[PATCH v3 0/6] DAX fixes, move flushing calls to FS

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] DAX fixes, move flushing calls to FS
@ 2016-02-17  3:34 Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 1/6] block: disable block device DAX by default Ross Zwisler
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

This patch series fixes several issues with the current DAX code:

1) DAX is used by default on raw block devices that are capable of
supporting it.  This creates an issue because there are still uses of the
block device that use the page cache, and having one block device user
doing DAX I/O and another doing page cache I/O can lead to data corruption.

2) When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes.  This wasn't true in all cases.

3) ext4 online defrag combined with DAX I/O could lead to data corruption.

4) The DAX block/sector zeroing code needs a valid struct block_device,
which it wasn't always getting.

5) The DAX writeback code needs a valid struct block_device, which it
wasn't always getting.

6) The DAX writeback code needs to be called for sync(2) and syncfs(2).

The last patch in this series reenables the DAX I/O path for raw block
devices when they would otherwise be doing direct I/O.  It can be dropped
if it is too controversial.

Thank you to Dan Williams and Jan Kara for their code contributions to this
set.

A working tree can be found here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_bdev_v3

Dan Williams (2):
  block: disable block device DAX by default
  block: use dax_do_io() if blkdev_dax_capable()

Ross Zwisler (4):
  ext2, ext4: only set S_DAX for regular inodes
  ext4: Online defrag not supported with DAX
  dax: give DAX clearing code correct bdev
  dax: move writeback calls into the filesystems

 block/Kconfig          | 13 +++++++++++++
 block/ioctl.c          |  1 +
 fs/block_dev.c         | 22 +++++++++++++++++++---
 fs/dax.c               | 21 +++++++++++----------
 fs/ext2/inode.c        | 16 +++++++++++++---
 fs/ext4/inode.c        |  6 +++++-
 fs/ext4/ioctl.c        |  5 +++++
 fs/xfs/xfs_aops.c      |  6 +++++-
 fs/xfs/xfs_aops.h      |  1 +
 fs/xfs/xfs_bmap_util.c |  3 ++-
 include/linux/dax.h    |  8 +++++---
 include/linux/fs.h     | 31 +++++++++++++++++++++----------
 mm/filemap.c           | 12 ++++--------
 13 files changed, 105 insertions(+), 40 deletions(-)

-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/6] block: disable block device DAX by default
  2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
@ 2016-02-17  3:34 ` Ross Zwisler
  2016-02-17 21:55   ` Jan Kara
  2016-02-17  3:34 ` [PATCH v3 2/6] ext2, ext4: only set S_DAX for regular inodes Ross Zwisler
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dan Williams, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Jan Kara,
	Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, xfs, Jan Kara, Jens Axboe,
	Matthew Wilcox, Al Viro, Ross Zwisler

From: Dan Williams <dan.j.williams@intel.com>

The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings.  This can lead to data corruption.

We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.

  [<ffffffff81576d93>] dump_stack+0x85/0xc2
  [<ffffffff812b9ee0>] dax_writeback_mapping_range+0x60/0xe0
  [<ffffffff812a1d4f>] blkdev_writepages+0x3f/0x50
  [<ffffffff811db011>] do_writepages+0x21/0x30
  [<ffffffff811cb6a6>] __filemap_fdatawrite_range+0xc6/0x100
  [<ffffffff811cb75a>] filemap_write_and_wait+0x4a/0xa0
  [<ffffffff812a15e0>] set_blocksize+0x70/0xd0
  [<ffffffff812a273d>] sb_set_blocksize+0x1d/0x50
  [<ffffffff8132ac9b>] ext4_fill_super+0x75b/0x3360
  [<ffffffff81583381>] ? vsnprintf+0x201/0x4c0
  [<ffffffff815836d9>] ? snprintf+0x49/0x60
  [<ffffffff81263010>] mount_bdev+0x180/0x1b0
  [<ffffffff8132a540>] ? ext4_calculate_overhead+0x370/0x370
  [<ffffffff8131ad95>] ext4_mount+0x15/0x20
  [<ffffffff81263908>] mount_fs+0x38/0x170

Mark the support broken so its disabled by default, but otherwise still
available for testing.

Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 block/Kconfig  | 13 +++++++++++++
 fs/block_dev.c |  6 +++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/Kconfig b/block/Kconfig
index 161491d..0363cd7 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -88,6 +88,19 @@ config BLK_DEV_INTEGRITY
 	T10/SCSI Data Integrity Field or the T13/ATA External Path
 	Protection.  If in doubt, say N.
 
+config BLK_DEV_DAX
+	bool "Block device DAX support"
+	depends on FS_DAX
+	depends on BROKEN
+	help
+	  When DAX support is available (CONFIG_FS_DAX) raw block
+	  devices can also support direct userspace access to the
+	  storage capacity via MMAP(2) similar to a file on a
+	  DAX-enabled filesystem.  However, the DAX I/O-path disables
+	  some standard I/O-statistics, and the MMAP(2) path has some
+	  operational differences due to bypassing the page
+	  cache.  If in doubt, say N.
+
 config BLK_DEV_THROTTLING
 	bool "Block layer bio throttling support"
 	depends on BLK_CGROUP=y
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 39b3a17..31c6d10 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1201,7 +1201,11 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
+		if (IS_ENABLED(CONFIG_BLK_DEV_DAX) && disk->fops->direct_access)
+			bdev->bd_inode->i_flags = S_DAX;
+		else
+			bdev->bd_inode->i_flags = 0;
+
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/6] ext2, ext4: only set S_DAX for regular inodes
  2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 1/6] block: disable block device DAX by default Ross Zwisler
@ 2016-02-17  3:34 ` Ross Zwisler
  2016-02-17 21:33   ` Jan Kara
  2016-02-17  3:34 ` [PATCH v3 3/6] ext4: Online defrag not supported with DAX Ross Zwisler
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes.  Any dirty data associated with
the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().

With the current code, though, this isn't always true.  For example, ext2
and ext4 directory inodes can have S_DAX set, but have their dirty data
stored as dirty page cache entries.  For these types of inodes, having
S_DAX set doesn't really make sense since their I/O doesn't actually happen
through the DAX code path.

Instead, only allow S_DAX to be set for regular inodes for ext2 and ext4.
This allows us to have strict DAX vs non-DAX paths in the writeback code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/inode.c | 2 +-
 fs/ext4/inode.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 338eefd..27e2cdd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1296,7 +1296,7 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, DAX))
+	if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode))
 		inode->i_flags |= S_DAX;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 83bc8bf..7088aa5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4127,7 +4127,7 @@ void ext4_set_inode_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (flags & EXT4_DIRSYNC_FL)
 		new_fl |= S_DIRSYNC;
-	if (test_opt(inode->i_sb, DAX))
+	if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode))
 		new_fl |= S_DAX;
 	inode_set_flags(inode, new_fl,
 			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/6] ext4: Online defrag not supported with DAX
  2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 1/6] block: disable block device DAX by default Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 2/6] ext2, ext4: only set S_DAX for regular inodes Ross Zwisler
@ 2016-02-17  3:34 ` Ross Zwisler
  2016-02-17 21:34   ` Jan Kara
  2016-02-17 21:50   ` Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 4/6] dax: give DAX clearing code correct bdev Ross Zwisler
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()

When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption.  This was observed with xfstests ext4/307 and
ext4/308.

Fix this by only allowing online defrag for non-DAX files.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/ioctl.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0f6c369..e32c86f 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -583,6 +583,11 @@ group_extend_out:
 				 "Online defrag not supported with bigalloc");
 			err = -EOPNOTSUPP;
 			goto mext_out;
+		} else if (IS_DAX(inode)) {
+			ext4_msg(sb, KERN_ERR,
+				 "Online defrag not supported with DAX");
+			err = -EOPNOTSUPP;
+			goto mext_out;
 		}
 
 		err = mnt_want_write_file(filp);
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 4/6] dax: give DAX clearing code correct bdev
  2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-02-17  3:34 ` [PATCH v3 3/6] ext4: Online defrag not supported with DAX Ross Zwisler
@ 2016-02-17  3:34 ` Ross Zwisler
  2016-02-17 21:37   ` Jan Kara
  2016-02-17  3:34 ` [PATCH v3 5/6] dax: move writeback calls into the filesystems Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable() Ross Zwisler
  5 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

dax_clear_blocks() needs a valid struct block_device and previously it was
using inode->i_sb->s_bdev in all cases.  This is correct for normal inodes
on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time devices.

Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its
arguments to take a bdev and a sector instead of an inode and a block.
This better reflects what the function does, and it allows the filesystem
and raw block device code to pass in an appropriate struct block_device.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c               | 9 ++++-----
 fs/ext2/inode.c        | 6 ++++--
 fs/xfs/xfs_aops.c      | 2 +-
 fs/xfs/xfs_aops.h      | 1 +
 fs/xfs/xfs_bmap_util.c | 3 ++-
 include/linux/dax.h    | 2 +-
 6 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fc2e314..9a173dd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -79,15 +79,14 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
 }
 
 /*
- * dax_clear_blocks() is called from within transaction context from XFS,
+ * dax_clear_sectors() is called from within transaction context from XFS,
  * and hence this means the stack from this point must follow GFP_NOFS
  * semantics for all operations.
  */
-int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
+int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
 {
-	struct block_device *bdev = inode->i_sb->s_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = block << (inode->i_blkbits - 9),
+		.sector = _sector,
 		.size = _size,
 	};
 
@@ -109,7 +108,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
 	wmb_pmem();
 	return 0;
 }
-EXPORT_SYMBOL_GPL(dax_clear_blocks);
+EXPORT_SYMBOL_GPL(dax_clear_sectors);
 
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 27e2cdd..4467cbd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -737,8 +737,10 @@ static int ext2_get_blocks(struct inode *inode,
 		 * so that it's not found by another thread before it's
 		 * initialised
 		 */
-		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
-						1 << inode->i_blkbits);
+		err = dax_clear_sectors(inode->i_sb->s_bdev,
+				le32_to_cpu(chain[depth-1].key) <<
+				(inode->i_blkbits - 9),
+				1 << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 379c089..fc20518 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -55,7 +55,7 @@ xfs_count_page_state(
 	} while ((bh = bh->b_this_page) != head);
 }
 
-STATIC struct block_device *
+struct block_device *
 xfs_find_bdev_for_inode(
 	struct inode		*inode)
 {
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index f6ffc9a..a4343c6 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -62,5 +62,6 @@ int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
 			         struct buffer_head *map_bh, int create);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
+extern struct block_device *xfs_find_bdev_for_inode(struct inode *);
 
 #endif /* __XFS_AOPS_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 45ec9e4..6c87601 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -75,7 +75,8 @@ xfs_zero_extent(
 	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
 
 	if (IS_DAX(VFS_I(ip)))
-		return dax_clear_blocks(VFS_I(ip), block, size);
+		return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
+				sector, size);
 
 	/*
 	 * let the block layer decide on the fastest method of
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 818e450..7b6bced 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,7 +7,7 @@
 
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
-int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 5/6] dax: move writeback calls into the filesystems
  2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-02-17  3:34 ` [PATCH v3 4/6] dax: give DAX clearing code correct bdev Ross Zwisler
@ 2016-02-17  3:34 ` Ross Zwisler
  2016-02-17  3:34 ` [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable() Ross Zwisler
  5 siblings, 0 replies; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs, Jan Kara

Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used to
get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response to
sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      | 13 ++++++++++++-
 fs/dax.c            | 12 +++++++-----
 fs/ext2/inode.c     |  8 ++++++++
 fs/ext4/inode.c     |  4 ++++
 fs/xfs/xfs_aops.c   |  4 ++++
 include/linux/dax.h |  6 ++++--
 mm/filemap.c        | 12 ++++--------
 7 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 31c6d10..826b164 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1697,13 +1697,24 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 	return try_to_free_buffers(page);
 }
 
+static int blkdev_writepages(struct address_space *mapping,
+			     struct writeback_control *wbc)
+{
+	if (dax_mapping(mapping)) {
+		struct block_device *bdev = I_BDEV(mapping->host);
+
+		return dax_writeback_mapping_range(mapping, bdev, wbc);
+	}
+	return generic_writepages(mapping, wbc);
+}
+
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.readpages	= blkdev_readpages,
 	.writepage	= blkdev_writepage,
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
-	.writepages	= generic_writepages,
+	.writepages	= blkdev_writepages,
 	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
diff --git a/fs/dax.c b/fs/dax.c
index 9a173dd..7111724 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev,
  * end]. This is required by data integrity operations to ensure file data is
  * on persistent storage prior to completion of the operation.
  */
-int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
-		loff_t end)
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	struct block_device *bdev = inode->i_sb->s_bdev;
 	pgoff_t start_index, end_index, pmd_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
@@ -499,8 +498,11 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
 
-	start_index = start >> PAGE_CACHE_SHIFT;
-	end_index = end >> PAGE_CACHE_SHIFT;
+	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
+		return 0;
+
+	start_index = wbc->range_start >> PAGE_CACHE_SHIFT;
+	end_index = wbc->range_end >> PAGE_CACHE_SHIFT;
 	pmd_index = DAX_PMD_INDEX(start_index);
 
 	rcu_read_lock();
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 4467cbd..6bd58e6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -876,6 +876,14 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
+#ifdef CONFIG_FS_DAX
+	if (dax_mapping(mapping)) {
+		return dax_writeback_mapping_range(mapping,
+						   mapping->host->i_sb->s_bdev,
+						   wbc);
+	}
+#endif
+
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7088aa5..c2d1b51 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2450,6 +2450,10 @@ static int ext4_writepages(struct address_space *mapping,
 
 	trace_ext4_writepages(inode, wbc);
 
+	if (dax_mapping(mapping))
+		return dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
+						   wbc);
+
 	/*
 	 * No pages to write? This is mainly a kludge to avoid starting
 	 * a transaction for special inodes like journal inode on last iput()
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index fc20518..a9ebabfe 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1208,6 +1208,10 @@ xfs_vm_writepages(
 	struct writeback_control *wbc)
 {
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+	if (dax_mapping(mapping))
+		return dax_writeback_mapping_range(mapping,
+				xfs_find_bdev_for_inode(mapping->host), wbc);
+
 	return generic_writepages(mapping, wbc);
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7b6bced..636dd59 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -52,6 +52,8 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
-int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
-		loff_t end);
+
+struct writeback_control;
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 23edcce..3461d97 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping)
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if ((!dax_mapping(mapping) && mapping->nrpages) ||
+	    (dax_mapping(mapping) && mapping->nrexceptional)) {
 		err = filemap_fdatawrite(mapping);
 		/*
 		 * Even if the above returned error, the pages may be
@@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
-	if (dax_mapping(mapping) && mapping->nrexceptional) {
-		err = dax_writeback_mapping_range(mapping, lstart, lend);
-		if (err)
-			return err;
-	}
-
-	if (mapping->nrpages) {
+	if ((!dax_mapping(mapping) && mapping->nrpages) ||
+	    (dax_mapping(mapping) && mapping->nrexceptional)) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable()
  2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
                   ` (4 preceding siblings ...)
  2016-02-17  3:34 ` [PATCH v3 5/6] dax: move writeback calls into the filesystems Ross Zwisler
@ 2016-02-17  3:34 ` Ross Zwisler
  2016-02-17 21:54   ` Jan Kara
  5 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17  3:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dan Williams, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Jan Kara,
	Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, xfs, Jan Kara, Jens Axboe,
	Al Viro, Ross Zwisler, Matthew Wilcox

From: Dan Williams <dan.j.williams@intel.com>

Setting S_DAX on an inode requires that the inode participate in the
DAX-fsync mechanism which expects to use the pagecache for tracking
potentially dirty cpu cachelines.  However, dax_do_io() participates in
the standard pagecache sync semantics and arranges for dirty pages to be
flushed through the driver when a direct-IO operation accesses the same
ranges.

It should always be valid to use the dax_do_io() path regardless of
whether the block_device inode has S_DAX set.  In either case dirty
pages or dirty cachelines are made durable before the direct-IO
operation proceeds.

Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 block/ioctl.c      |  1 +
 fs/block_dev.c     |  3 ++-
 include/linux/fs.h | 31 +++++++++++++++++++++----------
 3 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index d8996bb..7c64286 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -434,6 +434,7 @@ bool blkdev_dax_capable(struct block_device *bdev)
 
 	return true;
 }
+EXPORT_SYMBOL(blkdev_dax_capable);
 #endif
 
 static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164..0e937dd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -166,8 +166,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = bdev_file_inode(file);
+	struct block_device *bdev = I_BDEV(inode);
 
-	if (IS_DAX(inode))
+	if (blkdev_dax_capable(bdev))
 		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
 				NULL, DIO_SKIP_DIO_COUNT);
 	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100..a3f5ee8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -830,7 +830,14 @@ static inline unsigned imajor(const struct inode *inode)
 	return MAJOR(inode->i_rdev);
 }
 
+#ifdef CONFIG_BLOCK
 extern struct block_device *I_BDEV(struct inode *inode);
+#else
+static inline struct block_device *I_BDEV(struct inode *inode)
+{
+	return NULL;
+}
+#endif
 
 struct fown_struct {
 	rwlock_t lock;          /* protects pid, uid, euid fields */
@@ -2306,15 +2313,6 @@ extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
-#ifdef CONFIG_FS_DAX
-extern bool blkdev_dax_capable(struct block_device *bdev);
-#else
-static inline bool blkdev_dax_capable(struct block_device *bdev)
-{
-	return false;
-}
-#endif
-
 extern struct super_block *blockdev_superblock;
 
 static inline bool sb_is_blkdev_sb(struct super_block *sb)
@@ -2902,9 +2900,22 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
 extern void replace_mount_options(struct super_block *sb, char *options);
 
+#ifdef CONFIG_FS_DAX
+extern bool blkdev_dax_capable(struct block_device *bdev);
+#else
+static inline bool blkdev_dax_capable(struct block_device *bdev)
+{
+	return false;
+}
+#endif
+
 static inline bool io_is_direct(struct file *filp)
 {
-	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+	struct inode *inode = filp->f_mapping->host;
+
+	return (filp->f_flags & O_DIRECT) || IS_DAX(inode)
+		|| (S_ISBLK(file_inode(filp)->i_mode)
+				&& blkdev_dax_capable(I_BDEV(inode)));
 }
 
 static inline int iocb_flags(struct file *file)
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/6] ext2, ext4: only set S_DAX for regular inodes
  2016-02-17  3:34 ` [PATCH v3 2/6] ext2, ext4: only set S_DAX for regular inodes Ross Zwisler
@ 2016-02-17 21:33   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-02-17 21:33 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

On Tue 16-02-16 20:34:15, Ross Zwisler wrote:
> When S_DAX is set on an inode we assume that if there are pages attached
> to the mapping (mapping->nrpages != 0), those pages are clean zero pages
> that were used to service reads from holes.  Any dirty data associated with
> the inode should be in the form of DAX exceptional entries
> (mapping->nrexceptional) that is written back via
> dax_writeback_mapping_range().
> 
> With the current code, though, this isn't always true.  For example, ext2
> and ext4 directory inodes can have S_DAX set, but have their dirty data
> stored as dirty page cache entries.  For these types of inodes, having
> S_DAX set doesn't really make sense since their I/O doesn't actually happen
> through the DAX code path.
> 
> Instead, only allow S_DAX to be set for regular inodes for ext2 and ext4.
> This allows us to have strict DAX vs non-DAX paths in the writeback code.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext2/inode.c | 2 +-
>  fs/ext4/inode.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 338eefd..27e2cdd 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1296,7 +1296,7 @@ void ext2_set_inode_flags(struct inode *inode)
>  		inode->i_flags |= S_NOATIME;
>  	if (flags & EXT2_DIRSYNC_FL)
>  		inode->i_flags |= S_DIRSYNC;
> -	if (test_opt(inode->i_sb, DAX))
> +	if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode))
>  		inode->i_flags |= S_DAX;
>  }
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 83bc8bf..7088aa5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4127,7 +4127,7 @@ void ext4_set_inode_flags(struct inode *inode)
>  		new_fl |= S_NOATIME;
>  	if (flags & EXT4_DIRSYNC_FL)
>  		new_fl |= S_DIRSYNC;
> -	if (test_opt(inode->i_sb, DAX))
> +	if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode))
>  		new_fl |= S_DAX;
>  	inode_set_flags(inode, new_fl,
>  			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/6] ext4: Online defrag not supported with DAX
  2016-02-17  3:34 ` [PATCH v3 3/6] ext4: Online defrag not supported with DAX Ross Zwisler
@ 2016-02-17 21:34   ` Jan Kara
  2016-02-17 21:50   ` Ross Zwisler
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-02-17 21:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

On Tue 16-02-16 20:34:16, Ross Zwisler wrote:
> Online defrag operations for ext4 are hard coded to use the page cache.
> See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
> 
> When combined with DAX I/O, which circumvents the page cache, this can
> result in data corruption.  This was observed with xfstests ext4/307 and
> ext4/308.
> 
> Fix this by only allowing online defrag for non-DAX files.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

We need to handle this eventually but for now we are fine. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ioctl.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 0f6c369..e32c86f 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -583,6 +583,11 @@ group_extend_out:
>  				 "Online defrag not supported with bigalloc");
>  			err = -EOPNOTSUPP;
>  			goto mext_out;
> +		} else if (IS_DAX(inode)) {
> +			ext4_msg(sb, KERN_ERR,
> +				 "Online defrag not supported with DAX");
> +			err = -EOPNOTSUPP;
> +			goto mext_out;
>  		}
>  
>  		err = mnt_want_write_file(filp);
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 4/6] dax: give DAX clearing code correct bdev
  2016-02-17  3:34 ` [PATCH v3 4/6] dax: give DAX clearing code correct bdev Ross Zwisler
@ 2016-02-17 21:37   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-02-17 21:37 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

On Tue 16-02-16 20:34:17, Ross Zwisler wrote:
> dax_clear_blocks() needs a valid struct block_device and previously it was
> using inode->i_sb->s_bdev in all cases.  This is correct for normal inodes
> on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> block devices and for XFS real-time devices.
> 
> Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its
> arguments to take a bdev and a sector instead of an inode and a block.
> This better reflects what the function does, and it allows the filesystem
> and raw block device code to pass in an appropriate struct block_device.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c               | 9 ++++-----
>  fs/ext2/inode.c        | 6 ++++--
>  fs/xfs/xfs_aops.c      | 2 +-
>  fs/xfs/xfs_aops.h      | 1 +
>  fs/xfs/xfs_bmap_util.c | 3 ++-
>  include/linux/dax.h    | 2 +-
>  6 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fc2e314..9a173dd 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -79,15 +79,14 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
>  }
>  
>  /*
> - * dax_clear_blocks() is called from within transaction context from XFS,
> + * dax_clear_sectors() is called from within transaction context from XFS,
>   * and hence this means the stack from this point must follow GFP_NOFS
>   * semantics for all operations.
>   */
> -int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
> +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
>  {
> -	struct block_device *bdev = inode->i_sb->s_bdev;
>  	struct blk_dax_ctl dax = {
> -		.sector = block << (inode->i_blkbits - 9),
> +		.sector = _sector,
>  		.size = _size,
>  	};
>  
> @@ -109,7 +108,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
>  	wmb_pmem();
>  	return 0;
>  }
> -EXPORT_SYMBOL_GPL(dax_clear_blocks);
> +EXPORT_SYMBOL_GPL(dax_clear_sectors);
>  
>  /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
>  static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 27e2cdd..4467cbd 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -737,8 +737,10 @@ static int ext2_get_blocks(struct inode *inode,
>  		 * so that it's not found by another thread before it's
>  		 * initialised
>  		 */
> -		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
> -						1 << inode->i_blkbits);
> +		err = dax_clear_sectors(inode->i_sb->s_bdev,
> +				le32_to_cpu(chain[depth-1].key) <<
> +				(inode->i_blkbits - 9),
> +				1 << inode->i_blkbits);
>  		if (err) {
>  			mutex_unlock(&ei->truncate_mutex);
>  			goto cleanup;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 379c089..fc20518 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -55,7 +55,7 @@ xfs_count_page_state(
>  	} while ((bh = bh->b_this_page) != head);
>  }
>  
> -STATIC struct block_device *
> +struct block_device *
>  xfs_find_bdev_for_inode(
>  	struct inode		*inode)
>  {
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index f6ffc9a..a4343c6 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -62,5 +62,6 @@ int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
>  			         struct buffer_head *map_bh, int create);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
> +extern struct block_device *xfs_find_bdev_for_inode(struct inode *);
>  
>  #endif /* __XFS_AOPS_H__ */
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 45ec9e4..6c87601 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -75,7 +75,8 @@ xfs_zero_extent(
>  	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
>  
>  	if (IS_DAX(VFS_I(ip)))
> -		return dax_clear_blocks(VFS_I(ip), block, size);
> +		return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
> +				sector, size);
>  
>  	/*
>  	 * let the block layer decide on the fastest method of
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 818e450..7b6bced 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -7,7 +7,7 @@
>  
>  ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
>  		  get_block_t, dio_iodone_t, int flags);
> -int dax_clear_blocks(struct inode *, sector_t block, long size);
> +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/6] ext4: Online defrag not supported with DAX
  2016-02-17  3:34 ` [PATCH v3 3/6] ext4: Online defrag not supported with DAX Ross Zwisler
  2016-02-17 21:34   ` Jan Kara
@ 2016-02-17 21:50   ` Ross Zwisler
  2016-02-17 22:10     ` Jan Kara
  2016-02-18  0:12     ` Dave Chinner
  1 sibling, 2 replies; 16+ messages in thread
From: Ross Zwisler @ 2016-02-17 21:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs

On Tue, Feb 16, 2016 at 08:34:16PM -0700, Ross Zwisler wrote:
> Online defrag operations for ext4 are hard coded to use the page cache.
> See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
> 
> When combined with DAX I/O, which circumvents the page cache, this can
> result in data corruption.  This was observed with xfstests ext4/307 and
> ext4/308.
> 
> Fix this by only allowing online defrag for non-DAX files.

Jan,

Thinking about this a bit more, it's probably the case that the data
corruption I was observing was due to us skipping the writeback of the dirty
page cache pages because S_DAX was set.

I do think we have a problem with defrag because it is doing the extent
swapping using the page cache, and we won't flush the dirty pages due to
S_DAX being set.

This patch is the quick and easy answer, and is perhaps appropriate for v4.5.

Looking forward, though, what do you think the correct solution is?  Making an
extent swapper that doesn't use the page cache (as I believe XFS has? see
xfs_swap_extents()), or maybe just unsetting S_DAX while we do the defrag and
being careful to block out page faults and I/O?  Or is it acceptable to just
say that DAX and defrag are mutually exclusive for ext4?

Thanks,
- Ross

> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/ext4/ioctl.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 0f6c369..e32c86f 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -583,6 +583,11 @@ group_extend_out:
>  				 "Online defrag not supported with bigalloc");
>  			err = -EOPNOTSUPP;
>  			goto mext_out;
> +		} else if (IS_DAX(inode)) {
> +			ext4_msg(sb, KERN_ERR,
> +				 "Online defrag not supported with DAX");
> +			err = -EOPNOTSUPP;
> +			goto mext_out;
>  		}
>  
>  		err = mnt_want_write_file(filp);
> -- 
> 2.5.0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable()
  2016-02-17  3:34 ` [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable() Ross Zwisler
@ 2016-02-17 21:54   ` Jan Kara
  2016-02-17 22:18     ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2016-02-17 21:54 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Dan Williams, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs, Jan Kara,
	Jens Axboe, Al Viro, Matthew Wilcox

On Tue 16-02-16 20:34:19, Ross Zwisler wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Setting S_DAX on an inode requires that the inode participate in the
> DAX-fsync mechanism which expects to use the pagecache for tracking
> potentially dirty cpu cachelines.  However, dax_do_io() participates in
> the standard pagecache sync semantics and arranges for dirty pages to be
> flushed through the driver when a direct-IO operation accesses the same
> ranges.
> 
> It should always be valid to use the dax_do_io() path regardless of
> whether the block_device inode has S_DAX set.  In either case dirty
> pages or dirty cachelines are made durable before the direct-IO
> operation proceeds.

Please no. I agree that going via DAX path for normal likely won't
introduce new data corruption issues. But I dislike having a special
case for block devices. Also you have no way of turning DAX off for block
devices AFAIU and as Dave said, DAX should be opt-in, not opt-out. Note
that you may actually want to go through the block layer for normal IO e.g.
because you use IO cgroups to limit processes so using DAX regresses some
functionality.

								Honza
 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Al Viro <viro@ftp.linux.org.uk>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  block/ioctl.c      |  1 +
>  fs/block_dev.c     |  3 ++-
>  include/linux/fs.h | 31 +++++++++++++++++++++----------
>  3 files changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/block/ioctl.c b/block/ioctl.c
> index d8996bb..7c64286 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -434,6 +434,7 @@ bool blkdev_dax_capable(struct block_device *bdev)
>  
>  	return true;
>  }
> +EXPORT_SYMBOL(blkdev_dax_capable);
>  #endif
>  
>  static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 826b164..0e937dd 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -166,8 +166,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = bdev_file_inode(file);
> +	struct block_device *bdev = I_BDEV(inode);
>  
> -	if (IS_DAX(inode))
> +	if (blkdev_dax_capable(bdev))
>  		return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
>  				NULL, DIO_SKIP_DIO_COUNT);
>  	return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ae68100..a3f5ee8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -830,7 +830,14 @@ static inline unsigned imajor(const struct inode *inode)
>  	return MAJOR(inode->i_rdev);
>  }
>  
> +#ifdef CONFIG_BLOCK
>  extern struct block_device *I_BDEV(struct inode *inode);
> +#else
> +static inline struct block_device *I_BDEV(struct inode *inode)
> +{
> +	return NULL;
> +}
> +#endif
>  
>  struct fown_struct {
>  	rwlock_t lock;          /* protects pid, uid, euid fields */
> @@ -2306,15 +2313,6 @@ extern struct super_block *freeze_bdev(struct block_device *);
>  extern void emergency_thaw_all(void);
>  extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
>  extern int fsync_bdev(struct block_device *);
> -#ifdef CONFIG_FS_DAX
> -extern bool blkdev_dax_capable(struct block_device *bdev);
> -#else
> -static inline bool blkdev_dax_capable(struct block_device *bdev)
> -{
> -	return false;
> -}
> -#endif
> -
>  extern struct super_block *blockdev_superblock;
>  
>  static inline bool sb_is_blkdev_sb(struct super_block *sb)
> @@ -2902,9 +2900,22 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
>  extern void save_mount_options(struct super_block *sb, char *options);
>  extern void replace_mount_options(struct super_block *sb, char *options);
>  
> +#ifdef CONFIG_FS_DAX
> +extern bool blkdev_dax_capable(struct block_device *bdev);
> +#else
> +static inline bool blkdev_dax_capable(struct block_device *bdev)
> +{
> +	return false;
> +}
> +#endif
> +
>  static inline bool io_is_direct(struct file *filp)
>  {
> -	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> +	struct inode *inode = filp->f_mapping->host;
> +
> +	return (filp->f_flags & O_DIRECT) || IS_DAX(inode)
> +		|| (S_ISBLK(file_inode(filp)->i_mode)
> +				&& blkdev_dax_capable(I_BDEV(inode)));
>  }
>  
>  static inline int iocb_flags(struct file *file)
> -- 
> 2.5.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/6] block: disable block device DAX by default
  2016-02-17  3:34 ` [PATCH v3 1/6] block: disable block device DAX by default Ross Zwisler
@ 2016-02-17 21:55   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-02-17 21:55 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Dan Williams, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox, linux-block,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs, Jan Kara,
	Jens Axboe, Matthew Wilcox, Al Viro

On Tue 16-02-16 20:34:14, Ross Zwisler wrote:
> From: Dan Williams <dan.j.williams@intel.com>
> 
> The recent *sync enabling discovered that we are inserting into the
> block_device pagecache counter to the expectations of the dirty data
> tracking for dax mappings.  This can lead to data corruption.
> 
> We want to support DAX for block devices eventually, but it requires
> wider changes to properly manage the pagecache.
> 
>   [<ffffffff81576d93>] dump_stack+0x85/0xc2
>   [<ffffffff812b9ee0>] dax_writeback_mapping_range+0x60/0xe0
>   [<ffffffff812a1d4f>] blkdev_writepages+0x3f/0x50
>   [<ffffffff811db011>] do_writepages+0x21/0x30
>   [<ffffffff811cb6a6>] __filemap_fdatawrite_range+0xc6/0x100
>   [<ffffffff811cb75a>] filemap_write_and_wait+0x4a/0xa0
>   [<ffffffff812a15e0>] set_blocksize+0x70/0xd0
>   [<ffffffff812a273d>] sb_set_blocksize+0x1d/0x50
>   [<ffffffff8132ac9b>] ext4_fill_super+0x75b/0x3360
>   [<ffffffff81583381>] ? vsnprintf+0x201/0x4c0
>   [<ffffffff815836d9>] ? snprintf+0x49/0x60
>   [<ffffffff81263010>] mount_bdev+0x180/0x1b0
>   [<ffffffff8132a540>] ? ext4_calculate_overhead+0x370/0x370
>   [<ffffffff8131ad95>] ext4_mount+0x15/0x20
>   [<ffffffff81263908>] mount_fs+0x38/0x170
> 
> Mark the support broken so its disabled by default, but otherwise still
> available for testing.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Cc: Al Viro <viro@ftp.linux.org.uk>
> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Makes sense. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/Kconfig  | 13 +++++++++++++
>  fs/block_dev.c |  6 +++++-
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 161491d..0363cd7 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -88,6 +88,19 @@ config BLK_DEV_INTEGRITY
>  	T10/SCSI Data Integrity Field or the T13/ATA External Path
>  	Protection.  If in doubt, say N.
>  
> +config BLK_DEV_DAX
> +	bool "Block device DAX support"
> +	depends on FS_DAX
> +	depends on BROKEN
> +	help
> +	  When DAX support is available (CONFIG_FS_DAX) raw block
> +	  devices can also support direct userspace access to the
> +	  storage capacity via MMAP(2) similar to a file on a
> +	  DAX-enabled filesystem.  However, the DAX I/O-path disables
> +	  some standard I/O-statistics, and the MMAP(2) path has some
> +	  operational differences due to bypassing the page
> +	  cache.  If in doubt, say N.
> +
>  config BLK_DEV_THROTTLING
>  	bool "Block layer bio throttling support"
>  	depends on BLK_CGROUP=y
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 39b3a17..31c6d10 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1201,7 +1201,11 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
>  		bdev->bd_disk = disk;
>  		bdev->bd_queue = disk->queue;
>  		bdev->bd_contains = bdev;
> -		bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
> +		if (IS_ENABLED(CONFIG_BLK_DEV_DAX) && disk->fops->direct_access)
> +			bdev->bd_inode->i_flags = S_DAX;
> +		else
> +			bdev->bd_inode->i_flags = 0;
> +
>  		if (!partno) {
>  			ret = -ENXIO;
>  			bdev->bd_part = disk_get_part(disk, partno);
> -- 
> 2.5.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/6] ext4: Online defrag not supported with DAX
  2016-02-17 21:50   ` Ross Zwisler
@ 2016-02-17 22:10     ` Jan Kara
  2016-02-18  0:12     ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2016-02-17 22:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox,
	linux-block, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm,
	xfs

On Wed 17-02-16 14:50:37, Ross Zwisler wrote:
> On Tue, Feb 16, 2016 at 08:34:16PM -0700, Ross Zwisler wrote:
> > Online defrag operations for ext4 are hard coded to use the page cache.
> > See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
> > 
> > When combined with DAX I/O, which circumvents the page cache, this can
> > result in data corruption.  This was observed with xfstests ext4/307 and
> > ext4/308.
> > 
> > Fix this by only allowing online defrag for non-DAX files.
> 
> Jan,
> 
> Thinking about this a bit more, it's probably the case that the data
> corruption I was observing was due to us skipping the writeback of the dirty
> page cache pages because S_DAX was set.
> 
> I do think we have a problem with defrag because it is doing the extent
> swapping using the page cache, and we won't flush the dirty pages due to
> S_DAX being set.
> 
> This patch is the quick and easy answer, and is perhaps appropriate for v4.5.
> 
> Looking forward, though, what do you think the correct solution is?  Making an
> extent swapper that doesn't use the page cache (as I believe XFS has? see
> xfs_swap_extents()), or maybe just unsetting S_DAX while we do the defrag and
> being careful to block out page faults and I/O?  Or is it acceptable to just
> say that DAX and defrag are mutually exclusive for ext4?

For 4.5 I'd just say just make them exclusive. Long term, we could just
avoid using page cache for copying data - grab all the necessary locks,
wait for all DIO to complete, evict anything in pagecache, copy data, swap
extents. It could even result in a simpler code.

								Honza


> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/ext4/ioctl.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> > index 0f6c369..e32c86f 100644
> > --- a/fs/ext4/ioctl.c
> > +++ b/fs/ext4/ioctl.c
> > @@ -583,6 +583,11 @@ group_extend_out:
> >  				 "Online defrag not supported with bigalloc");
> >  			err = -EOPNOTSUPP;
> >  			goto mext_out;
> > +		} else if (IS_DAX(inode)) {
> > +			ext4_msg(sb, KERN_ERR,
> > +				 "Online defrag not supported with DAX");
> > +			err = -EOPNOTSUPP;
> > +			goto mext_out;
> >  		}
> >  
> >  		err = mnt_want_write_file(filp);
> > -- 
> > 2.5.0
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable()
  2016-02-17 21:54   ` Jan Kara
@ 2016-02-17 22:18     ` Dan Williams
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2016-02-17 22:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel@vger.kernel.org, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dave Chinner, Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox,
	linux-block, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers, Jens Axboe, Al Viro,
	Matthew Wilcox

On Wed, Feb 17, 2016 at 1:54 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 16-02-16 20:34:19, Ross Zwisler wrote:
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Setting S_DAX on an inode requires that the inode participate in the
>> DAX-fsync mechanism which expects to use the pagecache for tracking
>> potentially dirty cpu cachelines.  However, dax_do_io() participates in
>> the standard pagecache sync semantics and arranges for dirty pages to be
>> flushed through the driver when a direct-IO operation accesses the same
>> ranges.
>>
>> It should always be valid to use the dax_do_io() path regardless of
>> whether the block_device inode has S_DAX set.  In either case dirty
>> pages or dirty cachelines are made durable before the direct-IO
>> operation proceeds.
>
> Please no. I agree that going via DAX path for normal likely won't
> introduce new data corruption issues. But I dislike having a special
> case for block devices. Also you have no way of turning DAX off for block
> devices AFAIU and as Dave said, DAX should be opt-in, not opt-out. Note
> that you may actually want to go through the block layer for normal IO e.g.
> because you use IO cgroups to limit processes so using DAX regresses some
> functionality.
>

Sounds good.

As Ross mentioned in the cover letter, I'm fine with dropping this one
for now as we think through how to restore raw device DAX support.  In
the meantime we can still force CONFIG_BLK_DEV_DAX=y for testing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/6] ext4: Online defrag not supported with DAX
  2016-02-17 21:50   ` Ross Zwisler
  2016-02-17 22:10     ` Jan Kara
@ 2016-02-18  0:12     ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2016-02-18  0:12 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Jan Kara, Jeff Layton, Jens Axboe, Matthew Wilcox,
	linux-block, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm,
	xfs

On Wed, Feb 17, 2016 at 02:50:37PM -0700, Ross Zwisler wrote:
> On Tue, Feb 16, 2016 at 08:34:16PM -0700, Ross Zwisler wrote:
> > Online defrag operations for ext4 are hard coded to use the page cache.
> > See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
> > 
> > When combined with DAX I/O, which circumvents the page cache, this can
> > result in data corruption.  This was observed with xfstests ext4/307 and
> > ext4/308.
> > 
> > Fix this by only allowing online defrag for non-DAX files.
> 
> Jan,
> 
> Thinking about this a bit more, it's probably the case that the data
> corruption I was observing was due to us skipping the writeback of the dirty
> page cache pages because S_DAX was set.
> 
> I do think we have a problem with defrag because it is doing the extent
> swapping using the page cache, and we won't flush the dirty pages due to
> S_DAX being set.
> 
> This patch is the quick and easy answer, and is perhaps appropriate for v4.5.
> 
> Looking forward, though, what do you think the correct solution is?  Making an
> extent swapper that doesn't use the page cache (as I believe XFS has? see
> xfs_swap_extents()),

XFS does the data copy in userspace using direct IO so we don't
care about whether DAX is enabled or not on either the source or
destination inode. i.e. xfs_swap_extents() is a pure
metadata operation, swapping the entire extent tree between two
inodes if the source data has not changed while the copy was in
progress.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-02-18  0:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-17  3:34 [PATCH v3 0/6] DAX fixes, move flushing calls to FS Ross Zwisler
2016-02-17  3:34 ` [PATCH v3 1/6] block: disable block device DAX by default Ross Zwisler
2016-02-17 21:55   ` Jan Kara
2016-02-17  3:34 ` [PATCH v3 2/6] ext2, ext4: only set S_DAX for regular inodes Ross Zwisler
2016-02-17 21:33   ` Jan Kara
2016-02-17  3:34 ` [PATCH v3 3/6] ext4: Online defrag not supported with DAX Ross Zwisler
2016-02-17 21:34   ` Jan Kara
2016-02-17 21:50   ` Ross Zwisler
2016-02-17 22:10     ` Jan Kara
2016-02-18  0:12     ` Dave Chinner
2016-02-17  3:34 ` [PATCH v3 4/6] dax: give DAX clearing code correct bdev Ross Zwisler
2016-02-17 21:37   ` Jan Kara
2016-02-17  3:34 ` [PATCH v3 5/6] dax: move writeback calls into the filesystems Ross Zwisler
2016-02-17  3:34 ` [PATCH v3 6/6] block: use dax_do_io() if blkdev_dax_capable() Ross Zwisler
2016-02-17 21:54   ` Jan Kara
2016-02-17 22:18     ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).