* [PATCH v3 00/18] btrfs dax support
@ 2019-04-16 16:41 Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
                   ` (18 more replies)
  0 siblings, 19 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm
This patch set adds support for dax on the BTRFS filesystem.
In order to support for CoW for btrfs, there were changes which had to be
made to the dax handling. The important one is copying blocks into the
same dax device before using them which is performed by iomap
type IOMAP_DAX_COW.
Snapshotting and CoW features are supported (including mmap preservation
across snapshots).
Git: https://github.com/goldwynr/linux/tree/btrfs-dax
Changes since v2:
 - Created a new type IOMAP_DAX_COW as opposed to flag IOMAP_F_COW
 - CoW source address is presented in iomap.inline_data
 - Split the patches to more elaborate dax/iomap patches
Changes since v1:
 - use iomap instead of redoing everything in btrfs
 - support for mmap writeprotecting on snapshotting
 fs/btrfs/Makefile            |    1 
 fs/btrfs/ctree.h             |   38 +++++
 fs/btrfs/dax.c               |  288 +++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/disk-io.c           |    4 
 fs/btrfs/file.c              |   37 ++++-
 fs/btrfs/inode.c             |  114 ++++++++++++-----
 fs/btrfs/ioctl.c             |   29 +++-
 fs/btrfs/send.c              |    4 
 fs/btrfs/super.c             |   30 ++++
 fs/dax.c                     |  152 ++++++++++++++++++++--
 fs/iomap.c                   |    9 -
 fs/ocfs2/file.c              |    2 
 fs/read_write.c              |   10 -
 fs/xfs/xfs_reflink.c         |    2 
 include/linux/dax.h          |   13 +
 include/linux/fs.h           |    7 -
 include/linux/iomap.h        |    7 +
 include/trace/events/btrfs.h |   56 ++++++++
 18 files changed, 717 insertions(+), 86 deletions(-)
-- 
Goldwyn
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH 01/18] btrfs: create a mount option for dax
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:52   ` Dan Williams
  2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This sets S_DAX in inode->i_flags, which can be used with
IS_DAX().
The dax option is restricted to non multi-device mounts.
dax interacts with the device directly instead of using bio, so
all bio-hooks which we use for multi-device cannot be performed
here. While regular read/writes could be manipulated with
RAID0/1, mmap() is still an issue.
Auto-setting free space tree, because dealing with free space
inode (specifically readpages) is a nightmare.
Auto-setting nodatasum because we don't get callback for writing
checksums after mmap()s.
Deny compression because it does not work with direct I/O.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c |  4 ++++
 fs/btrfs/ioctl.c   |  5 ++++-
 fs/btrfs/super.c   | 30 ++++++++++++++++++++++++++++++
 4 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b3642367a595..8ca1c0d120f4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1067,6 +1067,7 @@ struct btrfs_fs_info {
 	u32 metadata_ratio;
 
 	void *bdev_holder;
+	struct dax_device *dax_dev;
 
 	/* private scrub information */
 	struct mutex scrub_lock;
@@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DAX			(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6fe9197f6ee4..2bbb63b2fcff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -16,6 +16,7 @@
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
 #include <linux/error-injection.h>
+#include <linux/dax.h>
 #include <linux/crc32c.h>
 #include <linux/sched/mm.h>
 #include <asm/unaligned.h>
@@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb,
 		goto fail_alloc;
 	}
 
+	fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev);
+
 	/*
 	 * We want to check superblock checksum, the type is stored inside.
 	 * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
@@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 #endif
 
 	btrfs_close_devices(fs_info->fs_devices);
+	fs_put_dax(fs_info->dax_dev);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
 
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cd4e693406a0..0138119cd9a3 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
 
+	if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
+		new_fl |= S_DAX;
+
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
 		      new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 120e4340792a..3b85e61e5182 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
 	Opt_treelog, Opt_notreelog,
 	Opt_usebackuproot,
 	Opt_user_subvol_rm_allowed,
+	Opt_dax,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
 	{Opt_notreelog, "notreelog"},
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_dax, "dax"},
 
 	/* Deprecated options */
 	{Opt_alloc_start, "alloc_start=%s"},
@@ -745,6 +747,32 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_user_subvol_rm_allowed:
 			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
+		case Opt_dax:
+#ifdef CONFIG_FS_DAX
+			if (btrfs_super_num_devices(info->super_copy) > 1) {
+				btrfs_info(info,
+					   "dax not supported for multi-device btrfs partition\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			btrfs_set_opt(info->mount_opt, DAX);
+			btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n");
+			btrfs_set_and_info(info, NODATASUM,
+					   "auto-setting nodatasum (dax)");
+			btrfs_clear_opt(info->mount_opt, SPACE_CACHE);
+			btrfs_set_and_info(info, FREE_SPACE_TREE,
+					"auto-setting free space tree (dax)");
+			if (btrfs_test_opt(info, COMPRESS)) {
+				btrfs_info(info, "disabling compress (dax)");
+				btrfs_clear_opt(info->mount_opt, COMPRESS);
+			}
+			break;
+#else
+			btrfs_err(info,
+				  "DAX option not supported\n");
+			ret = -EINVAL;
+			goto out;
+#endif
 		case Opt_enospc_debug:
 			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
 			break;
@@ -1335,6 +1363,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",clear_cache");
 	if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
 		seq_puts(seq, ",user_subvol_rm_allowed");
+	if (btrfs_test_opt(info, DAX))
+		seq_puts(seq, ",dax");
 	if (btrfs_test_opt(info, ENOSPC_DEBUG))
 		seq_puts(seq, ",enospc_debug");
 	if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 23:45   ` Elliott, Robert (Servers)
  2019-04-16 16:41 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This makes btrfs_get_extent_map_write() independent of Direct
I/O code.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 40 +++++++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ca1c0d120f4..9512f49262dd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3277,6 +3277,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 			      struct btrfs_path *path);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+		struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 				    struct page *page, size_t pg_offset,
 				    u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 82fdda8ff5ab..80184d0c3b52 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7496,11 +7496,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
 	return 0;
 }
 
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
-					 struct inode *inode,
-					 struct btrfs_dio_data *dio_data,
-					 u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+		struct buffer_head *bh,
+		struct inode *inode,
+		u64 start, u64 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em = *map;
@@ -7554,22 +7553,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			 */
 			btrfs_free_reserved_data_space_noquota(inode, start,
 							       len);
-			goto skip_cow;
+			/* skip COW */
+			goto out;
 		}
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
+	if (bh)
+		len = bh->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
-	if (IS_ERR(em)) {
-		ret = PTR_ERR(em);
-		goto out;
-	}
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+out:
+	return ret;
+}
 
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+					 struct buffer_head *bh_result,
+					 struct inode *inode,
+					 struct btrfs_dio_data *dio_data,
+					 u64 start, u64 len)
+{
+	int ret = 0;
+	struct extent_map *em;
+
+	ret = btrfs_get_extent_map_write(map, bh_result, inode,
+			start, len);
+	if (ret < 0)
+		return ret;
+	em = *map;
 	len = min(len, em->len - (start - em->start));
 
-skip_cow:
 	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
 		inode->i_blkbits;
 	bh_result->b_size = len;
@@ -7590,7 +7605,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	dio_data->reserve -= len;
 	dio_data->unsubmitted_oe_range_end = start + len;
 	current->journal_info = dio_data;
-out:
 	return ret;
 }
 
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 03/18] btrfs: basic dax read
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Perform a basic read using iomap support. The btrfs_iomap_begin()
finds the extent at the position and fills the iomap data
structure with the values.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/Makefile |  1 +
 fs/btrfs/ctree.h  |  5 +++++
 fs/btrfs/dax.c    | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/file.c   | 11 ++++++++++-
 4 files changed, 65 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dax.c
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..1fa77b875ae9 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o
 
+btrfs-$(CONFIG_FS_DAX) += dax.o
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9512f49262dd..b7bbe5130a3b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
 int btree_readahead_hook(struct extent_buffer *eb, int err);
 
+#ifdef CONFIG_FS_DAX
+/* dax.c */
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+#endif /* CONFIG_FS_DAX */
+
 static inline int is_fstree(u64 rootid)
 {
 	if (rootid == BTRFS_FS_TREE_OBJECTID ||
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
new file mode 100644
index 000000000000..bf3d46b0acb6
--- /dev/null
+++ b/fs/btrfs/dax.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAX support for BTRFS
+ *
+ * Copyright (c) 2019  SUSE Linux
+ * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
+ */
+
+#ifdef CONFIG_FS_DAX
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
+		loff_t length, unsigned flags, struct iomap *iomap)
+{
+	struct extent_map *em;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+	if (em->block_start == EXTENT_MAP_HOLE) {
+		iomap->type = IOMAP_HOLE;
+		return 0;
+	}
+	iomap->type = IOMAP_MAPPED;
+	iomap->bdev = em->bdev;
+	iomap->dax_dev = fs_info->dax_dev;
+	iomap->offset = em->start;
+	iomap->length = em->len;
+	iomap->addr = em->block_start;
+	return 0;
+}
+
+static const struct iomap_ops btrfs_iomap_ops = {
+	.iomap_begin		= btrfs_iomap_begin,
+};
+
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	inode_lock_shared(inode);
+	ret = dax_iomap_rw(iocb, to, &btrfs_iomap_ops);
+	inode_unlock_shared(inode);
+
+	return ret;
+}
+#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 34fe8a58b0e9..9194591f9eea 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3288,9 +3288,18 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(file_inode(iocb->ki_filp)))
+		return btrfs_file_dax_read(iocb, to);
+#endif
+	return generic_file_read_iter(iocb, to);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-17 16:46   ` Darrick J. Wong
  2019-04-16 16:41 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
The IOMAP_DAX_COW is a iomap type which performs copy of
edges of data while performing a write if start/end are
not page aligned. The source address is expected in
iomap->inline_data.
dax_copy_edges() is a helper functions performs a copy from
one part of the device to another for data not page aligned.
If iomap->inline_data is NULL, it memset's the area to zero.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c              | 41 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/iomap.h |  1 +
 2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index ca0671d55aa6..4b4ac51fbd16 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev,
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
+/*
+ * dax_copy_edges - Copies the part of the pages not included in
+ * 		    the write, but required for CoW because
+ * 		    offset/offset+length are not page aligned.
+ */
+static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
+			   struct iomap *iomap, void *daddr)
+{
+	unsigned offset = pos & (PAGE_SIZE - 1);
+	loff_t end = pos + length;
+	loff_t pg_end = round_up(end, PAGE_SIZE);
+	void *saddr = iomap->inline_data;
+	/*
+	 * Copy the first part of the page
+	 * Note: we pass offset as length
+	 */
+	if (offset) {
+		if (saddr)
+			memcpy(daddr, saddr, offset);
+		else
+			memset(daddr, 0, offset);
+	}
+
+	/* Copy the last part of the range */
+	if (end < pg_end) {
+		if (saddr)
+			memcpy(daddr + offset + length,
+			       saddr + offset + length,	pg_end - end);
+		else
+			memset(daddr + offset + length, 0,
+					pg_end - end);
+	}
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			return iov_iter_zero(min(length, end - pos), iter);
 	}
 
-	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
+			 && iomap->type != IOMAP_DAX_COW))
 		return -EIO;
 
+
 	/*
 	 * Write can allocate block for an area which has a hole page mapped
 	 * into page tables. We have to tear down these mappings so that data
@@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
+		if (iomap->type == IOMAP_DAX_COW)
+			dax_copy_edges(inode, pos, length, iomap, kaddr);
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bda..6e885c5a38a3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -25,6 +25,7 @@ struct vm_fault;
 #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
 #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
 #define IOMAP_INLINE	0x05	/* data inline in the inode */
+#define IOMAP_DAX_COW	0x06	/* Copy data pointed by inline_data before write*/
 
 /*
  * Flags for all iomap mappings:
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 05/18] btrfs: return whether extent is nocow or not
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
We require this to make sure we return type IOMAP_DAX_COW in
iomap structure, in the later patches.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h | 2 +-
 fs/btrfs/inode.c | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b7bbe5130a3b..050f30165531 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3278,7 +3278,7 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
 int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
-		struct inode *inode, u64 start, u64 len);
+		struct inode *inode, u64 start, u64 len, bool *nocow);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 				    struct page *page, size_t pg_offset,
 				    u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80184d0c3b52..06764d89490f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7499,12 +7499,15 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
 int btrfs_get_extent_map_write(struct extent_map **map,
 		struct buffer_head *bh,
 		struct inode *inode,
-		u64 start, u64 len)
+		u64 start, u64 len, bool *nocow)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em = *map;
 	int ret = 0;
 
+	if (nocow)
+		*nocow = false;
+
 	/*
 	 * We don't allocate a new extent in the following cases
 	 *
@@ -7553,6 +7556,8 @@ int btrfs_get_extent_map_write(struct extent_map **map,
 			 */
 			btrfs_free_reserved_data_space_noquota(inode, start,
 							       len);
+			if (nocow)
+				*nocow = true;
 			/* skip COW */
 			goto out;
 		}
@@ -7579,7 +7584,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	struct extent_map *em;
 
 	ret = btrfs_get_extent_map_write(map, bh_result, inode,
-			start, len);
+			start, len, NULL);
 	if (ret < 0)
 		return ret;
 	em = *map;
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Since we will be using it in another part of the code, use a
better name to declare it non-static
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  7 +++++--
 fs/btrfs/inode.c | 14 +++++---------
 2 files changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 050f30165531..1e3e758b83c2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3280,8 +3280,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
 		struct inode *inode, u64 start, u64 len, bool *nocow);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-				    struct page *page, size_t pg_offset,
-				    u64 start, u64 end, int create);
+		struct page *page, size_t pg_offset,
+		u64 start, u64 end, int create);
+void btrfs_update_ordered_extent(struct inode *inode,
+		const u64 offset, const u64 bytes,
+		const bool uptodate);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 06764d89490f..7fc778eb6bd2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -98,10 +98,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
-static void __endio_write_update_ordered(struct inode *inode,
-					 const u64 offset, const u64 bytes,
-					 const bool uptodate);
-
 /*
  * Cleanup all submitted ordered extents in specified range to handle errors
  * from the btrfs_run_delalloc_range() callback.
@@ -142,7 +138,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
 		bytes -= PAGE_SIZE;
 	}
 
-	return __endio_write_update_ordered(inode, offset, bytes, false);
+	return btrfs_update_ordered_extent(inode, offset, bytes, false);
 }
 
 static int btrfs_dirty_inode(struct inode *inode);
@@ -8085,7 +8081,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	bio_put(bio);
 }
 
-static void __endio_write_update_ordered(struct inode *inode,
+void btrfs_update_ordered_extent(struct inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
 {
@@ -8138,7 +8134,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
 
-	__endio_write_update_ordered(dip->inode, dip->logical_offset,
+	btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
 				     dip->bytes, !bio->bi_status);
 
 	kfree(dip);
@@ -8457,7 +8453,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		bio = NULL;
 	} else {
 		if (write)
-			__endio_write_update_ordered(inode,
+			btrfs_update_ordered_extent(inode,
 						file_offset,
 						dio_bio->bi_iter.bi_size,
 						false);
@@ -8597,7 +8593,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			 */
 			if (dio_data.unsubmitted_oe_range_start <
 			    dio_data.unsubmitted_oe_range_end)
-				__endio_write_update_ordered(inode,
+				btrfs_update_ordered_extent(inode,
 					dio_data.unsubmitted_oe_range_start,
 					dio_data.unsubmitted_oe_range_end -
 					dio_data.unsubmitted_oe_range_start,
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 07/18] btrfs: add dax write support
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
IOMAP_DAX_COW allows to inform the dax code, to first perform
a copy which are not page-aligned before performing the write.
The responsibility of checking if data edges are page aligned
is performed in ->iomap_begin() and the source address is
stored in ->inline_data
A new struct btrfs_iomap is passed from iomap_begin() to
iomap_end(), which contains all the accounting and locking information
for CoW based writes.
For writing to a hole, iomap->inline_data is set to zero.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   6 ++
 fs/btrfs/dax.c   | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/file.c  |   4 +-
 3 files changed, 185 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1e3e758b83c2..eec01eb92f33 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 #ifdef CONFIG_FS_DAX
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+#else
+static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	return 0;
+}
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index bf3d46b0acb6..f5cc9bcdbf14 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -9,30 +9,184 @@
 #ifdef CONFIG_FS_DAX
 #include <linux/dax.h>
 #include <linux/iomap.h>
+#include <linux/uio.h>
 #include "ctree.h"
 #include "btrfs_inode.h"
 
+struct btrfs_iomap {
+	u64 start;
+	u64 end;
+	bool nocow;
+	struct extent_changeset *data_reserved;
+	struct extent_state *cached_state;
+};
+
+static struct btrfs_iomap *btrfs_iomap_init(struct inode *inode,
+				     struct extent_map **em,
+				     loff_t pos, loff_t length)
+{
+	int ret = 0;
+	struct extent_map *map = *em;
+	struct btrfs_iomap *bi;
+
+	bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
+	if (!bi)
+		return ERR_PTR(-ENOMEM);
+
+	bi->start = round_down(pos, PAGE_SIZE);
+	bi->end = PAGE_ALIGN(pos + length);
+
+	/* Wait for existing ordered extents in range to finish */
+	btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
+
+	lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
+
+	ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
+			bi->start, bi->end - bi->start);
+	if (ret) {
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+				&bi->cached_state);
+		kfree(bi);
+		return ERR_PTR(ret);
+	}
+
+	refcount_inc(&map->refs);
+	ret = btrfs_get_extent_map_write(em, NULL,
+			inode, bi->start, bi->end - bi->start, &bi->nocow);
+	if (ret) {
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+				&bi->cached_state);
+		btrfs_delalloc_release_space(inode,
+				bi->data_reserved, bi->start,
+				bi->end - bi->start, true);
+		extent_changeset_free(bi->data_reserved);
+		kfree(bi);
+		return ERR_PTR(ret);
+	}
+	free_extent_map(map);
+	return bi;
+}
+
+static void *dax_address(struct block_device *bdev, struct dax_device *dax_dev,
+			 sector_t sector, loff_t pos, loff_t length)
+{
+	size_t size = ALIGN(pos + length, PAGE_SIZE);
+	int id, ret = 0;
+	void *kaddr = NULL;
+	pgoff_t pgoff;
+	long map_len;
+
+	id = dax_read_lock();
+
+	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+	if (ret)
+		goto out;
+
+	map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
+			&kaddr, NULL);
+	if (map_len < 0)
+		ret = map_len;
+
+out:
+	dax_read_unlock(id);
+	if (ret)
+		return ERR_PTR(ret);
+	return kaddr;
+}
+
 static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 		loff_t length, unsigned flags, struct iomap *iomap)
 {
 	struct extent_map *em;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_iomap *bi = NULL;
+	unsigned offset = pos & (PAGE_SIZE - 1);
+	u64 srcblk = 0;
+	loff_t diff;
+
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+
+	iomap->type = IOMAP_MAPPED;
+
+	if (flags & IOMAP_WRITE) {
+		if (em->block_start != EXTENT_MAP_HOLE)
+			srcblk = em->block_start + pos - em->start - offset;
+
+		bi = btrfs_iomap_init(inode, &em, pos, length);
+		if (IS_ERR(bi))
+			return PTR_ERR(bi);
+
+	}
+
+	/*
+	 * Advance the difference between pos and start, to align well with
+	 * inline_data in case of writes
+	 */
+	diff = round_down(pos - em->start, PAGE_SIZE);
+	iomap->offset = em->start + diff;
+	iomap->length = em->len - diff;
+	iomap->bdev = em->bdev;
+	iomap->dax_dev = fs_info->dax_dev;
+
+	/*
+	 * This will be true for reads only since we have already
+	 * allocated em
+	 */
 	if (em->block_start == EXTENT_MAP_HOLE) {
 		iomap->type = IOMAP_HOLE;
 		return 0;
 	}
-	iomap->type = IOMAP_MAPPED;
-	iomap->bdev = em->bdev;
-	iomap->dax_dev = fs_info->dax_dev;
-	iomap->offset = em->start;
-	iomap->length = em->len;
-	iomap->addr = em->block_start;
+
+	iomap->addr = em->block_start + diff;
+	/* Check if we really need to copy data from old extent */
+	if (bi && !bi->nocow && (offset || pos + length < bi->end)) {
+		iomap->type = IOMAP_DAX_COW;
+		if (srcblk) {
+			sector_t sector = (srcblk + (pos & PAGE_MASK) -
+					  iomap->offset) >> 9;
+			iomap->inline_data = dax_address(em->bdev,
+					fs_info->dax_dev, sector, pos, length);
+			if (IS_ERR(iomap->inline_data)) {
+				kfree(bi);
+				return PTR_ERR(iomap->inline_data);
+			}
+		}
+	}
+
+	iomap->private = bi;
+	return 0;
+}
+
+static int btrfs_iomap_end(struct inode *inode, loff_t pos,
+		loff_t length, ssize_t written, unsigned flags,
+		struct iomap *iomap)
+{
+	struct btrfs_iomap *bi = iomap->private;
+	u64 wend;
+
+	if (!bi)
+		return 0;
+
+	unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+			&bi->cached_state);
+
+	wend = PAGE_ALIGN(pos + written);
+	if (wend < bi->end) {
+		btrfs_delalloc_release_space(inode,
+				bi->data_reserved, wend,
+				bi->end - wend, true);
+	}
+
+	btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
+	extent_changeset_free(bi->data_reserved);
+	kfree(bi);
 	return 0;
 }
 
 static const struct iomap_ops btrfs_iomap_ops = {
 	.iomap_begin		= btrfs_iomap_begin,
+	.iomap_end		= btrfs_iomap_end,
 };
 
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
@@ -46,4 +200,20 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
 
 	return ret;
 }
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
+{
+	ssize_t ret = 0;
+	u64 pos = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops);
+
+	if (ret > 0) {
+		pos += ret;
+		if (pos > i_size_read(inode))
+			i_size_write(inode, pos);
+		iocb->ki_pos = pos;
+	}
+	return ret;
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9194591f9eea..a795023e26ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	if (IS_DAX(inode)) {
+		num_written = btrfs_file_dax_write(iocb, from);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (6 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-17 16:52   ` Darrick J. Wong
  2019-04-16 16:41 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Change dax_iomap_pfn to return the address as well in order to
use it for performing a memcpy in case the type is IOMAP_DAX_COW.
Question:
The sequence of bdev_dax_pgoff() and dax_direct_access() is
used multiple times to calculate address and pfn's. Would it make
sense to call it while calculating address as well to reduce code?
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 4b4ac51fbd16..45fc2e18969a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -983,7 +983,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 }
 
 static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-			 pfn_t *pfnp)
+			 pfn_t *pfnp, void **addr)
 {
 	const sector_t sector = dax_iomap_sector(iomap, pos);
 	pgoff_t pgoff;
@@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
 		return rc;
 	id = dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+				   addr, pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
@@ -1280,6 +1280,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct inode *inode = mapping->host;
 	unsigned long vaddr = vmf->address;
+	void *addr;
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
 	struct iomap iomap = { 0 };
 	unsigned flags = IOMAP_FAULT;
@@ -1369,16 +1370,23 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	sync = dax_fault_is_synchronous(flags, vma, &iomap);
 
 	switch (iomap.type) {
+	case IOMAP_DAX_COW:
 	case IOMAP_MAPPED:
 		if (iomap.flags & IOMAP_F_NEW) {
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
-		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn, &addr);
 		if (error < 0)
 			goto error_finish_iomap;
 
+		if (iomap.type == IOMAP_DAX_COW) {
+			if (iomap.inline_data)
+				memcpy(addr, iomap.inline_data, PAGE_SIZE);
+			else
+				memset(addr, 0, PAGE_SIZE);
+		}
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
 						 0, write && !sync);
 
@@ -1577,7 +1585,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
-		error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
+		error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn, NULL);
 		if (error < 0)
 			goto finish_iomap;
 
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 09/18] btrfs: Add dax specific address_space_operations
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (7 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/inode.c | 34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7fc778eb6bd2..c2c0544009a7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -28,6 +28,7 @@
 #include <linux/magic.h>
 #include <linux/iversion.h>
 #include <linux/swap.h>
+#include <linux/dax.h>
 #include <asm/unaligned.h>
 #include "ctree.h"
 #include "disk-io.h"
@@ -65,6 +66,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations;
 static const struct inode_operations btrfs_special_inode_operations;
 static const struct inode_operations btrfs_file_inode_operations;
 static const struct address_space_operations btrfs_aops;
+static const struct address_space_operations btrfs_dax_aops;
 static const struct file_operations btrfs_dir_file_operations;
 static const struct extent_io_ops btrfs_extent_io_ops;
 
@@ -3757,7 +3759,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
-		inode->i_mapping->a_ops = &btrfs_aops;
+		if (btrfs_test_opt(fs_info, DAX))
+			inode->i_mapping->a_ops = &btrfs_dax_aops;
+		else
+			inode->i_mapping->a_ops = &btrfs_aops;
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
@@ -3778,6 +3783,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	}
 
 	btrfs_sync_inode_flags_to_i_flags(inode);
+
 	return 0;
 }
 
@@ -6538,7 +6544,10 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 	*/
 	inode->i_fop = &btrfs_file_operations;
 	inode->i_op = &btrfs_file_inode_operations;
-	inode->i_mapping->a_ops = &btrfs_aops;
+	if (IS_DAX(inode) && S_ISREG(mode))
+		inode->i_mapping->a_ops = &btrfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &btrfs_aops;
 
 	err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
 	if (err)
@@ -8665,6 +8674,15 @@ static int btrfs_writepages(struct address_space *mapping,
 	return extent_writepages(mapping, wbc);
 }
 
+static int btrfs_dax_writepages(struct address_space *mapping,
+			    struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
+			wbc);
+}
+
 static int
 btrfs_readpages(struct file *file, struct address_space *mapping,
 		struct list_head *pages, unsigned nr_pages)
@@ -10436,7 +10454,10 @@ static int btrfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	inode->i_fop = &btrfs_file_operations;
 	inode->i_op = &btrfs_file_inode_operations;
 
-	inode->i_mapping->a_ops = &btrfs_aops;
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &btrfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &btrfs_aops;
 	BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
 
 	ret = btrfs_init_inode_security(trans, inode, dir, NULL);
@@ -10892,6 +10913,13 @@ static const struct address_space_operations btrfs_aops = {
 	.swap_deactivate = btrfs_swap_deactivate,
 };
 
+static const struct address_space_operations btrfs_dax_aops = {
+	.writepages             = btrfs_dax_writepages,
+	.direct_IO              = noop_direct_IO,
+	.set_page_dirty         = noop_set_page_dirty,
+	.invalidatepage         = noop_invalidatepage,
+};
+
 static const struct inode_operations btrfs_file_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.setattr	= btrfs_setattr,
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 10/18] dax: replace mmap entry in case of CoW
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (8 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-17 15:24   ` Darrick J. Wong
  2019-04-16 16:41 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 45fc2e18969a..d5100cbe8bd2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+		void *entry, pfn_t pfn, unsigned long flags, bool dirty,
+		bool cow)
 {
 	void *new_entry = dax_make_entry(pfn, flags);
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -727,12 +728,12 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+	if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
 		dax_disassociate_entry(entry, mapping, false);
 		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -752,6 +753,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1031,7 +1035,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	vm_fault_t ret;
 
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_ZERO_PAGE, false);
+			DAX_ZERO_PAGE, false, false);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	trace_dax_load_hole(inode, vmf, ret);
@@ -1388,7 +1392,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 				memset(addr, 0, PAGE_SIZE);
 		}
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						 0, write && !sync);
+						 0, write && !sync,
+						 iomap.type == IOMAP_DAX_COW);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
@@ -1467,7 +1472,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 
 	pfn = page_to_pfn_t(zero_page);
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+			DAX_PMD | DAX_ZERO_PAGE, false,
+			iomap->type == IOMAP_DAX_COW);
 
 	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (!pmd_none(*(vmf->pmd))) {
@@ -1590,7 +1596,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			goto finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						DAX_PMD, write && !sync);
+						DAX_PMD, write && !sync,
+						false);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 11/18] btrfs: add dax mmap support
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (9 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Add a new vm_operations struct btrfs_dax_vm_ops
specifically for dax files.
Since we will be removing(nulling) readpages/writepages for dax
return ENOEXEC only for non-dax files.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/dax.c   | 13 ++++++++++++-
 fs/btrfs/file.c  | 18 ++++++++++++++++--
 3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eec01eb92f33..2b7bdabb44f8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3802,6 +3802,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 #else
 static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 {
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index f5cc9bcdbf14..de957d681e16 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -139,7 +139,7 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 
 	iomap->addr = em->block_start + diff;
 	/* Check if we really need to copy data from old extent */
-	if (bi && !bi->nocow && (offset || pos + length < bi->end)) {
+	if (bi && !bi->nocow && (offset || pos + length < bi->end || flags & IOMAP_FAULT)) {
 		iomap->type = IOMAP_DAX_COW;
 		if (srcblk) {
 			sector_t sector = (srcblk + (pos & PAGE_MASK) -
@@ -216,4 +216,15 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
 	}
 	return ret;
 }
+
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret;
+	pfn_t pfn;
+	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
+
+	return ret;
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a795023e26ca..9d5a3c99a6b9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2214,15 +2214,29 @@ static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.page_mkwrite	= btrfs_page_mkwrite,
 };
 
+#ifdef CONFIG_FS_DAX
+static const struct vm_operations_struct btrfs_dax_vm_ops = {
+	.fault          = btrfs_dax_fault,
+	.page_mkwrite   = btrfs_dax_fault,
+	.pfn_mkwrite    = btrfs_dax_fault,
+};
+#else
+#define btrfs_dax_vm_ops btrfs_file_vm_ops
+#endif
+
 static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
 {
 	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = file_inode(filp);
 
-	if (!mapping->a_ops->readpage)
+	if (!IS_DAX(inode) && !mapping->a_ops->readpage)
 		return -ENOEXEC;
 
 	file_accessed(filp);
-	vma->vm_ops = &btrfs_file_vm_ops;
+	if (IS_DAX(inode))
+		vma->vm_ops = &btrfs_dax_vm_ops;
+	else
+		vma->vm_ops = &btrfs_file_vm_ops;
 
 	return 0;
 }
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 12/18] btrfs: allow MAP_SYNC mmap
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (10 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Adam Borowski <kilobyte@angband.pl>
Used by userspace to detect DAX.
[rgoldwyn@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags]
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9d5a3c99a6b9..362a9cf9dcb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
 #include <linux/btrfs.h>
 #include <linux/uio.h>
 #include <linux/iversion.h>
+#include <linux/mman.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
+#ifdef CONFIG_FS_DAX
+	.mmap_supported_flags = MAP_SYNC,
+#endif
 	.open		= btrfs_file_open,
 	.release	= btrfs_release_file,
 	.fsync		= btrfs_sync_file,
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 13/18] fs: dedup file range to use a compare function
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (11 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-17 15:36   ` Darrick J. Wong
  2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.
This may not be the best way to solve this. Suggestions welcome.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h     |  9 ++++++++
 fs/btrfs/dax.c       |  7 +++++++
 fs/btrfs/ioctl.c     | 11 ++++++++--
 fs/dax.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/file.c      |  2 +-
 fs/read_write.c      | 10 ++++-----
 fs/xfs/xfs_reflink.c |  2 +-
 include/linux/dax.h  |  2 ++
 include/linux/fs.h   |  7 ++++++-
 9 files changed, 98 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2b7bdabb44f8..d3d044125619 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
 vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
+int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dest, loff_t destoff, loff_t len,
+		bool *is_same);
 #else
 static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	return 0;
 }
+static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dest, loff_t destoff, loff_t len,
+		bool *is_same)
+{
+	return 0;
+}
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index de957d681e16..a29628b403b3 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -227,4 +227,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 
 	return ret;
 }
+
+int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dest, loff_t destoff, loff_t len,
+		bool *is_same)
+{
+	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0138119cd9a3..cd590105bd78 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4000,8 +4000,15 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (ret < 0)
 		goto out_unlock;
 
-	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-					    len, remap_flags);
+	if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags,
+				btrfs_dax_file_range_compare);
+	else
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags,
+				vfs_dedupe_file_range_compare);
+
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
diff --git a/fs/dax.c b/fs/dax.c
index d5100cbe8bd2..abbe4a79f219 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1759,3 +1759,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
+{
+	void *saddr, *daddr;
+	struct iomap s_iomap = {0};
+	struct iomap d_iomap = {0};
+	loff_t dstart, sstart;
+	bool same = true;
+	loff_t cmp_len, l;
+	int id, ret = 0;
+
+	id = dax_read_lock();
+	while (len) {
+		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
+		if (ret < 0) {
+			if (ops->iomap_end)
+				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
+			return ret;
+		}
+		cmp_len = len;
+		if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
+			cmp_len = s_iomap.offset + s_iomap.length - srcoff;
+
+		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
+		if (ret < 0) {
+			if (ops->iomap_end) {
+				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
+				ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
+			}
+			return ret;
+		}
+		if (cmp_len > d_iomap.offset + d_iomap.length - destoff)
+			cmp_len = d_iomap.offset + d_iomap.length - destoff;
+
+
+		sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset);
+		l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL);
+		dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset);
+		l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL);
+		same = !memcmp(saddr, daddr, cmp_len);
+		if (!same)
+			break;
+		len -= cmp_len;
+		srcoff += cmp_len;
+		destoff += cmp_len;
+
+		if (ops->iomap_end) {
+			ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
+			ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
+		}
+	}
+	dax_read_unlock(id);
+	*is_same = same;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d640c5f8a85d..9d470306cfc3 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			&len, remap_flags);
+			&len, remap_flags, vfs_dedupe_file_range_compare);
 	if (ret < 0 || len == 0)
 		goto out_unlock;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..ecc67740d0ff 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1778,7 +1778,7 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
 					 loff_t len, bool *is_same)
 {
@@ -1845,6 +1845,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 out_error:
 	return error;
 }
+EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
@@ -1856,7 +1857,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+				  loff_t *len, unsigned int remap_flags, compare_range_t compare)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -1915,9 +1916,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	 */
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
-
-		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+		ret = compare(inode_in, pos_in,
+			inode_out, pos_out, *len, &is_same);
 		if (ret)
 			return ret;
 		if (!is_same)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 680ae7662a78..68e4257cebb0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+			len, remap_flags, vfs_dedupe_file_range_compare);
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..a11bc7b1f526 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+                loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
 
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e7679089..b1aa2fc105ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1883,10 +1883,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+					 struct inode *dest, loff_t destoff,
+					 loff_t len, bool *is_same);
+typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest,
+		loff_t destpos, loff_t len, bool *is_same);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
-					 unsigned int remap_flags);
+					 unsigned int remap_flags, compare_range_t cmp);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 14/18] dax: memcpy before zeroing range
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (12 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-17 15:45   ` Darrick J. Wong
  2019-04-16 16:41 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
However, this needed more iomap fields, so it was easier
to pass iomap and compute inside the function rather
than passing a log of arguments.
Note, there is subtle difference between iomap_sector and
dax_iomap_sector(). Can we replace dax_iomap_sector with
iomap_sector()? It would need pos & PAGE_MASK though or else
bdev_dax_pgoff() return -EINVAL.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c              | 14 ++++++++++----
 fs/iomap.c            |  9 +--------
 include/linux/dax.h   | 11 +++++------
 include/linux/iomap.h |  6 ++++++
 4 files changed, 22 insertions(+), 18 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index abbe4a79f219..af94909640ea 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device *bdev,
 	return true;
 }
 
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int size)
+int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+			  unsigned int offset, unsigned int size)
 {
-	if (dax_range_is_aligned(bdev, offset, size)) {
+	sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
+	struct block_device *bdev = iomap->bdev;
+	struct dax_device *dax_dev = iomap->dax_dev;
+
+	if (!(iomap->type == IOMAP_DAX_COW) &&
+	    dax_range_is_aligned(bdev, offset, size)) {
 		sector_t start_sector = sector + (offset >> 9);
 
 		return blkdev_issue_zeroout(bdev, start_sector,
@@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev,
 			dax_read_unlock(id);
 			return rc;
 		}
+		if (iomap->type == IOMAP_DAX_COW)
+			memcpy(iomap->inline_data, kaddr, offset);
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
 		dax_read_unlock(id);
diff --git a/fs/iomap.c b/fs/iomap.c
index abdd18e404f8..90698c854883 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	return written ? written : ret;
 }
 
-static sector_t
-iomap_sector(struct iomap *iomap, loff_t pos)
-{
-	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
-}
-
 static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
@@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 		struct iomap *iomap)
 {
-	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
-			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
+	return __dax_zero_page_range(iomap, pos, offset, bytes);
 }
 
 static loff_t
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a11bc7b1f526..892c478d7073 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,6 +9,7 @@
 
 typedef unsigned long dax_entry_t;
 
+struct iomap;
 struct iomap_ops;
 struct dax_device;
 struct dax_operations {
@@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
                 loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
 
 #ifdef CONFIG_FS_DAX
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length);
+int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+		unsigned int offset, unsigned int size);
 #else
-static inline int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length)
+static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+		unsigned int offset, unsigned int size)
 {
 	return -ENXIO;
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6e885c5a38a3..3a803566dea1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/types.h>
 #include <linux/mm_types.h>
+#include <linux/blkdev.h>
 
 struct address_space;
 struct fiemap_extent_info;
@@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
 	return NULL;
 }
 
+static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	        return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
+}
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 15/18] btrfs: handle dax page zeroing
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (13 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
btrfs_dax_zero_block() zeros part of the page, either from the
front or the regular rest of the block.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/dax.c   | 27 ++++++++++++++++++++++++++-
 fs/btrfs/inode.c |  4 ++++
 3 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3d044125619..ee1ed18f8b3c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3806,6 +3806,7 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
 		struct inode *dest, loff_t destoff, loff_t len,
 		bool *is_same);
+int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front);
 #else
 static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 {
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index a29628b403b3..a4b36d5b537c 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -132,7 +132,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 	 * This will be true for reads only since we have already
 	 * allocated em
 	 */
-	if (em->block_start == EXTENT_MAP_HOLE) {
+	if (em->block_start == EXTENT_MAP_HOLE ||
+			em->flags == EXTENT_FLAG_FILLING) {
 		iomap->type = IOMAP_HOLE;
 		return 0;
 	}
@@ -234,4 +235,28 @@ int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
 {
 	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
 }
+
+/*
+ * zero a part of the page only. This should CoW (via iomap_begin) if required
+ */
+int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front)
+{
+	loff_t start = round_down(from, PAGE_SIZE);
+	loff_t end = round_up(from, PAGE_SIZE);
+	loff_t offset = from;
+	int ret = 0;
+
+	if (front) {
+		len = from - start;
+		offset = start;
+	} else	{
+		if (!len)
+			len = end - from;
+	}
+
+	if (len)
+		ret = iomap_zero_range(inode, offset, len, NULL, &btrfs_iomap_ops);
+
+	return (ret < 0) ? ret : 0;
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c2c0544009a7..f6a64607bc8c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4833,6 +4833,10 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	    (!len || IS_ALIGNED(len, blocksize)))
 		goto out;
 
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(inode))
+		return btrfs_dax_zero_block(inode, from, len, front);
+#endif
 	block_start = round_down(from, blocksize);
 	block_end = block_start + blocksize - 1;
 
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (14 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Inorder to make sure mmap'd files don't change after snapshot,
writeprotect the mmap pages on snapshot. This is done by performing
a data writeback on the pages (which simply mark the pages are
wrprotected). This way if the user process tries to access the memory
we will get another fault and we can perform a CoW.
In order to accomplish this, we tag all CoW pages as
PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
During snapshot, it starts writeback of all delalloc'd inodes and
here we perform a data writeback. We don't want to keep the inodes
in delalloc_inodes until it umount (WARN_ON), so we remove it
during inode evictions.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/dax.c   |  7 +++++++
 fs/btrfs/inode.c | 13 ++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ee1ed18f8b3c..d1b70f24adeb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *new_root,
 			     struct btrfs_root *parent_root,
 			     u64 new_dirid);
- void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
+void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode);
+void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
 			       unsigned *bits);
 void btrfs_clear_delalloc_extent(struct inode *inode,
 				 struct extent_state *state, unsigned *bits);
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index a4b36d5b537c..2740c15286ee 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -222,10 +222,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 {
 	vm_fault_t ret;
 	pfn_t pfn;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct btrfs_inode *binode = BTRFS_I(inode);
 	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
 
+	/* Insert into delalloc so we get writeback calls on snapshots */
+	if (vmf->flags & FAULT_FLAG_WRITE &&
+			!test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+		btrfs_add_delalloc_inodes(binode->root, inode);
+
 	return ret;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f6a64607bc8c..d439c6c80a2a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 	spin_unlock(&BTRFS_I(inode)->lock);
 }
 
-static void btrfs_add_delalloc_inodes(struct btrfs_root *root,
+void btrfs_add_delalloc_inodes(struct btrfs_root *root,
 				      struct inode *inode)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_trans_handle *trans;
+	struct btrfs_inode *binode = BTRFS_I(inode);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_block_rsv *rsv;
 	int ret;
 
 	trace_btrfs_inode_evict(inode);
 
+	if (IS_DAX(inode)
+	    && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+		btrfs_del_delalloc_inode(root, binode);
+
 	if (!root) {
 		clear_inode(inode);
 		return;
@@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping,
 {
 	struct inode *inode = mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_inode *binode = BTRFS_I(inode);
+	if ((wbc->sync_mode == WB_SYNC_ALL) &&
+	    test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+		btrfs_del_delalloc_inode(binode->root, binode);
 	return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
 			wbc);
 }
@@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work)
 	delalloc_work = container_of(work, struct btrfs_delalloc_work,
 				     work);
 	inode = delalloc_work->inode;
+	if (IS_DAX(inode))
+		filemap_fdatawrite(inode->i_mapping);
 	filemap_flush(inode->i_mapping);
 	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 				&BTRFS_I(inode)->runtime_flags))
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 17/18] btrfs: Disable dax-based defrag and send
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (15 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-16 16:41 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
  2019-04-17 16:49 ` [PATCH v3 00/18] btrfs dax support Adam Borowski
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This is temporary, and a TODO.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ioctl.c | 13 +++++++++++++
 fs/btrfs/send.c  |  4 ++++
 2 files changed, 17 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cd590105bd78..8ac770fe00dc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2990,6 +2990,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 			goto out;
 		}
 
+		if (IS_DAX(inode)) {
+			btrfs_warn(root->fs_info, "File defrag is not supported with DAX");
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+
 		if (argp) {
 			if (copy_from_user(range, argp,
 					   sizeof(*range))) {
@@ -4653,6 +4659,10 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	/* send can be on a directory, so check super block instead */
+	if (btrfs_test_opt(fs_info, DAX))
+		return -EOPNOTSUPP;
+
 	ret = mnt_want_write_file(file);
 	if (ret)
 		return ret;
@@ -5505,6 +5515,9 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 	struct btrfs_ioctl_send_args *arg;
 	int ret;
 
+	if (IS_DAX(file_inode(file)))
+		return -EOPNOTSUPP;
+
 	if (compat) {
 #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
 		struct btrfs_ioctl_send_args_32 args32;
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 7ea2d6b1f170..9679fd54db86 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6609,6 +6609,10 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 	int sort_clone_roots = 0;
 	int index;
 
+	/* send can be on a directory, so check super block instead */
+	if (btrfs_test_opt(fs_info, DAX))
+		return -EOPNOTSUPP;
+
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (16 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
  2019-04-17 16:49 ` [PATCH v3 00/18] btrfs dax support Adam Borowski
  18 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
  To: linux-btrfs
  Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This is for debug purposes only and can be skipped.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/dax.c               |  3 +++
 include/trace/events/btrfs.h | 56 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 2740c15286ee..de3ebc75f0b2 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -104,6 +104,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 	u64 srcblk = 0;
 	loff_t diff;
 
+	trace_btrfs_iomap_begin(inode, pos, length, flags);
+
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
 
 	iomap->type = IOMAP_MAPPED;
@@ -164,6 +166,7 @@ static int btrfs_iomap_end(struct inode *inode, loff_t pos,
 {
 	struct btrfs_iomap *bi = iomap->private;
 	u64 wend;
+	trace_btrfs_iomap_end(inode, pos, length, written, flags);
 
 	if (!bi)
 		return 0;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index ab1cc33adbac..8779e5789a7c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1850,6 +1850,62 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
 	TP_ARGS(bg_cache)
 );
 
+TRACE_EVENT(btrfs_iomap_begin,
+
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, int flags),
+
+	TP_ARGS(inode, pos, length, flags),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	pos		)
+		__field(	u64,	length		)
+		__field(	int,    flags		)
+	),
+
+	TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+		__entry->ino		= btrfs_ino(BTRFS_I(inode));
+		__entry->pos		= pos;
+		__entry->length		= length;
+		__entry->flags		= flags;
+	),
+
+	TP_printk_btrfs("ino=%llu pos=%llu len=%llu flags=0x%x",
+		  __entry->ino,
+		  __entry->pos,
+		  __entry->length,
+		  __entry->flags)
+);
+
+TRACE_EVENT(btrfs_iomap_end,
+
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, loff_t written, int flags),
+
+	TP_ARGS(inode, pos, length, written, flags),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	pos		)
+		__field(	u64,	length		)
+		__field(	u64,	written		)
+		__field(	int,    flags		)
+	),
+
+	TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+		__entry->ino		= btrfs_ino(BTRFS_I(inode));
+		__entry->pos		= pos;
+		__entry->length		= length;
+		__entry->written		= written;
+		__entry->flags		= flags;
+	),
+
+	TP_printk_btrfs("ino=%llu pos=%llu len=%llu written=%llu flags=0x%x",
+		  __entry->ino,
+		  __entry->pos,
+		  __entry->length,
+		  __entry->written,
+		  __entry->flags)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH 01/18] btrfs: create a mount option for dax
  2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2019-04-16 16:52   ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2019-04-16 16:52 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, kilobyte, Jan Kara, Darrick J. Wong, nborisov,
	linux-nvdimm, david, dsterba, Matthew Wilcox, linux-fsdevel,
	Christoph Hellwig, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 9:42 AM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
>
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
>
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.
> Deny compression because it does not work with direct I/O.
I'd like to consider the dax mount option deprecated and require new
implementations to use a dynamic scheme that Darrick and I once
discussed where "dax" (or more specifically the MAP_SYNC semantic) is
an attribute that can be set on an empty directory and then inherited
by all files / directories created in that hierarchy.
Dax mappings are not always a performance win and enabling it for all
files in the filesystem is potentially overkill, and I'm of opinion
that we should step away from an filesystem-global mount option.
^ permalink raw reply	[flat|nested] 31+ messages in thread
* RE: [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
  2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
@ 2019-04-16 23:45   ` Elliott, Robert (Servers)
  0 siblings, 0 replies; 31+ messages in thread
From: Elliott, Robert (Servers) @ 2019-04-16 23:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs@vger.kernel.org
  Cc: kilobyte@angband.pl, jack@suse.cz, darrick.wong@oracle.com,
	nborisov@suse.com, linux-nvdimm@lists.01.org, david@fromorbit.com,
	dsterba@suse.cz, willy@infradead.org,
	linux-fsdevel@vger.kernel.org, hch@lst.de, Goldwyn Rodrigues
> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of Goldwyn Rodrigues
> Sent: Tuesday, April 16, 2019 11:42 AM
> Subject: [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
...
> +static int btrfs_get_blocks_direct_write(struct extent_map **map,
> +					 struct buffer_head *bh_result,
> +					 struct inode *inode,
> +					 struct btrfs_dio_data *dio_data,
> +					 u64 start, u64 len)
> +{
> +	int ret = 0;
That initialization value is not needed, since ret is always overwritten
two lines later.
> +	struct extent_map *em;
> +
> +	ret = btrfs_get_extent_map_write(map, bh_result, inode,
> +			start, len);
> +	if (ret < 0)
> +		return ret;
> +	em = *map;
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 10/18] dax: replace mmap entry in case of CoW
  2019-04-16 16:41 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
@ 2019-04-17 15:24   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2019-04-17 15:24 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 11:41:46AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 45fc2e18969a..d5100cbe8bd2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
>   */
>  static void *dax_insert_entry(struct xa_state *xas,
>  		struct address_space *mapping, struct vm_fault *vmf,
> -		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +		void *entry, pfn_t pfn, unsigned long flags, bool dirty,
> +		bool cow)
I still wish these were flags instead of double booleans that will be
easy to mix up, especially since this is a static function and nobody
else has to see the flags...
#define IE_DIRTY	(1 << 0) /* mark entry and inode dirty */
#define IE_REPLACE	(1 << 1) /* replacing one page with another */
...otoh maybe I'll just defer to the maintainer. :)
>  {
>  	void *new_entry = dax_make_entry(pfn, flags);
>  
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> +	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>  		unsigned long index = xas->xa_index;
>  		/* we are replacing a zero page with block mapping */
These comments need updating.
Otherwise looks good to me...
--D
>  		if (dax_is_pmd_entry(entry))
> @@ -727,12 +728,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>  	xas_reset(xas);
>  	xas_lock_irq(xas);
> -	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> +	if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
>  		dax_disassociate_entry(entry, mapping, false);
>  		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>  	}
>  
> -	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>  		/*
>  		 * Only swap our new entry into the page cache if the current
>  		 * entry is a zero page or an empty entry.  If a normal PTE or
> @@ -752,6 +753,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>  	if (dirty)
>  		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> +	if (cow)
> +		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>  	xas_unlock_irq(xas);
>  	return entry;
>  }
> @@ -1031,7 +1035,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>  	vm_fault_t ret;
>  
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -			DAX_ZERO_PAGE, false);
> +			DAX_ZERO_PAGE, false, false);
>  
>  	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>  	trace_dax_load_hole(inode, vmf, ret);
> @@ -1388,7 +1392,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  				memset(addr, 0, PAGE_SIZE);
>  		}
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> -						 0, write && !sync);
> +						 0, write && !sync,
> +						 iomap.type == IOMAP_DAX_COW);
>  
>  		/*
>  		 * If we are doing synchronous page fault and inode needs fsync,
> @@ -1467,7 +1472,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  
>  	pfn = page_to_pfn_t(zero_page);
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -			DAX_PMD | DAX_ZERO_PAGE, false);
> +			DAX_PMD | DAX_ZERO_PAGE, false,
> +			iomap->type == IOMAP_DAX_COW);
>  
>  	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
>  	if (!pmd_none(*(vmf->pmd))) {
> @@ -1590,7 +1596,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			goto finish_iomap;
>  
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> -						DAX_PMD, write && !sync);
> +						DAX_PMD, write && !sync,
> +						false);
>  
>  		/*
>  		 * If we are doing synchronous page fault and inode needs fsync,
> -- 
> 2.16.4
> 
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 13/18] fs: dedup file range to use a compare function
  2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
@ 2019-04-17 15:36   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2019-04-17 15:36 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 11:41:49AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass
> it to generic_remap_file_range_prep() so it can use iomap-based
> functions.
> 
> This may not be the best way to solve this. Suggestions welcome.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h     |  9 ++++++++
>  fs/btrfs/dax.c       |  7 +++++++
>  fs/btrfs/ioctl.c     | 11 ++++++++--
>  fs/dax.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ocfs2/file.c      |  2 +-
>  fs/read_write.c      | 10 ++++-----
>  fs/xfs/xfs_reflink.c |  2 +-
>  include/linux/dax.h  |  2 ++
>  include/linux/fs.h   |  7 ++++++-
>  9 files changed, 98 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2b7bdabb44f8..d3d044125619 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
>  vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> +		struct inode *dest, loff_t destoff, loff_t len,
> +		bool *is_same);
>  #else
>  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	return 0;
>  }
> +static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> +		struct inode *dest, loff_t destoff, loff_t len,
> +		bool *is_same)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_FS_DAX */
>  
>  static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index de957d681e16..a29628b403b3 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -227,4 +227,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
>  
>  	return ret;
>  }
> +
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> +		struct inode *dest, loff_t destoff, loff_t len,
> +		bool *is_same)
> +{
> +	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
> +}
>  #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0138119cd9a3..cd590105bd78 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4000,8 +4000,15 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  	if (ret < 0)
>  		goto out_unlock;
>  
> -	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -					    len, remap_flags);
> +	if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
> +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> +				pos_out, len, remap_flags,
> +				btrfs_dax_file_range_compare);
> +	else
> +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> +				pos_out, len, remap_flags,
> +				vfs_dedupe_file_range_compare);
I wonder if you could simply have a compare_range_t variable that you
can set to either the vfs and btrfs_dax compare functions, and then only
have to maintain a single generic_remap_file_range_prep callsite?
> +
>  	if (ret < 0 || *len == 0)
>  		goto out_unlock;
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index d5100cbe8bd2..abbe4a79f219 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1759,3 +1759,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +
> +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> +		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
> +{
> +	void *saddr, *daddr;
> +	struct iomap s_iomap = {0};
> +	struct iomap d_iomap = {0};
> +	loff_t dstart, sstart;
> +	bool same = true;
> +	loff_t cmp_len, l;
> +	int id, ret = 0;
> +
> +	id = dax_read_lock();
> +	while (len) {
> +		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
> +		if (ret < 0) {
> +			if (ops->iomap_end)
> +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> +			return ret;
> +		}
> +		cmp_len = len;
> +		if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
> +			cmp_len = s_iomap.offset + s_iomap.length - srcoff;
> +
> +		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
> +		if (ret < 0) {
> +			if (ops->iomap_end) {
> +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> +				ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
> +			}
> +			return ret;
> +		}
> +		if (cmp_len > d_iomap.offset + d_iomap.length - destoff)
> +			cmp_len = d_iomap.offset + d_iomap.length - destoff;
> +
> +
> +		sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset);
This kinda screams static inline helper function...
> +		l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL);
> +		dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset);
...especially since it happens again here...
> +		l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL);
...and these lines are too long.
> +		same = !memcmp(saddr, daddr, cmp_len);
> +		if (!same)
> +			break;
> +		len -= cmp_len;
> +		srcoff += cmp_len;
> +		destoff += cmp_len;
> +
> +		if (ops->iomap_end) {
> +			ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
> +			ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
> +		}
> +	}
> +	dax_read_unlock(id);
> +	*is_same = same;
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_file_range_compare);
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index d640c5f8a85d..9d470306cfc3 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
>  		goto out_unlock;
>  
>  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -			&len, remap_flags);
> +			&len, remap_flags, vfs_dedupe_file_range_compare);
>  	if (ret < 0 || len == 0)
>  		goto out_unlock;
>  
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 61b43ad7608e..ecc67740d0ff 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1778,7 +1778,7 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>  					 struct inode *dest, loff_t destoff,
>  					 loff_t len, bool *is_same)
>  {
> @@ -1845,6 +1845,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>  out_error:
>  	return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
Not EXPORT_SYMBOL_GPL? :D
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -1856,7 +1857,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>   */
>  int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  				  struct file *file_out, loff_t pos_out,
> -				  loff_t *len, unsigned int remap_flags)
> +				  loff_t *len, unsigned int remap_flags, compare_range_t compare)
>  {
>  	struct inode *inode_in = file_inode(file_in);
>  	struct inode *inode_out = file_inode(file_out);
> @@ -1915,9 +1916,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  	 */
>  	if (remap_flags & REMAP_FILE_DEDUP) {
>  		bool		is_same = false;
> -
> -		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> -				inode_out, pos_out, *len, &is_same);
> +		ret = compare(inode_in, pos_in,
> +			inode_out, pos_out, *len, &is_same);
>  		if (ret)
>  			return ret;
>  		if (!is_same)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 680ae7662a78..68e4257cebb0 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
>  		goto out_unlock;
>  
>  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -			len, remap_flags);
> +			len, remap_flags, vfs_dedupe_file_range_compare);
>  	if (ret < 0 || *len == 0)
>  		goto out_unlock;
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 0dd316a74a29..a11bc7b1f526 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
> +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> +                loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
>  
>  #ifdef CONFIG_FS_DAX
>  int __dax_zero_page_range(struct block_device *bdev,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dd28e7679089..b1aa2fc105ae 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1883,10 +1883,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
>  		unsigned long, loff_t *, rwf_t);
>  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
>  				   loff_t, size_t, unsigned int);
> +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +					 struct inode *dest, loff_t destoff,
> +					 loff_t len, bool *is_same);
> +typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest,
> +		loff_t destpos, loff_t len, bool *is_same);
Might want to call this vfs_compare_range_t to make it clear that this
belongs to the generic vfs namespace.
--D
>  extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  					 struct file *file_out, loff_t pos_out,
>  					 loff_t *count,
> -					 unsigned int remap_flags);
> +					 unsigned int remap_flags, compare_range_t cmp);
>  extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
>  				  struct file *file_out, loff_t pos_out,
>  				  loff_t len, unsigned int remap_flags);
> -- 
> 2.16.4
> 
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 14/18] dax: memcpy before zeroing range
  2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
@ 2019-04-17 15:45   ` Darrick J. Wong
  2019-04-17 16:39     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 31+ messages in thread
From: Darrick J. Wong @ 2019-04-17 15:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 11:41:50AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> However, this needed more iomap fields, so it was easier
> to pass iomap and compute inside the function rather
> than passing a log of arguments.
> 
> Note, there is subtle difference between iomap_sector and
> dax_iomap_sector(). Can we replace dax_iomap_sector with
> iomap_sector()? It would need pos & PAGE_MASK though or else
> bdev_dax_pgoff() return -EINVAL.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c              | 14 ++++++++++----
>  fs/iomap.c            |  9 +--------
>  include/linux/dax.h   | 11 +++++------
>  include/linux/iomap.h |  6 ++++++
>  4 files changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index abbe4a79f219..af94909640ea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device *bdev,
>  	return true;
>  }
>  
> -int __dax_zero_page_range(struct block_device *bdev,
> -		struct dax_device *dax_dev, sector_t sector,
> -		unsigned int offset, unsigned int size)
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> +			  unsigned int offset, unsigned int size)
>  {
> -	if (dax_range_is_aligned(bdev, offset, size)) {
> +	sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
> +	struct block_device *bdev = iomap->bdev;
> +	struct dax_device *dax_dev = iomap->dax_dev;
> +
> +	if (!(iomap->type == IOMAP_DAX_COW) &&
> +	    dax_range_is_aligned(bdev, offset, size)) {
>  		sector_t start_sector = sector + (offset >> 9);
>  
>  		return blkdev_issue_zeroout(bdev, start_sector,
> @@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev,
>  			dax_read_unlock(id);
>  			return rc;
>  		}
> +		if (iomap->type == IOMAP_DAX_COW)
> +			memcpy(iomap->inline_data, kaddr, offset);
I'm confused, why are we copying into the source page before zeroing
some part of the dax device?
>  		memset(kaddr + offset, 0, size);
>  		dax_flush(dax_dev, kaddr + offset, size);
>  		dax_read_unlock(id);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index abdd18e404f8..90698c854883 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	return written ? written : ret;
>  }
>  
> -static sector_t
> -iomap_sector(struct iomap *iomap, loff_t pos)
> -{
> -	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> -}
> -
>  static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
> @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
>  static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>  		struct iomap *iomap)
>  {
> -	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
> -			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
> +	return __dax_zero_page_range(iomap, pos, offset, bytes);
>  }
>  
>  static loff_t
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a11bc7b1f526..892c478d7073 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -9,6 +9,7 @@
>  
>  typedef unsigned long dax_entry_t;
>  
> +struct iomap;
>  struct iomap_ops;
>  struct dax_device;
>  struct dax_operations {
> @@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
>                  loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
>  
>  #ifdef CONFIG_FS_DAX
> -int __dax_zero_page_range(struct block_device *bdev,
> -		struct dax_device *dax_dev, sector_t sector,
> -		unsigned int offset, unsigned int length);
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> +		unsigned int offset, unsigned int size);
>  #else
> -static inline int __dax_zero_page_range(struct block_device *bdev,
> -		struct dax_device *dax_dev, sector_t sector,
> -		unsigned int offset, unsigned int length)
> +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> +		unsigned int offset, unsigned int size)
>  {
>  	return -ENXIO;
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6e885c5a38a3..3a803566dea1 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -7,6 +7,7 @@
>  #include <linux/mm.h>
>  #include <linux/types.h>
>  #include <linux/mm_types.h>
> +#include <linux/blkdev.h>
>  
>  struct address_space;
>  struct fiemap_extent_info;
> @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
>  	return NULL;
>  }
>  
> +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	        return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
Why the double indent?
--D
> +}
> +
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>  		const struct iomap_ops *ops);
>  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
> -- 
> 2.16.4
> 
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 14/18] dax: memcpy before zeroing range
  2019-04-17 15:45   ` Darrick J. Wong
@ 2019-04-17 16:39     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-17 16:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
	dsterba, nborisov, linux-nvdimm
On  8:45 17/04, Darrick J. Wong wrote:
> On Tue, Apr 16, 2019 at 11:41:50AM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > However, this needed more iomap fields, so it was easier
> > to pass iomap and compute inside the function rather
> > than passing a log of arguments.
> > 
> > Note, there is subtle difference between iomap_sector and
> > dax_iomap_sector(). Can we replace dax_iomap_sector with
> > iomap_sector()? It would need pos & PAGE_MASK though or else
> > bdev_dax_pgoff() return -EINVAL.
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/dax.c              | 14 ++++++++++----
> >  fs/iomap.c            |  9 +--------
> >  include/linux/dax.h   | 11 +++++------
> >  include/linux/iomap.h |  6 ++++++
> >  4 files changed, 22 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index abbe4a79f219..af94909640ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device *bdev,
> >  	return true;
> >  }
> >  
> > -int __dax_zero_page_range(struct block_device *bdev,
> > -		struct dax_device *dax_dev, sector_t sector,
> > -		unsigned int offset, unsigned int size)
> > +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> > +			  unsigned int offset, unsigned int size)
> >  {
> > -	if (dax_range_is_aligned(bdev, offset, size)) {
> > +	sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct dax_device *dax_dev = iomap->dax_dev;
> > +
> > +	if (!(iomap->type == IOMAP_DAX_COW) &&
> > +	    dax_range_is_aligned(bdev, offset, size)) {
> >  		sector_t start_sector = sector + (offset >> 9);
> >  
> >  		return blkdev_issue_zeroout(bdev, start_sector,
> > @@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev,
> >  			dax_read_unlock(id);
> >  			return rc;
> >  		}
> > +		if (iomap->type == IOMAP_DAX_COW)
> > +			memcpy(iomap->inline_data, kaddr, offset);
> 
> I'm confused, why are we copying into the source page before zeroing
> some part of the dax device?
Clearly my testing was not good enough. This should be -
memcpy(kaddr, iomap->inline_data, offset);
We copy the part which is not zeroed to new location.
> 
> >  		memset(kaddr + offset, 0, size);
> >  		dax_flush(dax_dev, kaddr + offset, size);
> >  		dax_read_unlock(id);
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index abdd18e404f8..90698c854883 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >  	return written ? written : ret;
> >  }
> >  
> > -static sector_t
> > -iomap_sector(struct iomap *iomap, loff_t pos)
> > -{
> > -	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> > -}
> > -
> >  static struct iomap_page *
> >  iomap_page_create(struct inode *inode, struct page *page)
> >  {
> > @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> >  static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
> >  		struct iomap *iomap)
> >  {
> > -	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
> > -			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
> > +	return __dax_zero_page_range(iomap, pos, offset, bytes);
> >  }
> >  
> >  static loff_t
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index a11bc7b1f526..892c478d7073 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -9,6 +9,7 @@
> >  
> >  typedef unsigned long dax_entry_t;
> >  
> > +struct iomap;
> >  struct iomap_ops;
> >  struct dax_device;
> >  struct dax_operations {
> > @@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> >                  loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
> >  
> >  #ifdef CONFIG_FS_DAX
> > -int __dax_zero_page_range(struct block_device *bdev,
> > -		struct dax_device *dax_dev, sector_t sector,
> > -		unsigned int offset, unsigned int length);
> > +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> > +		unsigned int offset, unsigned int size);
> >  #else
> > -static inline int __dax_zero_page_range(struct block_device *bdev,
> > -		struct dax_device *dax_dev, sector_t sector,
> > -		unsigned int offset, unsigned int length)
> > +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> > +		unsigned int offset, unsigned int size)
> >  {
> >  	return -ENXIO;
> >  }
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 6e885c5a38a3..3a803566dea1 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/types.h>
> >  #include <linux/mm_types.h>
> > +#include <linux/blkdev.h>
> >  
> >  struct address_space;
> >  struct fiemap_extent_info;
> > @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
> >  	return NULL;
> >  }
> >  
> > +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
> > +{
> > +	        return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> 
> Why the double indent?
Ahh.. I will fix this.
> 
> --D
> 
> > +}
> > +
> >  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> >  		const struct iomap_ops *ops);
> >  int iomap_readpage(struct page *page, const struct iomap_ops *ops);
> > -- 
> > 2.16.4
> > 
> 
-- 
Goldwyn
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
  2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
@ 2019-04-17 16:46   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2019-04-17 16:46 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 11:41:40AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> The IOMAP_DAX_COW is a iomap type which performs copy of
> edges of data while performing a write if start/end are
> not page aligned. The source address is expected in
> iomap->inline_data.
> 
> dax_copy_edges() is a helper functions performs a copy from
> one part of the device to another for data not page aligned.
> If iomap->inline_data is NULL, it memset's the area to zero.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c              | 41 ++++++++++++++++++++++++++++++++++++++++-
>  include/linux/iomap.h |  1 +
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ca0671d55aa6..4b4ac51fbd16 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev,
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> +/*
> + * dax_copy_edges - Copies the part of the pages not included in
> + * 		    the write, but required for CoW because
> + * 		    offset/offset+length are not page aligned.
> + */
> +static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> +			   struct iomap *iomap, void *daddr)
> +{
> +	unsigned offset = pos & (PAGE_SIZE - 1);
> +	loff_t end = pos + length;
> +	loff_t pg_end = round_up(end, PAGE_SIZE);
> +	void *saddr = iomap->inline_data;
> +	/*
> +	 * Copy the first part of the page
> +	 * Note: we pass offset as length
> +	 */
> +	if (offset) {
> +		if (saddr)
> +			memcpy(daddr, saddr, offset);
I've been wondering, do we need memcpy_mcsafe here?
> +		else
> +			memset(daddr, 0, offset);
Or here?
(Or any of the other places we call memcpy/memset in this series...)
Because I think we'd prefer to return EIO on bad pmem over a machine
check.
--D
> +	}
> +
> +	/* Copy the last part of the range */
> +	if (end < pg_end) {
> +		if (saddr)
> +			memcpy(daddr + offset + length,
> +			       saddr + offset + length,	pg_end - end);
> +		else
> +			memset(daddr + offset + length, 0,
> +					pg_end - end);
> +	}
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct iomap *iomap)
> @@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			return iov_iter_zero(min(length, end - pos), iter);
>  	}
>  
> -	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> +	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> +			 && iomap->type != IOMAP_DAX_COW))
Usually the '&&' goes on the first line, right?
>  		return -EIO;
>  
> +
>  	/*
>  	 * Write can allocate block for an area which has a hole page mapped
>  	 * into page tables. We have to tear down these mappings so that data
> @@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			break;
>  		}
>  
> +		if (iomap->type == IOMAP_DAX_COW)
> +			dax_copy_edges(inode, pos, length, iomap, kaddr);
No return value?  So the pmem copy never fails?
--D
> +
>  		map_len = PFN_PHYS(map_len);
>  		kaddr += offset;
>  		map_len -= offset;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..6e885c5a38a3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -25,6 +25,7 @@ struct vm_fault;
>  #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
>  #define IOMAP_INLINE	0x05	/* data inline in the inode */
> +#define IOMAP_DAX_COW	0x06	/* Copy data pointed by inline_data before write*/
>  
>  /*
>   * Flags for all iomap mappings:
> -- 
> 2.16.4
> 
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH v3 00/18] btrfs dax support
  2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
                   ` (17 preceding siblings ...)
  2019-04-16 16:41 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
@ 2019-04-17 16:49 ` Adam Borowski
  18 siblings, 0 replies; 31+ messages in thread
From: Adam Borowski @ 2019-04-17 16:49 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, darrick.wong,
	dsterba, nborisov, linux-nvdimm
On Tue, Apr 16, 2019 at 11:41:36AM -0500, Goldwyn Rodrigues wrote:
> This patch set adds support for dax on the BTRFS filesystem.
> In order to support for CoW for btrfs, there were changes which had to be
> made to the dax handling. The important one is copying blocks into the
> same dax device before using them which is performed by iomap
> type IOMAP_DAX_COW.
I'm afraid that PMDK's testsuite nearly instantly makes this patchset
explode in different ways.
It'd probably be a waste of your time to require a round-trip to report such
fails and re-test.  As PMDK is probably the biggest userspace component, and
exercises DAX paths in interesting ways, I guess it'd be best if you ran its
testsuite locally until it no longer kills the kernel.
Quick start:
git clone https://github.com/pmem/pmdk
deps: libndctl-dev[el], libdaxctl-dev[el]; rest are mostly optional
(full deps for Fedora:
https://github.com/pmem/pmdk/blob/master/utils/docker/images/Dockerfile.fedora-28
SuSE's should be similar)
cp src/test/testconfig.sh{.example,}
-- then edit, you'd want:
PMEM_FS_DIR=/mnt/pmem      # where you have the dax filesystem on
KEEP_GOING=y               # to see more than one broken test
make
make test    # to build tests
make check   # to build and run all tests
cd src/test && ./RUNTESTS a_single_test
Obviously, PMDK's tests may fail due to some difference of behaviour between
ext4, xfs, and your btrfs implementation -- that'll require some
cooperation.  But crashing the kernel is something where the blame is quite
obvious.
Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
  2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
@ 2019-04-17 16:52   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2019-04-17 16:52 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 11:41:44AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Change dax_iomap_pfn to return the address as well in order to
> use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> 
> Question:
> The sequence of bdev_dax_pgoff() and dax_direct_access() is
> used multiple times to calculate address and pfn's. Would it make
> sense to call it while calculating address as well to reduce code?
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4b4ac51fbd16..45fc2e18969a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -983,7 +983,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
>  }
>  
>  static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> -			 pfn_t *pfnp)
> +			 pfn_t *pfnp, void **addr)
>  {
>  	const sector_t sector = dax_iomap_sector(iomap, pos);
>  	pgoff_t pgoff;
> @@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
>  		return rc;
>  	id = dax_read_lock();
>  	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> -				   NULL, pfnp);
> +				   addr, pfnp);
>  	if (length < 0) {
>  		rc = length;
>  		goto out;
> @@ -1280,6 +1280,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
>  	struct inode *inode = mapping->host;
>  	unsigned long vaddr = vmf->address;
> +	void *addr;
>  	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>  	struct iomap iomap = { 0 };
>  	unsigned flags = IOMAP_FAULT;
> @@ -1369,16 +1370,23 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	sync = dax_fault_is_synchronous(flags, vma, &iomap);
>  
>  	switch (iomap.type) {
> +	case IOMAP_DAX_COW:
>  	case IOMAP_MAPPED:
>  		if (iomap.flags & IOMAP_F_NEW) {
>  			count_vm_event(PGMAJFAULT);
>  			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>  			major = VM_FAULT_MAJOR;
>  		}
> -		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
> +		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn, &addr);
>  		if (error < 0)
>  			goto error_finish_iomap;
>  
> +		if (iomap.type == IOMAP_DAX_COW) {
> +			if (iomap.inline_data)
> +				memcpy(addr, iomap.inline_data, PAGE_SIZE);
Same memcpy_mcsafe question from my reply to patch 4 applies here.
> +			else
> +				memset(addr, 0, PAGE_SIZE);
> +		}
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
>  						 0, write && !sync);
>  
> @@ -1577,7 +1585,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  
>  	switch (iomap.type) {
>  	case IOMAP_MAPPED:
> -		error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
> +		error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn, NULL);
Same (unanswered) question from the v2 series -- doesn't a PMD fault
also require handling IOMAP_DAX_COW?
--D
>  		if (error < 0)
>  			goto finish_iomap;
>  
> -- 
> 2.16.4
> 
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH 10/18] dax: replace mmap entry in case of CoW
  2019-04-29 17:26 [PATCH v4 " Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
  2019-05-21 17:35   ` Darrick J. Wong
  2019-05-23 13:38   ` Jan Kara
  0 siblings, 2 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
  To: linux-btrfs
  Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.
btrfs does not support hugepages so we don't handle PMD.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 718b1632a39d..07e8ff20161d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -700,6 +700,9 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
 	return 0;
 }
 
+#define DAX_IF_DIRTY		(1ULL << 0)
+#define DAX_IF_COW		(1ULL << 1)
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+		void *entry, pfn_t pfn, unsigned long flags,
+		unsigned long insert_flags)
 {
 	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = insert_flags & DAX_IF_DIRTY;
+	bool cow = insert_flags & DAX_IF_COW;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+	if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
 		dax_disassociate_entry(entry, mapping, false);
 		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -753,6 +759,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1032,7 +1041,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	vm_fault_t ret;
 
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_ZERO_PAGE, false);
+			DAX_ZERO_PAGE, 0);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	trace_dax_load_hole(inode, vmf, ret);
@@ -1296,6 +1305,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	vm_fault_t ret = 0;
 	void *entry;
 	pfn_t pfn;
+	unsigned long insert_flags = 0;
 
 	trace_dax_pte_fault(inode, vmf, ret);
 	/*
@@ -1357,6 +1367,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			error = copy_user_dax(iomap.bdev, iomap.dax_dev,
 					sector, PAGE_SIZE, vmf->cow_page, vaddr);
 			break;
+		case IOMAP_DAX_COW:
+			/* Should not be setting this - fallthrough */
 		default:
 			WARN_ON_ONCE(1);
 			error = -EIO;
@@ -1377,6 +1389,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_DAX_COW:
+		insert_flags |= DAX_IF_COW;
+		/* fallthrough */
 	case IOMAP_MAPPED:
 		if (iomap.flags & IOMAP_F_NEW) {
 			count_vm_event(PGMAJFAULT);
@@ -1396,8 +1410,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			} else
 				memset(addr, 0, PAGE_SIZE);
 		}
+		if (write && !sync)
+			insert_flags |= DAX_IF_DIRTY;
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						 0, write && !sync);
+						 0, insert_flags);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
@@ -1478,7 +1494,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 
 	pfn = page_to_pfn_t(zero_page);
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+			DAX_PMD | DAX_ZERO_PAGE, 0);
 
 	if (arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
@@ -1528,6 +1544,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	loff_t pos;
 	int error;
 	pfn_t pfn;
+	unsigned long insert_flags = 0;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1612,8 +1629,11 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
+		if (write && !sync)
+			insert_flags |= DAX_IF_DIRTY;
+
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						DAX_PMD, write && !sync);
+						DAX_PMD, insert_flags);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
-- 
2.16.4
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH 10/18] dax: replace mmap entry in case of CoW
  2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
@ 2019-05-21 17:35   ` Darrick J. Wong
  2019-05-23 13:38   ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2019-05-21 17:35 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
	dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:41PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
> 
> btrfs does not support hugepages so we don't handle PMD.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c | 36 ++++++++++++++++++++++++++++--------
>  1 file changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 718b1632a39d..07e8ff20161d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -700,6 +700,9 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
>  	return 0;
>  }
>  
> +#define DAX_IF_DIRTY		(1ULL << 0)
> +#define DAX_IF_COW		(1ULL << 1)
> +
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
>   */
>  static void *dax_insert_entry(struct xa_state *xas,
>  		struct address_space *mapping, struct vm_fault *vmf,
> -		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +		void *entry, pfn_t pfn, unsigned long flags,
> +		unsigned long insert_flags)
I think unsigned int would have sufficed here.
>  {
>  	void *new_entry = dax_make_entry(pfn, flags);
> +	bool dirty = insert_flags & DAX_IF_DIRTY;
> +	bool cow = insert_flags & DAX_IF_COW;
>  
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> +	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>  		unsigned long index = xas->xa_index;
>  		/* we are replacing a zero page with block mapping */
>  		if (dax_is_pmd_entry(entry))
> @@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>  	xas_reset(xas);
>  	xas_lock_irq(xas);
> -	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> +	if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
>  		dax_disassociate_entry(entry, mapping, false);
>  		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>  	}
>  
> -	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>  		/*
>  		 * Only swap our new entry into the page cache if the current
>  		 * entry is a zero page or an empty entry.  If a normal PTE or
> @@ -753,6 +759,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>  	if (dirty)
>  		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> +	if (cow)
> +		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>  	xas_unlock_irq(xas);
>  	return entry;
>  }
> @@ -1032,7 +1041,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>  	vm_fault_t ret;
>  
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -			DAX_ZERO_PAGE, false);
> +			DAX_ZERO_PAGE, 0);
>  
>  	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>  	trace_dax_load_hole(inode, vmf, ret);
> @@ -1296,6 +1305,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	vm_fault_t ret = 0;
>  	void *entry;
>  	pfn_t pfn;
> +	unsigned long insert_flags = 0;
>  
>  	trace_dax_pte_fault(inode, vmf, ret);
>  	/*
> @@ -1357,6 +1367,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			error = copy_user_dax(iomap.bdev, iomap.dax_dev,
>  					sector, PAGE_SIZE, vmf->cow_page, vaddr);
>  			break;
> +		case IOMAP_DAX_COW:
> +			/* Should not be setting this - fallthrough */
>  		default:
>  			WARN_ON_ONCE(1);
>  			error = -EIO;
> @@ -1377,6 +1389,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  
>  	switch (iomap.type) {
>  	case IOMAP_DAX_COW:
> +		insert_flags |= DAX_IF_COW;
> +		/* fallthrough */
>  	case IOMAP_MAPPED:
>  		if (iomap.flags & IOMAP_F_NEW) {
>  			count_vm_event(PGMAJFAULT);
> @@ -1396,8 +1410,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			} else
>  				memset(addr, 0, PAGE_SIZE);
>  		}
> +		if (write && !sync)
> +			insert_flags |= DAX_IF_DIRTY;
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> -						 0, write && !sync);
> +						 0, insert_flags);
>  
>  		/*
>  		 * If we are doing synchronous page fault and inode needs fsync,
> @@ -1478,7 +1494,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  
>  	pfn = page_to_pfn_t(zero_page);
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -			DAX_PMD | DAX_ZERO_PAGE, false);
> +			DAX_PMD | DAX_ZERO_PAGE, 0);
>  
>  	if (arch_needs_pgtable_deposit()) {
>  		pgtable = pte_alloc_one(vma->vm_mm);
> @@ -1528,6 +1544,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	loff_t pos;
>  	int error;
>  	pfn_t pfn;
> +	unsigned long insert_flags = 0;
>  
>  	/*
>  	 * Check whether offset isn't beyond end of file now. Caller is
> @@ -1612,8 +1629,11 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  		if (error < 0)
>  			goto finish_iomap;
>  
> +		if (write && !sync)
> +			insert_flags |= DAX_IF_DIRTY;
> +
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> -						DAX_PMD, write && !sync);
> +						DAX_PMD, insert_flags);
Otherwise, this seems reasonable enough to me.  What do the DAX
developers think?
FWIW I'm not thrilled by the fact that Goldwyn isn't wiring up the
(rather similar looking) PMD code paths, but I grok that btrfs doesn't
support hugepages so he has no way to test that the code path actually
works so I'm ok with letting that go (until XFS joins the party) so long
as the pmd code path changes aren't too dissimilar to the pte code
paths.  Is that true?
--D
>  
>  		/*
>  		 * If we are doing synchronous page fault and inode needs fsync,
> -- 
> 2.16.4
> 
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH 10/18] dax: replace mmap entry in case of CoW
  2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
  2019-05-21 17:35   ` Darrick J. Wong
@ 2019-05-23 13:38   ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Kara @ 2019-05-23 13:38 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
	darrick.wong, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon 29-04-19 12:26:41, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
I don't understand why do you need to mark the new entry with
PAGECACHE_TAG_TOWRITE. dax_insert_entry() will unmap the entry from all
page tables so what's there left to writeprotect?
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
>   */
>  static void *dax_insert_entry(struct xa_state *xas,
>  		struct address_space *mapping, struct vm_fault *vmf,
> -		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +		void *entry, pfn_t pfn, unsigned long flags,
> +		unsigned long insert_flags)
>  {
>  	void *new_entry = dax_make_entry(pfn, flags);
> +	bool dirty = insert_flags & DAX_IF_DIRTY;
> +	bool cow = insert_flags & DAX_IF_COW;
Does 'cow' really need to be a separate flag? dax_insert_entry() can just
figure out the right thing to do on its own based on old entry value and
new entry to be inserted...
>  
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> +	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
E.g. here we need to unmap if old entry is not 'empty' and the pfns differ
(well, the pfns differ check should better be done like I outline below to
make pmd + pte match work correctly).
>  		unsigned long index = xas->xa_index;
>  		/* we are replacing a zero page with block mapping */
>  		if (dax_is_pmd_entry(entry))
> @@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>  	xas_reset(xas);
>  	xas_lock_irq(xas);
> -	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> +	if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
This needs to be done if entries are different at all...
>  		dax_disassociate_entry(entry, mapping, false);
>  		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>  	}
>  
> -	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
This is the only place that will be a bit more subtle - you need to check
whether the new entry is not a subset of the old one (i.e., a PTE inside a
PMD) and skip setting in that case. So something like:
	if (xa_to_value(new_entry) | DAX_LOCKED == xa_to_value(entry) ||
	    (dax_is_pmd_entry(entry) && dax_is_pte_entry(new_entry) &&
	     dax_to_pfn(entry) + (xas->xa_index & PG_PMD_COLOUR) ==
	     dax_to_pfn(new_entry))) {
		/* New entry is a subset of the current one? Skip update... */
		xas_load(xas);
	} else {
		do work...
	}
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 31+ messages in thread
end of thread, other threads:[~2019-05-23 13:38 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-04-16 16:52   ` Dan Williams
2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-04-16 23:45   ` Elliott, Robert (Servers)
2019-04-16 16:41 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
2019-04-17 16:46   ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
2019-04-17 16:52   ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
2019-04-17 15:24   ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
2019-04-17 15:36   ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
2019-04-17 15:45   ` Darrick J. Wong
2019-04-17 16:39     ` Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
2019-04-17 16:49 ` [PATCH v3 00/18] btrfs dax support Adam Borowski
  -- strict thread matches above, loose matches on Subject: below --
2019-04-29 17:26 [PATCH v4 " Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
2019-05-21 17:35   ` Darrick J. Wong
2019-05-23 13:38   ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).