* [PATCH v3 00/18] btrfs dax support
@ 2019-04-16 16:41 Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
` (18 more replies)
0 siblings, 19 replies; 45+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david,
dsterba, willy, linux-fsdevel, hch
This patch set adds support for dax on the BTRFS filesystem.
In order to support for CoW for btrfs, there were changes which had to be
made to the dax handling. The important one is copying blocks into the
same dax device before using them which is performed by iomap
type IOMAP_DAX_COW.
Snapshotting and CoW features are supported (including mmap preservation
across snapshots).
Git: https://github.com/goldwynr/linux/tree/btrfs-dax
Changes since v2:
- Created a new type IOMAP_DAX_COW as opposed to flag IOMAP_F_COW
- CoW source address is presented in iomap.inline_data
- Split the patches to more elaborate dax/iomap patches
Changes since v1:
- use iomap instead of redoing everything in btrfs
- support for mmap writeprotecting on snapshotting
fs/btrfs/Makefile | 1
fs/btrfs/ctree.h | 38 +++++
fs/btrfs/dax.c | 288 +++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/disk-io.c | 4
fs/btrfs/file.c | 37 ++++-
fs/btrfs/inode.c | 114 ++++++++++++-----
fs/btrfs/ioctl.c | 29 +++-
fs/btrfs/send.c | 4
fs/btrfs/super.c | 30 ++++
fs/dax.c | 152 ++++++++++++++++++++--
fs/iomap.c | 9 -
fs/ocfs2/file.c | 2
fs/read_write.c | 10 -
fs/xfs/xfs_reflink.c | 2
include/linux/dax.h | 13 +
include/linux/fs.h | 7 -
include/linux/iomap.h | 7 +
include/trace/events/btrfs.h | 56 ++++++++
18 files changed, 717 insertions(+), 86 deletions(-)
--
Goldwyn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 45+ messages in thread* [PATCH 01/18] btrfs: create a mount option for dax 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:52 ` Dan Williams 2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues ` (17 subsequent siblings) 18 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> This sets S_DAX in inode->i_flags, which can be used with IS_DAX(). The dax option is restricted to non multi-device mounts. dax interacts with the device directly instead of using bio, so all bio-hooks which we use for multi-device cannot be performed here. While regular read/writes could be manipulated with RAID0/1, mmap() is still an issue. Auto-setting free space tree, because dealing with free space inode (specifically readpages) is a nightmare. Auto-setting nodatasum because we don't get callback for writing checksums after mmap()s. Deny compression because it does not work with direct I/O. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 4 ++++ fs/btrfs/ioctl.c | 5 ++++- fs/btrfs/super.c | 30 ++++++++++++++++++++++++++++++ 4 files changed, 40 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b3642367a595..8ca1c0d120f4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1067,6 +1067,7 @@ struct btrfs_fs_info { u32 metadata_ratio; void *bdev_holder; + struct dax_device *dax_dev; /* private scrub information */ struct mutex scrub_lock; @@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26) #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) #define BTRFS_MOUNT_REF_VERIFY (1 << 28) +#define BTRFS_MOUNT_DAX (1 << 29) #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) #define BTRFS_DEFAULT_MAX_INLINE (2048) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6fe9197f6ee4..2bbb63b2fcff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -16,6 +16,7 @@ #include <linux/uuid.h> #include <linux/semaphore.h> #include <linux/error-injection.h> +#include <linux/dax.h> #include <linux/crc32c.h> #include <linux/sched/mm.h> #include <asm/unaligned.h> @@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb, goto fail_alloc; } + fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev); + /* * We want to check superblock checksum, the type is stored inside. * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k). @@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) #endif btrfs_close_devices(fs_info->fs_devices); + fs_put_dax(fs_info->dax_dev); btrfs_mapping_tree_free(&fs_info->mapping_tree); percpu_counter_destroy(&fs_info->dirty_metadata_bytes); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index cd4e693406a0..0138119cd9a3 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode) if (binode->flags & BTRFS_INODE_DIRSYNC) new_fl |= S_DIRSYNC; + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode)) + new_fl |= S_DAX; + set_mask_bits(&inode->i_flags, - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC, + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX, new_fl); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 120e4340792a..3b85e61e5182 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -326,6 +326,7 @@ enum { Opt_treelog, Opt_notreelog, Opt_usebackuproot, Opt_user_subvol_rm_allowed, + Opt_dax, /* Deprecated options */ Opt_alloc_start, @@ -393,6 +394,7 @@ static const match_table_t tokens = { {Opt_notreelog, "notreelog"}, {Opt_usebackuproot, "usebackuproot"}, {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"}, + {Opt_dax, "dax"}, /* Deprecated options */ {Opt_alloc_start, "alloc_start=%s"}, @@ -745,6 +747,32 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_user_subvol_rm_allowed: btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED); break; + case Opt_dax: +#ifdef CONFIG_FS_DAX + if (btrfs_super_num_devices(info->super_copy) > 1) { + btrfs_info(info, + "dax not supported for multi-device btrfs partition\n"); + ret = -EOPNOTSUPP; + goto out; + } + btrfs_set_opt(info->mount_opt, DAX); + btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n"); + btrfs_set_and_info(info, NODATASUM, + "auto-setting nodatasum (dax)"); + btrfs_clear_opt(info->mount_opt, SPACE_CACHE); + btrfs_set_and_info(info, FREE_SPACE_TREE, + "auto-setting free space tree (dax)"); + if (btrfs_test_opt(info, COMPRESS)) { + btrfs_info(info, "disabling compress (dax)"); + btrfs_clear_opt(info->mount_opt, COMPRESS); + } + break; +#else + btrfs_err(info, + "DAX option not supported\n"); + ret = -EINVAL; + goto out; +#endif case Opt_enospc_debug: btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG); break; @@ -1335,6 +1363,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",clear_cache"); if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED)) seq_puts(seq, ",user_subvol_rm_allowed"); + if (btrfs_test_opt(info, DAX)) + seq_puts(seq, ",dax"); if (btrfs_test_opt(info, ENOSPC_DEBUG)) seq_puts(seq, ",enospc_debug"); if (btrfs_test_opt(info, AUTO_DEFRAG)) -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 01/18] btrfs: create a mount option for dax 2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues @ 2019-04-16 16:52 ` Dan Williams 0 siblings, 0 replies; 45+ messages in thread From: Dan Williams @ 2019-04-16 16:52 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte, Jan Kara, linux-nvdimm, nborisov, Darrick J. Wong, david, dsterba, Matthew Wilcox, Goldwyn Rodrigues, linux-fsdevel, Christoph Hellwig, linux-btrfs On Tue, Apr 16, 2019 at 9:42 AM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > This sets S_DAX in inode->i_flags, which can be used with > IS_DAX(). > > The dax option is restricted to non multi-device mounts. > dax interacts with the device directly instead of using bio, so > all bio-hooks which we use for multi-device cannot be performed > here. While regular read/writes could be manipulated with > RAID0/1, mmap() is still an issue. > > Auto-setting free space tree, because dealing with free space > inode (specifically readpages) is a nightmare. > Auto-setting nodatasum because we don't get callback for writing > checksums after mmap()s. > Deny compression because it does not work with direct I/O. I'd like to consider the dax mount option deprecated and require new implementations to use a dynamic scheme that Darrick and I once discussed where "dax" (or more specifically the MAP_SYNC semantic) is an attribute that can be set on an empty directory and then inherited by all files / directories created in that hierarchy. Dax mappings are not always a performance win and enabling it for all files in the filesystem is potentially overkill, and I'm of opinion that we should step away from an filesystem-global mount option. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 23:45 ` Elliott, Robert (Servers) 2019-04-16 16:41 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues ` (16 subsequent siblings) 18 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> This makes btrfs_get_extent_map_write() independent of Direct I/O code. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/inode.c | 40 +++++++++++++++++++++++++++------------- 2 files changed, 29 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8ca1c0d120f4..9512f49262dd 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3277,6 +3277,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location, struct btrfs_path *path); struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, struct btrfs_root *root, int *was_new); +int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, + struct inode *inode, u64 start, u64 len); struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, struct page *page, size_t pg_offset, u64 start, u64 end, int create); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 82fdda8ff5ab..80184d0c3b52 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7496,11 +7496,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em, return 0; } -static int btrfs_get_blocks_direct_write(struct extent_map **map, - struct buffer_head *bh_result, - struct inode *inode, - struct btrfs_dio_data *dio_data, - u64 start, u64 len) +int btrfs_get_extent_map_write(struct extent_map **map, + struct buffer_head *bh, + struct inode *inode, + u64 start, u64 len) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct extent_map *em = *map; @@ -7554,22 +7553,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, */ btrfs_free_reserved_data_space_noquota(inode, start, len); - goto skip_cow; + /* skip COW */ + goto out; } } /* this will cow the extent */ - len = bh_result->b_size; + if (bh) + len = bh->b_size; free_extent_map(em); *map = em = btrfs_new_extent_direct(inode, start, len); - if (IS_ERR(em)) { - ret = PTR_ERR(em); - goto out; - } + if (IS_ERR(em)) + return PTR_ERR(em); +out: + return ret; +} +static int btrfs_get_blocks_direct_write(struct extent_map **map, + struct buffer_head *bh_result, + struct inode *inode, + struct btrfs_dio_data *dio_data, + u64 start, u64 len) +{ + int ret = 0; + struct extent_map *em; + + ret = btrfs_get_extent_map_write(map, bh_result, inode, + start, len); + if (ret < 0) + return ret; + em = *map; len = min(len, em->len - (start - em->start)); -skip_cow: bh_result->b_blocknr = (em->block_start + (start - em->start)) >> inode->i_blkbits; bh_result->b_size = len; @@ -7590,7 +7605,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, dio_data->reserve -= len; dio_data->unsubmitted_oe_range_end = start + len; current->journal_info = dio_data; -out: return ret; } -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* RE: [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() 2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues @ 2019-04-16 23:45 ` Elliott, Robert (Servers) 0 siblings, 0 replies; 45+ messages in thread From: Elliott, Robert (Servers) @ 2019-04-16 23:45 UTC (permalink / raw) To: Goldwyn Rodrigues, linux-btrfs@vger.kernel.org Cc: kilobyte@angband.pl, jack@suse.cz, linux-nvdimm@lists.01.org, nborisov@suse.com, darrick.wong@oracle.com, david@fromorbit.com, dsterba@suse.cz, willy@infradead.org, linux-fsdevel@vger.kernel.org, hch@lst.de, Goldwyn Rodrigues > -----Original Message----- > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of Goldwyn Rodrigues > Sent: Tuesday, April 16, 2019 11:42 AM > Subject: [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() ... > +static int btrfs_get_blocks_direct_write(struct extent_map **map, > + struct buffer_head *bh_result, > + struct inode *inode, > + struct btrfs_dio_data *dio_data, > + u64 start, u64 len) > +{ > + int ret = 0; That initialization value is not needed, since ret is always overwritten two lines later. > + struct extent_map *em; > + > + ret = btrfs_get_extent_map_write(map, bh_result, inode, > + start, len); > + if (ret < 0) > + return ret; > + em = *map; _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 03/18] btrfs: basic dax read 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues ` (15 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> Perform a basic read using iomap support. The btrfs_iomap_begin() finds the extent at the position and fills the iomap data structure with the values. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/Makefile | 1 + fs/btrfs/ctree.h | 5 +++++ fs/btrfs/dax.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/file.c | 11 ++++++++++- 4 files changed, 65 insertions(+), 1 deletion(-) create mode 100644 fs/btrfs/dax.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ca693dd554e9..1fa77b875ae9 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ uuid-tree.o props.o free-space-tree.o tree-checker.o +btrfs-$(CONFIG_FS_DAX) += dax.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 9512f49262dd..b7bbe5130a3b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle); void btrfs_reada_detach(void *handle); int btree_readahead_hook(struct extent_buffer *eb, int err); +#ifdef CONFIG_FS_DAX +/* dax.c */ +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); +#endif /* CONFIG_FS_DAX */ + static inline int is_fstree(u64 rootid) { if (rootid == BTRFS_FS_TREE_OBJECTID || diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c new file mode 100644 index 000000000000..bf3d46b0acb6 --- /dev/null +++ b/fs/btrfs/dax.c @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * DAX support for BTRFS + * + * Copyright (c) 2019 SUSE Linux + * Author: Goldwyn Rodrigues <rgoldwyn@suse.com> + */ + +#ifdef CONFIG_FS_DAX +#include <linux/dax.h> +#include <linux/iomap.h> +#include "ctree.h" +#include "btrfs_inode.h" + +static int btrfs_iomap_begin(struct inode *inode, loff_t pos, + loff_t length, unsigned flags, struct iomap *iomap) +{ + struct extent_map *em; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); + if (em->block_start == EXTENT_MAP_HOLE) { + iomap->type = IOMAP_HOLE; + return 0; + } + iomap->type = IOMAP_MAPPED; + iomap->bdev = em->bdev; + iomap->dax_dev = fs_info->dax_dev; + iomap->offset = em->start; + iomap->length = em->len; + iomap->addr = em->block_start; + return 0; +} + +static const struct iomap_ops btrfs_iomap_ops = { + .iomap_begin = btrfs_iomap_begin, +}; + +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) +{ + ssize_t ret; + struct inode *inode = file_inode(iocb->ki_filp); + + inode_lock_shared(inode); + ret = dax_iomap_rw(iocb, to, &btrfs_iomap_ops); + inode_unlock_shared(inode); + + return ret; +} +#endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 34fe8a58b0e9..9194591f9eea 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3288,9 +3288,18 @@ static int btrfs_file_open(struct inode *inode, struct file *filp) return generic_file_open(inode, filp); } +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ +#ifdef CONFIG_FS_DAX + if (IS_DAX(file_inode(iocb->ki_filp))) + return btrfs_file_dax_read(iocb, to); +#endif + return generic_file_read_iter(iocb, to); +} + const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, - .read_iter = generic_file_read_iter, + .read_iter = btrfs_file_read_iter, .splice_read = generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (2 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues [not found] ` <20190416164154.30390-5-rgoldwyn-l3A5Bk7waGM@public.gmane.org> 2019-04-16 16:41 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues ` (14 subsequent siblings) 18 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> The IOMAP_DAX_COW is a iomap type which performs copy of edges of data while performing a write if start/end are not page aligned. The source address is expected in iomap->inline_data. dax_copy_edges() is a helper functions performs a copy from one part of the device to another for data not page aligned. If iomap->inline_data is NULL, it memset's the area to zero. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/dax.c | 41 ++++++++++++++++++++++++++++++++++++++++- include/linux/iomap.h | 1 + 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index ca0671d55aa6..4b4ac51fbd16 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev, } EXPORT_SYMBOL_GPL(__dax_zero_page_range); +/* + * dax_copy_edges - Copies the part of the pages not included in + * the write, but required for CoW because + * offset/offset+length are not page aligned. + */ +static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, + struct iomap *iomap, void *daddr) +{ + unsigned offset = pos & (PAGE_SIZE - 1); + loff_t end = pos + length; + loff_t pg_end = round_up(end, PAGE_SIZE); + void *saddr = iomap->inline_data; + /* + * Copy the first part of the page + * Note: we pass offset as length + */ + if (offset) { + if (saddr) + memcpy(daddr, saddr, offset); + else + memset(daddr, 0, offset); + } + + /* Copy the last part of the range */ + if (end < pg_end) { + if (saddr) + memcpy(daddr + offset + length, + saddr + offset + length, pg_end - end); + else + memset(daddr + offset + length, 0, + pg_end - end); + } +} + static loff_t dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct iomap *iomap) @@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, return iov_iter_zero(min(length, end - pos), iter); } - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED + && iomap->type != IOMAP_DAX_COW)) return -EIO; + /* * Write can allocate block for an area which has a hole page mapped * into page tables. We have to tear down these mappings so that data @@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, break; } + if (iomap->type == IOMAP_DAX_COW) + dax_copy_edges(inode, pos, length, iomap, kaddr); + map_len = PFN_PHYS(map_len); kaddr += offset; map_len -= offset; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0fefb5455bda..6e885c5a38a3 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -25,6 +25,7 @@ struct vm_fault; #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ #define IOMAP_INLINE 0x05 /* data inline in the inode */ +#define IOMAP_DAX_COW 0x06 /* Copy data pointed by inline_data before write*/ /* * Flags for all iomap mappings: -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
[parent not found: <20190416164154.30390-5-rgoldwyn-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes [not found] ` <20190416164154.30390-5-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-17 16:46 ` Darrick J. Wong 0 siblings, 0 replies; 45+ messages in thread From: Darrick J. Wong @ 2019-04-17 16:46 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte-b9QjgO8OEXPVItvQsEIGlw, jack-AlSwsSmVLrQ, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, nborisov-IBi9RG/b67k, david-FqsqvQoI3Ljby3iVrkZq2A, dsterba-AlSwsSmVLrQ, willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g, linux-btrfs-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 16, 2019 at 11:41:40AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > > The IOMAP_DAX_COW is a iomap type which performs copy of > edges of data while performing a write if start/end are > not page aligned. The source address is expected in > iomap->inline_data. > > dax_copy_edges() is a helper functions performs a copy from > one part of the device to another for data not page aligned. > If iomap->inline_data is NULL, it memset's the area to zero. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > --- > fs/dax.c | 41 ++++++++++++++++++++++++++++++++++++++++- > include/linux/iomap.h | 1 + > 2 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index ca0671d55aa6..4b4ac51fbd16 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev, > } > EXPORT_SYMBOL_GPL(__dax_zero_page_range); > > +/* > + * dax_copy_edges - Copies the part of the pages not included in > + * the write, but required for CoW because > + * offset/offset+length are not page aligned. > + */ > +static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, > + struct iomap *iomap, void *daddr) > +{ > + unsigned offset = pos & (PAGE_SIZE - 1); > + loff_t end = pos + length; > + loff_t pg_end = round_up(end, PAGE_SIZE); > + void *saddr = iomap->inline_data; > + /* > + * Copy the first part of the page > + * Note: we pass offset as length > + */ > + if (offset) { > + if (saddr) > + memcpy(daddr, saddr, offset); I've been wondering, do we need memcpy_mcsafe here? > + else > + memset(daddr, 0, offset); Or here? (Or any of the other places we call memcpy/memset in this series...) Because I think we'd prefer to return EIO on bad pmem over a machine check. --D > + } > + > + /* Copy the last part of the range */ > + if (end < pg_end) { > + if (saddr) > + memcpy(daddr + offset + length, > + saddr + offset + length, pg_end - end); > + else > + memset(daddr + offset + length, 0, > + pg_end - end); > + } > +} > + > static loff_t > dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > struct iomap *iomap) > @@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > return iov_iter_zero(min(length, end - pos), iter); > } > > - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) > + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED > + && iomap->type != IOMAP_DAX_COW)) Usually the '&&' goes on the first line, right? > return -EIO; > > + > /* > * Write can allocate block for an area which has a hole page mapped > * into page tables. We have to tear down these mappings so that data > @@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > break; > } > > + if (iomap->type == IOMAP_DAX_COW) > + dax_copy_edges(inode, pos, length, iomap, kaddr); No return value? So the pmem copy never fails? --D > + > map_len = PFN_PHYS(map_len); > kaddr += offset; > map_len -= offset; > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 0fefb5455bda..6e885c5a38a3 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -25,6 +25,7 @@ struct vm_fault; > #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ > #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ > #define IOMAP_INLINE 0x05 /* data inline in the inode */ > +#define IOMAP_DAX_COW 0x06 /* Copy data pointed by inline_data before write*/ > > /* > * Flags for all iomap mappings: > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 05/18] btrfs: return whether extent is nocow or not 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (3 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues ` (13 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> We require this to make sure we return type IOMAP_DAX_COW in iomap structure, in the later patches. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 2 +- fs/btrfs/inode.c | 9 +++++++-- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b7bbe5130a3b..050f30165531 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3278,7 +3278,7 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location, struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, struct btrfs_root *root, int *was_new); int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, - struct inode *inode, u64 start, u64 len); + struct inode *inode, u64 start, u64 len, bool *nocow); struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, struct page *page, size_t pg_offset, u64 start, u64 end, int create); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 80184d0c3b52..06764d89490f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7499,12 +7499,15 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em, int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, struct inode *inode, - u64 start, u64 len) + u64 start, u64 len, bool *nocow) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct extent_map *em = *map; int ret = 0; + if (nocow) + *nocow = false; + /* * We don't allocate a new extent in the following cases * @@ -7553,6 +7556,8 @@ int btrfs_get_extent_map_write(struct extent_map **map, */ btrfs_free_reserved_data_space_noquota(inode, start, len); + if (nocow) + *nocow = true; /* skip COW */ goto out; } @@ -7579,7 +7584,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, struct extent_map *em; ret = btrfs_get_extent_map_write(map, bh_result, inode, - start, len); + start, len, NULL); if (ret < 0) return ret; em = *map; -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (4 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues ` (12 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> Since we will be using it in another part of the code, use a better name to declare it non-static Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 7 +++++-- fs/btrfs/inode.c | 14 +++++--------- 2 files changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 050f30165531..1e3e758b83c2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3280,8 +3280,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, struct inode *inode, u64 start, u64 len, bool *nocow); struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, - struct page *page, size_t pg_offset, - u64 start, u64 end, int create); + struct page *page, size_t pg_offset, + u64 start, u64 end, int create); +void btrfs_update_ordered_extent(struct inode *inode, + const u64 offset, const u64 bytes, + const bool uptodate); int btrfs_update_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *inode); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 06764d89490f..7fc778eb6bd2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -98,10 +98,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, u64 ram_bytes, int compress_type, int type); -static void __endio_write_update_ordered(struct inode *inode, - const u64 offset, const u64 bytes, - const bool uptodate); - /* * Cleanup all submitted ordered extents in specified range to handle errors * from the btrfs_run_delalloc_range() callback. @@ -142,7 +138,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode, bytes -= PAGE_SIZE; } - return __endio_write_update_ordered(inode, offset, bytes, false); + return btrfs_update_ordered_extent(inode, offset, bytes, false); } static int btrfs_dirty_inode(struct inode *inode); @@ -8085,7 +8081,7 @@ static void btrfs_endio_direct_read(struct bio *bio) bio_put(bio); } -static void __endio_write_update_ordered(struct inode *inode, +void btrfs_update_ordered_extent(struct inode *inode, const u64 offset, const u64 bytes, const bool uptodate) { @@ -8138,7 +8134,7 @@ static void btrfs_endio_direct_write(struct bio *bio) struct btrfs_dio_private *dip = bio->bi_private; struct bio *dio_bio = dip->dio_bio; - __endio_write_update_ordered(dip->inode, dip->logical_offset, + btrfs_update_ordered_extent(dip->inode, dip->logical_offset, dip->bytes, !bio->bi_status); kfree(dip); @@ -8457,7 +8453,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, bio = NULL; } else { if (write) - __endio_write_update_ordered(inode, + btrfs_update_ordered_extent(inode, file_offset, dio_bio->bi_iter.bi_size, false); @@ -8597,7 +8593,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) */ if (dio_data.unsubmitted_oe_range_start < dio_data.unsubmitted_oe_range_end) - __endio_write_update_ordered(inode, + btrfs_update_ordered_extent(inode, dio_data.unsubmitted_oe_range_start, dio_data.unsubmitted_oe_range_end - dio_data.unsubmitted_oe_range_start, -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 07/18] btrfs: add dax write support 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (5 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues ` (11 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> IOMAP_DAX_COW allows to inform the dax code, to first perform a copy which are not page-aligned before performing the write. The responsibility of checking if data edges are page aligned is performed in ->iomap_begin() and the source address is stored in ->inline_data A new struct btrfs_iomap is passed from iomap_begin() to iomap_end(), which contains all the accounting and locking information for CoW based writes. For writing to a hole, iomap->inline_data is set to zero. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 6 ++ fs/btrfs/dax.c | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- fs/btrfs/file.c | 4 +- 3 files changed, 185 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1e3e758b83c2..eec01eb92f33 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); #ifdef CONFIG_FS_DAX /* dax.c */ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); +#else +static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) +{ + return 0; +} #endif /* CONFIG_FS_DAX */ static inline int is_fstree(u64 rootid) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index bf3d46b0acb6..f5cc9bcdbf14 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -9,30 +9,184 @@ #ifdef CONFIG_FS_DAX #include <linux/dax.h> #include <linux/iomap.h> +#include <linux/uio.h> #include "ctree.h" #include "btrfs_inode.h" +struct btrfs_iomap { + u64 start; + u64 end; + bool nocow; + struct extent_changeset *data_reserved; + struct extent_state *cached_state; +}; + +static struct btrfs_iomap *btrfs_iomap_init(struct inode *inode, + struct extent_map **em, + loff_t pos, loff_t length) +{ + int ret = 0; + struct extent_map *map = *em; + struct btrfs_iomap *bi; + + bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS); + if (!bi) + return ERR_PTR(-ENOMEM); + + bi->start = round_down(pos, PAGE_SIZE); + bi->end = PAGE_ALIGN(pos + length); + + /* Wait for existing ordered extents in range to finish */ + btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start); + + lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state); + + ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved, + bi->start, bi->end - bi->start); + if (ret) { + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end, + &bi->cached_state); + kfree(bi); + return ERR_PTR(ret); + } + + refcount_inc(&map->refs); + ret = btrfs_get_extent_map_write(em, NULL, + inode, bi->start, bi->end - bi->start, &bi->nocow); + if (ret) { + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end, + &bi->cached_state); + btrfs_delalloc_release_space(inode, + bi->data_reserved, bi->start, + bi->end - bi->start, true); + extent_changeset_free(bi->data_reserved); + kfree(bi); + return ERR_PTR(ret); + } + free_extent_map(map); + return bi; +} + +static void *dax_address(struct block_device *bdev, struct dax_device *dax_dev, + sector_t sector, loff_t pos, loff_t length) +{ + size_t size = ALIGN(pos + length, PAGE_SIZE); + int id, ret = 0; + void *kaddr = NULL; + pgoff_t pgoff; + long map_len; + + id = dax_read_lock(); + + ret = bdev_dax_pgoff(bdev, sector, size, &pgoff); + if (ret) + goto out; + + map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), + &kaddr, NULL); + if (map_len < 0) + ret = map_len; + +out: + dax_read_unlock(id); + if (ret) + return ERR_PTR(ret); + return kaddr; +} + static int btrfs_iomap_begin(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap) { struct extent_map *em; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_iomap *bi = NULL; + unsigned offset = pos & (PAGE_SIZE - 1); + u64 srcblk = 0; + loff_t diff; + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); + + iomap->type = IOMAP_MAPPED; + + if (flags & IOMAP_WRITE) { + if (em->block_start != EXTENT_MAP_HOLE) + srcblk = em->block_start + pos - em->start - offset; + + bi = btrfs_iomap_init(inode, &em, pos, length); + if (IS_ERR(bi)) + return PTR_ERR(bi); + + } + + /* + * Advance the difference between pos and start, to align well with + * inline_data in case of writes + */ + diff = round_down(pos - em->start, PAGE_SIZE); + iomap->offset = em->start + diff; + iomap->length = em->len - diff; + iomap->bdev = em->bdev; + iomap->dax_dev = fs_info->dax_dev; + + /* + * This will be true for reads only since we have already + * allocated em + */ if (em->block_start == EXTENT_MAP_HOLE) { iomap->type = IOMAP_HOLE; return 0; } - iomap->type = IOMAP_MAPPED; - iomap->bdev = em->bdev; - iomap->dax_dev = fs_info->dax_dev; - iomap->offset = em->start; - iomap->length = em->len; - iomap->addr = em->block_start; + + iomap->addr = em->block_start + diff; + /* Check if we really need to copy data from old extent */ + if (bi && !bi->nocow && (offset || pos + length < bi->end)) { + iomap->type = IOMAP_DAX_COW; + if (srcblk) { + sector_t sector = (srcblk + (pos & PAGE_MASK) - + iomap->offset) >> 9; + iomap->inline_data = dax_address(em->bdev, + fs_info->dax_dev, sector, pos, length); + if (IS_ERR(iomap->inline_data)) { + kfree(bi); + return PTR_ERR(iomap->inline_data); + } + } + } + + iomap->private = bi; + return 0; +} + +static int btrfs_iomap_end(struct inode *inode, loff_t pos, + loff_t length, ssize_t written, unsigned flags, + struct iomap *iomap) +{ + struct btrfs_iomap *bi = iomap->private; + u64 wend; + + if (!bi) + return 0; + + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end, + &bi->cached_state); + + wend = PAGE_ALIGN(pos + written); + if (wend < bi->end) { + btrfs_delalloc_release_space(inode, + bi->data_reserved, wend, + bi->end - wend, true); + } + + btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true); + btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false); + extent_changeset_free(bi->data_reserved); + kfree(bi); return 0; } static const struct iomap_ops btrfs_iomap_ops = { .iomap_begin = btrfs_iomap_begin, + .iomap_end = btrfs_iomap_end, }; ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) @@ -46,4 +200,20 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) return ret; } + +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter) +{ + ssize_t ret = 0; + u64 pos = iocb->ki_pos; + struct inode *inode = file_inode(iocb->ki_filp); + ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops); + + if (ret > 0) { + pos += ret; + if (pos > i_size_read(inode)) + i_size_write(inode, pos); + iocb->ki_pos = pos; + } + return ret; +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9194591f9eea..a795023e26ca 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, if (sync) atomic_inc(&BTRFS_I(inode)->sync_writers); - if (iocb->ki_flags & IOCB_DIRECT) { + if (IS_DAX(inode)) { + num_written = btrfs_file_dax_write(iocb, from); + } else if (iocb->ki_flags & IOCB_DIRECT) { num_written = __btrfs_direct_write(iocb, from); } else { num_written = btrfs_buffered_write(iocb, from); -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (6 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues [not found] ` <20190416164154.30390-9-rgoldwyn-l3A5Bk7waGM@public.gmane.org> 2019-04-16 16:41 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues ` (10 subsequent siblings) 18 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> Change dax_iomap_pfn to return the address as well in order to use it for performing a memcpy in case the type is IOMAP_DAX_COW. Question: The sequence of bdev_dax_pgoff() and dax_direct_access() is used multiple times to calculate address and pfn's. Would it make sense to call it while calculating address as well to reduce code? Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/dax.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 4b4ac51fbd16..45fc2e18969a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -983,7 +983,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos) } static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, - pfn_t *pfnp) + pfn_t *pfnp, void **addr) { const sector_t sector = dax_iomap_sector(iomap, pos); pgoff_t pgoff; @@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, return rc; id = dax_read_lock(); length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), - NULL, pfnp); + addr, pfnp); if (length < 0) { rc = length; goto out; @@ -1280,6 +1280,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct inode *inode = mapping->host; unsigned long vaddr = vmf->address; + void *addr; loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT; struct iomap iomap = { 0 }; unsigned flags = IOMAP_FAULT; @@ -1369,16 +1370,23 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, sync = dax_fault_is_synchronous(flags, vma, &iomap); switch (iomap.type) { + case IOMAP_DAX_COW: case IOMAP_MAPPED: if (iomap.flags & IOMAP_F_NEW) { count_vm_event(PGMAJFAULT); count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); major = VM_FAULT_MAJOR; } - error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn); + error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn, &addr); if (error < 0) goto error_finish_iomap; + if (iomap.type == IOMAP_DAX_COW) { + if (iomap.inline_data) + memcpy(addr, iomap.inline_data, PAGE_SIZE); + else + memset(addr, 0, PAGE_SIZE); + } entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, 0, write && !sync); @@ -1577,7 +1585,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, switch (iomap.type) { case IOMAP_MAPPED: - error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn); + error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn, NULL); if (error < 0) goto finish_iomap; -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
[parent not found: <20190416164154.30390-9-rgoldwyn-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults [not found] ` <20190416164154.30390-9-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-17 16:52 ` Darrick J. Wong 0 siblings, 0 replies; 45+ messages in thread From: Darrick J. Wong @ 2019-04-17 16:52 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte-b9QjgO8OEXPVItvQsEIGlw, jack-AlSwsSmVLrQ, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, nborisov-IBi9RG/b67k, david-FqsqvQoI3Ljby3iVrkZq2A, dsterba-AlSwsSmVLrQ, willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g, linux-btrfs-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 16, 2019 at 11:41:44AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > > Change dax_iomap_pfn to return the address as well in order to > use it for performing a memcpy in case the type is IOMAP_DAX_COW. > > Question: > The sequence of bdev_dax_pgoff() and dax_direct_access() is > used multiple times to calculate address and pfn's. Would it make > sense to call it while calculating address as well to reduce code? > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > --- > fs/dax.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 4b4ac51fbd16..45fc2e18969a 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -983,7 +983,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos) > } > > static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, > - pfn_t *pfnp) > + pfn_t *pfnp, void **addr) > { > const sector_t sector = dax_iomap_sector(iomap, pos); > pgoff_t pgoff; > @@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size, > return rc; > id = dax_read_lock(); > length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), > - NULL, pfnp); > + addr, pfnp); > if (length < 0) { > rc = length; > goto out; > @@ -1280,6 +1280,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, > XA_STATE(xas, &mapping->i_pages, vmf->pgoff); > struct inode *inode = mapping->host; > unsigned long vaddr = vmf->address; > + void *addr; > loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT; > struct iomap iomap = { 0 }; > unsigned flags = IOMAP_FAULT; > @@ -1369,16 +1370,23 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, > sync = dax_fault_is_synchronous(flags, vma, &iomap); > > switch (iomap.type) { > + case IOMAP_DAX_COW: > case IOMAP_MAPPED: > if (iomap.flags & IOMAP_F_NEW) { > count_vm_event(PGMAJFAULT); > count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); > major = VM_FAULT_MAJOR; > } > - error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn); > + error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn, &addr); > if (error < 0) > goto error_finish_iomap; > > + if (iomap.type == IOMAP_DAX_COW) { > + if (iomap.inline_data) > + memcpy(addr, iomap.inline_data, PAGE_SIZE); Same memcpy_mcsafe question from my reply to patch 4 applies here. > + else > + memset(addr, 0, PAGE_SIZE); > + } > entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, > 0, write && !sync); > > @@ -1577,7 +1585,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, > > switch (iomap.type) { > case IOMAP_MAPPED: > - error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn); > + error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn, NULL); Same (unanswered) question from the v2 series -- doesn't a PMD fault also require handling IOMAP_DAX_COW? --D > if (error < 0) > goto finish_iomap; > > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 09/18] btrfs: Add dax specific address_space_operations 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (7 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues [not found] ` <20190416164154.30390-1-rgoldwyn-l3A5Bk7waGM@public.gmane.org> ` (9 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/inode.c | 34 +++++++++++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7fc778eb6bd2..c2c0544009a7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -28,6 +28,7 @@ #include <linux/magic.h> #include <linux/iversion.h> #include <linux/swap.h> +#include <linux/dax.h> #include <asm/unaligned.h> #include "ctree.h" #include "disk-io.h" @@ -65,6 +66,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations; static const struct inode_operations btrfs_special_inode_operations; static const struct inode_operations btrfs_file_inode_operations; static const struct address_space_operations btrfs_aops; +static const struct address_space_operations btrfs_dax_aops; static const struct file_operations btrfs_dir_file_operations; static const struct extent_io_ops btrfs_extent_io_ops; @@ -3757,7 +3759,10 @@ static int btrfs_read_locked_inode(struct inode *inode, switch (inode->i_mode & S_IFMT) { case S_IFREG: - inode->i_mapping->a_ops = &btrfs_aops; + if (btrfs_test_opt(fs_info, DAX)) + inode->i_mapping->a_ops = &btrfs_dax_aops; + else + inode->i_mapping->a_ops = &btrfs_aops; BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops; inode->i_fop = &btrfs_file_operations; inode->i_op = &btrfs_file_inode_operations; @@ -3778,6 +3783,7 @@ static int btrfs_read_locked_inode(struct inode *inode, } btrfs_sync_inode_flags_to_i_flags(inode); + return 0; } @@ -6538,7 +6544,10 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry, */ inode->i_fop = &btrfs_file_operations; inode->i_op = &btrfs_file_inode_operations; - inode->i_mapping->a_ops = &btrfs_aops; + if (IS_DAX(inode) && S_ISREG(mode)) + inode->i_mapping->a_ops = &btrfs_dax_aops; + else + inode->i_mapping->a_ops = &btrfs_aops; err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (err) @@ -8665,6 +8674,15 @@ static int btrfs_writepages(struct address_space *mapping, return extent_writepages(mapping, wbc); } +static int btrfs_dax_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev, + wbc); +} + static int btrfs_readpages(struct file *file, struct address_space *mapping, struct list_head *pages, unsigned nr_pages) @@ -10436,7 +10454,10 @@ static int btrfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) inode->i_fop = &btrfs_file_operations; inode->i_op = &btrfs_file_inode_operations; - inode->i_mapping->a_ops = &btrfs_aops; + if (IS_DAX(inode)) + inode->i_mapping->a_ops = &btrfs_dax_aops; + else + inode->i_mapping->a_ops = &btrfs_aops; BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops; ret = btrfs_init_inode_security(trans, inode, dir, NULL); @@ -10892,6 +10913,13 @@ static const struct address_space_operations btrfs_aops = { .swap_deactivate = btrfs_swap_deactivate, }; +static const struct address_space_operations btrfs_dax_aops = { + .writepages = btrfs_dax_writepages, + .direct_IO = noop_direct_IO, + .set_page_dirty = noop_set_page_dirty, + .invalidatepage = noop_invalidatepage, +}; + static const struct inode_operations btrfs_file_inode_operations = { .getattr = btrfs_getattr, .setattr = btrfs_setattr, -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
[parent not found: <20190416164154.30390-1-rgoldwyn-l3A5Bk7waGM@public.gmane.org>]
* [PATCH 10/18] dax: replace mmap entry in case of CoW [not found] ` <20190416164154.30390-1-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-16 16:41 ` Goldwyn Rodrigues [not found] ` <20190416164154.30390-11-rgoldwyn-l3A5Bk7waGM@public.gmane.org> 0 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs-u79uwXL29TY76Z2rM5mHXA Cc: kilobyte-b9QjgO8OEXPVItvQsEIGlw, jack-AlSwsSmVLrQ, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA, nborisov-IBi9RG/b67k, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, david-FqsqvQoI3Ljby3iVrkZq2A, dsterba-AlSwsSmVLrQ, willy-wEGCiKHe2LqWVfeAwA7xHQ, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> We replace the existing entry to the newly allocated one in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this entry as writeprotected. This helps us snapshots so new write pagefaults after snapshots trigger a CoW. Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> --- fs/dax.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 45fc2e18969a..d5100cbe8bd2 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, */ static void *dax_insert_entry(struct xa_state *xas, struct address_space *mapping, struct vm_fault *vmf, - void *entry, pfn_t pfn, unsigned long flags, bool dirty) + void *entry, pfn_t pfn, unsigned long flags, bool dirty, + bool cow) { void *new_entry = dax_make_entry(pfn, flags); if (dirty) __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) { + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { unsigned long index = xas->xa_index; /* we are replacing a zero page with block mapping */ if (dax_is_pmd_entry(entry)) @@ -727,12 +728,12 @@ static void *dax_insert_entry(struct xa_state *xas, xas_reset(xas); xas_lock_irq(xas); - if (dax_entry_size(entry) != dax_entry_size(new_entry)) { + if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) { dax_disassociate_entry(entry, mapping, false); dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address); } - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { /* * Only swap our new entry into the page cache if the current * entry is a zero page or an empty entry. If a normal PTE or @@ -752,6 +753,9 @@ static void *dax_insert_entry(struct xa_state *xas, if (dirty) xas_set_mark(xas, PAGECACHE_TAG_DIRTY); + if (cow) + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); + xas_unlock_irq(xas); return entry; } @@ -1031,7 +1035,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, vm_fault_t ret; *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, - DAX_ZERO_PAGE, false); + DAX_ZERO_PAGE, false, false); ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); trace_dax_load_hole(inode, vmf, ret); @@ -1388,7 +1392,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, memset(addr, 0, PAGE_SIZE); } entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, - 0, write && !sync); + 0, write && !sync, + iomap.type == IOMAP_DAX_COW); /* * If we are doing synchronous page fault and inode needs fsync, @@ -1467,7 +1472,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, pfn = page_to_pfn_t(zero_page); *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, - DAX_PMD | DAX_ZERO_PAGE, false); + DAX_PMD | DAX_ZERO_PAGE, false, + iomap->type == IOMAP_DAX_COW); ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (!pmd_none(*(vmf->pmd))) { @@ -1590,7 +1596,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, goto finish_iomap; entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, - DAX_PMD, write && !sync); + DAX_PMD, write && !sync, + false); /* * If we are doing synchronous page fault and inode needs fsync, -- 2.16.4 ^ permalink raw reply related [flat|nested] 45+ messages in thread
[parent not found: <20190416164154.30390-11-rgoldwyn-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH 10/18] dax: replace mmap entry in case of CoW [not found] ` <20190416164154.30390-11-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-17 15:24 ` Darrick J. Wong 0 siblings, 0 replies; 45+ messages in thread From: Darrick J. Wong @ 2019-04-17 15:24 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte-b9QjgO8OEXPVItvQsEIGlw, jack-AlSwsSmVLrQ, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, nborisov-IBi9RG/b67k, david-FqsqvQoI3Ljby3iVrkZq2A, dsterba-AlSwsSmVLrQ, willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g, linux-btrfs-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 16, 2019 at 11:41:46AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > > We replace the existing entry to the newly allocated one > in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE > so writeback marks this entry as writeprotected. This > helps us snapshots so new write pagefaults after snapshots > trigger a CoW. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > --- > fs/dax.c | 23 +++++++++++++++-------- > 1 file changed, 15 insertions(+), 8 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 45fc2e18969a..d5100cbe8bd2 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, > */ > static void *dax_insert_entry(struct xa_state *xas, > struct address_space *mapping, struct vm_fault *vmf, > - void *entry, pfn_t pfn, unsigned long flags, bool dirty) > + void *entry, pfn_t pfn, unsigned long flags, bool dirty, > + bool cow) I still wish these were flags instead of double booleans that will be easy to mix up, especially since this is a static function and nobody else has to see the flags... #define IE_DIRTY (1 << 0) /* mark entry and inode dirty */ #define IE_REPLACE (1 << 1) /* replacing one page with another */ ...otoh maybe I'll just defer to the maintainer. :) > { > void *new_entry = dax_make_entry(pfn, flags); > > if (dirty) > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) { > + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { > unsigned long index = xas->xa_index; > /* we are replacing a zero page with block mapping */ These comments need updating. Otherwise looks good to me... --D > if (dax_is_pmd_entry(entry)) > @@ -727,12 +728,12 @@ static void *dax_insert_entry(struct xa_state *xas, > > xas_reset(xas); > xas_lock_irq(xas); > - if (dax_entry_size(entry) != dax_entry_size(new_entry)) { > + if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) { > dax_disassociate_entry(entry, mapping, false); > dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address); > } > > - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { > + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { > /* > * Only swap our new entry into the page cache if the current > * entry is a zero page or an empty entry. If a normal PTE or > @@ -752,6 +753,9 @@ static void *dax_insert_entry(struct xa_state *xas, > if (dirty) > xas_set_mark(xas, PAGECACHE_TAG_DIRTY); > > + if (cow) > + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); > + > xas_unlock_irq(xas); > return entry; > } > @@ -1031,7 +1035,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, > vm_fault_t ret; > > *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, > - DAX_ZERO_PAGE, false); > + DAX_ZERO_PAGE, false, false); > > ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); > trace_dax_load_hole(inode, vmf, ret); > @@ -1388,7 +1392,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, > memset(addr, 0, PAGE_SIZE); > } > entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, > - 0, write && !sync); > + 0, write && !sync, > + iomap.type == IOMAP_DAX_COW); > > /* > * If we are doing synchronous page fault and inode needs fsync, > @@ -1467,7 +1472,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, > > pfn = page_to_pfn_t(zero_page); > *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, > - DAX_PMD | DAX_ZERO_PAGE, false); > + DAX_PMD | DAX_ZERO_PAGE, false, > + iomap->type == IOMAP_DAX_COW); > > ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); > if (!pmd_none(*(vmf->pmd))) { > @@ -1590,7 +1596,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, > goto finish_iomap; > > entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, > - DAX_PMD, write && !sync); > + DAX_PMD, write && !sync, > + false); > > /* > * If we are doing synchronous page fault and inode needs fsync, > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 11/18] btrfs: add dax mmap support 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (9 preceding siblings ...) [not found] ` <20190416164154.30390-1-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues ` (7 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> Add a new vm_operations struct btrfs_dax_vm_ops specifically for dax files. Since we will be removing(nulling) readpages/writepages for dax return ENOEXEC only for non-dax files. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/dax.c | 13 ++++++++++++- fs/btrfs/file.c | 18 ++++++++++++++++-- 3 files changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index eec01eb92f33..2b7bdabb44f8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3802,6 +3802,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); /* dax.c */ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); +vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); #else static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) { diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index f5cc9bcdbf14..de957d681e16 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -139,7 +139,7 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos, iomap->addr = em->block_start + diff; /* Check if we really need to copy data from old extent */ - if (bi && !bi->nocow && (offset || pos + length < bi->end)) { + if (bi && !bi->nocow && (offset || pos + length < bi->end || flags & IOMAP_FAULT)) { iomap->type = IOMAP_DAX_COW; if (srcblk) { sector_t sector = (srcblk + (pos & PAGE_MASK) - @@ -216,4 +216,15 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter) } return ret; } + +vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) +{ + vm_fault_t ret; + pfn_t pfn; + ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops); + if (ret & VM_FAULT_NEEDDSYNC) + ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn); + + return ret; +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a795023e26ca..9d5a3c99a6b9 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2214,15 +2214,29 @@ static const struct vm_operations_struct btrfs_file_vm_ops = { .page_mkwrite = btrfs_page_mkwrite, }; +#ifdef CONFIG_FS_DAX +static const struct vm_operations_struct btrfs_dax_vm_ops = { + .fault = btrfs_dax_fault, + .page_mkwrite = btrfs_dax_fault, + .pfn_mkwrite = btrfs_dax_fault, +}; +#else +#define btrfs_dax_vm_ops btrfs_file_vm_ops +#endif + static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) { struct address_space *mapping = filp->f_mapping; + struct inode *inode = file_inode(filp); - if (!mapping->a_ops->readpage) + if (!IS_DAX(inode) && !mapping->a_ops->readpage) return -ENOEXEC; file_accessed(filp); - vma->vm_ops = &btrfs_file_vm_ops; + if (IS_DAX(inode)) + vma->vm_ops = &btrfs_dax_vm_ops; + else + vma->vm_ops = &btrfs_file_vm_ops; return 0; } -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 12/18] btrfs: allow MAP_SYNC mmap 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (10 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues ` (6 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues From: Adam Borowski <kilobyte@angband.pl> Used by userspace to detect DAX. [rgoldwyn@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags] Signed-off-by: Adam Borowski <kilobyte@angband.pl> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/file.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9d5a3c99a6b9..362a9cf9dcb2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -16,6 +16,7 @@ #include <linux/btrfs.h> #include <linux/uio.h> #include <linux/iversion.h> +#include <linux/mman.h> #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = { .splice_read = generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, +#ifdef CONFIG_FS_DAX + .mmap_supported_flags = MAP_SYNC, +#endif .open = btrfs_file_open, .release = btrfs_release_file, .fsync = btrfs_sync_file, -- 2.16.4 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 13/18] fs: dedup file range to use a compare function 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (11 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues [not found] ` <20190416164154.30390-14-rgoldwyn-l3A5Bk7waGM@public.gmane.org> 2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues ` (5 subsequent siblings) 18 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> With dax we cannot deal with readpage() etc. So, we create a funciton callback to perform the file data comparison and pass it to generic_remap_file_range_prep() so it can use iomap-based functions. This may not be the best way to solve this. Suggestions welcome. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 9 ++++++++ fs/btrfs/dax.c | 7 +++++++ fs/btrfs/ioctl.c | 11 ++++++++-- fs/dax.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/ocfs2/file.c | 2 +- fs/read_write.c | 10 ++++----- fs/xfs/xfs_reflink.c | 2 +- include/linux/dax.h | 2 ++ include/linux/fs.h | 7 ++++++- 9 files changed, 98 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2b7bdabb44f8..d3d044125619 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, loff_t len, + bool *is_same); #else static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) { return 0; } +static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, loff_t len, + bool *is_same) +{ + return 0; +} #endif /* CONFIG_FS_DAX */ static inline int is_fstree(u64 rootid) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index de957d681e16..a29628b403b3 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -227,4 +227,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) return ret; } + +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, loff_t len, + bool *is_same) +{ + return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops); +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0138119cd9a3..cd590105bd78 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4000,8 +4000,15 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, if (ret < 0) goto out_unlock; - ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, - len, remap_flags); + if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out))) + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, + btrfs_dax_file_range_compare); + else + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, + vfs_dedupe_file_range_compare); + if (ret < 0 || *len == 0) goto out_unlock; diff --git a/fs/dax.c b/fs/dax.c index d5100cbe8bd2..abbe4a79f219 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1759,3 +1759,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, return dax_insert_pfn_mkwrite(vmf, pfn, order); } EXPORT_SYMBOL_GPL(dax_finish_sync_fault); + + +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, + loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops) +{ + void *saddr, *daddr; + struct iomap s_iomap = {0}; + struct iomap d_iomap = {0}; + loff_t dstart, sstart; + bool same = true; + loff_t cmp_len, l; + int id, ret = 0; + + id = dax_read_lock(); + while (len) { + ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap); + if (ret < 0) { + if (ops->iomap_end) + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap); + return ret; + } + cmp_len = len; + if (cmp_len > s_iomap.offset + s_iomap.length - srcoff) + cmp_len = s_iomap.offset + s_iomap.length - srcoff; + + ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap); + if (ret < 0) { + if (ops->iomap_end) { + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap); + ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap); + } + return ret; + } + if (cmp_len > d_iomap.offset + d_iomap.length - destoff) + cmp_len = d_iomap.offset + d_iomap.length - destoff; + + + sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset); + l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL); + dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset); + l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL); + same = !memcmp(saddr, daddr, cmp_len); + if (!same) + break; + len -= cmp_len; + srcoff += cmp_len; + destoff += cmp_len; + + if (ops->iomap_end) { + ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap); + ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap); + } + } + dax_read_unlock(id); + *is_same = same; + return ret; +} +EXPORT_SYMBOL_GPL(dax_file_range_compare); diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index d640c5f8a85d..9d470306cfc3 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in, goto out_unlock; ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, - &len, remap_flags); + &len, remap_flags, vfs_dedupe_file_range_compare); if (ret < 0 || len == 0) goto out_unlock; diff --git a/fs/read_write.c b/fs/read_write.c index 61b43ad7608e..ecc67740d0ff 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1778,7 +1778,7 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) * Compare extents of two files to see if they are the same. * Caller must have locked both inodes to prevent write races. */ -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, loff_t len, bool *is_same) { @@ -1845,6 +1845,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, out_error: return error; } +EXPORT_SYMBOL(vfs_dedupe_file_range_compare); /* * Check that the two inodes are eligible for cloning, the ranges make @@ -1856,7 +1857,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, */ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - loff_t *len, unsigned int remap_flags) + loff_t *len, unsigned int remap_flags, compare_range_t compare) { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); @@ -1915,9 +1916,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, */ if (remap_flags & REMAP_FILE_DEDUP) { bool is_same = false; - - ret = vfs_dedupe_file_range_compare(inode_in, pos_in, - inode_out, pos_out, *len, &is_same); + ret = compare(inode_in, pos_in, + inode_out, pos_out, *len, &is_same); if (ret) return ret; if (!is_same) diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 680ae7662a78..68e4257cebb0 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep( goto out_unlock; ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, - len, remap_flags); + len, remap_flags, vfs_dedupe_file_range_compare); if (ret < 0 || *len == 0) goto out_unlock; diff --git a/include/linux/dax.h b/include/linux/dax.h index 0dd316a74a29..a11bc7b1f526 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, + loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops); #ifdef CONFIG_FS_DAX int __dax_zero_page_range(struct block_device *bdev, diff --git a/include/linux/fs.h b/include/linux/fs.h index dd28e7679089..b1aa2fc105ae 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1883,10 +1883,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, unsigned long, loff_t *, rwf_t); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, loff_t, size_t, unsigned int); +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, + loff_t len, bool *is_same); +typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest, + loff_t destpos, loff_t len, bool *is_same); extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t *count, - unsigned int remap_flags); + unsigned int remap_flags, compare_range_t cmp); extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
[parent not found: <20190416164154.30390-14-rgoldwyn-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH 13/18] fs: dedup file range to use a compare function [not found] ` <20190416164154.30390-14-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-17 15:36 ` Darrick J. Wong 0 siblings, 0 replies; 45+ messages in thread From: Darrick J. Wong @ 2019-04-17 15:36 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte-b9QjgO8OEXPVItvQsEIGlw, jack-AlSwsSmVLrQ, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, nborisov-IBi9RG/b67k, david-FqsqvQoI3Ljby3iVrkZq2A, dsterba-AlSwsSmVLrQ, willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g, linux-btrfs-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 16, 2019 at 11:41:49AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > > With dax we cannot deal with readpage() etc. So, we create a > funciton callback to perform the file data comparison and pass > it to generic_remap_file_range_prep() so it can use iomap-based > functions. > > This may not be the best way to solve this. Suggestions welcome. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > --- > fs/btrfs/ctree.h | 9 ++++++++ > fs/btrfs/dax.c | 7 +++++++ > fs/btrfs/ioctl.c | 11 ++++++++-- > fs/dax.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/ocfs2/file.c | 2 +- > fs/read_write.c | 10 ++++----- > fs/xfs/xfs_reflink.c | 2 +- > include/linux/dax.h | 2 ++ > include/linux/fs.h | 7 ++++++- > 9 files changed, 98 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 2b7bdabb44f8..d3d044125619 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); > ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); > ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); > vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); > +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, > + struct inode *dest, loff_t destoff, loff_t len, > + bool *is_same); > #else > static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) > { > return 0; > } > +static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, > + struct inode *dest, loff_t destoff, loff_t len, > + bool *is_same) > +{ > + return 0; > +} > #endif /* CONFIG_FS_DAX */ > > static inline int is_fstree(u64 rootid) > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c > index de957d681e16..a29628b403b3 100644 > --- a/fs/btrfs/dax.c > +++ b/fs/btrfs/dax.c > @@ -227,4 +227,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) > > return ret; > } > + > +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, > + struct inode *dest, loff_t destoff, loff_t len, > + bool *is_same) > +{ > + return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops); > +} > #endif /* CONFIG_FS_DAX */ > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 0138119cd9a3..cd590105bd78 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -4000,8 +4000,15 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, > if (ret < 0) > goto out_unlock; > > - ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, > - len, remap_flags); > + if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out))) > + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, > + pos_out, len, remap_flags, > + btrfs_dax_file_range_compare); > + else > + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, > + pos_out, len, remap_flags, > + vfs_dedupe_file_range_compare); I wonder if you could simply have a compare_range_t variable that you can set to either the vfs and btrfs_dax compare functions, and then only have to maintain a single generic_remap_file_range_prep callsite? > + > if (ret < 0 || *len == 0) > goto out_unlock; > > diff --git a/fs/dax.c b/fs/dax.c > index d5100cbe8bd2..abbe4a79f219 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1759,3 +1759,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > return dax_insert_pfn_mkwrite(vmf, pfn, order); > } > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > + > + > +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, > + loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops) > +{ > + void *saddr, *daddr; > + struct iomap s_iomap = {0}; > + struct iomap d_iomap = {0}; > + loff_t dstart, sstart; > + bool same = true; > + loff_t cmp_len, l; > + int id, ret = 0; > + > + id = dax_read_lock(); > + while (len) { > + ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap); > + if (ret < 0) { > + if (ops->iomap_end) > + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap); > + return ret; > + } > + cmp_len = len; > + if (cmp_len > s_iomap.offset + s_iomap.length - srcoff) > + cmp_len = s_iomap.offset + s_iomap.length - srcoff; > + > + ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap); > + if (ret < 0) { > + if (ops->iomap_end) { > + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap); > + ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap); > + } > + return ret; > + } > + if (cmp_len > d_iomap.offset + d_iomap.length - destoff) > + cmp_len = d_iomap.offset + d_iomap.length - destoff; > + > + > + sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset); This kinda screams static inline helper function... > + l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL); > + dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset); ...especially since it happens again here... > + l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL); ...and these lines are too long. > + same = !memcmp(saddr, daddr, cmp_len); > + if (!same) > + break; > + len -= cmp_len; > + srcoff += cmp_len; > + destoff += cmp_len; > + > + if (ops->iomap_end) { > + ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap); > + ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap); > + } > + } > + dax_read_unlock(id); > + *is_same = same; > + return ret; > +} > +EXPORT_SYMBOL_GPL(dax_file_range_compare); > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index d640c5f8a85d..9d470306cfc3 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in, > goto out_unlock; > > ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, > - &len, remap_flags); > + &len, remap_flags, vfs_dedupe_file_range_compare); > if (ret < 0 || len == 0) > goto out_unlock; > > diff --git a/fs/read_write.c b/fs/read_write.c > index 61b43ad7608e..ecc67740d0ff 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1778,7 +1778,7 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset) > * Compare extents of two files to see if they are the same. > * Caller must have locked both inodes to prevent write races. > */ > -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > struct inode *dest, loff_t destoff, > loff_t len, bool *is_same) > { > @@ -1845,6 +1845,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > out_error: > return error; > } > +EXPORT_SYMBOL(vfs_dedupe_file_range_compare); Not EXPORT_SYMBOL_GPL? :D > /* > * Check that the two inodes are eligible for cloning, the ranges make > @@ -1856,7 +1857,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > */ > int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, > struct file *file_out, loff_t pos_out, > - loff_t *len, unsigned int remap_flags) > + loff_t *len, unsigned int remap_flags, compare_range_t compare) > { > struct inode *inode_in = file_inode(file_in); > struct inode *inode_out = file_inode(file_out); > @@ -1915,9 +1916,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, > */ > if (remap_flags & REMAP_FILE_DEDUP) { > bool is_same = false; > - > - ret = vfs_dedupe_file_range_compare(inode_in, pos_in, > - inode_out, pos_out, *len, &is_same); > + ret = compare(inode_in, pos_in, > + inode_out, pos_out, *len, &is_same); > if (ret) > return ret; > if (!is_same) > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c > index 680ae7662a78..68e4257cebb0 100644 > --- a/fs/xfs/xfs_reflink.c > +++ b/fs/xfs/xfs_reflink.c > @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep( > goto out_unlock; > > ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, > - len, remap_flags); > + len, remap_flags, vfs_dedupe_file_range_compare); > if (ret < 0 || *len == 0) > goto out_unlock; > > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 0dd316a74a29..a11bc7b1f526 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); > int dax_invalidate_mapping_entry_sync(struct address_space *mapping, > pgoff_t index); > +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, > + loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops); > > #ifdef CONFIG_FS_DAX > int __dax_zero_page_range(struct block_device *bdev, > diff --git a/include/linux/fs.h b/include/linux/fs.h > index dd28e7679089..b1aa2fc105ae 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1883,10 +1883,15 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, > unsigned long, loff_t *, rwf_t); > extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, > loff_t, size_t, unsigned int); > +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > + struct inode *dest, loff_t destoff, > + loff_t len, bool *is_same); > +typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest, > + loff_t destpos, loff_t len, bool *is_same); Might want to call this vfs_compare_range_t to make it clear that this belongs to the generic vfs namespace. --D > extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, > struct file *file_out, loff_t pos_out, > loff_t *count, > - unsigned int remap_flags); > + unsigned int remap_flags, compare_range_t cmp); > extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in, > struct file *file_out, loff_t pos_out, > loff_t len, unsigned int remap_flags); > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 14/18] dax: memcpy before zeroing range 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (12 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues [not found] ` <20190416164154.30390-15-rgoldwyn-l3A5Bk7waGM@public.gmane.org> 2019-04-16 16:41 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues ` (4 subsequent siblings) 18 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> However, this needed more iomap fields, so it was easier to pass iomap and compute inside the function rather than passing a log of arguments. Note, there is subtle difference between iomap_sector and dax_iomap_sector(). Can we replace dax_iomap_sector with iomap_sector()? It would need pos & PAGE_MASK though or else bdev_dax_pgoff() return -EINVAL. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/dax.c | 14 ++++++++++---- fs/iomap.c | 9 +-------- include/linux/dax.h | 11 +++++------ include/linux/iomap.h | 6 ++++++ 4 files changed, 22 insertions(+), 18 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index abbe4a79f219..af94909640ea 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device *bdev, return true; } -int __dax_zero_page_range(struct block_device *bdev, - struct dax_device *dax_dev, sector_t sector, - unsigned int offset, unsigned int size) +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, + unsigned int offset, unsigned int size) { - if (dax_range_is_aligned(bdev, offset, size)) { + sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK); + struct block_device *bdev = iomap->bdev; + struct dax_device *dax_dev = iomap->dax_dev; + + if (!(iomap->type == IOMAP_DAX_COW) && + dax_range_is_aligned(bdev, offset, size)) { sector_t start_sector = sector + (offset >> 9); return blkdev_issue_zeroout(bdev, start_sector, @@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev, dax_read_unlock(id); return rc; } + if (iomap->type == IOMAP_DAX_COW) + memcpy(iomap->inline_data, kaddr, offset); memset(kaddr + offset, 0, size); dax_flush(dax_dev, kaddr + offset, size); dax_read_unlock(id); diff --git a/fs/iomap.c b/fs/iomap.c index abdd18e404f8..90698c854883 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, return written ? written : ret; } -static sector_t -iomap_sector(struct iomap *iomap, loff_t pos) -{ - return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; -} - static struct iomap_page * iomap_page_create(struct inode *inode, struct page *page) { @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes, struct iomap *iomap) { - return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, - iomap_sector(iomap, pos & PAGE_MASK), offset, bytes); + return __dax_zero_page_range(iomap, pos, offset, bytes); } static loff_t diff --git a/include/linux/dax.h b/include/linux/dax.h index a11bc7b1f526..892c478d7073 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -9,6 +9,7 @@ typedef unsigned long dax_entry_t; +struct iomap; struct iomap_ops; struct dax_device; struct dax_operations { @@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops); #ifdef CONFIG_FS_DAX -int __dax_zero_page_range(struct block_device *bdev, - struct dax_device *dax_dev, sector_t sector, - unsigned int offset, unsigned int length); +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, + unsigned int offset, unsigned int size); #else -static inline int __dax_zero_page_range(struct block_device *bdev, - struct dax_device *dax_dev, sector_t sector, - unsigned int offset, unsigned int length) +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos, + unsigned int offset, unsigned int size) { return -ENXIO; } diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 6e885c5a38a3..3a803566dea1 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -7,6 +7,7 @@ #include <linux/mm.h> #include <linux/types.h> #include <linux/mm_types.h> +#include <linux/blkdev.h> struct address_space; struct fiemap_extent_info; @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page) return NULL; } +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos) +{ + return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; +} + ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, const struct iomap_ops *ops); int iomap_readpage(struct page *page, const struct iomap_ops *ops); -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
[parent not found: <20190416164154.30390-15-rgoldwyn-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH 14/18] dax: memcpy before zeroing range [not found] ` <20190416164154.30390-15-rgoldwyn-l3A5Bk7waGM@public.gmane.org> @ 2019-04-17 15:45 ` Darrick J. Wong 2019-04-17 16:39 ` Goldwyn Rodrigues 0 siblings, 1 reply; 45+ messages in thread From: Darrick J. Wong @ 2019-04-17 15:45 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte-b9QjgO8OEXPVItvQsEIGlw, jack-AlSwsSmVLrQ, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, nborisov-IBi9RG/b67k, david-FqsqvQoI3Ljby3iVrkZq2A, dsterba-AlSwsSmVLrQ, willy-wEGCiKHe2LqWVfeAwA7xHQ, Goldwyn Rodrigues, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g, linux-btrfs-u79uwXL29TY76Z2rM5mHXA On Tue, Apr 16, 2019 at 11:41:50AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > > However, this needed more iomap fields, so it was easier > to pass iomap and compute inside the function rather > than passing a log of arguments. > > Note, there is subtle difference between iomap_sector and > dax_iomap_sector(). Can we replace dax_iomap_sector with > iomap_sector()? It would need pos & PAGE_MASK though or else > bdev_dax_pgoff() return -EINVAL. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn-IBi9RG/b67k@public.gmane.org> > --- > fs/dax.c | 14 ++++++++++---- > fs/iomap.c | 9 +-------- > include/linux/dax.h | 11 +++++------ > include/linux/iomap.h | 6 ++++++ > 4 files changed, 22 insertions(+), 18 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index abbe4a79f219..af94909640ea 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device *bdev, > return true; > } > > -int __dax_zero_page_range(struct block_device *bdev, > - struct dax_device *dax_dev, sector_t sector, > - unsigned int offset, unsigned int size) > +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, > + unsigned int offset, unsigned int size) > { > - if (dax_range_is_aligned(bdev, offset, size)) { > + sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK); > + struct block_device *bdev = iomap->bdev; > + struct dax_device *dax_dev = iomap->dax_dev; > + > + if (!(iomap->type == IOMAP_DAX_COW) && > + dax_range_is_aligned(bdev, offset, size)) { > sector_t start_sector = sector + (offset >> 9); > > return blkdev_issue_zeroout(bdev, start_sector, > @@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev, > dax_read_unlock(id); > return rc; > } > + if (iomap->type == IOMAP_DAX_COW) > + memcpy(iomap->inline_data, kaddr, offset); I'm confused, why are we copying into the source page before zeroing some part of the dax device? > memset(kaddr + offset, 0, size); > dax_flush(dax_dev, kaddr + offset, size); > dax_read_unlock(id); > diff --git a/fs/iomap.c b/fs/iomap.c > index abdd18e404f8..90698c854883 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, > return written ? written : ret; > } > > -static sector_t > -iomap_sector(struct iomap *iomap, loff_t pos) > -{ > - return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; > -} > - > static struct iomap_page * > iomap_page_create(struct inode *inode, struct page *page) > { > @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, > static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes, > struct iomap *iomap) > { > - return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, > - iomap_sector(iomap, pos & PAGE_MASK), offset, bytes); > + return __dax_zero_page_range(iomap, pos, offset, bytes); > } > > static loff_t > diff --git a/include/linux/dax.h b/include/linux/dax.h > index a11bc7b1f526..892c478d7073 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -9,6 +9,7 @@ > > typedef unsigned long dax_entry_t; > > +struct iomap; > struct iomap_ops; > struct dax_device; > struct dax_operations { > @@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, > loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops); > > #ifdef CONFIG_FS_DAX > -int __dax_zero_page_range(struct block_device *bdev, > - struct dax_device *dax_dev, sector_t sector, > - unsigned int offset, unsigned int length); > +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, > + unsigned int offset, unsigned int size); > #else > -static inline int __dax_zero_page_range(struct block_device *bdev, > - struct dax_device *dax_dev, sector_t sector, > - unsigned int offset, unsigned int length) > +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos, > + unsigned int offset, unsigned int size) > { > return -ENXIO; > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 6e885c5a38a3..3a803566dea1 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -7,6 +7,7 @@ > #include <linux/mm.h> > #include <linux/types.h> > #include <linux/mm_types.h> > +#include <linux/blkdev.h> > > struct address_space; > struct fiemap_extent_info; > @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page) > return NULL; > } > > +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos) > +{ > + return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; Why the double indent? --D > +} > + > ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, > const struct iomap_ops *ops); > int iomap_readpage(struct page *page, const struct iomap_ops *ops); > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 14/18] dax: memcpy before zeroing range 2019-04-17 15:45 ` Darrick J. Wong @ 2019-04-17 16:39 ` Goldwyn Rodrigues 0 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-17 16:39 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte, dsterba, nborisov, linux-nvdimm On 8:45 17/04, Darrick J. Wong wrote: > On Tue, Apr 16, 2019 at 11:41:50AM -0500, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > However, this needed more iomap fields, so it was easier > > to pass iomap and compute inside the function rather > > than passing a log of arguments. > > > > Note, there is subtle difference between iomap_sector and > > dax_iomap_sector(). Can we replace dax_iomap_sector with > > iomap_sector()? It would need pos & PAGE_MASK though or else > > bdev_dax_pgoff() return -EINVAL. > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > > --- > > fs/dax.c | 14 ++++++++++---- > > fs/iomap.c | 9 +-------- > > include/linux/dax.h | 11 +++++------ > > include/linux/iomap.h | 6 ++++++ > > 4 files changed, 22 insertions(+), 18 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index abbe4a79f219..af94909640ea 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device *bdev, > > return true; > > } > > > > -int __dax_zero_page_range(struct block_device *bdev, > > - struct dax_device *dax_dev, sector_t sector, > > - unsigned int offset, unsigned int size) > > +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, > > + unsigned int offset, unsigned int size) > > { > > - if (dax_range_is_aligned(bdev, offset, size)) { > > + sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK); > > + struct block_device *bdev = iomap->bdev; > > + struct dax_device *dax_dev = iomap->dax_dev; > > + > > + if (!(iomap->type == IOMAP_DAX_COW) && > > + dax_range_is_aligned(bdev, offset, size)) { > > sector_t start_sector = sector + (offset >> 9); > > > > return blkdev_issue_zeroout(bdev, start_sector, > > @@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev, > > dax_read_unlock(id); > > return rc; > > } > > + if (iomap->type == IOMAP_DAX_COW) > > + memcpy(iomap->inline_data, kaddr, offset); > > I'm confused, why are we copying into the source page before zeroing > some part of the dax device? Clearly my testing was not good enough. This should be - memcpy(kaddr, iomap->inline_data, offset); We copy the part which is not zeroed to new location. > > > memset(kaddr + offset, 0, size); > > dax_flush(dax_dev, kaddr + offset, size); > > dax_read_unlock(id); > > diff --git a/fs/iomap.c b/fs/iomap.c > > index abdd18e404f8..90698c854883 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, > > return written ? written : ret; > > } > > > > -static sector_t > > -iomap_sector(struct iomap *iomap, loff_t pos) > > -{ > > - return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; > > -} > > - > > static struct iomap_page * > > iomap_page_create(struct inode *inode, struct page *page) > > { > > @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset, > > static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes, > > struct iomap *iomap) > > { > > - return __dax_zero_page_range(iomap->bdev, iomap->dax_dev, > > - iomap_sector(iomap, pos & PAGE_MASK), offset, bytes); > > + return __dax_zero_page_range(iomap, pos, offset, bytes); > > } > > > > static loff_t > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index a11bc7b1f526..892c478d7073 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -9,6 +9,7 @@ > > > > typedef unsigned long dax_entry_t; > > > > +struct iomap; > > struct iomap_ops; > > struct dax_device; > > struct dax_operations { > > @@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, > > loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops); > > > > #ifdef CONFIG_FS_DAX > > -int __dax_zero_page_range(struct block_device *bdev, > > - struct dax_device *dax_dev, sector_t sector, > > - unsigned int offset, unsigned int length); > > +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, > > + unsigned int offset, unsigned int size); > > #else > > -static inline int __dax_zero_page_range(struct block_device *bdev, > > - struct dax_device *dax_dev, sector_t sector, > > - unsigned int offset, unsigned int length) > > +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos, > > + unsigned int offset, unsigned int size) > > { > > return -ENXIO; > > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index 6e885c5a38a3..3a803566dea1 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -7,6 +7,7 @@ > > #include <linux/mm.h> > > #include <linux/types.h> > > #include <linux/mm_types.h> > > +#include <linux/blkdev.h> > > > > struct address_space; > > struct fiemap_extent_info; > > @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page) > > return NULL; > > } > > > > +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos) > > +{ > > + return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT; > > Why the double indent? Ahh.. I will fix this. > > --D > > > +} > > + > > ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, > > const struct iomap_ops *ops); > > int iomap_readpage(struct page *page, const struct iomap_ops *ops); > > -- > > 2.16.4 > > > -- Goldwyn ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 15/18] btrfs: handle dax page zeroing 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (13 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues ` (3 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> btrfs_dax_zero_block() zeros part of the page, either from the front or the regular rest of the block. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/dax.c | 27 ++++++++++++++++++++++++++- fs/btrfs/inode.c | 4 ++++ 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d3d044125619..ee1ed18f8b3c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3806,6 +3806,7 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, loff_t len, bool *is_same); +int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front); #else static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) { diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index a29628b403b3..a4b36d5b537c 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -132,7 +132,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos, * This will be true for reads only since we have already * allocated em */ - if (em->block_start == EXTENT_MAP_HOLE) { + if (em->block_start == EXTENT_MAP_HOLE || + em->flags == EXTENT_FLAG_FILLING) { iomap->type = IOMAP_HOLE; return 0; } @@ -234,4 +235,28 @@ int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, { return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops); } + +/* + * zero a part of the page only. This should CoW (via iomap_begin) if required + */ +int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front) +{ + loff_t start = round_down(from, PAGE_SIZE); + loff_t end = round_up(from, PAGE_SIZE); + loff_t offset = from; + int ret = 0; + + if (front) { + len = from - start; + offset = start; + } else { + if (!len) + len = end - from; + } + + if (len) + ret = iomap_zero_range(inode, offset, len, NULL, &btrfs_iomap_ops); + + return (ret < 0) ? ret : 0; +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c2c0544009a7..f6a64607bc8c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4833,6 +4833,10 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, (!len || IS_ALIGNED(len, blocksize))) goto out; +#ifdef CONFIG_FS_DAX + if (IS_DAX(inode)) + return btrfs_dax_zero_block(inode, from, len, front); +#endif block_start = round_down(from, blocksize); block_end = block_start + blocksize - 1; -- 2.16.4 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (14 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues ` (2 subsequent siblings) 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> Inorder to make sure mmap'd files don't change after snapshot, writeprotect the mmap pages on snapshot. This is done by performing a data writeback on the pages (which simply mark the pages are wrprotected). This way if the user process tries to access the memory we will get another fault and we can perform a CoW. In order to accomplish this, we tag all CoW pages as PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes. During snapshot, it starts writeback of all delalloc'd inodes and here we perform a data writeback. We don't want to keep the inodes in delalloc_inodes until it umount (WARN_ON), so we remove it during inode evictions. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/dax.c | 7 +++++++ fs/btrfs/inode.c | 13 ++++++++++++- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ee1ed18f8b3c..d1b70f24adeb 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans, struct btrfs_root *new_root, struct btrfs_root *parent_root, u64 new_dirid); - void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state, +void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode); +void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state, unsigned *bits); void btrfs_clear_delalloc_extent(struct inode *inode, struct extent_state *state, unsigned *bits); diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index a4b36d5b537c..2740c15286ee 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -222,10 +222,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) { vm_fault_t ret; pfn_t pfn; + struct inode *inode = file_inode(vmf->vma->vm_file); + struct btrfs_inode *binode = BTRFS_I(inode); ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn); + /* Insert into delalloc so we get writeback calls on snapshots */ + if (vmf->flags & FAULT_FLAG_WRITE && + !test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags)) + btrfs_add_delalloc_inodes(binode->root, inode); + return ret; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f6a64607bc8c..d439c6c80a2a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new, spin_unlock(&BTRFS_I(inode)->lock); } -static void btrfs_add_delalloc_inodes(struct btrfs_root *root, +void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_trans_handle *trans; + struct btrfs_inode *binode = BTRFS_I(inode); struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_block_rsv *rsv; int ret; trace_btrfs_inode_evict(inode); + if (IS_DAX(inode) + && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags)) + btrfs_del_delalloc_inode(root, binode); + if (!root) { clear_inode(inode); return; @@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping, { struct inode *inode = mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_inode *binode = BTRFS_I(inode); + if ((wbc->sync_mode == WB_SYNC_ALL) && + test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags)) + btrfs_del_delalloc_inode(binode->root, binode); return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev, wbc); } @@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work) delalloc_work = container_of(work, struct btrfs_delalloc_work, work); inode = delalloc_work->inode; + if (IS_DAX(inode)) + filemap_fdatawrite(inode->i_mapping); filemap_flush(inode->i_mapping); if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &BTRFS_I(inode)->runtime_flags)) -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 17/18] btrfs: Disable dax-based defrag and send 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (15 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-16 16:41 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues 2019-04-17 16:49 ` [PATCH v3 00/18] btrfs dax support Adam Borowski 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> This is temporary, and a TODO. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/ioctl.c | 13 +++++++++++++ fs/btrfs/send.c | 4 ++++ 2 files changed, 17 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index cd590105bd78..8ac770fe00dc 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2990,6 +2990,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) goto out; } + if (IS_DAX(inode)) { + btrfs_warn(root->fs_info, "File defrag is not supported with DAX"); + ret = -EOPNOTSUPP; + goto out; + } + if (argp) { if (copy_from_user(range, argp, sizeof(*range))) { @@ -4653,6 +4659,10 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg) if (!capable(CAP_SYS_ADMIN)) return -EPERM; + /* send can be on a directory, so check super block instead */ + if (btrfs_test_opt(fs_info, DAX)) + return -EOPNOTSUPP; + ret = mnt_want_write_file(file); if (ret) return ret; @@ -5505,6 +5515,9 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat) struct btrfs_ioctl_send_args *arg; int ret; + if (IS_DAX(file_inode(file))) + return -EOPNOTSUPP; + if (compat) { #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT) struct btrfs_ioctl_send_args_32 args32; diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 7ea2d6b1f170..9679fd54db86 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -6609,6 +6609,10 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) int sort_clone_roots = 0; int index; + /* send can be on a directory, so check super block instead */ + if (btrfs_test_opt(fs_info, DAX)) + return -EOPNOTSUPP; + if (!capable(CAP_SYS_ADMIN)) return -EPERM; -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (16 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues @ 2019-04-16 16:41 ` Goldwyn Rodrigues 2019-04-17 16:49 ` [PATCH v3 00/18] btrfs dax support Adam Borowski 18 siblings, 0 replies; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw) To: linux-btrfs Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> This is for debug purposes only and can be skipped. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/btrfs/dax.c | 3 +++ include/trace/events/btrfs.h | 56 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 59 insertions(+) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 2740c15286ee..de3ebc75f0b2 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -104,6 +104,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos, u64 srcblk = 0; loff_t diff; + trace_btrfs_iomap_begin(inode, pos, length, flags); + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); iomap->type = IOMAP_MAPPED; @@ -164,6 +166,7 @@ static int btrfs_iomap_end(struct inode *inode, loff_t pos, { struct btrfs_iomap *bi = iomap->private; u64 wend; + trace_btrfs_iomap_end(inode, pos, length, written, flags); if (!bi) return 0; diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index ab1cc33adbac..8779e5789a7c 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1850,6 +1850,62 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group, TP_ARGS(bg_cache) ); +TRACE_EVENT(btrfs_iomap_begin, + + TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, int flags), + + TP_ARGS(inode, pos, length, flags), + + TP_STRUCT__entry_btrfs( + __field( u64, ino ) + __field( u64, pos ) + __field( u64, length ) + __field( int, flags ) + ), + + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), + __entry->ino = btrfs_ino(BTRFS_I(inode)); + __entry->pos = pos; + __entry->length = length; + __entry->flags = flags; + ), + + TP_printk_btrfs("ino=%llu pos=%llu len=%llu flags=0x%x", + __entry->ino, + __entry->pos, + __entry->length, + __entry->flags) +); + +TRACE_EVENT(btrfs_iomap_end, + + TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, loff_t written, int flags), + + TP_ARGS(inode, pos, length, written, flags), + + TP_STRUCT__entry_btrfs( + __field( u64, ino ) + __field( u64, pos ) + __field( u64, length ) + __field( u64, written ) + __field( int, flags ) + ), + + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), + __entry->ino = btrfs_ino(BTRFS_I(inode)); + __entry->pos = pos; + __entry->length = length; + __entry->written = written; + __entry->flags = flags; + ), + + TP_printk_btrfs("ino=%llu pos=%llu len=%llu written=%llu flags=0x%x", + __entry->ino, + __entry->pos, + __entry->length, + __entry->written, + __entry->flags) +); #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 2.16.4 ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH v3 00/18] btrfs dax support 2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues ` (17 preceding siblings ...) 2019-04-16 16:41 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues @ 2019-04-17 16:49 ` Adam Borowski 18 siblings, 0 replies; 45+ messages in thread From: Adam Borowski @ 2019-04-17 16:49 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, darrick.wong, dsterba, nborisov, linux-nvdimm On Tue, Apr 16, 2019 at 11:41:36AM -0500, Goldwyn Rodrigues wrote: > This patch set adds support for dax on the BTRFS filesystem. > In order to support for CoW for btrfs, there were changes which had to be > made to the dax handling. The important one is copying blocks into the > same dax device before using them which is performed by iomap > type IOMAP_DAX_COW. I'm afraid that PMDK's testsuite nearly instantly makes this patchset explode in different ways. It'd probably be a waste of your time to require a round-trip to report such fails and re-test. As PMDK is probably the biggest userspace component, and exercises DAX paths in interesting ways, I guess it'd be best if you ran its testsuite locally until it no longer kills the kernel. Quick start: git clone https://github.com/pmem/pmdk deps: libndctl-dev[el], libdaxctl-dev[el]; rest are mostly optional (full deps for Fedora: https://github.com/pmem/pmdk/blob/master/utils/docker/images/Dockerfile.fedora-28 SuSE's should be similar) cp src/test/testconfig.sh{.example,} -- then edit, you'd want: PMEM_FS_DIR=/mnt/pmem # where you have the dax filesystem on KEEP_GOING=y # to see more than one broken test make make test # to build tests make check # to build and run all tests cd src/test && ./RUNTESTS a_single_test Obviously, PMDK's tests may fail due to some difference of behaviour between ext4, xfs, and your btrfs implementation -- that'll require some cooperation. But crashing the kernel is something where the blame is quite obvious. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄⠀⠀⠀⠀ A master species delegates. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v4 00/18] btrfs dax support @ 2019-04-29 17:26 Goldwyn Rodrigues 2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues 0 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch This patch set adds support for dax on the BTRFS filesystem. In order to support for CoW for btrfs, there were changes which had to be made to the dax handling. The important one is copying blocks into the same dax device before using them which is performed by iomap type IOMAP_DAX_COW. Snapshotting and CoW features are supported (including mmap preservation across snapshots). Git: https://github.com/goldwynr/linux/tree/btrfs-dax Changes since v3: - Fixed memcpy bug - used flags for dax_insert_entry instead of bools for dax_insert_entry() Changes since v2: - Created a new type IOMAP_DAX_COW as opposed to flag IOMAP_F_COW - CoW source address is presented in iomap.inline_data - Split the patches to more elaborate dax/iomap patches Changes since v1: - use iomap instead of redoing everything in btrfs - support for mmap writeprotecting on snapshotting fs/btrfs/Makefile | 1 fs/btrfs/ctree.h | 38 +++++ fs/btrfs/dax.c | 289 +++++++++++++++++++++++++++++++++++++++++-- fs/btrfs/disk-io.c | 4 fs/btrfs/file.c | 37 ++++- fs/btrfs/inode.c | 114 ++++++++++++---- fs/btrfs/ioctl.c | 29 +++- fs/btrfs/send.c | 4 fs/btrfs/super.c | 30 ++++ fs/dax.c | 183 ++++++++++++++++++++++++--- fs/iomap.c | 9 - fs/ocfs2/file.c | 2 fs/read_write.c | 11 - fs/xfs/xfs_reflink.c | 2 include/linux/dax.h | 15 +- include/linux/fs.h | 8 + include/linux/iomap.h | 7 + include/trace/events/btrfs.h | 56 ++++++++ 18 files changed, 752 insertions(+), 87 deletions(-) _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-04-29 17:26 [PATCH v4 " Goldwyn Rodrigues @ 2019-04-29 17:26 ` Goldwyn Rodrigues 2019-05-21 16:51 ` Darrick J. Wong 0 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw) To: linux-btrfs Cc: kilobyte, jack, darrick.wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, Goldwyn Rodrigues From: Goldwyn Rodrigues <rgoldwyn@suse.com> The IOMAP_DAX_COW is a iomap type which performs copy of edges of data while performing a write if start/end are not page aligned. The source address is expected in iomap->inline_data. dax_copy_edges() is a helper functions performs a copy from one part of the device to another for data not page aligned. If iomap->inline_data is NULL, it memset's the area to zero. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> --- fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- include/linux/iomap.h | 1 + 2 files changed, 46 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index e5e54da1715f..610bfa861a28 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev, } EXPORT_SYMBOL_GPL(__dax_zero_page_range); +/* + * dax_copy_edges - Copies the part of the pages not included in + * the write, but required for CoW because + * offset/offset+length are not page aligned. + */ +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, + struct iomap *iomap, void *daddr) +{ + unsigned offset = pos & (PAGE_SIZE - 1); + loff_t end = pos + length; + loff_t pg_end = round_up(end, PAGE_SIZE); + void *saddr = iomap->inline_data; + int ret = 0; + /* + * Copy the first part of the page + * Note: we pass offset as length + */ + if (offset) { + if (saddr) + ret = memcpy_mcsafe(daddr, saddr, offset); + else + memset(daddr, 0, offset); + } + + /* Copy the last part of the range */ + if (end < pg_end) { + if (saddr) + ret = memcpy_mcsafe(daddr + offset + length, + saddr + offset + length, pg_end - end); + else + memset(daddr + offset + length, 0, + pg_end - end); + } + return ret; +} + static loff_t dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct iomap *iomap) @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, return iov_iter_zero(min(length, end - pos), iter); } - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED + && iomap->type != IOMAP_DAX_COW)) return -EIO; + /* * Write can allocate block for an area which has a hole page mapped * into page tables. We have to tear down these mappings so that data @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, break; } + if (iomap->type == IOMAP_DAX_COW) { + ret = dax_copy_edges(inode, pos, length, iomap, kaddr); + if (ret) + break; + } + map_len = PFN_PHYS(map_len); kaddr += offset; map_len -= offset; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0fefb5455bda..6e885c5a38a3 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -25,6 +25,7 @@ struct vm_fault; #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ #define IOMAP_INLINE 0x05 /* data inline in the inode */ +#define IOMAP_DAX_COW 0x06 /* Copy data pointed by inline_data before write*/ /* * Flags for all iomap mappings: -- 2.16.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues @ 2019-05-21 16:51 ` Darrick J. Wong 2019-05-22 20:14 ` Goldwyn Rodrigues 2019-05-23 9:05 ` Shiyang Ruan 0 siblings, 2 replies; 45+ messages in thread From: Darrick J. Wong @ 2019-05-21 16:51 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > The IOMAP_DAX_COW is a iomap type which performs copy of > edges of data while performing a write if start/end are > not page aligned. The source address is expected in > iomap->inline_data. > > dax_copy_edges() is a helper functions performs a copy from > one part of the device to another for data not page aligned. > If iomap->inline_data is NULL, it memset's the area to zero. > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > --- > fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > include/linux/iomap.h | 1 + > 2 files changed, 46 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index e5e54da1715f..610bfa861a28 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev, > } > EXPORT_SYMBOL_GPL(__dax_zero_page_range); > > +/* > + * dax_copy_edges - Copies the part of the pages not included in > + * the write, but required for CoW because > + * offset/offset+length are not page aligned. > + */ > +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, > + struct iomap *iomap, void *daddr) > +{ > + unsigned offset = pos & (PAGE_SIZE - 1); > + loff_t end = pos + length; > + loff_t pg_end = round_up(end, PAGE_SIZE); > + void *saddr = iomap->inline_data; > + int ret = 0; > + /* > + * Copy the first part of the page > + * Note: we pass offset as length > + */ > + if (offset) { > + if (saddr) > + ret = memcpy_mcsafe(daddr, saddr, offset); > + else > + memset(daddr, 0, offset); > + } > + > + /* Copy the last part of the range */ > + if (end < pg_end) { > + if (saddr) > + ret = memcpy_mcsafe(daddr + offset + length, > + saddr + offset + length, pg_end - end); > + else > + memset(daddr + offset + length, 0, > + pg_end - end); > + } > + return ret; > +} > + > static loff_t > dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > struct iomap *iomap) > @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > return iov_iter_zero(min(length, end - pos), iter); > } > > - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) > + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED > + && iomap->type != IOMAP_DAX_COW)) I reiterate (from V3) that the && goes on the previous line... if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED && iomap->type != IOMAP_DAX_COW)) > return -EIO; > > + > /* > * Write can allocate block for an area which has a hole page mapped > * into page tables. We have to tear down these mappings so that data > @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > break; > } > > + if (iomap->type == IOMAP_DAX_COW) { > + ret = dax_copy_edges(inode, pos, length, iomap, kaddr); > + if (ret) > + break; > + } > + > map_len = PFN_PHYS(map_len); > kaddr += offset; > map_len -= offset; > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 0fefb5455bda..6e885c5a38a3 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -25,6 +25,7 @@ struct vm_fault; > #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ > #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ > #define IOMAP_INLINE 0x05 /* data inline in the inode */ > +#define IOMAP_DAX_COW 0x06 DAX isn't going to be the only scenario where we need a way to communicate to iomap actors the need to implement copy on write. XFS also uses struct iomap to hand out file leases to clients. The lease code /currently/ doesn't support files with shared blocks (because the only user is pNFS) but one could easily imagine a future where some client wants to lease a file with shared blocks, in which case XFS will want to convey the COW details to the lessee. > +/* Copy data pointed by inline_data before write*/ A month ago during the V3 patchset review, I wrote (possibly in an other thread, sorry) about something that I'm putting my foot down about now for the V4 patchset, which is the {re,ab}use of @inline_data for the data source address. We cannot use @inline_data to convey the source address. @inline_data (so far) is used to point to the in-memory representation of the storage described by @addr. For data writes, @addr is the location of the write on disk and @inline_data is the location of the write in memory. Reusing @inline_data here to point to the location of the source data in memory is a totally different thing and will likely result in confusion. On a practical level, this also means that we cannot support the case of COW && INLINE because the type codes collide and so would the users of @inline_data. This isn't required *right now*, but if you had a pmem filesystem that stages inode updates in memory and flips a pointer to commit changes then the ->iomap_begin function will need to convey two pointers at once. So this brings us back to Dave's suggestion during the V1 patchset review that instead of adding more iomap flags/types and overloading fields, we simply pass two struct iomaps into ->iomap_begin: - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass &iomap[0] into the ->iomap_begin and ->iomap_end functions. The first iomap will be filled in with the destination for the write (as all implementations do now), and the second iomap can be filled in with the source information for a COW operation. - If the ->iomap_begin implementation decides that COW is necessary for the requested operation, then it should fill out that second iomap with information about the extent that the actor must copied before returning. The second iomap's offset and length must match the first. If COW isn't necessary, the ->iomap_begin implementation ignores it, and the second iomap retains type == 0 (i.e. invalid mapping). Proceeding along these lines will (AFAICT) still allow you to enable all the btrfs functionality in the rest of this patchset while making the task of wiring up XFS fairly simple. No overloaded fields and no new flags. This is how I'd like to see this patchset should proceed to V5. Does that make sense? --D > > /* > * Flags for all iomap mappings: > -- > 2.16.4 > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-21 16:51 ` Darrick J. Wong @ 2019-05-22 20:14 ` Goldwyn Rodrigues 2019-05-23 2:10 ` Dave Chinner 2019-05-23 9:05 ` Shiyang Ruan 1 sibling, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-05-22 20:14 UTC (permalink / raw) To: Darrick J. Wong Cc: kilobyte, jack, linux-nvdimm, nborisov, david, dsterba, willy, linux-fsdevel, hch, linux-btrfs On 9:51 21/05, Darrick J. Wong wrote: > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > The IOMAP_DAX_COW is a iomap type which performs copy of > > edges of data while performing a write if start/end are > > not page aligned. The source address is expected in > > iomap->inline_data. > > > > dax_copy_edges() is a helper functions performs a copy from > > one part of the device to another for data not page aligned. > > If iomap->inline_data is NULL, it memset's the area to zero. > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > > --- > > fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > > include/linux/iomap.h | 1 + > > 2 files changed, 46 insertions(+), 1 deletion(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index e5e54da1715f..610bfa861a28 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev, > > } > > EXPORT_SYMBOL_GPL(__dax_zero_page_range); > > > > +/* > > + * dax_copy_edges - Copies the part of the pages not included in > > + * the write, but required for CoW because > > + * offset/offset+length are not page aligned. > > + */ > > +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, > > + struct iomap *iomap, void *daddr) > > +{ > > + unsigned offset = pos & (PAGE_SIZE - 1); > > + loff_t end = pos + length; > > + loff_t pg_end = round_up(end, PAGE_SIZE); > > + void *saddr = iomap->inline_data; > > + int ret = 0; > > + /* > > + * Copy the first part of the page > > + * Note: we pass offset as length > > + */ > > + if (offset) { > > + if (saddr) > > + ret = memcpy_mcsafe(daddr, saddr, offset); > > + else > > + memset(daddr, 0, offset); > > + } > > + > > + /* Copy the last part of the range */ > > + if (end < pg_end) { > > + if (saddr) > > + ret = memcpy_mcsafe(daddr + offset + length, > > + saddr + offset + length, pg_end - end); > > + else > > + memset(daddr + offset + length, 0, > > + pg_end - end); > > + } > > + return ret; > > +} > > + > > static loff_t > > dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > struct iomap *iomap) > > @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > return iov_iter_zero(min(length, end - pos), iter); > > } > > > > - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) > > + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED > > + && iomap->type != IOMAP_DAX_COW)) > > I reiterate (from V3) that the && goes on the previous line... > > if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED && > iomap->type != IOMAP_DAX_COW)) > > > return -EIO; > > > > + > > /* > > * Write can allocate block for an area which has a hole page mapped > > * into page tables. We have to tear down these mappings so that data > > @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > break; > > } > > > > + if (iomap->type == IOMAP_DAX_COW) { > > + ret = dax_copy_edges(inode, pos, length, iomap, kaddr); > > + if (ret) > > + break; > > + } > > + > > map_len = PFN_PHYS(map_len); > > kaddr += offset; > > map_len -= offset; > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index 0fefb5455bda..6e885c5a38a3 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -25,6 +25,7 @@ struct vm_fault; > > #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ > > #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ > > #define IOMAP_INLINE 0x05 /* data inline in the inode */ > > > +#define IOMAP_DAX_COW 0x06 > > DAX isn't going to be the only scenario where we need a way to > communicate to iomap actors the need to implement copy on write. > > XFS also uses struct iomap to hand out file leases to clients. The > lease code /currently/ doesn't support files with shared blocks (because > the only user is pNFS) but one could easily imagine a future where some > client wants to lease a file with shared blocks, in which case XFS will > want to convey the COW details to the lessee. > > > +/* Copy data pointed by inline_data before write*/ > > A month ago during the V3 patchset review, I wrote (possibly in an other > thread, sorry) about something that I'm putting my foot down about now > for the V4 patchset, which is the {re,ab}use of @inline_data for the > data source address. Looks like I missed this. > > We cannot use @inline_data to convey the source address. @inline_data > (so far) is used to point to the in-memory representation of the storage > described by @addr. For data writes, @addr is the location of the write > on disk and @inline_data is the location of the write in memory. > > Reusing @inline_data here to point to the location of the source data in > memory is a totally different thing and will likely result in confusion. > On a practical level, this also means that we cannot support the case of > COW && INLINE because the type codes collide and so would the users of > @inline_data. This isn't required *right now*, but if you had a pmem > filesystem that stages inode updates in memory and flips a pointer to > commit changes then the ->iomap_begin function will need to convey two > pointers at once. > > So this brings us back to Dave's suggestion during the V1 patchset > review that instead of adding more iomap flags/types and overloading > fields, we simply pass two struct iomaps into ->iomap_begin: Actually, Dave is the one who suggested to perform it this way. https://patchwork.kernel.org/comment/22562195/ > > - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass > &iomap[0] into the ->iomap_begin and ->iomap_end functions. The > first iomap will be filled in with the destination for the write (as > all implementations do now), and the second iomap can be filled in > with the source information for a COW operation. > > - If the ->iomap_begin implementation decides that COW is necessary for > the requested operation, then it should fill out that second iomap > with information about the extent that the actor must copied before > returning. The second iomap's offset and length must match the > first. If COW isn't necessary, the ->iomap_begin implementation > ignores it, and the second iomap retains type == 0 (i.e. invalid > mapping). > > Proceeding along these lines will (AFAICT) still allow you to enable all > the btrfs functionality in the rest of this patchset while making the > task of wiring up XFS fairly simple. No overloaded fields and no new > flags. > > This is how I'd like to see this patchset should proceed to V5. Does > that make sense? Yes, I think this would be a more flexible design as well if we ever decide to extend it beyond dax. We would still need a IOMAP_COW type set in iomap[0]. -- Goldwyn _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-22 20:14 ` Goldwyn Rodrigues @ 2019-05-23 2:10 ` Dave Chinner 0 siblings, 0 replies; 45+ messages in thread From: Dave Chinner @ 2019-05-23 2:10 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte, jack, Darrick J. Wong, nborisov, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Wed, May 22, 2019 at 03:14:47PM -0500, Goldwyn Rodrigues wrote: > On 9:51 21/05, Darrick J. Wong wrote: > > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote: > > We cannot use @inline_data to convey the source address. @inline_data > > (so far) is used to point to the in-memory representation of the storage > > described by @addr. For data writes, @addr is the location of the write > > on disk and @inline_data is the location of the write in memory. > > > > Reusing @inline_data here to point to the location of the source data in > > memory is a totally different thing and will likely result in confusion. > > On a practical level, this also means that we cannot support the case of > > COW && INLINE because the type codes collide and so would the users of > > @inline_data. This isn't required *right now*, but if you had a pmem > > filesystem that stages inode updates in memory and flips a pointer to > > commit changes then the ->iomap_begin function will need to convey two > > pointers at once. > > > > So this brings us back to Dave's suggestion during the V1 patchset > > review that instead of adding more iomap flags/types and overloading > > fields, we simply pass two struct iomaps into ->iomap_begin: > > Actually, Dave is the one who suggested to perform it this way. > https://patchwork.kernel.org/comment/22562195/ My first suggestion was to use two iomaps. This suggestion came later, as a way of demonstrating that a different type could be used to redefine what ->inline_data was used for, if people considered that an acceptible solution. What was apparent from other discussions in the thread you quote was that using two iomaps looked to be the better, more general approach to solving the iomap read-modify-write issue at hand. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-21 16:51 ` Darrick J. Wong 2019-05-22 20:14 ` Goldwyn Rodrigues @ 2019-05-23 9:05 ` Shiyang Ruan 2019-05-23 11:51 ` Goldwyn Rodrigues 1 sibling, 1 reply; 45+ messages in thread From: Shiyang Ruan @ 2019-05-23 9:05 UTC (permalink / raw) To: Darrick J. Wong, Goldwyn Rodrigues Cc: kilobyte, jack, linux-nvdimm, nborisov, david, dsterba, willy, Goldwyn Rodrigues, linux-fsdevel, hch, linux-btrfs On 5/22/19 12:51 AM, Darrick J. Wong wrote: > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >> >> The IOMAP_DAX_COW is a iomap type which performs copy of >> edges of data while performing a write if start/end are >> not page aligned. The source address is expected in >> iomap->inline_data. >> >> dax_copy_edges() is a helper functions performs a copy from >> one part of the device to another for data not page aligned. >> If iomap->inline_data is NULL, it memset's the area to zero. >> >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> >> --- >> fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- >> include/linux/iomap.h | 1 + >> 2 files changed, 46 insertions(+), 1 deletion(-) >> >> diff --git a/fs/dax.c b/fs/dax.c >> index e5e54da1715f..610bfa861a28 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev, >> } >> EXPORT_SYMBOL_GPL(__dax_zero_page_range); >> >> +/* >> + * dax_copy_edges - Copies the part of the pages not included in >> + * the write, but required for CoW because >> + * offset/offset+length are not page aligned. >> + */ >> +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, >> + struct iomap *iomap, void *daddr) >> +{ >> + unsigned offset = pos & (PAGE_SIZE - 1); >> + loff_t end = pos + length; >> + loff_t pg_end = round_up(end, PAGE_SIZE); >> + void *saddr = iomap->inline_data; >> + int ret = 0; >> + /* >> + * Copy the first part of the page >> + * Note: we pass offset as length >> + */ >> + if (offset) { >> + if (saddr) >> + ret = memcpy_mcsafe(daddr, saddr, offset); >> + else >> + memset(daddr, 0, offset); >> + } >> + >> + /* Copy the last part of the range */ >> + if (end < pg_end) { >> + if (saddr) >> + ret = memcpy_mcsafe(daddr + offset + length, >> + saddr + offset + length, pg_end - end); >> + else >> + memset(daddr + offset + length, 0, >> + pg_end - end); >> + } >> + return ret; >> +} >> + >> static loff_t >> dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, >> struct iomap *iomap) >> @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, >> return iov_iter_zero(min(length, end - pos), iter); >> } >> >> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) >> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED >> + && iomap->type != IOMAP_DAX_COW)) > > I reiterate (from V3) that the && goes on the previous line... > > if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED && > iomap->type != IOMAP_DAX_COW)) > >> return -EIO; >> >> + >> /* >> * Write can allocate block for an area which has a hole page mapped >> * into page tables. We have to tear down these mappings so that data >> @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, >> break; >> } >> >> + if (iomap->type == IOMAP_DAX_COW) { >> + ret = dax_copy_edges(inode, pos, length, iomap, kaddr); >> + if (ret) >> + break; >> + } >> + >> map_len = PFN_PHYS(map_len); >> kaddr += offset; >> map_len -= offset; >> diff --git a/include/linux/iomap.h b/include/linux/iomap.h >> index 0fefb5455bda..6e885c5a38a3 100644 >> --- a/include/linux/iomap.h >> +++ b/include/linux/iomap.h >> @@ -25,6 +25,7 @@ struct vm_fault; >> #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ >> #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ >> #define IOMAP_INLINE 0x05 /* data inline in the inode */ > >> +#define IOMAP_DAX_COW 0x06 > > DAX isn't going to be the only scenario where we need a way to > communicate to iomap actors the need to implement copy on write. > > XFS also uses struct iomap to hand out file leases to clients. The > lease code /currently/ doesn't support files with shared blocks (because > the only user is pNFS) but one could easily imagine a future where some > client wants to lease a file with shared blocks, in which case XFS will > want to convey the COW details to the lessee. > >> +/* Copy data pointed by inline_data before write*/ > > A month ago during the V3 patchset review, I wrote (possibly in an other > thread, sorry) about something that I'm putting my foot down about now > for the V4 patchset, which is the {re,ab}use of @inline_data for the > data source address. > > We cannot use @inline_data to convey the source address. @inline_data > (so far) is used to point to the in-memory representation of the storage > described by @addr. For data writes, @addr is the location of the write > on disk and @inline_data is the location of the write in memory. > > Reusing @inline_data here to point to the location of the source data in > memory is a totally different thing and will likely result in confusion. > On a practical level, this also means that we cannot support the case of > COW && INLINE because the type codes collide and so would the users of > @inline_data. This isn't required *right now*, but if you had a pmem > filesystem that stages inode updates in memory and flips a pointer to > commit changes then the ->iomap_begin function will need to convey two > pointers at once. > > So this brings us back to Dave's suggestion during the V1 patchset > review that instead of adding more iomap flags/types and overloading > fields, we simply pass two struct iomaps into ->iomap_begin: > > - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass > &iomap[0] into the ->iomap_begin and ->iomap_end functions. The > first iomap will be filled in with the destination for the write (as > all implementations do now), and the second iomap can be filled in > with the source information for a COW operation. > > - If the ->iomap_begin implementation decides that COW is necessary for > the requested operation, then it should fill out that second iomap > with information about the extent that the actor must copied before > returning. The second iomap's offset and length must match the > first. If COW isn't necessary, the ->iomap_begin implementation Hi, I'm working on reflink & dax in XFS, here are some thoughts on this: As mentioned above: the second iomap's offset and length must match the first. I thought so at the beginning, but later found that the only difference between these two iomaps is @addr. So, what about adding a @saddr, which means the source address of COW extent, into the struct iomap. The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then handle this @saddr in each ->actor(). No more modifications in other functions. My RFC patchset[1] is implemented in this way and works for me, though it is far away from perfectness. [1]: https://patchwork.kernel.org/cover/10904307/ -- Thanks, Shiyang Ruan. > ignores it, and the second iomap retains type == 0 (i.e. invalid > mapping). > > Proceeding along these lines will (AFAICT) still allow you to enable all > the btrfs functionality in the rest of this patchset while making the > task of wiring up XFS fairly simple. No overloaded fields and no new > flags. > > This is how I'd like to see this patchset should proceed to V5. Does > that make sense? > > --D > >> >> /* >> * Flags for all iomap mappings: >> -- >> 2.16.4 >> > > _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-23 9:05 ` Shiyang Ruan @ 2019-05-23 11:51 ` Goldwyn Rodrigues 2019-05-27 8:25 ` Shiyang Ruan 0 siblings, 1 reply; 45+ messages in thread From: Goldwyn Rodrigues @ 2019-05-23 11:51 UTC (permalink / raw) To: Shiyang Ruan Cc: kilobyte, jack, Darrick J. Wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, linux-btrfs On 17:05 23/05, Shiyang Ruan wrote: > > > On 5/22/19 12:51 AM, Darrick J. Wong wrote: > > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote: > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > > > The IOMAP_DAX_COW is a iomap type which performs copy of > > > edges of data while performing a write if start/end are > > > not page aligned. The source address is expected in > > > iomap->inline_data. > > > > > > dax_copy_edges() is a helper functions performs a copy from > > > one part of the device to another for data not page aligned. > > > If iomap->inline_data is NULL, it memset's the area to zero. > > > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > --- > > > fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > > > include/linux/iomap.h | 1 + > > > 2 files changed, 46 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index e5e54da1715f..610bfa861a28 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev, > > > } > > > EXPORT_SYMBOL_GPL(__dax_zero_page_range); > > > +/* > > > + * dax_copy_edges - Copies the part of the pages not included in > > > + * the write, but required for CoW because > > > + * offset/offset+length are not page aligned. > > > + */ > > > +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length, > > > + struct iomap *iomap, void *daddr) > > > +{ > > > + unsigned offset = pos & (PAGE_SIZE - 1); > > > + loff_t end = pos + length; > > > + loff_t pg_end = round_up(end, PAGE_SIZE); > > > + void *saddr = iomap->inline_data; > > > + int ret = 0; > > > + /* > > > + * Copy the first part of the page > > > + * Note: we pass offset as length > > > + */ > > > + if (offset) { > > > + if (saddr) > > > + ret = memcpy_mcsafe(daddr, saddr, offset); > > > + else > > > + memset(daddr, 0, offset); > > > + } > > > + > > > + /* Copy the last part of the range */ > > > + if (end < pg_end) { > > > + if (saddr) > > > + ret = memcpy_mcsafe(daddr + offset + length, > > > + saddr + offset + length, pg_end - end); > > > + else > > > + memset(daddr + offset + length, 0, > > > + pg_end - end); > > > + } > > > + return ret; > > > +} > > > + > > > static loff_t > > > dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > > struct iomap *iomap) > > > @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > > return iov_iter_zero(min(length, end - pos), iter); > > > } > > > - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED)) > > > + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED > > > + && iomap->type != IOMAP_DAX_COW)) > > > > I reiterate (from V3) that the && goes on the previous line... > > > > if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED && > > iomap->type != IOMAP_DAX_COW)) > > > > > return -EIO; > > > + > > > /* > > > * Write can allocate block for an area which has a hole page mapped > > > * into page tables. We have to tear down these mappings so that data > > > @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > > break; > > > } > > > + if (iomap->type == IOMAP_DAX_COW) { > > > + ret = dax_copy_edges(inode, pos, length, iomap, kaddr); > > > + if (ret) > > > + break; > > > + } > > > + > > > map_len = PFN_PHYS(map_len); > > > kaddr += offset; > > > map_len -= offset; > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > > index 0fefb5455bda..6e885c5a38a3 100644 > > > --- a/include/linux/iomap.h > > > +++ b/include/linux/iomap.h > > > @@ -25,6 +25,7 @@ struct vm_fault; > > > #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */ > > > #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */ > > > #define IOMAP_INLINE 0x05 /* data inline in the inode */ > > > > > +#define IOMAP_DAX_COW 0x06 > > > > DAX isn't going to be the only scenario where we need a way to > > communicate to iomap actors the need to implement copy on write. > > > > XFS also uses struct iomap to hand out file leases to clients. The > > lease code /currently/ doesn't support files with shared blocks (because > > the only user is pNFS) but one could easily imagine a future where some > > client wants to lease a file with shared blocks, in which case XFS will > > want to convey the COW details to the lessee. > > > > > +/* Copy data pointed by inline_data before write*/ > > > > A month ago during the V3 patchset review, I wrote (possibly in an other > > thread, sorry) about something that I'm putting my foot down about now > > for the V4 patchset, which is the {re,ab}use of @inline_data for the > > data source address. > > > > We cannot use @inline_data to convey the source address. @inline_data > > (so far) is used to point to the in-memory representation of the storage > > described by @addr. For data writes, @addr is the location of the write > > on disk and @inline_data is the location of the write in memory. > > > > Reusing @inline_data here to point to the location of the source data in > > memory is a totally different thing and will likely result in confusion. > > On a practical level, this also means that we cannot support the case of > > COW && INLINE because the type codes collide and so would the users of > > @inline_data. This isn't required *right now*, but if you had a pmem > > filesystem that stages inode updates in memory and flips a pointer to > > commit changes then the ->iomap_begin function will need to convey two > > pointers at once. > > > > So this brings us back to Dave's suggestion during the V1 patchset > > review that instead of adding more iomap flags/types and overloading > > fields, we simply pass two struct iomaps into ->iomap_begin: > > > > - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass > > &iomap[0] into the ->iomap_begin and ->iomap_end functions. The > > first iomap will be filled in with the destination for the write (as > > all implementations do now), and the second iomap can be filled in > > with the source information for a COW operation. > > > > - If the ->iomap_begin implementation decides that COW is necessary for > > the requested operation, then it should fill out that second iomap > > with information about the extent that the actor must copied before > > returning. The second iomap's offset and length must match the > > first. If COW isn't necessary, the ->iomap_begin implementation > > Hi, > > I'm working on reflink & dax in XFS, here are some thoughts on this: > > As mentioned above: the second iomap's offset and length must match the > first. I thought so at the beginning, but later found that the only > difference between these two iomaps is @addr. So, what about adding a > @saddr, which means the source address of COW extent, into the struct iomap. > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then > handle this @saddr in each ->actor(). No more modifications in other > functions. Yes, I started of with the exact idea before being recommended this by Dave. I used two fields instead of one namely cow_pos and cow_addr which defined the source details. I had put it as a iomap flag as opposed to a type which of course did not appeal well. We may want to use iomaps for cases where two inodes are involved. An example of the other scenario where offset may be different is file comparison for dedup: vfs_dedup_file_range_compare(). However, it would need two inodes in iomap as well. > > My RFC patchset[1] is implemented in this way and works for me, though it is > far away from perfectness. > > [1]: https://patchwork.kernel.org/cover/10904307/ > -- Goldwyn _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-23 11:51 ` Goldwyn Rodrigues @ 2019-05-27 8:25 ` Shiyang Ruan 2019-05-28 9:17 ` Jan Kara 0 siblings, 1 reply; 45+ messages in thread From: Shiyang Ruan @ 2019-05-27 8:25 UTC (permalink / raw) To: Goldwyn Rodrigues Cc: kilobyte, jack, Darrick J. Wong, nborisov, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, linux-btrfs On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote: >> >> Hi, >> >> I'm working on reflink & dax in XFS, here are some thoughts on this: >> >> As mentioned above: the second iomap's offset and length must match the >> first. I thought so at the beginning, but later found that the only >> difference between these two iomaps is @addr. So, what about adding a >> @saddr, which means the source address of COW extent, into the struct iomap. >> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then >> handle this @saddr in each ->actor(). No more modifications in other >> functions. > > Yes, I started of with the exact idea before being recommended this by Dave. > I used two fields instead of one namely cow_pos and cow_addr which defined > the source details. I had put it as a iomap flag as opposed to a type > which of course did not appeal well. > > We may want to use iomaps for cases where two inodes are involved. > An example of the other scenario where offset may be different is file > comparison for dedup: vfs_dedup_file_range_compare(). However, it would > need two inodes in iomap as well. > Yes, it is reasonable. Thanks for your explanation. One more thing RFC: I'd like to add an end-io callback argument in ->dax_iomap_actor() to update the metadata after one whole COW operation is completed. The end-io can also be called in ->iomap_end(). But one COW operation may call ->iomap_apply() many times, and so does the end-io. Thus, I think it would be nice to move it to the bottom of ->dax_iomap_actor(), called just once in each COW operation. -- Thanks, Shiyang Ruan. >> >> My RFC patchset[1] is implemented in this way and works for me, though it is >> far away from perfectness. >> >> [1]: https://patchwork.kernel.org/cover/10904307/ >> > _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-27 8:25 ` Shiyang Ruan @ 2019-05-28 9:17 ` Jan Kara 2019-05-29 2:01 ` Shiyang Ruan 0 siblings, 1 reply; 45+ messages in thread From: Jan Kara @ 2019-05-28 9:17 UTC (permalink / raw) To: Shiyang Ruan Cc: kilobyte, jack, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Mon 27-05-19 16:25:41, Shiyang Ruan wrote: > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote: > > > > > > Hi, > > > > > > I'm working on reflink & dax in XFS, here are some thoughts on this: > > > > > > As mentioned above: the second iomap's offset and length must match the > > > first. I thought so at the beginning, but later found that the only > > > difference between these two iomaps is @addr. So, what about adding a > > > @saddr, which means the source address of COW extent, into the struct iomap. > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then > > > handle this @saddr in each ->actor(). No more modifications in other > > > functions. > > > > Yes, I started of with the exact idea before being recommended this by Dave. > > I used two fields instead of one namely cow_pos and cow_addr which defined > > the source details. I had put it as a iomap flag as opposed to a type > > which of course did not appeal well. > > > > We may want to use iomaps for cases where two inodes are involved. > > An example of the other scenario where offset may be different is file > > comparison for dedup: vfs_dedup_file_range_compare(). However, it would > > need two inodes in iomap as well. > > > Yes, it is reasonable. Thanks for your explanation. > > One more thing RFC: > I'd like to add an end-io callback argument in ->dax_iomap_actor() to update > the metadata after one whole COW operation is completed. The end-io can > also be called in ->iomap_end(). But one COW operation may call > ->iomap_apply() many times, and so does the end-io. Thus, I think it would > be nice to move it to the bottom of ->dax_iomap_actor(), called just once in > each COW operation. I'm sorry but I don't follow what you suggest. One COW operation is a call to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), ->iomap_end() once. So I don't see a difference between doing something in ->actor() and ->iomap_end() (besides the passed arguments but that does not seem to be your concern). So what do you exactly want to do? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-28 9:17 ` Jan Kara @ 2019-05-29 2:01 ` Shiyang Ruan 2019-05-29 2:47 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Shiyang Ruan @ 2019-05-29 2:01 UTC (permalink / raw) To: Jan Kara Cc: kilobyte, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, david, dsterba, willy, linux-fsdevel, hch, linux-btrfs On 5/28/19 5:17 PM, Jan Kara wrote: > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote: >> On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote: >>>> >>>> Hi, >>>> >>>> I'm working on reflink & dax in XFS, here are some thoughts on this: >>>> >>>> As mentioned above: the second iomap's offset and length must match the >>>> first. I thought so at the beginning, but later found that the only >>>> difference between these two iomaps is @addr. So, what about adding a >>>> @saddr, which means the source address of COW extent, into the struct iomap. >>>> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then >>>> handle this @saddr in each ->actor(). No more modifications in other >>>> functions. >>> >>> Yes, I started of with the exact idea before being recommended this by Dave. >>> I used two fields instead of one namely cow_pos and cow_addr which defined >>> the source details. I had put it as a iomap flag as opposed to a type >>> which of course did not appeal well. >>> >>> We may want to use iomaps for cases where two inodes are involved. >>> An example of the other scenario where offset may be different is file >>> comparison for dedup: vfs_dedup_file_range_compare(). However, it would >>> need two inodes in iomap as well. >>> >> Yes, it is reasonable. Thanks for your explanation. >> >> One more thing RFC: >> I'd like to add an end-io callback argument in ->dax_iomap_actor() to update >> the metadata after one whole COW operation is completed. The end-io can >> also be called in ->iomap_end(). But one COW operation may call >> ->iomap_apply() many times, and so does the end-io. Thus, I think it would >> be nice to move it to the bottom of ->dax_iomap_actor(), called just once in >> each COW operation. > > I'm sorry but I don't follow what you suggest. One COW operation is a call > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), > ->iomap_end() once. So I don't see a difference between doing something in > ->actor() and ->iomap_end() (besides the passed arguments but that does not > seem to be your concern). So what do you exactly want to do? Hi Jan, Thanks for pointing out, and I'm sorry for my mistake. It's ->dax_iomap_rw(), not ->dax_iomap_actor(). I want to call the callback function at the end of ->dax_iomap_rw(). Like this: dax_iomap_rw(..., callback) { ... while (...) { iomap_apply(...); } if (callback != null) { callback(); } return ...; } > > Honza > -- Thanks, Shiyang Ruan. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 2:01 ` Shiyang Ruan @ 2019-05-29 2:47 ` Dave Chinner 2019-05-29 4:02 ` Shiyang Ruan 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2019-05-29 2:47 UTC (permalink / raw) To: Shiyang Ruan Cc: kilobyte, Jan Kara, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote: > > On 5/28/19 5:17 PM, Jan Kara wrote: > > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote: > > > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote: > > > > > > > > > > Hi, > > > > > > > > > > I'm working on reflink & dax in XFS, here are some thoughts on this: > > > > > > > > > > As mentioned above: the second iomap's offset and length must match the > > > > > first. I thought so at the beginning, but later found that the only > > > > > difference between these two iomaps is @addr. So, what about adding a > > > > > @saddr, which means the source address of COW extent, into the struct iomap. > > > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then > > > > > handle this @saddr in each ->actor(). No more modifications in other > > > > > functions. > > > > > > > > Yes, I started of with the exact idea before being recommended this by Dave. > > > > I used two fields instead of one namely cow_pos and cow_addr which defined > > > > the source details. I had put it as a iomap flag as opposed to a type > > > > which of course did not appeal well. > > > > > > > > We may want to use iomaps for cases where two inodes are involved. > > > > An example of the other scenario where offset may be different is file > > > > comparison for dedup: vfs_dedup_file_range_compare(). However, it would > > > > need two inodes in iomap as well. > > > > > > > Yes, it is reasonable. Thanks for your explanation. > > > > > > One more thing RFC: > > > I'd like to add an end-io callback argument in ->dax_iomap_actor() to update > > > the metadata after one whole COW operation is completed. The end-io can > > > also be called in ->iomap_end(). But one COW operation may call > > > ->iomap_apply() many times, and so does the end-io. Thus, I think it would > > > be nice to move it to the bottom of ->dax_iomap_actor(), called just once in > > > each COW operation. > > > > I'm sorry but I don't follow what you suggest. One COW operation is a call > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), > > ->iomap_end() once. So I don't see a difference between doing something in > > ->actor() and ->iomap_end() (besides the passed arguments but that does not > > seem to be your concern). So what do you exactly want to do? > > Hi Jan, > > Thanks for pointing out, and I'm sorry for my mistake. It's > ->dax_iomap_rw(), not ->dax_iomap_actor(). > > I want to call the callback function at the end of ->dax_iomap_rw(). > > Like this: > dax_iomap_rw(..., callback) { > > ... > while (...) { > iomap_apply(...); > } > > if (callback != null) { > callback(); > } > return ...; > } Why does this need to be in dax_iomap_rw()? We already do post-dax_iomap_rw() "io-end callbacks" directly in xfs_file_dax_write() to update the file size.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 2:47 ` Dave Chinner @ 2019-05-29 4:02 ` Shiyang Ruan 2019-05-29 4:07 ` Darrick J. Wong 0 siblings, 1 reply; 45+ messages in thread From: Shiyang Ruan @ 2019-05-29 4:02 UTC (permalink / raw) To: Dave Chinner Cc: kilobyte, Jan Kara, Darrick J. Wong <darrick.wong@oracle.com>, nborisov@suse.com, Goldwyn Rodrigues, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On 5/29/19 10:47 AM, Dave Chinner wrote: > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote: >> >> On 5/28/19 5:17 PM, Jan Kara wrote: >>> On Mon 27-05-19 16:25:41, Shiyang Ruan wrote: >>>> On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm working on reflink & dax in XFS, here are some thoughts on this: >>>>>> >>>>>> As mentioned above: the second iomap's offset and length must match the >>>>>> first. I thought so at the beginning, but later found that the only >>>>>> difference between these two iomaps is @addr. So, what about adding a >>>>>> @saddr, which means the source address of COW extent, into the struct iomap. >>>>>> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then >>>>>> handle this @saddr in each ->actor(). No more modifications in other >>>>>> functions. >>>>> >>>>> Yes, I started of with the exact idea before being recommended this by Dave. >>>>> I used two fields instead of one namely cow_pos and cow_addr which defined >>>>> the source details. I had put it as a iomap flag as opposed to a type >>>>> which of course did not appeal well. >>>>> >>>>> We may want to use iomaps for cases where two inodes are involved. >>>>> An example of the other scenario where offset may be different is file >>>>> comparison for dedup: vfs_dedup_file_range_compare(). However, it would >>>>> need two inodes in iomap as well. >>>>> >>>> Yes, it is reasonable. Thanks for your explanation. >>>> >>>> One more thing RFC: >>>> I'd like to add an end-io callback argument in ->dax_iomap_actor() to update >>>> the metadata after one whole COW operation is completed. The end-io can >>>> also be called in ->iomap_end(). But one COW operation may call >>>> ->iomap_apply() many times, and so does the end-io. Thus, I think it would >>>> be nice to move it to the bottom of ->dax_iomap_actor(), called just once in >>>> each COW operation. >>> >>> I'm sorry but I don't follow what you suggest. One COW operation is a call >>> to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, >>> each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), >>> ->iomap_end() once. So I don't see a difference between doing something in >>> ->actor() and ->iomap_end() (besides the passed arguments but that does not >>> seem to be your concern). So what do you exactly want to do? >> >> Hi Jan, >> >> Thanks for pointing out, and I'm sorry for my mistake. It's >> ->dax_iomap_rw(), not ->dax_iomap_actor(). >> >> I want to call the callback function at the end of ->dax_iomap_rw(). >> >> Like this: >> dax_iomap_rw(..., callback) { >> >> ... >> while (...) { >> iomap_apply(...); >> } >> >> if (callback != null) { >> callback(); >> } >> return ...; >> } > > Why does this need to be in dax_iomap_rw()? > > We already do post-dax_iomap_rw() "io-end callbacks" directly in > xfs_file_dax_write() to update the file size.... Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation. And an is-cow flag(from iomap) is also needed to determine if we call it. I think it would be better to put this into ->dax_iomap_rw() as a callback function. So sorry for my poor expression. > > Cheers, > > Dave. > -- Thanks, Shiyang Ruan. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 4:02 ` Shiyang Ruan @ 2019-05-29 4:07 ` Darrick J. Wong 2019-05-29 4:46 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Darrick J. Wong @ 2019-05-29 4:07 UTC (permalink / raw) To: Shiyang Ruan Cc: kilobyte, Jan Kara, linux-nvdimm, nborisov, Goldwyn Rodrigues, Dave Chinner, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote: > > > On 5/29/19 10:47 AM, Dave Chinner wrote: > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote: > > > > > > On 5/28/19 5:17 PM, Jan Kara wrote: > > > > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote: > > > > > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I'm working on reflink & dax in XFS, here are some thoughts on this: > > > > > > > > > > > > > > As mentioned above: the second iomap's offset and length must match the > > > > > > > first. I thought so at the beginning, but later found that the only > > > > > > > difference between these two iomaps is @addr. So, what about adding a > > > > > > > @saddr, which means the source address of COW extent, into the struct iomap. > > > > > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then > > > > > > > handle this @saddr in each ->actor(). No more modifications in other > > > > > > > functions. > > > > > > > > > > > > Yes, I started of with the exact idea before being recommended this by Dave. > > > > > > I used two fields instead of one namely cow_pos and cow_addr which defined > > > > > > the source details. I had put it as a iomap flag as opposed to a type > > > > > > which of course did not appeal well. > > > > > > > > > > > > We may want to use iomaps for cases where two inodes are involved. > > > > > > An example of the other scenario where offset may be different is file > > > > > > comparison for dedup: vfs_dedup_file_range_compare(). However, it would > > > > > > need two inodes in iomap as well. > > > > > > > > > > > Yes, it is reasonable. Thanks for your explanation. > > > > > > > > > > One more thing RFC: > > > > > I'd like to add an end-io callback argument in ->dax_iomap_actor() to update > > > > > the metadata after one whole COW operation is completed. The end-io can > > > > > also be called in ->iomap_end(). But one COW operation may call > > > > > ->iomap_apply() many times, and so does the end-io. Thus, I think it would > > > > > be nice to move it to the bottom of ->dax_iomap_actor(), called just once in > > > > > each COW operation. > > > > > > > > I'm sorry but I don't follow what you suggest. One COW operation is a call > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, > > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), > > > > ->iomap_end() once. So I don't see a difference between doing something in > > > > ->actor() and ->iomap_end() (besides the passed arguments but that does not > > > > seem to be your concern). So what do you exactly want to do? > > > > > > Hi Jan, > > > > > > Thanks for pointing out, and I'm sorry for my mistake. It's > > > ->dax_iomap_rw(), not ->dax_iomap_actor(). > > > > > > I want to call the callback function at the end of ->dax_iomap_rw(). > > > > > > Like this: > > > dax_iomap_rw(..., callback) { > > > > > > ... > > > while (...) { > > > iomap_apply(...); > > > } > > > > > > if (callback != null) { > > > callback(); > > > } > > > return ...; > > > } > > > > Why does this need to be in dax_iomap_rw()? > > > > We already do post-dax_iomap_rw() "io-end callbacks" directly in > > xfs_file_dax_write() to update the file size.... > > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation. > And an is-cow flag(from iomap) is also needed to determine if we call it. I > think it would be better to put this into ->dax_iomap_rw() as a callback > function. Sort of like how iomap_dio_rw takes a write endio function? --D > So sorry for my poor expression. > > > > > Cheers, > > > > Dave. > > > > -- > Thanks, > Shiyang Ruan. > > _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 4:07 ` Darrick J. Wong @ 2019-05-29 4:46 ` Dave Chinner 2019-05-29 13:46 ` Jan Kara 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2019-05-29 4:46 UTC (permalink / raw) To: Darrick J. Wong Cc: kilobyte, Jan Kara, linux-nvdimm, nborisov, Goldwyn Rodrigues, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Tue, May 28, 2019 at 09:07:19PM -0700, Darrick J. Wong wrote: > On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote: > > On 5/29/19 10:47 AM, Dave Chinner wrote: > > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote: > > > > On 5/28/19 5:17 PM, Jan Kara wrote: > > > > > I'm sorry but I don't follow what you suggest. One COW operation is a call > > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, > > > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), > > > > > ->iomap_end() once. So I don't see a difference between doing something in > > > > > ->actor() and ->iomap_end() (besides the passed arguments but that does not > > > > > seem to be your concern). So what do you exactly want to do? > > > > > > > > Hi Jan, > > > > > > > > Thanks for pointing out, and I'm sorry for my mistake. It's > > > > ->dax_iomap_rw(), not ->dax_iomap_actor(). > > > > > > > > I want to call the callback function at the end of ->dax_iomap_rw(). > > > > > > > > Like this: > > > > dax_iomap_rw(..., callback) { > > > > > > > > ... > > > > while (...) { > > > > iomap_apply(...); > > > > } > > > > > > > > if (callback != null) { > > > > callback(); > > > > } > > > > return ...; > > > > } > > > > > > Why does this need to be in dax_iomap_rw()? > > > > > > We already do post-dax_iomap_rw() "io-end callbacks" directly in > > > xfs_file_dax_write() to update the file size.... > > > > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation. > > And an is-cow flag(from iomap) is also needed to determine if we call it. I > > think it would be better to put this into ->dax_iomap_rw() as a callback > > function. > > Sort of like how iomap_dio_rw takes a write endio function? You mean like we originally had in the DAX code for unwritten extents? But we got rid of that because performance of unwritten extents was absolutely woeful - it's cheaper in terms of CPU cost to do up front zeroing (i.e. inside ->iomap_begin) than it is to use unwritten extents and convert them to protect against stale data exposure. I have a feeling that exactly the same thing is true for CoW - the hoops we jump through to do COW fork manipulation and then extent movement between the COW fork and the data fork on IO completion would be better done before we commit the COW extent allocation. In which case, what we actually want for DAX is: iomap_apply() ->iomap_begin() map old data extent that we copy from allocate new data extent we copy to in data fork, immediately replacing old data extent return transaction handle as private data dax_iomap_actor() copies data from old extent to new extent ->iomap_end commits transaction now data has been copied, making the COW operation atomic with the data copy. This, in fact, should be how we do all DAX writes that require allocation, because then we get rid of the need to zero newly allocated or unwritten extents before we copy the data into it. i.e. we only need to write once to newly allocated storage rather than twice. This gets rid of the need for COW callbacks, and means the DAX reflink implementation does not need to use delalloc speculative preallocation or COW forks at all. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 4:46 ` Dave Chinner @ 2019-05-29 13:46 ` Jan Kara 2019-05-29 22:14 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Jan Kara @ 2019-05-29 13:46 UTC (permalink / raw) To: Dave Chinner Cc: kilobyte, Jan Kara, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Wed 29-05-19 14:46:58, Dave Chinner wrote: > On Tue, May 28, 2019 at 09:07:19PM -0700, Darrick J. Wong wrote: > > On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote: > > > On 5/29/19 10:47 AM, Dave Chinner wrote: > > > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote: > > > > > On 5/28/19 5:17 PM, Jan Kara wrote: > > > > > > I'm sorry but I don't follow what you suggest. One COW operation is a call > > > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times, > > > > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()), > > > > > > ->iomap_end() once. So I don't see a difference between doing something in > > > > > > ->actor() and ->iomap_end() (besides the passed arguments but that does not > > > > > > seem to be your concern). So what do you exactly want to do? > > > > > > > > > > Hi Jan, > > > > > > > > > > Thanks for pointing out, and I'm sorry for my mistake. It's > > > > > ->dax_iomap_rw(), not ->dax_iomap_actor(). > > > > > > > > > > I want to call the callback function at the end of ->dax_iomap_rw(). > > > > > > > > > > Like this: > > > > > dax_iomap_rw(..., callback) { > > > > > > > > > > ... > > > > > while (...) { > > > > > iomap_apply(...); > > > > > } > > > > > > > > > > if (callback != null) { > > > > > callback(); > > > > > } > > > > > return ...; > > > > > } > > > > > > > > Why does this need to be in dax_iomap_rw()? > > > > > > > > We already do post-dax_iomap_rw() "io-end callbacks" directly in > > > > xfs_file_dax_write() to update the file size.... > > > > > > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation. > > > And an is-cow flag(from iomap) is also needed to determine if we call it. I > > > think it would be better to put this into ->dax_iomap_rw() as a callback > > > function. > > > > Sort of like how iomap_dio_rw takes a write endio function? > > You mean like we originally had in the DAX code for unwritten > extents? > > But we got rid of that because performance of unwritten extents was > absolutely woeful - it's cheaper in terms of CPU cost to do up front > zeroing (i.e. inside ->iomap_begin) than it is to use unwritten > extents and convert them to protect against stale data exposure. > > I have a feeling that exactly the same thing is true for CoW - the > hoops we jump through to do COW fork manipulation and then extent > movement between the COW fork and the data fork on IO completion > would be better done before we commit the COW extent allocation. > > In which case, what we actually want for DAX is: > > > iomap_apply() > > ->iomap_begin() > map old data extent that we copy from > > allocate new data extent we copy to in data fork, > immediately replacing old data extent > > return transaction handle as private data > > dax_iomap_actor() > copies data from old extent to new extent > > ->iomap_end > commits transaction now data has been copied, making > the COW operation atomic with the data copy. > > > This, in fact, should be how we do all DAX writes that require > allocation, because then we get rid of the need to zero newly > allocated or unwritten extents before we copy the data into it. i.e. > we only need to write once to newly allocated storage rather than > twice. You need to be careful though. You need to synchronize with page faults so that they cannot see and expose in page tables blocks you've allocated before their contents is filled. This race was actually the strongest motivation for pre-zeroing of blocks. OTOH copy_from_iter() in dax_iomap_actor() needs to be able to fault pages to copy from (and these pages may be from the same file you're writing to) so you cannot just block faulting for the file through I_MMAP_LOCK. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 13:46 ` Jan Kara @ 2019-05-29 22:14 ` Dave Chinner 2019-05-30 11:16 ` Jan Kara 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2019-05-29 22:14 UTC (permalink / raw) To: Jan Kara Cc: kilobyte, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote: > On Wed 29-05-19 14:46:58, Dave Chinner wrote: > > iomap_apply() > > > > ->iomap_begin() > > map old data extent that we copy from > > > > allocate new data extent we copy to in data fork, > > immediately replacing old data extent > > > > return transaction handle as private data This holds the inode block map locked exclusively across the IO, so.... > > > > dax_iomap_actor() > > copies data from old extent to new extent > > > > ->iomap_end > > commits transaction now data has been copied, making > > the COW operation atomic with the data copy. > > > > > > This, in fact, should be how we do all DAX writes that require > > allocation, because then we get rid of the need to zero newly > > allocated or unwritten extents before we copy the data into it. i.e. > > we only need to write once to newly allocated storage rather than > > twice. > > You need to be careful though. You need to synchronize with page faults so > that they cannot see and expose in page tables blocks you've allocated > before their contents is filled. ... so the page fault will block trying to map the blocks because it can't get the xfs_inode->i_ilock until the allocation transaciton commits.... > This race was actually the strongest > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in > dax_iomap_actor() needs to be able to fault pages to copy from (and these > pages may be from the same file you're writing to) so you cannot just block > faulting for the file through I_MMAP_LOCK. Right, it doesn't take the I_MMAP_LOCK, but it would block further in. And, really, I'm not caring all this much about this corner case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX within a file is unbeleivably naive - the data still gets copied by the CPU in the write() call. It's far simpler and more effcient to just mmap() both ranges of the file(s) and memcpy() in userspace.... FWIW, it's to avoid problems with stupid userspace stuff that nobody really should be doing that I want range locks for the XFS inode locks. If userspace overlaps the ranges and deadlocks in that case, they they get to keep all the broken bits because, IMO, they are doing something monumentally stupid. I'd probably be making it return EDEADLOCK back out to userspace in the case rather than deadlocking but, fundamentally, I think it's broken behaviour that we should be rejecting with an error rather than adding complexity trying to handle it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-29 22:14 ` Dave Chinner @ 2019-05-30 11:16 ` Jan Kara 2019-05-30 22:59 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Jan Kara @ 2019-05-30 11:16 UTC (permalink / raw) To: Dave Chinner Cc: kilobyte, Jan Kara, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Thu 30-05-19 08:14:45, Dave Chinner wrote: > On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote: > > On Wed 29-05-19 14:46:58, Dave Chinner wrote: > > > iomap_apply() > > > > > > ->iomap_begin() > > > map old data extent that we copy from > > > > > > allocate new data extent we copy to in data fork, > > > immediately replacing old data extent > > > > > > return transaction handle as private data > > This holds the inode block map locked exclusively across the IO, > so.... Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write. But xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't see anything that would prevent page fault from mapping blocks into page tables just after xfs_file_iomap_begin() returns. > > > dax_iomap_actor() > > > copies data from old extent to new extent > > > > > > ->iomap_end > > > commits transaction now data has been copied, making > > > the COW operation atomic with the data copy. > > > > > > > > > This, in fact, should be how we do all DAX writes that require > > > allocation, because then we get rid of the need to zero newly > > > allocated or unwritten extents before we copy the data into it. i.e. > > > we only need to write once to newly allocated storage rather than > > > twice. > > > > You need to be careful though. You need to synchronize with page faults so > > that they cannot see and expose in page tables blocks you've allocated > > before their contents is filled. > > ... so the page fault will block trying to map the blocks because > it can't get the xfs_inode->i_ilock until the allocation transaciton > commits.... > > > This race was actually the strongest > > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in > > dax_iomap_actor() needs to be able to fault pages to copy from (and these > > pages may be from the same file you're writing to) so you cannot just block > > faulting for the file through I_MMAP_LOCK. > > Right, it doesn't take the I_MMAP_LOCK, but it would block further > in. And, really, I'm not caring all this much about this corner > case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX > within a file is unbeleivably naive - the data still gets copied by > the CPU in the write() call. It's far simpler and more effcient to > just mmap() both ranges of the file(s) and memcpy() in userspace.... > > FWIW, it's to avoid problems with stupid userspace stuff that nobody > really should be doing that I want range locks for the XFS inode > locks. If userspace overlaps the ranges and deadlocks in that case, > they they get to keep all the broken bits because, IMO, they are > doing something monumentally stupid. I'd probably be making it > return EDEADLOCK back out to userspace in the case rather than > deadlocking but, fundamentally, I think it's broken behaviour that > we should be rejecting with an error rather than adding complexity > trying to handle it. I agree with this. We must just prevent user from taking the kernel down with maliciously created IOs... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes 2019-05-30 11:16 ` Jan Kara @ 2019-05-30 22:59 ` Dave Chinner 0 siblings, 0 replies; 45+ messages in thread From: Dave Chinner @ 2019-05-30 22:59 UTC (permalink / raw) To: Jan Kara Cc: kilobyte, Darrick J. Wong, nborisov, Goldwyn Rodrigues, linux-nvdimm, dsterba, willy, linux-fsdevel, hch, linux-btrfs On Thu, May 30, 2019 at 01:16:05PM +0200, Jan Kara wrote: > On Thu 30-05-19 08:14:45, Dave Chinner wrote: > > On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote: > > > On Wed 29-05-19 14:46:58, Dave Chinner wrote: > > > > iomap_apply() > > > > > > > > ->iomap_begin() > > > > map old data extent that we copy from > > > > > > > > allocate new data extent we copy to in data fork, > > > > immediately replacing old data extent > > > > > > > > return transaction handle as private data > > > > This holds the inode block map locked exclusively across the IO, > > so.... > > Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write. I forgot about that, I keep thinking that we use shared locking for DAX like we do for direct IO. There's another reason for range locks - allowing concurrent DAX read/write IO - but that's orthogonal to the issue here. > But > xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't > see anything that would prevent page fault from mapping blocks into page > tables just after xfs_file_iomap_begin() returns. Right, holding the IOLOCK doesn't stop concurrent page faults from mapping the page we are trying to write, and that leaves a window where stale data can be exposed if we don't initialise the newly allocated range whilst in the allocation transaction holding the ILOCK. That's what the XFS_BMAPI_ZERO flag does in the DAX block allocation path. So the idea of holding the allocation transaction across the data copy means that ILOCK is then held until the data blocks are fully initialised with valid data, meaning we can greatly reduce the scope of the XFS_BMAPI_ZERO flag and possible get rid of it altogether. > > > This race was actually the strongest > > > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in > > > dax_iomap_actor() needs to be able to fault pages to copy from (and these > > > pages may be from the same file you're writing to) so you cannot just block > > > faulting for the file through I_MMAP_LOCK. > > > > Right, it doesn't take the I_MMAP_LOCK, but it would block further > > in. And, really, I'm not caring all this much about this corner > > case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX > > within a file is unbeleivably naive - the data still gets copied by > > the CPU in the write() call. It's far simpler and more effcient to > > just mmap() both ranges of the file(s) and memcpy() in userspace.... > > > > FWIW, it's to avoid problems with stupid userspace stuff that nobody > > really should be doing that I want range locks for the XFS inode > > locks. If userspace overlaps the ranges and deadlocks in that case, > > they they get to keep all the broken bits because, IMO, they are > > doing something monumentally stupid. I'd probably be making it > > return EDEADLOCK back out to userspace in the case rather than > > deadlocking but, fundamentally, I think it's broken behaviour that > > we should be rejecting with an error rather than adding complexity > > trying to handle it. > > I agree with this. We must just prevent user from taking the kernel down > with maliciously created IOs... Noted. :) I'm still working to scale the range locks effectively for direct IO; I've got to work out why sometimes they give identical performance to rwsems out to 16 threads, and other times they run 20% slower or worse at 8+ threads. I'm way ahead of the original mutex protected tree implementation that I have, but still got work to do to get consistently close to rwsem performance for pure shared locking workloads like direct IO. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2019-05-30 22:59 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-04-16 16:52 ` Dan Williams
2019-04-16 16:41 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-04-16 23:45 ` Elliott, Robert (Servers)
2019-04-16 16:41 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
[not found] ` <20190416164154.30390-5-rgoldwyn-l3A5Bk7waGM@public.gmane.org>
2019-04-17 16:46 ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
[not found] ` <20190416164154.30390-9-rgoldwyn-l3A5Bk7waGM@public.gmane.org>
2019-04-17 16:52 ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
[not found] ` <20190416164154.30390-1-rgoldwyn-l3A5Bk7waGM@public.gmane.org>
2019-04-16 16:41 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
[not found] ` <20190416164154.30390-11-rgoldwyn-l3A5Bk7waGM@public.gmane.org>
2019-04-17 15:24 ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
[not found] ` <20190416164154.30390-14-rgoldwyn-l3A5Bk7waGM@public.gmane.org>
2019-04-17 15:36 ` Darrick J. Wong
2019-04-16 16:41 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
[not found] ` <20190416164154.30390-15-rgoldwyn-l3A5Bk7waGM@public.gmane.org>
2019-04-17 15:45 ` Darrick J. Wong
2019-04-17 16:39 ` Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
2019-04-17 16:49 ` [PATCH v3 00/18] btrfs dax support Adam Borowski
-- strict thread matches above, loose matches on Subject: below --
2019-04-29 17:26 [PATCH v4 " Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
2019-05-21 16:51 ` Darrick J. Wong
2019-05-22 20:14 ` Goldwyn Rodrigues
2019-05-23 2:10 ` Dave Chinner
2019-05-23 9:05 ` Shiyang Ruan
2019-05-23 11:51 ` Goldwyn Rodrigues
2019-05-27 8:25 ` Shiyang Ruan
2019-05-28 9:17 ` Jan Kara
2019-05-29 2:01 ` Shiyang Ruan
2019-05-29 2:47 ` Dave Chinner
2019-05-29 4:02 ` Shiyang Ruan
2019-05-29 4:07 ` Darrick J. Wong
2019-05-29 4:46 ` Dave Chinner
2019-05-29 13:46 ` Jan Kara
2019-05-29 22:14 ` Dave Chinner
2019-05-30 11:16 ` Jan Kara
2019-05-30 22:59 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox