* [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
@ 2019-04-16 16:41 ` Goldwyn Rodrigues
2019-04-17 16:46 ` Darrick J. Wong
0 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-16 16:41 UTC (permalink / raw)
To: linux-btrfs
Cc: linux-fsdevel, jack, david, willy, hch, darrick.wong, kilobyte,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
The IOMAP_DAX_COW is a iomap type which performs copy of
edges of data while performing a write if start/end are
not page aligned. The source address is expected in
iomap->inline_data.
dax_copy_edges() is a helper functions performs a copy from
one part of the device to another for data not page aligned.
If iomap->inline_data is NULL, it memset's the area to zero.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/dax.c | 41 ++++++++++++++++++++++++++++++++++++++++-
include/linux/iomap.h | 1 +
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index ca0671d55aa6..4b4ac51fbd16 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev,
}
EXPORT_SYMBOL_GPL(__dax_zero_page_range);
+/*
+ * dax_copy_edges - Copies the part of the pages not included in
+ * the write, but required for CoW because
+ * offset/offset+length are not page aligned.
+ */
+static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
+ struct iomap *iomap, void *daddr)
+{
+ unsigned offset = pos & (PAGE_SIZE - 1);
+ loff_t end = pos + length;
+ loff_t pg_end = round_up(end, PAGE_SIZE);
+ void *saddr = iomap->inline_data;
+ /*
+ * Copy the first part of the page
+ * Note: we pass offset as length
+ */
+ if (offset) {
+ if (saddr)
+ memcpy(daddr, saddr, offset);
+ else
+ memset(daddr, 0, offset);
+ }
+
+ /* Copy the last part of the range */
+ if (end < pg_end) {
+ if (saddr)
+ memcpy(daddr + offset + length,
+ saddr + offset + length, pg_end - end);
+ else
+ memset(daddr + offset + length, 0,
+ pg_end - end);
+ }
+}
+
static loff_t
dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
@@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
return iov_iter_zero(min(length, end - pos), iter);
}
- if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+ if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
+ && iomap->type != IOMAP_DAX_COW))
return -EIO;
+
/*
* Write can allocate block for an area which has a hole page mapped
* into page tables. We have to tear down these mappings so that data
@@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
break;
}
+ if (iomap->type == IOMAP_DAX_COW)
+ dax_copy_edges(inode, pos, length, iomap, kaddr);
+
map_len = PFN_PHYS(map_len);
kaddr += offset;
map_len -= offset;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bda..6e885c5a38a3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -25,6 +25,7 @@ struct vm_fault;
#define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
#define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
#define IOMAP_INLINE 0x05 /* data inline in the inode */
+#define IOMAP_DAX_COW 0x06 /* Copy data pointed by inline_data before write*/
/*
* Flags for all iomap mappings:
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
@ 2019-04-17 16:46 ` Darrick J. Wong
0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-04-17 16:46 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, linux-fsdevel, jack, david, willy, hch, kilobyte,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Tue, Apr 16, 2019 at 11:41:40AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> The IOMAP_DAX_COW is a iomap type which performs copy of
> edges of data while performing a write if start/end are
> not page aligned. The source address is expected in
> iomap->inline_data.
>
> dax_copy_edges() is a helper functions performs a copy from
> one part of the device to another for data not page aligned.
> If iomap->inline_data is NULL, it memset's the area to zero.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/dax.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> include/linux/iomap.h | 1 +
> 2 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index ca0671d55aa6..4b4ac51fbd16 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev,
> }
> EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>
> +/*
> + * dax_copy_edges - Copies the part of the pages not included in
> + * the write, but required for CoW because
> + * offset/offset+length are not page aligned.
> + */
> +static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> + struct iomap *iomap, void *daddr)
> +{
> + unsigned offset = pos & (PAGE_SIZE - 1);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, PAGE_SIZE);
> + void *saddr = iomap->inline_data;
> + /*
> + * Copy the first part of the page
> + * Note: we pass offset as length
> + */
> + if (offset) {
> + if (saddr)
> + memcpy(daddr, saddr, offset);
I've been wondering, do we need memcpy_mcsafe here?
> + else
> + memset(daddr, 0, offset);
Or here?
(Or any of the other places we call memcpy/memset in this series...)
Because I think we'd prefer to return EIO on bad pmem over a machine
check.
--D
> + }
> +
> + /* Copy the last part of the range */
> + if (end < pg_end) {
> + if (saddr)
> + memcpy(daddr + offset + length,
> + saddr + offset + length, pg_end - end);
> + else
> + memset(daddr + offset + length, 0,
> + pg_end - end);
> + }
> +}
> +
> static loff_t
> dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> struct iomap *iomap)
> @@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> return iov_iter_zero(min(length, end - pos), iter);
> }
>
> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> + && iomap->type != IOMAP_DAX_COW))
Usually the '&&' goes on the first line, right?
> return -EIO;
>
> +
> /*
> * Write can allocate block for an area which has a hole page mapped
> * into page tables. We have to tear down these mappings so that data
> @@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> break;
> }
>
> + if (iomap->type == IOMAP_DAX_COW)
> + dax_copy_edges(inode, pos, length, iomap, kaddr);
No return value? So the pmem copy never fails?
--D
> +
> map_len = PFN_PHYS(map_len);
> kaddr += offset;
> map_len -= offset;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..6e885c5a38a3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -25,6 +25,7 @@ struct vm_fault;
> #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
> #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
> #define IOMAP_INLINE 0x05 /* data inline in the inode */
> +#define IOMAP_DAX_COW 0x06 /* Copy data pointed by inline_data before write*/
>
> /*
> * Flags for all iomap mappings:
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v4 00/18] btrfs dax support
@ 2019-04-29 17:26 Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
` (17 more replies)
0 siblings, 18 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm
This patch set adds support for dax on the BTRFS filesystem.
In order to support for CoW for btrfs, there were changes which had to be
made to the dax handling. The important one is copying blocks into the
same dax device before using them which is performed by iomap
type IOMAP_DAX_COW.
Snapshotting and CoW features are supported (including mmap preservation
across snapshots).
Git: https://github.com/goldwynr/linux/tree/btrfs-dax
Changes since v3:
- Fixed memcpy bug
- used flags for dax_insert_entry instead of bools for dax_insert_entry()
Changes since v2:
- Created a new type IOMAP_DAX_COW as opposed to flag IOMAP_F_COW
- CoW source address is presented in iomap.inline_data
- Split the patches to more elaborate dax/iomap patches
Changes since v1:
- use iomap instead of redoing everything in btrfs
- support for mmap writeprotecting on snapshotting
fs/btrfs/Makefile | 1
fs/btrfs/ctree.h | 38 +++++
fs/btrfs/dax.c | 289 +++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/disk-io.c | 4
fs/btrfs/file.c | 37 ++++-
fs/btrfs/inode.c | 114 ++++++++++++----
fs/btrfs/ioctl.c | 29 +++-
fs/btrfs/send.c | 4
fs/btrfs/super.c | 30 ++++
fs/dax.c | 183 ++++++++++++++++++++++++---
fs/iomap.c | 9 -
fs/ocfs2/file.c | 2
fs/read_write.c | 11 -
fs/xfs/xfs_reflink.c | 2
include/linux/dax.h | 15 +-
include/linux/fs.h | 8 +
include/linux/iomap.h | 7 +
include/trace/events/btrfs.h | 56 ++++++++
18 files changed, 752 insertions(+), 87 deletions(-)
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 01/18] btrfs: create a mount option for dax
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 18:02 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
` (16 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This sets S_DAX in inode->i_flags, which can be used with
IS_DAX().
The dax option is restricted to non multi-device mounts.
dax interacts with the device directly instead of using bio, so
all bio-hooks which we use for multi-device cannot be performed
here. While regular read/writes could be manipulated with
RAID0/1, mmap() is still an issue.
Auto-setting free space tree, because dealing with free space
inode (specifically readpages) is a nightmare.
Auto-setting nodatasum because we don't get callback for writing
checksums after mmap()s.
Deny compression because it does not go with direct I/O.
Store the dax_device in fs_info which will be used in iomap code.
I am aware of the push to directory-based flags for dax. Until, that
code is in the kernel, we will work with mount flags.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 2 ++
fs/btrfs/disk-io.c | 4 ++++
fs/btrfs/ioctl.c | 5 ++++-
fs/btrfs/super.c | 30 ++++++++++++++++++++++++++++++
4 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b3642367a595..8ca1c0d120f4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1067,6 +1067,7 @@ struct btrfs_fs_info {
u32 metadata_ratio;
void *bdev_holder;
+ struct dax_device *dax_dev;
/* private scrub information */
struct mutex scrub_lock;
@@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
#define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26)
#define BTRFS_MOUNT_NOLOGREPLAY (1 << 27)
#define BTRFS_MOUNT_REF_VERIFY (1 << 28)
+#define BTRFS_MOUNT_DAX (1 << 29)
#define BTRFS_DEFAULT_COMMIT_INTERVAL (30)
#define BTRFS_DEFAULT_MAX_INLINE (2048)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6fe9197f6ee4..2bbb63b2fcff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -16,6 +16,7 @@
#include <linux/uuid.h>
#include <linux/semaphore.h>
#include <linux/error-injection.h>
+#include <linux/dax.h>
#include <linux/crc32c.h>
#include <linux/sched/mm.h>
#include <asm/unaligned.h>
@@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
+ fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev);
+
/*
* We want to check superblock checksum, the type is stored inside.
* Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
@@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
#endif
btrfs_close_devices(fs_info->fs_devices);
+ fs_put_dax(fs_info->dax_dev);
btrfs_mapping_tree_free(&fs_info->mapping_tree);
percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cd4e693406a0..0138119cd9a3 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
if (binode->flags & BTRFS_INODE_DIRSYNC)
new_fl |= S_DIRSYNC;
+ if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
+ new_fl |= S_DAX;
+
set_mask_bits(&inode->i_flags,
- S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+ S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
new_fl);
}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 120e4340792a..3b85e61e5182 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
Opt_treelog, Opt_notreelog,
Opt_usebackuproot,
Opt_user_subvol_rm_allowed,
+ Opt_dax,
/* Deprecated options */
Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
{Opt_notreelog, "notreelog"},
{Opt_usebackuproot, "usebackuproot"},
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+ {Opt_dax, "dax"},
/* Deprecated options */
{Opt_alloc_start, "alloc_start=%s"},
@@ -745,6 +747,32 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
case Opt_user_subvol_rm_allowed:
btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
break;
+ case Opt_dax:
+#ifdef CONFIG_FS_DAX
+ if (btrfs_super_num_devices(info->super_copy) > 1) {
+ btrfs_info(info,
+ "dax not supported for multi-device btrfs partition\n");
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ btrfs_set_opt(info->mount_opt, DAX);
+ btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n");
+ btrfs_set_and_info(info, NODATASUM,
+ "auto-setting nodatasum (dax)");
+ btrfs_clear_opt(info->mount_opt, SPACE_CACHE);
+ btrfs_set_and_info(info, FREE_SPACE_TREE,
+ "auto-setting free space tree (dax)");
+ if (btrfs_test_opt(info, COMPRESS)) {
+ btrfs_info(info, "disabling compress (dax)");
+ btrfs_clear_opt(info->mount_opt, COMPRESS);
+ }
+ break;
+#else
+ btrfs_err(info,
+ "DAX option not supported\n");
+ ret = -EINVAL;
+ goto out;
+#endif
case Opt_enospc_debug:
btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
break;
@@ -1335,6 +1363,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
seq_puts(seq, ",clear_cache");
if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
seq_puts(seq, ",user_subvol_rm_allowed");
+ if (btrfs_test_opt(info, DAX))
+ seq_puts(seq, ",dax");
if (btrfs_test_opt(info, ENOSPC_DEBUG))
seq_puts(seq, ",enospc_debug");
if (btrfs_test_opt(info, AUTO_DEFRAG))
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
` (15 subsequent siblings)
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This makes btrfs_get_extent_map_write() independent of Direct
I/O code.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 2 ++
fs/btrfs/inode.c | 40 +++++++++++++++++++++++++++-------------
2 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ca1c0d120f4..9512f49262dd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3277,6 +3277,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
struct btrfs_path *path);
struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+ struct inode *inode, u64 start, u64 len);
struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
struct page *page, size_t pg_offset,
u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 82fdda8ff5ab..68b8a4935ba6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7496,11 +7496,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
return 0;
}
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
- struct buffer_head *bh_result,
- struct inode *inode,
- struct btrfs_dio_data *dio_data,
- u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+ struct buffer_head *bh,
+ struct inode *inode,
+ u64 start, u64 len)
{
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct extent_map *em = *map;
@@ -7554,22 +7553,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
*/
btrfs_free_reserved_data_space_noquota(inode, start,
len);
- goto skip_cow;
+ /* skip COW */
+ goto out;
}
}
/* this will cow the extent */
- len = bh_result->b_size;
+ if (bh)
+ len = bh->b_size;
free_extent_map(em);
*map = em = btrfs_new_extent_direct(inode, start, len);
- if (IS_ERR(em)) {
- ret = PTR_ERR(em);
- goto out;
- }
+ if (IS_ERR(em))
+ return PTR_ERR(em);
+out:
+ return ret;
+}
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+ struct buffer_head *bh_result,
+ struct inode *inode,
+ struct btrfs_dio_data *dio_data,
+ u64 start, u64 len)
+{
+ int ret;
+ struct extent_map *em;
+
+ ret = btrfs_get_extent_map_write(map, bh_result, inode,
+ start, len);
+ if (ret < 0)
+ return ret;
+ em = *map;
len = min(len, em->len - (start - em->start));
-skip_cow:
bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
inode->i_blkbits;
bh_result->b_size = len;
@@ -7590,7 +7605,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
dio_data->reserve -= len;
dio_data->unsubmitted_oe_range_end = start + len;
current->journal_info = dio_data;
-out:
return ret;
}
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 03/18] btrfs: basic dax read
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 15:14 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
` (14 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Perform a basic read using iomap support. The btrfs_iomap_begin()
finds the extent at the position and fills the iomap data
structure with the values.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/Makefile | 1 +
fs/btrfs/ctree.h | 5 +++++
fs/btrfs/dax.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/file.c | 11 ++++++++++-
4 files changed, 65 insertions(+), 1 deletion(-)
create mode 100644 fs/btrfs/dax.c
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..1fa77b875ae9 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
uuid-tree.o props.o free-space-tree.o tree-checker.o
+btrfs-$(CONFIG_FS_DAX) += dax.o
btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9512f49262dd..b7bbe5130a3b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle);
void btrfs_reada_detach(void *handle);
int btree_readahead_hook(struct extent_buffer *eb, int err);
+#ifdef CONFIG_FS_DAX
+/* dax.c */
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+#endif /* CONFIG_FS_DAX */
+
static inline int is_fstree(u64 rootid)
{
if (rootid == BTRFS_FS_TREE_OBJECTID ||
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
new file mode 100644
index 000000000000..bf3d46b0acb6
--- /dev/null
+++ b/fs/btrfs/dax.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAX support for BTRFS
+ *
+ * Copyright (c) 2019 SUSE Linux
+ * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
+ */
+
+#ifdef CONFIG_FS_DAX
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
+ loff_t length, unsigned flags, struct iomap *iomap)
+{
+ struct extent_map *em;
+ struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+ em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+ if (em->block_start == EXTENT_MAP_HOLE) {
+ iomap->type = IOMAP_HOLE;
+ return 0;
+ }
+ iomap->type = IOMAP_MAPPED;
+ iomap->bdev = em->bdev;
+ iomap->dax_dev = fs_info->dax_dev;
+ iomap->offset = em->start;
+ iomap->length = em->len;
+ iomap->addr = em->block_start;
+ return 0;
+}
+
+static const struct iomap_ops btrfs_iomap_ops = {
+ .iomap_begin = btrfs_iomap_begin,
+};
+
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ inode_lock_shared(inode);
+ ret = dax_iomap_rw(iocb, to, &btrfs_iomap_ops);
+ inode_unlock_shared(inode);
+
+ return ret;
+}
+#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 34fe8a58b0e9..9194591f9eea 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3288,9 +3288,18 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
return generic_file_open(inode, filp);
}
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+#ifdef CONFIG_FS_DAX
+ if (IS_DAX(file_inode(iocb->ki_filp)))
+ return btrfs_file_dax_read(iocb, to);
+#endif
+ return generic_file_read_iter(iocb, to);
+}
+
const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
- .read_iter = generic_file_read_iter,
+ .read_iter = btrfs_file_read_iter,
.splice_read = generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap = btrfs_file_mmap,
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (2 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 16:51 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
` (13 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
The IOMAP_DAX_COW is a iomap type which performs copy of
edges of data while performing a write if start/end are
not page aligned. The source address is expected in
iomap->inline_data.
dax_copy_edges() is a helper functions performs a copy from
one part of the device to another for data not page aligned.
If iomap->inline_data is NULL, it memset's the area to zero.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/iomap.h | 1 +
2 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index e5e54da1715f..610bfa861a28 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev,
}
EXPORT_SYMBOL_GPL(__dax_zero_page_range);
+/*
+ * dax_copy_edges - Copies the part of the pages not included in
+ * the write, but required for CoW because
+ * offset/offset+length are not page aligned.
+ */
+static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
+ struct iomap *iomap, void *daddr)
+{
+ unsigned offset = pos & (PAGE_SIZE - 1);
+ loff_t end = pos + length;
+ loff_t pg_end = round_up(end, PAGE_SIZE);
+ void *saddr = iomap->inline_data;
+ int ret = 0;
+ /*
+ * Copy the first part of the page
+ * Note: we pass offset as length
+ */
+ if (offset) {
+ if (saddr)
+ ret = memcpy_mcsafe(daddr, saddr, offset);
+ else
+ memset(daddr, 0, offset);
+ }
+
+ /* Copy the last part of the range */
+ if (end < pg_end) {
+ if (saddr)
+ ret = memcpy_mcsafe(daddr + offset + length,
+ saddr + offset + length, pg_end - end);
+ else
+ memset(daddr + offset + length, 0,
+ pg_end - end);
+ }
+ return ret;
+}
+
static loff_t
dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
@@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
return iov_iter_zero(min(length, end - pos), iter);
}
- if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+ if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
+ && iomap->type != IOMAP_DAX_COW))
return -EIO;
+
/*
* Write can allocate block for an area which has a hole page mapped
* into page tables. We have to tear down these mappings so that data
@@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
break;
}
+ if (iomap->type == IOMAP_DAX_COW) {
+ ret = dax_copy_edges(inode, pos, length, iomap, kaddr);
+ if (ret)
+ break;
+ }
+
map_len = PFN_PHYS(map_len);
kaddr += offset;
map_len -= offset;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bda..6e885c5a38a3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -25,6 +25,7 @@ struct vm_fault;
#define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
#define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
#define IOMAP_INLINE 0x05 /* data inline in the inode */
+#define IOMAP_DAX_COW 0x06 /* Copy data pointed by inline_data before write*/
/*
* Flags for all iomap mappings:
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 05/18] btrfs: return whether extent is nocow or not
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (3 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
` (12 subsequent siblings)
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
We require this to make sure we return type IOMAP_DAX_COW in
iomap structure, in the later patches.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 2 +-
fs/btrfs/inode.c | 9 +++++++--
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b7bbe5130a3b..050f30165531 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3278,7 +3278,7 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
struct btrfs_root *root, int *was_new);
int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
- struct inode *inode, u64 start, u64 len);
+ struct inode *inode, u64 start, u64 len, bool *nocow);
struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
struct page *page, size_t pg_offset,
u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 68b8a4935ba6..8e33c38511bb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7499,12 +7499,15 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
int btrfs_get_extent_map_write(struct extent_map **map,
struct buffer_head *bh,
struct inode *inode,
- u64 start, u64 len)
+ u64 start, u64 len, bool *nocow)
{
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct extent_map *em = *map;
int ret = 0;
+ if (nocow)
+ *nocow = false;
+
/*
* We don't allocate a new extent in the following cases
*
@@ -7553,6 +7556,8 @@ int btrfs_get_extent_map_write(struct extent_map **map,
*/
btrfs_free_reserved_data_space_noquota(inode, start,
len);
+ if (nocow)
+ *nocow = true;
/* skip COW */
goto out;
}
@@ -7579,7 +7584,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
struct extent_map *em;
ret = btrfs_get_extent_map_write(map, bh_result, inode,
- start, len);
+ start, len, NULL);
if (ret < 0)
return ret;
em = *map;
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (4 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
` (11 subsequent siblings)
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Since we will be using it in another part of the code, use a
better name to declare it non-static
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 7 +++++--
fs/btrfs/inode.c | 14 +++++---------
2 files changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 050f30165531..1e3e758b83c2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3280,8 +3280,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
struct inode *inode, u64 start, u64 len, bool *nocow);
struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
- struct page *page, size_t pg_offset,
- u64 start, u64 end, int create);
+ struct page *page, size_t pg_offset,
+ u64 start, u64 end, int create);
+void btrfs_update_ordered_extent(struct inode *inode,
+ const u64 offset, const u64 bytes,
+ const bool uptodate);
int btrfs_update_inode(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e33c38511bb..af4b56cba104 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -98,10 +98,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
u64 ram_bytes, int compress_type,
int type);
-static void __endio_write_update_ordered(struct inode *inode,
- const u64 offset, const u64 bytes,
- const bool uptodate);
-
/*
* Cleanup all submitted ordered extents in specified range to handle errors
* from the btrfs_run_delalloc_range() callback.
@@ -142,7 +138,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
bytes -= PAGE_SIZE;
}
- return __endio_write_update_ordered(inode, offset, bytes, false);
+ return btrfs_update_ordered_extent(inode, offset, bytes, false);
}
static int btrfs_dirty_inode(struct inode *inode);
@@ -8085,7 +8081,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
bio_put(bio);
}
-static void __endio_write_update_ordered(struct inode *inode,
+void btrfs_update_ordered_extent(struct inode *inode,
const u64 offset, const u64 bytes,
const bool uptodate)
{
@@ -8138,7 +8134,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
struct btrfs_dio_private *dip = bio->bi_private;
struct bio *dio_bio = dip->dio_bio;
- __endio_write_update_ordered(dip->inode, dip->logical_offset,
+ btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
dip->bytes, !bio->bi_status);
kfree(dip);
@@ -8457,7 +8453,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
bio = NULL;
} else {
if (write)
- __endio_write_update_ordered(inode,
+ btrfs_update_ordered_extent(inode,
file_offset,
dio_bio->bi_iter.bi_size,
false);
@@ -8597,7 +8593,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
*/
if (dio_data.unsubmitted_oe_range_start <
dio_data.unsubmitted_oe_range_end)
- __endio_write_update_ordered(inode,
+ btrfs_update_ordered_extent(inode,
dio_data.unsubmitted_oe_range_start,
dio_data.unsubmitted_oe_range_end -
dio_data.unsubmitted_oe_range_start,
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 07/18] btrfs: add dax write support
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (5 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 17:08 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
` (10 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
IOMAP_DAX_COW allows to inform the dax code, to first perform
a copy which are not page-aligned before performing the write.
The responsibility of checking if data edges are page aligned
is performed in ->iomap_begin() and the source address is
stored in ->inline_data
A new struct btrfs_iomap is passed from iomap_begin() to
iomap_end(), which contains all the accounting and locking information
for CoW based writes.
For writing to a hole, iomap->inline_data is set to zero.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 6 ++
fs/btrfs/dax.c | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/btrfs/file.c | 4 +-
3 files changed, 185 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1e3e758b83c2..eec01eb92f33 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
#ifdef CONFIG_FS_DAX
/* dax.c */
ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+#else
+static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ return 0;
+}
#endif /* CONFIG_FS_DAX */
static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index bf3d46b0acb6..f5cc9bcdbf14 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -9,30 +9,184 @@
#ifdef CONFIG_FS_DAX
#include <linux/dax.h>
#include <linux/iomap.h>
+#include <linux/uio.h>
#include "ctree.h"
#include "btrfs_inode.h"
+struct btrfs_iomap {
+ u64 start;
+ u64 end;
+ bool nocow;
+ struct extent_changeset *data_reserved;
+ struct extent_state *cached_state;
+};
+
+static struct btrfs_iomap *btrfs_iomap_init(struct inode *inode,
+ struct extent_map **em,
+ loff_t pos, loff_t length)
+{
+ int ret = 0;
+ struct extent_map *map = *em;
+ struct btrfs_iomap *bi;
+
+ bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
+ if (!bi)
+ return ERR_PTR(-ENOMEM);
+
+ bi->start = round_down(pos, PAGE_SIZE);
+ bi->end = PAGE_ALIGN(pos + length);
+
+ /* Wait for existing ordered extents in range to finish */
+ btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
+
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
+
+ ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
+ bi->start, bi->end - bi->start);
+ if (ret) {
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+ &bi->cached_state);
+ kfree(bi);
+ return ERR_PTR(ret);
+ }
+
+ refcount_inc(&map->refs);
+ ret = btrfs_get_extent_map_write(em, NULL,
+ inode, bi->start, bi->end - bi->start, &bi->nocow);
+ if (ret) {
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+ &bi->cached_state);
+ btrfs_delalloc_release_space(inode,
+ bi->data_reserved, bi->start,
+ bi->end - bi->start, true);
+ extent_changeset_free(bi->data_reserved);
+ kfree(bi);
+ return ERR_PTR(ret);
+ }
+ free_extent_map(map);
+ return bi;
+}
+
+static void *dax_address(struct block_device *bdev, struct dax_device *dax_dev,
+ sector_t sector, loff_t pos, loff_t length)
+{
+ size_t size = ALIGN(pos + length, PAGE_SIZE);
+ int id, ret = 0;
+ void *kaddr = NULL;
+ pgoff_t pgoff;
+ long map_len;
+
+ id = dax_read_lock();
+
+ ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ if (ret)
+ goto out;
+
+ map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
+ &kaddr, NULL);
+ if (map_len < 0)
+ ret = map_len;
+
+out:
+ dax_read_unlock(id);
+ if (ret)
+ return ERR_PTR(ret);
+ return kaddr;
+}
+
static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
loff_t length, unsigned flags, struct iomap *iomap)
{
struct extent_map *em;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+ struct btrfs_iomap *bi = NULL;
+ unsigned offset = pos & (PAGE_SIZE - 1);
+ u64 srcblk = 0;
+ loff_t diff;
+
em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+
+ iomap->type = IOMAP_MAPPED;
+
+ if (flags & IOMAP_WRITE) {
+ if (em->block_start != EXTENT_MAP_HOLE)
+ srcblk = em->block_start + pos - em->start - offset;
+
+ bi = btrfs_iomap_init(inode, &em, pos, length);
+ if (IS_ERR(bi))
+ return PTR_ERR(bi);
+
+ }
+
+ /*
+ * Advance the difference between pos and start, to align well with
+ * inline_data in case of writes
+ */
+ diff = round_down(pos - em->start, PAGE_SIZE);
+ iomap->offset = em->start + diff;
+ iomap->length = em->len - diff;
+ iomap->bdev = em->bdev;
+ iomap->dax_dev = fs_info->dax_dev;
+
+ /*
+ * This will be true for reads only since we have already
+ * allocated em
+ */
if (em->block_start == EXTENT_MAP_HOLE) {
iomap->type = IOMAP_HOLE;
return 0;
}
- iomap->type = IOMAP_MAPPED;
- iomap->bdev = em->bdev;
- iomap->dax_dev = fs_info->dax_dev;
- iomap->offset = em->start;
- iomap->length = em->len;
- iomap->addr = em->block_start;
+
+ iomap->addr = em->block_start + diff;
+ /* Check if we really need to copy data from old extent */
+ if (bi && !bi->nocow && (offset || pos + length < bi->end)) {
+ iomap->type = IOMAP_DAX_COW;
+ if (srcblk) {
+ sector_t sector = (srcblk + (pos & PAGE_MASK) -
+ iomap->offset) >> 9;
+ iomap->inline_data = dax_address(em->bdev,
+ fs_info->dax_dev, sector, pos, length);
+ if (IS_ERR(iomap->inline_data)) {
+ kfree(bi);
+ return PTR_ERR(iomap->inline_data);
+ }
+ }
+ }
+
+ iomap->private = bi;
+ return 0;
+}
+
+static int btrfs_iomap_end(struct inode *inode, loff_t pos,
+ loff_t length, ssize_t written, unsigned flags,
+ struct iomap *iomap)
+{
+ struct btrfs_iomap *bi = iomap->private;
+ u64 wend;
+
+ if (!bi)
+ return 0;
+
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+ &bi->cached_state);
+
+ wend = PAGE_ALIGN(pos + written);
+ if (wend < bi->end) {
+ btrfs_delalloc_release_space(inode,
+ bi->data_reserved, wend,
+ bi->end - wend, true);
+ }
+
+ btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true);
+ btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
+ extent_changeset_free(bi->data_reserved);
+ kfree(bi);
return 0;
}
static const struct iomap_ops btrfs_iomap_ops = {
.iomap_begin = btrfs_iomap_begin,
+ .iomap_end = btrfs_iomap_end,
};
ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
@@ -46,4 +200,20 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
return ret;
}
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
+{
+ ssize_t ret = 0;
+ u64 pos = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops);
+
+ if (ret > 0) {
+ pos += ret;
+ if (pos > i_size_read(inode))
+ i_size_write(inode, pos);
+ iocb->ki_pos = pos;
+ }
+ return ret;
+}
#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9194591f9eea..a795023e26ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
if (sync)
atomic_inc(&BTRFS_I(inode)->sync_writers);
- if (iocb->ki_flags & IOCB_DIRECT) {
+ if (IS_DAX(inode)) {
+ num_written = btrfs_file_dax_write(iocb, from);
+ } else if (iocb->ki_flags & IOCB_DIRECT) {
num_written = __btrfs_direct_write(iocb, from);
} else {
num_written = btrfs_buffered_write(iocb, from);
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (6 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 17:46 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
` (9 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Change dax_iomap_pfn to return the address as well in order to
use it for performing a memcpy in case the type is IOMAP_DAX_COW.
We don't handle PMD because btrfs does not support hugepages.
Question:
The sequence of bdev_dax_pgoff() and dax_direct_access() is
used multiple times to calculate address and pfn's. Would it make
sense to call it while calculating address as well to reduce code?
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/dax.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 610bfa861a28..718b1632a39d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
}
static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
- pfn_t *pfnp)
+ pfn_t *pfnp, void **addr)
{
const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
@@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
return rc;
id = dax_read_lock();
length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
- NULL, pfnp);
+ addr, pfnp);
if (length < 0) {
rc = length;
goto out;
@@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
struct inode *inode = mapping->host;
unsigned long vaddr = vmf->address;
+ void *addr;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
struct iomap iomap = { 0 };
unsigned flags = IOMAP_FAULT;
@@ -1375,16 +1376,26 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
sync = dax_fault_is_synchronous(flags, vma, &iomap);
switch (iomap.type) {
+ case IOMAP_DAX_COW:
case IOMAP_MAPPED:
if (iomap.flags & IOMAP_F_NEW) {
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
- error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+ error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn, &addr);
if (error < 0)
goto error_finish_iomap;
+ if (iomap.type == IOMAP_DAX_COW) {
+ if (iomap.inline_data) {
+ error = memcpy_mcsafe(addr, iomap.inline_data,
+ PAGE_SIZE);
+ if (error < 0)
+ goto error_finish_iomap;
+ } else
+ memset(addr, 0, PAGE_SIZE);
+ }
entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
0, write && !sync);
@@ -1597,7 +1608,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
switch (iomap.type) {
case IOMAP_MAPPED:
- error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
+ error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn, NULL);
if (error < 0)
goto finish_iomap;
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 09/18] btrfs: Add dax specific address_space_operations
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (7 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
` (8 subsequent siblings)
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/inode.c | 34 +++++++++++++++++++++++++++++++---
1 file changed, 31 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index af4b56cba104..05714ffc4894 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -28,6 +28,7 @@
#include <linux/magic.h>
#include <linux/iversion.h>
#include <linux/swap.h>
+#include <linux/dax.h>
#include <asm/unaligned.h>
#include "ctree.h"
#include "disk-io.h"
@@ -65,6 +66,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations;
static const struct inode_operations btrfs_special_inode_operations;
static const struct inode_operations btrfs_file_inode_operations;
static const struct address_space_operations btrfs_aops;
+static const struct address_space_operations btrfs_dax_aops;
static const struct file_operations btrfs_dir_file_operations;
static const struct extent_io_ops btrfs_extent_io_ops;
@@ -3757,7 +3759,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
switch (inode->i_mode & S_IFMT) {
case S_IFREG:
- inode->i_mapping->a_ops = &btrfs_aops;
+ if (btrfs_test_opt(fs_info, DAX))
+ inode->i_mapping->a_ops = &btrfs_dax_aops;
+ else
+ inode->i_mapping->a_ops = &btrfs_aops;
BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
inode->i_fop = &btrfs_file_operations;
inode->i_op = &btrfs_file_inode_operations;
@@ -3778,6 +3783,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
}
btrfs_sync_inode_flags_to_i_flags(inode);
+
return 0;
}
@@ -6538,7 +6544,10 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
*/
inode->i_fop = &btrfs_file_operations;
inode->i_op = &btrfs_file_inode_operations;
- inode->i_mapping->a_ops = &btrfs_aops;
+ if (IS_DAX(inode) && S_ISREG(mode))
+ inode->i_mapping->a_ops = &btrfs_dax_aops;
+ else
+ inode->i_mapping->a_ops = &btrfs_aops;
err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
if (err)
@@ -8665,6 +8674,15 @@ static int btrfs_writepages(struct address_space *mapping,
return extent_writepages(mapping, wbc);
}
+static int btrfs_dax_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = mapping->host;
+ struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+ return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
+ wbc);
+}
+
static int
btrfs_readpages(struct file *file, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages)
@@ -10436,7 +10454,10 @@ static int btrfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
inode->i_fop = &btrfs_file_operations;
inode->i_op = &btrfs_file_inode_operations;
- inode->i_mapping->a_ops = &btrfs_aops;
+ if (IS_DAX(inode))
+ inode->i_mapping->a_ops = &btrfs_dax_aops;
+ else
+ inode->i_mapping->a_ops = &btrfs_aops;
BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
ret = btrfs_init_inode_security(trans, inode, dir, NULL);
@@ -10892,6 +10913,13 @@ static const struct address_space_operations btrfs_aops = {
.swap_deactivate = btrfs_swap_deactivate,
};
+static const struct address_space_operations btrfs_dax_aops = {
+ .writepages = btrfs_dax_writepages,
+ .direct_IO = noop_direct_IO,
+ .set_page_dirty = noop_set_page_dirty,
+ .invalidatepage = noop_invalidatepage,
+};
+
static const struct inode_operations btrfs_file_inode_operations = {
.getattr = btrfs_getattr,
.setattr = btrfs_setattr,
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 10/18] dax: replace mmap entry in case of CoW
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (8 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 17:35 ` Darrick J. Wong
2019-05-23 13:38 ` Jan Kara
2019-04-29 17:26 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
` (7 subsequent siblings)
17 siblings, 2 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.
btrfs does not support hugepages so we don't handle PMD.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/dax.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 718b1632a39d..07e8ff20161d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -700,6 +700,9 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
return 0;
}
+#define DAX_IF_DIRTY (1ULL << 0)
+#define DAX_IF_COW (1ULL << 1)
+
/*
* By this point grab_mapping_entry() has ensured that we have a locked entry
* of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
*/
static void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
- void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+ void *entry, pfn_t pfn, unsigned long flags,
+ unsigned long insert_flags)
{
void *new_entry = dax_make_entry(pfn, flags);
+ bool dirty = insert_flags & DAX_IF_DIRTY;
+ bool cow = insert_flags & DAX_IF_COW;
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
- if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+ if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
unsigned long index = xas->xa_index;
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
@@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
xas_reset(xas);
xas_lock_irq(xas);
- if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+ if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
dax_disassociate_entry(entry, mapping, false);
dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
}
- if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+ if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
* Only swap our new entry into the page cache if the current
* entry is a zero page or an empty entry. If a normal PTE or
@@ -753,6 +759,9 @@ static void *dax_insert_entry(struct xa_state *xas,
if (dirty)
xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
+ if (cow)
+ xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
xas_unlock_irq(xas);
return entry;
}
@@ -1032,7 +1041,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
vm_fault_t ret;
*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_ZERO_PAGE, false);
+ DAX_ZERO_PAGE, 0);
ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
trace_dax_load_hole(inode, vmf, ret);
@@ -1296,6 +1305,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
vm_fault_t ret = 0;
void *entry;
pfn_t pfn;
+ unsigned long insert_flags = 0;
trace_dax_pte_fault(inode, vmf, ret);
/*
@@ -1357,6 +1367,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
error = copy_user_dax(iomap.bdev, iomap.dax_dev,
sector, PAGE_SIZE, vmf->cow_page, vaddr);
break;
+ case IOMAP_DAX_COW:
+ /* Should not be setting this - fallthrough */
default:
WARN_ON_ONCE(1);
error = -EIO;
@@ -1377,6 +1389,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
switch (iomap.type) {
case IOMAP_DAX_COW:
+ insert_flags |= DAX_IF_COW;
+ /* fallthrough */
case IOMAP_MAPPED:
if (iomap.flags & IOMAP_F_NEW) {
count_vm_event(PGMAJFAULT);
@@ -1396,8 +1410,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
} else
memset(addr, 0, PAGE_SIZE);
}
+ if (write && !sync)
+ insert_flags |= DAX_IF_DIRTY;
entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
- 0, write && !sync);
+ 0, insert_flags);
/*
* If we are doing synchronous page fault and inode needs fsync,
@@ -1478,7 +1494,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
pfn = page_to_pfn_t(zero_page);
*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
- DAX_PMD | DAX_ZERO_PAGE, false);
+ DAX_PMD | DAX_ZERO_PAGE, 0);
if (arch_needs_pgtable_deposit()) {
pgtable = pte_alloc_one(vma->vm_mm);
@@ -1528,6 +1544,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
loff_t pos;
int error;
pfn_t pfn;
+ unsigned long insert_flags = 0;
/*
* Check whether offset isn't beyond end of file now. Caller is
@@ -1612,8 +1629,11 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
if (error < 0)
goto finish_iomap;
+ if (write && !sync)
+ insert_flags |= DAX_IF_DIRTY;
+
entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
- DAX_PMD, write && !sync);
+ DAX_PMD, insert_flags);
/*
* If we are doing synchronous page fault and inode needs fsync,
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 11/18] btrfs: add dax mmap support
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (9 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
` (6 subsequent siblings)
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Add a new vm_operations struct btrfs_dax_vm_ops
specifically for dax files.
Since we will be removing(nulling) readpages/writepages for dax
return ENOEXEC only for non-dax files.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/dax.c | 13 ++++++++++++-
fs/btrfs/file.c | 18 ++++++++++++++++--
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eec01eb92f33..2b7bdabb44f8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3802,6 +3802,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
/* dax.c */
ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
#else
static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
{
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index f5cc9bcdbf14..de957d681e16 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -139,7 +139,7 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
iomap->addr = em->block_start + diff;
/* Check if we really need to copy data from old extent */
- if (bi && !bi->nocow && (offset || pos + length < bi->end)) {
+ if (bi && !bi->nocow && (offset || pos + length < bi->end || flags & IOMAP_FAULT)) {
iomap->type = IOMAP_DAX_COW;
if (srcblk) {
sector_t sector = (srcblk + (pos & PAGE_MASK) -
@@ -216,4 +216,15 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
}
return ret;
}
+
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
+{
+ vm_fault_t ret;
+ pfn_t pfn;
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
+
+ return ret;
+}
#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a795023e26ca..9d5a3c99a6b9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2214,15 +2214,29 @@ static const struct vm_operations_struct btrfs_file_vm_ops = {
.page_mkwrite = btrfs_page_mkwrite,
};
+#ifdef CONFIG_FS_DAX
+static const struct vm_operations_struct btrfs_dax_vm_ops = {
+ .fault = btrfs_dax_fault,
+ .page_mkwrite = btrfs_dax_fault,
+ .pfn_mkwrite = btrfs_dax_fault,
+};
+#else
+#define btrfs_dax_vm_ops btrfs_file_vm_ops
+#endif
+
static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
{
struct address_space *mapping = filp->f_mapping;
+ struct inode *inode = file_inode(filp);
- if (!mapping->a_ops->readpage)
+ if (!IS_DAX(inode) && !mapping->a_ops->readpage)
return -ENOEXEC;
file_accessed(filp);
- vma->vm_ops = &btrfs_file_vm_ops;
+ if (IS_DAX(inode))
+ vma->vm_ops = &btrfs_dax_vm_ops;
+ else
+ vma->vm_ops = &btrfs_file_vm_ops;
return 0;
}
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 12/18] btrfs: allow MAP_SYNC mmap
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (10 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-10 15:32 ` [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts Adam Borowski
2019-05-23 13:44 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Jan Kara
2019-04-29 17:26 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
` (5 subsequent siblings)
17 siblings, 2 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Adam Borowski <kilobyte@angband.pl>
Used by userspace to detect DAX.
[rgoldwyn@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags]
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/file.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9d5a3c99a6b9..362a9cf9dcb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
#include <linux/btrfs.h>
#include <linux/uio.h>
#include <linux/iversion.h>
+#include <linux/mman.h>
#include "ctree.h"
#include "disk-io.h"
#include "transaction.h"
@@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = {
.splice_read = generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap = btrfs_file_mmap,
+#ifdef CONFIG_FS_DAX
+ .mmap_supported_flags = MAP_SYNC,
+#endif
.open = btrfs_file_open,
.release = btrfs_release_file,
.fsync = btrfs_sync_file,
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 13/18] fs: dedup file range to use a compare function
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (11 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 18:17 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
` (4 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.
This may not be the best way to solve this. Suggestions welcome.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 9 ++++++++
fs/btrfs/dax.c | 8 +++++++
fs/btrfs/ioctl.c | 11 +++++++--
fs/dax.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ocfs2/file.c | 2 +-
fs/read_write.c | 11 +++++----
fs/xfs/xfs_reflink.c | 2 +-
include/linux/dax.h | 4 ++++
include/linux/fs.h | 8 ++++++-
9 files changed, 110 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2b7bdabb44f8..d3d044125619 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
+int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len,
+ bool *is_same);
#else
static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
{
return 0;
}
+static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len,
+ bool *is_same)
+{
+ return 0;
+}
#endif /* CONFIG_FS_DAX */
static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index de957d681e16..af64696a5337 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -227,4 +227,12 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
return ret;
}
+
+int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff, loff_t len,
+ bool *is_same)
+{
+ return dax_file_range_compare(src, srcoff, dest, destoff, len,
+ is_same, &btrfs_iomap_ops);
+}
#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0138119cd9a3..5ebb52848d5a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3939,6 +3939,7 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
bool same_inode = inode_out == inode_in;
u64 wb_len;
int ret;
+ compare_range_t cmp;
if (!(remap_flags & REMAP_FILE_DEDUP)) {
struct btrfs_root *root_out = BTRFS_I(inode_out)->root;
@@ -4000,8 +4001,14 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
if (ret < 0)
goto out_unlock;
- ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
+ cmp = btrfs_dax_file_range_compare;
+ else
+ cmp = vfs_dedupe_file_range_compare;
+
+ ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+ pos_out, len, remap_flags, cmp);
+
if (ret < 0 || *len == 0)
goto out_unlock;
diff --git a/fs/dax.c b/fs/dax.c
index 07e8ff20161d..fa9ccbad7c03 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -39,6 +39,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/fs_dax.h>
+#define MIN(a, b) (((a) < (b)) ? (a) : (b))
+
static inline unsigned int pe_order(enum page_entry_size pe_size)
{
if (pe_size == PE_SIZE_PTE)
@@ -1795,3 +1797,66 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
return dax_insert_pfn_mkwrite(vmf, pfn, order);
}
EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static inline void *iomap_address(struct iomap *iomap, loff_t off, loff_t len)
+{
+ loff_t start;
+ void *addr;
+ start = (get_start_sect(iomap->bdev) << 9) + iomap->addr +
+ (off - iomap->offset);
+ dax_direct_access(iomap->dax_dev, PHYS_PFN(start), PHYS_PFN(len),
+ &addr, NULL);
+ return addr;
+}
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+ loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
+{
+ void *saddr, *daddr;
+ struct iomap s_iomap = {0};
+ struct iomap d_iomap = {0};
+ bool same = true;
+ loff_t cmp_len;
+ int id, ret = 0;
+
+ id = dax_read_lock();
+ while (len) {
+ ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
+ if (ret < 0) {
+ if (ops->iomap_end)
+ ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
+ return ret;
+ }
+ cmp_len = len;
+ cmp_len = MIN(len, s_iomap.offset + s_iomap.length - srcoff);
+
+ ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
+ if (ret < 0) {
+ if (ops->iomap_end) {
+ ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
+ ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
+ }
+ return ret;
+ }
+ cmp_len = MIN(cmp_len, d_iomap.offset + d_iomap.length - destoff);
+
+ saddr = iomap_address(&s_iomap, srcoff, cmp_len);
+ daddr = iomap_address(&d_iomap, destoff, cmp_len);
+
+ same = !memcmp(saddr, daddr, cmp_len);
+ if (!same)
+ break;
+ len -= cmp_len;
+ srcoff += cmp_len;
+ destoff += cmp_len;
+
+ if (ops->iomap_end) {
+ ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
+ ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
+ }
+ }
+ dax_read_unlock(id);
+ *is_same = same;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d640c5f8a85d..9d470306cfc3 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
goto out_unlock;
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- &len, remap_flags);
+ &len, remap_flags, vfs_dedupe_file_range_compare);
if (ret < 0 || len == 0)
goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index 61b43ad7608e..c6283802ef1c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1778,7 +1778,7 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
* Compare extents of two files to see if they are the same.
* Caller must have locked both inodes to prevent write races.
*/
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
struct inode *dest, loff_t destoff,
loff_t len, bool *is_same)
{
@@ -1845,6 +1845,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
out_error:
return error;
}
+EXPORT_SYMBOL_GPL(vfs_dedupe_file_range_compare);
/*
* Check that the two inodes are eligible for cloning, the ranges make
@@ -1856,7 +1857,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
*/
int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
- loff_t *len, unsigned int remap_flags)
+ loff_t *len, unsigned int remap_flags,
+ compare_range_t compare)
{
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -1915,9 +1917,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
*/
if (remap_flags & REMAP_FILE_DEDUP) {
bool is_same = false;
-
- ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
- inode_out, pos_out, *len, &is_same);
+ ret = (*compare)(inode_in, pos_in,
+ inode_out, pos_out, *len, &is_same);
if (ret)
return ret;
if (!is_same)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 680ae7662a78..68e4257cebb0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
goto out_unlock;
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
- len, remap_flags);
+ len, remap_flags, vfs_dedupe_file_range_compare);
if (ret < 0 || *len == 0)
goto out_unlock;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..1370d39c91b6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -157,6 +157,10 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
+int dax_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same,
+ const struct iomap_ops *ops);
#ifdef CONFIG_FS_DAX
int __dax_zero_page_range(struct block_device *bdev,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e7679089..0224503e42ce 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1883,10 +1883,16 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *, rwf_t);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+ struct inode *dest, loff_t destoff,
+ loff_t len, bool *is_same);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+ struct inode *dest, loff_t destpos,
+ loff_t len, bool *is_same);
extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count,
- unsigned int remap_flags);
+ unsigned int remap_flags, compare_range_t cmp);
extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 14/18] dax: memcpy before zeroing range
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (12 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-21 17:27 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
` (3 subsequent siblings)
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
However, this needed more iomap fields, so it was easier
to pass iomap and compute inside the function rather
than passing a log of arguments.
Note, there is subtle difference between iomap_sector and
dax_iomap_sector(). Can we replace dax_iomap_sector with
iomap_sector()? It would need pos & PAGE_MASK though or else
bdev_dax_pgoff() return -EINVAL.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/dax.c | 17 ++++++++++++-----
fs/iomap.c | 9 +--------
include/linux/dax.h | 11 +++++------
include/linux/iomap.h | 6 ++++++
4 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index fa9ccbad7c03..82a08b0eec23 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1063,11 +1063,16 @@ static bool dax_range_is_aligned(struct block_device *bdev,
return true;
}
-int __dax_zero_page_range(struct block_device *bdev,
- struct dax_device *dax_dev, sector_t sector,
- unsigned int offset, unsigned int size)
+int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+ unsigned int offset, unsigned int size)
{
- if (dax_range_is_aligned(bdev, offset, size)) {
+ sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
+ struct block_device *bdev = iomap->bdev;
+ struct dax_device *dax_dev = iomap->dax_dev;
+ int ret = 0;
+
+ if (!(iomap->type == IOMAP_DAX_COW) &&
+ dax_range_is_aligned(bdev, offset, size)) {
sector_t start_sector = sector + (offset >> 9);
return blkdev_issue_zeroout(bdev, start_sector,
@@ -1087,11 +1092,13 @@ int __dax_zero_page_range(struct block_device *bdev,
dax_read_unlock(id);
return rc;
}
+ if (iomap->type == IOMAP_DAX_COW)
+ ret = memcpy_mcsafe(kaddr, iomap->inline_data, offset);
memset(kaddr + offset, 0, size);
dax_flush(dax_dev, kaddr + offset, size);
dax_read_unlock(id);
}
- return 0;
+ return ret;
}
EXPORT_SYMBOL_GPL(__dax_zero_page_range);
diff --git a/fs/iomap.c b/fs/iomap.c
index abdd18e404f8..90698c854883 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
return written ? written : ret;
}
-static sector_t
-iomap_sector(struct iomap *iomap, loff_t pos)
-{
- return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
-}
-
static struct iomap_page *
iomap_page_create(struct inode *inode, struct page *page)
{
@@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
struct iomap *iomap)
{
- return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
- iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
+ return __dax_zero_page_range(iomap, pos, offset, bytes);
}
static loff_t
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 1370d39c91b6..c469d9ff54b4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,6 +9,7 @@
typedef unsigned long dax_entry_t;
+struct iomap;
struct iomap_ops;
struct dax_device;
struct dax_operations {
@@ -163,13 +164,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff,
const struct iomap_ops *ops);
#ifdef CONFIG_FS_DAX
-int __dax_zero_page_range(struct block_device *bdev,
- struct dax_device *dax_dev, sector_t sector,
- unsigned int offset, unsigned int length);
+int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+ unsigned int offset, unsigned int size);
#else
-static inline int __dax_zero_page_range(struct block_device *bdev,
- struct dax_device *dax_dev, sector_t sector,
- unsigned int offset, unsigned int length)
+static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+ unsigned int offset, unsigned int size)
{
return -ENXIO;
}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6e885c5a38a3..fcfce269db3e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -7,6 +7,7 @@
#include <linux/mm.h>
#include <linux/types.h>
#include <linux/mm_types.h>
+#include <linux/blkdev.h>
struct address_space;
struct fiemap_extent_info;
@@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
return NULL;
}
+static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
+{
+ return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
+}
+
ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
const struct iomap_ops *ops);
int iomap_readpage(struct page *page, const struct iomap_ops *ops);
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 15/18] btrfs: handle dax page zeroing
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (13 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
` (2 subsequent siblings)
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
btrfs_dax_zero_block() zeros part of the page, either from the
front or the regular rest of the block.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/dax.c | 27 ++++++++++++++++++++++++++-
fs/btrfs/inode.c | 4 ++++
3 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d3d044125619..ee1ed18f8b3c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3806,6 +3806,7 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
struct inode *dest, loff_t destoff, loff_t len,
bool *is_same);
+int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front);
#else
static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
{
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index af64696a5337..bf2ddac5b5a1 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -132,7 +132,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
* This will be true for reads only since we have already
* allocated em
*/
- if (em->block_start == EXTENT_MAP_HOLE) {
+ if (em->block_start == EXTENT_MAP_HOLE ||
+ em->flags == EXTENT_FLAG_FILLING) {
iomap->type = IOMAP_HOLE;
return 0;
}
@@ -235,4 +236,28 @@ int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
return dax_file_range_compare(src, srcoff, dest, destoff, len,
is_same, &btrfs_iomap_ops);
}
+
+/*
+ * zero a part of the page only. This should CoW (via iomap_begin) if required
+ */
+int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front)
+{
+ loff_t start = round_down(from, PAGE_SIZE);
+ loff_t end = round_up(from, PAGE_SIZE);
+ loff_t offset = from;
+ int ret = 0;
+
+ if (front) {
+ len = from - start;
+ offset = start;
+ } else {
+ if (!len)
+ len = end - from;
+ }
+
+ if (len)
+ ret = iomap_zero_range(inode, offset, len, NULL, &btrfs_iomap_ops);
+
+ return (ret < 0) ? ret : 0;
+}
#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05714ffc4894..7e88280a2c3b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4833,6 +4833,10 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
(!len || IS_ALIGNED(len, blocksize)))
goto out;
+#ifdef CONFIG_FS_DAX
+ if (IS_DAX(inode))
+ return btrfs_dax_zero_block(inode, from, len, front);
+#endif
block_start = round_down(from, blocksize);
block_end = block_start + blocksize - 1;
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (14 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-05-23 14:04 ` Jan Kara
2019-04-29 17:26 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
17 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
Inorder to make sure mmap'd files don't change after snapshot,
writeprotect the mmap pages on snapshot. This is done by performing
a data writeback on the pages (which simply mark the pages are
wrprotected). This way if the user process tries to access the memory
we will get another fault and we can perform a CoW.
In order to accomplish this, we tag all CoW pages as
PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
During snapshot, it starts writeback of all delalloc'd inodes and
here we perform a data writeback. We don't want to keep the inodes
in delalloc_inodes until it umount (WARN_ON), so we remove it
during inode evictions.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ctree.h | 3 ++-
fs/btrfs/dax.c | 7 +++++++
fs/btrfs/inode.c | 13 ++++++++++++-
3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ee1ed18f8b3c..d1b70f24adeb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
struct btrfs_root *new_root,
struct btrfs_root *parent_root,
u64 new_dirid);
- void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
+void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode);
+void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
unsigned *bits);
void btrfs_clear_delalloc_extent(struct inode *inode,
struct extent_state *state, unsigned *bits);
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index bf2ddac5b5a1..20ec2ec49c68 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -222,10 +222,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
{
vm_fault_t ret;
pfn_t pfn;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct btrfs_inode *binode = BTRFS_I(inode);
ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
+ /* Insert into delalloc so we get writeback calls on snapshots */
+ if (vmf->flags & FAULT_FLAG_WRITE &&
+ !test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+ btrfs_add_delalloc_inodes(binode->root, inode);
+
return ret;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7e88280a2c3b..e98fb512e1ca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
spin_unlock(&BTRFS_I(inode)->lock);
}
-static void btrfs_add_delalloc_inodes(struct btrfs_root *root,
+void btrfs_add_delalloc_inodes(struct btrfs_root *root,
struct inode *inode)
{
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode)
{
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_trans_handle *trans;
+ struct btrfs_inode *binode = BTRFS_I(inode);
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_block_rsv *rsv;
int ret;
trace_btrfs_inode_evict(inode);
+ if (IS_DAX(inode)
+ && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+ btrfs_del_delalloc_inode(root, binode);
+
if (!root) {
clear_inode(inode);
return;
@@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping,
{
struct inode *inode = mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+ struct btrfs_inode *binode = BTRFS_I(inode);
+ if ((wbc->sync_mode == WB_SYNC_ALL) &&
+ test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+ btrfs_del_delalloc_inode(binode->root, binode);
return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
wbc);
}
@@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work)
delalloc_work = container_of(work, struct btrfs_delalloc_work,
work);
inode = delalloc_work->inode;
+ if (IS_DAX(inode))
+ filemap_fdatawrite(inode->i_mapping);
filemap_flush(inode->i_mapping);
if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
&BTRFS_I(inode)->runtime_flags))
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 17/18] btrfs: Disable dax-based defrag and send
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (15 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This is temporary, and a TODO.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/ioctl.c | 13 +++++++++++++
fs/btrfs/send.c | 4 ++++
2 files changed, 17 insertions(+)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 5ebb52848d5a..2470f90c9983 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2990,6 +2990,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
goto out;
}
+ if (IS_DAX(inode)) {
+ btrfs_warn(root->fs_info, "File defrag is not supported with DAX");
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+
if (argp) {
if (copy_from_user(range, argp,
sizeof(*range))) {
@@ -4653,6 +4659,10 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
+ /* send can be on a directory, so check super block instead */
+ if (btrfs_test_opt(fs_info, DAX))
+ return -EOPNOTSUPP;
+
ret = mnt_want_write_file(file);
if (ret)
return ret;
@@ -5505,6 +5515,9 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
struct btrfs_ioctl_send_args *arg;
int ret;
+ if (IS_DAX(file_inode(file)))
+ return -EOPNOTSUPP;
+
if (compat) {
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
struct btrfs_ioctl_send_args_32 args32;
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 7ea2d6b1f170..9679fd54db86 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6609,6 +6609,10 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
int sort_clone_roots = 0;
int index;
+ /* send can be on a directory, so check super block instead */
+ if (btrfs_test_opt(fs_info, DAX))
+ return -EOPNOTSUPP;
+
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
` (16 preceding siblings ...)
2019-04-29 17:26 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
@ 2019-04-29 17:26 ` Goldwyn Rodrigues
17 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-29 17:26 UTC (permalink / raw)
To: linux-btrfs
Cc: kilobyte, linux-fsdevel, jack, david, willy, hch, darrick.wong,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
From: Goldwyn Rodrigues <rgoldwyn@suse.com>
This is for debug purposes only and can be skipped.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
fs/btrfs/dax.c | 3 +++
include/trace/events/btrfs.h | 56 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 20ec2ec49c68..3fee28f5a199 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -104,6 +104,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
u64 srcblk = 0;
loff_t diff;
+ trace_btrfs_iomap_begin(inode, pos, length, flags);
+
em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
iomap->type = IOMAP_MAPPED;
@@ -164,6 +166,7 @@ static int btrfs_iomap_end(struct inode *inode, loff_t pos,
{
struct btrfs_iomap *bi = iomap->private;
u64 wend;
+ trace_btrfs_iomap_end(inode, pos, length, written, flags);
if (!bi)
return 0;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index ab1cc33adbac..8779e5789a7c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1850,6 +1850,62 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
TP_ARGS(bg_cache)
);
+TRACE_EVENT(btrfs_iomap_begin,
+
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, int flags),
+
+ TP_ARGS(inode, pos, length, flags),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, ino )
+ __field( u64, pos )
+ __field( u64, length )
+ __field( int, flags )
+ ),
+
+ TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+ __entry->ino = btrfs_ino(BTRFS_I(inode));
+ __entry->pos = pos;
+ __entry->length = length;
+ __entry->flags = flags;
+ ),
+
+ TP_printk_btrfs("ino=%llu pos=%llu len=%llu flags=0x%x",
+ __entry->ino,
+ __entry->pos,
+ __entry->length,
+ __entry->flags)
+);
+
+TRACE_EVENT(btrfs_iomap_end,
+
+ TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, loff_t written, int flags),
+
+ TP_ARGS(inode, pos, length, written, flags),
+
+ TP_STRUCT__entry_btrfs(
+ __field( u64, ino )
+ __field( u64, pos )
+ __field( u64, length )
+ __field( u64, written )
+ __field( int, flags )
+ ),
+
+ TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+ __entry->ino = btrfs_ino(BTRFS_I(inode));
+ __entry->pos = pos;
+ __entry->length = length;
+ __entry->written = written;
+ __entry->flags = flags;
+ ),
+
+ TP_printk_btrfs("ino=%llu pos=%llu len=%llu written=%llu flags=0x%x",
+ __entry->ino,
+ __entry->pos,
+ __entry->length,
+ __entry->written,
+ __entry->flags)
+);
#endif /* _TRACE_BTRFS_H */
/* This part must be outside protection */
--
2.16.4
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts
2019-04-29 17:26 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
@ 2019-05-10 15:32 ` Adam Borowski
2019-05-10 15:41 ` Dan Williams
2019-05-23 13:44 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Jan Kara
1 sibling, 1 reply; 58+ messages in thread
From: Adam Borowski @ 2019-05-10 15:32 UTC (permalink / raw)
To: linux-btrfs, Goldwyn Rodrigues, linux-fsdevel, jack, david, willy,
hch, darrick.wong, dsterba, nborisov, linux-nvdimm
Cc: Adam Borowski
Even if allocation is done synchronously, data would be lost except on
actual pmem. Explicit msync()s don't need MAP_SYNC, and don't require
a sync per page.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
MAP_SYNC can't be allowed unconditionally, as cacheline flushes don't help
guarantee persistency in page cache. This fixes an error in my earlier
patch "btrfs: allow MAP_SYNC mmap" -- you'd probably want to amend that.
fs/btrfs/file.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 362a9cf9dcb2..0bc5428037ba 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2233,6 +2233,13 @@ static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
if (!IS_DAX(inode) && !mapping->a_ops->readpage)
return -ENOEXEC;
+ /*
+ * Normal operation of btrfs is pretty much an antithesis of MAP_SYNC;
+ * supporting it outside DAX is pointless.
+ */
+ if (!IS_DAX(inode) && (vma->vm_flags & VM_SYNC))
+ return -EOPNOTSUPP;
+
file_accessed(filp);
if (IS_DAX(inode))
vma->vm_ops = &btrfs_dax_vm_ops;
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts
2019-05-10 15:32 ` [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts Adam Borowski
@ 2019-05-10 15:41 ` Dan Williams
2019-05-10 15:59 ` Pankaj Gupta
0 siblings, 1 reply; 58+ messages in thread
From: Dan Williams @ 2019-05-10 15:41 UTC (permalink / raw)
To: Adam Borowski
Cc: linux-btrfs, Goldwyn Rodrigues, linux-fsdevel, Jan Kara, david,
Matthew Wilcox, Christoph Hellwig, Darrick J. Wong, dsterba,
nborisov, linux-nvdimm, Pankaj Gupta
On Fri, May 10, 2019 at 8:33 AM Adam Borowski <kilobyte@angband.pl> wrote:
>
> Even if allocation is done synchronously, data would be lost except on
> actual pmem. Explicit msync()s don't need MAP_SYNC, and don't require
> a sync per page.
>
> Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> ---
> MAP_SYNC can't be allowed unconditionally, as cacheline flushes don't help
> guarantee persistency in page cache. This fixes an error in my earlier
> patch "btrfs: allow MAP_SYNC mmap" -- you'd probably want to amend that.
>
>
> fs/btrfs/file.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 362a9cf9dcb2..0bc5428037ba 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2233,6 +2233,13 @@ static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
> if (!IS_DAX(inode) && !mapping->a_ops->readpage)
> return -ENOEXEC;
>
> + /*
> + * Normal operation of btrfs is pretty much an antithesis of MAP_SYNC;
> + * supporting it outside DAX is pointless.
> + */
> + if (!IS_DAX(inode) && (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
If the virtio-pmem patch set goes upstream prior to btrfs-dax support
this will need to switch over to the new daxdev_mapping_supported()
helper.
https://lore.kernel.org/lkml/20190426050039.17460-5-pagupta@redhat.com/
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts
2019-05-10 15:41 ` Dan Williams
@ 2019-05-10 15:59 ` Pankaj Gupta
0 siblings, 0 replies; 58+ messages in thread
From: Pankaj Gupta @ 2019-05-10 15:59 UTC (permalink / raw)
To: Dan Williams
Cc: Adam Borowski, linux-btrfs, Goldwyn Rodrigues, linux-fsdevel,
Jan Kara, david, Matthew Wilcox, Christoph Hellwig,
Darrick J. Wong, dsterba, nborisov, linux-nvdimm
> >
> > Even if allocation is done synchronously, data would be lost except on
> > actual pmem. Explicit msync()s don't need MAP_SYNC, and don't require
> > a sync per page.
> >
> > Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> > ---
> > MAP_SYNC can't be allowed unconditionally, as cacheline flushes don't help
> > guarantee persistency in page cache. This fixes an error in my earlier
> > patch "btrfs: allow MAP_SYNC mmap" -- you'd probably want to amend that.
> >
> >
> > fs/btrfs/file.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 362a9cf9dcb2..0bc5428037ba 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2233,6 +2233,13 @@ static int btrfs_file_mmap(struct file *filp,
> > struct vm_area_struct *vma)
> > if (!IS_DAX(inode) && !mapping->a_ops->readpage)
> > return -ENOEXEC;
> >
> > + /*
> > + * Normal operation of btrfs is pretty much an antithesis of
> > MAP_SYNC;
> > + * supporting it outside DAX is pointless.
> > + */
> > + if (!IS_DAX(inode) && (vma->vm_flags & VM_SYNC))
> > + return -EOPNOTSUPP;
> > +
>
> If the virtio-pmem patch set goes upstream prior to btrfs-dax support
> this will need to switch over to the new daxdev_mapping_supported()
> helper.
I was planning to do changes for virtio pmem & BTRFS. I was waiting for
DAX support for BTRFS to merge upstream.
Thank you,
Pankaj
>
> https://lore.kernel.org/lkml/20190426050039.17460-5-pagupta@redhat.com/
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 03/18] btrfs: basic dax read
2019-04-29 17:26 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
@ 2019-05-21 15:14 ` Darrick J. Wong
2019-05-22 21:50 ` Goldwyn Rodrigues
0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 15:14 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:34PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> Perform a basic read using iomap support. The btrfs_iomap_begin()
> finds the extent at the position and fills the iomap data
> structure with the values.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/btrfs/Makefile | 1 +
> fs/btrfs/ctree.h | 5 +++++
> fs/btrfs/dax.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/file.c | 11 ++++++++++-
> 4 files changed, 65 insertions(+), 1 deletion(-)
> create mode 100644 fs/btrfs/dax.c
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
> uuid-tree.o props.o free-space-tree.o tree-checker.o
>
> +btrfs-$(CONFIG_FS_DAX) += dax.o
> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 9512f49262dd..b7bbe5130a3b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle);
> void btrfs_reada_detach(void *handle);
> int btree_readahead_hook(struct extent_buffer *eb, int err);
>
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
> static inline int is_fstree(u64 rootid)
> {
> if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index 000000000000..bf3d46b0acb6
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * DAX support for BTRFS
> + *
> + * Copyright (c) 2019 SUSE Linux
> + * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> + */
> +
> +#ifdef CONFIG_FS_DAX
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
> + loff_t length, unsigned flags, struct iomap *iomap)
> +{
> + struct extent_map *em;
> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + iomap->type = IOMAP_HOLE;
> + return 0;
I'm not doing a rigorous review of the btrfs-specific pieces, but you're
required to fill out the other iomap fields for a read hole.
--D
> + }
> + iomap->type = IOMAP_MAPPED;
> + iomap->bdev = em->bdev;
> + iomap->dax_dev = fs_info->dax_dev;
> + iomap->offset = em->start;
> + iomap->length = em->len;
> + iomap->addr = em->block_start;
> + return 0;
> +}
> +
> +static const struct iomap_ops btrfs_iomap_ops = {
> + .iomap_begin = btrfs_iomap_begin,
> +};
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + inode_lock_shared(inode);
> + ret = dax_iomap_rw(iocb, to, &btrfs_iomap_ops);
> + inode_unlock_shared(inode);
> +
> + return ret;
> +}
> +#endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 34fe8a58b0e9..9194591f9eea 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3288,9 +3288,18 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
> return generic_file_open(inode, filp);
> }
>
> +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +#ifdef CONFIG_FS_DAX
> + if (IS_DAX(file_inode(iocb->ki_filp)))
> + return btrfs_file_dax_read(iocb, to);
> +#endif
> + return generic_file_read_iter(iocb, to);
> +}
> +
> const struct file_operations btrfs_file_operations = {
> .llseek = btrfs_file_llseek,
> - .read_iter = generic_file_read_iter,
> + .read_iter = btrfs_file_read_iter,
> .splice_read = generic_file_splice_read,
> .write_iter = btrfs_file_write_iter,
> .mmap = btrfs_file_mmap,
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
@ 2019-05-21 16:51 ` Darrick J. Wong
2019-05-22 20:14 ` Goldwyn Rodrigues
2019-05-23 9:05 ` Shiyang Ruan
0 siblings, 2 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 16:51 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> The IOMAP_DAX_COW is a iomap type which performs copy of
> edges of data while performing a write if start/end are
> not page aligned. The source address is expected in
> iomap->inline_data.
>
> dax_copy_edges() is a helper functions performs a copy from
> one part of the device to another for data not page aligned.
> If iomap->inline_data is NULL, it memset's the area to zero.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/iomap.h | 1 +
> 2 files changed, 46 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index e5e54da1715f..610bfa861a28 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev,
> }
> EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>
> +/*
> + * dax_copy_edges - Copies the part of the pages not included in
> + * the write, but required for CoW because
> + * offset/offset+length are not page aligned.
> + */
> +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> + struct iomap *iomap, void *daddr)
> +{
> + unsigned offset = pos & (PAGE_SIZE - 1);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, PAGE_SIZE);
> + void *saddr = iomap->inline_data;
> + int ret = 0;
> + /*
> + * Copy the first part of the page
> + * Note: we pass offset as length
> + */
> + if (offset) {
> + if (saddr)
> + ret = memcpy_mcsafe(daddr, saddr, offset);
> + else
> + memset(daddr, 0, offset);
> + }
> +
> + /* Copy the last part of the range */
> + if (end < pg_end) {
> + if (saddr)
> + ret = memcpy_mcsafe(daddr + offset + length,
> + saddr + offset + length, pg_end - end);
> + else
> + memset(daddr + offset + length, 0,
> + pg_end - end);
> + }
> + return ret;
> +}
> +
> static loff_t
> dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> struct iomap *iomap)
> @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> return iov_iter_zero(min(length, end - pos), iter);
> }
>
> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> + && iomap->type != IOMAP_DAX_COW))
I reiterate (from V3) that the && goes on the previous line...
if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
iomap->type != IOMAP_DAX_COW))
> return -EIO;
>
> +
> /*
> * Write can allocate block for an area which has a hole page mapped
> * into page tables. We have to tear down these mappings so that data
> @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> break;
> }
>
> + if (iomap->type == IOMAP_DAX_COW) {
> + ret = dax_copy_edges(inode, pos, length, iomap, kaddr);
> + if (ret)
> + break;
> + }
> +
> map_len = PFN_PHYS(map_len);
> kaddr += offset;
> map_len -= offset;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..6e885c5a38a3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -25,6 +25,7 @@ struct vm_fault;
> #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
> #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
> #define IOMAP_INLINE 0x05 /* data inline in the inode */
> +#define IOMAP_DAX_COW 0x06
DAX isn't going to be the only scenario where we need a way to
communicate to iomap actors the need to implement copy on write.
XFS also uses struct iomap to hand out file leases to clients. The
lease code /currently/ doesn't support files with shared blocks (because
the only user is pNFS) but one could easily imagine a future where some
client wants to lease a file with shared blocks, in which case XFS will
want to convey the COW details to the lessee.
> +/* Copy data pointed by inline_data before write*/
A month ago during the V3 patchset review, I wrote (possibly in an other
thread, sorry) about something that I'm putting my foot down about now
for the V4 patchset, which is the {re,ab}use of @inline_data for the
data source address.
We cannot use @inline_data to convey the source address. @inline_data
(so far) is used to point to the in-memory representation of the storage
described by @addr. For data writes, @addr is the location of the write
on disk and @inline_data is the location of the write in memory.
Reusing @inline_data here to point to the location of the source data in
memory is a totally different thing and will likely result in confusion.
On a practical level, this also means that we cannot support the case of
COW && INLINE because the type codes collide and so would the users of
@inline_data. This isn't required *right now*, but if you had a pmem
filesystem that stages inode updates in memory and flips a pointer to
commit changes then the ->iomap_begin function will need to convey two
pointers at once.
So this brings us back to Dave's suggestion during the V1 patchset
review that instead of adding more iomap flags/types and overloading
fields, we simply pass two struct iomaps into ->iomap_begin:
- Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass
&iomap[0] into the ->iomap_begin and ->iomap_end functions. The
first iomap will be filled in with the destination for the write (as
all implementations do now), and the second iomap can be filled in
with the source information for a COW operation.
- If the ->iomap_begin implementation decides that COW is necessary for
the requested operation, then it should fill out that second iomap
with information about the extent that the actor must copied before
returning. The second iomap's offset and length must match the
first. If COW isn't necessary, the ->iomap_begin implementation
ignores it, and the second iomap retains type == 0 (i.e. invalid
mapping).
Proceeding along these lines will (AFAICT) still allow you to enable all
the btrfs functionality in the rest of this patchset while making the
task of wiring up XFS fairly simple. No overloaded fields and no new
flags.
This is how I'd like to see this patchset should proceed to V5. Does
that make sense?
--D
>
> /*
> * Flags for all iomap mappings:
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 07/18] btrfs: add dax write support
2019-04-29 17:26 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
@ 2019-05-21 17:08 ` Darrick J. Wong
0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 17:08 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:38PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> IOMAP_DAX_COW allows to inform the dax code, to first perform
> a copy which are not page-aligned before performing the write.
> The responsibility of checking if data edges are page aligned
> is performed in ->iomap_begin() and the source address is
> stored in ->inline_data
>
> A new struct btrfs_iomap is passed from iomap_begin() to
> iomap_end(), which contains all the accounting and locking information
> for CoW based writes.
>
> For writing to a hole, iomap->inline_data is set to zero.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/btrfs/ctree.h | 6 ++
> fs/btrfs/dax.c | 182 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
> fs/btrfs/file.c | 4 +-
> 3 files changed, 185 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1e3e758b83c2..eec01eb92f33 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
> #ifdef CONFIG_FS_DAX
> /* dax.c */
> ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> +#else
> +static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + return 0;
> +}
> #endif /* CONFIG_FS_DAX */
>
> static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index bf3d46b0acb6..f5cc9bcdbf14 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -9,30 +9,184 @@
> #ifdef CONFIG_FS_DAX
> #include <linux/dax.h>
> #include <linux/iomap.h>
> +#include <linux/uio.h>
> #include "ctree.h"
> #include "btrfs_inode.h"
>
> +struct btrfs_iomap {
> + u64 start;
> + u64 end;
> + bool nocow;
> + struct extent_changeset *data_reserved;
> + struct extent_state *cached_state;
> +};
> +
> +static struct btrfs_iomap *btrfs_iomap_init(struct inode *inode,
> + struct extent_map **em,
> + loff_t pos, loff_t length)
> +{
> + int ret = 0;
> + struct extent_map *map = *em;
> + struct btrfs_iomap *bi;
> +
> + bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
> + if (!bi)
> + return ERR_PTR(-ENOMEM);
> +
> + bi->start = round_down(pos, PAGE_SIZE);
> + bi->end = PAGE_ALIGN(pos + length);
> +
> + /* Wait for existing ordered extents in range to finish */
> + btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
> +
> + lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
> +
> + ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
> + bi->start, bi->end - bi->start);
> + if (ret) {
> + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> + &bi->cached_state);
> + kfree(bi);
> + return ERR_PTR(ret);
> + }
> +
> + refcount_inc(&map->refs);
> + ret = btrfs_get_extent_map_write(em, NULL,
> + inode, bi->start, bi->end - bi->start, &bi->nocow);
> + if (ret) {
> + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> + &bi->cached_state);
> + btrfs_delalloc_release_space(inode,
> + bi->data_reserved, bi->start,
> + bi->end - bi->start, true);
> + extent_changeset_free(bi->data_reserved);
> + kfree(bi);
> + return ERR_PTR(ret);
> + }
> + free_extent_map(map);
> + return bi;
> +}
> +
> +static void *dax_address(struct block_device *bdev, struct dax_device *dax_dev,
> + sector_t sector, loff_t pos, loff_t length)
This looks like a common function for fs/dax.c.
--D
> +{
> + size_t size = ALIGN(pos + length, PAGE_SIZE);
> + int id, ret = 0;
> + void *kaddr = NULL;
> + pgoff_t pgoff;
> + long map_len;
> +
> + id = dax_read_lock();
> +
> + ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
> + if (ret)
> + goto out;
> +
> + map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
> + &kaddr, NULL);
> + if (map_len < 0)
> + ret = map_len;
> +
> +out:
> + dax_read_unlock(id);
> + if (ret)
> + return ERR_PTR(ret);
> + return kaddr;
> +}
> +
> static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
> loff_t length, unsigned flags, struct iomap *iomap)
> {
> struct extent_map *em;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + struct btrfs_iomap *bi = NULL;
> + unsigned offset = pos & (PAGE_SIZE - 1);
> + u64 srcblk = 0;
> + loff_t diff;
> +
> em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> +
> + iomap->type = IOMAP_MAPPED;
> +
> + if (flags & IOMAP_WRITE) {
> + if (em->block_start != EXTENT_MAP_HOLE)
> + srcblk = em->block_start + pos - em->start - offset;
> +
> + bi = btrfs_iomap_init(inode, &em, pos, length);
> + if (IS_ERR(bi))
> + return PTR_ERR(bi);
> +
> + }
> +
> + /*
> + * Advance the difference between pos and start, to align well with
> + * inline_data in case of writes
> + */
> + diff = round_down(pos - em->start, PAGE_SIZE);
> + iomap->offset = em->start + diff;
> + iomap->length = em->len - diff;
> + iomap->bdev = em->bdev;
> + iomap->dax_dev = fs_info->dax_dev;
> +
> + /*
> + * This will be true for reads only since we have already
> + * allocated em
> + */
> if (em->block_start == EXTENT_MAP_HOLE) {
> iomap->type = IOMAP_HOLE;
> return 0;
> }
> - iomap->type = IOMAP_MAPPED;
> - iomap->bdev = em->bdev;
> - iomap->dax_dev = fs_info->dax_dev;
> - iomap->offset = em->start;
> - iomap->length = em->len;
> - iomap->addr = em->block_start;
> +
> + iomap->addr = em->block_start + diff;
> + /* Check if we really need to copy data from old extent */
> + if (bi && !bi->nocow && (offset || pos + length < bi->end)) {
> + iomap->type = IOMAP_DAX_COW;
> + if (srcblk) {
> + sector_t sector = (srcblk + (pos & PAGE_MASK) -
> + iomap->offset) >> 9;
> + iomap->inline_data = dax_address(em->bdev,
> + fs_info->dax_dev, sector, pos, length);
> + if (IS_ERR(iomap->inline_data)) {
> + kfree(bi);
> + return PTR_ERR(iomap->inline_data);
> + }
> + }
> + }
> +
> + iomap->private = bi;
> + return 0;
> +}
> +
> +static int btrfs_iomap_end(struct inode *inode, loff_t pos,
> + loff_t length, ssize_t written, unsigned flags,
> + struct iomap *iomap)
> +{
> + struct btrfs_iomap *bi = iomap->private;
> + u64 wend;
> +
> + if (!bi)
> + return 0;
> +
> + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> + &bi->cached_state);
> +
> + wend = PAGE_ALIGN(pos + written);
> + if (wend < bi->end) {
> + btrfs_delalloc_release_space(inode,
> + bi->data_reserved, wend,
> + bi->end - wend, true);
> + }
> +
> + btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true);
> + btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
> + extent_changeset_free(bi->data_reserved);
> + kfree(bi);
> return 0;
> }
>
> static const struct iomap_ops btrfs_iomap_ops = {
> .iomap_begin = btrfs_iomap_begin,
> + .iomap_end = btrfs_iomap_end,
> };
>
> ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> @@ -46,4 +200,20 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
>
> return ret;
> }
> +
> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
> +{
> + ssize_t ret = 0;
> + u64 pos = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> + ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops);
> +
> + if (ret > 0) {
> + pos += ret;
> + if (pos > i_size_read(inode))
> + i_size_write(inode, pos);
> + iocb->ki_pos = pos;
> + }
> + return ret;
> +}
> #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9194591f9eea..a795023e26ca 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> if (sync)
> atomic_inc(&BTRFS_I(inode)->sync_writers);
>
> - if (iocb->ki_flags & IOCB_DIRECT) {
> + if (IS_DAX(inode)) {
> + num_written = btrfs_file_dax_write(iocb, from);
> + } else if (iocb->ki_flags & IOCB_DIRECT) {
> num_written = __btrfs_direct_write(iocb, from);
> } else {
> num_written = btrfs_buffered_write(iocb, from);
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 14/18] dax: memcpy before zeroing range
2019-04-29 17:26 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
@ 2019-05-21 17:27 ` Darrick J. Wong
0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 17:27 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> However, this needed more iomap fields, so it was easier
> to pass iomap and compute inside the function rather
> than passing a log of arguments.
>
> Note, there is subtle difference between iomap_sector and
> dax_iomap_sector(). Can we replace dax_iomap_sector with
> iomap_sector()? It would need pos & PAGE_MASK though or else
> bdev_dax_pgoff() return -EINVAL.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/dax.c | 17 ++++++++++++-----
> fs/iomap.c | 9 +--------
> include/linux/dax.h | 11 +++++------
> include/linux/iomap.h | 6 ++++++
> 4 files changed, 24 insertions(+), 19 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index fa9ccbad7c03..82a08b0eec23 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1063,11 +1063,16 @@ static bool dax_range_is_aligned(struct block_device *bdev,
> return true;
> }
>
> -int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int size)
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size)
> {
> - if (dax_range_is_aligned(bdev, offset, size)) {
> + sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
> + struct block_device *bdev = iomap->bdev;
> + struct dax_device *dax_dev = iomap->dax_dev;
> + int ret = 0;
> +
> + if (!(iomap->type == IOMAP_DAX_COW) &&
> + dax_range_is_aligned(bdev, offset, size)) {
> sector_t start_sector = sector + (offset >> 9);
>
> return blkdev_issue_zeroout(bdev, start_sector,
> @@ -1087,11 +1092,13 @@ int __dax_zero_page_range(struct block_device *bdev,
> dax_read_unlock(id);
> return rc;
> }
> + if (iomap->type == IOMAP_DAX_COW)
> + ret = memcpy_mcsafe(kaddr, iomap->inline_data, offset);
If the memcpy fails, does it make sense to keep going?
> memset(kaddr + offset, 0, size);
Is it ever the case that offset + size isn't the end of the page? If
so, then don't we need a second memcpy_mcsafe to handle that too?
> dax_flush(dax_dev, kaddr + offset, size);
> dax_read_unlock(id);
> }
> - return 0;
> + return ret;
> }
> EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index abdd18e404f8..90698c854883 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> return written ? written : ret;
> }
>
> -static sector_t
> -iomap_sector(struct iomap *iomap, loff_t pos)
> -{
> - return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> -}
> -
> static struct iomap_page *
> iomap_page_create(struct inode *inode, struct page *page)
> {
> @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
> struct iomap *iomap)
> {
> - return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
> - iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
> + return __dax_zero_page_range(iomap, pos, offset, bytes);
> }
>
> static loff_t
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 1370d39c91b6..c469d9ff54b4 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -9,6 +9,7 @@
>
> typedef unsigned long dax_entry_t;
>
> +struct iomap;
> struct iomap_ops;
> struct dax_device;
> struct dax_operations {
> @@ -163,13 +164,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff,
> const struct iomap_ops *ops);
>
> #ifdef CONFIG_FS_DAX
> -int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int length);
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size);
> #else
> -static inline int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int length)
> +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size)
> {
> return -ENXIO;
> }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6e885c5a38a3..fcfce269db3e 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -7,6 +7,7 @@
> #include <linux/mm.h>
> #include <linux/types.h>
> #include <linux/mm_types.h>
> +#include <linux/blkdev.h>
>
> struct address_space;
> struct fiemap_extent_info;
> @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
> return NULL;
> }
>
> +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> + return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> +}
FWIW the iomap changes seem fine to me.
--D
> +
> ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> const struct iomap_ops *ops);
> int iomap_readpage(struct page *page, const struct iomap_ops *ops);
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 10/18] dax: replace mmap entry in case of CoW
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
@ 2019-05-21 17:35 ` Darrick J. Wong
2019-05-23 13:38 ` Jan Kara
1 sibling, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 17:35 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:41PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
>
> btrfs does not support hugepages so we don't handle PMD.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/dax.c | 36 ++++++++++++++++++++++++++++--------
> 1 file changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 718b1632a39d..07e8ff20161d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -700,6 +700,9 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
> return 0;
> }
>
> +#define DAX_IF_DIRTY (1ULL << 0)
> +#define DAX_IF_COW (1ULL << 1)
> +
> /*
> * By this point grab_mapping_entry() has ensured that we have a locked entry
> * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
> */
> static void *dax_insert_entry(struct xa_state *xas,
> struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned long insert_flags)
I think unsigned int would have sufficed here.
> {
> void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>
> if (dirty)
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
> unsigned long index = xas->xa_index;
> /* we are replacing a zero page with block mapping */
> if (dax_is_pmd_entry(entry))
> @@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>
> xas_reset(xas);
> xas_lock_irq(xas);
> - if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> + if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
> dax_disassociate_entry(entry, mapping, false);
> dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
> }
>
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> /*
> * Only swap our new entry into the page cache if the current
> * entry is a zero page or an empty entry. If a normal PTE or
> @@ -753,6 +759,9 @@ static void *dax_insert_entry(struct xa_state *xas,
> if (dirty)
> xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
> xas_unlock_irq(xas);
> return entry;
> }
> @@ -1032,7 +1041,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
> vm_fault_t ret;
>
> *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_ZERO_PAGE, false);
> + DAX_ZERO_PAGE, 0);
>
> ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> trace_dax_load_hole(inode, vmf, ret);
> @@ -1296,6 +1305,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> vm_fault_t ret = 0;
> void *entry;
> pfn_t pfn;
> + unsigned long insert_flags = 0;
>
> trace_dax_pte_fault(inode, vmf, ret);
> /*
> @@ -1357,6 +1367,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> error = copy_user_dax(iomap.bdev, iomap.dax_dev,
> sector, PAGE_SIZE, vmf->cow_page, vaddr);
> break;
> + case IOMAP_DAX_COW:
> + /* Should not be setting this - fallthrough */
> default:
> WARN_ON_ONCE(1);
> error = -EIO;
> @@ -1377,6 +1389,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>
> switch (iomap.type) {
> case IOMAP_DAX_COW:
> + insert_flags |= DAX_IF_COW;
> + /* fallthrough */
> case IOMAP_MAPPED:
> if (iomap.flags & IOMAP_F_NEW) {
> count_vm_event(PGMAJFAULT);
> @@ -1396,8 +1410,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> } else
> memset(addr, 0, PAGE_SIZE);
> }
> + if (write && !sync)
> + insert_flags |= DAX_IF_DIRTY;
> entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> - 0, write && !sync);
> + 0, insert_flags);
>
> /*
> * If we are doing synchronous page fault and inode needs fsync,
> @@ -1478,7 +1494,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>
> pfn = page_to_pfn_t(zero_page);
> *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_PMD | DAX_ZERO_PAGE, false);
> + DAX_PMD | DAX_ZERO_PAGE, 0);
>
> if (arch_needs_pgtable_deposit()) {
> pgtable = pte_alloc_one(vma->vm_mm);
> @@ -1528,6 +1544,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> loff_t pos;
> int error;
> pfn_t pfn;
> + unsigned long insert_flags = 0;
>
> /*
> * Check whether offset isn't beyond end of file now. Caller is
> @@ -1612,8 +1629,11 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> if (error < 0)
> goto finish_iomap;
>
> + if (write && !sync)
> + insert_flags |= DAX_IF_DIRTY;
> +
> entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> - DAX_PMD, write && !sync);
> + DAX_PMD, insert_flags);
Otherwise, this seems reasonable enough to me. What do the DAX
developers think?
FWIW I'm not thrilled by the fact that Goldwyn isn't wiring up the
(rather similar looking) PMD code paths, but I grok that btrfs doesn't
support hugepages so he has no way to test that the code path actually
works so I'm ok with letting that go (until XFS joins the party) so long
as the pmd code path changes aren't too dissimilar to the pte code
paths. Is that true?
--D
>
> /*
> * If we are doing synchronous page fault and inode needs fsync,
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
2019-04-29 17:26 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
@ 2019-05-21 17:46 ` Darrick J. Wong
2019-05-22 19:11 ` Goldwyn Rodrigues
2019-05-23 12:10 ` Jan Kara
0 siblings, 2 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 17:46 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> Change dax_iomap_pfn to return the address as well in order to
> use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> We don't handle PMD because btrfs does not support hugepages.
>
> Question:
> The sequence of bdev_dax_pgoff() and dax_direct_access() is
> used multiple times to calculate address and pfn's. Would it make
> sense to call it while calculating address as well to reduce code?
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/dax.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 610bfa861a28..718b1632a39d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> }
>
> static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> - pfn_t *pfnp)
> + pfn_t *pfnp, void **addr)
> {
> const sector_t sector = dax_iomap_sector(iomap, pos);
> pgoff_t pgoff;
> @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> return rc;
> id = dax_read_lock();
> length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> - NULL, pfnp);
> + addr, pfnp);
> if (length < 0) {
> rc = length;
> goto out;
> @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
> struct inode *inode = mapping->host;
> unsigned long vaddr = vmf->address;
> + void *addr;
> loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> struct iomap iomap = { 0 };
Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
because the actor returns vm_fault_t, not bytes copied. I guess that
makes it a tiny bit more complicated to pass in two (struct iomap *) to
the iomap_begin function...
> unsigned flags = IOMAP_FAULT;
> @@ -1375,16 +1376,26 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> sync = dax_fault_is_synchronous(flags, vma, &iomap);
>
> switch (iomap.type) {
> + case IOMAP_DAX_COW:
> case IOMAP_MAPPED:
> if (iomap.flags & IOMAP_F_NEW) {
> count_vm_event(PGMAJFAULT);
> count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> major = VM_FAULT_MAJOR;
> }
> - error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
> + error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn, &addr);
> if (error < 0)
> goto error_finish_iomap;
>
> + if (iomap.type == IOMAP_DAX_COW) {
> + if (iomap.inline_data) {
> + error = memcpy_mcsafe(addr, iomap.inline_data,
> + PAGE_SIZE);
> + if (error < 0)
> + goto error_finish_iomap;
> + } else
> + memset(addr, 0, PAGE_SIZE);
This memcpy_mcsafe/memset chunk shows up a lot in this series. Maybe it
should be a static inline within dax.c?
--D
> + }
> entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> 0, write && !sync);
>
> @@ -1597,7 +1608,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>
> switch (iomap.type) {
> case IOMAP_MAPPED:
> - error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
> + error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn, NULL);
> if (error < 0)
> goto finish_iomap;
>
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 01/18] btrfs: create a mount option for dax
2019-04-29 17:26 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2019-05-21 18:02 ` Darrick J. Wong
0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 18:02 UTC (permalink / raw)
To: Goldwyn Rodrigues, Theodore Ts'o
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
[add Ted to the thread]
On Mon, Apr 29, 2019 at 12:26:32PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
>
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
>
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.
> Deny compression because it does not go with direct I/O.
>
> Store the dax_device in fs_info which will be used in iomap code.
>
> I am aware of the push to directory-based flags for dax. Until, that
> code is in the kernel, we will work with mount flags.
Hmm. This patchset was sent before LSFMM, and I've heard[1] that the
discussion there yielded some progress on how to move forward with the
user interface. I've gotten the impression that means no new dax mount
options; a persistent flag that can be inherited by new files; and some
other means for userspace to check if writethrough worked.
However, the LWN article says Ted planned to summarize for fsdevel so
let's table this part until he does that. Ted? :)
--D
[1] https://lwn.net/SubscriberLink/787973/ad85537bf8747e90/
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/btrfs/ctree.h | 2 ++
> fs/btrfs/disk-io.c | 4 ++++
> fs/btrfs/ioctl.c | 5 ++++-
> fs/btrfs/super.c | 30 ++++++++++++++++++++++++++++++
> 4 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b3642367a595..8ca1c0d120f4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1067,6 +1067,7 @@ struct btrfs_fs_info {
> u32 metadata_ratio;
>
> void *bdev_holder;
> + struct dax_device *dax_dev;
>
> /* private scrub information */
> struct mutex scrub_lock;
> @@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
> #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26)
> #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27)
> #define BTRFS_MOUNT_REF_VERIFY (1 << 28)
> +#define BTRFS_MOUNT_DAX (1 << 29)
>
> #define BTRFS_DEFAULT_COMMIT_INTERVAL (30)
> #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 6fe9197f6ee4..2bbb63b2fcff 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -16,6 +16,7 @@
> #include <linux/uuid.h>
> #include <linux/semaphore.h>
> #include <linux/error-injection.h>
> +#include <linux/dax.h>
> #include <linux/crc32c.h>
> #include <linux/sched/mm.h>
> #include <asm/unaligned.h>
> @@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb,
> goto fail_alloc;
> }
>
> + fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev);
> +
> /*
> * We want to check superblock checksum, the type is stored inside.
> * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
> @@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> #endif
>
> btrfs_close_devices(fs_info->fs_devices);
> + fs_put_dax(fs_info->dax_dev);
> btrfs_mapping_tree_free(&fs_info->mapping_tree);
>
> percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index cd4e693406a0..0138119cd9a3 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
> if (binode->flags & BTRFS_INODE_DIRSYNC)
> new_fl |= S_DIRSYNC;
>
> + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
> + new_fl |= S_DAX;
> +
> set_mask_bits(&inode->i_flags,
> - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
> new_fl);
> }
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 120e4340792a..3b85e61e5182 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
> Opt_treelog, Opt_notreelog,
> Opt_usebackuproot,
> Opt_user_subvol_rm_allowed,
> + Opt_dax,
>
> /* Deprecated options */
> Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
> {Opt_notreelog, "notreelog"},
> {Opt_usebackuproot, "usebackuproot"},
> {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> + {Opt_dax, "dax"},
>
> /* Deprecated options */
> {Opt_alloc_start, "alloc_start=%s"},
> @@ -745,6 +747,32 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
> case Opt_user_subvol_rm_allowed:
> btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> break;
> + case Opt_dax:
> +#ifdef CONFIG_FS_DAX
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> + "dax not supported for multi-device btrfs partition\n");
> + ret = -EOPNOTSUPP;
> + goto out;
> + }
> + btrfs_set_opt(info->mount_opt, DAX);
> + btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n");
> + btrfs_set_and_info(info, NODATASUM,
> + "auto-setting nodatasum (dax)");
> + btrfs_clear_opt(info->mount_opt, SPACE_CACHE);
> + btrfs_set_and_info(info, FREE_SPACE_TREE,
> + "auto-setting free space tree (dax)");
> + if (btrfs_test_opt(info, COMPRESS)) {
> + btrfs_info(info, "disabling compress (dax)");
> + btrfs_clear_opt(info->mount_opt, COMPRESS);
> + }
> + break;
> +#else
> + btrfs_err(info,
> + "DAX option not supported\n");
> + ret = -EINVAL;
> + goto out;
> +#endif
> case Opt_enospc_debug:
> btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
> break;
> @@ -1335,6 +1363,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
> seq_puts(seq, ",clear_cache");
> if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
> seq_puts(seq, ",user_subvol_rm_allowed");
> + if (btrfs_test_opt(info, DAX))
> + seq_puts(seq, ",dax");
> if (btrfs_test_opt(info, ENOSPC_DEBUG))
> seq_puts(seq, ",enospc_debug");
> if (btrfs_test_opt(info, AUTO_DEFRAG))
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 13/18] fs: dedup file range to use a compare function
2019-04-29 17:26 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
@ 2019-05-21 18:17 ` Darrick J. Wong
0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-21 18:17 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon, Apr 29, 2019 at 12:26:44PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass
> it to generic_remap_file_range_prep() so it can use iomap-based
> functions.
>
> This may not be the best way to solve this. Suggestions welcome.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/btrfs/ctree.h | 9 ++++++++
> fs/btrfs/dax.c | 8 +++++++
> fs/btrfs/ioctl.c | 11 +++++++--
> fs/dax.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ocfs2/file.c | 2 +-
> fs/read_write.c | 11 +++++----
> fs/xfs/xfs_reflink.c | 2 +-
> include/linux/dax.h | 4 ++++
> include/linux/fs.h | 8 ++++++-
> 9 files changed, 110 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2b7bdabb44f8..d3d044125619 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
> ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same);
> #else
> static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> {
> return 0;
> }
> +static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same)
> +{
> + return 0;
> +}
> #endif /* CONFIG_FS_DAX */
>
> static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index de957d681e16..af64696a5337 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -227,4 +227,12 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
>
> return ret;
> }
> +
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same)
> +{
> + return dax_file_range_compare(src, srcoff, dest, destoff, len,
> + is_same, &btrfs_iomap_ops);
> +}
> #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0138119cd9a3..5ebb52848d5a 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3939,6 +3939,7 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> bool same_inode = inode_out == inode_in;
> u64 wb_len;
> int ret;
> + compare_range_t cmp;
>
> if (!(remap_flags & REMAP_FILE_DEDUP)) {
> struct btrfs_root *root_out = BTRFS_I(inode_out)->root;
> @@ -4000,8 +4001,14 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> if (ret < 0)
> goto out_unlock;
>
> - ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> - len, remap_flags);
> + if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
If we're moving towards a world where IS_DAX is a per-file condition, I
think this is going to need quite a bit more work to support doing
mixed-mode comparisons.
That, I think, could involve reworking vfs_dedupe_file_range_compare to
take a pair of (inode, iomap_ops) so that we can use the iomap
information to skip holes, avoid reading pages for uncached unwritten
ranges, etc.
TBH that sounds like a whole series on its own, so maybe we just want to
say no dedupe for now unless both files are in the page cache or both
files are DAX?
--D
> + cmp = btrfs_dax_file_range_compare;
> + else
> + cmp = vfs_dedupe_file_range_compare;
> +
> + ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> + pos_out, len, remap_flags, cmp);
> +
> if (ret < 0 || *len == 0)
> goto out_unlock;
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 07e8ff20161d..fa9ccbad7c03 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -39,6 +39,8 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/fs_dax.h>
>
> +#define MIN(a, b) (((a) < (b)) ? (a) : (b))
> +
> static inline unsigned int pe_order(enum page_entry_size pe_size)
> {
> if (pe_size == PE_SIZE_PTE)
> @@ -1795,3 +1797,66 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> return dax_insert_pfn_mkwrite(vmf, pfn, order);
> }
> EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static inline void *iomap_address(struct iomap *iomap, loff_t off, loff_t len)
> +{
> + loff_t start;
> + void *addr;
> + start = (get_start_sect(iomap->bdev) << 9) + iomap->addr +
> + (off - iomap->offset);
> + dax_direct_access(iomap->dax_dev, PHYS_PFN(start), PHYS_PFN(len),
> + &addr, NULL);
> + return addr;
> +}
> +
> +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> + loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
> +{
> + void *saddr, *daddr;
> + struct iomap s_iomap = {0};
> + struct iomap d_iomap = {0};
> + bool same = true;
> + loff_t cmp_len;
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
> + if (ret < 0) {
> + if (ops->iomap_end)
> + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> + return ret;
> + }
> + cmp_len = len;
> + cmp_len = MIN(len, s_iomap.offset + s_iomap.length - srcoff);
> +
> + ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
> + if (ret < 0) {
> + if (ops->iomap_end) {
> + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> + ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
> + }
> + return ret;
> + }
> + cmp_len = MIN(cmp_len, d_iomap.offset + d_iomap.length - destoff);
> +
> + saddr = iomap_address(&s_iomap, srcoff, cmp_len);
> + daddr = iomap_address(&d_iomap, destoff, cmp_len);
> +
> + same = !memcmp(saddr, daddr, cmp_len);
> + if (!same)
> + break;
> + len -= cmp_len;
> + srcoff += cmp_len;
> + destoff += cmp_len;
> +
> + if (ops->iomap_end) {
> + ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
> + ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
> + }
> + }
> + dax_read_unlock(id);
> + *is_same = same;
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_file_range_compare);
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index d640c5f8a85d..9d470306cfc3 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
> goto out_unlock;
>
> ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> - &len, remap_flags);
> + &len, remap_flags, vfs_dedupe_file_range_compare);
> if (ret < 0 || len == 0)
> goto out_unlock;
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 61b43ad7608e..c6283802ef1c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1778,7 +1778,7 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
> * Compare extents of two files to see if they are the same.
> * Caller must have locked both inodes to prevent write races.
> */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> struct inode *dest, loff_t destoff,
> loff_t len, bool *is_same)
> {
> @@ -1845,6 +1845,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> out_error:
> return error;
> }
> +EXPORT_SYMBOL_GPL(vfs_dedupe_file_range_compare);
>
> /*
> * Check that the two inodes are eligible for cloning, the ranges make
> @@ -1856,7 +1857,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> */
> int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> - loff_t *len, unsigned int remap_flags)
> + loff_t *len, unsigned int remap_flags,
> + compare_range_t compare)
> {
> struct inode *inode_in = file_inode(file_in);
> struct inode *inode_out = file_inode(file_out);
> @@ -1915,9 +1917,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> */
> if (remap_flags & REMAP_FILE_DEDUP) {
> bool is_same = false;
> -
> - ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> - inode_out, pos_out, *len, &is_same);
> + ret = (*compare)(inode_in, pos_in,
> + inode_out, pos_out, *len, &is_same);
> if (ret)
> return ret;
> if (!is_same)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 680ae7662a78..68e4257cebb0 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
> goto out_unlock;
>
> ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> - len, remap_flags);
> + len, remap_flags, vfs_dedupe_file_range_compare);
> if (ret < 0 || *len == 0)
> goto out_unlock;
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 0dd316a74a29..1370d39c91b6 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -157,6 +157,10 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> pgoff_t index);
> +int dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff,
> + loff_t len, bool *is_same,
> + const struct iomap_ops *ops);
>
> #ifdef CONFIG_FS_DAX
> int __dax_zero_page_range(struct block_device *bdev,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dd28e7679089..0224503e42ce 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1883,10 +1883,16 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> unsigned long, loff_t *, rwf_t);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> loff_t, size_t, unsigned int);
> +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff,
> + loff_t len, bool *is_same);
> +typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
> + struct inode *dest, loff_t destpos,
> + loff_t len, bool *is_same);
> extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> loff_t *count,
> - unsigned int remap_flags);
> + unsigned int remap_flags, compare_range_t cmp);
> extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> loff_t len, unsigned int remap_flags);
> --
> 2.16.4
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
2019-05-21 17:46 ` Darrick J. Wong
@ 2019-05-22 19:11 ` Goldwyn Rodrigues
2019-05-23 4:02 ` Darrick J. Wong
2019-05-23 12:10 ` Jan Kara
1 sibling, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-05-22 19:11 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm
On 10:46 21/05, Darrick J. Wong wrote:
> On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > Change dax_iomap_pfn to return the address as well in order to
> > use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> > We don't handle PMD because btrfs does not support hugepages.
> >
> > Question:
> > The sequence of bdev_dax_pgoff() and dax_direct_access() is
> > used multiple times to calculate address and pfn's. Would it make
> > sense to call it while calculating address as well to reduce code?
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> > fs/dax.c | 19 +++++++++++++++----
> > 1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 610bfa861a28..718b1632a39d 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> > }
> >
> > static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > - pfn_t *pfnp)
> > + pfn_t *pfnp, void **addr)
> > {
> > const sector_t sector = dax_iomap_sector(iomap, pos);
> > pgoff_t pgoff;
> > @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > return rc;
> > id = dax_read_lock();
> > length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> > - NULL, pfnp);
> > + addr, pfnp);
> > if (length < 0) {
> > rc = length;
> > goto out;
> > @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
> > struct inode *inode = mapping->host;
> > unsigned long vaddr = vmf->address;
> > + void *addr;
> > loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> > struct iomap iomap = { 0 };
>
> Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
> because the actor returns vm_fault_t, not bytes copied. I guess that
> makes it a tiny bit more complicated to pass in two (struct iomap *) to
> the iomap_begin function...
I am not sure I understand this. We do not use iomap_apply() in
the fault path: dax_iomap_pte_fault(). We just use iomap_begin()
and iomap_end(). So, why can we not implement your idea of using two
iomaps? What does open-coding iomap-apply mean?
--
Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-21 16:51 ` Darrick J. Wong
@ 2019-05-22 20:14 ` Goldwyn Rodrigues
2019-05-23 2:10 ` Dave Chinner
2019-05-23 9:05 ` Shiyang Ruan
1 sibling, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-05-22 20:14 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm
On 9:51 21/05, Darrick J. Wong wrote:
> On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > The IOMAP_DAX_COW is a iomap type which performs copy of
> > edges of data while performing a write if start/end are
> > not page aligned. The source address is expected in
> > iomap->inline_data.
> >
> > dax_copy_edges() is a helper functions performs a copy from
> > one part of the device to another for data not page aligned.
> > If iomap->inline_data is NULL, it memset's the area to zero.
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> > fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/iomap.h | 1 +
> > 2 files changed, 46 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index e5e54da1715f..610bfa861a28 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev,
> > }
> > EXPORT_SYMBOL_GPL(__dax_zero_page_range);
> >
> > +/*
> > + * dax_copy_edges - Copies the part of the pages not included in
> > + * the write, but required for CoW because
> > + * offset/offset+length are not page aligned.
> > + */
> > +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> > + struct iomap *iomap, void *daddr)
> > +{
> > + unsigned offset = pos & (PAGE_SIZE - 1);
> > + loff_t end = pos + length;
> > + loff_t pg_end = round_up(end, PAGE_SIZE);
> > + void *saddr = iomap->inline_data;
> > + int ret = 0;
> > + /*
> > + * Copy the first part of the page
> > + * Note: we pass offset as length
> > + */
> > + if (offset) {
> > + if (saddr)
> > + ret = memcpy_mcsafe(daddr, saddr, offset);
> > + else
> > + memset(daddr, 0, offset);
> > + }
> > +
> > + /* Copy the last part of the range */
> > + if (end < pg_end) {
> > + if (saddr)
> > + ret = memcpy_mcsafe(daddr + offset + length,
> > + saddr + offset + length, pg_end - end);
> > + else
> > + memset(daddr + offset + length, 0,
> > + pg_end - end);
> > + }
> > + return ret;
> > +}
> > +
> > static loff_t
> > dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > struct iomap *iomap)
> > @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > return iov_iter_zero(min(length, end - pos), iter);
> > }
> >
> > - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> > + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> > + && iomap->type != IOMAP_DAX_COW))
>
> I reiterate (from V3) that the && goes on the previous line...
>
> if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
> iomap->type != IOMAP_DAX_COW))
>
> > return -EIO;
> >
> > +
> > /*
> > * Write can allocate block for an area which has a hole page mapped
> > * into page tables. We have to tear down these mappings so that data
> > @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > break;
> > }
> >
> > + if (iomap->type == IOMAP_DAX_COW) {
> > + ret = dax_copy_edges(inode, pos, length, iomap, kaddr);
> > + if (ret)
> > + break;
> > + }
> > +
> > map_len = PFN_PHYS(map_len);
> > kaddr += offset;
> > map_len -= offset;
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 0fefb5455bda..6e885c5a38a3 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -25,6 +25,7 @@ struct vm_fault;
> > #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
> > #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
> > #define IOMAP_INLINE 0x05 /* data inline in the inode */
>
> > +#define IOMAP_DAX_COW 0x06
>
> DAX isn't going to be the only scenario where we need a way to
> communicate to iomap actors the need to implement copy on write.
>
> XFS also uses struct iomap to hand out file leases to clients. The
> lease code /currently/ doesn't support files with shared blocks (because
> the only user is pNFS) but one could easily imagine a future where some
> client wants to lease a file with shared blocks, in which case XFS will
> want to convey the COW details to the lessee.
>
> > +/* Copy data pointed by inline_data before write*/
>
> A month ago during the V3 patchset review, I wrote (possibly in an other
> thread, sorry) about something that I'm putting my foot down about now
> for the V4 patchset, which is the {re,ab}use of @inline_data for the
> data source address.
Looks like I missed this.
>
> We cannot use @inline_data to convey the source address. @inline_data
> (so far) is used to point to the in-memory representation of the storage
> described by @addr. For data writes, @addr is the location of the write
> on disk and @inline_data is the location of the write in memory.
>
> Reusing @inline_data here to point to the location of the source data in
> memory is a totally different thing and will likely result in confusion.
> On a practical level, this also means that we cannot support the case of
> COW && INLINE because the type codes collide and so would the users of
> @inline_data. This isn't required *right now*, but if you had a pmem
> filesystem that stages inode updates in memory and flips a pointer to
> commit changes then the ->iomap_begin function will need to convey two
> pointers at once.
>
> So this brings us back to Dave's suggestion during the V1 patchset
> review that instead of adding more iomap flags/types and overloading
> fields, we simply pass two struct iomaps into ->iomap_begin:
Actually, Dave is the one who suggested to perform it this way.
https://patchwork.kernel.org/comment/22562195/
>
> - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass
> &iomap[0] into the ->iomap_begin and ->iomap_end functions. The
> first iomap will be filled in with the destination for the write (as
> all implementations do now), and the second iomap can be filled in
> with the source information for a COW operation.
>
> - If the ->iomap_begin implementation decides that COW is necessary for
> the requested operation, then it should fill out that second iomap
> with information about the extent that the actor must copied before
> returning. The second iomap's offset and length must match the
> first. If COW isn't necessary, the ->iomap_begin implementation
> ignores it, and the second iomap retains type == 0 (i.e. invalid
> mapping).
>
> Proceeding along these lines will (AFAICT) still allow you to enable all
> the btrfs functionality in the rest of this patchset while making the
> task of wiring up XFS fairly simple. No overloaded fields and no new
> flags.
>
> This is how I'd like to see this patchset should proceed to V5. Does
> that make sense?
Yes, I think this would be a more flexible design as well if we ever
decide to extend it beyond dax.
We would still need a IOMAP_COW type set in iomap[0].
--
Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 03/18] btrfs: basic dax read
2019-05-21 15:14 ` Darrick J. Wong
@ 2019-05-22 21:50 ` Goldwyn Rodrigues
0 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-05-22 21:50 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm
On 8:14 21/05, Darrick J. Wong wrote:
> On Mon, Apr 29, 2019 at 12:26:34PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > Perform a basic read using iomap support. The btrfs_iomap_begin()
> > finds the extent at the position and fills the iomap data
> > structure with the values.
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> > fs/btrfs/Makefile | 1 +
> > fs/btrfs/ctree.h | 5 +++++
> > fs/btrfs/dax.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/btrfs/file.c | 11 ++++++++++-
> > 4 files changed, 65 insertions(+), 1 deletion(-)
> > create mode 100644 fs/btrfs/dax.c
> >
> > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> > index ca693dd554e9..1fa77b875ae9 100644
> > --- a/fs/btrfs/Makefile
> > +++ b/fs/btrfs/Makefile
> > @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> > reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
> > uuid-tree.o props.o free-space-tree.o tree-checker.o
> >
> > +btrfs-$(CONFIG_FS_DAX) += dax.o
> > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> > btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 9512f49262dd..b7bbe5130a3b 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle);
> > void btrfs_reada_detach(void *handle);
> > int btree_readahead_hook(struct extent_buffer *eb, int err);
> >
> > +#ifdef CONFIG_FS_DAX
> > +/* dax.c */
> > +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> > +#endif /* CONFIG_FS_DAX */
> > +
> > static inline int is_fstree(u64 rootid)
> > {
> > if (rootid == BTRFS_FS_TREE_OBJECTID ||
> > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> > new file mode 100644
> > index 000000000000..bf3d46b0acb6
> > --- /dev/null
> > +++ b/fs/btrfs/dax.c
> > @@ -0,0 +1,49 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * DAX support for BTRFS
> > + *
> > + * Copyright (c) 2019 SUSE Linux
> > + * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > + */
> > +
> > +#ifdef CONFIG_FS_DAX
> > +#include <linux/dax.h>
> > +#include <linux/iomap.h>
> > +#include "ctree.h"
> > +#include "btrfs_inode.h"
> > +
> > +static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
> > + loff_t length, unsigned flags, struct iomap *iomap)
> > +{
> > + struct extent_map *em;
> > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> > + if (em->block_start == EXTENT_MAP_HOLE) {
> > + iomap->type = IOMAP_HOLE;
> > + return 0;
>
> I'm not doing a rigorous review of the btrfs-specific pieces, but you're
> required to fill out the other iomap fields for a read hole.
I fixed this in the patch adding write support. However, this
looks wrong as a patch. I will fix this.
--
Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-22 20:14 ` Goldwyn Rodrigues
@ 2019-05-23 2:10 ` Dave Chinner
0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2019-05-23 2:10 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: Darrick J. Wong, linux-btrfs, kilobyte, linux-fsdevel, jack,
willy, hch, dsterba, nborisov, linux-nvdimm
On Wed, May 22, 2019 at 03:14:47PM -0500, Goldwyn Rodrigues wrote:
> On 9:51 21/05, Darrick J. Wong wrote:
> > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
> > We cannot use @inline_data to convey the source address. @inline_data
> > (so far) is used to point to the in-memory representation of the storage
> > described by @addr. For data writes, @addr is the location of the write
> > on disk and @inline_data is the location of the write in memory.
> >
> > Reusing @inline_data here to point to the location of the source data in
> > memory is a totally different thing and will likely result in confusion.
> > On a practical level, this also means that we cannot support the case of
> > COW && INLINE because the type codes collide and so would the users of
> > @inline_data. This isn't required *right now*, but if you had a pmem
> > filesystem that stages inode updates in memory and flips a pointer to
> > commit changes then the ->iomap_begin function will need to convey two
> > pointers at once.
> >
> > So this brings us back to Dave's suggestion during the V1 patchset
> > review that instead of adding more iomap flags/types and overloading
> > fields, we simply pass two struct iomaps into ->iomap_begin:
>
> Actually, Dave is the one who suggested to perform it this way.
> https://patchwork.kernel.org/comment/22562195/
My first suggestion was to use two iomaps. This suggestion came
later, as a way of demonstrating that a different type could be used
to redefine what ->inline_data was used for, if people considered
that an acceptible solution.
What was apparent from other discussions in the thread you quote was
that using two iomaps looked to be the better, more general approach
to solving the iomap read-modify-write issue at hand.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
2019-05-22 19:11 ` Goldwyn Rodrigues
@ 2019-05-23 4:02 ` Darrick J. Wong
0 siblings, 0 replies; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-23 4:02 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm
On Wed, May 22, 2019 at 02:11:39PM -0500, Goldwyn Rodrigues wrote:
> On 10:46 21/05, Darrick J. Wong wrote:
> > On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > Change dax_iomap_pfn to return the address as well in order to
> > > use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> > > We don't handle PMD because btrfs does not support hugepages.
> > >
> > > Question:
> > > The sequence of bdev_dax_pgoff() and dax_direct_access() is
> > > used multiple times to calculate address and pfn's. Would it make
> > > sense to call it while calculating address as well to reduce code?
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > ---
> > > fs/dax.c | 19 +++++++++++++++----
> > > 1 file changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 610bfa861a28..718b1632a39d 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> > > }
> > >
> > > static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > > - pfn_t *pfnp)
> > > + pfn_t *pfnp, void **addr)
> > > {
> > > const sector_t sector = dax_iomap_sector(iomap, pos);
> > > pgoff_t pgoff;
> > > @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > > return rc;
> > > id = dax_read_lock();
> > > length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> > > - NULL, pfnp);
> > > + addr, pfnp);
> > > if (length < 0) {
> > > rc = length;
> > > goto out;
> > > @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > > XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
> > > struct inode *inode = mapping->host;
> > > unsigned long vaddr = vmf->address;
> > > + void *addr;
> > > loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> > > struct iomap iomap = { 0 };
> >
> > Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
> > because the actor returns vm_fault_t, not bytes copied. I guess that
> > makes it a tiny bit more complicated to pass in two (struct iomap *) to
> > the iomap_begin function...
>
> I am not sure I understand this. We do not use iomap_apply() in
> the fault path: dax_iomap_pte_fault(). We just use iomap_begin()
> and iomap_end(). So, why can we not implement your idea of using two
> iomaps?
Oh, sorry, I wasn't trying to say that calling ->iomap_begin made it
*impossible* to implement. I was merely complaining about the increased
maintenance burden that results from open coding -- now there are three
places where we have to change a struct iomap declaration, not one
(iomap_apply) as I had originally thought.
> What does open-coding iomap-apply mean?
Any function that calls (1) ->iomap_begin; (2) performs an action on the
returned iomap; and (3) then calls calling ->iomap_end. That's what
iomap_apply() does.
Really I'm just being maintainer-cranky. Ignore me for now. :)
--D
>
> --
> Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-21 16:51 ` Darrick J. Wong
2019-05-22 20:14 ` Goldwyn Rodrigues
@ 2019-05-23 9:05 ` Shiyang Ruan
2019-05-23 11:51 ` Goldwyn Rodrigues
1 sibling, 1 reply; 58+ messages in thread
From: Shiyang Ruan @ 2019-05-23 9:05 UTC (permalink / raw)
To: Darrick J. Wong, Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On 5/22/19 12:51 AM, Darrick J. Wong wrote:
> On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> The IOMAP_DAX_COW is a iomap type which performs copy of
>> edges of data while performing a write if start/end are
>> not page aligned. The source address is expected in
>> iomap->inline_data.
>>
>> dax_copy_edges() is a helper functions performs a copy from
>> one part of the device to another for data not page aligned.
>> If iomap->inline_data is NULL, it memset's the area to zero.
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> ---
>> fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>> include/linux/iomap.h | 1 +
>> 2 files changed, 46 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index e5e54da1715f..610bfa861a28 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev,
>> }
>> EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>>
>> +/*
>> + * dax_copy_edges - Copies the part of the pages not included in
>> + * the write, but required for CoW because
>> + * offset/offset+length are not page aligned.
>> + */
>> +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
>> + struct iomap *iomap, void *daddr)
>> +{
>> + unsigned offset = pos & (PAGE_SIZE - 1);
>> + loff_t end = pos + length;
>> + loff_t pg_end = round_up(end, PAGE_SIZE);
>> + void *saddr = iomap->inline_data;
>> + int ret = 0;
>> + /*
>> + * Copy the first part of the page
>> + * Note: we pass offset as length
>> + */
>> + if (offset) {
>> + if (saddr)
>> + ret = memcpy_mcsafe(daddr, saddr, offset);
>> + else
>> + memset(daddr, 0, offset);
>> + }
>> +
>> + /* Copy the last part of the range */
>> + if (end < pg_end) {
>> + if (saddr)
>> + ret = memcpy_mcsafe(daddr + offset + length,
>> + saddr + offset + length, pg_end - end);
>> + else
>> + memset(daddr + offset + length, 0,
>> + pg_end - end);
>> + }
>> + return ret;
>> +}
>> +
>> static loff_t
>> dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>> struct iomap *iomap)
>> @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>> return iov_iter_zero(min(length, end - pos), iter);
>> }
>>
>> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
>> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
>> + && iomap->type != IOMAP_DAX_COW))
>
> I reiterate (from V3) that the && goes on the previous line...
>
> if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
> iomap->type != IOMAP_DAX_COW))
>
>> return -EIO;
>>
>> +
>> /*
>> * Write can allocate block for an area which has a hole page mapped
>> * into page tables. We have to tear down these mappings so that data
>> @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>> break;
>> }
>>
>> + if (iomap->type == IOMAP_DAX_COW) {
>> + ret = dax_copy_edges(inode, pos, length, iomap, kaddr);
>> + if (ret)
>> + break;
>> + }
>> +
>> map_len = PFN_PHYS(map_len);
>> kaddr += offset;
>> map_len -= offset;
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 0fefb5455bda..6e885c5a38a3 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -25,6 +25,7 @@ struct vm_fault;
>> #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
>> #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
>> #define IOMAP_INLINE 0x05 /* data inline in the inode */
>
>> +#define IOMAP_DAX_COW 0x06
>
> DAX isn't going to be the only scenario where we need a way to
> communicate to iomap actors the need to implement copy on write.
>
> XFS also uses struct iomap to hand out file leases to clients. The
> lease code /currently/ doesn't support files with shared blocks (because
> the only user is pNFS) but one could easily imagine a future where some
> client wants to lease a file with shared blocks, in which case XFS will
> want to convey the COW details to the lessee.
>
>> +/* Copy data pointed by inline_data before write*/
>
> A month ago during the V3 patchset review, I wrote (possibly in an other
> thread, sorry) about something that I'm putting my foot down about now
> for the V4 patchset, which is the {re,ab}use of @inline_data for the
> data source address.
>
> We cannot use @inline_data to convey the source address. @inline_data
> (so far) is used to point to the in-memory representation of the storage
> described by @addr. For data writes, @addr is the location of the write
> on disk and @inline_data is the location of the write in memory.
>
> Reusing @inline_data here to point to the location of the source data in
> memory is a totally different thing and will likely result in confusion.
> On a practical level, this also means that we cannot support the case of
> COW && INLINE because the type codes collide and so would the users of
> @inline_data. This isn't required *right now*, but if you had a pmem
> filesystem that stages inode updates in memory and flips a pointer to
> commit changes then the ->iomap_begin function will need to convey two
> pointers at once.
>
> So this brings us back to Dave's suggestion during the V1 patchset
> review that instead of adding more iomap flags/types and overloading
> fields, we simply pass two struct iomaps into ->iomap_begin:
>
> - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass
> &iomap[0] into the ->iomap_begin and ->iomap_end functions. The
> first iomap will be filled in with the destination for the write (as
> all implementations do now), and the second iomap can be filled in
> with the source information for a COW operation.
>
> - If the ->iomap_begin implementation decides that COW is necessary for
> the requested operation, then it should fill out that second iomap
> with information about the extent that the actor must copied before
> returning. The second iomap's offset and length must match the
> first. If COW isn't necessary, the ->iomap_begin implementation
Hi,
I'm working on reflink & dax in XFS, here are some thoughts on this:
As mentioned above: the second iomap's offset and length must match the
first. I thought so at the beginning, but later found that the only
difference between these two iomaps is @addr. So, what about adding a
@saddr, which means the source address of COW extent, into the struct
iomap. The ->iomap_begin() fills @saddr if the extent is COW, and 0 if
not. Then handle this @saddr in each ->actor(). No more modifications
in other functions.
My RFC patchset[1] is implemented in this way and works for me, though
it is far away from perfectness.
[1]: https://patchwork.kernel.org/cover/10904307/
--
Thanks,
Shiyang Ruan.
> ignores it, and the second iomap retains type == 0 (i.e. invalid
> mapping).
>
> Proceeding along these lines will (AFAICT) still allow you to enable all
> the btrfs functionality in the rest of this patchset while making the
> task of wiring up XFS fairly simple. No overloaded fields and no new
> flags.
>
> This is how I'd like to see this patchset should proceed to V5. Does
> that make sense?
>
> --D
>
>>
>> /*
>> * Flags for all iomap mappings:
>> --
>> 2.16.4
>>
>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-23 9:05 ` Shiyang Ruan
@ 2019-05-23 11:51 ` Goldwyn Rodrigues
2019-05-27 8:25 ` Shiyang Ruan
0 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-05-23 11:51 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Darrick J. Wong, linux-btrfs, kilobyte, linux-fsdevel, jack,
david, willy, hch, dsterba, nborisov, linux-nvdimm
On 17:05 23/05, Shiyang Ruan wrote:
>
>
> On 5/22/19 12:51 AM, Darrick J. Wong wrote:
> > On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > The IOMAP_DAX_COW is a iomap type which performs copy of
> > > edges of data while performing a write if start/end are
> > > not page aligned. The source address is expected in
> > > iomap->inline_data.
> > >
> > > dax_copy_edges() is a helper functions performs a copy from
> > > one part of the device to another for data not page aligned.
> > > If iomap->inline_data is NULL, it memset's the area to zero.
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > ---
> > > fs/dax.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> > > include/linux/iomap.h | 1 +
> > > 2 files changed, 46 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index e5e54da1715f..610bfa861a28 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev,
> > > }
> > > EXPORT_SYMBOL_GPL(__dax_zero_page_range);
> > > +/*
> > > + * dax_copy_edges - Copies the part of the pages not included in
> > > + * the write, but required for CoW because
> > > + * offset/offset+length are not page aligned.
> > > + */
> > > +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> > > + struct iomap *iomap, void *daddr)
> > > +{
> > > + unsigned offset = pos & (PAGE_SIZE - 1);
> > > + loff_t end = pos + length;
> > > + loff_t pg_end = round_up(end, PAGE_SIZE);
> > > + void *saddr = iomap->inline_data;
> > > + int ret = 0;
> > > + /*
> > > + * Copy the first part of the page
> > > + * Note: we pass offset as length
> > > + */
> > > + if (offset) {
> > > + if (saddr)
> > > + ret = memcpy_mcsafe(daddr, saddr, offset);
> > > + else
> > > + memset(daddr, 0, offset);
> > > + }
> > > +
> > > + /* Copy the last part of the range */
> > > + if (end < pg_end) {
> > > + if (saddr)
> > > + ret = memcpy_mcsafe(daddr + offset + length,
> > > + saddr + offset + length, pg_end - end);
> > > + else
> > > + memset(daddr + offset + length, 0,
> > > + pg_end - end);
> > > + }
> > > + return ret;
> > > +}
> > > +
> > > static loff_t
> > > dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > > struct iomap *iomap)
> > > @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > > return iov_iter_zero(min(length, end - pos), iter);
> > > }
> > > - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> > > + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> > > + && iomap->type != IOMAP_DAX_COW))
> >
> > I reiterate (from V3) that the && goes on the previous line...
> >
> > if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
> > iomap->type != IOMAP_DAX_COW))
> >
> > > return -EIO;
> > > +
> > > /*
> > > * Write can allocate block for an area which has a hole page mapped
> > > * into page tables. We have to tear down these mappings so that data
> > > @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > > break;
> > > }
> > > + if (iomap->type == IOMAP_DAX_COW) {
> > > + ret = dax_copy_edges(inode, pos, length, iomap, kaddr);
> > > + if (ret)
> > > + break;
> > > + }
> > > +
> > > map_len = PFN_PHYS(map_len);
> > > kaddr += offset;
> > > map_len -= offset;
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 0fefb5455bda..6e885c5a38a3 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -25,6 +25,7 @@ struct vm_fault;
> > > #define IOMAP_MAPPED 0x03 /* blocks allocated at @addr */
> > > #define IOMAP_UNWRITTEN 0x04 /* blocks allocated at @addr in unwritten state */
> > > #define IOMAP_INLINE 0x05 /* data inline in the inode */
> >
> > > +#define IOMAP_DAX_COW 0x06
> >
> > DAX isn't going to be the only scenario where we need a way to
> > communicate to iomap actors the need to implement copy on write.
> >
> > XFS also uses struct iomap to hand out file leases to clients. The
> > lease code /currently/ doesn't support files with shared blocks (because
> > the only user is pNFS) but one could easily imagine a future where some
> > client wants to lease a file with shared blocks, in which case XFS will
> > want to convey the COW details to the lessee.
> >
> > > +/* Copy data pointed by inline_data before write*/
> >
> > A month ago during the V3 patchset review, I wrote (possibly in an other
> > thread, sorry) about something that I'm putting my foot down about now
> > for the V4 patchset, which is the {re,ab}use of @inline_data for the
> > data source address.
> >
> > We cannot use @inline_data to convey the source address. @inline_data
> > (so far) is used to point to the in-memory representation of the storage
> > described by @addr. For data writes, @addr is the location of the write
> > on disk and @inline_data is the location of the write in memory.
> >
> > Reusing @inline_data here to point to the location of the source data in
> > memory is a totally different thing and will likely result in confusion.
> > On a practical level, this also means that we cannot support the case of
> > COW && INLINE because the type codes collide and so would the users of
> > @inline_data. This isn't required *right now*, but if you had a pmem
> > filesystem that stages inode updates in memory and flips a pointer to
> > commit changes then the ->iomap_begin function will need to convey two
> > pointers at once.
> >
> > So this brings us back to Dave's suggestion during the V1 patchset
> > review that instead of adding more iomap flags/types and overloading
> > fields, we simply pass two struct iomaps into ->iomap_begin:
> >
> > - Change iomap_apply() to "struct iomap iomap[2] = 0;" and pass
> > &iomap[0] into the ->iomap_begin and ->iomap_end functions. The
> > first iomap will be filled in with the destination for the write (as
> > all implementations do now), and the second iomap can be filled in
> > with the source information for a COW operation.
> >
> > - If the ->iomap_begin implementation decides that COW is necessary for
> > the requested operation, then it should fill out that second iomap
> > with information about the extent that the actor must copied before
> > returning. The second iomap's offset and length must match the
> > first. If COW isn't necessary, the ->iomap_begin implementation
>
> Hi,
>
> I'm working on reflink & dax in XFS, here are some thoughts on this:
>
> As mentioned above: the second iomap's offset and length must match the
> first. I thought so at the beginning, but later found that the only
> difference between these two iomaps is @addr. So, what about adding a
> @saddr, which means the source address of COW extent, into the struct iomap.
> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
> handle this @saddr in each ->actor(). No more modifications in other
> functions.
Yes, I started of with the exact idea before being recommended this by Dave.
I used two fields instead of one namely cow_pos and cow_addr which defined
the source details. I had put it as a iomap flag as opposed to a type
which of course did not appeal well.
We may want to use iomaps for cases where two inodes are involved.
An example of the other scenario where offset may be different is file
comparison for dedup: vfs_dedup_file_range_compare(). However, it would
need two inodes in iomap as well.
>
> My RFC patchset[1] is implemented in this way and works for me, though it is
> far away from perfectness.
>
> [1]: https://patchwork.kernel.org/cover/10904307/
>
--
Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults
2019-05-21 17:46 ` Darrick J. Wong
2019-05-22 19:11 ` Goldwyn Rodrigues
@ 2019-05-23 12:10 ` Jan Kara
1 sibling, 0 replies; 58+ messages in thread
From: Jan Kara @ 2019-05-23 12:10 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Goldwyn Rodrigues, linux-btrfs, kilobyte, linux-fsdevel, jack,
david, willy, hch, dsterba, nborisov, linux-nvdimm,
Goldwyn Rodrigues
On Tue 21-05-19 10:46:25, Darrick J. Wong wrote:
> On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > Change dax_iomap_pfn to return the address as well in order to
> > use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> > We don't handle PMD because btrfs does not support hugepages.
> >
> > Question:
> > The sequence of bdev_dax_pgoff() and dax_direct_access() is
> > used multiple times to calculate address and pfn's. Would it make
> > sense to call it while calculating address as well to reduce code?
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> > fs/dax.c | 19 +++++++++++++++----
> > 1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 610bfa861a28..718b1632a39d 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> > }
> >
> > static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > - pfn_t *pfnp)
> > + pfn_t *pfnp, void **addr)
> > {
> > const sector_t sector = dax_iomap_sector(iomap, pos);
> > pgoff_t pgoff;
> > @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > return rc;
> > id = dax_read_lock();
> > length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> > - NULL, pfnp);
> > + addr, pfnp);
> > if (length < 0) {
> > rc = length;
> > goto out;
> > @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
> > struct inode *inode = mapping->host;
> > unsigned long vaddr = vmf->address;
> > + void *addr;
> > loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> > struct iomap iomap = { 0 };
>
> Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
> because the actor returns vm_fault_t, not bytes copied. I guess that
> makes it a tiny bit more complicated to pass in two (struct iomap *) to
> the iomap_begin function...
Hum, right. We could actually reimplement dax_iomap_{pte|pmd}_fault() using
iomap_apply(). We would just need to propagate error code out of our
'actor' inside the structure pointed to by 'data'. But that's doable.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 10/18] dax: replace mmap entry in case of CoW
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
2019-05-21 17:35 ` Darrick J. Wong
@ 2019-05-23 13:38 ` Jan Kara
1 sibling, 0 replies; 58+ messages in thread
From: Jan Kara @ 2019-05-23 13:38 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon 29-04-19 12:26:41, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
I don't understand why do you need to mark the new entry with
PAGECACHE_TAG_TOWRITE. dax_insert_entry() will unmap the entry from all
page tables so what's there left to writeprotect?
> /*
> * By this point grab_mapping_entry() has ensured that we have a locked entry
> * of the appropriate size so we don't have to worry about downgrading PMDs to
> @@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
> */
> static void *dax_insert_entry(struct xa_state *xas,
> struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned long insert_flags)
> {
> void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
Does 'cow' really need to be a separate flag? dax_insert_entry() can just
figure out the right thing to do on its own based on old entry value and
new entry to be inserted...
>
> if (dirty)
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
E.g. here we need to unmap if old entry is not 'empty' and the pfns differ
(well, the pfns differ check should better be done like I outline below to
make pmd + pte match work correctly).
> unsigned long index = xas->xa_index;
> /* we are replacing a zero page with block mapping */
> if (dax_is_pmd_entry(entry))
> @@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>
> xas_reset(xas);
> xas_lock_irq(xas);
> - if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> + if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
This needs to be done if entries are different at all...
> dax_disassociate_entry(entry, mapping, false);
> dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
> }
>
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
This is the only place that will be a bit more subtle - you need to check
whether the new entry is not a subset of the old one (i.e., a PTE inside a
PMD) and skip setting in that case. So something like:
if (xa_to_value(new_entry) | DAX_LOCKED == xa_to_value(entry) ||
(dax_is_pmd_entry(entry) && dax_is_pte_entry(new_entry) &&
dax_to_pfn(entry) + (xas->xa_index & PG_PMD_COLOUR) ==
dax_to_pfn(new_entry))) {
/* New entry is a subset of the current one? Skip update... */
xas_load(xas);
} else {
do work...
}
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 12/18] btrfs: allow MAP_SYNC mmap
2019-04-29 17:26 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
2019-05-10 15:32 ` [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts Adam Borowski
@ 2019-05-23 13:44 ` Jan Kara
2019-05-23 16:19 ` Adam Borowski
1 sibling, 1 reply; 58+ messages in thread
From: Jan Kara @ 2019-05-23 13:44 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon 29-04-19 12:26:43, Goldwyn Rodrigues wrote:
> From: Adam Borowski <kilobyte@angband.pl>
>
> Used by userspace to detect DAX.
> [rgoldwyn@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags]
Why the CONFIG_FS_DAX bit? Your mmap(2) implementation understands
implications of MAP_SYNC flag and that's all that's needed to set
.mmap_supported_flags.
Honza
> Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
> fs/btrfs/file.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 9d5a3c99a6b9..362a9cf9dcb2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -16,6 +16,7 @@
> #include <linux/btrfs.h>
> #include <linux/uio.h>
> #include <linux/iversion.h>
> +#include <linux/mman.h>
> #include "ctree.h"
> #include "disk-io.h"
> #include "transaction.h"
> @@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = {
> .splice_read = generic_file_splice_read,
> .write_iter = btrfs_file_write_iter,
> .mmap = btrfs_file_mmap,
> +#ifdef CONFIG_FS_DAX
> + .mmap_supported_flags = MAP_SYNC,
> +#endif
> .open = btrfs_file_open,
> .release = btrfs_release_file,
> .fsync = btrfs_sync_file,
> --
> 2.16.4
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot
2019-04-29 17:26 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
@ 2019-05-23 14:04 ` Jan Kara
2019-05-23 15:27 ` Goldwyn Rodrigues
0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2019-05-23 14:04 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: linux-btrfs, kilobyte, linux-fsdevel, jack, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Mon 29-04-19 12:26:47, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> Inorder to make sure mmap'd files don't change after snapshot,
> writeprotect the mmap pages on snapshot. This is done by performing
> a data writeback on the pages (which simply mark the pages are
> wrprotected). This way if the user process tries to access the memory
> we will get another fault and we can perform a CoW.
>
> In order to accomplish this, we tag all CoW pages as
> PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
> During snapshot, it starts writeback of all delalloc'd inodes and
> here we perform a data writeback. We don't want to keep the inodes
> in delalloc_inodes until it umount (WARN_ON), so we remove it
> during inode evictions.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
OK, so here you use PAGECACHE_TAG_TOWRITE. But why is not
PAGECACHE_TAG_DIRTY enough for you? Also why isn't the same needed also for
normal non-DAX inodes? There you also need to trigger CoW on mmap write so
I just don't see the difference...
Honza
> ---
> fs/btrfs/ctree.h | 3 ++-
> fs/btrfs/dax.c | 7 +++++++
> fs/btrfs/inode.c | 13 ++++++++++++-
> 3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index ee1ed18f8b3c..d1b70f24adeb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
> struct btrfs_root *new_root,
> struct btrfs_root *parent_root,
> u64 new_dirid);
> - void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
> +void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode);
> +void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
> unsigned *bits);
> void btrfs_clear_delalloc_extent(struct inode *inode,
> struct extent_state *state, unsigned *bits);
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index bf2ddac5b5a1..20ec2ec49c68 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -222,10 +222,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
> {
> vm_fault_t ret;
> pfn_t pfn;
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> + struct btrfs_inode *binode = BTRFS_I(inode);
> ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
> if (ret & VM_FAULT_NEEDDSYNC)
> ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
>
> + /* Insert into delalloc so we get writeback calls on snapshots */
> + if (vmf->flags & FAULT_FLAG_WRITE &&
> + !test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
> + btrfs_add_delalloc_inodes(binode->root, inode);
> +
> return ret;
> }
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7e88280a2c3b..e98fb512e1ca 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
> spin_unlock(&BTRFS_I(inode)->lock);
> }
>
> -static void btrfs_add_delalloc_inodes(struct btrfs_root *root,
> +void btrfs_add_delalloc_inodes(struct btrfs_root *root,
> struct inode *inode)
> {
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> @@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode)
> {
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> struct btrfs_trans_handle *trans;
> + struct btrfs_inode *binode = BTRFS_I(inode);
> struct btrfs_root *root = BTRFS_I(inode)->root;
> struct btrfs_block_rsv *rsv;
> int ret;
>
> trace_btrfs_inode_evict(inode);
>
> + if (IS_DAX(inode)
> + && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
> + btrfs_del_delalloc_inode(root, binode);
> +
> if (!root) {
> clear_inode(inode);
> return;
> @@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping,
> {
> struct inode *inode = mapping->host;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + struct btrfs_inode *binode = BTRFS_I(inode);
> + if ((wbc->sync_mode == WB_SYNC_ALL) &&
> + test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
> + btrfs_del_delalloc_inode(binode->root, binode);
> return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
> wbc);
> }
> @@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work)
> delalloc_work = container_of(work, struct btrfs_delalloc_work,
> work);
> inode = delalloc_work->inode;
> + if (IS_DAX(inode))
> + filemap_fdatawrite(inode->i_mapping);
> filemap_flush(inode->i_mapping);
> if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> &BTRFS_I(inode)->runtime_flags))
> --
> 2.16.4
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot
2019-05-23 14:04 ` Jan Kara
@ 2019-05-23 15:27 ` Goldwyn Rodrigues
2019-05-23 19:07 ` Jan Kara
0 siblings, 1 reply; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-05-23 15:27 UTC (permalink / raw)
To: Jan Kara
Cc: linux-btrfs, kilobyte, linux-fsdevel, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm
On 16:04 23/05, Jan Kara wrote:
> On Mon 29-04-19 12:26:47, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > Inorder to make sure mmap'd files don't change after snapshot,
> > writeprotect the mmap pages on snapshot. This is done by performing
> > a data writeback on the pages (which simply mark the pages are
> > wrprotected). This way if the user process tries to access the memory
> > we will get another fault and we can perform a CoW.
> >
> > In order to accomplish this, we tag all CoW pages as
> > PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
> > During snapshot, it starts writeback of all delalloc'd inodes and
> > here we perform a data writeback. We don't want to keep the inodes
> > in delalloc_inodes until it umount (WARN_ON), so we remove it
> > during inode evictions.
> >
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> OK, so here you use PAGECACHE_TAG_TOWRITE. But why is not
> PAGECACHE_TAG_DIRTY enough for you? Also why isn't the same needed also for
> normal non-DAX inodes? There you also need to trigger CoW on mmap write so
> I just don't see the difference...
Because dax_writeback_mapping_range() writebacks pages marked
PAGECACHE_TAG_TOWRITE and not PAGECACHE_TAG_DIRTY. Should it
writeback pages marked as PAGECACHE_TAG_DIRTY as well?
--
Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 12/18] btrfs: allow MAP_SYNC mmap
2019-05-23 13:44 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Jan Kara
@ 2019-05-23 16:19 ` Adam Borowski
0 siblings, 0 replies; 58+ messages in thread
From: Adam Borowski @ 2019-05-23 16:19 UTC (permalink / raw)
To: Jan Kara
Cc: Goldwyn Rodrigues, linux-btrfs, linux-fsdevel, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm, Goldwyn Rodrigues
On Thu, May 23, 2019 at 03:44:49PM +0200, Jan Kara wrote:
> On Mon 29-04-19 12:26:43, Goldwyn Rodrigues wrote:
> > From: Adam Borowski <kilobyte@angband.pl>
> >
> > Used by userspace to detect DAX.
> > [rgoldwyn@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags]
>
> Why the CONFIG_FS_DAX bit? Your mmap(2) implementation understands
> implications of MAP_SYNC flag and that's all that's needed to set
> .mmap_supported_flags.
Good point.
Also, that check will need to be updated when the pmem-virtio patchset goes
in.
> > Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> > fs/btrfs/file.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 9d5a3c99a6b9..362a9cf9dcb2 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -16,6 +16,7 @@
> > #include <linux/btrfs.h>
> > #include <linux/uio.h>
> > #include <linux/iversion.h>
> > +#include <linux/mman.h>
> > #include "ctree.h"
> > #include "disk-io.h"
> > #include "transaction.h"
> > @@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = {
> > .splice_read = generic_file_splice_read,
> > .write_iter = btrfs_file_write_iter,
> > .mmap = btrfs_file_mmap,
> > +#ifdef CONFIG_FS_DAX
> > + .mmap_supported_flags = MAP_SYNC,
> > +#endif
> > .open = btrfs_file_open,
> > .release = btrfs_release_file,
> > .fsync = btrfs_sync_file,
> > --
> > 2.16.4
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
--
⢀⣴⠾⠻⢶⣦⠀ Latin: meow 4 characters, 4 columns, 4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek: μεου 4 characters, 4 columns, 8 bytes
⢿⡄⠘⠷⠚⠋ Runes: ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄⠀⠀⠀⠀ Chinese: 喵 1 character, 2 columns, 3 bytes <-- best!
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot
2019-05-23 15:27 ` Goldwyn Rodrigues
@ 2019-05-23 19:07 ` Jan Kara
2019-05-23 21:22 ` Goldwyn Rodrigues
0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2019-05-23 19:07 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: Jan Kara, linux-btrfs, kilobyte, linux-fsdevel, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm
On Thu 23-05-19 10:27:22, Goldwyn Rodrigues wrote:
> On 16:04 23/05, Jan Kara wrote:
> > On Mon 29-04-19 12:26:47, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > Inorder to make sure mmap'd files don't change after snapshot,
> > > writeprotect the mmap pages on snapshot. This is done by performing
> > > a data writeback on the pages (which simply mark the pages are
> > > wrprotected). This way if the user process tries to access the memory
> > > we will get another fault and we can perform a CoW.
> > >
> > > In order to accomplish this, we tag all CoW pages as
> > > PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
> > > During snapshot, it starts writeback of all delalloc'd inodes and
> > > here we perform a data writeback. We don't want to keep the inodes
> > > in delalloc_inodes until it umount (WARN_ON), so we remove it
> > > during inode evictions.
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > OK, so here you use PAGECACHE_TAG_TOWRITE. But why is not
> > PAGECACHE_TAG_DIRTY enough for you? Also why isn't the same needed also for
> > normal non-DAX inodes? There you also need to trigger CoW on mmap write so
> > I just don't see the difference...
>
> Because dax_writeback_mapping_range() writebacks pages marked
> PAGECACHE_TAG_TOWRITE and not PAGECACHE_TAG_DIRTY. Should it
> writeback pages marked as PAGECACHE_TAG_DIRTY as well?
It does writeback PAGECACHE_TAG_DIRTY pages - tag_pages_for_writeback()
moves PAGECACHE_TAG_DIRTY to PAGECACHE_TAG_TOWRITE...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot
2019-05-23 19:07 ` Jan Kara
@ 2019-05-23 21:22 ` Goldwyn Rodrigues
0 siblings, 0 replies; 58+ messages in thread
From: Goldwyn Rodrigues @ 2019-05-23 21:22 UTC (permalink / raw)
To: Jan Kara
Cc: linux-btrfs, kilobyte, linux-fsdevel, david, willy, hch,
darrick.wong, dsterba, nborisov, linux-nvdimm
On 21:07 23/05, Jan Kara wrote:
> On Thu 23-05-19 10:27:22, Goldwyn Rodrigues wrote:
> > On 16:04 23/05, Jan Kara wrote:
> > > On Mon 29-04-19 12:26:47, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > >
> > > > Inorder to make sure mmap'd files don't change after snapshot,
> > > > writeprotect the mmap pages on snapshot. This is done by performing
> > > > a data writeback on the pages (which simply mark the pages are
> > > > wrprotected). This way if the user process tries to access the memory
> > > > we will get another fault and we can perform a CoW.
> > > >
> > > > In order to accomplish this, we tag all CoW pages as
> > > > PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
> > > > During snapshot, it starts writeback of all delalloc'd inodes and
> > > > here we perform a data writeback. We don't want to keep the inodes
> > > > in delalloc_inodes until it umount (WARN_ON), so we remove it
> > > > during inode evictions.
> > > >
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > OK, so here you use PAGECACHE_TAG_TOWRITE. But why is not
> > > PAGECACHE_TAG_DIRTY enough for you? Also why isn't the same needed also for
> > > normal non-DAX inodes? There you also need to trigger CoW on mmap write so
> > > I just don't see the difference...
> >
> > Because dax_writeback_mapping_range() writebacks pages marked
> > PAGECACHE_TAG_TOWRITE and not PAGECACHE_TAG_DIRTY. Should it
> > writeback pages marked as PAGECACHE_TAG_DIRTY as well?
>
> It does writeback PAGECACHE_TAG_DIRTY pages - tag_pages_for_writeback()
> moves PAGECACHE_TAG_DIRTY to PAGECACHE_TAG_TOWRITE...
Yes, I missed that. It just needs a writeback with WB_SYNC_ALL.
Thanks!
--
Goldwyn
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-23 11:51 ` Goldwyn Rodrigues
@ 2019-05-27 8:25 ` Shiyang Ruan
2019-05-28 9:17 ` Jan Kara
0 siblings, 1 reply; 58+ messages in thread
From: Shiyang Ruan @ 2019-05-27 8:25 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: Darrick J. Wong, linux-btrfs, kilobyte, linux-fsdevel, jack,
david, willy, hch, dsterba, nborisov, linux-nvdimm
On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
>>
>> Hi,
>>
>> I'm working on reflink & dax in XFS, here are some thoughts on this:
>>
>> As mentioned above: the second iomap's offset and length must match the
>> first. I thought so at the beginning, but later found that the only
>> difference between these two iomaps is @addr. So, what about adding a
>> @saddr, which means the source address of COW extent, into the struct iomap.
>> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
>> handle this @saddr in each ->actor(). No more modifications in other
>> functions.
>
> Yes, I started of with the exact idea before being recommended this by Dave.
> I used two fields instead of one namely cow_pos and cow_addr which defined
> the source details. I had put it as a iomap flag as opposed to a type
> which of course did not appeal well.
>
> We may want to use iomaps for cases where two inodes are involved.
> An example of the other scenario where offset may be different is file
> comparison for dedup: vfs_dedup_file_range_compare(). However, it would
> need two inodes in iomap as well.
>
Yes, it is reasonable. Thanks for your explanation.
One more thing RFC:
I'd like to add an end-io callback argument in ->dax_iomap_actor() to
update the metadata after one whole COW operation is completed. The
end-io can also be called in ->iomap_end(). But one COW operation may
call ->iomap_apply() many times, and so does the end-io. Thus, I think
it would be nice to move it to the bottom of ->dax_iomap_actor(), called
just once in each COW operation.
--
Thanks,
Shiyang Ruan.
>>
>> My RFC patchset[1] is implemented in this way and works for me, though it is
>> far away from perfectness.
>>
>> [1]: https://patchwork.kernel.org/cover/10904307/
>>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-27 8:25 ` Shiyang Ruan
@ 2019-05-28 9:17 ` Jan Kara
2019-05-29 2:01 ` Shiyang Ruan
0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2019-05-28 9:17 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Goldwyn Rodrigues, Darrick J. Wong, linux-btrfs, kilobyte,
linux-fsdevel, jack, david, willy, hch, dsterba, nborisov,
linux-nvdimm
On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
> On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
> > >
> > > Hi,
> > >
> > > I'm working on reflink & dax in XFS, here are some thoughts on this:
> > >
> > > As mentioned above: the second iomap's offset and length must match the
> > > first. I thought so at the beginning, but later found that the only
> > > difference between these two iomaps is @addr. So, what about adding a
> > > @saddr, which means the source address of COW extent, into the struct iomap.
> > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
> > > handle this @saddr in each ->actor(). No more modifications in other
> > > functions.
> >
> > Yes, I started of with the exact idea before being recommended this by Dave.
> > I used two fields instead of one namely cow_pos and cow_addr which defined
> > the source details. I had put it as a iomap flag as opposed to a type
> > which of course did not appeal well.
> >
> > We may want to use iomaps for cases where two inodes are involved.
> > An example of the other scenario where offset may be different is file
> > comparison for dedup: vfs_dedup_file_range_compare(). However, it would
> > need two inodes in iomap as well.
> >
> Yes, it is reasonable. Thanks for your explanation.
>
> One more thing RFC:
> I'd like to add an end-io callback argument in ->dax_iomap_actor() to update
> the metadata after one whole COW operation is completed. The end-io can
> also be called in ->iomap_end(). But one COW operation may call
> ->iomap_apply() many times, and so does the end-io. Thus, I think it would
> be nice to move it to the bottom of ->dax_iomap_actor(), called just once in
> each COW operation.
I'm sorry but I don't follow what you suggest. One COW operation is a call
to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
->iomap_end() once. So I don't see a difference between doing something in
->actor() and ->iomap_end() (besides the passed arguments but that does not
seem to be your concern). So what do you exactly want to do?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-28 9:17 ` Jan Kara
@ 2019-05-29 2:01 ` Shiyang Ruan
2019-05-29 2:47 ` Dave Chinner
0 siblings, 1 reply; 58+ messages in thread
From: Shiyang Ruan @ 2019-05-29 2:01 UTC (permalink / raw)
To: Jan Kara
Cc: Goldwyn Rodrigues, Darrick J. Wong, linux-btrfs, kilobyte,
linux-fsdevel, david, willy, hch, dsterba, nborisov, linux-nvdimm
On 5/28/19 5:17 PM, Jan Kara wrote:
> On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
>> On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm working on reflink & dax in XFS, here are some thoughts on this:
>>>>
>>>> As mentioned above: the second iomap's offset and length must match the
>>>> first. I thought so at the beginning, but later found that the only
>>>> difference between these two iomaps is @addr. So, what about adding a
>>>> @saddr, which means the source address of COW extent, into the struct iomap.
>>>> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
>>>> handle this @saddr in each ->actor(). No more modifications in other
>>>> functions.
>>>
>>> Yes, I started of with the exact idea before being recommended this by Dave.
>>> I used two fields instead of one namely cow_pos and cow_addr which defined
>>> the source details. I had put it as a iomap flag as opposed to a type
>>> which of course did not appeal well.
>>>
>>> We may want to use iomaps for cases where two inodes are involved.
>>> An example of the other scenario where offset may be different is file
>>> comparison for dedup: vfs_dedup_file_range_compare(). However, it would
>>> need two inodes in iomap as well.
>>>
>> Yes, it is reasonable. Thanks for your explanation.
>>
>> One more thing RFC:
>> I'd like to add an end-io callback argument in ->dax_iomap_actor() to update
>> the metadata after one whole COW operation is completed. The end-io can
>> also be called in ->iomap_end(). But one COW operation may call
>> ->iomap_apply() many times, and so does the end-io. Thus, I think it would
>> be nice to move it to the bottom of ->dax_iomap_actor(), called just once in
>> each COW operation.
>
> I'm sorry but I don't follow what you suggest. One COW operation is a call
> to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> ->iomap_end() once. So I don't see a difference between doing something in
> ->actor() and ->iomap_end() (besides the passed arguments but that does not
> seem to be your concern). So what do you exactly want to do?
Hi Jan,
Thanks for pointing out, and I'm sorry for my mistake. It's
->dax_iomap_rw(), not ->dax_iomap_actor().
I want to call the callback function at the end of ->dax_iomap_rw().
Like this:
dax_iomap_rw(..., callback) {
...
while (...) {
iomap_apply(...);
}
if (callback != null) {
callback();
}
return ...;
}
>
> Honza
>
--
Thanks,
Shiyang Ruan.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 2:01 ` Shiyang Ruan
@ 2019-05-29 2:47 ` Dave Chinner
2019-05-29 4:02 ` Shiyang Ruan
0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2019-05-29 2:47 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Jan Kara, Goldwyn Rodrigues, Darrick J. Wong, linux-btrfs,
kilobyte, linux-fsdevel, willy, hch, dsterba, nborisov,
linux-nvdimm
On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
>
> On 5/28/19 5:17 PM, Jan Kara wrote:
> > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
> > > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm working on reflink & dax in XFS, here are some thoughts on this:
> > > > >
> > > > > As mentioned above: the second iomap's offset and length must match the
> > > > > first. I thought so at the beginning, but later found that the only
> > > > > difference between these two iomaps is @addr. So, what about adding a
> > > > > @saddr, which means the source address of COW extent, into the struct iomap.
> > > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
> > > > > handle this @saddr in each ->actor(). No more modifications in other
> > > > > functions.
> > > >
> > > > Yes, I started of with the exact idea before being recommended this by Dave.
> > > > I used two fields instead of one namely cow_pos and cow_addr which defined
> > > > the source details. I had put it as a iomap flag as opposed to a type
> > > > which of course did not appeal well.
> > > >
> > > > We may want to use iomaps for cases where two inodes are involved.
> > > > An example of the other scenario where offset may be different is file
> > > > comparison for dedup: vfs_dedup_file_range_compare(). However, it would
> > > > need two inodes in iomap as well.
> > > >
> > > Yes, it is reasonable. Thanks for your explanation.
> > >
> > > One more thing RFC:
> > > I'd like to add an end-io callback argument in ->dax_iomap_actor() to update
> > > the metadata after one whole COW operation is completed. The end-io can
> > > also be called in ->iomap_end(). But one COW operation may call
> > > ->iomap_apply() many times, and so does the end-io. Thus, I think it would
> > > be nice to move it to the bottom of ->dax_iomap_actor(), called just once in
> > > each COW operation.
> >
> > I'm sorry but I don't follow what you suggest. One COW operation is a call
> > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > ->iomap_end() once. So I don't see a difference between doing something in
> > ->actor() and ->iomap_end() (besides the passed arguments but that does not
> > seem to be your concern). So what do you exactly want to do?
>
> Hi Jan,
>
> Thanks for pointing out, and I'm sorry for my mistake. It's
> ->dax_iomap_rw(), not ->dax_iomap_actor().
>
> I want to call the callback function at the end of ->dax_iomap_rw().
>
> Like this:
> dax_iomap_rw(..., callback) {
>
> ...
> while (...) {
> iomap_apply(...);
> }
>
> if (callback != null) {
> callback();
> }
> return ...;
> }
Why does this need to be in dax_iomap_rw()?
We already do post-dax_iomap_rw() "io-end callbacks" directly in
xfs_file_dax_write() to update the file size....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 2:47 ` Dave Chinner
@ 2019-05-29 4:02 ` Shiyang Ruan
2019-05-29 4:07 ` Darrick J. Wong
0 siblings, 1 reply; 58+ messages in thread
From: Shiyang Ruan @ 2019-05-29 4:02 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Goldwyn Rodrigues, Darrick J. Wong, linux-btrfs,
kilobyte, linux-fsdevel, willy, hch, dsterba, nborisov,
linux-nvdimm
On 5/29/19 10:47 AM, Dave Chinner wrote:
> On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
>>
>> On 5/28/19 5:17 PM, Jan Kara wrote:
>>> On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
>>>> On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm working on reflink & dax in XFS, here are some thoughts on this:
>>>>>>
>>>>>> As mentioned above: the second iomap's offset and length must match the
>>>>>> first. I thought so at the beginning, but later found that the only
>>>>>> difference between these two iomaps is @addr. So, what about adding a
>>>>>> @saddr, which means the source address of COW extent, into the struct iomap.
>>>>>> The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
>>>>>> handle this @saddr in each ->actor(). No more modifications in other
>>>>>> functions.
>>>>>
>>>>> Yes, I started of with the exact idea before being recommended this by Dave.
>>>>> I used two fields instead of one namely cow_pos and cow_addr which defined
>>>>> the source details. I had put it as a iomap flag as opposed to a type
>>>>> which of course did not appeal well.
>>>>>
>>>>> We may want to use iomaps for cases where two inodes are involved.
>>>>> An example of the other scenario where offset may be different is file
>>>>> comparison for dedup: vfs_dedup_file_range_compare(). However, it would
>>>>> need two inodes in iomap as well.
>>>>>
>>>> Yes, it is reasonable. Thanks for your explanation.
>>>>
>>>> One more thing RFC:
>>>> I'd like to add an end-io callback argument in ->dax_iomap_actor() to update
>>>> the metadata after one whole COW operation is completed. The end-io can
>>>> also be called in ->iomap_end(). But one COW operation may call
>>>> ->iomap_apply() many times, and so does the end-io. Thus, I think it would
>>>> be nice to move it to the bottom of ->dax_iomap_actor(), called just once in
>>>> each COW operation.
>>>
>>> I'm sorry but I don't follow what you suggest. One COW operation is a call
>>> to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
>>> each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
>>> ->iomap_end() once. So I don't see a difference between doing something in
>>> ->actor() and ->iomap_end() (besides the passed arguments but that does not
>>> seem to be your concern). So what do you exactly want to do?
>>
>> Hi Jan,
>>
>> Thanks for pointing out, and I'm sorry for my mistake. It's
>> ->dax_iomap_rw(), not ->dax_iomap_actor().
>>
>> I want to call the callback function at the end of ->dax_iomap_rw().
>>
>> Like this:
>> dax_iomap_rw(..., callback) {
>>
>> ...
>> while (...) {
>> iomap_apply(...);
>> }
>>
>> if (callback != null) {
>> callback();
>> }
>> return ...;
>> }
>
> Why does this need to be in dax_iomap_rw()?
>
> We already do post-dax_iomap_rw() "io-end callbacks" directly in
> xfs_file_dax_write() to update the file size....
Yes, but we also need to call ->xfs_reflink_end_cow() after a COW
operation. And an is-cow flag(from iomap) is also needed to determine
if we call it. I think it would be better to put this into
->dax_iomap_rw() as a callback function.
So sorry for my poor expression.
>
> Cheers,
>
> Dave.
>
--
Thanks,
Shiyang Ruan.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 4:02 ` Shiyang Ruan
@ 2019-05-29 4:07 ` Darrick J. Wong
2019-05-29 4:46 ` Dave Chinner
0 siblings, 1 reply; 58+ messages in thread
From: Darrick J. Wong @ 2019-05-29 4:07 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Dave Chinner, Jan Kara, Goldwyn Rodrigues, linux-btrfs, kilobyte,
linux-fsdevel, willy, hch, dsterba, nborisov, linux-nvdimm
On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote:
>
>
> On 5/29/19 10:47 AM, Dave Chinner wrote:
> > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> > >
> > > On 5/28/19 5:17 PM, Jan Kara wrote:
> > > > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
> > > > > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm working on reflink & dax in XFS, here are some thoughts on this:
> > > > > > >
> > > > > > > As mentioned above: the second iomap's offset and length must match the
> > > > > > > first. I thought so at the beginning, but later found that the only
> > > > > > > difference between these two iomaps is @addr. So, what about adding a
> > > > > > > @saddr, which means the source address of COW extent, into the struct iomap.
> > > > > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if not. Then
> > > > > > > handle this @saddr in each ->actor(). No more modifications in other
> > > > > > > functions.
> > > > > >
> > > > > > Yes, I started of with the exact idea before being recommended this by Dave.
> > > > > > I used two fields instead of one namely cow_pos and cow_addr which defined
> > > > > > the source details. I had put it as a iomap flag as opposed to a type
> > > > > > which of course did not appeal well.
> > > > > >
> > > > > > We may want to use iomaps for cases where two inodes are involved.
> > > > > > An example of the other scenario where offset may be different is file
> > > > > > comparison for dedup: vfs_dedup_file_range_compare(). However, it would
> > > > > > need two inodes in iomap as well.
> > > > > >
> > > > > Yes, it is reasonable. Thanks for your explanation.
> > > > >
> > > > > One more thing RFC:
> > > > > I'd like to add an end-io callback argument in ->dax_iomap_actor() to update
> > > > > the metadata after one whole COW operation is completed. The end-io can
> > > > > also be called in ->iomap_end(). But one COW operation may call
> > > > > ->iomap_apply() many times, and so does the end-io. Thus, I think it would
> > > > > be nice to move it to the bottom of ->dax_iomap_actor(), called just once in
> > > > > each COW operation.
> > > >
> > > > I'm sorry but I don't follow what you suggest. One COW operation is a call
> > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > > > ->iomap_end() once. So I don't see a difference between doing something in
> > > > ->actor() and ->iomap_end() (besides the passed arguments but that does not
> > > > seem to be your concern). So what do you exactly want to do?
> > >
> > > Hi Jan,
> > >
> > > Thanks for pointing out, and I'm sorry for my mistake. It's
> > > ->dax_iomap_rw(), not ->dax_iomap_actor().
> > >
> > > I want to call the callback function at the end of ->dax_iomap_rw().
> > >
> > > Like this:
> > > dax_iomap_rw(..., callback) {
> > >
> > > ...
> > > while (...) {
> > > iomap_apply(...);
> > > }
> > >
> > > if (callback != null) {
> > > callback();
> > > }
> > > return ...;
> > > }
> >
> > Why does this need to be in dax_iomap_rw()?
> >
> > We already do post-dax_iomap_rw() "io-end callbacks" directly in
> > xfs_file_dax_write() to update the file size....
>
> Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation.
> And an is-cow flag(from iomap) is also needed to determine if we call it. I
> think it would be better to put this into ->dax_iomap_rw() as a callback
> function.
Sort of like how iomap_dio_rw takes a write endio function?
--D
> So sorry for my poor expression.
>
> >
> > Cheers,
> >
> > Dave.
> >
>
> --
> Thanks,
> Shiyang Ruan.
>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 4:07 ` Darrick J. Wong
@ 2019-05-29 4:46 ` Dave Chinner
2019-05-29 13:46 ` Jan Kara
0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2019-05-29 4:46 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Shiyang Ruan, Jan Kara, Goldwyn Rodrigues, linux-btrfs, kilobyte,
linux-fsdevel, willy, hch, dsterba, nborisov, linux-nvdimm
On Tue, May 28, 2019 at 09:07:19PM -0700, Darrick J. Wong wrote:
> On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote:
> > On 5/29/19 10:47 AM, Dave Chinner wrote:
> > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> > > > On 5/28/19 5:17 PM, Jan Kara wrote:
> > > > > I'm sorry but I don't follow what you suggest. One COW operation is a call
> > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> > > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > > > > ->iomap_end() once. So I don't see a difference between doing something in
> > > > > ->actor() and ->iomap_end() (besides the passed arguments but that does not
> > > > > seem to be your concern). So what do you exactly want to do?
> > > >
> > > > Hi Jan,
> > > >
> > > > Thanks for pointing out, and I'm sorry for my mistake. It's
> > > > ->dax_iomap_rw(), not ->dax_iomap_actor().
> > > >
> > > > I want to call the callback function at the end of ->dax_iomap_rw().
> > > >
> > > > Like this:
> > > > dax_iomap_rw(..., callback) {
> > > >
> > > > ...
> > > > while (...) {
> > > > iomap_apply(...);
> > > > }
> > > >
> > > > if (callback != null) {
> > > > callback();
> > > > }
> > > > return ...;
> > > > }
> > >
> > > Why does this need to be in dax_iomap_rw()?
> > >
> > > We already do post-dax_iomap_rw() "io-end callbacks" directly in
> > > xfs_file_dax_write() to update the file size....
> >
> > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation.
> > And an is-cow flag(from iomap) is also needed to determine if we call it. I
> > think it would be better to put this into ->dax_iomap_rw() as a callback
> > function.
>
> Sort of like how iomap_dio_rw takes a write endio function?
You mean like we originally had in the DAX code for unwritten
extents?
But we got rid of that because performance of unwritten extents was
absolutely woeful - it's cheaper in terms of CPU cost to do up front
zeroing (i.e. inside ->iomap_begin) than it is to use unwritten
extents and convert them to protect against stale data exposure.
I have a feeling that exactly the same thing is true for CoW - the
hoops we jump through to do COW fork manipulation and then extent
movement between the COW fork and the data fork on IO completion
would be better done before we commit the COW extent allocation.
In which case, what we actually want for DAX is:
iomap_apply()
->iomap_begin()
map old data extent that we copy from
allocate new data extent we copy to in data fork,
immediately replacing old data extent
return transaction handle as private data
dax_iomap_actor()
copies data from old extent to new extent
->iomap_end
commits transaction now data has been copied, making
the COW operation atomic with the data copy.
This, in fact, should be how we do all DAX writes that require
allocation, because then we get rid of the need to zero newly
allocated or unwritten extents before we copy the data into it. i.e.
we only need to write once to newly allocated storage rather than
twice.
This gets rid of the need for COW callbacks, and means the DAX
reflink implementation does not need to use delalloc
speculative preallocation or COW forks at all.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 4:46 ` Dave Chinner
@ 2019-05-29 13:46 ` Jan Kara
2019-05-29 22:14 ` Dave Chinner
0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2019-05-29 13:46 UTC (permalink / raw)
To: Dave Chinner
Cc: Darrick J. Wong, Shiyang Ruan, Jan Kara, Goldwyn Rodrigues,
linux-btrfs, kilobyte, linux-fsdevel, willy, hch, dsterba,
nborisov, linux-nvdimm
On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> On Tue, May 28, 2019 at 09:07:19PM -0700, Darrick J. Wong wrote:
> > On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote:
> > > On 5/29/19 10:47 AM, Dave Chinner wrote:
> > > > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> > > > > On 5/28/19 5:17 PM, Jan Kara wrote:
> > > > > > I'm sorry but I don't follow what you suggest. One COW operation is a call
> > > > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> > > > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > > > > > ->iomap_end() once. So I don't see a difference between doing something in
> > > > > > ->actor() and ->iomap_end() (besides the passed arguments but that does not
> > > > > > seem to be your concern). So what do you exactly want to do?
> > > > >
> > > > > Hi Jan,
> > > > >
> > > > > Thanks for pointing out, and I'm sorry for my mistake. It's
> > > > > ->dax_iomap_rw(), not ->dax_iomap_actor().
> > > > >
> > > > > I want to call the callback function at the end of ->dax_iomap_rw().
> > > > >
> > > > > Like this:
> > > > > dax_iomap_rw(..., callback) {
> > > > >
> > > > > ...
> > > > > while (...) {
> > > > > iomap_apply(...);
> > > > > }
> > > > >
> > > > > if (callback != null) {
> > > > > callback();
> > > > > }
> > > > > return ...;
> > > > > }
> > > >
> > > > Why does this need to be in dax_iomap_rw()?
> > > >
> > > > We already do post-dax_iomap_rw() "io-end callbacks" directly in
> > > > xfs_file_dax_write() to update the file size....
> > >
> > > Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation.
> > > And an is-cow flag(from iomap) is also needed to determine if we call it. I
> > > think it would be better to put this into ->dax_iomap_rw() as a callback
> > > function.
> >
> > Sort of like how iomap_dio_rw takes a write endio function?
>
> You mean like we originally had in the DAX code for unwritten
> extents?
>
> But we got rid of that because performance of unwritten extents was
> absolutely woeful - it's cheaper in terms of CPU cost to do up front
> zeroing (i.e. inside ->iomap_begin) than it is to use unwritten
> extents and convert them to protect against stale data exposure.
>
> I have a feeling that exactly the same thing is true for CoW - the
> hoops we jump through to do COW fork manipulation and then extent
> movement between the COW fork and the data fork on IO completion
> would be better done before we commit the COW extent allocation.
>
> In which case, what we actually want for DAX is:
>
>
> iomap_apply()
>
> ->iomap_begin()
> map old data extent that we copy from
>
> allocate new data extent we copy to in data fork,
> immediately replacing old data extent
>
> return transaction handle as private data
>
> dax_iomap_actor()
> copies data from old extent to new extent
>
> ->iomap_end
> commits transaction now data has been copied, making
> the COW operation atomic with the data copy.
>
>
> This, in fact, should be how we do all DAX writes that require
> allocation, because then we get rid of the need to zero newly
> allocated or unwritten extents before we copy the data into it. i.e.
> we only need to write once to newly allocated storage rather than
> twice.
You need to be careful though. You need to synchronize with page faults so
that they cannot see and expose in page tables blocks you've allocated
before their contents is filled. This race was actually the strongest
motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
dax_iomap_actor() needs to be able to fault pages to copy from (and these
pages may be from the same file you're writing to) so you cannot just block
faulting for the file through I_MMAP_LOCK.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 13:46 ` Jan Kara
@ 2019-05-29 22:14 ` Dave Chinner
2019-05-30 11:16 ` Jan Kara
0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2019-05-29 22:14 UTC (permalink / raw)
To: Jan Kara
Cc: Darrick J. Wong, Shiyang Ruan, Goldwyn Rodrigues, linux-btrfs,
kilobyte, linux-fsdevel, willy, hch, dsterba, nborisov,
linux-nvdimm
On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote:
> On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> > iomap_apply()
> >
> > ->iomap_begin()
> > map old data extent that we copy from
> >
> > allocate new data extent we copy to in data fork,
> > immediately replacing old data extent
> >
> > return transaction handle as private data
This holds the inode block map locked exclusively across the IO,
so....
> >
> > dax_iomap_actor()
> > copies data from old extent to new extent
> >
> > ->iomap_end
> > commits transaction now data has been copied, making
> > the COW operation atomic with the data copy.
> >
> >
> > This, in fact, should be how we do all DAX writes that require
> > allocation, because then we get rid of the need to zero newly
> > allocated or unwritten extents before we copy the data into it. i.e.
> > we only need to write once to newly allocated storage rather than
> > twice.
>
> You need to be careful though. You need to synchronize with page faults so
> that they cannot see and expose in page tables blocks you've allocated
> before their contents is filled.
... so the page fault will block trying to map the blocks because
it can't get the xfs_inode->i_ilock until the allocation transaciton
commits....
> This race was actually the strongest
> motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
> dax_iomap_actor() needs to be able to fault pages to copy from (and these
> pages may be from the same file you're writing to) so you cannot just block
> faulting for the file through I_MMAP_LOCK.
Right, it doesn't take the I_MMAP_LOCK, but it would block further
in. And, really, I'm not caring all this much about this corner
case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX
within a file is unbeleivably naive - the data still gets copied by
the CPU in the write() call. It's far simpler and more effcient to
just mmap() both ranges of the file(s) and memcpy() in userspace....
FWIW, it's to avoid problems with stupid userspace stuff that nobody
really should be doing that I want range locks for the XFS inode
locks. If userspace overlaps the ranges and deadlocks in that case,
they they get to keep all the broken bits because, IMO, they are
doing something monumentally stupid. I'd probably be making it
return EDEADLOCK back out to userspace in the case rather than
deadlocking but, fundamentally, I think it's broken behaviour that
we should be rejecting with an error rather than adding complexity
trying to handle it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-29 22:14 ` Dave Chinner
@ 2019-05-30 11:16 ` Jan Kara
2019-05-30 22:59 ` Dave Chinner
0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2019-05-30 11:16 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Darrick J. Wong, Shiyang Ruan, Goldwyn Rodrigues,
linux-btrfs, kilobyte, linux-fsdevel, willy, hch, dsterba,
nborisov, linux-nvdimm
On Thu 30-05-19 08:14:45, Dave Chinner wrote:
> On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote:
> > On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> > > iomap_apply()
> > >
> > > ->iomap_begin()
> > > map old data extent that we copy from
> > >
> > > allocate new data extent we copy to in data fork,
> > > immediately replacing old data extent
> > >
> > > return transaction handle as private data
>
> This holds the inode block map locked exclusively across the IO,
> so....
Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write. But
xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't
see anything that would prevent page fault from mapping blocks into page
tables just after xfs_file_iomap_begin() returns.
> > > dax_iomap_actor()
> > > copies data from old extent to new extent
> > >
> > > ->iomap_end
> > > commits transaction now data has been copied, making
> > > the COW operation atomic with the data copy.
> > >
> > >
> > > This, in fact, should be how we do all DAX writes that require
> > > allocation, because then we get rid of the need to zero newly
> > > allocated or unwritten extents before we copy the data into it. i.e.
> > > we only need to write once to newly allocated storage rather than
> > > twice.
> >
> > You need to be careful though. You need to synchronize with page faults so
> > that they cannot see and expose in page tables blocks you've allocated
> > before their contents is filled.
>
> ... so the page fault will block trying to map the blocks because
> it can't get the xfs_inode->i_ilock until the allocation transaciton
> commits....
>
> > This race was actually the strongest
> > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
> > dax_iomap_actor() needs to be able to fault pages to copy from (and these
> > pages may be from the same file you're writing to) so you cannot just block
> > faulting for the file through I_MMAP_LOCK.
>
> Right, it doesn't take the I_MMAP_LOCK, but it would block further
> in. And, really, I'm not caring all this much about this corner
> case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX
> within a file is unbeleivably naive - the data still gets copied by
> the CPU in the write() call. It's far simpler and more effcient to
> just mmap() both ranges of the file(s) and memcpy() in userspace....
>
> FWIW, it's to avoid problems with stupid userspace stuff that nobody
> really should be doing that I want range locks for the XFS inode
> locks. If userspace overlaps the ranges and deadlocks in that case,
> they they get to keep all the broken bits because, IMO, they are
> doing something monumentally stupid. I'd probably be making it
> return EDEADLOCK back out to userspace in the case rather than
> deadlocking but, fundamentally, I think it's broken behaviour that
> we should be rejecting with an error rather than adding complexity
> trying to handle it.
I agree with this. We must just prevent user from taking the kernel down
with maliciously created IOs...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes
2019-05-30 11:16 ` Jan Kara
@ 2019-05-30 22:59 ` Dave Chinner
0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2019-05-30 22:59 UTC (permalink / raw)
To: Jan Kara
Cc: Darrick J. Wong, Shiyang Ruan, Goldwyn Rodrigues, linux-btrfs,
kilobyte, linux-fsdevel, willy, hch, dsterba, nborisov,
linux-nvdimm
On Thu, May 30, 2019 at 01:16:05PM +0200, Jan Kara wrote:
> On Thu 30-05-19 08:14:45, Dave Chinner wrote:
> > On Wed, May 29, 2019 at 03:46:29PM +0200, Jan Kara wrote:
> > > On Wed 29-05-19 14:46:58, Dave Chinner wrote:
> > > > iomap_apply()
> > > >
> > > > ->iomap_begin()
> > > > map old data extent that we copy from
> > > >
> > > > allocate new data extent we copy to in data fork,
> > > > immediately replacing old data extent
> > > >
> > > > return transaction handle as private data
> >
> > This holds the inode block map locked exclusively across the IO,
> > so....
>
> Does it? We do hold XFS_IOLOCK_EXCL during the whole dax write.
I forgot about that, I keep thinking that we use shared locking for
DAX like we do for direct IO. There's another reason for range
locks - allowing concurrent DAX read/write IO - but that's
orthogonal to the issue here.
> But
> xfs_file_iomap_begin() does release XFS_ILOCK_* on exit AFAICS. So I don't
> see anything that would prevent page fault from mapping blocks into page
> tables just after xfs_file_iomap_begin() returns.
Right, holding the IOLOCK doesn't stop concurrent page faults from
mapping the page we are trying to write, and that leaves a window
where stale data can be exposed if we don't initialise the newly
allocated range whilst in the allocation transaction holding the
ILOCK. That's what the XFS_BMAPI_ZERO flag does in the DAX block
allocation path.
So the idea of holding the allocation transaction across the data
copy means that ILOCK is then held until the data blocks are fully
initialised with valid data, meaning we can greatly reduce the scope
of the XFS_BMAPI_ZERO flag and possible get rid of it altogether.
> > > This race was actually the strongest
> > > motivation for pre-zeroing of blocks. OTOH copy_from_iter() in
> > > dax_iomap_actor() needs to be able to fault pages to copy from (and these
> > > pages may be from the same file you're writing to) so you cannot just block
> > > faulting for the file through I_MMAP_LOCK.
> >
> > Right, it doesn't take the I_MMAP_LOCK, but it would block further
> > in. And, really, I'm not caring all this much about this corner
> > case. i.e. anyone using a "mmap()+write() zero copy" pattern on DAX
> > within a file is unbeleivably naive - the data still gets copied by
> > the CPU in the write() call. It's far simpler and more effcient to
> > just mmap() both ranges of the file(s) and memcpy() in userspace....
> >
> > FWIW, it's to avoid problems with stupid userspace stuff that nobody
> > really should be doing that I want range locks for the XFS inode
> > locks. If userspace overlaps the ranges and deadlocks in that case,
> > they they get to keep all the broken bits because, IMO, they are
> > doing something monumentally stupid. I'd probably be making it
> > return EDEADLOCK back out to userspace in the case rather than
> > deadlocking but, fundamentally, I think it's broken behaviour that
> > we should be rejecting with an error rather than adding complexity
> > trying to handle it.
>
> I agree with this. We must just prevent user from taking the kernel down
> with maliciously created IOs...
Noted. :)
I'm still working to scale the range locks effectively for direct
IO; I've got to work out why sometimes they give identical
performance to rwsems out to 16 threads, and other times they run
20% slower or worse at 8+ threads. I'm way ahead of the original
mutex protected tree implementation that I have, but still got work
to do to get consistently close to rwsem performance for pure shared
locking workloads like direct IO.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2019-05-30 22:59 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-04-29 17:26 [PATCH v4 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 01/18] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-05-21 18:02 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 02/18] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 03/18] btrfs: basic dax read Goldwyn Rodrigues
2019-05-21 15:14 ` Darrick J. Wong
2019-05-22 21:50 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
2019-05-21 16:51 ` Darrick J. Wong
2019-05-22 20:14 ` Goldwyn Rodrigues
2019-05-23 2:10 ` Dave Chinner
2019-05-23 9:05 ` Shiyang Ruan
2019-05-23 11:51 ` Goldwyn Rodrigues
2019-05-27 8:25 ` Shiyang Ruan
2019-05-28 9:17 ` Jan Kara
2019-05-29 2:01 ` Shiyang Ruan
2019-05-29 2:47 ` Dave Chinner
2019-05-29 4:02 ` Shiyang Ruan
2019-05-29 4:07 ` Darrick J. Wong
2019-05-29 4:46 ` Dave Chinner
2019-05-29 13:46 ` Jan Kara
2019-05-29 22:14 ` Dave Chinner
2019-05-30 11:16 ` Jan Kara
2019-05-30 22:59 ` Dave Chinner
2019-04-29 17:26 ` [PATCH 05/18] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 06/18] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 07/18] btrfs: add dax write support Goldwyn Rodrigues
2019-05-21 17:08 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults Goldwyn Rodrigues
2019-05-21 17:46 ` Darrick J. Wong
2019-05-22 19:11 ` Goldwyn Rodrigues
2019-05-23 4:02 ` Darrick J. Wong
2019-05-23 12:10 ` Jan Kara
2019-04-29 17:26 ` [PATCH 09/18] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 10/18] dax: replace mmap entry in case of CoW Goldwyn Rodrigues
2019-05-21 17:35 ` Darrick J. Wong
2019-05-23 13:38 ` Jan Kara
2019-04-29 17:26 ` [PATCH 11/18] btrfs: add dax mmap support Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Goldwyn Rodrigues
2019-05-10 15:32 ` [PATCH for-goldwyn] btrfs: disallow MAP_SYNC outside of DAX mounts Adam Borowski
2019-05-10 15:41 ` Dan Williams
2019-05-10 15:59 ` Pankaj Gupta
2019-05-23 13:44 ` [PATCH 12/18] btrfs: allow MAP_SYNC mmap Jan Kara
2019-05-23 16:19 ` Adam Borowski
2019-04-29 17:26 ` [PATCH 13/18] fs: dedup file range to use a compare function Goldwyn Rodrigues
2019-05-21 18:17 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 14/18] dax: memcpy before zeroing range Goldwyn Rodrigues
2019-05-21 17:27 ` Darrick J. Wong
2019-04-29 17:26 ` [PATCH 15/18] btrfs: handle dax page zeroing Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 16/18] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
2019-05-23 14:04 ` Jan Kara
2019-05-23 15:27 ` Goldwyn Rodrigues
2019-05-23 19:07 ` Jan Kara
2019-05-23 21:22 ` Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 17/18] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-04-29 17:26 ` [PATCH 18/18] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
-- strict thread matches above, loose matches on Subject: below --
2019-04-16 16:41 [PATCH v3 00/18] btrfs dax support Goldwyn Rodrigues
2019-04-16 16:41 ` [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes Goldwyn Rodrigues
2019-04-17 16:46 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).