* [RFC] extending splice for copy offloading @ 2013-09-11 17:06 Zach Brown [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-11 21:17 ` Eric Wong 0 siblings, 2 replies; 62+ messages in thread From: Zach Brown @ 2013-09-11 17:06 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-nfs, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong When I first started on this stuff I followed the lead of previous work and added a new syscall for the copy operation: https://lkml.org/lkml/2013/5/14/618 Towards the end of that thread Eric Wong asked why we didn't just extend splice. I immediately replied with some dumb dismissive answer. Once I sat down and looked at it, though, it does make a lot of sense. So good job, Eric. +10 Dummie points for me. Extending splice avoids all the noise of adding a new syscall and naturally falls back to buffered copying as that's what the direct splice path does for sendfile() today. So that's what this patch series demonstrates. It adds a flag that lets splice get at the same direct splicing that sendfile() does. We then add a file system file_operations method to accelerate the copy which has access to both files. Some things to talk about: - I really don't care about the naming here. If you do, holler. - We might want different flags for file-to-file splicing and acceleration - We might want flags to require or forbid acceleration - We might want to provide all these flags to sendfile, too Thoughts? Objections? Bryan, do you see any problems with wiring the NFS COPY RPC under this? Martin, are we any closer to getting blk_() calls to kick off XCOPY bios? OCFS2 friends, is it a managable amount of work to implement an ocfs2_splice_direct() that only modifies a region of the destination file? Finally, there's a slot in the plumbers schedule next week to talk about this stuff. Come say hi if you're interested. -z ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH 1/3] splice: add DIRECT flag for splicing between files [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-11 17:06 ` Zach Brown 2013-09-11 17:06 ` [PATCH 2/3] splice: add f_op->splice_direct Zach Brown ` (3 subsequent siblings) 4 siblings, 0 replies; 62+ messages in thread From: Zach Brown @ 2013-09-11 17:06 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong sendfile() is implemented by performing an internal "direct" splice between two regular files. A per-task pipe buffer is allocated to splice between the reads from the source page cache and writes to the destination file page cache. This patch lets userspace perform these direct splices with sys_splice() by setting the SPLICE_F_DIRECT flag. This provides a single syscall for copying a region between files without either having to store the destination offset in the descriptor for sendfile or having to use multiple splicing syscalls to and from a pipe. Providing both files to the method lets the file system lock both for the duration of the copy, should it need to. If the method refuses to accelerate the copy, for whatever reason, we can naturally fall back to the generic direct splice method that sendfile uses today. Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- fs/splice.c | 38 ++++++++++++++++++++++++++++++++++++-- include/linux/splice.h | 1 + 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 3b7ee65..c0f4e27 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1347,7 +1347,7 @@ static long do_splice(struct file *in, loff_t __user *off_in, } if (ipipe) { - if (off_in) + if (off_in || (flags & SPLICE_F_DIRECT)) return -ESPIPE; if (off_out) { if (!(out->f_mode & FMODE_PWRITE)) @@ -1381,7 +1381,7 @@ static long do_splice(struct file *in, loff_t __user *off_in, } if (opipe) { - if (off_out) + if (off_out || (flags & SPLICE_F_DIRECT)) return -ESPIPE; if (off_in) { if (!(in->f_mode & FMODE_PREAD)) @@ -1402,6 +1402,40 @@ static long do_splice(struct file *in, loff_t __user *off_in, return ret; } + if (flags & SPLICE_F_DIRECT) { + loff_t out_pos; + + if (off_in) { + if (!(in->f_mode & FMODE_PREAD)) + return -EINVAL; + if (copy_from_user(&offset, off_in, sizeof(loff_t))) + return -EFAULT; + } else + offset = in->f_pos; + + if (off_out) { + if (!(out->f_mode & FMODE_PWRITE)) + return -EINVAL; + if (copy_from_user(&out_pos, off_out, sizeof(loff_t))) + return -EFAULT; + } else + out_pos = out->f_pos; + + ret = do_splice_direct(in, &offset, out, &out_pos, len, flags); + + if (!off_in) + in->f_pos = offset; + else if (copy_to_user(off_in, &offset, sizeof(loff_t))) + ret = -EFAULT; + + if (!off_out) + out->f_pos = out_pos; + else if (copy_to_user(off_out, &out_pos, sizeof(loff_t))) + ret = -EFAULT; + + return ret; + } + return -EINVAL; } diff --git a/include/linux/splice.h b/include/linux/splice.h index 74575cb..e1aa3ad 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -19,6 +19,7 @@ /* from/to, of course */ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ +#define SPLICE_F_DIRECT (0x10) /* neither splice fd is a pipe */ /* * Passed to the actors -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 2/3] splice: add f_op->splice_direct [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-11 17:06 ` [PATCH 1/3] splice: add DIRECT flag for splicing between files Zach Brown @ 2013-09-11 17:06 ` Zach Brown 2013-09-11 17:06 ` [PATCH 3/3] btrfs: implement .splice_direct extent copying Zach Brown ` (2 subsequent siblings) 4 siblings, 0 replies; 62+ messages in thread From: Zach Brown @ 2013-09-11 17:06 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong The splice_direct file_operations method gives file systems the opportunity to accelerate copying a region between two files. The generic path attempts to copy the remainder of the region that the file system fails to accelerate, for whatever reason. We may choose to dial this back a bit if the caller wants to avoid unaccelerated copying, perhaps by setting behavioural flags. The SPLICE_F_DIRECT flag is arguably misused here to indicate both file-to-file "direct" splicing *and* acceleration. Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- fs/bad_inode.c | 8 ++++++++ fs/splice.c | 28 +++++++++++++++++++++++----- include/linux/fs.h | 1 + 3 files changed, 32 insertions(+), 5 deletions(-) diff --git a/fs/bad_inode.c b/fs/bad_inode.c index 7c93953..394914b 100644 --- a/fs/bad_inode.c +++ b/fs/bad_inode.c @@ -145,6 +145,13 @@ static ssize_t bad_file_splice_read(struct file *in, loff_t *ppos, return -EIO; } +static ssize_t bad_file_splice_direct(struct file *in, loff_t in_pos, + struct file *out, loff_t out_pos, size_t len, + unsigned int flags) +{ + return -EIO; +} + static const struct file_operations bad_file_ops = { .llseek = bad_file_llseek, @@ -170,6 +177,7 @@ static const struct file_operations bad_file_ops = .flock = bad_file_flock, .splice_write = bad_file_splice_write, .splice_read = bad_file_splice_read, + .splice_direct = bad_file_splice_direct, }; static int bad_inode_create (struct inode *dir, struct dentry *dentry, diff --git a/fs/splice.c b/fs/splice.c index c0f4e27..eac310f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1284,14 +1284,12 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, loff_t *opos, size_t len, unsigned int flags) { struct splice_desc sd = { - .len = len, - .total_len = len, .flags = flags, - .pos = *ppos, .u.file = out, .opos = opos, }; long ret; + long bytes = 0; if (unlikely(!(out->f_mode & FMODE_WRITE))) return -EBADF; @@ -1303,11 +1301,31 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, if (unlikely(ret < 0)) return ret; + if ((flags & SPLICE_F_DIRECT) && out->f_op->splice_direct) { + ret = out->f_op->splice_direct(in, *ppos, out, *opos, len, + flags); + if (ret > 0) { + bytes += ret; + len -= ret; + *opos += ret; + *ppos += ret; + + if (len == 0) + return ret; + } + } + + sd.len = len; + sd.total_len = len; + sd.pos = *ppos; + ret = splice_direct_to_actor(in, &sd, direct_splice_actor); - if (ret > 0) + if (ret > 0) { + bytes += ret; *ppos = sd.pos; + } - return ret; + return bytes ? bytes : ret; } static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe, diff --git a/include/linux/fs.h b/include/linux/fs.h index 529d871..725e6fc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1553,6 +1553,7 @@ struct file_operations { int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + ssize_t (*splice_direct)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 3/3] btrfs: implement .splice_direct extent copying [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-11 17:06 ` [PATCH 1/3] splice: add DIRECT flag for splicing between files Zach Brown 2013-09-11 17:06 ` [PATCH 2/3] splice: add f_op->splice_direct Zach Brown @ 2013-09-11 17:06 ` Zach Brown 2013-09-20 9:49 ` [RFC] extending splice for copy offloading Szeredi Miklos 2013-12-18 12:41 ` Christoph Hellwig 4 siblings, 0 replies; 62+ messages in thread From: Zach Brown @ 2013-09-11 17:06 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong This patch re-uses the existing btrfs file cloning ioctl code to implement the .splice_direct copy offloading file operation. The existing extent item copying btrfs_ioctl_clone() is renamed to a shared btrfs_clone_extents(). The ioctl specific code (mostly simple entry-point stuff that splice() already does elsewhere) is moved to a new much smaller btrfs_ioctl_clone(). btrfs_splice_direct() thus inherits the conservative limitations of the btrfs clone ioctl: it only allows block-aligned copies between files on the same snapshot. Signed-off-by: Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/file.c | 11 ++++++++++ fs/btrfs/ioctl.c | 64 +++++++++++++++++++++++++++++++------------------------- 3 files changed, 48 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e795bf1..f73830e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3648,6 +3648,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, u64 newer_than, unsigned long max_pages); void btrfs_get_block_group_info(struct list_head *groups_list, struct btrfs_ioctl_space_info *space); +long btrfs_clone_extents(struct file *file, struct file *src_file, u64 off, + u64 olen, u64 destoff); /* file.c */ int btrfs_auto_defrag_init(void); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 4d2eb64..82aec93 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2557,6 +2557,16 @@ out: return offset; } +static long btrfs_splice_direct(struct file *in, loff_t in_pos, + struct file *out, loff_t out_pos, size_t len, + unsigned int flags) +{ + int ret = btrfs_clone_extents(out, in, in_pos, len, out_pos); + if (ret == 0) + ret = len; + return ret; +} + const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, .read = do_sync_read, @@ -2573,6 +2583,7 @@ const struct file_operations btrfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_ioctl, #endif + .splice_direct = btrfs_splice_direct, }; void btrfs_auto_defrag_exit(void) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 238a055..cddf6ef 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2469,13 +2469,12 @@ out: return ret; } -static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, - u64 off, u64 olen, u64 destoff) +long btrfs_clone_extents(struct file *file, struct file *src_file, u64 off, + u64 olen, u64 destoff) { struct inode *inode = file_inode(file); + struct inode *src = file_inode(src_file); struct btrfs_root *root = BTRFS_I(inode)->root; - struct fd src_file; - struct inode *src; struct btrfs_trans_handle *trans; struct btrfs_path *path; struct extent_buffer *leaf; @@ -2498,10 +2497,6 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, * they don't overlap)? */ - /* the destination must be opened for writing */ - if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND)) - return -EINVAL; - if (btrfs_root_readonly(root)) return -EROFS; @@ -2509,48 +2504,36 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, if (ret) return ret; - src_file = fdget(srcfd); - if (!src_file.file) { - ret = -EBADF; - goto out_drop_write; - } - ret = -EXDEV; - if (src_file.file->f_path.mnt != file->f_path.mnt) - goto out_fput; - - src = file_inode(src_file.file); + if (src_file->f_path.mnt != file->f_path.mnt) + goto out_drop_write; ret = -EINVAL; if (src == inode) same_inode = 1; - /* the src must be open for reading */ - if (!(src_file.file->f_mode & FMODE_READ)) - goto out_fput; - /* don't make the dst file partly checksummed */ if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) - goto out_fput; + goto out_drop_write; ret = -EISDIR; if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) - goto out_fput; + goto out_drop_write; ret = -EXDEV; if (src->i_sb != inode->i_sb) - goto out_fput; + goto out_drop_write; ret = -ENOMEM; buf = vmalloc(btrfs_level_size(root, 0)); if (!buf) - goto out_fput; + goto out_drop_write; path = btrfs_alloc_path(); if (!path) { vfree(buf); - goto out_fput; + goto out_drop_write; } path->reada = 2; @@ -2867,13 +2850,36 @@ out_unlock: mutex_unlock(&inode->i_mutex); vfree(buf); btrfs_free_path(path); -out_fput: - fdput(src_file); out_drop_write: mnt_drop_write_file(file); return ret; } +static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, + u64 off, u64 olen, u64 destoff) +{ + struct fd src_file; + int ret; + + /* the destination must be opened for writing */ + if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND)) + return -EINVAL; + + src_file = fdget(srcfd); + if (!src_file.file) + return -EBADF; + + /* the src must be open for reading */ + if (!(src_file.file->f_mode & FMODE_READ)) + ret = -EINVAL; + else + ret = btrfs_clone_extents(file, src_file.file, off, olen, + destoff); + + fdput(src_file); + return ret; +} + static long btrfs_ioctl_clone_range(struct file *file, void __user *argp) { struct btrfs_ioctl_clone_range_args args; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2013-09-11 17:06 ` [PATCH 3/3] btrfs: implement .splice_direct extent copying Zach Brown @ 2013-09-20 9:49 ` Szeredi Miklos [not found] ` <CAELBmZBGD4rph=gjLCPKCdEj+nzEQ-F=DExoL+h3vRm7qF7dCQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-12-18 12:41 ` Christoph Hellwig 4 siblings, 1 reply; 62+ messages in thread From: Szeredi Miklos @ 2013-09-20 9:49 UTC (permalink / raw) To: Zach Brown Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Sep 11, 2013 at 7:06 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > When I first started on this stuff I followed the lead of previous > work and added a new syscall for the copy operation: > > https://lkml.org/lkml/2013/5/14/618 > > Towards the end of that thread Eric Wong asked why we didn't just > extend splice. I immediately replied with some dumb dismissive > answer. Once I sat down and looked at it, though, it does make a > lot of sense. So good job, Eric. +10 Dummie points for me. > > Extending splice avoids all the noise of adding a new syscall and > naturally falls back to buffered copying as that's what the direct > splice path does for sendfile() today. Nice idea. > > So that's what this patch series demonstrates. It adds a flag that > lets splice get at the same direct splicing that sendfile() does. > We then add a file system file_operations method to accelerate the > copy which has access to both files. > > Some things to talk about: > - I really don't care about the naming here. If you do, holler. > - We might want different flags for file-to-file splicing and acceleration Yes, I think "copy" and "reflink" needs to be differentiated. > - We might want flags to require or forbid acceleration > - We might want to provide all these flags to sendfile, too > > Thoughts? Objections? Can filesystem support "whole file copy" only? Or arbitrary block-to-block copy should be mandatory? Splice has size_t argument for the size, which is limited to 4G on 32 bit. Won't this be an issue for whole-file-copy? We could have special value (-1) for whole file, but that's starting to be hackish. We are talking about copying large amounts of data in a single syscall, which will possibly take a long time. Will the syscall be interruptible? Restartable? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAELBmZBGD4rph=gjLCPKCdEj+nzEQ-F=DExoL+h3vRm7qF7dCQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAELBmZBGD4rph=gjLCPKCdEj+nzEQ-F=DExoL+h3vRm7qF7dCQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-25 18:38 ` Zach Brown 2013-09-25 19:02 ` Anna Schumaker 0 siblings, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-09-25 18:38 UTC (permalink / raw) To: Szeredi Miklos Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong Hrmph. I had composed a reply to you during Plumbers but.. something happened to it :). Here's another try now that I'm back. > > Some things to talk about: > > - I really don't care about the naming here. If you do, holler. > > - We might want different flags for file-to-file splicing and acceleration > > Yes, I think "copy" and "reflink" needs to be differentiated. I initially agreed but I'm not so sure now. The problem is that we can't know whether the acceleration is copying or not. XCOPY on some array may well do some shared referencing tricks. The nfs COPY op can have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At some point we have to admit that we have no way to determine the relative durability of writes. Storage can do a lot to make writes more or less fragile that we have no visibility of. SSD FTLs can log a bunch of unrelated sectors on to one flash failure domain. And if such a flag couldn't *actually* guarantee anything for a bunch of storage topologies, well, let's not bother with it. The only flag I'm in favour of now is one that has splice return rather than falling back to manual page cache reads and writes. It's more like O_NONBLOCK than any kind of data durability hint. > > - We might want flags to require or forbid acceleration > > - We might want to provide all these flags to sendfile, too > > > > Thoughts? Objections? > > Can filesystem support "whole file copy" only? Or arbitrary > block-to-block copy should be mandatory? I'm not sure I understand what you're asking. The interface specifies byte ranges. File systems can return errors if they can't accelerate the copy. We *can't* mandate copy acceleration granularity as some formats and protocols just can't do it. splice() will fall back to doing buffered copies when the file system returns an error. > Splice has size_t argument for the size, which is limited to 4G on 32 > bit. Won't this be an issue for whole-file-copy? We could have > special value (-1) for whole file, but that's starting to be hackish. It will be an issue, yeah. Just like it is with write() today. I think it's reasonable to start with a simple interface that matches current IO syscalls. I won't implement a special whole-file value, no. And it's not just 32bit size_t. While do_splice_direct() doesn't use the truncated length that's returned from rw_verify_area(), it then silently truncates the lengths to unsigned int in the splice_desc struct fields. It seems like we might want to address that :/. > We are talking about copying large amounts of data in a single > syscall, which will possibly take a long time. Will the syscall be > interruptible? Restartable? In as much as file systems let it be, yeah. As ever, you're not going to have a lot of luck interrupting a process stuck in lock_page(), mutex_lock(), wait_on_page_writeback(), etc. Though you did remind me to investigate restarting. Thanks. - z -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-25 18:38 ` Zach Brown @ 2013-09-25 19:02 ` Anna Schumaker [not found] ` <CAFX2JfnyF8kyMYzCdqdr2JkoyQCom1bFLpFj89wODjoju54-Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Anna Schumaker @ 2013-09-25 19:02 UTC (permalink / raw) To: Zach Brown Cc: Szeredi Miklos, linux-kernel, linux-fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Sep 25, 2013 at 2:38 PM, Zach Brown <zab@redhat.com> wrote: > > Hrmph. I had composed a reply to you during Plumbers but.. something > happened to it :). Here's another try now that I'm back. > >> > Some things to talk about: >> > - I really don't care about the naming here. If you do, holler. >> > - We might want different flags for file-to-file splicing and acceleration >> >> Yes, I think "copy" and "reflink" needs to be differentiated. > > I initially agreed but I'm not so sure now. The problem is that we > can't know whether the acceleration is copying or not. XCOPY on some > array may well do some shared referencing tricks. The nfs COPY op can > have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At > some point we have to admit that we have no way to determine the > relative durability of writes. Storage can do a lot to make writes more > or less fragile that we have no visibility of. SSD FTLs can log a bunch > of unrelated sectors on to one flash failure domain. > > And if such a flag couldn't *actually* guarantee anything for a bunch of > storage topologies, well, let's not bother with it. > > The only flag I'm in favour of now is one that has splice return rather > than falling back to manual page cache reads and writes. It's more like > O_NONBLOCK than any kind of data durability hint. For reference, I'm planning to have the NFS server do the fallback when it copies since any local copy will be faster than a read and write over the network. Anna > >> > - We might want flags to require or forbid acceleration >> > - We might want to provide all these flags to sendfile, too >> > >> > Thoughts? Objections? >> >> Can filesystem support "whole file copy" only? Or arbitrary >> block-to-block copy should be mandatory? > > I'm not sure I understand what you're asking. The interface specifies > byte ranges. File systems can return errors if they can't accelerate > the copy. We *can't* mandate copy acceleration granularity as some > formats and protocols just can't do it. splice() will fall back to > doing buffered copies when the file system returns an error. > >> Splice has size_t argument for the size, which is limited to 4G on 32 >> bit. Won't this be an issue for whole-file-copy? We could have >> special value (-1) for whole file, but that's starting to be hackish. > > It will be an issue, yeah. Just like it is with write() today. I think > it's reasonable to start with a simple interface that matches current IO > syscalls. I won't implement a special whole-file value, no. > > And it's not just 32bit size_t. While do_splice_direct() doesn't use > the truncated length that's returned from rw_verify_area(), it then > silently truncates the lengths to unsigned int in the splice_desc struct > fields. It seems like we might want to address that :/. > >> We are talking about copying large amounts of data in a single >> syscall, which will possibly take a long time. Will the syscall be >> interruptible? Restartable? > > In as much as file systems let it be, yeah. As ever, you're not going > to have a lot of luck interrupting a process stuck in lock_page(), > mutex_lock(), wait_on_page_writeback(), etc. Though you did remind me > to investigate restarting. Thanks. > > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAFX2JfnyF8kyMYzCdqdr2JkoyQCom1bFLpFj89wODjoju54-Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAFX2JfnyF8kyMYzCdqdr2JkoyQCom1bFLpFj89wODjoju54-Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-25 19:06 ` Zach Brown [not found] ` <20130925190620.GB30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-09-25 19:06 UTC (permalink / raw) To: Anna Schumaker Cc: Szeredi Miklos, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Sep 25, 2013 at 03:02:29PM -0400, Anna Schumaker wrote: > On Wed, Sep 25, 2013 at 2:38 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > Hrmph. I had composed a reply to you during Plumbers but.. something > > happened to it :). Here's another try now that I'm back. > > > >> > Some things to talk about: > >> > - I really don't care about the naming here. If you do, holler. > >> > - We might want different flags for file-to-file splicing and acceleration > >> > >> Yes, I think "copy" and "reflink" needs to be differentiated. > > > > I initially agreed but I'm not so sure now. The problem is that we > > can't know whether the acceleration is copying or not. XCOPY on some > > array may well do some shared referencing tricks. The nfs COPY op can > > have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At > > some point we have to admit that we have no way to determine the > > relative durability of writes. Storage can do a lot to make writes more > > or less fragile that we have no visibility of. SSD FTLs can log a bunch > > of unrelated sectors on to one flash failure domain. > > > > And if such a flag couldn't *actually* guarantee anything for a bunch of > > storage topologies, well, let's not bother with it. > > > > The only flag I'm in favour of now is one that has splice return rather > > than falling back to manual page cache reads and writes. It's more like > > O_NONBLOCK than any kind of data durability hint. > > For reference, I'm planning to have the NFS server do the fallback > when it copies since any local copy will be faster than a read and > write over the network. Agreed, this is definitely the reasonable thing to do. - z -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <20130925190620.GB30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <20130925190620.GB30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> @ 2013-09-25 19:55 ` J. Bruce Fields 2013-09-25 21:07 ` Zach Brown 0 siblings, 1 reply; 62+ messages in thread From: J. Bruce Fields @ 2013-09-25 19:55 UTC (permalink / raw) To: Zach Brown Cc: Anna Schumaker, Szeredi Miklos, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Sep 25, 2013 at 12:06:20PM -0700, Zach Brown wrote: > On Wed, Sep 25, 2013 at 03:02:29PM -0400, Anna Schumaker wrote: > > On Wed, Sep 25, 2013 at 2:38 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > Hrmph. I had composed a reply to you during Plumbers but.. something > > > happened to it :). Here's another try now that I'm back. > > > > > >> > Some things to talk about: > > >> > - I really don't care about the naming here. If you do, holler. > > >> > - We might want different flags for file-to-file splicing and acceleration > > >> > > >> Yes, I think "copy" and "reflink" needs to be differentiated. > > > > > > I initially agreed but I'm not so sure now. The problem is that we > > > can't know whether the acceleration is copying or not. XCOPY on some > > > array may well do some shared referencing tricks. The nfs COPY op can > > > have a server use btrfs reflink, or ext* and XCOPY, or .. who knows. At > > > some point we have to admit that we have no way to determine the > > > relative durability of writes. Storage can do a lot to make writes more > > > or less fragile that we have no visibility of. SSD FTLs can log a bunch > > > of unrelated sectors on to one flash failure domain. > > > > > > And if such a flag couldn't *actually* guarantee anything for a bunch of > > > storage topologies, well, let's not bother with it. > > > > > > The only flag I'm in favour of now is one that has splice return rather > > > than falling back to manual page cache reads and writes. It's more like > > > O_NONBLOCK than any kind of data durability hint. > > > > For reference, I'm planning to have the NFS server do the fallback > > when it copies since any local copy will be faster than a read and > > write over the network. > > Agreed, this is definitely the reasonable thing to do. A client-side copy will be slower, but I guess it does have the advantage that the application can track progress to some degree, and abort it fairly quickly without leaving the file in a totally undefined state--and both might be useful if the copy's not a simple constant-time operation. So maybe a way to pass your NONBLOCKy flag to the server would be useful? FWIW the protocol doesn't seem frozen yet, so I assume we could still add an extra flag field if you think it would be worthwhile. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-25 19:55 ` J. Bruce Fields @ 2013-09-25 21:07 ` Zach Brown 2013-09-26 8:58 ` Miklos Szeredi 0 siblings, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-09-25 21:07 UTC (permalink / raw) To: J. Bruce Fields Cc: Anna Schumaker, Szeredi Miklos, linux-kernel, linux-fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong > A client-side copy will be slower, but I guess it does have the > advantage that the application can track progress to some degree, and > abort it fairly quickly without leaving the file in a totally undefined > state--and both might be useful if the copy's not a simple constant-time > operation. I suppose, but can't the app achieve a nice middle ground by copying the file in smaller syscalls? Avoid bulk data motion back to the client, but still get notification every, I dunno, few hundred meg? > So maybe a way to pass your NONBLOCKy flag to the server would be > useful? Maybe, but maybe it also just won't be used in practice. I'm to the point where I'd rather we get the stupidest possible thing out there so that we can learm from actual use of the interface. - z ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-25 21:07 ` Zach Brown @ 2013-09-26 8:58 ` Miklos Szeredi 2013-09-26 15:34 ` J. Bruce Fields 2013-09-26 18:55 ` Zach Brown 0 siblings, 2 replies; 62+ messages in thread From: Miklos Szeredi @ 2013-09-26 8:58 UTC (permalink / raw) To: Zach Brown Cc: J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab@redhat.com> wrote: >> A client-side copy will be slower, but I guess it does have the >> advantage that the application can track progress to some degree, and >> abort it fairly quickly without leaving the file in a totally undefined >> state--and both might be useful if the copy's not a simple constant-time >> operation. > > I suppose, but can't the app achieve a nice middle ground by copying the > file in smaller syscalls? Avoid bulk data motion back to the client, > but still get notification every, I dunno, few hundred meg? Yes. And if "cp" could just be switched from a read+write syscall pair to a single splice syscall using the same buffer size. And then the user would only notice that things got faster in case of server side copy. No problems with long blocking times (at least not much worse than it was). However "cp" doesn't do reflinking by default, it has a switch for that. If we just want "cp" and the like to use splice without fearing side effects then by default we should try to be as close to read+write behavior as possible. No? That's what I'm really worrying about when you want to wire up splice to reflink by default. I do think there should be a flag for that. And if on the block level some magic happens, so be it. It's not the fs deverloper's worry any more ;) Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-26 8:58 ` Miklos Szeredi @ 2013-09-26 15:34 ` J. Bruce Fields [not found] ` <20130926153359.GE704-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-09-26 18:06 ` Miklos Szeredi 2013-09-26 18:55 ` Zach Brown 1 sibling, 2 replies; 62+ messages in thread From: J. Bruce Fields @ 2013-09-26 15:34 UTC (permalink / raw) To: Miklos Szeredi Cc: Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: > On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab@redhat.com> wrote: > >> A client-side copy will be slower, but I guess it does have the > >> advantage that the application can track progress to some degree, and > >> abort it fairly quickly without leaving the file in a totally undefined > >> state--and both might be useful if the copy's not a simple constant-time > >> operation. > > > > I suppose, but can't the app achieve a nice middle ground by copying the > > file in smaller syscalls? Avoid bulk data motion back to the client, > > but still get notification every, I dunno, few hundred meg? > > Yes. And if "cp" could just be switched from a read+write syscall > pair to a single splice syscall using the same buffer size. Will the various magic fs-specific copy operations become inefficient when the range copied is too small? (Totally naive question, as I have no idea how they really work.) --b. > And then > the user would only notice that things got faster in case of server > side copy. No problems with long blocking times (at least not much > worse than it was). > > However "cp" doesn't do reflinking by default, it has a switch for > that. If we just want "cp" and the like to use splice without fearing > side effects then by default we should try to be as close to > read+write behavior as possible. No? That's what I'm really > worrying about when you want to wire up splice to reflink by default. > I do think there should be a flag for that. And if on the block level > some magic happens, so be it. It's not the fs deverloper's worry any > more ;) > > Thanks, > Miklos > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <20130926153359.GE704-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <20130926153359.GE704-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-09-26 16:46 ` Ric Wheeler 0 siblings, 0 replies; 62+ messages in thread From: Ric Wheeler @ 2013-09-26 16:46 UTC (permalink / raw) To: J. Bruce Fields Cc: Miklos Szeredi, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/26/2013 11:34 AM, J. Bruce Fields wrote: > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>>> A client-side copy will be slower, but I guess it does have the >>>> advantage that the application can track progress to some degree, and >>>> abort it fairly quickly without leaving the file in a totally undefined >>>> state--and both might be useful if the copy's not a simple constant-time >>>> operation. >>> I suppose, but can't the app achieve a nice middle ground by copying the >>> file in smaller syscalls? Avoid bulk data motion back to the client, >>> but still get notification every, I dunno, few hundred meg? >> Yes. And if "cp" could just be switched from a read+write syscall >> pair to a single splice syscall using the same buffer size. > Will the various magic fs-specific copy operations become inefficient > when the range copied is too small? > > (Totally naive question, as I have no idea how they really work.) > > --b. I think that is not really possible to tell when we invoke it. It is very much target device (or file system, etc) dependent on how long it takes. It could be as simple as a reflink copying in a smallish amount of metadata or fall back to a full byte-by-byte copy. Also note that speed is not the only impact here, some of the mechanisms actually do not consume more space (just increment shared data references). It would probably make more sense to send it off to the target device and have it return an error when not appropriate (then the app can fall back to the old fashion copy). ric > >> And then >> the user would only notice that things got faster in case of server >> side copy. No problems with long blocking times (at least not much >> worse than it was). >> >> However "cp" doesn't do reflinking by default, it has a switch for >> that. If we just want "cp" and the like to use splice without fearing >> side effects then by default we should try to be as close to >> read+write behavior as possible. No? That's what I'm really >> worrying about when you want to wire up splice to reflink by default. >> I do think there should be a flag for that. And if on the block level >> some magic happens, so be it. It's not the fs deverloper's worry any >> more ;) >> >> Thanks, >> Miklos >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-26 15:34 ` J. Bruce Fields [not found] ` <20130926153359.GE704-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-09-26 18:06 ` Miklos Szeredi 2013-09-26 19:06 ` Zach Brown [not found] ` <CAJfpegsUchb0eX+Hi3rN5Ypje3Y-dgo=pxgM1Y3BQbHVp=1hSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 2 replies; 62+ messages in thread From: Miklos Szeredi @ 2013-09-26 18:06 UTC (permalink / raw) To: J. Bruce Fields Cc: Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab@redhat.com> wrote: >> >> A client-side copy will be slower, but I guess it does have the >> >> advantage that the application can track progress to some degree, and >> >> abort it fairly quickly without leaving the file in a totally undefined >> >> state--and both might be useful if the copy's not a simple constant-time >> >> operation. >> > >> > I suppose, but can't the app achieve a nice middle ground by copying the >> > file in smaller syscalls? Avoid bulk data motion back to the client, >> > but still get notification every, I dunno, few hundred meg? >> >> Yes. And if "cp" could just be switched from a read+write syscall >> pair to a single splice syscall using the same buffer size. > > Will the various magic fs-specific copy operations become inefficient > when the range copied is too small? We could treat spice-copy operations just like write operations (can be buffered, coalesced, synced). But I'm not sure it's worth the effort; 99% of the use of this interface will be copying whole files. And for that perhaps we need a different API, one which has been discussed some time ago: asynchronous copyfile() returns immediately with a pollable event descriptor indicating copy progress, and some way to cancel the copy. And that can internally rely on ->direct_splice(), with appropriate algorithms for determine the optimal chunk size. Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-26 18:06 ` Miklos Szeredi @ 2013-09-26 19:06 ` Zach Brown 2013-09-26 19:53 ` Miklos Szeredi [not found] ` <CAJfpegsUchb0eX+Hi3rN5Ypje3Y-dgo=pxgM1Y3BQbHVp=1hSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-09-26 19:06 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 08:06:41PM +0200, Miklos Szeredi wrote: > On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: > >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab@redhat.com> wrote: > >> >> A client-side copy will be slower, but I guess it does have the > >> >> advantage that the application can track progress to some degree, and > >> >> abort it fairly quickly without leaving the file in a totally undefined > >> >> state--and both might be useful if the copy's not a simple constant-time > >> >> operation. > >> > > >> > I suppose, but can't the app achieve a nice middle ground by copying the > >> > file in smaller syscalls? Avoid bulk data motion back to the client, > >> > but still get notification every, I dunno, few hundred meg? > >> > >> Yes. And if "cp" could just be switched from a read+write syscall > >> pair to a single splice syscall using the same buffer size. > > > > Will the various magic fs-specific copy operations become inefficient > > when the range copied is too small? > > We could treat spice-copy operations just like write operations (can > be buffered, coalesced, synced). > > But I'm not sure it's worth the effort; 99% of the use of this > interface will be copying whole files. And for that perhaps we need a > different API, one which has been discussed some time ago: > asynchronous copyfile() returns immediately with a pollable event > descriptor indicating copy progress, and some way to cancel the copy. > And that can internally rely on ->direct_splice(), with appropriate > algorithms for determine the optimal chunk size. And perhaps we don't. Perhaps we can provide this much simpler data-plane interface that works well enough for most everyone and can avoid going down the async rat hole, yet again. - z ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-26 19:06 ` Zach Brown @ 2013-09-26 19:53 ` Miklos Szeredi [not found] ` <CAJfpegvvWhs+jv2J9kOQrB31PEO3kyn_sLm_e2w9YKp=y6EDhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-26 19:53 UTC (permalink / raw) To: Zach Brown Cc: J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <zab@redhat.com> wrote: >> But I'm not sure it's worth the effort; 99% of the use of this >> interface will be copying whole files. And for that perhaps we need a >> different API, one which has been discussed some time ago: >> asynchronous copyfile() returns immediately with a pollable event >> descriptor indicating copy progress, and some way to cancel the copy. >> And that can internally rely on ->direct_splice(), with appropriate >> algorithms for determine the optimal chunk size. > > And perhaps we don't. Perhaps we can provide this much simpler > data-plane interface that works well enough for most everyone and can > avoid going down the async rat hole, yet again. I think either buffering or async is needed to get good perforrmace without too much complexity in the app (which is not good). Buffering works quite well for regular I/O, so maybe its the way to go here as well. Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpegvvWhs+jv2J9kOQrB31PEO3kyn_sLm_e2w9YKp=y6EDhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpegvvWhs+jv2J9kOQrB31PEO3kyn_sLm_e2w9YKp=y6EDhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-26 21:23 ` Ric Wheeler [not found] ` <5244A5E7.90808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-26 21:23 UTC (permalink / raw) To: Miklos Szeredi Cc: Zach Brown, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/26/2013 03:53 PM, Miklos Szeredi wrote: > On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >>> But I'm not sure it's worth the effort; 99% of the use of this >>> interface will be copying whole files. And for that perhaps we need a >>> different API, one which has been discussed some time ago: >>> asynchronous copyfile() returns immediately with a pollable event >>> descriptor indicating copy progress, and some way to cancel the copy. >>> And that can internally rely on ->direct_splice(), with appropriate >>> algorithms for determine the optimal chunk size. >> And perhaps we don't. Perhaps we can provide this much simpler >> data-plane interface that works well enough for most everyone and can >> avoid going down the async rat hole, yet again. > I think either buffering or async is needed to get good perforrmace > without too much complexity in the app (which is not good). Buffering > works quite well for regular I/O, so maybe its the way to go here as > well. > > Thanks, > Miklos > Buffering misses the whole point of the copy offload - the idea is *not* to read or write the actual data in the most interesting cases which offload the operation to a smart target device or file system. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <5244A5E7.90808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <5244A5E7.90808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-27 4:47 ` Miklos Szeredi 2013-09-27 14:00 ` Ric Wheeler 0 siblings, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-27 4:47 UTC (permalink / raw) To: Ric Wheeler Cc: Zach Brown, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On 09/26/2013 03:53 PM, Miklos Szeredi wrote: >> >> On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >>>> But I'm not sure it's worth the effort; 99% of the use of this >>>> interface will be copying whole files. And for that perhaps we need a >>>> different API, one which has been discussed some time ago: >>>> asynchronous copyfile() returns immediately with a pollable event >>>> descriptor indicating copy progress, and some way to cancel the copy. >>>> And that can internally rely on ->direct_splice(), with appropriate >>>> algorithms for determine the optimal chunk size. >>> >>> And perhaps we don't. Perhaps we can provide this much simpler >>> data-plane interface that works well enough for most everyone and can >>> avoid going down the async rat hole, yet again. >> >> I think either buffering or async is needed to get good perforrmace >> without too much complexity in the app (which is not good). Buffering >> works quite well for regular I/O, so maybe its the way to go here as >> well. >> >> Thanks, >> Miklos >> > > Buffering misses the whole point of the copy offload - the idea is *not* to > read or write the actual data in the most interesting cases which offload > the operation to a smart target device or file system. I meant buffering the COPY, not the data. Doing the COPY synchronously will always incur a performance penalty, the amount depending on the latency, which can be significant with networking. We think of write(2) as a synchronous interface, because that's the appearance we get from all that hard work the page cache and delayed writeback code does to make an asynchronous operation look as if it was synchronous. So from a userspace API perspective a sync interface is nice, but inside we almost always have async interfaces to do the actual work. Thanks, Miklos > > Regards, > > Ric > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-27 4:47 ` Miklos Szeredi @ 2013-09-27 14:00 ` Ric Wheeler 2013-09-27 14:39 ` Miklos Szeredi 0 siblings, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-27 14:00 UTC (permalink / raw) To: Miklos Szeredi Cc: Ric Wheeler, Zach Brown, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/27/2013 12:47 AM, Miklos Szeredi wrote: > On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >> On 09/26/2013 03:53 PM, Miklos Szeredi wrote: >>> On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown <zab@redhat.com> wrote: >>> >>>>> But I'm not sure it's worth the effort; 99% of the use of this >>>>> interface will be copying whole files. And for that perhaps we need a >>>>> different API, one which has been discussed some time ago: >>>>> asynchronous copyfile() returns immediately with a pollable event >>>>> descriptor indicating copy progress, and some way to cancel the copy. >>>>> And that can internally rely on ->direct_splice(), with appropriate >>>>> algorithms for determine the optimal chunk size. >>>> And perhaps we don't. Perhaps we can provide this much simpler >>>> data-plane interface that works well enough for most everyone and can >>>> avoid going down the async rat hole, yet again. >>> I think either buffering or async is needed to get good perforrmace >>> without too much complexity in the app (which is not good). Buffering >>> works quite well for regular I/O, so maybe its the way to go here as >>> well. >>> >>> Thanks, >>> Miklos >>> >> Buffering misses the whole point of the copy offload - the idea is *not* to >> read or write the actual data in the most interesting cases which offload >> the operation to a smart target device or file system. > I meant buffering the COPY, not the data. Doing the COPY > synchronously will always incur a performance penalty, the amount > depending on the latency, which can be significant with networking. > > We think of write(2) as a synchronous interface, because that's the > appearance we get from all that hard work the page cache and delayed > writeback code does to make an asynchronous operation look as if it > was synchronous. So from a userspace API perspective a sync interface > is nice, but inside we almost always have async interfaces to do the > actual work. > > Thanks, > Miklos I think that you are an order of magnitude off here in thinking about the scale of the operations. An enabled, synchronize copy offload to an array (or one that turns into a reflink locally) is effectively the cost of the call itself. Let's say no slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that call is much faster than that worst case number. Copying any substantial amount of data - like the target workload of VM images or media files - would be hundreds of MB's per copy and that would take seconds or minutes. We should really work on getting the basic mechanism working and robust without any complications, then we can look at real, measured performance and see if there is any justification for adding complexity. thanks! Ric > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-27 14:00 ` Ric Wheeler @ 2013-09-27 14:39 ` Miklos Szeredi 0 siblings, 0 replies; 62+ messages in thread From: Miklos Szeredi @ 2013-09-27 14:39 UTC (permalink / raw) To: Ric Wheeler Cc: Zach Brown, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Fri, Sep 27, 2013 at 4:00 PM, Ric Wheeler <rwheeler@redhat.com> wrote: > I think that you are an order of magnitude off here in thinking about the > scale of the operations. > > An enabled, synchronize copy offload to an array (or one that turns into a > reflink locally) is effectively the cost of the call itself. Let's say no > slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess. > Realistically, that call is much faster than that worst case number. > > Copying any substantial amount of data - like the target workload of VM > images or media files - would be hundreds of MB's per copy and that would > take seconds or minutes. Will a single splice-copy operation be interruptible/restartable? If not, how should apps size one request so that it doesn't take too much time? Even for slow devices (usb stick)? If it will be restartable, how? Can remote copy be done with this? Over a high latency network? Those are the questions I'm worried about. > > We should really work on getting the basic mechanism working and robust > without any complications, then we can look at real, measured performance > and see if there is any justification for adding complexity. Go for that. But don't forget that at the end of the day actual apps will need to be converted like file managers and "dd" and "cp" and we definitely don't wont a userspace library to be able to figure out how the copy is done most efficiently; it's something for the kernel to figure out. Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpegsUchb0eX+Hi3rN5Ypje3Y-dgo=pxgM1Y3BQbHVp=1hSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpegsUchb0eX+Hi3rN5Ypje3Y-dgo=pxgM1Y3BQbHVp=1hSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-06 8:42 ` Rob Landley 0 siblings, 0 replies; 62+ messages in thread From: Rob Landley @ 2013-10-06 8:42 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/26/2013 01:06:41 PM, Miklos Szeredi wrote: > On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields > <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote: > > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: > >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > wrote: > >> >> A client-side copy will be slower, but I guess it does have the > >> >> advantage that the application can track progress to some > degree, and > >> >> abort it fairly quickly without leaving the file in a totally > undefined > >> >> state--and both might be useful if the copy's not a simple > constant-time > >> >> operation. > >> > > >> > I suppose, but can't the app achieve a nice middle ground by > copying the > >> > file in smaller syscalls? Avoid bulk data motion back to the > client, > >> > but still get notification every, I dunno, few hundred meg? > >> > >> Yes. And if "cp" could just be switched from a read+write syscall > >> pair to a single splice syscall using the same buffer size. > > > > Will the various magic fs-specific copy operations become > inefficient > > when the range copied is too small? > > We could treat spice-copy operations just like write operations (can > be buffered, coalesced, synced). > > But I'm not sure it's worth the effort; 99% of the use of this > interface will be copying whole files. My "patch" implementation (in busybox and toybox) hits a point where it wants to copy the rest of the file, once there are no more hunks to apply. This is not copying a whole file. A similar thing happens with tail when you use the +N syntax to skip start instead of end lines. I can see sed doing a similar thing when told to operate on line ranges... Note sure your 99% holds up here. Rob-- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-26 8:58 ` Miklos Szeredi 2013-09-26 15:34 ` J. Bruce Fields @ 2013-09-26 18:55 ` Zach Brown [not found] ` <20130926185508.GO30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 1 sibling, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-09-26 18:55 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: > On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab@redhat.com> wrote: > >> A client-side copy will be slower, but I guess it does have the > >> advantage that the application can track progress to some degree, and > >> abort it fairly quickly without leaving the file in a totally undefined > >> state--and both might be useful if the copy's not a simple constant-time > >> operation. > > > > I suppose, but can't the app achieve a nice middle ground by copying the > > file in smaller syscalls? Avoid bulk data motion back to the client, > > but still get notification every, I dunno, few hundred meg? > > Yes. And if "cp" could just be switched from a read+write syscall > pair to a single splice syscall using the same buffer size. And then > the user would only notice that things got faster in case of server > side copy. No problems with long blocking times (at least not much > worse than it was). Hmm, yes, that would be a nice outcome. > However "cp" doesn't do reflinking by default, it has a switch for > that. If we just want "cp" and the like to use splice without fearing > side effects then by default we should try to be as close to > read+write behavior as possible. No? I guess? I don't find requiring --reflink hugely compelling. But there it is. > That's what I'm really > worrying about when you want to wire up splice to reflink by default. > I do think there should be a flag for that. And if on the block level > some magic happens, so be it. It's not the fs deverloper's worry any > more ;) Sure. So we'd have: - no flag default that forbids knowingly copying with shared references so that it will be used by default by people who feel strongly about their assumptions about independent write durability. - a flag that allows shared references for people who would otherwise use the file system shared reference ioctls (ocfs2 reflink, btrfs clone) but would like it to also do server-side read/write copies over nfs without additional intervention. - a flag that requires shared references for callers who don't want giant copies to take forever if they aren't instant. (The qemu guys asked for this at Plumbers.) I think I can live with that. - z ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <20130926185508.GO30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <20130926185508.GO30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> @ 2013-09-26 21:26 ` Ric Wheeler [not found] ` <5244A68F.906-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-26 21:26 UTC (permalink / raw) To: Zach Brown Cc: Miklos Szeredi, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/26/2013 02:55 PM, Zach Brown wrote: > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>>> A client-side copy will be slower, but I guess it does have the >>>> advantage that the application can track progress to some degree, and >>>> abort it fairly quickly without leaving the file in a totally undefined >>>> state--and both might be useful if the copy's not a simple constant-time >>>> operation. >>> I suppose, but can't the app achieve a nice middle ground by copying the >>> file in smaller syscalls? Avoid bulk data motion back to the client, >>> but still get notification every, I dunno, few hundred meg? >> Yes. And if "cp" could just be switched from a read+write syscall >> pair to a single splice syscall using the same buffer size. And then >> the user would only notice that things got faster in case of server >> side copy. No problems with long blocking times (at least not much >> worse than it was). > Hmm, yes, that would be a nice outcome. > >> However "cp" doesn't do reflinking by default, it has a switch for >> that. If we just want "cp" and the like to use splice without fearing >> side effects then by default we should try to be as close to >> read+write behavior as possible. No? > I guess? I don't find requiring --reflink hugely compelling. But there > it is. > >> That's what I'm really >> worrying about when you want to wire up splice to reflink by default. >> I do think there should be a flag for that. And if on the block level >> some magic happens, so be it. It's not the fs deverloper's worry any >> more ;) > Sure. So we'd have: > > - no flag default that forbids knowingly copying with shared references > so that it will be used by default by people who feel strongly about > their assumptions about independent write durability. > > - a flag that allows shared references for people who would otherwise > use the file system shared reference ioctls (ocfs2 reflink, btrfs > clone) but would like it to also do server-side read/write copies > over nfs without additional intervention. > > - a flag that requires shared references for callers who don't want > giant copies to take forever if they aren't instant. (The qemu guys > asked for this at Plumbers.) > > I think I can live with that. > > - z This last flag should not prevent a remote target device (NFS or SCSI array) copy from working though since they often do reflink like operations inside of the remote target device.... ric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <5244A68F.906-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <5244A68F.906-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-27 20:05 ` J. Bruce Fields 2013-09-27 20:50 ` Zach Brown 0 siblings, 1 reply; 62+ messages in thread From: J. Bruce Fields @ 2013-09-27 20:05 UTC (permalink / raw) To: Ric Wheeler Cc: Zach Brown, Miklos Szeredi, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Thu, Sep 26, 2013 at 05:26:39PM -0400, Ric Wheeler wrote: > On 09/26/2013 02:55 PM, Zach Brown wrote: > >On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote: > >>On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >>>>A client-side copy will be slower, but I guess it does have the > >>>>advantage that the application can track progress to some degree, and > >>>>abort it fairly quickly without leaving the file in a totally undefined > >>>>state--and both might be useful if the copy's not a simple constant-time > >>>>operation. > >>>I suppose, but can't the app achieve a nice middle ground by copying the > >>>file in smaller syscalls? Avoid bulk data motion back to the client, > >>>but still get notification every, I dunno, few hundred meg? > >>Yes. And if "cp" could just be switched from a read+write syscall > >>pair to a single splice syscall using the same buffer size. And then > >>the user would only notice that things got faster in case of server > >>side copy. No problems with long blocking times (at least not much > >>worse than it was). > >Hmm, yes, that would be a nice outcome. > > > >>However "cp" doesn't do reflinking by default, it has a switch for > >>that. If we just want "cp" and the like to use splice without fearing > >>side effects then by default we should try to be as close to > >>read+write behavior as possible. No? > >I guess? I don't find requiring --reflink hugely compelling. But there > >it is. > > > >>That's what I'm really > >>worrying about when you want to wire up splice to reflink by default. > >>I do think there should be a flag for that. And if on the block level > >>some magic happens, so be it. It's not the fs deverloper's worry any > >>more ;) > >Sure. So we'd have: > > > >- no flag default that forbids knowingly copying with shared references > > so that it will be used by default by people who feel strongly about > > their assumptions about independent write durability. > > > >- a flag that allows shared references for people who would otherwise > > use the file system shared reference ioctls (ocfs2 reflink, btrfs > > clone) but would like it to also do server-side read/write copies > > over nfs without additional intervention. > > > >- a flag that requires shared references for callers who don't want > > giant copies to take forever if they aren't instant. (The qemu guys > > asked for this at Plumbers.) Why not implement only the last flag only as the first step? It seems like the simplest one. So I think that would mean: - no worrying about cancelling, etc. - apps should be told to pass the entire range at once (normally the whole file). - The NFS server probably shouldn't do the internal copy loop by default. We can't prevent some storage system from implementing a high-latency copy operation, but we can refuse to provide them any help (providing no progress reports or easy way to cancel) and then they can deal with the complaints from their users. Also, I don't get the first option above at all. The argument is that it's safer to have more copies? How much safety does another copy on the same disk really give you? Do systems that do dedup provide interfaces to turn it off per-file? > This last flag should not prevent a remote target device (NFS or > SCSI array) copy from working though since they often do reflink > like operations inside of the remote target device.... In fact maybe that's the only case to care about on the first pass. But I understand that Zach's tired of the woodshedding and I could live with the above I guess.... --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-27 20:05 ` J. Bruce Fields @ 2013-09-27 20:50 ` Zach Brown 2013-09-28 5:49 ` Miklos Szeredi 0 siblings, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-09-27 20:50 UTC (permalink / raw) To: J. Bruce Fields Cc: Ric Wheeler, Miklos Szeredi, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong > > >Sure. So we'd have: > > > > > >- no flag default that forbids knowingly copying with shared references > > > so that it will be used by default by people who feel strongly about > > > their assumptions about independent write durability. > > > > > >- a flag that allows shared references for people who would otherwise > > > use the file system shared reference ioctls (ocfs2 reflink, btrfs > > > clone) but would like it to also do server-side read/write copies > > > over nfs without additional intervention. > > > > > >- a flag that requires shared references for callers who don't want > > > giant copies to take forever if they aren't instant. (The qemu guys > > > asked for this at Plumbers.) > > Why not implement only the last flag only as the first step? It seems > like the simplest one. So I think that would mean: > > - no worrying about cancelling, etc. > - apps should be told to pass the entire range at once (normally > the whole file). > - The NFS server probably shouldn't do the internal copy loop by > default. > > We can't prevent some storage system from implementing a high-latency > copy operation, but we can refuse to provide them any help (providing no > progress reports or easy way to cancel) and then they can deal with the > complaints from their users. I can see where you're going with that, yeah. It'd make less sense as a splice extension, then, perhaps. It'd be more like a generic entry point for the existing ioctls. Maybe even just defining the semantics of a common ioctl. Hmm. > Also, I don't get the first option above at all. The argument is that > it's safer to have more copies? How much safety does another copy on > the same disk really give you? Do systems that do dedup provide > interfaces to turn it off per-file? Yeah, got me. It's certainly nonsense on a lot of FTL logging implementations (which are making their way into SMR drives in the future). > But I understand that Zach's tired of the woodshedding and I could live > with the above I guess.... No, it's fine. At least people are expressing some interest in the interface! That's a marked improvement over the state of things in the past. - z ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-27 20:50 ` Zach Brown @ 2013-09-28 5:49 ` Miklos Szeredi 2013-09-28 15:20 ` Myklebust, Trond 0 siblings, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-28 5:49 UTC (permalink / raw) To: Zach Brown Cc: J. Bruce Fields, Ric Wheeler, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown <zab@redhat.com> wrote: >> Also, I don't get the first option above at all. The argument is that >> it's safer to have more copies? How much safety does another copy on >> the same disk really give you? Do systems that do dedup provide >> interfaces to turn it off per-file? I don't see the safety argument very compelling either. There are real semantic differences, however: ENOSPC on a write to a (apparentlíy) already allocated block. That could be a bit unexpected. Do we need a fallocate extension to deal with shared blocks? Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
* RE: [RFC] extending splice for copy offloading 2013-09-28 5:49 ` Miklos Szeredi @ 2013-09-28 15:20 ` Myklebust, Trond 2013-09-28 21:20 ` Ric Wheeler 0 siblings, 1 reply; 62+ messages in thread From: Myklebust, Trond @ 2013-09-28 15:20 UTC (permalink / raw) To: Miklos Szeredi, Zach Brown Cc: J. Bruce Fields, Ric Wheeler, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong > -----Original Message----- > From: Miklos Szeredi [mailto:miklos@szeredi.hu] > Sent: Saturday, September 28, 2013 12:50 AM > To: Zach Brown > Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux- > Fsdevel; linux-nfs@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan; > Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong > Subject: Re: [RFC] extending splice for copy offloading > > On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown <zab@redhat.com> wrote: > >> Also, I don't get the first option above at all. The argument is > >> that it's safer to have more copies? How much safety does another > >> copy on the same disk really give you? Do systems that do dedup > >> provide interfaces to turn it off per-file? > > I don't see the safety argument very compelling either. There are real > semantic differences, however: ENOSPC on a write to a > (apparentlíy) already allocated block. That could be a bit unexpected. Do we > need a fallocate extension to deal with shared blocks? The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement... Cheers, Trond ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-28 15:20 ` Myklebust, Trond @ 2013-09-28 21:20 ` Ric Wheeler [not found] ` <52474839.2080201-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-28 21:20 UTC (permalink / raw) To: Myklebust, Trond Cc: Miklos Szeredi, Zach Brown, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/28/2013 11:20 AM, Myklebust, Trond wrote: >> -----Original Message----- >> From: Miklos Szeredi [mailto:miklos@szeredi.hu] >> Sent: Saturday, September 28, 2013 12:50 AM >> To: Zach Brown >> Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux- >> Fsdevel; linux-nfs@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan; >> Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong >> Subject: Re: [RFC] extending splice for copy offloading >> >> On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown <zab@redhat.com> wrote: >>>> Also, I don't get the first option above at all. The argument is >>>> that it's safer to have more copies? How much safety does another >>>> copy on the same disk really give you? Do systems that do dedup >>>> provide interfaces to turn it off per-file? >> I don't see the safety argument very compelling either. There are real >> semantic differences, however: ENOSPC on a write to a >> (apparentlíy) already allocated block. That could be a bit unexpected. Do we >> need a fallocate extension to deal with shared blocks? > The above has been the case for all enterprise storage arrays ever since the invention of snapshots. The NFSv4.2 spec does allow you to set a per-file attribute that causes the storage server to always preallocate enough buffers to guarantee that you can rewrite the entire file, however the fact that we've lived without it for said 20 years leads me to believe that demand for it is going to be limited. I haven't put it top of the list of features we care to implement... > > Cheers, > Trond I agree - this has been common behaviour for a very long time in the array space. Even without an array, this is the same as overwriting a block in btrfs or any file system with a read-write LVM snapshot. Regards, Ric ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <52474839.2080201-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <52474839.2080201-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-30 12:20 ` Miklos Szeredi 2013-09-30 14:34 ` J. Bruce Fields 0 siblings, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 12:20 UTC (permalink / raw) To: Ric Wheeler Cc: Myklebust, Trond, Zach Brown, J. Bruce Fields, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>> I don't see the safety argument very compelling either. There are real >>> semantic differences, however: ENOSPC on a write to a >>> (apparentlíy) already allocated block. That could be a bit unexpected. >>> Do we >>> need a fallocate extension to deal with shared blocks? >> >> The above has been the case for all enterprise storage arrays ever since >> the invention of snapshots. The NFSv4.2 spec does allow you to set a >> per-file attribute that causes the storage server to always preallocate >> enough buffers to guarantee that you can rewrite the entire file, however >> the fact that we've lived without it for said 20 years leads me to believe >> that demand for it is going to be limited. I haven't put it top of the list >> of features we care to implement... >> >> Cheers, >> Trond > > > I agree - this has been common behaviour for a very long time in the array > space. Even without an array, this is the same as overwriting a block in > btrfs or any file system with a read-write LVM snapshot. Okay, I'm convinced. So I suggest - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not set, fall back to page cache copy. - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app can force reflink. Both are trivial to implement and make sure that no backward incompatibility surprises happen. My other worry is about interruptibility/restartability. Ideas? What happens on splice(from, to, 4G) and it's a non-reflink copy? Can the page cache copy be made restartable? Or should splice() be allowed to return a short count? What happens on (non-reflink) remote copies and huge request sizes? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 12:20 ` Miklos Szeredi @ 2013-09-30 14:34 ` J. Bruce Fields 2013-09-30 14:48 ` Ric Wheeler 2013-09-30 14:51 ` Miklos Szeredi 0 siblings, 2 replies; 62+ messages in thread From: J. Bruce Fields @ 2013-09-30 14:34 UTC (permalink / raw) To: Miklos Szeredi Cc: Ric Wheeler, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote: > On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler <rwheeler@redhat.com> wrote: > > >>> I don't see the safety argument very compelling either. There are real > >>> semantic differences, however: ENOSPC on a write to a > >>> (apparentlíy) already allocated block. That could be a bit unexpected. > >>> Do we > >>> need a fallocate extension to deal with shared blocks? > >> > >> The above has been the case for all enterprise storage arrays ever since > >> the invention of snapshots. The NFSv4.2 spec does allow you to set a > >> per-file attribute that causes the storage server to always preallocate > >> enough buffers to guarantee that you can rewrite the entire file, however > >> the fact that we've lived without it for said 20 years leads me to believe > >> that demand for it is going to be limited. I haven't put it top of the list > >> of features we care to implement... > >> > >> Cheers, > >> Trond > > > > > > I agree - this has been common behaviour for a very long time in the array > > space. Even without an array, this is the same as overwriting a block in > > btrfs or any file system with a read-write LVM snapshot. > > Okay, I'm convinced. > > So I suggest > > - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not > set, fall back to page cache copy. > - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app > can force reflink. > > Both are trivial to implement and make sure that no backward > incompatibility surprises happen. > > My other worry is about interruptibility/restartability. Ideas? > > What happens on splice(from, to, 4G) and it's a non-reflink copy? > Can the page cache copy be made restartable? Or should splice() be > allowed to return a short count? What happens on (non-reflink) remote > copies and huge request sizes? If I were writing an application that required copies to be restartable, I'd probably use the largest possible range in the reflink case but break the copy into smaller chunks in the splice case. For that reason I don't like the idea of a mount option--the choice is something that the application probably wants to make (or at least to know about). The NFS COPY operation, as specified in current drafts, allows for asynchronous copies but leaves the state of the file undefined in the case of an aborted COPY. I worry that agreeing on standard behavior in the case of an abort might be difficult. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 14:34 ` J. Bruce Fields @ 2013-09-30 14:48 ` Ric Wheeler 2013-09-30 14:51 ` Miklos Szeredi 1 sibling, 0 replies; 62+ messages in thread From: Ric Wheeler @ 2013-09-30 14:48 UTC (permalink / raw) To: J. Bruce Fields Cc: Miklos Szeredi, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 10:34 AM, J. Bruce Fields wrote: > On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote: >> On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >> >>>>> I don't see the safety argument very compelling either. There are real >>>>> semantic differences, however: ENOSPC on a write to a >>>>> (apparentlíy) already allocated block. That could be a bit unexpected. >>>>> Do we >>>>> need a fallocate extension to deal with shared blocks? >>>> The above has been the case for all enterprise storage arrays ever since >>>> the invention of snapshots. The NFSv4.2 spec does allow you to set a >>>> per-file attribute that causes the storage server to always preallocate >>>> enough buffers to guarantee that you can rewrite the entire file, however >>>> the fact that we've lived without it for said 20 years leads me to believe >>>> that demand for it is going to be limited. I haven't put it top of the list >>>> of features we care to implement... >>>> >>>> Cheers, >>>> Trond >>> >>> I agree - this has been common behaviour for a very long time in the array >>> space. Even without an array, this is the same as overwriting a block in >>> btrfs or any file system with a read-write LVM snapshot. >> Okay, I'm convinced. >> >> So I suggest >> >> - mount(..., MNT_REFLINK): *allow* splice to reflink. If this is not >> set, fall back to page cache copy. >> - splice(... SPLICE_REFLINK): fail non-reflink copy. With this app >> can force reflink. >> >> Both are trivial to implement and make sure that no backward >> incompatibility surprises happen. >> >> My other worry is about interruptibility/restartability. Ideas? >> >> What happens on splice(from, to, 4G) and it's a non-reflink copy? >> Can the page cache copy be made restartable? Or should splice() be >> allowed to return a short count? What happens on (non-reflink) remote >> copies and huge request sizes? > If I were writing an application that required copies to be restartable, > I'd probably use the largest possible range in the reflink case but > break the copy into smaller chunks in the splice case. > > For that reason I don't like the idea of a mount option--the choice is > something that the application probably wants to make (or at least to > know about). > > The NFS COPY operation, as specified in current drafts, allows for > asynchronous copies but leaves the state of the file undefined in the > case of an aborted COPY. I worry that agreeing on standard behavior in > the case of an abort might be difficult. > > --b. I think that this is still confusing - reflink and array copy offload should not be differentiated. In effect, they should often be the same order of magnitude in performance and possibly even use the same or very similar techniques (just on different sides of the initiator/target transaction!). It is much simpler to let the application fail if the offload (or reflink) is not supported and let it do the traditional copy offload. Then you always send the largest possible offload operation and do whatever you do now if that fails. thanks! Ric ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 14:34 ` J. Bruce Fields 2013-09-30 14:48 ` Ric Wheeler @ 2013-09-30 14:51 ` Miklos Szeredi [not found] ` <CAJfpeguMCzv-UhrXrG7e9Q7F_0aEe3_ZMumFwLu3hxcewA_7gA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 14:51 UTC (permalink / raw) To: J. Bruce Fields Cc: Ric Wheeler, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields@fieldses.org> wrote: >> My other worry is about interruptibility/restartability. Ideas? >> >> What happens on splice(from, to, 4G) and it's a non-reflink copy? >> Can the page cache copy be made restartable? Or should splice() be >> allowed to return a short count? What happens on (non-reflink) remote >> copies and huge request sizes? > > If I were writing an application that required copies to be restartable, > I'd probably use the largest possible range in the reflink case but > break the copy into smaller chunks in the splice case. > The app really doesn't want to care about that. And it doesn't want to care about restartability, etc.. It's something the *kernel* has to care about. You just can't have uninterruptible syscalls that sleep for a "long" time, otherwise first you'll just have annoyed users pressing ^C in vain; then, if the sleep is even longer, warnings about task sleeping too long. One idea is letting splice() return a short count, and so the app can safely issue SIZE_MAX requests and the kernel can decide if it can copy the whole file in one go or if it wants to do it in smaller chunks. Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpeguMCzv-UhrXrG7e9Q7F_0aEe3_ZMumFwLu3hxcewA_7gA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpeguMCzv-UhrXrG7e9Q7F_0aEe3_ZMumFwLu3hxcewA_7gA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-30 14:52 ` Ric Wheeler 2013-09-30 15:24 ` Miklos Szeredi 0 siblings, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-30 14:52 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 10:51 AM, Miklos Szeredi wrote: > On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> wrote: >>> My other worry is about interruptibility/restartability. Ideas? >>> >>> What happens on splice(from, to, 4G) and it's a non-reflink copy? >>> Can the page cache copy be made restartable? Or should splice() be >>> allowed to return a short count? What happens on (non-reflink) remote >>> copies and huge request sizes? >> If I were writing an application that required copies to be restartable, >> I'd probably use the largest possible range in the reflink case but >> break the copy into smaller chunks in the splice case. >> > The app really doesn't want to care about that. And it doesn't want > to care about restartability, etc.. It's something the *kernel* has > to care about. You just can't have uninterruptible syscalls that > sleep for a "long" time, otherwise first you'll just have annoyed > users pressing ^C in vain; then, if the sleep is even longer, warnings > about task sleeping too long. > > One idea is letting splice() return a short count, and so the app can > safely issue SIZE_MAX requests and the kernel can decide if it can > copy the whole file in one go or if it wants to do it in smaller > chunks. > > Thanks, > Miklos You cannot rely on a short count. That implies that an offloaded copy starts at byte 0 and the short count first bytes are all valid. I don't believe that is in fact required by all (any?) versions of the spec :) Best just to fail and restart the whole operation. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 14:52 ` Ric Wheeler @ 2013-09-30 15:24 ` Miklos Szeredi 2013-09-30 14:28 ` Ric Wheeler [not found] ` <CAJfpegtpXuh9070ALGy16Y8kdgioBqSf4JQqBBCF4FHvFJWAWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 15:24 UTC (permalink / raw) To: Ric Wheeler Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler@redhat.com> wrote: > On 09/30/2013 10:51 AM, Miklos Szeredi wrote: >> >> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields@fieldses.org> >> wrote: >>>> >>>> My other worry is about interruptibility/restartability. Ideas? >>>> >>>> What happens on splice(from, to, 4G) and it's a non-reflink copy? >>>> Can the page cache copy be made restartable? Or should splice() be >>>> allowed to return a short count? What happens on (non-reflink) remote >>>> copies and huge request sizes? >>> >>> If I were writing an application that required copies to be restartable, >>> I'd probably use the largest possible range in the reflink case but >>> break the copy into smaller chunks in the splice case. >>> >> The app really doesn't want to care about that. And it doesn't want >> to care about restartability, etc.. It's something the *kernel* has >> to care about. You just can't have uninterruptible syscalls that >> sleep for a "long" time, otherwise first you'll just have annoyed >> users pressing ^C in vain; then, if the sleep is even longer, warnings >> about task sleeping too long. >> >> One idea is letting splice() return a short count, and so the app can >> safely issue SIZE_MAX requests and the kernel can decide if it can >> copy the whole file in one go or if it wants to do it in smaller >> chunks. >> > > You cannot rely on a short count. That implies that an offloaded copy starts > at byte 0 and the short count first bytes are all valid. Huh? - app calls splice(from, 0, to, 0, SIZE_MAX) 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) 1.a) fs reflinks the whole file in a jiffy and returns the size of the file 1 b) fs does copy offload of, say, 64MB and returns 64M 2) VFS does page copy of, say, 1MB and returns 1MB - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset ... The point is: the app is always doing the same (incrementing offset with the return value from splice) and the kernel can decide what is the best size it can service within a single uninterruptible syscall. Wouldn't that work? Thanks, Miklos ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 15:24 ` Miklos Szeredi @ 2013-09-30 14:28 ` Ric Wheeler [not found] ` <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mA nhf_wbJ8iRhg@mail.gmail.com> ` (2 more replies) [not found] ` <CAJfpegtpXuh9070ALGy16Y8kdgioBqSf4JQqBBCF4FHvFJWAWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 3 replies; 62+ messages in thread From: Ric Wheeler @ 2013-09-30 14:28 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 10:24 AM, Miklos Szeredi wrote: > On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >> On 09/30/2013 10:51 AM, Miklos Szeredi wrote: >>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields@fieldses.org> >>> wrote: >>>>> My other worry is about interruptibility/restartability. Ideas? >>>>> >>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy? >>>>> Can the page cache copy be made restartable? Or should splice() be >>>>> allowed to return a short count? What happens on (non-reflink) remote >>>>> copies and huge request sizes? >>>> If I were writing an application that required copies to be restartable, >>>> I'd probably use the largest possible range in the reflink case but >>>> break the copy into smaller chunks in the splice case. >>>> >>> The app really doesn't want to care about that. And it doesn't want >>> to care about restartability, etc.. It's something the *kernel* has >>> to care about. You just can't have uninterruptible syscalls that >>> sleep for a "long" time, otherwise first you'll just have annoyed >>> users pressing ^C in vain; then, if the sleep is even longer, warnings >>> about task sleeping too long. >>> >>> One idea is letting splice() return a short count, and so the app can >>> safely issue SIZE_MAX requests and the kernel can decide if it can >>> copy the whole file in one go or if it wants to do it in smaller >>> chunks. >>> >> You cannot rely on a short count. That implies that an offloaded copy starts >> at byte 0 and the short count first bytes are all valid. > Huh? > > - app calls splice(from, 0, to, 0, SIZE_MAX) > 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) > 1.a) fs reflinks the whole file in a jiffy and returns the size of the file > 1 b) fs does copy offload of, say, 64MB and returns 64M > 2) VFS does page copy of, say, 1MB and returns 1MB > - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset > ... > > The point is: the app is always doing the same (incrementing offset > with the return value from splice) and the kernel can decide what is > the best size it can service within a single uninterruptible syscall. > > Wouldn't that work? > > Thanks, > Miklos No. Keep in mind that the offload operation in (1) might fail partially. The target file (the copy) is allocated, the question is what ranges have valid data. I don't see that (2) is interesting or really needed to be done in the kernel. If nothing else, it tends to confuse the discussion.... ric ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mA nhf_wbJ8iRhg@mail.gmail.com>]
[parent not found: <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mAnhf_wbJ8iRhg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mAnhf_wbJ8iRhg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-30 14:41 ` Ric Wheeler [not found] ` <52498DB6.7060901-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-30 14:41 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 10:38 AM, Miklos Szeredi wrote: > On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> On 09/30/2013 10:24 AM, Miklos Szeredi wrote: >>> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote: >>>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> >>>>> wrote: >>>>>>> My other worry is about interruptibility/restartability. Ideas? >>>>>>> >>>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy? >>>>>>> Can the page cache copy be made restartable? Or should splice() be >>>>>>> allowed to return a short count? What happens on (non-reflink) remote >>>>>>> copies and huge request sizes? >>>>>> If I were writing an application that required copies to be >>>>>> restartable, >>>>>> I'd probably use the largest possible range in the reflink case but >>>>>> break the copy into smaller chunks in the splice case. >>>>>> >>>>> The app really doesn't want to care about that. And it doesn't want >>>>> to care about restartability, etc.. It's something the *kernel* has >>>>> to care about. You just can't have uninterruptible syscalls that >>>>> sleep for a "long" time, otherwise first you'll just have annoyed >>>>> users pressing ^C in vain; then, if the sleep is even longer, warnings >>>>> about task sleeping too long. >>>>> >>>>> One idea is letting splice() return a short count, and so the app can >>>>> safely issue SIZE_MAX requests and the kernel can decide if it can >>>>> copy the whole file in one go or if it wants to do it in smaller >>>>> chunks. >>>>> >>>> You cannot rely on a short count. That implies that an offloaded copy >>>> starts >>>> at byte 0 and the short count first bytes are all valid. >>> Huh? >>> >>> - app calls splice(from, 0, to, 0, SIZE_MAX) >>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) >>> 1.a) fs reflinks the whole file in a jiffy and returns the size of >>> the file >>> 1 b) fs does copy offload of, say, 64MB and returns 64M >>> 2) VFS does page copy of, say, 1MB and returns 1MB >>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset >>> ... >>> >>> The point is: the app is always doing the same (incrementing offset >>> with the return value from splice) and the kernel can decide what is >>> the best size it can service within a single uninterruptible syscall. >>> >>> Wouldn't that work? >>> >> No. >> >> Keep in mind that the offload operation in (1) might fail partially. The >> target file (the copy) is allocated, the question is what ranges have valid >> data. > You are talking about case 1.a, right? So if the offload copy 0-64MB > fails partially, we return failure from splice, yet some of the copy > did succeed. Is that the problem? Why? > > Thanks, > Miklos The way the array based offload (and some software side reflink works) is not a byte by byte copy. We cannot assume that a valid count can be returned or that such a count would be an indication of a sequential segment of good data. The whole thing would normally have to be reissued. To make that a true assumption, you would have to mandate that in each of the specifications (and sw targets)... ric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <52498DB6.7060901-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <52498DB6.7060901-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-30 15:46 ` Miklos Szeredi 2013-09-30 14:49 ` Ric Wheeler [not found] ` <CAJfpegsvrr7x3MbdpvxUmzq0ZfDHfZkzAar6Od2G7wg8DgPLYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 15:46 UTC (permalink / raw) To: Ric Wheeler Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > The way the array based offload (and some software side reflink works) is > not a byte by byte copy. We cannot assume that a valid count can be returned > or that such a count would be an indication of a sequential segment of good > data. The whole thing would normally have to be reissued. > > To make that a true assumption, you would have to mandate that in each of > the specifications (and sw targets)... You're missing my point. - user issues SIZE_MAX splice request - fs issues *64M* (or whatever) request to offload - when that completes *fully* then we return 64M to userspace - if it completes partially, then we return an error to userspace Again, wouldn't that work? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 15:46 ` Miklos Szeredi @ 2013-09-30 14:49 ` Ric Wheeler [not found] ` <52498F68.8050200-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [not found] ` <CAJfpegsvrr7x3MbdpvxUmzq0ZfDHfZkzAar6Od2G7wg8DgPLYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-30 14:49 UTC (permalink / raw) To: Miklos Szeredi Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 10:46 AM, Miklos Szeredi wrote: > On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >> The way the array based offload (and some software side reflink works) is >> not a byte by byte copy. We cannot assume that a valid count can be returned >> or that such a count would be an indication of a sequential segment of good >> data. The whole thing would normally have to be reissued. >> >> To make that a true assumption, you would have to mandate that in each of >> the specifications (and sw targets)... > You're missing my point. > > - user issues SIZE_MAX splice request > - fs issues *64M* (or whatever) request to offload > - when that completes *fully* then we return 64M to userspace > - if it completes partially, then we return an error to userspace > > Again, wouldn't that work? > > Thanks, > Miklos Yes, if you send a copy offload command and it works, you can assume that it worked fully. It would be pretty interesting if that were not true :) If it fails, we cannot assume anything about partial completion. Ric ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <52498F68.8050200-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <52498F68.8050200-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-30 15:57 ` Miklos Szeredi [not found] ` <CAJfpegvvN_5c5oMv8UoODXQHc-DQnijhOtPDXmNamVpQLDoWMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 15:57 UTC (permalink / raw) To: Ric Wheeler Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 4:49 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On 09/30/2013 10:46 AM, Miklos Szeredi wrote: >> >> On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>> >>> The way the array based offload (and some software side reflink works) is >>> not a byte by byte copy. We cannot assume that a valid count can be >>> returned >>> or that such a count would be an indication of a sequential segment of >>> good >>> data. The whole thing would normally have to be reissued. >>> >>> To make that a true assumption, you would have to mandate that in each of >>> the specifications (and sw targets)... >> >> You're missing my point. >> >> - user issues SIZE_MAX splice request >> - fs issues *64M* (or whatever) request to offload >> - when that completes *fully* then we return 64M to userspace >> - if it completes partially, then we return an error to userspace >> >> Again, wouldn't that work? >> >> Thanks, >> Miklos > > > Yes, if you send a copy offload command and it works, you can assume that it > worked fully. It would be pretty interesting if that were not true :) > > If it fails, we cannot assume anything about partial completion. Sure, that was my understanding from the start. Maybe I wasn't precise enough in my explanation. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpegvvN_5c5oMv8UoODXQHc-DQnijhOtPDXmNamVpQLDoWMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpegvvN_5c5oMv8UoODXQHc-DQnijhOtPDXmNamVpQLDoWMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-09-30 16:31 ` Miklos Szeredi 2013-09-30 17:17 ` Bernd Schubert 0 siblings, 1 reply; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 16:31 UTC (permalink / raw) To: Ric Wheeler Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong Here's an example "cp" app using direct splice (and without fallback to non-splice, which is obviously required unless the kernel is known to support direct splice). Untested, but trivial enough... The important part is, I think, that the app must not assume that the kernel can complete the request in one go. Thanks, Miklos ---- #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <limits.h> #include <sys/stat.h> #include <err.h> #ifndef SPLICE_F_DIRECT #define SPLICE_F_DIRECT (0x10) /* neither splice fd is a pipe */ #endif int main(int argc, char *argv[]) { struct stat stbuf; int in_fd; int out_fd; int res; off_t off; if (argc != 3) errx(1, "usage: %s from to", argv[0]); in_fd = open(argv[1], O_RDONLY); if (in_fd == -1) err(1, "opening %s", argv[1]); res = fstat(in_fd, &stbuf); if (res == -1) err(1, "fstat"); out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode); if (out_fd == -1) err(1, "opening %s", argv[2]); do { off_t in_off = off, out_off = off; ssize_t rres; rres = splice(in_fd, &in_off, out_fd, &out_off, SSIZE_MAX, SPLICE_F_DIRECT); if (rres == -1) err(1, "splice"); if (rres == 0) break; off += rres; } while (off < stbuf.st_size); res = close(in_fd); if (res == -1) err(1, "close"); res = fsync(out_fd); if (res == -1) err(1, "fsync"); res = close(out_fd); if (res == -1) err(1, "close"); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 16:31 ` Miklos Szeredi @ 2013-09-30 17:17 ` Bernd Schubert [not found] ` <5249B21E.70603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Bernd Schubert @ 2013-09-30 17:17 UTC (permalink / raw) To: Miklos Szeredi, Ric Wheeler Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 06:31 PM, Miklos Szeredi wrote: > Here's an example "cp" app using direct splice (and without fallback to > non-splice, which is obviously required unless the kernel is known to support > direct splice). > > Untested, but trivial enough... > > The important part is, I think, that the app must not assume that the kernel can > complete the request in one go. > > Thanks, > Miklos > > ---- > #define _GNU_SOURCE > > #include <stdio.h> > #include <fcntl.h> > #include <unistd.h> > #include <limits.h> > #include <sys/stat.h> > #include <err.h> > > #ifndef SPLICE_F_DIRECT > #define SPLICE_F_DIRECT (0x10) /* neither splice fd is a pipe */ > #endif > > int main(int argc, char *argv[]) > { > struct stat stbuf; > int in_fd; > int out_fd; > int res; > off_t off; off_t off = 0; > > if (argc != 3) > errx(1, "usage: %s from to", argv[0]); > > in_fd = open(argv[1], O_RDONLY); > if (in_fd == -1) > err(1, "opening %s", argv[1]); > > res = fstat(in_fd, &stbuf); > if (res == -1) > err(1, "fstat"); > > out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode); > if (out_fd == -1) > err(1, "opening %s", argv[2]); > > do { > off_t in_off = off, out_off = off; > ssize_t rres; > > rres = splice(in_fd, &in_off, out_fd, &out_off, SSIZE_MAX, > SPLICE_F_DIRECT); > if (rres == -1) > err(1, "splice"); > if (rres == 0) > break; > > off += rres; > } while (off < stbuf.st_size); > > res = close(in_fd); > if (res == -1) > err(1, "close"); > > res = fsync(out_fd); > if (res == -1) > err(1, "fsync"); > > res = close(out_fd); > if (res == -1) > err(1, "close"); > > return 0; > } It would be nice if there would be way if the file system would get a hint that the target file is supposed to be copy of another file. That way distributed file systems could also create the target-file with the correct meta-information (same storage targets as in-file has). Well, if we cannot agree on that, file system with a custom protocol at least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not sure if this would work for pNFS, though. Bernd ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <5249B21E.70603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <5249B21E.70603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> @ 2013-09-30 17:44 ` Myklebust, Trond [not found] ` <1380563050.6501.15.camel-5lNtUQgoD8Pfa3cDbr2K10B+6BGkLq7r@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Myklebust, Trond @ 2013-09-30 17:44 UTC (permalink / raw) To: Bernd Schubert Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 915 bytes --] On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: > It would be nice if there would be way if the file system would get a > hint that the target file is supposed to be copy of another file. That > way distributed file systems could also create the target-file with the > correct meta-information (same storage targets as in-file has). > Well, if we cannot agree on that, file system with a custom protocol at > least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not > sure if this would work for pNFS, though. splice() does not create new files. What you appear to be asking for lies way outside the scope of that system call interface. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com N§²æìr¸yúèØb²X¬¶Ç§vØ^)Þº{.nÇ+·¥{±û"Ø^nr¡ö¦zË\x1aëh¨èÚ&¢îý»\x05ËÛÔØï¦v¬Îf\x1dp)¹¹br ê+Ê+zf£¢·h§~Ûiÿûàz¹\x1e®w¥¢¸?¨èÚ&¢)ߢ^[f ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <1380563050.6501.15.camel-5lNtUQgoD8Pfa3cDbr2K10B+6BGkLq7r@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <1380563050.6501.15.camel-5lNtUQgoD8Pfa3cDbr2K10B+6BGkLq7r@public.gmane.org> @ 2013-09-30 17:48 ` Bernd Schubert [not found] ` <5249B987.8020807-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Bernd Schubert @ 2013-09-30 17:48 UTC (permalink / raw) To: Myklebust, Trond Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 07:44 PM, Myklebust, Trond wrote: > On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: >> It would be nice if there would be way if the file system would get a >> hint that the target file is supposed to be copy of another file. That >> way distributed file systems could also create the target-file with the >> correct meta-information (same storage targets as in-file has). >> Well, if we cannot agree on that, file system with a custom protocol at >> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not >> sure if this would work for pNFS, though. > > splice() does not create new files. What you appear to be asking for > lies way outside the scope of that system call interface. > Sorry I know, definitely outside the scope of splice, but in the context of offloaded file copies. So the question is, what is the best way to address/discuss that? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <5249B987.8020807-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <5249B987.8020807-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> @ 2013-09-30 18:02 ` Myklebust, Trond 2013-09-30 18:49 ` Bernd Schubert 0 siblings, 1 reply; 62+ messages in thread From: Myklebust, Trond @ 2013-09-30 18:02 UTC (permalink / raw) To: Bernd Schubert Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: > On 09/30/2013 07:44 PM, Myklebust, Trond wrote: > > On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: > >> It would be nice if there would be way if the file system would get a > >> hint that the target file is supposed to be copy of another file. That > >> way distributed file systems could also create the target-file with the > >> correct meta-information (same storage targets as in-file has). > >> Well, if we cannot agree on that, file system with a custom protocol at > >> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not > >> sure if this would work for pNFS, though. > > > > splice() does not create new files. What you appear to be asking for > > lies way outside the scope of that system call interface. > > > > Sorry I know, definitely outside the scope of splice, but in the context > of offloaded file copies. So the question is, what is the best way to > address/discuss that? Why does it need to be addressed in the first place? What is preventing an application from retrieving and setting this information using standard libc functions such as fstat()+open(), and supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd where appropriate? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 18:02 ` Myklebust, Trond @ 2013-09-30 18:49 ` Bernd Schubert 2013-09-30 19:34 ` Myklebust, Trond 0 siblings, 1 reply; 62+ messages in thread From: Bernd Schubert @ 2013-09-30 18:49 UTC (permalink / raw) To: Myklebust, Trond Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 08:02 PM, Myklebust, Trond wrote: > On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: >> On 09/30/2013 07:44 PM, Myklebust, Trond wrote: >>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: >>>> It would be nice if there would be way if the file system would get a >>>> hint that the target file is supposed to be copy of another file. That >>>> way distributed file systems could also create the target-file with the >>>> correct meta-information (same storage targets as in-file has). >>>> Well, if we cannot agree on that, file system with a custom protocol at >>>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not >>>> sure if this would work for pNFS, though. >>> >>> splice() does not create new files. What you appear to be asking for >>> lies way outside the scope of that system call interface. >>> >> >> Sorry I know, definitely outside the scope of splice, but in the context >> of offloaded file copies. So the question is, what is the best way to >> address/discuss that? > > Why does it need to be addressed in the first place? An offloaded copy is still not efficient if different storage servers/targets used by from-file and to-file. > > What is preventing an application from retrieving and setting this > information using standard libc functions such as fstat()+open(), and > supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd > where appropriate? > At a minimum this requires network and metadata overhead. And while I'm working on FhGFS now, I still wonder what other file system need to do - for example Lustre pre-allocates storage-target files on creating a file, so file layout changes mean even more overhead there. Anyway, if we could agree on to use libattr or libacl to teach the file system about the upcoming splice call I would be fine. Metadata overhead is probably negligible for large files. Thanks, Bernd ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 18:49 ` Bernd Schubert @ 2013-09-30 19:34 ` Myklebust, Trond 2013-09-30 20:00 ` Bernd Schubert 0 siblings, 1 reply; 62+ messages in thread From: Myklebust, Trond @ 2013-09-30 19:34 UTC (permalink / raw) To: Bernd Schubert Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote: > On 09/30/2013 08:02 PM, Myklebust, Trond wrote: > > On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: > >> On 09/30/2013 07:44 PM, Myklebust, Trond wrote: > >>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: > >>>> It would be nice if there would be way if the file system would get a > >>>> hint that the target file is supposed to be copy of another file. That > >>>> way distributed file systems could also create the target-file with the > >>>> correct meta-information (same storage targets as in-file has). > >>>> Well, if we cannot agree on that, file system with a custom protocol at > >>>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not > >>>> sure if this would work for pNFS, though. > >>> > >>> splice() does not create new files. What you appear to be asking for > >>> lies way outside the scope of that system call interface. > >>> > >> > >> Sorry I know, definitely outside the scope of splice, but in the context > >> of offloaded file copies. So the question is, what is the best way to > >> address/discuss that? > > > > Why does it need to be addressed in the first place? > > An offloaded copy is still not efficient if different storage > servers/targets used by from-file and to-file. So? > > > > What is preventing an application from retrieving and setting this > > information using standard libc functions such as fstat()+open(), and > > supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd > > where appropriate? > > > > At a minimum this requires network and metadata overhead. And while I'm > working on FhGFS now, I still wonder what other file system need to do - > for example Lustre pre-allocates storage-target files on creating a > file, so file layout changes mean even more overhead there. The problem you are describing is limited to a narrow set of storage architectures. If copy offload using splice() doesn't make sense for those architectures, then don't implement it for them. You might be able to provide ioctls() to do these special hinted file creations for those filesystems that need it, but the vast majority don't, and you shouldn't enforce it on them. > Anyway, if we could agree on to use libattr or libacl to teach the file > system about the upcoming splice call I would be fine. libattr and libacl are generic libraries that exist to manipulate xattrs and acls. They do not need to contain Lustre-specific code. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 19:34 ` Myklebust, Trond @ 2013-09-30 20:00 ` Bernd Schubert 2013-09-30 20:08 ` Ric Wheeler [not found] ` <5249D86A.7080603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 0 siblings, 2 replies; 62+ messages in thread From: Bernd Schubert @ 2013-09-30 20:00 UTC (permalink / raw) To: Myklebust, Trond Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 09:34 PM, Myklebust, Trond wrote: > On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote: >> On 09/30/2013 08:02 PM, Myklebust, Trond wrote: >>> On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: >>>> On 09/30/2013 07:44 PM, Myklebust, Trond wrote: >>>>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: >>>>>> It would be nice if there would be way if the file system would get a >>>>>> hint that the target file is supposed to be copy of another file. That >>>>>> way distributed file systems could also create the target-file with the >>>>>> correct meta-information (same storage targets as in-file has). >>>>>> Well, if we cannot agree on that, file system with a custom protocol at >>>>>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not >>>>>> sure if this would work for pNFS, though. >>>>> >>>>> splice() does not create new files. What you appear to be asking for >>>>> lies way outside the scope of that system call interface. >>>>> >>>> >>>> Sorry I know, definitely outside the scope of splice, but in the context >>>> of offloaded file copies. So the question is, what is the best way to >>>> address/discuss that? >>> >>> Why does it need to be addressed in the first place? >> >> An offloaded copy is still not efficient if different storage >> servers/targets used by from-file and to-file. > > So? mds1: orig-file oss1/target1: orig-chunk1 mds1: target-file ossN/targetN: target-chunk1 clientN: Performs the copy Ideally, orig-chunk1 and target-chunk1 are on the same server and same target. Copy offload then even could done from the underlying fs, similiar as local splice. If different ossN servers are used copies still have to be done over network by these storage servers, although the client only would need to initiate the copy. Still faster, but also not ideal. > >>> >>> What is preventing an application from retrieving and setting this >>> information using standard libc functions such as fstat()+open(), and >>> supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd >>> where appropriate? >>> >> >> At a minimum this requires network and metadata overhead. And while I'm >> working on FhGFS now, I still wonder what other file system need to do - >> for example Lustre pre-allocates storage-target files on creating a >> file, so file layout changes mean even more overhead there. > > The problem you are describing is limited to a narrow set of storage > architectures. If copy offload using splice() doesn't make sense for > those architectures, then don't implement it for them. But it _does_ make sense. The file system just needs a hint that a splice copy is going to come up. > You might be able to provide ioctls() to do these special hinted file > creations for those filesystems that need it, but the vast majority > don't, and you shouldn't enforce it on them. And exactly for that we need a standard - it does not make sense if each and every distributed file system implements its own ioctl/libattr/libacl interface for that. > >> Anyway, if we could agree on to use libattr or libacl to teach the file >> system about the upcoming splice call I would be fine. > > libattr and libacl are generic libraries that exist to manipulate xattrs > and acls. They do not need to contain Lustre-specific code. > pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own interface? And userspace needs to address all of them differently? I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, didn't find a better name yet), which would take in-file-path and out-file-path and allow the file system to create out-file-path with the same meta-layout as in-file-path. And it would need some flags, such as AUTO (file system decides if it makes sense to do a local copy) and FORCE (always try a local copy). Thanks, Bernd ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-30 20:00 ` Bernd Schubert @ 2013-09-30 20:08 ` Ric Wheeler [not found] ` <5249DA50.5060105-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [not found] ` <5249D86A.7080603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 1 sibling, 1 reply; 62+ messages in thread From: Ric Wheeler @ 2013-09-30 20:08 UTC (permalink / raw) To: Bernd Schubert Cc: Myklebust, Trond, Miklos Szeredi, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 09/30/2013 04:00 PM, Bernd Schubert wrote: > pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own > interface? And userspace needs to address all of them differently? The NFS and SCSI groups have each defined a standard which Zach's proposal abstracts into a common user API. Distributed file systems tend to be rather unique and do not have similar standard bodies, but a lot of them could hide server specific implementations under the current proposed interfaces. What is not a good idea is to drag out the core, simple copy offload discussion for another 5 years to pull in every odd use case :) ric ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <5249DA50.5060105-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <5249DA50.5060105-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-30 20:27 ` Myklebust, Trond 0 siblings, 0 replies; 62+ messages in thread From: Myklebust, Trond @ 2013-09-30 20:27 UTC (permalink / raw) To: Ric Wheeler Cc: Bernd Schubert, Miklos Szeredi, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, 2013-09-30 at 16:08 -0400, Ric Wheeler wrote: > On 09/30/2013 04:00 PM, Bernd Schubert wrote: > > pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own > > interface? And userspace needs to address all of them differently? > > The NFS and SCSI groups have each defined a standard which Zach's proposal > abstracts into a common user API. > > Distributed file systems tend to be rather unique and do not have similar > standard bodies, but a lot of them could hide server specific implementations > under the current proposed interfaces. > > What is not a good idea is to drag out the core, simple copy offload discussion > for another 5 years to pull in every odd use case :) Agreed. The whole idea of a common system call interface should be to allow us to abstract away the underlying storage and filesystem architectures. If filesystem developers also want a way to expose that underlying architecture to applications in order to enable further optimisations, then that belongs in a separate discussion. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <5249D86A.7080603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <5249D86A.7080603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> @ 2013-09-30 20:10 ` Myklebust, Trond 0 siblings, 0 replies; 62+ messages in thread From: Myklebust, Trond @ 2013-09-30 20:10 UTC (permalink / raw) To: Bernd Schubert Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, 2013-09-30 at 22:00 +0200, Bernd Schubert wrote: > On 09/30/2013 09:34 PM, Myklebust, Trond wrote: > > On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote: > >> On 09/30/2013 08:02 PM, Myklebust, Trond wrote: > >>> On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote: > >>>> On 09/30/2013 07:44 PM, Myklebust, Trond wrote: > >>>>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote: > >>>>>> It would be nice if there would be way if the file system would get a > >>>>>> hint that the target file is supposed to be copy of another file. That > >>>>>> way distributed file systems could also create the target-file with the > >>>>>> correct meta-information (same storage targets as in-file has). > >>>>>> Well, if we cannot agree on that, file system with a custom protocol at > >>>>>> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not > >>>>>> sure if this would work for pNFS, though. > >>>>> > >>>>> splice() does not create new files. What you appear to be asking for > >>>>> lies way outside the scope of that system call interface. > >>>>> > >>>> > >>>> Sorry I know, definitely outside the scope of splice, but in the context > >>>> of offloaded file copies. So the question is, what is the best way to > >>>> address/discuss that? > >>> > >>> Why does it need to be addressed in the first place? > >> > >> An offloaded copy is still not efficient if different storage > >> servers/targets used by from-file and to-file. > > > > So? > > mds1: orig-file > oss1/target1: orig-chunk1 > > mds1: target-file > ossN/targetN: target-chunk1 > > clientN: Performs the copy > > Ideally, orig-chunk1 and target-chunk1 are on the same server and same > target. Copy offload then even could done from the underlying fs, > similiar as local splice. > If different ossN servers are used copies still have to be done over > network by these storage servers, although the client only would need to > initiate the copy. Still faster, but also not ideal. > > > > >>> > >>> What is preventing an application from retrieving and setting this > >>> information using standard libc functions such as fstat()+open(), and > >>> supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd > >>> where appropriate? > >>> > >> > >> At a minimum this requires network and metadata overhead. And while I'm > >> working on FhGFS now, I still wonder what other file system need to do - > >> for example Lustre pre-allocates storage-target files on creating a > >> file, so file layout changes mean even more overhead there. > > > > The problem you are describing is limited to a narrow set of storage > > architectures. If copy offload using splice() doesn't make sense for > > those architectures, then don't implement it for them. > > But it _does_ make sense. The file system just needs a hint that a > splice copy is going to come up. Just wait for the splice() system call. How is this any different from write()? > > You might be able to provide ioctls() to do these special hinted file > > creations for those filesystems that need it, but the vast majority > > don't, and you shouldn't enforce it on them. > > And exactly for that we need a standard - it does not make sense if each > and every distributed file system implements its own > ioctl/libattr/libacl interface for that. > > > > >> Anyway, if we could agree on to use libattr or libacl to teach the file > >> system about the upcoming splice call I would be fine. > > > > libattr and libacl are generic libraries that exist to manipulate xattrs > > and acls. They do not need to contain Lustre-specific code. > > > > pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own > interface? And userspace needs to address all of them differently? > > I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, > didn't find a better name yet), which would take in-file-path and > out-file-path and allow the file system to create out-file-path with the > same meta-layout as in-file-path. And it would need some flags, such as > AUTO (file system decides if it makes sense to do a local copy) and > FORCE (always try a local copy). splice() is not a whole-file copy operation; it's a byte range copy. How does the above help other than in the whole-file case? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpegsvrr7x3MbdpvxUmzq0ZfDHfZkzAar6Od2G7wg8DgPLYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpegsvrr7x3MbdpvxUmzq0ZfDHfZkzAar6Od2G7wg8DgPLYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-01 18:42 ` J. Bruce Fields 0 siblings, 0 replies; 62+ messages in thread From: J. Bruce Fields @ 2013-10-01 18:42 UTC (permalink / raw) To: Miklos Szeredi Cc: Ric Wheeler, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 05:46:38PM +0200, Miklos Szeredi wrote: > On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > The way the array based offload (and some software side reflink works) is > > not a byte by byte copy. We cannot assume that a valid count can be returned > > or that such a count would be an indication of a sequential segment of good > > data. The whole thing would normally have to be reissued. > > > > To make that a true assumption, you would have to mandate that in each of > > the specifications (and sw targets)... > > You're missing my point. > > - user issues SIZE_MAX splice request > - fs issues *64M* (or whatever) request to offload > - when that completes *fully* then we return 64M to userspace > - if it completes partially, then we return an error to userspace > > Again, wouldn't that work? So if implementations fall into two categories: - "instant": latency is on the order of a single IO. - "slow": latency is seconds or minutes, but still faster than a normal copy. (See Anna's NFS server implementation that does an ordinary copy internally.) Then to me it still seems simplest to design only for the "instant" case. But if we want to add some minimal help for the "slow" case then Miklos's proposal looks fine: the application doesn't have to know which case it's dealing with ahead of time--it always just submits the largest range it knows about--but a "slow" implementation isn't forced to leave the application waiting in one syscall for minutes with no indication what's going on. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* RE: [RFC] extending splice for copy offloading 2013-09-30 14:28 ` Ric Wheeler [not found] ` <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mA nhf_wbJ8iRhg@mail.gmail.com> @ 2013-09-30 15:33 ` Myklebust, Trond [not found] ` <52498AA8.2090204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2 siblings, 0 replies; 62+ messages in thread From: Myklebust, Trond @ 2013-09-30 15:33 UTC (permalink / raw) To: Ric Wheeler, Miklos Szeredi Cc: J. Bruce Fields, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong > -----Original Message----- > From: Ric Wheeler [mailto:rwheeler@redhat.com] > Sent: Monday, September 30, 2013 10:29 AM > To: Miklos Szeredi > Cc: J. Bruce Fields; Myklebust, Trond; Zach Brown; Anna Schumaker; Kernel > Mailing List; Linux-Fsdevel; linux-nfs@vger.kernel.org; Schumaker, Bryan; > Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong > Subject: Re: [RFC] extending splice for copy offloading > > On 09/30/2013 10:24 AM, Miklos Szeredi wrote: > > On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler@redhat.com> > wrote: > >> On 09/30/2013 10:51 AM, Miklos Szeredi wrote: > >>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields > >>> <bfields@fieldses.org> > >>> wrote: > >>>>> My other worry is about interruptibility/restartability. Ideas? > >>>>> > >>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy? > >>>>> Can the page cache copy be made restartable? Or should splice() be > >>>>> allowed to return a short count? What happens on (non-reflink) > >>>>> remote copies and huge request sizes? > >>>> If I were writing an application that required copies to be > >>>> restartable, I'd probably use the largest possible range in the > >>>> reflink case but break the copy into smaller chunks in the splice case. > >>>> > >>> The app really doesn't want to care about that. And it doesn't want > >>> to care about restartability, etc.. It's something the *kernel* has > >>> to care about. You just can't have uninterruptible syscalls that > >>> sleep for a "long" time, otherwise first you'll just have annoyed > >>> users pressing ^C in vain; then, if the sleep is even longer, > >>> warnings about task sleeping too long. > >>> > >>> One idea is letting splice() return a short count, and so the app > >>> can safely issue SIZE_MAX requests and the kernel can decide if it > >>> can copy the whole file in one go or if it wants to do it in smaller > >>> chunks. > >>> > >> You cannot rely on a short count. That implies that an offloaded copy > >> starts at byte 0 and the short count first bytes are all valid. > > Huh? > > > > - app calls splice(from, 0, to, 0, SIZE_MAX) > > 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) > > 1.a) fs reflinks the whole file in a jiffy and returns the size of the file > > 1 b) fs does copy offload of, say, 64MB and returns 64M > > 2) VFS does page copy of, say, 1MB and returns 1MB > > - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset > > ... > > > > The point is: the app is always doing the same (incrementing offset > > with the return value from splice) and the kernel can decide what is > > the best size it can service within a single uninterruptible syscall. > > > > Wouldn't that work? > > > > Thanks, > > Miklos > > No. > > Keep in mind that the offload operation in (1) might fail partially. The target > file (the copy) is allocated, the question is what ranges have valid data. > > I don't see that (2) is interesting or really needed to be done in the kernel. > If nothing else, it tends to confuse the discussion.... > Anna's figures, that were presented at Plumber's, show that (2) is still worth doing on the _server_ for the case of NFS. Cheers Trond ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <52498AA8.2090204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <52498AA8.2090204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-30 15:38 ` Miklos Szeredi 0 siblings, 0 replies; 62+ messages in thread From: Miklos Szeredi @ 2013-09-30 15:38 UTC (permalink / raw) To: Ric Wheeler Cc: J. Bruce Fields, Myklebust, Trond, Zach Brown, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On 09/30/2013 10:24 AM, Miklos Szeredi wrote: >> >> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler <rwheeler-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >>> >>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote: >>>> >>>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> >>>> wrote: >>>>>> >>>>>> My other worry is about interruptibility/restartability. Ideas? >>>>>> >>>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy? >>>>>> Can the page cache copy be made restartable? Or should splice() be >>>>>> allowed to return a short count? What happens on (non-reflink) remote >>>>>> copies and huge request sizes? >>>>> >>>>> If I were writing an application that required copies to be >>>>> restartable, >>>>> I'd probably use the largest possible range in the reflink case but >>>>> break the copy into smaller chunks in the splice case. >>>>> >>>> The app really doesn't want to care about that. And it doesn't want >>>> to care about restartability, etc.. It's something the *kernel* has >>>> to care about. You just can't have uninterruptible syscalls that >>>> sleep for a "long" time, otherwise first you'll just have annoyed >>>> users pressing ^C in vain; then, if the sleep is even longer, warnings >>>> about task sleeping too long. >>>> >>>> One idea is letting splice() return a short count, and so the app can >>>> safely issue SIZE_MAX requests and the kernel can decide if it can >>>> copy the whole file in one go or if it wants to do it in smaller >>>> chunks. >>>> >>> You cannot rely on a short count. That implies that an offloaded copy >>> starts >>> at byte 0 and the short count first bytes are all valid. >> >> Huh? >> >> - app calls splice(from, 0, to, 0, SIZE_MAX) >> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) >> 1.a) fs reflinks the whole file in a jiffy and returns the size of >> the file >> 1 b) fs does copy offload of, say, 64MB and returns 64M >> 2) VFS does page copy of, say, 1MB and returns 1MB >> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset >> ... >> >> The point is: the app is always doing the same (incrementing offset >> with the return value from splice) and the kernel can decide what is >> the best size it can service within a single uninterruptible syscall. >> >> Wouldn't that work? >> > > No. > > Keep in mind that the offload operation in (1) might fail partially. The > target file (the copy) is allocated, the question is what ranges have valid > data. You are talking about case 1.a, right? So if the offload copy 0-64MB fails partially, we return failure from splice, yet some of the copy did succeed. Is that the problem? Why? Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <CAJfpegtpXuh9070ALGy16Y8kdgioBqSf4JQqBBCF4FHvFJWAWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <CAJfpegtpXuh9070ALGy16Y8kdgioBqSf4JQqBBCF4FHvFJWAWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-10-01 19:58 ` Zach Brown [not found] ` <20131001195817.GE10831-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 0 siblings, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-10-01 19:58 UTC (permalink / raw) To: Miklos Szeredi Cc: Ric Wheeler, J. Bruce Fields, Myklebust, Trond, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong > - app calls splice(from, 0, to, 0, SIZE_MAX) > 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) > 1.a) fs reflinks the whole file in a jiffy and returns the size of the file > 1 b) fs does copy offload of, say, 64MB and returns 64M > 2) VFS does page copy of, say, 1MB and returns 1MB > - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset (It's not SIZE_MAX. It's MAX_RW_COUNT. INT_MAX with some PAGE_CACHE_SIZE rounding noise. For fear of weird corners of fs code paths that still use int, one assumes.) > The point is: the app is always doing the same (incrementing offset > with the return value from splice) and the kernel can decide what is > the best size it can service within a single uninterruptible syscall. > > Wouldn't that work? It seems like it should, if people are willing to allow splice() to return partial counts. Quite a lot of IO syscalls technically do return partial counts today if you try to write > MAX_RW_COUNT :). But returning partial counts on the order of a handful of megs that the file systems make up as the point of diminishing returns is another thing entirely. I can imagine people being anxious about that. I guess we'll find out! - z -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <20131001195817.GE10831-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <20131001195817.GE10831-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> @ 2013-10-02 12:58 ` Jan Kara 2013-10-02 13:31 ` David Lang 0 siblings, 1 reply; 62+ messages in thread From: Jan Kara @ 2013-10-02 12:58 UTC (permalink / raw) To: Zach Brown Cc: Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Myklebust, Trond, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Tue 01-10-13 12:58:17, Zach Brown wrote: > > - app calls splice(from, 0, to, 0, SIZE_MAX) > > 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) > > 1.a) fs reflinks the whole file in a jiffy and returns the size of the file > > 1 b) fs does copy offload of, say, 64MB and returns 64M > > 2) VFS does page copy of, say, 1MB and returns 1MB > > - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset > > (It's not SIZE_MAX. It's MAX_RW_COUNT. INT_MAX with some > PAGE_CACHE_SIZE rounding noise. For fear of weird corners of fs code > paths that still use int, one assumes.) > > > The point is: the app is always doing the same (incrementing offset > > with the return value from splice) and the kernel can decide what is > > the best size it can service within a single uninterruptible syscall. > > > > Wouldn't that work? > > It seems like it should, if people are willing to allow splice() to > return partial counts. Quite a lot of IO syscalls technically do return > partial counts today if you try to write > MAX_RW_COUNT :). Yes. Also POSIX says that application must handle such case for read & write. But in practice programmers are lazy. > But returning partial counts on the order of a handful of megs that the > file systems make up as the point of diminishing returns is another > thing entirely. I can imagine people being anxious about that. > > I guess we'll find out! Return 4 KB once in a while to screw up buggy applications from the start :-p Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-10-02 12:58 ` Jan Kara @ 2013-10-02 13:31 ` David Lang 0 siblings, 0 replies; 62+ messages in thread From: David Lang @ 2013-10-02 13:31 UTC (permalink / raw) To: Jan Kara Cc: Zach Brown, Miklos Szeredi, Ric Wheeler, J. Bruce Fields, Myklebust, Trond, Anna Schumaker, Kernel Mailing List, Linux-Fsdevel, linux-nfs@vger.kernel.org, Schumaker, Bryan, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, 2 Oct 2013, Jan Kara wrote: > On Tue 01-10-13 12:58:17, Zach Brown wrote: >>> - app calls splice(from, 0, to, 0, SIZE_MAX) >>> 1) VFS calls ->direct_splice(from, 0, to, 0, SIZE_MAX) >>> 1.a) fs reflinks the whole file in a jiffy and returns the size of the file >>> 1 b) fs does copy offload of, say, 64MB and returns 64M >>> 2) VFS does page copy of, say, 1MB and returns 1MB >>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset >> >> (It's not SIZE_MAX. It's MAX_RW_COUNT. INT_MAX with some >> PAGE_CACHE_SIZE rounding noise. For fear of weird corners of fs code >> paths that still use int, one assumes.) >> >>> The point is: the app is always doing the same (incrementing offset >>> with the return value from splice) and the kernel can decide what is >>> the best size it can service within a single uninterruptible syscall. >>> >>> Wouldn't that work? >> >> It seems like it should, if people are willing to allow splice() to >> return partial counts. Quite a lot of IO syscalls technically do return >> partial counts today if you try to write > MAX_RW_COUNT :). > Yes. Also POSIX says that application must handle such case for read & > write. But in practice programmers are lazy. > >> But returning partial counts on the order of a handful of megs that the >> file systems make up as the point of diminishing returns is another >> thing entirely. I can imagine people being anxious about that. >> >> I guess we'll find out! > Return 4 KB once in a while to screw up buggy applications from the > start :-p or at least have a debugging option early on that does this so people can use it to find such buggy apps. David Lang ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (3 preceding siblings ...) 2013-09-20 9:49 ` [RFC] extending splice for copy offloading Szeredi Miklos @ 2013-12-18 12:41 ` Christoph Hellwig 2013-12-18 17:10 ` Zach Brown 4 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2013-12-18 12:41 UTC (permalink / raw) To: Zach Brown Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Sep 11, 2013 at 10:06:47AM -0700, Zach Brown wrote: > When I first started on this stuff I followed the lead of previous > work and added a new syscall for the copy operation: > > https://lkml.org/lkml/2013/5/14/618 > > Towards the end of that thread Eric Wong asked why we didn't just > extend splice. I immediately replied with some dumb dismissive > answer. Once I sat down and looked at it, though, it does make a > lot of sense. So good job, Eric. +10 Dummie points for me. > > Extending splice avoids all the noise of adding a new syscall and > naturally falls back to buffered copying as that's what the direct > splice path does for sendfile() today. Given the convolute mess that the splice code already is I'd rather prefer not overloading it even further. Instead I'd make the sendfile code path that already works different in practice separate first, and then generalize it to a copy chunk syscall using the same code path. We can still fall back to the splice code as a fallback if no option is provided as a last resort, but I think making the splice code handle even more totally different cases is the wrong direction. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-12-18 12:41 ` Christoph Hellwig @ 2013-12-18 17:10 ` Zach Brown 2013-12-18 17:26 ` Anna Schumaker 0 siblings, 1 reply; 62+ messages in thread From: Zach Brown @ 2013-12-18 17:10 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-nfs, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On Wed, Dec 18, 2013 at 04:41:26AM -0800, Christoph Hellwig wrote: > On Wed, Sep 11, 2013 at 10:06:47AM -0700, Zach Brown wrote: > > When I first started on this stuff I followed the lead of previous > > work and added a new syscall for the copy operation: > > > > https://lkml.org/lkml/2013/5/14/618 > > > > Towards the end of that thread Eric Wong asked why we didn't just > > extend splice. I immediately replied with some dumb dismissive > > answer. Once I sat down and looked at it, though, it does make a > > lot of sense. So good job, Eric. +10 Dummie points for me. > > > > Extending splice avoids all the noise of adding a new syscall and > > naturally falls back to buffered copying as that's what the direct > > splice path does for sendfile() today. > > Given the convolute mess that the splice code already is I'd rather > prefer not overloading it even further. I agree after trying to weave the copy offloading API into the splice interface. There are also weird cases that we haven't really discussed so far (preserving unwritten allocations between the copied files?) that would muddy the waters even further. The further the APIs drift from each other, the more I'm prefering giving copy offloading its own clean syscall. Even if the argument types superficially match the splice() ABI. > We can still fall back to the splice code as a fallback if no option > is provided as a last resort, but I think making the splice code handle > even more totally different cases is the wrong direction. I'm with you. I'll have another version out sometime after the US holiday break.. say in a few weeks? - z ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-12-18 17:10 ` Zach Brown @ 2013-12-18 17:26 ` Anna Schumaker 0 siblings, 0 replies; 62+ messages in thread From: Anna Schumaker @ 2013-12-18 17:26 UTC (permalink / raw) To: Zach Brown, Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-nfs, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker, Eric Wong On 12/18/2013 12:10 PM, Zach Brown wrote: > On Wed, Dec 18, 2013 at 04:41:26AM -0800, Christoph Hellwig wrote: >> On Wed, Sep 11, 2013 at 10:06:47AM -0700, Zach Brown wrote: >>> When I first started on this stuff I followed the lead of previous >>> work and added a new syscall for the copy operation: >>> >>> https://lkml.org/lkml/2013/5/14/618 >>> >>> Towards the end of that thread Eric Wong asked why we didn't just >>> extend splice. I immediately replied with some dumb dismissive >>> answer. Once I sat down and looked at it, though, it does make a >>> lot of sense. So good job, Eric. +10 Dummie points for me. >>> >>> Extending splice avoids all the noise of adding a new syscall and >>> naturally falls back to buffered copying as that's what the direct >>> splice path does for sendfile() today. >> Given the convolute mess that the splice code already is I'd rather >> prefer not overloading it even further. > I agree after trying to weave the copy offloading API into the splice > interface. There are also weird cases that we haven't really discussed > so far (preserving unwritten allocations between the copied files?) that > would muddy the waters even further. > > The further the APIs drift from each other, the more I'm prefering > giving copy offloading its own clean syscall. Even if the argument > types superficially match the splice() ABI. > >> We can still fall back to the splice code as a fallback if no option >> is provided as a last resort, but I think making the splice code handle >> even more totally different cases is the wrong direction. > I'm with you. I'll have another version out sometime after the US > holiday break.. say in a few weeks? That'll work for me, I'll update my NFS code once your new patches are out. Anna > > - z > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-11 17:06 [RFC] extending splice for copy offloading Zach Brown [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-09-11 21:17 ` Eric Wong [not found] ` <20130911211722.GA9725-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org> 2013-09-19 12:59 ` Jeff Layton 1 sibling, 2 replies; 62+ messages in thread From: Eric Wong @ 2013-09-11 21:17 UTC (permalink / raw) To: Zach Brown Cc: linux-kernel, linux-fsdevel, linux-nfs, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker Zach Brown <zab@redhat.com> wrote: > Towards the end of that thread Eric Wong asked why we didn't just > extend splice. I immediately replied with some dumb dismissive > answer. Once I sat down and looked at it, though, it does make a > lot of sense. So good job, Eric. +10 Dummie points for me. Thanks for revisiting that :> > Some things to talk about: > - I really don't care about the naming here. If you do, holler. Exposing "DIRECT" to userspace now might confuse users into expecting O_DIRECT behavior. I say this as an easily-confused user. In the future, perhaps O_DIRECT behavior can become per-splice (instead of just per-open) and can save SPLICE_F_DIRECT for that. > - We might want different flags for file-to-file splicing and acceleration > - We might want flags to require or forbid acceleration > - We might want to provide all these flags to sendfile, too Another syscall? I prefer not. Better to just maintain the sendfile API as-is for compatibility reasons and nudge users towards splice. > Thoughts? Objections? I'll try to test/comment more in a week or two (not much time for computing until then). ^ permalink raw reply [flat|nested] 62+ messages in thread
[parent not found: <20130911211722.GA9725-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>]
* Re: [RFC] extending splice for copy offloading [not found] ` <20130911211722.GA9725-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org> @ 2013-09-16 19:44 ` Rob Landley 0 siblings, 0 replies; 62+ messages in thread From: Rob Landley @ 2013-09-16 19:44 UTC (permalink / raw) To: Eric Wong Cc: Zach Brown, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker On 09/11/2013 04:17:23 PM, Eric Wong wrote: > Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > Towards the end of that thread Eric Wong asked why we didn't just > > extend splice. I immediately replied with some dumb dismissive > > answer. Once I sat down and looked at it, though, it does make a > > lot of sense. So good job, Eric. +10 Dummie points for me. > > Thanks for revisiting that :> > > > Some things to talk about: > > - I really don't care about the naming here. If you do, holler. > > Exposing "DIRECT" to userspace now might confuse users into expecting > O_DIRECT behavior. I say this as an easily-confused user. > > In the future, perhaps O_DIRECT behavior can become per-splice > (instead > of just per-open) and can save SPLICE_F_DIRECT for that. > > > - We might want different flags for file-to-file splicing and > acceleration > > - We might want flags to require or forbid acceleration > > > - We might want to provide all these flags to sendfile, too > > Another syscall? I prefer not. Better to just maintain the sendfile > API as-is for compatibility reasons and nudge users towards splice. > > > Thoughts? Objections? > > I'll try to test/comment more in a week or two (not much time for > computing until then). Just a vague note that I've wanted to use splice implementing cp and patch and cat and so on in toybox, but couldn't because it needs a pipe. So I'm quite interested in moves to lift this restriction... Rob-- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [RFC] extending splice for copy offloading 2013-09-11 21:17 ` Eric Wong [not found] ` <20130911211722.GA9725-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org> @ 2013-09-19 12:59 ` Jeff Layton 1 sibling, 0 replies; 62+ messages in thread From: Jeff Layton @ 2013-09-19 12:59 UTC (permalink / raw) To: Eric Wong Cc: Zach Brown, linux-kernel, linux-fsdevel, linux-nfs, Trond Myklebust, Bryan Schumaker, Martin K. Petersen, Jens Axboe, Mark Fasheh, Joel Becker On Wed, 11 Sep 2013 21:17:23 +0000 Eric Wong <normalperson@yhbt.net> wrote: > Zach Brown <zab@redhat.com> wrote: > > Towards the end of that thread Eric Wong asked why we didn't just > > extend splice. I immediately replied with some dumb dismissive > > answer. Once I sat down and looked at it, though, it does make a > > lot of sense. So good job, Eric. +10 Dummie points for me. > > Thanks for revisiting that :> > > > Some things to talk about: > > - I really don't care about the naming here. If you do, holler. > > Exposing "DIRECT" to userspace now might confuse users into expecting > O_DIRECT behavior. I say this as an easily-confused user. > > In the future, perhaps O_DIRECT behavior can become per-splice (instead > of just per-open) and can save SPLICE_F_DIRECT for that. > > > - We might want different flags for file-to-file splicing and acceleration > > - We might want flags to require or forbid acceleration > Do we need new flags at all? If both fds refer to files, then perhaps we can just take it that SPLICE_F_DIRECT behavior is implied? I'd probably suggest that we not add any more flags than are necessary until use-cases for them become clear. > > - We might want to provide all these flags to sendfile, too > > Another syscall? I prefer not. Better to just maintain the sendfile > API as-is for compatibility reasons and nudge users towards splice. > Agreed. > > Thoughts? Objections? > > I'll try to test/comment more in a week or two (not much time for > computing until then). On the whole, the concept looks sound. I'll note too that by simply lifting the restriction that one of the fd's to splice must always be a pipe, that may also give us a relatively simple way to add recvfile() as well, even if only as a macro wrapper around splice(). That's been a long sought-after feature of the samba developers... Just allow userland to do a splice straight from a socket fd to a file. We may end up having to copy data if the alignment isn't right, but it'd still be valuable to do that directly in the kernel in a single syscall. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2013-12-18 17:26 UTC | newest] Thread overview: 62+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-11 17:06 [RFC] extending splice for copy offloading Zach Brown [not found] ` <1378919210-10372-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-11 17:06 ` [PATCH 1/3] splice: add DIRECT flag for splicing between files Zach Brown 2013-09-11 17:06 ` [PATCH 2/3] splice: add f_op->splice_direct Zach Brown 2013-09-11 17:06 ` [PATCH 3/3] btrfs: implement .splice_direct extent copying Zach Brown 2013-09-20 9:49 ` [RFC] extending splice for copy offloading Szeredi Miklos [not found] ` <CAELBmZBGD4rph=gjLCPKCdEj+nzEQ-F=DExoL+h3vRm7qF7dCQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-25 18:38 ` Zach Brown 2013-09-25 19:02 ` Anna Schumaker [not found] ` <CAFX2JfnyF8kyMYzCdqdr2JkoyQCom1bFLpFj89wODjoju54-Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-25 19:06 ` Zach Brown [not found] ` <20130925190620.GB30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 2013-09-25 19:55 ` J. Bruce Fields 2013-09-25 21:07 ` Zach Brown 2013-09-26 8:58 ` Miklos Szeredi 2013-09-26 15:34 ` J. Bruce Fields [not found] ` <20130926153359.GE704-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-09-26 16:46 ` Ric Wheeler 2013-09-26 18:06 ` Miklos Szeredi 2013-09-26 19:06 ` Zach Brown 2013-09-26 19:53 ` Miklos Szeredi [not found] ` <CAJfpegvvWhs+jv2J9kOQrB31PEO3kyn_sLm_e2w9YKp=y6EDhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-26 21:23 ` Ric Wheeler [not found] ` <5244A5E7.90808-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-27 4:47 ` Miklos Szeredi 2013-09-27 14:00 ` Ric Wheeler 2013-09-27 14:39 ` Miklos Szeredi [not found] ` <CAJfpegsUchb0eX+Hi3rN5Ypje3Y-dgo=pxgM1Y3BQbHVp=1hSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-06 8:42 ` Rob Landley 2013-09-26 18:55 ` Zach Brown [not found] ` <20130926185508.GO30372-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 2013-09-26 21:26 ` Ric Wheeler [not found] ` <5244A68F.906-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-27 20:05 ` J. Bruce Fields 2013-09-27 20:50 ` Zach Brown 2013-09-28 5:49 ` Miklos Szeredi 2013-09-28 15:20 ` Myklebust, Trond 2013-09-28 21:20 ` Ric Wheeler [not found] ` <52474839.2080201-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-30 12:20 ` Miklos Szeredi 2013-09-30 14:34 ` J. Bruce Fields 2013-09-30 14:48 ` Ric Wheeler 2013-09-30 14:51 ` Miklos Szeredi [not found] ` <CAJfpeguMCzv-UhrXrG7e9Q7F_0aEe3_ZMumFwLu3hxcewA_7gA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-30 14:52 ` Ric Wheeler 2013-09-30 15:24 ` Miklos Szeredi 2013-09-30 14:28 ` Ric Wheeler [not found] ` <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mA nhf_wbJ8iRhg@mail.gmail.com> [not found] ` <CAJfpegv_C6cLOuA-mNtgtf2QbmmmcHwjQVo8mAnhf_wbJ8iRhg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-30 14:41 ` Ric Wheeler [not found] ` <52498DB6.7060901-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-30 15:46 ` Miklos Szeredi 2013-09-30 14:49 ` Ric Wheeler [not found] ` <52498F68.8050200-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-30 15:57 ` Miklos Szeredi [not found] ` <CAJfpegvvN_5c5oMv8UoODXQHc-DQnijhOtPDXmNamVpQLDoWMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-09-30 16:31 ` Miklos Szeredi 2013-09-30 17:17 ` Bernd Schubert [not found] ` <5249B21E.70603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2013-09-30 17:44 ` Myklebust, Trond [not found] ` <1380563050.6501.15.camel-5lNtUQgoD8Pfa3cDbr2K10B+6BGkLq7r@public.gmane.org> 2013-09-30 17:48 ` Bernd Schubert [not found] ` <5249B987.8020807-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2013-09-30 18:02 ` Myklebust, Trond 2013-09-30 18:49 ` Bernd Schubert 2013-09-30 19:34 ` Myklebust, Trond 2013-09-30 20:00 ` Bernd Schubert 2013-09-30 20:08 ` Ric Wheeler [not found] ` <5249DA50.5060105-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-30 20:27 ` Myklebust, Trond [not found] ` <5249D86A.7080603-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2013-09-30 20:10 ` Myklebust, Trond [not found] ` <CAJfpegsvrr7x3MbdpvxUmzq0ZfDHfZkzAar6Od2G7wg8DgPLYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 18:42 ` J. Bruce Fields 2013-09-30 15:33 ` Myklebust, Trond [not found] ` <52498AA8.2090204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-09-30 15:38 ` Miklos Szeredi [not found] ` <CAJfpegtpXuh9070ALGy16Y8kdgioBqSf4JQqBBCF4FHvFJWAWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-10-01 19:58 ` Zach Brown [not found] ` <20131001195817.GE10831-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 2013-10-02 12:58 ` Jan Kara 2013-10-02 13:31 ` David Lang 2013-12-18 12:41 ` Christoph Hellwig 2013-12-18 17:10 ` Zach Brown 2013-12-18 17:26 ` Anna Schumaker 2013-09-11 21:17 ` Eric Wong [not found] ` <20130911211722.GA9725-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org> 2013-09-16 19:44 ` Rob Landley 2013-09-19 12:59 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).