* [RFC v0 0/4] sys_copy_range() rough draft @ 2013-05-14 21:15 Zach Brown 2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown ` (3 more replies) 0 siblings, 4 replies; 24+ messages in thread From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw) To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs We've been talking about implementing some form of bulk data copy offloading for a while now. BTRFS and OCFS2 implement forms of copy offloading with ioctls, NFS 4.2 will include a byte-granular COPY operation, and the SCSI XCOPY command is being implemented now that Windows can issue it. In the past we've discussed promoting the ocfs2 reflink ioctl into a system call that would create a new file and implicitly copy the source data into the new file: https://lkml.org/lkml/2009/9/14/481 These draft patches take the simpler approach of only copying data between existing files. The patches 1) make a system call out of the btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY op, and 4) serve the COPY op in nfsd by calling the .copy_range method again. The nfs patch is an untested hack. I'm happy to beat it in to shape but I'll need some guidance. I'd like strong review feedback on the interfaces, here are some possible topics: a) Hopefully being able to specify a portion of the data to copy will avoid *huge* syscall latencies and the motivation for new async semantics. b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy from the start offset to the end of the file. Does anyone have a strong feeling about this? I'm leaning towards not bothering with it in the syscall interface. c) I chose to return partial progess in the ssize_t return code. This limits the length of the range and the size_t count argument can be too large and return errors, much like other io syscalls. This seemed less awful than some extra argument with a pointer to a status value. d) I'm dreading mentioning a vector of ranges to copy in one syscall because I don't want to think about overlaping ranges and file systems that use range locks -- xfs for now, but more if Jan gets his way. I'd rather that we get some experience with this simpler syscall before taking on that headache. I'm sure I'm forgetting some other details. I'm going to keep hacking away at this. My next step is to get ext4 supporting .copy_range, probably with a quick hack to copy the contents of bios. Hopefully that'll give enough time to also integrate review feedback. Thoughts? - z ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown @ 2013-05-14 21:15 ` Zach Brown 2013-05-15 19:44 ` Eric Wong 2013-05-14 21:15 ` [RFC v0 2/4] x86: add sys_copy_range to syscall tables Zach Brown ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw) To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs This adds a syscall and vfs entry point for clone_range which offloads data copying between existing files. The syscall is a thin wrapper around the vfs entry point. Its arguments are inspired by sys_splice(). The behaviour of the vfs helper is derived from the current btrfs CLONE_RANGE ioctl. --- fs/Makefile | 2 +- fs/copy_range.c | 127 ++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 1 + 5 files changed, 135 insertions(+), 2 deletions(-) create mode 100644 fs/copy_range.c diff --git a/fs/Makefile b/fs/Makefile index 4fe6df3..1be83b3 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o splice.o sync.o utimes.o \ - stack.o fs_struct.o statfs.o + stack.o fs_struct.o statfs.o copy_range.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/copy_range.c b/fs/copy_range.c new file mode 100644 index 0000000..3000b9f --- /dev/null +++ b/fs/copy_range.c @@ -0,0 +1,127 @@ +/* + * "copy_range": offload data copying between existing files + * + * Copyright (C) 2013 Zach Brown <zab@redhat.com> + */ +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/syscalls.h> +#include <linux/export.h> +#include <linux/fsnotify.h> + +/** + * vfs_copy_range - copy range of bytes from source file to existing file + * @file_in: source regular file + * @pos_in: starting byte offset to copy from the source file + * @file_out: destination regular file + * @pos_out: starting byte offset to copy to in the destination file + * @count: number of bytes to copy + * + * Returns number of bytes successfully copied from the start of the range or + * a negative errno error value. + * + * The number of bytes successfully written can be less than the input + * count if an error is encountered. In this partial success case the + * contents of the destination range after the copied bytes can be a mix + * of pre-existing bytes, bytes from the source range, or zeros, + * depending on the implementation. + * + * The source range must be entirely within i_size in the source file. + * A destination range outside of the size of the destination file will + * extend its size. + */ +ssize_t vfs_copy_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t count) +{ + struct inode *inode_in; + struct inode *inode_out; + ssize_t ret; + + if (count == 0) + return 0; + + /* copy_range allows full ssize_t count, ignoring MAX_RW_COUNT */ + ret = rw_verify_area(READ, file_in, &pos_in, count); + if (ret >= 0) + ret = rw_verify_area(WRITE, file_out, &pos_out, count); + if (ret < 0) + return ret; + + if (!(file_in->f_mode & FMODE_READ) || + !(file_out->f_mode & FMODE_WRITE) || + (file_out->f_flags & O_APPEND) || + !file_in->f_op || !file_in->f_op->copy_range) + return -EINVAL; + + inode_in = file_inode(file_in); + inode_out = file_inode(file_out); + + /* make sure offsets don't wrap and the input is inside i_size */ + if (pos_in + count < pos_in || pos_out + count < pos_out || + pos_in + count > i_size_read(inode_in)) + return -EINVAL; + + /* XXX do we want this test? btrfs_ioctl_clone_range() */ + if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) + return -EISDIR; + + if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) + return -EINVAL; + + if (inode_in->i_sb != inode_out->i_sb || + file_in->f_path.mnt != file_out->f_path.mnt) + return -EXDEV; + + /* forbid ranges in the same file for now */ + if (inode_in == inode_out) + return -EINVAL; + + ret = mnt_want_write_file(file_out); + if (ret) + return ret; + + ret = file_in->f_op->copy_range(file_in, pos_in, file_out, pos_out, + count); + if (ret > 0) { + fsnotify_access(file_in); + add_rchar(current, ret); + fsnotify_modify(file_out); + add_wchar(current, ret); + } + inc_syscr(current); + inc_syscw(current); + + mnt_drop_write_file(file_out); + + return ret; +} +EXPORT_SYMBOL(vfs_copy_range); + +SYSCALL_DEFINE5(copy_range, int, fd_in, loff_t __user *, upos_in, + int, fd_out, loff_t __user *, upos_out, size_t, count) +{ + loff_t pos_in; + loff_t pos_out; + struct fd f_in; + struct fd f_out; + ssize_t ret; + + if (get_user(pos_in, upos_in) || get_user(pos_out, upos_out)) + return -EFAULT; + + f_in = fdget(fd_in); + f_out = fdget(fd_out); + + if (f_in.file && f_out.file) + ret = vfs_copy_range(f_in.file, pos_in, f_out.file, pos_out, + count); + else + ret = -EBADF; + + fdput(f_in); + fdput(f_out); + + return ret; +} diff --git a/include/linux/fs.h b/include/linux/fs.h index 43db02e..6214893 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1543,6 +1543,7 @@ struct file_operations { long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len); int (*show_fdinfo)(struct seq_file *m, struct file *f); + ssize_t (*copy_range)(struct file *, loff_t, struct file *, loff_t, size_t); }; struct inode_operations { @@ -1588,6 +1589,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, unsigned long, loff_t *); extern ssize_t vfs_writev(struct file *, const struct iovec __user *, unsigned long, loff_t *); +extern ssize_t vfs_copy_range(struct file *, loff_t , struct file *, loff_t, + size_t); struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 0cc74c4..3935d1c 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -692,9 +692,11 @@ __SC_COMP(__NR_process_vm_writev, sys_process_vm_writev, \ __SYSCALL(__NR_kcmp, sys_kcmp) #define __NR_finit_module 273 __SYSCALL(__NR_finit_module, sys_finit_module) +#define __NR_copy_range 274 +__SYSCALL(__NR_copy_range, sys_copy_range) #undef __NR_syscalls -#define __NR_syscalls 274 +#define __NR_syscalls 275 /* * All syscalls below here should go away really, diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7078052..af7808a 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -151,6 +151,7 @@ cond_syscall(sys_process_vm_readv); cond_syscall(sys_process_vm_writev); cond_syscall(compat_sys_process_vm_readv); cond_syscall(compat_sys_process_vm_writev); +cond_syscall(sys_copy_range); /* arch-specific weak syscall entries */ cond_syscall(sys_pciconfig_read); -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown @ 2013-05-15 19:44 ` Eric Wong 2013-05-15 20:03 ` Zach Brown 0 siblings, 1 reply; 24+ messages in thread From: Eric Wong @ 2013-05-15 19:44 UTC (permalink / raw) To: Zach Brown Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs Zach Brown <zab@redhat.com> wrote: > This adds a syscall and vfs entry point for clone_range which offloads > data copying between existing files. > > The syscall is a thin wrapper around the vfs entry point. Its arguments > are inspired by sys_splice(). Why introduce a new syscall instead of extending sys_splice? Perhaps adding a new SPLICE_F_COPYRANGE flag to avoid compatibility problems with userspace which may expect ESPIPE when given two non-pipes would be useful. If the user doesn't need a out offset, then sendfile() should also be able to transparently utilize COPY/CLONE_RANGE, too. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-15 19:44 ` Eric Wong @ 2013-05-15 20:03 ` Zach Brown 2013-05-16 21:16 ` Ric Wheeler 2013-05-21 19:47 ` Eric Wong 0 siblings, 2 replies; 24+ messages in thread From: Zach Brown @ 2013-05-15 20:03 UTC (permalink / raw) To: Eric Wong Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote: > Why introduce a new syscall instead of extending sys_splice? Personally, I think it's ugly to have different operations use the same syscall just because their arguments match. But that preference aside, sure, if the consensus is that we'd rather use the splice() entry point then I can duck tape the pieces together to make it work. > If the user doesn't need a out offset, then sendfile() should also be > able to transparently utilize COPY/CLONE_RANGE, too. Perhaps, yeah. - z ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-15 20:03 ` Zach Brown @ 2013-05-16 21:16 ` Ric Wheeler 2013-05-21 19:47 ` Eric Wong 1 sibling, 0 replies; 24+ messages in thread From: Ric Wheeler @ 2013-05-16 21:16 UTC (permalink / raw) To: Zach Brown Cc: Eric Wong, Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs On 05/15/2013 04:03 PM, Zach Brown wrote: > On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote: >> Why introduce a new syscall instead of extending sys_splice? > Personally, I think it's ugly to have different operations use the same > syscall just because their arguments match. I agree with Zach - having a system call called "splice" do copy offloads is not intuitive. This is a very reasonable name for something that battled its way through several standards bodies (for NFS and SCSI :)), so we should give it a reasonable name thanks! Ric > > But that preference aside, sure, if the consensus is that we'd rather > use the splice() entry point then I can duck tape the pieces together to > make it work. > >> If the user doesn't need a out offset, then sendfile() should also be >> able to transparently utilize COPY/CLONE_RANGE, too. > Perhaps, yeah. > > - z ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-15 20:03 ` Zach Brown 2013-05-16 21:16 ` Ric Wheeler @ 2013-05-21 19:47 ` Eric Wong 2013-05-21 19:50 ` Zach Brown 1 sibling, 1 reply; 24+ messages in thread From: Eric Wong @ 2013-05-21 19:47 UTC (permalink / raw) To: Zach Brown Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs Zach Brown <zab@redhat.com> wrote: > On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote: > > Why introduce a new syscall instead of extending sys_splice? > > Personally, I think it's ugly to have different operations use the same > syscall just because their arguments match. Fair enough. I think adding a (currently unused) flags parameter would make sense for future-proofing. If this were extended to sockets/pipes, peek/nonblock/waitall flags may be added. Basically, something like splice/tee, but without needing to create/manage a pipe in userspace or needing extra syscalls/looping. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-21 19:47 ` Eric Wong @ 2013-05-21 19:50 ` Zach Brown 0 siblings, 0 replies; 24+ messages in thread From: Zach Brown @ 2013-05-21 19:50 UTC (permalink / raw) To: Eric Wong Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs On Tue, May 21, 2013 at 07:47:19PM +0000, Eric Wong wrote: > Zach Brown <zab@redhat.com> wrote: > > On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote: > > > Why introduce a new syscall instead of extending sys_splice? > > > > Personally, I think it's ugly to have different operations use the same > > syscall just because their arguments match. > > Fair enough. I think adding a (currently unused) flags parameter would > make sense for future-proofing. Yeah, that seems reasonble. - z ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC v0 2/4] x86: add sys_copy_range to syscall tables 2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown 2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown @ 2013-05-14 21:15 ` Zach Brown 2013-05-14 21:15 ` [RFC v0 3/4] btrfs: add .copy_range file operation Zach Brown [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 3 siblings, 0 replies; 24+ messages in thread From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw) To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs Add sys_copy_range to the x86 syscall tables. Happily, it doesn't require compat helpers. Signed-off-by: Zach Brown <zab@redhat.com> --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index aabfb83..d75d1b5 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -357,3 +357,4 @@ 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev 349 i386 kcmp sys_kcmp 350 i386 finit_module sys_finit_module +351 i386 copy_range sys_copy_range diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 38ae65d..dc9d766 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -320,6 +320,7 @@ 311 64 process_vm_writev sys_process_vm_writev 312 common kcmp sys_kcmp 313 common finit_module sys_finit_module +314 common copy_range sys_copy_range # # x32-specific system call numbers start at 512 to avoid cache impact -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC v0 3/4] btrfs: add .copy_range file operation 2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown 2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown 2013-05-14 21:15 ` [RFC v0 2/4] x86: add sys_copy_range to syscall tables Zach Brown @ 2013-05-14 21:15 ` Zach Brown [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 3 siblings, 0 replies; 24+ messages in thread From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw) To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs This rearranges the existing COPY_RANGE ioctl implementation so that the .copy_range file operation can call the core loop that copies file data extent items. The extent copying loop is lifted up into its own function. It retains the core btrfs error checks that should be shared between the CLONE_RANGE ioctl and copy_range syscall. Signed-off-by: Zach Brown <zab@redhat.com> --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/file.c | 1 + fs/btrfs/ioctl.c | 122 +++++++++++++++++++++++++++++++++---------------------- 3 files changed, 77 insertions(+), 49 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 63c328a..bf9555c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3612,6 +3612,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, size_t write_bytes, struct extent_state **cached); +ssize_t btrfs_copy_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t count); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 4205ba7..d75cc07 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2460,6 +2460,7 @@ const struct file_operations btrfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_ioctl, #endif + .copy_range = btrfs_copy_range, }; void btrfs_auto_defrag_exit(void) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0de4a2f..ac035d8 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2463,13 +2463,10 @@ out: return ret; } -static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, - u64 off, u64 olen, u64 destoff) +static noinline int btrfs_clone_extents(struct inode *inode, struct inode *src, + u64 off, u64 olen, u64 destoff) { - struct inode *inode = file_inode(file); struct btrfs_root *root = BTRFS_I(inode)->root; - struct fd src_file; - struct inode *src; struct btrfs_trans_handle *trans; struct btrfs_path *path; struct extent_buffer *leaf; @@ -2491,59 +2488,22 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, * they don't overlap)? */ - /* the destination must be opened for writing */ - if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND)) - return -EINVAL; - if (btrfs_root_readonly(root)) return -EROFS; - ret = mnt_want_write_file(file); - if (ret) - return ret; - - src_file = fdget(srcfd); - if (!src_file.file) { - ret = -EBADF; - goto out_drop_write; - } - - ret = -EXDEV; - if (src_file.file->f_path.mnt != file->f_path.mnt) - goto out_fput; - - src = file_inode(src_file.file); - - ret = -EINVAL; - if (src == inode) - goto out_fput; - - /* the src must be open for reading */ - if (!(src_file.file->f_mode & FMODE_READ)) - goto out_fput; - /* don't make the dst file partly checksummed */ if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) - goto out_fput; - - ret = -EISDIR; - if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) - goto out_fput; - - ret = -EXDEV; - if (src->i_sb != inode->i_sb) - goto out_fput; + return -EINVAL; - ret = -ENOMEM; buf = vmalloc(btrfs_level_size(root, 0)); if (!buf) - goto out_fput; + return -ENOMEM; path = btrfs_alloc_path(); if (!path) { vfree(buf); - goto out_fput; + return -ENOMEM; } path->reada = 2; @@ -2555,10 +2515,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); } - /* determine range to clone */ - ret = -EINVAL; - if (off + len > src->i_size || off + len < off) - goto out_unlock; + /* CLONE_RANGE can have len == 0, copy_range won't */ if (len == 0) olen = len = src->i_size - off; /* if we extend to eof, continue to block boundary */ @@ -2566,6 +2523,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, len = ALIGN(src->i_size, bs) - off; /* verify the end result is block aligned */ + ret = -EINVAL; if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) || !IS_ALIGNED(destoff, bs)) goto out_unlock; @@ -2849,6 +2807,72 @@ out_unlock: mutex_unlock(&inode->i_mutex); vfree(buf); btrfs_free_path(path); + return ret; +} + +ssize_t btrfs_copy_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t count) +{ + ssize_t ret; + + ret = btrfs_clone_extents(file_inode(file_out), file_inode(file_in), + pos_in, count, pos_out); + if (ret == 0) + ret = count; + return ret; +} + +static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, + u64 off, u64 len, u64 destoff) +{ + struct inode *inode = file_inode(file); + struct fd src_file; + struct inode *src; + long ret; + + /* the destination must be opened for writing */ + if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND)) + return -EINVAL; + + ret = mnt_want_write_file(file); + if (ret) + return ret; + + src_file = fdget(srcfd); + if (!src_file.file) { + ret = -EBADF; + goto out_drop_write; + } + + ret = -EXDEV; + if (src_file.file->f_path.mnt != file->f_path.mnt) + goto out_fput; + + src = file_inode(src_file.file); + + ret = -EINVAL; + if (src == inode) + goto out_fput; + + /* the src must be open for reading */ + if (!(src_file.file->f_mode & FMODE_READ)) + goto out_fput; + + ret = -EISDIR; + if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EXDEV; + if (src->i_sb != inode->i_sb) + goto out_fput; + + ret = -EINVAL; + if (off + len > src->i_size || off + len < off) + goto out_fput; + + ret = btrfs_clone_extents(inode, src, off, len, destoff); + out_fput: fdput(src_file); out_drop_write: -- 1.7.11.7 ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-05-14 21:15 ` Zach Brown [not found] ` <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-05-14 21:42 ` [RFC v0 0/4] sys_copy_range() rough draft Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw) To: Martin K. Petersen, Trond Myklebust, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-btrfs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA This crude patch illustrates the simplest plumbing involved in supporting sys_call_range with the NFS COPY operation that's pending in the 4.2 draft spec. The patch is based on a previous prototype that used the COPY op to implement sys_copyfileat which created a new file (based on the ocfs2 reflink ioctl). By contrast, this copies file contents between existing files. There's still a lot of implementation and testing to do, but this can get discussion going. --- fs/nfs/file.c | 25 +++++++++ fs/nfs/nfs4proc.c | 72 ++++++++++++++++++++++++++ fs/nfs/nfs4xdr.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/nfsd/nfs4proc.c | 35 +++++++++++++ fs/nfsd/nfs4xdr.c | 43 ++++++++++++++++ fs/nfsd/vfs.c | 41 +++++++++++++++ fs/nfsd/vfs.h | 3 ++ fs/nfsd/xdr4.h | 21 ++++++++ include/linux/nfs4.h | 6 ++- include/linux/nfs_xdr.h | 24 +++++++++ 10 files changed, 401 insertions(+), 1 deletion(-) diff --git a/fs/nfs/file.c b/fs/nfs/file.c index a87a44f..7d7bedf 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -917,6 +917,30 @@ int nfs_setlease(struct file *file, long arg, struct file_lock **fl) } EXPORT_SYMBOL_GPL(nfs_setlease); +ssize_t nfs_copy_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t count) +{ + struct dentry *dentry_in = file_in->f_path.dentry; + struct dentry *dentry_out = file_out->f_path.dentry; + struct inode *inode_in = dentry_in->d_inode; + struct inode *inode_out = dentry_out->d_inode; + loff_t ret; + + dprintk("NFS copy_range(%s/%s@%llu, %s/%s@%llu, %zd)\n", + dentry_in->d_parent->d_name.name, dentry_in->d_name.name, + (unsigned long long)pos_in, + dentry_out->d_parent->d_name.name, dentry_out->d_name.name, + (unsigned long long)pos_out, count); + + if (NFS_PROTO(inode_in)->copy == NULL) + ret = -EOPNOTSUPP; + else + ret = NFS_PROTO(inode_in)->copy(inode_in, inode_out, NULL, + 0, count, pos_in, pos_out); + return ret; +} + const struct file_operations nfs_file_operations = { .llseek = nfs_file_llseek, .read = do_sync_read, @@ -934,5 +958,6 @@ const struct file_operations nfs_file_operations = { .splice_write = nfs_file_splice_write, .check_flags = nfs_check_flags, .setlease = nfs_setlease, + .copy_range = nfs_copy_range, }; EXPORT_SYMBOL_GPL(nfs_file_operations); diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 8fbc100..1586b3e 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -5405,6 +5405,75 @@ int nfs4_proc_secinfo(struct inode *dir, const struct qstr *name, } #ifdef CONFIG_NFS_V4_1 +static loff_t _nfs4_proc_copy(struct inode *inode, + struct inode *dir, + struct qstr *name, + int flags, + loff_t nbyte, + loff_t src_offset, + loff_t dst_offset) +{ + struct nfs_server *server = NFS_SERVER(inode); + int status; + struct nfs_copy_args arg = { + .fh = NFS_FH(inode), + .dir_fh = NFS_FH(dir), + .src_offset = src_offset, + .dst_offset = dst_offset, + .count = nbyte, + .flags = flags, + .destination = name, + .bitmask = server->attr_bitmask, + }; + struct nfs_copy_res res = { + .fh = NFS_FH(inode), + .callback_id_length = 0, + .callback_id = 0, + .bytes_copied = 0, + .server = server, + }; + struct rpc_message msg = { + .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY], + .rpc_argp = &arg, + .rpc_resp = &res, + }; + + res.fattr = nfs_alloc_fattr(); + if (res.fattr == NULL) + return -ENOMEM; + + status = nfs4_call_sync(server->client, server, &msg, &arg.seq_args, + &res.seq_res, 1); + if (res.bytes_copied) + status = res.bytes_copied; + + nfs_free_fattr(res.fattr); + return status; +} + +static loff_t nfs4_proc_copy(struct inode *inode, + struct inode *dir, + struct qstr *name, + int flags, + loff_t nbyte, + loff_t src_offset, + loff_t dst_offset) +{ + struct nfs4_exception exception = {0, }; + loff_t ret; + + do { + ret = _nfs4_proc_copy(inode, dir, name, flags, nbyte, + src_offset, dst_offset); + if (ret < 0) + ret = nfs4_handle_exception(NFS_SERVER(inode), ret, + &exception); + } while (exception.retry); + + return ret; +} + + /* * Check the exchange flags returned by the server for invalid flags, having * both PNFS and NON_PNFS flags set, and not having one of NON_PNFS, PNFS, or @@ -7097,6 +7166,9 @@ const struct nfs_rpc_ops nfs_v4_clientops = { .free_client = nfs4_free_client, .create_server = nfs4_create_server, .clone_server = nfs_clone_server, +#ifdef CONFIG_NFS_V4_1 + .copy = nfs4_proc_copy, +#endif }; static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = { diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 4be8d13..28598b0 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -397,6 +397,8 @@ static int nfs4_stat_to_errno(int); #define encode_free_stateid_maxsz (op_encode_hdr_maxsz + 1 + \ XDR_QUADLEN(NFS4_STATEID_SIZE)) #define decode_free_stateid_maxsz (op_decode_hdr_maxsz + 1) +#define encode_copy_maxsz (op_encode_hdr_maxsz + 8 + nfs4_name_maxsz) +#define decode_copy_maxsz (op_decode_hdr_maxsz + 1 + decode_stateid_maxsz) #else /* CONFIG_NFS_V4_1 */ #define encode_sequence_maxsz 0 #define decode_sequence_maxsz 0 @@ -840,6 +842,22 @@ static int nfs4_stat_to_errno(int); #define NFS4_dec_free_stateid_sz (compound_decode_hdr_maxsz + \ decode_sequence_maxsz + \ decode_free_stateid_maxsz) +#define NFS4_enc_copy_sz (compound_encode_hdr_maxsz + \ + encode_sequence_maxsz + \ + encode_putfh_maxsz + \ + encode_savefh_maxsz + \ + encode_putfh_maxsz + \ + encode_copy_maxsz + \ + encode_getfh_maxsz + \ + encode_getattr_maxsz) +#define NFS4_dec_copy_sz (compound_decode_hdr_maxsz + \ + decode_sequence_maxsz + \ + decode_putfh_maxsz + \ + decode_savefh_maxsz + \ + decode_putfh_maxsz + \ + decode_copy_maxsz + \ + decode_getfh_maxsz + \ + decode_getattr_maxsz) const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH + compound_encode_hdr_maxsz + @@ -1817,6 +1835,23 @@ static void encode_reclaim_complete(struct xdr_stream *xdr, encode_op_hdr(xdr, OP_RECLAIM_COMPLETE, decode_reclaim_complete_maxsz, hdr); encode_uint32(xdr, args->one_fs); } + +static void encode_copy(struct xdr_stream *xdr, + const struct nfs_copy_args *args, + struct compound_hdr *hdr) +{ + __be32 *p; + + p = reserve_space(xdr, 36 + args->destination->len); + *p++ = cpu_to_be32(OP_COPY); + p = xdr_encode_hyper(p, args->src_offset); + p = xdr_encode_hyper(p, args->dst_offset); + p = xdr_encode_hyper(p, args->count); + *p++ = cpu_to_be32(args->flags); + xdr_encode_opaque(p, args->destination->name, args->destination->len); + hdr->nops++; + hdr->replen += decode_copy_maxsz; +} #endif /* CONFIG_NFS_V4_1 */ static void encode_sequence(struct xdr_stream *xdr, @@ -2761,6 +2796,30 @@ static void nfs4_xdr_enc_sequence(struct rpc_rqst *req, struct xdr_stream *xdr, } /* + * Encode a COPY request + */ +static int nfs4_xdr_enc_copy(struct rpc_rqst *req, __be32 *p, + struct nfs_copy_args *args) +{ + struct xdr_stream xdr; + struct compound_hdr hdr = { + .minorversion = nfs4_xdr_minorversion(&args->seq_args), + }; + + xdr_init_encode(&xdr, &req->rq_snd_buf, p); + encode_compound_hdr(&xdr, req, &hdr); + encode_sequence(&xdr, &args->seq_args, &hdr); + encode_putfh(&xdr, args->fh, &hdr); + encode_savefh(&xdr, &hdr); + encode_putfh(&xdr, args->dir_fh, &hdr); + encode_copy(&xdr, args, &hdr); + encode_getfh(&xdr, &hdr); + encode_getfattr(&xdr, args->bitmask, &hdr); + encode_nops(&hdr); + return 0; +} + +/* * a GET_LEASE_TIME request */ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req, @@ -4688,6 +4747,41 @@ static int decode_link(struct xdr_stream *xdr, struct nfs4_change_info *cinfo) return decode_change_info(xdr, cinfo); } +#if defined(CONFIG_NFS_V4_1) +static int decode_copy(struct xdr_stream *xdr, struct nfs_copy_res *res) +{ + __be32 *p; + int status; + + status = decode_op_hdr(xdr, OP_COPY); + if (status) + return status; + + if (status == 0) { + p = xdr_inline_decode(xdr, 4); + if (unlikely(!p)) + goto out_overflow; + res->callback_id_length = be32_to_cpup(p); + if (res->callback_id_length == 1) { + status = decode_stateid(xdr, res->callback_id); + if (unlikely(status)) + return status; + } else if (res->callback_id_length != 0) + return -EIO; + } else { + p = xdr_inline_decode(xdr, 8); + if (unlikely(!p)) + goto out_overflow; + p = xdr_decode_hyper(p, &res->bytes_copied); + } + + return 0; +out_overflow: + print_overflow_msg(__func__, xdr); + return -EIO; +} +#endif /* CONFIG_NFS_V4_1 */ + /* * We create the owner, so we know a proper owner.id length is 4. */ @@ -7047,6 +7141,43 @@ static int nfs4_xdr_dec_free_stateid(struct rpc_rqst *rqstp, out: return status; } + +/* + * Decode COPY response + */ +static int nfs4_xdr_dec_copy(struct rpc_rqst *rqstp, __be32 *p, + struct nfs_copy_res *res) +{ + struct xdr_stream xdr; + struct compound_hdr hdr; + int status; + + xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p); + status = decode_compound_hdr(&xdr, &hdr); + if (status) + goto out; + status = decode_sequence(&xdr, &res->seq_res, rqstp); + if (status) + goto out; + status = decode_putfh(&xdr); + if (status) + goto out; + status = decode_savefh(&xdr); + if (status != 0) + goto out; + status = decode_putfh(&xdr); + if (status != 0) + goto out; + status = decode_copy(&xdr, res); + if (status) + goto out; + status = decode_getfh(&xdr, res->fh); + if (status != 0) + goto out; + decode_getfattr(&xdr, res->fattr, res->server); +out: + return status; +} #endif /* CONFIG_NFS_V4_1 */ /** @@ -7257,6 +7388,7 @@ struct rpc_procinfo nfs4_procedures[] = { PROC(BIND_CONN_TO_SESSION, enc_bind_conn_to_session, dec_bind_conn_to_session), PROC(DESTROY_CLIENTID, enc_destroy_clientid, dec_destroy_clientid), + PROC(COPY, enc_copy, dec_copy), #endif /* CONFIG_NFS_V4_1 */ }; diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 27d74a2..2f62ebb 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -986,6 +986,37 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, return status; } +/* + * XXX: + * - do something with stateids :) + * - implement callback results and OFFLOAD_ABORT + * - inter-server copies? + */ +static __be32 +nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, + struct nfsd4_copy *copy) +{ + __be32 status; + + /* only support copying data to an existing file */ + if (copy->ca_destinationlen) + return nfserr_inval; + + if (!cstate->current_fh.fh_dentry || !cstate->save_fh.fh_dentry) + return nfserr_nofilehandle; + + status = nfsd_copy_range(rqstp, &cstate->save_fh, copy->ca_src_offset, + &cstate->current_fh, copy->ca_dst_offset, + copy->ca_count); + if (status == nfs_ok) + copy->u.cr_bytes_copied = copy->ca_count; + + /* don't support async callbacks yet */ + copy->u.ok.cr_callback_id_length = 0; + + return status; +} + /* This routine never returns NFS_OK! If there are no other errors, it * will return NFSERR_SAME or NFSERR_NOT_SAME depending on whether the * attributes matched. VERIFY is implemented by mapping NFSERR_SAME @@ -1798,6 +1829,10 @@ static struct nfsd4_operation nfsd4_ops[] = { .op_get_currentstateid = (stateid_getter)nfsd4_get_freestateid, .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize, }, + [OP_COPY] = { + .op_func = (nfsd4op_func)nfsd4_copy, + .op_name = "OP_COPY", + }, }; #ifdef NFSD_DEBUG diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 6cd86e0..d2978e9 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -1445,6 +1445,26 @@ static __be32 nfsd4_decode_reclaim_complete(struct nfsd4_compoundargs *argp, str } static __be32 +nfsd4_decode_copy(struct nfsd4_compoundargs *argp, struct nfsd4_copy *copy) +{ + DECODE_HEAD; + + READ_BUF(32); + READ64(copy->ca_src_offset); + READ64(copy->ca_dst_offset); + READ64(copy->ca_count); + READ32(copy->ca_flags); + READ32(copy->ca_destinationlen); + READ_BUF(copy->ca_destinationlen); + SAVEMEM(copy->ca_destination, copy->ca_destinationlen); + if ((status = check_filename(copy->ca_destination, + copy->ca_destinationlen))) + return status; + + DECODE_TAIL; +} + +static __be32 nfsd4_decode_noop(struct nfsd4_compoundargs *argp, void *p) { return nfs_ok; @@ -1557,6 +1577,7 @@ static nfsd4_dec nfsd41_dec_ops[] = { [OP_WANT_DELEGATION] = (nfsd4_dec)nfsd4_decode_notsupp, [OP_DESTROY_CLIENTID] = (nfsd4_dec)nfsd4_decode_destroy_clientid, [OP_RECLAIM_COMPLETE] = (nfsd4_dec)nfsd4_decode_reclaim_complete, + [OP_COPY] = (nfsd4_dec)nfsd4_decode_copy, }; struct nfsd4_minorversion_ops { @@ -3394,6 +3415,27 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres *resp, __be32 nfserr, } static __be32 +nfsd4_encode_copy(struct nfsd4_compoundres *resp, __be32 nfserr, + struct nfsd4_copy *copy) +{ + __be32 *p; + + if (!nfserr) { + RESERVE_SPACE(4); + WRITE32(copy->u.ok.cr_callback_id_length); + ADJUST_ARGS(); + if (copy->u.ok.cr_callback_id_length == 1) + nfsd4_encode_stateid(resp, copy->u.ok.cr_callback_id); + } else { + RESERVE_SPACE(8); + WRITE64(copy->u.cr_bytes_copied); + ADJUST_ARGS(); + } + + return nfserr; +} + +static __be32 nfsd4_encode_noop(struct nfsd4_compoundres *resp, __be32 nfserr, void *p) { return nfserr; @@ -3465,6 +3507,7 @@ static nfsd4_enc nfsd4_enc_ops[] = { [OP_WANT_DELEGATION] = (nfsd4_enc)nfsd4_encode_noop, [OP_DESTROY_CLIENTID] = (nfsd4_enc)nfsd4_encode_noop, [OP_RECLAIM_COMPLETE] = (nfsd4_enc)nfsd4_encode_noop, + [OP_COPY] = (nfsd4_enc)nfsd4_encode_copy, }; /* diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 84ce601..0c1b427 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -28,6 +28,8 @@ #include <asm/uaccess.h> #include <linux/exportfs.h> #include <linux/writeback.h> +#include <linux/fs_struct.h> +#include <linux/kmod.h> #ifdef CONFIG_NFSD_V3 #include "xdr3.h" @@ -621,6 +623,45 @@ int nfsd4_is_junction(struct dentry *dentry) return 0; return 1; } + +__be32 +nfsd_copy_range(struct svc_rqst *rqstp, struct svc_fh *fhp_in, u64 pos_in, + struct svc_fh *fhp_out, u64 pos_out, u64 count) +{ + struct file *filp_in = NULL; + struct file *filp_out = NULL; + int err; + + /* XXX verify pos and count within sane limits? */ + + err = nfsd_open(rqstp, fhp_in, S_IFREG, NFSD_MAY_READ, &filp_in); + if (err) + goto out; + + err = nfsd_open(rqstp, fhp_out, S_IFREG, NFSD_MAY_WRITE, &filp_out); + if (err) + goto out; + + err = vfs_copy_range(filp_in, pos_in, filp_out, pos_out, count); + /* fall back if .copy_range isn't supported */ + + if (!err && EX_ISSYNC(fhp_out->fh_export)) + err = vfs_fsync_range(filp_out, pos_out, pos_out + count-1, 0); + +out: + if (filp_in) + nfsd_close(filp_in); + if (filp_out) + nfsd_close(filp_out); + + if (err < 0) + err = nfserrno(err); + else + err = 0; + + return err; +} + #endif /* defined(CONFIG_NFSD_V4) */ #ifdef CONFIG_NFSD_V3 diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h index 5b58941..bbc9483 100644 --- a/fs/nfsd/vfs.h +++ b/fs/nfsd/vfs.h @@ -85,6 +85,9 @@ __be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *, struct svc_fh *res, struct iattr *); __be32 nfsd_link(struct svc_rqst *, struct svc_fh *, char *, int, struct svc_fh *); +__be32 nfsd_copy_range(struct svc_rqst *, + struct svc_fh *, u64, + struct svc_fh *, u64, u64); __be32 nfsd_rename(struct svc_rqst *, struct svc_fh *, char *, int, struct svc_fh *, char *, int); diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h index 3b271d2..95fd1c3 100644 --- a/fs/nfsd/xdr4.h +++ b/fs/nfsd/xdr4.h @@ -426,6 +426,26 @@ struct nfsd4_reclaim_complete { u32 rca_one_fs; }; +struct nfsd4_copy { + /* request */ + u64 ca_src_offset; + u64 ca_dst_offset; + u64 ca_count; + u32 ca_flags; + u32 ca_destinationlen; + char * ca_destination; + + /* response */ + union { + struct { + u32 cr_callback_id_length; + stateid_t * cr_callback_id; + } ok; + u64 cr_bytes_copied; + } u; + +}; + struct nfsd4_op { int opnum; __be32 status; @@ -471,6 +491,7 @@ struct nfsd4_op { struct nfsd4_reclaim_complete reclaim_complete; struct nfsd4_test_stateid test_stateid; struct nfsd4_free_stateid free_stateid; + struct nfsd4_copy copy; } u; struct nfs4_replay * replay; }; diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h index 7b8fc73..6be484e 100644 --- a/include/linux/nfs4.h +++ b/include/linux/nfs4.h @@ -100,6 +100,7 @@ enum nfs_opnum4 { OP_WANT_DELEGATION = 56, OP_DESTROY_CLIENTID = 57, OP_RECLAIM_COMPLETE = 58, + OP_COPY = 59, OP_ILLEGAL = 10044, }; @@ -108,7 +109,7 @@ enum nfs_opnum4 { Needs to be updated if more operations are defined in future.*/ #define FIRST_NFS4_OP OP_ACCESS -#define LAST_NFS4_OP OP_RECLAIM_COMPLETE +#define LAST_NFS4_OP OP_COPY enum nfsstat4 { NFS4_OK = 0, @@ -456,6 +457,9 @@ enum { NFSPROC4_CLNT_GETDEVICELIST, NFSPROC4_CLNT_BIND_CONN_TO_SESSION, NFSPROC4_CLNT_DESTROY_CLIENTID, + + /* nfs42 */ + NFSPROC4_CLNT_COPY, }; /* nfs41 types */ diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h index 104b62f..2256e31 100644 --- a/include/linux/nfs_xdr.h +++ b/include/linux/nfs_xdr.h @@ -1184,6 +1184,28 @@ struct nfs41_free_stateid_res { unsigned int status; }; +struct nfs_copy_args { + struct nfs_fh *fh; + struct nfs_fh *dir_fh; + u32 *bitmask; + __u64 src_offset; + __u64 dst_offset; + __u64 count; + __u32 flags; + const struct qstr *destination; + struct nfs4_sequence_args seq_args; +}; + +struct nfs_copy_res { + struct nfs_fh *fh; + struct nfs_fattr *fattr; + __u32 callback_id_length; + nfs4_stateid *callback_id; + __u64 bytes_copied; + const struct nfs_server *server; + struct nfs4_sequence_res seq_res; +}; + #else struct pnfs_ds_commit_info { @@ -1433,6 +1455,8 @@ struct nfs_rpc_ops { struct nfs_server *(*create_server)(struct nfs_mount_info *, struct nfs_subversion *); struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *, struct nfs_fattr *, rpc_authflavor_t); + loff_t (*copy) (struct inode *, struct inode *, struct qstr *, + int, loff_t, loff_t, loff_t); }; /* -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support [not found] ` <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-05-15 20:19 ` J. Bruce Fields [not found] ` <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-05-15 20:19 UTC (permalink / raw) To: Zach Brown Cc: Martin K. Petersen, Trond Myklebust, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-btrfs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote: > This crude patch illustrates the simplest plumbing involved in > supporting sys_call_range with the NFS COPY operation that's pending in > the 4.2 draft spec. > > The patch is based on a previous prototype that used the COPY op to > implement sys_copyfileat which created a new file (based on the ocfs2 > reflink ioctl). By contrast, this copies file contents between existing > files. > > There's still a lot of implementation and testing to do, but this can > get discussion going. I'm using: git://github.com/loghyr/NFSv4.2 as my reference for the draft protocol. On a quick skim, one thing this is missing before it complies is a client implementation of CB_OFFLOAD: "If a client desires an intra-server file copy, then it MUST support the COPY and CB_OFFLOAD operations." The server doesn't have to implement CB_OFFLOAD, though, so we should ditch these todo's: > +/* > + * XXX: > + * - do something with stateids :) > + * - implement callback results and OFFLOAD_ABORT > + * - inter-server copies? > + */ ... > + /* don't support async callbacks yet */ ... lest someone go try to implement them for no reason. (Stranger things have happened.) Nits, possibly to ignore for now: > + copy->u.ok.cr_callback_id_length = 0; > + > + return status; > +} > + > /* This routine never returns NFS_OK! If there are no other errors, it > * will return NFSERR_SAME or NFSERR_NOT_SAME depending on whether the > * attributes matched. VERIFY is implemented by mapping NFSERR_SAME > @@ -1798,6 +1829,10 @@ static struct nfsd4_operation nfsd4_ops[] = { > .op_get_currentstateid = (stateid_getter)nfsd4_get_freestateid, > .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize, > }, > + [OP_COPY] = { > + .op_func = (nfsd4op_func)nfsd4_copy, > + .op_name = "OP_COPY", > + }, There's some more boilerplate to fill in (see other ops). > +static __be32 > nfsd4_decode_noop(struct nfsd4_compoundargs *argp, void *p) > { > return nfs_ok; > @@ -1557,6 +1577,7 @@ static nfsd4_dec nfsd41_dec_ops[] = { > [OP_WANT_DELEGATION] = (nfsd4_dec)nfsd4_decode_notsupp, > [OP_DESTROY_CLIENTID] = (nfsd4_dec)nfsd4_decode_destroy_clientid, > [OP_RECLAIM_COMPLETE] = (nfsd4_dec)nfsd4_decode_reclaim_complete, > + [OP_COPY] = (nfsd4_dec)nfsd4_decode_copy, And this should be made 4.2-specific. > }; > > struct nfsd4_minorversion_ops { > @@ -3394,6 +3415,27 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres *resp, __be32 nfserr, > } > > static __be32 > +nfsd4_encode_copy(struct nfsd4_compoundres *resp, __be32 nfserr, > + struct nfsd4_copy *copy) > +{ > + __be32 *p; > + > + if (!nfserr) { > + RESERVE_SPACE(4); > + WRITE32(copy->u.ok.cr_callback_id_length); > + ADJUST_ARGS(); > + if (copy->u.ok.cr_callback_id_length == 1) > + nfsd4_encode_stateid(resp, copy->u.ok.cr_callback_id); > + } else { > + RESERVE_SPACE(8); > + WRITE64(copy->u.cr_bytes_copied); > + ADJUST_ARGS(); > + } > + > + return nfserr; > +} > + > +static __be32 > nfsd4_encode_noop(struct nfsd4_compoundres *resp, __be32 nfserr, void *p) > { > return nfserr; > @@ -3465,6 +3507,7 @@ static nfsd4_enc nfsd4_enc_ops[] = { > [OP_WANT_DELEGATION] = (nfsd4_enc)nfsd4_encode_noop, > [OP_DESTROY_CLIENTID] = (nfsd4_enc)nfsd4_encode_noop, > [OP_RECLAIM_COMPLETE] = (nfsd4_enc)nfsd4_encode_noop, > + [OP_COPY] = (nfsd4_enc)nfsd4_encode_copy, > }; > > /* > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 84ce601..0c1b427 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -28,6 +28,8 @@ > #include <asm/uaccess.h> > #include <linux/exportfs.h> > #include <linux/writeback.h> > +#include <linux/fs_struct.h> > +#include <linux/kmod.h> > > #ifdef CONFIG_NFSD_V3 > #include "xdr3.h" > @@ -621,6 +623,45 @@ int nfsd4_is_junction(struct dentry *dentry) > return 0; > return 1; > } > + > +__be32 > +nfsd_copy_range(struct svc_rqst *rqstp, struct svc_fh *fhp_in, u64 pos_in, > + struct svc_fh *fhp_out, u64 pos_out, u64 count) > +{ > + struct file *filp_in = NULL; > + struct file *filp_out = NULL; > + int err; > + > + /* XXX verify pos and count within sane limits? */ > + > + err = nfsd_open(rqstp, fhp_in, S_IFREG, NFSD_MAY_READ, &filp_in); > + if (err) > + goto out; > + > + err = nfsd_open(rqstp, fhp_out, S_IFREG, NFSD_MAY_WRITE, &filp_out); > + if (err) > + goto out; Looking at the xdr... the COPY operation takes stateid's, which nfsd can use to look up files, so the opens shouldn't be required. --b. > + > + err = vfs_copy_range(filp_in, pos_in, filp_out, pos_out, count); > + /* fall back if .copy_range isn't supported */ > + > + if (!err && EX_ISSYNC(fhp_out->fh_export)) > + err = vfs_fsync_range(filp_out, pos_out, pos_out + count-1, 0); > + > +out: > + if (filp_in) > + nfsd_close(filp_in); > + if (filp_out) > + nfsd_close(filp_out); > + > + if (err < 0) > + err = nfserrno(err); > + else > + err = 0; > + > + return err; > +} > + > #endif /* defined(CONFIG_NFSD_V4) */ > > #ifdef CONFIG_NFSD_V3 > diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h > index 5b58941..bbc9483 100644 > --- a/fs/nfsd/vfs.h > +++ b/fs/nfsd/vfs.h > @@ -85,6 +85,9 @@ __be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *, > struct svc_fh *res, struct iattr *); > __be32 nfsd_link(struct svc_rqst *, struct svc_fh *, > char *, int, struct svc_fh *); > +__be32 nfsd_copy_range(struct svc_rqst *, > + struct svc_fh *, u64, > + struct svc_fh *, u64, u64); > __be32 nfsd_rename(struct svc_rqst *, > struct svc_fh *, char *, int, > struct svc_fh *, char *, int); > diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h > index 3b271d2..95fd1c3 100644 > --- a/fs/nfsd/xdr4.h > +++ b/fs/nfsd/xdr4.h > @@ -426,6 +426,26 @@ struct nfsd4_reclaim_complete { > u32 rca_one_fs; > }; > > +struct nfsd4_copy { > + /* request */ > + u64 ca_src_offset; > + u64 ca_dst_offset; > + u64 ca_count; > + u32 ca_flags; > + u32 ca_destinationlen; > + char * ca_destination; > + > + /* response */ > + union { > + struct { > + u32 cr_callback_id_length; > + stateid_t * cr_callback_id; > + } ok; > + u64 cr_bytes_copied; > + } u; > + > +}; > + > struct nfsd4_op { > int opnum; > __be32 status; > @@ -471,6 +491,7 @@ struct nfsd4_op { > struct nfsd4_reclaim_complete reclaim_complete; > struct nfsd4_test_stateid test_stateid; > struct nfsd4_free_stateid free_stateid; > + struct nfsd4_copy copy; > } u; > struct nfs4_replay * replay; > }; > diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h > index 7b8fc73..6be484e 100644 > --- a/include/linux/nfs4.h > +++ b/include/linux/nfs4.h > @@ -100,6 +100,7 @@ enum nfs_opnum4 { > OP_WANT_DELEGATION = 56, > OP_DESTROY_CLIENTID = 57, > OP_RECLAIM_COMPLETE = 58, > + OP_COPY = 59, > > OP_ILLEGAL = 10044, > }; > @@ -108,7 +109,7 @@ enum nfs_opnum4 { > Needs to be updated if more operations are defined in future.*/ > > #define FIRST_NFS4_OP OP_ACCESS > -#define LAST_NFS4_OP OP_RECLAIM_COMPLETE > +#define LAST_NFS4_OP OP_COPY > > enum nfsstat4 { > NFS4_OK = 0, > @@ -456,6 +457,9 @@ enum { > NFSPROC4_CLNT_GETDEVICELIST, > NFSPROC4_CLNT_BIND_CONN_TO_SESSION, > NFSPROC4_CLNT_DESTROY_CLIENTID, > + > + /* nfs42 */ > + NFSPROC4_CLNT_COPY, > }; > > /* nfs41 types */ > diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h > index 104b62f..2256e31 100644 > --- a/include/linux/nfs_xdr.h > +++ b/include/linux/nfs_xdr.h > @@ -1184,6 +1184,28 @@ struct nfs41_free_stateid_res { > unsigned int status; > }; > > +struct nfs_copy_args { > + struct nfs_fh *fh; > + struct nfs_fh *dir_fh; > + u32 *bitmask; > + __u64 src_offset; > + __u64 dst_offset; > + __u64 count; > + __u32 flags; > + const struct qstr *destination; > + struct nfs4_sequence_args seq_args; > +}; > + > +struct nfs_copy_res { > + struct nfs_fh *fh; > + struct nfs_fattr *fattr; > + __u32 callback_id_length; > + nfs4_stateid *callback_id; > + __u64 bytes_copied; > + const struct nfs_server *server; > + struct nfs4_sequence_res seq_res; > +}; > + > #else > > struct pnfs_ds_commit_info { > @@ -1433,6 +1455,8 @@ struct nfs_rpc_ops { > struct nfs_server *(*create_server)(struct nfs_mount_info *, struct nfs_subversion *); > struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *, > struct nfs_fattr *, rpc_authflavor_t); > + loff_t (*copy) (struct inode *, struct inode *, struct qstr *, > + int, loff_t, loff_t, loff_t); > }; > > /* > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support [not found] ` <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-05-15 20:21 ` Myklebust, Trond 2013-05-15 20:24 ` J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: Myklebust, Trond @ 2013-05-15 20:21 UTC (permalink / raw) To: J. Bruce Fields Cc: Zach Brown, Martin K. Petersen, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2013-05-15 at 16:19 -0400, J. Bruce Fields wrote: > On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote: > > This crude patch illustrates the simplest plumbing involved in > > supporting sys_call_range with the NFS COPY operation that's pending in > > the 4.2 draft spec. > > > > The patch is based on a previous prototype that used the COPY op to > > implement sys_copyfileat which created a new file (based on the ocfs2 > > reflink ioctl). By contrast, this copies file contents between existing > > files. > > > > There's still a lot of implementation and testing to do, but this can > > get discussion going. > > I'm using: > > git://github.com/loghyr/NFSv4.2 > > as my reference for the draft protocol. > > On a quick skim, one thing this is missing before it complies is a > client implementation of CB_OFFLOAD: "If a client desires an > intra-server file copy, then it MUST support the COPY and CB_OFFLOAD > operations." Note that Bryan is currently working on updating the NFS implementation to match the draft protocol. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support 2013-05-15 20:21 ` Myklebust, Trond @ 2013-05-15 20:24 ` J. Bruce Fields 0 siblings, 0 replies; 24+ messages in thread From: J. Bruce Fields @ 2013-05-15 20:24 UTC (permalink / raw) To: Myklebust, Trond Cc: Zach Brown, Martin K. Petersen, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-nfs@vger.kernel.org On Wed, May 15, 2013 at 08:21:54PM +0000, Myklebust, Trond wrote: > On Wed, 2013-05-15 at 16:19 -0400, J. Bruce Fields wrote: > > On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote: > > > This crude patch illustrates the simplest plumbing involved in > > > supporting sys_call_range with the NFS COPY operation that's pending in > > > the 4.2 draft spec. > > > > > > The patch is based on a previous prototype that used the COPY op to > > > implement sys_copyfileat which created a new file (based on the ocfs2 > > > reflink ioctl). By contrast, this copies file contents between existing > > > files. > > > > > > There's still a lot of implementation and testing to do, but this can > > > get discussion going. > > > > I'm using: > > > > git://github.com/loghyr/NFSv4.2 > > > > as my reference for the draft protocol. > > > > On a quick skim, one thing this is missing before it complies is a > > client implementation of CB_OFFLOAD: "If a client desires an > > intra-server file copy, then it MUST support the COPY and CB_OFFLOAD > > operations." > > Note that Bryan is currently working on updating the NFS implementation > to match the draft protocol. OK, good.--b. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 0/4] sys_copy_range() rough draft [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-05-14 21:15 ` [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support Zach Brown @ 2013-05-14 21:42 ` Dave Chinner 2013-05-14 22:04 ` Zach Brown 1 sibling, 1 reply; 24+ messages in thread From: Dave Chinner @ 2013-05-14 21:42 UTC (permalink / raw) To: Zach Brown Cc: Martin K. Petersen, Trond Myklebust, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-btrfs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: > We've been talking about implementing some form of bulk data copy > offloading for a while now. BTRFS and OCFS2 implement forms of copy > offloading with ioctls, NFS 4.2 will include a byte-granular COPY > operation, and the SCSI XCOPY command is being implemented now that > Windows can issue it. > > In the past we've discussed promoting the ocfs2 reflink ioctl into a > system call that would create a new file and implicitly copy the > source data into the new file: > https://lkml.org/lkml/2009/9/14/481 > > These draft patches take the simpler approach of only copying data > between existing files. The patches 1) make a system call out of the > btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with > the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY > op, and 4) serve the COPY op in nfsd by calling the .copy_range method > again. > > The nfs patch is an untested hack. I'm happy to beat it in to shape > but I'll need some guidance. > > I'd like strong review feedback on the interfaces, here are some > possible topics: > > a) Hopefully being able to specify a portion of the data to copy will > avoid *huge* syscall latencies and the motivation for new async > semantics. > > b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy > from the start offset to the end of the file. Does anyone have a > strong feeling about this? I'm leaning towards not bothering with it > in the syscall interface. > > c) I chose to return partial progess in the ssize_t return code. This > limits the length of the range and the size_t count argument can be too > large and return errors, much like other io syscalls. This seemed > less awful than some extra argument with a pointer to a status value. > > d) I'm dreading mentioning a vector of ranges to copy in one syscall > because I don't want to think about overlaping ranges and file systems > that use range locks -- xfs for now, but more if Jan gets his way. XFS doesn't use range locks (yet). > I'd rather that we get some experience with this simpler syscall before > taking on that headache. > > I'm sure I'm forgetting some other details. > > I'm going to keep hacking away at this. My next step is to get ext4 > supporting .copy_range, probably with a quick hack to copy the > contents of bios. Hopefully that'll give enough time to also integrate > review feedback. Wouldn't the easiest "support all filesystems" hack just be to add a destination offset parameter to do_splice_direct() and call that when the filesystem doesn't supply a ->copy_range method? i.e. use the mechanisms we already have for copying from one file to another via the page cache as efficiently as possible? Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 0/4] sys_copy_range() rough draft 2013-05-14 21:42 ` [RFC v0 0/4] sys_copy_range() rough draft Dave Chinner @ 2013-05-14 22:04 ` Zach Brown 2013-05-15 1:01 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Zach Brown @ 2013-05-14 22:04 UTC (permalink / raw) To: Dave Chinner Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote: > On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: > > I'm going to keep hacking away at this. My next step is to get ext4 > > supporting .copy_range, probably with a quick hack to copy the > > contents of bios. Hopefully that'll give enough time to also integrate > > review feedback. > > Wouldn't the easiest "support all filesystems" hack just be to add > a destination offset parameter to do_splice_direct() and call that > when the filesystem doesn't supply a ->copy_range method? i.e. use > the mechanisms we already have for copying from one file to another > via the page cache as efficiently as possible? Probably; and this in-kernel buffered fallback is particularly desirable for nfsd when the exported fs doesn't provide .copy_range. Having nfsd service the COPY op is still a significant win over having the client move the data backand forth over the wire. But in that quote above I was talking about implementing .copy_range in ext4 as though it could use XCOPY today. I'd like to get a feel for how bad it's going to be to juggle the bio XCOPY IO with unwritten extent conversion, RMW with overlapping existing blocks, i_size advancing, etc. (It's so much like O_DIRECT that I'm already crying a little.) - z ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 0/4] sys_copy_range() rough draft 2013-05-14 22:04 ` Zach Brown @ 2013-05-15 1:01 ` Dave Chinner 0 siblings, 0 replies; 24+ messages in thread From: Dave Chinner @ 2013-05-15 1:01 UTC (permalink / raw) To: Zach Brown Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel, linux-btrfs, linux-nfs On Tue, May 14, 2013 at 03:04:40PM -0700, Zach Brown wrote: > On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote: > > On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: > > > I'm going to keep hacking away at this. My next step is to get ext4 > > > supporting .copy_range, probably with a quick hack to copy the > > > contents of bios. Hopefully that'll give enough time to also integrate > > > review feedback. > > > > Wouldn't the easiest "support all filesystems" hack just be to add > > a destination offset parameter to do_splice_direct() and call that > > when the filesystem doesn't supply a ->copy_range method? i.e. use > > the mechanisms we already have for copying from one file to another > > via the page cache as efficiently as possible? > > Probably; and this in-kernel buffered fallback is particularly desirable > for nfsd when the exported fs doesn't provide .copy_range. Having nfsd > service the COPY op is still a significant win over having the client > move the data backand forth over the wire. Sure. That's kind of what I was thinking to make it easy to test and have widespread support up front. > But in that quote above I was talking about implementing .copy_range in > ext4 as though it could use XCOPY today. I'd like to get a feel for how > bad it's going to be to juggle the bio XCOPY IO with unwritten extent > conversion, RMW with overlapping existing blocks, i_size advancing, etc. > (It's so much like O_DIRECT that I'm already crying a little.) Toss anything that is hard back to the page cache path. Overlapping blocks, partial blocks and so can be handled by the slow path without making the offload path complex. Make the offload do the simple stuff fast - the mapping and completion callbacks should be no different to the direct IO bits we have now, and if you only handle filesystem block aligned ranges in the offload (rather than sector alignment) most of the grot that DIO code has to handle goes away.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point @ 2013-05-15 17:50 Steve French 2013-05-15 18:54 ` J. Bruce Fields [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 24+ messages in thread From: Steve French @ 2013-05-15 17:50 UTC (permalink / raw) To: linux-fsdevel, zab-H+wXaHxf7aLQT0dZR+AlfA Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA Doesn't the new syscall have to invalidate the page cache pages that the server is about to overwrite as btrfs does with the following line in fs/btrfs/ioctl.c truncate_inode_pages_range(&inode->i_data, destoff, PAGE_CACHE_ALIGN(destoff + len) - 1); (and doesn't truncate_inode_pages_range handle page cache alignment anyway - and also why did btrfs use truncate_inode_pages_range instead of invalidate?) Does nfs client ever have the case where two different superblocks map to the same nfs export (and thus the check below is restricting the ability to do server side copy)? + if (inode_in->i_sb != inode_out->i_sb || + file_in->f_path.mnt != file_out->f_path.mnt) + return -EXDEV; I am working on cifs client patches for the ioctl, and the new syscall also looks pretty easy. Some popular cifs servers (like Windows) have supported smb/cifs copy offload for many, many years - and now Samba with the support that David Disseldorp added for the clone range ioctl has been supporting copychunk (server side copy from Windows to Samba) so about time to finish the cifs client equivalent. -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-15 17:50 [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Steve French @ 2013-05-15 18:54 ` J. Bruce Fields [not found] ` <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-05-15 18:54 UTC (permalink / raw) To: Steve French; +Cc: linux-fsdevel, zab, linux-cifs On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote: > Doesn't the new syscall have to invalidate the page cache pages that > the server is about to overwrite as btrfs does with the following line > in fs/btrfs/ioctl.c > > truncate_inode_pages_range(&inode->i_data, destoff, > PAGE_CACHE_ALIGN(destoff + len) - 1); > > (and doesn't truncate_inode_pages_range handle page cache alignment > anyway - and also why did btrfs use truncate_inode_pages_range instead > of invalidate?) > > Does nfs client ever have the case where two different superblocks map > to the same nfs export (and thus the check below is restricting the > ability to do server side copy)? > > + if (inode_in->i_sb != inode_out->i_sb || > + file_in->f_path.mnt != file_out->f_path.mnt) > + return -EXDEV; The client attempts to use the same superblock whenever it can. I suppose you're also losing the opportunity to copy between two different filesystems on the same server, which should be faster than requiring the client to do the copy. --b. > > I am working on cifs client patches for the ioctl, and the new syscall > also looks pretty easy. Some popular cifs servers (like Windows) > have supported smb/cifs copy offload for many, many years - and now > Samba with the support that David Disseldorp added for the clone range > ioctl has been supporting copychunk (server side copy from Windows to > Samba) so about time to finish the cifs client equivalent. > > -- > Thanks, > > Steve > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point [not found] ` <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-05-15 19:39 ` Zach Brown 0 siblings, 0 replies; 24+ messages in thread From: Zach Brown @ 2013-05-15 19:39 UTC (permalink / raw) To: J. Bruce Fields Cc: Steve French, linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Wed, May 15, 2013 at 02:54:29PM -0400, J. Bruce Fields wrote: > On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote: > > Does nfs client ever have the case where two different superblocks map > > to the same nfs export (and thus the check below is restricting the > > ability to do server side copy)? > > > > + if (inode_in->i_sb != inode_out->i_sb || > > + file_in->f_path.mnt != file_out->f_path.mnt) > > + return -EXDEV; > > The client attempts to use the same superblock whenever it can. > > I suppose you're also losing the opportunity to copy between two > different filesystems on the same server, which should be faster than > requiring the client to do the copy. Ah, yeah, that'd probably be the simplest motivation to move this down into the ->copy_range methods. It should be trivial to test once the nfs bits are going and we have the page cache fallback. - z ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-05-15 19:36 ` Zach Brown [not found] ` <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Zach Brown @ 2013-05-15 19:36 UTC (permalink / raw) To: Steve French; +Cc: linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote: > Doesn't the new syscall have to invalidate the page cache pages that > the server is about to overwrite as btrfs does with the following line > in fs/btrfs/ioctl.c > > truncate_inode_pages_range(&inode->i_data, destoff, > PAGE_CACHE_ALIGN(destoff + len) - 1); The file_operations ->copy_range implementation is responsible for this, yeah, similar to ->write being responsible for invalidating around O_DIRECT writes. > (and doesn't truncate_inode_pages_range handle page cache alignment > anyway No, it bugs if given an end offset that isn't page aligned: BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); Lukas is working on a patch series to fix this: http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2 > - and also why did btrfs use truncate_inode_pages_range instead > of invalidate?) I'm not sure. Maybe they can have racing dirty pages and want them thrown away rather than having invalidation fail? > Does nfs client ever have the case where two different superblocks map > to the same nfs export (and thus the check below is restricting the > ability to do server side copy)? > > + if (inode_in->i_sb != inode_out->i_sb || > + file_in->f_path.mnt != file_out->f_path.mnt) > + return -EXDEV; Ah, yes, thanks for bringing this up. This check was brought over from the btrfs ioctl code into the core just to make our lives less complicated in the short term. We may well want to push this test down into the ->copy_range methods so that some can support these cross-sb/mnt copies. I'm not sure how the nfs client manages all these relationships, but yes, what you suggest seems plausible. There's also the long term possibility of using the nfs copy op to offload copying between files on different servers. The spec talks about this quite a bit (how the client informs both servers, whether to use nfs between the servers or some server-specific protocol), but this certainly isn't my priority. I figure we can worry about this once we have a ->copy_range method that can copy files like this safely and are stopped by the test. It's easy enough to move the test around. > I am working on cifs client patches for the ioctl, and the new syscall > also looks pretty easy. Some popular cifs servers (like Windows) > have supported smb/cifs copy offload for many, many years - and now > Samba with the support that David Disseldorp added for the clone range > ioctl has been supporting copychunk (server side copy from Windows to > Samba) so about time to finish the cifs client equivalent. Cool, let me know if anything strange pops up. - z ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>]
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point [not found] ` <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> @ 2013-05-15 20:08 ` Steve French 2013-05-15 20:16 ` Chris Mason 0 siblings, 1 reply; 24+ messages in thread From: Steve French @ 2013-05-15 20:08 UTC (permalink / raw) To: Zach Brown Cc: linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA, J. Bruce Fields On Wed, May 15, 2013 at 2:36 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote: >> Doesn't the new syscall have to invalidate the page cache pages that >> the server is about to overwrite as btrfs does with the following line >> in fs/btrfs/ioctl.c >> >> truncate_inode_pages_range(&inode->i_data, destoff, >> PAGE_CACHE_ALIGN(destoff + len) - 1); > > The file_operations ->copy_range implementation is responsible for this, > yeah, similar to ->write being responsible for invalidating around > O_DIRECT writes. > >> (and doesn't truncate_inode_pages_range handle page cache alignment >> anyway > > No, it bugs if given an end offset that isn't page aligned: > > BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); > > Lukas is working on a patch series to fix this: > > http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2 > >> - and also why did btrfs use truncate_inode_pages_range instead >> of invalidate?) > > I'm not sure. Maybe they can have racing dirty pages and want them > thrown away rather than having invalidation fail? Ideally cifs (and presumably) nfs would do a flush of the beginning of the first page (before source offset) and/or the end of the last page of the range (after target offset) if dirty (to send it to the server before the copy) if they are not writing the complete page out, and then toss the pages (including the first and last partial pages) from the page cache. Is there a good fs example for code for this? >> Does nfs client ever have the case where two different superblocks map >> to the same nfs export (and thus the check below is restricting the >> ability to do server side copy)? >> >> + if (inode_in->i_sb != inode_out->i_sb || >> + file_in->f_path.mnt != file_out->f_path.mnt) >> + return -EXDEV; > > Ah, yes, thanks for bringing this up. This check was brought over from > the btrfs ioctl code into the core just to make our lives less > complicated in the short term. > > We may well want to push this test down into the ->copy_range methods so > that some can support these cross-sb/mnt copies. I'm not sure how the > nfs client manages all these relationships, but yes, what you suggest > seems plausible. > > There's also the long term possibility of using the nfs copy op to > offload copying between files on different servers. The spec talks > about this quite a bit (how the client informs both servers, whether to > use nfs between the servers or some server-specific protocol), but this > certainly isn't my priority. > > I figure we can worry about this once we have a ->copy_range method that > can copy files like this safely and are stopped by the test. It's easy > enough to move the test around. > >> I am working on cifs client patches for the ioctl, and the new syscall >> also looks pretty easy. Some popular cifs servers (like Windows) >> have supported smb/cifs copy offload for many, many years - and now >> Samba with the support that David Disseldorp added for the clone range >> ioctl has been supporting copychunk (server side copy from Windows to >> Samba) so about time to finish the cifs client equivalent. > > Cool, let me know if anything strange pops up. > > - z -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-15 20:08 ` Steve French @ 2013-05-15 20:16 ` Chris Mason [not found] ` <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Chris Mason @ 2013-05-15 20:16 UTC (permalink / raw) To: Steve French, Zach Brown Cc: linux-fsdevel, linux-cifs@vger.kernel.org, J. Bruce Fields Quoting Steve French (2013-05-15 16:08:16) > On Wed, May 15, 2013 at 2:36 PM, Zach Brown <zab@redhat.com> wrote: > > On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote: > >> Doesn't the new syscall have to invalidate the page cache pages that > >> the server is about to overwrite as btrfs does with the following line > >> in fs/btrfs/ioctl.c > >> > >> truncate_inode_pages_range(&inode->i_data, destoff, > >> PAGE_CACHE_ALIGN(destoff + len) - 1); > > > > The file_operations ->copy_range implementation is responsible for this, > > yeah, similar to ->write being responsible for invalidating around > > O_DIRECT writes. > > > >> (and doesn't truncate_inode_pages_range handle page cache alignment > >> anyway > > > > No, it bugs if given an end offset that isn't page aligned: > > > > BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); > > > > Lukas is working on a patch series to fix this: > > > > http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2 > > > >> - and also why did btrfs use truncate_inode_pages_range instead > >> of invalidate?) > > > > I'm not sure. Maybe they can have racing dirty pages and want them > > thrown away rather than having invalidation fail? truncate_inode_pages_range ends with invalidate. It just locks the page first and makes sure to wait for writeback if it was already in progress. Looking at the btrfs side, I think we should add some locking on the destination extents as well. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point [not found] ` <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2013-05-15 20:21 ` Steve French 2013-05-15 20:25 ` Chris Mason 0 siblings, 1 reply; 24+ messages in thread From: Steve French @ 2013-05-15 20:21 UTC (permalink / raw) To: Chris Mason Cc: Zach Brown, linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, J. Bruce Fields On Wed, May 15, 2013 at 3:16 PM, Chris Mason <chris.mason-5c4llco8/ftWk0Htik3J/w@public.gmane.org> wrote: > Quoting Steve French (2013-05-15 16:08:16) >> On Wed, May 15, 2013 at 2:36 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote: >> >> Doesn't the new syscall have to invalidate the page cache pages that >> >> the server is about to overwrite as btrfs does with the following line >> >> in fs/btrfs/ioctl.c >> >> >> >> truncate_inode_pages_range(&inode->i_data, destoff, >> >> PAGE_CACHE_ALIGN(destoff + len) - 1); >> > >> > The file_operations ->copy_range implementation is responsible for this, >> > yeah, similar to ->write being responsible for invalidating around >> > O_DIRECT writes. >> > >> >> (and doesn't truncate_inode_pages_range handle page cache alignment >> >> anyway >> > >> > No, it bugs if given an end offset that isn't page aligned: >> > >> > BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); >> > >> > Lukas is working on a patch series to fix this: >> > >> > http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2 >> > >> >> - and also why did btrfs use truncate_inode_pages_range instead >> >> of invalidate?) >> > >> > I'm not sure. Maybe they can have racing dirty pages and want them >> > thrown away rather than having invalidation fail? > > truncate_inode_pages_range ends with invalidate. It just locks the page > first and makes sure to wait for writeback if it was already in > progress. Does truncate_inode_pages_range(&inode->i_data, destoff, PAGE_CACHE_ALIGN(destoff + len) - 1); I also am worried about what does: 1) writing out data from start of previous page up to destoff (if destoff not page aligned, and first page dirty) 2) writing out data from dest+len through end of page (if destoff not page aligned, and last page dirty) before tossing first through last page out -- Thanks, Steve ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point 2013-05-15 20:21 ` Steve French @ 2013-05-15 20:25 ` Chris Mason 0 siblings, 0 replies; 24+ messages in thread From: Chris Mason @ 2013-05-15 20:25 UTC (permalink / raw) To: Steve French Cc: Zach Brown, linux-fsdevel, linux-cifs@vger.kernel.org, J. Bruce Fields Quoting Steve French (2013-05-15 16:21:12) > On Wed, May 15, 2013 at 3:16 PM, Chris Mason <chris.mason@fusionio.com> wrote: > > truncate_inode_pages_range ends with invalidate. It just locks the page > > first and makes sure to wait for writeback if it was already in > > progress. > > Does > > truncate_inode_pages_range(&inode->i_data, destoff, > PAGE_CACHE_ALIGN(destoff + len) - 1); > > > I also am worried about what does: > > 1) writing out data from start of previous page up to destoff (if > destoff not page aligned, and first page dirty) > 2) writing out data from dest+len through end of page (if destoff not > page aligned, and last page dirty) > > before tossing first through last page out It does a check for a partial first page (sending down the partial offset to invalidatepage), but the last page is fully truncated. -chris ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2013-05-21 19:50 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown 2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown 2013-05-15 19:44 ` Eric Wong 2013-05-15 20:03 ` Zach Brown 2013-05-16 21:16 ` Ric Wheeler 2013-05-21 19:47 ` Eric Wong 2013-05-21 19:50 ` Zach Brown 2013-05-14 21:15 ` [RFC v0 2/4] x86: add sys_copy_range to syscall tables Zach Brown 2013-05-14 21:15 ` [RFC v0 3/4] btrfs: add .copy_range file operation Zach Brown [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-05-14 21:15 ` [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support Zach Brown [not found] ` <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-05-15 20:19 ` J. Bruce Fields [not found] ` <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-05-15 20:21 ` Myklebust, Trond 2013-05-15 20:24 ` J. Bruce Fields 2013-05-14 21:42 ` [RFC v0 0/4] sys_copy_range() rough draft Dave Chinner 2013-05-14 22:04 ` Zach Brown 2013-05-15 1:01 ` Dave Chinner -- strict thread matches above, loose matches on Subject: below -- 2013-05-15 17:50 [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Steve French 2013-05-15 18:54 ` J. Bruce Fields [not found] ` <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-05-15 19:39 ` Zach Brown [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-05-15 19:36 ` Zach Brown [not found] ` <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org> 2013-05-15 20:08 ` Steve French 2013-05-15 20:16 ` Chris Mason [not found] ` <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2013-05-15 20:21 ` Steve French 2013-05-15 20:25 ` Chris Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).