[RFC v0 0/4] sys_copy_range() rough draft

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v0 0/4] sys_copy_range() rough draft
@ 2013-05-14 21:15 Zach Brown
  2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw)
  To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

We've been talking about implementing some form of bulk data copy
offloading for a while now.  BTRFS and OCFS2 implement forms of copy
offloading with ioctls, NFS 4.2 will include a byte-granular COPY
operation, and the SCSI XCOPY command is being implemented now that
Windows can issue it.

In the past we've discussed promoting the ocfs2 reflink ioctl into a
system call that would create a new file and implicitly copy the
source data into the new file:
https://lkml.org/lkml/2009/9/14/481

These draft patches take the simpler approach of only copying data
between existing files.  The patches 1) make a system call out of the
btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with
the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY
op, and 4) serve the COPY op in nfsd by calling the .copy_range method
again.

The nfs patch is an untested hack.  I'm happy to beat it in to shape
but I'll need some guidance.

I'd like strong review feedback on the interfaces, here are some
possible topics:

a) Hopefully being able to specify a portion of the data to copy will
avoid *huge* syscall latencies and the motivation for new async
semantics.

b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy
from the start offset to the end of the file.  Does anyone have a
strong feeling about this?  I'm leaning towards not bothering with it
in the syscall interface.

c) I chose to return partial progess in the ssize_t return code.  This
limits the length of the range and the size_t count argument can be too
large and return errors, much like other io syscalls.  This seemed
less awful than some extra argument with a pointer to a status value.

d) I'm dreading mentioning a vector of ranges to copy in one syscall
because I don't want to think about overlaping ranges and file systems
that use range locks -- xfs for now, but more if Jan gets his way.
I'd rather that we get some experience with this simpler syscall before
taking on that headache.

I'm sure I'm forgetting some other details.

I'm going to keep hacking away at this.  My next step is to get ext4
supporting .copy_range, probably with a quick hack to copy the
contents of bios.  Hopefully that'll give enough time to also integrate
review feedback.

Thoughts?

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown
@ 2013-05-14 21:15 ` Zach Brown
  2013-05-15 19:44   ` Eric Wong
  2013-05-14 21:15 ` [RFC v0 2/4] x86: add sys_copy_range to syscall tables Zach Brown
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw)
  To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

This adds a syscall and vfs entry point for clone_range which offloads
data copying between existing files.

The syscall is a thin wrapper around the vfs entry point.  Its arguments
are inspired by sys_splice().

The behaviour of the vfs helper is derived from the current btrfs
CLONE_RANGE ioctl.
---
 fs/Makefile                       |   2 +-
 fs/copy_range.c                   | 127 ++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h                |   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   1 +
 5 files changed, 135 insertions(+), 2 deletions(-)
 create mode 100644 fs/copy_range.c

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..1be83b3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o statfs.o
+		stack.o fs_struct.o statfs.o copy_range.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/copy_range.c b/fs/copy_range.c
new file mode 100644
index 0000000..3000b9f
--- /dev/null
+++ b/fs/copy_range.c
@@ -0,0 +1,127 @@
+/*
+ * "copy_range": offload data copying between existing files
+ *
+ * Copyright (C) 2013 Zach Brown <zab@redhat.com>
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/syscalls.h>
+#include <linux/export.h>
+#include <linux/fsnotify.h>
+
+/**
+ * vfs_copy_range - copy range of bytes from source file to existing file
+ * @file_in:   source regular file
+ * @pos_in:    starting byte offset to copy from the source file
+ * @file_out:  destination regular file
+ * @pos_out:   starting byte offset to copy to in the destination file
+ * @count:     number of bytes to copy
+ *
+ * Returns number of bytes successfully copied from the start of the range or
+ * a negative errno error value.
+ *
+ * The number of bytes successfully written can be less than the input
+ * count if an error is encountered.  In this partial success case the
+ * contents of the destination range after the copied bytes can be a mix
+ * of pre-existing bytes, bytes from the source range, or zeros,
+ * depending on the implementation.
+ *
+ * The source range must be entirely within i_size in the source file.
+ * A destination range outside of the size of the destination file will
+ * extend its size.
+ */
+ssize_t vfs_copy_range(struct file *file_in, loff_t pos_in,
+		       struct file *file_out, loff_t pos_out,
+		       size_t count)
+{
+	struct inode *inode_in;
+	struct inode *inode_out;
+	ssize_t ret;
+
+	if (count == 0)
+		return 0;
+
+	/* copy_range allows full ssize_t count, ignoring MAX_RW_COUNT  */
+	ret = rw_verify_area(READ, file_in, &pos_in, count);
+	if (ret >= 0)
+		ret = rw_verify_area(WRITE, file_out, &pos_out, count);
+	if (ret < 0)
+		return ret;
+
+	if (!(file_in->f_mode & FMODE_READ) ||
+	    !(file_out->f_mode & FMODE_WRITE) ||
+	    (file_out->f_flags & O_APPEND) ||
+	    !file_in->f_op || !file_in->f_op->copy_range)
+		return -EINVAL;
+
+	inode_in = file_inode(file_in);
+	inode_out = file_inode(file_out);
+
+	/* make sure offsets don't wrap and the input is inside i_size */
+	if (pos_in + count < pos_in || pos_out + count < pos_out ||
+	    pos_in + count > i_size_read(inode_in))
+		return -EINVAL;
+
+	/* XXX do we want this test?  btrfs_ioctl_clone_range() */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;
+
+	/* forbid ranges in the same file for now */
+	if (inode_in == inode_out)
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file_out);
+	if (ret)
+		return ret;
+
+	ret = file_in->f_op->copy_range(file_in, pos_in, file_out, pos_out,
+					count);
+	if (ret > 0) {
+		fsnotify_access(file_in);
+		add_rchar(current, ret);
+		fsnotify_modify(file_out);
+		add_wchar(current, ret);
+	}
+	inc_syscr(current);
+	inc_syscw(current);
+
+	mnt_drop_write_file(file_out);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_copy_range);
+
+SYSCALL_DEFINE5(copy_range, int, fd_in, loff_t __user *, upos_in,
+		int, fd_out, loff_t __user *, upos_out, size_t, count)
+{
+	loff_t pos_in;
+	loff_t pos_out;
+	struct fd f_in;
+	struct fd f_out;
+	ssize_t ret;
+
+	if (get_user(pos_in, upos_in) || get_user(pos_out, upos_out))
+		return -EFAULT;
+
+	f_in = fdget(fd_in);
+	f_out = fdget(fd_out);
+
+	if (f_in.file && f_out.file)
+		ret = vfs_copy_range(f_in.file, pos_in, f_out.file, pos_out,
+				     count);
+	else
+		ret = -EBADF;
+
+	fdput(f_in);
+	fdput(f_out);
+
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 43db02e..6214893 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,7 @@ struct file_operations {
 	long (*fallocate)(struct file *file, int mode, loff_t offset,
 			  loff_t len);
 	int (*show_fdinfo)(struct seq_file *m, struct file *f);
+	ssize_t (*copy_range)(struct file *, loff_t, struct file *, loff_t, size_t);
 };
 
 struct inode_operations {
@@ -1588,6 +1589,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+extern ssize_t vfs_copy_range(struct file *, loff_t , struct file *, loff_t,
+		size_t);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 0cc74c4..3935d1c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -692,9 +692,11 @@ __SC_COMP(__NR_process_vm_writev, sys_process_vm_writev, \
 __SYSCALL(__NR_kcmp, sys_kcmp)
 #define __NR_finit_module 273
 __SYSCALL(__NR_finit_module, sys_finit_module)
+#define __NR_copy_range 274
+__SYSCALL(__NR_copy_range, sys_copy_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 274
+#define __NR_syscalls 275
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..af7808a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -151,6 +151,7 @@ cond_syscall(sys_process_vm_readv);
 cond_syscall(sys_process_vm_writev);
 cond_syscall(compat_sys_process_vm_readv);
 cond_syscall(compat_sys_process_vm_writev);
+cond_syscall(sys_copy_range);
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read);
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown
@ 2013-05-15 19:44   ` Eric Wong
  2013-05-15 20:03     ` Zach Brown
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Wong @ 2013-05-15 19:44 UTC (permalink / raw)
  To: Zach Brown
  Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

Zach Brown <zab@redhat.com> wrote:
> This adds a syscall and vfs entry point for clone_range which offloads
> data copying between existing files.
> 
> The syscall is a thin wrapper around the vfs entry point.  Its arguments
> are inspired by sys_splice().

Why introduce a new syscall instead of extending sys_splice?

Perhaps adding a new SPLICE_F_COPYRANGE flag to avoid compatibility
problems with userspace which may expect ESPIPE when given two non-pipes
would be useful.

If the user doesn't need a out offset, then sendfile() should also be
able to transparently utilize COPY/CLONE_RANGE, too.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-15 19:44   ` Eric Wong
@ 2013-05-15 20:03     ` Zach Brown
  2013-05-16 21:16       ` Ric Wheeler
  2013-05-21 19:47       ` Eric Wong
  0 siblings, 2 replies; 24+ messages in thread
From: Zach Brown @ 2013-05-15 20:03 UTC (permalink / raw)
  To: Eric Wong
  Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote:
> Why introduce a new syscall instead of extending sys_splice?

Personally, I think it's ugly to have different operations use the same
syscall just because their arguments match.

But that preference aside, sure, if the consensus is that we'd rather
use the splice() entry point then I can duck tape the pieces together to
make it work.

> If the user doesn't need a out offset, then sendfile() should also be
> able to transparently utilize COPY/CLONE_RANGE, too.

Perhaps, yeah.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-15 20:03     ` Zach Brown
@ 2013-05-16 21:16       ` Ric Wheeler
  2013-05-21 19:47       ` Eric Wong
  1 sibling, 0 replies; 24+ messages in thread
From: Ric Wheeler @ 2013-05-16 21:16 UTC (permalink / raw)
  To: Zach Brown
  Cc: Eric Wong, Martin K. Petersen, Trond Myklebust, linux-kernel,
	linux-fsdevel, linux-btrfs, linux-nfs

On 05/15/2013 04:03 PM, Zach Brown wrote:
> On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote:
>> Why introduce a new syscall instead of extending sys_splice?
> Personally, I think it's ugly to have different operations use the same
> syscall just because their arguments match.

I agree with Zach - having a system call called "splice" do copy offloads is not 
intuitive.

This is a very reasonable name for something that battled its way through 
several standards bodies (for NFS and SCSI :)), so we should give it a 
reasonable name

thanks!

Ric

>
> But that preference aside, sure, if the consensus is that we'd rather
> use the splice() entry point then I can duck tape the pieces together to
> make it work.
>
>> If the user doesn't need a out offset, then sendfile() should also be
>> able to transparently utilize COPY/CLONE_RANGE, too.
> Perhaps, yeah.
>
> - z


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-15 20:03     ` Zach Brown
  2013-05-16 21:16       ` Ric Wheeler
@ 2013-05-21 19:47       ` Eric Wong
  2013-05-21 19:50         ` Zach Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Wong @ 2013-05-21 19:47 UTC (permalink / raw)
  To: Zach Brown
  Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

Zach Brown <zab@redhat.com> wrote:
> On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote:
> > Why introduce a new syscall instead of extending sys_splice?
> 
> Personally, I think it's ugly to have different operations use the same
> syscall just because their arguments match.

Fair enough.  I think adding a (currently unused) flags parameter would
make sense for future-proofing.

If this were extended to sockets/pipes, peek/nonblock/waitall flags
may be added.

Basically, something like splice/tee, but without needing to
create/manage a pipe in userspace or needing extra syscalls/looping.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-21 19:47       ` Eric Wong
@ 2013-05-21 19:50         ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2013-05-21 19:50 UTC (permalink / raw)
  To: Eric Wong
  Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

On Tue, May 21, 2013 at 07:47:19PM +0000, Eric Wong wrote:
> Zach Brown <zab@redhat.com> wrote:
> > On Wed, May 15, 2013 at 07:44:05PM +0000, Eric Wong wrote:
> > > Why introduce a new syscall instead of extending sys_splice?
> > 
> > Personally, I think it's ugly to have different operations use the same
> > syscall just because their arguments match.
> 
> Fair enough.  I think adding a (currently unused) flags parameter would
> make sense for future-proofing.

Yeah, that seems reasonble.  

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v0 2/4] x86: add sys_copy_range to syscall tables
  2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown
  2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown
@ 2013-05-14 21:15 ` Zach Brown
  2013-05-14 21:15 ` [RFC v0 3/4] btrfs: add .copy_range file operation Zach Brown
       [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw)
  To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

Add sys_copy_range to the x86 syscall tables.  Happily, it doesn't
require compat helpers.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index aabfb83..d75d1b5 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,4 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	copy_range		sys_copy_range
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..dc9d766 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	copy_range		sys_copy_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC v0 3/4] btrfs: add .copy_range file operation
  2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown
  2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown
  2013-05-14 21:15 ` [RFC v0 2/4] x86: add sys_copy_range to syscall tables Zach Brown
@ 2013-05-14 21:15 ` Zach Brown
       [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw)
  To: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_range file operation can call the core loop that copies file data
extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared between the
CLONE_RANGE ioctl and copy_range syscall.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/ctree.h |   3 ++
 fs/btrfs/file.c  |   1 +
 fs/btrfs/ioctl.c | 122 +++++++++++++++++++++++++++++++++----------------------
 3 files changed, 77 insertions(+), 49 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 63c328a..bf9555c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3612,6 +3612,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      struct page **pages, size_t num_pages,
 		      loff_t pos, size_t write_bytes,
 		      struct extent_state **cached);
+ssize_t btrfs_copy_range(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 size_t count);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4205ba7..d75cc07 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2460,6 +2460,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.copy_range	= btrfs_copy_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0de4a2f..ac035d8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2463,13 +2463,10 @@ out:
 	return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_extents(struct inode *inode, struct inode *src,
+					u64 off, u64 olen, u64 destoff)
 {
-	struct inode *inode = file_inode(file);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
 	struct btrfs_trans_handle *trans;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
@@ -2491,59 +2488,22 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	 *   they don't overlap)?
 	 */
 
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = file_inode(src_file.file);
-
-	ret = -EINVAL;
-	if (src == inode)
-		goto out_fput;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
-
 	/* don't make the dst file partly checksummed */
 	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
 	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
-
-	ret = -EISDIR;
-	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+		return -EINVAL;
 
-	ret = -ENOMEM;
 	buf = vmalloc(btrfs_level_size(root, 0));
 	if (!buf)
-		goto out_fput;
+		return -ENOMEM;
 
 	path = btrfs_alloc_path();
 	if (!path) {
 		vfree(buf);
-		goto out_fput;
+		return -ENOMEM;
 	}
 	path->reada = 2;
 
@@ -2555,10 +2515,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 		mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
 	}
 
-	/* determine range to clone */
-	ret = -EINVAL;
-	if (off + len > src->i_size || off + len < off)
-		goto out_unlock;
+	/* CLONE_RANGE can have len == 0, copy_range won't */
 	if (len == 0)
 		olen = len = src->i_size - off;
 	/* if we extend to eof, continue to block boundary */
@@ -2566,6 +2523,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 		len = ALIGN(src->i_size, bs) - off;
 
 	/* verify the end result is block aligned */
+	ret = -EINVAL;
 	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) ||
 	    !IS_ALIGNED(destoff, bs))
 		goto out_unlock;
@@ -2849,6 +2807,72 @@ out_unlock:
 	mutex_unlock(&inode->i_mutex);
 	vfree(buf);
 	btrfs_free_path(path);
+	return ret;
+}
+
+ssize_t btrfs_copy_range(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 size_t count)
+{
+	ssize_t ret;
+
+	ret = btrfs_clone_extents(file_inode(file_out), file_inode(file_in),
+				  pos_in, count, pos_out);
+	if (ret == 0)
+		ret = count;
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 len, u64 destoff)
+{
+	struct inode *inode = file_inode(file);
+	struct fd src_file;
+	struct inode *src;
+	long ret;
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	ret = -EXDEV;
+	if (src_file.file->f_path.mnt != file->f_path.mnt)
+		goto out_fput;
+
+	src = file_inode(src_file.file);
+
+	ret = -EINVAL;
+	if (src == inode)
+		goto out_fput;
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ))
+		goto out_fput;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EXDEV;
+	if (src->i_sb != inode->i_sb)
+		goto out_fput;
+
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out_fput;
+
+	ret = btrfs_clone_extents(inode, src, off, len, destoff);
+
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support
       [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-05-14 21:15   ` Zach Brown
       [not found]     ` <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-05-14 21:42   ` [RFC v0 0/4] sys_copy_range() rough draft Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Zach Brown @ 2013-05-14 21:15 UTC (permalink / raw)
  To: Martin K. Petersen, Trond Myklebust,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

This crude patch illustrates the simplest plumbing involved in
supporting sys_call_range with the NFS COPY operation that's pending in
the 4.2 draft spec.

The patch is based on a previous prototype that used the COPY op to
implement sys_copyfileat which created a new file (based on the ocfs2
reflink ioctl).  By contrast, this copies file contents between existing
files.

There's still a lot of implementation and testing to do, but this can
get discussion going.
---
 fs/nfs/file.c           |  25 +++++++++
 fs/nfs/nfs4proc.c       |  72 ++++++++++++++++++++++++++
 fs/nfs/nfs4xdr.c        | 132 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/nfs4proc.c      |  35 +++++++++++++
 fs/nfsd/nfs4xdr.c       |  43 ++++++++++++++++
 fs/nfsd/vfs.c           |  41 +++++++++++++++
 fs/nfsd/vfs.h           |   3 ++
 fs/nfsd/xdr4.h          |  21 ++++++++
 include/linux/nfs4.h    |   6 ++-
 include/linux/nfs_xdr.h |  24 +++++++++
 10 files changed, 401 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index a87a44f..7d7bedf 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -917,6 +917,30 @@ int nfs_setlease(struct file *file, long arg, struct file_lock **fl)
 }
 EXPORT_SYMBOL_GPL(nfs_setlease);
 
+ssize_t nfs_copy_range(struct file *file_in, loff_t pos_in,
+		       struct file *file_out, loff_t pos_out,
+		       size_t count)
+{
+	struct dentry *dentry_in = file_in->f_path.dentry;
+	struct dentry *dentry_out = file_out->f_path.dentry;
+	struct inode *inode_in = dentry_in->d_inode;
+	struct inode *inode_out = dentry_out->d_inode;
+	loff_t ret;
+
+	dprintk("NFS copy_range(%s/%s@%llu, %s/%s@%llu, %zd)\n",
+		dentry_in->d_parent->d_name.name, dentry_in->d_name.name,
+		(unsigned long long)pos_in,
+		dentry_out->d_parent->d_name.name, dentry_out->d_name.name,
+		(unsigned long long)pos_out, count);
+
+	if (NFS_PROTO(inode_in)->copy == NULL)
+		ret = -EOPNOTSUPP;
+	else
+		ret = NFS_PROTO(inode_in)->copy(inode_in, inode_out, NULL,
+						0, count, pos_in, pos_out);
+	return ret;
+}
+
 const struct file_operations nfs_file_operations = {
 	.llseek		= nfs_file_llseek,
 	.read		= do_sync_read,
@@ -934,5 +958,6 @@ const struct file_operations nfs_file_operations = {
 	.splice_write	= nfs_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs_setlease,
+	.copy_range	= nfs_copy_range,
 };
 EXPORT_SYMBOL_GPL(nfs_file_operations);
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 8fbc100..1586b3e 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5405,6 +5405,75 @@ int nfs4_proc_secinfo(struct inode *dir, const struct qstr *name,
 }
 
 #ifdef CONFIG_NFS_V4_1
+static loff_t _nfs4_proc_copy(struct inode *inode,
+			      struct inode *dir,
+			      struct qstr *name,
+			      int flags,
+			      loff_t nbyte,
+			      loff_t src_offset,
+			      loff_t dst_offset)
+{
+	struct nfs_server *server = NFS_SERVER(inode);
+	int status;
+	struct nfs_copy_args arg = {
+		.fh     = NFS_FH(inode),
+		.dir_fh = NFS_FH(dir),
+		.src_offset = src_offset,
+		.dst_offset = dst_offset,
+		.count = nbyte,
+		.flags = flags,
+		.destination   = name,
+		.bitmask = server->attr_bitmask,
+	};
+	struct nfs_copy_res res = {
+		.fh = NFS_FH(inode),
+		.callback_id_length = 0,
+		.callback_id = 0,
+		.bytes_copied = 0,
+		.server = server,
+	};
+	struct rpc_message msg = {
+		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY],
+		.rpc_argp = &arg,
+		.rpc_resp = &res,
+	};
+
+	res.fattr = nfs_alloc_fattr();
+	if (res.fattr == NULL)
+		return -ENOMEM;
+
+	status = nfs4_call_sync(server->client, server, &msg, &arg.seq_args,
+				&res.seq_res, 1);
+	if (res.bytes_copied)
+		status = res.bytes_copied;
+
+	nfs_free_fattr(res.fattr);
+	return status;
+}
+
+static loff_t nfs4_proc_copy(struct inode *inode,
+			struct inode *dir,
+			struct qstr *name,
+			int flags,
+			loff_t nbyte,
+			loff_t src_offset,
+			loff_t dst_offset)
+{
+	struct nfs4_exception exception = {0, };
+	loff_t ret;
+
+	do {
+		ret = _nfs4_proc_copy(inode, dir, name, flags, nbyte,
+				      src_offset, dst_offset);
+		if (ret < 0)
+			ret = nfs4_handle_exception(NFS_SERVER(inode), ret,
+						    &exception);
+	} while (exception.retry);
+
+	return ret;
+}
+
+
 /*
  * Check the exchange flags returned by the server for invalid flags, having
  * both PNFS and NON_PNFS flags set, and not having one of NON_PNFS, PNFS, or
@@ -7097,6 +7166,9 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
 	.free_client	= nfs4_free_client,
 	.create_server	= nfs4_create_server,
 	.clone_server	= nfs_clone_server,
+#ifdef CONFIG_NFS_V4_1
+	.copy		= nfs4_proc_copy,
+#endif
 };
 
 static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 4be8d13..28598b0 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -397,6 +397,8 @@ static int nfs4_stat_to_errno(int);
 #define encode_free_stateid_maxsz	(op_encode_hdr_maxsz + 1 + \
 					 XDR_QUADLEN(NFS4_STATEID_SIZE))
 #define decode_free_stateid_maxsz	(op_decode_hdr_maxsz + 1)
+#define encode_copy_maxsz	(op_encode_hdr_maxsz + 8 + nfs4_name_maxsz)
+#define decode_copy_maxsz	(op_decode_hdr_maxsz + 1 + decode_stateid_maxsz)
 #else /* CONFIG_NFS_V4_1 */
 #define encode_sequence_maxsz	0
 #define decode_sequence_maxsz	0
@@ -840,6 +842,22 @@ static int nfs4_stat_to_errno(int);
 #define NFS4_dec_free_stateid_sz	(compound_decode_hdr_maxsz + \
 					 decode_sequence_maxsz + \
 					 decode_free_stateid_maxsz)
+#define NFS4_enc_copy_sz		(compound_encode_hdr_maxsz + \
+					 encode_sequence_maxsz + \
+					 encode_putfh_maxsz + \
+					 encode_savefh_maxsz + \
+					 encode_putfh_maxsz + \
+					 encode_copy_maxsz + \
+					 encode_getfh_maxsz + \
+					 encode_getattr_maxsz)
+#define NFS4_dec_copy_sz		(compound_decode_hdr_maxsz + \
+					 decode_sequence_maxsz + \
+					 decode_putfh_maxsz + \
+					 decode_savefh_maxsz + \
+					 decode_putfh_maxsz + \
+					 decode_copy_maxsz + \
+					 decode_getfh_maxsz + \
+					 decode_getattr_maxsz)
 
 const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
 				      compound_encode_hdr_maxsz +
@@ -1817,6 +1835,23 @@ static void encode_reclaim_complete(struct xdr_stream *xdr,
 	encode_op_hdr(xdr, OP_RECLAIM_COMPLETE, decode_reclaim_complete_maxsz, hdr);
 	encode_uint32(xdr, args->one_fs);
 }
+
+static void encode_copy(struct xdr_stream *xdr,
+			const struct nfs_copy_args *args,
+			struct compound_hdr *hdr)
+{
+	__be32 *p;
+
+	p = reserve_space(xdr, 36 + args->destination->len);
+	*p++ = cpu_to_be32(OP_COPY);
+	p = xdr_encode_hyper(p, args->src_offset);
+	p = xdr_encode_hyper(p, args->dst_offset);
+	p = xdr_encode_hyper(p, args->count);
+	*p++ = cpu_to_be32(args->flags);
+	xdr_encode_opaque(p, args->destination->name, args->destination->len);
+	hdr->nops++;
+	hdr->replen += decode_copy_maxsz;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 static void encode_sequence(struct xdr_stream *xdr,
@@ -2761,6 +2796,30 @@ static void nfs4_xdr_enc_sequence(struct rpc_rqst *req, struct xdr_stream *xdr,
 }
 
 /*
+ * Encode a COPY request
+ */
+static int nfs4_xdr_enc_copy(struct rpc_rqst *req, __be32 *p,
+				struct nfs_copy_args *args)
+{
+	struct xdr_stream xdr;
+	struct compound_hdr hdr = {
+		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
+	};
+
+	xdr_init_encode(&xdr, &req->rq_snd_buf, p);
+	encode_compound_hdr(&xdr, req, &hdr);
+	encode_sequence(&xdr, &args->seq_args, &hdr);
+	encode_putfh(&xdr, args->fh, &hdr);
+	encode_savefh(&xdr, &hdr);
+	encode_putfh(&xdr, args->dir_fh, &hdr);
+	encode_copy(&xdr, args, &hdr);
+	encode_getfh(&xdr, &hdr);
+	encode_getfattr(&xdr, args->bitmask, &hdr);
+	encode_nops(&hdr);
+	return 0;
+}
+
+/*
  * a GET_LEASE_TIME request
  */
 static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req,
@@ -4688,6 +4747,41 @@ static int decode_link(struct xdr_stream *xdr, struct nfs4_change_info *cinfo)
 	return decode_change_info(xdr, cinfo);
 }
 
+#if defined(CONFIG_NFS_V4_1)
+static int decode_copy(struct xdr_stream *xdr, struct nfs_copy_res *res)
+{
+	__be32 *p;
+	int status;
+
+	status = decode_op_hdr(xdr, OP_COPY);
+	if (status)
+		return status;
+
+	if (status == 0) {
+		p = xdr_inline_decode(xdr, 4);
+		if (unlikely(!p))
+			goto out_overflow;
+		res->callback_id_length = be32_to_cpup(p);
+		if (res->callback_id_length == 1) {
+			status = decode_stateid(xdr, res->callback_id);
+			if (unlikely(status))
+				return status;
+		} else if (res->callback_id_length != 0)
+			return -EIO;
+	} else {
+		p = xdr_inline_decode(xdr, 8);
+		if (unlikely(!p))
+			goto out_overflow;
+		p = xdr_decode_hyper(p, &res->bytes_copied);
+	}
+
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
+#endif /* CONFIG_NFS_V4_1 */
+
 /*
  * We create the owner, so we know a proper owner.id length is 4.
  */
@@ -7047,6 +7141,43 @@ static int nfs4_xdr_dec_free_stateid(struct rpc_rqst *rqstp,
 out:
 	return status;
 }
+
+/*
+ * Decode COPY response
+ */
+static int nfs4_xdr_dec_copy(struct rpc_rqst *rqstp, __be32 *p,
+				struct nfs_copy_res *res)
+{
+	struct xdr_stream xdr;
+	struct compound_hdr hdr;
+	int status;
+
+	xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
+	status = decode_compound_hdr(&xdr, &hdr);
+	if (status)
+		goto out;
+	status = decode_sequence(&xdr, &res->seq_res, rqstp);
+	if (status)
+		goto out;
+	status = decode_putfh(&xdr);
+	if (status)
+		goto out;
+	status = decode_savefh(&xdr);
+	if (status != 0)
+		goto out;
+	status = decode_putfh(&xdr);
+	if (status != 0)
+		goto out;
+	status = decode_copy(&xdr, res);
+	if (status)
+		goto out;
+	status = decode_getfh(&xdr, res->fh);
+	if (status != 0)
+		goto out;
+	decode_getfattr(&xdr, res->fattr, res->server);
+out:
+	return status;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 /**
@@ -7257,6 +7388,7 @@ struct rpc_procinfo	nfs4_procedures[] = {
 	PROC(BIND_CONN_TO_SESSION,
 			enc_bind_conn_to_session, dec_bind_conn_to_session),
 	PROC(DESTROY_CLIENTID,	enc_destroy_clientid,	dec_destroy_clientid),
+	PROC(COPY,		enc_copy,		dec_copy),
 #endif /* CONFIG_NFS_V4_1 */
 };
 
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 27d74a2..2f62ebb 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -986,6 +986,37 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	return status;
 }
 
+/*
+ * XXX:
+ *  - do something with stateids :)
+ *  - implement callback results and OFFLOAD_ABORT
+ *  - inter-server copies?
+ */
+static __be32
+nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
+		struct nfsd4_copy *copy)
+{
+	__be32 status;
+
+	/* only support copying data to an existing file */
+	if (copy->ca_destinationlen)
+		return nfserr_inval;
+
+	if (!cstate->current_fh.fh_dentry || !cstate->save_fh.fh_dentry)
+		return nfserr_nofilehandle;
+
+	status = nfsd_copy_range(rqstp, &cstate->save_fh, copy->ca_src_offset,
+				 &cstate->current_fh, copy->ca_dst_offset,
+				 copy->ca_count);
+	if (status == nfs_ok)
+		copy->u.cr_bytes_copied = copy->ca_count;
+
+	/* don't support async callbacks yet */
+	copy->u.ok.cr_callback_id_length = 0;
+
+	return status;
+}
+
 /* This routine never returns NFS_OK!  If there are no other errors, it
  * will return NFSERR_SAME or NFSERR_NOT_SAME depending on whether the
  * attributes matched.  VERIFY is implemented by mapping NFSERR_SAME
@@ -1798,6 +1829,10 @@ static struct nfsd4_operation nfsd4_ops[] = {
 		.op_get_currentstateid = (stateid_getter)nfsd4_get_freestateid,
 		.op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
 	},
+	[OP_COPY] = {
+		.op_func = (nfsd4op_func)nfsd4_copy,
+		.op_name = "OP_COPY",
+	},
 };
 
 #ifdef NFSD_DEBUG
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 6cd86e0..d2978e9 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1445,6 +1445,26 @@ static __be32 nfsd4_decode_reclaim_complete(struct nfsd4_compoundargs *argp, str
 }
 
 static __be32
+nfsd4_decode_copy(struct nfsd4_compoundargs *argp, struct nfsd4_copy *copy)
+{
+	DECODE_HEAD;
+
+	READ_BUF(32);
+	READ64(copy->ca_src_offset);
+	READ64(copy->ca_dst_offset);
+	READ64(copy->ca_count);
+	READ32(copy->ca_flags);
+	READ32(copy->ca_destinationlen);
+	READ_BUF(copy->ca_destinationlen);
+	SAVEMEM(copy->ca_destination, copy->ca_destinationlen);
+	if ((status = check_filename(copy->ca_destination,
+				     copy->ca_destinationlen)))
+		return status;
+
+	DECODE_TAIL;
+}
+
+static __be32
 nfsd4_decode_noop(struct nfsd4_compoundargs *argp, void *p)
 {
 	return nfs_ok;
@@ -1557,6 +1577,7 @@ static nfsd4_dec nfsd41_dec_ops[] = {
 	[OP_WANT_DELEGATION]	= (nfsd4_dec)nfsd4_decode_notsupp,
 	[OP_DESTROY_CLIENTID]	= (nfsd4_dec)nfsd4_decode_destroy_clientid,
 	[OP_RECLAIM_COMPLETE]	= (nfsd4_dec)nfsd4_decode_reclaim_complete,
+	[OP_COPY]               = (nfsd4_dec)nfsd4_decode_copy,
 };
 
 struct nfsd4_minorversion_ops {
@@ -3394,6 +3415,27 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres *resp, __be32 nfserr,
 }
 
 static __be32
+nfsd4_encode_copy(struct nfsd4_compoundres *resp, __be32 nfserr,
+			struct nfsd4_copy *copy)
+{
+	__be32 *p;
+
+	if (!nfserr) {
+		RESERVE_SPACE(4);
+		WRITE32(copy->u.ok.cr_callback_id_length);
+		ADJUST_ARGS();
+		if (copy->u.ok.cr_callback_id_length == 1)
+			nfsd4_encode_stateid(resp, copy->u.ok.cr_callback_id);
+	} else {
+		RESERVE_SPACE(8);
+		WRITE64(copy->u.cr_bytes_copied);
+		ADJUST_ARGS();
+	}
+
+	return nfserr;
+}
+
+static __be32
 nfsd4_encode_noop(struct nfsd4_compoundres *resp, __be32 nfserr, void *p)
 {
 	return nfserr;
@@ -3465,6 +3507,7 @@ static nfsd4_enc nfsd4_enc_ops[] = {
 	[OP_WANT_DELEGATION]	= (nfsd4_enc)nfsd4_encode_noop,
 	[OP_DESTROY_CLIENTID]	= (nfsd4_enc)nfsd4_encode_noop,
 	[OP_RECLAIM_COMPLETE]	= (nfsd4_enc)nfsd4_encode_noop,
+	[OP_COPY]		= (nfsd4_enc)nfsd4_encode_copy,
 };
 
 /*
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 84ce601..0c1b427 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -28,6 +28,8 @@
 #include <asm/uaccess.h>
 #include <linux/exportfs.h>
 #include <linux/writeback.h>
+#include <linux/fs_struct.h>
+#include <linux/kmod.h>
 
 #ifdef CONFIG_NFSD_V3
 #include "xdr3.h"
@@ -621,6 +623,45 @@ int nfsd4_is_junction(struct dentry *dentry)
 		return 0;
 	return 1;
 }
+
+__be32
+nfsd_copy_range(struct svc_rqst *rqstp, struct svc_fh *fhp_in, u64 pos_in,
+		struct svc_fh *fhp_out, u64 pos_out, u64 count)
+{
+	struct file *filp_in = NULL;
+	struct file *filp_out = NULL;
+	int err;
+
+	/* XXX verify pos and count within sane limits? */
+
+	err = nfsd_open(rqstp, fhp_in, S_IFREG, NFSD_MAY_READ, &filp_in);
+	if (err)
+		goto out;
+
+	err = nfsd_open(rqstp, fhp_out, S_IFREG, NFSD_MAY_WRITE, &filp_out);
+	if (err)
+		goto out;
+
+	err = vfs_copy_range(filp_in, pos_in, filp_out, pos_out, count);
+	/* fall back if .copy_range isn't supported */
+
+	if (!err && EX_ISSYNC(fhp_out->fh_export))
+		err = vfs_fsync_range(filp_out, pos_out, pos_out + count-1, 0);
+
+out:
+	if (filp_in)
+		nfsd_close(filp_in);
+	if (filp_out)
+		nfsd_close(filp_out);
+
+	if (err < 0)
+		err = nfserrno(err);
+	else
+		err = 0;
+
+	return err;
+}
+
 #endif /* defined(CONFIG_NFSD_V4) */
 
 #ifdef CONFIG_NFSD_V3
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 5b58941..bbc9483 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -85,6 +85,9 @@ __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
 				struct svc_fh *res, struct iattr *);
 __be32		nfsd_link(struct svc_rqst *, struct svc_fh *,
 				char *, int, struct svc_fh *);
+__be32		nfsd_copy_range(struct svc_rqst *,
+				struct svc_fh *, u64,
+				struct svc_fh *, u64, u64);
 __be32		nfsd_rename(struct svc_rqst *,
 				struct svc_fh *, char *, int,
 				struct svc_fh *, char *, int);
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 3b271d2..95fd1c3 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -426,6 +426,26 @@ struct nfsd4_reclaim_complete {
 	u32 rca_one_fs;
 };
 
+struct nfsd4_copy {
+	/* request */
+	u64             ca_src_offset;
+	u64             ca_dst_offset;
+	u64             ca_count;
+	u32             ca_flags;
+	u32             ca_destinationlen;
+	char *          ca_destination;
+
+	/* response */
+	union {
+		struct {
+			u32			cr_callback_id_length;
+			stateid_t *             cr_callback_id;
+		} ok;
+		u64	cr_bytes_copied;
+	} u;
+
+};
+
 struct nfsd4_op {
 	int					opnum;
 	__be32					status;
@@ -471,6 +491,7 @@ struct nfsd4_op {
 		struct nfsd4_reclaim_complete	reclaim_complete;
 		struct nfsd4_test_stateid	test_stateid;
 		struct nfsd4_free_stateid	free_stateid;
+		struct nfsd4_copy		copy;
 	} u;
 	struct nfs4_replay *			replay;
 };
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 7b8fc73..6be484e 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -100,6 +100,7 @@ enum nfs_opnum4 {
 	OP_WANT_DELEGATION = 56,
 	OP_DESTROY_CLIENTID = 57,
 	OP_RECLAIM_COMPLETE = 58,
+	OP_COPY = 59,
 
 	OP_ILLEGAL = 10044,
 };
@@ -108,7 +109,7 @@ enum nfs_opnum4 {
 Needs to be updated if more operations are defined in future.*/
 
 #define FIRST_NFS4_OP	OP_ACCESS
-#define LAST_NFS4_OP 	OP_RECLAIM_COMPLETE
+#define LAST_NFS4_OP	OP_COPY
 
 enum nfsstat4 {
 	NFS4_OK = 0,
@@ -456,6 +457,9 @@ enum {
 	NFSPROC4_CLNT_GETDEVICELIST,
 	NFSPROC4_CLNT_BIND_CONN_TO_SESSION,
 	NFSPROC4_CLNT_DESTROY_CLIENTID,
+
+	/* nfs42 */
+	NFSPROC4_CLNT_COPY,
 };
 
 /* nfs41 types */
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 104b62f..2256e31 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1184,6 +1184,28 @@ struct nfs41_free_stateid_res {
 	unsigned int			status;
 };
 
+struct nfs_copy_args {
+	struct nfs_fh			*fh;
+	struct nfs_fh			*dir_fh;
+	u32				*bitmask;
+	__u64				src_offset;
+	__u64				dst_offset;
+	__u64				count;
+	__u32				flags;
+	const struct qstr		*destination;
+	struct nfs4_sequence_args	seq_args;
+};
+
+struct nfs_copy_res {
+	struct nfs_fh			*fh;
+	struct nfs_fattr		*fattr;
+	__u32				callback_id_length;
+	nfs4_stateid			*callback_id;
+	__u64				bytes_copied;
+	const struct nfs_server		*server;
+	struct nfs4_sequence_res	seq_res;
+};
+
 #else
 
 struct pnfs_ds_commit_info {
@@ -1433,6 +1455,8 @@ struct nfs_rpc_ops {
 	struct nfs_server *(*create_server)(struct nfs_mount_info *, struct nfs_subversion *);
 	struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
 					   struct nfs_fattr *, rpc_authflavor_t);
+	loff_t  (*copy)    (struct inode *, struct inode *, struct qstr *,
+				int, loff_t, loff_t, loff_t);
 };
 
 /*
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support
       [not found]     ` <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-05-15 20:19       ` J. Bruce Fields
       [not found]         ` <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2013-05-15 20:19 UTC (permalink / raw)
  To: Zach Brown
  Cc: Martin K. Petersen, Trond Myklebust,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote:
> This crude patch illustrates the simplest plumbing involved in
> supporting sys_call_range with the NFS COPY operation that's pending in
> the 4.2 draft spec.
> 
> The patch is based on a previous prototype that used the COPY op to
> implement sys_copyfileat which created a new file (based on the ocfs2
> reflink ioctl).  By contrast, this copies file contents between existing
> files.
> 
> There's still a lot of implementation and testing to do, but this can
> get discussion going.

I'm using:

	git://github.com/loghyr/NFSv4.2

as my reference for the draft protocol.

On a quick skim, one thing this is missing before it complies is a
client implementation of CB_OFFLOAD: "If a client desires an
intra-server file copy, then it MUST support the COPY and CB_OFFLOAD
operations."

The server doesn't have to implement CB_OFFLOAD, though, so we should
ditch these todo's:

> +/*
> + * XXX:
> + *  - do something with stateids :)
> + *  - implement callback results and OFFLOAD_ABORT
> + *  - inter-server copies?
> + */

...

> +	/* don't support async callbacks yet */

...

lest someone go try to implement them for no reason.  (Stranger things
have happened.)

Nits, possibly to ignore for now:

> +	copy->u.ok.cr_callback_id_length = 0;
> +
> +	return status;
> +}
> +
>  /* This routine never returns NFS_OK!  If there are no other errors, it
>   * will return NFSERR_SAME or NFSERR_NOT_SAME depending on whether the
>   * attributes matched.  VERIFY is implemented by mapping NFSERR_SAME
> @@ -1798,6 +1829,10 @@ static struct nfsd4_operation nfsd4_ops[] = {
>  		.op_get_currentstateid = (stateid_getter)nfsd4_get_freestateid,
>  		.op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
>  	},
> +	[OP_COPY] = {
> +		.op_func = (nfsd4op_func)nfsd4_copy,
> +		.op_name = "OP_COPY",
> +	},

There's some more boilerplate to fill in (see other ops).

> +static __be32
>  nfsd4_decode_noop(struct nfsd4_compoundargs *argp, void *p)
>  {
>  	return nfs_ok;
> @@ -1557,6 +1577,7 @@ static nfsd4_dec nfsd41_dec_ops[] = {
>  	[OP_WANT_DELEGATION]	= (nfsd4_dec)nfsd4_decode_notsupp,
>  	[OP_DESTROY_CLIENTID]	= (nfsd4_dec)nfsd4_decode_destroy_clientid,
>  	[OP_RECLAIM_COMPLETE]	= (nfsd4_dec)nfsd4_decode_reclaim_complete,
> +	[OP_COPY]               = (nfsd4_dec)nfsd4_decode_copy,

And this should be made 4.2-specific.

>  };
>  
>  struct nfsd4_minorversion_ops {
> @@ -3394,6 +3415,27 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres *resp, __be32 nfserr,
>  }
>  
>  static __be32
> +nfsd4_encode_copy(struct nfsd4_compoundres *resp, __be32 nfserr,
> +			struct nfsd4_copy *copy)
> +{
> +	__be32 *p;
> +
> +	if (!nfserr) {
> +		RESERVE_SPACE(4);
> +		WRITE32(copy->u.ok.cr_callback_id_length);
> +		ADJUST_ARGS();
> +		if (copy->u.ok.cr_callback_id_length == 1)
> +			nfsd4_encode_stateid(resp, copy->u.ok.cr_callback_id);
> +	} else {
> +		RESERVE_SPACE(8);
> +		WRITE64(copy->u.cr_bytes_copied);
> +		ADJUST_ARGS();
> +	}
> +
> +	return nfserr;
> +}
> +
> +static __be32
>  nfsd4_encode_noop(struct nfsd4_compoundres *resp, __be32 nfserr, void *p)
>  {
>  	return nfserr;
> @@ -3465,6 +3507,7 @@ static nfsd4_enc nfsd4_enc_ops[] = {
>  	[OP_WANT_DELEGATION]	= (nfsd4_enc)nfsd4_encode_noop,
>  	[OP_DESTROY_CLIENTID]	= (nfsd4_enc)nfsd4_encode_noop,
>  	[OP_RECLAIM_COMPLETE]	= (nfsd4_enc)nfsd4_encode_noop,
> +	[OP_COPY]		= (nfsd4_enc)nfsd4_encode_copy,
>  };
>  
>  /*
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 84ce601..0c1b427 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -28,6 +28,8 @@
>  #include <asm/uaccess.h>
>  #include <linux/exportfs.h>
>  #include <linux/writeback.h>
> +#include <linux/fs_struct.h>
> +#include <linux/kmod.h>
>  
>  #ifdef CONFIG_NFSD_V3
>  #include "xdr3.h"
> @@ -621,6 +623,45 @@ int nfsd4_is_junction(struct dentry *dentry)
>  		return 0;
>  	return 1;
>  }
> +
> +__be32
> +nfsd_copy_range(struct svc_rqst *rqstp, struct svc_fh *fhp_in, u64 pos_in,
> +		struct svc_fh *fhp_out, u64 pos_out, u64 count)
> +{
> +	struct file *filp_in = NULL;
> +	struct file *filp_out = NULL;
> +	int err;
> +
> +	/* XXX verify pos and count within sane limits? */
> +
> +	err = nfsd_open(rqstp, fhp_in, S_IFREG, NFSD_MAY_READ, &filp_in);
> +	if (err)
> +		goto out;
> +
> +	err = nfsd_open(rqstp, fhp_out, S_IFREG, NFSD_MAY_WRITE, &filp_out);
> +	if (err)
> +		goto out;

Looking at the xdr... the COPY operation takes stateid's, which nfsd can
use to look up files, so the opens shouldn't be required.

--b.

> +
> +	err = vfs_copy_range(filp_in, pos_in, filp_out, pos_out, count);
> +	/* fall back if .copy_range isn't supported */
> +
> +	if (!err && EX_ISSYNC(fhp_out->fh_export))
> +		err = vfs_fsync_range(filp_out, pos_out, pos_out + count-1, 0);
> +
> +out:
> +	if (filp_in)
> +		nfsd_close(filp_in);
> +	if (filp_out)
> +		nfsd_close(filp_out);
> +
> +	if (err < 0)
> +		err = nfserrno(err);
> +	else
> +		err = 0;
> +
> +	return err;
> +}
> +
>  #endif /* defined(CONFIG_NFSD_V4) */
>  
>  #ifdef CONFIG_NFSD_V3
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index 5b58941..bbc9483 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -85,6 +85,9 @@ __be32		nfsd_symlink(struct svc_rqst *, struct svc_fh *,
>  				struct svc_fh *res, struct iattr *);
>  __be32		nfsd_link(struct svc_rqst *, struct svc_fh *,
>  				char *, int, struct svc_fh *);
> +__be32		nfsd_copy_range(struct svc_rqst *,
> +				struct svc_fh *, u64,
> +				struct svc_fh *, u64, u64);
>  __be32		nfsd_rename(struct svc_rqst *,
>  				struct svc_fh *, char *, int,
>  				struct svc_fh *, char *, int);
> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
> index 3b271d2..95fd1c3 100644
> --- a/fs/nfsd/xdr4.h
> +++ b/fs/nfsd/xdr4.h
> @@ -426,6 +426,26 @@ struct nfsd4_reclaim_complete {
>  	u32 rca_one_fs;
>  };
>  
> +struct nfsd4_copy {
> +	/* request */
> +	u64             ca_src_offset;
> +	u64             ca_dst_offset;
> +	u64             ca_count;
> +	u32             ca_flags;
> +	u32             ca_destinationlen;
> +	char *          ca_destination;
> +
> +	/* response */
> +	union {
> +		struct {
> +			u32			cr_callback_id_length;
> +			stateid_t *             cr_callback_id;
> +		} ok;
> +		u64	cr_bytes_copied;
> +	} u;
> +
> +};
> +
>  struct nfsd4_op {
>  	int					opnum;
>  	__be32					status;
> @@ -471,6 +491,7 @@ struct nfsd4_op {
>  		struct nfsd4_reclaim_complete	reclaim_complete;
>  		struct nfsd4_test_stateid	test_stateid;
>  		struct nfsd4_free_stateid	free_stateid;
> +		struct nfsd4_copy		copy;
>  	} u;
>  	struct nfs4_replay *			replay;
>  };
> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> index 7b8fc73..6be484e 100644
> --- a/include/linux/nfs4.h
> +++ b/include/linux/nfs4.h
> @@ -100,6 +100,7 @@ enum nfs_opnum4 {
>  	OP_WANT_DELEGATION = 56,
>  	OP_DESTROY_CLIENTID = 57,
>  	OP_RECLAIM_COMPLETE = 58,
> +	OP_COPY = 59,
>  
>  	OP_ILLEGAL = 10044,
>  };
> @@ -108,7 +109,7 @@ enum nfs_opnum4 {
>  Needs to be updated if more operations are defined in future.*/
>  
>  #define FIRST_NFS4_OP	OP_ACCESS
> -#define LAST_NFS4_OP 	OP_RECLAIM_COMPLETE
> +#define LAST_NFS4_OP	OP_COPY
>  
>  enum nfsstat4 {
>  	NFS4_OK = 0,
> @@ -456,6 +457,9 @@ enum {
>  	NFSPROC4_CLNT_GETDEVICELIST,
>  	NFSPROC4_CLNT_BIND_CONN_TO_SESSION,
>  	NFSPROC4_CLNT_DESTROY_CLIENTID,
> +
> +	/* nfs42 */
> +	NFSPROC4_CLNT_COPY,
>  };
>  
>  /* nfs41 types */
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 104b62f..2256e31 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -1184,6 +1184,28 @@ struct nfs41_free_stateid_res {
>  	unsigned int			status;
>  };
>  
> +struct nfs_copy_args {
> +	struct nfs_fh			*fh;
> +	struct nfs_fh			*dir_fh;
> +	u32				*bitmask;
> +	__u64				src_offset;
> +	__u64				dst_offset;
> +	__u64				count;
> +	__u32				flags;
> +	const struct qstr		*destination;
> +	struct nfs4_sequence_args	seq_args;
> +};
> +
> +struct nfs_copy_res {
> +	struct nfs_fh			*fh;
> +	struct nfs_fattr		*fattr;
> +	__u32				callback_id_length;
> +	nfs4_stateid			*callback_id;
> +	__u64				bytes_copied;
> +	const struct nfs_server		*server;
> +	struct nfs4_sequence_res	seq_res;
> +};
> +
>  #else
>  
>  struct pnfs_ds_commit_info {
> @@ -1433,6 +1455,8 @@ struct nfs_rpc_ops {
>  	struct nfs_server *(*create_server)(struct nfs_mount_info *, struct nfs_subversion *);
>  	struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
>  					   struct nfs_fattr *, rpc_authflavor_t);
> +	loff_t  (*copy)    (struct inode *, struct inode *, struct qstr *,
> +				int, loff_t, loff_t, loff_t);
>  };
>  
>  /*
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]

* Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support
       [not found]         ` <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2013-05-15 20:21           ` Myklebust, Trond
  2013-05-15 20:24             ` J. Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Myklebust, Trond @ 2013-05-15 20:21 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Zach Brown, Martin K. Petersen,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, 2013-05-15 at 16:19 -0400, J. Bruce Fields wrote:
> On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote:
> > This crude patch illustrates the simplest plumbing involved in
> > supporting sys_call_range with the NFS COPY operation that's pending in
> > the 4.2 draft spec.
> > 
> > The patch is based on a previous prototype that used the COPY op to
> > implement sys_copyfileat which created a new file (based on the ocfs2
> > reflink ioctl).  By contrast, this copies file contents between existing
> > files.
> > 
> > There's still a lot of implementation and testing to do, but this can
> > get discussion going.
> 
> I'm using:
> 
> 	git://github.com/loghyr/NFSv4.2
> 
> as my reference for the draft protocol.
> 
> On a quick skim, one thing this is missing before it complies is a
> client implementation of CB_OFFLOAD: "If a client desires an
> intra-server file copy, then it MUST support the COPY and CB_OFFLOAD
> operations."

Note that Bryan is currently working on updating the NFS implementation
to match the draft protocol.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support
  2013-05-15 20:21           ` Myklebust, Trond
@ 2013-05-15 20:24             ` J. Bruce Fields
  0 siblings, 0 replies; 24+ messages in thread
From: J. Bruce Fields @ 2013-05-15 20:24 UTC (permalink / raw)
  To: Myklebust, Trond
  Cc: Zach Brown, Martin K. Petersen, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org,
	linux-nfs@vger.kernel.org

On Wed, May 15, 2013 at 08:21:54PM +0000, Myklebust, Trond wrote:
> On Wed, 2013-05-15 at 16:19 -0400, J. Bruce Fields wrote:
> > On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote:
> > > This crude patch illustrates the simplest plumbing involved in
> > > supporting sys_call_range with the NFS COPY operation that's pending in
> > > the 4.2 draft spec.
> > > 
> > > The patch is based on a previous prototype that used the COPY op to
> > > implement sys_copyfileat which created a new file (based on the ocfs2
> > > reflink ioctl).  By contrast, this copies file contents between existing
> > > files.
> > > 
> > > There's still a lot of implementation and testing to do, but this can
> > > get discussion going.
> > 
> > I'm using:
> > 
> > 	git://github.com/loghyr/NFSv4.2
> > 
> > as my reference for the draft protocol.
> > 
> > On a quick skim, one thing this is missing before it complies is a
> > client implementation of CB_OFFLOAD: "If a client desires an
> > intra-server file copy, then it MUST support the COPY and CB_OFFLOAD
> > operations."
> 
> Note that Bryan is currently working on updating the NFS implementation
> to match the draft protocol.

OK, good.--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 0/4] sys_copy_range() rough draft
       [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-05-14 21:15   ` [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support Zach Brown
@ 2013-05-14 21:42   ` Dave Chinner
  2013-05-14 22:04     ` Zach Brown
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2013-05-14 21:42 UTC (permalink / raw)
  To: Zach Brown
  Cc: Martin K. Petersen, Trond Myklebust,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
> We've been talking about implementing some form of bulk data copy
> offloading for a while now.  BTRFS and OCFS2 implement forms of copy
> offloading with ioctls, NFS 4.2 will include a byte-granular COPY
> operation, and the SCSI XCOPY command is being implemented now that
> Windows can issue it.
> 
> In the past we've discussed promoting the ocfs2 reflink ioctl into a
> system call that would create a new file and implicitly copy the
> source data into the new file:
> https://lkml.org/lkml/2009/9/14/481
> 
> These draft patches take the simpler approach of only copying data
> between existing files.  The patches 1) make a system call out of the
> btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with
> the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY
> op, and 4) serve the COPY op in nfsd by calling the .copy_range method
> again.
> 
> The nfs patch is an untested hack.  I'm happy to beat it in to shape
> but I'll need some guidance.
> 
> I'd like strong review feedback on the interfaces, here are some
> possible topics:
> 
> a) Hopefully being able to specify a portion of the data to copy will
> avoid *huge* syscall latencies and the motivation for new async
> semantics.
> 
> b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy
> from the start offset to the end of the file.  Does anyone have a
> strong feeling about this?  I'm leaning towards not bothering with it
> in the syscall interface.
> 
> c) I chose to return partial progess in the ssize_t return code.  This
> limits the length of the range and the size_t count argument can be too
> large and return errors, much like other io syscalls.  This seemed
> less awful than some extra argument with a pointer to a status value.
> 
> d) I'm dreading mentioning a vector of ranges to copy in one syscall
> because I don't want to think about overlaping ranges and file systems
> that use range locks -- xfs for now, but more if Jan gets his way.

XFS doesn't use range locks (yet).

> I'd rather that we get some experience with this simpler syscall before
> taking on that headache.
> 
> I'm sure I'm forgetting some other details.
> 
> I'm going to keep hacking away at this.  My next step is to get ext4
> supporting .copy_range, probably with a quick hack to copy the
> contents of bios.  Hopefully that'll give enough time to also integrate
> review feedback.

Wouldn't the easiest "support all filesystems" hack just be to add
a destination offset parameter to do_splice_direct() and call that
when the filesystem doesn't supply a ->copy_range method? i.e. use
the mechanisms we already have for copying from one file to another
via the page cache as efficiently as possible?

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 0/4] sys_copy_range() rough draft
  2013-05-14 21:42   ` [RFC v0 0/4] sys_copy_range() rough draft Dave Chinner
@ 2013-05-14 22:04     ` Zach Brown
  2013-05-15  1:01       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Zach Brown @ 2013-05-14 22:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote:
> On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
> > I'm going to keep hacking away at this.  My next step is to get ext4
> > supporting .copy_range, probably with a quick hack to copy the
> > contents of bios.  Hopefully that'll give enough time to also integrate
> > review feedback.
> 
> Wouldn't the easiest "support all filesystems" hack just be to add
> a destination offset parameter to do_splice_direct() and call that
> when the filesystem doesn't supply a ->copy_range method? i.e. use
> the mechanisms we already have for copying from one file to another
> via the page cache as efficiently as possible?

Probably; and this in-kernel buffered fallback is particularly desirable
for nfsd when the exported fs doesn't provide .copy_range.  Having nfsd
service the COPY op is still a significant win over having the client
move the data backand forth over the wire.

But in that quote above I was talking about implementing .copy_range in
ext4 as though it could use XCOPY today.  I'd like to get a feel for how
bad it's going to be to juggle the bio XCOPY IO with unwritten extent
conversion, RMW with overlapping existing blocks, i_size advancing, etc.
(It's so much like O_DIRECT that I'm already crying a little.)

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 0/4] sys_copy_range() rough draft
  2013-05-14 22:04     ` Zach Brown
@ 2013-05-15  1:01       ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2013-05-15  1:01 UTC (permalink / raw)
  To: Zach Brown
  Cc: Martin K. Petersen, Trond Myklebust, linux-kernel, linux-fsdevel,
	linux-btrfs, linux-nfs

On Tue, May 14, 2013 at 03:04:40PM -0700, Zach Brown wrote:
> On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote:
> > On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
> > > I'm going to keep hacking away at this.  My next step is to get ext4
> > > supporting .copy_range, probably with a quick hack to copy the
> > > contents of bios.  Hopefully that'll give enough time to also integrate
> > > review feedback.
> > 
> > Wouldn't the easiest "support all filesystems" hack just be to add
> > a destination offset parameter to do_splice_direct() and call that
> > when the filesystem doesn't supply a ->copy_range method? i.e. use
> > the mechanisms we already have for copying from one file to another
> > via the page cache as efficiently as possible?
> 
> Probably; and this in-kernel buffered fallback is particularly desirable
> for nfsd when the exported fs doesn't provide .copy_range.  Having nfsd
> service the COPY op is still a significant win over having the client
> move the data backand forth over the wire.

Sure. That's kind of what I was thinking to make it easy to test and
have widespread support up front.

> But in that quote above I was talking about implementing .copy_range in
> ext4 as though it could use XCOPY today.  I'd like to get a feel for how
> bad it's going to be to juggle the bio XCOPY IO with unwritten extent
> conversion, RMW with overlapping existing blocks, i_size advancing, etc.
> (It's so much like O_DIRECT that I'm already crying a little.)

Toss anything that is hard back to the page cache path. Overlapping
blocks, partial blocks and so can be handled by the slow path
without making the offload path complex.

Make the offload do the simple stuff fast - the mapping and
completion callbacks should be no different to the direct IO bits we
have now, and if you only handle filesystem block aligned ranges in
the offload (rather than sector alignment) most of the grot that DIO
code has to handle goes away....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
@ 2013-05-15 17:50 Steve French
  2013-05-15 18:54 ` J. Bruce Fields
       [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 24+ messages in thread
From: Steve French @ 2013-05-15 17:50 UTC (permalink / raw)
  To: linux-fsdevel, zab-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

Doesn't the new syscall have to invalidate the page cache pages that
the server is about to overwrite as btrfs does with the following line
in fs/btrfs/ioctl.c

	truncate_inode_pages_range(&inode->i_data, destoff,
				   PAGE_CACHE_ALIGN(destoff + len) - 1);

(and doesn't truncate_inode_pages_range handle page cache alignment
anyway - and also why did btrfs use truncate_inode_pages_range instead
of invalidate?)

Does nfs client ever have the case where two different superblocks map
to the same nfs export (and thus the check below is restricting the
ability to do server side copy)?

+	if (inode_in->i_sb != inode_out->i_sb ||
+	    file_in->f_path.mnt != file_out->f_path.mnt)
+		return -EXDEV;

I am working on cifs client patches for the ioctl, and the new syscall
also looks pretty easy.   Some popular cifs servers (like Windows)
have supported smb/cifs copy offload for many, many years - and now
Samba with the support that David Disseldorp added for the clone range
ioctl has been supporting copychunk (server side copy from Windows to
Samba) so about time to finish the cifs client equivalent.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-15 17:50 [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Steve French
@ 2013-05-15 18:54 ` J. Bruce Fields
       [not found]   ` <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
       [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2013-05-15 18:54 UTC (permalink / raw)
  To: Steve French; +Cc: linux-fsdevel, zab, linux-cifs

On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote:
> Doesn't the new syscall have to invalidate the page cache pages that
> the server is about to overwrite as btrfs does with the following line
> in fs/btrfs/ioctl.c
> 
> 	truncate_inode_pages_range(&inode->i_data, destoff,
> 				   PAGE_CACHE_ALIGN(destoff + len) - 1);
> 
> (and doesn't truncate_inode_pages_range handle page cache alignment
> anyway - and also why did btrfs use truncate_inode_pages_range instead
> of invalidate?)
> 
> Does nfs client ever have the case where two different superblocks map
> to the same nfs export (and thus the check below is restricting the
> ability to do server side copy)?
> 
> +	if (inode_in->i_sb != inode_out->i_sb ||
> +	    file_in->f_path.mnt != file_out->f_path.mnt)
> +		return -EXDEV;

The client attempts to use the same superblock whenever it can.

I suppose you're also losing the opportunity to copy between two
different filesystems on the same server, which should be faster than
requiring the client to do the copy.

--b.

> 
> I am working on cifs client patches for the ioctl, and the new syscall
> also looks pretty easy.   Some popular cifs servers (like Windows)
> have supported smb/cifs copy offload for many, many years - and now
> Samba with the support that David Disseldorp added for the clone range
> ioctl has been supporting copychunk (server side copy from Windows to
> Samba) so about time to finish the cifs client equivalent.
> 
> -- 
> Thanks,
> 
> Steve
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
       [not found]   ` <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2013-05-15 19:39     ` Zach Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Zach Brown @ 2013-05-15 19:39 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Steve French, linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Wed, May 15, 2013 at 02:54:29PM -0400, J. Bruce Fields wrote:
> On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote:
> > Does nfs client ever have the case where two different superblocks map
> > to the same nfs export (and thus the check below is restricting the
> > ability to do server side copy)?
> > 
> > +	if (inode_in->i_sb != inode_out->i_sb ||
> > +	    file_in->f_path.mnt != file_out->f_path.mnt)
> > +		return -EXDEV;
> 
> The client attempts to use the same superblock whenever it can.
> 
> I suppose you're also losing the opportunity to copy between two
> different filesystems on the same server, which should be faster than
> requiring the client to do the copy.

Ah, yeah, that'd probably be the simplest motivation to move this down
into the ->copy_range methods.  It should be trivial to test once the
nfs bits are going and we have the page cache fallback.

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
       [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-05-15 19:36   ` Zach Brown
       [not found]     ` <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Zach Brown @ 2013-05-15 19:36 UTC (permalink / raw)
  To: Steve French; +Cc: linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote:
> Doesn't the new syscall have to invalidate the page cache pages that
> the server is about to overwrite as btrfs does with the following line
> in fs/btrfs/ioctl.c
> 
> 	truncate_inode_pages_range(&inode->i_data, destoff,
> 				   PAGE_CACHE_ALIGN(destoff + len) - 1);

The file_operations ->copy_range implementation is responsible for this,
yeah, similar to ->write being responsible for invalidating around
O_DIRECT writes.

> (and doesn't truncate_inode_pages_range handle page cache alignment
> anyway

No, it bugs if given an end offset that isn't page aligned:

        BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));

Lukas is working on a patch series to fix this:

  http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2

> - and also why did btrfs use truncate_inode_pages_range instead
> of invalidate?)

I'm not sure.  Maybe they can have racing dirty pages and want them
thrown away rather than having invalidation fail?

> Does nfs client ever have the case where two different superblocks map
> to the same nfs export (and thus the check below is restricting the
> ability to do server side copy)?
> 
> +	if (inode_in->i_sb != inode_out->i_sb ||
> +	    file_in->f_path.mnt != file_out->f_path.mnt)
> +		return -EXDEV;

Ah, yes, thanks for bringing this up.  This check was brought over from
the btrfs ioctl code into the core just to make our lives less
complicated in the short term.

We may well want to push this test down into the ->copy_range methods so
that some can support these cross-sb/mnt copies.  I'm not sure how the
nfs client manages all these relationships, but yes, what you suggest
seems plausible.

There's also the long term possibility of using the nfs copy op to
offload copying between files on different servers.  The spec talks
about this quite a bit (how the client informs both servers, whether to
use nfs between the servers or some server-specific protocol), but this
certainly isn't my priority.

I figure we can worry about this once we have a ->copy_range method that
can copy files like this safely and are stopped by the test.  It's easy
enough to move the test around.

> I am working on cifs client patches for the ioctl, and the new syscall
> also looks pretty easy.   Some popular cifs servers (like Windows)
> have supported smb/cifs copy offload for many, many years - and now
> Samba with the support that David Disseldorp added for the clone range
> ioctl has been supporting copychunk (server side copy from Windows to
> Samba) so about time to finish the cifs client equivalent.

Cool, let me know if anything strange pops up. 

- z

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>]

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
       [not found]     ` <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>
@ 2013-05-15 20:08       ` Steve French
  2013-05-15 20:16         ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Steve French @ 2013-05-15 20:08 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-fsdevel, linux-cifs-u79uwXL29TY76Z2rM5mHXA, J. Bruce Fields

On Wed, May 15, 2013 at 2:36 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote:
>> Doesn't the new syscall have to invalidate the page cache pages that
>> the server is about to overwrite as btrfs does with the following line
>> in fs/btrfs/ioctl.c
>>
>>       truncate_inode_pages_range(&inode->i_data, destoff,
>>                                  PAGE_CACHE_ALIGN(destoff + len) - 1);
>
> The file_operations ->copy_range implementation is responsible for this,
> yeah, similar to ->write being responsible for invalidating around
> O_DIRECT writes.
>
>> (and doesn't truncate_inode_pages_range handle page cache alignment
>> anyway
>
> No, it bugs if given an end offset that isn't page aligned:
>
>         BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
>
> Lukas is working on a patch series to fix this:
>
>   http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2
>
>> - and also why did btrfs use truncate_inode_pages_range instead
>> of invalidate?)
>
> I'm not sure.  Maybe they can have racing dirty pages and want them
> thrown away rather than having invalidation fail?

Ideally cifs (and presumably) nfs would do a flush of the beginning of
the first page (before source offset) and/or the end of the last page
of the range (after target offset) if dirty (to send it to the server
before the copy) if they are not writing the complete page out, and
then toss the pages (including the first and last partial pages) from
the page cache.  Is there a good fs example for code for this?


>> Does nfs client ever have the case where two different superblocks map
>> to the same nfs export (and thus the check below is restricting the
>> ability to do server side copy)?
>>
>> +     if (inode_in->i_sb != inode_out->i_sb ||
>> +         file_in->f_path.mnt != file_out->f_path.mnt)
>> +             return -EXDEV;
>
> Ah, yes, thanks for bringing this up.  This check was brought over from
> the btrfs ioctl code into the core just to make our lives less
> complicated in the short term.
>
> We may well want to push this test down into the ->copy_range methods so
> that some can support these cross-sb/mnt copies.  I'm not sure how the
> nfs client manages all these relationships, but yes, what you suggest
> seems plausible.
>
> There's also the long term possibility of using the nfs copy op to
> offload copying between files on different servers.  The spec talks
> about this quite a bit (how the client informs both servers, whether to
> use nfs between the servers or some server-specific protocol), but this
> certainly isn't my priority.
>
> I figure we can worry about this once we have a ->copy_range method that
> can copy files like this safely and are stopped by the test.  It's easy
> enough to move the test around.
>
>> I am working on cifs client patches for the ioctl, and the new syscall
>> also looks pretty easy.   Some popular cifs servers (like Windows)
>> have supported smb/cifs copy offload for many, many years - and now
>> Samba with the support that David Disseldorp added for the clone range
>> ioctl has been supporting copychunk (server side copy from Windows to
>> Samba) so about time to finish the cifs client equivalent.
>
> Cool, let me know if anything strange pops up.
>
> - z



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-15 20:08       ` Steve French
@ 2013-05-15 20:16         ` Chris Mason
       [not found]           ` <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2013-05-15 20:16 UTC (permalink / raw)
  To: Steve French, Zach Brown
  Cc: linux-fsdevel, linux-cifs@vger.kernel.org, J. Bruce Fields

Quoting Steve French (2013-05-15 16:08:16)
> On Wed, May 15, 2013 at 2:36 PM, Zach Brown <zab@redhat.com> wrote:
> > On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote:
> >> Doesn't the new syscall have to invalidate the page cache pages that
> >> the server is about to overwrite as btrfs does with the following line
> >> in fs/btrfs/ioctl.c
> >>
> >>       truncate_inode_pages_range(&inode->i_data, destoff,
> >>                                  PAGE_CACHE_ALIGN(destoff + len) - 1);
> >
> > The file_operations ->copy_range implementation is responsible for this,
> > yeah, similar to ->write being responsible for invalidating around
> > O_DIRECT writes.
> >
> >> (and doesn't truncate_inode_pages_range handle page cache alignment
> >> anyway
> >
> > No, it bugs if given an end offset that isn't page aligned:
> >
> >         BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
> >
> > Lukas is working on a patch series to fix this:
> >
> >   http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2
> >
> >> - and also why did btrfs use truncate_inode_pages_range instead
> >> of invalidate?)
> >
> > I'm not sure.  Maybe they can have racing dirty pages and want them
> > thrown away rather than having invalidation fail?

truncate_inode_pages_range ends with invalidate.  It just locks the page
first and makes sure to wait for writeback if it was already in
progress.

Looking at the btrfs side, I think we should add some locking on the
destination extents as well.


^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
       [not found]           ` <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2013-05-15 20:21             ` Steve French
  2013-05-15 20:25               ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Steve French @ 2013-05-15 20:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Zach Brown, linux-fsdevel,
	linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	J. Bruce Fields

On Wed, May 15, 2013 at 3:16 PM, Chris Mason <chris.mason-5c4llco8/ftWk0Htik3J/w@public.gmane.org> wrote:
> Quoting Steve French (2013-05-15 16:08:16)
>> On Wed, May 15, 2013 at 2:36 PM, Zach Brown <zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Wed, May 15, 2013 at 12:50:13PM -0500, Steve French wrote:
>> >> Doesn't the new syscall have to invalidate the page cache pages that
>> >> the server is about to overwrite as btrfs does with the following line
>> >> in fs/btrfs/ioctl.c
>> >>
>> >>       truncate_inode_pages_range(&inode->i_data, destoff,
>> >>                                  PAGE_CACHE_ALIGN(destoff + len) - 1);
>> >
>> > The file_operations ->copy_range implementation is responsible for this,
>> > yeah, similar to ->write being responsible for invalidating around
>> > O_DIRECT writes.
>> >
>> >> (and doesn't truncate_inode_pages_range handle page cache alignment
>> >> anyway
>> >
>> > No, it bugs if given an end offset that isn't page aligned:
>> >
>> >         BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
>> >
>> > Lukas is working on a patch series to fix this:
>> >
>> >   http://marc.info/?l=linux-fsdevel&m=136854986101843&w=2
>> >
>> >> - and also why did btrfs use truncate_inode_pages_range instead
>> >> of invalidate?)
>> >
>> > I'm not sure.  Maybe they can have racing dirty pages and want them
>> > thrown away rather than having invalidation fail?
>
> truncate_inode_pages_range ends with invalidate.  It just locks the page
> first and makes sure to wait for writeback if it was already in
> progress.

Does

     truncate_inode_pages_range(&inode->i_data, destoff,
                            PAGE_CACHE_ALIGN(destoff + len) - 1);


I also am worried about what does:

1) writing out data from start of previous page up to destoff (if
destoff not page aligned, and first page dirty)
2) writing out data from dest+len through end of page (if destoff not
page aligned, and last page dirty)

before tossing first through last page out


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point
  2013-05-15 20:21             ` Steve French
@ 2013-05-15 20:25               ` Chris Mason
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Mason @ 2013-05-15 20:25 UTC (permalink / raw)
  To: Steve French
  Cc: Zach Brown, linux-fsdevel, linux-cifs@vger.kernel.org,
	J. Bruce Fields

Quoting Steve French (2013-05-15 16:21:12)
> On Wed, May 15, 2013 at 3:16 PM, Chris Mason <chris.mason@fusionio.com> wrote:
> > truncate_inode_pages_range ends with invalidate.  It just locks the page
> > first and makes sure to wait for writeback if it was already in
> > progress.
> 
> Does
> 
>      truncate_inode_pages_range(&inode->i_data, destoff,
>                             PAGE_CACHE_ALIGN(destoff + len) - 1);
> 
> 
> I also am worried about what does:
> 
> 1) writing out data from start of previous page up to destoff (if
> destoff not page aligned, and first page dirty)
> 2) writing out data from dest+len through end of page (if destoff not
> page aligned, and last page dirty)
> 
> before tossing first through last page out

It does a check for a partial first page (sending down the partial
offset to invalidatepage), but the last page is fully truncated.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-05-21 19:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-14 21:15 [RFC v0 0/4] sys_copy_range() rough draft Zach Brown
2013-05-14 21:15 ` [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Zach Brown
2013-05-15 19:44   ` Eric Wong
2013-05-15 20:03     ` Zach Brown
2013-05-16 21:16       ` Ric Wheeler
2013-05-21 19:47       ` Eric Wong
2013-05-21 19:50         ` Zach Brown
2013-05-14 21:15 ` [RFC v0 2/4] x86: add sys_copy_range to syscall tables Zach Brown
2013-05-14 21:15 ` [RFC v0 3/4] btrfs: add .copy_range file operation Zach Brown
     [not found] ` <1368566126-17610-1-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-05-14 21:15   ` [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support Zach Brown
     [not found]     ` <1368566126-17610-5-git-send-email-zab-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-05-15 20:19       ` J. Bruce Fields
     [not found]         ` <20130515201949.GD25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-05-15 20:21           ` Myklebust, Trond
2013-05-15 20:24             ` J. Bruce Fields
2013-05-14 21:42   ` [RFC v0 0/4] sys_copy_range() rough draft Dave Chinner
2013-05-14 22:04     ` Zach Brown
2013-05-15  1:01       ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2013-05-15 17:50 [RFC v0 1/4] vfs: add copy_range syscall and vfs entry point Steve French
2013-05-15 18:54 ` J. Bruce Fields
     [not found]   ` <20130515185429.GA25994-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-05-15 19:39     ` Zach Brown
     [not found] ` <CAH2r5ms0P8Hgv1mUpyHA32Er38iiaC1HHC4fhxvz2SBFy6Sucw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-15 19:36   ` Zach Brown
     [not found]     ` <20130515193600.GA318-fypN+1c5dIyjpB87vu3CluTW4wlIGRCZ@public.gmane.org>
2013-05-15 20:08       ` Steve French
2013-05-15 20:16         ` Chris Mason
     [not found]           ` <20130515201614.24668.83788-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2013-05-15 20:21             ` Steve French
2013-05-15 20:25               ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).