[GIT PULL] ocfs2 changes for 2.6.32

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [GIT PULL] ocfs2 changes for 2.6.32
@ 2009-09-11 20:04 Joel Becker
  2009-09-14 21:32 ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-11 20:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

Linus, et al,
	Here are the ocfs2 feature changes for 2.6.32.  The big ticket
item is the reflinkat(2) system call and ocfs2's support for it.  The
ocfs2 support accounts for all but a handful of the changes.  The
remaining few patches are fixes.
	Please pull.

Joel

The following changes since commit 8379e7c46cc48f51197dd663fc6676f47f2a1e71:
  Sunil Mushran (1):
        ocfs2: ocfs2_write_begin_nolock() should handle len=0

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git upstream-linus

Coly Li (1):
      dlmglue.c: add missed mlog lines

Joel Becker (41):
      ocfs2: Make the ocfs2_caching_info structure self-contained.
      ocfs2: Change metadata caching locks to an operations structure.
      ocfs2: Take the inode out of the metadata read/write paths.
      ocfs2: move ip_last_trans to struct ocfs2_caching_info
      ocfs2: move ip_created_trans to struct ocfs2_caching_info
      ocfs2: Pass struct ocfs2_caching_info to the journal functions.
      ocfs2: Store the ocfs2_caching_info on ocfs2_extent_tree.
      ocfs2: Pass ocfs2_caching_info to ocfs2_read_extent_block().
      ocfs2: ocfs2_find_path() only needs the caching info
      ocfs2: ocfs2_create_new_meta_bhs() doesn't need struct inode.
      ocfs2: Pass ocfs2_extent_tree to ocfs2_unlink_path()
      ocfs2: ocfs2_complete_edge_insert() doesn't need struct inode at all.
      ocfs2: Get inode out of ocfs2_rotate_subtree_root_right().
      ocfs2: Pass ocfs2_extent_tree to ocfs2_get_subtree_root()
      ocfs2: Drop struct inode from ocfs2_extent_tree_operations.
      ocfs2: ocfs2_rotate_tree_right() doesn't need struct inode.
      ocfs2: ocfs2_update_edge_lengths() doesn't need struct inode.
      ocfs2: ocfs2_rotate_subtree_left() doesn't need struct inode.
      ocfs2: __ocfs2_rotate_tree_left() doesn't need struct inode.
      ocfs2: ocfs2_rotate_tree_left() no longer needs struct inode.
      ocfs2: ocfs2_merge_rec_left/right() no longer need struct inode.
      ocfs2: ocfs2_try_to_merge_extent() doesn't need struct inode.
      ocfs2: ocfs2_grow_branch() and ocfs2_append_rec_to_path() lose struct inode.
      ocfs2: ocfs2_truncate_rec() doesn't need struct inode.
      ocfs2: Make truncating the extent map an extent_tree_operation.
      ocfs2: ocfs2_insert_at_leaf() doesn't need struct inode.
      ocfs2: Give ocfs2_split_record() an extent_tree instead of an inode.
      ocfs2: ocfs2_do_insert_extent() and ocfs2_insert_path() no longer need an inode.
      ocfs2: ocfs2_extent_contig() only requires the superblock.
      ocfs2: Swap inode for extent_tree in ocfs2_figure_merge_contig_type().
      ocfs2: Remove inode from ocfs2_figure_extent_contig().
      ocfs2: ocfs2_figure_insert_type() no longer needs struct inode.
      ocfs2: Make extent map insertion an extent_tree_operation.
      ocfs2: ocfs2_insert_extent() no longer needs struct inode.
      ocfs2: ocfs2_add_clusters_in_btree() no longer needs struct inode.
      ocfs2: ocfs2_remove_extent() no longer needs struct inode.
      ocfs2: ocfs2_split_and_insert() no longer needs struct inode.
      ocfs2: Teach ocfs2_replace_extent_rec() to use an extent_tree.
      ocfs2: __ocfs2_mark_extent_written() doesn't need struct inode.
      ocfs2: Pass ocfs2_caching_info into ocfs_init_*_extent_tree().
      fs: Add the reflink() operation and reflinkat(2) system call.

Sunil Mushran (1):
      ocfs2: __ocfs2_abort() should not enable panic for local mounts

Tao Ma (42):
      ocfs2: Define refcount tree structure.
      ocfs2: Add metaecc for ocfs2_refcount_block.
      ocfs2: Add ocfs2_read_refcount_block.
      ocfs2: Abstract caching info checkpoint.
      ocfs2: Add new refcount tree lock resource in dlmglue.
      ocfs2: Add caching info for refcount tree.
      ocfs2: Add refcount tree lock mechanism.
      ocfs2: Basic tree root operation.
      ocfs2: Wrap ocfs2_extent_contig in ocfs2_extent_tree.
      ocfs2: Abstract extent split process.
      ocfs2: Add refcount b-tree as a new extent tree.
      ocfs2: move tree path functions to alloc.h.
      ocfs2: Add support for incrementing refcount in the tree.
      ocfs2: Add support of decrementing refcount for delete.
      ocfs2: Add functions for extents refcounted.
      ocfs2: Decrement refcount when truncating refcounted extents.
      ocfs2: Add CoW support.
      ocfs2: CoW refcount tree improvement.
      ocfs2: Integrate CoW in file write.
      ocfs2: CoW a reflinked cluster when it is truncated.
      ocfs2: Add normal functions for reflink a normal file's extents.
      ocfs2: handle file attributes issue for reflink.
      ocfs2: Return extent flags for xattr value tree.
      ocfs2: Abstract duplicate clusters process in CoW.
      ocfs2: Add CoW support for xattr.
      ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
      ocfs2: Abstract the creation of xattr block.
      ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
      ocfs2: Attach xattr clusters to refcount tree.
      ocfs2: Call refcount tree remove process properly.
      ocfs2: Create an xattr indexed block if needed.
      ocfs2: Add reflink support for xattr.
      ocfs2: Modify removing xattr process for refcount.
      ocfs2: Don't merge in 1st refcount ops of reflink.
      ocfs2: Make transaction extend more efficient.
      ocfs2: Use proper parameter for some inode operation.
      ocfs2: Create reflinked file in orphan dir.
      ocfs2: Add preserve to reflink.
      ocfs2: Implement ocfs2_reflink.
      ocfs2: Enable refcount tree support.
      ocfs2: Add ioctl for reflink.
      ocfs2: Use buffer IO if we are appending a file.

Wengang Wang (1):
      ocfs2: add spinlock protection when dealing with lockres->purge.

 Documentation/filesystems/reflink.txt |  174 ++
 Documentation/filesystems/vfs.txt     |    4 +
 arch/x86/ia32/ia32entry.S             |    1 +
 arch/x86/include/asm/unistd_32.h      |    1 +
 arch/x86/include/asm/unistd_64.h      |    2 +
 arch/x86/kernel/syscall_table_32.S    |    1 +
 fs/namei.c                            |  137 ++
 fs/ocfs2/Makefile                     |    1 +
 fs/ocfs2/alloc.c                      | 1342 ++++++-----
 fs/ocfs2/alloc.h                      |  101 +-
 fs/ocfs2/aops.c                       |   37 +-
 fs/ocfs2/aops.h                       |    2 +
 fs/ocfs2/buffer_head_io.c             |   47 +-
 fs/ocfs2/buffer_head_io.h             |    8 +-
 fs/ocfs2/cluster/masklog.c            |    1 +
 fs/ocfs2/cluster/masklog.h            |    1 +
 fs/ocfs2/dir.c                        |  107 +-
 fs/ocfs2/dlm/dlmthread.c              |    6 +-
 fs/ocfs2/dlmglue.c                    |  105 +-
 fs/ocfs2/dlmglue.h                    |    6 +
 fs/ocfs2/extent_map.c                 |   33 +-
 fs/ocfs2/extent_map.h                 |    8 +-
 fs/ocfs2/file.c                       |  151 ++-
 fs/ocfs2/file.h                       |    2 +
 fs/ocfs2/inode.c                      |   86 +-
 fs/ocfs2/inode.h                      |   20 +-
 fs/ocfs2/ioctl.c                      |   14 +
 fs/ocfs2/journal.c                    |   82 +-
 fs/ocfs2/journal.h                    |   94 +-
 fs/ocfs2/localalloc.c                 |   12 +-
 fs/ocfs2/namei.c                      |  343 +++-
 fs/ocfs2/namei.h                      |    6 +
 fs/ocfs2/ocfs2.h                      |   52 +-
 fs/ocfs2/ocfs2_fs.h                   |  107 +-
 fs/ocfs2/ocfs2_lockid.h               |    5 +
 fs/ocfs2/quota_global.c               |    5 +-
 fs/ocfs2/quota_local.c                |   26 +-
 fs/ocfs2/refcounttree.c               | 4249 +++++++++++++++++++++++++++++++++
 fs/ocfs2/refcounttree.h               |  108 +
 fs/ocfs2/resize.c                     |   16 +-
 fs/ocfs2/slot_map.c                   |   10 +-
 fs/ocfs2/suballoc.c                   |   35 +-
 fs/ocfs2/super.c                      |   13 +-
 fs/ocfs2/uptodate.c                   |  265 ++-
 fs/ocfs2/uptodate.h                   |   51 +-
 fs/ocfs2/xattr.c                      | 2056 +++++++++++++++--
 fs/ocfs2/xattr.h                      |   15 +-
 include/linux/fcntl.h                 |    8 +
 include/linux/fs.h                    |    2 +
 include/linux/security.h              |   23 +
 include/linux/syscalls.h              |    3 +
 security/capability.c                 |    7 +
 security/security.c                   |    8 +
 53 files changed, 8823 insertions(+), 1176 deletions(-)
 create mode 100644 Documentation/filesystems/reflink.txt
 create mode 100644 fs/ocfs2/refcounttree.c
 create mode 100644 fs/ocfs2/refcounttree.h
-- 

Life's Little Instruction Book #99

	"Think big thoughts, but relish small pleasures."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-11 20:04 [GIT PULL] ocfs2 changes for 2.6.32 Joel Becker
@ 2009-09-14 21:32 ` Linus Torvalds
  2009-09-14 22:14   ` Joel Becker
                     ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-14 21:32 UTC (permalink / raw)
  To: Joel Becker; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Fri, 11 Sep 2009, Joel Becker wrote:
>
> Linus, et al,
> 	Here are the ocfs2 feature changes for 2.6.32.  The big ticket
> item is the reflinkat(2) system call and ocfs2's support for it.  The
> ocfs2 support accounts for all but a handful of the changes.  The
> remaining few patches are fixes.

I _really_ want some kind of ack's for new filesystem system calls like 
this. I'm not going to pull a new 'reflink[at]()' system call just based 
on a single filesystem.

Yes, there's clearly been _some_ discussion, but (a) I've not seen it 
(since it's been on 'fsdevel', which is one of those single-topic mailing 
lists that I'm totally uninterested in, since they tend to become clique 
groups) and (b) you don't even say whether the thing has been acked by 
things like the security angle etc.

So I'm not pulling this. Not until I get the feeling that there is 
consensus.

I also don't understand why it's called 'reflink'. Why not 'copyfile'? We 
should not name things by implementation, we should name things by what 
they _do_. And I'm not seeing what is so 'reflink' about this that it's 
not a 'copyfile'. I also am not entirely clear on why you need the source 
name, and not - for example - an 'fd'.

Are we going to add 'freflink[at]()' at some point?

So I want explanations for the naming, I want sign-offs from other 
filesystem (and security) people, etc. What I do _not_ want is to get a 
"please pull" request for a filesystem, and notice that it's suddenly not 
all about just that particular filesystem, without any indication of who 
you've been talking to etc etc.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-14 21:32 ` Linus Torvalds
@ 2009-09-14 22:14   ` Joel Becker
  2009-09-14 23:27     ` Linus Torvalds
  2009-09-15  6:44   ` Miklos Szeredi
  2009-09-23 11:02   ` [GIT PULL] ocfs2 changes for 2.6.32 (take 2, no syscall) Joel Becker
  2 siblings, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-14 22:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Mon, Sep 14, 2009 at 02:32:36PM -0700, Linus Torvalds wrote:
> On Fri, 11 Sep 2009, Joel Becker wrote:
> >
> > Linus, et al,
> > 	Here are the ocfs2 feature changes for 2.6.32.  The big ticket
> > item is the reflinkat(2) system call and ocfs2's support for it.  The
> > ocfs2 support accounts for all but a handful of the changes.  The
> > remaining few patches are fixes.
> 
> I _really_ want some kind of ack's for new filesystem system calls like 
> this. I'm not going to pull a new 'reflink[at]()' system call just based 
> on a single filesystem.

	I'll get specific acks.  I sent it via ocfs2.git because others
recommended I not send it upstream in June but instead wait until
I had at least one filesystem implementing it.

> Yes, there's clearly been _some_ discussion, but (a) I've not seen it 
> (since it's been on 'fsdevel', which is one of those single-topic mailing 
> lists that I'm totally uninterested in, since they tend to become clique 
> groups) and (b) you don't even say whether the thing has been acked by 
> things like the security angle etc.

	Fair enough.  Don't worry, the security folks were involved.
I'll get direct acks.

> I also don't understand why it's called 'reflink'. Why not 'copyfile'? We 
> should not name things by implementation, we should name things by what 
> they _do_. And I'm not seeing what is so 'reflink' about this that it's 
> not a 'copyfile'. I also am not entirely clear on why you need the source 
> name, and not - for example - an 'fd'.
> 
> Are we going to add 'freflink[at]()' at some point?

	It's a link(2) analogue.  symlink(2) has the loosest coupling,
and reflink(2) the highest.  We're not going to add freflink[at]().
It's a snap, not a copy.  It can be used to implement a copy, and
copyfile() in libc can be written with reflinkat(2), but it isn't just a
copy.

Joel

-- 

"There is shadow under this red rock.
 (Come in under the shadow of this red rock)
 And I will show you something different from either
 Your shadow at morning striding behind you
 Or your shadow at evening rising to meet you.
 I will show you fear in a handful of dust."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-14 22:14   ` Joel Becker
@ 2009-09-14 23:27     ` Linus Torvalds
  2009-09-15  0:04       ` Joel Becker
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-09-14 23:27 UTC (permalink / raw)
  To: Joel Becker; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Mon, 14 Sep 2009, Joel Becker wrote:
> 
> 	It's a link(2) analogue.  symlink(2) has the loosest coupling,
> and reflink(2) the highest.  We're not going to add freflink[at]().
> It's a snap, not a copy.  It can be used to implement a copy, and
> copyfile() in libc can be written with reflinkat(2), but it isn't just a
> copy.

>From all but a performance standpoint, it's a copy. It has absolutely 
_zero_ "link" semantics. When you do a symlink or a hardlink, you see it 
in the resulting semantics: changing one changes the other. 

This 'reflink' has no such semantics that I can tell. It has purely copy 
semantics, never mind that it's optimized.

And the thing to note is that it doesn't even have to be optimized as a 
"link". Think about network filesystems: maybe they want to implement this 
thing as a server-side "copy" operation (with atomicity guarantees).

In other words, I can well imagine that for some filesystems, there really 
is no refcounting or linking implied, and that's why I think naming should 
be about semantics, not some random implementation issue that just happens 
to be true for some particular class of filesystems.

So tell me - are there actually any non-copying semantics as far as the 
_user_ is concerned? Is there some reason why a NFS server might not 
implement this as a server-side copy? Is there something fundamentally in 
this all that is about reference counting as far as a user is concerned?

I also still didn't get any answer to the "freflink()" question. You just 
said that we wouldn't do it, with no explanation. Why? We've discussed 
'flink()' in the past, I just want to know that when we do a new system 
call there is some _reason_ why it's not going to explode into many 
different variants later...

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-14 23:27     ` Linus Torvalds
@ 2009-09-15  0:04       ` Joel Becker
  2009-09-15  0:31         ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-15  0:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Mon, Sep 14, 2009 at 04:27:59PM -0700, Linus Torvalds wrote:
> On Mon, 14 Sep 2009, Joel Becker wrote:
> >From all but a performance standpoint, it's a copy. It has absolutely 
> _zero_ "link" semantics. When you do a symlink or a hardlink, you see it 
> in the resulting semantics: changing one changes the other. 

	It's creating a new entry in the name space based on an old one.

> And the thing to note is that it doesn't even have to be optimized as a 
> "link". Think about network filesystems: maybe they want to implement this 
> thing as a server-side "copy" operation (with atomicity guarantees).

	reflink doesn't merely guarantee atomicity, it guarantees the
shared data extents.  Under the auspices of reflink a network filesystem
cannot merely provide an atomic copy.  A separate copyfile call might
allow that, but reflink doesn't.  This is deliberate, because the caller
wants the shared storage, not just a copy.

> I also still didn't get any answer to the "freflink()" question. You just 
> said that we wouldn't do it, with no explanation. Why? We've discussed 
> 'flink()' in the past, I just want to know that when we do a new system 
> call there is some _reason_ why it's not going to explode into many 
> different variants later...

	Well, obviously I started from the fact that we don't have
flink().  But it doesn't really fit anyway.  reflink is a namespace
operation - give me a new item in the namespace that shares the data
extents of the old item.  So working from a file descriptor doesn't
quite fit.  Plus, flink and freflink would have to deal with
recovering already-orphaned inodes.
	Where do you stand on flink?  If it actually makes sense to
you, then perhaps we should consider it and freflinkat.  It doesn't
strike me as the way to go, but throughout all the discussion I'm quite
willing to be convinced.

Joel

-- 

"I don't know anything about music. In my line you don't have
 to."
        - Elvis Presley

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  0:04       ` Joel Becker
@ 2009-09-15  0:31         ` Linus Torvalds
  2009-09-15  0:54           ` Joel Becker
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-09-15  0:31 UTC (permalink / raw)
  To: Joel Becker; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Mon, 14 Sep 2009, Joel Becker wrote:
> 
> 	It's creating a new entry in the name space based on an old one.

That's just a cumbersome way of saying "copyfile".

Here's a challenge for you: go outside, take the first five people you 
meet at random, and ask them what a 'copyfile()' system call would do.

Then, do the same thing with 'reflink()'.

Feel free to stack the deck, so that the people you ask about 'reflink()' 
actually know computers.

Then report back which group guessed better what the system call does.

> 	reflink doesn't merely guarantee atomicity, it guarantees the
> shared data extents.

Why?

That just limits its usefulness. What's the reason for that sophistry, 
except to try to argue for a name that makes no sense?

> 	Well, obviously I started from the fact that we don't have
> flink().  But it doesn't really fit anyway.  reflink is a namespace
> operation - give me a new item in the namespace that shares the data
> extents of the old item.

That's not a namespace op, EXCEPT FOR THE NEW NAME.

The data you share from has no namespace component to it, except as a 
lookup. But a 'fd' is equally descriptive of the shared data.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  0:31         ` Linus Torvalds
@ 2009-09-15  0:54           ` Joel Becker
  2009-09-15  2:01             ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-15  0:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Mon, Sep 14, 2009 at 05:31:27PM -0700, Linus Torvalds wrote:
> On Mon, 14 Sep 2009, Joel Becker wrote:
> > 	reflink doesn't merely guarantee atomicity, it guarantees the
> > shared data extents.
> 
> Why?
> 
> That just limits its usefulness. What's the reason for that sophistry, 
> except to try to argue for a name that makes no sense?

	This originally came from the idea of creating file snapshots.
That was our original goal, but the more generic reflink call allows
more than snapshots to be built.  You can use it to implement copyfile
or clone or a variety of things.  But the snapshot capability is what
really motivates, and removing the shared data requirement means
removing that capability.  Like any API we have, if it can degrade, you
have to assume it degraded.  A reflink/copyfile that can just copy means
you have assume it copied and didn't conserve space.  This makes it
useless for snapshotting or cloning.
	In the reflink discussion before, I proposed that a separate
copyfile() syscall could be written that uses the same ->reflink() inode
operation but allows degradation in the storage handling.  This would be
a little more capable than a glibc copyfile() written around reflink
because it would get the atomicity right.  The separate copyfile/reflink
calls would handle the different requirements of storage handling.  I
just concentrated on reflink and didn't worry about that alternate
copyfile at the time being.
	I'm open to another proposal on how to do it.  As a user, I need
a way to ask for a reflink/copyfile that fails if it can't share the
data.  Things like snapshots and cloning gold VM images can't be
doubling the storage.  They become pointless.
	 About the name, the reflink name came out of "you call it like
link(2)" and "the storage is reference counted CoW".  It really works
well as "ln -r".  Folks at the filesystem summit liked it, so I didn't
change it.  It's not so much that it has to be "reflink", but I've
avoided "copyfile" because copyfile intuitively sounds like you
describe, including the plain-copy fallback.  Want me to call the
requires-shared-data-because-its-a-snap version snapfileat(2)?
Something better?

> > 	Well, obviously I started from the fact that we don't have
> > flink().  But it doesn't really fit anyway.  reflink is a namespace
> > operation - give me a new item in the namespace that shares the data
> > extents of the old item.
> 
> That's not a namespace op, EXCEPT FOR THE NEW NAME.
> 
> The data you share from has no namespace component to it, except as a 
> lookup. But a 'fd' is equally descriptive of the shared data.

	Ok, I gather that you find freflink (and by extension, flink)
compelling.  I can certainly implement it.

Joel

-- 

A good programming language should have features that make the
kind of people who use the phrase "software engineering" shake
their heads disapprovingly.
	- Paul Graham

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  0:54           ` Joel Becker
@ 2009-09-15  2:01             ` Linus Torvalds
  2009-09-15  4:05               ` Arjan van de Ven
  2009-09-15  4:06               ` Joel Becker
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-15  2:01 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Mon, 14 Sep 2009, Joel Becker wrote:
>
> 	In the reflink discussion before, I proposed that a separate
> copyfile() syscall could be written that uses the same ->reflink() inode
> operation but allows degradation in the storage handling.

.. exactly how?

If you're talking about falling back to manually just copying the data, 
then nobody is interested in that. User space can do that better with a 
simple read-write loop or with splice, or whatever. There's no reaason 
what-so-ever to do that.

But the thing is, network filesystems may be able to do server-side 
copies, and the point being that they can do so _without_ transferring the 
data to the client (and back). And if we do 'copyfile' (under whatever 
name) for one filesystem, then I think we should strive to make sure that 
it's useful for other filesystems too.

Just google for "NFS Server-side Copy". And SMB has had a COPY command 
from the very beginning, I think.

And as far as I can tell, neither NFS nor CIFS could use your definition 
of 'reflink()'. They aren't reflinks. Or rather, _could_ be, on the 
server, of course, but what some people want to do is to avoid moving data 
over the network. So it's not about "don't use more diskspace" for that 
kind of application.

Do we really want to introduce a new filesystem operation that is likely 
to be broken for something like that?

Now, it's possible that nobody will ever care, and that NFS server-side 
copy goes the way of a lot of other failed trials. But I really hope you 
have at least _talked_ to some CIFS/NFS people about this.

[ Btw, it's quite possible that CIFS/NFS people would want more than a 
  single entrypoint. I think they might want partial copies and status 
  updates etc, which would likely mean that a single ->copyfile() thing 
  isn't sufficient.

  Maybe it's not worth it, and the complexity of something like that gets 
  to be too annoying. But I don't get the feeling that you've even _tried_ 
  to see if this can be generalized to something that would be much more 
  widely useful ]

Now, I can see that you might want to say "fail rather than use double 
the diskspace for data". But why not just do that as a flag? You already 
have flags for 'copy extended attributes or not'. Why not have a flag that 
says 'copy only if you can do it without any extra space'?

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  2:01             ` Linus Torvalds
@ 2009-09-15  4:05               ` Arjan van de Ven
  2009-09-15  4:35                 ` Joel Becker
  2009-09-15  4:06               ` Joel Becker
  1 sibling, 1 reply; 33+ messages in thread
From: Arjan van de Ven @ 2009-09-15  4:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Mon, 14 Sep 2009 19:01:06 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 14 Sep 2009, Joel Becker wrote:
> >
> > 	In the reflink discussion before, I proposed that a separate
> > copyfile() syscall could be written that uses the same ->reflink()
> > inode operation but allows degradation in the storage handling.
> 
> .. exactly how?
> 
> If you're talking about falling back to manually just copying the
> data, then nobody is interested in that. User space can do that
> better with a simple read-write loop or with splice, or whatever.
> There's no reaason what-so-ever to do that.
> 
> But the thing is, network filesystems may be able to do server-side 
> copies, and the point being that they can do so _without_
> transferring the data to the client (and back). And if we do
> 'copyfile' (under whatever name) for one filesystem, then I think we
> should strive to make sure that it's useful for other filesystems too.

COW filesystems like btrfs may also be able to do interesting things
with copyfile() btw by just sharing all data blocks COW.

That would make copyfile() useful for me, much more so than the network
filesystem side...

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  4:05               ` Arjan van de Ven
@ 2009-09-15  4:35                 ` Joel Becker
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Becker @ 2009-09-15  4:35 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Tue, Sep 15, 2009 at 06:05:59AM +0200, Arjan van de Ven wrote:
> COW filesystems like btrfs may also be able to do interesting things
> with copyfile() btw by just sharing all data blocks COW.
> 
> That would make copyfile() useful for me, much more so than the network
> filesystem side...

	btrfs reflink is definitely in the cards.

Joel

-- 

"The first thing we do, let's kill all the lawyers."
                                        -Henry VI, IV:ii

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  2:01             ` Linus Torvalds
  2009-09-15  4:05               ` Arjan van de Ven
@ 2009-09-15  4:06               ` Joel Becker
  2009-09-15 16:30                 ` Linus Torvalds
  1 sibling, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-15  4:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Mon, Sep 14, 2009 at 07:01:06PM -0700, Linus Torvalds wrote:
> On Mon, 14 Sep 2009, Joel Becker wrote:
> >
> > 	In the reflink discussion before, I proposed that a separate
> > copyfile() syscall could be written that uses the same ->reflink() inode
> > operation but allows degradation in the storage handling.
> 
> .. exactly how?
> 
> If you're talking about falling back to manually just copying the data, 
> then nobody is interested in that. User space can do that better with a 
> simple read-write loop or with splice, or whatever. There's no reaason 
> what-so-ever to do that.

	I'm talking about any facility for copying that isn't just a
userspace loop.  Like your discussion of network filesystems.

> But the thing is, network filesystems may be able to do server-side 
> copies, and the point being that they can do so _without_ transferring the 
> data to the client (and back). And if we do 'copyfile' (under whatever 
> name) for one filesystem, then I think we should strive to make sure that 
> it's useful for other filesystems too.

	Hence I brought this to the filesystem summit and then fsdevel
rather than just implementing it in ocfs2.  I know NFS folks were in the
room in April, and they said the call definition was workable.  Can't
remember if CIFS folks were there, but I think so.

> [ Btw, it's quite possible that CIFS/NFS people would want more than a 
>   single entrypoint. I think they might want partial copies and status 
>   updates etc, which would likely mean that a single ->copyfile() thing 
>   isn't sufficient.
> 
>   Maybe it's not worth it, and the complexity of something like that gets 
>   to be too annoying. But I don't get the feeling that you've even _tried_ 
>   to see if this can be generalized to something that would be much more 
>   widely useful ]

	I brought it up in a forum with everyone there precisely so that
I wouldn't miss their concerns via myopia.  reflink() is a generic
application of the specific "let's snapshot inodes" idea.  It doesn't do
"atomic copy of data into duplicate storage", nor does it do "send byte
ranges".  The goal was something straightforward, not a kitchen sink.

> Now, I can see that you might want to say "fail rather than use double 
> the diskspace for data". But why not just do that as a flag? You already 
> have flags for 'copy extended attributes or not'. Why not have a flag that 
> says 'copy only if you can do it without any extra space'?

	We could.  Like I said, I really wanted something simple and
clean.  I tried hard to avoid that other flag, but I had to give up due
to (correct) concerns from the security folks.
	I'm looking at both the ease of calling the call and how we
define userspace programs to use it.  reflink(1), the program, is
essentially a synonym for 'ln -r' right now.  That's pretty nice to use
from a script.  Other ideas have been 'cp --reflink' or 'cp --clone',
but every proposal for a cp argument has felt awful and clunky.
	If I were doing a straight copyfile(), ignoring the reflink
symantics, I'd want something that could be done by cp(1) at all times
(rc = copyfile(); if ENOSYS do_normal_copy()).  I mean, if we do it
right, why not take advantage at all times.  Using reflink here violates
peoples expectations, because a reflink, with its shared data extents,
can ENOSPC when you do CoW.  Whereas a copyfile() that expects to
duplicate the storage can fit within defualt cp.

Joel

-- 

"Heav'n hath no rage like love to hatred turn'd, nor Hell a fury,
 like a woman scorn'd."
        - William Congreve

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15  4:06               ` Joel Becker
@ 2009-09-15 16:30                 ` Linus Torvalds
  2009-09-15 21:45                   ` Joel Becker
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-09-15 16:30 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Mon, 14 Sep 2009, Joel Becker wrote:
> > 
> > If you're talking about falling back to manually just copying the data, 
> > then nobody is interested in that. User space can do that better with a 
> > simple read-write loop or with splice, or whatever. There's no reaason 
> > what-so-ever to do that.
> 
> 	I'm talking about any facility for copying that isn't just a
> userspace loop.  Like your discussion of network filesystems.

HOW?

We need to have a per-filesystem interface to that. 

Having a '->copyfile()' function would be great.

But don't you see how _idiotic_ it is to then also having a '->reflink()' 
function that does _conceptually_ the exact same thing, except it does it 
by incrementing a usage count instead?

Do you see why I'm so unhappy to add a ->reflink() function? 

> 	Hence I brought this to the filesystem summit and then fsdevel
> rather than just implementing it in ocfs2.  I know NFS folks were in the
> room in April, and they said the call definition was workable.  Can't
> remember if CIFS folks were there, but I think so.

It's not workable if you define the 'reflink()' function to not use any 
disk space on the filesystem. Because SMB _will_ do a copy (and I presume 
the NFS thing will too). So it would not in general be what you call 
reflink, it will not be a "snapshot".

So if you _define_ the semantics of "reflink" to be that it's atomic and 
doesn't use any new diskspace (apart from the new inode/directory entry, 
of course), then it will be almost totally useless to other filesystems.

In fact, it's entirely possible to have filesystems that can avoid copying 
the _data_ blocks, but would need to copy the indirect blocks - maybe the 
data blocks are ref-counted, but the metadata needs to be per-file (I can 
see many reasons to do it that way, even if it's organized as a tree - 
it's how we do page table COW, for example, and it makes some things much 
simpler).

Would that be a 'reflink()' or not? I have no way of knowing, because you 
have decided on reflink on a purely ocfs2-specific implementation basis. 
But I do know that such a filesystem would be perfectly happy to have a 
'copyfile' function.

This is why I want the VFS pointers to be about _semantics_, not about 
some random implementation detail.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15 16:30                 ` Linus Torvalds
@ 2009-09-15 21:45                   ` Joel Becker
  2009-09-16  4:20                     ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-15 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Tue, Sep 15, 2009 at 09:30:54AM -0700, Linus Torvalds wrote:
> HOW?
> 
> We need to have a per-filesystem interface to that. 

	No argument here.

> But don't you see how _idiotic_ it is to then also having a '->reflink()' 
> function that does _conceptually_ the exact same thing, except it does it 
> by incrementing a usage count instead?
> 
> Do you see why I'm so unhappy to add a ->reflink() function? 

	I got it the first time.  You see reflink() as a copyfile(), and
distinguishing the inode operations doesn't make sense to you.   Quite
frankly, it doesn't to me either.  There is the user<->kernel interface
of the system call, and there is the filesystem interface of the inode
operation.  One inode op that can support multiple variations of
user<->kernel is find with me!
	Let's step back a second.  I'm not married to the name
'reflink'.  I'm not opposed to a copyfile() syscall.  I think I have a
clearer idea of what I see.  More below.

> Would that be a 'reflink()' or not? I have no way of knowing, because you 
> have decided on reflink on a purely ocfs2-specific implementation basis. 
> But I do know that such a filesystem would be perfectly happy to have a 
> 'copyfile' function.

	That's not fair.  I deliberately defined it as something outside
of the ocfs2 implementation.  Apparently I didn't do a good enough job.

> This is why I want the VFS pointers to be about _semantics_, not about 
> some random implementation detail.

	Again, no argument here.  The syscall interface better be
reasonably obvious to the userspace programmer.  The VFS pointer better
be an efficient and clean way to implement the syscall interface.
	I'm seeing three things here:

1. A CoW snapshot of an inode.  This is reflink.  It expressly defines
   metadata as copyable, but data must be shared in a CoW fashion (to
   answer your question about indirect blocks).  You either get a
   snapshot or nothing.  Call it snapfile() if you like.  Don't care.

2. An efficient copy.  This is what you're talking about with CIFS COPY,
   etc.  You want to be guaranteed it does NOT do CoW, because it would
   be great for a naive cp(1) to use it without the ENOSPC surprise of
   CoW.  You'd like the kernel call to fail if you're just going to get
   read-write-loops, because userspace can implement that better.  Maybe
   we have it such that only network filesystems implement this action,
   all the others return -ENOTSUPP, and then glibc handles the
   read-write-loop.  This allows everyone to call copyfile() and get
   what they expected.

3. A space-saving copy.  This is doing CoW linkup of the data storage if
   possible, like a snapshot but without the atomicity guarantee.  It
   has the ENOSPC surprise, but someone using it should know that.
 
	I think it would be great for Linux to provide all three.  I
chose to only attack (1) because I could define it well.  I left (2) and
(3), what I see as copyfile(), for later work.  And I fully expected
that the VFS operation could change later - it's an internal thing,
after all.  I want to get a good user<->kernel interface, because that's
the one that is set in stone.  What I didn't want was to create another
kitchen-sink call, or another POSIXy thing that has a million special
cases that trip folks up.
	I'm glad you've taken an interest, because you're pretty damned
good at architecture.  If we can expand to cover copyfile sanely too,
win-win.  To me, the user<->kernel interface really is two system calls:
reflink/snapfile for (1) and copyfile for (2) & (3).  The kernel VFS
interface I would think you could do in one inode operation.  If you
want to name it ->copyfile, that's fine.
	Perhaps ->copyfile takes the following flags:

#define ALLOW_COW_SHARED	0x0001
#define REQUIRE_COW_SHARED	0x0002
#define REQUIRE_BASIC_ATTRS	0x0004
#define REQUIRE_FULL_ATTRS	0x0008
#define REQUIRE_ATOMIC		0x0010
#define SNAPSHOT		(REQUIRE_COW_SHARED |
				 REQUIRE_BASIC_ATTRS |
				 REQUIRE_ATOMIC)
#define SNAPSHOT_PRESERVE	(SNAPSHOT | REQUIRE_FULL_ATTRS)

Thus, sys_reflink/sys_snapfile(oldpath, newpath, 0) becomes:

  ->copyfile(oldpath, newpath, SNAPSHOT)

and sys_reflink/sys_snapfile(oldpath, newpath, ATTR_PRESERVE) becomes:

  ->copyfile(oldpath, newpath, SNAPSHOT_PRESERVE)

while sys_copyfile(oldpath, newpath, 0) is:

  ->copyfile(oldpath, newpath, 0)

and sys_copyfile(oldpath, newpath, ALLOW_COW) is:

  ->copyfile(oldpath, newpath, ALLOW_COW_SHARED)

	What do you think?  Other ideas?

Joel
-- 

"The lawgiver, of all beings, most owes the law allegiance.  He of all
 men should behave as though the law compelled him.  But it is the
 universal weakness of mankind that what we are given to administer we
 presently imagine we own."
        - H.G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-15 21:45                   ` Joel Becker
@ 2009-09-16  4:20                     ` Linus Torvalds
  2009-09-16  4:40                       ` Joel Becker
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2009-09-16  4:20 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Tue, 15 Sep 2009, Joel Becker wrote:
>
> 	Perhaps ->copyfile takes the following flags:
> 
> #define ALLOW_COW_SHARED	0x0001
> #define REQUIRE_COW_SHARED	0x0002
> #define REQUIRE_BASIC_ATTRS	0x0004
> #define REQUIRE_FULL_ATTRS	0x0008
> #define REQUIRE_ATOMIC		0x0010
> #define SNAPSHOT		(REQUIRE_COW_SHARED |
> 				 REQUIRE_BASIC_ATTRS |
> 				 REQUIRE_ATOMIC)
> #define SNAPSHOT_PRESERVE	(SNAPSHOT | REQUIRE_FULL_ATTRS)
> 
> Thus, sys_reflink/sys_snapfile(oldpath, newpath, 0) becomes:
> ...

Yes. The above all sounds sane to me.

I still worry that especially the non-atomic case will want some kind of 
partial-copy updates (think graphical file managers that want to show the 
progress of the copy), and that (think EINTR and continuing) makes me 
think "that could get really complex really quickly", but that's something 
that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS 
spec has some kind "partial completion notification" model, I dunno about 
SMB.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-16  4:20                     ` Linus Torvalds
@ 2009-09-16  4:40                       ` Joel Becker
  2009-09-17 16:29                         ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-16  4:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Tue, Sep 15, 2009 at 09:20:47PM -0700, Linus Torvalds wrote:
> On Tue, 15 Sep 2009, Joel Becker wrote:
> >
> > 	Perhaps ->copyfile takes the following flags:
> > 
> > #define ALLOW_COW_SHARED	0x0001
> > #define REQUIRE_COW_SHARED	0x0002
> > #define REQUIRE_BASIC_ATTRS	0x0004
> > #define REQUIRE_FULL_ATTRS	0x0008
> > #define REQUIRE_ATOMIC		0x0010
> > #define SNAPSHOT		(REQUIRE_COW_SHARED |
> > 				 REQUIRE_BASIC_ATTRS |
> > 				 REQUIRE_ATOMIC)
> > #define SNAPSHOT_PRESERVE	(SNAPSHOT | REQUIRE_FULL_ATTRS)
> > 
> > Thus, sys_reflink/sys_snapfile(oldpath, newpath, 0) becomes:
> > ...
> 
> Yes. The above all sounds sane to me.

	Ok.  Where do you see the exposure level?  What I mean is, I
just defined a vfs op that handles these things, but accessed it via two
syscalls, sys_snapfile() and sys_copyfile().  We could also just provide
one system call and allow userspace to use these flags itself, creating
snapfile(3) and copyfile(3) in libc, hiding the details (kind of like
clone being hidden by pthreads, though ignoring that pthreads has
"issues").  Or we could explicitly make this the public API and expect
something like cp(1) to directly use the flags.  Thoughts?

> I still worry that especially the non-atomic case will want some kind of 
> partial-copy updates (think graphical file managers that want to show the 
> progress of the copy), and that (think EINTR and continuing) makes me 
> think "that could get really complex really quickly", but that's something 
> that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS 
> spec has some kind "partial completion notification" model, I dunno about 
> SMB.

	I'm really wary of combining a ranged interface with this one.
Not only does it make no sense for snapshots, but I think it falls down
in any "create a new inode" scheme entirely.
	btrfs has an ioctl that basically says "link up range x->y of
file 1 to file 2".  Chris is using the underlying machinery to implement
reflink, but I think the concept actually would work nicely as a splice
flag.  If you have two existing files, not creating one, you can just
ask splice to do efficient things with a SPLICE_F_EFFICIENT_COPY for
yoru CIFS COPY-style thing or SPLICE_F_COW_COPY for btrfs- and
ocfs2-style data sharing.

Joel

-- 

"Nothing is wrong with California that a rise in the ocean level
 wouldn't cure."
        - Ross MacDonald

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-16  4:40                       ` Joel Becker
@ 2009-09-17 16:29                         ` Linus Torvalds
  2009-09-17 16:38                           ` Arjan van de Ven
                                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-17 16:29 UTC (permalink / raw)
  To: Joel Becker
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Tue, 15 Sep 2009, Joel Becker wrote:
> 
> 	Ok.  Where do you see the exposure level?  What I mean is, I
> just defined a vfs op that handles these things, but accessed it via two
> syscalls, sys_snapfile() and sys_copyfile().  We could also just provide
> one system call and allow userspace to use these flags itself, creating
> snapfile(3) and copyfile(3) in libc

Why would anybody want to hide it at all? Why even the libc hiding?

Nobody is going to use this except for special apps. Let them see what 
they can do, in all its glory. 

> > I still worry that especially the non-atomic case will want some kind of 
> > partial-copy updates (think graphical file managers that want to show the 
> > progress of the copy), and that (think EINTR and continuing) makes me 
> > think "that could get really complex really quickly", but that's something 
> > that the NFS/SMB people would have to pipe up on. I'm pretty sure the NFS 
> > spec has some kind "partial completion notification" model, I dunno about 
> > SMB.
> 
> 	I'm really wary of combining a ranged interface with this one.
> Not only does it make no sense for snapshots, but I think it falls down
> in any "create a new inode" scheme entirely.

Oh, I wouldn't suggest a ranged interface, just one that allows for status 
updates and cancelling - _if_ the initial op isn't atomic to begin with. 
There's also the issue of concurrency in IO: maybe you want to start 
several things without necessarily waiting for them (think high-throughput 
"cp -R" on NFS or something like that).

So I'd suggest something like having two system calls: one to start the 
operation, and one to control it. And for a filesystem that does atomic 
copies, the 'start' one obviously would also finish it, so the 'control' 
it would be a no-op, because there would never be any outstanding ones.

See what I'm saying? It wouldn't complicate _your_ life, but it would 
allow for filesystems that can't do it atomically (or even quickly).

So the first one would be something like

	int copyfile(const char *src, const char *dest, unsigned long flags);

which would return:

 - zero on success
 - negative (with errno) on error
 - positive cookie on "I started it, here's my cookie". For extra bonus 
   points, maybe the cookie would actually be a file descriptor (for 
   poll/select users), but it would _not_ be a file descriptor to the 
   resulting _file_, it would literally be a "cookie" to the actual 
   copyfile event.

and then for ocfs2 you'd never return positive cookies. You'd never have 
to worry about it.

Then the second interface would be something like

	int copyfile_ctrl(long cookie, unsigned long cmd);

where you'd just have some way to wait for completion and ask how much has 
been copied. The 'cmd' would be some set of 'cancel', 'status' or 
'uninterruptible wait' or whatever, and the return value would again be

 - negative (with errno) for errors (copy failed) - cookie released
 - zero for 'done' - cookie released
 - positive for 'percent remaining' or whatever - cookie still valid

and this would be another callback into the filesystem code, but you'd 
never have to worry about it, since you'd never see it (just leave it 
NULL).

NOTE! The above is a rough idea - I have not spent tons of time thinking 
about it, or looking at exactly what something like NFS would really want. 
But the _concept_ is simple, and usage should be pretty trivial. A simple 
case would be something like this:

   int copy_file(const char *src, const char *dst)
   {
	/* Start a file copy */
	int cookie = copyfile(src, dst, 0);

	/* Async case? */
	if (cookie > 0) {
		int ret;

		while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
			/* nothing */;

		/* Error handling is shared for async/sync */
		cookie = ret;
	}
	if (cookie < 0) {
		perror("copyfile failed");
		return -1;
	}
	return 0;
   }

doesn't that look fairly easy to use?

And the advantage here is that you _can_ - still fairly easily - do much 
more involved things. For example, let's say that you wanted to do a very 
efficient parallel copy, so you'd do something like this:

	#define MAX_PEND 10
	static int pending[MAX_PEND];
	static int nr_pending = 0;

	static int wait_for_completion(int nr_left)
	{
		int ret;

		while (nr_pending > nr_left) {
			int cookie = pending[0], i;

			/* Wait for completion of the oldest entry */
			while ((i = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
				/* nothing */;

			/* Save the "we had an error" case */
			if (i < 0)
				ret = i;

			/* Move the other entries down */
			memmove(pending, pending+1, sizeof(int)*--nr_pending);
		}
		return ret;
	}

	int start_copy(src, dst)
	{
		int cookie, ret;

		cookie = copyfile(src, dst, 0);
		if (cookie <= 0)
			return cookie;

		ret = 0;
		if (nr_pending == MAX_PENDING)
			ret = wait_for_completion(pending, MAX_PENDING/2);

		pending[nr_pending++] = cookie;
		return ret;
	}

	int stop_copy(void)
	{
		return wait_for_completion(pending, 0);
	}

which basically ends up having ten copyfile() calls outstanding (and when 
we hit the limit, we wait for half of them to complete), so now you can do 
an efficient "cp -R" with concurrent server-side IO. And it wasn't so 
hard, was it?

(Ok, so the above would need to be fleshed out to remember the filenames 
so that you can report _which_ file failed etc, but you get the idea).

And again, it wouldn't be any more complicated for your case. Your 
copyfile would always just return 0 or negative for error. But it would be 
_way_ more powerful for filesystems that want to do potentially lots of IO 
for the file copy.

I dunno. The above seems like a fairly simple and powerful interface, and 
I _think_ it would be ok for NFS and CIFS. And in fact, if that whole 
"background copy" ends up being used a lot, maybe even a local filesystem 
would implement it just to get easy overlapping IO - even if it would just 
be a trivial common wrapper function that says "start a thread to do a 
trivial manual copy".

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 16:29                         ` Linus Torvalds
@ 2009-09-17 16:38                           ` Arjan van de Ven
  2009-09-17 20:16                             ` Linus Torvalds
  2009-09-17 18:40                           ` Roland Dreier
  2009-09-18  1:43                           ` [Ocfs2-devel] " Joel Becker
  2 siblings, 1 reply; 33+ messages in thread
From: Arjan van de Ven @ 2009-09-17 16:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Thu, 17 Sep 2009 09:29:14 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Tue, 15 Sep 2009, Joel Becker wrote:
> > 
> > 	Ok.  Where do you see the exposure level?  What I mean is, I
> > just defined a vfs op that handles these things, but accessed it
> > via two syscalls, sys_snapfile() and sys_copyfile().  We could also
> > just provide one system call and allow userspace to use these flags
> > itself, creating snapfile(3) and copyfile(3) in libc
> 
> Why would anybody want to hide it at all? Why even the libc hiding?
> 
> Nobody is going to use this except for special apps. Let them see
> what they can do, in all its glory. 
> 
> > > I still worry that especially the non-atomic case will want some
> > > kind of partial-copy updates (think graphical file managers that
> > > want to show the progress of the copy), and that (think EINTR and
> > > continuing) makes me think "that could get really complex really
> > > quickly", but that's something that the NFS/SMB people would have
> > > to pipe up on. I'm pretty sure the NFS spec has some kind
> > > "partial completion notification" model, I dunno about SMB.
> > 
> > 	I'm really wary of combining a ranged interface with this
> > one. Not only does it make no sense for snapshots, but I think it
> > falls down in any "create a new inode" scheme entirely.
> 
> Oh, I wouldn't suggest a ranged interface, just one that allows for
> status updates and cancelling - _if_ the initial op isn't atomic to
> begin with. There's also the issue of concurrency in IO: maybe you
> want to start several things without necessarily waiting for them
> (think high-throughput "cp -R" on NFS or something like that).
> 
> So I'd suggest something like having two system calls: one to start
> the operation, and one to control it. And for a filesystem that does
> atomic copies, the 'start' one obviously would also finish it, so the
> 'control' it would be a no-op, because there would never be any
> outstanding ones.
> 
> See what I'm saying? It wouldn't complicate _your_ life, but it would 
> allow for filesystems that can't do it atomically (or even quickly).
> 
> So the first one would be something like
> 
> 	int copyfile(const char *src, const char *dest, unsigned long
> flags);
> 
> which would return:
> 
>  - zero on success
>  - negative (with errno) on error
>  - positive cookie on "I started it, here's my cookie". For extra
> bonus points, maybe the cookie would actually be a file descriptor
> (for poll/select users), but it would _not_ be a file descriptor to
> the resulting _file_, it would literally be a "cookie" to the actual 
>    copyfile event.
> 
> and then for ocfs2 you'd never return positive cookies. You'd never
> have to worry about it.
> 
> Then the second interface would be something like
> 
> 	int copyfile_ctrl(long cookie, unsigned long cmd);
> 
> where you'd just have some way to wait for completion and ask how
> much has been copied. The 'cmd' would be some set of 'cancel',
> 'status' or 'uninterruptible wait' or whatever, and the return value
> would again be
> 
>  - negative (with errno) for errors (copy failed) - cookie released
>  - zero for 'done' - cookie released
>  - positive for 'percent remaining' or whatever - cookie still valid
> 
> and this would be another callback into the filesystem code, but
> you'd never have to worry about it, since you'd never see it (just
> leave it NULL).
> 
> NOTE! The above is a rough idea - I have not spent tons of time
> thinking about it, or looking at exactly what something like NFS
> would really want. But the _concept_ is simple, and usage should be
> pretty trivial. A simple case would be something like this:
> 
>    int copy_file(const char *src, const char *dst)
>    {
> 	/* Start a file copy */
> 	int cookie = copyfile(src, dst, 0);
> 
> 	/* Async case? */
> 	if (cookie > 0) {
> 		int ret;
> 
> 		while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) >
> 0) /* nothing */;
> 
> 		/* Error handling is shared for async/sync */
> 		cookie = ret;
> 	}
> 	if (cookie < 0) {
> 		perror("copyfile failed");
> 		return -1;
> 	}
> 	return 0;
>    }
> 
> doesn't that look fairly easy to use?
> 
> And the advantage here is that you _can_ - still fairly easily - do
> much more involved things. For example, let's say that you wanted to
> do a very efficient parallel copy, so you'd do something like this:
> 
> 	#define MAX_PEND 10
> 	static int pending[MAX_PEND];
> 	static int nr_pending = 0;
> 
> 	static int wait_for_completion(int nr_left)
> 	{
> 		int ret;
> 
> 		while (nr_pending > nr_left) {
> 			int cookie = pending[0], i;
> 
> 			/* Wait for completion of the oldest entry */
> 			while ((i = copyfile_ctrl(cookie,
> COPYFILE_WAIT)) > 0) /* nothing */;
> 
> 			/* Save the "we had an error" case */
> 			if (i < 0)
> 				ret = i;
> 
> 			/* Move the other entries down */
> 			memmove(pending, pending+1,
> sizeof(int)*--nr_pending); }
> 		return ret;
> 	}
> 
> 	int start_copy(src, dst)
> 	{
> 		int cookie, ret;
> 
> 		cookie = copyfile(src, dst, 0);
> 		if (cookie <= 0)
> 			return cookie;
> 
> 		ret = 0;
> 		if (nr_pending == MAX_PENDING)
> 			ret = wait_for_completion(pending,
> MAX_PENDING/2);
> 
> 		pending[nr_pending++] = cookie;
> 		return ret;
> 	}
> 
> 	int stop_copy(void)
> 	{
> 		return wait_for_completion(pending, 0);
> 	}
> 
> which basically ends up having ten copyfile() calls outstanding (and
> when we hit the limit, we wait for half of them to complete), so now
> you can do an efficient "cp -R" with concurrent server-side IO. And
> it wasn't so hard, was it?
> 
> (Ok, so the above would need to be fleshed out to remember the
> filenames so that you can report _which_ file failed etc, but you get
> the idea).
> 
> And again, it wouldn't be any more complicated for your case. Your 
> copyfile would always just return 0 or negative for error. But it
> would be _way_ more powerful for filesystems that want to do
> potentially lots of IO for the file copy.
> 
> I dunno. The above seems like a fairly simple and powerful interface,
> and I _think_ it would be ok for NFS and CIFS. And in fact, if that
> whole "background copy" ends up being used a lot, maybe even a local
> filesystem would implement it just to get easy overlapping IO - even
> if it would just be a trivial common wrapper function that says
> "start a thread to do a trivial manual copy".

or make it one level simpler?
Have a "wait for all started copies" call only.... saves a ton of book
keeping, and is likely what people will use it for in the end anyway.


(implementation wise the fallback implementation could then just use
the async function calls if it wanted to, and just wait for all copies
to finish in the complete call)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 16:38                           ` Arjan van de Ven
@ 2009-09-17 20:16                             ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-17 20:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel



On Thu, 17 Sep 2009, Arjan van de Ven wrote:
> 
> or make it one level simpler?
> Have a "wait for all started copies" call only.... saves a ton of book
> keeping, and is likely what people will use it for in the end  anyway.

No. That wouldn't work. For a few reasons:

 - the case I didn't show was the "graphical file manager client" thing 
   that wants to show the copy as it progresses. It needs to know how much 
   is left, and for which file.

 - if errors happen, you need to indicate which file had an error. Again, 
   my example code didn't show that, since it was written as an example 
   and obviously just while writing email anyway. But it's a major 
   requirement for any sane and reliable filesystem model!

 - even in my example, I wanted to show how you don't want to wait for 
   _all_ of them in the middle, you just want to wait for some of them. If 
   you wait for all of them - just to make room for more - you're going to 
   have hickups in your IO patterns and you cannot saturate your server or 
   disks well.

So I really think there needs to be a cookie per outstanding file copy (of 
course, the kernel is likely to not allow a single user more than 'n' 
outstanding copies anyway, but that's a separate issue).

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 16:29                         ` Linus Torvalds
  2009-09-17 16:38                           ` Arjan van de Ven
@ 2009-09-17 18:40                           ` Roland Dreier
  2009-09-17 20:17                             ` Linus Torvalds
  2009-09-18  1:43                           ` [Ocfs2-devel] " Joel Becker
  2 siblings, 1 reply; 33+ messages in thread
From: Roland Dreier @ 2009-09-17 18:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel


 >   int copy_file(const char *src, const char *dst)
 >   {
 >	/* Start a file copy */
 >	int cookie = copyfile(src, dst, 0);
 >
 >	/* Async case? */
 >	if (cookie > 0) {
 >		int ret;
 >
 >		while ((ret = copyfile_ctrl(cookie, COPYFILE_WAIT)) > 0)
 >			/* nothing */;
 >
 >		/* Error handling is shared for async/sync */
 >		cookie = ret;
 >	}
 >	if (cookie < 0) {
 >		perror("copyfile failed");
 >		return -1;
 >	}

I guess one bit of semantics to figure out is what happens if copyfile()
does the async case but then copyfile_ctrl() returns an error halfway
through... is the state of the dest file just undefined?

 - R.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 18:40                           ` Roland Dreier
@ 2009-09-17 20:17                             ` Linus Torvalds
  2009-09-17 20:34                               ` Joel Becker
  2009-09-17 20:42                               ` Roland Dreier
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-17 20:17 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Thu, 17 Sep 2009, Roland Dreier wrote:
> 
> I guess one bit of semantics to figure out is what happens if copyfile()
> does the async case but then copyfile_ctrl() returns an error halfway
> through... is the state of the dest file just undefined?

I think that's the one that most filesystems would prefer. Maybe the file 
is there, it's just that it's only half copied because the filesystem 
filled up. 

Making filesystems give atomicity guarantees would be hard for the async 
case. 

Of course, if the filesystem can do the copy entirely atomically (ie by 
just incrementing a refcount), then it can give atomicity guarantees, but 
then you'd never see the async case either.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 20:17                             ` Linus Torvalds
@ 2009-09-17 20:34                               ` Joel Becker
  2009-09-18  0:29                                 ` Linus Torvalds
  2009-09-17 20:42                               ` Roland Dreier
  1 sibling, 1 reply; 33+ messages in thread
From: Joel Becker @ 2009-09-17 20:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland Dreier, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Thu, Sep 17, 2009 at 01:17:55PM -0700, Linus Torvalds wrote:
> On Thu, 17 Sep 2009, Roland Dreier wrote:
> > 
> > I guess one bit of semantics to figure out is what happens if copyfile()
> > does the async case but then copyfile_ctrl() returns an error halfway
> > through... is the state of the dest file just undefined?
> 
> I think that's the one that most filesystems would prefer. Maybe the file 
> is there, it's just that it's only half copied because the filesystem 
> filled up. 

	I have to say, adding 'undefined behavior' things isn't fun in a
call that is already potentially confusing.  We have a bunch of flags
and behaviors we're covering.

> Making filesystems give atomicity guarantees would be hard for the async 
> case. 

	Note that "cleaning up after an error" and "atomic" are not the
same.  Atomicity implies that not only do you see all or none, but that
the contents are a point-in-time of the source file.  A non-atomic
implementation may be affected by writes that happen during the copy
(like any read-write-loop copy would be).
	As an example, ocfs2_reflink() builds the target inode in the
orphan directory.  If the operation fails at any point, it's removed.
If we crash, orphan cleanup happens.  Only if it succeeds do we move it
to the target directory.  ocfs2_reflink() is an atomic snapshot, of
course, but recoverability is certainly possible for a non-atomic
copyfile() on filesystems with similar orphan schemes (ext3 is the
obvious example).
	Of course, how the network filesystems might see it, I don't
know.  NFS/CIFS folks, please speak up.

> Of course, if the filesystem can do the copy entirely atomically (ie by 
> just incrementing a refcount), then it can give atomicity guarantees, but 
> then you'd never see the async case either.

	Even the atomic copy might take a little time (say, to bump up
and write out the metadata structures).  Do you want to define that as
not being async?  I was figuring COPYFILE_ATOMIC and COPYFILE_WAIT to be
separate flags.

Joel

-- 

"Behind every successful man there's a lot of unsuccessful years."
        - Bob Brown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 20:34                               ` Joel Becker
@ 2009-09-18  0:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-18  0:29 UTC (permalink / raw)
  To: Joel Becker
  Cc: Roland Dreier, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Thu, 17 Sep 2009, Joel Becker wrote:
>
> 	I have to say, adding 'undefined behavior' things isn't fun in a
> call that is already potentially confusing.  We have a bunch of flags
> and behaviors we're covering.

I don't think it's "undefined". It's just not complete.

>From a user standpoint, there is no difference between such a system call 
and doing the thing as a library routine (which has to be the fallback 
anyway for something like 'cp').

> 	Note that "cleaning up after an error" and "atomic" are not the
> same.  Atomicity implies that not only do you see all or none, but that
> the contents are a point-in-time of the source file.  A non-atomic
> implementation may be affected by writes that happen during the copy
> (like any read-write-loop copy would be).

Sure. There are middle grounds that may not need the cleanup. I was more 
looking at the two extreme ends.

> > Of course, if the filesystem can do the copy entirely atomically (ie by 
> > just incrementing a refcount), then it can give atomicity guarantees, but 
> > then you'd never see the async case either.
> 
> 	Even the atomic copy might take a little time (say, to bump up
> and write out the metadata structures).  Do you want to define that as
> not being async?  I was figuring COPYFILE_ATOMIC and COPYFILE_WAIT to be
> separate flags.

I don't think that's wrong, and yeah, you could decide that you actually 
want to be able to support the "ten outstanding 'copy' commands from user 
space" model too. So yeah, separate COPYFILE_ATOMIC (only succeed if you 
can do it as an atomic op in the filesystem) and COPYFILE_WAIT (only 
return when fully done) bits sounds conceptually fine to me.

Whether it's worth it for a filesystem that effectively only needs a 
couple of writes (that can be buffered too), I dunno. But it's certainly 
not something I'd object to on an interface level.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 20:17                             ` Linus Torvalds
  2009-09-17 20:34                               ` Joel Becker
@ 2009-09-17 20:42                               ` Roland Dreier
  2009-09-17 20:55                                 ` Linus Torvalds
  1 sibling, 1 reply; 33+ messages in thread
From: Roland Dreier @ 2009-09-17 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel


 > > I guess one bit of semantics to figure out is what happens if copyfile()
 > > does the async case but then copyfile_ctrl() returns an error halfway
 > > through... is the state of the dest file just undefined?

 > I think that's the one that most filesystems would prefer. Maybe the file 
 > is there, it's just that it's only half copied because the filesystem 
 > filled up. 

Makes sense... and even having the state of having the file half-copied
is not really well-defined since a filesystem may want to optimize
things by copying out of order etc.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 20:42                               ` Roland Dreier
@ 2009-09-17 20:55                                 ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2009-09-17 20:55 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Joel Becker, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Thu, 17 Sep 2009, Roland Dreier wrote:
>  > > I guess one bit of semantics to figure out is what happens if copyfile()
>  > > does the async case but then copyfile_ctrl() returns an error halfway
>  > > through... is the state of the dest file just undefined?
> 
>  > I think that's the one that most filesystems would prefer. Maybe the file 
>  > is there, it's just that it's only half copied because the filesystem 
>  > filled up. 
> 
> Makes sense... and even having the state of having the file half-copied
> is not really well-defined since a filesystem may want to optimize
> things by copying out of order etc.

Yeah. 

The thing to remember is that where you'd use a non-atomic 'copyfile()' 
system call is as a replacement for just doing the same thing by hand in 
user space, so any users of this system call would basically have to 
handle the "oops, it failed with ENOSPC in the middle" _anyway_.

So there is no downside to saying "ok, it failed in the middle, you clean 
up the result", because the user needs to support that anyway.

The ones that use copyfile for filesystem-specific snapshots and use the 
ATOMIC bit to say so obviously don't have this issue. But they aren't 
looking for a "faster 'cp'" thing - they're looking for very specific 
semantics, and for the ATOMIC case we can provide those kinds of nice 
atomic "everything or nothing" semantics.

So the fact that some random server-side copy over CIFS/NFS may need 
cleanup by the user after failing at half-point is not actually a 
downside. It doesn't affect Joel's kind of fancy use.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-17 16:29                         ` Linus Torvalds
  2009-09-17 16:38                           ` Arjan van de Ven
  2009-09-17 18:40                           ` Roland Dreier
@ 2009-09-18  1:43                           ` Joel Becker
  2009-09-18 13:34                             ` Pádraig Brady
  2009-09-18 17:23                             ` Peter W. Morreale
  2 siblings, 2 replies; 33+ messages in thread
From: Joel Becker @ 2009-09-18  1:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Fasheh, Andrew Morton, Linux Kernel Mailing List,
	ocfs2-devel

On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> Why would anybody want to hide it at all? Why even the libc hiding?
> 
> Nobody is going to use this except for special apps. Let them see what 
> they can do, in all its glory. 

	I expect everyone will use this through cp(1), so that cp(1) can
try to get server-side copy on the network filesystms.
	Speaking of "all its glory", what we have now is:

int sys_copyfileat(int oldfd, const char *oldname, int newfd,
                   const char *newname, int flags, int atflags)

> So I'd suggest something like having two system calls: one to start the 
> operation, and one to control it. And for a filesystem that does atomic 
> copies, the 'start' one obviously would also finish it, so the 'control' 
> it would be a no-op, because there would never be any outstanding ones.
> 
> See what I'm saying? It wouldn't complicate _your_ life, but it would 
> allow for filesystems that can't do it atomically (or even quickly).
> 
> So the first one would be something like
> 
> 	int copyfile(const char *src, const char *dest, unsigned long flags);
> 
> which would return:
> 
>  - zero on success
>  - negative (with errno) on error
>  - positive cookie on "I started it, here's my cookie". For extra bonus 
>    points, maybe the cookie would actually be a file descriptor (for 
>    poll/select users), but it would _not_ be a file descriptor to the 
>    resulting _file_, it would literally be a "cookie" to the actual 
>    copyfile event.

	Actually, if the cookie is a magic file descriptor, you don't
need ctl.  You can play tricks like polling for completoin,
read(magic_fd, &remain, sizeof(loff_t)) for status, and close(magic_fd)
for cancel.  Might be a bit overloaded, though.

> and then for ocfs2 you'd never return positive cookies. You'd never have 
> to worry about it.

	I suspect we'll later take advantage of copyfile's other
modes.  I did reflink as reflink only for the simple fact of doing one
thing and well, not because I think copyfile isn't good.

> Then the second interface would be something like
> 
> 	int copyfile_ctrl(long cookie, unsigned long cmd);
> 
> where you'd just have some way to wait for completion and ask how much has 
> been copied. The 'cmd' would be some set of 'cancel', 'status' or 
> 'uninterruptible wait' or whatever, and the return value would again be
> 
>  - negative (with errno) for errors (copy failed) - cookie released
>  - zero for 'done' - cookie released
>  - positive for 'percent remaining' or whatever - cookie still valid
> 
> and this would be another callback into the filesystem code, but you'd 
> never have to worry about it, since you'd never see it (just leave it 
> NULL).

	I was going to ask about how to fit both calls into one inode
operation, but I see you're giving this as an additional inode
operation.
	This leaves us with a simliar-to-reflink inode copyfile op and a
control op:

    ->copyfile(old_dentry, dir_inode, new_dentry, flags)
    ->copyfile_ctl(int cookie, unsigned int cmd)

	I have to change the flags a little, as my original proposal
didn't handle backoff correctly.

#define COPYFILE_WAIT		0x0001	/* Block until complete */
#define COPYFILE_ATOMIC		0x0002	/* Things copied must be
					   point-in-time and it must
					   fail or succeed completely. */
#define COPYFILE_ALLOW_COW	0x0004	/* The filesystem may share data
					   extents between the source
					   and target in a Copy-on-Write
					   fashion.  If neither
					   COPYFILE_ALLOW_COW nor
					   COPYFILE_REQUIRE_COW are
					   specified, data extents must
					   NOT be shared.  When neither
					   COW flag is provided, most
					   filesystems should return
					   -ENOTSUPP, as userspace can
					   do read-write looping
					   itself */
#define COPYFILE_REQUIRE_COW	0x0008	/* Data extents MUST be shared
					   between the source and target
					   in a Copy-on-Write fashion */
#define COPYFILE_UNPRIV_ATTRS	0x0010	/* Unprivileged attributes
					   should be copied from the
					   source to the target */
#define COPYFILE_PRIV_ATTRS	0x0020	/* Privileged attributes should
					   be copied from the source to
					   the target if the caller has
					   the necessary privileges */
#define COPYFILE_REQUIRE_ATTRS	0x0040	/* Combined with the other
					   attribute flags, the call
					   MUST fail if the caller lacks
					   the necessary privileges to
					   copy ever attribute
					   requested */

#define COPYFILE_SNAPSHOT_ASYNC	(COPYFILE_REQUIRE_COW |
				 COPYFILE_UNPRIV_ATTRS |
				 COPYFILE_PRIV_ATTRS |
				 COPYFILE_ATOMIC)
#define COPYFILE_SNAPSHOT_STRICT_ASYNC	(COPYFILE_SNAPSHOT_ASYNC |
					 COPYFILE_REQUIRE_ATTRS)
#define COPYFILE_SNAPSHOT	(COPYFILE_SNAPSHOT_ASYNC |
				 COPYFILE_WAIT)
#define COPYFILE_SNAPSHOT_STRICT	(COPYFILE_SNAPSHOT_STRICT_ASYNC |
					 COPYFILE_WAIT)

> I dunno. The above seems like a fairly simple and powerful interface, and 
> I _think_ it would be ok for NFS and CIFS. And in fact, if that whole 
> "background copy" ends up being used a lot, maybe even a local filesystem 
> would implement it just to get easy overlapping IO - even if it would just 
> be a trivial common wrapper function that says "start a thread to do a 
> trivial manual copy".

	NFS and CIFS folks, please speak up.

Joel

-- 

"There is no more evil thing on earth than race prejudice, none at 
 all.  I write deliberately -- it is the worst single thing in life 
 now.  It justifies and holds together more baseness, cruelty and
 abomination than any other sort of error in the world." 
        - H. G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-18  1:43                           ` [Ocfs2-devel] " Joel Becker
@ 2009-09-18 13:34                             ` Pádraig Brady
  2009-09-18 18:37                               ` Joel Becker
  2009-09-18 17:23                             ` Peter W. Morreale
  1 sibling, 1 reply; 33+ messages in thread
From: Pádraig Brady @ 2009-09-18 13:34 UTC (permalink / raw)
  To: Linus Torvalds, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

Joel Becker wrote:
> On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
>> Why would anybody want to hide it at all? Why even the libc hiding?
>>
>> Nobody is going to use this except for special apps. Let them see what 
>> they can do, in all its glory. 
> 
> 	I expect everyone will use this through cp(1), so that cp(1) can
> try to get server-side copy on the network filesystms.

For reference, cp(1) has a --reflink option as of
coreutils-7.5 which currently just does:

  ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);

There is a specific option in cp to do this because
a "reflink copy" was seen to have these disadvantages:

  1. one copy of data blocks so more chances of data loss
  2. disk head seeking deferred to modification process
  3. possible fragmentation on write
  4. possible ENOSPC on write

Now 2. will go away with time, and 3 & 4 may be alleviated
by the use of fallocate(), but 1. was deemed important
enough to not enable by default.

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-18 13:34                             ` Pádraig Brady
@ 2009-09-18 18:37                               ` Joel Becker
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Becker @ 2009-09-18 18:37 UTC (permalink / raw)
  To: Pádraig Brady
  Cc: Linus Torvalds, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Fri, Sep 18, 2009 at 02:34:18PM +0100, Pádraig Brady wrote:
> Joel Becker wrote:
> > On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> >> Why would anybody want to hide it at all? Why even the libc hiding?
> >>
> >> Nobody is going to use this except for special apps. Let them see what
> >> they can do, in all its glory.
> >
> > 	I expect everyone will use this through cp(1), so that cp(1) can
> > try to get server-side copy on the network filesystms.
> 
> For reference, cp(1) has a --reflink option as of
> coreutils-7.5 which currently just does:
> 
>   ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);

	Note that the btrfs ioctl is not a reflink(), so this probably
wants changing (OCFS2_IOC_REFLINK is the ocfs2 ioctl, sys_reflink() was
going to be the syscall).

> There is a specific option in cp to do this because
> a "reflink copy" was seen to have these disadvantages:
> 
>   1. one copy of data blocks so more chances of data loss
>   2. disk head seeking deferred to modification process
>   3. possible fragmentation on write
>   4. possible ENOSPC on write
> 
> Now 2. will go away with time, and 3 & 4 may be alleviated
> by the use of fallocate(), but 1. was deemed important
> enough to not enable by default.

	1, 2, and 3 are definitely in the category of "it would be nice
to choose the behavior".  4 is the big one, because it breaks default
cp(1) assumptions.  The good news is that the current copyfile
idea of copyfile(src, dst, 0) would satisfy 1-4 and be efficient or
return -ENOTSUPP/-ENOSYS if it couldn't be.  Then cp(1) falls back to
the read-write loop.
	cp --reflink would become copyfile(src, dst, COPYFILE_SNAPSHOT)

Joel

-- 

Life's Little Instruction Book #451

	"Don't be afraid to say, 'I'm sorry.'"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-18  1:43                           ` [Ocfs2-devel] " Joel Becker
  2009-09-18 13:34                             ` Pádraig Brady
@ 2009-09-18 17:23                             ` Peter W. Morreale
  2009-09-18 18:39                               ` Joel Becker
  1 sibling, 1 reply; 33+ messages in thread
From: Peter W. Morreale @ 2009-09-18 17:23 UTC (permalink / raw)
  To: Joel Becker
  Cc: Linus Torvalds, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Thu, 2009-09-17 at 18:43 -0700, Joel Becker wrote:
> On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> > Why would anybody want to hide it at all? Why even the libc hiding?
> > 
> > Nobody is going to use this except for special apps. Let them see what 
> > they can do, in all its glory. 
> 
> 	I expect everyone will use this through cp(1), so that cp(1) can
> try to get server-side copy on the network filesystms.
> 	Speaking of "all its glory", what we have now is:
> 
> int sys_copyfileat(int oldfd, const char *oldname, int newfd,
>                    const char *newname, int flags, int atflags)


Would it be worthwhile to consider adding an offset and length?  

Then we get dd as well. (potentially) 


Best,
-PWM

> 
> > So I'd suggest something like having two system calls: one to start the 
> > operation, and one to control it. And for a filesystem that does atomic 
> > copies, the 'start' one obviously would also finish it, so the 'control' 
> > it would be a no-op, because there would never be any outstanding ones.
> > 
> > See what I'm saying? It wouldn't complicate _your_ life, but it would 
> > allow for filesystems that can't do it atomically (or even quickly).
> > 
> > So the first one would be something like
> > 
> > 	int copyfile(const char *src, const char *dest, unsigned long flags);
> > 
> > which would return:
> > 
> >  - zero on success
> >  - negative (with errno) on error
> >  - positive cookie on "I started it, here's my cookie". For extra bonus 
> >    points, maybe the cookie would actually be a file descriptor (for 
> >    poll/select users), but it would _not_ be a file descriptor to the 
> >    resulting _file_, it would literally be a "cookie" to the actual 
> >    copyfile event.
> 
> 	Actually, if the cookie is a magic file descriptor, you don't
> need ctl.  You can play tricks like polling for completoin,
> read(magic_fd, &remain, sizeof(loff_t)) for status, and close(magic_fd)
> for cancel.  Might be a bit overloaded, though.
> 
> > and then for ocfs2 you'd never return positive cookies. You'd never have 
> > to worry about it.
> 
> 	I suspect we'll later take advantage of copyfile's other
> modes.  I did reflink as reflink only for the simple fact of doing one
> thing and well, not because I think copyfile isn't good.
> 
> > Then the second interface would be something like
> > 
> > 	int copyfile_ctrl(long cookie, unsigned long cmd);
> > 
> > where you'd just have some way to wait for completion and ask how much has 
> > been copied. The 'cmd' would be some set of 'cancel', 'status' or 
> > 'uninterruptible wait' or whatever, and the return value would again be
> > 
> >  - negative (with errno) for errors (copy failed) - cookie released
> >  - zero for 'done' - cookie released
> >  - positive for 'percent remaining' or whatever - cookie still valid
> > 
> > and this would be another callback into the filesystem code, but you'd 
> > never have to worry about it, since you'd never see it (just leave it 
> > NULL).
> 
> 	I was going to ask about how to fit both calls into one inode
> operation, but I see you're giving this as an additional inode
> operation.
> 	This leaves us with a simliar-to-reflink inode copyfile op and a
> control op:
> 
>     ->copyfile(old_dentry, dir_inode, new_dentry, flags)
>     ->copyfile_ctl(int cookie, unsigned int cmd)
> 
> 	I have to change the flags a little, as my original proposal
> didn't handle backoff correctly.
> 
> #define COPYFILE_WAIT		0x0001	/* Block until complete */
> #define COPYFILE_ATOMIC		0x0002	/* Things copied must be
> 					   point-in-time and it must
> 					   fail or succeed completely. */
> #define COPYFILE_ALLOW_COW	0x0004	/* The filesystem may share data
> 					   extents between the source
> 					   and target in a Copy-on-Write
> 					   fashion.  If neither
> 					   COPYFILE_ALLOW_COW nor
> 					   COPYFILE_REQUIRE_COW are
> 					   specified, data extents must
> 					   NOT be shared.  When neither
> 					   COW flag is provided, most
> 					   filesystems should return
> 					   -ENOTSUPP, as userspace can
> 					   do read-write looping
> 					   itself */
> #define COPYFILE_REQUIRE_COW	0x0008	/* Data extents MUST be shared
> 					   between the source and target
> 					   in a Copy-on-Write fashion */
> #define COPYFILE_UNPRIV_ATTRS	0x0010	/* Unprivileged attributes
> 					   should be copied from the
> 					   source to the target */
> #define COPYFILE_PRIV_ATTRS	0x0020	/* Privileged attributes should
> 					   be copied from the source to
> 					   the target if the caller has
> 					   the necessary privileges */
> #define COPYFILE_REQUIRE_ATTRS	0x0040	/* Combined with the other
> 					   attribute flags, the call
> 					   MUST fail if the caller lacks
> 					   the necessary privileges to
> 					   copy ever attribute
> 					   requested */
> 
> #define COPYFILE_SNAPSHOT_ASYNC	(COPYFILE_REQUIRE_COW |
> 				 COPYFILE_UNPRIV_ATTRS |
> 				 COPYFILE_PRIV_ATTRS |
> 				 COPYFILE_ATOMIC)
> #define COPYFILE_SNAPSHOT_STRICT_ASYNC	(COPYFILE_SNAPSHOT_ASYNC |
> 					 COPYFILE_REQUIRE_ATTRS)
> #define COPYFILE_SNAPSHOT	(COPYFILE_SNAPSHOT_ASYNC |
> 				 COPYFILE_WAIT)
> #define COPYFILE_SNAPSHOT_STRICT	(COPYFILE_SNAPSHOT_STRICT_ASYNC |
> 					 COPYFILE_WAIT)
> 
> > I dunno. The above seems like a fairly simple and powerful interface, and 
> > I _think_ it would be ok for NFS and CIFS. And in fact, if that whole 
> > "background copy" ends up being used a lot, maybe even a local filesystem 
> > would implement it just to get easy overlapping IO - even if it would just 
> > be a trivial common wrapper function that says "start a thread to do a 
> > trivial manual copy".
> 
> 	NFS and CIFS folks, please speak up.
> 
> Joel
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-18 17:23                             ` Peter W. Morreale
@ 2009-09-18 18:39                               ` Joel Becker
  0 siblings, 0 replies; 33+ messages in thread
From: Joel Becker @ 2009-09-18 18:39 UTC (permalink / raw)
  To: Peter W. Morreale
  Cc: Linus Torvalds, Mark Fasheh, Andrew Morton,
	Linux Kernel Mailing List, ocfs2-devel

On Fri, Sep 18, 2009 at 11:23:33AM -0600, Peter W. Morreale wrote:
> On Thu, 2009-09-17 at 18:43 -0700, Joel Becker wrote:
> > On Thu, Sep 17, 2009 at 09:29:14AM -0700, Linus Torvalds wrote:
> > > Why would anybody want to hide it at all? Why even the libc hiding?
> > > 
> > > Nobody is going to use this except for special apps. Let them see what 
> > > they can do, in all its glory. 
> > 
> > 	I expect everyone will use this through cp(1), so that cp(1) can
> > try to get server-side copy on the network filesystms.
> > 	Speaking of "all its glory", what we have now is:
> > 
> > int sys_copyfileat(int oldfd, const char *oldname, int newfd,
> >                    const char *newname, int flags, int atflags)
> 
> 
> Would it be worthwhile to consider adding an offset and length?  
> 
> Then we get dd as well. (potentially) 

	I'm with Linus that a range attribute really makes this
complicated.  I also think it doesn't work well with a call that is
supposed to create newname.

Joel

-- 

	Pitchers and catchers report.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
  2009-09-14 21:32 ` Linus Torvalds
  2009-09-14 22:14   ` Joel Becker
@ 2009-09-15  6:44   ` Miklos Szeredi
  2009-09-23 11:02   ` [GIT PULL] ocfs2 changes for 2.6.32 (take 2, no syscall) Joel Becker
  2 siblings, 0 replies; 33+ messages in thread
From: Miklos Szeredi @ 2009-09-15  6:44 UTC (permalink / raw)
  To: torvalds; +Cc: Joel.Becker, mfasheh, akpm, linux-kernel, ocfs2-devel

On Mon, 14 Sep 2009, Linus Torvalds wrote:
> Are we going to add 'freflink[at]()' at some point?

We already have an interface for "freflink()" called splice().  Splice
has all the arguments needed to implement refcounted data copies and
more.

Yes, splice's name implies more of a "piecing data together" type of
operation, but we can look at reflink() or copyfile() as just "splice
one big piece from here to there".

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [GIT PULL] ocfs2 changes for 2.6.32 (take 2, no syscall)
  2009-09-14 21:32 ` Linus Torvalds
  2009-09-14 22:14   ` Joel Becker
  2009-09-15  6:44   ` Miklos Szeredi
@ 2009-09-23 11:02   ` Joel Becker
  2 siblings, 0 replies; 33+ messages in thread
From: Joel Becker @ 2009-09-23 11:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Fasheh, Andrew Morton, linux-kernel, ocfs2-devel

On Mon, Sep 14, 2009 at 02:32:36PM -0700, Linus Torvalds wrote:
> So I'm not pulling this. Not until I get the feeling that there is 
> consensus.

Linus,
	Here are all of the ocfs2 changes without the reflink(2)
system call.  The snapshot functionality is still available via the
ocfs2 ioctl(2).  I'm going to formulate up a new system call based on
the discussion we've had, but clearly that's not something for 2.6.32
anymore.
	Please pull.

The following changes since commit 8379e7c46cc48f51197dd663fc6676f47f2a1e71:
  Sunil Mushran (1):
        ocfs2: ocfs2_write_begin_nolock() should handle len=0

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git upstream-linus

Coly Li (1):
      dlmglue.c: add missed mlog lines

Joel Becker (40):
      ocfs2: Make the ocfs2_caching_info structure self-contained.
      ocfs2: Change metadata caching locks to an operations structure.
      ocfs2: Take the inode out of the metadata read/write paths.
      ocfs2: move ip_last_trans to struct ocfs2_caching_info
      ocfs2: move ip_created_trans to struct ocfs2_caching_info
      ocfs2: Pass struct ocfs2_caching_info to the journal functions.
      ocfs2: Store the ocfs2_caching_info on ocfs2_extent_tree.
      ocfs2: Pass ocfs2_caching_info to ocfs2_read_extent_block().
      ocfs2: ocfs2_find_path() only needs the caching info
      ocfs2: ocfs2_create_new_meta_bhs() doesn't need struct inode.
      ocfs2: Pass ocfs2_extent_tree to ocfs2_unlink_path()
      ocfs2: ocfs2_complete_edge_insert() doesn't need struct inode at all.
      ocfs2: Get inode out of ocfs2_rotate_subtree_root_right().
      ocfs2: Pass ocfs2_extent_tree to ocfs2_get_subtree_root()
      ocfs2: Drop struct inode from ocfs2_extent_tree_operations.
      ocfs2: ocfs2_rotate_tree_right() doesn't need struct inode.
      ocfs2: ocfs2_update_edge_lengths() doesn't need struct inode.
      ocfs2: ocfs2_rotate_subtree_left() doesn't need struct inode.
      ocfs2: __ocfs2_rotate_tree_left() doesn't need struct inode.
      ocfs2: ocfs2_rotate_tree_left() no longer needs struct inode.
      ocfs2: ocfs2_merge_rec_left/right() no longer need struct inode.
      ocfs2: ocfs2_try_to_merge_extent() doesn't need struct inode.
      ocfs2: ocfs2_grow_branch() and ocfs2_append_rec_to_path() lose struct inode.
      ocfs2: ocfs2_truncate_rec() doesn't need struct inode.
      ocfs2: Make truncating the extent map an extent_tree_operation.
      ocfs2: ocfs2_insert_at_leaf() doesn't need struct inode.
      ocfs2: Give ocfs2_split_record() an extent_tree instead of an inode.
      ocfs2: ocfs2_do_insert_extent() and ocfs2_insert_path() no longer need an inode.
      ocfs2: ocfs2_extent_contig() only requires the superblock.
      ocfs2: Swap inode for extent_tree in ocfs2_figure_merge_contig_type().
      ocfs2: Remove inode from ocfs2_figure_extent_contig().
      ocfs2: ocfs2_figure_insert_type() no longer needs struct inode.
      ocfs2: Make extent map insertion an extent_tree_operation.
      ocfs2: ocfs2_insert_extent() no longer needs struct inode.
      ocfs2: ocfs2_add_clusters_in_btree() no longer needs struct inode.
      ocfs2: ocfs2_remove_extent() no longer needs struct inode.
      ocfs2: ocfs2_split_and_insert() no longer needs struct inode.
      ocfs2: Teach ocfs2_replace_extent_rec() to use an extent_tree.
      ocfs2: __ocfs2_mark_extent_written() doesn't need struct inode.
      ocfs2: Pass ocfs2_caching_info into ocfs_init_*_extent_tree().

Sunil Mushran (1):
      ocfs2: __ocfs2_abort() should not enable panic for local mounts

Tao Ma (42):
      ocfs2: Define refcount tree structure.
      ocfs2: Add metaecc for ocfs2_refcount_block.
      ocfs2: Add ocfs2_read_refcount_block.
      ocfs2: Abstract caching info checkpoint.
      ocfs2: Add new refcount tree lock resource in dlmglue.
      ocfs2: Add caching info for refcount tree.
      ocfs2: Add refcount tree lock mechanism.
      ocfs2: Basic tree root operation.
      ocfs2: Wrap ocfs2_extent_contig in ocfs2_extent_tree.
      ocfs2: Abstract extent split process.
      ocfs2: Add refcount b-tree as a new extent tree.
      ocfs2: move tree path functions to alloc.h.
      ocfs2: Add support for incrementing refcount in the tree.
      ocfs2: Add support of decrementing refcount for delete.
      ocfs2: Add functions for extents refcounted.
      ocfs2: Decrement refcount when truncating refcounted extents.
      ocfs2: Add CoW support.
      ocfs2: CoW refcount tree improvement.
      ocfs2: Integrate CoW in file write.
      ocfs2: CoW a reflinked cluster when it is truncated.
      ocfs2: Add normal functions for reflink a normal file's extents.
      ocfs2: handle file attributes issue for reflink.
      ocfs2: Return extent flags for xattr value tree.
      ocfs2: Abstract duplicate clusters process in CoW.
      ocfs2: Add CoW support for xattr.
      ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
      ocfs2: Abstract the creation of xattr block.
      ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
      ocfs2: Attach xattr clusters to refcount tree.
      ocfs2: Call refcount tree remove process properly.
      ocfs2: Create an xattr indexed block if needed.
      ocfs2: Add reflink support for xattr.
      ocfs2: Modify removing xattr process for refcount.
      ocfs2: Don't merge in 1st refcount ops of reflink.
      ocfs2: Make transaction extend more efficient.
      ocfs2: Use proper parameter for some inode operation.
      ocfs2: Create reflinked file in orphan dir.
      ocfs2: Add preserve to reflink.
      ocfs2: Implement ocfs2_reflink.
      ocfs2: Enable refcount tree support.
      ocfs2: Add ioctl for reflink.
      ocfs2: Use buffer IO if we are appending a file.

Wengang Wang (1):
      ocfs2: add spinlock protection when dealing with lockres->purge.

 fs/ocfs2/Makefile          |    1 +
 fs/ocfs2/alloc.c           | 1342 ++++++++------
 fs/ocfs2/alloc.h           |  101 +-
 fs/ocfs2/aops.c            |   37 +-
 fs/ocfs2/aops.h            |    2 +
 fs/ocfs2/buffer_head_io.c  |   47 +-
 fs/ocfs2/buffer_head_io.h  |    8 +-
 fs/ocfs2/cluster/masklog.c |    1 +
 fs/ocfs2/cluster/masklog.h |    1 +
 fs/ocfs2/dir.c             |  107 +-
 fs/ocfs2/dlm/dlmthread.c   |    6 +-
 fs/ocfs2/dlmglue.c         |  105 +-
 fs/ocfs2/dlmglue.h         |    6 +
 fs/ocfs2/extent_map.c      |   33 +-
 fs/ocfs2/extent_map.h      |    8 +-
 fs/ocfs2/file.c            |  151 ++-
 fs/ocfs2/file.h            |    2 +
 fs/ocfs2/inode.c           |   86 +-
 fs/ocfs2/inode.h           |   20 +-
 fs/ocfs2/ioctl.c           |   14 +
 fs/ocfs2/journal.c         |   82 +-
 fs/ocfs2/journal.h         |   94 +-
 fs/ocfs2/localalloc.c      |   12 +-
 fs/ocfs2/namei.c           |  341 ++++-
 fs/ocfs2/namei.h           |    6 +
 fs/ocfs2/ocfs2.h           |   52 +-
 fs/ocfs2/ocfs2_fs.h        |  107 ++-
 fs/ocfs2/ocfs2_lockid.h    |    5 +
 fs/ocfs2/quota_global.c    |    5 +-
 fs/ocfs2/quota_local.c     |   26 +-
 fs/ocfs2/refcounttree.c    | 4313 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/refcounttree.h    |  106 ++
 fs/ocfs2/resize.c          |   16 +-
 fs/ocfs2/slot_map.c        |   10 +-
 fs/ocfs2/suballoc.c        |   35 +-
 fs/ocfs2/super.c           |   13 +-
 fs/ocfs2/uptodate.c        |  265 ++--
 fs/ocfs2/uptodate.h        |   51 +-
 fs/ocfs2/xattr.c           | 2056 +++++++++++++++++++--
 fs/ocfs2/xattr.h           |   15 +-
 40 files changed, 8512 insertions(+), 1176 deletions(-)
 create mode 100644 fs/ocfs2/refcounttree.c
 create mode 100644 fs/ocfs2/refcounttree.h
-- 

"But then she looks me in the eye
 And says, 'We're going to last forever,'
 And man you know I can't begin to doubt it.
 Cause it just feels so good and so free and so right,
 I know we ain't never going to change our minds about it, Hey!
 Here comes my girl."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
@ 2009-09-22  0:51 George Spelvin
  0 siblings, 0 replies; 33+ messages in thread
From: George Spelvin @ 2009-09-22  0:51 UTC (permalink / raw)
  To: Joel.Becker; +Cc: linux, linux-kernel, torvalds

> Perhaps ->copyfile takes the following flags:
>
> #define ALLOW_COW_SHARED	0x0001
> #define REQUIRE_COW_SHARED	0x0002
> #define REQUIRE_BASIC_ATTRS	0x0004
> #define REQUIRE_FULL_ATTRS	0x0008
> #define REQUIRE_ATOMIC		0x0010
> #define SNAPSHOT		(REQUIRE_COW_SHARED |
>				 REQUIRE_BASIC_ATTRS |
>				 REQUIRE_ATOMIC)
> #define SNAPSHOT_PRESERVE	(SNAPSHOT | REQUIRE_FULL_ATTRS)

Um, could I strongly suggest that flags == 0 be the "succeed if at all
possible case", and various options limit it.

In particular, invert ALLOW_COW_SHARED to REQUIRE_ALLOCATE.

Another possibly useful flag would be REQUIRE_OPTIMIZED.
I.e. if it's not appreciably faster than a read/write loop, perhaps
the application would prefer to do it itself.

We also have to define the error code to return in case of a flag
violation.  ENOTSUP?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [GIT PULL] ocfs2 changes for 2.6.32
@ 2009-09-22  3:28 George Spelvin
  0 siblings, 0 replies; 33+ messages in thread
From: George Spelvin @ 2009-09-22  3:28 UTC (permalink / raw)
  To: arjan; +Cc: linux, linux-kernel, torvalds

> or make it one level simpler?
> Have a "wait for all started copies" call only.... saves a ton of book
> keeping, and is likely what people will use it for in the end anyway.

Aieeee!  No, no, a thousand times, no!

We do NOT need another blocking primitive that can't play well with others.
That would be a HORRIBLE design mistake.

The whole benefit of Linus' scheme is that it returns a file descriptor.
Any waiting should be done by the standard event-wait system call, poll().
It should return POLLIN when there's an interesting event (such as copy
completion), and should remain valid until explicitly close()d.

There's nothing wrong with a convenience function that waits for all
started copies, but I don't see a reason to design a new kernel interface
for the purpose.

A few more points:
- If the file descriptor returned by copyfile() is guaranteed not to be 0
  (even if that is available), perhaps it should be guaranteed to be >= 3.
- We might as well make the returned file descriptor O_CLOEXEC by default.
  You can always change it back if you want to.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-09-23 11:02 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-11 20:04 [GIT PULL] ocfs2 changes for 2.6.32 Joel Becker
2009-09-14 21:32 ` Linus Torvalds
2009-09-14 22:14   ` Joel Becker
2009-09-14 23:27     ` Linus Torvalds
2009-09-15  0:04       ` Joel Becker
2009-09-15  0:31         ` Linus Torvalds
2009-09-15  0:54           ` Joel Becker
2009-09-15  2:01             ` Linus Torvalds
2009-09-15  4:05               ` Arjan van de Ven
2009-09-15  4:35                 ` Joel Becker
2009-09-15  4:06               ` Joel Becker
2009-09-15 16:30                 ` Linus Torvalds
2009-09-15 21:45                   ` Joel Becker
2009-09-16  4:20                     ` Linus Torvalds
2009-09-16  4:40                       ` Joel Becker
2009-09-17 16:29                         ` Linus Torvalds
2009-09-17 16:38                           ` Arjan van de Ven
2009-09-17 20:16                             ` Linus Torvalds
2009-09-17 18:40                           ` Roland Dreier
2009-09-17 20:17                             ` Linus Torvalds
2009-09-17 20:34                               ` Joel Becker
2009-09-18  0:29                                 ` Linus Torvalds
2009-09-17 20:42                               ` Roland Dreier
2009-09-17 20:55                                 ` Linus Torvalds
2009-09-18  1:43                           ` [Ocfs2-devel] " Joel Becker
2009-09-18 13:34                             ` Pádraig Brady
2009-09-18 18:37                               ` Joel Becker
2009-09-18 17:23                             ` Peter W. Morreale
2009-09-18 18:39                               ` Joel Becker
2009-09-15  6:44   ` Miklos Szeredi
2009-09-23 11:02   ` [GIT PULL] ocfs2 changes for 2.6.32 (take 2, no syscall) Joel Becker
  -- strict thread matches above, loose matches on Subject: below --
2009-09-22  0:51 [GIT PULL] ocfs2 changes for 2.6.32 George Spelvin
2009-09-22  3:28 George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox