[RFC] The reflink(2) system call.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] The reflink(2) system call.
@ 2009-05-03  6:15 Joel Becker
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
                   ` (3 more replies)
  0 siblings, 4 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-03  6:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro

Hi everyone,
	I described the reflink operation at the Linux Storage &
Filesystems Workshop last month.  Originally implemented as an
ocfs2-specific ioctl, the consensus was that it should be a syscall from
the get-go.  Here's some first-cut patches.
	For people who have not seen reflink, either at LSF or on the
ocfs2 wiki, the first patch contains
Documentation/filesystems/reflink.txt to describe the call.  The
short-short version is that reflink creates a reference-counted link.
This is a new file that shares the data extents of a source file in a
copy-on-write fashion.
	The second patch adds iops->reflink() and vfs_reflink().  People
interested in LSM interaction, please look at my comments in the patch
header and the implementation of vfs_link().  I think it needs
improvement.
	The last patch defines sys_reflink() and sys_reflinkat().  It
also hooks them up for x86_32.  The final version of this patch will
obviously include the other architectures.
	The patches are also available in my git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git reflink

The current ioctl-based implementation for ocfs2 is available in Tao's
git tree at:

  git://oss.oracle.com/git/tma/linux-2.6.git refcount

It will be reset atop the system call very soon.
	Please send any comments along.

Joel

 Documentation/filesystems/reflink.txt |  129 ++++++++++++++++++++++++++++++++++
 Documentation/filesystems/vfs.txt     |    4 +
 arch/x86/include/asm/unistd_32.h      |    1 
 arch/x86/kernel/syscall_table_32.S    |    1 
 fs/namei.c                            |   96 +++++++++++++++++++++++++
 include/linux/fs.h                    |    2 
 6 files changed, 233 insertions(+)

-- 

"But then she looks me in the eye
 And says, 'We're going to last forever,'
 And man you know I can't begin to doubt it.
 Cause it just feels so good and so free and so right,
 I know we ain't never going to change our minds about it, Hey!
 Here comes my girl."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127



^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03  6:15 [RFC] The reflink(2) system call Joel Becker
@ 2009-05-03  6:15 ` Joel Becker
  2009-05-03  8:01   ` Christoph Hellwig
                     ` (3 more replies)
  2009-05-03  6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-03  6:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro

int reflink(const char *oldpath, const char *newpath);

The reflink(2) system call creates reference-counted links.  It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion.  Its calling semantics are identical to link(2).
Once complete, programs see the new file as a completely separate entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 Documentation/filesystems/reflink.txt |  129 +++++++++++++++++++++++++++++++++
 Documentation/filesystems/vfs.txt     |    4 +
 2 files changed, 133 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/reflink.txt

diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..f3620f0
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,129 @@
+reflink(2)
+==========
+
+NAME
+----
+reflink - make a reference-counted link of a file
+
+
+SYNOPSIS
+--------
+#include <unistd.h>
+
+int reflink(const char *oldpath, const char *newpath);
+
+DESCRIPTION
+-----------
+reflink() creates a new reflink (also known as a reference-counted link)
+to an existing file.  This reflink is a new file object that shares the
+attributes and data extents of the source object in a copy-on-write fashion.
+
+An easy way to think of it is that the semantics of the reflink() call
+are identical to the link(2) system call, but the resulting file object
+behaves as if it were a copy with identical attributes.
+
+Like the link(2) system call, if newpath exists, it will not be overwritten.
+oldpath must be a regular file.  oldpath and newpath must be on the same
+mounted filesystem.
+
+All data extents of the new file must be shared with the source file in
+a copy-on-write fashion.  This includes data extents for extended
+attributes.  If either the source or new files are written to, the
+changes do not show up in the other file.
+
+All file attributes and extended attributes of the new file must
+identical to the source file with the following exceptions:
+
+- The new file must have a new inode number.  This allows POSIX
+  programs to treat the source and new files as separate objects.  From
+  the view of the POSIX application, the files are distinct.  The
+  sharing is invisible outside the filesystem.
+- The ctime of the source file only changes if the source's metadata
+  must be changed to accommodate the copy-on-write linkage.  The ctime of
+  the new file is set to represent its creation.
+- The mtime of the source file is unmodified, and the mtime of the new file
+  is set identical to the source file.  This reflects that the data is
+  unchanged.
+- The link count of the source file is unchanged, and the link count of
+  the new file is one.
+
+RETURN VALUE
+------------
+On success, zero is returned.  On error, -1 is returned, and errno is
+set appropriately.
+
+ERRORS
+------
+EACCES::
+	Write access to the directory containing newpath is denied, or
+	search permission is denied for one of the directories in the
+	path prefix of oldpath or newpath.  (See also path_resolution(7).)
+
+EEXIST::
+	newpath already exists.
+
+EFAULT::
+	oldpath or newpath points outside your accessible address space.
+
+EIO::
+	An I/O error occurred.
+
+ELOOP::
+	Too many symbolic links were encountered in resolving oldpath or
+	newpath.
+
+ENAMETOOLONG::
+	oldpath or newpath was too long.
+
+ENOENT::
+	A directory component in oldpath or newpath does not exist or is
+	a dangling symbolic link.
+
+ENOMEM::
+	Insufficient kernel memory was available.
+
+ENOSPC::
+	The device containing the file has no room for the new directory
+	entry or file object.
+
+ENOTDIR::
+	A component used as a directory in oldpath or newpath is not, in
+	fact, a directory.
+
+EPERM::
+	oldpath is a directory.
+
+EPERM::
+	The file system containing oldpath and newpath does not support
+	the creation of reference-counted links.
+
+EROFS::
+	The file is on a read-only file system.
+
+EXDEV::
+	oldpath and newpath are not on the same mounted file system.
+	(Linux permits a file system to be mounted at multiple points,
+	but reflink() does not work across different mount points, even if
+	the same file system is mounted on both.)
+
+VERSIONS
+--------
+reflink() is available on Linux since kernel 2.6.31.
+
+CONFORMING TO
+-------------
+reflink() is Linux-specific.
+
+NOTES
+-----
+reflink() deferences symbolic links in the same manner that link(2)
+does.  For precise control over the treatment of symbolic links, see
+reflinkat().
+
+In the case of a crash, the new file must not appear partially complete
+in the filesystem.
+
+SEE ALSO
+--------
+ln(1), reflink(1), reflinkat(2), path_resolution(7)
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..01cd810 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
 
   truncate_range: a method provided by the underlying filesystem to truncate a
   	range of blocks , i.e. punch a hole somewhere in a file.
+  reflink: called by the reflink(2) system call. Only required if you want
+	to support reflinks.  For further information, see
+	Documentation/filesystems/reflink.txt.
 
 
 The Address Space Object
-- 
1.6.1.3


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
@ 2009-05-03  8:01   ` Christoph Hellwig
  2009-05-04  2:46     ` Joel Becker
  2009-05-03 13:08   ` Boaz Harrosh
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 151+ messages in thread
From: Christoph Hellwig @ 2009-05-03  8:01 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro, mtk.manpages

On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
> int reflink(const char *oldpath, const char *newpath);
> 
> The reflink(2) system call creates reference-counted links.  It creates
> a new file that shares the data extents of the source file in a
> copy-on-write fashion.  Its calling semantics are identical to link(2).
> Once complete, programs see the new file as a completely separate entry.

Just send this as a manpage to Michael, no need to duplicate a
pseudo-manpage in the kernel tree.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03  8:01   ` Christoph Hellwig
@ 2009-05-04  2:46     ` Joel Becker
  2009-05-04  6:36       ` Michael Kerrisk
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-04  2:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, mtk.manpages, jmorris, ocfs2-devel, viro

On Sun, May 03, 2009 at 04:01:12AM -0400, Christoph Hellwig wrote:
> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
> > int reflink(const char *oldpath, const char *newpath);
> > 
> > The reflink(2) system call creates reference-counted links.  It creates
> > a new file that shares the data extents of the source file in a
> > copy-on-write fashion.  Its calling semantics are identical to link(2).
> > Once complete, programs see the new file as a completely separate entry.
> 
> Just send this as a manpage to Michael, no need to duplicate a
> pseudo-manpage in the kernel tree.

	The manpage style was just a convenient way to organize my
thoughts.  The goal was to document the behavior of reflink() for
implementors.  If the pseudo-manpage doesn't work, perhaps I'll try some
other form.

Joel

-- 

Life's Little Instruction Book #337

	"Reread your favorite book."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-04  2:46     ` Joel Becker
@ 2009-05-04  6:36       ` Michael Kerrisk
  2009-05-04  7:12         ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Michael Kerrisk @ 2009-05-04  6:36 UTC (permalink / raw)
  To: Christoph Hellwig, linux-fsdevel, jmorris, ocfs2-devel, viro,
	mtk.manpages

On Mon, May 4, 2009 at 2:46 PM, Joel Becker <Joel.Becker@oracle.com> wrote:
> On Sun, May 03, 2009 at 04:01:12AM -0400, Christoph Hellwig wrote:
>> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
>> > int reflink(const char *oldpath, const char *newpath);
>> >
>> > The reflink(2) system call creates reference-counted links.  It creates
>> > a new file that shares the data extents of the source file in a
>> > copy-on-write fashion.  Its calling semantics are identical to link(2).
>> > Once complete, programs see the new file as a completely separate entry.
>>
>> Just send this as a manpage to Michael, no need to duplicate a
>> pseudo-manpage in the kernel tree.
>
>        The manpage style was just a convenient way to organize my
> thoughts.  The goal was to document the behavior of reflink() for
> implementors.  If the pseudo-manpage doesn't work, perhaps I'll try some
> other form.

So, I'm late to this thread.  Is reflink() (to be) a user-visible syscall?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-04  6:36       ` Michael Kerrisk
@ 2009-05-04  7:12         ` Joel Becker
  0 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-04  7:12 UTC (permalink / raw)
  To: mtk.manpages; +Cc: Christoph Hellwig, linux-fsdevel, jmorris, ocfs2-devel, viro

On Mon, May 04, 2009 at 06:36:58PM +1200, Michael Kerrisk wrote:
> On Mon, May 4, 2009 at 2:46 PM, Joel Becker <Joel.Becker@oracle.com> wrote:
> > On Sun, May 03, 2009 at 04:01:12AM -0400, Christoph Hellwig wrote:
> >> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
> >> > int reflink(const char *oldpath, const char *newpath);
> >> >
> >> > The reflink(2) system call creates reference-counted links.  It creates
> >> > a new file that shares the data extents of the source file in a
> >> > copy-on-write fashion.  Its calling semantics are identical to link(2).
> >> > Once complete, programs see the new file as a completely separate entry.
> >>
> >> Just send this as a manpage to Michael, no need to duplicate a
> >> pseudo-manpage in the kernel tree.
> >
> >        The manpage style was just a convenient way to organize my
> > thoughts.  The goal was to document the behavior of reflink() for
> > implementors.  If the pseudo-manpage doesn't work, perhaps I'll try some
> > other form.
> 
> So, I'm late to this thread.  Is reflink() (to be) a user-visible syscall?

	Yes.  The actual call will be reflinkat(), as they're correct
that userspace can wrap reflink() around it.  I did have you on my todo
for notification as it settled down.

Joel

-- 

"Well-timed silence hath more eloquence than speech."  
         - Martin Fraquhar Tupper

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
  2009-05-03  8:01   ` Christoph Hellwig
@ 2009-05-03 13:08   ` Boaz Harrosh
  2009-05-03 23:08     ` Al Viro
  2009-05-04  2:49     ` Joel Becker
  2009-05-03 23:45   ` Theodore Tso
  2009-05-05  1:07   ` Jamie Lokier
  3 siblings, 2 replies; 151+ messages in thread
From: Boaz Harrosh @ 2009-05-03 13:08 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On 05/03/2009 09:15 AM, Joel Becker wrote:
> int reflink(const char *oldpath, const char *newpath);
> 
> The reflink(2) system call creates reference-counted links.  It creates
> a new file that shares the data extents of the source file in a
> copy-on-write fashion.  Its calling semantics are identical to link(2).
> Once complete, programs see the new file as a completely separate entry.
> 

Please forgive my complete Unix jargon novice-ness, but from here it looks like the
name is very wrong, and confusing.

if I put data to link graph then:

[data]<--[hard-link (one or more)]<--[soft-link(zero or more)]

The data is other-wise just there on disk but is un available until
it is linked to a dir-entry, at-least one. The middle hard-link is reference
counted and once all uses are removed data can be garbage collected. Soft links
don't follow on-disk data but follow a dir-entry. So if we have a completely
different on disk data we're still in agreement with the dir-entry.

In the graph above and has explained below. there is no reference counting
going on:
> +- The link count of the source file is unchanged, and the link count of
> +  the new file is one.

And and the "link" meaning is very vaguely kept, only half way until the next
write. (If it can be called a link at all being a different inode and cached
twice)

As my first impression when I read the title of the patch, an English reflink
I would imagine is something more to the left of above graph, between hard-link
and soft-link, something like: link to an invisible dir-entry that is gone once
all soft-links to it are gone.

So form my point of view. Call it something different like Copy-On-Write or
COW.

I do understand that there is something very fundamental in my misunderstanding,
but it was not explained below, in fact the below terminology confused me even
more. Please explain?

> Signed-off-by: Joel Becker <joel.becker@oracle.com>
> ---
>  Documentation/filesystems/reflink.txt |  129 +++++++++++++++++++++++++++++++++
>  Documentation/filesystems/vfs.txt     |    4 +
>  2 files changed, 133 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/filesystems/reflink.txt
> 
> diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
> new file mode 100644
> index 0000000..f3620f0
> --- /dev/null
> +++ b/Documentation/filesystems/reflink.txt
> @@ -0,0 +1,129 @@
> +reflink(2)
> +==========
> +
> +NAME
> +----
> +reflink - make a reference-counted link of a file
> +
> +
> +SYNOPSIS
> +--------
> +#include <unistd.h>
> +
> +int reflink(const char *oldpath, const char *newpath);
> +
> +DESCRIPTION
> +-----------
> +reflink() creates a new reflink (also known as a reference-counted link)
> +to an existing file.  This reflink is a new file object that shares the
> +attributes and data extents of the source object in a copy-on-write fashion.
> +

This is exactly my confusion how is the logical jump made from reflink (reference/link)
to copy-on-write. I fail to see any logical connection.

> +An easy way to think of it is that the semantics of the reflink() call
> +are identical to the link(2) system call, but the resulting file object
> +behaves as if it were a copy with identical attributes.
> +
> +Like the link(2) system call, if newpath exists, it will not be overwritten.
> +oldpath must be a regular file.  oldpath and newpath must be on the same
> +mounted filesystem.
> +
> +All data extents of the new file must be shared with the source file in
> +a copy-on-write fashion.  This includes data extents for extended
> +attributes.  If either the source or new files are written to, the
> +changes do not show up in the other file.
> +
> +All file attributes and extended attributes of the new file must
> +identical to the source file with the following exceptions:
> +
> +- The new file must have a new inode number.  This allows POSIX
> +  programs to treat the source and new files as separate objects.  From
> +  the view of the POSIX application, the files are distinct.  The
> +  sharing is invisible outside the filesystem.
> +- The ctime of the source file only changes if the source's metadata
> +  must be changed to accommodate the copy-on-write linkage.  The ctime of
> +  the new file is set to represent its creation.
> +- The mtime of the source file is unmodified, and the mtime of the new file
> +  is set identical to the source file.  This reflects that the data is
> +  unchanged.
> +- The link count of the source file is unchanged, and the link count of
> +  the new file is one.
> +
> +RETURN VALUE
> +------------
> +On success, zero is returned.  On error, -1 is returned, and errno is
> +set appropriately.
> +
> +ERRORS
> +------
> +EACCES::
> +	Write access to the directory containing newpath is denied, or
> +	search permission is denied for one of the directories in the
> +	path prefix of oldpath or newpath.  (See also path_resolution(7).)
> +
> +EEXIST::
> +	newpath already exists.
> +
> +EFAULT::
> +	oldpath or newpath points outside your accessible address space.
> +
> +EIO::
> +	An I/O error occurred.
> +
> +ELOOP::
> +	Too many symbolic links were encountered in resolving oldpath or
> +	newpath.
> +
> +ENAMETOOLONG::
> +	oldpath or newpath was too long.
> +
> +ENOENT::
> +	A directory component in oldpath or newpath does not exist or is
> +	a dangling symbolic link.
> +
> +ENOMEM::
> +	Insufficient kernel memory was available.
> +
> +ENOSPC::
> +	The device containing the file has no room for the new directory
> +	entry or file object.
> +
> +ENOTDIR::
> +	A component used as a directory in oldpath or newpath is not, in
> +	fact, a directory.
> +
> +EPERM::
> +	oldpath is a directory.
> +
> +EPERM::
> +	The file system containing oldpath and newpath does not support
> +	the creation of reference-counted links.
> +
> +EROFS::
> +	The file is on a read-only file system.
> +
> +EXDEV::
> +	oldpath and newpath are not on the same mounted file system.
> +	(Linux permits a file system to be mounted at multiple points,
> +	but reflink() does not work across different mount points, even if
> +	the same file system is mounted on both.)
> +
> +VERSIONS
> +--------
> +reflink() is available on Linux since kernel 2.6.31.
> +
> +CONFORMING TO
> +-------------
> +reflink() is Linux-specific.
> +
> +NOTES
> +-----
> +reflink() deferences symbolic links in the same manner that link(2)
> +does.  For precise control over the treatment of symbolic links, see
> +reflinkat().
> +
> +In the case of a crash, the new file must not appear partially complete
> +in the filesystem.
> +
> +SEE ALSO
> +--------
> +ln(1), reflink(1), reflinkat(2), path_resolution(7)
> +
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index f49eecf..01cd810 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -333,6 +333,7 @@ struct inode_operations {
>  	ssize_t (*listxattr) (struct dentry *, char *, size_t);
>  	int (*removexattr) (struct dentry *, const char *);
>  	void (*truncate_range)(struct inode *, loff_t, loff_t);
> +	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
>  };
>  
>  Again, all methods are called without any locks being held, unless
> @@ -431,6 +432,9 @@ otherwise noted.
>  
>    truncate_range: a method provided by the underlying filesystem to truncate a
>    	range of blocks , i.e. punch a hole somewhere in a file.
> +  reflink: called by the reflink(2) system call. Only required if you want
> +	to support reflinks.  For further information, see
> +	Documentation/filesystems/reflink.txt.
>  
>  
>  The Address Space Object

Please forgive my ignorance, again I would honestly like to understand, and
how else, then to just ask?

Thanks in advance
Boaz

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03 13:08   ` Boaz Harrosh
@ 2009-05-03 23:08     ` Al Viro
  2009-05-04  2:49     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Al Viro @ 2009-05-03 23:08 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Joel Becker, linux-fsdevel, jmorris, ocfs2-devel

On Sun, May 03, 2009 at 04:08:59PM +0300, Boaz Harrosh wrote:

> As my first impression when I read the title of the patch, an English reflink
> I would imagine is something more to the left of above graph, between hard-link
> and soft-link, something like: link to an invisible dir-entry that is gone once
> all soft-links to it are gone.
> 
> So form my point of view. Call it something different like Copy-On-Write or
> COW.
> 
> I do understand that there is something very fundamental in my misunderstanding,
> but it was not explained below, in fact the below terminology confused me even
> more. Please explain?

It's simply a lazy copy, with interface for creating it similar to link(2).
That's all.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03 13:08   ` Boaz Harrosh
  2009-05-03 23:08     ` Al Viro
@ 2009-05-04  2:49     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-04  2:49 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Sun, May 03, 2009 at 04:08:59PM +0300, Boaz Harrosh wrote:
> On 05/03/2009 09:15 AM, Joel Becker wrote:
> > int reflink(const char *oldpath, const char *newpath);
> > 
> > The reflink(2) system call creates reference-counted links.  It creates
> > a new file that shares the data extents of the source file in a
> > copy-on-write fashion.  Its calling semantics are identical to link(2).
> > Once complete, programs see the new file as a completely separate entry.
> > 
> 
> Please forgive my complete Unix jargon novice-ness, but from here it looks like the
> name is very wrong, and confusing.
> 
> if I put data to link graph then:
> 
> [data]<--[hard-link (one or more)]<--[soft-link(zero or more)]
> 
> The data is other-wise just there on disk but is un available until
> it is linked to a dir-entry, at-least one. The middle hard-link is reference
> counted and once all uses are removed data can be garbage collected. Soft links
> don't follow on-disk data but follow a dir-entry. So if we have a completely
> different on disk data we're still in agreement with the dir-entry.

	A reflink creates a dir entry.  That's what newpath is about.
Using your graph:


[data]<--[reflink (zero or more)]<--[hard-link (one or more)]<--[soft-link(zero or more)]

 
> As my first impression when I read the title of the patch, an English reflink
> I would imagine is something more to the left of above graph, between hard-link
> and soft-link, something like: link to an invisible dir-entry that is gone once
> all soft-links to it are gone.

	There is no "invisible dir entry".  The target is a new file
with a new dir entry.  It just shares the data extents of the source.
Perhaps I can clarify that better.

Joel

-- 

"Maybe the time has drawn the faces I recall.
 But things in this life change very slowly,
 If they ever change at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
  2009-05-03  8:01   ` Christoph Hellwig
  2009-05-03 13:08   ` Boaz Harrosh
@ 2009-05-03 23:45   ` Theodore Tso
  2009-05-04  1:44     ` Tao Ma
  2009-05-05  1:07   ` Jamie Lokier
  3 siblings, 1 reply; 151+ messages in thread
From: Theodore Tso @ 2009-05-03 23:45 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
> int reflink(const char *oldpath, const char *newpath);
> 
> The reflink(2) system call creates reference-counted links.  It creates
> a new file that shares the data extents of the source file in a
> copy-on-write fashion.  Its calling semantics are identical to link(2).
> Once complete, programs see the new file as a completely separate entry.

How should quota handle reflinks?  Since there are separate inodes,
the two files could be owned by different user ID's.  Since the data
blocks exist only once, I can imagine a number of different ways of
handling it:

1) When the reflink is created, the owner of the new reflink is not
charged the number of blocks of the file against his/her quota.  If
the original inode is deleted, the original owner continues to have
the cost of the file charged against his/her quota until the last
reflink disappears.

2) When the reflink is created, the owner of the new reflink is NOT
charged the number of blocks of the file against his/her quota.  If
the original inode is deleted, the owner of the reflink is charged the
number of blocks against his/her quota.  If that drives the owner over
quota, the quota subsystem will enforce the soft and hard quota limits
as per normal.  If there are more than one reflink against the file,
the system will randomly choose one user and charge the blocks against
his/her quota.

3) When the reflink is created, the owner of the new reflink is
charged the number of blocks of the file against his/her quota.  The
original owner of the inode continus to also have the blocks of the
file charged against his/her quota, so in effect the blocks are
"double counted".

4) When the reflink is created, the owner of the new reflink is NOT
charged the number of blocks of the file against his/her quota.  The
original owner of the inode continues to also have the blocks of the
file charged against his/her quota; if the file is deleted the blocks
associated with the file will not be charged against any users' quota.

All of these have various problems; and maybe the answer is that
reflinks aren't really compatible with quotas, so pick something least
bad (say #3), and we can just move on.

						- Ted

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03 23:45   ` Theodore Tso
@ 2009-05-04  1:44     ` Tao Ma
  2009-05-04 18:25       ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Tao Ma @ 2009-05-04  1:44 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Joel Becker, linux-fsdevel, jmorris, ocfs2-devel, viro

Hi Ted,

Theodore Tso wrote:
> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
>> int reflink(const char *oldpath, const char *newpath);
>>
>> The reflink(2) system call creates reference-counted links.  It creates
>> a new file that shares the data extents of the source file in a
>> copy-on-write fashion.  Its calling semantics are identical to link(2).
>> Once complete, programs see the new file as a completely separate entry.
> 
> How should quota handle reflinks?  Since there are separate inodes,
> the two files could be owned by different user ID's.  Since the data
> blocks exist only once, I can imagine a number of different ways of
> handling it:
> 
> 1) When the reflink is created, the owner of the new reflink is not
> charged the number of blocks of the file against his/her quota.  If
> the original inode is deleted, the original owner continues to have
> the cost of the file charged against his/her quota until the last
> reflink disappears.
> 
> 2) When the reflink is created, the owner of the new reflink is NOT
> charged the number of blocks of the file against his/her quota.  If
> the original inode is deleted, the owner of the reflink is charged the
> number of blocks against his/her quota.  If that drives the owner over
> quota, the quota subsystem will enforce the soft and hard quota limits
> as per normal.  If there are more than one reflink against the file,
> the system will randomly choose one user and charge the blocks against
> his/her quota.
> 
> 3) When the reflink is created, the owner of the new reflink is
> charged the number of blocks of the file against his/her quota.  The
> original owner of the inode continus to also have the blocks of the
> file charged against his/her quota, so in effect the blocks are
> "double counted".
> 
> 4) When the reflink is created, the owner of the new reflink is NOT
> charged the number of blocks of the file against his/her quota.  The
> original owner of the inode continues to also have the blocks of the
> file charged against his/her quota; if the file is deleted the blocks
> associated with the file will not be charged against any users' quota.
> 
> All of these have various problems; and maybe the answer is that
> reflinks aren't really compatible with quotas, so pick something least
> bad (say #3), and we can just move on.
yeah, agree. So I will pick #3 in my ocfs2 reflink implementation.

Thanks.

Regards,
Tao

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-04  1:44     ` Tao Ma
@ 2009-05-04 18:25       ` Joel Becker
  2009-05-04 21:18         ` [Ocfs2-devel] " Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-04 18:25 UTC (permalink / raw)
  To: Tao Ma; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro

On Mon, May 04, 2009 at 09:44:32AM +0800, Tao Ma wrote:
> Theodore Tso wrote:
>> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
>> How should quota handle reflinks?  Since there are separate inodes,
>> the two files could be owned by different user ID's.  Since the data
>> blocks exist only once, I can imagine a number of different ways of
>> handling it:
<snip>
> yeah, agree. So I will pick #3 in my ocfs2 reflink implementation.

	While at first I was all "sure, this makes sense," now I'm
thinking otherwise.  Because reflink() means the file attributes are
unmodified.  So the original owner owns the new file, and thus the quota
charge doesn't matter.  If and when the new file is changed to another
owner, then the normal quota code will adjust the quotas.

Joel

-- 

"If you are ever in doubt as to whether or not to kiss a pretty girl, 
 give her the benefit of the doubt"
                                        -Thomas Carlyle

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-04 18:25       ` Joel Becker
@ 2009-05-04 21:18         ` Joel Becker
  2009-05-04 22:23           ` Theodore Tso
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-04 21:18 UTC (permalink / raw)
  To: Tao Ma, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

On Mon, May 04, 2009 at 11:25:52AM -0700, Joel Becker wrote:
> On Mon, May 04, 2009 at 09:44:32AM +0800, Tao Ma wrote:
> > Theodore Tso wrote:
> >> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote:
> >> How should quota handle reflinks?  Since there are separate inodes,
> >> the two files could be owned by different user ID's.  Since the data
> >> blocks exist only once, I can imagine a number of different ways of
> >> handling it:
> <snip>
> > yeah, agree. So I will pick #3 in my ocfs2 reflink implementation.
> 
> 	While at first I was all "sure, this makes sense," now I'm
> thinking otherwise.  Because reflink() means the file attributes are
> unmodified.  So the original owner owns the new file, and thus the quota
> charge doesn't matter.  If and when the new file is changed to another
> owner, then the normal quota code will adjust the quotas.

	More thinking.  It looks like we'll restrict reflink() to owners
or people with CAP_FCHOWN.  This prevents some quota DoS behavior.
	We need to pre-charge all quota.  That means a reflink must be
charged the entire size of the file.  So, if I do:

  # dd if=/dev/zero bs=1M count=1 of=foo
  # reflink foo bar

I am now charged 2MB of quota, even though foo and bar share the same
1MB of space.
	Why?  Because if I only mark 1M of quota and then do "chown
tao.tao bar", we can't sanely keep track of fractional quota.  Wheras if
we charge the 2MB up front, the chown just moves the quota over to tao.
Copy-on-write is even cleaner - since you were pre-charged for the
quota, you don't do any quota adjustments for the data blocks in the CoW
operation (though any new metadata is a new charge).

Joel  

-- 

"The whole principle is wrong; it's like demanding that grown men 
 live on skim milk because the baby can't eat steak."
        - author Robert A. Heinlein on censorship

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-04 21:18         ` [Ocfs2-devel] " Joel Becker
@ 2009-05-04 22:23           ` Theodore Tso
  2009-05-05  6:55             ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Theodore Tso @ 2009-05-04 22:23 UTC (permalink / raw)
  To: Tao Ma, linux-fsdevel, jmorris, ocfs2-devel, viro

On Mon, May 04, 2009 at 02:18:54PM -0700, Joel Becker wrote:
> 	More thinking.  It looks like we'll restrict reflink() to owners
> or people with CAP_FCHOWN.  This prevents some quota DoS behavior.
> 	We need to pre-charge all quota.  That means a reflink must be
> charged the entire size of the file.  So, if I do:
> 
>   # dd if=/dev/zero bs=1M count=1 of=foo
>   # reflink foo bar
> 
> I am now charged 2MB of quota, even though foo and bar share the same
> 1MB of space.

Yep; but as long as you do this, why do you need CAP_FCHOWN?  

Suppose Alice has a 1MB file, and Bob creates a reflink to it.  The
reflink would be owned by Bob, and Bob would be charged the 1MB quota.
This mirrors exactly what happens if Bob were to make a copy of the
file, and we want to make the creation of reflink mirror a copy, right?

In that case, as long as Bob has read access to the file, he should be
allowed to create a reflink.

That way when you do the copy-on-write, Bob will continue to be
charged the 1MB quota, which is what you want.  So pre-charging the
quota makes the most amount of sense.

	    	       	     	     - Ted

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-04 22:23           ` Theodore Tso
@ 2009-05-05  6:55             ` Joel Becker
  0 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05  6:55 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Tao Ma, linux-fsdevel, jmorris, ocfs2-devel, viro

On Mon, May 04, 2009 at 06:23:27PM -0400, Theodore Tso wrote:
> On Mon, May 04, 2009 at 02:18:54PM -0700, Joel Becker wrote:
> > 	More thinking.  It looks like we'll restrict reflink() to owners
> > or people with CAP_FCHOWN.  This prevents some quota DoS behavior.
> > 	We need to pre-charge all quota.  That means a reflink must be
> > charged the entire size of the file.  So, if I do:
> > 
> >   # dd if=/dev/zero bs=1M count=1 of=foo
> >   # reflink foo bar
> > 
> > I am now charged 2MB of quota, even though foo and bar share the same
> > 1MB of space.
> 
> Yep; but as long as you do this, why do you need CAP_FCHOWN?  

	Because the ownership doesn't change, and thus the person doing
the reflink is effectively setting ownership.

> Suppose Alice has a 1MB file, and Bob creates a reflink to it.  The
> reflink would be owned by Bob, and Bob would be charged the 1MB quota.
> This mirrors exactly what happens if Bob were to make a copy of the
> file, and we want to make the creation of reflink mirror a copy, right?

	It's more a link(2).  The ownership, permissions, and attributes
are identical to the original.

-- 

"Always give your best, never get discouraged, never be petty; always
 remember, others may hate you.  Those who hate you don't win unless
 you hate them.  And then you destroy yourself."
	- Richard M. Nixon

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
                     ` (2 preceding siblings ...)
  2009-05-03 23:45   ` Theodore Tso
@ 2009-05-05  1:07   ` Jamie Lokier
  2009-05-05  7:16     ` Joel Becker
  3 siblings, 1 reply; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05  1:07 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

Joel Becker wrote:
> +All file attributes and extended attributes of the new file must
> +identical to the source file with the following exceptions:

reflink() sounds useful already, but is there a compelling reason why
both files must have the same attributes, and changing attributes will
break the COW?

Being able to have different attributes would allow:

   - reflink() to be used for fast space-efficient copying, i.e. an
     optimisation to "cp", "git checkout" and things like that.

   - reflink() to be used for merging files with identical contents
     (something I find surprisingly often on my disks).

   - reflink() to be used for merging files from different
     cgroup-style VMs in particular.

Requiring all attributes except nlink and ino to be identical makes
reflink() unsuitable for transparently doing those things, except in
cases where they happen to have the same attributes anyway.

I'm thinking particularly of file permissions, owner/group and atime.

Since each reflink has its own nlink and ino, I'm wondering why the
other attributes cannot also be separate.  (I realise extended
attributes complicate the picture and it's desirable to share them,
especially if they are large).

> +- The new file must have a new inode number.  This allows POSIX
> +  programs to treat the source and new files as separate objects.  From
> +  the view of the POSIX application, the files are distinct.  The
> +  sharing is invisible outside the filesystem.

Invisible sharing is good and different inode number is obviously required.

But is there an efficient way for reflink-aware applications to detect
these files have the same contents, other than reading the contents
twice and comparing?  Occasionally that would be good.  E.g. It would
be nice if "diff -r" could be patched to do that.

> +- The ctime of the source file only changes if the source's metadata
> +  must be changed to accommodate the copy-on-write linkage.  The ctime of
> +  the new file is set to represent its creation.

What change to the source metadata would require ctime to change?

> +- The link count of the source file is unchanged, and the link count of
> +  the new file is one.

Can you hard link to the source file and the reflink afterwards,
incrementing the reflink's link count?  (I presume yes).  Can you
reflink to both of them too?

> +EPERM::
> +	oldpath is a directory.

I've always been surprised this isn't EISDIR :-)

> +EXDEV::
> +	oldpath and newpath are not on the same mounted file system.
> +	(Linux permits a file system to be mounted at multiple points,
> +	but reflink() does not work across different mount points, even if
> +	the same file system is mounted on both.)

That's in interesting restriction, though I see link() does the same.

> +reflink() deferences symbolic links in the same manner that link(2)
> +does.

Would that be "reflink() does not dereference symbolic links as the
final path component, in the same manner that link() does not" :-)

> For precise control over the treatment of symbolic links, see
> reflinkat().

As others have said, there's no need for a reflink() kernel system
call, as reflinkat() can be used for the same thing, and wrapped in
libc if reflink() is desirable as a userspace C function.

Also, reflinkat() has room for reflink-specific flags to be added
later if needed, which may come in handy.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05  1:07   ` Jamie Lokier
@ 2009-05-05  7:16     ` Joel Becker
  2009-05-05  8:09       ` Andreas Dilger
                         ` (2 more replies)
  0 siblings, 3 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05  7:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote:
> Joel Becker wrote:
> > +All file attributes and extended attributes of the new file must
> > +identical to the source file with the following exceptions:
> 
> reflink() sounds useful already, but is there a compelling reason why
> both files must have the same attributes, and changing attributes will
> break the COW?

	Yeah, because without it you can't use it for snapshotting.
That's where the original design came from - inode snapshots.  The big
thing that excited me was that defining reflink() as I did, instead of
a more specific snapshot call, allows all sorts of generic uses (some of
which you outline below).
	If reflink() creates a snapshot, you can then break it to make
things a little different.  But if it changes things, you can never
change them back.

> Being able to have different attributes would allow:
> 
>    - reflink() to be used for fast space-efficient copying, i.e. an
>      optimisation to "cp", "git checkout" and things like that.

	It can right now, just not of other people's files.  Actually,
the only real difficult with doing it to other people's files is quota.
But I can't come up with a way to prevent quota DoS.
	Here's another fun trick.  Overwriting rsync, instead of copying
blocks from the already-existing source could reflink the source to the
.temporary, then only write the changed blocks.  And since you own both
files, it just works.  If you're overwriting someone else's file?  The
old copy behavior is fine.

>    - reflink() to be used for merging files with identical contents
>      (something I find surprisingly often on my disks).
> 
>    - reflink() to be used for merging files from different
>      cgroup-style VMs in particular.

	While it would be great to have a way to do this, reflink() is
not the way.  It's really simple to understand with its link-like
semantic, and I see no point in making it a seven-different-operation
kitchen sink call.

> Requiring all attributes except nlink and ino to be identical makes
> reflink() unsuitable for transparently doing those things, except in
> cases where they happen to have the same attributes anyway.

	We've had a lot of fun thinking up many uses for reflink(), and
almost all of them are within the context of one's own files.

> I'm thinking particularly of file permissions, owner/group and atime.

	People do cp -p all the time.  I don't see how keeping those
things the same will break anything.  It's a new call, not an existing
semantic.

> Since each reflink has its own nlink and ino, I'm wondering why the
> other attributes cannot also be separate.  (I realise extended
> attributes complicate the picture and it's desirable to share them,
> especially if they are large).

	The biggest reason is snapshotting.  The second biggest reason
is a simple to understand call.  "Everything is identical except those
things that *have* to be different".

> But is there an efficient way for reflink-aware applications to detect
> these files have the same contents, other than reading the contents
> twice and comparing?  Occasionally that would be good.  E.g. It would
> be nice if "diff -r" could be patched to do that.

	I would think FIEMAP would tell you what you want to know,
wouldn't it?

> > +- The ctime of the source file only changes if the source's metadata
> > +  must be changed to accommodate the copy-on-write linkage.  The ctime of
> > +  the new file is set to represent its creation.
> 
> What change to the source metadata would require ctime to change?

	ocfs2 flags all extents in the source file with a "this is now
shared, go check the reference count before writing" flag if they don't
have it already.  I'd call that a metadata update.

> > +- The link count of the source file is unchanged, and the link count of
> > +  the new file is one.
> 
> Can you hard link to the source file and the reflink afterwards,
> incrementing the reflink's link count?  (I presume yes).  Can you
> reflink to both of them too?

	Yes, absolutely.  Once reflinked, they look like two separate
POSIX files.

Joel

-- 

"Depend on the rabbit's foot if you will, but remember, it didn't
 help the rabbit."
	- R. E. Shay

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05  7:16     ` Joel Becker
@ 2009-05-05  8:09       ` Andreas Dilger
  2009-05-05 16:56         ` Joel Becker
  2009-05-05 13:01       ` Theodore Tso
  2009-05-05 13:01       ` Jamie Lokier
  2 siblings, 1 reply; 151+ messages in thread
From: Andreas Dilger @ 2009-05-05  8:09 UTC (permalink / raw)
  To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro

On May 05, 2009  00:16 -0700, Joel Becker wrote:
> On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote:
> > Being able to have different attributes would allow:
> > 
> >    - reflink() to be used for fast space-efficient copying, i.e. an
> >      optimisation to "cp", "git checkout" and things like that.
> 
> 	It can right now, just not of other people's files.  Actually,
> the only real difficult with doing it to other people's files is quota.
> But I can't come up with a way to prevent quota DoS.

If the reflink caller is always charged for the full space used (as if
it were a real copy) by virtue of the user doing the reflink() owning the
new inode.  Doing anything else seems broken.  If the owner of the file
wasn't charged for the reflink's quota then if the reflink inode was
chowned the new owner would be charged for the new file, but the quota
code would have to special case the decrement of EACH of the reflink's
blocks because otherwise the original owner might "release" quota that
it was never originally charged.

> 	Here's another fun trick.  Overwriting rsync, instead of copying
> blocks from the already-existing source could reflink the source to the
> .temporary, then only write the changed blocks.  And since you own both
> files, it just works.  If you're overwriting someone else's file?  The
> old copy behavior is fine.

Well, "fine" as in it works, but if there are only a few changed blocks,
and the old copy is now part of a snapshot (so it won't be released when
rsync is finished) the space consumption has doubled instead of just
using a few extra blocks.

> > Requiring all attributes except nlink and ino to be identical makes
> > reflink() unsuitable for transparently doing those things, except in
> > cases where they happen to have the same attributes anyway.
>
>        We've had a lot of fun thinking up many uses for reflink(), and
>	almost all of them are within the context of one's own files.

Is there anything about changing the owner/group of the new inode during
reflink that makes the implementation more complex?  If the process doing
the reflink is the same as the file owner then the semantics are unchanged
from what you have proposed.

> > I'm thinking particularly of file permissions, owner/group and atime.
> 
> 	People do cp -p all the time.  I don't see how keeping those
> things the same will break anything.  It's a new call, not an existing
> semantic.

Though "cp -p" doesn't keep the owner/group of the original file if you
are not root.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05  8:09       ` Andreas Dilger
@ 2009-05-05 16:56         ` Joel Becker
  2009-05-05 21:24           ` Andreas Dilger
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-05 16:56 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, Jamie Lokier, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 02:09:36AM -0600, Andreas Dilger wrote:
> On May 05, 2009  00:16 -0700, Joel Becker wrote:
> > On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote:
> > > Being able to have different attributes would allow:
> > > 
> > >    - reflink() to be used for fast space-efficient copying, i.e. an
> > >      optimisation to "cp", "git checkout" and things like that.
> > 
> > 	It can right now, just not of other people's files.  Actually,
> > the only real difficult with doing it to other people's files is quota.
> > But I can't come up with a way to prevent quota DoS.
> 
> If the reflink caller is always charged for the full space used (as if
> it were a real copy) by virtue of the user doing the reflink() owning the
> new inode.  Doing anything else seems broken.  If the owner of the file
> wasn't charged for the reflink's quota then if the reflink inode was
> chowned the new owner would be charged for the new file, but the quota
> code would have to special case the decrement of EACH of the reflink's
> blocks because otherwise the original owner might "release" quota that
> it was never originally charged.

	If the caller is creating an inode in someone else's name, then
who do you charge for the quota?  If you charge the caller, how do you
know to decrement the caller's quota when the actual owner does
truncate, given that the inode has no knowledge of the caller anymore.
	You've hit the nail on the head - without backrefs for each
refcounted hunk, you can't figure out who it owns it from a quota
perspective.  And that's just a non-starter to try and maintain.

> > 	Here's another fun trick.  Overwriting rsync, instead of copying
> > blocks from the already-existing source could reflink the source to the
> > .temporary, then only write the changed blocks.  And since you own both
> > files, it just works.  If you're overwriting someone else's file?  The
> > old copy behavior is fine.
> 
> Well, "fine" as in it works, but if there are only a few changed blocks,
> and the old copy is now part of a snapshot (so it won't be released when
> rsync is finished) the space consumption has doubled instead of just
> using a few extra blocks.

	No, because the last thing rsync will do is rename(.temporary,
source).  All the references from the source will be decremented, and
any blocks only owned by the source will be freed.  Space usage is
identical before and after, like a copying rsync, but there is less
space used and less I/O done during the rsync process.

> Is there anything about changing the owner/group of the new inode during
> reflink that makes the implementation more complex?  If the process doing
> the reflink is the same as the file owner then the semantics are unchanged
> from what you have proposed.

	If you define that 'reflink sets the attributes as if it was a
new file', then you should be creating the file with a new security
context, not with the security context from the existing inode.  And
then you can't really snapshot.
	A mixed behavior, like "if you own it, I'll preserve the entire
security context, but if not I will treat it with a new context" is
confusing at best.

> > > I'm thinking particularly of file permissions, owner/group and atime.
> > 
> > 	People do cp -p all the time.  I don't see how keeping those
> > things the same will break anything.  It's a new call, not an existing
> > semantic.
> 
> Though "cp -p" doesn't keep the owner/group of the original file if you
> are not root.

	Sure, my argument wasn't that we should be exactly like cp -p,
it was that the results of cp -p are understood, so if we look like them
it won't break anything.
	I actually discussed the "cp -p" issue elsewhere.  Yes, we all
understand the caveats of "cp -p".  But it's a actually a combination of
many simple operations.  reflink() is one operation, and trying to give
it confusing and varied semantics seems to clutter it up for no good
reason.

Joel

-- 

"Baby, even the losers
 Get luck sometimes.
 Even the losers
 Keep a little bit of pride."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 16:56         ` Joel Becker
@ 2009-05-05 21:24           ` Andreas Dilger
  2009-05-05 21:32             ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Andreas Dilger @ 2009-05-05 21:24 UTC (permalink / raw)
  To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro

On May 05, 2009  09:56 -0700, Joel Becker wrote:
> On Tue, May 05, 2009 at 02:09:36AM -0600, Andreas Dilger wrote:
> > If the reflink caller is always charged for the full space used (as if
> > it were a real copy) by virtue of the user doing the reflink() owning the
> > new inode.  Doing anything else seems broken.  If the owner of the file
> > wasn't charged for the reflink's quota then if the reflink inode was
> > chowned the new owner would be charged for the new file, but the quota
> > code would have to special case the decrement of EACH of the reflink's
> > blocks because otherwise the original owner might "release" quota that
> > it was never originally charged.
> 
>  If the caller is creating an inode in someone else's name, then
> who do you charge for the quota?

IMHO, it shouldn't be possible to create an inode in someone else's
name (CAP_* excluded), just like it isn't possible to create a new
file in someone elses name.  The caller of reflink() should be the
one creating the file, hence the owner of the file, and the owner of
the quota.

> If you charge the caller, how do you know to decrement the caller's
> quota when the actual owner does truncate, given that the inode has
> no knowledge of the caller anymore.

No, if the owner of the inode (== caller) is charged the quota then
when the inode is truncated (regardless of who does the truncate)
the quota will just work correctly.

> 	You've hit the nail on the head - without backrefs for each
> refcounted hunk, you can't figure out who it owns it from a quota
> perspective.  And that's just a non-starter to try and maintain.

No, I don't think my proposal is _more_ complex than the original.
It is actually _less_ complex, because the fact that this is a reflink
and not a complete file copy is a purely internal detail of the filesystem
and is not exposed outside the filesystem.  The fact that a reflink
consumes less space and is faster than a real copy is an implementation
detail, not really any different than if the file were compressed by
the filesystem internally.

> > > 	Here's another fun trick.  Overwriting rsync, instead of copying
> > > blocks from the already-existing source could reflink the source to the
> > > .temporary, then only write the changed blocks.  And since you own both
> > > files, it just works.  If you're overwriting someone else's file?  The
> > > old copy behavior is fine.
> > 
> > Well, "fine" as in it works, but if there are only a few changed blocks,
> > and the old copy is now part of a snapshot (so it won't be released when
> > rsync is finished) the space consumption has doubled instead of just
> > using a few extra blocks.
> 
> 	No, because the last thing rsync will do is rename(.temporary,
> source).  All the references from the source will be decremented, and
> any blocks only owned by the source will be freed.  Space usage is
> identical before and after, like a copying rsync, but there is less
> space used and less I/O done during the rsync process.

What I was objecting to is "when overwriting someone elses file, the old
copy behaviour is fine".  If we are implementing a copy-on-write API,
why hamstring it to not work in the expected manner by a normal "cp"?

> > Is there anything about changing the owner/group of the new inode during
> > reflink that makes the implementation more complex?  If the process doing
> > the reflink is the same as the file owner then the semantics are unchanged
> > from what you have proposed.
> 
> 	If you define that 'reflink sets the attributes as if it was a
> new file', then you should be creating the file with a new security
> context, not with the security context from the existing inode.  And
> then you can't really snapshot.
> 	A mixed behavior, like "if you own it, I'll preserve the entire
> security context, but if not I will treat it with a new context" is
> confusing at best.

I don't find it confusing.  The security context would be inherited from
the creating process, just like creating a new file would.  If it is the
same user as the file owner then the security context will be the same.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 21:24           ` Andreas Dilger
@ 2009-05-05 21:32             ` Joel Becker
  2009-05-06  7:15               ` [Ocfs2-devel] " Theodore Tso
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-05 21:32 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, Jamie Lokier, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 03:24:17PM -0600, Andreas Dilger wrote:
> On May 05, 2009  09:56 -0700, Joel Becker wrote:

<snip a bunch of stuff about how quota obviously works correctly if we
 change ownership>
 
> > 	No, because the last thing rsync will do is rename(.temporary,
> > source).  All the references from the source will be decremented, and
> > any blocks only owned by the source will be freed.  Space usage is
> > identical before and after, like a copying rsync, but there is less
> > space used and less I/O done during the rsync process.
> 
> What I was objecting to is "when overwriting someone elses file, the old
> copy behaviour is fine".  If we are implementing a copy-on-write API,
> why hamstring it to not work in the expected manner by a normal "cp"?

	We're implementing an inode-level snapshot/clone that also
happens to be very convenient for many cp-like operations.

> > 	If you define that 'reflink sets the attributes as if it was a
> > new file', then you should be creating the file with a new security
> > context, not with the security context from the existing inode.  And
> > then you can't really snapshot.
> > 	A mixed behavior, like "if you own it, I'll preserve the entire
> > security context, but if not I will treat it with a new context" is
> > confusing at best.
> 
> I don't find it confusing.  The security context would be inherited from
> the creating process, just like creating a new file would.  If it is the
> same user as the file owner then the security context will be the same.

	The same as what?  If you reflink your own file, it preserves
the security context of the original or it appears with the default
security context of yourself?  They are not the same.  "Treat it like
link(2)" argues for the former - which precludes changing ownership.
That's what reflink is designed to do.  "Treat it like cp" is a
different behavior.

Joel

-- 

"The lawgiver, of all beings, most owes the law allegiance.  He of all
 men should behave as though the law compelled him.  But it is the
 universal weakness of mankind that what we are given to administer we
 presently imagine we own."
        - H.G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 21:32             ` Joel Becker
@ 2009-05-06  7:15               ` Theodore Tso
  2009-05-06 14:24                 ` jim owens
  0 siblings, 1 reply; 151+ messages in thread
From: Theodore Tso @ 2009-05-06  7:15 UTC (permalink / raw)
  To: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel,
	viro

On Tue, May 05, 2009 at 02:32:06PM -0700, Joel Becker wrote:
> 	The same as what?  If you reflink your own file, it preserves
> the security context of the original or it appears with the default
> security context of yourself?  They are not the same.  "Treat it like
> link(2)" argues for the former - which precludes changing ownership.
> That's what reflink is designed to do.  "Treat it like cp" is a
> different behavior.

The reason why I don't like the default to be "preserve the inode
ownership" is because it's *not* just like link(2).  If it were just
like link(2), the inode number would also be preserved.  If the inode
number is changing, then it arguably is ***much*** more like a copy.
And a copy operation also has many useful properties.

      	   	     	      	   - Ted

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-06  7:15               ` [Ocfs2-devel] " Theodore Tso
@ 2009-05-06 14:24                 ` jim owens
  2009-05-06 14:30                   ` jim owens
  2009-05-12 19:11                   ` Jamie Lokier
  0 siblings, 2 replies; 151+ messages in thread
From: jim owens @ 2009-05-06 14:24 UTC (permalink / raw)
  To: Theodore Tso, joel.becker
  Cc: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel,
	viro

So summarizing the main argument of the day,
there are 2 different functions proposed:

1) "snapfile" - Joel's reflink(2) design.

The definition is good.  It makes writable snapshots
possible, and is security safe with CAP_FOWNER added
as a requirement because CAP_CHOWN is very restricted
in the real world.  Only the owner and admin can reflink.
For the admin to use it in backups, it must be a single
call as Joel said, a point in time of data and attributes.

If the name "reflink" is the problem, call it something else.

2) "cowfilecopy" - Ted's/Jamie's kernel cp.

Again, the definition makes sense.  The security model
is simply "current read access to the file", so anyone
who can read it can make almost-zero-space-consumed copy
of a file.

--- analysis ---

You ask why not use a 2-step "cowfilecopy" and "attrfilecopy"
to do "snapfile"... because that is not an atomic snapshot.

The security and "might not know about it" concerns are bogus:
No extra visibility exists to future updates of the original
file that would not exist without either snapfile or cowfilecopy.
That BOTH point at your old data is no different than if root
or raid was copying every disk block to permanent storage. If
you write it, someone can have it later.

So bottom line... I see no reason (except someone has to document)
why we should not have 2 system calls since there are good uses
and good definitions for both and the code is 99% identical.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-06 14:24                 ` jim owens
@ 2009-05-06 14:30                   ` jim owens
  2009-05-06 17:50                     ` jim owens
  2009-05-12 19:11                   ` Jamie Lokier
  1 sibling, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-06 14:30 UTC (permalink / raw)
  To: Theodore Tso, joel.becker
  Cc: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel,
	viro

P.S. as people have already said, both

1) "snapfile" - Joel's reflink(2) design.
2) "cowfilecopy" - Ted's/Jamie's kernel cp.

could be 1 syscall with a "preserve" flag that requires CAP_FCHOWN.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-06 14:30                   ` jim owens
@ 2009-05-06 17:50                     ` jim owens
  2009-05-12 19:20                       ` Jamie Lokier
  2009-05-12 19:30                       ` Jamie Lokier
  0 siblings, 2 replies; 151+ messages in thread
From: jim owens @ 2009-05-06 17:50 UTC (permalink / raw)
  To: Theodore Tso, joel.becker
  Cc: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel,
	viro

jim owens wrote:
> 
> 1) "snapfile" - Joel's reflink(2) design.
> 2) "cowfilecopy" - Ted's/Jamie's kernel cp.
> 
> could be 1 syscall with a "preserve" flag that requires CAP_FCHOWN.

No disagreement must mean the mail isn't getting through :)

So on to the last turd in the punch bowl,  quota and du rules:

Both snapfile and cowfilecopy:

- must not allow their use to cheat to exceed the user's quota

- would best serve if only the actual disk non-shared space was counted

But we know not all filesystems will be able to change their
on-disk data structures and efficiently count only non-shared.

So I suggest this is the rule:

Quota accounting and disk space used for the original file will
be as if there were 0 reflinks.  Quota accounting and disk space
reported for the new reflink file is filesystem specific and may
or may not include shared disk space.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-06 17:50                     ` jim owens
@ 2009-05-12 19:20                       ` Jamie Lokier
  2009-05-12 19:30                       ` Jamie Lokier
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 19:20 UTC (permalink / raw)
  To: jim owens
  Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris,
	ocfs2-devel, viro

jim owens wrote:
> Quota accounting and disk space used for the original file will
> be as if there were 0 reflinks.  Quota accounting and disk space
> reported for the new reflink file is filesystem specific and may
> or may not include shared disk space.

One little thing:

If the original file is deleted,

   1. The data must still be accounted at least once.

   2. After deleting the original, the space must not be charged to
      the owner of the original file if different from the owners of
      reflinks which remain, because that would be a quota attack.

Less important:

   3. The combination of 1 and 2 probably shouldn't cause the quota
      charge of other users (who created reflinks and aren't the
      original file owner) to increase as it might push them over their
      quota, in a way that's difficult for them to know beforehand.

One way to satisfy that is for the reflink's data to be charged at
least once to each distinct owner who references that data.  Maybe it
can be done by charging it when the owner of a new reflink is
different from the original, and something appropriate during chown.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-06 17:50                     ` jim owens
  2009-05-12 19:20                       ` Jamie Lokier
@ 2009-05-12 19:30                       ` Jamie Lokier
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 19:30 UTC (permalink / raw)
  To: jim owens
  Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris,
	ocfs2-devel, viro

jim owens wrote:
> du rules

In general, hard :-)

See the difficulties reporting process memory usage given shared
libraries for how tricky it gets.

Imho, the simplest is for du to report how much space it would take if
all the files were fully unCOWed, either by being modified or copied
to another filesystem.  In other words, just return the data blocks
assigned to a file as usual; the COW difference is the same block can
be assigned to more than one file.  Sometimes knowing the unCOWed
space would even be useful.

Of course knowing the COWed space is also useful.  (Both together
would give you a nice feel for how much space COW is saving.)  I
suspect that would need changes to du to give useful answers (and
similar changes to "Disk Usage Analyzer" if you like that tool).

du detects hard links by i_nlink!=1 and inode number, and merges the
accounting.  For partially shared files, I'm not sure how best to get
the right information out.  FIEMAP is inherently a lot slower than
stat() because it can do much more disk access, and it might not
always work (depending on how data is represented on disk), and it
might not always be permitted.  Worse, there's no i_nlink!=1
equivalent to decide when FIEMAP does not need to be called.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-06 14:24                 ` jim owens
  2009-05-06 14:30                   ` jim owens
@ 2009-05-12 19:11                   ` Jamie Lokier
  2009-05-12 19:37                     ` jim owens
  1 sibling, 1 reply; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 19:11 UTC (permalink / raw)
  To: jim owens
  Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris,
	ocfs2-devel, viro

jim owens wrote:
> You ask why not use a 2-step "cowfilecopy" and "attrfilecopy"
> to do "snapfile"... because that is not an atomic snapshot.

Understood, no problem with that.  (Though it would be nice to have a
realistic example showing the atomicity being useful for a single file
snapshot).

Being able to create a _new file_ with the security attributes of an
existing file is sometimes useful too.  Lots of programs do that, of
course, but a lot of them get it wrong when non-traditional security
attributes are used.

reflink() followed by truncate() would be useful for that - and in
that case, returning EPERM if it can't clone the attributes would be
essential - because if a program which wants to copy "all the security
attributes" without the knowledge to parse them itself and set them in
the right order, then it won't have the code to check if they were
cloned reliably either.

> The security and "might not know about it" concerns are bogus:
> No extra visibility exists to future updates of the original
> file that would not exist without either snapfile or cowfilecopy.
> That BOTH point at your old data is no different than if root
> or raid was copying every disk block to permanent storage. If
> you write it, someone can have it later.

I agree with that _as long as_ reflink() does not permit you to clone
a file when you are not the owner and you don't have read access.

It looks like reflink() V4 does not permit in that case - good!

(A more precise statement of the rule is "as long as you could not
copy the file normally and then change its attributes to match what
reflink() produces").

That's different from link(), which _does_ allow links when you have
no read access and aren't the owner, but it always bumps i_nlink.
That's where I was coming from with the "might not know about it"
concern, because it looked like earlier reflink() proposals applied
the same weak permission checks as link().

V4 seems much better.

> So bottom line... I see no reason (except someone has to document)
> why we should not have 2 system calls since there are good uses
> and good definitions for both and the code is 99% identical.

I doubt if anyone cares deeply if there are two system calls or one
system call with a flag(*), since they are so similar.  The main thing
is having useful behaviours.

(*) Except for aesthetics.

I'm with the folks who think it's better for userspace to explicitly
request one behaviour or the other, rather than having reflink()
"automatically" decide for itself whether it will clone the attributes
or use new-file attributes.

The reason is because the "automatic" behaviour will certainly require
some applications to work around it, by guessing what it's going to do
before (which is difficult to do accurately), or checking what it did
afterwards.

That will be these applications:

   - Sometimes an app will want to clone the attributes, and tell the
     user "sorry, no" if that's not possible.  So the app will have to
     stat the file first, check the file owner against it's euid,
     reflink, then stat the resulting file afterwards and check what
     happened (because ownership might have changed between the first
     stat and reflink calls, changing reflink's behaviour from what it
     expected), and then call unlink if the wrong thing happend *and*
     it will still be wrong 1% of the time when the security model is
     not what the application expected.  Applications should not
     have to hard-code every known security model.  And linking then
     unlinking because you got it wrong is another security issue.

     "cp --cow -a" might be in this category, so would "rsync --cow -a"
     and generic backup applications.  I expect most applications
     wanting to copy exactly care about this.

   - Sometimes an app will want to warn the user if the attributes
     couldn't be cloned, but succeed in making the copy.  reflink() V4
     does that, but the app will have to check the new attributes against
     the old ones to know whether to warn, and then guess what errno
     would be appropriate.

     Maybe "cp --cow -a" will be like this.

   - Sometimes an app really just wants to copy a file with COW for
     efficient data sharing.  It will have to change the resulting
     attributes to "new file" attributes - and that will be wrong 1%
     of the time because it's not necessarily easy to get those
     attributes right, especially with non-standard security models.
     Even with traditional security, getting setgid-directory
     behaviour right is extremely difficult - because it depends on
     the filesystem's mount options among other things.  Basically
     "new file" attributes are something that should always be left to
     the kernel.

     While it might not be obvious when root would want to copy a file
     without preserving attributes with COW performance, the argument
     "I nearly always forget -p when writing cp" is arguing for "alias
     cp='cp -p'" in your /root/.profile, not for making the system
     call do it in a way you can't disable :-)

     Besides I can think of when you would want it: When running *any*
     shell script that you didn't write with the environment variable
     CP_USE_COW_WHEN_POSSIBLE_TO_SAVE_SPACE set ;-)

Now the opposite of "automatic" is the app requests whether to clone
attributes or use "new file" attributes.  In contrast to the above
problems, this doesn't cause any difficulty to applications, because
any app wanting the automatic choice can just do this:

     ret = reflink(a,b);
     if (ret == -1 && errno == EPERM)
         ret = cowlink(a,b);

Ok, that's not perfect because EPERM can mean other things.

Which brings us back to a flag ;-) like this:

     REFLINK_ATTR_CLONE                    (EPERM if can't clone attributes)
     REFLINK_ATTR_CLONE_IF_OWNER_OR_ROOT   (choose, as proposed in reflink V4)

One last annoyance.  If you're making a new file, then like open() you
need another argument, which is the new file's mode which is combined
with umask.  But not if you're cloning the attributes.

That's a good reason why there should be two functions for
applications.  The names reflink/cowlink (and reflinkat/cowlinkat)
make sense to me.  The cowlink functions have an extra mode argument,
like the last argument to open().

(They could all be one system call at the kernel level, but different
in libc, as is already planned for the reflink/reflinkat distinction.)

Oh, and please implement AT_SYMLINK_FOLLOW the same as link().

Thanks :-)
-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-12 19:11                   ` Jamie Lokier
@ 2009-05-12 19:37                     ` jim owens
  2009-05-12 20:11                       ` Jamie Lokier
  0 siblings, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-12 19:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris,
	ocfs2-devel, viro

I don't write applications so I won't argue when
what they want does not make sense to me :)

Jamie Lokier wrote:
> 
> One last annoyance.  If you're making a new file, then like open() you
> need another argument, which is the new file's mode which is combined
> with umask.

But that only works for minimal traditional permissions.  If you
want to adjust ACL or MAC, you need to do something else anyway,
so is it really worth having the old-style mode parameter?

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-12 19:37                     ` jim owens
@ 2009-05-12 20:11                       ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:11 UTC (permalink / raw)
  To: jim owens
  Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris,
	ocfs2-devel, viro

jim owens wrote:
> >One last annoyance.  If you're making a new file, then like open() you
> >need another argument, which is the new file's mode which is combined
> >with umask.
> 
> But that only works for minimal traditional permissions.  If you
> want to adjust ACL or MAC, you need to do something else anyway,
> so is it really worth having the old-style mode parameter?

You have a point, and mode+umask is sort of ugly, but:

ACLs and MACs have are intentionally designed so that in 99.9% of
cases, there is no need to do anything else after open(), even in
programs that use different mode arguments for security and don't know
anything about non-traditional permissions.  So very few apps need to
do anything else afterwards.  The ACL/MAC defaults have been carefully
designed to have the right security properties, and people writing
security policies understand how that works.

The most often used mode parameters are almost certainly 0666 meaning
"use what umask says", and 0600 meaning "most restricted useful
permissions" for a new file.

If you want to create a file with restricted permissions without
altering umask, which isn't safe in a threaded program, you must _not_
use 0666 _and then_ narrow the permissions - it's important that the
initial permissions are <= the final ones that you need.

So without the parameter, what's the sane default?

For typical cowlink uses it should be equivalent to open(...,0666) as
you don't want to umask+chmod afterwards.  I wouldn't be surprised if
umask+chmod afterwards gave different ACL/MAC results.

But if you need restricted permission on the file afterwards, since
it's not safe to start wide and then narrow, 0666 is not a suitable
default.

You could say "just change the umask!" but that is bad in a threaded
program, unfortunately.  (Imho they should have made umask
thread-specific; oh well.  In fact you emulate per-thread umask by
adjusting the mode argument in some environments :-)

The mode argument, though ugly, is at least well understood and
security policies (inside apps and outside) do the right thing with it.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05  7:16     ` Joel Becker
  2009-05-05  8:09       ` Andreas Dilger
@ 2009-05-05 13:01       ` Theodore Tso
  2009-05-05 13:19         ` Jamie Lokier
  2009-05-05 17:00         ` Joel Becker
  2009-05-05 13:01       ` Jamie Lokier
  2 siblings, 2 replies; 151+ messages in thread
From: Theodore Tso @ 2009-05-05 13:01 UTC (permalink / raw)
  To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 12:16:09AM -0700, Joel Becker wrote:
> On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote:
> > Joel Becker wrote:
> > > +All file attributes and extended attributes of the new file must
> > > +identical to the source file with the following exceptions:
> > 
> > reflink() sounds useful already, but is there a compelling reason why
> > both files must have the same attributes, and changing attributes will
> > break the COW?
> 
> 	Yeah, because without it you can't use it for snapshotting.
> That's where the original design came from - inode snapshots.  The big
> thing that excited me was that defining reflink() as I did, instead of
> a more specific snapshot call, allows all sorts of generic uses (some of
> which you outline below).

I guess it depends on your implementation.  At least the way I would
implement this in ext4, for example, I'd simply set a new flag
indicating this was a "reflink", and then the i_data[0..3] field would
contain the inode number of the "host" inode, and i_data [4..7] and
i_data[8..11] would contain a circular linked list of all reflinks
associated with that inode.  I'd then grab a spare inode field so the
"host" inode could point to the reflink'ed inodes.

If you ever need to delete the host inode, you simply pick one of the
reflink inodes and copy i_data from the host inode one of the reflink
inodes and promote it to be the "host" inode, and then update all of
the other reflink inodes to point at the new host inode.

The advantage of this scheme is not only does the reflink'ed inode
have a new inode number (as in your design), it actually has an
entirely new inode.  So we can change the ownership, the mtime, ctime;
it behaves *entirely* as a separate, free-standing inode except it is
sharing the data blocks.

This allows me to easily set a new owner, and indeed any other inode
metadata, on the reflink'ed inode, which I would argue is a Good
Thing.

I'm guessing that OCFS2 has implemented (or is planning on
implementing) reflinks, you can't modify the metadata?  Or is there
some really important reason why it's not a good idea for OCFS2?

> > Since each reflink has its own nlink and ino, I'm wondering why the
> > other attributes cannot also be separate.  (I realise extended
> > attributes complicate the picture and it's desirable to share them,
> > especially if they are large).
> 
> 	The biggest reason is snapshotting.

I guess this doesn't mean much to me.  Can you say more about what you
have in mind when you say "snapshotting"?  Is this in the WAFL sense?
What's the use case?

> > Can you hard link to the source file and the reflink afterwards,
> > incrementing the reflink's link count?  (I presume yes).  Can you
> > reflink to both of them too?
> 
> 	Yes, absolutely.  Once reflinked, they look like two separate
> POSIX files.

... but in your implementation, if you ever chown or chmod (or even
touch the atime?) of the file, it instantly does a copy-on-write?

            	  	     	       	  - Ted

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:01       ` Theodore Tso
@ 2009-05-05 13:19         ` Jamie Lokier
  2009-05-05 13:39           ` Chris Mason
                             ` (2 more replies)
  2009-05-05 17:00         ` Joel Becker
  1 sibling, 3 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 13:19 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

Theodore Tso wrote:
> I guess it depends on your implementation.  At least the way I would
> implement this in ext4, for example, I'd simply set a new flag
> indicating this was a "reflink", and then the i_data[0..3] field would
> contain the inode number of the "host" inode, and i_data [4..7] and
> i_data[8..11] would contain a circular linked list of all reflinks
> associated with that inode.  I'd then grab a spare inode field so the
> "host" inode could point to the reflink'ed inodes.
> 
> If you ever need to delete the host inode, you simply pick one of the
> reflink inodes and copy i_data from the host inode one of the reflink
> inodes and promote it to be the "host" inode, and then update all of
> the other reflink inodes to point at the new host inode.
> 
> The advantage of this scheme is not only does the reflink'ed inode
> have a new inode number (as in your design), it actually has an
> entirely new inode.  So we can change the ownership, the mtime, ctime;
> it behaves *entirely* as a separate, free-standing inode except it is
> sharing the data blocks.
> 
> This allows me to easily set a new owner, and indeed any other inode
> metadata, on the reflink'ed inode, which I would argue is a Good
> Thing.

There was an attempt at something like that for ext3 a year or two ago.
Search for "cowlink" if you're interested.

Most of the discussion ended up around how to handle copying on writes
to shared-writable mmaps, something which I guess is solved these days.

Instead of a circular list, a proposed implementation was to create a
separate "host" inode on the first reflink, converting the source
inode to a reflink inode and moving the data block references to the
new host inode.  Each reflink was simply a reference to the host
inode, much like your design, and the host inode was only to hold the
data blocks, with it's i_nlink counting the number of reflinks
pointing to it.

Using a circular list means the space must be reserved in every inode,
even those which are not (yet) reflinks.  It also does a bit more
writing sometimes, because of having to update next and previous
entries on the list.

Hmm.  The data pointers could live in all the inodes, since they are
identical and the whole data is cloned on write.  That would make
reading a bit faster.

> I'm guessing that OCFS2 has implemented (or is planning on
> implementing) reflinks, you can't modify the metadata?  Or is there
> some really important reason why it's not a good idea for OCFS2?

I would have thought for OCFS2 and BTRFS, with their nice keyed tree
structure, it would be quite natural to implement separate inodes for
the reflinks pointing at a shared data-holding inode.  Something a
little bit like that must be happening to permit separate inode numbers.

I wonder if even pointing at shared subtrees of data extents might be
feasible, to share some file data.  That would make the COW copy less
of a catastophe when it happens on a large file :-)

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:19         ` Jamie Lokier
@ 2009-05-05 13:39           ` Chris Mason
  2009-05-05 15:36             ` Jamie Lokier
  2009-05-05 14:21           ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso
  2009-05-05 17:05           ` Joel Becker
  2 siblings, 1 reply; 151+ messages in thread
From: Chris Mason @ 2009-05-05 13:39 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, 2009-05-05 at 14:19 +0100, Jamie Lokier wrote:
> Theodore Tso wrote:
> > I guess it depends on your implementation.  At least the way I would
> > implement this in ext4, for example, I'd simply set a new flag
> > indicating this was a "reflink", and then the i_data[0..3] field would
> > contain the inode number of the "host" inode, and i_data [4..7] and
> > i_data[8..11] would contain a circular linked list of all reflinks
> > associated with that inode.  I'd then grab a spare inode field so the
> > "host" inode could point to the reflink'ed inodes.
> > 
> > If you ever need to delete the host inode, you simply pick one of the
> > reflink inodes and copy i_data from the host inode one of the reflink
> > inodes and promote it to be the "host" inode, and then update all of
> > the other reflink inodes to point at the new host inode.
> > 
> > The advantage of this scheme is not only does the reflink'ed inode
> > have a new inode number (as in your design), it actually has an
> > entirely new inode.  So we can change the ownership, the mtime, ctime;
> > it behaves *entirely* as a separate, free-standing inode except it is
> > sharing the data blocks.
> > 
> > This allows me to easily set a new owner, and indeed any other inode
> > metadata, on the reflink'ed inode, which I would argue is a Good
> > Thing.
> 
> There was an attempt at something like that for ext3 a year or two ago.
> Search for "cowlink" if you're interested.
> 
> Most of the discussion ended up around how to handle copying on writes
> to shared-writable mmaps, something which I guess is solved these days.
> 
> Instead of a circular list, a proposed implementation was to create a
> separate "host" inode on the first reflink, converting the source
> inode to a reflink inode and moving the data block references to the
> new host inode.  Each reflink was simply a reference to the host
> inode, much like your design, and the host inode was only to hold the
> data blocks, with it's i_nlink counting the number of reflinks
> pointing to it.
> 
> Using a circular list means the space must be reserved in every inode,
> even those which are not (yet) reflinks.  It also does a bit more
> writing sometimes, because of having to update next and previous
> entries on the list.
> 
> Hmm.  The data pointers could live in all the inodes, since they are
> identical and the whole data is cloned on write.  That would make
> reading a bit faster.
> 
> > I'm guessing that OCFS2 has implemented (or is planning on
> > implementing) reflinks, you can't modify the metadata?  Or is there
> > some really important reason why it's not a good idea for OCFS2?
> 
> I would have thought for OCFS2 and BTRFS, with their nice keyed tree
> structure, it would be quite natural to implement separate inodes for
> the reflinks pointing at a shared data-holding inode.  Something a
> little bit like that must be happening to permit separate inode numbers.
> 

Thanks for getting this discussion going Joel, its really good to get
this behavior well defined.

The btrfs implementation is just that you have two separate files
pointing to the same extents on disk.  Each file has a reference on each
extent, and deleting or chowning fileA doesn't change the metadata in
fileB.

The btrfs cow code makes sure that modifications in either file (even
when mounted in -o nodatacow) are written to new extents instead of
changing the original.  If you write one block in a 1TB file, the new
space used by the clone is only one block.  (Thanks to the ceph
developers for coding all of this up a while ago).

The main difference between reflink and the btrfs ioctl is that in the
btrfs ioctl the destination file must already exist.  The btrfs code can
also do range replacements in the destination file, but I'd agree with
Joel that we don't want to toss the kitchen sink into something nice and
clean like reflink.

-chris



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:39           ` Chris Mason
@ 2009-05-05 15:36             ` Jamie Lokier
  2009-05-05 15:41               ` Chris Mason
  2009-05-05 16:46               ` Jörn Engel
  0 siblings, 2 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 15:36 UTC (permalink / raw)
  To: Chris Mason; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

Chris Mason wrote:
> The btrfs implementation is just that you have two separate files
> pointing to the same extents on disk.  Each file has a reference on each
> extent, and deleting or chowning fileA doesn't change the metadata in
> fileB.
> 
> The btrfs cow code makes sure that modifications in either file (even
> when mounted in -o nodatacow) are written to new extents instead of
> changing the original.  If you write one block in a 1TB file, the new
> space used by the clone is only one block.  (Thanks to the ceph
> developers for coding all of this up a while ago).

Ooh, nice.

> The main difference between reflink and the btrfs ioctl is that in the
> btrfs ioctl the destination file must already exist.  The btrfs code can
> also do range replacements in the destination file, but I'd agree with
> Joel that we don't want to toss the kitchen sink into something nice and
> clean like reflink.

Ah, now that I know about the BTRFS data-cloning ioctl... :-)

I'm wondering why reflink() is needed at all.  Can't it be done in
userspace, using the BTRFS ioctl?  The hard part in userspace seems to
be copying the file attributes, but "cp -a" and other tools manage.

What is the advantage of adding the system call for the special case
of reflink(), when we choose not to have, say, a copyfile() system
call which does what "cp -a" does because doing it in user space is
good enough?

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 15:36             ` Jamie Lokier
@ 2009-05-05 15:41               ` Chris Mason
  2009-05-05 16:03                 ` Jamie Lokier
  2009-05-05 16:46               ` Jörn Engel
  1 sibling, 1 reply; 151+ messages in thread
From: Chris Mason @ 2009-05-05 15:41 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, 2009-05-05 at 16:36 +0100, Jamie Lokier wrote:
> Chris Mason wrote:
> > The btrfs implementation is just that you have two separate files
> > pointing to the same extents on disk.  Each file has a reference on each
> > extent, and deleting or chowning fileA doesn't change the metadata in
> > fileB.
> > 
> > The btrfs cow code makes sure that modifications in either file (even
> > when mounted in -o nodatacow) are written to new extents instead of
> > changing the original.  If you write one block in a 1TB file, the new
> > space used by the clone is only one block.  (Thanks to the ceph
> > developers for coding all of this up a while ago).
> 
> Ooh, nice.
> 
> > The main difference between reflink and the btrfs ioctl is that in the
> > btrfs ioctl the destination file must already exist.  The btrfs code can
> > also do range replacements in the destination file, but I'd agree with
> > Joel that we don't want to toss the kitchen sink into something nice and
> > clean like reflink.
> 
> Ah, now that I know about the BTRFS data-cloning ioctl... :-)
> 
> I'm wondering why reflink() is needed at all.  Can't it be done in
> userspace, using the BTRFS ioctl?  The hard part in userspace seems to
> be copying the file attributes, but "cp -a" and other tools manage.
> 

reflink is a subset of what the btrfs ioctl does, and that's a good
thing.  The way they've added support for this to ocfs2 is really cool,
and the same ideas could be used in other filesystems.

So, I'd rather see a system call that everyone can implement, and if
btrfs hangs on to the ioctl for extra features, even better.

-chris



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 15:41               ` Chris Mason
@ 2009-05-05 16:03                 ` Jamie Lokier
  2009-05-05 16:18                   ` Chris Mason
  2009-05-05 20:48                   ` jim owens
  0 siblings, 2 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 16:03 UTC (permalink / raw)
  To: Chris Mason; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

Chris Mason wrote:
> > > The main difference between reflink and the btrfs ioctl is that in the
> > > btrfs ioctl the destination file must already exist.  The btrfs code can
> > > also do range replacements in the destination file, but I'd agree with
> > > Joel that we don't want to toss the kitchen sink into something nice and
> > > clean like reflink.
> > 
> > Ah, now that I know about the BTRFS data-cloning ioctl... :-)
> > 
> > I'm wondering why reflink() is needed at all.  Can't it be done in
> > userspace, using the BTRFS ioctl?  The hard part in userspace seems to
> > be copying the file attributes, but "cp -a" and other tools manage.
> > 
> 
> reflink is a subset of what the btrfs ioctl does, and that's a good
> thing.  The way they've added support for this to ocfs2 is really cool,
> and the same ideas could be used in other filesystems.
> 
> So, I'd rather see a system call that everyone can implement, and if
> btrfs hangs on to the ioctl for extra features, even better.

Realistically, very few existing filesystems can implement this system call.

I agree that it's much more likely that a filesystem can implement
reflink() than BTRFS' more flexible data cloning though.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 16:03                 ` Jamie Lokier
@ 2009-05-05 16:18                   ` Chris Mason
  2009-05-05 20:48                   ` jim owens
  1 sibling, 0 replies; 151+ messages in thread
From: Chris Mason @ 2009-05-05 16:18 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro

On Tue, 2009-05-05 at 17:03 +0100, Jamie Lokier wrote:
> Chris Mason wrote:
> > > > The main difference between reflink and the btrfs ioctl is that in the
> > > > btrfs ioctl the destination file must already exist.  The btrfs code can
> > > > also do range replacements in the destination file, but I'd agree with
> > > > Joel that we don't want to toss the kitchen sink into something nice and
> > > > clean like reflink.
> > > 
> > > Ah, now that I know about the BTRFS data-cloning ioctl... :-)
> > > 
> > > I'm wondering why reflink() is needed at all.  Can't it be done in
> > > userspace, using the BTRFS ioctl?  The hard part in userspace seems to
> > > be copying the file attributes, but "cp -a" and other tools manage.
> > > 
> > 
> > reflink is a subset of what the btrfs ioctl does, and that's a good
> > thing.  The way they've added support for this to ocfs2 is really cool,
> > and the same ideas could be used in other filesystems.
> > 
> > So, I'd rather see a system call that everyone can implement, and if
> > btrfs hangs on to the ioctl for extra features, even better.
> 
> Realistically, very few existing filesystems can implement this system call.
> 

I'd say that if the shared disk clustering filesystem can do it, pretty
much anyone can ;)  This doesn't mean its easy, but it is a good set of
semantics to have as the baseline.

-chris

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 16:03                 ` Jamie Lokier
  2009-05-05 16:18                   ` Chris Mason
@ 2009-05-05 20:48                   ` jim owens
  2009-05-05 21:57                     ` Jamie Lokier
  1 sibling, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-05 20:48 UTC (permalink / raw)
  To: joel.becker
  Cc: Jamie Lokier, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

Not surprising that the discussion is all over the place
as far as what this should do.  Whether is is better to
implement one do many things syscall or several different
syscalls for different features can be debated after we
set some rules.  Going back to Joel's patch, I think the
first rules we need agreement on are:

1) is only for filesystems with COW operation,
    if the fs does not support COW it returns ENOSYS.

    the rational is that while we could allow it to
    be a copyfile, it would not save space so "cp -a".

2) is only for regular files, all others return EPERM

    *note* as-coded the patch only traps S_ISDIR, but
    other file types could be a problem on some fs and
    I don't see any value in supporting more than regular
    files unless we support directory COW and then we are
    really jumping into the swamp.

3) the granularity of the COW (1-byte write may cause
    1-block up through whole file copy) is fs-dependent.

4) post-reflink changes done to data or attributes
    in either the original or new file are independent.

next rules if we assume reflink(2) matches Joel's
manpage and call arguments and any other features are
a different api definition:

5) you must be the file owner or have CAP_FCHOWN

    because...

6) all non-time file attributes (owner, security, etc),
    atime, and mtime match the original file.  ctime is
    when the reflink was created.

but the hard part is the quota accounting rule:

7) pre-charge all quotas so a reflink double-counts inodes
    and blocks against the original owner/group

    pro - easiest, does not allow owner to bypass limits,
          quota utilities just work

    con - admin snapshot can trip user quota-limit failures,
          du/df will wildly disagree on space used

so is that what we want or do we want to just say the
behavior is fs-specific with respect to quotas.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 20:48                   ` jim owens
@ 2009-05-05 21:57                     ` Jamie Lokier
  2009-05-05 22:04                       ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 21:57 UTC (permalink / raw)
  To: jim owens
  Cc: joel.becker, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

Joel Becker wrote:
>> > >     If you define that 'reflink sets the attributes as if it was a
>> > > new file', then you should be creating the file with a new security
>> > > context, not with the security context from the existing inode.  And
>> > > then you can't really snapshot.
>> > >     A mixed behavior, like "if you own it, I'll preserve the entire
>> > > security context, but if not I will treat it with a new context" is
>> > > confusing at best.
>> >
>> > I don't find it confusing.  The security context would be inherited from
>> > the creating process, just like creating a new file would.  If it is the
>> > same user as the file owner then the security context will be the same.
>> 
>>         The same as what?  If you reflink your own file, it preserves
>> the security context of the original or it appears with the default
>> security context of yourself?  They are not the same.  "Treat it like
>> link(2)" argues for the former - which precludes changing ownership.
>> That's what reflink is designed to do.  "Treat it like cp" is a
>> different behavior.

jim owens wrote:
> 1) is only for filesystems with COW operation,
>    if the fs does not support COW it returns ENOSYS.
> 
>    the rational is that while we could allow it to
>    be a copyfile, it would not save space so "cp -a".

As Joel explains above, reflink has user-visible semantics that are
different from "cp -a" quite aside from the COW efficiency which can
be seen as an internal fs-dependent speed/space optimisation.

That means you cannot fall back to "cp -a": reflink has semantics
behaviour which "cp -a" cannot always mimic, and won't always mimic
correctly when it tries.

Imho that's because reflink is overcomplicated and tries to do
multiple jobs at once ;-)

> 2) is only for regular files, all others return EPERM
> 
>    *note* as-coded the patch only traps S_ISDIR, but
>    other file types could be a problem on some fs and
>    I don't see any value in supporting more than regular
>    files unless we support directory COW and then we are
>    really jumping into the swamp.

I agree.

> 3) the granularity of the COW (1-byte write may cause
>    1-block up through whole file copy) is fs-dependent.

And yet ENOSYS if the fs cannot implement any COW, and it isn't
possible for userspace to duplicate the semantics by explicit copying?

> 4) post-reflink changes done to data or attributes
>    in either the original or new file are independent.

Hopefully :-)

Do we say anything about attribute changes triggering COW or not, or
leave it fs-dependent?  Given 3) fs-dependent makes sense, but it's
nice to know in advance if { reflink -R old_tree saved_tree; chmod -R
a-w saved_tree } will be as expensive as copying or as cheap as linking.

> next rules if we assume reflink(2) matches Joel's
> manpage and call arguments and any other features are
> a different api definition:
> 
> 5) you must be the file owner or have CAP_FCHOWN
> 
>    because...
> 
> 6) all non-time file attributes (owner, security, etc),
>    atime, and mtime match the original file.  ctime is
>    when the reflink was created.
> 
> but the hard part is the quota accounting rule:
> 
> 7) pre-charge all quotas so a reflink double-counts inodes
>    and blocks against the original owner/group
> 
>    pro - easiest, does not allow owner to bypass limits,
>          quota utilities just work
> 
>    con - admin snapshot can trip user quota-limit failures,
>          du/df will wildly disagree on space used

Another con - reflink is potentially a useful way to save space,
multiple-charging prevents their use when tight on quota.

If a user is tight on their quota and they need lots of snapshots of
their files, e.g. snapshots of work in progress, why should they have
to use hard links with its associated problems (i.e. cannot be
trusted) for their snapshots, instead of reflinks which are ideal?

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 21:57                     ` Jamie Lokier
@ 2009-05-05 22:04                       ` Joel Becker
  2009-05-05 22:11                         ` Jamie Lokier
                                           ` (2 more replies)
  0 siblings, 3 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05 22:04 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

On Tue, May 05, 2009 at 10:57:11PM +0100, Jamie Lokier wrote:
> jim owens wrote:
> > 3) the granularity of the COW (1-byte write may cause
> >    1-block up through whole file copy) is fs-dependent.
> 
> And yet ENOSYS if the fs cannot implement any COW, and it isn't
> possible for userspace to duplicate the semantics by explicit copying?

	The point-in-time of the snapshot is what's important here.

> Do we say anything about attribute changes triggering COW or not, or
> leave it fs-dependent?  Given 3) fs-dependent makes sense, but it's
> nice to know in advance if { reflink -R old_tree saved_tree; chmod -R
> a-w saved_tree } will be as expensive as copying or as cheap as linking.

	"Shares the data extents of the source file".  I should hope
that chmod doesn't require copying out all the data.

Joel

-- 

Life's Little Instruction Book #267

	"Lie on your back and look at the stars."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:04                       ` Joel Becker
@ 2009-05-05 22:11                         ` Jamie Lokier
  2009-05-05 22:24                           ` Joel Becker
  2009-05-05 22:12                         ` Jamie Lokier
  2009-05-05 22:28                         ` jim owens
  2 siblings, 1 reply; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 22:11 UTC (permalink / raw)
  To: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

Joel Becker wrote:
> 	"Shares the data extents of the source file".  I should hope
> that chmod doesn't require copying out all the data.

Oh... I was under the impression that it would, because the
man-page-of-sorts says attributes must be the same and the following
question:

> Jamie Lokier wrote:
> > reflink() sounds useful already, but is there a compelling reason why
> > both files must have the same attributes, and changing attributes will
                                                  ------------------------
> > break the COW?
    --------------
>
> Yeah, because without it you can't use it for snapshotting.
  ----

If that's not true, then I change my tune substantially and like
reflink() semantics a lot more. :-)

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:11                         ` Jamie Lokier
@ 2009-05-05 22:24                           ` Joel Becker
  2009-05-05 23:14                             ` Jamie Lokier
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-05 22:24 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

On Tue, May 05, 2009 at 11:11:01PM +0100, Jamie Lokier wrote:
> Joel Becker wrote:
> > 	"Shares the data extents of the source file".  I should hope
> > that chmod doesn't require copying out all the data.
> 
> Oh... I was under the impression that it would, because the
> man-page-of-sorts says attributes must be the same and the following
> question:
> 
> > Jamie Lokier wrote:
> > > reflink() sounds useful already, but is there a compelling reason why
> > > both files must have the same attributes, and changing attributes will
>                                                   ------------------------
> > > break the COW?
>     --------------
> >
> > Yeah, because without it you can't use it for snapshotting.
>   ----
> 
> If that's not true, then I change my tune substantially and like
> reflink() semantics a lot more. :-)

	I was yeah'ing the first part.  The explicit requirement of
reflink is sharing the data extents (including xattr extents).  So, for
example, both the btrfs and ocfs2 implementations can
chmod/chown/utimes/etc all they want after the reflink is done, and no
data sharing is broken.  The data sharing is broken only via data
modification.  Both btrfs and ocfs2 will only copy the hunk modified,
leaving the rest of the file shared; you won't have long wait times for
the CoW of large files just because you modified one byte.

Joel

-- 

"Get right to the heart of matters.
 It's the heart that matters more."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:24                           ` Joel Becker
@ 2009-05-05 23:14                             ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 23:14 UTC (permalink / raw)
  To: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

Joel Becker wrote:
> Both btrfs and ocfs2 will only copy the hunk modified, leaving the
> rest of the file shared; you won't have long wait times for the CoW
> of large files just because you modified one byte.

Which is brilliant and perfect, and should be near the top of any
marketing materials for reflink :-)

-- Jamie


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:04                       ` Joel Becker
  2009-05-05 22:11                         ` Jamie Lokier
@ 2009-05-05 22:12                         ` Jamie Lokier
  2009-05-05 22:21                           ` Joel Becker
  2009-05-05 22:28                         ` jim owens
  2 siblings, 1 reply; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 22:12 UTC (permalink / raw)
  To: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

Joel Becker wrote:
> On Tue, May 05, 2009 at 10:57:11PM +0100, Jamie Lokier wrote:
> > jim owens wrote:
> > > 3) the granularity of the COW (1-byte write may cause
> > >    1-block up through whole file copy) is fs-dependent.
> > 
> > And yet ENOSYS if the fs cannot implement any COW, and it isn't
> > possible for userspace to duplicate the semantics by explicit copying?
> 
> 	The point-in-time of the snapshot is what's important here.

Don't we have a slight problem that useful point-in-time snapshots
really need to snapshot whole directory trees?  Otherwise you get the
same inter-file inconsistency issues that you get intra-file from old
fashioned copying.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:12                         ` Jamie Lokier
@ 2009-05-05 22:21                           ` Joel Becker
  2009-05-05 22:32                             ` James Morris
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-05 22:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Theodore Tso, jmorris, linux-fsdevel, ocfs2-devel, jim owens,
	Chris Mason, viro

On Tue, May 05, 2009 at 11:12:36PM +0100, Jamie Lokier wrote:
> Joel Becker wrote:
> > 	The point-in-time of the snapshot is what's important here.
> 
> Don't we have a slight problem that useful point-in-time snapshots
> really need to snapshot whole directory trees?  Otherwise you get the
> same inter-file inconsistency issues that you get intra-file from old
> fashioned copying.

	Snapshotting whole trees is already doable from things like
btrfs and from whole volumes on your storage.  This is a different
beast.
	Inter-file is a lot easier to handle than intra-file, because
you have control over that part of the process.
	And if your file is actually a disk image, you get snapping of
disks for free :-)

Joel

-- 

"We'd better get back, `cause it'll be dark soon,
 and they mostly come at night.  Mostly."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:21                           ` Joel Becker
@ 2009-05-05 22:32                             ` James Morris
  2009-05-05 22:39                               ` Joel Becker
  2009-05-12 19:40                               ` Jamie Lokier
  0 siblings, 2 replies; 151+ messages in thread
From: James Morris @ 2009-05-05 22:32 UTC (permalink / raw)
  To: Joel Becker
  Cc: Jamie Lokier, jim owens, Chris Mason, Theodore Tso, linux-fsdevel,
	ocfs2-devel, viro, Daniel P. Berrange

On Tue, 5 May 2009, Joel Becker wrote:

> 	And if your file is actually a disk image, you get snapping of
> disks for free :-)

Indeed... I think a great use-case scenario for this will be snapshotting 
VM images, as well as fast and space-efficient instantiation of VMs.


-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:32                             ` James Morris
@ 2009-05-05 22:39                               ` Joel Becker
  2009-05-12 19:40                               ` Jamie Lokier
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05 22:39 UTC (permalink / raw)
  To: James Morris
  Cc: Theodore Tso, Jamie Lokier, linux-fsdevel, Chris Mason, jim owens,
	Daniel P. Berrange, ocfs2-devel, viro

On Wed, May 06, 2009 at 08:32:07AM +1000, James Morris wrote:
> On Tue, 5 May 2009, Joel Becker wrote:
> 
> > 	And if your file is actually a disk image, you get snapping of
> > disks for free :-)
> 
> Indeed... I think a great use-case scenario for this will be snapshotting 
> VM images, as well as fast and space-efficient instantiation of VMs.

	I did the initial design work (ocfs2's refcount tree structure)
to support snapping VM images.  I came up with the reflink() interface
when I realized the structure would back space-efficient instantiation,
or "shallow clones".  Since then, we've been coming up with more and
more fun tricks that the generic reflink() interface allows us to do.

Joel

-- 

"For every complex problem there exists a solution that is brief,
     concise, and totally wrong."
                                        -Unknown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:32                             ` James Morris
  2009-05-05 22:39                               ` Joel Becker
@ 2009-05-12 19:40                               ` Jamie Lokier
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 19:40 UTC (permalink / raw)
  To: James Morris
  Cc: Joel Becker, jim owens, Chris Mason, Theodore Tso, linux-fsdevel,
	ocfs2-devel, viro, Daniel P. Berrange

James Morris wrote:
> Indeed... I think a great use-case scenario for this will be snapshotting 
> VM images, as well as fast and space-efficient instantiation of VMs.

I agree, except beware of the illusion that atomic file snapshots mean
safe VM snapshots...

To snapshot a live VM safely, you need to atomically snapshot both
the running state (memory and CPU) _and_ all its disk images
simultaneously.

Otherwise you're asking for guest filesystem corruption.

reflink() won't do that by itself, but the VM implementation could use
reflink() to make fast snapshots without significantly pausing a
running VM.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:04                       ` Joel Becker
  2009-05-05 22:11                         ` Jamie Lokier
  2009-05-05 22:12                         ` Jamie Lokier
@ 2009-05-05 22:28                         ` jim owens
  2009-05-05 23:12                           ` Jamie Lokier
  2 siblings, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-05 22:28 UTC (permalink / raw)
  To: Jamie Lokier, jim owens, Chris Mason, Theodore Tso, linux-fsdevel,
	jmorris, oc

Jamie,

Joel Becker wrote:
> 	"Shares the data extents of the source file".

so with that clarification, do you now agree with this

 > 1) is only for filesystems with COW operation,
 >    if the fs does not support COW it returns ENOSYS.

being a requirement so the user can trust that calling
reflink() uses minimal space (inode/extentmap) and only
a change to the file will trigger a data copy.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:28                         ` jim owens
@ 2009-05-05 23:12                           ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 23:12 UTC (permalink / raw)
  To: jim owens
  Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel,
	viro

jim owens wrote:
> Jamie,
> 
> Joel Becker wrote:
> >	"Shares the data extents of the source file".
> 
> so with that clarification, do you now agree with this
> 
> > 1) is only for filesystems with COW operation,
> >    if the fs does not support COW it returns ENOSYS.
> 
> being a requirement so the user can trust that calling
> reflink() uses minimal space (inode/extentmap) and only

Yes I do, if

> a change to the file will trigger a data copy.

"file" means the data, not the permissions and timestamps :-)

Otherwise there's still a user trust issue, since many applications
come to mind who would like to chmod/chown/futimes immediately after
making the reflink, and they need to trust that the result uses
minimal space.

I realise now in the OCFS2/BTRFS cases this isn't an issue since
changing the data only unshares a small region of the data anyway.

But that's quite a difficult thing to ask of any filesystem which
implements reflink(), whereas saying "attribute changes do not trigger
COW" (well maybe chown/chgrp do) is reasonable for any filesystems
which can implement reflink().

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 15:36             ` Jamie Lokier
  2009-05-05 15:41               ` Chris Mason
@ 2009-05-05 16:46               ` Jörn Engel
  2009-05-05 16:54                 ` Jörn Engel
  2009-05-05 21:44                 ` copyfile semantics Andreas Dilger
  1 sibling, 2 replies; 151+ messages in thread
From: Jörn Engel @ 2009-05-05 16:46 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel,
	viro

On Tue, 5 May 2009 16:36:29 +0100, Jamie Lokier wrote:
> 
> I'm wondering why reflink() is needed at all.  Can't it be done in
> userspace, using the BTRFS ioctl?  The hard part in userspace seems to
> be copying the file attributes, but "cp -a" and other tools manage.
> 
> What is the advantage of adding the system call for the special case
> of reflink(), when we choose not to have, say, a copyfile() system
> call which does what "cp -a" does because doing it in user space is
> good enough?

Given an ignorant filesystem, copyfile() will simply do the read/write
loop in kernelspace.  So either copyfile() is just a fancy name for
splice() or copyfile() will also have to create a tempfile, rename the
tempfile when the copy is done and deal with all possible errors.  And
if the system crashes, who will remove the tempfile on reboot?  Will the
tempfile have a well-known name, allowing for easy DoS?  Or will it be
random, causing much fun locating it after reboot.

In short, copyfile() for ignorant filesystems is steaming load of it.  I
know, I've written it [1].

When implemented in the filesystem itself, copyfile() can be quite nice.
The filesystem can create a temporary inode without visibly exposing it
to userspace.  It can delete temporary inodes in journal replay after a
crash.  And depending on the fs design, the read/write loop can be
replaced with finer-grained reference counting.

[1] Not a year or two ago, but in 2004, btw.

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 16:46               ` Jörn Engel
@ 2009-05-05 16:54                 ` Jörn Engel
  2009-05-05 22:03                   ` Jamie Lokier
  2009-05-05 21:44                 ` copyfile semantics Andreas Dilger
  1 sibling, 1 reply; 151+ messages in thread
From: Jörn Engel @ 2009-05-05 16:54 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel,
	viro

On Tue, 5 May 2009 18:46:19 +0200, Jörn Engel wrote:
> 
> And depending on the fs design, the read/write loop can be
> replaced with finer-grained reference counting.

And maybe finer-grained reference counting should be a requirement for
copyfile/cowlink/reflink or whatever we call it.  With a large file on
slow media, open("foo", O_RDWR); should still return in a reasonable
amount of time.  Not after ten minutes.

Jörn

-- 
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 16:54                 ` Jörn Engel
@ 2009-05-05 22:03                   ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 22:03 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel,
	viro

Jörn Engel wrote:
> On Tue, 5 May 2009 18:46:19 +0200, Jörn Engel wrote:
> > And depending on the fs design, the read/write loop can be
> > replaced with finer-grained reference counting.
> 
> And maybe finer-grained reference counting should be a requirement for
> copyfile/cowlink/reflink or whatever we call it.  With a large file on
> slow media, open("foo", O_RDWR); should still return in a reasonable
> amount of time.  Not after ten minutes.

Or 8 hours, which is how long it took me to copy a really large file
last time...

Oh, and are open() or write() on regular files interruptible per POSIX?
Didn't think so :-)

Fortunately BTRFS does do fine-grained extent sharing, and reflink
so it should work ok on BTRFS.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: copyfile semantics.
  2009-05-05 16:46               ` Jörn Engel
  2009-05-05 16:54                 ` Jörn Engel
@ 2009-05-05 21:44                 ` Andreas Dilger
  2009-05-05 21:48                   ` Matthew Wilcox
                                     ` (2 more replies)
  1 sibling, 3 replies; 151+ messages in thread
From: Andreas Dilger @ 2009-05-05 21:44 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Theodore Tso, Jamie Lokier, jmorris, ocfs2-devel, linux-fsdevel,
	Chris Mason, viro

On May 05, 2009  18:46 +0200, Jörn Engel wrote:
> On Tue, 5 May 2009 16:36:29 +0100, Jamie Lokier wrote:
> > What is the advantage of adding the system call for the special case
> > of reflink(), when we choose not to have, say, a copyfile() system
> > call which does what "cp -a" does because doing it in user space is
> > good enough?
> 
> Given an ignorant filesystem, copyfile() will simply do the read/write
> loop in kernelspace.  So either copyfile() is just a fancy name for
> splice()

Sure, except splice() (AFAIK) doesn't allow a splice between two regular
files, only between a pipe and a file.  Maybe it has changed since the
last time I looked.  On high performance filesystems the copy_to_user()
and copy_from_user() can be a major limiting factor on IO performance,
and it is getting more significant because the single-core performance
is not improving at all.  At 1GB/s just a single copy_{to,from}_user
(read or write) will consume 40% of a single core.

If it is possible to use splice() to copy between two regular files then
that is great.  Does anything (e.g. cp) actually use this yet?

> or copyfile() will also have to create a tempfile, rename the
> tempfile when the copy is done and deal with all possible errors.  And
> if the system crashes, who will remove the tempfile on reboot?  Will the
> tempfile have a well-known name, allowing for easy DoS?  Or will it be
> random, causing much fun locating it after reboot.

Maybe I'm missing something, but why do we need a tempfile at all?
I can't imagine that people expect atomic semantics for copyfile(),
any more than they expect atomic sematics for "cp" in the face of a
crash.

> When implemented in the filesystem itself, copyfile() can be quite nice.
> The filesystem can create a temporary inode without visibly exposing it
> to userspace.  It can delete temporary inodes in journal replay after a
> crash.  And depending on the fs design, the read/write loop can be
> replaced with finer-grained reference counting.

I would think that copyfile() is of primary interest when it involves
a network filesystem, so there is no need to ship data to the client
doing the copy at all.  This is possible for NFS and CIFS protocol today,
AFAIK.  The problem with splice is that the filesystem only knows about
->splice_read() and ->splice_write(), it doesn't have any opportunity
to optimize this further (e.g. by sending a "copyfile" RPC, or implementing
a reflink or whatever).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: copyfile semantics.
  2009-05-05 21:44                 ` copyfile semantics Andreas Dilger
@ 2009-05-05 21:48                   ` Matthew Wilcox
  2009-05-05 22:25                     ` Trond Myklebust
  2009-05-05 22:06                   ` Jamie Lokier
  2009-05-06  5:57                   ` Jörn Engel
  2 siblings, 1 reply; 151+ messages in thread
From: Matthew Wilcox @ 2009-05-05 21:48 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: J?rn Engel, Jamie Lokier, Chris Mason, Theodore Tso,
	linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 03:44:54PM -0600, Andreas Dilger wrote:
> > When implemented in the filesystem itself, copyfile() can be quite nice.
> > The filesystem can create a temporary inode without visibly exposing it
> > to userspace.  It can delete temporary inodes in journal replay after a
> > crash.  And depending on the fs design, the read/write loop can be
> > replaced with finer-grained reference counting.
> 
> I would think that copyfile() is of primary interest when it involves
> a network filesystem, so there is no need to ship data to the client
> doing the copy at all.  This is possible for NFS and CIFS protocol today,
> AFAIK.  The problem with splice is that the filesystem only knows about
> ->splice_read() and ->splice_write(), it doesn't have any opportunity
> to optimize this further (e.g. by sending a "copyfile" RPC, or implementing
> a reflink or whatever).

Do you mean NFSv4?  I don't know of a way to do it with traditional NFS.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: copyfile semantics.
  2009-05-05 21:48                   ` Matthew Wilcox
@ 2009-05-05 22:25                     ` Trond Myklebust
  0 siblings, 0 replies; 151+ messages in thread
From: Trond Myklebust @ 2009-05-05 22:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andreas Dilger, J?rn Engel, Jamie Lokier, Chris Mason,
	Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, 2009-05-05 at 15:48 -0600, Matthew Wilcox wrote:
> On Tue, May 05, 2009 at 03:44:54PM -0600, Andreas Dilger wrote:
> > > When implemented in the filesystem itself, copyfile() can be quite nice.
> > > The filesystem can create a temporary inode without visibly exposing it
> > > to userspace.  It can delete temporary inodes in journal replay after a
> > > crash.  And depending on the fs design, the read/write loop can be
> > > replaced with finer-grained reference counting.
> > 
> > I would think that copyfile() is of primary interest when it involves
> > a network filesystem, so there is no need to ship data to the client
> > doing the copy at all.  This is possible for NFS and CIFS protocol today,
> > AFAIK.  The problem with splice is that the filesystem only knows about
> > ->splice_read() and ->splice_write(), it doesn't have any opportunity
> > to optimize this further (e.g. by sending a "copyfile" RPC, or implementing
> > a reflink or whatever).
> 
> Do you mean NFSv4?  I don't know of a way to do it with traditional NFS.

It is expected to be a feature of NFSv4.2. There is a proposal currently
winding it's way through the IETF that can handle both copyfile() and
reflink() semantics.

I can help to relay the input from this discussion to the people that
are drafting the IETF proposal to ensure that the Linux community
concerns get heard.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: copyfile semantics.
  2009-05-05 21:44                 ` copyfile semantics Andreas Dilger
  2009-05-05 21:48                   ` Matthew Wilcox
@ 2009-05-05 22:06                   ` Jamie Lokier
  2009-05-06  5:57                   ` Jörn Engel
  2 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 22:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jörn Engel, Chris Mason, Theodore Tso, linux-fsdevel,
	jmorris, ocfs2-devel, viro

Andreas Dilger wrote:
> If it is possible to use splice() to copy between two regular files then
> that is great.  Does anything (e.g. cp) actually use this yet?

It's mentioned earlier in this thread that BTRFS has an ioctl() for
copying parts of files into other files, and will share the data
between both files.

With a bit of plumbing, splice() could probably be persuaded to call
the mechanism which BTRFS provides.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: copyfile semantics.
  2009-05-05 21:44                 ` copyfile semantics Andreas Dilger
  2009-05-05 21:48                   ` Matthew Wilcox
  2009-05-05 22:06                   ` Jamie Lokier
@ 2009-05-06  5:57                   ` Jörn Engel
  2 siblings, 0 replies; 151+ messages in thread
From: Jörn Engel @ 2009-05-06  5:57 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jamie Lokier, Chris Mason, Theodore Tso, linux-fsdevel, jmorris,
	ocfs2-devel, viro

On Tue, 5 May 2009 15:44:54 -0600, Andreas Dilger wrote:
> 
> > or copyfile() will also have to create a tempfile, rename the
> > tempfile when the copy is done and deal with all possible errors.  And
> > if the system crashes, who will remove the tempfile on reboot?  Will the
> > tempfile have a well-known name, allowing for easy DoS?  Or will it be
> > random, causing much fun locating it after reboot.
> 
> Maybe I'm missing something, but why do we need a tempfile at all?
> I can't imagine that people expect atomic semantics for copyfile(),
> any more than they expect atomic sematics for "cp" in the face of a
> crash.

In the case of cowlink() a tempfile is required when breaking the link.
Otherwise open() can result in the file disappearing or being truncated.
Rather unexpected.

If copyfile() doesn't try to be smart and does the actual copy when
being called, I could certainly live with half-written files.

Jörn

-- 
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:19         ` Jamie Lokier
  2009-05-05 13:39           ` Chris Mason
@ 2009-05-05 14:21           ` Theodore Tso
  2009-05-05 15:32             ` Jamie Lokier
  2009-05-05 22:49             ` James Morris
  2009-05-05 17:05           ` Joel Becker
  2 siblings, 2 replies; 151+ messages in thread
From: Theodore Tso @ 2009-05-05 14:21 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 02:19:07PM +0100, Jamie Lokier wrote:
> There was an attempt at something like that for ext3 a year or two ago.
> Search for "cowlink" if you're interested.

Yeah, I remember that discussion.  The hard part was always the VM
infrastructure, not the fs metadata.

> Instead of a circular list, a proposed implementation was to create a
> separate "host" inode on the first reflink, converting the source
> inode to a reflink inode and moving the data block references to the
> new host inode.  Each reflink was simply a reference to the host
> inode, much like your design, and the host inode was only to hold the
> data blocks, with it's i_nlink counting the number of reflinks
> pointing to it.
> 
> Using a circular list means the space must be reserved in every inode,
> even those which are not (yet) reflinks.  It also does a bit more
> writing sometimes, because of having to update next and previous
> entries on the list.

It's a tradeoff.  If you use a separate "host" inode on the first
reflink, then then if you burn 3 inodes instead of two for which is
"copied"/"reflinked" once.  The only reason why we need to reserve an
extra field in the inode structure is for the pointer from the "host"
inode to the circular linked list.  (The space for the circular linked
list gets stored in i_data in the reflink inodes.)  If we are using
256 byte inodes we have the space to spare --- and if we really cared
about not utilizing the space in the inode structure if it wasn't
necessary, it could always be stored as an extended attribute
(although that has a greater overhead).

The question of which of these design tradeoffs is preferable is
really one of how many inodes will get copied via reflinks, and how
many times will a particular inode will be copied by a reflink.  If it
is common (for example, in a virtualization or container workload) for
a single file to be copied via reflink 50 or 100 times, then the extra
inode created when you create the first reflink is no big deal.  If
most of the time a file is only going to be reflink'ed once or twice,
then the overhead is much bigger.

This is really a design detail, though.

The bigger questions, which we really need to answer are:

1) If someone other than the owner of a file uses reflink to "make a
copy" of the file, is it new inode, with the new inode number, owned
by the original owner (making it look more like a link), or owned by
the person creating the reflink (making it look more like a copy).

2) Does changing the metadata --- atime, user/group ownership, ctime,
etc., break the COW link and cause a copy?

(2) could be a per-filesystem implementation detail, but (1) goes to
the semantics of the how the reflink() system call will work, so I
think we need to have a common answer which is the same across all
filesystems.  

Maybe some filesystems could simply refuse to support a user who isn't
the owner creating a reflink, but saying that some filesystems might
CAP_FOWNER (because the inode will be created with owner of the uid)
would still mean that in the case where you had a setuid binary, or if
the system supports fine-grained capability support, so a user with a
non-zero UID has CAP_FOWNER, it would be unfortunate if a file owned
by uid 23, when copied via reflink by uid 45 with CAP_FOWNER privs, on
some filesystems creates a reflinked inode which when stat'ed, st_uid
is 23, and on other filesystems creates a reflink inode which when
stat'ed, st_uid is 45.

							- Ted

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 14:21           ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso
@ 2009-05-05 15:32             ` Jamie Lokier
  2009-05-05 22:49             ` James Morris
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 15:32 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

Theodore Tso wrote:
> 1) If someone other than the owner of a file uses reflink to "make a
> copy" of the file, is it new inode, with the new inode number, owned
> by the original owner (making it look more like a link), or owned by
> the person creating the reflink (making it look more like a copy).
> 
> 2) Does changing the metadata --- atime, user/group ownership, ctime,
> etc., break the COW link and cause a copy?
> 
> (2) could be a per-filesystem implementation detail, but (1) goes to
> the semantics of the how the reflink() system call will work, so I
> think we need to have a common answer which is the same across all
> filesystems.  

I agree on both.

user/group ownership seems to raise the most questions.

Can we settle the hopefully simpler question of ctime, atime, mode changes?

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 14:21           ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso
  2009-05-05 15:32             ` Jamie Lokier
@ 2009-05-05 22:49             ` James Morris
  1 sibling, 0 replies; 151+ messages in thread
From: James Morris @ 2009-05-05 22:49 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jamie Lokier, linux-fsdevel, ocfs2-devel, viro,
	linux-security-module

On Tue, 5 May 2009, Theodore Tso wrote:

> The bigger questions, which we really need to answer are:
> 
> 1) If someone other than the owner of a file uses reflink to "make a
> copy" of the file, is it new inode, with the new inode number, owned
> by the original owner (making it look more like a link), or owned by
> the person creating the reflink (making it look more like a copy).

Changing the owner fundamentally changes the character of the call 
(certainly, the SELinux security logic would be quite different), and I 
think application writers would often be asking "what type of reflink call 
am I supposed to be using here?", and possibly getting it wrong much of 
the time.  It might be better to create a separate syscall for the copy 
case, with its own distinct semantics, if it is desired.

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:19         ` Jamie Lokier
  2009-05-05 13:39           ` Chris Mason
  2009-05-05 14:21           ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso
@ 2009-05-05 17:05           ` Joel Becker
  2 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05 17:05 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 02:19:07PM +0100, Jamie Lokier wrote:
> There was an attempt at something like that for ext3 a year or two ago.
> Search for "cowlink" if you're interested.

	Yeah, I discussed those with Jörn Engel after my talk at LSF - I
hadn't heard of them before.  cowlinks actually changed the semantic of
link(2).  This does not do that.

> Instead of a circular list, a proposed implementation was to create a
> separate "host" inode on the first reflink, converting the source
> inode to a reflink inode and moving the data block references to the
> new host inode.  Each reflink was simply a reference to the host
> inode, much like your design, and the host inode was only to hold the
> data blocks, with it's i_nlink counting the number of reflinks
> pointing to it.

	Reflinks are not cowlinks.  reflinks are new files (new inodes
in most implementations I expect) that only share the *data extents* in
a CoW fashion.
	Maybe reading the wiki details of the ocfs2 implementation and
so on would be helpful?

[Overview]
http://wiki.us.oracle.com/calpg/OCFS2Reflink
[ocfs2 Implementation]
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/RefcountTrees
[reflink() Itself]
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/ReflinkOperation
[Use Cases]
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/ReflinkUses

Joel

-- 

"Every day I get up and look through the Forbes list of the richest
 people in America. If I'm not there, I go to work."
        - Robert Orben

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:01       ` Theodore Tso
  2009-05-05 13:19         ` Jamie Lokier
@ 2009-05-05 17:00         ` Joel Becker
  2009-05-05 17:29           ` Theodore Tso
  2009-05-05 22:30           ` Jamie Lokier
  1 sibling, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05 17:00 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 09:01:14AM -0400, Theodore Tso wrote:
> I guess it depends on your implementation.  At least the way I would
> implement this in ext4, for example, I'd simply set a new flag
> indicating this was a "reflink", and then the i_data[0..3] field would
> contain the inode number of the "host" inode, and i_data [4..7] and
> i_data[8..11] would contain a circular linked list of all reflinks
> associated with that inode.  I'd then grab a spare inode field so the
> "host" inode could point to the reflink'ed inodes.
> 
> If you ever need to delete the host inode, you simply pick one of the
> reflink inodes and copy i_data from the host inode one of the reflink
> inodes and promote it to be the "host" inode, and then update all of
> the other reflink inodes to point at the new host inode.
> 
> The advantage of this scheme is not only does the reflink'ed inode
> have a new inode number (as in your design), it actually has an
> entirely new inode.  So we can change the ownership, the mtime, ctime;
> it behaves *entirely* as a separate, free-standing inode except it is
> sharing the data blocks.
> 
> This allows me to easily set a new owner, and indeed any other inode
> metadata, on the reflink'ed inode, which I would argue is a Good
> Thing.
> 
> I'm guessing that OCFS2 has implemented (or is planning on
> implementing) reflinks, you can't modify the metadata?  Or is there
> some really important reason why it's not a good idea for OCFS2?

	I think I'm confusing you.  ocfs2 creates a new inode, with a
new tree of extent blocks, pointing to the same data extents as the
source.  You can do *anything* POSIX to that new inode.  You can chown
it, chmod it, truncate it, futimes it, whatever.  The only thing at
issue is what the state of the inode is at the return of the reflink()
call.
	I'm not defining reflink() as "creates a new inode" because I
can see something like btrfs using the same storage inode with a new
inode number until it needs to CoW.  But from the user-visible
perspective, that's exactly what happens.

Joel

-- 

Life's Little Instruction Book #347

	"Never waste the oppourtunity to tell someone you love them."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 17:00         ` Joel Becker
@ 2009-05-05 17:29           ` Theodore Tso
  2009-05-05 22:36             ` Jamie Lokier
  2009-05-05 22:30           ` Jamie Lokier
  1 sibling, 1 reply; 151+ messages in thread
From: Theodore Tso @ 2009-05-05 17:29 UTC (permalink / raw)
  To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 10:00:58AM -0700, Joel Becker wrote:
> On Tue, May 05, 2009 at 09:01:14AM -0400, Theodore Tso wrote:
> > I'm guessing that OCFS2 has implemented (or is planning on
> > implementing) reflinks, you can't modify the metadata?  Or is there
> > some really important reason why it's not a good idea for OCFS2?
> 
> 	I think I'm confusing you.  ocfs2 creates a new inode, with a
> new tree of extent blocks, pointing to the same data extents as the
> source.  You can do *anything* POSIX to that new inode.  You can chown
> it, chmod it, truncate it, futimes it, whatever.  The only thing at
> issue is what the state of the inode is at the return of the reflink()
> call.

OK, cool.  But in that case, if in every user-visible sense of the
word, it's equivalent to a file copy --- which is to say, it gets a
new inode number, and, then why not make it work *exactly* like a file
copy, which is to say make the ownership be the user who asked for the
reflink to be created?  That way /bin/cp could potentially use
reflinks, and aside from the fact that a cp -r of an existing
directory hierarchy takes no extra disk space and runs *much* faster,
a reflink acts exactly like a file copy.  The semantics are easy to
describe, we don't need CAP_FOWNER nonsense, it becomes much easier to
deal with the semantics vis-a-vis quota, etc.

> 	I'm not defining reflink() as "creates a new inode" because I
> can see something like btrfs using the same storage inode with a new
> inode number until it needs to CoW.  But from the user-visible
> perspective, that's exactly what happens.

Well, we can talk about inodes even for filesystems like FAT that
don't really have inodes; the user-visible perspective is the only
thing that we really care when we try to define the semantics of the
system call in a way that causes the least amount of surprise; given
that the new file gets a new inode number, it is *not* a hard link,
and it looks much more like a file copy.

       	     	       	      	   - Ted

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 17:29           ` Theodore Tso
@ 2009-05-05 22:36             ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 22:36 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

Theodore Tso wrote:
 But in that case, if in every user-visible sense of the
> word, it's equivalent to a file copy --- which is to say, it gets a
> new inode number, and, then why not make it work *exactly* like a file
> copy, which is to say make the ownership be the user who asked for the
> reflink to be created?  That way /bin/cp could potentially use
> reflinks, and aside from the fact that a cp -r of an existing
> directory hierarchy takes no extra disk space and runs *much* faster,
> a reflink acts exactly like a file copy.  The semantics are easy to
> describe, we don't need CAP_FOWNER nonsense, it becomes much easier to
> deal with the semantics vis-a-vis quota, etc.

reflink() seems to be designed to copy a file _and_ clone the file's
attributes exactly, and to do it all atomically.

So how about relaxing a bit and, since reflinkat() takes flags, giving
it a flag to make cloning the attributes optional.

I imagine there's little implementation difference between cloning the
attributes and giving it new file attributes, and both behaviours are
useful for different things.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 17:00         ` Joel Becker
  2009-05-05 17:29           ` Theodore Tso
@ 2009-05-05 22:30           ` Jamie Lokier
  2009-05-05 22:37             ` Joel Becker
  2009-05-05 23:08             ` jim owens
  1 sibling, 2 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 22:30 UTC (permalink / raw)
  To: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

Joel Becker wrote:
> 	I think I'm confusing you.  ocfs2 creates a new inode, with a
> new tree of extent blocks, pointing to the same data extents as the
> source.  You can do *anything* POSIX to that new inode.  You can chown
> it, chmod it, truncate it, futimes it, whatever.  The only thing at
> issue is what the state of the inode is at the return of the reflink()
> call.

Ok, but does chown/chmod/futimes trigger a COW copy, unsharing the data?
This is still not clear. :-)

Behaviourally, whether a massive copy is triggered by chmod is quite a
significant thing.  It dictates whether programs and scripts should be
careful to avoid chmod on reflinked files because it may very
expensive (think chmod triggering a 200GB copy), or can do so cheaply.

> 	I'm not defining reflink() as "creates a new inode" because I
> can see something like btrfs using the same storage inode with a new
> inode number until it needs to CoW.  But from the user-visible
> perspective, that's exactly what happens.

I'm still not clear from the above explanation whether full data
unsharing (i.e. it's all copied, takes a long time, can trigger
ENOSPC) happens on chown/chmod etc.

But assuming it stays shared until you modify the actual data, could
the documentation make this important fact a bit more prominent:

    reflink() creates a new file which initially shares the same
    underlying data storage as the source file, and has all the same
    attributes including security context and extended attributes.

    After creating the new file, you can do *anything* POSIX to that
    new file.  You can chown it, chmod it, futimes it, truncate it,
    write to it, whatever.  When the data is modified, that will
    trigger a copy-on-write operation so that the underlying data is
    not completely shared any more.

    The amount and timing of copying is filesystem-dependent, but only
    happens when a data write or extended attribute change takes place.

    Opening a file, reading it, read-only or private mappings, and
    simple attribute updates (chown, chmod, futimes, as well as
    automatic atime updates) will not trigger copy-on-write and will
    not return ENOSPC errors.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:30           ` Jamie Lokier
@ 2009-05-05 22:37             ` Joel Becker
  2009-05-05 23:08             ` jim owens
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05 22:37 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 11:30:16PM +0100, Jamie Lokier wrote:
> Joel Becker wrote:
> > 	I think I'm confusing you.  ocfs2 creates a new inode, with a
> > new tree of extent blocks, pointing to the same data extents as the
> > source.  You can do *anything* POSIX to that new inode.  You can chown
> > it, chmod it, truncate it, futimes it, whatever.  The only thing at
> > issue is what the state of the inode is at the return of the reflink()
> > call.
> 
> Ok, but does chown/chmod/futimes trigger a COW copy, unsharing the data?
> This is still not clear. :-)

	No, of course it doesn't.  That would be awful!

> But assuming it stays shared until you modify the actual data, could
> the documentation make this important fact a bit more prominent:
> 
>     reflink() creates a new file which initially shares the same
>     underlying data storage as the source file, and has all the same
>     attributes including security context and extended attributes.
> 
>     After creating the new file, you can do *anything* POSIX to that
>     new file.  You can chown it, chmod it, futimes it, truncate it,
>     write to it, whatever.  When the data is modified, that will
>     trigger a copy-on-write operation so that the underlying data is
>     not completely shared any more.
> 
>     The amount and timing of copying is filesystem-dependent, but only
>     happens when a data write or extended attribute change takes place.
> 
>     Opening a file, reading it, read-only or private mappings, and
>     simple attribute updates (chown, chmod, futimes, as well as
>     automatic atime updates) will not trigger copy-on-write and will
>     not return ENOSPC errors.

	You got it.

Joel

-- 

"In the room the women come and go
 Talking of Michaelangelo."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 22:30           ` Jamie Lokier
  2009-05-05 22:37             ` Joel Becker
@ 2009-05-05 23:08             ` jim owens
  1 sibling, 0 replies; 151+ messages in thread
From: jim owens @ 2009-05-05 23:08 UTC (permalink / raw)
  To: Jamie Lokier, joel.becker
  Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro

Jamie Lokier wrote:
> But assuming it stays shared until you modify the actual data, could
> the documentation make this important fact a bit more prominent:

>     Opening a file, reading it, read-only or private mappings, and
>     simple attribute updates (chown, chmod, futimes, as well as
>     automatic atime updates) will not trigger copy-on-write and will
>     not return ENOSPC errors.

almost... more like:

     automatic atime updates) will not trigger file data copy-on-write
     and will not return ENOSPC errors unless the filesytem would have
     returned ENOSPC if the file had no reflink.

filesystems such as btrfs that COW metadata changes can
generate ENOSPC on any attribute update!

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05  7:16     ` Joel Becker
  2009-05-05  8:09       ` Andreas Dilger
  2009-05-05 13:01       ` Theodore Tso
@ 2009-05-05 13:01       ` Jamie Lokier
  2009-05-05 17:09         ` Joel Becker
  2 siblings, 1 reply; 151+ messages in thread
From: Jamie Lokier @ 2009-05-05 13:01 UTC (permalink / raw)
  To: linux-fsdevel, jmorris, ocfs2-devel, viro

Joel Becker wrote:
> On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote:
> > Joel Becker wrote:
> > > +All file attributes and extended attributes of the new file must
> > > +identical to the source file with the following exceptions:
> > 
> > reflink() sounds useful already, but is there a compelling reason why
> > both files must have the same attributes, and changing attributes will
> > break the COW?
> 
> 	Yeah, because without it you can't use it for snapshotting.
> That's where the original design came from - inode snapshots.  The big
> thing that excited me was that defining reflink() as I did, instead of
> a more specific snapshot call, allows all sorts of generic uses (some of
> which you outline below).
> 	If reflink() creates a snapshot, you can then break it to make
> things a little different.  But if it changes things, you can never
> change them back.
> 
> > Being able to have different attributes would allow:
> > 
> >    - reflink() to be used for fast space-efficient copying, i.e. an
> >      optimisation to "cp", "git checkout" and things like that.
> 
> 	It can right now, just not of other people's files.  Actually,
> the only real difficult with doing it to other people's files is quota.
> But I can't come up with a way to prevent quota DoS.
> 	Here's another fun trick.  Overwriting rsync, instead of copying
> blocks from the already-existing source could reflink the source to the
> .temporary, then only write the changed blocks.  And since you own both
> files, it just works.  If you're overwriting someone else's file?  The
> old copy behavior is fine.

The moment rsync overwrites a single block, the whole reflink file
will be copied by the filesystem, and then rsync will overwrite other
blocks in the copy.

So I would think it's more efficient for rsync to do what it's always
done instead, and just copy those parts of the file which are not changed.

(It needs to read the whole file anyway for checksumming, unless you
have a filesystem trick planned to avoid that :-)  If you made
splice() share file extents when cloning data from one file to
another, that would really accelerate rsync and do a better job of
reducing storage...)

> >    - reflink() to be used for merging files with identical contents
> >      (something I find surprisingly often on my disks).
> > 
> >    - reflink() to be used for merging files from different
> >      cgroup-style VMs in particular.
> 
> 	While it would be great to have a way to do this, reflink() is
> not the way.  It's really simple to understand with its link-like
> semantic, and I see no point in making it a seven-different-operation
> kitchen sink call.

That's hand-waving away.  I'm thinking of it doing _one_ simple thing:
copy the file with a COW implementation, which happens to be versatile
in its consequences.  It's not a kitchen sink call.

I.e. what the ext3 cowlink() call partially implemented a year or two
ago did.

In some ways reflink() is more complicated to understand than
cowlink(), because of reflink making chown and chmod have potentially
heavy side effects.

> > Requiring all attributes except nlink and ino to be identical makes
> > reflink() unsuitable for transparently doing those things, except in
> > cases where they happen to have the same attributes anyway.
> 
> 	We've had a lot of fun thinking up many uses for reflink(), and
> almost all of them are within the context of one's own files.

Sure.

> > I'm thinking particularly of file permissions, owner/group and atime.
> 
> 	People do cp -p all the time.  I don't see how keeping those
> things the same will break anything.  It's a new call, not an existing
> semantic.

Some people do "chown -R a-w" all the time after copying a tree for
snapshotting, so they don't accidentally modify files later when
viewing them in a text editor :-)  (I'm thinking of the old days, when
we edited kernel trees using "cp -rl" to make snapshots)

Thinking about it, with reflink snapshots, it would be annoying to be
unable write-protect the snapshots.

> > Since each reflink has its own nlink and ino, I'm wondering why the
> > other attributes cannot also be separate.  (I realise extended
> > attributes complicate the picture and it's desirable to share them,
> > especially if they are large).
> 
> 	The biggest reason is snapshotting.  The second biggest reason
> is a simple to understand call.  "Everything is identical except those
> things that *have* to be different".

I'm not clear about something.  Will "chmod XXX reflinked-file" change
the permissions of both files (like hard-linked files), or will it
trigger a data copy (like lazy cp -a)?

I think "chmod XXX reflinked-file" is simpler to understand if it
doesn't trigger a copy as side effect.  (Especially as the copy may
take a long time and/or ENOSPC - things you don't expect from
"chmod").

What if you want to change the permissions of both reflinks - do you
have to recreate them?

> > But is there an efficient way for reflink-aware applications to detect
> > these files have the same contents, other than reading the contents
> > twice and comparing?  Occasionally that would be good.  E.g. It would
> > be nice if "diff -r" could be patched to do that.
> 
> 	I would think FIEMAP would tell you what you want to know,
> wouldn't it?

I'm not sure.  FIEMAP can be quite a heavy operation too, and it's
only available to root I think.

>From a user's "managing space on my disk" perspective, the important
things are being able to see where their data is shared and
_especially_ being able to see when touching a file would trigger a
massive increase in storage + copying time.

I.e. I can see an additional flag to "ls" being useful if reflink is
used for more than just very well organised backup folders.

> > > +- The ctime of the source file only changes if the source's metadata
> > > +  must be changed to accommodate the copy-on-write linkage.  The ctime of
> > > +  the new file is set to represent its creation.
> > 
> > What change to the source metadata would require ctime to change?
> 
> 	ocfs2 flags all extents in the source file with a "this is now
> shared, go check the reference count before writing" flag if they don't
> have it already.  I'd call that a metadata update.

If the flag is invisible to users, it isn't.  If the flag is visible,
isn't that the answer to the previous question? :-)

> > > +- The link count of the source file is unchanged, and the link count of
> > > +  the new file is one.
> > 
> > Can you hard link to the source file and the reflink afterwards,
> > incrementing the reflink's link count?  (I presume yes).  Can you
> > reflink to both of them too?
> 
> 	Yes, absolutely.  Once reflinked, they look like two separate
> POSIX files.

Except that chmod can take hours and trigger ENOSPC, and the POSIX
atime does... what?

Thanks, btw.
-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 1/3] fs: Document the reflink(2) system call.
  2009-05-05 13:01       ` Jamie Lokier
@ 2009-05-05 17:09         ` Joel Becker
  0 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-05 17:09 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Tue, May 05, 2009 at 02:01:36PM +0100, Jamie Lokier wrote:
> Joel Becker wrote:
> > 	Here's another fun trick.  Overwriting rsync, instead of copying
> > blocks from the already-existing source could reflink the source to the
> > .temporary, then only write the changed blocks.  And since you own both
> > files, it just works.  If you're overwriting someone else's file?  The
> > old copy behavior is fine.
> 
> The moment rsync overwrites a single block, the whole reflink file
> will be copied by the filesystem, and then rsync will overwrite other
> blocks in the copy.

	This is not cowlink.  It's not a "CoW the whole thing when I
touch one block".  It's a new file (new inode for most implementations)
that just shares the data extents.  So if I write to one block, I only
need to CoW that one block.
	See my other email with the wiki pages.

Joel

-- 

"Maybe the time has drawn the faces I recall.
 But things in this life change very slowly,
 If they ever change at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation.
  2009-05-03  6:15 [RFC] The reflink(2) system call Joel Becker
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
@ 2009-05-03  6:15 ` Joel Becker
  2009-05-03  8:03   ` Christoph Hellwig
  2009-05-03  6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker
  2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker
  3 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-03  6:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro

Implement vfs_reflink(), which calls iops->reflink().  See
Documentation/reflink.txt for a description of the reflink(2) system
call.

I'm not quite certain of the security model to follow.
security_inode_link() is clearly not correct as the resulting file is
not the source inode.  I have chosen security_inode_create() to reflect
the creation of a new file in the directory.  This matches the
fsnotify_create() I've decided to use.  However, it does not reflect
that the new file will have the same contents as the source file.  The
real solution is probably either to check read access on the source or
define a new security_inode_reflink().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 fs/namei.c         |   40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..45cbe7a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,45 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
+{
+	struct inode *inode = old_dentry->d_inode;
+	int error;
+
+	if (!inode)
+		return -ENOENT;
+
+	error = may_create(dir, new_dentry);
+	if (error)
+		return error;
+
+	if (dir->i_sb != inode->i_sb)
+		return -EXDEV;
+
+	/*
+	 * A reflink to an append-only or immutable file cannot be created.
+	 */
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	if (!dir->i_op->reflink)
+		return -EPERM;
+	if (S_ISDIR(inode->i_mode))
+		return -EPERM;
+
+	error = security_inode_create(dir, new_dentry, inode->i_mode);
+	if (error)
+		return error;
+
+	mutex_lock(&inode->i_mutex);
+	vfs_dq_init(dir);
+	error = dir->i_op->reflink(old_dentry, dir, new_dentry);
+	mutex_unlock(&inode->i_mutex);
+	if (!error)
+		fsnotify_create(dir, new_dentry);
+	return error;
+}
+
+
 /*
  * The worst of all namespace operations - renaming directory. "Perverted"
  * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +2929,7 @@ EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(vfs_create);
 EXPORT_SYMBOL(vfs_follow_link);
 EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);
 EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..3c9e4ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
 
 /*
  * VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
 			  loff_t len);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
 };
 
 struct seq_file;
-- 
1.6.1.3


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation.
  2009-05-03  6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker
@ 2009-05-03  8:03   ` Christoph Hellwig
  2009-05-04  2:51     ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Christoph Hellwig @ 2009-05-03  8:03 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

> +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
>
>+{

Would be nice to have a little kerneldoc comment for it.  Also please
avoid the > 80 har lines

> +EXPORT_SYMBOL(vfs_reflink);

No really good reason to export this.  Most vfs_ helpers are exported
for nfsd, and I can't really see nfsd use this anytime soon.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation.
  2009-05-03  8:03   ` Christoph Hellwig
@ 2009-05-04  2:51     ` Joel Becker
  0 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-04  2:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Sun, May 03, 2009 at 04:03:25AM -0400, Christoph Hellwig wrote:
> > +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
> >
> >+{
> 
> Would be nice to have a little kerneldoc comment for it.  Also please
> avoid the > 80 har lines

	Both good points.

> > +EXPORT_SYMBOL(vfs_reflink);
> 
> No really good reason to export this.  Most vfs_ helpers are exported
> for nfsd, and I can't really see nfsd use this anytime soon.

	While we're going forward with the system call, ocfs2's going to
support the ioctl for older kernels.  I was planning to have mainline
just reroute the ioctl to vfs_reflink(), rather than have the ioctl
just break.
 
Joel

-- 

Life's Little Instruction Book #222

	"Think twice before burdening a friend with a secret."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:15 [RFC] The reflink(2) system call Joel Becker
  2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
  2009-05-03  6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker
@ 2009-05-03  6:15 ` Joel Becker
  2009-05-03  6:27   ` Matthew Wilcox
  2009-05-03  8:04   ` Christoph Hellwig
  2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker
  3 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-03  6:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro

This implements reflinkat(2) and reflink(2).  See
Documentation/reflink.txt for a description of the reflink(2) system
call.

XXX: Currently only adds the x86_32 linkage.  The rest of the
architectures belong here too.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/syscall_table_32.S |    1 +
 fs/namei.c                         |   56 ++++++++++++++++++++++++++++++++++++
 3 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..ea8eb94 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_reflink		335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..866705d 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_reflink		/* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 45cbe7a..cf739a3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2524,6 +2524,62 @@ int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new
 	return error;
 }
 
+SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
+		int, newdfd, const char __user *, newname, int, flags)
+{
+	struct dentry *new_dentry;
+	struct nameidata nd;
+	struct path old_path;
+	int error;
+	char *to;
+
+	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+		return -EINVAL;
+
+	error = user_path_at(olddfd, oldname,
+			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+			     &old_path);
+	if (error)
+		return error;
+
+	error = user_path_parent(newdfd, newname, &nd, &to);
+	if (error)
+		goto out;
+	error = -EXDEV;
+	if (old_path.mnt != nd.path.mnt)
+		goto out_release;
+	new_dentry = lookup_create(&nd, 0);
+	error = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out_unlock;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
+	error = security_path_mknod(&nd.path, new_dentry,
+				    old_path.dentry->d_inode->i_mode, 0);
+	if (error)
+		goto out_drop_write;
+	error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
+out_drop_write:
+	mnt_drop_write(nd.path.mnt);
+out_dput:
+	dput(new_dentry);
+out_unlock:
+	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+	path_put(&nd.path);
+	putname(to);
+out:
+	path_put(&old_path);
+
+	return error;
+}
+
+SYSCALL_DEFINE2(reflink, const char __user *, oldname, const char __user *, newname)
+{
+	return sys_reflinkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
+}
+
 
 /*
  * The worst of all namespace operations - renaming directory. "Perverted"
-- 
1.6.1.3


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker
@ 2009-05-03  6:27   ` Matthew Wilcox
  2009-05-03  6:39     ` Al Viro
  2009-05-04  2:53     ` Joel Becker
  2009-05-03  8:04   ` Christoph Hellwig
  1 sibling, 2 replies; 151+ messages in thread
From: Matthew Wilcox @ 2009-05-03  6:27 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote:
> This implements reflinkat(2) and reflink(2).  See
> Documentation/reflink.txt for a description of the reflink(2) system
> call.

Do we need to add sys_reflink()?  Since sys_reflinkat() has a superset
of the functionality, presumably glibc can provide both reflink() and
reflinkat() calls, and userspace need never know that glibc is calling
sys_reflinkat() for both.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:27   ` Matthew Wilcox
@ 2009-05-03  6:39     ` Al Viro
  2009-05-03  7:48       ` Christoph Hellwig
  2009-05-04  2:53       ` Joel Becker
  2009-05-04  2:53     ` Joel Becker
  1 sibling, 2 replies; 151+ messages in thread
From: Al Viro @ 2009-05-03  6:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Joel Becker, linux-fsdevel, jmorris, ocfs2-devel

On Sun, May 03, 2009 at 12:27:57AM -0600, Matthew Wilcox wrote:
> On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote:
> > This implements reflinkat(2) and reflink(2).  See
> > Documentation/reflink.txt for a description of the reflink(2) system
> > call.
> 
> Do we need to add sys_reflink()?  Since sys_reflinkat() has a superset
> of the functionality, presumably glibc can provide both reflink() and
> reflinkat() calls, and userspace need never know that glibc is calling
> sys_reflinkat() for both.

Yes, indeed...

Another question: do we want that to work across mounpoint boundary?
It's probably OK in this case, but...

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:39     ` Al Viro
@ 2009-05-03  7:48       ` Christoph Hellwig
  2009-05-03 11:16         ` Al Viro
  2009-05-04  2:53       ` Joel Becker
  1 sibling, 1 reply; 151+ messages in thread
From: Christoph Hellwig @ 2009-05-03  7:48 UTC (permalink / raw)
  To: Al Viro; +Cc: Matthew Wilcox, Joel Becker, linux-fsdevel, jmorris, ocfs2-devel

On Sun, May 03, 2009 at 07:39:02AM +0100, Al Viro wrote:
> Another question: do we want that to work across mounpoint boundary?
> It's probably OK in this case, but...

I don't think so.  Allowing any link-like semantics over mount point
boundaries will just cause problems.

Joel, can you also submit a reflink man page to Michael?


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  7:48       ` Christoph Hellwig
@ 2009-05-03 11:16         ` Al Viro
  0 siblings, 0 replies; 151+ messages in thread
From: Al Viro @ 2009-05-03 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Joel Becker, linux-fsdevel, jmorris, ocfs2-devel

On Sun, May 03, 2009 at 03:48:49AM -0400, Christoph Hellwig wrote:
> On Sun, May 03, 2009 at 07:39:02AM +0100, Al Viro wrote:
> > Another question: do we want that to work across mounpoint boundary?
> > It's probably OK in this case, but...
> 
> I don't think so.  Allowing any link-like semantics over mount point
> boundaries will just cause problems.

Quite.  I realize that this is how vfs_link() is written, but I really
wonder if we should turn that if (foo->i_sb != bar->i_sb) into BUG_ON()
in both.  Their callers have vfsmounts and ought to do the vfsmount-level
check anyway, so running into *that* -EXDEV should be impossible.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:39     ` Al Viro
  2009-05-03  7:48       ` Christoph Hellwig
@ 2009-05-04  2:53       ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-04  2:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Matthew Wilcox, linux-fsdevel, jmorris, ocfs2-devel

On Sun, May 03, 2009 at 07:39:02AM +0100, Al Viro wrote:
> Another question: do we want that to work across mounpoint boundary?
> It's probably OK in this case, but...

	I don't think we want it working across mountpoints, just like
link(2).  I thought I checked for that in sys_reflinkat().

Joel

-- 

Life's Little Instruction Book #139

	"Never deprive someone of hope; it might be all they have."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:27   ` Matthew Wilcox
  2009-05-03  6:39     ` Al Viro
@ 2009-05-04  2:53     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-04  2:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Sun, May 03, 2009 at 12:27:57AM -0600, Matthew Wilcox wrote:
> On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote:
> > This implements reflinkat(2) and reflink(2).  See
> > Documentation/reflink.txt for a description of the reflink(2) system
> > call.
> 
> Do we need to add sys_reflink()?  Since sys_reflinkat() has a superset
> of the functionality, presumably glibc can provide both reflink() and
> reflinkat() calls, and userspace need never know that glibc is calling
> sys_reflinkat() for both.

	Sure, that works.

Joel

-- 

"Always give your best, never get discouraged, never be petty; always
 remember, others may hate you.  Those who hate you don't win unless
 you hate them.  And then you destroy yourself."
	- Richard M. Nixon

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 3/3] fs: Add the reflink(2) system call.
  2009-05-03  6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker
  2009-05-03  6:27   ` Matthew Wilcox
@ 2009-05-03  8:04   ` Christoph Hellwig
  1 sibling, 0 replies; 151+ messages in thread
From: Christoph Hellwig @ 2009-05-03  8:04 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro

On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote:
> This implements reflinkat(2) and reflink(2).  See
> Documentation/reflink.txt for a description of the reflink(2) system
> call.
> 
> XXX: Currently only adds the x86_32 linkage.  The rest of the
> architectures belong here too.

As mentioned by willy, no need for the sys_reflink syscall.  Also
no really good reason to split the support up into three patches,
one is enough.


^ permalink raw reply	[flat|nested] 151+ messages in thread

* [RFC] The reflink(2) system call v2.
  2009-05-03  6:15 [RFC] The reflink(2) system call Joel Becker
                   ` (2 preceding siblings ...)
  2009-05-03  6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker
@ 2009-05-07 22:15 ` Joel Becker
  2009-05-08  1:39   ` James Morris
  2009-05-08  2:59   ` jim owens
  3 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-07 22:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: mtk.manpages, linux-security-module, jmorris, ocfs2-devel, viro

Hi again,
	Here's version 2 of reflink.  Changes since the first version:

- One patch, not three.
- Documentation/filesystems/reflink.txt is no longer a pseudo-manpage.
  It also tries to encapsulate all the feedback from the discussion to
  make the operation clearer.
- LSM hooks added as recommended by the LSM folks.  This includes the
  default implementation in capability.c.
- Restricted reflink to owner or CAP_CHOWN.
- reflink(2) removed, only reflinkat(2) will be in the syscall table.
  Userspace can trivially write reflink(3).

	The patch still only defines sys_reflinkat() for x86_32.  The
final version will have all architectures.  The patch is also available
in my ocfs2 tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git reflink

	If you want to play with reflinks, here's what you need:

1) Tao's kernel code.  This is the ioctl-based ocfs2 implementation.
   Obviously we'll be putting it under the syscall shortly.  Compile and
   install as you'd expect.  It's in the 'refcount' branch of his git
   tree:

  git://oss.oracle.com/git/tma/linux-2.6.git refcount

2) My code for ocfs2-tools.  This is the mkfs.ocfs2(8) support to create
   a filesystem ready for reflink.  It's in the 'refcount' branch of the
   ocfs2-tools git tree:

  git://oss.oracle.com/git/ocfs2-tools.git refcount

   Once the branck is checked out, you can build and install it with:

  # ./autogen.sh; make; make install

   Create a non-clustered ocfs2 filesystem like so:

  # mkfs.ocfs2 -M local --fs-features=refcount /dev/XXX

   If you really want a clustered ocfs2, go right ahead, but I figure
   most people that want to play with reflinks want the quickest start
   possible, and a non-clustered ocfs2 means mkfs+mount just like any
   other local filesystem.

3) The reflink(1) program.  Grab the master branch from the reflink git
   tree:

  git://oss.oracle.com/git/jlbec/reflink.git master

   Type 'make' and 'make install' in the toplevel directory.  You now
   have the reflink(1) program.  It works with both the system call and
   the ocfs2 ioctl, so you can use it atop the current ocfs2 patch set.

4) Have fun!
 
Joel

>From 3130be9651832cece277d30182a04274798ce7f2 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.

The userpace visible idea of the operation is:

int reflink(const char *oldpath, const char *newpath);
int reflinkat(int olddirfd, const char *oldpath,
	      int newdirfd, const char *newpath, int flags);

The kernel only implements reflinkat(2).  reflink(3) is a trivial
wrapper around reflinkat(2).

The reflink() system call creates reference-counted links.  It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion.  Its calling semantics are identical to link(2)
and linkat(2).  Once complete, programs see the new file as a completely
separate entry.

In the VFS, ->reflink() is an inode_operation with the same arguments as
->link().

reflink() requires the caller to own the source file or have CAP_CHOWN,
because a reflink preserves ownership, permissions, and security
contexts.  Without the priviledges, a regular user can't preserve
ownership.

Two new LSM hooks are added, security_path_reflink() and
security_inode_reflink().  None of the existing LSM hooks appear to
fit.

XXX: Currently only adds the x86_32 linkage.  The rest of the
architectures belong here too.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 Documentation/filesystems/reflink.txt |  152 +++++++++++++++++++++++++++++++++
 Documentation/filesystems/vfs.txt     |    4 +
 arch/x86/include/asm/unistd_32.h      |    1 +
 arch/x86/kernel/syscall_table_32.S    |    1 +
 fs/namei.c                            |  101 ++++++++++++++++++++++
 include/linux/fs.h                    |    2 +
 include/linux/security.h              |   38 ++++++++
 include/linux/syscalls.h              |    2 +
 security/capability.c                 |   13 +++
 security/security.c                   |   15 +++
 10 files changed, 329 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/reflink.txt

diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..58a6b38
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,152 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link.  The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data.  Writes do not modify the shared data; they
+use copy-on-write (CoW).  Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks just like link(2):
+
+    int reflink(const char *oldpath, const char *newpath);
+
+The actual system call is reflinkat(2):
+
+    int reflinkat(int olddirfd, const char *oldpath,
+                  int newdirfd, const char *newpath, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing.  A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry.  Hard links are one step
+down.  Multiple directory entries are sharing one inode.  Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical.  When
+accessing more than one name for a hard link, the object returned looks
+identical.  Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such.  This includes
+ownership, permissions, security context, and data.  The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file.  Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file.  Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file.  ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Partial reflinks are not allowed.  The new inode will only appear in the
+directory structure after it is fully formed.  This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it.  Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows.  When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter.  A symlink doesn't
+require any access permissions other than being able to create its
+inode.  It can cross filesystems and mount points, and it can point to
+any type of file.  A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory.   Like hard links and symlinks, a reflink cannot be
+created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can reflink a file.  A
+reflink is a point-in-time snapshot of a file.  It has the same
+ownership, attributes, and security context as the source file.  A
+regular user cannot change the ownership of files, so they cannot create
+a reflink of a file they do not own.
+
+
+SHARING
+-------
+
+A reflink creates a new inode.  It shares all data extents of the source
+file; this includes file data and extended attribute data.  All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents.  Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode.  Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+All file attributes and extended attributes of the new file must
+identical to the source file with the following exceptions:
+
+- The new file must have a new inode number.  This allows POSIX
+  programs to treat the source and new files as separate objects.  From
+  the view of the POSIX application, the files are distinct.  The
+  sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+  must be changed to accommodate the copy-on-write linkage.  The ctime
+  of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+  the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file.  This reflects that the data
+is unchanged.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation.  It has the same
+prototype as ->link():
+
+    int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+                   struct dentry *new_dentry);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation.  The filesystem just needs
+to create the new inode identical to the old one with the exceptions
+noted above, link up the shared data extents, and then link the new
+inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does.  The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..01cd810 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
 
   truncate_range: a method provided by the underlying filesystem to truncate a
   	range of blocks , i.e. punch a hole somewhere in a file.
+  reflink: called by the reflink(2) system call. Only required if you want
+	to support reflinks.  For further information, see
+	Documentation/filesystems/reflink.txt.
 
 
 The Address Space Object
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_reflinkat		335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_reflinkat		/* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..3f80c2f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,106 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
+{
+	struct inode *inode = old_dentry->d_inode;
+	int error;
+
+	if (!inode)
+		return -ENOENT;
+
+	/*
+	 * reflink() preserves ownership, so the caller must have the
+	 * right to do so.
+	 */
+	if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+		return -EPERM;
+
+	if ((current_fsuid() != inode->i_uid) &&
+	    !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+		return -EPERM;
+
+	error = may_create(dir, new_dentry);
+	if (error)
+		return error;
+
+	if (dir->i_sb != inode->i_sb)
+		return -EXDEV;
+
+	/*
+	 * A reflink to an append-only or immutable file cannot be created.
+	 */
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	if (!dir->i_op->reflink)
+		return -EPERM;
+	if (S_ISDIR(inode->i_mode))
+		return -EPERM;
+
+	error = security_inode_reflink(old_dentry, dir, new_dentry);
+	if (error)
+		return error;
+
+	mutex_lock(&inode->i_mutex);
+	vfs_dq_init(dir);
+	error = dir->i_op->reflink(old_dentry, dir, new_dentry);
+	mutex_unlock(&inode->i_mutex);
+	if (!error)
+		fsnotify_create(dir, new_dentry);
+	return error;
+}
+
+SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
+		int, newdfd, const char __user *, newname, int, flags)
+{
+	struct dentry *new_dentry;
+	struct nameidata nd;
+	struct path old_path;
+	int error;
+	char *to;
+
+	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+		return -EINVAL;
+
+	error = user_path_at(olddfd, oldname,
+			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+			     &old_path);
+	if (error)
+		return error;
+
+	error = user_path_parent(newdfd, newname, &nd, &to);
+	if (error)
+		goto out;
+	error = -EXDEV;
+	if (old_path.mnt != nd.path.mnt)
+		goto out_release;
+	new_dentry = lookup_create(&nd, 0);
+	error = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out_unlock;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
+	error = security_path_reflink(old_path.dentry, &nd.path, new_dentry);
+	if (error)
+		goto out_drop_write;
+	error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
+out_drop_write:
+	mnt_drop_write(nd.path.mnt);
+out_dput:
+	dput(new_dentry);
+out_unlock:
+	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+	path_put(&nd.path);
+	putname(to);
+out:
+	path_put(&old_path);
+
+	return error;
+}
+
+
 /*
  * The worst of all namespace operations - renaming directory. "Perverted"
  * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +2990,7 @@ EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(vfs_create);
 EXPORT_SYMBOL(vfs_follow_link);
 EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);
 EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..3c9e4ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
 
 /*
  * VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
 			  loff_t len);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
 };
 
 struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..c647761 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,23 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	@inode contains a pointer to the inode.
  *	@secid contains a pointer to the location where result will be saved.
  *	In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ *	Check permission before creating a new reference-counted link to
+ *	a file.
+ *	@old_dentry contains the dentry structure for an existing link to
+ *	the file.
+ *	@dir contains the inode structure of the parent directory of the
+ *	new reflink.
+ *	Return 0 if permission is granted.
+ * @path_reflink:
+ *	Check permission before creating a new reference-counted link to
+ *	a file.
+ *	@old_dentry contains the dentry structure for an existing link
+ *	to the file.
+ *	@new_dir contains the path structure of the parent directory of
+ *	the new reflink.
+ *	@new_dentry contains the dentry structure for the new reflink.
+ *	Return 0 if permission is granted.
  *
  * Security hooks for file operations
  *
@@ -1402,6 +1419,8 @@ struct security_operations {
 			  struct dentry *new_dentry);
 	int (*path_rename) (struct path *old_dir, struct dentry *old_dentry,
 			    struct path *new_dir, struct dentry *new_dentry);
+	int (*path_reflink) (struct dentry *old_dentry, struct path *new_dir,
+			     struct dentry *new_dentry);
 #endif
 
 	int (*inode_alloc_security) (struct inode *inode);
@@ -1415,6 +1434,7 @@ struct security_operations {
 	int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
 	int (*inode_symlink) (struct inode *dir,
 			      struct dentry *dentry, const char *old_name);
+	int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir);
 	int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
 	int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
 	int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1695,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
 int security_inode_unlink(struct inode *dir, struct dentry *dentry);
 int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 			   const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+			   struct dentry *new_dentry);
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
 int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2078,13 @@ static inline int security_inode_symlink(struct inode *dir,
 	return 0;
 }
 
+static inline int security_inode_reflink(struct dentry *old_dentry,
+					 struct inode *dir,
+					 struct dentry *new_dentry)
+{
+	return 0;
+}
+
 static inline int security_inode_mkdir(struct inode *dir,
 					struct dentry *dentry,
 					int mode)
@@ -2802,6 +2831,8 @@ int security_path_link(struct dentry *old_dentry, struct path *new_dir,
 		       struct dentry *new_dentry);
 int security_path_rename(struct path *old_dir, struct dentry *old_dentry,
 			 struct path *new_dir, struct dentry *new_dentry);
+int security_path_reflink(struct dentry *old_dentry, struct path *new_dir,
+			  struct dentry *new_dentry);
 #else	/* CONFIG_SECURITY_PATH */
 static inline int security_path_unlink(struct path *dir, struct dentry *dentry)
 {
@@ -2851,6 +2882,13 @@ static inline int security_path_rename(struct path *old_dir,
 {
 	return 0;
 }
+
+static inline int security_path_reflink(struct dentry *old_dentry,
+					struct path *new_dir,
+					struct dentry *new_dentry)
+{
+	return 0;
+}
 #endif	/* CONFIG_SECURITY_PATH */
 
 #ifdef CONFIG_KEYS
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..35a8743 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
 			      int newdfd, const char __user * newname);
 asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
 			   int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+			      int newdfd, const char __user *newname, int flags);
 asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
 			     int newdfd, const char __user * newname);
 asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..60c6eda 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
 	return 0;
 }
 
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode)
+{
+	return 0;
+}
+
 static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
 			   int mask)
 {
@@ -308,6 +313,12 @@ static int cap_path_truncate(struct path *path, loff_t length,
 {
 	return 0;
 }
+
+static int cap_path_reflink(struct dentry *old_dentry, struct path *new_dir,
+			    struct dentry *new_dentry)
+{
+	return 0;
+}
 #endif
 
 static int cap_file_permission(struct file *file, int mask)
@@ -905,6 +916,7 @@ void security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, inode_link);
 	set_to_cap_if_null(ops, inode_unlink);
 	set_to_cap_if_null(ops, inode_symlink);
+	set_to_cap_if_null(ops, inode_reflink);
 	set_to_cap_if_null(ops, inode_mkdir);
 	set_to_cap_if_null(ops, inode_rmdir);
 	set_to_cap_if_null(ops, inode_mknod);
@@ -935,6 +947,7 @@ void security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, path_link);
 	set_to_cap_if_null(ops, path_rename);
 	set_to_cap_if_null(ops, path_truncate);
+	set_to_cap_if_null(ops, path_reflink);
 #endif
 	set_to_cap_if_null(ops, file_permission);
 	set_to_cap_if_null(ops, file_alloc_security);
diff --git a/security/security.c b/security/security.c
index 5284255..fc40a29 100644
--- a/security/security.c
+++ b/security/security.c
@@ -437,6 +437,14 @@ int security_path_truncate(struct path *path, loff_t length,
 		return 0;
 	return security_ops->path_truncate(path, length, time_attrs);
 }
+
+int security_path_reflink(struct dentry *old_dentry, struct path *new_dir,
+			  struct dentry *new_dentry)
+{
+	if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+		return 0;
+	return security_ops->path_reflink(old_dentry, new_dir, new_dentry);
+}
 #endif
 
 int security_inode_create(struct inode *dir, struct dentry *dentry, int mode)
@@ -470,6 +478,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 	return security_ops->inode_symlink(dir, dentry, old_name);
 }
 
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir)
+{
+	if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+		return 0;
+	return security_ops->inode_reflink(old_dentry, dir);
+}
+
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	if (unlikely(IS_PRIVATE(dir)))
-- 
1.6.1.3

-- 

"Sometimes I think the surest sign intelligent
 life exists elsewhere in the universe is that
 none of it has tried to contact us."
                                -Calvin & Hobbes

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker
@ 2009-05-08  1:39   ` James Morris
  2009-05-08  1:49     ` Joel Becker
  2009-05-08  2:59   ` jim owens
  1 sibling, 1 reply; 151+ messages in thread
From: James Morris @ 2009-05-08  1:39 UTC (permalink / raw)
  To: Joel Becker
  Cc: linux-fsdevel, ocfs2-devel, viro, mtk.manpages,
	linux-security-module

On Thu, 7 May 2009, Joel Becker wrote:


> @@ -1402,6 +1419,8 @@ struct security_operations {
>  			  struct dentry *new_dentry);
>  	int (*path_rename) (struct path *old_dir, struct dentry *old_dentry,
>  			    struct path *new_dir, struct dentry *new_dentry);
> +	int (*path_reflink) (struct dentry *old_dentry, struct path *new_dir,
> +			     struct dentry *new_dentry);
>  #endif
>  

The TOMOYO folk don't need a path hook, so it would be unused, and should 
not be added unless someone responsible for an in-tree LSM establishes a 
case for it.

> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
> +			   struct dentry *new_dentry);

We don't need the new_dentry argument (this is correct in the low-level 
hook, and doesn't compile with CONFIG_SECURITY=y).


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08  1:39   ` James Morris
@ 2009-05-08  1:49     ` Joel Becker
  2009-05-08 13:01       ` Tetsuo Handa
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-08  1:49 UTC (permalink / raw)
  To: James Morris
  Cc: linux-fsdevel, ocfs2-devel, viro, mtk.manpages,
	linux-security-module

On Fri, May 08, 2009 at 11:39:53AM +1000, James Morris wrote:
> On Thu, 7 May 2009, Joel Becker wrote:
> 
> 
> > @@ -1402,6 +1419,8 @@ struct security_operations {
> >  			  struct dentry *new_dentry);
> >  	int (*path_rename) (struct path *old_dir, struct dentry *old_dentry,
> >  			    struct path *new_dir, struct dentry *new_dentry);
> > +	int (*path_reflink) (struct dentry *old_dentry, struct path *new_dir,
> > +			     struct dentry *new_dentry);
> >  #endif
> >  
> 
> The TOMOYO folk don't need a path hook, so it would be unused, and should 
> not be added unless someone responsible for an in-tree LSM establishes a 
> case for it.

	Oh, I misread what they said:

> TOMOYO wants to prevent reflink(".htpasswd", "readme.html").
> But security_path_mknod() can't know the source file's name.
> Therefore, TOMOYO wants security_path_link() rather than security_path_mknod().
> So far I don't feel TOMOYO needs to introduce security_path_reflink() > because
> modifications after reflink() will be checked by other LSM hooks.

	So I should change the path_reflink() call to path_link() in
reflinkat(2)?

> > +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
> > +			   struct dentry *new_dentry);
> 
> We don't need the new_dentry argument (this is correct in the low-level 
> hook, and doesn't compile with CONFIG_SECURITY=y).

	Eek, missed that.

Joel

-- 

"The cynics are right nine times out of ten."  
        - H. L. Mencken

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08  1:49     ` Joel Becker
@ 2009-05-08 13:01       ` Tetsuo Handa
  0 siblings, 0 replies; 151+ messages in thread
From: Tetsuo Handa @ 2009-05-08 13:01 UTC (permalink / raw)
  To: Joel.Becker
  Cc: jmorris, linux-fsdevel, ocfs2-devel, viro, mtk.manpages,
	linux-security-module

Joel Becker wrote:
>> So far I don't feel TOMOYO needs to introduce security_path_reflink() because
>> modifications after reflink() will be checked by other LSM hooks.
> 
> 	So I should change the path_reflink() call to path_link() in
> reflinkat(2)?

Yes, you can.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker
  2009-05-08  1:39   ` James Morris
@ 2009-05-08  2:59   ` jim owens
  2009-05-08  3:10     ` Joel Becker
  2009-05-11 20:49     ` [RFC] The reflink(2) system call v2 Joel Becker
  1 sibling, 2 replies; 151+ messages in thread
From: jim owens @ 2009-05-08  2:59 UTC (permalink / raw)
  To: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module,
	joel.becker
  Cc: linux-fsdevel

Joel Becker wrote:
> Hi again,
> 	Here's version 2 of reflink.  Changes since the first version:
> 
> - One patch, not three.
> - Documentation/filesystems/reflink.txt is no longer a pseudo-manpage.
>   It also tries to encapsulate all the feedback from the discussion to
>   make the operation clearer.

You certainly did not address:

- desire for one single system call to handle both
   owner preservation and create with current owner.

   I see no reason to have 2 vfs_xxx and 2 inode functions for those.

- please just add the flag to the defined reflink API...
   there is no reason to keep saying "it is just like link(2)".
   that not true and you will just cause confusion.

- fix the
+	if (S_ISDIR(inode->i_mode))
+		return -EPERM;

   to be an ISREG check unless you have an argument for
   special files and symlinks being COWed.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08  2:59   ` jim owens
@ 2009-05-08  3:10     ` Joel Becker
  2009-05-08 11:53       ` jim owens
                         ` (2 more replies)
  2009-05-11 20:49     ` [RFC] The reflink(2) system call v2 Joel Becker
  1 sibling, 3 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-08  3:10 UTC (permalink / raw)
  To: jim owens
  Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel,
	ocfs2-devel, viro

On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
> You certainly did not address:
>
> - desire for one single system call to handle both
>   owner preservation and create with current owner.

	Nope, and I don't intend to.  reflink() is a snapshotting call,
not a kitchen sink.

Joel

-- 

Life's Little Instruction Book #444

	"Never underestimate the power of a kind word or deed."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08  3:10     ` Joel Becker
@ 2009-05-08 11:53       ` jim owens
  2009-05-08 12:16       ` jim owens
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
  2 siblings, 0 replies; 151+ messages in thread
From: jim owens @ 2009-05-08 11:53 UTC (permalink / raw)
  To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

Joel Becker wrote:
> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
>> You certainly did not address:
>>
>> - desire for one single system call to handle both
>>   owner preservation and create with current owner.
> 
> 	Nope, and I don't intend to.  reflink() is a snapshotting call,
> not a kitchen sink.

I'm not a maintainer but if I was, in that case I would
NAK this since more people wanted the cowfile() definition
than your reflink definition.

If you persist that you are only doing the snapshot
then call it snaplink(2) or something.

The reflink() name makes no sense because all references are
internal to the file system.  There is absolutely no way via
"ls" to determine the reference between the original and new.

With hard links and symlinks you can easily associate them.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08  3:10     ` Joel Becker
  2009-05-08 11:53       ` jim owens
@ 2009-05-08 12:16       ` jim owens
  2009-05-08 14:11         ` jim owens
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
  2 siblings, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-08 12:16 UTC (permalink / raw)
  To: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module,
	linux-fsdevel
  Cc: joel.becker

Joel Becker wrote:
> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
>> You certainly did not address:
>>
>> - desire for one single system call to handle both
>>   owner preservation and create with current owner.
> 
> 	Nope, and I don't intend to.  reflink() is a snapshotting call,
> not a kitchen sink.

BTW, the "kitchen sink" argument is bull!

All we are saying is have 1 syscall with 1 vfs operation that
does exactly the same thing except:

if (FLAG_SHAPFILE) {
       if (not CAP_FOWNER)
             return -EPERM
       newfile.attrs = old_file.attrs
) else
       newfile.attrs = user_default_create_attrs

I really think your objection is all because you are
hung up on your reflink() API that has NO EXISTING USERS!

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08 12:16       ` jim owens
@ 2009-05-08 14:11         ` jim owens
  0 siblings, 0 replies; 151+ messages in thread
From: jim owens @ 2009-05-08 14:11 UTC (permalink / raw)
  To: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module,
	linux-fsdevel, joel.becker

> Joel Becker wrote:
>> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
>>> You certainly did not address:
>>>
>>> - desire for one single system call to handle both
>>>   owner preservation and create with current owner.
>>
>>     Nope, and I don't intend to.  reflink() is a snapshotting call,

You might have designed this for *snapshotting*
but from a user perspective the function is best described as:

*Copy_file_attributes() and Cow_file_data()*

Since immediately afterwards the reflink() you permit everything
to be modified on the new file.

Your design is good, but you need to admit to yourself
that the Copy_file_attributes() is the special case
as far as users are concerned.  Because most people expect
snapshots are immutable and these are really files that
don't use up space and can be used as snapshots
*if you don't modify them*.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [RFC] The reflink(2) system call v4.
  2009-05-08  3:10     ` Joel Becker
  2009-05-08 11:53       ` jim owens
  2009-05-08 12:16       ` jim owens
@ 2009-05-11 20:40       ` Joel Becker
  2009-05-11 22:27         ` James Morris
                           ` (6 more replies)
  2 siblings, 7 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-11 20:40 UTC (permalink / raw)
  To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Thu, May 07, 2009 at 08:10:18PM -0700, Joel Becker wrote:
> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
> > You certainly did not address:
> >
> > - desire for one single system call to handle both
> >   owner preservation and create with current owner.
> 
> 	Nope, and I don't intend to.  reflink() is a snapshotting call,
> not a kitchen sink.

	I've been thinking about this all weekend.  The current state
doesn't make me happy.
	Now, what concerns me here is the interface to userspace.  The
system call itself.  I don't care if we implement it via one vfs_foo()
or 10 nor how many iops we end up with.  We can and will modify those as
we find better ideas.  But I want reflink(2) to have a semantic that is
easily understood and intuitive.
	When I initially designed reflink(), I hadn't thought about the
ownership and permission implications of snapshotting.  I was having too
much fun reflinking files around.  In that iteration, anyone could
reflink a file.  But a true snapshot needs ownership, permissions, acls,
and other security attributes (in all, I'm gonna call that the "security
context") as well.  So I defined reflink() as such.  This meant
requiring privileges, but lost some of the flexibility of the call.  I
call that a loss.
	What I'm not going to do is add optional behaviors to the system
call.  It should be pretty obvious what it does, or we're doing it
wrong.  The 'flags' field of reflinkat(2) is for AT_* flags.
	When I decided on requiring privileges, I thought that degrading
without privileges was too confusing.  I was wrong.  I want reflink() to
fit into the pantheon of file system operations in a way that makes
sense alongside the others, and this isn't it.
	Here's v4 of reflink().  If you have the privileges, you get the
full snapshot.  If you don't, you must have read access, and then you
get the entire snapshot (data and extended attributes) except that the
security context is reinitialized.  That's it.  It fits with most of the
other ops, and it's a clean degradation.
	I add a flag to ips->reflink() so that the filesystem knows what
to do with the security context.  That's the only change visible outside
of vfs_reflink().
	Security folks, check my work.  Everyone else, let me know if
this satisfies.

Joel

>From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.

The userpace visible idea of the operation is:

int reflink(const char *oldpath, const char *newpath);
int reflinkat(int olddirfd, const char *oldpath,
	      int newdirfd, const char *newpath, int flags);

The kernel only implements reflinkat(2).  reflink(3) is a trivial
wrapper around reflinkat(2).

The reflink() system call creates reference-counted links.  It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion.  Its calling semantics are identical to link(2)
and linkat(2).  Once complete, programs see the new file as a completely
separate entry.

reflink() attempts to preserve ownership, permissions, and security
contexts in order to create a fully snapshot.  Preserving those
attributes requires ownership or CAP_CHOWN.  A caller without those
privileges will see the security context of the new file initialized to
their default.

In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security context on the new file.

A new LSM hook, security_inode_reflink(), is added.  None of the
existing LSM hooks appeared to fit.

XXX: Currently only adds the x86_32 linkage.  The rest of the
architectures belong here too.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 Documentation/filesystems/reflink.txt |  165 +++++++++++++++++++++++++++++++++
 Documentation/filesystems/vfs.txt     |    4 +
 arch/x86/include/asm/unistd_32.h      |    1 +
 arch/x86/kernel/syscall_table_32.S    |    1 +
 fs/namei.c                            |  113 ++++++++++++++++++++++
 include/linux/fs.h                    |    2 +
 include/linux/security.h              |   16 +++
 include/linux/syscalls.h              |    2 +
 security/capability.c                 |    6 +
 security/security.c                   |    7 ++
 10 files changed, 317 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/reflink.txt

diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..aa7380f
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,165 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link.  The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data.  Writes do not modify the shared data; they
+use copy-on-write (CoW).  Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks just like link(2):
+
+    int reflink(const char *oldpath, const char *newpath);
+
+The actual system call is reflinkat(2):
+
+    int reflinkat(int olddirfd, const char *oldpath,
+                  int newdirfd, const char *newpath, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing.  A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry.  Hard links are one step
+down.  Multiple directory entries are sharing one inode.  Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical.  When
+accessing more than one name for a hard link, the object returned looks
+identical.  Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such.  This includes
+ownership, permissions, security context, and data.  The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file.  Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file.  Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file.  ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security context of the source file obviously requires
+the privilege to do so.  Callers that do not own the source file and do
+not have CAP_CHOWN will get a new reflink with all non-security
+attributes preserved; the security context of the new reflink will be
+as a newly created file by that user.
+
+Partial reflinks are not allowed.  The new inode will only appear in the
+directory structure after it is fully formed.  This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it.  Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows.  When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter.  A symlink doesn't
+require any access permissions other than being able to create its
+inode.  It can cross filesystems and mount points, and it can point to
+any type of file.  A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory.   Like hard links and symlinks, a reflink cannot be
+created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security context (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file.  Without the
+appropriate privilege, the caller will see their own default security
+context applied to the file.
+
+A caller without the privileges to preserve the security context must
+have read access to reflink a file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode.  It shares all data extents of the source
+file; this includes file data and extended attribute data.  All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents.  Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode.  Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+All file attributes and extended attributes of the new file must
+identical to the source file with the following exceptions:
+
+- The new file must have a new inode number.  This allows POSIX
+  programs to treat the source and new files as separate objects.  From
+  the view of the POSIX application, the files are distinct.  The
+  sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+  must be changed to accommodate the copy-on-write linkage.  The ctime
+  of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+  the new file is one.
+- If the caller lacks the privileges to preserve the security context,
+  the file will have its security context initialized as would any new
+  file.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file.  This reflects that the data
+is unchanged.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation.  It has almost
+the same prototype as ->link():
+
+    int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+                   struct dentry *new_dentry, int preserve_security);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation.  It has determined whether
+the security context should be preserved or reinitialized, as specified
+by the preserve_security argument.  The filesystem just needs to create
+the new inode identical to the old one with the exceptions noted above,
+link up the shared data extents, and then link the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does.  The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..01cd810 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
 
   truncate_range: a method provided by the underlying filesystem to truncate a
   	range of blocks , i.e. punch a hole somewhere in a file.
+  reflink: called by the reflink(2) system call. Only required if you want
+	to support reflinks.  For further information, see
+	Documentation/filesystems/reflink.txt.
 
 
 The Address Space Object
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_reflinkat		335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_reflinkat		/* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..34a6ce5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
+{
+	struct inode *inode = old_dentry->d_inode;
+	int error;
+	int preserve_security = 1;
+
+	if (!inode)
+		return -ENOENT;
+
+	/*
+	 * If the caller has the rights, reflink() will preserve the
+	 * security context of the source inode.
+	 */
+	if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+		preserve_security = 0;
+	if ((current_fsuid() != inode->i_uid) &&
+	    !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+		preserve_security = 0;
+
+	/*
+	 * If the caller doesn't have the right to preserve the security
+	 * context, the caller is only getting the data and extended
+	 * attributes.  They need read permission on the file.
+	 */
+	if (!preserve_security) {
+		error = inode_permission(inode, MAY_READ);
+		if (error)
+			return error;
+	}
+
+	error = may_create(dir, new_dentry);
+	if (error)
+		return error;
+
+	if (dir->i_sb != inode->i_sb)
+		return -EXDEV;
+
+	/*
+	 * A reflink to an append-only or immutable file cannot be created.
+	 */
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	if (!dir->i_op->reflink)
+		return -EPERM;
+	if (S_ISDIR(inode->i_mode))
+		return -EPERM;
+
+	error = security_inode_reflink(old_dentry, dir);
+	if (error)
+		return error;
+
+	mutex_lock(&inode->i_mutex);
+	vfs_dq_init(dir);
+	error = dir->i_op->reflink(old_dentry, dir, new_dentry,
+				   preserve_security);
+	mutex_unlock(&inode->i_mutex);
+	if (!error)
+		fsnotify_create(dir, new_dentry);
+	return error;
+}
+
+SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
+		int, newdfd, const char __user *, newname, int, flags)
+{
+	struct dentry *new_dentry;
+	struct nameidata nd;
+	struct path old_path;
+	int error;
+	char *to;
+
+	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+		return -EINVAL;
+
+	error = user_path_at(olddfd, oldname,
+			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+			     &old_path);
+	if (error)
+		return error;
+
+	error = user_path_parent(newdfd, newname, &nd, &to);
+	if (error)
+		goto out;
+	error = -EXDEV;
+	if (old_path.mnt != nd.path.mnt)
+		goto out_release;
+	new_dentry = lookup_create(&nd, 0);
+	error = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out_unlock;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
+	error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+	if (error)
+		goto out_drop_write;
+	error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
+out_drop_write:
+	mnt_drop_write(nd.path.mnt);
+out_dput:
+	dput(new_dentry);
+out_unlock:
+	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+	path_put(&nd.path);
+	putname(to);
+out:
+	path_put(&old_path);
+
+	return error;
+}
+
+
 /*
  * The worst of all namespace operations - renaming directory. "Perverted"
  * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(vfs_create);
 EXPORT_SYMBOL(vfs_follow_link);
 EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);
 EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..0a5c807 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
 
 /*
  * VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
 			  loff_t len);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *,int);
 };
 
 struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..ea9cd93 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	@inode contains a pointer to the inode.
  *	@secid contains a pointer to the location where result will be saved.
  *	In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ *	Check permission before creating a new reference-counted link to
+ *	a file.
+ *	@old_dentry contains the dentry structure for an existing link to
+ *	the file.
+ *	@dir contains the inode structure of the parent directory of the
+ *	new reflink.
+ *	Return 0 if permission is granted.
  *
  * Security hooks for file operations
  *
@@ -1415,6 +1423,7 @@ struct security_operations {
 	int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
 	int (*inode_symlink) (struct inode *dir,
 			      struct dentry *dentry, const char *old_name);
+	int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir);
 	int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
 	int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
 	int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
 int security_inode_unlink(struct inode *dir, struct dentry *dentry);
 int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 			   const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir);
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
 int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir,
 	return 0;
 }
 
+static inline int security_inode_reflink(struct dentry *old_dentry,
+					 struct inode *dir)
+{
+	return 0;
+}
+
 static inline int security_inode_mkdir(struct inode *dir,
 					struct dentry *dentry,
 					int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..35a8743 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
 			      int newdfd, const char __user * newname);
 asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
 			   int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+			      int newdfd, const char __user *newname, int flags);
 asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
 			     int newdfd, const char __user * newname);
 asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..3dcc4cc 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
 	return 0;
 }
 
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode)
+{
+	return 0;
+}
+
 static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
 			   int mask)
 {
@@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, inode_link);
 	set_to_cap_if_null(ops, inode_unlink);
 	set_to_cap_if_null(ops, inode_symlink);
+	set_to_cap_if_null(ops, inode_reflink);
 	set_to_cap_if_null(ops, inode_mkdir);
 	set_to_cap_if_null(ops, inode_rmdir);
 	set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..70d0ac3 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 	return security_ops->inode_symlink(dir, dentry, old_name);
 }
 
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir)
+{
+	if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+		return 0;
+	return security_ops->inode_reflink(old_dentry, dir);
+}
+
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	if (unlikely(IS_PRIVATE(dir)))
-- 
1.6.1.3


-- 

"Three o'clock is always too late or too early for anything you
 want to do."
        - Jean-Paul Sartre

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
@ 2009-05-11 22:27         ` James Morris
  2009-05-11 22:34           ` Joel Becker
  2009-05-12 12:01           ` Stephen Smalley
  2009-05-11 23:11         ` jim owens
                           ` (5 subsequent siblings)
  6 siblings, 2 replies; 151+ messages in thread
From: James Morris @ 2009-05-11 22:27 UTC (permalink / raw)
  To: Joel Becker
  Cc: jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module,
	linux-fsdevel

On Mon, 11 May 2009, Joel Becker wrote:

> and other security attributes (in all, I'm gonna call that the "security
> context") as well.  So I defined reflink() as such.  This meant

"security context" is an term associated with SELinux, so you may want to 
use something like "security attributes" or "security state" to avoid 
confusing people.

> +	error = security_inode_reflink(old_dentry, dir);
> +	if (error)
> +		return error;

We'll need the new_dentry now, to set up new security state before the 
dentry is instantiated.

e.g. SELinux will need to perform some checks on the operation, then 
calculate a new security context for the new file.

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 22:27         ` James Morris
@ 2009-05-11 22:34           ` Joel Becker
  2009-05-12  1:12             ` James Morris
  2009-05-12 12:01           ` Stephen Smalley
  1 sibling, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-11 22:34 UTC (permalink / raw)
  To: James Morris
  Cc: jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module,
	linux-fsdevel

On Tue, May 12, 2009 at 08:27:17AM +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
> 
> > and other security attributes (in all, I'm gonna call that the "security
> > context") as well.  So I defined reflink() as such.  This meant
> 
> "security context" is an term associated with SELinux, so you may want to 
> use something like "security attributes" or "security state" to avoid 
> confusing people.

	Ok, I wondered if my brain had picked that out from somewhere.

> > +	error = security_inode_reflink(old_dentry, dir);
> > +	if (error)
> > +		return error;
> 
> We'll need the new_dentry now, to set up new security state before the 
> dentry is instantiated.
> 
> e.g. SELinux will need to perform some checks on the operation, then 
> calculate a new security context for the new file.

	Do I need to pass in preserve_security as well so SELinux knows
what the ownership check determined?

Joel

-- 

"Copy from one, it's plagiarism; copy from two, it's research."
        - Wilson Mizner

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 22:34           ` Joel Becker
@ 2009-05-12  1:12             ` James Morris
  2009-05-12 12:18               ` Stephen Smalley
  0 siblings, 1 reply; 151+ messages in thread
From: James Morris @ 2009-05-12  1:12 UTC (permalink / raw)
  To: Joel Becker
  Cc: jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module,
	linux-fsdevel

On Mon, 11 May 2009, Joel Becker wrote:

> > e.g. SELinux will need to perform some checks on the operation, then 
> > calculate a new security context for the new file.
> 
> 	Do I need to pass in preserve_security as well so SELinux knows
> what the ownership check determined?

Not for SELinux -- its security attributes are orthogonal to DAC, and it 
will perform its own checks on them.

Other LSMs should operate similarly (there is also the CAP_CHOWN check 
which the LSM may hook), although if not, the flag can be added later if 
required.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12  1:12             ` James Morris
@ 2009-05-12 12:18               ` Stephen Smalley
  2009-05-12 17:22                 ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-12 12:18 UTC (permalink / raw)
  To: James Morris
  Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
> 
> > > e.g. SELinux will need to perform some checks on the operation, then 
> > > calculate a new security context for the new file.
> > 
> > 	Do I need to pass in preserve_security as well so SELinux knows
> > what the ownership check determined?
> 
> Not for SELinux -- its security attributes are orthogonal to DAC, and it 
> will perform its own checks on them.

Is preserve_security supposed to also control the preservation of the
SELinux security attribute (security.selinux extended attribute)?  I'd
expect that either we preserve all the security-relevant attributes or
none of them.  And if that is the case, then SELinux has to know about
preserve_security in order to know what the security context of the new
inode will be.  

Also, if you are going to automatically degrade reflink(2) behavior
based on the owner_or_cap test, then you ought to allow the same to be
true if the security module vetoes the attempt to preserve attributes.
Either DAC or MAC logic may say that security attributes cannot be
preserved.  Your current logic will only allow graceful degradation in
the DAC case, but the MAC case will remain a hard failure.

> Other LSMs should operate similarly (there is also the CAP_CHOWN check 
> which the LSM may hook), although if not, the flag can be added later if 
> required.
> 
> 
> - James
-- 
Stephen Smalley
National Security Agency

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 12:18               ` Stephen Smalley
@ 2009-05-12 17:22                 ` Joel Becker
  2009-05-12 17:32                   ` Stephen Smalley
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-12 17:22 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
	jim owens, ocfs2-devel, viro

On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> > On Mon, 11 May 2009, Joel Becker wrote:
> > 
> > > > e.g. SELinux will need to perform some checks on the operation, then 
> > > > calculate a new security context for the new file.
> > > 
> > > 	Do I need to pass in preserve_security as well so SELinux knows
> > > what the ownership check determined?
> > 
> > Not for SELinux -- its security attributes are orthogonal to DAC, and it 
> > will perform its own checks on them.
> 
> Is preserve_security supposed to also control the preservation of the
> SELinux security attribute (security.selinux extended attribute)?  I'd
> expect that either we preserve all the security-relevant attributes or
> none of them.  And if that is the case, then SELinux has to know about
> preserve_security in order to know what the security context of the new
> inode will be.  

	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
expecting to just reflink the extended attribute structures verbatim in
the preserve_security case.  So we would be ignoring whatever was set on
the new_dentry by security_inode_reflink().  This gets us the best CoW
sharing of the xattr extents, but I want to make sure that's "safe" in
the preserve_security case.

> Also, if you are going to automatically degrade reflink(2) behavior
> based on the owner_or_cap test, then you ought to allow the same to be
> true if the security module vetoes the attempt to preserve attributes.
> Either DAC or MAC logic may say that security attributes cannot be
> preserved.  Your current logic will only allow graceful degradation in
> the DAC case, but the MAC case will remain a hard failure.

	I did not think of this, and its a very good point as well.  I'm
not sure how to have the return value of security_inode_reflink()
distinguish between "disallow the reflink" and "disallow
preserve_security".  But since !preserve_security requires read access
only, perhaps we move security_inode_reflink up higher and say:

	error = security_inode_reflink(old_dentry, dir);
	if (error)
		preserve_security = 0;

Here security_inode_reflink() does not need new_dentry, because it isn't
setting a security context.  If it's ok with the reflink, we'll be
copying the extended attribute.  If it's not OK, it falls through to the
inode_permission(inode, MAY_READ) check, which will check for plain old
read access.
	What do we think?

Joel

-- 

"Under capitalism, man exploits man.  Under Communism, it's just 
   the opposite."
				 - John Kenneth Galbraith

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 17:22                 ` Joel Becker
@ 2009-05-12 17:32                   ` Stephen Smalley
  2009-05-12 18:03                     ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-12 17:32 UTC (permalink / raw)
  To: Joel Becker
  Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote:
> > > On Mon, 11 May 2009, Joel Becker wrote:
> > > 
> > > > > e.g. SELinux will need to perform some checks on the operation, then 
> > > > > calculate a new security context for the new file.
> > > > 
> > > > 	Do I need to pass in preserve_security as well so SELinux knows
> > > > what the ownership check determined?
> > > 
> > > Not for SELinux -- its security attributes are orthogonal to DAC, and it 
> > > will perform its own checks on them.
> > 
> > Is preserve_security supposed to also control the preservation of the
> > SELinux security attribute (security.selinux extended attribute)?  I'd
> > expect that either we preserve all the security-relevant attributes or
> > none of them.  And if that is the case, then SELinux has to know about
> > preserve_security in order to know what the security context of the new
> > inode will be.  
> 
> 	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
> expecting to just reflink the extended attribute structures verbatim in
> the preserve_security case.

And in the preserve_security==0 case, you'll be calling
security_inode_init_security() in order to get the attribute name/value
pair to assign to the new inode just as in the normal file creation
case?

>   So we would be ignoring whatever was set on
> the new_dentry by security_inode_reflink().  This gets us the best CoW
> sharing of the xattr extents, but I want to make sure that's "safe" in
> the preserve_security case.

security_inode_reflink() can't handle the initialization regardless, as
the inode doesn't yet exist at that point.

> > Also, if you are going to automatically degrade reflink(2) behavior
> > based on the owner_or_cap test, then you ought to allow the same to be
> > true if the security module vetoes the attempt to preserve attributes.
> > Either DAC or MAC logic may say that security attributes cannot be
> > preserved.  Your current logic will only allow graceful degradation in
> > the DAC case, but the MAC case will remain a hard failure.
> 
> 	I did not think of this, and its a very good point as well.  I'm
> not sure how to have the return value of security_inode_reflink()
> distinguish between "disallow the reflink" and "disallow
> preserve_security".  But since !preserve_security requires read access
> only, perhaps we move security_inode_reflink up higher and say:
> 
> 	error = security_inode_reflink(old_dentry, dir);
> 	if (error)
> 		preserve_security = 0;
> 
> Here security_inode_reflink() does not need new_dentry, because it isn't
> setting a security context.  If it's ok with the reflink, we'll be
> copying the extended attribute.  If it's not OK, it falls through to the
> inode_permission(inode, MAY_READ) check, which will check for plain old
> read access.
> 	What do we think?

I'd rather have two hooks, one to allow the security module to override
preserve_security and one to allow the security module to deny the
operation altogether.  The former hook only needs to be called if
preserve_security is not already cleared by the DAC logic.  The latter
hook needs to know the final verdict on preserve_security in order to
determine the right set of checks to apply, which isn't necessarily
limited to only checking read access.

But we don't need the new_dentry regardless.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 17:32                   ` Stephen Smalley
@ 2009-05-12 18:03                     ` Joel Becker
  2009-05-12 18:04                       ` Stephen Smalley
  2009-05-13  1:47                       ` Casey Schaufler
  0 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-12 18:03 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
	jim owens, ocfs2-devel, viro

On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > Is preserve_security supposed to also control the preservation of the
> > > SELinux security attribute (security.selinux extended attribute)?  I'd
> > > expect that either we preserve all the security-relevant attributes or
> > > none of them.  And if that is the case, then SELinux has to know about
> > > preserve_security in order to know what the security context of the new
> > > inode will be.  
> > 
> > 	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
> > expecting to just reflink the extended attribute structures verbatim in
> > the preserve_security case.
> 
> And in the preserve_security==0 case, you'll be calling
> security_inode_init_security() in order to get the attribute name/value
> pair to assign to the new inode just as in the normal file creation
> case?

	Oh, absolutely.
	As an aside, do inodes ever have more than one security.*
attribute?  It would appear that security_inode_init_security() just
returns one attribute, but what if I had a system running under SMACK
and then changed to SELinux?  Would my (existing) inode then have
security.smack and security.selinux attributes?

> > > Also, if you are going to automatically degrade reflink(2) behavior
> > > based on the owner_or_cap test, then you ought to allow the same to be
> > > true if the security module vetoes the attempt to preserve attributes.
> > > Either DAC or MAC logic may say that security attributes cannot be
> > > preserved.  Your current logic will only allow graceful degradation in
> > > the DAC case, but the MAC case will remain a hard failure.
> > 
> > 	I did not think of this, and its a very good point as well.  I'm
> > not sure how to have the return value of security_inode_reflink()
> > distinguish between "disallow the reflink" and "disallow
> > preserve_security".  But since !preserve_security requires read access
> > only, perhaps we move security_inode_reflink up higher and say:
> > 
> > 	error = security_inode_reflink(old_dentry, dir);
> > 	if (error)
> > 		preserve_security = 0;
> > 
> > Here security_inode_reflink() does not need new_dentry, because it isn't
> > setting a security context.  If it's ok with the reflink, we'll be
> > copying the extended attribute.  If it's not OK, it falls through to the
> > inode_permission(inode, MAY_READ) check, which will check for plain old
> > read access.
> > 	What do we think?
> 
> I'd rather have two hooks, one to allow the security module to override
> preserve_security and one to allow the security module to deny the
> operation altogether.  The former hook only needs to be called if
> preserve_security is not already cleared by the DAC logic.  The latter
> hook needs to know the final verdict on preserve_security in order to
> determine the right set of checks to apply, which isn't necessarily
> limited to only checking read access.

	Ok, is that two hooks or one hook with specific error returns?
I don't care, it's up to the LSM group.  I just can't come up with a
good distinguishing set of names if its two hooks :-)

Joel

-- 

Life's Little Instruction Book #157 

	"Take time to smell the roses."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 18:03                     ` Joel Becker
@ 2009-05-12 18:04                       ` Stephen Smalley
  2009-05-12 18:28                         ` Joel Becker
  2009-05-14 18:06                         ` Stephen Smalley
  2009-05-13  1:47                       ` Casey Schaufler
  1 sibling, 2 replies; 151+ messages in thread
From: Stephen Smalley @ 2009-05-12 18:04 UTC (permalink / raw)
  To: Joel Becker
  Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > Is preserve_security supposed to also control the preservation of the
> > > > SELinux security attribute (security.selinux extended attribute)?  I'd
> > > > expect that either we preserve all the security-relevant attributes or
> > > > none of them.  And if that is the case, then SELinux has to know about
> > > > preserve_security in order to know what the security context of the new
> > > > inode will be.  
> > > 
> > > 	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
> > > expecting to just reflink the extended attribute structures verbatim in
> > > the preserve_security case.
> > 
> > And in the preserve_security==0 case, you'll be calling
> > security_inode_init_security() in order to get the attribute name/value
> > pair to assign to the new inode just as in the normal file creation
> > case?
> 
> 	Oh, absolutely.
> 	As an aside, do inodes ever have more than one security.*
> attribute?  It would appear that security_inode_init_security() just
> returns one attribute, but what if I had a system running under SMACK
> and then changed to SELinux?  Would my (existing) inode then have
> security.smack and security.selinux attributes?

No, there would be no security.selinux attribute and the file would be
treated as having a well-defined 'unlabeled' attribute by SELinux.  Not
something you have to worry about.

> > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > true if the security module vetoes the attempt to preserve attributes.
> > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > preserved.  Your current logic will only allow graceful degradation in
> > > > the DAC case, but the MAC case will remain a hard failure.
> > > 
> > > 	I did not think of this, and its a very good point as well.  I'm
> > > not sure how to have the return value of security_inode_reflink()
> > > distinguish between "disallow the reflink" and "disallow
> > > preserve_security".  But since !preserve_security requires read access
> > > only, perhaps we move security_inode_reflink up higher and say:
> > > 
> > > 	error = security_inode_reflink(old_dentry, dir);
> > > 	if (error)
> > > 		preserve_security = 0;
> > > 
> > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > setting a security context.  If it's ok with the reflink, we'll be
> > > copying the extended attribute.  If it's not OK, it falls through to the
> > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > read access.
> > > 	What do we think?
> > 
> > I'd rather have two hooks, one to allow the security module to override
> > preserve_security and one to allow the security module to deny the
> > operation altogether.  The former hook only needs to be called if
> > preserve_security is not already cleared by the DAC logic.  The latter
> > hook needs to know the final verdict on preserve_security in order to
> > determine the right set of checks to apply, which isn't necessarily
> > limited to only checking read access.
> 
> 	Ok, is that two hooks or one hook with specific error returns?
> I don't care, it's up to the LSM group.  I just can't come up with a
> good distinguishing set of names if its two hooks :-)

I suppose you could coalesce them into a single hook ala:
	error = security_inode_reflink(old_dentry, dir, &preserve_security);
	if (error)
		return (error);

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 18:04                       ` Stephen Smalley
@ 2009-05-12 18:28                         ` Joel Becker
  2009-05-12 18:37                           ` Stephen Smalley
  2009-05-14 18:06                         ` Stephen Smalley
  1 sibling, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-12 18:28 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
	jim owens, ocfs2-devel, viro

On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > 	As an aside, do inodes ever have more than one security.*
> > attribute?  It would appear that security_inode_init_security() just
> > returns one attribute, but what if I had a system running under SMACK
> > and then changed to SELinux?  Would my (existing) inode then have
> > security.smack and security.selinux attributes?
> 
> No, there would be no security.selinux attribute and the file would be
> treated as having a well-defined 'unlabeled' attribute by SELinux.  Not
> something you have to worry about.

	Even if I've run rstorecon?  Basically, I'm trying to understand
if, in the !preserve_security case, ocfs2 can just do "link up the
existing xattrs, then set whatever we got from
security_inode_init_security()", or if we have to go through and delete
all security.* attributes before installing the result of
security_inode_init_security().

> > > I'd rather have two hooks, one to allow the security module to override
> > > preserve_security and one to allow the security module to deny the
> > > operation altogether.  The former hook only needs to be called if
> > > preserve_security is not already cleared by the DAC logic.  The latter
> > > hook needs to know the final verdict on preserve_security in order to
> > > determine the right set of checks to apply, which isn't necessarily
> > > limited to only checking read access.
> > 
> > 	Ok, is that two hooks or one hook with specific error returns?
> > I don't care, it's up to the LSM group.  I just can't come up with a
> > good distinguishing set of names if its two hooks :-)
> 
> I suppose you could coalesce them into a single hook ala:
> 	error = security_inode_reflink(old_dentry, dir, &preserve_security);
> 	if (error)
> 		return (error);

	What fits in with the LSM convention.  That's more important
than one-hook-vs-two.

Joel

-- 

"Gone to plant a weeping willow
 On the bank's green edge it will roll, roll, roll.
 Sing a lulaby beside the waters.
 Lovers come and go, the river roll, roll, rolls."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 18:28                         ` Joel Becker
@ 2009-05-12 18:37                           ` Stephen Smalley
  0 siblings, 0 replies; 151+ messages in thread
From: Stephen Smalley @ 2009-05-12 18:37 UTC (permalink / raw)
  To: Joel Becker
  Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 2009-05-12 at 11:28 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > > 	As an aside, do inodes ever have more than one security.*
> > > attribute?  It would appear that security_inode_init_security() just
> > > returns one attribute, but what if I had a system running under SMACK
> > > and then changed to SELinux?  Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> > 
> > No, there would be no security.selinux attribute and the file would be
> > treated as having a well-defined 'unlabeled' attribute by SELinux.  Not
> > something you have to worry about.
> 
> 	Even if I've run rstorecon?  Basically, I'm trying to understand
> if, in the !preserve_security case, ocfs2 can just do "link up the
> existing xattrs, then set whatever we got from
> security_inode_init_security()", or if we have to go through and delete
> all security.* attributes before installing the result of
> security_inode_init_security().

Likely a better example would be file capabilities
(security.capability), as you might be using those simultaneously with
SELinux (security.selinux).

security_inode_init_security() is only going to return security.selinux,
as new files don't get any file capabilities assigned by default.  I
guess you would want to delete security.capability from the reflink if
preserve_security==0.

> > > > I'd rather have two hooks, one to allow the security module to override
> > > > preserve_security and one to allow the security module to deny the
> > > > operation altogether.  The former hook only needs to be called if
> > > > preserve_security is not already cleared by the DAC logic.  The latter
> > > > hook needs to know the final verdict on preserve_security in order to
> > > > determine the right set of checks to apply, which isn't necessarily
> > > > limited to only checking read access.
> > > 
> > > 	Ok, is that two hooks or one hook with specific error returns?
> > > I don't care, it's up to the LSM group.  I just can't come up with a
> > > good distinguishing set of names if its two hooks :-)
> > 
> > I suppose you could coalesce them into a single hook ala:
> > 	error = security_inode_reflink(old_dentry, dir, &preserve_security);
> > 	if (error)
> > 		return (error);
> 
> 	What fits in with the LSM convention.  That's more important
> than one-hook-vs-two.

I think that the above example fits with the LSM convention.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 18:04                       ` Stephen Smalley
  2009-05-12 18:28                         ` Joel Becker
@ 2009-05-14 18:06                         ` Stephen Smalley
  2009-05-14 18:25                           ` Stephen Smalley
  1 sibling, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:06 UTC (permalink / raw)
  To: Joel Becker
  Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > > Is preserve_security supposed to also control the preservation of the
> > > > > SELinux security attribute (security.selinux extended attribute)?  I'd
> > > > > expect that either we preserve all the security-relevant attributes or
> > > > > none of them.  And if that is the case, then SELinux has to know about
> > > > > preserve_security in order to know what the security context of the new
> > > > > inode will be.  
> > > > 
> > > > 	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
> > > > expecting to just reflink the extended attribute structures verbatim in
> > > > the preserve_security case.
> > > 
> > > And in the preserve_security==0 case, you'll be calling
> > > security_inode_init_security() in order to get the attribute name/value
> > > pair to assign to the new inode just as in the normal file creation
> > > case?
> > 
> > 	Oh, absolutely.
> > 	As an aside, do inodes ever have more than one security.*
> > attribute?  It would appear that security_inode_init_security() just
> > returns one attribute, but what if I had a system running under SMACK
> > and then changed to SELinux?  Would my (existing) inode then have
> > security.smack and security.selinux attributes?
> 
> No, there would be no security.selinux attribute and the file would be
> treated as having a well-defined 'unlabeled' attribute by SELinux.  Not
> something you have to worry about.
> 
> > > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > > true if the security module vetoes the attempt to preserve attributes.
> > > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > > preserved.  Your current logic will only allow graceful degradation in
> > > > > the DAC case, but the MAC case will remain a hard failure.
> > > > 
> > > > 	I did not think of this, and its a very good point as well.  I'm
> > > > not sure how to have the return value of security_inode_reflink()
> > > > distinguish between "disallow the reflink" and "disallow
> > > > preserve_security".  But since !preserve_security requires read access
> > > > only, perhaps we move security_inode_reflink up higher and say:
> > > > 
> > > > 	error = security_inode_reflink(old_dentry, dir);
> > > > 	if (error)
> > > > 		preserve_security = 0;
> > > > 
> > > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > > setting a security context.  If it's ok with the reflink, we'll be
> > > > copying the extended attribute.  If it's not OK, it falls through to the
> > > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > > read access.
> > > > 	What do we think?
> > > 
> > > I'd rather have two hooks, one to allow the security module to override
> > > preserve_security and one to allow the security module to deny the
> > > operation altogether.  The former hook only needs to be called if
> > > preserve_security is not already cleared by the DAC logic.  The latter
> > > hook needs to know the final verdict on preserve_security in order to
> > > determine the right set of checks to apply, which isn't necessarily
> > > limited to only checking read access.
> > 
> > 	Ok, is that two hooks or one hook with specific error returns?
> > I don't care, it's up to the LSM group.  I just can't come up with a
> > good distinguishing set of names if its two hooks :-)
> 
> I suppose you could coalesce them into a single hook ala:
> 	error = security_inode_reflink(old_dentry, dir, &preserve_security);
> 	if (error)
> 		return (error);

On second thought (agreeing with Andy about making the interface
explicit wrt preserve_security), I don't expect us to ever override
preserve_security from SELinux, so you can just pass it in by value.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14 18:06                         ` Stephen Smalley
@ 2009-05-14 18:25                           ` Stephen Smalley
  2009-05-14 23:25                             ` James Morris
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:25 UTC (permalink / raw)
  To: Joel Becker
  Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Thu, 2009-05-14 at 14:06 -0400, Stephen Smalley wrote:
> On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote:
> > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote:
> > > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
> > > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
> > > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
> > > > > > Is preserve_security supposed to also control the preservation of the
> > > > > > SELinux security attribute (security.selinux extended attribute)?  I'd
> > > > > > expect that either we preserve all the security-relevant attributes or
> > > > > > none of them.  And if that is the case, then SELinux has to know about
> > > > > > preserve_security in order to know what the security context of the new
> > > > > > inode will be.  
> > > > > 
> > > > > 	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
> > > > > expecting to just reflink the extended attribute structures verbatim in
> > > > > the preserve_security case.
> > > > 
> > > > And in the preserve_security==0 case, you'll be calling
> > > > security_inode_init_security() in order to get the attribute name/value
> > > > pair to assign to the new inode just as in the normal file creation
> > > > case?
> > > 
> > > 	Oh, absolutely.
> > > 	As an aside, do inodes ever have more than one security.*
> > > attribute?  It would appear that security_inode_init_security() just
> > > returns one attribute, but what if I had a system running under SMACK
> > > and then changed to SELinux?  Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> > 
> > No, there would be no security.selinux attribute and the file would be
> > treated as having a well-defined 'unlabeled' attribute by SELinux.  Not
> > something you have to worry about.
> > 
> > > > > > Also, if you are going to automatically degrade reflink(2) behavior
> > > > > > based on the owner_or_cap test, then you ought to allow the same to be
> > > > > > true if the security module vetoes the attempt to preserve attributes.
> > > > > > Either DAC or MAC logic may say that security attributes cannot be
> > > > > > preserved.  Your current logic will only allow graceful degradation in
> > > > > > the DAC case, but the MAC case will remain a hard failure.
> > > > > 
> > > > > 	I did not think of this, and its a very good point as well.  I'm
> > > > > not sure how to have the return value of security_inode_reflink()
> > > > > distinguish between "disallow the reflink" and "disallow
> > > > > preserve_security".  But since !preserve_security requires read access
> > > > > only, perhaps we move security_inode_reflink up higher and say:
> > > > > 
> > > > > 	error = security_inode_reflink(old_dentry, dir);
> > > > > 	if (error)
> > > > > 		preserve_security = 0;
> > > > > 
> > > > > Here security_inode_reflink() does not need new_dentry, because it isn't
> > > > > setting a security context.  If it's ok with the reflink, we'll be
> > > > > copying the extended attribute.  If it's not OK, it falls through to the
> > > > > inode_permission(inode, MAY_READ) check, which will check for plain old
> > > > > read access.
> > > > > 	What do we think?
> > > > 
> > > > I'd rather have two hooks, one to allow the security module to override
> > > > preserve_security and one to allow the security module to deny the
> > > > operation altogether.  The former hook only needs to be called if
> > > > preserve_security is not already cleared by the DAC logic.  The latter
> > > > hook needs to know the final verdict on preserve_security in order to
> > > > determine the right set of checks to apply, which isn't necessarily
> > > > limited to only checking read access.
> > > 
> > > 	Ok, is that two hooks or one hook with specific error returns?
> > > I don't care, it's up to the LSM group.  I just can't come up with a
> > > good distinguishing set of names if its two hooks :-)
> > 
> > I suppose you could coalesce them into a single hook ala:
> > 	error = security_inode_reflink(old_dentry, dir, &preserve_security);
> > 	if (error)
> > 		return (error);
> 
> On second thought (agreeing with Andy about making the interface
> explicit wrt preserve_security), I don't expect us to ever override
> preserve_security from SELinux, so you can just pass it in by value.

And you can likely make preserve_security a simple bool (set from some
caller-provided flag) rather than an int.  At which point the SELinux
wiring for the new hook would be something like this:

If we are preserving security attributes on the reflink, then treat it
like creating a link to an existing file; else treat it like creating a
new file.  Read access will also be checked in the non-preserving case
by virtue of the separate inode_permission call.

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 2fcad7c..20ef414 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2667,6 +2667,17 @@ static int selinux_inode_symlink(struct inode *dir, struct dentry *dentry, const
 	return may_create(dir, dentry, SECCLASS_LNK_FILE);
 }
 
+static int selinux_inode_reflink(struct dentry *dentry, struct inode *dir,
+				bool preserve_security)
+{
+	struct inode_security_struct *isec = dentry->d_inode->i_security;
+
+	if (preserve_security)
+		return may_link(dir, dentry, MAY_LINK);
+	else
+		return may_create(dir, dentry, isec->sclass);
+}
+
 static int selinux_inode_mkdir(struct inode *dir, struct dentry *dentry, int mask)
 {
 	return may_create(dir, dentry, SECCLASS_DIR);
@@ -5357,6 +5368,7 @@ static struct security_operations selinux_ops = {
 	.inode_link =			selinux_inode_link,
 	.inode_unlink =			selinux_inode_unlink,
 	.inode_symlink =		selinux_inode_symlink,
+	.inode_reflink =		selinux_inode_reflink,
 	.inode_mkdir =			selinux_inode_mkdir,
 	.inode_rmdir =			selinux_inode_rmdir,
 	.inode_mknod =			selinux_inode_mknod,



-- 
Stephen Smalley
National Security Agency


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14 18:25                           ` Stephen Smalley
@ 2009-05-14 23:25                             ` James Morris
  2009-05-15 11:54                               ` Stephen Smalley
  0 siblings, 1 reply; 151+ messages in thread
From: James Morris @ 2009-05-14 23:25 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Thu, 14 May 2009, Stephen Smalley wrote:

> And you can likely make preserve_security a simple bool (set from some
> caller-provided flag) rather than an int.  At which point the SELinux
> wiring for the new hook would be something like this:
> 
> If we are preserving security attributes on the reflink, then treat it
> like creating a link to an existing file;

Do we also need to somewhat consider it like a new file? e.g. in the case 
of create_sid being set (if different to the existing security attribute), 
I believe we need to fail the operation because security attributes are 
not preserved, and also decide which error code to return (the user may be 
confused if it's EACCES -- EINVAL might be better).  Similar for reflinks 
on a context mounted file system, although create_sid needs to be checked 
during inode instantiation (unless we, say, add set a preserve_sid flag 
which overrides create_sid and is cleared upon use).

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14 23:25                             ` James Morris
@ 2009-05-15 11:54                               ` Stephen Smalley
  2009-05-15 13:35                                 ` James Morris
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-15 11:54 UTC (permalink / raw)
  To: James Morris
  Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Fri, 2009-05-15 at 09:25 +1000, James Morris wrote:
> On Thu, 14 May 2009, Stephen Smalley wrote:
> 
> > And you can likely make preserve_security a simple bool (set from some
> > caller-provided flag) rather than an int.  At which point the SELinux
> > wiring for the new hook would be something like this:
> > 
> > If we are preserving security attributes on the reflink, then treat it
> > like creating a link to an existing file;
> 
> Do we also need to somewhat consider it like a new file? e.g. in the case 
> of create_sid being set (if different to the existing security attribute), 
> I believe we need to fail the operation because security attributes are 
> not preserved, and also decide which error code to return (the user may be 
> confused if it's EACCES -- EINVAL might be better).  Similar for reflinks 
> on a context mounted file system, although create_sid needs to be checked 
> during inode instantiation (unless we, say, add set a preserve_sid flag 
> which overrides create_sid and is cleared upon use).

The create_sid is not relevant in the preserve_security==1 case; the
filesystem will always preserve the security context from the original
inode on the new inode in that case.  The create_sid won't ever be used
in that case, as it only gets applied if the filesystem calls
security_inode_init_security() to obtain the attribute (name, value)
pair for a new inode, and the filesystem will only do that in the
preserve_security==0 case.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-15 11:54                               ` Stephen Smalley
@ 2009-05-15 13:35                                 ` James Morris
  2009-05-15 15:44                                   ` Stephen Smalley
  0 siblings, 1 reply; 151+ messages in thread
From: James Morris @ 2009-05-15 13:35 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Fri, 15 May 2009, Stephen Smalley wrote:

> The create_sid is not relevant in the preserve_security==1 case; the
> filesystem will always preserve the security context from the original
> inode on the new inode in that case.  The create_sid won't ever be used
> in that case, as it only gets applied if the filesystem calls
> security_inode_init_security() to obtain the attribute (name, value)
> pair for a new inode, and the filesystem will only do that in the
> preserve_security==0 case.

Ok.  Does this break the idea of create_sid, though?  i.e. it will be 
ignored when a new file is created via reflink(), potentially allowing DAC 
to determine whether MAC labeling policy is enforced, and is also not 
consistent with the way fsuid is handled.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-15 13:35                                 ` James Morris
@ 2009-05-15 15:44                                   ` Stephen Smalley
  0 siblings, 0 replies; 151+ messages in thread
From: Stephen Smalley @ 2009-05-15 15:44 UTC (permalink / raw)
  To: James Morris
  Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Fri, 2009-05-15 at 23:35 +1000, James Morris wrote:
> On Fri, 15 May 2009, Stephen Smalley wrote:
> 
> > The create_sid is not relevant in the preserve_security==1 case; the
> > filesystem will always preserve the security context from the original
> > inode on the new inode in that case.  The create_sid won't ever be used
> > in that case, as it only gets applied if the filesystem calls
> > security_inode_init_security() to obtain the attribute (name, value)
> > pair for a new inode, and the filesystem will only do that in the
> > preserve_security==0 case.
> 
> Ok.  Does this break the idea of create_sid, though?  i.e. it will be 
> ignored when a new file is created via reflink(), potentially allowing DAC 
> to determine whether MAC labeling policy is enforced, and is also not 
> consistent with the way fsuid is handled.

I think it is consistent with the planned uid handling for reflink (if
preserve_security==1, then the new inode gets the uid of the original
inode; else the new inode gets the fsuid of the creating process).

create_sid is a "discretionary" mechanism - the application supplies the
value via setfscreatecon(3), subject to a policy check (the file create
check).  Applications only expect the create_sid to be applied on normal
file creations (and even there, it may not happen due to context mounts
or filesystems that do not support labeling), so we aren't bound to that
behavior for reflink.

The MAC policy is enforced based on the permission checks, not the
create_sid, so the only question is whether it is sufficient to check
link permission for reflink(2) in the attribute-preserving case or
whether we should add a new permission for it.  We don't want to reuse
the create permission for reflink(2) in the attribute-preserving case
due to the difference in semantics between a reflink and a normal file
creation.  The result of a reflink(2) will look identical to the result
of a link(2) except that it will have its own inode and thus a different
inode number, link count, and ctime.

-- 
Stephen Smalley
National Security Agency

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 18:03                     ` Joel Becker
  2009-05-12 18:04                       ` Stephen Smalley
@ 2009-05-13  1:47                       ` Casey Schaufler
  2009-05-13 16:43                         ` Joel Becker
  1 sibling, 1 reply; 151+ messages in thread
From: Casey Schaufler @ 2009-05-13  1:47 UTC (permalink / raw)
  To: Stephen Smalley, James Morris, jim owens, ocfs2-devel, viro,
	mtk.manpages, linux-se

Joel Becker wrote:
> On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote:
>   
>> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote:
>>     
>>> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote:
>>>       
>>>> Is preserve_security supposed to also control the preservation of the
>>>> SELinux security attribute (security.selinux extended attribute)?  I'd
>>>> expect that either we preserve all the security-relevant attributes or
>>>> none of them.  And if that is the case, then SELinux has to know about
>>>> preserve_security in order to know what the security context of the new
>>>> inode will be.  
>>>>         
>>> 	Thank you Stephen, you read my mind.  In the ocfs2 case, we're 
>>> expecting to just reflink the extended attribute structures verbatim in
>>> the preserve_security case.
>>>       
>> And in the preserve_security==0 case, you'll be calling
>> security_inode_init_security() in order to get the attribute name/value
>> pair to assign to the new inode just as in the normal file creation
>> case?
>>     
>
> 	Oh, absolutely.
> 	As an aside, do inodes ever have more than one security.*
> attribute?

ACLs, capability sets and Smack labels can all exist on a file at
the same time. I know of at least one effort underway to create a
multiple-label LSM.

>   It would appear that security_inode_init_security() just
> returns one attribute, but what if I had a system running under SMACK
> and then changed to SELinux?  

The Smack attribute would hang around, it would just be unused.


> Would my (existing) inode then have
> security.smack and security.selinux attributes?
>   

Yup. It happens all the time. Whenever someone converts a Fedora
system to Smack they end up with a filesystem full of unused selinux
labels. It does no harm.

>   
>>>> Also, if you are going to automatically degrade reflink(2) behavior
>>>> based on the owner_or_cap test, then you ought to allow the same to be
>>>> true if the security module vetoes the attempt to preserve attributes.
>>>> Either DAC or MAC logic may say that security attributes cannot be
>>>> preserved.  Your current logic will only allow graceful degradation in
>>>> the DAC case, but the MAC case will remain a hard failure.
>>>>         
>>> 	I did not think of this, and its a very good point as well.  I'm
>>> not sure how to have the return value of security_inode_reflink()
>>> distinguish between "disallow the reflink" and "disallow
>>> preserve_security".  But since !preserve_security requires read access
>>> only, perhaps we move security_inode_reflink up higher and say:
>>>
>>> 	error = security_inode_reflink(old_dentry, dir);
>>> 	if (error)
>>> 		preserve_security = 0;
>>>
>>> Here security_inode_reflink() does not need new_dentry, because it isn't
>>> setting a security context.  If it's ok with the reflink, we'll be
>>> copying the extended attribute.  If it's not OK, it falls through to the
>>> inode_permission(inode, MAY_READ) check, which will check for plain old
>>> read access.
>>> 	What do we think?
>>>       
>> I'd rather have two hooks, one to allow the security module to override
>> preserve_security and one to allow the security module to deny the
>> operation altogether.  The former hook only needs to be called if
>> preserve_security is not already cleared by the DAC logic.  The latter
>> hook needs to know the final verdict on preserve_security in order to
>> determine the right set of checks to apply, which isn't necessarily
>> limited to only checking read access.
>>     
>
> 	Ok, is that two hooks or one hook with specific error returns?
> I don't care, it's up to the LSM group.  I just can't come up with a
> good distinguishing set of names if its two hooks :-)
>
> Joel
>
>   

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-13  1:47                       ` Casey Schaufler
@ 2009-05-13 16:43                         ` Joel Becker
  2009-05-13 17:23                           ` Stephen Smalley
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-13 16:43 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages,
	jim owens, Stephen Smalley, ocfs2-devel, viro

On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote:
> Joel Becker wrote:
> > 	Oh, absolutely.
> > 	As an aside, do inodes ever have more than one security.*
> > attribute?
> 
> ACLs, capability sets and Smack labels can all exist on a file at
> the same time. I know of at least one effort underway to create a
> multiple-label LSM.

	So ACLs and cap sets live under security.*?  That's good.

> > Would my (existing) inode then have
> > security.smack and security.selinux attributes?
> >   
> 
> Yup. It happens all the time. Whenever someone converts a Fedora
> system to Smack they end up with a filesystem full of unused selinux
> labels. It does no harm.

	At that runtime, sure.  But with reflink(), we may be reflinking
someone else's inode, and if we have to drop its security state, we
should clean the unused labels just in case they go back to selinux (or
back to smack, etc).  But if they are all under security.*, it's easy to
do.

Thanks!
Joel

-- 

Life's Little Instruction Book #173

	"Be kinder than necessary."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-13 16:43                         ` Joel Becker
@ 2009-05-13 17:23                           ` Stephen Smalley
  2009-05-13 18:27                             ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-13 17:23 UTC (permalink / raw)
  To: Joel Becker
  Cc: Casey Schaufler, James Morris, jim owens, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Wed, 2009-05-13 at 09:43 -0700, Joel Becker wrote:
> On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote:
> > Joel Becker wrote:
> > > 	Oh, absolutely.
> > > 	As an aside, do inodes ever have more than one security.*
> > > attribute?
> > 
> > ACLs, capability sets and Smack labels can all exist on a file at
> > the same time. I know of at least one effort underway to create a
> > multiple-label LSM.
> 
> 	So ACLs and cap sets live under security.*?  That's good.

File capabilities live under security.*, but ACLs predate the security
namespace and live in the system namespace as
"system.posix_acl_access" (and if a directory, there is also a
"system.posix_acl_default" attribute that specifies the default ACL for
new files in that directory).

In the preserve_security==0 case, you'd want to:
- drop all attributes under security.* on the new inode,
- set (security.<name>, value) to the name:value pair provided by
security_inode_init_security(),
- set system.posix_acl_access to the default ACL associated with the
parent directory (the "system.posix_acl_default" attribute on the
parent).

The latter two steps are what is already done in the new inode creation
code path, so you hopefully can just reuse that code.

> > > Would my (existing) inode then have
> > > security.smack and security.selinux attributes?
> > >   
> > 
> > Yup. It happens all the time. Whenever someone converts a Fedora
> > system to Smack they end up with a filesystem full of unused selinux
> > labels. It does no harm.
> 
> 	At that runtime, sure.  But with reflink(), we may be reflinking
> someone else's inode, and if we have to drop its security state, we
> should clean the unused labels just in case they go back to selinux (or
> back to smack, etc).  But if they are all under security.*, it's easy to
> do.
> 
> Thanks!
> Joel
> 
-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-13 17:23                           ` Stephen Smalley
@ 2009-05-13 18:27                             ` Joel Becker
  0 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-13 18:27 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: James Morris, jim owens, linux-security-module, mtk.manpages,
	Casey Schaufler, linux-fsdevel, ocfs2-devel, viro

On Wed, May 13, 2009 at 01:23:58PM -0400, Stephen Smalley wrote:
> File capabilities live under security.*, but ACLs predate the security
> namespace and live in the system namespace as
> "system.posix_acl_access" (and if a directory, there is also a
> "system.posix_acl_default" attribute that specifies the default ACL for
> new files in that directory).
> 
> In the preserve_security==0 case, you'd want to:
> - drop all attributes under security.* on the new inode,
> - set (security.<name>, value) to the name:value pair provided by
> security_inode_init_security(),
> - set system.posix_acl_access to the default ACL associated with the
> parent directory (the "system.posix_acl_default" attribute on the
> parent).
> 
> The latter two steps are what is already done in the new inode creation
> code path, so you hopefully can just reuse that code.

	I am absolutely expecting to reuse that code.  I was just
trying to make sure I didn't miss any steps prior to the normal
new-inode stuff.  Thanks.

Joel

-- 

 The zen have a saying:
 "When you learn how to listen, ANYONE can be your teacher."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 22:27         ` James Morris
  2009-05-11 22:34           ` Joel Becker
@ 2009-05-12 12:01           ` Stephen Smalley
  1 sibling, 0 replies; 151+ messages in thread
From: Stephen Smalley @ 2009-05-12 12:01 UTC (permalink / raw)
  To: James Morris
  Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 2009-05-12 at 08:27 +1000, James Morris wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
> 
> > and other security attributes (in all, I'm gonna call that the "security
> > context") as well.  So I defined reflink() as such.  This meant
> 
> "security context" is an term associated with SELinux, so you may want to 
> use something like "security attributes" or "security state" to avoid 
> confusing people.
> 
> > +	error = security_inode_reflink(old_dentry, dir);
> > +	if (error)
> > +		return error;
> 
> We'll need the new_dentry now, to set up new security state before the 
> dentry is instantiated.

I don't think the inode exists yet for the new_dentry (not until after
the call to i_op->reflink), and thus we cannot set up the new inode
state at the point of security_inode_reflink().  We will need the
filesystem to call into the security module to get the right security
attribute name/value pair when creating the new inode, just as with
normal inode creation, unless it is preserving the name/value pair from
the original.  The security_inode_init_security() hook is for that
purpose - you can see its usage in existing filesystems when creating
new inodes.

> e.g. SELinux will need to perform some checks on the operation, then 
> calculate a new security context for the new file.
> 
> 
> - James
-- 
Stephen Smalley
National Security Agency

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
  2009-05-11 22:27         ` James Morris
@ 2009-05-11 23:11         ` jim owens
  2009-05-11 23:42           ` Joel Becker
  2009-05-12 11:31         ` Jörn Engel
                           ` (4 subsequent siblings)
  6 siblings, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-11 23:11 UTC (permalink / raw)
  To: joel.becker, linux-fsdevel
  Cc: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module

Joel Becker wrote:
> 	Here's v4 of reflink().  If you have the privileges, you get the
> full snapshot.  If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized.  That's it.  It fits with most of the
> other ops, and it's a clean degradation.

I really like this.  It has a nice clean user operational definition
and gives them all the snap/cowfile features.  And if they had the
privilege to do the reflink(), they can just chattr away :)

jim

> +	/*
> +	 * If the caller has the rights, reflink() will preserve the
> +	 * security context of the source inode.
> +	 */
> +	if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
> +		preserve_security = 0;
> +	if ((current_fsuid() != inode->i_uid) &&
> +	    !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
> +		preserve_security = 0;

I have not done a code review, but that appears to be an
editing cut-and-past duplication.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 23:11         ` jim owens
@ 2009-05-11 23:42           ` Joel Becker
  0 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-11 23:42 UTC (permalink / raw)
  To: jim owens
  Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel,
	ocfs2-devel, viro

On Mon, May 11, 2009 at 07:11:00PM -0400, jim owens wrote:
> Joel Becker wrote:
>> 	Here's v4 of reflink().  If you have the privileges, you get the
>> full snapshot.  If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized.  That's it.  It fits with most of the
>> other ops, and it's a clean degradation.
>
> I really like this.  It has a nice clean user operational definition
> and gives them all the snap/cowfile features.  And if they had the
> privilege to do the reflink(), they can just chattr away :)
>
> jim
>
>> +	/*
>> +	 * If the caller has the rights, reflink() will preserve the
>> +	 * security context of the source inode.
>> +	 */
>> +	if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
>> +		preserve_security = 0;
>> +	if ((current_fsuid() != inode->i_uid) &&
>> +	    !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
>> +		preserve_security = 0;
>
> I have not done a code review, but that appears to be an
> editing cut-and-past duplication.

	Oh, good catch.

Joel

-- 

"In the long run...we'll all be dead."
                                        -Unknown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
  2009-05-11 22:27         ` James Morris
  2009-05-11 23:11         ` jim owens
@ 2009-05-12 11:31         ` Jörn Engel
  2009-05-12 13:12           ` jim owens
  2009-05-12 15:04         ` Sage Weil
                           ` (3 subsequent siblings)
  6 siblings, 1 reply; 151+ messages in thread
From: Jörn Engel @ 2009-05-12 11:31 UTC (permalink / raw)
  To: Joel Becker
  Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote:
>
> 	Here's v4 of reflink().  If you have the privileges, you get the
> full snapshot.  If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized.  That's it.  It fits with most of the
> other ops, and it's a clean degradation.

Let me see if I understand this correctly.  File "/tmp/foo" belongs to
Joel, file "/tmp/bar" belongs to Joern.  Everyone has read access to
those files.  Now if you reflink them to your home directory, both files
belong to you.  If I reflink them to my home directory, both files
belong to me.  And if root reflinks them to /root, one file belongs to
Joel, the other to Joern.  Is that correct?

Because if it is, I would call that behaviour rather confusing.  A
system call that behaves differently depending on who calls it - or
on whether the binary is installed suid root - is something I would like
to avoid.

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 11:31         ` Jörn Engel
@ 2009-05-12 13:12           ` jim owens
  2009-05-12 20:24             ` Jamie Lokier
  2009-05-14 18:43             ` Jörn Engel
  0 siblings, 2 replies; 151+ messages in thread
From: jim owens @ 2009-05-12 13:12 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

Jörn Engel wrote:
> On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote:
>> 	Here's v4 of reflink().  If you have the privileges, you get the
>> full snapshot.  If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized.  That's it.  It fits with most of the
>> other ops, and it's a clean degradation.
> 
> Let me see if I understand this correctly.  File "/tmp/foo" belongs to
> Joel, file "/tmp/bar" belongs to Joern.  Everyone has read access to
> those files.  Now if you reflink them to your home directory, both files
> belong to you.  If I reflink them to my home directory, both files
> belong to me.  And if root reflinks them to /root, one file belongs to
> Joel, the other to Joern.  Is that correct?

yes

> Because if it is, I would call that behaviour rather confusing.  A
> system call that behaves differently depending on who calls it - or
> on whether the binary is installed suid root - is something I would like
> to avoid.

Avoiding that just gives us other confusing operations unless
you have a really good alternative.

This design is very elegant, I wish I had thought of it :)

It passes the test that 99% of the time for any user (including
root), "it just works the way I want it to".  In my experience,
root and setuid programs really don't want to take ownership,
they want to replicate it.

The behavior matches "cp -p" or "tar -x" and yes those are not
system calls but so what.  What matters is the documentation is
clear about what happens and the most useful result occurs.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 13:12           ` jim owens
@ 2009-05-12 20:24             ` Jamie Lokier
  2009-05-14 18:43             ` Jörn Engel
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:24 UTC (permalink / raw)
  To: jim owens
  Cc: Jörn Engel, Joel Becker, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

jim owens wrote:
> It passes the test that 99% of the time for any user (including
> root), "it just works the way I want it to".  In my experience,
> root and setuid programs really don't want to take ownership,
> they want to replicate it.

Unfortunately in the other 1%, as I've explained in detail in another
mail, it's a lot of work and sometimes impossible for a program to set
the attributes to be those of a new file.

Whereas an explicit choice between snapshot attributes and new-file
attributes never causes problems, because it's trivial to provide the
automatic "-p" switch by trying one then the other.

To human-optimise, make your reflink _program_ do that.
Humans don't call system calls themselves :-)

> The behavior matches "cp -p" or "tar -x"

Actually it doesn't, but even if it did, not having any way to turn
off the "-p" would be just as annoying as if you couldn't do that with "cp".

If you like root to have "cp -p", put it in /root/.bashrc :-)

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 13:12           ` jim owens
  2009-05-12 20:24             ` Jamie Lokier
@ 2009-05-14 18:43             ` Jörn Engel
  1 sibling, 0 replies; 151+ messages in thread
From: Jörn Engel @ 2009-05-14 18:43 UTC (permalink / raw)
  To: jim owens
  Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

[ Delayed response - mailserver was dead. ]

On Tue, 12 May 2009 09:12:17 -0400, jim owens wrote:
> 
> >Because if it is, I would call that behaviour rather confusing.  A
> >system call that behaves differently depending on who calls it - or
> >on whether the binary is installed suid root - is something I would like
> >to avoid.
> 
> Avoiding that just gives us other confusing operations unless
> you have a really good alternative.
> 
> This design is very elegant, I wish I had thought of it :)
> 
> It passes the test that 99% of the time for any user (including
> root), "it just works the way I want it to".  In my experience,
> root and setuid programs really don't want to take ownership,
> they want to replicate it.
> 
> The behavior matches "cp -p" or "tar -x" and yes those are not
> system calls but so what.  What matters is the documentation is
> clear about what happens and the most useful result occurs.

If what you want is copyfile(2), this is a poor design because it
usually does what you want and sometimes doesn't.  If what you want is
reflink(2), this may be acceptable.  Not sure.  I personally would
prefer to get -EPERM or something instead of altered behaviour.

So you can count me in with the people that propose two seperate system
calls.

Jörn

-- 
They laughed at Galileo.  They laughed at Copernicus.  They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
                           ` (2 preceding siblings ...)
  2009-05-12 11:31         ` Jörn Engel
@ 2009-05-12 15:04         ` Sage Weil
  2009-05-12 15:23           ` jim owens
  2009-05-12 17:28           ` Joel Becker
  2009-05-14  3:57         ` Andy Lutomirski
                           ` (2 subsequent siblings)
  6 siblings, 2 replies; 151+ messages in thread
From: Sage Weil @ 2009-05-12 15:04 UTC (permalink / raw)
  To: Joel Becker
  Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Mon, 11 May 2009, Joel Becker wrote:
> 	Here's v4 of reflink().  If you have the privileges, you get the
> full snapshot.  If you don't, you must have read access, and then you
> get the entire snapshot (data and extended attributes) except that the
> security context is reinitialized.  That's it.  It fits with most of the
> other ops, and it's a clean degradation.

What would a 'cp' without '-p' be expected to do here when it has the 
privileges?  Call reflink(2), then explicitly clear out any copied 
security attributes ensure that any copied attributes are removed, and 
otherwise jump through hoops to make the newly created file look like it 
should?  Should it check whether it has the privileges and act accordingly 
(_can_ it even do that reliably/atomically?), or unconditionally verify 
the attributes look like a new file's should?

To me, a simple 'cp' type operation (assuming it gets wired up the way it 
could) seems like at least as common a use case than a 'snapshot' 
operation.  I know that's not what your main goal here, but I don't 
understand the resistance to two syscalls.  Mixing the two might give you 
the right answer in many cases, but certainly not all, and it makes for 
confusing application interface semantics that we won't be able to change 
down the line.

sage


> 	I add a flag to ips->reflink() so that the filesystem knows what
> to do with the security context.  That's the only change visible outside
> of vfs_reflink().
> 	Security folks, check my work.  Everyone else, let me know if
> this satisfies.
> 
> Joel
> 
> >From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001
> From: Joel Becker <joel.becker@oracle.com>
> Date: Sat, 2 May 2009 22:48:59 -0700
> Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.
> 
> The userpace visible idea of the operation is:
> 
> int reflink(const char *oldpath, const char *newpath);
> int reflinkat(int olddirfd, const char *oldpath,
> 	      int newdirfd, const char *newpath, int flags);
> 
> The kernel only implements reflinkat(2).  reflink(3) is a trivial
> wrapper around reflinkat(2).
> 
> The reflink() system call creates reference-counted links.  It creates
> a new file that shares the data extents of the source file in a
> copy-on-write fashion.  Its calling semantics are identical to link(2)
> and linkat(2).  Once complete, programs see the new file as a completely
> separate entry.
> 
> reflink() attempts to preserve ownership, permissions, and security
> contexts in order to create a fully snapshot.  Preserving those
> attributes requires ownership or CAP_CHOWN.  A caller without those
> privileges will see the security context of the new file initialized to
> their default.
> 
> In the VFS, ->reflink() is an inode_operation with the almost same
> arguments as ->link(); an additional argument tells the filesystem to
> copy over or reinitialize the security context on the new file.
> 
> A new LSM hook, security_inode_reflink(), is added.  None of the
> existing LSM hooks appeared to fit.
> 
> XXX: Currently only adds the x86_32 linkage.  The rest of the
> architectures belong here too.
> 
> Signed-off-by: Joel Becker <joel.becker@oracle.com>
> ---
>  Documentation/filesystems/reflink.txt |  165 +++++++++++++++++++++++++++++++++
>  Documentation/filesystems/vfs.txt     |    4 +
>  arch/x86/include/asm/unistd_32.h      |    1 +
>  arch/x86/kernel/syscall_table_32.S    |    1 +
>  fs/namei.c                            |  113 ++++++++++++++++++++++
>  include/linux/fs.h                    |    2 +
>  include/linux/security.h              |   16 +++
>  include/linux/syscalls.h              |    2 +
>  security/capability.c                 |    6 +
>  security/security.c                   |    7 ++
>  10 files changed, 317 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/filesystems/reflink.txt
> 
> diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
> new file mode 100644
> index 0000000..aa7380f
> --- /dev/null
> +++ b/Documentation/filesystems/reflink.txt
> @@ -0,0 +1,165 @@
> +reflink(2)
> +==========
> +
> +
> +INTRODUCTION
> +------------
> +
> +A reflink is a reference-counted link.  The reflink(2) operation is
> +analogous to the link(2) operation, except that instead of two directory
> +entries pointing to the same inode, there are two identical inodes
> +pointing to the same data.  Writes do not modify the shared data; they
> +use copy-on-write (CoW).  Thus, after the reflink has been created, the
> +inodes can diverge without impacting each other.
> +
> +
> +SYNOPSIS
> +--------
> +
> +The reflink(2) call looks just like link(2):
> +
> +    int reflink(const char *oldpath, const char *newpath);
> +
> +The actual system call is reflinkat(2):
> +
> +    int reflinkat(int olddirfd, const char *oldpath,
> +                  int newdirfd, const char *newpath, int flags);
> +
> +For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
> +The reflink(2) call won't be implemented by the kernel, because it's a
> +trivial wrapper around reflinkat(2).
> +
> +
> +DESCRIPTION
> +-----------
> +
> +One way of viewing reflink is to look at the level of sharing.  A
> +symbolic link does its sharing at the directory entry level; many names
> +end up pointing at the same directory entry.  Hard links are one step
> +down.  Multiple directory entries are sharing one inode.  Reflinks are
> +down one more level: multiple inodes share the same data extents.
> +
> +When you symlink a file, you can then access it via the symlink or the
> +real directory entry, and for the most part they look identical.  When
> +accessing more than one name for a hard link, the object returned looks
> +identical.  Similarly, a newly created reflink is identical to its
> +source in almost every way and can be treated as such.  This includes
> +ownership, permissions, security context, and data.  The only things
> +that are different are the inode number, the link count, and the ctime.
> +
> +A reflink is a snapshot of the source file at the time it is created.
> +
> +Once created, though, a reflink can be modified like any other normal
> +file without affecting the source file.  Changes to trivial fields like
> +permissions, owner, or times are guaranteed not to trigger CoW of file
> +data and will not return any error that wouldn't happen on a truly
> +distinct file.  Changes to the file's data will trigger CoW of the data
> +affected - the actual CoW granularity is up to the filesystem, from
> +exact bytes up to the entire file.  ocfs2, for example, will copy out an
> +entire extent or 1MB, whichever is smaller.
> +
> +Preserving the security context of the source file obviously requires
> +the privilege to do so.  Callers that do not own the source file and do
> +not have CAP_CHOWN will get a new reflink with all non-security
> +attributes preserved; the security context of the new reflink will be
> +as a newly created file by that user.
> +
> +Partial reflinks are not allowed.  The new inode will only appear in the
> +directory structure after it is fully formed.  This prevents a crash or
> +lack of space from creating a partial reflink.
> +
> +If a filesystem does not support reflinks, the kernel and libc MUST NOT
> +fake it.  Callers are expecting to get snapshots, and faking it will
> +violate that trust.
> +
> +The userspace view is as follows.  When reflink(2) returns, opening
> +oldpath and newpath returns identical-looking files, just like link(2).
> +After that, oldpath and newpath behave as distinct files, and
> +modifications to one have no impact on the other.
> +
> +
> +RESTRICTIONS
> +------------
> +
> +Just as the sharing gets lower as you move from symlink() -> link() ->
> +reflink(), the restrictions on the call get tighter.  A symlink doesn't
> +require any access permissions other than being able to create its
> +inode.  It can cross filesystems and mount points, and it can point to
> +any type of file.  A hard link requires both source and target to be on
> +the same filesystem under the same mount point, and that the source not
> +be a directory.   Like hard links and symlinks, a reflink cannot be
> +created if newpath exists.
> +
> +Reflinks adds one big restriction on top of hard links: only the owner
> +or someone with elevated privileges (CAP_CHOWN) can preserve the
> +security context (permissions, ownership, ACLs, etc) across a reflink.
> +A reflink is a point-in-time snapshot of a file.  Without the
> +appropriate privilege, the caller will see their own default security
> +context applied to the file.
> +
> +A caller without the privileges to preserve the security context must
> +have read access to reflink a file.
> +
> +
> +SHARING
> +-------
> +
> +A reflink creates a new inode.  It shares all data extents of the source
> +file; this includes file data and extended attribute data.  All of the
> +sharing is in a CoW fashion, and any modification of the data will break
> +the sharing.
> +
> +For some filesystems, certain data structures are not in allocated
> +storage extents.  Creating a reflink might make a copy of these extents.
> +An example is ext3's ability to store small extended attributes inside
> +the ext3 inode.  Since a reflink is creating a new inode, those extended
> +attributes are merely copied to the new inode.
> +
> +
> +EXCEPTIONS
> +----------
> +
> +All file attributes and extended attributes of the new file must
> +identical to the source file with the following exceptions:
> +
> +- The new file must have a new inode number.  This allows POSIX
> +  programs to treat the source and new files as separate objects.  From
> +  the view of the POSIX application, the files are distinct.  The
> +  sharing is invisible outside of the filesystem's internal structures.
> +- The ctime of the source file only changes if the source's metadata
> +  must be changed to accommodate the copy-on-write linkage.  The ctime
> +  of the new file is set to represent its creation.
> +- The link count of the source file is unchanged, and the link count of
> +  the new file is one.
> +- If the caller lacks the privileges to preserve the security context,
> +  the file will have its security context initialized as would any new
> +  file.
> +
> +The mtime of the source file is unmodified, and the mtime of the new
> +file is set identical to the source file.  This reflects that the data
> +is unchanged.
> +
> +
> +INODE OPERATION
> +---------------
> +
> +Filesystems implement the ->reflink() inode operation.  It has almost
> +the same prototype as ->link():
> +
> +    int (*reflink)(struct dentry *old_dentry, struct inode *dir,
> +                   struct dentry *new_dentry, int preserve_security);
> +
> +When the filesystem is called, the VFS has already checked the
> +permissions and mountpoint of the operation.  It has determined whether
> +the security context should be preserved or reinitialized, as specified
> +by the preserve_security argument.  The filesystem just needs to create
> +the new inode identical to the old one with the exceptions noted above,
> +link up the shared data extents, and then link the new inode into dir.
> +
> +
> +FOLLOWING SYMBOLIC LINKS
> +------------------------
> +
> +reflink() deferences symbolic links in the same manner that link(2)
> +does.  The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
> +
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index f49eecf..01cd810 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -333,6 +333,7 @@ struct inode_operations {
>  	ssize_t (*listxattr) (struct dentry *, char *, size_t);
>  	int (*removexattr) (struct dentry *, const char *);
>  	void (*truncate_range)(struct inode *, loff_t, loff_t);
> +	int (*reflink) (struct dentry *,struct inode *,struct dentry *);
>  };
>  
>  Again, all methods are called without any locks being held, unless
> @@ -431,6 +432,9 @@ otherwise noted.
>  
>    truncate_range: a method provided by the underlying filesystem to truncate a
>    	range of blocks , i.e. punch a hole somewhere in a file.
> +  reflink: called by the reflink(2) system call. Only required if you want
> +	to support reflinks.  For further information, see
> +	Documentation/filesystems/reflink.txt.
>  
>  
>  The Address Space Object
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6e72d74..c368563 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
>  #define __NR_inotify_init1	332
>  #define __NR_preadv		333
>  #define __NR_pwritev		334
> +#define __NR_reflinkat		335
>  
>  #ifdef __KERNEL__
>  
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index ff5c873..d11c200 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
>  	.long sys_inotify_init1
>  	.long sys_preadv
>  	.long sys_pwritev
> +	.long sys_reflinkat		/* 335 */
> diff --git a/fs/namei.c b/fs/namei.c
> index 78f253c..34a6ce5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
>  	return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
>  }
>  
> +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
> +{
> +	struct inode *inode = old_dentry->d_inode;
> +	int error;
> +	int preserve_security = 1;
> +
> +	if (!inode)
> +		return -ENOENT;
> +
> +	/*
> +	 * If the caller has the rights, reflink() will preserve the
> +	 * security context of the source inode.
> +	 */
> +	if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
> +		preserve_security = 0;
> +	if ((current_fsuid() != inode->i_uid) &&
> +	    !in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
> +		preserve_security = 0;
> +
> +	/*
> +	 * If the caller doesn't have the right to preserve the security
> +	 * context, the caller is only getting the data and extended
> +	 * attributes.  They need read permission on the file.
> +	 */
> +	if (!preserve_security) {
> +		error = inode_permission(inode, MAY_READ);
> +		if (error)
> +			return error;
> +	}
> +
> +	error = may_create(dir, new_dentry);
> +	if (error)
> +		return error;
> +
> +	if (dir->i_sb != inode->i_sb)
> +		return -EXDEV;
> +
> +	/*
> +	 * A reflink to an append-only or immutable file cannot be created.
> +	 */
> +	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> +		return -EPERM;
> +	if (!dir->i_op->reflink)
> +		return -EPERM;
> +	if (S_ISDIR(inode->i_mode))
> +		return -EPERM;
> +
> +	error = security_inode_reflink(old_dentry, dir);
> +	if (error)
> +		return error;
> +
> +	mutex_lock(&inode->i_mutex);
> +	vfs_dq_init(dir);
> +	error = dir->i_op->reflink(old_dentry, dir, new_dentry,
> +				   preserve_security);
> +	mutex_unlock(&inode->i_mutex);
> +	if (!error)
> +		fsnotify_create(dir, new_dentry);
> +	return error;
> +}
> +
> +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname,
> +		int, newdfd, const char __user *, newname, int, flags)
> +{
> +	struct dentry *new_dentry;
> +	struct nameidata nd;
> +	struct path old_path;
> +	int error;
> +	char *to;
> +
> +	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
> +		return -EINVAL;
> +
> +	error = user_path_at(olddfd, oldname,
> +			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
> +			     &old_path);
> +	if (error)
> +		return error;
> +
> +	error = user_path_parent(newdfd, newname, &nd, &to);
> +	if (error)
> +		goto out;
> +	error = -EXDEV;
> +	if (old_path.mnt != nd.path.mnt)
> +		goto out_release;
> +	new_dentry = lookup_create(&nd, 0);
> +	error = PTR_ERR(new_dentry);
> +	if (IS_ERR(new_dentry))
> +		goto out_unlock;
> +	error = mnt_want_write(nd.path.mnt);
> +	if (error)
> +		goto out_dput;
> +	error = security_path_link(old_path.dentry, &nd.path, new_dentry);
> +	if (error)
> +		goto out_drop_write;
> +	error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
> +out_drop_write:
> +	mnt_drop_write(nd.path.mnt);
> +out_dput:
> +	dput(new_dentry);
> +out_unlock:
> +	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
> +out_release:
> +	path_put(&nd.path);
> +	putname(to);
> +out:
> +	path_put(&old_path);
> +
> +	return error;
> +}
> +
> +
>  /*
>   * The worst of all namespace operations - renaming directory. "Perverted"
>   * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
> @@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename);
>  EXPORT_SYMBOL(vfs_create);
>  EXPORT_SYMBOL(vfs_follow_link);
>  EXPORT_SYMBOL(vfs_link);
> +EXPORT_SYMBOL(vfs_reflink);
>  EXPORT_SYMBOL(vfs_mkdir);
>  EXPORT_SYMBOL(vfs_mknod);
>  EXPORT_SYMBOL(generic_permission);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5bed436..0a5c807 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
>  extern int vfs_rmdir(struct inode *, struct dentry *);
>  extern int vfs_unlink(struct inode *, struct dentry *);
>  extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
> +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *);
>  
>  /*
>   * VFS dentry helper functions.
> @@ -1537,6 +1538,7 @@ struct inode_operations {
>  			  loff_t len);
>  	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
>  		      u64 len);
> +	int (*reflink) (struct dentry *,struct inode *,struct dentry *,int);
>  };
>  
>  struct seq_file;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index d5fd616..ea9cd93 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
>   *	@inode contains a pointer to the inode.
>   *	@secid contains a pointer to the location where result will be saved.
>   *	In case of failure, @secid will be set to zero.
> + * @inode_reflink:
> + *	Check permission before creating a new reference-counted link to
> + *	a file.
> + *	@old_dentry contains the dentry structure for an existing link to
> + *	the file.
> + *	@dir contains the inode structure of the parent directory of the
> + *	new reflink.
> + *	Return 0 if permission is granted.
>   *
>   * Security hooks for file operations
>   *
> @@ -1415,6 +1423,7 @@ struct security_operations {
>  	int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
>  	int (*inode_symlink) (struct inode *dir,
>  			      struct dentry *dentry, const char *old_name);
> +	int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir);
>  	int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
>  	int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
>  	int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
> @@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
>  int security_inode_unlink(struct inode *dir, struct dentry *dentry);
>  int security_inode_symlink(struct inode *dir, struct dentry *dentry,
>  			   const char *old_name);
> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir);
>  int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
>  int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
>  int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
> @@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir,
>  	return 0;
>  }
>  
> +static inline int security_inode_reflink(struct dentry *old_dentry,
> +					 struct inode *dir)
> +{
> +	return 0;
> +}
> +
>  static inline int security_inode_mkdir(struct inode *dir,
>  					struct dentry *dentry,
>  					int mode)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 40617c1..35a8743 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
>  			      int newdfd, const char __user * newname);
>  asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
>  			   int newdfd, const char __user *newname, int flags);
> +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
> +			      int newdfd, const char __user *newname, int flags);
>  asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
>  			     int newdfd, const char __user * newname);
>  asmlinkage long sys_futimesat(int dfd, char __user *filename,
> diff --git a/security/capability.c b/security/capability.c
> index 21b6cea..3dcc4cc 100644
> --- a/security/capability.c
> +++ b/security/capability.c
> @@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
>  	return 0;
>  }
>  
> +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode)
> +{
> +	return 0;
> +}
> +
>  static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
>  			   int mask)
>  {
> @@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops)
>  	set_to_cap_if_null(ops, inode_link);
>  	set_to_cap_if_null(ops, inode_unlink);
>  	set_to_cap_if_null(ops, inode_symlink);
> +	set_to_cap_if_null(ops, inode_reflink);
>  	set_to_cap_if_null(ops, inode_mkdir);
>  	set_to_cap_if_null(ops, inode_rmdir);
>  	set_to_cap_if_null(ops, inode_mknod);
> diff --git a/security/security.c b/security/security.c
> index 5284255..70d0ac3 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
>  	return security_ops->inode_symlink(dir, dentry, old_name);
>  }
>  
> +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir)
> +{
> +	if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
> +		return 0;
> +	return security_ops->inode_reflink(old_dentry, dir);
> +}
> +
>  int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
>  {
>  	if (unlikely(IS_PRIVATE(dir)))
> -- 
> 1.6.1.3
> 
> 
> -- 
> 
> "Three o'clock is always too late or too early for anything you
>  want to do."
>         - Jean-Paul Sartre
> 
> Joel Becker
> Principal Software Developer
> Oracle
> E-mail: joel.becker@oracle.com
> Phone: (650) 506-8127
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 15:04         ` Sage Weil
@ 2009-05-12 15:23           ` jim owens
  2009-05-12 16:16             ` Sage Weil
  2009-05-12 17:28           ` Joel Becker
  1 sibling, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-12 15:23 UTC (permalink / raw)
  To: Sage Weil
  Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

Sage Weil wrote:
> On Mon, 11 May 2009, Joel Becker wrote:
>> 	Here's v4 of reflink().  If you have the privileges, you get the
>> full snapshot.  If you don't, you must have read access, and then you
>> get the entire snapshot (data and extended attributes) except that the
>> security context is reinitialized.  That's it.  It fits with most of the
>> other ops, and it's a clean degradation.
> 
> What would a 'cp' without '-p' be expected to do here when it has the 
> privileges?  Call reflink(2), then explicitly clear out any copied 
> security attributes ensure that any copied attributes are removed, and 
> otherwise jump through hoops to make the newly created file look like it 
> should?  Should it check whether it has the privileges and act accordingly 
> (_can_ it even do that reliably/atomically?), or unconditionally verify 
> the attributes look like a new file's should?

I don't understand what you think is hard about cp doing the
"if not preserve then update attributes".  It does not have to check
the reflink() attr result, it just assigns the expected new attributes.

Only the -p snapshot needs atomicity.

> To me, a simple 'cp' type operation (assuming it gets wired up the way it 
> could) seems like at least as common a use case than a 'snapshot' 

I don't think changing "cp" is a good idea since users have a
long history that cp means make a data copy, not cow.  Adding
a new flag is IMO not be as good as a new utility.  Particularly
since we can not do directories.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 15:23           ` jim owens
@ 2009-05-12 16:16             ` Sage Weil
  2009-05-12 17:45               ` jim owens
  0 siblings, 1 reply; 151+ messages in thread
From: Sage Weil @ 2009-05-12 16:16 UTC (permalink / raw)
  To: jim owens
  Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 12 May 2009, jim owens wrote:
> Sage Weil wrote:
> > On Mon, 11 May 2009, Joel Becker wrote:
> > > 	Here's v4 of reflink().  If you have the privileges, you get the
> > > full snapshot.  If you don't, you must have read access, and then you
> > > get the entire snapshot (data and extended attributes) except that the
> > > security context is reinitialized.  That's it.  It fits with most of the
> > > other ops, and it's a clean degradation.
> > 
> > What would a 'cp' without '-p' be expected to do here when it has the
> > privileges?  Call reflink(2), then explicitly clear out any copied security
> > attributes ensure that any copied attributes are removed, and otherwise jump
> > through hoops to make the newly created file look like it should?  Should it
> > check whether it has the privileges and act accordingly (_can_ it even do
> > that reliably/atomically?), or unconditionally verify the attributes look
> > like a new file's should?
> 
> I don't understand what you think is hard about cp doing the
> "if not preserve then update attributes".  It does not have to check
> the reflink() attr result, it just assigns the expected new attributes.

I assume it's possible, but not being familiar with how the SELinux etc 
attributes look, my guess is that any tool that wants to cow file data 
to a new file (even if root) would need to do something like

 reflink(src, dst)
 chown(dst, getuid(), getgid())
 listxattr and rmxattr each.  or just delete selinux/whatever attributes.
 create generic 'new file' selinux/whatever attributes, if needed.

The chown bit isn't even right, since it doesn't follow the directory 
sticky bit rules.  And is there some generic way to assign an existing 
file 'new file'-like security attributes?  It's a mess.

> Only the -p snapshot needs atomicity.

My point is that the process creating the cow file should unconditionally 
do the above checks (and needed fixups) because it can't atomically verify 
the attribute copy won't happen andke the reflink call.

> > To me, a simple 'cp' type operation (assuming it gets wired up the way it
> > could) seems like at least as common a use case than a 'snapshot' 
> 
> I don't think changing "cp" is a good idea since users have a
> long history that cp means make a data copy, not cow.  Adding
> a new flag is IMO not be as good as a new utility.  Particularly
> since we can not do directories.

Maybe not, but that's a separate question from the interface issue.  We 
shouldn't preclude the possibility creating tools that preserve attributes 
(or warn if they can't) and tools that simply want to cow data to a new 
file.  AFAICS reflink(2) as proposed doesn't quite let you do either one 
without extra hackery to compensate for its dual-mode behavior. If this 
thread has demonstrated anything, it's that some users want snapshot-like 
semantics (cp -p) and some want cowfile()-like semantics (cp).  What is 
the benefit of combining the two into a single call?  If I want 
snapshot-like semantics, I would rather get -EPERM if I lack the necessary 
permissions than silently get an approximation.  Then I can at least issue 
a warning to the user.  If I really want to gracefully 'degrade', I can 
always do something like

err = reflink(src, dst);
if (err == -EPERM) {
	err = cowfile(src, dst);
	if (!err)
		printf("warning: failed to preserve all file attributes\n");
}

sage

> 
> jim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 16:16             ` Sage Weil
@ 2009-05-12 17:45               ` jim owens
  2009-05-12 20:29                 ` Jamie Lokier
  0 siblings, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-12 17:45 UTC (permalink / raw)
  To: Sage Weil
  Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

Sage Weil wrote:
> Maybe not, but that's a separate question from the interface issue.  We 
> shouldn't preclude the possibility creating tools that preserve attributes 
> (or warn if they can't) and tools that simply want to cow data to a new 
> file.  AFAICS reflink(2) as proposed doesn't quite let you do either one 
> without extra hackery to compensate for its dual-mode behavior. If this 
> thread has demonstrated anything, it's that some users want snapshot-like 
> semantics (cp -p) and some want cowfile()-like semantics (cp).  What is 
> the benefit of combining the two into a single call?  If I want 
> snapshot-like semantics, I would rather get -EPERM if I lack the necessary 
> permissions than silently get an approximation.

I'm not fighting against two syscalls but the reason I like
the V4 definition is the opposite of knowing I failed to snapshot.

It is really because in my experience as both root on multi-user
systems and basic untrusted user, when root copies something from
a user there are only two desired outcomes:

  1) cp -p

  2) cp, chown "someone" , chgrp "somegroup", chmod "new rights"

The common mistake is wanting #1 and forgetting the -p so it
then produces an error and has to be fixed.

Using root's default attributes is almost never desired.

So with this reflink() definition, normal users get their own
attributes and root automatically gets preserve but can change
them later.

IMO this is optimized for humans, and I don't really know
of any privileged daemon things that are setuid and
want to not preserve attributes.  Do you have examples?

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 17:45               ` jim owens
@ 2009-05-12 20:29                 ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:29 UTC (permalink / raw)
  To: jim owens
  Cc: Sage Weil, Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

jim owens wrote:
> Using root's default attributes is almost never desired.
                                     ^^^^^^

Exactly.  When it is desired, it shouldn't be impossible :-)

Setting attributes to those of a new file outside the kernel requires
parsing /proc/mounts and knowing filesystem-type-specific things,
among other things.  Ugly stuff - should never be written.  Don't make
such ugly stuff be written (and fail when /proc isn't mounted).

There is also the principle of least surprise...  Shell scripts which
behave differently for root - that's asking for trouble.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 15:04         ` Sage Weil
  2009-05-12 15:23           ` jim owens
@ 2009-05-12 17:28           ` Joel Becker
  2009-05-13  4:30             ` Sage Weil
  1 sibling, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-12 17:28 UTC (permalink / raw)
  To: Sage Weil
  Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, May 12, 2009 at 08:04:21AM -0700, Sage Weil wrote:
> To me, a simple 'cp' type operation (assuming it gets wired up the way it 
> could) seems like at least as common a use case than a 'snapshot' 
> operation.  I know that's not what your main goal here, but I don't 
> understand the resistance to two syscalls.  Mixing the two might give you 
> the right answer in many cases, but certainly not all, and it makes for 
> confusing application interface semantics that we won't be able to change 
> down the line.

	I'm not against two syscalls, but I'm not writing copyfile()
here, just reflink().  Someone clearly could write copyfile() later and
link into some of the same underlying mechanisms.
	It's important to distinguish the semantics, though, and that's
why I'm doing one thing.  For example, reflink() is a snapshot (a
"reference-counted link") and has behaviors based on that.  libc should
never fake it, because the callers expect those behaviors.  Whereas
copyfile() would be fakeable in libc with a read/write cycle on
filesystems that don't support it.  Things like that.
	Heck, I think you could use reflink() to create a copyfile() in
libc that uses no additional syscall.  But you couldn't use copyfile()
to create reflink().

Joel

-- 

"Lately I've been talking in my sleep.
 Can't imagine what I'd have to say.
 Except my world will be right
 When love comes back my way."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-12 17:28           ` Joel Becker
@ 2009-05-13  4:30             ` Sage Weil
  0 siblings, 0 replies; 151+ messages in thread
From: Sage Weil @ 2009-05-13  4:30 UTC (permalink / raw)
  To: Joel Becker
  Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Tue, 12 May 2009, Joel Becker wrote:
> 	I'm not against two syscalls, but I'm not writing copyfile()
> here, just reflink().  Someone clearly could write copyfile() later and
> link into some of the same underlying mechanisms.

Ok, good.

> 	It's important to distinguish the semantics, though, and that's
> why I'm doing one thing.  For example, reflink() is a snapshot (a
> "reference-counted link") and has behaviors based on that.  libc should
> never fake it, because the callers expect those behaviors.  Whereas
> copyfile() would be fakeable in libc with a read/write cycle on
> filesystems that don't support it.  Things like that.
> 	Heck, I think you could use reflink() to create a copyfile() in
> libc that uses no additional syscall.  But you couldn't use copyfile()
> to create reflink().

Right, except that you _could_ implement the degraded (no CAP_CHOWN) 
reflink() behavior with a hypothetical copyfile().

I just think you should be sure that reflink() has _exactly_ the snapshot 
semantics that make sense, without compromises that try to capture some or 
all of copyfile() as well.  Assuming that a copyfile() type syscall also 
existed, would you really want reflink() to silently degrade to something 
that can be implemented via copyfile() when you lack CAP_CHOWN?

With the proposed reflink(), we might end up with a final API that looks 
something like:

 cowfile(src, dst, flags) - cow data and/or xattrs from src to dst
 reflink(src, dst) - snapshot src to dst, or if !CAP_CHOWN, cowfile() instead

A simpler reflink() would make that degradation non-mandatory, and 
trivially implemented in userspace by those who want it.

sage

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
                           ` (3 preceding siblings ...)
  2009-05-12 15:04         ` Sage Weil
@ 2009-05-14  3:57         ` Andy Lutomirski
  2009-05-14 18:12           ` Stephen Smalley
  2009-05-28  0:24         ` [RFC] The reflink(2) system call v5 Joel Becker
  2009-09-14 22:24         ` Joel Becker
  6 siblings, 1 reply; 151+ messages in thread
From: Andy Lutomirski @ 2009-05-14  3:57 UTC (permalink / raw)
  To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

Joel Becker wrote:
> +
> +Preserving the security context of the source file obviously requires
> +the privilege to do so.  Callers that do not own the source file and do
> +not have CAP_CHOWN will get a new reflink with all non-security
> +attributes preserved; the security context of the new reflink will be
> +as a newly created file by that user.
> +

There are plenty of syscalls that require some privilege and fail if the 
caller doesn't have it.  But I can think of only one syscall that does 
*something different* depending on who called it: setuid.

Please search the web and marvel at the disasters caused by setuid's 
magical caller-dependent behavior (the sendmail bug is probably the most 
famous [1]).  This proposal for reflink is just asking for bugs where an 
attacker gets some otherwise privileged program to call reflink but to 
somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to 
copy security attributes, thus exposing a link with the wrong permissions.

Would it really be that hard to have two syscalls, or a flag, or 
whatever, where one of them preserves all security attributes and 
*fails* if the caller isn't allowed to do that and the other one makes 
the caller own the new link?

[1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf

--Andy

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14  3:57         ` Andy Lutomirski
@ 2009-05-14 18:12           ` Stephen Smalley
  2009-05-14 22:00             ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-14 18:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> Joel Becker wrote:
> > +
> > +Preserving the security context of the source file obviously requires
> > +the privilege to do so.  Callers that do not own the source file and do
> > +not have CAP_CHOWN will get a new reflink with all non-security
> > +attributes preserved; the security context of the new reflink will be
> > +as a newly created file by that user.
> > +
> 
> There are plenty of syscalls that require some privilege and fail if the 
> caller doesn't have it.  But I can think of only one syscall that does 
> *something different* depending on who called it: setuid.
> 
> Please search the web and marvel at the disasters caused by setuid's 
> magical caller-dependent behavior (the sendmail bug is probably the most 
> famous [1]).  This proposal for reflink is just asking for bugs where an 
> attacker gets some otherwise privileged program to call reflink but to 
> somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to 
> copy security attributes, thus exposing a link with the wrong permissions.
> 
> Would it really be that hard to have two syscalls, or a flag, or 
> whatever, where one of them preserves all security attributes and 
> *fails* if the caller isn't allowed to do that and the other one makes 
> the caller own the new link?
> 
> 
> [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf

Yes, I agree - the selection of whether or not to preserve the security
attributes should be an explicit part of the kernel interface.  Then the
application still has the freedom to fall back on the non-preserving
form of the call if that is truly what it wants.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14 18:12           ` Stephen Smalley
@ 2009-05-14 22:00             ` Joel Becker
  2009-05-15  1:20               ` Jamie Lokier
  2009-05-15 12:01               ` Stephen Smalley
  0 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-14 22:00 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote:
> On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> > Joel Becker wrote:
> > > +
> > > +Preserving the security context of the source file obviously requires
> > > +the privilege to do so.  Callers that do not own the source file and do
> > > +not have CAP_CHOWN will get a new reflink with all non-security
> > > +attributes preserved; the security context of the new reflink will be
> > > +as a newly created file by that user.
> > > +
> > 
> > There are plenty of syscalls that require some privilege and fail if the 
> > caller doesn't have it.  But I can think of only one syscall that does 
> > *something different* depending on who called it: setuid.
> > 
> > Please search the web and marvel at the disasters caused by setuid's 
> > magical caller-dependent behavior (the sendmail bug is probably the most 
> > famous [1]).  This proposal for reflink is just asking for bugs where an 
> > attacker gets some otherwise privileged program to call reflink but to 
> > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to 
> > copy security attributes, thus exposing a link with the wrong permissions.
> > 
> > Would it really be that hard to have two syscalls, or a flag, or 
> > whatever, where one of them preserves all security attributes and 
> > *fails* if the caller isn't allowed to do that and the other one makes 
> > the caller own the new link?
> > 
> > 
> > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
> 
> Yes, I agree - the selection of whether or not to preserve the security
> attributes should be an explicit part of the kernel interface.  Then the
> application still has the freedom to fall back on the non-preserving
> form of the call if that is truly what it wants.

	Here's my problem.  Every single shell script now has to do:

    ln -r source target
    [ $? != 0 ] && ln -r --no-perms source target

Every single program now has to do:

    if (reflink(source, target) && errno == EPERM)
        reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);

Because the 99% user wants a real snapshot, and doesn't want to have to
think about it.  The could, of course, code up their own permission
checks to see which variant of reflink to call, but it's still useless
(to them) boilerplate.
	Also, if the 'common' user has to use the reflinkat() call?
We've lost.
	Finally, how is this safer?  Don't get me wrong, I do respect
the concern - that's why I originally went with your proposal of
is_owner_or_cap().  But the fact is that if you've hijacked a process
with enough privileges, you *can* make the full reflink, and if your
hijacked process doesn't but does have read access, you *can* make the
NOPERMS reflink.  So doing it with the userspace code above is identical
to the kernel code, except that every userspace program has to handle it
themselves.

Joel

-- 

"Vote early and vote often." 
        - Al Capone

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14 22:00             ` Joel Becker
@ 2009-05-15  1:20               ` Jamie Lokier
  2009-05-15 12:01               ` Stephen Smalley
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-15  1:20 UTC (permalink / raw)
  To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel,
	viro, mtk.manpages

Joel Becker wrote:
> 	Here's my problem.  Every single shell script now has to do:
> 
>     ln -r source target
>     [ $? != 0 ] && ln -r --no-perms source target

No, they'll obviously do

    ln -Rr source target

It is not a burden to type that.

(Where -R == your -r --no-perms, and -R -r together means try -R then -r).

> Every single program now has to do:
> 
>     if (reflink(source, target) && errno == EPERM)
>         reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);

Yes if that's what they want.

> Because the 99% user wants a real snapshot,

A quick poll based on emails in these threads says >50% doesn't want a real snapshot :-)

But even at 99%, what about the other 1%?

As I've explained, it is _impossible_ for userspace to do "ln -r" thing
itself in some conditions given your system call.

> and doesn't want to have to think about it.

The problem with the "automatic" switch is that it isn't obvious, so
people will make mistaken assumptions when using it.

If they _want_ the automatic switch, then a few moments of thought
doesn't matter.  Make it easy if you care: like "ln -Rr" in scripts
and a flag REFLINK_PERMS_IF_ALLOWED in the system call.

This is especially so with reflink(), because the userspace code if
you _didn't_ want the automatic change are tricky to write (and
extremely difficult to get right), so authors will either not bother,
or do it badly.

And test suites for programs using reflink() will pass nicely, yet the
code may still be broken because ordinary users can't test the "other
user's files" cases.

> The could, of course, code up their own permission
> checks to see which variant of reflink to call, but it's still useless
> (to them) boilerplate.

Why wouldn't you just do the two calls?  It's much easier.  But even
that goes away with REFLINK_PERMS_IF_ALLOWED (and conversely
REFLINK_PERMS_STRICT).

(Note it's not just permissions - it's also timestamps, group,
xattrs. The flag names could reflect that).

> 	Also, if the 'common' user has to use the reflinkat() call?  We've lost.

Provide a reflink() call in libc.  Problem solved.

Heck, provide separate reflink() and cowlink() calls in libc if you
don't like a flag.

> 	Finally, how is this safer?  Don't get me wrong, I do respect
> the concern - that's why I originally went with your proposal of
> is_owner_or_cap().  But the fact is that if you've hijacked a process
> with enough privileges, you *can* make the full reflink, and if your
> hijacked process doesn't but does have read access, you *can* make the
> NOPERMS reflink.

If you can trick a process into unexpected behaviour, it doesn't mean
you can make it do just anything.  It means you can trick specific
checks and assumptions that the program makes into being wrong,
because you made something behave in a way the authors didn't expect.
Building on that, sometimes the trick is enough to make a backdoor.

Which is why file system calls should behave in a simple way that
don't surprise anyone.

> So doing it with the userspace code above is identical
> to the kernel code, except that every userspace program has to handle it
> themselves.

No because not every userspace program _wants_ that behaviour.

So you have these problems if it's forced in the kernel:

    - Userspace programs that _don't want_ a "full reflink" but have the
      privilege to do to.  Sometimes they can't do the chmod/etc. to
      fix the attributes after _at all_ (think setgid-directories
      among other things - it's *hard* to simulate that in userspace
      and never quite right).

    - Sometimes fixing up afterwards would be a security race
      condition - the temporary unwanted permissions can be looser
      looser than the process wants to expose in the new directory.

What I'm seeing is that for the benefit of saving exactly one line in
some userspace programs - a line which is quite helpful in showing
what the program intends - it will cost about 1000 lines of code
(which is still slightly broken) in other userspace programs, and I
can think of a number of those programs already.  Not pretty.

If you don't like the two calls, just add a flag which means try one
then the other.  Then it's clear what the app is requesting, and
invites authors to decide what behaviour they want, trivially.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-14 22:00             ` Joel Becker
  2009-05-15  1:20               ` Jamie Lokier
@ 2009-05-15 12:01               ` Stephen Smalley
  2009-05-15 15:22                 ` Joel Becker
  1 sibling, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-15 12:01 UTC (permalink / raw)
  To: Joel Becker
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Thu, 2009-05-14 at 15:00 -0700, Joel Becker wrote:
> On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote:
> > On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote:
> > > Joel Becker wrote:
> > > > +
> > > > +Preserving the security context of the source file obviously requires
> > > > +the privilege to do so.  Callers that do not own the source file and do
> > > > +not have CAP_CHOWN will get a new reflink with all non-security
> > > > +attributes preserved; the security context of the new reflink will be
> > > > +as a newly created file by that user.
> > > > +
> > > 
> > > There are plenty of syscalls that require some privilege and fail if the 
> > > caller doesn't have it.  But I can think of only one syscall that does 
> > > *something different* depending on who called it: setuid.
> > > 
> > > Please search the web and marvel at the disasters caused by setuid's 
> > > magical caller-dependent behavior (the sendmail bug is probably the most 
> > > famous [1]).  This proposal for reflink is just asking for bugs where an 
> > > attacker gets some otherwise privileged program to call reflink but to 
> > > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to 
> > > copy security attributes, thus exposing a link with the wrong permissions.
> > > 
> > > Would it really be that hard to have two syscalls, or a flag, or 
> > > whatever, where one of them preserves all security attributes and 
> > > *fails* if the caller isn't allowed to do that and the other one makes 
> > > the caller own the new link?
> > > 
> > > 
> > > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf
> > 
> > Yes, I agree - the selection of whether or not to preserve the security
> > attributes should be an explicit part of the kernel interface.  Then the
> > application still has the freedom to fall back on the non-preserving
> > form of the call if that is truly what it wants.
> 
> 	Here's my problem.  Every single shell script now has to do:
> 
>     ln -r source target
>     [ $? != 0 ] && ln -r --no-perms source target
> 
> Every single program now has to do:
> 
>     if (reflink(source, target) && errno == EPERM)
>         reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS);
> 
> Because the 99% user wants a real snapshot, and doesn't want to have to
> think about it.  The could, of course, code up their own permission
> checks to see which variant of reflink to call, but it's still useless
> (to them) boilerplate.
> 	Also, if the 'common' user has to use the reflinkat() call?
> We've lost.

I think Jamie covered the fact that you can provide a user interface and
library functions that provide the "simpler" interface on top of this
interface, but not vice versa.

> 	Finally, how is this safer?  Don't get me wrong, I do respect
> the concern - that's why I originally went with your proposal of
> is_owner_or_cap().  But the fact is that if you've hijacked a process
> with enough privileges, you *can* make the full reflink, and if your
> hijacked process doesn't but does have read access, you *can* make the
> NOPERMS reflink.  So doing it with the userspace code above is identical
> to the kernel code, except that every userspace program has to handle it
> themselves.

As Jamie said, we aren't talking about injecting arbitrary code into the
process.  The failure scenario is quite similar to the setuid() one:
arrange conditions such that the process lacks sufficient privileges to
preserve attributes, and when it calls reflink(2) expecting to preserve
the attributes, it will get no indication that they weren't preserved.
At which point the data may be unwittingly exposed beyond its original
constraints.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-15 12:01               ` Stephen Smalley
@ 2009-05-15 15:22                 ` Joel Becker
  2009-05-15 15:55                   ` Stephen Smalley
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-15 15:22 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module,
	mtk.manpages, jim owens, ocfs2-devel, viro

On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote:
> > 	Finally, how is this safer?  Don't get me wrong, I do respect
> > the concern - that's why I originally went with your proposal of
> > is_owner_or_cap().  But the fact is that if you've hijacked a process
> > with enough privileges, you *can* make the full reflink, and if your
> > hijacked process doesn't but does have read access, you *can* make the
> > NOPERMS reflink.  So doing it with the userspace code above is identical
> > to the kernel code, except that every userspace program has to handle it
> > themselves.
> 
> As Jamie said, we aren't talking about injecting arbitrary code into the
> process.  The failure scenario is quite similar to the setuid() one:
> arrange conditions such that the process lacks sufficient privileges to
> preserve attributes, and when it calls reflink(2) expecting to preserve
> the attributes, it will get no indication that they weren't preserved.
> At which point the data may be unwittingly exposed beyond its original
> constraints.

	I wasn't being specific to injected code.  Assume we have a
deliberate flag to reflinkat(2).  Then we provide reflink(3) in
userspace that does the fallback, keeping it out of the kernel.  Doesn't
that have the exact same problem?

Joel

-- 

"Same dancers in the same old shoes.
 You get too careful with the steps you choose.
 You don't care about winning but you don't want to lose
 After the thrill is gone."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-15 15:22                 ` Joel Becker
@ 2009-05-15 15:55                   ` Stephen Smalley
  2009-05-15 16:42                     ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-15 15:55 UTC (permalink / raw)
  To: Joel Becker
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Fri, 2009-05-15 at 08:22 -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote:
> > > 	Finally, how is this safer?  Don't get me wrong, I do respect
> > > the concern - that's why I originally went with your proposal of
> > > is_owner_or_cap().  But the fact is that if you've hijacked a process
> > > with enough privileges, you *can* make the full reflink, and if your
> > > hijacked process doesn't but does have read access, you *can* make the
> > > NOPERMS reflink.  So doing it with the userspace code above is identical
> > > to the kernel code, except that every userspace program has to handle it
> > > themselves.
> > 
> > As Jamie said, we aren't talking about injecting arbitrary code into the
> > process.  The failure scenario is quite similar to the setuid() one:
> > arrange conditions such that the process lacks sufficient privileges to
> > preserve attributes, and when it calls reflink(2) expecting to preserve
> > the attributes, it will get no indication that they weren't preserved.
> > At which point the data may be unwittingly exposed beyond its original
> > constraints.
> 
> 	I wasn't being specific to injected code.  Assume we have a
> deliberate flag to reflinkat(2).  Then we provide reflink(3) in
> userspace that does the fallback, keeping it out of the kernel.  Doesn't
> that have the exact same problem?

You wouldn't always do the fallback in reflink(3), but instead provide a
helper interface that would perform the fallback for applications that
want that behavior.

Consider a program that wants to always preserve attributes on the
reflinks it creates.  If the interface allows the program to explicitly
request that behavior and returns an error when the request cannot be
honored, then the program knows that upon a successful return, the
attributes were in fact preserved.  If the interface instead silently
selects a behavior based on the current privileges of the process and
gives no indication to the caller as to what behavior was selected, then
the opportunity for error is great.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-15 15:55                   ` Stephen Smalley
@ 2009-05-15 16:42                     ` Joel Becker
  2009-05-15 17:01                       ` Shaya Potter
  2009-05-15 20:53                       ` [Ocfs2-devel] " Joel Becker
  0 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-15 16:42 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > 	I wasn't being specific to injected code.  Assume we have a
> > deliberate flag to reflinkat(2).  Then we provide reflink(3) in
> > userspace that does the fallback, keeping it out of the kernel.  Doesn't
> > that have the exact same problem?
> 
> You wouldn't always do the fallback in reflink(3), but instead provide a
> helper interface that would perform the fallback for applications that
> want that behavior.

	But isn't that reflink(3)?  And the application that wants to
know uses reflinkat(2)?
> 
> Consider a program that wants to always preserve attributes on the
> reflinks it creates.  If the interface allows the program to explicitly
> request that behavior and returns an error when the request cannot be
> honored, then the program knows that upon a successful return, the
> attributes were in fact preserved.  If the interface instead silently
> selects a behavior based on the current privileges of the process and
> gives no indication to the caller as to what behavior was selected, then
> the opportunity for error is great.

	I get that.  I'm looking at what the programming interface is.
What's the standard function for "I want the fallback behavior" called?
What's the standard function for "I want preserve security" called?
"int reflink(oldpath, newpath)" has to pick one of the behaviors.  Which
is it?

Joel

-- 

Life's Little Instruction Book #69

	"Whistle"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-15 16:42                     ` Joel Becker
@ 2009-05-15 17:01                       ` Shaya Potter
  2009-05-15 20:53                       ` [Ocfs2-devel] " Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Shaya Potter @ 2009-05-15 17:01 UTC (permalink / raw)
  To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel,
	viro, mtk.manpages

Joel Becker wrote:
> On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
>>> 	I wasn't being specific to injected code.  Assume we have a
>>> deliberate flag to reflinkat(2).  Then we provide reflink(3) in
>>> userspace that does the fallback, keeping it out of the kernel.  Doesn't
>>> that have the exact same problem?
>> You wouldn't always do the fallback in reflink(3), but instead provide a
>> helper interface that would perform the fallback for applications that
>> want that behavior.
> 
> 	But isn't that reflink(3)?  And the application that wants to
> know uses reflinkat(2)?
>> Consider a program that wants to always preserve attributes on the
>> reflinks it creates.  If the interface allows the program to explicitly
>> request that behavior and returns an error when the request cannot be
>> honored, then the program knows that upon a successful return, the
>> attributes were in fact preserved.  If the interface instead silently
>> selects a behavior based on the current privileges of the process and
>> gives no indication to the caller as to what behavior was selected, then
>> the opportunity for error is great.
> 
> 	I get that.  I'm looking at what the programming interface is.
> What's the standard function for "I want the fallback behavior" called?
> What's the standard function for "I want preserve security" called?
> "int reflink(oldpath, newpath)" has to pick one of the behaviors.  Which
> is it?

whenever there's hidden fallback behavior that changes the security
semantics you will cause programming error.

the only correct way for an application to code that want the fallback
functionality

if (initial_behavior()) {
	if (fallback_behavior()) {
		some sort of error
	}
}

as that way the application knows what occured.  if that logic is
wrapped in a single function (like , you would have to dosomething like

if (ret == initial_and_fallbakc()) {
	if (ret == 0) {
		fallback = 0;
	} else if (ret == 1) {
		fallback == 1;
	} else {
		some sort of error
	}
}

which is much more prone to error.

at the end of the day, a single function that has hidden fallback
behavior does not really save lines of code in a well written
application.  it does however make it easier to write a poorly written
application that can cause security problems.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-15 16:42                     ` Joel Becker
  2009-05-15 17:01                       ` Shaya Potter
@ 2009-05-15 20:53                       ` Joel Becker
  2009-05-18  9:17                         ` Jörn Engel
                                           ` (3 more replies)
  1 sibling, 4 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-15 20:53 UTC (permalink / raw)
  To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel,
	viro, mtk.manpages

On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > Consider a program that wants to always preserve attributes on the
> > reflinks it creates.  If the interface allows the program to explicitly
> > request that behavior and returns an error when the request cannot be
> > honored, then the program knows that upon a successful return, the
> > attributes were in fact preserved.  If the interface instead silently
> > selects a behavior based on the current privileges of the process and
> > gives no indication to the caller as to what behavior was selected, then
> > the opportunity for error is great.
> 
> 	I get that.  I'm looking at what the programming interface is.
> What's the standard function for "I want the fallback behavior" called?
> What's the standard function for "I want preserve security" called?
> "int reflink(oldpath, newpath)" has to pick one of the behaviors.  Which
> is it?

	Ok, I've been casting about how to solve the concern and provide
a decent interface.  I'm not about to give up on either.  I think,
though, that we do have to let the application signal its intent to the
system.  And if we're doing that, let's add a little flexibility.
	I think the interface will be this (ignoring the reflinkat(2)
bit for now):

int reflink(const char *oldpath, const char *newpath, int preserve);

- Data and xattrs are reflinked always.
- 'preserve is a bitfield describing which attributes to keep across the
  reflink:
 * REFLINK_ATTR_OWNER - Keeps uid/gid the same.  Requires ownership or
   CAP_CHOWN.
 * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
   the same.  This requires REFLINK_ATTR_OWNER (the security state makes
   no sense if the ownership changes).  If not set, the filesystem wipes
   all security.* xattrs and reinitializes with
   security_inode_init_security() just like a new file.
 * REFLINK_ATTR_MODE - Keeps the mode bits the same.  Requires ownership
   or CAP_FOWNER.
 * REFLINK_ATTR_ACL - Keeps the ACLs the same.  Requires
   REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
   changes, and so you can't keep them the same if the mode wasn't
   preserved.  If not set, the filesystem reinits the ACLs as for a new
   file.
- REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.

	That's all the relevant attributes.  The timestamps behave as
already described (ctime is now, mtime matches the source), which is the
only sane behavior for this sort of thing.
	So, a copy program would reflink(source, target,
REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
easily.
	In the kernel, security_inode_reflink() gets passed the preserve
bits.  It's responsible for determining whether REFLINK_ATTR_SECURITY is
allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
It may do other checks on the reflink and the preserve bits, that's up
to the LSM.
        For scripting, we add the we add the '-p' and '-P' to "ln -r":

- ln -r == reflink(source, target, REFLINK_ATTR_NONE);
- ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
- ln -r -p == the fallback behavior.  This is like cp(1), where "cp -p"
  is best-effort.

	Does this make everyone happy?
Joel

-- 

"In the beginning, the universe was created. This has made a lot 
 of people very angry, and is generally considered to have been a 
 bad move."
        - Douglas Adams

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-15 20:53                       ` [Ocfs2-devel] " Joel Becker
@ 2009-05-18  9:17                         ` Jörn Engel
  2009-05-18 13:02                         ` Stephen Smalley
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 151+ messages in thread
From: Jörn Engel @ 2009-05-18  9:17 UTC (permalink / raw)
  To: Joel Becker
  Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel,
	viro, mtk.manpages, linux-security-module, linux-fsdevel

On Fri, 15 May 2009 13:53:35 -0700, Joel Becker wrote:
> 
> 	Does this make everyone happy?

Provided the only fallback is to return an error code and let userspace
decide what to do, I'm a happy camper.

Not sure how many of the REFLINK_ATTR_* flags will actually be used,
apart from ALL and NONE.  But I don't mind having them.

Jörn

-- 
People will accept your ideas much more readily if you tell them
that Benjamin Franklin said it first.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-15 20:53                       ` [Ocfs2-devel] " Joel Becker
  2009-05-18  9:17                         ` Jörn Engel
@ 2009-05-18 13:02                         ` Stephen Smalley
  2009-05-18 14:33                           ` Stephen Smalley
  2009-05-18 18:26                           ` Joel Becker
  2009-05-19 19:33                         ` Jonathan Corbet
       [not found]                         ` <20090519132057.419b9de0@bike.lwn.net>
  3 siblings, 2 replies; 151+ messages in thread
From: Stephen Smalley @ 2009-05-18 13:02 UTC (permalink / raw)
  To: Joel Becker
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > Consider a program that wants to always preserve attributes on the
> > > reflinks it creates.  If the interface allows the program to explicitly
> > > request that behavior and returns an error when the request cannot be
> > > honored, then the program knows that upon a successful return, the
> > > attributes were in fact preserved.  If the interface instead silently
> > > selects a behavior based on the current privileges of the process and
> > > gives no indication to the caller as to what behavior was selected, then
> > > the opportunity for error is great.
> > 
> > 	I get that.  I'm looking at what the programming interface is.
> > What's the standard function for "I want the fallback behavior" called?
> > What's the standard function for "I want preserve security" called?
> > "int reflink(oldpath, newpath)" has to pick one of the behaviors.  Which
> > is it?
> 
> 	Ok, I've been casting about how to solve the concern and provide
> a decent interface.  I'm not about to give up on either.  I think,
> though, that we do have to let the application signal its intent to the
> system.  And if we're doing that, let's add a little flexibility.
> 	I think the interface will be this (ignoring the reflinkat(2)
> bit for now):
> 
> int reflink(const char *oldpath, const char *newpath, int preserve);
> 
> - Data and xattrs are reflinked always.
> - 'preserve is a bitfield describing which attributes to keep across the
>   reflink:
>  * REFLINK_ATTR_OWNER - Keeps uid/gid the same.  Requires ownership or
>    CAP_CHOWN.
>  * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
>    the same.  This requires REFLINK_ATTR_OWNER (the security state makes
>    no sense if the ownership changes).  If not set, the filesystem wipes
>    all security.* xattrs and reinitializes with
>    security_inode_init_security() just like a new file.
>  * REFLINK_ATTR_MODE - Keeps the mode bits the same.  Requires ownership
>    or CAP_FOWNER.
>  * REFLINK_ATTR_ACL - Keeps the ACLs the same.  Requires
>    REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
>    changes, and so you can't keep them the same if the mode wasn't
>    preserved.  If not set, the filesystem reinits the ACLs as for a new
>    file.
> - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> 
> 	That's all the relevant attributes.  The timestamps behave as
> already described (ctime is now, mtime matches the source), which is the
> only sane behavior for this sort of thing.
> 	So, a copy program would reflink(source, target,
> REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> easily.
> 	In the kernel, security_inode_reflink() gets passed the preserve
> bits.  It's responsible for determining whether REFLINK_ATTR_SECURITY is
> allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> It may do other checks on the reflink and the preserve bits, that's up
> to the LSM.
>         For scripting, we add the we add the '-p' and '-P' to "ln -r":
> 
> - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> - ln -r -p == the fallback behavior.  This is like cp(1), where "cp -p"
>   is best-effort.
> 
> 	Does this make everyone happy?

For simplicity and robustness, I would only support the none or all
flags, i.e. preserve can be a simple bool.  I don't think you really
want to deal with the individual flags, and I don't see a use case for
them.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-18 13:02                         ` Stephen Smalley
@ 2009-05-18 14:33                           ` Stephen Smalley
  2009-05-18 17:15                             ` Stephen Smalley
  2009-05-18 18:26                           ` Joel Becker
  1 sibling, 1 reply; 151+ messages in thread
From: Stephen Smalley @ 2009-05-18 14:33 UTC (permalink / raw)
  To: Joel Becker
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote:
> On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > > Consider a program that wants to always preserve attributes on the
> > > > reflinks it creates.  If the interface allows the program to explicitly
> > > > request that behavior and returns an error when the request cannot be
> > > > honored, then the program knows that upon a successful return, the
> > > > attributes were in fact preserved.  If the interface instead silently
> > > > selects a behavior based on the current privileges of the process and
> > > > gives no indication to the caller as to what behavior was selected, then
> > > > the opportunity for error is great.
> > > 
> > > 	I get that.  I'm looking at what the programming interface is.
> > > What's the standard function for "I want the fallback behavior" called?
> > > What's the standard function for "I want preserve security" called?
> > > "int reflink(oldpath, newpath)" has to pick one of the behaviors.  Which
> > > is it?
> > 
> > 	Ok, I've been casting about how to solve the concern and provide
> > a decent interface.  I'm not about to give up on either.  I think,
> > though, that we do have to let the application signal its intent to the
> > system.  And if we're doing that, let's add a little flexibility.
> > 	I think the interface will be this (ignoring the reflinkat(2)
> > bit for now):
> > 
> > int reflink(const char *oldpath, const char *newpath, int preserve);
> > 
> > - Data and xattrs are reflinked always.
> > - 'preserve is a bitfield describing which attributes to keep across the
> >   reflink:
> >  * REFLINK_ATTR_OWNER - Keeps uid/gid the same.  Requires ownership or
> >    CAP_CHOWN.
> >  * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> >    the same.  This requires REFLINK_ATTR_OWNER (the security state makes
> >    no sense if the ownership changes).  If not set, the filesystem wipes
> >    all security.* xattrs and reinitializes with
> >    security_inode_init_security() just like a new file.
> >  * REFLINK_ATTR_MODE - Keeps the mode bits the same.  Requires ownership
> >    or CAP_FOWNER.
> >  * REFLINK_ATTR_ACL - Keeps the ACLs the same.  Requires
> >    REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> >    changes, and so you can't keep them the same if the mode wasn't
> >    preserved.  If not set, the filesystem reinits the ACLs as for a new
> >    file.
> > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> > 
> > 	That's all the relevant attributes.  The timestamps behave as
> > already described (ctime is now, mtime matches the source), which is the
> > only sane behavior for this sort of thing.
> > 	So, a copy program would reflink(source, target,
> > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> > easily.
> > 	In the kernel, security_inode_reflink() gets passed the preserve
> > bits.  It's responsible for determining whether REFLINK_ATTR_SECURITY is
> > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> > It may do other checks on the reflink and the preserve bits, that's up
> > to the LSM.
> >         For scripting, we add the we add the '-p' and '-P' to "ln -r":
> > 
> > - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> > - ln -r -p == the fallback behavior.  This is like cp(1), where "cp -p"
> >   is best-effort.
> > 
> > 	Does this make everyone happy?
> 
> For simplicity and robustness, I would only support the none or all
> flags, i.e. preserve can be a simple bool.  I don't think you really
> want to deal with the individual flags, and I don't see a use case for
> them.

Or possibly only distinguish preserve-dac from preserve-mac, e.g.
REFLINK_ATTR_NONE (preserve none),
REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p)
REFLINK_ATTR_MAC (preserve MAC security label ala cp -c)
REFLINK_ATTR_ALL (preserve all)

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-18 14:33                           ` Stephen Smalley
@ 2009-05-18 17:15                             ` Stephen Smalley
  0 siblings, 0 replies; 151+ messages in thread
From: Stephen Smalley @ 2009-05-18 17:15 UTC (permalink / raw)
  To: Joel Becker
  Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module, linux-fsdevel

On Mon, 2009-05-18 at 10:33 -0400, Stephen Smalley wrote:
> On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote:
> > On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote:
> > > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote:
> > > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote:
> > > > > Consider a program that wants to always preserve attributes on the
> > > > > reflinks it creates.  If the interface allows the program to explicitly
> > > > > request that behavior and returns an error when the request cannot be
> > > > > honored, then the program knows that upon a successful return, the
> > > > > attributes were in fact preserved.  If the interface instead silently
> > > > > selects a behavior based on the current privileges of the process and
> > > > > gives no indication to the caller as to what behavior was selected, then
> > > > > the opportunity for error is great.
> > > > 
> > > > 	I get that.  I'm looking at what the programming interface is.
> > > > What's the standard function for "I want the fallback behavior" called?
> > > > What's the standard function for "I want preserve security" called?
> > > > "int reflink(oldpath, newpath)" has to pick one of the behaviors.  Which
> > > > is it?
> > > 
> > > 	Ok, I've been casting about how to solve the concern and provide
> > > a decent interface.  I'm not about to give up on either.  I think,
> > > though, that we do have to let the application signal its intent to the
> > > system.  And if we're doing that, let's add a little flexibility.
> > > 	I think the interface will be this (ignoring the reflinkat(2)
> > > bit for now):
> > > 
> > > int reflink(const char *oldpath, const char *newpath, int preserve);
> > > 
> > > - Data and xattrs are reflinked always.
> > > - 'preserve is a bitfield describing which attributes to keep across the
> > >   reflink:
> > >  * REFLINK_ATTR_OWNER - Keeps uid/gid the same.  Requires ownership or
> > >    CAP_CHOWN.
> > >  * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc)
> > >    the same.  This requires REFLINK_ATTR_OWNER (the security state makes
> > >    no sense if the ownership changes).  If not set, the filesystem wipes
> > >    all security.* xattrs and reinitializes with
> > >    security_inode_init_security() just like a new file.
> > >  * REFLINK_ATTR_MODE - Keeps the mode bits the same.  Requires ownership
> > >    or CAP_FOWNER.
> > >  * REFLINK_ATTR_ACL - Keeps the ACLs the same.  Requires
> > >    REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode
> > >    changes, and so you can't keep them the same if the mode wasn't
> > >    preserved.  If not set, the filesystem reinits the ACLs as for a new
> > >    file.
> > > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> > > 
> > > 	That's all the relevant attributes.  The timestamps behave as
> > > already described (ctime is now, mtime matches the source), which is the
> > > only sane behavior for this sort of thing.
> > > 	So, a copy program would reflink(source, target,
> > > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target,
> > > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it
> > > easily.
> > > 	In the kernel, security_inode_reflink() gets passed the preserve
> > > bits.  It's responsible for determining whether REFLINK_ATTR_SECURITY is
> > > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER).
> > > It may do other checks on the reflink and the preserve bits, that's up
> > > to the LSM.
> > >         For scripting, we add the we add the '-p' and '-P' to "ln -r":
> > > 
> > > - ln -r == reflink(source, target, REFLINK_ATTR_NONE);
> > > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL);
> > > - ln -r -p == the fallback behavior.  This is like cp(1), where "cp -p"
> > >   is best-effort.
> > > 
> > > 	Does this make everyone happy?
> > 
> > For simplicity and robustness, I would only support the none or all
> > flags, i.e. preserve can be a simple bool.  I don't think you really
> > want to deal with the individual flags, and I don't see a use case for
> > them.
> 
> Or possibly only distinguish preserve-dac from preserve-mac, e.g.
> REFLINK_ATTR_NONE (preserve none),
> REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p)
> REFLINK_ATTR_MAC (preserve MAC security label ala cp -c)
> REFLINK_ATTR_ALL (preserve all)

Even this distinction doesn't seem worthwhile and could get complicated,
e.g. security.capability is an alternative to using the setuid mode bit,
and thus logically would fall into the same class as the owner and mode.
I'd just limit reflink() to preserving none or all of the security
attributes.

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v4.
  2009-05-18 13:02                         ` Stephen Smalley
  2009-05-18 14:33                           ` Stephen Smalley
@ 2009-05-18 18:26                           ` Joel Becker
  2009-05-19 16:32                             ` [Ocfs2-devel] " Sage Weil
  1 sibling, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-18 18:26 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module,
	mtk.manpages, jim owens, ocfs2-devel, viro

On Mon, May 18, 2009 at 09:02:39AM -0400, Stephen Smalley wrote:
> For simplicity and robustness, I would only support the none or all
> flags, i.e. preserve can be a simple bool.  I don't think you really
> want to deal with the individual flags, and I don't see a use case for
> them.

	The simple use case I can think of is "I want a snapshot, but I
don't have rights to copy the MAC context".  Or "I want to own it, but I
want to keep all the ACLs for other users".
	Basically, if I'm adding another int argument to reflinkat(2), I
wanted to consider the future.  Maybe define it as 1 or 0, and leave the
use of the other bits for future possibilities?  If we're lucky, of
course, we never need future changes.

Joel

-- 

"There is a country in Europe where multiple-choice tests are
 illegal."
        - Sigfried Hulzer

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-18 18:26                           ` Joel Becker
@ 2009-05-19 16:32                             ` Sage Weil
  0 siblings, 0 replies; 151+ messages in thread
From: Sage Weil @ 2009-05-19 16:32 UTC (permalink / raw)
  To: Joel Becker
  Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel,
	viro, mtk.manpages, linux-security-module, linux-fsdevel

Hi Joel,

This version (with whatever flag simplifications are deemed appropriate) 
looks pretty good to me!

The only other thing I would like to see is a flag that makes copying the 
xattrs optional.  That's straying toward kitchen sink territory, but it 
seems like a natural enough interface once you're cherry-picking what to 
preserve in the reflink.  (Since you can always remove unwanted xattrs 
later, of course, it's certainly not a show-stopper.)

Thanks!
sage

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-15 20:53                       ` [Ocfs2-devel] " Joel Becker
  2009-05-18  9:17                         ` Jörn Engel
  2009-05-18 13:02                         ` Stephen Smalley
@ 2009-05-19 19:33                         ` Jonathan Corbet
  2009-05-19 20:15                           ` Jamie Lokier
       [not found]                         ` <20090519132057.419b9de0@bike.lwn.net>
  3 siblings, 1 reply; 151+ messages in thread
From: Jonathan Corbet @ 2009-05-19 19:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-security-module

One tiny little thing that crossed my mind as I was looking at this...

> - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.

That, I think, could lead to unexpected results if different flags
(perhaps controlling different aspects of behavior altogether) are
added in the future.  Might it make more sense for REFLINK_ATTR_ALL to
be something like 0xffff, with the current implementation insisting
that all other bits are zero?  That would leave room for expansion of
the set of things covered by the "preserve all" semantics while,
simultaneously, allowing the addition of different types of flags
entirely.

jon

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
  2009-05-19 19:33                         ` Jonathan Corbet
@ 2009-05-19 20:15                           ` Jamie Lokier
  0 siblings, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-19 20:15 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: linux-fsdevel, linux-security-module

Jonathan Corbet wrote:
> One tiny little thing that crossed my mind as I was looking at this...
> 
> > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0.
> 
> That, I think, could lead to unexpected results if different flags
> (perhaps controlling different aspects of behavior altogether) are
> added in the future.  Might it make more sense for REFLINK_ATTR_ALL to
> be something like 0xffff, with the current implementation insisting
> that all other bits are zero?  That would leave room for expansion of
> the set of things covered by the "preserve all" semantics while,
> simultaneously, allowing the addition of different types of flags
> entirely.

I think it's far better if REFLINK_ATTR_ALL is simply it's own 1-bit
flag, meaning exactly what you think it means: In the kernel, it sets
all the attribute flags.

It's possible to choose a bit-mask now, but there's no particular
reason that 16 bits is the right size, and it's ugly if it turns out
you need a hack for a backward-compatible 17th attribute sometime.
(It can be done, it's just ugly).

(I'd also add REFLINK_ATTR_ATOMIC, because you might want the
attributes copied but don't care about atomicity, and some filesystems
might be able to one without the other.  I'm thinking of SMB/CIFS here.)

By the way, there is work going on towards a "selective stat()" call,
which takes a set of bits for which attributes are to be returned.  Is
it worth converging on some common flags to select attributes?

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

[parent not found: <20090519132057.419b9de0@bike.lwn.net>]

[parent not found: <20090519193244.GB25521@mail.oracle.com>]

* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4.
       [not found]                           ` <20090519193244.GB25521@mail.oracle.com>
@ 2009-05-19 19:41                             ` Jonathan Corbet
  0 siblings, 0 replies; 151+ messages in thread
From: Jonathan Corbet @ 2009-05-19 19:41 UTC (permalink / raw)
  To: Joel Becker
  Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel,
	viro, mtk.manpages, linux-security-module, linux-fsdevel

On Tue, 19 May 2009 12:32:44 -0700
Joel Becker <Joel.Becker@oracle.com> wrote:

> 	I considered that, but really a process specifying
> REFLINK_ATTR_ALL wants a complete snapshot.  So if we add things to our
> inodes later, and then you have an old program asking for "a complete
> snapshot", it won't get it.  It'll get a partial snapshot, missing the
> things we added later.
> 	Conversely, a newer program that knows about the new things will
> get an error on an older kernel when it asks for the complete snapshot.

Yep, that's why I'd suggested carving out a set of bits rather larger
than the ones specified now.  That would allow any future flags to be
included in the REFLINK_ATTR_ALL "space" if that seemed like the right
thing to do.  It would be forward and backward compatible.

Anything added outside that bit range would, presumably, be a more
significant change which should not carry forward or backward
automatically.

> 	You'll note I called this 'preserve', not 'flags'.  It's not a
> set of behavioral flags, it's a mask of attributes to preserve.

Understood, but that may not stop somebody else from trying to extend
the API in different directions in the future.  It seems like a way to
make life easier for that person when the time comes.

Just a thought, anyway; not something I'd make a fuss about.

jon

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [RFC] The reflink(2) system call v5.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
                           ` (4 preceding siblings ...)
  2009-05-14  3:57         ` Andy Lutomirski
@ 2009-05-28  0:24         ` Joel Becker
  2009-09-14 22:24         ` Joel Becker
  6 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-28  0:24 UTC (permalink / raw)
  To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

	Here's v5 of reflink().  It adds a 'preserve' argument to the
call.  This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE.  _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges.  _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc).  I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
	Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.

Joel

>From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.

The userpace visible idea of the operation is:

int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
	      int newdirfd, const char *newpath,
	      int preserve,  int flags);

The kernel only implements reflinkat(2).  reflink(3) is a trivial
wrapper around reflinkat(2).

The reflink() system call creates reference-counted links.  It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion.  Its calling semantics are identical to link(2)
and linkat(2).  Once complete, programs see the new file as a completely
separate entry.

reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot.  A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN.  A caller
without those privileges will get EPERM.  An unpriviledged caller can
specify REFLINK_ATTR_NONE.  They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file.  The unpriviledged reflink requires read access.

In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.

A new LSM hook, security_inode_reflink(), is added.  None of the
existing LSM hooks appeared to fit.

This only adds the x86 linkage.  The trend appears to be for other
architectures to add their own linkage.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 Documentation/filesystems/reflink.txt |  174 +++++++++++++++++++++++++++++++++
 Documentation/filesystems/vfs.txt     |    4 +
 arch/x86/ia32/ia32entry.S             |    1 +
 arch/x86/include/asm/unistd_32.h      |    1 +
 arch/x86/include/asm/unistd_64.h      |    2 +
 arch/x86/kernel/syscall_table_32.S    |    1 +
 fs/namei.c                            |  124 +++++++++++++++++++++++
 include/linux/fcntl.h                 |    8 ++
 include/linux/fs.h                    |    2 +
 include/linux/security.h              |   23 +++++
 include/linux/syscalls.h              |    3 +
 security/capability.c                 |    7 ++
 security/security.c                   |    8 ++
 13 files changed, 358 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/reflink.txt

diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link.  The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data.  Writes do not modify the shared data; they
+use copy-on-write (CoW).  Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+    int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+    int reflinkat(int olddirfd, const char *oldpath,
+                  int newdirfd, const char *newpath,
+                  int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing.  A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry.  Hard links are one step
+down.  Multiple directory entries are sharing one inode.  Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical.  When
+accessing more than one name for a hard link, the object returned looks
+identical.  Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such.  This includes
+ownership, permissions, security state, and data.  The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file.  Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file.  Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file.  ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so.  Because of this, the reflink(2) call has the
+preserve argument.  If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM.  If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes.  The security state
+and attributes of the new reflink will be as a newly created file by
+that user.  With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed.  The new inode will only appear in the
+directory structure after it is fully formed.  This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it.  Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows.  When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter.  A symlink doesn't
+require any access permissions other than being able to create its
+inode.  It can cross filesystems and mount points, and it can point to
+any type of file.  A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory.  A reflink tightens that to regular files only.  Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file.  Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode.  It shares all data extents of the source
+file; this includes file data and extended attribute data.  All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents.  Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode.  Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number.  This allows POSIX
+  programs to treat the source and new files as separate objects.  From
+  the view of the POSIX application, the files are distinct.  The
+  sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+  must be changed to accommodate the copy-on-write linkage.  The ctime
+  of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+  the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file.  This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation.  It has almost
+the same prototype as ->link():
+
+    int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+                   struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation.  It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument.  The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does.  The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
 
   truncate_range: a method provided by the underlying filesystem to truncate a
   	range of blocks , i.e. punch a hole somewhere in a file.
+  reflink: called by the reflink(2) system call. Only required if you want
+	to support reflinks.  For further information, see
+	Documentation/filesystems/reflink.txt.
 
 
 The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
 	.quad sys_inotify_init1
 	.quad compat_sys_preadv
 	.quad compat_sys_pwritev
+	.quad sys_reflinkat		/* 335 */
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_reflinkat		335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
 __SYSCALL(__NR_preadv, sys_preadv)
 #define __NR_pwritev				296
 __SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink				297
+__SYSCALL(__NR_reflink, sys_reflink)
 
 
 #ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_reflinkat		/* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+		struct dentry *new_dentry, bool preserve)
+{
+	struct inode *inode = old_dentry->d_inode;
+	int error;
+
+	if (!inode)
+		return -ENOENT;
+
+	error = may_create(dir, new_dentry);
+	if (error)
+		return error;
+
+	if (dir->i_sb != inode->i_sb)
+		return -EXDEV;
+
+	/*
+	 * A reflink to an append-only or immutable file cannot be created.
+	 */
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	if (!dir->i_op->reflink)
+		return -EPERM;
+
+	/*
+	 * Only regular files can be reflinked; if a user tries to
+	 * reflink a block device, do they expect copy-on-write of the
+	 * entire device?
+	 */
+	if (!S_ISREG(inode->i_mode))
+		return -EPERM;
+
+	/*
+	 * If the caller wants to preserve ownership, they require the
+	 * rights to do so.
+	 */
+	if (preserve) {
+		if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+			return -EPERM;
+		if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+			return -EPERM;
+	}
+
+	error = security_inode_reflink(old_dentry, dir, preserve);
+	if (error)
+		return error;
+
+	/*
+	 * If the caller is modifying any aspect of the attributes, they
+	 * are not creating a snapshot.  They need read permission on the
+	 * file.
+	 */
+	if (!preserve) {
+		error = inode_permission(inode, MAY_READ);
+		if (error)
+			return error;
+	}
+
+	mutex_lock(&inode->i_mutex);
+	vfs_dq_init(dir);
+	error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+	mutex_unlock(&inode->i_mutex);
+	if (!error)
+		fsnotify_create(dir, new_dentry);
+	return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+		int, newdfd, const char __user *, newname, int, preserve,
+		int, flags)
+{
+	struct dentry *new_dentry;
+	struct nameidata nd;
+	struct path old_path;
+	int error;
+	char *to;
+
+	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+		return -EINVAL;
+
+	if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+		return -EINVAL;
+
+	error = user_path_at(olddfd, oldname,
+			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+			     &old_path);
+	if (error)
+		return error;
+
+	error = user_path_parent(newdfd, newname, &nd, &to);
+	if (error)
+		goto out;
+	error = -EXDEV;
+	if (old_path.mnt != nd.path.mnt)
+		goto out_release;
+	new_dentry = lookup_create(&nd, 0);
+	error = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out_unlock;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
+	error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+	if (error)
+		goto out_drop_write;
+	error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+			    new_dentry, preserve);
+out_drop_write:
+	mnt_drop_write(nd.path.mnt);
+out_dput:
+	dput(new_dentry);
+out_unlock:
+	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+	path_put(&nd.path);
+	putname(to);
+out:
+	path_put(&old_path);
+
+	return error;
+}
+
+
 /*
  * The worst of all namespace operations - renaming directory. "Perverted"
  * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(vfs_create);
 EXPORT_SYMBOL(vfs_follow_link);
 EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);
 EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
                                            unlinking file.  */
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE	0x00000001
+#define REFLINK_ATTR_NONE	0
+
+
 #ifdef __KERNEL__
 
 #ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
 
 /*
  * VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
 			  loff_t len);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
 };
 
 struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	@inode contains a pointer to the inode.
  *	@secid contains a pointer to the location where result will be saved.
  *	In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ *	Check permission before creating a new reference-counted link to
+ *	a file.
+ *	@old_dentry contains the dentry structure for an existing link to
+ *	the file.
+ *	@dir contains the inode structure of the parent directory of the
+ *	new reflink.
+ *	@preserve specifies whether the caller wishes to preserve the
+ *	file's attributes.  If true, the caller wishes to clone the file's
+ *	attributes exactly.  If false, the caller expects to reflink the
+ *	data extents but reset the attributes.
+ *	Return 0 if permission is granted.
  *
  * Security hooks for file operations
  *
@@ -1415,6 +1427,8 @@ struct security_operations {
 	int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
 	int (*inode_symlink) (struct inode *dir,
 			      struct dentry *dentry, const char *old_name);
+	int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+			      bool preserve);
 	int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
 	int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
 	int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
 int security_inode_unlink(struct inode *dir, struct dentry *dentry);
 int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 			   const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+			   bool preserve);
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
 int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
 	return 0;
 }
 
+static inline int security_inode_reflink(struct dentry *old_dentry,
+					 struct inode *dir,
+					 bool preserve)
+{
+	return 0;
+}
+
 static inline int security_inode_mkdir(struct inode *dir,
 					struct dentry *dentry,
 					int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
 			      int newdfd, const char __user * newname);
 asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
 			   int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+			      int newdfd, const char __user *newname,
+			      int preserve, int flags);
 asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
 			     int newdfd, const char __user * newname);
 asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
 	return 0;
 }
 
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+			     bool preserve)
+{
+	return 0;
+}
+
 static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
 			   int mask)
 {
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, inode_link);
 	set_to_cap_if_null(ops, inode_unlink);
 	set_to_cap_if_null(ops, inode_symlink);
+	set_to_cap_if_null(ops, inode_reflink);
 	set_to_cap_if_null(ops, inode_mkdir);
 	set_to_cap_if_null(ops, inode_rmdir);
 	set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 	return security_ops->inode_symlink(dir, dentry, old_name);
 }
 
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+			   bool preserve)
+{
+	if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+		return 0;
+	return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	if (unlikely(IS_PRIVATE(dir)))
-- 
1.6.3

-- 

"Anything that is too stupid to be spoken is sung."  
        - Voltaire

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply related	[flat|nested] 151+ messages in thread

* [RFC] The reflink(2) system call v5.
  2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
                           ` (5 preceding siblings ...)
  2009-05-28  0:24         ` [RFC] The reflink(2) system call v5 Joel Becker
@ 2009-09-14 22:24         ` Joel Becker
  6 siblings, 0 replies; 151+ messages in thread
From: Joel Becker @ 2009-09-14 22:24 UTC (permalink / raw)
  To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module, linux-fsdevel

[This is a resend of the v5 patch sent on May 25th.  Jim, Al, can I get
 acks please.]

	Here's v5 of reflink().  It adds a 'preserve' argument to the
call.  This argument may currently be one of REFLINK_ATTR_PRESERVE and
REFLINK_ATTR_NONE.  _ATTR_PRESERVE takes a full snapshot, and fails if
the caller lacks the privileges.  _ATTR_NONE links up the data extents
(data and xattrs) in a CoW fashion, but otherwise initializes the new
inode as a new file (new security state, acls, ownership, etc).  I took
everyone's advice and dropped attribute-specific flags for a single
_ATTR_PRESERVE.
	Inside the kernel, the iop and security op get 'bool preserve'
to tell them what to do.

Joel

>From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001
From: Joel Becker <joel.becker@oracle.com>
Date: Sat, 2 May 2009 22:48:59 -0700
Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call.

The userpace visible idea of the operation is:

int reflink(const char *oldpath, const char *newpath, int preserve);
int reflinkat(int olddirfd, const char *oldpath,
	      int newdirfd, const char *newpath,
	      int preserve,  int flags);

The kernel only implements reflinkat(2).  reflink(3) is a trivial
wrapper around reflinkat(2).

The reflink() system call creates reference-counted links.  It creates
a new file that shares the data extents of the source file in a
copy-on-write fashion.  Its calling semantics are identical to link(2)
and linkat(2).  Once complete, programs see the new file as a completely
separate entry.

reflink() attempts to preserve ownership, permissions, and all other
security state in order to create a full snapshot.  A caller requests
this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument.
Preserving those attributes requires ownership or CAP_CHOWN.  A caller
without those privileges will get EPERM.  An unpriviledged caller can
specify REFLINK_ATTR_NONE.  They will acquire the data extent sharing
but will see the file's security state and attributes initialized as a
new file.  The unpriviledged reflink requires read access.

In the VFS, ->reflink() is an inode_operation with the almost same
arguments as ->link(); an additional argument tells the filesystem to
copy over or reinitialize the security state on the new file.

A new LSM hook, security_inode_reflink(), is added.  None of the
existing LSM hooks appeared to fit.

This only adds the x86 linkage.  The trend appears to be for other
architectures to add their own linkage.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
---
 Documentation/filesystems/reflink.txt |  174 +++++++++++++++++++++++++++++++++
 Documentation/filesystems/vfs.txt     |    4 +
 arch/x86/ia32/ia32entry.S             |    1 +
 arch/x86/include/asm/unistd_32.h      |    1 +
 arch/x86/include/asm/unistd_64.h      |    2 +
 arch/x86/kernel/syscall_table_32.S    |    1 +
 fs/namei.c                            |  124 +++++++++++++++++++++++
 include/linux/fcntl.h                 |    8 ++
 include/linux/fs.h                    |    2 +
 include/linux/security.h              |   23 +++++
 include/linux/syscalls.h              |    3 +
 security/capability.c                 |    7 ++
 security/security.c                   |    8 ++
 13 files changed, 358 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/reflink.txt

diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt
new file mode 100644
index 0000000..7effe33
--- /dev/null
+++ b/Documentation/filesystems/reflink.txt
@@ -0,0 +1,174 @@
+reflink(2)
+==========
+
+
+INTRODUCTION
+------------
+
+A reflink is a reference-counted link.  The reflink(2) operation is
+analogous to the link(2) operation, except that instead of two directory
+entries pointing to the same inode, there are two identical inodes
+pointing to the same data.  Writes do not modify the shared data; they
+use copy-on-write (CoW).  Thus, after the reflink has been created, the
+inodes can diverge without impacting each other.
+
+
+SYNOPSIS
+--------
+
+The reflink(2) call looks almost like link(2):
+
+    int reflink(const char *oldpath, const char *newpath, int preserve);
+
+The actual system call is reflinkat(2):
+
+    int reflinkat(int olddirfd, const char *oldpath,
+                  int newdirfd, const char *newpath,
+                  int preserve, int flags);
+
+For details on how olddirfd, newdirfd, and flags behave, see linkat(2).
+The reflink(2) call won't be implemented by the kernel, because it's a
+trivial wrapper around reflinkat(2).
+
+
+DESCRIPTION
+-----------
+
+One way of viewing reflink is to look at the level of sharing.  A
+symbolic link does its sharing at the directory entry level; many names
+end up pointing at the same directory entry.  Hard links are one step
+down.  Multiple directory entries are sharing one inode.  Reflinks are
+down one more level: multiple inodes share the same data extents.
+
+When you symlink a file, you can then access it via the symlink or the
+real directory entry, and for the most part they look identical.  When
+accessing more than one name for a hard link, the object returned looks
+identical.  Similarly, a newly created reflink is identical to its
+source in almost every way and can be treated as such.  This includes
+ownership, permissions, security state, and data.  The only things
+that are different are the inode number, the link count, and the ctime.
+
+A reflink is a snapshot of the source file at the time it is created.
+
+Once created, though, a reflink can be modified like any other normal
+file without affecting the source file.  Changes to trivial fields like
+permissions, owner, or times are guaranteed not to trigger CoW of file
+data and will not return any error that wouldn't happen on a truly
+distinct file.  Changes to the file's data will trigger CoW of the data
+affected - the actual CoW granularity is up to the filesystem, from
+exact bytes up to the entire file.  ocfs2, for example, will copy out an
+entire extent or 1MB, whichever is smaller.
+
+Preserving the security state of the source file obviously requires
+the privilege to do so.  Because of this, the reflink(2) call has the
+preserve argument.  If it is set to REFLINK_ATTR_PRESERVE, the security
+state and file attributes will match the source as described above.
+Callers that do not own the source file and do not have CAP_CHOWN will
+see reflink(2) fail with EPERM.  If preserve is set to
+REFLINK_ATTR_NONE, the new reflink will still share all the data extents
+of the source file, including extended attributes.  The security state
+and attributes of the new reflink will be as a newly created file by
+that user.  With REFLINK_ATTR_NONE, the caller must have read access to
+the source file.
+
+Partial reflinks are not allowed.  The new inode will only appear in the
+directory structure after it is fully formed.  This prevents a crash or
+lack of space from creating a partial reflink.
+
+If a filesystem does not support reflinks, the kernel and libc MUST NOT
+fake it.  Callers are expecting to get snapshots, and faking it will
+violate that trust.
+
+The userspace view is as follows.  When reflink(2) returns, opening
+oldpath and newpath returns identical-looking files, just like link(2).
+After that, oldpath and newpath behave as distinct files, and
+modifications to one have no impact on the other.
+
+
+RESTRICTIONS
+------------
+
+Just as the sharing gets lower as you move from symlink() -> link() ->
+reflink(), the restrictions on the call get tighter.  A symlink doesn't
+require any access permissions other than being able to create its
+inode.  It can cross filesystems and mount points, and it can point to
+any type of file.  A hard link requires both source and target to be on
+the same filesystem under the same mount point, and that the source not
+be a directory.  A reflink tightens that to regular files only.  Like
+hard links and symlinks, a reflink cannot be created if newpath exists.
+
+Reflinks adds one big restriction on top of hard links: only the owner
+or someone with elevated privileges (CAP_CHOWN) can preserve the
+security state (permissions, ownership, ACLs, etc) across a reflink.
+A reflink is a point-in-time snapshot of a file.  Without the
+appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE
+will receive EPERM.
+
+A caller specifying REFLINK_ATTR_NONE must have read access to reflink a
+file.
+
+
+SHARING
+-------
+
+A reflink creates a new inode.  It shares all data extents of the source
+file; this includes file data and extended attribute data.  All of the
+sharing is in a CoW fashion, and any modification of the data will break
+the sharing.
+
+For some filesystems, certain data structures are not in allocated
+storage extents.  Creating a reflink might make a copy of these extents.
+An example is ext3's ability to store small extended attributes inside
+the ext3 inode.  Since a reflink is creating a new inode, those extended
+attributes are merely copied to the new inode.
+
+
+EXCEPTIONS
+----------
+
+When REFLINK_ATTR_PRESERVE is specified, all file attributes and
+extended attributes of the new file must identical to the source file
+with the following exceptions:
+
+- The new file must have a new inode number.  This allows POSIX
+  programs to treat the source and new files as separate objects.  From
+  the view of the POSIX application, the files are distinct.  The
+  sharing is invisible outside of the filesystem's internal structures.
+- The ctime of the source file only changes if the source's metadata
+  must be changed to accommodate the copy-on-write linkage.  The ctime
+  of the new file is set to represent its creation.
+- The link count of the source file is unchanged, and the link count of
+  the new file is one.
+
+The mtime of the source file is unmodified, and the mtime of the new
+file is set identical to the source file.  This reflects that the data
+is unchanged.
+
+If REFLINK_ATTR_NONE is specified, all data extents will be reflinked,
+but file attributes and security state will be as any new file.
+
+
+INODE OPERATION
+---------------
+
+Filesystems implement the ->reflink() inode operation.  It has almost
+the same prototype as ->link():
+
+    int (*reflink)(struct dentry *old_dentry, struct inode *dir,
+                   struct dentry *new_dentry, bool preserve);
+
+When the filesystem is called, the VFS has already checked the
+permissions and mountpoint of the operation.  It has determined whether
+the file attributes and security state should be preserved or
+reinitialized, as specified by the preserve argument.  The filesystem
+just needs to create the new inode identical to the old one with the
+exceptions noted above, link up the shared data extents, and then link
+the new inode into dir.
+
+
+FOLLOWING SYMBOLIC LINKS
+------------------------
+
+reflink() deferences symbolic links in the same manner that link(2)
+does.  The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2).
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..0620d73 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -333,6 +333,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -431,6 +432,9 @@ otherwise noted.
 
   truncate_range: a method provided by the underlying filesystem to truncate a
   	range of blocks , i.e. punch a hole somewhere in a file.
+  reflink: called by the reflink(2) system call. Only required if you want
+	to support reflinks.  For further information, see
+	Documentation/filesystems/reflink.txt.
 
 
 The Address Space Object
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a505202..ca832b4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -830,4 +830,5 @@ ia32_sys_call_table:
 	.quad sys_inotify_init1
 	.quad compat_sys_preadv
 	.quad compat_sys_pwritev
+	.quad sys_reflinkat		/* 335 */
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..c368563 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1	332
 #define __NR_preadv		333
 #define __NR_pwritev		334
+#define __NR_reflinkat		335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index f818294..b20f68c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1)
 __SYSCALL(__NR_preadv, sys_preadv)
 #define __NR_pwritev				296
 __SYSCALL(__NR_pwritev, sys_pwritev)
+#define __NR_reflink				297
+__SYSCALL(__NR_reflink, sys_reflink)
 
 
 #ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index ff5c873..d11c200 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_preadv
 	.long sys_pwritev
+	.long sys_reflinkat		/* 335 */
diff --git a/fs/namei.c b/fs/namei.c
index 78f253c..55f5c80 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname
 	return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0);
 }
 
+int vfs_reflink(struct dentry *old_dentry, struct inode *dir,
+		struct dentry *new_dentry, bool preserve)
+{
+	struct inode *inode = old_dentry->d_inode;
+	int error;
+
+	if (!inode)
+		return -ENOENT;
+
+	error = may_create(dir, new_dentry);
+	if (error)
+		return error;
+
+	if (dir->i_sb != inode->i_sb)
+		return -EXDEV;
+
+	/*
+	 * A reflink to an append-only or immutable file cannot be created.
+	 */
+	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+		return -EPERM;
+	if (!dir->i_op->reflink)
+		return -EPERM;
+
+	/*
+	 * Only regular files can be reflinked; if a user tries to
+	 * reflink a block device, do they expect copy-on-write of the
+	 * entire device?
+	 */
+	if (!S_ISREG(inode->i_mode))
+		return -EPERM;
+
+	/*
+	 * If the caller wants to preserve ownership, they require the
+	 * rights to do so.
+	 */
+	if (preserve) {
+		if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN))
+			return -EPERM;
+		if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN))
+			return -EPERM;
+	}
+
+	error = security_inode_reflink(old_dentry, dir, preserve);
+	if (error)
+		return error;
+
+	/*
+	 * If the caller is modifying any aspect of the attributes, they
+	 * are not creating a snapshot.  They need read permission on the
+	 * file.
+	 */
+	if (!preserve) {
+		error = inode_permission(inode, MAY_READ);
+		if (error)
+			return error;
+	}
+
+	mutex_lock(&inode->i_mutex);
+	vfs_dq_init(dir);
+	error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve);
+	mutex_unlock(&inode->i_mutex);
+	if (!error)
+		fsnotify_create(dir, new_dentry);
+	return error;
+}
+
+SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname,
+		int, newdfd, const char __user *, newname, int, preserve,
+		int, flags)
+{
+	struct dentry *new_dentry;
+	struct nameidata nd;
+	struct path old_path;
+	int error;
+	char *to;
+
+	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
+		return -EINVAL;
+
+	if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0)
+		return -EINVAL;
+
+	error = user_path_at(olddfd, oldname,
+			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+			     &old_path);
+	if (error)
+		return error;
+
+	error = user_path_parent(newdfd, newname, &nd, &to);
+	if (error)
+		goto out;
+	error = -EXDEV;
+	if (old_path.mnt != nd.path.mnt)
+		goto out_release;
+	new_dentry = lookup_create(&nd, 0);
+	error = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out_unlock;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
+	error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+	if (error)
+		goto out_drop_write;
+	error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode,
+			    new_dentry, preserve);
+out_drop_write:
+	mnt_drop_write(nd.path.mnt);
+out_dput:
+	dput(new_dentry);
+out_unlock:
+	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+out_release:
+	path_put(&nd.path);
+	putname(to);
+out:
+	path_put(&old_path);
+
+	return error;
+}
+
+
 /*
  * The worst of all namespace operations - renaming directory. "Perverted"
  * doesn't even start to describe it. Somebody in UCB had a heck of a trip...
@@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(vfs_create);
 EXPORT_SYMBOL(vfs_follow_link);
 EXPORT_SYMBOL(vfs_link);
+EXPORT_SYMBOL(vfs_reflink);
 EXPORT_SYMBOL(vfs_mkdir);
 EXPORT_SYMBOL(vfs_mknod);
 EXPORT_SYMBOL(generic_permission);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 8603740..96dc2f0 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -40,6 +40,14 @@
                                            unlinking file.  */
 #define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
 
+/*
+ * A reflink call may preserve the file's attributes in toto or not at
+ * all.
+ */
+#define REFLINK_ATTR_PRESERVE	0x00000001
+#define REFLINK_ATTR_NONE	0
+
+
 #ifdef __KERNEL__
 
 #ifndef force_o_largefile
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bed436..c6f9cb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *);
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool);
 
 /*
  * VFS dentry helper functions.
@@ -1537,6 +1538,7 @@ struct inode_operations {
 			  loff_t len);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
+	int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool);
 };
 
 struct seq_file;
diff --git a/include/linux/security.h b/include/linux/security.h
index d5fd616..2f1f520 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	@inode contains a pointer to the inode.
  *	@secid contains a pointer to the location where result will be saved.
  *	In case of failure, @secid will be set to zero.
+ * @inode_reflink:
+ *	Check permission before creating a new reference-counted link to
+ *	a file.
+ *	@old_dentry contains the dentry structure for an existing link to
+ *	the file.
+ *	@dir contains the inode structure of the parent directory of the
+ *	new reflink.
+ *	@preserve specifies whether the caller wishes to preserve the
+ *	file's attributes.  If true, the caller wishes to clone the file's
+ *	attributes exactly.  If false, the caller expects to reflink the
+ *	data extents but reset the attributes.
+ *	Return 0 if permission is granted.
  *
  * Security hooks for file operations
  *
@@ -1415,6 +1427,8 @@ struct security_operations {
 	int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
 	int (*inode_symlink) (struct inode *dir,
 			      struct dentry *dentry, const char *old_name);
+	int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir,
+			      bool preserve);
 	int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
 	int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
 	int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
@@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir,
 int security_inode_unlink(struct inode *dir, struct dentry *dentry);
 int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 			   const char *old_name);
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+			   bool preserve);
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode);
 int security_inode_rmdir(struct inode *dir, struct dentry *dentry);
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev);
@@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir,
 	return 0;
 }
 
+static inline int security_inode_reflink(struct dentry *old_dentry,
+					 struct inode *dir,
+					 bool preserve)
+{
+	return 0;
+}
+
 static inline int security_inode_mkdir(struct inode *dir,
 					struct dentry *dentry,
 					int mode)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40617c1..a11f228 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname,
 			      int newdfd, const char __user * newname);
 asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
 			   int newdfd, const char __user *newname, int flags);
+asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname,
+			      int newdfd, const char __user *newname,
+			      int preserve, int flags);
 asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
 			     int newdfd, const char __user * newname);
 asmlinkage long sys_futimesat(int dfd, char __user *filename,
diff --git a/security/capability.c b/security/capability.c
index 21b6cea..8047b7c 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry,
 	return 0;
 }
 
+static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode,
+			     bool preserve)
+{
+	return 0;
+}
+
 static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry,
 			   int mask)
 {
@@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, inode_link);
 	set_to_cap_if_null(ops, inode_unlink);
 	set_to_cap_if_null(ops, inode_symlink);
+	set_to_cap_if_null(ops, inode_reflink);
 	set_to_cap_if_null(ops, inode_mkdir);
 	set_to_cap_if_null(ops, inode_rmdir);
 	set_to_cap_if_null(ops, inode_mknod);
diff --git a/security/security.c b/security/security.c
index 5284255..e2b12f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry,
 	return security_ops->inode_symlink(dir, dentry, old_name);
 }
 
+int security_inode_reflink(struct dentry *old_dentry, struct inode *dir,
+			   bool preserve)
+{
+	if (unlikely(IS_PRIVATE(old_dentry->d_inode)))
+		return 0;
+	return security_ops->inode_reflink(old_dentry, dir, preserve);
+}
+
 int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	if (unlikely(IS_PRIVATE(dir)))
-- 
1.6.3

-- 

"Anything that is too stupid to be spoken is sung."  
        - Voltaire

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply related	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-08  2:59   ` jim owens
  2009-05-08  3:10     ` Joel Becker
@ 2009-05-11 20:49     ` Joel Becker
  2009-05-11 22:49       ` jim owens
  1 sibling, 1 reply; 151+ messages in thread
From: Joel Becker @ 2009-05-11 20:49 UTC (permalink / raw)
  To: jim owens
  Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel,
	ocfs2-devel, viro

On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
> - fix the
> +	if (S_ISDIR(inode->i_mode))
> +		return -EPERM;
>
>   to be an ISREG check unless you have an argument for
>   special files and symlinks being COWed.

	I'm unsure on this one, and would like other comments.  Why?  It
doesn't *hurt* to allow reflink on symlinks or special files.  Mostly
it's a waste - symlinks may have a data extent, but special files do
not.  But I'm not sure there's a point to arbitrarily limit filesystems
when there's nothing we're combating.
	Jim, if you have a real problem this prevents, I'm all ears.
And if others concur that restricting it to regular files is the right
way to go, I can be convinced.

Joel

-- 

"Hey mister if you're gonna walk on water,
 Could you drop a line my way?"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-11 20:49     ` [RFC] The reflink(2) system call v2 Joel Becker
@ 2009-05-11 22:49       ` jim owens
  2009-05-11 23:46         ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: jim owens @ 2009-05-11 22:49 UTC (permalink / raw)
  To: joel.becker, linux-fsdevel
  Cc: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module

Joel Becker wrote:
> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
>> - fix the
>> +	if (S_ISDIR(inode->i_mode))
>> +		return -EPERM;
>>
>>   to be an ISREG check unless you have an argument for
>>   special files and symlinks being COWed.
> 
> 	I'm unsure on this one, and would like other comments.  Why?  It
> doesn't *hurt* to allow reflink on symlinks or special files.  Mostly
> it's a waste - symlinks may have a data extent, but special files do
> not.  But I'm not sure there's a point to arbitrarily limit filesystems
> when there's nothing we're combating.
> 	Jim, if you have a real problem this prevents, I'm all ears.
> And if others concur that restricting it to regular files is the right
> way to go, I can be convinced.

My only problem was my past experience on non-Linux systems
where once we said it works for multiple file types, we had
to support that forever across all filesystems.  We could add
support for more types but not eliminate supported ones.

Since only ocfs2 will initially support this, I'm fine with the
S_ISDIR and if in the future other filesystems can only support
regular files (or can also support directories), we move the
check out of VFS to be filesystem specific.

jim

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-11 22:49       ` jim owens
@ 2009-05-11 23:46         ` Joel Becker
  2009-05-12  0:54           ` Chris Mason
  2009-05-12 20:36           ` Jamie Lokier
  0 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2009-05-11 23:46 UTC (permalink / raw)
  To: jim owens
  Cc: linux-fsdevel, jmorris, ocfs2-devel, viro, mtk.manpages,
	linux-security-module

On Mon, May 11, 2009 at 06:49:01PM -0400, jim owens wrote:
> Joel Becker wrote:
>> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
>>> - fix the
>>> +	if (S_ISDIR(inode->i_mode))
>>> +		return -EPERM;
>>>
>>>   to be an ISREG check unless you have an argument for
>>>   special files and symlinks being COWed.
>>
>> 	Jim, if you have a real problem this prevents, I'm all ears.
>> And if others concur that restricting it to regular files is the right
>> way to go, I can be convinced.
>
> My only problem was my past experience on non-Linux systems
> where once we said it works for multiple file types, we had
> to support that forever across all filesystems.  We could add
> support for more types but not eliminate supported ones.

	Someone else pointed out that a naive user might reflink a block
device file and expect the device contents to be copied-on-write.
Obviously wrong if you understand filesystems, but let's just prevent
that misunderstanding.  S_ISREG() it is.

Joel

-- 

"All alone at the end of the evening
 When the bright lights have faded to blue.
 I was thinking about a woman who had loved me
 And I never knew"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-11 23:46         ` Joel Becker
@ 2009-05-12  0:54           ` Chris Mason
  2009-05-12 20:36           ` Jamie Lokier
  1 sibling, 0 replies; 151+ messages in thread
From: Chris Mason @ 2009-05-12  0:54 UTC (permalink / raw)
  To: Joel Becker
  Cc: jmorris, linux-fsdevel, linux-security-module, mtk.manpages,
	jim owens, ocfs2-devel, viro

On Mon, 2009-05-11 at 16:46 -0700, Joel Becker wrote:
> On Mon, May 11, 2009 at 06:49:01PM -0400, jim owens wrote:
> > Joel Becker wrote:
> >> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote:
> >>> - fix the
> >>> +	if (S_ISDIR(inode->i_mode))
> >>> +		return -EPERM;
> >>>
> >>>   to be an ISREG check unless you have an argument for
> >>>   special files and symlinks being COWed.
> >>
> >> 	Jim, if you have a real problem this prevents, I'm all ears.
> >> And if others concur that restricting it to regular files is the right
> >> way to go, I can be convinced.
> >
> > My only problem was my past experience on non-Linux systems
> > where once we said it works for multiple file types, we had
> > to support that forever across all filesystems.  We could add
> > support for more types but not eliminate supported ones.
> 
> 	Someone else pointed out that a naive user might reflink a block
> device file and expect the device contents to be copied-on-write.
> Obviously wrong if you understand filesystems, but let's just prevent
> that misunderstanding.  S_ISREG() it is.

Btrfs won't be doing single directories, and I'd rather keep using a
dedicated ioctl for snapshotting whole subvolumes.

The semantics described here all sound sane, if this looks like the
final-ish rev I'll try to find someone interested in wiring it up to the
btrfs clone ioctl.  It just needs a wrapper to create the new inode and
copy xattrs/acls over.

Thanks for doing all of this Joel.

-chris

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [RFC] The reflink(2) system call v2.
  2009-05-11 23:46         ` Joel Becker
  2009-05-12  0:54           ` Chris Mason
@ 2009-05-12 20:36           ` Jamie Lokier
  1 sibling, 0 replies; 151+ messages in thread
From: Jamie Lokier @ 2009-05-12 20:36 UTC (permalink / raw)
  To: jim owens, linux-fsdevel, jmorris, ocfs2-devel, viro,
	mtk.manpages, linux-security-module

Joel Becker wrote:
> 	Someone else pointed out that a naive user might reflink a block
> device file and expect the device contents to be copied-on-write.
> Obviously wrong if you understand filesystems, but let's just prevent
> that misunderstanding.  S_ISREG() it is.

I think S_ISLNK() should be allowed too if the filesystem allows, as
it is harmless, behaves as expected, saves a little space, and copying
symlink attributes is meaningful too.

-- Jamie

^ permalink raw reply	[flat|nested] 151+ messages in thread

end of thread, other threads:[~2009-09-14 22:26 UTC | newest]

Thread overview: 151+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-03  6:15 [RFC] The reflink(2) system call Joel Becker
2009-05-03  6:15 ` [PATCH 1/3] fs: Document the " Joel Becker
2009-05-03  8:01   ` Christoph Hellwig
2009-05-04  2:46     ` Joel Becker
2009-05-04  6:36       ` Michael Kerrisk
2009-05-04  7:12         ` Joel Becker
2009-05-03 13:08   ` Boaz Harrosh
2009-05-03 23:08     ` Al Viro
2009-05-04  2:49     ` Joel Becker
2009-05-03 23:45   ` Theodore Tso
2009-05-04  1:44     ` Tao Ma
2009-05-04 18:25       ` Joel Becker
2009-05-04 21:18         ` [Ocfs2-devel] " Joel Becker
2009-05-04 22:23           ` Theodore Tso
2009-05-05  6:55             ` Joel Becker
2009-05-05  1:07   ` Jamie Lokier
2009-05-05  7:16     ` Joel Becker
2009-05-05  8:09       ` Andreas Dilger
2009-05-05 16:56         ` Joel Becker
2009-05-05 21:24           ` Andreas Dilger
2009-05-05 21:32             ` Joel Becker
2009-05-06  7:15               ` [Ocfs2-devel] " Theodore Tso
2009-05-06 14:24                 ` jim owens
2009-05-06 14:30                   ` jim owens
2009-05-06 17:50                     ` jim owens
2009-05-12 19:20                       ` Jamie Lokier
2009-05-12 19:30                       ` Jamie Lokier
2009-05-12 19:11                   ` Jamie Lokier
2009-05-12 19:37                     ` jim owens
2009-05-12 20:11                       ` Jamie Lokier
2009-05-05 13:01       ` Theodore Tso
2009-05-05 13:19         ` Jamie Lokier
2009-05-05 13:39           ` Chris Mason
2009-05-05 15:36             ` Jamie Lokier
2009-05-05 15:41               ` Chris Mason
2009-05-05 16:03                 ` Jamie Lokier
2009-05-05 16:18                   ` Chris Mason
2009-05-05 20:48                   ` jim owens
2009-05-05 21:57                     ` Jamie Lokier
2009-05-05 22:04                       ` Joel Becker
2009-05-05 22:11                         ` Jamie Lokier
2009-05-05 22:24                           ` Joel Becker
2009-05-05 23:14                             ` Jamie Lokier
2009-05-05 22:12                         ` Jamie Lokier
2009-05-05 22:21                           ` Joel Becker
2009-05-05 22:32                             ` James Morris
2009-05-05 22:39                               ` Joel Becker
2009-05-12 19:40                               ` Jamie Lokier
2009-05-05 22:28                         ` jim owens
2009-05-05 23:12                           ` Jamie Lokier
2009-05-05 16:46               ` Jörn Engel
2009-05-05 16:54                 ` Jörn Engel
2009-05-05 22:03                   ` Jamie Lokier
2009-05-05 21:44                 ` copyfile semantics Andreas Dilger
2009-05-05 21:48                   ` Matthew Wilcox
2009-05-05 22:25                     ` Trond Myklebust
2009-05-05 22:06                   ` Jamie Lokier
2009-05-06  5:57                   ` Jörn Engel
2009-05-05 14:21           ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso
2009-05-05 15:32             ` Jamie Lokier
2009-05-05 22:49             ` James Morris
2009-05-05 17:05           ` Joel Becker
2009-05-05 17:00         ` Joel Becker
2009-05-05 17:29           ` Theodore Tso
2009-05-05 22:36             ` Jamie Lokier
2009-05-05 22:30           ` Jamie Lokier
2009-05-05 22:37             ` Joel Becker
2009-05-05 23:08             ` jim owens
2009-05-05 13:01       ` Jamie Lokier
2009-05-05 17:09         ` Joel Becker
2009-05-03  6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker
2009-05-03  8:03   ` Christoph Hellwig
2009-05-04  2:51     ` Joel Becker
2009-05-03  6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker
2009-05-03  6:27   ` Matthew Wilcox
2009-05-03  6:39     ` Al Viro
2009-05-03  7:48       ` Christoph Hellwig
2009-05-03 11:16         ` Al Viro
2009-05-04  2:53       ` Joel Becker
2009-05-04  2:53     ` Joel Becker
2009-05-03  8:04   ` Christoph Hellwig
2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker
2009-05-08  1:39   ` James Morris
2009-05-08  1:49     ` Joel Becker
2009-05-08 13:01       ` Tetsuo Handa
2009-05-08  2:59   ` jim owens
2009-05-08  3:10     ` Joel Becker
2009-05-08 11:53       ` jim owens
2009-05-08 12:16       ` jim owens
2009-05-08 14:11         ` jim owens
2009-05-11 20:40       ` [RFC] The reflink(2) system call v4 Joel Becker
2009-05-11 22:27         ` James Morris
2009-05-11 22:34           ` Joel Becker
2009-05-12  1:12             ` James Morris
2009-05-12 12:18               ` Stephen Smalley
2009-05-12 17:22                 ` Joel Becker
2009-05-12 17:32                   ` Stephen Smalley
2009-05-12 18:03                     ` Joel Becker
2009-05-12 18:04                       ` Stephen Smalley
2009-05-12 18:28                         ` Joel Becker
2009-05-12 18:37                           ` Stephen Smalley
2009-05-14 18:06                         ` Stephen Smalley
2009-05-14 18:25                           ` Stephen Smalley
2009-05-14 23:25                             ` James Morris
2009-05-15 11:54                               ` Stephen Smalley
2009-05-15 13:35                                 ` James Morris
2009-05-15 15:44                                   ` Stephen Smalley
2009-05-13  1:47                       ` Casey Schaufler
2009-05-13 16:43                         ` Joel Becker
2009-05-13 17:23                           ` Stephen Smalley
2009-05-13 18:27                             ` Joel Becker
2009-05-12 12:01           ` Stephen Smalley
2009-05-11 23:11         ` jim owens
2009-05-11 23:42           ` Joel Becker
2009-05-12 11:31         ` Jörn Engel
2009-05-12 13:12           ` jim owens
2009-05-12 20:24             ` Jamie Lokier
2009-05-14 18:43             ` Jörn Engel
2009-05-12 15:04         ` Sage Weil
2009-05-12 15:23           ` jim owens
2009-05-12 16:16             ` Sage Weil
2009-05-12 17:45               ` jim owens
2009-05-12 20:29                 ` Jamie Lokier
2009-05-12 17:28           ` Joel Becker
2009-05-13  4:30             ` Sage Weil
2009-05-14  3:57         ` Andy Lutomirski
2009-05-14 18:12           ` Stephen Smalley
2009-05-14 22:00             ` Joel Becker
2009-05-15  1:20               ` Jamie Lokier
2009-05-15 12:01               ` Stephen Smalley
2009-05-15 15:22                 ` Joel Becker
2009-05-15 15:55                   ` Stephen Smalley
2009-05-15 16:42                     ` Joel Becker
2009-05-15 17:01                       ` Shaya Potter
2009-05-15 20:53                       ` [Ocfs2-devel] " Joel Becker
2009-05-18  9:17                         ` Jörn Engel
2009-05-18 13:02                         ` Stephen Smalley
2009-05-18 14:33                           ` Stephen Smalley
2009-05-18 17:15                             ` Stephen Smalley
2009-05-18 18:26                           ` Joel Becker
2009-05-19 16:32                             ` [Ocfs2-devel] " Sage Weil
2009-05-19 19:33                         ` Jonathan Corbet
2009-05-19 20:15                           ` Jamie Lokier
     [not found]                         ` <20090519132057.419b9de0@bike.lwn.net>
     [not found]                           ` <20090519193244.GB25521@mail.oracle.com>
2009-05-19 19:41                             ` Jonathan Corbet
2009-05-28  0:24         ` [RFC] The reflink(2) system call v5 Joel Becker
2009-09-14 22:24         ` Joel Becker
2009-05-11 20:49     ` [RFC] The reflink(2) system call v2 Joel Becker
2009-05-11 22:49       ` jim owens
2009-05-11 23:46         ` Joel Becker
2009-05-12  0:54           ` Chris Mason
2009-05-12 20:36           ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).