* [RFC] The reflink(2) system call. @ 2009-05-03 6:15 Joel Becker 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker ` (3 more replies) 0 siblings, 4 replies; 151+ messages in thread From: Joel Becker @ 2009-05-03 6:15 UTC (permalink / raw) To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro Hi everyone, I described the reflink operation at the Linux Storage & Filesystems Workshop last month. Originally implemented as an ocfs2-specific ioctl, the consensus was that it should be a syscall from the get-go. Here's some first-cut patches. For people who have not seen reflink, either at LSF or on the ocfs2 wiki, the first patch contains Documentation/filesystems/reflink.txt to describe the call. The short-short version is that reflink creates a reference-counted link. This is a new file that shares the data extents of a source file in a copy-on-write fashion. The second patch adds iops->reflink() and vfs_reflink(). People interested in LSM interaction, please look at my comments in the patch header and the implementation of vfs_link(). I think it needs improvement. The last patch defines sys_reflink() and sys_reflinkat(). It also hooks them up for x86_32. The final version of this patch will obviously include the other architectures. The patches are also available in my git tree: git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git reflink The current ioctl-based implementation for ocfs2 is available in Tao's git tree at: git://oss.oracle.com/git/tma/linux-2.6.git refcount It will be reset atop the system call very soon. Please send any comments along. Joel Documentation/filesystems/reflink.txt | 129 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/vfs.txt | 4 + arch/x86/include/asm/unistd_32.h | 1 arch/x86/kernel/syscall_table_32.S | 1 fs/namei.c | 96 +++++++++++++++++++++++++ include/linux/fs.h | 2 6 files changed, 233 insertions(+) -- "But then she looks me in the eye And says, 'We're going to last forever,' And man you know I can't begin to doubt it. Cause it just feels so good and so free and so right, I know we ain't never going to change our minds about it, Hey! Here comes my girl." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 6:15 [RFC] The reflink(2) system call Joel Becker @ 2009-05-03 6:15 ` Joel Becker 2009-05-03 8:01 ` Christoph Hellwig ` (3 more replies) 2009-05-03 6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker ` (2 subsequent siblings) 3 siblings, 4 replies; 151+ messages in thread From: Joel Becker @ 2009-05-03 6:15 UTC (permalink / raw) To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro int reflink(const char *oldpath, const char *newpath); The reflink(2) system call creates reference-counted links. It creates a new file that shares the data extents of the source file in a copy-on-write fashion. Its calling semantics are identical to link(2). Once complete, programs see the new file as a completely separate entry. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- Documentation/filesystems/reflink.txt | 129 +++++++++++++++++++++++++++++++++ Documentation/filesystems/vfs.txt | 4 + 2 files changed, 133 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/reflink.txt diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt new file mode 100644 index 0000000..f3620f0 --- /dev/null +++ b/Documentation/filesystems/reflink.txt @@ -0,0 +1,129 @@ +reflink(2) +========== + +NAME +---- +reflink - make a reference-counted link of a file + + +SYNOPSIS +-------- +#include <unistd.h> + +int reflink(const char *oldpath, const char *newpath); + +DESCRIPTION +----------- +reflink() creates a new reflink (also known as a reference-counted link) +to an existing file. This reflink is a new file object that shares the +attributes and data extents of the source object in a copy-on-write fashion. + +An easy way to think of it is that the semantics of the reflink() call +are identical to the link(2) system call, but the resulting file object +behaves as if it were a copy with identical attributes. + +Like the link(2) system call, if newpath exists, it will not be overwritten. +oldpath must be a regular file. oldpath and newpath must be on the same +mounted filesystem. + +All data extents of the new file must be shared with the source file in +a copy-on-write fashion. This includes data extents for extended +attributes. If either the source or new files are written to, the +changes do not show up in the other file. + +All file attributes and extended attributes of the new file must +identical to the source file with the following exceptions: + +- The new file must have a new inode number. This allows POSIX + programs to treat the source and new files as separate objects. From + the view of the POSIX application, the files are distinct. The + sharing is invisible outside the filesystem. +- The ctime of the source file only changes if the source's metadata + must be changed to accommodate the copy-on-write linkage. The ctime of + the new file is set to represent its creation. +- The mtime of the source file is unmodified, and the mtime of the new file + is set identical to the source file. This reflects that the data is + unchanged. +- The link count of the source file is unchanged, and the link count of + the new file is one. + +RETURN VALUE +------------ +On success, zero is returned. On error, -1 is returned, and errno is +set appropriately. + +ERRORS +------ +EACCES:: + Write access to the directory containing newpath is denied, or + search permission is denied for one of the directories in the + path prefix of oldpath or newpath. (See also path_resolution(7).) + +EEXIST:: + newpath already exists. + +EFAULT:: + oldpath or newpath points outside your accessible address space. + +EIO:: + An I/O error occurred. + +ELOOP:: + Too many symbolic links were encountered in resolving oldpath or + newpath. + +ENAMETOOLONG:: + oldpath or newpath was too long. + +ENOENT:: + A directory component in oldpath or newpath does not exist or is + a dangling symbolic link. + +ENOMEM:: + Insufficient kernel memory was available. + +ENOSPC:: + The device containing the file has no room for the new directory + entry or file object. + +ENOTDIR:: + A component used as a directory in oldpath or newpath is not, in + fact, a directory. + +EPERM:: + oldpath is a directory. + +EPERM:: + The file system containing oldpath and newpath does not support + the creation of reference-counted links. + +EROFS:: + The file is on a read-only file system. + +EXDEV:: + oldpath and newpath are not on the same mounted file system. + (Linux permits a file system to be mounted at multiple points, + but reflink() does not work across different mount points, even if + the same file system is mounted on both.) + +VERSIONS +-------- +reflink() is available on Linux since kernel 2.6.31. + +CONFORMING TO +------------- +reflink() is Linux-specific. + +NOTES +----- +reflink() deferences symbolic links in the same manner that link(2) +does. For precise control over the treatment of symbolic links, see +reflinkat(). + +In the case of a crash, the new file must not appear partially complete +in the filesystem. + +SEE ALSO +-------- +ln(1), reflink(1), reflinkat(2), path_resolution(7) + diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf..01cd810 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -333,6 +333,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*reflink) (struct dentry *,struct inode *,struct dentry *); }; Again, all methods are called without any locks being held, unless @@ -431,6 +432,9 @@ otherwise noted. truncate_range: a method provided by the underlying filesystem to truncate a range of blocks , i.e. punch a hole somewhere in a file. + reflink: called by the reflink(2) system call. Only required if you want + to support reflinks. For further information, see + Documentation/filesystems/reflink.txt. The Address Space Object -- 1.6.1.3 ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker @ 2009-05-03 8:01 ` Christoph Hellwig 2009-05-04 2:46 ` Joel Becker 2009-05-03 13:08 ` Boaz Harrosh ` (2 subsequent siblings) 3 siblings, 1 reply; 151+ messages in thread From: Christoph Hellwig @ 2009-05-03 8:01 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro, mtk.manpages On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: > int reflink(const char *oldpath, const char *newpath); > > The reflink(2) system call creates reference-counted links. It creates > a new file that shares the data extents of the source file in a > copy-on-write fashion. Its calling semantics are identical to link(2). > Once complete, programs see the new file as a completely separate entry. Just send this as a manpage to Michael, no need to duplicate a pseudo-manpage in the kernel tree. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 8:01 ` Christoph Hellwig @ 2009-05-04 2:46 ` Joel Becker 2009-05-04 6:36 ` Michael Kerrisk 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-04 2:46 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, mtk.manpages, jmorris, ocfs2-devel, viro On Sun, May 03, 2009 at 04:01:12AM -0400, Christoph Hellwig wrote: > On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: > > int reflink(const char *oldpath, const char *newpath); > > > > The reflink(2) system call creates reference-counted links. It creates > > a new file that shares the data extents of the source file in a > > copy-on-write fashion. Its calling semantics are identical to link(2). > > Once complete, programs see the new file as a completely separate entry. > > Just send this as a manpage to Michael, no need to duplicate a > pseudo-manpage in the kernel tree. The manpage style was just a convenient way to organize my thoughts. The goal was to document the behavior of reflink() for implementors. If the pseudo-manpage doesn't work, perhaps I'll try some other form. Joel -- Life's Little Instruction Book #337 "Reread your favorite book." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-04 2:46 ` Joel Becker @ 2009-05-04 6:36 ` Michael Kerrisk 2009-05-04 7:12 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Michael Kerrisk @ 2009-05-04 6:36 UTC (permalink / raw) To: Christoph Hellwig, linux-fsdevel, jmorris, ocfs2-devel, viro, mtk.manpages On Mon, May 4, 2009 at 2:46 PM, Joel Becker <Joel.Becker@oracle.com> wrote: > On Sun, May 03, 2009 at 04:01:12AM -0400, Christoph Hellwig wrote: >> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: >> > int reflink(const char *oldpath, const char *newpath); >> > >> > The reflink(2) system call creates reference-counted links. It creates >> > a new file that shares the data extents of the source file in a >> > copy-on-write fashion. Its calling semantics are identical to link(2). >> > Once complete, programs see the new file as a completely separate entry. >> >> Just send this as a manpage to Michael, no need to duplicate a >> pseudo-manpage in the kernel tree. > > The manpage style was just a convenient way to organize my > thoughts. The goal was to document the behavior of reflink() for > implementors. If the pseudo-manpage doesn't work, perhaps I'll try some > other form. So, I'm late to this thread. Is reflink() (to be) a user-visible syscall? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-04 6:36 ` Michael Kerrisk @ 2009-05-04 7:12 ` Joel Becker 0 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-04 7:12 UTC (permalink / raw) To: mtk.manpages; +Cc: Christoph Hellwig, linux-fsdevel, jmorris, ocfs2-devel, viro On Mon, May 04, 2009 at 06:36:58PM +1200, Michael Kerrisk wrote: > On Mon, May 4, 2009 at 2:46 PM, Joel Becker <Joel.Becker@oracle.com> wrote: > > On Sun, May 03, 2009 at 04:01:12AM -0400, Christoph Hellwig wrote: > >> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: > >> > int reflink(const char *oldpath, const char *newpath); > >> > > >> > The reflink(2) system call creates reference-counted links. It creates > >> > a new file that shares the data extents of the source file in a > >> > copy-on-write fashion. Its calling semantics are identical to link(2). > >> > Once complete, programs see the new file as a completely separate entry. > >> > >> Just send this as a manpage to Michael, no need to duplicate a > >> pseudo-manpage in the kernel tree. > > > > The manpage style was just a convenient way to organize my > > thoughts. The goal was to document the behavior of reflink() for > > implementors. If the pseudo-manpage doesn't work, perhaps I'll try some > > other form. > > So, I'm late to this thread. Is reflink() (to be) a user-visible syscall? Yes. The actual call will be reflinkat(), as they're correct that userspace can wrap reflink() around it. I did have you on my todo for notification as it settled down. Joel -- "Well-timed silence hath more eloquence than speech." - Martin Fraquhar Tupper Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker 2009-05-03 8:01 ` Christoph Hellwig @ 2009-05-03 13:08 ` Boaz Harrosh 2009-05-03 23:08 ` Al Viro 2009-05-04 2:49 ` Joel Becker 2009-05-03 23:45 ` Theodore Tso 2009-05-05 1:07 ` Jamie Lokier 3 siblings, 2 replies; 151+ messages in thread From: Boaz Harrosh @ 2009-05-03 13:08 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On 05/03/2009 09:15 AM, Joel Becker wrote: > int reflink(const char *oldpath, const char *newpath); > > The reflink(2) system call creates reference-counted links. It creates > a new file that shares the data extents of the source file in a > copy-on-write fashion. Its calling semantics are identical to link(2). > Once complete, programs see the new file as a completely separate entry. > Please forgive my complete Unix jargon novice-ness, but from here it looks like the name is very wrong, and confusing. if I put data to link graph then: [data]<--[hard-link (one or more)]<--[soft-link(zero or more)] The data is other-wise just there on disk but is un available until it is linked to a dir-entry, at-least one. The middle hard-link is reference counted and once all uses are removed data can be garbage collected. Soft links don't follow on-disk data but follow a dir-entry. So if we have a completely different on disk data we're still in agreement with the dir-entry. In the graph above and has explained below. there is no reference counting going on: > +- The link count of the source file is unchanged, and the link count of > + the new file is one. And and the "link" meaning is very vaguely kept, only half way until the next write. (If it can be called a link at all being a different inode and cached twice) As my first impression when I read the title of the patch, an English reflink I would imagine is something more to the left of above graph, between hard-link and soft-link, something like: link to an invisible dir-entry that is gone once all soft-links to it are gone. So form my point of view. Call it something different like Copy-On-Write or COW. I do understand that there is something very fundamental in my misunderstanding, but it was not explained below, in fact the below terminology confused me even more. Please explain? > Signed-off-by: Joel Becker <joel.becker@oracle.com> > --- > Documentation/filesystems/reflink.txt | 129 +++++++++++++++++++++++++++++++++ > Documentation/filesystems/vfs.txt | 4 + > 2 files changed, 133 insertions(+), 0 deletions(-) > create mode 100644 Documentation/filesystems/reflink.txt > > diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt > new file mode 100644 > index 0000000..f3620f0 > --- /dev/null > +++ b/Documentation/filesystems/reflink.txt > @@ -0,0 +1,129 @@ > +reflink(2) > +========== > + > +NAME > +---- > +reflink - make a reference-counted link of a file > + > + > +SYNOPSIS > +-------- > +#include <unistd.h> > + > +int reflink(const char *oldpath, const char *newpath); > + > +DESCRIPTION > +----------- > +reflink() creates a new reflink (also known as a reference-counted link) > +to an existing file. This reflink is a new file object that shares the > +attributes and data extents of the source object in a copy-on-write fashion. > + This is exactly my confusion how is the logical jump made from reflink (reference/link) to copy-on-write. I fail to see any logical connection. > +An easy way to think of it is that the semantics of the reflink() call > +are identical to the link(2) system call, but the resulting file object > +behaves as if it were a copy with identical attributes. > + > +Like the link(2) system call, if newpath exists, it will not be overwritten. > +oldpath must be a regular file. oldpath and newpath must be on the same > +mounted filesystem. > + > +All data extents of the new file must be shared with the source file in > +a copy-on-write fashion. This includes data extents for extended > +attributes. If either the source or new files are written to, the > +changes do not show up in the other file. > + > +All file attributes and extended attributes of the new file must > +identical to the source file with the following exceptions: > + > +- The new file must have a new inode number. This allows POSIX > + programs to treat the source and new files as separate objects. From > + the view of the POSIX application, the files are distinct. The > + sharing is invisible outside the filesystem. > +- The ctime of the source file only changes if the source's metadata > + must be changed to accommodate the copy-on-write linkage. The ctime of > + the new file is set to represent its creation. > +- The mtime of the source file is unmodified, and the mtime of the new file > + is set identical to the source file. This reflects that the data is > + unchanged. > +- The link count of the source file is unchanged, and the link count of > + the new file is one. > + > +RETURN VALUE > +------------ > +On success, zero is returned. On error, -1 is returned, and errno is > +set appropriately. > + > +ERRORS > +------ > +EACCES:: > + Write access to the directory containing newpath is denied, or > + search permission is denied for one of the directories in the > + path prefix of oldpath or newpath. (See also path_resolution(7).) > + > +EEXIST:: > + newpath already exists. > + > +EFAULT:: > + oldpath or newpath points outside your accessible address space. > + > +EIO:: > + An I/O error occurred. > + > +ELOOP:: > + Too many symbolic links were encountered in resolving oldpath or > + newpath. > + > +ENAMETOOLONG:: > + oldpath or newpath was too long. > + > +ENOENT:: > + A directory component in oldpath or newpath does not exist or is > + a dangling symbolic link. > + > +ENOMEM:: > + Insufficient kernel memory was available. > + > +ENOSPC:: > + The device containing the file has no room for the new directory > + entry or file object. > + > +ENOTDIR:: > + A component used as a directory in oldpath or newpath is not, in > + fact, a directory. > + > +EPERM:: > + oldpath is a directory. > + > +EPERM:: > + The file system containing oldpath and newpath does not support > + the creation of reference-counted links. > + > +EROFS:: > + The file is on a read-only file system. > + > +EXDEV:: > + oldpath and newpath are not on the same mounted file system. > + (Linux permits a file system to be mounted at multiple points, > + but reflink() does not work across different mount points, even if > + the same file system is mounted on both.) > + > +VERSIONS > +-------- > +reflink() is available on Linux since kernel 2.6.31. > + > +CONFORMING TO > +------------- > +reflink() is Linux-specific. > + > +NOTES > +----- > +reflink() deferences symbolic links in the same manner that link(2) > +does. For precise control over the treatment of symbolic links, see > +reflinkat(). > + > +In the case of a crash, the new file must not appear partially complete > +in the filesystem. > + > +SEE ALSO > +-------- > +ln(1), reflink(1), reflinkat(2), path_resolution(7) > + > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt > index f49eecf..01cd810 100644 > --- a/Documentation/filesystems/vfs.txt > +++ b/Documentation/filesystems/vfs.txt > @@ -333,6 +333,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + int (*reflink) (struct dentry *,struct inode *,struct dentry *); > }; > > Again, all methods are called without any locks being held, unless > @@ -431,6 +432,9 @@ otherwise noted. > > truncate_range: a method provided by the underlying filesystem to truncate a > range of blocks , i.e. punch a hole somewhere in a file. > + reflink: called by the reflink(2) system call. Only required if you want > + to support reflinks. For further information, see > + Documentation/filesystems/reflink.txt. > > > The Address Space Object Please forgive my ignorance, again I would honestly like to understand, and how else, then to just ask? Thanks in advance Boaz ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 13:08 ` Boaz Harrosh @ 2009-05-03 23:08 ` Al Viro 2009-05-04 2:49 ` Joel Becker 1 sibling, 0 replies; 151+ messages in thread From: Al Viro @ 2009-05-03 23:08 UTC (permalink / raw) To: Boaz Harrosh; +Cc: Joel Becker, linux-fsdevel, jmorris, ocfs2-devel On Sun, May 03, 2009 at 04:08:59PM +0300, Boaz Harrosh wrote: > As my first impression when I read the title of the patch, an English reflink > I would imagine is something more to the left of above graph, between hard-link > and soft-link, something like: link to an invisible dir-entry that is gone once > all soft-links to it are gone. > > So form my point of view. Call it something different like Copy-On-Write or > COW. > > I do understand that there is something very fundamental in my misunderstanding, > but it was not explained below, in fact the below terminology confused me even > more. Please explain? It's simply a lazy copy, with interface for creating it similar to link(2). That's all. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 13:08 ` Boaz Harrosh 2009-05-03 23:08 ` Al Viro @ 2009-05-04 2:49 ` Joel Becker 1 sibling, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-04 2:49 UTC (permalink / raw) To: Boaz Harrosh; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Sun, May 03, 2009 at 04:08:59PM +0300, Boaz Harrosh wrote: > On 05/03/2009 09:15 AM, Joel Becker wrote: > > int reflink(const char *oldpath, const char *newpath); > > > > The reflink(2) system call creates reference-counted links. It creates > > a new file that shares the data extents of the source file in a > > copy-on-write fashion. Its calling semantics are identical to link(2). > > Once complete, programs see the new file as a completely separate entry. > > > > Please forgive my complete Unix jargon novice-ness, but from here it looks like the > name is very wrong, and confusing. > > if I put data to link graph then: > > [data]<--[hard-link (one or more)]<--[soft-link(zero or more)] > > The data is other-wise just there on disk but is un available until > it is linked to a dir-entry, at-least one. The middle hard-link is reference > counted and once all uses are removed data can be garbage collected. Soft links > don't follow on-disk data but follow a dir-entry. So if we have a completely > different on disk data we're still in agreement with the dir-entry. A reflink creates a dir entry. That's what newpath is about. Using your graph: [data]<--[reflink (zero or more)]<--[hard-link (one or more)]<--[soft-link(zero or more)] > As my first impression when I read the title of the patch, an English reflink > I would imagine is something more to the left of above graph, between hard-link > and soft-link, something like: link to an invisible dir-entry that is gone once > all soft-links to it are gone. There is no "invisible dir entry". The target is a new file with a new dir entry. It just shares the data extents of the source. Perhaps I can clarify that better. Joel -- "Maybe the time has drawn the faces I recall. But things in this life change very slowly, If they ever change at all." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker 2009-05-03 8:01 ` Christoph Hellwig 2009-05-03 13:08 ` Boaz Harrosh @ 2009-05-03 23:45 ` Theodore Tso 2009-05-04 1:44 ` Tao Ma 2009-05-05 1:07 ` Jamie Lokier 3 siblings, 1 reply; 151+ messages in thread From: Theodore Tso @ 2009-05-03 23:45 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: > int reflink(const char *oldpath, const char *newpath); > > The reflink(2) system call creates reference-counted links. It creates > a new file that shares the data extents of the source file in a > copy-on-write fashion. Its calling semantics are identical to link(2). > Once complete, programs see the new file as a completely separate entry. How should quota handle reflinks? Since there are separate inodes, the two files could be owned by different user ID's. Since the data blocks exist only once, I can imagine a number of different ways of handling it: 1) When the reflink is created, the owner of the new reflink is not charged the number of blocks of the file against his/her quota. If the original inode is deleted, the original owner continues to have the cost of the file charged against his/her quota until the last reflink disappears. 2) When the reflink is created, the owner of the new reflink is NOT charged the number of blocks of the file against his/her quota. If the original inode is deleted, the owner of the reflink is charged the number of blocks against his/her quota. If that drives the owner over quota, the quota subsystem will enforce the soft and hard quota limits as per normal. If there are more than one reflink against the file, the system will randomly choose one user and charge the blocks against his/her quota. 3) When the reflink is created, the owner of the new reflink is charged the number of blocks of the file against his/her quota. The original owner of the inode continus to also have the blocks of the file charged against his/her quota, so in effect the blocks are "double counted". 4) When the reflink is created, the owner of the new reflink is NOT charged the number of blocks of the file against his/her quota. The original owner of the inode continues to also have the blocks of the file charged against his/her quota; if the file is deleted the blocks associated with the file will not be charged against any users' quota. All of these have various problems; and maybe the answer is that reflinks aren't really compatible with quotas, so pick something least bad (say #3), and we can just move on. - Ted ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 23:45 ` Theodore Tso @ 2009-05-04 1:44 ` Tao Ma 2009-05-04 18:25 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Tao Ma @ 2009-05-04 1:44 UTC (permalink / raw) To: Theodore Tso; +Cc: Joel Becker, linux-fsdevel, jmorris, ocfs2-devel, viro Hi Ted, Theodore Tso wrote: > On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: >> int reflink(const char *oldpath, const char *newpath); >> >> The reflink(2) system call creates reference-counted links. It creates >> a new file that shares the data extents of the source file in a >> copy-on-write fashion. Its calling semantics are identical to link(2). >> Once complete, programs see the new file as a completely separate entry. > > How should quota handle reflinks? Since there are separate inodes, > the two files could be owned by different user ID's. Since the data > blocks exist only once, I can imagine a number of different ways of > handling it: > > 1) When the reflink is created, the owner of the new reflink is not > charged the number of blocks of the file against his/her quota. If > the original inode is deleted, the original owner continues to have > the cost of the file charged against his/her quota until the last > reflink disappears. > > 2) When the reflink is created, the owner of the new reflink is NOT > charged the number of blocks of the file against his/her quota. If > the original inode is deleted, the owner of the reflink is charged the > number of blocks against his/her quota. If that drives the owner over > quota, the quota subsystem will enforce the soft and hard quota limits > as per normal. If there are more than one reflink against the file, > the system will randomly choose one user and charge the blocks against > his/her quota. > > 3) When the reflink is created, the owner of the new reflink is > charged the number of blocks of the file against his/her quota. The > original owner of the inode continus to also have the blocks of the > file charged against his/her quota, so in effect the blocks are > "double counted". > > 4) When the reflink is created, the owner of the new reflink is NOT > charged the number of blocks of the file against his/her quota. The > original owner of the inode continues to also have the blocks of the > file charged against his/her quota; if the file is deleted the blocks > associated with the file will not be charged against any users' quota. > > All of these have various problems; and maybe the answer is that > reflinks aren't really compatible with quotas, so pick something least > bad (say #3), and we can just move on. yeah, agree. So I will pick #3 in my ocfs2 reflink implementation. Thanks. Regards, Tao ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-04 1:44 ` Tao Ma @ 2009-05-04 18:25 ` Joel Becker 2009-05-04 21:18 ` [Ocfs2-devel] " Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-04 18:25 UTC (permalink / raw) To: Tao Ma; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro On Mon, May 04, 2009 at 09:44:32AM +0800, Tao Ma wrote: > Theodore Tso wrote: >> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: >> How should quota handle reflinks? Since there are separate inodes, >> the two files could be owned by different user ID's. Since the data >> blocks exist only once, I can imagine a number of different ways of >> handling it: <snip> > yeah, agree. So I will pick #3 in my ocfs2 reflink implementation. While at first I was all "sure, this makes sense," now I'm thinking otherwise. Because reflink() means the file attributes are unmodified. So the original owner owns the new file, and thus the quota charge doesn't matter. If and when the new file is changed to another owner, then the normal quota code will adjust the quotas. Joel -- "If you are ever in doubt as to whether or not to kiss a pretty girl, give her the benefit of the doubt" -Thomas Carlyle Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-04 18:25 ` Joel Becker @ 2009-05-04 21:18 ` Joel Becker 2009-05-04 22:23 ` Theodore Tso 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-04 21:18 UTC (permalink / raw) To: Tao Ma, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Mon, May 04, 2009 at 11:25:52AM -0700, Joel Becker wrote: > On Mon, May 04, 2009 at 09:44:32AM +0800, Tao Ma wrote: > > Theodore Tso wrote: > >> On Sat, May 02, 2009 at 11:15:01PM -0700, Joel Becker wrote: > >> How should quota handle reflinks? Since there are separate inodes, > >> the two files could be owned by different user ID's. Since the data > >> blocks exist only once, I can imagine a number of different ways of > >> handling it: > <snip> > > yeah, agree. So I will pick #3 in my ocfs2 reflink implementation. > > While at first I was all "sure, this makes sense," now I'm > thinking otherwise. Because reflink() means the file attributes are > unmodified. So the original owner owns the new file, and thus the quota > charge doesn't matter. If and when the new file is changed to another > owner, then the normal quota code will adjust the quotas. More thinking. It looks like we'll restrict reflink() to owners or people with CAP_FCHOWN. This prevents some quota DoS behavior. We need to pre-charge all quota. That means a reflink must be charged the entire size of the file. So, if I do: # dd if=/dev/zero bs=1M count=1 of=foo # reflink foo bar I am now charged 2MB of quota, even though foo and bar share the same 1MB of space. Why? Because if I only mark 1M of quota and then do "chown tao.tao bar", we can't sanely keep track of fractional quota. Wheras if we charge the 2MB up front, the chown just moves the quota over to tao. Copy-on-write is even cleaner - since you were pre-charged for the quota, you don't do any quota adjustments for the data blocks in the CoW operation (though any new metadata is a new charge). Joel -- "The whole principle is wrong; it's like demanding that grown men live on skim milk because the baby can't eat steak." - author Robert A. Heinlein on censorship Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-04 21:18 ` [Ocfs2-devel] " Joel Becker @ 2009-05-04 22:23 ` Theodore Tso 2009-05-05 6:55 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Theodore Tso @ 2009-05-04 22:23 UTC (permalink / raw) To: Tao Ma, linux-fsdevel, jmorris, ocfs2-devel, viro On Mon, May 04, 2009 at 02:18:54PM -0700, Joel Becker wrote: > More thinking. It looks like we'll restrict reflink() to owners > or people with CAP_FCHOWN. This prevents some quota DoS behavior. > We need to pre-charge all quota. That means a reflink must be > charged the entire size of the file. So, if I do: > > # dd if=/dev/zero bs=1M count=1 of=foo > # reflink foo bar > > I am now charged 2MB of quota, even though foo and bar share the same > 1MB of space. Yep; but as long as you do this, why do you need CAP_FCHOWN? Suppose Alice has a 1MB file, and Bob creates a reflink to it. The reflink would be owned by Bob, and Bob would be charged the 1MB quota. This mirrors exactly what happens if Bob were to make a copy of the file, and we want to make the creation of reflink mirror a copy, right? In that case, as long as Bob has read access to the file, he should be allowed to create a reflink. That way when you do the copy-on-write, Bob will continue to be charged the 1MB quota, which is what you want. So pre-charging the quota makes the most amount of sense. - Ted ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-04 22:23 ` Theodore Tso @ 2009-05-05 6:55 ` Joel Becker 0 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 6:55 UTC (permalink / raw) To: Theodore Tso; +Cc: Tao Ma, linux-fsdevel, jmorris, ocfs2-devel, viro On Mon, May 04, 2009 at 06:23:27PM -0400, Theodore Tso wrote: > On Mon, May 04, 2009 at 02:18:54PM -0700, Joel Becker wrote: > > More thinking. It looks like we'll restrict reflink() to owners > > or people with CAP_FCHOWN. This prevents some quota DoS behavior. > > We need to pre-charge all quota. That means a reflink must be > > charged the entire size of the file. So, if I do: > > > > # dd if=/dev/zero bs=1M count=1 of=foo > > # reflink foo bar > > > > I am now charged 2MB of quota, even though foo and bar share the same > > 1MB of space. > > Yep; but as long as you do this, why do you need CAP_FCHOWN? Because the ownership doesn't change, and thus the person doing the reflink is effectively setting ownership. > Suppose Alice has a 1MB file, and Bob creates a reflink to it. The > reflink would be owned by Bob, and Bob would be charged the 1MB quota. > This mirrors exactly what happens if Bob were to make a copy of the > file, and we want to make the creation of reflink mirror a copy, right? It's more a link(2). The ownership, permissions, and attributes are identical to the original. -- "Always give your best, never get discouraged, never be petty; always remember, others may hate you. Those who hate you don't win unless you hate them. And then you destroy yourself." - Richard M. Nixon Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker ` (2 preceding siblings ...) 2009-05-03 23:45 ` Theodore Tso @ 2009-05-05 1:07 ` Jamie Lokier 2009-05-05 7:16 ` Joel Becker 3 siblings, 1 reply; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 1:07 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: > +All file attributes and extended attributes of the new file must > +identical to the source file with the following exceptions: reflink() sounds useful already, but is there a compelling reason why both files must have the same attributes, and changing attributes will break the COW? Being able to have different attributes would allow: - reflink() to be used for fast space-efficient copying, i.e. an optimisation to "cp", "git checkout" and things like that. - reflink() to be used for merging files with identical contents (something I find surprisingly often on my disks). - reflink() to be used for merging files from different cgroup-style VMs in particular. Requiring all attributes except nlink and ino to be identical makes reflink() unsuitable for transparently doing those things, except in cases where they happen to have the same attributes anyway. I'm thinking particularly of file permissions, owner/group and atime. Since each reflink has its own nlink and ino, I'm wondering why the other attributes cannot also be separate. (I realise extended attributes complicate the picture and it's desirable to share them, especially if they are large). > +- The new file must have a new inode number. This allows POSIX > + programs to treat the source and new files as separate objects. From > + the view of the POSIX application, the files are distinct. The > + sharing is invisible outside the filesystem. Invisible sharing is good and different inode number is obviously required. But is there an efficient way for reflink-aware applications to detect these files have the same contents, other than reading the contents twice and comparing? Occasionally that would be good. E.g. It would be nice if "diff -r" could be patched to do that. > +- The ctime of the source file only changes if the source's metadata > + must be changed to accommodate the copy-on-write linkage. The ctime of > + the new file is set to represent its creation. What change to the source metadata would require ctime to change? > +- The link count of the source file is unchanged, and the link count of > + the new file is one. Can you hard link to the source file and the reflink afterwards, incrementing the reflink's link count? (I presume yes). Can you reflink to both of them too? > +EPERM:: > + oldpath is a directory. I've always been surprised this isn't EISDIR :-) > +EXDEV:: > + oldpath and newpath are not on the same mounted file system. > + (Linux permits a file system to be mounted at multiple points, > + but reflink() does not work across different mount points, even if > + the same file system is mounted on both.) That's in interesting restriction, though I see link() does the same. > +reflink() deferences symbolic links in the same manner that link(2) > +does. Would that be "reflink() does not dereference symbolic links as the final path component, in the same manner that link() does not" :-) > For precise control over the treatment of symbolic links, see > reflinkat(). As others have said, there's no need for a reflink() kernel system call, as reflinkat() can be used for the same thing, and wrapped in libc if reflink() is desirable as a userspace C function. Also, reflinkat() has room for reflink-specific flags to be added later if needed, which may come in handy. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 1:07 ` Jamie Lokier @ 2009-05-05 7:16 ` Joel Becker 2009-05-05 8:09 ` Andreas Dilger ` (2 more replies) 0 siblings, 3 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 7:16 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote: > Joel Becker wrote: > > +All file attributes and extended attributes of the new file must > > +identical to the source file with the following exceptions: > > reflink() sounds useful already, but is there a compelling reason why > both files must have the same attributes, and changing attributes will > break the COW? Yeah, because without it you can't use it for snapshotting. That's where the original design came from - inode snapshots. The big thing that excited me was that defining reflink() as I did, instead of a more specific snapshot call, allows all sorts of generic uses (some of which you outline below). If reflink() creates a snapshot, you can then break it to make things a little different. But if it changes things, you can never change them back. > Being able to have different attributes would allow: > > - reflink() to be used for fast space-efficient copying, i.e. an > optimisation to "cp", "git checkout" and things like that. It can right now, just not of other people's files. Actually, the only real difficult with doing it to other people's files is quota. But I can't come up with a way to prevent quota DoS. Here's another fun trick. Overwriting rsync, instead of copying blocks from the already-existing source could reflink the source to the .temporary, then only write the changed blocks. And since you own both files, it just works. If you're overwriting someone else's file? The old copy behavior is fine. > - reflink() to be used for merging files with identical contents > (something I find surprisingly often on my disks). > > - reflink() to be used for merging files from different > cgroup-style VMs in particular. While it would be great to have a way to do this, reflink() is not the way. It's really simple to understand with its link-like semantic, and I see no point in making it a seven-different-operation kitchen sink call. > Requiring all attributes except nlink and ino to be identical makes > reflink() unsuitable for transparently doing those things, except in > cases where they happen to have the same attributes anyway. We've had a lot of fun thinking up many uses for reflink(), and almost all of them are within the context of one's own files. > I'm thinking particularly of file permissions, owner/group and atime. People do cp -p all the time. I don't see how keeping those things the same will break anything. It's a new call, not an existing semantic. > Since each reflink has its own nlink and ino, I'm wondering why the > other attributes cannot also be separate. (I realise extended > attributes complicate the picture and it's desirable to share them, > especially if they are large). The biggest reason is snapshotting. The second biggest reason is a simple to understand call. "Everything is identical except those things that *have* to be different". > But is there an efficient way for reflink-aware applications to detect > these files have the same contents, other than reading the contents > twice and comparing? Occasionally that would be good. E.g. It would > be nice if "diff -r" could be patched to do that. I would think FIEMAP would tell you what you want to know, wouldn't it? > > +- The ctime of the source file only changes if the source's metadata > > + must be changed to accommodate the copy-on-write linkage. The ctime of > > + the new file is set to represent its creation. > > What change to the source metadata would require ctime to change? ocfs2 flags all extents in the source file with a "this is now shared, go check the reference count before writing" flag if they don't have it already. I'd call that a metadata update. > > +- The link count of the source file is unchanged, and the link count of > > + the new file is one. > > Can you hard link to the source file and the reflink afterwards, > incrementing the reflink's link count? (I presume yes). Can you > reflink to both of them too? Yes, absolutely. Once reflinked, they look like two separate POSIX files. Joel -- "Depend on the rabbit's foot if you will, but remember, it didn't help the rabbit." - R. E. Shay Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 7:16 ` Joel Becker @ 2009-05-05 8:09 ` Andreas Dilger 2009-05-05 16:56 ` Joel Becker 2009-05-05 13:01 ` Theodore Tso 2009-05-05 13:01 ` Jamie Lokier 2 siblings, 1 reply; 151+ messages in thread From: Andreas Dilger @ 2009-05-05 8:09 UTC (permalink / raw) To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro On May 05, 2009 00:16 -0700, Joel Becker wrote: > On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote: > > Being able to have different attributes would allow: > > > > - reflink() to be used for fast space-efficient copying, i.e. an > > optimisation to "cp", "git checkout" and things like that. > > It can right now, just not of other people's files. Actually, > the only real difficult with doing it to other people's files is quota. > But I can't come up with a way to prevent quota DoS. If the reflink caller is always charged for the full space used (as if it were a real copy) by virtue of the user doing the reflink() owning the new inode. Doing anything else seems broken. If the owner of the file wasn't charged for the reflink's quota then if the reflink inode was chowned the new owner would be charged for the new file, but the quota code would have to special case the decrement of EACH of the reflink's blocks because otherwise the original owner might "release" quota that it was never originally charged. > Here's another fun trick. Overwriting rsync, instead of copying > blocks from the already-existing source could reflink the source to the > .temporary, then only write the changed blocks. And since you own both > files, it just works. If you're overwriting someone else's file? The > old copy behavior is fine. Well, "fine" as in it works, but if there are only a few changed blocks, and the old copy is now part of a snapshot (so it won't be released when rsync is finished) the space consumption has doubled instead of just using a few extra blocks. > > Requiring all attributes except nlink and ino to be identical makes > > reflink() unsuitable for transparently doing those things, except in > > cases where they happen to have the same attributes anyway. > > We've had a lot of fun thinking up many uses for reflink(), and > almost all of them are within the context of one's own files. Is there anything about changing the owner/group of the new inode during reflink that makes the implementation more complex? If the process doing the reflink is the same as the file owner then the semantics are unchanged from what you have proposed. > > I'm thinking particularly of file permissions, owner/group and atime. > > People do cp -p all the time. I don't see how keeping those > things the same will break anything. It's a new call, not an existing > semantic. Though "cp -p" doesn't keep the owner/group of the original file if you are not root. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 8:09 ` Andreas Dilger @ 2009-05-05 16:56 ` Joel Becker 2009-05-05 21:24 ` Andreas Dilger 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-05 16:56 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel, Jamie Lokier, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 02:09:36AM -0600, Andreas Dilger wrote: > On May 05, 2009 00:16 -0700, Joel Becker wrote: > > On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote: > > > Being able to have different attributes would allow: > > > > > > - reflink() to be used for fast space-efficient copying, i.e. an > > > optimisation to "cp", "git checkout" and things like that. > > > > It can right now, just not of other people's files. Actually, > > the only real difficult with doing it to other people's files is quota. > > But I can't come up with a way to prevent quota DoS. > > If the reflink caller is always charged for the full space used (as if > it were a real copy) by virtue of the user doing the reflink() owning the > new inode. Doing anything else seems broken. If the owner of the file > wasn't charged for the reflink's quota then if the reflink inode was > chowned the new owner would be charged for the new file, but the quota > code would have to special case the decrement of EACH of the reflink's > blocks because otherwise the original owner might "release" quota that > it was never originally charged. If the caller is creating an inode in someone else's name, then who do you charge for the quota? If you charge the caller, how do you know to decrement the caller's quota when the actual owner does truncate, given that the inode has no knowledge of the caller anymore. You've hit the nail on the head - without backrefs for each refcounted hunk, you can't figure out who it owns it from a quota perspective. And that's just a non-starter to try and maintain. > > Here's another fun trick. Overwriting rsync, instead of copying > > blocks from the already-existing source could reflink the source to the > > .temporary, then only write the changed blocks. And since you own both > > files, it just works. If you're overwriting someone else's file? The > > old copy behavior is fine. > > Well, "fine" as in it works, but if there are only a few changed blocks, > and the old copy is now part of a snapshot (so it won't be released when > rsync is finished) the space consumption has doubled instead of just > using a few extra blocks. No, because the last thing rsync will do is rename(.temporary, source). All the references from the source will be decremented, and any blocks only owned by the source will be freed. Space usage is identical before and after, like a copying rsync, but there is less space used and less I/O done during the rsync process. > Is there anything about changing the owner/group of the new inode during > reflink that makes the implementation more complex? If the process doing > the reflink is the same as the file owner then the semantics are unchanged > from what you have proposed. If you define that 'reflink sets the attributes as if it was a new file', then you should be creating the file with a new security context, not with the security context from the existing inode. And then you can't really snapshot. A mixed behavior, like "if you own it, I'll preserve the entire security context, but if not I will treat it with a new context" is confusing at best. > > > I'm thinking particularly of file permissions, owner/group and atime. > > > > People do cp -p all the time. I don't see how keeping those > > things the same will break anything. It's a new call, not an existing > > semantic. > > Though "cp -p" doesn't keep the owner/group of the original file if you > are not root. Sure, my argument wasn't that we should be exactly like cp -p, it was that the results of cp -p are understood, so if we look like them it won't break anything. I actually discussed the "cp -p" issue elsewhere. Yes, we all understand the caveats of "cp -p". But it's a actually a combination of many simple operations. reflink() is one operation, and trying to give it confusing and varied semantics seems to clutter it up for no good reason. Joel -- "Baby, even the losers Get luck sometimes. Even the losers Keep a little bit of pride." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 16:56 ` Joel Becker @ 2009-05-05 21:24 ` Andreas Dilger 2009-05-05 21:32 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Andreas Dilger @ 2009-05-05 21:24 UTC (permalink / raw) To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro On May 05, 2009 09:56 -0700, Joel Becker wrote: > On Tue, May 05, 2009 at 02:09:36AM -0600, Andreas Dilger wrote: > > If the reflink caller is always charged for the full space used (as if > > it were a real copy) by virtue of the user doing the reflink() owning the > > new inode. Doing anything else seems broken. If the owner of the file > > wasn't charged for the reflink's quota then if the reflink inode was > > chowned the new owner would be charged for the new file, but the quota > > code would have to special case the decrement of EACH of the reflink's > > blocks because otherwise the original owner might "release" quota that > > it was never originally charged. > > If the caller is creating an inode in someone else's name, then > who do you charge for the quota? IMHO, it shouldn't be possible to create an inode in someone else's name (CAP_* excluded), just like it isn't possible to create a new file in someone elses name. The caller of reflink() should be the one creating the file, hence the owner of the file, and the owner of the quota. > If you charge the caller, how do you know to decrement the caller's > quota when the actual owner does truncate, given that the inode has > no knowledge of the caller anymore. No, if the owner of the inode (== caller) is charged the quota then when the inode is truncated (regardless of who does the truncate) the quota will just work correctly. > You've hit the nail on the head - without backrefs for each > refcounted hunk, you can't figure out who it owns it from a quota > perspective. And that's just a non-starter to try and maintain. No, I don't think my proposal is _more_ complex than the original. It is actually _less_ complex, because the fact that this is a reflink and not a complete file copy is a purely internal detail of the filesystem and is not exposed outside the filesystem. The fact that a reflink consumes less space and is faster than a real copy is an implementation detail, not really any different than if the file were compressed by the filesystem internally. > > > Here's another fun trick. Overwriting rsync, instead of copying > > > blocks from the already-existing source could reflink the source to the > > > .temporary, then only write the changed blocks. And since you own both > > > files, it just works. If you're overwriting someone else's file? The > > > old copy behavior is fine. > > > > Well, "fine" as in it works, but if there are only a few changed blocks, > > and the old copy is now part of a snapshot (so it won't be released when > > rsync is finished) the space consumption has doubled instead of just > > using a few extra blocks. > > No, because the last thing rsync will do is rename(.temporary, > source). All the references from the source will be decremented, and > any blocks only owned by the source will be freed. Space usage is > identical before and after, like a copying rsync, but there is less > space used and less I/O done during the rsync process. What I was objecting to is "when overwriting someone elses file, the old copy behaviour is fine". If we are implementing a copy-on-write API, why hamstring it to not work in the expected manner by a normal "cp"? > > Is there anything about changing the owner/group of the new inode during > > reflink that makes the implementation more complex? If the process doing > > the reflink is the same as the file owner then the semantics are unchanged > > from what you have proposed. > > If you define that 'reflink sets the attributes as if it was a > new file', then you should be creating the file with a new security > context, not with the security context from the existing inode. And > then you can't really snapshot. > A mixed behavior, like "if you own it, I'll preserve the entire > security context, but if not I will treat it with a new context" is > confusing at best. I don't find it confusing. The security context would be inherited from the creating process, just like creating a new file would. If it is the same user as the file owner then the security context will be the same. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 21:24 ` Andreas Dilger @ 2009-05-05 21:32 ` Joel Becker 2009-05-06 7:15 ` [Ocfs2-devel] " Theodore Tso 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-05 21:32 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel, Jamie Lokier, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 03:24:17PM -0600, Andreas Dilger wrote: > On May 05, 2009 09:56 -0700, Joel Becker wrote: <snip a bunch of stuff about how quota obviously works correctly if we change ownership> > > No, because the last thing rsync will do is rename(.temporary, > > source). All the references from the source will be decremented, and > > any blocks only owned by the source will be freed. Space usage is > > identical before and after, like a copying rsync, but there is less > > space used and less I/O done during the rsync process. > > What I was objecting to is "when overwriting someone elses file, the old > copy behaviour is fine". If we are implementing a copy-on-write API, > why hamstring it to not work in the expected manner by a normal "cp"? We're implementing an inode-level snapshot/clone that also happens to be very convenient for many cp-like operations. > > If you define that 'reflink sets the attributes as if it was a > > new file', then you should be creating the file with a new security > > context, not with the security context from the existing inode. And > > then you can't really snapshot. > > A mixed behavior, like "if you own it, I'll preserve the entire > > security context, but if not I will treat it with a new context" is > > confusing at best. > > I don't find it confusing. The security context would be inherited from > the creating process, just like creating a new file would. If it is the > same user as the file owner then the security context will be the same. The same as what? If you reflink your own file, it preserves the security context of the original or it appears with the default security context of yourself? They are not the same. "Treat it like link(2)" argues for the former - which precludes changing ownership. That's what reflink is designed to do. "Treat it like cp" is a different behavior. Joel -- "The lawgiver, of all beings, most owes the law allegiance. He of all men should behave as though the law compelled him. But it is the universal weakness of mankind that what we are given to administer we presently imagine we own." - H.G. Wells Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 21:32 ` Joel Becker @ 2009-05-06 7:15 ` Theodore Tso 2009-05-06 14:24 ` jim owens 0 siblings, 1 reply; 151+ messages in thread From: Theodore Tso @ 2009-05-06 7:15 UTC (permalink / raw) To: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 02:32:06PM -0700, Joel Becker wrote: > The same as what? If you reflink your own file, it preserves > the security context of the original or it appears with the default > security context of yourself? They are not the same. "Treat it like > link(2)" argues for the former - which precludes changing ownership. > That's what reflink is designed to do. "Treat it like cp" is a > different behavior. The reason why I don't like the default to be "preserve the inode ownership" is because it's *not* just like link(2). If it were just like link(2), the inode number would also be preserved. If the inode number is changing, then it arguably is ***much*** more like a copy. And a copy operation also has many useful properties. - Ted ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-06 7:15 ` [Ocfs2-devel] " Theodore Tso @ 2009-05-06 14:24 ` jim owens 2009-05-06 14:30 ` jim owens 2009-05-12 19:11 ` Jamie Lokier 0 siblings, 2 replies; 151+ messages in thread From: jim owens @ 2009-05-06 14:24 UTC (permalink / raw) To: Theodore Tso, joel.becker Cc: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro So summarizing the main argument of the day, there are 2 different functions proposed: 1) "snapfile" - Joel's reflink(2) design. The definition is good. It makes writable snapshots possible, and is security safe with CAP_FOWNER added as a requirement because CAP_CHOWN is very restricted in the real world. Only the owner and admin can reflink. For the admin to use it in backups, it must be a single call as Joel said, a point in time of data and attributes. If the name "reflink" is the problem, call it something else. 2) "cowfilecopy" - Ted's/Jamie's kernel cp. Again, the definition makes sense. The security model is simply "current read access to the file", so anyone who can read it can make almost-zero-space-consumed copy of a file. --- analysis --- You ask why not use a 2-step "cowfilecopy" and "attrfilecopy" to do "snapfile"... because that is not an atomic snapshot. The security and "might not know about it" concerns are bogus: No extra visibility exists to future updates of the original file that would not exist without either snapfile or cowfilecopy. That BOTH point at your old data is no different than if root or raid was copying every disk block to permanent storage. If you write it, someone can have it later. So bottom line... I see no reason (except someone has to document) why we should not have 2 system calls since there are good uses and good definitions for both and the code is 99% identical. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-06 14:24 ` jim owens @ 2009-05-06 14:30 ` jim owens 2009-05-06 17:50 ` jim owens 2009-05-12 19:11 ` Jamie Lokier 1 sibling, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-06 14:30 UTC (permalink / raw) To: Theodore Tso, joel.becker Cc: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro P.S. as people have already said, both 1) "snapfile" - Joel's reflink(2) design. 2) "cowfilecopy" - Ted's/Jamie's kernel cp. could be 1 syscall with a "preserve" flag that requires CAP_FCHOWN. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-06 14:30 ` jim owens @ 2009-05-06 17:50 ` jim owens 2009-05-12 19:20 ` Jamie Lokier 2009-05-12 19:30 ` Jamie Lokier 0 siblings, 2 replies; 151+ messages in thread From: jim owens @ 2009-05-06 17:50 UTC (permalink / raw) To: Theodore Tso, joel.becker Cc: Andreas Dilger, Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro jim owens wrote: > > 1) "snapfile" - Joel's reflink(2) design. > 2) "cowfilecopy" - Ted's/Jamie's kernel cp. > > could be 1 syscall with a "preserve" flag that requires CAP_FCHOWN. No disagreement must mean the mail isn't getting through :) So on to the last turd in the punch bowl, quota and du rules: Both snapfile and cowfilecopy: - must not allow their use to cheat to exceed the user's quota - would best serve if only the actual disk non-shared space was counted But we know not all filesystems will be able to change their on-disk data structures and efficiently count only non-shared. So I suggest this is the rule: Quota accounting and disk space used for the original file will be as if there were 0 reflinks. Quota accounting and disk space reported for the new reflink file is filesystem specific and may or may not include shared disk space. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-06 17:50 ` jim owens @ 2009-05-12 19:20 ` Jamie Lokier 2009-05-12 19:30 ` Jamie Lokier 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 19:20 UTC (permalink / raw) To: jim owens Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris, ocfs2-devel, viro jim owens wrote: > Quota accounting and disk space used for the original file will > be as if there were 0 reflinks. Quota accounting and disk space > reported for the new reflink file is filesystem specific and may > or may not include shared disk space. One little thing: If the original file is deleted, 1. The data must still be accounted at least once. 2. After deleting the original, the space must not be charged to the owner of the original file if different from the owners of reflinks which remain, because that would be a quota attack. Less important: 3. The combination of 1 and 2 probably shouldn't cause the quota charge of other users (who created reflinks and aren't the original file owner) to increase as it might push them over their quota, in a way that's difficult for them to know beforehand. One way to satisfy that is for the reflink's data to be charged at least once to each distinct owner who references that data. Maybe it can be done by charging it when the owner of a new reflink is different from the original, and something appropriate during chown. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-06 17:50 ` jim owens 2009-05-12 19:20 ` Jamie Lokier @ 2009-05-12 19:30 ` Jamie Lokier 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 19:30 UTC (permalink / raw) To: jim owens Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris, ocfs2-devel, viro jim owens wrote: > du rules In general, hard :-) See the difficulties reporting process memory usage given shared libraries for how tricky it gets. Imho, the simplest is for du to report how much space it would take if all the files were fully unCOWed, either by being modified or copied to another filesystem. In other words, just return the data blocks assigned to a file as usual; the COW difference is the same block can be assigned to more than one file. Sometimes knowing the unCOWed space would even be useful. Of course knowing the COWed space is also useful. (Both together would give you a nice feel for how much space COW is saving.) I suspect that would need changes to du to give useful answers (and similar changes to "Disk Usage Analyzer" if you like that tool). du detects hard links by i_nlink!=1 and inode number, and merges the accounting. For partially shared files, I'm not sure how best to get the right information out. FIEMAP is inherently a lot slower than stat() because it can do much more disk access, and it might not always work (depending on how data is represented on disk), and it might not always be permitted. Worse, there's no i_nlink!=1 equivalent to decide when FIEMAP does not need to be called. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-06 14:24 ` jim owens 2009-05-06 14:30 ` jim owens @ 2009-05-12 19:11 ` Jamie Lokier 2009-05-12 19:37 ` jim owens 1 sibling, 1 reply; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 19:11 UTC (permalink / raw) To: jim owens Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris, ocfs2-devel, viro jim owens wrote: > You ask why not use a 2-step "cowfilecopy" and "attrfilecopy" > to do "snapfile"... because that is not an atomic snapshot. Understood, no problem with that. (Though it would be nice to have a realistic example showing the atomicity being useful for a single file snapshot). Being able to create a _new file_ with the security attributes of an existing file is sometimes useful too. Lots of programs do that, of course, but a lot of them get it wrong when non-traditional security attributes are used. reflink() followed by truncate() would be useful for that - and in that case, returning EPERM if it can't clone the attributes would be essential - because if a program which wants to copy "all the security attributes" without the knowledge to parse them itself and set them in the right order, then it won't have the code to check if they were cloned reliably either. > The security and "might not know about it" concerns are bogus: > No extra visibility exists to future updates of the original > file that would not exist without either snapfile or cowfilecopy. > That BOTH point at your old data is no different than if root > or raid was copying every disk block to permanent storage. If > you write it, someone can have it later. I agree with that _as long as_ reflink() does not permit you to clone a file when you are not the owner and you don't have read access. It looks like reflink() V4 does not permit in that case - good! (A more precise statement of the rule is "as long as you could not copy the file normally and then change its attributes to match what reflink() produces"). That's different from link(), which _does_ allow links when you have no read access and aren't the owner, but it always bumps i_nlink. That's where I was coming from with the "might not know about it" concern, because it looked like earlier reflink() proposals applied the same weak permission checks as link(). V4 seems much better. > So bottom line... I see no reason (except someone has to document) > why we should not have 2 system calls since there are good uses > and good definitions for both and the code is 99% identical. I doubt if anyone cares deeply if there are two system calls or one system call with a flag(*), since they are so similar. The main thing is having useful behaviours. (*) Except for aesthetics. I'm with the folks who think it's better for userspace to explicitly request one behaviour or the other, rather than having reflink() "automatically" decide for itself whether it will clone the attributes or use new-file attributes. The reason is because the "automatic" behaviour will certainly require some applications to work around it, by guessing what it's going to do before (which is difficult to do accurately), or checking what it did afterwards. That will be these applications: - Sometimes an app will want to clone the attributes, and tell the user "sorry, no" if that's not possible. So the app will have to stat the file first, check the file owner against it's euid, reflink, then stat the resulting file afterwards and check what happened (because ownership might have changed between the first stat and reflink calls, changing reflink's behaviour from what it expected), and then call unlink if the wrong thing happend *and* it will still be wrong 1% of the time when the security model is not what the application expected. Applications should not have to hard-code every known security model. And linking then unlinking because you got it wrong is another security issue. "cp --cow -a" might be in this category, so would "rsync --cow -a" and generic backup applications. I expect most applications wanting to copy exactly care about this. - Sometimes an app will want to warn the user if the attributes couldn't be cloned, but succeed in making the copy. reflink() V4 does that, but the app will have to check the new attributes against the old ones to know whether to warn, and then guess what errno would be appropriate. Maybe "cp --cow -a" will be like this. - Sometimes an app really just wants to copy a file with COW for efficient data sharing. It will have to change the resulting attributes to "new file" attributes - and that will be wrong 1% of the time because it's not necessarily easy to get those attributes right, especially with non-standard security models. Even with traditional security, getting setgid-directory behaviour right is extremely difficult - because it depends on the filesystem's mount options among other things. Basically "new file" attributes are something that should always be left to the kernel. While it might not be obvious when root would want to copy a file without preserving attributes with COW performance, the argument "I nearly always forget -p when writing cp" is arguing for "alias cp='cp -p'" in your /root/.profile, not for making the system call do it in a way you can't disable :-) Besides I can think of when you would want it: When running *any* shell script that you didn't write with the environment variable CP_USE_COW_WHEN_POSSIBLE_TO_SAVE_SPACE set ;-) Now the opposite of "automatic" is the app requests whether to clone attributes or use "new file" attributes. In contrast to the above problems, this doesn't cause any difficulty to applications, because any app wanting the automatic choice can just do this: ret = reflink(a,b); if (ret == -1 && errno == EPERM) ret = cowlink(a,b); Ok, that's not perfect because EPERM can mean other things. Which brings us back to a flag ;-) like this: REFLINK_ATTR_CLONE (EPERM if can't clone attributes) REFLINK_ATTR_CLONE_IF_OWNER_OR_ROOT (choose, as proposed in reflink V4) One last annoyance. If you're making a new file, then like open() you need another argument, which is the new file's mode which is combined with umask. But not if you're cloning the attributes. That's a good reason why there should be two functions for applications. The names reflink/cowlink (and reflinkat/cowlinkat) make sense to me. The cowlink functions have an extra mode argument, like the last argument to open(). (They could all be one system call at the kernel level, but different in libc, as is already planned for the reflink/reflinkat distinction.) Oh, and please implement AT_SYMLINK_FOLLOW the same as link(). Thanks :-) -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-12 19:11 ` Jamie Lokier @ 2009-05-12 19:37 ` jim owens 2009-05-12 20:11 ` Jamie Lokier 0 siblings, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-12 19:37 UTC (permalink / raw) To: Jamie Lokier Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris, ocfs2-devel, viro I don't write applications so I won't argue when what they want does not make sense to me :) Jamie Lokier wrote: > > One last annoyance. If you're making a new file, then like open() you > need another argument, which is the new file's mode which is combined > with umask. But that only works for minimal traditional permissions. If you want to adjust ACL or MAC, you need to do something else anyway, so is it really worth having the old-style mode parameter? jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-12 19:37 ` jim owens @ 2009-05-12 20:11 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 20:11 UTC (permalink / raw) To: jim owens Cc: Theodore Tso, joel.becker, Andreas Dilger, linux-fsdevel, jmorris, ocfs2-devel, viro jim owens wrote: > >One last annoyance. If you're making a new file, then like open() you > >need another argument, which is the new file's mode which is combined > >with umask. > > But that only works for minimal traditional permissions. If you > want to adjust ACL or MAC, you need to do something else anyway, > so is it really worth having the old-style mode parameter? You have a point, and mode+umask is sort of ugly, but: ACLs and MACs have are intentionally designed so that in 99.9% of cases, there is no need to do anything else after open(), even in programs that use different mode arguments for security and don't know anything about non-traditional permissions. So very few apps need to do anything else afterwards. The ACL/MAC defaults have been carefully designed to have the right security properties, and people writing security policies understand how that works. The most often used mode parameters are almost certainly 0666 meaning "use what umask says", and 0600 meaning "most restricted useful permissions" for a new file. If you want to create a file with restricted permissions without altering umask, which isn't safe in a threaded program, you must _not_ use 0666 _and then_ narrow the permissions - it's important that the initial permissions are <= the final ones that you need. So without the parameter, what's the sane default? For typical cowlink uses it should be equivalent to open(...,0666) as you don't want to umask+chmod afterwards. I wouldn't be surprised if umask+chmod afterwards gave different ACL/MAC results. But if you need restricted permission on the file afterwards, since it's not safe to start wide and then narrow, 0666 is not a suitable default. You could say "just change the umask!" but that is bad in a threaded program, unfortunately. (Imho they should have made umask thread-specific; oh well. In fact you emulate per-thread umask by adjusting the mode argument in some environments :-) The mode argument, though ugly, is at least well understood and security policies (inside apps and outside) do the right thing with it. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 7:16 ` Joel Becker 2009-05-05 8:09 ` Andreas Dilger @ 2009-05-05 13:01 ` Theodore Tso 2009-05-05 13:19 ` Jamie Lokier 2009-05-05 17:00 ` Joel Becker 2009-05-05 13:01 ` Jamie Lokier 2 siblings, 2 replies; 151+ messages in thread From: Theodore Tso @ 2009-05-05 13:01 UTC (permalink / raw) To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 12:16:09AM -0700, Joel Becker wrote: > On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote: > > Joel Becker wrote: > > > +All file attributes and extended attributes of the new file must > > > +identical to the source file with the following exceptions: > > > > reflink() sounds useful already, but is there a compelling reason why > > both files must have the same attributes, and changing attributes will > > break the COW? > > Yeah, because without it you can't use it for snapshotting. > That's where the original design came from - inode snapshots. The big > thing that excited me was that defining reflink() as I did, instead of > a more specific snapshot call, allows all sorts of generic uses (some of > which you outline below). I guess it depends on your implementation. At least the way I would implement this in ext4, for example, I'd simply set a new flag indicating this was a "reflink", and then the i_data[0..3] field would contain the inode number of the "host" inode, and i_data [4..7] and i_data[8..11] would contain a circular linked list of all reflinks associated with that inode. I'd then grab a spare inode field so the "host" inode could point to the reflink'ed inodes. If you ever need to delete the host inode, you simply pick one of the reflink inodes and copy i_data from the host inode one of the reflink inodes and promote it to be the "host" inode, and then update all of the other reflink inodes to point at the new host inode. The advantage of this scheme is not only does the reflink'ed inode have a new inode number (as in your design), it actually has an entirely new inode. So we can change the ownership, the mtime, ctime; it behaves *entirely* as a separate, free-standing inode except it is sharing the data blocks. This allows me to easily set a new owner, and indeed any other inode metadata, on the reflink'ed inode, which I would argue is a Good Thing. I'm guessing that OCFS2 has implemented (or is planning on implementing) reflinks, you can't modify the metadata? Or is there some really important reason why it's not a good idea for OCFS2? > > Since each reflink has its own nlink and ino, I'm wondering why the > > other attributes cannot also be separate. (I realise extended > > attributes complicate the picture and it's desirable to share them, > > especially if they are large). > > The biggest reason is snapshotting. I guess this doesn't mean much to me. Can you say more about what you have in mind when you say "snapshotting"? Is this in the WAFL sense? What's the use case? > > Can you hard link to the source file and the reflink afterwards, > > incrementing the reflink's link count? (I presume yes). Can you > > reflink to both of them too? > > Yes, absolutely. Once reflinked, they look like two separate > POSIX files. ... but in your implementation, if you ever chown or chmod (or even touch the atime?) of the file, it instantly does a copy-on-write? - Ted ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:01 ` Theodore Tso @ 2009-05-05 13:19 ` Jamie Lokier 2009-05-05 13:39 ` Chris Mason ` (2 more replies) 2009-05-05 17:00 ` Joel Becker 1 sibling, 3 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 13:19 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro Theodore Tso wrote: > I guess it depends on your implementation. At least the way I would > implement this in ext4, for example, I'd simply set a new flag > indicating this was a "reflink", and then the i_data[0..3] field would > contain the inode number of the "host" inode, and i_data [4..7] and > i_data[8..11] would contain a circular linked list of all reflinks > associated with that inode. I'd then grab a spare inode field so the > "host" inode could point to the reflink'ed inodes. > > If you ever need to delete the host inode, you simply pick one of the > reflink inodes and copy i_data from the host inode one of the reflink > inodes and promote it to be the "host" inode, and then update all of > the other reflink inodes to point at the new host inode. > > The advantage of this scheme is not only does the reflink'ed inode > have a new inode number (as in your design), it actually has an > entirely new inode. So we can change the ownership, the mtime, ctime; > it behaves *entirely* as a separate, free-standing inode except it is > sharing the data blocks. > > This allows me to easily set a new owner, and indeed any other inode > metadata, on the reflink'ed inode, which I would argue is a Good > Thing. There was an attempt at something like that for ext3 a year or two ago. Search for "cowlink" if you're interested. Most of the discussion ended up around how to handle copying on writes to shared-writable mmaps, something which I guess is solved these days. Instead of a circular list, a proposed implementation was to create a separate "host" inode on the first reflink, converting the source inode to a reflink inode and moving the data block references to the new host inode. Each reflink was simply a reference to the host inode, much like your design, and the host inode was only to hold the data blocks, with it's i_nlink counting the number of reflinks pointing to it. Using a circular list means the space must be reserved in every inode, even those which are not (yet) reflinks. It also does a bit more writing sometimes, because of having to update next and previous entries on the list. Hmm. The data pointers could live in all the inodes, since they are identical and the whole data is cloned on write. That would make reading a bit faster. > I'm guessing that OCFS2 has implemented (or is planning on > implementing) reflinks, you can't modify the metadata? Or is there > some really important reason why it's not a good idea for OCFS2? I would have thought for OCFS2 and BTRFS, with their nice keyed tree structure, it would be quite natural to implement separate inodes for the reflinks pointing at a shared data-holding inode. Something a little bit like that must be happening to permit separate inode numbers. I wonder if even pointing at shared subtrees of data extents might be feasible, to share some file data. That would make the COW copy less of a catastophe when it happens on a large file :-) -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:19 ` Jamie Lokier @ 2009-05-05 13:39 ` Chris Mason 2009-05-05 15:36 ` Jamie Lokier 2009-05-05 14:21 ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso 2009-05-05 17:05 ` Joel Becker 2 siblings, 1 reply; 151+ messages in thread From: Chris Mason @ 2009-05-05 13:39 UTC (permalink / raw) To: Jamie Lokier; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, 2009-05-05 at 14:19 +0100, Jamie Lokier wrote: > Theodore Tso wrote: > > I guess it depends on your implementation. At least the way I would > > implement this in ext4, for example, I'd simply set a new flag > > indicating this was a "reflink", and then the i_data[0..3] field would > > contain the inode number of the "host" inode, and i_data [4..7] and > > i_data[8..11] would contain a circular linked list of all reflinks > > associated with that inode. I'd then grab a spare inode field so the > > "host" inode could point to the reflink'ed inodes. > > > > If you ever need to delete the host inode, you simply pick one of the > > reflink inodes and copy i_data from the host inode one of the reflink > > inodes and promote it to be the "host" inode, and then update all of > > the other reflink inodes to point at the new host inode. > > > > The advantage of this scheme is not only does the reflink'ed inode > > have a new inode number (as in your design), it actually has an > > entirely new inode. So we can change the ownership, the mtime, ctime; > > it behaves *entirely* as a separate, free-standing inode except it is > > sharing the data blocks. > > > > This allows me to easily set a new owner, and indeed any other inode > > metadata, on the reflink'ed inode, which I would argue is a Good > > Thing. > > There was an attempt at something like that for ext3 a year or two ago. > Search for "cowlink" if you're interested. > > Most of the discussion ended up around how to handle copying on writes > to shared-writable mmaps, something which I guess is solved these days. > > Instead of a circular list, a proposed implementation was to create a > separate "host" inode on the first reflink, converting the source > inode to a reflink inode and moving the data block references to the > new host inode. Each reflink was simply a reference to the host > inode, much like your design, and the host inode was only to hold the > data blocks, with it's i_nlink counting the number of reflinks > pointing to it. > > Using a circular list means the space must be reserved in every inode, > even those which are not (yet) reflinks. It also does a bit more > writing sometimes, because of having to update next and previous > entries on the list. > > Hmm. The data pointers could live in all the inodes, since they are > identical and the whole data is cloned on write. That would make > reading a bit faster. > > > I'm guessing that OCFS2 has implemented (or is planning on > > implementing) reflinks, you can't modify the metadata? Or is there > > some really important reason why it's not a good idea for OCFS2? > > I would have thought for OCFS2 and BTRFS, with their nice keyed tree > structure, it would be quite natural to implement separate inodes for > the reflinks pointing at a shared data-holding inode. Something a > little bit like that must be happening to permit separate inode numbers. > Thanks for getting this discussion going Joel, its really good to get this behavior well defined. The btrfs implementation is just that you have two separate files pointing to the same extents on disk. Each file has a reference on each extent, and deleting or chowning fileA doesn't change the metadata in fileB. The btrfs cow code makes sure that modifications in either file (even when mounted in -o nodatacow) are written to new extents instead of changing the original. If you write one block in a 1TB file, the new space used by the clone is only one block. (Thanks to the ceph developers for coding all of this up a while ago). The main difference between reflink and the btrfs ioctl is that in the btrfs ioctl the destination file must already exist. The btrfs code can also do range replacements in the destination file, but I'd agree with Joel that we don't want to toss the kitchen sink into something nice and clean like reflink. -chris ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:39 ` Chris Mason @ 2009-05-05 15:36 ` Jamie Lokier 2009-05-05 15:41 ` Chris Mason 2009-05-05 16:46 ` Jörn Engel 0 siblings, 2 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 15:36 UTC (permalink / raw) To: Chris Mason; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Chris Mason wrote: > The btrfs implementation is just that you have two separate files > pointing to the same extents on disk. Each file has a reference on each > extent, and deleting or chowning fileA doesn't change the metadata in > fileB. > > The btrfs cow code makes sure that modifications in either file (even > when mounted in -o nodatacow) are written to new extents instead of > changing the original. If you write one block in a 1TB file, the new > space used by the clone is only one block. (Thanks to the ceph > developers for coding all of this up a while ago). Ooh, nice. > The main difference between reflink and the btrfs ioctl is that in the > btrfs ioctl the destination file must already exist. The btrfs code can > also do range replacements in the destination file, but I'd agree with > Joel that we don't want to toss the kitchen sink into something nice and > clean like reflink. Ah, now that I know about the BTRFS data-cloning ioctl... :-) I'm wondering why reflink() is needed at all. Can't it be done in userspace, using the BTRFS ioctl? The hard part in userspace seems to be copying the file attributes, but "cp -a" and other tools manage. What is the advantage of adding the system call for the special case of reflink(), when we choose not to have, say, a copyfile() system call which does what "cp -a" does because doing it in user space is good enough? -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 15:36 ` Jamie Lokier @ 2009-05-05 15:41 ` Chris Mason 2009-05-05 16:03 ` Jamie Lokier 2009-05-05 16:46 ` Jörn Engel 1 sibling, 1 reply; 151+ messages in thread From: Chris Mason @ 2009-05-05 15:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, 2009-05-05 at 16:36 +0100, Jamie Lokier wrote: > Chris Mason wrote: > > The btrfs implementation is just that you have two separate files > > pointing to the same extents on disk. Each file has a reference on each > > extent, and deleting or chowning fileA doesn't change the metadata in > > fileB. > > > > The btrfs cow code makes sure that modifications in either file (even > > when mounted in -o nodatacow) are written to new extents instead of > > changing the original. If you write one block in a 1TB file, the new > > space used by the clone is only one block. (Thanks to the ceph > > developers for coding all of this up a while ago). > > Ooh, nice. > > > The main difference between reflink and the btrfs ioctl is that in the > > btrfs ioctl the destination file must already exist. The btrfs code can > > also do range replacements in the destination file, but I'd agree with > > Joel that we don't want to toss the kitchen sink into something nice and > > clean like reflink. > > Ah, now that I know about the BTRFS data-cloning ioctl... :-) > > I'm wondering why reflink() is needed at all. Can't it be done in > userspace, using the BTRFS ioctl? The hard part in userspace seems to > be copying the file attributes, but "cp -a" and other tools manage. > reflink is a subset of what the btrfs ioctl does, and that's a good thing. The way they've added support for this to ocfs2 is really cool, and the same ideas could be used in other filesystems. So, I'd rather see a system call that everyone can implement, and if btrfs hangs on to the ioctl for extra features, even better. -chris ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 15:41 ` Chris Mason @ 2009-05-05 16:03 ` Jamie Lokier 2009-05-05 16:18 ` Chris Mason 2009-05-05 20:48 ` jim owens 0 siblings, 2 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 16:03 UTC (permalink / raw) To: Chris Mason; +Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Chris Mason wrote: > > > The main difference between reflink and the btrfs ioctl is that in the > > > btrfs ioctl the destination file must already exist. The btrfs code can > > > also do range replacements in the destination file, but I'd agree with > > > Joel that we don't want to toss the kitchen sink into something nice and > > > clean like reflink. > > > > Ah, now that I know about the BTRFS data-cloning ioctl... :-) > > > > I'm wondering why reflink() is needed at all. Can't it be done in > > userspace, using the BTRFS ioctl? The hard part in userspace seems to > > be copying the file attributes, but "cp -a" and other tools manage. > > > > reflink is a subset of what the btrfs ioctl does, and that's a good > thing. The way they've added support for this to ocfs2 is really cool, > and the same ideas could be used in other filesystems. > > So, I'd rather see a system call that everyone can implement, and if > btrfs hangs on to the ioctl for extra features, even better. Realistically, very few existing filesystems can implement this system call. I agree that it's much more likely that a filesystem can implement reflink() than BTRFS' more flexible data cloning though. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 16:03 ` Jamie Lokier @ 2009-05-05 16:18 ` Chris Mason 2009-05-05 20:48 ` jim owens 1 sibling, 0 replies; 151+ messages in thread From: Chris Mason @ 2009-05-05 16:18 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro On Tue, 2009-05-05 at 17:03 +0100, Jamie Lokier wrote: > Chris Mason wrote: > > > > The main difference between reflink and the btrfs ioctl is that in the > > > > btrfs ioctl the destination file must already exist. The btrfs code can > > > > also do range replacements in the destination file, but I'd agree with > > > > Joel that we don't want to toss the kitchen sink into something nice and > > > > clean like reflink. > > > > > > Ah, now that I know about the BTRFS data-cloning ioctl... :-) > > > > > > I'm wondering why reflink() is needed at all. Can't it be done in > > > userspace, using the BTRFS ioctl? The hard part in userspace seems to > > > be copying the file attributes, but "cp -a" and other tools manage. > > > > > > > reflink is a subset of what the btrfs ioctl does, and that's a good > > thing. The way they've added support for this to ocfs2 is really cool, > > and the same ideas could be used in other filesystems. > > > > So, I'd rather see a system call that everyone can implement, and if > > btrfs hangs on to the ioctl for extra features, even better. > > Realistically, very few existing filesystems can implement this system call. > I'd say that if the shared disk clustering filesystem can do it, pretty much anyone can ;) This doesn't mean its easy, but it is a good set of semantics to have as the baseline. -chris ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 16:03 ` Jamie Lokier 2009-05-05 16:18 ` Chris Mason @ 2009-05-05 20:48 ` jim owens 2009-05-05 21:57 ` Jamie Lokier 1 sibling, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-05 20:48 UTC (permalink / raw) To: joel.becker Cc: Jamie Lokier, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Not surprising that the discussion is all over the place as far as what this should do. Whether is is better to implement one do many things syscall or several different syscalls for different features can be debated after we set some rules. Going back to Joel's patch, I think the first rules we need agreement on are: 1) is only for filesystems with COW operation, if the fs does not support COW it returns ENOSYS. the rational is that while we could allow it to be a copyfile, it would not save space so "cp -a". 2) is only for regular files, all others return EPERM *note* as-coded the patch only traps S_ISDIR, but other file types could be a problem on some fs and I don't see any value in supporting more than regular files unless we support directory COW and then we are really jumping into the swamp. 3) the granularity of the COW (1-byte write may cause 1-block up through whole file copy) is fs-dependent. 4) post-reflink changes done to data or attributes in either the original or new file are independent. next rules if we assume reflink(2) matches Joel's manpage and call arguments and any other features are a different api definition: 5) you must be the file owner or have CAP_FCHOWN because... 6) all non-time file attributes (owner, security, etc), atime, and mtime match the original file. ctime is when the reflink was created. but the hard part is the quota accounting rule: 7) pre-charge all quotas so a reflink double-counts inodes and blocks against the original owner/group pro - easiest, does not allow owner to bypass limits, quota utilities just work con - admin snapshot can trip user quota-limit failures, du/df will wildly disagree on space used so is that what we want or do we want to just say the behavior is fs-specific with respect to quotas. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 20:48 ` jim owens @ 2009-05-05 21:57 ` Jamie Lokier 2009-05-05 22:04 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 21:57 UTC (permalink / raw) To: jim owens Cc: joel.becker, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: >> > > If you define that 'reflink sets the attributes as if it was a >> > > new file', then you should be creating the file with a new security >> > > context, not with the security context from the existing inode. And >> > > then you can't really snapshot. >> > > A mixed behavior, like "if you own it, I'll preserve the entire >> > > security context, but if not I will treat it with a new context" is >> > > confusing at best. >> > >> > I don't find it confusing. The security context would be inherited from >> > the creating process, just like creating a new file would. If it is the >> > same user as the file owner then the security context will be the same. >> >> The same as what? If you reflink your own file, it preserves >> the security context of the original or it appears with the default >> security context of yourself? They are not the same. "Treat it like >> link(2)" argues for the former - which precludes changing ownership. >> That's what reflink is designed to do. "Treat it like cp" is a >> different behavior. jim owens wrote: > 1) is only for filesystems with COW operation, > if the fs does not support COW it returns ENOSYS. > > the rational is that while we could allow it to > be a copyfile, it would not save space so "cp -a". As Joel explains above, reflink has user-visible semantics that are different from "cp -a" quite aside from the COW efficiency which can be seen as an internal fs-dependent speed/space optimisation. That means you cannot fall back to "cp -a": reflink has semantics behaviour which "cp -a" cannot always mimic, and won't always mimic correctly when it tries. Imho that's because reflink is overcomplicated and tries to do multiple jobs at once ;-) > 2) is only for regular files, all others return EPERM > > *note* as-coded the patch only traps S_ISDIR, but > other file types could be a problem on some fs and > I don't see any value in supporting more than regular > files unless we support directory COW and then we are > really jumping into the swamp. I agree. > 3) the granularity of the COW (1-byte write may cause > 1-block up through whole file copy) is fs-dependent. And yet ENOSYS if the fs cannot implement any COW, and it isn't possible for userspace to duplicate the semantics by explicit copying? > 4) post-reflink changes done to data or attributes > in either the original or new file are independent. Hopefully :-) Do we say anything about attribute changes triggering COW or not, or leave it fs-dependent? Given 3) fs-dependent makes sense, but it's nice to know in advance if { reflink -R old_tree saved_tree; chmod -R a-w saved_tree } will be as expensive as copying or as cheap as linking. > next rules if we assume reflink(2) matches Joel's > manpage and call arguments and any other features are > a different api definition: > > 5) you must be the file owner or have CAP_FCHOWN > > because... > > 6) all non-time file attributes (owner, security, etc), > atime, and mtime match the original file. ctime is > when the reflink was created. > > but the hard part is the quota accounting rule: > > 7) pre-charge all quotas so a reflink double-counts inodes > and blocks against the original owner/group > > pro - easiest, does not allow owner to bypass limits, > quota utilities just work > > con - admin snapshot can trip user quota-limit failures, > du/df will wildly disagree on space used Another con - reflink is potentially a useful way to save space, multiple-charging prevents their use when tight on quota. If a user is tight on their quota and they need lots of snapshots of their files, e.g. snapshots of work in progress, why should they have to use hard links with its associated problems (i.e. cannot be trusted) for their snapshots, instead of reflinks which are ideal? -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 21:57 ` Jamie Lokier @ 2009-05-05 22:04 ` Joel Becker 2009-05-05 22:11 ` Jamie Lokier ` (2 more replies) 0 siblings, 3 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 22:04 UTC (permalink / raw) To: Jamie Lokier Cc: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 10:57:11PM +0100, Jamie Lokier wrote: > jim owens wrote: > > 3) the granularity of the COW (1-byte write may cause > > 1-block up through whole file copy) is fs-dependent. > > And yet ENOSYS if the fs cannot implement any COW, and it isn't > possible for userspace to duplicate the semantics by explicit copying? The point-in-time of the snapshot is what's important here. > Do we say anything about attribute changes triggering COW or not, or > leave it fs-dependent? Given 3) fs-dependent makes sense, but it's > nice to know in advance if { reflink -R old_tree saved_tree; chmod -R > a-w saved_tree } will be as expensive as copying or as cheap as linking. "Shares the data extents of the source file". I should hope that chmod doesn't require copying out all the data. Joel -- Life's Little Instruction Book #267 "Lie on your back and look at the stars." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:04 ` Joel Becker @ 2009-05-05 22:11 ` Jamie Lokier 2009-05-05 22:24 ` Joel Becker 2009-05-05 22:12 ` Jamie Lokier 2009-05-05 22:28 ` jim owens 2 siblings, 1 reply; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 22:11 UTC (permalink / raw) To: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: > "Shares the data extents of the source file". I should hope > that chmod doesn't require copying out all the data. Oh... I was under the impression that it would, because the man-page-of-sorts says attributes must be the same and the following question: > Jamie Lokier wrote: > > reflink() sounds useful already, but is there a compelling reason why > > both files must have the same attributes, and changing attributes will ------------------------ > > break the COW? -------------- > > Yeah, because without it you can't use it for snapshotting. ---- If that's not true, then I change my tune substantially and like reflink() semantics a lot more. :-) -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:11 ` Jamie Lokier @ 2009-05-05 22:24 ` Joel Becker 2009-05-05 23:14 ` Jamie Lokier 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-05 22:24 UTC (permalink / raw) To: Jamie Lokier Cc: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 11:11:01PM +0100, Jamie Lokier wrote: > Joel Becker wrote: > > "Shares the data extents of the source file". I should hope > > that chmod doesn't require copying out all the data. > > Oh... I was under the impression that it would, because the > man-page-of-sorts says attributes must be the same and the following > question: > > > Jamie Lokier wrote: > > > reflink() sounds useful already, but is there a compelling reason why > > > both files must have the same attributes, and changing attributes will > ------------------------ > > > break the COW? > -------------- > > > > Yeah, because without it you can't use it for snapshotting. > ---- > > If that's not true, then I change my tune substantially and like > reflink() semantics a lot more. :-) I was yeah'ing the first part. The explicit requirement of reflink is sharing the data extents (including xattr extents). So, for example, both the btrfs and ocfs2 implementations can chmod/chown/utimes/etc all they want after the reflink is done, and no data sharing is broken. The data sharing is broken only via data modification. Both btrfs and ocfs2 will only copy the hunk modified, leaving the rest of the file shared; you won't have long wait times for the CoW of large files just because you modified one byte. Joel -- "Get right to the heart of matters. It's the heart that matters more." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:24 ` Joel Becker @ 2009-05-05 23:14 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 23:14 UTC (permalink / raw) To: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: > Both btrfs and ocfs2 will only copy the hunk modified, leaving the > rest of the file shared; you won't have long wait times for the CoW > of large files just because you modified one byte. Which is brilliant and perfect, and should be near the top of any marketing materials for reflink :-) -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:04 ` Joel Becker 2009-05-05 22:11 ` Jamie Lokier @ 2009-05-05 22:12 ` Jamie Lokier 2009-05-05 22:21 ` Joel Becker 2009-05-05 22:28 ` jim owens 2 siblings, 1 reply; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 22:12 UTC (permalink / raw) To: jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: > On Tue, May 05, 2009 at 10:57:11PM +0100, Jamie Lokier wrote: > > jim owens wrote: > > > 3) the granularity of the COW (1-byte write may cause > > > 1-block up through whole file copy) is fs-dependent. > > > > And yet ENOSYS if the fs cannot implement any COW, and it isn't > > possible for userspace to duplicate the semantics by explicit copying? > > The point-in-time of the snapshot is what's important here. Don't we have a slight problem that useful point-in-time snapshots really need to snapshot whole directory trees? Otherwise you get the same inter-file inconsistency issues that you get intra-file from old fashioned copying. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:12 ` Jamie Lokier @ 2009-05-05 22:21 ` Joel Becker 2009-05-05 22:32 ` James Morris 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-05 22:21 UTC (permalink / raw) To: Jamie Lokier Cc: Theodore Tso, jmorris, linux-fsdevel, ocfs2-devel, jim owens, Chris Mason, viro On Tue, May 05, 2009 at 11:12:36PM +0100, Jamie Lokier wrote: > Joel Becker wrote: > > The point-in-time of the snapshot is what's important here. > > Don't we have a slight problem that useful point-in-time snapshots > really need to snapshot whole directory trees? Otherwise you get the > same inter-file inconsistency issues that you get intra-file from old > fashioned copying. Snapshotting whole trees is already doable from things like btrfs and from whole volumes on your storage. This is a different beast. Inter-file is a lot easier to handle than intra-file, because you have control over that part of the process. And if your file is actually a disk image, you get snapping of disks for free :-) Joel -- "We'd better get back, `cause it'll be dark soon, and they mostly come at night. Mostly." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:21 ` Joel Becker @ 2009-05-05 22:32 ` James Morris 2009-05-05 22:39 ` Joel Becker 2009-05-12 19:40 ` Jamie Lokier 0 siblings, 2 replies; 151+ messages in thread From: James Morris @ 2009-05-05 22:32 UTC (permalink / raw) To: Joel Becker Cc: Jamie Lokier, jim owens, Chris Mason, Theodore Tso, linux-fsdevel, ocfs2-devel, viro, Daniel P. Berrange On Tue, 5 May 2009, Joel Becker wrote: > And if your file is actually a disk image, you get snapping of > disks for free :-) Indeed... I think a great use-case scenario for this will be snapshotting VM images, as well as fast and space-efficient instantiation of VMs. -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:32 ` James Morris @ 2009-05-05 22:39 ` Joel Becker 2009-05-12 19:40 ` Jamie Lokier 1 sibling, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 22:39 UTC (permalink / raw) To: James Morris Cc: Theodore Tso, Jamie Lokier, linux-fsdevel, Chris Mason, jim owens, Daniel P. Berrange, ocfs2-devel, viro On Wed, May 06, 2009 at 08:32:07AM +1000, James Morris wrote: > On Tue, 5 May 2009, Joel Becker wrote: > > > And if your file is actually a disk image, you get snapping of > > disks for free :-) > > Indeed... I think a great use-case scenario for this will be snapshotting > VM images, as well as fast and space-efficient instantiation of VMs. I did the initial design work (ocfs2's refcount tree structure) to support snapping VM images. I came up with the reflink() interface when I realized the structure would back space-efficient instantiation, or "shallow clones". Since then, we've been coming up with more and more fun tricks that the generic reflink() interface allows us to do. Joel -- "For every complex problem there exists a solution that is brief, concise, and totally wrong." -Unknown Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:32 ` James Morris 2009-05-05 22:39 ` Joel Becker @ 2009-05-12 19:40 ` Jamie Lokier 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 19:40 UTC (permalink / raw) To: James Morris Cc: Joel Becker, jim owens, Chris Mason, Theodore Tso, linux-fsdevel, ocfs2-devel, viro, Daniel P. Berrange James Morris wrote: > Indeed... I think a great use-case scenario for this will be snapshotting > VM images, as well as fast and space-efficient instantiation of VMs. I agree, except beware of the illusion that atomic file snapshots mean safe VM snapshots... To snapshot a live VM safely, you need to atomically snapshot both the running state (memory and CPU) _and_ all its disk images simultaneously. Otherwise you're asking for guest filesystem corruption. reflink() won't do that by itself, but the VM implementation could use reflink() to make fast snapshots without significantly pausing a running VM. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:04 ` Joel Becker 2009-05-05 22:11 ` Jamie Lokier 2009-05-05 22:12 ` Jamie Lokier @ 2009-05-05 22:28 ` jim owens 2009-05-05 23:12 ` Jamie Lokier 2 siblings, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-05 22:28 UTC (permalink / raw) To: Jamie Lokier, jim owens, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, oc Jamie, Joel Becker wrote: > "Shares the data extents of the source file". so with that clarification, do you now agree with this > 1) is only for filesystems with COW operation, > if the fs does not support COW it returns ENOSYS. being a requirement so the user can trust that calling reflink() uses minimal space (inode/extentmap) and only a change to the file will trigger a data copy. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:28 ` jim owens @ 2009-05-05 23:12 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 23:12 UTC (permalink / raw) To: jim owens Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro jim owens wrote: > Jamie, > > Joel Becker wrote: > > "Shares the data extents of the source file". > > so with that clarification, do you now agree with this > > > 1) is only for filesystems with COW operation, > > if the fs does not support COW it returns ENOSYS. > > being a requirement so the user can trust that calling > reflink() uses minimal space (inode/extentmap) and only Yes I do, if > a change to the file will trigger a data copy. "file" means the data, not the permissions and timestamps :-) Otherwise there's still a user trust issue, since many applications come to mind who would like to chmod/chown/futimes immediately after making the reflink, and they need to trust that the result uses minimal space. I realise now in the OCFS2/BTRFS cases this isn't an issue since changing the data only unshares a small region of the data anyway. But that's quite a difficult thing to ask of any filesystem which implements reflink(), whereas saying "attribute changes do not trigger COW" (well maybe chown/chgrp do) is reasonable for any filesystems which can implement reflink(). -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 15:36 ` Jamie Lokier 2009-05-05 15:41 ` Chris Mason @ 2009-05-05 16:46 ` Jörn Engel 2009-05-05 16:54 ` Jörn Engel 2009-05-05 21:44 ` copyfile semantics Andreas Dilger 1 sibling, 2 replies; 151+ messages in thread From: Jörn Engel @ 2009-05-05 16:46 UTC (permalink / raw) To: Jamie Lokier Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, 5 May 2009 16:36:29 +0100, Jamie Lokier wrote: > > I'm wondering why reflink() is needed at all. Can't it be done in > userspace, using the BTRFS ioctl? The hard part in userspace seems to > be copying the file attributes, but "cp -a" and other tools manage. > > What is the advantage of adding the system call for the special case > of reflink(), when we choose not to have, say, a copyfile() system > call which does what "cp -a" does because doing it in user space is > good enough? Given an ignorant filesystem, copyfile() will simply do the read/write loop in kernelspace. So either copyfile() is just a fancy name for splice() or copyfile() will also have to create a tempfile, rename the tempfile when the copy is done and deal with all possible errors. And if the system crashes, who will remove the tempfile on reboot? Will the tempfile have a well-known name, allowing for easy DoS? Or will it be random, causing much fun locating it after reboot. In short, copyfile() for ignorant filesystems is steaming load of it. I know, I've written it [1]. When implemented in the filesystem itself, copyfile() can be quite nice. The filesystem can create a temporary inode without visibly exposing it to userspace. It can delete temporary inodes in journal replay after a crash. And depending on the fs design, the read/write loop can be replaced with finer-grained reference counting. [1] Not a year or two ago, but in 2004, btw. Jörn -- Do not stop an army on its way home. -- Sun Tzu -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 16:46 ` Jörn Engel @ 2009-05-05 16:54 ` Jörn Engel 2009-05-05 22:03 ` Jamie Lokier 2009-05-05 21:44 ` copyfile semantics Andreas Dilger 1 sibling, 1 reply; 151+ messages in thread From: Jörn Engel @ 2009-05-05 16:54 UTC (permalink / raw) To: Jamie Lokier Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, 5 May 2009 18:46:19 +0200, Jörn Engel wrote: > > And depending on the fs design, the read/write loop can be > replaced with finer-grained reference counting. And maybe finer-grained reference counting should be a requirement for copyfile/cowlink/reflink or whatever we call it. With a large file on slow media, open("foo", O_RDWR); should still return in a reasonable amount of time. Not after ten minutes. Jörn -- Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. -- Rob Pike -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 16:54 ` Jörn Engel @ 2009-05-05 22:03 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 22:03 UTC (permalink / raw) To: Jörn Engel Cc: Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Jörn Engel wrote: > On Tue, 5 May 2009 18:46:19 +0200, Jörn Engel wrote: > > And depending on the fs design, the read/write loop can be > > replaced with finer-grained reference counting. > > And maybe finer-grained reference counting should be a requirement for > copyfile/cowlink/reflink or whatever we call it. With a large file on > slow media, open("foo", O_RDWR); should still return in a reasonable > amount of time. Not after ten minutes. Or 8 hours, which is how long it took me to copy a really large file last time... Oh, and are open() or write() on regular files interruptible per POSIX? Didn't think so :-) Fortunately BTRFS does do fine-grained extent sharing, and reflink so it should work ok on BTRFS. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: copyfile semantics. 2009-05-05 16:46 ` Jörn Engel 2009-05-05 16:54 ` Jörn Engel @ 2009-05-05 21:44 ` Andreas Dilger 2009-05-05 21:48 ` Matthew Wilcox ` (2 more replies) 1 sibling, 3 replies; 151+ messages in thread From: Andreas Dilger @ 2009-05-05 21:44 UTC (permalink / raw) To: Jörn Engel Cc: Theodore Tso, Jamie Lokier, jmorris, ocfs2-devel, linux-fsdevel, Chris Mason, viro On May 05, 2009 18:46 +0200, Jörn Engel wrote: > On Tue, 5 May 2009 16:36:29 +0100, Jamie Lokier wrote: > > What is the advantage of adding the system call for the special case > > of reflink(), when we choose not to have, say, a copyfile() system > > call which does what "cp -a" does because doing it in user space is > > good enough? > > Given an ignorant filesystem, copyfile() will simply do the read/write > loop in kernelspace. So either copyfile() is just a fancy name for > splice() Sure, except splice() (AFAIK) doesn't allow a splice between two regular files, only between a pipe and a file. Maybe it has changed since the last time I looked. On high performance filesystems the copy_to_user() and copy_from_user() can be a major limiting factor on IO performance, and it is getting more significant because the single-core performance is not improving at all. At 1GB/s just a single copy_{to,from}_user (read or write) will consume 40% of a single core. If it is possible to use splice() to copy between two regular files then that is great. Does anything (e.g. cp) actually use this yet? > or copyfile() will also have to create a tempfile, rename the > tempfile when the copy is done and deal with all possible errors. And > if the system crashes, who will remove the tempfile on reboot? Will the > tempfile have a well-known name, allowing for easy DoS? Or will it be > random, causing much fun locating it after reboot. Maybe I'm missing something, but why do we need a tempfile at all? I can't imagine that people expect atomic semantics for copyfile(), any more than they expect atomic sematics for "cp" in the face of a crash. > When implemented in the filesystem itself, copyfile() can be quite nice. > The filesystem can create a temporary inode without visibly exposing it > to userspace. It can delete temporary inodes in journal replay after a > crash. And depending on the fs design, the read/write loop can be > replaced with finer-grained reference counting. I would think that copyfile() is of primary interest when it involves a network filesystem, so there is no need to ship data to the client doing the copy at all. This is possible for NFS and CIFS protocol today, AFAIK. The problem with splice is that the filesystem only knows about ->splice_read() and ->splice_write(), it doesn't have any opportunity to optimize this further (e.g. by sending a "copyfile" RPC, or implementing a reflink or whatever). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: copyfile semantics. 2009-05-05 21:44 ` copyfile semantics Andreas Dilger @ 2009-05-05 21:48 ` Matthew Wilcox 2009-05-05 22:25 ` Trond Myklebust 2009-05-05 22:06 ` Jamie Lokier 2009-05-06 5:57 ` Jörn Engel 2 siblings, 1 reply; 151+ messages in thread From: Matthew Wilcox @ 2009-05-05 21:48 UTC (permalink / raw) To: Andreas Dilger Cc: J?rn Engel, Jamie Lokier, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 03:44:54PM -0600, Andreas Dilger wrote: > > When implemented in the filesystem itself, copyfile() can be quite nice. > > The filesystem can create a temporary inode without visibly exposing it > > to userspace. It can delete temporary inodes in journal replay after a > > crash. And depending on the fs design, the read/write loop can be > > replaced with finer-grained reference counting. > > I would think that copyfile() is of primary interest when it involves > a network filesystem, so there is no need to ship data to the client > doing the copy at all. This is possible for NFS and CIFS protocol today, > AFAIK. The problem with splice is that the filesystem only knows about > ->splice_read() and ->splice_write(), it doesn't have any opportunity > to optimize this further (e.g. by sending a "copyfile" RPC, or implementing > a reflink or whatever). Do you mean NFSv4? I don't know of a way to do it with traditional NFS. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: copyfile semantics. 2009-05-05 21:48 ` Matthew Wilcox @ 2009-05-05 22:25 ` Trond Myklebust 0 siblings, 0 replies; 151+ messages in thread From: Trond Myklebust @ 2009-05-05 22:25 UTC (permalink / raw) To: Matthew Wilcox Cc: Andreas Dilger, J?rn Engel, Jamie Lokier, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, 2009-05-05 at 15:48 -0600, Matthew Wilcox wrote: > On Tue, May 05, 2009 at 03:44:54PM -0600, Andreas Dilger wrote: > > > When implemented in the filesystem itself, copyfile() can be quite nice. > > > The filesystem can create a temporary inode without visibly exposing it > > > to userspace. It can delete temporary inodes in journal replay after a > > > crash. And depending on the fs design, the read/write loop can be > > > replaced with finer-grained reference counting. > > > > I would think that copyfile() is of primary interest when it involves > > a network filesystem, so there is no need to ship data to the client > > doing the copy at all. This is possible for NFS and CIFS protocol today, > > AFAIK. The problem with splice is that the filesystem only knows about > > ->splice_read() and ->splice_write(), it doesn't have any opportunity > > to optimize this further (e.g. by sending a "copyfile" RPC, or implementing > > a reflink or whatever). > > Do you mean NFSv4? I don't know of a way to do it with traditional NFS. It is expected to be a feature of NFSv4.2. There is a proposal currently winding it's way through the IETF that can handle both copyfile() and reflink() semantics. I can help to relay the input from this discussion to the people that are drafting the IETF proposal to ensure that the Linux community concerns get heard. Cheers Trond ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: copyfile semantics. 2009-05-05 21:44 ` copyfile semantics Andreas Dilger 2009-05-05 21:48 ` Matthew Wilcox @ 2009-05-05 22:06 ` Jamie Lokier 2009-05-06 5:57 ` Jörn Engel 2 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 22:06 UTC (permalink / raw) To: Andreas Dilger Cc: Jörn Engel, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Andreas Dilger wrote: > If it is possible to use splice() to copy between two regular files then > that is great. Does anything (e.g. cp) actually use this yet? It's mentioned earlier in this thread that BTRFS has an ioctl() for copying parts of files into other files, and will share the data between both files. With a bit of plumbing, splice() could probably be persuaded to call the mechanism which BTRFS provides. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: copyfile semantics. 2009-05-05 21:44 ` copyfile semantics Andreas Dilger 2009-05-05 21:48 ` Matthew Wilcox 2009-05-05 22:06 ` Jamie Lokier @ 2009-05-06 5:57 ` Jörn Engel 2 siblings, 0 replies; 151+ messages in thread From: Jörn Engel @ 2009-05-06 5:57 UTC (permalink / raw) To: Andreas Dilger Cc: Jamie Lokier, Chris Mason, Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, 5 May 2009 15:44:54 -0600, Andreas Dilger wrote: > > > or copyfile() will also have to create a tempfile, rename the > > tempfile when the copy is done and deal with all possible errors. And > > if the system crashes, who will remove the tempfile on reboot? Will the > > tempfile have a well-known name, allowing for easy DoS? Or will it be > > random, causing much fun locating it after reboot. > > Maybe I'm missing something, but why do we need a tempfile at all? > I can't imagine that people expect atomic semantics for copyfile(), > any more than they expect atomic sematics for "cp" in the face of a > crash. In the case of cowlink() a tempfile is required when breaking the link. Otherwise open() can result in the file disappearing or being truncated. Rather unexpected. If copyfile() doesn't try to be smart and does the actual copy when being called, I could certainly live with half-written files. Jörn -- "Security vulnerabilities are here to stay." -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:19 ` Jamie Lokier 2009-05-05 13:39 ` Chris Mason @ 2009-05-05 14:21 ` Theodore Tso 2009-05-05 15:32 ` Jamie Lokier 2009-05-05 22:49 ` James Morris 2009-05-05 17:05 ` Joel Becker 2 siblings, 2 replies; 151+ messages in thread From: Theodore Tso @ 2009-05-05 14:21 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 02:19:07PM +0100, Jamie Lokier wrote: > There was an attempt at something like that for ext3 a year or two ago. > Search for "cowlink" if you're interested. Yeah, I remember that discussion. The hard part was always the VM infrastructure, not the fs metadata. > Instead of a circular list, a proposed implementation was to create a > separate "host" inode on the first reflink, converting the source > inode to a reflink inode and moving the data block references to the > new host inode. Each reflink was simply a reference to the host > inode, much like your design, and the host inode was only to hold the > data blocks, with it's i_nlink counting the number of reflinks > pointing to it. > > Using a circular list means the space must be reserved in every inode, > even those which are not (yet) reflinks. It also does a bit more > writing sometimes, because of having to update next and previous > entries on the list. It's a tradeoff. If you use a separate "host" inode on the first reflink, then then if you burn 3 inodes instead of two for which is "copied"/"reflinked" once. The only reason why we need to reserve an extra field in the inode structure is for the pointer from the "host" inode to the circular linked list. (The space for the circular linked list gets stored in i_data in the reflink inodes.) If we are using 256 byte inodes we have the space to spare --- and if we really cared about not utilizing the space in the inode structure if it wasn't necessary, it could always be stored as an extended attribute (although that has a greater overhead). The question of which of these design tradeoffs is preferable is really one of how many inodes will get copied via reflinks, and how many times will a particular inode will be copied by a reflink. If it is common (for example, in a virtualization or container workload) for a single file to be copied via reflink 50 or 100 times, then the extra inode created when you create the first reflink is no big deal. If most of the time a file is only going to be reflink'ed once or twice, then the overhead is much bigger. This is really a design detail, though. The bigger questions, which we really need to answer are: 1) If someone other than the owner of a file uses reflink to "make a copy" of the file, is it new inode, with the new inode number, owned by the original owner (making it look more like a link), or owned by the person creating the reflink (making it look more like a copy). 2) Does changing the metadata --- atime, user/group ownership, ctime, etc., break the COW link and cause a copy? (2) could be a per-filesystem implementation detail, but (1) goes to the semantics of the how the reflink() system call will work, so I think we need to have a common answer which is the same across all filesystems. Maybe some filesystems could simply refuse to support a user who isn't the owner creating a reflink, but saying that some filesystems might CAP_FOWNER (because the inode will be created with owner of the uid) would still mean that in the case where you had a setuid binary, or if the system supports fine-grained capability support, so a user with a non-zero UID has CAP_FOWNER, it would be unfortunate if a file owned by uid 23, when copied via reflink by uid 45 with CAP_FOWNER privs, on some filesystems creates a reflinked inode which when stat'ed, st_uid is 23, and on other filesystems creates a reflink inode which when stat'ed, st_uid is 45. - Ted ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 14:21 ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso @ 2009-05-05 15:32 ` Jamie Lokier 2009-05-05 22:49 ` James Morris 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 15:32 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro Theodore Tso wrote: > 1) If someone other than the owner of a file uses reflink to "make a > copy" of the file, is it new inode, with the new inode number, owned > by the original owner (making it look more like a link), or owned by > the person creating the reflink (making it look more like a copy). > > 2) Does changing the metadata --- atime, user/group ownership, ctime, > etc., break the COW link and cause a copy? > > (2) could be a per-filesystem implementation detail, but (1) goes to > the semantics of the how the reflink() system call will work, so I > think we need to have a common answer which is the same across all > filesystems. I agree on both. user/group ownership seems to raise the most questions. Can we settle the hopefully simpler question of ctime, atime, mode changes? -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 14:21 ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso 2009-05-05 15:32 ` Jamie Lokier @ 2009-05-05 22:49 ` James Morris 1 sibling, 0 replies; 151+ messages in thread From: James Morris @ 2009-05-05 22:49 UTC (permalink / raw) To: Theodore Tso Cc: Jamie Lokier, linux-fsdevel, ocfs2-devel, viro, linux-security-module On Tue, 5 May 2009, Theodore Tso wrote: > The bigger questions, which we really need to answer are: > > 1) If someone other than the owner of a file uses reflink to "make a > copy" of the file, is it new inode, with the new inode number, owned > by the original owner (making it look more like a link), or owned by > the person creating the reflink (making it look more like a copy). Changing the owner fundamentally changes the character of the call (certainly, the SELinux security logic would be quite different), and I think application writers would often be asking "what type of reflink call am I supposed to be using here?", and possibly getting it wrong much of the time. It might be better to create a separate syscall for the copy case, with its own distinct semantics, if it is desired. - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:19 ` Jamie Lokier 2009-05-05 13:39 ` Chris Mason 2009-05-05 14:21 ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso @ 2009-05-05 17:05 ` Joel Becker 2 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 17:05 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 02:19:07PM +0100, Jamie Lokier wrote: > There was an attempt at something like that for ext3 a year or two ago. > Search for "cowlink" if you're interested. Yeah, I discussed those with Jörn Engel after my talk at LSF - I hadn't heard of them before. cowlinks actually changed the semantic of link(2). This does not do that. > Instead of a circular list, a proposed implementation was to create a > separate "host" inode on the first reflink, converting the source > inode to a reflink inode and moving the data block references to the > new host inode. Each reflink was simply a reference to the host > inode, much like your design, and the host inode was only to hold the > data blocks, with it's i_nlink counting the number of reflinks > pointing to it. Reflinks are not cowlinks. reflinks are new files (new inodes in most implementations I expect) that only share the *data extents* in a CoW fashion. Maybe reading the wiki details of the ocfs2 implementation and so on would be helpful? [Overview] http://wiki.us.oracle.com/calpg/OCFS2Reflink [ocfs2 Implementation] http://oss.oracle.com/osswiki/OCFS2/DesignDocs/RefcountTrees [reflink() Itself] http://oss.oracle.com/osswiki/OCFS2/DesignDocs/ReflinkOperation [Use Cases] http://oss.oracle.com/osswiki/OCFS2/DesignDocs/ReflinkUses Joel -- "Every day I get up and look through the Forbes list of the richest people in America. If I'm not there, I go to work." - Robert Orben Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:01 ` Theodore Tso 2009-05-05 13:19 ` Jamie Lokier @ 2009-05-05 17:00 ` Joel Becker 2009-05-05 17:29 ` Theodore Tso 2009-05-05 22:30 ` Jamie Lokier 1 sibling, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 17:00 UTC (permalink / raw) To: Theodore Tso; +Cc: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 09:01:14AM -0400, Theodore Tso wrote: > I guess it depends on your implementation. At least the way I would > implement this in ext4, for example, I'd simply set a new flag > indicating this was a "reflink", and then the i_data[0..3] field would > contain the inode number of the "host" inode, and i_data [4..7] and > i_data[8..11] would contain a circular linked list of all reflinks > associated with that inode. I'd then grab a spare inode field so the > "host" inode could point to the reflink'ed inodes. > > If you ever need to delete the host inode, you simply pick one of the > reflink inodes and copy i_data from the host inode one of the reflink > inodes and promote it to be the "host" inode, and then update all of > the other reflink inodes to point at the new host inode. > > The advantage of this scheme is not only does the reflink'ed inode > have a new inode number (as in your design), it actually has an > entirely new inode. So we can change the ownership, the mtime, ctime; > it behaves *entirely* as a separate, free-standing inode except it is > sharing the data blocks. > > This allows me to easily set a new owner, and indeed any other inode > metadata, on the reflink'ed inode, which I would argue is a Good > Thing. > > I'm guessing that OCFS2 has implemented (or is planning on > implementing) reflinks, you can't modify the metadata? Or is there > some really important reason why it's not a good idea for OCFS2? I think I'm confusing you. ocfs2 creates a new inode, with a new tree of extent blocks, pointing to the same data extents as the source. You can do *anything* POSIX to that new inode. You can chown it, chmod it, truncate it, futimes it, whatever. The only thing at issue is what the state of the inode is at the return of the reflink() call. I'm not defining reflink() as "creates a new inode" because I can see something like btrfs using the same storage inode with a new inode number until it needs to CoW. But from the user-visible perspective, that's exactly what happens. Joel -- Life's Little Instruction Book #347 "Never waste the oppourtunity to tell someone you love them." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 17:00 ` Joel Becker @ 2009-05-05 17:29 ` Theodore Tso 2009-05-05 22:36 ` Jamie Lokier 2009-05-05 22:30 ` Jamie Lokier 1 sibling, 1 reply; 151+ messages in thread From: Theodore Tso @ 2009-05-05 17:29 UTC (permalink / raw) To: Jamie Lokier, linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 10:00:58AM -0700, Joel Becker wrote: > On Tue, May 05, 2009 at 09:01:14AM -0400, Theodore Tso wrote: > > I'm guessing that OCFS2 has implemented (or is planning on > > implementing) reflinks, you can't modify the metadata? Or is there > > some really important reason why it's not a good idea for OCFS2? > > I think I'm confusing you. ocfs2 creates a new inode, with a > new tree of extent blocks, pointing to the same data extents as the > source. You can do *anything* POSIX to that new inode. You can chown > it, chmod it, truncate it, futimes it, whatever. The only thing at > issue is what the state of the inode is at the return of the reflink() > call. OK, cool. But in that case, if in every user-visible sense of the word, it's equivalent to a file copy --- which is to say, it gets a new inode number, and, then why not make it work *exactly* like a file copy, which is to say make the ownership be the user who asked for the reflink to be created? That way /bin/cp could potentially use reflinks, and aside from the fact that a cp -r of an existing directory hierarchy takes no extra disk space and runs *much* faster, a reflink acts exactly like a file copy. The semantics are easy to describe, we don't need CAP_FOWNER nonsense, it becomes much easier to deal with the semantics vis-a-vis quota, etc. > I'm not defining reflink() as "creates a new inode" because I > can see something like btrfs using the same storage inode with a new > inode number until it needs to CoW. But from the user-visible > perspective, that's exactly what happens. Well, we can talk about inodes even for filesystems like FAT that don't really have inodes; the user-visible perspective is the only thing that we really care when we try to define the semantics of the system call in a way that causes the least amount of surprise; given that the new file gets a new inode number, it is *not* a hard link, and it looks much more like a file copy. - Ted ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 17:29 ` Theodore Tso @ 2009-05-05 22:36 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 22:36 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro Theodore Tso wrote: But in that case, if in every user-visible sense of the > word, it's equivalent to a file copy --- which is to say, it gets a > new inode number, and, then why not make it work *exactly* like a file > copy, which is to say make the ownership be the user who asked for the > reflink to be created? That way /bin/cp could potentially use > reflinks, and aside from the fact that a cp -r of an existing > directory hierarchy takes no extra disk space and runs *much* faster, > a reflink acts exactly like a file copy. The semantics are easy to > describe, we don't need CAP_FOWNER nonsense, it becomes much easier to > deal with the semantics vis-a-vis quota, etc. reflink() seems to be designed to copy a file _and_ clone the file's attributes exactly, and to do it all atomically. So how about relaxing a bit and, since reflinkat() takes flags, giving it a flag to make cloning the attributes optional. I imagine there's little implementation difference between cloning the attributes and giving it new file attributes, and both behaviours are useful for different things. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 17:00 ` Joel Becker 2009-05-05 17:29 ` Theodore Tso @ 2009-05-05 22:30 ` Jamie Lokier 2009-05-05 22:37 ` Joel Becker 2009-05-05 23:08 ` jim owens 1 sibling, 2 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 22:30 UTC (permalink / raw) To: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: > I think I'm confusing you. ocfs2 creates a new inode, with a > new tree of extent blocks, pointing to the same data extents as the > source. You can do *anything* POSIX to that new inode. You can chown > it, chmod it, truncate it, futimes it, whatever. The only thing at > issue is what the state of the inode is at the return of the reflink() > call. Ok, but does chown/chmod/futimes trigger a COW copy, unsharing the data? This is still not clear. :-) Behaviourally, whether a massive copy is triggered by chmod is quite a significant thing. It dictates whether programs and scripts should be careful to avoid chmod on reflinked files because it may very expensive (think chmod triggering a 200GB copy), or can do so cheaply. > I'm not defining reflink() as "creates a new inode" because I > can see something like btrfs using the same storage inode with a new > inode number until it needs to CoW. But from the user-visible > perspective, that's exactly what happens. I'm still not clear from the above explanation whether full data unsharing (i.e. it's all copied, takes a long time, can trigger ENOSPC) happens on chown/chmod etc. But assuming it stays shared until you modify the actual data, could the documentation make this important fact a bit more prominent: reflink() creates a new file which initially shares the same underlying data storage as the source file, and has all the same attributes including security context and extended attributes. After creating the new file, you can do *anything* POSIX to that new file. You can chown it, chmod it, futimes it, truncate it, write to it, whatever. When the data is modified, that will trigger a copy-on-write operation so that the underlying data is not completely shared any more. The amount and timing of copying is filesystem-dependent, but only happens when a data write or extended attribute change takes place. Opening a file, reading it, read-only or private mappings, and simple attribute updates (chown, chmod, futimes, as well as automatic atime updates) will not trigger copy-on-write and will not return ENOSPC errors. Thanks, -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:30 ` Jamie Lokier @ 2009-05-05 22:37 ` Joel Becker 2009-05-05 23:08 ` jim owens 1 sibling, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 22:37 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, Theodore Tso, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 11:30:16PM +0100, Jamie Lokier wrote: > Joel Becker wrote: > > I think I'm confusing you. ocfs2 creates a new inode, with a > > new tree of extent blocks, pointing to the same data extents as the > > source. You can do *anything* POSIX to that new inode. You can chown > > it, chmod it, truncate it, futimes it, whatever. The only thing at > > issue is what the state of the inode is at the return of the reflink() > > call. > > Ok, but does chown/chmod/futimes trigger a COW copy, unsharing the data? > This is still not clear. :-) No, of course it doesn't. That would be awful! > But assuming it stays shared until you modify the actual data, could > the documentation make this important fact a bit more prominent: > > reflink() creates a new file which initially shares the same > underlying data storage as the source file, and has all the same > attributes including security context and extended attributes. > > After creating the new file, you can do *anything* POSIX to that > new file. You can chown it, chmod it, futimes it, truncate it, > write to it, whatever. When the data is modified, that will > trigger a copy-on-write operation so that the underlying data is > not completely shared any more. > > The amount and timing of copying is filesystem-dependent, but only > happens when a data write or extended attribute change takes place. > > Opening a file, reading it, read-only or private mappings, and > simple attribute updates (chown, chmod, futimes, as well as > automatic atime updates) will not trigger copy-on-write and will > not return ENOSPC errors. You got it. Joel -- "In the room the women come and go Talking of Michaelangelo." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 22:30 ` Jamie Lokier 2009-05-05 22:37 ` Joel Becker @ 2009-05-05 23:08 ` jim owens 1 sibling, 0 replies; 151+ messages in thread From: jim owens @ 2009-05-05 23:08 UTC (permalink / raw) To: Jamie Lokier, joel.becker Cc: Theodore Tso, linux-fsdevel, jmorris, ocfs2-devel, viro Jamie Lokier wrote: > But assuming it stays shared until you modify the actual data, could > the documentation make this important fact a bit more prominent: > Opening a file, reading it, read-only or private mappings, and > simple attribute updates (chown, chmod, futimes, as well as > automatic atime updates) will not trigger copy-on-write and will > not return ENOSPC errors. almost... more like: automatic atime updates) will not trigger file data copy-on-write and will not return ENOSPC errors unless the filesytem would have returned ENOSPC if the file had no reflink. filesystems such as btrfs that COW metadata changes can generate ENOSPC on any attribute update! jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 7:16 ` Joel Becker 2009-05-05 8:09 ` Andreas Dilger 2009-05-05 13:01 ` Theodore Tso @ 2009-05-05 13:01 ` Jamie Lokier 2009-05-05 17:09 ` Joel Becker 2 siblings, 1 reply; 151+ messages in thread From: Jamie Lokier @ 2009-05-05 13:01 UTC (permalink / raw) To: linux-fsdevel, jmorris, ocfs2-devel, viro Joel Becker wrote: > On Tue, May 05, 2009 at 02:07:03AM +0100, Jamie Lokier wrote: > > Joel Becker wrote: > > > +All file attributes and extended attributes of the new file must > > > +identical to the source file with the following exceptions: > > > > reflink() sounds useful already, but is there a compelling reason why > > both files must have the same attributes, and changing attributes will > > break the COW? > > Yeah, because without it you can't use it for snapshotting. > That's where the original design came from - inode snapshots. The big > thing that excited me was that defining reflink() as I did, instead of > a more specific snapshot call, allows all sorts of generic uses (some of > which you outline below). > If reflink() creates a snapshot, you can then break it to make > things a little different. But if it changes things, you can never > change them back. > > > Being able to have different attributes would allow: > > > > - reflink() to be used for fast space-efficient copying, i.e. an > > optimisation to "cp", "git checkout" and things like that. > > It can right now, just not of other people's files. Actually, > the only real difficult with doing it to other people's files is quota. > But I can't come up with a way to prevent quota DoS. > Here's another fun trick. Overwriting rsync, instead of copying > blocks from the already-existing source could reflink the source to the > .temporary, then only write the changed blocks. And since you own both > files, it just works. If you're overwriting someone else's file? The > old copy behavior is fine. The moment rsync overwrites a single block, the whole reflink file will be copied by the filesystem, and then rsync will overwrite other blocks in the copy. So I would think it's more efficient for rsync to do what it's always done instead, and just copy those parts of the file which are not changed. (It needs to read the whole file anyway for checksumming, unless you have a filesystem trick planned to avoid that :-) If you made splice() share file extents when cloning data from one file to another, that would really accelerate rsync and do a better job of reducing storage...) > > - reflink() to be used for merging files with identical contents > > (something I find surprisingly often on my disks). > > > > - reflink() to be used for merging files from different > > cgroup-style VMs in particular. > > While it would be great to have a way to do this, reflink() is > not the way. It's really simple to understand with its link-like > semantic, and I see no point in making it a seven-different-operation > kitchen sink call. That's hand-waving away. I'm thinking of it doing _one_ simple thing: copy the file with a COW implementation, which happens to be versatile in its consequences. It's not a kitchen sink call. I.e. what the ext3 cowlink() call partially implemented a year or two ago did. In some ways reflink() is more complicated to understand than cowlink(), because of reflink making chown and chmod have potentially heavy side effects. > > Requiring all attributes except nlink and ino to be identical makes > > reflink() unsuitable for transparently doing those things, except in > > cases where they happen to have the same attributes anyway. > > We've had a lot of fun thinking up many uses for reflink(), and > almost all of them are within the context of one's own files. Sure. > > I'm thinking particularly of file permissions, owner/group and atime. > > People do cp -p all the time. I don't see how keeping those > things the same will break anything. It's a new call, not an existing > semantic. Some people do "chown -R a-w" all the time after copying a tree for snapshotting, so they don't accidentally modify files later when viewing them in a text editor :-) (I'm thinking of the old days, when we edited kernel trees using "cp -rl" to make snapshots) Thinking about it, with reflink snapshots, it would be annoying to be unable write-protect the snapshots. > > Since each reflink has its own nlink and ino, I'm wondering why the > > other attributes cannot also be separate. (I realise extended > > attributes complicate the picture and it's desirable to share them, > > especially if they are large). > > The biggest reason is snapshotting. The second biggest reason > is a simple to understand call. "Everything is identical except those > things that *have* to be different". I'm not clear about something. Will "chmod XXX reflinked-file" change the permissions of both files (like hard-linked files), or will it trigger a data copy (like lazy cp -a)? I think "chmod XXX reflinked-file" is simpler to understand if it doesn't trigger a copy as side effect. (Especially as the copy may take a long time and/or ENOSPC - things you don't expect from "chmod"). What if you want to change the permissions of both reflinks - do you have to recreate them? > > But is there an efficient way for reflink-aware applications to detect > > these files have the same contents, other than reading the contents > > twice and comparing? Occasionally that would be good. E.g. It would > > be nice if "diff -r" could be patched to do that. > > I would think FIEMAP would tell you what you want to know, > wouldn't it? I'm not sure. FIEMAP can be quite a heavy operation too, and it's only available to root I think. >From a user's "managing space on my disk" perspective, the important things are being able to see where their data is shared and _especially_ being able to see when touching a file would trigger a massive increase in storage + copying time. I.e. I can see an additional flag to "ls" being useful if reflink is used for more than just very well organised backup folders. > > > +- The ctime of the source file only changes if the source's metadata > > > + must be changed to accommodate the copy-on-write linkage. The ctime of > > > + the new file is set to represent its creation. > > > > What change to the source metadata would require ctime to change? > > ocfs2 flags all extents in the source file with a "this is now > shared, go check the reference count before writing" flag if they don't > have it already. I'd call that a metadata update. If the flag is invisible to users, it isn't. If the flag is visible, isn't that the answer to the previous question? :-) > > > +- The link count of the source file is unchanged, and the link count of > > > + the new file is one. > > > > Can you hard link to the source file and the reflink afterwards, > > incrementing the reflink's link count? (I presume yes). Can you > > reflink to both of them too? > > Yes, absolutely. Once reflinked, they look like two separate > POSIX files. Except that chmod can take hours and trigger ENOSPC, and the POSIX atime does... what? Thanks, btw. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 1/3] fs: Document the reflink(2) system call. 2009-05-05 13:01 ` Jamie Lokier @ 2009-05-05 17:09 ` Joel Becker 0 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-05 17:09 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Tue, May 05, 2009 at 02:01:36PM +0100, Jamie Lokier wrote: > Joel Becker wrote: > > Here's another fun trick. Overwriting rsync, instead of copying > > blocks from the already-existing source could reflink the source to the > > .temporary, then only write the changed blocks. And since you own both > > files, it just works. If you're overwriting someone else's file? The > > old copy behavior is fine. > > The moment rsync overwrites a single block, the whole reflink file > will be copied by the filesystem, and then rsync will overwrite other > blocks in the copy. This is not cowlink. It's not a "CoW the whole thing when I touch one block". It's a new file (new inode for most implementations) that just shares the data extents. So if I write to one block, I only need to CoW that one block. See my other email with the wiki pages. Joel -- "Maybe the time has drawn the faces I recall. But things in this life change very slowly, If they ever change at all." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation. 2009-05-03 6:15 [RFC] The reflink(2) system call Joel Becker 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker @ 2009-05-03 6:15 ` Joel Becker 2009-05-03 8:03 ` Christoph Hellwig 2009-05-03 6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker 2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker 3 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-03 6:15 UTC (permalink / raw) To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro Implement vfs_reflink(), which calls iops->reflink(). See Documentation/reflink.txt for a description of the reflink(2) system call. I'm not quite certain of the security model to follow. security_inode_link() is clearly not correct as the resulting file is not the source inode. I have chosen security_inode_create() to reflect the creation of a new file in the directory. This matches the fsnotify_create() I've decided to use. However, it does not reflect that the new file will have the same contents as the source file. The real solution is probably either to check read access on the source or define a new security_inode_reflink(). Signed-off-by: Joel Becker <joel.becker@oracle.com> --- fs/namei.c | 40 ++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 42 insertions(+), 0 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 78f253c..45cbe7a 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2486,6 +2486,45 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); } +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) +{ + struct inode *inode = old_dentry->d_inode; + int error; + + if (!inode) + return -ENOENT; + + error = may_create(dir, new_dentry); + if (error) + return error; + + if (dir->i_sb != inode->i_sb) + return -EXDEV; + + /* + * A reflink to an append-only or immutable file cannot be created. + */ + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) + return -EPERM; + if (!dir->i_op->reflink) + return -EPERM; + if (S_ISDIR(inode->i_mode)) + return -EPERM; + + error = security_inode_create(dir, new_dentry, inode->i_mode); + if (error) + return error; + + mutex_lock(&inode->i_mutex); + vfs_dq_init(dir); + error = dir->i_op->reflink(old_dentry, dir, new_dentry); + mutex_unlock(&inode->i_mutex); + if (!error) + fsnotify_create(dir, new_dentry); + return error; +} + + /* * The worst of all namespace operations - renaming directory. "Perverted" * doesn't even start to describe it. Somebody in UCB had a heck of a trip... @@ -2890,6 +2929,7 @@ EXPORT_SYMBOL(unlock_rename); EXPORT_SYMBOL(vfs_create); EXPORT_SYMBOL(vfs_follow_link); EXPORT_SYMBOL(vfs_link); +EXPORT_SYMBOL(vfs_reflink); EXPORT_SYMBOL(vfs_mkdir); EXPORT_SYMBOL(vfs_mknod); EXPORT_SYMBOL(generic_permission); diff --git a/include/linux/fs.h b/include/linux/fs.h index 5bed436..3c9e4ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *); extern int vfs_rmdir(struct inode *, struct dentry *); extern int vfs_unlink(struct inode *, struct dentry *); extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *); /* * VFS dentry helper functions. @@ -1537,6 +1538,7 @@ struct inode_operations { loff_t len); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + int (*reflink) (struct dentry *,struct inode *,struct dentry *); }; struct seq_file; -- 1.6.1.3 ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation. 2009-05-03 6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker @ 2009-05-03 8:03 ` Christoph Hellwig 2009-05-04 2:51 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Christoph Hellwig @ 2009-05-03 8:03 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro > +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) > >+{ Would be nice to have a little kerneldoc comment for it. Also please avoid the > 80 har lines > +EXPORT_SYMBOL(vfs_reflink); No really good reason to export this. Most vfs_ helpers are exported for nfsd, and I can't really see nfsd use this anytime soon. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation. 2009-05-03 8:03 ` Christoph Hellwig @ 2009-05-04 2:51 ` Joel Becker 0 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-04 2:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Sun, May 03, 2009 at 04:03:25AM -0400, Christoph Hellwig wrote: > > +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) > > > >+{ > > Would be nice to have a little kerneldoc comment for it. Also please > avoid the > 80 har lines Both good points. > > +EXPORT_SYMBOL(vfs_reflink); > > No really good reason to export this. Most vfs_ helpers are exported > for nfsd, and I can't really see nfsd use this anytime soon. While we're going forward with the system call, ocfs2's going to support the ioctl for older kernels. I was planning to have mainline just reroute the ioctl to vfs_reflink(), rather than have the ioctl just break. Joel -- Life's Little Instruction Book #222 "Think twice before burdening a friend with a secret." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:15 [RFC] The reflink(2) system call Joel Becker 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker 2009-05-03 6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker @ 2009-05-03 6:15 ` Joel Becker 2009-05-03 6:27 ` Matthew Wilcox 2009-05-03 8:04 ` Christoph Hellwig 2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker 3 siblings, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-03 6:15 UTC (permalink / raw) To: linux-fsdevel; +Cc: jmorris, ocfs2-devel, viro This implements reflinkat(2) and reflink(2). See Documentation/reflink.txt for a description of the reflink(2) system call. XXX: Currently only adds the x86_32 linkage. The rest of the architectures belong here too. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- arch/x86/include/asm/unistd_32.h | 1 + arch/x86/kernel/syscall_table_32.S | 1 + fs/namei.c | 56 ++++++++++++++++++++++++++++++++++++ 3 files changed, 58 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..ea8eb94 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_reflink 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..866705d 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long sys_reflink /* 335 */ diff --git a/fs/namei.c b/fs/namei.c index 45cbe7a..cf739a3 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2524,6 +2524,62 @@ int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new return error; } +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname, + int, newdfd, const char __user *, newname, int, flags) +{ + struct dentry *new_dentry; + struct nameidata nd; + struct path old_path; + int error; + char *to; + + if ((flags & ~AT_SYMLINK_FOLLOW) != 0) + return -EINVAL; + + error = user_path_at(olddfd, oldname, + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0, + &old_path); + if (error) + return error; + + error = user_path_parent(newdfd, newname, &nd, &to); + if (error) + goto out; + error = -EXDEV; + if (old_path.mnt != nd.path.mnt) + goto out_release; + new_dentry = lookup_create(&nd, 0); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; + error = security_path_mknod(&nd.path, new_dentry, + old_path.dentry->d_inode->i_mode, 0); + if (error) + goto out_drop_write; + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry); +out_drop_write: + mnt_drop_write(nd.path.mnt); +out_dput: + dput(new_dentry); +out_unlock: + mutex_unlock(&nd.path.dentry->d_inode->i_mutex); +out_release: + path_put(&nd.path); + putname(to); +out: + path_put(&old_path); + + return error; +} + +SYSCALL_DEFINE2(reflink, const char __user *, oldname, const char __user *, newname) +{ + return sys_reflinkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); +} + /* * The worst of all namespace operations - renaming directory. "Perverted" -- 1.6.1.3 ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker @ 2009-05-03 6:27 ` Matthew Wilcox 2009-05-03 6:39 ` Al Viro 2009-05-04 2:53 ` Joel Becker 2009-05-03 8:04 ` Christoph Hellwig 1 sibling, 2 replies; 151+ messages in thread From: Matthew Wilcox @ 2009-05-03 6:27 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote: > This implements reflinkat(2) and reflink(2). See > Documentation/reflink.txt for a description of the reflink(2) system > call. Do we need to add sys_reflink()? Since sys_reflinkat() has a superset of the functionality, presumably glibc can provide both reflink() and reflinkat() calls, and userspace need never know that glibc is calling sys_reflinkat() for both. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:27 ` Matthew Wilcox @ 2009-05-03 6:39 ` Al Viro 2009-05-03 7:48 ` Christoph Hellwig 2009-05-04 2:53 ` Joel Becker 2009-05-04 2:53 ` Joel Becker 1 sibling, 2 replies; 151+ messages in thread From: Al Viro @ 2009-05-03 6:39 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Joel Becker, linux-fsdevel, jmorris, ocfs2-devel On Sun, May 03, 2009 at 12:27:57AM -0600, Matthew Wilcox wrote: > On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote: > > This implements reflinkat(2) and reflink(2). See > > Documentation/reflink.txt for a description of the reflink(2) system > > call. > > Do we need to add sys_reflink()? Since sys_reflinkat() has a superset > of the functionality, presumably glibc can provide both reflink() and > reflinkat() calls, and userspace need never know that glibc is calling > sys_reflinkat() for both. Yes, indeed... Another question: do we want that to work across mounpoint boundary? It's probably OK in this case, but... ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:39 ` Al Viro @ 2009-05-03 7:48 ` Christoph Hellwig 2009-05-03 11:16 ` Al Viro 2009-05-04 2:53 ` Joel Becker 1 sibling, 1 reply; 151+ messages in thread From: Christoph Hellwig @ 2009-05-03 7:48 UTC (permalink / raw) To: Al Viro; +Cc: Matthew Wilcox, Joel Becker, linux-fsdevel, jmorris, ocfs2-devel On Sun, May 03, 2009 at 07:39:02AM +0100, Al Viro wrote: > Another question: do we want that to work across mounpoint boundary? > It's probably OK in this case, but... I don't think so. Allowing any link-like semantics over mount point boundaries will just cause problems. Joel, can you also submit a reflink man page to Michael? ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 7:48 ` Christoph Hellwig @ 2009-05-03 11:16 ` Al Viro 0 siblings, 0 replies; 151+ messages in thread From: Al Viro @ 2009-05-03 11:16 UTC (permalink / raw) To: Christoph Hellwig Cc: Matthew Wilcox, Joel Becker, linux-fsdevel, jmorris, ocfs2-devel On Sun, May 03, 2009 at 03:48:49AM -0400, Christoph Hellwig wrote: > On Sun, May 03, 2009 at 07:39:02AM +0100, Al Viro wrote: > > Another question: do we want that to work across mounpoint boundary? > > It's probably OK in this case, but... > > I don't think so. Allowing any link-like semantics over mount point > boundaries will just cause problems. Quite. I realize that this is how vfs_link() is written, but I really wonder if we should turn that if (foo->i_sb != bar->i_sb) into BUG_ON() in both. Their callers have vfsmounts and ought to do the vfsmount-level check anyway, so running into *that* -EXDEV should be impossible. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:39 ` Al Viro 2009-05-03 7:48 ` Christoph Hellwig @ 2009-05-04 2:53 ` Joel Becker 1 sibling, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-04 2:53 UTC (permalink / raw) To: Al Viro; +Cc: Matthew Wilcox, linux-fsdevel, jmorris, ocfs2-devel On Sun, May 03, 2009 at 07:39:02AM +0100, Al Viro wrote: > Another question: do we want that to work across mounpoint boundary? > It's probably OK in this case, but... I don't think we want it working across mountpoints, just like link(2). I thought I checked for that in sys_reflinkat(). Joel -- Life's Little Instruction Book #139 "Never deprive someone of hope; it might be all they have." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:27 ` Matthew Wilcox 2009-05-03 6:39 ` Al Viro @ 2009-05-04 2:53 ` Joel Becker 1 sibling, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-04 2:53 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Sun, May 03, 2009 at 12:27:57AM -0600, Matthew Wilcox wrote: > On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote: > > This implements reflinkat(2) and reflink(2). See > > Documentation/reflink.txt for a description of the reflink(2) system > > call. > > Do we need to add sys_reflink()? Since sys_reflinkat() has a superset > of the functionality, presumably glibc can provide both reflink() and > reflinkat() calls, and userspace need never know that glibc is calling > sys_reflinkat() for both. Sure, that works. Joel -- "Always give your best, never get discouraged, never be petty; always remember, others may hate you. Those who hate you don't win unless you hate them. And then you destroy yourself." - Richard M. Nixon Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [PATCH 3/3] fs: Add the reflink(2) system call. 2009-05-03 6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker 2009-05-03 6:27 ` Matthew Wilcox @ 2009-05-03 8:04 ` Christoph Hellwig 1 sibling, 0 replies; 151+ messages in thread From: Christoph Hellwig @ 2009-05-03 8:04 UTC (permalink / raw) To: Joel Becker; +Cc: linux-fsdevel, jmorris, ocfs2-devel, viro On Sat, May 02, 2009 at 11:15:03PM -0700, Joel Becker wrote: > This implements reflinkat(2) and reflink(2). See > Documentation/reflink.txt for a description of the reflink(2) system > call. > > XXX: Currently only adds the x86_32 linkage. The rest of the > architectures belong here too. As mentioned by willy, no need for the sys_reflink syscall. Also no really good reason to split the support up into three patches, one is enough. ^ permalink raw reply [flat|nested] 151+ messages in thread
* [RFC] The reflink(2) system call v2. 2009-05-03 6:15 [RFC] The reflink(2) system call Joel Becker ` (2 preceding siblings ...) 2009-05-03 6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker @ 2009-05-07 22:15 ` Joel Becker 2009-05-08 1:39 ` James Morris 2009-05-08 2:59 ` jim owens 3 siblings, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-07 22:15 UTC (permalink / raw) To: linux-fsdevel Cc: mtk.manpages, linux-security-module, jmorris, ocfs2-devel, viro Hi again, Here's version 2 of reflink. Changes since the first version: - One patch, not three. - Documentation/filesystems/reflink.txt is no longer a pseudo-manpage. It also tries to encapsulate all the feedback from the discussion to make the operation clearer. - LSM hooks added as recommended by the LSM folks. This includes the default implementation in capability.c. - Restricted reflink to owner or CAP_CHOWN. - reflink(2) removed, only reflinkat(2) will be in the syscall table. Userspace can trivially write reflink(3). The patch still only defines sys_reflinkat() for x86_32. The final version will have all architectures. The patch is also available in my ocfs2 tree: git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2.git reflink If you want to play with reflinks, here's what you need: 1) Tao's kernel code. This is the ioctl-based ocfs2 implementation. Obviously we'll be putting it under the syscall shortly. Compile and install as you'd expect. It's in the 'refcount' branch of his git tree: git://oss.oracle.com/git/tma/linux-2.6.git refcount 2) My code for ocfs2-tools. This is the mkfs.ocfs2(8) support to create a filesystem ready for reflink. It's in the 'refcount' branch of the ocfs2-tools git tree: git://oss.oracle.com/git/ocfs2-tools.git refcount Once the branck is checked out, you can build and install it with: # ./autogen.sh; make; make install Create a non-clustered ocfs2 filesystem like so: # mkfs.ocfs2 -M local --fs-features=refcount /dev/XXX If you really want a clustered ocfs2, go right ahead, but I figure most people that want to play with reflinks want the quickest start possible, and a non-clustered ocfs2 means mkfs+mount just like any other local filesystem. 3) The reflink(1) program. Grab the master branch from the reflink git tree: git://oss.oracle.com/git/jlbec/reflink.git master Type 'make' and 'make install' in the toplevel directory. You now have the reflink(1) program. It works with both the system call and the ocfs2 ioctl, so you can use it atop the current ocfs2 patch set. 4) Have fun! Joel >From 3130be9651832cece277d30182a04274798ce7f2 Mon Sep 17 00:00:00 2001 From: Joel Becker <joel.becker@oracle.com> Date: Sat, 2 May 2009 22:48:59 -0700 Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call. The userpace visible idea of the operation is: int reflink(const char *oldpath, const char *newpath); int reflinkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int flags); The kernel only implements reflinkat(2). reflink(3) is a trivial wrapper around reflinkat(2). The reflink() system call creates reference-counted links. It creates a new file that shares the data extents of the source file in a copy-on-write fashion. Its calling semantics are identical to link(2) and linkat(2). Once complete, programs see the new file as a completely separate entry. In the VFS, ->reflink() is an inode_operation with the same arguments as ->link(). reflink() requires the caller to own the source file or have CAP_CHOWN, because a reflink preserves ownership, permissions, and security contexts. Without the priviledges, a regular user can't preserve ownership. Two new LSM hooks are added, security_path_reflink() and security_inode_reflink(). None of the existing LSM hooks appear to fit. XXX: Currently only adds the x86_32 linkage. The rest of the architectures belong here too. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- Documentation/filesystems/reflink.txt | 152 +++++++++++++++++++++++++++++++++ Documentation/filesystems/vfs.txt | 4 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/kernel/syscall_table_32.S | 1 + fs/namei.c | 101 ++++++++++++++++++++++ include/linux/fs.h | 2 + include/linux/security.h | 38 ++++++++ include/linux/syscalls.h | 2 + security/capability.c | 13 +++ security/security.c | 15 +++ 10 files changed, 329 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/reflink.txt diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt new file mode 100644 index 0000000..58a6b38 --- /dev/null +++ b/Documentation/filesystems/reflink.txt @@ -0,0 +1,152 @@ +reflink(2) +========== + + +INTRODUCTION +------------ + +A reflink is a reference-counted link. The reflink(2) operation is +analogous to the link(2) operation, except that instead of two directory +entries pointing to the same inode, there are two identical inodes +pointing to the same data. Writes do not modify the shared data; they +use copy-on-write (CoW). Thus, after the reflink has been created, the +inodes can diverge without impacting each other. + + +SYNOPSIS +-------- + +The reflink(2) call looks just like link(2): + + int reflink(const char *oldpath, const char *newpath); + +The actual system call is reflinkat(2): + + int reflinkat(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, int flags); + +For details on how olddirfd, newdirfd, and flags behave, see linkat(2). +The reflink(2) call won't be implemented by the kernel, because it's a +trivial wrapper around reflinkat(2). + + +DESCRIPTION +----------- + +One way of viewing reflink is to look at the level of sharing. A +symbolic link does its sharing at the directory entry level; many names +end up pointing at the same directory entry. Hard links are one step +down. Multiple directory entries are sharing one inode. Reflinks are +down one more level: multiple inodes share the same data extents. + +When you symlink a file, you can then access it via the symlink or the +real directory entry, and for the most part they look identical. When +accessing more than one name for a hard link, the object returned looks +identical. Similarly, a newly created reflink is identical to its +source in almost every way and can be treated as such. This includes +ownership, permissions, security context, and data. The only things +that are different are the inode number, the link count, and the ctime. + +A reflink is a snapshot of the source file at the time it is created. + +Once created, though, a reflink can be modified like any other normal +file without affecting the source file. Changes to trivial fields like +permissions, owner, or times are guaranteed not to trigger CoW of file +data and will not return any error that wouldn't happen on a truly +distinct file. Changes to the file's data will trigger CoW of the data +affected - the actual CoW granularity is up to the filesystem, from +exact bytes up to the entire file. ocfs2, for example, will copy out an +entire extent or 1MB, whichever is smaller. + +Partial reflinks are not allowed. The new inode will only appear in the +directory structure after it is fully formed. This prevents a crash or +lack of space from creating a partial reflink. + +If a filesystem does not support reflinks, the kernel and libc MUST NOT +fake it. Callers are expecting to get snapshots, and faking it will +violate that trust. + +The userspace view is as follows. When reflink(2) returns, opening +oldpath and newpath returns identical-looking files, just like link(2). +After that, oldpath and newpath behave as distinct files, and +modifications to one have no impact on the other. + + +RESTRICTIONS +------------ + +Just as the sharing gets lower as you move from symlink() -> link() -> +reflink(), the restrictions on the call get tighter. A symlink doesn't +require any access permissions other than being able to create its +inode. It can cross filesystems and mount points, and it can point to +any type of file. A hard link requires both source and target to be on +the same filesystem under the same mount point, and that the source not +be a directory. Like hard links and symlinks, a reflink cannot be +created if newpath exists. + +Reflinks adds one big restriction on top of hard links: only the owner +or someone with elevated privileges (CAP_CHOWN) can reflink a file. A +reflink is a point-in-time snapshot of a file. It has the same +ownership, attributes, and security context as the source file. A +regular user cannot change the ownership of files, so they cannot create +a reflink of a file they do not own. + + +SHARING +------- + +A reflink creates a new inode. It shares all data extents of the source +file; this includes file data and extended attribute data. All of the +sharing is in a CoW fashion, and any modification of the data will break +the sharing. + +For some filesystems, certain data structures are not in allocated +storage extents. Creating a reflink might make a copy of these extents. +An example is ext3's ability to store small extended attributes inside +the ext3 inode. Since a reflink is creating a new inode, those extended +attributes are merely copied to the new inode. + + +EXCEPTIONS +---------- + +All file attributes and extended attributes of the new file must +identical to the source file with the following exceptions: + +- The new file must have a new inode number. This allows POSIX + programs to treat the source and new files as separate objects. From + the view of the POSIX application, the files are distinct. The + sharing is invisible outside of the filesystem's internal structures. +- The ctime of the source file only changes if the source's metadata + must be changed to accommodate the copy-on-write linkage. The ctime + of the new file is set to represent its creation. +- The link count of the source file is unchanged, and the link count of + the new file is one. + +The mtime of the source file is unmodified, and the mtime of the new +file is set identical to the source file. This reflects that the data +is unchanged. + + +INODE OPERATION +--------------- + +Filesystems implement the ->reflink() inode operation. It has the same +prototype as ->link(): + + int (*reflink)(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry); + +When the filesystem is called, the VFS has already checked the +permissions and mountpoint of the operation. The filesystem just needs +to create the new inode identical to the old one with the exceptions +noted above, link up the shared data extents, and then link the new +inode into dir. + + +FOLLOWING SYMBOLIC LINKS +------------------------ + +reflink() deferences symbolic links in the same manner that link(2) +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2). + diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf..01cd810 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -333,6 +333,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*reflink) (struct dentry *,struct inode *,struct dentry *); }; Again, all methods are called without any locks being held, unless @@ -431,6 +432,9 @@ otherwise noted. truncate_range: a method provided by the underlying filesystem to truncate a range of blocks , i.e. punch a hole somewhere in a file. + reflink: called by the reflink(2) system call. Only required if you want + to support reflinks. For further information, see + Documentation/filesystems/reflink.txt. The Address Space Object diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..c368563 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_reflinkat 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..d11c200 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long sys_reflinkat /* 335 */ diff --git a/fs/namei.c b/fs/namei.c index 78f253c..3f80c2f 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2486,6 +2486,106 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); } +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) +{ + struct inode *inode = old_dentry->d_inode; + int error; + + if (!inode) + return -ENOENT; + + /* + * reflink() preserves ownership, so the caller must have the + * right to do so. + */ + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) + return -EPERM; + + if ((current_fsuid() != inode->i_uid) && + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) + return -EPERM; + + error = may_create(dir, new_dentry); + if (error) + return error; + + if (dir->i_sb != inode->i_sb) + return -EXDEV; + + /* + * A reflink to an append-only or immutable file cannot be created. + */ + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) + return -EPERM; + if (!dir->i_op->reflink) + return -EPERM; + if (S_ISDIR(inode->i_mode)) + return -EPERM; + + error = security_inode_reflink(old_dentry, dir, new_dentry); + if (error) + return error; + + mutex_lock(&inode->i_mutex); + vfs_dq_init(dir); + error = dir->i_op->reflink(old_dentry, dir, new_dentry); + mutex_unlock(&inode->i_mutex); + if (!error) + fsnotify_create(dir, new_dentry); + return error; +} + +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname, + int, newdfd, const char __user *, newname, int, flags) +{ + struct dentry *new_dentry; + struct nameidata nd; + struct path old_path; + int error; + char *to; + + if ((flags & ~AT_SYMLINK_FOLLOW) != 0) + return -EINVAL; + + error = user_path_at(olddfd, oldname, + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0, + &old_path); + if (error) + return error; + + error = user_path_parent(newdfd, newname, &nd, &to); + if (error) + goto out; + error = -EXDEV; + if (old_path.mnt != nd.path.mnt) + goto out_release; + new_dentry = lookup_create(&nd, 0); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; + error = security_path_reflink(old_path.dentry, &nd.path, new_dentry); + if (error) + goto out_drop_write; + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry); +out_drop_write: + mnt_drop_write(nd.path.mnt); +out_dput: + dput(new_dentry); +out_unlock: + mutex_unlock(&nd.path.dentry->d_inode->i_mutex); +out_release: + path_put(&nd.path); + putname(to); +out: + path_put(&old_path); + + return error; +} + + /* * The worst of all namespace operations - renaming directory. "Perverted" * doesn't even start to describe it. Somebody in UCB had a heck of a trip... @@ -2890,6 +2990,7 @@ EXPORT_SYMBOL(unlock_rename); EXPORT_SYMBOL(vfs_create); EXPORT_SYMBOL(vfs_follow_link); EXPORT_SYMBOL(vfs_link); +EXPORT_SYMBOL(vfs_reflink); EXPORT_SYMBOL(vfs_mkdir); EXPORT_SYMBOL(vfs_mknod); EXPORT_SYMBOL(generic_permission); diff --git a/include/linux/fs.h b/include/linux/fs.h index 5bed436..3c9e4ec 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *); extern int vfs_rmdir(struct inode *, struct dentry *); extern int vfs_unlink(struct inode *, struct dentry *); extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *); /* * VFS dentry helper functions. @@ -1537,6 +1538,7 @@ struct inode_operations { loff_t len); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + int (*reflink) (struct dentry *,struct inode *,struct dentry *); }; struct seq_file; diff --git a/include/linux/security.h b/include/linux/security.h index d5fd616..c647761 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -528,6 +528,23 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * @inode contains a pointer to the inode. * @secid contains a pointer to the location where result will be saved. * In case of failure, @secid will be set to zero. + * @inode_reflink: + * Check permission before creating a new reference-counted link to + * a file. + * @old_dentry contains the dentry structure for an existing link to + * the file. + * @dir contains the inode structure of the parent directory of the + * new reflink. + * Return 0 if permission is granted. + * @path_reflink: + * Check permission before creating a new reference-counted link to + * a file. + * @old_dentry contains the dentry structure for an existing link + * to the file. + * @new_dir contains the path structure of the parent directory of + * the new reflink. + * @new_dentry contains the dentry structure for the new reflink. + * Return 0 if permission is granted. * * Security hooks for file operations * @@ -1402,6 +1419,8 @@ struct security_operations { struct dentry *new_dentry); int (*path_rename) (struct path *old_dir, struct dentry *old_dentry, struct path *new_dir, struct dentry *new_dentry); + int (*path_reflink) (struct dentry *old_dentry, struct path *new_dir, + struct dentry *new_dentry); #endif int (*inode_alloc_security) (struct inode *inode); @@ -1415,6 +1434,7 @@ struct security_operations { int (*inode_unlink) (struct inode *dir, struct dentry *dentry); int (*inode_symlink) (struct inode *dir, struct dentry *dentry, const char *old_name); + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir); int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode); int (*inode_rmdir) (struct inode *dir, struct dentry *dentry); int (*inode_mknod) (struct inode *dir, struct dentry *dentry, @@ -1675,6 +1695,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir, int security_inode_unlink(struct inode *dir, struct dentry *dentry); int security_inode_symlink(struct inode *dir, struct dentry *dentry, const char *old_name); +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry); int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode); int security_inode_rmdir(struct inode *dir, struct dentry *dentry); int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev); @@ -2056,6 +2078,13 @@ static inline int security_inode_symlink(struct inode *dir, return 0; } +static inline int security_inode_reflink(struct dentry *old_dentry, + struct inode *dir, + struct dentry *new_dentry) +{ + return 0; +} + static inline int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) @@ -2802,6 +2831,8 @@ int security_path_link(struct dentry *old_dentry, struct path *new_dir, struct dentry *new_dentry); int security_path_rename(struct path *old_dir, struct dentry *old_dentry, struct path *new_dir, struct dentry *new_dentry); +int security_path_reflink(struct dentry *old_dentry, struct path *new_dir, + struct dentry *new_dentry); #else /* CONFIG_SECURITY_PATH */ static inline int security_path_unlink(struct path *dir, struct dentry *dentry) { @@ -2851,6 +2882,13 @@ static inline int security_path_rename(struct path *old_dir, { return 0; } + +static inline int security_path_reflink(struct dentry *old_dentry, + struct path *new_dir, + struct dentry *new_dentry) +{ + return 0; +} #endif /* CONFIG_SECURITY_PATH */ #ifdef CONFIG_KEYS diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 40617c1..35a8743 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_linkat(int olddfd, const char __user *oldname, int newdfd, const char __user *newname, int flags); +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname, + int newdfd, const char __user *newname, int flags); asmlinkage long sys_renameat(int olddfd, const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_futimesat(int dfd, char __user *filename, diff --git a/security/capability.c b/security/capability.c index 21b6cea..60c6eda 100644 --- a/security/capability.c +++ b/security/capability.c @@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry, return 0; } +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode) +{ + return 0; +} + static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry, int mask) { @@ -308,6 +313,12 @@ static int cap_path_truncate(struct path *path, loff_t length, { return 0; } + +static int cap_path_reflink(struct dentry *old_dentry, struct path *new_dir, + struct dentry *new_dentry) +{ + return 0; +} #endif static int cap_file_permission(struct file *file, int mask) @@ -905,6 +916,7 @@ void security_fixup_ops(struct security_operations *ops) set_to_cap_if_null(ops, inode_link); set_to_cap_if_null(ops, inode_unlink); set_to_cap_if_null(ops, inode_symlink); + set_to_cap_if_null(ops, inode_reflink); set_to_cap_if_null(ops, inode_mkdir); set_to_cap_if_null(ops, inode_rmdir); set_to_cap_if_null(ops, inode_mknod); @@ -935,6 +947,7 @@ void security_fixup_ops(struct security_operations *ops) set_to_cap_if_null(ops, path_link); set_to_cap_if_null(ops, path_rename); set_to_cap_if_null(ops, path_truncate); + set_to_cap_if_null(ops, path_reflink); #endif set_to_cap_if_null(ops, file_permission); set_to_cap_if_null(ops, file_alloc_security); diff --git a/security/security.c b/security/security.c index 5284255..fc40a29 100644 --- a/security/security.c +++ b/security/security.c @@ -437,6 +437,14 @@ int security_path_truncate(struct path *path, loff_t length, return 0; return security_ops->path_truncate(path, length, time_attrs); } + +int security_path_reflink(struct dentry *old_dentry, struct path *new_dir, + struct dentry *new_dentry) +{ + if (unlikely(IS_PRIVATE(old_dentry->d_inode))) + return 0; + return security_ops->path_reflink(old_dentry, new_dir, new_dentry); +} #endif int security_inode_create(struct inode *dir, struct dentry *dentry, int mode) @@ -470,6 +478,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry, return security_ops->inode_symlink(dir, dentry, old_name); } +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir) +{ + if (unlikely(IS_PRIVATE(old_dentry->d_inode))) + return 0; + return security_ops->inode_reflink(old_dentry, dir); +} + int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) { if (unlikely(IS_PRIVATE(dir))) -- 1.6.1.3 -- "Sometimes I think the surest sign intelligent life exists elsewhere in the universe is that none of it has tried to contact us." -Calvin & Hobbes Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker @ 2009-05-08 1:39 ` James Morris 2009-05-08 1:49 ` Joel Becker 2009-05-08 2:59 ` jim owens 1 sibling, 1 reply; 151+ messages in thread From: James Morris @ 2009-05-08 1:39 UTC (permalink / raw) To: Joel Becker Cc: linux-fsdevel, ocfs2-devel, viro, mtk.manpages, linux-security-module On Thu, 7 May 2009, Joel Becker wrote: > @@ -1402,6 +1419,8 @@ struct security_operations { > struct dentry *new_dentry); > int (*path_rename) (struct path *old_dir, struct dentry *old_dentry, > struct path *new_dir, struct dentry *new_dentry); > + int (*path_reflink) (struct dentry *old_dentry, struct path *new_dir, > + struct dentry *new_dentry); > #endif > The TOMOYO folk don't need a path hook, so it would be unused, and should not be added unless someone responsible for an in-tree LSM establishes a case for it. > +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, > + struct dentry *new_dentry); We don't need the new_dentry argument (this is correct in the low-level hook, and doesn't compile with CONFIG_SECURITY=y). - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 1:39 ` James Morris @ 2009-05-08 1:49 ` Joel Becker 2009-05-08 13:01 ` Tetsuo Handa 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-08 1:49 UTC (permalink / raw) To: James Morris Cc: linux-fsdevel, ocfs2-devel, viro, mtk.manpages, linux-security-module On Fri, May 08, 2009 at 11:39:53AM +1000, James Morris wrote: > On Thu, 7 May 2009, Joel Becker wrote: > > > > @@ -1402,6 +1419,8 @@ struct security_operations { > > struct dentry *new_dentry); > > int (*path_rename) (struct path *old_dir, struct dentry *old_dentry, > > struct path *new_dir, struct dentry *new_dentry); > > + int (*path_reflink) (struct dentry *old_dentry, struct path *new_dir, > > + struct dentry *new_dentry); > > #endif > > > > The TOMOYO folk don't need a path hook, so it would be unused, and should > not be added unless someone responsible for an in-tree LSM establishes a > case for it. Oh, I misread what they said: > TOMOYO wants to prevent reflink(".htpasswd", "readme.html"). > But security_path_mknod() can't know the source file's name. > Therefore, TOMOYO wants security_path_link() rather than security_path_mknod(). > So far I don't feel TOMOYO needs to introduce security_path_reflink() > because > modifications after reflink() will be checked by other LSM hooks. So I should change the path_reflink() call to path_link() in reflinkat(2)? > > +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, > > + struct dentry *new_dentry); > > We don't need the new_dentry argument (this is correct in the low-level > hook, and doesn't compile with CONFIG_SECURITY=y). Eek, missed that. Joel -- "The cynics are right nine times out of ten." - H. L. Mencken Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 1:49 ` Joel Becker @ 2009-05-08 13:01 ` Tetsuo Handa 0 siblings, 0 replies; 151+ messages in thread From: Tetsuo Handa @ 2009-05-08 13:01 UTC (permalink / raw) To: Joel.Becker Cc: jmorris, linux-fsdevel, ocfs2-devel, viro, mtk.manpages, linux-security-module Joel Becker wrote: >> So far I don't feel TOMOYO needs to introduce security_path_reflink() because >> modifications after reflink() will be checked by other LSM hooks. > > So I should change the path_reflink() call to path_link() in > reflinkat(2)? Yes, you can. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker 2009-05-08 1:39 ` James Morris @ 2009-05-08 2:59 ` jim owens 2009-05-08 3:10 ` Joel Becker 2009-05-11 20:49 ` [RFC] The reflink(2) system call v2 Joel Becker 1 sibling, 2 replies; 151+ messages in thread From: jim owens @ 2009-05-08 2:59 UTC (permalink / raw) To: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, joel.becker Cc: linux-fsdevel Joel Becker wrote: > Hi again, > Here's version 2 of reflink. Changes since the first version: > > - One patch, not three. > - Documentation/filesystems/reflink.txt is no longer a pseudo-manpage. > It also tries to encapsulate all the feedback from the discussion to > make the operation clearer. You certainly did not address: - desire for one single system call to handle both owner preservation and create with current owner. I see no reason to have 2 vfs_xxx and 2 inode functions for those. - please just add the flag to the defined reflink API... there is no reason to keep saying "it is just like link(2)". that not true and you will just cause confusion. - fix the + if (S_ISDIR(inode->i_mode)) + return -EPERM; to be an ISREG check unless you have an argument for special files and symlinks being COWed. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 2:59 ` jim owens @ 2009-05-08 3:10 ` Joel Becker 2009-05-08 11:53 ` jim owens ` (2 more replies) 2009-05-11 20:49 ` [RFC] The reflink(2) system call v2 Joel Becker 1 sibling, 3 replies; 151+ messages in thread From: Joel Becker @ 2009-05-08 3:10 UTC (permalink / raw) To: jim owens Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel, ocfs2-devel, viro On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: > You certainly did not address: > > - desire for one single system call to handle both > owner preservation and create with current owner. Nope, and I don't intend to. reflink() is a snapshotting call, not a kitchen sink. Joel -- Life's Little Instruction Book #444 "Never underestimate the power of a kind word or deed." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 3:10 ` Joel Becker @ 2009-05-08 11:53 ` jim owens 2009-05-08 12:16 ` jim owens 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker 2 siblings, 0 replies; 151+ messages in thread From: jim owens @ 2009-05-08 11:53 UTC (permalink / raw) To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Joel Becker wrote: > On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: >> You certainly did not address: >> >> - desire for one single system call to handle both >> owner preservation and create with current owner. > > Nope, and I don't intend to. reflink() is a snapshotting call, > not a kitchen sink. I'm not a maintainer but if I was, in that case I would NAK this since more people wanted the cowfile() definition than your reflink definition. If you persist that you are only doing the snapshot then call it snaplink(2) or something. The reflink() name makes no sense because all references are internal to the file system. There is absolutely no way via "ls" to determine the reference between the original and new. With hard links and symlinks you can easily associate them. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 3:10 ` Joel Becker 2009-05-08 11:53 ` jim owens @ 2009-05-08 12:16 ` jim owens 2009-05-08 14:11 ` jim owens 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker 2 siblings, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-08 12:16 UTC (permalink / raw) To: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Cc: joel.becker Joel Becker wrote: > On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: >> You certainly did not address: >> >> - desire for one single system call to handle both >> owner preservation and create with current owner. > > Nope, and I don't intend to. reflink() is a snapshotting call, > not a kitchen sink. BTW, the "kitchen sink" argument is bull! All we are saying is have 1 syscall with 1 vfs operation that does exactly the same thing except: if (FLAG_SHAPFILE) { if (not CAP_FOWNER) return -EPERM newfile.attrs = old_file.attrs ) else newfile.attrs = user_default_create_attrs I really think your objection is all because you are hung up on your reflink() API that has NO EXISTING USERS! jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 12:16 ` jim owens @ 2009-05-08 14:11 ` jim owens 0 siblings, 0 replies; 151+ messages in thread From: jim owens @ 2009-05-08 14:11 UTC (permalink / raw) To: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel, joel.becker > Joel Becker wrote: >> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: >>> You certainly did not address: >>> >>> - desire for one single system call to handle both >>> owner preservation and create with current owner. >> >> Nope, and I don't intend to. reflink() is a snapshotting call, You might have designed this for *snapshotting* but from a user perspective the function is best described as: *Copy_file_attributes() and Cow_file_data()* Since immediately afterwards the reflink() you permit everything to be modified on the new file. Your design is good, but you need to admit to yourself that the Copy_file_attributes() is the special case as far as users are concerned. Because most people expect snapshots are immutable and these are really files that don't use up space and can be used as snapshots *if you don't modify them*. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* [RFC] The reflink(2) system call v4. 2009-05-08 3:10 ` Joel Becker 2009-05-08 11:53 ` jim owens 2009-05-08 12:16 ` jim owens @ 2009-05-11 20:40 ` Joel Becker 2009-05-11 22:27 ` James Morris ` (6 more replies) 2 siblings, 7 replies; 151+ messages in thread From: Joel Becker @ 2009-05-11 20:40 UTC (permalink / raw) To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Thu, May 07, 2009 at 08:10:18PM -0700, Joel Becker wrote: > On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: > > You certainly did not address: > > > > - desire for one single system call to handle both > > owner preservation and create with current owner. > > Nope, and I don't intend to. reflink() is a snapshotting call, > not a kitchen sink. I've been thinking about this all weekend. The current state doesn't make me happy. Now, what concerns me here is the interface to userspace. The system call itself. I don't care if we implement it via one vfs_foo() or 10 nor how many iops we end up with. We can and will modify those as we find better ideas. But I want reflink(2) to have a semantic that is easily understood and intuitive. When I initially designed reflink(), I hadn't thought about the ownership and permission implications of snapshotting. I was having too much fun reflinking files around. In that iteration, anyone could reflink a file. But a true snapshot needs ownership, permissions, acls, and other security attributes (in all, I'm gonna call that the "security context") as well. So I defined reflink() as such. This meant requiring privileges, but lost some of the flexibility of the call. I call that a loss. What I'm not going to do is add optional behaviors to the system call. It should be pretty obvious what it does, or we're doing it wrong. The 'flags' field of reflinkat(2) is for AT_* flags. When I decided on requiring privileges, I thought that degrading without privileges was too confusing. I was wrong. I want reflink() to fit into the pantheon of file system operations in a way that makes sense alongside the others, and this isn't it. Here's v4 of reflink(). If you have the privileges, you get the full snapshot. If you don't, you must have read access, and then you get the entire snapshot (data and extended attributes) except that the security context is reinitialized. That's it. It fits with most of the other ops, and it's a clean degradation. I add a flag to ips->reflink() so that the filesystem knows what to do with the security context. That's the only change visible outside of vfs_reflink(). Security folks, check my work. Everyone else, let me know if this satisfies. Joel >From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001 From: Joel Becker <joel.becker@oracle.com> Date: Sat, 2 May 2009 22:48:59 -0700 Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call. The userpace visible idea of the operation is: int reflink(const char *oldpath, const char *newpath); int reflinkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int flags); The kernel only implements reflinkat(2). reflink(3) is a trivial wrapper around reflinkat(2). The reflink() system call creates reference-counted links. It creates a new file that shares the data extents of the source file in a copy-on-write fashion. Its calling semantics are identical to link(2) and linkat(2). Once complete, programs see the new file as a completely separate entry. reflink() attempts to preserve ownership, permissions, and security contexts in order to create a fully snapshot. Preserving those attributes requires ownership or CAP_CHOWN. A caller without those privileges will see the security context of the new file initialized to their default. In the VFS, ->reflink() is an inode_operation with the almost same arguments as ->link(); an additional argument tells the filesystem to copy over or reinitialize the security context on the new file. A new LSM hook, security_inode_reflink(), is added. None of the existing LSM hooks appeared to fit. XXX: Currently only adds the x86_32 linkage. The rest of the architectures belong here too. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- Documentation/filesystems/reflink.txt | 165 +++++++++++++++++++++++++++++++++ Documentation/filesystems/vfs.txt | 4 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/kernel/syscall_table_32.S | 1 + fs/namei.c | 113 ++++++++++++++++++++++ include/linux/fs.h | 2 + include/linux/security.h | 16 +++ include/linux/syscalls.h | 2 + security/capability.c | 6 + security/security.c | 7 ++ 10 files changed, 317 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/reflink.txt diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt new file mode 100644 index 0000000..aa7380f --- /dev/null +++ b/Documentation/filesystems/reflink.txt @@ -0,0 +1,165 @@ +reflink(2) +========== + + +INTRODUCTION +------------ + +A reflink is a reference-counted link. The reflink(2) operation is +analogous to the link(2) operation, except that instead of two directory +entries pointing to the same inode, there are two identical inodes +pointing to the same data. Writes do not modify the shared data; they +use copy-on-write (CoW). Thus, after the reflink has been created, the +inodes can diverge without impacting each other. + + +SYNOPSIS +-------- + +The reflink(2) call looks just like link(2): + + int reflink(const char *oldpath, const char *newpath); + +The actual system call is reflinkat(2): + + int reflinkat(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, int flags); + +For details on how olddirfd, newdirfd, and flags behave, see linkat(2). +The reflink(2) call won't be implemented by the kernel, because it's a +trivial wrapper around reflinkat(2). + + +DESCRIPTION +----------- + +One way of viewing reflink is to look at the level of sharing. A +symbolic link does its sharing at the directory entry level; many names +end up pointing at the same directory entry. Hard links are one step +down. Multiple directory entries are sharing one inode. Reflinks are +down one more level: multiple inodes share the same data extents. + +When you symlink a file, you can then access it via the symlink or the +real directory entry, and for the most part they look identical. When +accessing more than one name for a hard link, the object returned looks +identical. Similarly, a newly created reflink is identical to its +source in almost every way and can be treated as such. This includes +ownership, permissions, security context, and data. The only things +that are different are the inode number, the link count, and the ctime. + +A reflink is a snapshot of the source file at the time it is created. + +Once created, though, a reflink can be modified like any other normal +file without affecting the source file. Changes to trivial fields like +permissions, owner, or times are guaranteed not to trigger CoW of file +data and will not return any error that wouldn't happen on a truly +distinct file. Changes to the file's data will trigger CoW of the data +affected - the actual CoW granularity is up to the filesystem, from +exact bytes up to the entire file. ocfs2, for example, will copy out an +entire extent or 1MB, whichever is smaller. + +Preserving the security context of the source file obviously requires +the privilege to do so. Callers that do not own the source file and do +not have CAP_CHOWN will get a new reflink with all non-security +attributes preserved; the security context of the new reflink will be +as a newly created file by that user. + +Partial reflinks are not allowed. The new inode will only appear in the +directory structure after it is fully formed. This prevents a crash or +lack of space from creating a partial reflink. + +If a filesystem does not support reflinks, the kernel and libc MUST NOT +fake it. Callers are expecting to get snapshots, and faking it will +violate that trust. + +The userspace view is as follows. When reflink(2) returns, opening +oldpath and newpath returns identical-looking files, just like link(2). +After that, oldpath and newpath behave as distinct files, and +modifications to one have no impact on the other. + + +RESTRICTIONS +------------ + +Just as the sharing gets lower as you move from symlink() -> link() -> +reflink(), the restrictions on the call get tighter. A symlink doesn't +require any access permissions other than being able to create its +inode. It can cross filesystems and mount points, and it can point to +any type of file. A hard link requires both source and target to be on +the same filesystem under the same mount point, and that the source not +be a directory. Like hard links and symlinks, a reflink cannot be +created if newpath exists. + +Reflinks adds one big restriction on top of hard links: only the owner +or someone with elevated privileges (CAP_CHOWN) can preserve the +security context (permissions, ownership, ACLs, etc) across a reflink. +A reflink is a point-in-time snapshot of a file. Without the +appropriate privilege, the caller will see their own default security +context applied to the file. + +A caller without the privileges to preserve the security context must +have read access to reflink a file. + + +SHARING +------- + +A reflink creates a new inode. It shares all data extents of the source +file; this includes file data and extended attribute data. All of the +sharing is in a CoW fashion, and any modification of the data will break +the sharing. + +For some filesystems, certain data structures are not in allocated +storage extents. Creating a reflink might make a copy of these extents. +An example is ext3's ability to store small extended attributes inside +the ext3 inode. Since a reflink is creating a new inode, those extended +attributes are merely copied to the new inode. + + +EXCEPTIONS +---------- + +All file attributes and extended attributes of the new file must +identical to the source file with the following exceptions: + +- The new file must have a new inode number. This allows POSIX + programs to treat the source and new files as separate objects. From + the view of the POSIX application, the files are distinct. The + sharing is invisible outside of the filesystem's internal structures. +- The ctime of the source file only changes if the source's metadata + must be changed to accommodate the copy-on-write linkage. The ctime + of the new file is set to represent its creation. +- The link count of the source file is unchanged, and the link count of + the new file is one. +- If the caller lacks the privileges to preserve the security context, + the file will have its security context initialized as would any new + file. + +The mtime of the source file is unmodified, and the mtime of the new +file is set identical to the source file. This reflects that the data +is unchanged. + + +INODE OPERATION +--------------- + +Filesystems implement the ->reflink() inode operation. It has almost +the same prototype as ->link(): + + int (*reflink)(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry, int preserve_security); + +When the filesystem is called, the VFS has already checked the +permissions and mountpoint of the operation. It has determined whether +the security context should be preserved or reinitialized, as specified +by the preserve_security argument. The filesystem just needs to create +the new inode identical to the old one with the exceptions noted above, +link up the shared data extents, and then link the new inode into dir. + + +FOLLOWING SYMBOLIC LINKS +------------------------ + +reflink() deferences symbolic links in the same manner that link(2) +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2). + diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf..01cd810 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -333,6 +333,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*reflink) (struct dentry *,struct inode *,struct dentry *); }; Again, all methods are called without any locks being held, unless @@ -431,6 +432,9 @@ otherwise noted. truncate_range: a method provided by the underlying filesystem to truncate a range of blocks , i.e. punch a hole somewhere in a file. + reflink: called by the reflink(2) system call. Only required if you want + to support reflinks. For further information, see + Documentation/filesystems/reflink.txt. The Address Space Object diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..c368563 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_reflinkat 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..d11c200 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long sys_reflinkat /* 335 */ diff --git a/fs/namei.c b/fs/namei.c index 78f253c..34a6ce5 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); } +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) +{ + struct inode *inode = old_dentry->d_inode; + int error; + int preserve_security = 1; + + if (!inode) + return -ENOENT; + + /* + * If the caller has the rights, reflink() will preserve the + * security context of the source inode. + */ + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) + preserve_security = 0; + if ((current_fsuid() != inode->i_uid) && + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) + preserve_security = 0; + + /* + * If the caller doesn't have the right to preserve the security + * context, the caller is only getting the data and extended + * attributes. They need read permission on the file. + */ + if (!preserve_security) { + error = inode_permission(inode, MAY_READ); + if (error) + return error; + } + + error = may_create(dir, new_dentry); + if (error) + return error; + + if (dir->i_sb != inode->i_sb) + return -EXDEV; + + /* + * A reflink to an append-only or immutable file cannot be created. + */ + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) + return -EPERM; + if (!dir->i_op->reflink) + return -EPERM; + if (S_ISDIR(inode->i_mode)) + return -EPERM; + + error = security_inode_reflink(old_dentry, dir); + if (error) + return error; + + mutex_lock(&inode->i_mutex); + vfs_dq_init(dir); + error = dir->i_op->reflink(old_dentry, dir, new_dentry, + preserve_security); + mutex_unlock(&inode->i_mutex); + if (!error) + fsnotify_create(dir, new_dentry); + return error; +} + +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname, + int, newdfd, const char __user *, newname, int, flags) +{ + struct dentry *new_dentry; + struct nameidata nd; + struct path old_path; + int error; + char *to; + + if ((flags & ~AT_SYMLINK_FOLLOW) != 0) + return -EINVAL; + + error = user_path_at(olddfd, oldname, + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0, + &old_path); + if (error) + return error; + + error = user_path_parent(newdfd, newname, &nd, &to); + if (error) + goto out; + error = -EXDEV; + if (old_path.mnt != nd.path.mnt) + goto out_release; + new_dentry = lookup_create(&nd, 0); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; + error = security_path_link(old_path.dentry, &nd.path, new_dentry); + if (error) + goto out_drop_write; + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry); +out_drop_write: + mnt_drop_write(nd.path.mnt); +out_dput: + dput(new_dentry); +out_unlock: + mutex_unlock(&nd.path.dentry->d_inode->i_mutex); +out_release: + path_put(&nd.path); + putname(to); +out: + path_put(&old_path); + + return error; +} + + /* * The worst of all namespace operations - renaming directory. "Perverted" * doesn't even start to describe it. Somebody in UCB had a heck of a trip... @@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename); EXPORT_SYMBOL(vfs_create); EXPORT_SYMBOL(vfs_follow_link); EXPORT_SYMBOL(vfs_link); +EXPORT_SYMBOL(vfs_reflink); EXPORT_SYMBOL(vfs_mkdir); EXPORT_SYMBOL(vfs_mknod); EXPORT_SYMBOL(generic_permission); diff --git a/include/linux/fs.h b/include/linux/fs.h index 5bed436..0a5c807 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *); extern int vfs_rmdir(struct inode *, struct dentry *); extern int vfs_unlink(struct inode *, struct dentry *); extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *); /* * VFS dentry helper functions. @@ -1537,6 +1538,7 @@ struct inode_operations { loff_t len); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + int (*reflink) (struct dentry *,struct inode *,struct dentry *,int); }; struct seq_file; diff --git a/include/linux/security.h b/include/linux/security.h index d5fd616..ea9cd93 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * @inode contains a pointer to the inode. * @secid contains a pointer to the location where result will be saved. * In case of failure, @secid will be set to zero. + * @inode_reflink: + * Check permission before creating a new reference-counted link to + * a file. + * @old_dentry contains the dentry structure for an existing link to + * the file. + * @dir contains the inode structure of the parent directory of the + * new reflink. + * Return 0 if permission is granted. * * Security hooks for file operations * @@ -1415,6 +1423,7 @@ struct security_operations { int (*inode_unlink) (struct inode *dir, struct dentry *dentry); int (*inode_symlink) (struct inode *dir, struct dentry *dentry, const char *old_name); + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir); int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode); int (*inode_rmdir) (struct inode *dir, struct dentry *dentry); int (*inode_mknod) (struct inode *dir, struct dentry *dentry, @@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir, int security_inode_unlink(struct inode *dir, struct dentry *dentry); int security_inode_symlink(struct inode *dir, struct dentry *dentry, const char *old_name); +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir); int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode); int security_inode_rmdir(struct inode *dir, struct dentry *dentry); int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev); @@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir, return 0; } +static inline int security_inode_reflink(struct dentry *old_dentry, + struct inode *dir) +{ + return 0; +} + static inline int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 40617c1..35a8743 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_linkat(int olddfd, const char __user *oldname, int newdfd, const char __user *newname, int flags); +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname, + int newdfd, const char __user *newname, int flags); asmlinkage long sys_renameat(int olddfd, const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_futimesat(int dfd, char __user *filename, diff --git a/security/capability.c b/security/capability.c index 21b6cea..3dcc4cc 100644 --- a/security/capability.c +++ b/security/capability.c @@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry, return 0; } +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode) +{ + return 0; +} + static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry, int mask) { @@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops) set_to_cap_if_null(ops, inode_link); set_to_cap_if_null(ops, inode_unlink); set_to_cap_if_null(ops, inode_symlink); + set_to_cap_if_null(ops, inode_reflink); set_to_cap_if_null(ops, inode_mkdir); set_to_cap_if_null(ops, inode_rmdir); set_to_cap_if_null(ops, inode_mknod); diff --git a/security/security.c b/security/security.c index 5284255..70d0ac3 100644 --- a/security/security.c +++ b/security/security.c @@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry, return security_ops->inode_symlink(dir, dentry, old_name); } +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir) +{ + if (unlikely(IS_PRIVATE(old_dentry->d_inode))) + return 0; + return security_ops->inode_reflink(old_dentry, dir); +} + int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) { if (unlikely(IS_PRIVATE(dir))) -- 1.6.1.3 -- "Three o'clock is always too late or too early for anything you want to do." - Jean-Paul Sartre Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker @ 2009-05-11 22:27 ` James Morris 2009-05-11 22:34 ` Joel Becker 2009-05-12 12:01 ` Stephen Smalley 2009-05-11 23:11 ` jim owens ` (5 subsequent siblings) 6 siblings, 2 replies; 151+ messages in thread From: James Morris @ 2009-05-11 22:27 UTC (permalink / raw) To: Joel Becker Cc: jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Mon, 11 May 2009, Joel Becker wrote: > and other security attributes (in all, I'm gonna call that the "security > context") as well. So I defined reflink() as such. This meant "security context" is an term associated with SELinux, so you may want to use something like "security attributes" or "security state" to avoid confusing people. > + error = security_inode_reflink(old_dentry, dir); > + if (error) > + return error; We'll need the new_dentry now, to set up new security state before the dentry is instantiated. e.g. SELinux will need to perform some checks on the operation, then calculate a new security context for the new file. - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 22:27 ` James Morris @ 2009-05-11 22:34 ` Joel Becker 2009-05-12 1:12 ` James Morris 2009-05-12 12:01 ` Stephen Smalley 1 sibling, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-11 22:34 UTC (permalink / raw) To: James Morris Cc: jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, May 12, 2009 at 08:27:17AM +1000, James Morris wrote: > On Mon, 11 May 2009, Joel Becker wrote: > > > and other security attributes (in all, I'm gonna call that the "security > > context") as well. So I defined reflink() as such. This meant > > "security context" is an term associated with SELinux, so you may want to > use something like "security attributes" or "security state" to avoid > confusing people. Ok, I wondered if my brain had picked that out from somewhere. > > + error = security_inode_reflink(old_dentry, dir); > > + if (error) > > + return error; > > We'll need the new_dentry now, to set up new security state before the > dentry is instantiated. > > e.g. SELinux will need to perform some checks on the operation, then > calculate a new security context for the new file. Do I need to pass in preserve_security as well so SELinux knows what the ownership check determined? Joel -- "Copy from one, it's plagiarism; copy from two, it's research." - Wilson Mizner Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 22:34 ` Joel Becker @ 2009-05-12 1:12 ` James Morris 2009-05-12 12:18 ` Stephen Smalley 0 siblings, 1 reply; 151+ messages in thread From: James Morris @ 2009-05-12 1:12 UTC (permalink / raw) To: Joel Becker Cc: jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Mon, 11 May 2009, Joel Becker wrote: > > e.g. SELinux will need to perform some checks on the operation, then > > calculate a new security context for the new file. > > Do I need to pass in preserve_security as well so SELinux knows > what the ownership check determined? Not for SELinux -- its security attributes are orthogonal to DAC, and it will perform its own checks on them. Other LSMs should operate similarly (there is also the CAP_CHOWN check which the LSM may hook), although if not, the flag can be added later if required. - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 1:12 ` James Morris @ 2009-05-12 12:18 ` Stephen Smalley 2009-05-12 17:22 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-12 12:18 UTC (permalink / raw) To: James Morris Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote: > On Mon, 11 May 2009, Joel Becker wrote: > > > > e.g. SELinux will need to perform some checks on the operation, then > > > calculate a new security context for the new file. > > > > Do I need to pass in preserve_security as well so SELinux knows > > what the ownership check determined? > > Not for SELinux -- its security attributes are orthogonal to DAC, and it > will perform its own checks on them. Is preserve_security supposed to also control the preservation of the SELinux security attribute (security.selinux extended attribute)? I'd expect that either we preserve all the security-relevant attributes or none of them. And if that is the case, then SELinux has to know about preserve_security in order to know what the security context of the new inode will be. Also, if you are going to automatically degrade reflink(2) behavior based on the owner_or_cap test, then you ought to allow the same to be true if the security module vetoes the attempt to preserve attributes. Either DAC or MAC logic may say that security attributes cannot be preserved. Your current logic will only allow graceful degradation in the DAC case, but the MAC case will remain a hard failure. > Other LSMs should operate similarly (there is also the CAP_CHOWN check > which the LSM may hook), although if not, the flag can be added later if > required. > > > - James -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 12:18 ` Stephen Smalley @ 2009-05-12 17:22 ` Joel Becker 2009-05-12 17:32 ` Stephen Smalley 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-12 17:22 UTC (permalink / raw) To: Stephen Smalley Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, ocfs2-devel, viro On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: > On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote: > > On Mon, 11 May 2009, Joel Becker wrote: > > > > > > e.g. SELinux will need to perform some checks on the operation, then > > > > calculate a new security context for the new file. > > > > > > Do I need to pass in preserve_security as well so SELinux knows > > > what the ownership check determined? > > > > Not for SELinux -- its security attributes are orthogonal to DAC, and it > > will perform its own checks on them. > > Is preserve_security supposed to also control the preservation of the > SELinux security attribute (security.selinux extended attribute)? I'd > expect that either we preserve all the security-relevant attributes or > none of them. And if that is the case, then SELinux has to know about > preserve_security in order to know what the security context of the new > inode will be. Thank you Stephen, you read my mind. In the ocfs2 case, we're expecting to just reflink the extended attribute structures verbatim in the preserve_security case. So we would be ignoring whatever was set on the new_dentry by security_inode_reflink(). This gets us the best CoW sharing of the xattr extents, but I want to make sure that's "safe" in the preserve_security case. > Also, if you are going to automatically degrade reflink(2) behavior > based on the owner_or_cap test, then you ought to allow the same to be > true if the security module vetoes the attempt to preserve attributes. > Either DAC or MAC logic may say that security attributes cannot be > preserved. Your current logic will only allow graceful degradation in > the DAC case, but the MAC case will remain a hard failure. I did not think of this, and its a very good point as well. I'm not sure how to have the return value of security_inode_reflink() distinguish between "disallow the reflink" and "disallow preserve_security". But since !preserve_security requires read access only, perhaps we move security_inode_reflink up higher and say: error = security_inode_reflink(old_dentry, dir); if (error) preserve_security = 0; Here security_inode_reflink() does not need new_dentry, because it isn't setting a security context. If it's ok with the reflink, we'll be copying the extended attribute. If it's not OK, it falls through to the inode_permission(inode, MAY_READ) check, which will check for plain old read access. What do we think? Joel -- "Under capitalism, man exploits man. Under Communism, it's just the opposite." - John Kenneth Galbraith Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 17:22 ` Joel Becker @ 2009-05-12 17:32 ` Stephen Smalley 2009-05-12 18:03 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-12 17:32 UTC (permalink / raw) To: Joel Becker Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote: > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: > > On Tue, 2009-05-12 at 11:12 +1000, James Morris wrote: > > > On Mon, 11 May 2009, Joel Becker wrote: > > > > > > > > e.g. SELinux will need to perform some checks on the operation, then > > > > > calculate a new security context for the new file. > > > > > > > > Do I need to pass in preserve_security as well so SELinux knows > > > > what the ownership check determined? > > > > > > Not for SELinux -- its security attributes are orthogonal to DAC, and it > > > will perform its own checks on them. > > > > Is preserve_security supposed to also control the preservation of the > > SELinux security attribute (security.selinux extended attribute)? I'd > > expect that either we preserve all the security-relevant attributes or > > none of them. And if that is the case, then SELinux has to know about > > preserve_security in order to know what the security context of the new > > inode will be. > > Thank you Stephen, you read my mind. In the ocfs2 case, we're > expecting to just reflink the extended attribute structures verbatim in > the preserve_security case. And in the preserve_security==0 case, you'll be calling security_inode_init_security() in order to get the attribute name/value pair to assign to the new inode just as in the normal file creation case? > So we would be ignoring whatever was set on > the new_dentry by security_inode_reflink(). This gets us the best CoW > sharing of the xattr extents, but I want to make sure that's "safe" in > the preserve_security case. security_inode_reflink() can't handle the initialization regardless, as the inode doesn't yet exist at that point. > > Also, if you are going to automatically degrade reflink(2) behavior > > based on the owner_or_cap test, then you ought to allow the same to be > > true if the security module vetoes the attempt to preserve attributes. > > Either DAC or MAC logic may say that security attributes cannot be > > preserved. Your current logic will only allow graceful degradation in > > the DAC case, but the MAC case will remain a hard failure. > > I did not think of this, and its a very good point as well. I'm > not sure how to have the return value of security_inode_reflink() > distinguish between "disallow the reflink" and "disallow > preserve_security". But since !preserve_security requires read access > only, perhaps we move security_inode_reflink up higher and say: > > error = security_inode_reflink(old_dentry, dir); > if (error) > preserve_security = 0; > > Here security_inode_reflink() does not need new_dentry, because it isn't > setting a security context. If it's ok with the reflink, we'll be > copying the extended attribute. If it's not OK, it falls through to the > inode_permission(inode, MAY_READ) check, which will check for plain old > read access. > What do we think? I'd rather have two hooks, one to allow the security module to override preserve_security and one to allow the security module to deny the operation altogether. The former hook only needs to be called if preserve_security is not already cleared by the DAC logic. The latter hook needs to know the final verdict on preserve_security in order to determine the right set of checks to apply, which isn't necessarily limited to only checking read access. But we don't need the new_dentry regardless. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 17:32 ` Stephen Smalley @ 2009-05-12 18:03 ` Joel Becker 2009-05-12 18:04 ` Stephen Smalley 2009-05-13 1:47 ` Casey Schaufler 0 siblings, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-12 18:03 UTC (permalink / raw) To: Stephen Smalley Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, ocfs2-devel, viro On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote: > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote: > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: > > > Is preserve_security supposed to also control the preservation of the > > > SELinux security attribute (security.selinux extended attribute)? I'd > > > expect that either we preserve all the security-relevant attributes or > > > none of them. And if that is the case, then SELinux has to know about > > > preserve_security in order to know what the security context of the new > > > inode will be. > > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're > > expecting to just reflink the extended attribute structures verbatim in > > the preserve_security case. > > And in the preserve_security==0 case, you'll be calling > security_inode_init_security() in order to get the attribute name/value > pair to assign to the new inode just as in the normal file creation > case? Oh, absolutely. As an aside, do inodes ever have more than one security.* attribute? It would appear that security_inode_init_security() just returns one attribute, but what if I had a system running under SMACK and then changed to SELinux? Would my (existing) inode then have security.smack and security.selinux attributes? > > > Also, if you are going to automatically degrade reflink(2) behavior > > > based on the owner_or_cap test, then you ought to allow the same to be > > > true if the security module vetoes the attempt to preserve attributes. > > > Either DAC or MAC logic may say that security attributes cannot be > > > preserved. Your current logic will only allow graceful degradation in > > > the DAC case, but the MAC case will remain a hard failure. > > > > I did not think of this, and its a very good point as well. I'm > > not sure how to have the return value of security_inode_reflink() > > distinguish between "disallow the reflink" and "disallow > > preserve_security". But since !preserve_security requires read access > > only, perhaps we move security_inode_reflink up higher and say: > > > > error = security_inode_reflink(old_dentry, dir); > > if (error) > > preserve_security = 0; > > > > Here security_inode_reflink() does not need new_dentry, because it isn't > > setting a security context. If it's ok with the reflink, we'll be > > copying the extended attribute. If it's not OK, it falls through to the > > inode_permission(inode, MAY_READ) check, which will check for plain old > > read access. > > What do we think? > > I'd rather have two hooks, one to allow the security module to override > preserve_security and one to allow the security module to deny the > operation altogether. The former hook only needs to be called if > preserve_security is not already cleared by the DAC logic. The latter > hook needs to know the final verdict on preserve_security in order to > determine the right set of checks to apply, which isn't necessarily > limited to only checking read access. Ok, is that two hooks or one hook with specific error returns? I don't care, it's up to the LSM group. I just can't come up with a good distinguishing set of names if its two hooks :-) Joel -- Life's Little Instruction Book #157 "Take time to smell the roses." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 18:03 ` Joel Becker @ 2009-05-12 18:04 ` Stephen Smalley 2009-05-12 18:28 ` Joel Becker 2009-05-14 18:06 ` Stephen Smalley 2009-05-13 1:47 ` Casey Schaufler 1 sibling, 2 replies; 151+ messages in thread From: Stephen Smalley @ 2009-05-12 18:04 UTC (permalink / raw) To: Joel Becker Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote: > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote: > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote: > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: > > > > Is preserve_security supposed to also control the preservation of the > > > > SELinux security attribute (security.selinux extended attribute)? I'd > > > > expect that either we preserve all the security-relevant attributes or > > > > none of them. And if that is the case, then SELinux has to know about > > > > preserve_security in order to know what the security context of the new > > > > inode will be. > > > > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're > > > expecting to just reflink the extended attribute structures verbatim in > > > the preserve_security case. > > > > And in the preserve_security==0 case, you'll be calling > > security_inode_init_security() in order to get the attribute name/value > > pair to assign to the new inode just as in the normal file creation > > case? > > Oh, absolutely. > As an aside, do inodes ever have more than one security.* > attribute? It would appear that security_inode_init_security() just > returns one attribute, but what if I had a system running under SMACK > and then changed to SELinux? Would my (existing) inode then have > security.smack and security.selinux attributes? No, there would be no security.selinux attribute and the file would be treated as having a well-defined 'unlabeled' attribute by SELinux. Not something you have to worry about. > > > > Also, if you are going to automatically degrade reflink(2) behavior > > > > based on the owner_or_cap test, then you ought to allow the same to be > > > > true if the security module vetoes the attempt to preserve attributes. > > > > Either DAC or MAC logic may say that security attributes cannot be > > > > preserved. Your current logic will only allow graceful degradation in > > > > the DAC case, but the MAC case will remain a hard failure. > > > > > > I did not think of this, and its a very good point as well. I'm > > > not sure how to have the return value of security_inode_reflink() > > > distinguish between "disallow the reflink" and "disallow > > > preserve_security". But since !preserve_security requires read access > > > only, perhaps we move security_inode_reflink up higher and say: > > > > > > error = security_inode_reflink(old_dentry, dir); > > > if (error) > > > preserve_security = 0; > > > > > > Here security_inode_reflink() does not need new_dentry, because it isn't > > > setting a security context. If it's ok with the reflink, we'll be > > > copying the extended attribute. If it's not OK, it falls through to the > > > inode_permission(inode, MAY_READ) check, which will check for plain old > > > read access. > > > What do we think? > > > > I'd rather have two hooks, one to allow the security module to override > > preserve_security and one to allow the security module to deny the > > operation altogether. The former hook only needs to be called if > > preserve_security is not already cleared by the DAC logic. The latter > > hook needs to know the final verdict on preserve_security in order to > > determine the right set of checks to apply, which isn't necessarily > > limited to only checking read access. > > Ok, is that two hooks or one hook with specific error returns? > I don't care, it's up to the LSM group. I just can't come up with a > good distinguishing set of names if its two hooks :-) I suppose you could coalesce them into a single hook ala: error = security_inode_reflink(old_dentry, dir, &preserve_security); if (error) return (error); -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 18:04 ` Stephen Smalley @ 2009-05-12 18:28 ` Joel Becker 2009-05-12 18:37 ` Stephen Smalley 2009-05-14 18:06 ` Stephen Smalley 1 sibling, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-12 18:28 UTC (permalink / raw) To: Stephen Smalley Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, ocfs2-devel, viro On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote: > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote: > > As an aside, do inodes ever have more than one security.* > > attribute? It would appear that security_inode_init_security() just > > returns one attribute, but what if I had a system running under SMACK > > and then changed to SELinux? Would my (existing) inode then have > > security.smack and security.selinux attributes? > > No, there would be no security.selinux attribute and the file would be > treated as having a well-defined 'unlabeled' attribute by SELinux. Not > something you have to worry about. Even if I've run rstorecon? Basically, I'm trying to understand if, in the !preserve_security case, ocfs2 can just do "link up the existing xattrs, then set whatever we got from security_inode_init_security()", or if we have to go through and delete all security.* attributes before installing the result of security_inode_init_security(). > > > I'd rather have two hooks, one to allow the security module to override > > > preserve_security and one to allow the security module to deny the > > > operation altogether. The former hook only needs to be called if > > > preserve_security is not already cleared by the DAC logic. The latter > > > hook needs to know the final verdict on preserve_security in order to > > > determine the right set of checks to apply, which isn't necessarily > > > limited to only checking read access. > > > > Ok, is that two hooks or one hook with specific error returns? > > I don't care, it's up to the LSM group. I just can't come up with a > > good distinguishing set of names if its two hooks :-) > > I suppose you could coalesce them into a single hook ala: > error = security_inode_reflink(old_dentry, dir, &preserve_security); > if (error) > return (error); What fits in with the LSM convention. That's more important than one-hook-vs-two. Joel -- "Gone to plant a weeping willow On the bank's green edge it will roll, roll, roll. Sing a lulaby beside the waters. Lovers come and go, the river roll, roll, rolls." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 18:28 ` Joel Becker @ 2009-05-12 18:37 ` Stephen Smalley 0 siblings, 0 replies; 151+ messages in thread From: Stephen Smalley @ 2009-05-12 18:37 UTC (permalink / raw) To: Joel Becker Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 2009-05-12 at 11:28 -0700, Joel Becker wrote: > On Tue, May 12, 2009 at 02:04:53PM -0400, Stephen Smalley wrote: > > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote: > > > As an aside, do inodes ever have more than one security.* > > > attribute? It would appear that security_inode_init_security() just > > > returns one attribute, but what if I had a system running under SMACK > > > and then changed to SELinux? Would my (existing) inode then have > > > security.smack and security.selinux attributes? > > > > No, there would be no security.selinux attribute and the file would be > > treated as having a well-defined 'unlabeled' attribute by SELinux. Not > > something you have to worry about. > > Even if I've run rstorecon? Basically, I'm trying to understand > if, in the !preserve_security case, ocfs2 can just do "link up the > existing xattrs, then set whatever we got from > security_inode_init_security()", or if we have to go through and delete > all security.* attributes before installing the result of > security_inode_init_security(). Likely a better example would be file capabilities (security.capability), as you might be using those simultaneously with SELinux (security.selinux). security_inode_init_security() is only going to return security.selinux, as new files don't get any file capabilities assigned by default. I guess you would want to delete security.capability from the reflink if preserve_security==0. > > > > I'd rather have two hooks, one to allow the security module to override > > > > preserve_security and one to allow the security module to deny the > > > > operation altogether. The former hook only needs to be called if > > > > preserve_security is not already cleared by the DAC logic. The latter > > > > hook needs to know the final verdict on preserve_security in order to > > > > determine the right set of checks to apply, which isn't necessarily > > > > limited to only checking read access. > > > > > > Ok, is that two hooks or one hook with specific error returns? > > > I don't care, it's up to the LSM group. I just can't come up with a > > > good distinguishing set of names if its two hooks :-) > > > > I suppose you could coalesce them into a single hook ala: > > error = security_inode_reflink(old_dentry, dir, &preserve_security); > > if (error) > > return (error); > > What fits in with the LSM convention. That's more important > than one-hook-vs-two. I think that the above example fits with the LSM convention. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 18:04 ` Stephen Smalley 2009-05-12 18:28 ` Joel Becker @ 2009-05-14 18:06 ` Stephen Smalley 2009-05-14 18:25 ` Stephen Smalley 1 sibling, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-14 18:06 UTC (permalink / raw) To: Joel Becker Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote: > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote: > > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote: > > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote: > > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: > > > > > Is preserve_security supposed to also control the preservation of the > > > > > SELinux security attribute (security.selinux extended attribute)? I'd > > > > > expect that either we preserve all the security-relevant attributes or > > > > > none of them. And if that is the case, then SELinux has to know about > > > > > preserve_security in order to know what the security context of the new > > > > > inode will be. > > > > > > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're > > > > expecting to just reflink the extended attribute structures verbatim in > > > > the preserve_security case. > > > > > > And in the preserve_security==0 case, you'll be calling > > > security_inode_init_security() in order to get the attribute name/value > > > pair to assign to the new inode just as in the normal file creation > > > case? > > > > Oh, absolutely. > > As an aside, do inodes ever have more than one security.* > > attribute? It would appear that security_inode_init_security() just > > returns one attribute, but what if I had a system running under SMACK > > and then changed to SELinux? Would my (existing) inode then have > > security.smack and security.selinux attributes? > > No, there would be no security.selinux attribute and the file would be > treated as having a well-defined 'unlabeled' attribute by SELinux. Not > something you have to worry about. > > > > > > Also, if you are going to automatically degrade reflink(2) behavior > > > > > based on the owner_or_cap test, then you ought to allow the same to be > > > > > true if the security module vetoes the attempt to preserve attributes. > > > > > Either DAC or MAC logic may say that security attributes cannot be > > > > > preserved. Your current logic will only allow graceful degradation in > > > > > the DAC case, but the MAC case will remain a hard failure. > > > > > > > > I did not think of this, and its a very good point as well. I'm > > > > not sure how to have the return value of security_inode_reflink() > > > > distinguish between "disallow the reflink" and "disallow > > > > preserve_security". But since !preserve_security requires read access > > > > only, perhaps we move security_inode_reflink up higher and say: > > > > > > > > error = security_inode_reflink(old_dentry, dir); > > > > if (error) > > > > preserve_security = 0; > > > > > > > > Here security_inode_reflink() does not need new_dentry, because it isn't > > > > setting a security context. If it's ok with the reflink, we'll be > > > > copying the extended attribute. If it's not OK, it falls through to the > > > > inode_permission(inode, MAY_READ) check, which will check for plain old > > > > read access. > > > > What do we think? > > > > > > I'd rather have two hooks, one to allow the security module to override > > > preserve_security and one to allow the security module to deny the > > > operation altogether. The former hook only needs to be called if > > > preserve_security is not already cleared by the DAC logic. The latter > > > hook needs to know the final verdict on preserve_security in order to > > > determine the right set of checks to apply, which isn't necessarily > > > limited to only checking read access. > > > > Ok, is that two hooks or one hook with specific error returns? > > I don't care, it's up to the LSM group. I just can't come up with a > > good distinguishing set of names if its two hooks :-) > > I suppose you could coalesce them into a single hook ala: > error = security_inode_reflink(old_dentry, dir, &preserve_security); > if (error) > return (error); On second thought (agreeing with Andy about making the interface explicit wrt preserve_security), I don't expect us to ever override preserve_security from SELinux, so you can just pass it in by value. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 18:06 ` Stephen Smalley @ 2009-05-14 18:25 ` Stephen Smalley 2009-05-14 23:25 ` James Morris 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-14 18:25 UTC (permalink / raw) To: Joel Becker Cc: James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Thu, 2009-05-14 at 14:06 -0400, Stephen Smalley wrote: > On Tue, 2009-05-12 at 14:04 -0400, Stephen Smalley wrote: > > On Tue, 2009-05-12 at 11:03 -0700, Joel Becker wrote: > > > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote: > > > > On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote: > > > > > On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: > > > > > > Is preserve_security supposed to also control the preservation of the > > > > > > SELinux security attribute (security.selinux extended attribute)? I'd > > > > > > expect that either we preserve all the security-relevant attributes or > > > > > > none of them. And if that is the case, then SELinux has to know about > > > > > > preserve_security in order to know what the security context of the new > > > > > > inode will be. > > > > > > > > > > Thank you Stephen, you read my mind. In the ocfs2 case, we're > > > > > expecting to just reflink the extended attribute structures verbatim in > > > > > the preserve_security case. > > > > > > > > And in the preserve_security==0 case, you'll be calling > > > > security_inode_init_security() in order to get the attribute name/value > > > > pair to assign to the new inode just as in the normal file creation > > > > case? > > > > > > Oh, absolutely. > > > As an aside, do inodes ever have more than one security.* > > > attribute? It would appear that security_inode_init_security() just > > > returns one attribute, but what if I had a system running under SMACK > > > and then changed to SELinux? Would my (existing) inode then have > > > security.smack and security.selinux attributes? > > > > No, there would be no security.selinux attribute and the file would be > > treated as having a well-defined 'unlabeled' attribute by SELinux. Not > > something you have to worry about. > > > > > > > > Also, if you are going to automatically degrade reflink(2) behavior > > > > > > based on the owner_or_cap test, then you ought to allow the same to be > > > > > > true if the security module vetoes the attempt to preserve attributes. > > > > > > Either DAC or MAC logic may say that security attributes cannot be > > > > > > preserved. Your current logic will only allow graceful degradation in > > > > > > the DAC case, but the MAC case will remain a hard failure. > > > > > > > > > > I did not think of this, and its a very good point as well. I'm > > > > > not sure how to have the return value of security_inode_reflink() > > > > > distinguish between "disallow the reflink" and "disallow > > > > > preserve_security". But since !preserve_security requires read access > > > > > only, perhaps we move security_inode_reflink up higher and say: > > > > > > > > > > error = security_inode_reflink(old_dentry, dir); > > > > > if (error) > > > > > preserve_security = 0; > > > > > > > > > > Here security_inode_reflink() does not need new_dentry, because it isn't > > > > > setting a security context. If it's ok with the reflink, we'll be > > > > > copying the extended attribute. If it's not OK, it falls through to the > > > > > inode_permission(inode, MAY_READ) check, which will check for plain old > > > > > read access. > > > > > What do we think? > > > > > > > > I'd rather have two hooks, one to allow the security module to override > > > > preserve_security and one to allow the security module to deny the > > > > operation altogether. The former hook only needs to be called if > > > > preserve_security is not already cleared by the DAC logic. The latter > > > > hook needs to know the final verdict on preserve_security in order to > > > > determine the right set of checks to apply, which isn't necessarily > > > > limited to only checking read access. > > > > > > Ok, is that two hooks or one hook with specific error returns? > > > I don't care, it's up to the LSM group. I just can't come up with a > > > good distinguishing set of names if its two hooks :-) > > > > I suppose you could coalesce them into a single hook ala: > > error = security_inode_reflink(old_dentry, dir, &preserve_security); > > if (error) > > return (error); > > On second thought (agreeing with Andy about making the interface > explicit wrt preserve_security), I don't expect us to ever override > preserve_security from SELinux, so you can just pass it in by value. And you can likely make preserve_security a simple bool (set from some caller-provided flag) rather than an int. At which point the SELinux wiring for the new hook would be something like this: If we are preserving security attributes on the reflink, then treat it like creating a link to an existing file; else treat it like creating a new file. Read access will also be checked in the non-preserving case by virtue of the separate inode_permission call. diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 2fcad7c..20ef414 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2667,6 +2667,17 @@ static int selinux_inode_symlink(struct inode *dir, struct dentry *dentry, const return may_create(dir, dentry, SECCLASS_LNK_FILE); } +static int selinux_inode_reflink(struct dentry *dentry, struct inode *dir, + bool preserve_security) +{ + struct inode_security_struct *isec = dentry->d_inode->i_security; + + if (preserve_security) + return may_link(dir, dentry, MAY_LINK); + else + return may_create(dir, dentry, isec->sclass); +} + static int selinux_inode_mkdir(struct inode *dir, struct dentry *dentry, int mask) { return may_create(dir, dentry, SECCLASS_DIR); @@ -5357,6 +5368,7 @@ static struct security_operations selinux_ops = { .inode_link = selinux_inode_link, .inode_unlink = selinux_inode_unlink, .inode_symlink = selinux_inode_symlink, + .inode_reflink = selinux_inode_reflink, .inode_mkdir = selinux_inode_mkdir, .inode_rmdir = selinux_inode_rmdir, .inode_mknod = selinux_inode_mknod, -- Stephen Smalley National Security Agency ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 18:25 ` Stephen Smalley @ 2009-05-14 23:25 ` James Morris 2009-05-15 11:54 ` Stephen Smalley 0 siblings, 1 reply; 151+ messages in thread From: James Morris @ 2009-05-14 23:25 UTC (permalink / raw) To: Stephen Smalley Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Thu, 14 May 2009, Stephen Smalley wrote: > And you can likely make preserve_security a simple bool (set from some > caller-provided flag) rather than an int. At which point the SELinux > wiring for the new hook would be something like this: > > If we are preserving security attributes on the reflink, then treat it > like creating a link to an existing file; Do we also need to somewhat consider it like a new file? e.g. in the case of create_sid being set (if different to the existing security attribute), I believe we need to fail the operation because security attributes are not preserved, and also decide which error code to return (the user may be confused if it's EACCES -- EINVAL might be better). Similar for reflinks on a context mounted file system, although create_sid needs to be checked during inode instantiation (unless we, say, add set a preserve_sid flag which overrides create_sid and is cleared upon use). - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 23:25 ` James Morris @ 2009-05-15 11:54 ` Stephen Smalley 2009-05-15 13:35 ` James Morris 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-15 11:54 UTC (permalink / raw) To: James Morris Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, 2009-05-15 at 09:25 +1000, James Morris wrote: > On Thu, 14 May 2009, Stephen Smalley wrote: > > > And you can likely make preserve_security a simple bool (set from some > > caller-provided flag) rather than an int. At which point the SELinux > > wiring for the new hook would be something like this: > > > > If we are preserving security attributes on the reflink, then treat it > > like creating a link to an existing file; > > Do we also need to somewhat consider it like a new file? e.g. in the case > of create_sid being set (if different to the existing security attribute), > I believe we need to fail the operation because security attributes are > not preserved, and also decide which error code to return (the user may be > confused if it's EACCES -- EINVAL might be better). Similar for reflinks > on a context mounted file system, although create_sid needs to be checked > during inode instantiation (unless we, say, add set a preserve_sid flag > which overrides create_sid and is cleared upon use). The create_sid is not relevant in the preserve_security==1 case; the filesystem will always preserve the security context from the original inode on the new inode in that case. The create_sid won't ever be used in that case, as it only gets applied if the filesystem calls security_inode_init_security() to obtain the attribute (name, value) pair for a new inode, and the filesystem will only do that in the preserve_security==0 case. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-15 11:54 ` Stephen Smalley @ 2009-05-15 13:35 ` James Morris 2009-05-15 15:44 ` Stephen Smalley 0 siblings, 1 reply; 151+ messages in thread From: James Morris @ 2009-05-15 13:35 UTC (permalink / raw) To: Stephen Smalley Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, 15 May 2009, Stephen Smalley wrote: > The create_sid is not relevant in the preserve_security==1 case; the > filesystem will always preserve the security context from the original > inode on the new inode in that case. The create_sid won't ever be used > in that case, as it only gets applied if the filesystem calls > security_inode_init_security() to obtain the attribute (name, value) > pair for a new inode, and the filesystem will only do that in the > preserve_security==0 case. Ok. Does this break the idea of create_sid, though? i.e. it will be ignored when a new file is created via reflink(), potentially allowing DAC to determine whether MAC labeling policy is enforced, and is also not consistent with the way fsuid is handled. - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-15 13:35 ` James Morris @ 2009-05-15 15:44 ` Stephen Smalley 0 siblings, 0 replies; 151+ messages in thread From: Stephen Smalley @ 2009-05-15 15:44 UTC (permalink / raw) To: James Morris Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, 2009-05-15 at 23:35 +1000, James Morris wrote: > On Fri, 15 May 2009, Stephen Smalley wrote: > > > The create_sid is not relevant in the preserve_security==1 case; the > > filesystem will always preserve the security context from the original > > inode on the new inode in that case. The create_sid won't ever be used > > in that case, as it only gets applied if the filesystem calls > > security_inode_init_security() to obtain the attribute (name, value) > > pair for a new inode, and the filesystem will only do that in the > > preserve_security==0 case. > > Ok. Does this break the idea of create_sid, though? i.e. it will be > ignored when a new file is created via reflink(), potentially allowing DAC > to determine whether MAC labeling policy is enforced, and is also not > consistent with the way fsuid is handled. I think it is consistent with the planned uid handling for reflink (if preserve_security==1, then the new inode gets the uid of the original inode; else the new inode gets the fsuid of the creating process). create_sid is a "discretionary" mechanism - the application supplies the value via setfscreatecon(3), subject to a policy check (the file create check). Applications only expect the create_sid to be applied on normal file creations (and even there, it may not happen due to context mounts or filesystems that do not support labeling), so we aren't bound to that behavior for reflink. The MAC policy is enforced based on the permission checks, not the create_sid, so the only question is whether it is sufficient to check link permission for reflink(2) in the attribute-preserving case or whether we should add a new permission for it. We don't want to reuse the create permission for reflink(2) in the attribute-preserving case due to the difference in semantics between a reflink and a normal file creation. The result of a reflink(2) will look identical to the result of a link(2) except that it will have its own inode and thus a different inode number, link count, and ctime. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 18:03 ` Joel Becker 2009-05-12 18:04 ` Stephen Smalley @ 2009-05-13 1:47 ` Casey Schaufler 2009-05-13 16:43 ` Joel Becker 1 sibling, 1 reply; 151+ messages in thread From: Casey Schaufler @ 2009-05-13 1:47 UTC (permalink / raw) To: Stephen Smalley, James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-se Joel Becker wrote: > On Tue, May 12, 2009 at 01:32:47PM -0400, Stephen Smalley wrote: > >> On Tue, 2009-05-12 at 10:22 -0700, Joel Becker wrote: >> >>> On Tue, May 12, 2009 at 08:18:34AM -0400, Stephen Smalley wrote: >>> >>>> Is preserve_security supposed to also control the preservation of the >>>> SELinux security attribute (security.selinux extended attribute)? I'd >>>> expect that either we preserve all the security-relevant attributes or >>>> none of them. And if that is the case, then SELinux has to know about >>>> preserve_security in order to know what the security context of the new >>>> inode will be. >>>> >>> Thank you Stephen, you read my mind. In the ocfs2 case, we're >>> expecting to just reflink the extended attribute structures verbatim in >>> the preserve_security case. >>> >> And in the preserve_security==0 case, you'll be calling >> security_inode_init_security() in order to get the attribute name/value >> pair to assign to the new inode just as in the normal file creation >> case? >> > > Oh, absolutely. > As an aside, do inodes ever have more than one security.* > attribute? ACLs, capability sets and Smack labels can all exist on a file at the same time. I know of at least one effort underway to create a multiple-label LSM. > It would appear that security_inode_init_security() just > returns one attribute, but what if I had a system running under SMACK > and then changed to SELinux? The Smack attribute would hang around, it would just be unused. > Would my (existing) inode then have > security.smack and security.selinux attributes? > Yup. It happens all the time. Whenever someone converts a Fedora system to Smack they end up with a filesystem full of unused selinux labels. It does no harm. > >>>> Also, if you are going to automatically degrade reflink(2) behavior >>>> based on the owner_or_cap test, then you ought to allow the same to be >>>> true if the security module vetoes the attempt to preserve attributes. >>>> Either DAC or MAC logic may say that security attributes cannot be >>>> preserved. Your current logic will only allow graceful degradation in >>>> the DAC case, but the MAC case will remain a hard failure. >>>> >>> I did not think of this, and its a very good point as well. I'm >>> not sure how to have the return value of security_inode_reflink() >>> distinguish between "disallow the reflink" and "disallow >>> preserve_security". But since !preserve_security requires read access >>> only, perhaps we move security_inode_reflink up higher and say: >>> >>> error = security_inode_reflink(old_dentry, dir); >>> if (error) >>> preserve_security = 0; >>> >>> Here security_inode_reflink() does not need new_dentry, because it isn't >>> setting a security context. If it's ok with the reflink, we'll be >>> copying the extended attribute. If it's not OK, it falls through to the >>> inode_permission(inode, MAY_READ) check, which will check for plain old >>> read access. >>> What do we think? >>> >> I'd rather have two hooks, one to allow the security module to override >> preserve_security and one to allow the security module to deny the >> operation altogether. The former hook only needs to be called if >> preserve_security is not already cleared by the DAC logic. The latter >> hook needs to know the final verdict on preserve_security in order to >> determine the right set of checks to apply, which isn't necessarily >> limited to only checking read access. >> > > Ok, is that two hooks or one hook with specific error returns? > I don't care, it's up to the LSM group. I just can't come up with a > good distinguishing set of names if its two hooks :-) > > Joel > > ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-13 1:47 ` Casey Schaufler @ 2009-05-13 16:43 ` Joel Becker 2009-05-13 17:23 ` Stephen Smalley 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-13 16:43 UTC (permalink / raw) To: Casey Schaufler Cc: James Morris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, Stephen Smalley, ocfs2-devel, viro On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote: > Joel Becker wrote: > > Oh, absolutely. > > As an aside, do inodes ever have more than one security.* > > attribute? > > ACLs, capability sets and Smack labels can all exist on a file at > the same time. I know of at least one effort underway to create a > multiple-label LSM. So ACLs and cap sets live under security.*? That's good. > > Would my (existing) inode then have > > security.smack and security.selinux attributes? > > > > Yup. It happens all the time. Whenever someone converts a Fedora > system to Smack they end up with a filesystem full of unused selinux > labels. It does no harm. At that runtime, sure. But with reflink(), we may be reflinking someone else's inode, and if we have to drop its security state, we should clean the unused labels just in case they go back to selinux (or back to smack, etc). But if they are all under security.*, it's easy to do. Thanks! Joel -- Life's Little Instruction Book #173 "Be kinder than necessary." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-13 16:43 ` Joel Becker @ 2009-05-13 17:23 ` Stephen Smalley 2009-05-13 18:27 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-13 17:23 UTC (permalink / raw) To: Joel Becker Cc: Casey Schaufler, James Morris, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Wed, 2009-05-13 at 09:43 -0700, Joel Becker wrote: > On Tue, May 12, 2009 at 06:47:04PM -0700, Casey Schaufler wrote: > > Joel Becker wrote: > > > Oh, absolutely. > > > As an aside, do inodes ever have more than one security.* > > > attribute? > > > > ACLs, capability sets and Smack labels can all exist on a file at > > the same time. I know of at least one effort underway to create a > > multiple-label LSM. > > So ACLs and cap sets live under security.*? That's good. File capabilities live under security.*, but ACLs predate the security namespace and live in the system namespace as "system.posix_acl_access" (and if a directory, there is also a "system.posix_acl_default" attribute that specifies the default ACL for new files in that directory). In the preserve_security==0 case, you'd want to: - drop all attributes under security.* on the new inode, - set (security.<name>, value) to the name:value pair provided by security_inode_init_security(), - set system.posix_acl_access to the default ACL associated with the parent directory (the "system.posix_acl_default" attribute on the parent). The latter two steps are what is already done in the new inode creation code path, so you hopefully can just reuse that code. > > > Would my (existing) inode then have > > > security.smack and security.selinux attributes? > > > > > > > Yup. It happens all the time. Whenever someone converts a Fedora > > system to Smack they end up with a filesystem full of unused selinux > > labels. It does no harm. > > At that runtime, sure. But with reflink(), we may be reflinking > someone else's inode, and if we have to drop its security state, we > should clean the unused labels just in case they go back to selinux (or > back to smack, etc). But if they are all under security.*, it's easy to > do. > > Thanks! > Joel > -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-13 17:23 ` Stephen Smalley @ 2009-05-13 18:27 ` Joel Becker 0 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-13 18:27 UTC (permalink / raw) To: Stephen Smalley Cc: James Morris, jim owens, linux-security-module, mtk.manpages, Casey Schaufler, linux-fsdevel, ocfs2-devel, viro On Wed, May 13, 2009 at 01:23:58PM -0400, Stephen Smalley wrote: > File capabilities live under security.*, but ACLs predate the security > namespace and live in the system namespace as > "system.posix_acl_access" (and if a directory, there is also a > "system.posix_acl_default" attribute that specifies the default ACL for > new files in that directory). > > In the preserve_security==0 case, you'd want to: > - drop all attributes under security.* on the new inode, > - set (security.<name>, value) to the name:value pair provided by > security_inode_init_security(), > - set system.posix_acl_access to the default ACL associated with the > parent directory (the "system.posix_acl_default" attribute on the > parent). > > The latter two steps are what is already done in the new inode creation > code path, so you hopefully can just reuse that code. I am absolutely expecting to reuse that code. I was just trying to make sure I didn't miss any steps prior to the normal new-inode stuff. Thanks. Joel -- The zen have a saying: "When you learn how to listen, ANYONE can be your teacher." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 22:27 ` James Morris 2009-05-11 22:34 ` Joel Becker @ 2009-05-12 12:01 ` Stephen Smalley 1 sibling, 0 replies; 151+ messages in thread From: Stephen Smalley @ 2009-05-12 12:01 UTC (permalink / raw) To: James Morris Cc: Joel Becker, jim owens, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 2009-05-12 at 08:27 +1000, James Morris wrote: > On Mon, 11 May 2009, Joel Becker wrote: > > > and other security attributes (in all, I'm gonna call that the "security > > context") as well. So I defined reflink() as such. This meant > > "security context" is an term associated with SELinux, so you may want to > use something like "security attributes" or "security state" to avoid > confusing people. > > > + error = security_inode_reflink(old_dentry, dir); > > + if (error) > > + return error; > > We'll need the new_dentry now, to set up new security state before the > dentry is instantiated. I don't think the inode exists yet for the new_dentry (not until after the call to i_op->reflink), and thus we cannot set up the new inode state at the point of security_inode_reflink(). We will need the filesystem to call into the security module to get the right security attribute name/value pair when creating the new inode, just as with normal inode creation, unless it is preserving the name/value pair from the original. The security_inode_init_security() hook is for that purpose - you can see its usage in existing filesystems when creating new inodes. > e.g. SELinux will need to perform some checks on the operation, then > calculate a new security context for the new file. > > > - James -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker 2009-05-11 22:27 ` James Morris @ 2009-05-11 23:11 ` jim owens 2009-05-11 23:42 ` Joel Becker 2009-05-12 11:31 ` Jörn Engel ` (4 subsequent siblings) 6 siblings, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-11 23:11 UTC (permalink / raw) To: joel.becker, linux-fsdevel Cc: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module Joel Becker wrote: > Here's v4 of reflink(). If you have the privileges, you get the > full snapshot. If you don't, you must have read access, and then you > get the entire snapshot (data and extended attributes) except that the > security context is reinitialized. That's it. It fits with most of the > other ops, and it's a clean degradation. I really like this. It has a nice clean user operational definition and gives them all the snap/cowfile features. And if they had the privilege to do the reflink(), they can just chattr away :) jim > + /* > + * If the caller has the rights, reflink() will preserve the > + * security context of the source inode. > + */ > + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) > + preserve_security = 0; > + if ((current_fsuid() != inode->i_uid) && > + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) > + preserve_security = 0; I have not done a code review, but that appears to be an editing cut-and-past duplication. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 23:11 ` jim owens @ 2009-05-11 23:42 ` Joel Becker 0 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-11 23:42 UTC (permalink / raw) To: jim owens Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel, ocfs2-devel, viro On Mon, May 11, 2009 at 07:11:00PM -0400, jim owens wrote: > Joel Becker wrote: >> Here's v4 of reflink(). If you have the privileges, you get the >> full snapshot. If you don't, you must have read access, and then you >> get the entire snapshot (data and extended attributes) except that the >> security context is reinitialized. That's it. It fits with most of the >> other ops, and it's a clean degradation. > > I really like this. It has a nice clean user operational definition > and gives them all the snap/cowfile features. And if they had the > privilege to do the reflink(), they can just chattr away :) > > jim > >> + /* >> + * If the caller has the rights, reflink() will preserve the >> + * security context of the source inode. >> + */ >> + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) >> + preserve_security = 0; >> + if ((current_fsuid() != inode->i_uid) && >> + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) >> + preserve_security = 0; > > I have not done a code review, but that appears to be an > editing cut-and-past duplication. Oh, good catch. Joel -- "In the long run...we'll all be dead." -Unknown Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker 2009-05-11 22:27 ` James Morris 2009-05-11 23:11 ` jim owens @ 2009-05-12 11:31 ` Jörn Engel 2009-05-12 13:12 ` jim owens 2009-05-12 15:04 ` Sage Weil ` (3 subsequent siblings) 6 siblings, 1 reply; 151+ messages in thread From: Jörn Engel @ 2009-05-12 11:31 UTC (permalink / raw) To: Joel Becker Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote: > > Here's v4 of reflink(). If you have the privileges, you get the > full snapshot. If you don't, you must have read access, and then you > get the entire snapshot (data and extended attributes) except that the > security context is reinitialized. That's it. It fits with most of the > other ops, and it's a clean degradation. Let me see if I understand this correctly. File "/tmp/foo" belongs to Joel, file "/tmp/bar" belongs to Joern. Everyone has read access to those files. Now if you reflink them to your home directory, both files belong to you. If I reflink them to my home directory, both files belong to me. And if root reflinks them to /root, one file belongs to Joel, the other to Joern. Is that correct? Because if it is, I would call that behaviour rather confusing. A system call that behaves differently depending on who calls it - or on whether the binary is installed suid root - is something I would like to avoid. Jörn -- A surrounded army must be given a way out. -- Sun Tzu -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 11:31 ` Jörn Engel @ 2009-05-12 13:12 ` jim owens 2009-05-12 20:24 ` Jamie Lokier 2009-05-14 18:43 ` Jörn Engel 0 siblings, 2 replies; 151+ messages in thread From: jim owens @ 2009-05-12 13:12 UTC (permalink / raw) To: Jörn Engel Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Jörn Engel wrote: > On Mon, 11 May 2009 13:40:11 -0700, Joel Becker wrote: >> Here's v4 of reflink(). If you have the privileges, you get the >> full snapshot. If you don't, you must have read access, and then you >> get the entire snapshot (data and extended attributes) except that the >> security context is reinitialized. That's it. It fits with most of the >> other ops, and it's a clean degradation. > > Let me see if I understand this correctly. File "/tmp/foo" belongs to > Joel, file "/tmp/bar" belongs to Joern. Everyone has read access to > those files. Now if you reflink them to your home directory, both files > belong to you. If I reflink them to my home directory, both files > belong to me. And if root reflinks them to /root, one file belongs to > Joel, the other to Joern. Is that correct? yes > Because if it is, I would call that behaviour rather confusing. A > system call that behaves differently depending on who calls it - or > on whether the binary is installed suid root - is something I would like > to avoid. Avoiding that just gives us other confusing operations unless you have a really good alternative. This design is very elegant, I wish I had thought of it :) It passes the test that 99% of the time for any user (including root), "it just works the way I want it to". In my experience, root and setuid programs really don't want to take ownership, they want to replicate it. The behavior matches "cp -p" or "tar -x" and yes those are not system calls but so what. What matters is the documentation is clear about what happens and the most useful result occurs. jim -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 13:12 ` jim owens @ 2009-05-12 20:24 ` Jamie Lokier 2009-05-14 18:43 ` Jörn Engel 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 20:24 UTC (permalink / raw) To: jim owens Cc: Jörn Engel, Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel jim owens wrote: > It passes the test that 99% of the time for any user (including > root), "it just works the way I want it to". In my experience, > root and setuid programs really don't want to take ownership, > they want to replicate it. Unfortunately in the other 1%, as I've explained in detail in another mail, it's a lot of work and sometimes impossible for a program to set the attributes to be those of a new file. Whereas an explicit choice between snapshot attributes and new-file attributes never causes problems, because it's trivial to provide the automatic "-p" switch by trying one then the other. To human-optimise, make your reflink _program_ do that. Humans don't call system calls themselves :-) > The behavior matches "cp -p" or "tar -x" Actually it doesn't, but even if it did, not having any way to turn off the "-p" would be just as annoying as if you couldn't do that with "cp". If you like root to have "cp -p", put it in /root/.bashrc :-) -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 13:12 ` jim owens 2009-05-12 20:24 ` Jamie Lokier @ 2009-05-14 18:43 ` Jörn Engel 1 sibling, 0 replies; 151+ messages in thread From: Jörn Engel @ 2009-05-14 18:43 UTC (permalink / raw) To: jim owens Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel [ Delayed response - mailserver was dead. ] On Tue, 12 May 2009 09:12:17 -0400, jim owens wrote: > > >Because if it is, I would call that behaviour rather confusing. A > >system call that behaves differently depending on who calls it - or > >on whether the binary is installed suid root - is something I would like > >to avoid. > > Avoiding that just gives us other confusing operations unless > you have a really good alternative. > > This design is very elegant, I wish I had thought of it :) > > It passes the test that 99% of the time for any user (including > root), "it just works the way I want it to". In my experience, > root and setuid programs really don't want to take ownership, > they want to replicate it. > > The behavior matches "cp -p" or "tar -x" and yes those are not > system calls but so what. What matters is the documentation is > clear about what happens and the most useful result occurs. If what you want is copyfile(2), this is a poor design because it usually does what you want and sometimes doesn't. If what you want is reflink(2), this may be acceptable. Not sure. I personally would prefer to get -EPERM or something instead of altered behaviour. So you can count me in with the people that propose two seperate system calls. Jörn -- They laughed at Galileo. They laughed at Copernicus. They laughed at Columbus. But remember, they also laughed at Bozo the Clown. -- unknown -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker ` (2 preceding siblings ...) 2009-05-12 11:31 ` Jörn Engel @ 2009-05-12 15:04 ` Sage Weil 2009-05-12 15:23 ` jim owens 2009-05-12 17:28 ` Joel Becker 2009-05-14 3:57 ` Andy Lutomirski ` (2 subsequent siblings) 6 siblings, 2 replies; 151+ messages in thread From: Sage Weil @ 2009-05-12 15:04 UTC (permalink / raw) To: Joel Becker Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Mon, 11 May 2009, Joel Becker wrote: > Here's v4 of reflink(). If you have the privileges, you get the > full snapshot. If you don't, you must have read access, and then you > get the entire snapshot (data and extended attributes) except that the > security context is reinitialized. That's it. It fits with most of the > other ops, and it's a clean degradation. What would a 'cp' without '-p' be expected to do here when it has the privileges? Call reflink(2), then explicitly clear out any copied security attributes ensure that any copied attributes are removed, and otherwise jump through hoops to make the newly created file look like it should? Should it check whether it has the privileges and act accordingly (_can_ it even do that reliably/atomically?), or unconditionally verify the attributes look like a new file's should? To me, a simple 'cp' type operation (assuming it gets wired up the way it could) seems like at least as common a use case than a 'snapshot' operation. I know that's not what your main goal here, but I don't understand the resistance to two syscalls. Mixing the two might give you the right answer in many cases, but certainly not all, and it makes for confusing application interface semantics that we won't be able to change down the line. sage > I add a flag to ips->reflink() so that the filesystem knows what > to do with the security context. That's the only change visible outside > of vfs_reflink(). > Security folks, check my work. Everyone else, let me know if > this satisfies. > > Joel > > >From 1ebf4c2cf36d38b22de025b03753497466e18941 Mon Sep 17 00:00:00 2001 > From: Joel Becker <joel.becker@oracle.com> > Date: Sat, 2 May 2009 22:48:59 -0700 > Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call. > > The userpace visible idea of the operation is: > > int reflink(const char *oldpath, const char *newpath); > int reflinkat(int olddirfd, const char *oldpath, > int newdirfd, const char *newpath, int flags); > > The kernel only implements reflinkat(2). reflink(3) is a trivial > wrapper around reflinkat(2). > > The reflink() system call creates reference-counted links. It creates > a new file that shares the data extents of the source file in a > copy-on-write fashion. Its calling semantics are identical to link(2) > and linkat(2). Once complete, programs see the new file as a completely > separate entry. > > reflink() attempts to preserve ownership, permissions, and security > contexts in order to create a fully snapshot. Preserving those > attributes requires ownership or CAP_CHOWN. A caller without those > privileges will see the security context of the new file initialized to > their default. > > In the VFS, ->reflink() is an inode_operation with the almost same > arguments as ->link(); an additional argument tells the filesystem to > copy over or reinitialize the security context on the new file. > > A new LSM hook, security_inode_reflink(), is added. None of the > existing LSM hooks appeared to fit. > > XXX: Currently only adds the x86_32 linkage. The rest of the > architectures belong here too. > > Signed-off-by: Joel Becker <joel.becker@oracle.com> > --- > Documentation/filesystems/reflink.txt | 165 +++++++++++++++++++++++++++++++++ > Documentation/filesystems/vfs.txt | 4 + > arch/x86/include/asm/unistd_32.h | 1 + > arch/x86/kernel/syscall_table_32.S | 1 + > fs/namei.c | 113 ++++++++++++++++++++++ > include/linux/fs.h | 2 + > include/linux/security.h | 16 +++ > include/linux/syscalls.h | 2 + > security/capability.c | 6 + > security/security.c | 7 ++ > 10 files changed, 317 insertions(+), 0 deletions(-) > create mode 100644 Documentation/filesystems/reflink.txt > > diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt > new file mode 100644 > index 0000000..aa7380f > --- /dev/null > +++ b/Documentation/filesystems/reflink.txt > @@ -0,0 +1,165 @@ > +reflink(2) > +========== > + > + > +INTRODUCTION > +------------ > + > +A reflink is a reference-counted link. The reflink(2) operation is > +analogous to the link(2) operation, except that instead of two directory > +entries pointing to the same inode, there are two identical inodes > +pointing to the same data. Writes do not modify the shared data; they > +use copy-on-write (CoW). Thus, after the reflink has been created, the > +inodes can diverge without impacting each other. > + > + > +SYNOPSIS > +-------- > + > +The reflink(2) call looks just like link(2): > + > + int reflink(const char *oldpath, const char *newpath); > + > +The actual system call is reflinkat(2): > + > + int reflinkat(int olddirfd, const char *oldpath, > + int newdirfd, const char *newpath, int flags); > + > +For details on how olddirfd, newdirfd, and flags behave, see linkat(2). > +The reflink(2) call won't be implemented by the kernel, because it's a > +trivial wrapper around reflinkat(2). > + > + > +DESCRIPTION > +----------- > + > +One way of viewing reflink is to look at the level of sharing. A > +symbolic link does its sharing at the directory entry level; many names > +end up pointing at the same directory entry. Hard links are one step > +down. Multiple directory entries are sharing one inode. Reflinks are > +down one more level: multiple inodes share the same data extents. > + > +When you symlink a file, you can then access it via the symlink or the > +real directory entry, and for the most part they look identical. When > +accessing more than one name for a hard link, the object returned looks > +identical. Similarly, a newly created reflink is identical to its > +source in almost every way and can be treated as such. This includes > +ownership, permissions, security context, and data. The only things > +that are different are the inode number, the link count, and the ctime. > + > +A reflink is a snapshot of the source file at the time it is created. > + > +Once created, though, a reflink can be modified like any other normal > +file without affecting the source file. Changes to trivial fields like > +permissions, owner, or times are guaranteed not to trigger CoW of file > +data and will not return any error that wouldn't happen on a truly > +distinct file. Changes to the file's data will trigger CoW of the data > +affected - the actual CoW granularity is up to the filesystem, from > +exact bytes up to the entire file. ocfs2, for example, will copy out an > +entire extent or 1MB, whichever is smaller. > + > +Preserving the security context of the source file obviously requires > +the privilege to do so. Callers that do not own the source file and do > +not have CAP_CHOWN will get a new reflink with all non-security > +attributes preserved; the security context of the new reflink will be > +as a newly created file by that user. > + > +Partial reflinks are not allowed. The new inode will only appear in the > +directory structure after it is fully formed. This prevents a crash or > +lack of space from creating a partial reflink. > + > +If a filesystem does not support reflinks, the kernel and libc MUST NOT > +fake it. Callers are expecting to get snapshots, and faking it will > +violate that trust. > + > +The userspace view is as follows. When reflink(2) returns, opening > +oldpath and newpath returns identical-looking files, just like link(2). > +After that, oldpath and newpath behave as distinct files, and > +modifications to one have no impact on the other. > + > + > +RESTRICTIONS > +------------ > + > +Just as the sharing gets lower as you move from symlink() -> link() -> > +reflink(), the restrictions on the call get tighter. A symlink doesn't > +require any access permissions other than being able to create its > +inode. It can cross filesystems and mount points, and it can point to > +any type of file. A hard link requires both source and target to be on > +the same filesystem under the same mount point, and that the source not > +be a directory. Like hard links and symlinks, a reflink cannot be > +created if newpath exists. > + > +Reflinks adds one big restriction on top of hard links: only the owner > +or someone with elevated privileges (CAP_CHOWN) can preserve the > +security context (permissions, ownership, ACLs, etc) across a reflink. > +A reflink is a point-in-time snapshot of a file. Without the > +appropriate privilege, the caller will see their own default security > +context applied to the file. > + > +A caller without the privileges to preserve the security context must > +have read access to reflink a file. > + > + > +SHARING > +------- > + > +A reflink creates a new inode. It shares all data extents of the source > +file; this includes file data and extended attribute data. All of the > +sharing is in a CoW fashion, and any modification of the data will break > +the sharing. > + > +For some filesystems, certain data structures are not in allocated > +storage extents. Creating a reflink might make a copy of these extents. > +An example is ext3's ability to store small extended attributes inside > +the ext3 inode. Since a reflink is creating a new inode, those extended > +attributes are merely copied to the new inode. > + > + > +EXCEPTIONS > +---------- > + > +All file attributes and extended attributes of the new file must > +identical to the source file with the following exceptions: > + > +- The new file must have a new inode number. This allows POSIX > + programs to treat the source and new files as separate objects. From > + the view of the POSIX application, the files are distinct. The > + sharing is invisible outside of the filesystem's internal structures. > +- The ctime of the source file only changes if the source's metadata > + must be changed to accommodate the copy-on-write linkage. The ctime > + of the new file is set to represent its creation. > +- The link count of the source file is unchanged, and the link count of > + the new file is one. > +- If the caller lacks the privileges to preserve the security context, > + the file will have its security context initialized as would any new > + file. > + > +The mtime of the source file is unmodified, and the mtime of the new > +file is set identical to the source file. This reflects that the data > +is unchanged. > + > + > +INODE OPERATION > +--------------- > + > +Filesystems implement the ->reflink() inode operation. It has almost > +the same prototype as ->link(): > + > + int (*reflink)(struct dentry *old_dentry, struct inode *dir, > + struct dentry *new_dentry, int preserve_security); > + > +When the filesystem is called, the VFS has already checked the > +permissions and mountpoint of the operation. It has determined whether > +the security context should be preserved or reinitialized, as specified > +by the preserve_security argument. The filesystem just needs to create > +the new inode identical to the old one with the exceptions noted above, > +link up the shared data extents, and then link the new inode into dir. > + > + > +FOLLOWING SYMBOLIC LINKS > +------------------------ > + > +reflink() deferences symbolic links in the same manner that link(2) > +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2). > + > diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt > index f49eecf..01cd810 100644 > --- a/Documentation/filesystems/vfs.txt > +++ b/Documentation/filesystems/vfs.txt > @@ -333,6 +333,7 @@ struct inode_operations { > ssize_t (*listxattr) (struct dentry *, char *, size_t); > int (*removexattr) (struct dentry *, const char *); > void (*truncate_range)(struct inode *, loff_t, loff_t); > + int (*reflink) (struct dentry *,struct inode *,struct dentry *); > }; > > Again, all methods are called without any locks being held, unless > @@ -431,6 +432,9 @@ otherwise noted. > > truncate_range: a method provided by the underlying filesystem to truncate a > range of blocks , i.e. punch a hole somewhere in a file. > + reflink: called by the reflink(2) system call. Only required if you want > + to support reflinks. For further information, see > + Documentation/filesystems/reflink.txt. > > > The Address Space Object > diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h > index 6e72d74..c368563 100644 > --- a/arch/x86/include/asm/unistd_32.h > +++ b/arch/x86/include/asm/unistd_32.h > @@ -340,6 +340,7 @@ > #define __NR_inotify_init1 332 > #define __NR_preadv 333 > #define __NR_pwritev 334 > +#define __NR_reflinkat 335 > > #ifdef __KERNEL__ > > diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S > index ff5c873..d11c200 100644 > --- a/arch/x86/kernel/syscall_table_32.S > +++ b/arch/x86/kernel/syscall_table_32.S > @@ -334,3 +334,4 @@ ENTRY(sys_call_table) > .long sys_inotify_init1 > .long sys_preadv > .long sys_pwritev > + .long sys_reflinkat /* 335 */ > diff --git a/fs/namei.c b/fs/namei.c > index 78f253c..34a6ce5 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2486,6 +2486,118 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname > return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); > } > > +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) > +{ > + struct inode *inode = old_dentry->d_inode; > + int error; > + int preserve_security = 1; > + > + if (!inode) > + return -ENOENT; > + > + /* > + * If the caller has the rights, reflink() will preserve the > + * security context of the source inode. > + */ > + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) > + preserve_security = 0; > + if ((current_fsuid() != inode->i_uid) && > + !in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) > + preserve_security = 0; > + > + /* > + * If the caller doesn't have the right to preserve the security > + * context, the caller is only getting the data and extended > + * attributes. They need read permission on the file. > + */ > + if (!preserve_security) { > + error = inode_permission(inode, MAY_READ); > + if (error) > + return error; > + } > + > + error = may_create(dir, new_dentry); > + if (error) > + return error; > + > + if (dir->i_sb != inode->i_sb) > + return -EXDEV; > + > + /* > + * A reflink to an append-only or immutable file cannot be created. > + */ > + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) > + return -EPERM; > + if (!dir->i_op->reflink) > + return -EPERM; > + if (S_ISDIR(inode->i_mode)) > + return -EPERM; > + > + error = security_inode_reflink(old_dentry, dir); > + if (error) > + return error; > + > + mutex_lock(&inode->i_mutex); > + vfs_dq_init(dir); > + error = dir->i_op->reflink(old_dentry, dir, new_dentry, > + preserve_security); > + mutex_unlock(&inode->i_mutex); > + if (!error) > + fsnotify_create(dir, new_dentry); > + return error; > +} > + > +SYSCALL_DEFINE5(reflinkat, int, olddfd, const char __user *, oldname, > + int, newdfd, const char __user *, newname, int, flags) > +{ > + struct dentry *new_dentry; > + struct nameidata nd; > + struct path old_path; > + int error; > + char *to; > + > + if ((flags & ~AT_SYMLINK_FOLLOW) != 0) > + return -EINVAL; > + > + error = user_path_at(olddfd, oldname, > + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0, > + &old_path); > + if (error) > + return error; > + > + error = user_path_parent(newdfd, newname, &nd, &to); > + if (error) > + goto out; > + error = -EXDEV; > + if (old_path.mnt != nd.path.mnt) > + goto out_release; > + new_dentry = lookup_create(&nd, 0); > + error = PTR_ERR(new_dentry); > + if (IS_ERR(new_dentry)) > + goto out_unlock; > + error = mnt_want_write(nd.path.mnt); > + if (error) > + goto out_dput; > + error = security_path_link(old_path.dentry, &nd.path, new_dentry); > + if (error) > + goto out_drop_write; > + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, new_dentry); > +out_drop_write: > + mnt_drop_write(nd.path.mnt); > +out_dput: > + dput(new_dentry); > +out_unlock: > + mutex_unlock(&nd.path.dentry->d_inode->i_mutex); > +out_release: > + path_put(&nd.path); > + putname(to); > +out: > + path_put(&old_path); > + > + return error; > +} > + > + > /* > * The worst of all namespace operations - renaming directory. "Perverted" > * doesn't even start to describe it. Somebody in UCB had a heck of a trip... > @@ -2890,6 +3002,7 @@ EXPORT_SYMBOL(unlock_rename); > EXPORT_SYMBOL(vfs_create); > EXPORT_SYMBOL(vfs_follow_link); > EXPORT_SYMBOL(vfs_link); > +EXPORT_SYMBOL(vfs_reflink); > EXPORT_SYMBOL(vfs_mkdir); > EXPORT_SYMBOL(vfs_mknod); > EXPORT_SYMBOL(generic_permission); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 5bed436..0a5c807 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *); > extern int vfs_rmdir(struct inode *, struct dentry *); > extern int vfs_unlink(struct inode *, struct dentry *); > extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); > +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *); > > /* > * VFS dentry helper functions. > @@ -1537,6 +1538,7 @@ struct inode_operations { > loff_t len); > int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, > u64 len); > + int (*reflink) (struct dentry *,struct inode *,struct dentry *,int); > }; > > struct seq_file; > diff --git a/include/linux/security.h b/include/linux/security.h > index d5fd616..ea9cd93 100644 > --- a/include/linux/security.h > +++ b/include/linux/security.h > @@ -528,6 +528,14 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) > * @inode contains a pointer to the inode. > * @secid contains a pointer to the location where result will be saved. > * In case of failure, @secid will be set to zero. > + * @inode_reflink: > + * Check permission before creating a new reference-counted link to > + * a file. > + * @old_dentry contains the dentry structure for an existing link to > + * the file. > + * @dir contains the inode structure of the parent directory of the > + * new reflink. > + * Return 0 if permission is granted. > * > * Security hooks for file operations > * > @@ -1415,6 +1423,7 @@ struct security_operations { > int (*inode_unlink) (struct inode *dir, struct dentry *dentry); > int (*inode_symlink) (struct inode *dir, > struct dentry *dentry, const char *old_name); > + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir); > int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode); > int (*inode_rmdir) (struct inode *dir, struct dentry *dentry); > int (*inode_mknod) (struct inode *dir, struct dentry *dentry, > @@ -1675,6 +1684,7 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir, > int security_inode_unlink(struct inode *dir, struct dentry *dentry); > int security_inode_symlink(struct inode *dir, struct dentry *dentry, > const char *old_name); > +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir); > int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode); > int security_inode_rmdir(struct inode *dir, struct dentry *dentry); > int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev); > @@ -2056,6 +2066,12 @@ static inline int security_inode_symlink(struct inode *dir, > return 0; > } > > +static inline int security_inode_reflink(struct dentry *old_dentry, > + struct inode *dir) > +{ > + return 0; > +} > + > static inline int security_inode_mkdir(struct inode *dir, > struct dentry *dentry, > int mode) > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 40617c1..35a8743 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -692,6 +692,8 @@ asmlinkage long sys_symlinkat(const char __user * oldname, > int newdfd, const char __user * newname); > asmlinkage long sys_linkat(int olddfd, const char __user *oldname, > int newdfd, const char __user *newname, int flags); > +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname, > + int newdfd, const char __user *newname, int flags); > asmlinkage long sys_renameat(int olddfd, const char __user * oldname, > int newdfd, const char __user * newname); > asmlinkage long sys_futimesat(int dfd, char __user *filename, > diff --git a/security/capability.c b/security/capability.c > index 21b6cea..3dcc4cc 100644 > --- a/security/capability.c > +++ b/security/capability.c > @@ -172,6 +172,11 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry, > return 0; > } > > +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode) > +{ > + return 0; > +} > + > static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry, > int mask) > { > @@ -905,6 +910,7 @@ void security_fixup_ops(struct security_operations *ops) > set_to_cap_if_null(ops, inode_link); > set_to_cap_if_null(ops, inode_unlink); > set_to_cap_if_null(ops, inode_symlink); > + set_to_cap_if_null(ops, inode_reflink); > set_to_cap_if_null(ops, inode_mkdir); > set_to_cap_if_null(ops, inode_rmdir); > set_to_cap_if_null(ops, inode_mknod); > diff --git a/security/security.c b/security/security.c > index 5284255..70d0ac3 100644 > --- a/security/security.c > +++ b/security/security.c > @@ -470,6 +470,13 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry, > return security_ops->inode_symlink(dir, dentry, old_name); > } > > +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir) > +{ > + if (unlikely(IS_PRIVATE(old_dentry->d_inode))) > + return 0; > + return security_ops->inode_reflink(old_dentry, dir); > +} > + > int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) > { > if (unlikely(IS_PRIVATE(dir))) > -- > 1.6.1.3 > > > -- > > "Three o'clock is always too late or too early for anything you > want to do." > - Jean-Paul Sartre > > Joel Becker > Principal Software Developer > Oracle > E-mail: joel.becker@oracle.com > Phone: (650) 506-8127 > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 15:04 ` Sage Weil @ 2009-05-12 15:23 ` jim owens 2009-05-12 16:16 ` Sage Weil 2009-05-12 17:28 ` Joel Becker 1 sibling, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-12 15:23 UTC (permalink / raw) To: Sage Weil Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Sage Weil wrote: > On Mon, 11 May 2009, Joel Becker wrote: >> Here's v4 of reflink(). If you have the privileges, you get the >> full snapshot. If you don't, you must have read access, and then you >> get the entire snapshot (data and extended attributes) except that the >> security context is reinitialized. That's it. It fits with most of the >> other ops, and it's a clean degradation. > > What would a 'cp' without '-p' be expected to do here when it has the > privileges? Call reflink(2), then explicitly clear out any copied > security attributes ensure that any copied attributes are removed, and > otherwise jump through hoops to make the newly created file look like it > should? Should it check whether it has the privileges and act accordingly > (_can_ it even do that reliably/atomically?), or unconditionally verify > the attributes look like a new file's should? I don't understand what you think is hard about cp doing the "if not preserve then update attributes". It does not have to check the reflink() attr result, it just assigns the expected new attributes. Only the -p snapshot needs atomicity. > To me, a simple 'cp' type operation (assuming it gets wired up the way it > could) seems like at least as common a use case than a 'snapshot' I don't think changing "cp" is a good idea since users have a long history that cp means make a data copy, not cow. Adding a new flag is IMO not be as good as a new utility. Particularly since we can not do directories. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 15:23 ` jim owens @ 2009-05-12 16:16 ` Sage Weil 2009-05-12 17:45 ` jim owens 0 siblings, 1 reply; 151+ messages in thread From: Sage Weil @ 2009-05-12 16:16 UTC (permalink / raw) To: jim owens Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 12 May 2009, jim owens wrote: > Sage Weil wrote: > > On Mon, 11 May 2009, Joel Becker wrote: > > > Here's v4 of reflink(). If you have the privileges, you get the > > > full snapshot. If you don't, you must have read access, and then you > > > get the entire snapshot (data and extended attributes) except that the > > > security context is reinitialized. That's it. It fits with most of the > > > other ops, and it's a clean degradation. > > > > What would a 'cp' without '-p' be expected to do here when it has the > > privileges? Call reflink(2), then explicitly clear out any copied security > > attributes ensure that any copied attributes are removed, and otherwise jump > > through hoops to make the newly created file look like it should? Should it > > check whether it has the privileges and act accordingly (_can_ it even do > > that reliably/atomically?), or unconditionally verify the attributes look > > like a new file's should? > > I don't understand what you think is hard about cp doing the > "if not preserve then update attributes". It does not have to check > the reflink() attr result, it just assigns the expected new attributes. I assume it's possible, but not being familiar with how the SELinux etc attributes look, my guess is that any tool that wants to cow file data to a new file (even if root) would need to do something like reflink(src, dst) chown(dst, getuid(), getgid()) listxattr and rmxattr each. or just delete selinux/whatever attributes. create generic 'new file' selinux/whatever attributes, if needed. The chown bit isn't even right, since it doesn't follow the directory sticky bit rules. And is there some generic way to assign an existing file 'new file'-like security attributes? It's a mess. > Only the -p snapshot needs atomicity. My point is that the process creating the cow file should unconditionally do the above checks (and needed fixups) because it can't atomically verify the attribute copy won't happen andke the reflink call. > > To me, a simple 'cp' type operation (assuming it gets wired up the way it > > could) seems like at least as common a use case than a 'snapshot' > > I don't think changing "cp" is a good idea since users have a > long history that cp means make a data copy, not cow. Adding > a new flag is IMO not be as good as a new utility. Particularly > since we can not do directories. Maybe not, but that's a separate question from the interface issue. We shouldn't preclude the possibility creating tools that preserve attributes (or warn if they can't) and tools that simply want to cow data to a new file. AFAICS reflink(2) as proposed doesn't quite let you do either one without extra hackery to compensate for its dual-mode behavior. If this thread has demonstrated anything, it's that some users want snapshot-like semantics (cp -p) and some want cowfile()-like semantics (cp). What is the benefit of combining the two into a single call? If I want snapshot-like semantics, I would rather get -EPERM if I lack the necessary permissions than silently get an approximation. Then I can at least issue a warning to the user. If I really want to gracefully 'degrade', I can always do something like err = reflink(src, dst); if (err == -EPERM) { err = cowfile(src, dst); if (!err) printf("warning: failed to preserve all file attributes\n"); } sage > > jim > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 16:16 ` Sage Weil @ 2009-05-12 17:45 ` jim owens 2009-05-12 20:29 ` Jamie Lokier 0 siblings, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-12 17:45 UTC (permalink / raw) To: Sage Weil Cc: Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Sage Weil wrote: > Maybe not, but that's a separate question from the interface issue. We > shouldn't preclude the possibility creating tools that preserve attributes > (or warn if they can't) and tools that simply want to cow data to a new > file. AFAICS reflink(2) as proposed doesn't quite let you do either one > without extra hackery to compensate for its dual-mode behavior. If this > thread has demonstrated anything, it's that some users want snapshot-like > semantics (cp -p) and some want cowfile()-like semantics (cp). What is > the benefit of combining the two into a single call? If I want > snapshot-like semantics, I would rather get -EPERM if I lack the necessary > permissions than silently get an approximation. I'm not fighting against two syscalls but the reason I like the V4 definition is the opposite of knowing I failed to snapshot. It is really because in my experience as both root on multi-user systems and basic untrusted user, when root copies something from a user there are only two desired outcomes: 1) cp -p 2) cp, chown "someone" , chgrp "somegroup", chmod "new rights" The common mistake is wanting #1 and forgetting the -p so it then produces an error and has to be fixed. Using root's default attributes is almost never desired. So with this reflink() definition, normal users get their own attributes and root automatically gets preserve but can change them later. IMO this is optimized for humans, and I don't really know of any privileged daemon things that are setuid and want to not preserve attributes. Do you have examples? jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 17:45 ` jim owens @ 2009-05-12 20:29 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 20:29 UTC (permalink / raw) To: jim owens Cc: Sage Weil, Joel Becker, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel jim owens wrote: > Using root's default attributes is almost never desired. ^^^^^^ Exactly. When it is desired, it shouldn't be impossible :-) Setting attributes to those of a new file outside the kernel requires parsing /proc/mounts and knowing filesystem-type-specific things, among other things. Ugly stuff - should never be written. Don't make such ugly stuff be written (and fail when /proc isn't mounted). There is also the principle of least surprise... Shell scripts which behave differently for root - that's asking for trouble. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 15:04 ` Sage Weil 2009-05-12 15:23 ` jim owens @ 2009-05-12 17:28 ` Joel Becker 2009-05-13 4:30 ` Sage Weil 1 sibling, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-12 17:28 UTC (permalink / raw) To: Sage Weil Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, May 12, 2009 at 08:04:21AM -0700, Sage Weil wrote: > To me, a simple 'cp' type operation (assuming it gets wired up the way it > could) seems like at least as common a use case than a 'snapshot' > operation. I know that's not what your main goal here, but I don't > understand the resistance to two syscalls. Mixing the two might give you > the right answer in many cases, but certainly not all, and it makes for > confusing application interface semantics that we won't be able to change > down the line. I'm not against two syscalls, but I'm not writing copyfile() here, just reflink(). Someone clearly could write copyfile() later and link into some of the same underlying mechanisms. It's important to distinguish the semantics, though, and that's why I'm doing one thing. For example, reflink() is a snapshot (a "reference-counted link") and has behaviors based on that. libc should never fake it, because the callers expect those behaviors. Whereas copyfile() would be fakeable in libc with a read/write cycle on filesystems that don't support it. Things like that. Heck, I think you could use reflink() to create a copyfile() in libc that uses no additional syscall. But you couldn't use copyfile() to create reflink(). Joel -- "Lately I've been talking in my sleep. Can't imagine what I'd have to say. Except my world will be right When love comes back my way." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-12 17:28 ` Joel Becker @ 2009-05-13 4:30 ` Sage Weil 0 siblings, 0 replies; 151+ messages in thread From: Sage Weil @ 2009-05-13 4:30 UTC (permalink / raw) To: Joel Becker Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 12 May 2009, Joel Becker wrote: > I'm not against two syscalls, but I'm not writing copyfile() > here, just reflink(). Someone clearly could write copyfile() later and > link into some of the same underlying mechanisms. Ok, good. > It's important to distinguish the semantics, though, and that's > why I'm doing one thing. For example, reflink() is a snapshot (a > "reference-counted link") and has behaviors based on that. libc should > never fake it, because the callers expect those behaviors. Whereas > copyfile() would be fakeable in libc with a read/write cycle on > filesystems that don't support it. Things like that. > Heck, I think you could use reflink() to create a copyfile() in > libc that uses no additional syscall. But you couldn't use copyfile() > to create reflink(). Right, except that you _could_ implement the degraded (no CAP_CHOWN) reflink() behavior with a hypothetical copyfile(). I just think you should be sure that reflink() has _exactly_ the snapshot semantics that make sense, without compromises that try to capture some or all of copyfile() as well. Assuming that a copyfile() type syscall also existed, would you really want reflink() to silently degrade to something that can be implemented via copyfile() when you lack CAP_CHOWN? With the proposed reflink(), we might end up with a final API that looks something like: cowfile(src, dst, flags) - cow data and/or xattrs from src to dst reflink(src, dst) - snapshot src to dst, or if !CAP_CHOWN, cowfile() instead A simpler reflink() would make that degradation non-mandatory, and trivially implemented in userspace by those who want it. sage ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker ` (3 preceding siblings ...) 2009-05-12 15:04 ` Sage Weil @ 2009-05-14 3:57 ` Andy Lutomirski 2009-05-14 18:12 ` Stephen Smalley 2009-05-28 0:24 ` [RFC] The reflink(2) system call v5 Joel Becker 2009-09-14 22:24 ` Joel Becker 6 siblings, 1 reply; 151+ messages in thread From: Andy Lutomirski @ 2009-05-14 3:57 UTC (permalink / raw) To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Joel Becker wrote: > + > +Preserving the security context of the source file obviously requires > +the privilege to do so. Callers that do not own the source file and do > +not have CAP_CHOWN will get a new reflink with all non-security > +attributes preserved; the security context of the new reflink will be > +as a newly created file by that user. > + There are plenty of syscalls that require some privilege and fail if the caller doesn't have it. But I can think of only one syscall that does *something different* depending on who called it: setuid. Please search the web and marvel at the disasters caused by setuid's magical caller-dependent behavior (the sendmail bug is probably the most famous [1]). This proposal for reflink is just asking for bugs where an attacker gets some otherwise privileged program to call reflink but to somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to copy security attributes, thus exposing a link with the wrong permissions. Would it really be that hard to have two syscalls, or a flag, or whatever, where one of them preserves all security attributes and *fails* if the caller isn't allowed to do that and the other one makes the caller own the new link? [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf --Andy ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 3:57 ` Andy Lutomirski @ 2009-05-14 18:12 ` Stephen Smalley 2009-05-14 22:00 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-14 18:12 UTC (permalink / raw) To: Andy Lutomirski Cc: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote: > Joel Becker wrote: > > + > > +Preserving the security context of the source file obviously requires > > +the privilege to do so. Callers that do not own the source file and do > > +not have CAP_CHOWN will get a new reflink with all non-security > > +attributes preserved; the security context of the new reflink will be > > +as a newly created file by that user. > > + > > There are plenty of syscalls that require some privilege and fail if the > caller doesn't have it. But I can think of only one syscall that does > *something different* depending on who called it: setuid. > > Please search the web and marvel at the disasters caused by setuid's > magical caller-dependent behavior (the sendmail bug is probably the most > famous [1]). This proposal for reflink is just asking for bugs where an > attacker gets some otherwise privileged program to call reflink but to > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to > copy security attributes, thus exposing a link with the wrong permissions. > > Would it really be that hard to have two syscalls, or a flag, or > whatever, where one of them preserves all security attributes and > *fails* if the caller isn't allowed to do that and the other one makes > the caller own the new link? > > > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf Yes, I agree - the selection of whether or not to preserve the security attributes should be an explicit part of the kernel interface. Then the application still has the freedom to fall back on the non-preserving form of the call if that is truly what it wants. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 18:12 ` Stephen Smalley @ 2009-05-14 22:00 ` Joel Becker 2009-05-15 1:20 ` Jamie Lokier 2009-05-15 12:01 ` Stephen Smalley 0 siblings, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-14 22:00 UTC (permalink / raw) To: Stephen Smalley Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote: > On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote: > > Joel Becker wrote: > > > + > > > +Preserving the security context of the source file obviously requires > > > +the privilege to do so. Callers that do not own the source file and do > > > +not have CAP_CHOWN will get a new reflink with all non-security > > > +attributes preserved; the security context of the new reflink will be > > > +as a newly created file by that user. > > > + > > > > There are plenty of syscalls that require some privilege and fail if the > > caller doesn't have it. But I can think of only one syscall that does > > *something different* depending on who called it: setuid. > > > > Please search the web and marvel at the disasters caused by setuid's > > magical caller-dependent behavior (the sendmail bug is probably the most > > famous [1]). This proposal for reflink is just asking for bugs where an > > attacker gets some otherwise privileged program to call reflink but to > > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to > > copy security attributes, thus exposing a link with the wrong permissions. > > > > Would it really be that hard to have two syscalls, or a flag, or > > whatever, where one of them preserves all security attributes and > > *fails* if the caller isn't allowed to do that and the other one makes > > the caller own the new link? > > > > > > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf > > Yes, I agree - the selection of whether or not to preserve the security > attributes should be an explicit part of the kernel interface. Then the > application still has the freedom to fall back on the non-preserving > form of the call if that is truly what it wants. Here's my problem. Every single shell script now has to do: ln -r source target [ $? != 0 ] && ln -r --no-perms source target Every single program now has to do: if (reflink(source, target) && errno == EPERM) reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS); Because the 99% user wants a real snapshot, and doesn't want to have to think about it. The could, of course, code up their own permission checks to see which variant of reflink to call, but it's still useless (to them) boilerplate. Also, if the 'common' user has to use the reflinkat() call? We've lost. Finally, how is this safer? Don't get me wrong, I do respect the concern - that's why I originally went with your proposal of is_owner_or_cap(). But the fact is that if you've hijacked a process with enough privileges, you *can* make the full reflink, and if your hijacked process doesn't but does have read access, you *can* make the NOPERMS reflink. So doing it with the userspace code above is identical to the kernel code, except that every userspace program has to handle it themselves. Joel -- "Vote early and vote often." - Al Capone Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 22:00 ` Joel Becker @ 2009-05-15 1:20 ` Jamie Lokier 2009-05-15 12:01 ` Stephen Smalley 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-15 1:20 UTC (permalink / raw) To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages Joel Becker wrote: > Here's my problem. Every single shell script now has to do: > > ln -r source target > [ $? != 0 ] && ln -r --no-perms source target No, they'll obviously do ln -Rr source target It is not a burden to type that. (Where -R == your -r --no-perms, and -R -r together means try -R then -r). > Every single program now has to do: > > if (reflink(source, target) && errno == EPERM) > reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS); Yes if that's what they want. > Because the 99% user wants a real snapshot, A quick poll based on emails in these threads says >50% doesn't want a real snapshot :-) But even at 99%, what about the other 1%? As I've explained, it is _impossible_ for userspace to do "ln -r" thing itself in some conditions given your system call. > and doesn't want to have to think about it. The problem with the "automatic" switch is that it isn't obvious, so people will make mistaken assumptions when using it. If they _want_ the automatic switch, then a few moments of thought doesn't matter. Make it easy if you care: like "ln -Rr" in scripts and a flag REFLINK_PERMS_IF_ALLOWED in the system call. This is especially so with reflink(), because the userspace code if you _didn't_ want the automatic change are tricky to write (and extremely difficult to get right), so authors will either not bother, or do it badly. And test suites for programs using reflink() will pass nicely, yet the code may still be broken because ordinary users can't test the "other user's files" cases. > The could, of course, code up their own permission > checks to see which variant of reflink to call, but it's still useless > (to them) boilerplate. Why wouldn't you just do the two calls? It's much easier. But even that goes away with REFLINK_PERMS_IF_ALLOWED (and conversely REFLINK_PERMS_STRICT). (Note it's not just permissions - it's also timestamps, group, xattrs. The flag names could reflect that). > Also, if the 'common' user has to use the reflinkat() call? We've lost. Provide a reflink() call in libc. Problem solved. Heck, provide separate reflink() and cowlink() calls in libc if you don't like a flag. > Finally, how is this safer? Don't get me wrong, I do respect > the concern - that's why I originally went with your proposal of > is_owner_or_cap(). But the fact is that if you've hijacked a process > with enough privileges, you *can* make the full reflink, and if your > hijacked process doesn't but does have read access, you *can* make the > NOPERMS reflink. If you can trick a process into unexpected behaviour, it doesn't mean you can make it do just anything. It means you can trick specific checks and assumptions that the program makes into being wrong, because you made something behave in a way the authors didn't expect. Building on that, sometimes the trick is enough to make a backdoor. Which is why file system calls should behave in a simple way that don't surprise anyone. > So doing it with the userspace code above is identical > to the kernel code, except that every userspace program has to handle it > themselves. No because not every userspace program _wants_ that behaviour. So you have these problems if it's forced in the kernel: - Userspace programs that _don't want_ a "full reflink" but have the privilege to do to. Sometimes they can't do the chmod/etc. to fix the attributes after _at all_ (think setgid-directories among other things - it's *hard* to simulate that in userspace and never quite right). - Sometimes fixing up afterwards would be a security race condition - the temporary unwanted permissions can be looser looser than the process wants to expose in the new directory. What I'm seeing is that for the benefit of saving exactly one line in some userspace programs - a line which is quite helpful in showing what the program intends - it will cost about 1000 lines of code (which is still slightly broken) in other userspace programs, and I can think of a number of those programs already. Not pretty. If you don't like the two calls, just add a flag which means try one then the other. Then it's clear what the app is requesting, and invites authors to decide what behaviour they want, trivially. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-14 22:00 ` Joel Becker 2009-05-15 1:20 ` Jamie Lokier @ 2009-05-15 12:01 ` Stephen Smalley 2009-05-15 15:22 ` Joel Becker 1 sibling, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-15 12:01 UTC (permalink / raw) To: Joel Becker Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Thu, 2009-05-14 at 15:00 -0700, Joel Becker wrote: > On Thu, May 14, 2009 at 02:12:45PM -0400, Stephen Smalley wrote: > > On Wed, 2009-05-13 at 23:57 -0400, Andy Lutomirski wrote: > > > Joel Becker wrote: > > > > + > > > > +Preserving the security context of the source file obviously requires > > > > +the privilege to do so. Callers that do not own the source file and do > > > > +not have CAP_CHOWN will get a new reflink with all non-security > > > > +attributes preserved; the security context of the new reflink will be > > > > +as a newly created file by that user. > > > > + > > > > > > There are plenty of syscalls that require some privilege and fail if the > > > caller doesn't have it. But I can think of only one syscall that does > > > *something different* depending on who called it: setuid. > > > > > > Please search the web and marvel at the disasters caused by setuid's > > > magical caller-dependent behavior (the sendmail bug is probably the most > > > famous [1]). This proposal for reflink is just asking for bugs where an > > > attacker gets some otherwise privileged program to call reflink but to > > > somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to > > > copy security attributes, thus exposing a link with the wrong permissions. > > > > > > Would it really be that hard to have two syscalls, or a flag, or > > > whatever, where one of them preserves all security attributes and > > > *fails* if the caller isn't allowed to do that and the other one makes > > > the caller own the new link? > > > > > > > > > [1] http://www.cs.berkeley.edu/~daw/papers/setuid-usenix02.pdf > > > > Yes, I agree - the selection of whether or not to preserve the security > > attributes should be an explicit part of the kernel interface. Then the > > application still has the freedom to fall back on the non-preserving > > form of the call if that is truly what it wants. > > Here's my problem. Every single shell script now has to do: > > ln -r source target > [ $? != 0 ] && ln -r --no-perms source target > > Every single program now has to do: > > if (reflink(source, target) && errno == EPERM) > reflinkat(AT_FDCWD, source, AT_FDCWD, target, 0, REFLINK_NOPERMS); > > Because the 99% user wants a real snapshot, and doesn't want to have to > think about it. The could, of course, code up their own permission > checks to see which variant of reflink to call, but it's still useless > (to them) boilerplate. > Also, if the 'common' user has to use the reflinkat() call? > We've lost. I think Jamie covered the fact that you can provide a user interface and library functions that provide the "simpler" interface on top of this interface, but not vice versa. > Finally, how is this safer? Don't get me wrong, I do respect > the concern - that's why I originally went with your proposal of > is_owner_or_cap(). But the fact is that if you've hijacked a process > with enough privileges, you *can* make the full reflink, and if your > hijacked process doesn't but does have read access, you *can* make the > NOPERMS reflink. So doing it with the userspace code above is identical > to the kernel code, except that every userspace program has to handle it > themselves. As Jamie said, we aren't talking about injecting arbitrary code into the process. The failure scenario is quite similar to the setuid() one: arrange conditions such that the process lacks sufficient privileges to preserve attributes, and when it calls reflink(2) expecting to preserve the attributes, it will get no indication that they weren't preserved. At which point the data may be unwittingly exposed beyond its original constraints. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-15 12:01 ` Stephen Smalley @ 2009-05-15 15:22 ` Joel Becker 2009-05-15 15:55 ` Stephen Smalley 0 siblings, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-15 15:22 UTC (permalink / raw) To: Stephen Smalley Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, ocfs2-devel, viro On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote: > > Finally, how is this safer? Don't get me wrong, I do respect > > the concern - that's why I originally went with your proposal of > > is_owner_or_cap(). But the fact is that if you've hijacked a process > > with enough privileges, you *can* make the full reflink, and if your > > hijacked process doesn't but does have read access, you *can* make the > > NOPERMS reflink. So doing it with the userspace code above is identical > > to the kernel code, except that every userspace program has to handle it > > themselves. > > As Jamie said, we aren't talking about injecting arbitrary code into the > process. The failure scenario is quite similar to the setuid() one: > arrange conditions such that the process lacks sufficient privileges to > preserve attributes, and when it calls reflink(2) expecting to preserve > the attributes, it will get no indication that they weren't preserved. > At which point the data may be unwittingly exposed beyond its original > constraints. I wasn't being specific to injected code. Assume we have a deliberate flag to reflinkat(2). Then we provide reflink(3) in userspace that does the fallback, keeping it out of the kernel. Doesn't that have the exact same problem? Joel -- "Same dancers in the same old shoes. You get too careful with the steps you choose. You don't care about winning but you don't want to lose After the thrill is gone." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-15 15:22 ` Joel Becker @ 2009-05-15 15:55 ` Stephen Smalley 2009-05-15 16:42 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-15 15:55 UTC (permalink / raw) To: Joel Becker Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, 2009-05-15 at 08:22 -0700, Joel Becker wrote: > On Fri, May 15, 2009 at 08:01:45AM -0400, Stephen Smalley wrote: > > > Finally, how is this safer? Don't get me wrong, I do respect > > > the concern - that's why I originally went with your proposal of > > > is_owner_or_cap(). But the fact is that if you've hijacked a process > > > with enough privileges, you *can* make the full reflink, and if your > > > hijacked process doesn't but does have read access, you *can* make the > > > NOPERMS reflink. So doing it with the userspace code above is identical > > > to the kernel code, except that every userspace program has to handle it > > > themselves. > > > > As Jamie said, we aren't talking about injecting arbitrary code into the > > process. The failure scenario is quite similar to the setuid() one: > > arrange conditions such that the process lacks sufficient privileges to > > preserve attributes, and when it calls reflink(2) expecting to preserve > > the attributes, it will get no indication that they weren't preserved. > > At which point the data may be unwittingly exposed beyond its original > > constraints. > > I wasn't being specific to injected code. Assume we have a > deliberate flag to reflinkat(2). Then we provide reflink(3) in > userspace that does the fallback, keeping it out of the kernel. Doesn't > that have the exact same problem? You wouldn't always do the fallback in reflink(3), but instead provide a helper interface that would perform the fallback for applications that want that behavior. Consider a program that wants to always preserve attributes on the reflinks it creates. If the interface allows the program to explicitly request that behavior and returns an error when the request cannot be honored, then the program knows that upon a successful return, the attributes were in fact preserved. If the interface instead silently selects a behavior based on the current privileges of the process and gives no indication to the caller as to what behavior was selected, then the opportunity for error is great. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-15 15:55 ` Stephen Smalley @ 2009-05-15 16:42 ` Joel Becker 2009-05-15 17:01 ` Shaya Potter 2009-05-15 20:53 ` [Ocfs2-devel] " Joel Becker 0 siblings, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-15 16:42 UTC (permalink / raw) To: Stephen Smalley Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote: > > I wasn't being specific to injected code. Assume we have a > > deliberate flag to reflinkat(2). Then we provide reflink(3) in > > userspace that does the fallback, keeping it out of the kernel. Doesn't > > that have the exact same problem? > > You wouldn't always do the fallback in reflink(3), but instead provide a > helper interface that would perform the fallback for applications that > want that behavior. But isn't that reflink(3)? And the application that wants to know uses reflinkat(2)? > > Consider a program that wants to always preserve attributes on the > reflinks it creates. If the interface allows the program to explicitly > request that behavior and returns an error when the request cannot be > honored, then the program knows that upon a successful return, the > attributes were in fact preserved. If the interface instead silently > selects a behavior based on the current privileges of the process and > gives no indication to the caller as to what behavior was selected, then > the opportunity for error is great. I get that. I'm looking at what the programming interface is. What's the standard function for "I want the fallback behavior" called? What's the standard function for "I want preserve security" called? "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which is it? Joel -- Life's Little Instruction Book #69 "Whistle" Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-15 16:42 ` Joel Becker @ 2009-05-15 17:01 ` Shaya Potter 2009-05-15 20:53 ` [Ocfs2-devel] " Joel Becker 1 sibling, 0 replies; 151+ messages in thread From: Shaya Potter @ 2009-05-15 17:01 UTC (permalink / raw) To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages Joel Becker wrote: > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote: >>> I wasn't being specific to injected code. Assume we have a >>> deliberate flag to reflinkat(2). Then we provide reflink(3) in >>> userspace that does the fallback, keeping it out of the kernel. Doesn't >>> that have the exact same problem? >> You wouldn't always do the fallback in reflink(3), but instead provide a >> helper interface that would perform the fallback for applications that >> want that behavior. > > But isn't that reflink(3)? And the application that wants to > know uses reflinkat(2)? >> Consider a program that wants to always preserve attributes on the >> reflinks it creates. If the interface allows the program to explicitly >> request that behavior and returns an error when the request cannot be >> honored, then the program knows that upon a successful return, the >> attributes were in fact preserved. If the interface instead silently >> selects a behavior based on the current privileges of the process and >> gives no indication to the caller as to what behavior was selected, then >> the opportunity for error is great. > > I get that. I'm looking at what the programming interface is. > What's the standard function for "I want the fallback behavior" called? > What's the standard function for "I want preserve security" called? > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which > is it? whenever there's hidden fallback behavior that changes the security semantics you will cause programming error. the only correct way for an application to code that want the fallback functionality if (initial_behavior()) { if (fallback_behavior()) { some sort of error } } as that way the application knows what occured. if that logic is wrapped in a single function (like , you would have to dosomething like if (ret == initial_and_fallbakc()) { if (ret == 0) { fallback = 0; } else if (ret == 1) { fallback == 1; } else { some sort of error } } which is much more prone to error. at the end of the day, a single function that has hidden fallback behavior does not really save lines of code in a well written application. it does however make it easier to write a poorly written application that can cause security problems. ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-15 16:42 ` Joel Becker 2009-05-15 17:01 ` Shaya Potter @ 2009-05-15 20:53 ` Joel Becker 2009-05-18 9:17 ` Jörn Engel ` (3 more replies) 1 sibling, 4 replies; 151+ messages in thread From: Joel Becker @ 2009-05-15 20:53 UTC (permalink / raw) To: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote: > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote: > > Consider a program that wants to always preserve attributes on the > > reflinks it creates. If the interface allows the program to explicitly > > request that behavior and returns an error when the request cannot be > > honored, then the program knows that upon a successful return, the > > attributes were in fact preserved. If the interface instead silently > > selects a behavior based on the current privileges of the process and > > gives no indication to the caller as to what behavior was selected, then > > the opportunity for error is great. > > I get that. I'm looking at what the programming interface is. > What's the standard function for "I want the fallback behavior" called? > What's the standard function for "I want preserve security" called? > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which > is it? Ok, I've been casting about how to solve the concern and provide a decent interface. I'm not about to give up on either. I think, though, that we do have to let the application signal its intent to the system. And if we're doing that, let's add a little flexibility. I think the interface will be this (ignoring the reflinkat(2) bit for now): int reflink(const char *oldpath, const char *newpath, int preserve); - Data and xattrs are reflinked always. - 'preserve is a bitfield describing which attributes to keep across the reflink: * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or CAP_CHOWN. * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc) the same. This requires REFLINK_ATTR_OWNER (the security state makes no sense if the ownership changes). If not set, the filesystem wipes all security.* xattrs and reinitializes with security_inode_init_security() just like a new file. * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership or CAP_FOWNER. * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode changes, and so you can't keep them the same if the mode wasn't preserved. If not set, the filesystem reinits the ACLs as for a new file. - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0. That's all the relevant attributes. The timestamps behave as already described (ctime is now, mtime matches the source), which is the only sane behavior for this sort of thing. So, a copy program would reflink(source, target, REFLINK_ATTR_NONE), a snapshot program would reflink(source, target, REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it easily. In the kernel, security_inode_reflink() gets passed the preserve bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER). It may do other checks on the reflink and the preserve bits, that's up to the LSM. For scripting, we add the we add the '-p' and '-P' to "ln -r": - ln -r == reflink(source, target, REFLINK_ATTR_NONE); - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL); - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p" is best-effort. Does this make everyone happy? Joel -- "In the beginning, the universe was created. This has made a lot of people very angry, and is generally considered to have been a bad move." - Douglas Adams Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-15 20:53 ` [Ocfs2-devel] " Joel Becker @ 2009-05-18 9:17 ` Jörn Engel 2009-05-18 13:02 ` Stephen Smalley ` (2 subsequent siblings) 3 siblings, 0 replies; 151+ messages in thread From: Jörn Engel @ 2009-05-18 9:17 UTC (permalink / raw) To: Joel Becker Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, 15 May 2009 13:53:35 -0700, Joel Becker wrote: > > Does this make everyone happy? Provided the only fallback is to return an error code and let userspace decide what to do, I'm a happy camper. Not sure how many of the REFLINK_ATTR_* flags will actually be used, apart from ALL and NONE. But I don't mind having them. Jörn -- People will accept your ideas much more readily if you tell them that Benjamin Franklin said it first. -- unknown -- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-15 20:53 ` [Ocfs2-devel] " Joel Becker 2009-05-18 9:17 ` Jörn Engel @ 2009-05-18 13:02 ` Stephen Smalley 2009-05-18 14:33 ` Stephen Smalley 2009-05-18 18:26 ` Joel Becker 2009-05-19 19:33 ` Jonathan Corbet [not found] ` <20090519132057.419b9de0@bike.lwn.net> 3 siblings, 2 replies; 151+ messages in thread From: Stephen Smalley @ 2009-05-18 13:02 UTC (permalink / raw) To: Joel Becker Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote: > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote: > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote: > > > Consider a program that wants to always preserve attributes on the > > > reflinks it creates. If the interface allows the program to explicitly > > > request that behavior and returns an error when the request cannot be > > > honored, then the program knows that upon a successful return, the > > > attributes were in fact preserved. If the interface instead silently > > > selects a behavior based on the current privileges of the process and > > > gives no indication to the caller as to what behavior was selected, then > > > the opportunity for error is great. > > > > I get that. I'm looking at what the programming interface is. > > What's the standard function for "I want the fallback behavior" called? > > What's the standard function for "I want preserve security" called? > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which > > is it? > > Ok, I've been casting about how to solve the concern and provide > a decent interface. I'm not about to give up on either. I think, > though, that we do have to let the application signal its intent to the > system. And if we're doing that, let's add a little flexibility. > I think the interface will be this (ignoring the reflinkat(2) > bit for now): > > int reflink(const char *oldpath, const char *newpath, int preserve); > > - Data and xattrs are reflinked always. > - 'preserve is a bitfield describing which attributes to keep across the > reflink: > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or > CAP_CHOWN. > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc) > the same. This requires REFLINK_ATTR_OWNER (the security state makes > no sense if the ownership changes). If not set, the filesystem wipes > all security.* xattrs and reinitializes with > security_inode_init_security() just like a new file. > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership > or CAP_FOWNER. > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode > changes, and so you can't keep them the same if the mode wasn't > preserved. If not set, the filesystem reinits the ACLs as for a new > file. > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0. > > That's all the relevant attributes. The timestamps behave as > already described (ctime is now, mtime matches the source), which is the > only sane behavior for this sort of thing. > So, a copy program would reflink(source, target, > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target, > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it > easily. > In the kernel, security_inode_reflink() gets passed the preserve > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER). > It may do other checks on the reflink and the preserve bits, that's up > to the LSM. > For scripting, we add the we add the '-p' and '-P' to "ln -r": > > - ln -r == reflink(source, target, REFLINK_ATTR_NONE); > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL); > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p" > is best-effort. > > Does this make everyone happy? For simplicity and robustness, I would only support the none or all flags, i.e. preserve can be a simple bool. I don't think you really want to deal with the individual flags, and I don't see a use case for them. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-18 13:02 ` Stephen Smalley @ 2009-05-18 14:33 ` Stephen Smalley 2009-05-18 17:15 ` Stephen Smalley 2009-05-18 18:26 ` Joel Becker 1 sibling, 1 reply; 151+ messages in thread From: Stephen Smalley @ 2009-05-18 14:33 UTC (permalink / raw) To: Joel Becker Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote: > On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote: > > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote: > > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote: > > > > Consider a program that wants to always preserve attributes on the > > > > reflinks it creates. If the interface allows the program to explicitly > > > > request that behavior and returns an error when the request cannot be > > > > honored, then the program knows that upon a successful return, the > > > > attributes were in fact preserved. If the interface instead silently > > > > selects a behavior based on the current privileges of the process and > > > > gives no indication to the caller as to what behavior was selected, then > > > > the opportunity for error is great. > > > > > > I get that. I'm looking at what the programming interface is. > > > What's the standard function for "I want the fallback behavior" called? > > > What's the standard function for "I want preserve security" called? > > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which > > > is it? > > > > Ok, I've been casting about how to solve the concern and provide > > a decent interface. I'm not about to give up on either. I think, > > though, that we do have to let the application signal its intent to the > > system. And if we're doing that, let's add a little flexibility. > > I think the interface will be this (ignoring the reflinkat(2) > > bit for now): > > > > int reflink(const char *oldpath, const char *newpath, int preserve); > > > > - Data and xattrs are reflinked always. > > - 'preserve is a bitfield describing which attributes to keep across the > > reflink: > > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or > > CAP_CHOWN. > > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc) > > the same. This requires REFLINK_ATTR_OWNER (the security state makes > > no sense if the ownership changes). If not set, the filesystem wipes > > all security.* xattrs and reinitializes with > > security_inode_init_security() just like a new file. > > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership > > or CAP_FOWNER. > > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires > > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode > > changes, and so you can't keep them the same if the mode wasn't > > preserved. If not set, the filesystem reinits the ACLs as for a new > > file. > > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0. > > > > That's all the relevant attributes. The timestamps behave as > > already described (ctime is now, mtime matches the source), which is the > > only sane behavior for this sort of thing. > > So, a copy program would reflink(source, target, > > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target, > > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it > > easily. > > In the kernel, security_inode_reflink() gets passed the preserve > > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is > > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER). > > It may do other checks on the reflink and the preserve bits, that's up > > to the LSM. > > For scripting, we add the we add the '-p' and '-P' to "ln -r": > > > > - ln -r == reflink(source, target, REFLINK_ATTR_NONE); > > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL); > > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p" > > is best-effort. > > > > Does this make everyone happy? > > For simplicity and robustness, I would only support the none or all > flags, i.e. preserve can be a simple bool. I don't think you really > want to deal with the individual flags, and I don't see a use case for > them. Or possibly only distinguish preserve-dac from preserve-mac, e.g. REFLINK_ATTR_NONE (preserve none), REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p) REFLINK_ATTR_MAC (preserve MAC security label ala cp -c) REFLINK_ATTR_ALL (preserve all) -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-18 14:33 ` Stephen Smalley @ 2009-05-18 17:15 ` Stephen Smalley 0 siblings, 0 replies; 151+ messages in thread From: Stephen Smalley @ 2009-05-18 17:15 UTC (permalink / raw) To: Joel Becker Cc: Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Mon, 2009-05-18 at 10:33 -0400, Stephen Smalley wrote: > On Mon, 2009-05-18 at 09:02 -0400, Stephen Smalley wrote: > > On Fri, 2009-05-15 at 13:53 -0700, Joel Becker wrote: > > > On Fri, May 15, 2009 at 09:42:09AM -0700, Joel Becker wrote: > > > > On Fri, May 15, 2009 at 11:55:25AM -0400, Stephen Smalley wrote: > > > > > Consider a program that wants to always preserve attributes on the > > > > > reflinks it creates. If the interface allows the program to explicitly > > > > > request that behavior and returns an error when the request cannot be > > > > > honored, then the program knows that upon a successful return, the > > > > > attributes were in fact preserved. If the interface instead silently > > > > > selects a behavior based on the current privileges of the process and > > > > > gives no indication to the caller as to what behavior was selected, then > > > > > the opportunity for error is great. > > > > > > > > I get that. I'm looking at what the programming interface is. > > > > What's the standard function for "I want the fallback behavior" called? > > > > What's the standard function for "I want preserve security" called? > > > > "int reflink(oldpath, newpath)" has to pick one of the behaviors. Which > > > > is it? > > > > > > Ok, I've been casting about how to solve the concern and provide > > > a decent interface. I'm not about to give up on either. I think, > > > though, that we do have to let the application signal its intent to the > > > system. And if we're doing that, let's add a little flexibility. > > > I think the interface will be this (ignoring the reflinkat(2) > > > bit for now): > > > > > > int reflink(const char *oldpath, const char *newpath, int preserve); > > > > > > - Data and xattrs are reflinked always. > > > - 'preserve is a bitfield describing which attributes to keep across the > > > reflink: > > > * REFLINK_ATTR_OWNER - Keeps uid/gid the same. Requires ownership or > > > CAP_CHOWN. > > > * REFLINK_ATTR_SECURITY - Keeps the security state (SELinux/SMACK/etc) > > > the same. This requires REFLINK_ATTR_OWNER (the security state makes > > > no sense if the ownership changes). If not set, the filesystem wipes > > > all security.* xattrs and reinitializes with > > > security_inode_init_security() just like a new file. > > > * REFLINK_ATTR_MODE - Keeps the mode bits the same. Requires ownership > > > or CAP_FOWNER. > > > * REFLINK_ATTR_ACL - Keeps the ACLs the same. Requires > > > REFLINK_ATTR_MODE, as ACLs have to get adjusted when the mode > > > changes, and so you can't keep them the same if the mode wasn't > > > preserved. If not set, the filesystem reinits the ACLs as for a new > > > file. > > > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0. > > > > > > That's all the relevant attributes. The timestamps behave as > > > already described (ctime is now, mtime matches the source), which is the > > > only sane behavior for this sort of thing. > > > So, a copy program would reflink(source, target, > > > REFLINK_ATTR_NONE), a snapshot program would reflink(source, target, > > > REFLINK_ATTR_ALL), and someone wanting the fallback behavior can do it > > > easily. > > > In the kernel, security_inode_reflink() gets passed the preserve > > > bits. It's responsible for determining whether REFLINK_ATTR_SECURITY is > > > allowed (vfs_reflink() will already have asserted REFLINK_ATTR_OWNER). > > > It may do other checks on the reflink and the preserve bits, that's up > > > to the LSM. > > > For scripting, we add the we add the '-p' and '-P' to "ln -r": > > > > > > - ln -r == reflink(source, target, REFLINK_ATTR_NONE); > > > - ln -r -P == reflink(source, target, REFLINK_ATTR_ALL); > > > - ln -r -p == the fallback behavior. This is like cp(1), where "cp -p" > > > is best-effort. > > > > > > Does this make everyone happy? > > > > For simplicity and robustness, I would only support the none or all > > flags, i.e. preserve can be a simple bool. I don't think you really > > want to deal with the individual flags, and I don't see a use case for > > them. > > Or possibly only distinguish preserve-dac from preserve-mac, e.g. > REFLINK_ATTR_NONE (preserve none), > REFLINK_ATTR_DAC (preserve uid, gid, mode, and ACLs ala cp -p) > REFLINK_ATTR_MAC (preserve MAC security label ala cp -c) > REFLINK_ATTR_ALL (preserve all) Even this distinction doesn't seem worthwhile and could get complicated, e.g. security.capability is an alternative to using the setuid mode bit, and thus logically would fall into the same class as the owner and mode. I'd just limit reflink() to preserving none or all of the security attributes. -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v4. 2009-05-18 13:02 ` Stephen Smalley 2009-05-18 14:33 ` Stephen Smalley @ 2009-05-18 18:26 ` Joel Becker 2009-05-19 16:32 ` [Ocfs2-devel] " Sage Weil 1 sibling, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-18 18:26 UTC (permalink / raw) To: Stephen Smalley Cc: Andy Lutomirski, jmorris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, ocfs2-devel, viro On Mon, May 18, 2009 at 09:02:39AM -0400, Stephen Smalley wrote: > For simplicity and robustness, I would only support the none or all > flags, i.e. preserve can be a simple bool. I don't think you really > want to deal with the individual flags, and I don't see a use case for > them. The simple use case I can think of is "I want a snapshot, but I don't have rights to copy the MAC context". Or "I want to own it, but I want to keep all the ACLs for other users". Basically, if I'm adding another int argument to reflinkat(2), I wanted to consider the future. Maybe define it as 1 or 0, and leave the use of the other bits for future possibilities? If we're lucky, of course, we never need future changes. Joel -- "There is a country in Europe where multiple-choice tests are illegal." - Sigfried Hulzer Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-18 18:26 ` Joel Becker @ 2009-05-19 16:32 ` Sage Weil 0 siblings, 0 replies; 151+ messages in thread From: Sage Weil @ 2009-05-19 16:32 UTC (permalink / raw) To: Joel Becker Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Hi Joel, This version (with whatever flag simplifications are deemed appropriate) looks pretty good to me! The only other thing I would like to see is a flag that makes copying the xattrs optional. That's straying toward kitchen sink territory, but it seems like a natural enough interface once you're cherry-picking what to preserve in the reflink. (Since you can always remove unwanted xattrs later, of course, it's certainly not a show-stopper.) Thanks! sage ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-15 20:53 ` [Ocfs2-devel] " Joel Becker 2009-05-18 9:17 ` Jörn Engel 2009-05-18 13:02 ` Stephen Smalley @ 2009-05-19 19:33 ` Jonathan Corbet 2009-05-19 20:15 ` Jamie Lokier [not found] ` <20090519132057.419b9de0@bike.lwn.net> 3 siblings, 1 reply; 151+ messages in thread From: Jonathan Corbet @ 2009-05-19 19:33 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-security-module One tiny little thing that crossed my mind as I was looking at this... > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0. That, I think, could lead to unexpected results if different flags (perhaps controlling different aspects of behavior altogether) are added in the future. Might it make more sense for REFLINK_ATTR_ALL to be something like 0xffff, with the current implementation insisting that all other bits are zero? That would leave room for expansion of the set of things covered by the "preserve all" semantics while, simultaneously, allowing the addition of different types of flags entirely. jon ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. 2009-05-19 19:33 ` Jonathan Corbet @ 2009-05-19 20:15 ` Jamie Lokier 0 siblings, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-19 20:15 UTC (permalink / raw) To: Jonathan Corbet; +Cc: linux-fsdevel, linux-security-module Jonathan Corbet wrote: > One tiny little thing that crossed my mind as I was looking at this... > > > - REFLINK_ATTR_NONE is 0 and REFLINK_ATTR_ALL is ~0. > > That, I think, could lead to unexpected results if different flags > (perhaps controlling different aspects of behavior altogether) are > added in the future. Might it make more sense for REFLINK_ATTR_ALL to > be something like 0xffff, with the current implementation insisting > that all other bits are zero? That would leave room for expansion of > the set of things covered by the "preserve all" semantics while, > simultaneously, allowing the addition of different types of flags > entirely. I think it's far better if REFLINK_ATTR_ALL is simply it's own 1-bit flag, meaning exactly what you think it means: In the kernel, it sets all the attribute flags. It's possible to choose a bit-mask now, but there's no particular reason that 16 bits is the right size, and it's ugly if it turns out you need a hack for a backward-compatible 17th attribute sometime. (It can be done, it's just ugly). (I'd also add REFLINK_ATTR_ATOMIC, because you might want the attributes copied but don't care about atomicity, and some filesystems might be able to one without the other. I'm thinking of SMB/CIFS here.) By the way, there is work going on towards a "selective stat()" call, which takes a set of bits for which attributes are to be returned. Is it worth converging on some common flags to select attributes? -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
[parent not found: <20090519132057.419b9de0@bike.lwn.net>]
[parent not found: <20090519193244.GB25521@mail.oracle.com>]
* Re: [Ocfs2-devel] [RFC] The reflink(2) system call v4. [not found] ` <20090519193244.GB25521@mail.oracle.com> @ 2009-05-19 19:41 ` Jonathan Corbet 0 siblings, 0 replies; 151+ messages in thread From: Jonathan Corbet @ 2009-05-19 19:41 UTC (permalink / raw) To: Joel Becker Cc: Stephen Smalley, Andy Lutomirski, jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel On Tue, 19 May 2009 12:32:44 -0700 Joel Becker <Joel.Becker@oracle.com> wrote: > I considered that, but really a process specifying > REFLINK_ATTR_ALL wants a complete snapshot. So if we add things to our > inodes later, and then you have an old program asking for "a complete > snapshot", it won't get it. It'll get a partial snapshot, missing the > things we added later. > Conversely, a newer program that knows about the new things will > get an error on an older kernel when it asks for the complete snapshot. Yep, that's why I'd suggested carving out a set of bits rather larger than the ones specified now. That would allow any future flags to be included in the REFLINK_ATTR_ALL "space" if that seemed like the right thing to do. It would be forward and backward compatible. Anything added outside that bit range would, presumably, be a more significant change which should not carry forward or backward automatically. > You'll note I called this 'preserve', not 'flags'. It's not a > set of behavioral flags, it's a mask of attributes to preserve. Understood, but that may not stop somebody else from trying to extend the API in different directions in the future. It seems like a way to make life easier for that person when the time comes. Just a thought, anyway; not something I'd make a fuss about. jon ^ permalink raw reply [flat|nested] 151+ messages in thread
* [RFC] The reflink(2) system call v5. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker ` (4 preceding siblings ...) 2009-05-14 3:57 ` Andy Lutomirski @ 2009-05-28 0:24 ` Joel Becker 2009-09-14 22:24 ` Joel Becker 6 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-05-28 0:24 UTC (permalink / raw) To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel Here's v5 of reflink(). It adds a 'preserve' argument to the call. This argument may currently be one of REFLINK_ATTR_PRESERVE and REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if the caller lacks the privileges. _ATTR_NONE links up the data extents (data and xattrs) in a CoW fashion, but otherwise initializes the new inode as a new file (new security state, acls, ownership, etc). I took everyone's advice and dropped attribute-specific flags for a single _ATTR_PRESERVE. Inside the kernel, the iop and security op get 'bool preserve' to tell them what to do. Joel >From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001 From: Joel Becker <joel.becker@oracle.com> Date: Sat, 2 May 2009 22:48:59 -0700 Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call. The userpace visible idea of the operation is: int reflink(const char *oldpath, const char *newpath, int preserve); int reflinkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int preserve, int flags); The kernel only implements reflinkat(2). reflink(3) is a trivial wrapper around reflinkat(2). The reflink() system call creates reference-counted links. It creates a new file that shares the data extents of the source file in a copy-on-write fashion. Its calling semantics are identical to link(2) and linkat(2). Once complete, programs see the new file as a completely separate entry. reflink() attempts to preserve ownership, permissions, and all other security state in order to create a full snapshot. A caller requests this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument. Preserving those attributes requires ownership or CAP_CHOWN. A caller without those privileges will get EPERM. An unpriviledged caller can specify REFLINK_ATTR_NONE. They will acquire the data extent sharing but will see the file's security state and attributes initialized as a new file. The unpriviledged reflink requires read access. In the VFS, ->reflink() is an inode_operation with the almost same arguments as ->link(); an additional argument tells the filesystem to copy over or reinitialize the security state on the new file. A new LSM hook, security_inode_reflink(), is added. None of the existing LSM hooks appeared to fit. This only adds the x86 linkage. The trend appears to be for other architectures to add their own linkage. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++ Documentation/filesystems/vfs.txt | 4 + arch/x86/ia32/ia32entry.S | 1 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/include/asm/unistd_64.h | 2 + arch/x86/kernel/syscall_table_32.S | 1 + fs/namei.c | 124 +++++++++++++++++++++++ include/linux/fcntl.h | 8 ++ include/linux/fs.h | 2 + include/linux/security.h | 23 +++++ include/linux/syscalls.h | 3 + security/capability.c | 7 ++ security/security.c | 8 ++ 13 files changed, 358 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/reflink.txt diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt new file mode 100644 index 0000000..7effe33 --- /dev/null +++ b/Documentation/filesystems/reflink.txt @@ -0,0 +1,174 @@ +reflink(2) +========== + + +INTRODUCTION +------------ + +A reflink is a reference-counted link. The reflink(2) operation is +analogous to the link(2) operation, except that instead of two directory +entries pointing to the same inode, there are two identical inodes +pointing to the same data. Writes do not modify the shared data; they +use copy-on-write (CoW). Thus, after the reflink has been created, the +inodes can diverge without impacting each other. + + +SYNOPSIS +-------- + +The reflink(2) call looks almost like link(2): + + int reflink(const char *oldpath, const char *newpath, int preserve); + +The actual system call is reflinkat(2): + + int reflinkat(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, + int preserve, int flags); + +For details on how olddirfd, newdirfd, and flags behave, see linkat(2). +The reflink(2) call won't be implemented by the kernel, because it's a +trivial wrapper around reflinkat(2). + + +DESCRIPTION +----------- + +One way of viewing reflink is to look at the level of sharing. A +symbolic link does its sharing at the directory entry level; many names +end up pointing at the same directory entry. Hard links are one step +down. Multiple directory entries are sharing one inode. Reflinks are +down one more level: multiple inodes share the same data extents. + +When you symlink a file, you can then access it via the symlink or the +real directory entry, and for the most part they look identical. When +accessing more than one name for a hard link, the object returned looks +identical. Similarly, a newly created reflink is identical to its +source in almost every way and can be treated as such. This includes +ownership, permissions, security state, and data. The only things +that are different are the inode number, the link count, and the ctime. + +A reflink is a snapshot of the source file at the time it is created. + +Once created, though, a reflink can be modified like any other normal +file without affecting the source file. Changes to trivial fields like +permissions, owner, or times are guaranteed not to trigger CoW of file +data and will not return any error that wouldn't happen on a truly +distinct file. Changes to the file's data will trigger CoW of the data +affected - the actual CoW granularity is up to the filesystem, from +exact bytes up to the entire file. ocfs2, for example, will copy out an +entire extent or 1MB, whichever is smaller. + +Preserving the security state of the source file obviously requires +the privilege to do so. Because of this, the reflink(2) call has the +preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security +state and file attributes will match the source as described above. +Callers that do not own the source file and do not have CAP_CHOWN will +see reflink(2) fail with EPERM. If preserve is set to +REFLINK_ATTR_NONE, the new reflink will still share all the data extents +of the source file, including extended attributes. The security state +and attributes of the new reflink will be as a newly created file by +that user. With REFLINK_ATTR_NONE, the caller must have read access to +the source file. + +Partial reflinks are not allowed. The new inode will only appear in the +directory structure after it is fully formed. This prevents a crash or +lack of space from creating a partial reflink. + +If a filesystem does not support reflinks, the kernel and libc MUST NOT +fake it. Callers are expecting to get snapshots, and faking it will +violate that trust. + +The userspace view is as follows. When reflink(2) returns, opening +oldpath and newpath returns identical-looking files, just like link(2). +After that, oldpath and newpath behave as distinct files, and +modifications to one have no impact on the other. + + +RESTRICTIONS +------------ + +Just as the sharing gets lower as you move from symlink() -> link() -> +reflink(), the restrictions on the call get tighter. A symlink doesn't +require any access permissions other than being able to create its +inode. It can cross filesystems and mount points, and it can point to +any type of file. A hard link requires both source and target to be on +the same filesystem under the same mount point, and that the source not +be a directory. A reflink tightens that to regular files only. Like +hard links and symlinks, a reflink cannot be created if newpath exists. + +Reflinks adds one big restriction on top of hard links: only the owner +or someone with elevated privileges (CAP_CHOWN) can preserve the +security state (permissions, ownership, ACLs, etc) across a reflink. +A reflink is a point-in-time snapshot of a file. Without the +appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE +will receive EPERM. + +A caller specifying REFLINK_ATTR_NONE must have read access to reflink a +file. + + +SHARING +------- + +A reflink creates a new inode. It shares all data extents of the source +file; this includes file data and extended attribute data. All of the +sharing is in a CoW fashion, and any modification of the data will break +the sharing. + +For some filesystems, certain data structures are not in allocated +storage extents. Creating a reflink might make a copy of these extents. +An example is ext3's ability to store small extended attributes inside +the ext3 inode. Since a reflink is creating a new inode, those extended +attributes are merely copied to the new inode. + + +EXCEPTIONS +---------- + +When REFLINK_ATTR_PRESERVE is specified, all file attributes and +extended attributes of the new file must identical to the source file +with the following exceptions: + +- The new file must have a new inode number. This allows POSIX + programs to treat the source and new files as separate objects. From + the view of the POSIX application, the files are distinct. The + sharing is invisible outside of the filesystem's internal structures. +- The ctime of the source file only changes if the source's metadata + must be changed to accommodate the copy-on-write linkage. The ctime + of the new file is set to represent its creation. +- The link count of the source file is unchanged, and the link count of + the new file is one. + +The mtime of the source file is unmodified, and the mtime of the new +file is set identical to the source file. This reflects that the data +is unchanged. + +If REFLINK_ATTR_NONE is specified, all data extents will be reflinked, +but file attributes and security state will be as any new file. + + +INODE OPERATION +--------------- + +Filesystems implement the ->reflink() inode operation. It has almost +the same prototype as ->link(): + + int (*reflink)(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry, bool preserve); + +When the filesystem is called, the VFS has already checked the +permissions and mountpoint of the operation. It has determined whether +the file attributes and security state should be preserved or +reinitialized, as specified by the preserve argument. The filesystem +just needs to create the new inode identical to the old one with the +exceptions noted above, link up the shared data extents, and then link +the new inode into dir. + + +FOLLOWING SYMBOLIC LINKS +------------------------ + +reflink() deferences symbolic links in the same manner that link(2) +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2). + diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf..0620d73 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -333,6 +333,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool); }; Again, all methods are called without any locks being held, unless @@ -431,6 +432,9 @@ otherwise noted. truncate_range: a method provided by the underlying filesystem to truncate a range of blocks , i.e. punch a hole somewhere in a file. + reflink: called by the reflink(2) system call. Only required if you want + to support reflinks. For further information, see + Documentation/filesystems/reflink.txt. The Address Space Object diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index a505202..ca832b4 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -830,4 +830,5 @@ ia32_sys_call_table: .quad sys_inotify_init1 .quad compat_sys_preadv .quad compat_sys_pwritev + .quad sys_reflinkat /* 335 */ ia32_syscall_end: diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..c368563 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_reflinkat 335 #ifdef __KERNEL__ diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index f818294..b20f68c 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1) __SYSCALL(__NR_preadv, sys_preadv) #define __NR_pwritev 296 __SYSCALL(__NR_pwritev, sys_pwritev) +#define __NR_reflink 297 +__SYSCALL(__NR_reflink, sys_reflink) #ifndef __NO_STUBS diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..d11c200 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long sys_reflinkat /* 335 */ diff --git a/fs/namei.c b/fs/namei.c index 78f253c..55f5c80 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); } +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry, bool preserve) +{ + struct inode *inode = old_dentry->d_inode; + int error; + + if (!inode) + return -ENOENT; + + error = may_create(dir, new_dentry); + if (error) + return error; + + if (dir->i_sb != inode->i_sb) + return -EXDEV; + + /* + * A reflink to an append-only or immutable file cannot be created. + */ + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) + return -EPERM; + if (!dir->i_op->reflink) + return -EPERM; + + /* + * Only regular files can be reflinked; if a user tries to + * reflink a block device, do they expect copy-on-write of the + * entire device? + */ + if (!S_ISREG(inode->i_mode)) + return -EPERM; + + /* + * If the caller wants to preserve ownership, they require the + * rights to do so. + */ + if (preserve) { + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) + return -EPERM; + if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) + return -EPERM; + } + + error = security_inode_reflink(old_dentry, dir, preserve); + if (error) + return error; + + /* + * If the caller is modifying any aspect of the attributes, they + * are not creating a snapshot. They need read permission on the + * file. + */ + if (!preserve) { + error = inode_permission(inode, MAY_READ); + if (error) + return error; + } + + mutex_lock(&inode->i_mutex); + vfs_dq_init(dir); + error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve); + mutex_unlock(&inode->i_mutex); + if (!error) + fsnotify_create(dir, new_dentry); + return error; +} + +SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname, + int, newdfd, const char __user *, newname, int, preserve, + int, flags) +{ + struct dentry *new_dentry; + struct nameidata nd; + struct path old_path; + int error; + char *to; + + if ((flags & ~AT_SYMLINK_FOLLOW) != 0) + return -EINVAL; + + if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0) + return -EINVAL; + + error = user_path_at(olddfd, oldname, + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0, + &old_path); + if (error) + return error; + + error = user_path_parent(newdfd, newname, &nd, &to); + if (error) + goto out; + error = -EXDEV; + if (old_path.mnt != nd.path.mnt) + goto out_release; + new_dentry = lookup_create(&nd, 0); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; + error = security_path_link(old_path.dentry, &nd.path, new_dentry); + if (error) + goto out_drop_write; + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, + new_dentry, preserve); +out_drop_write: + mnt_drop_write(nd.path.mnt); +out_dput: + dput(new_dentry); +out_unlock: + mutex_unlock(&nd.path.dentry->d_inode->i_mutex); +out_release: + path_put(&nd.path); + putname(to); +out: + path_put(&old_path); + + return error; +} + + /* * The worst of all namespace operations - renaming directory. "Perverted" * doesn't even start to describe it. Somebody in UCB had a heck of a trip... @@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename); EXPORT_SYMBOL(vfs_create); EXPORT_SYMBOL(vfs_follow_link); EXPORT_SYMBOL(vfs_link); +EXPORT_SYMBOL(vfs_reflink); EXPORT_SYMBOL(vfs_mkdir); EXPORT_SYMBOL(vfs_mknod); EXPORT_SYMBOL(generic_permission); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 8603740..96dc2f0 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -40,6 +40,14 @@ unlinking file. */ #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ +/* + * A reflink call may preserve the file's attributes in toto or not at + * all. + */ +#define REFLINK_ATTR_PRESERVE 0x00000001 +#define REFLINK_ATTR_NONE 0 + + #ifdef __KERNEL__ #ifndef force_o_largefile diff --git a/include/linux/fs.h b/include/linux/fs.h index 5bed436..c6f9cb0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *); extern int vfs_rmdir(struct inode *, struct dentry *); extern int vfs_unlink(struct inode *, struct dentry *); extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool); /* * VFS dentry helper functions. @@ -1537,6 +1538,7 @@ struct inode_operations { loff_t len); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool); }; struct seq_file; diff --git a/include/linux/security.h b/include/linux/security.h index d5fd616..2f1f520 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * @inode contains a pointer to the inode. * @secid contains a pointer to the location where result will be saved. * In case of failure, @secid will be set to zero. + * @inode_reflink: + * Check permission before creating a new reference-counted link to + * a file. + * @old_dentry contains the dentry structure for an existing link to + * the file. + * @dir contains the inode structure of the parent directory of the + * new reflink. + * @preserve specifies whether the caller wishes to preserve the + * file's attributes. If true, the caller wishes to clone the file's + * attributes exactly. If false, the caller expects to reflink the + * data extents but reset the attributes. + * Return 0 if permission is granted. * * Security hooks for file operations * @@ -1415,6 +1427,8 @@ struct security_operations { int (*inode_unlink) (struct inode *dir, struct dentry *dentry); int (*inode_symlink) (struct inode *dir, struct dentry *dentry, const char *old_name); + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir, + bool preserve); int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode); int (*inode_rmdir) (struct inode *dir, struct dentry *dentry); int (*inode_mknod) (struct inode *dir, struct dentry *dentry, @@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir, int security_inode_unlink(struct inode *dir, struct dentry *dentry); int security_inode_symlink(struct inode *dir, struct dentry *dentry, const char *old_name); +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, + bool preserve); int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode); int security_inode_rmdir(struct inode *dir, struct dentry *dentry); int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev); @@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir, return 0; } +static inline int security_inode_reflink(struct dentry *old_dentry, + struct inode *dir, + bool preserve) +{ + return 0; +} + static inline int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 40617c1..a11f228 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_linkat(int olddfd, const char __user *oldname, int newdfd, const char __user *newname, int flags); +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname, + int newdfd, const char __user *newname, + int preserve, int flags); asmlinkage long sys_renameat(int olddfd, const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_futimesat(int dfd, char __user *filename, diff --git a/security/capability.c b/security/capability.c index 21b6cea..8047b7c 100644 --- a/security/capability.c +++ b/security/capability.c @@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry, return 0; } +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode, + bool preserve) +{ + return 0; +} + static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry, int mask) { @@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops) set_to_cap_if_null(ops, inode_link); set_to_cap_if_null(ops, inode_unlink); set_to_cap_if_null(ops, inode_symlink); + set_to_cap_if_null(ops, inode_reflink); set_to_cap_if_null(ops, inode_mkdir); set_to_cap_if_null(ops, inode_rmdir); set_to_cap_if_null(ops, inode_mknod); diff --git a/security/security.c b/security/security.c index 5284255..e2b12f9 100644 --- a/security/security.c +++ b/security/security.c @@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry, return security_ops->inode_symlink(dir, dentry, old_name); } +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, + bool preserve) +{ + if (unlikely(IS_PRIVATE(old_dentry->d_inode))) + return 0; + return security_ops->inode_reflink(old_dentry, dir, preserve); +} + int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) { if (unlikely(IS_PRIVATE(dir))) -- 1.6.3 -- "Anything that is too stupid to be spoken is sung." - Voltaire Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply related [flat|nested] 151+ messages in thread
* [RFC] The reflink(2) system call v5. 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker ` (5 preceding siblings ...) 2009-05-28 0:24 ` [RFC] The reflink(2) system call v5 Joel Becker @ 2009-09-14 22:24 ` Joel Becker 6 siblings, 0 replies; 151+ messages in thread From: Joel Becker @ 2009-09-14 22:24 UTC (permalink / raw) To: jim owens, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module, linux-fsdevel [This is a resend of the v5 patch sent on May 25th. Jim, Al, can I get acks please.] Here's v5 of reflink(). It adds a 'preserve' argument to the call. This argument may currently be one of REFLINK_ATTR_PRESERVE and REFLINK_ATTR_NONE. _ATTR_PRESERVE takes a full snapshot, and fails if the caller lacks the privileges. _ATTR_NONE links up the data extents (data and xattrs) in a CoW fashion, but otherwise initializes the new inode as a new file (new security state, acls, ownership, etc). I took everyone's advice and dropped attribute-specific flags for a single _ATTR_PRESERVE. Inside the kernel, the iop and security op get 'bool preserve' to tell them what to do. Joel >From d3c4ed0cb3f5af75f2adf92346e7a3f23870cd16 Mon Sep 17 00:00:00 2001 From: Joel Becker <joel.becker@oracle.com> Date: Sat, 2 May 2009 22:48:59 -0700 Subject: [PATCH] fs: Add the reflink() operation and reflinkat(2) system call. The userpace visible idea of the operation is: int reflink(const char *oldpath, const char *newpath, int preserve); int reflinkat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, int preserve, int flags); The kernel only implements reflinkat(2). reflink(3) is a trivial wrapper around reflinkat(2). The reflink() system call creates reference-counted links. It creates a new file that shares the data extents of the source file in a copy-on-write fashion. Its calling semantics are identical to link(2) and linkat(2). Once complete, programs see the new file as a completely separate entry. reflink() attempts to preserve ownership, permissions, and all other security state in order to create a full snapshot. A caller requests this by passing REFLINK_ATTR_PRESERVE as the 'preserve' argument. Preserving those attributes requires ownership or CAP_CHOWN. A caller without those privileges will get EPERM. An unpriviledged caller can specify REFLINK_ATTR_NONE. They will acquire the data extent sharing but will see the file's security state and attributes initialized as a new file. The unpriviledged reflink requires read access. In the VFS, ->reflink() is an inode_operation with the almost same arguments as ->link(); an additional argument tells the filesystem to copy over or reinitialize the security state on the new file. A new LSM hook, security_inode_reflink(), is added. None of the existing LSM hooks appeared to fit. This only adds the x86 linkage. The trend appears to be for other architectures to add their own linkage. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- Documentation/filesystems/reflink.txt | 174 +++++++++++++++++++++++++++++++++ Documentation/filesystems/vfs.txt | 4 + arch/x86/ia32/ia32entry.S | 1 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/include/asm/unistd_64.h | 2 + arch/x86/kernel/syscall_table_32.S | 1 + fs/namei.c | 124 +++++++++++++++++++++++ include/linux/fcntl.h | 8 ++ include/linux/fs.h | 2 + include/linux/security.h | 23 +++++ include/linux/syscalls.h | 3 + security/capability.c | 7 ++ security/security.c | 8 ++ 13 files changed, 358 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/reflink.txt diff --git a/Documentation/filesystems/reflink.txt b/Documentation/filesystems/reflink.txt new file mode 100644 index 0000000..7effe33 --- /dev/null +++ b/Documentation/filesystems/reflink.txt @@ -0,0 +1,174 @@ +reflink(2) +========== + + +INTRODUCTION +------------ + +A reflink is a reference-counted link. The reflink(2) operation is +analogous to the link(2) operation, except that instead of two directory +entries pointing to the same inode, there are two identical inodes +pointing to the same data. Writes do not modify the shared data; they +use copy-on-write (CoW). Thus, after the reflink has been created, the +inodes can diverge without impacting each other. + + +SYNOPSIS +-------- + +The reflink(2) call looks almost like link(2): + + int reflink(const char *oldpath, const char *newpath, int preserve); + +The actual system call is reflinkat(2): + + int reflinkat(int olddirfd, const char *oldpath, + int newdirfd, const char *newpath, + int preserve, int flags); + +For details on how olddirfd, newdirfd, and flags behave, see linkat(2). +The reflink(2) call won't be implemented by the kernel, because it's a +trivial wrapper around reflinkat(2). + + +DESCRIPTION +----------- + +One way of viewing reflink is to look at the level of sharing. A +symbolic link does its sharing at the directory entry level; many names +end up pointing at the same directory entry. Hard links are one step +down. Multiple directory entries are sharing one inode. Reflinks are +down one more level: multiple inodes share the same data extents. + +When you symlink a file, you can then access it via the symlink or the +real directory entry, and for the most part they look identical. When +accessing more than one name for a hard link, the object returned looks +identical. Similarly, a newly created reflink is identical to its +source in almost every way and can be treated as such. This includes +ownership, permissions, security state, and data. The only things +that are different are the inode number, the link count, and the ctime. + +A reflink is a snapshot of the source file at the time it is created. + +Once created, though, a reflink can be modified like any other normal +file without affecting the source file. Changes to trivial fields like +permissions, owner, or times are guaranteed not to trigger CoW of file +data and will not return any error that wouldn't happen on a truly +distinct file. Changes to the file's data will trigger CoW of the data +affected - the actual CoW granularity is up to the filesystem, from +exact bytes up to the entire file. ocfs2, for example, will copy out an +entire extent or 1MB, whichever is smaller. + +Preserving the security state of the source file obviously requires +the privilege to do so. Because of this, the reflink(2) call has the +preserve argument. If it is set to REFLINK_ATTR_PRESERVE, the security +state and file attributes will match the source as described above. +Callers that do not own the source file and do not have CAP_CHOWN will +see reflink(2) fail with EPERM. If preserve is set to +REFLINK_ATTR_NONE, the new reflink will still share all the data extents +of the source file, including extended attributes. The security state +and attributes of the new reflink will be as a newly created file by +that user. With REFLINK_ATTR_NONE, the caller must have read access to +the source file. + +Partial reflinks are not allowed. The new inode will only appear in the +directory structure after it is fully formed. This prevents a crash or +lack of space from creating a partial reflink. + +If a filesystem does not support reflinks, the kernel and libc MUST NOT +fake it. Callers are expecting to get snapshots, and faking it will +violate that trust. + +The userspace view is as follows. When reflink(2) returns, opening +oldpath and newpath returns identical-looking files, just like link(2). +After that, oldpath and newpath behave as distinct files, and +modifications to one have no impact on the other. + + +RESTRICTIONS +------------ + +Just as the sharing gets lower as you move from symlink() -> link() -> +reflink(), the restrictions on the call get tighter. A symlink doesn't +require any access permissions other than being able to create its +inode. It can cross filesystems and mount points, and it can point to +any type of file. A hard link requires both source and target to be on +the same filesystem under the same mount point, and that the source not +be a directory. A reflink tightens that to regular files only. Like +hard links and symlinks, a reflink cannot be created if newpath exists. + +Reflinks adds one big restriction on top of hard links: only the owner +or someone with elevated privileges (CAP_CHOWN) can preserve the +security state (permissions, ownership, ACLs, etc) across a reflink. +A reflink is a point-in-time snapshot of a file. Without the +appropriate privilege, the caller specifying REFLINK_ATTR_PRESERVE +will receive EPERM. + +A caller specifying REFLINK_ATTR_NONE must have read access to reflink a +file. + + +SHARING +------- + +A reflink creates a new inode. It shares all data extents of the source +file; this includes file data and extended attribute data. All of the +sharing is in a CoW fashion, and any modification of the data will break +the sharing. + +For some filesystems, certain data structures are not in allocated +storage extents. Creating a reflink might make a copy of these extents. +An example is ext3's ability to store small extended attributes inside +the ext3 inode. Since a reflink is creating a new inode, those extended +attributes are merely copied to the new inode. + + +EXCEPTIONS +---------- + +When REFLINK_ATTR_PRESERVE is specified, all file attributes and +extended attributes of the new file must identical to the source file +with the following exceptions: + +- The new file must have a new inode number. This allows POSIX + programs to treat the source and new files as separate objects. From + the view of the POSIX application, the files are distinct. The + sharing is invisible outside of the filesystem's internal structures. +- The ctime of the source file only changes if the source's metadata + must be changed to accommodate the copy-on-write linkage. The ctime + of the new file is set to represent its creation. +- The link count of the source file is unchanged, and the link count of + the new file is one. + +The mtime of the source file is unmodified, and the mtime of the new +file is set identical to the source file. This reflects that the data +is unchanged. + +If REFLINK_ATTR_NONE is specified, all data extents will be reflinked, +but file attributes and security state will be as any new file. + + +INODE OPERATION +--------------- + +Filesystems implement the ->reflink() inode operation. It has almost +the same prototype as ->link(): + + int (*reflink)(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry, bool preserve); + +When the filesystem is called, the VFS has already checked the +permissions and mountpoint of the operation. It has determined whether +the file attributes and security state should be preserved or +reinitialized, as specified by the preserve argument. The filesystem +just needs to create the new inode identical to the old one with the +exceptions noted above, link up the shared data extents, and then link +the new inode into dir. + + +FOLLOWING SYMBOLIC LINKS +------------------------ + +reflink() deferences symbolic links in the same manner that link(2) +does. The AT_SYMLINK_FOLLOW flag is honored just as for linkat(2). + diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf..0620d73 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -333,6 +333,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool); }; Again, all methods are called without any locks being held, unless @@ -431,6 +432,9 @@ otherwise noted. truncate_range: a method provided by the underlying filesystem to truncate a range of blocks , i.e. punch a hole somewhere in a file. + reflink: called by the reflink(2) system call. Only required if you want + to support reflinks. For further information, see + Documentation/filesystems/reflink.txt. The Address Space Object diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index a505202..ca832b4 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -830,4 +830,5 @@ ia32_sys_call_table: .quad sys_inotify_init1 .quad compat_sys_preadv .quad compat_sys_pwritev + .quad sys_reflinkat /* 335 */ ia32_syscall_end: diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..c368563 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_reflinkat 335 #ifdef __KERNEL__ diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index f818294..b20f68c 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -657,6 +657,8 @@ __SYSCALL(__NR_inotify_init1, sys_inotify_init1) __SYSCALL(__NR_preadv, sys_preadv) #define __NR_pwritev 296 __SYSCALL(__NR_pwritev, sys_pwritev) +#define __NR_reflink 297 +__SYSCALL(__NR_reflink, sys_reflink) #ifndef __NO_STUBS diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..d11c200 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long sys_reflinkat /* 335 */ diff --git a/fs/namei.c b/fs/namei.c index 78f253c..55f5c80 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2486,6 +2486,129 @@ SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname return sys_linkat(AT_FDCWD, oldname, AT_FDCWD, newname, 0); } +int vfs_reflink(struct dentry *old_dentry, struct inode *dir, + struct dentry *new_dentry, bool preserve) +{ + struct inode *inode = old_dentry->d_inode; + int error; + + if (!inode) + return -ENOENT; + + error = may_create(dir, new_dentry); + if (error) + return error; + + if (dir->i_sb != inode->i_sb) + return -EXDEV; + + /* + * A reflink to an append-only or immutable file cannot be created. + */ + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) + return -EPERM; + if (!dir->i_op->reflink) + return -EPERM; + + /* + * Only regular files can be reflinked; if a user tries to + * reflink a block device, do they expect copy-on-write of the + * entire device? + */ + if (!S_ISREG(inode->i_mode)) + return -EPERM; + + /* + * If the caller wants to preserve ownership, they require the + * rights to do so. + */ + if (preserve) { + if ((current_fsuid() != inode->i_uid) && !capable(CAP_CHOWN)) + return -EPERM; + if (!in_group_p(inode->i_gid) && !capable(CAP_CHOWN)) + return -EPERM; + } + + error = security_inode_reflink(old_dentry, dir, preserve); + if (error) + return error; + + /* + * If the caller is modifying any aspect of the attributes, they + * are not creating a snapshot. They need read permission on the + * file. + */ + if (!preserve) { + error = inode_permission(inode, MAY_READ); + if (error) + return error; + } + + mutex_lock(&inode->i_mutex); + vfs_dq_init(dir); + error = dir->i_op->reflink(old_dentry, dir, new_dentry, preserve); + mutex_unlock(&inode->i_mutex); + if (!error) + fsnotify_create(dir, new_dentry); + return error; +} + +SYSCALL_DEFINE6(reflinkat, int, olddfd, const char __user *, oldname, + int, newdfd, const char __user *, newname, int, preserve, + int, flags) +{ + struct dentry *new_dentry; + struct nameidata nd; + struct path old_path; + int error; + char *to; + + if ((flags & ~AT_SYMLINK_FOLLOW) != 0) + return -EINVAL; + + if ((preserve & ~REFLINK_ATTR_PRESERVE) != 0) + return -EINVAL; + + error = user_path_at(olddfd, oldname, + flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0, + &old_path); + if (error) + return error; + + error = user_path_parent(newdfd, newname, &nd, &to); + if (error) + goto out; + error = -EXDEV; + if (old_path.mnt != nd.path.mnt) + goto out_release; + new_dentry = lookup_create(&nd, 0); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; + error = security_path_link(old_path.dentry, &nd.path, new_dentry); + if (error) + goto out_drop_write; + error = vfs_reflink(old_path.dentry, nd.path.dentry->d_inode, + new_dentry, preserve); +out_drop_write: + mnt_drop_write(nd.path.mnt); +out_dput: + dput(new_dentry); +out_unlock: + mutex_unlock(&nd.path.dentry->d_inode->i_mutex); +out_release: + path_put(&nd.path); + putname(to); +out: + path_put(&old_path); + + return error; +} + + /* * The worst of all namespace operations - renaming directory. "Perverted" * doesn't even start to describe it. Somebody in UCB had a heck of a trip... @@ -2890,6 +3013,7 @@ EXPORT_SYMBOL(unlock_rename); EXPORT_SYMBOL(vfs_create); EXPORT_SYMBOL(vfs_follow_link); EXPORT_SYMBOL(vfs_link); +EXPORT_SYMBOL(vfs_reflink); EXPORT_SYMBOL(vfs_mkdir); EXPORT_SYMBOL(vfs_mknod); EXPORT_SYMBOL(generic_permission); diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 8603740..96dc2f0 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -40,6 +40,14 @@ unlinking file. */ #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */ +/* + * A reflink call may preserve the file's attributes in toto or not at + * all. + */ +#define REFLINK_ATTR_PRESERVE 0x00000001 +#define REFLINK_ATTR_NONE 0 + + #ifdef __KERNEL__ #ifndef force_o_largefile diff --git a/include/linux/fs.h b/include/linux/fs.h index 5bed436..c6f9cb0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1415,6 +1415,7 @@ extern int vfs_link(struct dentry *, struct inode *, struct dentry *); extern int vfs_rmdir(struct inode *, struct dentry *); extern int vfs_unlink(struct inode *, struct dentry *); extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +extern int vfs_reflink(struct dentry *, struct inode *, struct dentry *, bool); /* * VFS dentry helper functions. @@ -1537,6 +1538,7 @@ struct inode_operations { loff_t len); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + int (*reflink) (struct dentry *,struct inode *,struct dentry *,bool); }; struct seq_file; diff --git a/include/linux/security.h b/include/linux/security.h index d5fd616..2f1f520 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -528,6 +528,18 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * @inode contains a pointer to the inode. * @secid contains a pointer to the location where result will be saved. * In case of failure, @secid will be set to zero. + * @inode_reflink: + * Check permission before creating a new reference-counted link to + * a file. + * @old_dentry contains the dentry structure for an existing link to + * the file. + * @dir contains the inode structure of the parent directory of the + * new reflink. + * @preserve specifies whether the caller wishes to preserve the + * file's attributes. If true, the caller wishes to clone the file's + * attributes exactly. If false, the caller expects to reflink the + * data extents but reset the attributes. + * Return 0 if permission is granted. * * Security hooks for file operations * @@ -1415,6 +1427,8 @@ struct security_operations { int (*inode_unlink) (struct inode *dir, struct dentry *dentry); int (*inode_symlink) (struct inode *dir, struct dentry *dentry, const char *old_name); + int (*inode_reflink) (struct dentry *old_dentry, struct inode *dir, + bool preserve); int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode); int (*inode_rmdir) (struct inode *dir, struct dentry *dentry); int (*inode_mknod) (struct inode *dir, struct dentry *dentry, @@ -1675,6 +1689,8 @@ int security_inode_link(struct dentry *old_dentry, struct inode *dir, int security_inode_unlink(struct inode *dir, struct dentry *dentry); int security_inode_symlink(struct inode *dir, struct dentry *dentry, const char *old_name); +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, + bool preserve); int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode); int security_inode_rmdir(struct inode *dir, struct dentry *dentry); int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev); @@ -2056,6 +2072,13 @@ static inline int security_inode_symlink(struct inode *dir, return 0; } +static inline int security_inode_reflink(struct dentry *old_dentry, + struct inode *dir, + bool preserve) +{ + return 0; +} + static inline int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 40617c1..a11f228 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -692,6 +692,9 @@ asmlinkage long sys_symlinkat(const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_linkat(int olddfd, const char __user *oldname, int newdfd, const char __user *newname, int flags); +asmlinkage long sys_reflinkat(int olddfd, const char __user *oldname, + int newdfd, const char __user *newname, + int preserve, int flags); asmlinkage long sys_renameat(int olddfd, const char __user * oldname, int newdfd, const char __user * newname); asmlinkage long sys_futimesat(int dfd, char __user *filename, diff --git a/security/capability.c b/security/capability.c index 21b6cea..8047b7c 100644 --- a/security/capability.c +++ b/security/capability.c @@ -172,6 +172,12 @@ static int cap_inode_symlink(struct inode *inode, struct dentry *dentry, return 0; } +static int cap_inode_reflink(struct dentry *old_dentry, struct inode *inode, + bool preserve) +{ + return 0; +} + static int cap_inode_mkdir(struct inode *inode, struct dentry *dentry, int mask) { @@ -905,6 +911,7 @@ void security_fixup_ops(struct security_operations *ops) set_to_cap_if_null(ops, inode_link); set_to_cap_if_null(ops, inode_unlink); set_to_cap_if_null(ops, inode_symlink); + set_to_cap_if_null(ops, inode_reflink); set_to_cap_if_null(ops, inode_mkdir); set_to_cap_if_null(ops, inode_rmdir); set_to_cap_if_null(ops, inode_mknod); diff --git a/security/security.c b/security/security.c index 5284255..e2b12f9 100644 --- a/security/security.c +++ b/security/security.c @@ -470,6 +470,14 @@ int security_inode_symlink(struct inode *dir, struct dentry *dentry, return security_ops->inode_symlink(dir, dentry, old_name); } +int security_inode_reflink(struct dentry *old_dentry, struct inode *dir, + bool preserve) +{ + if (unlikely(IS_PRIVATE(old_dentry->d_inode))) + return 0; + return security_ops->inode_reflink(old_dentry, dir, preserve); +} + int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode) { if (unlikely(IS_PRIVATE(dir))) -- 1.6.3 -- "Anything that is too stupid to be spoken is sung." - Voltaire Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-08 2:59 ` jim owens 2009-05-08 3:10 ` Joel Becker @ 2009-05-11 20:49 ` Joel Becker 2009-05-11 22:49 ` jim owens 1 sibling, 1 reply; 151+ messages in thread From: Joel Becker @ 2009-05-11 20:49 UTC (permalink / raw) To: jim owens Cc: jmorris, linux-security-module, mtk.manpages, linux-fsdevel, ocfs2-devel, viro On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: > - fix the > + if (S_ISDIR(inode->i_mode)) > + return -EPERM; > > to be an ISREG check unless you have an argument for > special files and symlinks being COWed. I'm unsure on this one, and would like other comments. Why? It doesn't *hurt* to allow reflink on symlinks or special files. Mostly it's a waste - symlinks may have a data extent, but special files do not. But I'm not sure there's a point to arbitrarily limit filesystems when there's nothing we're combating. Jim, if you have a real problem this prevents, I'm all ears. And if others concur that restricting it to regular files is the right way to go, I can be convinced. Joel -- "Hey mister if you're gonna walk on water, Could you drop a line my way?" Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-11 20:49 ` [RFC] The reflink(2) system call v2 Joel Becker @ 2009-05-11 22:49 ` jim owens 2009-05-11 23:46 ` Joel Becker 0 siblings, 1 reply; 151+ messages in thread From: jim owens @ 2009-05-11 22:49 UTC (permalink / raw) To: joel.becker, linux-fsdevel Cc: jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module Joel Becker wrote: > On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: >> - fix the >> + if (S_ISDIR(inode->i_mode)) >> + return -EPERM; >> >> to be an ISREG check unless you have an argument for >> special files and symlinks being COWed. > > I'm unsure on this one, and would like other comments. Why? It > doesn't *hurt* to allow reflink on symlinks or special files. Mostly > it's a waste - symlinks may have a data extent, but special files do > not. But I'm not sure there's a point to arbitrarily limit filesystems > when there's nothing we're combating. > Jim, if you have a real problem this prevents, I'm all ears. > And if others concur that restricting it to regular files is the right > way to go, I can be convinced. My only problem was my past experience on non-Linux systems where once we said it works for multiple file types, we had to support that forever across all filesystems. We could add support for more types but not eliminate supported ones. Since only ocfs2 will initially support this, I'm fine with the S_ISDIR and if in the future other filesystems can only support regular files (or can also support directories), we move the check out of VFS to be filesystem specific. jim ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-11 22:49 ` jim owens @ 2009-05-11 23:46 ` Joel Becker 2009-05-12 0:54 ` Chris Mason 2009-05-12 20:36 ` Jamie Lokier 0 siblings, 2 replies; 151+ messages in thread From: Joel Becker @ 2009-05-11 23:46 UTC (permalink / raw) To: jim owens Cc: linux-fsdevel, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module On Mon, May 11, 2009 at 06:49:01PM -0400, jim owens wrote: > Joel Becker wrote: >> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: >>> - fix the >>> + if (S_ISDIR(inode->i_mode)) >>> + return -EPERM; >>> >>> to be an ISREG check unless you have an argument for >>> special files and symlinks being COWed. >> >> Jim, if you have a real problem this prevents, I'm all ears. >> And if others concur that restricting it to regular files is the right >> way to go, I can be convinced. > > My only problem was my past experience on non-Linux systems > where once we said it works for multiple file types, we had > to support that forever across all filesystems. We could add > support for more types but not eliminate supported ones. Someone else pointed out that a naive user might reflink a block device file and expect the device contents to be copied-on-write. Obviously wrong if you understand filesystems, but let's just prevent that misunderstanding. S_ISREG() it is. Joel -- "All alone at the end of the evening When the bright lights have faded to blue. I was thinking about a woman who had loved me And I never knew" Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-11 23:46 ` Joel Becker @ 2009-05-12 0:54 ` Chris Mason 2009-05-12 20:36 ` Jamie Lokier 1 sibling, 0 replies; 151+ messages in thread From: Chris Mason @ 2009-05-12 0:54 UTC (permalink / raw) To: Joel Becker Cc: jmorris, linux-fsdevel, linux-security-module, mtk.manpages, jim owens, ocfs2-devel, viro On Mon, 2009-05-11 at 16:46 -0700, Joel Becker wrote: > On Mon, May 11, 2009 at 06:49:01PM -0400, jim owens wrote: > > Joel Becker wrote: > >> On Thu, May 07, 2009 at 10:59:04PM -0400, jim owens wrote: > >>> - fix the > >>> + if (S_ISDIR(inode->i_mode)) > >>> + return -EPERM; > >>> > >>> to be an ISREG check unless you have an argument for > >>> special files and symlinks being COWed. > >> > >> Jim, if you have a real problem this prevents, I'm all ears. > >> And if others concur that restricting it to regular files is the right > >> way to go, I can be convinced. > > > > My only problem was my past experience on non-Linux systems > > where once we said it works for multiple file types, we had > > to support that forever across all filesystems. We could add > > support for more types but not eliminate supported ones. > > Someone else pointed out that a naive user might reflink a block > device file and expect the device contents to be copied-on-write. > Obviously wrong if you understand filesystems, but let's just prevent > that misunderstanding. S_ISREG() it is. Btrfs won't be doing single directories, and I'd rather keep using a dedicated ioctl for snapshotting whole subvolumes. The semantics described here all sound sane, if this looks like the final-ish rev I'll try to find someone interested in wiring it up to the btrfs clone ioctl. It just needs a wrapper to create the new inode and copy xattrs/acls over. Thanks for doing all of this Joel. -chris ^ permalink raw reply [flat|nested] 151+ messages in thread
* Re: [RFC] The reflink(2) system call v2. 2009-05-11 23:46 ` Joel Becker 2009-05-12 0:54 ` Chris Mason @ 2009-05-12 20:36 ` Jamie Lokier 1 sibling, 0 replies; 151+ messages in thread From: Jamie Lokier @ 2009-05-12 20:36 UTC (permalink / raw) To: jim owens, linux-fsdevel, jmorris, ocfs2-devel, viro, mtk.manpages, linux-security-module Joel Becker wrote: > Someone else pointed out that a naive user might reflink a block > device file and expect the device contents to be copied-on-write. > Obviously wrong if you understand filesystems, but let's just prevent > that misunderstanding. S_ISREG() it is. I think S_ISLNK() should be allowed too if the filesystem allows, as it is harmless, behaves as expected, saves a little space, and copying symlink attributes is meaningful too. -- Jamie ^ permalink raw reply [flat|nested] 151+ messages in thread
end of thread, other threads:[~2009-09-14 22:26 UTC | newest] Thread overview: 151+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-03 6:15 [RFC] The reflink(2) system call Joel Becker 2009-05-03 6:15 ` [PATCH 1/3] fs: Document the " Joel Becker 2009-05-03 8:01 ` Christoph Hellwig 2009-05-04 2:46 ` Joel Becker 2009-05-04 6:36 ` Michael Kerrisk 2009-05-04 7:12 ` Joel Becker 2009-05-03 13:08 ` Boaz Harrosh 2009-05-03 23:08 ` Al Viro 2009-05-04 2:49 ` Joel Becker 2009-05-03 23:45 ` Theodore Tso 2009-05-04 1:44 ` Tao Ma 2009-05-04 18:25 ` Joel Becker 2009-05-04 21:18 ` [Ocfs2-devel] " Joel Becker 2009-05-04 22:23 ` Theodore Tso 2009-05-05 6:55 ` Joel Becker 2009-05-05 1:07 ` Jamie Lokier 2009-05-05 7:16 ` Joel Becker 2009-05-05 8:09 ` Andreas Dilger 2009-05-05 16:56 ` Joel Becker 2009-05-05 21:24 ` Andreas Dilger 2009-05-05 21:32 ` Joel Becker 2009-05-06 7:15 ` [Ocfs2-devel] " Theodore Tso 2009-05-06 14:24 ` jim owens 2009-05-06 14:30 ` jim owens 2009-05-06 17:50 ` jim owens 2009-05-12 19:20 ` Jamie Lokier 2009-05-12 19:30 ` Jamie Lokier 2009-05-12 19:11 ` Jamie Lokier 2009-05-12 19:37 ` jim owens 2009-05-12 20:11 ` Jamie Lokier 2009-05-05 13:01 ` Theodore Tso 2009-05-05 13:19 ` Jamie Lokier 2009-05-05 13:39 ` Chris Mason 2009-05-05 15:36 ` Jamie Lokier 2009-05-05 15:41 ` Chris Mason 2009-05-05 16:03 ` Jamie Lokier 2009-05-05 16:18 ` Chris Mason 2009-05-05 20:48 ` jim owens 2009-05-05 21:57 ` Jamie Lokier 2009-05-05 22:04 ` Joel Becker 2009-05-05 22:11 ` Jamie Lokier 2009-05-05 22:24 ` Joel Becker 2009-05-05 23:14 ` Jamie Lokier 2009-05-05 22:12 ` Jamie Lokier 2009-05-05 22:21 ` Joel Becker 2009-05-05 22:32 ` James Morris 2009-05-05 22:39 ` Joel Becker 2009-05-12 19:40 ` Jamie Lokier 2009-05-05 22:28 ` jim owens 2009-05-05 23:12 ` Jamie Lokier 2009-05-05 16:46 ` Jörn Engel 2009-05-05 16:54 ` Jörn Engel 2009-05-05 22:03 ` Jamie Lokier 2009-05-05 21:44 ` copyfile semantics Andreas Dilger 2009-05-05 21:48 ` Matthew Wilcox 2009-05-05 22:25 ` Trond Myklebust 2009-05-05 22:06 ` Jamie Lokier 2009-05-06 5:57 ` Jörn Engel 2009-05-05 14:21 ` [PATCH 1/3] fs: Document the reflink(2) system call Theodore Tso 2009-05-05 15:32 ` Jamie Lokier 2009-05-05 22:49 ` James Morris 2009-05-05 17:05 ` Joel Becker 2009-05-05 17:00 ` Joel Becker 2009-05-05 17:29 ` Theodore Tso 2009-05-05 22:36 ` Jamie Lokier 2009-05-05 22:30 ` Jamie Lokier 2009-05-05 22:37 ` Joel Becker 2009-05-05 23:08 ` jim owens 2009-05-05 13:01 ` Jamie Lokier 2009-05-05 17:09 ` Joel Becker 2009-05-03 6:15 ` [PATCH 2/3] fs: Add vfs_reflink() and the ->reflink() inode operation Joel Becker 2009-05-03 8:03 ` Christoph Hellwig 2009-05-04 2:51 ` Joel Becker 2009-05-03 6:15 ` [PATCH 3/3] fs: Add the reflink(2) system call Joel Becker 2009-05-03 6:27 ` Matthew Wilcox 2009-05-03 6:39 ` Al Viro 2009-05-03 7:48 ` Christoph Hellwig 2009-05-03 11:16 ` Al Viro 2009-05-04 2:53 ` Joel Becker 2009-05-04 2:53 ` Joel Becker 2009-05-03 8:04 ` Christoph Hellwig 2009-05-07 22:15 ` [RFC] The reflink(2) system call v2 Joel Becker 2009-05-08 1:39 ` James Morris 2009-05-08 1:49 ` Joel Becker 2009-05-08 13:01 ` Tetsuo Handa 2009-05-08 2:59 ` jim owens 2009-05-08 3:10 ` Joel Becker 2009-05-08 11:53 ` jim owens 2009-05-08 12:16 ` jim owens 2009-05-08 14:11 ` jim owens 2009-05-11 20:40 ` [RFC] The reflink(2) system call v4 Joel Becker 2009-05-11 22:27 ` James Morris 2009-05-11 22:34 ` Joel Becker 2009-05-12 1:12 ` James Morris 2009-05-12 12:18 ` Stephen Smalley 2009-05-12 17:22 ` Joel Becker 2009-05-12 17:32 ` Stephen Smalley 2009-05-12 18:03 ` Joel Becker 2009-05-12 18:04 ` Stephen Smalley 2009-05-12 18:28 ` Joel Becker 2009-05-12 18:37 ` Stephen Smalley 2009-05-14 18:06 ` Stephen Smalley 2009-05-14 18:25 ` Stephen Smalley 2009-05-14 23:25 ` James Morris 2009-05-15 11:54 ` Stephen Smalley 2009-05-15 13:35 ` James Morris 2009-05-15 15:44 ` Stephen Smalley 2009-05-13 1:47 ` Casey Schaufler 2009-05-13 16:43 ` Joel Becker 2009-05-13 17:23 ` Stephen Smalley 2009-05-13 18:27 ` Joel Becker 2009-05-12 12:01 ` Stephen Smalley 2009-05-11 23:11 ` jim owens 2009-05-11 23:42 ` Joel Becker 2009-05-12 11:31 ` Jörn Engel 2009-05-12 13:12 ` jim owens 2009-05-12 20:24 ` Jamie Lokier 2009-05-14 18:43 ` Jörn Engel 2009-05-12 15:04 ` Sage Weil 2009-05-12 15:23 ` jim owens 2009-05-12 16:16 ` Sage Weil 2009-05-12 17:45 ` jim owens 2009-05-12 20:29 ` Jamie Lokier 2009-05-12 17:28 ` Joel Becker 2009-05-13 4:30 ` Sage Weil 2009-05-14 3:57 ` Andy Lutomirski 2009-05-14 18:12 ` Stephen Smalley 2009-05-14 22:00 ` Joel Becker 2009-05-15 1:20 ` Jamie Lokier 2009-05-15 12:01 ` Stephen Smalley 2009-05-15 15:22 ` Joel Becker 2009-05-15 15:55 ` Stephen Smalley 2009-05-15 16:42 ` Joel Becker 2009-05-15 17:01 ` Shaya Potter 2009-05-15 20:53 ` [Ocfs2-devel] " Joel Becker 2009-05-18 9:17 ` Jörn Engel 2009-05-18 13:02 ` Stephen Smalley 2009-05-18 14:33 ` Stephen Smalley 2009-05-18 17:15 ` Stephen Smalley 2009-05-18 18:26 ` Joel Becker 2009-05-19 16:32 ` [Ocfs2-devel] " Sage Weil 2009-05-19 19:33 ` Jonathan Corbet 2009-05-19 20:15 ` Jamie Lokier [not found] ` <20090519132057.419b9de0@bike.lwn.net> [not found] ` <20090519193244.GB25521@mail.oracle.com> 2009-05-19 19:41 ` Jonathan Corbet 2009-05-28 0:24 ` [RFC] The reflink(2) system call v5 Joel Becker 2009-09-14 22:24 ` Joel Becker 2009-05-11 20:49 ` [RFC] The reflink(2) system call v2 Joel Becker 2009-05-11 22:49 ` jim owens 2009-05-11 23:46 ` Joel Becker 2009-05-12 0:54 ` Chris Mason 2009-05-12 20:36 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).