From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: vito.caputo@coreos.com, xfs@pengaru.com, xfs <xfs@oss.sgi.com>
Subject: Re: question re: xfs inode to inode copy implementation
Date: Thu, 30 Apr 2015 01:03:43 -0700 [thread overview]
Message-ID: <20150430080343.GA28992@birch.djwong.org> (raw)
In-Reply-To: <20150430062614.GA15810@dastard>
On Thu, Apr 30, 2015 at 04:26:14PM +1000, Dave Chinner wrote:
> On Wed, Apr 29, 2015 at 07:53:30PM -0500, xfs@pengaru.com wrote:
> > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@pengaru.com wrote:
> > > > My prototype already mostly works just using xfs_alloc_file_space() to
> > > > allocate the appropriate space in the destination inode, but I need to
> > > > get that allocated space populated from the shared inode's extents.
> > >
> > > I think you're asking how to copy extent map entries from one file to another?
> >
> > What I really wanted was something analogous to do_splice_direct() but for
> > operating on the inodes/address_space structs. I ended up just hacking
> > something together which does the copy ad-hoc directly via the address_space
> > structs and calling xfs_get_blocks() on buffer heads of the destination pages,
> > without any readahead or other optimizations, at least it reads from and
> > populates the page caches.
> >
> > It looks like what you guys are working on is a more granular/block-level COW
> > reflink implementation, which I assumed would be significantly more challenging
> > and well beyond my ability to hack up quickly for experimentation.
> >
> > Below I'll summarize what I've hacked together. It's probably inappropriate to
> > refer to this as a reflink, I've been referring to it as a COW-file in the
> > code.
>
> It's the same thing, just using a different COW mechanism to break
> the reflink.
>
> > A COW-file is a new inode which refers to another (shared) inode for its data
> > until the COW-file is opened for writing. At that point it clones the shared
> > inode's data as its own.
> >
> > Here's the mid-level details of the hack:
> >
> > 1. Add two members to the inode in the available padding:
> > be32 nlink_cow: Number of COW-file links to the inode
> > be64 inumber_cow: Number of backing inode if inode is a COW-file
>
> FYI, inode number alone is not enough to uniquely identify an inode.
> Needs the ino/gen pair as inode numbers can be reused.
>
> > 2. Introduc a new ioctl for creating a COW-file:
> > #define XFS_IOC_CREATE_COWFILE_AT _IOW ('X', 126, struct xfs_create_cowfile_at)
> >
> > typedef struct xfs_create_cowfile_at
> > {
> > int dfd; /* parent directory */
> > char name[MAXNAMELEN]; /* name to create */
> > umode_t mode; /* mode */
> > } xfs_create_cowfile_at_t;
>
> Let's not create yet another incompatible reflink API. Please use
> OCFS2_IOC_REFLINK as the API and pull the ioctl and infrastructure
> within fs/ocfs2/refcounttree.c up to the VFS, add a ->reflink method
> to the operations structure and then hook ocfs2 and the new XFS code
> up to that interface. We don't want to duplicate all that code
> unnecessarily in an incompatible fashion.
I've been using BTRFS_IOC_CLONE{,_RANGE} so far, and eyeing zab's proposed
copy_file_range syscall.
> > 3. Derive a xfs_create_cowfile() from xfs_create() and xfs_link():
> > xfs_create_cowfile(
> > xfs_inode_t *dp, /* parent directory */
> > xfs_inode_t *sip, /* shared inode (ioctl callee) */
> > struct xfs_name *name, /* name of cowfile to create in dp */
> > umode_t mode, /* mode */
> > xfs_dev_t rdev,
> > xfs_inode_t **ipp) /* new inode */
> >
> > - ipp = allocate inode
> > - ipp->i_mapping->host = sip
> > - bump sip->nlink_cow
> > - ipp->inumber_cow = sip->di_ino
> > - create name in dp referencing ipp
> >
> > ipp starts out with the shared inode hosting i_mapping
>
> That's kinda messy. I'd like to leave sharing page cache completely
> out of the picture first, and get all the inode and extent
> manipulations correct first. most important is all the inode life
> cycle and unlink/unshare operations sorted, especially w.r.t.
> locking and transactions.
Heh heh heh. I found another one hidden one of those just this evening
(zeroing ranges for punch/zerorange). In addition to buffered writes,
dio writes (which lazily fall back to buffered for now), rm, and block-punch.
There was also the little matter of making FIEMAP work.
> Once we have that sanely under control, page cache sharing is a
> matter of reflecting the shared extents in the page cache via
> multiple page references (i.e. one for each mapping tree the page is
> inserted into) rather than playing games trying to share mappings.
>
> > 4. Modify xfs_setup_inode() to be inumber_cow-aware, opening the shared inode
> > when set, and assigning to i_mapping->host.
>
> So you've added an xfs_iget() call to xfs_setup_inode() if the inode
> is a reflink inode? How does the locking work?
>
> > 5. Modify xfs_file_open() S_ISREG && inumber_cow && FMODE_WRITE:
> > - clear inumber_cow
> > - restore i_mapping->host to the inode being opened
> > - invalidate_inode_pages2(i_mapping)
> > - allocate all needed space in this inode
> > - copy size from shared inode to this inode
> > - copy all pages from the previously shared inode to this one
>
> So the copy is done on open(O_WR) of the file. files get opened in
> write mode for more than just writing date to them. e.g. fchmod,
> writing new attributes, etc. We don't want a data cow if this is the
> first operation that is done to the inode after the reflink is
> made...
That could be a lot of stuff to copy?
Anyway, I might have a RFC(rap) posting of reflink patches tomorrow.
I haven't added the "this is reflinked" inode flag optimization yet. I bet the
performance hit is awful. But, no more patching at 1am.
--D
> > 6. Modify xfs_vn_getattr() to copy stat->size from the shared inode if
> > inumer_cow is set.
>
> Why wouldn't you just put the size in the COW inode?
>
> > 7. Modify unlink paths to be nlink_cow-aware
>
> What happens when parent is unlinked? It stays on the unlinked list?
> Or do we know have the possibility that unlink can have two inodes
> to free in xfs-inactive()? What's the implication for transaction
> reservations? What's the locking order?
>
> reflink doesn't have this "readonly parent" wart. If the parent is
> modified, it breaks it's side of the reflink via COW. Simple,
> symmetric handling of the problem - it doesn't matter who writes to
> what, the unmodified inodes do not see any change.
>
> > 8. To simplify things, inodes that have nlink_cow no longer can be opened for
> > writing, they've become immutable backing stores of sorts for other inodes.
>
> So, someone marks a file as COW, and that makes it read only? What
> happens if there is already an open fd with write permissions when
> the COW ioctl is called? So, where does the immutable state of the
> inode get enforced? What's preventing it from being unlinked?
>
> > The one major question mark I see in this achieving correctness is the live
> > manipulation of i_mapping->host. It seems to work in my casual testing on
> > x86_64, this actually all works surprisingly well considering it's a fast and
> > nasty hack. I abuse invalidate_inode_pages2() as if a truncate has occurred,
> > forcing subsequent page accesses to fault and revisit i_mapping->host.
>
> I don't think you can't change mapping->host, as there is no one
> lock that protects access to it.
>
> > The goal here was to achieve something overlayfs-like but with inodes capable
> > of being chowned/chmodded without triggering the copy_up, operations likely
> > necessary for supporting user namespaces in containers.
>
> Yup, reflink gives us this.
>
> > Additionally,
> > overlayfs has some serious correctness issues WRT multiply-opened files
> > spanning the lower and upper layers due to one of the subsequent opens being a
> > writer. Since my hack creates distinct inodes from the start, no such issue
> > exists.
>
> Right, but it's not something that overlayfs can use unconditionally
> (reflink or your COW file) because overlayfs can have upper and
> lower directories on different filesystems. And, as has also been
> pointed out, it doesn't work on lower filesystems that are
> read-only.
>
> As such, I'm not sure that overlayfs will ever support this change
> of behaviour as it restricts overlayfs to a single writeable
> filesystem. I know, most people only need overlayfs on a single
> writeable filesystem, but....
>
> > However, one of the attractive things about overlayfs is the page cache sharing
> > which my XFS hack doesn't enable due to the distinct struct addres_space's and
> > i_mapping->host host exchanging. I had hoped to explore something KSM-like
> > with maybe some hints from XFS for these shared inode pagess saying "hey these
> > are read-only pages in a shared inode, try deduplicate them!" but KSM is purely
> > in vma land so that seems like a lot of work to make happen.
>
> As I mentioned before, page cache sharing needs to reflect shared
> blocks on disk. Playing games with KSM or mappings isn't a viable
> solution.
>
> > In any case, thanks for the responses and any further input you guys have.
> > Obviously a more granular btrfs-like reflink is preferable, and I'd welcome it.
> > It just seemed like doing something overlayfs-like would be a whole lot easier
> > to get working in a relatively short time.
>
> Overlayfs is creating more problems than it is solving. Hacking
> random one-off functionality into filesystems is not the way to fix
> overlayfs problems. Let's do reflink properly, then we can modify
> overlayfs to be able to use reflink when all it's working
> directories are on the same filesystem. The we can add page cache
> sharing to reflink files, and all of a sudden, the overlayfs page
> cache sharing problem is solved. Not to mention it is solved for all
> the other people that want reflink files for their applications,
> too.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
prev parent reply other threads:[~2015-04-30 8:04 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-21 1:06 question re: xfs inode to inode copy implementation xfs
2015-04-21 4:28 ` Darrick J. Wong
2015-04-21 22:27 ` Dave Chinner
2015-04-23 0:44 ` Darrick J. Wong
2015-04-23 1:13 ` Dave Chinner
2015-04-23 6:40 ` Darrick J. Wong
2015-04-26 22:40 ` Dave Chinner
2015-04-30 0:53 ` xfs
2015-04-30 6:26 ` Dave Chinner
2015-04-30 8:03 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150430080343.GA28992@birch.djwong.org \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=vito.caputo@coreos.com \
--cc=xfs@oss.sgi.com \
--cc=xfs@pengaru.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox