From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: [Btrfs-devel] cloning file data Date: Fri, 25 Apr 2008 09:41:35 -0400 Message-ID: <200804250941.35343.chris.mason@oracle.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Cc: Sage Weil , linux-btrfs@vger.kernel.org To: btrfs-devel@oss.oracle.com Return-path: In-Reply-To: List-ID: On Thursday 24 April 2008, Sage Weil wrote: > Hi- > > I'm working on a clone ioctl that will quickly and efficiently duplicate > the contents of a file, e.g. Very cool. I'd actually loved to see this wrapped into a program that will cow a directory tree. Basically the same as cp -al, but with cow instead of linking. > > int main(int argc, const char **argv) > { > int in = open(argv[1], O_RDONLY); > int out = open(argv[2], O_CREAT|O_TRUNC|O_WRONLY, 0644); > ioctl(out, BTRFS_IOC_CLONE, in); > close(in); > close(out); > return 0; > } > > I've probably got the locking order a bit wrong, lots of error handling is > missing, and I suspect there's a cleaner way to do the target inode size > update, but it behaves well enough in my (limited :) testing. > > Oh, and I wasn't certain the 'offset' in file_extent_item could be safely > ignored when duplicating the extent reference. My assumption was that it > is orthogonal to extent allocation and isn't related to the backref. > However, btrfs_insert_file_extent() always set offset=0. I'm guessing I > need to add an argument there and fix up the other callers? > Yes, you need to preserve the offset. There's only one place right now that sets a non-zero offset and it inserts the extent by hand for other reasons (if you're brave, file.c:btrfs_drop_extents) The reason file extents have an offset field is to allow COW without read/modify/write. Picture something like this: # create a single 100MB extent in file foo dd if=/dev/zero of=foo bs=1M count=100 sync # write into the middle dd if=/dev/zero of=foo bs=4k count=1 seek=100 conv=notrunc sync We've written into the middle of that 100MB extent, and we need to do COW. One option is to read the whole thing, change 4k and write it all back. Instead, btrfs does something like this (+/- off by need more coffee errors): file pos = 0 -> [ old extent, offset = 0, num_bytes = 400k ] file pos = 409600 -> [ new 4k extent, offset = 0, num_bytes = 4k ] file pos = 413696 -> [ old extent, offset = 413696, num_bytes = 100MB - 404k] An extra reference is taken on the old extent to reflect that we're pointing to it twice. > Anyway, any comments or suggestions (on the interface or implemantation) > are welcome.. :) > By taking the inode mutex, you protect against file_write and truncates changing the file. But, we also need to prevent mmaps from changing the file pages as well. What you want to do lock all the file bytes in the extent tree: lock_extent(&BTRFS_I(src_inode)->io_tree, 0, (u64)-1, GFP_NOFS); But unfortunately, the code to fill delayed allocation takes that same lock. So you need to loop a bit: while(1) { filemap_write_and_wait(src_inode); lock_extent() if (BTRFS_I(src_inode)->delalloc_bytes == 0) break; unlock_extent() } That should keep you from racing with btrfs_page_mkwrite() -chris