linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: btrfs-devel@oss.oracle.com
Cc: Sage Weil <sage@newdream.net>, linux-btrfs@vger.kernel.org
Subject: Re: [Btrfs-devel] cloning file data
Date: Fri, 25 Apr 2008 09:41:35 -0400	[thread overview]
Message-ID: <200804250941.35343.chris.mason@oracle.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0804241352130.23551@cobra.newdream.net>

On Thursday 24 April 2008, Sage Weil wrote:
> Hi-
>
> I'm working on a clone ioctl that will quickly and efficiently duplicate
> the contents of a file, e.g.

Very cool.  I'd actually loved to see this wrapped into a program that will 
cow a directory tree.  Basically the same as cp -al, but with cow instead of 
linking.

>
> int main(int argc, const char **argv)
> {
>   int in = open(argv[1], O_RDONLY);
>   int out = open(argv[2], O_CREAT|O_TRUNC|O_WRONLY, 0644);
>   ioctl(out, BTRFS_IOC_CLONE, in);
>   close(in);
>   close(out);
>   return 0;
> }
>
> I've probably got the locking order a bit wrong, lots of error handling is
> missing, and I suspect there's a cleaner way to do the target inode size
> update, but it behaves well enough in my (limited :) testing.
>
> Oh, and I wasn't certain the 'offset' in file_extent_item could be safely
> ignored when duplicating the extent reference.  My assumption was that it
> is orthogonal to extent allocation and isn't related to the backref.
> However, btrfs_insert_file_extent() always set offset=0.  I'm guessing I
> need to add an argument there and fix up the other callers?
>
Yes, you need to preserve the offset.  There's only one place right now that 
sets a non-zero offset and it inserts the extent by hand for other reasons 
(if you're brave, file.c:btrfs_drop_extents)

The reason file extents have an offset field is to allow COW without 
read/modify/write.  Picture something like this:

# create a single 100MB extent in file foo
dd if=/dev/zero of=foo bs=1M count=100
sync

# write into the middle
dd if=/dev/zero of=foo bs=4k count=1 seek=100 conv=notrunc
sync

We've written into the middle of that 100MB extent, and we need to do COW.  
One option is to read the whole thing, change 4k and write it all back.  
Instead, btrfs does something like this (+/- off by need more coffee errors):

file pos = 0 -> [ old extent, offset = 0, num_bytes = 400k ]
file pos = 409600 -> [ new 4k extent, offset = 0, num_bytes = 4k ]
file pos = 413696 -> [ old extent, offset = 413696, num_bytes = 100MB - 404k]

An extra reference is taken on the old extent to reflect that we're pointing 
to it twice.

> Anyway, any comments or suggestions (on the interface or implemantation)
> are welcome.. :)
>
By taking the inode mutex, you protect against file_write and truncates 
changing the file.  But, we also need to prevent mmaps from changing the file 
pages as well.  What you want to do lock all the file bytes in the extent 
tree:

lock_extent(&BTRFS_I(src_inode)->io_tree, 0, (u64)-1, GFP_NOFS);

But unfortunately, the code to fill delayed allocation takes that same lock.  
So you need to loop a bit:

while(1) {
    filemap_write_and_wait(src_inode);
    lock_extent()
    if (BTRFS_I(src_inode)->delalloc_bytes == 0)
        break;
    unlock_extent()
}

That should keep you from racing with btrfs_page_mkwrite()

-chris

  reply	other threads:[~2008-04-25 13:41 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-24 22:47 cloning file data Sage Weil
2008-04-25 13:41 ` Chris Mason [this message]
2008-04-25 16:50   ` [Btrfs-devel] " Zach Brown
2008-04-25 16:58     ` Chris Mason
2008-04-25 17:04       ` Zach Brown
2008-04-25 16:50   ` Zach Brown
2008-04-25 18:32     ` Sage Weil
2008-04-25 18:26   ` Sage Weil
2008-04-26  4:38     ` Sage Weil
2008-05-03  4:44       ` Yan Zheng
2008-05-03  6:16         ` Sage Weil
2008-05-03  6:48           ` Yan Zheng
2008-05-03  7:25           ` Yan Zheng
2008-05-05 10:27             ` Chris Mason
2008-05-05 15:57               ` Sage Weil
2008-05-21 17:19                 ` btrfs_put_inode Mingming
2008-05-21 18:02                   ` btrfs_put_inode Chris Mason
2008-05-21 18:45                     ` btrfs_put_inode Mingming
2008-05-21 18:52                       ` btrfs_put_inode Chris Mason
2008-05-21 22:29                         ` [RFC][PATCH]btrfs delete ordered inode handling fix Mingming
2008-05-22 14:11                           ` Chris Mason
2008-05-22 17:43                             ` Mingming
2008-05-22 17:47                               ` Chris Mason
2008-05-22 20:39                                 ` Mingming
2008-05-22 22:23                                   ` Chris Mason
2008-05-21 18:23                   ` btrfs_put_inode Ryan Hope
2008-05-21 18:32                     ` btrfs_put_inode Chris Mason
2008-05-21 19:02                       ` btrfs_put_inode Mingming
2008-04-25 20:28   ` [Btrfs-devel] cloning file data Sage Weil
2008-04-29 20:52 ` Chris Mason
2008-05-02 20:50 ` Chris Mason
2008-05-02 21:38   ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200804250941.35343.chris.mason@oracle.com \
    --to=chris.mason@oracle.com \
    --cc=btrfs-devel@oss.oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).