From: Miao Xie <miaox@cn.fujitsu.com>
To: Hugo Mills <hugo@carfax.org.uk>,
Alex Lyakas <alex.btrfs@zadarastorage.com>,
Chen Yang <chenyang.fnst@cn.fujitsu.com>,
Jan Schmidt <list.btrfs@jan-o-sch.net>,
Alexander Block <ablock84@googlemail.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] Btrfs/send: sparse and pre-allocated file support for, btrfs-send mechanism
Date: Thu, 24 Jan 2013 20:13:34 +0800 [thread overview]
Message-ID: <5101256E.4060603@cn.fujitsu.com> (raw)
In-Reply-To: <20130124115813.GC19527@carfax.org.uk>
On thu, 24 Jan 2013 11:58:13 +0000, Hugo Mills wrote:
> On Thu, Jan 24, 2013 at 07:39:17PM +0800, Miao Xie wrote:
>> On thu, 24 Jan 2013 10:53:17 +0200, Alex Lyakas wrote:
>>> Hi Chen,
>>> with all due respect, what do you mean by "I see" and "OK for users"?
>>> The semantics of the fallocate API with/without the
>>> FALLOC_FL_PUNCH_HOLE flag is not defined by me or you. As far as I
>>> understand, the file system *must* adhere to the semantics of this
>>> API, as defined elsewhere.
>>> Looking, for example, here:
>>> http://man7.org/linux/man-pages/man2/fallocate.2.html
>>>
>>> "Allocating disk space
>>> The default operation (i.e., mode is zero) of fallocate() allocates
>>> and initializes to zero the disk space within the range specified by
>>> offset and len."
>>>
>>> "Deallocating file space
>>> Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
>>> 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
>>> range starting at offset and continuing for len bytes. Within the
>>> specified range, partial file system blocks are zeroed, and whole file
>>> system blocks are removed from the file."
>>>
>>> These are clearly two different modes of operation, and I don't think
>>> you or me can decide otherwise, at this point.
>>
>> I think send/receive commands just need ensure the content of the file is
>> right, not the disk format. That is also the reason why the developers just
>> send out the zero data when they meet a hole or pre-allocated extent. From
>> this viewpoint, they are the same.
>
> I disagree on the "content vs representation" comment above. I
> would very much hope that the reference receive implementation can
> turn a string of zeroes (or a hole) back into a sparse file. As a
> user, I'd be quite irritated if, say, one of my sparse VM images ended
> up 5 times the size when I backed it up with send/receive, simply
> because it's gone from holey to a huge bunch of zeroes. That
> particular issue can be dealt with at the receiver level, though.
My explanation is not clear, sorry! What I want to say is send command needn't
differentiate between a hole and a pre-allocated extent, because both of them
are considered as zero. So just send out a hole even we meet a pre-allocated
extent is OK, I think.
Thanks
Miao
>
> Hugo.
>
>>> However, I may be not knowledgeable enough to confirm this.
>>> Jan/Alexander, can you perhaps comment on this?
>>>
>>> Thanks,
>>> Alex.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 24, 2013 at 3:54 AM, Chen Yang <chenyang.fnst@cn.fujitsu.com> wrote:
>>>> 于 2013-1-23 19:56, Alex Lyakas 写道:
>>>>> Hi,
>>>>>
>>>>> On Wed, Jan 23, 2013 at 1:04 PM, Chen Yang <chenyang.fnst@cn.fujitsu.com> wrote:
>>>>>> From: Chen Yang <chenyang.fnst@cn.fujitsu.com>
>>>>>> Date: Wed, 23 Jan 2013 11:21:51 +0800
>>>>>> Subject: [PATCH] Btrfs/send: sparse and pre-allocated file support for
>>>>>> btrfs-send mechanism
>>>>>>
>>>>>> When sending a file with sparse or pre-allocated part,
>>>>>> these parts will be sent as ZERO streams, and it's unnecessary.
>>>>>>
>>>>>> There are two ways to improve this, one is just skip the EMPTY parts,
>>>>>> and the other one is to add a punch command to send, when an EMPTY parts
>>>>>> was detected. But considering a case of incremental sends, if we choose
>>>>>> the first one, when a hole got punched into the file after the initial
>>>>>> send, the data will be unchanged on the receiving side when received
>>>>>> incrementally. So the second choice is right.
>>>>>>
>>>>>> Signed-off-by: Cheng Yang <chenyang.fnst@cn.fujitsu.com>
>>>>>> ---
>>>>>> fs/btrfs/send.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>>>> fs/btrfs/send.h | 3 +-
>>>>>> 2 files changed, 61 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>>>> index 5445454..31e9aef 100644
>>>>>> --- a/fs/btrfs/send.c
>>>>>> +++ b/fs/btrfs/send.c
>>>>>> @@ -3585,6 +3585,52 @@ out:
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> +static int send_punch(struct send_ctx *sctx, u64 offset, u32 len)
>>>>>> +{
>>>>>> + int ret = 0;
>>>>>> + struct fs_path *p;
>>>>>> + mm_segment_t old_fs;
>>>>>> +
>>>>>> + p = fs_path_alloc(sctx);
>>>>>> + if (!p)
>>>>>> + return -ENOMEM;
>>>>>> +
>>>>>> + /*
>>>>>> + * vfs normally only accepts user space buffers for security reasons.
>>>>>> + * we only read from the file and also only provide the read_buf buffer
>>>>>> + * to vfs. As this buffer does not come from a user space call, it's
>>>>>> + * ok to temporary allow kernel space buffers.
>>>>>> + */
>>>>>> + old_fs = get_fs();
>>>>>> + set_fs(KERNEL_DS);
>>>>>> +
>>>>>> +verbose_printk("btrfs: send_fallocate offset=%llu, len=%d\n", offset, len);
>>>>>> +
>>>>>> + ret = open_cur_inode_file(sctx);
>>>>>> + if (ret < 0)
>>>>>> + goto out;
>>>>>> +
>>>>>> + ret = begin_cmd(sctx, BTRFS_SEND_C_PUNCH);
>>>>>> + if (ret < 0)
>>>>>> + goto out;
>>>>>> +
>>>>>> + ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>>>>>> + if (ret < 0)
>>>>>> + goto out;
>>>>>> +
>>>>>> + TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>>>>>> + TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
>>>>>> + TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, len);
>>>>>> +
>>>>>> + ret = send_cmd(sctx);
>>>>>> +
>>>>>> +tlv_put_failure:
>>>>>> +out:
>>>>>> + fs_path_free(sctx, p);
>>>>>> + set_fs(old_fs);
>>>>>> + return ret;
>>>>>> +}
>>>>>> +
>>>>>> /*
>>>>>> * Read some bytes from the current inode/file and send a write command to
>>>>>> * user space.
>>>>>> @@ -3718,6 +3764,7 @@ static int send_write_or_clone(struct send_ctx *sctx,
>>>>>> u64 pos = 0;
>>>>>> u64 len;
>>>>>> u32 l;
>>>>>> + u64 bytenr;
>>>>>> u8 type;
>>>>>>
>>>>>> ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
>>>>>> @@ -3731,8 +3778,19 @@ static int send_write_or_clone(struct send_ctx *sctx,
>>>>>> * sure to send the whole thing
>>>>>> */
>>>>>> len = PAGE_CACHE_ALIGN(len);
>>>>>> - } else {
>>>>>> + } else if (type == BTRFS_FILE_EXTENT_REG) {
>>>>>> len = btrfs_file_extent_num_bytes(path->nodes[0], ei);
>>>>>> + bytenr = btrfs_file_extent_disk_bytenr(path->nodes[0], ei);
>>>>>> + if (bytenr == 0) {
>>>>>> + ret = send_punch(sctx, offset, len);
>>>>>> + goto out;
>>>>>> + }
>>>>>> + } else if (type == BTRFS_FILE_EXTENT_PREALLOC) {
>>>>>> + len = btrfs_file_extent_num_bytes(path->nodes[0], ei);
>>>>>> + ret = send_punch(sctx, offset, len);
>>>>>> + goto out;
>>>>>> + } else {
>>>>>> + BUG();
>>>>>> }
>>>>>
>>>>> Are these two cases really the same? In the bytenr == 0 we want to
>>>>> deallocate the range. While in the prealloc case, we want to ensure
>>>>> disk space allocation. Or am I mistaken?
>>>>> Looking at the receive side, you use the same command for both:
>>>>> ret = fallocate(r->write_fd,
>>>>> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>>>>> offset, len);
>>>>>
>>>>> Looking at btrfs_fallocate code, in that case it will always punch the hole:
>>>>> static long btrfs_fallocate(struct file *file, int mode,
>>>>> loff_t offset, loff_t len)
>>>>> ...
>>>>> if (mode & FALLOC_FL_PUNCH_HOLE)
>>>>> return btrfs_punch_hole(inode, offset, len);
>>>>>
>>>>> So maybe you should have two different commands, or add a flag to
>>>>> distinguish between the two cases?
>>>>>
>>>>> Thanks,
>>>>> Alex.
>>>>>
>>>>
>>>> I see sparse and pre-allocated parts both as EMPTY hole,
>>>> and although different for a file, it's OK for users.
>>>>
>>>> Thanks
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> if (offset + len > sctx->cur_inode_size)
>>>>>> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
>>>>>> index 1bf4f32..659ac8f 100644
>>>>>> --- a/fs/btrfs/send.h
>>>>>> +++ b/fs/btrfs/send.h
>>>>>> @@ -20,7 +20,7 @@
>>>>>> #include "ctree.h"
>>>>>>
>>>>>> #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
>>>>>> -#define BTRFS_SEND_STREAM_VERSION 1
>>>>>> +#define BTRFS_SEND_STREAM_VERSION 2
>>>>>>
>>>>>> #define BTRFS_SEND_BUF_SIZE (1024 * 64)
>>>>>> #define BTRFS_SEND_READ_SIZE (1024 * 48)
>>>>>> @@ -80,6 +80,7 @@ enum btrfs_send_cmd {
>>>>>> BTRFS_SEND_C_WRITE,
>>>>>> BTRFS_SEND_C_CLONE,
>>>>>>
>>>>>> + BTRFS_SEND_C_PUNCH,
>>>>>> BTRFS_SEND_C_TRUNCATE,
>>>>>> BTRFS_SEND_C_CHMOD,
>>>>>> BTRFS_SEND_C_CHOWN,
>
next prev parent reply other threads:[~2013-01-24 12:13 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <50FF58BC.7090604@cn.fujitsu.com>
2013-01-23 11:04 ` [PATCH] Btrfs/send: sparse and pre-allocated file support for, btrfs-send mechanism Chen Yang
2013-01-23 11:56 ` Alex Lyakas
[not found] ` <51009455.7040005@cn.fujitsu.com>
2013-01-24 8:53 ` Alex Lyakas
2013-01-24 11:39 ` Miao Xie
2013-01-24 11:58 ` Hugo Mills
2013-01-24 12:13 ` Miao Xie [this message]
2013-01-24 12:41 ` Hugo Mills
2013-01-24 17:13 ` Jan Schmidt
2013-01-24 18:06 ` Alex Lyakas
2013-01-25 8:56 ` Jan Schmidt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5101256E.4060603@cn.fujitsu.com \
--to=miaox@cn.fujitsu.com \
--cc=ablock84@googlemail.com \
--cc=alex.btrfs@zadarastorage.com \
--cc=chenyang.fnst@cn.fujitsu.com \
--cc=hugo@carfax.org.uk \
--cc=linux-btrfs@vger.kernel.org \
--cc=list.btrfs@jan-o-sch.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).