From: Miao Xie <miaox@cn.fujitsu.com>
To: Hugo Mills <hugo@carfax.org.uk>,
Alex Lyakas <alex.btrfs@zadarastorage.com>,
Chen Yang <chenyang.fnst@cn.fujitsu.com>,
Jan Schmidt <list.btrfs@jan-o-sch.net>,
Alexander Block <ablock84@googlemail.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] Btrfs/send: sparse and pre-allocated file support for, btrfs-send mechanism
Date: Thu, 24 Jan 2013 20:13:34 +0800 [thread overview]
Message-ID: <5101256E.4060603@cn.fujitsu.com> (raw)
In-Reply-To: <20130124115813.GC19527@carfax.org.uk>
On thu, 24 Jan 2013 11:58:13 +0000, Hugo Mills wrote:
> On Thu, Jan 24, 2013 at 07:39:17PM +0800, Miao Xie wrote:
>> On thu, 24 Jan 2013 10:53:17 +0200, Alex Lyakas wrote:
>>> Hi Chen,
>>> with all due respect, what do you mean by "I see" and "OK for users"?
>>> The semantics of the fallocate API with/without the
>>> FALLOC_FL_PUNCH_HOLE flag is not defined by me or you. As far as I
>>> understand, the file system *must* adhere to the semantics of this
>>> API, as defined elsewhere.
>>> Looking, for example, here:
>>> http://man7.org/linux/man-pages/man2/fallocate.2.html
>>>
>>> "Allocating disk space
>>> The default operation (i.e., mode is zero) of fallocate() allocates
>>> and initializes to zero the disk space within the range specified by
>>> offset and len."
>>>
>>> "Deallocating file space
>>> Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
>>> 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
>>> range starting at offset and continuing for len bytes. Within the
>>> specified range, partial file system blocks are zeroed, and whole file
>>> system blocks are removed from the file."
>>>
>>> These are clearly two different modes of operation, and I don't think
>>> you or me can decide otherwise, at this point.
>>
>> I think send/receive commands just need ensure the content of the file is
>> right, not the disk format. That is also the reason why the developers just
>> send out the zero data when they meet a hole or pre-allocated extent. From
>> this viewpoint, they are the same.
>
> I disagree on the "content vs representation" comment above. I
> would very much hope that the reference receive implementation can
> turn a string of zeroes (or a hole) back into a sparse file. As a
> user, I'd be quite irritated if, say, one of my sparse VM images ended
> up 5 times the size when I backed it up with send/receive, simply
> because it's gone from holey to a huge bunch of zeroes. That
> particular issue can be dealt with at the receiver level, though.
My explanation is not clear, sorry! What I want to say is send command needn't
differentiate between a hole and a pre-allocated extent, because both of them
are considered as zero. So just send out a hole even we meet a pre-allocated
extent is OK, I think.
Thanks
Miao
>
> Hugo.
>
>>> However, I may be not knowledgeable enough to confirm this.
>>> Jan/Alexander, can you perhaps comment on this?
>>>
>>> Thanks,
>>> Alex.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 24, 2013 at 3:54 AM, Chen Yang <chenyang.fnst@cn.fujitsu.com> wrote:
>>>> 于 2013-1-23 19:56, Alex Lyakas 写道:
>>>>> Hi,
>>>>>
>>>>> On Wed, Jan 23, 2013 at 1:04 PM, Chen Yang <chenyang.fnst@cn.fujitsu.com> wrote:
>>>>>> From: Chen Yang <chenyang.fnst@cn.fujitsu.com>
>>>>>> Date: Wed, 23 Jan 2013 11:21:51 +0800
>>>>>> Subject: [PATCH] Btrfs/send: sparse and pre-allocated file support for
>>>>>> btrfs-send mechanism
>>>>>>
>>>>>> When sending a file with sparse or pre-allocated part,
>>>>>> these parts will be sent as ZERO streams, and it's unnecessary.
>>>>>>
>>>>>> There are two ways to improve this, one is just skip the EMPTY parts,
>>>>>> and the other one is to add a punch command to send, when an EMPTY parts
>>>>>> was detected. But considering a case of incremental sends, if we choose
>>>>>> the first one, when a hole got punched into the file after the initial
>>>>>> send, the data will be unchanged on the receiving side when received
>>>>>> incrementally. So the second choice is right.
>>>>>>
>>>>>> Signed-off-by: Cheng Yang <chenyang.fnst@cn.fujitsu.com>
>>>>>> ---
>>>>>> fs/btrfs/send.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>>>> fs/btrfs/send.h | 3 +-
>>>>>> 2 files changed, 61 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>>>> index 5445454..31e9aef 100644
>>>>>> --- a/fs/btrfs/send.c
>>>>>> +++ b/fs/btrfs/send.c
>>>>>> @@ -3585,6 +3585,52 @@ out:
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> +static int send_punch(struct send_ctx *sctx, u64 offset, u32 len)
>>>>>> +{
>>>>>> + int ret = 0;
>>>>>> + struct fs_path *p;
>>>>>> + mm_segment_t old_fs;
>>>>>> +
>>>>>> + p = fs_path_alloc(sctx);
>>>>>> + if (!p)
>>>>>> + return -ENOMEM;
>>>>>> +
>>>>>> + /*
>>>>>> + * vfs normally only accepts user space buffers for security reasons.
>>>>>> + * we only read from the file and also only provide the read_buf buffer
>>>>>> + * to vfs. As this buffer does not come from a user space call, it's
>>>>>> + * ok to temporary allow kernel space buffers.
>>>>>> + */
>>>>>> + old_fs = get_fs();
>>>>>> + set_fs(KERNEL_DS);
>>>>>> +
>>>>>> +verbose_printk("btrfs: send_fallocate offset=%llu, len=%d\n", offset, len);
>>>>>> +
>>>>>> + ret = open_cur_inode_file(sctx);
>>>>>> + if (ret < 0)
>>>>>> + goto out;
>>>>>> +
>>>>>> + ret = begin_cmd(sctx, BTRFS_SEND_C_PUNCH);
>>>>>> + if (ret < 0)
>>>>>> + goto out;
>>>>>> +
>>>>>> + ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p);
>>>>>> + if (ret < 0)
>>>>>> + goto out;
>>>>>> +
>>>>>> + TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p);
>>>>>> + TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset);
>>>>>> + TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, len);
>>>>>> +
>>>>>> + ret = send_cmd(sctx);
>>>>>> +
>>>>>> +tlv_put_failure:
>>>>>> +out:
>>>>>> + fs_path_free(sctx, p);
>>>>>> + set_fs(old_fs);
>>>>>> + return ret;
>>>>>> +}
>>>>>> +
>>>>>> /*
>>>>>> * Read some bytes from the current inode/file and send a write command to
>>>>>> * user space.
>>>>>> @@ -3718,6 +3764,7 @@ static int send_write_or_clone(struct send_ctx *sctx,
>>>>>> u64 pos = 0;
>>>>>> u64 len;
>>>>>> u32 l;
>>>>>> + u64 bytenr;
>>>>>> u8 type;
>>>>>>
>>>>>> ei = btrfs_item_ptr(path->nodes[0], path->slots[0],
>>>>>> @@ -3731,8 +3778,19 @@ static int send_write_or_clone(struct send_ctx *sctx,
>>>>>> * sure to send the whole thing
>>>>>> */
>>>>>> len = PAGE_CACHE_ALIGN(len);
>>>>>> - } else {
>>>>>> + } else if (type == BTRFS_FILE_EXTENT_REG) {
>>>>>> len = btrfs_file_extent_num_bytes(path->nodes[0], ei);
>>>>>> + bytenr = btrfs_file_extent_disk_bytenr(path->nodes[0], ei);
>>>>>> + if (bytenr == 0) {
>>>>>> + ret = send_punch(sctx, offset, len);
>>>>>> + goto out;
>>>>>> + }
>>>>>> + } else if (type == BTRFS_FILE_EXTENT_PREALLOC) {
>>>>>> + len = btrfs_file_extent_num_bytes(path->nodes[0], ei);
>>>>>> + ret = send_punch(sctx, offset, len);
>>>>>> + goto out;
>>>>>> + } else {
>>>>>> + BUG();
>>>>>> }
>>>>>
>>>>> Are these two cases really the same? In the bytenr == 0 we want to
>>>>> deallocate the range. While in the prealloc case, we want to ensure
>>>>> disk space allocation. Or am I mistaken?
>>>>> Looking at the receive side, you use the same command for both:
>>>>> ret = fallocate(r->write_fd,
>>>>> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>>>>> offset, len);
>>>>>
>>>>> Looking at btrfs_fallocate code, in that case it will always punch the hole:
>>>>> static long btrfs_fallocate(struct file *file, int mode,
>>>>> loff_t offset, loff_t len)
>>>>> ...
>>>>> if (mode & FALLOC_FL_PUNCH_HOLE)
>>>>> return btrfs_punch_hole(inode, offset, len);
>>>>>
>>>>> So maybe you should have two different commands, or add a flag to
>>>>> distinguish between the two cases?
>>>>>
>>>>> Thanks,
>>>>> Alex.
>>>>>
>>>>
>>>> I see sparse and pre-allocated parts both as EMPTY hole,
>>>> and although different for a file, it's OK for users.
>>>>
>>>> Thanks
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> if (offset + len > sctx->cur_inode_size)
>>>>>> diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
>>>>>> index 1bf4f32..659ac8f 100644
>>>>>> --- a/fs/btrfs/send.h
>>>>>> +++ b/fs/btrfs/send.h
>>>>>> @@ -20,7 +20,7 @@
>>>>>> #include "ctree.h"
>>>>>>
>>>>>> #define BTRFS_SEND_STREAM_MAGIC "btrfs-stream"
>>>>>> -#define BTRFS_SEND_STREAM_VERSION 1
>>>>>> +#define BTRFS_SEND_STREAM_VERSION 2
>>>>>>
>>>>>> #define BTRFS_SEND_BUF_SIZE (1024 * 64)
>>>>>> #define BTRFS_SEND_READ_SIZE (1024 * 48)
>>>>>> @@ -80,6 +80,7 @@ enum btrfs_send_cmd {
>>>>>> BTRFS_SEND_C_WRITE,
>>>>>> BTRFS_SEND_C_CLONE,
>>>>>>
>>>>>> + BTRFS_SEND_C_PUNCH,
>>>>>> BTRFS_SEND_C_TRUNCATE,
>>>>>> BTRFS_SEND_C_CHMOD,
>>>>>> BTRFS_SEND_C_CHOWN,
>
next prev parent reply other threads:[~2013-01-24 12:13 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <50FF58BC.7090604@cn.fujitsu.com>
2013-01-23 11:04 ` [PATCH] Btrfs/send: sparse and pre-allocated file support for, btrfs-send mechanism Chen Yang
2013-01-23 11:56 ` Alex Lyakas
[not found] ` <51009455.7040005@cn.fujitsu.com>
2013-01-24 8:53 ` Alex Lyakas
2013-01-24 11:39 ` Miao Xie
2013-01-24 11:58 ` Hugo Mills
2013-01-24 12:13 ` Miao Xie [this message]
2013-01-24 12:41 ` Hugo Mills
2013-01-24 17:13 ` Jan Schmidt
2013-01-24 18:06 ` Alex Lyakas
2013-01-25 8:56 ` Jan Schmidt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5101256E.4060603@cn.fujitsu.com \
--to=miaox@cn.fujitsu.com \
--cc=ablock84@googlemail.com \
--cc=alex.btrfs@zadarastorage.com \
--cc=chenyang.fnst@cn.fujitsu.com \
--cc=hugo@carfax.org.uk \
--cc=linux-btrfs@vger.kernel.org \
--cc=list.btrfs@jan-o-sch.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.