From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
Max Reitz <mreitz@redhat.com>
Cc: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>,
qemu-devel <qemu-devel@nongnu.org>,
qemu block <qemu-block@nongnu.org>
Subject: Re: backing chain & block status & filters
Date: Wed, 29 Apr 2020 12:15:35 +0300 [thread overview]
Message-ID: <7b1d4246-e59b-0fdb-3c44-6810eea6e5b8@virtuozzo.com> (raw)
In-Reply-To: <91b741ac-248c-2065-17b9-7fe31eafee40@virtuozzo.com>
[-- Attachment #1: Type: text/plain, Size: 3851 bytes --]
28.04.2020 22:44, Vladimir Sementsov-Ogievskiy wrote:
> 28.04.2020 19:46, Vladimir Sementsov-Ogievskiy wrote:
>> 28.04.2020 19:18, Eric Blake wrote:
>>> On 4/28/20 10:13 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>
>>>>>> Hm. I could imagine that there are formats that have non-zero holes
>>>>>> (e.g. 0xff or just garbage). It would be a bit wrong for them to return
>>>>>> ZERO or DATA then.
>>>>>>
>>>>>> But OTOH we don’t care about such cases, do we? We need to know whether
>>>>>> ranges are zero, data, or unallocated. If they aren’t zero, we only
>>>>>> care about whether reading from it will return data from this layer or not.
>>>>>>
>>>>>> So I suppose that anything that doesn’t support backing files (or
>>>>>> filtered children) should always return ZERO and/or DATA.
>>>>>
>>>>> I'm not sure I agree with the notion that everything should be
>>>>> BDRV_BLOCK_ALLOCATED at the lowest layer. It's not what it means today
>>>>> at least. If we want to change this, we will have to check all callers
>>>>> of bdrv_is_allocated() and friends who might use this to find holes in
>>>>> the file.
>>>>
>>>> Yes. Because they are doing incorrect (or at least undocumented and unreliable) thing.
>>>
>>> Here's some previous mails discussing the same question about what block_status should actually mean. At the time, I was so scared of the prospect of something breaking if I changed things that I ended up keeping status quo, so here we are revisiting the topic several years later, still asking the same questions.
>>>
>>> https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00069.html
>>> https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03757.html
>>>
>>>>
>>>>>
>>>>> Basically, the way bdrv_is_allocated() works today is that we assume an
>>>>> implicit zeroed backing layer even for block drivers that don't support
>>>>> backing files.
>>>>
>>>> But read doesn't work so: it will read data from the bottom layer, not from
>>>> this implicit zeroed backing layer. And it is inconsistent. On read data
>>>> comes exactly from this layer, not from its implicit backing. So it should
>>>> return BDRV_BLOCK_ALLOCATED, accordingly to its definition..
>>>>
>>>> Or, we should at least document current behavior:
>>>>
>>>> BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
>>>> layer rather than any backing, set by block. Attention: it may not be set
>>>> for drivers without backing support, still data is of course read from
>>>> this layer. Note, that for such drivers BDRV_BLOCK_ALLOCATED may mean
>>>> allocation on fs level, which occupies real space on disk.. So, for such drivers
>>>>
>>>> ZERO | ALLOCATED means that, read as zero, data may be allocated on fs, or
>>>> (most probably) not,
>>>> don't look at ALLOCATED flag, as it is added by generic layer for another logic,
>>>> not related to fs-allocation.
>>>>
>>>> 0 means that, most probably, data doesn't occupy space on fs, zero-status is
>>>> unknown (most probably non-zero)
>>>>
>>>
>>> That may be right in describing the current situation, but again, needs a GOOD audit of what we are actually using it for, and whether it is what we really WANT to be using it for. If we're going to audit/refactor the code, we might as well get semantics that are actually useful, rather than painfully contorted to documentation that happens to match our current contorted code.
>>>
>>
>> Honest enough:) I'll try to make a table.
>>
>> I don't think that reporting fs-allocation status is a bad thing. But I'm sure that it should be separated from backing-chain-allocated concept.
>>
>
> As a first step, I've don brief analysis of .bdrv_co_block_status of drivers (attached)
>
As a second step, here is brief analysis of all block_status usage
--
Best regards,
Vladimir
[-- Attachment #2: block-status-usage-report --]
[-- Type: text/plain, Size: 4098 bytes --]
Public interface of block-status is:
bdrv_block_status
bdrv_block_status_above
bdrv_is_allocated
bdrv_is_allocated_above
= bdrv_block_status =
bdrv_make_zero: works on current level of backing-chain, want's to skip zeroes, not interested in @map and @file
img convert: convert_iteration_sectors: wants to distinguish ZERO, DATA and go-to-backing. It also tries to not write zeroes, if have short backing file, but does it a bit wrong. Treats unallocated as DATA if no backing.
img-map: get_block_status: distinguish ZERO, DATA and go-to-backing. Count depth of the backing. Just reports final ZERO and DATA. So, fs-unallocated thing is reported to user
= bdrv_block_status_above =
block-copy: block_copy_block_status: wants two things:
1. skip go-to-backing holes in top layer for top mode
2. do write_zero for ZERO areas
mirror: call on the whole backing chain
- for DATA (and for DATA|ZERO which is bad) do just copy
- for ZERO do just ZERO
- for 0 (which means that bottom layer doesn't report that unallocated are zero) does DISCARD (which is most-probably zeroing) - absolutely wrong thing
qcow2: is_zero: call on the whole backing chain, want's just to check is reads-as-zero or not.
qcow2: qcow2_measure: call on the whole backing chain:
- skip ZERO
- count clusters with both DATA and ALLOCATED set. Hmm. ALLOCATED is always set for DATA. Seems the function actually tries to calculate disk occupation, assuming that BDRV_BLOCK_ALLOCATED helps in it, but it actually doesn't..
I think, correct solution is to support offset and bytes in bdrv_measure, and split it from block_status. Then qcow2_measure will just recursively call bdrv_measure on its children. This would be clean.
nbd: nbd_co_send_sparse_read: call on the whole backing chain:
- wants to distinguish zeroes
nbd: blockstatus_to_extents: call on the whole backing chain:
!ALLOCATED -> NBD_HOLE
ZERO -> NBD_ZERO
So, we report HOLE only if it's not BDRV_BLOCK_ALLOCATED on any layer.. That's wrong. I think, we should report HOLE in a lot more cases. Actually, when not occupy real space on disk.
img-compare: call on the whole backing chain:
- do not compare zeroes
- do not compare if both report unallocated.. it's actually not correct for protocols which reports fs-unallocated-non-zeroes. As reads may differ actually. Still, read from fs-unallocated area is not guaranteed to return same thing each time, yes? At least, null-co doesn't guarantee it :).. So, it may be correct to skip these areas. Or may be better to always report them different??
- consider data-zeroes equal to unallocated.. it's definitely not correct for protocols which reports fs-unallocated-non-zeroes
I think, img-compare must only consider zero/non-zero, and don't touch other block-status features. Otherwise it's a mess
img-convert: convert_iteration_sectors: call on the whole backing chain: already described in bdrv_block_status section
= bdrv_is_allocated =
Obvious thing for backing-chain related operation (still wrong that some protocol drivers may return fs-unallocated and it is treated as go-to-backing areas):
block-copy, commit, copy-on-read, stream, img-rebase
Others:
vvfat: o_O it has qcow child.. and operates like self is a backing of this child. But yes, it just uses bdrv_is_allocated to understand is chunk is rewritten in qcow.
migration/block: skip unallocated for top mode (shared_base, as it called here)
io-alloc: just report number of allocated in top layer
io-map: map_is_allocated: same thing as io-alloc, but report chunks
test_sync_op_block_status: just check what it returns
= bdrv_is_allocated_above =
Obvious usage for backing-chain related: commit, mirror, stream, img-rebase. Wrong for fs-unallocated-non-zero reporting drivers
Others:
qcow2: is_unallocated: call for the whole backing chain. Used to check is-zero.. Wrong for fs-unallocated-non-zero reporting drivers, and may be more efficient if consider also ZERO status.. but in some smart-fast way.
replication: allocated or not in backing-chain: common case
next prev parent reply other threads:[~2020-04-29 9:17 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-28 8:55 backing chain & block status & filters Vladimir Sementsov-Ogievskiy
2020-04-28 11:08 ` Max Reitz
2020-04-28 11:28 ` Kevin Wolf
2020-04-28 15:13 ` Vladimir Sementsov-Ogievskiy
2020-04-28 16:18 ` Eric Blake
2020-04-28 16:46 ` Vladimir Sementsov-Ogievskiy
2020-04-28 18:37 ` Kevin Wolf
2020-04-28 19:44 ` Vladimir Sementsov-Ogievskiy
2020-04-29 9:15 ` Vladimir Sementsov-Ogievskiy [this message]
2020-04-29 10:50 ` Vladimir Sementsov-Ogievskiy
2020-04-28 14:51 ` Vladimir Sementsov-Ogievskiy
2020-04-30 19:12 ` Vladimir Sementsov-Ogievskiy
2020-05-01 3:04 ` Andrey Shinkevich
2020-05-06 5:56 ` Vladimir Sementsov-Ogievskiy
2020-05-07 12:58 ` Max Reitz
2020-05-07 19:34 ` Vladimir Sementsov-Ogievskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7b1d4246-e59b-0fdb-3c44-6810eea6e5b8@virtuozzo.com \
--to=vsementsov@virtuozzo.com \
--cc=andrey.shinkevich@virtuozzo.com \
--cc=eblake@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).