From: Eric Blake <eblake@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, jsnow@redhat.com, famz@redhat.com,
qemu-block@nongnu.org, vsementsov@virtuozzo.com,
Stefan Hajnoczi <stefanha@redhat.com>,
Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v5 01/20] block: Add .bdrv_co_block_status() callback
Date: Fri, 1 Dec 2017 09:03:14 -0600 [thread overview]
Message-ID: <c2bde291-63d7-d94c-aac7-d894fa2a5218@redhat.com> (raw)
In-Reply-To: <20171201144047.GE4147@localhost.localdomain>
On 12/01/2017 08:40 AM, Kevin Wolf wrote:
>> Note that most drivers give sector-aligned answers, except at
>> end-of-file, even when request_alignment is smaller than a sector.
>> However, bdrv_getlength() is sector-aligned (even though it gives a
>> byte answer), often by exceeding the actual file size. If we were to
>> give back strict results, at least file-posix.c would report a
>> transition from DATA to HOLE at the end of a file even in the middle
>> of a sector, which can throw off callers; so we intentionally lie and
>> state that any partial sector at the end of a file has the same
>> status for the entire sector. Maybe at some future day we can
>> report actual file size instead of rounding up, but not for this
>> series.
>
> In what way does this throw off callers in practice?
>
Several iotests failed if I didn't do that (it's been a few months, so
the details are a bit fuzzy). I think the biggest problem is that
because we round the size up in bdrv_getlength(), but CANNOT access
those rounded bytes, then reporting the status of those bytes as a hole
(which is the only sane thing that file-posix can do) can cause grief
when the rest of the sector (which we CAN access) is data.
> The rounding will lead to strange effects, and I'm not sure that dealing
> with them is any easier than fixing the callers. Imagine an image file
> like this (very small example, file size 384 bytes):
>
> 0 128 384 512
> | | | |
> +----+--------+----+
> |Hole| Data | |
> +----+--------+----+
> |
> EOF
Unlikely. Holes are at least a sector in size on all known filesystems
that have holes; that's also true for qcow2 format. The only
non-sector-aligned hole that you can encounter in practice is at EOF.
>
> bdrv_co_block_status(offset=0, bytes=512) returns 512 bytes of HOLE.
> bdrv_co_block_status(offset=128, bytes=512) returns 384 bytes of DATA.
> bdrv_co_block_status(offset=384, bytes=512) returns 128 bytes of HOLE.
>
> This is not only contradictory, but the first one is almost begging for
> data corruption because it returns HOLE for a region that actually
> contains data.
I agree that it would be confusing, if it were possible. But in
practice it is not possible.
>
> The only excuse I can imagine is that we say that this can never happen
> because drivers use 512 byte granularity anyway. But why are we
> introducing the new interface then? I don't think this semantics is
> compatible with a bytes-based driver interface.
Really, the ONLY boundary that is unlikely to ever be 512-byte aligned
is at EOF - and we wouldn't even need to do any rounding if
bdrv_getlength() didn't round.
One thing I _can_ do: it is ALWAYS valid to report a partial sector as
data. It may pessimize the code slightly, but while rounding the size of
a hole up can be wrong if the rounding covers data but the caller
getting the rounded data treats that data as 0, rounding the size of
data up will never misbehave because the caller will just read the
literal zeroes. So I will tweak the code to make sure that if any
rounding takes place, that either the driver already set BDRV_BLOCK_EOF
(all further bytes also read as zero), or else report the rounded region
as data. (I thought I could get away with only io.c setting
BDRV_BLOCK_EOF; but it sounds like setting it in the drivers will be
helpful).
>
>> We also add an assertion that any driver using the new callback will
>> make progress (the only time pnum will be 0 is if the block layer
>> already handled an out-of-bounds request, or if there is an error).
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>
>> + ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
>> + aligned_bytes, pnum, &local_map,
>> + &local_file);
>> + if (ret < 0) {
>> + *pnum = 0;
>> + goto out;
>> + }
>> + assert(*pnum); /* The block driver must make progress */
I debated about combining this assert...
>> +
>> + /*
>> + * total_size is always sector-aligned, by sometimes exceeding actual
>> + * file size. Expand pnum if it lands mid-sector due to end-of-file.
>> + */
>> + if (QEMU_ALIGN_UP(*pnum + aligned_offset,
>
> aligned_offset + *pnum wouldn be the more conventional order.
Sure.
>
>> + BDRV_SECTOR_SIZE) == total_size) {
>> + *pnum = total_size - aligned_offset;
>> + }
>> +refine:
>> /*
>> * The driver's result must be a multiple of request_alignment.
>> * Clamp pnum and adjust map to original request.
...with these, but it would trigger on at least vmdk (see patch 17/20)
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
next prev parent reply other threads:[~2017-12-01 15:45 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-01 1:42 [Qemu-devel] [PATCH v5 00/20] add byte-based block_status driver callbacks Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 01/20] block: Add .bdrv_co_block_status() callback Eric Blake
2017-12-01 14:40 ` Kevin Wolf
2017-12-01 15:03 ` Eric Blake [this message]
2017-12-01 15:24 ` Kevin Wolf
2017-12-06 15:20 ` Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 02/20] block: Switch passthrough drivers to .bdrv_co_block_status() Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 03/20] file-posix: Switch " Eric Blake
2017-12-01 16:00 ` Kevin Wolf
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 04/20] gluster: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 05/20] iscsi: Switch cluster_sectors to byte-based Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 06/20] iscsi: Switch iscsi_allocmap_update() " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 07/20] iscsi: Switch to .bdrv_co_block_status() Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 08/20] null: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 09/20] parallels: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 10/20] qcow: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 11/20] qcow2: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 12/20] qed: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 13/20] raw: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 14/20] sheepdog: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 15/20] vdi: Avoid bitrot of debugging code Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 16/20] vdi: Switch to .bdrv_co_block_status() Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 17/20] vmdk: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 18/20] vpc: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 19/20] vvfat: " Eric Blake
2017-12-01 1:42 ` [Qemu-devel] [PATCH v5 20/20] block: Drop unused .bdrv_co_get_block_status() Eric Blake
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c2bde291-63d7-d94c-aac7-d894fa2a5218@redhat.com \
--to=eblake@redhat.com \
--cc=famz@redhat.com \
--cc=jsnow@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).