From: Nir Soffer <nsoffer@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
Hanna Reitz <hreitz@redhat.com>,
qemu-block@nongnu.org
Subject: Re: [PATCH] block/file-posix: Consider discard flag when opening
Date: Wed, 19 Jun 2024 13:58:28 +0300 [thread overview]
Message-ID: <CA5B7F3C-1C85-4A65-BD10-7B7B913EDB2B@redhat.com> (raw)
In-Reply-To: <ZnKT8ECzHjpLGOm_@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 8259 bytes --]
> On 19 Jun 2024, at 11:16, Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 18.06.2024 um 23:24 hat Nir Soffer geschrieben:
>> Set has_discard only when BDRV_O_UNMAP is not set. With this users that
>> want to keep their images fully allocated can disable hole punching
>> when writing zeros or discarding using:
>>
>> -drive file=thick.img,discard=off
>>
>> This change is not entirely correct since it changes the default discard
>> behavior. Previously we always allowed punching holes, but now you have
>> must use discard=unmap|on to enable it. We probably need to add the
>> BDDR_O_UNMAP flag by default.
>>
>> make check still works, so maybe we don't have tests for sparsifying
>> images, or maybe you need to run special tests that do not run by
>> default. We needs tests for keeping images non-sparse.
>>
>> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
>
> So first of all, I agree with you that this patch is wrong. ;-)
>
> At first, I failed to understand the problem this is trying to solve. I
> put a debug message in handle_aiocb_discard() and tried with which
> options it triggers. [1] To me, this looked exactly like it should be.
> We only try to discard blocks when discard=unmap is given as an option.
>
> That leaves the case of write_zeroes. And while at the first sight, the
> code looked good, we do seem to have a problem there and it tried to
> unmap even with discard=off.
>
>> block/file-posix.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index be25e35ff6..acac2abadc 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -738,11 +738,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>> ret = -EINVAL;
>> goto fail;
>> }
>> #endif /* !defined(CONFIG_LINUX_IO_URING) */
>>
>> - s->has_discard = true;
>> + s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP);
>> s->has_write_zeroes = true;
>>
>> if (fstat(s->fd, &st) < 0) {
>> ret = -errno;
>> error_setg_errno(errp, errno, "Could not stat file");
>
> s->has_discard is about what the host supports, not about the semantics
> of the QEMU block node. So this doesn't feel right to me.
>
> So for the buggy case, write_zeroes, bdrv_co_pwrite_zeroes() has code
> that considers the case and clears the ~BDRV_REQ_MAY_UNMAP flags:
>
> if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
> flags &= ~BDRV_REQ_MAY_UNMAP;
> }
>
> But it turns out that we don't necessarily even go through this function
> for the top node which has discard=off, so it can't take effect:
>
> (gdb) bt
> #0 0x00007ffff4f2f144 in __pthread_kill_implementation () at /lib64/libc.so <http://libc.so/>.6
> #1 0x00007ffff4ed765e in raise () at /lib64/libc.so <http://libc.so/>.6
> #2 0x00007ffff4ebf902 in abort () at /lib64/libc.so <http://libc.so/>.6
> #3 0x000055555615aff0 in raw_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP, blkdev=false) at ../block/file-posix.c:3643
> #4 0x000055555615557e in raw_co_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/file-posix.c:3655
> #5 0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
> #6 0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f51460, req=0x7fffed5ff800, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2100
> #7 0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f51460, offset=0, bytes=1048576, flags=6, req=0x7fffed5ff800) at ../block/io.c:2183
> #8 0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f51460, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
> #9 0x00005555560c634f in bdrv_co_pwritev (child=0x555557f51460, offset=0, bytes=1048576, qiov=0x0, flags=6) at ../block/io.c:2216
> #10 0x00005555560c75b5 in bdrv_co_pwrite_zeroes (child=0x555557f51460, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/io.c:2322
> #11 0x0000555556117d24 in raw_co_pwrite_zeroes (bs=0x555557f44980, offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/raw-format.c:307
> #12 0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f44980, offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
> #13 0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f513f0, req=0x7fffed5ffd90, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2100
> #14 0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f513f0, offset=0, bytes=1048576, flags=6, req=0x7fffed5ffd90) at ../block/io.c:2183
> #15 0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f513f0, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
> #16 0x00005555560ad741 in blk_co_do_pwritev_part (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/block-backend.c:1425
> #17 0x00005555560ad5f2 in blk_co_pwritev_part (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/block-backend.c:1440
> #18 0x00005555560ad8cf in blk_co_pwritev (blk=0x555557f51660, offset=0, bytes=1048576, qiov=0x0, flags=6) at ../block/block-backend.c:1462
> #19 0x00005555560b0f79 in blk_co_pwrite_zeroes (blk=0x555557f51660, offset=0, bytes=1048576, flags=6) at ../block/block-backend.c:2590
> #20 0x000055555606d240 in blk_co_pwrite_zeroes_entry (opaque=0x7fffffffbc18) at block/block-gen.c:2162
> #21 0x00005555562c36ba in coroutine_trampoline (i0=1475685216, i1=21845) at ../util/coroutine-ucontext.c:175
> #22 0x00007ffff4ef1190 in ??? () at /lib64/libc.so <http://libc.so/>.6
> #23 0x00007fffffffc010 in ??? ()
> #24 0x0000000000000000 in ??? ()
>
> I haven't checked the details yet, but my first impression is that this
> check should probably move to bdrv_co_do_pwrite_zeroes().
Punching holes only if BDRV_REQ_MAY_UNMAP bit is set sounds like the right way.
>
> Kevin
>
>
> [1] Tests I did:
>
> # For discard
>
> $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "discard 0 1M"
> discard 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (396.171 GiB/sec and 405679.5132 ops/sec)
> (qemu) quit
>
> $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "discard 0 1M"
> handle_aiocb_discard
> discard 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (6.623 GiB/sec and 6782.2820 ops/sec)
> (qemu) quit
>
> $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "discard 0 1M"
> discard 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (375.168 GiB/sec and 384172.1091 ops/sec)
> (qemu) quit
>
> # For write_zeroes
>
> $ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "write -zu 0 1M"
> handle_aiocb_write_zeroes_unmap
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (4.943 GiB/sec and 5061.5997 ops/sec)
> (qemu) quit
> $ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio
> QEMU 9.0.50 monitor - type 'help' for more information
> (qemu) qemu-io none0 "write -zu 0 1M"
> handle_aiocb_write_zeroes_unmap
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 00.00 sec (7.208 GiB/sec and 7381.4357 ops/sec)
> (qemu) quit
Looks like a good way to write tests for this, thanks for sharing.
Nir
[-- Attachment #2: Type: text/html, Size: 75495 bytes --]
prev parent reply other threads:[~2024-06-19 10:59 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-18 21:24 [PATCH] block/file-posix: Consider discard flag when opening Nir Soffer
2024-06-19 8:16 ` Kevin Wolf
2024-06-19 10:58 ` Nir Soffer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CA5B7F3C-1C85-4A65-BD10-7B7B913EDB2B@redhat.com \
--to=nsoffer@redhat.com \
--cc=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).