From: "Denis V. Lunev" <den@openvz.org>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes()
Date: Mon, 25 Apr 2016 13:20:45 +0300 [thread overview]
Message-ID: <571DEF7D.1010304@openvz.org> (raw)
In-Reply-To: <20160425090553.GA5293@noname.str.redhat.com>
On 04/25/2016 12:05 PM, Kevin Wolf wrote:
> Am 23.04.2016 um 14:05 hat Denis V. Lunev geschrieben:
>> Unfortunately Linux kernel could send non-aligned requests to qemu-nbd
>> if the caller is using O_DIRECT and does not align in-memory data to
>> page. Thus qemu-nbd will call block layer with non-aligned requests.
>>
>> qcow2_co_write_zeroes forcibly asks the caller to supply block-aligned
>> data. In the other case it rejects with ENOTSUP which is properly
>> handled on the upper level. The problem is that this grows the image.
>>
>> This could be optimized a bit:
>> - particular request could be split to block aligned part and head/tail,
>> which could be handled separately
> In fact, this is what bdrv_co_do_write_zeroes() is already supposed to
> do. qcow2 exposes its cluster size as bs->bl.write_zeroes_alignment, so
> block/io.c should split the request in three.
>
> If you see something different happening, we may have a bug there.
>
Pls look to the commit
commit 459b4e66129d091a11e9886ecc15a8bf9f7f3d92
Author: Denis V. Lunev <den@openvz.org>
Date: Tue May 12 17:30:56 2015 +0300
block: align bounce buffers to page
The situation is exactly like the described there. The user
of the /dev/nbd0 writes with O_DIRECT and has unaligned
to page buffers. Thus real operations on qemu-nbd
layer becomes unaligned to block size.
Thus bdrv_co_do_write_zeroes is helpless here unfortunately.
We have this problem with 3rd party software performing
restoration from the backup.
The program is 100% reproducible. The following sequence
is performed:
#define _GNU_SOURCE
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/fcntl.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
char *buf;
int fd;
if (argc != 2) {
return -1;
}
fd = open(argv[1], O_WRONLY | O_DIRECT);
do {
buf = memalign(512, 1024 * 1024);
} while (!(unsigned long)buf & (4096 - 1));
memset(buf, 0, 1024 * 1024);
write(fd, buf, 1024 * 1024);
return 0;
}
This program is compiled as a.out.
Before the patch:
hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
hades ~/src/qemu $ sudo ./qemu-nbd --connect=/dev/nbd3 test.qcow2
--detect-zeroes=on --aio=native --cache=none
hades ~/src/qemu $ sudo ./a.out /dev/nbd3
hades ~/src/qemu $ ls -als test.qcow2
772 -rw-r--r-- 1 den den 851968 Apr 25 12:48 test.qcow2
hades ~/src/qemu $
After the patch:
hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
hades ~/src/qemu $ sudo ./qemu-nbd --connect=/dev/nbd3 test.qcow2
--detect-zeroes=on --aio=native --cache=none
hades ~/src/qemu $ sudo ./a.out /dev/nbd3
hades ~/src/qemu $ ls -als test.qcow2
260 -rw-r--r-- 1 den den 327680 Apr 25 12:50 test.qcow2
hades ~/src/qemu $
>> - writes could be omitted when we do know that the image already contains
>> zeroes at the offsets being written
> I don't think this is a valid shortcut. The semantics of a write_zeroes
> operation is that the zeros (literal or as flags) are stored in this
> layer and that the backing file isn't involved any more for the given
> sectors. For example, a streaming operation or qemu-img rebase may
> involve write_zeroes operations, and relying on the backing file would
> cause corruption there (because the whole point of the operation is that
> the backing file can be removed).
this is not a problem. The block will be abscent and thus it will be
read as zeroes.
> And to be honest, writing zero flags in memory to the cached L2 table is
> an operation so quick that calling bdrv_get_block_status_above() might
> actually end up slower in most cases than just setting the flag.
Main fast path is not touched. bdrv_get_block_status_above() is called
only for
non-aligned parts of the operation.
Den
next prev parent reply other threads:[~2016-04-25 10:21 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-23 12:05 [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes() Denis V. Lunev
2016-04-25 9:05 ` Kevin Wolf
2016-04-25 10:20 ` Denis V. Lunev [this message]
2016-04-25 19:35 ` Eric Blake
2016-04-25 21:00 ` Denis V. Lunev
2016-04-26 8:23 ` Kevin Wolf
2016-04-26 9:35 ` Denis V. Lunev
2016-04-26 10:19 ` Kevin Wolf
2016-04-27 7:07 ` Denis V. Lunev
2016-04-27 8:12 ` Kevin Wolf
2016-04-27 8:32 ` Denis V. Lunev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=571DEF7D.1010304@openvz.org \
--to=den@openvz.org \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.