qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Denis V. Lunev" <den@openvz.org>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes()
Date: Mon, 25 Apr 2016 13:20:45 +0300	[thread overview]
Message-ID: <571DEF7D.1010304@openvz.org> (raw)
In-Reply-To: <20160425090553.GA5293@noname.str.redhat.com>

On 04/25/2016 12:05 PM, Kevin Wolf wrote:
> Am 23.04.2016 um 14:05 hat Denis V. Lunev geschrieben:
>> Unfortunately Linux kernel could send non-aligned requests to qemu-nbd
>> if the caller is using O_DIRECT and does not align in-memory data to
>> page. Thus qemu-nbd will call block layer with non-aligned requests.
>>
>> qcow2_co_write_zeroes forcibly asks the caller to supply block-aligned
>> data. In the other case it rejects with ENOTSUP which is properly
>> handled on the upper level. The problem is that this grows the image.
>>
>> This could be optimized a bit:
>> - particular request could be split to block aligned part and head/tail,
>>    which could be handled separately
> In fact, this is what bdrv_co_do_write_zeroes() is already supposed to
> do. qcow2 exposes its cluster size as bs->bl.write_zeroes_alignment, so
> block/io.c should split the request in three.
>
> If you see something different happening, we may have a bug there.
>
Pls look to the commit

commit 459b4e66129d091a11e9886ecc15a8bf9f7f3d92
Author: Denis V. Lunev <den@openvz.org>
Date:   Tue May 12 17:30:56 2015 +0300

     block: align bounce buffers to page

The situation is exactly like the described there. The user
of the /dev/nbd0 writes with O_DIRECT and has unaligned
to page buffers. Thus real operations on qemu-nbd
layer becomes unaligned to block size.

Thus bdrv_co_do_write_zeroes is helpless here unfortunately.
We have this problem with 3rd party software performing
restoration from the backup.

The program is 100% reproducible. The following sequence
is performed:

#define _GNU_SOURCE

#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <sys/fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
     char *buf;
     int fd;

     if (argc != 2) {
         return -1;
     }

     fd = open(argv[1], O_WRONLY | O_DIRECT);

     do {
         buf = memalign(512, 1024 * 1024);
     } while (!(unsigned long)buf & (4096 - 1));
     memset(buf, 0, 1024 * 1024);
     write(fd, buf, 1024 * 1024);
     return 0;
}

This program is compiled as a.out.

Before the patch:
hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16
hades ~/src/qemu $ sudo ./qemu-nbd  --connect=/dev/nbd3 test.qcow2 
--detect-zeroes=on --aio=native --cache=none
hades ~/src/qemu $ sudo ./a.out /dev/nbd3
hades ~/src/qemu $ ls -als test.qcow2
772 -rw-r--r-- 1 den den 851968 Apr 25 12:48 test.qcow2
hades ~/src/qemu $

After the patch:
hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16
hades ~/src/qemu $ sudo ./qemu-nbd  --connect=/dev/nbd3 test.qcow2 
--detect-zeroes=on --aio=native --cache=none
hades ~/src/qemu $ sudo ./a.out /dev/nbd3
hades ~/src/qemu $ ls -als test.qcow2
260 -rw-r--r-- 1 den den 327680 Apr 25 12:50 test.qcow2
hades ~/src/qemu $




>> - writes could be omitted when we do know that the image already contains
>>    zeroes at the offsets being written
> I don't think this is a valid shortcut. The semantics of a write_zeroes
> operation is that the zeros (literal or as flags) are stored in this
> layer and that the backing file isn't involved any more for the given
> sectors. For example, a streaming operation or qemu-img rebase may
> involve write_zeroes operations, and relying on the backing file would
> cause corruption there (because the whole point of the operation is that
> the backing file can be removed).
this is not a problem. The block will be abscent and thus it will be
read as zeroes.


> And to be honest, writing zero flags in memory to the cached L2 table is
> an operation so quick that calling bdrv_get_block_status_above() might
> actually end up slower in most cases than just setting the flag.
Main fast path is not touched. bdrv_get_block_status_above() is called 
only for
non-aligned parts of the operation.

Den

  reply	other threads:[~2016-04-25 10:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-23 12:05 [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes() Denis V. Lunev
2016-04-25  9:05 ` Kevin Wolf
2016-04-25 10:20   ` Denis V. Lunev [this message]
2016-04-25 19:35     ` Eric Blake
2016-04-25 21:00       ` Denis V. Lunev
2016-04-26  8:23     ` Kevin Wolf
2016-04-26  9:35       ` Denis V. Lunev
2016-04-26 10:19         ` Kevin Wolf
2016-04-27  7:07           ` Denis V. Lunev
2016-04-27  8:12             ` Kevin Wolf
2016-04-27  8:32               ` Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=571DEF7D.1010304@openvz.org \
    --to=den@openvz.org \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).