From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55529) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auddS-00018f-7i for qemu-devel@nongnu.org; Mon, 25 Apr 2016 06:21:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1auddP-0003h3-0J for qemu-devel@nongnu.org; Mon, 25 Apr 2016 06:20:58 -0400 Received: from mail-am1on0137.outbound.protection.outlook.com ([157.56.112.137]:48951 helo=emea01-am1-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auddO-0003go-Gv for qemu-devel@nongnu.org; Mon, 25 Apr 2016 06:20:54 -0400 From: "Denis V. Lunev" References: <1461413127-2594-1-git-send-email-den@openvz.org> <20160425090553.GA5293@noname.str.redhat.com> Message-ID: <571DEF7D.1010304@openvz.org> Date: Mon, 25 Apr 2016 13:20:45 +0300 MIME-Version: 1.0 In-Reply-To: <20160425090553.GA5293@noname.str.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Max Reitz On 04/25/2016 12:05 PM, Kevin Wolf wrote: > Am 23.04.2016 um 14:05 hat Denis V. Lunev geschrieben: >> Unfortunately Linux kernel could send non-aligned requests to qemu-nbd >> if the caller is using O_DIRECT and does not align in-memory data to >> page. Thus qemu-nbd will call block layer with non-aligned requests. >> >> qcow2_co_write_zeroes forcibly asks the caller to supply block-aligned >> data. In the other case it rejects with ENOTSUP which is properly >> handled on the upper level. The problem is that this grows the image. >> >> This could be optimized a bit: >> - particular request could be split to block aligned part and head/tail, >> which could be handled separately > In fact, this is what bdrv_co_do_write_zeroes() is already supposed to > do. qcow2 exposes its cluster size as bs->bl.write_zeroes_alignment, so > block/io.c should split the request in three. > > If you see something different happening, we may have a bug there. > Pls look to the commit commit 459b4e66129d091a11e9886ecc15a8bf9f7f3d92 Author: Denis V. Lunev Date: Tue May 12 17:30:56 2015 +0300 block: align bounce buffers to page The situation is exactly like the described there. The user of the /dev/nbd0 writes with O_DIRECT and has unaligned to page buffers. Thus real operations on qemu-nbd layer becomes unaligned to block size. Thus bdrv_co_do_write_zeroes is helpless here unfortunately. We have this problem with 3rd party software performing restoration from the backup. The program is 100% reproducible. The following sequence is performed: #define _GNU_SOURCE #include #include #include #include #include int main(int argc, char *argv[]) { char *buf; int fd; if (argc != 2) { return -1; } fd = open(argv[1], O_WRONLY | O_DIRECT); do { buf = memalign(512, 1024 * 1024); } while (!(unsigned long)buf & (4096 - 1)); memset(buf, 0, 1024 * 1024); write(fd, buf, 1024 * 1024); return 0; } This program is compiled as a.out. Before the patch: hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 hades ~/src/qemu $ sudo ./qemu-nbd --connect=/dev/nbd3 test.qcow2 --detect-zeroes=on --aio=native --cache=none hades ~/src/qemu $ sudo ./a.out /dev/nbd3 hades ~/src/qemu $ ls -als test.qcow2 772 -rw-r--r-- 1 den den 851968 Apr 25 12:48 test.qcow2 hades ~/src/qemu $ After the patch: hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 hades ~/src/qemu $ sudo ./qemu-nbd --connect=/dev/nbd3 test.qcow2 --detect-zeroes=on --aio=native --cache=none hades ~/src/qemu $ sudo ./a.out /dev/nbd3 hades ~/src/qemu $ ls -als test.qcow2 260 -rw-r--r-- 1 den den 327680 Apr 25 12:50 test.qcow2 hades ~/src/qemu $ >> - writes could be omitted when we do know that the image already contains >> zeroes at the offsets being written > I don't think this is a valid shortcut. The semantics of a write_zeroes > operation is that the zeros (literal or as flags) are stored in this > layer and that the backing file isn't involved any more for the given > sectors. For example, a streaming operation or qemu-img rebase may > involve write_zeroes operations, and relying on the backing file would > cause corruption there (because the whole point of the operation is that > the backing file can be removed). this is not a problem. The block will be abscent and thus it will be read as zeroes. > And to be honest, writing zero flags in memory to the cached L2 table is > an operation so quick that calling bdrv_get_block_status_above() might > actually end up slower in most cases than just setting the flag. Main fast path is not touched. bdrv_get_block_status_above() is called only for non-aligned parts of the operation. Den