From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59221) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avJZZ-00083k-G2 for qemu-devel@nongnu.org; Wed, 27 Apr 2016 03:07:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1avJZW-0008AS-7k for qemu-devel@nongnu.org; Wed, 27 Apr 2016 03:07:45 -0400 Received: from mail-am1on0121.outbound.protection.outlook.com ([157.56.112.121]:46880 helo=emea01-am1-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1avJZV-0008AG-Jx for qemu-devel@nongnu.org; Wed, 27 Apr 2016 03:07:42 -0400 References: <1461413127-2594-1-git-send-email-den@openvz.org> <20160425090553.GA5293@noname.str.redhat.com> <571DEF7D.1010304@openvz.org> <20160426082321.GA4213@noname.str.redhat.com> <571F3653.2040007@openvz.org> <20160426101941.GB4213@noname.str.redhat.com> From: "Denis V. Lunev" Message-ID: <57206537.4040706@openvz.org> Date: Wed, 27 Apr 2016 10:07:35 +0300 MIME-Version: 1.0 In-Reply-To: <20160426101941.GB4213@noname.str.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Max Reitz On 04/26/2016 01:19 PM, Kevin Wolf wrote: [skipped] > Did you ever talk to the kernel people? > > We can try to make the best out of suboptimal requests in qemu, but it > looks to me as if the real fix was in the kernel, and if we don't get it > fixed there, we'll see more and more of this kind of problems. I think > this is relevant not only for VMs, but probably on real hardware as > well. it is difficult and could be done partially only :( >> Locally with QEMU-IO the reproduction is simple. We can repeat above >> requests or could simple do the following: >> qemu-io -c "write 0xff 32k 1M" 1.img > I assume you mean "-z" instead of "0xff"? sure >> The code without the patch will allocate 2 blocks for guest offsets >> 0-64k and 1M-(1M+64k) and performs writes there. The code with >> the patch will skip creation of blocks if possible. > Okay, that's the second of the optimisations you mentioned in your > commit message. I can see how this adds something that the generic block > layer can't easily add, if it can be made safe (I think it can, even > though your patch doesn't get it completely right yet, see below). > >> I have recorded parameters in qcow2_co_do_write_zeroes for the >> reference (1 MB is written, memory is not aligned as sent in the >> letter above, [sudo ./a.out /dev/nbd3]): >> >> qcow2_co_write_zeroes off=0 size=10000 >> qcow2_co_write_zeroes off=1fe00 size=200 >> qcow2_co_write_zeroes off=3fe00 size=200 >> qcow2_co_write_zeroes off=5fe00 size=200 >> qcow2_co_write_zeroes off=7fe00 size=200 >> qcow2_co_write_zeroes off=9fe00 size=200 >> qcow2_co_write_zeroes off=bfe00 size=200 >> qcow2_co_write_zeroes off=dfe00 size=200 >> qcow2_co_write_zeroes off=ffe00 size=200 >> qcow2_co_write_zeroes off=10000 size=fe00 >> qcow2_co_write_zeroes off=20000 size=10000 >> qcow2_co_write_zeroes off=30000 size=fe00 >> qcow2_co_write_zeroes off=60000 size=10000 >> qcow2_co_write_zeroes off=70000 size=fe00 >> qcow2_co_write_zeroes off=c0000 size=10000 >> qcow2_co_write_zeroes off=d0000 size=fe00 >> qcow2_co_write_zeroes off=e0000 size=10000 >> qcow2_co_write_zeroes off=f0000 size=fe00 >> qcow2_co_write_zeroes off=80000 size=10000 >> qcow2_co_write_zeroes off=90000 size=fe00 >> qcow2_co_write_zeroes off=a0000 size=10000 >> qcow2_co_write_zeroes off=b0000 size=fe00 >> qcow2_co_write_zeroes off=40000 size=10000 >> qcow2_co_write_zeroes off=50000 size=fe00 > I don't see any requests here where your code actually ends up splitting > the request into head, aligned part and tail. Which is expected because > bdrv_co_do_write_zeroes() already does that. > > What I can't see here is whether this actually happened (10000 + fe00 > could be a split request) or whether it already came in this way over > NBD. bdrv_aligned_pwritev off=0 size=1fe00 qcow2_co_write_zeroes off=0 size=10000 qcow2_co_write_zeroes off=10000 size=fe00 bdrv_aligned_pwritev off=1fe00 size=20000 qcow2_co_write_zeroes off=1fe00 size=200 qcow2_co_write_zeroes off=20000 size=10000 qcow2_co_write_zeroes off=30000 size=fe00 bdrv_aligned_pwritev off=3fe00 size=20000 qcow2_co_write_zeroes off=3fe00 size=200 qcow2_co_write_zeroes off=40000 size=10000 qcow2_co_write_zeroes off=50000 size=fe00 bdrv_aligned_pwritev off=5fe00 size=20000 qcow2_co_write_zeroes off=5fe00 size=200 qcow2_co_write_zeroes off=60000 size=10000 qcow2_co_write_zeroes off=70000 size=fe00 bdrv_aligned_pwritev off=7fe00 size=20000 qcow2_co_write_zeroes off=7fe00 size=200 qcow2_co_write_zeroes off=80000 size=10000 qcow2_co_write_zeroes off=90000 size=fe00 bdrv_aligned_pwritev off=9fe00 size=20000 qcow2_co_write_zeroes off=9fe00 size=200 qcow2_co_write_zeroes off=a0000 size=10000 qcow2_co_write_zeroes off=b0000 size=fe00 bdrv_aligned_pwritev off=bfe00 size=20000 qcow2_co_write_zeroes off=bfe00 size=200 qcow2_co_write_zeroes off=c0000 size=10000 qcow2_co_write_zeroes off=d0000 size=fe00 bdrv_aligned_pwritev off=dfe00 size=20000 qcow2_co_write_zeroes off=dfe00 size=200 qcow2_co_write_zeroes off=e0000 size=10000 qcow2_co_write_zeroes off=f0000 size=fe00 bdrv_aligned_pwritev off=ffe00 size=200 qcow2_co_write_zeroes off=ffe00 size=200 [skipped] > Here is the trick that I think will save us: > > On a misaligned call, we call bdrv_get_block_status_above() for the > whole cluster that we're in. We know that it's only a single cluster > because bdrv_co_do_write_zeroes() splits things this way; only aligned > requests can be longer than a sector (we can even assert this). > > If the result is that the cluster already reads as zero, instead of > doing nothing and possibly breaking backing chain manipulations, we > simply extend the write zeroes operation to the whole cluster and > continue as normal with an aligned request. This way we end up with a > zero cluster instead of an unallocated one, and that should be safe. > > If the result is that the cluster isn't completed zeroed, return > -ENOTSUP as you did in the snippet above. > > That approach should probably result in an (even) simpler patch, too. let's start from the simple thing. I'll send a patch in a couple of minutes. It really looks MUCH better than the original one. Thank you for this cool suggestion. > > Hm... Or actually, if we want something more complex that will help all > block drivers, extending the range of the request could even be done in > bdrv_co_do_write_zeroes(), I guess. I won't insist on it, though. I do not think that this should be done now. This patch should go into our own stable and thus it would better be as simple as possible. Den