From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59235) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aunc9-0004Xy-Hr for qemu-devel@nongnu.org; Mon, 25 Apr 2016 17:00:18 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aunc6-0006ZU-BH for qemu-devel@nongnu.org; Mon, 25 Apr 2016 17:00:17 -0400 Received: from mail-am1on0111.outbound.protection.outlook.com ([157.56.112.111]:10656 helo=emea01-am1-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aunc5-0006Yj-Rr for qemu-devel@nongnu.org; Mon, 25 Apr 2016 17:00:14 -0400 References: <1461413127-2594-1-git-send-email-den@openvz.org> <20160425090553.GA5293@noname.str.redhat.com> <571DEF7D.1010304@openvz.org> <571E7170.1090305@redhat.com> From: "Denis V. Lunev" Message-ID: <571E8556.5030007@openvz.org> Date: Tue, 26 Apr 2016 00:00:06 +0300 MIME-Version: 1.0 In-Reply-To: <571E7170.1090305@redhat.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake , Kevin Wolf Cc: qemu-devel@nongnu.org, Max Reitz On 04/25/2016 10:35 PM, Eric Blake wrote: > On 04/25/2016 04:20 AM, Denis V. Lunev wrote: >> On 04/25/2016 12:05 PM, Kevin Wolf wrote: >>> Am 23.04.2016 um 14:05 hat Denis V. Lunev geschrieben: >>>> Unfortunately Linux kernel could send non-aligned requests to qemu-nbd >>>> if the caller is using O_DIRECT and does not align in-memory data to >>>> page. Thus qemu-nbd will call block layer with non-aligned requests. > At first glance, I'd argue that any caller using O_DIRECT without > obeying memory alignment restrictions is broken; why should qemu have to > work around such broken callers? > Memory requirements are followed. Minimal requirement which is fixed in kernel API is 512 bytes. For here the broken part is kernel as we have agreed on year ago, which splits requests in a wrong way. Unfortunately 3.10 RHEL7 based kernel misbehaves this way. You could remember the situation a year ago. The QEMU behaves exactly in the same way. Usually people start probing memory alignment for O_DIRECT writes from 512 bytes and kernel accepts that. This situation is VERY common :( >> The program is 100% reproducible. The following sequence >> is performed: >> >> #define _GNU_SOURCE >> >> #include >> #include >> #include >> #include >> #include >> >> int main(int argc, char *argv[]) >> { >> char *buf; >> int fd; >> >> if (argc != 2) { >> return -1; >> } >> >> fd = open(argv[1], O_WRONLY | O_DIRECT); >> >> do { >> buf = memalign(512, 1024 * 1024); >> } while (!(unsigned long)buf & (4096 - 1)); > In other words, you are INTENTIONALLY grabbing an unaligned buffer for > use with an O_DIRECT fd, when O_DIRECT demands that you should be using > at least page alignment (4096 or greater). Arguably, the bug is in your > program, not qemu. Yes, as THIS IS THE TEST to reproduce the problem observed in the 3rd party software. I have spent some time narrowing it down to this. > That said, teaching qemu to split up a write_zeroes request into head, > tail, and aligned body, so at least the aligned part might benefit from > optimizations, seems like it might be worthwhile, particularly since my > pending NBD series changed from blk_write_zeroes (cluster-aligned) to > blk_pwrite_zeroes (byte-aligned), and it is conceivable that we can > encounter a situation without O_DIRECT in the picture at all, where our > NBD server is connected to a client that specifically asks for the new > NBD_CMD_WRITE_ZEROES on any arbitrary byte alignment. actually NBD server is just an innocent party which transparently processes requests coming to it. What will happen if we will write 4k into the middle of the zeroed block with old code? The system will allocate 64k and write it down. This code will skip this write at all. Thus the optimization has some sense on its own. >> memset(buf, 0, 1024 * 1024); >> write(fd, buf, 1024 * 1024); >> return 0; >> } >> >> This program is compiled as a.out. >> >> Before the patch: >> hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G >> Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off >> cluster_size=65536 lazy_refcounts=off refcount_bits=16 >> hades ~/src/qemu $ sudo ./qemu-nbd --connect=/dev/nbd3 test.qcow2 >> --detect-zeroes=on --aio=native --cache=none >> hades ~/src/qemu $ sudo ./a.out /dev/nbd3 > Here, I'd argue that the kernel NBD module is buggy, for allowing a > user-space app to pass an unaligned buffer to an O_DIRECT fd. But > that's outside the realm of qemu code. no. The victim is GENERIC block layer code in the kernel. You could look. Original report in the commit 459b4e661 "The patch changes default bounce buffer optimal alignment to MAX(page size, 4k). 4k is chosen as maximal known sector size on real HDD. The justification of the performance improve is quite interesting. From the kernel point of view each request to the disk was split by two. This could be seen by blktrace like this: 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] After the patch the pattern becomes normal: 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img]" was made against ext4 and xfs filesystem. Once again, the problem is in the block layer of the kernel itself. > But again, per my argument that you don't have to involve the kernel nor > O_DIRECT to be able to write a client that can attempt to cause an NBD > server to do unaligned operations, we can use this kernel bug as an > easier way to test any proposed fix to the qemu side of things, whether > or not the kernel module gets tightened in behavior down the road. > Den