From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36575) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1frn7g-0005Xv-PE for qemu-devel@nongnu.org; Mon, 20 Aug 2018 12:33:45 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1frn7f-000828-KU for qemu-devel@nongnu.org; Mon, 20 Aug 2018 12:33:44 -0400 References: <20180807174311.32454-1-vsementsov@virtuozzo.com> <13910182-771b-c5dc-26a7-0958a7241fe8@redhat.com> <6c318533-dc87-daeb-1fe8-6b11b0cbec8d@virtuozzo.com> <747506be-aceb-0ab8-a4ee-c79f9a6b929a@redhat.com> From: Vladimir Sementsov-Ogievskiy Message-ID: Date: Mon, 20 Aug 2018 19:33:31 +0300 MIME-Version: 1.0 In-Reply-To: <747506be-aceb-0ab8-a4ee-c79f9a6b929a@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Content-Language: en-US Subject: Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Max Reitz , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: kwolf@redhat.com, den@openvz.org 17.08.2018 22:34, Max Reitz wrote: > On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: >> 16.08.2018 03:51, Max Reitz wrote: >>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>>> Hi all! >>>> >>>> Here is an asynchronous scheme for handling fragmented qcow2 >>>> reads and writes. Both qcow2 read and write functions loops through >>>> sequential portions of data. The series aim it to parallelize these >>>> loops iterations. >>>> >>>> It improves performance for fragmented qcow2 images, I've tested it >>>> as follows: >>>> >>>> I have four 4G qcow2 images (with default 64k block size) on my ssd di= sk: >>>> t-seq.qcow2 - sequentially written qcow2 image >>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random orde= r >>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clus= ters >>>> (see source code of image generation in the end for details) >>>> >>>> and the test (sequential io by 1mb chunks): >>>> >>>> test write: >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo =3D=3D=3D $= t =3D=3D=3D; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t= ; \ >>>> done >>>> >>>> test read (same, just drop -w parameter): >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo =3D=3D=3D $= t =3D=3D=3D; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>>> done >>>> >>>> short info about parameters: >>>> -w - do writes (otherwise do reads) >>>> -c - count of blocks >>>> -s - block size >>>> -t none - disable cache >>>> -n - native aio >>>> -d 1 - don't use parallel requests provided by qemu-img bench itsel= f >>> Hm, actually, why not? And how does a guest behave? >>> >>> If parallel requests on an SSD perform better, wouldn't a guest issue >>> parallel requests to the virtual device and thus to qcow2 anyway? >> Guest knows nothing about qcow2 fragmentation, so this kind of >> "asynchronization" could be done only at qcow2 level. > Hm, yes. I'm sorry, but without having looked closer at the series > (which is why I'm sorry in advance), I would suspect that the > performance improvement comes from us being able to send parallel > requests to an SSD. > > So if you send large requests to an SSD, you may either send them in > parallel or sequentially, it doesn't matter. But for small requests, > it's better to send them in parallel so the SSD always has requests in > its queue. > > I would think this is where the performance improvement comes from. But > I would also think that a guest OS knows this and it would also send > many requests in parallel so the virtual block device never runs out of > requests. > >> However, if guest do async io, send a lot of parallel requests, it >> behave like qemu-img without -d 1 option, and in this case, >> parallel loop iterations in qcow2 doesn't have such great sense. >> However, I think that async parallel requests are better in >> general than sequential, because if device have some unused opportunity >> of parallelization, it will be utilized. > I agree that it probably doesn't make things worse performance-wise, but > it's always added complexity (see the diffstat), which is why I'm just > routinely asking how useful it is in practice. :-) > > Anyway, I suspect there are indeed cases where a guest doesn't send many > requests in parallel but it makes sense for the qcow2 driver to > parallelize it. That would be mainly when the guest reads seemingly > sequential data that is then fragmented in the qcow2 file. So basically > what your benchmark is testing. :-) > > Then, the guest could assume that there is no sense in parallelizing it > because the latency from the device is large enough, whereas in qemu > itself we always run dry and wait for different parts of the single > large request to finish. So, yes, in that case, parallelization that's > internal to qcow2 would make sense. > > Now another question is, does this negatively impact devices where > seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I > don't have access to an HDD to test myself... hdd: +-----------+-----------+----------+-----------+----------+ |=C2=A0=C2=A0 file=C2=A0=C2=A0=C2=A0 | wr before | wr after | rd before | r= d after | +-----------+-----------+----------+-----------+----------+ | seq=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0 39.821 |=C2= =A0=C2=A0 40.513 |=C2=A0=C2=A0=C2=A0 38.600 |=C2=A0=C2=A0 38.916 | | reverse=C2=A0=C2=A0 |=C2=A0=C2=A0=C2=A0 60.320 |=C2=A0=C2=A0 57.902 |=C2= =A0=C2=A0=C2=A0 98.223 |=C2=A0 111.717 | | rand=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 614.826 |=C2=A0 580.452 = |=C2=A0=C2=A0 672.600 |=C2=A0 465.120 | | part-rand |=C2=A0=C2=A0=C2=A0 52.311 |=C2=A0=C2=A0 52.450 |=C2=A0=C2=A0= =C2=A0 37.663 |=C2=A0=C2=A0 37.989 | +-----------+-----------+----------+-----------+----------+ hmm. 10% degradation on "reverse" case, strange magic.. However reverse=20 is near to impossible. > >> We've already >> use this approach in mirror and qemu-img convert. > Indeed, but here you could always argue that this is just what guests > do, so we should, too. > >> In Virtuozzo we have >> backup, improved by parallelization of requests >> loop too. I think, it would be good to have some general code for such >> things in future. > Well, those are different things, I'd think. Parallelization in > mirror/backup/convert is useful not just because of qcow2 issues, but > also because you have a volume to read from and a volume to write to, so > that's where parallelization gives you some pipelining. And it gives > you buffers for latency spikes, I guess. > > Max > --=20 Best regards, Vladimir