From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:39789) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cykdG-0000C2-RX for qemu-devel@nongnu.org; Thu, 13 Apr 2017 15:42:20 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cykdF-0005jX-2r for qemu-devel@nongnu.org; Thu, 13 Apr 2017 15:42:18 -0400 References: <20170406150148.zwjpozqtale44jfh@perseus.local> <2b915695-29b5-df8d-4d89-080eeaaaff13@openvz.org> <565c1e1b-b9e1-e9c5-790e-283d04afc747@openvz.org> <20170413130555.GC5095@noname.redhat.com> From: "Denis V. Lunev" Message-ID: <3654f226-d51a-5013-8301-5beb83200aa8@openvz.org> Date: Thu, 13 Apr 2017 16:09:53 +0300 MIME-Version: 1.0 In-Reply-To: <20170413130555.GC5095@noname.redhat.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: Alberto Garcia , qemu-devel@nongnu.org, Stefan Hajnoczi , qemu-block@nongnu.org, Max Reitz On 04/13/2017 04:05 PM, Kevin Wolf wrote: > Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben: >> On 04/13/2017 02:58 PM, Alberto Garcia wrote: >>> On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote: >>>> My opinion about this approach is very negative as the problem could >>>> be (partially) solved in a much better way. >>> Hmm... it seems to me that (some of) the problems you are describing are >>> different from the ones this proposal tries to address. Not that I >>> disagree with them! I think you are giving useful feedback :) >>> >>>> 1) current L2 cache management seems very wrong to me. Each cache >>>> miss means that we have to read entire L2 cache block. This means >>>> that in the worst case (when dataset of the test does not fit L2 >>>> cache size we read 64kb of L2 table for each 4 kb read). >>>> >>>> The situation is MUCH worse once we are starting to increase >>>> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache >>>> miss. >>>> >>>> The situation can be cured immediately once we will start reading >>>> L2 cache with 4 or 8kb chunks. We have patchset for this for our >>>> downstream and preparing it for upstream. >>> Correct, although the impact of this depends on whether you are using >>> SDD or HDD. >>> >>> With an SSD what you want is to minimize is the number of unnecessary >>> reads, so reading small chunks will likely increase the performance when >>> there's a cache miss. >>> >>> With an HDD what you want is to minimize the number of seeks. Once you >>> have moved the disk head to the location where the cluster is, reading >>> the whole cluster is relatively inexpensive, so (leaving the memory >>> requirements aside) you generally want to read as much as possible. >> no! This greatly helps for HDD too! >> >> The reason is that you cover areas of the virtual disk much more precise. >> There is very simple example. Let us assume that I have f.e. 1 TB virtual >> HDD with 1 MB block size. As far as I understand right now L2 cache >> for the case consists of 4 L2 clusters. >> >> So, I can exhaust current cache only with 5 requests and each actual read >> will costs L2 table read. This is a read problem. This condition could >> happen on fragmented FS without a problem. >> >> With my proposal the situation is MUCH better. All accesses will be taken >> from the cache after the first run. >> >>>> 2) yet another terrible thing in cluster allocation is its allocation >>>> strategy. >>>> Current QCOW2 codebase implies that we need 5 (five) IOPSes to >>>> complete COW operation. We are reading head, writing head, reading >>>> tail, writing tail, writing actual data to be written. This could >>>> be easily reduced to 3 IOPSes. >>> That sounds right, but I'm not sure if this is really incompatible with >>> my proposal :) >> the problem is code complexity, with is very complex right now. >> >> >>>> Another problem is the amount of data written. We are writing >>>> entire cluster in write operation and this is also insane. It is >>>> possible to perform fallocate() and actual data write on normal >>>> modern filesystem. >>> But that only works when filling the cluster with zeroes, doesn't it? If >>> there's a backing image you need to bring all the contents from there. >> Yes. Backing images are problems. Though, even with sub-clusters, we >> will suffer exactly the same with the amount of IOPSes as even with >> that head and tail have to be read. If you are spoken about >> subclusters equals to FS block size and avoid COW at all, this would >> be terribly slow later on with sequential reading. In such an approach >> sequential reading will result in random read. >> >> Guest OSes are written keeping in mind that adjacent LBAs are really >> adjacent and reading them sequentially is a very good idea. This >> invariant will be broken for the case of subclusters. > How that? > > Given the same cluster size, subclustered and traditional images behave > _exactly_ the same regarding fragmentation. Subclusters only have an > effect on it (and a positive one) when you take them as a reason that > you can now afford to increase the cluster size. > > I see subclusters and fragmentation as mostly independent topics. > >> For nowadays SSD we are facing problems somewhere else. Right now I >> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block >> with preallocation and fragmented L2 cache gives same 100k. Tests for >> initially empty image gives around 80k for us. > Preallocated images aren't particularly interesting to me. qcow2 is used > mainly for two reasons. One of them is sparseness (initially small file > size) mostly for desktop use cases with no serious I/O, so not that > interesting either. The other one is snapshots, i.e. backing files, > which doesn't work with preallocation (yet). > > Actually, preallocation with backing files is something that subclusters > would automatically enable: You could already reserve the space for a > cluster, but still leave all subclusters marked as unallocated. I am spoken about fallocate() for the entire cluster before actual write() for originally empty image. This increases the performance of 4k random writes 10+ times. In this case we can just write those 4k and do nothing else. Den