From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Kp7nf-0003KX-E4 for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:27 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Kp7ne-0003Jo-45 for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:27 -0400 Received: from [199.232.76.173] (port=35946 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Kp7nd-0003Jl-Vd for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:26 -0400 Received: from mx2.redhat.com ([66.187.237.31]:56824) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Kp7nc-0003sF-UJ for qemu-devel@nongnu.org; Sun, 12 Oct 2008 16:44:25 -0400 Message-ID: <48F26171.70109@redhat.com> Date: Sun, 12 Oct 2008 22:43:29 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU References: <48EE38B9.2050106@codemonkey.ws> <48EF1D55.7060307@redhat.com> <48F0E83E.2000907@redhat.com> <48F10DFD.40505@codemonkey.ws> <20081012004401.GA9763@acer.localdomain> <48F1CF9E.9030500@redhat.com> <48F23AF1.2000104@codemonkey.ws> <48F24320.9010201@redhat.com> <48F25720.9010306@codemonkey.ws> In-Reply-To: <48F25720.9010306@codemonkey.ws> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Chris Wright , Mark McLoughlin , kvm-devel , Laurent Vivier , qemu-devel@nongnu.org, Ryan Harper Anthony Liguori wrote: >> >> Let me phrase this another way: is there an argument against O_DIRECT? > > It slows down any user who frequently restarts virtual machines. This is an important use case (us developers), but not the majority of deployments. > It slows down total system throughput when there are multiple virtual > machines sharing a single disk. This later point is my primary > concern because in the future, I expect disk sharing to be common in > some form (either via common QCOW base images or via CAS). Sharing via qcow base images is also an important use case, but for desktop workloads. Server workloads will be able to share a lot less, and in any case will not keep reloading their text pages as desktops do. Regarding CAS, the Linux page cache indexes pages by inode number and offset, so it cannot share page cache contents without significant rework. Perhaps ksm could be adapted to do this, but it can't right now. And again, server consolidation scenarios which are mostly unrelated workloads jammed on a single host won't benefit much from this. > > I'd like to see a benchmark demonstrating that O_DIRECT improves > overall system throughput in any scenario today. I just don't buy the > cost of the extra copy today is going to be significant since the CPU > cache is already polluted. I think the burden of proof is on O_DIRECT > because it's quite simple to demonstrate where it hurts performance > (just the time it takes to do two boots of the same image). > >> In a significant fraction of deployments it will be both simpler and >> faster. >> > > I think this is speculative. Is there any performance data to back > this up? Given that we don't have a zero-copy implementation yet, it is impossible to generate real performance data. However it is backed up by experience; all major databases use direct I/O and their own caching; and since the data patterns of filesystems are similar to that of databases (perhaps less random), there's a case for not caching them. I'll repeat my arguments: - cache size In many deployments we will maximize the number of guests, so host memory will be low. If your L3 cache is smaller than your L2 cache, your cache hit rate will be low. Guests will write out data they are not expecting to need soon (the tails of their LRU, or their journals) so caching it is pointless. Conversely, they _will_ cache data they have just read. - cpu cache utilization When a guest writes out its page cache, this is likely to be some time after the cpu moved the data there. So it's out of the page cache. Now we're bringing it back to the cache, twice (once reading guest memory, second time writing to host page cache). Similarly, when reading from the host page cache into the guest, we have no idea whether the guest will actually touch the memory in question. It may be doing a readahead, or reading a metadata page of which it will only access a small part. So again we're wasting two pages worth of cache per page we're reading. Note also that we have no idea which vcpu will use the page, so even if the guest will touch the data, there is a high likelihood (for large guests) that it will be in the wrong cache. - conflicting readahead heuristics The host may attempt to perform readahead on the disk. However the guest is also doing readahead, so the host is extending the readahead further than is likely to be a good idea. Also, the guest does logical (file-based) readahead while the host does physical (disk order based) readahead, or qcow-level readahead which is basically reading random blocks. Now I don't have data that demonstrates how bad these effects are, but I think there is sufficient arguments here to justify adding O_DIRECT. I intend to recommend O_DIRECT unless I see performance data that favours O_DSYNC on real world scenarios that take into account bandwidth, cpu utilization, and memory utilization (i.e. a 1G guest on a 32G host running fio but not top doesn't count). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.