From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Kp5mO-0003dT-5y for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:35:00 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Kp5mM-0003dH-TM for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:34:59 -0400 Received: from [199.232.76.173] (port=49837 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Kp5mM-0003dE-Q0 for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:34:58 -0400 Received: from mx2.redhat.com ([66.187.237.31]:55490) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Kp5mM-0000E4-9s for qemu-devel@nongnu.org; Sun, 12 Oct 2008 14:34:58 -0400 Message-ID: <48F24320.9010201@redhat.com> Date: Sun, 12 Oct 2008 20:34:08 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU References: <48EE38B9.2050106@codemonkey.ws> <48EF1D55.7060307@redhat.com> <48F0E83E.2000907@redhat.com> <48F10DFD.40505@codemonkey.ws> <20081012004401.GA9763@acer.localdomain> <48F1CF9E.9030500@redhat.com> <48F23AF1.2000104@codemonkey.ws> In-Reply-To: <48F23AF1.2000104@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Chris Wright , Mark McLoughlin , kvm-devel , Laurent Vivier , qemu-devel@nongnu.org, Ryan Harper Anthony Liguori wrote: >> >> Getting good performance because we have a huge amount of free memory >> in the host is not a good benchmark. Under most circumstances, the >> free memory will be used either for more guests, or will be given to >> the existing guests, which can utilize it more efficiently than the >> host. > > There's two arguments for O_DIRECT. The first is that you can avoid > bringing in data into CPU cache. This requires zero-copy in QEMU but > ignoring that, the use of the page cache doesn't necessarily prevent > us from achieving this. > > In the future, most systems will have a DMA offload engine. This is a > pretty obvious thing to attempt to accelerate with such an engine > which would prevent cache pollution. But would increase latency, memory bus utilization, and cpu overhead. In the cases where the page cache buys us something (host page cache significantly larger than guest size), that's understandable. But for the other cases, why bother? Especially when many systems don't have this today. Let me phrase this another way: is there an argument against O_DIRECT? In a significant fraction of deployments it will be both simpler and faster. > Another possibility is to directly map the host's page cache into the > guest's memory space. > Doesn't work with large pages. > The later is a bit tricky but is so much more interesting especially > if you have a strong storage backend that is capable of > deduplification (you get memory compaction for free). > It's not free at all. Replacing a guest memory page involves IPIs and TLB flushes. It only works on small pages, and if the host page cache and guest page cache are aligned with each other. And with current Linux memory management, I don't see a way to do it that doesn't involve creating a vma for every page, which is prohibitively expensive. > I also have my doubts that the amount of memory saved by using > O_DIRECT will have a noticable impact on performance considering that > guest memory and page cache memory are entirely reclaimable. O_DIRECT is not about saving memory, it is about saving cpu utilization, cache utilization, and memory bandwidth. > An LRU should make the best decisions about whether memory is more > valuable for the guests or for the host page cache. > LRU typically makes fairly bad decisions since it throws most of the information it has away. I recommend looking up LRU-K and similar algorithms, just to get a feel for this; it is basically the simplest possible algorithm short of random selection. Note that Linux doesn't even have an LRU; it has to approximate since it can't sample all of the pages all of the time. With a hypervisor that uses Intel's EPT, it's even worse since we don't have an accessed bit. On silly benchmarks that just exercise the disk and touch no memory, and if you tune the host very aggresively, LRU will win on long running guests since it will eventually page out all unused guest memory (with Linux guests, it will never even page guest memory in). On real life applications I don't think there is much chance. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.