From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Jyt3h-0003ld-CS for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:05 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Jyt3g-0003l5-BH for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:04 -0400 Received: from [199.232.76.173] (port=52944 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jyt3g-0003l1-83 for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:04 -0400 Received: from bzq-179-150-194.static.bezeqint.net ([212.179.150.194]:58629 helo=il.qumranet.com) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Jyt3f-0001LJ-Uy for qemu-devel@nongnu.org; Wed, 21 May 2008 14:29:04 -0400 Message-ID: <483469ED.1050408@qumranet.com> Date: Wed, 21 May 2008 21:29:01 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [PATCH][v2] Align file accesses with cache=off (O_DIRECT) References: <483429EB.7070705@codemonkey.ws> <48342F05.2090603@qumranet.com> <48343106.4070801@codemonkey.ws> <48343844.1050107@qumranet.com> <20080521153454.GB20527@shareable.org> <48344793.2020902@codemonkey.ws> <20080521162406.GA21501@shareable.org> <48345258.9040004@qumranet.com> <20080521170129.GF22488@duo.random> <48345949.4050903@qumranet.com> <20080521174754.GG22488@duo.random> In-Reply-To: <20080521174754.GG22488@duo.random> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrea Arcangeli Cc: Blue Swirl , Laurent Vivier , qemu-devel@nongnu.org, Paul Brook Andrea Arcangeli wrote: > On Wed, May 21, 2008 at 08:18:01PM +0300, Avi Kivity wrote: > >> Yes, that's the reason. Here zerocopy is not the motivation; instead, we >> have host-cached pages that are used directly in the guest. So we get both >> reduced memory footprint, and host caching. O_DIRECT reduces the memory >> footprint but kills host caching. >> > > Sure. So MAP_SHARED+remap_file_pages should work just fine to achieve > zerocopy I/O. > > No, when the guest writes to memory, it will affect the disk, which doesn't happen with normal memory writes. MAP_PRIVATE is needed. >> The scenario is desktop/laptop use. For server use O_DIRECT is clearly >> preferred due to much reduced overhead. >> > > Well in some ways there's more overhead with O_DIRECT because O_DIRECT > has to call get_user_pages and walk pagetables in software before > every I/O operation. MAP_SHARED walks them in hardware and it can take > advantage of the CPU tlb too. > > The primary problem of MAP_SHARED isn't the overhead of the operation > itself that infact will be lower with MAP_SHARED after the cache is > allocated, but the write throttling and garbage collection of the host > caches. If you've a 250G guest image, MAP_SHARED will allocate as much > as 250G of cache and a cp /dev/zero /dev/hdb in the guest will mark > 100% of guest RAM dirty. The mkclean methods and write throttling for > MAP_SHARED introduced in reasonably recent kernels can avoid filling > 100% of host ram with dirty pages, but it still requires write > throttling and it'll pollute the host caches and it can result in > large ram allocations in the host having to block before the ram is > available, the same way as buffered writes in the host (current KVM > default). > I'd do writes via the normal write path, not mmap(). > I think O_DIRECT is the best solution and MAP_SHARED could become a > secondary option just for certain guest workloads with light I/O where > fairness isn't even a variable worth considering. > > The cost of garbage collection of the mapped caches on the host isn't > trivial, and I don't mean because the nonlinear rmap logic has to scan > all pagetables, that's a minor cost compared to shrinking the host > caches before try_to_unmap is ever invoked etc... Leaving the host > caches purely for the host usage is surely more fair that won't ever > lead to one guest doing heavy I/O thrashing the host caches and > leading to all other guests and host tasks hanging. If they will hang > for a few msec with O_DIRECT it'll be because they're waiting for I/O > and the elevator put them on the queue to wait for the disk to return > ready. But it won't be because of some write throttling during writes, > or in alloc_pages shrink methods that are calling ->writepage on the > dirty pages. > > The other significant advantage of O_DIRECT is that you won't have to > call msync to provide journaling. > > I think O_DIRECT will work best for all usages, and it looks higher > priority to me to have than MAP_SHARED. MAP_SHARED will surely result > in better benchmarks for certain workloads though, imagine 'dd > if=/dev/hda of=/dev/zero iflag=direct bs=1M count=100 run on the > guest', it'll read from cache and it will do zero I/O starting from > the second run with MAP_SHARED ;). > > If it was me I'd prefer O_DIRECT by default. > > Certainly O_DIRECT is the normal path. We're considering mmap() as a way to have both host caching and avoiding double-caching. > For full disclosure you may also want to read this but I strongly > disagree with those statements. http://kerneltrap.org/node/7563 > I disagree with them strongly too. For general purpose applications you want to avoid O_DIRECT, but special purpose applications that do their own caching (databases, virtualization, streaming servers), O_DIRECT is critical. The kernel's cache management algorithms simply cannot compete with a specially tuned application, not to mention the additional overhead that comes from crossing a protection boundary. [I've worked on a userspace filesystem that took every possible measure to get the OS out of the way: user level threads, O_DIRECT, aio, large pages] -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.