From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=51201 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OxjJo-0004ds-PN for qemu-devel@nongnu.org; Mon, 20 Sep 2010 12:34:17 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OxjJn-0006X1-Ju for qemu-devel@nongnu.org; Mon, 20 Sep 2010 12:34:16 -0400 Received: from mail-vw0-f45.google.com ([209.85.212.45]:56186) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OxjJn-0006Wo-F5 for qemu-devel@nongnu.org; Mon, 20 Sep 2010 12:34:15 -0400 Received: by vws19 with SMTP id 19so3718643vws.4 for ; Mon, 20 Sep 2010 09:34:14 -0700 (PDT) Message-ID: <4C978CFA.1000600@codemonkey.ws> Date: Mon, 20 Sep 2010 11:34:02 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] block-queue: Delay and batch metadata writes References: <1284991010-10951-1-git-send-email-kwolf@redhat.com> <4C977028.3050602@codemonkey.ws> <4C9778EC.9060704@redhat.com> <4C978071.2010209@codemonkey.ws> <4C9783E7.5080905@redhat.com> In-Reply-To: <4C9783E7.5080905@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org On 09/20/2010 10:55 AM, Kevin Wolf wrote: > Am 20.09.2010 17:40, schrieb Anthony Liguori: > >> On 09/20/2010 10:08 AM, Kevin Wolf wrote: >> >>>> If you're comfortable with a writeback cache for metadata, then you >>>> should also be comfortable with a writeback cache for data in which >>>> case, cache=writeback is the answer. >>>> >>>> >>> Well, there is a difference: We don't pollute the host page cache with >>> guest data and we don't get a virtual "disk cache" as big as the host >>> RAM, but only a very limited queue of metadata. >>> >>> Basically, in qemu we have three different types of caching: >>> >>> 1. O_DSYNC, everything is always synced without any explicit request. >>> This is cache=writethrough. >>> >>> >> I actually think O_DSYNC is the wrong implementation of >> cache=writethrough. cache=writethrough should behave just like >> cache=none except that data goes through the page cache. >> > Then you have cache=writeback, basically. > No. Write through means "write requests are not completed until the data has been acknowledged by the next layer." Write back means "write requests are completed irrespective of the data being acknowledged by the next layer." Write through ensures consistency with the next layer whereas write back doesn't. cache=none means that there is no cache. If there is no cache, then you're guaranteed to be consistent with the next layer. The only reason it's exposed as writeback at the emulation layer is that *usually* disks have writeback caches. The fact that writethrough currently is stronger than the next layer (in terms that it breaks through the writeback cache a disk may have) is not a designed feature. It's an accident. Had ext3 enabled barriers by default, writethrough would not use O_DSYNC. >>> 2. Nothing is ever synced. This is cache=unsafe. >>> >>> 3. We present a writeback disk cache to the guest and the guest needs >>> to explicitly flush to gets its data safe on disk. This is >>> cache=writeback and cache=none. >>> >>> >> We shouldn't tie the virtual disk cache to which cache= option is used >> in the host. cache=none means that all requests go directly to the >> disk. cache=writeback means the host acts as a writeback cache. >> > No, that's not the meaning of cache=none if you take the disk cache into > consideration. You can't possibly take into account the disk cache because we don't know anything about the disk cache in QEMU. Just assuming disks always have writeback caches is wrong. > It might be what you think should be the meaning of > cache=none, but it's not what it means in any qemu release. > That's precisely what it's meant in every release since the cache options when I first introduced them. >>> We're still lacking modes for O_DSYNC | O_DIRECT and unsafe | O_DIRECT, >>> but they are entirely possible, because it's two different dimensions. >>> (And I think Christoph was planning to actually make it two independent >>> options) >>> >> I don't really think O_DSYNC | O_DIRECT makes much sense. >> > Maybe, maybe not. It's just a missing entry in the matrix. > Regards, Anthony Liguori