From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=43547 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oy3oH-0003ki-KJ for qemu-devel@nongnu.org; Tue, 21 Sep 2010 10:27:23 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Oy3nQ-0003h7-Dp for qemu-devel@nongnu.org; Tue, 21 Sep 2010 10:26:59 -0400 Received: from verein.lst.de ([213.95.11.210]:40595) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Oy3nQ-0003gm-1z for qemu-devel@nongnu.org; Tue, 21 Sep 2010 10:26:12 -0400 Date: Tue, 21 Sep 2010 16:26:08 +0200 From: Christoph Hellwig Message-ID: <20100921142608.GA18290@lst.de> References: <4C97916E.2080801@codemonkey.ws> <20100920193451.GA11516@lst.de> <4C97BFF3.90103@codemonkey.ws> <20100920231742.GB18512@lst.de> <4C97F9C6.60501@codemonkey.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4C97F9C6.60501@codemonkey.ws> Subject: [Qemu-devel] Re: Caching modes List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Christoph Hellwig , qemu-devel On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote: > O_DIRECT alone to a pre-allocated file on a normal file system should > result in the data being visible without any additional metadata > transactions. Anthony, for the third time: no. O_DIRECT is a non-portable extension in Linux (taken from IRIX) and is defined as: O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECT does not have any meaning for data integrity, it just tells the filesystem it *should* not use the pagecache. Even if it should not various filesystem have fallbacks to buffered I/O for corner cases. It does *not* mean the actual disk cache gets flushed, and it *does* not guarantee anything about metadata which is very important. Metadata updates happen when filling sparse file, when extening the file size, when using a COW filesystem, and when converting preallocated to fully allocated extents in practice and could happen in many more cases depending on the filesystem implementation. > >Barriers are a Linux-specific implementation details that is in the > >process of going away, probably in Linux 2.6.37. But if you want > >O_DSYNC semantics with a volatile disk write cache there is no way > >around using a cache flush or the FUA bit on all I/O caused by it. > > If you have a volatile disk write cache, then we don't need O_DSYNC > semantics. If you present a volatile write cache to the guest you do indeed not need O_DSYNC and can rely on the guest sending fdatasync calls when it wants to flush the cache. But for the statement above you can replace O_DSYC with fdatasync and it will still be correct. O_DSYNC in current Linux kernels is nothing but an implicit range fdatasync after each write. > > We > >currently use the cache flush, and although I plan to experiment a bit > >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very > >surprised if they actually are any faster. > > > > The thing I struggle with understanding is that if the guest is sending > us a write request, why are we sending the underlying disk a write + > flush request? That doesn't seem logical at all to me. We only send a cache flush request *iff* we present the guest a device without a volatile write cache so that it can assume all writes are stable and we sit on a device that does have a volatile write cache. > Even if we advertise WC disable, it should be up to the guest to decide > when to issue flushes. No. If we don't claim to have a volatile cache no guest will ever flush the cache. Which is just logially given that we just told it that we don't have a cache that needs flushing. > >ext3 and ext4 have really bad fsync implementations. Just use a better > >filesystem or bug one of it's developers if you want that fixed. But > >except for disabling the disk cache there is no way to get data integrity > >without cache flushes (the FUA bit is nothing but an implicit flush). > > > > But why are we issuing more flushes than the guest is issuing if we > don't have to worry about filesystem metadata (i.e. preallocated storage > or physical devices)? Who is "we" and what is workload/filesystem/kernel combination? Specific details and numbers please.