From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1G7v3W-0005OO-F8 for qemu-devel@nongnu.org; Tue, 01 Aug 2006 10:17:10 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1G7v3U-0005Mw-Dg for qemu-devel@nongnu.org; Tue, 01 Aug 2006 10:17:10 -0400 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1G7v3U-0005Mq-9e for qemu-devel@nongnu.org; Tue, 01 Aug 2006 10:17:08 -0400 Received: from [81.29.64.88] (helo=mail.shareable.org) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA:32) (Exim 4.52) id 1G7v6O-00046k-1F for qemu-devel@nongnu.org; Tue, 01 Aug 2006 10:20:08 -0400 Received: from mail.shareable.org (localhost [127.0.0.1]) by mail.shareable.org (8.12.11.20060308/8.12.11) with ESMTP id k71EH5CI008293 for ; Tue, 1 Aug 2006 15:17:05 +0100 Received: (from jamie@localhost) by mail.shareable.org (8.12.11.20060308/8.12.8/Submit) id k71EH5mm008291 for qemu-devel@nongnu.org; Tue, 1 Aug 2006 15:17:05 +0100 Date: Tue, 1 Aug 2006 15:17:05 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] Ensuring data is written to disk Message-ID: <20060801141705.GA7779@mail.shareable.org> References: <20060801101743.GA31760@mail.shareable.org> <20060801104539.GO31908@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060801104539.GO31908@suse.de> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Jens Axboe wrote: > On Tue, Aug 01 2006, Jamie Lokier wrote: > > > Of course, guessing the disk drive write buffer size and trying not to kill > > > system I/O performance with all these writes is another question entirely > > > ... sigh !!! > > > > If you just want to evict all data from the drive's cache, and don't > > actually have other data to write, there is a CACHEFLUSH command you > > can send to the drive which will be more dependable than writing as > > much data as the cache size. > > Exactly, and this is what the OS fsync() should do once the drive has > acknowledged that the data has been written (to cache). At least > reiserfs w/barriers on Linux does this. 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk is an SATA or SCSI type that supports ordered tagged commands? My understanding is that barriers force an ordering between write commands, and that CACHEFLUSH is used only with disks that don't have more sophisticated write ordering commands. Is the data still committed to the disk platter before fsync() returns on those? 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too, for in-place writes which don't modify the inode and therefore don't have a journal entry? On Darwin, fsync() does not issue CACHEFLUSH to the drive. Instead, it has an fcntl F_FULLSYNC which does that, which is documented in Darwin's fsync() page as working with all Darwin's filesystems, provided the hardware honours CACHEFLUSH or the equivalent. >>From what little documentation I've found, on Linux it appears to be much less predictable. It seems that some filesystems, with some kernel versions, and some mount options, on some types of disk, with some drive settings, will commit data to a platter before fsync() returns, and others won't. And an application calling fsync() has no easy way to find out. Have I got this wrong? ps. (An aside question): do you happen to know of a good patch which implements IDE barriers w/ ext3 on 2.4 kernels? I found a patch by googling, but it seemed that the ext3 parts might not be finished, so I don't trust it. I've found turning off the IDE write cache makes writes safe, but with a huge performance cost. Thanks, -- Jamie