From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JTuJD-0007st-CT for qemu-devel@nongnu.org; Tue, 26 Feb 2008 02:33:03 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JTuJA-0007rX-TX for qemu-devel@nongnu.org; Tue, 26 Feb 2008 02:33:02 -0500 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JTuJA-0007rU-OZ for qemu-devel@nongnu.org; Tue, 26 Feb 2008 02:33:00 -0500 Received: from mail2.shareable.org ([80.68.89.115]) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1JTuJA-00028G-1R for qemu-devel@nongnu.org; Tue, 26 Feb 2008 02:33:00 -0500 Received: from jamie by mail2.shareable.org with local (Exim 4.63) (envelope-from ) id 1JTuJ7-0001l1-9b for qemu-devel@nongnu.org; Tue, 26 Feb 2008 07:32:57 +0000 Date: Tue, 26 Feb 2008 07:32:57 +0000 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH] ide.c make write cacheing controllable by guest Message-ID: <20080226073257.GC30238@shareable.org> References: <18371.1341.577787.909764@mariner.uk.xensource.com> <20080225205040.GA18613@shareable.org> <20080226011639.GA20401@puku.stupidest.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080226011639.GA20401@puku.stupidest.org> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org [To qemu-devel and Chris, I have started a thread on linux-kernel on this topic. I've copied the first few paragraphs here, so you can see what it's about since it's a response to a post here. But it's largely off topic for Qemu, and on topic for linux-kernel, so I didn't cross post lest linux-kernel replies come here.] To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Proposal for "proper" durable fsync() and fdatasync() Message-ID: <20080226072649.GB30238@shareable.org> Date: Tue, 26 Feb 2008 07:26:49 +0000 Dear kernel, This is a proposal to add "proper" durable fsync() and fdatasync() to Linux. First the problem, then a proposed solution "with benefits", so to speak. [...] By durable, I mean that fsync() should actually commit writes to physical stable storage, not just the disk write cache when that is enabled. Databases and guest VMs needs this, or an equivalent feature, if they aren't to face occasional corruption after power failure and perhaps some crashes. The alternative is to disable the disk write cache. But that isn't modern practice or recommendation, since I/O write barriers were implemented and they are much faster. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. I noticed this following up discussions on the Qemu mailing list, about guest VMs and how their IDE flush cache command should translate to fsync() to avoid data loss. (For guest VMs, fsync() isn't necessary if the host machine is fine, and it isn't enough (on Linux host) if the host machine loses power or the hard disk crashes another way.) Then I noticed it again, when I was designing a database engine with filesystem characteristics. I thought "how do I ensure ordered journal writes; can I use fdatasync()?" and was surprised to find the answer is no, I have to use hacks like calling hdparm, and the authors of major SQL databases seem to brush the problem under a carpet. (Interestingly, in the Linux 2.4 patches for write barriers, fsync() seems to be fine, if a bit slow.) It isn't the first time this topic has come up: http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1 ("True fsync() in Linux (on IDE)") In that thread, it was implied that would be fixed in 2.6. So I bet some people are under the illusion that it's fixed in 2.6... For a while, I've been meaning to bring it up on linux-kernel... [More on linux-kernel]. Thanks, -- Jamie