From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Jycz8-0003fu-MX for qemu-devel@nongnu.org; Tue, 20 May 2008 21:19:18 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Jycz7-0003dP-LE for qemu-devel@nongnu.org; Tue, 20 May 2008 21:19:17 -0400 Received: from [199.232.76.173] (port=33979 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Jycz7-0003d0-FF for qemu-devel@nongnu.org; Tue, 20 May 2008 21:19:17 -0400 Received: from mail2.shareable.org ([80.68.89.115]:49787) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1Jycz7-0000Ar-18 for qemu-devel@nongnu.org; Tue, 20 May 2008 21:19:17 -0400 Date: Wed, 21 May 2008 02:19:15 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] Re: [PATCH][v2] Align file accesses with cache=off (O_DIRECT) Message-ID: <20080521011915.GC595@shareable.org> References: <1211283126.4314.70.camel@frecb07144> <48332AB9.3010707@codemonkey.ws> <20080520223602.GE27853@shareable.org> <48337444.2070203@codemonkey.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <48337444.2070203@codemonkey.ws> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: Blue Swirl , Laurent Vivier , Kevin Wolf Anthony Liguori wrote: > >One property of disks is that if you overwrite a sector and the're > >power loss, when read later that sector might be corrupt. Even if the > >new data is the same as the old data with only some bytes changed, > >some of the _unchanged_ bytes may be corrupt by this. > > I don't think this is true. What evidence do you have to support such > claims? What do you imagine happens when you pull the power in the middle of writing a sector to a floppy disk (to pick a more easily imagined example)? There is not enough residual power to write the rest of the sector. That sector's checksum will therefore be corrupt, and (hopefully) have a CRC read error. It can be written over again, wiping the CRC error. No sector which wasn't being written will be corrupt: the write head isn't activated over those. The drive waits until it senses the start of sector N, then activates the write head to write data bits. The CRC error by itself my cause the whole sector to be reported as corrupt with no data. However, if you do manage to get back the bits from the media, some bits of the sector being written whose values were not intended to change may be different than expected. This is because the way data is recorded does not encode each bit separately, but multiplexes them together for modulation, and also because bit timing is not exact. A modern hard disk uses much more complex data encoding, which further adds to the effect of a truncated write corrupting even data bits not intended to be changed, in the vicinity of those being changed. But it should aim to provide the same basic guarantee that writing a sector cannot corrupt neighbouring sectors on power failure, only the one(s) being written. This is because robustness of journalling filesystems and databases do rather depend on this property, and simple old-fashioned disks do provide it. I am just speculating; I don't know whether modern hard disks provide this property, or under what circumstances they fail. But it seems they could provide it, because they still have physically independent sectors. (Interestingly, the journal block size used by Oracle on different OSes is different, suggesting the "basic unit of corruption" varies between OSes and is not always a single sector). Although it's just speculation, do you think modern hard disks behave differently from this? -- Jamie