public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Machek <pavel@suse.cz>
To: Theodore Tso <tytso@mit.edu>, Chris Friesen <cfriesen@nortel.com>,
	mikulas@artax.karlin.mff.cuni.cz, clock@atrey.karlin.mff.cuni.cz,
	kernel list <linux-kernel@vger.kernel.org>,
	aviro@redhat.com
Subject: Re: writing file to disk: not as easy as it looks
Date: Tue, 2 Dec 2008 23:44:03 +0100	[thread overview]
Message-ID: <20081202224403.GA8277@elf.ucw.cz> (raw)
In-Reply-To: <20081202205558.GD20858@mit.edu>

On Tue 2008-12-02 15:55:58, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote:
> > Theodore Tso wrote:
> >
> >> Even for ext3/ext4 which is doing physical journalling, it's still the
> >> case that the journal commits first, and it's only later when the
> >> write happens that we write out the change.  If the disk fails some of
> >> the writes, it's possible to lose data, especially if the two blocks
> >> involved in the node split are far apart, and the write to the
> >> existing old btree block fails.
> >
> > Yikes.  I was under the impression that once the journal hit the platter  
> > then the data were safe (barring media corruption).
> 
> Well, this is a case of media corruption (or a cosmic ray hitting
> hitting a ribbon cable in the disk controller sending the write to the
> wrong location on disk, or someone bumping the server causing the disk
> head to lift up a little higher than normal while it was writing the
> disk sector, etc.).  But it is a case of the hard drive misbehaving. 

I could not parse this. Negation seems to be missing somewhere.

> Heck, if you have a hiccup while writing an inode table block out to
> disk (for example a power failure at just the wrong time), so the
> memory (which is more voltage sensitive than hard drives) DMA's
> garbage which gets written to the inode table, you could lose a large
> number of adjacent inodes when garbage gets splatted over the inode
> table.

Ok, "memory failed before disk" is ... bad hardware.

...but... you seem to be saying that modern filesystems can damage
data even on "sane" hardware.

Lets define sane as:

1) if disk says sector was successfully written, it is so, until you
start writing to that sector again.

	(but disk may say "error writing". Filesystem should propagate
	that back to the userland, reliably. "Error writing" is
	extremely rare on modern disks, but can happen if you run out
	of spare blocks.)

	(and if you ask for sector write, sector is in undefined 
	state until drive returns success. Flashes behave like this
	-- reads return errors. Do disks?)

2) connection to the disk either works or fails totally. Bit errors
are reliably detected at connection level.

3) power may fail any time.

You seem to be saying that ext2/ext3 only work if these are met:

1) power may fail any time.

2) writes are always successful.

3) connection to the disk always works.

AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
without unmounting (missing fsync error propagation), and it is unsafe
to run ext2/ext3 on any flash-based storage with block interface (SD
cards, flash sticks).
 
> Ext3 tends to recover from this better than other filesystems, thanks
> to the fact that it does physical block journalling, but you do pay
> for this in terms of performance if you have a metadata-intensive
> workload, because you're writing more bytes to the journal for each
> metadata opeation.

And thanks for that! Actually I'd be willing to pay some more
performance to get reliability up.

> > It seems like the more I learn about filesystems, the more failure modes  
> > there are and the fewer guarantees can be made.  It's amazing that  
> > things work as well as they do...
> 
> There are certainly things you can do.  Put your fileservers's on
> UPS's.  Use RAID.  Make backups.   Do all three.  :-)

I was almost stupid enough to move primary copy of ~ and linux trees
to SD... I do have UPSes, unfortunately they are li-ion and i'm
running off them most of the time. I do have backups, but restoring
them all the time is boring & time consuming. I'll try to stick two
MMC cards into SD slot to make it RAID 1 :-).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

  reply	other threads:[~2008-12-02 22:42 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-02  9:40 writing file to disk: not as easy as it looks Pavel Machek
2008-12-02 14:04 ` Theodore Tso
2008-12-02 15:26   ` Pavel Machek
2008-12-02 16:37     ` Theodore Tso
2008-12-02 17:22       ` Chris Friesen
2008-12-02 20:55         ` Theodore Tso
2008-12-02 22:44           ` Pavel Machek [this message]
2008-12-02 22:50             ` Pavel Machek
2008-12-03  5:07             ` Theodore Tso
2008-12-03  8:46               ` Pavel Machek
2008-12-03 15:50                 ` Mikulas Patocka
2008-12-03 15:54                   ` Alan Cox
2008-12-03 17:37                     ` Mikulas Patocka
2008-12-03 17:52                       ` Alan Cox
2008-12-03 18:16                       ` Pavel Machek
2008-12-03 18:33                         ` Mikulas Patocka
2008-12-03 16:42                 ` Theodore Tso
2008-12-03 17:43                   ` Mikulas Patocka
2008-12-03 18:26                     ` Pavel Machek
2008-12-03 15:34               ` Mikulas Patocka
2008-12-15 10:24               ` [patch] " Pavel Machek
2008-12-15 11:03           ` Pavel Machek
2008-12-15 20:08             ` Folkert van Heusden
2008-12-02 19:10       ` Folkert van Heusden
2008-12-02 23:01 ` Mikulas Patocka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081202224403.GA8277@elf.ucw.cz \
    --to=pavel@suse.cz \
    --cc=aviro@redhat.com \
    --cc=cfriesen@nortel.com \
    --cc=clock@atrey.karlin.mff.cuni.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mikulas@artax.karlin.mff.cuni.cz \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox