All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Theodore Tso <tytso@mit.edu>
Cc: jack@suse.cz, hidehiro.kawai.ez@hitachi.com, sct@redhat.com,
	adilger@sun.com, linux-kernel@vger.kernel.org,
	linux-ext4@vger.kernel.org, jbacik@redhat.com, cmm@us.ibm.com,
	yumiko.sugita.yf@hitachi.com, satoshi.oshima.fk@hitachi.com
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers
Date: Wed, 4 Jun 2008 14:58:48 -0700	[thread overview]
Message-ID: <20080604145848.e3da6f20.akpm@linux-foundation.org> (raw)
In-Reply-To: <20080604212202.GA8727@mit.edu>

On Wed, 4 Jun 2008 17:22:02 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:
> > Does any other filesystem driver turn the fs read-only on the first
> > write-IO-error?

^^ this?

> > It seems like a big policy change to me.  For a lot of applications
> > it's effectively a complete outage and people might get a bit upset if
> > this happens on the first blip from their NAS.
> 
> As I told Kawai-san when I met with him and his colleagues in Tokyo
> last week, it is the responsibility of the storage stack to retry
> errors as appropriate.  From the filesystem perspective, a read or a
> write operation succeeds, or fails.  A read or write operation could
> take a long time before returning, but the storage stack doesn't get
> to return a "fail, but try again at some point; maybe we'll succeed
> later, or if you try writing to a different block".  The only sane
> thing for a filesystem to do is to treat any failure as a hard
> failure.
> 
> It is similarly insane to ask a filesystem to figure out that a newly
> plugged in USB stick is the same one that the user had accidentally
> unplugged 30 seconds ago.  We don't want to put that kind of low-level
> knowlede about storage details in each different filesystem.
> 
> A much better place to put that kind of smarts is in a multipath
> module which sits in between the device and the filesystem.  It can
> retry writes from a transient failure, if a path goes down or if a
> iSCSI device temporarily drops off the network.

That's fine in theory, but we don't do any if that right now, do we?

>  But if a filesystem
> gets a write failure, it has to assume that the write failure is
> permanent.

To that sector, yes.  But to the entire partition?

(Well, if the entire partition became unwriteable then we don't have a
problem.  It's "parts are writeable" or "it became writeable again
later which is the problem).


> The question though is what should you do if you have a write failure
> in various different parts of the disk?  If you have a write failure
> in a data block, you can return -EIO to the user.

Absolutely.

But afaict this patch changes things so that if we get a write failure
in a data block we make the entire fs read-only.  Which, as I said, is
often "dead box".

This seems like a quite major policy change to me.

>  You could try
> reallocating to find another block, and try writing to that alternate
> location (although with modern filesystems that do block remapping,
> this is largely pointless, since an EIO failure on write probably
> means you've lost connectivity to the disk or the disk as run out of
> spare blocks).  But for a failure to write to the a critical part of
> the filesystem, like the inode table, or failure to write to the
> journal, what the heck can you do?  Remounting read-only is probably
> the best thing you can do.

Ah, yes, I agree, a write error to the journal or to metadata is a
quite different thing.  An unwriteable journal block is surely
terminal.  And a write error during metadata checkpointing is pretty
horrid ebcause we've (potentially) already told userspace that the
write was successful.

But should we treat plain old data blocks in the same way?

> In theory, if it is a failure to write to the journal, you could fall
> back to no-journaled operation, and if ext3 could support running w/o
> a journal, that is possibly an option --- but again, it's very likely
> that the disk is totally gone (i.e., the user pulled the USB stick
> without unmounting), or the disk is out of spare blocks in its bad
> block remapping pool, and the system is probably going to be in deep
> trouble --- and the next failure to write some data might be critical
> application data.  You probably *are* better off failing the system
> hard, and letting the HA system swap in the hot spare backup, if this
> is some critical service.
> 
> That being said, ext3 can be tuned (and it is the default today,
> although I should probably change the default to be remount-ro), so
> that its behaviour on write errors is, "don't worry, be happy", and
> just leave the filesystem mounted read/write.  That's actually quite
> dangerous for a critical production server, however.....
> 



  reply	other threads:[~2008-06-04 21:59 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-02 10:40 [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2) Hidehiro Kawai
2008-06-02 10:43 ` [PATCH 1/5] jbd: strictly check for write errors on data buffers Hidehiro Kawai
2008-06-03 22:30   ` Andrew Morton
2008-06-04 10:19     ` Jan Kara
2008-06-04 18:19       ` Andrew Morton
2008-06-04 21:22         ` Theodore Tso
2008-06-04 21:58           ` Andrew Morton [this message]
2008-06-04 22:51             ` Theodore Tso
2008-06-05  9:35               ` Jan Kara
2008-06-05  9:35                 ` Jan Kara
2008-06-05 11:33                 ` Hidehiro Kawai
2008-06-05 14:29                   ` Theodore Tso
2008-06-05 16:20                     ` Andrew Morton
2008-06-05 18:49                       ` Andreas Dilger
2008-06-09 10:09                         ` Hidehiro Kawai
2008-06-11 12:35                           ` Jan Kara
2008-06-12 13:19                             ` Hidehiro Kawai
2008-06-05  3:28           ` Mike Snitzer
2008-06-05  3:28             ` Mike Snitzer
2008-06-04 21:58         ` Andreas Dilger
2008-06-04 10:53     ` Hidehiro Kawai
2008-06-02 10:45 ` [PATCH 2/5] jbd: ordered data integrity fix Hidehiro Kawai
2008-06-02 11:59   ` Jan Kara
2008-06-03 22:33   ` Andrew Morton
2008-06-04 10:55     ` Hidehiro Kawai
2008-06-02 10:46 ` [PATCH 3/5] jbd: abort when failed to log metadata buffers Hidehiro Kawai
2008-06-02 12:00   ` Jan Kara
2008-06-03 22:35   ` Andrew Morton
2008-06-04 10:57     ` Hidehiro Kawai
2008-06-02 10:47 ` [PATCH 4/5] jbd: fix error handling for checkpoint io Hidehiro Kawai
2008-06-02 12:44   ` Jan Kara
2008-06-03  4:31     ` Hidehiro Kawai
2008-06-03  4:40     ` Hidehiro Kawai
2008-06-03  5:11       ` Hidehiro Kawai
2008-06-03  5:20         ` Andrew Morton
2008-06-03  8:02       ` Jan Kara
2008-06-23 11:14         ` Hidehiro Kawai
2008-06-23 12:22           ` Jan Kara
2008-06-24 11:52             ` Hidehiro Kawai
2008-06-24 13:33               ` Jan Kara
2008-06-27  8:06                 ` Hidehiro Kawai
2008-06-27 10:24                   ` Jan Kara
2008-06-30  5:09                     ` Hidehiro Kawai
2008-07-07 10:07                       ` Jan Kara
2008-06-02 10:48 ` [PATCH 5/5] ext3: abort ext3 if the journal has aborted Hidehiro Kawai
2008-06-02 12:49   ` Jan Kara
2008-06-02 12:05 ` [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2) Jan Kara
2008-06-03  4:30   ` Hidehiro Kawai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080604145848.e3da6f20.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=adilger@sun.com \
    --cc=cmm@us.ibm.com \
    --cc=hidehiro.kawai.ez@hitachi.com \
    --cc=jack@suse.cz \
    --cc=jbacik@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=satoshi.oshima.fk@hitachi.com \
    --cc=sct@redhat.com \
    --cc=tytso@mit.edu \
    --cc=yumiko.sugita.yf@hitachi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.