From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: Jean-Louis Dupond <jean-louis@dupond.be>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Filesystem corruption after unreachable storage
Date: Mon, 9 Mar 2020 18:32:38 -0400 [thread overview]
Message-ID: <20200309223238.GC4852@mit.edu> (raw)
In-Reply-To: <93e74f9f-6694-a3e9-4fac-981389522d25@dupond.be>
On Mon, Mar 09, 2020 at 04:33:52PM +0100, Jean-Louis Dupond wrote:
> On 9/03/2020 16:18, Theodore Y. Ts'o wrote:
> > Did the panic happen immediately, or did things hang until the storage
> > recovered, and*then* it rebooted. Or did the hard reset and reboot
> > happened before the storage network connection was restored?
>
> The panic (well it was just frozen, no stacktrace or automatic reboot) did
> happen *after* storage came back online.
> So nothing happens while the storage is offline, even if we wait until the
> scsi timeout is exceeded (180s * 6).
> It's only when the storage returns that the filesystem goes read-only /
> panic (depending on the error setting).
So I under why the scsi timeout isn't sufficient to keep the panic
from hanging.
> If we do reset the VM before storage is back, the filesystem check just goes
> fine in automatic mode.
> So I think we should (in some cases) not try to update the superblock
> anymore on I/O errors, but just go read-only/panic.
> Cause it seems like updating the superblock makes things worse.
The problem is that from the file system's perspective, we don't know
why the I/O error has happened. Is it because of timeout, or is it
because of a media error?
In the case where an SSD really was unable to write to a metadata
block, we *do* want to update the superblock.
There is a return status that the block device could send back,
BLK_STS_TIMEOUT, but it's not set by the SCSI layer. It is by the
network block device (nbd), but it looks like the SCSI layer just
returns BLK_STS_IOERR if I'm reading the code correctly.
> Or changes could be made to e2fsck to allow automatic repair of this kind of
> error for example?
The fundamental problem is we don't know what "kind of error" has
taken place. If we did, we could theoretically have some kind of
mount option which means "in case of timeout, reboot the system
without setting some kind of file system error". But we need to know
that the I/O error was caused by a timeout first.
- Ted
prev parent reply other threads:[~2020-03-09 22:32 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-24 10:57 Filesystem corruption after unreachable storage Jean-Louis Dupond
2020-01-24 20:37 ` Theodore Y. Ts'o
2020-02-20 9:08 ` Jean-Louis Dupond
2020-02-20 9:14 ` Jean-Louis Dupond
2020-02-20 15:50 ` Theodore Y. Ts'o
2020-02-20 16:14 ` Jean-Louis Dupond
2020-02-25 13:19 ` Jean-Louis Dupond
2020-02-25 17:23 ` Theodore Y. Ts'o
2020-02-28 11:06 ` Jean-Louis Dupond
2020-03-09 13:52 ` Jean-Louis Dupond
2020-03-09 15:18 ` Theodore Y. Ts'o
2020-03-09 15:33 ` Jean-Louis Dupond
2020-03-09 22:32 ` Theodore Y. Ts'o [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200309223238.GC4852@mit.edu \
--to=tytso@mit.edu \
--cc=jean-louis@dupond.be \
--cc=linux-ext4@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).