From: Tejun Heo <tj@kernel.org>
To: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: linux-ide@vger.kernel.org, Gionatan Danti <g.danti@assyoma.it>,
linux-scsi@vger.kernel.org
Subject: Re: No I/O errors reported after SATA link hard reset
Date: Thu, 17 Aug 2017 05:48:21 -0700 [thread overview]
Message-ID: <20170817124821.GA3238792@devbig577.frc2.facebook.com> (raw)
In-Reply-To: <13561110-4e59-303a-8e3d-dd60c1bafba8@fastmail.fm>
Hello,
On Thu, Aug 17, 2017 at 11:24:22AM +0200, Bernd Schubert wrote:
> > More concerning is the fact that these undetected errors can make their
> > way even when the higher application consistently calls sync() and/or
> > fsync. In other words, it seems than even acknowledged writes can fail
> > in this manner (and this is consistent with the first machine corrupting
> > its filesystem due to journal trashing - XFS journal surely uses sync()
> > where appropriate). The mechanism seems the following:
> >
> > - an higher layer application issue sync();
> > - a write barrier is generated;
> > - a first FLUSH CACHE command is sent to the disk;
> > - data are written to the disk's DRAM cache;
> > - power is lost! The volatile cache lose its content;
> > - power is re-established and the disk become responsive again;
> > - a second FLUSH CACHE command is sent to the disk;
> > - the disk acks each SATA command, but real data are lost.
Recovered errors aren't reported as IO errors and at least from link
state proper there's no way for the driver to tell apart link
glitches and buffer-erasing power issues.
> > Now, I have few questions:
> > - is the above explanation plausible, or I am (horribly) missing something?
For the most part, yes. To be more accurate, the failure is coming
from libata not being able to tell apart link glitches from the device
getting reset due to power issues.
> > - why the scsi midlevel does not respond to a power loss event by
> > immediately offlining the disks?
Because we don't wanna be ditching disks on temporary link glitches,
which do happen once in a while.
> > - is the scsi midlevel behavior configurable (I know I can lower eh
> > timeout, but is this the right solution)?
> > - how to deal with this problem (other than being 100% sure power is
> > never lost by any disks)?
So, the right way to deal with the problem probably is making use of
the SMART counter which indicates power loss events and verify that
the counter hasn't increased over link issues. If it changed, the
device should be detached and re-probed, which will make it come back
as a different block device. Unfortunately, I haven't had the chance
to actually implement that.
Thanks.
--
tejun
next prev parent reply other threads:[~2017-08-17 12:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-16 22:27 No I/O errors reported after SATA link hard reset Gionatan Danti
2017-08-17 9:24 ` Bernd Schubert
2017-08-17 12:48 ` Tejun Heo [this message]
2017-08-17 13:18 ` Bernd Schubert
2017-08-17 13:25 ` Tejun Heo
2017-08-17 13:43 ` Bernd Schubert
2017-08-17 14:23 ` Gionatan Danti
2017-08-17 14:15 ` Gionatan Danti
2017-08-17 14:46 ` Tejun Heo
2017-08-17 15:01 ` Gionatan Danti
-- strict thread matches above, loose matches on Subject: below --
2017-08-26 20:58 sonofagun
2017-08-27 18:42 ` Gionatan Danti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170817124821.GA3238792@devbig577.frc2.facebook.com \
--to=tj@kernel.org \
--cc=bernd.schubert@fastmail.fm \
--cc=g.danti@assyoma.it \
--cc=linux-ide@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox