From: "Verma, Vishal L" <vishal.l.verma@intel.com>
To: "Williams, Dan J" <dan.j.williams@intel.com>,
"Rudoff, Andy" <andy.rudoff@intel.com>
Cc: "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: Re: [ndctl PATCH] ndctl, check: Add a sigbus handler to detect metadata corruption
Date: Fri, 14 Apr 2017 20:28:04 +0000 [thread overview]
Message-ID: <1492201682.1657.6.camel@intel.com> (raw)
In-Reply-To: <01C2D860-110D-407F-A8FA-2892AF38EB88@intel.com>
On Fri, 2017-04-14 at 19:52 +0000, Rudoff, Andy wrote:
> >If we hit a known badblock, that is 512B worth of map entries
> (128).
> >Should we really (almost certainly) scramble 64 blocks? :)
> >If it is a latent error, it will still be at least a cache line
> worth
> >of map entries, i.e. 16.
> >
> >If an error is in the log, then in the badblock case, we lose
> both log
> >and log' for four lanes. Which means we can't tell is one of
> those 4
> >entries needed a map update, leaving a potential corruption
> window
> >open.
>
> Before responding to this, can you walk me through the steps
> a user is expected to take if poison appears the flog, for example?
> Say I just want to copy out as much data as I can get, then recreate
> a fresh BTT. I think you’re telling me that a poisoned flog prevents
> me from using the entire device, but maybe I’m wrong and you’re
> saying I can still get to the other arenas?
So I think currently the best way is to recreate the namespace and
restore from a backup.. I wonder if we have a problem hiding even here,
but more on that below.
I don't think all access will be prevented, even to that arena, and
certainly not the whole device. My point was only that we can't clear
the poison, because then all notification of "something went wrong" is
lost.
One option could be to clear the bock of poisoned map entries, and
replace them with 'error' entries (the error bit in each map entry).
That way reads will continue to fail, and writes to those entries will
fix them. I'll have to work through it and make sure this works on the
kernel side first..
>
> Anyway, after copying out what I can, how do I recreate the BTT?
> When the BTT creation code starts writing the new flog, will that
> clear the poison? Does the BTT creation code do big enough writes
> so the driver will issue Clear Uncorrectable commands?
After 4.12, we will at least have the ability to clear poison for data
blocks. i.e. when the requests going through rw_bytes are aligned and
sized to 512B sectors. Map and flog writes are not large enough, and so
won't clear errors..
And maybe this is something we need to fix otherwise.
Consider a scenario where we had metadata poison, we delete and
recreate the namespace. This in itself doesn't hit the poison. We will
end up hitting CMCIs when we write metaddata and Uncorrectables on
reads again since the poison wasn't cleared.
Maybe what we need here is during BTT creation, check for at least
known badblocks, and initialize them so that we know at least new BTTs
are clean.
>
> With your answers, I think we can create a typical user flow,
> where the user gets error messages and tries to use our tools
> to fix things. That should tell us if we’re on the right track
> or if we’re stranding the poor user with no way to recover…
Agreed, current recovery path (from poison) sure involves slaying
dragons :)
>
> Thanks,
>
> -andy
>
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
next prev parent reply other threads:[~2017-04-14 20:28 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-04-14 0:02 [ndctl PATCH] ndctl, check: Add a sigbus handler to detect metadata corruption Vishal Verma
2017-04-14 0:10 ` Rudoff, Andy
2017-04-14 19:00 ` Verma, Vishal L
2017-04-14 19:04 ` Dan Williams
2017-04-14 19:26 ` Rudoff, Andy
2017-04-14 19:40 ` Verma, Vishal L
2017-04-14 19:52 ` Rudoff, Andy
2017-04-14 20:28 ` Verma, Vishal L [this message]
2017-04-14 20:31 ` Rudoff, Andy
2017-04-17 15:37 ` Jeff Moyer
2017-04-18 16:09 ` Verma, Vishal L
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1492201682.1657.6.camel@intel.com \
--to=vishal.l.verma@intel.com \
--cc=andy.rudoff@intel.com \
--cc=dan.j.williams@intel.com \
--cc=linux-nvdimm@lists.01.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox