From: Boris Brezillon <boris.brezillon@free-electrons.com>
To: Brian Norris <computersforpeace@gmail.com>
Cc: Ricard Wanderlof <ricard.wanderlof@axis.com>,
Richard Weinberger <richard@nod.at>,
Steve deRosier <derosier@gmail.com>, Josh Wu <josh.wu@atmel.com>,
"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>,
Huang Shijie <shijie8@gmail.com>
Subject: Re: [PATCH] mtd: nand: default bitflip-reporting threshold to 75% of correction strength
Date: Wed, 21 Jan 2015 09:42:57 +0100 [thread overview]
Message-ID: <20150121094257.6c9d6214@bbrezillon> (raw)
In-Reply-To: <20150121082257.GB5273@norris-Latitude-E6410>
Hi Brian,
On Wed, 21 Jan 2015 00:22:57 -0800
Brian Norris <computersforpeace@gmail.com> wrote:
> I'm sorry, I just noticed your reply as I was applying the patch. I can
> back it out if we find a real objection.
No, I'm fine with this patch, I was just trying to draw attention on
several problems I noticed while working on MLC NANDs.
We'll need to sort this out at some point, but I won't prevent sane
SLC/MLC NANDs from having this ECC threshold changed for something that
is not well supported yet ;-).
>
> On Sat, Jan 17, 2015 at 08:01:37PM +0100, Boris Brezillon wrote:
> > On Mon, 12 Jan 2015 12:51:29 -0800
> > Brian Norris <computersforpeace@gmail.com> wrote:
> > > The MTD API reports -EUCLEAN only if the maximum number of bitflips
> > > found in any ECC block exceeds a certain threshold. This is done to
> > > avoid excessive -EUCLEAN reports to MTD users, which may induce
> > > additional scrubbing of data, even when the ECC algorithm in use is
> > > perfectly capable of handling the bitflips.
> > >
> > > This threshold can be controlled by user-space (via sysfs), to allow
> > > users to determine what they are willing to tolerate in their
> > > application. But it still helps to have sane defaults.
> > >
> > > In recent discussion [1], it was pointed out that our default threshold
> > > is equal to the correction strength. That means that we won't actually
> > > report any -EUCLEAN (i.e., "bitflips were corrected") errors until there
> > > are almost too many to handle. It was determined that 3/4 of the
> > > correction strength is probably a better default.
> > >
> > > [1] http://lists.infradead.org/pipermail/linux-mtd/2015-January/057259.html
> > >
> > > Signed-off-by: Brian Norris <computersforpeace@gmail.com>
> > > ---
> > > drivers/mtd/nand/nand_base.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> > > index 816b5c1fd416..3f24b587304f 100644
> > > --- a/drivers/mtd/nand/nand_base.c
> > > +++ b/drivers/mtd/nand/nand_base.c
> > > @@ -4171,7 +4171,7 @@ int nand_scan_tail(struct mtd_info *mtd)
> > > * properly set.
> > > */
> > > if (!mtd->bitflip_threshold)
> > > - mtd->bitflip_threshold = mtd->ecc_strength;
> > > + mtd->bitflip_threshold = DIV_ROUND_UP(mtd->ecc_strength * 3, 4);
> >
> > Just sharing my experience with MLC NANDs requiring read-retry: the
> > number of reported bitflips often raise ecc_strength value (at least
> > with the current read-retry approach).
>
> I did not have fun when testing a few MLC NAND which required read
> retry. I ended up recommending that most of our customers move off of it
> onto another solution where possible. But yes, I can understand your
> issue.
Neither did I, but that's the world we live in, and I guess board
manufacturers will keep using crapy MLC chips so this is definitely
something we have to work on.
>
> > This patch will definitely make UBI move NAND blocks over and over
> > again considering the threshold has been raised and the block is not
> > reliable anymore.
>
> Personally, we found that we just needed to lower the
> MTD_UBI_WL_THRESHOLD and let UBI move away from blocks with high bitflip
> counts. I suppose we essentially treat the block as bad.
The problem I have is that a lot of them reach this ECC threshold limit
(at least with the current implementation).
>
> > While I like the idea of limiting the threshold to something smaller
> > than what's recommended on the datasheet (or reported by ONFI) I wonder
> > if it won't make things worst in some cases.
>
> I wouldn't exactly say that there is any threshold recommended on the
> datasheet or in ONFI. They simply specify a required correction
> strength,
Yes, that's what I was talking about.
> with no word about any intermediate handling -- what do they
> expect software to do when bitflips exceed the ECC strength? We
> immediately lose data. So we need to preemptively move such data.
Yep, I agree.
>
> So I don't think it's a good idea at all to say that
> threshold == ecc_strength. That renders your ECC nearly useless w.r.t.
> the original design of UBI. UBI intends to use -EUCLEAN as a signal that
> high # of bitflips. I would suggest that 23 bitflips on a 24-bit
> correction algorithm is a "high # of bitflips." So as I see it, this
> patch is just restoring that UBI assumption.
Yes.
>
> > Regarding the read-retry code, it currently stops retrying reading the
> > page once the page has been successfully retrieved (or in other terms
> > all bitflips have been fixed). But it might stop to soon, because by
> > changing the bit level threshold (in other term retrying one more time)
> > it might successfully read the page with less bitflips than the
> > previous attempt (these are just supposition, I haven't tested it yet).
> > If we can achieve that we could retry until we reach something below
> > the bitflips threshold value, and if we fail to find any, just consider
> > the lower number of bitflips found during those read-retry operations.
>
> I believe I suggested scenarios like this to some flash vendors when
> speaking to reps in person, but they didn't seem to consider that
> likely. I think they were implying that there would be only one read
> retry mode that gives a reasonable result. I'm not sure if they were
> really the experts on that particular topic, though, or if they were
> just giving me an answer to make me happy.
Okay, good to know. I'll try to do some more testing to verify that.
>
> Honestly, there's a lot of work that goes into MLC NAND for use by
> serious applications. For instance, SSD controller vendors, and even
> eMMC makers, use plenty of low-level knowledge that we simply do not
> have. From whatever I've gleaned about Toshiba MLC NAND (I don't think
> they even try to sell much raw NAND outside of eMMC and SSD solutions),
> they actually expose fine-grained voltage threshold controls through
> their (non-standard, obviously) command interface, and their clients
> likely use this to chart a targeted plan on how to adjust thresholds
> over the lifetime of a block, rather than just using a dumb sequential
> search like we do.
Yep, but they do sell these unreliable MLC chips (at least Micron and
Hynix do) to board manufacturers, and a lot of people would really like
to use a mainline kernel on these boards.
>
> So anyway, that drifted a little away from the topic at hand. Are you
> suggesting that we should not change this default in nand_base, because
> of potential issues with highly-unreliable NAND that need read-retry? I
> can back out the patch while we finish this discussion, although I'm not
> very convinced so far.
As I said I don't expect you to back it out, just discussing the MLC
chip problems in an almost unrelated thread ;-).
Best Regards,
Boris
--
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
next prev parent reply other threads:[~2015-01-21 8:43 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-08 3:10 NAND ECC capabilities Steve deRosier
2015-01-08 4:17 ` Ezequiel Garcia
2015-01-08 6:22 ` Steve deRosier
[not found] ` <0D23F1ECC880A74392D56535BCADD73526C0EA9A@NTXBOIMBX03.micron.com>
2015-01-08 17:09 ` Steve deRosier
2015-01-08 18:57 ` Brian Norris
2015-01-08 8:32 ` Ricard Wanderlof
2015-01-08 16:42 ` Ezequiel Garcia
2015-01-08 17:26 ` Steve deRosier
2015-01-08 19:09 ` Brian Norris
2015-01-08 19:27 ` Ezequiel Garcia
2015-01-12 8:35 ` Josh Wu
2015-01-12 20:51 ` [PATCH] mtd: nand: default bitflip-reporting threshold to 75% of correction strength Brian Norris
2015-01-13 2:01 ` Huang Shijie
2015-01-13 2:38 ` Brian Norris
2015-01-13 2:56 ` Huang Shijie
2015-01-13 13:25 ` Richard Weinberger
2015-01-13 18:48 ` Brian Norris
2015-01-13 18:51 ` Richard Weinberger
2015-01-13 19:51 ` Brian Norris
2015-01-17 19:01 ` Boris Brezillon
2015-01-17 19:26 ` Richard Weinberger
2015-01-17 19:42 ` Boris Brezillon
2015-01-17 19:54 ` Richard Weinberger
2015-01-21 8:22 ` Brian Norris
2015-01-21 8:42 ` Boris Brezillon [this message]
2015-02-10 13:50 ` Boris Brezillon
2015-01-21 7:45 ` Brian Norris
2015-01-08 17:14 ` NAND ECC capabilities Steve deRosier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150121094257.6c9d6214@bbrezillon \
--to=boris.brezillon@free-electrons.com \
--cc=computersforpeace@gmail.com \
--cc=derosier@gmail.com \
--cc=ezequiel@vanguardiasur.com.ar \
--cc=josh.wu@atmel.com \
--cc=linux-mtd@lists.infradead.org \
--cc=ricard.wanderlof@axis.com \
--cc=richard@nod.at \
--cc=shijie8@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox