Re: [PATCH] mtd: nand: default bitflip-reporting threshold to 75% of correction strength

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

From: Brian Norris <computersforpeace@gmail.com>
To: Richard Weinberger <richard@nod.at>
Cc: Ricard Wanderlof <ricard.wanderlof@axis.com>,
	Kamal Dasu <kdasu.kdev@gmail.com>,
	Artem Bityutskiy <dedekind1@gmail.com>,
	Steve deRosier <derosier@gmail.com>, Josh Wu <josh.wu@atmel.com>,
	"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
	Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>,
	Huang Shijie <shijie8@gmail.com>
Subject: Re: [PATCH] mtd: nand: default bitflip-reporting threshold to 75% of correction strength
Date: Tue, 13 Jan 2015 11:51:14 -0800	[thread overview]
Message-ID: <20150113195114.GU9759@ld-irv-0074> (raw)
In-Reply-To: <54B5693C.6020700@nod.at>

Hi Richard,

On Tue, Jan 13, 2015 at 07:51:40PM +0100, Richard Weinberger wrote:
> Am 13.01.2015 um 19:48 schrieb Brian Norris:
> > On Tue, Jan 13, 2015 at 02:25:30PM +0100, Richard Weinberger wrote:
> >> Am 12.01.2015 um 21:51 schrieb Brian Norris:
> >>> The MTD API reports -EUCLEAN only if the maximum number of bitflips
> >>> found in any ECC block exceeds a certain threshold. This is done to
> >>> avoid excessive -EUCLEAN reports to MTD users, which may induce
> >>> additional scrubbing of data, even when the ECC algorithm in use is
> >>> perfectly capable of handling the bitflips.
> >>>
> >>> This threshold can be controlled by user-space (via sysfs), to allow
> >>> users to determine what they are willing to tolerate in their
> >>> application. But it still helps to have sane defaults.
> >>>
> >>> In recent discussion [1], it was pointed out that our default threshold
> >>> is equal to the correction strength. That means that we won't actually
> >>> report any -EUCLEAN (i.e., "bitflips were corrected") errors until there
> >>> are almost too many to handle. It was determined that 3/4 of the
> >>> correction strength is probably a better default.
> >>>
> >>> [1] http://lists.infradead.org/pipermail/linux-mtd/2015-January/057259.html
> >>
> >> I like this change but I have one question.
> >>
> >> UBI will treat a block as bad if it shows bitflips (EUCLEAN) right
> >> after erasure.
> > 
> > Can you elaborate? When "after erasure"? The closest I see is that UBI
> > will mark a block bad if it sees an -EIO failure from sync_erase() in
> > erase_worker(). If you have extra debug checks on, then
> > ubi_self_check_all_ff() could potentially give you bitflip problems
> > after the erase, but that's an odd corner case anyway, which many
> > drivers have been handling in hacked together ad-hoc ways anyway (search
> > for "bitflips in erase pages").
> > 
> > So I can't pinpoint what you're talking about, exactly.
> 
> See torture_peb()
> out:
>         mutex_unlock(&ubi->buf_mutex);
>         if (err == UBI_IO_BITFLIPS || mtd_is_eccerr(err)) {
>                 /*
>                  * If a bit-flip or data integrity error was detected, the test
>                  * has not passed because it happened on a freshly erased
>                  * physical eraseblock which means something is wrong with it.
>                  */
>                 ubi_err(ubi, "read problems on freshly erased PEB %d, must be bad",
>                         pnum);
>                 err = -EIO;
>         }

Thanks for refreshing my memory.

This is actually the same function that people have been tripping over
already, especially this:

		/* Make sure the PEB contains only 0xFF bytes */
		err = ubi_io_read(ubi, ubi->peb_buf, pnum, 0, ubi->peb_size);
		if (err)
			goto out;

		err = ubi_check_pattern(ubi->peb_buf, 0xFF, ubi->peb_size);
		if (err == 0) {
			ubi_err(ubi, "erased PEB %d, but a non-0xFF byte found",
				pnum);
			err = -EIO;
			goto out;
		}

The problems here are mostly with drivers that don't (or do a bad job
of) correcting bitflips in blank pages. But there's never been an issue
of more than a few bitflips, I don't think.

So I guess we're focusing on this chunk?

		/* Write a pattern and check it */
		memset(ubi->peb_buf, patterns[i], ubi->peb_size);
		err = ubi_io_write(ubi, ubi->peb_buf, pnum, 0, ubi->peb_size);
		if (err)
			goto out;

		memset(ubi->peb_buf, ~patterns[i], ubi->peb_size);
		err = ubi_io_read(ubi, ubi->peb_buf, pnum, 0, ubi->peb_size);
		if (err)
			goto out;

I wouldn't expect that even MLC NAND would see 75% of its spec'd
bitflips immediately after erase/program/read-verify. But I'll admit I
haven't heavily tested this.

Also interesting to consider: should we make use of the knowledge of the
manufacturer-specified minimum correction info, when available? See
nand_chip::ecc_strength_ds and nand_chip::ecc_step_ds. These are pulled
from the parmeter page (ONFI or JEDEC) or the full-ID table. Note that
some systems might overprovision their ECC strength, but it may still be
worth reporting based on the datasheet specifications. Not sure of all
the implications of doing that automatically, though.

> >> For SLC NAND this works very well.
> >> Does this also hold for MLC NAND? If one or two bit flips are okay
> >> even for a freshly erased MLC NAND this change could cause UBI to
> >> mark good blocks as bad depending on the ECC strength.
> > 
> > I would typically assume that MLC NAND users must be using significantly
> > stronger ECC (e.g., 12-bit / 512-byte, at least), so "one or two
> > bitflips" would still fall well under 75% of 12 bits.
> 
> Same here. I just want to make sure that UBI does not assume a perfect NAND world. :)

At some point the responsibility is on the user. Except for the open
problems on bitflips in erased pages [1], torture_peb() still seems to look
OK. But if there are further issues related to this threshold, we might
need to adjust UBI for a less-perfect world. e.g., notice the difference
between MTD_NANDFLASH and MTD_MLCNANDFLASH.

Recall this threshold parameter is already user-settable, and paranoid
users are likely to already set it to 75% or less anyway. So UBI should
not try to inflict too much unnecessary damage.

Brian

[1] Mostly a MTD/NAND layer problem, although it's arguably a UBI issue
    that it cares about something flash manufacturers never guaranteed;
    that an erased block contains all 0xff.

next prev parent reply	other threads:[~2015-01-13 19:51 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-08  3:10 NAND ECC capabilities Steve deRosier
2015-01-08  4:17 ` Ezequiel Garcia
2015-01-08  6:22   ` Steve deRosier
     [not found]     ` <0D23F1ECC880A74392D56535BCADD73526C0EA9A@NTXBOIMBX03.micron.com>
2015-01-08 17:09       ` Steve deRosier
2015-01-08 18:57         ` Brian Norris
2015-01-08  8:32 ` Ricard Wanderlof
2015-01-08 16:42   ` Ezequiel Garcia
2015-01-08 17:26     ` Steve deRosier
2015-01-08 19:09     ` Brian Norris
2015-01-08 19:27       ` Ezequiel Garcia
2015-01-12  8:35       ` Josh Wu
2015-01-12 20:51         ` [PATCH] mtd: nand: default bitflip-reporting threshold to 75% of correction strength Brian Norris
2015-01-13  2:01           ` Huang Shijie
2015-01-13  2:38             ` Brian Norris
2015-01-13  2:56               ` Huang Shijie
2015-01-13 13:25           ` Richard Weinberger
2015-01-13 18:48             ` Brian Norris
2015-01-13 18:51               ` Richard Weinberger
2015-01-13 19:51                 ` Brian Norris [this message]
2015-01-17 19:01           ` Boris Brezillon
2015-01-17 19:26             ` Richard Weinberger
2015-01-17 19:42               ` Boris Brezillon
2015-01-17 19:54                 ` Richard Weinberger
2015-01-21  8:22             ` Brian Norris
2015-01-21  8:42               ` Boris Brezillon
2015-02-10 13:50                 ` Boris Brezillon
2015-01-21  7:45           ` Brian Norris
2015-01-08 17:14   ` NAND ECC capabilities Steve deRosier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150113195114.GU9759@ld-irv-0074 \
    --to=computersforpeace@gmail.com \
    --cc=dedekind1@gmail.com \
    --cc=derosier@gmail.com \
    --cc=ezequiel@vanguardiasur.com.ar \
    --cc=josh.wu@atmel.com \
    --cc=kdasu.kdev@gmail.com \
    --cc=linux-mtd@lists.infradead.org \
    --cc=ricard.wanderlof@axis.com \
    --cc=richard@nod.at \
    --cc=shijie8@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox