Re: [PATCH v7 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

From: Brian Norris <computersforpeace@gmail.com>
To: "Gupta, Pekon" <pekon@ti.com>
Cc: Stefan Roese <sr@denx.de>,
	linux-mtd <linux-mtd@lists.infradead.org>,
	"Balbi, Felipe" <balbi@ti.com>,
	Artem Bityutskiy <dedekind1@gmail.com>
Subject: Re: [PATCH v7 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes
Date: Fri, 7 Mar 2014 11:11:23 -0800	[thread overview]
Message-ID: <20140307191123.GC5232@ld-irv-0074> (raw)
In-Reply-To: <20980858CB6D3A4BAE95CA194937D5E73EA6FB16@DBDE04.ent.ti.com>

Hi Pekon,

On Wed, Feb 12, 2014 at 10:31:02AM +0000, Pekon Gupta wrote:
> >From: Brian Norris [mailto:computersforpeace@gmail.com]
> >>On Fri, Jan 17, 2014 at 12:43:14PM +0530, Pekon Gupta wrote:
> >> This patch removes dependency on any marker byte in ecc-layout, to differentiate
> >> between erased-page v/s programeed-page. Instead a page is considered erased if
> >>  (a) all(OOB)  == 0xff, .i.e., number of zero-bits in read_ecc[] == 0
> >>  (b) number of zero-bits in (Data + OOB) is less than ecc-strength
[...]
> >> @@ -1410,24 +1441,47 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
> >>  		}
> >>
> >>  		if (eccflag == 1) {
> >> -			/*
> >> -			 * Set threshold to minimum of 4, half of ecc.strength/2
> >> -			 * to allow max bit flip in byte to 4
> >> -			 */
> >> -			unsigned int threshold = min_t(unsigned int, 4,
> >> -					info->nand.ecc.strength / 2);
> >> +			bitflip_count = 0;
> >> +			/* count zero-bits in OOB region */
> >> +			err = count_zero_bits(read_ecc, eccbytes,
> >> +						 eccstrength, &bitflip_count);
> >> +			if (err) {
> >> +				/*
> >> +				 * number of zero-bits in OOB > ecc-strength
> >> +				 * either un-correctable or programmed-page
> >> +				 */
> >> +				page_is_erased = false;
> >> +			} else if (bitflip_count == 0) {
> >> +				/* OOB == all(0xff) */
> >> +				page_is_erased = true;
> >
> >This else-if is still incorrect. You cannot assume that just because the
> >OOB is all 0xff that the page is erased.
> >
> Now this part is most important, where I would like to get more clarity
> and feedback before I proceed.
> ----------------------------
> if OOB of an page is == all(0xff)
> (a) As per general NAND spec, chip->write_buf() _only_ fills the internal buffers of
>  NAND device's with information to be written. Actual write to NAND array cells
>  happen only after 'NAND_CMD_PAGEPROG' or 'NAND_CMD_CACHEDPROG' is issued.

I'm not really sure why you bring up chip->write_buf(); not all
implementations actually use this. But anyway...

>  Thus both Main-area and OOB-area are programmed simultaneously inside NAND device.
>  if there is any disruption (like power-cut) in on-going  NAND_CMD_PAGEPROG
>  cycle, then the data should be assumed to be garbage, and it may have un-stable bits.

OK. But this is not at all what we're trying to check; we are checking
*only* whether this particular page reads mostly-0xff, due to regular
bitflips on an otherwise unprogrammed page. The "assumed garbage"
conclusion is really irrelevant here.

> (b) if OOB==all(0xff) then there is _no_ ECC stored in OOB to check the read_data

I disagree. Can you guarantee that there is *no* data pattern in which
the matching ECC syndrome bytes are all 0xff?

>   Hence driver cannot distinguish between genuine data and bit-flips.
> 
> Now based on (a) & (b), it's safe to assume that:
> if (OOB == all 0xff)
> 	/* page has no-data | garbage-data*/
> Agree ?

No. But more importantly, why do you want to make this conclusion?
Aren't we *only* trying to make a decision of "is this page
mostly-0xff"?

Or more directly: you are using the above conclusion ("page has garbage
data") to then erase the buffer entirely. This is totally, 100% bogus
because the data area might have junk (partially-programmed page?) but
the OOB could be all 0xff -- and you are now lying to the upper layers,
saying the page is cleanly 0xff!!!

I believe one of the two of us has a very dire misunderstanding here.
Please help me decide which of us this is :)

> (This is where I call it page_is_erased==true, Though I could have used better
> variable name as page_has_no_valid_data = true).
> ----------------------------
> 
> And also, driver should return 'total number zero-bits in both Main-area + OOB'
> if (0 < 'total number of zero-bits in both OOB + DATA region' < mtd->bitflip_threshold)
> 	return -EUCLEAN; 
> else
> 	return -EBADMSG;
> 
> ** However there is a bug in current patch version which I just figured out.
> I don't include the number of zero-bits of Main-area, if OOB==all(0xff).
>  +			} else if (bitflip_count == 0) {
>  +				/* OOB == all(0xff) */
>  +				page_is_erased = true;
>  +  /* missing */		bitflip_count += count_zero_bits(buf, eccsize,
>  +					  		eccstrength, &bitflip_count);
> ------------------------
> 
> >> +			} else {
> >> +				/*
> >> +				 * OOB has some bits as '0' but
> >> +				 * number of zero-bits in OOB < ecc-strength.
> >> +				 * It may be erased-page with bit-flips so,
> >> +				 * further checks are needed for confirmation
> >> +				 */
> >> +				/* count zero-bits in DATA region */
> >> +				buf = &data[eccsize * i];
> >> +				err = count_zero_bits(buf, eccsize,
> >> +						 eccstrength, &bitflip_count);
> >> +				if (err) {
> >> +					/*
> >> +					 * total number of zero-bits in OOB
> >> +					 * and DATA exceeds ecc-strength
> >> +					 */
> >> +					page_is_erased = false;
> >> +				} else {
> >> +					/* number of zero-bits < ecc-strength */
> >> +					page_is_erased = true;
> >> +				}
> >> +			}
> >
> >This whole block (where you call count_zero_bits() twice) is a
> >convoluted and buggy way of just doing the following, AFAIU:
> >
> >	page_is_erased = !count_zero_bits(read_ecc, ecc->bytes,
> >				ecc->strength, &bitflip_count) &&
> >			 !count_zero_bits(&data[ecc->size * i], ecc->size,
> >			 	ecc->strength, &bitflip_count);
> >
> >You could modify that to your liking and add a comment, but it's much
> >more concise than your version, and from what I can tell, it's actually
> >correct.
> >
> Firstly, In you above statement    's/&&/||'

No. I wrote it exactly as I meant it. We need to check both the data
area and the OOB, and if either cause us to exceed the ECC strength,
then the page is not erased.

> because as per above statement, if count_zero_bits(oob) == 0, then
> my patch assumes 'page_has_no_valid_data = true'.

And that assumption is 100% wrong, AIUI.

> page_is_erased = !count_zero_bits(read_ecc, ecc->bytes,
> 				ecc->strength, &bitflip_count) ||
> 			 !count_zero_bits(&data[ecc->size * i], ecc->size,
> 			 	ecc->strength, &bitflip_count);
> 
> 
> Secondly, You are missing the that here NAND driver is relaxing the limit of
> number-of-acceptable bit-flips in an erased-page, because some MLC and 
> Toshiba SLC NANDs show bit-flips almost on every second already erased-page.

No, I'm not missing this. The MTD/NAND layers already take care of the
"number-of-acceptable bit-flips" using the mtd->bitflip_threshold
parameter.

> >> +				if (err) {
> >> +					/*
> >> +					 * total number of zero-bits in OOB
> >> +					 * and DATA exceeds ecc-strength
> >> +					 */
> >> +					page_is_erased = false;
> >> +				} else {
> >> +					/* number of zero-bits < ecc-strength */
> >> +					page_is_erased = true;
> >> +				}
> 
> In above patch I can replace ecc->strength with mtd->bitflip_threshold
> to make it more user controllable. As bitflip_threshold is configurable via
> /sys/class/mtd/mtdX/bitflip_threshold
> What is your opinion ?

The hardware driver should not be comparing with 'bitflip_threshold'.
Comparing against ecc_strength is correct, as we are simulating the
current ECC correction strength.

Brian

next prev parent reply	other threads:[~2014-03-07 19:12 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-17  7:13 [PATCH v7 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Pekon Gupta
2014-01-17  7:13 ` [PATCH v7 1/6] mtd: nand: omap: add field to indicate current ecc-scheme in 'struct omap_nand_info' Pekon Gupta
2014-01-17  7:13 ` [PATCH v7 2/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: rename ambiguous variable 'eccsize' and 'ecc_vector_size' Pekon Gupta
2014-01-17  7:13 ` [PATCH v7 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes Pekon Gupta
2014-02-11 21:34   ` Brian Norris
2014-02-12 10:31     ` Gupta, Pekon
2014-03-07 19:11       ` Brian Norris [this message]
2014-01-17  7:13 ` [PATCH v7 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: remove redundant bit-flip counting for erased-page Pekon Gupta
2014-01-17  7:13 ` [PATCH v7 5/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: cleanup for future enhancements Pekon Gupta
2014-01-17  7:13 ` [PATCH v7 6/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix programmed-page bit-flip correction logic Pekon Gupta
2014-01-29 13:18 ` [PATCH v7 0/6] mtd: nand: omap: optimized chip->ecc.correct() for H/W ECC schemes Stefan Roese
2014-02-04 11:37 ` Gupta, Pekon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140307191123.GC5232@ld-irv-0074 \
    --to=computersforpeace@gmail.com \
    --cc=balbi@ti.com \
    --cc=dedekind1@gmail.com \
    --cc=linux-mtd@lists.infradead.org \
    --cc=pekon@ti.com \
    --cc=sr@denx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).