All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bill Pringlemeir <bpringlemeir@nbsps.com>
To: Stefan Agner <stefan@agner.ch>
Cc: b21989@freescale.com, linux-mtd@lists.infradead.org,
	Jason.jin@freescale.com, linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC 2/5] mtd:fsl_nfc: Add hardware 45 byte BHC-ECC support for 24 bit corrections.
Date: Wed, 17 Sep 2014 14:06:08 -0400	[thread overview]
Message-ID: <87r3zaxgtr.fsf@nbsps.com> (raw)
In-Reply-To: 127db6b73356442d2ba12e8c011038cc@agner.ch

On 17 Sep 2014, stefan@agner.ch wrote:

> On one of our Colibri VF61 modules I discovered an issue using this
> driver:

> [    0.758327] ECC failed to correct all errors (ebd9fd80)
> [    0.767005] ECC failed to correct all errors (ebd9fd80)
> [    0.775525] ECC failed to correct all errors (ebd9fd80)
> [    0.784004] ECC failed to correct all errors (ebd9fd80)
> [    0.791938] UBI error: ubi_io_read: error -74 (ECC error) while
> reading 2048 bytes from PEB 28:2048, 
> read 2048 bytes

> That page supposed to be empty, and I got several of this messages.
> Hence I did not believe that they have really ECC errors, so I digged
> a bit deeper:

>>> -673,11 +673,13 @@ static int fsl_nfc_check_ecc_status(struct
> mtd_info *mtd, u_char *dat)
> return ecc_count;

> /* If 'ecc_count' zero or less then buffer is all 0xff or
> erased. */
> -       flip = count_written_bits(dat, chip->ecc.size, ecc_count);
> +       flip = count_written_bits(dat, chip->ecc.size, 100);

> if (flip > ecc_count) {
> -               printk("ECC failed to correct all errors (%08x)\n",
> ecc_status);
> +               printk("ECC failed to correct all errors (%08x, flips
> %d)\n",
> +                               ecc_status, flip);
> return -1;
> }

> [    0.758624] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.768123] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.777468] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.786773] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.795535] UBI error: ubi_io_read: error -74 (ECC error) while
> reading 2048 bytes from PEB 28:2048, read 2048 bytes

[snip]

> Interesting thought, the returned ecc_count is exactly one below the
> actual flipped bytes counted...
>
> One comment below regarding this...

[snip]

>> +
>> +	/* If 'ecc_count' zero or less then buffer is all 0xff or erased. */
>> +	flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count);

> Also, I could not find that the reference manual states that ecc_count
> represents the amount of flipped byte in a empty page. Is this given
> by the ECC algorithm?

I was using this information.

   Table 31-18. ECC Status Word Field Definition

   7            0 Page has been successfully corrected
   CORFAIL      1 Page is uncorrectable

   5–0          Number of errors that have been corrected in this page
   ERROR_COUNT

This is from the Vybrid RM, but the MPC5125RM has the same information.

I have definitely tested the detection of 'erased pages'.  However, I
don't know that I ever had actual bit flips.

The ECC controller has no idea whether the page is empty or not.  Are
you saying you have an erased page with bit flips?  I have definitely
not tested this.

>> +	/* If 'ecc_count' zero or less then buffer is all 0xff or erased. */
>> +	flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count);

There are two issues.

 1) erased page with some physical flipped bits (zero).
 2) erased page were controller flipped some bits.

Currently, the code is only handling case 2 (that is what the controller
counts).  I believe that your physical page has actual 'stuck at zero'
bits.  The current driver gives up.  If you want to handle this then we
should replace the lines,

> +	/* ECC failed. */
> +	if (flip > ecc_count)
> +		return -1;

With something like,

 +	/* Not completely empty. */
 +	if (flip > ecc_count) {
 +           re_read_page_w_ecc_off();
 +           if (count_written_bits() > strength) /* strength/2 ? */
 +		return -1;
 +      }

There is also a discussion on this here,

http://lists.infradead.org/pipermail/linux-mtd/2014-March/052507.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052508.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052509.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052526.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052527.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052619.html
... etc.  Use the view thread.

Especially,
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052647.html

Here Brian says that 3b applies to SLC NAND.  I guess this is you :).

It is fairly common to read an erased page.  Doing 'raw reads' all the
time on an erased page will slow the file system.  However, doing
re-reads for an erased page with bit flips is probably fairly uncommon.
A flash in this state maybe near EOL or the sector was bad from the
factory but not marked so.

I am chasing an DDR2 issue on another CPU and haven't had much more time
to the look at the Vybrid nor follow the MTD mailing list.  I don't know
if the 'NAND_ECC_NEED_CHECK_FF' feature has been merged?  It may also
solve the issue.

Regards,
Bill Pringlemeir.

WARNING: multiple messages have this Message-ID (diff)
From: bpringlemeir@nbsps.com (Bill Pringlemeir)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC 2/5] mtd:fsl_nfc: Add hardware 45 byte BHC-ECC support for 24 bit corrections.
Date: Wed, 17 Sep 2014 14:06:08 -0400	[thread overview]
Message-ID: <87r3zaxgtr.fsf@nbsps.com> (raw)
In-Reply-To: 127db6b73356442d2ba12e8c011038cc@agner.ch

On 17 Sep 2014, stefan at agner.ch wrote:

> On one of our Colibri VF61 modules I discovered an issue using this
> driver:

> [    0.758327] ECC failed to correct all errors (ebd9fd80)
> [    0.767005] ECC failed to correct all errors (ebd9fd80)
> [    0.775525] ECC failed to correct all errors (ebd9fd80)
> [    0.784004] ECC failed to correct all errors (ebd9fd80)
> [    0.791938] UBI error: ubi_io_read: error -74 (ECC error) while
> reading 2048 bytes from PEB 28:2048, 
> read 2048 bytes

> That page supposed to be empty, and I got several of this messages.
> Hence I did not believe that they have really ECC errors, so I digged
> a bit deeper:

>>> -673,11 +673,13 @@ static int fsl_nfc_check_ecc_status(struct
> mtd_info *mtd, u_char *dat)
> return ecc_count;

> /* If 'ecc_count' zero or less then buffer is all 0xff or
> erased. */
> -       flip = count_written_bits(dat, chip->ecc.size, ecc_count);
> +       flip = count_written_bits(dat, chip->ecc.size, 100);

> if (flip > ecc_count) {
> -               printk("ECC failed to correct all errors (%08x)\n",
> ecc_status);
> +               printk("ECC failed to correct all errors (%08x, flips
> %d)\n",
> +                               ecc_status, flip);
> return -1;
> }

> [    0.758624] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.768123] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.777468] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.786773] ECC failed to correct all errors (ebdded80, flips 1)
> [    0.795535] UBI error: ubi_io_read: error -74 (ECC error) while
> reading 2048 bytes from PEB 28:2048, read 2048 bytes

[snip]

> Interesting thought, the returned ecc_count is exactly one below the
> actual flipped bytes counted...
>
> One comment below regarding this...

[snip]

>> +
>> +	/* If 'ecc_count' zero or less then buffer is all 0xff or erased. */
>> +	flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count);

> Also, I could not find that the reference manual states that ecc_count
> represents the amount of flipped byte in a empty page. Is this given
> by the ECC algorithm?

I was using this information.

   Table 31-18. ECC Status Word Field Definition

   7            0 Page has been successfully corrected
   CORFAIL      1 Page is uncorrectable

   5?0          Number of errors that have been corrected in this page
   ERROR_COUNT

This is from the Vybrid RM, but the MPC5125RM has the same information.

I have definitely tested the detection of 'erased pages'.  However, I
don't know that I ever had actual bit flips.

The ECC controller has no idea whether the page is empty or not.  Are
you saying you have an erased page with bit flips?  I have definitely
not tested this.

>> +	/* If 'ecc_count' zero or less then buffer is all 0xff or erased. */
>> +	flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count);

There are two issues.

 1) erased page with some physical flipped bits (zero).
 2) erased page were controller flipped some bits.

Currently, the code is only handling case 2 (that is what the controller
counts).  I believe that your physical page has actual 'stuck at zero'
bits.  The current driver gives up.  If you want to handle this then we
should replace the lines,

> +	/* ECC failed. */
> +	if (flip > ecc_count)
> +		return -1;

With something like,

 +	/* Not completely empty. */
 +	if (flip > ecc_count) {
 +           re_read_page_w_ecc_off();
 +           if (count_written_bits() > strength) /* strength/2 ? */
 +		return -1;
 +      }

There is also a discussion on this here,

http://lists.infradead.org/pipermail/linux-mtd/2014-March/052507.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052508.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052509.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052526.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052527.html
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052619.html
... etc.  Use the view thread.

Especially,
http://lists.infradead.org/pipermail/linux-mtd/2014-March/052647.html

Here Brian says that 3b applies to SLC NAND.  I guess this is you :).

It is fairly common to read an erased page.  Doing 'raw reads' all the
time on an erased page will slow the file system.  However, doing
re-reads for an erased page with bit flips is probably fairly uncommon.
A flash in this state maybe near EOL or the sector was bad from the
factory but not marked so.

I am chasing an DDR2 issue on another CPU and haven't had much more time
to the look at the Vybrid nor follow the MTD mailing list.  I don't know
if the 'NAND_ECC_NEED_CHECK_FF' feature has been merged?  It may also
solve the issue.

Regards,
Bill Pringlemeir.

  reply	other threads:[~2014-09-17 18:06 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-21 17:01 VF610+ColdFireM54418 controller Bill Pringlemeir
2013-11-21 21:52 ` Bill Pringlemeir
2014-01-08 23:07 ` [RFC 0/5] Nand Bill Pringlemeir
2014-01-08 23:07   ` [RFC 1/5] mtd:fsl_nfc: Nand flash controller for VF610, MPC5125, etc Bill Pringlemeir
2014-04-28 14:41     ` Stefan Agner
2014-04-28 14:41       ` Stefan Agner
2014-04-28 16:51       ` Bill Pringlemeir
2014-04-28 16:51         ` Bill Pringlemeir
2014-04-29  7:50         ` Stefan Agner
2014-04-29  7:50           ` Stefan Agner
2014-04-29 16:36       ` Bill Pringlemeir
2014-04-29 16:36         ` Bill Pringlemeir
2014-01-08 23:07   ` [RFC 2/5] mtd:fsl_nfc: Add hardware 45 byte BHC-ECC support for 24 bit corrections Bill Pringlemeir
2014-09-17 17:02     ` Stefan Agner
2014-09-17 17:02       ` Stefan Agner
2014-09-17 18:06       ` Bill Pringlemeir [this message]
2014-09-17 18:06         ` Bill Pringlemeir
2014-09-17 20:08         ` Stefan Agner
2014-09-17 20:08           ` Stefan Agner
2014-09-17 22:21           ` Bill Pringlemeir
2014-09-17 22:21             ` Bill Pringlemeir
2014-12-10 14:56             ` Stefan Agner
2014-12-10 14:56               ` Stefan Agner
2014-12-11 16:44               ` Bill Pringlemeir
2014-12-11 16:44                 ` Bill Pringlemeir
2015-03-01  0:38                 ` Stefan Agner
2015-03-01  0:38                   ` Stefan Agner
2015-03-02 15:05                   ` Bill Pringlemeir
2015-03-02 15:05                     ` Bill Pringlemeir
2015-03-02 21:39                     ` Aaron Brice
2015-03-02 21:39                       ` Aaron Brice
2015-03-02 21:44                       ` Stefan Agner
2015-03-02 21:44                         ` Stefan Agner
2014-01-08 23:07   ` [RFC 3/5] mtd:fsl_nfc: Add device tree documentation Bill Pringlemeir
2014-01-08 23:07   ` [RFC 4/5] imx:vf610: Add device tree support for the fsl_nfc driver and NAND interface Bill Pringlemeir
2014-01-08 23:07   ` [RFC 5/5] imx:vf610: Allow user to enable NAND controller for the VF610 SOC Bill Pringlemeir

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r3zaxgtr.fsf@nbsps.com \
    --to=bpringlemeir@nbsps.com \
    --cc=Jason.jin@freescale.com \
    --cc=b21989@freescale.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mtd@lists.infradead.org \
    --cc=stefan@agner.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.