From: stefan@agner.ch (Stefan Agner)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC 2/5] mtd:fsl_nfc: Add hardware 45 byte BHC-ECC support for 24 bit corrections.
Date: Wed, 17 Sep 2014 22:08:04 +0200 [thread overview]
Message-ID: <c444059de888a661fadf3d6be598a98d@agner.ch> (raw)
In-Reply-To: <87r3zaxgtr.fsf@nbsps.com>
Am 2014-09-17 20:06, schrieb Bill Pringlemeir:
> On 17 Sep 2014, stefan at agner.ch wrote:
>
>> On one of our Colibri VF61 modules I discovered an issue using this
>> driver:
>
>> [ 0.758327] ECC failed to correct all errors (ebd9fd80)
>> [ 0.767005] ECC failed to correct all errors (ebd9fd80)
>> [ 0.775525] ECC failed to correct all errors (ebd9fd80)
>> [ 0.784004] ECC failed to correct all errors (ebd9fd80)
>> [ 0.791938] UBI error: ubi_io_read: error -74 (ECC error) while
>> reading 2048 bytes from PEB 28:2048,
>> read 2048 bytes
>
>> That page supposed to be empty, and I got several of this messages.
>> Hence I did not believe that they have really ECC errors, so I digged
>> a bit deeper:
>
>>>> -673,11 +673,13 @@ static int fsl_nfc_check_ecc_status(struct
>> mtd_info *mtd, u_char *dat)
>> return ecc_count;
>
>> /* If 'ecc_count' zero or less then buffer is all 0xff or
>> erased. */
>> - flip = count_written_bits(dat, chip->ecc.size, ecc_count);
>> + flip = count_written_bits(dat, chip->ecc.size, 100);
>
>> if (flip > ecc_count) {
>> - printk("ECC failed to correct all errors (%08x)\n",
>> ecc_status);
>> + printk("ECC failed to correct all errors (%08x, flips
>> %d)\n",
>> + ecc_status, flip);
>> return -1;
>> }
>
>> [ 0.758624] ECC failed to correct all errors (ebdded80, flips 1)
>> [ 0.768123] ECC failed to correct all errors (ebdded80, flips 1)
>> [ 0.777468] ECC failed to correct all errors (ebdded80, flips 1)
>> [ 0.786773] ECC failed to correct all errors (ebdded80, flips 1)
>> [ 0.795535] UBI error: ubi_io_read: error -74 (ECC error) while
>> reading 2048 bytes from PEB 28:2048, read 2048 bytes
>
> [snip]
>
>> Interesting thought, the returned ecc_count is exactly one below the
>> actual flipped bytes counted...
>>
>> One comment below regarding this...
>
> [snip]
>
>>> +
>>> + /* If 'ecc_count' zero or less then buffer is all 0xff or erased. */
>>> + flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count);
>
>> Also, I could not find that the reference manual states that ecc_count
>> represents the amount of flipped byte in a empty page. Is this given
>> by the ECC algorithm?
>
> I was using this information.
>
> Table 31-18. ECC Status Word Field Definition
>
> 7 0 Page has been successfully corrected
> CORFAIL 1 Page is uncorrectable
>
> 5?0 Number of errors that have been corrected in this page
> ERROR_COUNT
>
> This is from the Vybrid RM, but the MPC5125RM has the same information.
>
> I have definitely tested the detection of 'erased pages'. However, I
> don't know that I ever had actual bit flips.
>
> The ECC controller has no idea whether the page is empty or not. Are
> you saying you have an erased page with bit flips? I have definitely
> not tested this.
Yes, that's what it looks like. Note that this output is on first boot
after flashing the device (with the UBI auto-grow option). I guess UBI
is reading all pages once to verify that they are empty.
>>> + /* If 'ecc_count' zero or less then buffer is all 0xff or erased. */
>>> + flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count);
>
> There are two issues.
>
> 1) erased page with some physical flipped bits (zero).
> 2) erased page were controller flipped some bits.
I did not know that issue 2 exists. How can 2 happen on a erased page?
With a 'stuck at zero' in OOB?
> Currently, the code is only handling case 2 (that is what the controller
> counts). I believe that your physical page has actual 'stuck at zero'
> bits. The current driver gives up. If you want to handle this then we
> should replace the lines,
>
I assumed that your code assumes that the controller returns the flipped
bit count when reading an erase page (case 1), which I did not found in
the documentation.
>> + /* ECC failed. */
>> + if (flip > ecc_count)
>> + return -1;
>
> With something like,
>
> + /* Not completely empty. */
> + if (flip > ecc_count) {
> + re_read_page_w_ecc_off();
> + if (count_written_bits() > strength) /* strength/2 ? */
> + return -1;
> + }
>
> There is also a discussion on this here,
>
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052507.html
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052508.html
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052509.html
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052526.html
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052527.html
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052619.html
> ... etc. Use the view thread.
>
> Especially,
> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052647.html
>
> Here Brian says that 3b applies to SLC NAND. I guess this is you :).
Yes, we are using Macronix SLC NAND.
>
> It is fairly common to read an erased page. Doing 'raw reads' all the
> time on an erased page will slow the file system. However, doing
> re-reads for an erased page with bit flips is probably fairly uncommon.
> A flash in this state maybe near EOL or the sector was bad from the
> factory but not marked so.
This is a new device, but its one out of several dozens. The device had
two factory marked bad page. This four page would then be 6 bad pages. I
would say that your guess is probably the case at hand (should be
considered bad, but were marked by factory).
>
> I am chasing an DDR2 issue on another CPU and haven't had much more time
> to the look at the Vybrid nor follow the MTD mailing list. I don't know
> if the 'NAND_ECC_NEED_CHECK_FF' feature has been merged? It may also
> solve the issue.
A quick search through 3.17-rc3 didn't found that string.
I will read the pages using raw again and check how many bit flips it
shows. But I guess regarding Brian's comment to 3b, the driver actually
behaves correct and return -1, because we have bit flips on erased page
which is not good...
UBI starts to scrub & torture those 4 pages then, but starts doing this
with other pages too. All the pages start to fail then! I'm still
investigating that issue.
[ 30.901345] UBI: scrubbed PEB 1436 (LEB 0:356), data moved to PEB
3965
[ 31.100247] UBI: run torture test for PEB 1436
[ 31.280845] UBI error: torture_peb: read problems on freshly erased
PEB 1436, must be bad
[ 31.300656] UBI error: erase_worker: failed to erase PEB 1436, error
-5
[ 31.312626] UBI: mark PEB 1436 as bad
[ 31.338216] UBI: 12 PEBs left in the reserve
[ 31.519044] UBI: scrubbed PEB 1390 (LEB 0:313), data moved to PEB
3963
[ 31.697812] UBI: run torture test for PEB 1390
[ 31.880470] UBI error: torture_peb: read problems on freshly erased
PEB 1390, must be bad
[ 31.898220] UBI error: erase_worker: failed to erase PEB 1390, error
-5
[ 31.909687] UBI: mark PEB 1390 as bad
[ 31.931620] UBI: 11 PEBs left in the reserve
[ 32.170599] UBI: scrubbed PEB 1470 (LEB 0:389), data moved to PEB
3961
[ 32.344564] UBI: run torture test for PEB 1470
[ 32.522842] UBI error: torture_peb: read problems on freshly erased
PEB 1470, must be bad
[ 32.540587] UBI error: erase_worker: failed to erase PEB 1470, error
-5
[ 32.552060] UBI: mark PEB 1470 as bad
next prev parent reply other threads:[~2014-09-17 20:08 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <87siupheou.fsf@nbsps.com>
2014-01-08 23:07 ` [RFC 0/5] Nand Bill Pringlemeir
2014-01-08 23:07 ` [RFC 1/5] mtd:fsl_nfc: Nand flash controller for VF610, MPC5125, etc Bill Pringlemeir
2014-04-28 14:41 ` Stefan Agner
2014-04-28 16:51 ` Bill Pringlemeir
2014-04-29 7:50 ` Stefan Agner
2014-04-29 16:36 ` Bill Pringlemeir
2014-01-08 23:07 ` [RFC 2/5] mtd:fsl_nfc: Add hardware 45 byte BHC-ECC support for 24 bit corrections Bill Pringlemeir
2014-09-17 17:02 ` Stefan Agner
2014-09-17 18:06 ` Bill Pringlemeir
2014-09-17 20:08 ` Stefan Agner [this message]
2014-09-17 22:21 ` Bill Pringlemeir
2014-12-10 14:56 ` Stefan Agner
2014-12-11 16:44 ` Bill Pringlemeir
2015-03-01 0:38 ` Stefan Agner
2015-03-02 15:05 ` Bill Pringlemeir
2015-03-02 21:39 ` Aaron Brice
2015-03-02 21:44 ` Stefan Agner
2014-01-08 23:07 ` [RFC 3/5] mtd:fsl_nfc: Add device tree documentation Bill Pringlemeir
2014-01-08 23:07 ` [RFC 4/5] imx:vf610: Add device tree support for the fsl_nfc driver and NAND interface Bill Pringlemeir
2014-01-08 23:07 ` [RFC 5/5] imx:vf610: Allow user to enable NAND controller for the VF610 SOC Bill Pringlemeir
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c444059de888a661fadf3d6be598a98d@agner.ch \
--to=stefan@agner.ch \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).