From mboxrd@z Thu Jan 1 00:00:00 1970 From: stefan@agner.ch (Stefan Agner) Date: Wed, 17 Sep 2014 22:08:04 +0200 Subject: [RFC 2/5] =?UTF-8?Q?mtd=3Afsl=5Fnfc=3A=20Add=20hardware=20=34?= =?UTF-8?Q?=35=20byte=20BHC-ECC=20support=20for=20=32=34=20bit=20correctio?= =?UTF-8?Q?ns=2E?= In-Reply-To: <87r3zaxgtr.fsf@nbsps.com> References: <87siupheou.fsf@nbsps.com> <1389222441-4322-1-git-send-email-bpringlemeir@nbsps.com> <1389222441-4322-3-git-send-email-bpringlemeir@nbsps.com> <127db6b73356442d2ba12e8c011038cc@agner.ch> <87r3zaxgtr.fsf@nbsps.com> Message-ID: To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Am 2014-09-17 20:06, schrieb Bill Pringlemeir: > On 17 Sep 2014, stefan at agner.ch wrote: > >> On one of our Colibri VF61 modules I discovered an issue using this >> driver: > >> [ 0.758327] ECC failed to correct all errors (ebd9fd80) >> [ 0.767005] ECC failed to correct all errors (ebd9fd80) >> [ 0.775525] ECC failed to correct all errors (ebd9fd80) >> [ 0.784004] ECC failed to correct all errors (ebd9fd80) >> [ 0.791938] UBI error: ubi_io_read: error -74 (ECC error) while >> reading 2048 bytes from PEB 28:2048, >> read 2048 bytes > >> That page supposed to be empty, and I got several of this messages. >> Hence I did not believe that they have really ECC errors, so I digged >> a bit deeper: > >>>> -673,11 +673,13 @@ static int fsl_nfc_check_ecc_status(struct >> mtd_info *mtd, u_char *dat) >> return ecc_count; > >> /* If 'ecc_count' zero or less then buffer is all 0xff or >> erased. */ >> - flip = count_written_bits(dat, chip->ecc.size, ecc_count); >> + flip = count_written_bits(dat, chip->ecc.size, 100); > >> if (flip > ecc_count) { >> - printk("ECC failed to correct all errors (%08x)\n", >> ecc_status); >> + printk("ECC failed to correct all errors (%08x, flips >> %d)\n", >> + ecc_status, flip); >> return -1; >> } > >> [ 0.758624] ECC failed to correct all errors (ebdded80, flips 1) >> [ 0.768123] ECC failed to correct all errors (ebdded80, flips 1) >> [ 0.777468] ECC failed to correct all errors (ebdded80, flips 1) >> [ 0.786773] ECC failed to correct all errors (ebdded80, flips 1) >> [ 0.795535] UBI error: ubi_io_read: error -74 (ECC error) while >> reading 2048 bytes from PEB 28:2048, read 2048 bytes > > [snip] > >> Interesting thought, the returned ecc_count is exactly one below the >> actual flipped bytes counted... >> >> One comment below regarding this... > > [snip] > >>> + >>> + /* If 'ecc_count' zero or less then buffer is all 0xff or erased. */ >>> + flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count); > >> Also, I could not find that the reference manual states that ecc_count >> represents the amount of flipped byte in a empty page. Is this given >> by the ECC algorithm? > > I was using this information. > > Table 31-18. ECC Status Word Field Definition > > 7 0 Page has been successfully corrected > CORFAIL 1 Page is uncorrectable > > 5?0 Number of errors that have been corrected in this page > ERROR_COUNT > > This is from the Vybrid RM, but the MPC5125RM has the same information. > > I have definitely tested the detection of 'erased pages'. However, I > don't know that I ever had actual bit flips. > > The ECC controller has no idea whether the page is empty or not. Are > you saying you have an erased page with bit flips? I have definitely > not tested this. Yes, that's what it looks like. Note that this output is on first boot after flashing the device (with the UBI auto-grow option). I guess UBI is reading all pages once to verify that they are empty. >>> + /* If 'ecc_count' zero or less then buffer is all 0xff or erased. */ >>> + flip = count_written_bits(dat, nfc->chip.ecc.size, ecc_count); > > There are two issues. > > 1) erased page with some physical flipped bits (zero). > 2) erased page were controller flipped some bits. I did not know that issue 2 exists. How can 2 happen on a erased page? With a 'stuck at zero' in OOB? > Currently, the code is only handling case 2 (that is what the controller > counts). I believe that your physical page has actual 'stuck at zero' > bits. The current driver gives up. If you want to handle this then we > should replace the lines, > I assumed that your code assumes that the controller returns the flipped bit count when reading an erase page (case 1), which I did not found in the documentation. >> + /* ECC failed. */ >> + if (flip > ecc_count) >> + return -1; > > With something like, > > + /* Not completely empty. */ > + if (flip > ecc_count) { > + re_read_page_w_ecc_off(); > + if (count_written_bits() > strength) /* strength/2 ? */ > + return -1; > + } > > There is also a discussion on this here, > > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052507.html > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052508.html > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052509.html > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052526.html > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052527.html > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052619.html > ... etc. Use the view thread. > > Especially, > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052647.html > > Here Brian says that 3b applies to SLC NAND. I guess this is you :). Yes, we are using Macronix SLC NAND. > > It is fairly common to read an erased page. Doing 'raw reads' all the > time on an erased page will slow the file system. However, doing > re-reads for an erased page with bit flips is probably fairly uncommon. > A flash in this state maybe near EOL or the sector was bad from the > factory but not marked so. This is a new device, but its one out of several dozens. The device had two factory marked bad page. This four page would then be 6 bad pages. I would say that your guess is probably the case at hand (should be considered bad, but were marked by factory). > > I am chasing an DDR2 issue on another CPU and haven't had much more time > to the look at the Vybrid nor follow the MTD mailing list. I don't know > if the 'NAND_ECC_NEED_CHECK_FF' feature has been merged? It may also > solve the issue. A quick search through 3.17-rc3 didn't found that string. I will read the pages using raw again and check how many bit flips it shows. But I guess regarding Brian's comment to 3b, the driver actually behaves correct and return -1, because we have bit flips on erased page which is not good... UBI starts to scrub & torture those 4 pages then, but starts doing this with other pages too. All the pages start to fail then! I'm still investigating that issue. [ 30.901345] UBI: scrubbed PEB 1436 (LEB 0:356), data moved to PEB 3965 [ 31.100247] UBI: run torture test for PEB 1436 [ 31.280845] UBI error: torture_peb: read problems on freshly erased PEB 1436, must be bad [ 31.300656] UBI error: erase_worker: failed to erase PEB 1436, error -5 [ 31.312626] UBI: mark PEB 1436 as bad [ 31.338216] UBI: 12 PEBs left in the reserve [ 31.519044] UBI: scrubbed PEB 1390 (LEB 0:313), data moved to PEB 3963 [ 31.697812] UBI: run torture test for PEB 1390 [ 31.880470] UBI error: torture_peb: read problems on freshly erased PEB 1390, must be bad [ 31.898220] UBI error: erase_worker: failed to erase PEB 1390, error -5 [ 31.909687] UBI: mark PEB 1390 as bad [ 31.931620] UBI: 11 PEBs left in the reserve [ 32.170599] UBI: scrubbed PEB 1470 (LEB 0:389), data moved to PEB 3961 [ 32.344564] UBI: run torture test for PEB 1470 [ 32.522842] UBI error: torture_peb: read problems on freshly erased PEB 1470, must be bad [ 32.540587] UBI error: erase_worker: failed to erase PEB 1470, error -5 [ 32.552060] UBI: mark PEB 1470 as bad