* NAND BBT corruption on MPC83xx
@ 2011-06-15 19:48 Matthew L. Creech
2011-06-15 20:17 ` Mike Hench
2011-06-23 8:35 ` Artem Bityutskiy
0 siblings, 2 replies; 10+ messages in thread
From: Matthew L. Creech @ 2011-06-15 19:48 UTC (permalink / raw)
To: MTD list
Hi, I'm not sure whether this list or the U-Boot list is more
appropriate, but figured I'd start here and see if anyone can help.
We've gotten some devices back from the field which all suffer from
this same problem on bootup when attaching UBI (these messages are
from U-Boot):
...
Bad block table found at page 524224, version 0x01
Bad block table found at page 524160, version 0x01
nand_bbt: ECC error while reading bad block table
...(long stream of bogus bad blocks)...
UBI: attaching mtd1 to ubi0
UBI: physical eraseblock size: 131072 bytes (128 KiB)
UBI: logical eraseblock size: 129024 bytes
UBI: smallest flash I/O unit: 2048
UBI: sub-page size: 512
UBI: VID header offset: 512 (aligned 512)
UBI: data offset: 2048
UBI error: vtbl_check: volume table check failed: record 0, error 9
UBI error: ubi_init: cannot attach mtd1
UBI error: ubi_init: UBI error: cannot initialize UBI, error -22
UBI init error -22
A full console dump is here:
http://mcreech.com/work/bbt-ecc-error.txt
Question #1: Is the UBI error here attributable to the blocks which
are wrongly marked as bad? I would assume that it's a red herring,
and I should focus on figuring out how the BBT got corrupted, but
figured I'd check first.
Question #2: Are there any known issues that could cause the BBT to
become corrupt like this?
I noticed that the reported bad blocks were all aligned at multiples
of 0x80000 (with one exception). Dumping the last 2 blocks shows:
- one BBT with lots of bytes that have their lower 1 or 2 bits
un-set (e.g. 0xfe instead of 0xff): this explains all the
each-4th-block alignment.
- the other BBT shows only one factory-marked bad block at
0x062e0000, which is presumably correct. This is preserved in the
bogus BBT, and is the only non-0x80000-aligned bad block in the table.
- Only the first 1024 bytes of the BBT contain bogus info - the
latter half of the BBT is all correct
It seems like the original BBT somehow had 0-2 bits corrupted at the
low end of each of its bytes, either while in memory or when the BBT
was written to NAND. Any ideas on what I can do to isolate the
problem? Thanks in advance!
More info on this board:
- MPC 8313 SoC
- 1GB Samsung NAND flash (K9K8G08U0B)
- Linux 2.6.31
- U-Boot 2009.06
--
Matthew L. Creech
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: NAND BBT corruption on MPC83xx
2011-06-15 19:48 NAND BBT corruption on MPC83xx Matthew L. Creech
@ 2011-06-15 20:17 ` Mike Hench
2011-06-23 8:35 ` Artem Bityutskiy
1 sibling, 0 replies; 10+ messages in thread
From: Mike Hench @ 2011-06-15 20:17 UTC (permalink / raw)
To: Matthew L. Creech, MTD list
This is the read page -1 problem we discussed earlier Round about line
485 of linux-2.6.39.1/drivers/mtd/nand/fsl_elbc_nand.c
/* Read back the page in order to fill in the ECC for the
* caller. Is this really needed?
*/
if (full_page && elbc_fcm_ctrl->oob_poi) {
out_be32(&lbc->fbcr, 3);
set_addr(mtd, 6, page_addr, 1);
page_addr at this point is always -1.
now WHY a read corrupts that last block I do not know.
-1 is not a valid page address, the address 'protocol' allows more bits
Than the flash uses. So it does not mean 'last block', or last page in
last block.
Maybe it is upsetting the state machine in the flash.
Maybe it is upsetting the ELBC.
You will find that the secondary bad block table (first page, second to
last block) is fine.
the kernel works fine without that block of code. So oob_poi must not be
used anywhere. In any case it is always garbage with the above code.
I raised this issue on this list earlier, no response.
I think Freescale might be more likely to notice on the PPC list.
I don't think they hang out here.
I am happy without that block of useless code.
-----Original Message-----
From: linux-mtd-bounces@lists.infradead.org
[mailto:linux-mtd-bounces@lists.infradead.org] On Behalf Of Matthew L.
Creech
Sent: Wednesday, June 15, 2011 2:49 PM
To: MTD list
Subject: NAND BBT corruption on MPC83xx
Hi, I'm not sure whether this list or the U-Boot list is more
appropriate, but figured I'd start here and see if anyone can help.
We've gotten some devices back from the field which all suffer from
this same problem on bootup when attaching UBI (these messages are
from U-Boot):
...
Bad block table found at page 524224, version 0x01
Bad block table found at page 524160, version 0x01
nand_bbt: ECC error while reading bad block table
...(long stream of bogus bad blocks)...
UBI: attaching mtd1 to ubi0
UBI: physical eraseblock size: 131072 bytes (128 KiB)
UBI: logical eraseblock size: 129024 bytes
UBI: smallest flash I/O unit: 2048
UBI: sub-page size: 512
UBI: VID header offset: 512 (aligned 512)
UBI: data offset: 2048
UBI error: vtbl_check: volume table check failed: record 0, error 9
UBI error: ubi_init: cannot attach mtd1
UBI error: ubi_init: UBI error: cannot initialize UBI, error -22
UBI init error -22
A full console dump is here:
http://mcreech.com/work/bbt-ecc-error.txt
Question #1: Is the UBI error here attributable to the blocks which
are wrongly marked as bad? I would assume that it's a red herring,
and I should focus on figuring out how the BBT got corrupted, but
figured I'd check first.
Question #2: Are there any known issues that could cause the BBT to
become corrupt like this?
I noticed that the reported bad blocks were all aligned at multiples
of 0x80000 (with one exception). Dumping the last 2 blocks shows:
- one BBT with lots of bytes that have their lower 1 or 2 bits
un-set (e.g. 0xfe instead of 0xff): this explains all the
each-4th-block alignment.
- the other BBT shows only one factory-marked bad block at
0x062e0000, which is presumably correct. This is preserved in the
bogus BBT, and is the only non-0x80000-aligned bad block in the table.
- Only the first 1024 bytes of the BBT contain bogus info - the
latter half of the BBT is all correct
It seems like the original BBT somehow had 0-2 bits corrupted at the
low end of each of its bytes, either while in memory or when the BBT
was written to NAND. Any ideas on what I can do to isolate the
problem? Thanks in advance!
More info on this board:
- MPC 8313 SoC
- 1GB Samsung NAND flash (K9K8G08U0B)
- Linux 2.6.31
- U-Boot 2009.06
--
Matthew L. Creech
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: NAND BBT corruption on MPC83xx
2011-06-15 19:48 NAND BBT corruption on MPC83xx Matthew L. Creech
2011-06-15 20:17 ` Mike Hench
@ 2011-06-23 8:35 ` Artem Bityutskiy
1 sibling, 0 replies; 10+ messages in thread
From: Artem Bityutskiy @ 2011-06-23 8:35 UTC (permalink / raw)
To: Matthew L. Creech; +Cc: MTD list
On Wed, 2011-06-15 at 15:48 -0400, Matthew L. Creech wrote:
> Hi, I'm not sure whether this list or the U-Boot list is more
> appropriate, but figured I'd start here and see if anyone can help.
>
> We've gotten some devices back from the field which all suffer from
> this same problem on bootup when attaching UBI (these messages are
> from U-Boot):
>
>
> ...
> Bad block table found at page 524224, version 0x01
> Bad block table found at page 524160, version 0x01
> nand_bbt: ECC error while reading bad block table
> ...(long stream of bogus bad blocks)...
> UBI: attaching mtd1 to ubi0
> UBI: physical eraseblock size: 131072 bytes (128 KiB)
> UBI: logical eraseblock size: 129024 bytes
> UBI: smallest flash I/O unit: 2048
> UBI: sub-page size: 512
> UBI: VID header offset: 512 (aligned 512)
> UBI: data offset: 2048
> UBI error: vtbl_check: volume table check failed: record 0, error 9
> UBI error: ubi_init: cannot attach mtd1
> UBI error: ubi_init: UBI error: cannot initialize UBI, error -22
> UBI init error -22
>
> A full console dump is here:
>
> http://mcreech.com/work/bbt-ecc-error.txt
>
> Question #1: Is the UBI error here attributable to the blocks which
> are wrongly marked as bad? I would assume that it's a red herring,
> and I should focus on figuring out how the BBT got corrupted, but
> figured I'd check first.
UBI prints error messages if a block is marked bad, and they should go
to syslog. If you are able to access the syslog - you can see if
anything was marked as bad by UBI. But I really doubt this is UBI to
blame.
This might be BBT stuff - I never used on-flash BBT and when I look at
the code - I do not trust it...
--
Best Regards,
Artem Bityutskiy
^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <BANLkTikyKObukzVw+c10HyDz+Q=PH4jozA@mail.gmail.com>]
* Re: NAND BBT corruption on MPC83xx
[not found] <BANLkTikyKObukzVw+c10HyDz+Q=PH4jozA@mail.gmail.com>
@ 2011-06-17 21:34 ` Scott Wood
2011-06-18 17:55 ` Mike Hench
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Scott Wood @ 2011-06-17 21:34 UTC (permalink / raw)
To: Matthew L. Creech; +Cc: linux-mtd, linuxppc-dev
On Fri, 17 Jun 2011 16:54:27 -0400
"Matthew L. Creech" <mlcreech@gmail.com> wrote:
> Hi, I posted this on the Linux-MTD list but haven't gotten any hits.
> Since it looks like it could be MPC83xx-specific, I'm reposting here.
> Rick Johnson noted a problem in fsl_elbc_nand.c back in May which
> might be related:
>
> http://lists.infradead.org/pipermail/linux-mtd/2011-May/035372.html
It seems that the generic code always passes -1 with PAGEPROG, and only
provides the actual page address on SEQIN.
I don't think the ECC readback is needed, and the fact that it looks like
it has always been broken would seem to confirm that. It's broken in
other ways, too -- it assumes a particular ECC layout. Let's get rid of it.
As for the corruption, could it be degradation from repeated reads of that
one page?
> More info on this board:
> - MPC 8313 SoC
> - 1GB Samsung NAND flash (K9K8G08U0B)
> - Linux 2.6.31
> - U-Boot 2009.06
Hmm, 2.6.31... it's probably not related to this problem, but you
should cherry pick b3a70f0bc32d1b70584bcaa6019fa4260b0da92e and
476459a6cf46d20ec73d9b211f3894ced5f9871e.
-Scott
^ permalink raw reply [flat|nested] 10+ messages in thread* RE: NAND BBT corruption on MPC83xx
2011-06-17 21:34 ` Scott Wood
@ 2011-06-18 17:55 ` Mike Hench
2011-06-20 11:22 ` Atlant Schmidt
2011-06-20 15:20 ` Matthew L. Creech
2011-07-05 19:58 ` Matthew L. Creech
2 siblings, 1 reply; 10+ messages in thread
From: Mike Hench @ 2011-06-18 17:55 UTC (permalink / raw)
To: Scott Wood, Matthew L. Creech; +Cc: linuxppc-dev, linux-mtd
Scott Wood wrote:
> As for the corruption, could it be degradation from repeated reads of
that
> one page?
Read Disturb. I Did not know SLC did that.
It just takes 10x as long as MLC, on the order of a million reads.
Supposedly erasing the block fixes it.
It is not a permanent damage thing.
I was seeing ~9 hours before failure with heavy writes.
~4GByte/hour = 2M pages, total ~18 million reads before errors in that
last block showed up.
Cool. Now we know.
Thanks.
Mike Hench
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: NAND BBT corruption on MPC83xx
2011-06-18 17:55 ` Mike Hench
@ 2011-06-20 11:22 ` Atlant Schmidt
2011-06-23 8:31 ` Artem Bityutskiy
0 siblings, 1 reply; 10+ messages in thread
From: Atlant Schmidt @ 2011-06-20 11:22 UTC (permalink / raw)
To: 'Mike Hench', Scott Wood, Matthew L. Creech
Cc: linux-mtd@lists.infradead.org, linuxppc-dev@lists.ozlabs.org
Mike:
> It is not a permanent damage thing.
A "read disturb" does no permanent damage to the chip
but if the read disturb event involves more bits than
can be corrected by your ECC code, it can do permanent
damage to the *DATA* you've stored in that block.
For this reason, a good flash management system manages
to at least occasionally read through *ALL* of the in-use
blocks in the device so that single-bit errors can be
scrubbed out (read and successfully corrected) before
an adjacent bit in the block also fails (which would
eventually lead to a multi-bit error that might be
beyond the ability to be corrected by the ECC).
As far as I know (and I'm sure the list will correct
me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any
Linux layer provides this routine scrubbing; you have
to code it up yourself, probably by accessing the
device at the UBI (underlying block device/LEB) layer.
Atlant
-----Original Message-----
From: linux-mtd-bounces@lists.infradead.org [mailto:linux-mtd-bounces@lists.infradead.org] On Behalf Of Mike Hench
Sent: Saturday, June 18, 2011 13:55
To: Scott Wood; Matthew L. Creech
Cc: linuxppc-dev@lists.ozlabs.org; linux-mtd@lists.infradead.org
Subject: RE: NAND BBT corruption on MPC83xx
Scott Wood wrote:
> As for the corruption, could it be degradation from repeated reads of
that
> one page?
Read Disturb. I Did not know SLC did that.
It just takes 10x as long as MLC, on the order of a million reads.
Supposedly erasing the block fixes it.
It is not a permanent damage thing.
I was seeing ~9 hours before failure with heavy writes.
~4GByte/hour = 2M pages, total ~18 million reads before errors in that
last block showed up.
Cool. Now we know.
Thanks.
Mike Hench
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.
Thank you.
Please consider the environment before printing this email.
^ permalink raw reply [flat|nested] 10+ messages in thread* RE: NAND BBT corruption on MPC83xx
2011-06-20 11:22 ` Atlant Schmidt
@ 2011-06-23 8:31 ` Artem Bityutskiy
0 siblings, 0 replies; 10+ messages in thread
From: Artem Bityutskiy @ 2011-06-23 8:31 UTC (permalink / raw)
To: Atlant Schmidt
Cc: Scott Wood, linuxppc-dev@lists.ozlabs.org,
linux-mtd@lists.infradead.org, Matthew L. Creech,
'Mike Hench'
On Mon, 2011-06-20 at 07:22 -0400, Atlant Schmidt wrote:
>
> As far as I know (and I'm sure the list will correct
> me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any
> Linux layer provides this routine scrubbing; you have
> to code it up yourself, probably by accessing the
> device at the UBI (underlying block device/LEB) layer.
UBI will scrub all LEBs with bit-flips once they are read.
But if you have bit-flips in an LEB and it is never read, it will never
be scrubbed. And erasures of the neighboring PEBs may turn bit-flips
into hard errors.
To force scrubbing, the easies way is to just read all volumes, like
dd if=/dev/ubi0_i of=/dev/null bs=4096
for each i.
--
Best Regards,
Artem Bityutskiy
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NAND BBT corruption on MPC83xx
2011-06-17 21:34 ` Scott Wood
2011-06-18 17:55 ` Mike Hench
@ 2011-06-20 15:20 ` Matthew L. Creech
2011-07-05 19:58 ` Matthew L. Creech
2 siblings, 0 replies; 10+ messages in thread
From: Matthew L. Creech @ 2011-06-20 15:20 UTC (permalink / raw)
To: Scott Wood; +Cc: linux-mtd, linuxppc-dev, mhench
On Fri, Jun 17, 2011 at 5:34 PM, Scott Wood <scottwood@freescale.com> wrote:
>
> As for the corruption, could it be degradation from repeated reads of that
> one page?
>
Could be. I think Mike's theory was that the -1 page_addr sort of
"wrapped around", and caused us to read in the last block on flash
each time NAND_CMD_PAGEPROG was performed. So with a lot of writes
happening, we could end up with a BBT that looks like this.
That makes sense I guess, since set_addr() in fsl_elbc_nand.c uses
page_addr to set FBAR. I don't see anything about it in the manual,
but if FBAR wraps beyond the end of the chip, maybe the bits that
don't make sense are simply ignored. (In which case we should
probably add a check in set_addr() to prevent anything like this in
the future)
In theory I should be able to prove it out by running 2 devices in
parallel - one with that block of code still there, and one with it
removed. If the former device sees bit-flips in the BBT and the
latter one doesn't, we'll be sure of the culprit. I'll try this and
come back with the results.
Thanks!
--
Matthew L. Creech
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NAND BBT corruption on MPC83xx
2011-06-17 21:34 ` Scott Wood
2011-06-18 17:55 ` Mike Hench
2011-06-20 15:20 ` Matthew L. Creech
@ 2011-07-05 19:58 ` Matthew L. Creech
2011-07-11 15:30 ` Matthew L. Creech
2 siblings, 1 reply; 10+ messages in thread
From: Matthew L. Creech @ 2011-07-05 19:58 UTC (permalink / raw)
To: Scott Wood; +Cc: linux-mtd, linuxppc-dev
On Fri, Jun 17, 2011 at 5:34 PM, Scott Wood <scottwood@freescale.com> wrote:
>
> It seems that the generic code always passes -1 with PAGEPROG, and only
> provides the actual page address on SEQIN.
>
> I don't think the ECC readback is needed, and the fact that it looks like
> it has always been broken would seem to confirm that. It's broken in
> other ways, too -- it assumes a particular ECC layout. Let's get rid of it.
>
> As for the corruption, could it be degradation from repeated reads of that
> one page?
>
I modified nanddump to do repeated reads, and compare the data
obtained from the first iteration with that obtained later (to detect
bit-flips). I tried 3 different variations:
- one which reads the first page (2k) of the last block
- one which reads the second page (2k) of the last block
- one which reads the entire last block (128k), just for comparison
As I understand it, read-disturb would primarily come into play when
the second page is read, since it's adjacent to the first page (please
correct me if I'm wrong there). Anyway, all 3 of these tests were run
for at least 50 million read cycles, with no bit-flips detected. So
I'm somewhat doubtful that this is the cause of the BBT corruption
I've been seeing.
====
Separately, I set up 2 test devices to run while I was away last week.
One of them contained 2 patches:
- Mike Hench's patch which eliminates this block of code in fsl_elbc_nand.c
- Adam Thomson's patch
(http://lists.infradead.org/pipermail/linux-mtd/2011-June/036427.html)
which initializes oob_poi correctly
Upon my return, the device with these patches saw no problems at all,
and had no additional bad blocks. The device without these patches
had some 200+ blocks which had been newly marked as bad in the BBT
over the course of 10 days. After rebooting, this latter device then
failed to boot, as shown here:
http://mcreech.com/work/bbt-ecc-error4.txt
I'm currently running another test to verify which of the two patches
actually fixed this problem (which might take a few days), but it
seems like removing that block of code in fsl_elbc_nand.c is a good
idea.
--
Matthew L. Creech
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: NAND BBT corruption on MPC83xx
2011-07-05 19:58 ` Matthew L. Creech
@ 2011-07-11 15:30 ` Matthew L. Creech
0 siblings, 0 replies; 10+ messages in thread
From: Matthew L. Creech @ 2011-07-11 15:30 UTC (permalink / raw)
To: Scott Wood; +Cc: linux-mtd, linuxppc-dev, rick22, Mike Hench
On Tue, Jul 5, 2011 at 3:58 PM, Matthew L. Creech <mlcreech@gmail.com> wrote:
>
> Separately, I set up 2 test devices to run while I was away last week.
> One of them contained 2 patches:
>
> - Mike Hench's patch which eliminates this block of code in fsl_elbc_nand.c
> - Adam Thomson's patch
> (http://lists.infradead.org/pipermail/linux-mtd/2011-June/036427.html)
> which initializes oob_poi correctly
>
> Upon my return, the device with these patches saw no problems at all,
> and had no additional bad blocks. The device without these patches
> had some 200+ blocks which had been newly marked as bad in the BBT
> over the course of 10 days. After rebooting, this latter device then
> failed to boot, as shown here:
>
> http://mcreech.com/work/bbt-ecc-error4.txt
>
> I'm currently running another test to verify which of the two patches
> actually fixed this problem (which might take a few days), but it
> seems like removing that block of code in fsl_elbc_nand.c is a good
> idea.
>
Just an update: my tests confirmed that the patch to fsl_elbc_nand.c
(http://lists.infradead.org/pipermail/linux-mtd/2011-July/036893.html)
seems to have fixed these BBT corruption problems.
I ran a torture test on 2 devices for several days: the one which had
only that patch had no further issues, while the one which didn't have
it (but did have the other oob_poi patch from Adam) experienced BBT
corruption.
Thanks everyone
--
Matthew L. Creech
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-07-11 15:30 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-15 19:48 NAND BBT corruption on MPC83xx Matthew L. Creech
2011-06-15 20:17 ` Mike Hench
2011-06-23 8:35 ` Artem Bityutskiy
[not found] <BANLkTikyKObukzVw+c10HyDz+Q=PH4jozA@mail.gmail.com>
2011-06-17 21:34 ` Scott Wood
2011-06-18 17:55 ` Mike Hench
2011-06-20 11:22 ` Atlant Schmidt
2011-06-23 8:31 ` Artem Bityutskiy
2011-06-20 15:20 ` Matthew L. Creech
2011-07-05 19:58 ` Matthew L. Creech
2011-07-11 15:30 ` Matthew L. Creech
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox