* Questions about NAND (double)bit errors @ 2006-02-02 11:12 Wolfgang Mües 2006-02-08 22:26 ` Charles Manning 0 siblings, 1 reply; 7+ messages in thread From: Wolfgang Mües @ 2006-02-02 11:12 UTC (permalink / raw) To: linux-mtd Hello, I want to use JFFS2/MTD in an embedded Linux device with frequent writes (worst case is 15 KBytes per 10 seconds, typical case is less than 10% of the worst case). The device will be a 512 MBit NAND SLC type from Hynix, Samsung or STM. We have a working prototype, and we have read many NAND flash papers available on the net, and the recent MTD mailing list archives. Beside of wear leveling questions, there are program disturb errors (programming a page flips a bit in another page) and read disturb errors (reading a page flips a bit). Rates for these single-bit-errors are available in publications from M-systems and Toshiba. But since single bit errors are easily corrected by ECC, I am more interested in errors where more than 1 bit is flipped in a 256 byte ECC area. We cannot calculate these error numbers from the single bit errors because we don't know if these errors are unrelated to each other. Is there any information available to estimate/calculate the remaining errors after ECC correction? Or is there any information about first hand experience of NAND stress tests or other real world experience? Maybe the NAND project is terminated if I don't find anything about practical reliability... best regards Wolfgang Muees -- Wolfgang Muees Vor den Grashoefen 1 Auerswald GmbH & Co. KG D-38162 Cremlingen Hardware Development Germany Tel +49 5306 9219 0 Fax +49 5306 9219 94 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions about NAND (double)bit errors 2006-02-02 11:12 Questions about NAND (double)bit errors Wolfgang Mües @ 2006-02-08 22:26 ` Charles Manning 2006-02-10 8:28 ` Wolfgang Mües 2006-02-14 14:10 ` Wolfgang Mües 0 siblings, 2 replies; 7+ messages in thread From: Charles Manning @ 2006-02-08 22:26 UTC (permalink / raw) To: linux-mtd; +Cc: Wolfgang Mües On Friday 03 February 2006 00:12, Wolfgang Mües wrote: > Hello, > > I want to use JFFS2/MTD in an embedded Linux device with frequent > writes (worst case is 15 KBytes per 10 seconds, typical case is less than > 10% of the worst case). The device will be a 512 MBit NAND SLC type from > Hynix, Samsung or STM. We have a working prototype, and we have read many > NAND flash papers available on the net, and the recent MTD mailing list > archives. > > Beside of wear leveling questions, there are program disturb errors > (programming a page flips a bit in another page) and read disturb errors > (reading a page flips a bit). Rates for these single-bit-errors are > available in publications from M-systems and Toshiba. > > But since single bit errors are easily corrected by ECC, I am more > interested in errors where more than 1 bit is flipped in a 256 byte ECC > area. We cannot calculate these error numbers from the single bit errors > because we don't know if these errors are unrelated to each other. If you have not already done so, read the Toshiba NAND flash application guide: http://www.dataio.com/pdf/NAND/Toshiba/NandDesignGuide.pdf.pdf that might give some further info. > > Is there any information available to estimate/calculate the remaining > errors after ECC correction? Or is there any information about first hand > experience of NAND stress tests or other real world experience? > > Maybe the NAND project is terminated if I don't find anything about > practical reliability... I have not used JFFS2, but I have done extensive testing with YAFFS. At the NAND level they should be about the same. I have done a few accelerated lifetime tests that have gone very well. In one test (run once on 512byte page devices and once on 2k page devices) I wrote, read back and verified over 120Gbytes of data to the fs without a single bit betting lost. Other people did similar tests too. This was on non-Linux devices, but that's not material at the NAND level. From my observations NAND is very reliable and is getting more reliable all the time. There are at least two factor that might be different for JFFS2 vs YAFFS: * Most flash reliability is specified based on an assumption that you perform a maximum number of writes per page. I don't know what JFFS2 does, but YAFFS does one major write and then writes a single byte deletion marker to the OOB area when the page is discarded. YAFFS2 does not write deletion markers. This is generally well within the write limits used for the specification, so the fash should be less stressed than was used to derive the specs. JFFS2 might be different here. * YAFFS is very conservative on dealing with ECC failures. YAFFS retires a block if one ECC failure is seen. JFFS2, IIRC allows five of so failure before retiring a block. The Toshiba folk have told me that if a block is going bad, it is most likely to start displaying recoverable 1-bit errors before displaying non-recoverable multi-bit errors. Thus, YAFFS will potentially perform differently in this area. Still, I think those rliability differences, at the flash level, are more than likely theoretical noise and are unlikely to be material in the real world. One important factor, IMHO, is how you handle the write protect pin on the NAND. Some people tie the WP to the power supply failure flag. IMHO this is a bad thing to do since it can cause incomplete writes to happen if the wp is asserted during a write or erase cycle. -- Charles ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions about NAND (double)bit errors 2006-02-08 22:26 ` Charles Manning @ 2006-02-10 8:28 ` Wolfgang Mües 2006-02-14 14:10 ` Wolfgang Mües 1 sibling, 0 replies; 7+ messages in thread From: Wolfgang Mües @ 2006-02-10 8:28 UTC (permalink / raw) To: linux-mtd Hello Charles, thank you for sharing your experience... You wrote: > If you have not already done so, read the Toshiba NAND flash application > guide: Yes, I have. > I have done a few accelerated lifetime tests that have gone very well. In > one test (run once on 512byte page devices and once on 2k page devices) I > wrote, read back and verified over 120Gbytes of data to the fs without a > single bit betting lost. You mean, without a single error correction? Or do you mean that ECC has done its job? Regarding the 120 GBytes: How many times was each block written/erased? Have you reached the specified lifetime of the flash? > * YAFFS is very conservative on dealing with ECC failures. YAFFS retires a > block if one ECC failure is seen. JFFS2, IIRC allows five of so failure > before retiring a block. The Toshiba folk have told me that if a block is > going bad, it is most likely to start displaying recoverable 1-bit errors > before displaying non-recoverable multi-bit errors. This is a valuable information not found in other resources. > Still, I think those reliability differences, at the flash level, are more > than likely theoretical noise and are unlikely to be material in the real > world. Hmmmm... can you come and tell this to my boss ;-) > One important factor, IMHO, is how you handle the write protect pin on the > NAND. Some people tie the WP to the power supply failure flag. IMHO this is > a bad thing to do since it can cause incomplete writes to happen if the wp > is asserted during a write or erase cycle. I will check this. best regards -- Wolfgang Muees Vor den Grashoefen 1 Auerswald GmbH & Co. KG D-38162 Cremlingen Hardware Development Germany Tel +49 5306 9219 0 Fax +49 5306 9219 94 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions about NAND (double)bit errors 2006-02-08 22:26 ` Charles Manning 2006-02-10 8:28 ` Wolfgang Mües @ 2006-02-14 14:10 ` Wolfgang Mües 2006-02-16 3:17 ` Charles Manning 1 sibling, 1 reply; 7+ messages in thread From: Wolfgang Mües @ 2006-02-14 14:10 UTC (permalink / raw) To: linux-mtd Hello Charles, Charles Manning wrote: > * YAFFS is very conservative on dealing with ECC failures. YAFFS retires a > block if one ECC failure is seen. JFFS2, IIRC allows five of so failure > before retiring a block. The Toshiba folk have told me that if a block is > going bad, it is most likely to start displaying recoverable 1-bit errors > before displaying non-recoverable multi-bit errors. Thus, YAFFS will > potentially perform differently in this area. About bad block detection: what is your oppinion about partitioning the flash (the programs in a read-only partition, the data in r/w). How about detection of ECC errors in read only partitions? > One important factor, IMHO, is how you handle the write protect pin on the > NAND. Some people tie the WP to the power supply failure flag. IMHO this is > a bad thing to do since it can cause incomplete writes to happen if the wp > is asserted during a write or erase cycle. I have checked this. WP is tied to VCC, and VCC is stable at least 500ms after a power fail detect. best regards -- Wolfgang Muees Vor den Grashoefen 1 Auerswald GmbH & Co. KG D-38162 Cremlingen Hardware Development Germany Tel +49 5306 9219 0 Fax +49 5306 9219 94 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions about NAND (double)bit errors 2006-02-14 14:10 ` Wolfgang Mües @ 2006-02-16 3:17 ` Charles Manning 2006-02-16 8:30 ` Wolfgang Mües 2006-02-16 22:08 ` Jamie Lokier 0 siblings, 2 replies; 7+ messages in thread From: Charles Manning @ 2006-02-16 3:17 UTC (permalink / raw) To: linux-mtd; +Cc: Wolfgang Mües On Wednesday 15 February 2006 03:10, Wolfgang Mües wrote: > Hello Charles, > > Charles Manning wrote: > > * YAFFS is very conservative on dealing with ECC failures. YAFFS retires > > a block if one ECC failure is seen. JFFS2, IIRC allows five of so failure > > before retiring a block. The Toshiba folk have told me that if a block is > > going bad, it is most likely to start displaying recoverable 1-bit errors > > before displaying non-recoverable multi-bit errors. Thus, YAFFS will > > potentially perform differently in this area. > > About bad block detection: what is your oppinion about partitioning the > flash (the programs in a read-only partition, the data in r/w). This gets fs specific. With YAFFS (and I assume JFFS2, but consult an expert), grabage collection will force read-only files to get rewritten occasionally. Thus for ultimate reliability it is probably a GoodIdea to seperate the read-only stuff into a seperate partition. This is also a GoodIdea in that a smaller partition mounts faster (true for YAFFS and JFFS2). So if all your kernel + mount stuff is seperated from your rw stuff things will probably dgo better. > > How about detection of ECC errors in read only partitions? ECC should be done on both rw and read-only partitions. Sometimes NAND gets read disturbs which would impact on read-only partitions. Also, write disturbs from writes to one partition can still corrupt a read-only partition on the same chip. > > > One important factor, IMHO, is how you handle the write protect pin on > > the NAND. Some people tie the WP to the power supply failure flag. IMHO > > this is a bad thing to do since it can cause incomplete writes to happen > > if the wp is asserted during a write or erase cycle. > > I have checked this. > > WP is tied to VCC, and VCC is stable at least 500ms after a power fail > detect. 500ms is long enough to grow a beard. There's been some interesting discussion over in yaffs-land on this. If you don't subscribe to yaffs list then you can catch up on the yaffs archive. -- Charles ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions about NAND (double)bit errors 2006-02-16 3:17 ` Charles Manning @ 2006-02-16 8:30 ` Wolfgang Mües 2006-02-16 22:08 ` Jamie Lokier 1 sibling, 0 replies; 7+ messages in thread From: Wolfgang Mües @ 2006-02-16 8:30 UTC (permalink / raw) To: linux-mtd Hello Charles, Charles Manning wrote: > > About bad block detection: what is your oppinion about partitioning the > > flash (the programs in a read-only partition, the data in r/w). > > This gets fs specific. With YAFFS (and I assume JFFS2, but consult an > expert), grabage collection will force read-only files to get rewritten > occasionally. Thus for ultimate reliability it is probably a GoodIdea to > seperate the read-only stuff into a seperate partition. This is also a > GoodIdea in that a smaller partition mounts faster (true for YAFFS and > JFFS2). So if all your kernel + mount stuff is seperated from your rw stuff > things will probably dgo better. OK. > > How about detection of ECC errors in read only partitions? > > ECC should be done on both rw and read-only partitions. Sometimes NAND gets > read disturbs which would impact on read-only partitions. My real question was: does YAFFS do regulary reads of all files in a R/O partition so that one-bit-errors can be discovered? Without reading, you will never find them... > Also, write disturbs from writes to one partition can still corrupt a > read-only partition on the same chip. Bad news. Are you shure about this? I know from the toshiba paper that write disturb is limited to the scope of a block. From other vendors, I don't have informations. > There's been some interesting discussion over in yaffs-land on this. If you > don't subscribe to yaffs list then you can catch up on the yaffs archive. I am reading the YAFFS mailing list for 1 year now. Very impressed by your constant engagement for YAFFS and the community. Regarding YAFFS2 and the mounting time /need for scanning the whole NAND: Do you think it will be possible to separate the directory information from the file data? So scanning will be: - read the bad block marker and the "is directory information bit" - if directory: scan it, building the data structures in RAM - if data: you don't need it. Obviously, this is only a benefit if reading the bad block marker is very much faster than scanning the whole block. best regards -- Wolfgang Muees Vor den Grashoefen 1 Auerswald GmbH & Co. KG D-38162 Cremlingen Hardware Development Germany Tel +49 5306 9219 0 Fax +49 5306 9219 94 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions about NAND (double)bit errors 2006-02-16 3:17 ` Charles Manning 2006-02-16 8:30 ` Wolfgang Mües @ 2006-02-16 22:08 ` Jamie Lokier 1 sibling, 0 replies; 7+ messages in thread From: Jamie Lokier @ 2006-02-16 22:08 UTC (permalink / raw) To: Charles Manning; +Cc: linux-mtd, Wolfgang Mües Charles Manning wrote: > > About bad block detection: what is your oppinion about partitioning the > > flash (the programs in a read-only partition, the data in r/w). > > This gets fs specific. With YAFFS (and I assume JFFS2, but consult > an expert), grabage collection will force read-only files to get > rewritten occasionally. Thus for ultimate reliability it is > probably a GoodIdea to seperate the read-only stuff into a seperate > partition. This is also a GoodIdea in that a smaller partition > mounts faster (true for YAFFS and JFFS2). So if all your kernel + > mount stuff is seperated from your rw stuff things will probably dgo > better. Absolutely. I've been testing 40 devices lately, and in 2 weeks, 5 of them (out of 40) have corrupted files in JFFS2 when those files aren't being written. I haven't seen any errors in the ROMFS partitions. I'm still getting round to analysing the corrupt files / filesystems, because that failure rate is too high even for configuration files that are written from time to time. These are 8MB chips, so presumably NOR. -- Jamie ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-02-16 22:09 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-02-02 11:12 Questions about NAND (double)bit errors Wolfgang Mües 2006-02-08 22:26 ` Charles Manning 2006-02-10 8:28 ` Wolfgang Mües 2006-02-14 14:10 ` Wolfgang Mües 2006-02-16 3:17 ` Charles Manning 2006-02-16 8:30 ` Wolfgang Mües 2006-02-16 22:08 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox