Questions about NAND (double)bit errors

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* Questions about NAND (double)bit errors
@ 2006-02-02 11:12 Wolfgang Mües
  2006-02-08 22:26 ` Charles Manning
  0 siblings, 1 reply; 7+ messages in thread
From: Wolfgang Mües @ 2006-02-02 11:12 UTC (permalink / raw)
  To: linux-mtd

Hello,

I want to use JFFS2/MTD in an embedded Linux device with frequent
writes (worst case is 15 KBytes per 10 seconds, typical case is less than 10% 
of the worst case). The device will be a 512 MBit NAND SLC type from Hynix,
Samsung or STM. We have a working prototype, and we have read many NAND flash
papers available on the net, and the recent MTD mailing list archives.

Beside of wear leveling questions, there are program disturb errors 
(programming a page flips a bit in another page) and read disturb errors 
(reading a page flips a bit). Rates for these single-bit-errors are available
in publications from M-systems and Toshiba. 

But since single bit errors are easily corrected by ECC, I am more interested 
in errors where more than 1 bit is flipped in a 256 byte ECC area. We cannot 
calculate these error numbers from the single bit errors because we don't 
know if these errors are unrelated to each other. 

Is there any information available to estimate/calculate the remaining errors
after ECC correction? Or is there any information about first hand experience
of NAND stress tests or other real world experience?

Maybe the NAND project is terminated if I don't find anything about practical
reliability...

best regards
Wolfgang Muees
-- 
Wolfgang Muees                    Vor den Grashoefen 1
Auerswald GmbH & Co. KG       	  D-38162 Cremlingen
Hardware Development              Germany
Tel +49 5306 9219 0               Fax +49 5306 9219 94

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about NAND (double)bit errors
  2006-02-02 11:12 Questions about NAND (double)bit errors Wolfgang Mües
@ 2006-02-08 22:26 ` Charles Manning
  2006-02-10  8:28   ` Wolfgang Mües
  2006-02-14 14:10   ` Wolfgang Mües
  0 siblings, 2 replies; 7+ messages in thread
From: Charles Manning @ 2006-02-08 22:26 UTC (permalink / raw)
  To: linux-mtd; +Cc: Wolfgang Mües

On Friday 03 February 2006 00:12, Wolfgang Mües wrote:
> Hello,
>
> I want to use JFFS2/MTD in an embedded Linux device with frequent
> writes (worst case is 15 KBytes per 10 seconds, typical case is less than
> 10% of the worst case). The device will be a 512 MBit NAND SLC type from
> Hynix, Samsung or STM. We have a working prototype, and we have read many
> NAND flash papers available on the net, and the recent MTD mailing list
> archives.
>
> Beside of wear leveling questions, there are program disturb errors
> (programming a page flips a bit in another page) and read disturb errors
> (reading a page flips a bit). Rates for these single-bit-errors are
> available in publications from M-systems and Toshiba.
>
> But since single bit errors are easily corrected by ECC, I am more
> interested in errors where more than 1 bit is flipped in a 256 byte ECC
> area. We cannot calculate these error numbers from the single bit errors
> because we don't know if these errors are unrelated to each other.

If you have not already done so, read the Toshiba NAND flash application 
guide:
http://www.dataio.com/pdf/NAND/Toshiba/NandDesignGuide.pdf.pdf

that might give some further info.

>
> Is there any information available to estimate/calculate the remaining
> errors after ECC correction? Or is there any information about first hand
> experience of NAND stress tests or other real world experience?
>
> Maybe the NAND project is terminated if I don't find anything about
> practical reliability...

I have not used JFFS2, but I have done extensive testing with YAFFS. At the 
NAND level they should be about the same.

I have done a few accelerated lifetime tests that have gone very well. In one 
test (run once on 512byte page devices and once on 2k page devices) I wrote, 
read back and verified over 120Gbytes of data to the fs without a single bit 
betting lost. Other people did similar tests too. This was on non-Linux 
devices, but that's not material at the NAND level.

From my observations NAND is very reliable and is getting more reliable all 
the time.

There are at least two factor that might be different for JFFS2 vs YAFFS:
* Most flash reliability is specified based on an assumption that you perform 
a maximum number of writes per page. I don't know what JFFS2 does, but YAFFS 
does one major write and then writes a single byte deletion marker to the OOB 
area when the page is discarded. YAFFS2 does not write deletion markers. This 
is generally well within the write limits used for the specification, so the 
fash should be less stressed than was used to derive the specs. JFFS2 might 
be different here.
* YAFFS is very conservative on dealing with ECC failures. YAFFS retires a 
block if one ECC failure is seen. JFFS2, IIRC allows five of so failure 
before retiring a block. The Toshiba folk have told me that if a block is 
going bad, it is most likely to start displaying recoverable 1-bit errors 
before displaying non-recoverable multi-bit errors. Thus, YAFFS will 
potentially perform differently in this area.

Still, I think those rliability differences, at the flash level, are more than 
likely theoretical noise and are unlikely to be material in the real world.

One important factor, IMHO, is how you handle the write protect pin on the 
NAND. Some people tie the WP to the power supply failure flag. IMHO this is a 
bad thing to do since it can cause incomplete writes to happen if the wp is 
asserted during a write or erase cycle.

-- Charles

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about NAND (double)bit errors
  2006-02-08 22:26 ` Charles Manning
@ 2006-02-10  8:28   ` Wolfgang Mües
  2006-02-14 14:10   ` Wolfgang Mües
  1 sibling, 0 replies; 7+ messages in thread
From: Wolfgang Mües @ 2006-02-10  8:28 UTC (permalink / raw)
  To: linux-mtd

Hello Charles,

thank you for sharing your experience...

You wrote:
> If you have not already done so, read the Toshiba NAND flash application
> guide:

Yes, I have.

> I have done a few accelerated lifetime tests that have gone very well. In
> one test (run once on 512byte page devices and once on 2k page devices) I
> wrote, read back and verified over 120Gbytes of data to the fs without a
> single bit betting lost.

You mean, without a single error correction? Or do you mean that ECC has done 
its job?

Regarding the 120 GBytes: How many times was each block written/erased? Have 
you reached the specified lifetime of the flash?

> * YAFFS is very conservative on dealing with ECC failures. YAFFS retires a
> block if one ECC failure is seen. JFFS2, IIRC allows five of so failure
> before retiring a block. The Toshiba folk have told me that if a block is
> going bad, it is most likely to start displaying recoverable 1-bit errors
> before displaying non-recoverable multi-bit errors.

This is a valuable information not found in other resources.

> Still, I think those reliability differences, at the flash level, are more
> than likely theoretical noise and are unlikely to be material in the real
> world.

Hmmmm... can you come and tell this to my boss ;-)

> One important factor, IMHO, is how you handle the write protect pin on the
> NAND. Some people tie the WP to the power supply failure flag. IMHO this is
> a bad thing to do since it can cause incomplete writes to happen if the wp
> is asserted during a write or erase cycle.

I will check this.

best regards

-- 
Wolfgang Muees                    Vor den Grashoefen 1
Auerswald GmbH & Co. KG       	  D-38162 Cremlingen
Hardware Development              Germany
Tel +49 5306 9219 0               Fax +49 5306 9219 94

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about NAND (double)bit errors
  2006-02-08 22:26 ` Charles Manning
  2006-02-10  8:28   ` Wolfgang Mües
@ 2006-02-14 14:10   ` Wolfgang Mües
  2006-02-16  3:17     ` Charles Manning
  1 sibling, 1 reply; 7+ messages in thread
From: Wolfgang Mües @ 2006-02-14 14:10 UTC (permalink / raw)
  To: linux-mtd

Hello Charles,

Charles Manning wrote:
> * YAFFS is very conservative on dealing with ECC failures. YAFFS retires a
> block if one ECC failure is seen. JFFS2, IIRC allows five of so failure
> before retiring a block. The Toshiba folk have told me that if a block is
> going bad, it is most likely to start displaying recoverable 1-bit errors
> before displaying non-recoverable multi-bit errors. Thus, YAFFS will
> potentially perform differently in this area.

About bad block detection: what is your oppinion about partitioning the flash 
(the programs in a read-only partition, the data in r/w).

How about detection of ECC errors in read only partitions?

> One important factor, IMHO, is how you handle the write protect pin on the
> NAND. Some people tie the WP to the power supply failure flag. IMHO this is
> a bad thing to do since it can cause incomplete writes to happen if the wp
> is asserted during a write or erase cycle.

I have checked this.

WP is tied to VCC, and VCC is stable at least 500ms after a power fail detect.

best regards
-- 
Wolfgang Muees                    Vor den Grashoefen 1
Auerswald GmbH & Co. KG       	  D-38162 Cremlingen
Hardware Development              Germany
Tel +49 5306 9219 0               Fax +49 5306 9219 94

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about NAND (double)bit errors
  2006-02-14 14:10   ` Wolfgang Mües
@ 2006-02-16  3:17     ` Charles Manning
  2006-02-16  8:30       ` Wolfgang Mües
  2006-02-16 22:08       ` Jamie Lokier
  0 siblings, 2 replies; 7+ messages in thread
From: Charles Manning @ 2006-02-16  3:17 UTC (permalink / raw)
  To: linux-mtd; +Cc: Wolfgang Mües

On Wednesday 15 February 2006 03:10, Wolfgang Mües wrote:
> Hello Charles,
>
> Charles Manning wrote:
> > * YAFFS is very conservative on dealing with ECC failures. YAFFS retires
> > a block if one ECC failure is seen. JFFS2, IIRC allows five of so failure
> > before retiring a block. The Toshiba folk have told me that if a block is
> > going bad, it is most likely to start displaying recoverable 1-bit errors
> > before displaying non-recoverable multi-bit errors. Thus, YAFFS will
> > potentially perform differently in this area.
>
> About bad block detection: what is your oppinion about partitioning the
> flash (the programs in a read-only partition, the data in r/w).

This gets fs specific. With YAFFS (and I assume JFFS2, but consult an expert), 
grabage collection will force read-only files to get rewritten occasionally. 
Thus for ultimate reliability it is probably a GoodIdea to seperate the 
read-only stuff into a seperate partition. This is also a GoodIdea in that a 
smaller partition mounts faster (true for YAFFS and JFFS2). So if all your 
kernel + mount stuff is seperated from your rw stuff things will probably dgo 
better.
>
> How about detection of ECC errors in read only partitions?

ECC should be done on both rw and read-only partitions. Sometimes NAND gets 
read disturbs which would impact on read-only partitions. Also, write 
disturbs from writes to one partition can still corrupt a read-only partition 
on the same chip.

>
> > One important factor, IMHO, is how you handle the write protect pin on
> > the NAND. Some people tie the WP to the power supply failure flag. IMHO
> > this is a bad thing to do since it can cause incomplete writes to happen
> > if the wp is asserted during a write or erase cycle.
>
> I have checked this.
>
> WP is tied to VCC, and VCC is stable at least 500ms after a power fail
> detect.

500ms is long enough to grow a beard.

There's been some interesting discussion over in yaffs-land on this. If you 
don't subscribe to yaffs list then you can catch up on the yaffs archive.

-- Charles

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about NAND (double)bit errors
  2006-02-16  3:17     ` Charles Manning
@ 2006-02-16  8:30       ` Wolfgang Mües
  2006-02-16 22:08       ` Jamie Lokier
  1 sibling, 0 replies; 7+ messages in thread
From: Wolfgang Mües @ 2006-02-16  8:30 UTC (permalink / raw)
  To: linux-mtd

Hello Charles,

Charles Manning wrote:

> > About bad block detection: what is your oppinion about partitioning the
> > flash (the programs in a read-only partition, the data in r/w).
>
> This gets fs specific. With YAFFS (and I assume JFFS2, but consult an
> expert), grabage collection will force read-only files to get rewritten
> occasionally. Thus for ultimate reliability it is probably a GoodIdea to
> seperate the read-only stuff into a seperate partition. This is also a
> GoodIdea in that a smaller partition mounts faster (true for YAFFS and
> JFFS2). So if all your kernel + mount stuff is seperated from your rw stuff
> things will probably dgo better.

OK.

> > How about detection of ECC errors in read only partitions?
>
> ECC should be done on both rw and read-only partitions. Sometimes NAND gets
> read disturbs which would impact on read-only partitions.

My real question was: does YAFFS do regulary reads of all files in a R/O 
partition so that one-bit-errors can be discovered? Without reading, you will 
never find them...

> Also, write disturbs from writes to one partition can still corrupt a
> read-only partition on the same chip.

Bad news. Are you shure about this?
I know from the toshiba paper that write disturb is limited to the scope of a 
block. From other vendors, I don't have informations. 

> There's been some interesting discussion over in yaffs-land on this. If you
> don't subscribe to yaffs list then you can catch up on the yaffs archive.

I am reading the YAFFS mailing list for 1 year now. Very impressed by your 
constant engagement for YAFFS and the community.

Regarding YAFFS2 and the mounting time /need for scanning the whole NAND:
Do you think it will be possible to separate the directory information from 
the file data? So scanning will be:
- read the bad block marker and the "is directory information bit"
- if directory: scan it, building the data structures in RAM
- if data: you don't need it.

Obviously, this is only a benefit if reading the bad block marker is very much 
faster than scanning the whole block.

best regards
-- 
Wolfgang Muees                    Vor den Grashoefen 1
Auerswald GmbH & Co. KG       	  D-38162 Cremlingen
Hardware Development              Germany
Tel +49 5306 9219 0               Fax +49 5306 9219 94

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions about NAND (double)bit errors
  2006-02-16  3:17     ` Charles Manning
  2006-02-16  8:30       ` Wolfgang Mües
@ 2006-02-16 22:08       ` Jamie Lokier
  1 sibling, 0 replies; 7+ messages in thread
From: Jamie Lokier @ 2006-02-16 22:08 UTC (permalink / raw)
  To: Charles Manning; +Cc: linux-mtd, Wolfgang Mües

Charles Manning wrote:
> > About bad block detection: what is your oppinion about partitioning the
> > flash (the programs in a read-only partition, the data in r/w).
> 
> This gets fs specific. With YAFFS (and I assume JFFS2, but consult
> an expert), grabage collection will force read-only files to get
> rewritten occasionally.  Thus for ultimate reliability it is
> probably a GoodIdea to seperate the read-only stuff into a seperate
> partition. This is also a GoodIdea in that a smaller partition
> mounts faster (true for YAFFS and JFFS2). So if all your kernel +
> mount stuff is seperated from your rw stuff things will probably dgo
> better.

Absolutely.

I've been testing 40 devices lately, and in 2 weeks, 5 of them (out of 40)
have corrupted files in JFFS2 when those files aren't being written.
I haven't seen any errors in the ROMFS partitions.

I'm still getting round to analysing the corrupt files / filesystems,
because that failure rate is too high even for configuration files
that are written from time to time.

These are 8MB chips, so presumably NOR.

-- Jamie

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-02-16 22:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-02 11:12 Questions about NAND (double)bit errors Wolfgang Mües
2006-02-08 22:26 ` Charles Manning
2006-02-10  8:28   ` Wolfgang Mües
2006-02-14 14:10   ` Wolfgang Mües
2006-02-16  3:17     ` Charles Manning
2006-02-16  8:30       ` Wolfgang Mües
2006-02-16 22:08       ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox