Mean Time Between Failure - UBI clarifications

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* Mean Time Between Failure - UBI clarifications
@ 2011-02-24 12:39 Navaneethan P
  2011-02-24 14:04 ` Ricard Wanderlof
  0 siblings, 1 reply; 7+ messages in thread
From: Navaneethan P @ 2011-02-24 12:39 UTC (permalink / raw)
  To: linux-mtd

Hi Linux-mtd users,

In our product, we are using 128MB of NAND Flash (Samsung / Micron).
The whole NAND flash is configured as a single MTD partition. We are
using UBI over the MTD partition.

With this input, we wanted to calculate the Mean Time between failures
(MTBF) of our product. In this context,

1) We wanted to term ’bitflip’ as a failure. Is our understanding
correct or should we only consider a bad block as a failure?

2) Is there any standard way to findout the number of bitflips from
the UBI? If no, is it suggested to modify the UBI subsystem of the
Linux kernel to get the bit flip counter?

3) Is there any standard software / approach which can be used to find
out the reliability / MTBF / MTTF (Mean Time To Failure) of our NAND
Flash?

Could some one clarify in this regard?

Thanks in advance.

Regards,
Navaneethan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mean Time Between Failure - UBI clarifications
  2011-02-24 12:39 Mean Time Between Failure - UBI clarifications Navaneethan P
@ 2011-02-24 14:04 ` Ricard Wanderlof
  2011-02-25 12:36   ` Artem Bityutskiy
  2011-02-28 15:22   ` Navaneethan P
  0 siblings, 2 replies; 7+ messages in thread
From: Ricard Wanderlof @ 2011-02-24 14:04 UTC (permalink / raw)
  To: Navaneethan P; +Cc: linux-mtd@lists.infradead.org

On Thu, 24 Feb 2011, Navaneethan P wrote:

> Hi Linux-mtd users,
>
>
> In our product, we are using 128MB of NAND Flash (Samsung / Micron).
> The whole NAND flash is configured as a single MTD partition. We are
> using UBI over the MTD partition.
>
> With this input, we wanted to calculate the Mean Time between failures
> (MTBF) of our product. In this context,
>
> 1) We wanted to term ?bitflip? as a failure. Is our understanding
> correct or should we only consider a bad block as a failure?

I'd say it's a failure in the sense that the raw data from the flash is 
not what you expect, but UBI handles this transparently so it's not a 
failure from the user's point of view. Furthermore, bitflips are inherent 
to the design of nand flashes, and it does not indicate that there is 
actually anything abnormal about a particular bit.

A bad block is more of a failure in that it can contain bits which are 
unreliable, or stuck at a particular bit level. At least this is the case 
for blocks that have been detected bad at the factory and marked as such, 
but they are not really part of the equation since they should not be used 
anyway.

The ordinary way for a block to 'fail' is when the number of erase/write 
cycles performed on the block causes it to physically wear out. A worn-out 
block has lower data retention (i.e. larger susceptibility to bitflips) 
than other blocks. Usually if an erase or write operation times out (i.e. 
the on-chip erase/write algorithm on the flash times out before the 
operation is completed, and indicates a failure status to the host) the 
block is considered 'bad'. However, note that it is not necessarily an 
either-or situation. The block might not suddenly go dead. Instead, its 
data retention characteristics and erase/write cycle times can get worse 
and worse as the block is erased and rewritten. At some time, the on-chip 
algorithm on the flash signals that erase or write took too long, but the 
characteristics of the block might be far below spec before then.

It's up to you as a user to decide when the block is 'bad' in this case.

> 2) Is there any standard way to findout the number of bitflips from
> the UBI? If no, is it suggested to modify the UBI subsystem of the
> Linux kernel to get the bit flip counter?

mtd supplies statistics counters that might help. For each mtd partition 
there is one counter which is increased every time a read operation 
requires ECC to correct a bit (i.e. a correctable single bit error), and 
one counter for ECC failures (two-bit errors).

I don't know about UBI, someone else probably does.

> 3) Is there any standard software / approach which can be used to find 
> out the reliability / MTBF / MTTF (Mean Time To Failure) of our NAND 
> Flash?

The manufacturers provide some data, however my experience has been that 
it is very difficult to get any form of reliability information.

One way would be to take the spec of the number of erase/write cycles that 
the flash can handle (probably 100 000 for your flash), and calculate how 
much data will be written to the flash over a certain amount of time. When 
you reach 100 000 writes to any given block it can constitute a failure.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mean Time Between Failure - UBI clarifications
  2011-02-24 14:04 ` Ricard Wanderlof
@ 2011-02-25 12:36   ` Artem Bityutskiy
  2011-02-25 12:40     ` Ricard Wanderlof
  2011-02-28 15:22   ` Navaneethan P
  1 sibling, 1 reply; 7+ messages in thread
From: Artem Bityutskiy @ 2011-02-25 12:36 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: Navaneethan P, linux-mtd@lists.infradead.org

On Thu, 2011-02-24 at 15:04 +0100, Ricard Wanderlof wrote:
> I don't know about UBI, someone else probably does.

No, UBI does not collect bit-flips statistics. It could be changed
though, someone could send a patch.

But I agree that bit-flips are wrong metric for MTBF.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mean Time Between Failure - UBI clarifications
  2011-02-25 12:36   ` Artem Bityutskiy
@ 2011-02-25 12:40     ` Ricard Wanderlof
  2011-02-25 12:45       ` Artem Bityutskiy
  0 siblings, 1 reply; 7+ messages in thread
From: Ricard Wanderlof @ 2011-02-25 12:40 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-mtd@lists.infradead.org


On Fri, 25 Feb 2011, Artem Bityutskiy wrote:

> On Thu, 2011-02-24 at 15:04 +0100, Ricard Wanderlof wrote:
>> I don't know about UBI, someone else probably does.
>
> No, UBI does not collect bit-flips statistics. It could be changed
> though, someone could send a patch.

But don't bit flips eventually trigger bit scrubbing in UBI?

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mean Time Between Failure - UBI clarifications
  2011-02-25 12:40     ` Ricard Wanderlof
@ 2011-02-25 12:45       ` Artem Bityutskiy
  0 siblings, 0 replies; 7+ messages in thread
From: Artem Bityutskiy @ 2011-02-25 12:45 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org

On Fri, 2011-02-25 at 13:40 +0100, Ricard Wanderlof wrote:
> On Fri, 25 Feb 2011, Artem Bityutskiy wrote:
> 
> > On Thu, 2011-02-24 at 15:04 +0100, Ricard Wanderlof wrote:
> >> I don't know about UBI, someone else probably does.
> >
> > No, UBI does not collect bit-flips statistics. It could be changed
> > though, someone could send a patch.
> 
> But don't bit flips eventually trigger bit scrubbing in UBI?

Yes, they do, but after scrubbing we forget about them.

But we could store the bit-flips counter in UBI headers, so "paranoid"
users could get per-eraseblock bit-flips count.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mean Time Between Failure - UBI clarifications
  2011-02-24 14:04 ` Ricard Wanderlof
  2011-02-25 12:36   ` Artem Bityutskiy
@ 2011-02-28 15:22   ` Navaneethan P
  2011-02-28 15:32     ` Ricard Wanderlof
  1 sibling, 1 reply; 7+ messages in thread
From: Navaneethan P @ 2011-02-28 15:22 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Stefan.Bigler, navaneethan.p, linux-mtd@lists.infradead.org,
	Sundararajan.Somasundaram, Shenbaga.Nathan

Thank you for your response.

On Thu, Feb 24, 2011 at 7:34 PM, Ricard Wanderlof
<ricard.wanderlof@axis.com> wrote:
> mtd supplies statistics counters that might help. For each mtd partition
> there is one counter which is increased every time a read operation requires
> ECC to correct a bit (i.e. a correctable single bit error), and one counter
> for ECC failures (two-bit errors).

I checked the mtd directory of sysfs and also the source code of mtd.
However I could not find the mtd statistics counter from the available
source mtd-2.6.28. Could you please guide me as to where the
statistics counters are implemented inside mtd source.

Thanks and Regards,
Navaneethan P

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mean Time Between Failure - UBI clarifications
  2011-02-28 15:22   ` Navaneethan P
@ 2011-02-28 15:32     ` Ricard Wanderlof
  0 siblings, 0 replies; 7+ messages in thread
From: Ricard Wanderlof @ 2011-02-28 15:32 UTC (permalink / raw)
  To: Navaneethan P
  Cc: Stefan.Bigler@keymile.com, navaneethan.p@aricent.com,
	Ricard Wanderlöf, linux-mtd@lists.infradead.org,
	Sundararajan.Somasundaram@aricent.com,
	Shenbaga.Nathan@aricent.com


On Mon, 28 Feb 2011, Navaneethan P wrote:

> Thank you for your response.
>
> On Thu, Feb 24, 2011 at 7:34 PM, Ricard Wanderlof
> <ricard.wanderlof@axis.com> wrote:
>> mtd supplies statistics counters that might help. For each mtd partition
>> there is one counter which is increased every time a read operation requires
>> ECC to correct a bit (i.e. a correctable single bit error), and one counter
>> for ECC failures (two-bit errors).
>
> I checked the mtd directory of sysfs and also the source code of mtd.
> However I could not find the mtd statistics counter from the available
> source mtd-2.6.28. Could you please guide me as to where the
> statistics counters are implemented inside mtd source.

They're in mtd->ecc_stats . See drivers/mtd/nand/nand_base.c for 
instance. drivers/mtd/mtdchar.c:mtd_ioctl (case ECCGETSTATS) retrievs the 
statistics to userspace for the ECCGETSTATS ioctl.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-02-28 15:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-24 12:39 Mean Time Between Failure - UBI clarifications Navaneethan P
2011-02-24 14:04 ` Ricard Wanderlof
2011-02-25 12:36   ` Artem Bityutskiy
2011-02-25 12:40     ` Ricard Wanderlof
2011-02-25 12:45       ` Artem Bityutskiy
2011-02-28 15:22   ` Navaneethan P
2011-02-28 15:32     ` Ricard Wanderlof

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox