problem with ecc errors and ubifs

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* problem with ecc errors and ubifs
@ 2013-10-22 11:52 Steffen Kühn
  2013-10-22 12:05 ` Ricard Wanderlof
  0 siblings, 1 reply; 5+ messages in thread
From: Steffen Kühn @ 2013-10-22 11:52 UTC (permalink / raw)
  To: linux-mtd@lists.infradead.org

Hey,

my question is about ecc errors and the way ubifs deals with it.
Unfortunately, I have to be a bit detailed to make my problem clear.

We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash
supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have
started to use this flash no support for on die ecc was available in the
kernel (kernel 3.2). The on die ecc support is - because of that - self
written.

Everything works in principle very well. But we use our hardware in
greater numbers and quite intensively. Over the months we have observed
numerous destroyed file systems with different ubi errors. For finding
the reason of that problem I have written a mechanism to create bit
errors (in U-Boot).

With that I made different tests. One test was to create only one single
bit error in the whole flash device. The on die ecc mechanism (which can
correct up to 8 bit errors) had no problems to correct this error. The
kernel code has now a piece of code where the bit error occurrence is
reported to the stages above. With this information can ubifs decide if
and what it has to do.

I have seen that such error reporting leads usually to a page "scrub". I
do not really understand what there happens. But sometimes the result is
catastrophic. Because of that I have removed the error reporting (my
hope is that 8 bit errors occur seldom enough in a page) After that code
removing our problems are completely vanished (I have even tested with
more than 8 bit errors in the same page => no problems). I could not
provoke any faults by creating numerous bit errors in dozens of pages.

What is your opinion? Have I overlooked something? I know that this
method has risks but I hope that under the line the file system stays
longer alive.

Best
Steffen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: problem with ecc errors and ubifs
  2013-10-22 11:52 problem with ecc errors and ubifs Steffen Kühn
@ 2013-10-22 12:05 ` Ricard Wanderlof
  2013-10-22 13:10   ` Steffen Kühn
  2013-10-23 16:17   ` Artem Bityutskiy
  0 siblings, 2 replies; 5+ messages in thread
From: Ricard Wanderlof @ 2013-10-22 12:05 UTC (permalink / raw)
  To: Steffen Kühn; +Cc: linux-mtd@lists.infradead.org

On Tue, 22 Oct 2013, Steffen Kühn wrote:

> We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash
> supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have
> started to use this flash no support for on die ecc was available in the
> kernel (kernel 3.2). The on die ecc support is - because of that - self
> written.
> ...
>
> I have seen that such error reporting leads usually to a page "scrub". I
> do not really understand what there happens.

I don't know the details, but scrubbing means that the data in the page 
is to be rewritten at another place in the flash, since the presence of a 
bit error indicates that the data in the page will eventually become 
unreliable.

The code as it looks today triggers scrubbing whenever a single bit 
correction is detected during read. The reason for this is that the 
classic Hamming algorithm can only handle one incorrect bit, so if another 
bit flips the data becomes unreadable.

I know there have been discussions that when using ECC that can correct 
more than a single bit in a given area, to not trigger scrubbing as soon 
as a single bit goes bad, but use a threshold mechanism, so that scrubbing 
is triggered first when, say, half the maximum amount of bits need 
correcting (e.g. in your case when 4 bits need correcting), the reason 
being that flashes which require multibit ecc tend to have bits here and 
there that flip rather quickly after writing, so triggering on them leads 
to undue scrubbing and hence wear on the flash.

> But sometimes the result is catastrophic. Because of that I have removed 
> the error reporting (my hope is that 8 bit errors occur seldom enough in 
> a page) After that code removing our problems are completely vanished (I 
> have even tested with more than 8 bit errors in the same page => no 
> problems). I could not provoke any faults by creating numerous bit 
> errors in dozens of pages.

Removing the 'corrected bit' reporting mechanism avoids scrubbing, and 
hence that code path is never executed. It seems a serious bug if the 
scrubbing mechanism doesn't work as intended, for whatever reason.

> What is your opinion? Have I overlooked something? I know that this
> method has risks but I hope that under the line the file system stays
> longer alive.

The risk would be that eventually the number of errors would grow past the 
ECC capability and subsequently lead to unreadable data. If there indeed 
is a bug in the scrubbing mechanism, I agree it would be better to just 
correct the bits and hope not too many of them flip. But it would be 
better to try and fix the problem with scrubbing...

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: problem with ecc errors and ubifs
  2013-10-22 12:05 ` Ricard Wanderlof
@ 2013-10-22 13:10   ` Steffen Kühn
  2013-10-22 14:10     ` Ricard Wanderlof
  2013-10-23 16:17   ` Artem Bityutskiy
  1 sibling, 1 reply; 5+ messages in thread
From: Steffen Kühn @ 2013-10-22 13:10 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org

Dear Ricard,

thanks for your answer.

> The risk would be that eventually the number of errors would grow past
> the ECC capability and subsequently lead to unreadable data. If there
> indeed is a bug in the scrubbing mechanism, I agree it would be better
> to just correct the bits and hope not too many of them flip. But it
> would be better to try and fix the problem with scrubbing...

I agree with you that a fix of the scrubbing mechanism would be better.
Unfortunately, the problem is very hard to reproduce (and debug).

And perhaps it is already solved? My kernel is not the newest. But it is
not possible to change the version because we have some patches (related
to other hardware) which only works with kernel 3.2.

Do you know about some important bug fixes which have something to do
with the "scrubbing" in the last months?

Best
Steffen

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: problem with ecc errors and ubifs
  2013-10-22 13:10   ` Steffen Kühn
@ 2013-10-22 14:10     ` Ricard Wanderlof
  0 siblings, 0 replies; 5+ messages in thread
From: Ricard Wanderlof @ 2013-10-22 14:10 UTC (permalink / raw)
  To: Steffen Kühn; +Cc: linux-mtd@lists.infradead.org


On Tue, 22 Oct 2013, Steffen Kühn wrote:

> thanks for your answer.
>
>> The risk would be that eventually the number of errors would grow past
>> the ECC capability and subsequently lead to unreadable data. If there
>> indeed is a bug in the scrubbing mechanism, I agree it would be better
>> to just correct the bits and hope not too many of them flip. But it
>> would be better to try and fix the problem with scrubbing...
>
> I agree with you that a fix of the scrubbing mechanism would be better.
> Unfortunately, the problem is very hard to reproduce (and debug).

So it happens very seldom then?

> And perhaps it is already solved? My kernel is not the newest. But it is
> not possible to change the version because we have some patches (related
> to other hardware) which only works with kernel 3.2.
>
> Do you know about some important bug fixes which have something to do
> with the "scrubbing" in the last months?

Not offhand, but I don't keep track of all corrected bugs. A quick scan in 
the mailing list archives might be useful.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: problem with ecc errors and ubifs
  2013-10-22 12:05 ` Ricard Wanderlof
  2013-10-22 13:10   ` Steffen Kühn
@ 2013-10-23 16:17   ` Artem Bityutskiy
  1 sibling, 0 replies; 5+ messages in thread
From: Artem Bityutskiy @ 2013-10-23 16:17 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: Steffen Kühn, linux-mtd@lists.infradead.org

On Tue, 2013-10-22 at 14:05 +0200, Ricard Wanderlof wrote:
> I know there have been discussions that when using ECC that can correct 
> more than a single bit in a given area, to not trigger scrubbing as soon 
> as a single bit goes bad, but use a threshold mechanism, so that scrubbing 
> is triggered first when, say, half the maximum amount of bits need 
> correcting (e.g. in your case when 4 bits need correcting), the reason 
> being that flashes which require multibit ecc tend to have bits here and 
> there that flip rather quickly after writing, so triggering on them leads 
> to undue scrubbing and hence wear on the flash.

We already have this stuff in mainline kernels for rather long time. See
the 'bitflip_threshold' variable in the mtd_info data structures. We
have a good piece of documentation in
Documentation/ABI/testing/sysfs-class-mtd

-- 
Best Regards,
Artem Bityutskiy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-10-23 16:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-22 11:52 problem with ecc errors and ubifs Steffen Kühn
2013-10-22 12:05 ` Ricard Wanderlof
2013-10-22 13:10   ` Steffen Kühn
2013-10-22 14:10     ` Ricard Wanderlof
2013-10-23 16:17   ` Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox