* problem with ecc errors and ubifs
@ 2013-10-22 11:52 Steffen Kühn
2013-10-22 12:05 ` Ricard Wanderlof
0 siblings, 1 reply; 5+ messages in thread
From: Steffen Kühn @ 2013-10-22 11:52 UTC (permalink / raw)
To: linux-mtd@lists.infradead.org
Hey,
my question is about ecc errors and the way ubifs deals with it.
Unfortunately, I have to be a bit detailed to make my problem clear.
We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash
supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have
started to use this flash no support for on die ecc was available in the
kernel (kernel 3.2). The on die ecc support is - because of that - self
written.
Everything works in principle very well. But we use our hardware in
greater numbers and quite intensively. Over the months we have observed
numerous destroyed file systems with different ubi errors. For finding
the reason of that problem I have written a mechanism to create bit
errors (in U-Boot).
With that I made different tests. One test was to create only one single
bit error in the whole flash device. The on die ecc mechanism (which can
correct up to 8 bit errors) had no problems to correct this error. The
kernel code has now a piece of code where the bit error occurrence is
reported to the stages above. With this information can ubifs decide if
and what it has to do.
I have seen that such error reporting leads usually to a page "scrub". I
do not really understand what there happens. But sometimes the result is
catastrophic. Because of that I have removed the error reporting (my
hope is that 8 bit errors occur seldom enough in a page) After that code
removing our problems are completely vanished (I have even tested with
more than 8 bit errors in the same page => no problems). I could not
provoke any faults by creating numerous bit errors in dozens of pages.
What is your opinion? Have I overlooked something? I know that this
method has risks but I hope that under the line the file system stays
longer alive.
Best
Steffen
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs
2013-10-22 11:52 problem with ecc errors and ubifs Steffen Kühn
@ 2013-10-22 12:05 ` Ricard Wanderlof
2013-10-22 13:10 ` Steffen Kühn
2013-10-23 16:17 ` Artem Bityutskiy
0 siblings, 2 replies; 5+ messages in thread
From: Ricard Wanderlof @ 2013-10-22 12:05 UTC (permalink / raw)
To: Steffen Kühn; +Cc: linux-mtd@lists.infradead.org
On Tue, 22 Oct 2013, Steffen Kühn wrote:
> We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash
> supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have
> started to use this flash no support for on die ecc was available in the
> kernel (kernel 3.2). The on die ecc support is - because of that - self
> written.
> ...
>
> I have seen that such error reporting leads usually to a page "scrub". I
> do not really understand what there happens.
I don't know the details, but scrubbing means that the data in the page
is to be rewritten at another place in the flash, since the presence of a
bit error indicates that the data in the page will eventually become
unreliable.
The code as it looks today triggers scrubbing whenever a single bit
correction is detected during read. The reason for this is that the
classic Hamming algorithm can only handle one incorrect bit, so if another
bit flips the data becomes unreadable.
I know there have been discussions that when using ECC that can correct
more than a single bit in a given area, to not trigger scrubbing as soon
as a single bit goes bad, but use a threshold mechanism, so that scrubbing
is triggered first when, say, half the maximum amount of bits need
correcting (e.g. in your case when 4 bits need correcting), the reason
being that flashes which require multibit ecc tend to have bits here and
there that flip rather quickly after writing, so triggering on them leads
to undue scrubbing and hence wear on the flash.
> But sometimes the result is catastrophic. Because of that I have removed
> the error reporting (my hope is that 8 bit errors occur seldom enough in
> a page) After that code removing our problems are completely vanished (I
> have even tested with more than 8 bit errors in the same page => no
> problems). I could not provoke any faults by creating numerous bit
> errors in dozens of pages.
Removing the 'corrected bit' reporting mechanism avoids scrubbing, and
hence that code path is never executed. It seems a serious bug if the
scrubbing mechanism doesn't work as intended, for whatever reason.
> What is your opinion? Have I overlooked something? I know that this
> method has risks but I hope that under the line the file system stays
> longer alive.
The risk would be that eventually the number of errors would grow past the
ECC capability and subsequently lead to unreadable data. If there indeed
is a bug in the scrubbing mechanism, I agree it would be better to just
correct the bits and hope not too many of them flip. But it would be
better to try and fix the problem with scrubbing...
/Ricard
--
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs
2013-10-22 12:05 ` Ricard Wanderlof
@ 2013-10-22 13:10 ` Steffen Kühn
2013-10-22 14:10 ` Ricard Wanderlof
2013-10-23 16:17 ` Artem Bityutskiy
1 sibling, 1 reply; 5+ messages in thread
From: Steffen Kühn @ 2013-10-22 13:10 UTC (permalink / raw)
To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org
Dear Ricard,
thanks for your answer.
> The risk would be that eventually the number of errors would grow past
> the ECC capability and subsequently lead to unreadable data. If there
> indeed is a bug in the scrubbing mechanism, I agree it would be better
> to just correct the bits and hope not too many of them flip. But it
> would be better to try and fix the problem with scrubbing...
I agree with you that a fix of the scrubbing mechanism would be better.
Unfortunately, the problem is very hard to reproduce (and debug).
And perhaps it is already solved? My kernel is not the newest. But it is
not possible to change the version because we have some patches (related
to other hardware) which only works with kernel 3.2.
Do you know about some important bug fixes which have something to do
with the "scrubbing" in the last months?
Best
Steffen
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs
2013-10-22 13:10 ` Steffen Kühn
@ 2013-10-22 14:10 ` Ricard Wanderlof
0 siblings, 0 replies; 5+ messages in thread
From: Ricard Wanderlof @ 2013-10-22 14:10 UTC (permalink / raw)
To: Steffen Kühn; +Cc: linux-mtd@lists.infradead.org
On Tue, 22 Oct 2013, Steffen Kühn wrote:
> thanks for your answer.
>
>> The risk would be that eventually the number of errors would grow past
>> the ECC capability and subsequently lead to unreadable data. If there
>> indeed is a bug in the scrubbing mechanism, I agree it would be better
>> to just correct the bits and hope not too many of them flip. But it
>> would be better to try and fix the problem with scrubbing...
>
> I agree with you that a fix of the scrubbing mechanism would be better.
> Unfortunately, the problem is very hard to reproduce (and debug).
So it happens very seldom then?
> And perhaps it is already solved? My kernel is not the newest. But it is
> not possible to change the version because we have some patches (related
> to other hardware) which only works with kernel 3.2.
>
> Do you know about some important bug fixes which have something to do
> with the "scrubbing" in the last months?
Not offhand, but I don't keep track of all corrected bugs. A quick scan in
the mailing list archives might be useful.
/Ricard
--
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs
2013-10-22 12:05 ` Ricard Wanderlof
2013-10-22 13:10 ` Steffen Kühn
@ 2013-10-23 16:17 ` Artem Bityutskiy
1 sibling, 0 replies; 5+ messages in thread
From: Artem Bityutskiy @ 2013-10-23 16:17 UTC (permalink / raw)
To: Ricard Wanderlof; +Cc: Steffen Kühn, linux-mtd@lists.infradead.org
On Tue, 2013-10-22 at 14:05 +0200, Ricard Wanderlof wrote:
> I know there have been discussions that when using ECC that can correct
> more than a single bit in a given area, to not trigger scrubbing as soon
> as a single bit goes bad, but use a threshold mechanism, so that scrubbing
> is triggered first when, say, half the maximum amount of bits need
> correcting (e.g. in your case when 4 bits need correcting), the reason
> being that flashes which require multibit ecc tend to have bits here and
> there that flip rather quickly after writing, so triggering on them leads
> to undue scrubbing and hence wear on the flash.
We already have this stuff in mainline kernels for rather long time. See
the 'bitflip_threshold' variable in the mtd_info data structures. We
have a good piece of documentation in
Documentation/ABI/testing/sysfs-class-mtd
--
Best Regards,
Artem Bityutskiy
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-10-23 16:17 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-22 11:52 problem with ecc errors and ubifs Steffen Kühn
2013-10-22 12:05 ` Ricard Wanderlof
2013-10-22 13:10 ` Steffen Kühn
2013-10-22 14:10 ` Ricard Wanderlof
2013-10-23 16:17 ` Artem Bityutskiy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox