* problem with ecc errors and ubifs @ 2013-10-22 11:52 Steffen Kühn 2013-10-22 12:05 ` Ricard Wanderlof 0 siblings, 1 reply; 5+ messages in thread From: Steffen Kühn @ 2013-10-22 11:52 UTC (permalink / raw) To: linux-mtd@lists.infradead.org Hey, my question is about ecc errors and the way ubifs deals with it. Unfortunately, I have to be a bit detailed to make my problem clear. We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have started to use this flash no support for on die ecc was available in the kernel (kernel 3.2). The on die ecc support is - because of that - self written. Everything works in principle very well. But we use our hardware in greater numbers and quite intensively. Over the months we have observed numerous destroyed file systems with different ubi errors. For finding the reason of that problem I have written a mechanism to create bit errors (in U-Boot). With that I made different tests. One test was to create only one single bit error in the whole flash device. The on die ecc mechanism (which can correct up to 8 bit errors) had no problems to correct this error. The kernel code has now a piece of code where the bit error occurrence is reported to the stages above. With this information can ubifs decide if and what it has to do. I have seen that such error reporting leads usually to a page "scrub". I do not really understand what there happens. But sometimes the result is catastrophic. Because of that I have removed the error reporting (my hope is that 8 bit errors occur seldom enough in a page) After that code removing our problems are completely vanished (I have even tested with more than 8 bit errors in the same page => no problems). I could not provoke any faults by creating numerous bit errors in dozens of pages. What is your opinion? Have I overlooked something? I know that this method has risks but I hope that under the line the file system stays longer alive. Best Steffen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs 2013-10-22 11:52 problem with ecc errors and ubifs Steffen Kühn @ 2013-10-22 12:05 ` Ricard Wanderlof 2013-10-22 13:10 ` Steffen Kühn 2013-10-23 16:17 ` Artem Bityutskiy 0 siblings, 2 replies; 5+ messages in thread From: Ricard Wanderlof @ 2013-10-22 12:05 UTC (permalink / raw) To: Steffen Kühn; +Cc: linux-mtd@lists.infradead.org On Tue, 22 Oct 2013, Steffen Kühn wrote: > We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash > supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have > started to use this flash no support for on die ecc was available in the > kernel (kernel 3.2). The on die ecc support is - because of that - self > written. > ... > > I have seen that such error reporting leads usually to a page "scrub". I > do not really understand what there happens. I don't know the details, but scrubbing means that the data in the page is to be rewritten at another place in the flash, since the presence of a bit error indicates that the data in the page will eventually become unreliable. The code as it looks today triggers scrubbing whenever a single bit correction is detected during read. The reason for this is that the classic Hamming algorithm can only handle one incorrect bit, so if another bit flips the data becomes unreadable. I know there have been discussions that when using ECC that can correct more than a single bit in a given area, to not trigger scrubbing as soon as a single bit goes bad, but use a threshold mechanism, so that scrubbing is triggered first when, say, half the maximum amount of bits need correcting (e.g. in your case when 4 bits need correcting), the reason being that flashes which require multibit ecc tend to have bits here and there that flip rather quickly after writing, so triggering on them leads to undue scrubbing and hence wear on the flash. > But sometimes the result is catastrophic. Because of that I have removed > the error reporting (my hope is that 8 bit errors occur seldom enough in > a page) After that code removing our problems are completely vanished (I > have even tested with more than 8 bit errors in the same page => no > problems). I could not provoke any faults by creating numerous bit > errors in dozens of pages. Removing the 'corrected bit' reporting mechanism avoids scrubbing, and hence that code path is never executed. It seems a serious bug if the scrubbing mechanism doesn't work as intended, for whatever reason. > What is your opinion? Have I overlooked something? I know that this > method has risks but I hope that under the line the file system stays > longer alive. The risk would be that eventually the number of errors would grow past the ECC capability and subsequently lead to unreadable data. If there indeed is a bug in the scrubbing mechanism, I agree it would be better to just correct the bits and hope not too many of them flip. But it would be better to try and fix the problem with scrubbing... /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs 2013-10-22 12:05 ` Ricard Wanderlof @ 2013-10-22 13:10 ` Steffen Kühn 2013-10-22 14:10 ` Ricard Wanderlof 2013-10-23 16:17 ` Artem Bityutskiy 1 sibling, 1 reply; 5+ messages in thread From: Steffen Kühn @ 2013-10-22 13:10 UTC (permalink / raw) To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org Dear Ricard, thanks for your answer. > The risk would be that eventually the number of errors would grow past > the ECC capability and subsequently lead to unreadable data. If there > indeed is a bug in the scrubbing mechanism, I agree it would be better > to just correct the bits and hope not too many of them flip. But it > would be better to try and fix the problem with scrubbing... I agree with you that a fix of the scrubbing mechanism would be better. Unfortunately, the problem is very hard to reproduce (and debug). And perhaps it is already solved? My kernel is not the newest. But it is not possible to change the version because we have some patches (related to other hardware) which only works with kernel 3.2. Do you know about some important bug fixes which have something to do with the "scrubbing" in the last months? Best Steffen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs 2013-10-22 13:10 ` Steffen Kühn @ 2013-10-22 14:10 ` Ricard Wanderlof 0 siblings, 0 replies; 5+ messages in thread From: Ricard Wanderlof @ 2013-10-22 14:10 UTC (permalink / raw) To: Steffen Kühn; +Cc: linux-mtd@lists.infradead.org On Tue, 22 Oct 2013, Steffen Kühn wrote: > thanks for your answer. > >> The risk would be that eventually the number of errors would grow past >> the ECC capability and subsequently lead to unreadable data. If there >> indeed is a bug in the scrubbing mechanism, I agree it would be better >> to just correct the bits and hope not too many of them flip. But it >> would be better to try and fix the problem with scrubbing... > > I agree with you that a fix of the scrubbing mechanism would be better. > Unfortunately, the problem is very hard to reproduce (and debug). So it happens very seldom then? > And perhaps it is already solved? My kernel is not the newest. But it is > not possible to change the version because we have some patches (related > to other hardware) which only works with kernel 3.2. > > Do you know about some important bug fixes which have something to do > with the "scrubbing" in the last months? Not offhand, but I don't keep track of all corrected bugs. A quick scan in the mailing list archives might be useful. /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problem with ecc errors and ubifs 2013-10-22 12:05 ` Ricard Wanderlof 2013-10-22 13:10 ` Steffen Kühn @ 2013-10-23 16:17 ` Artem Bityutskiy 1 sibling, 0 replies; 5+ messages in thread From: Artem Bityutskiy @ 2013-10-23 16:17 UTC (permalink / raw) To: Ricard Wanderlof; +Cc: Steffen Kühn, linux-mtd@lists.infradead.org On Tue, 2013-10-22 at 14:05 +0200, Ricard Wanderlof wrote: > I know there have been discussions that when using ECC that can correct > more than a single bit in a given area, to not trigger scrubbing as soon > as a single bit goes bad, but use a threshold mechanism, so that scrubbing > is triggered first when, say, half the maximum amount of bits need > correcting (e.g. in your case when 4 bits need correcting), the reason > being that flashes which require multibit ecc tend to have bits here and > there that flip rather quickly after writing, so triggering on them leads > to undue scrubbing and hence wear on the flash. We already have this stuff in mainline kernels for rather long time. See the 'bitflip_threshold' variable in the mtd_info data structures. We have a good piece of documentation in Documentation/ABI/testing/sysfs-class-mtd -- Best Regards, Artem Bityutskiy ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-10-23 16:17 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-22 11:52 problem with ecc errors and ubifs Steffen Kühn 2013-10-22 12:05 ` Ricard Wanderlof 2013-10-22 13:10 ` Steffen Kühn 2013-10-22 14:10 ` Ricard Wanderlof 2013-10-23 16:17 ` Artem Bityutskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox