linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* UBIFS recovery fails
@ 2011-10-17 13:29 Daniel Kuhn
  2011-10-17 20:17 ` Artem Bityutskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Kuhn @ 2011-10-17 13:29 UTC (permalink / raw)
  To: linux-mtd

Hi,

I have a problem with a device which uses UBI + UBIFS on a 32GB NAND 
Flash (16*2GB). The
filesystem worked without problems for a couple of months but now I get 
an error if I try to mount the volume.
Attaching the UBI-Device works fine as you can see in the following 
messages:

===============

UBI: max. sequence number:       18803
UBI: attached mtd3 to ubi2
UBI: MTD device name:            "LogData"
UBI: MTD device size:            32544 MiB
UBI: number of good PEBs:        130119
UBI: number of bad PEBs:         57
UBI: number of corrupted PEBs:   0
UBI: max. allowed volumes:       128
UBI: wear-leveling threshold:    4096
UBI: number of internal volumes: 1
UBI: number of user volumes:     1
UBI: available PEBs:             0
UBI: total number of reserved PEBs: 130119
UBI: number of PEBs reserved for bad PEB handling: 1301
UBI: max/mean erase counter: 30/0
UBI: image sequence number:  1813038814
UBI: background thread "ubi_bgt2d" started, PID 1376

===============


UBIFS prints the following error messages during mount (with dmesg -n8, 
UBIFS debugging enabled):

===============

UBIFS: recovery needed
UBIFS error (pid 611): ubifs_recover_leb: corrupt empty space LEB 
3550:188416, c
orruption starts at 64362
UBIFS error (pid 611): ubifs_scanned_corruption: corruption at LEB 
3550:252778
UBIFS error (pid 611): ubifs_scanned_corruption: first 1174 bytes from 
LEB 3550:
252778
00000000: ffffffbf ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000020: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000040: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000060: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000080: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000000a0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000000c0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000000e0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000100: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000120: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000140: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000160: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000180: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000001a0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000001c0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000001e0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000200: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000220: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000240: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000260: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000280: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000002a0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000002c0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000002e0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000300: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000320: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000340: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000360: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000380: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000003a0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000003c0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
000003e0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000400: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000420: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000440: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000460: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
fffffff
f  ................................
00000480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                            ......................
UBIFS error (pid 611): ubifs_recover_leb: LEB 3550 scanning failed

===============


The Kernel (2.6.38-rc8) runs on an ARM926 CPU (OMAPL138 from TI). I've 
taken the latest UBI + UBIFS sources from the GIT-Repository 
"|git://git.infradead.org/~dedekind/ubifs-v2.6.38.git|" which eliminated 
a segmentation fault I had before updating.

"mtdinfo -a" prints the following information for the broken device mtd3:

===============

mtd3
Name:                           LogData
Type:                           nand
Eraseblock size:                262144 bytes, 256.0 KiB
Amount of eraseblocks:          130176 (34124857344 bytes, 31.8 GiB)
Minimum input/output unit size: 4096 bytes
Sub-page size:                  1024 bytes
OOB size:                       128 bytes
Character device major/minor:   90:6
Bad blocks are allowed:         true
Device is writable:             true

===============


Is there a way to recover the data on the device?

Thanks in advance,
Daniel Kuhn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-17 13:29 UBIFS recovery fails Daniel Kuhn
@ 2011-10-17 20:17 ` Artem Bityutskiy
  2011-10-18  8:11   ` Ricard Wanderlof
                     ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-17 20:17 UTC (permalink / raw)
  To: Daniel Kuhn; +Cc: linux-mtd

On Mon, 2011-10-17 at 15:29 +0200, Daniel Kuhn wrote:
> Hi,
> 
> I have a problem with a device which uses UBI + UBIFS on a 32GB NAND 
> Flash (16*2GB). The
> filesystem worked without problems for a couple of months but now I get 
> an error if I try to mount the volume.
> Attaching the UBI-Device works fine as you can see in the following 
> messages:

This issue looks like one of the MLC-specific ones. Unfortunately,
no one really invested time into making UBIFS support MLC very well.
It needs some more work. It also have some issues related to unstable
bits in modern SLC.

In short - if you want to use UBIFS on MLC - you should not have unclean
reboots. If you want to make UBIFS 100% uclean-reboot save on MLC - you
need to work on it some more.

We (the original authors) developed and tested it on very robust SLC
NAND.

Unfortunately, I do not work with MTD for few years already and have
no spare time to make it MLC-robust. My last attempt was this spring -
I started making integck test (in mtd-utils) support emulated power
cuts. The idea was to improve UBIFS power-cut emulation infrastructure
and make it emulating unstable bits, and then test and fix all issues.
But then I realized that I simply will not have time to finish it,
so left the work half-done.

If someone wants to see UBIFS 100% or near 100% power-cut safe on
MLC or one of shitty modern SLCs - he needs to invest men-hours.
I can help by suggesting and reviewing. Although the funny thing is
that eMMCs die and lose data in case of power cuts very often :-)

> UBI: wear-leveling threshold:    4096
Are you sure it is good for MLC? What is your eraseblock life-cycle?

> Is there a way to recover the data on the device?

If you just have precious data which you need - I think:
1. Make a dump of the flash.
2. Verify that on another device you can flash it and reproduce this
issue.
3. Just hack the function in kernel that fails to return 0 instead
of error for that case, and I think you will be able to mount your
flash and copy your data.

But if you want this to never happen again - you need to prepare
for a several months project (providing you have a good kernel
engineer).

HTH :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-17 20:17 ` Artem Bityutskiy
@ 2011-10-18  8:11   ` Ricard Wanderlof
  2011-10-18  8:42     ` Ivan Djelic
  2011-10-20 16:36     ` Artem Bityutskiy
  2011-10-18  8:29   ` Ivan Djelic
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: Ricard Wanderlof @ 2011-10-18  8:11 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-mtd@lists.infradead.org


On Mon, 17 Oct 2011, Artem Bityutskiy wrote:

> [...] Unfortunately, no one really invested time into making UBIFS 
> support MLC very well. It needs some more work. It also have some issues 
> related to unstable bits in modern SLC.
>
> In short - if you want to use UBIFS on MLC - you should not have unclean
> reboots. If you want to make UBIFS 100% uclean-reboot save on MLC - you
> need to work on it some more.
>
> We (the original authors) developed and tested it on very robust SLC
> NAND.

Do you have any specifics on what the issues are with MLC ?

Since UBI implements bit scrubbing and eraseblock torture on questionable 
blocks it would seem that a lot of the work has been done in order to 
handle unstable bits. Is there some issue which is related specifically to 
bit flips happening while the system is powered off?

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-17 20:17 ` Artem Bityutskiy
  2011-10-18  8:11   ` Ricard Wanderlof
@ 2011-10-18  8:29   ` Ivan Djelic
  2011-10-19 15:15     ` Artem Bityutskiy
  2011-10-18 12:47   ` Jean-Sébastien Gagnon
  2011-10-18 15:29   ` Daniel Kuhn
  3 siblings, 1 reply; 24+ messages in thread
From: Ivan Djelic @ 2011-10-18  8:29 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-mtd@lists.infradead.org, Daniel Kuhn

On Mon, Oct 17, 2011 at 09:17:48PM +0100, Artem Bityutskiy wrote:
> 
> If someone wants to see UBIFS 100% or near 100% power-cut safe on
> MLC or one of shitty modern SLCs - he needs to invest men-hours.
> I can help by suggesting and reviewing. Although the funny thing is
> that eMMCs die and lose data in case of power cuts very often :-)

Hi Artem,

That's interesting... Do you have more details or any data on those eMMC
power-cut failures ?

I plan to be working soon (December) on UBIFS robustness issues with unstable
modern SLCs; besides using nandsim to simulate SLC (and maybe MLC) issues,
I also have real hardware with a power-cutting framework ready for testing.

Best Regards,
--
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18  8:11   ` Ricard Wanderlof
@ 2011-10-18  8:42     ` Ivan Djelic
  2011-10-20 16:37       ` Artem Bityutskiy
  2011-10-20 16:36     ` Artem Bityutskiy
  1 sibling, 1 reply; 24+ messages in thread
From: Ivan Djelic @ 2011-10-18  8:42 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org, Artem Bityutskiy

On Tue, Oct 18, 2011 at 09:11:54AM +0100, Ricard Wanderlof wrote:
> 
> On Mon, 17 Oct 2011, Artem Bityutskiy wrote:
> 
> > [...] Unfortunately, no one really invested time into making UBIFS 
> > support MLC very well. It needs some more work. It also have some issues 
> > related to unstable bits in modern SLC.
> >
> > In short - if you want to use UBIFS on MLC - you should not have unclean
> > reboots. If you want to make UBIFS 100% uclean-reboot save on MLC - you
> > need to work on it some more.
> >
> > We (the original authors) developed and tested it on very robust SLC
> > NAND.
> 
> Do you have any specifics on what the issues are with MLC ?
 
Besides unstable bits, lower endurance, higher ecc requirements and NOP1, there
is a "page pairing" phenomenon which is not handled by UBI/UBIFS AFAIK.
When you cut power during a page write, you may lose data in another (paired)
previously programmed page.

BR,
--
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-17 20:17 ` Artem Bityutskiy
  2011-10-18  8:11   ` Ricard Wanderlof
  2011-10-18  8:29   ` Ivan Djelic
@ 2011-10-18 12:47   ` Jean-Sébastien Gagnon
  2011-10-18 14:54     ` Ivan Djelic
  2011-10-18 15:29   ` Daniel Kuhn
  3 siblings, 1 reply; 24+ messages in thread
From: Jean-Sébastien Gagnon @ 2011-10-18 12:47 UTC (permalink / raw)
  To: linux-mtd

Hi, 
Actually, I think the empty space corruption is the only thing to address in this 
Specific problem, since any other error cause by unstable bits on valid data should be 
corrected by the parities in the flash driver.

This problem could be addressed by the NAND driver reading the pages and correcting 
empty ones, or by correcting UBIFS to address this problem.


J-S Gagnon

>
> On Mon, 2011-10-17 at 15:29 +0200, Daniel Kuhn wrote:
>> Hi,
>> 
>> I have a problem with a device which uses UBI + UBIFS on a 32GB NAND 
>> Flash (16*2GB). The
>> filesystem worked without problems for a couple of months but now I get 
>> an error if I try to mount the volume.
>> Attaching the UBI-Device works fine as you can see in the following 
>> messages:
>
> This issue looks like one of the MLC-specific ones. Unfortunately,
> no one really invested time into making UBIFS support MLC very well.
> It needs some more work. It also have some issues related to unstable
> bits in modern SLC.
>
> In short - if you want to use UBIFS on MLC - you should not have unclean
> reboots. If you want to make UBIFS 100% uclean-reboot save on MLC - you
> need to work on it some more.
>
> We (the original authors) developed and tested it on very robust SLC
> NAND.
> 
> Unfortunately, I do not work with MTD for few years already and have
> no spare time to make it MLC-robust. My last attempt was this spring -
> I started making integck test (in mtd-utils) support emulated power
> cuts. The idea was to improve UBIFS power-cut emulation infrastructure
> and make it emulating unstable bits, and then test and fix all issues.
> But then I realized that I simply will not have time to finish it,
> so left the work half-done.
>
> If someone wants to see UBIFS 100% or near 100% power-cut safe on
> MLC or one of shitty modern SLCs - he needs to invest men-hours.
> I can help by suggesting and reviewing. Although the funny thing is
> that eMMCs die and lose data in case of power cuts very often :-)
>
>> UBI: wear-leveling threshold:    4096
> Are you sure it is good for MLC? What is your eraseblock life-cycle?
>
>> Is there a way to recover the data on the device?
>
> If you just have precious data which you need - I think:
> 1. Make a dump of the flash.
> 2. Verify that on another device you can flash it and reproduce this
> issue.
> 3. Just hack the function in kernel that fails to return 0 instead
> of error for that case, and I think you will be able to mount your
> flash and copy your data.
>
> But if you want this to never happen again - you need to prepare
> for a several months project (providing you have a good kernel
> engineer).
>
> HTH :-)



______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18 12:47   ` Jean-Sébastien Gagnon
@ 2011-10-18 14:54     ` Ivan Djelic
  2011-10-18 15:10       ` Jean-Sébastien Gagnon
  2011-10-20 14:14       ` Artem Bityutskiy
  0 siblings, 2 replies; 24+ messages in thread
From: Ivan Djelic @ 2011-10-18 14:54 UTC (permalink / raw)
  To: Jean-Sébastien Gagnon; +Cc: linux-mtd@lists.infradead.org

On Tue, Oct 18, 2011 at 01:47:26PM +0100, Jean-Sébastien Gagnon wrote:
> Hi, 
> Actually, I think the empty space corruption is the only thing to address in this 
> Specific problem, since any other error cause by unstable bits on valid data should be 
> corrected by the parities in the flash driver.

Hi Jean-Sébastien,

If you cut power during a page programming operation, you can easily get more
unstable bits than what the manufacturer-specified ecc supports (for instance,
3 unstable bits on a 1bit-ecc device). We experienced this on several different
devices.
Having a lot of bitflips (more than what ecc supports) is not the problem here:
the page was indeed partially programmed, it contains garbage and its contents
should be discarded.

The real problem appears when those faulty bits are unstable: during the first
few read attempts, the page may be successfully read (possibly with ecc
corrections); and then, a bit later, the page becomes unreadable because of too
many faulty bits.

Therefore, software using MTD (UBI, UBIFS) cannot just rely on being able to
read a page at some point to decide that this page reliably stores data.
It should also be able to trace power failures, and treat the NAND area being
modified (programmed or erased) during the power cut as potentially unstable.

HTH,
-- 
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-18 14:54     ` Ivan Djelic
@ 2011-10-18 15:10       ` Jean-Sébastien Gagnon
  2011-10-18 15:32         ` Ivan Djelic
  2011-10-20 16:43         ` Artem Bityutskiy
  2011-10-20 14:14       ` Artem Bityutskiy
  1 sibling, 2 replies; 24+ messages in thread
From: Jean-Sébastien Gagnon @ 2011-10-18 15:10 UTC (permalink / raw)
  To: Ivan Djelic; +Cc: linux-mtd

The situation you described should be already handled correctly by UBIFS, if the nand driver is correctly reporting pages with bitflips with the -EUNCLEAN.  In this case, UBI will move the PEB to a new one as soon as possible to avoid this problem.

My comment was really about the original error posted by Daniel Kuhn :

>>UBIFS: recovery needed
>>UBIFS error (pid 611): ubifs_recover_leb: corrupt empty space LEB 3550:188416, corruption starts at >>64362 UBIFS error (pid 611): ubifs_scanned_corruption: corruption at LEB
>>3550:252778
>>UBIFS error (pid 611): ubifs_scanned_corruption: first 1174 bytes from LEB 3550:
>>252778
>>00000000: ffffffbf ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffff f  >>................................
>>00000020: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffff f  >>................................
...

In this case, it seems that a bitflip has occur on a blank page, or maybe the page was partially programmed (really partially).



--------------------------------------------------------------------------------------

On Tue, Oct 18, 2011 at 01:47:26PM +0100, Jean-Sébastien Gagnon wrote:
> Hi, 
> Actually, I think the empty space corruption is the only thing to address in this 
> Specific problem, since any other error cause by unstable bits on valid data should be 
> corrected by the parities in the flash driver.

Hi Jean-Sébastien,

If you cut power during a page programming operation, you can easily get more
unstable bits than what the manufacturer-specified ecc supports (for instance,
3 unstable bits on a 1bit-ecc device). We experienced this on several different
devices.
Having a lot of bitflips (more than what ecc supports) is not the problem here:
the page was indeed partially programmed, it contains garbage and its contents
should be discarded.

The real problem appears when those faulty bits are unstable: during the first
few read attempts, the page may be successfully read (possibly with ecc
corrections); and then, a bit later, the page becomes unreadable because of too
many faulty bits.

Therefore, software using MTD (UBI, UBIFS) cannot just rely on being able to
read a page at some point to decide that this page reliably stores data.
It should also be able to trace power failures, and treat the NAND area being
modified (programmed or erased) during the power cut as potentially unstable.

HTH,
-- 
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-17 20:17 ` Artem Bityutskiy
                     ` (2 preceding siblings ...)
  2011-10-18 12:47   ` Jean-Sébastien Gagnon
@ 2011-10-18 15:29   ` Daniel Kuhn
  3 siblings, 0 replies; 24+ messages in thread
From: Daniel Kuhn @ 2011-10-18 15:29 UTC (permalink / raw)
  To: linux-mtd

Thanks for the immediate responses.

> If you just have precious data which you need - I think:
> 1. Make a dump of the flash.
> 2. Verify that on another device you can flash it and reproduce this
> issue.
> 3. Just hack the function in kernel that fails to return 0 instead
> of error for that case, and I think you will be able to mount your
> flash and copy your data.

Luckily I could restore the data on the volume (a single file) after 
some tests with
various code changes. Finally I succeeded by uncommenting the journal 
replay function call in the file "super.c". Changing return codes to 
zero in several other places lead to new errors in other code parts.

> This issue looks like one of the MLC-specific ones. Unfortunately,
> no one really invested time into making UBIFS support MLC very well.
> It needs some more work. It also have some issues related to unstable
> bits in modern SLC.
>
The NAND Flash I'm using is a Samsung SLC NAND K9WBG08U1M. So this isn't 
MLC-specific.
It seems strange to me that a whole data block gets empty during 
operation or as a result of a power failure. The device on which this 
error occured is a prototype on which I run various tests during 
development and the original bad block markers from the manufacturer 
were overwritten by low level tests with an embedded FPGA-Softcore CPU. 
So maybe the failed LEB 3550 was originally a bad eraseblock but I can't 
tell for sure.

Any further ideas to improve the UBIFS data security is really appreciated.

Thanks,
Daniel Kuhn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18 15:10       ` Jean-Sébastien Gagnon
@ 2011-10-18 15:32         ` Ivan Djelic
  2011-10-18 16:05           ` Jean-Sébastien Gagnon
  2011-10-20 16:43         ` Artem Bityutskiy
  1 sibling, 1 reply; 24+ messages in thread
From: Ivan Djelic @ 2011-10-18 15:32 UTC (permalink / raw)
  To: Jean-Sébastien Gagnon; +Cc: linux-mtd@lists.infradead.org

On Tue, Oct 18, 2011 at 04:10:32PM +0100, Jean-Sébastien Gagnon wrote:
> The situation you described should be already handled correctly by UBIFS, if the nand driver is correctly reporting pages with bitflips with the -EUNCLEAN.  In this case, UBI will move the PEB to a new one as soon as possible to avoid this problem.

This would be true, if you could assume that a page reading failure always
occurs after a previous read on the same page reports an ecc correction.
But this is not the case: we had several unstable pages going from 0 bitflips
(perfect read) to 2 bitflips (failed read). No way to detect any failure, no
way to scrub data before it's too late.
And the answer from the manufacturer was: you should not use any partially
programmed/erased pages anyway, those should be cleaned up after recovering
from a power failure.

> My comment was really about the original error posted by Daniel Kuhn :

I agree on your comment about blank page; bitflips on erased space should be
corrected, or the upper layers should be robust to them...

BR,
--
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-18 15:32         ` Ivan Djelic
@ 2011-10-18 16:05           ` Jean-Sébastien Gagnon
  2011-10-19  6:50             ` Ricard Wanderlof
  0 siblings, 1 reply; 24+ messages in thread
From: Jean-Sébastien Gagnon @ 2011-10-18 16:05 UTC (permalink / raw)
  To: Ivan Djelic; +Cc: linux-mtd


>> The situation you described should be already handled correctly by UBIFS, if the nand driver is correctly reporting pages with bitflips with the -EUNCLEAN.  In this case, UBI will move the PEB to a new one as soon as possible to avoid this problem.

> This would be true, if you could assume that a page reading failure always
> occurs after a previous read on the same page reports an ecc correction.
> But this is not the case: we had several unstable pages going from 0 bitflips
> (perfect read) to 2 bitflips (failed read). No way to detect any failure, no
> way to scrub data before it's too late.
> And the answer from the manufacturer was: you should not use any partially
> programmed/erased pages anyway, those should be cleaned up after recovering
> from a power failure.

This should not append if you use recommended ECC bits count defined by the NAND part. 
Can you verify that, because 1 bit ECC for each 512 byte is only recommended by older 
SLC flash which are pretty stable.  Newer NAND need more ECC bits, which can be 
addressed by using BCH parities.

---
J-S


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-18 16:05           ` Jean-Sébastien Gagnon
@ 2011-10-19  6:50             ` Ricard Wanderlof
  2011-10-19 10:22               ` Ivan Djelic
  0 siblings, 1 reply; 24+ messages in thread
From: Ricard Wanderlof @ 2011-10-19  6:50 UTC (permalink / raw)
  To: Jean-Sébastien Gagnon; +Cc: Ivan Djelic, linux-mtd@lists.infradead.org


>> This would be true, if you could assume that a page reading failure always
>> occurs after a previous read on the same page reports an ecc correction.
>> But this is not the case: we had several unstable pages going from 0 bitflips
>> (perfect read) to 2 bitflips (failed read). No way to detect any failure, no
>> way to scrub data before it's too late.
>> And the answer from the manufacturer was: you should not use any partially
>> programmed/erased pages anyway, those should be cleaned up after recovering
>> from a power failure.
>
> This should not append if you use recommended ECC bits count defined by the NAND part.
> Can you verify that, because 1 bit ECC for each 512 byte is only recommended by older
> SLC flash which are pretty stable.  Newer NAND need more ECC bits, which can be
> addressed by using BCH parities.

I would think that the manufacturers ECC recommendation is only valid if 
the page is properly programmed. If a power failure occurs during the 
writing of a page, it has not been programmed according to the 
manufacturer's specification, hence it is not properly programmed, and no 
assumptions can be made as to the validity of the data.

I remember a time in the dark old days of EPROMs, when I programmed an 
EPROM with too low a programming voltage (Vpp) - like 21V when the device 
required 25V. It verified correctly in the EPROM programmer, but when 
installed in the system for which it was intended, the data read out was 
faulty.

That said, it seems to me that power failure during write causing 
excessive bitflips would be a problem with any flash, not just MLC or 
modern "unstable" SLC. Ivan, you said you've only seen it with modern 
flashes, not older SLC ?

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-19  6:50             ` Ricard Wanderlof
@ 2011-10-19 10:22               ` Ivan Djelic
  2011-10-19 12:17                 ` Atlant Schmidt
  0 siblings, 1 reply; 24+ messages in thread
From: Ivan Djelic @ 2011-10-19 10:22 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: linux-mtd@lists.infradead.org, Jean-Sébastien Gagnon

On Wed, Oct 19, 2011 at 07:50:42AM +0100, Ricard Wanderlof wrote:
> That said, it seems to me that power failure during write causing 
> excessive bitflips would be a problem with any flash, not just MLC or 
> modern "unstable" SLC. Ivan, you said you've only seen it with modern 
> flashes, not older SLC ?

Excessive bitflips after a power failure happen on all types of flash (SLC,
MLC, even NOR flash); it is an expected possible consequence of the power cut.
It is not really a problem as long as what you read in flash is _stable_.

On modern SLCs (at least I first saw it on 34 nm SLC flash), those bitflips
can be _unstable_, i.e. they can appear and disappear randomly as you read
pages. I experienced this phenomenon only on pages which were being programmed
or erased during a power cut.

BR,
--
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-19 10:22               ` Ivan Djelic
@ 2011-10-19 12:17                 ` Atlant Schmidt
  2011-10-19 12:52                   ` Ricard Wanderlof
  0 siblings, 1 reply; 24+ messages in thread
From: Atlant Schmidt @ 2011-10-19 12:17 UTC (permalink / raw)
  To: 'Ivan Djelic', Ricard Wanderlof
  Cc: linux-mtd@lists.infradead.org, Jean-Sébastien Gagnon

All:

> On modern SLCs (at least I first saw it on 34 nm SLC flash), those bitflips
> can be _unstable_, i.e. they can appear and disappear randomly as you read
> pages. I experienced this phenomenon only on pages which were being programmed
> or erased during a power cut.

  This makes perfectly good sense. During erasing or programming,
  charge is being deposited-upon or removed from the floating gates
  and that's not an instantaneous process, so it can be interrupted
  while on-going, leaving a gate that's only half charged or half
  discharged.

  At that point, the floating gate may have charge on it that's
  all-too-near the threshold voltage for the cell and any given
  read of that cell could "go either way" depending on minute
  variations in other conditions.

                                   Atlant

-----Original Message-----
From: linux-mtd-bounces@lists.infradead.org [mailto:linux-mtd-bounces@lists.infradead.org] On Behalf Of Ivan Djelic
Sent: Wednesday, October 19, 2011 06:23
To: Ricard Wanderlof
Cc: linux-mtd@lists.infradead.org; Jean-Sébastien Gagnon
Subject: Re: UBIFS recovery fails

On Wed, Oct 19, 2011 at 07:50:42AM +0100, Ricard Wanderlof wrote:
> That said, it seems to me that power failure during write causing
> excessive bitflips would be a problem with any flash, not just MLC or
> modern "unstable" SLC. Ivan, you said you've only seen it with modern
> flashes, not older SLC ?

Excessive bitflips after a power failure happen on all types of flash (SLC,
MLC, even NOR flash); it is an expected possible consequence of the power cut.
It is not really a problem as long as what you read in flash is _stable_.

On modern SLCs (at least I first saw it on 34 nm SLC flash), those bitflips
can be _unstable_, i.e. they can appear and disappear randomly as you read
pages. I experienced this phenomenon only on pages which were being programmed
or erased during a power cut.

BR,
--
Ivan

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/


 Click https://www.mailcontrol.com/sr/qIXI9rK6r8XTndxI!oX7Uun8LHo!qpr3bLEMktINnvGjzny7tu5OdmysYq8E2OEC21OCp49t+CbFX!XDabuzkg==  to report this email as spam.

This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-19 12:17                 ` Atlant Schmidt
@ 2011-10-19 12:52                   ` Ricard Wanderlof
  2011-10-19 13:30                     ` Atlant Schmidt
  0 siblings, 1 reply; 24+ messages in thread
From: Ricard Wanderlof @ 2011-10-19 12:52 UTC (permalink / raw)
  To: Atlant Schmidt
  Cc: 'Ivan Djelic', linux-mtd@lists.infradead.org,
	Ricard Wanderlöf, Jean-Sébastien Gagnon


On Wed, 19 Oct 2011, Atlant Schmidt wrote:

> All:
>
>> On modern SLCs (at least I first saw it on 34 nm SLC flash), those bitflips
>> can be _unstable_, i.e. they can appear and disappear randomly as you read
>> pages. I experienced this phenomenon only on pages which were being programmed
>> or erased during a power cut.
>
>  This makes perfectly good sense. During erasing or programming,
>  charge is being deposited-upon or removed from the floating gates
>  and that's not an instantaneous process, so it can be interrupted
>  while on-going, leaving a gate that's only half charged or half
>  discharged.
>
>  At that point, the floating gate may have charge on it that's
>  all-too-near the threshold voltage for the cell and any given
>  read of that cell could "go either way" depending on minute
>  variations in other conditions.

That makes sense, but it doesn't explain why the effect is appearently 
more pronounced on newer flashes than on older ones.

Of course, it could be that there is some other mechanism that also has 
had to be changed with shrinking geometries, such as for example (note: 
wildly speculating here) that the charge/discharge time for the individual 
bit cells is longer on modern flashes for whatever reason, causing them to 
be more sensitive to power cuts, as the programming operation (per bit) 
takes place over a longer time.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-19 12:52                   ` Ricard Wanderlof
@ 2011-10-19 13:30                     ` Atlant Schmidt
  0 siblings, 0 replies; 24+ messages in thread
From: Atlant Schmidt @ 2011-10-19 13:30 UTC (permalink / raw)
  To: 'Ricard Wanderlof'
  Cc: 'Ivan Djelic', linux-mtd@lists.infradead.org,
	Ricard Wanderlöf, Jean-Sébastien Gagnon

Ricard:

  One mechanism is that newer flashes simply store fewer
  electrons on the floating gate so a very few electrons
  one way or the other now makes a bigger difference.

  I read an article from Siemens/Infineon a few years ago
  that was explaining how at the then-upcoming design node,
  the difference between a "1" and "0" was going to be less
  than 100 electrons!

  At those levels, there's not a lot of room for error
  in the deposited charge!

                               Atlant

-----Original Message-----
From: linux-mtd-bounces@lists.infradead.org [mailto:linux-mtd-bounces@lists.infradead.org] On Behalf Of Ricard Wanderlof
Sent: Wednesday, October 19, 2011 08:52
To: Atlant Schmidt
Cc: 'Ivan Djelic'; linux-mtd@lists.infradead.org; Ricard Wanderlöf; Jean-Sébastien Gagnon
Subject: RE: UBIFS recovery fails


On Wed, 19 Oct 2011, Atlant Schmidt wrote:

> All:
>
>> On modern SLCs (at least I first saw it on 34 nm SLC flash), those bitflips
>> can be _unstable_, i.e. they can appear and disappear randomly as you read
>> pages. I experienced this phenomenon only on pages which were being programmed
>> or erased during a power cut.
>
>  This makes perfectly good sense. During erasing or programming,
>  charge is being deposited-upon or removed from the floating gates
>  and that's not an instantaneous process, so it can be interrupted
>  while on-going, leaving a gate that's only half charged or half
>  discharged.
>
>  At that point, the floating gate may have charge on it that's
>  all-too-near the threshold voltage for the cell and any given
>  read of that cell could "go either way" depending on minute
>  variations in other conditions.

That makes sense, but it doesn't explain why the effect is appearently
more pronounced on newer flashes than on older ones.

Of course, it could be that there is some other mechanism that also has
had to be changed with shrinking geometries, such as for example (note:
wildly speculating here) that the charge/discharge time for the individual
bit cells is longer on modern flashes for whatever reason, causing them to
be more sensitive to power cuts, as the programming operation (per bit)
takes place over a longer time.

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/


 Click https://www.mailcontrol.com/sr/Ymlp8zmNZXXTndxI!oX7Ul+sZxelw3DREV2PuRGYiNNRbPV17wAWqXkuNhUQVZCj21OCp49t+CbC8dmstCo9uQ==  to report this email as spam.

This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18  8:29   ` Ivan Djelic
@ 2011-10-19 15:15     ` Artem Bityutskiy
  2011-10-19 17:27       ` Ivan Djelic
  0 siblings, 1 reply; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-19 15:15 UTC (permalink / raw)
  To: Ivan Djelic; +Cc: linux-mtd@lists.infradead.org, Daniel Kuhn

[-- Attachment #1: Type: text/plain, Size: 2425 bytes --]

On Tue, 2011-10-18 at 10:29 +0200, Ivan Djelic wrote:
> That's interesting... Do you have more details or any data on those eMMC
> power-cut failures ?

Not much. We were testing eMMC and were trying to make sure that if we
sync the data, and then have a power cut, we never lose the data which
was synced. We have a test which worked directly with the block device,
so no file-system involved. And in some cases eMMC had sectors which
were reported to be already written corrupted. The vendor later said
that yes, there is a FW bug, and promised to fix it in the next
revision.

eMMC FW is written by humans as well :-)

> I plan to be working soon (December) on UBIFS robustness issues with unstable
> modern SLCs; besides using nandsim to simulate SLC (and maybe MLC) issues,
> I also have real hardware with a power-cutting framework ready for testing.

I suggest you to improve the UBIFS power cut emulation functions and
make them emulate unstable bits, and then use integck which is already
able to handle emulated power cuts. This will allow you to

1. Test quickly
2. Continue the half-done work
3. Work with nicer code-base than ugly nandsim
4. Make it possible to emulate unstable bits in interesting places like
   TNC, LPT, orphans area, etc. Otherwise most of the failures will be
   emulated in data area.


Similarly, something like that should be done in UBI level which will
emulate power cuts _only_ when writing UBI-specific stuff (e.g., the
headers).

Something on driver level can also be done later.

I know you are driver guy and it is more natural for you to start from
driver, but I suggest starting from UBIFS and fix 90% of the issues
there, then go down. This way you will also isolate non-UBIFS specific
issues.

Anyway, we should start with _documenting_:
1. What are unstable bits
2. Which work UBIFS/UBI/MTD needs to handle that.
3. What are MLC-specific issues
4. What would have to be done to handle them.

I have ideas about the paired pages in MLC.

But the thing also is that the whole stack is complex and big and
has a lot of states (like any FS), so it is easy to miss something and
you never know the complete list until you actually start stressing the
stack.

But let's document what we know at the moment. Then people who are
interested to have that fixed can start approaching that.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-19 15:15     ` Artem Bityutskiy
@ 2011-10-19 17:27       ` Ivan Djelic
  0 siblings, 0 replies; 24+ messages in thread
From: Ivan Djelic @ 2011-10-19 17:27 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-mtd@lists.infradead.org, Daniel Kuhn

On Wed, Oct 19, 2011 at 04:15:15PM +0100, Artem Bityutskiy wrote:
> I suggest you to improve the UBIFS power cut emulation functions and
> make them emulate unstable bits, and then use integck which is already
> able to handle emulated power cuts. This will allow you to
> 
> 1. Test quickly
> 2. Continue the half-done work
> 3. Work with nicer code-base than ugly nandsim
> 4. Make it possible to emulate unstable bits in interesting places like
>    TNC, LPT, orphans area, etc. Otherwise most of the failures will be
>    emulated in data area.
> 
> 
> Similarly, something like that should be done in UBI level which will
> emulate power cuts _only_ when writing UBI-specific stuff (e.g., the
> headers).
 
My first hope was maybe to garantee stable data at UBI level, as this would also
secure raw UBI storage. But I have not looked into this seriously yet.

> I know you are driver guy and it is more natural for you to start from
> driver, but I suggest starting from UBIFS and fix 90% of the issues
> there, then go down. This way you will also isolate non-UBIFS specific
> issues.

OK; I also work on filesystem code; so I'm not really obsessed with drivers :-)
 
> Anyway, we should start with _documenting_:
> 1. What are unstable bits
> 2. Which work UBIFS/UBI/MTD needs to handle that.
> 3. What are MLC-specific issues
> 4. What would have to be done to handle them.
> 
> I have ideas about the paired pages in MLC.
> 
> But the thing also is that the whole stack is complex and big and
> has a lot of states (like any FS), so it is easy to miss something and
> you never know the complete list until you actually start stressing the
> stack.
> 
> But let's document what we know at the moment. Then people who are
> interested to have that fixed can start approaching that.

OK, sounds good to me.

BR,
--
Ivan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18 14:54     ` Ivan Djelic
  2011-10-18 15:10       ` Jean-Sébastien Gagnon
@ 2011-10-20 14:14       ` Artem Bityutskiy
  1 sibling, 0 replies; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-20 14:14 UTC (permalink / raw)
  To: Ivan Djelic, Daniel Kuhn
  Cc: linux-mtd@lists.infradead.org, Jean-Sébastien Gagnon

[-- Attachment #1: Type: text/plain, Size: 585 bytes --]

Hi,

I have just updated UBI/UBIFS docs. I was writing quickly because I do
not have much time, so _please_, review and send me corrections
(preferably patches against mtd-www:
http://git.infradead.org/mtd-www.git).

Review these:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_powercut
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_ubifs_mlc
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits

Thank you!

P.S. you may want to press "refresh" button to force your browser to
re-download the web pages.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18  8:11   ` Ricard Wanderlof
  2011-10-18  8:42     ` Ivan Djelic
@ 2011-10-20 16:36     ` Artem Bityutskiy
  1 sibling, 0 replies; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-20 16:36 UTC (permalink / raw)
  To: Ricard Wanderlof; +Cc: linux-mtd@lists.infradead.org

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

On Tue, 2011-10-18 at 10:11 +0200, Ricard Wanderlof wrote:
> On Mon, 17 Oct 2011, Artem Bityutskiy wrote:
> 
> > [...] Unfortunately, no one really invested time into making UBIFS 
> > support MLC very well. It needs some more work. It also have some issues 
> > related to unstable bits in modern SLC.
> >
> > In short - if you want to use UBIFS on MLC - you should not have unclean
> > reboots. If you want to make UBIFS 100% uclean-reboot save on MLC - you
> > need to work on it some more.
> >
> > We (the original authors) developed and tested it on very robust SLC
> > NAND.
> 
> Do you have any specifics on what the issues are with MLC ?

Yeah, described at the web site, please send patches if you have
something to add!

> Since UBI implements bit scrubbing and eraseblock torture on questionable 
> blocks it would seem that a lot of the work has been done in order to 
> handle unstable bits. Is there some issue which is related specifically to 
> bit flips happening while the system is powered off?

The definition of unstable bits in MTD community is probably not what
you though of. I've added the docs at the web site as well :-)

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: UBIFS recovery fails
  2011-10-18  8:42     ` Ivan Djelic
@ 2011-10-20 16:37       ` Artem Bityutskiy
  0 siblings, 0 replies; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-20 16:37 UTC (permalink / raw)
  To: Ivan Djelic; +Cc: linux-mtd@lists.infradead.org, Ricard Wanderlof

[-- Attachment #1: Type: text/plain, Size: 461 bytes --]

On Tue, 2011-10-18 at 10:42 +0200, Ivan Djelic wrote:
> Besides unstable bits, lower endurance, higher ecc requirements and NOP1, there
> is a "page pairing" phenomenon which is not handled by UBI/UBIFS AFAIK.
> When you cut power during a page write, you may lose data in another (paired)
> previously programmed page.

Yeah, covered this at the web site and provided an idea what UBIFS could
do to handle that.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-18 15:10       ` Jean-Sébastien Gagnon
  2011-10-18 15:32         ` Ivan Djelic
@ 2011-10-20 16:43         ` Artem Bityutskiy
  2011-10-24  7:00           ` Ricard Wanderlof
  1 sibling, 1 reply; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-20 16:43 UTC (permalink / raw)
  To: Jean-Sébastien Gagnon, Ivan Djelic; +Cc: linux-mtd

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

On Tue, 2011-10-18 at 11:10 -0400, Jean-Sébastien Gagnon wrote:
> If you cut power during a page programming operation, you can easily get more
> unstable bits than what the manufacturer-specified ecc supports (for instance,
> 3 unstable bits on a 1bit-ecc device). We experienced this on several different
> devices.
> Having a lot of bitflips (more than what ecc supports) is not the problem here:
> the page was indeed partially programmed, it contains garbage and its contents
> should be discarded.
> 
> The real problem appears when those faulty bits are unstable: during the first
> few read attempts, the page may be successfully read (possibly with ecc
> corrections); and then, a bit later, the page becomes unreadable because of too
> many faulty bits.

Right. If you first get a correctable bit-flip, UBI will schedule this
PEB for scrubbing. When the background thread starts scrubbing, it will
read the PEB _again_, and this time it can end up with an uncorrectable
ECC error.

This is probably also a good point: when UBIFS recovers, it should
probably somehow ask UBI to _not_ do scrubbing, to avoid failures at UBI
level if UBI decides to scrub a PEB before UBIFS erase-cycles it.

IOW, the whole stack (not only UBIFS) should make sure the PEBs with
unstable bits are read only once.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-20 16:43         ` Artem Bityutskiy
@ 2011-10-24  7:00           ` Ricard Wanderlof
  2011-10-29 19:43             ` Artem Bityutskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Ricard Wanderlof @ 2011-10-24  7:00 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Ivan Djelic, linux-mtd@lists.infradead.org,
	Jean-Sébastien Gagnon


On Thu, 20 Oct 2011, Artem Bityutskiy wrote:

>> The real problem appears when those faulty bits are unstable: during 
>> the first few read attempts, the page may be successfully read 
>> (possibly with ecc corrections); and then, a bit later, the page 
>> becomes unreadable because of too many faulty bits.
>
> Right. If you first get a correctable bit-flip, UBI will schedule this
> PEB for scrubbing. When the background thread starts scrubbing, it will
> read the PEB _again_, and this time it can end up with an uncorrectable
> ECC error.
>
> This is probably also a good point: when UBIFS recovers, it should
> probably somehow ask UBI to _not_ do scrubbing, to avoid failures at UBI
> level if UBI decides to scrub a PEB before UBIFS erase-cycles it.
>
> IOW, the whole stack (not only UBIFS) should make sure the PEBs with
> unstable bits are read only once.

How would UBI+ubifs compare to jffs2 in this respect? I would naïvely 
assume that since jffs2 doesn't do any scrubbing, it would just return ok 
if the data happened to be read correctly that time, and something like an 
I/O error if the data was faulty, but without taking any special action in 
that case?

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: UBIFS recovery fails
  2011-10-24  7:00           ` Ricard Wanderlof
@ 2011-10-29 19:43             ` Artem Bityutskiy
  0 siblings, 0 replies; 24+ messages in thread
From: Artem Bityutskiy @ 2011-10-29 19:43 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: Ivan Djelic, linux-mtd@lists.infradead.org,
	Jean-Sébastien Gagnon

On Mon, 2011-10-24 at 09:00 +0200, Ricard Wanderlof wrote:
> How would UBI+ubifs compare to jffs2 in this respect? I would naïvely 
> assume that since jffs2 doesn't do any scrubbing, it would just return ok 
> if the data happened to be read correctly that time, and something like an 
> I/O error if the data was faulty, but without taking any special action in 
> that case?

Yes, I think so.

Artem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-10-29 19:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-17 13:29 UBIFS recovery fails Daniel Kuhn
2011-10-17 20:17 ` Artem Bityutskiy
2011-10-18  8:11   ` Ricard Wanderlof
2011-10-18  8:42     ` Ivan Djelic
2011-10-20 16:37       ` Artem Bityutskiy
2011-10-20 16:36     ` Artem Bityutskiy
2011-10-18  8:29   ` Ivan Djelic
2011-10-19 15:15     ` Artem Bityutskiy
2011-10-19 17:27       ` Ivan Djelic
2011-10-18 12:47   ` Jean-Sébastien Gagnon
2011-10-18 14:54     ` Ivan Djelic
2011-10-18 15:10       ` Jean-Sébastien Gagnon
2011-10-18 15:32         ` Ivan Djelic
2011-10-18 16:05           ` Jean-Sébastien Gagnon
2011-10-19  6:50             ` Ricard Wanderlof
2011-10-19 10:22               ` Ivan Djelic
2011-10-19 12:17                 ` Atlant Schmidt
2011-10-19 12:52                   ` Ricard Wanderlof
2011-10-19 13:30                     ` Atlant Schmidt
2011-10-20 16:43         ` Artem Bityutskiy
2011-10-24  7:00           ` Ricard Wanderlof
2011-10-29 19:43             ` Artem Bityutskiy
2011-10-20 14:14       ` Artem Bityutskiy
2011-10-18 15:29   ` Daniel Kuhn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).