public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* [BUG] JFFS2 power loss recovery issues on NAND
@ 2008-06-10 13:57 Alexey Korolev
  2008-06-17  1:03 ` Iwo Mergler
  0 siblings, 1 reply; 9+ messages in thread
From: Alexey Korolev @ 2008-06-10 13:57 UTC (permalink / raw)
  To: dwmw2; +Cc: joern, linux-mtd

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1621 bytes --]

David,

As I prommised we investigated the JFFS2 power loss issues and found
what exactly caused data corruption issues. The issue occurs not
only on the recent kernels (>2.6.25) but on old as well  - so I createed new thread.

The problem occurs when we face power loss during writing ECC bytes to
NAND. 
As we know one page of NAND device has 8 SW ECC regions. Assume 
we have written NAND main area correctly but got a power loss during
write of first 3 ECC bytes. For the first ECC region algorithm detects
that checksum is not the same but it detects one bit error (it is a
common situation for ECC algorithm to improperly detect 1bit error when
actual numer of error bits is large). Since ECC detects one bit error it
"corrects" one bit in NAND main area (in first 256 bytes). For other
regions ECC algorithm returned 2 bit errors and did not perform any
correction. 
JFFS2 ignores read errors from NAND since it has own CRC. On attempt to
read fragment from first 256 bytes JFFS2 detects CRC error as this
region has been improperly corrected and considers
node as invalid. Rest data on page is considered as good. 

So if we write new file during power loss and we have several nodes in
page, we may face bad case with hole in the middle of the file after
power loss. It is a bug. 

The attached picture may explain the issue better. 


So for now it is clear how JFFS2 fails. It is not obvious how to fix it. 
Do you have any suggestions or ideas how it could be fixed?
Would it be a good idea do hack JFFS2 in order to read data one more
time but without ECC correction in case of failed read?


Thanks,
Alexey

[-- Attachment #2: explanation of power loss issue --]
[-- Type: IMAGE/gif, Size: 54535 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-10 13:57 [BUG] JFFS2 power loss recovery issues on NAND Alexey Korolev
@ 2008-06-17  1:03 ` Iwo Mergler
  2008-06-17  8:13   ` Matthieu CASTET
  0 siblings, 1 reply; 9+ messages in thread
From: Iwo Mergler @ 2008-06-17  1:03 UTC (permalink / raw)
  To: linux-mtd; +Cc: joern, dwmw2, Alexey Korolev

Alexey Korolev wrote:
> JFFS2 ignores read errors from NAND since it has own CRC. On attempt to
> read fragment from first 256 bytes JFFS2 detects CRC error as this
> region has been improperly corrected and considers
> node as invalid. Rest data on page is considered as good. 
>
> So if we write new file during power loss and we have several nodes in
> page, we may face bad case with hole in the middle of the file after
> power loss. It is a bug. 
>
> The attached picture may explain the issue better. 
>
>
> So for now it is clear how JFFS2 fails. It is not obvious how to fix it. 
> Do you have any suggestions or ideas how it could be fixed?
> Would it be a good idea do hack JFFS2 in order to read data one more
> time but without ECC correction in case of failed read?
>   
Alexey,

I know of at least one hardware ECC implementation which can flag
errors within the ECC bytes separately. In other words, not all 
implementations
will detect/correct bit errors in the case of a ECC write error.

About how to fix it - what about making the reading without ECC the default
and only re-reading with ECC if JFFS2 finds an invalid checksum?

Kind regards,

Iwo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-17  1:03 ` Iwo Mergler
@ 2008-06-17  8:13   ` Matthieu CASTET
  2008-06-17  9:24     ` Jörn Engel
  2008-06-17 23:51     ` Iwo Mergler
  0 siblings, 2 replies; 9+ messages in thread
From: Matthieu CASTET @ 2008-06-17  8:13 UTC (permalink / raw)
  To: Iwo Mergler; +Cc: dwmw2, joern, linux-mtd, Alexey Korolev

Iwo Mergler wrote:
> Alexey Korolev wrote:
>>   
> Alexey,
> 
> I know of at least one hardware ECC implementation which can flag
> errors within the ECC bytes separately. In other words, not all 
> implementations
> will detect/correct bit errors in the case of a ECC write error.
> 
> About how to fix it - what about making the reading without ECC the default
> and only re-reading with ECC if JFFS2 finds an invalid checksum?
> 
But what happen if a but flip happen ?
If we do this, ecc won't correct it, the error can happen everywhere not 
only in checksum, for example wrong nodetype.

If only we could put a marker after the ecc data, we could detect such 
case. But linux put ecc at the end of oob.

Matthieu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-17  8:13   ` Matthieu CASTET
@ 2008-06-17  9:24     ` Jörn Engel
  2008-06-17 16:00       ` Alexey Korolev
  2008-06-17 23:51     ` Iwo Mergler
  1 sibling, 1 reply; 9+ messages in thread
From: Jörn Engel @ 2008-06-17  9:24 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: linux-mtd, dwmw2, Alexey Korolev, Iwo Mergler

On Tue, 17 June 2008 10:13:06 +0200, Matthieu CASTET wrote:
> Iwo Mergler wrote:
> > 
> > About how to fix it - what about making the reading without ECC the default
> > and only re-reading with ECC if JFFS2 finds an invalid checksum?
> > 
> But what happen if a but flip happen ?

The exact same problem.

Changing the default is bogus.  Re-reading without ecc on invalid
checksums may be the only solution for jffs2, though.  Either that or
close your eyes and claim that holes in files can happen on power loss.

Jörn

-- 
Fancy algorithms are buggier than simple ones, and they're much harder
to implement. Use simple algorithms as well as simple data structures.
-- Rob Pike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-17  9:24     ` Jörn Engel
@ 2008-06-17 16:00       ` Alexey Korolev
  2008-06-17 16:57         ` Jörn Engel
  0 siblings, 1 reply; 9+ messages in thread
From: Alexey Korolev @ 2008-06-17 16:00 UTC (permalink / raw)
  To: Jörn Engel; +Cc: dwmw2, linux-mtd, Iwo Mergler, Matthieu CASTET

Hi, 

> > > 
> > But what happen if a but flip happen ?
> 
> The exact same problem.
> 
> Changing the default is bogus.  Re-reading without ecc on invalid
> checksums may be the only solution for jffs2, though.  Either that or
> close your eyes and claim that holes in files can happen on power loss.
> 
Correct it is better do not touch OOB area at all since it may broke
compatibility. 

Declare bug as a feature could be not so good. But the issue is rare. I think if we got some time we will fix it in the described way.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-17 16:00       ` Alexey Korolev
@ 2008-06-17 16:57         ` Jörn Engel
  0 siblings, 0 replies; 9+ messages in thread
From: Jörn Engel @ 2008-06-17 16:57 UTC (permalink / raw)
  To: Alexey Korolev; +Cc: linux-mtd, dwmw2, Matthieu CASTET, Iwo Mergler

On Tue, 17 June 2008 17:00:42 +0100, Alexey Korolev wrote:
> 
> Declare bug as a feature could be not so good. But the issue is rare. I think if we got some time we will fix it in the described way.

[ You really should switch to a decent mailer that e.g. breaks lines
  before column 136. ;) ]

For 'normal' filesystems, this bug is solved in __sync_single_inode by
writing out pages first and the inode later.  Works well enough for
ext[23] at least.

So I agree, JffS2 should behave at least as well.

Jörn

-- 
Simplicity is prerequisite for reliability.
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-17  8:13   ` Matthieu CASTET
  2008-06-17  9:24     ` Jörn Engel
@ 2008-06-17 23:51     ` Iwo Mergler
  2008-06-18 12:19       ` Jamie Lokier
  1 sibling, 1 reply; 9+ messages in thread
From: Iwo Mergler @ 2008-06-17 23:51 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: dwmw2, joern, linux-mtd, Alexey Korolev

Matthieu CASTET wrote:
> Iwo Mergler wrote:
>> Alexey Korolev wrote:
>>>   
>> Alexey,
>>
>> I know of at least one hardware ECC implementation which can flag
>> errors within the ECC bytes separately. In other words, not all 
>> implementations
>> will detect/correct bit errors in the case of a ECC write error.
>>
>> About how to fix it - what about making the reading without ECC the 
>> default
>> and only re-reading with ECC if JFFS2 finds an invalid checksum?
>>
> But what happen if a but flip happen ?
> If we do this, ecc won't correct it, the error can happen everywhere 
> not only in checksum, for example wrong nodetype.
Forgive my ignorance - does that mean that not everything in JFFS2 is 
CRC protected?

If that is the case, forget my suggestion. I don't know JFFS2 that 
intimately. :-)

Kind regards,

Iwo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-17 23:51     ` Iwo Mergler
@ 2008-06-18 12:19       ` Jamie Lokier
  2008-06-18 12:33         ` David Woodhouse
  0 siblings, 1 reply; 9+ messages in thread
From: Jamie Lokier @ 2008-06-18 12:19 UTC (permalink / raw)
  To: Iwo Mergler; +Cc: linux-mtd, joern, dwmw2, Alexey Korolev, Matthieu CASTET

Iwo Mergler wrote:
> Forgive my ignorance - does that mean that not everything in JFFS2 is 
> CRC protected?

I think every _individual_ record is CRC protected in JFFS2

_But_ that doesn't always detect file corruption.  If JFFS2 records
don't match their CRC, they are treated as if not there.

If there's corrupt data records, that causes holes in files.  There's
no I/O error reported to the application, just blocks of zero bytes.

There's no mechanism in JFFS2 to detect that.  It requires checksums
at a higher level than individual records.

-- Jamie

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] JFFS2 power loss recovery issues on NAND
  2008-06-18 12:19       ` Jamie Lokier
@ 2008-06-18 12:33         ` David Woodhouse
  0 siblings, 0 replies; 9+ messages in thread
From: David Woodhouse @ 2008-06-18 12:33 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: joern, linux-mtd, Matthieu CASTET, Iwo Mergler, Alexey Korolev

On Wed, 2008-06-18 at 13:19 +0100, Jamie Lokier wrote:
> Iwo Mergler wrote:
> > Forgive my ignorance - does that mean that not everything in JFFS2 is 
> > CRC protected?
> 
> I think every _individual_ record is CRC protected in JFFS2
> 
> _But_ that doesn't always detect file corruption.  If JFFS2 records
> don't match their CRC, they are treated as if not there.
> 
> If there's corrupt data records, that causes holes in files.  There's
> no I/O error reported to the application, just blocks of zero bytes.
> 
> There's no mechanism in JFFS2 to detect that.  It requires checksums
> at a higher level

Well, in the case where a given region of the file is actually _absent_,
rather than the 'hole' being filled by an earlier version of the data
for that range of the file, we _could_ at least bitch about it, or maybe
even return -EIO. It should never happen, because of the way we 'fill'
holes in files.

It doesn't solve the real issue at hand though.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-06-18 12:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-10 13:57 [BUG] JFFS2 power loss recovery issues on NAND Alexey Korolev
2008-06-17  1:03 ` Iwo Mergler
2008-06-17  8:13   ` Matthieu CASTET
2008-06-17  9:24     ` Jörn Engel
2008-06-17 16:00       ` Alexey Korolev
2008-06-17 16:57         ` Jörn Engel
2008-06-17 23:51     ` Iwo Mergler
2008-06-18 12:19       ` Jamie Lokier
2008-06-18 12:33         ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox