ubifs_decompress: cannot decompress ...

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* ubifs_decompress: cannot decompress ...
@ 2011-05-27 21:12 Matthew L. Creech
  2011-05-30 12:29 ` Ben Gardiner
  2011-06-01  7:48 ` Artem Bityutskiy
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew L. Creech @ 2011-05-27 21:12 UTC (permalink / raw)
  To: MTD list

Hi,

We encountered a device which has the following error:

UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22

The device is still bootable (though not functional), so I can get
more debug info if needed.  Any ideas on what could cause this?

- MPC8313 (using fsl_elbc_nand driver)
- 1GB NAND flash (partitioned into 5MB for U-Boot, remainder for UBI/UBIFS)
- 2.6.39 kernel

Thanks!

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-27 21:12 ubifs_decompress: cannot decompress Matthew L. Creech
@ 2011-05-30 12:29 ` Ben Gardiner
  2011-05-31 15:47   ` Matthew L. Creech
  2011-06-01  7:48 ` Artem Bityutskiy
  1 sibling, 1 reply; 25+ messages in thread
From: Ben Gardiner @ 2011-05-30 12:29 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: MTD list

Hi Matthew,

On Fri, May 27, 2011 at 5:12 PM, Matthew L. Creech <mlcreech@gmail.com> wrote:
> We encountered a device which has the following error:
>
> UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
> compressor lzo, error -22
> UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
> UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
> 556196, error -22
> UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
> compressor lzo, error -22
> UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
> UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
> 556196, error -22

It looks like the lzo.c decompress function lzo_decompress() maps all
non-ok return codes from lzo1x_decompress_safe() to -EINVAL (-22). So
any one of the possible failures of the decompressor could cause the
error you are observing.

I don't see any debug statements in lzo1x_decompress_safe() that can
be enabled, so you might want to add some printing to
lzo1x_decompress_safe() or at least print the non-ok return code of
lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
decompressor is failing.

Best Regards,
Ben Gardiner

---
Nanometrics Inc.
http://www.nanometrics.ca

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-30 12:29 ` Ben Gardiner
@ 2011-05-31 15:47   ` Matthew L. Creech
  2011-05-31 16:10     ` Ben Gardiner
  2011-06-01  8:02     ` Artem Bityutskiy
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew L. Creech @ 2011-05-31 15:47 UTC (permalink / raw)
  To: Ben Gardiner; +Cc: MTD list

On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner
<bengardiner@nanometrics.ca> wrote:
>
> I don't see any debug statements in lzo1x_decompress_safe() that can
> be enabled, so you might want to add some printing to
> lzo1x_decompress_safe() or at least print the non-ok return code of
> lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
> decompressor is failing.
>

Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN.  I don't know what
that indicates, but there is trailing 0xff data in the block to be
decompressed if that matters:

XXXX: LZO_E_LOOKBEHIND_OVERRUN
UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
compressor lzo, error -22
00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03  ..in:attr..8..9.
00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61  /01/mWC/.ValueMa
00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37  x:a...1r.in....7
00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65  ..=./SerialNumbe
00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f  r/......./.c./So
00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f  ftware/Commands/
00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81  next..ID)....G..
00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76  ..(....Data Serv
00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b  ers/BACnet-IP...
00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14  DeviceInstance).
000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41  ....(A.... ...MA
000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e  C).....B... ...N
000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07  am*.....<.{ ....
000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40  BMDAddress*....@
000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69  ... ....TimeToLi
000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80  v+... ....aseId.
00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c  6*..  ....Defaul
00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01  tNet/.....F... .
00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f  ....EnableBaseFo
00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81  rGateway*$...E..
00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02  . ....Routed*...
00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09  !P..! .....01>..
00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24  ...K... .@.-...$
00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77  L... .,.....#:.w
00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc   .,.-... ....$3.
00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec  .......l..?jh.s.
000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
000003f0: ff ff
UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
556196, error -22

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-31 15:47   ` Matthew L. Creech
@ 2011-05-31 16:10     ` Ben Gardiner
  2011-05-31 21:47       ` Matthew L. Creech
  2011-06-01  8:02     ` Artem Bityutskiy
  1 sibling, 1 reply; 25+ messages in thread
From: Ben Gardiner @ 2011-05-31 16:10 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: MTD list

Hi Matthew,

On Tue, May 31, 2011 at 11:47 AM, Matthew L. Creech <mlcreech@gmail.com> wrote:
> On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner
> <bengardiner@nanometrics.ca> wrote:
>>
>> I don't see any debug statements in lzo1x_decompress_safe() that can
>> be enabled, so you might want to add some printing to
>> lzo1x_decompress_safe() or at least print the non-ok return code of
>> lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
>> decompressor is failing.
>>
>
> Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN.  I don't know what
> that indicates, but there is trailing 0xff data in the block to be
> decompressed if that matters:

Interesting -- does the trailing 0xff have ECC set, or is it erased
pages of 0xff?

> XXXX: LZO_E_LOOKBEHIND_OVERRUN

Can't say much more about this than lzo-2.05/doc/LZOAPI.TXT [1] does:

  LZO_E_LOOKBEHIND_OVERRUN
    Your data is corrupted.

> UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
> compressor lzo, error -22
> 00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03  ..in:attr..8..9.
> 00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61  /01/mWC/.ValueMa
> 00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37  x:a...1r.in....7
> 00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65  ..=./SerialNumbe
> 00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f  r/......./.c./So
> 00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f  ftware/Commands/
> 00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81  next..ID)....G..
> 00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76  ..(....Data Serv
> 00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b  ers/BACnet-IP...
> 00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14  DeviceInstance).
> 000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41  ....(A.... ...MA
> 000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e  C).....B... ...N
> 000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07  am*.....<.{ ....
> 000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40  BMDAddress*....@
> 000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69  ... ....TimeToLi
> 000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80  v+... ....aseId.
> 00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c  6*..  ....Defaul
> 00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01  tNet/.....F... .
> 00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f  ....EnableBaseFo
> 00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81  rGateway*$...E..
> 00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02  . ....Routed*...
> 00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09  !P..! .....01>..
> 00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24  ...K... .@.-...$
> 00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77  L... .,.....#:.w
> 00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc   .,.-... ....$3.
> 00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec  .......l..?jh.s.
> 000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003f0: ff ff
> UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
> UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
> 556196, error -22


Could it be possible that writing the page was interrupted? I guess
the CRC checks above decompress would catch that though.

Sorry I can't be of more help here.

Best Regards,
Ben Gardiner

[1] http://www.oberhumer.com/opensource/lzo/download/

---
Nanometrics Inc.
http://www.nanometrics.ca

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-31 16:10     ` Ben Gardiner
@ 2011-05-31 21:47       ` Matthew L. Creech
  2011-06-01  7:51         ` Artem Bityutskiy
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew L. Creech @ 2011-05-31 21:47 UTC (permalink / raw)
  To: Ben Gardiner; +Cc: MTD list

On Tue, May 31, 2011 at 12:10 PM, Ben Gardiner
<bengardiner@nanometrics.ca> wrote:
>
> Interesting -- does the trailing 0xff have ECC set, or is it erased
> pages of 0xff?
>
...
>
> Could it be possible that writing the page was interrupted? I guess
> the CRC checks above decompress would catch that though.
>

I verified by adding a call to ubi->mtd->read_oob(): the 0xff data
starts on a page boundary (a whole multiple of 2k in my case).  The
associated OOB area for that page is all 0xff as well.

So I guess this is less about the original LZO error, and more about
how a page in the middle of a UBIFS node got erased out from
underneath it.  Which seems suspiciously similar to the
ubifs_read_node() error which I reported last year:

http://lists.infradead.org/pipermail/linux-mtd/2010-July/031069.html

and which is still showing up from time to time on devices in the
field.  In that case the erased page contained the node header and so
"type" was interpreted as 255; in this case the erased page is in the
middle of a data node, resulting in decompression failure instead.

Unfortunately it's not repeatable enough for us to capture adequate
debug output - we've had several devices logging extensive debug
output via netconsole for months now, but none have recreated this
problem so far under those conditions.

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-31 21:47       ` Matthew L. Creech
@ 2011-06-01  7:51         ` Artem Bityutskiy
  2011-06-02  4:30           ` Matthew L. Creech
  0 siblings, 1 reply; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-01  7:51 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Tue, 2011-05-31 at 17:47 -0400, Matthew L. Creech wrote:
> Unfortunately it's not repeatable enough for us to capture adequate
> debug output - we've had several devices logging extensive debug
> output via netconsole for months now, but none have recreated this
> problem so far under those conditions.

How this happens? What do you do? Does this happen after mount when you
first read your data? Or this happens at some point while you stress
testing your system? Or this happens after a power cut?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-01  7:51         ` Artem Bityutskiy
@ 2011-06-02  4:30           ` Matthew L. Creech
  2011-06-02 18:59             ` Matthew L. Creech
  2011-06-03  4:32             ` Artem Bityutskiy
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-02  4:30 UTC (permalink / raw)
  To: dedekind1; +Cc: Ben Gardiner, MTD list

On Wed, Jun 1, 2011 at 3:51 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> How this happens? What do you do? Does this happen after mount when you
> first read your data? Or this happens at some point while you stress
> testing your system? Or this happens after a power cut?
>

So far there's no discernable pattern.  Most of the failed units are
returns from the field, so we don't know what kind of conditions
they've been placed in.  Some are from our test department, but we
haven't found anything that might "trigger" the problem in any way.

The device works fine for some period of time (usually weeks /
months), then we get complaints about various problems.  The reported
symptoms eventually come down to one of these UBIFS errors.  Depending
on the region which happens to go bad, it can result in breakage of a
minor feature (because a file we try to read/write after mount
triggers the error), all the way up to a completely non-functional
device.  I'm not sure if we've ever seen it fail to mount altogether
(I'll check into that), but we've had several cases in which U-Boot
couldn't read the kernel image from UBIFS, so the device wouldn't boot
Linux at all.

Power cuts are probably not common, though.  We have to expect them in
the product of course, but practically speaking, our service guy
assures me that a couple of the bad units he shipped me had stable
power and were rarely/never rebooted.  But I can't rule it out with
certainty.

Aside from that, it's just normal operation.  If the usage pattern
matters, the only files ever written to in the persistent (UBIFS)
filesystem are SQLite databases.  It's generally light usage, logging
a variety of measurements once every 5 minutes.  I've tried
stress-testing by running non-stop SQLite operations, recreating the
normal usage pattern but with a _much_ higher frequency of writes than
normal.  It didn't seem to help reproduce the error - we've yet to
succeed in making this problem happen under controlled conditions.

As for this specific error (ubifs_decompress): tomorrow I'll try to
gather & post additional log data for this device.  Thanks!

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-02  4:30           ` Matthew L. Creech
@ 2011-06-02 18:59             ` Matthew L. Creech
  2011-06-06  9:58               ` Artem Bityutskiy
  2011-06-03  4:32             ` Artem Bityutskiy
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-02 18:59 UTC (permalink / raw)
  To: dedekind1; +Cc: Ben Gardiner, MTD list

On Thu, Jun 2, 2011 at 12:30 AM, Matthew L. Creech <mlcreech@gmail.com> wrote:
>
> As for this specific error (ubifs_decompress): tomorrow I'll try to
> gather & post additional log data for this device.  Thanks!
>

Here is a console dump with more information enabled:

http://mcreech.com/work/ubifs-decompress-err.txt

I turned on mount & recovery debug messages, although it seems to
mount & recover correctly, so presumably any info from past recovery
actions is long gone.  The error actually occurs later on, when our
main application accesses SQLite databases.

I dumped out the corrupted node data in 3 places:

1. In ubifs_decompress(), I dumped the data buffer, prefixed with
"compressed node" (redundant with #3, really)

2. In read_block(), I page-align the starting offset & size, re-fetch
the pertinent pages from flash, and dump those with corresponding OOB
info, prefixed with "data" and "oob" respectively

3. There's already a dbg_dump_node() call in read_block(), so I
enabled that as well

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-02 18:59             ` Matthew L. Creech
@ 2011-06-06  9:58               ` Artem Bityutskiy
  2011-06-06 16:04                 ` Matthew L. Creech
  0 siblings, 1 reply; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-06  9:58 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Thu, 2011-06-02 at 14:59 -0400, Matthew L. Creech wrote: 
> On Thu, Jun 2, 2011 at 12:30 AM, Matthew L. Creech <mlcreech@gmail.com> wrote:
> >
> > As for this specific error (ubifs_decompress): tomorrow I'll try to
> > gather & post additional log data for this device.  Thanks!
> >
> 
> Here is a console dump with more information enabled:
> 
> http://mcreech.com/work/ubifs-decompress-err.txt
> 
> I turned on mount & recovery debug messages, although it seems to
> mount & recover correctly, so presumably any info from past recovery
> actions is long gone.  The error actually occurs later on, when our
> main application accesses SQLite databases.
> 
> I dumped out the corrupted node data in 3 places:
> 
> 1. In ubifs_decompress(), I dumped the data buffer, prefixed with
> "compressed node" (redundant with #3, really)
> 
> 2. In read_block(), I page-align the starting offset & size, re-fetch
> the pertinent pages from flash, and dump those with corresponding OOB
> info, prefixed with "data" and "oob" respectively
> 
> 3. There's already a dbg_dump_node() call in read_block(), so I
> enabled that as well

So the corruption starts exactly at the NAND page boundary. This makes
me believe that the reason is most probably power cut recovery. But you
say your client ensures there were none...

Yes, you are right that info from past is gone... What I'd like to see
is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' -
basically I want to look if this LEB was passed through GC.

Because my theory is:

1. You have LEB A which contains this data node, but it is not corrupted
   yet. Let's call this data node X.
2. GC moves valid data from LEB A to LEB B (lnum 3479).
3. We get a power cut while moving the data. We end up with node X
   corrupted in B.
4. UBIFS recovery has a bug and it decides that the copy of node X in
   LEB B is OK, commits, and LEB A is erased at some point.
5. And we are in your situation...

But this is just a theory.

I actually worked on power cut emulation testing improvements lately,
and the current state is that "integck -p" fails sometimes. I need to
investigate it - might turn out to be a bug which cases the effect you
see.

Basically, I've improved UBIFS power cut testing and corrupt the buffer
with random data, not only with 0xFFs, and now integck -p starts
failing. See this commit:

http://git.infradead.org/ubifs-2.6.git/commit/96c32bb596c5a74362a6a825f66fde68b6c3487c

It contains several unrelated changes. Ignore the simple random part,
only the changes in 'cut_data()' are interesting.

But I'll split that change on several changes.

And I'll try to investigate the issue - it might turn out to be
integck's issue, will see.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-06  9:58               ` Artem Bityutskiy
@ 2011-06-06 16:04                 ` Matthew L. Creech
  2011-06-06 16:18                   ` Artem Bityutskiy
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-06 16:04 UTC (permalink / raw)
  To: dedekind1; +Cc: Ben Gardiner, MTD list

On Mon, Jun 6, 2011 at 5:58 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> So the corruption starts exactly at the NAND page boundary. This makes
> me believe that the reason is most probably power cut recovery. But you
> say your client ensures there were none...
>
> Yes, you are right that info from past is gone... What I'd like to see
> is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' -
> basically I want to look if this LEB was passed through GC.
>

[Repost, gmail mangled my node dump]

The dbg_dump_leb() call fails due to a bad CRC; I've posted its output
plus the corresponding raw LEB dump (obtained via mtd->read_oob())
here:

http://mcreech.com/work/ubifs-decompress-err-rawleb.txt

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-06 16:04                 ` Matthew L. Creech
@ 2011-06-06 16:18                   ` Artem Bityutskiy
  2011-06-06 19:52                     ` Matthew L. Creech
  0 siblings, 1 reply; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-06 16:18 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Mon, 2011-06-06 at 12:04 -0400, Matthew L. Creech wrote:
> On Mon, Jun 6, 2011 at 5:58 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> >
> > So the corruption starts exactly at the NAND page boundary. This makes
> > me believe that the reason is most probably power cut recovery. But you
> > say your client ensures there were none...
> >
> > Yes, you are right that info from past is gone... What I'd like to see
> > is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' -
> > basically I want to look if this LEB was passed through GC.
> >
> 
> [Repost, gmail mangled my node dump]
> 
> The dbg_dump_leb() call fails due to a bad CRC; I've posted its output
> plus the corresponding raw LEB dump (obtained via mtd->read_oob())
> here:

I have to go home now - could you please improve dbg_dump_leb().
Currently it calls ubifs_scan(), which scans, finds corrupted node,
prints corruption information and returns -EUCLEAN and destroys the
scanned data.

Instead, we could do:

1. Add another parameter to ubifs_scan() which makes it to _not_ free
scanned data on corruption.

2. In dbg_dump_leb() when you get -EUCLEAN - just go ahead and print the
scanned information instead of exiting. Should not be too difficult to
do.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-06 16:18                   ` Artem Bityutskiy
@ 2011-06-06 19:52                     ` Matthew L. Creech
  2011-06-07  4:34                       ` Artem Bityutskiy
  2011-06-07 10:24                       ` Artem Bityutskiy
  0 siblings, 2 replies; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-06 19:52 UTC (permalink / raw)
  To: linux-mtd; +Cc: dedekind1

On Mon, Jun 6, 2011 at 12:18 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> I have to go home now - could you please improve dbg_dump_leb().
> Currently it calls ubifs_scan(), which scans, finds corrupted node,
> prints corruption information and returns -EUCLEAN and destroys the
> scanned data.
>

Will something like this be okay?  Or do you still want to dump the
partially-parsed data from the corrupt node as well (not just the raw contents
of the LEB)?


Currently an error in ubifs_scan() will cause dbg_dump_leb() to abort without
completing the dump.  Instead, we should abandon parsing the data, but dump
the raw (uninterpreted) LEB contents instead.

Signed-off-by: Matthew L. Creech <mlcreech@gmail.com>
---
 fs/ubifs/debug.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index 26d4c61..6ab43e4 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -901,6 +901,9 @@ void dbg_dump_leb(const struct ubifs_info *c, int lnum)
 	sleb = ubifs_scan(c, lnum, 0, buf, 0);
 	if (IS_ERR(sleb)) {
 		ubifs_err("scan error %d", (int)PTR_ERR(sleb));
+		printk(KERN_DEBUG "\tLEB data buffer:\n");
+		print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1,
+			       buf, c->leb_size, 0);
 		goto out;
 	}
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-06 19:52                     ` Matthew L. Creech
@ 2011-06-07  4:34                       ` Artem Bityutskiy
  2011-06-07 20:41                         ` Matthew L. Creech
  2011-06-07 10:24                       ` Artem Bityutskiy
  1 sibling, 1 reply; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-07  4:34 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: linux-mtd

On Mon, 2011-06-06 at 15:52 -0400, Matthew L. Creech wrote:
>  		ubifs_err("scan error %d", (int)PTR_ERR(sleb));
> +		printk(KERN_DEBUG "\tLEB data buffer:\n");
> +		print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1,
> +			       buf, c->leb_size, 0);

No, I have difficulties reading hexdumps. You have set of good nodes
following by one broken node. I wanted to see a human-readable dump of
the good nodes at the beginning of the LEB.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-07  4:34                       ` Artem Bityutskiy
@ 2011-06-07 20:41                         ` Matthew L. Creech
  2011-06-08 14:11                           ` Artem Bityutskiy
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-07 20:41 UTC (permalink / raw)
  To: dedekind1; +Cc: linux-mtd

On Tue, Jun 7, 2011 at 12:34 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> No, I have difficulties reading hexdumps. You have set of good nodes
> following by one broken node. I wanted to see a human-readable dump of
> the good nodes at the beginning of the LEB.
>

Oh I see - sorry, I thought you wanted to debug the corrupted portion.

Here's the output for my corrupt flash:

http://mcreech.com/work/ubifs-2011-06-07.txt

I'll follow up with a patch.

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-07 20:41                         ` Matthew L. Creech
@ 2011-06-08 14:11                           ` Artem Bityutskiy
  2011-06-08 17:50                             ` Matthew L. Creech
  0 siblings, 1 reply; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-08 14:11 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: linux-mtd

On Tue, 2011-06-07 at 16:41 -0400, Matthew L. Creech wrote:
> On Tue, Jun 7, 2011 at 12:34 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> >
> > No, I have difficulties reading hexdumps. You have set of good nodes
> > following by one broken node. I wanted to see a human-readable dump of
> > the good nodes at the beginning of the LEB.
> >
> 
> Oh I see - sorry, I thought you wanted to debug the corrupted portion.
> 
> Here's the output for my corrupt flash:
> 
> http://mcreech.com/work/ubifs-2011-06-07.txt
> 
> I'll follow up with a patch.

Yes, it does look like this LEB might be garbage-collected. But it does
not have to be.

Anyway, what I can suggest you is to do several things.

1. If you have many occasions of such error, try to gather some
   information about how the device was used, and if it was uncleanly
   power-cut. Remember, I often saw that embedded devices have incorrect
   reboot. Whe users reboot it "normally" - it does not try to unmount
   the FS-es cleanly and just jumps to som HW reset function.

   You can verify this by rebooting normally and checking if UBIFS says
   "recovery needed" or not. If it does - the reboot was not normal.

2. This error may be due to memory corruptions in some driver (e.g.,
   wireless or video), due to issues in the mtd driver, etc. Try to
   stress your system with slub/slab full checks enabled, and other
   debugging features which you can find in the "hacking" section of
   make menuconfig.

3. If my theory is true, then what may help is adding a check it
   ubifs recovery function. The recovery ends with an ubifs_leb_change()
   call. You need to check the last node there - is it full and correct?
   If not, you should print a loud warning and information like leb dump
   _before_ the change, and dump of the buffer which we are going to
   write with ubifs_leb_change().

   You'd probably need to deploy this check to the field if this issue
   is not easy to reproduce. If you have then this info you may fix the
   bug.

4. Set-up power-cut emulation testing in your office.

P.S. I'm curious where you use UBIFS, if this is not a trade secret, of
course.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-08 14:11                           ` Artem Bityutskiy
@ 2011-06-08 17:50                             ` Matthew L. Creech
  2011-06-09 12:10                               ` Artem Bityutskiy
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-08 17:50 UTC (permalink / raw)
  To: dedekind1; +Cc: linux-mtd

On Wed, Jun 8, 2011 at 10:11 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> Yes, it does look like this LEB might be garbage-collected. But it does
> not have to be.
>
> Anyway, what I can suggest you is to do several things.
>
> 1. If you have many occasions of such error, try to gather some
>   information about how the device was used, and if it was uncleanly
>   power-cut. Remember, I often saw that embedded devices have incorrect
>   reboot. Whe users reboot it "normally" - it does not try to unmount
>   the FS-es cleanly and just jumps to som HW reset function.
>
>   You can verify this by rebooting normally and checking if UBIFS says
>   "recovery needed" or not. If it does - the reboot was not normal.
>

Yes, it currently reboots uncleanly (though it does do a "sync"
first).  I noticed this a while back, and the next release firmware
will have it fixed.  However, it doesn't make a huge difference to us,
because these devices are probably more likely to experience power
loss than a software reboot, in the field at least.

> 2. This error may be due to memory corruptions in some driver (e.g.,
>   wireless or video), due to issues in the mtd driver, etc. Try to
>   stress your system with slub/slab full checks enabled, and other
>   debugging features which you can find in the "hacking" section of
>   make menuconfig.
>

Will do.

> 3. If my theory is true, then what may help is adding a check it
>   ubifs recovery function. The recovery ends with an ubifs_leb_change()
>   call. You need to check the last node there - is it full and correct?
>   If not, you should print a loud warning and information like leb dump
>   _before_ the change, and dump of the buffer which we are going to
>   write with ubifs_leb_change().
>
>   You'd probably need to deploy this check to the field if this issue
>   is not easy to reproduce. If you have then this info you may fix the
>   bug.
>

Great, I'll add this check and see if we get any hits.  Even if it
takes a while to hit it in the field, this would at least give us a
way to make some progress in finding the issue.

> 4. Set-up power-cut emulation testing in your office.
>

I did this at one point - I have a programmable UPS, so I was able to
automate a test to turn outlet power off & on repeatedly while having
the device do some work.  It didn't seem to help reproduce the
problem, but it's worth trying again on a long-term basis (especially
with the change above to try & catch the corruption in the act).

Thanks again Artem.

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-08 17:50                             ` Matthew L. Creech
@ 2011-06-09 12:10                               ` Artem Bityutskiy
  2011-06-20 15:35                                 ` Matthew L. Creech
  0 siblings, 1 reply; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-09 12:10 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: linux-mtd

On Wed, 2011-06-08 at 13:50 -0400, Matthew L. Creech wrote:
> On Wed, Jun 8, 2011 at 10:11 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> >
> > Yes, it does look like this LEB might be garbage-collected. But it does
> > not have to be.
> >
> > Anyway, what I can suggest you is to do several things.
> >
> > 1. If you have many occasions of such error, try to gather some
> >   information about how the device was used, and if it was uncleanly
> >   power-cut. Remember, I often saw that embedded devices have incorrect
> >   reboot. Whe users reboot it "normally" - it does not try to unmount
> >   the FS-es cleanly and just jumps to som HW reset function.
> >
> >   You can verify this by rebooting normally and checking if UBIFS says
> >   "recovery needed" or not. If it does - the reboot was not normal.
> >
> 
> Yes, it currently reboots uncleanly (though it does do a "sync"
> first).  I noticed this a while back, and the next release firmware
> will have it fixed.  However, it doesn't make a huge difference to us,
> because these devices are probably more likely to experience power
> loss than a software reboot, in the field at least.
> 
> > 2. This error may be due to memory corruptions in some driver (e.g.,
> >   wireless or video), due to issues in the mtd driver, etc. Try to
> >   stress your system with slub/slab full checks enabled, and other
> >   debugging features which you can find in the "hacking" section of
> >   make menuconfig.
> >
> 
> Will do.
> 
> > 3. If my theory is true, then what may help is adding a check it
> >   ubifs recovery function. The recovery ends with an ubifs_leb_change()
> >   call. You need to check the last node there - is it full and correct?
> >   If not, you should print a loud warning and information like leb dump
> >   _before_ the change, and dump of the buffer which we are going to
> >   write with ubifs_leb_change().
> >
> >   You'd probably need to deploy this check to the field if this issue
> >   is not easy to reproduce. If you have then this info you may fix the
> >   bug.
> >
> 
> Great, I'll add this check and see if we get any hits.  Even if it
> takes a while to hit it in the field, this would at least give us a
> way to make some progress in finding the issue.

With my latest code-base, I am able to inject a hack into
ubifs_leb_change() - but this function does not exist in your code-base.
Anyway, I'm currently running power cut emulation testing with the
following hack:


>From df163f4dd8a1604fe3085c4d11281c530837bc53 Mon Sep 17 00:00:00 2001
From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Date: Thu, 9 Jun 2011 15:08:59 +0300
Subject: [PATCH] UBIFS: temporary: hack to check recovery

We suspect that recovery cuts nodes sometimes. This is the hack which should
catch such things. We hack ubifs_change_leb and scan the leb right after
changing it - if we wrote corrupted data there, scan should fail.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
---
 fs/ubifs/io.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/ubifs/io.c b/fs/ubifs/io.c
index 9228950..9f7dbbf 100644
--- a/fs/ubifs/io.c
+++ b/fs/ubifs/io.c
@@ -153,6 +153,30 @@ int ubifs_leb_change(struct ubifs_info *c, int lnum, const void *buf, int len,
 		ubifs_ro_mode(c, err);
 		dbg_dump_stack();
 	}
+
+	/* Temporary hack to catch incorrect recovery, if we have such */
+	if (!err && (lnum < c->lpt_first || lnum > c->lpt_last)) {
+		void *buf = vmalloc(c->leb_size);
+		struct ubifs_scan_leb *sleb;
+
+		if (!buf)
+			return 0;
+
+		sleb = ubifs_scan(c, lnum, 0, buf, 0);
+		if (!IS_ERR(sleb)) {
+			/* Scan succeeded */
+			vfree(buf);
+			return 0;
+		}
+
+		ubifs_err("scanning after LEB %d change failed, error %d!", lnum, err);
+		print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 32, 1,
+			       buf, c->leb_size, 1);
+		dump_stack();
+		vfree(buf);
+		return -EINVAL;
+	}
+
 	return err;
 }
 
-- 
1.7.2.3



-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-09 12:10                               ` Artem Bityutskiy
@ 2011-06-20 15:35                                 ` Matthew L. Creech
  0 siblings, 0 replies; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-20 15:35 UTC (permalink / raw)
  To: dedekind1; +Cc: linux-mtd

On Thu, Jun 9, 2011 at 8:10 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> With my latest code-base, I am able to inject a hack into
> ubifs_leb_change() - but this function does not exist in your code-base.
> Anyway, I'm currently running power cut emulation testing with the
> following hack:
>

Hi Artem, just wanted to post an update to this:

I hacked this change into my kernel last week, and enabled the
in-kernel debugging options you suggested, then set up a power-cut
test:

- the device in question is powered by a UPS with controllable outlets
- its firmware boots and immediately starts writing large amounts of
data to flash (so we'll have a dirty FS)
- a separate script controlling the UPS waits for the device to boot,
then sleeps some random time between 0 and 10 minutes
- it then cuts power to the outlet and later restores it

The test has been running for the better part of a week with no issues
- it has now power-cycled the device over 1500 times, but I've gotten
no errors, and the filesystem is functioning fine.  FYI this is way
more stress than was placed on the "bad" units we've gotten back from
the field - most of those rarely lose power, and when they do, they're
less likely to be in the middle of a write operation.

I'll continue testing, but based on this it seems like UBIFS handles
power cuts very well!

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-06 19:52                     ` Matthew L. Creech
  2011-06-07  4:34                       ` Artem Bityutskiy
@ 2011-06-07 10:24                       ` Artem Bityutskiy
  1 sibling, 0 replies; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-07 10:24 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: linux-mtd

On Mon, 2011-06-06 at 15:52 -0400, Matthew L. Creech wrote:
> On Mon, Jun 6, 2011 at 12:18 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> >
> > I have to go home now - could you please improve dbg_dump_leb().
> > Currently it calls ubifs_scan(), which scans, finds corrupted node,
> > prints corruption information and returns -EUCLEAN and destroys the
> > scanned data.
> >
> 
> Will something like this be okay?  Or do you still want to dump the
> partially-parsed data from the corrupt node as well (not just the raw contents
> of the LEB)?

Probably a temporary hack like this should work.

diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c
index 37383e8..c709045 100644
--- a/fs/ubifs/scan.c
+++ b/fs/ubifs/scan.c
@@ -352,8 +352,9 @@ corrupted:
 		ubifs_err("LEB %d scanning failed", lnum);
 	}
 	err = -EUCLEAN;
-	ubifs_scan_destroy(sleb);
-	return ERR_PTR(err);
+//	ubifs_scan_destroy(sleb);
+//	return ERR_PTR(err);
+	return sleb;
 
 error:
 	ubifs_err("LEB %d scanning failed, error %d", lnum, err);

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-02  4:30           ` Matthew L. Creech
  2011-06-02 18:59             ` Matthew L. Creech
@ 2011-06-03  4:32             ` Artem Bityutskiy
  1 sibling, 0 replies; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-03  4:32 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Thu, 2011-06-02 at 00:30 -0400, Matthew L. Creech wrote:
> On Wed, Jun 1, 2011 at 3:51 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
> >
> > How this happens? What do you do? Does this happen after mount when you
> > first read your data? Or this happens at some point while you stress
> > testing your system? Or this happens after a power cut?
> >
> 
> So far there's no discernable pattern.  Most of the failed units are
> returns from the field, so we don't know what kind of conditions
> they've been placed in.  Some are from our test department, but we
> haven't found anything that might "trigger" the problem in any way.
> 
> The device works fine for some period of time (usually weeks /
> months), then we get complaints about various problems.  The reported
> symptoms eventually come down to one of these UBIFS errors.  Depending
> on the region which happens to go bad, it can result in breakage of a
> minor feature (because a file we try to read/write after mount
> triggers the error), all the way up to a completely non-functional
> device.  I'm not sure if we've ever seen it fail to mount altogether
> (I'll check into that), but we've had several cases in which U-Boot
> couldn't read the kernel image from UBIFS, so the device wouldn't boot
> Linux at all.
> 
> Power cuts are probably not common, though.  We have to expect them in
> the product of course, but practically speaking, our service guy
> assures me that a couple of the bad units he shipped me had stable
> power and were rarely/never rebooted.  But I can't rule it out with
> certainty.
> 
> Aside from that, it's just normal operation.  If the usage pattern
> matters, the only files ever written to in the persistent (UBIFS)
> filesystem are SQLite databases.  It's generally light usage, logging
> a variety of measurements once every 5 minutes.  I've tried
> stress-testing by running non-stop SQLite operations, recreating the
> normal usage pattern but with a _much_ higher frequency of writes than
> normal.  It didn't seem to help reproduce the error - we've yet to
> succeed in making this problem happen under controlled conditions.
> 
> As for this specific error (ubifs_decompress): tomorrow I'll try to
> gather & post additional log data for this device.  Thanks!

OK, then this is not about power cuts and unstable bits. First thing
coming to my mind is that your kernel may have some non-UBIFS bugs which
end up in memory corruptions, so UBIFS writes corrupted data to the
flash.

But the hexdump you sent shows that you have some non-0xFFs and then
many 0xFFs. Are those trailing 0xFFs part of the node data or not? If
yes, then it does not look like memory corruption, but more like some
driver/flash issues.

BTW, have you run mtd tests? Would you mind to set up torture test on
one of your boards and let it run fore several weeks. I remember we
found a rare DMA bug in our board by running the torture test for long
time. Also, it might be interesting how your HW and SW behave when you
continuously wear out few eraseblocks.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-31 15:47   ` Matthew L. Creech
  2011-05-31 16:10     ` Ben Gardiner
@ 2011-06-01  8:02     ` Artem Bityutskiy
  2011-06-01  8:07       ` Artem Bityutskiy
                         ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-01  8:02 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Tue, 2011-05-31 at 11:47 -0400, Matthew L. Creech wrote:
> On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner
> <bengardiner@nanometrics.ca> wrote:
> >
> > I don't see any debug statements in lzo1x_decompress_safe() that can
> > be enabled, so you might want to add some printing to
> > lzo1x_decompress_safe() or at least print the non-ok return code of
> > lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the
> > decompressor is failing.
> >
> 
> Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN.  I don't know what
> that indicates, but there is trailing 0xff data in the block to be
> decompressed if that matters:
> 
> XXXX: LZO_E_LOOKBEHIND_OVERRUN
> UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
> compressor lzo, error -22
> 00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03  ..in:attr..8..9.
> 00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61  /01/mWC/.ValueMa
> 00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37  x:a...1r.in....7
> 00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65  ..=./SerialNumbe
> 00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f  r/......./.c./So
> 00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f  ftware/Commands/
> 00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81  next..ID)....G..
> 00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76  ..(....Data Serv
> 00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b  ers/BACnet-IP...
> 00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14  DeviceInstance).
> 000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41  ....(A.... ...MA
> 000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e  C).....B... ...N
> 000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07  am*.....<.{ ....
> 000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40  BMDAddress*....@
> 000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69  ... ....TimeToLi
> 000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80  v+... ....aseId.
> 00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c  6*..  ....Defaul
> 00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01  tNet/.....F... .
> 00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f  ....EnableBaseFo
> 00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81  rGateway*$...E..
> 00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02  . ....Routed*...
> 00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09  !P..! .....01>..
> 00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24  ...K... .@.-...$
> 00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77  L... .,.....#:.w
> 00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc   .,.-... ....$3.
> 00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec  .......l..?jh.s.
> 000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
> 000003f0: ff ff
> UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
> UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
> 556196, error -22

I see the following possibilities:

1. The data has been written like this - then the bug is at writing
side. Check the data node - what is its length, is CRC correct? It would
be useful to dump the node which cannot be decompressed - I'd accept
such patch with great delight.

2. You had power cuts while this peace of data has been written and
recovery did not work correctly. Enabling mount and recovery messages
would help.

3. I merged several changes to 2.6.39 which could in theory break
recovery. Try to reproduce this with 2.6.38.

4. The fixup feature might have broke this - we might for some reason
read less data than there. Although I see FFs start at offset 416, which
is strange.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-01  8:02     ` Artem Bityutskiy
@ 2011-06-01  8:07       ` Artem Bityutskiy
  2011-06-01  8:39       ` Artem Bityutskiy
  2011-06-02  4:34       ` Matthew L. Creech
  2 siblings, 0 replies; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-01  8:07 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Wed, 2011-06-01 at 11:02 +0300, Artem Bityutskiy wrote:
> I see the following possibilities:
> 
> 1. The data has been written like this - then the bug is at writing
> side. Check the data node - what is its length, is CRC correct? It would
> be useful to dump the node which cannot be decompressed - I'd accept
> such patch with great delight.
> 
> 2. You had power cuts while this peace of data has been written and
> recovery did not work correctly. Enabling mount and recovery messages
> would help.
> 
> 3. I merged several changes to 2.6.39 which could in theory break
> recovery. Try to reproduce this with 2.6.38.
> 
> 4. The fixup feature might have broke this - we might for some reason
> read less data than there. Although I see FFs start at offset 416, which
> is strange.

So Matthew, let's try to exclude some of these possibilities. Please,
send me some description about what you do to reproduce this. Is this
error persistent or it goes away after remount or reboot. And other
information relevant to the above points.

Thanks!

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-01  8:02     ` Artem Bityutskiy
  2011-06-01  8:07       ` Artem Bityutskiy
@ 2011-06-01  8:39       ` Artem Bityutskiy
  2011-06-02  4:34       ` Matthew L. Creech
  2 siblings, 0 replies; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-01  8:39 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list

On Wed, 2011-06-01 at 11:02 +0300, Artem Bityutskiy wrote:
> 3. I merged several changes to 2.6.39 which could in theory break
> recovery. Try to reproduce this with 2.6.38.

Err, sorry, those went to 2.6.40, so disregard this point please.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-06-01  8:02     ` Artem Bityutskiy
  2011-06-01  8:07       ` Artem Bityutskiy
  2011-06-01  8:39       ` Artem Bityutskiy
@ 2011-06-02  4:34       ` Matthew L. Creech
  2 siblings, 0 replies; 25+ messages in thread
From: Matthew L. Creech @ 2011-06-02  4:34 UTC (permalink / raw)
  To: dedekind1; +Cc: Ben Gardiner, MTD list

On Wed, Jun 1, 2011 at 4:02 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
> 4. The fixup feature might have broke this - we might for some reason
> read less data than there. Although I see FFs start at offset 416, which
> is strange.
>

Also, I can rule this possibility out - although the device is now
running a recent kernel with the fix-up code present, it was upgraded
last week from an older version.  So the UBIFS image it was originally
programmed with (and is still using) didn't have the fixup flag set.

-- 
Matthew L. Creech

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: ubifs_decompress: cannot decompress ...
  2011-05-27 21:12 ubifs_decompress: cannot decompress Matthew L. Creech
  2011-05-30 12:29 ` Ben Gardiner
@ 2011-06-01  7:48 ` Artem Bityutskiy
  1 sibling, 0 replies; 25+ messages in thread
From: Artem Bityutskiy @ 2011-06-01  7:48 UTC (permalink / raw)
  To: Matthew L. Creech; +Cc: MTD list

On Fri, 2011-05-27 at 17:12 -0400, Matthew L. Creech wrote:
> Hi,
> 
> We encountered a device which has the following error:
> 
> UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
> compressor lzo, error -22
> UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
> UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
> 556196, error -22
> UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes,
> compressor lzo, error -22
> UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196)
> UBIFS error (pid 428): do_readpage: cannot read page 388 of inode
> 556196, error -22
> 
> The device is still bootable (though not functional), so I can get
> more debug info if needed.  Any ideas on what could cause this?

Yes, you can enable mount and recovery debugging messages and send me
the log.

> - MPC8313 (using fsl_elbc_nand driver)
> - 1GB NAND flash (partitioned into 5MB for U-Boot, remainder for UBI/UBIFS)
> - 2.6.39 kernel
> 
> Thanks!
> 


-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-06-20 15:35 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-27 21:12 ubifs_decompress: cannot decompress Matthew L. Creech
2011-05-30 12:29 ` Ben Gardiner
2011-05-31 15:47   ` Matthew L. Creech
2011-05-31 16:10     ` Ben Gardiner
2011-05-31 21:47       ` Matthew L. Creech
2011-06-01  7:51         ` Artem Bityutskiy
2011-06-02  4:30           ` Matthew L. Creech
2011-06-02 18:59             ` Matthew L. Creech
2011-06-06  9:58               ` Artem Bityutskiy
2011-06-06 16:04                 ` Matthew L. Creech
2011-06-06 16:18                   ` Artem Bityutskiy
2011-06-06 19:52                     ` Matthew L. Creech
2011-06-07  4:34                       ` Artem Bityutskiy
2011-06-07 20:41                         ` Matthew L. Creech
2011-06-08 14:11                           ` Artem Bityutskiy
2011-06-08 17:50                             ` Matthew L. Creech
2011-06-09 12:10                               ` Artem Bityutskiy
2011-06-20 15:35                                 ` Matthew L. Creech
2011-06-07 10:24                       ` Artem Bityutskiy
2011-06-03  4:32             ` Artem Bityutskiy
2011-06-01  8:02     ` Artem Bityutskiy
2011-06-01  8:07       ` Artem Bityutskiy
2011-06-01  8:39       ` Artem Bityutskiy
2011-06-02  4:34       ` Matthew L. Creech
2011-06-01  7:48 ` Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).