* ubifs_decompress: cannot decompress ... @ 2011-05-27 21:12 Matthew L. Creech 2011-05-30 12:29 ` Ben Gardiner 2011-06-01 7:48 ` Artem Bityutskiy 0 siblings, 2 replies; 25+ messages in thread From: Matthew L. Creech @ 2011-05-27 21:12 UTC (permalink / raw) To: MTD list Hi, We encountered a device which has the following error: UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, compressor lzo, error -22 UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) UBIFS error (pid 428): do_readpage: cannot read page 388 of inode 556196, error -22 UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, compressor lzo, error -22 UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) UBIFS error (pid 428): do_readpage: cannot read page 388 of inode 556196, error -22 The device is still bootable (though not functional), so I can get more debug info if needed. Any ideas on what could cause this? - MPC8313 (using fsl_elbc_nand driver) - 1GB NAND flash (partitioned into 5MB for U-Boot, remainder for UBI/UBIFS) - 2.6.39 kernel Thanks! -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-27 21:12 ubifs_decompress: cannot decompress Matthew L. Creech @ 2011-05-30 12:29 ` Ben Gardiner 2011-05-31 15:47 ` Matthew L. Creech 2011-06-01 7:48 ` Artem Bityutskiy 1 sibling, 1 reply; 25+ messages in thread From: Ben Gardiner @ 2011-05-30 12:29 UTC (permalink / raw) To: Matthew L. Creech; +Cc: MTD list Hi Matthew, On Fri, May 27, 2011 at 5:12 PM, Matthew L. Creech <mlcreech@gmail.com> wrote: > We encountered a device which has the following error: > > UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, > compressor lzo, error -22 > UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) > UBIFS error (pid 428): do_readpage: cannot read page 388 of inode > 556196, error -22 > UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, > compressor lzo, error -22 > UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) > UBIFS error (pid 428): do_readpage: cannot read page 388 of inode > 556196, error -22 It looks like the lzo.c decompress function lzo_decompress() maps all non-ok return codes from lzo1x_decompress_safe() to -EINVAL (-22). So any one of the possible failures of the decompressor could cause the error you are observing. I don't see any debug statements in lzo1x_decompress_safe() that can be enabled, so you might want to add some printing to lzo1x_decompress_safe() or at least print the non-ok return code of lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the decompressor is failing. Best Regards, Ben Gardiner --- Nanometrics Inc. http://www.nanometrics.ca ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-30 12:29 ` Ben Gardiner @ 2011-05-31 15:47 ` Matthew L. Creech 2011-05-31 16:10 ` Ben Gardiner 2011-06-01 8:02 ` Artem Bityutskiy 0 siblings, 2 replies; 25+ messages in thread From: Matthew L. Creech @ 2011-05-31 15:47 UTC (permalink / raw) To: Ben Gardiner; +Cc: MTD list On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner <bengardiner@nanometrics.ca> wrote: > > I don't see any debug statements in lzo1x_decompress_safe() that can > be enabled, so you might want to add some printing to > lzo1x_decompress_safe() or at least print the non-ok return code of > lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the > decompressor is failing. > Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN. I don't know what that indicates, but there is trailing 0xff data in the block to be decompressed if that matters: XXXX: LZO_E_LOOKBEHIND_OVERRUN UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, compressor lzo, error -22 00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03 ..in:attr..8..9. 00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61 /01/mWC/.ValueMa 00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37 x:a...1r.in....7 00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65 ..=./SerialNumbe 00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f r/......./.c./So 00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f ftware/Commands/ 00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81 next..ID)....G.. 00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76 ..(....Data Serv 00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b ers/BACnet-IP... 00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14 DeviceInstance). 000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41 ....(A.... ...MA 000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e C).....B... ...N 000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07 am*.....<.{ .... 000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40 BMDAddress*....@ 000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69 ... ....TimeToLi 000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80 v+... ....aseId. 00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c 6*.. ....Defaul 00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01 tNet/.....F... . 00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f ....EnableBaseFo 00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81 rGateway*$...E.. 00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02 . ....Routed*... 00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09 !P..! .....01>.. 00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24 ...K... .@.-...$ 00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77 L... .,.....#:.w 00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc .,.-... ....$3. 00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec .......l..?jh.s. 000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ 000003f0: ff ff UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) UBIFS error (pid 428): do_readpage: cannot read page 388 of inode 556196, error -22 -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-31 15:47 ` Matthew L. Creech @ 2011-05-31 16:10 ` Ben Gardiner 2011-05-31 21:47 ` Matthew L. Creech 2011-06-01 8:02 ` Artem Bityutskiy 1 sibling, 1 reply; 25+ messages in thread From: Ben Gardiner @ 2011-05-31 16:10 UTC (permalink / raw) To: Matthew L. Creech; +Cc: MTD list Hi Matthew, On Tue, May 31, 2011 at 11:47 AM, Matthew L. Creech <mlcreech@gmail.com> wrote: > On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner > <bengardiner@nanometrics.ca> wrote: >> >> I don't see any debug statements in lzo1x_decompress_safe() that can >> be enabled, so you might want to add some printing to >> lzo1x_decompress_safe() or at least print the non-ok return code of >> lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the >> decompressor is failing. >> > > Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN. I don't know what > that indicates, but there is trailing 0xff data in the block to be > decompressed if that matters: Interesting -- does the trailing 0xff have ECC set, or is it erased pages of 0xff? > XXXX: LZO_E_LOOKBEHIND_OVERRUN Can't say much more about this than lzo-2.05/doc/LZOAPI.TXT [1] does: LZO_E_LOOKBEHIND_OVERRUN Your data is corrupted. > UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, > compressor lzo, error -22 > 00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03 ..in:attr..8..9. > 00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61 /01/mWC/.ValueMa > 00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37 x:a...1r.in....7 > 00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65 ..=./SerialNumbe > 00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f r/......./.c./So > 00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f ftware/Commands/ > 00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81 next..ID)....G.. > 00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76 ..(....Data Serv > 00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b ers/BACnet-IP... > 00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14 DeviceInstance). > 000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41 ....(A.... ...MA > 000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e C).....B... ...N > 000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07 am*.....<.{ .... > 000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40 BMDAddress*....@ > 000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69 ... ....TimeToLi > 000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80 v+... ....aseId. > 00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c 6*.. ....Defaul > 00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01 tNet/.....F... . > 00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f ....EnableBaseFo > 00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81 rGateway*$...E.. > 00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02 . ....Routed*... > 00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09 !P..! .....01>.. > 00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24 ...K... .@.-...$ > 00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77 L... .,.....#:.w > 00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc .,.-... ....$3. > 00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec .......l..?jh.s. > 000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003f0: ff ff > UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) > UBIFS error (pid 428): do_readpage: cannot read page 388 of inode > 556196, error -22 Could it be possible that writing the page was interrupted? I guess the CRC checks above decompress would catch that though. Sorry I can't be of more help here. Best Regards, Ben Gardiner [1] http://www.oberhumer.com/opensource/lzo/download/ --- Nanometrics Inc. http://www.nanometrics.ca ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-31 16:10 ` Ben Gardiner @ 2011-05-31 21:47 ` Matthew L. Creech 2011-06-01 7:51 ` Artem Bityutskiy 0 siblings, 1 reply; 25+ messages in thread From: Matthew L. Creech @ 2011-05-31 21:47 UTC (permalink / raw) To: Ben Gardiner; +Cc: MTD list On Tue, May 31, 2011 at 12:10 PM, Ben Gardiner <bengardiner@nanometrics.ca> wrote: > > Interesting -- does the trailing 0xff have ECC set, or is it erased > pages of 0xff? > ... > > Could it be possible that writing the page was interrupted? I guess > the CRC checks above decompress would catch that though. > I verified by adding a call to ubi->mtd->read_oob(): the 0xff data starts on a page boundary (a whole multiple of 2k in my case). The associated OOB area for that page is all 0xff as well. So I guess this is less about the original LZO error, and more about how a page in the middle of a UBIFS node got erased out from underneath it. Which seems suspiciously similar to the ubifs_read_node() error which I reported last year: http://lists.infradead.org/pipermail/linux-mtd/2010-July/031069.html and which is still showing up from time to time on devices in the field. In that case the erased page contained the node header and so "type" was interpreted as 255; in this case the erased page is in the middle of a data node, resulting in decompression failure instead. Unfortunately it's not repeatable enough for us to capture adequate debug output - we've had several devices logging extensive debug output via netconsole for months now, but none have recreated this problem so far under those conditions. -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-31 21:47 ` Matthew L. Creech @ 2011-06-01 7:51 ` Artem Bityutskiy 2011-06-02 4:30 ` Matthew L. Creech 0 siblings, 1 reply; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-01 7:51 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Tue, 2011-05-31 at 17:47 -0400, Matthew L. Creech wrote: > Unfortunately it's not repeatable enough for us to capture adequate > debug output - we've had several devices logging extensive debug > output via netconsole for months now, but none have recreated this > problem so far under those conditions. How this happens? What do you do? Does this happen after mount when you first read your data? Or this happens at some point while you stress testing your system? Or this happens after a power cut? -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-01 7:51 ` Artem Bityutskiy @ 2011-06-02 4:30 ` Matthew L. Creech 2011-06-02 18:59 ` Matthew L. Creech 2011-06-03 4:32 ` Artem Bityutskiy 0 siblings, 2 replies; 25+ messages in thread From: Matthew L. Creech @ 2011-06-02 4:30 UTC (permalink / raw) To: dedekind1; +Cc: Ben Gardiner, MTD list On Wed, Jun 1, 2011 at 3:51 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > How this happens? What do you do? Does this happen after mount when you > first read your data? Or this happens at some point while you stress > testing your system? Or this happens after a power cut? > So far there's no discernable pattern. Most of the failed units are returns from the field, so we don't know what kind of conditions they've been placed in. Some are from our test department, but we haven't found anything that might "trigger" the problem in any way. The device works fine for some period of time (usually weeks / months), then we get complaints about various problems. The reported symptoms eventually come down to one of these UBIFS errors. Depending on the region which happens to go bad, it can result in breakage of a minor feature (because a file we try to read/write after mount triggers the error), all the way up to a completely non-functional device. I'm not sure if we've ever seen it fail to mount altogether (I'll check into that), but we've had several cases in which U-Boot couldn't read the kernel image from UBIFS, so the device wouldn't boot Linux at all. Power cuts are probably not common, though. We have to expect them in the product of course, but practically speaking, our service guy assures me that a couple of the bad units he shipped me had stable power and were rarely/never rebooted. But I can't rule it out with certainty. Aside from that, it's just normal operation. If the usage pattern matters, the only files ever written to in the persistent (UBIFS) filesystem are SQLite databases. It's generally light usage, logging a variety of measurements once every 5 minutes. I've tried stress-testing by running non-stop SQLite operations, recreating the normal usage pattern but with a _much_ higher frequency of writes than normal. It didn't seem to help reproduce the error - we've yet to succeed in making this problem happen under controlled conditions. As for this specific error (ubifs_decompress): tomorrow I'll try to gather & post additional log data for this device. Thanks! -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-02 4:30 ` Matthew L. Creech @ 2011-06-02 18:59 ` Matthew L. Creech 2011-06-06 9:58 ` Artem Bityutskiy 2011-06-03 4:32 ` Artem Bityutskiy 1 sibling, 1 reply; 25+ messages in thread From: Matthew L. Creech @ 2011-06-02 18:59 UTC (permalink / raw) To: dedekind1; +Cc: Ben Gardiner, MTD list On Thu, Jun 2, 2011 at 12:30 AM, Matthew L. Creech <mlcreech@gmail.com> wrote: > > As for this specific error (ubifs_decompress): tomorrow I'll try to > gather & post additional log data for this device. Thanks! > Here is a console dump with more information enabled: http://mcreech.com/work/ubifs-decompress-err.txt I turned on mount & recovery debug messages, although it seems to mount & recover correctly, so presumably any info from past recovery actions is long gone. The error actually occurs later on, when our main application accesses SQLite databases. I dumped out the corrupted node data in 3 places: 1. In ubifs_decompress(), I dumped the data buffer, prefixed with "compressed node" (redundant with #3, really) 2. In read_block(), I page-align the starting offset & size, re-fetch the pertinent pages from flash, and dump those with corresponding OOB info, prefixed with "data" and "oob" respectively 3. There's already a dbg_dump_node() call in read_block(), so I enabled that as well -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-02 18:59 ` Matthew L. Creech @ 2011-06-06 9:58 ` Artem Bityutskiy 2011-06-06 16:04 ` Matthew L. Creech 0 siblings, 1 reply; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-06 9:58 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Thu, 2011-06-02 at 14:59 -0400, Matthew L. Creech wrote: > On Thu, Jun 2, 2011 at 12:30 AM, Matthew L. Creech <mlcreech@gmail.com> wrote: > > > > As for this specific error (ubifs_decompress): tomorrow I'll try to > > gather & post additional log data for this device. Thanks! > > > > Here is a console dump with more information enabled: > > http://mcreech.com/work/ubifs-decompress-err.txt > > I turned on mount & recovery debug messages, although it seems to > mount & recover correctly, so presumably any info from past recovery > actions is long gone. The error actually occurs later on, when our > main application accesses SQLite databases. > > I dumped out the corrupted node data in 3 places: > > 1. In ubifs_decompress(), I dumped the data buffer, prefixed with > "compressed node" (redundant with #3, really) > > 2. In read_block(), I page-align the starting offset & size, re-fetch > the pertinent pages from flash, and dump those with corresponding OOB > info, prefixed with "data" and "oob" respectively > > 3. There's already a dbg_dump_node() call in read_block(), so I > enabled that as well So the corruption starts exactly at the NAND page boundary. This makes me believe that the reason is most probably power cut recovery. But you say your client ensures there were none... Yes, you are right that info from past is gone... What I'd like to see is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' - basically I want to look if this LEB was passed through GC. Because my theory is: 1. You have LEB A which contains this data node, but it is not corrupted yet. Let's call this data node X. 2. GC moves valid data from LEB A to LEB B (lnum 3479). 3. We get a power cut while moving the data. We end up with node X corrupted in B. 4. UBIFS recovery has a bug and it decides that the copy of node X in LEB B is OK, commits, and LEB A is erased at some point. 5. And we are in your situation... But this is just a theory. I actually worked on power cut emulation testing improvements lately, and the current state is that "integck -p" fails sometimes. I need to investigate it - might turn out to be a bug which cases the effect you see. Basically, I've improved UBIFS power cut testing and corrupt the buffer with random data, not only with 0xFFs, and now integck -p starts failing. See this commit: http://git.infradead.org/ubifs-2.6.git/commit/96c32bb596c5a74362a6a825f66fde68b6c3487c It contains several unrelated changes. Ignore the simple random part, only the changes in 'cut_data()' are interesting. But I'll split that change on several changes. And I'll try to investigate the issue - it might turn out to be integck's issue, will see. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-06 9:58 ` Artem Bityutskiy @ 2011-06-06 16:04 ` Matthew L. Creech 2011-06-06 16:18 ` Artem Bityutskiy 0 siblings, 1 reply; 25+ messages in thread From: Matthew L. Creech @ 2011-06-06 16:04 UTC (permalink / raw) To: dedekind1; +Cc: Ben Gardiner, MTD list On Mon, Jun 6, 2011 at 5:58 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > So the corruption starts exactly at the NAND page boundary. This makes > me believe that the reason is most probably power cut recovery. But you > say your client ensures there were none... > > Yes, you are right that info from past is gone... What I'd like to see > is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' - > basically I want to look if this LEB was passed through GC. > [Repost, gmail mangled my node dump] The dbg_dump_leb() call fails due to a bad CRC; I've posted its output plus the corresponding raw LEB dump (obtained via mtd->read_oob()) here: http://mcreech.com/work/ubifs-decompress-err-rawleb.txt -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-06 16:04 ` Matthew L. Creech @ 2011-06-06 16:18 ` Artem Bityutskiy 2011-06-06 19:52 ` Matthew L. Creech 0 siblings, 1 reply; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-06 16:18 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Mon, 2011-06-06 at 12:04 -0400, Matthew L. Creech wrote: > On Mon, Jun 6, 2011 at 5:58 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > So the corruption starts exactly at the NAND page boundary. This makes > > me believe that the reason is most probably power cut recovery. But you > > say your client ensures there were none... > > > > Yes, you are right that info from past is gone... What I'd like to see > > is a dump of the whole LEB. Could you please add 'dbg_dump_leb()' - > > basically I want to look if this LEB was passed through GC. > > > > [Repost, gmail mangled my node dump] > > The dbg_dump_leb() call fails due to a bad CRC; I've posted its output > plus the corresponding raw LEB dump (obtained via mtd->read_oob()) > here: I have to go home now - could you please improve dbg_dump_leb(). Currently it calls ubifs_scan(), which scans, finds corrupted node, prints corruption information and returns -EUCLEAN and destroys the scanned data. Instead, we could do: 1. Add another parameter to ubifs_scan() which makes it to _not_ free scanned data on corruption. 2. In dbg_dump_leb() when you get -EUCLEAN - just go ahead and print the scanned information instead of exiting. Should not be too difficult to do. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-06 16:18 ` Artem Bityutskiy @ 2011-06-06 19:52 ` Matthew L. Creech 2011-06-07 4:34 ` Artem Bityutskiy 2011-06-07 10:24 ` Artem Bityutskiy 0 siblings, 2 replies; 25+ messages in thread From: Matthew L. Creech @ 2011-06-06 19:52 UTC (permalink / raw) To: linux-mtd; +Cc: dedekind1 On Mon, Jun 6, 2011 at 12:18 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > I have to go home now - could you please improve dbg_dump_leb(). > Currently it calls ubifs_scan(), which scans, finds corrupted node, > prints corruption information and returns -EUCLEAN and destroys the > scanned data. > Will something like this be okay? Or do you still want to dump the partially-parsed data from the corrupt node as well (not just the raw contents of the LEB)? Currently an error in ubifs_scan() will cause dbg_dump_leb() to abort without completing the dump. Instead, we should abandon parsing the data, but dump the raw (uninterpreted) LEB contents instead. Signed-off-by: Matthew L. Creech <mlcreech@gmail.com> --- fs/ubifs/debug.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c index 26d4c61..6ab43e4 100644 --- a/fs/ubifs/debug.c +++ b/fs/ubifs/debug.c @@ -901,6 +901,9 @@ void dbg_dump_leb(const struct ubifs_info *c, int lnum) sleb = ubifs_scan(c, lnum, 0, buf, 0); if (IS_ERR(sleb)) { ubifs_err("scan error %d", (int)PTR_ERR(sleb)); + printk(KERN_DEBUG "\tLEB data buffer:\n"); + print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1, + buf, c->leb_size, 0); goto out; } -- 1.6.3.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-06 19:52 ` Matthew L. Creech @ 2011-06-07 4:34 ` Artem Bityutskiy 2011-06-07 20:41 ` Matthew L. Creech 2011-06-07 10:24 ` Artem Bityutskiy 1 sibling, 1 reply; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-07 4:34 UTC (permalink / raw) To: Matthew L. Creech; +Cc: linux-mtd On Mon, 2011-06-06 at 15:52 -0400, Matthew L. Creech wrote: > ubifs_err("scan error %d", (int)PTR_ERR(sleb)); > + printk(KERN_DEBUG "\tLEB data buffer:\n"); > + print_hex_dump(KERN_DEBUG, "\t", DUMP_PREFIX_OFFSET, 32, 1, > + buf, c->leb_size, 0); No, I have difficulties reading hexdumps. You have set of good nodes following by one broken node. I wanted to see a human-readable dump of the good nodes at the beginning of the LEB. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-07 4:34 ` Artem Bityutskiy @ 2011-06-07 20:41 ` Matthew L. Creech 2011-06-08 14:11 ` Artem Bityutskiy 0 siblings, 1 reply; 25+ messages in thread From: Matthew L. Creech @ 2011-06-07 20:41 UTC (permalink / raw) To: dedekind1; +Cc: linux-mtd On Tue, Jun 7, 2011 at 12:34 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > No, I have difficulties reading hexdumps. You have set of good nodes > following by one broken node. I wanted to see a human-readable dump of > the good nodes at the beginning of the LEB. > Oh I see - sorry, I thought you wanted to debug the corrupted portion. Here's the output for my corrupt flash: http://mcreech.com/work/ubifs-2011-06-07.txt I'll follow up with a patch. -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-07 20:41 ` Matthew L. Creech @ 2011-06-08 14:11 ` Artem Bityutskiy 2011-06-08 17:50 ` Matthew L. Creech 0 siblings, 1 reply; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-08 14:11 UTC (permalink / raw) To: Matthew L. Creech; +Cc: linux-mtd On Tue, 2011-06-07 at 16:41 -0400, Matthew L. Creech wrote: > On Tue, Jun 7, 2011 at 12:34 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > No, I have difficulties reading hexdumps. You have set of good nodes > > following by one broken node. I wanted to see a human-readable dump of > > the good nodes at the beginning of the LEB. > > > > Oh I see - sorry, I thought you wanted to debug the corrupted portion. > > Here's the output for my corrupt flash: > > http://mcreech.com/work/ubifs-2011-06-07.txt > > I'll follow up with a patch. Yes, it does look like this LEB might be garbage-collected. But it does not have to be. Anyway, what I can suggest you is to do several things. 1. If you have many occasions of such error, try to gather some information about how the device was used, and if it was uncleanly power-cut. Remember, I often saw that embedded devices have incorrect reboot. Whe users reboot it "normally" - it does not try to unmount the FS-es cleanly and just jumps to som HW reset function. You can verify this by rebooting normally and checking if UBIFS says "recovery needed" or not. If it does - the reboot was not normal. 2. This error may be due to memory corruptions in some driver (e.g., wireless or video), due to issues in the mtd driver, etc. Try to stress your system with slub/slab full checks enabled, and other debugging features which you can find in the "hacking" section of make menuconfig. 3. If my theory is true, then what may help is adding a check it ubifs recovery function. The recovery ends with an ubifs_leb_change() call. You need to check the last node there - is it full and correct? If not, you should print a loud warning and information like leb dump _before_ the change, and dump of the buffer which we are going to write with ubifs_leb_change(). You'd probably need to deploy this check to the field if this issue is not easy to reproduce. If you have then this info you may fix the bug. 4. Set-up power-cut emulation testing in your office. P.S. I'm curious where you use UBIFS, if this is not a trade secret, of course. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-08 14:11 ` Artem Bityutskiy @ 2011-06-08 17:50 ` Matthew L. Creech 2011-06-09 12:10 ` Artem Bityutskiy 0 siblings, 1 reply; 25+ messages in thread From: Matthew L. Creech @ 2011-06-08 17:50 UTC (permalink / raw) To: dedekind1; +Cc: linux-mtd On Wed, Jun 8, 2011 at 10:11 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > Yes, it does look like this LEB might be garbage-collected. But it does > not have to be. > > Anyway, what I can suggest you is to do several things. > > 1. If you have many occasions of such error, try to gather some > information about how the device was used, and if it was uncleanly > power-cut. Remember, I often saw that embedded devices have incorrect > reboot. Whe users reboot it "normally" - it does not try to unmount > the FS-es cleanly and just jumps to som HW reset function. > > You can verify this by rebooting normally and checking if UBIFS says > "recovery needed" or not. If it does - the reboot was not normal. > Yes, it currently reboots uncleanly (though it does do a "sync" first). I noticed this a while back, and the next release firmware will have it fixed. However, it doesn't make a huge difference to us, because these devices are probably more likely to experience power loss than a software reboot, in the field at least. > 2. This error may be due to memory corruptions in some driver (e.g., > wireless or video), due to issues in the mtd driver, etc. Try to > stress your system with slub/slab full checks enabled, and other > debugging features which you can find in the "hacking" section of > make menuconfig. > Will do. > 3. If my theory is true, then what may help is adding a check it > ubifs recovery function. The recovery ends with an ubifs_leb_change() > call. You need to check the last node there - is it full and correct? > If not, you should print a loud warning and information like leb dump > _before_ the change, and dump of the buffer which we are going to > write with ubifs_leb_change(). > > You'd probably need to deploy this check to the field if this issue > is not easy to reproduce. If you have then this info you may fix the > bug. > Great, I'll add this check and see if we get any hits. Even if it takes a while to hit it in the field, this would at least give us a way to make some progress in finding the issue. > 4. Set-up power-cut emulation testing in your office. > I did this at one point - I have a programmable UPS, so I was able to automate a test to turn outlet power off & on repeatedly while having the device do some work. It didn't seem to help reproduce the problem, but it's worth trying again on a long-term basis (especially with the change above to try & catch the corruption in the act). Thanks again Artem. -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-08 17:50 ` Matthew L. Creech @ 2011-06-09 12:10 ` Artem Bityutskiy 2011-06-20 15:35 ` Matthew L. Creech 0 siblings, 1 reply; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-09 12:10 UTC (permalink / raw) To: Matthew L. Creech; +Cc: linux-mtd On Wed, 2011-06-08 at 13:50 -0400, Matthew L. Creech wrote: > On Wed, Jun 8, 2011 at 10:11 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > Yes, it does look like this LEB might be garbage-collected. But it does > > not have to be. > > > > Anyway, what I can suggest you is to do several things. > > > > 1. If you have many occasions of such error, try to gather some > > information about how the device was used, and if it was uncleanly > > power-cut. Remember, I often saw that embedded devices have incorrect > > reboot. Whe users reboot it "normally" - it does not try to unmount > > the FS-es cleanly and just jumps to som HW reset function. > > > > You can verify this by rebooting normally and checking if UBIFS says > > "recovery needed" or not. If it does - the reboot was not normal. > > > > Yes, it currently reboots uncleanly (though it does do a "sync" > first). I noticed this a while back, and the next release firmware > will have it fixed. However, it doesn't make a huge difference to us, > because these devices are probably more likely to experience power > loss than a software reboot, in the field at least. > > > 2. This error may be due to memory corruptions in some driver (e.g., > > wireless or video), due to issues in the mtd driver, etc. Try to > > stress your system with slub/slab full checks enabled, and other > > debugging features which you can find in the "hacking" section of > > make menuconfig. > > > > Will do. > > > 3. If my theory is true, then what may help is adding a check it > > ubifs recovery function. The recovery ends with an ubifs_leb_change() > > call. You need to check the last node there - is it full and correct? > > If not, you should print a loud warning and information like leb dump > > _before_ the change, and dump of the buffer which we are going to > > write with ubifs_leb_change(). > > > > You'd probably need to deploy this check to the field if this issue > > is not easy to reproduce. If you have then this info you may fix the > > bug. > > > > Great, I'll add this check and see if we get any hits. Even if it > takes a while to hit it in the field, this would at least give us a > way to make some progress in finding the issue. With my latest code-base, I am able to inject a hack into ubifs_leb_change() - but this function does not exist in your code-base. Anyway, I'm currently running power cut emulation testing with the following hack: >From df163f4dd8a1604fe3085c4d11281c530837bc53 Mon Sep 17 00:00:00 2001 From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Date: Thu, 9 Jun 2011 15:08:59 +0300 Subject: [PATCH] UBIFS: temporary: hack to check recovery We suspect that recovery cuts nodes sometimes. This is the hack which should catch such things. We hack ubifs_change_leb and scan the leb right after changing it - if we wrote corrupted data there, scan should fail. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> --- fs/ubifs/io.c | 24 ++++++++++++++++++++++++ 1 files changed, 24 insertions(+), 0 deletions(-) diff --git a/fs/ubifs/io.c b/fs/ubifs/io.c index 9228950..9f7dbbf 100644 --- a/fs/ubifs/io.c +++ b/fs/ubifs/io.c @@ -153,6 +153,30 @@ int ubifs_leb_change(struct ubifs_info *c, int lnum, const void *buf, int len, ubifs_ro_mode(c, err); dbg_dump_stack(); } + + /* Temporary hack to catch incorrect recovery, if we have such */ + if (!err && (lnum < c->lpt_first || lnum > c->lpt_last)) { + void *buf = vmalloc(c->leb_size); + struct ubifs_scan_leb *sleb; + + if (!buf) + return 0; + + sleb = ubifs_scan(c, lnum, 0, buf, 0); + if (!IS_ERR(sleb)) { + /* Scan succeeded */ + vfree(buf); + return 0; + } + + ubifs_err("scanning after LEB %d change failed, error %d!", lnum, err); + print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 32, 1, + buf, c->leb_size, 1); + dump_stack(); + vfree(buf); + return -EINVAL; + } + return err; } -- 1.7.2.3 -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-09 12:10 ` Artem Bityutskiy @ 2011-06-20 15:35 ` Matthew L. Creech 0 siblings, 0 replies; 25+ messages in thread From: Matthew L. Creech @ 2011-06-20 15:35 UTC (permalink / raw) To: dedekind1; +Cc: linux-mtd On Thu, Jun 9, 2011 at 8:10 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > With my latest code-base, I am able to inject a hack into > ubifs_leb_change() - but this function does not exist in your code-base. > Anyway, I'm currently running power cut emulation testing with the > following hack: > Hi Artem, just wanted to post an update to this: I hacked this change into my kernel last week, and enabled the in-kernel debugging options you suggested, then set up a power-cut test: - the device in question is powered by a UPS with controllable outlets - its firmware boots and immediately starts writing large amounts of data to flash (so we'll have a dirty FS) - a separate script controlling the UPS waits for the device to boot, then sleeps some random time between 0 and 10 minutes - it then cuts power to the outlet and later restores it The test has been running for the better part of a week with no issues - it has now power-cycled the device over 1500 times, but I've gotten no errors, and the filesystem is functioning fine. FYI this is way more stress than was placed on the "bad" units we've gotten back from the field - most of those rarely lose power, and when they do, they're less likely to be in the middle of a write operation. I'll continue testing, but based on this it seems like UBIFS handles power cuts very well! -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-06 19:52 ` Matthew L. Creech 2011-06-07 4:34 ` Artem Bityutskiy @ 2011-06-07 10:24 ` Artem Bityutskiy 1 sibling, 0 replies; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-07 10:24 UTC (permalink / raw) To: Matthew L. Creech; +Cc: linux-mtd On Mon, 2011-06-06 at 15:52 -0400, Matthew L. Creech wrote: > On Mon, Jun 6, 2011 at 12:18 PM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > I have to go home now - could you please improve dbg_dump_leb(). > > Currently it calls ubifs_scan(), which scans, finds corrupted node, > > prints corruption information and returns -EUCLEAN and destroys the > > scanned data. > > > > Will something like this be okay? Or do you still want to dump the > partially-parsed data from the corrupt node as well (not just the raw contents > of the LEB)? Probably a temporary hack like this should work. diff --git a/fs/ubifs/scan.c b/fs/ubifs/scan.c index 37383e8..c709045 100644 --- a/fs/ubifs/scan.c +++ b/fs/ubifs/scan.c @@ -352,8 +352,9 @@ corrupted: ubifs_err("LEB %d scanning failed", lnum); } err = -EUCLEAN; - ubifs_scan_destroy(sleb); - return ERR_PTR(err); +// ubifs_scan_destroy(sleb); +// return ERR_PTR(err); + return sleb; error: ubifs_err("LEB %d scanning failed, error %d", lnum, err); -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-02 4:30 ` Matthew L. Creech 2011-06-02 18:59 ` Matthew L. Creech @ 2011-06-03 4:32 ` Artem Bityutskiy 1 sibling, 0 replies; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-03 4:32 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Thu, 2011-06-02 at 00:30 -0400, Matthew L. Creech wrote: > On Wed, Jun 1, 2011 at 3:51 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > > How this happens? What do you do? Does this happen after mount when you > > first read your data? Or this happens at some point while you stress > > testing your system? Or this happens after a power cut? > > > > So far there's no discernable pattern. Most of the failed units are > returns from the field, so we don't know what kind of conditions > they've been placed in. Some are from our test department, but we > haven't found anything that might "trigger" the problem in any way. > > The device works fine for some period of time (usually weeks / > months), then we get complaints about various problems. The reported > symptoms eventually come down to one of these UBIFS errors. Depending > on the region which happens to go bad, it can result in breakage of a > minor feature (because a file we try to read/write after mount > triggers the error), all the way up to a completely non-functional > device. I'm not sure if we've ever seen it fail to mount altogether > (I'll check into that), but we've had several cases in which U-Boot > couldn't read the kernel image from UBIFS, so the device wouldn't boot > Linux at all. > > Power cuts are probably not common, though. We have to expect them in > the product of course, but practically speaking, our service guy > assures me that a couple of the bad units he shipped me had stable > power and were rarely/never rebooted. But I can't rule it out with > certainty. > > Aside from that, it's just normal operation. If the usage pattern > matters, the only files ever written to in the persistent (UBIFS) > filesystem are SQLite databases. It's generally light usage, logging > a variety of measurements once every 5 minutes. I've tried > stress-testing by running non-stop SQLite operations, recreating the > normal usage pattern but with a _much_ higher frequency of writes than > normal. It didn't seem to help reproduce the error - we've yet to > succeed in making this problem happen under controlled conditions. > > As for this specific error (ubifs_decompress): tomorrow I'll try to > gather & post additional log data for this device. Thanks! OK, then this is not about power cuts and unstable bits. First thing coming to my mind is that your kernel may have some non-UBIFS bugs which end up in memory corruptions, so UBIFS writes corrupted data to the flash. But the hexdump you sent shows that you have some non-0xFFs and then many 0xFFs. Are those trailing 0xFFs part of the node data or not? If yes, then it does not look like memory corruption, but more like some driver/flash issues. BTW, have you run mtd tests? Would you mind to set up torture test on one of your boards and let it run fore several weeks. I remember we found a rare DMA bug in our board by running the torture test for long time. Also, it might be interesting how your HW and SW behave when you continuously wear out few eraseblocks. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-31 15:47 ` Matthew L. Creech 2011-05-31 16:10 ` Ben Gardiner @ 2011-06-01 8:02 ` Artem Bityutskiy 2011-06-01 8:07 ` Artem Bityutskiy ` (2 more replies) 1 sibling, 3 replies; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-01 8:02 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Tue, 2011-05-31 at 11:47 -0400, Matthew L. Creech wrote: > On Mon, May 30, 2011 at 8:29 AM, Ben Gardiner > <bengardiner@nanometrics.ca> wrote: > > > > I don't see any debug statements in lzo1x_decompress_safe() that can > > be enabled, so you might want to add some printing to > > lzo1x_decompress_safe() or at least print the non-ok return code of > > lzo1x_decompress_safe() in lzo_decompress() to get an idea of how the > > decompressor is failing. > > > > Looks like it's returning LZO_E_LOOKBEHIND_OVERRUN. I don't know what > that indicates, but there is trailing 0xff data in the block to be > decompressed if that matters: > > XXXX: LZO_E_LOOKBEHIND_OVERRUN > UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, > compressor lzo, error -22 > 00000000: 00 0f 69 6e 3a 61 74 74 72 00 c2 38 1c 03 39 03 ..in:attr..8..9. > 00000010: 2f 30 31 2f 6d 57 43 2f 2e 56 61 6c 75 65 4d 61 /01/mWC/.ValueMa > 00000020: 78 3a 61 91 03 94 31 72 00 69 6e d0 03 00 01 37 x:a...1r.in....7 > 00000030: 1c 03 3d 01 2f 53 65 72 69 61 6c 4e 75 6d 62 65 ..=./SerialNumbe > 00000040: 72 2f 98 07 98 03 00 0a 04 2f 03 63 01 2f 53 6f r/......./.c./So > 00000050: 66 74 77 61 72 65 2f 43 6f 6d 6d 61 6e 64 73 2f ftware/Commands/ > 00000060: 6e 65 78 74 d2 01 49 44 29 bc 00 03 06 47 04 81 next..ID)....G.. > 00000070: 0d 03 28 c0 00 00 04 44 61 74 61 20 53 65 72 76 ..(....Data Serv > 00000080: 65 72 73 2f 42 41 43 6e 65 74 2d 49 50 c4 01 0b ers/BACnet-IP... > 00000090: 44 65 76 69 63 65 49 6e 73 74 61 6e 63 65 29 14 DeviceInstance). > 000000a0: 01 05 00 f7 28 41 04 81 03 02 20 0c 1f 01 4d 41 ....(A.... ...MA > 000000b0: 43 29 08 01 03 02 19 42 04 81 05 20 0d 07 01 4e C).....B... ...N > 000000c0: 61 6d 2a 14 02 02 02 18 3c 03 7b 20 02 04 01 07 am*.....<.{ .... > 000000d0: 42 4d 44 41 64 64 72 65 73 73 2a fc 01 02 1d 40 BMDAddress*....@ > 000000e0: 04 81 01 20 05 f4 00 06 54 69 6d 65 54 6f 4c 69 ... ....TimeToLi > 000000f0: 76 2b f5 01 1e 20 05 f4 01 02 61 73 65 49 64 80 v+... ....aseId. > 00000100: 36 2a f5 01 20 20 05 f4 03 07 44 65 66 61 75 6c 6*.. ....Defaul > 00000110: 74 4e 65 74 2f e4 07 03 02 1a 46 04 81 0d 20 01 tNet/.....F... . > 00000120: 00 03 00 02 45 6e 61 62 6c 65 42 61 73 65 46 6f ....EnableBaseFo > 00000130: 72 47 61 74 65 77 61 79 2a 24 02 02 1f 45 04 81 rGateway*$...E.. > 00000140: 0b 20 0e 18 01 03 52 6f 75 74 65 64 2a 14 01 02 . ....Routed*... > 00000150: 21 50 04 81 21 20 01 14 01 c2 07 30 31 3e 84 09 !P..! .....01>.. > 00000160: 03 01 f4 4b 04 81 17 20 17 40 01 2d a8 09 02 24 ...K... .@.-...$ > 00000170: 4c 04 81 19 20 17 2c 01 2e d0 09 01 23 3a 03 77 L... .,.....#:.w > 00000180: 20 07 2c 01 2d 1d 02 1c 20 05 d0 0c b0 24 33 cc .,.-... ....$3. > 00000190: 07 02 1b 00 00 00 11 6c 00 00 3f 6a 68 2e 73 ec .......l..?jh.s. > 000001a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000210: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000220: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000230: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000240: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000250: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000260: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000270: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000290: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000310: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000320: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000330: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000340: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000350: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000360: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000370: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 00000390: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ > 000003f0: ff ff > UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) > UBIFS error (pid 428): do_readpage: cannot read page 388 of inode > 556196, error -22 I see the following possibilities: 1. The data has been written like this - then the bug is at writing side. Check the data node - what is its length, is CRC correct? It would be useful to dump the node which cannot be decompressed - I'd accept such patch with great delight. 2. You had power cuts while this peace of data has been written and recovery did not work correctly. Enabling mount and recovery messages would help. 3. I merged several changes to 2.6.39 which could in theory break recovery. Try to reproduce this with 2.6.38. 4. The fixup feature might have broke this - we might for some reason read less data than there. Although I see FFs start at offset 416, which is strange. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-01 8:02 ` Artem Bityutskiy @ 2011-06-01 8:07 ` Artem Bityutskiy 2011-06-01 8:39 ` Artem Bityutskiy 2011-06-02 4:34 ` Matthew L. Creech 2 siblings, 0 replies; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-01 8:07 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Wed, 2011-06-01 at 11:02 +0300, Artem Bityutskiy wrote: > I see the following possibilities: > > 1. The data has been written like this - then the bug is at writing > side. Check the data node - what is its length, is CRC correct? It would > be useful to dump the node which cannot be decompressed - I'd accept > such patch with great delight. > > 2. You had power cuts while this peace of data has been written and > recovery did not work correctly. Enabling mount and recovery messages > would help. > > 3. I merged several changes to 2.6.39 which could in theory break > recovery. Try to reproduce this with 2.6.38. > > 4. The fixup feature might have broke this - we might for some reason > read less data than there. Although I see FFs start at offset 416, which > is strange. So Matthew, let's try to exclude some of these possibilities. Please, send me some description about what you do to reproduce this. Is this error persistent or it goes away after remount or reboot. And other information relevant to the above points. Thanks! -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-01 8:02 ` Artem Bityutskiy 2011-06-01 8:07 ` Artem Bityutskiy @ 2011-06-01 8:39 ` Artem Bityutskiy 2011-06-02 4:34 ` Matthew L. Creech 2 siblings, 0 replies; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-01 8:39 UTC (permalink / raw) To: Matthew L. Creech; +Cc: Ben Gardiner, MTD list On Wed, 2011-06-01 at 11:02 +0300, Artem Bityutskiy wrote: > 3. I merged several changes to 2.6.39 which could in theory break > recovery. Try to reproduce this with 2.6.38. Err, sorry, those went to 2.6.40, so disregard this point please. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-06-01 8:02 ` Artem Bityutskiy 2011-06-01 8:07 ` Artem Bityutskiy 2011-06-01 8:39 ` Artem Bityutskiy @ 2011-06-02 4:34 ` Matthew L. Creech 2 siblings, 0 replies; 25+ messages in thread From: Matthew L. Creech @ 2011-06-02 4:34 UTC (permalink / raw) To: dedekind1; +Cc: Ben Gardiner, MTD list On Wed, Jun 1, 2011 at 4:02 AM, Artem Bityutskiy <dedekind1@gmail.com> wrote: > > 4. The fixup feature might have broke this - we might for some reason > read less data than there. Although I see FFs start at offset 416, which > is strange. > Also, I can rule this possibility out - although the device is now running a recent kernel with the fix-up code present, it was upgraded last week from an older version. So the UBIFS image it was originally programmed with (and is still using) didn't have the fixup flag set. -- Matthew L. Creech ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: ubifs_decompress: cannot decompress ... 2011-05-27 21:12 ubifs_decompress: cannot decompress Matthew L. Creech 2011-05-30 12:29 ` Ben Gardiner @ 2011-06-01 7:48 ` Artem Bityutskiy 1 sibling, 0 replies; 25+ messages in thread From: Artem Bityutskiy @ 2011-06-01 7:48 UTC (permalink / raw) To: Matthew L. Creech; +Cc: MTD list On Fri, 2011-05-27 at 17:12 -0400, Matthew L. Creech wrote: > Hi, > > We encountered a device which has the following error: > > UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, > compressor lzo, error -22 > UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) > UBIFS error (pid 428): do_readpage: cannot read page 388 of inode > 556196, error -22 > UBIFS error (pid 428): ubifs_decompress: cannot decompress 1010 bytes, > compressor lzo, error -22 > UBIFS error (pid 428): read_block: bad data node (block 388, inode 556196) > UBIFS error (pid 428): do_readpage: cannot read page 388 of inode > 556196, error -22 > > The device is still bootable (though not functional), so I can get > more debug info if needed. Any ideas on what could cause this? Yes, you can enable mount and recovery debugging messages and send me the log. > - MPC8313 (using fsl_elbc_nand driver) > - 1GB NAND flash (partitioned into 5MB for U-Boot, remainder for UBI/UBIFS) > - 2.6.39 kernel > > Thanks! > -- Best Regards, Artem Bityutskiy (Артём Битюцкий) ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2011-06-20 15:35 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-27 21:12 ubifs_decompress: cannot decompress Matthew L. Creech 2011-05-30 12:29 ` Ben Gardiner 2011-05-31 15:47 ` Matthew L. Creech 2011-05-31 16:10 ` Ben Gardiner 2011-05-31 21:47 ` Matthew L. Creech 2011-06-01 7:51 ` Artem Bityutskiy 2011-06-02 4:30 ` Matthew L. Creech 2011-06-02 18:59 ` Matthew L. Creech 2011-06-06 9:58 ` Artem Bityutskiy 2011-06-06 16:04 ` Matthew L. Creech 2011-06-06 16:18 ` Artem Bityutskiy 2011-06-06 19:52 ` Matthew L. Creech 2011-06-07 4:34 ` Artem Bityutskiy 2011-06-07 20:41 ` Matthew L. Creech 2011-06-08 14:11 ` Artem Bityutskiy 2011-06-08 17:50 ` Matthew L. Creech 2011-06-09 12:10 ` Artem Bityutskiy 2011-06-20 15:35 ` Matthew L. Creech 2011-06-07 10:24 ` Artem Bityutskiy 2011-06-03 4:32 ` Artem Bityutskiy 2011-06-01 8:02 ` Artem Bityutskiy 2011-06-01 8:07 ` Artem Bityutskiy 2011-06-01 8:39 ` Artem Bityutskiy 2011-06-02 4:34 ` Matthew L. Creech 2011-06-01 7:48 ` Artem Bityutskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).