* UBIFS corruption after power cut - possibly unstable bits issue? @ 2015-10-26 19:37 Tim Harvey 2015-10-26 20:01 ` Richard Weinberger 0 siblings, 1 reply; 18+ messages in thread From: Tim Harvey @ 2015-10-26 19:37 UTC (permalink / raw) To: Artem Bityutskiy, Adrian Hunter, linux-mtd, Richard Weinberger Greetings, I'm looking into a re-producible ubifs corruption issue on an IMX6 based board with a 16Gb MT29F2G08 SLC NAND flash. During boot reliability tests we can recreate perhaps 1 in 10000 cases of a UBIFS corruption that results in a permanent failure to mount the rootfs. In this scenario the U-Boot bootloader first mounts the ubifs to load the kernel/dtb from the /boot directory, then the kernel is mounting the rootfs read-only, then later its mounted read/write via /etc/fstab but is never written to directly by our software. Our software runs various tests for several minutes then power-cuts the board. After corruption, U-Boot fails reading the superblock but if I load a kernel/dtb from the net and boot it I see some more useful info from the kernel: [ 0.000000] Linux version 4.2.0-00012-gd722238 (tharvey@tharvey) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #189 SMP Fri Oct 23 16:24:27 PDT 2015 ... [ 2.375663] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 [ 2.382075] nand: Micron MT29F16G08ADACAH4 [ 2.386190] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: 4096, OOB size: 224 [ 2.394821] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 [ 2.401760] Scanning device for bad blocks [ 4.000436] 3 ofpart partitions found on MTD device gpmi-nand [ 4.006201] Creating 3 MTD partitions on "gpmi-nand": [ 4.011298] 0x000000000000-0x000001000000 : "uboot" [ 4.021174] 0x000001000000-0x000001100000 : "env" [ 4.028287] 0x000001100000-0x000080000000 : "rootfs" [ 4.037134] gpmi-nand 112000.gpmi-nand: driver registered. ... [ 8.678875] ubi0: scanning is finished [ 8.706815] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) [ 8.712710] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes [ 8.719598] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 [ 8.726415] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 [ 8.733398] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 [ 8.739501] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128 [ 8.746748] ubi0: max/mean erase counter: 5/2, WL threshold: 4096, image sequence number: 36433972 [ 8.755747] ubi0: available PEBs: 0, total reserved PEBs: 8124, PEBs reserved for bad PEB handling: 160 [ 8.765357] ubi0: background thread "ubi_bgt0d" started, PID 79 ... [ 8.888732] UBIFS (ubi0:0): recovery needed [ 8.916292] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 864 bytes from PEB 5195:225280, read only 36864 bytes, retry [ 8.931116] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 864 bytes from PEB 5195:225280, read only 36864 bytes, retry [ 8.945869] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 864 bytes from PEB 5195:225280, read only 36864 bytes, retry [ 8.960676] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 3686 4 bytes from PEB 5195:225280, read 36864 bytes [ 8.971859] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-00012-gd722238 #1 89 [ 8.979177] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) [ 8.985747] Backtrace: [ 8.988248] [<80013468>] (dump_backtrace) from [<80013654>] (show_stack+0x18/ 0x1c) [ 8.995848] r6:80abf960 r5:80abf960 r4:00000000 r3:00000000 [ 9.001649] [<8001363c>] (show_stack) from [<80798948>] (dump_stack+0x8c/0xa4) [ 9.008894] [<807988bc>] (dump_stack) from [<8045d410>] (ubi_io_read+0x128/0x304) [ 9.016428] r6:00000000 r5:00009000 r4:ffffffb6 r3:ee070000 [ 9.022201] [<8045d2e8>] (ubi_io_read) from [<8045a650>] (ubi_eba_read_leb+0x18c/0x430) [ 9.030210] r10:00000000 r9:f069e000 r8:0000144b r7:00000179 r6:00000000 r5:edab6000 [ 9.038189] r4:edb10000 [ 9.040783] [<8045a4c4>] (ubi_eba_read_leb) from [<80459a08>] (ubi_leb_read+0x7c/0xd4) [ 9.048724] r10:00000000 r9:00000179 r8:f069e000 r7:edb10000 r6:00035000 r5:00009000 [ 9.056677] r4:edab6000 [ 9.059254] [<8045998c>] (ubi_leb_read) from [<80253c2c>] (ubifs_leb_read+0x30/0x9c) [ 9.067044] r10:ee034000 r9:edb55e40 r8:00000179 r7:00035000 r6:00009000 r5:ee034000 [ 9.074996] r4:edb55e40 [ 9.077563] [<80253bfc>] (ubifs_leb_read) from [<8025bfe0>] (ubifs_start_scan+0x74/0xf0) [ 9.085697] r8:f0669000 r7:00000179 r6:ee034000 r5:00035000 r4:edb55e40 [ 9.092575] [<8025bf6c>] (ubifs_start_scan) from [<8026e624>] (ubifs_recover_leb+0x64/0x810) [ 9.101051] r8:00009000 r7:00000179 r6:ee034000 r5:edb55cc0 r4:00000179 [ 9.107850] [<8026e5c0>] (ubifs_recover_leb) from [<8025d890>] (ubifs_replay_journal+0xed0/0x16a0) [ 9.116846] r10:ee034000 r9:edb55e40 r8:edb55e48 r7:ee034000 r6:edb55cc0 r5:edb55cc0 [ 9.124782] r4:00000179 [ 9.127344] [<8025c9c0>] (ubifs_replay_journal) from [<802515ec>] (ubifs_mount+0x1234/0x19fc) [ 9.135906] r10:00000001 r9:ee034008 r8:edb4a300 r7:eda92800 r6:ee034b38 r5:ee034000 [ 9.143842] r4:00000000 [ 9.146411] [<802503b8>] (ubifs_mount) from [<80103c6c>] (mount_fs+0x1c/0xb0) [ 9.153578] r10:00008001 r9:00000000 r8:80abc2e0 r7:801234a8 r6:80abc2e0 r5:edb4a280 [ 9.161514] r4:eda61400 [ 9.164084] [<80103c50>] (mount_fs) from [<8011fedc>] (vfs_kern_mount+0x5c/0x144) [ 9.171602] r6:00008001 r5:edb4a280 r4:eda61400 [ 9.176282] [<8011fe80>] (vfs_kern_mount) from [<801234a8>] (do_mount+0x19c/0xd0c) [ 9.183886] r9:80abc2e0 r8:80ab0bdc r7:edb4a240 r6:edb4a280 r5:00000060 r4:00000000 [ 9.191743] [<8012330c>] (do_mount) from [<801243ac>] (SyS_mount+0x7c/0xa8) [ 9.198721] r10:000000d5 r9:80a8d9c8 r8:8094cfa0 r7:00008001 r6:00000000 r5:edb4a280 [ 9.206660] r4:edb4a240 [ 9.209227] [<80124330>] (SyS_mount) from [<80a3a248>] (mount_block_root+0x114/0x240) [ 9.217091] r8:ef5af1c0 r7:80a8d9b8 r6:edb8e000 r5:edb8e000 r4:00008001 [ 9.223907] [<80a3a134>] (mount_block_root) from [<80a3a634>] (prepare_namespace+0x98/0x1d0) [ 9.232380] r10:000000d5 r9:80a8d998 r8:80a9b5c0 r7:80a8d990 r6:80a8d9b8 r5:80a8d9b8 [ 9.240297] r4:80a8d9c8 [ 9.242877] [<80a3a59c>] (prepare_namespace) from [<80a39ee4>] (kernel_init_freeable+0x1a4/0x1f0) [ 9.251780] r6:80b09000 r5:80b09000 r4:00000007 [ 9.256467] [<80a39d40>] (kernel_init_freeable) from [<80792cdc>] (kernel_init+0x10/0xf4) [ 9.264667] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5: 80792ccc [ 9.272614] r4:00000000 [ 9.275177] [<80792ccc>] (kernel_init) from [<8000f938>] (ret_from_fork+0x14/0x3c) [ 9.282782] r4:00000000 r3:00000000 [ 9.286909] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt empty space LEB 377:217088, corruption starts at 24762 [ 9.297900] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: corruption at LEB 377:241850 [ 9.306536] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: first 8192 bytes from LEB 377:241850 [ 9.315870] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ [ 9.327374] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ [ 9.338883] 00000040: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ [ 9.350389] 00000060: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ ... [ 9.352755] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f fffffff ffffffff ................................ [ 9.352765] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f fffffff ffffffff ................................ [ 9.352770] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 377 scanning f ailed [ 12.271485] VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0): error -117 Thus far we have encountered this with a 16Gb MT29F16G08 and 'not' with a 2Gb MT29F2G08. The two parts have different geometries and the 16Gb part has a much larger block erase time (2ms) compared to the 2Gb (700us). We have also seen several occurrences of errors on both NAND chips that cause a failure to mount rootfs but are not permanent (power-cycle again and all is well): [17:39:02][ 2.666800] ubi0: attaching mtd2 [17:39:05][ 5.388606] ubi0: scanning is finished [17:39:05][ 5.395747] ubi0 error: vtbl_check: bad CRC at record 72: 0xf116c36b, not 0x000000 [17:39:05][ 5.403330] Volume table record 72 dump: [17:39:05][ 5.407279] reserved_pebs 0 [17:39:05][ 5.410342] alignment 0 [17:39:05][ 5.413402] data_pad 0 [17:39:05][ 5.416481] vol_type 0 [17:39:05][ 5.419543] upd_marker 0 [17:39:05][ 5.422603] name_len 0 [17:39:05][ 5.427696] name NULL [17:39:05][ 5.433237] ubi0 error: vtbl_check: bad CRC at record 72: 0xf116c36b, not 0x000000 [17:39:05][ 5.440846] Volume table record 72 dump: [17:39:05][ 5.446806] reserved_pebs 0 [17:39:05][ 5.451881] alignment 0 [17:39:05][ 5.456967] data_pad 0 [17:39:05][ 5.462042] vol_type 0 [17:39:05][ 5.467137] upd_marker 0 [17:39:05][ 5.472216] name_len 0 [17:39:05][ 5.475304] name NULL [17:39:05][ 5.478629] ubi0 error: process_lvol: both volume tables are corrupted [17:39:05][ 5.492599] ubi0 error: ubi_attach_mtd_dev: failed to attach mtd2, error -22 [17:39:05][ 5.499728] UBI error: cannot attach mtd2 Are any/all of the above the 'unstable bits' issue? Any recommendations on what to look for next? Regards, Tim Tim Harvey - Principal Software Engineer Gateworks Corporation - http://www.gateworks.com/ 3026 S. Higuera St. San Luis Obispo CA 93401 805-781-2000 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-26 19:37 UBIFS corruption after power cut - possibly unstable bits issue? Tim Harvey @ 2015-10-26 20:01 ` Richard Weinberger 2015-10-26 20:31 ` Tim Harvey 2015-11-03 10:06 ` Michal Suchanek 0 siblings, 2 replies; 18+ messages in thread From: Richard Weinberger @ 2015-10-26 20:01 UTC (permalink / raw) To: Tim Harvey, Artem Bityutskiy, Adrian Hunter, linux-mtd Tim, Am 26.10.2015 um 20:37 schrieb Tim Harvey: > Greetings, > > I'm looking into a re-producible ubifs corruption issue on an IMX6 > based board with a 16Gb MT29F2G08 SLC NAND flash. During boot > reliability tests we can recreate perhaps 1 in 10000 cases of a UBIFS > corruption that results in a permanent failure to mount the rootfs. > > In this scenario the U-Boot bootloader first mounts the ubifs to load > the kernel/dtb from the /boot directory, then the kernel is mounting > the rootfs read-only, then later its mounted read/write via /etc/fstab > but is never written to directly by our software. Our software runs > various tests for several minutes then power-cuts the board. > > After corruption, U-Boot fails reading the superblock but if I load a > kernel/dtb from the net and boot it I see some more useful info from > the kernel: > > [ 0.000000] Linux version 4.2.0-00012-gd722238 (tharvey@tharvey) > (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #189 SMP > Fri Oct 23 16:24:27 PDT 2015 > ... > [ 2.375663] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 > [ 2.382075] nand: Micron MT29F16G08ADACAH4 > [ 2.386190] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: > 4096, OOB size: 224 > [ 2.394821] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 > [ 2.401760] Scanning device for bad blocks > [ 4.000436] 3 ofpart partitions found on MTD device gpmi-nand > [ 4.006201] Creating 3 MTD partitions on "gpmi-nand": > [ 4.011298] 0x000000000000-0x000001000000 : "uboot" > [ 4.021174] 0x000001000000-0x000001100000 : "env" > [ 4.028287] 0x000001100000-0x000080000000 : "rootfs" > [ 4.037134] gpmi-nand 112000.gpmi-nand: driver registered. > ... > [ 8.678875] ubi0: scanning is finished > [ 8.706815] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) > [ 8.712710] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes > [ 8.719598] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 > [ 8.726415] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 > [ 8.733398] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 > [ 8.739501] ubi0: user volume: 1, internal volumes: 1, max. volumes > count: 128 > [ 8.746748] ubi0: max/mean erase counter: 5/2, WL threshold: 4096, > image sequence number: 36433972 > [ 8.755747] ubi0: available PEBs: 0, total reserved PEBs: 8124, > PEBs reserved for bad PEB handling: 160 > [ 8.765357] ubi0: background thread "ubi_bgt0d" started, PID 79 > ... > [ 8.888732] UBIFS (ubi0:0): recovery needed > [ 8.916292] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 > 864 bytes from PEB 5195:225280, read only 36864 bytes, retry > [ 8.931116] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 > 864 bytes from PEB 5195:225280, read only 36864 bytes, retry > [ 8.945869] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 > 864 bytes from PEB 5195:225280, read only 36864 bytes, retry > [ 8.960676] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 3686 > 4 bytes from PEB 5195:225280, read 36864 bytes > [ 8.971859] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-00012-gd722238 #1 > 89 > [ 8.979177] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > [ 8.985747] Backtrace: > [ 8.988248] [<80013468>] (dump_backtrace) from [<80013654>] (show_stack+0x18/ > 0x1c) > [ 8.995848] r6:80abf960 r5:80abf960 r4:00000000 r3:00000000 > [ 9.001649] [<8001363c>] (show_stack) from [<80798948>] > (dump_stack+0x8c/0xa4) > [ 9.008894] [<807988bc>] (dump_stack) from [<8045d410>] > (ubi_io_read+0x128/0x304) > [ 9.016428] r6:00000000 r5:00009000 r4:ffffffb6 r3:ee070000 > [ 9.022201] [<8045d2e8>] (ubi_io_read) from [<8045a650>] > (ubi_eba_read_leb+0x18c/0x430) > [ 9.030210] r10:00000000 r9:f069e000 r8:0000144b r7:00000179 > r6:00000000 r5:edab6000 > [ 9.038189] r4:edb10000 > [ 9.040783] [<8045a4c4>] (ubi_eba_read_leb) from [<80459a08>] > (ubi_leb_read+0x7c/0xd4) > [ 9.048724] r10:00000000 r9:00000179 r8:f069e000 r7:edb10000 > r6:00035000 r5:00009000 > [ 9.056677] r4:edab6000 > [ 9.059254] [<8045998c>] (ubi_leb_read) from [<80253c2c>] > (ubifs_leb_read+0x30/0x9c) > [ 9.067044] r10:ee034000 r9:edb55e40 r8:00000179 r7:00035000 > r6:00009000 r5:ee034000 > [ 9.074996] r4:edb55e40 > [ 9.077563] [<80253bfc>] (ubifs_leb_read) from [<8025bfe0>] > (ubifs_start_scan+0x74/0xf0) > [ 9.085697] r8:f0669000 r7:00000179 r6:ee034000 r5:00035000 r4:edb55e40 > [ 9.092575] [<8025bf6c>] (ubifs_start_scan) from [<8026e624>] > (ubifs_recover_leb+0x64/0x810) > [ 9.101051] r8:00009000 r7:00000179 r6:ee034000 r5:edb55cc0 r4:00000179 > [ 9.107850] [<8026e5c0>] (ubifs_recover_leb) from [<8025d890>] > (ubifs_replay_journal+0xed0/0x16a0) > [ 9.116846] r10:ee034000 r9:edb55e40 r8:edb55e48 r7:ee034000 > r6:edb55cc0 r5:edb55cc0 > [ 9.124782] r4:00000179 > [ 9.127344] [<8025c9c0>] (ubifs_replay_journal) from [<802515ec>] > (ubifs_mount+0x1234/0x19fc) > [ 9.135906] r10:00000001 r9:ee034008 r8:edb4a300 r7:eda92800 > r6:ee034b38 r5:ee034000 > [ 9.143842] r4:00000000 > [ 9.146411] [<802503b8>] (ubifs_mount) from [<80103c6c>] (mount_fs+0x1c/0xb0) > [ 9.153578] r10:00008001 r9:00000000 r8:80abc2e0 r7:801234a8 > r6:80abc2e0 r5:edb4a280 > [ 9.161514] r4:eda61400 > [ 9.164084] [<80103c50>] (mount_fs) from [<8011fedc>] > (vfs_kern_mount+0x5c/0x144) > [ 9.171602] r6:00008001 r5:edb4a280 r4:eda61400 > [ 9.176282] [<8011fe80>] (vfs_kern_mount) from [<801234a8>] > (do_mount+0x19c/0xd0c) > [ 9.183886] r9:80abc2e0 r8:80ab0bdc r7:edb4a240 r6:edb4a280 > r5:00000060 r4:00000000 > [ 9.191743] [<8012330c>] (do_mount) from [<801243ac>] (SyS_mount+0x7c/0xa8) > [ 9.198721] r10:000000d5 r9:80a8d9c8 r8:8094cfa0 r7:00008001 > r6:00000000 r5:edb4a280 > [ 9.206660] r4:edb4a240 > [ 9.209227] [<80124330>] (SyS_mount) from [<80a3a248>] > (mount_block_root+0x114/0x240) > [ 9.217091] r8:ef5af1c0 r7:80a8d9b8 r6:edb8e000 r5:edb8e000 r4:00008001 > [ 9.223907] [<80a3a134>] (mount_block_root) from [<80a3a634>] > (prepare_namespace+0x98/0x1d0) > [ 9.232380] r10:000000d5 r9:80a8d998 r8:80a9b5c0 r7:80a8d990 > r6:80a8d9b8 r5:80a8d9b8 > [ 9.240297] r4:80a8d9c8 > [ 9.242877] [<80a3a59c>] (prepare_namespace) from [<80a39ee4>] > (kernel_init_freeable+0x1a4/0x1f0) > [ 9.251780] r6:80b09000 r5:80b09000 r4:00000007 > [ 9.256467] [<80a39d40>] (kernel_init_freeable) from [<80792cdc>] > (kernel_init+0x10/0xf4) > [ 9.264667] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5: > 80792ccc > [ 9.272614] r4:00000000 > [ 9.275177] [<80792ccc>] (kernel_init) from [<8000f938>] > (ret_from_fork+0x14/0x3c) > [ 9.282782] r4:00000000 r3:00000000 > [ 9.286909] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt > empty space LEB 377:217088, corruption starts at 24762 > [ 9.297900] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: > corruption at LEB 377:241850 > [ 9.306536] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: > first 8192 bytes from LEB 377:241850 > [ 9.315870] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > [ 9.327374] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > [ 9.338883] 00000040: ffffffff ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > [ 9.350389] 00000060: ffffffff ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > ... > [ 9.352755] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f > fffffff ffffffff ................................ > [ 9.352765] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f > fffffff ffffffff ................................ > [ 9.352770] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 377 scanning f > ailed > [ 12.271485] VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0): > error -117 > > Thus far we have encountered this with a 16Gb MT29F16G08 and 'not' > with a 2Gb MT29F2G08. The two parts have different geometries and the > 16Gb part has a much larger block erase time (2ms) compared to the 2Gb > (700us). gpmi-nand is not able to correct bit flips on erased pages. This is why UBI is facing uncorrectable ECC errors and UBIFS gives up. In March there was an attempt to fix that in software. But no mainline ready solution was presented so far: http://lists.infradead.org/pipermail/linux-mtd/2014-March/052521.html It is not clear whether to implement this directly in gpmi-nand or MTD core. Currently UBIFS assumes that empty spaces must contain only 0xff octets. A naive approach would be removing that check from UBIFS, bit this can have disastrous consequences as UBIFS's recovery algorithm relies on that. > We have also seen several occurrences of errors on both NAND chips > that cause a failure to mount rootfs but are not permanent > (power-cycle again and all is well): > [17:39:02][ 2.666800] ubi0: attaching mtd2 > [17:39:05][ 5.388606] ubi0: scanning is finished > [17:39:05][ 5.395747] ubi0 error: vtbl_check: bad CRC at record 72: > 0xf116c36b, not 0x000000 > [17:39:05][ 5.403330] Volume table record 72 dump: > [17:39:05][ 5.407279] reserved_pebs 0 > [17:39:05][ 5.410342] alignment 0 > [17:39:05][ 5.413402] data_pad 0 > [17:39:05][ 5.416481] vol_type 0 > [17:39:05][ 5.419543] upd_marker 0 > [17:39:05][ 5.422603] name_len 0 > [17:39:05][ 5.427696] name NULL > [17:39:05][ 5.433237] ubi0 error: vtbl_check: bad CRC at record 72: > 0xf116c36b, not 0x000000 > [17:39:05][ 5.440846] Volume table record 72 dump: > [17:39:05][ 5.446806] reserved_pebs 0 > [17:39:05][ 5.451881] alignment 0 > [17:39:05][ 5.456967] data_pad 0 > [17:39:05][ 5.462042] vol_type 0 > [17:39:05][ 5.467137] upd_marker 0 > [17:39:05][ 5.472216] name_len 0 > [17:39:05][ 5.475304] name NULL > [17:39:05][ 5.478629] ubi0 error: process_lvol: both volume tables > are corrupted > [17:39:05][ 5.492599] ubi0 error: ubi_attach_mtd_dev: failed to > attach mtd2, error -22 > [17:39:05][ 5.499728] UBI error: cannot attach mtd2 > > Are any/all of the above the 'unstable bits' issue? Any Not really. Why do you think so? > recommendations on what to look for next? How does the corruption look like in the second case? Maybe we can identify a pattern. :-) Thanks, //richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-26 20:01 ` Richard Weinberger @ 2015-10-26 20:31 ` Tim Harvey 2015-10-26 21:41 ` Richard Weinberger 2015-11-03 10:06 ` Michal Suchanek 1 sibling, 1 reply; 18+ messages in thread From: Tim Harvey @ 2015-10-26 20:31 UTC (permalink / raw) To: Richard Weinberger; +Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd On Mon, Oct 26, 2015 at 1:01 PM, Richard Weinberger <richard@nod.at> wrote: > Tim, > > Am 26.10.2015 um 20:37 schrieb Tim Harvey: >> Greetings, >> >> I'm looking into a re-producible ubifs corruption issue on an IMX6 >> based board with a 16Gb MT29F2G08 SLC NAND flash. During boot >> reliability tests we can recreate perhaps 1 in 10000 cases of a UBIFS >> corruption that results in a permanent failure to mount the rootfs. >> >> In this scenario the U-Boot bootloader first mounts the ubifs to load >> the kernel/dtb from the /boot directory, then the kernel is mounting >> the rootfs read-only, then later its mounted read/write via /etc/fstab >> but is never written to directly by our software. Our software runs >> various tests for several minutes then power-cuts the board. >> >> After corruption, U-Boot fails reading the superblock but if I load a >> kernel/dtb from the net and boot it I see some more useful info from >> the kernel: >> >> [ 0.000000] Linux version 4.2.0-00012-gd722238 (tharvey@tharvey) >> (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #189 SMP >> Fri Oct 23 16:24:27 PDT 2015 >> ... >> [ 2.375663] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 >> [ 2.382075] nand: Micron MT29F16G08ADACAH4 >> [ 2.386190] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: >> 4096, OOB size: 224 >> [ 2.394821] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 >> [ 2.401760] Scanning device for bad blocks >> [ 4.000436] 3 ofpart partitions found on MTD device gpmi-nand >> [ 4.006201] Creating 3 MTD partitions on "gpmi-nand": >> [ 4.011298] 0x000000000000-0x000001000000 : "uboot" >> [ 4.021174] 0x000001000000-0x000001100000 : "env" >> [ 4.028287] 0x000001100000-0x000080000000 : "rootfs" >> [ 4.037134] gpmi-nand 112000.gpmi-nand: driver registered. >> ... >> [ 8.678875] ubi0: scanning is finished >> [ 8.706815] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) >> [ 8.712710] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes >> [ 8.719598] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 >> [ 8.726415] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 >> [ 8.733398] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 >> [ 8.739501] ubi0: user volume: 1, internal volumes: 1, max. volumes >> count: 128 >> [ 8.746748] ubi0: max/mean erase counter: 5/2, WL threshold: 4096, >> image sequence number: 36433972 >> [ 8.755747] ubi0: available PEBs: 0, total reserved PEBs: 8124, >> PEBs reserved for bad PEB handling: 160 >> [ 8.765357] ubi0: background thread "ubi_bgt0d" started, PID 79 >> ... >> [ 8.888732] UBIFS (ubi0:0): recovery needed >> [ 8.916292] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 >> 864 bytes from PEB 5195:225280, read only 36864 bytes, retry >> [ 8.931116] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 >> 864 bytes from PEB 5195:225280, read only 36864 bytes, retry >> [ 8.945869] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 36 >> 864 bytes from PEB 5195:225280, read only 36864 bytes, retry >> [ 8.960676] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 3686 >> 4 bytes from PEB 5195:225280, read 36864 bytes >> [ 8.971859] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-00012-gd722238 #1 >> 89 >> [ 8.979177] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) >> [ 8.985747] Backtrace: >> [ 8.988248] [<80013468>] (dump_backtrace) from [<80013654>] (show_stack+0x18/ >> 0x1c) >> [ 8.995848] r6:80abf960 r5:80abf960 r4:00000000 r3:00000000 >> [ 9.001649] [<8001363c>] (show_stack) from [<80798948>] >> (dump_stack+0x8c/0xa4) >> [ 9.008894] [<807988bc>] (dump_stack) from [<8045d410>] >> (ubi_io_read+0x128/0x304) >> [ 9.016428] r6:00000000 r5:00009000 r4:ffffffb6 r3:ee070000 >> [ 9.022201] [<8045d2e8>] (ubi_io_read) from [<8045a650>] >> (ubi_eba_read_leb+0x18c/0x430) >> [ 9.030210] r10:00000000 r9:f069e000 r8:0000144b r7:00000179 >> r6:00000000 r5:edab6000 >> [ 9.038189] r4:edb10000 >> [ 9.040783] [<8045a4c4>] (ubi_eba_read_leb) from [<80459a08>] >> (ubi_leb_read+0x7c/0xd4) >> [ 9.048724] r10:00000000 r9:00000179 r8:f069e000 r7:edb10000 >> r6:00035000 r5:00009000 >> [ 9.056677] r4:edab6000 >> [ 9.059254] [<8045998c>] (ubi_leb_read) from [<80253c2c>] >> (ubifs_leb_read+0x30/0x9c) >> [ 9.067044] r10:ee034000 r9:edb55e40 r8:00000179 r7:00035000 >> r6:00009000 r5:ee034000 >> [ 9.074996] r4:edb55e40 >> [ 9.077563] [<80253bfc>] (ubifs_leb_read) from [<8025bfe0>] >> (ubifs_start_scan+0x74/0xf0) >> [ 9.085697] r8:f0669000 r7:00000179 r6:ee034000 r5:00035000 r4:edb55e40 >> [ 9.092575] [<8025bf6c>] (ubifs_start_scan) from [<8026e624>] >> (ubifs_recover_leb+0x64/0x810) >> [ 9.101051] r8:00009000 r7:00000179 r6:ee034000 r5:edb55cc0 r4:00000179 >> [ 9.107850] [<8026e5c0>] (ubifs_recover_leb) from [<8025d890>] >> (ubifs_replay_journal+0xed0/0x16a0) >> [ 9.116846] r10:ee034000 r9:edb55e40 r8:edb55e48 r7:ee034000 >> r6:edb55cc0 r5:edb55cc0 >> [ 9.124782] r4:00000179 >> [ 9.127344] [<8025c9c0>] (ubifs_replay_journal) from [<802515ec>] >> (ubifs_mount+0x1234/0x19fc) >> [ 9.135906] r10:00000001 r9:ee034008 r8:edb4a300 r7:eda92800 >> r6:ee034b38 r5:ee034000 >> [ 9.143842] r4:00000000 >> [ 9.146411] [<802503b8>] (ubifs_mount) from [<80103c6c>] (mount_fs+0x1c/0xb0) >> [ 9.153578] r10:00008001 r9:00000000 r8:80abc2e0 r7:801234a8 >> r6:80abc2e0 r5:edb4a280 >> [ 9.161514] r4:eda61400 >> [ 9.164084] [<80103c50>] (mount_fs) from [<8011fedc>] >> (vfs_kern_mount+0x5c/0x144) >> [ 9.171602] r6:00008001 r5:edb4a280 r4:eda61400 >> [ 9.176282] [<8011fe80>] (vfs_kern_mount) from [<801234a8>] >> (do_mount+0x19c/0xd0c) >> [ 9.183886] r9:80abc2e0 r8:80ab0bdc r7:edb4a240 r6:edb4a280 >> r5:00000060 r4:00000000 >> [ 9.191743] [<8012330c>] (do_mount) from [<801243ac>] (SyS_mount+0x7c/0xa8) >> [ 9.198721] r10:000000d5 r9:80a8d9c8 r8:8094cfa0 r7:00008001 >> r6:00000000 r5:edb4a280 >> [ 9.206660] r4:edb4a240 >> [ 9.209227] [<80124330>] (SyS_mount) from [<80a3a248>] >> (mount_block_root+0x114/0x240) >> [ 9.217091] r8:ef5af1c0 r7:80a8d9b8 r6:edb8e000 r5:edb8e000 r4:00008001 >> [ 9.223907] [<80a3a134>] (mount_block_root) from [<80a3a634>] >> (prepare_namespace+0x98/0x1d0) >> [ 9.232380] r10:000000d5 r9:80a8d998 r8:80a9b5c0 r7:80a8d990 >> r6:80a8d9b8 r5:80a8d9b8 >> [ 9.240297] r4:80a8d9c8 >> [ 9.242877] [<80a3a59c>] (prepare_namespace) from [<80a39ee4>] >> (kernel_init_freeable+0x1a4/0x1f0) >> [ 9.251780] r6:80b09000 r5:80b09000 r4:00000007 >> [ 9.256467] [<80a39d40>] (kernel_init_freeable) from [<80792cdc>] >> (kernel_init+0x10/0xf4) >> [ 9.264667] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5: >> 80792ccc >> [ 9.272614] r4:00000000 >> [ 9.275177] [<80792ccc>] (kernel_init) from [<8000f938>] >> (ret_from_fork+0x14/0x3c) >> [ 9.282782] r4:00000000 r3:00000000 >> [ 9.286909] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt >> empty space LEB 377:217088, corruption starts at 24762 >> [ 9.297900] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: >> corruption at LEB 377:241850 >> [ 9.306536] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: >> first 8192 bytes from LEB 377:241850 >> [ 9.315870] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.327374] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.338883] 00000040: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.350389] 00000060: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> ... >> [ 9.352755] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f >> fffffff ffffffff ................................ >> [ 9.352765] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f >> fffffff ffffffff ................................ >> [ 9.352770] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 377 scanning f >> ailed >> [ 12.271485] VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0): >> error -117 >> >> Thus far we have encountered this with a 16Gb MT29F16G08 and 'not' >> with a 2Gb MT29F2G08. The two parts have different geometries and the >> 16Gb part has a much larger block erase time (2ms) compared to the 2Gb >> (700us). > > gpmi-nand is not able to correct bit flips on erased pages. > This is why UBI is facing uncorrectable ECC errors and UBIFS gives up. Hi Richard, What would be causing the bit-flips on the erased pages? Could it have something to do with the larger flash part having a much longer erase time? There shouldn't be anything writing to ubifs so I'm not clear what caused this to occur. Note that even if I remove the /etc/fstab that causes root to be re-mounted read/write I always see 'UBIFS (ubi0:0): recovery needed' and I'm not understanding what causes that but it makes me think that NAND is getting touched each and every boot by the recovery process. > In March there was an attempt to fix that in software. > But no mainline ready solution was presented so far: > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052521.html > > It is not clear whether to implement this directly in gpmi-nand or MTD core. > Currently UBIFS assumes that empty spaces must contain only 0xff octets. > A naive approach would be removing that check from UBIFS, bit this can have > disastrous consequences as UBIFS's recovery algorithm relies on that. I think I ran across that approach right before the thread you pointed me to: http://thread.gmane.org/gmane.linux.drivers.mtd/52208 > >> We have also seen several occurrences of errors on both NAND chips >> that cause a failure to mount rootfs but are not permanent >> (power-cycle again and all is well): >> [17:39:02][ 2.666800] ubi0: attaching mtd2 >> [17:39:05][ 5.388606] ubi0: scanning is finished >> [17:39:05][ 5.395747] ubi0 error: vtbl_check: bad CRC at record 72: >> 0xf116c36b, not 0x000000 >> [17:39:05][ 5.403330] Volume table record 72 dump: >> [17:39:05][ 5.407279] reserved_pebs 0 >> [17:39:05][ 5.410342] alignment 0 >> [17:39:05][ 5.413402] data_pad 0 >> [17:39:05][ 5.416481] vol_type 0 >> [17:39:05][ 5.419543] upd_marker 0 >> [17:39:05][ 5.422603] name_len 0 >> [17:39:05][ 5.427696] name NULL >> [17:39:05][ 5.433237] ubi0 error: vtbl_check: bad CRC at record 72: >> 0xf116c36b, not 0x000000 >> [17:39:05][ 5.440846] Volume table record 72 dump: >> [17:39:05][ 5.446806] reserved_pebs 0 >> [17:39:05][ 5.451881] alignment 0 >> [17:39:05][ 5.456967] data_pad 0 >> [17:39:05][ 5.462042] vol_type 0 >> [17:39:05][ 5.467137] upd_marker 0 >> [17:39:05][ 5.472216] name_len 0 >> [17:39:05][ 5.475304] name NULL >> [17:39:05][ 5.478629] ubi0 error: process_lvol: both volume tables >> are corrupted >> [17:39:05][ 5.492599] ubi0 error: ubi_attach_mtd_dev: failed to >> attach mtd2, error -22 >> [17:39:05][ 5.499728] UBI error: cannot attach mtd2 >> >> Are any/all of the above the 'unstable bits' issue? Any > > Not really. Why do you think so? No reason - it is just not clear to me how to know if you have run into the 'unstable bits' issue. > >> recommendations on what to look for next? > > How does the corruption look like in the second case? > Maybe we can identify a pattern. :-) the second case is not permanent corruption - when the system is power-cycled it comes up fine the next time. Tim > > Thanks, > //richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-26 20:31 ` Tim Harvey @ 2015-10-26 21:41 ` Richard Weinberger 2015-10-27 19:01 ` Tim Harvey 0 siblings, 1 reply; 18+ messages in thread From: Richard Weinberger @ 2015-10-26 21:41 UTC (permalink / raw) To: Tim Harvey; +Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd Tim, Am 26.10.2015 um 21:31 schrieb Tim Harvey: > What would be causing the bit-flips on the erased pages? Could it have > something to do with the larger flash part having a much longer erase > time? According to NAND manufactures can happen also on empty space. At the time when UBIFS was designed this was not known nor was it observed. These days it seems to happen on some large/cheap NANDs. > There shouldn't be anything writing to ubifs so I'm not clear what > caused this to occur. Note that even if I remove the /etc/fstab that > causes root to be re-mounted read/write I always see 'UBIFS (ubi0:0): > recovery needed' and I'm not understanding what causes that but it > makes me think that NAND is getting touched each and every boot by the > recovery process. > >> In March there was an attempt to fix that in software. >> But no mainline ready solution was presented so far: >> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052521.html >> >> It is not clear whether to implement this directly in gpmi-nand or MTD core. >> Currently UBIFS assumes that empty spaces must contain only 0xff octets. >> A naive approach would be removing that check from UBIFS, bit this can have >> disastrous consequences as UBIFS's recovery algorithm relies on that. > > I think I ran across that approach right before the thread you pointed > me to: http://thread.gmane.org/gmane.linux.drivers.mtd/52208 :-) > the second case is not permanent corruption - when the system is > power-cycled it comes up fine the next time. Yeah, but have you checked how the temporary corruption looks like? Thanks, //richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-26 21:41 ` Richard Weinberger @ 2015-10-27 19:01 ` Tim Harvey 2015-10-27 19:52 ` Richard Weinberger 0 siblings, 1 reply; 18+ messages in thread From: Tim Harvey @ 2015-10-27 19:01 UTC (permalink / raw) To: Richard Weinberger; +Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd On Mon, Oct 26, 2015 at 2:41 PM, Richard Weinberger <richard@nod.at> wrote: > Tim, > > Am 26.10.2015 um 21:31 schrieb Tim Harvey: >> What would be causing the bit-flips on the erased pages? Could it have >> something to do with the larger flash part having a much longer erase >> time? > > According to NAND manufactures can happen also on empty space. > At the time when UBIFS was designed this was not known nor was it observed. > These days it seems to happen on some large/cheap NANDs. > Richard, I'm not understanding what is making you say that the issue I encountered is 'not' the unstable bits issue described at http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits? My understanding is that the 'unstable bit' issue refers to bits which are truly unstable and can read either way each and every read due to not getting properly erased/written. If I understand what you are saying you are thinking that my issue is instead the result of a never-used PEB that had bit-flips from the manufacturer in which case the bits would read the same every time? How can we know this PEB was never before used and isn't one that was being erased/written during a power cut? In my test scenario where the rootfs is mounted from the kernel read-only, but later mounted read-write by userspace (yet not being specifically written to by userspace) then power-cut should 'any' NAND writes would be occurring at all? And if not as I suspect, then how could a subsequent boot end up using a PEB that may have been never previously used and have bit-flips from the manufacturer? Should we be doing an erase block on every NAND block during our board manufacturing process to avoid this? It sounds like this 'unexpected bit-flips on erased pages from the mfg' issue is a ticking time-bomb for people using ubi/ubifs NAND. Shouldn't the http://www.linux-mtd.infradead.org/doc/ubifs.html page be updated to refer to this known issue as well as the unstable bit issue? >> There shouldn't be anything writing to ubifs so I'm not clear what >> caused this to occur. Note that even if I remove the /etc/fstab that >> causes root to be re-mounted read/write I always see 'UBIFS (ubi0:0): >> recovery needed' and I'm not understanding what causes that but it >> makes me think that NAND is getting touched each and every boot by the >> recovery process. >> >>> In March there was an attempt to fix that in software. >>> But no mainline ready solution was presented so far: >>> http://lists.infradead.org/pipermail/linux-mtd/2014-March/052521.html >>> >>> It is not clear whether to implement this directly in gpmi-nand or MTD core. >>> Currently UBIFS assumes that empty spaces must contain only 0xff octets. >>> A naive approach would be removing that check from UBIFS, bit this can have >>> disastrous consequences as UBIFS's recovery algorithm relies on that. >> >> I think I ran across that approach right before the thread you pointed >> me to: http://thread.gmane.org/gmane.linux.drivers.mtd/52208 > > :-) > >> the second case is not permanent corruption - when the system is >> power-cycled it comes up fine the next time. > > Yeah, but have you checked how the temporary corruption looks like? I can add some debugging to find out - what specifically would be helpful to add? Thanks for the help! Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-27 19:01 ` Tim Harvey @ 2015-10-27 19:52 ` Richard Weinberger 2015-11-02 20:27 ` Tim Harvey 0 siblings, 1 reply; 18+ messages in thread From: Richard Weinberger @ 2015-10-27 19:52 UTC (permalink / raw) To: Tim Harvey; +Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd Tim, Am 27.10.2015 um 20:01 schrieb Tim Harvey: > I'm not understanding what is making you say that the issue I > encountered is 'not' the unstable bits issue described at > http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits? My > understanding is that the 'unstable bit' issue refers to bits which > are truly unstable and can read either way each and every read due to > not getting properly erased/written. You are right. I was sorting out the unstable bits issue a bit too early. I'm sorry. Let's double check. Can you enable UBI verbose logging while testing? Such that we can see which blocks were written/erased while the power cut happened? > If I understand what you are saying you are thinking that my issue is > instead the result of a never-used PEB that had bit-flips from the > manufacturer in which case the bits would read the same every time? > How can we know this PEB was never before used and isn't one that was > being erased/written during a power cut? I've seen bit flips on cheap SLC NANDs which came out of a sudden. According to the FAE I was talking to this is legit for NAND as long the flipping bits are fixable by the ECC engine. > In my test scenario where the rootfs is mounted from the kernel > read-only, but later mounted read-write by userspace (yet not being > specifically written to by userspace) then power-cut should 'any' NAND > writes would be occurring at all? And if not as I suspect, then how > could a subsequent boot end up using a PEB that may have been never > previously used and have bit-flips from the manufacturer? UBIFS's has a wandering journal. During the remount it moved maybe. But for a more expressive analysis I'd need a nanddump to find out which blocks are in which role. Can you share the nanddump? > Should we be doing an erase block on every NAND block during our board > manufacturing process to avoid this? Sorry, I don't understand this sentence. Do you mean a full erasure of the whole NAND? If so, it would not help as the bit flips can come later. (Without writing/erasing the block) The root cause is that your NFC cannot correct bit flips on empty pages. > It sounds like this 'unexpected bit-flips on erased pages from the > mfg' issue is a ticking time-bomb for people using ubi/ubifs NAND. > Shouldn't the http://www.linux-mtd.infradead.org/doc/ubifs.html page > be updated to refer to this known issue as well as the unstable bit > issue? As I said the root cause is that some NFCs cannot correct bit flips on empty pages. Instead of putting warnings to ubifs.html I'd love to see a solution on the said drivers or MTD core. > I can add some debugging to find out - what specifically would be > helpful to add? A hexdump of the buffer would be a good start. > Thanks for the help! Thanks for sharing your issues. This is the only way to address them. That said, as far on no board I had access to I was able to reproduce the unstable bits issue. It was always something else. Thanks, //richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-27 19:52 ` Richard Weinberger @ 2015-11-02 20:27 ` Tim Harvey 2015-11-02 20:31 ` Tim Harvey 2015-11-03 9:10 ` Artem Bityutskiy 0 siblings, 2 replies; 18+ messages in thread From: Tim Harvey @ 2015-11-02 20:27 UTC (permalink / raw) To: Richard Weinberger Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd, Huang Shijie, Elie De Brauwer, Brian Norris On Tue, Oct 27, 2015 at 12:52 PM, Richard Weinberger <richard@nod.at> wrote: > Tim, > > Am 27.10.2015 um 20:01 schrieb Tim Harvey: >> I'm not understanding what is making you say that the issue I >> encountered is 'not' the unstable bits issue described at >> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits? My >> understanding is that the 'unstable bit' issue refers to bits which >> are truly unstable and can read either way each and every read due to >> not getting properly erased/written. > > You are right. I was sorting out the unstable bits issue a bit too > early. I'm sorry. > Let's double check. Can you enable UBI verbose logging while testing? > Such that we can see which blocks were written/erased while the power cut > happened? > Richard, I have 2 IMX6 based boards that I've recreated a UBI/UBIFS NAND corruption using Linux 4.2 with dynamic debug enabled for ubi_io_write. This particular test cycle consisted of 5 different boards and had over 40,000 boot/power-cuts before corruption occurred. In each case the corruption is in a slightly different area. The boards remain in a corrupted state which will allow me to get further debug info. Board1 looks like this: [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP Thu Oct 29 14:27:36 PDT 2015 [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] Machine model: Gateworks Ventana i.MX6 Dual/Quad GW52XX ... [ 0.000000] Kernel command line: console=ttymxc1,115200 root=ubi0:rootfs ubi.mtd=2 rootfstype=ubifs dyndbg="func ubi_io_write +p" debug ... [ 2.045065] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 [ 2.051477] nand: Micron MT29F16G08ADACAH4 [ 2.055590] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: 4096, OOB size: 224 [ 2.087400] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 [ 2.094321] Scanning device for bad blocks ... [ 3.704946] 3 ofpart partitions found on MTD device gpmi-nand [ 3.710735] Creating 3 MTD partitions on "gpmi-nand": [ 3.715823] 0x000000000000-0x000001000000 : "uboot" [ 3.725088] 0x000001000000-0x000001100000 : "env" [ 3.733000] 0x000001100000-0x000080000000 : "rootfs" [ 3.741880] gpmi-nand 112000.gpmi-nand: driver registered. ... [ 4.365104] ubi0: attaching mtd2 [ 8.418862] ubi0: scanning is finished [ 8.445084] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) [ 8.450977] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes [ 8.457864] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 [ 8.464676] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 [ 8.471666] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 [ 8.477767] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128 [ 8.485015] ubi0: max/mean erase counter: 7/3, WL threshold: 4096, image sequence number: 42283325 [ 8.494000] ubi0: available PEBs: 0, total reserved PEBs: 8124, PEBs reserved for bad PEB handling: 160 [ 8.503610] ubi0: background thread "ubi_bgt0d" started, PID 81 ... [ 8.615672] UBIFS (ubi0:0): recovery needed [ 8.657083] UBIFS (ubi0:0): recovery deferred [ 8.661965] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "rootfs", R/O mode [ 8.670290] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), min./max. I/O unit sizes: 4096 bytes/4096 bytes [ 8.680248] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) [ 8.691145] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) [ 8.697001] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model [ 8.709784] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. [ 8.719451] devtmpfs: mounted [ 8.723002] Freeing unused kernel memory: 420K (80aa5000 - 80b0e000) INIT: version 2.88 booting Starting udev [ 9.548551] udevd[114]: starting version 182 [ 10.018775] UBIFS (ubi0:0): completing deferred recovery [ 10.043624] UBI DBG io (pid 145): write 4096 bytes to PEB 2253:4096 [ 10.054878] UBI DBG io (pid 145): write 94208 bytes to PEB 2253:8192 [ 10.096644] UBI DBG io (pid 81): write 4096 bytes to PEB 2250:0 [ 10.143314] UBI DBG io (pid 145): write 4096 bytes to PEB 2254:4096 [ 10.153917] UBI DBG io (pid 145): write 184320 bytes to PEB 2254:8192 ^^^^^ PEB 2254 is the one that becomes un-readable on the next boot [ 10.186288] UBI DBG io (pid 81): write 4096 bytes to PEB 2251:0 [ 10.204620] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 173 [ 10.213375] UBI DBG io (pid 145): write 4096 bytes to PEB 2255:4096 [ 10.223864] UBI DBG io (pid 145): write 4096 bytes to PEB 2255:8192 [ 10.239935] UBI DBG io (pid 145): write 4096 bytes to PEB 2237:32768 [ 10.261787] UBI DBG io (pid 145): write 4096 bytes to PEB 2184:102400 [ 10.272978] UBI DBG io (pid 145): write 4096 bytes to PEB 2173:114688 [ 10.285948] UBI DBG io (pid 145): write 4096 bytes to PEB 2174:114688 [ 10.296027] UBIFS (ubi0:0): deferred recovery completed [ 10.299453] UBI DBG io (pid 81): write 4096 bytes to PEB 2252:0 bootlogd: cannot allocate pseudo tty: No such file or directory ALSA: Restoring mixer settings... Mon Oct 26 16:37:16 UTC 2015 INIT: Entering runlevel: 5 Configuring network interfaces... [ 11.471755] fec 2188000.ethernet eth0: Freescale FEC PHY driver [Generic PHY] (mii_bus:phy_addr=2188000.ethernet:00, irq=-1) [ 11.483499] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready udhcpc (v1.23.1) started Sending discover... Sending discover... [ 15.000462] UBI DBG io (pid 173): write 4096 bytes to PEB 2253:102400 [ 16.472720] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [ 16.480975] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Sending discover... Sending select for 192.168.1.1... Lease of 192.168.1.1 obtained, lease time 21600 /etc/udhcpc.d/50default: Adding DNS 192.168.1.100 done. <power cut> [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP Thu Oct 29 14:27:36 PDT 2015 [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] Machine model: Gateworks Ventana i.MX6 Dual/Quad GW52XX ... [ 2.044936] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 [ 2.051348] nand: Micron MT29F16G08ADACAH4 [ 2.055461] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: 4096, OOB size: 224 [ 2.087272] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 [ 2.094196] Scanning device for bad blocks ... [ 3.710706] Creating 3 MTD partitions on "gpmi-nand": [ 3.715793] 0x000000000000-0x000001000000 : "uboot" [ 3.725027] 0x000001000000-0x000001100000 : "env" [ 3.732929] 0x000001100000-0x000080000000 : "rootfs" [ 3.741781] gpmi-nand 112000.gpmi-nand: driver registered. ... [ 4.366226] ubi0: attaching mtd2 [ 8.445733] ubi0: scanning is finished [ 8.477551] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) [ 8.483443] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes [ 8.490355] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 [ 8.497155] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 [ 8.504148] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 [ 8.510274] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128 [ 8.517510] ubi0: max/mean erase counter: 7/3, WL threshold: 4096, image sequence number: 42283325 [ 8.526498] ubi0: available PEBs: 0, total reserved PEBs: 8124, PEBs reserved for bad PEB handling: 160 [ 8.536151] ubi0: background thread "ubi_bgt0d" started, PID 82 ... [ 8.635364] UBIFS (ubi0:0): recovery needed [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry ^^^^ non correctable ecc error on PEB 2254 - I verified that this was not the first time this PEB has been used [ 8.724987] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 69632 bytes from PEB 2254:192512, read 69632 bytes [ 8.736163] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-00015-g64291f7 dirty #195 [ 8.744021] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) [ 8.750582] Backtrace: [ 8.753074] [<8001400c>] (dump_backtrace) from [<800141f8>] (show_stack+0x18/0x1c) [ 8.760667] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 [ 8.766424] [<800141e0>] (show_stack) from [<807ceeb8>] (dump_stack+0x8c/0xa4) [ 8.773687] [<807cee2c>] (dump_stack) from [<80474730>] (ubi_io_read+0x130/0x378) [ 8.781212] r6:00000000 r5:00011000 r4:ffffffb6 r3:be070000 [ 8.786951] [<80474600>] (ubi_io_read) from [<80471564>] (ubi_eba_read_leb+0x19c/0x4c8) [ 8.794975] r10:00000000 r9:c0c63000 r8:000008ce r7:00001f15 r6:00000000 r5:bda3a000 [ 8.802928] r4:bda12000 [ 8.805489] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] (ubi_leb_read+0x94/0x130) [ 8.813527] r10:00000000 r9:c0c63000 r8:bda3a000 r7:0002d000 r6:00001f15 r5:00011000 [ 8.821467] r4:bda12000 [ 8.824037] [<804706e8>] (ubi_leb_read) from [<80257870>] (ubifs_leb_read+0x30/0x9c) [ 8.831815] r10:be29a000 r9:80b77ff8 r8:00001f15 r7:0002d000 r6:00011000 r5:be29a000 [ 8.839731] r4:bdaea600 [ 8.842312] [<80257840>] (ubifs_leb_read) from [<80260fac>] (ubifs_start_scan+0x84/0x134) [ 8.850522] r8:c0c36000 r7:00001f15 r6:be29a000 r5:0002d000 r4:bdaea600 [ 8.857322] [<80260f28>] (ubifs_start_scan) from [<80276270>] (ubifs_recover_leb+0x80/0xb80) [ 8.865794] r8:00001f15 r7:c0c63000 r6:00011000 r5:00001f15 r4:00000001 [ 8.872609] [<802761f0>] (ubifs_recover_leb) from [<802627f0>] (ubifs_replay_journal+0xd34/0x1a48) [ 8.881601] r10:00017188 r9:0002d000 r8:be29a000 r7:00000000 r6:bdaea608 r5:00001f15 [ 8.889518] r4:00000001 [ 8.892106] [<80261abc>] (ubifs_replay_journal) from [<80254954>] (ubifs_mount+0x136c/0x2170) [ 8.900666] r10:be29a008 r9:00000000 r8:0003db70 r7:be29a000 r6:80b76870 r5:be3f6000 [ 8.908584] r4:00000000 [ 8.911171] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) [ 8.918323] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 r6:80b2a2e0 r5:bd9cfac0 [ 8.926258] r4:be365900 [ 8.928829] [<801055dc>] (mount_fs) from [<80121868>] (vfs_kern_mount+0x5c/0x144) [ 8.936355] r6:00008001 r5:bd9cfac0 r4:be365900 [ 8.941056] [<8012180c>] (vfs_kern_mount) from [<80124e34>] (do_mount+0x19c/0xd0c) [ 8.948642] r9:80b2a2e0 r8:80b1ebdc r7:bd9cfa80 r6:bd9cfac0 r5:00000060 r4:00000000 [ 8.956498] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) [ 8.963493] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 r6:00000000 r5:bd9cfac0 [ 8.971429] r4:bd9cfa80 [ 8.973993] [<80125cbc>] (SyS_mount) from [<80aa6260>] (mount_block_root+0x114/0x240) [ 8.981856] r8:befa3a40 r7:80afa7a0 r6:bd9d2000 r5:bd9d2000 r4:00008001 [ 8.988651] [<80aa614c>] (mount_block_root) from [<80aa664c>] (prepare_namespace+0x98/0x1d0) [ 8.997123] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 r6:80afa7a0 r5:80afa7a0 [ 9.005058] r4:80afa7b0 [ 9.007617] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] (kernel_init_freeable+0x1a4/0x1f0) [ 9.016522] r6:80b8b000 r5:80b8b000 r4:00000007 [ 9.021217] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] (kernel_init+0x10/0xf4) [ 9.029412] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:807c91a4 [ 9.037354] r4:00000000 [ 9.039918] [<807c91a4>] (kernel_init) from [<800104b8>] (ret_from_fork+0x14/0x3c) [ 9.047525] r4:00000000 r3:00000000 [ 9.051259] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt empty space LEB 7957:184320, corruption starts at 4665 [ 9.062247] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: corruption at LEB 7957:188985 [ 9.070972] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: first 8192 bytes from LEB 7957:188985 [ 9.080396] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ [ 9.091903] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ ... [ 12.002997] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ [ 12.014501] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ................................ [ 12.026000] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 7957 scanning failed ^^^^ recovery failed [ 12.036886] VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0): error -117 After added Huang Shijie's 'mtd: gpmi: fix the bitflips for erased page' patch from http://patchwork.ozlabs.org/patch/309763/ which shows the following on board1: [ 8.532624] UBIFS (ubi0:0): recovery needed [ 8.569104] gpmi-nand 112000.gpmi-nand: The page(148656) is an erased page(1,6,1,1). [ 8.577748] gpmi-nand 112000.gpmi-nand: The page(148658) is an erased page(1,6,1,1). [ 8.586413] gpmi-nand 112000.gpmi-nand: The page(148660) is an erased page(1,6,1,1). [ 8.595050] gpmi-nand 112000.gpmi-nand: The page(148662) is an erased page(1,6,1,1). [ 8.603705] gpmi-nand 112000.gpmi-nand: The page(148664) is an erased page(1,6,1,1). [ 8.612310] gpmi-nand 112000.gpmi-nand: The page(148666) is an erased page(1,6,1,1). [ 8.620940] gpmi-nand 112000.gpmi-nand: The page(148668) is an erased page(1,6,1,1). [ 8.629580] gpmi-nand 112000.gpmi-nand: The page(148670) is an erased page(1,6,1,1). ^^^^ 1 bit-flip on ecc read, 1 bit-flip on non-ecc read, ecc-str=6, memset's buffer to 0xff to attempt to recover [ 8.644345] UBIFS (ubi0:0): recovery deferred [ 8.649209] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "rootfs", R/O mode [ 8.657537] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), min./max. I/O unit sizes: 4096 bytes/4096 bytes [ 8.667486] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) [ 8.678409] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) [ 8.684283] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model [ 8.697990] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. ^^^^ able to recover and always ends up this way over further boots ^^^^ after applying this patch I boot with init=foo so as to not allow rootfs to get mounted read/write by udev and perhaps permanently recover the board Board2 looks like this: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP Thu Oct 29 14:27:36 PDT 2015 [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] Machine model: Gateworks Ventana i.MX6 Dual/Quad GW552X ... [ 0.000000] Kernel command line: console=ttymxc1,115200 root=ubi0:rootfs ubi.mtd=2 rootfstype=ubifs dyndbg="func ubi_io_write +p" debug ... [ 2.341988] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 [ 2.348382] nand: Micron MT29F16G08ADACAH4 [ 2.352495] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: 4096, OOB size: 224 [ 2.384301] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 [ 2.391224] Scanning device for bad blocks ... [ 4.024472] 3 ofpart partitions found on MTD device gpmi-nand [ 4.030265] Creating 3 MTD partitions on "gpmi-nand": [ 4.035336] 0x000000000000-0x000001000000 : "uboot" [ 4.044635] 0x000001000000-0x000001100000 : "env" [ 4.052559] 0x000001100000-0x000080000000 : "rootfs" [ 4.061452] gpmi-nand 112000.gpmi-nand: driver registered. ... [ 4.426420] ubi0: attaching mtd2 [ 8.514755] ubi0: scanning is finished [ 8.541225] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) [ 8.547108] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes [ 8.553994] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 [ 8.560810] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 [ 8.567799] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 [ 8.573901] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128 [ 8.581161] ubi0: max/mean erase counter: 4/2, WL threshold: 4096, image sequence number: 42283325 [ 8.590148] ubi0: available PEBs: 0, total reserved PEBs: 8124, PEBs reserved for bad PEB handling: 160 [ 8.599737] ubi0: background thread "ubi_bgt0d" started, PID 74 ... [ 8.712166] UBIFS (ubi0:0): recovery needed [ 8.749882] UBIFS (ubi0:0): recovery deferred [ 8.754639] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "rootfs", R/O mode [ 8.762941] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), min./max. I/O unit sizes: 4096 bytes/4096 bytes [ 8.772883] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) [ 8.783773] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) [ 8.789631] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model [ 8.803366] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. [ 8.813172] devtmpfs: mounted [ 8.816746] Freeing unused kernel memory: 420K (80aa5000 - 80b0e000) INIT: version 2.88 booting Starting udev [ 9.621167] udevd[106]: starting version 182 [ 10.525521] UBIFS (ubi0:0): completing deferred recovery [ 10.540072] UBI DBG io (pid 162): write 4096 bytes to PEB 5891:4096 [ 10.547480] UBI DBG io (pid 162): write 110592 bytes to PEB 5891:8192 [ 10.578168] UBI DBG io (pid 162): write 4096 bytes to PEB 5892:4096 [ 10.585053] UBI DBG io (pid 74): write 4096 bytes to PEB 2944:0 [ 10.585639] UBI DBG io (pid 162): write 163840 bytes to PEB 5892:8192 [ 10.607478] UBI DBG io (pid 74): write 4096 bytes to PEB 2945:0 [ 10.615485] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 163 [ 10.617094] UBI DBG io (pid 162): write 4096 bytes to PEB 5893:4096 [ 10.618542] UBI DBG io (pid 162): write 4096 bytes to PEB 5893:8192 [ 10.619487] UBI DBG io (pid 162): write 4096 bytes to PEB 2859:212992 [ 10.620347] UBI DBG io (pid 162): write 4096 bytes to PEB 2850:225280 ^^^^^ this is the PEB that gets uncorrectable ECC errors - I verified that this was not the first time this PEB has been used [ 10.621476] UBI DBG io (pid 162): write 4096 bytes to PEB 2851:225280 [ 10.623058] UBIFS (ubi0:0): deferred recovery completed [ 10.628641] UBI DBG io (pid 74): write 4096 bytes to PEB 2946:0 bootlogd: cannot allocate pseudo tty: No such file or directory Populating dev cache ALSA: Restoring mixer settings... /usr/sbin/alsactl: load_state:1729: No soundcards found... Sat Oct 24 11:28:04 UTC 2015 INIT: Entering runlevel: 5 Configuring network interfaces... ifconfig: SIOCGIFFLAGS: No such device Starting system message bus: dbus. Starting Dropbear SSH server: dropbear. Starting rpcbind daemon...done. Starting watchdog: ^[[1 34m[^[[1 32m OK ^[[1 34m]^[[0 39m Starting advanced power management daemon: No APM support in kernel (failed.) starting DNS forwarder and DHCP server: dnsmasq... done. <power cut> [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP Thu Oct 29 14:27:36 PDT 2015 ... [ 2.332131] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 [ 2.338524] nand: Micron MT29F16G08ADACAH4 [ 2.342638] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: 4096, OOB size: 224 [ 2.374442] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 [ 2.381366] Scanning device for bad blocks ... [ 4.013255] 3 ofpart partitions found on MTD device gpmi-nand [ 4.019040] Creating 3 MTD partitions on "gpmi-nand": [ 4.024111] 0x000000000000-0x000001000000 : "uboot" [ 4.033352] 0x000001000000-0x000001100000 : "env" [ 4.041210] 0x000001100000-0x000080000000 : "rootfs" [ 4.050164] gpmi-nand 112000.gpmi-nand: driver registered. ... [ 4.416614] ubi0: attaching mtd2 [ 8.500977] ubi0: scanning is finished [ 8.527328] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) [ 8.533184] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes [ 8.540089] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 [ 8.546906] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 [ 8.553878] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 [ 8.559998] ubi0: user volume: 1, internal volumes: 1, max. volumes count: 128 [ 8.567248] ubi0: max/mean erase counter: 4/2, WL threshold: 4096, image sequence number: 42283325 [ 8.576216] ubi0: available PEBs: 0, total reserved PEBs: 8124, PEBs reserved for bad PEB handling: 160 [ 8.585813] ubi0: background thread "ubi_bgt0d" started, PID 73 .. [ 8.683337] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry [ 8.710495] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry [ 8.736821] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry [ 8.763179] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read 253952 bytes ^^^^ uncorrectable ECC error on PEB 2850 [ 8.774340] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-00015-g64291f7 #195 [ 8.782194] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) [ 8.788743] Backtrace: [ 8.791237] [<8001400c>] (dump_backtrace) from [<800141f8>] (show_stack+0x18/0x1c) [ 8.798839] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 [ 8.804589] [<800141e0>] (show_stack) from [<807ceeb8>] (dump_stack+0x8c/0xa4) [ 8.811855] [<807cee2c>] (dump_stack) from [<80474730>] (ubi_io_read+0x130/0x378) [ 8.819359] r6:00000000 r5:0003e000 r4:ffffffb6 r3:9e468000 [ 8.825099] [<80474600>] (ubi_io_read) from [<80471564>] (ubi_eba_read_leb+0x19c/0x4c8) [ 8.833125] r10:00000000 r9:a0c0d000 r8:00000b22 r7:00000001 r6:00000000 r5:9dd8a000 [ 8.841065] r4:9dd9e000 [ 8.843628] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] (ubi_leb_read+0x94/0x130) [ 8.851661] r10:00000000 r9:a0c0d000 r8:9dd8a000 r7:00000000 r6:00000001 r5:0003e000 [ 8.859596] r4:9dd9e000 [ 8.862170] [<804706e8>] (ubi_leb_read) from [<80257870>] (ubifs_leb_read+0x30/0x9c) [ 8.869937] r10:9e673008 r9:00000001 r8:00000001 r7:00000000 r6:0003e000 r5:9e673000 [ 8.877873] r4:9de22240 [ 8.880437] [<80257840>] (ubifs_leb_read) from [<80260fac>] (ubifs_start_scan+0x84/0x134) [ 8.888637] r8:a0c0d000 r7:00000001 r6:9e673000 r5:00000000 r4:9de22240 [ 8.895433] [<80260f28>] (ubifs_start_scan) from [<802613ac>] (ubifs_scan+0x30/0x3d4) [ 8.903284] r8:00000001 r7:9e673000 r6:a0c0d000 r5:00000000 r4:0003e000 [ 8.910097] [<8026137c>] (ubifs_scan) from [<8025fcd8>] (ubifs_read_master+0x48/0xd78) [ 8.918034] r10:9e673008 r9:9dd7bf80 r8:9e673000 r7:00000001 r6:80b76870 r5:9dda1800 [ 8.925952] r4:00000001 [ 8.928533] [<8025fc90>] (ubifs_read_master) from [<80254714>] (ubifs_mount+0x112c/0x2170) [ 8.936819] r10:9e673008 r9:9dd7bf80 r8:9e673b38 r7:9e673000 r6:80b76870 r5:9dda1800 [ 8.944738] r4:00000001 [ 8.947326] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) [ 8.954466] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 r6:80b2a2e0 r5:9dd73c80 [ 8.962404] r4:9dd76500 [ 8.964975] [<801055dc>] (mount_fs) from [<80121868>] (vfs_kern_mount+0x5c/0x144) [ 8.972479] r6:00008001 r5:9dd73c80 r4:9dd76500 [ 8.977177] [<8012180c>] (vfs_kern_mount) from [<80124e34>] (do_mount+0x19c/0xd0c) [ 8.984750] r9:80b2a2e0 r8:80b1ebdc r7:9e7b5d00 r6:9dd73c80 r5:00000060 r4:00000000 [ 8.992605] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) [ 8.999588] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 r6:00000000 r5:9dd73c80 [ 9.007522] r4:9e7b5d00 [ 9.010087] [<80125cbc>] (SyS_mount) from [<80aa6260>] (mount_block_root+0x114/0x240) [ 9.017941] r8:9efacd20 r7:80afa7a0 r6:9de69000 r5:9de69000 r4:00008001 [ 9.024736] [<80aa614c>] (mount_block_root) from [<80aa664c>] (prepare_namespace+0x98/0x1d0) [ 9.033195] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 r6:80afa7a0 r5:80afa7a0 [ 9.041129] r4:80afa7b0 [ 9.043690] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] (kernel_init_freeable+0x1a4/0x1f0) [ 9.052592] r6:80b8b000 r5:80b8b000 r4:00000007 [ 9.057290] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] (kernel_init+0x10/0xf4) [ 9.065472] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:807c91a4 [ 9.073407] r4:00000000 [ 9.075970] [<807c91a4>] (kernel_init) from [<800104b8>] (ret_from_fork+0x14/0x3c) [ 9.083560] r4:00000000 r3:00000000 [ 9.102369] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry [ 9.128774] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry [ 9.155184] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry [ 9.181522] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 253952 bytes from PEB 2850:8192, read 253952 bytes ^^^^ unrecoverable ECC error on PEB 2850 [ 9.192683] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-00015-g64291f7 #195 [ 9.200535] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) [ 9.207083] Backtrace: [ 9.209573] [<8001400c>] (dump_backtrace) from [<800141f8>] (show_stack+0x18/0x1c) [ 9.217173] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 [ 9.222916] [<800141e0>] (show_stack) from [<807ceeb8>] (dump_stack+0x8c/0xa4) [ 9.230177] [<807cee2c>] (dump_stack) from [<80474730>] (ubi_io_read+0x130/0x378) [ 9.237688] r6:00000000 r5:0003e000 r4:ffffffb6 r3:9e468000 [ 9.243429] [<80474600>] (ubi_io_read) from [<80471564>] (ubi_eba_read_leb+0x19c/0x4c8) [ 9.251458] r10:00000000 r9:a0c4c000 r8:00000b22 r7:00000001 r6:00000000 r5:9dd8a000 [ 9.259398] r4:9dd9e000 [ 9.261962] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] (ubi_leb_read+0x94/0x130) [ 9.269988] r10:00000000 r9:a0c4c000 r8:9dd8a000 r7:00000000 r6:00000001 r5:0003e000 [ 9.277925] r4:9dd9e000 [ 9.280495] [<804706e8>] (ubi_leb_read) from [<80257870>] (ubifs_leb_read+0x30/0x9c) [ 9.288264] r10:9e673000 r9:9dd7bf80 r8:00000001 r7:00000000 r6:0003e000 r5:9e673000 [ 9.296182] r4:9e673000 [ 9.298768] [<80257840>] (ubifs_leb_read) from [<8027569c>] (get_master_node+0x54/0x324) [ 9.306880] r8:9e673000 r7:80b77ff8 r6:00001000 r5:00001000 r4:9e673000 [ 9.313681] [<80275648>] (get_master_node) from [<80275d98>] (ubifs_recover_master_node+0x60/0x3ec) [ 9.322749] r10:9e673008 r9:9dd7bf80 r8:9e673000 r7:80b77ff8 r6:00001000 r5:9dda1800 [ 9.330685] r4:9e673000 [ 9.333248] [<80275d38>] (ubifs_recover_master_node) from [<8026080c>] (ubifs_read_master+0xb7c/0xd78) [ 9.342575] r7:00000001 r6:80b76870 r5:9dda1800 r4:ffffff8b [ 9.348335] [<8025fc90>] (ubifs_read_master) from [<80254714>] (ubifs_mount+0x112c/0x2170) [ 9.356621] r10:9e673008 r9:9dd7bf80 r8:9e673b38 r7:9e673000 r6:80b76870 r5:9dda1800 [ 9.364540] r4:00000001 [ 9.367127] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) [ 9.374266] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 r6:80b2a2e0 r5:9dd73c80 [ 9.382204] r4:9dd76500 [ 9.384772] [<801055dc>] (mount_fs) from [<80121868>] (vfs_kern_mount+0x5c/0x144) [ 9.392276] r6:00008001 r5:9dd73c80 r4:9dd76500 [ 9.396974] [<8012180c>] (vfs_kern_mount) from [<80124e34>] (do_mount+0x19c/0xd0c) [ 9.404547] r9:80b2a2e0 r8:80b1ebdc r7:9e7b5d00 r6:9dd73c80 r5:00000060 r4:00000000 [ 9.412404] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) [ 9.419386] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 r6:00000000 r5:9dd73c80 [ 9.427320] r4:9e7b5d00 [ 9.429887] [<80125cbc>] (SyS_mount) from [<80aa6260>] (mount_block_root+0x114/0x240) [ 9.437743] r8:9efacd20 r7:80afa7a0 r6:9de69000 r5:9de69000 r4:00008001 [ 9.444539] [<80aa614c>] (mount_block_root) from [<80aa664c>] (prepare_namespace+0x98/0x1d0) [ 9.452999] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 r6:80afa7a0 r5:80afa7a0 [ 9.460933] r4:80afa7b0 [ 9.463494] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] (kernel_init_freeable+0x1a4/0x1f0) [ 9.472386] r6:80b8b000 r5:80b8b000 r4:00000007 [ 9.477081] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] (kernel_init+0x10/0xf4) [ 9.485262] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:807c91a4 [ 9.493196] r4:00000000 [ 9.495758] [<807c91a4>] (kernel_init) from [<800104b8>] (ret_from_fork+0x14/0x3c) [ 9.503349] r4:00000000 r3:00000000 [ 9.507115] UBIFS error (ubi0:0 pid 1): ubifs_recover_master_node: failed to recover master node ^^^ master node recovery fails ... [ 9.618335] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) After added Huang Shijie's 'mtd: gpmi: fix the bitflips for erased page' patch from http://patchwork.ozlabs.org/patch/309763/ which shows the following on board2: [ 8.605463] gpmi-nand 112000.gpmi-nand: The page(186809) is an erased page(7,6,1,0). [ 8.628012] gpmi-nand 112000.gpmi-nand: The page(186809) is an erased page(7,6,1,0). ^^^ 1 bit-flip on ecc read, 0 bit-flip on non-ecc read, ecc-str=6, memset's buffer to 0xff to attempt to recover [ 8.637305] UBIFS error (ubi0:0 pid 1): ubifs_recover_master_node: failed to recover master node ^^^ unable to recover and always ends up this way over further boots >> If I understand what you are saying you are thinking that my issue is >> instead the result of a never-used PEB that had bit-flips from the >> manufacturer in which case the bits would read the same every time? >> How can we know this PEB was never before used and isn't one that was >> being erased/written during a power cut? > > I've seen bit flips on cheap SLC NANDs which came out of a sudden. > According to the FAE I was talking to this is legit for NAND > as long the flipping bits are fixable by the ECC engine. > I'm a little confused by the fact that gpmi-nand.c encounters an uncorrectable status yet when the bit-flips are counted by Huang's 'gmpi_erased_check' they come out to 1 which is less than the ecc str. Is this saying that ecc already corrected what it could and there was 'still' 1 more flip? >> In my test scenario where the rootfs is mounted from the kernel >> read-only, but later mounted read-write by userspace (yet not being >> specifically written to by userspace) then power-cut should 'any' NAND >> writes would be occurring at all? And if not as I suspect, then how >> could a subsequent boot end up using a PEB that may have been never >> previously used and have bit-flips from the manufacturer? > > UBIFS's has a wandering journal. During the remount it moved maybe. > But for a more expressive analysis I'd need a nanddump to find out which > blocks are in which role. > Can you share the nanddump? > This makes sense, and I apparently lied when I said the test software wasn't writing to the rootfs. There is at least one process that does write: the dhcp client when it writes to /etc/resolv.con and possibly something else writes as well. So we are power-cutting an an un-clean ubifs and never allowing a safe shutdown so I would expect the journal to be wandering around. >> Should we be doing an erase block on every NAND block during our board >> manufacturing process to avoid this? > > Sorry, I don't understand this sentence. > Do you mean a full erasure of the whole NAND? > If so, it would not help as the bit flips can come later. > (Without writing/erasing the block) > The root cause is that your NFC cannot correct bit flips on empty pages. > >> It sounds like this 'unexpected bit-flips on erased pages from the >> mfg' issue is a ticking time-bomb for people using ubi/ubifs NAND. >> Shouldn't the http://www.linux-mtd.infradead.org/doc/ubifs.html page >> be updated to refer to this known issue as well as the unstable bit >> issue? > > As I said the root cause is that some NFCs cannot correct bit flips on empty > pages. > Instead of putting warnings to ubifs.html I'd love to see a solution on the > said drivers or MTD core. > >> I can add some debugging to find out - what specifically would be >> helpful to add? > > A hexdump of the buffer would be a good start. > >> Thanks for the help! > > Thanks for sharing your issues. This is the only way > to address them. > That said, as far on no board I had access to I was able to reproduce the unstable bits > issue. It was always something else. > > Thanks, > //richard I can provide dumps of various blocks if needed or add more debugging on the reads after the corruption occurred. It takes me a day or two to re-create the issue if you want me to turn on more debugging during the writes. I've cc'd Huang, Elie, and Brian who were involved in the patch to detect bit-flips in gpmi-nand.c reads - perhaps they have some more ideas. I find it interesting that in one case that patch resolves the issue and in the other it does not. Regards, Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-02 20:27 ` Tim Harvey @ 2015-11-02 20:31 ` Tim Harvey 2015-11-02 21:31 ` Richard Weinberger 2015-11-03 13:38 ` Boris Brezillon 2015-11-03 9:10 ` Artem Bityutskiy 1 sibling, 2 replies; 18+ messages in thread From: Tim Harvey @ 2015-11-02 20:31 UTC (permalink / raw) To: Richard Weinberger Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd, Elie De Brauwer, Brian Norris, Huang Shijie On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey@gateworks.com> wrote: > On Tue, Oct 27, 2015 at 12:52 PM, Richard Weinberger <richard@nod.at> wrote: >> Tim, >> >> Am 27.10.2015 um 20:01 schrieb Tim Harvey: >>> I'm not understanding what is making you say that the issue I >>> encountered is 'not' the unstable bits issue described at >>> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits? My >>> understanding is that the 'unstable bit' issue refers to bits which >>> are truly unstable and can read either way each and every read due to >>> not getting properly erased/written. >> >> You are right. I was sorting out the unstable bits issue a bit too >> early. I'm sorry. >> Let's double check. Can you enable UBI verbose logging while testing? >> Such that we can see which blocks were written/erased while the power cut >> happened? >> > > Richard, > > I have 2 IMX6 based boards that I've recreated a UBI/UBIFS NAND > corruption using Linux 4.2 with dynamic debug enabled for > ubi_io_write. This particular test cycle consisted of 5 different > boards and had over 40,000 boot/power-cuts before corruption occurred. > In each case the corruption is in a slightly different area. > > The boards remain in a corrupted state which will allow me to get > further debug info. > > Board1 looks like this: > > [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) > (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP > Thu Oct 29 14:27:36 PDT 2015 > [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing > instruction cache > [ 0.000000] Machine model: Gateworks Ventana i.MX6 Dual/Quad GW52XX > ... > [ 0.000000] Kernel command line: console=ttymxc1,115200 > root=ubi0:rootfs ubi.mtd=2 rootfstype=ubifs dyndbg="func ubi_io_write > +p" debug > ... > [ 2.045065] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 > [ 2.051477] nand: Micron MT29F16G08ADACAH4 > [ 2.055590] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: > 4096, OOB size: 224 > [ 2.087400] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 > [ 2.094321] Scanning device for bad blocks > ... > [ 3.704946] 3 ofpart partitions found on MTD device gpmi-nand > [ 3.710735] Creating 3 MTD partitions on "gpmi-nand": > [ 3.715823] 0x000000000000-0x000001000000 : "uboot" > [ 3.725088] 0x000001000000-0x000001100000 : "env" > [ 3.733000] 0x000001100000-0x000080000000 : "rootfs" > [ 3.741880] gpmi-nand 112000.gpmi-nand: driver registered. > ... > [ 4.365104] ubi0: attaching mtd2 > [ 8.418862] ubi0: scanning is finished > [ 8.445084] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) > [ 8.450977] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes > [ 8.457864] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 > [ 8.464676] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 > [ 8.471666] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 > [ 8.477767] ubi0: user volume: 1, internal volumes: 1, max. volumes > count: 128 > [ 8.485015] ubi0: max/mean erase counter: 7/3, WL threshold: 4096, > image sequence number: 42283325 > [ 8.494000] ubi0: available PEBs: 0, total reserved PEBs: 8124, > PEBs reserved for bad PEB handling: 160 > [ 8.503610] ubi0: background thread "ubi_bgt0d" started, PID 81 > ... > [ 8.615672] UBIFS (ubi0:0): recovery needed > [ 8.657083] UBIFS (ubi0:0): recovery deferred > [ 8.661965] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, > name "rootfs", R/O mode > [ 8.670290] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), > min./max. I/O unit sizes: 4096 bytes/4096 bytes > [ 8.680248] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, > 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) > [ 8.691145] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) > [ 8.697001] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), > UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model > [ 8.709784] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. > [ 8.719451] devtmpfs: mounted > [ 8.723002] Freeing unused kernel memory: 420K (80aa5000 - 80b0e000) > INIT: version 2.88 booting > Starting udev > [ 9.548551] udevd[114]: starting version 182 > [ 10.018775] UBIFS (ubi0:0): completing deferred recovery > [ 10.043624] UBI DBG io (pid 145): write 4096 bytes to PEB 2253:4096 > [ 10.054878] UBI DBG io (pid 145): write 94208 bytes to PEB 2253:8192 > [ 10.096644] UBI DBG io (pid 81): write 4096 bytes to PEB 2250:0 > [ 10.143314] UBI DBG io (pid 145): write 4096 bytes to PEB 2254:4096 > [ 10.153917] UBI DBG io (pid 145): write 184320 bytes to PEB 2254:8192 > ^^^^^ PEB 2254 is the one that becomes un-readable on the next boot > > [ 10.186288] UBI DBG io (pid 81): write 4096 bytes to PEB 2251:0 > [ 10.204620] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 173 > [ 10.213375] UBI DBG io (pid 145): write 4096 bytes to PEB 2255:4096 > [ 10.223864] UBI DBG io (pid 145): write 4096 bytes to PEB 2255:8192 > [ 10.239935] UBI DBG io (pid 145): write 4096 bytes to PEB 2237:32768 > [ 10.261787] UBI DBG io (pid 145): write 4096 bytes to PEB 2184:102400 > [ 10.272978] UBI DBG io (pid 145): write 4096 bytes to PEB 2173:114688 > [ 10.285948] UBI DBG io (pid 145): write 4096 bytes to PEB 2174:114688 > [ 10.296027] UBIFS (ubi0:0): deferred recovery completed > [ 10.299453] UBI DBG io (pid 81): write 4096 bytes to PEB 2252:0 > bootlogd: cannot allocate pseudo tty: No such file or directory > ALSA: Restoring mixer settings... > Mon Oct 26 16:37:16 UTC 2015 > INIT: Entering runlevel: 5 > Configuring network interfaces... [ 11.471755] fec 2188000.ethernet > eth0: Freescale FEC PHY driver [Generic PHY] > (mii_bus:phy_addr=2188000.ethernet:00, irq=-1) > [ 11.483499] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > udhcpc (v1.23.1) started > Sending discover... > Sending discover... > [ 15.000462] UBI DBG io (pid 173): write 4096 bytes to PEB 2253:102400 > [ 16.472720] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - > flow control rx/tx > [ 16.480975] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > Sending discover... > Sending select for 192.168.1.1... > Lease of 192.168.1.1 obtained, lease time 21600 > /etc/udhcpc.d/50default: Adding DNS 192.168.1.100 > done. > > <power cut> > > [ 0.000000] Booting Linux on physical CPU 0x0 > [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) > (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP > Thu Oct 29 14:27:36 PDT 2015 > [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing > instruction cache > [ 0.000000] Machine model: Gateworks Ventana i.MX6 Dual/Quad GW52XX > ... > [ 2.044936] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 > [ 2.051348] nand: Micron MT29F16G08ADACAH4 > [ 2.055461] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: > 4096, OOB size: 224 > [ 2.087272] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 > [ 2.094196] Scanning device for bad blocks > ... > [ 3.710706] Creating 3 MTD partitions on "gpmi-nand": > [ 3.715793] 0x000000000000-0x000001000000 : "uboot" > [ 3.725027] 0x000001000000-0x000001100000 : "env" > [ 3.732929] 0x000001100000-0x000080000000 : "rootfs" > [ 3.741781] gpmi-nand 112000.gpmi-nand: driver registered. > ... > [ 4.366226] ubi0: attaching mtd2 > [ 8.445733] ubi0: scanning is finished > [ 8.477551] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) > [ 8.483443] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes > [ 8.490355] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 > [ 8.497155] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 > [ 8.504148] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 > [ 8.510274] ubi0: user volume: 1, internal volumes: 1, max. volumes > count: 128 > [ 8.517510] ubi0: max/mean erase counter: 7/3, WL threshold: 4096, > image sequence number: 42283325 > [ 8.526498] ubi0: available PEBs: 0, total reserved PEBs: 8124, > PEBs reserved for bad PEB handling: 160 > [ 8.536151] ubi0: background thread "ubi_bgt0d" started, PID 82 > ... > [ 8.635364] UBIFS (ubi0:0): recovery needed > [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > ^^^^ non correctable ecc error on PEB 2254 - I verified that this was > not the first time this PEB has been used > [ 8.724987] ubi0 error: ubi_io_read: error -74 (ECC error) while > reading 69632 bytes from PEB 2254:192512, read 69632 bytes > [ 8.736163] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.2.0-00015-g64291f7 dirty #195 > [ 8.744021] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > [ 8.750582] Backtrace: > [ 8.753074] [<8001400c>] (dump_backtrace) from [<800141f8>] > (show_stack+0x18/0x1c) > [ 8.760667] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 > [ 8.766424] [<800141e0>] (show_stack) from [<807ceeb8>] > (dump_stack+0x8c/0xa4) > [ 8.773687] [<807cee2c>] (dump_stack) from [<80474730>] > (ubi_io_read+0x130/0x378) > [ 8.781212] r6:00000000 r5:00011000 r4:ffffffb6 r3:be070000 > [ 8.786951] [<80474600>] (ubi_io_read) from [<80471564>] > (ubi_eba_read_leb+0x19c/0x4c8) > [ 8.794975] r10:00000000 r9:c0c63000 r8:000008ce r7:00001f15 > r6:00000000 r5:bda3a000 > [ 8.802928] r4:bda12000 > [ 8.805489] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] > (ubi_leb_read+0x94/0x130) > [ 8.813527] r10:00000000 r9:c0c63000 r8:bda3a000 r7:0002d000 > r6:00001f15 r5:00011000 > [ 8.821467] r4:bda12000 > [ 8.824037] [<804706e8>] (ubi_leb_read) from [<80257870>] > (ubifs_leb_read+0x30/0x9c) > [ 8.831815] r10:be29a000 r9:80b77ff8 r8:00001f15 r7:0002d000 > r6:00011000 r5:be29a000 > [ 8.839731] r4:bdaea600 > [ 8.842312] [<80257840>] (ubifs_leb_read) from [<80260fac>] > (ubifs_start_scan+0x84/0x134) > [ 8.850522] r8:c0c36000 r7:00001f15 r6:be29a000 r5:0002d000 r4:bdaea600 > [ 8.857322] [<80260f28>] (ubifs_start_scan) from [<80276270>] > (ubifs_recover_leb+0x80/0xb80) > [ 8.865794] r8:00001f15 r7:c0c63000 r6:00011000 r5:00001f15 r4:00000001 > [ 8.872609] [<802761f0>] (ubifs_recover_leb) from [<802627f0>] > (ubifs_replay_journal+0xd34/0x1a48) > [ 8.881601] r10:00017188 r9:0002d000 r8:be29a000 r7:00000000 > r6:bdaea608 r5:00001f15 > [ 8.889518] r4:00000001 > [ 8.892106] [<80261abc>] (ubifs_replay_journal) from [<80254954>] > (ubifs_mount+0x136c/0x2170) > [ 8.900666] r10:be29a008 r9:00000000 r8:0003db70 r7:be29a000 > r6:80b76870 r5:be3f6000 > [ 8.908584] r4:00000000 > [ 8.911171] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) > [ 8.918323] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 > r6:80b2a2e0 r5:bd9cfac0 > [ 8.926258] r4:be365900 > [ 8.928829] [<801055dc>] (mount_fs) from [<80121868>] > (vfs_kern_mount+0x5c/0x144) > [ 8.936355] r6:00008001 r5:bd9cfac0 r4:be365900 > [ 8.941056] [<8012180c>] (vfs_kern_mount) from [<80124e34>] > (do_mount+0x19c/0xd0c) > [ 8.948642] r9:80b2a2e0 r8:80b1ebdc r7:bd9cfa80 r6:bd9cfac0 > r5:00000060 r4:00000000 > [ 8.956498] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) > [ 8.963493] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 > r6:00000000 r5:bd9cfac0 > [ 8.971429] r4:bd9cfa80 > [ 8.973993] [<80125cbc>] (SyS_mount) from [<80aa6260>] > (mount_block_root+0x114/0x240) > [ 8.981856] r8:befa3a40 r7:80afa7a0 r6:bd9d2000 r5:bd9d2000 r4:00008001 > [ 8.988651] [<80aa614c>] (mount_block_root) from [<80aa664c>] > (prepare_namespace+0x98/0x1d0) > [ 8.997123] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 > r6:80afa7a0 r5:80afa7a0 > [ 9.005058] r4:80afa7b0 > [ 9.007617] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] > (kernel_init_freeable+0x1a4/0x1f0) > [ 9.016522] r6:80b8b000 r5:80b8b000 r4:00000007 > [ 9.021217] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] > (kernel_init+0x10/0xf4) > [ 9.029412] r10:00000000 r9:00000000 r8:00000000 r7:00000000 > r6:00000000 r5:807c91a4 > [ 9.037354] r4:00000000 > [ 9.039918] [<807c91a4>] (kernel_init) from [<800104b8>] > (ret_from_fork+0x14/0x3c) > [ 9.047525] r4:00000000 r3:00000000 > [ 9.051259] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt > empty space LEB 7957:184320, corruption starts at 4665 > [ 9.062247] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: > corruption at LEB 7957:188985 > [ 9.070972] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: > first 8192 bytes from LEB 7957:188985 > [ 9.080396] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > [ 9.091903] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > ... > [ 12.002997] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > [ 12.014501] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff > ffffffff ffffffff ffffffff ................................ > [ 12.026000] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 7957 > scanning failed > ^^^^ recovery failed > [ 12.036886] VFS: Cannot open root device "ubi0:rootfs" or > unknown-block(0,0): error -117 > > After added Huang Shijie's 'mtd: gpmi: fix the bitflips for erased > page' patch from http://patchwork.ozlabs.org/patch/309763/ which shows > the following on board1: > [ 8.532624] UBIFS (ubi0:0): recovery needed > [ 8.569104] gpmi-nand 112000.gpmi-nand: The page(148656) is an > erased page(1,6,1,1). > [ 8.577748] gpmi-nand 112000.gpmi-nand: The page(148658) is an > erased page(1,6,1,1). > [ 8.586413] gpmi-nand 112000.gpmi-nand: The page(148660) is an > erased page(1,6,1,1). > [ 8.595050] gpmi-nand 112000.gpmi-nand: The page(148662) is an > erased page(1,6,1,1). > [ 8.603705] gpmi-nand 112000.gpmi-nand: The page(148664) is an > erased page(1,6,1,1). > [ 8.612310] gpmi-nand 112000.gpmi-nand: The page(148666) is an > erased page(1,6,1,1). > [ 8.620940] gpmi-nand 112000.gpmi-nand: The page(148668) is an > erased page(1,6,1,1). > [ 8.629580] gpmi-nand 112000.gpmi-nand: The page(148670) is an > erased page(1,6,1,1). > ^^^^ 1 bit-flip on ecc read, 1 bit-flip on non-ecc read, ecc-str=6, > memset's buffer to 0xff to attempt to recover > [ 8.644345] UBIFS (ubi0:0): recovery deferred > [ 8.649209] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, > name "rootfs", R/O mode > [ 8.657537] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), > min./max. I/O unit sizes: 4096 bytes/4096 bytes > [ 8.667486] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, > 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) > [ 8.678409] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) > [ 8.684283] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), > UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model > [ 8.697990] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. > ^^^^ able to recover and always ends up this way over further boots > ^^^^ after applying this patch I boot with init=foo so as to not allow > rootfs to get mounted read/write by udev and perhaps permanently > recover the board > > > > Board2 looks like this: > [ 0.000000] Booting Linux on physical CPU 0x0 > [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) > (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP > Thu Oct 29 14:27:36 PDT 2015 > [ 0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c5387d > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing > instruction cache > [ 0.000000] Machine model: Gateworks Ventana i.MX6 Dual/Quad GW552X > ... > [ 0.000000] Kernel command line: console=ttymxc1,115200 > root=ubi0:rootfs ubi.mtd=2 rootfstype=ubifs dyndbg="func ubi_io_write > +p" debug > ... > [ 2.341988] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 > [ 2.348382] nand: Micron MT29F16G08ADACAH4 > [ 2.352495] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: > 4096, OOB size: 224 > [ 2.384301] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 > [ 2.391224] Scanning device for bad blocks > ... > [ 4.024472] 3 ofpart partitions found on MTD device gpmi-nand > [ 4.030265] Creating 3 MTD partitions on "gpmi-nand": > [ 4.035336] 0x000000000000-0x000001000000 : "uboot" > [ 4.044635] 0x000001000000-0x000001100000 : "env" > [ 4.052559] 0x000001100000-0x000080000000 : "rootfs" > [ 4.061452] gpmi-nand 112000.gpmi-nand: driver registered. > ... > [ 4.426420] ubi0: attaching mtd2 > [ 8.514755] ubi0: scanning is finished > [ 8.541225] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) > [ 8.547108] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes > [ 8.553994] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 > [ 8.560810] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 > [ 8.567799] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 > [ 8.573901] ubi0: user volume: 1, internal volumes: 1, max. volumes > count: 128 > [ 8.581161] ubi0: max/mean erase counter: 4/2, WL threshold: 4096, > image sequence number: 42283325 > [ 8.590148] ubi0: available PEBs: 0, total reserved PEBs: 8124, > PEBs reserved for bad PEB handling: 160 > [ 8.599737] ubi0: background thread "ubi_bgt0d" started, PID 74 > ... > [ 8.712166] UBIFS (ubi0:0): recovery needed > [ 8.749882] UBIFS (ubi0:0): recovery deferred > [ 8.754639] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, > name "rootfs", R/O mode > [ 8.762941] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), > min./max. I/O unit sizes: 4096 bytes/4096 bytes > [ 8.772883] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, > 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) > [ 8.783773] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) > [ 8.789631] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), > UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model > [ 8.803366] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. > [ 8.813172] devtmpfs: mounted > [ 8.816746] Freeing unused kernel memory: 420K (80aa5000 - 80b0e000) > INIT: version 2.88 booting > Starting udev > [ 9.621167] udevd[106]: starting version 182 > [ 10.525521] UBIFS (ubi0:0): completing deferred recovery > [ 10.540072] UBI DBG io (pid 162): write 4096 bytes to PEB 5891:4096 > [ 10.547480] UBI DBG io (pid 162): write 110592 bytes to PEB 5891:8192 > [ 10.578168] UBI DBG io (pid 162): write 4096 bytes to PEB 5892:4096 > [ 10.585053] UBI DBG io (pid 74): write 4096 bytes to PEB 2944:0 > [ 10.585639] UBI DBG io (pid 162): write 163840 bytes to PEB 5892:8192 > [ 10.607478] UBI DBG io (pid 74): write 4096 bytes to PEB 2945:0 > [ 10.615485] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 163 > [ 10.617094] UBI DBG io (pid 162): write 4096 bytes to PEB 5893:4096 > [ 10.618542] UBI DBG io (pid 162): write 4096 bytes to PEB 5893:8192 > [ 10.619487] UBI DBG io (pid 162): write 4096 bytes to PEB 2859:212992 > [ 10.620347] UBI DBG io (pid 162): write 4096 bytes to PEB 2850:225280 > ^^^^^ this is the PEB that gets uncorrectable ECC errors - I verified > that this was not the first time this PEB has been used > [ 10.621476] UBI DBG io (pid 162): write 4096 bytes to PEB 2851:225280 > [ 10.623058] UBIFS (ubi0:0): deferred recovery completed > [ 10.628641] UBI DBG io (pid 74): write 4096 bytes to PEB 2946:0 > bootlogd: cannot allocate pseudo tty: No such file or directory > Populating dev cache > ALSA: Restoring mixer settings... > /usr/sbin/alsactl: load_state:1729: No soundcards found... > Sat Oct 24 11:28:04 UTC 2015 > INIT: Entering runlevel: 5 > Configuring network interfaces... ifconfig: SIOCGIFFLAGS: No such device > Starting system message bus: dbus. > Starting Dropbear SSH server: dropbear. > Starting rpcbind daemon...done. > Starting watchdog: [1 > 34m[ [1 > 32m OK [1 > 34m] [0 > 39m > Starting advanced power management daemon: No APM support in kernel > (failed.) > starting DNS forwarder and DHCP server: dnsmasq... done. > > <power cut> > > [ 0.000000] Booting Linux on physical CPU 0x0 > [ 0.000000] Linux version 4.2.0-00015-g64291f7 (tharvey@tharvey) > (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r43990) ) #195 SMP > Thu Oct 29 14:27:36 PDT 2015 > ... > [ 2.332131] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xd5 > [ 2.338524] nand: Micron MT29F16G08ADACAH4 > [ 2.342638] nand: 2048 MiB, SLC, erase size: 256 KiB, page size: > 4096, OOB size: 224 > [ 2.374442] gpmi-nand 112000.gpmi-nand: enable the asynchronous EDO mode 5 > [ 2.381366] Scanning device for bad blocks > ... > [ 4.013255] 3 ofpart partitions found on MTD device gpmi-nand > [ 4.019040] Creating 3 MTD partitions on "gpmi-nand": > [ 4.024111] 0x000000000000-0x000001000000 : "uboot" > [ 4.033352] 0x000001000000-0x000001100000 : "env" > [ 4.041210] 0x000001100000-0x000080000000 : "rootfs" > [ 4.050164] gpmi-nand 112000.gpmi-nand: driver registered. > ... > [ 4.416614] ubi0: attaching mtd2 > [ 8.500977] ubi0: scanning is finished > [ 8.527328] ubi0: attached mtd2 (name "rootfs", size 2031 MiB) > [ 8.533184] ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 253952 bytes > [ 8.540089] ubi0: min./max. I/O unit sizes: 4096/4096, sub-page size 4096 > [ 8.546906] ubi0: VID header offset: 4096 (aligned 4096), data offset: 8192 > [ 8.553878] ubi0: good PEBs: 8124, bad PEBs: 0, corrupted PEBs: 0 > [ 8.559998] ubi0: user volume: 1, internal volumes: 1, max. volumes > count: 128 > [ 8.567248] ubi0: max/mean erase counter: 4/2, WL threshold: 4096, > image sequence number: 42283325 > [ 8.576216] ubi0: available PEBs: 0, total reserved PEBs: 8124, > PEBs reserved for bad PEB handling: 160 > [ 8.585813] ubi0: background thread "ubi_bgt0d" started, PID 73 > .. > [ 8.683337] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry > [ 8.710495] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry > [ 8.736821] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry > [ 8.763179] ubi0 error: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read 253952 bytes > ^^^^ uncorrectable ECC error on PEB 2850 > [ 8.774340] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.2.0-00015-g64291f7 #195 > [ 8.782194] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > [ 8.788743] Backtrace: > [ 8.791237] [<8001400c>] (dump_backtrace) from [<800141f8>] > (show_stack+0x18/0x1c) > [ 8.798839] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 > [ 8.804589] [<800141e0>] (show_stack) from [<807ceeb8>] > (dump_stack+0x8c/0xa4) > [ 8.811855] [<807cee2c>] (dump_stack) from [<80474730>] > (ubi_io_read+0x130/0x378) > [ 8.819359] r6:00000000 r5:0003e000 r4:ffffffb6 r3:9e468000 > [ 8.825099] [<80474600>] (ubi_io_read) from [<80471564>] > (ubi_eba_read_leb+0x19c/0x4c8) > [ 8.833125] r10:00000000 r9:a0c0d000 r8:00000b22 r7:00000001 > r6:00000000 r5:9dd8a000 > [ 8.841065] r4:9dd9e000 > [ 8.843628] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] > (ubi_leb_read+0x94/0x130) > [ 8.851661] r10:00000000 r9:a0c0d000 r8:9dd8a000 r7:00000000 > r6:00000001 r5:0003e000 > [ 8.859596] r4:9dd9e000 > [ 8.862170] [<804706e8>] (ubi_leb_read) from [<80257870>] > (ubifs_leb_read+0x30/0x9c) > [ 8.869937] r10:9e673008 r9:00000001 r8:00000001 r7:00000000 > r6:0003e000 r5:9e673000 > [ 8.877873] r4:9de22240 > [ 8.880437] [<80257840>] (ubifs_leb_read) from [<80260fac>] > (ubifs_start_scan+0x84/0x134) > [ 8.888637] r8:a0c0d000 r7:00000001 r6:9e673000 r5:00000000 r4:9de22240 > [ 8.895433] [<80260f28>] (ubifs_start_scan) from [<802613ac>] > (ubifs_scan+0x30/0x3d4) > [ 8.903284] r8:00000001 r7:9e673000 r6:a0c0d000 r5:00000000 r4:0003e000 > [ 8.910097] [<8026137c>] (ubifs_scan) from [<8025fcd8>] > (ubifs_read_master+0x48/0xd78) > [ 8.918034] r10:9e673008 r9:9dd7bf80 r8:9e673000 r7:00000001 > r6:80b76870 r5:9dda1800 > [ 8.925952] r4:00000001 > [ 8.928533] [<8025fc90>] (ubifs_read_master) from [<80254714>] > (ubifs_mount+0x112c/0x2170) > [ 8.936819] r10:9e673008 r9:9dd7bf80 r8:9e673b38 r7:9e673000 > r6:80b76870 r5:9dda1800 > [ 8.944738] r4:00000001 > [ 8.947326] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) > [ 8.954466] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 > r6:80b2a2e0 r5:9dd73c80 > [ 8.962404] r4:9dd76500 > [ 8.964975] [<801055dc>] (mount_fs) from [<80121868>] > (vfs_kern_mount+0x5c/0x144) > [ 8.972479] r6:00008001 r5:9dd73c80 r4:9dd76500 > [ 8.977177] [<8012180c>] (vfs_kern_mount) from [<80124e34>] > (do_mount+0x19c/0xd0c) > [ 8.984750] r9:80b2a2e0 r8:80b1ebdc r7:9e7b5d00 r6:9dd73c80 > r5:00000060 r4:00000000 > [ 8.992605] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) > [ 8.999588] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 > r6:00000000 r5:9dd73c80 > [ 9.007522] r4:9e7b5d00 > [ 9.010087] [<80125cbc>] (SyS_mount) from [<80aa6260>] > (mount_block_root+0x114/0x240) > [ 9.017941] r8:9efacd20 r7:80afa7a0 r6:9de69000 r5:9de69000 r4:00008001 > [ 9.024736] [<80aa614c>] (mount_block_root) from [<80aa664c>] > (prepare_namespace+0x98/0x1d0) > [ 9.033195] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 > r6:80afa7a0 r5:80afa7a0 > [ 9.041129] r4:80afa7b0 > [ 9.043690] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] > (kernel_init_freeable+0x1a4/0x1f0) > [ 9.052592] r6:80b8b000 r5:80b8b000 r4:00000007 > [ 9.057290] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] > (kernel_init+0x10/0xf4) > [ 9.065472] r10:00000000 r9:00000000 r8:00000000 r7:00000000 > r6:00000000 r5:807c91a4 > [ 9.073407] r4:00000000 > [ 9.075970] [<807c91a4>] (kernel_init) from [<800104b8>] > (ret_from_fork+0x14/0x3c) > [ 9.083560] r4:00000000 r3:00000000 > [ 9.102369] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry > [ 9.128774] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry > [ 9.155184] ubi0 warning: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry > [ 9.181522] ubi0 error: ubi_io_read: error -74 (ECC error) while > reading 253952 bytes from PEB 2850:8192, read 253952 bytes > ^^^^ unrecoverable ECC error on PEB 2850 > [ 9.192683] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.2.0-00015-g64291f7 #195 > [ 9.200535] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > [ 9.207083] Backtrace: > [ 9.209573] [<8001400c>] (dump_backtrace) from [<800141f8>] > (show_stack+0x18/0x1c) > [ 9.217173] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 > [ 9.222916] [<800141e0>] (show_stack) from [<807ceeb8>] > (dump_stack+0x8c/0xa4) > [ 9.230177] [<807cee2c>] (dump_stack) from [<80474730>] > (ubi_io_read+0x130/0x378) > [ 9.237688] r6:00000000 r5:0003e000 r4:ffffffb6 r3:9e468000 > [ 9.243429] [<80474600>] (ubi_io_read) from [<80471564>] > (ubi_eba_read_leb+0x19c/0x4c8) > [ 9.251458] r10:00000000 r9:a0c4c000 r8:00000b22 r7:00000001 > r6:00000000 r5:9dd8a000 > [ 9.259398] r4:9dd9e000 > [ 9.261962] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] > (ubi_leb_read+0x94/0x130) > [ 9.269988] r10:00000000 r9:a0c4c000 r8:9dd8a000 r7:00000000 > r6:00000001 r5:0003e000 > [ 9.277925] r4:9dd9e000 > [ 9.280495] [<804706e8>] (ubi_leb_read) from [<80257870>] > (ubifs_leb_read+0x30/0x9c) > [ 9.288264] r10:9e673000 r9:9dd7bf80 r8:00000001 r7:00000000 > r6:0003e000 r5:9e673000 > [ 9.296182] r4:9e673000 > [ 9.298768] [<80257840>] (ubifs_leb_read) from [<8027569c>] > (get_master_node+0x54/0x324) > [ 9.306880] r8:9e673000 r7:80b77ff8 r6:00001000 r5:00001000 r4:9e673000 > [ 9.313681] [<80275648>] (get_master_node) from [<80275d98>] > (ubifs_recover_master_node+0x60/0x3ec) > [ 9.322749] r10:9e673008 r9:9dd7bf80 r8:9e673000 r7:80b77ff8 > r6:00001000 r5:9dda1800 > [ 9.330685] r4:9e673000 > [ 9.333248] [<80275d38>] (ubifs_recover_master_node) from > [<8026080c>] (ubifs_read_master+0xb7c/0xd78) > [ 9.342575] r7:00000001 r6:80b76870 r5:9dda1800 r4:ffffff8b > [ 9.348335] [<8025fc90>] (ubifs_read_master) from [<80254714>] > (ubifs_mount+0x112c/0x2170) > [ 9.356621] r10:9e673008 r9:9dd7bf80 r8:9e673b38 r7:9e673000 > r6:80b76870 r5:9dda1800 > [ 9.364540] r4:00000001 > [ 9.367127] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) > [ 9.374266] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 > r6:80b2a2e0 r5:9dd73c80 > [ 9.382204] r4:9dd76500 > [ 9.384772] [<801055dc>] (mount_fs) from [<80121868>] > (vfs_kern_mount+0x5c/0x144) > [ 9.392276] r6:00008001 r5:9dd73c80 r4:9dd76500 > [ 9.396974] [<8012180c>] (vfs_kern_mount) from [<80124e34>] > (do_mount+0x19c/0xd0c) > [ 9.404547] r9:80b2a2e0 r8:80b1ebdc r7:9e7b5d00 r6:9dd73c80 > r5:00000060 r4:00000000 > [ 9.412404] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) > [ 9.419386] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 > r6:00000000 r5:9dd73c80 > [ 9.427320] r4:9e7b5d00 > [ 9.429887] [<80125cbc>] (SyS_mount) from [<80aa6260>] > (mount_block_root+0x114/0x240) > [ 9.437743] r8:9efacd20 r7:80afa7a0 r6:9de69000 r5:9de69000 r4:00008001 > [ 9.444539] [<80aa614c>] (mount_block_root) from [<80aa664c>] > (prepare_namespace+0x98/0x1d0) > [ 9.452999] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 > r6:80afa7a0 r5:80afa7a0 > [ 9.460933] r4:80afa7b0 > [ 9.463494] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] > (kernel_init_freeable+0x1a4/0x1f0) > [ 9.472386] r6:80b8b000 r5:80b8b000 r4:00000007 > [ 9.477081] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] > (kernel_init+0x10/0xf4) > [ 9.485262] r10:00000000 r9:00000000 r8:00000000 r7:00000000 > r6:00000000 r5:807c91a4 > [ 9.493196] r4:00000000 > [ 9.495758] [<807c91a4>] (kernel_init) from [<800104b8>] > (ret_from_fork+0x14/0x3c) > [ 9.503349] r4:00000000 r3:00000000 > [ 9.507115] UBIFS error (ubi0:0 pid 1): ubifs_recover_master_node: > failed to recover master node > ^^^ master node recovery fails > ... > [ 9.618335] Kernel panic - not syncing: VFS: Unable to mount root > fs on unknown-block(0,0) > > After added Huang Shijie's 'mtd: gpmi: fix the bitflips for erased > page' patch from http://patchwork.ozlabs.org/patch/309763/ which shows > the following on board2: > [ 8.605463] gpmi-nand 112000.gpmi-nand: The page(186809) is an > erased page(7,6,1,0). > [ 8.628012] gpmi-nand 112000.gpmi-nand: The page(186809) is an > erased page(7,6,1,0). > ^^^ 1 bit-flip on ecc read, 0 bit-flip on non-ecc read, ecc-str=6, > memset's buffer to 0xff to attempt to recover > [ 8.637305] UBIFS error (ubi0:0 pid 1): ubifs_recover_master_node: > failed to recover master node > ^^^ unable to recover and always ends up this way over further boots > >>> If I understand what you are saying you are thinking that my issue is >>> instead the result of a never-used PEB that had bit-flips from the >>> manufacturer in which case the bits would read the same every time? >>> How can we know this PEB was never before used and isn't one that was >>> being erased/written during a power cut? >> >> I've seen bit flips on cheap SLC NANDs which came out of a sudden. >> According to the FAE I was talking to this is legit for NAND >> as long the flipping bits are fixable by the ECC engine. >> > > I'm a little confused by the fact that gpmi-nand.c encounters an > uncorrectable status yet when the bit-flips are counted by Huang's > 'gmpi_erased_check' they come out to 1 which is less than the ecc str. > Is this saying that ecc already corrected what it could and there was > 'still' 1 more flip? > >>> In my test scenario where the rootfs is mounted from the kernel >>> read-only, but later mounted read-write by userspace (yet not being >>> specifically written to by userspace) then power-cut should 'any' NAND >>> writes would be occurring at all? And if not as I suspect, then how >>> could a subsequent boot end up using a PEB that may have been never >>> previously used and have bit-flips from the manufacturer? >> >> UBIFS's has a wandering journal. During the remount it moved maybe. >> But for a more expressive analysis I'd need a nanddump to find out which >> blocks are in which role. >> Can you share the nanddump? >> > > This makes sense, and I apparently lied when I said the test software > wasn't writing to the rootfs. There is at least one process that does > write: the dhcp client when it writes to /etc/resolv.con and possibly > something else writes as well. So we are power-cutting an an un-clean > ubifs and never allowing a safe shutdown so I would expect the journal > to be wandering around. > >>> Should we be doing an erase block on every NAND block during our board >>> manufacturing process to avoid this? >> >> Sorry, I don't understand this sentence. >> Do you mean a full erasure of the whole NAND? >> If so, it would not help as the bit flips can come later. >> (Without writing/erasing the block) >> The root cause is that your NFC cannot correct bit flips on empty pages. >> >>> It sounds like this 'unexpected bit-flips on erased pages from the >>> mfg' issue is a ticking time-bomb for people using ubi/ubifs NAND. >>> Shouldn't the http://www.linux-mtd.infradead.org/doc/ubifs.html page >>> be updated to refer to this known issue as well as the unstable bit >>> issue? >> >> As I said the root cause is that some NFCs cannot correct bit flips on empty >> pages. >> Instead of putting warnings to ubifs.html I'd love to see a solution on the >> said drivers or MTD core. >> >>> I can add some debugging to find out - what specifically would be >>> helpful to add? >> >> A hexdump of the buffer would be a good start. >> >>> Thanks for the help! >> >> Thanks for sharing your issues. This is the only way >> to address them. >> That said, as far on no board I had access to I was able to reproduce the unstable bits >> issue. It was always something else. >> >> Thanks, >> //richard > > I can provide dumps of various blocks if needed or add more debugging > on the reads after the corruption occurred. It takes me a day or two > to re-create the issue if you want me to turn on more debugging during > the writes. > > I've cc'd Huang, Elie, and Brian who were involved in the patch to > detect bit-flips in gpmi-nand.c reads - perhaps they have some more > ideas. I find it interesting that in one case that patch resolves the > issue and in the other it does not. > > Regards, > > Tim Cc: Huang Shijie <shijie.huang@arm.com> Updated Huang's e-mail as the old one bounced Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-02 20:31 ` Tim Harvey @ 2015-11-02 21:31 ` Richard Weinberger 2015-11-02 22:11 ` Brian Norris 2015-11-03 13:38 ` Boris Brezillon 1 sibling, 1 reply; 18+ messages in thread From: Richard Weinberger @ 2015-11-02 21:31 UTC (permalink / raw) To: Tim Harvey Cc: Artem Bityutskiy, Adrian Hunter, linux-mtd, Elie De Brauwer, Brian Norris, Huang Shijie Am 02.11.2015 um 21:31 schrieb Tim Harvey: >> [ 10.018775] UBIFS (ubi0:0): completing deferred recovery >> [ 10.043624] UBI DBG io (pid 145): write 4096 bytes to PEB 2253:4096 >> [ 10.054878] UBI DBG io (pid 145): write 94208 bytes to PEB 2253:8192 >> [ 10.096644] UBI DBG io (pid 81): write 4096 bytes to PEB 2250:0 >> [ 10.143314] UBI DBG io (pid 145): write 4096 bytes to PEB 2254:4096 >> [ 10.153917] UBI DBG io (pid 145): write 184320 bytes to PEB 2254:8192 >> ^^^^^ PEB 2254 is the one that becomes un-readable on the next boot So, the important fact is that it was not written while the power cut happened. Maybe it was erased. Can you please enable that debug print too? Just to make very sure. >> [ 10.186288] UBI DBG io (pid 81): write 4096 bytes to PEB 2251:0 >> [ 10.204620] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 173 >> [ 10.213375] UBI DBG io (pid 145): write 4096 bytes to PEB 2255:4096 >> [ 10.223864] UBI DBG io (pid 145): write 4096 bytes to PEB 2255:8192 >> [ 10.239935] UBI DBG io (pid 145): write 4096 bytes to PEB 2237:32768 >> [ 10.261787] UBI DBG io (pid 145): write 4096 bytes to PEB 2184:102400 >> [ 10.272978] UBI DBG io (pid 145): write 4096 bytes to PEB 2173:114688 >> [ 10.285948] UBI DBG io (pid 145): write 4096 bytes to PEB 2174:114688 >> [ 10.296027] UBIFS (ubi0:0): deferred recovery completed >> [ 10.299453] UBI DBG io (pid 81): write 4096 bytes to PEB 2252:0 [...] >> [ 15.000462] UBI DBG io (pid 173): write 4096 bytes to PEB 2253:102400 >> [ 16.472720] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full - >> flow control rx/tx >> [ 16.480975] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready >> Sending discover... >> Sending select for 192.168.1.1... >> Lease of 192.168.1.1 obtained, lease time 21600 >> /etc/udhcpc.d/50default: Adding DNS 192.168.1.100 >> done. >> >> <power cut> [...] >> [ 8.635364] UBIFS (ubi0:0): recovery needed >> [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >> [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >> [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >> ^^^^ non correctable ecc error on PEB 2254 - I verified that this was >> not the first time this PEB has been used >> [ 8.724987] ubi0 error: ubi_io_read: error -74 (ECC error) while >> reading 69632 bytes from PEB 2254:192512, read 69632 bytes >> [ 8.736163] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >> 4.2.0-00015-g64291f7 dirty #195 >> [ 8.744021] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) >> [ 8.750582] Backtrace: >> [ 8.753074] [<8001400c>] (dump_backtrace) from [<800141f8>] >> (show_stack+0x18/0x1c) >> [ 8.760667] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 >> [ 8.766424] [<800141e0>] (show_stack) from [<807ceeb8>] >> (dump_stack+0x8c/0xa4) >> [ 8.773687] [<807cee2c>] (dump_stack) from [<80474730>] >> (ubi_io_read+0x130/0x378) >> [ 8.781212] r6:00000000 r5:00011000 r4:ffffffb6 r3:be070000 >> [ 8.786951] [<80474600>] (ubi_io_read) from [<80471564>] >> (ubi_eba_read_leb+0x19c/0x4c8) >> [ 8.794975] r10:00000000 r9:c0c63000 r8:000008ce r7:00001f15 >> r6:00000000 r5:bda3a000 >> [ 8.802928] r4:bda12000 >> [ 8.805489] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] >> (ubi_leb_read+0x94/0x130) >> [ 8.813527] r10:00000000 r9:c0c63000 r8:bda3a000 r7:0002d000 >> r6:00001f15 r5:00011000 >> [ 8.821467] r4:bda12000 >> [ 8.824037] [<804706e8>] (ubi_leb_read) from [<80257870>] >> (ubifs_leb_read+0x30/0x9c) >> [ 8.831815] r10:be29a000 r9:80b77ff8 r8:00001f15 r7:0002d000 >> r6:00011000 r5:be29a000 >> [ 8.839731] r4:bdaea600 >> [ 8.842312] [<80257840>] (ubifs_leb_read) from [<80260fac>] >> (ubifs_start_scan+0x84/0x134) >> [ 8.850522] r8:c0c36000 r7:00001f15 r6:be29a000 r5:0002d000 r4:bdaea600 >> [ 8.857322] [<80260f28>] (ubifs_start_scan) from [<80276270>] >> (ubifs_recover_leb+0x80/0xb80) >> [ 8.865794] r8:00001f15 r7:c0c63000 r6:00011000 r5:00001f15 r4:00000001 >> [ 8.872609] [<802761f0>] (ubifs_recover_leb) from [<802627f0>] >> (ubifs_replay_journal+0xd34/0x1a48) >> [ 8.881601] r10:00017188 r9:0002d000 r8:be29a000 r7:00000000 >> r6:bdaea608 r5:00001f15 >> [ 8.889518] r4:00000001 >> [ 8.892106] [<80261abc>] (ubifs_replay_journal) from [<80254954>] >> (ubifs_mount+0x136c/0x2170) >> [ 8.900666] r10:be29a008 r9:00000000 r8:0003db70 r7:be29a000 >> r6:80b76870 r5:be3f6000 >> [ 8.908584] r4:00000000 >> [ 8.911171] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) >> [ 8.918323] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 >> r6:80b2a2e0 r5:bd9cfac0 >> [ 8.926258] r4:be365900 >> [ 8.928829] [<801055dc>] (mount_fs) from [<80121868>] >> (vfs_kern_mount+0x5c/0x144) >> [ 8.936355] r6:00008001 r5:bd9cfac0 r4:be365900 >> [ 8.941056] [<8012180c>] (vfs_kern_mount) from [<80124e34>] >> (do_mount+0x19c/0xd0c) >> [ 8.948642] r9:80b2a2e0 r8:80b1ebdc r7:bd9cfa80 r6:bd9cfac0 >> r5:00000060 r4:00000000 >> [ 8.956498] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) >> [ 8.963493] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 >> r6:00000000 r5:bd9cfac0 >> [ 8.971429] r4:bd9cfa80 >> [ 8.973993] [<80125cbc>] (SyS_mount) from [<80aa6260>] >> (mount_block_root+0x114/0x240) >> [ 8.981856] r8:befa3a40 r7:80afa7a0 r6:bd9d2000 r5:bd9d2000 r4:00008001 >> [ 8.988651] [<80aa614c>] (mount_block_root) from [<80aa664c>] >> (prepare_namespace+0x98/0x1d0) >> [ 8.997123] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 >> r6:80afa7a0 r5:80afa7a0 >> [ 9.005058] r4:80afa7b0 >> [ 9.007617] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] >> (kernel_init_freeable+0x1a4/0x1f0) >> [ 9.016522] r6:80b8b000 r5:80b8b000 r4:00000007 >> [ 9.021217] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] >> (kernel_init+0x10/0xf4) >> [ 9.029412] r10:00000000 r9:00000000 r8:00000000 r7:00000000 >> r6:00000000 r5:807c91a4 >> [ 9.037354] r4:00000000 >> [ 9.039918] [<807c91a4>] (kernel_init) from [<800104b8>] >> (ret_from_fork+0x14/0x3c) >> [ 9.047525] r4:00000000 r3:00000000 >> [ 9.051259] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt >> empty space LEB 7957:184320, corruption starts at 4665 >> [ 9.062247] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: >> corruption at LEB 7957:188985 >> [ 9.070972] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: >> first 8192 bytes from LEB 7957:188985 >> [ 9.080396] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.091903] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> ... >> [ 12.002997] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 12.014501] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 12.026000] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 7957 >> scanning failed >> ^^^^ recovery failed >> [ 12.036886] VFS: Cannot open root device "ubi0:rootfs" or >> unknown-block(0,0): error -117 >> >> After added Huang Shijie's 'mtd: gpmi: fix the bitflips for erased >> page' patch from http://patchwork.ozlabs.org/patch/309763/ which shows >> the following on board1: >> [ 8.532624] UBIFS (ubi0:0): recovery needed >> [ 8.569104] gpmi-nand 112000.gpmi-nand: The page(148656) is an >> erased page(1,6,1,1). >> [ 8.577748] gpmi-nand 112000.gpmi-nand: The page(148658) is an >> erased page(1,6,1,1). >> [ 8.586413] gpmi-nand 112000.gpmi-nand: The page(148660) is an >> erased page(1,6,1,1). >> [ 8.595050] gpmi-nand 112000.gpmi-nand: The page(148662) is an >> erased page(1,6,1,1). >> [ 8.603705] gpmi-nand 112000.gpmi-nand: The page(148664) is an >> erased page(1,6,1,1). >> [ 8.612310] gpmi-nand 112000.gpmi-nand: The page(148666) is an >> erased page(1,6,1,1). >> [ 8.620940] gpmi-nand 112000.gpmi-nand: The page(148668) is an >> erased page(1,6,1,1). >> [ 8.629580] gpmi-nand 112000.gpmi-nand: The page(148670) is an >> erased page(1,6,1,1). >> ^^^^ 1 bit-flip on ecc read, 1 bit-flip on non-ecc read, ecc-str=6, >> memset's buffer to 0xff to attempt to recover >> [ 8.644345] UBIFS (ubi0:0): recovery deferred >> [ 8.649209] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, >> name "rootfs", R/O mode >> [ 8.657537] UBIFS (ubi0:0): LEB size: 253952 bytes (248 KiB), >> min./max. I/O unit sizes: 4096 bytes/4096 bytes >> [ 8.667486] UBIFS (ubi0:0): FS size: 2018918400 bytes (1925 MiB, >> 7950 LEBs), journal size 9404416 bytes (8 MiB, 38 LEBs) >> [ 8.678409] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB) >> [ 8.684283] UBIFS (ubi0:0): media format: w4/r0 (latest is w4/r0), >> UUID BB5E5EB5-6A21-48AA-9C31-DE6B4A32ABAD, small LPT model >> [ 8.697990] VFS: Mounted root (ubifs filesystem) readonly on device 0:14. >> ^^^^ able to recover and always ends up this way over further boots >> ^^^^ after applying this patch I boot with init=foo so as to not allow >> rootfs to get mounted read/write by udev and perhaps permanently >> recover the board >> >> >> >> Board2 looks like this: >> [ 9.621167] udevd[106]: starting version 182 >> [ 10.525521] UBIFS (ubi0:0): completing deferred recovery >> [ 10.540072] UBI DBG io (pid 162): write 4096 bytes to PEB 5891:4096 >> [ 10.547480] UBI DBG io (pid 162): write 110592 bytes to PEB 5891:8192 >> [ 10.578168] UBI DBG io (pid 162): write 4096 bytes to PEB 5892:4096 >> [ 10.585053] UBI DBG io (pid 74): write 4096 bytes to PEB 2944:0 >> [ 10.585639] UBI DBG io (pid 162): write 163840 bytes to PEB 5892:8192 >> [ 10.607478] UBI DBG io (pid 74): write 4096 bytes to PEB 2945:0 >> [ 10.615485] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 163 >> [ 10.617094] UBI DBG io (pid 162): write 4096 bytes to PEB 5893:4096 >> [ 10.618542] UBI DBG io (pid 162): write 4096 bytes to PEB 5893:8192 >> [ 10.619487] UBI DBG io (pid 162): write 4096 bytes to PEB 2859:212992 >> [ 10.620347] UBI DBG io (pid 162): write 4096 bytes to PEB 2850:225280 >> ^^^^^ this is the PEB that gets uncorrectable ECC errors - I verified Again not written while power cut happened. Maybe erased. Please enable print. :) >> that this was not the first time this PEB has been used >> [ 10.621476] UBI DBG io (pid 162): write 4096 bytes to PEB 2851:225280 >> [ 10.623058] UBIFS (ubi0:0): deferred recovery completed >> [ 10.628641] UBI DBG io (pid 74): write 4096 bytes to PEB 2946:0 >> bootlogd: cannot allocate pseudo tty: No such file or directory >> Populating dev cache >> ALSA: Restoring mixer settings... >> /usr/sbin/alsactl: load_state:1729: No soundcards found... >> Sat Oct 24 11:28:04 UTC 2015 >> INIT: Entering runlevel: 5 >> Configuring network interfaces... ifconfig: SIOCGIFFLAGS: No such device >> Starting system message bus: dbus. >> Starting Dropbear SSH server: dropbear. >> Starting rpcbind daemon...done. >> Starting watchdog: [1 >> 34m[ [1 >> 32m OK [1 >> 34m] [0 >> 39m >> Starting advanced power management daemon: No APM support in kernel >> (failed.) >> starting DNS forwarder and DHCP server: dnsmasq... done. >> >> <power cut> [...] >> [ 8.683337] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry >> [ 8.710495] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry >> [ 8.736821] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry >> [ 8.763179] ubi0 error: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read 253952 bytes >> ^^^^ uncorrectable ECC error on PEB 2850 >> [ 8.774340] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >> 4.2.0-00015-g64291f7 #195 >> [ 8.782194] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) >> [ 8.788743] Backtrace: >> [ 8.791237] [<8001400c>] (dump_backtrace) from [<800141f8>] >> (show_stack+0x18/0x1c) >> [ 8.798839] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 >> [ 8.804589] [<800141e0>] (show_stack) from [<807ceeb8>] >> (dump_stack+0x8c/0xa4) >> [ 8.811855] [<807cee2c>] (dump_stack) from [<80474730>] >> (ubi_io_read+0x130/0x378) >> [ 8.819359] r6:00000000 r5:0003e000 r4:ffffffb6 r3:9e468000 >> [ 8.825099] [<80474600>] (ubi_io_read) from [<80471564>] >> (ubi_eba_read_leb+0x19c/0x4c8) >> [ 8.833125] r10:00000000 r9:a0c0d000 r8:00000b22 r7:00000001 >> r6:00000000 r5:9dd8a000 >> [ 8.841065] r4:9dd9e000 >> [ 8.843628] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] >> (ubi_leb_read+0x94/0x130) >> [ 8.851661] r10:00000000 r9:a0c0d000 r8:9dd8a000 r7:00000000 >> r6:00000001 r5:0003e000 >> [ 8.859596] r4:9dd9e000 >> [ 8.862170] [<804706e8>] (ubi_leb_read) from [<80257870>] >> (ubifs_leb_read+0x30/0x9c) >> [ 8.869937] r10:9e673008 r9:00000001 r8:00000001 r7:00000000 >> r6:0003e000 r5:9e673000 >> [ 8.877873] r4:9de22240 >> [ 8.880437] [<80257840>] (ubifs_leb_read) from [<80260fac>] >> (ubifs_start_scan+0x84/0x134) >> [ 8.888637] r8:a0c0d000 r7:00000001 r6:9e673000 r5:00000000 r4:9de22240 >> [ 8.895433] [<80260f28>] (ubifs_start_scan) from [<802613ac>] >> (ubifs_scan+0x30/0x3d4) >> [ 8.903284] r8:00000001 r7:9e673000 r6:a0c0d000 r5:00000000 r4:0003e000 >> [ 8.910097] [<8026137c>] (ubifs_scan) from [<8025fcd8>] >> (ubifs_read_master+0x48/0xd78) >> [ 8.918034] r10:9e673008 r9:9dd7bf80 r8:9e673000 r7:00000001 >> r6:80b76870 r5:9dda1800 >> [ 8.925952] r4:00000001 >> [ 8.928533] [<8025fc90>] (ubifs_read_master) from [<80254714>] >> (ubifs_mount+0x112c/0x2170) >> [ 8.936819] r10:9e673008 r9:9dd7bf80 r8:9e673b38 r7:9e673000 >> r6:80b76870 r5:9dda1800 >> [ 8.944738] r4:00000001 >> [ 8.947326] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) >> [ 8.954466] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 >> r6:80b2a2e0 r5:9dd73c80 >> [ 8.962404] r4:9dd76500 >> [ 8.964975] [<801055dc>] (mount_fs) from [<80121868>] >> (vfs_kern_mount+0x5c/0x144) >> [ 8.972479] r6:00008001 r5:9dd73c80 r4:9dd76500 >> [ 8.977177] [<8012180c>] (vfs_kern_mount) from [<80124e34>] >> (do_mount+0x19c/0xd0c) >> [ 8.984750] r9:80b2a2e0 r8:80b1ebdc r7:9e7b5d00 r6:9dd73c80 >> r5:00000060 r4:00000000 >> [ 8.992605] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) >> [ 8.999588] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 >> r6:00000000 r5:9dd73c80 >> [ 9.007522] r4:9e7b5d00 >> [ 9.010087] [<80125cbc>] (SyS_mount) from [<80aa6260>] >> (mount_block_root+0x114/0x240) >> [ 9.017941] r8:9efacd20 r7:80afa7a0 r6:9de69000 r5:9de69000 r4:00008001 >> [ 9.024736] [<80aa614c>] (mount_block_root) from [<80aa664c>] >> (prepare_namespace+0x98/0x1d0) >> [ 9.033195] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 >> r6:80afa7a0 r5:80afa7a0 >> [ 9.041129] r4:80afa7b0 >> [ 9.043690] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] >> (kernel_init_freeable+0x1a4/0x1f0) >> [ 9.052592] r6:80b8b000 r5:80b8b000 r4:00000007 >> [ 9.057290] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] >> (kernel_init+0x10/0xf4) >> [ 9.065472] r10:00000000 r9:00000000 r8:00000000 r7:00000000 >> r6:00000000 r5:807c91a4 >> [ 9.073407] r4:00000000 >> [ 9.075970] [<807c91a4>] (kernel_init) from [<800104b8>] >> (ret_from_fork+0x14/0x3c) >> [ 9.083560] r4:00000000 r3:00000000 >> [ 9.102369] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry >> [ 9.128774] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry >> [ 9.155184] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read only 253952 bytes, retry >> [ 9.181522] ubi0 error: ubi_io_read: error -74 (ECC error) while >> reading 253952 bytes from PEB 2850:8192, read 253952 bytes >> ^^^^ unrecoverable ECC error on PEB 2850 >> [ 9.192683] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >> 4.2.0-00015-g64291f7 #195 >> [ 9.200535] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) >> [ 9.207083] Backtrace: >> [ 9.209573] [<8001400c>] (dump_backtrace) from [<800141f8>] >> (show_stack+0x18/0x1c) >> [ 9.217173] r6:80b2d960 r5:80b2d960 r4:00000000 r3:00000000 >> [ 9.222916] [<800141e0>] (show_stack) from [<807ceeb8>] >> (dump_stack+0x8c/0xa4) >> [ 9.230177] [<807cee2c>] (dump_stack) from [<80474730>] >> (ubi_io_read+0x130/0x378) >> [ 9.237688] r6:00000000 r5:0003e000 r4:ffffffb6 r3:9e468000 >> [ 9.243429] [<80474600>] (ubi_io_read) from [<80471564>] >> (ubi_eba_read_leb+0x19c/0x4c8) >> [ 9.251458] r10:00000000 r9:a0c4c000 r8:00000b22 r7:00000001 >> r6:00000000 r5:9dd8a000 >> [ 9.259398] r4:9dd9e000 >> [ 9.261962] [<804713c8>] (ubi_eba_read_leb) from [<8047077c>] >> (ubi_leb_read+0x94/0x130) >> [ 9.269988] r10:00000000 r9:a0c4c000 r8:9dd8a000 r7:00000000 >> r6:00000001 r5:0003e000 >> [ 9.277925] r4:9dd9e000 >> [ 9.280495] [<804706e8>] (ubi_leb_read) from [<80257870>] >> (ubifs_leb_read+0x30/0x9c) >> [ 9.288264] r10:9e673000 r9:9dd7bf80 r8:00000001 r7:00000000 >> r6:0003e000 r5:9e673000 >> [ 9.296182] r4:9e673000 >> [ 9.298768] [<80257840>] (ubifs_leb_read) from [<8027569c>] >> (get_master_node+0x54/0x324) >> [ 9.306880] r8:9e673000 r7:80b77ff8 r6:00001000 r5:00001000 r4:9e673000 >> [ 9.313681] [<80275648>] (get_master_node) from [<80275d98>] >> (ubifs_recover_master_node+0x60/0x3ec) >> [ 9.322749] r10:9e673008 r9:9dd7bf80 r8:9e673000 r7:80b77ff8 >> r6:00001000 r5:9dda1800 >> [ 9.330685] r4:9e673000 >> [ 9.333248] [<80275d38>] (ubifs_recover_master_node) from >> [<8026080c>] (ubifs_read_master+0xb7c/0xd78) >> [ 9.342575] r7:00000001 r6:80b76870 r5:9dda1800 r4:ffffff8b >> [ 9.348335] [<8025fc90>] (ubifs_read_master) from [<80254714>] >> (ubifs_mount+0x112c/0x2170) >> [ 9.356621] r10:9e673008 r9:9dd7bf80 r8:9e673b38 r7:9e673000 >> r6:80b76870 r5:9dda1800 >> [ 9.364540] r4:00000001 >> [ 9.367127] [<802535e8>] (ubifs_mount) from [<801055f8>] (mount_fs+0x1c/0xb0) >> [ 9.374266] r10:00008001 r9:00000000 r8:80b2a2e0 r7:80124e34 >> r6:80b2a2e0 r5:9dd73c80 >> [ 9.382204] r4:9dd76500 >> [ 9.384772] [<801055dc>] (mount_fs) from [<80121868>] >> (vfs_kern_mount+0x5c/0x144) >> [ 9.392276] r6:00008001 r5:9dd73c80 r4:9dd76500 >> [ 9.396974] [<8012180c>] (vfs_kern_mount) from [<80124e34>] >> (do_mount+0x19c/0xd0c) >> [ 9.404547] r9:80b2a2e0 r8:80b1ebdc r7:9e7b5d00 r6:9dd73c80 >> r5:00000060 r4:00000000 >> [ 9.412404] [<80124c98>] (do_mount) from [<80125d38>] (SyS_mount+0x7c/0xa8) >> [ 9.419386] r10:000000d7 r9:80afa7b0 r8:8099d910 r7:00008001 >> r6:00000000 r5:9dd73c80 >> [ 9.427320] r4:9e7b5d00 >> [ 9.429887] [<80125cbc>] (SyS_mount) from [<80aa6260>] >> (mount_block_root+0x114/0x240) >> [ 9.437743] r8:9efacd20 r7:80afa7a0 r6:9de69000 r5:9de69000 r4:00008001 >> [ 9.444539] [<80aa614c>] (mount_block_root) from [<80aa664c>] >> (prepare_namespace+0x98/0x1d0) >> [ 9.452999] r10:000000d7 r9:80afa780 r8:80b087e4 r7:80afa778 >> r6:80afa7a0 r5:80afa7a0 >> [ 9.460933] r4:80afa7b0 >> [ 9.463494] [<80aa65b4>] (prepare_namespace) from [<80aa5efc>] >> (kernel_init_freeable+0x1a4/0x1f0) >> [ 9.472386] r6:80b8b000 r5:80b8b000 r4:00000007 >> [ 9.477081] [<80aa5d58>] (kernel_init_freeable) from [<807c91b4>] >> (kernel_init+0x10/0xf4) >> [ 9.485262] r10:00000000 r9:00000000 r8:00000000 r7:00000000 >> r6:00000000 r5:807c91a4 >> [ 9.493196] r4:00000000 >> [ 9.495758] [<807c91a4>] (kernel_init) from [<800104b8>] >> (ret_from_fork+0x14/0x3c) >> [ 9.503349] r4:00000000 r3:00000000 >> [ 9.507115] UBIFS error (ubi0:0 pid 1): ubifs_recover_master_node: >> failed to recover master node >> ^^^ master node recovery fails [...] >> I can provide dumps of various blocks if needed or add more debugging >> on the reads after the corruption occurred. It takes me a day or two >> to re-create the issue if you want me to turn on more debugging during >> the writes. Please retry first with erase debug prints enabled too. >> I've cc'd Huang, Elie, and Brian who were involved in the patch to >> detect bit-flips in gpmi-nand.c reads - perhaps they have some more >> ideas. I find it interesting that in one case that patch resolves the >> issue and in the other it does not. Yeah, good idea! Thanks, //richard ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-02 21:31 ` Richard Weinberger @ 2015-11-02 22:11 ` Brian Norris 0 siblings, 0 replies; 18+ messages in thread From: Brian Norris @ 2015-11-02 22:11 UTC (permalink / raw) To: Richard Weinberger Cc: Tim Harvey, Artem Bityutskiy, Adrian Hunter, linux-mtd, Elie De Brauwer, Huang Shijie, Boris Brezillon + Boris On Mon, Nov 02, 2015 at 10:31:53PM +0100, Richard Weinberger wrote: > Am 02.11.2015 um 21:31 schrieb Tim Harvey: > > >> I've cc'd Huang, Elie, and Brian who were involved in the patch to > >> detect bit-flips in gpmi-nand.c reads - perhaps they have some more > >> ideas. I find it interesting that in one case that patch resolves the > >> issue and in the other it does not. > > Yeah, good idea! Sorry, I don't have much time to care about gpmi-nand right now. I recall that there was ugly support for detecting bitflips in erased pages posted at some point (in case that's relevant; didn't read the whole essay), but that never got merged, and Huang left Freescale. I also recall Boris struggled with that driver at some point. Perhaps he can be persuaded to care. Brian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-02 20:31 ` Tim Harvey 2015-11-02 21:31 ` Richard Weinberger @ 2015-11-03 13:38 ` Boris Brezillon 2015-11-16 15:01 ` Tim Harvey 1 sibling, 1 reply; 18+ messages in thread From: Boris Brezillon @ 2015-11-03 13:38 UTC (permalink / raw) To: Tim Harvey Cc: Richard Weinberger, Elie De Brauwer, Artem Bityutskiy, Adrian Hunter, linux-mtd, Huang Shijie, Brian Norris Hi Tim, On Mon, 2 Nov 2015 12:31:11 -0800 Tim Harvey <tharvey@gateworks.com> wrote: > On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey@gateworks.com> wrote: > > [ 8.635364] UBIFS (ubi0:0): recovery needed > > [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while > > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > > [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while > > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > > [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while > > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > > ^^^^ non correctable ecc error on PEB 2254 - I verified that this was > > not the first time this PEB has been used I suspect one of the bit in PEB 2254 to be stuck at 0 (even after erasing the block the bit stays at 0). Have you tried to erase this block (flash_erase /dev/mtd2 0x23380000 1) and dump it in raw mode (nanddump -n -l 0x40000 -s 0x23380000 -f /tmp/dump /dev/mtd2)? > > > > I've cc'd Huang, Elie, and Brian who were involved in the patch to > > detect bit-flips in gpmi-nand.c reads - perhaps they have some more > > ideas. I find it interesting that in one case that patch resolves the > > issue and in the other it does not. I posted a slightly reworked version of Huang's patch [1] a while ago addressing the "account for bitflips in OOB area" problem, but maybe we could do better (avoid this extra "read in raw mode" step, or use the generic nand_check_erased_ecc_chunk() function when ECC bytes are aligned). Best Regards, Boris [1]https://patchwork.ozlabs.org/patch/416543/ -- Boris Brezillon, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-03 13:38 ` Boris Brezillon @ 2015-11-16 15:01 ` Tim Harvey 2015-11-30 21:58 ` Tim Harvey 0 siblings, 1 reply; 18+ messages in thread From: Tim Harvey @ 2015-11-16 15:01 UTC (permalink / raw) To: Boris Brezillon Cc: Richard Weinberger, Elie De Brauwer, Artem Bityutskiy, Adrian Hunter, linux-mtd, Huang Shijie, Brian Norris On Tue, Nov 3, 2015 at 5:38 AM, Boris Brezillon <boris.brezillon@free-electrons.com> wrote: > Hi Tim, > > On Mon, 2 Nov 2015 12:31:11 -0800 > Tim Harvey <tharvey@gateworks.com> wrote: > >> On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey@gateworks.com> wrote: >> > [ 8.635364] UBIFS (ubi0:0): recovery needed >> > [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >> > [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >> > [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while >> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >> > ^^^^ non correctable ecc error on PEB 2254 - I verified that this was >> > not the first time this PEB has been used > > I suspect one of the bit in PEB 2254 to be stuck at 0 (even after > erasing the block the bit stays at 0). Have you tried to erase this > block (flash_erase /dev/mtd2 0x23380000 1) and dump it in raw mode > (nanddump -n -l 0x40000 -s 0x23380000 -f /tmp/dump /dev/mtd2)? Boris, I examined the bad PEB on several boards now that I have reproduced this issue with and found no stuck bits (no 0's following erase, no 1's following erase and raw write all ff's). So in this case it doesn't appear to be a bad block. Incidentally for UBI/UBIFS, what is in charge of detecting bad blocks, how are they detected, and when/how are they marked? > >> > >> > I've cc'd Huang, Elie, and Brian who were involved in the patch to >> > detect bit-flips in gpmi-nand.c reads - perhaps they have some more >> > ideas. I find it interesting that in one case that patch resolves the >> > issue and in the other it does not. > > I posted a slightly reworked version of Huang's patch [1] a while ago > addressing the "account for bitflips in OOB area" problem, but maybe we > could do better (avoid this extra "read in raw mode" step, or use the > generic nand_check_erased_ecc_chunk() function when ECC bytes are > aligned). > > Best Regards, > > Boris > > [1]https://patchwork.ozlabs.org/patch/416543/ At this point I likely need to reproduce this problem with additional debugging enabled to show what last erased and/or wrote to the PEB's that are corrupt. I will also try your patch as well and see if that resolves anything. Regards, Tim ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-16 15:01 ` Tim Harvey @ 2015-11-30 21:58 ` Tim Harvey 2015-12-01 9:12 ` Boris Brezillon 0 siblings, 1 reply; 18+ messages in thread From: Tim Harvey @ 2015-11-30 21:58 UTC (permalink / raw) To: Boris Brezillon Cc: Richard Weinberger, Elie De Brauwer, Artem Bityutskiy, Adrian Hunter, linux-mtd, Huang Shijie, Brian Norris On Mon, Nov 16, 2015 at 7:01 AM, Tim Harvey <tharvey@gateworks.com> wrote: > On Tue, Nov 3, 2015 at 5:38 AM, Boris Brezillon > <boris.brezillon@free-electrons.com> wrote: >> Hi Tim, >> >> On Mon, 2 Nov 2015 12:31:11 -0800 >> Tim Harvey <tharvey@gateworks.com> wrote: >> >>> On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey@gateworks.com> wrote: >>> > [ 8.635364] UBIFS (ubi0:0): recovery needed >>> > [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >>> > [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >>> > [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry >>> > ^^^^ non correctable ecc error on PEB 2254 - I verified that this was >>> > not the first time this PEB has been used >> >> I suspect one of the bit in PEB 2254 to be stuck at 0 (even after >> erasing the block the bit stays at 0). Have you tried to erase this >> block (flash_erase /dev/mtd2 0x23380000 1) and dump it in raw mode >> (nanddump -n -l 0x40000 -s 0x23380000 -f /tmp/dump /dev/mtd2)? > > Boris, > > I examined the bad PEB on several boards now that I have reproduced > this issue with and found no stuck bits (no 0's following erase, no > 1's following erase and raw write all ff's). > > So in this case it doesn't appear to be a bad block. Incidentally for > UBI/UBIFS, what is in charge of detecting bad blocks, how are they > detected, and when/how are they marked? > >> >>> > >>> > I've cc'd Huang, Elie, and Brian who were involved in the patch to >>> > detect bit-flips in gpmi-nand.c reads - perhaps they have some more >>> > ideas. I find it interesting that in one case that patch resolves the >>> > issue and in the other it does not. >> >> I posted a slightly reworked version of Huang's patch [1] a while ago >> addressing the "account for bitflips in OOB area" problem, but maybe we >> could do better (avoid this extra "read in raw mode" step, or use the >> generic nand_check_erased_ecc_chunk() function when ECC bytes are >> aligned). >> >> Best Regards, >> >> Boris >> >> [1]https://patchwork.ozlabs.org/patch/416543/ > > At this point I likely need to reproduce this problem with additional > debugging enabled to show what last erased and/or wrote to the PEB's > that are corrupt. I will also try your patch as well and see if that > resolves anything. > > Regards, > > Tim Boris, I tried your patch [1] on a week-long test over 10x IMX6 boards booting over 60K times across temperature ranges and the patch resolved many previous failures to mount rootfs errors (previously I would encounter around 1% failure to mount rootfs). In addition I saw no nand corruption where I would have expected to see it several times with those numbers so I suspect this may have resolved that as well. Can you re-submit your patch for inclusion and/or discussion? Regards, Tim [1]https://patchwork.ozlabs.org/patch/416543/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-30 21:58 ` Tim Harvey @ 2015-12-01 9:12 ` Boris Brezillon 0 siblings, 0 replies; 18+ messages in thread From: Boris Brezillon @ 2015-12-01 9:12 UTC (permalink / raw) To: Tim Harvey Cc: Richard Weinberger, Elie De Brauwer, Artem Bityutskiy, Adrian Hunter, linux-mtd, Huang Shijie, Brian Norris On Mon, 30 Nov 2015 13:58:34 -0800 Tim Harvey <tharvey@gateworks.com> wrote: > On Mon, Nov 16, 2015 at 7:01 AM, Tim Harvey <tharvey@gateworks.com> wrote: > > On Tue, Nov 3, 2015 at 5:38 AM, Boris Brezillon > > <boris.brezillon@free-electrons.com> wrote: > >> Hi Tim, > >> > >> On Mon, 2 Nov 2015 12:31:11 -0800 > >> Tim Harvey <tharvey@gateworks.com> wrote: > >> > >>> On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey@gateworks.com> wrote: > >>> > [ 8.635364] UBIFS (ubi0:0): recovery needed > >>> > [ 8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > >>> > [ 8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > >>> > [ 8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while > >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry > >>> > ^^^^ non correctable ecc error on PEB 2254 - I verified that this was > >>> > not the first time this PEB has been used > >> > >> I suspect one of the bit in PEB 2254 to be stuck at 0 (even after > >> erasing the block the bit stays at 0). Have you tried to erase this > >> block (flash_erase /dev/mtd2 0x23380000 1) and dump it in raw mode > >> (nanddump -n -l 0x40000 -s 0x23380000 -f /tmp/dump /dev/mtd2)? > > > > Boris, > > > > I examined the bad PEB on several boards now that I have reproduced > > this issue with and found no stuck bits (no 0's following erase, no > > 1's following erase and raw write all ff's). > > > > So in this case it doesn't appear to be a bad block. Incidentally for > > UBI/UBIFS, what is in charge of detecting bad blocks, how are they > > detected, and when/how are they marked? > > > >> > >>> > > >>> > I've cc'd Huang, Elie, and Brian who were involved in the patch to > >>> > detect bit-flips in gpmi-nand.c reads - perhaps they have some more > >>> > ideas. I find it interesting that in one case that patch resolves the > >>> > issue and in the other it does not. > >> > >> I posted a slightly reworked version of Huang's patch [1] a while ago > >> addressing the "account for bitflips in OOB area" problem, but maybe we > >> could do better (avoid this extra "read in raw mode" step, or use the > >> generic nand_check_erased_ecc_chunk() function when ECC bytes are > >> aligned). > >> > >> Best Regards, > >> > >> Boris > >> > >> [1]https://patchwork.ozlabs.org/patch/416543/ > > > > At this point I likely need to reproduce this problem with additional > > debugging enabled to show what last erased and/or wrote to the PEB's > > that are corrupt. I will also try your patch as well and see if that > > resolves anything. > > > > Regards, > > > > Tim > > Boris, > > I tried your patch [1] on a week-long test over 10x IMX6 boards > booting over 60K times across temperature ranges and the patch > resolved many previous failures to mount rootfs errors (previously I > would encounter around 1% failure to mount rootfs). In addition I saw > no nand corruption where I would have expected to see it several times > with those numbers so I suspect this may have resolved that as well. > > Can you re-submit your patch for inclusion and/or discussion? I'm quite busy on other topics lately, but feel free to adapt/resubmit the patch. Best Regards, Boris -- Boris Brezillon, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-02 20:27 ` Tim Harvey 2015-11-02 20:31 ` Tim Harvey @ 2015-11-03 9:10 ` Artem Bityutskiy 1 sibling, 0 replies; 18+ messages in thread From: Artem Bityutskiy @ 2015-11-03 9:10 UTC (permalink / raw) To: Tim Harvey, Richard Weinberger Cc: Adrian Hunter, linux-mtd, Huang Shijie, Elie De Brauwer, Brian Norris On Mon, 2015-11-02 at 12:27 -0800, Tim Harvey wrote: > [ 9.507115] UBIFS error (ubi0:0 pid 1): ubifs_recover_master_node: > failed to recover master node > ^^^ master node recovery fails > ... > [ 9.618335] Kernel panic - not syncing: VFS: Unable to mount root > fs on unknown-block(0,0) You need to enable debugging messages, and/or add some custom prints to 'get_master_node()', which for some reason returns an error and makes 'ubifs_recover_master_node()' fail. Figure out where it fails, and we then can try figuring out why and may be fix it. Add custom printk's if needed. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-10-26 20:01 ` Richard Weinberger 2015-10-26 20:31 ` Tim Harvey @ 2015-11-03 10:06 ` Michal Suchanek 2015-11-03 10:18 ` Ricard Wanderlof 2015-11-03 10:43 ` Artem Bityutskiy 1 sibling, 2 replies; 18+ messages in thread From: Michal Suchanek @ 2015-11-03 10:06 UTC (permalink / raw) To: Richard Weinberger Cc: Tim Harvey, Artem Bityutskiy, Adrian Hunter, MTD Maling List On 26 October 2015 at 21:01, Richard Weinberger <richard@nod.at> wrote: > Tim, > > Am 26.10.2015 um 20:37 schrieb Tim Harvey: >> >> [ 9.282782] r4:00000000 r3:00000000 >> [ 9.286909] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: corrupt >> empty space LEB 377:217088, corruption starts at 24762 >> [ 9.297900] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: >> corruption at LEB 377:241850 >> [ 9.306536] UBIFS error (ubi0:0 pid 1): ubifs_scanned_corruption: >> first 8192 bytes from LEB 377:241850 >> [ 9.315870] 00000000: ffffffef ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.327374] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.338883] 00000040: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> [ 9.350389] 00000060: ffffffff ffffffff ffffffff ffffffff ffffffff >> ffffffff ffffffff ffffffff ................................ >> ... >> [ 9.352755] 00001fc0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f >> fffffff ffffffff ................................ >> [ 9.352765] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff f >> fffffff ffffffff ................................ >> [ 9.352770] UBIFS error (ubi0:0 pid 1): ubifs_recover_leb: LEB 377 scanning f >> ailed >> [ 12.271485] VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0): >> error -117 >> >> Thus far we have encountered this with a 16Gb MT29F16G08 and 'not' >> with a 2Gb MT29F2G08. The two parts have different geometries and the >> 16Gb part has a much larger block erase time (2ms) compared to the 2Gb >> (700us). > > gpmi-nand is not able to correct bit flips on erased pages. > This is why UBI is facing uncorrectable ECC errors and UBIFS gives up. > In March there was an attempt to fix that in software. > But no mainline ready solution was presented so far: > http://lists.infradead.org/pipermail/linux-mtd/2014-March/052521.html > > It is not clear whether to implement this directly in gpmi-nand or MTD core. > Currently UBIFS assumes that empty spaces must contain only 0xff octets. > A naive approach would be removing that check from UBIFS, bit this can have > disastrous consequences as UBIFS's recovery algorithm relies on that. Hello, it has been pointed out that this assumption on the part of UBI that erased pages are composed of 0xff bytes is just wrong. Here it is wrong in the sense that the pages are not bit-perfect but there are other reasons why it may go wrong. - if the nand controller uses randomization layer the pages are 0xff bytes transformed by the randomization layer - if MTD grows full-disk encryption then reading a page of 0xff bytes will yield what this page decrypts to using current encryption scheme The layering is wrong here. The MTD core should provide a function to check if a physical page is empty and the driver should provide driver-specific implementation if needed. Thanks Michal ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-03 10:06 ` Michal Suchanek @ 2015-11-03 10:18 ` Ricard Wanderlof 2015-11-03 10:43 ` Artem Bityutskiy 1 sibling, 0 replies; 18+ messages in thread From: Ricard Wanderlof @ 2015-11-03 10:18 UTC (permalink / raw) To: Michal Suchanek Cc: Richard Weinberger, Tim Harvey, MTD Maling List, Adrian Hunter, Artem Bityutskiy On Tue, 3 Nov 2015, Michal Suchanek wrote: > it has been pointed out that this assumption on the part of UBI that > erased pages are composed of 0xff bytes is just wrong. > > Here it is wrong in the sense that the pages are not bit-perfect but > there are other reasons why it may go wrong. > > - if the nand controller uses randomization layer the pages are 0xff > bytes transformed by the randomization layer > > - if MTD grows full-disk encryption then reading a page of 0xff bytes > will yield what this page decrypts to using current encryption scheme > > The layering is wrong here. The MTD core should provide a function to > check if a physical page is empty and the driver should provide > driver-specific implementation if needed. In a sense, doesn't the driver already do that in that an erased page with bitflips should be corrected by ECC so that there are no bitflips apparent? If, as many hardware controllers do, the ECC check for an erased page fails, then the driver should still be able to determine that the page in question is erased, with bitflips, and should be able to provide a true 0xff page upward. I agree though that an erased-page API function would be good too, as it moves the guesswork to the right place (the actual driver). /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: UBIFS corruption after power cut - possibly unstable bits issue? 2015-11-03 10:06 ` Michal Suchanek 2015-11-03 10:18 ` Ricard Wanderlof @ 2015-11-03 10:43 ` Artem Bityutskiy 1 sibling, 0 replies; 18+ messages in thread From: Artem Bityutskiy @ 2015-11-03 10:43 UTC (permalink / raw) To: Michal Suchanek, Richard Weinberger Cc: Tim Harvey, Adrian Hunter, MTD Maling List On Tue, 2015-11-03 at 11:06 +0100, Michal Suchanek wrote: > it has been pointed out that this assumption on the part of UBI that > erased pages are composed of 0xff bytes is just wrong. > > Here it is wrong in the sense that the pages are not bit-perfect but > there are other reasons why it may go wrong. > > - if the nand controller uses randomization layer the pages are 0xff > bytes transformed by the randomization layer > > - if MTD grows full-disk encryption then reading a page of 0xff bytes > will yield what this page decrypts to using current encryption scheme > > The layering is wrong here. The MTD core should provide a function to > check if a physical page is empty and the driver should provide > driver-specific implementation if needed. Yes, this was discussed, and the missing part is patches. The reason I am replying to you is that I find the wording you are using to be inaccurate. The assumptions were not wrong when this stuff was designed, implemented, tested and used in production. And for that hardware they are not wrong. Nowadays the situation is different, the HW is different. It is not the same as the HW UBIFS was implemented and tested for. UBIFS needs to be adjusted in order to support that hardware. "Wrong wrong wrong" is incorrect wording, please do not use it. Please, use "needs more work", "needs extending", "needs to support bit-flips in empty space", etc. And layering is not wrong. We do not implement useless MTD methods for something which does not exist in the Linux kernel. Please, introduce the encryption support, introduce that driver with that method, and then refactor the code. Please, do not use word "wrong", because it is not the correct word for this situation, and it is unnecessarily negative. Thank you! ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-12-01 9:13 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-26 19:37 UBIFS corruption after power cut - possibly unstable bits issue? Tim Harvey 2015-10-26 20:01 ` Richard Weinberger 2015-10-26 20:31 ` Tim Harvey 2015-10-26 21:41 ` Richard Weinberger 2015-10-27 19:01 ` Tim Harvey 2015-10-27 19:52 ` Richard Weinberger 2015-11-02 20:27 ` Tim Harvey 2015-11-02 20:31 ` Tim Harvey 2015-11-02 21:31 ` Richard Weinberger 2015-11-02 22:11 ` Brian Norris 2015-11-03 13:38 ` Boris Brezillon 2015-11-16 15:01 ` Tim Harvey 2015-11-30 21:58 ` Tim Harvey 2015-12-01 9:12 ` Boris Brezillon 2015-11-03 9:10 ` Artem Bityutskiy 2015-11-03 10:06 ` Michal Suchanek 2015-11-03 10:18 ` Ricard Wanderlof 2015-11-03 10:43 ` Artem Bityutskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).