ubi : kernel panic on erroneous block

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* ubi : kernel panic on erroneous block
@ 2010-08-10  9:56 Matthieu CASTET
  2010-08-10 11:42 ` Artem Bityutskiy
  2010-08-29 20:46 ` Artem Bityutskiy
  0 siblings, 2 replies; 5+ messages in thread
From: Matthieu CASTET @ 2010-08-10  9:56 UTC (permalink / raw)
  To: linux-mtd@lists.infradead.org, Artem Bityutskiy

[-- Attachment #1: Type: text/plain, Size: 14390 bytes --]

Hi,


when running test with ubifs I found the following crash.
One block is instable (some read fails with ecc error correctable or 
not) after a power cut. This is due to interrupted write or erase.

Our test do first a read of the ubi volume (cat /dev/ubi3_0 > /dev/null) 
to force complete read of it.

In this case ecc correctable is detected, and scrubbing is scheduled
But ubi_eba_copy_leb: the block become uncorrectable and added to 
erroneous list.
When mounting ubifs read doesn't check that it is erroneous and return data.
It is added again for scrubbing, but prot_queue_del crash because we 
already remove it in the first scrubbing try.

Here an attempt to fix the problem. This is ugly. I didn't try it yet. I 
  erased my corrupted flash by accident.

One other solution could be to add the test in ubi_wl_scrub_peb, but I 
don't think it is ok to return data on erroneous block.

An other solution could be to unmap the block (read will return 0xff), 
but this may break upper layer ?

Matthieu


[    6.613769] UBI DBG (pid 266): ubi_scan: scanning is finished
[    6.627283] UBI: attached mtd3 to ubi3
[    6.630876] UBI: MTD device name:            "P6system"
[    6.636130] UBI: MTD device size:            32 MiB
[    6.640948] UBI: number of good PEBs:        256
[    6.645567] UBI: number of bad PEBs:         0
[    6.649979] UBI: max. allowed volumes:       128
[    6.654591] UBI: wear-leveling threshold:    4096
[    6.659274] UBI: number of internal volumes: 1
[    6.663701] UBI: number of user volumes:     1
[    6.668138] UBI: available PEBs:             0
[    6.672559] UBI: total number of reserved PEBs: 256
[    6.677433] UBI: number of PEBs reserved for bad PEB handling: 2
[    6.683416] UBI: max/mean erase counter: 3717/3558
[    6.688201] UBI: image sequence number: 1403635655
[    6.693008] UBI: background thread "ubi_bgt3d" started, PID 269
UBI device number 3, total 256 LEBs (32505856 bytes, 31.0 MiB), 
available 0 LEBs (0 bytes), LEB size 126976 bytes (124.0 KiB)

----> cat /dev/ubi3_0 > /dev/null

[    6.908524] BA315_STATUS_DEC_ERR : 0 4 on 24525
[    6.912907] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[    6.931062] UBI DBG (pid 272): ubi_io_read: fixable bit-flip detected 
at PEB 189
[    6.938448] UBI DBG (pid 272): ubi_wl_scrub_peb: schedule PEB 189 for 
scrubbing
[    6.947677] BA315_STATUS_DEC_ERR : 512 4 on 24525
[    6.952226] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[    6.976651] UBI error: ubi_io_read: error -74 while reading 126976 
bytes from PEB 189:4096, read 126976 bytes
[    6.986429] [<c00279f0>] (dump_stack+0x0/0x14) from [<c0160fa8>] 
(ubi_io_read+0xf0/0x258)
[    6.994594] [<c0160eb8>] (ubi_io_read+0x0/0x258) from [<c01607e8>] 
(ubi_eba_copy_leb+0x204/0x58c)
[    7.003434] [<c01605e4>] (ubi_eba_copy_leb+0x0/0x58c) from 
[<c01638e8>] (wear_leveling_worker+0x2e4/0x630)
[    7.013074] [<c0163604>] (wear_leveling_worker+0x0/0x630) from 
[<c0162b0c>] (do_work+0x94/0xe8)
[    7.021758] [<c0162a78>] (do_work+0x0/0xe8) from [<c0163cc4>] 
(ubi_thread+0x90/0x118)
[    7.029576]  r7:c789bc50 r6:00000000 r5:c7848000 r4:c789b800
[    7.035222] [<c0163c34>] (ubi_thread+0x0/0x118) from [<c0047204>] 
(kthread+0x50/0x7c)
[    7.043036] [<c00471b4>] (kthread+0x0/0x7c) from [<c00360e8>] 
(do_exit+0x0/0x6ac)
[    7.050505]  r5:00000000 r4:00000000
[    7.054071] UBI warning: ubi_eba_copy_leb: error -74 while reading 
data from PEB 189

echo sleeping 38
real	0m 3.52s
user	0m 0.01s
sys	0m 0.25s

----> mounting ubifs on /dev/ubi3_0


info.type      = 0x04
info.flags     = 0x00000400
info.size      = 0x02000000
info.erasesize = 0x00020000
info.writesize = 2048
info.oobsize   = 64
ecc.eccbytes   = 12
ecc.eccpos     = 2,3,4,5,6,7,8,9,10,11,12,13,

Please press Enter to activate this console. starting pid 277, tty '': 
'/bin/sh'


BusyBox v1.16.0 (2010-06-30 18:04:36 CEST) b[   10.423398] UBIFS: 
recovery needed
uilt-in shell (ash)
Enter 'help' for a list of built-in[   10.431750] BA315_STATUS_DEC_ERR : 
512 4 on 24525
  commands.

# echo sleepin[   10.438532] ff g 38
sleeping 38
# ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   10.467210] UBI error: ubi_io_read: error -74 while reading 126976 
bytes from PEB 189:4096, read 126976 bytes
[   10.477007] [<c00279f0>] (dump_stack+0x0/0x14) from [<c0160fa8>] 
(ubi_io_read+0xf0/0x258)
[   10.485144] [<c0160eb8>] (ubi_io_read+0x0/0x258) from [<c016035c>] 
(ubi_eba_read_leb+0x1a0/0x428)
[   10.494005] [<c01601bc>] (ubi_eba_read_leb+0x0/0x428) from 
[<c015e3c0>] (ubi_leb_read+0xe8/0x138)
[   10.502859] [<c015e2d8>] (ubi_leb_read+0x0/0x138) from [<c00d6918>] 
(ubifs_start_scan+0x7c/0xf4)
[   10.511633]  r7:c79f3000 r6:00000000 r5:c798b8e0 r4:00000000
[   10.517276] [<c00d689c>] (ubifs_start_scan+0x0/0xf4) from 
[<c00d6b4c>] (ubifs_scan+0x2c/0x298)
[   10.525878]  r8:00000003 r7:c79f3000 r6:00000000 r5:c8d01000 r4:0001f000
[   10.532560] [<c00d6b20>] (ubifs_scan+0x0/0x298) from [<c00d71dc>] 
(ubifs_replay_journal+0x14c/0x13a4)
[   10.541769] [<c00d7090>] (ubifs_replay_journal+0x0/0x13a4) from 
[<c00cdd68>] (ubifs_fill_super+0xb84/0x1054)
[   10.551580] [<c00cd1e4>] (ubifs_fill_super+0x0/0x1054) from 
[<c00ced04>] (ubifs_get_sb+0xc4/0x2ac)
[   10.560525] [<c00cec40>] (ubifs_get_sb+0x0/0x2ac) from [<c007f04c>] 
(vfs_kern_mount+0x58/0x94)
[   10.569124] [<c007eff4>] (vfs_kern_mount+0x0/0x94) from [<c007f0e8>] 
(do_kern_mount+0x40/0xe8)
[   10.577736]  r8:c79fa000 r7:c02253ec r6:00000000 r5:c78fb000 r4:00000000
[   10.584408] [<c007f0a8>] (do_kern_mount+0x0/0xe8) from [<c0095628>] 
(do_new_mount+0x68/0x8c)
[   10.592833]  r8:00000000 r7:0000000a r6:c784bef0 r5:00000000 r4:c79fa000
[   10.599519] [<c00955c0>] (do_new_mount+0x0/0x8c) from [<c00957a8>] 
(do_mount+0x15c/0x1b8)
[   10.607694]  r7:c79fa000 r6:c78fb000 r5:c78d9000 r4:00000404
[   10.613327] [<c009564c>] (do_mount+0x0/0x1b8) from [<c0095890>] 
(sys_mount+0x8c/0xd4)
[   10.621144] [<c0095804>] (sys_mount+0x0/0xd4) from [<c0023c00>] 
(ret_fast_syscall+0x0/0x2c)
[   10.629483]  r7:00000015 r6:00008840 r5:00000000 r4:00000000
[   10.638307] BA315_STATUS_DEC_ERR : 0 4 on 24525
[   10.642677] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   10.667078] UBI DBG (pid 273): ubi_io_read: fixable bit-flip detected 
at PEB 189
[   10.674321] UBI DBG (pid 273): ubi_wl_scrub_peb: schedule PEB 189 for 
scrubbing
[   10.681652] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000
[   10.689703] pgd = c79e4000
[   10.692381] [00000000] *pgd=47841031, *pte=00000000, *ppte=00000000
[   10.698639] Internal error: Oops: 817 [#1]
[   10.702713] Modules linked in:
[   10.705761] CPU: 0    Not tainted  (2.6.27.47-parrot-dirty #212)
[   10.711769] PC is at prot_queue_del+0x2c/0x50
[   10.716100] LR is at ubi_wl_scrub_peb+0xec/0x13c
[   10.720700] pc : [<c0162430>]    lr : [<c01635b4>]    psr: a0000013
[   10.720716] sp : c784ba70  ip : c78ff290  fp : c784ba7c
[   10.732157] r10: 00000000  r9 : 00000003  r8 : 000000bd
[   10.737371] r7 : c789bbcc  r6 : c789bbc0  r5 : c789b800  r4 : c78ff290
[   10.743883] r3 : 00100100  r2 : 00000001  r1 : 00000000  r0 : ffffffed
[   10.750398] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM 
Segment user
[   10.757519] Control: 0005317f  Table: 479e4000  DAC: 00000015
[   10.763249] Process endurance (pid: 273, stack limit = 0xc784a268)
[   10.769416] Stack: (0xc784ba70 to 0xc784c000)
[   10.773753] ba60:                                     c784baa4 
c784ba80 c01635b4 c0162414
[   10.782004] ba80: c8d01000 00000000 c8d01000 c789b800 c73fd000 
000000bd c784baec c784baa8
[   10.790254] baa0: c01603bc c01634d8 0001f000 00000001 c0236138 
c8d01000 00000001 00000000
[   10.798505] bac0: 00000000 c73fd000 c8d01000 00000000 0001f000 
00000003 ffffff8b 00000000
[   10.806755] bae0: c784bb1c c784baf0 c015e3c0 c01601cc 00000000 
0001f000 00000000 c0079a30
[   10.815005] bb00: 00000000 c798b8e0 00000000 c79f3000 c784bb54 
c784bb20 c00d6918 c015e2e8
[   10.823256] bb20: 0001f000 00000000 c0023874 c0023010 c79f3000 
c8d01000 c79f3000 c798b8e0
[   10.831506] bb40: c79f3000 00000003 c784bbac c784bb58 c00e3650 
c00d68ac 00000004 c784bbac
[   10.839757] bb60: c784bb78 c8d01000 c00d6d08 c00d6d14 00000000 
00000000 c8d01000 0001f000
[   10.848007] bb80: c784bbac 00000000 c79f3850 c798b8e0 c79f3000 
00000003 ffffff8b 00000000
[   10.856257] bba0: c784bbf4 c784bbb0 c00e444c c00e3624 00000000 
c784bbc0 c8d01000 c00d66b4
[   10.864508] bbc0: 00005800 00000001 c79f3870 00000000 c79f3850 
c79f3870 c79f3000 00000003
[   10.872758] bbe0: ffffff8b c79f3000 c784bd34 c784bbf8 c00d7c20 
c00e4394 00000001 00000000
[   10.881009] bc00: 00000000 c02305c0 c784a000 000000ef c784bc74 
c784bc20 c005ee14 c005de64
[   10.889259] bc20: 00000001 00000044 000200d2 00000000 c79f3850 
00000001 c0007228 c8be5000
[   10.897509] bc40: 00000000 00000000 c8d01000 c79f376c c784a000 
ffffff8b 00000000 00000000
[   10.905760] bc60: c789bbb0 c7972760 c784bcac c784bc78 c7972760 
c789b800 00000007 00000000
[   10.914010] bc80: 00000000 c784bd00 c7972760 c7972760 c789b800 
00000007 c784bcbc c784bca8
[   10.922261] bca0: c015edcc c00792bc c789b800 c73fd000 c784bce4 
c784bcc0 c015eea0 c015ed74
[   10.930511] bcc0: 0001f000 00000001 00000018 c79f3000 00000000 
00000007 c784bcf4 c784bce8
[   10.938761] bce0: c015e15c c015ee30 c784bd34 c784bcf8 c00e08d8 
c015e100 0000000b 00000000
[   10.947012] bd00: c891b00b 00000000 00000000 000000eb 00000000 
c79f3870 c79f3000 000000eb
[   10.955262] bd20: 00000000 00000000 c784bdec c784bd38 c00cdd68 
c00d70a0 00000000 c00fccdc
[   10.963513] bd40: c780c1a0 c780ff60 00000000 c784bdc0 00000001 
00000003 c784bef0 00000000
[   10.971763] bd60: 0001e5a0 00000000 00117000 00000000 c784bd94 
00000003 01d2f000 00000000
[   10.980014] bd80: 00000000 c79b8c00 c73ffd20 c79f3724 c79f3008 
c79f37f8 c79f36a4 c79f371c
[   10.988264] bda0: c00cbf10 c0225404 c784bdec 00000000 00000001 
00000000 0007c001 00000000
[   10.996514] bdc0: c780c1a0 c79b8c00 c73ffd20 c79b8c00 00000000 
c02253ec c784bef0 00000000
[   11.004765] bde0: c784be6c c784bdf0 c00ced04 c00cd1f4 00000000 
c023997c 00000003 00000000
[   11.013015] be00: 000000fa c784be10 01e46000 00000000 c78fb000 
00000003 00000000 00000000
[   11.021266] be20: 00000001 0001f000 00000006 c73fd18c 0fd00001 
c780cf20 c784be6c c784be48
[   11.029516] be40: c0093e4c 00000000 c78fb000 c780cf20 00000000 
c02253ec c784bef0 0000000a
[   11.037766] be60: c784be9c c784be70 c007f04c c00cec50 c780cf20 
c784be80 c00928b0 00000000
[   11.046017] be80: c78fb000 00000000 c02253ec c79fa000 c784bec4 
c784bea0 c007f0e8 c007f004
[   11.054267] bea0: c784bec4 c79fa000 00000000 c784bef0 0000000a 
00000000 c784bee4 c784bec8
[   11.062518] bec0: c0095628 c007f0b8 00000404 c78d9000 c78fb000 
c79fa000 c784bf6c c784bee8
[   11.070768] bee0: c00957a8 c00955d0 c78fb000 00000000 c780c3a0 
c74844b8 c784bf6c c784bf08
[   11.079018] bf00: c002382c 00000001 00000001 00000000 00000000 
00000000 0000038c c784bf7c
[   11.087269] bf20: 00001000 c78fb000 c0023d84 c784a000 40068008 
c784bf6c 000200d0 c784bf50
[   11.095519] bf40: 00000000 00000000 c78d9000 0000a38c 00000404 
c0023d84 c784a000 40068008
[   11.103770] bf60: c784bfa4 c784bf70 c0095890 c009565c 00000000 
c78d9000 beb41f78 c78fb000
[   11.112020] bf80: c79fa000 00000000 00000000 00000000 00008840 
00000015 00000000 c784bfa8
[   11.120270] bfa0: c0023c00 c0095814 00000000 00000000 0000a38c 
0000a384 0000a398 00000404
[   11.128521] bfc0: 00000000 00000000 00008840 00000015 000086cc 
00000001 40068008 00008e60
[   11.136771] bfe0: 4001b424 beb41d10 00008cd8 4001b438 20000010 
0000a38c 7d195ec9 5c1aa6a4
[   11.145022] Backtrace:
[   11.147455] [<c0162404>] (prot_queue_del+0x0/0x50) from [<c01635b4>] 
(ubi_wl_scrub_peb+0xec/0x13c)
[   11.156400] [<c01634c8>] (ubi_wl_scrub_peb+0x0/0x13c) from 
[<c01603bc>] (ubi_eba_read_leb+0x200/0x428)
[   11.165693]  r8:000000bd r7:c73fd000 r6:c789b800 r5:c8d01000 r4:00000000
[   11.172379] [<c01601bc>] (ubi_eba_read_leb+0x0/0x428) from 
[<c015e3c0>] (ubi_leb_read+0xe8/0x138)
[   11.181237] [<c015e2d8>] (ubi_leb_read+0x0/0x138) from [<c00d6918>] 
(ubifs_start_scan+0x7c/0xf4)
[   11.190010]  r7:c79f3000 r6:00000000 r5:c798b8e0 r4:00000000
[   11.195654] [<c00d689c>] (ubifs_start_scan+0x0/0xf4) from 
[<c00e3650>] (ubifs_recover_leb+0x3c/0x730)
[   11.204861]  r8:00000003 r7:c79f3000 r6:c798b8e0 r5:c79f3000 r4:c8d01000
[   11.211547] [<c00e3614>] (ubifs_recover_leb+0x0/0x730) from 
[<c00e444c>] (ubifs_recover_log_leb+0xc8/0x2dc)
[   11.221274] [<c00e4384>] (ubifs_recover_log_leb+0x0/0x2dc) from 
[<c00d7c20>] (ubifs_replay_journal+0xb90/0x13a4)
[   11.231435] [<c00d7090>] (ubifs_replay_journal+0x0/0x13a4) from 
[<c00cdd68>] (ubifs_fill_super+0xb84/0x1054)
[   11.241249] [<c00cd1e4>] (ubifs_fill_super+0x0/0x1054) from 
[<c00ced04>] (ubifs_get_sb+0xc4/0x2ac)
[   11.250194] [<c00cec40>] (ubifs_get_sb+0x0/0x2ac) from [<c007f04c>] 
(vfs_kern_mount+0x58/0x94)
[   11.258793] [<c007eff4>] (vfs_kern_mount+0x0/0x94) from [<c007f0e8>] 
(do_kern_mount+0x40/0xe8)
[   11.267390]  r8:c79fa000 r7:c02253ec r6:00000000 r5:c78fb000 r4:00000000
[   11.274077] [<c007f0a8>] (do_kern_mount+0x0/0xe8) from [<c0095628>] 
(do_new_mount+0x68/0x8c)
[   11.282501]  r8:00000000 r7:0000000a r6:c784bef0 r5:00000000 r4:c79fa000
[   11.289188] [<c00955c0>] (do_new_mount+0x0/0x8c) from [<c00957a8>] 
(do_mount+0x15c/0x1b8)
[   11.297352]  r7:c79fa000 r6:c78fb000 r5:c78d9000 r4:00000404
[   11.302997] [<c009564c>] (do_mount+0x0/0x1b8) from [<c0095890>] 
(sys_mount+0x8c/0xd4)
[   11.310813] [<c0095804>] (sys_mount+0x0/0xd4) from [<c0023c00>] 
(ret_fast_syscall+0x0/0x2c)
[   11.319151]  r7:00000015 r6:00008840 r5:00000000 r4:00000000
[   11.324794] Code: 089da800 e59c1004 e59c2000 e59f3018 (e5812000)
[   11.330924] Kernel panic - not syncing: Fatal exception

[-- Attachment #2: ubifs.diff --]
[-- Type: text/x-diff, Size: 1530 bytes --]

diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
index 7fbe0d7..289c003 100644
--- a/drivers/mtd/ubi/eba.c
+++ b/drivers/mtd/ubi/eba.c
@@ -367,6 +367,7 @@ out_unlock:
  * returned for any volume type if an ECC error was detected by the MTD device
  * driver. Other negative error cored may be returned in case of other errors.
  */
+int in_wl_tree(struct ubi_wl_entry *e, struct rb_root *root);
 int ubi_eba_read_leb(struct ubi_device *ubi, struct ubi_volume *vol, int lnum,
 		     void *buf, int offset, int len, int check)
 {
@@ -392,6 +393,19 @@ int ubi_eba_read_leb(struct ubi_device *ubi, struct ubi_volume *vol, int lnum,
 		memset(buf, 0xFF, len);
 		return 0;
 	}
+	{
+		struct ubi_wl_entry *e;
+		int bad;
+
+		spin_lock(&ubi->wl_lock);
+		e = ubi->lookuptbl[pnum];
+		bad = in_wl_tree(e, &ubi->erroneous);
+		spin_unlock(&ubi->wl_lock);
+		/* we should not append to read bad block */
+		if (bad) {
+			return -EBADMSG;
+		}
+	}
 
 	dbg_eba("read %d bytes from offset %d of LEB %d:%d, PEB %d",
 		len, offset, vol_id, lnum, pnum);
diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
index 10b6100..3c4c3ed 100644
--- a/drivers/mtd/ubi/wl.c
+++ b/drivers/mtd/ubi/wl.c
@@ -292,7 +292,7 @@ static int produce_free_peb(struct ubi_device *ubi)
  * This function returns non-zero if @e is in the @root RB-tree and zero if it
  * is not.
  */
-static int in_wl_tree(struct ubi_wl_entry *e, struct rb_root *root)
+/*static */int in_wl_tree(struct ubi_wl_entry *e, struct rb_root *root)
 {
 	struct rb_node *p;
 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: ubi : kernel panic on erroneous block
  2010-08-10  9:56 ubi : kernel panic on erroneous block Matthieu CASTET
@ 2010-08-10 11:42 ` Artem Bityutskiy
  2010-08-23 13:30   ` Matthieu CASTET
  2010-08-29 20:46 ` Artem Bityutskiy
  1 sibling, 1 reply; 5+ messages in thread
From: Artem Bityutskiy @ 2010-08-10 11:42 UTC (permalink / raw)
  To: Matthieu CASTET, Adrian.Hunter
  Cc: Artem Bityutskiy, linux-mtd@lists.infradead.org

On Tue, 2010-08-10 at 11:56 +0200, Matthieu CASTET wrote:
> Hi,
> 
> 
> when running test with ubifs I found the following crash.
> One block is instable (some read fails with ecc error correctable or 
> not) after a power cut. This is due to interrupted write or erase.
> 
> Our test do first a read of the ubi volume (cat /dev/ubi3_0 > /dev/null) 
> to force complete read of it.
> 
> In this case ecc correctable is detected, and scrubbing is scheduled
> But ubi_eba_copy_leb: the block become uncorrectable and added to 
> erroneous list.
> When mounting ubifs read doesn't check that it is erroneous and return data.
> It is added again for scrubbing, but prot_queue_del crash because we 
> already remove it in the first scrubbing try.
> 
> Here an attempt to fix the problem. This is ugly. I didn't try it yet. I 
>   erased my corrupted flash by accident.
> 
> One other solution could be to add the test in ubi_wl_scrub_peb, but I 
> don't think it is ok to return data on erroneous block.
> 
> An other solution could be to unmap the block (read will return 0xff), 
> but this may break upper layer ?

Matthieu, unfortunately I'm on holidays so cannot really look at this.
And I already have a lot of UBI/UBIFS issues waiting for me to look at.
I think I'll start looking at the things only in mid-September/October.
Sorry for this. But may be Adrian could take a look at this, if he has
some time? :-)

Artem.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ubi : kernel panic on erroneous block
  2010-08-10 11:42 ` Artem Bityutskiy
@ 2010-08-23 13:30   ` Matthieu CASTET
  0 siblings, 0 replies; 5+ messages in thread
From: Matthieu CASTET @ 2010-08-23 13:30 UTC (permalink / raw)
  To: dedekind1@gmail.com
  Cc: Artem Bityutskiy, linux-mtd@lists.infradead.org,
	Adrian.Hunter@nokia.com

Hi Artem,

Artem Bityutskiy a écrit :
> On Tue, 2010-08-10 at 11:56 +0200, Matthieu CASTET wrote:
>> Hi,
>>
> 
> Matthieu, unfortunately I'm on holidays so cannot really look at this.
> And I already have a lot of UBI/UBIFS issues waiting for me to look at.
> I think I'll start looking at the things only in mid-September/October.
> Sorry for this. But may be Adrian could take a look at this, if he has
> some time? :-)
I don't know if you returned from holidays, but as you post stuff on ML 
it will post further investigation.


I have done more test on these flash and I got other failures.

The problem seems in the handling of interrupted write. On some nand we 
use, the page becomes instable and read can return unstable values. The 
manufacturer told us we should not use page where write was interrupted, 
they should have a erase cycle before they can be used again.

On mounting, for the page where write was interrupted by a power cut :
- I saw ecc error, in these case ubifs should reject it in recovery 
handling and everything should be fine.
- I saw correctable error, in this case ubi move the block unless the 
next read in copy_page return an ecc error. In case of ecc error in copy 
we saw it too late, ubifs recovery is already done.
   - in this case ubifs recover can reject it if the data is not ok (bad 
crc, ...). Note that in these case we did the scrubbing move for nothing.
- I saw page that return correct data (ecc and crc ok), but later they 
return (un)correctable error. Again this is too late [1], recovery is 
already done.


It seems ubi/ubifs doesn't identify interrupted write pages on 
scanning/mount ATM. It only relies on ecc/crc, but this is not enough 
for unstable page. They can be good (or 1 bit error) for one read and 
bad the next read.

So the problem is to identify interrupted write pages on scanning/mount.


For static volume it should be easy with the interrupted flags.

There is the tricky case of data move (for wear leveling or scrubbing) : 
if sqnum of the copy is the biggest, we should ignore it/copy it.


But for dynamic/ubifs that's an other story. May be using ubi sqnum + 
ubifs journal it should be possible to do something.


Matthieu

PS : the same story happen for erase, but ubi should handle them correctly.

[1]

[   12.720244] UBIFS: un-mount UBI device 3, volume 0
[   12.760056] UBIFS: mounted UBI device 3, volume 0, name "system"
[   12.765919] UBIFS: file system size:   30601216 bytes (29884 KiB, 29 
MiB, 241 LEBs)
[   12.773642] UBIFS: journal size:       1523712 bytes (1488 KiB, 1 
MiB, 12 LEBs)
[   12.780868] UBIFS: media format:       w4/r0 (latest is w4/r0)
[   12.786668] UBIFS: default compressor: none
[   12.790852] UBIFS: reserved for root:  1445370 bytes (1411 KiB)
writing file '//mnt/dir06/file0046.bin' num=70, size=147120
writing file '//mnt/dir0c/file006c.bin' num=108, size=288146
[   13.491407] UBI error: ubi_io_read: error -74 while reading 60 bytes 
from PEB 106:129480, read 60 bytes
[   13.500785] [<c00279f0>] (dump_stack+0x0/0x14) from [<c0161040>] 
(ubi_io_read+0xf0/0x258)
[   13.508952] [<c0160f50>] (ubi_io_read+0x0/0x258) from [<c01603a0>] 
(ubi_eba_read_leb+0x1b4/0x490)
[   13.517791] [<c01601ec>] (ubi_eba_read_leb+0x0/0x490) from 
[<c015e3f0>] (ubi_leb_read+0xe8/0x138)
[   13.526649] [<c015e308>] (ubi_leb_read+0x0/0x138) from [<c00d0c48>] 
(ubifs_read_node+0x40/0x190)
[   13.535423]  r7:00000002 r6:00000000 r5:c78489a0 r4:c78489a0
[   13.541065] [<c00d0c08>] (ubifs_read_node+0x0/0x190) from 
[<c00d18b8>] (ubifs_read_node_wbuf+0x4c/0x204)
[   13.550547] [<c00d186c>] (ubifs_read_node_wbuf+0x0/0x204) from 
[<c00e6b60>] (ubifs_tnc_read_node+0x5c/0xf8)
[   13.560274] [<c00e6b04>] (ubifs_tnc_read_node+0x0/0xf8) from 
[<c00d32a8>] (matches_name+0x94/0xdc)
[   13.569218] [<c00d3214>] (matches_name+0x0/0xdc) from [<c00d3334>] 
(resolve_collision+0x44/0x204)
[   13.578074] [<c00d32f0>] (resolve_collision+0x0/0x204) from 
[<c00d45e4>] (ubifs_tnc_remove_nm+0xf0/0x108)
[   13.587615] [<c00d44f4>] (ubifs_tnc_remove_nm+0x0/0x108) from 
[<c00c7f08>] (ubifs_jnl_rename+0x4f8/0x70c)
[   13.597169] [<c00c7a10>] (ubifs_jnl_rename+0x0/0x70c) from 
[<c00caaf8>] (ubifs_rename+0x2b0/0x5e4)
[   13.606117] [<c00ca848>] (ubifs_rename+0x0/0x5e4) from [<c008581c>] 
(vfs_rename+0x238/0x270)
[   13.614538] [<c00855e4>] (vfs_rename+0x0/0x270) from [<c0086e54>] 
(sys_renameat+0x1b8/0x1cc)
[   13.622965] [<c0086c9c>] (sys_renameat+0x0/0x1cc) from [<c0086e8c>] 
(sys_rename+0x24/0x28)
[   13.631213] [<c0086e68>] (sys_rename+0x0/0x28) from [<c0023c00>] 
(ret_fast_syscall+0x0/0x2c)
[   13.639670] UBIFS error (pid 273): ubifs_read_node: bad node type (0 
but expected 2)
[   13.647371] UBIFS error (pid 273): ubifs_read_node: bad node at LEB 
47:125384
[   13.654514] UBIFS warning (pid 273): ubifs_ro_mode: switched to 
read-only mode, error -22
/endurance: endurance.c: 197: create_file: Assertion `status == 0' failed.
[   46.357586] UBIFS error (pid 101): make_reservation: cannot reserve 
160 bytes in jhead 1, error -30
[   46.366503] UBIFS error (pid 101): ubifs_write_inode: can't write 
inode 19507, error -30

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ubi : kernel panic on erroneous block
  2010-08-10  9:56 ubi : kernel panic on erroneous block Matthieu CASTET
  2010-08-10 11:42 ` Artem Bityutskiy
@ 2010-08-29 20:46 ` Artem Bityutskiy
  2010-08-29 21:00   ` Artem Bityutskiy
  1 sibling, 1 reply; 5+ messages in thread
From: Artem Bityutskiy @ 2010-08-29 20:46 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: Artem Bityutskiy, linux-mtd@lists.infradead.org

Hi Matthiew,

I've read both of your mails. It looks like there are several issues.
Let's try to identify them and then deal with them one by one. I think I
see one of them, you'll find the fix at the end of this e-mail.

But let me also comment on your e-mail to let you know what I think was
happening.

On Tue, 2010-08-10 at 11:56 +0200, Matthieu CASTET wrote:
> when running test with ubifs I found the following crash.
> One block is instable (some read fails with ecc error correctable or 
> not) after a power cut. This is due to interrupted write or erase.

OK

> Our test do first a read of the ubi volume (cat /dev/ubi3_0 > /dev/null) 
> to force complete read of it.

OK

> In this case ecc correctable is detected, and scrubbing is scheduled
> But ubi_eba_copy_leb: the block become uncorrectable and added to 
> erroneous list.

OK, this is fine and expected behavior.

> When mounting ubifs read doesn't check that it is erroneous and return data.
> It is added again for scrubbing, but prot_queue_del crash because we 
> already remove it in the first scrubbing try.
> 
> Here an attempt to fix the problem. This is ugly. I didn't try it yet. I 
>   erased my corrupted flash by accident.
> 
> One other solution could be to add the test in ubi_wl_scrub_peb, but I 
> don't think it is ok to return data on erroneous block.
> 
> An other solution could be to unmap the block (read will return 0xff), 
> but this may break upper layer ?

Err, I think the bug is actually very simple - when I introduced the
erroneous list, I just forgot to add a check in 'ubi_wl_scrub_peb()'.

> [    6.613769] UBI DBG (pid 266): ubi_scan: scanning is finished
> [    6.627283] UBI: attached mtd3 to ubi3
> [    6.630876] UBI: MTD device name:            "P6system"
> [    6.636130] UBI: MTD device size:            32 MiB
> [    6.640948] UBI: number of good PEBs:        256
> [    6.645567] UBI: number of bad PEBs:         0
> [    6.649979] UBI: max. allowed volumes:       128
> [    6.654591] UBI: wear-leveling threshold:    4096
> [    6.659274] UBI: number of internal volumes: 1
> [    6.663701] UBI: number of user volumes:     1
> [    6.668138] UBI: available PEBs:             0
> [    6.672559] UBI: total number of reserved PEBs: 256
> [    6.677433] UBI: number of PEBs reserved for bad PEB handling: 2
> [    6.683416] UBI: max/mean erase counter: 3717/3558
> [    6.688201] UBI: image sequence number: 1403635655
> [    6.693008] UBI: background thread "ubi_bgt3d" started, PID 269
> UBI device number 3, total 256 LEBs (32505856 bytes, 31.0 MiB), 
> available 0 LEBs (0 bytes), LEB size 126976 bytes (124.0 KiB)
> 
> ----> cat /dev/ubi3_0 > /dev/null

OK, you are reading whole UBI.

> [    6.908524] BA315_STATUS_DEC_ERR : 0 4 on 24525
> [    6.912907] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [    6.931062] UBI DBG (pid 272): ubi_io_read: fixable bit-flip detected 
> at PEB 189
> [    6.938448] UBI DBG (pid 272): ubi_wl_scrub_peb: schedule PEB 189 for 
> scrubbing

OK, a fixable I/O error happens, and this PEB is scheduled for
scrubbing. So far so good.

> [    6.947677] BA315_STATUS_DEC_ERR : 512 4 on 24525
> [    6.952226] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [    6.976651] UBI error: ubi_io_read: error -74 while reading 126976 
> bytes from PEB 189:4096, read 126976 bytes

OK. While scrubbing, we have a hard error, and UBI reports about it. As
you described, we are dealing with an unstable NAND page which sometimes
may be read with no errors, sometimes with a correctable ECC error,
sometimes with an uncorrectable ECC error.

> [    6.986429] [<c00279f0>] (dump_stack+0x0/0x14) from [<c0160fa8>] 
> (ubi_io_read+0xf0/0x258)
> [    6.994594] [<c0160eb8>] (ubi_io_read+0x0/0x258) from [<c01607e8>] 
> (ubi_eba_copy_leb+0x204/0x58c)
> [    7.003434] [<c01605e4>] (ubi_eba_copy_leb+0x0/0x58c) from 
> [<c01638e8>] (wear_leveling_worker+0x2e4/0x630)
> [    7.013074] [<c0163604>] (wear_leveling_worker+0x0/0x630) from 
> [<c0162b0c>] (do_work+0x94/0xe8)
> [    7.021758] [<c0162a78>] (do_work+0x0/0xe8) from [<c0163cc4>] 
> (ubi_thread+0x90/0x118)
> [    7.029576]  r7:c789bc50 r6:00000000 r5:c7848000 r4:c789b800
> [    7.035222] [<c0163c34>] (ubi_thread+0x0/0x118) from [<c0047204>] 
> (kthread+0x50/0x7c)
> [    7.043036] [<c00471b4>] (kthread+0x0/0x7c) from [<c00360e8>] 
> (do_exit+0x0/0x6ac)
> [    7.050505]  r5:00000000 r4:00000000

You have debugging enable, so you are enjoins extra UBI prints, but they
are useful :-) 

> [    7.054071] UBI warning: ubi_eba_copy_leb: error -74 while reading 
> data from PEB 189

The WL code warns that we cannot read the LEB, this is OK. So far so
good.

> ----> mounting ubifs on /dev/ubi3_0

[snip]

> [   10.467210] UBI error: ubi_io_read: error -74 while reading 126976 
> bytes from PEB 189:4096, read 126976 bytes

OK, we are now mounting UBIFS, which reads the LEB which is mapped to
the faulty PEB 189, which is currently sitting in the 'erroneous' list
in UBI, which is kind of logical.

Also note, most of PEBs which were write-interrupted will belong to the
UBIFS journal. There is another case, as you already identified in your
previous mail, the PEBs which were WL-copied in UBI. But let's
concentrate on your case.

The UBIFS journal scanning code expects -EBADMSG errors, and is designed
to handle them.

> [   10.477007] [<c00279f0>] (dump_stack+0x0/0x14) from [<c0160fa8>] 
> (ubi_io_read+0xf0/0x258)
> [   10.485144] [<c0160eb8>] (ubi_io_read+0x0/0x258) from [<c016035c>] 
> (ubi_eba_read_leb+0x1a0/0x428)
> [   10.494005] [<c01601bc>] (ubi_eba_read_leb+0x0/0x428) from 
> [<c015e3c0>] (ubi_leb_read+0xe8/0x138)
> [   10.502859] [<c015e2d8>] (ubi_leb_read+0x0/0x138) from [<c00d6918>] 
> (ubifs_start_scan+0x7c/0xf4)
> [   10.511633]  r7:c79f3000 r6:00000000 r5:c798b8e0 r4:00000000
> [   10.517276] [<c00d689c>] (ubifs_start_scan+0x0/0xf4) from 
> [<c00d6b4c>] (ubifs_scan+0x2c/0x298)
> [   10.525878]  r8:00000003 r7:c79f3000 r6:00000000 r5:c8d01000 r4:0001f000
> [   10.532560] [<c00d6b20>] (ubifs_scan+0x0/0x298) from [<c00d71dc>] 
> (ubifs_replay_journal+0x14c/0x13a4)
> [   10.541769] [<c00d7090>] (ubifs_replay_journal+0x0/0x13a4) from 
> [<c00cdd68>] (ubifs_fill_super+0xb84/0x1054)
> [   10.551580] [<c00cd1e4>] (ubifs_fill_super+0x0/0x1054) from 
> [<c00ced04>] (ubifs_get_sb+0xc4/0x2ac)
> [   10.560525] [<c00cec40>] (ubifs_get_sb+0x0/0x2ac) from [<c007f04c>] 
> (vfs_kern_mount+0x58/0x94)
> [   10.569124] [<c007eff4>] (vfs_kern_mount+0x0/0x94) from [<c007f0e8>] 
> (do_kern_mount+0x40/0xe8)
> [   10.577736]  r8:c79fa000 r7:c02253ec r6:00000000 r5:c78fb000 r4:00000000
> [   10.584408] [<c007f0a8>] (do_kern_mount+0x0/0xe8) from [<c0095628>] 
> (do_new_mount+0x68/0x8c)
> [   10.592833]  r8:00000000 r7:0000000a r6:c784bef0 r5:00000000 r4:c79fa000
> [   10.599519] [<c00955c0>] (do_new_mount+0x0/0x8c) from [<c00957a8>] 
> (do_mount+0x15c/0x1b8)
> [   10.607694]  r7:c79fa000 r6:c78fb000 r5:c78d9000 r4:00000404
> [   10.613327] [<c009564c>] (do_mount+0x0/0x1b8) from [<c0095890>] 
> (sys_mount+0x8c/0xd4)
> [   10.621144] [<c0095804>] (sys_mount+0x0/0xd4) from [<c0023c00>] 
> (ret_fast_syscall+0x0/0x2c)

OK, this gives us additional info about where we are. 

> [   10.629483]  r7:00000015 r6:00008840 r5:00000000 r4:00000000
> [   10.638307] BA315_STATUS_DEC_ERR : 0 4 on 24525
> [   10.642677] ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [   10.667078] UBI DBG (pid 273): ubi_io_read: fixable bit-flip detected 
> at PEB 189

Because of inlining, your stack-dump lacks some function calls. I think
it was: ubifs_replay_journal() -> replay_log_leb() -> ubifs_scan()
-> ... Then we had -EBADMSG, and returned back to 'replay_log_leb()'.
Then we call 'ubifs_recover_log_leb()', which scans the LEB again.

Yes, this is sub-optimal to scan twice, but this is how the thing is
implemented. I can try to fix it later - but this is not so important
now, let's first deal with your issues. But please, feel free to bug me
later and remind about this.

So, we are scanning for the second time. And this time we get a bit-flip
instead of -EBADMSG. 

> [   10.674321] UBI DBG (pid 273): ubi_wl_scrub_peb: schedule PEB 189 for 

And as the stack-dump below shoes, we end up here.

> scrubbing
> [   10.681652] Unable to handle kernel NULL pointer dereference at 
> virtual address 00000000
> [   10.689703] pgd = c79e4000
> [   10.692381] [00000000] *pgd=47841031, *pte=00000000, *ppte=00000000
> [   10.698639] Internal error: Oops: 817 [#1]
> [   10.702713] Modules linked in:
> [   10.705761] CPU: 0    Not tainted  (2.6.27.47-parrot-dirty #212)
> [   10.711769] PC is at prot_queue_del+0x2c/0x50
> [   10.716100] LR is at ubi_wl_scrub_peb+0xec/0x13c
> [   10.720700] pc : [<c0162430>]    lr : [<c01635b4>]    psr: a0000013
> [   10.720716] sp : c784ba70  ip : c78ff290  fp : c784ba7c
> [   10.732157] r10: 00000000  r9 : 00000003  r8 : 000000bd
> [   10.737371] r7 : c789bbcc  r6 : c789bbc0  r5 : c789b800  r4 : c78ff290
> [   10.743883] r3 : 00100100  r2 : 00000001  r1 : 00000000  r0 : ffffffed
> [   10.750398] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM 
> Segment user
> [   10.757519] Control: 0005317f  Table: 479e4000  DAC: 00000015
> [   10.763249] Process endurance (pid: 273, stack limit = 0xc784a268)
> [   10.769416] Stack: (0xc784ba70 to 0xc784c000)

And oops, because 'ubi_wl_scrub_peb()' is buggy and does not handle the
case when the PEB is erroneous.

[snip long hex dump of the stack]

> [   11.145022] Backtrace:
> [   11.147455] [<c0162404>] (prot_queue_del+0x0/0x50) from [<c01635b4>] 
> (ubi_wl_scrub_peb+0xec/0x13c)
> [   11.156400] [<c01634c8>] (ubi_wl_scrub_peb+0x0/0x13c) from 
> [<c01603bc>] (ubi_eba_read_leb+0x200/0x428)
> [   11.165693]  r8:000000bd r7:c73fd000 r6:c789b800 r5:c8d01000 r4:00000000
> [   11.172379] [<c01601bc>] (ubi_eba_read_leb+0x0/0x428) from 
> [<c015e3c0>] (ubi_leb_read+0xe8/0x138)
> [   11.181237] [<c015e2d8>] (ubi_leb_read+0x0/0x138) from [<c00d6918>] 
> (ubifs_start_scan+0x7c/0xf4)
> [   11.190010]  r7:c79f3000 r6:00000000 r5:c798b8e0 r4:00000000
> [   11.195654] [<c00d689c>] (ubifs_start_scan+0x0/0xf4) from 
> [<c00e3650>] (ubifs_recover_leb+0x3c/0x730)
> [   11.204861]  r8:00000003 r7:c79f3000 r6:c798b8e0 r5:c79f3000 r4:c8d01000
> [   11.211547] [<c00e3614>] (ubifs_recover_leb+0x0/0x730) from 
> [<c00e444c>] (ubifs_recover_log_leb+0xc8/0x2dc)
> [   11.221274] [<c00e4384>] (ubifs_recover_log_leb+0x0/0x2dc) from 

Yeah, I was right, it is log LEB.

> [<c00d7c20>] (ubifs_replay_journal+0xb90/0x13a4)
> [   11.231435] [<c00d7090>] (ubifs_replay_journal+0x0/0x13a4) from 
> [<c00cdd68>] (ubifs_fill_super+0xb84/0x1054)
> [   11.241249] [<c00cd1e4>] (ubifs_fill_super+0x0/0x1054) from 
> [<c00ced04>] (ubifs_get_sb+0xc4/0x2ac)
> [   11.250194] [<c00cec40>] (ubifs_get_sb+0x0/0x2ac) from [<c007f04c>] 
> (vfs_kern_mount+0x58/0x94)
> [   11.258793] [<c007eff4>] (vfs_kern_mount+0x0/0x94) from [<c007f0e8>] 
> (do_kern_mount+0x40/0xe8)
> [   11.267390]  r8:c79fa000 r7:c02253ec r6:00000000 r5:c78fb000 r4:00000000
> [   11.274077] [<c007f0a8>] (do_kern_mount+0x0/0xe8) from [<c0095628>] 
> (do_new_mount+0x68/0x8c)
> [   11.282501]  r8:00000000 r7:0000000a r6:c784bef0 r5:00000000 r4:c79fa000
> [   11.289188] [<c00955c0>] (do_new_mount+0x0/0x8c) from [<c00957a8>] 
> (do_mount+0x15c/0x1b8)
> [   11.297352]  r7:c79fa000 r6:c78fb000 r5:c78d9000 r4:00000404
> [   11.302997] [<c009564c>] (do_mount+0x0/0x1b8) from [<c0095890>] 
> (sys_mount+0x8c/0xd4)
> [   11.310813] [<c0095804>] (sys_mount+0x0/0xd4) from [<c0023c00>] 
> (ret_fast_syscall+0x0/0x2c)
> [   11.319151]  r7:00000015 r6:00008840 r5:00000000 r4:00000000
> [   11.324794] Code: 089da800 e59c1004 e59c2000 e59f3018 (e5812000)
> [   11.330924] Kernel panic - not syncing: Fatal exception

The patch at the end should fix this oops. With this patch you would not
have an oops. Instead, 'ubi_wl_scrub_peb()' would just ignore the PEB
and would just return.

Then 'ubifs_recover_leb()' would finish the scanning, collected all good
nodes, and then it would refresh this LEB (!!!) using 'ubi_leb_change()'
(see ubifs_recover_leb() -> fix_unclean_leb() -> ubi_leb_change()).

So the result would be that this faulty PEB would be scheduled for
erasure, i.e., exactly what you want.

Here is the patch. I only compile-tested it, but it looks correct and
obvious to me, and I'd even sent it to Linus for 2.6.36 inclusion, if
you'd test it or approved.

From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Subject: [PATCH] UBIFS: do not oops when erroneous PEB is scheduled for scrubbing

When an erroneous PEB is scheduling for scrubbing, we end up with the
following oops:

[<c0162404>] (prot_queue_del+0x0/0x50) from [<c01635b4>] (ubi_wl_scrub_peb+0xec/0x13c)
[<c01634c8>] (ubi_wl_scrub_peb+0x0/0x13c) from [<c01603bc>] (ubi_eba_read_leb+0x200/0x428)
[<c01601bc>] (ubi_eba_read_leb+0x0/0x428) from [<c015e3c0>] (ubi_leb_read+0xe8/0x138)
[<c015e2d8>] (ubi_leb_read+0x0/0x138) from [<c00d6918>] (ubifs_start_scan+0x7c/0xf4)
[<c00d689c>] (ubifs_start_scan+0x0/0xf4) from [<c00e3650>] (ubifs_recover_leb+0x3c/0x730)
[<c00e3614>] (ubifs_recover_leb+0x0/0x730) from [<c00e444c>] (ubifs_recover_log_leb+0xc8/0x2dc)
[<c00e4384>] (ubifs_recover_log_leb+0x0/0x2dc) from [<c00d7c20>] (ubifs_replay_journal+0xb90/0x13a4)
[<c00d7090>] (ubifs_replay_journal+0x0/0x13a4) from [<c00cdd68>] (ubifs_fill_super+0xb84/0x1054)
[<c00cd1e4>] (ubifs_fill_super+0x0/0x1054) from [<c00ced04>] (ubifs_get_sb+0xc4/0x2ac)
[<c00cec40>] (ubifs_get_sb+0x0/0x2ac) from [<c007f04c>] (vfs_kern_mount+0x58/0x94)
[<c007eff4>] (vfs_kern_mount+0x0/0x94) from [<c007f0e8>] (do_kern_mount+0x40/0xe8)
[<c007f0a8>] (do_kern_mount+0x0/0xe8) from [<c0095628>] (do_new_mount+0x68/0x8c)
[<c00955c0>] (do_new_mount+0x0/0x8c) from [<c00957a8>] (do_mount+0x15c/0x1b8)
[<c009564c>] (do_mount+0x0/0x1b8) from [<c0095890>] (sys_mount+0x8c/0xd4)
[<c0095804>] (sys_mount+0x0/0xd4) from [<c0023c00>] (ret_fast_syscall+0x0/0x2c)
Kernel panic - not syncing: Fatal exception

The problem is that 'ubi_wl_scrub_peb()' does not expect that PEBs may
be in the erroneous tree, which is a bug. This patch fixes the bug
and adds corresponding check to 'ubi_wl_scrub_peb()'. Now it will simply
ignore erroneous PEBs, instead of causing an oops.

Reported-by: Matthieu CASTET <matthieu.castet@parrot.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
---
 drivers/mtd/ubi/wl.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
index ee7b1d8..97a4356 100644
--- a/drivers/mtd/ubi/wl.c
+++ b/drivers/mtd/ubi/wl.c
@@ -1212,7 +1212,8 @@ int ubi_wl_scrub_peb(struct ubi_device *ubi, int pnum)
 retry:
 	spin_lock(&ubi->wl_lock);
 	e = ubi->lookuptbl[pnum];
-	if (e == ubi->move_from || in_wl_tree(e, &ubi->scrub)) {
+	if (e == ubi->move_from || in_wl_tree(e, &ubi->scrub) ||
+				   in_wl_tree(e, &ubi->erroneous)) {
 		spin_unlock(&ubi->wl_lock);
 		return 0;
 	}
-- 
1.7.2.2

-- 
Best Regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: ubi : kernel panic on erroneous block
  2010-08-29 20:46 ` Artem Bityutskiy
@ 2010-08-29 21:00   ` Artem Bityutskiy
  0 siblings, 0 replies; 5+ messages in thread
From: Artem Bityutskiy @ 2010-08-29 21:00 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: Artem Bityutskiy, linux-mtd@lists.infradead.org

On Sun, 2010-08-29 at 23:46 +0300, Artem Bityutskiy wrote:
> Because of inlining, your stack-dump lacks some function calls. I think
> it was: ubifs_replay_journal() -> replay_log_leb() -> ubifs_scan()
> -> ... Then we had -EBADMSG, and returned back to 'replay_log_leb()'.
> Then we call 'ubifs_recover_log_leb()', which scans the LEB again.
> 
> Yes, this is sub-optimal to scan twice, but this is how the thing is
> implemented. I can try to fix it later - but this is not so important
> now, let's first deal with your issues. But please, feel free to bug me
> later and remind about this.

An additional idea: we can strengthen 'ubi_io_read()' and make it re-try
several times if there was -EBADMSG. It already retires for 'read !=
len' case, but probably we should make it retry in case of any error.

But this is not a fix, just an improvement. Lets do this at the end,
when we have addressed your issues, because otherwise it will be more
difficult to reproduce.

So, let's postpone this, but please, bug me and remind about this.

> From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> Subject: [PATCH] UBIFS: do not oops when erroneous PEB is scheduled for scrubbing

And of course the prefix should be "UBI:".

I've pushed this patch to the UBI tree as well.

-- 
Best Regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-08-29 21:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-10  9:56 ubi : kernel panic on erroneous block Matthieu CASTET
2010-08-10 11:42 ` Artem Bityutskiy
2010-08-23 13:30   ` Matthieu CASTET
2010-08-29 20:46 ` Artem Bityutskiy
2010-08-29 21:00   ` Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).