From mboxrd@z Thu Jan 1 00:00:00 1970 From: Luca Ceresoli Date: Thu, 20 Dec 2012 17:02:00 +0100 Subject: [U-Boot] Bricked when trying to attach UBI In-Reply-To: <50D1FB46.9000209@comelit.it> References: <50D1A4C2.6020207@comelit.it> <50D1DC30.1060701@gmail.com> <50D1E3AD.30208@comelit.it> <50D1E6BF.9090407@gmail.com> <50D1FB46.9000209@comelit.it> Message-ID: <50D33678.4080904@comelit.it> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de Hi, I'm Cc'ing the linux-mtd list as well as the authors of the Linux commits cited below. For these new readers: I reported a problem with U-Boot 2012.04.01 not being able to attach an UBI partition in NAND, while Linux (2.6.37) can attach and repair it. It looks like an U-Boot bug, but I discovered strange things around the chip->badblockbits variable (in the NAND code) by comparing the relevant code in U-Boot and Linux. Sorry for Cc'ing so many people, but following this issue I was lead from one subsystem to another (and from U-Boot to Linux). Previous discussion is here: http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/149624 Luca Ceresoli wrote: > Hi Andreas, > > Andreas Bie?mann wrote: >> Hi Luca, >> >> On 19.12.2012 16:56, Luca Ceresoli wrote: >>> Hi Andreas, >>> >>> Andreas Bie?mann wrote: >>> ... >>>>> Creating 1 MTD partitions on "nand0": >>>>> 0x000000100000-0x000010000000 : "mtd=3" >>>>> UBI: attaching mtd1 to ubi0 >>>>> UBI: physical eraseblock size: 131072 bytes (128 KiB) >>>>> UBI: logical eraseblock size: 129024 bytes >>>>> UBI: smallest flash I/O unit: 2048 >>>>> UBI: sub-page size: 512 >>>>> UBI: VID header offset: 512 (aligned 512) >>>>> UBI: data offset: 2048 >>>>> UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0, >>>>> need 1) >>>>> >>>>> Now the device is totally blocked, and power cycling does not change >>>>> the result. >>>> >>>> have you tried to increase the malloc arena in u-boot >>>> (CONIG_SYS_MALLOC_LEN)? >>>> We had errors like this before [1],[2] and [3], maybe others - >>>> apparently with another error message, but please give it a try. We >>>> know >>>> ubi recovery needs some ram and 1MiB may be not enough. >>> >>> Thanks for your suggestion. >>> >>> Unfortunately this does not seem to be the cause of my problem: I tried >>> increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from >>> (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any >>> difference. >> >> Well, ok ... Malloc arena is always my first thought if I read about >> problems with ubi in u-boot. >> Have you looked up the differences in drivers/mtd/ubi/ in your u-boot >> and linux tree? Maybe you can see something obviously different in the >> ubi_wl_init_scan()? > > I had some days ago, but I double-checked now as you suggested. Indeed > there is an important difference: attach_by_scanning() (build.c) calls > ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in > a swapped order! > > This swap dates back to: > > commit d63894654df72b010de2abb4b3f07d0d755f65b6 > Author: Holger Brunck > Date: Mon Oct 10 13:08:19 2011 +0200 > > UBI: init eba tables before wl when attaching a device > > This fixes that u-boot gets stuck when a bitflip was detected > during "ubi part ". If a bitflip was detected UBI tries > to copy the PEB to a different place. This needs that the eba table > are initialized, but this was done after the wear levelling worker > detects the bitflip. So changes the initialisation of these two > tasks in u-boot. > > This is a u-boot specific patch and not needed in the linux layer, > because due to commit 1b1f9a9d00447d > UBI: Ensure that "background thread" operations are really executed > we schedule these tasks in place and not as in linux after the inital > task which schedule this new task is finished. > > Signed-off-by: Holger Brunck > cc: Stefan Roese > Signed-off-by: Stefan Roese > > I tried reverting that commit and... surprise! U-Boot can now attach UBI > and boot properly! > > But the cited commit actually fixed a bug that bite our board a few > months back, so it should not be reverted without thinking twice. Now > it apparently introduced another bug. :-( > > I'm Cc:ing the commit author for comments. > > Nonetheless, I have evidence of a different behaviour between U-Boot > and Linux even before the two swapped functions are called. > > What attach_by_scanning() does in Linux is (abbreviated): > > static int attach_by_scanning(struct ubi_device *ubi) > { > si = ubi_scan(ubi); > ...fill ubi->some_fields...; > err = ubi_read_volume_table(ubi, si); > /* MARK */ > err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */ > err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */ > ubi_scan_destroy_si(si); > return 0; > } > > See the two swapped calls. > > At MARK, I printed some of the peb counters in *ubi, and I got > different results for ubi->avail_pebs between U-Boot and Linux: > U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0 > Linux: UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 > > The printed values were equal before calling ubi_read_volume_table(). > I have no idea about where this difference comes from, nor if this > difference can cause my troubles. > I will better investigate tomorrow looking into ubi_read_volume_table(). After half a day of debugging and an insane amount of printk()s added to both U-Boot and Linux, I have some more hints to understand the problem. The two different results quoted above show that U-Boot counted 21 available eraseblocks, while Linux counts 22. I am not sure if this can cause my problem, but it's the first visible difference between U-Boot and Linux. This originates from ubi_scan() (scan.c): in U-Boot, it sets si->bad_peb_count to 1, in Linux to 0. U-Boot's ubi_scan() is very similar to Linux's, and the differences do not seem to relevant in my case. So let's dig down... - ubi_scan() (scan.c) calls process_eb() (scan.c) for each EB - process_eb() calls ubi_io_is_bad() (io.c), and if it returns >0 it increments si->bad_peb_count, which is what is happening to my board when executing U-Boot - ubi_io_is_bad() calls mtd->block_isbad(), which points to nand_block_isbad() (nand_base.c) - nand_block_isbad() is a wrapper to nand_block_checkbad() (nand_base.c) - nand_block_checkbad() differs from the Linux code in something related to lazy bad block scanning (commit fb49454b1b6c7c6, Feb 2012), but this does not seem to change the behaviour I observe; - nand_block_checkbad() calls either chip->block_bad() or nand_isbad_bbt(); I tracked only into the former, but I suspect the latter produces the same effects with regard to the problem I'm facing - chip->block_bad() points to nand_block_bad() (nand_base.c) nand_block_bad() (nand_base.c) does the following: static int nand_block_bad(struct mtd_info *mtd, loff_t ofs, int getchip) { ... if (likely(chip->badblockbits == 8)) res = bad != 0xFF; else res = hweight8(bad) < chip->badblockbits; if (getchip) nand_release_device(mtd); return res; } I don't understand the algorithm, but the relevant variables have these values: U-Boot: nand_block_bad: chip->badblockbits=8, bad=0000, hweight8(bad)=0 Linux: nand_block_bad: chip->badblockbits=0, bad=0000, hweight8(bad)=0 ^ Obviously the U-Boot and Linux produce a different return value. This propagates up to ubi->bad_peb_count in ubi_scan(), and from there it changes the behaviour of the following code, leading to a block in U-Boot and a successful attach in Linux. chip->badblockbits in current Linux master is described as "minimum number of set bits in a good block's bad block marker position; i.e., BBM == 11110111b is not bad when badblockbits == 7". Still a bit obscure to me because I don't have a general picture. Anyway, here's how its value comes to be different between U-Boot (2012.04.01) and Linux (2.6.37). Linux: a) commit e0b58d0a7005, Feb 2010: mtd: nand: add ->badblockbits for minimum number of set bits in bad block byte declared the new variable and introduced in nand_get_flash_type() (nand_base.c) the following line: chip->badblockbits = 8; b) commit c7b28e25cb9, Jul 2010: mtd: nand: refactor BB marker detection removed from nand_get_flash_type() (nand_base.c) the same line: chip->badblockbits = 8; c) commit 26d9be11485e, Apr 2011: mtd: return badblockbits back restored in nand_get_flash_type() (nand_base.c) the following line: chip->badblockbits = 8; claiming it had been accidentally removed in commit b). The version of Linux I'm using (2.6.37), contains commits a) and b), so it has chip->badblockbits equal to 0. According to the log message of commit c), this should be wrong, but the resulting kernel works! The version of U-Boot (2012.04.01) contains the result of all 3 commits, since commit 2a8e0fc8b3dc31a3c571e439fbf04b882c8986be Author: Christian Hitz Date: Wed Oct 12 09:32:02 2011 +0200 nand: Merge changes from Linux nand driver [backport from linux commit 02f8c6aee8df3cdc935e9bdd4f2d020306035dbe] This patch synchronizes the nand driver with the Linux 3.0 state. This looks like an improvement, but it bricks my board! I could not resist, and without even trying to understand what I was doing, I did in U-Boot's nand_get_flash_type() (nand_base.c): - chip->badblockbits = 8; + chip->badblockbits = 0; and guess what? U-Boot attached UBI, loaded Linux from it and booted successfully! No, I don't think changing lines here and there without any real understanding is a way to produce reliable software. But I'm unable to understand why the software that should work better actually bricks the board and the other one runs fine? And how do I know what the correct value for chip->badblockbits should be? And last but most important: how can I properly fix U-Boot? Thanks, Luca