Re: imx27: No space left to write bad block table

From: Miquel Raynal <miquel.raynal@bootlin.com>
To: "Stefan Riedmüller" <S.Riedmueller@phytec.de>
Cc: "festevam@gmail.com" <festevam@gmail.com>,
	"guillaume.tucker@collabora.com" <guillaume.tucker@collabora.com>,
	"kernel@pengutronix.de" <kernel@pengutronix.de>,
	"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>
Subject: Re: imx27: No space left to write bad block table
Date: Tue, 4 May 2021 10:34:53 +0200	[thread overview]
Message-ID: <20210504103453.15786c69@xps13> (raw)
In-Reply-To: <50c7f5d96dd4faaabfcf1e8cbf9248a7646d4f9a.camel@phytec.de>

Hi Stefan,

Stefan Riedmüller <S.Riedmueller@phytec.de> wrote on Mon, 26 Apr 2021
15:53:39 +0000:

> Hi Miquel,
> 
> On Mon, 2021-04-19 at 17:36 +0200, Miquel Raynal wrote:
> > Hi Stefan,
> >   
> > > > Interesting. Maybe I overlooked the below commit when applying. Indeed,
> > > > BBT may be considered as bad blocks, so I wonder if the below change is
> > > > valid now...
> > > > 
> > > > Guillaume, would you have a way to revert this patch on top of
> > > > linux-next? Stefan, would you mind giving more details on the testing
> > > > procedure?    
> > > 
> > > I have tested this on an i.MX 6 by simulating two bad BBT blocks by simply
> > > returning -EIO in nand_erase_nand when the block to be erased is one of
> > > the
> > > first two BBT blocks.
> > > 
> > > I have seen this once on a customer board but were not able to reproduce
> > > it
> > > anymore, thus the simulation of the two bad blocks.
> > > 
> > > Without the patch below new versions of the BBT can no longer be written
> > > to
> > > the first two blocks reserved for the BBT but they are still evaluated to
> > > read
> > > the BBT from during boot due the lack of a test if these blocks are bad.
> > > So
> > > changes to the BBT after these two blocks turn bad are only kept and used
> > > until the next reboot where again the old version of the two worn blocks
> > > is
> > > used as a basis.
> > > 
> > > I tried to use the same mechanism that is used to identify bad blocks
> > > during a
> > > scan for bad blocks. But maybe I missed something there? Or were my
> > > assumptions wrong in the first place?  
> > 
> > Honestly I don't know what is wrong exactly in this patch.
> > 
> > We will revert the commit as it clearly breaks something fundamental
> > and the merge window is too close to adopt a hackish attitude.
> > 
> > I would propose the following tests with your board:
> > - Hack the core to allow yourself to access bad blocks from userspace
> >   for testing purposes.
> > - With the below commit, you should have the same behavior than
> >   reported by Fabio.  
> 
> On my imx6 board the patch does not lead to the behavior reported by Fabio.
> The BBT is found and can be read:
> 
> [    1.520501] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xd3
> [    1.526944] nand: Macronix MX60LF8G18AC
> [    1.530803] nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB
> size: 64
> [    1.539412] Bad block table found at page 524224, version 0x01
> [    1.545790] Bad block table found at page 524160, version 0x01
> [    1.551796] nand_read_bbt: bad block at 0x000001b60000
> [    1.557032] nand_read_bbt: bad block at 0x000008cc0000
> [    1.562204] nand_read_bbt: bad block at 0x00000f480000
> [    1.567395] nand_read_bbt: bad block at 0x0000111c0000
> [    1.572588] nand_read_bbt: bad block at 0x0000205c0000
> [    1.577802] nand_read_bbt: bad block at 0x00002dfc0000
> 
> I dug a little deeper and I think I found the cause for the failure on the
> imx27 board.
> 
> The mxc_nand driver (used by the imx27) uses its own nand_bbt_descr with an
> offset of 0 in the OOB area. This is the same place the bad block marker is
> located on worn or factory bad blocks.
> 
> This explains why the BBT is no longer found with my patch. scan_block_fast
> checks if there is anything else than 0xff in the bad block marker and finds
> the 'B' from 'Bbt0'. The same occurs for the mirrored version where it finds
> the '1' from '1tbB'. 

Ok, that's the reason why the original logic failed, thanks for looking
for it.

> This also explains why the original BBT is detected as bad blocks in the scan
> after the BBT was not found, which results in the BBT being written to the
> remaining two blocks reserved for the BBT.
> 
> 19:38:23.001385  nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
> 19:38:23.002635  nand: ST Micro NAND01GR3B2CZA6
> 19:38:23.006666  nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB
> size: 64
> 19:38:23.028413  Bad block table not found for chip 0
> 19:38:23.035625  random: fast init done
> 19:38:23.049144  Bad block table not found for chip 0
> 19:38:23.050024  Scanning device for bad blocks
> 19:38:23.330999  Bad eraseblock 329 at 0x000002920000
> 19:38:23.345958  Bad eraseblock 330 at 0x000002940000
> 19:38:23.356024  Bad eraseblock 331 at 0x000002960000
> 19:38:23.365738  Bad eraseblock 332 at 0x000002980000
> 19:38:23.375590  Bad eraseblock 333 at 0x0000029a0000
> 19:38:23.385505  Bad eraseblock 334 at 0x0000029c0000
> 19:38:23.395548  Bad eraseblock 335 at 0x0000029e0000
> 19:38:23.405501  Bad eraseblock 336 at 0x000002a00000
> 19:38:23.415551  Bad eraseblock 337 at 0x000002a20000
> 19:38:23.425937  Bad eraseblock 338 at 0x000002a40000
> 19:38:23.436028  Bad eraseblock 339 at 0x000002a60000
> 19:38:23.445959  Bad eraseblock 340 at 0x000002a80000
> 19:38:23.456008  Bad eraseblock 341 at 0x000002aa0000
> 19:38:23.466006  Bad eraseblock 342 at 0x000002ac0000
> 19:38:23.475912  Bad eraseblock 343 at 0x000002ae0000
> 19:38:23.486064  Bad eraseblock 344 at 0x000002b00000
> 19:38:23.495925  Bad eraseblock 345 at 0x000002b20000
> 19:38:24.048053  Bad eraseblock 1022 at 0x000007fc0000
> 19:38:24.056117  Bad eraseblock 1023 at 0x000007fe0000
> 19:38:24.067953  Bad block table written to 0x000007fa0000, version 0x01
> 19:38:24.087637  Bad block table written to 0x000007f80000, version 0x01
> 
> 
> On the next boot all four BBT version in flash are skipped for the same reason
> as before and the two blocks containing the latest BBT are also detected as
> bad blocks. The result is no more remaining blocks to write the BBT to.
> 
> 
> 21:22:55.032595  nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
> 21:22:55.033333  nand: ST Micro NAND01GR3B2CZA6
> 21:22:55.037804  nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB
> size: 64
> 21:22:55.088475  Bad block table not found for chip 0
> 21:22:55.093807  Bad block table not found for chip 0
> 21:22:55.105995  Scanning device for bad blocks
> 21:22:55.109049  random: fast init done
> 21:22:55.395488  Bad eraseblock 329 at 0x000002920000
> 21:22:55.406832  Bad eraseblock 330 at 0x000002940000
> 21:22:55.416885  Bad eraseblock 331 at 0x000002960000
> 21:22:55.426736  Bad eraseblock 332 at 0x000002980000
> 21:22:55.436732  Bad eraseblock 333 at 0x0000029a0000
> 21:22:55.446864  Bad eraseblock 334 at 0x0000029c0000
> 21:22:55.456662  Bad eraseblock 335 at 0x0000029e0000
> 21:22:55.466785  Bad eraseblock 336 at 0x000002a00000
> 21:22:55.476801  Bad eraseblock 337 at 0x000002a20000
> 21:22:55.486772  Bad eraseblock 338 at 0x000002a40000
> 21:22:55.496768  Bad eraseblock 339 at 0x000002a60000
> 21:22:55.506607  Bad eraseblock 340 at 0x000002a80000
> 21:22:55.516965  Bad eraseblock 341 at 0x000002aa0000
> 21:22:55.526621  Bad eraseblock 342 at 0x000002ac0000
> 21:22:55.536702  Bad eraseblock 343 at 0x000002ae0000
> 21:22:55.546660  Bad eraseblock 344 at 0x000002b00000
> 21:22:55.556745  Bad eraseblock 345 at 0x000002b20000
> 21:22:56.172928  Bad eraseblock 1020 at 0x000007f80000
> 21:22:56.187043  Bad eraseblock 1021 at 0x000007fa0000
> 21:22:56.197437  Bad eraseblock 1022 at 0x000007fc0000
> 21:22:56.212665  Bad eraseblock 1023 at 0x000007fe0000
> 21:22:56.213356  No space left to write bad block table
> 21:22:56.215012  nand_bbt: error while writing bad block table -28
> 21:22:56.239353  mxc_nand: probe of d8000000.nand-controller failed with error
> -28
> 
> I'm not sure of the best way to address this issue. A few ideas came into my
> mind:
> 
> - Shift the offset of the nand_bbt_descr of mxc_nand to make room for the bad
> block marker. But I'm not sure if this would already conflict with the ECC
> hardware but the ooblayout functions would suggest that it could work. 

There are thousands of boards out there that would be broken with such
change: it's too late to do changes in this driver, unfortunately.

> Unfortunately I don't have any hardware at hand at the moment to test it. I
> think the distinction between small and large pagesizes needs to be reflected
> on the bbt_descr as well.
> 
> - Use NAND_BBT_NO_OOB with the mxc_nand driver since there is a comment saying
> there is an overlap between the generic bbt descriptors and the ECC hardware.
> I'm not sure what other effects it might have to set NAND_BBT_NO_OOB.

Same here: that's not an option.

> - Explicitly check for the bad block marker during a search for the BBT
> instead of using scan_block_fast

This look more reasonable. You can create a helper which does the
scan_block_fast(), then eventually checks the beginning of the OOB
buffer and tries to match with the ->td and ->md descriptors. This
should work with all the legacy drivers implementing their own
descriptors - hopefully.

Other drivers are impacted as well, so maybe you'll find a board for
testing (or someone gentle enough that will test it for you).

Thanks,
Miquèl

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/