From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.bootlin.com ([62.4.15.54]) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1fBawR-0000FT-FW for linux-mtd@lists.infradead.org; Thu, 26 Apr 2018 07:03:47 +0000 Date: Thu, 26 Apr 2018 09:03:21 +0200 From: Miquel Raynal To: Chris Packham Cc: Steve deRosier , "linux-mtd@lists.infradead.org" , "boris.brezillon@bootlin.com" , Tobi Wulff Subject: Re: NAND timeout issues with blank chip and Marvell NFC Message-ID: <20180426090321.1a5dee5b@xps13> In-Reply-To: References: <20180424180837.398957ba@xps13> <72ff5349ac6e48a9ab74986947572108@svr-chch-ex1.atlnz.lc> <7cd09dc2689643e9a8e0751e1cba3e11@svr-chch-ex1.atlnz.lc> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi Chris, On Thu, 26 Apr 2018 05:16:57 +0000, Chris Packham wrote: > An update for the end of my working day. >=20 > On 26/04/18 13:40, Chris Packham wrote: > > On 26/04/18 09:22, Chris Packham wrote: =20 > >> Hi Miquel, > >> > >> On 25/04/18 04:08, Miquel Raynal wrote: =20 > >>> Hi Steve, Chris, > >>> > >>> On Tue, 24 Apr 2018 08:49:47 -0700, Steve deRosier > >>> wrote: > >>> =20 > >>>> Hi Chris, > >>>> > >>>> On Mon, Apr 23, 2018 at 10:31 PM, Chris Packham > >>>> wrote: =20 > >>>>> Hi, > >>>>> > >>>>> We're in the process of qualifying new NAND chips (Macronix > >>>>> MX30LF2G18AC) for one of our Armada-385 based devices and we're > >>>>> experiencing some long startup times on units with factory fresh NA= ND > >>>>> chips. Anecdotally I think I've also seen this behaviour on the old > >>>>> chips as well (Micron MT29F2G08ABAEAWP-ITX:E). > >>>>> > >>>>> On 4.17.0-rc2 with the newly re-written NAND infrastructure we see > >>>>> > >>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda > >>>>> nand: Macronix MX30LF2G18AC > >>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size:= 64 > >>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080) > >>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280) > >>>>> Bad block table not found for chip 0 > >>>>> Bad block table not found for chip 0 > >>>>> Scanning device for bad blocks > >>>>> > >>>>> (nothing for some time) > >>>>> > >>>>> On an older kernel we see > >>>>> > >>>>> pxa3xx-nand f10d0000.flash: This platform can't do DMA on this devi= ce > >>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda > >>>>> nand: Macronix MX30LF2G18AC > >>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size:= 64 > >>>>> pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048 > >>>>> Bad block table not found for chip 0 > >>>>> Bad block table not found for chip 0 > >>>>> Scanning device for bad blocks > >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!! > >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!! > >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!! > >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!! > >>>>> pxa3xx-nand f10d0000.flash: Wait time out!!! > >>>>> ... > >>>>> (time outs continue for some time) > >>>>> > >>>>> Presumably the new driver in 4.17.0-rc2 is experiencing the same wa= it > >>>>> time out but just not complaining about it. > >>>>> > >>>>> If we leave the system running long enough (in the order of 30 minu= tes) > >>>>> things seem to sort themselves out and bootup continues, the subseq= uent > >>>>> boots are fine. If we run 'nand erase.chip' from u-boot on a fresh = unit > >>>>> and then boot into the kernel then things are also fine. > >>>>> > >>>>> If we run 'nand scrub.chip -y' from u-boot we are able to re-create= the > >>>>> problem. > >>>>> > >>>>> Our suspicion is that erased state of the chip is probably not agre= eable > >>>>> with either the ecc data or the bad block table location (or both).= By > >>>>> erasing it from u-boot this must fill in valid data in the expected > >>>>> places and the kernel is happy. > >>>>> =20 > >>>> > >>>> During your very first boot, Linux can't find the bad-block table and > >>>> thus does a full scan of the chip, each and every block, to find the > >>>> manufacturer bad block marks and then constructs the table. I imagine > >>>> you've got a parameter incorrect somewhere that's causing it to wait > >>>> for timeouts at read points, instead of quickly able to read through > >>>> the 2k or 4k blocks on that flash. On subsequent boots, you don't s= ee > >>>> this issue because the BBT is found and Linux just uses that. Same > >>>> deal if you do a `nand erase.chip`, because the BBT is itself marked > >>>> with a bad-block marker and gets skipped during a normal erase. =20 > >>> > >>> I share Steve's thoughts on that, there is probably some > >>> misconfiguration at some point, having a first long boot is not a > >>> problem, but 30 minutes for a 256MiB chip... What I don't understand = is > >>> that you should have timeouts with the recent kernel too if there is > >>> actually something wrong happening. =20 > >> > >> As I mentioned in my other reply I may have understated the time. It is > >> ~30mins with the old pxa3xx driver but the new one seems to block > >> indefinitely for me. > >> =20 > >>>> > >>>> Now, I don't know if you're aware of this, but by doing the `nand > >>>> scub.chip -y`, you've ruined the flash chip. That device can not be > >>>> relied upon anymore. A scrub will ignore the factory bad-block-marks > >>>> and erase them. Unless you stored this information off-chip and > >>>> rewrite the markers, you've now lost the bad-block information from > >>>> the manufacturer's tests. In any case, this erases the BBT, so your > >>>> next boot triggers Linux to rebuild the BBT. =20 > >>> > >>> I think U-Boot will do it automatically after the scrub. But the resu= lt > >>> is still the same. > >>> =20 > >>>> =20 > >>>>> We could update our manufacturing procedures to run 'nand erase.chi= p' > >>>>> before the first boot but this feels wrong. Some of our devices boot > >>>>> over the network so the nand is not normally touched by the bootloa= der. > >>>>> It seems that there is some unhandled error condition that is stopp= ing > >>>>> the kernel from seeing that the chip is completely blank and making > >>>>> forward progress. > >>>>> =20 > >>>> > >>>> erase chip won't fix your issue. The BBT scan is going to happen > >>>> anyway. There is however clearly some parameter that is setup > >>>> incorrectly that's causing it to wait for the timeout instead of bei= ng > >>>> able to quickly read pages. I don't see why that'd be unique to the > >>>> BBT scan however, I'd expect you to see the problem on all reads, th= us > >>>> slowing down the system noticeably in general. > >>>> > >>>> Your hint is likely these lines: > >>>> " marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x000000= 80) > >>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x000002= 80)" > >>>> > >>>> You can go look at that in the driver and compare with the relevant > >>>> behavior in the datasheets. Sorry, but I can't help more specificall= y, > >>>> I'd have to know your particular hardware and datasheets and spend > >>>> some time looking at the code. =20 > >>> > >>> I also reproduce the problem on my Armada 38x, the two timeouts at bo= ot > >>> time (not specifically the first one) are suspicious, I'm going to lo= ok > >>> into it. =20 > >> > >> Thanks for leaping onto it. I'll keep investigating it here as well. > >> =20 > >=20 > > When I add some debugging to marvell_nfc_wait_op I see > >=20 > > marvell-nfc f10d0000.flash: timeout_ms =3D 250 > > marvell-nfc f10d0000.flash: done > > marvell-nfc f10d0000.flash: timeout_ms =3D 1 > > marvell-nfc f10d0000.flash: done > > nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda > > nand: Macronix MX30LF2G18AC > > nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64 > > Bad block table not found for chip 0 > > Bad block table not found for chip 0 > > Scanning device for bad blocks > > marvell-nfc f10d0000.flash: timeout_ms =3D 4 > > marvell-nfc f10d0000.flash: done > > marvell-nfc f10d0000.flash: timeout_ms =3D 600000000 > >=20 > > That last line looks quite odd. I think the problem might be related to > > this line from marvell_nfc_hw_ecc_bch_write_page() > >=20 > > ret =3D marvell_nfc_wait_op(chip, > > chip->data_interface.timings.sdr.tPROG_max= ); > >=20 > > Based on the datasheet that number is 600 microseconds(us) not the > > milliseconds expected by marvell_nfc_wait_op(). > > =20 >=20 > So naturally throwing in some PSEC_TO_MSEC() calls stopped the really=20 > long timeouts but then the probe would fail. It seems that I'm getting=20 > some "page done" and "command done" interrupts indications (NDSR =3D=20 > 0x0000500) while attempting to write the oob data. My bad, I might have forgotten one of these. Can you send a patch or show me which delay was wrong? Can you also add a dump_stack() in the error path of the timeout (probably *wait_cmdd()) and show the full boot log? >=20 > I've also re-done some of my initial tests and it seems that 4.17-rc2=20 > cannot mount this chip. The 4.16.4 kernel can. >=20 > Even if I use the old kernel to create the ubi volumes the new kernel=20 > seems to hang while mounting in a similar place to what I was seeing=20 > with the BBT creation. Thanks for your time, Miqu=C3=A8l --=20 Miquel Raynal, Bootlin (formerly Free Electrons) Embedded Linux and Kernel engineering https://bootlin.com