From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-out.m-online.net ([2001:a60:0:28:0:1:25:1]) by bombadil.infradead.org with esmtps (Exim 4.87 #1 (Red Hat Linux)) id 1eEO09-0002r4-FM for linux-mtd@lists.infradead.org; Mon, 13 Nov 2017 23:18:52 +0000 Date: Tue, 14 Nov 2017 00:18:14 +0100 From: Lukasz Majewski To: Boris Brezillon Cc: Richard Weinberger , Miquel Raynal , Marek Vasut , "linux-mtd@lists.infradead.org" , Cyrille Pitchen , Brian Norris , David Woodhouse Subject: Re: [NAND] Question regarding -EIO error Message-ID: <20171114001814.4aedd8ad@jawa> In-Reply-To: <20171113221907.3f20c931@bbrezillon> References: <20171113212701.01de0c47@jawa> <20171113221907.3f20c931@bbrezillon> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; boundary="Sig_/wvoPpkXgXZPriJK2jCvwPbB"; protocol="application/pgp-signature" List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , --Sig_/wvoPpkXgXZPriJK2jCvwPbB Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Hi Boris, Thanks for your reply. > +Miquel who is working a lot on NAND stuff lately and might have faced > the same kind of problems while working on ->exec_op(). >=20 > Hi Lukasz, >=20 > On Mon, 13 Nov 2017 21:27:01 +0100 > Lukasz Majewski wrote: >=20 > > Dear All, > >=20 > > I was investigating the -EIO issue for page write from 2.6.26 > > kernel up till 4.14-rc7. > >=20 > > A foreword: > > ----------- > >=20 > > Before the commit (v4.4): > > mtd: nand: increase ready wait timeout and report timeouts [1] > > b70af9bef49bd9a5f4e7a2327d9074e29653e665 > >=20 > > The timeout for nand memory write (nand_page_write()) was ignored > > (as mentioned in [1]). > > The nand_write_page() (@nand_base.c) only checks for > > NAND_STATUS_FAIL (and returns -EIO). > >=20 > > In the old days it also used CONFIG_MTD_NAND_VERIFY_WRITE to check > > if correct data is written (if not -EIO was returned immediately). > > This was removed with [2]: > > "mtd: kill MTD_NAND_VERIFY_WRITE" > > 657f28f8811c92724db10d18bbbec70d540147d6 > >=20 > > The commit: > > "mtd: nand_wait: warn if the nand is busy on exit" > > f251b8dfdd0721255ea11751cdc282834e43b74e > >=20 > > added WARN_ON() on timeout. > >=20 > > Setup: > > ----- > >=20 > > I've run mtd_*.ko tests on several kernels and two memories. > >=20 > > With mtd_torture tests (and timeout set to 20ms): > > modprobe mtd_torturetest dev=3D${device} check=3D1 cycles_count=3D100 > > gran=3D10 > >=20 > > forces both memories to timeout (at random execution place) with > > -EIO error returned. > >=20 > > Please correct me if I'm wrong: > > ------------------------------- > >=20 > > With the new kernel (v4.14-rc7) we rely on: > >=20 > > 1. Page write timeout increased from 20ms -> 400 ms (as in [1]) > >=20 > > 2. The WARN_ON() is displayed when we leave nand_wait() with ongoing > > NAND controller operation. > >=20 > > 3. As written in [2] the correctness of written data is check in > > upper layers (fs) -> when memory return no fails, but internal > > controller still writes data. > > =20 >=20 > Unless I miss something, I think you're correct. >=20 > >=20 > > Problem: > > -------- > >=20 > > Normally to exit nand_wait loop I do read RnB GPIO pin > > (chip->dev_ready).=20 > >=20 > > When we got a timeout passed status from one memory is 0x81. > > Second one returns no errors (0x80) - but the write data check > > fails. According to spec bits 5 and 6 (of status register) are 0 -> > > Internal data operation Busy and overall Busy. =20 >=20 > Yep, the NAND is not ready and all other bits in the STATUS reg can't > be trusted (which might explain why bit0 changes from 1 to 0 between > the 2 status read operations). Indeed the memory is not ready. Those two values 0x81 and 0x80 are from two different memories (when the same test code is run). >=20 > Quoting the ONFI spec: >=20 > " > RDY: > If set to one, then the LUN or plane address is ready for another > command and all other bits in the status value are valid. If cleared > to zero, then the last command issued is not yet complete and SR bits > 5:0 are invalid and shall be ignored by the host. > " Ok. I see. This means that RnB if present has higher priority than reading status register (via 0x70 command). >=20 >=20 > >=20 > > The problem here is that we exit nand_wait with NAND memory > > controller still being busy. Timeout change[1] from 20ms -> 400ms > > just 'masked' this issue. =20 >=20 > Theoretically yes, but in practice 400ms should be more than enough to > complete a PROGRAM operation (actually is should even be enough to > complete an ERASE operation). >=20 > Did you experience any failures with the timeout set 400ms? With changing timeout to 400 ms I do not see any issues (I do run mtd_*.ko tests for +10h) >=20 > >=20 > >=20 > > Question: > > --------- > >=20 > > Shall not we wait more (@nand_wait) for internal operations to be > > finished? =20 >=20 > Well, we need a boundary, we definitely don't want to wait > indefinitely, especially since the bug can be caused by a bad > controller. This being said, if the PROGRAM operation timeouts, we > should issue a RESET operation to hopefully end up in a well-known > state. >=20 > >=20 > >=20 > > To reproduce: > > ------------- > >=20 > > Change back the timeout value from 400ms to 20m and run mtd_*.ko > > tests. =20 >=20 > The problem you report was possible with a 20ms (especially for modern > NANDs with big pages) but becomes unlikely with a 400ms timeout, Yes. I can confirm that - up till now no issues observed with 400ms timeout. > simply because, even if the PROGRAM operation fails, it shouldn't take > more than 100ms for the NAND chip to report it (put the R/B back to > ready state and set the FAIL bit to 1 in the STATUS reg). Ok. >=20 > Just to be sure I understood correctly, is it something you managed to > reproduce with a 400ms timeout or are you worried that it could happen > because you've experienced it with an older kernel which had a 20ms > timeout. Up till now I was not able to reproduce this issue with 400ms timeout. I was curious if changing timeout to 400 ms is the "correct" solution. It seems like it is - since we give NAND memory enough time to finish any page program operation. >=20 > Note that I'm not against making the code more robust, I'm just trying > to figure how urgent this is because we're in the middle of a huge > rework (the ->exec_op() thing I was mentioning at the beginning of > this reply) that could possibly help us with this kind of problems. As I said above - no issues with 400 ms. I've backported the change [1] and it works. I'm mostly curious about the rationale. >=20 > Regards, >=20 > Boris >=20 > ______________________________________________________ > Linux MTD discussion mailing list > http://lists.infradead.org/mailman/listinfo/linux-mtd/ Best regards, Lukasz Majewski -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de --Sig_/wvoPpkXgXZPriJK2jCvwPbB Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEgAyFJ+N6uu6+XupJAR8vZIA0zr0FAloKKDYACgkQAR8vZIA0 zr06xQf+Ms0JJMkW3GYiChj9mt56MJ6NqNbicpEAPubFRJFRo2gqoprQy+dwXi0q Ee//5bLxV+rlTmBbHNM2kYtG77KDCrrIMzMxJvNXrdMC5KoAWBLPpR4E4vDARHuO ely7eoZV+vF4xrCWV3foklozvTd+TggtPFdUWul/RrjcqkgIz1OVSVEMXvAptUig R7AybarLgGM1vdPcfhlqS9Yw5Y8Snge+kFVsGz+DBpx+CzctjkHoUT5SEkEqPq6a wEscmOfo667Hb5d1c9oGJ/77zrXtbUwaGBWk47zaG1qlcgEHKQcbBFTAoBmjHRiV 0ijdeAl5zhFlegKCI+5Vau2TfKpHng== =Y+Jf -----END PGP SIGNATURE----- --Sig_/wvoPpkXgXZPriJK2jCvwPbB--