Re: [NAND] Question regarding -EIO error

From: Lukasz Majewski <lukma@denx.de>
To: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Richard Weinberger <richard@nod.at>,
	Miquel Raynal <miquel.raynal@free-electrons.com>,
	Marek Vasut <marek.vasut@gmail.com>,
	"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
	Cyrille Pitchen <cyrille.pitchen@wedev4u.fr>,
	Brian Norris <computersforpeace@gmail.com>,
	David Woodhouse <dwmw2@infradead.org>
Subject: Re: [NAND] Question regarding -EIO error
Date: Tue, 14 Nov 2017 00:18:14 +0100	[thread overview]
Message-ID: <20171114001814.4aedd8ad@jawa> (raw)
In-Reply-To: <20171113221907.3f20c931@bbrezillon>

[-- Attachment #1: Type: text/plain, Size: 6180 bytes --]

Hi Boris,

Thanks for your reply.

> +Miquel who is working a lot on NAND stuff lately and might have faced
> the same kind of problems while working on ->exec_op().
> 
> Hi Lukasz,
> 
> On Mon, 13 Nov 2017 21:27:01 +0100
> Lukasz Majewski <lukma@denx.de> wrote:
> 
> > Dear All,
> > 
> > I was investigating the -EIO issue for page write from 2.6.26
> > kernel up till 4.14-rc7.
> > 
> > A foreword:
> > -----------
> > 
> > Before the commit (v4.4):
> > mtd: nand: increase ready wait timeout and report timeouts [1]
> > b70af9bef49bd9a5f4e7a2327d9074e29653e665
> > 
> > The timeout for nand memory write (nand_page_write()) was ignored
> > (as mentioned in [1]).
> > The nand_write_page() (@nand_base.c) only checks for
> > NAND_STATUS_FAIL (and returns -EIO).
> > 
> > In the old days it also used CONFIG_MTD_NAND_VERIFY_WRITE to check
> > if correct data is written (if not -EIO was returned immediately).
> > This was removed with [2]:
> > "mtd: kill MTD_NAND_VERIFY_WRITE"
> > 657f28f8811c92724db10d18bbbec70d540147d6
> > 
> > The commit:
> > "mtd: nand_wait: warn if the nand is busy on exit"
> > f251b8dfdd0721255ea11751cdc282834e43b74e
> > 
> > added WARN_ON() on timeout.
> > 
> > Setup:
> > -----
> > 
> > I've run mtd_*.ko tests on several kernels and two memories.
> > 
> > With mtd_torture tests (and timeout set to 20ms):
> > modprobe mtd_torturetest dev=${device} check=1 cycles_count=100
> > gran=10
> > 
> > forces both memories to timeout (at random execution place) with
> > -EIO error returned.
> > 
> > Please correct me if I'm wrong:
> > -------------------------------
> > 
> > With the new kernel (v4.14-rc7) we rely on:
> > 
> > 1. Page write timeout increased from 20ms -> 400 ms (as in [1])
> > 
> > 2. The WARN_ON() is displayed when we leave nand_wait() with ongoing
> > NAND controller operation.
> > 
> > 3. As written in [2] the correctness of written data is check in
> > upper layers (fs) -> when memory return no fails, but internal
> > controller still writes data.
> >   
> 
> Unless I miss something, I think you're correct.
> 
> > 
> > Problem:
> > --------
> > 
> > Normally to exit nand_wait loop I do read RnB GPIO pin
> > (chip->dev_ready). 
> > 
> > When we got a timeout passed status from one memory is 0x81.
> > Second one returns no errors (0x80) - but the write data check
> > fails. According to spec bits 5 and 6 (of status register) are 0 ->
> > Internal data operation Busy and overall Busy.  
> 
> Yep, the NAND is not ready and all other bits in the STATUS reg can't
> be trusted (which might explain why bit0 changes from 1 to 0 between
> the 2 status read operations).

Indeed the memory is not ready.

Those two values 0x81 and 0x80 are from two different memories (when
the same test code is run).

> 
> Quoting the ONFI spec:
> 
> "
> RDY:
> If set to one, then the LUN or plane address is ready for another
> command and all other bits in the status value are valid. If cleared
> to zero, then the last command issued is not yet complete and SR bits
> 5:0 are invalid and shall be ignored by the host.
> "

Ok. I see. This means that RnB if present has higher priority than
reading status register (via 0x70 command).

> 
> 
> > 
> > The problem here is that we exit nand_wait with NAND memory
> > controller still being busy. Timeout change[1] from 20ms -> 400ms
> > just 'masked' this issue.  
> 
> Theoretically yes, but in practice 400ms should be more than enough to
> complete a PROGRAM operation (actually is should even be enough to
> complete an ERASE operation).
> 
> Did you experience any failures with the timeout set 400ms?

With changing timeout to 400 ms I do not see any issues (I do run
mtd_*.ko tests for +10h)

> 
> > 
> > 
> > Question:
> > ---------
> > 
> > Shall not we wait more (@nand_wait) for internal operations to be
> > finished?  
> 
> Well, we need a boundary, we definitely don't want to wait
> indefinitely, especially since the bug can be caused by a bad
> controller. This being said, if the PROGRAM operation timeouts, we
> should issue a RESET operation to hopefully end up in a well-known
> state.
> 
> > 
> > 
> > To reproduce:
> > -------------
> > 
> > Change back the timeout value from 400ms to 20m and run mtd_*.ko
> > tests.  
> 
> The problem you report was possible with a 20ms (especially for modern
> NANDs with big pages) but becomes unlikely with a 400ms timeout,

Yes. I can confirm that - up till now no issues observed with 400ms
timeout.

> simply because, even if the PROGRAM operation fails, it shouldn't take
> more than 100ms for the NAND chip to report it (put the R/B back to
> ready state and set the FAIL bit to 1 in the STATUS reg).

Ok.

> 
> Just to be sure I understood correctly, is it something you managed to
> reproduce with a 400ms timeout or are you worried that it could happen
> because you've experienced it with an older kernel which had a 20ms
> timeout.

Up till now I was not able to reproduce this issue with 400ms timeout.

I was curious if changing timeout to 400 ms is the "correct" solution.
It seems like it is - since we give NAND memory enough time to finish
any page program operation.

> 
> Note that I'm not against making the code more robust, I'm just trying
> to figure how urgent this is because we're in the middle of a huge
> rework (the ->exec_op() thing I was mentioning at the beginning of
> this reply) that could possibly help us with this kind of problems.

As I said above - no issues with 400 ms. I've backported the change [1]
and it works.

I'm mostly curious about the rationale.

> 
> Regards,
> 
> Boris
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/

Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]