[NAND] Question regarding -EIO error

* [NAND] Question regarding -EIO error
@ 2017-11-13 20:27 Lukasz Majewski
  2017-11-13 21:19 ` Boris Brezillon
  0 siblings, 1 reply; 3+ messages in thread
From: Lukasz Majewski @ 2017-11-13 20:27 UTC (permalink / raw)
  To: linux-mtd@lists.infradead.org
  Cc: David Woodhouse, Brian Norris, Boris Brezillon, Marek Vasut,
	Richard Weinberger, Cyrille Pitchen

[-- Attachment #1: Type: text/plain, Size: 2554 bytes --]

Dear All,

I was investigating the -EIO issue for page write from 2.6.26 kernel up
till 4.14-rc7.

A foreword:
-----------

Before the commit (v4.4):
mtd: nand: increase ready wait timeout and report timeouts [1]
b70af9bef49bd9a5f4e7a2327d9074e29653e665

The timeout for nand memory write (nand_page_write()) was ignored (as
mentioned in [1]).
The nand_write_page() (@nand_base.c) only checks for NAND_STATUS_FAIL
(and returns -EIO).

In the old days it also used CONFIG_MTD_NAND_VERIFY_WRITE to check if
correct data is written (if not -EIO was returned immediately).
This was removed with [2]:
"mtd: kill MTD_NAND_VERIFY_WRITE"
657f28f8811c92724db10d18bbbec70d540147d6

The commit:
"mtd: nand_wait: warn if the nand is busy on exit"
f251b8dfdd0721255ea11751cdc282834e43b74e

added WARN_ON() on timeout.

Setup:
-----

I've run mtd_*.ko tests on several kernels and two memories.

With mtd_torture tests (and timeout set to 20ms):
modprobe mtd_torturetest dev=${device} check=1 cycles_count=100 gran=10

forces both memories to timeout (at random execution place) with -EIO
error returned.

Please correct me if I'm wrong:
-------------------------------

With the new kernel (v4.14-rc7) we rely on:

1. Page write timeout increased from 20ms -> 400 ms (as in [1])

2. The WARN_ON() is displayed when we leave nand_wait() with ongoing
NAND controller operation.

3. As written in [2] the correctness of written data is check in upper
layers (fs) -> when memory return no fails, but internal controller
still writes data.

Problem:
--------

Normally to exit nand_wait loop I do read RnB GPIO pin
(chip->dev_ready). 

When we got a timeout passed status from one memory is 0x81.
Second one returns no errors (0x80) - but the write data check fails.
According to spec bits 5 and 6 (of status register) are 0 -> Internal
data operation Busy and overall Busy.

The problem here is that we exit nand_wait with NAND memory controller
still being busy. Timeout change[1] from 20ms -> 400ms just 'masked'
this issue.

Question:
---------

Shall not we wait more (@nand_wait) for internal operations to be
finished?

To reproduce:
-------------

Change back the timeout value from 400ms to 20m and run mtd_*.ko tests.

Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread