From: Lukasz Majewski <lukma@denx.de>
To: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Richard Weinberger <richard@nod.at>,
Miquel Raynal <miquel.raynal@free-electrons.com>,
Marek Vasut <marek.vasut@gmail.com>,
"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
Cyrille Pitchen <cyrille.pitchen@wedev4u.fr>,
Brian Norris <computersforpeace@gmail.com>,
David Woodhouse <dwmw2@infradead.org>
Subject: Re: [NAND] Question regarding -EIO error
Date: Tue, 14 Nov 2017 00:18:14 +0100 [thread overview]
Message-ID: <20171114001814.4aedd8ad@jawa> (raw)
In-Reply-To: <20171113221907.3f20c931@bbrezillon>
[-- Attachment #1: Type: text/plain, Size: 6180 bytes --]
Hi Boris,
Thanks for your reply.
> +Miquel who is working a lot on NAND stuff lately and might have faced
> the same kind of problems while working on ->exec_op().
>
> Hi Lukasz,
>
> On Mon, 13 Nov 2017 21:27:01 +0100
> Lukasz Majewski <lukma@denx.de> wrote:
>
> > Dear All,
> >
> > I was investigating the -EIO issue for page write from 2.6.26
> > kernel up till 4.14-rc7.
> >
> > A foreword:
> > -----------
> >
> > Before the commit (v4.4):
> > mtd: nand: increase ready wait timeout and report timeouts [1]
> > b70af9bef49bd9a5f4e7a2327d9074e29653e665
> >
> > The timeout for nand memory write (nand_page_write()) was ignored
> > (as mentioned in [1]).
> > The nand_write_page() (@nand_base.c) only checks for
> > NAND_STATUS_FAIL (and returns -EIO).
> >
> > In the old days it also used CONFIG_MTD_NAND_VERIFY_WRITE to check
> > if correct data is written (if not -EIO was returned immediately).
> > This was removed with [2]:
> > "mtd: kill MTD_NAND_VERIFY_WRITE"
> > 657f28f8811c92724db10d18bbbec70d540147d6
> >
> > The commit:
> > "mtd: nand_wait: warn if the nand is busy on exit"
> > f251b8dfdd0721255ea11751cdc282834e43b74e
> >
> > added WARN_ON() on timeout.
> >
> > Setup:
> > -----
> >
> > I've run mtd_*.ko tests on several kernels and two memories.
> >
> > With mtd_torture tests (and timeout set to 20ms):
> > modprobe mtd_torturetest dev=${device} check=1 cycles_count=100
> > gran=10
> >
> > forces both memories to timeout (at random execution place) with
> > -EIO error returned.
> >
> > Please correct me if I'm wrong:
> > -------------------------------
> >
> > With the new kernel (v4.14-rc7) we rely on:
> >
> > 1. Page write timeout increased from 20ms -> 400 ms (as in [1])
> >
> > 2. The WARN_ON() is displayed when we leave nand_wait() with ongoing
> > NAND controller operation.
> >
> > 3. As written in [2] the correctness of written data is check in
> > upper layers (fs) -> when memory return no fails, but internal
> > controller still writes data.
> >
>
> Unless I miss something, I think you're correct.
>
> >
> > Problem:
> > --------
> >
> > Normally to exit nand_wait loop I do read RnB GPIO pin
> > (chip->dev_ready).
> >
> > When we got a timeout passed status from one memory is 0x81.
> > Second one returns no errors (0x80) - but the write data check
> > fails. According to spec bits 5 and 6 (of status register) are 0 ->
> > Internal data operation Busy and overall Busy.
>
> Yep, the NAND is not ready and all other bits in the STATUS reg can't
> be trusted (which might explain why bit0 changes from 1 to 0 between
> the 2 status read operations).
Indeed the memory is not ready.
Those two values 0x81 and 0x80 are from two different memories (when
the same test code is run).
>
> Quoting the ONFI spec:
>
> "
> RDY:
> If set to one, then the LUN or plane address is ready for another
> command and all other bits in the status value are valid. If cleared
> to zero, then the last command issued is not yet complete and SR bits
> 5:0 are invalid and shall be ignored by the host.
> "
Ok. I see. This means that RnB if present has higher priority than
reading status register (via 0x70 command).
>
>
> >
> > The problem here is that we exit nand_wait with NAND memory
> > controller still being busy. Timeout change[1] from 20ms -> 400ms
> > just 'masked' this issue.
>
> Theoretically yes, but in practice 400ms should be more than enough to
> complete a PROGRAM operation (actually is should even be enough to
> complete an ERASE operation).
>
> Did you experience any failures with the timeout set 400ms?
With changing timeout to 400 ms I do not see any issues (I do run
mtd_*.ko tests for +10h)
>
> >
> >
> > Question:
> > ---------
> >
> > Shall not we wait more (@nand_wait) for internal operations to be
> > finished?
>
> Well, we need a boundary, we definitely don't want to wait
> indefinitely, especially since the bug can be caused by a bad
> controller. This being said, if the PROGRAM operation timeouts, we
> should issue a RESET operation to hopefully end up in a well-known
> state.
>
> >
> >
> > To reproduce:
> > -------------
> >
> > Change back the timeout value from 400ms to 20m and run mtd_*.ko
> > tests.
>
> The problem you report was possible with a 20ms (especially for modern
> NANDs with big pages) but becomes unlikely with a 400ms timeout,
Yes. I can confirm that - up till now no issues observed with 400ms
timeout.
> simply because, even if the PROGRAM operation fails, it shouldn't take
> more than 100ms for the NAND chip to report it (put the R/B back to
> ready state and set the FAIL bit to 1 in the STATUS reg).
Ok.
>
> Just to be sure I understood correctly, is it something you managed to
> reproduce with a 400ms timeout or are you worried that it could happen
> because you've experienced it with an older kernel which had a 20ms
> timeout.
Up till now I was not able to reproduce this issue with 400ms timeout.
I was curious if changing timeout to 400 ms is the "correct" solution.
It seems like it is - since we give NAND memory enough time to finish
any page program operation.
>
> Note that I'm not against making the code more robust, I'm just trying
> to figure how urgent this is because we're in the middle of a huge
> rework (the ->exec_op() thing I was mentioning at the beginning of
> this reply) that could possibly help us with this kind of problems.
As I said above - no issues with 400 ms. I've backported the change [1]
and it works.
I'm mostly curious about the rationale.
>
> Regards,
>
> Boris
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
Best regards,
Lukasz Majewski
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
prev parent reply other threads:[~2017-11-13 23:18 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-13 20:27 [NAND] Question regarding -EIO error Lukasz Majewski
2017-11-13 21:19 ` Boris Brezillon
2017-11-13 23:18 ` Lukasz Majewski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171114001814.4aedd8ad@jawa \
--to=lukma@denx.de \
--cc=boris.brezillon@free-electrons.com \
--cc=computersforpeace@gmail.com \
--cc=cyrille.pitchen@wedev4u.fr \
--cc=dwmw2@infradead.org \
--cc=linux-mtd@lists.infradead.org \
--cc=marek.vasut@gmail.com \
--cc=miquel.raynal@free-electrons.com \
--cc=richard@nod.at \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox