From: Lukasz Majewski <lukma@denx.de>
To: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Richard Weinberger <richard@nod.at>,
Miquel Raynal <miquel.raynal@free-electrons.com>,
Marek Vasut <marek.vasut@gmail.com>,
"linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
Cyrille Pitchen <cyrille.pitchen@wedev4u.fr>,
Brian Norris <computersforpeace@gmail.com>,
David Woodhouse <dwmw2@infradead.org>
Subject: Re: [NAND] Question regarding -EIO error
Date: Tue, 14 Nov 2017 00:18:14 +0100 [thread overview]
Message-ID: <20171114001814.4aedd8ad@jawa> (raw)
In-Reply-To: <20171113221907.3f20c931@bbrezillon>
[-- Attachment #1: Type: text/plain, Size: 6180 bytes --]
Hi Boris,
Thanks for your reply.
> +Miquel who is working a lot on NAND stuff lately and might have faced
> the same kind of problems while working on ->exec_op().
>
> Hi Lukasz,
>
> On Mon, 13 Nov 2017 21:27:01 +0100
> Lukasz Majewski <lukma@denx.de> wrote:
>
> > Dear All,
> >
> > I was investigating the -EIO issue for page write from 2.6.26
> > kernel up till 4.14-rc7.
> >
> > A foreword:
> > -----------
> >
> > Before the commit (v4.4):
> > mtd: nand: increase ready wait timeout and report timeouts [1]
> > b70af9bef49bd9a5f4e7a2327d9074e29653e665
> >
> > The timeout for nand memory write (nand_page_write()) was ignored
> > (as mentioned in [1]).
> > The nand_write_page() (@nand_base.c) only checks for
> > NAND_STATUS_FAIL (and returns -EIO).
> >
> > In the old days it also used CONFIG_MTD_NAND_VERIFY_WRITE to check
> > if correct data is written (if not -EIO was returned immediately).
> > This was removed with [2]:
> > "mtd: kill MTD_NAND_VERIFY_WRITE"
> > 657f28f8811c92724db10d18bbbec70d540147d6
> >
> > The commit:
> > "mtd: nand_wait: warn if the nand is busy on exit"
> > f251b8dfdd0721255ea11751cdc282834e43b74e
> >
> > added WARN_ON() on timeout.
> >
> > Setup:
> > -----
> >
> > I've run mtd_*.ko tests on several kernels and two memories.
> >
> > With mtd_torture tests (and timeout set to 20ms):
> > modprobe mtd_torturetest dev=${device} check=1 cycles_count=100
> > gran=10
> >
> > forces both memories to timeout (at random execution place) with
> > -EIO error returned.
> >
> > Please correct me if I'm wrong:
> > -------------------------------
> >
> > With the new kernel (v4.14-rc7) we rely on:
> >
> > 1. Page write timeout increased from 20ms -> 400 ms (as in [1])
> >
> > 2. The WARN_ON() is displayed when we leave nand_wait() with ongoing
> > NAND controller operation.
> >
> > 3. As written in [2] the correctness of written data is check in
> > upper layers (fs) -> when memory return no fails, but internal
> > controller still writes data.
> >
>
> Unless I miss something, I think you're correct.
>
> >
> > Problem:
> > --------
> >
> > Normally to exit nand_wait loop I do read RnB GPIO pin
> > (chip->dev_ready).
> >
> > When we got a timeout passed status from one memory is 0x81.
> > Second one returns no errors (0x80) - but the write data check
> > fails. According to spec bits 5 and 6 (of status register) are 0 ->
> > Internal data operation Busy and overall Busy.
>
> Yep, the NAND is not ready and all other bits in the STATUS reg can't
> be trusted (which might explain why bit0 changes from 1 to 0 between
> the 2 status read operations).
Indeed the memory is not ready.
Those two values 0x81 and 0x80 are from two different memories (when
the same test code is run).
>
> Quoting the ONFI spec:
>
> "
> RDY:
> If set to one, then the LUN or plane address is ready for another
> command and all other bits in the status value are valid. If cleared
> to zero, then the last command issued is not yet complete and SR bits
> 5:0 are invalid and shall be ignored by the host.
> "
Ok. I see. This means that RnB if present has higher priority than
reading status register (via 0x70 command).
>
>
> >
> > The problem here is that we exit nand_wait with NAND memory
> > controller still being busy. Timeout change[1] from 20ms -> 400ms
> > just 'masked' this issue.
>
> Theoretically yes, but in practice 400ms should be more than enough to
> complete a PROGRAM operation (actually is should even be enough to
> complete an ERASE operation).
>
> Did you experience any failures with the timeout set 400ms?
With changing timeout to 400 ms I do not see any issues (I do run
mtd_*.ko tests for +10h)
>
> >
> >
> > Question:
> > ---------
> >
> > Shall not we wait more (@nand_wait) for internal operations to be
> > finished?
>
> Well, we need a boundary, we definitely don't want to wait
> indefinitely, especially since the bug can be caused by a bad
> controller. This being said, if the PROGRAM operation timeouts, we
> should issue a RESET operation to hopefully end up in a well-known
> state.
>
> >
> >
> > To reproduce:
> > -------------
> >
> > Change back the timeout value from 400ms to 20m and run mtd_*.ko
> > tests.
>
> The problem you report was possible with a 20ms (especially for modern
> NANDs with big pages) but becomes unlikely with a 400ms timeout,
Yes. I can confirm that - up till now no issues observed with 400ms
timeout.
> simply because, even if the PROGRAM operation fails, it shouldn't take
> more than 100ms for the NAND chip to report it (put the R/B back to
> ready state and set the FAIL bit to 1 in the STATUS reg).
Ok.
>
> Just to be sure I understood correctly, is it something you managed to
> reproduce with a 400ms timeout or are you worried that it could happen
> because you've experienced it with an older kernel which had a 20ms
> timeout.
Up till now I was not able to reproduce this issue with 400ms timeout.
I was curious if changing timeout to 400 ms is the "correct" solution.
It seems like it is - since we give NAND memory enough time to finish
any page program operation.
>
> Note that I'm not against making the code more robust, I'm just trying
> to figure how urgent this is because we're in the middle of a huge
> rework (the ->exec_op() thing I was mentioning at the beginning of
> this reply) that could possibly help us with this kind of problems.
As I said above - no issues with 400 ms. I've backported the change [1]
and it works.
I'm mostly curious about the rationale.
>
> Regards,
>
> Boris
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
Best regards,
Lukasz Majewski
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
prev parent reply other threads:[~2017-11-13 23:18 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-13 20:27 [NAND] Question regarding -EIO error Lukasz Majewski
2017-11-13 21:19 ` Boris Brezillon
2017-11-13 23:18 ` Lukasz Majewski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171114001814.4aedd8ad@jawa \
--to=lukma@denx.de \
--cc=boris.brezillon@free-electrons.com \
--cc=computersforpeace@gmail.com \
--cc=cyrille.pitchen@wedev4u.fr \
--cc=dwmw2@infradead.org \
--cc=linux-mtd@lists.infradead.org \
--cc=marek.vasut@gmail.com \
--cc=miquel.raynal@free-electrons.com \
--cc=richard@nod.at \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.