Re: PL353 NAND Controller - SW vs HW ECC

From: Miquel Raynal <miquel.raynal@bootlin.com>
To: Andrea Scian <andrea.scian@dave.eu>
Cc: "linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
	"michal.simek@amd.com" <michal.simek@amd.com>,
	 "richard@nod.at" <richard@nod.at>,
	 "vigneshr@ti.com" <vigneshr@ti.com>,
	"amit.kumar-mahapatra@amd.com" <amit.kumar-mahapatra@amd.com>
Subject: Re: PL353 NAND Controller - SW vs HW ECC
Date: Tue, 10 Feb 2026 11:12:45 +0100	[thread overview]
Message-ID: <87v7g4zz36.fsf@bootlin.com> (raw)
In-Reply-To: <MI2P293MB02644DC5515E56A2539C65739765A@MI2P293MB0264.ITAP293.PROD.OUTLOOK.COM> (Andrea Scian's message of "Mon, 9 Feb 2026 11:37:37 +0000")

Hi Andrea,

On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian@dave.eu> wrote:

> Dear all,
>
> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>
> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>
> Usually we check ECC functionality with mtd_nandbiterrs but it's way
> of testing ECC correction is quite obscure and unmaintained (see a
> thread between me and Miquel on this mailing list in December 2025 on
> this topic).

We stopped developing the kernel modules, for testing we advise to use
the same tools from the mtd-utils test suite which are actively
maintained.

nandbiterrs -i is the correct tool for testing your ECC engine. It works
this way (from memory, maybe not 100% accurate, but that's the idea):
* One time:
- write a (pseudo) random content
- reads it back and verify the content, expect 0 bf
- reads it back in raw mode, including the OOB area with the ECC bytes

* Repeats:
- insert a bitflip in the first subpage
- write it in raw mode to force the bitflip
- reads it back and expect the data to be corrected, bitflip count shall
  be reported. If data is incorrect, page helpers are incorrect.

* Finishes:
When reaching the threshold of your engine, you should expect an ECC
error which indicates a success for the test. If you get passed the
threshold, the ECC engine is not functional. If the test stops before
the threshold, something is wrong in your configuration or ECC engine.

> We've thus moved to userspace nandflipbits which give much more
> control on bitflip generation, making easier to understand if
> everything's fine or not.

nandflipbits is more flexible but less automated. It works identically,
except I believe it erases before rewriting in raw mode (which is a
subtle difference, this may have an impact with some -rare- chips).

> By using this tool, I'm able to reproduce what I think is a PL353 HW
> ECC malfunction, that I think is hardware related (there's some,
> cryptic IMHO, errata on this) but I may be missing something and it
> may be "just" a software bug There's also the obvious 3rd option:
> PEBKAC. I'm doing something wrong with my test setup, either on
> kernel/test configuration/usage or in hw setup ;-) )

I am interested by this errata, do you have a link? I do not remember
seeing it when I worked on this controller.

> Step 1 - SW ECC
>
> Thanks to my patch (and mailing list review) now I can use SW Hamming ECC on Zynq7k based devices. So this test is about using software hamming ECC on (1 bit on 256 byte)
>
> This is the device tree
>
> &nfc0 {
>   status = "okay";
>   nand@0 {
>     reg = <0x0>;
>     #address-cells = <0x1>;
>     #size-cells = <0x1>;
>
>     nand-ecc-mode = "soft";
>     nand-ecc-algo = "hamming";
>     nand-ecc-strength = <1>;
>     nand-ecc-step-size = <256>;
>
>     nand-on-flash-bbt;
>
>     nand-bus-width = <8>;
>     status = "okay";
>
>     partition@nand-ubi {
>       label = "ubi";
>       reg = <0x00000000 0x0>;
>     };
>   };
> };
>
> To make it quick, I'm using just the first EB, with a simple string on it (in my case,
> this is useful for testing on u-boot too, but this is for another separate thread ;-) )
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|
>
> I'm now inserting one bitflip, which is detected and corrected as expected
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtst testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|
>
> With an additional bitflip, we have an uncorrectable error (and this is, again, expected)
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 1
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|
>
> The same applies to another combination of bitflips (this will be useful later and don't look at ECC counters.. I had to reboot the system ;-) )
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
> 0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |ktst testing....|
>
>
>
> Step 2 - PL353 HW ECC
>
> The device tree is now
> &nfc0 {
>   status = "okay";
>   nand@0 {
>     reg = <0x0>;
>     #address-cells = <0x1>;
>     #size-cells = <0x1>;
>
>     nand-ecc-mode = "hw";
>     nand-ecc-strength = <1>;
>     nand-ecc-step-size = <256>;
>
>     nand-on-flash-bbt;
>
>     nand-bus-width = <8>;
>     status = "okay";
>
>     partition@nand-ubi {
>       label = "ubi";
>       reg = <0x00000000 0x0>;
>     };
>   };
> };
>
> Please note that PL353 is not using nand-ecc-step-size property
> correctly, but this is a secondary issue (this NAND device requires 1
> bit on 512 byte, so it's fine anyway)

Can you elaborate? Looking at the driver, it takes the ECC configuration
from the core (hence, usually from the DT), otherwise it falls back to
what the chip advertises in terms of requirements, and finally it falls
back to 1b/512B as default.

> root@sw0005-devel:/lib/modules# cat /sys/class/mtd/mtd0/ecc_step_size
> 512
>
> Re-doing the same test as above
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|
>
> One single bitflip is detected and corrected as expected:
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtst testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|
>
> a 2nd bitflip is detected as uncorrectable as expected:
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 0
> ECC corrected: 1
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|
>
> But there's some corner case, e.g. double bit flip that are detected (wrongly) as single bitflip and return wrong data:
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
> root@sw0005-devel:~# nandflipbits /dev/mtd0 1@1
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 1
> ECC corrected: 2
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6a 76 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jvst testing....|
>
> Another full test from scratch (ECC corrected counter is bigger that expected because I had to try a few combination, without rebooting ;-) )
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
> ECC failed: 1
> ECC corrected: 6
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |ktst
> testing....|

This is scary.

> Conclusions:
>
> IIUC with the results of the above test, we have an issue on PL353 because it cannot detect double bit errors (at least some combination of them) and, while this is a rare event on SLC NAND devices (that requires 1 bit ECC to guarantee 100k PE cycles), I think that this might give some catastrophic failures on field (because, AFAIK, upper MTD layers, like UBI, don't expect this situation).
> Am I wrong?

If you use UBI on top, depending on where the bitflips are, they may be
found due to UBI using checksums quite extensively, but this is clearly
not the nominal case. This is a dangerous hardware bug.

> I kindly ask to the MTD experts if I have to worry about this or if we
> can assume that correcting 1 bit error is enough for this subsystem.

No, the expectation is a clear failure upon double bit errors. Be
careful though, Hamming ECC engines carry *no guaranty* for 3 bit
errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
uncorrectable, which means:
- 0 bf, ok
- 1 bf, ok + reporting 1 bf
- 2 bf, NOK + reporting an error
- more bf: no guarantee, usually returns incorrect data with a correct
status

What you observe is maybe the last case. If you read the whole page raw,
maybe you are silently facing a 3 bit error case due to a lot of
repetitions (?), it can be somewhere else in the page. Be very careful
when testing with nanflipbits because the tool does not check for data
integrity, unlike nandbiterrs, which does. Are you sure your disapproval
of nandbiterrs is justified in the first place? Could it be that the
NAND chip you're testing with is a bit faulty/unstable and nandbiterrs
returns errors where you do not expect them because of that?

This is obviously just speculation, maybe the errata you mentioned above
will bring an obvious hardware failure to our attention. The Arasan IP
used on ZynqMP also suffers from a similar limitation (not able to
correctly report failures) and I decided to implement one path using the
software BCH engine, with a time penalty of course.

Thanks,
Miquèl

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/