* PL353 NAND Controller - SW vs HW ECC
@ 2026-02-09 11:37 Andrea Scian
2026-02-10 10:12 ` Miquel Raynal
0 siblings, 1 reply; 6+ messages in thread
From: Andrea Scian @ 2026-02-09 11:37 UTC (permalink / raw)
To: linux-mtd@lists.infradead.org
Cc: Miquel Raynal, michal.simek@amd.com, richard@nod.at,
vigneshr@ti.com, amit.kumar-mahapatra@amd.com
Dear all,
I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
Usually we check ECC functionality with mtd_nandbiterrs but it's way of testing ECC correction is quite obscure and unmaintained (see a thread between me and Miquel on this mailing list in December 2025 on this topic).
We've thus moved to userspace nandflipbits which give much more control on bitflip generation, making easier to understand if everything's fine or not.
By using this tool, I'm able to reproduce what I think is a PL353 HW ECC malfunction, that I think is hardware related (there's some, cryptic IMHO, errata on this) but I may be missing something and it may be "just" a software bug
There's also the obvious 3rd option: PEBKAC. I'm doing something wrong with my test setup, either on kernel/test configuration/usage or in hw setup ;-) )
Step 1 - SW ECC
Thanks to my patch (and mailing list review) now I can use SW Hamming ECC on Zynq7k based devices. So this test is about using software hamming ECC on (1 bit on 256 byte)
This is the device tree
&nfc0 {
status = "okay";
nand@0 {
reg = <0x0>;
#address-cells = <0x1>;
#size-cells = <0x1>;
nand-ecc-mode = "soft";
nand-ecc-algo = "hamming";
nand-ecc-strength = <1>;
nand-ecc-step-size = <256>;
nand-on-flash-bbt;
nand-bus-width = <8>;
status = "okay";
partition@nand-ubi {
label = "ubi";
reg = <0x00000000 0x0>;
};
};
};
To make it quick, I'm using just the first EB, with a simple string on it (in my case,
this is useful for testing on u-boot too, but this is for another separate thread ;-) )
root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
I'm now inserting one bitflip, which is detected and corrected as expected
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
With an additional bitflip, we have an uncorrectable error (and this is, again, expected)
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
The same applies to another combination of bitflips (this will be useful later and don't look at ECC counters.. I had to reboot the system ;-) )
root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst testing....|
Step 2 - PL353 HW ECC
The device tree is now
&nfc0 {
status = "okay";
nand@0 {
reg = <0x0>;
#address-cells = <0x1>;
#size-cells = <0x1>;
nand-ecc-mode = "hw";
nand-ecc-strength = <1>;
nand-ecc-step-size = <256>;
nand-on-flash-bbt;
nand-bus-width = <8>;
status = "okay";
partition@nand-ubi {
label = "ubi";
reg = <0x00000000 0x0>;
};
};
};
Please note that PL353 is not using nand-ecc-step-size property correctly, but this is a secondary issue (this NAND device requires 1 bit on 512 byte, so it's fine anyway)
root@sw0005-devel:/lib/modules# cat /sys/class/mtd/mtd0/ecc_step_size
512
Re-doing the same test as above
root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
One single bitflip is detected and corrected as expected:
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
a 2nd bitflip is detected as uncorrectable as expected:
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
But there's some corner case, e.g. double bit flip that are detected (wrongly) as single bitflip and return wrong data:
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
root@sw0005-devel:~# nandflipbits /dev/mtd0 1@1
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 1
ECC corrected: 2
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 76 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jvst testing....|
Another full test from scratch (ECC corrected counter is bigger that expected because I had to try a few combination, without rebooting ;-) )
root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 1
ECC corrected: 6
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst testing....|
Conclusions:
IIUC with the results of the above test, we have an issue on PL353 because it cannot detect double bit errors (at least some combination of them) and, while this is a rare event on SLC NAND devices (that requires 1 bit ECC to guarantee 100k PE cycles), I think that this might give some catastrophic failures on field (because, AFAIK, upper MTD layers, like UBI, don't expect this situation).
Am I wrong?
I kindly ask to the MTD experts if I have to worry about this or if we can assume that correcting 1 bit error is enough for this subsystem.
If this is not acceptable, I think we have to update the driver at least to warn the user about this and use SW ECC where possible. WDTY?
If anybody in this list can also help me in understanding if I'm doing something wrong with my test or may have some setup/configuration error, it's appreciated and I can make some additional test.
In the mean time, as my previous patch anticipate, I'm trying to setup a configuration using only SW ECC (I'm currently stuck in Linux/U-Boot compatibility, but this might not the right ML to discuss this topic)
Any feedback is appreciated. Kind Regards,
Andrea SCIAN
SW Development Manager
DAVE Embedded Systems
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PL353 NAND Controller - SW vs HW ECC
2026-02-09 11:37 PL353 NAND Controller - SW vs HW ECC Andrea Scian
@ 2026-02-10 10:12 ` Miquel Raynal
2026-02-10 14:14 ` Andrea Scian
0 siblings, 1 reply; 6+ messages in thread
From: Miquel Raynal @ 2026-02-10 10:12 UTC (permalink / raw)
To: Andrea Scian
Cc: linux-mtd@lists.infradead.org, michal.simek@amd.com,
richard@nod.at, vigneshr@ti.com, amit.kumar-mahapatra@amd.com
Hi Andrea,
On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian@dave.eu> wrote:
> Dear all,
>
> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>
> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>
> Usually we check ECC functionality with mtd_nandbiterrs but it's way
> of testing ECC correction is quite obscure and unmaintained (see a
> thread between me and Miquel on this mailing list in December 2025 on
> this topic).
We stopped developing the kernel modules, for testing we advise to use
the same tools from the mtd-utils test suite which are actively
maintained.
nandbiterrs -i is the correct tool for testing your ECC engine. It works
this way (from memory, maybe not 100% accurate, but that's the idea):
* One time:
- write a (pseudo) random content
- reads it back and verify the content, expect 0 bf
- reads it back in raw mode, including the OOB area with the ECC bytes
* Repeats:
- insert a bitflip in the first subpage
- write it in raw mode to force the bitflip
- reads it back and expect the data to be corrected, bitflip count shall
be reported. If data is incorrect, page helpers are incorrect.
* Finishes:
When reaching the threshold of your engine, you should expect an ECC
error which indicates a success for the test. If you get passed the
threshold, the ECC engine is not functional. If the test stops before
the threshold, something is wrong in your configuration or ECC engine.
> We've thus moved to userspace nandflipbits which give much more
> control on bitflip generation, making easier to understand if
> everything's fine or not.
nandflipbits is more flexible but less automated. It works identically,
except I believe it erases before rewriting in raw mode (which is a
subtle difference, this may have an impact with some -rare- chips).
> By using this tool, I'm able to reproduce what I think is a PL353 HW
> ECC malfunction, that I think is hardware related (there's some,
> cryptic IMHO, errata on this) but I may be missing something and it
> may be "just" a software bug There's also the obvious 3rd option:
> PEBKAC. I'm doing something wrong with my test setup, either on
> kernel/test configuration/usage or in hw setup ;-) )
I am interested by this errata, do you have a link? I do not remember
seeing it when I worked on this controller.
> Step 1 - SW ECC
>
> Thanks to my patch (and mailing list review) now I can use SW Hamming ECC on Zynq7k based devices. So this test is about using software hamming ECC on (1 bit on 256 byte)
>
> This is the device tree
>
> &nfc0 {
> status = "okay";
> nand@0 {
> reg = <0x0>;
> #address-cells = <0x1>;
> #size-cells = <0x1>;
>
> nand-ecc-mode = "soft";
> nand-ecc-algo = "hamming";
> nand-ecc-strength = <1>;
> nand-ecc-step-size = <256>;
>
> nand-on-flash-bbt;
>
> nand-bus-width = <8>;
> status = "okay";
>
> partition@nand-ubi {
> label = "ubi";
> reg = <0x00000000 0x0>;
> };
> };
> };
>
> To make it quick, I'm using just the first EB, with a simple string on it (in my case,
> this is useful for testing on u-boot too, but this is for another separate thread ;-) )
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>
> I'm now inserting one bitflip, which is detected and corrected as expected
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>
> With an additional bitflip, we have an uncorrectable error (and this is, again, expected)
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 1
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
>
> The same applies to another combination of bitflips (this will be useful later and don't look at ECC counters.. I had to reboot the system ;-) )
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
> 0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst testing....|
>
>
>
> Step 2 - PL353 HW ECC
>
> The device tree is now
> &nfc0 {
> status = "okay";
> nand@0 {
> reg = <0x0>;
> #address-cells = <0x1>;
> #size-cells = <0x1>;
>
> nand-ecc-mode = "hw";
> nand-ecc-strength = <1>;
> nand-ecc-step-size = <256>;
>
> nand-on-flash-bbt;
>
> nand-bus-width = <8>;
> status = "okay";
>
> partition@nand-ubi {
> label = "ubi";
> reg = <0x00000000 0x0>;
> };
> };
> };
>
> Please note that PL353 is not using nand-ecc-step-size property
> correctly, but this is a secondary issue (this NAND device requires 1
> bit on 512 byte, so it's fine anyway)
Can you elaborate? Looking at the driver, it takes the ECC configuration
from the core (hence, usually from the DT), otherwise it falls back to
what the chip advertises in terms of requirements, and finally it falls
back to 1b/512B as default.
> root@sw0005-devel:/lib/modules# cat /sys/class/mtd/mtd0/ecc_step_size
> 512
>
> Re-doing the same test as above
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>
> One single bitflip is detected and corrected as expected:
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>
> a 2nd bitflip is detected as uncorrectable as expected:
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 0
> ECC corrected: 1
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
>
> But there's some corner case, e.g. double bit flip that are detected (wrongly) as single bitflip and return wrong data:
>
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
> root@sw0005-devel:~# nandflipbits /dev/mtd0 1@1
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 1
> ECC corrected: 2
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6a 76 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jvst testing....|
>
> Another full test from scratch (ECC corrected counter is bigger that expected because I had to try a few combination, without rebooting ;-) )
>
> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
> Erasing 128 Kibyte @ 0 -- 100 % complete
> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
> Writing data to block 0 at offset 0x0
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
> ECC failed: 1
> ECC corrected: 6
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000064...
> ECC: 1 corrected bitflip(s) at offset 0x00000000
> 0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst
> testing....|
This is scary.
> Conclusions:
>
> IIUC with the results of the above test, we have an issue on PL353 because it cannot detect double bit errors (at least some combination of them) and, while this is a rare event on SLC NAND devices (that requires 1 bit ECC to guarantee 100k PE cycles), I think that this might give some catastrophic failures on field (because, AFAIK, upper MTD layers, like UBI, don't expect this situation).
> Am I wrong?
If you use UBI on top, depending on where the bitflips are, they may be
found due to UBI using checksums quite extensively, but this is clearly
not the nominal case. This is a dangerous hardware bug.
> I kindly ask to the MTD experts if I have to worry about this or if we
> can assume that correcting 1 bit error is enough for this subsystem.
No, the expectation is a clear failure upon double bit errors. Be
careful though, Hamming ECC engines carry *no guaranty* for 3 bit
errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
uncorrectable, which means:
- 0 bf, ok
- 1 bf, ok + reporting 1 bf
- 2 bf, NOK + reporting an error
- more bf: no guarantee, usually returns incorrect data with a correct
status
What you observe is maybe the last case. If you read the whole page raw,
maybe you are silently facing a 3 bit error case due to a lot of
repetitions (?), it can be somewhere else in the page. Be very careful
when testing with nanflipbits because the tool does not check for data
integrity, unlike nandbiterrs, which does. Are you sure your disapproval
of nandbiterrs is justified in the first place? Could it be that the
NAND chip you're testing with is a bit faulty/unstable and nandbiterrs
returns errors where you do not expect them because of that?
This is obviously just speculation, maybe the errata you mentioned above
will bring an obvious hardware failure to our attention. The Arasan IP
used on ZynqMP also suffers from a similar limitation (not able to
correctly report failures) and I decided to implement one path using the
software BCH engine, with a time penalty of course.
Thanks,
Miquèl
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PL353 NAND Controller - SW vs HW ECC
2026-02-10 10:12 ` Miquel Raynal
@ 2026-02-10 14:14 ` Andrea Scian
2026-02-12 10:40 ` Miquel Raynal
0 siblings, 1 reply; 6+ messages in thread
From: Andrea Scian @ 2026-02-10 14:14 UTC (permalink / raw)
To: Miquel Raynal
Cc: linux-mtd@lists.infradead.org, michal.simek@amd.com,
richard@nod.at, vigneshr@ti.com, amit.kumar-mahapatra@amd.com
Dear Miquel,
Il 10/02/2026 11:12, Miquel Raynal ha scritto:
> Hi Andrea,
>
> On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian@dave.eu> wrote:
>
>> Dear all,
>>
>> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>>
>> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>>
>> Usually we check ECC functionality with mtd_nandbiterrs but it's way
>> of testing ECC correction is quite obscure and unmaintained (see a
>> thread between me and Miquel on this mailing list in December 2025 on
>> this topic).
>
> We stopped developing the kernel modules, for testing we advise to use
> the same tools from the mtd-utils test suite which are actively
> maintained.
>
> nandbiterrs -i is the correct tool for testing your ECC engine. It works
> this way (from memory, maybe not 100% accurate, but that's the idea):
[snip]
Got it! This looks very similar to the kernel module and, in
fact, I got the same results, depending on the seed choosen.
With PL353 HW ECC with out seek (a.k.a. seed=0)
root@sw0005-devel:~# nandbiterrs -i /dev/mtd0
incremental biterrors test
Successfully corrected 0 bit errors per subpage
Inserted biterror @ 0/5
Read reported 1 corrected bit errors
Successfully corrected 1 bit errors per subpage
Inserted biterror @ 0/2
Failed to recover 1 bitflips
Read error after 2 bit errors per page
While with seed=1
root@sw0005-devel:~# nandbiterrs -i /dev/mtd0 -s 1
incremental biterrors test
Successfully corrected 0 bit errors per subpage
Inserted biterror @ 0/7
Read reported 1 corrected bit errors
Successfully corrected 1 bit errors per subpage
Inserted biterror @ 0/5
Read reported 1 corrected bit errors
ECC failure, invalid data despite read success
As comparison, with SW ECC I got
root@sw0005-devel:~# nandbiterrs -i /dev/mtd0
incremental biterrors test
Successfully corrected 0 bit errors per subpage
Inserted b[ 117.677553] ecc_sw_hamming_correct: uncorrectable ECC error
iterror @ 0/5
Read reported 1 corrected bit errors
Successfully corrected 1 bit errors per subpage
Inserted biterror @ 0/2
Failed to recover 1 bitflips
Read error after 2 bit errors per page
root@sw0005-devel:~# ./nandbiterrs -i /dev/mtd0 -s 1
incremental biterrors test
Successfully corrected 0 bit errors per subpage
Inserted [ 127.793727] ecc_sw_hamming_correct: uncorrectable ECC error
biterror @ 0/7
Read reported 1 corrected bit errors
Successfully corrected 1 bit errors per subpage
Inserted biterror @ 0/5
Failed to recover 1 bitflips
Read error after 2 bit errors per page
>> We've thus moved to userspace nandflipbits which give much more
>> control on bitflip generation, making easier to understand if
>> everything's fine or not.
>
> nandflipbits is more flexible but less automated. It works identically,
> except I believe it erases before rewriting in raw mode (which is a
> subtle difference, this may have an impact with some -rare- chips).
>
>> By using this tool, I'm able to reproduce what I think is a PL353 HW
>> ECC malfunction, that I think is hardware related (there's some,
>> cryptic IMHO, errata on this) but I may be missing something and it
>> may be "just" a software bug There's also the obvious 3rd option:
>> PEBKAC. I'm doing something wrong with my test setup, either on
>> kernel/test configuration/usage or in hw setup ;-) )
>
> I am interested by this errata, do you have a link? I do not remember
> seeing it when I worked on this controller.
For PL353, due the fact that it's an ARM IP, you need to look at their
website:
https://developer.arm.com/documentation/rlnc000227/a
Refer to r2p1 IP revision which is affected by errata ID 721059
It's statement nr 3 says
"Some double error cases are not correctly identified as uncorrectable fail"
"90 double errors out of the 8485140 possible double error combinations
are not correctly identified as
uncorrectable fail"
"All double errors in the data (8386560 possible errors) will be
correctly identified as uncorrectable fail"
However, this is NOT my experience, as you can see from the above testing.
>> Step 1 - SW ECC
>>
>> Thanks to my patch (and mailing list review) now I can use SW Hamming ECC on Zynq7k based devices. So this test is about using software hamming ECC on (1 bit on 256 byte)
>>
>> This is the device tree
>>
>> &nfc0 {
>> status = "okay";
>> nand@0 {
>> reg = <0x0>;
>> #address-cells = <0x1>;
>> #size-cells = <0x1>;
>>
>> nand-ecc-mode = "soft";
>> nand-ecc-algo = "hamming";
>> nand-ecc-strength = <1>;
>> nand-ecc-step-size = <256>;
>>
>> nand-on-flash-bbt;
>>
>> nand-bus-width = <8>;
>> status = "okay";
>>
>> partition@nand-ubi {
>> label = "ubi";
>> reg = <0x00000000 0x0>;
>> };
>> };
>> };
>>
>> To make it quick, I'm using just the first EB, with a simple string on it (in my case,
>> this is useful for testing on u-boot too, but this is for another separate thread ;-) )
>>
>> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
>> Erasing 128 Kibyte @ 0 -- 100 % complete
>> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
>> Writing data to block 0 at offset 0x0
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>>
>> I'm now inserting one bitflip, which is detected and corrected as expected
>>
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
>> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> 0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 corrected bitflip(s) at offset 0x00000000
>> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>>
>> With an additional bitflip, we have an uncorrectable error (and this is, again, expected)
>>
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
>> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 1
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
>> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
>>
>> The same applies to another combination of bitflips (this will be useful later and don't look at ECC counters.. I had to reboot the system ;-) )
>>
>> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
>> Erasing 128 Kibyte @ 0 -- 100 % complete
>> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
>> Writing data to block 0 at offset 0x0
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
>> 0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst testing....|
>>
>>
>>
>> Step 2 - PL353 HW ECC
>>
>> The device tree is now
>> &nfc0 {
>> status = "okay";
>> nand@0 {
>> reg = <0x0>;
>> #address-cells = <0x1>;
>> #size-cells = <0x1>;
>>
>> nand-ecc-mode = "hw";
>> nand-ecc-strength = <1>;
>> nand-ecc-step-size = <256>;
>>
>> nand-on-flash-bbt;
>>
>> nand-bus-width = <8>;
>> status = "okay";
>>
>> partition@nand-ubi {
>> label = "ubi";
>> reg = <0x00000000 0x0>;
>> };
>> };
>> };
>>
>> Please note that PL353 is not using nand-ecc-step-size property
>> correctly, but this is a secondary issue (this NAND device requires 1
>> bit on 512 byte, so it's fine anyway)
>
> Can you elaborate? Looking at the driver, it takes the ECC configuration
> from the core (hence, usually from the DT), otherwise it falls back to
> what the chip advertises in terms of requirements, and finally it falls
> back to 1b/512B as default.
I tried with
&nfc0 {
status = "okay";
nand@0 {
reg = <0x0>;
#address-cells = <0x1>;
#size-cells = <0x1>;
nand-ecc-mode = "hw";
nand-ecc-strength = <1>;
nand-ecc-step-size = <256>;
nand-on-flash-bbt;
nand-bus-width = <8>;
status = "okay";
partition@nand-ubi {
label = "ubi";
reg = <0x00000000 0x0>;
};
};
};
But I still got
root@sw0005-devel:~# cat /sys/class/mtd/mtd0/ecc_step_size
512
Maybe I'm missing something, but it seems that even if PL353 get this
from code/NAND requirements in pl35x_nand_attach_chip()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n948
In case of (host) HW ECC this gets later overwritten when initializing
PL353 ECC controller in pl35x_nand_init_hw_ecc_controller()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n910
>
>> root@sw0005-devel:/lib/modules# cat /sys/class/mtd/mtd0/ecc_step_size
>> 512
>>
>> Re-doing the same test as above
>>
>> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
>> Erasing 128 Kibyte @ 0 -- 100 % complete
>> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
>> Writing data to block 0 at offset 0x0
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>>
>> One single bitflip is detected and corrected as expected:
>>
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
>> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> 0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 corrected bitflip(s) at offset 0x00000000
>> 0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
>>
>> a 2nd bitflip is detected as uncorrectable as expected:
>>
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
>> root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 0
>> ECC corrected: 1
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
>> 0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
>>
>> But there's some corner case, e.g. double bit flip that are detected (wrongly) as single bitflip and return wrong data:
>>
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 1@1
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 1
>> ECC corrected: 2
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 corrected bitflip(s) at offset 0x00000000
>> 0x00000000: 6a 76 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jvst testing....|
>>
>> Another full test from scratch (ECC corrected counter is bigger that expected because I had to try a few combination, without rebooting ;-) )
>>
>> root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
>> Erasing 128 Kibyte @ 0 -- 100 % complete
>> root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
>> Writing data to block 0 at offset 0x0
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
>> root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
>> root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
>> ECC failed: 1
>> ECC corrected: 6
>> Number of bad blocks: 0
>> Number of bbt blocks: 4
>> Block size 131072, page size 2048, OOB size 64
>> Dumping data starting at 0x00000000 and ending at 0x00000064...
>> ECC: 1 corrected bitflip(s) at offset 0x00000000
>> 0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst
>> testing....|
>
> This is scary.
>
>> Conclusions:
>>
>> IIUC with the results of the above test, we have an issue on PL353 because it cannot detect double bit errors (at least some combination of them) and, while this is a rare event on SLC NAND devices (that requires 1 bit ECC to guarantee 100k PE cycles), I think that this might give some catastrophic failures on field (because, AFAIK, upper MTD layers, like UBI, don't expect this situation).
>> Am I wrong?
>
> If you use UBI on top, depending on where the bitflips are, they may be
> found due to UBI using checksums quite extensively, but this is clearly
> not the nominal case. This is a dangerous hardware bug.
Thanks for the confirm
>> I kindly ask to the MTD experts if I have to worry about this or if we
>> can assume that correcting 1 bit error is enough for this subsystem.
>
> No, the expectation is a clear failure upon double bit errors. Be
> careful though, Hamming ECC engines carry *no guaranty* for 3 bit
> errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
> uncorrectable, which means:
> - 0 bf, ok
> - 1 bf, ok + reporting 1 bf
> - 2 bf, NOK + reporting an error
> - more bf: no guarantee, usually returns incorrect data with a correct
> status
Thanks for pointing this out. I was not aware about the last case when
using Hamming algo.
Probably we'll have to move to BCH, even if, IIRC, it requires more
CPU horsepower to do the job.
> What you observe is maybe the last case. If you read the whole page raw,
> maybe you are silently facing a 3 bit error case due to a lot of
> repetitions (?), it can be somewhere else in the page. Be very careful
> when testing with nanflipbits because the tool does not check for data
> integrity, unlike nandbiterrs, which does. Are you sure your disapproval
> of nandbiterrs is justified in the first place?
No and I'm sorry about this. I misunderstand that there's 2 nandbiterrs,
one that has the .ko suffix (unmaintained kernel module) and one that is
without .ko provided as userspace too in mtd-utils ;-)
I'll use nandbiterrs provided by mtd-utils for the next tests.
Thanks for pointing me to the rigth tool to use :-)
> Could it be that the
> NAND chip you're testing with is a bit faulty/unstable and nandbiterrs
> returns errors where you do not expect them because of that?
My colleague and I had the same objection. For this reason I tested this
on different SOMs, from different production lots and also with
different SLC NAND devices (same nominal charateristics and nearly 100%
compatible), from Winbond and Spansion.
The latest hardware I'm testing is one that has just comes out from
factory (so the NAND has no more that 1-2 PE cycles, from our factory
functional testing)
Also, I'm using the same hardware to test HW and SW ECC (plus the same
binaries, apart from the two lines inside device tree that I've already
highlighted).
Due the fact that we're using the same algo (apart from ECC block size)
I think that this guarantee that, from the hardware point of view, we
don't have any issue, but please correct me if I'm doing some wrong
assumption.
> This is obviously just speculation, maybe the errata you mentioned above
> will bring an obvious hardware failure to our attention. The Arasan IP
> used on ZynqMP also suffers from a similar limitation (not able to
> correctly report failures) and I decided to implement one path using the
> software BCH engine, with a time penalty of course.
So that's nearly the same result I'm trying to get with Zynq7k platform ;-)
I'm trying to match this SW ECC support also with U-Boot (and I'm having
some trouble about it, but I think this needs a separate thread or
moving to another mailing list).
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PL353 NAND Controller - SW vs HW ECC
2026-02-10 14:14 ` Andrea Scian
@ 2026-02-12 10:40 ` Miquel Raynal
2026-02-23 16:39 ` Andrea Scian
0 siblings, 1 reply; 6+ messages in thread
From: Miquel Raynal @ 2026-02-12 10:40 UTC (permalink / raw)
To: Andrea Scian
Cc: linux-mtd@lists.infradead.org, michal.simek@amd.com,
richard@nod.at, vigneshr@ti.com, amit.kumar-mahapatra@amd.com
Hi Andrea,
On 10/02/2026 at 15:14:30 +01, Andrea Scian <andrea.scian@dave.eu> wrote:
> Dear Miquel,
>
>
> Il 10/02/2026 11:12, Miquel Raynal ha scritto:
>> Hi Andrea,
>> On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian@dave.eu>
>> wrote:
>>
>>> Dear all,
>>>
>>> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>>>
>>> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>>>
>>> Usually we check ECC functionality with mtd_nandbiterrs but it's way
>>> of testing ECC correction is quite obscure and unmaintained (see a
>>> thread between me and Miquel on this mailing list in December 2025 on
>>> this topic).
>> We stopped developing the kernel modules, for testing we advise to use
>> the same tools from the mtd-utils test suite which are actively
>> maintained.
>> nandbiterrs -i is the correct tool for testing your ECC engine. It
>> works
>> this way (from memory, maybe not 100% accurate, but that's the idea):
> [snip]
>
> Got it! This looks very similar to the kernel module and, in
> fact, I got the same results, depending on the seed choosen.
Ah, finally. I was very suspicious about this observation in the first
place. I remember you were reporting failures in the nandbiterrs -i test
with seed=1, which means we must fall into one of the 90 cases that are
not properly covered by the ECC engine?
[...]
>>> control on bitflip generation, making easier to understand if
>>> everything's fine or not.
>> nandflipbits is more flexible but less automated. It works
>> identically,
>> except I believe it erases before rewriting in raw mode (which is a
>> subtle difference, this may have an impact with some -rare- chips).
>>
>>> By using this tool, I'm able to reproduce what I think is a PL353 HW
>>> ECC malfunction, that I think is hardware related (there's some,
>>> cryptic IMHO, errata on this) but I may be missing something and it
>>> may be "just" a software bug There's also the obvious 3rd option:
>>> PEBKAC. I'm doing something wrong with my test setup, either on
>>> kernel/test configuration/usage or in hw setup ;-) )
>> I am interested by this errata, do you have a link? I do not remember
>> seeing it when I worked on this controller.
>
> For PL353, due the fact that it's an ARM IP, you need to look at their
> website:
>
> https://developer.arm.com/documentation/rlnc000227/a
Thanks!
> Refer to r2p1 IP revision which is affected by errata ID 721059
> It's statement nr 3 says
>
> "Some double error cases are not correctly identified as uncorrectable fail"
>
> "90 double errors out of the 8485140 possible double error combinations
> are not correctly identified as
> uncorrectable fail"
>
> "All double errors in the data (8386560 possible errors) will be
> correctly identified as uncorrectable fail"
This is only one issue over 3.
> However, this is NOT my experience, as you can see from the above testing.
Maybe you fell into one of the two other cases?
[...]
>>> Please note that PL353 is not using nand-ecc-step-size property
>>> correctly, but this is a secondary issue (this NAND device requires 1
>>> bit on 512 byte, so it's fine anyway)
>> Can you elaborate? Looking at the driver, it takes the ECC
>> configuration
>> from the core (hence, usually from the DT), otherwise it falls back to
>> what the chip advertises in terms of requirements, and finally it falls
>> back to 1b/512B as default.
>
> I tried with
>
> &nfc0 {
> status = "okay";
> nand@0 {
> reg = <0x0>;
> #address-cells = <0x1>;
> #size-cells = <0x1>;
>
> nand-ecc-mode = "hw";
> nand-ecc-strength = <1>;
> nand-ecc-step-size = <256>;
>
> nand-on-flash-bbt;
>
> nand-bus-width = <8>;
> status = "okay";
>
> partition@nand-ubi {
> label = "ubi";
> reg = <0x00000000 0x0>;
> };
> };
> };
>
> But I still got
>
> root@sw0005-devel:~# cat /sys/class/mtd/mtd0/ecc_step_size
> 512
>
>
> Maybe I'm missing something, but it seems that even if PL353 get this
> from code/NAND requirements in pl35x_nand_attach_chip()
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n948
>
> In case of (host) HW ECC this gets later overwritten when initializing
> PL353 ECC controller in pl35x_nand_init_hw_ecc_controller()
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n910
This is a bug, either you shall remove the '= 512' assignment (if the
configuration of the ECC engine is already done correctly, there is
nothing to do except removing this limit), or you should refuse any
value that is not 512 if the engine cannot be configured for 256B steps,
else you should add the logic to configure the ECC engine logic for
steps != 512.
[...]
>>> I kindly ask to the MTD experts if I have to worry about this or if we
>>> can assume that correcting 1 bit error is enough for this subsystem.
>> No, the expectation is a clear failure upon double bit errors. Be
>> careful though, Hamming ECC engines carry *no guaranty* for 3 bit
>> errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
>> uncorrectable, which means:
>> - 0 bf, ok
>> - 1 bf, ok + reporting 1 bf
>> - 2 bf, NOK + reporting an error
>> - more bf: no guarantee, usually returns incorrect data with a correct
>> status
>
> Thanks for pointing this out. I was not aware about the last case when
> using Hamming algo.
> Probably we'll have to move to BCH, even if, IIRC, it requires more
> CPU horsepower to do the job.
It does, and you can observe the impact with a speed test, eg:
flash_speed -dc10 /dev/mtdx
[...]
>> This is obviously just speculation, maybe the errata you mentioned above
>> will bring an obvious hardware failure to our attention. The Arasan IP
>> used on ZynqMP also suffers from a similar limitation (not able to
>> correctly report failures) and I decided to implement one path using the
>> software BCH engine, with a time penalty of course.
>
> So that's nearly the same result I'm trying to get with Zynq7k
> platform ;-)
In this case, I can only recommend the blog post I wrote for the Arasan
controller. I believe it should be easier to do as you won't need the
polynomial retro-engineering.
Link: https://bootlin.com/blog/supporting-a-misbehaving-nand-ecc-engine/
Thanks,
Miquèl
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PL353 NAND Controller - SW vs HW ECC
2026-02-12 10:40 ` Miquel Raynal
@ 2026-02-23 16:39 ` Andrea Scian
2026-02-24 8:07 ` Miquel Raynal
0 siblings, 1 reply; 6+ messages in thread
From: Andrea Scian @ 2026-02-23 16:39 UTC (permalink / raw)
To: Miquel Raynal
Cc: linux-mtd@lists.infradead.org, michal.simek@amd.com,
richard@nod.at, vigneshr@ti.com, amit.kumar-mahapatra@amd.com
Dear Miquel,
(as usual, sorry for the late response)
Il 12/02/2026 11:40, Miquel Raynal ha scritto:
> Hi Andrea,
>
> On 10/02/2026 at 15:14:30 +01, Andrea Scian <andrea.scian@dave.eu> wrote:
>
>> Dear Miquel,
>>
>>
>> Il 10/02/2026 11:12, Miquel Raynal ha scritto:
>>> Hi Andrea,
>>> On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian@dave.eu>
>>> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>>>>
>>>> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>>>>
>>>> Usually we check ECC functionality with mtd_nandbiterrs but it's way
>>>> of testing ECC correction is quite obscure and unmaintained (see a
>>>> thread between me and Miquel on this mailing list in December 2025 on
>>>> this topic).
>>> We stopped developing the kernel modules, for testing we advise to use
>>> the same tools from the mtd-utils test suite which are actively
>>> maintained.
>>> nandbiterrs -i is the correct tool for testing your ECC engine. It
>>> works
>>> this way (from memory, maybe not 100% accurate, but that's the idea):
>> [snip]
>>
>> Got it! This looks very similar to the kernel module and, in
>> fact, I got the same results, depending on the seed choosen.
>
> Ah, finally. I was very suspicious about this observation in the first
> place. I remember you were reporting failures in the nandbiterrs -i test
> with seed=1, which means we must fall into one of the 90 cases that are
> not properly covered by the ECC engine?
>
> [...]
Yes, this does not match with the error cases provided by the errata
(IIUC that document) but it matches the fact that it fails in some cases
(which, to me, anyway means that this ECC engine is unusable)
[...]
>> Refer to r2p1 IP revision which is affected by errata ID 721059
>> It's statement nr 3 says
>>
>> "Some double error cases are not correctly identified as uncorrectable fail"
>>
>> "90 double errors out of the 8485140 possible double error combinations
>> are not correctly identified as
>> uncorrectable fail"
>>
>> "All double errors in the data (8386560 possible errors) will be
>> correctly identified as uncorrectable fail"
>
> This is only one issue over 3.
>
>> However, this is NOT my experience, as you can see from the above testing.
>
> Maybe you fell into one of the two other cases?
I don't think so, for this reason I've ignored them in my previous email
but I here they are for sake of completeness:
1) A single bit error in data byte 0, bit 0 will not be detected
2) A single bit error in second 12 bits of the 3 parity bytes read from
the spare area, will incorrectly be identified as having passed
the error check
So these case are about single bit errors, that was never an issue for
me ¯\_("/)_/¯
>
> [...]
>
>>>> Please note that PL353 is not using nand-ecc-step-size property
>>>> correctly, but this is a secondary issue (this NAND device requires 1
>>>> bit on 512 byte, so it's fine anyway)
>>> Can you elaborate? Looking at the driver, it takes the ECC
>>> configuration
>>> from the core (hence, usually from the DT), otherwise it falls back to
>>> what the chip advertises in terms of requirements, and finally it falls
>>> back to 1b/512B as default.
>>
>> I tried with
>>
>> &nfc0 {
>> status = "okay";
>> nand@0 {
>> reg = <0x0>;
>> #address-cells = <0x1>;
>> #size-cells = <0x1>;
>>
>> nand-ecc-mode = "hw";
>> nand-ecc-strength = <1>;
>> nand-ecc-step-size = <256>;
>>
>> nand-on-flash-bbt;
>>
>> nand-bus-width = <8>;
>> status = "okay";
>>
>> partition@nand-ubi {
>> label = "ubi";
>> reg = <0x00000000 0x0>;
>> };
>> };
>> };
>>
>> But I still got
>>
>> root@sw0005-devel:~# cat /sys/class/mtd/mtd0/ecc_step_size
>> 512
>>
>>
>> Maybe I'm missing something, but it seems that even if PL353 get this
>> from code/NAND requirements in pl35x_nand_attach_chip()
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n948
>>
>> In case of (host) HW ECC this gets later overwritten when initializing
>> PL353 ECC controller in pl35x_nand_init_hw_ecc_controller()
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n910
>
> This is a bug, either you shall remove the '= 512' assignment (if the
> configuration of the ECC engine is already done correctly, there is
> nothing to do except removing this limit), or you should refuse any
> value that is not 512 if the engine cannot be configured for 256B steps,
> else you should add the logic to configure the ECC engine logic for
> steps != 512.
Thanks for the explanation
I'll try to figure this out, but it's not my main objective
>>>> I kindly ask to the MTD experts if I have to worry about this or if we
>>>> can assume that correcting 1 bit error is enough for this subsystem.
>>> No, the expectation is a clear failure upon double bit errors. Be
>>> careful though, Hamming ECC engines carry *no guaranty* for 3 bit
>>> errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
>>> uncorrectable, which means:
>>> - 0 bf, ok
>>> - 1 bf, ok + reporting 1 bf
>>> - 2 bf, NOK + reporting an error
>>> - more bf: no guarantee, usually returns incorrect data with a correct
>>> status
>>
>> Thanks for pointing this out. I was not aware about the last case when
>> using Hamming algo.
>> Probably we'll have to move to BCH, even if, IIRC, it requires more
>> CPU horsepower to do the job.
>
> It does, and you can observe the impact with a speed test, eg:
>
> flash_speed -dc10 /dev/mtdx
>
> [...]
another userspace tool that replace mtd_speedtest.ko, thanks ;-)
>>> This is obviously just speculation, maybe the errata you mentioned above
>>> will bring an obvious hardware failure to our attention. The Arasan IP
>>> used on ZynqMP also suffers from a similar limitation (not able to
>>> correctly report failures) and I decided to implement one path using the
>>> software BCH engine, with a time penalty of course.
>>
>> So that's nearly the same result I'm trying to get with Zynq7k
>> platform ;-)
>
> In this case, I can only recommend the blog post I wrote for the Arasan
> controller. I believe it should be easier to do as you won't need the
> polynomial retro-engineering.
>
> Link: https://bootlin.com/blog/supporting-a-misbehaving-nand-ecc-engine/
Thanks for this detailed article!
I think I'll move to hamming code (if I make it work in u-boot) or BCH
(after evaluating the performance impact)
I have some issue with Linux to U-Boot interoperability (single bit
error are not correctly handled in u-boot hamming implementation), is
this the right mailing list to ask for advice or should I move to u-boot ML?
Kind Regards,
Andrea Scian
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: PL353 NAND Controller - SW vs HW ECC
2026-02-23 16:39 ` Andrea Scian
@ 2026-02-24 8:07 ` Miquel Raynal
0 siblings, 0 replies; 6+ messages in thread
From: Miquel Raynal @ 2026-02-24 8:07 UTC (permalink / raw)
To: Andrea Scian
Cc: linux-mtd@lists.infradead.org, michal.simek@amd.com,
richard@nod.at, vigneshr@ti.com, amit.kumar-mahapatra@amd.com
Hi Andrea,
>>>> This is obviously just speculation, maybe the errata you mentioned above
>>>> will bring an obvious hardware failure to our attention. The Arasan IP
>>>> used on ZynqMP also suffers from a similar limitation (not able to
>>>> correctly report failures) and I decided to implement one path using the
>>>> software BCH engine, with a time penalty of course.
>>>
>>> So that's nearly the same result I'm trying to get with Zynq7k
>>> platform ;-)
>> In this case, I can only recommend the blog post I wrote for the
>> Arasan
>> controller. I believe it should be easier to do as you won't need the
>> polynomial retro-engineering.
>> Link:
>> https://bootlin.com/blog/supporting-a-misbehaving-nand-ecc-engine/
>
> Thanks for this detailed article!
> I think I'll move to hamming code (if I make it work in u-boot) or BCH
> (after evaluating the performance impact)
Either you want to leverage the engine in the write path (because it is
faster than doing it with the CPU) and you must implement Hamming as
that's the only capability of the controller, or you want better
strength and in this case you do not have anything to do except
configuring for software ECC engines in the DT.
But for an upstream fix, which I would welcome, you need to use the
Hamming hardware for the write path and software in the read path.
> I have some issue with Linux to U-Boot interoperability (single bit
> error are not correctly handled in u-boot hamming implementation), is
> this the right mailing list to ask for advice or should I move to
> u-boot ML?
You could create a thread with both, I guess, if there is an
interoperability issue.
Thanks,
Miquèl
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-24 8:08 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 11:37 PL353 NAND Controller - SW vs HW ECC Andrea Scian
2026-02-10 10:12 ` Miquel Raynal
2026-02-10 14:14 ` Andrea Scian
2026-02-12 10:40 ` Miquel Raynal
2026-02-23 16:39 ` Andrea Scian
2026-02-24 8:07 ` Miquel Raynal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox