public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* PL353 NAND Controller - SW vs HW ECC
@ 2026-02-09 11:37 Andrea Scian
  2026-02-10 10:12 ` Miquel Raynal
  0 siblings, 1 reply; 6+ messages in thread
From: Andrea Scian @ 2026-02-09 11:37 UTC (permalink / raw)
  To: linux-mtd@lists.infradead.org
  Cc: Miquel Raynal, michal.simek@amd.com, richard@nod.at,
	vigneshr@ti.com, amit.kumar-mahapatra@amd.com

Dear all,

I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.

Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.

Usually we check ECC functionality with mtd_nandbiterrs but it's way of testing ECC correction is quite obscure and unmaintained (see a thread between me and Miquel on this mailing list in December 2025 on this topic).

We've thus moved to userspace nandflipbits which give much more control on bitflip generation, making easier to understand if everything's fine or not.

By using this tool, I'm able to reproduce what I think is a PL353 HW ECC malfunction, that I think is hardware related (there's some, cryptic IMHO, errata on this) but I may be missing something and it may be "just" a software bug
There's also the obvious 3rd option: PEBKAC. I'm doing something wrong with my test setup, either on kernel/test configuration/usage or in hw setup ;-) )

Step 1 - SW ECC

Thanks to my patch (and mailing list review) now I can use SW Hamming ECC on Zynq7k based devices. So this test is about using software hamming ECC on (1 bit on 256 byte)

This is the device tree

&nfc0 {
  status = "okay";
  nand@0 {
    reg = <0x0>;
    #address-cells = <0x1>;
    #size-cells = <0x1>;

    nand-ecc-mode = "soft";
    nand-ecc-algo = "hamming";
    nand-ecc-strength = <1>;
    nand-ecc-step-size = <256>;

    nand-on-flash-bbt;

    nand-bus-width = <8>;
    status = "okay";

    partition@nand-ubi {
      label = "ubi";
      reg = <0x00000000 0x0>;
    };
  };
};

To make it quick, I'm using just the first EB, with a simple string on it (in my case,
this is useful for testing on u-boot too, but this is for another separate thread ;-) )

root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|

I'm now inserting one bitflip, which is detected and corrected as expected

root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtst testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|

With an additional bitflip, we have an uncorrectable error (and this is, again, expected)

root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|

The same applies to another combination of bitflips (this will be useful later and don't look at ECC counters.. I had to reboot the system ;-) )

root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |ktst testing....|



Step 2 - PL353 HW ECC

The device tree is now
&nfc0 {
  status = "okay";
  nand@0 {
    reg = <0x0>;
    #address-cells = <0x1>;
    #size-cells = <0x1>;

    nand-ecc-mode = "hw";
    nand-ecc-strength = <1>;
    nand-ecc-step-size = <256>;

    nand-on-flash-bbt;

    nand-bus-width = <8>;
    status = "okay";

    partition@nand-ubi {
      label = "ubi";
      reg = <0x00000000 0x0>;
    };
  };
};

Please note that PL353 is not using nand-ecc-step-size property correctly, but this is a secondary issue (this NAND device requires 1 bit on 512 byte, so it's fine anyway)

root@sw0005-devel:/lib/modules# cat /sys/class/mtd/mtd0/ecc_step_size
512

Re-doing the same test as above

root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|

One single bitflip is detected and corrected as expected:

root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtst testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |just testing....|

a 2nd bitflip is detected as uncorrectable as expected:

root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
root@sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0  | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 0
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jtrt testing....|

But there's some corner case, e.g. double bit flip that are detected (wrongly) as single bitflip and return wrong data:

root@sw0005-devel:~# nandflipbits /dev/mtd0 0@2
root@sw0005-devel:~# nandflipbits /dev/mtd0 1@1
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 1
ECC corrected: 2
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 76 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |jvst testing....|

Another full test from scratch (ECC corrected counter is bigger that expected because I had to try a few combination, without rebooting ;-) )

root@sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root@sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@1
root@sw0005-devel:~# nandflipbits /dev/mtd0 0@0
root@sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0  | head -n 1
ECC failed: 1
ECC corrected: 6
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff  |ktst testing....|


Conclusions:

IIUC with the results of the above test, we have an issue on PL353 because it cannot detect double bit errors (at least some combination of them) and, while this is a rare event on SLC NAND devices (that requires 1 bit ECC to guarantee 100k PE cycles), I think that this might give some catastrophic failures on field (because, AFAIK, upper MTD layers, like UBI, don't expect this situation).
Am I wrong?
I kindly ask to the MTD experts if I have to worry about this or if we can assume that correcting 1 bit error is enough for this subsystem.
If this is not acceptable, I think we have to update the driver at least to warn the user about this and use SW ECC where possible. WDTY?

If anybody in this list can also help me in understanding if I'm doing something wrong with my test or may have some setup/configuration error, it's appreciated and I can make some additional test.

In the mean time, as my previous patch anticipate, I'm trying to setup a configuration using only SW ECC (I'm currently stuck in Linux/U-Boot compatibility, but this might not the right ML to discuss this topic)

Any feedback is appreciated. Kind Regards,

Andrea SCIAN

SW Development Manager

DAVE Embedded Systems

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-24  8:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 11:37 PL353 NAND Controller - SW vs HW ECC Andrea Scian
2026-02-10 10:12 ` Miquel Raynal
2026-02-10 14:14   ` Andrea Scian
2026-02-12 10:40     ` Miquel Raynal
2026-02-23 16:39       ` Andrea Scian
2026-02-24  8:07         ` Miquel Raynal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox