Re: PL353 NAND Controller - SW vs HW ECC

From: Miquel Raynal <miquel.raynal@bootlin.com>
To: Andrea Scian <andrea.scian@dave.eu>
Cc: "linux-mtd@lists.infradead.org" <linux-mtd@lists.infradead.org>,
	"michal.simek@amd.com" <michal.simek@amd.com>,
	 "richard@nod.at" <richard@nod.at>,
	 "vigneshr@ti.com" <vigneshr@ti.com>,
	"amit.kumar-mahapatra@amd.com" <amit.kumar-mahapatra@amd.com>
Subject: Re: PL353 NAND Controller - SW vs HW ECC
Date: Thu, 12 Feb 2026 11:40:37 +0100	[thread overview]
Message-ID: <87wm0ixn16.fsf@bootlin.com> (raw)
In-Reply-To: <f05f27ac-abaf-43fc-8f77-8f729741d03f@dave.eu> (Andrea Scian's message of "Tue, 10 Feb 2026 15:14:30 +0100")

Hi Andrea,

On 10/02/2026 at 15:14:30 +01, Andrea Scian <andrea.scian@dave.eu> wrote:

> Dear Miquel,
>
>
> Il 10/02/2026 11:12, Miquel Raynal ha scritto:
>> Hi Andrea,
>> On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian@dave.eu>
>> wrote:
>> 
>>> Dear all,
>>>
>>> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>>>
>>> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>>>
>>> Usually we check ECC functionality with mtd_nandbiterrs but it's way
>>> of testing ECC correction is quite obscure and unmaintained (see a
>>> thread between me and Miquel on this mailing list in December 2025 on
>>> this topic).
>> We stopped developing the kernel modules, for testing we advise to use
>> the same tools from the mtd-utils test suite which are actively
>> maintained.
>> nandbiterrs -i is the correct tool for testing your ECC engine. It
>> works
>> this way (from memory, maybe not 100% accurate, but that's the idea):
> [snip]
>
> Got it! This looks very similar to the kernel module and, in
> fact, I got the same results, depending on the seed choosen.

Ah, finally. I was very suspicious about this observation in the first
place. I remember you were reporting failures in the nandbiterrs -i test
with seed=1, which means we must fall into one of the 90 cases that are
not properly covered by the ECC engine?

[...]

>>> control on bitflip generation, making easier to understand if
>>> everything's fine or not.
>> nandflipbits is more flexible but less automated. It works
>> identically,
>> except I believe it erases before rewriting in raw mode (which is a
>> subtle difference, this may have an impact with some -rare- chips).
>> 
>>> By using this tool, I'm able to reproduce what I think is a PL353 HW
>>> ECC malfunction, that I think is hardware related (there's some,
>>> cryptic IMHO, errata on this) but I may be missing something and it
>>> may be "just" a software bug There's also the obvious 3rd option:
>>> PEBKAC. I'm doing something wrong with my test setup, either on
>>> kernel/test configuration/usage or in hw setup ;-) )
>> I am interested by this errata, do you have a link? I do not remember
>> seeing it when I worked on this controller.
>
> For PL353, due the fact that it's an ARM IP, you need to look at their
> website:
>
> https://developer.arm.com/documentation/rlnc000227/a

Thanks!

> Refer to r2p1 IP revision which is affected by errata ID 721059
> It's statement nr 3 says
>
> "Some double error cases are not correctly identified as uncorrectable fail"
>
> "90 double errors out of the 8485140 possible double error combinations
> are not correctly identified as
> uncorrectable fail"
>
> "All double errors in the data (8386560 possible errors) will be
> correctly identified as uncorrectable fail"

This is only one issue over 3.

> However, this is NOT my experience, as you can see from the above testing.

Maybe you fell into one of the two other cases?

[...]

>>> Please note that PL353 is not using nand-ecc-step-size property
>>> correctly, but this is a secondary issue (this NAND device requires 1
>>> bit on 512 byte, so it's fine anyway)
>> Can you elaborate? Looking at the driver, it takes the ECC
>> configuration
>> from the core (hence, usually from the DT), otherwise it falls back to
>> what the chip advertises in terms of requirements, and finally it falls
>> back to 1b/512B as default.
>
> I tried with
>
> &nfc0 {
> 	status = "okay";
> 	nand@0 {
> 		reg = <0x0>;
> 		#address-cells = <0x1>;
> 		#size-cells = <0x1>;
>
> 		nand-ecc-mode = "hw";
> 		nand-ecc-strength = <1>;
> 		nand-ecc-step-size = <256>;
>
> 		nand-on-flash-bbt;
>
> 		nand-bus-width = <8>;
> 		status = "okay";
>
> 		partition@nand-ubi {
> 			label = "ubi";
> 			reg = <0x00000000 0x0>;
> 		};
> 	};
> };
>
> But I still got
>
> root@sw0005-devel:~# cat /sys/class/mtd/mtd0/ecc_step_size
> 512
>
>
> Maybe I'm missing something, but it seems that even if PL353 get this
> from code/NAND requirements in pl35x_nand_attach_chip()
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n948
>
> In case of (host) HW ECC this gets later overwritten when initializing
> PL353 ECC controller in pl35x_nand_init_hw_ecc_controller()
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n910

This is a bug, either you shall remove the '= 512' assignment (if the
configuration of the ECC engine is already done correctly, there is
nothing to do except removing this limit), or you should refuse any
value that is not 512 if the engine cannot be configured for 256B steps,
else you should add the logic to configure the ECC engine logic for
steps != 512.

[...]

>>> I kindly ask to the MTD experts if I have to worry about this or if we
>>> can assume that correcting 1 bit error is enough for this subsystem.
>> No, the expectation is a clear failure upon double bit errors. Be
>> careful though, Hamming ECC engines carry *no guaranty* for 3 bit
>> errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
>> uncorrectable, which means:
>> - 0 bf, ok
>> - 1 bf, ok + reporting 1 bf
>> - 2 bf, NOK + reporting an error
>> - more bf: no guarantee, usually returns incorrect data with a correct
>> status
>
> Thanks for pointing this out. I was not aware about the last case when
> using Hamming algo.
> Probably we'll have to move to BCH, even if, IIRC, it requires more
> CPU horsepower to do the job.

It does, and you can observe the impact with a speed test, eg:

        flash_speed -dc10 /dev/mtdx

[...]

>> This is obviously just speculation, maybe the errata you mentioned above
>> will bring an obvious hardware failure to our attention. The Arasan IP
>> used on ZynqMP also suffers from a similar limitation (not able to
>> correctly report failures) and I decided to implement one path using the
>> software BCH engine, with a time penalty of course.
>
> So that's nearly the same result I'm trying to get with Zynq7k
> platform ;-)

In this case, I can only recommend the blog post I wrote for the Arasan
controller. I believe it should be easier to do as you won't need the
polynomial retro-engineering.

Link: https://bootlin.com/blog/supporting-a-misbehaving-nand-ecc-engine/

Thanks,
Miquèl

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/