public inbox for linux-mips@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] MIPS: DEC: Rate-limit memory errors
@ 2026-03-28 15:49 Maciej W. Rozycki
  2026-03-28 15:49 ` [PATCH 1/3] MIPS: DEC: Rate-limit memory errors for ECC systems Maciej W. Rozycki
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Maciej W. Rozycki @ 2026-03-28 15:49 UTC (permalink / raw)
  To: Thomas Bogendoerfer; +Cc: linux-mips, linux-kernel

Hi,

 A recent failure of one of my systems revealed an issue with memory error 
logging where the flood of messages produced, which reported corrected ECC 
errors, made the system unusable despite the errors themselves having been 
recovered from and the messages serving informational purpose only.

 I took the opportunity and actually verified the rate-limiting does its 
purpose with the offending system before cleaning memory module contacts, 
which has cured the original problem, the third time in ~25 years I've had 
the system for -- not too bad, but clearly a recurring issue.

 For consistency I have also updated support for the other two DEC memory 
system designs, although they're parity-based and therefore memory errors 
are fatal and consequently less likely to cause a message flood, although 
in principle still possible where a faulty memory location causes a bus 
error exception to kill user processes repeatedly.  They seem not to have 
the issue with memory contacts though, which use the common SIMM design 
rather than 0.1"-pitch PCB connectors.

 Please apply.

  Maciej

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/3] MIPS: DEC: Rate-limit memory errors for ECC systems
  2026-03-28 15:49 [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Maciej W. Rozycki
@ 2026-03-28 15:49 ` Maciej W. Rozycki
  2026-03-28 15:50 ` [PATCH 2/3] MIPS: DEC: Rate-limit memory errors for KN01 systems Maciej W. Rozycki
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Maciej W. Rozycki @ 2026-03-28 15:49 UTC (permalink / raw)
  To: Thomas Bogendoerfer; +Cc: linux-mips, linux-kernel

Prevent the system from becoming unusable due to a flood of memory error 
messages with DECstation and DECsystem models using ECC, that is KN02, 
KN03 and KN05 systems.  It seems common for gradual oxidation of memory 
module contacts to cause memory errors to eventually develop and while 
ECC takes care of correcting them and the system affected can continue 
operating normally until the contacts have been cleaned, the unlimited 
messages make the system spend all its time on producing them, therefore 
preventing it from being used.

Rate-limiting removes the load from the system and enables its normal 
operation, e.g.:

Bus error interrupt: CPU memory read ECC error at 0x139cfb04
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU partial memory write ECC error at 0x138c1f5c
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU partial memory write ECC error at 0x138c1f6c
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x139cff64
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x136af00c
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x136af044
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x136af0cc
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x136af0cc
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x136af0e4
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
Bus error interrupt: CPU memory read ECC error at 0x136af104
  ECC syndrome 0x54 -- corrected single bit error at data bit D3
dec_ecc_be_backend: 34455 callbacks suppressed

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
 arch/mips/dec/ecc-berr.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

linux-mips-dec-berr-ratelimit-ecc.diff
Index: linux-macro/arch/mips/dec/ecc-berr.c
===================================================================
--- linux-macro.orig/arch/mips/dec/ecc-berr.c
+++ linux-macro/arch/mips/dec/ecc-berr.c
@@ -5,12 +5,13 @@
  *	5000/240 (KN03), 5000/260 (KN05) and DECsystem 5900 (KN03),
  *	5900/260 (KN05) systems.
  *
- *	Copyright (c) 2003, 2005  Maciej W. Rozycki
+ *	Copyright (c) 2003, 2005, 2026  Maciej W. Rozycki
  */
 
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
+#include <linux/ratelimit.h>
 #include <linux/sched.h>
 #include <linux/types.h>
 
@@ -51,6 +52,10 @@ static int dec_ecc_be_backend(struct pt_
 	static const char overstr[] = "overrun";
 	static const char eccstr[] = "ECC error";
 
+	static DEFINE_RATELIMIT_STATE(rs,
+				      DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
 	const char *kind, *agent, *cycle, *event;
 	const char *status = "", *xbit = "", *fmt = "";
 	unsigned long address;
@@ -70,7 +75,7 @@ static int dec_ecc_be_backend(struct pt_
 
 	if (!(erraddr & KN0X_EAR_VALID)) {
 		/* No idea what happened. */
-		printk(KERN_ALERT "Unidentified bus error %s\n", kind);
+		pr_alert_ratelimited("Unidentified bus error %s\n", kind);
 		return action;
 	}
 
@@ -180,12 +185,13 @@ static int dec_ecc_be_backend(struct pt_
 		}
 	}
 
-	if (action != MIPS_BE_FIXUP)
+	if (action != MIPS_BE_FIXUP && __ratelimit(&rs)) {
 		printk(KERN_ALERT "Bus error %s: %s %s %s at %#010lx\n",
 			kind, agent, cycle, event, address);
 
-	if (action != MIPS_BE_FIXUP && erraddr & KN0X_EAR_ECCERR)
-		printk(fmt, "  ECC syndrome ", syn, status, xbit, i);
+		if (erraddr & KN0X_EAR_ECCERR)
+			printk(fmt, "  ECC syndrome ", syn, status, xbit, i);
+	}
 
 	return action;
 }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/3] MIPS: DEC: Rate-limit memory errors for KN01 systems
  2026-03-28 15:49 [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Maciej W. Rozycki
  2026-03-28 15:49 ` [PATCH 1/3] MIPS: DEC: Rate-limit memory errors for ECC systems Maciej W. Rozycki
@ 2026-03-28 15:50 ` Maciej W. Rozycki
  2026-03-28 15:50 ` [PATCH 3/3] MIPS: DEC: Rate-limit memory errors for non-KN01 parity systems Maciej W. Rozycki
  2026-04-06 12:33 ` [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Thomas Bogendoerfer
  3 siblings, 0 replies; 5+ messages in thread
From: Maciej W. Rozycki @ 2026-03-28 15:50 UTC (permalink / raw)
  To: Thomas Bogendoerfer; +Cc: linux-mips, linux-kernel

Similarly to memory errors in ECC systems also rate-limit memory parity 
errors for KN01 DECstation and DECsystem models.  Unlike with ECC these 
events are always fatal and are less likely to cause a message flood, 
but handle them the same way for consistency.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
 arch/mips/dec/kn01-berr.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

linux-mips-dec-berr-ratelimit-kn01.diff
Index: linux-macro/arch/mips/dec/kn01-berr.c
===================================================================
--- linux-macro.orig/arch/mips/dec/kn01-berr.c
+++ linux-macro/arch/mips/dec/kn01-berr.c
@@ -4,7 +4,7 @@
  *	and 2100 (KN01) systems equipped with parity error detection
  *	logic.
  *
- *	Copyright (c) 2005  Maciej W. Rozycki
+ *	Copyright (c) 2005, 2026  Maciej W. Rozycki
  */
 
 #include <linux/init.h>
@@ -134,8 +134,8 @@ static int dec_kn01_be_backend(struct pt
 		action = MIPS_BE_FIXUP;
 
 	if (action != MIPS_BE_FIXUP)
-		printk(KERN_ALERT "Bus error %s: %s %s %s at %#010lx\n",
-			kind, agent, cycle, event, address);
+		pr_alert_ratelimited("Bus error %s: %s %s %s at %#010lx\n",
+				     kind, agent, cycle, event, address);
 
 	return action;
 }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 3/3] MIPS: DEC: Rate-limit memory errors for non-KN01 parity systems
  2026-03-28 15:49 [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Maciej W. Rozycki
  2026-03-28 15:49 ` [PATCH 1/3] MIPS: DEC: Rate-limit memory errors for ECC systems Maciej W. Rozycki
  2026-03-28 15:50 ` [PATCH 2/3] MIPS: DEC: Rate-limit memory errors for KN01 systems Maciej W. Rozycki
@ 2026-03-28 15:50 ` Maciej W. Rozycki
  2026-04-06 12:33 ` [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Thomas Bogendoerfer
  3 siblings, 0 replies; 5+ messages in thread
From: Maciej W. Rozycki @ 2026-03-28 15:50 UTC (permalink / raw)
  To: Thomas Bogendoerfer; +Cc: linux-mips, linux-kernel

Similarly to memory errors in ECC systems also rate-limit memory parity 
errors for KN02-BA, KN02-CA, KN04-BA, KN04-CA DECstation and DECsystem 
models.  Unlike with ECC these events are always fatal and are less 
likely to cause a message flood, but handle them the same way for 
consistency.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
 arch/mips/dec/kn02xa-berr.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

linux-mips-dec-berr-ratelimit-kn02xa.diff
Index: linux-macro/arch/mips/dec/kn02xa-berr.c
===================================================================
--- linux-macro.orig/arch/mips/dec/kn02xa-berr.c
+++ linux-macro/arch/mips/dec/kn02xa-berr.c
@@ -6,12 +6,13 @@
  *	DECstation/DECsystem 5000/20, /25, /33 (KN02-CA), 5000/50
  *	(KN04-CA) systems.
  *
- *	Copyright (c) 2005  Maciej W. Rozycki
+ *	Copyright (c) 2005, 2026  Maciej W. Rozycki
  */
 
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
+#include <linux/ratelimit.h>
 #include <linux/types.h>
 
 #include <asm/addrspace.h>
@@ -50,6 +51,10 @@ static int dec_kn02xa_be_backend(struct
 	static const char paritystr[] = "parity error";
 	static const char lanestat[][4] = { " OK", "BAD" };
 
+	static DEFINE_RATELIMIT_STATE(rs,
+				      DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
 	const char *kind, *agent, *cycle, *event;
 	unsigned long address;
 
@@ -79,18 +84,19 @@ static int dec_kn02xa_be_backend(struct
 	if (is_fixup)
 		action = MIPS_BE_FIXUP;
 
-	if (action != MIPS_BE_FIXUP)
+	if (action != MIPS_BE_FIXUP && __ratelimit(&rs)) {
 		printk(KERN_ALERT "Bus error %s: %s %s %s at %#010lx\n",
 			kind, agent, cycle, event, address);
 
-	if (action != MIPS_BE_FIXUP && address < 0x10000000)
-		printk(KERN_ALERT "  Byte lane status %#3x -- "
-		       "#3: %s, #2: %s, #1: %s, #0: %s\n",
-		       (mer & KN02XA_MER_BYTERR) >> 8,
-		       lanestat[(mer & KN02XA_MER_BYTERR_3) != 0],
-		       lanestat[(mer & KN02XA_MER_BYTERR_2) != 0],
-		       lanestat[(mer & KN02XA_MER_BYTERR_1) != 0],
-		       lanestat[(mer & KN02XA_MER_BYTERR_0) != 0]);
+		if (address < 0x10000000)
+			printk(KERN_ALERT "  Byte lane status %#3x -- "
+			       "#3: %s, #2: %s, #1: %s, #0: %s\n",
+			       (mer & KN02XA_MER_BYTERR) >> 8,
+			       lanestat[(mer & KN02XA_MER_BYTERR_3) != 0],
+			       lanestat[(mer & KN02XA_MER_BYTERR_2) != 0],
+			       lanestat[(mer & KN02XA_MER_BYTERR_1) != 0],
+			       lanestat[(mer & KN02XA_MER_BYTERR_0) != 0]);
+	}
 
 	return action;
 }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/3] MIPS: DEC: Rate-limit memory errors
  2026-03-28 15:49 [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Maciej W. Rozycki
                   ` (2 preceding siblings ...)
  2026-03-28 15:50 ` [PATCH 3/3] MIPS: DEC: Rate-limit memory errors for non-KN01 parity systems Maciej W. Rozycki
@ 2026-04-06 12:33 ` Thomas Bogendoerfer
  3 siblings, 0 replies; 5+ messages in thread
From: Thomas Bogendoerfer @ 2026-04-06 12:33 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: linux-mips, linux-kernel

On Sat, Mar 28, 2026 at 03:49:52PM +0000, Maciej W. Rozycki wrote:
> Hi,
> 
>  A recent failure of one of my systems revealed an issue with memory error 
> logging where the flood of messages produced, which reported corrected ECC 
> errors, made the system unusable despite the errors themselves having been 
> recovered from and the messages serving informational purpose only.
> 
>  I took the opportunity and actually verified the rate-limiting does its 
> purpose with the offending system before cleaning memory module contacts, 
> which has cured the original problem, the third time in ~25 years I've had 
> the system for -- not too bad, but clearly a recurring issue.
> 
>  For consistency I have also updated support for the other two DEC memory 
> system designs, although they're parity-based and therefore memory errors 
> are fatal and consequently less likely to cause a message flood, although 
> in principle still possible where a faulty memory location causes a bus 
> error exception to kill user processes repeatedly.  They seem not to have 
> the issue with memory contacts though, which use the common SIMM design 
> rather than 0.1"-pitch PCB connectors.
> 
>  Please apply.

series applied to mips-next

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-06 12:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-28 15:49 [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Maciej W. Rozycki
2026-03-28 15:49 ` [PATCH 1/3] MIPS: DEC: Rate-limit memory errors for ECC systems Maciej W. Rozycki
2026-03-28 15:50 ` [PATCH 2/3] MIPS: DEC: Rate-limit memory errors for KN01 systems Maciej W. Rozycki
2026-03-28 15:50 ` [PATCH 3/3] MIPS: DEC: Rate-limit memory errors for non-KN01 parity systems Maciej W. Rozycki
2026-04-06 12:33 ` [PATCH 0/3] MIPS: DEC: Rate-limit memory errors Thomas Bogendoerfer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox