public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] RAS/CEC: Should cec_notifier() set MCE_HANDLED_CEC after a soft-offline?
@ 2024-10-01 18:02 Kyle Meyer
  2024-10-01 18:24 ` Luck, Tony
  0 siblings, 1 reply; 3+ messages in thread
From: Kyle Meyer @ 2024-10-01 18:02 UTC (permalink / raw)
  To: bp, tony.luck; +Cc: linux-kernel

Hi Boris and Tony,

I noticed CEC should indicate whether it took action to log or handle an error
by setting MCE_HANDLED_CEC (commit 1de08dc) and that EDAC and dev-mcelog should
skip errors that have been processed by CEC (commit 23ba710).

cec_notifier() does not set MCE_HANDLED_CEC when the offlining threshold
is reached in cec_add_elem() because the return code is not zero. Is that
intentional?

Thanks,
Kyle Meyer

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [RFC] RAS/CEC: Should cec_notifier() set MCE_HANDLED_CEC after a soft-offline?
  2024-10-01 18:02 [RFC] RAS/CEC: Should cec_notifier() set MCE_HANDLED_CEC after a soft-offline? Kyle Meyer
@ 2024-10-01 18:24 ` Luck, Tony
  2024-10-01 18:45   ` Kyle Meyer
  0 siblings, 1 reply; 3+ messages in thread
From: Luck, Tony @ 2024-10-01 18:24 UTC (permalink / raw)
  To: Meyer, Kyle, bp@alien8.de; +Cc: linux-kernel@vger.kernel.org

> I noticed CEC should indicate whether it took action to log or handle an error
> by setting MCE_HANDLED_CEC (commit 1de08dc) and that EDAC and dev-mcelog should
> skip errors that have been processed by CEC (commit 23ba710).
>
> cec_notifier() does not set MCE_HANDLED_CEC when the offlining threshold
> is reached in cec_add_elem() because the return code is not zero. Is that
> intentional?

Kyle,

It seems a bit murky. You are right that cec_add_elem() appears to expect three
different actions from its caller based on the return value being <0, 0, >0. But
cec_notifier() only has two actions (0 and !0).

But I think this may be OK. The main purpose of CEC is to avoid over-reacting
to simple corrected memory errors. Many (most?) are due to particle bit flips and
no action is needed. So setting MCE_HANDLED_CEC for the case where CEC
counted the error, but took no action feels like the right thing to do.

Conversely, if action was taken (because this was an error that repeated
enough to hit the threshold) the we do want mcelog/EDAC to give additional
reporting.

-Tony


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] RAS/CEC: Should cec_notifier() set MCE_HANDLED_CEC after a soft-offline?
  2024-10-01 18:24 ` Luck, Tony
@ 2024-10-01 18:45   ` Kyle Meyer
  0 siblings, 0 replies; 3+ messages in thread
From: Kyle Meyer @ 2024-10-01 18:45 UTC (permalink / raw)
  To: Luck, Tony; +Cc: bp@alien8.de, linux-kernel@vger.kernel.org

On Tue, Oct 01, 2024 at 06:24:19PM +0000, Luck, Tony wrote:
> > I noticed CEC should indicate whether it took action to log or handle an error
> > by setting MCE_HANDLED_CEC (commit 1de08dc) and that EDAC and dev-mcelog should
> > skip errors that have been processed by CEC (commit 23ba710).
> >
> > cec_notifier() does not set MCE_HANDLED_CEC when the offlining threshold
> > is reached in cec_add_elem() because the return code is not zero. Is that
> > intentional?
> 
> Kyle,
> 
> It seems a bit murky. You are right that cec_add_elem() appears to expect three
> different actions from its caller based on the return value being <0, 0, >0. But
> cec_notifier() only has two actions (0 and !0).
> 
> But I think this may be OK. The main purpose of CEC is to avoid over-reacting
> to simple corrected memory errors. Many (most?) are due to particle bit flips and
> no action is needed. So setting MCE_HANDLED_CEC for the case where CEC
> counted the error, but took no action feels like the right thing to do.
> 
> Conversely, if action was taken (because this was an error that repeated
> enough to hit the threshold) the we do want mcelog/EDAC to give additional
> reporting.

That makes sense. Thank you for the explanation.

Thanks,
Kyle Meyer

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-10-01 18:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-01 18:02 [RFC] RAS/CEC: Should cec_notifier() set MCE_HANDLED_CEC after a soft-offline? Kyle Meyer
2024-10-01 18:24 ` Luck, Tony
2024-10-01 18:45   ` Kyle Meyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox