From: "Luck, Tony" <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "Li, Rongqing" <lirongqing@baidu.com>,
Nikolay Borisov <nik.borisov@suse.com>,
Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
"x86@kernel.org" <x86@kernel.org>,
"H . Peter Anvin" <hpa@zytor.com>,
"Yazen Ghannam" <yazen.ghannam@amd.com>,
"Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>,
Avadhut Naik <avadhut.naik@amd.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a MCE event
Date: Mon, 9 Feb 2026 09:37:35 -0800 [thread overview]
Message-ID: <aYobX83_0kElO3NZ@agluck-desk3> (raw)
In-Reply-To: <20260207115142.GBaYcnTp7maUDVv3Nc@fat_crate.local>
On Sat, Feb 07, 2026 at 12:51:42PM +0100, Borislav Petkov wrote:
> On Wed, Jan 14, 2026 at 02:50:34PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 13, 2026 at 04:30:08PM -0800, Luck, Tony wrote:
> > > Seems to work (though you've deleted all the places where mce_need_notify
> > > is used, so you can also delete the declaration.
> >
> > Right.
> >
> > > I see time delta between logs reducing while I'm injecting errors.
> > >
> > > When I pause injection for several minutes, and then restart I see the
> > > interval went back up again.
> >
> > Thanks Tony, I'll play with this too and ponder over what would be the proper
> > fix which to take to stable too.
>
> Hmm, so looking at this more while it is all peaceful and I can actually hear
> the thoughts in my head... :-)
>
> The whole dance here on the MCE logging path:
>
> mce_log -> ... mce_irq_work -> ... mce_work -> mce_gen_pool_process
>
> can happen in between two mce_timer_fn() function firings - just think of
> the default timer running once every 5 mins.
>
> So in-between those runs with 5 min timeout, errors can get logged and when
> mce_notify_irq() runs, it won't see either that the genpool is not empty
> - it will be empty - and mce_need_notify will be 0 too because we would've
> set and cleared it.
The algorithm to halve the interval when errors are found, and double it
when they are not found was orginally for a "poll-only" configuration.
So there wasn't an option for an error to be logged between timer
invocations. This all dates back to before #MC was recoverable.
If the system is now running in some mixed mode of polling and
interrupts, then it is unclear what should be done in various
new cases.
>
> So basically, the timer fires, we log errors without it noticing anything, and
> it won't halve.
>
> The only way it would halve is if it manages to notice an error being
> in-flight to being logged and it fires right then and there. Then its interval
> would get halved.
>
> And this sounds kinda weird and not what we want perhaps.
>
> So fixing that would mean, we'd have to write down the fact that in-between
> two timer invocations, we have logged errors. Maybe a per-CPU counter
> somewhere which says "this CPU logged so many errors after the timer ran the
> last time".
>
> The timer would fire, check that counter for != 0, and if so, decrease
> interval and clear it.
>
> And it doesn't even have to be a counter - it suffices to be a single bit
> which gets set.
>
> A scheme like that would solve this accurately I'd say.
>
> But the real question actually is, do we really care?
I don't think we care. If we miss out halving the interval becuause an
error was logged between timer based polling, nothing really bad will
happen. The interval might get sorted out on the next interval.
> I mean, this thing went unnoticed for so long and frankly, people should run
> the CEC anyway which has a better MCE-has-been-logged stifling capability so
> that I wanna say, let's do the simplest thing and be done with it.
>
> Or?
>
> Do we care about some real use case here...?
>
Unless someone has a real world case where something is going badly
wrong, then I don't think any changes are needed to cover this race.
-Tony
next prev parent reply other threads:[~2026-02-09 17:37 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-12 8:27 [PATCH] x86/mce: Fix timer interval adjustment after logging a MCE event lirongqing
2026-01-12 8:56 ` Nikolay Borisov
2026-01-12 9:36 ` 答复: [外部邮件] " Li,Rongqing
2026-01-12 9:51 ` Borislav Petkov
2026-01-12 10:24 ` 答复: " Li,Rongqing
2026-01-13 9:51 ` Borislav Petkov
[not found] ` <39cfb093256f4da78fe0bc9e814ce5d0@baidu.com>
2026-01-13 12:48 ` 答复: " Borislav Petkov
2026-01-13 18:53 ` Luck, Tony
2026-01-13 18:55 ` Nikolay Borisov
2026-01-13 19:13 ` Borislav Petkov
2026-01-13 19:25 ` Nikolay Borisov
2026-01-13 19:33 ` Borislav Petkov
2026-01-13 19:37 ` Nikolay Borisov
2026-01-13 19:44 ` Borislav Petkov
2026-01-13 19:51 ` Nikolay Borisov
2026-01-13 20:33 ` Borislav Petkov
2026-01-13 19:10 ` Borislav Petkov
2026-01-13 19:31 ` Nikolay Borisov
2026-01-13 20:30 ` Thomas Gleixner
2026-01-13 20:56 ` Borislav Petkov
2026-01-13 21:05 ` Luck, Tony
2026-01-13 21:31 ` Borislav Petkov
2026-01-13 22:41 ` Borislav Petkov
2026-01-14 0:30 ` Luck, Tony
2026-01-14 13:50 ` Borislav Petkov
2026-01-14 14:48 ` Borislav Petkov
2026-02-02 15:18 ` Borislav Petkov
2026-02-02 23:49 ` 答复: [外部邮件] " Li,Rongqing
2026-02-06 22:03 ` Borislav Petkov
2026-02-07 11:51 ` Borislav Petkov
2026-02-09 17:37 ` Luck, Tony [this message]
2026-02-10 15:01 ` Borislav Petkov
2026-03-06 7:37 ` 答复: [外部邮件] " Li,Rongqing(ACG CCN)
2026-03-06 14:00 ` Borislav Petkov
2026-03-06 14:38 ` 答复: " Li,Rongqing(ACG CCN)
2026-03-06 15:29 ` Borislav Petkov
2026-03-07 1:18 ` 答复: " Li,Rongqing(ACG CCN)
2026-03-16 13:44 ` Borislav Petkov
2026-01-14 6:17 ` Nikolay Borisov
2026-01-14 13:52 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYobX83_0kElO3NZ@agluck-desk3 \
--to=tony.luck@intel.com \
--cc=avadhut.naik@amd.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lirongqing@baidu.com \
--cc=mingo@redhat.com \
--cc=nik.borisov@suse.com \
--cc=qiuxu.zhuo@intel.com \
--cc=tglx@kernel.org \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox