From: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
To: Thomas Gleixner <tglx@linutronix.de>,
Dave Hansen <dave.hansen@intel.com>, <mingo@redhat.com>,
<bp@alien8.de>, <dave.hansen@linux.intel.com>, <x86@kernel.org>,
<hpa@zytor.com>, <keescook@chromium.org>, <tony.luck@intel.com>,
<gpiccoli@igalia.com>, <mat.jonczyk@o2.pl>,
<rdunlap@infradead.org>, <alexandre.belloni@bootlin.com>,
<mario.limonciello@amd.com>, <yaolu@kylinos.cn>,
<bhelgaas@google.com>, <justinstitt@google.com>,
<linux-kernel@vger.kernel.org>, <linux-hardening@vger.kernel.org>
Cc: <CobeChen@zhaoxin.com>, <TimGuo@zhaoxin.com>,
<LeoLiu-oc@zhaoxin.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
Date: Mon, 1 Jul 2024 19:09:18 +0800 [thread overview]
Message-ID: <93030e7f-dcd2-45dd-a3a2-efa0128753f1@zhaoxin.com> (raw)
In-Reply-To: <87wmnda8mc.ffs@tglx>
On 2024/5/29 06:12, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>> index c96ae8fee95e..ecadd0698d6a 100644
>>> --- a/arch/x86/kernel/hpet.c
>>> +++ b/arch/x86/kernel/hpet.c
>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>> if (in_nmi())
>>> return (u64)hpet_readl(HPET_COUNTER);
>>>
>>> + /*
>>> + * Read HPET directly if panic in progress.
>>> + */
>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>> + return (u64)hpet_readl(HPET_COUNTER);
>>> +
>>
>> There is literally one other piece of the code in the kernel doing
>> something similar: the printk() implementation. There's no other
>> clocksource or timekeeping code that does this on any architecture.
>>
>> Why doesn't this problem apply to any other clock sources?
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
>
> Think about i8253 :)
>
> Most real clocksources, like TSC and the majority of the preferred clock
> sources on other architectures are perfectly fine. They just read and be
> done.
>
>> Why should the problem be fixed in the clock sources themselves? Why
>> doesn't printk() deadlock on systems using the HPET?
>
> Because regular printk()s are deferred to irq work when in NMI and
> similar contexts, but that obviously does not apply to panic
> situations. Also NMI is treated special even in the HPET code. See
> below.
>
>> In other words, I think we should fix pstore to be more like printk
>> rather than hacking around this in each clock source.
>
> pstore is perfectly fine. It uses a NMI safe time accessor function
> which is then tripping over the HPET lock. That's really a HPET specific
> problem.
>
> Though what I read out of the changelog is that the MCE hits the same
> CPU 'x' which holds the lock. But that's fairy tale material as you can
> see in the patch above:
>
> if (in_nmi())
> return (u64)hpet_readl(HPET_COUNTER);
>
> For that particular case the dead lock, which would actually be a live
> lock, cannot happen because in kernel MCEs are NMI class exceptions and
> therefore in_nmi() evaluates to true and that new voodoo can't be
> reached at all.
>
> Now there are two other scenarios which really can make that happen:
>
> 1) A non-NMI class exception within the lock held region
>
> CPU A
> acquire(hpet_lock);
> ... <- #PF, #GP, #DE, ... -> panic()
>
> If any of that happens within that lock held section then the live
> lock on the hpet_lock is the least of your worries. Seriously, I
> don't care about this at all.
>
> 2) The actual scenario is:
>
> CPU A CPU B
> lock(hpet_lock)
> MCE hits user space
> ...
> exc_machine_check_user()
> irqentry_enter_from_user_mode(regs);
>
> irqentry_enter_from_user_mode() obviously does not mark the
> exception as NMI class, so in_nmi() evaluates to false. That would
> actually dead lock if CPU A is not making progress and releases
> hpet_lock.
>
> Sounds unlikely to happen, right? But in reality it can because of
> MCE broadcast. Assume that both CPUs go into MCE:
>
> CPU A CPU B
> lock(hpet_lock)
> exc_machine_check_user()
> irqentry_enter_from_user_mode();
> exc_machine_check_kernel() do_machine_check()
> irqentry_nmi_enter(); mce_panic()
> do_machine_check() if (atomic_inc_return(&mce_panicked) > 1)
> mce_panic() wait_for_panic(); <- Not taken
>
> if (atomic_inc_return(&mce_panicked) > 1)
> wait_for_panic(); <- Taken
>
> ....
> hpet_read()
>
> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
> obviously can't release the lock.
>
For this scenario, an experiment was designed for the printk:
a, Install a driver module that repeatedly sending IPIs to multiple
cores to executes printk.
b, Run a user-level testing tool like stream on all cores.
c, Trigger a MCE hardware error.
During burnin tests a-c, reproduce the following case:
CPU A CPU B
printk()
console_owner
exc_machine_check_user()
irqentry_enter_from_user_mode()
exc_machine_check_kernel() do_machine_check()
irqentry_nmi_enter() mce_panic()
do_machine_check() print_mce()
...
... while(console_waiter)
cpu_relax(); <- deadloop
mce_timed_out() <-timeout
wait_for_panic()
panic("Panicing machine check CPU died");
In this case CPU B is the monarch CPU in MCE handler, CPU B waiting to
be the console_owner and CPU A can't release the console_owner.
So the monarch CPU B deadloop happened, as a result other CPU witch
waiting the monarch CPU timeout will call the panic function.
This problem is caused by the use of printk in the MCE handler.
Actually, I found the comments for the MCE handler like:
* This is executed in #MC context not subject to normal locking rules.
* This implies that most kernel services cannot be safely used. Don't even
* think about putting a printk in there!
Should consider not using printk in the MCE handler?
Sincerely!
TonyWWang-oc
> So the proposed patch makes sense to some extent. But it only cures the
> symptom. The real underlying questions are:
>
> 1) Should we provide a panic mode read callback for clocksources which
> are affected by this?
>
> 2) Is it correct to claim that a MCE which hits user space and ends up in
> mce_panic() is still just a regular exception or should we upgrade to
> NMI class context when we enter mce_panic() or even go as far to
> upgrade to NMI class context for any panic() invocation?
>
> #1 Solves it at the clocksource level. It still needs HPET specific
> changes.
>
> #2 Solves a whole class of issues
>
> ... while potentially introducing new ones :)
>
> To me upgrading any panic() invocation to NMI class context makes a
> lot of sense because in that case all bets are off.
>
> in_nmi() is used in quite some places to avoid such problems. IOW,
> that would kill a whole class of issues instead of "curing" the HPET
> problem locally for the price of an extra conditional. Not that the
> extra conditional matters much if HPET is the clocksource as that's
> awfully slow anyway and I really don't care about that.
>
> But I very much care about avoiding to sprinkle panic_cpu checks all
> over the place.
>
> Thanks,
>
> tglx
prev parent reply other threads:[~2024-07-01 11:09 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-28 6:38 [PATCH] x86/hpet: Read HPET directly if panic in progress Tony W Wang-oc
2024-05-28 14:18 ` Dave Hansen
2024-05-28 22:12 ` Thomas Gleixner
2024-05-28 23:22 ` Linus Torvalds
2024-05-29 7:42 ` Thomas Gleixner
2024-06-05 6:23 ` Tony W Wang-oc
2024-06-05 9:36 ` Borislav Petkov
2024-06-05 10:10 ` Tony W Wang-oc
2024-06-05 11:33 ` Borislav Petkov
2024-06-05 12:02 ` Tony W Wang-oc
2024-06-05 15:51 ` Luck, Tony
2024-06-06 8:44 ` Tony W Wang-oc
2024-06-06 17:00 ` Luck, Tony
2024-05-29 4:39 ` Tony W Wang-oc
2024-05-29 6:44 ` Tony W Wang-oc
2024-05-29 7:45 ` Thomas Gleixner
2024-05-29 8:15 ` Tony W Wang-oc
2024-05-29 7:44 ` Thomas Gleixner
2024-07-01 11:09 ` Tony W Wang-oc [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=93030e7f-dcd2-45dd-a3a2-efa0128753f1@zhaoxin.com \
--to=tonywwang-oc@zhaoxin.com \
--cc=CobeChen@zhaoxin.com \
--cc=LeoLiu-oc@zhaoxin.com \
--cc=TimGuo@zhaoxin.com \
--cc=alexandre.belloni@bootlin.com \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=gpiccoli@igalia.com \
--cc=hpa@zytor.com \
--cc=justinstitt@google.com \
--cc=keescook@chromium.org \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mario.limonciello@amd.com \
--cc=mat.jonczyk@o2.pl \
--cc=mingo@redhat.com \
--cc=rdunlap@infradead.org \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
--cc=yaolu@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox