* [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
@ 2024-08-21 14:00 Yazen Ghannam
2024-08-21 18:35 ` boris.ostrovsky
2024-08-25 11:16 ` Thomas Gleixner
0 siblings, 2 replies; 11+ messages in thread
From: Yazen Ghannam @ 2024-08-21 14:00 UTC (permalink / raw)
To: linux-edac
Cc: linux-kernel, tony.luck, x86, avadhut.naik, john.allen,
boris.ostrovsky, Yazen Ghannam
Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
banks. Each of these banks represents unique and separate hardware
located within the system. Each bank is managed by a single logical CPU;
they are not shared. Furthermore, the "CPU to MCA bank" assignment
cannot be modified at run time.
The MCE subsystem supports run time CPU hotplug. Many vendors have
non-core MCA banks, so MCA settings are not cleared when a CPU is
offlined for these vendors.
Even though the non-core MCA banks remain enabled, MCA errors will not
be handled (reported, cleared, etc.) on SMCA systems when the managing
CPU is offline.
Check if a CPU manages non-core MCA banks and, if so, prevent it from
being taken offline.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
arch/x86/kernel/cpu/mce/core.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2a938f429c4d..cf1529d0e6b1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2770,10 +2770,34 @@ static int mce_cpu_online(unsigned int cpu)
return 0;
}
+static bool mce_cpu_is_hotpluggable(void)
+{
+ if (!mce_flags.smca)
+ return true;
+
+ /*
+ * SMCA systems use banks 0-6 for core units. Banks 7 and later are
+ * used for non-core units.
+ *
+ * Logical CPUs with 7 or fewer banks can be offlined, since they are not
+ * managing any non-core units.
+ *
+ * Check if non-core banks are enabled using MCG_CTL. The hardware may
+ * report MCG_CAP[Count] greater than is actually present, so it is not a
+ * good indicator that a CPU has non-core banks.
+ */
+ return fls_long(mce_rdmsrl(MSR_IA32_MCG_CTL)) <= 7;
+}
+
static int mce_cpu_pre_down(unsigned int cpu)
{
struct timer_list *t = this_cpu_ptr(&mce_timer);
+ if (!mce_cpu_is_hotpluggable()) {
+ pr_info("CPU%d is not hotpluggable\n", cpu);
+ return -EOPNOTSUPP;
+ }
+
mce_disable_cpu();
del_timer_sync(t);
mce_threshold_remove_device(cpu);
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-21 14:00 [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks Yazen Ghannam
@ 2024-08-21 18:35 ` boris.ostrovsky
2024-08-22 14:14 ` Yazen Ghannam
2024-08-25 11:16 ` Thomas Gleixner
1 sibling, 1 reply; 11+ messages in thread
From: boris.ostrovsky @ 2024-08-21 18:35 UTC (permalink / raw)
To: Yazen Ghannam, linux-edac
Cc: linux-kernel, tony.luck, x86, avadhut.naik, john.allen
On 8/21/24 10:00 AM, Yazen Ghannam wrote:
>
> + if (!mce_cpu_is_hotpluggable()) {
> + pr_info("CPU%d is not hotpluggable\n", cpu);
Can this error message be made a little more informative, e.g. that it
is emitted from MCA code? And is info the right level? I think notice
would be more appropriate.
-boris
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-21 18:35 ` boris.ostrovsky
@ 2024-08-22 14:14 ` Yazen Ghannam
0 siblings, 0 replies; 11+ messages in thread
From: Yazen Ghannam @ 2024-08-22 14:14 UTC (permalink / raw)
To: boris.ostrovsky
Cc: linux-edac, linux-kernel, tony.luck, x86, avadhut.naik,
john.allen
On Wed, Aug 21, 2024 at 02:35:23PM -0400, boris.ostrovsky@oracle.com wrote:
>
>
> On 8/21/24 10:00 AM, Yazen Ghannam wrote:
>
> > + if (!mce_cpu_is_hotpluggable()) {
> > + pr_info("CPU%d is not hotpluggable\n", cpu);
>
> Can this error message be made a little more informative, e.g. that it is
> emitted from MCA code? And is info the right level? I think notice would be
> more appropriate.
>
Sure thing. I can change the level to notice.
The MCE subsystem has a global prefix set already by pr_fmt. This is
defined in arch/x86/kernel/cpu/mce/internal.h.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-21 14:00 [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks Yazen Ghannam
2024-08-21 18:35 ` boris.ostrovsky
@ 2024-08-25 11:16 ` Thomas Gleixner
2024-08-26 13:20 ` Yazen Ghannam
1 sibling, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2024-08-25 11:16 UTC (permalink / raw)
To: Yazen Ghannam, linux-edac
Cc: linux-kernel, tony.luck, x86, avadhut.naik, john.allen,
boris.ostrovsky, Yazen Ghannam
On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
> Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
> banks. Each of these banks represents unique and separate hardware
> located within the system. Each bank is managed by a single logical CPU;
> they are not shared. Furthermore, the "CPU to MCA bank" assignment
> cannot be modified at run time.
>
> The MCE subsystem supports run time CPU hotplug. Many vendors have
> non-core MCA banks, so MCA settings are not cleared when a CPU is
> offlined for these vendors.
>
> Even though the non-core MCA banks remain enabled, MCA errors will not
> be handled (reported, cleared, etc.) on SMCA systems when the managing
> CPU is offline.
>
> Check if a CPU manages non-core MCA banks and, if so, prevent it from
> being taken offline.
Which in turn breaks hibernation and kexec...
Thanks,
tglx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-25 11:16 ` Thomas Gleixner
@ 2024-08-26 13:20 ` Yazen Ghannam
2024-08-27 12:50 ` Borislav Petkov
0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2024-08-26 13:20 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-edac, linux-kernel, tony.luck, x86, avadhut.naik,
john.allen, boris.ostrovsky
On Sun, Aug 25, 2024 at 01:16:37PM +0200, Thomas Gleixner wrote:
> On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
> > Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
> > banks. Each of these banks represents unique and separate hardware
> > located within the system. Each bank is managed by a single logical CPU;
> > they are not shared. Furthermore, the "CPU to MCA bank" assignment
> > cannot be modified at run time.
> >
> > The MCE subsystem supports run time CPU hotplug. Many vendors have
> > non-core MCA banks, so MCA settings are not cleared when a CPU is
> > offlined for these vendors.
> >
> > Even though the non-core MCA banks remain enabled, MCA errors will not
> > be handled (reported, cleared, etc.) on SMCA systems when the managing
> > CPU is offline.
> >
> > Check if a CPU manages non-core MCA banks and, if so, prevent it from
> > being taken offline.
>
> Which in turn breaks hibernation and kexec...
>
Right, good point.
Maybe this change can apply only to a user-initiated (sysfs) case?
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-26 13:20 ` Yazen Ghannam
@ 2024-08-27 12:50 ` Borislav Petkov
2024-08-27 13:47 ` Yazen Ghannam
0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2024-08-27 12:50 UTC (permalink / raw)
To: Yazen Ghannam, Thomas Gleixner
Cc: linux-edac, linux-kernel, tony.luck, x86, avadhut.naik,
john.allen, boris.ostrovsky
On August 26, 2024 3:20:57 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@amd.com> wrote:
>On Sun, Aug 25, 2024 at 01:16:37PM +0200, Thomas Gleixner wrote:
>> On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
>> > Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
>> > banks. Each of these banks represents unique and separate hardware
>> > located within the system. Each bank is managed by a single logical CPU;
>> > they are not shared. Furthermore, the "CPU to MCA bank" assignment
>> > cannot be modified at run time.
>> >
>> > The MCE subsystem supports run time CPU hotplug. Many vendors have
>> > non-core MCA banks, so MCA settings are not cleared when a CPU is
>> > offlined for these vendors.
>> >
>> > Even though the non-core MCA banks remain enabled, MCA errors will not
>> > be handled (reported, cleared, etc.) on SMCA systems when the managing
>> > CPU is offline.
>> >
>> > Check if a CPU manages non-core MCA banks and, if so, prevent it from
>> > being taken offline.
>>
>> Which in turn breaks hibernation and kexec...
>>
>
>Right, good point.
>
>Maybe this change can apply only to a user-initiated (sysfs) case?
>
>Thanks,
>Yazen
>
Or, you can simply say that the MCE cannot be processed because the user took the managing CPU offline.
What is this actually really fixing anyway?
--
Sent from a small device: formatting sucks and brevity is inevitable.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-27 12:50 ` Borislav Petkov
@ 2024-08-27 13:47 ` Yazen Ghannam
2024-08-29 8:39 ` Borislav Petkov
0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2024-08-27 13:47 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, linux-edac, linux-kernel, tony.luck, x86,
avadhut.naik, john.allen, boris.ostrovsky
On Tue, Aug 27, 2024 at 02:50:40PM +0200, Borislav Petkov wrote:
> On August 26, 2024 3:20:57 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@amd.com> wrote:
> >On Sun, Aug 25, 2024 at 01:16:37PM +0200, Thomas Gleixner wrote:
> >> On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
> >> > Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
> >> > banks. Each of these banks represents unique and separate hardware
> >> > located within the system. Each bank is managed by a single logical CPU;
> >> > they are not shared. Furthermore, the "CPU to MCA bank" assignment
> >> > cannot be modified at run time.
> >> >
> >> > The MCE subsystem supports run time CPU hotplug. Many vendors have
> >> > non-core MCA banks, so MCA settings are not cleared when a CPU is
> >> > offlined for these vendors.
> >> >
> >> > Even though the non-core MCA banks remain enabled, MCA errors will not
> >> > be handled (reported, cleared, etc.) on SMCA systems when the managing
> >> > CPU is offline.
> >> >
> >> > Check if a CPU manages non-core MCA banks and, if so, prevent it from
> >> > being taken offline.
> >>
> >> Which in turn breaks hibernation and kexec...
> >>
> >
> >Right, good point.
> >
> >Maybe this change can apply only to a user-initiated (sysfs) case?
> >
> >Thanks,
> >Yazen
> >
>
> Or, you can simply say that the MCE cannot be processed because the user took the managing CPU offline.
>
I found that we can not populate the "cpuN/online" file. This would
prevent a user from offlining a CPU, but it shouldn't prevent the system
from doing what it needs.
This is already done for CPU0, and other cases I think.
> What is this actually really fixing anyway?
There are times where a user wants to take CPUs offline due to software
licensing. And this would prevent the user from unintentionally
offlining CPUs that would affect MCA handling.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-27 13:47 ` Yazen Ghannam
@ 2024-08-29 8:39 ` Borislav Petkov
2024-08-29 14:03 ` Yazen Ghannam
0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2024-08-29 8:39 UTC (permalink / raw)
To: Yazen Ghannam
Cc: Thomas Gleixner, linux-edac, linux-kernel, tony.luck, x86,
avadhut.naik, john.allen, boris.ostrovsky
On August 27, 2024 3:47:06 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@amd.com> wrote:
>On Tue, Aug 27, 2024 at 02:50:40PM +0200, Borislav Petkov wrote:
>> On August 26, 2024 3:20:57 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@amd.com> wrote:
>> >On Sun, Aug 25, 2024 at 01:16:37PM +0200, Thomas Gleixner wrote:
>> >> On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
>> >> > Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
>> >> > banks. Each of these banks represents unique and separate hardware
>> >> > located within the system. Each bank is managed by a single logical CPU;
>> >> > they are not shared. Furthermore, the "CPU to MCA bank" assignment
>> >> > cannot be modified at run time.
>> >> >
>> >> > The MCE subsystem supports run time CPU hotplug. Many vendors have
>> >> > non-core MCA banks, so MCA settings are not cleared when a CPU is
>> >> > offlined for these vendors.
>> >> >
>> >> > Even though the non-core MCA banks remain enabled, MCA errors will not
>> >> > be handled (reported, cleared, etc.) on SMCA systems when the managing
>> >> > CPU is offline.
>> >> >
>> >> > Check if a CPU manages non-core MCA banks and, if so, prevent it from
>> >> > being taken offline.
>> >>
>> >> Which in turn breaks hibernation and kexec...
>> >>
>> >
>> >Right, good point.
>> >
>> >Maybe this change can apply only to a user-initiated (sysfs) case?
>> >
>> >Thanks,
>> >Yazen
>> >
>>
>> Or, you can simply say that the MCE cannot be processed because the user took the managing CPU offline.
>>
>
>I found that we can not populate the "cpuN/online" file. This would
>prevent a user from offlining a CPU, but it shouldn't prevent the system
>from doing what it needs.
>
>This is already done for CPU0, and other cases I think.
>
>> What is this actually really fixing anyway?
>
>There are times where a user wants to take CPUs offline due to software
>licensing. And this would prevent the user from unintentionally
>offlining CPUs that would affect MCA handling.
>
>Thanks,
>Yazen
If the user offlines CPUs and some MCEs cannot be handled as a result, then that's her/his problem, no?
- Why does it hurt when I do this?
- Well, don't do that then.
--
Sent from a small device: formatting sucks and brevity is inevitable.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-29 8:39 ` Borislav Petkov
@ 2024-08-29 14:03 ` Yazen Ghannam
2024-08-29 14:14 ` Borislav Petkov
0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2024-08-29 14:03 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, linux-edac, linux-kernel, tony.luck, x86,
avadhut.naik, john.allen, boris.ostrovsky
On Thu, Aug 29, 2024 at 10:39:41AM +0200, Borislav Petkov wrote:
> On August 27, 2024 3:47:06 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@amd.com> wrote:
> >On Tue, Aug 27, 2024 at 02:50:40PM +0200, Borislav Petkov wrote:
> >> On August 26, 2024 3:20:57 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@amd.com> wrote:
> >> >On Sun, Aug 25, 2024 at 01:16:37PM +0200, Thomas Gleixner wrote:
> >> >> On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
> >> >> > Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
> >> >> > banks. Each of these banks represents unique and separate hardware
> >> >> > located within the system. Each bank is managed by a single logical CPU;
> >> >> > they are not shared. Furthermore, the "CPU to MCA bank" assignment
> >> >> > cannot be modified at run time.
> >> >> >
> >> >> > The MCE subsystem supports run time CPU hotplug. Many vendors have
> >> >> > non-core MCA banks, so MCA settings are not cleared when a CPU is
> >> >> > offlined for these vendors.
> >> >> >
> >> >> > Even though the non-core MCA banks remain enabled, MCA errors will not
> >> >> > be handled (reported, cleared, etc.) on SMCA systems when the managing
> >> >> > CPU is offline.
> >> >> >
> >> >> > Check if a CPU manages non-core MCA banks and, if so, prevent it from
> >> >> > being taken offline.
> >> >>
> >> >> Which in turn breaks hibernation and kexec...
> >> >>
> >> >
> >> >Right, good point.
> >> >
> >> >Maybe this change can apply only to a user-initiated (sysfs) case?
> >> >
> >> >Thanks,
> >> >Yazen
> >> >
> >>
> >> Or, you can simply say that the MCE cannot be processed because the user took the managing CPU offline.
> >>
> >
> >I found that we can not populate the "cpuN/online" file. This would
> >prevent a user from offlining a CPU, but it shouldn't prevent the system
> >from doing what it needs.
> >
> >This is already done for CPU0, and other cases I think.
> >
> >> What is this actually really fixing anyway?
> >
> >There are times where a user wants to take CPUs offline due to software
> >licensing. And this would prevent the user from unintentionally
> >offlining CPUs that would affect MCA handling.
> >
> >Thanks,
> >Yazen
>
> If the user offlines CPUs and some MCEs cannot be handled as a result, then that's her/his problem, no?
>
> - Why does it hurt when I do this?
> - Well, don't do that then.
> --
Right, that was our initial feedback.
But there was an ask to have the kernel manage this.
Do you think we should we continue to pursue this or no?
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-29 14:03 ` Yazen Ghannam
@ 2024-08-29 14:14 ` Borislav Petkov
2024-08-29 14:18 ` Yazen Ghannam
0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2024-08-29 14:14 UTC (permalink / raw)
To: Yazen Ghannam
Cc: Thomas Gleixner, linux-edac, linux-kernel, tony.luck, x86,
avadhut.naik, john.allen, boris.ostrovsky
On Thu, Aug 29, 2024 at 10:03:05AM -0400, Yazen Ghannam wrote:
> Do you think we should we continue to pursue this or no?
You mean the kernel should prevent those folks from shooting themselves in the
foot?
How would that patch look like?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks
2024-08-29 14:14 ` Borislav Petkov
@ 2024-08-29 14:18 ` Yazen Ghannam
0 siblings, 0 replies; 11+ messages in thread
From: Yazen Ghannam @ 2024-08-29 14:18 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, linux-edac, linux-kernel, tony.luck, x86,
avadhut.naik, john.allen, boris.ostrovsky
On Thu, Aug 29, 2024 at 04:14:15PM +0200, Borislav Petkov wrote:
> On Thu, Aug 29, 2024 at 10:03:05AM -0400, Yazen Ghannam wrote:
> > Do you think we should we continue to pursue this or no?
>
> You mean the kernel should prevent those folks from shooting themselves in the
> foot?
>
> How would that patch look like?
>
Right, I'm working on another revision. I'll try to send it today.
The gist is that we can hide the sysfs interface for CPUs that shouldn't
be hotplugged. We already do this today for other special cases like
CPU0.
Thanks,
Yazen
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-08-29 14:18 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-21 14:00 [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core banks Yazen Ghannam
2024-08-21 18:35 ` boris.ostrovsky
2024-08-22 14:14 ` Yazen Ghannam
2024-08-25 11:16 ` Thomas Gleixner
2024-08-26 13:20 ` Yazen Ghannam
2024-08-27 12:50 ` Borislav Petkov
2024-08-27 13:47 ` Yazen Ghannam
2024-08-29 8:39 ` Borislav Petkov
2024-08-29 14:03 ` Yazen Ghannam
2024-08-29 14:14 ` Borislav Petkov
2024-08-29 14:18 ` Yazen Ghannam
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox