* [PATCH] cpu, AMD: Fix another bug in the new errata checking code
@ 2011-05-12 23:59 Chuck Ebbert
2011-05-13 10:21 ` Hans Rosenfeld
0 siblings, 1 reply; 7+ messages in thread
From: Chuck Ebbert @ 2011-05-12 23:59 UTC (permalink / raw)
To: linux-kernel; +Cc: Hans Rosenfeld, Boris Ostrovsky, Borislav Petkov
Fix a bug that causes CPU hangs due to missing timer interrupts,
introduced by these three patches:
(1) commit d78d671db478eb8b14c78501c0cee1cc7baf6967
"x86, cpu: AMD errata checking framework"
(2) commit 9d8888c2a214aece2494a49e699a097c2ba9498b
"x86, cpu: Clean up AMD erratum 400 workaround"
(3) commit b87cf80af3ba4b4c008b4face3c68d604e1715c6
"x86, AMD: Set ARAT feature on AMD processors"
Patch (1) introduced a new framework that allowed checking for errata
using AMD's OSVW (OS visible workaround) feature combined with
explicit lists of models. It checked OSVW first, and completely
relied on that if it was present and usable.
Patch (2) switched the checking for erratum 400 to use the new
framework. But the original code checked for an explicit model range
first, then used OSVW if the CPU was not within that range. Patch (2)
also inexplicably added a second model range (for Family 10h) that
was never in the original code.
Then patch (3) used the new erratum 400 checks to decide whether
to enable the ARAT feature (always running APIC timer.) However,
this causes notebooks using the Sempron processor (Family 10h
Model 6 Stepping 2) to enable ARAT when they shouldn't because the
explicit check for that model gets skipped.
The fix is to check the model list first, then use OSVW if the CPU
is not in that list.
Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
---
NOTE: Untested, but this looks like the obvious fix.
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -723,6 +723,17 @@ bool cpu_has_amd_erratum(const int *erra
if (cpu->x86_vendor != X86_VENDOR_AMD)
return false;
+ /*
+ * Must match family-model-stepping range first so that the
+ * range checks will override OSVW checking.
+ */
+ ms = (cpu->x86_model << 4) | cpu->x86_mask;
+ while ((range = *erratum++))
+ if ((cpu->x86 == AMD_MODEL_RANGE_FAMILY(range)) &&
+ (ms >= AMD_MODEL_RANGE_START(range)) &&
+ (ms <= AMD_MODEL_RANGE_END(range)))
+ return true;
+
if (osvw_id >= 0 && osvw_id < 65536 &&
cpu_has(cpu, X86_FEATURE_OSVW)) {
u64 osvw_len;
@@ -737,15 +748,6 @@ bool cpu_has_amd_erratum(const int *erra
}
}
- /* OSVW unavailable or ID unknown, match family-model-stepping range */
- ms = (cpu->x86_model << 4) | cpu->x86_mask;
- while ((range = *erratum++))
- if ((cpu->x86 == AMD_MODEL_RANGE_FAMILY(range)) &&
- (ms >= AMD_MODEL_RANGE_START(range)) &&
- (ms <= AMD_MODEL_RANGE_END(range)))
- return true;
-
return false;
}
-
EXPORT_SYMBOL_GPL(cpu_has_amd_erratum);
_
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] cpu, AMD: Fix another bug in the new errata checking code
2011-05-12 23:59 [PATCH] cpu, AMD: Fix another bug in the new errata checking code Chuck Ebbert
@ 2011-05-13 10:21 ` Hans Rosenfeld
2011-05-13 13:03 ` Boris Ostrovsky
0 siblings, 1 reply; 7+ messages in thread
From: Hans Rosenfeld @ 2011-05-13 10:21 UTC (permalink / raw)
To: Chuck Ebbert
Cc: linux-kernel@vger.kernel.org, Boris Ostrovsky, Petkov, Borislav
On Thu, May 12, 2011 at 07:59:38PM -0400, Chuck Ebbert wrote:
> Fix a bug that causes CPU hangs due to missing timer interrupts,
> introduced by these three patches:
>
> (1) commit d78d671db478eb8b14c78501c0cee1cc7baf6967
> "x86, cpu: AMD errata checking framework"
>
> (2) commit 9d8888c2a214aece2494a49e699a097c2ba9498b
> "x86, cpu: Clean up AMD erratum 400 workaround"
>
> (3) commit b87cf80af3ba4b4c008b4face3c68d604e1715c6
> "x86, AMD: Set ARAT feature on AMD processors"
>
> Patch (1) introduced a new framework that allowed checking for errata
> using AMD's OSVW (OS visible workaround) feature combined with
> explicit lists of models. It checked OSVW first, and completely
> relied on that if it was present and usable.
Thats how it is specified to work.
> Patch (2) switched the checking for erratum 400 to use the new
> framework. But the original code checked for an explicit model range
> first, then used OSVW if the CPU was not within that range. Patch (2)
> also inexplicably added a second model range (for Family 10h) that
> was never in the original code.
The original code checked just for family 0x10, and thats what the new
code does: define a model range that covers all of family 0x10.
> Then patch (3) used the new erratum 400 checks to decide whether
> to enable the ARAT feature (always running APIC timer.) However,
> this causes notebooks using the Sempron processor (Family 10h
> Model 6 Stepping 2) to enable ARAT when they shouldn't because the
> explicit check for that model gets skipped.
>
> The fix is to check the model list first, then use OSVW if the CPU
> is not in that list.
No, that is wrong. The whole point of OSVW is to check it first. The
model ranges are only to be used for older systems that either don't
have OSVW or don't know about a particular erratum yet.
The revision guide states that family 0x10 model 6 stepping 2 has E400.
So I would expect that OSVW length is >= 2 and that OSVW status has bit
1 set, or that OSVW length is < 2. This indicates that the workaround is
necessary, without any need to check the family-model-stepping ranges.
It would also be correct if the BIOS disabled C1E and cleared the
corresponding OSVW status bit. Anything else would probably be a very
nasty BIOS bug.
Could you send me the contents of MSRs 0xc0010140, 0xc0010141 and
0xc0010055?
Hans
--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] cpu, AMD: Fix another bug in the new errata checking code
2011-05-13 10:21 ` Hans Rosenfeld
@ 2011-05-13 13:03 ` Boris Ostrovsky
2011-05-13 14:59 ` Chuck Ebbert
0 siblings, 1 reply; 7+ messages in thread
From: Boris Ostrovsky @ 2011-05-13 13:03 UTC (permalink / raw)
To: Hans Rosenfeld, Chuck Ebbert
Cc: linux-kernel@vger.kernel.org, Petkov, Borislav
On 05/13/2011 06:21 AM, Hans Rosenfeld wrote:
> On Thu, May 12, 2011 at 07:59:38PM -0400, Chuck Ebbert wrote:
> The revision guide states that family 0x10 model 6 stepping 2 has E400.
> So I would expect that OSVW length is>= 2 and that OSVW status has bit
> 1 set, or that OSVW length is< 2. This indicates that the workaround is
> necessary, without any need to check the family-model-stepping ranges.
>
> It would also be correct if the BIOS disabled C1E and cleared the
> corresponding OSVW status bit. Anything else would probably be a very
> nasty BIOS bug.
>
> Could you send me the contents of MSRs 0xc0010140, 0xc0010141 and
> 0xc0010055?
Knowing whether any C state above C1 is declared could be useful too.
-boris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] cpu, AMD: Fix another bug in the new errata checking code
2011-05-13 13:03 ` Boris Ostrovsky
@ 2011-05-13 14:59 ` Chuck Ebbert
2011-05-13 15:19 ` Hans Rosenfeld
0 siblings, 1 reply; 7+ messages in thread
From: Chuck Ebbert @ 2011-05-13 14:59 UTC (permalink / raw)
To: Boris Ostrovsky
Cc: Hans Rosenfeld, linux-kernel@vger.kernel.org, Petkov, Borislav
On Fri, 13 May 2011 09:03:59 -0400
Boris Ostrovsky <boris.ostrovsky@amd.com> wrote:
> On 05/13/2011 06:21 AM, Hans Rosenfeld wrote:
> > On Thu, May 12, 2011 at 07:59:38PM -0400, Chuck Ebbert wrote:
> > The revision guide states that family 0x10 model 6 stepping 2 has E400.
> > So I would expect that OSVW length is>= 2 and that OSVW status has bit
> > 1 set, or that OSVW length is< 2. This indicates that the workaround is
> > necessary, without any need to check the family-model-stepping ranges.
> >
> > It would also be correct if the BIOS disabled C1E and cleared the
> > corresponding OSVW status bit. Anything else would probably be a very
> > nasty BIOS bug.
> >
> > Could you send me the contents of MSRs 0xc0010140, 0xc0010141 and
> > 0xc0010055?
>
> Knowing whether any C state above C1 is declared could be useful too.
>
rdmsr 0xc0010140 gives 2
rdmsr 0xc0010141 gives 0
rdmsr 0xc0010055 gives 0
And ARAT is definitely set where it wasn't before these updates.
BTW we now have multiple reports of this, one system is a Compaq Presario
CQ61 with an AMD Sempron M120 processor.
I'll check on the C-states next.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] cpu, AMD: Fix another bug in the new errata checking code
2011-05-13 14:59 ` Chuck Ebbert
@ 2011-05-13 15:19 ` Hans Rosenfeld
2011-05-16 12:43 ` Chuck Ebbert
0 siblings, 1 reply; 7+ messages in thread
From: Hans Rosenfeld @ 2011-05-13 15:19 UTC (permalink / raw)
To: Chuck Ebbert
Cc: Ostrovsky, Boris, linux-kernel@vger.kernel.org, Petkov, Borislav
On Fri, May 13, 2011 at 10:59:28AM -0400, Chuck Ebbert wrote:
> On Fri, 13 May 2011 09:03:59 -0400
> Boris Ostrovsky <boris.ostrovsky@amd.com> wrote:
>
> > On 05/13/2011 06:21 AM, Hans Rosenfeld wrote:
> > > On Thu, May 12, 2011 at 07:59:38PM -0400, Chuck Ebbert wrote:
> > > The revision guide states that family 0x10 model 6 stepping 2 has E400.
> > > So I would expect that OSVW length is>= 2 and that OSVW status has bit
> > > 1 set, or that OSVW length is< 2. This indicates that the workaround is
> > > necessary, without any need to check the family-model-stepping ranges.
> > >
> > > It would also be correct if the BIOS disabled C1E and cleared the
> > > corresponding OSVW status bit. Anything else would probably be a very
> > > nasty BIOS bug.
> > >
>
> > > Could you send me the contents of MSRs 0xc0010140, 0xc0010141 and
> > > 0xc0010055?
> >
> > Knowing whether any C state above C1 is declared could be useful too.
> >
> rdmsr 0xc0010140 gives 2
This means that E400 is known ...
> rdmsr 0xc0010141 gives 0
... and no workaround is necessary ...
> rdmsr 0xc0010055 gives 0
... because C1E is not enabled.
> And ARAT is definitely set where it wasn't before these updates.
I don't see how that could possibly make a difference if C1E is not even
enabled. This is all very strange.
Hans
--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] cpu, AMD: Fix another bug in the new errata checking code
2011-05-13 15:19 ` Hans Rosenfeld
@ 2011-05-16 12:43 ` Chuck Ebbert
2011-05-16 13:38 ` Boris Ostrovsky
0 siblings, 1 reply; 7+ messages in thread
From: Chuck Ebbert @ 2011-05-16 12:43 UTC (permalink / raw)
To: Hans Rosenfeld
Cc: Ostrovsky, Boris, linux-kernel@vger.kernel.org, Petkov, Borislav
On Fri, 13 May 2011 17:19:23 +0200
Hans Rosenfeld <hans.rosenfeld@amd.com> wrote:
> > > > Could you send me the contents of MSRs 0xc0010140, 0xc0010141 and
> > > > 0xc0010055?
> > >
> > > Knowing whether any C state above C1 is declared could be useful too.
> > >
> > rdmsr 0xc0010140 gives 2
>
> This means that E400 is known ...
>
> > rdmsr 0xc0010141 gives 0
>
> ... and no workaround is necessary ...
>
> > rdmsr 0xc0010055 gives 0
>
> ... because C1E is not enabled.
>
> > And ARAT is definitely set where it wasn't before these updates.
>
> I don't see how that could possibly make a difference if C1E is not even
> enabled. This is all very strange.
>
Looking at commit e20a2d205c05cef6b5783df339a7d54adeb50962 ("x86, AMD: Fix
APIC timer erratum 400 affecting K8 Rev.A-E processors") I see that it
extended the E400 workaround to cover a whole range of processors that
have never supported C1E. Isn't this just more of the same problem, only
happening with processors that support C1E but have it disabled?
They are using C3 for idle states, I can confirm that now.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] cpu, AMD: Fix another bug in the new errata checking code
2011-05-16 12:43 ` Chuck Ebbert
@ 2011-05-16 13:38 ` Boris Ostrovsky
0 siblings, 0 replies; 7+ messages in thread
From: Boris Ostrovsky @ 2011-05-16 13:38 UTC (permalink / raw)
To: Chuck Ebbert
Cc: Rosenfeld, Hans, linux-kernel@vger.kernel.org, Petkov, Borislav
On 05/16/2011 08:43 AM, Chuck Ebbert wrote:
> On Fri, 13 May 2011 17:19:23 +0200
> Hans Rosenfeld<hans.rosenfeld@amd.com> wrote:
>>>>> Could you send me the contents of MSRs 0xc0010140, 0xc0010141 and
>>>>> 0xc0010055?
>>>>
>>>> Knowing whether any C state above C1 is declared could be useful too.
>>>>
>>> rdmsr 0xc0010140 gives 2
>>
>> This means that E400 is known ...
>>
>>> rdmsr 0xc0010141 gives 0
>>
>> ... and no workaround is necessary ...
>>
>>> rdmsr 0xc0010055 gives 0
>>
>> ... because C1E is not enabled.
>>
>>> And ARAT is definitely set where it wasn't before these updates.
>>
>> I don't see how that could possibly make a difference if C1E is not even
>> enabled. This is all very strange.
>>
>
> Looking at commit e20a2d205c05cef6b5783df339a7d54adeb50962 ("x86, AMD: Fix
> APIC timer erratum 400 affecting K8 Rev.A-E processors") I see that it
> extended the E400 workaround to cover a whole range of processors that
> have never supported C1E. Isn't this just more of the same problem, only
> happening with processors that support C1E but have it disabled?
Erratum 400 covers not just C1E but also C3 and the latter is not
covered by OSVW so we may need to update cpu_has_amd_erratum().
Fortunately, only a few processors in family 10h support C3. I think, in
fact, it's only the part that you have (model 6 stepping 2) but we need
to confirm it. If you know of other FMSs please let us know.
As for expansion of ranges covering this erratum in that commit, it only
affected family fh.
>
> They are using C3 for idle states, I can confirm that now.
Thanks.
-boris
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-05-16 13:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-12 23:59 [PATCH] cpu, AMD: Fix another bug in the new errata checking code Chuck Ebbert
2011-05-13 10:21 ` Hans Rosenfeld
2011-05-13 13:03 ` Boris Ostrovsky
2011-05-13 14:59 ` Chuck Ebbert
2011-05-13 15:19 ` Hans Rosenfeld
2011-05-16 12:43 ` Chuck Ebbert
2011-05-16 13:38 ` Boris Ostrovsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox