public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Error reports at boot time in Ampere Altra machines since c733ebb7c
@ 2023-03-02 20:17 Aristeu Rozanski
  2023-03-02 23:25 ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Aristeu Rozanski @ 2023-03-02 20:17 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-kernel

Hi Marc,

Since c733ebb7cb67d ("irqchip/gic-v3-its: Reset each ITS's BASERn
register before probe"), Ampere Altra machines are reporting corrected
errors during boot:

	[    0.294334] HEST: Table parsing has been initialized.
	[    0.294397] sdei: SDEIv1.0 (0x0) detected in firmware.
	[    0.299622] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
	[    0.299626] {1}[Hardware Error]: event severity: recoverable
	[    0.299629] {1}[Hardware Error]:  Error 0, type: recoverable
	[    0.299633] {1}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
	[    0.299638] {1}[Hardware Error]:   section length: 0x30
	[    0.299645] {1}[Hardware Error]:   00000000: 00000005 ec30000e 00080110 80001001  ......0.........
	[    0.299648] {1}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
	[    0.299650] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
	[    0.299714] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
	[    0.299716] {2}[Hardware Error]: event severity: recoverable
	[    0.299717] {2}[Hardware Error]:  Error 0, type: recoverable
	[    0.299718] {2}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
	[    0.299720] {2}[Hardware Error]:   section length: 0x30
	[    0.299722] {2}[Hardware Error]:   00000000: 40000005 ec30000e 00080110 80005001  ...@..0......P..
	[    0.299724] {2}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
	[    0.299726] {2}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
	[    0.299912] GHES: APEI firmware first mode is enabled by APEI bit.

Because the errors are being reported later in boot, it's hard to
pinpoint exactly what's causing it without decoding the error information,
which I currently don't know how to do it.

There're no problems other than of course triggering tests because of
the warnings.

Do you know what's going on here?

Thanks

-- 
Aristeu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-03-02 20:17 Error reports at boot time in Ampere Altra machines since c733ebb7c Aristeu Rozanski
@ 2023-03-02 23:25 ` Marc Zyngier
  2023-03-03  3:04   ` Aristeu Rozanski
  2023-03-03 19:38   ` Darren Hart
  0 siblings, 2 replies; 9+ messages in thread
From: Marc Zyngier @ 2023-03-02 23:25 UTC (permalink / raw)
  To: Aristeu Rozanski, Darren Hart; +Cc: linux-kernel

On Thu, 02 Mar 2023 20:17:32 +0000,
Aristeu Rozanski <aris@redhat.com> wrote:
> 
> Hi Marc,
> 
> Since c733ebb7cb67d ("irqchip/gic-v3-its: Reset each ITS's BASERn
> register before probe"), Ampere Altra machines are reporting corrected
> errors during boot:
> 
> 	[    0.294334] HEST: Table parsing has been initialized.
> 	[    0.294397] sdei: SDEIv1.0 (0x0) detected in firmware.
> 	[    0.299622] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> 	[    0.299626] {1}[Hardware Error]: event severity: recoverable
> 	[    0.299629] {1}[Hardware Error]:  Error 0, type: recoverable
> 	[    0.299633] {1}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> 	[    0.299638] {1}[Hardware Error]:   section length: 0x30
> 	[    0.299645] {1}[Hardware Error]:   00000000: 00000005 ec30000e 00080110 80001001  ......0.........
> 	[    0.299648] {1}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> 	[    0.299650] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> 	[    0.299714] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
> 	[    0.299716] {2}[Hardware Error]: event severity: recoverable
> 	[    0.299717] {2}[Hardware Error]:  Error 0, type: recoverable
> 	[    0.299718] {2}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> 	[    0.299720] {2}[Hardware Error]:   section length: 0x30
> 	[    0.299722] {2}[Hardware Error]:   00000000: 40000005 ec30000e 00080110 80005001  ...@..0......P..
> 	[    0.299724] {2}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> 	[    0.299726] {2}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> 	[    0.299912] GHES: APEI firmware first mode is enabled by APEI bit.
> 
> Because the errors are being reported later in boot, it's hard to
> pinpoint exactly what's causing it without decoding the error information,
> which I currently don't know how to do it.

+ Darren

Hopefully someone at Ampere can decode this and tell us what is happening.

> There're no problems other than of course triggering tests because of
> the warnings.

It says "Hardware Error". In my book, that's pretty bad. Do you see
this on more than a single machine?

> Do you know what's going on here?

No idea. I haven't seen this on the Altra I have access to so far,

It could be related to firmware and/or things like power management,
but again, someone needs to help us with the error report above.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-03-02 23:25 ` Marc Zyngier
@ 2023-03-03  3:04   ` Aristeu Rozanski
  2023-03-03 19:38   ` Darren Hart
  1 sibling, 0 replies; 9+ messages in thread
From: Aristeu Rozanski @ 2023-03-03  3:04 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Darren Hart, linux-kernel

On Thu, Mar 02, 2023 at 11:25:37PM +0000, Marc Zyngier wrote:
> It says "Hardware Error". In my book, that's pretty bad. Do you see
> this on more than a single machine?

At least two of these machines in our lab. While looking up
e8ed898d-df16-43cc-8ecc-54f060ef157f I found another report but this one
with a lot more of the same errors:

https://github.com/nodejs/build/issues/2894#issuecomment-1129229236

-- 
Aristeu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-03-02 23:25 ` Marc Zyngier
  2023-03-03  3:04   ` Aristeu Rozanski
@ 2023-03-03 19:38   ` Darren Hart
  2023-03-03 20:10     ` Marc Zyngier
  1 sibling, 1 reply; 9+ messages in thread
From: Darren Hart @ 2023-03-03 19:38 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Aristeu Rozanski, linux-kernel, D. Scott Phillips

On Thu, Mar 02, 2023 at 11:25:37PM +0000, Marc Zyngier wrote:
> On Thu, 02 Mar 2023 20:17:32 +0000,
> Aristeu Rozanski <aris@redhat.com> wrote:
> > 
> > Hi Marc,
> > 
> > Since c733ebb7cb67d ("irqchip/gic-v3-its: Reset each ITS's BASERn
> > register before probe"), Ampere Altra machines are reporting corrected
> > errors during boot:
> > 
> > 	[    0.294334] HEST: Table parsing has been initialized.
> > 	[    0.294397] sdei: SDEIv1.0 (0x0) detected in firmware.
> > 	[    0.299622] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> > 	[    0.299626] {1}[Hardware Error]: event severity: recoverable
> > 	[    0.299629] {1}[Hardware Error]:  Error 0, type: recoverable
> > 	[    0.299633] {1}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > 	[    0.299638] {1}[Hardware Error]:   section length: 0x30
> > 	[    0.299645] {1}[Hardware Error]:   00000000: 00000005 ec30000e 00080110 80001001  ......0.........
> > 	[    0.299648] {1}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> > 	[    0.299650] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > 	[    0.299714] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
> > 	[    0.299716] {2}[Hardware Error]: event severity: recoverable
> > 	[    0.299717] {2}[Hardware Error]:  Error 0, type: recoverable
> > 	[    0.299718] {2}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > 	[    0.299720] {2}[Hardware Error]:   section length: 0x30
> > 	[    0.299722] {2}[Hardware Error]:   00000000: 40000005 ec30000e 00080110 80005001  ...@..0......P..
> > 	[    0.299724] {2}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> > 	[    0.299726] {2}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > 	[    0.299912] GHES: APEI firmware first mode is enabled by APEI bit.
> > 
> > Because the errors are being reported later in boot, it's hard to
> > pinpoint exactly what's causing it without decoding the error information,
> > which I currently don't know how to do it.
> 
> + Darren
> 
> Hopefully someone at Ampere can decode this and tell us what is happening.

Hi Marc,

+ D Scott

Thanks for the connection.

This is reporting that something attempted to access GITS2_BASER2, the base
register for the gicv4 vcpu table. Altra doesn't support gicv4. Is c733ebb7c
assuming GITS_BASER2 should be accessible on gicv3?


> 
> > There're no problems other than of course triggering tests because of
> > the warnings.
> 
> It says "Hardware Error". In my book, that's pretty bad. Do you see
> this on more than a single machine?
> 
> > Do you know what's going on here?
> 
> No idea. I haven't seen this on the Altra I have access to so far,
> 
> It could be related to firmware and/or things like power management,
> but again, someone needs to help us with the error report above.
> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.

-- 
Darren Hart
Ampere Computing / OS and Kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-03-03 19:38   ` Darren Hart
@ 2023-03-03 20:10     ` Marc Zyngier
  2023-03-03 20:23       ` Darren Hart
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2023-03-03 20:10 UTC (permalink / raw)
  To: Darren Hart; +Cc: Aristeu Rozanski, linux-kernel, D. Scott Phillips

On Fri, 03 Mar 2023 19:38:40 +0000,
Darren Hart <darren@os.amperecomputing.com> wrote:
> 
> On Thu, Mar 02, 2023 at 11:25:37PM +0000, Marc Zyngier wrote:
> > On Thu, 02 Mar 2023 20:17:32 +0000,
> > Aristeu Rozanski <aris@redhat.com> wrote:
> > > 
> > > Hi Marc,
> > > 
> > > Since c733ebb7cb67d ("irqchip/gic-v3-its: Reset each ITS's BASERn
> > > register before probe"), Ampere Altra machines are reporting corrected
> > > errors during boot:
> > > 
> > > 	[    0.294334] HEST: Table parsing has been initialized.
> > > 	[    0.294397] sdei: SDEIv1.0 (0x0) detected in firmware.
> > > 	[    0.299622] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> > > 	[    0.299626] {1}[Hardware Error]: event severity: recoverable
> > > 	[    0.299629] {1}[Hardware Error]:  Error 0, type: recoverable
> > > 	[    0.299633] {1}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > > 	[    0.299638] {1}[Hardware Error]:   section length: 0x30
> > > 	[    0.299645] {1}[Hardware Error]:   00000000: 00000005 ec30000e 00080110 80001001  ......0.........
> > > 	[    0.299648] {1}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> > > 	[    0.299650] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > 	[    0.299714] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
> > > 	[    0.299716] {2}[Hardware Error]: event severity: recoverable
> > > 	[    0.299717] {2}[Hardware Error]:  Error 0, type: recoverable
> > > 	[    0.299718] {2}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > > 	[    0.299720] {2}[Hardware Error]:   section length: 0x30
> > > 	[    0.299722] {2}[Hardware Error]:   00000000: 40000005 ec30000e 00080110 80005001  ...@..0......P..
> > > 	[    0.299724] {2}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> > > 	[    0.299726] {2}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > 	[    0.299912] GHES: APEI firmware first mode is enabled by APEI bit.
> > > 
> > > Because the errors are being reported later in boot, it's hard to
> > > pinpoint exactly what's causing it without decoding the error information,
> > > which I currently don't know how to do it.
> > 
> > + Darren
> > 
> > Hopefully someone at Ampere can decode this and tell us what is happening.
> 
> Hi Marc,
> 
> + D Scott
> 
> Thanks for the connection.
> 
> This is reporting that something attempted to access GITS2_BASER2, the base
> register for the gicv4 vcpu table. Altra doesn't support gicv4. Is c733ebb7c
> assuming GITS_BASER2 should be accessible on gicv3?

All the GITS_BASERn registers should be RES0 if not implemented, as
per the spec (12.19.1 GITS_BASER<n>, ITS Translation Table
Descriptors, n = 0 - 7)

<quote>
A maximum of 8 GITS_BASER<n> registers can be provided. Unimplemented
registers are RES 0.
</quote>

Returning an error on access is thus definitely a violation of the
spec.

So either the GIC implementation you are using is buggy, or you have
some sort of HW firewalling between the CPU and the GIC that is
trigger happy. My hunch is that this is the latter, as buggy
implementations tend to return an SError when missing this sort of
detail.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-03-03 20:10     ` Marc Zyngier
@ 2023-03-03 20:23       ` Darren Hart
  2023-04-03 16:26         ` Aristeu Rozanski
  0 siblings, 1 reply; 9+ messages in thread
From: Darren Hart @ 2023-03-03 20:23 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Aristeu Rozanski, linux-kernel, D. Scott Phillips

On Fri, Mar 03, 2023 at 08:10:17PM +0000, Marc Zyngier wrote:
> On Fri, 03 Mar 2023 19:38:40 +0000,
> Darren Hart <darren@os.amperecomputing.com> wrote:
> > 
> > On Thu, Mar 02, 2023 at 11:25:37PM +0000, Marc Zyngier wrote:
> > > On Thu, 02 Mar 2023 20:17:32 +0000,
> > > Aristeu Rozanski <aris@redhat.com> wrote:
> > > > 
> > > > Hi Marc,
> > > > 
> > > > Since c733ebb7cb67d ("irqchip/gic-v3-its: Reset each ITS's BASERn
> > > > register before probe"), Ampere Altra machines are reporting corrected
> > > > errors during boot:
> > > > 
> > > > 	[    0.294334] HEST: Table parsing has been initialized.
> > > > 	[    0.294397] sdei: SDEIv1.0 (0x0) detected in firmware.
> > > > 	[    0.299622] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> > > > 	[    0.299626] {1}[Hardware Error]: event severity: recoverable
> > > > 	[    0.299629] {1}[Hardware Error]:  Error 0, type: recoverable
> > > > 	[    0.299633] {1}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > > > 	[    0.299638] {1}[Hardware Error]:   section length: 0x30
> > > > 	[    0.299645] {1}[Hardware Error]:   00000000: 00000005 ec30000e 00080110 80001001  ......0.........
> > > > 	[    0.299648] {1}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> > > > 	[    0.299650] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > > 	[    0.299714] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
> > > > 	[    0.299716] {2}[Hardware Error]: event severity: recoverable
> > > > 	[    0.299717] {2}[Hardware Error]:  Error 0, type: recoverable
> > > > 	[    0.299718] {2}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
> > > > 	[    0.299720] {2}[Hardware Error]:   section length: 0x30
> > > > 	[    0.299722] {2}[Hardware Error]:   00000000: 40000005 ec30000e 00080110 80005001  ...@..0......P..
> > > > 	[    0.299724] {2}[Hardware Error]:   00000010: 00000300 00000000 00000000 00000000  ................
> > > > 	[    0.299726] {2}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > > 	[    0.299912] GHES: APEI firmware first mode is enabled by APEI bit.
> > > > 
> > > > Because the errors are being reported later in boot, it's hard to
> > > > pinpoint exactly what's causing it without decoding the error information,
> > > > which I currently don't know how to do it.
> > > 
> > > + Darren
> > > 
> > > Hopefully someone at Ampere can decode this and tell us what is happening.
> > 
> > Hi Marc,
> > 
> > + D Scott
> > 
> > Thanks for the connection.
> > 
> > This is reporting that something attempted to access GITS2_BASER2, the base
> > register for the gicv4 vcpu table. Altra doesn't support gicv4. Is c733ebb7c
> > assuming GITS_BASER2 should be accessible on gicv3?
> 
> All the GITS_BASERn registers should be RES0 if not implemented, as
> per the spec (12.19.1 GITS_BASER<n>, ITS Translation Table
> Descriptors, n = 0 - 7)
> 
> <quote>
> A maximum of 8 GITS_BASER<n> registers can be provided. Unimplemented
> registers are RES 0.
> </quote>
> 
> Returning an error on access is thus definitely a violation of the
> spec.
> 
> So either the GIC implementation you are using is buggy, or you have
> some sort of HW firewalling between the CPU and the GIC that is
> trigger happy. My hunch is that this is the latter, as buggy
> implementations tend to return an SError when missing this sort of
> detail.

Thanks for the detail Marc, let me see what I can learn and will follow up.

-- 
Darren Hart
Ampere Computing / OS and Kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-03-03 20:23       ` Darren Hart
@ 2023-04-03 16:26         ` Aristeu Rozanski
  2023-04-03 23:50           ` Darren Hart
  0 siblings, 1 reply; 9+ messages in thread
From: Aristeu Rozanski @ 2023-04-03 16:26 UTC (permalink / raw)
  To: Darren Hart; +Cc: Marc Zyngier, linux-kernel, D. Scott Phillips

Hi Darren,

On Fri, Mar 03, 2023 at 12:23:29PM -0800, Darren Hart wrote:
> Thanks for the detail Marc, let me see what I can learn and will follow up.

any updates on this?

Thanks

-- 
Aristeu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-04-03 16:26         ` Aristeu Rozanski
@ 2023-04-03 23:50           ` Darren Hart
  2023-04-04 12:57             ` Aristeu Rozanski
  0 siblings, 1 reply; 9+ messages in thread
From: Darren Hart @ 2023-04-03 23:50 UTC (permalink / raw)
  To: Aristeu Rozanski; +Cc: Marc Zyngier, linux-kernel, D. Scott Phillips

On Mon, Apr 03, 2023 at 12:26:20PM -0400, Aristeu Rozanski wrote:
> Hi Darren,
> 
> On Fri, Mar 03, 2023 at 12:23:29PM -0800, Darren Hart wrote:
> > Thanks for the detail Marc, let me see what I can learn and will follow up.
> 
> any updates on this?

Hi Aristeu,

Thanks for the patience. Yes. These error messages are benign as the error is
managed by hardware and has no adverse effects to software other than the severe
looking messages. A firmware fix will address the issue so the messages are no
longer printed.

Thanks,

-- 
Darren Hart
Ampere Computing / OS and Kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Error reports at boot time in Ampere Altra machines since c733ebb7c
  2023-04-03 23:50           ` Darren Hart
@ 2023-04-04 12:57             ` Aristeu Rozanski
  0 siblings, 0 replies; 9+ messages in thread
From: Aristeu Rozanski @ 2023-04-04 12:57 UTC (permalink / raw)
  To: Darren Hart; +Cc: Marc Zyngier, linux-kernel, D. Scott Phillips

Hi Darren,

On Mon, Apr 03, 2023 at 04:50:10PM -0700, Darren Hart wrote:
> Thanks for the patience. Yes. These error messages are benign as the error is
> managed by hardware and has no adverse effects to software other than the severe
> looking messages. A firmware fix will address the issue so the messages are no
> longer printed.

thanks!

-- 
Aristeu


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-04-04 12:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-02 20:17 Error reports at boot time in Ampere Altra machines since c733ebb7c Aristeu Rozanski
2023-03-02 23:25 ` Marc Zyngier
2023-03-03  3:04   ` Aristeu Rozanski
2023-03-03 19:38   ` Darren Hart
2023-03-03 20:10     ` Marc Zyngier
2023-03-03 20:23       ` Darren Hart
2023-04-03 16:26         ` Aristeu Rozanski
2023-04-03 23:50           ` Darren Hart
2023-04-04 12:57             ` Aristeu Rozanski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox