public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Hans-Peter Jansen <hpj@urpla.net>
To: linux-kernel@vger.kernel.org
Subject: Re: AMD PCI Bridge: Hardware error from APEI
Date: Tue, 07 Jul 2020 08:56:41 +0200	[thread overview]
Message-ID: <10516708.ThPc60jCtT@xrated> (raw)
In-Reply-To: <2559180.cUrjzdZFCD@xrated>

Am Samstag, 27. Juni 2020, 20:23:35 CEST schrieben Sie:
> Dear hacker from the order of the penguins,
> 
> we're facing a disturbing issue here after swapping a motherboard of a
> mission critical system:
> 
> Jun 27 20:05:29 server kernel: {10}[Hardware Error]: Hardware error from
> APEI Generic Hardware Error Source: 4 Jun 27 20:05:29 server kernel:
> {10}[Hardware Error]: It has been corrected by h/w and requires no further
> action Jun 27 20:05:29 server kernel: {10}[Hardware Error]: event severity:
> corrected Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  Error 0,
> type: corrected Jun 27 20:05:29 server kernel: {10}[Hardware Error]: 
> fru_text: PcieError Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  
> section_type: PCIe error Jun 27 20:05:29 server kernel: {10}[Hardware
> Error]:   port_type: 4, root port Jun 27 20:05:29 server kernel:
> {10}[Hardware Error]:   version: 0.2 Jun 27 20:05:29 server kernel:
> {10}[Hardware Error]:   command: 0x0407, status: 0x0010 Jun 27 20:05:29
> server kernel: {10}[Hardware Error]:   device_id: 0000:60:03.1 Jun 27
> 20:05:29 server kernel: {10}[Hardware Error]:   slot: 19
> Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   secondary_bus: 0x62
> Jun 27 20:05:29 server kernel: {10}[Hardware Error]:   vendor_id: 0x1022,
> device_id: 0x1453 Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  
> class_code: 060400 Jun 27 20:05:29 server kernel: {10}[Hardware Error]:  
> bridge: secondary_status: 0x2000, control: 0x0012 Jun 27 20:05:29 server
> kernel: pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER:   
> [12] Timeout Jun 27 20:05:29 server kernel: pcieport 0000:60:03.1: AER:
> aer_layer=Data Link Layer, aer_agent=Transmitter ID
> 
> 
> Jun 27 20:05:51 server kernel: {11}[Hardware Error]: Hardware error from
> APEI Generic Hardware Error Source: 4 Jun 27 20:05:51 server kernel:
> {11}[Hardware Error]: It has been corrected by h/w and requires no further
> action Jun 27 20:05:51 server kernel: {11}[Hardware Error]: event severity:
> corrected Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  Error 0,
> type: corrected Jun 27 20:05:51 server kernel: {11}[Hardware Error]: 
> fru_text: PcieError Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  
> section_type: PCIe error Jun 27 20:05:51 server kernel: {11}[Hardware
> Error]:   port_type: 4, root port Jun 27 20:05:51 server kernel:
> {11}[Hardware Error]:   version: 0.2 Jun 27 20:05:51 server kernel:
> {11}[Hardware Error]:   command: 0x0407, status: 0x0010 Jun 27 20:05:51
> server kernel: {11}[Hardware Error]:   device_id: 0000:60:03.1 Jun 27
> 20:05:51 server kernel: {11}[Hardware Error]:   slot: 19
> Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   secondary_bus: 0x62
> Jun 27 20:05:51 server kernel: {11}[Hardware Error]:   vendor_id: 0x1022,
> device_id: 0x1453 Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  
> class_code: 060400 Jun 27 20:05:51 server kernel: {11}[Hardware Error]:  
> bridge: secondary_status: 0x2000, control: 0x0012 Jun 27 20:05:51 server
> kernel: pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask:
> 0x00006000 Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER:   
> [12] Timeout Jun 27 20:05:51 server kernel: pcieport 0000:60:03.1: AER:
> aer_layer=Data Link Layer, aer_agent=Transmitter ID
> 
> 
> 60:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models
> 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Flags: bus master,
> fast devsel, latency 0, IRQ 44, NUMA node 3 Bus: primary=60, secondary=62,
> subordinate=62, sec-latency=0 I/O behind bridge: None
>         Memory behind bridge: e3600000-e36fffff [size=1M]
>         Prefetchable memory behind bridge: None
>         Capabilities: [50] Power Management version 3
>         Capabilities: [58] Express Root Port (Slot+), MSI 00
>         Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD]
> Family 17h (Models 00h-0fh) PCIe GPP Bridge Capabilities: [c8]
> HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100] Vendor
> Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150]
> Advanced Error Reporting
>         Capabilities: [270] #19
>         Capabilities: [2a0] Access Control Services
>         Capabilities: [370] L1 PM Substates
>         Capabilities: [380] Downstream Port Containment
>         Capabilities: [3c4] #23
>         Kernel driver in use: pcieport
> 
> It's probably related to a satellite receiver card, since it only appeared
> after plugging:
> 
> 62:00.0 Multimedia controller: Digital Devices GmbH Max
>         Subsystem: Digital Devices GmbH Max S8 4/8
>         Flags: bus master, fast devsel, latency 0, IRQ 161, NUMA node 3
>         Memory at e3600000 (64-bit, non-prefetchable) [size=64K]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit+
>         Capabilities: [90] Express Endpoint, MSI 00
>         Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0
> Len=00c <?> Kernel driver in use: ddbridge
>         Kernel modules: ddbridge
> 
> Specs:
> ASUS KNPA-U16 with an AMD EPYC 7261, 2x32 GB Kingston KSM26RD4/32MEI
> (officially supported RAM modules)
>
> openSUSE 15.1, Kernel 5.7.5

Not sure, how to proceed with this one?

After 9½ days uptime, it cumulated about 34,000 incidents:

2020-07-07T08:31:37.275535+02:00 tyrex kernel: [822744.211956] {34202}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2020-07-07T08:31:37.275557+02:00 tyrex kernel: [822744.211969] {34202}[Hardware Error]: It has been corrected by h/w and requires no further action
2020-07-07T08:31:37.275559+02:00 tyrex kernel: [822744.211977] {34202}[Hardware Error]: event severity: corrected
2020-07-07T08:31:37.275567+02:00 tyrex kernel: [822744.211983] {34202}[Hardware Error]:  Error 0, type: corrected
2020-07-07T08:31:37.275569+02:00 tyrex kernel: [822744.211989] {34202}[Hardware Error]:  fru_text: PcieError
2020-07-07T08:31:37.275570+02:00 tyrex kernel: [822744.211995] {34202}[Hardware Error]:   section_type: PCIe error
2020-07-07T08:31:37.275571+02:00 tyrex kernel: [822744.212001] {34202}[Hardware Error]:   port_type: 4, root port
2020-07-07T08:31:37.275573+02:00 tyrex kernel: [822744.212006] {34202}[Hardware Error]:   version: 0.2
2020-07-07T08:31:37.275574+02:00 tyrex kernel: [822744.212012] {34202}[Hardware Error]:   command: 0x0407, status: 0x0010
2020-07-07T08:31:37.275615+02:00 tyrex kernel: [822744.212019] {34202}[Hardware Error]:   device_id: 0000:60:03.1
2020-07-07T08:31:37.275631+02:00 tyrex kernel: [822744.212024] {34202}[Hardware Error]:   slot: 19
2020-07-07T08:31:37.275634+02:00 tyrex kernel: [822744.212029] {34202}[Hardware Error]:   secondary_bus: 0x62
2020-07-07T08:31:37.275636+02:00 tyrex kernel: [822744.212035] {34202}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1453
2020-07-07T08:31:37.275637+02:00 tyrex kernel: [822744.212041] {34202}[Hardware Error]:   class_code: 060400
2020-07-07T08:31:37.275639+02:00 tyrex kernel: [822744.212046] {34202}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0012
2020-07-07T08:31:37.275661+02:00 tyrex kernel: [822744.212148] pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-07-07T08:31:37.275664+02:00 tyrex kernel: [822744.212158] pcieport 0000:60:03.1: AER:    [12] Timeout               
2020-07-07T08:31:37.275667+02:00 tyrex kernel: [822744.212167] pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
2020-07-07T08:31:43.167444+02:00 tyrex kernel: [822750.099749] pcieport 0000:60:03.1: AER: aer_status: 0x00001000, aer_mask: 0x00006000
2020-07-07T08:31:43.167474+02:00 tyrex kernel: [822750.099765] pcieport 0000:60:03.1: AER:    [12] Timeout               
2020-07-07T08:31:43.167476+02:00 tyrex kernel: [822750.099773] pcieport 0000:60:03.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

Needless so say, that this is no permanent solution.

Any ideas anybody?

In fact, with 25 years of build and operation experience of Linux based 
systems and servers, I wouldn't have expected such amount of hassle with 
setting up commodity hardware in 2020. It almost feels as if diverging 
forces are dominating hardware, BIOS, firmware, and kernel developers 
more than ever before..

Certainly, one could argue that the decision for a low end AMD EPYC based 
system was a courageous one.

Pete



  reply	other threads:[~2020-07-07  6:56 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-27 18:23 AMD PCI Bridge: Hardware error from APEI Hans-Peter Jansen
2020-07-07  6:56 ` Hans-Peter Jansen [this message]
2020-07-11 16:32   ` Hans-Peter Jansen
2020-07-15  8:11     ` Hans-Peter Jansen
2021-05-15 17:11       ` Hans-Peter Jansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10516708.ThPc60jCtT@xrated \
    --to=hpj@urpla.net \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox