Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Alex G." <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Mon, 23 Apr 2018 23:19:25 -0500	[thread overview]
Message-ID: <70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com> (raw)
In-Reply-To: <20180422104849.GA32754@pd.tnic>



On 04/22/2018 05:48 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote:
>>> How does such an error look like, in detail?
>>
>> It's green on the soft side, with lots of red accents, as well as some
>> textured white shades:
>>
>> [   51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
>> [   51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
>> [   52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
>> to correct
>> [   52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
>> [   52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
>> Hardware Error Source: 1
>> [   52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
>> [   52.711616] {1}[Hardware Error]: event severity: fatal
>> [   52.716754] {1}[Hardware Error]:  Error 0, type: fatal
>> [   52.721891] {1}[Hardware Error]:   section_type: PCIe error
>> [   52.727463] {1}[Hardware Error]:   port_type: 6, downstream switch port
>> [   52.734075] {1}[Hardware Error]:   version: 3.0
>> [   52.738607] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
>> [   52.744786] {1}[Hardware Error]:   device_id: 0000:b0:06.0
>> [   52.750271] {1}[Hardware Error]:   slot: 4
>> [   52.754371] {1}[Hardware Error]:   secondary_bus: 0xb3
>> [   52.759509] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
>> [   52.766123] {1}[Hardware Error]:   class_code: 000406
>> [   52.771182] {1}[Hardware Error]:   bridge: secondary_status: 0x0000,
>> control: 0x0003
>> [   52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
>> 0x01a10000
>> [   52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
>> [   52.786348] pcieport 0000:b0:06.0:    [20] Unsupported Request
>> [   52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
>> aer_agent=Requester ID
>> [   52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
>> [   52.786352] pcieport 0000:b0:06.0:   TLP Header: 40000001 0000020f
>> e12023bc 01000000
>> [   52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
>> [   52.883895] pci 0000:b3:00.0: device has no driver
>> [   52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
>> [   52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
>> queued; currently getting powered on
>> [   52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
> 
> Btw, from another discussion we're having with Yazen:
> 
> @Yazen, do you see how this error record is worth shit?
> 
>   class_code: 000406
>   command: 0x0407, status: 0x0010
>   bridge: secondary_status: 0x0000, control: 0x0003
>   aer_status: 0x00100000, aer_mask: 0x01a10000
>   aer_uncor_severity: 0x004eb030

That tells you what FFS said about the error. Keep in mind that FFS has 
cleared the hardware error bits, which the AER handler would normally 
read from the PCI device.

> those above are only some of the fields which are purely useless
> undecoded. Makes me wonder what's worse for the user: dump the
> half-decoded error or not dump an error at all...

It's immediately obvious if there's a glaring FFS bug and if we get 
bogus data. If you distrust firmware as much as I do, then you will find 
great value in having such info in the logs. It's probably not too 
useful to a casual user, but then neither is a majority of the system log.

> Anyway, Alex, I see this in the logs:
> 
> [   66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
> [   66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present
> [   66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present
> 
> and that comes from that pciehp_isr() interrupt handler AFAICT.
> 
> So there *is* a way to know that the card is not present anymore. So,
> theoretically, and ignoring the code layering for now, we can connect
> that error to the card not present event and then ignore the error...

You're missing the timing and assuming you will get the hotplug 
interrupt. In this example, you have 22ms between the link down and 
presence detect state change. This is a fairly fast removal.

Hotplug dependencies aside (you can have the kernel run without PCIe 
hotplug support), I don't think you want to just linger in NMI for 
dozens of milliseconds waiting for presence detect confirmation.

For enterprise SFF NVMe drives, the data lanes will disconnect before 
the presence detect. FFS relies on presence detect, and these are two of 
the reasons why slow removal is such a problem. You might not get a 
presence detect interrupt at all.

Presence detect is optional for PCIe. PD is such a reliable heuristic, 
that it guarantees worse error handling than the crackmonkey firmware. I 
don't see how might be useful in a way which gives us better handling 
than firmware.

> Hmmm.

Hmmm

Anyway, heuristics about PCIe error recovery belong in the recovery 
handler. I don't think it's smart to apply policy before we get there

Alex

WARNING: multiple messages have this Message-ID (diff)

From: Alexandru Gagniuc <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Mon, 23 Apr 2018 23:19:25 -0500	[thread overview]
Message-ID: <70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com> (raw)

On 04/22/2018 05:48 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote:
>>> How does such an error look like, in detail?
>>
>> It's green on the soft side, with lots of red accents, as well as some
>> textured white shades:
>>
>> [   51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
>> [   51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
>> [   52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
>> to correct
>> [   52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
>> [   52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
>> Hardware Error Source: 1
>> [   52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
>> [   52.711616] {1}[Hardware Error]: event severity: fatal
>> [   52.716754] {1}[Hardware Error]:  Error 0, type: fatal
>> [   52.721891] {1}[Hardware Error]:   section_type: PCIe error
>> [   52.727463] {1}[Hardware Error]:   port_type: 6, downstream switch port
>> [   52.734075] {1}[Hardware Error]:   version: 3.0
>> [   52.738607] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
>> [   52.744786] {1}[Hardware Error]:   device_id: 0000:b0:06.0
>> [   52.750271] {1}[Hardware Error]:   slot: 4
>> [   52.754371] {1}[Hardware Error]:   secondary_bus: 0xb3
>> [   52.759509] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
>> [   52.766123] {1}[Hardware Error]:   class_code: 000406
>> [   52.771182] {1}[Hardware Error]:   bridge: secondary_status: 0x0000,
>> control: 0x0003
>> [   52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
>> 0x01a10000
>> [   52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
>> [   52.786348] pcieport 0000:b0:06.0:    [20] Unsupported Request
>> [   52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
>> aer_agent=Requester ID
>> [   52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
>> [   52.786352] pcieport 0000:b0:06.0:   TLP Header: 40000001 0000020f
>> e12023bc 01000000
>> [   52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
>> [   52.883895] pci 0000:b3:00.0: device has no driver
>> [   52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
>> [   52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
>> queued; currently getting powered on
>> [   52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
> 
> Btw, from another discussion we're having with Yazen:
> 
> @Yazen, do you see how this error record is worth shit?
> 
>   class_code: 000406
>   command: 0x0407, status: 0x0010
>   bridge: secondary_status: 0x0000, control: 0x0003
>   aer_status: 0x00100000, aer_mask: 0x01a10000
>   aer_uncor_severity: 0x004eb030

That tells you what FFS said about the error. Keep in mind that FFS has 
cleared the hardware error bits, which the AER handler would normally 
read from the PCI device.

> those above are only some of the fields which are purely useless
> undecoded. Makes me wonder what's worse for the user: dump the
> half-decoded error or not dump an error at all...

It's immediately obvious if there's a glaring FFS bug and if we get 
bogus data. If you distrust firmware as much as I do, then you will find 
great value in having such info in the logs. It's probably not too 
useful to a casual user, but then neither is a majority of the system log.

> Anyway, Alex, I see this in the logs:
> 
> [   66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
> [   66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present
> [   66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present
> 
> and that comes from that pciehp_isr() interrupt handler AFAICT.
> 
> So there *is* a way to know that the card is not present anymore. So,
> theoretically, and ignoring the code layering for now, we can connect
> that error to the card not present event and then ignore the error...

You're missing the timing and assuming you will get the hotplug 
interrupt. In this example, you have 22ms between the link down and 
presence detect state change. This is a fairly fast removal.

Hotplug dependencies aside (you can have the kernel run without PCIe 
hotplug support), I don't think you want to just linger in NMI for 
dozens of milliseconds waiting for presence detect confirmation.

For enterprise SFF NVMe drives, the data lanes will disconnect before 
the presence detect. FFS relies on presence detect, and these are two of 
the reasons why slow removal is such a problem. You might not get a 
presence detect interrupt at all.

Presence detect is optional for PCIe. PD is such a reliable heuristic, 
that it guarantees worse error handling than the crackmonkey firmware. I 
don't see how might be useful in a way which gives us better handling 
than firmware.

> Hmmm.

Hmmm

Anyway, heuristics about PCIe error recovery belong in the recovery 
handler. I don't think it's smart to apply policy before we get there

Alex
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2018-04-24  4:19 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-16 21:58 [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-17  9:36   ` [RFC PATCH v2 1/4] " Borislav Petkov
2018-04-17  9:36     ` [RFC,v2,1/4] " Borislav Petkov
2018-04-17 16:43     ` [RFC PATCH v2 1/4] " Alex G.
2018-04-17 16:43       ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-18 17:52   ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-18 17:52     ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:19     ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:19       ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 14:30       ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 14:30         ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:57         ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:57           ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 15:29           ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 15:29             ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 15:46             ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 15:46               ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 16:40               ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 16:40                 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 14:57     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 14:57       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:35       ` [Devel] [RFC PATCH v2 3/4] " James Morse
2018-04-19 15:35         ` [RFC,v2,3/4] " James Morse
2018-04-19 15:35         ` [RFC PATCH v2 3/4] " James Morse
2018-04-19 16:27         ` Alex G.
2018-04-19 16:27           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:40       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 15:40         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 16:26         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:26           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 16:45           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 16:45             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 17:40             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 17:40               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 19:03               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 19:03                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 22:55                 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 22:55                   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-22 10:48                   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-22 10:48                     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-24  4:19                     ` Alex G. [this message]
2018-04-24  4:19                       ` Alexandru Gagniuc
2018-04-25 14:01                       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 14:01                         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 15:00                         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 15:00                           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:15                           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:15                             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 17:27                             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 17:27                               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:39                               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:39                                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,4/4] " Borislav Petkov
2018-04-19 15:11     ` [RFC PATCH v2 4/4] " Alex G.
2018-04-19 15:11       ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-19 15:46       ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-19 15:46         ` [RFC,v2,4/4] " Borislav Petkov
2018-04-25 20:39 ` [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,1/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-26 11:19     ` [RFC PATCH v3 2/3] " Borislav Petkov
2018-04-26 11:19       ` [RFC,v3,2/3] " Borislav Petkov
2018-04-26 17:44       ` [RFC PATCH v3 2/3] " Alex G.
2018-04-26 17:44         ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 11:20     ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 11:20       ` [RFC,v3,3/3] " Borislav Petkov
2018-04-26 17:47       ` [RFC PATCH v3 3/3] " Alex G.
2018-04-26 17:47         ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 18:03         ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 18:03           ` [RFC,v3,3/3] " Borislav Petkov
2018-05-02 19:10       ` [RFC PATCH v3 3/3] " Pavel Machek
2018-05-02 19:10         ` [RFC,v3,3/3] " Pavel Machek
2018-05-02 19:29         ` [RFC PATCH v3 3/3] " Alex G.
2018-05-02 19:29           ` [RFC,v3,3/3] " Alexandru Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=alex_gagniuc@dellteam.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=austin_bolen@dell.com \
    --cc=bp@alien8.de \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=yazen.ghannam@amd.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.