* [patch v2] tg3: Disable tg3 PCIe AER on system reboot
@ 2024-11-29 20:36 Lenny Szubowicz
2024-12-02 7:00 ` Pavan Chebbi
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Lenny Szubowicz @ 2024-11-29 20:36 UTC (permalink / raw)
To: pavan.chebbi, mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
Disable PCIe AER on the tg3 device on system reboot on a limited
list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
on the tg3 device during the ACPI _PTS (prepare to sleep) method for
S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
as part of the kernel's reboot sequence as a result of commit
38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
There was an earlier fix for this problem by commit 2ca1c94ce0b6
("tg3: Disable tg3 device on system reboot to avoid triggering AER").
But it was discovered that this earlier fix caused a reboot hang
when some Dell PowerEdge servers were booted via ipxe. To address
this reboot hang, the earlier fix was essentially reverted by commit
9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
This re-exposed the tg3 PCIe AER on reboot problem.
This fix is not an ideal solution because the root cause of the AER
is in system firmware. Instead, it's a targeted work-around in the
tg3 driver.
Note also that the PCIe AER must be disabled on the tg3 device even
if the system is configured to use "firmware first" error handling.
Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
---
drivers/net/ethernet/broadcom/tg3.c | 59 +++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 9cc8db10a8d6..12ae5a976ca7 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -55,6 +55,7 @@
#include <linux/hwmon.h>
#include <linux/hwmon-sysfs.h>
#include <linux/crc32poly.h>
+#include <linux/dmi.h>
#include <net/checksum.h>
#include <net/gso.h>
@@ -18192,6 +18193,51 @@ static int tg3_resume(struct device *device)
static SIMPLE_DEV_PM_OPS(tg3_pm_ops, tg3_suspend, tg3_resume);
+/*
+ * Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
+ * PCIe AER event on the tg3 device if the tg3 device is not, or cannot
+ * be, powered down.
+ */
+static const struct dmi_system_id tg3_restart_aer_quirk_table[] = {
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
+ },
+ },
+ {}
+};
+
static void tg3_shutdown(struct pci_dev *pdev)
{
struct net_device *dev = pci_get_drvdata(pdev);
@@ -18208,6 +18254,19 @@ static void tg3_shutdown(struct pci_dev *pdev)
if (system_state == SYSTEM_POWER_OFF)
tg3_power_down(tp);
+ else if (system_state == SYSTEM_RESTART &&
+ dmi_first_match(tg3_restart_aer_quirk_table) &&
+ pdev->current_state <= PCI_D3hot) {
+ /*
+ * Disable PCIe AER on the tg3 to avoid a fatal
+ * error during this system restart.
+ */
+ pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
+ PCI_EXP_DEVCTL_CERE |
+ PCI_EXP_DEVCTL_NFERE |
+ PCI_EXP_DEVCTL_FERE |
+ PCI_EXP_DEVCTL_URRE);
+ }
rtnl_unlock();
--
2.45.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [patch v2] tg3: Disable tg3 PCIe AER on system reboot
2024-11-29 20:36 [patch v2] tg3: Disable tg3 PCIe AER on system reboot Lenny Szubowicz
@ 2024-12-02 7:00 ` Pavan Chebbi
2025-01-30 19:40 ` Lenny Szubowicz
2024-12-02 22:57 ` Jakub Kicinski
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Pavan Chebbi @ 2024-12-02 7:00 UTC (permalink / raw)
To: Lenny Szubowicz
Cc: mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 5253 bytes --]
On Sat, Nov 30, 2024 at 2:06 AM Lenny Szubowicz <lszubowi@redhat.com> wrote:
>
> Disable PCIe AER on the tg3 device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> There was an earlier fix for this problem by commit 2ca1c94ce0b6
> ("tg3: Disable tg3 device on system reboot to avoid triggering AER").
Are you saying that if we have tg3_power_down() done then the current
new issue won't be seen?
> But it was discovered that this earlier fix caused a reboot hang
> when some Dell PowerEdge servers were booted via ipxe. To address
> this reboot hang, the earlier fix was essentially reverted by commit
> 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
> This re-exposed the tg3 PCIe AER on reboot problem.
>
> This fix is not an ideal solution because the root cause of the AER
> is in system firmware. Instead, it's a targeted work-around in the
> tg3 driver.
>
> Note also that the PCIe AER must be disabled on the tg3 device even
> if the system is configured to use "firmware first" error handling.
Not too sure about this. The list has some widely used latest Dell
servers. The first fix only did pci_disable_device()
But looks like this fix should be the right one for the first ever
reported issue in commit 2ca1c94ce0b6 ?
Also you may want to address the warnings generated. Also note that
netdev requires you to wait 24hours before posting a new revision of
the patch.
>
> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
> ---
> drivers/net/ethernet/broadcom/tg3.c | 59 +++++++++++++++++++++++++++++
> 1 file changed, 59 insertions(+)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 9cc8db10a8d6..12ae5a976ca7 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -55,6 +55,7 @@
> #include <linux/hwmon.h>
> #include <linux/hwmon-sysfs.h>
> #include <linux/crc32poly.h>
> +#include <linux/dmi.h>
>
> #include <net/checksum.h>
> #include <net/gso.h>
> @@ -18192,6 +18193,51 @@ static int tg3_resume(struct device *device)
>
> static SIMPLE_DEV_PM_OPS(tg3_pm_ops, tg3_suspend, tg3_resume);
>
> +/*
> + * Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
> + * PCIe AER event on the tg3 device if the tg3 device is not, or cannot
> + * be, powered down.
> + */
> +static const struct dmi_system_id tg3_restart_aer_quirk_table[] = {
> + {
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
> + },
> + },
> + {
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
> + },
> + },
> + {
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
> + },
> + },
> + {
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
> + },
> + },
> + {
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
> + },
> + },
> + {
> + .matches = {
> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
> + },
> + },
> + {}
> +};
> +
> static void tg3_shutdown(struct pci_dev *pdev)
> {
> struct net_device *dev = pci_get_drvdata(pdev);
> @@ -18208,6 +18254,19 @@ static void tg3_shutdown(struct pci_dev *pdev)
>
> if (system_state == SYSTEM_POWER_OFF)
> tg3_power_down(tp);
> + else if (system_state == SYSTEM_RESTART &&
> + dmi_first_match(tg3_restart_aer_quirk_table) &&
> + pdev->current_state <= PCI_D3hot) {
> + /*
> + * Disable PCIe AER on the tg3 to avoid a fatal
> + * error during this system restart.
> + */
> + pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
> + PCI_EXP_DEVCTL_CERE |
> + PCI_EXP_DEVCTL_NFERE |
> + PCI_EXP_DEVCTL_FERE |
> + PCI_EXP_DEVCTL_URRE);
> + }
>
> rtnl_unlock();
>
> --
> 2.45.2
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch v2] tg3: Disable tg3 PCIe AER on system reboot
2024-11-29 20:36 [patch v2] tg3: Disable tg3 PCIe AER on system reboot Lenny Szubowicz
2024-12-02 7:00 ` Pavan Chebbi
@ 2024-12-02 22:57 ` Jakub Kicinski
2025-01-30 21:57 ` [PATCH net v3] " Lenny Szubowicz
2025-01-31 12:56 ` [patch v2] " Przemek Kitszel
3 siblings, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2024-12-02 22:57 UTC (permalink / raw)
To: Lenny Szubowicz
Cc: pavan.chebbi, mchan, andrew+netdev, davem, edumazet, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
On Fri, 29 Nov 2024 15:36:40 -0500 Lenny Szubowicz wrote:
> Disable PCIe AER on the tg3 device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> There was an earlier fix for this problem by commit 2ca1c94ce0b6
> ("tg3: Disable tg3 device on system reboot to avoid triggering AER").
> But it was discovered that this earlier fix caused a reboot hang
> when some Dell PowerEdge servers were booted via ipxe. To address
> this reboot hang, the earlier fix was essentially reverted by commit
> 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
> This re-exposed the tg3 PCIe AER on reboot problem.
>
> This fix is not an ideal solution because the root cause of the AER
> is in system firmware. Instead, it's a targeted work-around in the
> tg3 driver.
>
> Note also that the PCIe AER must be disabled on the tg3 device even
> if the system is configured to use "firmware first" error handling.
sparse (make C=1) complains:
drivers/net/ethernet/broadcom/tg3.c:18259:22: warning: restricted pci_power_t degrades to integer
--
pw-bot: cr
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch v2] tg3: Disable tg3 PCIe AER on system reboot
2024-12-02 7:00 ` Pavan Chebbi
@ 2025-01-30 19:40 ` Lenny Szubowicz
0 siblings, 0 replies; 9+ messages in thread
From: Lenny Szubowicz @ 2025-01-30 19:40 UTC (permalink / raw)
To: Pavan Chebbi
Cc: mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
On 12/2/24 2:00 AM, Pavan Chebbi wrote:
> On Sat, Nov 30, 2024 at 2:06 AM Lenny Szubowicz <lszubowi@redhat.com> wrote:
>>
>> Disable PCIe AER on the tg3 device on system reboot on a limited
>> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
>> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
>> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
>> as part of the kernel's reboot sequence as a result of commit
>> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>>
>> There was an earlier fix for this problem by commit 2ca1c94ce0b6
>> ("tg3: Disable tg3 device on system reboot to avoid triggering AER").
>
> Are you saying that if we have tg3_power_down() done then the current
> new issue won't be seen?
First, thank you for your review and sorry for not responding sooner.
I did not see your comments until now because I had some unhelpful
email filters in place.
Yes, if tg3_power_down() is called on a restart, then the PCIe AER does
not occur. The other calls added by upstream commit 2ca1c94ce0b6 ("tg3:
Disable tg3 device on system reboot to avoid triggering AER"), i.e.
tg3_reset_task_cancel() and pci_disable_device(), are not
sufficient to prevent the AER.
>
>> But it was discovered that this earlier fix caused a reboot hang
>> when some Dell PowerEdge servers were booted via ipxe. To address
>> this reboot hang, the earlier fix was essentially reverted by commit
>> 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
>> This re-exposed the tg3 PCIe AER on reboot problem.
>>
>> This fix is not an ideal solution because the root cause of the AER
>> is in system firmware. Instead, it's a targeted work-around in the
>> tg3 driver.
>>
>> Note also that the PCIe AER must be disabled on the tg3 device even
>> if the system is configured to use "firmware first" error handling.
>
> Not too sure about this. The list has some widely used latest Dell
> servers. The first fix only did pci_disable_device()
> But looks like this fix should be the right one for the first ever
> reported issue in commit 2ca1c94ce0b6 ?
I have reproduced this problem on an example system for every entry in
the DMI match table tg3_restart_aer_quirk_table that is in the patch.
I don't have access to all models of Dell PowerEdge servers. So that
list might not be comprehensive. I was not able to reproduce the
problem on newer PowerEdge servers that I tested.
I reproduced the problem this morning with the latest upstream
kernel on a Dell PowerEdge R640 with the latest BIOS system firmware.
root@dell-per640-02:~# uname -a
Linux dell-per640-02.khw.eng.rdu2.dc.redhat.com 6.13.0.lss001+ #23 SMP PREEMPT_DYNAMIC Thu Jan 30 08:56:44 EST 2025 x86_64 GNU/Linux
root@dell-per640-02:~# lspci -s 01:00.1
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
root@dell-per640-02:~# dmidecode
...
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R640
...
BIOS Information
Vendor: Dell Inc.
Version: 2.22.2
Release Date: 09/12/2024
root@dell-per640-02:~# reboot
...
[ 68.497693] ACPI: PM: Preparing to enter system sleep state S5
[ 71.100607] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 71.100610] {1}[Hardware Error]: event severity: fatal
[ 71.100611] {1}[Hardware Error]: Error 0, type: fatal
[ 71.100613] {1}[Hardware Error]: section_type: PCIe error
[ 71.100614] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 71.100616] {1}[Hardware Error]: version: 3.0
[ 71.100618] {1}[Hardware Error]: command: 0x0002, status: 0x0010
[ 71.100619] {1}[Hardware Error]: device_id: 0000:01:00.1
[ 71.100621] {1}[Hardware Error]: slot: 0
[ 71.100622] {1}[Hardware Error]: secondary_bus: 0x00
[ 71.100623] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
[ 71.100625] {1}[Hardware Error]: class_code: 020000
[ 71.100626] {1}[Hardware Error]: aer_cor_status: 0x00002000, aer_cor_mask: 0x000031c0
[ 71.100627] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
[ 71.100629] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
[ 71.100630] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
[ 71.100632] GHES: Fatal hardware error but panic disabled
[ 71.100633] Kernel panic - not syncing: GHES: Fatal hardware error
[ 71.100635] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0.lss001+ #23
[ 71.100638] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.22.2 09/12/2024
[ 71.100640] Call Trace:
[ 71.100642] <NMI>
[ 71.100643] dump_stack_lvl+0x4e/0x70
[ 71.100649] panic+0x113/0x2dd
[ 71.100654] __ghes_panic.cold+0x28/0x28
[ 71.100657] ghes_in_nmi_queue_one_entry.constprop.0+0x23f/0x2c0
[ 71.100662] ghes_notify_nmi+0x5d/0xd0
[ 71.100665] nmi_handle+0x5e/0x120
[ 71.100668] default_do_nmi+0x40/0x130
[ 71.100673] exc_nmi+0x103/0x180
[ 71.100676] end_repeat_nmi+0xf/0x53
[ 71.100681] RIP: 0010:intel_idle+0x59/0xa0
...
>
> Also you may want to address the warnings generated. Also note that
> netdev requires you to wait 24hours before posting a new revision of
> the patch.
Yes, I'll have a fix for the sparse warnings in V3 of the patch.
That was a sloppy error on my part.
>
>>
>> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
>> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
>> ---
>> drivers/net/ethernet/broadcom/tg3.c | 59 +++++++++++++++++++++++++++++
>> 1 file changed, 59 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
>> index 9cc8db10a8d6..12ae5a976ca7 100644
>> --- a/drivers/net/ethernet/broadcom/tg3.c
>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>> @@ -55,6 +55,7 @@
>> #include <linux/hwmon.h>
>> #include <linux/hwmon-sysfs.h>
>> #include <linux/crc32poly.h>
>> +#include <linux/dmi.h>
>>
>> #include <net/checksum.h>
>> #include <net/gso.h>
>> @@ -18192,6 +18193,51 @@ static int tg3_resume(struct device *device)
>>
>> static SIMPLE_DEV_PM_OPS(tg3_pm_ops, tg3_suspend, tg3_resume);
>>
>> +/*
>> + * Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
>> + * PCIe AER event on the tg3 device if the tg3 device is not, or cannot
>> + * be, powered down.
>> + */
>> +static const struct dmi_system_id tg3_restart_aer_quirk_table[] = {
>> + {
>> + .matches = {
>> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
>> + },
>> + },
>> + {
>> + .matches = {
>> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
>> + },
>> + },
>> + {
>> + .matches = {
>> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
>> + },
>> + },
>> + {
>> + .matches = {
>> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
>> + },
>> + },
>> + {
>> + .matches = {
>> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
>> + },
>> + },
>> + {
>> + .matches = {
>> + DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
>> + DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
>> + },
>> + },
>> + {}
>> +};
>> +
>> static void tg3_shutdown(struct pci_dev *pdev)
>> {
>> struct net_device *dev = pci_get_drvdata(pdev);
>> @@ -18208,6 +18254,19 @@ static void tg3_shutdown(struct pci_dev *pdev)
>>
>> if (system_state == SYSTEM_POWER_OFF)
>> tg3_power_down(tp);
>> + else if (system_state == SYSTEM_RESTART &&
>> + dmi_first_match(tg3_restart_aer_quirk_table) &&
>> + pdev->current_state <= PCI_D3hot) {
>> + /*
>> + * Disable PCIe AER on the tg3 to avoid a fatal
>> + * error during this system restart.
>> + */
>> + pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
>> + PCI_EXP_DEVCTL_CERE |
>> + PCI_EXP_DEVCTL_NFERE |
>> + PCI_EXP_DEVCTL_FERE |
>> + PCI_EXP_DEVCTL_URRE);
>> + }
>>
>> rtnl_unlock();
>>
>> --
>> 2.45.2
>>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net v3] tg3: Disable tg3 PCIe AER on system reboot
2024-11-29 20:36 [patch v2] tg3: Disable tg3 PCIe AER on system reboot Lenny Szubowicz
2024-12-02 7:00 ` Pavan Chebbi
2024-12-02 22:57 ` Jakub Kicinski
@ 2025-01-30 21:57 ` Lenny Szubowicz
2025-01-31 5:08 ` Pavan Chebbi
` (2 more replies)
2025-01-31 12:56 ` [patch v2] " Przemek Kitszel
3 siblings, 3 replies; 9+ messages in thread
From: Lenny Szubowicz @ 2025-01-30 21:57 UTC (permalink / raw)
To: pavan.chebbi, mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
Disable PCIe AER on the tg3 device on system reboot on a limited
list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
on the tg3 device during the ACPI _PTS (prepare to sleep) method for
S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
as part of the kernel's reboot sequence as a result of commit
38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
There was an earlier fix for this problem by commit 2ca1c94ce0b6
("tg3: Disable tg3 device on system reboot to avoid triggering AER").
But it was discovered that this earlier fix caused a reboot hang
when some Dell PowerEdge servers were booted via ipxe. To address
this reboot hang, the earlier fix was essentially reverted by commit
9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
This re-exposed the tg3 PCIe AER on reboot problem.
This fix is not an ideal solution because the root cause of the AER
is in system firmware. Instead, it's a targeted work-around in the
tg3 driver.
Note also that the PCIe AER must be disabled on the tg3 device even
if the system is configured to use "firmware first" error handling.
V3:
- Fix sparse warning on improper comparison of pdev->current_state
- Adhere to netdev comment style
Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
---
drivers/net/ethernet/broadcom/tg3.c | 58 +++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 9cc8db10a8d6..5ba22fe0995f 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -55,6 +55,7 @@
#include <linux/hwmon.h>
#include <linux/hwmon-sysfs.h>
#include <linux/crc32poly.h>
+#include <linux/dmi.h>
#include <net/checksum.h>
#include <net/gso.h>
@@ -18192,6 +18193,50 @@ static int tg3_resume(struct device *device)
static SIMPLE_DEV_PM_OPS(tg3_pm_ops, tg3_suspend, tg3_resume);
+/* Systems where ACPI _PTS (Prepare To Sleep) S5 will result in a fatal
+ * PCIe AER event on the tg3 device if the tg3 device is not, or cannot
+ * be, powered down.
+ */
+static const struct dmi_system_id tg3_restart_aer_quirk_table[] = {
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R440"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R540"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R640"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R650"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
+ },
+ },
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R750"),
+ },
+ },
+ {}
+};
+
static void tg3_shutdown(struct pci_dev *pdev)
{
struct net_device *dev = pci_get_drvdata(pdev);
@@ -18208,6 +18253,19 @@ static void tg3_shutdown(struct pci_dev *pdev)
if (system_state == SYSTEM_POWER_OFF)
tg3_power_down(tp);
+ else if (system_state == SYSTEM_RESTART &&
+ dmi_first_match(tg3_restart_aer_quirk_table) &&
+ pdev->current_state != PCI_D3cold &&
+ pdev->current_state != PCI_UNKNOWN) {
+ /* Disable PCIe AER on the tg3 to avoid a fatal
+ * error during this system restart.
+ */
+ pcie_capability_clear_word(pdev, PCI_EXP_DEVCTL,
+ PCI_EXP_DEVCTL_CERE |
+ PCI_EXP_DEVCTL_NFERE |
+ PCI_EXP_DEVCTL_FERE |
+ PCI_EXP_DEVCTL_URRE);
+ }
rtnl_unlock();
--
2.47.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net v3] tg3: Disable tg3 PCIe AER on system reboot
2025-01-30 21:57 ` [PATCH net v3] " Lenny Szubowicz
@ 2025-01-31 5:08 ` Pavan Chebbi
2025-01-31 9:42 ` Simon Horman
2025-02-03 10:20 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 9+ messages in thread
From: Pavan Chebbi @ 2025-01-31 5:08 UTC (permalink / raw)
To: Lenny Szubowicz
Cc: mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1622 bytes --]
On Fri, Jan 31, 2025 at 3:28 AM Lenny Szubowicz <lszubowi@redhat.com> wrote:
>
> Disable PCIe AER on the tg3 device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> There was an earlier fix for this problem by commit 2ca1c94ce0b6
> ("tg3: Disable tg3 device on system reboot to avoid triggering AER").
> But it was discovered that this earlier fix caused a reboot hang
> when some Dell PowerEdge servers were booted via ipxe. To address
> this reboot hang, the earlier fix was essentially reverted by commit
> 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
> This re-exposed the tg3 PCIe AER on reboot problem.
>
> This fix is not an ideal solution because the root cause of the AER
> is in system firmware. Instead, it's a targeted work-around in the
> tg3 driver.
>
> Note also that the PCIe AER must be disabled on the tg3 device even
> if the system is configured to use "firmware first" error handling.
>
> V3:
> - Fix sparse warning on improper comparison of pdev->current_state
> - Adhere to netdev comment style
>
> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
> ---
Thanks Lenny. LGTM.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net v3] tg3: Disable tg3 PCIe AER on system reboot
2025-01-30 21:57 ` [PATCH net v3] " Lenny Szubowicz
2025-01-31 5:08 ` Pavan Chebbi
@ 2025-01-31 9:42 ` Simon Horman
2025-02-03 10:20 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 9+ messages in thread
From: Simon Horman @ 2025-01-31 9:42 UTC (permalink / raw)
To: Lenny Szubowicz
Cc: pavan.chebbi, mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
On Thu, Jan 30, 2025 at 04:57:54PM -0500, Lenny Szubowicz wrote:
> Disable PCIe AER on the tg3 device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> There was an earlier fix for this problem by commit 2ca1c94ce0b6
> ("tg3: Disable tg3 device on system reboot to avoid triggering AER").
> But it was discovered that this earlier fix caused a reboot hang
> when some Dell PowerEdge servers were booted via ipxe. To address
> this reboot hang, the earlier fix was essentially reverted by commit
> 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
> This re-exposed the tg3 PCIe AER on reboot problem.
>
> This fix is not an ideal solution because the root cause of the AER
> is in system firmware. Instead, it's a targeted work-around in the
> tg3 driver.
>
> Note also that the PCIe AER must be disabled on the tg3 device even
> if the system is configured to use "firmware first" error handling.
>
> V3:
> - Fix sparse warning on improper comparison of pdev->current_state
> - Adhere to netdev comment style
>
> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Hi Lenny,
For future reference, please post new versions of patches to netdev
in new email threads.
Ref: https://docs.kernel.org/process/maintainer-netdev.html#resending-after-review
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [patch v2] tg3: Disable tg3 PCIe AER on system reboot
2024-11-29 20:36 [patch v2] tg3: Disable tg3 PCIe AER on system reboot Lenny Szubowicz
` (2 preceding siblings ...)
2025-01-30 21:57 ` [PATCH net v3] " Lenny Szubowicz
@ 2025-01-31 12:56 ` Przemek Kitszel
3 siblings, 0 replies; 9+ messages in thread
From: Przemek Kitszel @ 2025-01-31 12:56 UTC (permalink / raw)
To: Lenny Szubowicz, pavan.chebbi, mchan, andrew+netdev, davem,
edumazet, kuba, pabeni, george.shuklin, andrea.fois
Cc: netdev, linux-kernel, Yue Zhao, chunguang.xu, haifeng.xu,
Dawid Osuchowski
On 11/29/24 21:36, Lenny Szubowicz wrote:
> Disable PCIe AER on the tg3 device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> There was an earlier fix for this problem by commit 2ca1c94ce0b6
> ("tg3: Disable tg3 device on system reboot to avoid triggering AER").
> But it was discovered that this earlier fix caused a reboot hang
> when some Dell PowerEdge servers were booted via ipxe. To address
> this reboot hang, the earlier fix was essentially reverted by commit
> 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF").
> This re-exposed the tg3 PCIe AER on reboot problem.
>
> This fix is not an ideal solution because the root cause of the AER
> is in system firmware. Instead, it's a targeted work-around in the
> tg3 driver.
>
> Note also that the PCIe AER must be disabled on the tg3 device even
> if the system is configured to use "firmware first" error handling.
>
> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
the bug occurs also on Intel drivers, we even got the ~very same fix
proposed:
https://lore.kernel.org/netdev/20241227035459.90602-1-yue.zhao@shopee.com/T/
I believe that such fix should be centralized, instead of repeating for
each driver. Especially that the list of platforms is likely to be
extended in the future.
It's sad that we don't have Dell cced here, I'm trying to get some
relevant contacts, but without success so far.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net v3] tg3: Disable tg3 PCIe AER on system reboot
2025-01-30 21:57 ` [PATCH net v3] " Lenny Szubowicz
2025-01-31 5:08 ` Pavan Chebbi
2025-01-31 9:42 ` Simon Horman
@ 2025-02-03 10:20 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-02-03 10:20 UTC (permalink / raw)
To: Lenny Szubowicz
Cc: pavan.chebbi, mchan, andrew+netdev, davem, edumazet, kuba, pabeni,
george.shuklin, andrea.fois, netdev, linux-kernel
Hello:
This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Thu, 30 Jan 2025 16:57:54 -0500 you wrote:
> Disable PCIe AER on the tg3 device on system reboot on a limited
> list of Dell PowerEdge systems. This prevents a fatal PCIe AER event
> on the tg3 device during the ACPI _PTS (prepare to sleep) method for
> S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep()
> as part of the kernel's reboot sequence as a result of commit
> 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot").
>
> [...]
Here is the summary with links:
- [net,v3] tg3: Disable tg3 PCIe AER on system reboot
https://git.kernel.org/netdev/net/c/e0efe83ed325
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-02-03 10:20 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-29 20:36 [patch v2] tg3: Disable tg3 PCIe AER on system reboot Lenny Szubowicz
2024-12-02 7:00 ` Pavan Chebbi
2025-01-30 19:40 ` Lenny Szubowicz
2024-12-02 22:57 ` Jakub Kicinski
2025-01-30 21:57 ` [PATCH net v3] " Lenny Szubowicz
2025-01-31 5:08 ` Pavan Chebbi
2025-01-31 9:42 ` Simon Horman
2025-02-03 10:20 ` patchwork-bot+netdevbpf
2025-01-31 12:56 ` [patch v2] " Przemek Kitszel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).