From: "Noam D. Eliyahu" <noam.d.eliyahu@gmail.com>
To: pavan.chebbi@broadcom.com
Cc: mchan@broadcom.com, netdev@vger.kernel.org
Subject: Re: [DISCUSS] tg3 reboot handling on Dell T440 (BCM5720)
Date: Wed, 24 Dec 2025 18:53:01 +0200 [thread overview]
Message-ID: <20251224165301.2794-1-noam.d.eliyahu@gmail.com> (raw)
In-Reply-To: <CALs4sv0EYR=bMSW6pF6W=W_mZHhQBpkeg=ugwTtpBc7_FyPDug@mail.gmail.com>
Thanks for the quick reply!
On Mon, Dec 22, 2025 at 11:23 AM Pavan Chebbi <pavan.chebbi@broadcom.com> wrote:
>
> On Sun, Dec 21, 2025 at 11:20 PM Noam D. Eliyahu
> <noam.d.eliyahu@gmail.com> wrote:
> >
> > ### Relevant driver evolution
> >
> > * **9931c9d04f4d – tg3: power down device only on SYSTEM_POWER_OFF**
>
> I think this commit id is wrong. Anyway, I know the commit.
My apologies, I likely copied a local hash; the upstream commit ID is 9fc3bc764334.
> This is going to be a problem, please follow the discussion here:
> https://lore.kernel.org/netdev/CALs4sv1-6mgQ2JfF9MYiRADxumJD7m7OGWhCB5aWj1tGP0OPJg@mail.gmail.com/
> where regression risk is flagged and it came true in
> https://lore.kernel.org/netdev/CALs4sv2_JZd5K-ZgBkjL=QpXVEXnoJrjuqwwKg0+jo2-4taHJw@mail.gmail.com/
Thank you for the links.
I understand the need to prevent regressions, especially in an area where it happened before.
That said, I still think the design of the first and second fixes is problematic and needs adjusting.
The original bug (Fixed in: 9fc3bc764334) was triggered by SNP initialization on specific models (R650xs with BCM5720).
The fix, the conditional tg3_power_down call, *was applied globally regardless of models*.
The second bug I mentioned (Fixed in: e0efe83ed3252) was triggered mainly due to the conditional tg3_power_down call.
Look again at the changes made in the commit referenced by e0efe83ed3252 (2ca1c94ce0b6 as the original fix):
```
+ tg3_reset_task_cancel(tp);
+
rtnl_lock();
+
netif_device_detach(dev);
if (netif_running(dev))
dev_close(dev);
- if (system_state == SYSTEM_POWER_OFF)
- tg3_power_down(tp);
+ tg3_power_down(tp); /* NOTE: the conditional system state based tg3_power_down call was problematic */
rtnl_unlock();
+
+ pci_disable_device(pdev);
}
```
The changes in 2ca1c94ce0b6 caused the regression which later led to the AER disablement in e0efe83ed3252.
The problem is that it was decided to apply the change to a specific set of models, even though it originated from 9fc3bc764334 which was applied globally.
If we apply the conditional tg3_power_down specifically for the R650xs, we can guarantee no regression, as the logic for the models with the first bug stays the same, just now limited to their set of models.
By applying the conditional tg3_power_down this way, we won't need the AER disablement at all.
> >
> > 2. **Flip the conditioning**
> > Keep the DMI list, but use it to guard the conditional tg3_power_down instead (only for models where the original issue was observed, e.g. R650xs). Drop the AER handling entirely. This limits risk to known systems while simplifying the flow.
> But I am not sure how systems affected in the commit e0efe83ed3252 will react. Can't tell 100pc without testing.
Regarding your concern about systems affected in e0efe83ed3252: my hardware (Dell PowerEdge T440) is one of the models affected in e0efe83ed3252 but not listed in the initial DMI match list.
I tested all of the solutions I suggested, others have reported the same regarding my first suggestion (to remove both the conditional tg3_power_down and the AER disablement) online, and most importantly, the e0efe83ed3252 commit itself references the original fix (2ca1c94ce0b6) which didn't include a specific set of models and was considered a viable fix.
If we restrict the system_state check (from 9fc3bc764334) to a DMI table for the R650xs, all other systems would revert to the 'unconditional' tg3_power_down which was the standard for years. This would naturally prevent the AER errors (as I and others had seen on our machines) without needing to touch the AER registers at all.
For reference:
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e0efe83ed3252
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=2ca1c94ce0b6
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9fc3bc764334
Best regards,
Noam D. Eliyahu
prev parent reply other threads:[~2025-12-24 16:53 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-21 17:50 [DISCUSS] tg3 reboot handling on Dell T440 (BCM5720) Noam D. Eliyahu
2025-12-22 5:53 ` Pavan Chebbi
2025-12-24 16:53 ` Noam D. Eliyahu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251224165301.2794-1-noam.d.eliyahu@gmail.com \
--to=noam.d.eliyahu@gmail.com \
--cc=mchan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=pavan.chebbi@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).