* e1000e: Report link down after "Detected Hardware Unit Hang" ? @ 2026-06-14 21:48 Helge Deller 2026-06-15 16:41 ` Andrew Lunn 0 siblings, 1 reply; 7+ messages in thread From: Helge Deller @ 2026-06-14 21:48 UTC (permalink / raw) To: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev I'm regularily facing the known "eno1: Detected Hardware Unit Hang:" with my on-board intel e1000e NIC hardware. Since none of he various tips on the internet helped, I had the idea to setup a master/slave bond networking to fail over to another NIC when the Intel chip hangs. Sadly this doesn't work as intended, because the link of the intel NIC isn't reported "down", so the failover never happens, unless I manually start "ifconfig eno1 down". My question: Shouldn't the intel NIC ideally report Link Down if we know it hangs? That way a fail-over should at least happen, right? Below is a completely untested patch. Does it make sense that I try to test and/or develop such a patch, or are there things I miss? Helge diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 7ce0cc8ab8f4..c6edcf4ac032 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1157,6 +1157,10 @@ static void e1000_print_hw_hang(struct work_struct *work) e1000e_dump(adapter); + /* The NIC hangs. Force link down in e1000e_has_link() such that a + * failover can happen */ + hw->phy.media_type = e1000_media_type_unknown; + /* Suggest workaround for known h/w issue */ if ((hw->mac.type == e1000_pchlan) && (er32(CTRL) & E1000_CTRL_TFCE)) e_err("Try turning off Tx pause (flow control) via ethtool\n"); ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: e1000e: Report link down after "Detected Hardware Unit Hang" ? 2026-06-14 21:48 e1000e: Report link down after "Detected Hardware Unit Hang" ? Helge Deller @ 2026-06-15 16:41 ` Andrew Lunn 2026-06-15 20:36 ` Helge Deller 0 siblings, 1 reply; 7+ messages in thread From: Andrew Lunn @ 2026-06-15 16:41 UTC (permalink / raw) To: Helge Deller; +Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote: > I'm regularily facing the known "eno1: Detected Hardware Unit Hang:" > with my on-board intel e1000e NIC hardware. > Since none of he various tips on the internet helped, I had the idea > to setup a master/slave bond networking to fail over to another NIC when > the Intel chip hangs. > > Sadly this doesn't work as intended, because the link of the intel NIC > isn't reported "down", so the failover never happens, unless I manually > start "ifconfig eno1 down". > > My question: Shouldn't the intel NIC ideally report Link Down if we know > it hangs? That way a fail-over should at least happen, right? > > Below is a completely untested patch. > Does it make sense that I try to test and/or develop such a patch, or > are there things I miss? If the interface is dead, then setting the carrier down makes a lot of sense. One question i have is, what do you need to do to recover the hardware? Will it correctly set the carrier up when you do the recovery? Also, just looking at your proposed change, it is not clear to me why such an assignment will result in carrier down. It would be good to explain it in the commit message. Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: e1000e: Report link down after "Detected Hardware Unit Hang" ? 2026-06-15 16:41 ` Andrew Lunn @ 2026-06-15 20:36 ` Helge Deller 2026-06-16 16:20 ` [Intel-wired-lan] " Ruinskiy, Dima 0 siblings, 1 reply; 7+ messages in thread From: Helge Deller @ 2026-06-15 20:36 UTC (permalink / raw) To: Andrew Lunn, Helge Deller Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev On 6/15/26 18:41, Andrew Lunn wrote: > On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote: >> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:" >> with my on-board intel e1000e NIC hardware. >> Since none of he various tips on the internet helped, I had the idea >> to setup a master/slave bond networking to fail over to another NIC when >> the Intel chip hangs. >> >> Sadly this doesn't work as intended, because the link of the intel NIC >> isn't reported "down", so the failover never happens, unless I manually >> start "ifconfig eno1 down". >> >> My question: Shouldn't the intel NIC ideally report Link Down if we know >> it hangs? That way a fail-over should at least happen, right? >> >> Below is a completely untested patch. >> Does it make sense that I try to test and/or develop such a patch, or >> are there things I miss? > > If the interface is dead, then setting the carrier down makes a lot of > sense. That's what I think as well. Thanks for confirming. > One question i have is, what do you need to do to recover the > hardware? Will it correctly set the carrier up when you do the > recovery? The only way I could recover was to plug the network cable and re-insert it. I have not tested bringing the NIC down. But in both cases the driver will need to re-detect the media & link > Also, just looking at your proposed change, it is not clear to me why > such an assignment will result in carrier down. It would be good to > explain it in the commit message. Sure. The patch I attached was completely untested and just based on the analysis of the flow and how to make the Link possibly report to be down. Maybe someone knowledgeable of the driver has a better suggestion how to report the link down situation in a clean way? Helge ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ? 2026-06-15 20:36 ` Helge Deller @ 2026-06-16 16:20 ` Ruinskiy, Dima 2026-06-16 16:55 ` Helge Deller 2026-06-16 21:59 ` Andrew Lunn 0 siblings, 2 replies; 7+ messages in thread From: Ruinskiy, Dima @ 2026-06-16 16:20 UTC (permalink / raw) To: Helge Deller, Andrew Lunn, Helge Deller Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev On 15/06/2026 23:36, Helge Deller wrote: > On 6/15/26 18:41, Andrew Lunn wrote: >> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote: >>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:" >>> with my on-board intel e1000e NIC hardware. >>> Since none of he various tips on the internet helped, I had the idea >>> to setup a master/slave bond networking to fail over to another NIC when >>> the Intel chip hangs. >>> >>> Sadly this doesn't work as intended, because the link of the intel NIC >>> isn't reported "down", so the failover never happens, unless I manually >>> start "ifconfig eno1 down". >>> >>> My question: Shouldn't the intel NIC ideally report Link Down if we know >>> it hangs? That way a fail-over should at least happen, right? >>> >>> Below is a completely untested patch. >>> Does it make sense that I try to test and/or develop such a patch, or >>> are there things I miss? >> >> If the interface is dead, then setting the carrier down makes a lot of >> sense. > > That's what I think as well. Thanks for confirming. > >> One question i have is, what do you need to do to recover the >> hardware? Will it correctly set the carrier up when you do the >> recovery? > > The only way I could recover was to plug the network cable and re-insert > it. > I have not tested bringing the NIC down. > But in both cases the driver will need to re-detect the media & link > >> Also, just looking at your proposed change, it is not clear to me why >> such an assignment will result in carrier down. It would be good to >> explain it in the commit message. > > Sure. The patch I attached was completely untested and just based on > the analysis of the flow and how to make the Link possibly report to be > down. > Maybe someone knowledgeable of the driver has a better suggestion how to > report the link down situation in a clean way? > > Helge This does not seem like the right direction to me. The "Detected Hardware Unit Hang" print does not indicate that the interface is dead, but that the transmitter is stalled. This can be due to an unusually high load, or a HW fault / race condition with another component, etc. When a hang is detected, the transmitter is stopped with netif_stop_queue() and eventually ndo_tx_timeout triggers a full reset to the device, which in many cases recovers it from the hang. If the hang is persistent, we try to understand the cause and debug it. Permanently marking the device as 'down' because it hung once is not going to be the optimal solution. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ? 2026-06-16 16:20 ` [Intel-wired-lan] " Ruinskiy, Dima @ 2026-06-16 16:55 ` Helge Deller 2026-06-16 21:59 ` Andrew Lunn 1 sibling, 0 replies; 7+ messages in thread From: Helge Deller @ 2026-06-16 16:55 UTC (permalink / raw) To: Ruinskiy, Dima, Andrew Lunn, Helge Deller Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev Hello Dima, On 6/16/26 18:20, Ruinskiy, Dima wrote: > On 15/06/2026 23:36, Helge Deller wrote: >> On 6/15/26 18:41, Andrew Lunn wrote: >>> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote: >>>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:" >>>> with my on-board intel e1000e NIC hardware. >>>> Since none of he various tips on the internet helped, I had the idea >>>> to setup a master/slave bond networking to fail over to another NIC when >>>> the Intel chip hangs. >>>> >>>> Sadly this doesn't work as intended, because the link of the intel NIC >>>> isn't reported "down", so the failover never happens, unless I manually >>>> start "ifconfig eno1 down". >>>> >>>> My question: Shouldn't the intel NIC ideally report Link Down if we know >>>> it hangs? That way a fail-over should at least happen, right? >>>> >>>> Below is a completely untested patch. >>>> Does it make sense that I try to test and/or develop such a patch, or >>>> are there things I miss? >>> >>> If the interface is dead, then setting the carrier down makes a lot of >>> sense. >> >> That's what I think as well. Thanks for confirming. >> >>> One question i have is, what do you need to do to recover the >>> hardware? Will it correctly set the carrier up when you do the >>> recovery? >> >> The only way I could recover was to plug the network cable and re-insert it. >> I have not tested bringing the NIC down. >> But in both cases the driver will need to re-detect the media & link >> >>> Also, just looking at your proposed change, it is not clear to me why >>> such an assignment will result in carrier down. It would be good to >>> explain it in the commit message. >> >> Sure. The patch I attached was completely untested and just based on >> the analysis of the flow and how to make the Link possibly report to be down. >> Maybe someone knowledgeable of the driver has a better suggestion how to >> report the link down situation in a clean way? >> >> Helge > This does not seem like the right direction to me. > > The "Detected Hardware Unit Hang" print does not indicate that the > interface is dead, but that the transmitter is stalled. Ok. But effectively it means there can nothing be transmitted then at this stage, which somehow is the same as if the Link would be down. > This can be due to an unusually high load, or a HW fault / race condition with another component, etc. > > When a hang is detected, the transmitter is stopped with > netif_stop_queue() and eventually ndo_tx_timeout triggers a full > reset to the device, which in many cases recovers it from the hang. That would be optimal, but I have never seen it recovering from such stalls since years. Also looking at the many reports in the internet, people say it just hangs and does not recover until the cable is plugged out (I might be wrong!). > If the hang is persistent, we try to understand the cause and debug > it. Permanently marking the device as 'down' because it hung once is > not going to be the optimal solution. Of course debugging this situation is preferred but it does not help when the productive remote system stays unreachable forever. Right now it just fills the syslog with the same stuck message. Even an module option like "report_link_down_on_hang after 5 automatic re-tries" would be good compromise.... You still should be able to get the necessary debug info then. Helge ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ? 2026-06-16 16:20 ` [Intel-wired-lan] " Ruinskiy, Dima 2026-06-16 16:55 ` Helge Deller @ 2026-06-16 21:59 ` Andrew Lunn 2026-06-21 13:22 ` Ruinskiy, Dima 1 sibling, 1 reply; 7+ messages in thread From: Andrew Lunn @ 2026-06-16 21:59 UTC (permalink / raw) To: Ruinskiy, Dima Cc: Helge Deller, Helge Deller, Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev > This does not seem like the right direction to me. > > The "Detected Hardware Unit Hang" print does not indicate that the interface > is dead, but that the transmitter is stalled. > > This can be due to an unusually high load, or a HW fault / race condition > with another component, etc. > > When a hang is detected, the transmitter is stopped with netif_stop_queue() > and eventually ndo_tx_timeout triggers a full reset to the device, which in > many cases recovers it from the hang. Does a full reset cause the link to be negotiated again? If so, there is no harm in setting the carrier down. If the reset is successful, the carrier will be restored. However, if the reset does not recover the system, does the carrier say down? Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ? 2026-06-16 21:59 ` Andrew Lunn @ 2026-06-21 13:22 ` Ruinskiy, Dima 0 siblings, 0 replies; 7+ messages in thread From: Ruinskiy, Dima @ 2026-06-21 13:22 UTC (permalink / raw) To: Andrew Lunn, Helge Deller, Helge Deller Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev On 17/06/2026 0:59, Andrew Lunn wrote: >> This does not seem like the right direction to me. >> >> The "Detected Hardware Unit Hang" print does not indicate that the interface >> is dead, but that the transmitter is stalled. >> >> This can be due to an unusually high load, or a HW fault / race condition >> with another component, etc. >> >> When a hang is detected, the transmitter is stopped with netif_stop_queue() >> and eventually ndo_tx_timeout triggers a full reset to the device, which in >> many cases recovers it from the hang. > > Does a full reset cause the link to be negotiated again? If so, there > is no harm in setting the carrier down. If the reset is successful, > the carrier will be restored. However, if the reset does not recover > the system, does the carrier say down? > > Andrew > The way it is written - a reset triggered by the Tx timeout path will go through e1000e_reinit_locked(), which calls e1000e_down() followed by e1000e_up(). e1000e_down() calls netif_carrier_off() at the start, and e1000e_reset() later. e1000e_up() triggers a link state recheck, which should restore the carrier. So if everything works as it should, the change proposed here would be both redundant and unnecessary. However, we have been getting reports of these unrecoverable hangs from time-to-time, so I suspect things do not always work as they should. There is one issue under investigation at present, where a persistent hang was reported following an aborted hibernation attempt. We are testing a patch against it. I did not see anything in the original description of this report tying the hang to a power state change, but I will happily share the patch once we get preliminary positive results. --Dima ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-06-21 13:23 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-14 21:48 e1000e: Report link down after "Detected Hardware Unit Hang" ? Helge Deller 2026-06-15 16:41 ` Andrew Lunn 2026-06-15 20:36 ` Helge Deller 2026-06-16 16:20 ` [Intel-wired-lan] " Ruinskiy, Dima 2026-06-16 16:55 ` Helge Deller 2026-06-16 21:59 ` Andrew Lunn 2026-06-21 13:22 ` Ruinskiy, Dima
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox