e1000e: Report link down after "Detected Hardware Unit Hang" ?

Netdev List
 help / color / mirror / Atom feed

* e1000e: Report link down after "Detected Hardware Unit Hang" ?
@ 2026-06-14 21:48 Helge Deller
  2026-06-15 16:41 ` Andrew Lunn
  0 siblings, 1 reply; 7+ messages in thread
From: Helge Deller @ 2026-06-14 21:48 UTC (permalink / raw)
  To: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
with my on-board intel e1000e NIC hardware.
Since none of he various tips on the internet helped, I had the idea
to setup a master/slave bond networking to fail over to another NIC when
the Intel chip hangs.

Sadly this doesn't work as intended, because the link of the intel NIC 
isn't reported "down", so the failover never happens, unless I manually
start "ifconfig eno1 down".

My question: Shouldn't the intel NIC ideally report Link Down if we know
it hangs? That way a fail-over should at least happen, right?

Below is a completely untested patch.
Does it make sense that I try to test and/or develop such a patch, or
are there things I miss?

Helge 

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 7ce0cc8ab8f4..c6edcf4ac032 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1157,6 +1157,10 @@ static void e1000_print_hw_hang(struct work_struct *work)

 	e1000e_dump(adapter);

+	/* The NIC hangs. Force link down in e1000e_has_link() such that a
+	 * failover can happen */
+	hw->phy.media_type = e1000_media_type_unknown;
+
 	/* Suggest workaround for known h/w issue */
 	if ((hw->mac.type == e1000_pchlan) && (er32(CTRL) & E1000_CTRL_TFCE))
 		e_err("Try turning off Tx pause (flow control) via ethtool\n");

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-14 21:48 e1000e: Report link down after "Detected Hardware Unit Hang" ? Helge Deller
@ 2026-06-15 16:41 ` Andrew Lunn
  2026-06-15 20:36   ` Helge Deller
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Lunn @ 2026-06-15 16:41 UTC (permalink / raw)
  To: Helge Deller; +Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
> with my on-board intel e1000e NIC hardware.
> Since none of he various tips on the internet helped, I had the idea
> to setup a master/slave bond networking to fail over to another NIC when
> the Intel chip hangs.
> 
> Sadly this doesn't work as intended, because the link of the intel NIC 
> isn't reported "down", so the failover never happens, unless I manually
> start "ifconfig eno1 down".
> 
> My question: Shouldn't the intel NIC ideally report Link Down if we know
> it hangs? That way a fail-over should at least happen, right?
> 
> Below is a completely untested patch.
> Does it make sense that I try to test and/or develop such a patch, or
> are there things I miss?

If the interface is dead, then setting the carrier down makes a lot of
sense. One question i have is, what do you need to do to recover the
hardware? Will it correctly set the carrier up when you do the
recovery?

Also, just looking at your proposed change, it is not clear to me why
such an assignment will result in carrier down. It would be good to
explain it in the commit message.

	Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-15 16:41 ` Andrew Lunn
@ 2026-06-15 20:36   ` Helge Deller
  2026-06-16 16:20     ` [Intel-wired-lan] " Ruinskiy, Dima
  0 siblings, 1 reply; 7+ messages in thread
From: Helge Deller @ 2026-06-15 20:36 UTC (permalink / raw)
  To: Andrew Lunn, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On 6/15/26 18:41, Andrew Lunn wrote:
> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
>> with my on-board intel e1000e NIC hardware.
>> Since none of he various tips on the internet helped, I had the idea
>> to setup a master/slave bond networking to fail over to another NIC when
>> the Intel chip hangs.
>>
>> Sadly this doesn't work as intended, because the link of the intel NIC
>> isn't reported "down", so the failover never happens, unless I manually
>> start "ifconfig eno1 down".
>>
>> My question: Shouldn't the intel NIC ideally report Link Down if we know
>> it hangs? That way a fail-over should at least happen, right?
>>
>> Below is a completely untested patch.
>> Does it make sense that I try to test and/or develop such a patch, or
>> are there things I miss?
> 
> If the interface is dead, then setting the carrier down makes a lot of
> sense. 

That's what I think as well. Thanks for confirming.

> One question i have is, what do you need to do to recover the
> hardware? Will it correctly set the carrier up when you do the
> recovery?

The only way I could recover was to plug the network cable and re-insert it.
I have not tested bringing the NIC down.
But in both cases the driver will need to re-detect the media & link

> Also, just looking at your proposed change, it is not clear to me why
> such an assignment will result in carrier down. It would be good to
> explain it in the commit message.

Sure. The patch I attached was completely untested and just based on
the analysis of the flow and how to make the Link possibly report to be down.
Maybe someone knowledgeable of the driver has a better suggestion how to
report the link down situation in a clean way?

Helge

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-15 20:36   ` Helge Deller
@ 2026-06-16 16:20     ` Ruinskiy, Dima
  2026-06-16 16:55       ` Helge Deller
  2026-06-16 21:59       ` Andrew Lunn
  0 siblings, 2 replies; 7+ messages in thread
From: Ruinskiy, Dima @ 2026-06-16 16:20 UTC (permalink / raw)
  To: Helge Deller, Andrew Lunn, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On 15/06/2026 23:36, Helge Deller wrote:
> On 6/15/26 18:41, Andrew Lunn wrote:
>> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
>>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
>>> with my on-board intel e1000e NIC hardware.
>>> Since none of he various tips on the internet helped, I had the idea
>>> to setup a master/slave bond networking to fail over to another NIC when
>>> the Intel chip hangs.
>>>
>>> Sadly this doesn't work as intended, because the link of the intel NIC
>>> isn't reported "down", so the failover never happens, unless I manually
>>> start "ifconfig eno1 down".
>>>
>>> My question: Shouldn't the intel NIC ideally report Link Down if we know
>>> it hangs? That way a fail-over should at least happen, right?
>>>
>>> Below is a completely untested patch.
>>> Does it make sense that I try to test and/or develop such a patch, or
>>> are there things I miss?
>>
>> If the interface is dead, then setting the carrier down makes a lot of
>> sense. 
> 
> That's what I think as well. Thanks for confirming.
> 
>> One question i have is, what do you need to do to recover the
>> hardware? Will it correctly set the carrier up when you do the
>> recovery?
> 
> The only way I could recover was to plug the network cable and re-insert 
> it.
> I have not tested bringing the NIC down.
> But in both cases the driver will need to re-detect the media & link
> 
>> Also, just looking at your proposed change, it is not clear to me why
>> such an assignment will result in carrier down. It would be good to
>> explain it in the commit message.
> 
> Sure. The patch I attached was completely untested and just based on
> the analysis of the flow and how to make the Link possibly report to be 
> down.
> Maybe someone knowledgeable of the driver has a better suggestion how to
> report the link down situation in a clean way?
> 
> Helge
This does not seem like the right direction to me.

The "Detected Hardware Unit Hang" print does not indicate that the 
interface is dead, but that the transmitter is stalled.

This can be due to an unusually high load, or a HW fault / race 
condition with another component, etc.

When a hang is detected, the transmitter is stopped with 
netif_stop_queue() and eventually ndo_tx_timeout triggers a full reset 
to the device, which in many cases recovers it from the hang.

If the hang is persistent, we try to understand the cause and debug it. 
Permanently marking the device as 'down' because it hung once is not 
going to be the optimal solution.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-16 16:20     ` [Intel-wired-lan] " Ruinskiy, Dima
@ 2026-06-16 16:55       ` Helge Deller
  2026-06-16 21:59       ` Andrew Lunn
  1 sibling, 0 replies; 7+ messages in thread
From: Helge Deller @ 2026-06-16 16:55 UTC (permalink / raw)
  To: Ruinskiy, Dima, Andrew Lunn, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

Hello Dima,

On 6/16/26 18:20, Ruinskiy, Dima wrote:
> On 15/06/2026 23:36, Helge Deller wrote:
>> On 6/15/26 18:41, Andrew Lunn wrote:
>>> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
>>>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
>>>> with my on-board intel e1000e NIC hardware.
>>>> Since none of he various tips on the internet helped, I had the idea
>>>> to setup a master/slave bond networking to fail over to another NIC when
>>>> the Intel chip hangs.
>>>>
>>>> Sadly this doesn't work as intended, because the link of the intel NIC
>>>> isn't reported "down", so the failover never happens, unless I manually
>>>> start "ifconfig eno1 down".
>>>>
>>>> My question: Shouldn't the intel NIC ideally report Link Down if we know
>>>> it hangs? That way a fail-over should at least happen, right?
>>>>
>>>> Below is a completely untested patch.
>>>> Does it make sense that I try to test and/or develop such a patch, or
>>>> are there things I miss?
>>>
>>> If the interface is dead, then setting the carrier down makes a lot of
>>> sense. 
>>
>> That's what I think as well. Thanks for confirming.
>>
>>> One question i have is, what do you need to do to recover the
>>> hardware? Will it correctly set the carrier up when you do the
>>> recovery?
>>
>> The only way I could recover was to plug the network cable and re-insert it.
>> I have not tested bringing the NIC down.
>> But in both cases the driver will need to re-detect the media & link
>>
>>> Also, just looking at your proposed change, it is not clear to me why
>>> such an assignment will result in carrier down. It would be good to
>>> explain it in the commit message.
>>
>> Sure. The patch I attached was completely untested and just based on
>> the analysis of the flow and how to make the Link possibly report to be down.
>> Maybe someone knowledgeable of the driver has a better suggestion how to
>> report the link down situation in a clean way?
>>
>> Helge
> This does not seem like the right direction to me.
> 
> The "Detected Hardware Unit Hang" print does not indicate that the
> interface is dead, but that the transmitter is stalled.

Ok. But effectively it means there can nothing be transmitted then at this stage,
which somehow is the same as if the Link would be down.

> This can be due to an unusually high load, or a HW fault / race condition with another component, etc.
>
> When a hang is detected, the transmitter is stopped with
> netif_stop_queue() and eventually ndo_tx_timeout triggers a full
> reset to the device, which in many cases recovers it from the hang.

That would be optimal, but I have never seen it recovering from such stalls since years.
Also looking at the many reports in the internet, people say it just
hangs and does not recover until the cable is plugged out (I might be wrong!).

> If the hang is persistent, we try to understand the cause and debug
> it. Permanently marking the device as 'down' because it hung once is
> not going to be the optimal solution.

Of course debugging this situation is preferred but it does not help when
the productive remote system stays unreachable forever.
Right now it just fills the syslog with the same stuck message.
Even an module option like "report_link_down_on_hang after 5 automatic re-tries"
would be good compromise.... You still should be able to get the necessary
debug info then.

Helge

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-16 16:20     ` [Intel-wired-lan] " Ruinskiy, Dima
  2026-06-16 16:55       ` Helge Deller
@ 2026-06-16 21:59       ` Andrew Lunn
  2026-06-21 13:22         ` Ruinskiy, Dima
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Lunn @ 2026-06-16 21:59 UTC (permalink / raw)
  To: Ruinskiy, Dima
  Cc: Helge Deller, Helge Deller, Tony Nguyen, Przemek Kitszel,
	intel-wired-lan, netdev

> This does not seem like the right direction to me.
> 
> The "Detected Hardware Unit Hang" print does not indicate that the interface
> is dead, but that the transmitter is stalled.
> 
> This can be due to an unusually high load, or a HW fault / race condition
> with another component, etc.
> 
> When a hang is detected, the transmitter is stopped with netif_stop_queue()
> and eventually ndo_tx_timeout triggers a full reset to the device, which in
> many cases recovers it from the hang.

Does a full reset cause the link to be negotiated again? If so, there
is no harm in setting the carrier down. If the reset is successful,
the carrier will be restored. However, if the reset does not recover
the system, does the carrier say down?

    Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-16 21:59       ` Andrew Lunn
@ 2026-06-21 13:22         ` Ruinskiy, Dima
  0 siblings, 0 replies; 7+ messages in thread
From: Ruinskiy, Dima @ 2026-06-21 13:22 UTC (permalink / raw)
  To: Andrew Lunn, Helge Deller, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On 17/06/2026 0:59, Andrew Lunn wrote:
>> This does not seem like the right direction to me.
>>
>> The "Detected Hardware Unit Hang" print does not indicate that the interface
>> is dead, but that the transmitter is stalled.
>>
>> This can be due to an unusually high load, or a HW fault / race condition
>> with another component, etc.
>>
>> When a hang is detected, the transmitter is stopped with netif_stop_queue()
>> and eventually ndo_tx_timeout triggers a full reset to the device, which in
>> many cases recovers it from the hang.
> 
> Does a full reset cause the link to be negotiated again? If so, there
> is no harm in setting the carrier down. If the reset is successful,
> the carrier will be restored. However, if the reset does not recover
> the system, does the carrier say down?
> 
>      Andrew
> 

The way it is written - a reset triggered by the Tx timeout path will go 
through e1000e_reinit_locked(), which calls e1000e_down() followed by 
e1000e_up().

e1000e_down() calls netif_carrier_off() at the start, and e1000e_reset() 
later. e1000e_up() triggers a link state recheck, which should restore 
the carrier.

So if everything works as it should, the change proposed here would be 
both redundant and unnecessary. However, we have been getting reports of 
these unrecoverable hangs from time-to-time, so I suspect things do not 
always work as they should.

There is one issue under investigation at present, where a persistent 
hang was reported following an aborted hibernation attempt. We are 
testing a patch against it.

I did not see anything in the original description of this report tying 
the hang to a power state change, but I will happily share the patch 
once we get preliminary positive results.

--Dima

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-21 13:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-14 21:48 e1000e: Report link down after "Detected Hardware Unit Hang" ? Helge Deller
2026-06-15 16:41 ` Andrew Lunn
2026-06-15 20:36   ` Helge Deller
2026-06-16 16:20     ` [Intel-wired-lan] " Ruinskiy, Dima
2026-06-16 16:55       ` Helge Deller
2026-06-16 21:59       ` Andrew Lunn
2026-06-21 13:22         ` Ruinskiy, Dima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox