e1000e: Report link down after "Detected Hardware Unit Hang" ?

All of lore.kernel.org
 help / color / mirror / Atom feed

* e1000e: Report link down after "Detected Hardware Unit Hang" ?
@ 2026-06-14 21:48 ` Helge Deller
  0 siblings, 0 replies; 5+ messages in thread
From: Helge Deller @ 2026-06-14 21:48 UTC (permalink / raw)
  To: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
with my on-board intel e1000e NIC hardware.
Since none of he various tips on the internet helped, I had the idea
to setup a master/slave bond networking to fail over to another NIC when
the Intel chip hangs.

Sadly this doesn't work as intended, because the link of the intel NIC 
isn't reported "down", so the failover never happens, unless I manually
start "ifconfig eno1 down".

My question: Shouldn't the intel NIC ideally report Link Down if we know
it hangs? That way a fail-over should at least happen, right?

Below is a completely untested patch.
Does it make sense that I try to test and/or develop such a patch, or
are there things I miss?

Helge 

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 7ce0cc8ab8f4..c6edcf4ac032 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1157,6 +1157,10 @@ static void e1000_print_hw_hang(struct work_struct *work)

 	e1000e_dump(adapter);

+	/* The NIC hangs. Force link down in e1000e_has_link() such that a
+	 * failover can happen */
+	hw->phy.media_type = e1000_media_type_unknown;
+
 	/* Suggest workaround for known h/w issue */
 	if ((hw->mac.type == e1000_pchlan) && (er32(CTRL) & E1000_CTRL_TFCE))
 		e_err("Try turning off Tx pause (flow control) via ethtool\n");

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
@ 2026-06-14 21:48 ` Helge Deller
  0 siblings, 0 replies; 5+ messages in thread
From: Helge Deller @ 2026-06-14 21:48 UTC (permalink / raw)
  To: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
with my on-board intel e1000e NIC hardware.
Since none of he various tips on the internet helped, I had the idea
to setup a master/slave bond networking to fail over to another NIC when
the Intel chip hangs.

Sadly this doesn't work as intended, because the link of the intel NIC 
isn't reported "down", so the failover never happens, unless I manually
start "ifconfig eno1 down".

My question: Shouldn't the intel NIC ideally report Link Down if we know
it hangs? That way a fail-over should at least happen, right?

Below is a completely untested patch.
Does it make sense that I try to test and/or develop such a patch, or
are there things I miss?

Helge 

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 7ce0cc8ab8f4..c6edcf4ac032 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1157,6 +1157,10 @@ static void e1000_print_hw_hang(struct work_struct *work)

 	e1000e_dump(adapter);

+	/* The NIC hangs. Force link down in e1000e_has_link() such that a
+	 * failover can happen */
+	hw->phy.media_type = e1000_media_type_unknown;
+
 	/* Suggest workaround for known h/w issue */
 	if ((hw->mac.type == e1000_pchlan) && (er32(CTRL) & E1000_CTRL_TFCE))
 		e_err("Try turning off Tx pause (flow control) via ethtool\n");

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-14 21:48 ` [Intel-wired-lan] " Helge Deller
@ 2026-06-15 16:41   ` Andrew Lunn
  -1 siblings, 0 replies; 5+ messages in thread
From: Andrew Lunn @ 2026-06-15 16:41 UTC (permalink / raw)
  To: Helge Deller; +Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
> with my on-board intel e1000e NIC hardware.
> Since none of he various tips on the internet helped, I had the idea
> to setup a master/slave bond networking to fail over to another NIC when
> the Intel chip hangs.
> 
> Sadly this doesn't work as intended, because the link of the intel NIC 
> isn't reported "down", so the failover never happens, unless I manually
> start "ifconfig eno1 down".
> 
> My question: Shouldn't the intel NIC ideally report Link Down if we know
> it hangs? That way a fail-over should at least happen, right?
> 
> Below is a completely untested patch.
> Does it make sense that I try to test and/or develop such a patch, or
> are there things I miss?

If the interface is dead, then setting the carrier down makes a lot of
sense. One question i have is, what do you need to do to recover the
hardware? Will it correctly set the carrier up when you do the
recovery?

Also, just looking at your proposed change, it is not clear to me why
such an assignment will result in carrier down. It would be good to
explain it in the commit message.

	Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Intel-wired-lan] e1000e: Report link down after "Detected Hardware Unit Hang" ?
@ 2026-06-15 16:41   ` Andrew Lunn
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Lunn @ 2026-06-15 16:41 UTC (permalink / raw)
  To: Helge Deller; +Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
> with my on-board intel e1000e NIC hardware.
> Since none of he various tips on the internet helped, I had the idea
> to setup a master/slave bond networking to fail over to another NIC when
> the Intel chip hangs.
> 
> Sadly this doesn't work as intended, because the link of the intel NIC 
> isn't reported "down", so the failover never happens, unless I manually
> start "ifconfig eno1 down".
> 
> My question: Shouldn't the intel NIC ideally report Link Down if we know
> it hangs? That way a fail-over should at least happen, right?
> 
> Below is a completely untested patch.
> Does it make sense that I try to test and/or develop such a patch, or
> are there things I miss?

If the interface is dead, then setting the carrier down makes a lot of
sense. One question i have is, what do you need to do to recover the
hardware? Will it correctly set the carrier up when you do the
recovery?

Also, just looking at your proposed change, it is not clear to me why
such an assignment will result in carrier down. It would be good to
explain it in the commit message.

	Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: e1000e: Report link down after "Detected Hardware Unit Hang" ?
  2026-06-15 16:41   ` [Intel-wired-lan] " Andrew Lunn
  (?)
@ 2026-06-15 20:36   ` Helge Deller
  -1 siblings, 0 replies; 5+ messages in thread
From: Helge Deller @ 2026-06-15 20:36 UTC (permalink / raw)
  To: Andrew Lunn, Helge Deller
  Cc: Tony Nguyen, Przemek Kitszel, intel-wired-lan, netdev

On 6/15/26 18:41, Andrew Lunn wrote:
> On Sun, Jun 14, 2026 at 11:48:08PM +0200, Helge Deller wrote:
>> I'm regularily facing the known "eno1: Detected Hardware Unit Hang:"
>> with my on-board intel e1000e NIC hardware.
>> Since none of he various tips on the internet helped, I had the idea
>> to setup a master/slave bond networking to fail over to another NIC when
>> the Intel chip hangs.
>>
>> Sadly this doesn't work as intended, because the link of the intel NIC
>> isn't reported "down", so the failover never happens, unless I manually
>> start "ifconfig eno1 down".
>>
>> My question: Shouldn't the intel NIC ideally report Link Down if we know
>> it hangs? That way a fail-over should at least happen, right?
>>
>> Below is a completely untested patch.
>> Does it make sense that I try to test and/or develop such a patch, or
>> are there things I miss?
> 
> If the interface is dead, then setting the carrier down makes a lot of
> sense. 

That's what I think as well. Thanks for confirming.

> One question i have is, what do you need to do to recover the
> hardware? Will it correctly set the carrier up when you do the
> recovery?

The only way I could recover was to plug the network cable and re-insert it.
I have not tested bringing the NIC down.
But in both cases the driver will need to re-detect the media & link

> Also, just looking at your proposed change, it is not clear to me why
> such an assignment will result in carrier down. It would be good to
> explain it in the commit message.

Sure. The patch I attached was completely untested and just based on
the analysis of the flow and how to make the Link possibly report to be down.
Maybe someone knowledgeable of the driver has a better suggestion how to
report the link down situation in a clean way?

Helge

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-15 20:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-14 21:48 e1000e: Report link down after "Detected Hardware Unit Hang" ? Helge Deller
2026-06-14 21:48 ` [Intel-wired-lan] " Helge Deller
2026-06-15 16:41 ` Andrew Lunn
2026-06-15 16:41   ` [Intel-wired-lan] " Andrew Lunn
2026-06-15 20:36   ` Helge Deller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.