* e1000: Detected Tx Unit Hang
@ 2008-02-15 22:52 Bernd Schubert
2008-02-15 23:29 ` Kok, Auke
0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2008-02-15 22:52 UTC (permalink / raw)
To: netdev
Hello,
I can't login to one of our servers and just got this in an ipmi sol
session:
[18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[18169.209183] Tx Queue <0>
[18169.209184] TDH <e3>
[18169.209185] TDT <e3>
[18169.209186] next_to_use <e3>
[18169.209187] next_to_clean <bd>
[18169.209188] buffer_info[next_to_clean]
[18169.209189] time_stamp <10043e4d2>
[18169.209190] next_to_watch <be>
[18169.209191] jiffies <10043e6f6>
[18169.209192] next_to_watch.status <1>
[18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
[18169.256979] Tx Queue <0>
[18169.256980] TDH <de>
[18169.256982] TDT <de>
[18169.256983] next_to_use <de>
[18169.256984] next_to_clean <bc>
[18169.256985] buffer_info[next_to_clean]
[18169.256986] time_stamp <10043e511>
[18169.256987] next_to_watch <bd>
[18169.256988] jiffies <10043e701>
[18169.256989] next_to_watch.status <1>
This is with 2.6.22.18. Is there any chance to recover the system? For some
reasons I would prefer not to reboot now.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: e1000: Detected Tx Unit Hang
2008-02-15 22:52 e1000: Detected Tx Unit Hang Bernd Schubert
@ 2008-02-15 23:29 ` Kok, Auke
2008-02-16 0:26 ` Bernd Schubert
0 siblings, 1 reply; 12+ messages in thread
From: Kok, Auke @ 2008-02-15 23:29 UTC (permalink / raw)
To: Bernd Schubert; +Cc: netdev
Bernd Schubert wrote:
> Hello,
>
> I can't login to one of our servers and just got this in an ipmi sol
> session:
>
> [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [18169.209183] Tx Queue <0>
> [18169.209184] TDH <e3>
> [18169.209185] TDT <e3>
> [18169.209186] next_to_use <e3>
> [18169.209187] next_to_clean <bd>
> [18169.209188] buffer_info[next_to_clean]
> [18169.209189] time_stamp <10043e4d2>
> [18169.209190] next_to_watch <be>
> [18169.209191] jiffies <10043e6f6>
> [18169.209192] next_to_watch.status <1>
> [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> [18169.256979] Tx Queue <0>
> [18169.256980] TDH <de>
> [18169.256982] TDT <de>
> [18169.256983] next_to_use <de>
> [18169.256984] next_to_clean <bc>
> [18169.256985] buffer_info[next_to_clean]
> [18169.256986] time_stamp <10043e511>
> [18169.256987] next_to_watch <bd>
> [18169.256988] jiffies <10043e701>
> [18169.256989] next_to_watch.status <1>
>
> This is with 2.6.22.18. Is there any chance to recover the system? For some
> reasons I would prefer not to reboot now.
if that's all you have then it was false alarm. there should be a 'netdev timeout
- link reset' following those messages. can you send some more context on those
messages?
in real tx hang cases, the hardware is reset within 2 seconds, and everything
continues as normal.
Auke
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: e1000: Detected Tx Unit Hang
2008-02-15 23:29 ` Kok, Auke
@ 2008-02-16 0:26 ` Bernd Schubert
2008-02-19 16:47 ` Kok, Auke
0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2008-02-16 0:26 UTC (permalink / raw)
To: Kok, Auke; +Cc: netdev
On Saturday 16 February 2008, Kok, Auke wrote:
> Bernd Schubert wrote:
> > Hello,
> >
> > I can't login to one of our servers and just got this in an ipmi sol
> > session:
> >
> > [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [18169.209183] Tx Queue <0>
> > [18169.209184] TDH <e3>
> > [18169.209185] TDT <e3>
> > [18169.209186] next_to_use <e3>
> > [18169.209187] next_to_clean <bd>
> > [18169.209188] buffer_info[next_to_clean]
> > [18169.209189] time_stamp <10043e4d2>
> > [18169.209190] next_to_watch <be>
> > [18169.209191] jiffies <10043e6f6>
> > [18169.209192] next_to_watch.status <1>
> > [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [18169.256979] Tx Queue <0>
> > [18169.256980] TDH <de>
> > [18169.256982] TDT <de>
> > [18169.256983] next_to_use <de>
> > [18169.256984] next_to_clean <bc>
> > [18169.256985] buffer_info[next_to_clean]
> > [18169.256986] time_stamp <10043e511>
> > [18169.256987] next_to_watch <bd>
> > [18169.256988] jiffies <10043e701>
> > [18169.256989] next_to_watch.status <1>
> >
> > This is with 2.6.22.18. Is there any chance to recover the system? For
> > some reasons I would prefer not to reboot now.
>
> if that's all you have then it was false alarm. there should be a 'netdev
> timeout - link reset' following those messages. can you send some more
> context on those messages?
All I presently know is that there are 20 servers and login doesn't work any
more - sysrq+t does show me it hangs in fuse, which is accessing the
underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t
output suddenly these e1000 messages appeared.
Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which
2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone
mis-configured the switch/network environment today.
Hmm, now that I think about the last part, there already had been other
networking problems today, which were supposed to be fixed several hours ago.
Seems they didn't fix it properly.
>
> in real tx hang cases, the hardware is reset within 2 seconds, and
> everything continues as normal.
Thanks, this gives me hope I don't need to reboot the serves (reboot would
mean I would need to start 60 md-raid rebuilds...).
Thanks,
Bernd
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: e1000: Detected Tx Unit Hang
2008-02-16 0:26 ` Bernd Schubert
@ 2008-02-19 16:47 ` Kok, Auke
0 siblings, 0 replies; 12+ messages in thread
From: Kok, Auke @ 2008-02-19 16:47 UTC (permalink / raw)
To: Bernd Schubert; +Cc: netdev
Bernd Schubert wrote:
> On Saturday 16 February 2008, Kok, Auke wrote:
>> Bernd Schubert wrote:
>>> Hello,
>>>
>>> I can't login to one of our servers and just got this in an ipmi sol
>>> session:
>>>
>>> [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.209183] Tx Queue <0>
>>> [18169.209184] TDH <e3>
>>> [18169.209185] TDT <e3>
>>> [18169.209186] next_to_use <e3>
>>> [18169.209187] next_to_clean <bd>
>>> [18169.209188] buffer_info[next_to_clean]
>>> [18169.209189] time_stamp <10043e4d2>
>>> [18169.209190] next_to_watch <be>
>>> [18169.209191] jiffies <10043e6f6>
>>> [18169.209192] next_to_watch.status <1>
>>> [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.256979] Tx Queue <0>
>>> [18169.256980] TDH <de>
>>> [18169.256982] TDT <de>
>>> [18169.256983] next_to_use <de>
>>> [18169.256984] next_to_clean <bc>
>>> [18169.256985] buffer_info[next_to_clean]
>>> [18169.256986] time_stamp <10043e511>
>>> [18169.256987] next_to_watch <bd>
>>> [18169.256988] jiffies <10043e701>
>>> [18169.256989] next_to_watch.status <1>
>>>
>>> This is with 2.6.22.18. Is there any chance to recover the system? For
>>> some reasons I would prefer not to reboot now.
>> if that's all you have then it was false alarm. there should be a 'netdev
>> timeout - link reset' following those messages. can you send some more
>> context on those messages?
>
> All I presently know is that there are 20 servers and login doesn't work any
> more - sysrq+t does show me it hangs in fuse, which is accessing the
> underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t
> output suddenly these e1000 messages appeared.
> Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which
> 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone
> mis-configured the switch/network environment today.
> Hmm, now that I think about the last part, there already had been other
> networking problems today, which were supposed to be fixed several hours ago.
> Seems they didn't fix it properly.
>
>> in real tx hang cases, the hardware is reset within 2 seconds, and
>> everything continues as normal.
>
> Thanks, this gives me hope I don't need to reboot the serves (reboot would
> mean I would need to start 60 md-raid rebuilds...).
my first thought after I read this e-mail is that the tx-hang message is just a
symptom of your system not responding or being spinlocked all the time. These TX
hang issues normally completely do not interfere with normal system operation and
unless you have continuous TX resets you would be able to logon perfectly fine.
I think you might have hit another kernel bug here... perhaps even unionfs/fuse
related and that certainly looks plausible from your problem description.
looking at the changelog for 2.6.22.16->2.6.22.18 I can't see anything relevant
(see
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=shortlog),
but there are definately no e1000 driver changes in that range anyway.
I don't suppose you can do a git-bisect? that would certainly help. I don't think
we can rule out anything just yet here.
At least try to revert some of your systems to the previous kernel version and see
if the problem goes away...
Auke
^ permalink raw reply [flat|nested] 12+ messages in thread
* e1000 Detected Tx Unit Hang
@ 2006-09-02 14:39 Paul Aviles
2006-09-03 17:45 ` Jesse Brandeburg
0 siblings, 1 reply; 12+ messages in thread
From: Paul Aviles @ 2006-09-02 14:39 UTC (permalink / raw)
To: netdev
I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" using
stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a
Netgear GS724T Gig switch. I can easily reproduce the problem by trying to
do a large ftp transfer to the server. It does not happen if the server is
connected to a dummy 100 Mb switch, only when is connected to the Gig
switch.
I have also tried the options line below disabling tso, tx and rx in the
modprobe.conf without any luck.
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0
FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0
TxIntDelay=0
in /var/log/kernel I get the following...
Sep 1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
Unit Hang
Sep 1 23:53:01 www kernel: Tx Queue <0>
Sep 1 23:53:01 www kernel: TDH <4c4>
Sep 1 23:53:01 www kernel: TDT <4c9>
Sep 1 23:53:01 www kernel: next_to_use <4c9>
Sep 1 23:53:01 www kernel: next_to_clean <4c4>
Sep 1 23:53:01 www kernel: buffer_info[next_to_clean]
Sep 1 23:53:01 www kernel: time_stamp <ffff9c60>
Sep 1 23:53:01 www kernel: next_to_watch <4c4>
Sep 1 23:53:01 www kernel: jiffies <ffff9d96>
Sep 1 23:53:01 www kernel: next_to_watch.status <0>
.
repeats the same as above a few times....
.
Sep 1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep 1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link
is Up 1000 Mbps Full Duplex
then the server locks up, no response from the keyboard at all and must be
forced down with a power kill. The suggested tips on how to deal with this
issue are not working so if I can help troubleshoot this let me know.
Here is my system info,
driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: 0000:02:01.0
lspci -vv output below..
00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub
(rev 02)
Subsystem: Intel Corporation 82875P/E7210 Memory Controller Hub
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
Latency: 0
Region 0: Memory at 90000000 (32-bit, prefetchable) [size=128M]
Capabilities: [e4] Vendor Specific Information
Capabilities: [a0] AGP version 3.0
Status: RQ=32 Iso- ArqSz=2 Cal=0 SBA+ ITACoh- GART64-
HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW-
Rate=<none>
00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller
(rev 02) (prog-if 00 [Normal decode])
Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA
Bridge (rev 02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 32
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 00002000-00002fff
Memory behind bridge: fc100000-fc1fffff
Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-
00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #1 (rev 02) (prog-if 00 [UHCI])
Subsystem: Intel Corporation: Unknown device 24c0
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 18
Region 4: I/O ports at 1400 [size=32]
00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #2 (rev 02) (prog-if 00 [UHCI])
Subsystem: Intel Corporation: Unknown device 24c0
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 19
Region 4: I/O ports at 1420 [size=32]
00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #3 (rev 02) (prog-if 00 [UHCI])
Subsystem: Intel Corporation: Unknown device 24c0
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin C routed to IRQ 16
Region 4: I/O ports at 1440 [size=32]
00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #4 (rev 02) (prog-if 00 [UHCI])
Subsystem: Intel Corporation: Unknown device 24c0
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 18
Region 4: I/O ports at 1460 [size=32]
00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI
Controller (rev 02) (prog-if 20 [EHCI])
Subsystem: Intel Corporation: Unknown device 24d0
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin D routed to IRQ 17
Region 0: Memory at fc000000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Debug port
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) (prog-if 00
[Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=32
I/O behind bridge: 00003000-00003fff
Memory behind bridge: fc200000-fdffffff
Prefetchable memory behind bridge: 88000000-880fffff
Secondary status: 66Mhz- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B-
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface
Bridge (rev 02)
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE
Controller (rev 02) (prog-if 8a [Master SecP PriP])
Subsystem: Intel Corporation: Unknown device 24c0
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 16
Region 0: I/O ports at <unassigned>
Region 1: I/O ports at <unassigned>
Region 2: I/O ports at <unassigned>
Region 3: I/O ports at <unassigned>
Region 4: I/O ports at 14a0 [size=16]
Region 5: Memory at 88100000 (32-bit, non-prefetchable) [size=1K]
00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller
(rev 02)
Subsystem: Intel Corporation: Unknown device 24c0
Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Interrupt: pin B routed to IRQ 10
Region 4: I/O ports at 1480 [size=32]
02:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet
Controller
Subsystem: Intel Corporation PRO/1000 CT Network Connection
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0 (63750ns min), Cache Line Size 08
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fc100000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 2000 [size=32]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
03:00.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
(prog-if 00 [VGA])
Subsystem: ATI Technologies Inc Rage XL
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping+ SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 66 (2000ns min), Cache Line Size 08
Interrupt: pin A routed to IRQ 5
Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Region 1: I/O ports at 3000 [size=256]
Region 2: Memory at fc240000 (32-bit, non-prefetchable) [size=4K]
[virtual] Expansion ROM at 88000000 [disabled] [size=128K]
Capabilities: [5c] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
03:02.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller
Subsystem: Intel Corporation PRO/1000 MT Network Connection
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 52 (63750ns min), Cache Line Size 08
Interrupt: pin A routed to IRQ 16
Region 0: Memory at fc220000 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at fc200000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 3400 [size=64]
[virtual] Expansion ROM at 88020000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] PCI-X non-bridge device.
Command: DPERE- ERO+ RBC=0 OST=0
Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
DC=simple, DMMRBC=2, DMOST=0, DMCRS=0, RSCEM-
Regards,
Paul Aviles
--
VGER BF report: U 0.453695
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: e1000 Detected Tx Unit Hang
2006-09-02 14:39 e1000 " Paul Aviles
@ 2006-09-03 17:45 ` Jesse Brandeburg
2006-09-03 23:37 ` Paul Aviles
0 siblings, 1 reply; 12+ messages in thread
From: Jesse Brandeburg @ 2006-09-03 17:45 UTC (permalink / raw)
To: Paul Aviles; +Cc: netdev
On 9/2/06, Paul Aviles <paul.aviles@palei.com> wrote:
> I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" using
> stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
>
> The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a
> Netgear GS724T Gig switch. I can easily reproduce the problem by trying to
> do a large ftp transfer to the server. It does not happen if the server is
> connected to a dummy 100 Mb switch, only when is connected to the Gig
> switch.
> I have also tried the options line below disabling tso, tx and rx in the
> modprobe.conf without any luck.
Hi Paul, sorry to hear about your problem. You're getting hangs on
the 82547 right? can you send the output of cat /proc/interrupts.
I'm curious if you are sharing interrupts while running NAPI.
Also, please try the driver without CONFIG_E1000_NAPI enabled in your
kernel .config, and let us know the results.
Someone has posted (what they think is) a theoretical problem with
irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to
figure it out yet.
Jesse
--
VGER BF report: U 0.495355
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: e1000 Detected Tx Unit Hang
2006-09-03 17:45 ` Jesse Brandeburg
@ 2006-09-03 23:37 ` Paul Aviles
2006-09-05 16:09 ` Jesse Brandeburg
0 siblings, 1 reply; 12+ messages in thread
From: Paul Aviles @ 2006-09-03 23:37 UTC (permalink / raw)
To: Jesse Brandeburg; +Cc: netdev
Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
part is that I have several other identical systems and only one is
affected. Today I moved the hard drive to another similar system and I am
not seeing the problem so I am wondering if is something maybe wrong with
the card eeprom? Is there a way to check that?
Regards,
Paul
cat /proc/interrupts
CPU0 CPU1
0: 7716253 0 IO-APIC-edge timer
3: 11538 0 IO-APIC-edge serial
8: 1 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-level acpi
14: 93406 0 IO-APIC-edge ide0
16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0
17: 2 0 IO-APIC-level ehci_hcd:usb1
18: 0 0 IO-APIC-level uhci_hcd:usb2, uhci_hcd:usb5
19: 90 0 IO-APIC-level uhci_hcd:usb3
NMI: 0 0
LOC: 7715839 7715838
ERR: 0
MIS: 0
----- Original Message -----
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Sunday, September 03, 2006 1:45 PM
Subject: Re: e1000 Detected Tx Unit Hang
> On 9/2/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"
>> using
>> stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
>>
>> The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to
>> a
>> Netgear GS724T Gig switch. I can easily reproduce the problem by trying
>> to
>> do a large ftp transfer to the server. It does not happen if the server
>> is
>> connected to a dummy 100 Mb switch, only when is connected to the Gig
>> switch.
>> I have also tried the options line below disabling tso, tx and rx in the
>> modprobe.conf without any luck.
>
> Hi Paul, sorry to hear about your problem. You're getting hangs on
> the 82547 right? can you send the output of cat /proc/interrupts.
> I'm curious if you are sharing interrupts while running NAPI.
>
> Also, please try the driver without CONFIG_E1000_NAPI enabled in your
> kernel .config, and let us know the results.
>
> Someone has posted (what they think is) a theoretical problem with
> irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to
> figure it out yet.
>
> Jesse
>
> --
> VGER BF report: U 0.495355
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
VGER BF report: U 0.516297
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: e1000 Detected Tx Unit Hang
2006-09-03 23:37 ` Paul Aviles
@ 2006-09-05 16:09 ` Jesse Brandeburg
2006-09-06 1:33 ` Paul Aviles
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Jesse Brandeburg @ 2006-09-05 16:09 UTC (permalink / raw)
To: Paul Aviles; +Cc: netdev
On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
no problem,
> part is that I have several other identical systems and only one is
> affected. Today I moved the hard drive to another similar system and I am
> not seeing the problem so I am wondering if is something maybe wrong with
> the card eeprom? Is there a way to check that?
I doubt it is an eeprom problem. you can dump the eeproms with
ethtool -e eth0 from both machines and compare them . Odd that only
one system is having the problem. Could it be that the hardware on
that box is having issues? Are you sure the machines are running the
same bios version with the same settings? Any overclocking?
> cat /proc/interrupts
> CPU0 CPU1
> 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0
this could contribute to your problem, were you able to test without NAPI?
Jesse
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: e1000 Detected Tx Unit Hang
2006-09-05 16:09 ` Jesse Brandeburg
@ 2006-09-06 1:33 ` Paul Aviles
2006-09-11 4:03 ` Paul Aviles
2006-09-17 2:05 ` Paul Aviles
2 siblings, 0 replies; 12+ messages in thread
From: Paul Aviles @ 2006-09-06 1:33 UTC (permalink / raw)
To: Jesse Brandeburg; +Cc: netdev
I haven't done the NAPI yet. These are identical systems altogether, maybe
the CPU is a different stepping at the most, but that is all.
The "16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0" is the
same in every GS12 I have. No overclocking and same BIOS. Tyan released ver
1.8 about a month ago and I did the upgrade and same effect. Then I thought
about upgrading to 2.6.17.11 just to see if the driver will have any issues
and nothing, same deal. The only way I was able to control it was usign a
dummy 10/100 non-management switch. Then we had no issues.
I will try without NAPI tomorrow 9-6-06 and will report back. My
understanding on NAPI was that it will drop packets by design on overload.
Why will that cause a system lock?
Are there any other kernel options you would like to enable to track this
better and if you need remote access to the system I can accomodate too,
just let me know what time zone you are to schedule it. Let me know.
Regards,
Paul Aviles
----- Original Message -----
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang
> On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
> no problem,
>
>> part is that I have several other identical systems and only one is
>> affected. Today I moved the hard drive to another similar system and I am
>> not seeing the problem so I am wondering if is something maybe wrong with
>> the card eeprom? Is there a way to check that?
>
> I doubt it is an eeprom problem. you can dump the eeproms with
> ethtool -e eth0 from both machines and compare them . Odd that only
> one system is having the problem. Could it be that the hardware on
> that box is having issues? Are you sure the machines are running the
> same bios version with the same settings? Any overclocking?
>
>> cat /proc/interrupts
>> CPU0 CPU1
>> 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0
>
> this could contribute to your problem, were you able to test without NAPI?
>
> Jesse
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: e1000 Detected Tx Unit Hang
2006-09-05 16:09 ` Jesse Brandeburg
2006-09-06 1:33 ` Paul Aviles
@ 2006-09-11 4:03 ` Paul Aviles
2006-09-17 2:05 ` Paul Aviles
2 siblings, 0 replies; 12+ messages in thread
From: Paul Aviles @ 2006-09-11 4:03 UTC (permalink / raw)
To: Jesse Brandeburg; +Cc: netdev
Jesse, testing without NAPI, will see how it behaves.
Paul Aviles
----- Original Message -----
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang
> On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
> no problem,
>
>> part is that I have several other identical systems and only one is
>> affected. Today I moved the hard drive to another similar system and I am
>> not seeing the problem so I am wondering if is something maybe wrong with
>> the card eeprom? Is there a way to check that?
>
> I doubt it is an eeprom problem. you can dump the eeproms with
> ethtool -e eth0 from both machines and compare them . Odd that only
> one system is having the problem. Could it be that the hardware on
> that box is having issues? Are you sure the machines are running the
> same bios version with the same settings? Any overclocking?
>
>> cat /proc/interrupts
>> CPU0 CPU1
>> 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0
>
> this could contribute to your problem, were you able to test without NAPI?
>
> Jesse
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: e1000 Detected Tx Unit Hang
2006-09-05 16:09 ` Jesse Brandeburg
2006-09-06 1:33 ` Paul Aviles
2006-09-11 4:03 ` Paul Aviles
@ 2006-09-17 2:05 ` Paul Aviles
2 siblings, 0 replies; 12+ messages in thread
From: Paul Aviles @ 2006-09-17 2:05 UTC (permalink / raw)
To: Jesse Brandeburg, netdev
Jesse, today the server froze and was not able to see anything in the logs.
Nothing at all about any error, just plain froze. Just in case, this is a
different unit altogether, still the same model as the units having the Tx
Unit Hang, but different memory, motherboard and CPU. The only 1 thing that
is the same is the hard drive a regular IDE...
The only one thing I noticed that is very weird to me at least is that in
powering off the unit from the crash and rebooting it I saw some lines like
this in the logs..
Sep 16 11:08:03 www kernel: checking if image is initramfs... it is
Sep 16 07:05:19 www sysctl: kernel.msgmnb = 65536
The odd part is the diff in the time stamps between one entry and the very
next one in the log. Any ideas what can cause this? Also, any way to get a
dump or some way to prevent the system from locking without any log entries?
Regards,
Paul
----- Original Message -----
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang
> On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
> no problem,
>
>> part is that I have several other identical systems and only one is
>> affected. Today I moved the hard drive to another similar system and I am
>> not seeing the problem so I am wondering if is something maybe wrong with
>> the card eeprom? Is there a way to check that?
>
> I doubt it is an eeprom problem. you can dump the eeproms with
> ethtool -e eth0 from both machines and compare them . Odd that only
> one system is having the problem. Could it be that the hardware on
> that box is having issues? Are you sure the machines are running the
> same bios version with the same settings? Any overclocking?
>
>> cat /proc/interrupts
>> CPU0 CPU1
>> 16: 70540 0 IO-APIC-level uhci_hcd:usb4, eth0
>
> this could contribute to your problem, were you able to test without NAPI?
>
> Jesse
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <000f01c6ce49$affd37e0$3224050a@avilespaxp>]
* Re: e1000 Detected Tx Unit Hang
[not found] <000f01c6ce49$affd37e0$3224050a@avilespaxp>
@ 2006-09-02 5:45 ` Auke Kok
0 siblings, 0 replies; 12+ messages in thread
From: Auke Kok @ 2006-09-02 5:45 UTC (permalink / raw)
To: Paul Aviles; +Cc: linux-kernel, NetDev
Paul Aviles wrote:
> I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"
> using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
>
> The server is a Tyan GS10 and is connected to a Netgear GS724T Gig
> switch. I can easily reproduce the problem by trying to do a large ftp
> transfer to the server. It does not happen if the server is connected to
> a dummy 100 Mb switch, only when is connected to the Gig switch.
> I have also tried the options line below disabling tso, tx and rx in the
> modprobe.conf without any luck.
>
> options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0
> FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0
> TxIntDelay=0
>
> in /var/log/kernel I get the following...
>
> Sep 1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Sep 1 23:53:01 www kernel: Tx Queue <0>
> Sep 1 23:53:01 www kernel: TDH <4c4>
> Sep 1 23:53:01 www kernel: TDT <4c9>
> Sep 1 23:53:01 www kernel: next_to_use <4c9>
> Sep 1 23:53:01 www kernel: next_to_clean <4c4>
> Sep 1 23:53:01 www kernel: buffer_info[next_to_clean]
> Sep 1 23:53:01 www kernel: time_stamp <ffff9c60>
> Sep 1 23:53:01 www kernel: next_to_watch <4c4>
> Sep 1 23:53:01 www kernel: jiffies <ffff9d96>
> Sep 1 23:53:01 www kernel: next_to_watch.status <0>
> .
> repeats the same as above a few times....
> .
> Sep 1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Sep 1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link
> is Up 1000 Mbps Full Duplex
>
> then the server locks up, no response from the keyboard at all and must
> be forced down with a power kill.
>
> Here is my driver info,
>
> driver: e1000
> version: 7.0.33-k2-NAPI
> firmware-version: N/A
> bus-info: 0000:02:01.0
>
> What else could I check?
[adding netdev to cc, this is a NET issue]
This is a known issue and there are several discussions and bugs filed on this.
Please read this one where most is documented, and also the netdev
http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449
more links and information available on http://e1000.sf.net/
Your debugging information might be needed and helpful, so please take the
trouble of digging in the previous bugreports and reporting anything that might
be relevant there.
The full lockup is certainly not good, but should not necessarily be related to
the tx hang (or the cause of that). It is likely that interrupt sharing might
be a problem here; what kind of e1000 nic is this? lspci -vv?
Cheers,
Auke
--
VGER BF report: H 0.00334085
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-02-19 16:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-15 22:52 e1000: Detected Tx Unit Hang Bernd Schubert
2008-02-15 23:29 ` Kok, Auke
2008-02-16 0:26 ` Bernd Schubert
2008-02-19 16:47 ` Kok, Auke
-- strict thread matches above, loose matches on Subject: below --
2006-09-02 14:39 e1000 " Paul Aviles
2006-09-03 17:45 ` Jesse Brandeburg
2006-09-03 23:37 ` Paul Aviles
2006-09-05 16:09 ` Jesse Brandeburg
2006-09-06 1:33 ` Paul Aviles
2006-09-11 4:03 ` Paul Aviles
2006-09-17 2:05 ` Paul Aviles
[not found] <000f01c6ce49$affd37e0$3224050a@avilespaxp>
2006-09-02 5:45 ` Auke Kok
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).