Re: e1000 Detected Tx Unit Hang

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: e1000 Detected Tx Unit Hang
       [not found] <000f01c6ce49$affd37e0$3224050a@avilespaxp>
@ 2006-09-02  5:45 ` Auke Kok
  0 siblings, 0 replies; 12+ messages in thread
From: Auke Kok @ 2006-09-02  5:45 UTC (permalink / raw)
  To: Paul Aviles; +Cc: linux-kernel, NetDev

Paul Aviles wrote:
> I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" 
> using stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
> 
> The server is a Tyan GS10 and is connected to a Netgear GS724T Gig 
> switch. I can easily reproduce the problem by trying to do a large ftp 
> transfer to the server. It does not happen if the server is connected to 
> a dummy 100 Mb switch, only when is connected to the Gig switch.
> I have also tried the options line below disabling tso, tx and rx in the 
> modprobe.conf without any luck.
> 
> options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 
> FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 
> TxIntDelay=0
> 
> in /var/log/kernel I get the following...
> 
> Sep  1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx 
> Unit Hang
> Sep  1 23:53:01 www kernel:   Tx Queue             <0>
> Sep  1 23:53:01 www kernel:   TDH                  <4c4>
> Sep  1 23:53:01 www kernel:   TDT                  <4c9>
> Sep  1 23:53:01 www kernel:   next_to_use          <4c9>
> Sep  1 23:53:01 www kernel:   next_to_clean        <4c4>
> Sep  1 23:53:01 www kernel: buffer_info[next_to_clean]
> Sep  1 23:53:01 www kernel:   time_stamp           <ffff9c60>
> Sep  1 23:53:01 www kernel:   next_to_watch        <4c4>
> Sep  1 23:53:01 www kernel:   jiffies              <ffff9d96>
> Sep  1 23:53:01 www kernel:   next_to_watch.status <0>
> .
> repeats the same as above a few times....
> .
> Sep  1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Sep  1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link 
> is Up 1000 Mbps Full Duplex
> 
> then the server locks up, no response from the keyboard at all and must 
> be forced down with a power kill.
> 
> Here is my driver info,
> 
> driver: e1000
> version: 7.0.33-k2-NAPI
> firmware-version: N/A
> bus-info: 0000:02:01.0
> 
> What else could I check?

[adding netdev to cc, this is a NET issue]

This is a known issue and there are several discussions and bugs filed on this. 
  Please read this one where most is documented, and also the netdev

http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449

more links and information available on http://e1000.sf.net/

Your debugging information might be needed and helpful, so please take the 
trouble of digging in the previous bugreports and reporting anything that might 
be relevant there.

The full lockup is certainly not good, but should not necessarily be related to 
the tx hang (or the cause of that). It is likely that interrupt sharing might 
be a problem here; what kind of e1000 nic is this? lspci -vv?

Cheers,

Auke

-- 
VGER BF report: H 0.00334085

^ permalink raw reply	[flat|nested] 12+ messages in thread

* e1000 Detected Tx Unit Hang
@ 2006-09-02 14:39 Paul Aviles
  2006-09-03 17:45 ` Jesse Brandeburg
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Aviles @ 2006-09-02 14:39 UTC (permalink / raw)
  To: netdev

I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"  using 
stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.

 The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a 
Netgear GS724T Gig  switch. I can easily reproduce the problem by trying to 
do a large ftp transfer to the server. It does not happen if the server is 
connected to a dummy 100 Mb switch, only when is connected to the Gig 
switch.
I have also tried the options line below disabling tso, tx and rx in the 
modprobe.conf without any luck.

 options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 
FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 
TxIntDelay=0

 in /var/log/kernel I get the following...

Sep  1 23:53:01 www kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx 
Unit Hang
Sep  1 23:53:01 www kernel:   Tx Queue             <0>
Sep  1 23:53:01 www kernel:   TDH                  <4c4>
Sep  1 23:53:01 www kernel:   TDT                  <4c9>
Sep  1 23:53:01 www kernel:   next_to_use          <4c9>
Sep  1 23:53:01 www kernel:   next_to_clean        <4c4>
Sep  1 23:53:01 www kernel: buffer_info[next_to_clean]
Sep  1 23:53:01 www kernel:   time_stamp           <ffff9c60>
Sep  1 23:53:01 www kernel:   next_to_watch        <4c4>
Sep  1 23:53:01 www kernel:   jiffies              <ffff9d96>
Sep  1 23:53:01 www kernel:   next_to_watch.status <0>
.
repeats the same as above a few times....
.
Sep  1 23:53:10 www kernel: NETDEV WATCHDOG: eth0: transmit timed out
Sep  1 23:53:13 www kernel: e1000: eth0: e1000_watchdog_task: NIC Link
 is Up 1000 Mbps Full Duplex

 then the server locks up, no response from the keyboard at all and must be 
forced down with a power kill. The suggested tips on how to deal with this 
issue are not working so if I can help troubleshoot this let me know.

 Here is my system info,

 driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: 0000:02:01.0

lspci -vv output below..

00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub 
(rev 02)
        Subsystem: Intel Corporation 82875P/E7210 Memory Controller Hub
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort+ >SERR- <PERR-
        Latency: 0
        Region 0: Memory at 90000000 (32-bit, prefetchable) [size=128M]
        Capabilities: [e4] Vendor Specific Information
        Capabilities: [a0] AGP version 3.0
                Status: RQ=32 Iso- ArqSz=2 Cal=0 SBA+ ITACoh- GART64- 
HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
                Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit- FW- 
Rate=<none>

00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller 
(rev 02) (prog-if 00 [Normal decode])
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-

00:03.0 PCI bridge: Intel Corporation 82875P/E7210 Processor to PCI to CSA 
Bridge (rev 02) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
        Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32
        Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
        I/O behind bridge: 00002000-00002fff
        Memory behind bridge: fc100000-fc1fffff
        Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B-

00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #1 (rev 02) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation: Unknown device 24c0
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin A routed to IRQ 18
        Region 4: I/O ports at 1400 [size=32]

00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #2 (rev 02) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation: Unknown device 24c0
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin B routed to IRQ 19
        Region 4: I/O ports at 1420 [size=32]

00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #3 (rev 02) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation: Unknown device 24c0
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin C routed to IRQ 16
        Region 4: I/O ports at 1440 [size=32]

00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI 
Controller #4 (rev 02) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation: Unknown device 24c0
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin A routed to IRQ 18
        Region 4: I/O ports at 1460 [size=32]

00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI 
Controller (rev 02) (prog-if 20 [EHCI])
        Subsystem: Intel Corporation: Unknown device 24d0
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin D routed to IRQ 17
        Region 0: Memory at fc000000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Debug port

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) (prog-if 00 
[Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Bus: primary=00, secondary=03, subordinate=03, sec-latency=32
        I/O behind bridge: 00003000-00003fff
        Memory behind bridge: fc200000-fdffffff
        Prefetchable memory behind bridge: 88000000-880fffff
        Secondary status: 66Mhz- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B-

00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface 
Bridge (rev 02)
        Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0

00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE 
Controller (rev 02) (prog-if 8a [Master SecP PriP])
        Subsystem: Intel Corporation: Unknown device 24c0
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: I/O ports at <unassigned>
        Region 1: I/O ports at <unassigned>
        Region 2: I/O ports at <unassigned>
        Region 3: I/O ports at <unassigned>
        Region 4: I/O ports at 14a0 [size=16]
        Region 5: Memory at 88100000 (32-bit, non-prefetchable) [size=1K]

00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller 
(rev 02)
        Subsystem: Intel Corporation: Unknown device 24c0
        Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Interrupt: pin B routed to IRQ 10
        Region 4: I/O ports at 1480 [size=32]

02:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet 
Controller
        Subsystem: Intel Corporation PRO/1000 CT Network Connection
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0 (63750ns min), Cache Line Size 08
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fc100000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at 2000 [size=32]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-

03:00.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 
(prog-if 00 [VGA])
        Subsystem: ATI Technologies Inc Rage XL
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping+ SERR- FastB2B-
        Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 66 (2000ns min), Cache Line Size 08
        Interrupt: pin A routed to IRQ 5
        Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: I/O ports at 3000 [size=256]
        Region 2: Memory at fc240000 (32-bit, non-prefetchable) [size=4K]
        [virtual] Expansion ROM at 88000000 [disabled] [size=128K]
        Capabilities: [5c] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-

03:02.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet 
Controller
        Subsystem: Intel Corporation PRO/1000 MT Network Connection
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 52 (63750ns min), Cache Line Size 08
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fc220000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fc200000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at 3400 [size=64]
        [virtual] Expansion ROM at 88020000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [e4] PCI-X non-bridge device.
                Command: DPERE- ERO+ RBC=0 OST=0
                Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, 
DC=simple, DMMRBC=2, DMOST=0, DMCRS=0, RSCEM-

Regards,

Paul Aviles 



-- 
VGER BF report: U 0.453695

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000 Detected Tx Unit Hang
  2006-09-02 14:39 Paul Aviles
@ 2006-09-03 17:45 ` Jesse Brandeburg
  2006-09-03 23:37   ` Paul Aviles
  0 siblings, 1 reply; 12+ messages in thread
From: Jesse Brandeburg @ 2006-09-03 17:45 UTC (permalink / raw)
  To: Paul Aviles; +Cc: netdev

On 9/2/06, Paul Aviles <paul.aviles@palei.com> wrote:
> I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang"  using
> stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
>
>  The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to a
> Netgear GS724T Gig  switch. I can easily reproduce the problem by trying to
> do a large ftp transfer to the server. It does not happen if the server is
> connected to a dummy 100 Mb switch, only when is connected to the Gig
> switch.
> I have also tried the options line below disabling tso, tx and rx in the
> modprobe.conf without any luck.

Hi Paul, sorry to hear about your problem.  You're getting hangs on
the 82547 right?  can you send the output of cat /proc/interrupts.
I'm curious if you are sharing interrupts while running NAPI.

Also, please try the driver without CONFIG_E1000_NAPI enabled in your
kernel .config, and let us know the results.

Someone has posted (what they think is) a theoretical problem with
irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to
figure it out yet.

Jesse

-- 
VGER BF report: U 0.495355

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000 Detected Tx Unit Hang
  2006-09-03 17:45 ` Jesse Brandeburg
@ 2006-09-03 23:37   ` Paul Aviles
  2006-09-05 16:09     ` Jesse Brandeburg
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Aviles @ 2006-09-03 23:37 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: netdev

Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird 
part is that I have several other identical systems and only one is 
affected. Today I moved the hard drive to another similar system and I am 
not seeing the problem so I am wondering if is something maybe wrong with 
the card eeprom? Is there a way to check that?

Regards,

Paul

 cat /proc/interrupts
           CPU0       CPU1
  0:    7716253          0    IO-APIC-edge  timer
  3:      11538          0    IO-APIC-edge  serial
  8:          1          0    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 14:      93406          0    IO-APIC-edge  ide0
 16:      70540          0   IO-APIC-level  uhci_hcd:usb4, eth0
 17:          2          0   IO-APIC-level  ehci_hcd:usb1
 18:          0          0   IO-APIC-level  uhci_hcd:usb2, uhci_hcd:usb5
 19:         90          0   IO-APIC-level  uhci_hcd:usb3
NMI:          0          0
LOC:    7715839    7715838
ERR:          0
MIS:          0

----- Original Message ----- 
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Sunday, September 03, 2006 1:45 PM
Subject: Re: e1000 Detected Tx Unit Hang


> On 9/2/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> I am getting "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" 
>> using
>> stock 2.6.17.11, 2.6.17.5 or 2.6.17.4 kernels on centos 4.3.
>>
>>  The server is a Tyan GS12 ( 82541GI/PI and 82547GI) and is connected to 
>> a
>> Netgear GS724T Gig  switch. I can easily reproduce the problem by trying 
>> to
>> do a large ftp transfer to the server. It does not happen if the server 
>> is
>> connected to a dummy 100 Mb switch, only when is connected to the Gig
>> switch.
>> I have also tried the options line below disabling tso, tx and rx in the
>> modprobe.conf without any luck.
>
> Hi Paul, sorry to hear about your problem.  You're getting hangs on
> the 82547 right?  can you send the output of cat /proc/interrupts.
> I'm curious if you are sharing interrupts while running NAPI.
>
> Also, please try the driver without CONFIG_E1000_NAPI enabled in your
> kernel .config, and let us know the results.
>
> Someone has posted (what they think is) a theoretical problem with
> irq_sem on the 82547 at e1000.sf.net and I haven't had a chance to
> figure it out yet.
>
> Jesse
>
> -- 
> VGER BF report: U 0.495355
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> 



-- 
VGER BF report: U 0.516297

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000 Detected Tx Unit Hang
  2006-09-03 23:37   ` Paul Aviles
@ 2006-09-05 16:09     ` Jesse Brandeburg
  2006-09-06  1:33       ` Paul Aviles
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jesse Brandeburg @ 2006-09-05 16:09 UTC (permalink / raw)
  To: Paul Aviles; +Cc: netdev

On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
no problem,

> part is that I have several other identical systems and only one is
> affected. Today I moved the hard drive to another similar system and I am
> not seeing the problem so I am wondering if is something maybe wrong with
> the card eeprom? Is there a way to check that?

I doubt it is an eeprom problem.  you can dump the eeproms with
ethtool -e eth0 from both machines and compare them .  Odd that only
one system is having the problem.  Could it be that the hardware on
that box is having issues?  Are you sure the machines are running the
same bios version with the same settings?  Any overclocking?

>  cat /proc/interrupts
>            CPU0       CPU1
>  16:      70540          0   IO-APIC-level  uhci_hcd:usb4, eth0

this could contribute to your problem, were you able to test without NAPI?

Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000 Detected Tx Unit Hang
  2006-09-05 16:09     ` Jesse Brandeburg
@ 2006-09-06  1:33       ` Paul Aviles
  2006-09-11  4:03       ` Paul Aviles
  2006-09-17  2:05       ` Paul Aviles
  2 siblings, 0 replies; 12+ messages in thread
From: Paul Aviles @ 2006-09-06  1:33 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: netdev

I haven't done the NAPI yet. These are identical systems altogether, maybe 
the CPU is a different stepping at the most, but that is all.
The "16:      70540          0   IO-APIC-level  uhci_hcd:usb4, eth0" is the 
same in every GS12 I have. No overclocking and same BIOS. Tyan released  ver 
1.8 about a month ago and I did the upgrade and same effect. Then I thought 
about upgrading to 2.6.17.11 just to see if the driver will have any issues 
and nothing, same deal. The only way I was able to control it was usign a 
dummy 10/100 non-management switch. Then we had no issues.

I will try without NAPI tomorrow 9-6-06 and will report back. My 
understanding on NAPI was that it will drop packets by design on overload. 
Why will that cause a system lock?

Are there any other kernel options you would like to enable to track this 
better and if you need remote access to the system I can accomodate too, 
just let me know what time zone you are to schedule it. Let me know.

Regards,

Paul Aviles

----- Original Message ----- 
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang

> On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
> no problem,
>
>> part is that I have several other identical systems and only one is
>> affected. Today I moved the hard drive to another similar system and I am
>> not seeing the problem so I am wondering if is something maybe wrong with
>> the card eeprom? Is there a way to check that?
>
> I doubt it is an eeprom problem.  you can dump the eeproms with
> ethtool -e eth0 from both machines and compare them .  Odd that only
> one system is having the problem.  Could it be that the hardware on
> that box is having issues?  Are you sure the machines are running the
> same bios version with the same settings?  Any overclocking?
>
>>  cat /proc/interrupts
>>            CPU0       CPU1
>>  16:      70540          0   IO-APIC-level  uhci_hcd:usb4, eth0
>
> this could contribute to your problem, were you able to test without NAPI?
>
> Jesse
>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000 Detected Tx Unit Hang
  2006-09-05 16:09     ` Jesse Brandeburg
  2006-09-06  1:33       ` Paul Aviles
@ 2006-09-11  4:03       ` Paul Aviles
  2006-09-17  2:05       ` Paul Aviles
  2 siblings, 0 replies; 12+ messages in thread
From: Paul Aviles @ 2006-09-11  4:03 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: netdev

Jesse, testing without NAPI, will see how it behaves.

Paul Aviles

----- Original Message ----- 
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang


> On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
> no problem,
>
>> part is that I have several other identical systems and only one is
>> affected. Today I moved the hard drive to another similar system and I am
>> not seeing the problem so I am wondering if is something maybe wrong with
>> the card eeprom? Is there a way to check that?
>
> I doubt it is an eeprom problem.  you can dump the eeproms with
> ethtool -e eth0 from both machines and compare them .  Odd that only
> one system is having the problem.  Could it be that the hardware on
> that box is having issues?  Are you sure the machines are running the
> same bios version with the same settings?  Any overclocking?
>
>>  cat /proc/interrupts
>>            CPU0       CPU1
>>  16:      70540          0   IO-APIC-level  uhci_hcd:usb4, eth0
>
> this could contribute to your problem, were you able to test without NAPI?
>
> Jesse
>
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000 Detected Tx Unit Hang
  2006-09-05 16:09     ` Jesse Brandeburg
  2006-09-06  1:33       ` Paul Aviles
  2006-09-11  4:03       ` Paul Aviles
@ 2006-09-17  2:05       ` Paul Aviles
  2 siblings, 0 replies; 12+ messages in thread
From: Paul Aviles @ 2006-09-17  2:05 UTC (permalink / raw)
  To: Jesse Brandeburg, netdev

Jesse, today the server froze and was not able to see anything in the logs. 
Nothing at all about any error, just plain froze.  Just in case, this is a 
different unit altogether, still the same model as the units having the Tx 
Unit Hang, but different memory, motherboard and CPU. The only 1 thing that 
is the same is the hard drive a regular IDE...

The only one thing I noticed that is very weird to me at least is that in 
powering off the unit from the crash and rebooting it I saw some lines like 
this in the logs..

Sep 16 11:08:03 www kernel: checking if image is initramfs... it is
Sep 16 07:05:19 www sysctl: kernel.msgmnb = 65536

The odd part is the diff in the time stamps between one entry and the very 
next one in the log. Any ideas what can cause this? Also, any way to get a 
dump or some way to prevent the system from locking without any log entries?

Regards,

Paul

----- Original Message ----- 
From: "Jesse Brandeburg" <jesse.brandeburg@gmail.com>
To: "Paul Aviles" <paul.aviles@palei.com>
Cc: <netdev@vger.kernel.org>
Sent: Tuesday, September 05, 2006 12:09 PM
Subject: Re: e1000 Detected Tx Unit Hang

> On 9/3/06, Paul Aviles <paul.aviles@palei.com> wrote:
>> Hey Jesse, thanks for your reply. Here is the stuff on /procs. The weird
> no problem,
>
>> part is that I have several other identical systems and only one is
>> affected. Today I moved the hard drive to another similar system and I am
>> not seeing the problem so I am wondering if is something maybe wrong with
>> the card eeprom? Is there a way to check that?
>
> I doubt it is an eeprom problem.  you can dump the eeproms with
> ethtool -e eth0 from both machines and compare them .  Odd that only
> one system is having the problem.  Could it be that the hardware on
> that box is having issues?  Are you sure the machines are running the
> same bios version with the same settings?  Any overclocking?
>
>>  cat /proc/interrupts
>>            CPU0       CPU1
>>  16:      70540          0   IO-APIC-level  uhci_hcd:usb4, eth0
>
> this could contribute to your problem, were you able to test without NAPI?
>
> Jesse
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* e1000: Detected Tx Unit Hang
@ 2008-02-15 22:52 Bernd Schubert
  2008-02-15 23:29 ` Kok, Auke
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2008-02-15 22:52 UTC (permalink / raw)
  To: netdev

Hello,

I can't login to one of our servers and just got this in an ipmi sol
session:

[18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[18169.209183]   Tx Queue             <0>
[18169.209184]   TDH                  <e3>
[18169.209185]   TDT                  <e3>
[18169.209186]   next_to_use          <e3>
[18169.209187]   next_to_clean        <bd>
[18169.209188] buffer_info[next_to_clean]
[18169.209189]   time_stamp           <10043e4d2>
[18169.209190]   next_to_watch        <be>
[18169.209191]   jiffies              <10043e6f6>
[18169.209192]   next_to_watch.status <1>
[18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
[18169.256979]   Tx Queue             <0>
[18169.256980]   TDH                  <de>
[18169.256982]   TDT                  <de>
[18169.256983]   next_to_use          <de>
[18169.256984]   next_to_clean        <bc>
[18169.256985] buffer_info[next_to_clean]
[18169.256986]   time_stamp           <10043e511>
[18169.256987]   next_to_watch        <bd>
[18169.256988]   jiffies              <10043e701>
[18169.256989]   next_to_watch.status <1>

This is with 2.6.22.18. Is there any chance to recover the system? For some
reasons I would prefer not to reboot now.

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000: Detected Tx Unit Hang
  2008-02-15 22:52 e1000: " Bernd Schubert
@ 2008-02-15 23:29 ` Kok, Auke
  2008-02-16  0:26   ` Bernd Schubert
  0 siblings, 1 reply; 12+ messages in thread
From: Kok, Auke @ 2008-02-15 23:29 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: netdev

Bernd Schubert wrote:
> Hello,
> 
> I can't login to one of our servers and just got this in an ipmi sol
> session:
> 
> [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [18169.209183]   Tx Queue             <0>
> [18169.209184]   TDH                  <e3>
> [18169.209185]   TDT                  <e3>
> [18169.209186]   next_to_use          <e3>
> [18169.209187]   next_to_clean        <bd>
> [18169.209188] buffer_info[next_to_clean]
> [18169.209189]   time_stamp           <10043e4d2>
> [18169.209190]   next_to_watch        <be>
> [18169.209191]   jiffies              <10043e6f6>
> [18169.209192]   next_to_watch.status <1>
> [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> [18169.256979]   Tx Queue             <0>
> [18169.256980]   TDH                  <de>
> [18169.256982]   TDT                  <de>
> [18169.256983]   next_to_use          <de>
> [18169.256984]   next_to_clean        <bc>
> [18169.256985] buffer_info[next_to_clean]
> [18169.256986]   time_stamp           <10043e511>
> [18169.256987]   next_to_watch        <bd>
> [18169.256988]   jiffies              <10043e701>
> [18169.256989]   next_to_watch.status <1>
> 
> This is with 2.6.22.18. Is there any chance to recover the system? For some
> reasons I would prefer not to reboot now.

if that's all you have then it was false alarm. there should be a 'netdev timeout
- link reset' following those messages. can you send some more context on those
messages?

in real tx hang cases, the hardware is reset within 2 seconds, and everything
continues as normal.

Auke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000: Detected Tx Unit Hang
  2008-02-15 23:29 ` Kok, Auke
@ 2008-02-16  0:26   ` Bernd Schubert
  2008-02-19 16:47     ` Kok, Auke
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2008-02-16  0:26 UTC (permalink / raw)
  To: Kok, Auke; +Cc: netdev

On Saturday 16 February 2008, Kok, Auke wrote:
> Bernd Schubert wrote:
> > Hello,
> >
> > I can't login to one of our servers and just got this in an ipmi sol
> > session:
> >
> > [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [18169.209183]   Tx Queue             <0>
> > [18169.209184]   TDH                  <e3>
> > [18169.209185]   TDT                  <e3>
> > [18169.209186]   next_to_use          <e3>
> > [18169.209187]   next_to_clean        <bd>
> > [18169.209188] buffer_info[next_to_clean]
> > [18169.209189]   time_stamp           <10043e4d2>
> > [18169.209190]   next_to_watch        <be>
> > [18169.209191]   jiffies              <10043e6f6>
> > [18169.209192]   next_to_watch.status <1>
> > [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [18169.256979]   Tx Queue             <0>
> > [18169.256980]   TDH                  <de>
> > [18169.256982]   TDT                  <de>
> > [18169.256983]   next_to_use          <de>
> > [18169.256984]   next_to_clean        <bc>
> > [18169.256985] buffer_info[next_to_clean]
> > [18169.256986]   time_stamp           <10043e511>
> > [18169.256987]   next_to_watch        <bd>
> > [18169.256988]   jiffies              <10043e701>
> > [18169.256989]   next_to_watch.status <1>
> >
> > This is with 2.6.22.18. Is there any chance to recover the system? For
> > some reasons I would prefer not to reboot now.
>
> if that's all you have then it was false alarm. there should be a 'netdev
> timeout - link reset' following those messages. can you send some more
> context on those messages?

All I presently know is that there are 20 servers and login doesn't work any 
more - sysrq+t does show me it hangs in fuse, which is accessing the 
underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
output suddenly these e1000 messages appeared.
Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
mis-configured the switch/network environment today. 
Hmm, now that I think about the last part, there already had been other 
networking problems today, which were supposed to be fixed several hours ago. 
Seems they didn't fix it properly.

>
> in real tx hang cases, the hardware is reset within 2 seconds, and
> everything continues as normal.

Thanks, this gives me hope I don't need to reboot the serves (reboot would 
mean I would need to start 60 md-raid rebuilds...).

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: e1000: Detected Tx Unit Hang
  2008-02-16  0:26   ` Bernd Schubert
@ 2008-02-19 16:47     ` Kok, Auke
  0 siblings, 0 replies; 12+ messages in thread
From: Kok, Auke @ 2008-02-19 16:47 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: netdev

Bernd Schubert wrote:
> On Saturday 16 February 2008, Kok, Auke wrote:
>> Bernd Schubert wrote:
>>> Hello,
>>>
>>> I can't login to one of our servers and just got this in an ipmi sol
>>> session:
>>>
>>> [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.209183]   Tx Queue             <0>
>>> [18169.209184]   TDH                  <e3>
>>> [18169.209185]   TDT                  <e3>
>>> [18169.209186]   next_to_use          <e3>
>>> [18169.209187]   next_to_clean        <bd>
>>> [18169.209188] buffer_info[next_to_clean]
>>> [18169.209189]   time_stamp           <10043e4d2>
>>> [18169.209190]   next_to_watch        <be>
>>> [18169.209191]   jiffies              <10043e6f6>
>>> [18169.209192]   next_to_watch.status <1>
>>> [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.256979]   Tx Queue             <0>
>>> [18169.256980]   TDH                  <de>
>>> [18169.256982]   TDT                  <de>
>>> [18169.256983]   next_to_use          <de>
>>> [18169.256984]   next_to_clean        <bc>
>>> [18169.256985] buffer_info[next_to_clean]
>>> [18169.256986]   time_stamp           <10043e511>
>>> [18169.256987]   next_to_watch        <bd>
>>> [18169.256988]   jiffies              <10043e701>
>>> [18169.256989]   next_to_watch.status <1>
>>>
>>> This is with 2.6.22.18. Is there any chance to recover the system? For
>>> some reasons I would prefer not to reboot now.
>> if that's all you have then it was false alarm. there should be a 'netdev
>> timeout - link reset' following those messages. can you send some more
>> context on those messages?
> 
> All I presently know is that there are 20 servers and login doesn't work any 
> more - sysrq+t does show me it hangs in fuse, which is accessing the 
> underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
> output suddenly these e1000 messages appeared.
> Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
> 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
> mis-configured the switch/network environment today. 
> Hmm, now that I think about the last part, there already had been other 
> networking problems today, which were supposed to be fixed several hours ago. 
> Seems they didn't fix it properly.
> 
>> in real tx hang cases, the hardware is reset within 2 seconds, and
>> everything continues as normal.
> 
> Thanks, this gives me hope I don't need to reboot the serves (reboot would 
> mean I would need to start 60 md-raid rebuilds...).

my first thought after I read this e-mail is that the tx-hang message is just a
symptom of your system not responding or being spinlocked all the time. These TX
hang issues normally completely do not interfere with normal system operation and
unless you have continuous TX resets you would be able to logon perfectly fine.

I think you might have hit another kernel bug here... perhaps even unionfs/fuse
related and that certainly looks plausible from your problem description.

looking at the changelog for 2.6.22.16->2.6.22.18 I can't see anything relevant
(see
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=shortlog),
but there are definately no e1000 driver changes in that range anyway.

I don't suppose you can do a git-bisect? that would certainly help. I don't think
we can rule out anything just yet here.

At least try to revert some of your systems to the previous kernel version and see
if the problem goes away...

Auke

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-02-19 16:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <000f01c6ce49$affd37e0$3224050a@avilespaxp>
2006-09-02  5:45 ` e1000 Detected Tx Unit Hang Auke Kok
2006-09-02 14:39 Paul Aviles
2006-09-03 17:45 ` Jesse Brandeburg
2006-09-03 23:37   ` Paul Aviles
2006-09-05 16:09     ` Jesse Brandeburg
2006-09-06  1:33       ` Paul Aviles
2006-09-11  4:03       ` Paul Aviles
2006-09-17  2:05       ` Paul Aviles
  -- strict thread matches above, loose matches on Subject: below --
2008-02-15 22:52 e1000: " Bernd Schubert
2008-02-15 23:29 ` Kok, Auke
2008-02-16  0:26   ` Bernd Schubert
2008-02-19 16:47     ` Kok, Auke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).