From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Greear Subject: Re: e1000e tx queue timeout in 3.3.0 (bisected to BQL support for e1000e) Date: Fri, 20 Apr 2012 12:00:32 -0700 Message-ID: <4F91B250.8090509@candelatech.com> References: <4F909F4B.1010707@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: netdev , e1000-devel list , Eric Dumazet To: Tom Herbert Return-path: Received: from mail.candelatech.com ([208.74.158.172]:49742 "EHLO ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754973Ab2DTTBM (ORCPT ); Fri, 20 Apr 2012 15:01:12 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 04/19/2012 07:39 PM, Tom Herbert wrote: > Thanks, will try to reproduce. I am seeing something similar with the 'igb' driver, though this NIC also involves a side-driver that does bypass. When I enable/disable bypass, the links bounce (as expected), and igb reports the same timeout that I was seeing with e1000e. If I revert the igb BQL patch, then it works fine (kernel 3.3.2+). I did not do an exhaustive bisect, nor have tested non-bypass igb hardware, but since reverting the patch fixes the problem.... Maybe there is some fundamental issue with BQL when a NIC resets itself (in this case, due to remote port doing a reset) Maybe we are not properly accounting pkts cleared from the xmit queue on reset or something of that nature? I'm happy to test patches of someone has suggestions... Thanks, Ben > > Tom > > On Thu, Apr 19, 2012 at 4:27 PM, Ben Greear wrote: >> Test case: >> >> Run full duplex traffic (900Mbps rx, 400Mbps tx) UDP traffic >> (moderate speeds of traffic has issues as well, maybe not as easy to >> reproduce) >> reset peer interface >> ----> tx queue timeout >> >> >> Apr 19 16:12:48 localhost kernel: e1000e: eth2 NIC Link is Down >> Apr 19 16:12:48 localhost kernel: e1000e 0000:08:00.0: eth2: Reset adapter >> Apr 19 16:12:48 localhost kernel: e1000e: eth3 NIC Link is Down >> Apr 19 16:12:50 localhost kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full >> Duplex, Flow Control: Rx/Tx >> Apr 19 16:12:50 localhost kernel: ADDRCONF(NETDEV_CHANGE): eth2: link >> becomes ready >> Apr 19 16:12:50 localhost kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full >> Duplex, Flow Control: Rx/Tx >> Apr 19 16:12:50 localhost kernel: ADDRCONF(NETDEV_CHANGE): eth3: link >> becomes ready >> Apr 19 16:12:54 localhost /usr/sbin/irqbalance: Load average increasing, >> re-enabling all cpus for irq balancing >> Apr 19 16:12:55 localhost kernel: ------------[ cut here ]------------ >> Apr 19 16:12:55 localhost kernel: WARNING: at >> /home/greearb/git/linux-3.3.dev.y/net/sched/sch_generic.c:256 >> dev_watchdog+0xf4/0x154() >> Apr 19 16:12:55 localhost kernel: Hardware name: X7DBU >> Apr 19 16:12:55 localhost kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit >> queue 0 timed out >> Apr 19 16:12:55 localhost kernel: Modules linked in: xt_CT iptable_raw 8021q >> garp stp llc veth ppdev parport_pc lp parport fuse macvlan pktgen iscsi_tcp >> libiscsi_tcp libiscsi scsi_transport_iscsi lockd w83793 w83627hf hwmon_vid >> coretemp iTCO_wdt microcode iTCO_vendor_support pcspkr i5k_amb ioatdma >> i2c_i801 >> i5000_edac dca edac_core e1000e shpchp uinput sunrpc ipv6 autofs4 floppy >> radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core [last unloaded: >> nf_nat] >> Apr 19 16:12:55 localhost kernel: Pid: 0, comm: kworker/0:1 Not tainted >> 3.2.0-rc2+ #36 >> Apr 19 16:12:55 localhost kernel: Call Trace: >> Apr 19 16:12:55 localhost kernel: [] >> warn_slowpath_common+0x80/0x98 >> Apr 19 16:12:55 localhost kernel: [] >> warn_slowpath_fmt+0x41/0x43 >> Apr 19 16:12:55 localhost kernel: [] >> dev_watchdog+0xf4/0x154 >> Apr 19 16:12:55 localhost kernel: [] >> run_timer_softirq+0x16f/0x201 >> Apr 19 16:12:55 localhost kernel: [] ? >> netif_tx_unlock+0x57/0x57 >> Apr 19 16:12:55 localhost kernel: [] >> __do_softirq+0x86/0x12f >> Apr 19 16:12:55 localhost kernel: [] ? >> hrtimer_interrupt+0x12b/0x1bd >> Apr 19 16:12:55 localhost kernel: [] >> call_softirq+0x1c/0x30 >> Apr 19 16:12:55 localhost kernel: [] do_softirq+0x41/0x7e >> Apr 19 16:12:55 localhost kernel: [] irq_exit+0x3f/0xbb >> Apr 19 16:12:55 localhost kernel: [] >> smp_apic_timer_interrupt+0x85/0x93 >> Apr 19 16:12:55 localhost kernel: [] >> apic_timer_interrupt+0x6e/0x80 >> Apr 19 16:12:55 localhost kernel: [] ? >> mwait_idle+0x6e/0x8c >> Apr 19 16:12:55 localhost kernel: [] ? >> mwait_idle+0x61/0x8c >> Apr 19 16:12:55 localhost kernel: [] cpu_idle+0x67/0xbe >> Apr 19 16:12:55 localhost kernel: [] >> start_secondary+0x194/0x199 >> Apr 19 16:12:55 localhost kernel: ---[ end trace e3ca12fc1a8b85da ]--- >> Apr 19 16:12:55 localhost kernel: e1000e 0000:08:00.0: eth2: Reset adapter >> Apr 19 16:12:57 localhost abrt-dump-oops[898]: abrt-dump-oops: Found oopses: >> 1 >> Apr 19 16:12:57 localhost abrt-dump-oops[898]: abrt-dump-oops: Creating dump >> directories >> Apr 19 16:12:57 localhost abrtd: Directory 'oops-2012-04-19-16:12:57-898-0' >> creation detected >> Apr 19 16:12:57 localhost abrt-dump-oops: Reported 1 kernel oopses to Abrt >> Apr 19 16:12:57 localhost abrtd: Can't open file >> '/var/spool/abrt/oops-2012-04-19-16:12:57-898-0/uid': No such file or >> directory >> Apr 19 16:12:57 localhost abrtd: DUP_OF_DIR: >> /var/spool/abrt/oops-2012-04-19-15:02:13-862-0 >> Apr 19 16:12:57 localhost abrtd: Dump directory is a duplicate of >> /var/spool/abrt/oops-2012-04-19-15:02:13-862-0 >> Apr 19 16:12:57 localhost abrtd: Deleting dump directory >> oops-2012-04-19-16:12:57-898-0 (dup of oops-2012-04-19-15:02:13-862-0), >> sending dbus signal >> Apr 19 16:12:58 localhost kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full >> Duplex, Flow Control: Rx/Tx >> Apr 19 16:12:58 localhost kernel: ADDRCONF(NETDEV_CHANGE): eth2: link >> becomes ready >> Apr 19 16:13:03 localhost /usr/sbin/irqbalance: Load average increasing, >> re-enabling all cpus for irq balancing >> Apr 19 16:13:04 localhost kernel: e1000e 0000:08:00.0: eth2: Reset adapter >> Apr 19 16:13:05 localhost chronyd[1003]: Selected source 108.59.2.194 >> Apr 19 16:13:07 localhost kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full >> Duplex, Flow Control: Rx/Tx >> Apr 19 16:13:07 localhost kernel: ADDRCONF(NETDEV_CHANGE): eth2: link >> becomes ready >> .... >> >> lspci: >> >> 08:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet >> Controller (rev 06) >> Subsystem: Intel Corporation PRO/1000 PT Dual Port Server Adapter >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ >> Stepping- SERR+ FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast>TAbort- >> SERR-> Latency: 0, Cache Line Size: 32 bytes >> Interrupt: pin A routed to IRQ 74 >> Region 0: Memory at d8300000 (32-bit, non-prefetchable) [size=128K] >> Region 2: I/O ports at 3000 [size=32] >> [virtual] Expansion ROM at d8d00000 [disabled] [size=128K] >> Capabilities: [c8] Power Management version 2 >> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >> PME(D0+,D1-,D2-,D3hot+,D3cold-) >> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >> Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ >> Address: 00000000feeff00c Data: 41a3 >> Capabilities: [e0] Express (v1) Endpoint, MSI 00 >> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s >> <512ns, L1<64us >> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- >> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ >> Unsupported+ >> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ >> MaxPayload 128 bytes, MaxReadReq 4096 bytes >> DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- >> TransPend- >> LnkCap: Port #1, Speed 2.5GT/s, Width x2, ASPM L0s L1, >> Latency L0<4us, L1<64us >> ClockPM- Surprise- LLActRep- BwNot- >> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- >> CommClk- >> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >> LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ >> DLActive- BWMgmt- ABWMgmt- >> Capabilities: [100 v1] Advanced Error Reporting >> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >> RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- >> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- >> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- >> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- >> NonFatalErr- >> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- >> NonFatalErr- >> AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- >> ChkEn- >> Capabilities: [140 v1] Device Serial Number 00-e0-ed-ff-ff-0c-11-6e >> Kernel driver in use: e1000e >> Kernel modules: e1000e >> >> >> 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c is the first bad commit >> commit 3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c >> Author: Tom Herbert >> Date: Mon Nov 28 16:33:16 2011 +0000 >> >> e1000e: Support for byte queue limits >> >> Changes to e1000e to use byte queue limits. >> >> Signed-off-by: Tom Herbert >> Acked-by: Eric Dumazet >> Signed-off-by: David S. Miller >> >> :040000 040000 bf3e2ec64fd74253563e1ab39797b27a5f2df3fe >> 51914e221547b95a989b5c7e9b037c9370fd734e M drivers >> >> >> Thanks, >> Ben >> >> -- >> Ben Greear >> Candela Technologies Inc http://www.candelatech.com >> -- Ben Greear Candela Technologies Inc http://www.candelatech.com