From mboxrd@z Thu Jan 1 00:00:00 1970 From: Guru Anbalagane Subject: Re: Broadcom BCM5709 (bnx2) on Dell PowerEdge R610, Issues Date: Fri, 18 Mar 2011 08:53:36 -0700 Message-ID: <4D838000.8030307@oracle.com> References: <201103181536.p2IFa9Un010697@acsinet12.oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <201103181536.p2IFa9Un010697@acsinet12.oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org This is likely related xen losing interrupts while certain cpus goes to c6 state. The below patch addresses an issue around this. http://xenbits.xen.org/hg/xen-unstable.hg/rev/1087f9a03ab6 Easy workaround would be to turn off cstates in BIOS or limit cstate in xen. Hope this helps. Thanks Guru > Message: 5 > Date: Fri, 18 Mar 2011 11:39:07 -0400 > From: Joshua West > Subject: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610 > Issues > To: xen-devel@lists.xensource.com > Message-ID:<4D837C9B.6030107@brandeis.edu> > Content-Type: text/plain; charset="iso-8859-1" > > Hey folks, > > Unfortunately, ever since we went live with Xen on Dell PowerEdge > R610's, we've been having some odd and aggravating issues. The NIC's > tend to drop out when under heavy traffic after 1-7 days of uptime > (random, difficult to reproduce). But before I get into the issue's > specifics, here's some information about our setup: > > * Dell PowerEdge R610's w/ 4 Onboard Broadcom BCM5709 1-GbE NIC's. > * RHEL 5.6. > * Xen 3.4.3 (from xen.org; our own compile) > * Kernel 2.6.18.18 (http://xenbits.xensource.com/linux-2.6.18-xen.hg) > checkout 1073. > * bnx2 driver 2.0.18c from Broadcom's netxtreme2-6.0.53 package. > * bnx2 that ships with 2.6.18.8 doesn't support BCM5709's. > * Had to use driver package from broadcom.com in order to get > networking. > * NIC bonding in pairs (eth0 + eth1, etc), with options "mode=4 > lacp_rate=fast miimon=100 use_carrier=1". > > What occurs is suddenly one of the NIC's in the bond stops responding. > Gets stuck on transmitting from what I understand. Kernel logs show the > following, which includes extra debug information as the developers from > Broadcom (Michael Chan and Benjamin Li) were assisting in > troubleshooting and gave me a version of bnx2 2.0.18c to run, that > prints out extra debug information upon NIC crash: > > Mar 18 01:40:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth0: transmit > timed out > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth0 ---> > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_PFTQ_CTL 10000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_TFTQ_CTL 20000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_MFTQ_CTL 4000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TBDR_FTQ_CTL 4002 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TDMA_FTQ_CTL 10002 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TXP_FTQ_CTL 10002 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TPAT_FTQ_CTL 10000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_CFTQ_CTL 8000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_FTQ_CTL 100000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMXQ_FTQ_CTL > 10000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMTQ_FTQ_CTL > 20000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMQ_FTQ_CTL 10000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_CP_CPQ_FTQ_CTL 4000 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TXP mode b84c state > 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TPAT mode b84c state > 80001000 evt_mask 500 pc 8000a50 pc 8000a4c instr 38420001 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: RXP mode b84c state > 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: COM mode b8cc state > 80008000 evt_mask 500 pc 8000a98 pc 8000a8c instr 8821 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: CP mode b8cc state > 80000000 evt_mask 500 pc 8000c7c pc 8000928 instr 8ce800e8 > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth0 ---> > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0] > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0] > PCI_CMD[00100406] > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: PCI_PM[19002008] > PCI_MISC_CFG[92000088] > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000] > Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 RPM_MGMT_PKT_CTRL[40000088] > Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e] > Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > HC_STATS_INTERRUPT_STATUS[01fe0001] > Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 > Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 7 > Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status idx > 307c irq jiffies 100759890 > Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c > Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669 > Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c > Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa tx > 1008f41e2 poll 100759890 > Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx start > jiffies 0 > Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event c68c37 > Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 > Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 77 > Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status idx > 307c irq jiffies 100759890 > Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c > Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669 > Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c > Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa tx > 1008f41e2 poll 100759890 > Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx start > jiffies 0 > Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event c68c37 > Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 NIC Copper Link is Down > Mar 18 01:40:27 xen-san-gb1 kernel: bonding: bond0: link status > definitely down for interface eth0, disabling it > > This was then followed rather quickly by a failure with the second NIC > (eth1) in the bond: > > Mar 18 01:42:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth1: transmit > timed out > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth1 ---> > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_PFTQ_CTL 10000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_TFTQ_CTL 20000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_MFTQ_CTL 4000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TBDR_FTQ_CTL 4002 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TDMA_FTQ_CTL 10000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TXP_FTQ_CTL 10002 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TPAT_FTQ_CTL 10000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_CFTQ_CTL 8000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_FTQ_CTL 100000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMXQ_FTQ_CTL > 10000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMTQ_FTQ_CTL > 20000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMQ_FTQ_CTL 10000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_CP_CPQ_FTQ_CTL 4000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TXP mode b84c state > 80005000 evt_mask 500 pc 8001294 pc 8001284 instr 38640001 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TPAT mode b84c state > 80001000 evt_mask 500 pc 8000a58 pc 8000a5c instr 8f820014 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: RXP mode b84c state > 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: COM mode b8cc state > 80000000 evt_mask 500 pc 8000a9c pc 8000a94 instr 3c028000 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: CP mode b8cc state > 80008000 evt_mask 500 pc 8000c58 pc 8000c6c instr 27bdffe8 > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth1 ---> > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0] > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0] > PCI_CMD[00100406] > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: PCI_PM[19002008] > PCI_MISC_CFG[92000088] > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000] > Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 RPM_MGMT_PKT_CTRL[40000088] > Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e] > Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > HC_STATS_INTERRUPT_STATUS[01fe0001] > Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 > Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 7 > Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status idx > 29c4 irq jiffies 100759898 > Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce > Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421 > Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce > Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc71 HZ fa tx > 1008fb744 poll 100759898 > Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx start > jiffies 100239dfd > Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event ab2e14 > Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 > Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 77 > Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status idx > 29c4 irq jiffies 100759898 > Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce > Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421 > Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce > Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc72 HZ fa tx > 1008fb744 poll 100759898 > Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx start > jiffies 100239dfd > Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event ab2e14 > Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 NIC Copper Link is Down > Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: link status > definitely down for interface eth1, disabling it > Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: Warning: No 802.3ad > response from the link partner for any adapters in the bond > > Onto more technical details... > > The kernel we were running (2.6.18.8 from xenbits) was compiled without > support for MSI/MSI-X originally. So, we were experiencing these > problems with plain standard IRQ's. Michael Chan @ Broadcom, the author > of bnx2 if you modinfo, has told me via email: > > * "The logs show that we haven't had an interrupt for a very long > time. It's not clear how that interrupt was lost." > * "So far the logs don't show any inconsistent state in the hardware > or software. It is possible that the Xen kernel is missing an interrupt > and not delivering to the driver. Normally, in INTA mode, the IRQ is > level triggered and should remain asserted until it is seen by the > driver and de-asserted by the driver." > > But, just in case, I compiled 2.6.18.8 with support for MSI/MSI-X and > was able to confirm (via dmesg and lspci -vv) that the NIC's began to > use MSI for interrupts. Unfortunately, the NIC crash happened anyways > (the above kernel logs is actually from when running with MSI). > > Here's whats really bugging me. We have a Dell PowerEdge R610, running > Xen along with the bnx2 drivers from Broadcom, thats been online for > ~220 days. Without a failure. The only difference is the system is not > making use of bonding. It has just one NIC connected to the network > with no VLAN's trunked down etc. > > It looks like I'm not alone out there, as there's a Red Hat bugzilla > report for this issue: > > https://bugzilla.redhat.com/show_bug.cgi?id=520888 > > ^^ The above has an indication of *Status > *: CLOSED > DUPLICATE of bug 511368 > , but looks like I > don't have access to view 511368. Grrr. > > Anyways... > > 1) Has anybody else experienced this issue? > 2) Any developers care to comment on possible causes of this problem? > 3) Anybody know of a solution? > 4) What can I do to troubleshoot further, and get developers necessary > information? > > Lastly... > > 5) Is anybody running Intel NIC's within Dell PowerEdge R610's, using > bonding + Xen 3.4.3 + 2.6.18.8, and can safely report success? I may > switch to Intel... > > Thanks! > >