From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joshua West Subject: Re: Broadcom BCM5709 (bnx2) on Dell PowerEdge R610, Issues Date: Fri, 18 Mar 2011 15:24:53 -0400 Message-ID: <4D83B185.7050703@brandeis.edu> References: <201103181536.p2IFa9Un010697@acsinet12.oracle.com> <4D838000.8030307@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D838000.8030307@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Hi Guru, Awesome, thanks for the tip. I'll test out disabling cstates in the BIOS as I don't believe Xen 3.4.x lets you set max_cstate as an argument to xen.gz in grub.conf. The patch in the changeset you mention applies to Xen 3.4.3 code. Do you have an experience with that patch functioning/helping/working with Xen 3.4.x? And if so, do you think it will end up as part of Xen 3.4.4 (if that ever gets tagged/released)? Assuming disabling cstates in the BIOS alleviates my problem, I'll probably give that patch a whirl with cstates enabled and see if the issue comes back. Just wondering if anybody else has used that patch with Xen 3.4.3 and found success. Thanks. On 03/18/11 11:53, Guru Anbalagane wrote: > This is likely related xen losing interrupts while certain cpus goes > to c6 state. > The below patch addresses an issue around this. > http://xenbits.xen.org/hg/xen-unstable.hg/rev/1087f9a03ab6 > Easy workaround would be to turn off cstates in BIOS or limit cstate > in xen. > > Hope this helps. > Thanks > Guru >> Message: 5 >> Date: Fri, 18 Mar 2011 11:39:07 -0400 >> From: Joshua West >> Subject: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610 >> Issues >> To: xen-devel@lists.xensource.com >> Message-ID:<4D837C9B.6030107@brandeis.edu> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Hey folks, >> >> Unfortunately, ever since we went live with Xen on Dell PowerEdge >> R610's, we've been having some odd and aggravating issues. The NIC's >> tend to drop out when under heavy traffic after 1-7 days of uptime >> (random, difficult to reproduce). But before I get into the issue's >> specifics, here's some information about our setup: >> >> * Dell PowerEdge R610's w/ 4 Onboard Broadcom BCM5709 1-GbE NIC's. >> * RHEL 5.6. >> * Xen 3.4.3 (from xen.org; our own compile) >> * Kernel 2.6.18.18 >> (http://xenbits.xensource.com/linux-2.6.18-xen.hg) >> checkout 1073. >> * bnx2 driver 2.0.18c from Broadcom's netxtreme2-6.0.53 package. >> * bnx2 that ships with 2.6.18.8 doesn't support BCM5709's. >> * Had to use driver package from broadcom.com in order to get >> networking. >> * NIC bonding in pairs (eth0 + eth1, etc), with options "mode=4 >> lacp_rate=fast miimon=100 use_carrier=1". >> >> What occurs is suddenly one of the NIC's in the bond stops responding. >> Gets stuck on transmitting from what I understand. Kernel logs show the >> following, which includes extra debug information as the developers from >> Broadcom (Michael Chan and Benjamin Li) were assisting in >> troubleshooting and gave me a version of bnx2 2.0.18c to run, that >> prints out extra debug information upon NIC crash: >> >> Mar 18 01:40:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth0: transmit >> timed out >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth0 >> ---> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_PFTQ_CTL 10000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_TFTQ_CTL 20000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_MFTQ_CTL 4000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TBDR_FTQ_CTL 4002 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TDMA_FTQ_CTL 10002 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TXP_FTQ_CTL 10002 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TPAT_FTQ_CTL 10000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_CFTQ_CTL 8000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_FTQ_CTL 100000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMXQ_FTQ_CTL >> 10000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMTQ_FTQ_CTL >> 20000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMQ_FTQ_CTL >> 10000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_CP_CPQ_FTQ_CTL 4000 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TXP mode b84c state >> 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TPAT mode b84c state >> 80001000 evt_mask 500 pc 8000a50 pc 8000a4c instr 38420001 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: RXP mode b84c state >> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: COM mode b8cc state >> 80008000 evt_mask 500 pc 8000a98 pc 8000a8c instr 8821 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: CP mode b8cc state >> 80000000 evt_mask 500 pc 8000c7c pc 8000928 instr 8ce800e8 >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth0 ---> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0] >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0] >> PCI_CMD[00100406] >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: PCI_PM[19002008] >> PCI_MISC_CFG[92000088] >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: >> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000] >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 >> RPM_MGMT_PKT_CTRL[40000088] >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG: >> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e] >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG: >> HC_STATS_INTERRUPT_STATUS[01fe0001] >> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 >> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 7 >> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status idx >> 307c irq jiffies 100759890 >> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c >> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669 >> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c >> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa tx >> 1008f41e2 poll 100759890 >> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx start >> jiffies 0 >> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event c68c37 >> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 >> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 77 >> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status idx >> 307c irq jiffies 100759890 >> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c >> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669 >> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c >> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa tx >> 1008f41e2 poll 100759890 >> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx start >> jiffies 0 >> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event c68c37 >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 NIC Copper Link is Down >> Mar 18 01:40:27 xen-san-gb1 kernel: bonding: bond0: link status >> definitely down for interface eth0, disabling it >> >> This was then followed rather quickly by a failure with the second NIC >> (eth1) in the bond: >> >> Mar 18 01:42:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth1: transmit >> timed out >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth1 >> ---> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_PFTQ_CTL 10000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_TFTQ_CTL 20000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_MFTQ_CTL 4000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TBDR_FTQ_CTL 4002 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TDMA_FTQ_CTL 10000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TXP_FTQ_CTL 10002 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TPAT_FTQ_CTL 10000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_CFTQ_CTL 8000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_FTQ_CTL 100000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMXQ_FTQ_CTL >> 10000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMTQ_FTQ_CTL >> 20000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMQ_FTQ_CTL >> 10000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_CP_CPQ_FTQ_CTL 4000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TXP mode b84c state >> 80005000 evt_mask 500 pc 8001294 pc 8001284 instr 38640001 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TPAT mode b84c state >> 80001000 evt_mask 500 pc 8000a58 pc 8000a5c instr 8f820014 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: RXP mode b84c state >> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: COM mode b8cc state >> 80000000 evt_mask 500 pc 8000a9c pc 8000a94 instr 3c028000 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: CP mode b8cc state >> 80008000 evt_mask 500 pc 8000c58 pc 8000c6c instr 27bdffe8 >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth1 ---> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0] >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0] >> PCI_CMD[00100406] >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: PCI_PM[19002008] >> PCI_MISC_CFG[92000088] >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: >> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000] >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 >> RPM_MGMT_PKT_CTRL[40000088] >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG: >> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e] >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG: >> HC_STATS_INTERRUPT_STATUS[01fe0001] >> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 >> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 7 >> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status idx >> 29c4 irq jiffies 100759898 >> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce >> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421 >> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce >> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc71 HZ fa tx >> 1008fb744 poll 100759898 >> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx start >> jiffies 100239dfd >> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event ab2e14 >> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12 >> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 77 >> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status idx >> 29c4 irq jiffies 100759898 >> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce >> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421 >> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce >> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc72 HZ fa tx >> 1008fb744 poll 100759898 >> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx start >> jiffies 100239dfd >> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event ab2e14 >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 NIC Copper Link is Down >> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: link status >> definitely down for interface eth1, disabling it >> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: Warning: No 802.3ad >> response from the link partner for any adapters in the bond >> >> Onto more technical details... >> >> The kernel we were running (2.6.18.8 from xenbits) was compiled without >> support for MSI/MSI-X originally. So, we were experiencing these >> problems with plain standard IRQ's. Michael Chan @ Broadcom, the author >> of bnx2 if you modinfo, has told me via email: >> >> * "The logs show that we haven't had an interrupt for a very long >> time. It's not clear how that interrupt was lost." >> * "So far the logs don't show any inconsistent state in the hardware >> or software. It is possible that the Xen kernel is missing an interrupt >> and not delivering to the driver. Normally, in INTA mode, the IRQ is >> level triggered and should remain asserted until it is seen by the >> driver and de-asserted by the driver." >> >> But, just in case, I compiled 2.6.18.8 with support for MSI/MSI-X and >> was able to confirm (via dmesg and lspci -vv) that the NIC's began to >> use MSI for interrupts. Unfortunately, the NIC crash happened anyways >> (the above kernel logs is actually from when running with MSI). >> >> Here's whats really bugging me. We have a Dell PowerEdge R610, running >> Xen along with the bnx2 drivers from Broadcom, thats been online for >> ~220 days. Without a failure. The only difference is the system is not >> making use of bonding. It has just one NIC connected to the network >> with no VLAN's trunked down etc. >> >> It looks like I'm not alone out there, as there's a Red Hat bugzilla >> report for this issue: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=520888 >> >> ^^ The above has an indication of *Status >> *: CLOSED >> DUPLICATE of bug 511368 >> , but looks like I >> don't have access to view 511368. Grrr. >> >> Anyways... >> >> 1) Has anybody else experienced this issue? >> 2) Any developers care to comment on possible causes of this problem? >> 3) Anybody know of a solution? >> 4) What can I do to troubleshoot further, and get developers necessary >> information? >> >> Lastly... >> >> 5) Is anybody running Intel NIC's within Dell PowerEdge R610's, using >> bonding + Xen 3.4.3 + 2.6.18.8, and can safely report success? I may >> switch to Intel... >> >> Thanks! >> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel -- Joshua West Senior Systems Engineer Brandeis University http://www.brandeis.edu