xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Guru Anbalagane <guru.anbalagane@oracle.com>
To: xen-devel@lists.xensource.com
Subject: Re: Broadcom BCM5709 (bnx2) on Dell PowerEdge R610, Issues
Date: Fri, 18 Mar 2011 08:53:36 -0700	[thread overview]
Message-ID: <4D838000.8030307@oracle.com> (raw)
In-Reply-To: <201103181536.p2IFa9Un010697@acsinet12.oracle.com>

This is likely related xen losing interrupts while certain cpus goes to 
c6 state.
The below patch addresses an issue around this.
http://xenbits.xen.org/hg/xen-unstable.hg/rev/1087f9a03ab6
Easy workaround would be to turn off cstates in BIOS or limit cstate in xen.

Hope this helps.
Thanks
Guru
> Message: 5
> Date: Fri, 18 Mar 2011 11:39:07 -0400
> From: Joshua West<jwest@brandeis.edu>
> Subject: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610
> 	Issues
> To: xen-devel@lists.xensource.com
> Message-ID:<4D837C9B.6030107@brandeis.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hey folks,
>
> Unfortunately, ever since we went live with Xen on Dell PowerEdge
> R610's, we've been having some odd and aggravating issues.  The NIC's
> tend to drop out when under heavy traffic after 1-7 days of uptime
> (random, difficult to reproduce).  But before I get into the issue's
> specifics, here's some information about our setup:
>
>     * Dell PowerEdge R610's w/ 4 Onboard Broadcom BCM5709 1-GbE NIC's.
>     * RHEL 5.6.
>     * Xen 3.4.3 (from xen.org; our own compile)
>     * Kernel 2.6.18.18 (http://xenbits.xensource.com/linux-2.6.18-xen.hg)
> checkout 1073.
>     * bnx2 driver 2.0.18c from Broadcom's netxtreme2-6.0.53 package.
>       * bnx2 that ships with 2.6.18.8 doesn't support BCM5709's.
>       * Had to use driver package from broadcom.com in order to get
> networking.
>     * NIC bonding in pairs (eth0 + eth1, etc), with options "mode=4
> lacp_rate=fast miimon=100 use_carrier=1".
>
> What occurs is suddenly one of the NIC's in the bond stops responding.
> Gets stuck on transmitting from what I understand.  Kernel logs show the
> following, which includes extra debug information as the developers from
> Broadcom (Michael Chan and Benjamin Li) were assisting in
> troubleshooting and gave me a version of bnx2 2.0.18c to run, that
> prints out extra debug information upon NIC crash:
>
> Mar 18 01:40:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth0: transmit
> timed out
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth0 --->
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_PFTQ_CTL 10000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_TFTQ_CTL 20000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_MFTQ_CTL 4000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TBDR_FTQ_CTL 4002
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TDMA_FTQ_CTL 10002
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TXP_FTQ_CTL 10002
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TPAT_FTQ_CTL 10000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_CFTQ_CTL 8000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_FTQ_CTL 100000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMXQ_FTQ_CTL
> 10000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMTQ_FTQ_CTL
> 20000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_COM_COMQ_FTQ_CTL 10000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_CP_CPQ_FTQ_CTL 4000
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TXP mode b84c state
> 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TPAT mode b84c state
> 80001000 evt_mask 500 pc 8000a50 pc 8000a4c instr 38420001
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: RXP mode b84c state
> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: COM mode b8cc state
> 80008000 evt_mask 500 pc 8000a98 pc 8000a8c instr 8821
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: CP mode b8cc state
> 80000000 evt_mask 500 pc 8000c7c pc 8000928 instr 8ce800e8
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth0 --->
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0]
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0]
> PCI_CMD[00100406]
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: PCI_PM[19002008]
> PCI_MISC_CFG[92000088]
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 RPM_MGMT_PKT_CTRL[40000088]
> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> HC_STATS_INTERRUPT_STATUS[01fe0001]
> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12
> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 7
> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status idx
> 307c irq jiffies 100759890
> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c
> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669
> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c
> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa tx
> 1008f41e2 poll 100759890
> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx start
> jiffies 0
> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event c68c37
> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12
> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 77
> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status idx
> 307c irq jiffies 100759890
> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c
> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669
> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c
> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa tx
> 1008f41e2 poll 100759890
> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx start
> jiffies 0
> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event c68c37
> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 NIC Copper Link is Down
> Mar 18 01:40:27 xen-san-gb1 kernel: bonding: bond0: link status
> definitely down for interface eth0, disabling it
>
> This was then followed rather quickly by a failure with the second NIC
> (eth1) in the bond:
>
> Mar 18 01:42:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth1: transmit
> timed out
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth1 --->
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_PFTQ_CTL 10000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_TFTQ_CTL 20000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_MFTQ_CTL 4000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TBDR_FTQ_CTL 4002
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TDMA_FTQ_CTL 10000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TXP_FTQ_CTL 10002
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TPAT_FTQ_CTL 10000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_CFTQ_CTL 8000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_FTQ_CTL 100000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMXQ_FTQ_CTL
> 10000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMTQ_FTQ_CTL
> 20000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_COM_COMQ_FTQ_CTL 10000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_CP_CPQ_FTQ_CTL 4000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TXP mode b84c state
> 80005000 evt_mask 500 pc 8001294 pc 8001284 instr 38640001
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TPAT mode b84c state
> 80001000 evt_mask 500 pc 8000a58 pc 8000a5c instr 8f820014
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: RXP mode b84c state
> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: COM mode b8cc state
> 80000000 evt_mask 500 pc 8000a9c pc 8000a94 instr 3c028000
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: CP mode b8cc state
> 80008000 evt_mask 500 pc 8000c58 pc 8000c6c instr 27bdffe8
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth1 --->
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0]
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0]
> PCI_CMD[00100406]
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: PCI_PM[19002008]
> PCI_MISC_CFG[92000088]
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 RPM_MGMT_PKT_CTRL[40000088]
> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> HC_STATS_INTERRUPT_STATUS[01fe0001]
> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12
> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 7
> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status idx
> 29c4 irq jiffies 100759898
> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce
> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421
> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce
> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc71 HZ fa tx
> 1008fb744 poll 100759898
> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx start
> jiffies 100239dfd
> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event ab2e14
> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state 12
> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 77
> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status idx
> 29c4 irq jiffies 100759898
> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce
> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421
> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce
> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc72 HZ fa tx
> 1008fb744 poll 100759898
> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx start
> jiffies 100239dfd
> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event ab2e14
> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 NIC Copper Link is Down
> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: link status
> definitely down for interface eth1, disabling it
> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: Warning: No 802.3ad
> response from the link partner for any adapters in the bond
>
> Onto more technical details...
>
> The kernel we were running (2.6.18.8 from xenbits) was compiled without
> support for MSI/MSI-X originally.  So, we were experiencing these
> problems with plain standard IRQ's.  Michael Chan @ Broadcom, the author
> of bnx2 if you modinfo, has told me via email:
>
>     * "The logs show that we haven't had an interrupt for a very long
> time. It's not clear how that interrupt was lost."
>     * "So far the logs don't show any inconsistent state in the hardware
> or software. It is possible that the Xen kernel is missing an interrupt
> and not delivering to the driver. Normally, in INTA mode, the IRQ is
> level triggered and should remain asserted until it is seen by the
> driver and de-asserted by the driver."
>
> But, just in case, I compiled 2.6.18.8 with support for MSI/MSI-X and
> was able to confirm (via dmesg and lspci -vv) that the NIC's began to
> use MSI for interrupts.  Unfortunately, the NIC crash happened anyways
> (the above kernel logs is actually from when running with MSI).
>
> Here's whats really bugging me.  We have a Dell PowerEdge R610, running
> Xen along with the bnx2 drivers from Broadcom, thats been online for
> ~220 days.  Without a failure.  The only difference is the system is not
> making use of bonding.  It has just one NIC connected to the network
> with no VLAN's trunked down etc.
>
> It looks like I'm not alone out there, as there's a Red Hat bugzilla
> report for this issue:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=520888
>
> ^^ The above has an indication of *Status
> <https://bugzilla.redhat.com/page.cgi?id=fields.html#status>*: CLOSED
> DUPLICATE of bug 511368
> <https://bugzilla.redhat.com/show_bug.cgi?id=511368>  , but looks like I
> don't have access to view 511368. Grrr.
>
> Anyways...
>
> 1) Has anybody else experienced this issue?
> 2) Any developers care to comment on possible causes of this problem?
> 3) Anybody know of a solution?
> 4) What can I do to troubleshoot further, and get developers necessary
> information?
>
> Lastly...
>
> 5) Is anybody running Intel NIC's within Dell PowerEdge R610's, using
> bonding + Xen 3.4.3 + 2.6.18.8, and can safely report success?  I may
> switch to Intel...
>
> Thanks!
>
>    

       reply	other threads:[~2011-03-18 15:53 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <201103181536.p2IFa9Un010697@acsinet12.oracle.com>
2011-03-18 15:53 ` Guru Anbalagane [this message]
2011-03-18 19:24   ` Broadcom BCM5709 (bnx2) on Dell PowerEdge R610, Issues Joshua West
2011-03-18 19:46     ` Dan Magenheimer
2011-03-18 19:57       ` Guru Anbalagane
2011-03-18 19:45   ` Joshua West
2011-03-18 15:39 Broadcom BCM5709 (bnx2) on Dell PowerEdge R610 Issues Joshua West

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D838000.8030307@oracle.com \
    --to=guru.anbalagane@oracle.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).