From: "Michael Chan" <mchan@broadcom.com>
To: "Sven Ulland" <sveniu@opera.com>
Cc: netdev@vger.kernel.org, "Marc A. Donges" <marc.donges@1und1.de>
Subject: Re: bnx2 cards intermittantly going offline
Date: Thu, 13 Sep 2012 13:30:20 -0700 [thread overview]
Message-ID: <1347568220.7890.10.camel@HP1> (raw)
In-Reply-To: <5051FFAA.8060501@opera.com>
On Thu, 2012-09-13 at 17:45 +0200, Sven Ulland wrote:
> On 09/13/2012 03:51 PM, Marc A. Donges wrote:
> > After 55 days of operation the machine (A) suddenly was no longer
> > reachable via network. Strangely, a second machine (B) that should
> > take over the IP addresses (keepalived) did not take over. Only
> > after shutting the switchport to which A is attached did B take
> > over.
The rx_ftq_discards problem is a firmware problem. FTQ discards mean
that the firmware is no longer running and the packets are dropped at
the FTQ. This is likely fixed in:
commit 22fa159d37efbfe781bbb99279efe83f58b87d29
Author: Michael Chan <mchan@broadcom.com>
Date: Mon Oct 11 16:12:00 2010 -0700
bnx2: Update firmware to 6.0.x.
>
> Hi. We've had the same symptom with our BCM5709S [14e4:163a] on
> Debian. Like you, we were on stable's 2.6.32-41squeeze2. Google led us
> to many similar issues [1,2,3]. They concluded with the fix being in
> mainline commit c441b8d2 [4]: "bnx2: Fix lost MSI-X problem on 5709
> NICs".
This is a different problem and will not result in FTQ discards.
>
> Broadcom: Can you publish a tool that decodes ethtool -d dumps to make
> debugging easier, or do you deem it no longer necessary with the the
> register dump commits in 555069da?
The register dump during tx timeout is now quite comprehensive.
>
> Now, Debian's 2.6.32-41squeeze2 is based on longterm release 2.6.32.54
> [5]. That version includes commit 0b7817ed [6], which is a backport of
> the already mentioned mainline commit c441b8d2.
>
> So we tried digging further and applying some seemingly relevant
> commits [7,8] to our 2.6.32, but without any change in behaviour. Our
> temporary fix was to run 'ethtool -t ethX' to reset the device every
> time it locked up.
>
> This dragged on with various builds, until we ended up on mainline
> 2.6.38 where we no longer saw any symptoms. I don't know in which
> kernel version it was fixed, but we ended up on that one, sort of by
> chance. Unfortunately, it had severe issues with kswapd memory
> compaction causing CPU soft lockups [9], so we went straight to
> squeeze-backports' 3.2.23-1~bpo60+2. We've been happy since then.
>
> > We have five pairs of basically identical machines performing the
> > same task (each pair for one site). The error has not occured with
> > any other one, but this site is the busiest:
>
> We also saw the issue only at a site with generally higher load
> compared to other sites.
>
> I'd love to know exactly which commit fixed the issue, but it's fairly
> tricky to reproduce the issue, and the bisect count is fairly high (it
> need not be a specific fix for bnx2).
If you see the same FTQ discards, please try that firmware commit
mentioned above. Thanks.
>
> sven
>
>
> [1]: bnx2 driver crashes under random circumstances
> https://bugzilla.redhat.com/show_bug.cgi?id=520888
>
> [2]: Access denied. Come on, Red Hat!
> https://bugzilla.redhat.com/show_bug.cgi?id=511368
>
> [3]: NIC doesn't register packets [rhel-5.5.z]
> https://bugzilla.redhat.com/show_bug.cgi?id=587799
>
> [4]: bnx2: Fix lost MSI-X problem on 5709 NICs.
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=object;h=c441b8d2cb2194b05550a558d6d95d8944e56a84
>
> [5]: Debian Changelog linux-2.6 (2.6.32-45)
> http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog#version2.6.32-41
>
> [6]: bnx2: Fix lost MSI-X problem on 5709 NICs.
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=0b7817edda5e44e5fa769645bd1220f5e7b0beb5
>
> [7]: bnx2: reset_task is crashing the kernel. Fixing it.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4529819c45161e4a119134f56ef504e69420bc98
>
> [8]: bnx2: fixing a timout error due not refreshing TX timers correctly
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e6bf95ffa8d6f8f4b7ee33ea01490d95b0bbeb6e
>
> [9]: [PATCH] remove compaction from kswapd
> http://thread.gmane.org/gmane.linux.kernel.mm/58962
> https://lkml.org/lkml/2011/3/25/664
>
>
next prev parent reply other threads:[~2012-09-13 20:51 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-18 10:54 bnx2 cards intermittantly going offline Mills, Tony
2011-01-18 17:55 ` Michael Chan
2011-01-26 12:44 ` Mills, Tony
2012-09-13 13:51 ` Marc A. Donges
2012-09-13 15:45 ` Sven Ulland
2012-09-13 20:30 ` Michael Chan [this message]
2012-09-16 3:47 ` Ben Hutchings
2011-11-15 17:41 ` Ken
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1347568220.7890.10.camel@HP1 \
--to=mchan@broadcom.com \
--cc=marc.donges@1und1.de \
--cc=netdev@vger.kernel.org \
--cc=sveniu@opera.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.