From: David Gibson <david@gibson.dropbear.id.au>
To: Manish Chopra <manish.chopra@qlogic.com>
Cc: Sony Chacko <sony.chacko@qlogic.com>,
Rajesh Borundia <rajesh.borundia@qlogic.com>,
netdev <netdev@vger.kernel.org>,
"snagarka@redhat.com" <snagarka@redhat.com>,
"tcamuso@redhat.com" <tcamuso@redhat.com>,
"vdasgupt@redhat.com" <vdasgupt@redhat.com>
Subject: Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
Date: Fri, 24 Jan 2014 17:44:11 +1100 [thread overview]
Message-ID: <20140124064411.GC4361@voom.redhat.com> (raw)
In-Reply-To: <31AFFC7280259C4184970ABA9AFE8B938CF868E9@avmb3.qlogic.org>
[-- Attachment #1: Type: text/plain, Size: 6205 bytes --]
On Thu, Dec 19, 2013 at 09:11:33AM +0000, Manish Chopra wrote:
> >> >From: David Gibson [mailto:david@gibson.dropbear.id.au]
> >> >Sent: Tuesday, December 17, 2013 10:53 AM
> >> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
> >> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
> >> >vdasgupt@redhat.com
> >> >Subject: [0/2] netxen: bug fix and diagnostics for possible
> >> >(hardware?) bug
> >> >
> >> >At Red Hat, we've hit a couple of customer cases with crashes in the
> >> >netxen driver due to list corruption. This seems to be very rarely
> >> >triggered, and unfortunately the dumps we have don't have enough
> >> >information to be certain of the cause, although we have a possible theory.
> >> >
> >> >I'm suggesting, therefore a patch to add some sanity checking which
> >> >should help to at least localize and mitigate the problem when someone hits it
> >in future.
> >> >Please let me know if there's a better approach to doing this.
> >> >
> >> >That's 2/2. 1/2 is a fix for a clear bug I spotted along the way,
> >> >but not one that could cause the symptoms we've seen.
> >>
> >> David,
> >>
> >> Having these checks in data path(Rx path) may have some performance
> >> impact. It's better to root cause it instead of putting some sanity
> >> checks.
> >
> >Obviously, but this was the best way I could think of to try narrowing down the
> >root cause (at least trying to eliminate driver vs. firmware bug).
>
> David, Instead of making permanent changes in driver, can you please
> run your modified driver in selective customer environment where
> this issues is seen?
Yeah, the problem with that is that the problem has never triggered
twice for a single customer. Well, technically there is one customer
that's hit it twice, but I'm pretty sure it's on entirely unrelated
systems in different sections of a large customer. The only reason I
can see enough cases to suspect a pattern to these problems is from
looking across Red Hat's whole case history.
> Which may give some data point that what's the issue exactly and then we go by that.
>
> >
> >> We will get back to you on this.
> >
> >If you have a better idea for locating the root cause, please let me know. I have
> >access to a vmcore which I can poke around in.
>
> We will also try to reproduce the problem in our environment and debug this.
> Can you please give some details?
Apologies for the long delay, I'd been hoping for some more
confirmation of things, but it hasn't happened. I'll give you what I
can.
> 1) what's the driver and firmware version used?
I'm not sure what the most useful way ot giving a driver version.
I've given kernel version below, but it's an RH kernel, so I'm not
sure how much has been backported.
As to firmware, the driver reports:
netxen_nic 0000:04:00.0: Gen2 strapping detected
netxen_nic 0000:04:00.0: using 64-bit dma mask
netxen_nic: NX3031 Gigabit Ethernet Board S/N
<FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF>NX3031 Gigabit Ethernet Chip rev 0x42
netxen_nic 0000:04:00.0: firmware v4.0.585 [legacy]
> 2) which operating system and kernel version?
RHEL5,
Linux hostname 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
> 3) please send the vmcore also with backtrace if available which can
> give some idea what can trigger this issue.
I can't send the vmcore itself, since it will include customer data.
I can give you the backtrace below, and look up specific things if you
can give me an idea of what you need:
crash> bt
PID: 0 TASK: ffff81207f8bd7e0 CPU: 44 COMMAND: "swapper"
#0 [ffff81107fd33b40] crash_kexec at ffffffff800b0938
#1 [ffff81107fd33c00] __die at ffffffff80065137
#2 [ffff81107fd33c40] die at ffffffff8006c789
#3 [ffff81107fd33c70] do_invalid_op at ffffffff8006cd49
#4 [ffff81107fd33d30] error_exit at ffffffff8005dde9
[exception RIP: list_del+71]
RIP: ffffffff8015a793 RSP: ffff81107fd33de0 RFLAGS: 00010286
RAX: 0000000000000058 RBX: 0000000000000427 RCX: ffffffff80323028
RDX: ffffffff80323028 RSI: 0000000000000000 RDI: ffffffff80323020
RBP: ffff81407f4e8680 R8: ffffffff80323028 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc200104494a0
R13: 0000000000000002 R14: ffff81107a2cf500 R15: ffff81407e1bf400
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff81107fd33de8] netxen_process_rcv_ring at ffffffff8830b050 [netxen_nic]
#6 [ffff81107fd33eb8] netxen_nic_poll at ffffffff88306e71 [netxen_nic]
#7 [ffff81107fd33ef8] net_rx_action at ffffffff8000c9b9
#8 [ffff81107fd33f38] __do_softirq at ffffffff80012551
#9 [ffff81107fd33f68] call_softirq at ffffffff8005e2fc
#10 [ffff81107fd33f80] do_softirq at ffffffff8006d646
#11 [ffff81107fd33f90] do_IRQ at ffffffff8006d4d6
--- <IRQ stack> ---
#12 [ffff81307fe2fe38] ret_from_intr at ffffffff8005d615
[exception RIP: mwait_idle_with_hints+102]
RIP: ffffffff8006b9cf RSP: ffff81307fe2fee8 RFLAGS: 00000246
RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00007f319bfb6f27 R8: ffff81307fe2e000 R9: 0000000000000013
R10: ffff8110b8288510 R11: 00000000ffffffff R12: ffff81306273d040
R13: ffff81207f8bd7e0 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: ffffffffffffff64 CS: 0010 SS: 0018
#13 [ffff81307fe2fee8] mwait_idle at ffffffff80056c65
#14 [ffff81307fe2fef0] cpu_idle at ffffffff80048f92
> 4) Test case details:- what type of test is running on the system?
> Just to make sure we also try the same test cases in our
> environment.
No particular type of test, it's an Oracle server in production.
> 5) Server details (Number of CPus, memory etc.) if available.
64 x Xeon X7550 CPUs, 256G RAM
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
prev parent reply other threads:[~2014-01-24 6:44 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-17 5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson
2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
2013-12-17 6:37 ` Jitendra Kalsaria
2013-12-19 11:51 ` Manish Chopra
2013-12-20 4:11 ` David Gibson
2013-12-17 5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson
2013-12-19 20:05 ` David Miller
2014-01-24 5:21 ` David Gibson
2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra
2013-12-18 6:22 ` David Gibson
2013-12-19 9:11 ` Manish Chopra
2014-01-24 6:44 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140124064411.GC4361@voom.redhat.com \
--to=david@gibson.dropbear.id.au \
--cc=manish.chopra@qlogic.com \
--cc=netdev@vger.kernel.org \
--cc=rajesh.borundia@qlogic.com \
--cc=snagarka@redhat.com \
--cc=sony.chacko@qlogic.com \
--cc=tcamuso@redhat.com \
--cc=vdasgupt@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).