netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Manish Chopra <manish.chopra@qlogic.com>
Cc: Sony Chacko <sony.chacko@qlogic.com>,
	Rajesh Borundia <rajesh.borundia@qlogic.com>,
	netdev <netdev@vger.kernel.org>,
	"snagarka@redhat.com" <snagarka@redhat.com>,
	"tcamuso@redhat.com" <tcamuso@redhat.com>,
	"vdasgupt@redhat.com" <vdasgupt@redhat.com>
Subject: Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
Date: Fri, 24 Jan 2014 17:44:11 +1100	[thread overview]
Message-ID: <20140124064411.GC4361@voom.redhat.com> (raw)
In-Reply-To: <31AFFC7280259C4184970ABA9AFE8B938CF868E9@avmb3.qlogic.org>

[-- Attachment #1: Type: text/plain, Size: 6205 bytes --]

On Thu, Dec 19, 2013 at 09:11:33AM +0000, Manish Chopra wrote:
> >> >From: David Gibson [mailto:david@gibson.dropbear.id.au]
> >> >Sent: Tuesday, December 17, 2013 10:53 AM
> >> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
> >> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
> >> >vdasgupt@redhat.com
> >> >Subject: [0/2] netxen: bug fix and diagnostics for possible
> >> >(hardware?) bug
> >> >
> >> >At Red Hat, we've hit a couple of customer cases with crashes in the
> >> >netxen driver due to list corruption.  This seems to be very rarely
> >> >triggered, and unfortunately the dumps we have don't have enough
> >> >information to be certain of the cause, although we have a possible theory.
> >> >
> >> >I'm suggesting, therefore a patch to add some sanity checking which
> >> >should help to at least localize and mitigate the problem when someone hits it
> >in future.
> >> >Please let me know if there's a better approach to doing this.
> >> >
> >> >That's 2/2.  1/2 is a fix for a clear bug I spotted along the way,
> >> >but not one that could cause the symptoms we've seen.
> >>
> >> David,
> >>
> >> Having these checks in data path(Rx path) may have some performance
> >> impact. It's better to root cause it instead of putting some sanity
> >> checks.
> >
> >Obviously, but this was the best way I could think of to try narrowing down the
> >root cause (at least trying to eliminate driver vs. firmware bug).
> 
> David, Instead of making permanent changes in driver, can you please
> run your modified driver in selective customer environment where
> this issues is seen?

Yeah, the problem with that is that the problem has never triggered
twice for a single customer.  Well, technically there is one customer
that's hit it twice, but I'm pretty sure it's on entirely unrelated
systems in different sections of a large customer.  The only reason I
can see enough cases to suspect a pattern to these problems is from
looking across Red Hat's whole case history.

> Which may give some data point that what's the issue exactly and then we go by that.
> 
> >
> >> We will get back to you on this.
> >
> >If you have a better idea for locating the root cause, please let me know.  I have
> >access to a vmcore which I can poke around in.
> 
> We will also try to reproduce the problem in our environment and debug this.
> Can you please give some details?

Apologies for the long delay, I'd been hoping for some more
confirmation of things, but it hasn't happened.  I'll give you what I
can.

> 1) what's the driver and firmware version used?

I'm not sure what the most useful way ot giving a driver version.
I've given kernel version below, but it's an RH kernel, so I'm not
sure how much has been backported.

As to firmware, the driver reports:

netxen_nic 0000:04:00.0: Gen2 strapping detected
netxen_nic 0000:04:00.0: using 64-bit dma mask
netxen_nic: NX3031 Gigabit Ethernet Board S/N
<FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF>NX3031 Gigabit Ethernet  Chip rev 0x42
netxen_nic 0000:04:00.0: firmware v4.0.585 [legacy]

> 2) which operating system and kernel version?

RHEL5, 

Linux hostname 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

> 3) please send the vmcore also with backtrace if available which can
> give some idea what can trigger this issue.

I can't send the vmcore itself, since it will include customer data.
I can give you the backtrace below, and look up specific things if you
can give me an idea of what you need:

crash> bt
PID: 0      TASK: ffff81207f8bd7e0  CPU: 44  COMMAND: "swapper"
 #0 [ffff81107fd33b40] crash_kexec at ffffffff800b0938
 #1 [ffff81107fd33c00] __die at ffffffff80065137
 #2 [ffff81107fd33c40] die at ffffffff8006c789
 #3 [ffff81107fd33c70] do_invalid_op at ffffffff8006cd49
 #4 [ffff81107fd33d30] error_exit at ffffffff8005dde9
    [exception RIP: list_del+71]
    RIP: ffffffff8015a793  RSP: ffff81107fd33de0  RFLAGS: 00010286
    RAX: 0000000000000058  RBX: 0000000000000427  RCX: ffffffff80323028
    RDX: ffffffff80323028  RSI: 0000000000000000  RDI: ffffffff80323020
    RBP: ffff81407f4e8680   R8: ffffffff80323028   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffc200104494a0
    R13: 0000000000000002  R14: ffff81107a2cf500  R15: ffff81407e1bf400
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff81107fd33de8] netxen_process_rcv_ring at ffffffff8830b050 [netxen_nic]
 #6 [ffff81107fd33eb8] netxen_nic_poll at ffffffff88306e71 [netxen_nic]
 #7 [ffff81107fd33ef8] net_rx_action at ffffffff8000c9b9
 #8 [ffff81107fd33f38] __do_softirq at ffffffff80012551
 #9 [ffff81107fd33f68] call_softirq at ffffffff8005e2fc
#10 [ffff81107fd33f80] do_softirq at ffffffff8006d646
#11 [ffff81107fd33f90] do_IRQ at ffffffff8006d4d6
--- <IRQ stack> ---
#12 [ffff81307fe2fe38] ret_from_intr at ffffffff8005d615
    [exception RIP: mwait_idle_with_hints+102]
    RIP: ffffffff8006b9cf  RSP: ffff81307fe2fee8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00000000000000ff  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00007f319bfb6f27   R8: ffff81307fe2e000   R9: 0000000000000013
    R10: ffff8110b8288510  R11: 00000000ffffffff  R12: ffff81306273d040
    R13: ffff81207f8bd7e0  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff64  CS: 0010  SS: 0018
#13 [ffff81307fe2fee8] mwait_idle at ffffffff80056c65
#14 [ffff81307fe2fef0] cpu_idle at ffffffff80048f92

> 4) Test case details:- what type of test is running on the system?
> Just to make sure we also try the same test cases in our
> environment.

No particular type of test, it's an Oracle server in production.

> 5) Server details (Number of CPus, memory etc.) if available.

64 x Xeon X7550 CPUs, 256G RAM

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

      reply	other threads:[~2014-01-24  6:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-17  5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson
2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
2013-12-17  6:37   ` Jitendra Kalsaria
2013-12-19 11:51   ` Manish Chopra
2013-12-20  4:11     ` David Gibson
2013-12-17  5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson
2013-12-19 20:05   ` David Miller
2014-01-24  5:21     ` David Gibson
2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra
2013-12-18  6:22   ` David Gibson
2013-12-19  9:11     ` Manish Chopra
2014-01-24  6:44       ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140124064411.GC4361@voom.redhat.com \
    --to=david@gibson.dropbear.id.au \
    --cc=manish.chopra@qlogic.com \
    --cc=netdev@vger.kernel.org \
    --cc=rajesh.borundia@qlogic.com \
    --cc=snagarka@redhat.com \
    --cc=sony.chacko@qlogic.com \
    --cc=tcamuso@redhat.com \
    --cc=vdasgupt@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).