kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD
@ 2015-11-04 11:31 Patrick Schaaf
  2015-11-05  6:45 ` Yuval Mintz
  0 siblings, 1 reply; 3+ messages in thread
From: Patrick Schaaf @ 2015-11-04 11:31 UTC (permalink / raw)
  To: NETDEV; +Cc: Greg KH, ariele

Dear netdevs,

on a production server (HP DL380 Gen9 with HP 10GE dual port card - bnx2x 
driver), I just encountered a full loss of connectivity through the 10 GE 
ports. Kernel in use is vanilla 3.14.53.

On the console I could see this (timestamps omitted, have to type by hand, 
damn ILO console does not let me copy+paste text...)

MCP SCPAD
MCP SCPAD
bnx2x 0000:04:00.1 eth1: Parity errors detected in blocks:
MCP SCPAD
MCP SCPAD
bnx2x 0000:04:00.0 eth0: Parity errors detected in blocks:
bnx2x: [bnx2x_attn_int_deasserted3:4080(eth0)]LATCHED attention 0x80000000 
(masked)
MCP SCPAD
...
systemd-journald[491]: /dev/kmsg buffer overrun, some messages lost.

Some googling around finds:

https://github.com/torvalds/linux/commit/ad6afbe9578d1fa26680faf78c846bd8c00d1d6e 

which might be related. If I read that correctly (and I have no real idea what 
I'm talking about, sorry...) that patch removes superflous printks which 
might, e.g. in our case, hide the real cause. i.e. even with that patch we 
would have had a problem / loss of connectivity, but we might know better why.

Maybe that changeset would be suitable for backporting to long term stable 
kernels?

Incidentally, how should these parity events be judged generally? Hope it's a 
one time cosmic ray incident? Cry "faulty hardware, please repair" to the 
supplier? Anything else?

best regards
  Patrick

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD
  2015-11-04 11:31 kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD Patrick Schaaf
@ 2015-11-05  6:45 ` Yuval Mintz
  2015-11-05  8:25   ` Patrick Schaaf
  0 siblings, 1 reply; 3+ messages in thread
From: Yuval Mintz @ 2015-11-05  6:45 UTC (permalink / raw)
  To: Patrick Schaaf, netdev; +Cc: Greg KH, ariele@broadcom.com

> on a production server (HP DL380 Gen9 with HP 10GE dual port card - bnx2x
> driver), I just encountered a full loss of connectivity through the 10 GE ports.
> Kernel in use is vanilla 3.14.53.
> 
> On the console I could see this (timestamps omitted, have to type by hand,
> damn ILO console does not let me copy+paste text...)
> 
> MCP SCPAD
> MCP SCPAD
> bnx2x 0000:04:00.1 eth1: Parity errors detected in blocks:
> MCP SCPAD
> MCP SCPAD
> bnx2x 0000:04:00.0 eth0: Parity errors detected in blocks:
> bnx2x: [bnx2x_attn_int_deasserted3:4080(eth0)]LATCHED attention
> 0x80000000
> (masked)
> MCP SCPAD
> ...
> systemd-journald[491]: /dev/kmsg buffer overrun, some messages lost.
> 
> Some googling around finds:
> 
> https://github.com/torvalds/linux/commit/ad6afbe9578d1fa26680faf78c846bd
> 8c00d1d6e
> 
> which might be related. If I read that correctly (and I have no real idea what I'm
> talking about, sorry...) that patch removes superflous printks which might, e.g. in
> our case, hide the real cause. i.e. even with that patch we would have had a
> problem / loss of connectivity, but we might know better why.

> 
> Maybe that changeset would be suitable for backporting to long term stable
> kernels?
> 
> Incidentally, how should these parity events be judged generally? Hope it's a one
> time cosmic ray incident? Cry "faulty hardware, please repair" to the supplier?
> Anything else?

A couple of things to note - 
1. On older kernels, MCP SCPAD parity on its own would have resulted in
Entering the parity recovery flows, and assuming those would have failed
resulting in an adapter in an unsteady state.
But 3.14.53 should be passed that point, and only log MCP SCPAD errors
instead of initiating recovery.

2. Since the SCPAD is not on the datapath, even assuming a real parity
would occur, if that's the only problem then it shouldn't have stopped traffic.

3. In most cases SCPAD is due to utilities, e.g., `ethtool -d' or `ethtool -t'
that are ran on the adapter's network interface; Theoretically, if there's some
unexpected incompatibility between driver and management FW it might
also happen.

4. The patch you've listed merely removes the MCP SCPAD prints, as they're
unavoidable in certain scenarios; It doesn't actually solve anything.

Having said that, do you know if anything happened to the setup that
triggered this? I.e., so configuration change, new utility, etc.?
Alternatively, did the log show anything else in addition to the MCP SCPAD?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD
  2015-11-05  6:45 ` Yuval Mintz
@ 2015-11-05  8:25   ` Patrick Schaaf
  0 siblings, 0 replies; 3+ messages in thread
From: Patrick Schaaf @ 2015-11-05  8:25 UTC (permalink / raw)
  To: Yuval Mintz; +Cc: netdev, Greg KH

Hi Yuval,

thanks for your notes.

> 4. The patch you've listed merely removes the MCP SCPAD prints, as they're
> unavoidable in certain scenarios; It doesn't actually solve anything.

I also thought so, thanks for confirming. Do you know whether the messages 
might have hidden earlier messages pointing to the real problem?

> Having said that, do you know if anything happened to the setup that
> triggered this? I.e., so configuration change, new utility, etc.?
> Alternatively, did the log show anything else in addition to the MCP SCPAD?

There was no update or configuration activity on the box, it was just running 
along as usual, operating some virtual machines. Uptime was about 22 days. I 
have a second, practically identical server, running pretty much the same 
workload, which is still up + running nicely.

I was a bit overeager to reboot the server (power reset) and didn't even try 
whether I could still log in (shame on me). After the reset the virtual 
machines all came up fine, so at least filesystem flushing was still working 
properly during the network breakage event.

The systemd journal logged a vast amount of the messages I've shown (with lots 
of "missed kernel messages" too), for a duration of about 8 seconds. In total, 
including the suppressions, it would have been over 1 million messages during 
the 8 seconds. Running a "sort|uniq" over the visibly logged messages I see:

   8786 kernel: bnx2x 0000:04:00.0 eth0: Parity errors detected in blocks:
   8768 kernel: bnx2x 0000:04:00.1 eth1: Parity errors detected in blocks:
   1583 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth0)]LATCHED 
attention 0x80000000 (masked)
   1743 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4080(eth1)]LATCHED 
attention 0x80000000 (masked)
  36092 kernel: MCP SCPAD:
      1 kernel: RAX: 0000000000000000 RBX: 000000198111fb67 RCX: 
00000000ffffffff

I'll now see that I backport that "MCP SCPAD" logging suppression patch to the 
latest 3.14 kernel, and reboot the boxes with that, hoping to learn more if 
the situation reoccurs.

best regards
  Patrick

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-11-05  8:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-04 11:31 kernel 3.14.53 + bnx2x loss of connectivity / parity errors / MCP SCPAD Patrick Schaaf
2015-11-05  6:45 ` Yuval Mintz
2015-11-05  8:25   ` Patrick Schaaf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).