netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures
@ 2016-01-25 10:08 Nikola Ciprich
  2016-01-25 10:44 ` zhuyj
  2016-01-31 22:01 ` Nikola Ciprich
  0 siblings, 2 replies; 4+ messages in thread
From: Nikola Ciprich @ 2016-01-25 10:08 UTC (permalink / raw)
  To: netdev; +Cc: nik, Stanislav Schattke

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

Hello netdev readers,

I'd like to consult following problem we're dealing with:

I have a cluster of three nodes connected to stacked Brocade ICX6610
switches using bonded AOC-STGN-i2S adapters (they're using 82599ES
chipsets).

The problem is, I see random link failures on practically all
interfaces. Link always goes down for very short time, then adapter
is reset and link goes up again.

Here's dmesg snippet:

[Jan22 22:09] ixgbe 0000:03:00.0 eth0: NIC Link is Down
[  +0.005610] ixgbe 0000:03:00.0 eth0: initiating reset to clear Tx work after link loss
[  +0.012792] bond0: link status definitely down for interface eth0, disabling it
[  +1.105826] ixgbe 0000:03:00.0 eth0: Reset adapter
[  +0.307518] ixgbe 0000:03:00.0 eth0: detected SFP+: 3
[  +0.145881] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

since I'm using bonding, it doesn't disrupt traffic, but I'd still like to
resolve it. We're using 5m passive SFP cables, we tried replacing one with 3m
piece, to no avail. 

all three boxes are supermicro X10DRW, running vanilla x86_64 4.0.5 kernel (I'll upgrade it to 4.1.16 soon)

we were using broadcom adapter before and they were working without such problems
(except for one particular port, which showed mysterious packet drops every few
months, thats why we switched to intel-based adapters), so I think cables and switches
should be fine, but I'm not sure of course

I think I've seen similar problems and they were PM related, but I'm not sure..

anyone seen similar problem?

or some tips on how could I debug it?

If I could provide more information, please let me know

BR

nik

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures
  2016-01-25 10:08 Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures Nikola Ciprich
@ 2016-01-25 10:44 ` zhuyj
  2016-01-28  8:48   ` Nikola Ciprich
  2016-01-31 22:01 ` Nikola Ciprich
  1 sibling, 1 reply; 4+ messages in thread
From: zhuyj @ 2016-01-25 10:44 UTC (permalink / raw)
  To: Nikola Ciprich, netdev; +Cc: nik, Stanislav Schattke

https://www.mail-archive.com/netdev@vger.kernel.org/msg94109.html

Maybe this link can help you. If work, please let me know.

Thanks a lot.
Zhu Yanjun

On 01/25/2016 06:08 PM, Nikola Ciprich wrote:
> Hello netdev readers,
>
> I'd like to consult following problem we're dealing with:
>
> I have a cluster of three nodes connected to stacked Brocade ICX6610
> switches using bonded AOC-STGN-i2S adapters (they're using 82599ES
> chipsets).
>
> The problem is, I see random link failures on practically all
> interfaces. Link always goes down for very short time, then adapter
> is reset and link goes up again.
>
> Here's dmesg snippet:
>
> [Jan22 22:09] ixgbe 0000:03:00.0 eth0: NIC Link is Down
> [  +0.005610] ixgbe 0000:03:00.0 eth0: initiating reset to clear Tx work after link loss
> [  +0.012792] bond0: link status definitely down for interface eth0, disabling it
> [  +1.105826] ixgbe 0000:03:00.0 eth0: Reset adapter
> [  +0.307518] ixgbe 0000:03:00.0 eth0: detected SFP+: 3
> [  +0.145881] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
>
> since I'm using bonding, it doesn't disrupt traffic, but I'd still like to
> resolve it. We're using 5m passive SFP cables, we tried replacing one with 3m
> piece, to no avail.
>
> all three boxes are supermicro X10DRW, running vanilla x86_64 4.0.5 kernel (I'll upgrade it to 4.1.16 soon)
>
> we were using broadcom adapter before and they were working without such problems
> (except for one particular port, which showed mysterious packet drops every few
> months, thats why we switched to intel-based adapters), so I think cables and switches
> should be fine, but I'm not sure of course
>
> I think I've seen similar problems and they were PM related, but I'm not sure..
>
> anyone seen similar problem?
>
> or some tips on how could I debug it?
>
> If I could provide more information, please let me know
>
> BR
>
> nik
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures
  2016-01-25 10:44 ` zhuyj
@ 2016-01-28  8:48   ` Nikola Ciprich
  0 siblings, 0 replies; 4+ messages in thread
From: Nikola Ciprich @ 2016-01-28  8:48 UTC (permalink / raw)
  To: zhuyj; +Cc: netdev, nik, Stanislav Schattke, Nikola Ciprich

[-- Attachment #1: Type: text/plain, Size: 2723 bytes --]


Hello Zhu,

I'm sorry for late reply.. I can test the patch, but if I understand
correctly, it deals with bonding issue, not (lower level) interface
link problems no? Bonding works well for me, I'm trying to find
out why link failures occur..

please correct me if I'm wrong

thanks 

nik





On Mon, Jan 25, 2016 at 06:44:11PM +0800, zhuyj wrote:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg94109.html
> 
> Maybe this link can help you. If work, please let me know.
> 
> Thanks a lot.
> Zhu Yanjun
> 
> On 01/25/2016 06:08 PM, Nikola Ciprich wrote:
> >Hello netdev readers,
> >
> >I'd like to consult following problem we're dealing with:
> >
> >I have a cluster of three nodes connected to stacked Brocade ICX6610
> >switches using bonded AOC-STGN-i2S adapters (they're using 82599ES
> >chipsets).
> >
> >The problem is, I see random link failures on practically all
> >interfaces. Link always goes down for very short time, then adapter
> >is reset and link goes up again.
> >
> >Here's dmesg snippet:
> >
> >[Jan22 22:09] ixgbe 0000:03:00.0 eth0: NIC Link is Down
> >[  +0.005610] ixgbe 0000:03:00.0 eth0: initiating reset to clear Tx work after link loss
> >[  +0.012792] bond0: link status definitely down for interface eth0, disabling it
> >[  +1.105826] ixgbe 0000:03:00.0 eth0: Reset adapter
> >[  +0.307518] ixgbe 0000:03:00.0 eth0: detected SFP+: 3
> >[  +0.145881] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> >
> >since I'm using bonding, it doesn't disrupt traffic, but I'd still like to
> >resolve it. We're using 5m passive SFP cables, we tried replacing one with 3m
> >piece, to no avail.
> >
> >all three boxes are supermicro X10DRW, running vanilla x86_64 4.0.5 kernel (I'll upgrade it to 4.1.16 soon)
> >
> >we were using broadcom adapter before and they were working without such problems
> >(except for one particular port, which showed mysterious packet drops every few
> >months, thats why we switched to intel-based adapters), so I think cables and switches
> >should be fine, but I'm not sure of course
> >
> >I think I've seen similar problems and they were PM related, but I'm not sure..
> >
> >anyone seen similar problem?
> >
> >or some tips on how could I debug it?
> >
> >If I could provide more information, please let me know
> >
> >BR
> >
> >nik
> >
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures
  2016-01-25 10:08 Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures Nikola Ciprich
  2016-01-25 10:44 ` zhuyj
@ 2016-01-31 22:01 ` Nikola Ciprich
  1 sibling, 0 replies; 4+ messages in thread
From: Nikola Ciprich @ 2016-01-31 22:01 UTC (permalink / raw)
  To: netdev; +Cc: nik, Stanislav Schattke, emil.s.tantilov

[-- Attachment #1: Type: text/plain, Size: 6165 bytes --]

Hi,

I've updated all three boxes to 4.1.15. I've just had link outage again,
but this time I got more detailed backtrace..

not sure, but maybe it could be of some help?

[Jan30 23:53] ixgbe 0000:03:00.0 eth0: NIC Link is Down
[  +0.097285] bond0: link status definitely down for interface eth0, disabling it
[  +0.007695] bond0: first active interface up!
[  +0.000224] ------------[ cut here ]------------
[  +0.000007] WARNING: CPU: 6 PID: 19351 at kernel/softirq.c:150 __local_bh_enable_ip+0x7a/0xb0()
[  +0.000031] Modules linked in: cbc ceph libceph fscache dlm sctp crc32c_intel crc32c_generic libcrc32c configfs netconsole autofs4 sunrpc ipmi_devintf bridge stp llc 8021
[  +0.000002] CPU: 6 PID: 19351 Comm: kworker/u32:1 Not tainted 4.1.15lb6.03 #1
[  +0.000000] Hardware name: Supermicro X10DRW/X10DRW-i, BIOS 1.0c 01/07/2015
[  +0.000005] Workqueue: bond0 bond_mii_monitor [bonding]
[  +0.000002]  0000000000000096 ffff8804c2213798 ffffffff814c104b 0000000000000096
[  +0.000001]  0000000000000000 ffff8804c22137d8 ffffffff810535a5 ffff881036f03e00
[  +0.000002]  0000000000000200 ffff8804c2213830 0000000000000000 ffffffffa05250c0
[  +0.000000] Call Trace:
[  +0.000004]  [<ffffffff814c104b>] dump_stack+0x4f/0x74
[  +0.000002]  [<ffffffff810535a5>] warn_slowpath_common+0x95/0xe0
[  +0.000002]  [<ffffffff8105360a>] warn_slowpath_null+0x1a/0x20
[  +0.000002]  [<ffffffff81057b4a>] __local_bh_enable_ip+0x7a/0xb0
[  +0.000003]  [<ffffffffa07abc41>] bond_poll_controller+0x111/0x150 [bonding]
[  +0.000003]  [<ffffffff814242cc>] netpoll_poll_dev+0x5c/0x1b0
[  +0.000003]  [<ffffffff814072be>] ? netif_skb_features+0xfe/0x1f0
[  +0.000001]  [<ffffffff81424589>] netpoll_send_skb_on_dev+0x169/0x250
[  +0.000002]  [<ffffffffa07d3975>] vlan_dev_hard_start_xmit+0x105/0x120 [8021q]
[  +0.000001]  [<ffffffff81423c2c>] netpoll_start_xmit+0x15c/0x1f0
[  +0.000002]  [<ffffffff8142456b>] netpoll_send_skb_on_dev+0x14b/0x250
[  +0.000001]  [<ffffffff8142492f>] netpoll_send_udp+0x2bf/0x400
[  +0.000002]  [<ffffffffa087b234>] write_msg+0xb4/0xf0 [netconsole]
[  +0.000003]  [<ffffffff810a2154>] call_console_drivers.clone.1+0xa4/0x120
[  +0.000002]  [<ffffffff810a2454>] console_unlock+0x284/0x400
[  +0.000002]  [<ffffffff810a2e7b>] vprintk_emit+0x20b/0x4a0
[  +0.000002]  [<ffffffff810a312f>] vprintk_default+0x1f/0x30
[  +0.000001]  [<ffffffff814c0f39>] printk+0x46/0x48
[  +0.000002]  [<ffffffff81402ef6>] __netdev_printk+0x176/0x2e0
[  +0.000002]  [<ffffffff814030b3>] netdev_info+0x53/0x60
[  +0.000003]  [<ffffffffa07b30f7>] ? bond_3ad_set_carrier+0x57/0xa0 [bonding]
[  +0.000003]  [<ffffffffa07ae468>] ? bond_set_carrier+0xb8/0xd0 [bonding]
[  +0.000003]  [<ffffffffa07ae5fe>] bond_select_active_slave+0x17e/0x200 [bonding]
[  +0.000002]  [<ffffffffa07aeb3f>] bond_mii_monitor+0x4bf/0x700 [bonding]
[  +0.000003]  [<ffffffff8106b119>] process_one_work+0x139/0x470
[  +0.000001]  [<ffffffff8106b573>] worker_thread+0x123/0x520
[  +0.000002]  [<ffffffff8106b450>] ? process_one_work+0x470/0x470
[  +0.000001]  [<ffffffff8106b450>] ? process_one_work+0x470/0x470
[  +0.000002]  [<ffffffff810707ce>] kthread+0xde/0x100
[  +0.000001]  [<ffffffff810706f0>] ? __init_kthread_worker+0x40/0x40
[  +0.000003]  [<ffffffff814c6b52>] ret_from_fork+0x42/0x70
[  +0.000001]  [<ffffffff810706f0>] ? __init_kthread_worker+0x40/0x40
[  +0.000001] ---[ end trace c168d14d53373934 ]---
[  +1.635277] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX

anyways, next step we'll do now is switch firmware update (although there's only
one minor update, so I don't expect much..)

BR

nik

On Mon, Jan 25, 2016 at 11:08:51AM +0100, Nikola Ciprich wrote:
> Hello netdev readers,
> 
> I'd like to consult following problem we're dealing with:
> 
> I have a cluster of three nodes connected to stacked Brocade ICX6610
> switches using bonded AOC-STGN-i2S adapters (they're using 82599ES
> chipsets).
> 
> The problem is, I see random link failures on practically all
> interfaces. Link always goes down for very short time, then adapter
> is reset and link goes up again.
> 
> Here's dmesg snippet:
> 
> [Jan22 22:09] ixgbe 0000:03:00.0 eth0: NIC Link is Down
> [  +0.005610] ixgbe 0000:03:00.0 eth0: initiating reset to clear Tx work after link loss
> [  +0.012792] bond0: link status definitely down for interface eth0, disabling it
> [  +1.105826] ixgbe 0000:03:00.0 eth0: Reset adapter
> [  +0.307518] ixgbe 0000:03:00.0 eth0: detected SFP+: 3
> [  +0.145881] ixgbe 0000:03:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> 
> since I'm using bonding, it doesn't disrupt traffic, but I'd still like to
> resolve it. We're using 5m passive SFP cables, we tried replacing one with 3m
> piece, to no avail. 
> 
> all three boxes are supermicro X10DRW, running vanilla x86_64 4.0.5 kernel (I'll upgrade it to 4.1.16 soon)
> 
> we were using broadcom adapter before and they were working without such problems
> (except for one particular port, which showed mysterious packet drops every few
> months, thats why we switched to intel-based adapters), so I think cables and switches
> should be fine, but I'm not sure of course
> 
> I think I've seen similar problems and they were PM related, but I'm not sure..
> 
> anyone seen similar problem?
> 
> or some tips on how could I debug it?
> 
> If I could provide more information, please let me know
> 
> BR
> 
> nik
> 
> -- 
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@linuxbox.cz
> -------------------------------------



-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-31 22:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-25 10:08 Supermicro AOC-STGN-i2S w intel 82599ES on Brocade ICX6610 - random link failures Nikola Ciprich
2016-01-25 10:44 ` zhuyj
2016-01-28  8:48   ` Nikola Ciprich
2016-01-31 22:01 ` Nikola Ciprich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).