Fwd: 802.3ad bonding aggregator reselection

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fwd: 802.3ad bonding aggregator reselection
       [not found] <CAKdSkDUb94mR7cDiZbxsc6fgm_8O5wagjUec4g-t5DF7R_GFDw@mail.gmail.com>
@ 2016-06-17 10:40 ` Veli-Matti Lintu
       [not found]   ` <CAD=hENdGOFY5027964=f3xk_qeNmVccHYvr2rvTJtpFmaeFG2w@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Veli-Matti Lintu @ 2016-06-17 10:40 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Andy Gospodarek

Hello,

I have been trying to get the bonding driver working with multiple
aggregators with two switches in mode=802.3ad to handle failing links
properly. The goal is to have always the best possible bonded link in
use if one or physical links fail.

The bonding documentation describes that 802.3ad with
ad_select=bandwidth/count should do this, but I wasn't able to get
those or ad_select=stable working without patching the kernel. As I'm
not really familiar with the codebase, I'm not sure if this is really
a kernel problem or a configuration problem.

Documentation/networking/bonding.txt

ad_select
...
        The bandwidth and count selection policies permit failover of
        802.3ad aggregations when partial failure of the active aggregator
        occurs.  This keeps the aggregator with the highest availability
        (either in bandwidth or in number of ports) active at all times.

        This option was added in bonding version 3.4.0.




The hardware setup consists of two HP 2530-48G switches and servers
that have 6 ports in total that are connected to both switches using
3x1Gbps links. Port groups are configured as LACP on the switches. The
switches are connected to each other, but they do not create a single
aggregator so that all 6 links could be active at the same time. The
NICs use ixgbe and igb drivers.



Here are the tested steps:

ad_select=stable

1. Enable all links on both switches and boot the server, 3 ports are up
2. Disable one link on switch that is the active aggregator

expected: link goes down and port count in /proc/net/bonding/bond0 goes down
result: link goes down and port count in /proc/net/bonding/bond0 does not change

3. Disable all links on switch that is the active aggregator

expected: link goes down and bond switches to using aggregator that has links up
result: link goes down and port count in /proc/net/bonding/bond0 does
not change and connection is lost as there are no links up in active
aggregator.

4. Enable a single link that on active aggregator that has all links down

expect: ?
result: aggregator with most links up is activated (in this case the
previously non-active switch that had 3 links up all the time)



ad_select=bandwidth/count

1. Enable all links on both switches and boot the server, 3 ports are up
2. Disable one link on switch that is the active aggregator

expected: link goes down and aggregator reselection is started and
non-active aggregator with 3 links up becomes active
result: link goes down and port count in /proc/net/bonding/bond0 does
not change, aggregator reselection does not occur

3. Same as with ad_select=stable

4. Enable a single link that on active aggregator that has all links down

expect: aggregator with most links up is activated
result: aggregator with most links up is activated (in this case the
previously non-active switch that had 3 links up all the time)


In all cases miimon does detect the link going down and if I bring one
slaved interface down and back up (ifconfig/ip) in non-active
aggregator, aggregator reselection is done. For me it looks like the
problem is that when link goes down, there's nothing to check the
remaining status of the bond.

I could get this to happen with the following patch, but I'm not sure
what side effects it might cause. Most of the examples googling
revealed seemed to refer to Cisco gear, so I'm wondering if there's
something hardware specific here.



--- a/drivers/net/bonding/bond_3ad.c 2016-06-17 09:49:56.236636742 +0300
+++ b/drivers/net/bonding/bond_3ad.c 2016-06-17 10:04:34.309353452 +0300
@@ -2458,6 +2458,7 @@
  /* link has failed */
  port->is_enabled = false;
  ad_update_actor_keys(port, true);
+ port->sm_vars &= ~AD_PORT_SELECTED;
  }
  netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n",
    port->actor_port_number,



Here's /proc/net/bonding/bond0 on unmodified 4.7-rc3 after disabling
two ports on the switch with active aggregator. The active aggregator
info still shows 3 ports. The results are the same on 4.4.x and 4.6.x
kernels.

The following options were used:

options bonding mode=4 miimon=100 downdelay=200 updelay=200
xmit_hash_policy=layer3+4 ad_select=1 max_bonds=0 min_links=0


Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 1000
Up Delay (ms): 2000
Down Delay (ms): 2000

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): bandwidth
System priority: 65535
System MAC address: f2:07:89:4a:7c:9f
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 3
Actor Key: 9
Partner Key: 57
Partner Mac Address: 6c:3b:e5:df:7a:80

Slave Interface: enp5s0f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:c4:7a:34:c7:f1
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 31360
    system mac address: 6c:3b:e5:df:7a:80
    oper key: 57
    port priority: 0
    port number: 23
    port state: 61

Slave Interface: enp5s0f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:c4:7a:34:c7:f0
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 36992
    system mac address: 6c:3b:e5:e0:90:80
    oper key: 57
    port priority: 0
    port number: 23
    port state: 61

Slave Interface: ens6f1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: a0:36:9f:83:3c:41
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 0
    port priority: 255
    port number: 3
    port state: 63
details partner lacp pdu:
    system priority: 31360
    system mac address: 6c:3b:e5:df:7a:80
    oper key: 57
    port priority: 0
    port number: 29
    port state: 61

Slave Interface: ens6f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:36:9f:83:3c:40
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 4
    port state: 7
details partner lacp pdu:
    system priority: 36992
    system mac address: 6c:3b:e5:e0:90:80
    oper key: 57
    port priority: 0
    port number: 29
    port state: 53

Slave Interface: ens5f1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: a0:36:9f:83:3d:1f
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 0
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 0
    port priority: 255
    port number: 5
    port state: 143
details partner lacp pdu:
    system priority: 31360
    system mac address: 6c:3b:e5:df:7a:80
    oper key: 57
    port priority: 0
    port number: 28
    port state: 55

Slave Interface: ens5f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:36:9f:83:3d:1e
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 6
    port state: 63
details partner lacp pdu:
    system priority: 36992
    system mac address: 6c:3b:e5:e0:90:80
    oper key: 57
    port priority: 0
    port number: 28
    port state: 61




The results with the patch after disabling links and aggregator has
been reselected:

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 1000
Up Delay (ms): 2000
Down Delay (ms): 2000

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): bandwidth
System priority: 65535
System MAC address: f2:07:89:4a:7c:9f
Active Aggregator Info:
Aggregator ID: 2
Number of ports: 2
Actor Key: 9
Partner Key: 57
Partner Mac Address: 6c:3b:e5:e0:90:80

Slave Interface: enp5s0f1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:c4:7a:34:c7:f1
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 1
    port state: 63
details partner lacp pdu:
    system priority: 31360
    system mac address: 6c:3b:e5:df:7a:80
    oper key: 57
    port priority: 0
    port number: 23
    port state: 61

Slave Interface: enp5s0f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 0c:c4:7a:34:c7:f0
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 2
    port state: 63
details partner lacp pdu:
    system priority: 36992
    system mac address: 6c:3b:e5:e0:90:80
    oper key: 57
    port priority: 0
    port number: 23
    port state: 61

Slave Interface: ens6f1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: a0:36:9f:83:3c:41
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 0
    port priority: 255
    port number: 3
    port state: 7
details partner lacp pdu:
    system priority: 31360
    system mac address: 6c:3b:e5:df:7a:80
    oper key: 57
    port priority: 0
    port number: 29
    port state: 61

Slave Interface: ens6f0
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: a0:36:9f:83:3c:40
Slave queue ID: 0
Aggregator ID: 4
Actor Churn State: monitoring
Partner Churn State: monitoring
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 0
    port priority: 255
    port number: 4
    port state: 135
details partner lacp pdu:
    system priority: 36992
    system mac address: 6c:3b:e5:e0:90:80
    oper key: 57
    port priority: 0
    port number: 29
    port state: 55

Slave Interface: ens5f1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: a0:36:9f:83:3d:1f
Slave queue ID: 0
Aggregator ID: 5
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 0
    port priority: 255
    port number: 5
    port state: 135
details partner lacp pdu:
    system priority: 31360
    system mac address: 6c:3b:e5:df:7a:80
    oper key: 57
    port priority: 0
    port number: 28
    port state: 55

Slave Interface: ens5f0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:36:9f:83:3d:1e
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
    system priority: 65535
    system mac address: f2:07:89:4a:7c:9f
    port key: 9
    port priority: 255
    port number: 6
    port state: 63
details partner lacp pdu:
    system priority: 36992
    system mac address: 6c:3b:e5:e0:90:80
    oper key: 57
    port priority: 0
    port number: 28
    port state: 61


Happy hacking!

Veli-Matti

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 802.3ad bonding aggregator reselection
       [not found]   ` <CAD=hENdGOFY5027964=f3xk_qeNmVccHYvr2rvTJtpFmaeFG2w@mail.gmail.com>
@ 2016-06-21 10:50     ` Veli-Matti Lintu
  2016-06-21 15:46       ` Jay Vosburgh
  0 siblings, 1 reply; 7+ messages in thread
From: Veli-Matti Lintu @ 2016-06-21 10:50 UTC (permalink / raw)
  To: zhuyj; +Cc: netdev, Jay Vosburgh, Andy Gospodarek

2016-06-20 17:11 GMT+03:00 zhuyj <zyjzyj2000@gmail.com>:
> 5. Switch Configuration
> =======================
>
>         For this section, "switch" refers to whatever system the
> bonded devices are directly connected to (i.e., where the other end of
> the cable plugs into).  This may be an actual dedicated switch device,
> or it may be another regular system (e.g., another computer running
> Linux),
>
>         The active-backup, balance-tlb and balance-alb modes do not
> require any specific configuration of the switch.
>
>         The 802.3ad mode requires that the switch have the appropriate
> ports configured as an 802.3ad aggregation.  The precise method used
> to configure this varies from switch to switch, but, for example, a
> Cisco 3550 series switch requires that the appropriate ports first be
> grouped together in a single etherchannel instance, then that
> etherchannel is set to mode "lacp" to enable 802.3ad (instead of
> standard EtherChannel).

The ports are configured in switch settings (HP Procurve 2530-48G) in
same trunk group (TrkX) and trunk group type is set as LACP.
/proc/net/bonding/bond0 also shows that the three ports belong to same
aggregator and bandwidth tests also support this. In my understanding
Procurve's trunk group is pretty much the same as etherchannel in
Cisco's terminology. The bonded link comes always up properly, but
handling of links going down is the problem. Are there known
differences between different vendors there?

Veli-Matti

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 802.3ad bonding aggregator reselection
  2016-06-21 10:50     ` Veli-Matti Lintu
@ 2016-06-21 15:46       ` Jay Vosburgh
  2016-06-21 20:48         ` Veli-Matti Lintu
  0 siblings, 1 reply; 7+ messages in thread
From: Jay Vosburgh @ 2016-06-21 15:46 UTC (permalink / raw)
  To: Veli-Matti Lintu; +Cc: zhuyj, netdev, Andy Gospodarek

Veli-Matti Lintu <veli-matti.lintu@opinsys.fi> wrote:

>2016-06-20 17:11 GMT+03:00 zhuyj <zyjzyj2000@gmail.com>:
>> 5. Switch Configuration
>> =======================
>>
>>         For this section, "switch" refers to whatever system the
>> bonded devices are directly connected to (i.e., where the other end of
>> the cable plugs into).  This may be an actual dedicated switch device,
>> or it may be another regular system (e.g., another computer running
>> Linux),
>>
>>         The active-backup, balance-tlb and balance-alb modes do not
>> require any specific configuration of the switch.
>>
>>         The 802.3ad mode requires that the switch have the appropriate
>> ports configured as an 802.3ad aggregation.  The precise method used
>> to configure this varies from switch to switch, but, for example, a
>> Cisco 3550 series switch requires that the appropriate ports first be
>> grouped together in a single etherchannel instance, then that
>> etherchannel is set to mode "lacp" to enable 802.3ad (instead of
>> standard EtherChannel).
>
>The ports are configured in switch settings (HP Procurve 2530-48G) in
>same trunk group (TrkX) and trunk group type is set as LACP.
>/proc/net/bonding/bond0 also shows that the three ports belong to same
>aggregator and bandwidth tests also support this. In my understanding
>Procurve's trunk group is pretty much the same as etherchannel in
>Cisco's terminology. The bonded link comes always up properly, but
>handling of links going down is the problem. Are there known
>differences between different vendors there?

	I did the original LACP reselection testing on a Cisco switch,
but I have an HP 2530 now; I'll test it later today or tomorrow and see
if it behaves properly, and whether your proposed patch is needed.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 802.3ad bonding aggregator reselection
  2016-06-21 15:46       ` Jay Vosburgh
@ 2016-06-21 20:48         ` Veli-Matti Lintu
  2016-06-22  0:49           ` Jay Vosburgh
  0 siblings, 1 reply; 7+ messages in thread
From: Veli-Matti Lintu @ 2016-06-21 20:48 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: zhuyj, netdev, Andy Gospodarek

2016-06-21 18:46 GMT+03:00 Jay Vosburgh <jay.vosburgh@canonical.com>:
> Veli-Matti Lintu <veli-matti.lintu@opinsys.fi> wrote:
>
>>2016-06-20 17:11 GMT+03:00 zhuyj <zyjzyj2000@gmail.com>:
>>> 5. Switch Configuration
>>> =======================
>>>
>>>         For this section, "switch" refers to whatever system the
>>> bonded devices are directly connected to (i.e., where the other end of
>>> the cable plugs into).  This may be an actual dedicated switch device,
>>> or it may be another regular system (e.g., another computer running
>>> Linux),
>>>
>>>         The active-backup, balance-tlb and balance-alb modes do not
>>> require any specific configuration of the switch.
>>>
>>>         The 802.3ad mode requires that the switch have the appropriate
>>> ports configured as an 802.3ad aggregation.  The precise method used
>>> to configure this varies from switch to switch, but, for example, a
>>> Cisco 3550 series switch requires that the appropriate ports first be
>>> grouped together in a single etherchannel instance, then that
>>> etherchannel is set to mode "lacp" to enable 802.3ad (instead of
>>> standard EtherChannel).
>>
>>The ports are configured in switch settings (HP Procurve 2530-48G) in
>>same trunk group (TrkX) and trunk group type is set as LACP.
>>/proc/net/bonding/bond0 also shows that the three ports belong to same
>>aggregator and bandwidth tests also support this. In my understanding
>>Procurve's trunk group is pretty much the same as etherchannel in
>>Cisco's terminology. The bonded link comes always up properly, but
>>handling of links going down is the problem. Are there known
>>differences between different vendors there?
>
>         I did the original LACP reselection testing on a Cisco switch,
> but I have an HP 2530 now; I'll test it later today or tomorrow and see
> if it behaves properly, and whether your proposed patch is needed.

Thanks for taking a look at this. Here are some more details about the
setup as Zhu Yanjun also requested.

The server in question has two internal 10Gbps ports (using ixgbe) and
two Intel I350 T2 dual-1Gbps PCIe-cards (using igb). All ports are
using 1Gbps connections.

05:00.0 Ethernet controller: Intel Corporation Ethernet Controller
10-Gigabit X540-AT2 (rev 01)
05:00.1 Ethernet controller: Intel Corporation Ethernet Controller
10-Gigabit X540-AT2 (rev 01)
81:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
81:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
82:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
82:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)

In the test setup the bonds are setup as:

05:00.0 + 81:00.0 + 82:00.0 and
05:00.1 + 81:00.1 + 82:00.1

So each bond uses one port using ixgbe and two ports using igbe.

When testing, I have disabled the port in the switch configuration
that brings down the link and also miimon sees the link going down on
the server. This should be the same as unplugging the cable, so
there's nothing coming through the wire to the server.

Veli-Matti

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 802.3ad bonding aggregator reselection
  2016-06-21 20:48         ` Veli-Matti Lintu
@ 2016-06-22  0:49           ` Jay Vosburgh
  2016-06-22 17:43             ` Veli-Matti Lintu
  0 siblings, 1 reply; 7+ messages in thread
From: Jay Vosburgh @ 2016-06-22  0:49 UTC (permalink / raw)
  To: Veli-Matti Lintu; +Cc: zhuyj, netdev, Andy Gospodarek, Mahesh Bandewar


Veli-Matti Lintu <veli-matti.lintu@opinsys.fi> wrote:
[...]
>>>The ports are configured in switch settings (HP Procurve 2530-48G) in
>>>same trunk group (TrkX) and trunk group type is set as LACP.
>>>/proc/net/bonding/bond0 also shows that the three ports belong to same
>>>aggregator and bandwidth tests also support this. In my understanding
>>>Procurve's trunk group is pretty much the same as etherchannel in
>>>Cisco's terminology. The bonded link comes always up properly, but
>>>handling of links going down is the problem. Are there known
>>>differences between different vendors there?
>>
>>         I did the original LACP reselection testing on a Cisco switch,
>> but I have an HP 2530 now; I'll test it later today or tomorrow and see
>> if it behaves properly, and whether your proposed patch is needed.
>
>Thanks for taking a look at this. Here are some more details about the
>setup as Zhu Yanjun also requested.

	Summary (because anything involving a standard tends to get long
winded):

	This is not a switch problem.  Bonding appears to be following
the standard in this case.  I've identified when this behavior changed,
and I think we should violate the standard in this case for ad_select
set to "bandwidth" or "count," neither of which is the default value.

	Long winded version:

	I've reproduced the issue locally, and it does not appear to be
anything particular to the switch.  It appears to be due to changes from

commit 7bb11dc9f59ddcb33ee317da77b235235aaa582a
Author: Mahesh Bandewar <maheshb@google.com>
Date:   Sat Oct 31 12:45:06 2015 -0700

    bonding: unify all places where actor-oper key needs to be updated.

	Specifically this block:

 void bond_3ad_handle_link_change(struct slave *slave, char link)
[...]
-       /* there is no need to reselect a new aggregator, just signal the
-        * state machines to reinitialize
-        */
-       port->sm_vars |= AD_PORT_BEGIN;

	Previously, setting BEGIN would cause the port in question to be
reinitialized, which in turn would trigger reselection.

	I'm not sure that adding this section back is the correct fix
from the point of view of the standard, however, as 802.1AX 5.2.3.1.2
defines BEGIN as:

	A Boolean variable that is set to TRUE when the System is
	initialized or reinitialized, and is set to FALSE when
	(re-)initialization has completed.

	and in this case we're not reinitializing the System (i.e., the
bond).

	Further, 802.1AX 5.4.12 says:

	If the port becomes inoperable and a BEGIN event has not
	occurred, the state machine enters the PORT_DISABLED
	state. Partner_Oper_Port_State.Synchronization is set to
	FALSE. This state allows the current Selection state to remain
	undisturbed, so that, in the event that the port is still
	connected to the same Partner and Partner port when it becomes
	operable again, there will be no disturbance caused to higher
	layers by unneccessary re-configuration.

	At the moment, bonding is doing what 5.4.12 specifies, by
placing the port into PORT_DISABLED state.  bond_3ad_handle_link_change
clears port->is_enabled, which causes ad_rx_machine to clear
AD_PORT_MATCHED but leave AD_PORT_SELECTED set.  This in turn cause the
selection logic to skip this port, resulting in the observed behavior
(that the port is link down, but stays in the aggregator).

	Bonding will still remove the slave from the bond->slave_arr, so
it won't actually try to send on this slave.  I'll further note that
802.1AX 5.4.7 defines port_enabled as:

	A variable indicating that the physical layer has indicated that
	the link has been established and the port is operable.
	Value: Boolean
	TRUE if the physical layer has indicated that the port is operable.
	FALSE otherwise.

	So, it appears that bonding is in conformance with the standard
in this case.

	I don't see an issue with the above behavior when ad_select is
set to the default value of "stable"; bonding does reselect a new
aggregator when all links fail, and it appears to follow the standard.

	I think a reasonable compromise here is to utilize a modified
version of your patch that clears SELECTED (to trigger reselection) when
a link goes down, but only if ad_select is not "stable", for example:

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index b9304a295f86..1ee5a3a5e658 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2458,6 +2458,8 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
 		/* link has failed */
 		port->is_enabled = false;
 		ad_update_actor_keys(port, true);
+		if (__get_agg_selection_mode(port) != BOND_AD_STABLE)
+			port->port->sm_vars &= ~AD_PORT_SELECTED;
 	}
 	netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n",
 		   port->actor_port_number,

	I'll test this locally and will submit a formal patch with an
update to bonding.txt tomorrow (if it works).

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: 802.3ad bonding aggregator reselection
  2016-06-22  0:49           ` Jay Vosburgh
@ 2016-06-22 17:43             ` Veli-Matti Lintu
  2016-06-23  5:58               ` Jay Vosburgh
  0 siblings, 1 reply; 7+ messages in thread
From: Veli-Matti Lintu @ 2016-06-22 17:43 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: zhuyj, netdev, Andy Gospodarek, Mahesh Bandewar

2016-06-22 3:49 GMT+03:00 Jay Vosburgh <jay.vosburgh@canonical.com>:
>
> Veli-Matti Lintu <veli-matti.lintu@opinsys.fi> wrote:
> [...]
>>>>The ports are configured in switch settings (HP Procurve 2530-48G) in
>>>>same trunk group (TrkX) and trunk group type is set as LACP.
>>>>/proc/net/bonding/bond0 also shows that the three ports belong to same
>>>>aggregator and bandwidth tests also support this. In my understanding
>>>>Procurve's trunk group is pretty much the same as etherchannel in
>>>>Cisco's terminology. The bonded link comes always up properly, but
>>>>handling of links going down is the problem. Are there known
>>>>differences between different vendors there?
>>>
>>>         I did the original LACP reselection testing on a Cisco switch,
>>> but I have an HP 2530 now; I'll test it later today or tomorrow and see
>>> if it behaves properly, and whether your proposed patch is needed.
>>
>>Thanks for taking a look at this. Here are some more details about the
>>setup as Zhu Yanjun also requested.
>
>         Summary (because anything involving a standard tends to get long
> winded):
>
>         This is not a switch problem.  Bonding appears to be following
> the standard in this case.  I've identified when this behavior changed,
> and I think we should violate the standard in this case for ad_select
> set to "bandwidth" or "count," neither of which is the default value.
>
>         Long winded version:
>
>         I've reproduced the issue locally, and it does not appear to be
> anything particular to the switch.  It appears to be due to changes from
>
> commit 7bb11dc9f59ddcb33ee317da77b235235aaa582a
> Author: Mahesh Bandewar <maheshb@google.com>
> Date:   Sat Oct 31 12:45:06 2015 -0700
>
>     bonding: unify all places where actor-oper key needs to be updated.
>
>         Specifically this block:
>
>  void bond_3ad_handle_link_change(struct slave *slave, char link)
> [...]
> -       /* there is no need to reselect a new aggregator, just signal the
> -        * state machines to reinitialize
> -        */
> -       port->sm_vars |= AD_PORT_BEGIN;
>
>         Previously, setting BEGIN would cause the port in question to be
> reinitialized, which in turn would trigger reselection.
>
>         I'm not sure that adding this section back is the correct fix
> from the point of view of the standard, however, as 802.1AX 5.2.3.1.2
> defines BEGIN as:
>
>         A Boolean variable that is set to TRUE when the System is
>         initialized or reinitialized, and is set to FALSE when
>         (re-)initialization has completed.
>
>         and in this case we're not reinitializing the System (i.e., the
> bond).
>
>         Further, 802.1AX 5.4.12 says:
>
>         If the port becomes inoperable and a BEGIN event has not
>         occurred, the state machine enters the PORT_DISABLED
>         state. Partner_Oper_Port_State.Synchronization is set to
>         FALSE. This state allows the current Selection state to remain
>         undisturbed, so that, in the event that the port is still
>         connected to the same Partner and Partner port when it becomes
>         operable again, there will be no disturbance caused to higher
>         layers by unneccessary re-configuration.
>
>         At the moment, bonding is doing what 5.4.12 specifies, by
> placing the port into PORT_DISABLED state.  bond_3ad_handle_link_change
> clears port->is_enabled, which causes ad_rx_machine to clear
> AD_PORT_MATCHED but leave AD_PORT_SELECTED set.  This in turn cause the
> selection logic to skip this port, resulting in the observed behavior
> (that the port is link down, but stays in the aggregator).
>
>         Bonding will still remove the slave from the bond->slave_arr, so
> it won't actually try to send on this slave.  I'll further note that
> 802.1AX 5.4.7 defines port_enabled as:
>
>         A variable indicating that the physical layer has indicated that
>         the link has been established and the port is operable.
>         Value: Boolean
>         TRUE if the physical layer has indicated that the port is operable.
>         FALSE otherwise.
>
>         So, it appears that bonding is in conformance with the standard
> in this case.

I haven't done extensive testing on this, but I haven't noticed
anything that would indicate that anything is sent to failed ports. So
this part should be working.

>         I don't see an issue with the above behavior when ad_select is
> set to the default value of "stable"; bonding does reselect a new
> aggregator when all links fail, and it appears to follow the standard.

In my testing ad_select=stable does not reselect a new aggregator when
all links have failed. Reselection seems to occur only when a link
comes up the failure. Here's an example of two bonds having three
links each. Aggregator ID 3 is active with three ports and ID 2 has
also three ports up.


802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:c4:7a:34:c7:f1
Active Aggregator Info:
        Aggregator ID: 3
        Number of ports: 3
        Actor Key: 9
        Partner Key: 57
        Partner Mac Address: 6c:3b:e5:df:7a:80


Disable all ports in aggregator id 2 (enp5s0f1, ens5f1 and ens6f1) in
switch configuration at the same time:

[  146.783003] ixgbe 0000:05:00.1 enp5s0f1: NIC Link is Down
[  146.783223] ixgbe 0000:05:00.1 enp5s0f1: speed changed to 0 for port enp5s0f1
[  146.858824] bond0: link status down for interface enp5s0f1,
disabling it in 200 ms
[  147.058932] bond0: link status definitely down for interface
enp5s0f1, disabling it
[  147.291259] igb 0000:81:00.1 ens5f1: igb: ens5f1 NIC Link is Down
[  147.303303] igb 0000:82:00.1 ens6f1: igb: ens6f1 NIC Link is Down
[  147.358862] bond0: link status down for interface ens6f1, disabling
it in 200 ms
[  147.358868] bond0: link status down for interface ens5f1, disabling
it in 200 ms
[  147.558929] bond0: link status definitely down for interface
ens6f1, disabling it
[  147.558987] bond0: link status definitely down for interface
ens5f1, disabling it

At this point there is no connection to the host and the aggregator
with all failed links is still active.

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:c4:7a:34:c7:f1
Active Aggregator Info:
        Aggregator ID: 3
        Number of ports: 3
        Actor Key: 9
        Partner Key: 57
        Partner Mac Address: 6c:3b:e5:df:7a:80

If I then bring down an interface that is connected to an active
switch port and bring it back up, reselection is done:

# ifconfig ens5f0 down
# ifconfig ens5f0 up

[  190.258900] bond0: link status down for interface ens5f0, disabling
it in 200 ms
[  190.458934] bond0: link status definitely down for interface
ens5f0, disabling it
[  193.192453] 8021q: adding VLAN 0 to HW filter on device ens5f0
[  196.156105] igb 0000:81:00.0 ens5f0: igb: ens5f0 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX
[  196.158912] bond0: link status up for interface ens5f0, enabling it in 200 ms
[  196.360471] bond0: link status definitely up for interface ens5f0,
1000 Mbps full duplex

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 0c:c4:7a:34:c7:f1
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 3
        Actor Key: 9
        Partner Key: 57
        Partner Mac Address: 6c:3b:e5:e0:90:80

At this point all connections resume normally.

Are you able to reproduce this or is reselection working as expected?


>         I think a reasonable compromise here is to utilize a modified
> version of your patch that clears SELECTED (to trigger reselection) when
> a link goes down, but only if ad_select is not "stable", for example:
>
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index b9304a295f86..1ee5a3a5e658 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -2458,6 +2458,8 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
>                 /* link has failed */
>                 port->is_enabled = false;
>                 ad_update_actor_keys(port, true);
> +               if (__get_agg_selection_mode(port) != BOND_AD_STABLE)
> +                       port->port->sm_vars &= ~AD_PORT_SELECTED;
>         }
>         netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n",
>                    port->actor_port_number,
>
>         I'll test this locally and will submit a formal patch with an
> update to bonding.txt tomorrow (if it works).
>
>         -J
>
> ---
>         -Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 802.3ad bonding aggregator reselection
  2016-06-22 17:43             ` Veli-Matti Lintu
@ 2016-06-23  5:58               ` Jay Vosburgh
  0 siblings, 0 replies; 7+ messages in thread
From: Jay Vosburgh @ 2016-06-23  5:58 UTC (permalink / raw)
  To: Veli-Matti Lintu; +Cc: zhuyj, netdev, Andy Gospodarek, Mahesh Bandewar

Veli-Matti Lintu <veli-matti.lintu@opinsys.fi> wrote:
[...]
>>         I don't see an issue with the above behavior when ad_select is
>> set to the default value of "stable"; bonding does reselect a new
>> aggregator when all links fail, and it appears to follow the standard.
>
>In my testing ad_select=stable does not reselect a new aggregator when
>all links have failed. Reselection seems to occur only when a link
>comes up the failure. Here's an example of two bonds having three
>links each. Aggregator ID 3 is active with three ports and ID 2 has
>also three ports up.

	Yes, I've since observed that as well.

[...]
>Are you able to reproduce this or is reselection working as expected?

	Reselection is not working correctly at all.

	I'm working up a more comprehensive fix; the setting of BEGIN in
the older code masked a number of issues in the reselection logic that
never came up because setting BEGIN would do a full reselection from
scratch at every slave carrier state change (meaning that no aggregator
ever ended up with link down ports as members).

	My test patch at the moment is below (this is against net); any
testing or review would be appreciated.  I have not tested the ad_select
bandwidth behavior of this yet; I've been testing stable and count
first.

	This patch should be conformant to the standard, which requires
link down ports to remain selected, but implementations are free to
choose an active aggregator however they wish.

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index b9304a295f86..57be940c4c37 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -657,6 +657,20 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val)
 	}
 }
 
+static int __agg_active_ports(struct aggregator *agg)
+{
+	struct port *port;
+	int active = 0;
+
+	for (port = agg->lag_ports; port;
+	     port = port->next_port_in_aggregator) {
+		if (port->is_enabled)
+			active++;
+	}
+
+	return active;
+}
+
 /**
  * __get_agg_bandwidth - get the total bandwidth of an aggregator
  * @aggregator: the aggregator we're looking at
@@ -665,38 +679,39 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val)
 static u32 __get_agg_bandwidth(struct aggregator *aggregator)
 {
 	u32 bandwidth = 0;
+	int nports = __agg_active_ports(aggregator);
 
-	if (aggregator->num_of_ports) {
+	if (nports) {
 		switch (__get_link_speed(aggregator->lag_ports)) {
 		case AD_LINK_SPEED_1MBPS:
-			bandwidth = aggregator->num_of_ports;
+			bandwidth = nports;
 			break;
 		case AD_LINK_SPEED_10MBPS:
-			bandwidth = aggregator->num_of_ports * 10;
+			bandwidth = nports * 10;
 			break;
 		case AD_LINK_SPEED_100MBPS:
-			bandwidth = aggregator->num_of_ports * 100;
+			bandwidth = nports * 100;
 			break;
 		case AD_LINK_SPEED_1000MBPS:
-			bandwidth = aggregator->num_of_ports * 1000;
+			bandwidth = nports * 1000;
 			break;
 		case AD_LINK_SPEED_2500MBPS:
-			bandwidth = aggregator->num_of_ports * 2500;
+			bandwidth = nports * 2500;
 			break;
 		case AD_LINK_SPEED_10000MBPS:
-			bandwidth = aggregator->num_of_ports * 10000;
+			bandwidth = nports * 10000;
 			break;
 		case AD_LINK_SPEED_20000MBPS:
-			bandwidth = aggregator->num_of_ports * 20000;
+			bandwidth = nports * 20000;
 			break;
 		case AD_LINK_SPEED_40000MBPS:
-			bandwidth = aggregator->num_of_ports * 40000;
+			bandwidth = nports * 40000;
 			break;
 		case AD_LINK_SPEED_56000MBPS:
-			bandwidth = aggregator->num_of_ports * 56000;
+			bandwidth = nports * 56000;
 			break;
 		case AD_LINK_SPEED_100000MBPS:
-			bandwidth = aggregator->num_of_ports * 100000;
+			bandwidth = nports * 100000;
 			break;
 		default:
 			bandwidth = 0; /* to silence the compiler */
@@ -1530,10 +1545,10 @@ static struct aggregator *ad_agg_selection_test(struct aggregator *best,
 
 	switch (__get_agg_selection_mode(curr->lag_ports)) {
 	case BOND_AD_COUNT:
-		if (curr->num_of_ports > best->num_of_ports)
+		if (__agg_active_ports(curr) > __agg_active_ports(best))
 			return curr;
 
-		if (curr->num_of_ports < best->num_of_ports)
+		if (__agg_active_ports(curr) < __agg_active_ports(best))
 			return best;
 
 		/*FALLTHROUGH*/
@@ -1561,8 +1576,14 @@ static int agg_device_up(const struct aggregator *agg)
 	if (!port)
 		return 0;
 
-	return netif_running(port->slave->dev) &&
-	       netif_carrier_ok(port->slave->dev);
+	for (port = agg->lag_ports; port;
+	     port = port->next_port_in_aggregator) {
+		if (netif_running(port->slave->dev) &&
+		    netif_carrier_ok(port->slave->dev))
+			return 1;
+	}
+
+	return 0;
 }
 
 /**
@@ -1610,7 +1631,7 @@ static void ad_agg_selection_logic(struct aggregator *agg,
 
 		agg->is_active = 0;
 
-		if (agg->num_of_ports && agg_device_up(agg))
+		if (__agg_active_ports(agg) && agg_device_up(agg))
 			best = ad_agg_selection_test(best, agg);
 	}
 
@@ -1622,7 +1643,7 @@ static void ad_agg_selection_logic(struct aggregator *agg,
 		 * answering partner.
 		 */
 		if (active && active->lag_ports &&
-		    active->lag_ports->is_enabled &&
+		    __agg_active_ports(active) &&
 		    (__agg_has_partner(active) ||
 		     (!__agg_has_partner(active) &&
 		     !__agg_has_partner(best)))) {
@@ -2432,7 +2453,9 @@ void bond_3ad_adapter_speed_duplex_changed(struct slave *slave)
  */
 void bond_3ad_handle_link_change(struct slave *slave, char link)
 {
+	struct aggregator *agg;
 	struct port *port;
+	bool dummy;
 
 	port = &(SLAVE_AD_INFO(slave)->port);
 
@@ -2459,6 +2482,9 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
 		port->is_enabled = false;
 		ad_update_actor_keys(port, true);
 	}
+	agg = __get_first_agg(port);
+	ad_agg_selection_logic(agg, &dummy);
+
 	netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n",
 		   port->actor_port_number,
 		   link == BOND_LINK_UP ? "UP" : "DOWN");
@@ -2499,7 +2525,7 @@ int bond_3ad_set_carrier(struct bonding *bond)
 	active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator));
 	if (active) {
 		/* are enough slaves available to consider link up? */
-		if (active->num_of_ports < bond->params.min_links) {
+		if (__agg_active_ports(active) < bond->params.min_links) {
 			if (netif_carrier_ok(bond->dev)) {
 				netif_carrier_off(bond->dev);
 				goto out;


	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-23  5:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAKdSkDUb94mR7cDiZbxsc6fgm_8O5wagjUec4g-t5DF7R_GFDw@mail.gmail.com>
2016-06-17 10:40 ` Fwd: 802.3ad bonding aggregator reselection Veli-Matti Lintu
     [not found]   ` <CAD=hENdGOFY5027964=f3xk_qeNmVccHYvr2rvTJtpFmaeFG2w@mail.gmail.com>
2016-06-21 10:50     ` Veli-Matti Lintu
2016-06-21 15:46       ` Jay Vosburgh
2016-06-21 20:48         ` Veli-Matti Lintu
2016-06-22  0:49           ` Jay Vosburgh
2016-06-22 17:43             ` Veli-Matti Lintu
2016-06-23  5:58               ` Jay Vosburgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).