* [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad
[not found] <20250513094750.23387-1-tonghao@bamaicloud.com>
@ 2025-05-13 9:47 ` Tonghao Zhang
2025-05-13 10:18 ` Nikolay Aleksandrov
2025-05-13 9:47 ` [PATCH net-next v2 2/4] net: bonding: add broadcast_neighbor netlink option Tonghao Zhang
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-13 9:47 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jay Vosburgh, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
Stacking technology is a type of technology used to expand ports on
Ethernet switches. It is widely used as a common access method in
large-scale Internet data center architectures. Years of practice
have proved that stacking technology has advantages and disadvantages
in high-reliability network architecture scenarios. For instance,
in stacking networking arch, conventional switch system upgrades
require multiple stacked devices to restart at the same time.
Therefore, it is inevitable that the business will be interrupted
for a while. It is for this reason that "no-stacking" in data centers
has become a trend. Additionally, when the stacking link connecting
the switches fails or is abnormal, the stack will split. Although it is
not common, it still happens in actual operation. The problem is that
after the split, it is equivalent to two switches with the same configuration
appearing in the network, causing network configuration conflicts and
ultimately interrupting the services carried by the stacking system.
To improve network stability, "non-stacking" solutions have been increasingly
adopted, particularly by public cloud providers and tech companies
like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
stacking that convinces a LACP peer, bonding in this case, connected to a set of
"non-stacked" switches that all of its ports are connected to a single
switch (i.e., LACP aggregator), as if those switches were stacked. This
enables the LACP peer's ports to aggregate together, and requires (a)
special switch configuration, described in the linked article, and (b)
modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
packets across all ports of the active aggregator.
----------- -----------
| switch1 | | switch2 |
----------- -----------
^ ^
| |
---------------
| bond4 lacp |
---------------
| NIC1 | NIC2
---------------------
| server |
---------------------
- https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
---
Documentation/networking/bonding.rst | 6 +++++
drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
include/net/bond_options.h | 1 +
include/net/bonding.h | 3 +++
6 files changed, 101 insertions(+)
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index a4c1291d2561..14f7593d888d 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -562,6 +562,12 @@ lacp_rate
The default is slow.
+broadcast_neighbor
+
+ Option specifying whether to broadcast ARP/ND packets to all
+ active slaves. This option has no effect in modes other than
+ 802.3ad mode. The default is off (0).
+
max_bonds
Specifies the number of bonding devices to create for this
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index d05226484c64..8ee26ddddbc8 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
unsigned int bond_net_id __read_mostly;
+DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
+EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
+
static const struct flow_dissector_key flow_keys_bonding_keys[] = {
{
.key_id = FLOW_DISSECTOR_KEY_CONTROL,
@@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
bond_alb_deinitialize(bond);
bond->recv_probe = NULL;
+ if (bond->params.broadcast_neighbor)
+ static_branch_dec(&bond_bcast_neigh_enabled);
+
if (bond_uses_primary(bond)) {
rcu_read_lock();
slave = rcu_dereference(bond->curr_active_slave);
@@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
return slaves->arr[hash % count];
}
+static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ struct bonding *bond = netdev_priv(dev);
+
+ if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
+ return false;
+
+ if (!bond->params.broadcast_neighbor)
+ return false;
+
+ if (skb->protocol == htons(ETH_P_ARP))
+ return true;
+
+ if (skb->protocol == htons(ETH_P_IPV6) &&
+ pskb_may_pull(skb,
+ sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
+ if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
+ struct icmp6hdr *icmph = icmp6_hdr(skb);
+
+ if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
+ (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
+ return true;
+ }
+ }
+
+ return false;
+}
+
/* Use this Xmit function for 3AD as well as XOR modes. The current
* usable slave array is formed in the control path. The xmit function
* just calculates hash and sends the packet out.
@@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
case BOND_MODE_ACTIVEBACKUP:
return bond_xmit_activebackup(skb, dev);
case BOND_MODE_8023AD:
+ if (bond_should_broadcast_neighbor(skb, dev))
+ return bond_xmit_broadcast(skb, dev);
+ fallthrough;
case BOND_MODE_XOR:
return bond_3ad_xor_xmit(skb, dev);
case BOND_MODE_BROADCAST:
@@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
eth_zero_addr(params->ad_actor_system);
params->ad_user_port_key = ad_user_port_key;
params->coupled_control = 1;
+ params->broadcast_neighbor = 0;
if (packets_per_slave > 0) {
params->reciprocal_packets_per_slave =
reciprocal_value(packets_per_slave);
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 91893c29b899..dca52d93f513 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
const struct bond_opt_value *newval);
static int bond_option_coupled_control_set(struct bonding *bond,
const struct bond_opt_value *newval);
+static int bond_option_broadcast_neigh_set(struct bonding *bond,
+ const struct bond_opt_value *newval);
static const struct bond_opt_value bond_mode_tbl[] = {
{ "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
@@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
{ NULL, -1, 0},
};
+static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
+ { "on", 1, 0},
+ { "off", 0, BOND_VALFLAG_DEFAULT},
+ { NULL, -1, 0}
+};
+
static const struct bond_option bond_opts[BOND_OPT_LAST] = {
[BOND_OPT_MODE] = {
.id = BOND_OPT_MODE,
@@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
.flags = BOND_OPTFLAG_IFDOWN,
.values = bond_coupled_control_tbl,
.set = bond_option_coupled_control_set,
+ },
+ [BOND_OPT_BROADCAST_NEIGH] = {
+ .id = BOND_OPT_BROADCAST_NEIGH,
+ .name = "broadcast_neighbor",
+ .desc = "Broadcast neighbor packets to all slaves",
+ .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
+ .values = bond_broadcast_neigh_tbl,
+ .set = bond_option_broadcast_neigh_set,
}
};
@@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
bond->params.coupled_control = newval->value;
return 0;
}
+
+static int bond_option_broadcast_neigh_set(struct bonding *bond,
+ const struct bond_opt_value *newval)
+{
+ netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
+ newval->value);
+
+ if (bond->params.broadcast_neighbor == newval->value)
+ return 0;
+
+ bond->params.broadcast_neighbor = newval->value;
+ if (bond->params.broadcast_neighbor)
+ static_branch_inc(&bond_bcast_neigh_enabled);
+ else
+ static_branch_dec(&bond_bcast_neigh_enabled);
+
+ return 0;
+}
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 1e13bb170515..4a53850b2c68 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
static DEVICE_ATTR(ad_user_port_key, 0644,
bonding_show_ad_user_port_key, bonding_sysfs_store_option);
+static ssize_t bonding_show_broadcast_neighbor(struct device *d,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct bonding *bond = to_bond(d);
+ const struct bond_opt_value *val;
+
+ val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
+ bond->params.broadcast_neighbor);
+
+ return sysfs_emit(buf, "%s %d\n", val->string,
+ bond->params.broadcast_neighbor);
+}
+
+static DEVICE_ATTR(broadcast_neighbor, 0644,
+ bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
+
static struct attribute *per_bond_attrs[] = {
&dev_attr_slaves.attr,
&dev_attr_mode.attr,
@@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
&dev_attr_ad_actor_system.attr,
&dev_attr_ad_user_port_key.attr,
&dev_attr_arp_missed_max.attr,
+ &dev_attr_broadcast_neighbor.attr,
NULL,
};
diff --git a/include/net/bond_options.h b/include/net/bond_options.h
index 18687ccf0638..022b122a9fb6 100644
--- a/include/net/bond_options.h
+++ b/include/net/bond_options.h
@@ -77,6 +77,7 @@ enum {
BOND_OPT_NS_TARGETS,
BOND_OPT_PRIO,
BOND_OPT_COUPLED_CONTROL,
+ BOND_OPT_BROADCAST_NEIGH,
BOND_OPT_LAST
};
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 95f67b308c19..e06f0d63b2c1 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
#define is_netpoll_tx_blocked(dev) (0)
#endif
+DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
+
struct bond_params {
int mode;
int xmit_policy;
@@ -149,6 +151,7 @@ struct bond_params {
struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
#endif
int coupled_control;
+ int broadcast_neighbor;
/* 2 bytes of padding : see ether_addr_equal_64bits() */
u8 ad_actor_system[ETH_ALEN + 2];
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 2/4] net: bonding: add broadcast_neighbor netlink option
[not found] <20250513094750.23387-1-tonghao@bamaicloud.com>
2025-05-13 9:47 ` [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad Tonghao Zhang
@ 2025-05-13 9:47 ` Tonghao Zhang
2025-05-13 10:19 ` Nikolay Aleksandrov
2025-05-13 9:47 ` [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery Tonghao Zhang
2025-05-13 9:47 ` [PATCH net-next v2 4/4] net: bonding: add tracepoint for 802.3ad Tonghao Zhang
3 siblings, 1 reply; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-13 9:47 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jay Vosburgh, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
User can config or display the bonding broadcast_neighbor option via
iproute2/netlink.
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
---
drivers/net/bonding/bond_netlink.c | 16 ++++++++++++++++
include/uapi/linux/if_link.h | 1 +
2 files changed, 17 insertions(+)
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index ac5e402c34bc..57fff2421f1b 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -124,6 +124,7 @@ static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
[IFLA_BOND_MISSED_MAX] = { .type = NLA_U8 },
[IFLA_BOND_NS_IP6_TARGET] = { .type = NLA_NESTED },
[IFLA_BOND_COUPLED_CONTROL] = { .type = NLA_U8 },
+ [IFLA_BOND_BROADCAST_NEIGH] = { .type = NLA_U8 },
};
static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = {
@@ -561,6 +562,16 @@ static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[],
return err;
}
+ if (data[IFLA_BOND_BROADCAST_NEIGH]) {
+ int broadcast_neigh = nla_get_u8(data[IFLA_BOND_BROADCAST_NEIGH]);
+
+ bond_opt_initval(&newval, broadcast_neigh);
+ err = __bond_opt_set(bond, BOND_OPT_BROADCAST_NEIGH, &newval,
+ data[IFLA_BOND_BROADCAST_NEIGH], extack);
+ if (err)
+ return err;
+ }
+
return 0;
}
@@ -630,6 +641,7 @@ static size_t bond_get_size(const struct net_device *bond_dev)
nla_total_size(sizeof(struct nlattr)) +
nla_total_size(sizeof(struct in6_addr)) * BOND_MAX_NS_TARGETS +
nla_total_size(sizeof(u8)) + /* IFLA_BOND_COUPLED_CONTROL */
+ nla_total_size(sizeof(u8)) + /* IFLA_BOND_BROADCAST_NEIGH */
0;
}
@@ -793,6 +805,10 @@ static int bond_fill_info(struct sk_buff *skb,
bond->params.coupled_control))
goto nla_put_failure;
+ if (nla_put_u8(skb, IFLA_BOND_BROADCAST_NEIGH,
+ bond->params.broadcast_neighbor))
+ goto nla_put_failure;
+
if (BOND_MODE(bond) == BOND_MODE_8023AD) {
struct ad_info info;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 3ad2d5d98034..53b2f6ebda8b 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1534,6 +1534,7 @@ enum {
IFLA_BOND_MISSED_MAX,
IFLA_BOND_NS_IP6_TARGET,
IFLA_BOND_COUPLED_CONTROL,
+ IFLA_BOND_BROADCAST_NEIGH,
__IFLA_BOND_MAX,
};
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery
[not found] <20250513094750.23387-1-tonghao@bamaicloud.com>
2025-05-13 9:47 ` [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad Tonghao Zhang
2025-05-13 9:47 ` [PATCH net-next v2 2/4] net: bonding: add broadcast_neighbor netlink option Tonghao Zhang
@ 2025-05-13 9:47 ` Tonghao Zhang
2025-05-13 10:37 ` Nikolay Aleksandrov
2025-05-13 9:47 ` [PATCH net-next v2 4/4] net: bonding: add tracepoint for 802.3ad Tonghao Zhang
3 siblings, 1 reply; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-13 9:47 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jay Vosburgh, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
After LACP protocol recovery, the port can transmit packets.
However, if the bond port doesn't send gratuitous ARP/ND
packets to the switch, the switch won't return packets through
the current interface. This causes traffic imbalance. To resolve
this issue, when LACP protocol recovers, send ARP/ND packets.
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
---
Documentation/networking/bonding.rst | 5 +++--
drivers/net/bonding/bond_3ad.c | 13 +++++++++++++
drivers/net/bonding/bond_main.c | 21 ++++++++++++++++-----
3 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 14f7593d888d..f8f5766703d4 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -773,8 +773,9 @@ num_unsol_na
greater than 1.
The valid range is 0 - 255; the default value is 1. These options
- affect only the active-backup mode. These options were added for
- bonding versions 3.3.0 and 3.4.0 respectively.
+ affect the active-backup or 802.3ad (broadcast_neighbor enabled) mode.
+ These options were added for bonding versions 3.3.0 and 3.4.0
+ respectively.
From Linux 3.0 and bonding version 3.7.1, these notifications
are generated by the ipv4 and ipv6 code and the numbers of
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index c6807e473ab7..d1c2d416ac87 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -982,6 +982,17 @@ static int ad_marker_send(struct port *port, struct bond_marker *marker)
return 0;
}
+static void ad_cond_set_peer_notif(struct port *port)
+{
+ struct bonding *bond = port->slave->bond;
+
+ if (bond->params.broadcast_neighbor && rtnl_trylock()) {
+ bond->send_peer_notif = bond->params.num_peer_notif *
+ max(1, bond->params.peer_notif_delay);
+ rtnl_unlock();
+ }
+}
+
/**
* ad_mux_machine - handle a port's mux state machine
* @port: the port we're looking at
@@ -2061,6 +2072,8 @@ static void ad_enable_collecting_distributing(struct port *port,
__enable_port(port);
/* Slave array needs update */
*update_slave_arr = true;
+ /* Should notify peers if possible */
+ ad_cond_set_peer_notif(port);
}
}
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 8ee26ddddbc8..70bb1e33cee2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1243,17 +1243,28 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
/* must be called in RCU critical section or with RTNL held */
static bool bond_should_notify_peers(struct bonding *bond)
{
- struct slave *slave = rcu_dereference_rtnl(bond->curr_active_slave);
+ struct bond_up_slave *usable;
+ struct slave *slave = NULL;
- if (!slave || !bond->send_peer_notif ||
+ if (!bond->send_peer_notif ||
bond->send_peer_notif %
max(1, bond->params.peer_notif_delay) != 0 ||
- !netif_carrier_ok(bond->dev) ||
- test_bit(__LINK_STATE_LINKWATCH_PENDING, &slave->dev->state))
+ !netif_carrier_ok(bond->dev))
return false;
+ if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+ usable = rtnl_dereference(bond->usable_slaves);
+ if (!usable || !READ_ONCE(usable->count))
+ return false;
+ } else {
+ slave = rcu_dereference_rtnl(bond->curr_active_slave);
+ if (!slave || test_bit(__LINK_STATE_LINKWATCH_PENDING,
+ &slave->dev->state))
+ return false;
+ }
+
netdev_dbg(bond->dev, "bond_should_notify_peers: slave %s\n",
- slave ? slave->dev->name : "NULL");
+ slave ? slave->dev->name : "all");
return true;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v2 4/4] net: bonding: add tracepoint for 802.3ad
[not found] <20250513094750.23387-1-tonghao@bamaicloud.com>
` (2 preceding siblings ...)
2025-05-13 9:47 ` [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery Tonghao Zhang
@ 2025-05-13 9:47 ` Tonghao Zhang
3 siblings, 0 replies; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-13 9:47 UTC (permalink / raw)
To: netdev
Cc: Tonghao Zhang, Jay Vosburgh, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
Users can monitor NIC link status changes through netlink. However, LACP
protocol failures may occur despite operational physical links. There is
no way to detect LACP state changes. This patch adds tracepoint at LACP state
transition.
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
---
drivers/net/bonding/bond_3ad.c | 6 ++++++
include/trace/events/bonding.h | 37 ++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)
create mode 100644 include/trace/events/bonding.h
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index d1c2d416ac87..55703230ab29 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -16,6 +16,9 @@
#include <net/bond_3ad.h>
#include <net/netlink.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/bonding.h>
+
/* General definitions */
#define AD_SHORT_TIMEOUT 1
#define AD_LONG_TIMEOUT 0
@@ -1146,6 +1149,9 @@ static void ad_mux_machine(struct port *port, bool *update_slave_arr)
port->actor_port_number,
last_state,
port->sm_mux_state);
+
+ trace_3ad_mux_state(port->slave->dev, last_state, port->sm_mux_state);
+
switch (port->sm_mux_state) {
case AD_MUX_DETACHED:
port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;
diff --git a/include/trace/events/bonding.h b/include/trace/events/bonding.h
new file mode 100644
index 000000000000..1ee4b07d912a
--- /dev/null
+++ b/include/trace/events/bonding.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#if !defined(_TRACE_BONDING_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_BONDING_H
+
+#include <linux/netdevice.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM bonding
+
+TRACE_EVENT(3ad_mux_state,
+ TP_PROTO(struct net_device *dev, u32 last_state, u32 curr_state),
+ TP_ARGS(dev, last_state, curr_state),
+
+ TP_STRUCT__entry(
+ __field(int, ifindex)
+ __string(dev_name, dev->name)
+ __field(u32, last_state)
+ __field(u32, curr_state)
+ ),
+
+ TP_fast_assign(
+ __entry->ifindex = dev->ifindex;
+ __assign_str(dev_name);
+ __entry->last_state = last_state;
+ __entry->curr_state = curr_state;
+ ),
+
+ TP_printk("ifindex %d dev %s last_state 0x%x curr_state 0x%x",
+ __entry->ifindex, __get_str(dev_name),
+ __entry->last_state, __entry->curr_state)
+);
+
+#endif /* _TRACE_BONDING_H */
+
+#include <trace/define_trace.h>
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad
2025-05-13 9:47 ` [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad Tonghao Zhang
@ 2025-05-13 10:18 ` Nikolay Aleksandrov
2025-05-13 12:13 ` Tonghao Zhang
0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Aleksandrov @ 2025-05-13 10:18 UTC (permalink / raw)
To: Tonghao Zhang, netdev
Cc: Jay Vosburgh, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
Zengbing Tu
On 5/13/25 12:47, Tonghao Zhang wrote:
> Stacking technology is a type of technology used to expand ports on
> Ethernet switches. It is widely used as a common access method in
> large-scale Internet data center architectures. Years of practice
> have proved that stacking technology has advantages and disadvantages
> in high-reliability network architecture scenarios. For instance,
> in stacking networking arch, conventional switch system upgrades
> require multiple stacked devices to restart at the same time.
> Therefore, it is inevitable that the business will be interrupted
> for a while. It is for this reason that "no-stacking" in data centers
> has become a trend. Additionally, when the stacking link connecting
> the switches fails or is abnormal, the stack will split. Although it is
> not common, it still happens in actual operation. The problem is that
> after the split, it is equivalent to two switches with the same configuration
> appearing in the network, causing network configuration conflicts and
> ultimately interrupting the services carried by the stacking system.
>
> To improve network stability, "non-stacking" solutions have been increasingly
> adopted, particularly by public cloud providers and tech companies
> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
> stacking that convinces a LACP peer, bonding in this case, connected to a set of
> "non-stacked" switches that all of its ports are connected to a single
> switch (i.e., LACP aggregator), as if those switches were stacked. This
> enables the LACP peer's ports to aggregate together, and requires (a)
> special switch configuration, described in the linked article, and (b)
> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
> packets across all ports of the active aggregator.
>
> ----------- -----------
> | switch1 | | switch2 |
> ----------- -----------
> ^ ^
> | |
> ---------------
> | bond4 lacp |
> ---------------
> | NIC1 | NIC2
> ---------------------
> | server |
> ---------------------
>
> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>
> Cc: Jay Vosburgh <jv@jvosburgh.net>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Simon Horman <horms@kernel.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
> ---
> Documentation/networking/bonding.rst | 6 +++++
> drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
> drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
> drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
> include/net/bond_options.h | 1 +
> include/net/bonding.h | 3 +++
> 6 files changed, 101 insertions(+)
>
> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
> index a4c1291d2561..14f7593d888d 100644
> --- a/Documentation/networking/bonding.rst
> +++ b/Documentation/networking/bonding.rst
> @@ -562,6 +562,12 @@ lacp_rate
>
> The default is slow.
>
> +broadcast_neighbor
> +
> + Option specifying whether to broadcast ARP/ND packets to all
> + active slaves. This option has no effect in modes other than
> + 802.3ad mode. The default is off (0).
> +
> max_bonds
>
> Specifies the number of bonding devices to create for this
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index d05226484c64..8ee26ddddbc8 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>
> unsigned int bond_net_id __read_mostly;
>
> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
> +EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
No need to export the symbol, you can add bond helpers to inc/dec it.
> +
> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
> {
> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
> @@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
> bond_alb_deinitialize(bond);
> bond->recv_probe = NULL;
>
> + if (bond->params.broadcast_neighbor)
> + static_branch_dec(&bond_bcast_neigh_enabled);
> +
This branch doesn't get re-enabled if the bond is brought up afterwards.
> if (bond_uses_primary(bond)) {
> rcu_read_lock();
> slave = rcu_dereference(bond->curr_active_slave);
> @@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
> return slaves->arr[hash % count];
> }
>
> +static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
don't use inline in .c files
> + struct net_device *dev)
> +{
> + struct bonding *bond = netdev_priv(dev);
> +
> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
> + return false;
> +
> + if (!bond->params.broadcast_neighbor)
> + return false;
> +
> + if (skb->protocol == htons(ETH_P_ARP))
> + return true;
> +
> + if (skb->protocol == htons(ETH_P_IPV6) &&
> + pskb_may_pull(skb,
> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
> + struct icmp6hdr *icmph = icmp6_hdr(skb);
> +
> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
> + return true;
> + }
> + }
> +
> + return false;
> +}
> +
> /* Use this Xmit function for 3AD as well as XOR modes. The current
> * usable slave array is formed in the control path. The xmit function
> * just calculates hash and sends the packet out.
> @@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
> case BOND_MODE_ACTIVEBACKUP:
> return bond_xmit_activebackup(skb, dev);
> case BOND_MODE_8023AD:
> + if (bond_should_broadcast_neighbor(skb, dev))
> + return bond_xmit_broadcast(skb, dev);
> + fallthrough;
> case BOND_MODE_XOR:
> return bond_3ad_xor_xmit(skb, dev);
> case BOND_MODE_BROADCAST:
> @@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
> eth_zero_addr(params->ad_actor_system);
> params->ad_user_port_key = ad_user_port_key;
> params->coupled_control = 1;
> + params->broadcast_neighbor = 0;
> if (packets_per_slave > 0) {
> params->reciprocal_packets_per_slave =
> reciprocal_value(packets_per_slave);
> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
> index 91893c29b899..dca52d93f513 100644
> --- a/drivers/net/bonding/bond_options.c
> +++ b/drivers/net/bonding/bond_options.c
> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
> const struct bond_opt_value *newval);
> static int bond_option_coupled_control_set(struct bonding *bond,
> const struct bond_opt_value *newval);
> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
> + const struct bond_opt_value *newval);
>
> static const struct bond_opt_value bond_mode_tbl[] = {
> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
> { NULL, -1, 0},
> };
>
> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
> + { "on", 1, 0},
> + { "off", 0, BOND_VALFLAG_DEFAULT},
I know the option above is using this order, but it is a bit counter-intuitive to
have their places reversed wrt their values, could you please re-order these as
the other bond on/off options? This is a small nit, I don't have a strong preference
but it is more intuitive to have them in their value order. :)
> + { NULL, -1, 0}> +};
> +
> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
> [BOND_OPT_MODE] = {
> .id = BOND_OPT_MODE,
> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
> .flags = BOND_OPTFLAG_IFDOWN,
> .values = bond_coupled_control_tbl,
> .set = bond_option_coupled_control_set,
> + },
> + [BOND_OPT_BROADCAST_NEIGH] = {
> + .id = BOND_OPT_BROADCAST_NEIGH,
> + .name = "broadcast_neighbor",
> + .desc = "Broadcast neighbor packets to all slaves",
> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
> + .values = bond_broadcast_neigh_tbl,
> + .set = bond_option_broadcast_neigh_set,
> }
> };
>
> @@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
> bond->params.coupled_control = newval->value;
> return 0;
> }
> +
> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
> + const struct bond_opt_value *newval)
> +{
> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
> + newval->value);
> +
> + if (bond->params.broadcast_neighbor == newval->value)
> + return 0;
> +
> + bond->params.broadcast_neighbor = newval->value;
> + if (bond->params.broadcast_neighbor)
> + static_branch_inc(&bond_bcast_neigh_enabled);
> + else
> + static_branch_dec(&bond_bcast_neigh_enabled);
If the bond has been brought down then the branch has been already decremented.
You'll have to synchronize this with bond open/close or alternatively mark the option
as being able to be changed only when the bond is up (there is an option flag for that).
> +
> + return 0;
> +}
> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
> index 1e13bb170515..4a53850b2c68 100644
> --- a/drivers/net/bonding/bond_sysfs.c
> +++ b/drivers/net/bonding/bond_sysfs.c
> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
> static DEVICE_ATTR(ad_user_port_key, 0644,
> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>
> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct bonding *bond = to_bond(d);
> + const struct bond_opt_value *val;
> +
> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
> + bond->params.broadcast_neighbor);
> +
> + return sysfs_emit(buf, "%s %d\n", val->string,
> + bond->params.broadcast_neighbor);
> +}
> +
> +static DEVICE_ATTR(broadcast_neighbor, 0644,
> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
> +
sysfs options are deprecated, please don't extend sysfs
netlink is the preferred way for new options
> static struct attribute *per_bond_attrs[] = {
> &dev_attr_slaves.attr,
> &dev_attr_mode.attr,
> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
> &dev_attr_ad_actor_system.attr,
> &dev_attr_ad_user_port_key.attr,
> &dev_attr_arp_missed_max.attr,
> + &dev_attr_broadcast_neighbor.attr,
> NULL,
> };
>
> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
> index 18687ccf0638..022b122a9fb6 100644
> --- a/include/net/bond_options.h
> +++ b/include/net/bond_options.h
> @@ -77,6 +77,7 @@ enum {
> BOND_OPT_NS_TARGETS,
> BOND_OPT_PRIO,
> BOND_OPT_COUPLED_CONTROL,
> + BOND_OPT_BROADCAST_NEIGH,
> BOND_OPT_LAST
> };
>
> diff --git a/include/net/bonding.h b/include/net/bonding.h
> index 95f67b308c19..e06f0d63b2c1 100644
> --- a/include/net/bonding.h
> +++ b/include/net/bonding.h
> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
> #define is_netpoll_tx_blocked(dev) (0)
> #endif
>
> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
> +
> struct bond_params {
> int mode;
> int xmit_policy;
> @@ -149,6 +151,7 @@ struct bond_params {
> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
> #endif
> int coupled_control;
> + int broadcast_neighbor;
>
> /* 2 bytes of padding : see ether_addr_equal_64bits() */
> u8 ad_actor_system[ETH_ALEN + 2];
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 2/4] net: bonding: add broadcast_neighbor netlink option
2025-05-13 9:47 ` [PATCH net-next v2 2/4] net: bonding: add broadcast_neighbor netlink option Tonghao Zhang
@ 2025-05-13 10:19 ` Nikolay Aleksandrov
0 siblings, 0 replies; 12+ messages in thread
From: Nikolay Aleksandrov @ 2025-05-13 10:19 UTC (permalink / raw)
To: Tonghao Zhang, netdev
Cc: Jay Vosburgh, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
Zengbing Tu
On 5/13/25 12:47, Tonghao Zhang wrote:
> User can config or display the bonding broadcast_neighbor option via
> iproute2/netlink.
>
> Cc: Jay Vosburgh <jv@jvosburgh.net>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Simon Horman <horms@kernel.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
> ---
> drivers/net/bonding/bond_netlink.c | 16 ++++++++++++++++
> include/uapi/linux/if_link.h | 1 +
> 2 files changed, 17 insertions(+)
>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery
2025-05-13 9:47 ` [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery Tonghao Zhang
@ 2025-05-13 10:37 ` Nikolay Aleksandrov
2025-05-14 6:07 ` Tonghao Zhang
0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Aleksandrov @ 2025-05-13 10:37 UTC (permalink / raw)
To: Tonghao Zhang, netdev
Cc: Jay Vosburgh, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
Zengbing Tu
On 5/13/25 12:47, Tonghao Zhang wrote:
> After LACP protocol recovery, the port can transmit packets.
> However, if the bond port doesn't send gratuitous ARP/ND
> packets to the switch, the switch won't return packets through
> the current interface. This causes traffic imbalance. To resolve
> this issue, when LACP protocol recovers, send ARP/ND packets.
>
> Cc: Jay Vosburgh <jv@jvosburgh.net>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Simon Horman <horms@kernel.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
> ---
> Documentation/networking/bonding.rst | 5 +++--
> drivers/net/bonding/bond_3ad.c | 13 +++++++++++++
> drivers/net/bonding/bond_main.c | 21 ++++++++++++++++-----
> 3 files changed, 32 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
> index 14f7593d888d..f8f5766703d4 100644
> --- a/Documentation/networking/bonding.rst
> +++ b/Documentation/networking/bonding.rst
> @@ -773,8 +773,9 @@ num_unsol_na
> greater than 1.
>
> The valid range is 0 - 255; the default value is 1. These options
> - affect only the active-backup mode. These options were added for
> - bonding versions 3.3.0 and 3.4.0 respectively.
> + affect the active-backup or 802.3ad (broadcast_neighbor enabled) mode.
> + These options were added for bonding versions 3.3.0 and 3.4.0
> + respectively.
>
> From Linux 3.0 and bonding version 3.7.1, these notifications
> are generated by the ipv4 and ipv6 code and the numbers of
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index c6807e473ab7..d1c2d416ac87 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -982,6 +982,17 @@ static int ad_marker_send(struct port *port, struct bond_marker *marker)
> return 0;
> }
>
> +static void ad_cond_set_peer_notif(struct port *port)
> +{
> + struct bonding *bond = port->slave->bond;
> +
> + if (bond->params.broadcast_neighbor && rtnl_trylock()) {
> + bond->send_peer_notif = bond->params.num_peer_notif *
> + max(1, bond->params.peer_notif_delay);
> + rtnl_unlock();
> + }> +}
> +
> /**
> * ad_mux_machine - handle a port's mux state machine
> * @port: the port we're looking at
> @@ -2061,6 +2072,8 @@ static void ad_enable_collecting_distributing(struct port *port,
> __enable_port(port);
> /* Slave array needs update */
> *update_slave_arr = true;
> + /* Should notify peers if possible */
> + ad_cond_set_peer_notif(port);
> }
> }
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 8ee26ddddbc8..70bb1e33cee2 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1243,17 +1243,28 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
> /* must be called in RCU critical section or with RTNL held */
> static bool bond_should_notify_peers(struct bonding *bond)
> {
> - struct slave *slave = rcu_dereference_rtnl(bond->curr_active_slave);
> + struct bond_up_slave *usable;
> + struct slave *slave = NULL;
>
> - if (!slave || !bond->send_peer_notif ||
> + if (!bond->send_peer_notif ||
> bond->send_peer_notif %
> max(1, bond->params.peer_notif_delay) != 0 ||
> - !netif_carrier_ok(bond->dev) ||
> + !netif_carrier_ok(bond->dev))> return false;
>
> + if (BOND_MODE(bond) == BOND_MODE_8023AD) {
> + usable = rtnl_dereference(bond->usable_slaves);
If mii monitor is enabled in 802.3ad bond mode then this will be
dereferenced in bond_mii_monitor() with rcu, so you'd have to use
rcu_dereference_rtnl() instead.
> + if (!usable || !READ_ONCE(usable->count))
> + return false;
> + } else {
> + slave = rcu_dereference_rtnl(bond->curr_active_slave);
> + if (!slave || test_bit(__LINK_STATE_LINKWATCH_PENDING,
> + &slave->dev->state))
> + return false;
> + }
> +
> netdev_dbg(bond->dev, "bond_should_notify_peers: slave %s\n",
> - slave ? slave->dev->name : "NULL");
> + slave ? slave->dev->name : "all");
>
> return true;
> }
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad
2025-05-13 10:18 ` Nikolay Aleksandrov
@ 2025-05-13 12:13 ` Tonghao Zhang
2025-05-13 12:20 ` Nikolay Aleksandrov
0 siblings, 1 reply; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-13 12:13 UTC (permalink / raw)
To: Nikolay Aleksandrov, jv
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
Zengbing Tu
> 2025年5月13日 18:18,Nikolay Aleksandrov <razor@blackwall.org> 写道:
>
> On 5/13/25 12:47, Tonghao Zhang wrote:
>> Stacking technology is a type of technology used to expand ports on
>> Ethernet switches. It is widely used as a common access method in
>> large-scale Internet data center architectures. Years of practice
>> have proved that stacking technology has advantages and disadvantages
>> in high-reliability network architecture scenarios. For instance,
>> in stacking networking arch, conventional switch system upgrades
>> require multiple stacked devices to restart at the same time.
>> Therefore, it is inevitable that the business will be interrupted
>> for a while. It is for this reason that "no-stacking" in data centers
>> has become a trend. Additionally, when the stacking link connecting
>> the switches fails or is abnormal, the stack will split. Although it is
>> not common, it still happens in actual operation. The problem is that
>> after the split, it is equivalent to two switches with the same configuration
>> appearing in the network, causing network configuration conflicts and
>> ultimately interrupting the services carried by the stacking system.
>>
>> To improve network stability, "non-stacking" solutions have been increasingly
>> adopted, particularly by public cloud providers and tech companies
>> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
>> stacking that convinces a LACP peer, bonding in this case, connected to a set of
>> "non-stacked" switches that all of its ports are connected to a single
>> switch (i.e., LACP aggregator), as if those switches were stacked. This
>> enables the LACP peer's ports to aggregate together, and requires (a)
>> special switch configuration, described in the linked article, and (b)
>> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
>> packets across all ports of the active aggregator.
>>
>> ----------- -----------
>> | switch1 | | switch2 |
>> ----------- -----------
>> ^ ^
>> | |
>> ---------------
>> | bond4 lacp |
>> ---------------
>> | NIC1 | NIC2
>> ---------------------
>> | server |
>> ---------------------
>>
>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>>
>> Cc: Jay Vosburgh <jv@jvosburgh.net>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Simon Horman <horms@kernel.org>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
>> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
>> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
>> ---
>> Documentation/networking/bonding.rst | 6 +++++
>> drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
>> drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
>> drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
>> include/net/bond_options.h | 1 +
>> include/net/bonding.h | 3 +++
>> 6 files changed, 101 insertions(+)
>>
>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>> index a4c1291d2561..14f7593d888d 100644
>> --- a/Documentation/networking/bonding.rst
>> +++ b/Documentation/networking/bonding.rst
>> @@ -562,6 +562,12 @@ lacp_rate
>>
>> The default is slow.
>>
>> +broadcast_neighbor
>> +
>> + Option specifying whether to broadcast ARP/ND packets to all
>> + active slaves. This option has no effect in modes other than
>> + 802.3ad mode. The default is off (0).
>> +
>> max_bonds
>>
>> Specifies the number of bonding devices to create for this
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index d05226484c64..8ee26ddddbc8 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>
>> unsigned int bond_net_id __read_mostly;
>>
>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>> +EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
>
> No need to export the symbol, you can add bond helpers to inc/dec it.
>
>> +
>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>> {
>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>> @@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
>> bond_alb_deinitialize(bond);
>> bond->recv_probe = NULL;
>>
>> + if (bond->params.broadcast_neighbor)
>> + static_branch_dec(&bond_bcast_neigh_enabled);
>> +
>
> This branch doesn't get re-enabled if the bond is brought up afterwards.
This is not right place to dec bond_bcast_neigh_enabled, I should dec this value in bond_uninit(). Because we can destroy a bond net device which broadcast_neighbor enabled.
If we don’t check broadcast_neighbor in destroy path. bond_bcast_neigh_enabled always is enabled. For example:
ip link add bondx type bond mode 802.3ad ... broadcast_neighbor on
ip link add bondy type bond mode 802.3ad ... broadcast_neighbor off
ip li del dev bondx
In this case, bond_bcast_neigh_enabled is enabled for bondy while broadcast_neighbor is off.
>
>> if (bond_uses_primary(bond)) {
>> rcu_read_lock();
>> slave = rcu_dereference(bond->curr_active_slave);
>> @@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>> return slaves->arr[hash % count];
>> }
>>
>> +static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>
> don't use inline in .c files
As suggested by Jay, inline the codes for performance. I think it is better to keep inline. By the way, there are many inline function in bond_main.c and other *.c
>> + struct net_device *dev)
>> +{
>> + struct bonding *bond = netdev_priv(dev);
>> +
>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>> + return false;
>> +
>> + if (!bond->params.broadcast_neighbor)
>> + return false;
>> +
>> + if (skb->protocol == htons(ETH_P_ARP))
>> + return true;
>> +
>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>> + pskb_may_pull(skb,
>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>> +
>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>> + return true;
>> + }
>> + }
>> +
>> + return false;
>> +}
>> +
>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>> * usable slave array is formed in the control path. The xmit function
>> * just calculates hash and sends the packet out.
>> @@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>> case BOND_MODE_ACTIVEBACKUP:
>> return bond_xmit_activebackup(skb, dev);
>> case BOND_MODE_8023AD:
>> + if (bond_should_broadcast_neighbor(skb, dev))
>> + return bond_xmit_broadcast(skb, dev);
>> + fallthrough;
>> case BOND_MODE_XOR:
>> return bond_3ad_xor_xmit(skb, dev);
>> case BOND_MODE_BROADCAST:
>> @@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
>> eth_zero_addr(params->ad_actor_system);
>> params->ad_user_port_key = ad_user_port_key;
>> params->coupled_control = 1;
>> + params->broadcast_neighbor = 0;
>> if (packets_per_slave > 0) {
>> params->reciprocal_packets_per_slave =
>> reciprocal_value(packets_per_slave);
>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>> index 91893c29b899..dca52d93f513 100644
>> --- a/drivers/net/bonding/bond_options.c
>> +++ b/drivers/net/bonding/bond_options.c
>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>> const struct bond_opt_value *newval);
>> static int bond_option_coupled_control_set(struct bonding *bond,
>> const struct bond_opt_value *newval);
>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>> + const struct bond_opt_value *newval);
>>
>> static const struct bond_opt_value bond_mode_tbl[] = {
>> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>> { NULL, -1, 0},
>> };
>>
>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>> + { "on", 1, 0},
>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>
> I know the option above is using this order, but it is a bit counter-intuitive to
> have their places reversed wrt their values, could you please re-order these as
> the other bond on/off options? This is a small nit, I don't have a strong preference
> but it is more intuitive to have them in their value order. :)
Ok
>
>> + { NULL, -1, 0}> +};
>> +
>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>> [BOND_OPT_MODE] = {
>> .id = BOND_OPT_MODE,
>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>> .flags = BOND_OPTFLAG_IFDOWN,
>> .values = bond_coupled_control_tbl,
>> .set = bond_option_coupled_control_set,
>> + },
>> + [BOND_OPT_BROADCAST_NEIGH] = {
>> + .id = BOND_OPT_BROADCAST_NEIGH,
>> + .name = "broadcast_neighbor",
>> + .desc = "Broadcast neighbor packets to all slaves",
>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>> + .values = bond_broadcast_neigh_tbl,
>> + .set = bond_option_broadcast_neigh_set,
>> }
>> };
>>
>> @@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>> bond->params.coupled_control = newval->value;
>> return 0;
>> }
>> +
>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>> + const struct bond_opt_value *newval)
>> +{
>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
>> + newval->value);
>> +
>> + if (bond->params.broadcast_neighbor == newval->value)
>> + return 0;
>> +
>> + bond->params.broadcast_neighbor = newval->value;
>> + if (bond->params.broadcast_neighbor)
>> + static_branch_inc(&bond_bcast_neigh_enabled);
>> + else
>> + static_branch_dec(&bond_bcast_neigh_enabled);
>
> If the bond has been brought down then the branch has been already decremented.
> You'll have to synchronize this with bond open/close or alternatively mark the option
> as being able to be changed only when the bond is up (there is an option flag for that).
I will check bond_bcast_neigh_enabled in bond_unint() instead of in bond_close().
>
>
>> +
>> + return 0;
>> +}
>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>> index 1e13bb170515..4a53850b2c68 100644
>> --- a/drivers/net/bonding/bond_sysfs.c
>> +++ b/drivers/net/bonding/bond_sysfs.c
>> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
>> static DEVICE_ATTR(ad_user_port_key, 0644,
>> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>>
>> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> + struct bonding *bond = to_bond(d);
>> + const struct bond_opt_value *val;
>> +
>> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
>> + bond->params.broadcast_neighbor);
>> +
>> + return sysfs_emit(buf, "%s %d\n", val->string,
>> + bond->params.broadcast_neighbor);
>> +}
>> +
>> +static DEVICE_ATTR(broadcast_neighbor, 0644,
>> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
>> +
>
> sysfs options are deprecated, please don't extend sysfs
> netlink is the preferred way for new options
I think it is still useful to config option via sysfs, and I find other new option still use the sysfs.
>
>> static struct attribute *per_bond_attrs[] = {
>> &dev_attr_slaves.attr,
>> &dev_attr_mode.attr,
>> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
>> &dev_attr_ad_actor_system.attr,
>> &dev_attr_ad_user_port_key.attr,
>> &dev_attr_arp_missed_max.attr,
>> + &dev_attr_broadcast_neighbor.attr,
>> NULL,
>> };
>>
>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>> index 18687ccf0638..022b122a9fb6 100644
>> --- a/include/net/bond_options.h
>> +++ b/include/net/bond_options.h
>> @@ -77,6 +77,7 @@ enum {
>> BOND_OPT_NS_TARGETS,
>> BOND_OPT_PRIO,
>> BOND_OPT_COUPLED_CONTROL,
>> + BOND_OPT_BROADCAST_NEIGH,
>> BOND_OPT_LAST
>> };
>>
>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>> index 95f67b308c19..e06f0d63b2c1 100644
>> --- a/include/net/bonding.h
>> +++ b/include/net/bonding.h
>> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
>> #define is_netpoll_tx_blocked(dev) (0)
>> #endif
>>
>> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>> +
>> struct bond_params {
>> int mode;
>> int xmit_policy;
>> @@ -149,6 +151,7 @@ struct bond_params {
>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>> #endif
>> int coupled_control;
>> + int broadcast_neighbor;
>>
>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>> u8 ad_actor_system[ETH_ALEN + 2];
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad
2025-05-13 12:13 ` Tonghao Zhang
@ 2025-05-13 12:20 ` Nikolay Aleksandrov
2025-05-13 13:30 ` Jay Vosburgh
0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Aleksandrov @ 2025-05-13 12:20 UTC (permalink / raw)
To: Tonghao Zhang, jv
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
Zengbing Tu
On 5/13/25 15:13, Tonghao Zhang wrote:
>
>
>> 2025年5月13日 18:18,Nikolay Aleksandrov <razor@blackwall.org> 写道:
>>
>> On 5/13/25 12:47, Tonghao Zhang wrote:
>>> Stacking technology is a type of technology used to expand ports on
>>> Ethernet switches. It is widely used as a common access method in
>>> large-scale Internet data center architectures. Years of practice
>>> have proved that stacking technology has advantages and disadvantages
>>> in high-reliability network architecture scenarios. For instance,
>>> in stacking networking arch, conventional switch system upgrades
>>> require multiple stacked devices to restart at the same time.
>>> Therefore, it is inevitable that the business will be interrupted
>>> for a while. It is for this reason that "no-stacking" in data centers
>>> has become a trend. Additionally, when the stacking link connecting
>>> the switches fails or is abnormal, the stack will split. Although it is
>>> not common, it still happens in actual operation. The problem is that
>>> after the split, it is equivalent to two switches with the same configuration
>>> appearing in the network, causing network configuration conflicts and
>>> ultimately interrupting the services carried by the stacking system.
>>>
>>> To improve network stability, "non-stacking" solutions have been increasingly
>>> adopted, particularly by public cloud providers and tech companies
>>> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
>>> stacking that convinces a LACP peer, bonding in this case, connected to a set of
>>> "non-stacked" switches that all of its ports are connected to a single
>>> switch (i.e., LACP aggregator), as if those switches were stacked. This
>>> enables the LACP peer's ports to aggregate together, and requires (a)
>>> special switch configuration, described in the linked article, and (b)
>>> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
>>> packets across all ports of the active aggregator.
>>>
>>> ----------- -----------
>>> | switch1 | | switch2 |
>>> ----------- -----------
>>> ^ ^
>>> | |
>>> ---------------
>>> | bond4 lacp |
>>> ---------------
>>> | NIC1 | NIC2
>>> ---------------------
>>> | server |
>>> ---------------------
>>>
>>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>>>
>>> Cc: Jay Vosburgh <jv@jvosburgh.net>
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>> Cc: Simon Horman <horms@kernel.org>
>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
>>> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
>>> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
>>> ---
>>> Documentation/networking/bonding.rst | 6 +++++
>>> drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
>>> drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
>>> drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
>>> include/net/bond_options.h | 1 +
>>> include/net/bonding.h | 3 +++
>>> 6 files changed, 101 insertions(+)
>>>
>>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>>> index a4c1291d2561..14f7593d888d 100644
>>> --- a/Documentation/networking/bonding.rst
>>> +++ b/Documentation/networking/bonding.rst
>>> @@ -562,6 +562,12 @@ lacp_rate
>>>
>>> The default is slow.
>>>
>>> +broadcast_neighbor
>>> +
>>> + Option specifying whether to broadcast ARP/ND packets to all
>>> + active slaves. This option has no effect in modes other than
>>> + 802.3ad mode. The default is off (0).
>>> +
>>> max_bonds
>>>
>>> Specifies the number of bonding devices to create for this
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index d05226484c64..8ee26ddddbc8 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>>
>>> unsigned int bond_net_id __read_mostly;
>>>
>>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>> +EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
>>
>> No need to export the symbol, you can add bond helpers to inc/dec it.
>>
>>> +
>>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>>> {
>>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>>> @@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
>>> bond_alb_deinitialize(bond);
>>> bond->recv_probe = NULL;
>>>
>>> + if (bond->params.broadcast_neighbor)
>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>> +
>>
>> This branch doesn't get re-enabled if the bond is brought up afterwards.
> This is not right place to dec bond_bcast_neigh_enabled, I should dec this value in bond_uninit(). Because we can destroy a bond net device which broadcast_neighbor enabled.
> If we don’t check broadcast_neighbor in destroy path. bond_bcast_neigh_enabled always is enabled. For example:
> ip link add bondx type bond mode 802.3ad ... broadcast_neighbor on
> ip link add bondy type bond mode 802.3ad ... broadcast_neighbor off
>
> ip li del dev bondx
> In this case, bond_bcast_neigh_enabled is enabled for bondy while broadcast_neighbor is off.
>
>>>>> if (bond_uses_primary(bond)) {
>>> rcu_read_lock();
>>> slave = rcu_dereference(bond->curr_active_slave);
>>> @@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>>> return slaves->arr[hash % count];
>>> }
>>>
>>> +static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>>
>> don't use inline in .c files
> As suggested by Jay, inline the codes for performance. I think it is better to keep inline. By the way, there are many inline function in bond_main.c and other *.c
This is a general rule, the compiler knows what to do. The inlines in bond_main are old
and can be removed. Do not add new ones.
>>> + struct net_device *dev)
>>> +{
>>> + struct bonding *bond = netdev_priv(dev);
>>> +
>>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>>> + return false;
>>> +
>>> + if (!bond->params.broadcast_neighbor)
>>> + return false;
>>> +
>>> + if (skb->protocol == htons(ETH_P_ARP))
>>> + return true;
>>> +
>>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>>> + pskb_may_pull(skb,
>>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>>> +
>>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>>> + return true;
>>> + }
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>>> * usable slave array is formed in the control path. The xmit function
>>> * just calculates hash and sends the packet out.
>>> @@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>>> case BOND_MODE_ACTIVEBACKUP:
>>> return bond_xmit_activebackup(skb, dev);
>>> case BOND_MODE_8023AD:
>>> + if (bond_should_broadcast_neighbor(skb, dev))
>>> + return bond_xmit_broadcast(skb, dev);
>>> + fallthrough;
>>> case BOND_MODE_XOR:
>>> return bond_3ad_xor_xmit(skb, dev);
>>> case BOND_MODE_BROADCAST:
>>> @@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
>>> eth_zero_addr(params->ad_actor_system);
>>> params->ad_user_port_key = ad_user_port_key;
>>> params->coupled_control = 1;
>>> + params->broadcast_neighbor = 0;
>>> if (packets_per_slave > 0) {
>>> params->reciprocal_packets_per_slave =
>>> reciprocal_value(packets_per_slave);
>>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>>> index 91893c29b899..dca52d93f513 100644
>>> --- a/drivers/net/bonding/bond_options.c
>>> +++ b/drivers/net/bonding/bond_options.c
>>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>>> const struct bond_opt_value *newval);
>>> static int bond_option_coupled_control_set(struct bonding *bond,
>>> const struct bond_opt_value *newval);
>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>> + const struct bond_opt_value *newval);
>>>
>>> static const struct bond_opt_value bond_mode_tbl[] = {
>>> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
>>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>>> { NULL, -1, 0},
>>> };
>>>
>>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>>> + { "on", 1, 0},
>>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>>
>> I know the option above is using this order, but it is a bit counter-intuitive to
>> have their places reversed wrt their values, could you please re-order these as
>> the other bond on/off options? This is a small nit, I don't have a strong preference
>> but it is more intuitive to have them in their value order. :)
> Ok
>>
>>> + { NULL, -1, 0}> +};
>>> +
>>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>> [BOND_OPT_MODE] = {
>>> .id = BOND_OPT_MODE,
>>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>> .flags = BOND_OPTFLAG_IFDOWN,
>>> .values = bond_coupled_control_tbl,
>>> .set = bond_option_coupled_control_set,
>>> + },
>>> + [BOND_OPT_BROADCAST_NEIGH] = {
>>> + .id = BOND_OPT_BROADCAST_NEIGH,
>>> + .name = "broadcast_neighbor",
>>> + .desc = "Broadcast neighbor packets to all slaves",
>>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>>> + .values = bond_broadcast_neigh_tbl,
>>> + .set = bond_option_broadcast_neigh_set,
>>> }
>>> };
>>>
>>> @@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>>> bond->params.coupled_control = newval->value;
>>> return 0;
>>> }
>>> +
>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>> + const struct bond_opt_value *newval)
>>> +{
>>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
>>> + newval->value);
>>> +
>>> + if (bond->params.broadcast_neighbor == newval->value)
>>> + return 0;
>>> +
>>> + bond->params.broadcast_neighbor = newval->value;
>>> + if (bond->params.broadcast_neighbor)
>>> + static_branch_inc(&bond_bcast_neigh_enabled);
>>> + else
>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>
>> If the bond has been brought down then the branch has been already decremented.
>> You'll have to synchronize this with bond open/close or alternatively mark the option
>> as being able to be changed only when the bond is up (there is an option flag for that).
> I will check bond_bcast_neigh_enabled in bond_unint() instead of in bond_close().
>>>>
>>> +
>>> + return 0;
>>> +}
>>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>>> index 1e13bb170515..4a53850b2c68 100644
>>> --- a/drivers/net/bonding/bond_sysfs.c
>>> +++ b/drivers/net/bonding/bond_sysfs.c
>>> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
>>> static DEVICE_ATTR(ad_user_port_key, 0644,
>>> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>>>
>>> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
>>> + struct device_attribute *attr,
>>> + char *buf)
>>> +{
>>> + struct bonding *bond = to_bond(d);
>>> + const struct bond_opt_value *val;
>>> +
>>> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
>>> + bond->params.broadcast_neighbor);
>>> +
>>> + return sysfs_emit(buf, "%s %d\n", val->string,
>>> + bond->params.broadcast_neighbor);
>>> +}
>>> +
>>> +static DEVICE_ATTR(broadcast_neighbor, 0644,
>>> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
>>> +
>>
>> sysfs options are deprecated, please don't extend sysfs
>> netlink is the preferred way for new options
> I think it is still useful to config option via sysfs, and I find other new option still use the sysfs.
This is wrong, there are no new options that have been added to sysfs recently,
the latest option being "coupled_control". As I already said - sysfs has been deprecated
for quite some time, don't add new options to it.
>>
>>> static struct attribute *per_bond_attrs[] = {
>>> &dev_attr_slaves.attr,
>>> &dev_attr_mode.attr,
>>> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
>>> &dev_attr_ad_actor_system.attr,
>>> &dev_attr_ad_user_port_key.attr,
>>> &dev_attr_arp_missed_max.attr,
>>> + &dev_attr_broadcast_neighbor.attr,
>>> NULL,
>>> };
>>>
>>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>>> index 18687ccf0638..022b122a9fb6 100644
>>> --- a/include/net/bond_options.h
>>> +++ b/include/net/bond_options.h
>>> @@ -77,6 +77,7 @@ enum {
>>> BOND_OPT_NS_TARGETS,
>>> BOND_OPT_PRIO,
>>> BOND_OPT_COUPLED_CONTROL,
>>> + BOND_OPT_BROADCAST_NEIGH,
>>> BOND_OPT_LAST
>>> };
>>>
>>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>>> index 95f67b308c19..e06f0d63b2c1 100644
>>> --- a/include/net/bonding.h
>>> +++ b/include/net/bonding.h
>>> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
>>> #define is_netpoll_tx_blocked(dev) (0)
>>> #endif
>>>
>>> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>> +
>>> struct bond_params {
>>> int mode;
>>> int xmit_policy;
>>> @@ -149,6 +151,7 @@ struct bond_params {
>>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>>> #endif
>>> int coupled_control;
>>> + int broadcast_neighbor;
>>>
>>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>>> u8 ad_actor_system[ETH_ALEN + 2];
>>
>>
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad
2025-05-13 12:20 ` Nikolay Aleksandrov
@ 2025-05-13 13:30 ` Jay Vosburgh
2025-05-14 6:07 ` Tonghao Zhang
0 siblings, 1 reply; 12+ messages in thread
From: Jay Vosburgh @ 2025-05-13 13:30 UTC (permalink / raw)
To: Nikolay Aleksandrov
Cc: Tonghao Zhang, netdev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
Nikolay Aleksandrov <razor@blackwall.org> wrote:
>On 5/13/25 15:13, Tonghao Zhang wrote:
>>
>>
>>> 2025年5月13日 18:18,Nikolay Aleksandrov <razor@blackwall.org> 写道:
>>>
>>> On 5/13/25 12:47, Tonghao Zhang wrote:
>>>> Stacking technology is a type of technology used to expand ports on
>>>> Ethernet switches. It is widely used as a common access method in
>>>> large-scale Internet data center architectures. Years of practice
>>>> have proved that stacking technology has advantages and disadvantages
>>>> in high-reliability network architecture scenarios. For instance,
>>>> in stacking networking arch, conventional switch system upgrades
>>>> require multiple stacked devices to restart at the same time.
>>>> Therefore, it is inevitable that the business will be interrupted
>>>> for a while. It is for this reason that "no-stacking" in data centers
>>>> has become a trend. Additionally, when the stacking link connecting
>>>> the switches fails or is abnormal, the stack will split. Although it is
>>>> not common, it still happens in actual operation. The problem is that
>>>> after the split, it is equivalent to two switches with the same configuration
>>>> appearing in the network, causing network configuration conflicts and
>>>> ultimately interrupting the services carried by the stacking system.
>>>>
>>>> To improve network stability, "non-stacking" solutions have been increasingly
>>>> adopted, particularly by public cloud providers and tech companies
>>>> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
>>>> stacking that convinces a LACP peer, bonding in this case, connected to a set of
>>>> "non-stacked" switches that all of its ports are connected to a single
>>>> switch (i.e., LACP aggregator), as if those switches were stacked. This
>>>> enables the LACP peer's ports to aggregate together, and requires (a)
>>>> special switch configuration, described in the linked article, and (b)
>>>> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
>>>> packets across all ports of the active aggregator.
>>>>
>>>> ----------- -----------
>>>> | switch1 | | switch2 |
>>>> ----------- -----------
>>>> ^ ^
>>>> | |
>>>> ---------------
>>>> | bond4 lacp |
>>>> ---------------
>>>> | NIC1 | NIC2
>>>> ---------------------
>>>> | server |
>>>> ---------------------
>>>>
>>>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>>>>
>>>> Cc: Jay Vosburgh <jv@jvosburgh.net>
>>>> Cc: "David S. Miller" <davem@davemloft.net>
>>>> Cc: Eric Dumazet <edumazet@google.com>
>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>> Cc: Simon Horman <horms@kernel.org>
>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
>>>> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
>>>> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
>>>> ---
>>>> Documentation/networking/bonding.rst | 6 +++++
>>>> drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
>>>> drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
>>>> drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
>>>> include/net/bond_options.h | 1 +
>>>> include/net/bonding.h | 3 +++
>>>> 6 files changed, 101 insertions(+)
>>>>
>>>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>>>> index a4c1291d2561..14f7593d888d 100644
>>>> --- a/Documentation/networking/bonding.rst
>>>> +++ b/Documentation/networking/bonding.rst
>>>> @@ -562,6 +562,12 @@ lacp_rate
>>>>
>>>> The default is slow.
>>>>
>>>> +broadcast_neighbor
>>>> +
>>>> + Option specifying whether to broadcast ARP/ND packets to all
>>>> + active slaves. This option has no effect in modes other than
>>>> + 802.3ad mode. The default is off (0).
>>>> +
>>>> max_bonds
>>>>
>>>> Specifies the number of bonding devices to create for this
>>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>>> index d05226484c64..8ee26ddddbc8 100644
>>>> --- a/drivers/net/bonding/bond_main.c
>>>> +++ b/drivers/net/bonding/bond_main.c
>>>> @@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>>>
>>>> unsigned int bond_net_id __read_mostly;
>>>>
>>>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>>> +EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
>>>
>>> No need to export the symbol, you can add bond helpers to inc/dec it.
Agreed.
>>>> +
>>>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>>>> {
>>>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>>>> @@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
>>>> bond_alb_deinitialize(bond);
>>>> bond->recv_probe = NULL;
>>>>
>>>> + if (bond->params.broadcast_neighbor)
>>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>>> +
>>>
>>> This branch doesn't get re-enabled if the bond is brought up afterwards.
>> This is not right place to dec bond_bcast_neigh_enabled, I should dec this value in bond_uninit(). Because we can destroy a bond net device which broadcast_neighbor enabled.
>> If we don’t check broadcast_neighbor in destroy path. bond_bcast_neigh_enabled always is enabled. For example:
>> ip link add bondx type bond mode 802.3ad ... broadcast_neighbor on
>> ip link add bondy type bond mode 802.3ad ... broadcast_neighbor off
>>
>> ip li del dev bondx
>> In this case, bond_bcast_neigh_enabled is enabled for bondy while broadcast_neighbor is off.
So it sounds like the flow should be:
- increment the static key when:
- option enabled on IFF_UP bond (transition from off to on)
- bond set to up with option enabled (bond_open)
- decrement the static key when:
- option disabled on IFF_UP bond (transition from on to off)
- bond with option enabled is set to down (bond_close)
The unregister path in unregister_netdevice_many_notify that
calls bond_uninit will call dev_close before the ndo_uninit, so I don't
think we need to handle the static key in bond_uninit separately.
>>>>>> if (bond_uses_primary(bond)) {
>>>> rcu_read_lock();
>>>> slave = rcu_dereference(bond->curr_active_slave);
>>>> @@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>>>> return slaves->arr[hash % count];
>>>> }
>>>>
>>>> +static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>>>
>>> don't use inline in .c files
>> As suggested by Jay, inline the codes for performance. I think it is better to keep inline. By the way, there are many inline function in bond_main.c and other *.c
>
>This is a general rule, the compiler knows what to do. The inlines in bond_main are old
>and can be removed. Do not add new ones.
To clarify, the intent of my question was to ask whether or not
the compiler was inlining the existing code in Tonghao's earlier patch.
I did not intend to suggest that an explicit inline should be added.
That said, I agree with Nikolay in that inline should not be
added explicitly.
>>>> + struct net_device *dev)
>>>> +{
>>>> + struct bonding *bond = netdev_priv(dev);
>>>> +
>>>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>>>> + return false;
>>>> +
>>>> + if (!bond->params.broadcast_neighbor)
>>>> + return false;
>>>> +
>>>> + if (skb->protocol == htons(ETH_P_ARP))
>>>> + return true;
>>>> +
>>>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>>>> + pskb_may_pull(skb,
>>>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>>>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>>>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>>>> +
>>>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>>>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>>>> + return true;
>>>> + }
>>>> + }
>>>> +
>>>> + return false;
>>>> +}
>>>> +
>>>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>>>> * usable slave array is formed in the control path. The xmit function
>>>> * just calculates hash and sends the packet out.
>>>> @@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>>>> case BOND_MODE_ACTIVEBACKUP:
>>>> return bond_xmit_activebackup(skb, dev);
>>>> case BOND_MODE_8023AD:
>>>> + if (bond_should_broadcast_neighbor(skb, dev))
>>>> + return bond_xmit_broadcast(skb, dev);
>>>> + fallthrough;
>>>> case BOND_MODE_XOR:
>>>> return bond_3ad_xor_xmit(skb, dev);
>>>> case BOND_MODE_BROADCAST:
>>>> @@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
>>>> eth_zero_addr(params->ad_actor_system);
>>>> params->ad_user_port_key = ad_user_port_key;
>>>> params->coupled_control = 1;
>>>> + params->broadcast_neighbor = 0;
>>>> if (packets_per_slave > 0) {
>>>> params->reciprocal_packets_per_slave =
>>>> reciprocal_value(packets_per_slave);
>>>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>>>> index 91893c29b899..dca52d93f513 100644
>>>> --- a/drivers/net/bonding/bond_options.c
>>>> +++ b/drivers/net/bonding/bond_options.c
>>>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>>>> const struct bond_opt_value *newval);
>>>> static int bond_option_coupled_control_set(struct bonding *bond,
>>>> const struct bond_opt_value *newval);
>>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>>> + const struct bond_opt_value *newval);
>>>>
>>>> static const struct bond_opt_value bond_mode_tbl[] = {
>>>> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
>>>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>>>> { NULL, -1, 0},
>>>> };
>>>>
>>>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>>>> + { "on", 1, 0},
>>>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>>>
>>> I know the option above is using this order, but it is a bit counter-intuitive to
>>> have their places reversed wrt their values, could you please re-order these as
>>> the other bond on/off options? This is a small nit, I don't have a strong preference
>>> but it is more intuitive to have them in their value order. :)
>> Ok
>>>
>>>> + { NULL, -1, 0}> +};
>>>> +
>>>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>>> [BOND_OPT_MODE] = {
>>>> .id = BOND_OPT_MODE,
>>>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>>> .flags = BOND_OPTFLAG_IFDOWN,
>>>> .values = bond_coupled_control_tbl,
>>>> .set = bond_option_coupled_control_set,
>>>> + },
>>>> + [BOND_OPT_BROADCAST_NEIGH] = {
>>>> + .id = BOND_OPT_BROADCAST_NEIGH,
>>>> + .name = "broadcast_neighbor",
>>>> + .desc = "Broadcast neighbor packets to all slaves",
>>>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>>>> + .values = bond_broadcast_neigh_tbl,
>>>> + .set = bond_option_broadcast_neigh_set,
>>>> }
>>>> };
>>>>
>>>> @@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>>>> bond->params.coupled_control = newval->value;
>>>> return 0;
>>>> }
>>>> +
>>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>>> + const struct bond_opt_value *newval)
>>>> +{
>>>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
>>>> + newval->value);
>>>> +
>>>> + if (bond->params.broadcast_neighbor == newval->value)
>>>> + return 0;
>>>> +
>>>> + bond->params.broadcast_neighbor = newval->value;
>>>> + if (bond->params.broadcast_neighbor)
>>>> + static_branch_inc(&bond_bcast_neigh_enabled);
>>>> + else
>>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>>
>>> If the bond has been brought down then the branch has been already decremented.
>>> You'll have to synchronize this with bond open/close or alternatively mark the option
>>> as being able to be changed only when the bond is up (there is an option flag for that).
>> I will check bond_bcast_neigh_enabled in bond_unint() instead of in bond_close().
>>>>>
>>>> +
>>>> + return 0;
>>>> +}
>>>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>>>> index 1e13bb170515..4a53850b2c68 100644
>>>> --- a/drivers/net/bonding/bond_sysfs.c
>>>> +++ b/drivers/net/bonding/bond_sysfs.c
>>>> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
>>>> static DEVICE_ATTR(ad_user_port_key, 0644,
>>>> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>>>>
>>>> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
>>>> + struct device_attribute *attr,
>>>> + char *buf)
>>>> +{
>>>> + struct bonding *bond = to_bond(d);
>>>> + const struct bond_opt_value *val;
>>>> +
>>>> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
>>>> + bond->params.broadcast_neighbor);
>>>> +
>>>> + return sysfs_emit(buf, "%s %d\n", val->string,
>>>> + bond->params.broadcast_neighbor);
>>>> +}
>>>> +
>>>> +static DEVICE_ATTR(broadcast_neighbor, 0644,
>>>> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
>>>> +
>>>
>>> sysfs options are deprecated, please don't extend sysfs
>>> netlink is the preferred way for new options
>> I think it is still useful to config option via sysfs, and I find other new option still use the sysfs.
>
>This is wrong, there are no new options that have been added to sysfs recently,
>the latest option being "coupled_control". As I already said - sysfs has been deprecated
>for quite some time, don't add new options to it.
Agreed, no new options in bonding's sysfs. We are intentionally
encouraging the use of netlink / iproute for bonding configuration
instead of sysfs.
-J
>>>> static struct attribute *per_bond_attrs[] = {
>>>> &dev_attr_slaves.attr,
>>>> &dev_attr_mode.attr,
>>>> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
>>>> &dev_attr_ad_actor_system.attr,
>>>> &dev_attr_ad_user_port_key.attr,
>>>> &dev_attr_arp_missed_max.attr,
>>>> + &dev_attr_broadcast_neighbor.attr,
>>>> NULL,
>>>> };
>>>>
>>>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>>>> index 18687ccf0638..022b122a9fb6 100644
>>>> --- a/include/net/bond_options.h
>>>> +++ b/include/net/bond_options.h
>>>> @@ -77,6 +77,7 @@ enum {
>>>> BOND_OPT_NS_TARGETS,
>>>> BOND_OPT_PRIO,
>>>> BOND_OPT_COUPLED_CONTROL,
>>>> + BOND_OPT_BROADCAST_NEIGH,
>>>> BOND_OPT_LAST
>>>> };
>>>>
>>>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>>>> index 95f67b308c19..e06f0d63b2c1 100644
>>>> --- a/include/net/bonding.h
>>>> +++ b/include/net/bonding.h
>>>> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
>>>> #define is_netpoll_tx_blocked(dev) (0)
>>>> #endif
>>>>
>>>> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>>> +
>>>> struct bond_params {
>>>> int mode;
>>>> int xmit_policy;
>>>> @@ -149,6 +151,7 @@ struct bond_params {
>>>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>>>> #endif
>>>> int coupled_control;
>>>> + int broadcast_neighbor;
>>>>
>>>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>>>> u8 ad_actor_system[ETH_ALEN + 2];
>>>
>>>
>>>
>>
>
---
-Jay Vosburgh, jv@jvosburgh.net
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad
2025-05-13 13:30 ` Jay Vosburgh
@ 2025-05-14 6:07 ` Tonghao Zhang
0 siblings, 0 replies; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-14 6:07 UTC (permalink / raw)
To: Jay Vosburgh
Cc: Nikolay Aleksandrov, netdev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
> 2025年5月13日 21:30,Jay Vosburgh <jv@jvosburgh.net> 写道:
>
> Nikolay Aleksandrov <razor@blackwall.org> wrote:
>
>> On 5/13/25 15:13, Tonghao Zhang wrote:
>>>
>>>
>>>> 2025年5月13日 18:18,Nikolay Aleksandrov <razor@blackwall.org> 写道:
>>>>
>>>> On 5/13/25 12:47, Tonghao Zhang wrote:
>>>>> Stacking technology is a type of technology used to expand ports on
>>>>> Ethernet switches. It is widely used as a common access method in
>>>>> large-scale Internet data center architectures. Years of practice
>>>>> have proved that stacking technology has advantages and disadvantages
>>>>> in high-reliability network architecture scenarios. For instance,
>>>>> in stacking networking arch, conventional switch system upgrades
>>>>> require multiple stacked devices to restart at the same time.
>>>>> Therefore, it is inevitable that the business will be interrupted
>>>>> for a while. It is for this reason that "no-stacking" in data centers
>>>>> has become a trend. Additionally, when the stacking link connecting
>>>>> the switches fails or is abnormal, the stack will split. Although it is
>>>>> not common, it still happens in actual operation. The problem is that
>>>>> after the split, it is equivalent to two switches with the same configuration
>>>>> appearing in the network, causing network configuration conflicts and
>>>>> ultimately interrupting the services carried by the stacking system.
>>>>>
>>>>> To improve network stability, "non-stacking" solutions have been increasingly
>>>>> adopted, particularly by public cloud providers and tech companies
>>>>> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
>>>>> stacking that convinces a LACP peer, bonding in this case, connected to a set of
>>>>> "non-stacked" switches that all of its ports are connected to a single
>>>>> switch (i.e., LACP aggregator), as if those switches were stacked. This
>>>>> enables the LACP peer's ports to aggregate together, and requires (a)
>>>>> special switch configuration, described in the linked article, and (b)
>>>>> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
>>>>> packets across all ports of the active aggregator.
>>>>>
>>>>> ----------- -----------
>>>>> | switch1 | | switch2 |
>>>>> ----------- -----------
>>>>> ^ ^
>>>>> | |
>>>>> ---------------
>>>>> | bond4 lacp |
>>>>> ---------------
>>>>> | NIC1 | NIC2
>>>>> ---------------------
>>>>> | server |
>>>>> ---------------------
>>>>>
>>>>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>>>>>
>>>>> Cc: Jay Vosburgh <jv@jvosburgh.net>
>>>>> Cc: "David S. Miller" <davem@davemloft.net>
>>>>> Cc: Eric Dumazet <edumazet@google.com>
>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>>> Cc: Simon Horman <horms@kernel.org>
>>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>>> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
>>>>> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
>>>>> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
>>>>> ---
>>>>> Documentation/networking/bonding.rst | 6 +++++
>>>>> drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
>>>>> drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
>>>>> drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
>>>>> include/net/bond_options.h | 1 +
>>>>> include/net/bonding.h | 3 +++
>>>>> 6 files changed, 101 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>>>>> index a4c1291d2561..14f7593d888d 100644
>>>>> --- a/Documentation/networking/bonding.rst
>>>>> +++ b/Documentation/networking/bonding.rst
>>>>> @@ -562,6 +562,12 @@ lacp_rate
>>>>>
>>>>> The default is slow.
>>>>>
>>>>> +broadcast_neighbor
>>>>> +
>>>>> + Option specifying whether to broadcast ARP/ND packets to all
>>>>> + active slaves. This option has no effect in modes other than
>>>>> + 802.3ad mode. The default is off (0).
>>>>> +
>>>>> max_bonds
>>>>>
>>>>> Specifies the number of bonding devices to create for this
>>>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>>>> index d05226484c64..8ee26ddddbc8 100644
>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>> @@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>>>>
>>>>> unsigned int bond_net_id __read_mostly;
>>>>>
>>>>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>>>> +EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
>>>>
>>>> No need to export the symbol, you can add bond helpers to inc/dec it.
>
> Agreed.
>
>>>>> +
>>>>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>>>>> {
>>>>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>>>>> @@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
>>>>> bond_alb_deinitialize(bond);
>>>>> bond->recv_probe = NULL;
>>>>>
>>>>> + if (bond->params.broadcast_neighbor)
>>>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>>>> +
>>>>
>>>> This branch doesn't get re-enabled if the bond is brought up afterwards.
>>> This is not right place to dec bond_bcast_neigh_enabled, I should dec this value in bond_uninit(). Because we can destroy a bond net device which broadcast_neighbor enabled.
>>> If we don’t check broadcast_neighbor in destroy path. bond_bcast_neigh_enabled always is enabled. For example:
>>> ip link add bondx type bond mode 802.3ad ... broadcast_neighbor on
>>> ip link add bondy type bond mode 802.3ad ... broadcast_neighbor off
>>>
>>> ip li del dev bondx
>>> In this case, bond_bcast_neigh_enabled is enabled for bondy while broadcast_neighbor is off.
>
> So it sounds like the flow should be:
>
> - increment the static key when:
> - option enabled on IFF_UP bond (transition from off to on)
> - bond set to up with option enabled (bond_open)
>
> - decrement the static key when:
> - option disabled on IFF_UP bond (transition from on to off)
> - bond with option enabled is set to down (bond_close)
>
> The unregister path in unregister_netdevice_many_notify that
> calls bond_uninit will call dev_close before the ndo_uninit, so I don't
> think we need to handle the static key in bond_uninit separately.
Thanks Jay, updated in v3
>
>
>>>>>>> if (bond_uses_primary(bond)) {
>>>>> rcu_read_lock();
>>>>> slave = rcu_dereference(bond->curr_active_slave);
>>>>> @@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>>>>> return slaves->arr[hash % count];
>>>>> }
>>>>>
>>>>> +static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>>>>
>>>> don't use inline in .c files
>>> As suggested by Jay, inline the codes for performance. I think it is better to keep inline. By the way, there are many inline function in bond_main.c and other *.c
>>
>> This is a general rule, the compiler knows what to do. The inlines in bond_main are old
>> and can be removed. Do not add new ones.
>
> To clarify, the intent of my question was to ask whether or not
> the compiler was inlining the existing code in Tonghao's earlier patch.
> I did not intend to suggest that an explicit inline should be added.
>
> That said, I agree with Nikolay in that inline should not be
> added explicitly.
Ok
>
>>>>> + struct net_device *dev)
>>>>> +{
>>>>> + struct bonding *bond = netdev_priv(dev);
>>>>> +
>>>>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>>>>> + return false;
>>>>> +
>>>>> + if (!bond->params.broadcast_neighbor)
>>>>> + return false;
>>>>> +
>>>>> + if (skb->protocol == htons(ETH_P_ARP))
>>>>> + return true;
>>>>> +
>>>>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>>>>> + pskb_may_pull(skb,
>>>>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>>>>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>>>>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>>>>> +
>>>>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>>>>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>>>>> + return true;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + return false;
>>>>> +}
>>>>> +
>>>>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>>>>> * usable slave array is formed in the control path. The xmit function
>>>>> * just calculates hash and sends the packet out.
>>>>> @@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>>>>> case BOND_MODE_ACTIVEBACKUP:
>>>>> return bond_xmit_activebackup(skb, dev);
>>>>> case BOND_MODE_8023AD:
>>>>> + if (bond_should_broadcast_neighbor(skb, dev))
>>>>> + return bond_xmit_broadcast(skb, dev);
>>>>> + fallthrough;
>>>>> case BOND_MODE_XOR:
>>>>> return bond_3ad_xor_xmit(skb, dev);
>>>>> case BOND_MODE_BROADCAST:
>>>>> @@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
>>>>> eth_zero_addr(params->ad_actor_system);
>>>>> params->ad_user_port_key = ad_user_port_key;
>>>>> params->coupled_control = 1;
>>>>> + params->broadcast_neighbor = 0;
>>>>> if (packets_per_slave > 0) {
>>>>> params->reciprocal_packets_per_slave =
>>>>> reciprocal_value(packets_per_slave);
>>>>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>>>>> index 91893c29b899..dca52d93f513 100644
>>>>> --- a/drivers/net/bonding/bond_options.c
>>>>> +++ b/drivers/net/bonding/bond_options.c
>>>>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>>>>> const struct bond_opt_value *newval);
>>>>> static int bond_option_coupled_control_set(struct bonding *bond,
>>>>> const struct bond_opt_value *newval);
>>>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>>>> + const struct bond_opt_value *newval);
>>>>>
>>>>> static const struct bond_opt_value bond_mode_tbl[] = {
>>>>> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
>>>>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>>>>> { NULL, -1, 0},
>>>>> };
>>>>>
>>>>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>>>>> + { "on", 1, 0},
>>>>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>>>>
>>>> I know the option above is using this order, but it is a bit counter-intuitive to
>>>> have their places reversed wrt their values, could you please re-order these as
>>>> the other bond on/off options? This is a small nit, I don't have a strong preference
>>>> but it is more intuitive to have them in their value order. :)
>>> Ok
>>>>
>>>>> + { NULL, -1, 0}> +};
>>>>> +
>>>>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>>>> [BOND_OPT_MODE] = {
>>>>> .id = BOND_OPT_MODE,
>>>>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>>>> .flags = BOND_OPTFLAG_IFDOWN,
>>>>> .values = bond_coupled_control_tbl,
>>>>> .set = bond_option_coupled_control_set,
>>>>> + },
>>>>> + [BOND_OPT_BROADCAST_NEIGH] = {
>>>>> + .id = BOND_OPT_BROADCAST_NEIGH,
>>>>> + .name = "broadcast_neighbor",
>>>>> + .desc = "Broadcast neighbor packets to all slaves",
>>>>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>>>>> + .values = bond_broadcast_neigh_tbl,
>>>>> + .set = bond_option_broadcast_neigh_set,
>>>>> }
>>>>> };
>>>>>
>>>>> @@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>>>>> bond->params.coupled_control = newval->value;
>>>>> return 0;
>>>>> }
>>>>> +
>>>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>>>> + const struct bond_opt_value *newval)
>>>>> +{
>>>>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
>>>>> + newval->value);
>>>>> +
>>>>> + if (bond->params.broadcast_neighbor == newval->value)
>>>>> + return 0;
>>>>> +
>>>>> + bond->params.broadcast_neighbor = newval->value;
>>>>> + if (bond->params.broadcast_neighbor)
>>>>> + static_branch_inc(&bond_bcast_neigh_enabled);
>>>>> + else
>>>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>>>
>>>> If the bond has been brought down then the branch has been already decremented.
>>>> You'll have to synchronize this with bond open/close or alternatively mark the option
>>>> as being able to be changed only when the bond is up (there is an option flag for that).
>>> I will check bond_bcast_neigh_enabled in bond_unint() instead of in bond_close().
>>>>>>
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>>>>> index 1e13bb170515..4a53850b2c68 100644
>>>>> --- a/drivers/net/bonding/bond_sysfs.c
>>>>> +++ b/drivers/net/bonding/bond_sysfs.c
>>>>> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
>>>>> static DEVICE_ATTR(ad_user_port_key, 0644,
>>>>> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>>>>>
>>>>> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
>>>>> + struct device_attribute *attr,
>>>>> + char *buf)
>>>>> +{
>>>>> + struct bonding *bond = to_bond(d);
>>>>> + const struct bond_opt_value *val;
>>>>> +
>>>>> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
>>>>> + bond->params.broadcast_neighbor);
>>>>> +
>>>>> + return sysfs_emit(buf, "%s %d\n", val->string,
>>>>> + bond->params.broadcast_neighbor);
>>>>> +}
>>>>> +
>>>>> +static DEVICE_ATTR(broadcast_neighbor, 0644,
>>>>> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
>>>>> +
>>>>
>>>> sysfs options are deprecated, please don't extend sysfs
>>>> netlink is the preferred way for new options
>>> I think it is still useful to config option via sysfs, and I find other new option still use the sysfs.
>>
>> This is wrong, there are no new options that have been added to sysfs recently,
>> the latest option being "coupled_control". As I already said - sysfs has been deprecated
>> for quite some time, don't add new options to it.
>
> Agreed, no new options in bonding's sysfs. We are intentionally
> encouraging the use of netlink / iproute for bonding configuration
> instead of sysfs.
sysfs option of this patch, will be removed in v3.
>
> -J
>
>>>>> static struct attribute *per_bond_attrs[] = {
>>>>> &dev_attr_slaves.attr,
>>>>> &dev_attr_mode.attr,
>>>>> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
>>>>> &dev_attr_ad_actor_system.attr,
>>>>> &dev_attr_ad_user_port_key.attr,
>>>>> &dev_attr_arp_missed_max.attr,
>>>>> + &dev_attr_broadcast_neighbor.attr,
>>>>> NULL,
>>>>> };
>>>>>
>>>>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>>>>> index 18687ccf0638..022b122a9fb6 100644
>>>>> --- a/include/net/bond_options.h
>>>>> +++ b/include/net/bond_options.h
>>>>> @@ -77,6 +77,7 @@ enum {
>>>>> BOND_OPT_NS_TARGETS,
>>>>> BOND_OPT_PRIO,
>>>>> BOND_OPT_COUPLED_CONTROL,
>>>>> + BOND_OPT_BROADCAST_NEIGH,
>>>>> BOND_OPT_LAST
>>>>> };
>>>>>
>>>>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>>>>> index 95f67b308c19..e06f0d63b2c1 100644
>>>>> --- a/include/net/bonding.h
>>>>> +++ b/include/net/bonding.h
>>>>> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
>>>>> #define is_netpoll_tx_blocked(dev) (0)
>>>>> #endif
>>>>>
>>>>> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>>>> +
>>>>> struct bond_params {
>>>>> int mode;
>>>>> int xmit_policy;
>>>>> @@ -149,6 +151,7 @@ struct bond_params {
>>>>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>>>>> #endif
>>>>> int coupled_control;
>>>>> + int broadcast_neighbor;
>>>>>
>>>>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>>>>> u8 ad_actor_system[ETH_ALEN + 2];
>>>>
>>>>
>>>>
>>>
>>
>
> ---
> -Jay Vosburgh, jv@jvosburgh.net
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery
2025-05-13 10:37 ` Nikolay Aleksandrov
@ 2025-05-14 6:07 ` Tonghao Zhang
0 siblings, 0 replies; 12+ messages in thread
From: Tonghao Zhang @ 2025-05-14 6:07 UTC (permalink / raw)
To: Nikolay Aleksandrov
Cc: netdev, Jay Vosburgh, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
Andrew Lunn, Zengbing Tu
> 2025年5月13日 18:37,Nikolay Aleksandrov <razor@blackwall.org> 写道:
>
> On 5/13/25 12:47, Tonghao Zhang wrote:
>> After LACP protocol recovery, the port can transmit packets.
>> However, if the bond port doesn't send gratuitous ARP/ND
>> packets to the switch, the switch won't return packets through
>> the current interface. This causes traffic imbalance. To resolve
>> this issue, when LACP protocol recovers, send ARP/ND packets.
>>
>> Cc: Jay Vosburgh <jv@jvosburgh.net>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Simon Horman <horms@kernel.org>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
>> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
>> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com>
>> ---
>> Documentation/networking/bonding.rst | 5 +++--
>> drivers/net/bonding/bond_3ad.c | 13 +++++++++++++
>> drivers/net/bonding/bond_main.c | 21 ++++++++++++++++-----
>> 3 files changed, 32 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>> index 14f7593d888d..f8f5766703d4 100644
>> --- a/Documentation/networking/bonding.rst
>> +++ b/Documentation/networking/bonding.rst
>> @@ -773,8 +773,9 @@ num_unsol_na
>> greater than 1.
>>
>> The valid range is 0 - 255; the default value is 1. These options
>> - affect only the active-backup mode. These options were added for
>> - bonding versions 3.3.0 and 3.4.0 respectively.
>> + affect the active-backup or 802.3ad (broadcast_neighbor enabled) mode.
>> + These options were added for bonding versions 3.3.0 and 3.4.0
>> + respectively.
>>
>> From Linux 3.0 and bonding version 3.7.1, these notifications
>> are generated by the ipv4 and ipv6 code and the numbers of
>> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>> index c6807e473ab7..d1c2d416ac87 100644
>> --- a/drivers/net/bonding/bond_3ad.c
>> +++ b/drivers/net/bonding/bond_3ad.c
>> @@ -982,6 +982,17 @@ static int ad_marker_send(struct port *port, struct bond_marker *marker)
>> return 0;
>> }
>>
>> +static void ad_cond_set_peer_notif(struct port *port)
>> +{
>> + struct bonding *bond = port->slave->bond;
>> +
>> + if (bond->params.broadcast_neighbor && rtnl_trylock()) {
>> + bond->send_peer_notif = bond->params.num_peer_notif *
>> + max(1, bond->params.peer_notif_delay);
>> + rtnl_unlock();
>> + }> +}
>> +
>> /**
>> * ad_mux_machine - handle a port's mux state machine
>> * @port: the port we're looking at
>> @@ -2061,6 +2072,8 @@ static void ad_enable_collecting_distributing(struct port *port,
>> __enable_port(port);
>> /* Slave array needs update */
>> *update_slave_arr = true;
>> + /* Should notify peers if possible */
>> + ad_cond_set_peer_notif(port);
>> }
>> }
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 8ee26ddddbc8..70bb1e33cee2 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1243,17 +1243,28 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>> /* must be called in RCU critical section or with RTNL held */
>> static bool bond_should_notify_peers(struct bonding *bond)
>> {
>> - struct slave *slave = rcu_dereference_rtnl(bond->curr_active_slave);
>> + struct bond_up_slave *usable;
>> + struct slave *slave = NULL;
>>
>> - if (!slave || !bond->send_peer_notif ||
>> + if (!bond->send_peer_notif ||
>> bond->send_peer_notif %
>> max(1, bond->params.peer_notif_delay) != 0 ||
>> - !netif_carrier_ok(bond->dev) ||
>> + !netif_carrier_ok(bond->dev))> return false;
>>
>> + if (BOND_MODE(bond) == BOND_MODE_8023AD) {
>> + usable = rtnl_dereference(bond->usable_slaves);
>
> If mii monitor is enabled in 802.3ad bond mode then this will be
> dereferenced in bond_mii_monitor() with rcu, so you'd have to use
> rcu_dereference_rtnl() instead.
Ok, thanks for your review.
>
>> + if (!usable || !READ_ONCE(usable->count))
>> + return false;
>> + } else {
>> + slave = rcu_dereference_rtnl(bond->curr_active_slave);
>> + if (!slave || test_bit(__LINK_STATE_LINKWATCH_PENDING,
>> + &slave->dev->state))
>> + return false;
>> + }
>> +
>> netdev_dbg(bond->dev, "bond_should_notify_peers: slave %s\n",
>> - slave ? slave->dev->name : "NULL");
>> + slave ? slave->dev->name : "all");
>>
>> return true;
>> }
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-05-14 6:09 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250513094750.23387-1-tonghao@bamaicloud.com>
2025-05-13 9:47 ` [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor option for 802.3ad Tonghao Zhang
2025-05-13 10:18 ` Nikolay Aleksandrov
2025-05-13 12:13 ` Tonghao Zhang
2025-05-13 12:20 ` Nikolay Aleksandrov
2025-05-13 13:30 ` Jay Vosburgh
2025-05-14 6:07 ` Tonghao Zhang
2025-05-13 9:47 ` [PATCH net-next v2 2/4] net: bonding: add broadcast_neighbor netlink option Tonghao Zhang
2025-05-13 10:19 ` Nikolay Aleksandrov
2025-05-13 9:47 ` [PATCH net-next v2 3/4] net: bonding: send peer notify when failure recovery Tonghao Zhang
2025-05-13 10:37 ` Nikolay Aleksandrov
2025-05-14 6:07 ` Tonghao Zhang
2025-05-13 9:47 ` [PATCH net-next v2 4/4] net: bonding: add tracepoint for 802.3ad Tonghao Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).