* [PATCH net v5 0/4] bonding: 3ad: fix carrier state with no usable slaves
From: Louis Scalbert @ 2026-05-06 16:11 UTC (permalink / raw)
To: netdev
Cc: andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy, shemminger,
maheshb, jonas.gorski, Louis Scalbert
Hi everyone,
This series addresses a blackholing issue and a subsequent link-flapping
issue in the 802.3ad bonding driver when dealing with inactive slaves
and the `min_links` parameter.
When an 802.3ad (LACP) bonding interface has no slaves in the
collecting/distributing state, the bonding master still reports
carrier as up as long as at least 'min_links' slaves have carrier.
In this situation, only one slave is effectively used for TX/RX,
while traffic received on other slaves is dropped. Upper-layer
daemons therefore consider the interface operational, even though
traffic may be blackholed if the lack of LACP negotiation means
the partner is not ready to deal with traffic.
This patchset introduces an optional behavior, widely adopted across
the industry, to address this issue. It consists of bringing the
bonding master interface down to signal to upper-layer processes
that it is not usable.
This patchset depends on the following iproute2 change:
ip/bond: add lacp_strict support
Patch 1 introduces the lacp_strict configuration knob, which is
applied in the subsequent patch. The default (off) mode preserves
the existing behavior, while the strict mode (on) is intended to force
the bonding master carrier down in this situation.
Patch 2 addresses the core issue when lacp_strict is set to strict.
It ensures that carrier is asserted only when at least 'min_links'
slaves are in the Collecting/Distributing state.
Patch 3 fixes a side effect of the second patch. Tightening the carrier
logic exposes a state persistence bug: when a physical link goes down,
the LACP collecting/distributing flags remain set. When the link returns,
the interface briefly hallucinates that it is ready, bounces the carrier
up, and then drops it again once LACP renegotiation starts. Fix by
resetting Collecting and Distributing state as soon as the link goes
down.
Patch 4 adds a test for bonding lacp_strict both modes.
Changelog:
v4 -> v5
- Patch 4: replace the use of netem, which is not included in the
bonding selftests configuration. Instead, use a dedicated netns to
forward frames between the partner and DUT. The partner and DUT are
bridged within that netns. Since Linux bridges do not forward LACP
special frames by default, group_fwd_mask is configured on bridge
interfaces to allow them.
Link: https://lore.kernel.org/netdev/20260417140505.3860237-1-louis.scalbert@6wind.com/
v3 -> v4
- Rename the configuration knob to lacp_strict on/off instead of
lacp_fallback legacy/strict.
- Patch 1: change the command documentation accordingly and wrap
text at approximately 75 columns.
- Use "usable" wording instead "valid" for LACP Collecting /
Distributing state in code and commit log.
- Patch 2: test collecting and distributing state regardless of
coupled_control
- Patch 3: Reworked because removing the SELECTED flag was not
compliant with 802.1AX. Instead, to transition to the WAITING state
on port disabled, except when already in the DETACHED state.
And remove Collecting and Distributing state in WAITING state.
- Patch 4 is removed. It was a fix for patch 3 but it is no more
needed since patch 3 was reworked.
Link: https://lore.kernel.org/netdev/20260408152353.276204-1-louis.scalbert@6wind.com/
v2 -> v3
- Add an initial patch introducing the lacp_fallback configuration
knob (no behavior change yet).
- Patch 2 (was patch 1 in v2): apply the new behavior only when
lacp_fallback is set to strict, and re-evaluate the bonding
master carrier when the setting changes.
Link: https://lore.kernel.org/netdev/20260325134439.3048615-1-louis.scalbert@6wind.com/
v1 -> v2
- Patch 1: split a comment line that exceeded 80 characters.
- Move the change from patch 2 in __agg_ports_are_ready() into a third
patch, as it is actually a side effect of the fix introduced in
patch 2.
- Patch 2: Expand the commit message and add a code comment describing
the change in ad_port_selection_logic().
- Patch 3: Check the READY_N flag only on ports in the WAITING state,
rather than on all enabled ports. This more closely matches 802.3ad.
Link: https://lore.kernel.org/netdev/20260316131838.3257889-1-louis.scalbert@6wind.com/
Louis Scalbert (4):
bonding: 3ad: add lacp_strict configuration knob
bonding: 3ad: fix carrier when no usable slaves
bonding: 3ad: fix mux port state on oper down
selftests: bonding: add test for lacp_strict mode
Documentation/networking/bonding.rst | 23 ++
drivers/net/bonding/bond_3ad.c | 28 +-
drivers/net/bonding/bond_main.c | 1 +
drivers/net/bonding/bond_netlink.c | 16 +
drivers/net/bonding/bond_options.c | 27 ++
include/net/bond_options.h | 1 +
include/net/bonding.h | 1 +
include/uapi/linux/if_link.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 1 +
.../drivers/net/bonding/bond_lacp_strict.sh | 347 ++++++++++++++++++
10 files changed, 445 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_lacp_strict.sh
--
2.39.2
^ permalink raw reply
* [PATCH net v5 2/4] bonding: 3ad: fix carrier when no usable slaves
From: Louis Scalbert @ 2026-05-06 16:11 UTC (permalink / raw)
To: netdev
Cc: andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy, shemminger,
maheshb, jonas.gorski, Louis Scalbert
In-Reply-To: <20260506161144.465485-1-louis.scalbert@6wind.com>
Apply the "lacp_strict" configuration from the previous commit.
"lacp_strict" mode "on" asserts that the bonding master carrier is up
only when at least 'min_links' slaves are in the Collecting_Distributing
state.
Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
---
drivers/net/bonding/bond_3ad.c | 21 ++++++++++++++++++++-
drivers/net/bonding/bond_options.c | 1 +
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index f0aa7d2f2171..1247a1e048df 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -745,6 +745,21 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val)
}
}
+static int __agg_usable_ports(struct aggregator *agg)
+{
+ struct port *port;
+ int valid = 0;
+
+ for (port = agg->lag_ports; port;
+ port = port->next_port_in_aggregator) {
+ if (port->actor_oper_port_state & LACP_STATE_COLLECTING &&
+ port->actor_oper_port_state & LACP_STATE_DISTRIBUTING)
+ valid++;
+ }
+
+ return valid;
+}
+
static int __agg_active_ports(struct aggregator *agg)
{
struct port *port;
@@ -2128,6 +2143,7 @@ static void ad_enable_collecting_distributing(struct port *port,
port->actor_port_number,
aggregator->aggregator_identifier);
__enable_port(port);
+ bond_3ad_set_carrier(port->slave->bond);
/* Slave array needs update */
*update_slave_arr = true;
/* Should notify peers if possible */
@@ -2151,6 +2167,7 @@ static void ad_disable_collecting_distributing(struct port *port,
port->actor_port_number,
aggregator->aggregator_identifier);
__disable_port(port);
+ bond_3ad_set_carrier(port->slave->bond);
/* Slave array needs an update */
*update_slave_arr = true;
}
@@ -2830,7 +2847,9 @@ int bond_3ad_set_carrier(struct bonding *bond)
active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator));
if (active) {
/* are enough slaves available to consider link up? */
- if (__agg_active_ports(active) < bond->params.min_links) {
+ if ((bond->params.lacp_strict ? __agg_usable_ports(active)
+ : __agg_active_ports(active)) <
+ bond->params.min_links) {
if (netif_carrier_ok(bond->dev)) {
netif_carrier_off(bond->dev);
goto out;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index d358b831df77..94b7b0851f16 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -1706,6 +1706,7 @@ static int bond_option_lacp_strict_set(struct bonding *bond,
netdev_dbg(bond->dev, "Setting LACP fallback to %s (%llu)\n",
newval->string, newval->value);
bond->params.lacp_strict = newval->value;
+ bond_3ad_set_carrier(bond);
return 0;
}
--
2.39.2
^ permalink raw reply related
* [PATCH net v5 3/4] bonding: 3ad: fix mux port state on oper down
From: Louis Scalbert @ 2026-05-06 16:11 UTC (permalink / raw)
To: netdev
Cc: andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy, shemminger,
maheshb, jonas.gorski, Louis Scalbert
In-Reply-To: <20260506161144.465485-1-louis.scalbert@6wind.com>
When the bonding interface has carrier down due to the absence of
usable slaves and a slave transitions from down to up, the bonding
interface briefly goes carrier up, then down again, and finally up
once LACP negotiates collecting and distributing on the port.
When lacp_strict mode is on, the interface should not transition to
carrier up until LACP negotiation is complete.
This happens because the actor and partner port states remain in
Collecting_Distributing when the port goes down. When the port
comes back up, it temporarily remains in this state until LACP
renegotiation occurs.
Previously this was mostly cosmetic, but since the bonding carrier
state may depend on the LACP negotiation state, it causes the
interface to flap.
Move an operationally down port to the Mux WAITING state and clear the
Synchronization, Collecting, and Distributing states, in accordance with
the 802.1AX Mux state machine diagram.
Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
---
drivers/net/bonding/bond_3ad.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 1247a1e048df..b531f68a24b0 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -1055,6 +1055,8 @@ static void ad_mux_machine(struct port *port, bool *update_slave_arr)
aggregator = rcu_dereference(port->aggregator);
if (port->sm_vars & AD_PORT_BEGIN) {
port->sm_mux_state = AD_MUX_DETACHED;
+ } else if (!port->is_enabled && port->sm_mux_state != AD_MUX_DETACHED) {
+ port->sm_mux_state = AD_MUX_WAITING;
} else {
switch (port->sm_mux_state) {
case AD_MUX_DETACHED:
@@ -1202,6 +1204,11 @@ static void ad_mux_machine(struct port *port, bool *update_slave_arr)
break;
case AD_MUX_WAITING:
port->sm_mux_timer_counter = __ad_timer_to_ticks(AD_WAIT_WHILE_TIMER, 0);
+ port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;
+ ad_disable_collecting_distributing(port,
+ update_slave_arr);
+ port->actor_oper_port_state &= ~LACP_STATE_COLLECTING;
+ port->actor_oper_port_state &= ~LACP_STATE_DISTRIBUTING;
break;
case AD_MUX_ATTACHED:
if (aggregator->is_active)
--
2.39.2
^ permalink raw reply related
* [PATCH net v5 1/4] bonding: 3ad: add lacp_strict configuration knob
From: Louis Scalbert @ 2026-05-06 16:11 UTC (permalink / raw)
To: netdev
Cc: andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy, shemminger,
maheshb, jonas.gorski, Louis Scalbert
In-Reply-To: <20260506161144.465485-1-louis.scalbert@6wind.com>
When an 802.3ad (LACP) bonding interface has no slaves in the
collecting/distributing state, the bonding master still reports
carrier as up as long as at least 'min_links' slaves have carrier.
In this situation, only one slave is effectively used for TX/RX,
while traffic received on other slaves is dropped. Upper-layer
daemons therefore consider the interface operational, even though
traffic may be blackholed if the lack of LACP negotiation means
the partner is not ready to deal with traffic.
Introduce a configuration knob to control this behavior. It allows
the bonding master to assert carrier only when at least 'min_links'
slaves are in Collecting_Distributing state.
The default mode preserves the existing behavior. This patch only
introduces the knob; its behavior is implemented in the subsequent
commit.
Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
---
Documentation/networking/bonding.rst | 23 +++++++++++++++++++++++
drivers/net/bonding/bond_main.c | 1 +
drivers/net/bonding/bond_netlink.c | 16 ++++++++++++++++
drivers/net/bonding/bond_options.c | 26 ++++++++++++++++++++++++++
include/net/bond_options.h | 1 +
include/net/bonding.h | 1 +
include/uapi/linux/if_link.h | 1 +
7 files changed, 69 insertions(+)
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index e700bf1d095c..33ca5afafdf6 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -619,6 +619,29 @@ min_links
aggregator cannot be active without at least one available link,
setting this option to 0 or to 1 has the exact same effect.
+lacp_strict
+
+ Specifies the fallback behavior of a bonding when LACP negotiation
+ fails on all slave links, i.e. when no slave is in the
+ Collecting_Distributing state, while at least `min_links` link still
+ reports carrier up.
+
+ This option is only applicable to 802.3ad mode (mode 4).
+
+ Valid values are:
+
+ off or 0
+ One interface of the bond is selected to be active, in order to
+ facilitate communication with peer devices that do not implement
+ LACP.
+
+ on or 1
+ Interfaces are only permitted to be made active if they have an
+ active LACP partner and have successfully reached
+ Collecting_Distributing state.
+
+ The default value is 0 (off).
+
mode
Specifies one of the bonding policies. The default is
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index af82a3df2c5d..47d7ce3240da 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -6440,6 +6440,7 @@ static int __init bond_check_params(struct bond_params *params)
params->ad_user_port_key = ad_user_port_key;
params->coupled_control = 1;
params->broadcast_neighbor = 0;
+ params->lacp_strict = 0;
if (packets_per_slave > 0) {
params->reciprocal_packets_per_slave =
reciprocal_value(packets_per_slave);
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index c7d3e0602c83..08d4d0814f67 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -143,6 +143,7 @@ static const struct nla_policy bond_policy[IFLA_BOND_MAX + 1] = {
[IFLA_BOND_NS_IP6_TARGET] = { .type = NLA_NESTED },
[IFLA_BOND_COUPLED_CONTROL] = { .type = NLA_U8 },
[IFLA_BOND_BROADCAST_NEIGH] = { .type = NLA_U8 },
+ [IFLA_BOND_LACP_STRICT] = { .type = NLA_U8 },
};
static const struct nla_policy bond_slave_policy[IFLA_BOND_SLAVE_MAX + 1] = {
@@ -599,6 +600,16 @@ static int bond_changelink(struct net_device *bond_dev, struct nlattr *tb[],
return err;
}
+ if (data[IFLA_BOND_LACP_STRICT]) {
+ int fallback_mode = nla_get_u8(data[IFLA_BOND_LACP_STRICT]);
+
+ bond_opt_initval(&newval, fallback_mode);
+ err = __bond_opt_set(bond, BOND_OPT_LACP_STRICT, &newval,
+ data[IFLA_BOND_LACP_STRICT], extack);
+ if (err)
+ return err;
+ }
+
return 0;
}
@@ -671,6 +682,7 @@ static size_t bond_get_size(const struct net_device *bond_dev)
nla_total_size(sizeof(struct in6_addr)) * BOND_MAX_NS_TARGETS +
nla_total_size(sizeof(u8)) + /* IFLA_BOND_COUPLED_CONTROL */
nla_total_size(sizeof(u8)) + /* IFLA_BOND_BROADCAST_NEIGH */
+ nla_total_size(sizeof(u8)) + /* IFLA_BOND_LACP_STRICT */
0;
}
@@ -838,6 +850,10 @@ static int bond_fill_info(struct sk_buff *skb,
bond->params.broadcast_neighbor))
goto nla_put_failure;
+ if (nla_put_u8(skb, IFLA_BOND_LACP_STRICT,
+ bond->params.lacp_strict))
+ goto nla_put_failure;
+
if (BOND_MODE(bond) == BOND_MODE_8023AD) {
struct ad_info info;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 7380cc4ee75a..d358b831df77 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -68,6 +68,8 @@ static int bond_option_lacp_active_set(struct bonding *bond,
const struct bond_opt_value *newval);
static int bond_option_lacp_rate_set(struct bonding *bond,
const struct bond_opt_value *newval);
+static int bond_option_lacp_strict_set(struct bonding *bond,
+ const struct bond_opt_value *newval);
static int bond_option_ad_select_set(struct bonding *bond,
const struct bond_opt_value *newval);
static int bond_option_queue_id_set(struct bonding *bond,
@@ -162,6 +164,12 @@ static const struct bond_opt_value bond_lacp_rate_tbl[] = {
{ NULL, -1, 0},
};
+static const struct bond_opt_value bond_lacp_strict_tbl[] = {
+ { "off", 0, BOND_VALFLAG_DEFAULT},
+ { "on", 1, 0},
+ { NULL, -1, 0 }
+};
+
static const struct bond_opt_value bond_ad_select_tbl[] = {
{ "stable", BOND_AD_STABLE, BOND_VALFLAG_DEFAULT},
{ "bandwidth", BOND_AD_BANDWIDTH, 0},
@@ -363,6 +371,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
.values = bond_lacp_rate_tbl,
.set = bond_option_lacp_rate_set
},
+ [BOND_OPT_LACP_STRICT] = {
+ .id = BOND_OPT_LACP_STRICT,
+ .name = "lacp_strict",
+ .desc = "Define the LACP fallback mode when no slaves have negotiated",
+ .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
+ .values = bond_lacp_strict_tbl,
+ .set = bond_option_lacp_strict_set
+ },
[BOND_OPT_MINLINKS] = {
.id = BOND_OPT_MINLINKS,
.name = "min_links",
@@ -1684,6 +1700,16 @@ static int bond_option_lacp_rate_set(struct bonding *bond,
return 0;
}
+static int bond_option_lacp_strict_set(struct bonding *bond,
+ const struct bond_opt_value *newval)
+{
+ netdev_dbg(bond->dev, "Setting LACP fallback to %s (%llu)\n",
+ newval->string, newval->value);
+ bond->params.lacp_strict = newval->value;
+
+ return 0;
+}
+
static int bond_option_ad_select_set(struct bonding *bond,
const struct bond_opt_value *newval)
{
diff --git a/include/net/bond_options.h b/include/net/bond_options.h
index e6eedf23aea1..52b966e92793 100644
--- a/include/net/bond_options.h
+++ b/include/net/bond_options.h
@@ -79,6 +79,7 @@ enum {
BOND_OPT_COUPLED_CONTROL,
BOND_OPT_BROADCAST_NEIGH,
BOND_OPT_ACTOR_PORT_PRIO,
+ BOND_OPT_LACP_STRICT,
BOND_OPT_LAST
};
diff --git a/include/net/bonding.h b/include/net/bonding.h
index edd1942dcd73..2c54a36a8477 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -129,6 +129,7 @@ struct bond_params {
int peer_notif_delay;
int lacp_active;
int lacp_fast;
+ int lacp_strict;
unsigned int min_links;
int ad_select;
char primary[IFNAMSIZ];
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 79ce4bc24cba..9ef5784e78e8 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1584,6 +1584,7 @@ enum {
IFLA_BOND_NS_IP6_TARGET,
IFLA_BOND_COUPLED_CONTROL,
IFLA_BOND_BROADCAST_NEIGH,
+ IFLA_BOND_LACP_STRICT,
__IFLA_BOND_MAX,
};
--
2.39.2
^ permalink raw reply related
* [PATCH net v5 4/4] selftests: bonding: add test for lacp_strict mode
From: Louis Scalbert @ 2026-05-06 16:11 UTC (permalink / raw)
To: netdev
Cc: andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy, shemminger,
maheshb, jonas.gorski, Louis Scalbert
In-Reply-To: <20260506161144.465485-1-louis.scalbert@6wind.com>
Add a test for the bonding lacp_strict mode.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
---
.../selftests/drivers/net/bonding/Makefile | 1 +
.../drivers/net/bonding/bond_lacp_strict.sh | 347 ++++++++++++++++++
2 files changed, 348 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_lacp_strict.sh
diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile b/tools/testing/selftests/drivers/net/bonding/Makefile
index 9af5f84edd37..91269e7ceb63 100644
--- a/tools/testing/selftests/drivers/net/bonding/Makefile
+++ b/tools/testing/selftests/drivers/net/bonding/Makefile
@@ -7,6 +7,7 @@ TEST_PROGS := \
bond-eth-type-change.sh \
bond-lladdr-target.sh \
bond_ipsec_offload.sh \
+ bond_lacp_strict.sh \
bond_lacp_prio.sh \
bond_macvlan_ipvlan.sh \
bond_options.sh \
diff --git a/tools/testing/selftests/drivers/net/bonding/bond_lacp_strict.sh b/tools/testing/selftests/drivers/net/bonding/bond_lacp_strict.sh
new file mode 100755
index 000000000000..f1a93a1d952f
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/bonding/bond_lacp_strict.sh
@@ -0,0 +1,347 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing if bond lacp_strict works
+#
+# Partner (p_ns)
+# +--------------------------+
+# | bond0 |
+# | + |
+# | eth0 | eth1 |
+# | +---+---+ |
+# | | | |
+# +--------------------------+
+# | | eth0 | eth1 |
+# | | | |
+# |(br_ns) | br0 | br1 |
+# | | | |
+# | | eth2 | eth3 |
+# +--------------------------+
+# | | | |
+# | +---+---+ |
+# | eth0 | eth1 |
+# | + |
+# | bond0 |
+# +--------------------------+
+# Dut (d_ns)
+
+lib_dir=$(dirname "$0")
+# shellcheck disable=SC1090
+source "$lib_dir"/../../../net/lib.sh
+
+COLLECTING_DISTRIBUTING_MASK=48
+COLLECTING_DISTRIBUTING=48
+FAILED=0
+
+setup_links()
+{
+ # shellcheck disable=SC2154
+ ip -n "${p_ns}" link add eth0 type veth peer name eth0 netns "${br_ns}"
+ ip -n "${p_ns}" link add eth1 type veth peer name eth1 netns "${br_ns}"
+ ip -n "${d_ns}" link add eth0 type veth peer name eth2 netns "${br_ns}"
+ ip -n "${d_ns}" link add eth1 type veth peer name eth3 netns "${br_ns}"
+
+ ip -n "${br_ns}" link add br0 type bridge
+ ip -n "${br_ns}" link add br1 type bridge
+
+ ip -n "${br_ns}" link set dev br0 type bridge stp_state 0
+ ip -n "${br_ns}" link set dev br1 type bridge stp_state 0
+
+ ip -n "${br_ns}" link set eth0 master br0
+ ip -n "${br_ns}" link set eth2 master br0
+ ip -n "${br_ns}" link set eth1 master br1
+ ip -n "${br_ns}" link set eth3 master br1
+
+ # Allow LACP trames forwarding on bridge ports
+ ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br0/brif/eth0/group_fwd_mask"
+ ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br1/brif/eth1/group_fwd_mask"
+ ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br0/brif/eth2/group_fwd_mask"
+ ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br1/brif/eth3/group_fwd_mask"
+
+ ip -n "${br_ns}" link set eth0 up
+ ip -n "${br_ns}" link set eth2 up
+ ip -n "${br_ns}" link set eth1 up
+ ip -n "${br_ns}" link set eth3 up
+
+ ip -n "${br_ns}" link set br0 up
+ ip -n "${br_ns}" link set br1 up
+
+ ip -n "${d_ns}" link add bond0 type bond mode 802.3ad miimon 100 \
+ lacp_rate fast min_links 1
+ ip -n "${p_ns}" link add bond0 type bond mode 802.3ad miimon 100 \
+ lacp_rate fast min_links 1
+
+ ip -n "${d_ns}" link set eth0 master bond0
+ ip -n "${d_ns}" link set eth1 master bond0
+ ip -n "${p_ns}" link set eth0 master bond0
+ ip -n "${p_ns}" link set eth1 master bond0
+
+ ip -n "${d_ns}" link set bond0 up
+ ip -n "${p_ns}" link set bond0 up
+}
+
+test_master_carrier() {
+ local expected=$1
+ local mode_name=$2
+ local carrier
+
+ carrier=$(ip netns exec "${d_ns}" cat /sys/class/net/bond0/carrier)
+ [ "$carrier" == "1" ] && carrier="up" || carrier="down"
+
+ [ "$carrier" == "$expected" ] && return
+
+ echo "FAIL: Expected carrier $expected in lacp_strict $mode_name mode, got $carrier"
+
+ RET=1
+
+}
+
+compare_state() {
+ local actual_state=$1
+ local expected_state=$2
+ local iface=$3
+ local last_attempt=$4
+
+ [ $((actual_state & COLLECTING_DISTRIBUTING_MASK)) -eq "$expected_state" ] \
+ && return 0
+
+ [ "$last_attempt" -ne 1 ] && return 1
+
+ printf "FAIL: Expected LACP %s actor state to " "$iface"
+ if [ "$expected_state" -eq $COLLECTING_DISTRIBUTING ]; then
+ echo "be in Collecting/Distributing state"
+ else
+ echo "have neither Collecting nor Distributing set."
+ fi
+
+ return 1
+}
+
+_test_lacp_port_state() {
+ local interface=$1
+ local expected=$2
+ local last_attempt=$3
+ local eth0_actor_state eth1_actor_state
+ local ret=0
+
+ # shellcheck disable=SC2016
+ while IFS='=' read -r k v; do
+ printf -v "$k" '%s' "$v"
+ done < <(
+ ip netns exec "${d_ns}" awk '
+ /^Slave Interface: / { iface=$3 }
+ /details actor lacp pdu:/ { ctx="actor" }
+ /details partner lacp pdu:/ { ctx="partner" }
+ /^[[:space:]]+port state: / {
+ if (ctx == "actor") {
+ gsub(":", "", iface)
+ printf "%s_%s_state=%s\n", iface, ctx, $3
+ }
+ }
+ ' /proc/net/bonding/bond0
+ )
+
+ if [ "$interface" == "eth0" ] || [ "$interface" == "both" ]; then
+ compare_state "$eth0_actor_state" "$expected" eth0 "$last_attempt" || ret=1
+ fi
+
+ if [ "$interface" == "eth1" ] || [ "$interface" == "both" ]; then
+ compare_state "$eth1_actor_state" "$expected" eth1 "$last_attempt" || ret=1
+ fi
+
+ return $ret
+}
+
+test_lacp_port_state() {
+ local interface=$1
+ local expected=$2
+ local retry=$3
+ local last_attempt=0
+ local attempt=1
+ local ret=1
+
+ while [ $attempt -le $((retry + 1)) ]; do
+ [ $attempt -eq $((retry + 1)) ] && last_attempt=1
+ _test_lacp_port_state "$interface" "$expected" "$last_attempt" && return
+ ((attempt++))
+ sleep 1
+ done
+
+ RET=1
+}
+
+
+trap cleanup_all_ns EXIT
+setup_ns d_ns p_ns br_ns
+setup_links
+
+# Initial state
+RET=0
+mode=off
+test_lacp_port_state both $COLLECTING_DISTRIBUTING 3
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 up"
+
+# partner eth0 down, eth1 up
+# (replicate eth0 state to dut eth0 by shutting a bridge port)
+RET=0
+ip -n "${p_ns}" link set eth0 down
+ip -n "${br_ns}" link set eth2 down
+test_lacp_port_state eth0 $FAILED 5
+test_lacp_port_state eth1 $COLLECTING_DISTRIBUTING 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 down"
+
+# partner eth0 and eth1 down
+RET=0
+ip -n "${p_ns}" link set eth1 down
+ip -n "${br_ns}" link set eth3 down
+test_lacp_port_state both $FAILED 5
+test_master_carrier down $mode # down because of min_links
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 down"
+
+# partner eth0 up, eth1 down
+RET=0
+ip -n "${p_ns}" link set eth0 up
+ip -n "${br_ns}" link set eth2 up
+test_lacp_port_state eth0 $COLLECTING_DISTRIBUTING 60
+test_lacp_port_state eth1 $FAILED 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 up, eth1 down"
+
+# partner eth0 and eth1 up
+RET=0
+ip -n "${p_ns}" link set eth1 up
+ip -n "${br_ns}" link set eth3 up
+test_lacp_port_state both $COLLECTING_DISTRIBUTING 60
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 up"
+
+# partner eth0 stops LACP and eth1 up
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br0/brif/eth0/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br0/brif/eth2/group_fwd_mask"
+test_lacp_port_state eth0 $FAILED 5
+test_lacp_port_state eth1 $COLLECTING_DISTRIBUTING 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 stopped sending LACP"
+
+# partner eth0 and eth1 stop LACP
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br1/brif/eth1/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br1/brif/eth3/group_fwd_mask"
+test_lacp_port_state both $FAILED 5
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 stopped sending LACP"
+
+# switch to lacp_strict on
+RET=0
+mode=on
+ip -n "${d_ns}" link set dev bond0 type bond lacp_strict $mode
+test_lacp_port_state both $FAILED 1
+test_master_carrier down $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 stopped sending LACP"
+
+# switch back to lacp_strict off mode
+RET=0
+mode=off
+ip -n "${d_ns}" link set dev bond0 type bond lacp_strict $mode
+test_lacp_port_state both $FAILED 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 stopped sending LACP"
+
+# eth0 recovers LACP
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br0/brif/eth0/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br0/brif/eth2/group_fwd_mask"
+test_lacp_port_state eth0 $COLLECTING_DISTRIBUTING 60
+test_lacp_port_state eth1 $FAILED 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 recovered and eth1 stopped sending LACP"
+
+# eth1 recovers LACP
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br1/brif/eth1/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br1/brif/eth3/group_fwd_mask"
+test_lacp_port_state both $COLLECTING_DISTRIBUTING 60
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 recovered LACP"
+
+# switch to lacp_strict on
+RET=0
+mode=on
+ip -n "${d_ns}" link set dev bond0 type bond lacp_strict $mode
+test_lacp_port_state both $COLLECTING_DISTRIBUTING 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 up"
+
+# partner eth0 down, eth1 up
+RET=0
+ip -n "${p_ns}" link set eth0 down
+ip -n "${br_ns}" link set eth2 down
+test_lacp_port_state eth0 $FAILED 5
+test_lacp_port_state eth1 $COLLECTING_DISTRIBUTING 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 down"
+
+# partner eth0 and eth1 down
+RET=0
+ip -n "${p_ns}" link set eth1 down
+ip -n "${br_ns}" link set eth3 down
+test_lacp_port_state both $FAILED 5
+test_master_carrier down $mode # down because of min_links
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 down"
+
+# partner eth0 up, eth1 down
+RET=0
+ip -n "${p_ns}" link set eth0 up
+ip -n "${br_ns}" link set eth2 up
+test_lacp_port_state eth0 $COLLECTING_DISTRIBUTING 60
+test_lacp_port_state eth1 $FAILED 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 up, eth1 down"
+
+# partner eth0 and eth1 up
+RET=0
+ip -n "${p_ns}" link set eth1 up
+ip -n "${br_ns}" link set eth3 up
+test_lacp_port_state both $COLLECTING_DISTRIBUTING 60
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 up"
+
+# partner eth0 stops LACP and eth1 up
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br0/brif/eth0/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br0/brif/eth2/group_fwd_mask"
+test_lacp_port_state eth0 $FAILED 5
+test_lacp_port_state eth1 $COLLECTING_DISTRIBUTING 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 stopped sending LACP"
+
+# partner eth0 and eth1 stop LACP
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br1/brif/eth1/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 0 > /sys/class/net/br1/brif/eth3/group_fwd_mask"
+test_lacp_port_state both $FAILED 5
+test_master_carrier down $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 stopped sending LACP"
+
+# eth0 recovers LACP
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br0/brif/eth0/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br0/brif/eth2/group_fwd_mask"
+test_lacp_port_state eth0 $COLLECTING_DISTRIBUTING 60
+test_lacp_port_state eth1 $FAILED 1
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 recovered and eth1 stopped sending LACP"
+
+# eth1 recovers LACP
+# shellcheck disable=SC2034
+RET=0
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br1/brif/eth1/group_fwd_mask"
+ip netns exec "${br_ns}" sh -c "echo 4 > /sys/class/net/br1/brif/eth3/group_fwd_mask"
+test_lacp_port_state both $COLLECTING_DISTRIBUTING 60
+test_master_carrier up $mode
+log_test "bond LACP" "lacp_strict $mode - eth0 and eth1 recovered LACP"
+
+exit "${EXIT_STATUS}"
--
2.39.2
^ permalink raw reply related
* Re: [PATCH net-next 5/5] ionic: Add .get_fec_stats ethtool handler
From: Eric Joyner @ 2026-05-06 16:17 UTC (permalink / raw)
To: Vadim Fedorenko, netdev
Cc: Brett Creeley, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
In-Reply-To: <45677056-ff03-442f-a7dd-19d34c7b612d@linux.dev>
On 5/6/2026 3:00 AM, Vadim Fedorenko wrote:
> [You don't often get email from vadim.fedorenko@linux.dev. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On 05/05/2026 20:43, Eric Joyner wrote:
>> On 5/5/2026 6:54 AM, Vadim Fedorenko wrote:
>>> [You don't often get email from vadim.fedorenko@linux.dev. Learn why this is
>>> important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> Caution: This message originated from an External Source. Use proper caution
>>> when opening attachments, clicking links, or responding.
>>>
>>>
>>> On 01/05/2026 04:15, Eric Joyner wrote:
>>>> Several FEC error statistics being collected can be reported in a
>>>> dedicated ethtool callback for FEC errors, so implement the handler that
>>>> does so. This includes 802.3ck FEC histogram data that some newer
>>>> hardware collects.
>>>>
>>>> Assisted-by: Claude:claude-4.6-sonnet
>>>> Signed-off-by: Eric Joyner <eric.joyner@amd.com>
>>>> ---
>>>> .../ethernet/pensando/ionic/ionic_ethtool.c | 51 +++++++++++++++++++
>>>> 1 file changed, 51 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_ethtool.c b/drivers/
>>>> net/ethernet/pensando/ionic/ionic_ethtool.c
>>>> index 78a802eb159f..fe1f753b6115 100644
>>>> --- a/drivers/net/ethernet/pensando/ionic/ionic_ethtool.c
>>>> +++ b/drivers/net/ethernet/pensando/ionic/ionic_ethtool.c
>>>> @@ -418,6 +418,56 @@ static int ionic_get_fecparam(struct net_device *netdev,
>>>> return 0;
>>>> }
>>>>
>>>> +static const struct ethtool_fec_hist_range ionic_fec_ranges[] = {
>>>> + { 0, 0},
>>>> + { 1, 1},
>>>> + { 2, 2},
>>>> + { 3, 3},
>>>> + { 4, 4},
>>>> + { 5, 5},
>>>> + { 6, 6},
>>>> + { 7, 7},
>>>> + { 8, 8},
>>>> + { 9, 9},
>>>> + { 10, 10},
>>>> + { 11, 11},
>>>> + { 12, 12},
>>>> + { 13, 13},
>>>> + { 14, 14},
>>>> + { 15, 15},
>>>> + { 0, 0},
>>>> +};
>>>> +
>>>> +static void
>>>> +ionic_fill_fec_hist(const struct ionic_port_extra_stats *extra_stats,
>>>> + struct ethtool_fec_hist *hist)
>>>> +{
>>>> + int i;
>>>> +
>>>> + hist->ranges = ionic_fec_ranges;
>>>> + for (i = 0; i < ETHTOOL_FEC_HIST_MAX - 1; i++)
>>>> + hist->values[i].sum = extra_stats->fec_codeword_error_bin[i];
>>>> +}
>>>
>>> ETHTOOL_FEC_HIST_MAX = 17, you defined 16 bins, but iterating over 15 of
>>> them. Looks like bin {15, 15} will be lost in stats.
>>>
>>>
>>
>> This looks correct to me -- (ETHTOOL_FEC_HIST_MAX - 1) = 16, so starting with
>> i=0, it'll iterate through the 16 bins and ignore the 17th end marker bin. Bin
>> 15 does get included and gets its sum set.
>
> Ahh, yeah, you're right, smth was wrong with my math yesterday. btw, how
> does it work for different FEC? RS(528, 514) will not give you 16 bins..
>
>
According to the datasheet for the device that does support reporting these
histogram bins, there's no support for "Clause 91 RS(528,514) "KR4" FEC for
100GbE". I only see support for PAM4 and not NRZ.
- Eric
^ permalink raw reply
* Re: [PATCH net-next v6 00/10] enic: SR-IOV V2 admin channel and MBOX protocol
From: Simon Horman @ 2026-05-06 16:19 UTC (permalink / raw)
To: Satish Kharat
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-kernel, Sesidhar Baddela, Breno Leitao
In-Reply-To: <20260503-enic-sriov-v2-admin-channel-v2-v6-0-0af4fbc2d86d@cisco.com>
On Sun, May 03, 2026 at 04:22:37AM -0700, Satish Kharat wrote:
> This series adds the admin channel infrastructure and mailbox (MBOX)
> protocol needed for V2 SR-IOV support in the enic driver.
>
> The V2 SR-IOV design uses a direct PF-VF communication channel built on
> dedicated WQ/RQ/CQ hardware resources and an MSI-X interrupt.
>
> Firmware capability and admin channel infrastructure (patches 1-4):
> - Probe-time firmware feature check for V2 SR-IOV support
> - Admin channel open/close, RQ buffer management, CQ service
> with MSI-X interrupt and NAPI polling
>
> MBOX protocol and VF enable (patches 5-10):
> - MBOX message types, core send/receive, PF and VF handlers
> - V2 SR-IOV enable wiring with admin channel setup
> - V2 VF probe with admin channel and PF registration
>
> Signed-off-by: Satish Kharat <satishkh@cisco.com>
Hi Satish,
There is are AI-generated reviews of this patch available at both
https://netdev-ai.bots.linux.dev/sashiko/ and https://sashiko.dev/
And it seems to me that some of the issues raised there do warrant
investigation. I'd appreciate it if you could do so with
a view to addressing any issues that are not pre-existing in v7.
...
^ permalink raw reply
* Re: [PATCH v1] gve: Use generic power management
From: Vaibhav Gupta @ 2026-05-06 16:21 UTC (permalink / raw)
To: Harshitha Ramamurthy
Cc: Joshua Washington, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Willem de Bruijn, Ankit Garg,
Tim Hostetler, Alok Tiwari, John Fraker, Matt Olson,
Praveen Kaligineedi, netdev, linux-kernel
In-Reply-To: <CAEAWyHf3x=sCefMKkUvRq6hq8qd_ALM=MPvnN57xmyE5VCN+CA@mail.gmail.com>
> > - .suspend = gve_suspend,
> > - .resume = gve_resume,
> > -#endif
> > + .driver.pm = &gve_pm_ops,
>
> Thanks for the patch! A minor suggestion: could you wrap this in pm_sleep_ptr()?
>
> Also, please include the net-next prefix in the subject line when posting v2.
>
> With those changes, feel free to add:
> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
>
> Thanks,
> Harshitha
>
Thanks for your input Harshitha! I have made the suggested amends, and I will
send the v2 in reply to my v1 patch, as I have to include the review tag of
Alexander as well.
Thanks
Vaibhav
> > };
> >
> > module_pci_driver(gve_driver);
>
> > --
> > 2.53.0
> >
^ permalink raw reply
* Re: [PATCH net-next v2] ipv4: Flush the FIB once per dev nexthop removal
From: David Ahern @ 2026-05-06 16:26 UTC (permalink / raw)
To: Ido Schimmel, Cosmin Ratiu
Cc: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Paolo Abeni
In-Reply-To: <20260506130129.GA665477@shredder>
On 5/6/26 7:01 AM, Ido Schimmel wrote:
> ... it would have been easier to review if split into
> multiple patches (not saying you should do it). Something like:
>
> 1. Change the various nexthop remove functions to return an indication if
> flushing is required, but keep doing the flushing in
> __remove_nexthop_fib(). Referring to these functions:
>
> remove_nexthop()
> __remove_nexthop()
> __remove_nexthop_fib()
> remove_nexthop_from_groups()
> remove_nh_grp_entry()
>
> 2. Act upon the flushing indication in the various callers of
> remove_nexthop() and remove the flushing from __remove_nexthop_fib().
>
> 3. Add __must_check annotations.
>
+1. Always send the smallest patches possible to evolve the code. Make
it easy for reviewers - and yourself should you introduce an intended
side effect.
^ permalink raw reply
* Re: [syzbot] [kernel?] WARNING: ODEBUG bug in smpboot_thread_fn
From: Thomas Gleixner @ 2026-05-06 16:29 UTC (permalink / raw)
To: syzbot, linux-kernel, peterz, syzkaller-bugs
Cc: bridge, Nikolay Aleksandrov, Ido Schimmel, netdev
In-Reply-To: <69f88fc0.050a0220.3460d5.0004.GAE@google.com>
On Mon, May 04 2026 at 05:23, syzbot wrote:
Cc'ed network/bridge people as that's clearly in their realm.
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 6d35786de281 Merge tag 'for-linus' of git://git.kernel.org..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=126bf21f980000
> kernel config: https://syzkaller.appspot.com/x/.config?x=f2e8ebfec4636d32
> dashboard link: https://syzkaller.appspot.com/bug?extid=ae231e0552fa77b26ea1
> compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/b3ab34c4c807/disk-6d35786d.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/03f465478178/vmlinux-6d35786d.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/29d261604425/bzImage-6d35786d.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ae231e0552fa77b26ea1@syzkaller.appspotmail.com
>
> ------------[ cut here ]------------
> ODEBUG: free active (active state 0) object: ffff888033a47278 object type: timer_list hint: br_ip6_multicast_port_query_expired+0x0/0x380 net/bridge/br_multicast.c:-1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An object which contains an active timer is RCU freed....
> WARNING: lib/debugobjects.c:632 at debug_print_object lib/debugobjects.c:629 [inline], CPU#1: rcuc/1/28
> WARNING: lib/debugobjects.c:632 at __debug_check_no_obj_freed lib/debugobjects.c:1116 [inline], CPU#1: rcuc/1/28
> WARNING: lib/debugobjects.c:632 at debug_check_no_obj_freed+0x405/0x550 lib/debugobjects.c:1146, CPU#1: rcuc/1/28
> Modules linked in:
> CPU: 1 UID: 0 PID: 28 Comm: rcuc/1 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)}
> Tainted: [L]=SOFTLOCKUP
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
> RIP: 0010:debug_print_object lib/debugobjects.c:629 [inline]
> RIP: 0010:__debug_check_no_obj_freed lib/debugobjects.c:1116 [inline]
> RIP: 0010:debug_check_no_obj_freed+0x44a/0x550 lib/debugobjects.c:1146
> Code: 89 44 24 20 e8 27 81 7f fd 48 8b 44 24 20 4c 8b 4d 00 4c 89 ef 48 c7 c6 a0 53 a7 8b 48 c7 c2 20 59 a7 8b 8b 0c 24 4d 89 f8 50 <67> 48 0f b9 3a 48 83 c4 08 4c 8b 6c 24 18 48 b9 00 00 00 00 00 fc
> RSP: 0018:ffffc90000a2fa90 EFLAGS: 00010246
> RAX: ffffffff89f32240 RBX: ffffffff99a8edc0 RCX: 0000000000000000
> RDX: ffffffff8ba75920 RSI: ffffffff8ba753a0 RDI: ffffffff8f933740
> RBP: ffffffff8b4f5380 R08: ffff888033a47278 R09: ffffffff8b4f6700
> R10: dffffc0000000000 R11: ffffffff81b0c180 R12: ffff888033a47400
> R13: ffffffff8f933740 R14: ffff888033a47000 R15: ffff888033a47278
> FS: 0000000000000000(0000) GS:ffff888126279000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fb873a4da08 CR3: 000000004cc78000 CR4: 00000000003526f0
> Call Trace:
> <TASK>
> slab_free_hook mm/slub.c:2620 [inline]
> slab_free mm/slub.c:6250 [inline]
> kfree+0x13e/0x6c0 mm/slub.c:6565
> kobject_cleanup lib/kobject.c:689 [inline]
> kobject_release lib/kobject.c:720 [inline]
> kref_put include/linux/kref.h:65 [inline]
> kobject_put+0x228/0x560 lib/kobject.c:737
> rcu_do_batch kernel/rcu/tree.c:2617 [inline]
> rcu_core kernel/rcu/tree.c:2869 [inline]
> rcu_cpu_kthread+0x99e/0x1470 kernel/rcu/tree.c:2957
> smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
> kthread+0x388/0x470 kernel/kthread.c:436
> ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> </TASK>
^ permalink raw reply
* Re: [PATCH net v5 4/4] selftests: bonding: add test for lacp_strict mode
From: Breno Leitao @ 2026-05-06 16:33 UTC (permalink / raw)
To: Louis Scalbert
Cc: netdev, andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy,
shemminger, maheshb, jonas.gorski
In-Reply-To: <20260506161144.465485-5-louis.scalbert@6wind.com>
On Wed, May 06, 2026 at 06:11:44PM +0200, Louis Scalbert wrote:
> Add a test for the bonding lacp_strict mode.
Test seems fine, although it seems there is a bunch of copy & paste. Any
chance you can mke it a bit more modular with some helpers that can be
reused?
> diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile b/tools/testing/selftests/drivers/net/bonding/Makefile
> index 9af5f84edd37..91269e7ceb63 100644
> --- a/tools/testing/selftests/drivers/net/bonding/Makefile
> +++ b/tools/testing/selftests/drivers/net/bonding/Makefile
> @@ -7,6 +7,7 @@ TEST_PROGS := \
> bond-eth-type-change.sh \
> bond-lladdr-target.sh \
> bond_ipsec_offload.sh \
> + bond_lacp_strict.sh \
TEST_PROGS is sorted alphabetically; bond_lacp_strict.sh should sort
after bond_lacp_prio.sh, not before it.
> +COLLECTING_DISTRIBUTING_MASK=48
> +COLLECTING_DISTRIBUTING=48
> +FAILED=0
FAILED is a bit misleading here.
> + # Allow LACP trames forwarding on bridge ports
s/trames/frames ?
> + RET=1
> +
> +}
Remove the extra line above.
> +
> +compare_state() {
> + local actual_state=$1
> + local expected_state=$2
> + local iface=$3
> + local last_attempt=$4
> +
> + [ $((actual_state & COLLECTING_DISTRIBUTING_MASK)) -eq "$expected_state" ] \
> + && return 0
The line above is hard to read, and idententation is a bit off as well.
^ permalink raw reply
* Re: [PATCH net 06/12] selftests: drv-net: add shaper test for duplicate leaves
From: Breno Leitao @ 2026-05-06 16:40 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, shuah,
linux-kselftest
In-Reply-To: <20260506000628.1501691-7-kuba@kernel.org>
On Tue, May 05, 2026 at 05:06:22PM -0700, Jakub Kicinski wrote:
> Add test exercising duplicate leaves.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
> diff --git a/tools/testing/selftests/drivers/net/shaper.py b/tools/testing/selftests/drivers/net/shaper.py
> index 11310f19bfa0..f7872c931bf6 100755
> --- a/tools/testing/selftests/drivers/net/shaper.py
> +++ b/tools/testing/selftests/drivers/net/shaper.py
> def main() -> None:
> with NetDrvEnv(__file__, queue_count=4) as cfg:
> cfg.queues = False
> @@ -453,7 +471,10 @@ from lib.py import cmd
> basic_groups,
> qgroups,
> delegation,
> - queue_update], args=(cfg, NetshaperFamily()))
> + queue_update,
> + dup_leaves
> + ],
Nit: dup_leaves], on a single line matches the surrounding style and
avoids the dangling bracket on its own line. Not worth a respin on its
own.
^ permalink raw reply
* Re: assert in phylink.c with lan7801 and dp83tc811 since kernel 6.18
From: Andrew Lunn @ 2026-05-06 16:40 UTC (permalink / raw)
To: Sven Schuchmann; +Cc: Maxime Chevallier, netdev@vger.kernel.org
In-Reply-To: <BEZP281MB22457317618130F28E84E3B4D93F2@BEZP281MB2245.DEUP281.PROD.OUTLOOK.COM>
> phylink_dbg(pl, "phy: --3.4 supported 0x%x\n", supported);
supported is a pointer, so you cannot get anything useful from
printing its value.
> [ 2.771616] lan78xx 1-1.4:1.0 (unnamed net_device) (uninitialized): validation of rgmii-id with support 00000000,00000000,00000000,00006280 and advertisement 00000000,00000000,00000000,00006280 failed: -EINVAL
> But I do not get what phylink_is_empty_linkmode() is doing...
The supported value is a combination of a few things. It indicates if
the PHY supports auto negotiation, pause, asym pause, if the link is
using twisted pair, fibre etc. It also lists the speeds the PHY
supports. There is a list here:
https://elixir.bootlin.com/linux/v7.0.1/source/include/uapi/linux/ethtool.h#L1963
Decoding 00006280 we get:
ETHTOOL_LINK_MODE_TP_BIT,
ETHTOOL_LINK_MODE_MII_BIT,
ETHTOOL_LINK_MODE_Pause_BIT,
ETHTOOL_LINK_MODE_Asym_Pause_BIT
So Maxime was correct, it is not listing any link speeds. From what i
understand, it is a 100BaseT1? So you would expect to see:
ETHTOOL_LINK_MODE_100baseT1_Full_BIT
which is bit 67.
phylib asks the PHY what it can do. For BaseT1 that would be
genphy_c45_pma_baset1_read_abilities()
https://elixir.bootlin.com/linux/v7.0.1/source/drivers/net/phy/phy-c45.c#L971
either the PHY has the wrong value in its register, or the function is
not getting called.
The PHY is an dp83tc811?
https://elixir.bootlin.com/linux/v7.0.1/source/drivers/net/phy/dp83tc811.c#L387
It does not set .get_features. However, dp83tg720.c does:
https://elixir.bootlin.com/linux/v7.0.1/source/drivers/net/phy/dp83tg720.c#L654
So try adding:
.get_features = genphy_c45_pma_read_ext_abilities,
Andrew
^ permalink raw reply
* [PATCH net-next v2] gve: Use generic power management
From: Vaibhav Gupta @ 2026-05-06 16:47 UTC (permalink / raw)
To: Joshua Washington, Harshitha Ramamurthy, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Willem de Bruijn, Ankit Garg, Tim Hostetler, Alok Tiwari,
John Fraker, Matt Olson, Praveen Kaligineedi
Cc: Vaibhav Gupta, netdev, linux-kernel, Alexander Lobakin
In-Reply-To: <20260504182139.604925-1-vaibhavgupta40@gmail.com>
Switch to the generic power management and remove the usage of legacy
(pci_driver) hooks.
Signed-off-by: Vaibhav Gupta <vaibhavgupta40@gmail.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
v2: wrap "&gve_pm_ops" inside pm_sleep_ptr(), and add 'net-next' tag, as
suggested by Harshitha.
---
drivers/net/ethernet/google/gve/gve_main.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 424d973c97f2..00750643e614 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2967,9 +2967,9 @@ static void gve_shutdown(struct pci_dev *pdev)
rtnl_unlock();
}
-#ifdef CONFIG_PM
-static int gve_suspend(struct pci_dev *pdev, pm_message_t state)
+static int gve_suspend(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *netdev = pci_get_drvdata(pdev);
struct gve_priv *priv = netdev_priv(netdev);
bool was_up = netif_running(priv->dev);
@@ -2990,8 +2990,9 @@ static int gve_suspend(struct pci_dev *pdev, pm_message_t state)
return 0;
}
-static int gve_resume(struct pci_dev *pdev)
+static int gve_resume(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *netdev = pci_get_drvdata(pdev);
struct gve_priv *priv = netdev_priv(netdev);
int err;
@@ -3004,7 +3005,8 @@ static int gve_resume(struct pci_dev *pdev)
rtnl_unlock();
return err;
}
-#endif /* CONFIG_PM */
+
+static DEFINE_SIMPLE_DEV_PM_OPS(gve_pm_ops, gve_suspend, gve_resume);
static const struct pci_device_id gve_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEV_ID_GVNIC) },
@@ -3017,10 +3019,7 @@ static struct pci_driver gve_driver = {
.probe = gve_probe,
.remove = gve_remove,
.shutdown = gve_shutdown,
-#ifdef CONFIG_PM
- .suspend = gve_suspend,
- .resume = gve_resume,
-#endif
+ .driver.pm = pm_sleep_ptr(&gve_pm_ops),
};
module_pci_driver(gve_driver);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next v2] gve: Use generic power management
From: Vaibhav Gupta @ 2026-05-06 16:50 UTC (permalink / raw)
To: Joshua Washington, Harshitha Ramamurthy, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Willem de Bruijn, Ankit Garg, Tim Hostetler, Alok Tiwari,
John Fraker, Matt Olson, Praveen Kaligineedi
Cc: Vaibhav Gupta, netdev, linux-kernel, Alexander Lobakin
Switch to the generic power management and remove the usage of legacy
(pci_driver) hooks.
Signed-off-by: Vaibhav Gupta <vaibhavgupta40@gmail.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
v2: wrap "&gve_pm_ops" inside pm_sleep_ptr(), and add 'net-next' tag, as
suggested by Harshitha.
---
drivers/net/ethernet/google/gve/gve_main.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 424d973c97f2..00750643e614 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2967,9 +2967,9 @@ static void gve_shutdown(struct pci_dev *pdev)
rtnl_unlock();
}
-#ifdef CONFIG_PM
-static int gve_suspend(struct pci_dev *pdev, pm_message_t state)
+static int gve_suspend(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *netdev = pci_get_drvdata(pdev);
struct gve_priv *priv = netdev_priv(netdev);
bool was_up = netif_running(priv->dev);
@@ -2990,8 +2990,9 @@ static int gve_suspend(struct pci_dev *pdev, pm_message_t state)
return 0;
}
-static int gve_resume(struct pci_dev *pdev)
+static int gve_resume(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *netdev = pci_get_drvdata(pdev);
struct gve_priv *priv = netdev_priv(netdev);
int err;
@@ -3004,7 +3005,8 @@ static int gve_resume(struct pci_dev *pdev)
rtnl_unlock();
return err;
}
-#endif /* CONFIG_PM */
+
+static DEFINE_SIMPLE_DEV_PM_OPS(gve_pm_ops, gve_suspend, gve_resume);
static const struct pci_device_id gve_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEV_ID_GVNIC) },
@@ -3017,10 +3019,7 @@ static struct pci_driver gve_driver = {
.probe = gve_probe,
.remove = gve_remove,
.shutdown = gve_shutdown,
-#ifdef CONFIG_PM
- .suspend = gve_suspend,
- .resume = gve_resume,
-#endif
+ .driver.pm = pm_sleep_ptr(&gve_pm_ops),
};
module_pci_driver(gve_driver);
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-05-06 16:52 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David Wei, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
edumazet, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260427131745.2eac52ef@kernel.org>
On Mon, Apr 27, 2026 at 01:17:45PM -0700, Jakub Kicinski wrote:
> On Sat, 25 Apr 2026 01:05:43 -0700 Dipayaan Roy wrote:
> > Hi Jakub,
> > with this new data from David, is it convincing enough for a mana driver
> > specific private flag, which can be set from user space by a udev rule
> > by detecting the underlying platform? If not then I will send the next
> > version with the other rxbuflen approach.
>
> I think so, thank you both for the testing.
> Please look out for the net-next opening up and repost the patches.
> (The reopening is delayed, it was supposed to happen already but I
> can't get a clean run out of our CI, sigh)
Hi Jakub,
I have reposted the patches now.
Regards
Dipayaan Roy
^ permalink raw reply
* [PATCH mptcp-next v6 0/4] mptcp: MSG_ERRQUEUE support on the parent socket
From: David Carlier @ 2026-05-06 16:55 UTC (permalink / raw)
To: mptcp
Cc: Matthieu Baerts, Paolo Abeni, Mat Martineau, Geliang Tang, netdev,
David Carlier
This series lets MPTCP applications use poll(EPOLLERR) and
recvmsg(MSG_ERRQUEUE) on the parent socket to drain TX timestamps,
MSG_ZEROCOPY completion notifications and SO_EE_ORIGIN_LOCAL events
that are produced by the subflows, the same way they would on a plain
TCP socket. ICMP-derived errors stay on the subflow queue: the legacy
RECVERR ABI cannot convey their per-subflow peer identity, and they
are intended for a future MPTCP_RECERR channel.
Patch 1 factors the existing inet_flags subflow-propagation hard-coded
list into a mask, so subsequent patches can extend it without churn.
Patch 2 makes IP_RECVERR / IPV6_RECVERR (and the RFC4884 variants)
propagate to the subflows. The parent stores the bit so MPTCP-aware
helpers can branch on it.
Patch 3 splices subflow err-skbs onto the parent's sk_error_queue at
error-report time. mptcp_poll() and __mptcp_subflow_error_report()
already handle the parent path, so user-visible behaviour matches
plain TCP.
Patch 4 is a selftest covering the propagation path.
Changes in v6 (addresses sashiko v5 review,
https://sashiko.dev/#/patchset/cover.1777756707.git.devnexen@gmail.com):
- patch 2/4: take lock_sock() before the parent ip_setsockopt() and
re-read the freshly stored RECVERR bit via inet_test_bit() inside
the critical section, then propagate that to subflows. Two racing
setsockopt() callers can no longer leave parent and subflows
desynchronised. (sashiko v5 #1, High)
- patch 2/4: drop the local 4-byte snapshot and pass the user buffer
straight through to ip_setsockopt() / ipv6_setsockopt(), so 1-byte
boolean writes (char on=1; setsockopt(.., IP_RECVERR, &on, 1))
keep the same ABI as plain TCP. (sashiko v5 #2, High)
- patch 3/4: drain the parent err-queue first in mptcp_recv_error(),
then splice from the subflows. A previous splice that failed under
rmem pressure is retried once recvmsg(MSG_ERRQUEUE) frees parent
space, and the successful sock_queue_err_skb() re-asserts EPOLLERR
so userspace knows to drain again. No permanent event loss.
(sashiko v5 #3, High)
- patch 3/4: use skb_queue_empty_lockless() in mptcp_recv_error()'s
subflow loop, matching what mptcp_poll() already does. The plain
skb_queue_empty() pointer compare tripped KCSAN against softirq
writers. (sashiko v5 #4, Medium)
v5: https://lore.kernel.org/mptcp/cover.1777756707.git.devnexen@gmail.com/
David Carlier (4):
mptcp: sockopt: factor inet_flags propagation into a mask
mptcp: propagate RECVERR sockopts to subflows
mptcp: support MSG_ERRQUEUE on the parent socket
selftests: mptcp: cover IP_RECVERR sockopt propagation
net/mptcp/protocol.c | 74 +++++++++-
net/mptcp/sockopt.c | 136 ++++++++++++++----
.../selftests/net/mptcp/mptcp_sockopt.c | 55 +++++++
3 files changed, 235 insertions(+), 30 deletions(-)
base-commit: aa15c271d79edde595fb6f4eedb52fbc16325a83
--
2.53.0
^ permalink raw reply
* [PATCH mptcp-next v6 1/4] mptcp: sockopt: factor inet_flags propagation into a mask
From: David Carlier @ 2026-05-06 16:55 UTC (permalink / raw)
To: mptcp
Cc: Matthieu Baerts, Paolo Abeni, Mat Martineau, Geliang Tang, netdev,
David Carlier
In-Reply-To: <cover.1778086500.git.devnexen@gmail.com>
Introduce MPTCP_INET_FLAGS_MASK and replace the per-flag
inet_assign_bit() calls in sync_socket_options() with a loop driven
by the mask that calls assign_bit() per set bit, preserving the
per-bit atomicity of the original. Further flags propagated by MPTCP
can be added by extending the mask rather than touching the call
site.
No functional change.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
net/mptcp/sockopt.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 1cf608e7357b..20c737b2e704 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -16,6 +16,10 @@
#define MIN_INFO_OPTLEN_SIZE 16
#define MIN_FULL_INFO_OPTLEN_SIZE 40
+#define MPTCP_INET_FLAGS_MASK \
+ (BIT(INET_FLAGS_TRANSPARENT) | \
+ BIT(INET_FLAGS_FREEBIND) | \
+ BIT(INET_FLAGS_BIND_ADDRESS_NO_PORT))
static struct sock *__mptcp_tcp_fallback(struct mptcp_sock *msk)
{
@@ -1540,6 +1544,9 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
{
static const unsigned int tx_rx_locks = SOCK_RCVBUF_LOCK | SOCK_SNDBUF_LOCK;
struct sock *sk = (struct sock *)msk;
+ unsigned long mask = MPTCP_INET_FLAGS_MASK;
+ unsigned long src;
+ int b;
bool keep_open;
keep_open = sock_flag(sk, SOCK_KEEPOPEN);
@@ -1586,9 +1593,11 @@ static void sync_socket_options(struct mptcp_sock *msk, struct sock *ssk)
tcp_sock_set_keepcnt(ssk, msk->keepalive_cnt);
tcp_sock_set_maxseg(ssk, msk->maxseg);
- inet_assign_bit(TRANSPARENT, ssk, inet_test_bit(TRANSPARENT, sk));
- inet_assign_bit(FREEBIND, ssk, inet_test_bit(FREEBIND, sk));
- inet_assign_bit(BIND_ADDRESS_NO_PORT, ssk, inet_test_bit(BIND_ADDRESS_NO_PORT, sk));
+ src = READ_ONCE(inet_sk(sk)->inet_flags);
+
+ for_each_set_bit(b, &mask, BITS_PER_LONG)
+ assign_bit(b, &inet_sk(ssk)->inet_flags, src & BIT(b));
+
WRITE_ONCE(inet_sk(ssk)->local_port_range, READ_ONCE(inet_sk(sk)->local_port_range));
}
--
2.53.0
^ permalink raw reply related
* [PATCH mptcp-next v6 2/4] mptcp: propagate RECVERR sockopts to subflows
From: David Carlier @ 2026-05-06 16:55 UTC (permalink / raw)
To: mptcp
Cc: Matthieu Baerts, Paolo Abeni, Mat Martineau, Geliang Tang, netdev,
David Carlier
In-Reply-To: <cover.1778086500.git.devnexen@gmail.com>
Propagate IP_RECVERR/IP_RECVERR_RFC4884 and
IPV6_RECVERR/IPV6_RECVERR_RFC4884 from the MPTCP socket to existing
and future subflows.
mptcp_setsockopt_recverr() snapshots optval into a local int, applies
it to the parent socket via ip_setsockopt() / ipv6_setsockopt(), bumps
msk->setsockopt_seq, and forwards to every subflow via
mptcp_setsockopt_all_sf(). Newly-joining subflows pick up the four
RECVERR bits through sync_socket_options() now that
MPTCP_INET_FLAGS_MASK covers them.
mptcp_setsockopt_all_sf() skips IPv4 subflows when called with
SOL_IPV6 to avoid the -ENOPROTOOPT that ip_setsockopt() returns on
level mismatch in AF_INET6 msks carrying IPv4 subflows.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Assisted-by: Codex:gpt-5
Signed-off-by: David Carlier <devnexen@gmail.com>
---
net/mptcp/sockopt.c | 123 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 101 insertions(+), 22 deletions(-)
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 20c737b2e704..3046e9eb2c33 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -8,6 +8,8 @@
#include <linux/kernel.h>
#include <linux/module.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
#include <net/sock.h>
#include <net/protocol.h>
#include <net/tcp.h>
@@ -19,7 +21,11 @@
#define MPTCP_INET_FLAGS_MASK \
(BIT(INET_FLAGS_TRANSPARENT) | \
BIT(INET_FLAGS_FREEBIND) | \
- BIT(INET_FLAGS_BIND_ADDRESS_NO_PORT))
+ BIT(INET_FLAGS_BIND_ADDRESS_NO_PORT) | \
+ BIT(INET_FLAGS_RECVERR) | \
+ BIT(INET_FLAGS_RECVERR_RFC4884) | \
+ BIT(INET_FLAGS_RECVERR6) | \
+ BIT(INET_FLAGS_RECVERR6_RFC4884))
static struct sock *__mptcp_tcp_fallback(struct mptcp_sock *msk)
{
@@ -388,6 +394,81 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
return -EOPNOTSUPP;
}
+static int mptcp_setsockopt_all_sf(struct mptcp_sock *msk, int level,
+ int optname, sockptr_t optval,
+ unsigned int optlen)
+{
+ struct mptcp_subflow_context *subflow;
+ int ret = 0;
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+
+ if (level == SOL_IPV6 && ssk->sk_family != AF_INET6)
+ continue;
+
+ ret = tcp_setsockopt(ssk, level, optname, optval, optlen);
+ if (ret)
+ break;
+ }
+
+ if (!ret)
+ sockopt_seq_inc(msk);
+
+ return ret;
+}
+
+static int mptcp_setsockopt_recverr(struct mptcp_sock *msk, int level,
+ int optname, sockptr_t optval,
+ unsigned int optlen)
+{
+ struct sock *sk = (struct sock *)msk;
+ int val, ret;
+
+ /* Let ip_setsockopt() / ipv6_setsockopt() validate optval and optlen
+ * (so 1-byte boolean writes keep the same ABI as plain TCP) and update
+ * the parent's RECVERR bit. Re-read that bit under lock_sock() and
+ * push it to the subflows: concurrent setsockopt callers cannot leave
+ * parent and subflows desynchronized this way.
+ */
+ if (level == SOL_IP)
+ ret = ip_setsockopt(sk, level, optname, optval, optlen);
+#if IS_ENABLED(CONFIG_IPV6)
+ else if (level == SOL_IPV6)
+ ret = ipv6_setsockopt(sk, level, optname, optval, optlen);
+#endif
+ else
+ return -EOPNOTSUPP;
+ if (ret)
+ return ret;
+
+ lock_sock(sk);
+ switch (optname) {
+ case IP_RECVERR:
+ val = inet_test_bit(RECVERR, sk);
+ break;
+ case IP_RECVERR_RFC4884:
+ val = inet_test_bit(RECVERR_RFC4884, sk);
+ break;
+#if IS_ENABLED(CONFIG_IPV6)
+ case IPV6_RECVERR:
+ val = inet6_test_bit(RECVERR6, sk);
+ break;
+ case IPV6_RECVERR_RFC4884:
+ val = inet6_test_bit(RECVERR6_RFC4884, sk);
+ break;
+#endif
+ default:
+ release_sock(sk);
+ return -EOPNOTSUPP;
+ }
+
+ ret = mptcp_setsockopt_all_sf(msk, level, optname,
+ KERNEL_SOCKPTR(&val), sizeof(val));
+ release_sock(sk);
+ return ret;
+}
+
static int mptcp_setsockopt_v6(struct mptcp_sock *msk, int optname,
sockptr_t optval, unsigned int optlen)
{
@@ -430,6 +511,10 @@ static int mptcp_setsockopt_v6(struct mptcp_sock *msk, int optname,
release_sock(sk);
break;
+ case IPV6_RECVERR:
+ case IPV6_RECVERR_RFC4884:
+ ret = mptcp_setsockopt_recverr(msk, SOL_IPV6, optname, optval, optlen);
+ break;
}
return ret;
@@ -775,6 +860,9 @@ static int mptcp_setsockopt_v4(struct mptcp_sock *msk, int optname,
return mptcp_setsockopt_sol_ip_set(msk, optname, optval, optlen);
case IP_TOS:
return mptcp_setsockopt_v4_set_tos(msk, optname, optval, optlen);
+ case IP_RECVERR:
+ case IP_RECVERR_RFC4884:
+ return mptcp_setsockopt_recverr(msk, SOL_IP, optname, optval, optlen);
}
return -EOPNOTSUPP;
@@ -802,27 +890,6 @@ static int mptcp_setsockopt_first_sf_only(struct mptcp_sock *msk, int level, int
return ret;
}
-static int mptcp_setsockopt_all_sf(struct mptcp_sock *msk, int level,
- int optname, sockptr_t optval,
- unsigned int optlen)
-{
- struct mptcp_subflow_context *subflow;
- int ret = 0;
-
- mptcp_for_each_subflow(msk, subflow) {
- struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
-
- ret = tcp_setsockopt(ssk, level, optname, optval, optlen);
- if (ret)
- break;
- }
-
- if (!ret)
- sockopt_seq_inc(msk);
-
- return ret;
-}
-
static int mptcp_setsockopt_sol_tcp(struct mptcp_sock *msk, int optname,
sockptr_t optval, unsigned int optlen)
{
@@ -1467,6 +1534,12 @@ static int mptcp_getsockopt_v4(struct mptcp_sock *msk, int optname,
case IP_LOCAL_PORT_RANGE:
return mptcp_put_int_option(msk, optval, optlen,
READ_ONCE(inet_sk(sk)->local_port_range));
+ case IP_RECVERR:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet_test_bit(RECVERR, sk));
+ case IP_RECVERR_RFC4884:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet_test_bit(RECVERR_RFC4884, sk));
}
return -EOPNOTSUPP;
@@ -1487,6 +1560,12 @@ static int mptcp_getsockopt_v6(struct mptcp_sock *msk, int optname,
case IPV6_FREEBIND:
return mptcp_put_int_option(msk, optval, optlen,
inet_test_bit(FREEBIND, sk));
+ case IPV6_RECVERR:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet6_test_bit(RECVERR6, sk));
+ case IPV6_RECVERR_RFC4884:
+ return mptcp_put_int_option(msk, optval, optlen,
+ inet6_test_bit(RECVERR6_RFC4884, sk));
}
return -EOPNOTSUPP;
--
2.53.0
^ permalink raw reply related
* [PATCH mptcp-next v6 3/4] mptcp: support MSG_ERRQUEUE on the parent socket
From: David Carlier @ 2026-05-06 16:55 UTC (permalink / raw)
To: mptcp
Cc: Matthieu Baerts, Paolo Abeni, Mat Martineau, Geliang Tang, netdev,
David Carlier
In-Reply-To: <cover.1778086500.git.devnexen@gmail.com>
Splice pending err skbs from each subflow's error queue onto the parent
msk's error queue at error-report time, so poll() and recvmsg(MSG_ERRQUEUE)
on the parent socket observe TX timestamps and MSG_ZEROCOPY completion
notifications through the standard inet ABI.
The splice filters by SO_EE_ORIGIN: TIMESTAMPING / ZEROCOPY / LOCAL
events forward to the parent because they are tied to user-handed data,
not to a specific path; subflow-level ICMP errors are dropped because
the legacy RECVERR ABI cannot meaningfully convey their per-subflow peer
identity to single-path-aware userspace. Such events will be carried by
a future MPTCP_RECERR channel.
mptcp_recv_error() retries the splice on the pull side: if
sock_queue_err_skb() previously failed under rmem pressure, the skb
stays on the subflow queue, and the next recvmsg(MSG_ERRQUEUE) splices
it once the parent's queue has been drained.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
net/mptcp/protocol.c | 74 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 68 insertions(+), 6 deletions(-)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 0db50e3715c3..203ee37f57e0 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -11,6 +11,7 @@
#include <linux/netdevice.h>
#include <linux/sched/signal.h>
#include <linux/atomic.h>
+#include <linux/errqueue.h>
#include <net/aligned_data.h>
#include <net/rps.h>
#include <net/sock.h>
@@ -815,21 +816,52 @@ static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
return moved;
}
+static bool mptcp_errqueue_skb_forwardable(const struct sk_buff *skb)
+{
+ u8 origin = SKB_EXT_ERR(skb)->ee.ee_origin;
+
+ return origin == SO_EE_ORIGIN_TIMESTAMPING ||
+ origin == SO_EE_ORIGIN_ZEROCOPY ||
+ origin == SO_EE_ORIGIN_LOCAL;
+}
+
+static bool __mptcp_subflow_splice_errqueue(struct sock *sk, struct sock *ssk)
+{
+ struct sk_buff *skb;
+ bool moved = false;
+
+ while ((skb = skb_dequeue(&ssk->sk_error_queue))) {
+ if (!mptcp_errqueue_skb_forwardable(skb)) {
+ kfree_skb(skb); /* path-specific (ICMP) — belongs in MPTCP_RECERR */
+ continue;
+ }
+ if (sock_queue_err_skb(sk, skb)) {
+ skb_queue_head(&ssk->sk_error_queue, skb);
+ break;
+ }
+ moved = true;
+ }
+
+ return moved;
+}
+
static bool __mptcp_subflow_error_report(struct sock *sk, struct sock *ssk)
{
int ssk_state;
+ bool report;
int err;
+ report = __mptcp_subflow_splice_errqueue(sk, ssk);
+
/* only propagate errors on fallen-back sockets or
* on MPC connect
*/
if (sk->sk_state != TCP_SYN_SENT && !__mptcp_check_fallback(mptcp_sk(sk)))
- return false;
+ goto out;
err = sock_error(ssk);
if (!err)
- return false;
-
+ goto out;
/* We need to propagate only transition to CLOSE state.
* Orphaned socket will see such state change via
* subflow_sched_work_if_closed() and that path will properly
@@ -839,6 +871,11 @@ static bool __mptcp_subflow_error_report(struct sock *sk, struct sock *ssk)
if (ssk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DEAD))
mptcp_set_state(sk, ssk_state);
WRITE_ONCE(sk->sk_err, -err);
+ report = true;
+
+out:
+ if (!report)
+ return false;
/* This barrier is coupled with smp_rmb() in mptcp_poll() */
smp_wmb();
@@ -2286,6 +2323,31 @@ static unsigned int mptcp_inq_hint(const struct sock *sk)
return 0;
}
+static int mptcp_recv_error(struct sock *sk, struct msghdr *msg, int len)
+{
+ struct mptcp_sock *msk = mptcp_sk(sk);
+ struct mptcp_subflow_context *subflow;
+ int ret;
+
+ /* Drain the parent first: a previous splice may have failed under
+ * rmem pressure and the skb stayed on a subflow. Freeing space here
+ * lets the splice below succeed; sock_queue_err_skb() then re-asserts
+ * EPOLLERR so userspace knows to drain again on the next poll.
+ */
+ ret = inet_recv_error(sk, msg, len);
+
+ lock_sock(sk);
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+
+ if (!skb_queue_empty_lockless(&ssk->sk_error_queue))
+ __mptcp_subflow_splice_errqueue(sk, ssk);
+ }
+ release_sock(sk);
+
+ return ret;
+}
+
static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int flags)
{
@@ -2295,9 +2357,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int target;
long timeo;
- /* MSG_ERRQUEUE is really a no-op till we support IP_RECVERR */
if (unlikely(flags & MSG_ERRQUEUE))
- return inet_recv_error(sk, msg, len);
+ return mptcp_recv_error(sk, msg, len);
lock_sock(sk);
if (unlikely(sk->sk_state == TCP_LISTEN)) {
@@ -4340,7 +4401,8 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock,
/* This barrier is coupled with smp_wmb() in __mptcp_error_report() */
smp_rmb();
- if (READ_ONCE(sk->sk_err))
+ if (READ_ONCE(sk->sk_err) ||
+ !skb_queue_empty_lockless(&sk->sk_error_queue))
mask |= EPOLLERR;
return mask;
--
2.53.0
^ permalink raw reply related
* [PATCH mptcp-next v6 4/4] selftests: mptcp: cover IP_RECVERR sockopt propagation
From: David Carlier @ 2026-05-06 16:55 UTC (permalink / raw)
To: mptcp
Cc: Matthieu Baerts, Paolo Abeni, Mat Martineau, Geliang Tang, netdev,
David Carlier
In-Reply-To: <cover.1778086500.git.devnexen@gmail.com>
Exercise setsockopt/getsockopt of IP_RECVERR and IPV6_RECVERR on the
MPTCP parent socket, including the empty-errqueue EAGAIN contract on
MSG_ERRQUEUE|MSG_DONTWAIT.
End-to-end errqueue delivery (ICMP, TX timestamps, zerocopy) depends on
subflow-side producers that are out of scope for this series and will be
covered by follow-up work.
Assisted-by: Codex:gpt-5
Signed-off-by: David Carlier <devnexen@gmail.com>
---
.../selftests/net/mptcp/mptcp_sockopt.c | 55 +++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_sockopt.c b/tools/testing/selftests/net/mptcp/mptcp_sockopt.c
index b6e58d936ebe..95bb2cc8e2ff 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_sockopt.c
+++ b/tools/testing/selftests/net/mptcp/mptcp_sockopt.c
@@ -769,6 +769,60 @@ static void test_ip_tos_sockopt(int fd)
xerror("expect socklen_t == -1");
}
+static void test_ip_recverr_sockopt(int fd)
+{
+ struct iovec iov = {
+ .iov_base = &(char){ 0 },
+ .iov_len = 1,
+ };
+ struct msghdr msg = {
+ .msg_iov = &iov,
+ .msg_iovlen = 1,
+ };
+ int one = 1, zero = 0, val = -1;
+ socklen_t s = sizeof(val);
+ int level, optname, r;
+
+ switch (pf) {
+ case AF_INET:
+ level = SOL_IP;
+ optname = IP_RECVERR;
+ break;
+ case AF_INET6:
+ level = SOL_IPV6;
+ optname = IPV6_RECVERR;
+ break;
+ default:
+ xerror("Unknown pf %d\n", pf);
+ }
+
+ r = setsockopt(fd, level, optname, &one, sizeof(one));
+ if (r)
+ die_perror("setsockopt recverr on");
+
+ r = getsockopt(fd, level, optname, &val, &s);
+ if (r)
+ die_perror("getsockopt recverr on");
+ if (s != sizeof(val) || val != one)
+ xerror("recverr on mismatch val=%d len=%u", val, s);
+
+ r = recvmsg(fd, &msg, MSG_ERRQUEUE | MSG_DONTWAIT);
+ if (r != -1 || errno != EAGAIN)
+ xerror("expected empty errqueue to return EAGAIN, ret=%d errno=%d", r, errno);
+
+ r = setsockopt(fd, level, optname, &zero, sizeof(zero));
+ if (r)
+ die_perror("setsockopt recverr off");
+
+ val = -1;
+ s = sizeof(val);
+ r = getsockopt(fd, level, optname, &val, &s);
+ if (r)
+ die_perror("getsockopt recverr off");
+ if (s != sizeof(val) || val != zero)
+ xerror("recverr off mismatch val=%d len=%u", val, s);
+}
+
static int client(int pipefd)
{
int fd = -1;
@@ -787,6 +841,7 @@ static int client(int pipefd)
}
test_ip_tos_sockopt(fd);
+ test_ip_recverr_sockopt(fd);
connect_one_server(fd, pipefd);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next v7 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-05-06 16:58 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.
This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.
This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:
ethtool --set-priv-flags eth0 full-page-rx on
There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.
Changes in v7:
- Rebased onto net-next.
- Retained private flag approach after David Wei's testing on
Grace (ARM64) confirmed that fragment mode outperforms
full-page mode on other platforms, validating this is a
single-platform workaround rather than a generic issue.
Changes in v6:
- Added missed maintainers.
Changes in v5:
- Split prep refactor into separate patch (patch 1/2)
Changes in v4:
- Dropping the smbios string parsing and add ethtool priv flag
to reconfigure the queues with full page rx buffers.
Changes in v3:
- changed u8* to char*
Changes in v2:
- separate reading string index and the string, remove inline.
Dipayaan Roy (2):
net: mana: refactor mana_get_strings() and mana_get_sset_count() to
use switch
net: mana: force full-page RX buffers via ethtool private flag
drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++-
.../ethernet/microsoft/mana/mana_ethtool.c | 164 ++++++++++++++----
include/net/mana/mana.h | 8 +
3 files changed, 163 insertions(+), 31 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next v7 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-05-06 16:58 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260506170034.327907-1-dipayanroy@linux.microsoft.com>
Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.
No functional change.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
.../ethernet/microsoft/mana/mana_ethtool.c | 75 ++++++++++++-------
1 file changed, 46 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..a28ca461c135 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
struct mana_port_context *apc = netdev_priv(ndev);
unsigned int num_queues = apc->num_queues;
- if (stringset != ETH_SS_STATS)
+ switch (stringset) {
+ case ETH_SS_STATS:
+ return ARRAY_SIZE(mana_eth_stats) +
+ ARRAY_SIZE(mana_phy_stats) +
+ ARRAY_SIZE(mana_hc_stats) +
+ num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+ default:
return -EINVAL;
-
- return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
- num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+ }
}
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
{
- struct mana_port_context *apc = netdev_priv(ndev);
unsigned int num_queues = apc->num_queues;
int i, j;
- if (stringset != ETH_SS_STATS)
- return;
for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
- ethtool_puts(&data, mana_eth_stats[i].name);
+ ethtool_puts(data, mana_eth_stats[i].name);
for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
- ethtool_puts(&data, mana_hc_stats[i].name);
+ ethtool_puts(data, mana_hc_stats[i].name);
for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
- ethtool_puts(&data, mana_phy_stats[i].name);
+ ethtool_puts(data, mana_phy_stats[i].name);
for (i = 0; i < num_queues; i++) {
- ethtool_sprintf(&data, "rx_%d_packets", i);
- ethtool_sprintf(&data, "rx_%d_bytes", i);
- ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
- ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
- ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
- ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+ ethtool_sprintf(data, "rx_%d_packets", i);
+ ethtool_sprintf(data, "rx_%d_bytes", i);
+ ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+ ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+ ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+ ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
- ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+ ethtool_sprintf(data,
+ "rx_%d_coalesced_cqe_%d",
+ i,
+ j + 2);
}
for (i = 0; i < num_queues; i++) {
- ethtool_sprintf(&data, "tx_%d_packets", i);
- ethtool_sprintf(&data, "tx_%d_bytes", i);
- ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
- ethtool_sprintf(&data, "tx_%d_tso_packets", i);
- ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
- ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
- ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
- ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
- ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
- ethtool_sprintf(&data, "tx_%d_csum_partial", i);
- ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+ ethtool_sprintf(data, "tx_%d_packets", i);
+ ethtool_sprintf(data, "tx_%d_bytes", i);
+ ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+ ethtool_sprintf(data, "tx_%d_tso_packets", i);
+ ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+ ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+ ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+ ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+ ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+ ethtool_sprintf(data, "tx_%d_csum_partial", i);
+ ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+ }
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+
+ switch (stringset) {
+ case ETH_SS_STATS:
+ mana_get_strings_stats(apc, &data);
+ break;
+ default:
+ break;
}
}
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v7 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-05-06 16:58 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260506170034.327907-1-dipayanroy@linux.microsoft.com>
On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).
Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.
Usage:
ethtool --set-priv-flags eth0 full-page-rx on
There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.
The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++-
.../ethernet/microsoft/mana/mana_ethtool.c | 89 +++++++++++++++++++
include/net/mana/mana.h | 8 ++
3 files changed, 117 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
return va;
}
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+ /* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+ * in the RX refill path (~2kB buffer) can cause significant throughput
+ * regression under high connection counts. Allow user to force one RX
+ * buffer per page via ethtool private flag to bypass the fragment
+ * path.
+ */
+ if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+ return true;
+
+ /* For xdp and jumbo frames make sure only one packet fits per page. */
+ if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+ return true;
+
+ return false;
+}
+
/* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
/* Calculate datasize first (consistent across all cases) */
*datasize = mtu + ETH_HLEN;
- /* For xdp and jumbo frames make sure only one packet fits per page */
- if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+ if (mana_use_single_rxbuf_per_page(apc, mtu)) {
if (mana_xdp_get(apc)) {
*headroom = XDP_PACKET_HEADROOM;
*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index a28ca461c135..0547c903f613 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
};
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+ [MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
static int mana_get_sset_count(struct net_device *ndev, int stringset)
{
struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
ARRAY_SIZE(mana_phy_stats) +
ARRAY_SIZE(mana_hc_stats) +
num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+ case ETH_SS_PRIV_FLAGS:
+ return MANA_PRIV_FLAG_MAX;
+
default:
return -EINVAL;
}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
}
}
+static void mana_get_strings_priv_flags(u8 **data)
+{
+ int i;
+
+ for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+ ethtool_puts(data, mana_priv_flags[i]);
+}
+
static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
{
struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
case ETH_SS_STATS:
mana_get_strings_stats(apc, &data);
break;
+ case ETH_SS_PRIV_FLAGS:
+ mana_get_strings_priv_flags(&data);
+ break;
default:
break;
}
@@ -590,6 +609,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
return 0;
}
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+
+ return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+ u32 changed = apc->priv_flags ^ priv_flags;
+ u32 old_priv_flags = apc->priv_flags;
+ bool schedule_port_reset = false;
+ int err = 0;
+
+ if (!changed)
+ return 0;
+
+ /* Reject unknown bits */
+ if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+ return -EINVAL;
+
+ if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+ apc->priv_flags = priv_flags;
+
+ if (!apc->port_is_up) {
+ /* Port is down, flag updated to apply on next up
+ * so just return.
+ */
+ return 0;
+ }
+
+ /* Pre-allocate buffers to prevent failure in mana_attach
+ * later
+ */
+ err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+ if (err) {
+ netdev_err(ndev,
+ "Insufficient memory for new allocations\n");
+ apc->priv_flags = old_priv_flags;
+ return err;
+ }
+
+ err = mana_detach(ndev, false);
+ if (err) {
+ netdev_err(ndev, "mana_detach failed: %d\n", err);
+ apc->priv_flags = old_priv_flags;
+ goto out;
+ }
+
+ err = mana_attach(ndev);
+ if (err) {
+ netdev_err(ndev, "mana_attach failed: %d\n", err);
+ apc->priv_flags = old_priv_flags;
+ schedule_port_reset = true;
+ }
+ }
+
+out:
+ mana_pre_dealloc_rxbufs(apc);
+
+ if (err && schedule_port_reset)
+ queue_work(apc->ac->per_port_queue_reset_wq,
+ &apc->queue_reset_work);
+
+ return err;
+}
+
const struct ethtool_ops mana_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
.get_ethtool_stats = mana_get_ethtool_stats,
@@ -608,4 +695,6 @@ const struct ethtool_ops mana_ethtool_ops = {
.set_ringparam = mana_set_ringparam,
.get_link_ksettings = mana_get_link_ksettings,
.get_link = ethtool_op_get_link,
+ .get_priv_flags = mana_get_priv_flags,
+ .set_priv_flags = mana_set_priv_flags,
};
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
TRI_STATE_TRUE = 1
};
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+ MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+ MANA_PRIV_FLAG_MAX,
+};
+
/* Number of entries for hardware indirection table must be in power of 2 */
#define MANA_INDIRECT_TABLE_MAX_SIZE 512
#define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
u32 rxbpre_headroom;
u32 rxbpre_frag_count;
+ u32 priv_flags;
+
struct bpf_prog *bpf_prog;
/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v4] Bluetooth: serialize accept_q access
From: Jann Horn @ 2026-05-06 17:04 UTC (permalink / raw)
To: Ren Wei, luiz.dentz
Cc: linux-bluetooth, netdev, marcel, davem, edumazet, kuba, pabeni,
horms, yuantan098, yifanwucs, tomapufckgml, bird, wangjiexun2025
In-Reply-To: <20260506114338.2873496-1-n05ec@lzu.edu.cn>
[-- Attachment #1: Type: text/plain, Size: 10085 bytes --]
On Wed, May 6, 2026 at 1:43 PM Ren Wei <n05ec@lzu.edu.cn> wrote:
> bt_sock_poll() walks the accept queue without synchronization, while
> child teardown can unlink the same socket and drop its last reference.
> The unsynchronized accept queue walk has existed since the initial
> Bluetooth import.
>
> Protect accept_q with a dedicated lock for queue updates and polling.
> Also rework bt_accept_dequeue() to take temporary child references under
> the queue lock before dropping it and locking the child socket.
>
> Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2")
> Cc: stable@vger.kernel.org
> Reported-by: Jann Horn <jannh@google.com>
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Yifan Wu <yifanwucs@gmail.com>
> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
The patch looks good to me. I have some comments below, but they're
not important - from my perspective, this patch is ready to land in
the tree.
Reviewed-by: Jann Horn <jannh@google.com>
> ---
> Changes in v4:
> - no functional changes
> - clarify that the race dates back to the initial Bluetooth import
> - update trailers
> I noticed Jann also proposed a fix at
> https://patchwork.kernel.org/project/bluetooth/patch/20260504-bluetooth-accept-uaf-fix-v1-1-1ca63c0efadd@google.com/,
> so we're adding his Reported-by tag here. Please let us know if this
> isn't appropriate.
Thanks for letting me know that my patch was redundant, and for
listing me in Reported-by.
This addresses the race I described.
(You could add the line
"Closes: https://lore.kernel.org/r/20260504-bluetooth-accept-uaf-fix-v1-1-1ca63c0efadd@google.com"
after the "Reported-by: Jann Horn <jannh@google.com>".)
[...]
> @@ -254,45 +258,72 @@ EXPORT_SYMBOL(bt_accept_enqueue);
> */
> void bt_accept_unlink(struct sock *sk)
> {
> + struct sock *parent = bt_sk(sk)->parent;
> +
> BT_DBG("sk %p state %d", sk, sk->sk_state);
>
> + spin_lock_bh(&bt_sk(parent)->accept_q_lock);
> list_del_init(&bt_sk(sk)->accept_q);
> - sk_acceptq_removed(bt_sk(sk)->parent);
> + sk_acceptq_removed(parent);
> + spin_unlock_bh(&bt_sk(parent)->accept_q_lock);
> bt_sk(sk)->parent = NULL;
> sock_put(sk);
> }
> EXPORT_SYMBOL(bt_accept_unlink);
>
> +static struct sock *bt_accept_get(struct sock *parent, struct sock *sk)
> +{
> + struct bt_sock *bt = bt_sk(parent);
> + struct sock *next = NULL;
> +
> + /* accept_q is modified from child teardown paths too, so take a
> + * temporary reference before dropping the queue lock.
> + */
> + spin_lock_bh(&bt->accept_q_lock);
> +
> + if (sk) {
> + if (bt_sk(sk)->parent != parent)
> + goto out;
This check seems redundant? The caller already bailed out if
"bt_sk(sk)->parent != parent", and lock_sock(sk) ensures that
bt_sk(sk)->parent can't change concurrently because bt_accept_unlink()
is also protected by lock_sock() or lock_sock_nested(), as the comment
above bt_accept_unlink() documents.
> +
> + if (!list_is_last(&bt_sk(sk)->accept_q, &bt->accept_q)) {
> + next = &list_next_entry(bt_sk(sk), accept_q)->sk;
> + sock_hold(next);
> + }
> + } else if (!list_empty(&bt->accept_q)) {
> + next = &list_first_entry(&bt->accept_q,
> + struct bt_sock, accept_q)->sk;
> + sock_hold(next);
> + }
> +
> +out:
> + spin_unlock_bh(&bt->accept_q_lock);
> + return next;
> +}
Hmm. This looks a bit complicated to me, and I find it hard to reason
about how accept_q walks are restarted after temporarily dropping the
lock; I think it would be nice if you could instead walk the
->accept_q while holding the accept_q_lock until you identify a socket
with the right ->sk_state, then drop the accept_q_lock and lock the
sock. Something like this diff on top of your patch (completely
untested); I have attached a properly formatted version of this diff
that you can apply with "git apply":
```
diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index 9d68dd86023c..26e7c7198522 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -271,50 +271,36 @@ void bt_accept_unlink(struct sock *sk)
}
EXPORT_SYMBOL(bt_accept_unlink);
-static struct sock *bt_accept_get(struct sock *parent, struct sock *sk)
-{
- struct bt_sock *bt = bt_sk(parent);
- struct sock *next = NULL;
-
- /* accept_q is modified from child teardown paths too, so take a
- * temporary reference before dropping the queue lock.
- */
- spin_lock_bh(&bt->accept_q_lock);
-
- if (sk) {
- if (bt_sk(sk)->parent != parent)
- goto out;
-
- if (!list_is_last(&bt_sk(sk)->accept_q, &bt->accept_q)) {
- next = &list_next_entry(bt_sk(sk), accept_q)->sk;
- sock_hold(next);
- }
- } else if (!list_empty(&bt->accept_q)) {
- next = &list_first_entry(&bt->accept_q,
- struct bt_sock, accept_q)->sk;
- sock_hold(next);
- }
-
-out:
- spin_unlock_bh(&bt->accept_q_lock);
- return next;
-}
-
struct sock *bt_accept_dequeue(struct sock *parent, struct socket *newsock)
{
- struct sock *sk, *next;
+ struct bt_sock *s;
+ struct sock *sk;
BT_DBG("parent %p", parent);
restart:
- for (sk = bt_accept_get(parent, NULL); sk; sk = next) {
+ spin_lock_bh(&bt_sk(parent)->accept_q_lock);
+ list_for_each_entry(s, &bt_sk(parent)->accept_q, accept_q) {
+ unsigned char state;
+
+ sk = &s->sk;
+
+ /* lockless version of the checks below */
+ state = data_race(READ_ONCE(sk->sk_state));
+ if (state != BT_CLOSED && state != BT_CONNECTED && newsock &&
+ !test_bit(BT_SK_DEFER_SETUP, &bt_sk(parent)->flags))
+ continue;
+
/* Prevent early freeing of sk due to unlink and sock_kill */
+ sock_hold(sk);
+ spin_unlock_bh(&bt_sk(parent)->accept_q_lock);
lock_sock(sk);
+ /* socket is now locked, redo checks reliably */
/* Check sk has not already been unlinked via
* bt_accept_unlink() due to serialisation caused by sk locking
*/
- if (bt_sk(sk)->parent != parent) {
+ if (s->parent != parent) {
BT_DBG("sk %p, already unlinked", sk);
release_sock(sk);
sock_put(sk);
@@ -322,8 +308,6 @@ struct sock *bt_accept_dequeue(struct sock
*parent, struct socket *newsock)
goto restart;
}
- next = bt_accept_get(parent, sk);
-
/* sk is safely in the parent list so reduce reference count */
sock_put(sk);
@@ -331,7 +315,7 @@ struct sock *bt_accept_dequeue(struct sock
*parent, struct socket *newsock)
if (sk->sk_state == BT_CLOSED) {
bt_accept_unlink(sk);
release_sock(sk);
- continue;
+ goto restart;
}
if (sk->sk_state == BT_CONNECTED || !newsock ||
@@ -341,12 +325,11 @@ struct sock *bt_accept_dequeue(struct sock
*parent, struct socket *newsock)
sock_graft(sk, newsock);
release_sock(sk);
- if (next)
- sock_put(next);
return sk;
}
release_sock(sk);
+ goto restart;
}
return NULL;
```
I think this makes the code simpler, and it reduces the line count;
however, I think your approach is okay too, so it would also be fine
to keep your approach if you prefer.
[...]
> @@ -518,18 +551,28 @@ EXPORT_SYMBOL(bt_sock_stream_recvmsg);
>
> static inline __poll_t bt_accept_poll(struct sock *parent)
> {
> - struct bt_sock *s, *n;
> + struct bt_sock *bt = bt_sk(parent);
> + struct bt_sock *s;
> struct sock *sk;
> + __poll_t mask = 0;
> +
> + spin_lock_bh(&bt->accept_q_lock);
> + list_for_each_entry(s, &bt->accept_q, accept_q) {
> + int state;
>
> - list_for_each_entry_safe(s, n, &bt_sk(parent)->accept_q, accept_q) {
> sk = (struct sock *)s;
> - if (sk->sk_state == BT_CONNECTED ||
> - (test_bit(BT_SK_DEFER_SETUP, &bt_sk(parent)->flags) &&
> - sk->sk_state == BT_CONNECT2))
> - return EPOLLIN | EPOLLRDNORM;
> + state = READ_ONCE(sk->sk_state);
nitpick: This READ_ONCE() is not synchronized with a corresponding
WRITE_ONCE(); that's not really clean, and it might be appropriate to
mark this with data_race() if this is intentionally racy with
potentially-torn stores. But that's a minor detail.
> +
> + if (state == BT_CONNECTED ||
> + (test_bit(BT_SK_DEFER_SETUP, &bt->flags) &&
> + state == BT_CONNECT2)) {
> + mask = EPOLLIN | EPOLLRDNORM;
> + break;
> + }
> }
> + spin_unlock_bh(&bt->accept_q_lock);
>
> - return 0;
> + return mask;
> }
>
> __poll_t bt_sock_poll(struct file *file, struct socket *sock,
> --
> 2.34.1
>
[-- Attachment #2: locked-walk.diff --]
[-- Type: application/x-patch, Size: 2914 bytes --]
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox