Netdev List

Netdev List
 help / color / mirror / Atom feed

* AW: [PATCH net-next v2 2/8] net: mdio: realtek-rtl9300: Add page tracking
From: Markus Stockhausen @ 2026-06-29 17:25 UTC (permalink / raw)
  To: 'Andrew Lunn'
  Cc: hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
	chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
In-Reply-To: <f3b0ddab-372a-49c5-977e-59c7a104d0a8@lunn.ch>

> Von: Andrew Lunn <andrew@lunn.ch> 
> Gesendet: Montag, 29. Juni 2026 18:29
> An: Markus Stockhausen <markus.stockhausen@gmx.de>
> Betreff: Re: [PATCH net-next v2 2/8] net: mdio: realtek-rtl9300: Add page
tracking
> ...
> > Intercept access to register 31 and store the desired value for each
port
> > in the driver. When issuing access to other registers add the saved
page.
> > This given, the hardware will run two consecutive c22 commands that are
> > not interrupted by polling.
> > 
> >   ... hardware poll ...
> >   phy_write(phy, 31, page)
> >   phy_write(phy, reg, value)
> >   ... hardware poll ...
>
> How do you guarantee the polling will not get between?

The page is part of a single command towards the controller.
If a page is given the hardware will make the above two MDIO 
commands out of this without any polling in between.

I will clarify this in the next version.

Markus 


^ permalink raw reply

* [PATCH net-next v10 06/12] enic: define MBOX message types and header structures
From: Satish Kharat @ 2026-06-29 17:25 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260629-enic-sriov-v2-admin-channel-v2-v10-0-62569af83417@cisco.com>

Define the mailbox protocol structures for PF-VF communication:
message header, generic reply, and per-message-type payloads for
capability negotiation, VF registration/unregistration, and link
state notification/acknowledgment.

Include linux/types.h and linux/bits.h for __le16/__le32/__le64
and BIT() used in the header.

Message types use an even=request / odd=reply convention.  The
header carries source and destination VNIC IDs, a monotonically
increasing message number, and the total message length.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
 drivers/net/ethernet/cisco/enic/enic_mbox.h | 83 +++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
new file mode 100644
index 000000000000..a52f1d25cb21
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright 2025 Cisco Systems, Inc.  All rights reserved. */
+
+#ifndef _ENIC_MBOX_H_
+#define _ENIC_MBOX_H_
+
+#include <linux/bits.h>
+#include <linux/types.h>
+
+/*
+ * Mailbox protocol for PF-VF communication over the admin channel.
+ *
+ * Even numbers are requests, odd numbers are replies/acks.
+ * The prefix indicates the initiator: VF_ = VF-initiated, PF_ = PF-initiated.
+ */
+enum enic_mbox_msg_type {
+	ENIC_MBOX_VF_CAPABILITY_REQUEST		= 0,
+	ENIC_MBOX_VF_CAPABILITY_REPLY		= 1,
+	ENIC_MBOX_VF_REGISTER_REQUEST		= 2,
+	ENIC_MBOX_VF_REGISTER_REPLY		= 3,
+	ENIC_MBOX_VF_UNREGISTER_REQUEST		= 4,
+	ENIC_MBOX_VF_UNREGISTER_REPLY		= 5,
+	ENIC_MBOX_PF_LINK_STATE_NOTIF		= 6,
+	ENIC_MBOX_PF_LINK_STATE_ACK		= 7,
+	ENIC_MBOX_MAX
+};
+
+struct enic_mbox_hdr {
+	__le16 src_vnic_id;
+	__le16 dst_vnic_id;
+	u8 msg_type;
+	u8 flags;
+	__le16 msg_len;
+	__le64 msg_num;
+};
+
+struct enic_mbox_generic_reply {
+	__le16 ret_major;
+	__le16 ret_minor;
+};
+
+#define ENIC_MBOX_ERR_GENERIC		BIT(0)
+#define ENIC_MBOX_ERR_VF_NOT_REGISTERED	BIT(1)
+#define ENIC_MBOX_ERR_MSG_NOT_SUPPORTED	BIT(2)
+
+/* ENIC_MBOX_VF_CAPABILITY_REQUEST / _REPLY */
+#define ENIC_MBOX_CAP_VERSION_0		0
+#define ENIC_MBOX_CAP_VERSION_1		1
+
+struct enic_mbox_vf_capability_msg {
+	__le32 version;
+	__le32 reserved[32];
+};
+
+/* The embedded enic_mbox_generic_reply has 2-byte alignment, but the
+ * __le32 members give this struct 4-byte natural alignment.  Receive
+ * buffers come from kmalloc (>= 8-byte aligned), so there is no
+ * misaligned access risk when casting from the receive buffer.
+ */
+struct enic_mbox_vf_capability_reply_msg {
+	struct enic_mbox_generic_reply reply;
+	__le32 version;
+	__le32 reserved[32];
+};
+
+/* ENIC_MBOX_VF_REGISTER / _UNREGISTER */
+struct enic_mbox_vf_register_reply_msg {
+	struct enic_mbox_generic_reply reply;
+};
+
+/* ENIC_MBOX_PF_LINK_STATE_NOTIF / _ACK */
+#define ENIC_MBOX_LINK_STATE_DISABLE	0
+#define ENIC_MBOX_LINK_STATE_ENABLE	1
+
+struct enic_mbox_pf_link_state_notif_msg {
+	__le32 link_state;
+};
+
+struct enic_mbox_pf_link_state_ack_msg {
+	struct enic_mbox_generic_reply ack;
+};
+
+#endif /* _ENIC_MBOX_H_ */

-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 07/12] enic: add MBOX core send and receive for admin channel
From: Satish Kharat @ 2026-06-29 17:25 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260629-enic-sriov-v2-admin-channel-v2-v10-0-62569af83417@cisco.com>

Implement the mailbox protocol engine used for PF-VF communication
over the admin channel.

The send path (enic_mbox_send_msg) builds a message with a common
header, DMA-maps it, posts a single WQ descriptor with the
destination vnic ID encoded in the VLAN tag field, and polls
the WQ CQ for completion.

MBOX sends are gated by enic->mbox_send_disabled: enic_mbox_send_msg()
returns early while it is set. The flag is cleared in
enic_admin_channel_open() only once the admin WQ/RQ/CQ and interrupt
are fully programmed, and set again at the start of
enic_admin_channel_close(), so a send can never race a not-yet-ready
or torn-down admin channel.

The receive path (enic_mbox_recv_handler) is installed as the admin
RQ callback and validates incoming message headers. PF/VF-specific
dispatch will be added in subsequent commits.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
 drivers/net/ethernet/cisco/enic/Makefile     |   2 +-
 drivers/net/ethernet/cisco/enic/enic.h       |   6 +
 drivers/net/ethernet/cisco/enic/enic_admin.c |  35 +++++-
 drivers/net/ethernet/cisco/enic/enic_mbox.c  | 170 +++++++++++++++++++++++++++
 drivers/net/ethernet/cisco/enic/enic_mbox.h  |   8 ++
 5 files changed, 218 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/Makefile b/drivers/net/ethernet/cisco/enic/Makefile
index 7ae72fefc99a..e38aaf34c148 100644
--- a/drivers/net/ethernet/cisco/enic/Makefile
+++ b/drivers/net/ethernet/cisco/enic/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_ENIC) := enic.o
 enic-y := enic_main.o vnic_cq.o vnic_intr.o vnic_wq.o \
 	enic_res.o enic_dev.o enic_pp.o vnic_dev.o vnic_rq.o vnic_vic.o \
 	enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o \
-	enic_admin.o
+	enic_admin.o enic_mbox.o
 
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 401123e6df1d..b009d87da4bd 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -297,6 +297,8 @@ struct enic {
 	 * left the resources freed.
 	 */
 	bool admin_chan_up;
+	/* set on send timeout; cleared on channel re-open */
+	bool mbox_send_disabled;
 	struct vnic_wq admin_wq;
 	struct vnic_rq admin_rq;
 	struct vnic_cq admin_cq[2];
@@ -309,6 +311,10 @@ struct enic {
 	unsigned int admin_msg_count;	/* current depth of admin_msg_list */
 	void (*admin_rq_handler)(struct enic *enic, void *buf,
 				 unsigned int len);
+
+	/* MBOX protocol state — mbox_lock serializes admin WQ sends */
+	struct mutex mbox_lock;
+	u64 mbox_msg_num;
 };
 
 static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index 6062a18043ba..d695b16765a1 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -19,6 +19,7 @@
 #include "cq_enet_desc.h"
 #include "wq_enet_desc.h"
 #include "rq_enet_desc.h"
+#include "enic_mbox.h"
 
 /* Clean up any admin WQ buffers still held by hardware at close time.
  * Normally buffers are freed inline after send completion, but a timed-out
@@ -213,7 +214,26 @@ unsigned int enic_admin_rq_cq_service(struct enic *enic)
 			goto next_desc;
 		}
 
-		enic_admin_msg_enqueue(enic, buf->os_buf, bytes_written);
+		if (enic->admin_rq_handler) {
+			u16 sender_vlan;
+
+			/* Firmware sets the CQ VLAN field to identify the
+			 * sender: 0 = PF, 1-based = VF index.  Overwrite
+			 * the untrusted src_vnic_id in the MBOX header with
+			 * the hardware-verified value.
+			 */
+			sender_vlan = le16_to_cpu(rq_desc->vlan);
+			if (bytes_written >= sizeof(struct enic_mbox_hdr)) {
+				struct enic_mbox_hdr *hdr = buf->os_buf;
+
+				hdr->src_vnic_id = (sender_vlan == 0) ?
+					cpu_to_le16(ENIC_MBOX_DST_PF) :
+					cpu_to_le16(sender_vlan - 1);
+			}
+
+			enic_admin_msg_enqueue(enic, buf->os_buf,
+					       bytes_written);
+		}
 
 next_desc:
 		enic_admin_rq_buf_clean(rq, rq->to_clean);
@@ -456,8 +476,9 @@ static void enic_admin_init_resources(struct enic *enic)
 		     VNIC_CQ_MSG_DISABLE,
 		     intr_offset,
 		     0 /* cq_message_addr */);
+	/* coalescing_timer, coalescing_type, mask_on_assertion */
 	vnic_intr_init(&enic->admin_intr,
-		       0, 0, 1); /* coalescing_timer, coalescing_type, mask_on_assertion */
+		       0, 0, 1);
 }
 
 static void enic_admin_msg_drain(struct enic *enic)
@@ -522,6 +543,14 @@ int enic_admin_channel_open(struct enic *enic)
 
 	vnic_intr_unmask(&enic->admin_intr);
 
+	/* Only now that the admin WQ/RQ/CQ and interrupt are fully allocated,
+	 * programmed and enabled is it safe to allow MBOX sends.  Clearing this
+	 * earlier opened a window where a concurrent sender (e.g. link-notify
+	 * work scheduled by a post-reset link-up) could call enic_mbox_send_msg()
+	 * against a not-yet-allocated admin_wq and crash.
+	 */
+	WRITE_ONCE(enic->mbox_send_disabled, false);
+
 	netdev_dbg(enic->netdev,
 		   "admin channel open: intr=%u wq_avail=%u rq_avail=%u cq0_color=%u cq1_color=%u\n",
 		   enic->admin_intr_index,
@@ -563,6 +592,8 @@ void enic_admin_channel_close(struct enic *enic)
 	if (!enic->admin_chan_up)
 		return;
 
+	WRITE_ONCE(enic->mbox_send_disabled, true);
+
 	netdev_dbg(enic->netdev, "admin channel close\n");
 
 	vnic_intr_mask(&enic->admin_intr);
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
new file mode 100644
index 000000000000..3709704bee02
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2025 Cisco Systems, Inc.  All rights reserved.
+
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/dma-mapping.h>
+#include <linux/delay.h>
+
+#include "vnic_dev.h"
+#include "vnic_wq.h"
+#include "vnic_cq.h"
+#include "enic.h"
+#include "enic_admin.h"
+#include "enic_mbox.h"
+#include "wq_enet_desc.h"
+
+#define ENIC_MBOX_POLL_TIMEOUT_US	5000000
+#define ENIC_MBOX_POLL_INTERVAL_US	100
+
+static void enic_mbox_fill_hdr(struct enic *enic, struct enic_mbox_hdr *hdr,
+			       u8 msg_type, u16 dst_vnic_id, u16 msg_len)
+{
+	memset(hdr, 0, sizeof(*hdr));
+	hdr->dst_vnic_id = cpu_to_le16(dst_vnic_id);
+	hdr->msg_type = msg_type;
+	hdr->msg_len = cpu_to_le16(msg_len);
+	hdr->msg_num = cpu_to_le64(++enic->mbox_msg_num);
+}
+
+int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
+		       void *payload, u16 payload_len)
+{
+	u16 total_len = sizeof(struct enic_mbox_hdr) + payload_len;
+	struct vnic_wq *wq = &enic->admin_wq;
+	struct wq_enet_desc *desc;
+	unsigned long timeout;
+	dma_addr_t dma_addr;
+	u16 vlan_tag;
+	void *buf;
+	int err;
+
+	/* Serialize MBOX sends. The admin channel is a low-frequency
+	 * control path; holding the mutex across the poll is acceptable.
+	 */
+	mutex_lock(&enic->mbox_lock);
+
+	if (!enic->has_admin_channel || READ_ONCE(enic->mbox_send_disabled)) {
+		err = -ENODEV;
+		goto unlock;
+	}
+
+	if (vnic_wq_desc_avail(wq) == 0) {
+		err = -ENOSPC;
+		goto unlock;
+	}
+
+	buf = kmalloc(total_len, GFP_KERNEL);
+	if (!buf) {
+		err = -ENOMEM;
+		goto unlock;
+	}
+
+	enic_mbox_fill_hdr(enic, buf, msg_type, dst_vnic_id, total_len);
+	if (payload_len) {
+		void *dst = buf + sizeof(struct enic_mbox_hdr);
+
+		memcpy(dst, payload, payload_len);
+	}
+
+	dma_addr = dma_map_single(&enic->pdev->dev, buf, total_len,
+				  DMA_TO_DEVICE);
+	if (dma_mapping_error(&enic->pdev->dev, dma_addr)) {
+		kfree(buf);
+		err = -ENOMEM;
+		goto unlock;
+	}
+
+	/* Firmware uses vlan field for routing: 0 = PF, 1-based = VF index */
+	if (dst_vnic_id == ENIC_MBOX_DST_PF)
+		vlan_tag = 0;
+	else
+		vlan_tag = dst_vnic_id + 1;
+
+	desc = vnic_wq_next_desc(wq);
+	wq_enet_desc_enc(desc, (u64)dma_addr | VNIC_PADDR_TARGET,
+			 total_len,
+			 0, 0, 0,       /* mss, hdr_len, offload_mode */
+			 1, 1,          /* eop, cq_entry */
+			 0,             /* fcoe_encap */
+			 1, vlan_tag,   /* vlan_tag_insert, vlan_tag */
+			 0);            /* loopback */
+	vnic_wq_post(wq, buf, dma_addr, total_len,
+		     1, 1,              /* sop, eop */
+		     1, 1,              /* desc_skip_cnt, cq_entry */
+		     0, 0);             /* compressed_send, wrid */
+	vnic_wq_doorbell(wq);
+
+	timeout = jiffies + usecs_to_jiffies(ENIC_MBOX_POLL_TIMEOUT_US);
+	err = -ETIMEDOUT;
+	while (time_before(jiffies, timeout)) {
+		if (enic_admin_wq_cq_service(enic)) {
+			err = 0;
+			break;
+		}
+		usleep_range(ENIC_MBOX_POLL_INTERVAL_US,
+			     ENIC_MBOX_POLL_INTERVAL_US + 50);
+	}
+	/* Final check in case completion arrived during the last sleep */
+	if (err && enic_admin_wq_cq_service(enic))
+		err = 0;
+
+	if (!err) {
+		wq->to_clean = wq->to_clean->next;
+		wq->ring.desc_avail++;
+		dma_unmap_single(&enic->pdev->dev, dma_addr, total_len,
+				 DMA_TO_DEVICE);
+		kfree(buf);
+	} else {
+		netdev_err(enic->netdev,
+			   "MBOX send timed out (type %u dst %u), disabling channel\n",
+			   msg_type, dst_vnic_id);
+		/*
+		 * The WQ descriptor is still live in hardware. Do not unmap
+		 * or free the buffer: the device may still DMA from dma_addr.
+		 * Mark the channel unusable so no further sends are attempted.
+		 */
+		WRITE_ONCE(enic->mbox_send_disabled, true);
+	}
+
+	netdev_dbg(enic->netdev,
+		   "MBOX send msg_type %u dst %u vlan %u err %d\n",
+		   msg_type, dst_vnic_id, vlan_tag, err);
+unlock:
+	mutex_unlock(&enic->mbox_lock);
+	return err;
+}
+
+static void enic_mbox_recv_handler(struct enic *enic, void *buf,
+				   unsigned int len)
+{
+	struct enic_mbox_hdr *hdr = buf;
+
+	if (len < sizeof(*hdr)) {
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: truncated message (len %u < %zu)\n",
+				    len, sizeof(*hdr));
+		return;
+	}
+
+	if (hdr->msg_type >= ENIC_MBOX_MAX) {
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: unknown msg type %u\n",
+				    hdr->msg_type);
+		return;
+	}
+
+	netdev_dbg(enic->netdev,
+		   "MBOX recv: type %u from vnic %u len %u\n",
+		   hdr->msg_type, le16_to_cpu(hdr->src_vnic_id),
+		   le16_to_cpu(hdr->msg_len));
+}
+
+void enic_mbox_init(struct enic *enic)
+{
+	enic->mbox_msg_num = 0;
+	mutex_init(&enic->mbox_lock);
+	enic->admin_rq_handler = enic_mbox_recv_handler;
+}
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index a52f1d25cb21..73fd7f783ee2 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -80,4 +80,12 @@ struct enic_mbox_pf_link_state_ack_msg {
 	struct enic_mbox_generic_reply ack;
 };
 
+#define ENIC_MBOX_DST_PF	0xFFFF
+
+struct enic;
+
+void enic_mbox_init(struct enic *enic);
+int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
+		       void *payload, u16 payload_len);
+
 #endif /* _ENIC_MBOX_H_ */

-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 11/12] enic: add V2 VF probe with admin channel and PF registration
From: Satish Kharat @ 2026-06-29 17:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260629-enic-sriov-v2-admin-channel-v2-v10-0-62569af83417@cisco.com>

When a V2 SR-IOV VF probes, open the admin channel, initialize the
MBOX protocol, perform the capability check with the PF, and register
with the PF. This establishes the PF-VF communication path that the PF
uses to send link state notifications.

The admin channel and MBOX registration happen after enic_dev_init()
(which discovers admin channel resources) and before register_netdev()
so the VF is fully initialized before the interface is visible to
userspace.

The admin channel is opened before enic_mbox_init() installs the
receive handler.  This is safe because enic_admin_rq_cq_service()
checks admin_rq_handler before enqueuing received buffers, so any
interrupt that fires between open and mbox_init is harmlessly
discarded.

On remove, the VF unregisters from the PF and closes its admin channel
before tearing down data path resources.

V2 VFs are not provisioned with an RES_TYPE_SRIOV_INTR resource by
firmware, so bypass that check in the admin channel capability
detection for V2 VFs. The PF still requires this resource.

The admin MSI-X vector reserved by enic_set_intr_mode()
is used for the admin channel interrupt.
enic_adjust_resources() ensures the reserved slot is within
intr_avail bounds even at maximum queue configurations.  The
admin INTR uses a RES_TYPE_INTR_CTRL slot shared with the
data path.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
 drivers/net/ethernet/cisco/enic/enic.h      |   1 +
 drivers/net/ethernet/cisco/enic/enic_main.c | 101 +++++++++++++++++++++++++---
 drivers/net/ethernet/cisco/enic/enic_res.c  |   3 +-
 3 files changed, 94 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index b5a43fe04877..62b8941489d7 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -452,6 +452,7 @@ void enic_reset_addr_lists(struct enic *enic);
 int enic_sriov_enabled(struct enic *enic);
 int enic_is_valid_vf(struct enic *enic, int vf);
 int enic_is_dynamic(struct enic *enic);
+int enic_is_sriov_vf_v2(struct enic *enic);
 void enic_set_ethtool_ops(struct net_device *netdev);
 int __enic_set_rsskey(struct enic *enic);
 void enic_ext_cq(struct enic *enic);
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 185da2fbc5c7..abb30e5457c1 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -316,6 +316,11 @@ static int enic_is_sriov_vf(struct enic *enic)
 	       enic->pdev->device == PCI_DEVICE_ID_CISCO_VIC_ENET_VF_V2;
 }
 
+int enic_is_sriov_vf_v2(struct enic *enic)
+{
+	return enic->pdev->device == PCI_DEVICE_ID_CISCO_VIC_ENET_VF_V2;
+}
+
 int enic_is_valid_vf(struct enic *enic, int vf)
 {
 #ifdef CONFIG_PCI_IOV
@@ -2399,15 +2404,19 @@ static int enic_adjust_resources(struct enic *enic)
 		enic->intr_count = enic->intr_avail;
 		break;
 	case VNIC_DEV_INTR_MODE_MSIX: {
-		/* Reserve one MSI-X slot for the admin channel interrupt
-		 * when V2 SR-IOV admin channel resources are present.
-		 */
-		unsigned int admin_reserve =
-			enic->has_admin_channel ? 1 : 0;
-
 		/* Adjust the number of wqs/rqs/cqs/interrupts that will be
-		 * used based on which resource is the most constrained
+		 * used based on which resource is the most constrained.
+		 * Reserve one extra MSI-X slot for the admin channel INTR
+		 * when has_admin_channel is set so that
+		 * enic_admin_setup_intr() can allocate at intr_count
+		 * within the intr_avail bounds even when the data queue
+		 * count is maxed out.  intr_count counts only the data-path
+		 * IRQs (registered by enic_request_intr()); the admin INTR
+		 * lives at msix index intr_count and is set up later by
+		 * enic_admin_setup_intr().
 		 */
+		unsigned int admin_reserve = enic->has_admin_channel ? 1 : 0;
+
 		wq_avail = min(enic->wq_avail, ENIC_WQ_MAX);
 		rq_default = max(netif_get_num_default_rss_queues(),
 				 ENIC_RQ_MIN_DEFAULT);
@@ -3096,6 +3105,44 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_out_dev_close;
 	}
 
+	/* Initialise link_notify_work before the V2-VF admin-open block below:
+	 * its error path (err_out_admin_close -> enic_admin_channel_close() ->
+	 * cancel_work_sync()) would otherwise act on an uninitialised work.
+	 */
+	INIT_WORK(&enic->link_notify_work, enic_link_notify_work_handler);
+
+	/* V2 VF: open admin channel and register with PF.
+	 * Must happen before register_netdev so the VF is fully
+	 * initialized before the interface is visible to userspace.
+	 *
+	 * admin_channel_open() runs before enic_mbox_init() installs
+	 * the receive handler.  This is safe because
+	 * enic_admin_rq_cq_service() checks admin_rq_handler before
+	 * enqueuing any received buffer, so interrupts that fire
+	 * between open and mbox_init are harmlessly discarded.
+	 */
+	if (enic_is_sriov_vf_v2(enic)) {
+		err = enic_admin_channel_open(enic);
+		if (err) {
+			dev_err(dev,
+				"Failed to open admin channel: %d\n", err);
+			goto err_out_dev_deinit;
+		}
+		enic_mbox_init(enic);
+		err = enic_mbox_vf_capability_check(enic);
+		if (err) {
+			dev_err(dev,
+				"MBOX capability check failed: %d\n", err);
+			goto err_out_admin_close;
+		}
+		err = enic_mbox_vf_register(enic);
+		if (err) {
+			dev_err(dev,
+				"MBOX VF registration failed: %d\n", err);
+			goto err_out_admin_close;
+		}
+	}
+
 	netif_set_real_num_tx_queues(netdev, enic->wq_count);
 	netif_set_real_num_rx_queues(netdev, enic->rq_count);
 
@@ -3108,7 +3155,6 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	INIT_WORK(&enic->reset, enic_reset);
 	INIT_WORK(&enic->tx_hang_reset, enic_tx_hang_reset);
 	INIT_WORK(&enic->change_mtu_work, enic_change_mtu_work);
-	INIT_WORK(&enic->link_notify_work, enic_link_notify_work_handler);
 
 	for (i = 0; i < enic->wq_count; i++)
 		spin_lock_init(&enic->wq[i].lock);
@@ -3121,7 +3167,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	err = enic_set_mac_addr(netdev, enic->mac_addr);
 	if (err) {
 		dev_err(dev, "Invalid MAC address, aborting\n");
-		goto err_out_dev_deinit;
+		goto err_out_admin_close;
 	}
 
 	enic->tx_coalesce_usecs = enic->config.intr_timer_usec;
@@ -3219,11 +3265,23 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	err = register_netdev(netdev);
 	if (err) {
 		dev_err(dev, "Cannot register net device, aborting\n");
-		goto err_out_dev_deinit;
+		goto err_out_admin_close;
 	}
 
 	return 0;
 
+err_out_admin_close:
+	if (enic_is_sriov_vf_v2(enic)) {
+		if (enic->vf_registered) {
+			int unreg_err = enic_mbox_vf_unregister(enic);
+
+			if (unreg_err)
+				netdev_warn(netdev,
+					    "Failed to unregister from PF: %d\n",
+					    unreg_err);
+		}
+		enic_admin_channel_close(enic);
+	}
 err_out_dev_deinit:
 	enic_dev_deinit(enic);
 err_out_dev_close:
@@ -3261,7 +3319,30 @@ static void enic_remove(struct pci_dev *pdev)
 		cancel_work_sync(&enic->reset);
 		cancel_work_sync(&enic->tx_hang_reset);
 		cancel_work_sync(&enic->change_mtu_work);
+
+		/* Close the admin channel and unregister from the PF before
+		 * unregister_netdev() to prevent a late PF notification from
+		 * touching a netdev that has been freed.
+		 */
+		if (enic_is_sriov_vf_v2(enic)) {
+			if (enic->vf_registered) {
+				int unreg_err = enic_mbox_vf_unregister(enic);
+
+				if (unreg_err)
+					netdev_warn(netdev,
+						    "Failed to unregister from PF: %d\n",
+						    unreg_err);
+			}
+			enic_admin_channel_close(enic);
+		}
+
 		unregister_netdev(netdev);
+		/* unregister_netdev() -> enic_stop() stops the notify timer, so
+		 * no new link_notify_work can be queued past this point.  Cancel
+		 * unconditionally to cover the narrow window where
+		 * enic_link_check() scheduled it just as SR-IOV was disabled.
+		 */
+		cancel_work_sync(&enic->link_notify_work);
 #ifdef CONFIG_PCI_IOV
 		if (enic_sriov_enabled(enic)) {
 			if (enic->vf_type == ENIC_VF_TYPE_V2)
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
index 436326ace049..74cd2ee3af5c 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -211,7 +211,8 @@ void enic_get_res_counts(struct enic *enic)
 		vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_RQ) >= 1 &&
 		vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_CQ) >=
 			ARRAY_SIZE(enic->admin_cq) &&
-		vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1;
+		(enic_is_sriov_vf_v2(enic) ||
+		 vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1);
 
 	dev_info(enic_get_dev(enic),
 		"vNIC resources avail: wq %d rq %d cq %d intr %d admin %s\n",

-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 00/12] enic: SR-IOV V2 admin channel and MBOX protocol
From: Satish Kharat @ 2026-06-29 17:25 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat,
	Breno Leitao

This series adds the admin channel infrastructure and mailbox (MBOX)
protocol needed for V2 SR-IOV support in the enic driver.

The V2 SR-IOV design uses a direct PF-VF communication channel built on
dedicated WQ/RQ/CQ hardware resources and an MSI-X interrupt.

Patch 1 is an independent fix for a pre-existing use-after-free in
enic_remove() (the tx_hang_reset work item was never cancelled on
removal).  It is unrelated to SR-IOV but lives in the same teardown
path the later patches touch, so it is carried at the head of the
series.

Firmware capability and admin channel infrastructure (patches 2-5):
  - Probe-time firmware feature check for V2 SR-IOV support
  - Admin channel open/close, RQ buffer management, CQ service
    with MSI-X interrupt and workqueue-based polling

MBOX protocol and VF enable (patches 6-11):
  - MBOX message types, core send/receive, PF and VF handlers
  - V2 SR-IOV enable wiring with admin channel setup
  - V2 VF probe with admin channel and PF registration

Patch 12 completes reset recovery for V2 VFs: the reset paths added
earlier in the series re-establish the admin channel only for the PF,
which left a VF unregistered and unable to exchange MBOX traffic after
a reset taken on the VF.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
Changes in v10:
- Cancel tx_hang_reset work in enic_remove() to fix a pre-existing
  use-after-free when a TX timeout fires during device removal; carried
  as an independent fix at the head of the series (new patch 1)
  [Sashiko]
- Clarify in the patch 2 commit message that V2 VFs are only enabled via
  .sriov_configure, which rejects firmware without V2 support, so such
  firmware never exposes VFs (patch 2) [Sashiko]
- Track admin-channel up/down state and gate admin/MBOX operations on
  it, fixing a NULL pointer dereference when close() runs after a failed
  open() and when a reset fails to reopen the channel (patch 3)
  [Sashiko]
- Bound the admin message list with ENIC_ADMIN_MSG_MAX (256) to prevent
  a malicious VF from exhausting PF memory (patch 5) [Sashiko]
- Name the admin MSI-X interrupt with pci_name() instead of the
  not-yet-registered netdev name so it no longer appears as
  "eth%d-admin" in /proc/interrupts (patch 5) [Sashiko]
- Document the in-order admin CQ/RQ completion guarantee in a comment
  (patch 5) [Sashiko]
- On a VF, validate that admin MBOX messages are sourced from the PF
  before acting on them, rejecting spoofed link-state messages (patch 9)
  [Sashiko]
- Move the link_notify_work initialisation ahead of the VF setup block
  so a VF probe error path cannot cancel_work_sync() an uninitialised
  work item (patch 11) [Sashiko]
- Cancel link_notify_work in enic_remove() after unregister_netdev() to
  close the narrow window where enic_link_check() could schedule it just
  as SR-IOV was disabled, leaving the work to outlive vf_state (patch 11)
  [Sashiko]
- Re-establish the V2 VF admin channel and re-run PF registration after
  a driver-initiated device reset (the soft reset from a WQ/RQ error and
  the tx-hang reset from a TX timeout); previously only the PF recovered,
  so a reset taken on a VF left it unable to exchange MBOX traffic
  (new patch 12) [Sashiko]

Testing:
- Exercised on a Cisco VIC with multiple V2 VFs under a KASAN + lockdep
  + DMA-API-debug kernel.  VF resets (soft and tx-hang) and PF reset,
  including a 10x reset stress loop, re-established the admin channel and
  re-registered the VFs with no use-after-free, lockdep, or DMA-API
  warnings.  MBOX control-plane operations (VF MAC/VLAN/spoofchk/trust/
  MTU) were verified to survive resets.

- Link to v9: https://patch.msgid.link/20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com

Changes in v9:
- Use dma_rmb() instead of rmb() when reading admin RQ completion
  descriptors written by DMA (patch 4) [Sashiko]
- Use GFP_KERNEL instead of GFP_ATOMIC for admin RQ refill and for
  received-message allocation; both run in workqueue (process)
  context after the v8 NAPI-to-workqueue switch (patch 4) [Sashiko]
- Correct the enic_admin_msg comment to describe the workqueue
  enqueue path rather than NAPI (patch 4) [Sashiko]
- Set mbox_send_disabled in enic_admin_channel_close() so a MBOX
  send cannot race with channel teardown (patch 6) [Sashiko]
- Send the actual PF carrier state to a VF on registration instead
  of unconditionally reporting link up (patch 7) [Sashiko]
- Call reinit_completion() before setting mbox_expected_reply so a
  reply arriving between the two is not missed (patch 8) [Sashiko]
- Defer PF->VF link state notification to a workqueue and gate it on
  carrier transitions; enic_link_check() runs in the notify (atomic)
  context while the MBOX send sleeps on a mutex/completion (patch 9)
  [Sashiko]
- Clear ENIC_SRIOV_ENABLED and cancel the link-notify work before
  freeing per-VF state in the SR-IOV disable path, closing a
  use-after-free window against a concurrent link notification
  (patch 9) [Sashiko]
- Link to v8: https://patch.msgid.link/20260609-enic-sriov-v2-admin-channel-v2-v8-0-8ad8babbb826@cisco.com

Changes in v8:
- Replace NAPI polling with workqueue for admin CQ service — admin
  channel is low-frequency control traffic, not data path (patch 4)
  [Jakub Kicinski]
- Use explicit enum value (= 4) for VIC_FEATURE_SRIOV instead of
  placeholder VIC_FEATURE_PTP entry (patch 1) [Breno Leitao]
- Remove unnecessary rmb() in WQ CQ service (patch 4) [Jakub Kicinski]
- Remove admin_msg_drop_cnt counter (patch 4) [Simon Horman]
- Drop NAPI reschedule on RQ refill failure — the NAPI-to-workqueue
  switch removes the livelock and budget issues (patch 4) [Simon Horman]
- Remove unnecessary READ_ONCE/WRITE_ONCE on admin_rq_handler — all
  access is serialized by probe/remove (patch 6) [Jakub Kicinski]
- Fix checkpatch line-length warnings (patches 3, 5, 6)
- Rate-limit link state send failure and ACK error warnings (patch 7)
  [Jakub Kicinski]
- Correct enic_link_check comment to describe actual PF link state
  notification flow (patch 7) [Simon Horman]
- Correct mbox_expected_reply comment — serialization is by
  RTNL/probe, not mbox_lock (patch 8) [Jakub Kicinski]
- Wire enic_mbox_send_link_state() from enic_link_check() so PF
  notifies VFs on carrier change (patch 9) [Simon Horman]
- Fix commit message wording about MSI-X reservation (patch 10)
  [Simon Horman]
- Link to v7: https://patch.msgid.link/20260513-enic-sriov-v2-admin-channel-v2-v7-0-68b9f4141f4c@cisco.com

Changes in v7:
- Replace magic numbers in admin channel init with named macros
  and inline comments for MBOX descriptor encoding
  (patches 2, 6) [Paolo Abeni]
- Add defense-in-depth bounds check on admin RQ bytes_written (patch 4)
- Force NAPI reschedule on admin RQ refill failure (patch 4)
- Always unmask admin interrupt even with zero credits (patch 4)
- Reorder NAPI init before request_irq in admin channel open (patch 4)
- Remove redundant netdev_warn on admin msg enqueue kmalloc failure
  (patch 4) [Paolo Abeni]
- Add netdev_warn on admin WQ/RQ disable failure in close path
  (patch 2)
- Remove incorrect RES_TYPE_SRIOV_INTR interrupt allocation from
  admin channel open (patch 2); interrupt setup handled entirely
  in patch 4 using RES_TYPE_INTR_CTRL
- Rate-limit VF register/unregister log messages (patch 7) [Paolo Abeni]
- Add __aligned(8) to admin message data[] for strict-alignment
  safety (patch 4)
- Rate-limit MBOX handler error warnings (patch 7)
- Pre-allocate port profile array before pci_disable_sriov in V1
  disable path to avoid half-torn-down state on alloc failure (patch 9)
- Account for admin channel interrupt reservation in
  enic_set_intr_mode() and enic_adjust_resources() (patch 9) [Paolo Abeni]
- Clear admin_rq_handler in enic_admin_channel_close (patch 9)
- Quiesce admin channel (mask interrupt, disable NAPI, block MBOX
  sends) around soft reset (patch 9)
- Use WRITE_ONCE/READ_ONCE for mbox_send_disabled and
  admin_rq_handler across data-path/reset boundaries
  (patches 4, 6, 9)
- Fix commit message: reference enic_adjust_resources() alongside
  enic_set_intr_mode() (patch 10)
Investigated findings from automated review (Simon Horman / Sashiko):
- Race between probe-time feature check and VF proxy: false positive;
  detection runs at probe, enable runs from sriov_configure
- Struct alignment of __le32 after 2-byte mbox_hdr_embed: compiler
  inserts correct padding, no manual alignment needed
- Stale MBOX reply matching / reinit_completion race: single-flight
  design with mutex serialization prevents this
- cancel_work_sync vs MBOX unregister race: work cannot be
  re-triggered during the close window
- Link to v6: https://patch.msgid.link/20260503-enic-sriov-v2-admin-channel-v2-v6-0-0af4fbc2d86d@cisco.com

Changes in v6:
- Add explanatory comments documenting admin_cq[0] (WQ CQE size) and
  admin_cq[1] (RQ CQE size matching firmware enic_ext_cq() programming)
  allocations (patch 2)
- Enforce bytes_written from CQ descriptor when enqueuing admin RQ
  message; previously buf->len (allocation size) was passed, exposing
  uninitialized buffer memory beyond the real payload (patch 4)
- Drop admin RQ messages with TRUNCATED set or FCS_OK clear, gated by
  netdev_warn_once() (patch 4)
- Disable interrupt_enable on admin_cq[0]: WQ completions are polled
  synchronously inside enic_mbox_send_msg() and never raise an
  interrupt; matches admin_cq[1] (RQ) which does NAPI polling (patch 4)
- Add mbox_expected_reply gating in VF reply handlers (capability,
  register, unregister): drop replies whose type does not match the
  current waiter's expected type, avoiding spurious wakeup of an
  unrelated waiter from a stale reply that arrives after timeout
  (patch 8)
- Distinguish error returns in enic_mbox_vf_unregister(): -ETIMEDOUT
  (no reply received), -EACCES (PF rejected the unregister), 0 on
  success.  Previously all paths collapsed to a single -ETIMEDOUT
  (patch 8)
- Reserve one extra MSI-X slot in enic_set_intr_mode() when
  has_admin_channel is set so enic_admin_setup_intr() always has room
  to allocate at intr_count without exceeding intr_avail bounds when
  data queue count is maxed out (patch 10)
- Clarify in commit messages that .sriov_configure is intentionally
  not yet wired in this series and will be added in a follow-up after
  the necessary devcmd hardening lands (patch 9)
- Link to v5: https://patch.msgid.link/20260423-enic-sriov-v2-admin-channel-v2-v5-0-caa9f504a3dc@cisco.com

Changes in v5:
- Fix DMA-into-freed-memory race: call enic_admin_qp_type_set() before
  disabling RQ/WQ in both error and close paths (patch 3)
- Fix DMA mapping leak: enic_admin_wq_buf_clean() now unmaps and frees
  WQ buffers still held at close time after a send timeout (patch 3)
- Log rate-limited warning on admin RQ refill failure (patch 4)
- Add missing linux/types.h and linux/bits.h includes to enic_mbox.h
  (patch 5)
- Guard mbox_lock/mbox_comp init with mbox_initialized flag to prevent
  re-initialization on sriov_configure re-entry (patch 7)
- Clear VF registered state before sending unregister reply so PF does
  not treat a dead VF as still registered (patch 8)
- Gate VF-facing log messages with net_ratelimit() to prevent malicious
  VF from flooding PF dmesg (patch 8)
- Reject VF port profile requests when V2 SR-IOV is active since
  enic->pp is not reallocated for V2 VFs (patch 9)
- Move enic_sriov_detect_vf_type() before auto-enable check; skip
  probe-time auto-enable for V2 VFs (patch 9)
- Move admin channel close and VF unregister before unregister_netdev()
  in enic_remove() to prevent use-after-free on netdev (patch 10)
- Add comment in enic_reset() documenting that admin channel is not
  recovered after soft reset (patch 10)
- Bypass RES_TYPE_SRIOV_INTR check for V2 VFs in admin channel
  capability detection (patch 10)
- Link to v4: https://patch.msgid.link/20260411-enic-sriov-v2-admin-channel-v2-v4-0-f052326c2a57@cisco.com

Changes in v4:
- Fix reverse xmas tree variable ordering (patches 1, 6)
- Use kzalloc_obj instead of kzalloc with sizeof (patch 9)
- Add NULL check for pp allocation in V1 SR-IOV disable path (patch 9)
- Link to v3: https://lore.kernel.org/r/20260408-enic-sriov-v2-admin-channel-v2-v3-0-1d4999a03cec@cisco.com

Changes in v3:
- Use early-return pattern in enic_sriov_detect_vf_type to reduce
  nesting (patch 1) [Breno Leitao]
- Link to v2: https://lore.kernel.org/r/20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com

Changes in v2:
- Fix lines exceeding 80 columns (patches 4, 6, 7, 8)
- Add __maybe_unused to enic_sriov_configure and enic_sriov_v2_enable;
  .sriov_configure wiring deferred to a later series after devcmd
  hardening is in place (patch 9)
- Guard probe-time auto-enable to skip V2 VFs (patch 9)
- Link to v1: https://lore.kernel.org/r/20260406-enic-sriov-v2-admin-channel-v2-v1-0-82cc47636a78@cisco.com

---
Satish Kharat (12):
      enic: cancel tx_hang_reset work on device removal
      enic: verify firmware supports V2 SR-IOV at probe time
      enic: add admin channel open and close for SR-IOV
      enic: add admin RQ buffer management
      enic: add admin CQ service with MSI-X interrupt and workqueue polling
      enic: define MBOX message types and header structures
      enic: add MBOX core send and receive for admin channel
      enic: add MBOX PF handlers for VF register and capability
      enic: add MBOX VF handlers for capability, register and link state
      enic: wire V2 SR-IOV enable with admin channel and MBOX
      enic: add V2 VF probe with admin channel and PF registration
      enic: re-establish V2 VF admin channel and PF registration after reset

 drivers/net/ethernet/cisco/enic/Makefile      |   3 +-
 drivers/net/ethernet/cisco/enic/enic.h        |  40 +-
 drivers/net/ethernet/cisco/enic/enic_admin.c  | 626 +++++++++++++++++++++++++
 drivers/net/ethernet/cisco/enic/enic_admin.h  |  27 ++
 drivers/net/ethernet/cisco/enic/enic_main.c   | 386 +++++++++++++++-
 drivers/net/ethernet/cisco/enic/enic_mbox.c   | 640 ++++++++++++++++++++++++++
 drivers/net/ethernet/cisco/enic/enic_mbox.h   |  95 ++++
 drivers/net/ethernet/cisco/enic/enic_pp.c     |   5 +
 drivers/net/ethernet/cisco/enic/enic_res.c    |   4 +-
 drivers/net/ethernet/cisco/enic/vnic_cq.h     |   9 +
 drivers/net/ethernet/cisco/enic/vnic_devcmd.h |  13 +
 drivers/net/ethernet/cisco/enic/vnic_enet.h   |   4 +-
 12 files changed, 1832 insertions(+), 20 deletions(-)
---
base-commit: b85966adbf5de0668a815c6e3527f87e0c387fb4
change-id: 20260404-enic-sriov-v2-admin-channel-v2-c0aa3e988833

Best regards,
--  
Satish Kharat <satishkh@cisco.com>


^ permalink raw reply

* [PATCH net-next v10 05/12] enic: add admin CQ service with MSI-X interrupt and workqueue polling
From: Satish Kharat @ 2026-06-29 17:25 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260629-enic-sriov-v2-admin-channel-v2-v10-0-62569af83417@cisco.com>

Add completion queue (CQ) service for the admin channel work queue
(WQ) and receive queue (RQ), driven by a dedicated MSI-X interrupt
and a workqueue-based CQ poller.

The admin WQ CQ service advances the completion ring and returns the
number of descriptors consumed.  The admin RQ CQ service does the
same for receive completions and copies each received message into a
preallocated buffer.  Received messages are enqueued for deferred
dispatch by a separate work_struct so the CQ poller stays short.

When the MSI-X interrupt fires, the ISR schedules the CQ poll
work_struct.  The work handler drains all pending completions, kicks
message dispatch if work was done, and returns credits to unmask the
interrupt.

Log a rate-limited warning when admin RQ buffer refill fails so that
transient memory pressure is visible without flooding the log.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
 drivers/net/ethernet/cisco/enic/enic.h       |   8 +
 drivers/net/ethernet/cisco/enic/enic_admin.c | 311 ++++++++++++++++++++++++++-
 drivers/net/ethernet/cisco/enic/enic_admin.h |  12 ++
 3 files changed, 327 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 398227448b37..401123e6df1d 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -301,6 +301,14 @@ struct enic {
 	struct vnic_rq admin_rq;
 	struct vnic_cq admin_cq[2];
 	struct vnic_intr admin_intr;
+	struct work_struct admin_poll_work;
+	unsigned int admin_intr_index;
+	struct work_struct admin_msg_work;
+	spinlock_t admin_msg_lock;	/* protects admin_msg_list */
+	struct list_head admin_msg_list;
+	unsigned int admin_msg_count;	/* current depth of admin_msg_list */
+	void (*admin_rq_handler)(struct enic *enic, void *buf,
+				 unsigned int len);
 };
 
 static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index b2be42092106..6062a18043ba 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -4,6 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/netdevice.h>
 #include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
 
 #include "vnic_dev.h"
 #include "vnic_wq.h"
@@ -15,6 +16,7 @@
 #include "enic.h"
 #include "enic_admin.h"
 #include "cq_desc.h"
+#include "cq_enet_desc.h"
 #include "wq_enet_desc.h"
 #include "rq_enet_desc.h"
 
@@ -94,6 +96,254 @@ static void enic_admin_rq_drain(struct enic *enic)
 	vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
 }
 
+static unsigned int enic_admin_cq_color(void *cq_desc, unsigned int desc_size)
+{
+	u8 type_color = *((u8 *)cq_desc + desc_size - 1);
+
+	return (type_color >> CQ_DESC_COLOR_SHIFT) & CQ_DESC_COLOR_MASK;
+}
+
+unsigned int enic_admin_wq_cq_service(struct enic *enic)
+{
+	struct vnic_cq *cq = &enic->admin_cq[0];
+	unsigned int work = 0;
+	void *desc;
+
+	desc = vnic_cq_to_clean(cq);
+	while (enic_admin_cq_color(desc, cq->ring.desc_size) !=
+	       cq->last_color) {
+		vnic_cq_inc_to_clean(cq);
+		work++;
+		desc = vnic_cq_to_clean(cq);
+	}
+
+	return work;
+}
+
+/* Upper bound on pending admin messages.  A buggy or hostile VF could flood
+ * the PF admin channel faster than admin_msg_work drains it; cap the backlog
+ * so a guest cannot drive the host out of memory.
+ */
+#define ENIC_ADMIN_MSG_MAX	256
+
+static void enic_admin_msg_enqueue(struct enic *enic, void *buf,
+				   unsigned int len)
+{
+	struct enic_admin_msg *msg;
+
+	msg = kmalloc(struct_size(msg, data, len), GFP_KERNEL);
+	if (!msg)
+		return;
+
+	msg->len = len;
+	memcpy(msg->data, buf, len);
+
+	spin_lock(&enic->admin_msg_lock);
+	if (enic->admin_msg_count >= ENIC_ADMIN_MSG_MAX) {
+		spin_unlock(&enic->admin_msg_lock);
+		kfree(msg);
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "admin msg backlog full (%u); dropping\n",
+				    ENIC_ADMIN_MSG_MAX);
+		return;
+	}
+	list_add_tail(&msg->list, &enic->admin_msg_list);
+	enic->admin_msg_count++;
+	spin_unlock(&enic->admin_msg_lock);
+}
+
+unsigned int enic_admin_rq_cq_service(struct enic *enic)
+{
+	struct vnic_cq *cq = &enic->admin_cq[1];
+	struct vnic_rq *rq = &enic->admin_rq;
+	struct cq_enet_rq_desc *rq_desc;
+	struct vnic_rq_buf *buf;
+	u16 bwf, bytes_written;
+	unsigned int work = 0;
+	void *desc;
+
+	/* The admin RQ and its CQ form a single in-order channel: firmware
+	 * posts exactly one CQE per consumed RQ descriptor, in submission
+	 * order.  Each CQE therefore pairs with rq->to_clean below without a
+	 * completed_index cross-check, mirroring the in-order assumption of
+	 * the main enic RX path.
+	 */
+	desc = vnic_cq_to_clean(cq);
+	while (enic_admin_cq_color(desc, cq->ring.desc_size) !=
+	       cq->last_color) {
+		/* Ensure DMA descriptor fields are read after
+		 * the color/valid check.  dma_rmb() is the
+		 * correct barrier for DMA-written descriptors.
+		 */
+		dma_rmb();
+		buf = rq->to_clean;
+
+		/* Decode the actual number of bytes hardware wrote into
+		 * the RX buffer.  buf->len is the static allocation size
+		 * (ENIC_ADMIN_BUF_SIZE) and would expose uninitialised
+		 * heap memory beyond the real payload.  bytes_written_flags
+		 * is at the same offset in every cq_enet_rq_desc[_32|_64]
+		 * variant.
+		 */
+		rq_desc = desc;
+		bwf = le16_to_cpu(rq_desc->bytes_written_flags);
+		bytes_written = bwf & CQ_ENET_RQ_DESC_BYTES_WRITTEN_MASK;
+		if (bytes_written > buf->len)
+			goto next_desc;
+
+		dma_sync_single_for_cpu(&enic->pdev->dev,
+					buf->dma_addr, buf->len,
+					DMA_FROM_DEVICE);
+
+		/* Drop on hardware error indications.  Admin messages
+		 * are internal to the VIC, not received over the wire.
+		 * Firmware sets TRUNCATED when the message does not fit
+		 * in the posted buffer, and FCS_OK is always set on
+		 * healthy admin completions.
+		 */
+		if (bwf & CQ_ENET_RQ_DESC_FLAGS_TRUNCATED) {
+			netdev_warn_once(enic->netdev,
+					 "admin RQ: truncated message dropped\n");
+			goto next_desc;
+		}
+		if (!(rq_desc->flags & CQ_ENET_RQ_DESC_FLAGS_FCS_OK)) {
+			netdev_warn_once(enic->netdev,
+					 "admin RQ: bad FCS, dropping message\n");
+			goto next_desc;
+		}
+
+		enic_admin_msg_enqueue(enic, buf->os_buf, bytes_written);
+
+next_desc:
+		enic_admin_rq_buf_clean(rq, rq->to_clean);
+		rq->to_clean = rq->to_clean->next;
+		rq->ring.desc_avail++;
+
+		vnic_cq_inc_to_clean(cq);
+		work++;
+		desc = vnic_cq_to_clean(cq);
+	}
+
+	if (enic_admin_rq_fill(enic, GFP_KERNEL) && net_ratelimit())
+		netdev_warn(enic->netdev,
+			    "admin RQ refill failed\n");
+
+	return work;
+}
+
+static irqreturn_t enic_admin_isr_msix(int irq, void *data)
+{
+	struct enic *enic = data;
+
+	schedule_work(&enic->admin_poll_work);
+
+	return IRQ_HANDLED;
+}
+
+static void enic_admin_msg_work_handler(struct work_struct *work)
+{
+	struct enic *enic = container_of(work, struct enic, admin_msg_work);
+	struct enic_admin_msg *msg, *tmp;
+	LIST_HEAD(local_list);
+
+	spin_lock_bh(&enic->admin_msg_lock);
+	list_splice_init(&enic->admin_msg_list, &local_list);
+	enic->admin_msg_count = 0;
+	spin_unlock_bh(&enic->admin_msg_lock);
+
+	list_for_each_entry_safe(msg, tmp, &local_list, list) {
+		if (enic->admin_rq_handler)
+			enic->admin_rq_handler(enic, msg->data, msg->len);
+		list_del(&msg->list);
+		kfree(msg);
+	}
+}
+
+static void enic_admin_poll_work_handler(struct work_struct *work)
+{
+	struct enic *enic = container_of(work, struct enic, admin_poll_work);
+	unsigned int credits;
+	unsigned int rq_work;
+
+	credits = vnic_intr_credits(&enic->admin_intr);
+
+	rq_work = enic_admin_rq_cq_service(enic);
+
+	if (rq_work > 0)
+		schedule_work(&enic->admin_msg_work);
+
+	vnic_intr_return_credits(&enic->admin_intr,
+				 credits ?: 1,
+				 1 /* unmask */, 0);
+}
+
+static int enic_admin_setup_intr(struct enic *enic)
+{
+	unsigned int intr_index = enic->intr_count;
+	int err;
+
+	if (vnic_dev_get_intr_mode(enic->vdev) != VNIC_DEV_INTR_MODE_MSIX ||
+	    intr_index >= enic->intr_avail)
+		return -ENODEV;
+
+	/* The admin INTR uses a slot in the same RES_TYPE_INTR_CTRL
+	 * strided array of per-vector control blocks (mask, coalescing
+	 * timer, credit return) that the data-path IRQs occupy in BAR0.
+	 * vnic_intr_alloc() defaults to RES_TYPE_INTR_CTRL, which is what
+	 * we want here.
+	 */
+	err = vnic_intr_alloc(enic->vdev, &enic->admin_intr, intr_index);
+	if (err) {
+		netdev_warn(enic->netdev,
+			    "Failed to alloc admin intr at index %u: %d\n",
+			    intr_index, err);
+		return err;
+	}
+
+	enic->admin_intr_index = intr_index;
+
+	/* A V2 VF opens the admin channel during probe, before
+	 * register_netdev() resolves the "eth%d" name template, so using
+	 * netdev->name here would register the literal "eth%d-admin" in
+	 * /proc/interrupts.  Use the already-stable PCI device name instead.
+	 */
+	snprintf(enic->msix[intr_index].devname,
+		 sizeof(enic->msix[intr_index].devname),
+		 "%s-admin", pci_name(enic->pdev));
+	enic->msix[intr_index].isr = enic_admin_isr_msix;
+	enic->msix[intr_index].devid = enic;
+
+	err = request_irq(enic->msix_entry[intr_index].vector,
+			  enic->msix[intr_index].isr, 0,
+			  enic->msix[intr_index].devname,
+			  enic->msix[intr_index].devid);
+	if (err) {
+		netdev_warn(enic->netdev,
+			    "Failed to request admin MSI-X irq: %d\n", err);
+		vnic_intr_free(&enic->admin_intr);
+		return err;
+	}
+
+	enic->msix[intr_index].requested = 1;
+
+	netdev_dbg(enic->netdev,
+		   "admin channel using MSI-X interrupt (index %u)\n",
+		   intr_index);
+
+	return 0;
+}
+
+static void enic_admin_teardown_intr(struct enic *enic)
+{
+	unsigned int intr_index = enic->admin_intr_index;
+
+	free_irq(enic->msix_entry[intr_index].vector,
+		 enic->msix[intr_index].devid);
+	cancel_work_sync(&enic->admin_poll_work);
+	enic->msix[intr_index].requested = 0;
+}
+
 static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
 {
 	u64 a0 = QP_TYPE_ADMIN, a1 = enable;
@@ -173,6 +423,7 @@ static int enic_admin_alloc_resources(struct enic *enic)
 
 static void enic_admin_free_resources(struct enic *enic)
 {
+	vnic_intr_free(&enic->admin_intr);
 	vnic_cq_free(&enic->admin_cq[1]);
 	vnic_cq_free(&enic->admin_cq[0]);
 	vnic_rq_free(&enic->admin_rq);
@@ -181,6 +432,8 @@ static void enic_admin_free_resources(struct enic *enic)
 
 static void enic_admin_init_resources(struct enic *enic)
 {
+	unsigned int intr_offset = enic->admin_intr_index;
+
 	vnic_wq_init(&enic->admin_wq,
 		     0, 0, 0); /* cq_index, err_intr_enable, err_intr_offset */
 	vnic_rq_init(&enic->admin_rq,
@@ -189,20 +442,35 @@ static void enic_admin_init_resources(struct enic *enic)
 		     VNIC_CQ_FC_DISABLE,
 		     VNIC_CQ_COLOR_ENABLE,
 		     0, 0, 1, /* cq_head, cq_tail, cq_tail_color */
-		     VNIC_CQ_INTR_DISABLE,
+		     VNIC_CQ_INTR_DISABLE, /* polled synchronously by mbox send */
 		     VNIC_CQ_ENTRY_ENABLE,
 		     VNIC_CQ_MSG_DISABLE,
-		     0, /* interrupt_offset */
+		     intr_offset,
 		     0 /* cq_message_addr */);
 	vnic_cq_init(&enic->admin_cq[1],
 		     VNIC_CQ_FC_DISABLE,
 		     VNIC_CQ_COLOR_ENABLE,
 		     0, 0, 1, /* cq_head, cq_tail, cq_tail_color */
-		     VNIC_CQ_INTR_DISABLE,
+		     VNIC_CQ_INTR_ENABLE,
 		     VNIC_CQ_ENTRY_ENABLE,
 		     VNIC_CQ_MSG_DISABLE,
-		     0, /* interrupt_offset */
+		     intr_offset,
 		     0 /* cq_message_addr */);
+	vnic_intr_init(&enic->admin_intr,
+		       0, 0, 1); /* coalescing_timer, coalescing_type, mask_on_assertion */
+}
+
+static void enic_admin_msg_drain(struct enic *enic)
+{
+	struct enic_admin_msg *msg, *tmp;
+
+	spin_lock_bh(&enic->admin_msg_lock);
+	list_for_each_entry_safe(msg, tmp, &enic->admin_msg_list, list) {
+		list_del(&msg->list);
+		kfree(msg);
+	}
+	enic->admin_msg_count = 0;
+	spin_unlock_bh(&enic->admin_msg_lock);
 }
 
 int enic_admin_channel_open(struct enic *enic)
@@ -220,6 +488,19 @@ int enic_admin_channel_open(struct enic *enic)
 		return err;
 	}
 
+	spin_lock_init(&enic->admin_msg_lock);
+	INIT_LIST_HEAD(&enic->admin_msg_list);
+	INIT_WORK(&enic->admin_msg_work, enic_admin_msg_work_handler);
+	INIT_WORK(&enic->admin_poll_work, enic_admin_poll_work_handler);
+
+	err = enic_admin_setup_intr(enic);
+	if (err) {
+		netdev_err(enic->netdev,
+			   "Admin channel requires MSI-X, SR-IOV unavailable: %d\n",
+			   err);
+		goto free_resources;
+	}
+
 	enic_admin_init_resources(enic);
 
 	vnic_wq_enable(&enic->admin_wq);
@@ -239,17 +520,31 @@ int enic_admin_channel_open(struct enic *enic)
 		goto disable_queues;
 	}
 
+	vnic_intr_unmask(&enic->admin_intr);
+
+	netdev_dbg(enic->netdev,
+		   "admin channel open: intr=%u wq_avail=%u rq_avail=%u cq0_color=%u cq1_color=%u\n",
+		   enic->admin_intr_index,
+		   vnic_wq_desc_avail(&enic->admin_wq),
+		   vnic_rq_desc_avail(&enic->admin_rq),
+		   enic->admin_cq[0].last_color,
+		   enic->admin_cq[1].last_color);
+
 	enic->admin_chan_up = true;
 
 	return 0;
 
 disable_queues:
+	enic_admin_teardown_intr(enic);
 	enic_admin_qp_type_set(enic, QP_DISABLE);
 	if (vnic_wq_disable(&enic->admin_wq))
 		netdev_warn(enic->netdev, "Failed to disable admin WQ\n");
 	if (vnic_rq_disable(&enic->admin_rq))
 		netdev_warn(enic->netdev, "Failed to disable admin RQ\n");
+	cancel_work_sync(&enic->admin_msg_work);
+	enic_admin_msg_drain(enic);
 	enic_admin_rq_drain(enic);
+free_resources:
 	enic_admin_free_resources(enic);
 	return err;
 }
@@ -268,6 +563,13 @@ void enic_admin_channel_close(struct enic *enic)
 	if (!enic->admin_chan_up)
 		return;
 
+	netdev_dbg(enic->netdev, "admin channel close\n");
+
+	vnic_intr_mask(&enic->admin_intr);
+	enic_admin_teardown_intr(enic);
+	cancel_work_sync(&enic->admin_msg_work);
+	enic_admin_msg_drain(enic);
+
 	enic_admin_qp_type_set(enic, QP_DISABLE);
 
 	err = vnic_wq_disable(&enic->admin_wq);
@@ -283,6 +585,7 @@ void enic_admin_channel_close(struct enic *enic)
 	enic_admin_rq_drain(enic);
 	vnic_cq_clean(&enic->admin_cq[0]);
 	vnic_cq_clean(&enic->admin_cq[1]);
+	vnic_intr_clean(&enic->admin_intr);
 	enic_admin_free_resources(enic);
 
 	enic->admin_chan_up = false;
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.h b/drivers/net/ethernet/cisco/enic/enic_admin.h
index 569aadeb9312..62c80220b0ca 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.h
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.h
@@ -9,7 +9,19 @@
 
 struct enic;
 
+/* Wrapper for received admin messages queued for deferred processing.
+ * The admin CQ poll work handler enqueues these; a separate work handler
+ * processes them where sleeping (mutex, GFP_KERNEL) is safe.
+ */
+struct enic_admin_msg {
+	struct list_head list;
+	unsigned int len;
+	u8 data[] __aligned(8);
+};
+
 int enic_admin_channel_open(struct enic *enic);
 void enic_admin_channel_close(struct enic *enic);
+unsigned int enic_admin_wq_cq_service(struct enic *enic);
+unsigned int enic_admin_rq_cq_service(struct enic *enic);
 
 #endif /* _ENIC_ADMIN_H_ */

-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v10 08/12] enic: add MBOX PF handlers for VF register and capability
From: Satish Kharat @ 2026-06-29 17:26 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260629-enic-sriov-v2-admin-channel-v2-v10-0-62569af83417@cisco.com>

Implement PF-side mailbox message processing for SR-IOV V2
admin channel communication.

When the PF receives messages from VFs, the dispatch routes
them to type-specific handlers:
  - VF_CAPABILITY_REQUEST: reply with protocol version 1
  - VF_REGISTER_REQUEST: send the register reply, mark the
    VF registered on success, then send PF_LINK_STATE_NOTIF
    reflecting the PF's current carrier state
  - VF_UNREGISTER_REQUEST: mark VF unregistered, send reply
  - PF_LINK_STATE_ACK: log errors from VF acknowledgment

Per-VF state (struct enic_vf_state) is tracked via enic->vf_state
which will be allocated when SRIOV V2 is enabled.

Remove the CONFIG_PCI_IOV guard from num_vfs in struct enic. The
PF handlers reference enic->num_vfs for VF ID bounds checking in
enic_mbox.c, which is compiled unconditionally. The field must be
visible regardless of CONFIG_PCI_IOV to avoid build failures.

Add enic_mbox_send_link_state() helper for PF-initiated link
state notifications, also used later by ndo_set_vf_link_state.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
 drivers/net/ethernet/cisco/enic/enic.h      |   7 +-
 drivers/net/ethernet/cisco/enic/enic_mbox.c | 190 +++++++++++++++++++++++++++-
 drivers/net/ethernet/cisco/enic/enic_mbox.h |   1 +
 3 files changed, 194 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index b009d87da4bd..d459318c46fc 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -256,9 +256,7 @@ struct enic {
 	struct enic_rx_coal rx_coalesce_setting;
 	u32 rx_coalesce_usecs;
 	u32 tx_coalesce_usecs;
-#ifdef CONFIG_PCI_IOV
 	u16 num_vfs;
-#endif
 	enum enic_vf_type vf_type;
 	unsigned int enable_count;
 	spinlock_t enic_api_lock;
@@ -315,6 +313,11 @@ struct enic {
 	/* MBOX protocol state — mbox_lock serializes admin WQ sends */
 	struct mutex mbox_lock;
 	u64 mbox_msg_num;
+
+	/* PF: per-VF MBOX state, allocated when SRIOV V2 is enabled */
+	struct enic_vf_state {
+		bool registered;
+	} *vf_state;
 };
 
 static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
index 3709704bee02..b6f05b03ae26 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -135,10 +135,183 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
 	return err;
 }
 
+int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state)
+{
+	struct enic_mbox_pf_link_state_notif_msg notif = {};
+
+	if (!enic->vf_state || vf_id >= enic->num_vfs ||
+	    !enic->vf_state[vf_id].registered) {
+		netdev_dbg(enic->netdev,
+			   "MBOX: skip link state to unregistered VF %u\n",
+			   vf_id);
+		return 0;
+	}
+
+	notif.link_state = cpu_to_le32(link_state);
+	return enic_mbox_send_msg(enic, ENIC_MBOX_PF_LINK_STATE_NOTIF, vf_id,
+				  &notif, sizeof(notif));
+}
+
+static int enic_mbox_pf_handle_capability(struct enic *enic, void *msg,
+					  u16 vf_id, u64 msg_num)
+{
+	struct enic_mbox_vf_capability_reply_msg reply = {};
+
+	reply.reply.ret_major = cpu_to_le16(0);
+	reply.version = cpu_to_le32(ENIC_MBOX_CAP_VERSION_1);
+
+	return enic_mbox_send_msg(enic, ENIC_MBOX_VF_CAPABILITY_REPLY, vf_id,
+				  &reply, sizeof(reply));
+}
+
+static int enic_mbox_pf_handle_register(struct enic *enic, void *msg,
+					u16 vf_id, u64 msg_num)
+{
+	struct enic_mbox_vf_register_reply_msg reply = {};
+	u32 link_state;
+	int err;
+
+	if (!enic->vf_state || vf_id >= enic->num_vfs) {
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: register from invalid VF %u\n",
+				    vf_id);
+		return -EINVAL;
+	}
+
+	/* VF re-registering (e.g. guest reboot without clean unregister):
+	 * mark the previous registration inactive before accepting the new one.
+	 */
+	if (enic->vf_state[vf_id].registered) {
+		netdev_dbg(enic->netdev,
+			   "MBOX: VF %u re-register, cleaning previous state\n",
+			   vf_id);
+		enic->vf_state[vf_id].registered = false;
+	}
+
+	reply.reply.ret_major = cpu_to_le16(0);
+	err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_REGISTER_REPLY, vf_id,
+				 &reply, sizeof(reply));
+	if (err)
+		return err;
+
+	enic->vf_state[vf_id].registered = true;
+	if (net_ratelimit())
+		netdev_info(enic->netdev, "VF %u registered via MBOX\n", vf_id);
+
+	link_state = netif_carrier_ok(enic->netdev) ?
+		ENIC_MBOX_LINK_STATE_ENABLE :
+		ENIC_MBOX_LINK_STATE_DISABLE;
+	err = enic_mbox_send_link_state(enic, vf_id, link_state);
+	if (err && net_ratelimit())
+		netdev_warn(enic->netdev,
+			    "VF %u: failed to send initial link state: %d\n",
+			    vf_id, err);
+	/* Registration succeeded; initial link state notification sent
+	 * above.  Subsequent link state changes are sent from the PF
+	 * when enic_link_check() detects carrier changes.
+	 */
+	return 0;
+}
+
+static int enic_mbox_pf_handle_unregister(struct enic *enic, void *msg,
+					  u16 vf_id, u64 msg_num)
+{
+	struct enic_mbox_vf_register_reply_msg reply = {};
+	int err;
+
+	if (!enic->vf_state || vf_id >= enic->num_vfs) {
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: unregister from invalid VF %u\n",
+				    vf_id);
+		return -EINVAL;
+	}
+
+	/* VF is unloading; clear local state regardless of whether
+	 * the reply is successfully delivered to avoid the PF treating
+	 * a dead VF as still registered.
+	 */
+	enic->vf_state[vf_id].registered = false;
+
+	reply.reply.ret_major = cpu_to_le16(0);
+	err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REPLY, vf_id,
+				 &reply, sizeof(reply));
+
+	if (net_ratelimit())
+		netdev_info(enic->netdev,
+			    "VF %u unregistered via MBOX\n", vf_id);
+
+	return err;
+}
+
+static void enic_mbox_pf_process_msg(struct enic *enic,
+				     struct enic_mbox_hdr *hdr, void *payload)
+{
+	u16 vf_id = le16_to_cpu(hdr->src_vnic_id);
+	u16 msg_len = le16_to_cpu(hdr->msg_len);
+	int err = 0;
+
+	if (!enic->vf_state) {
+		netdev_dbg(enic->netdev,
+			   "MBOX: PF received msg but SRIOV not active\n");
+		return;
+	}
+
+	if (vf_id >= enic->num_vfs) {
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: PF received msg from invalid VF %u\n",
+				    vf_id);
+		return;
+	}
+
+	switch (hdr->msg_type) {
+	case ENIC_MBOX_VF_CAPABILITY_REQUEST:
+		err = enic_mbox_pf_handle_capability(enic, payload, vf_id,
+						     le64_to_cpu(hdr->msg_num));
+		break;
+	case ENIC_MBOX_VF_REGISTER_REQUEST:
+		err = enic_mbox_pf_handle_register(enic, payload, vf_id,
+						   le64_to_cpu(hdr->msg_num));
+		break;
+	case ENIC_MBOX_VF_UNREGISTER_REQUEST:
+		err = enic_mbox_pf_handle_unregister(enic, payload, vf_id,
+						     le64_to_cpu(hdr->msg_num));
+		break;
+	case ENIC_MBOX_PF_LINK_STATE_ACK: {
+		struct enic_mbox_pf_link_state_ack_msg *ack = payload;
+
+		if (msg_len < sizeof(*hdr) + sizeof(*ack))
+			break;
+		if (le16_to_cpu(ack->ack.ret_major) && net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: VF %u link state ACK error %u/%u\n",
+				    vf_id,
+				    le16_to_cpu(ack->ack.ret_major),
+				    le16_to_cpu(ack->ack.ret_minor));
+		break;
+	}
+	default:
+		netdev_dbg(enic->netdev,
+			   "MBOX: PF unhandled msg type %u from VF %u\n",
+			   hdr->msg_type, vf_id);
+		err = -EOPNOTSUPP;
+		break;
+	}
+
+	if (err && net_ratelimit())
+		netdev_warn(enic->netdev,
+			    "MBOX: PF handler for msg type %u from VF %u failed: %d\n",
+			    hdr->msg_type, vf_id, err);
+}
+
 static void enic_mbox_recv_handler(struct enic *enic, void *buf,
 				   unsigned int len)
 {
 	struct enic_mbox_hdr *hdr = buf;
+	void *payload;
+	u16 msg_len;
 
 	if (len < sizeof(*hdr)) {
 		if (net_ratelimit())
@@ -156,10 +329,23 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
 		return;
 	}
 
+	msg_len = le16_to_cpu(hdr->msg_len);
+	if (msg_len < sizeof(*hdr) || msg_len > len) {
+		if (net_ratelimit())
+			netdev_warn(enic->netdev,
+				    "MBOX: invalid msg_len %u (buf len %u)\n",
+				    msg_len, len);
+		return;
+	}
+
 	netdev_dbg(enic->netdev,
 		   "MBOX recv: type %u from vnic %u len %u\n",
-		   hdr->msg_type, le16_to_cpu(hdr->src_vnic_id),
-		   le16_to_cpu(hdr->msg_len));
+		   hdr->msg_type, le16_to_cpu(hdr->src_vnic_id), msg_len);
+
+	payload = buf + sizeof(*hdr);
+
+	if (enic->vf_state)
+		enic_mbox_pf_process_msg(enic, hdr, payload);
 }
 
 void enic_mbox_init(struct enic *enic)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index 73fd7f783ee2..f1de67db1273 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -87,5 +87,6 @@ struct enic;
 void enic_mbox_init(struct enic *enic);
 int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
 		       void *payload, u16 payload_len);
+int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state);
 
 #endif /* _ENIC_MBOX_H_ */

-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v3 0/2] bpf, sockmap: disallow sockmap mutation from tc, xdp and flow_dissector
From: Sechang Lim @ 2026-06-29 17:26 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Stanislav Fomichev, Emil Tsalapatis, Lorenz Bauer, Jakub Sitnicki,
	Jiayuan Chen, Shuah Khan, bpf, netdev, linux-kselftest,
	linux-kernel

A tc, xdp or flow_dissector program updating or deleting a sockmap
deadlocks on stab->lock vs sk_callback_lock and has no reason to. Patch 1
disallows it in may_update_sockmap(); patch 2 drops the selftests that
exercised it.

v3:
 - drop the broken selftests (Jiayuan Chen)
 - drop the Fixes tag and target bpf-next (Jiayuan Chen)

v2:
 - https://lore.kernel.org/all/20260620034632.2308-1-rhkrqnwk98@gmail.com/

v1:
 - https://lore.kernel.org/all/20260616091153.2966617-1-rhkrqnwk98@gmail.com/

Sechang Lim (2):
  bpf, sockmap: disallow update and delete from tc, xdp and
    flow_dissector
  selftests/bpf: drop tc/xdp/flow_dissector sockmap mutation tests

 kernel/bpf/verifier.c                         |  4 --
 .../selftests/bpf/prog_tests/fexit_bpf2bpf.c  | 13 -----
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 52 -------------------
 .../bpf/progs/freplace_cls_redirect.c         | 34 ------------
 .../selftests/bpf/progs/test_sockmap_update.c | 48 -----------------
 .../bpf/progs/verifier_sockmap_mutate.c       | 10 ++--
 6 files changed, 5 insertions(+), 156 deletions(-)
 delete mode 100644 tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
 delete mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_update.c

-- 
2.43.0


^ permalink raw reply

* [PATCH bpf-next v3 1/2] bpf, sockmap: disallow update and delete from tc, xdp and flow_dissector
From: Sechang Lim @ 2026-06-29 17:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Stanislav Fomichev, Emil Tsalapatis, Lorenz Bauer, Jakub Sitnicki,
	Jiayuan Chen, Shuah Khan, bpf, netdev, linux-kselftest,
	linux-kernel
In-Reply-To: <20260629172704.1302218-1-rhkrqnwk98@gmail.com>

sock_map_update_common() and __sock_map_delete() hold stab->lock and call
sock_map_unref() -> sock_map_del_link(), which takes sk_callback_lock for
write. That gives the order stab->lock -> sk_callback_lock.

The reverse order comes from the SK_SKB stream parser.
sk_psock_strp_data_ready() holds sk_callback_lock for read, and after the
verdict tcp_bpf_strp_read_sock() acks the consumed data inline via
__tcp_cleanup_rbuf(). The ACK goes out egress, where a sched_cls program
deletes from the sockmap and takes stab->lock:

  WARNING: possible circular locking dependency detected
  ------------------------------------------------------
  syz.9.8824 is trying to acquire lock:
  (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
  but task is already holding lock:
  (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173

  -> #1 (clock-AF_INET){++.-}-{3:3}:
         _raw_write_lock_bh
         sock_map_del_link net/core/sock_map.c:167
         sock_map_unref net/core/sock_map.c:184
         sock_map_update_common net/core/sock_map.c:509
         sock_map_update_elem_sys net/core/sock_map.c:588
         map_update_elem kernel/bpf/syscall.c:1805

  -> #0 (&stab->lock){+.-.}-{3:3}:
         _raw_spin_lock_bh
         __sock_map_delete net/core/sock_map.c:421
         sock_map_delete_elem net/core/sock_map.c:452
         bpf_prog_06044d24140080b6
         tcx_run net/core/dev.c:4451
         sch_handle_egress net/core/dev.c:4541
         __dev_queue_xmit net/core/dev.c:4808
         ...
         tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
         strp_data_ready net/strparser/strparser.c:402
         sk_psock_strp_data_ready net/core/skmsg.c:1174
         tcp_data_queue net/ipv4/tcp_input.c:5661

  Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    rlock(clock-AF_INET);
                                 lock(&stab->lock);
                                 lock(clock-AF_INET);
    lock(&stab->lock);

   *** DEADLOCK ***

A tc, xdp or flow_dissector program has no reason to update or delete a
sockmap, and redirect does not go through here. Drop them from
may_update_sockmap() so the verifier rejects it. It also closes the
matching sockhash inversion.

Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 kernel/bpf/verifier.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 25aea4271cd0..58d766c34626 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8489,11 +8489,7 @@ static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id)
 			return true;
 		break;
 	case BPF_PROG_TYPE_SOCKET_FILTER:
-	case BPF_PROG_TYPE_SCHED_CLS:
-	case BPF_PROG_TYPE_SCHED_ACT:
-	case BPF_PROG_TYPE_XDP:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
-	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return true;
 	default:
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf-next v3 2/2] selftests/bpf: drop tc/xdp/flow_dissector sockmap mutation tests
From: Sechang Lim @ 2026-06-29 17:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Stanislav Fomichev, Emil Tsalapatis, Lorenz Bauer, Jakub Sitnicki,
	Jiayuan Chen, Shuah Khan, bpf, netdev, linux-kselftest,
	linux-kernel
In-Reply-To: <20260629172704.1302218-1-rhkrqnwk98@gmail.com>

tc, xdp and flow_dissector programs can no longer update or delete a
sockmap. Adjust the tests:

 - verifier_sockmap_mutate: the tc, xdp and flow_dissector cases now
   expect __failure with "cannot update sockmap in this context".
 - sockmap_basic: drop "sockmap update" / "sockhash update", which load
   a SEC("tc") program that copies a sock between maps.
 - fexit_bpf2bpf: drop "func_sockmap_update", whose freplace program
   updates a sockmap in the tc cls_redirect context.

Remove the now-unused test_sockmap_update.c and freplace_cls_redirect.c.

Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/prog_tests/fexit_bpf2bpf.c  | 13 -----
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 52 -------------------
 .../bpf/progs/freplace_cls_redirect.c         | 34 ------------
 .../selftests/bpf/progs/test_sockmap_update.c | 48 -----------------
 .../bpf/progs/verifier_sockmap_mutate.c       | 10 ++--
 5 files changed, 5 insertions(+), 152 deletions(-)
 delete mode 100644 tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
 delete mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_update.c

diff --git a/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c b/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c
index 92c20803ea76..d3a954158c33 100644
--- a/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c
+++ b/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c
@@ -336,17 +336,6 @@ static void test_fmod_ret_freplace(void)
 }
 
 
-static void test_func_sockmap_update(void)
-{
-	const char *prog_name[] = {
-		"freplace/cls_redirect",
-	};
-	test_fexit_bpf2bpf_common("./freplace_cls_redirect.bpf.o",
-				  "./test_cls_redirect.bpf.o",
-				  ARRAY_SIZE(prog_name),
-				  prog_name, false, NULL);
-}
-
 static void test_func_replace_void(void)
 {
 	const char *prog_name[] = {
@@ -599,8 +588,6 @@ void serial_test_fexit_bpf2bpf(void)
 		test_func_replace();
 	if (test__start_subtest("func_replace_verify"))
 		test_func_replace_verify();
-	if (test__start_subtest("func_sockmap_update"))
-		test_func_sockmap_update();
 	if (test__start_subtest("func_replace_return_code"))
 		test_func_replace_return_code();
 	if (test__start_subtest("func_map_prog_compatibility"))
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index cb3229711f93..33f788e2786d 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -7,7 +7,6 @@
 
 #include "test_progs.h"
 #include "test_skmsg_load_helpers.skel.h"
-#include "test_sockmap_update.skel.h"
 #include "test_sockmap_invalid_update.skel.h"
 #include "test_sockmap_skb_verdict_attach.skel.h"
 #include "test_sockmap_progs_query.skel.h"
@@ -235,53 +234,6 @@ static void test_skmsg_helpers_with_link(enum bpf_map_type map_type)
 	test_skmsg_load_helpers__destroy(skel);
 }
 
-static void test_sockmap_update(enum bpf_map_type map_type)
-{
-	int err, prog, src;
-	struct test_sockmap_update *skel;
-	struct bpf_map *dst_map;
-	const __u32 zero = 0;
-	char dummy[14] = {0};
-	LIBBPF_OPTS(bpf_test_run_opts, topts,
-		.data_in = dummy,
-		.data_size_in = sizeof(dummy),
-		.repeat = 1,
-	);
-	__s64 sk;
-
-	sk = connected_socket_v4();
-	if (!ASSERT_NEQ(sk, -1, "connected_socket_v4"))
-		return;
-
-	skel = test_sockmap_update__open_and_load();
-	if (!ASSERT_OK_PTR(skel, "open_and_load"))
-		goto close_sk;
-
-	prog = bpf_program__fd(skel->progs.copy_sock_map);
-	src = bpf_map__fd(skel->maps.src);
-	if (map_type == BPF_MAP_TYPE_SOCKMAP)
-		dst_map = skel->maps.dst_sock_map;
-	else
-		dst_map = skel->maps.dst_sock_hash;
-
-	err = bpf_map_update_elem(src, &zero, &sk, BPF_NOEXIST);
-	if (!ASSERT_OK(err, "update_elem(src)"))
-		goto out;
-
-	err = bpf_prog_test_run_opts(prog, &topts);
-	if (!ASSERT_OK(err, "test_run"))
-		goto out;
-	if (!ASSERT_NEQ(topts.retval, 0, "test_run retval"))
-		goto out;
-
-	compare_cookies(skel->maps.src, dst_map);
-
-out:
-	test_sockmap_update__destroy(skel);
-close_sk:
-	close(sk);
-}
-
 static void test_sockmap_invalid_update(void)
 {
 	struct test_sockmap_invalid_update *skel;
@@ -1385,10 +1337,6 @@ void test_sockmap_basic(void)
 		test_skmsg_helpers(BPF_MAP_TYPE_SOCKMAP);
 	if (test__start_subtest("sockhash sk_msg load helpers"))
 		test_skmsg_helpers(BPF_MAP_TYPE_SOCKHASH);
-	if (test__start_subtest("sockmap update"))
-		test_sockmap_update(BPF_MAP_TYPE_SOCKMAP);
-	if (test__start_subtest("sockhash update"))
-		test_sockmap_update(BPF_MAP_TYPE_SOCKHASH);
 	if (test__start_subtest("sockmap update in unsafe context"))
 		test_sockmap_invalid_update();
 	if (test__start_subtest("sockmap copy"))
diff --git a/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c b/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
deleted file mode 100644
index 7e94412d47a5..000000000000
--- a/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
+++ /dev/null
@@ -1,34 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-// Copyright (c) 2020 Facebook
-
-#include <linux/stddef.h>
-#include <linux/bpf.h>
-#include <linux/pkt_cls.h>
-#include <bpf/bpf_endian.h>
-#include <bpf/bpf_helpers.h>
-
-struct {
-	__uint(type, BPF_MAP_TYPE_SOCKMAP);
-	__type(key, int);
-	__type(value, int);
-	__uint(max_entries, 2);
-} sock_map SEC(".maps");
-
-SEC("freplace/cls_redirect")
-int freplace_cls_redirect_test(struct __sk_buff *skb)
-{
-	int ret = 0;
-	const int zero = 0;
-	struct bpf_sock *sk;
-
-	sk = bpf_map_lookup_elem(&sock_map, &zero);
-	if (!sk)
-		return TC_ACT_SHOT;
-
-	ret = bpf_map_update_elem(&sock_map, &zero, sk, 0);
-	bpf_sk_release(sk);
-
-	return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
-}
-
-char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_update.c b/tools/testing/selftests/bpf/progs/test_sockmap_update.c
deleted file mode 100644
index 6d64ea536e3d..000000000000
--- a/tools/testing/selftests/bpf/progs/test_sockmap_update.c
+++ /dev/null
@@ -1,48 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-// Copyright (c) 2020 Cloudflare
-#include "vmlinux.h"
-#include <bpf/bpf_helpers.h>
-
-struct {
-	__uint(type, BPF_MAP_TYPE_SOCKMAP);
-	__uint(max_entries, 1);
-	__type(key, __u32);
-	__type(value, __u64);
-} src SEC(".maps");
-
-struct {
-	__uint(type, BPF_MAP_TYPE_SOCKMAP);
-	__uint(max_entries, 1);
-	__type(key, __u32);
-	__type(value, __u64);
-} dst_sock_map SEC(".maps");
-
-struct {
-	__uint(type, BPF_MAP_TYPE_SOCKHASH);
-	__uint(max_entries, 1);
-	__type(key, __u32);
-	__type(value, __u64);
-} dst_sock_hash SEC(".maps");
-
-SEC("tc")
-int copy_sock_map(void *ctx)
-{
-	struct bpf_sock *sk;
-	bool failed = false;
-	__u32 key = 0;
-
-	sk = bpf_map_lookup_elem(&src, &key);
-	if (!sk)
-		return SK_DROP;
-
-	if (bpf_map_update_elem(&dst_sock_map, &key, sk, 0))
-		failed = true;
-
-	if (bpf_map_update_elem(&dst_sock_hash, &key, sk, 0))
-		failed = true;
-
-	bpf_sk_release(sk);
-	return failed ? SK_DROP : SK_PASS;
-}
-
-char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c b/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c
index fe4b123187b8..b11026123690 100644
--- a/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c
+++ b/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c
@@ -74,7 +74,7 @@ static __always_inline void test_sockmap_lookup_and_mutate(void)
 }
 
 SEC("action")
-__success
+__failure __msg("cannot update sockmap in this context")
 int test_sched_act(struct __sk_buff *skb)
 {
 	test_sockmap_mutate(skb->sk);
@@ -82,7 +82,7 @@ int test_sched_act(struct __sk_buff *skb)
 }
 
 SEC("classifier")
-__success
+__failure __msg("cannot update sockmap in this context")
 int test_sched_cls(struct __sk_buff *skb)
 {
 	test_sockmap_mutate(skb->sk);
@@ -90,7 +90,7 @@ int test_sched_cls(struct __sk_buff *skb)
 }
 
 SEC("flow_dissector")
-__success
+__failure __msg("cannot update sockmap in this context")
 int test_flow_dissector_delete(struct __sk_buff *skb __always_unused)
 {
 	test_sockmap_delete();
@@ -98,7 +98,7 @@ int test_flow_dissector_delete(struct __sk_buff *skb __always_unused)
 }
 
 SEC("flow_dissector")
-__failure __msg("program of this type cannot use helper bpf_sk_release")
+__failure __msg("cannot update sockmap in this context")
 int test_flow_dissector_update(struct __sk_buff *skb __always_unused)
 {
 	test_sockmap_lookup_and_update(); /* no access to skb->sk */
@@ -179,7 +179,7 @@ int test_sockops_update_dedicated(struct bpf_sock_ops *ctx)
 }
 
 SEC("xdp")
-__success
+__failure __msg("cannot update sockmap in this context")
 int test_xdp(struct xdp_md *ctx __always_unused)
 {
 	test_sockmap_lookup_and_mutate();
-- 
2.43.0


^ permalink raw reply related

* RE: the confusing 10000base_CR. Shouldn't it be 10000_SFI_DA?
From: D H, Siddaraju @ 2026-06-29 17:29 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Maxime Chevallier, Michal Kubecek, netdev@vger.kernel.org,
	Das, Shubham, Chintalapalle, Balaji, Srinivasan, Vijay,
	Lindberg, Magnus, Niklas Damberg, Wirandi, Jonas, Siddaraju DH
In-Reply-To: <deee3864-69dd-4eb5-bcfc-4ab771f4530c@lunn.ch>

On 6/29/26, Andrew Lunn wrote,
> Is there confusion? git blame suggests it has been there 10 years,
> and this is the first time somebody has questioned it.
>
> I would limit changes to Documentation, man pages, help etc.

Yes Andrew, as summarized in the first email of this thread, there
is no standard to explicitly support 10000baseCR and it conflicts with
so many characteristics of a "typical IEEE *baseCR". Based on my internal
discussion & feedback, it seems people are using 10000baseCR
with 10G-SFI-DA cables but the question "why should I set it to 10000baseCR
while using 10G_SFI_DACables?" keeps coming back along with the subsequent
questions to seek clarity against the "IEEE *baseCR" definitions.

This is an effort to settle it down for once & all. With option-(c) it is
just to add a clarity that 10000baseCR is SFF-8431 SFP+ Direct Attach (DA)
standard with SFI interface.

- Thank you,
Siddaraju D H

^ permalink raw reply

* AW: [PATCH net-next v2 5/8] net: mdio: realtek-rtl9300: Add c45 over c22 mitigation
From: Markus Stockhausen @ 2026-06-29 17:29 UTC (permalink / raw)
  To: 'Andrew Lunn'
  Cc: hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
	chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
In-Reply-To: <af7ba89b-bac6-4e04-b606-8e51a61d4be0@lunn.ch>

> Von: Andrew Lunn <andrew@lunn.ch> 
> Gesendet: Montag, 29. Juni 2026 18:40
> An: Markus Stockhausen <markus.stockhausen@gmx.de>
> Betreff: Re: [PATCH net-next v2 5/8] net: mdio: realtek-rtl9300: Add c45
over c22 mitigation
> 
> > Enhance the driver to detect this register 13/14/13/14 access sequence.
> 
> I still think this is the wrong way to do this, and you should look at
> MDIO bus lock/unlock.

C45 over C22 interception only tried to mitigate a downstream nit
that is no show stopper. We already live perfectly without it. I will 
drop that part.

Markus 


^ permalink raw reply

* [PATCH v4 net-next] bonding: no longer rely on RTNL in bond_fill_info()
From: Eric Dumazet @ 2026-06-29 17:32 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Eric Dumazet, Jay Vosburgh,
	Andrew Lunn

Add READ_ONCE()/WRITE_ONCE() annotations on port->is_enabled.
While this field is written under bond->mode_lock protection,
is is read without this lock being held.

Change bond_fill_info() to acquire RCU and use READ_ONCE()
to read bond->params fields that can be updated concurrently
from sysfs/procfs/rtnetlink.

Add const qualifiers to bond_uses_primary(), __agg_active_ports(),
bond_option_active_slave_get_rcu(), bond_3ad_get_active_agg_info(),
__bond_3ad_get_active_agg_info() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
---
v4: addressed Sashiko/Jakub feedback

 drivers/net/bonding/bond_3ad.c     |  24 ++++---
 drivers/net/bonding/bond_netlink.c | 109 ++++++++++++++++-------------
 drivers/net/bonding/bond_options.c |   8 +--
 include/net/bond_3ad.h             |   4 +-
 include/net/bonding.h              |   8 +--
 5 files changed, 85 insertions(+), 68 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index acbba08dbdfada00c118bd41b89a60e60f8521e7..b8e4b4d68dd65f1381ba7c0aad84609fcfbabbef 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -760,14 +760,14 @@ static int __agg_usable_ports(struct aggregator *agg)
 	return valid;
 }
 
-static int __agg_active_ports(struct aggregator *agg)
+static int __agg_active_ports(const struct aggregator *agg)
 {
-	struct port *port;
+	const struct port *port;
 	int active = 0;
 
 	for (port = agg->lag_ports; port;
 	     port = port->next_port_in_aggregator) {
-		if (port->is_enabled)
+		if (READ_ONCE(port->is_enabled))
 			active++;
 	}
 
@@ -2801,11 +2801,11 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
 	 * some of he adaptors(ce1000.lan) report.
 	 */
 	if (link == BOND_LINK_UP) {
-		port->is_enabled = true;
+		WRITE_ONCE(port->is_enabled, true);
 		ad_update_actor_keys(port, false);
 	} else {
 		/* link has failed */
-		port->is_enabled = false;
+		WRITE_ONCE(port->is_enabled, false);
 		ad_update_actor_keys(port, true);
 	}
 	agg = __get_first_agg(port);
@@ -2878,16 +2878,20 @@ int bond_3ad_set_carrier(struct bonding *bond)
  * Returns:   0 on success
  *          < 0 on error
  */
-int __bond_3ad_get_active_agg_info(struct bonding *bond,
+int __bond_3ad_get_active_agg_info(const struct bonding *bond,
 				   struct ad_info *ad_info)
 {
-	struct aggregator *aggregator = NULL, *tmp;
+	const struct aggregator *aggregator = NULL, *tmp;
+	struct ad_slave_info *ad_slave_info;
+	const struct port *port;
 	struct list_head *iter;
 	struct slave *slave;
-	struct port *port;
 
 	bond_for_each_slave_rcu(bond, slave, iter) {
-		port = &(SLAVE_AD_INFO(slave)->port);
+		ad_slave_info = SLAVE_AD_INFO(slave);
+		if (!ad_slave_info)
+			continue;
+		port = &ad_slave_info->port;
 		tmp = rcu_dereference(port->aggregator);
 		if (tmp && tmp->is_active) {
 			aggregator = tmp;
@@ -2907,7 +2911,7 @@ int __bond_3ad_get_active_agg_info(struct bonding *bond,
 	return 0;
 }
 
-int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
+int bond_3ad_get_active_agg_info(const struct bonding *bond, struct ad_info *ad_info)
 {
 	int ret;
 
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 4a11572f663d3127fb2901468939556d34df16ed..55d2f8a539d46ad825768a0fb91ffe5648028d28 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -686,53 +686,58 @@ static size_t bond_get_size(const struct net_device *bond_dev)
 		0;
 }
 
-static int bond_option_active_slave_get_ifindex(struct bonding *bond)
+static int bond_option_active_slave_get_ifindex_rcu(const struct bonding *bond)
 {
-	const struct net_device *slave;
-	int ifindex;
+	const struct net_device *dev = NULL;
+	const struct slave *slave;
 
-	rcu_read_lock();
-	slave = bond_option_active_slave_get_rcu(bond);
-	ifindex = slave ? slave->ifindex : 0;
-	rcu_read_unlock();
-	return ifindex;
+	slave = rcu_dereference(bond->curr_active_slave);
+	if (slave)
+		dev = slave->dev;
+	return dev ? dev->ifindex : 0;
 }
 
 static int bond_fill_info(struct sk_buff *skb,
 			  const struct net_device *bond_dev)
 {
-	struct bonding *bond = netdev_priv(bond_dev);
-	unsigned int packets_per_slave;
-	int ifindex, i, targets_added;
+	const struct bonding *bond = netdev_priv(bond_dev);
+	int i, targets_added, miimon, mode;
+	const struct slave *primary;
 	struct nlattr *targets;
-	struct slave *primary;
 
-	if (nla_put_u8(skb, IFLA_BOND_MODE, BOND_MODE(bond)))
+	rcu_read_lock();
+	mode = READ_ONCE(bond->params.mode);
+	if (nla_put_u8(skb, IFLA_BOND_MODE, mode))
 		goto nla_put_failure;
 
-	ifindex = bond_option_active_slave_get_ifindex(bond);
-	if (ifindex && nla_put_u32(skb, IFLA_BOND_ACTIVE_SLAVE, ifindex))
-		goto nla_put_failure;
+	if (bond_mode_uses_primary(mode)) {
+		int ifindex = bond_option_active_slave_get_ifindex_rcu(bond);
+
+		if (ifindex && nla_put_u32(skb, IFLA_BOND_ACTIVE_SLAVE, ifindex))
+			goto nla_put_failure;
+	}
 
-	if (nla_put_u32(skb, IFLA_BOND_MIIMON, bond->params.miimon))
+	miimon = READ_ONCE(bond->params.miimon);
+	if (nla_put_u32(skb, IFLA_BOND_MIIMON, miimon))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_UPDELAY,
-			bond->params.updelay * bond->params.miimon))
+			READ_ONCE(bond->params.updelay) * miimon))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_DOWNDELAY,
-			bond->params.downdelay * bond->params.miimon))
+			READ_ONCE(bond->params.downdelay) * miimon))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_PEER_NOTIF_DELAY,
-			bond->params.peer_notif_delay * bond->params.miimon))
+			READ_ONCE(bond->params.peer_notif_delay) * miimon))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_USE_CARRIER, 1))
 		goto nla_put_failure;
 
-	if (nla_put_u32(skb, IFLA_BOND_ARP_INTERVAL, bond->params.arp_interval))
+	if (nla_put_u32(skb, IFLA_BOND_ARP_INTERVAL,
+			READ_ONCE(bond->params.arp_interval)))
 		goto nla_put_failure;
 
 	targets = nla_nest_start_noflag(skb, IFLA_BOND_ARP_IP_TARGET);
@@ -741,8 +746,10 @@ static int bond_fill_info(struct sk_buff *skb,
 
 	targets_added = 0;
 	for (i = 0; i < BOND_MAX_ARP_TARGETS; i++) {
-		if (bond->params.arp_targets[i]) {
-			if (nla_put_be32(skb, i, bond->params.arp_targets[i]))
+		__be32 t = READ_ONCE(bond->params.arp_targets[i]);
+
+		if (t) {
+			if (nla_put_be32(skb, i, t))
 				goto nla_put_failure;
 			targets_added = 1;
 		}
@@ -753,11 +760,12 @@ static int bond_fill_info(struct sk_buff *skb,
 	else
 		nla_nest_cancel(skb, targets);
 
-	if (nla_put_u32(skb, IFLA_BOND_ARP_VALIDATE, bond->params.arp_validate))
+	if (nla_put_u32(skb, IFLA_BOND_ARP_VALIDATE,
+			READ_ONCE(bond->params.arp_validate)))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_ARP_ALL_TARGETS,
-			bond->params.arp_all_targets))
+			READ_ONCE(bond->params.arp_all_targets)))
 		goto nla_put_failure;
 
 #if IS_ENABLED(CONFIG_IPV6)
@@ -767,6 +775,9 @@ static int bond_fill_info(struct sk_buff *skb,
 
 	targets_added = 0;
 	for (i = 0; i < BOND_MAX_NS_TARGETS; i++) {
+		/* Note: IPv6 addresses can not be read in an atomic READ_ONCE() yet.
+		 * We accept this minor race for the moment.
+		 */
 		if (!ipv6_addr_any(&bond->params.ns_targets[i])) {
 			if (nla_put_in6_addr(skb, i, &bond->params.ns_targets[i]))
 				goto nla_put_failure;
@@ -780,97 +791,97 @@ static int bond_fill_info(struct sk_buff *skb,
 		nla_nest_cancel(skb, targets);
 #endif
 
-	primary = rtnl_dereference(bond->primary_slave);
+	primary = rcu_dereference(bond->primary_slave);
 	if (primary &&
 	    nla_put_u32(skb, IFLA_BOND_PRIMARY, primary->dev->ifindex))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_PRIMARY_RESELECT,
-		       bond->params.primary_reselect))
+		       READ_ONCE(bond->params.primary_reselect)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_FAIL_OVER_MAC,
-		       bond->params.fail_over_mac))
+		       READ_ONCE(bond->params.fail_over_mac)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_XMIT_HASH_POLICY,
-		       bond->params.xmit_policy))
+		       READ_ONCE(bond->params.xmit_policy)))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_RESEND_IGMP,
-			bond->params.resend_igmp))
+			READ_ONCE(bond->params.resend_igmp)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_NUM_PEER_NOTIF,
-		       bond->params.num_peer_notif))
+		       READ_ONCE(bond->params.num_peer_notif)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_ALL_SLAVES_ACTIVE,
-		       bond->params.all_slaves_active))
+		       READ_ONCE(bond->params.all_slaves_active)))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_MIN_LINKS,
-			bond->params.min_links))
+			READ_ONCE(bond->params.min_links)))
 		goto nla_put_failure;
 
 	if (nla_put_u32(skb, IFLA_BOND_LP_INTERVAL,
-			bond->params.lp_interval))
+			READ_ONCE(bond->params.lp_interval)))
 		goto nla_put_failure;
 
-	packets_per_slave = bond->params.packets_per_slave;
 	if (nla_put_u32(skb, IFLA_BOND_PACKETS_PER_SLAVE,
-			packets_per_slave))
+			READ_ONCE(bond->params.packets_per_slave)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_AD_LACP_ACTIVE,
-		       bond->params.lacp_active))
+		       READ_ONCE(bond->params.lacp_active)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_AD_LACP_RATE,
-		       bond->params.lacp_fast))
+		       READ_ONCE(bond->params.lacp_fast)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_AD_SELECT,
-		       bond->params.ad_select))
+		       READ_ONCE(bond->params.ad_select)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_TLB_DYNAMIC_LB,
-		       bond->params.tlb_dynamic_lb))
+		       READ_ONCE(bond->params.tlb_dynamic_lb)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_MISSED_MAX,
-		       bond->params.missed_max))
+		       READ_ONCE(bond->params.missed_max)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_COUPLED_CONTROL,
-		       bond->params.coupled_control))
+		       READ_ONCE(bond->params.coupled_control)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_BROADCAST_NEIGH,
-		       bond->params.broadcast_neighbor))
+		       READ_ONCE(bond->params.broadcast_neighbor)))
 		goto nla_put_failure;
 
 	if (nla_put_u8(skb, IFLA_BOND_LACP_STRICT,
-		       bond->params.lacp_strict))
+		       READ_ONCE(bond->params.lacp_strict)))
 		goto nla_put_failure;
 
-	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+	if (mode == BOND_MODE_8023AD) {
 		struct ad_info info;
 
 		if (capable(CAP_NET_ADMIN)) {
 			if (nla_put_u16(skb, IFLA_BOND_AD_ACTOR_SYS_PRIO,
-					bond->params.ad_actor_sys_prio))
+					READ_ONCE(bond->params.ad_actor_sys_prio)))
 				goto nla_put_failure;
 
 			if (nla_put_u16(skb, IFLA_BOND_AD_USER_PORT_KEY,
-					bond->params.ad_user_port_key))
+					READ_ONCE(bond->params.ad_user_port_key)))
 				goto nla_put_failure;
 
+			/* Small race here, this is a minor trade off. */
 			if (nla_put(skb, IFLA_BOND_AD_ACTOR_SYSTEM,
 				    ETH_ALEN, &bond->params.ad_actor_system))
 				goto nla_put_failure;
 		}
-		if (!bond_3ad_get_active_agg_info(bond, &info)) {
+		if (!__bond_3ad_get_active_agg_info(bond, &info)) {
 			struct nlattr *nest;
 
 			nest = nla_nest_start_noflag(skb, IFLA_BOND_AD_INFO);
@@ -898,9 +909,11 @@ static int bond_fill_info(struct sk_buff *skb,
 		}
 	}
 
+	rcu_read_unlock();
 	return 0;
 
 nla_put_failure:
+	rcu_read_unlock();
 	return -EMSGSIZE;
 }
 
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index e590c8dee86e1e7168e400b1321fc22c8f75b87c..36b8d89387ee5d67a51087fa2c6edd0de579ab3f 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -934,14 +934,14 @@ static int bond_option_mode_set(struct bonding *bond,
 
 	/* don't cache arp_validate between modes */
 	WRITE_ONCE(bond->params.arp_validate, BOND_ARP_VALIDATE_NONE);
-	bond->params.mode = newval->value;
+	WRITE_ONCE(bond->params.mode, newval->value);
 
 	/* When changing mode, the bond device is down, we may reduce
 	 * the bond_bcast_neigh_enabled in bond_close() if broadcast_neighbor
 	 * enabled in 8023ad mode. Therefore, only clear broadcast_neighbor
 	 * to 0.
 	 */
-	bond->params.broadcast_neighbor = 0;
+	WRITE_ONCE(bond->params.broadcast_neighbor, 0);
 
 	if (bond->dev->reg_state == NETREG_REGISTERED) {
 		bool update = false;
@@ -1706,7 +1706,7 @@ static int bond_option_lacp_strict_set(struct bonding *bond,
 {
 	netdev_dbg(bond->dev, "Setting LACP fallback to %s (%llu)\n",
 		   newval->string, newval->value);
-	bond->params.lacp_strict = newval->value;
+	WRITE_ONCE(bond->params.lacp_strict, newval->value);
 	bond_3ad_set_carrier(bond);
 
 	return 0;
@@ -1927,7 +1927,7 @@ static int bond_option_broadcast_neigh_set(struct bonding *bond,
 	if (bond->params.broadcast_neighbor == newval->value)
 		return 0;
 
-	bond->params.broadcast_neighbor = newval->value;
+	WRITE_ONCE(bond->params.broadcast_neighbor, newval->value);
 	if (bond->dev->flags & IFF_UP) {
 		if (bond->params.broadcast_neighbor)
 			static_branch_inc(&bond_bcast_neigh_enabled);
diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
index 05572c19e14b7ae97d497cc9c5d97d4314eab295..ef667dff297294dc78d98701aae10c31ef09df3f 100644
--- a/include/net/bond_3ad.h
+++ b/include/net/bond_3ad.h
@@ -302,8 +302,8 @@ void bond_3ad_state_machine_handler(struct work_struct *);
 void bond_3ad_initiate_agg_selection(struct bonding *bond, int timeout);
 void bond_3ad_adapter_speed_duplex_changed(struct slave *slave);
 void bond_3ad_handle_link_change(struct slave *slave, char link);
-int  bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info);
-int  __bond_3ad_get_active_agg_info(struct bonding *bond,
+int  bond_3ad_get_active_agg_info(const struct bonding *bond, struct ad_info *ad_info);
+int  __bond_3ad_get_active_agg_info(const struct bonding *bond,
 				    struct ad_info *ad_info);
 int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond,
 			 struct slave *slave);
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 2c54a36a8477b98dc7a4a1f45d27f7972efecba0..598d56b1bc97063365c88dd002ec1423850e56eb 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -345,14 +345,14 @@ static inline bool bond_mode_uses_primary(int mode)
 	       mode == BOND_MODE_ALB;
 }
 
-static inline bool bond_uses_primary(struct bonding *bond)
+static inline bool bond_uses_primary(const struct bonding *bond)
 {
 	return bond_mode_uses_primary(BOND_MODE(bond));
 }
 
-static inline struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond)
+static inline struct net_device *bond_option_active_slave_get_rcu(const struct bonding *bond)
 {
-	struct slave *slave = rcu_dereference_rtnl(bond->curr_active_slave);
+	const struct slave *slave = rcu_dereference_rtnl(bond->curr_active_slave);
 
 	return bond_uses_primary(bond) && slave ? slave->dev : NULL;
 }
@@ -703,7 +703,7 @@ void bond_setup(struct net_device *bond_dev);
 unsigned int bond_get_num_tx_queues(void);
 int bond_netlink_init(void);
 void bond_netlink_fini(void);
-struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond);
+struct net_device *bond_option_active_slave_get_rcu(const struct bonding *bond);
 const char *bond_slave_link_status(s8 link);
 struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
 					      struct net_device *end_dev,
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* Re: [PATCH net] net: sgi: ioc3-eth: fix split TX DMA mapping lengths
From: Thomas Bogendoerfer @ 2026-06-29 17:16 UTC (permalink / raw)
  To: raoxu
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-mips, netdev,
	linux-kernel, stable
In-Reply-To: <4E1486BC4536407E+20260629080623.908426-1-raoxu@uniontech.com>

On Mon, Jun 29, 2026 at 04:06:23PM +0800, raoxu wrote:
> From: Xu Rao <raoxu@uniontech.com>
> 
> When a linear skb crosses a 16 KiB boundary, ioc3_start_xmit()
> splits it into two buffers of lengths s1 and s2.  The descriptor
> advertises those lengths through B1CNT and B2CNT.
> 
> The first buffer is mapped with s1, but the second buffer is also
> mapped with s1 even though the device is told to fetch s2 bytes from
> it.  When the lengths differ, the DMA mapping does not cover the same
> region as the second descriptor buffer, which can result in incorrect
> cache maintenance or a DMA fault on implementations that enforce the
> mapped range.
> 
> There is a separate mismatch in the error path.  If mapping the second
> buffer fails, only d1 needs to be unmapped.  d1 was mapped for s1 bytes,
> but the driver unmaps it using the full packet length.  Streaming DMA
> mappings must be unmapped with the same size used for the corresponding
> map operation.
> 
> Map the second buffer with s2 and unmap the first buffer with s1 when
> the second mapping fails.
> 
> Fixes: ed870f6a7aa2 ("net: sgi: ioc3-eth: use dma-direct for dma allocations")
> Cc: stable@vger.kernel.org
> Signed-off-by: Xu Rao <raoxu@uniontech.com>
> ---
>  drivers/net/ethernet/sgi/ioc3-eth.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c
> index 3973106..261f2d3 100644
> --- a/drivers/net/ethernet/sgi/ioc3-eth.c
> +++ b/drivers/net/ethernet/sgi/ioc3-eth.c
> @@ -1061,9 +1061,9 @@ static netdev_tx_t ioc3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		d1 = dma_map_single(ip->dma_dev, skb->data, s1, DMA_TO_DEVICE);
>  		if (dma_mapping_error(ip->dma_dev, d1))
>  			goto drop_packet;
> -		d2 = dma_map_single(ip->dma_dev, (void *)b2, s1, DMA_TO_DEVICE);
> +		d2 = dma_map_single(ip->dma_dev, (void *)b2, s2, DMA_TO_DEVICE);
>  		if (dma_mapping_error(ip->dma_dev, d2)) {
> -			dma_unmap_single(ip->dma_dev, d1, len, DMA_TO_DEVICE);
> +			dma_unmap_single(ip->dma_dev, d1, s1, DMA_TO_DEVICE);
>  			goto drop_packet;
>  		}
>  		desc->p1     = cpu_to_be64(ioc3_map(d1, PCI64_ATTR_PREF));
> -- 
> 2.47.3

Reviewed-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply

* Re: [PATCH net] net: sgi: ioc3-eth: unregister netdev before freeing DMA rings
From: Thomas Bogendoerfer @ 2026-06-29 17:18 UTC (permalink / raw)
  To: raoxu
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-mips, netdev,
	linux-kernel, stable
In-Reply-To: <40CD736C4911C181+20260629085053.964383-1-raoxu@uniontech.com>

On Mon, Jun 29, 2026 at 04:50:53PM +0800, raoxu wrote:
> From: Xu Rao <raoxu@uniontech.com>
> 
> ioc3eth_remove() frees the coherent RX and TX descriptor rings before
> unregistering the netdev. If the interface is running,
> unregister_netdev() invokes ioc3_close() through ndo_stop.
> 
> ioc3_close() stops the device and then calls ioc3_free_rx_bufs() and
> ioc3_clean_tx_ring(). Both cleanup functions access descriptors in the
> rings, so the current ordering causes CPU accesses to freed coherent
> memory. Until ioc3_stop() disables RX and TX DMA, the device may also
> continue using the freed ring addresses.
> 
> Unregister the netdev before releasing the rings. This lets the core
> close a running interface and quiesce the device while the rings are
> still valid. Keep the explicit timer deletion because ndo_stop is not
> called when the interface is already down.
> 
> Fixes: c7b572747549 ("net: sgi: ioc3-eth: allocate space for desc rings only once")
> Cc: stable@vger.kernel.org
> Signed-off-by: Xu Rao <raoxu@uniontech.com>
> ---
>  drivers/net/ethernet/sgi/ioc3-eth.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c
> index 261f2d35d579..009f37105eaf 100644
> --- a/drivers/net/ethernet/sgi/ioc3-eth.c
> +++ b/drivers/net/ethernet/sgi/ioc3-eth.c
> @@ -967,11 +967,12 @@ static void ioc3eth_remove(struct platform_device *pdev)
>  	struct net_device *dev = platform_get_drvdata(pdev);
>  	struct ioc3_private *ip = netdev_priv(dev);
>  
> +	unregister_netdev(dev);
> +	timer_delete_sync(&ip->ioc3_timer);
> +
>  	dma_free_coherent(ip->dma_dev, RX_RING_SIZE, ip->rxr, ip->rxr_dma);
>  	dma_free_coherent(ip->dma_dev, TX_RING_SIZE + SZ_16K - 1, ip->tx_ring, ip->txr_dma);
>  
> -	unregister_netdev(dev);
> -	timer_delete_sync(&ip->ioc3_timer);
>  	free_netdev(dev);
>  }
>  
> -- 
> 2.50.1

Reviewed-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply

* Re: [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Kuniyuki Iwashima @ 2026-06-29 18:05 UTC (permalink / raw)
  To: Vimal Agrawal; +Cc: kuba, edumazet, netdev, vimal.agrawal
In-Reply-To: <CALkUMdSzLQSpjpK+cnt=c+dPDZGnyB2zUD3z3KKXr=-DhM7t=Q@mail.gmail.com>

On Mon, Jun 29, 2026 at 12:57 AM Vimal Agrawal <avimalin@gmail.com> wrote:
>
> Hi Kuniyuki,
> Thank you for the feedback.
> However, the rate limiting issue exists independently of the threshold
> values. If entries genuinely exceed gc_thresh3 — regardless of what it
> is set to — neigh_forced_gc() is called on every allocation attempt
> with no rate limiting. In my workload, most entries are
> active/reachable with refcnt > 1, so the GC walk traverses the entire
> table without reclaiming anything.

This suggests your gc_thresh2/3 do not fit your use case.

If GC does not help, there is no point in running it or rate-limiting
in the first place.


> Increasing gc_thresh3 would make
> this worse, not better, as GC now has a larger table to scan on each
> call.

If you just increase gc_thresh3 slightly, then yes, it won't help.


>
> Regarding neigh_hash_shift: in my workload, neigh_alloc() returns
> ENOBUFS before reaching do_alloc() since GC cannot reclaim any
> entries. kzalloc() is never called, so neigh_hash_grow() is not
> involved in the latency I observed. The pre-lock time check in
> neigh_forced_gc() is a low-cost safeguard that prevents repeated full
> table scans regardless of gc_thresh3 value. It does not interfere with
> correct GC behaviour — if entries are still above the threshold, GC
> runs normally.
>
>
> Hi Jakub,
> I tested with different threshold values, filling the table completely
> with 32k reachable entries and attempting 1000 additional allocations.
> Exported neigh_forced_gc so that it can be profiled
>                          no change  10ms   50ms   100ms
> max cpu usage %          44%        11.8%  2.56%  1.42%
> calls > 100us (of 1000)  101        31     13     7
>
> At 10ms, max CPU usage is still 11.8% and 31 out of 1000 calls take
> more than 100us. Given that 50ms reduces this to 2.56% and 13 calls
> respectively, I would prefer 50ms as the threshold. However, I am open
> to further discussion on the right value.
>
> Thanks,
> Vimal
>
>
> On Fri, Jun 26, 2026 at 3:17 AM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > From: Vimal Agrawal <avimalin@gmail.com>
> > Date: Thu, 25 Jun 2026 10:20:20 +0000
> > > Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
> > > on every allocation attempt with no rate limiting. In workloads with mostly
> > > active/reachable entries, the GC walk traverses a large portion of the
> > > neighbour table without reclaiming entries, holding tbl->lock for an
> > > extended period. This causes severe lock contention and allocation
> > > latencies exceeding 16ms under sustained neighbour creation.
> > >
> > > Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
> > > performed within the last second, avoiding repeated full table scans and
> > > lock acquisitions on the hot allocation path.
> > >
> > > Profiling of neigh_create() shows ~3 orders of magnitude latency
> > > improvement with this change.
> > >
> > > Link:https://lore.kernel.org/netdev/CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com/
> >
> > From the thread, these look misconfigured.
> >
> > ---8<---
> > net.ipv6.neigh.default.gc_thresh2 = 32768
> > net.ipv6.neigh.default.gc_thresh3 = 32768
> > ---8<---
> >
> > If gc_thresh3 is larger enough, gc_thresh2 will give you 5s
> > rate limiting.
> >
> > If the number of active neigh entries constantly exceeds
> > gc_thresh3, it will be the correct gc_thresh2 for you.
> >
> > Also, I guess you want a new kernel param for the first
> > neigh_hash_alloc(), which is currently fixed for 3, which
> > is too small for some hosts.
> >
> > 50000 entries require neigh_hash_grow() 13 times.
> >
> > Can you test this on your real workload, starting from
> > neigh_hash_shift=16 and appropriate gc_thresh2/3 ?
> >
> > ---8<---
> > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > index 1349c0eedb64..a75b3750eec9 100644
> > --- a/net/core/neighbour.c
> > +++ b/net/core/neighbour.c
> > @@ -1817,6 +1817,22 @@ EXPORT_SYMBOL(neigh_parms_release);
> >  static struct lock_class_key neigh_table_proxy_queue_class;
> >
> >  static struct neigh_table __rcu *neigh_tables[NEIGH_NR_TABLES] __read_mostly;
> > +static __initdata unsigned long neigh_hash_shift = 3;
> > +
> > +static int __init neigh_set_hash_shift(char *str)
> > +{
> > +       ssize_t ret;
> > +
> > +       if (!str)
> > +               return 0;
> > +
> > +       ret = kstrtoul(str, 0, &neigh_hash_shift);
> > +       if (ret)
> > +               return 0;
> > +
> > +       return 1;
> > +}
> > +__setup("neigh_hash_shift=", neigh_set_hash_shift);
> >
> >  void neigh_table_init(int index, struct neigh_table *tbl)
> >  {
> > @@ -1843,7 +1859,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
> >                 panic("cannot create neighbour proc dir entry");
> >  #endif
> >
> > -       RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(3));
> > +       RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(neigh_hash_shift));
> >
> >         phsize = (PNEIGH_HASHMASK + 1) * sizeof(struct pneigh_entry *);
> >         tbl->phash_buckets = kzalloc(phsize, GFP_KERNEL);
> > ---8<---
> >
> >
> >
> > > Signed-off-by: Vimal Agrawal <vimal.agrawal@sophos.com>
> > > ---
> > >  net/core/neighbour.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > > index 1349c0eedb64..078842db3c5f 100644
> > > --- a/net/core/neighbour.c
> > > +++ b/net/core/neighbour.c
> > > @@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
> > >       int shrunk = 0;
> > >       int loop = 0;
> > >
> > > +     if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
> > > +             return 0;
> > > +
> > >       NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
> > >
> > >       spin_lock_bh(&tbl->lock);
> > > --
> > > 2.17.1
> > > v

^ permalink raw reply

* Re: [PATCH bpf-next v3 1/2] bpf, sockmap: disallow update and delete from tc, xdp and flow_dissector
From: bot+bpf-ci @ 2026-06-29 18:06 UTC (permalink / raw)
  To: rhkrqnwk98, ast, daniel, andrii, john.fastabend, eddyz87, memxor,
	davem, kuba, hawk
  Cc: martin.lau, song, yonghong.song, jolsa, sdf, emil, lmb, jakub,
	jiayuan.chen, shuah, bpf, netdev, linux-kselftest, linux-kernel,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <20260629172704.1302218-2-rhkrqnwk98@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

>     sock_map_update_common() and __sock_map_delete() hold stab->lock and call
>     sock_map_unref() -> sock_map_del_link(), which takes sk_callback_lock for
>     write. That gives the order stab->lock -> sk_callback_lock.
>
>     The reverse order comes from the SK_SKB stream parser.
>     sk_psock_strp_data_ready() holds sk_callback_lock for read, and after the
>     verdict tcp_bpf_strp_read_sock() acks the consumed data inline via
>     __tcp_cleanup_rbuf(). The ACK goes out egress, where a sched_cls program
>     deletes from the sockmap and takes stab->lock:
>
>     A tc, xdp or flow_dissector program has no reason to update or delete a
>     sockmap, and redirect does not go through here. Drop them from
>     may_update_sockmap() so the verifier rejects it. It also closes the
>     matching sockhash inversion.
>
>     Suggested-by: John Fastabend <john.fastabend@gmail.com>
>     Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>

This fixes the behaviour added by commit 0126240f448d ("bpf: sockmap: Allow
update from BPF"), which introduced may_update_sockmap() and added the four
BPF_PROG_TYPE_* cases that this change removes.  Should it carry:

  Fixes: 0126240f448d ("bpf: sockmap: Allow update from BPF")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28391303635

^ permalink raw reply

* [PATCH v1 net-next 00/10] net: fib_rules: RTNL-less RTM_NEWRULE and RTM_DELRULE.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

RTM_NEWRULE and RTM_DELRULE acquire rtnl_net_lock(), but this is
only for fib_unmerge() in IPv4.

Since commit d954a67a7dfa ("ipv4: fib_rule: Move fib4_rules_exit()
to ->exit()."), RTM_DELRULE no longer needs RTNL.

fib_unmerge() is one-time event for each netns, so we only need
RTNL for the first IPv4 rule.

This series introduces per-fib_rules_ops mutex and drops RTNL
from fib_rules code except for the first IPv4 RTM_NEWRULE.

The script below creates 1K rules in parallel in 4K netns, and
it got 20x/30x faster for IPv4/IPv6.

  #!/bin/bash
  N=4096
  F=rules.txt

  for i in $(seq $N); do ip netns add ns-$i; done
  printf 'rule add from all table %d\n' {1..1024} > $F

  for v in 4 6; do
        echo "=== IPv${v} ==="
        time { for i in $(seq $N); do nsenter \
        --net=/var/run/netns/ns-$i ip -$v -batch $F & done; wait; }
  done

  for i in $(seq $N); do ip netns del ns-$i; done
  rm -f $F

Without this series:

  # ./test.sh
  === IPv4 ===

  real  0m22.752s
  user  0m7.834s
  sys   92m46.721s
  === IPv6 ===

  real  0m35.181s
  user  0m8.635s
  sys   142m30.479s

With this series:

  # ./test.sh
  === IPv4 ===

  real  0m0.918s
  user  0m5.675s
  sys   2m7.024s
  === IPv6 ===

  real  0m1.214s
  user  0m7.917s
  sys   4m19.489s

Kuniyuki Iwashima (10):
  net: fib_rules: Make fib_rules_ops.delete() return void.
  ipv4: fib_rules: Make the need for fib_unmerge() explicit.
  ipv4: fib: Protect fib_new_table() with spinlock.
  ipv4: fib: Drop RTNL annotation for net->ipv4.fib_table_hash[].
  net: fib_rules: Add fib_rules_ops.lock.
  net: fib_rules: Remove unnecessary EXPORT_SYMBOL.
  net: fib_rules: Drop RTNL assertions.
  net: fib_rules: Use dev_get_by_name_rcu().
  net: fib_rules: Only hold RTNL for the first IPv4 RTM_NEWRULE.
  ipv6: fib_rules: Convert fib6_rules_net_exit_rtnl() to ->exit().

 include/net/fib_rules.h  |  4 +-
 include/net/ip_fib.h     |  3 +-
 include/net/netns/ipv4.h |  1 +
 net/core/fib_rules.c     | 82 +++++++++++++++++++++-------------------
 net/ipv4/fib_frontend.c  | 48 ++++++++++++++++-------
 net/ipv4/fib_rules.c     | 20 ++++++----
 net/ipv4/fib_trie.c      |  3 +-
 net/ipv6/fib6_rules.c    | 17 ++-------
 8 files changed, 101 insertions(+), 77 deletions(-)

-- 
2.55.0.rc0.799.gd6f94ed593-goog

^ permalink raw reply

* [PATCH v1 net-next 01/10] net: fib_rules: Make fib_rules_ops.delete() return void.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

Since commit d954a67a7dfa ("ipv4: fib_rule: Move fib4_rules_exit()
to ->exit()."), both fib4_rule_delete() and fib6_rule_delete() always
return 0.

Let's change the return type to void.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/fib_rules.h | 2 +-
 net/core/fib_rules.c    | 7 ++-----
 net/ipv4/fib_rules.c    | 4 +---
 net/ipv6/fib6_rules.c   | 4 +---
 4 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 7dee0ae616e3..f9a4bca51eda 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -82,7 +82,7 @@ struct fib_rules_ops {
 					     struct fib_rule_hdr *,
 					     struct nlattr **,
 					     struct netlink_ext_ack *);
-	int			(*delete)(struct fib_rule *);
+	void			(*delete)(struct fib_rule *);
 	int			(*compare)(struct fib_rule *,
 					   struct fib_rule_hdr *,
 					   struct nlattr **);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cf374c208732..961eb709f256 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -1055,11 +1055,8 @@ int fib_delrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 		goto errout_free;
 	}
 
-	if (ops->delete) {
-		err = ops->delete(rule);
-		if (err)
-			goto errout_free;
-	}
+	if (ops->delete)
+		ops->delete(rule);
 
 	if (rule->tun_id)
 		ip_tunnel_unneed_metadata();
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index e068a5bace73..51d0ab423ed4 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -349,7 +349,7 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	return err;
 }
 
-static int fib4_rule_delete(struct fib_rule *rule)
+static void fib4_rule_delete(struct fib_rule *rule)
 {
 	struct net *net = rule->fr_net;
 
@@ -361,8 +361,6 @@ static int fib4_rule_delete(struct fib_rule *rule)
 	if (net->ipv4.fib_rules_require_fldissect &&
 	    fib_rule_requires_fldissect(rule))
 		net->ipv4.fib_rules_require_fldissect--;
-
-	return 0;
 }
 
 static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index e1b2b4fa6e18..5ab4dde07225 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -480,15 +480,13 @@ static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	return err;
 }
 
-static int fib6_rule_delete(struct fib_rule *rule)
+static void fib6_rule_delete(struct fib_rule *rule)
 {
 	struct net *net = rule->fr_net;
 
 	if (net->ipv6.fib6_rules_require_fldissect &&
 	    fib_rule_requires_fldissect(rule))
 		net->ipv6.fib6_rules_require_fldissect--;
-
-	return 0;
 }
 
 static int fib6_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v1 net-next 02/10] ipv4: fib_rules: Make the need for fib_unmerge() explicit.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

IPv4 local and main route tables are merged by default to avoid
unnecessary rule lookups.

When the first IPv4 rule is created, fib_unmerge() splits the
two tables.

However, fib4_rule_configure() currently always calls fib_unmerge(),
and even fetching a table via fib_get_table() requires RTNL (or RCU).

We will drop RTNL from fib_newrule() if not needed.

Let's call fib_unmerge() only once for the first rule.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/ipv4/fib_rules.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 51d0ab423ed4..16d202246a36 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -301,10 +301,12 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	    fib4_nl2rule_dscp_mask(tb[FRA_DSCP_MASK], rule4, extack) < 0)
 		goto errout;
 
-	/* split local/main if they are not already split */
-	err = fib_unmerge(net);
-	if (err)
-		goto errout;
+	if (!net->ipv4.fib_has_custom_rules) {
+		/* split local/main if they are not already split */
+		err = fib_unmerge(net);
+		if (err)
+			goto errout;
+	}
 
 	if (rule->table == RT_TABLE_UNSPEC && !rule->l3mdev) {
 		if (rule->action == FR_ACT_TO_TBL) {
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v1 net-next 03/10] ipv4: fib: Protect fib_new_table() with spinlock.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

fib_newrule() will drop RTNL except for the first IPv4 rule.

Then, fib4_rule_configure() could call fib_empty_table() and create
a new IPv4 fib_table without RTNL.

Currently, net->ipv4.fib_table_hash[] is only protected by RTNL.

As a prep, let's protect net->ipv4.fib_table_hash[] with a dedicated
spinlock.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/netns/ipv4.h |  1 +
 net/ipv4/fib_frontend.c  | 25 +++++++++++++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 6e27c56514df..59506320558a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -127,6 +127,7 @@ struct netns_ipv4 {
 	atomic_t		fib_num_tclassid_users;
 #endif
 	struct hlist_head	*fib_table_hash;
+	spinlock_t		fib_table_hash_lock;
 	struct sock		*fibnl;
 	struct hlist_head	*fib_info_hash;
 	unsigned int		fib_info_hash_bits;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 42212970d735..336d70649eb9 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -76,7 +76,7 @@ static int __net_init fib4_rules_init(struct net *net)
 
 struct fib_table *fib_new_table(struct net *net, u32 id)
 {
-	struct fib_table *tb, *alias = NULL;
+	struct fib_table *tb, *new_tb, *alias = NULL;
 	unsigned int h;
 
 	if (id == 0)
@@ -85,14 +85,27 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 	if (tb)
 		return tb;
 
+	if (!check_net(net))
+		return NULL;
+
 	if (id == RT_TABLE_LOCAL && !net->ipv4.fib_has_custom_rules)
 		alias = fib_new_table(net, RT_TABLE_MAIN);
 
-	if (check_net(net))
-		tb = fib_trie_table(id, alias);
-	if (!tb)
+	new_tb = fib_trie_table(id, alias);
+	if (!new_tb)
 		return NULL;
 
+	spin_lock(&net->ipv4.fib_table_hash_lock);
+
+	tb = fib_get_table(net, id);
+	if (tb) {
+		spin_unlock(&net->ipv4.fib_table_hash_lock);
+		fib_free_table(new_tb);
+		return tb;
+	}
+
+	tb = new_tb;
+
 	switch (id) {
 	case RT_TABLE_MAIN:
 		rcu_assign_pointer(net->ipv4.fib_main, tb);
@@ -106,6 +119,9 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 
 	h = id & (FIB_TABLE_HASHSZ - 1);
 	hlist_add_head_rcu(&tb->tb_hlist, &net->ipv4.fib_table_hash[h]);
+
+	spin_unlock(&net->ipv4.fib_table_hash_lock);
+
 	return tb;
 }
 EXPORT_SYMBOL_GPL(fib_new_table);
@@ -1565,6 +1581,7 @@ static int __net_init ip_fib_net_init(struct net *net)
 	net->ipv4.sysctl_fib_multipath_hash_fields =
 		FIB_MULTIPATH_HASH_FIELD_DEFAULT_MASK;
 #endif
+	spin_lock_init(&net->ipv4.fib_table_hash_lock);
 
 	/* Avoid false sharing : Use at least a full cache line */
 	size = max_t(size_t, size, L1_CACHE_BYTES);
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v1 net-next 04/10] ipv4: fib: Drop RTNL annotation for net->ipv4.fib_table_hash[].
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

fib_newrule() will drop RTNL except for the first IPv4 rule.

net->ipv4.fib_table_hash[] will be read with no protection,
but this is fine because fib_table is not destroyed until
netns dismantle except for the merged main/local table.

fib_unmerge() will continue to be called under RTNL, so other
readers (fib_flush() and fib_info_notify_update()) just have
to care about the concurrent hlist_add().

IPv6 and IPMR/IP6MR also take this strategy and use RCU helpers
to avoid data race against concurrent hlist_add().

Let's not use lockdep_rtnl_is_held() and rcu_dereference_rtnl()
for net->ipv4.fib_table_hash[].

Note that commit a7e53531234d ("fib_trie: Make fib_table rcu
safe") started to use the _safe version in fib_flush(), but it
is not needed thanks to RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/ip_fib.h    |  3 ++-
 net/ipv4/fib_frontend.c | 23 +++++++++++++----------
 net/ipv4/fib_trie.c     |  3 +--
 3 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index c63a3c4967ae..0a35355fb0f3 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -302,7 +302,8 @@ static inline struct fib_table *fib_get_table(struct net *net, u32 id)
 		&net->ipv4.fib_table_hash[TABLE_LOCAL_INDEX] :
 		&net->ipv4.fib_table_hash[TABLE_MAIN_INDEX];
 
-	tb_hlist = rcu_dereference_rtnl(hlist_first_rcu(ptr));
+	/* Only fib4_rules_init() adds fib_table. */
+	tb_hlist = rcu_dereference_protected(hlist_first_rcu(ptr), true);
 
 	return hlist_entry(tb_hlist, struct fib_table, tb_hlist);
 }
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 336d70649eb9..54eb72695093 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -126,24 +126,28 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 }
 EXPORT_SYMBOL_GPL(fib_new_table);
 
-/* caller must hold either rtnl or rcu read lock */
 struct fib_table *fib_get_table(struct net *net, u32 id)
 {
-	struct fib_table *tb;
+	struct fib_table *tb = NULL;
 	struct hlist_head *head;
 	unsigned int h;
 
 	if (id == 0)
 		id = RT_TABLE_MAIN;
 	h = id & (FIB_TABLE_HASHSZ - 1);
-
 	head = &net->ipv4.fib_table_hash[h];
-	hlist_for_each_entry_rcu(tb, head, tb_hlist,
-				 lockdep_rtnl_is_held()) {
+
+	/* fib_table is not destroyed until ip_fib_net_exit()
+	 * except for the merged main/local table.
+	 * fib_unmerge() is called under RTNL, so other readers
+	 * under RTNL (e.g. fib_flush(), fib_info_notify_update())
+	 * can safely traverse the list with rcu_dereference_raw().
+	 */
+	hlist_for_each_entry_rcu(tb, head, tb_hlist, true)
 		if (tb->tb_id == id)
-			return tb;
-	}
-	return NULL;
+			break;
+
+	return tb;
 }
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
 
@@ -206,10 +210,9 @@ void fib_flush(struct net *net)
 
 	for (h = 0; h < FIB_TABLE_HASHSZ; h++) {
 		struct hlist_head *head = &net->ipv4.fib_table_hash[h];
-		struct hlist_node *tmp;
 		struct fib_table *tb;
 
-		hlist_for_each_entry_safe(tb, tmp, head, tb_hlist)
+		hlist_for_each_entry_rcu(tb, head, tb_hlist, true)
 			flushed += fib_table_flush(net, tb, false);
 	}
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index e11dc86ceda0..d1d342d7148e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2137,8 +2137,7 @@ void fib_info_notify_update(struct net *net, struct nl_info *info)
 		struct hlist_head *head = &net->ipv4.fib_table_hash[h];
 		struct fib_table *tb;
 
-		hlist_for_each_entry_rcu(tb, head, tb_hlist,
-					 lockdep_rtnl_is_held())
+		hlist_for_each_entry_rcu(tb, head, tb_hlist, true)
 			__fib_info_notify_update(net, tb, info);
 	}
 }
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v1 net-next 05/10] net: fib_rules: Add fib_rules_ops.lock.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

We will no longer hold RTNL for RTM_NEWRULE and RMT_DELRULE
except for the first IPv4 RTM_NEWRULE.

Let's add per-fib_rules_ops mutex inside RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 include/net/fib_rules.h |  1 +
 net/core/fib_rules.c    | 20 ++++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index f9a4bca51eda..7636ef4da5ad 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -98,6 +98,7 @@ struct fib_rules_ops {
 	struct list_head	rules_list;
 	struct module		*owner;
 	struct net		*fro_net;
+	struct mutex		lock;
 	struct rcu_head		rcu;
 };
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 961eb709f256..8b9dac1bd4a7 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -172,6 +172,7 @@ fib_rules_register(const struct fib_rules_ops *tmpl, struct net *net)
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&ops->rules_list);
+	mutex_init(&ops->lock);
 	ops->fro_net = net;
 
 	err = __fib_rules_register(ops);
@@ -392,6 +393,7 @@ static int call_fib_rule_notifiers(struct net *net,
 	};
 
 	ASSERT_RTNL_NET(net);
+	lockdep_assert_held(&ops->lock);
 
 	/* Paired with READ_ONCE() in fib_rules_seq() */
 	WRITE_ONCE(ops->fib_rules_seq, ops->fib_rules_seq + 1);
@@ -910,6 +912,7 @@ int fib_newrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (!rtnl_held)
 		rtnl_net_lock(net);
+	mutex_lock(&ops->lock);
 
 	err = fib_nl2rule_rtnl(rule, ops, tb, extack);
 	if (err)
@@ -978,6 +981,7 @@ int fib_newrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	fib_rule_get(rule);
 
+	mutex_unlock(&ops->lock);
 	if (!rtnl_held)
 		rtnl_net_unlock(net);
 
@@ -988,6 +992,7 @@ int fib_newrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 	return 0;
 
 errout_free:
+	mutex_unlock(&ops->lock);
 	if (!rtnl_held)
 		rtnl_net_unlock(net);
 	kfree(rule);
@@ -1039,6 +1044,7 @@ int fib_delrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (!rtnl_held)
 		rtnl_net_lock(net);
+	mutex_lock(&ops->lock);
 
 	err = fib_nl2rule_rtnl(nlrule, ops, tb, extack);
 	if (err)
@@ -1093,6 +1099,7 @@ int fib_delrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	call_fib_rule_notifiers(net, FIB_EVENT_RULE_DEL, rule, ops, NULL);
 
+	mutex_unlock(&ops->lock);
 	if (!rtnl_held)
 		rtnl_net_unlock(net);
 
@@ -1104,6 +1111,7 @@ int fib_delrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 	return 0;
 
 errout_free:
+	mutex_unlock(&ops->lock);
 	if (!rtnl_held)
 		rtnl_net_unlock(net);
 	kfree(nlrule);
@@ -1403,20 +1411,28 @@ static int fib_rules_event(struct notifier_block *this, unsigned long event,
 
 	switch (event) {
 	case NETDEV_REGISTER:
-		list_for_each_entry(ops, &net->rules_ops, list)
+		list_for_each_entry(ops, &net->rules_ops, list) {
+			mutex_lock(&ops->lock);
 			attach_rules(&ops->rules_list, dev);
+			mutex_unlock(&ops->lock);
+		}
 		break;
 
 	case NETDEV_CHANGENAME:
 		list_for_each_entry(ops, &net->rules_ops, list) {
+			mutex_lock(&ops->lock);
 			detach_rules(&ops->rules_list, dev);
 			attach_rules(&ops->rules_list, dev);
+			mutex_unlock(&ops->lock);
 		}
 		break;
 
 	case NETDEV_UNREGISTER:
-		list_for_each_entry(ops, &net->rules_ops, list)
+		list_for_each_entry(ops, &net->rules_ops, list) {
+			mutex_lock(&ops->lock);
 			detach_rules(&ops->rules_list, dev);
+			mutex_unlock(&ops->lock);
+		}
 		break;
 	}
 
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v1 net-next 06/10] net: fib_rules: Remove unnecessary EXPORT_SYMBOL.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

All fib_rule users cannot be compiled as module.

  $ grep -E "config (INET|IPV6|IP_MROUTE|IPV6_MROUTE)\b" -A1 \
    net/{Kconfig,{ipv4,ipv6}/Kconfig}
  net/Kconfig:config INET
  net/Kconfig-	bool "TCP/IP networking"
  --
  net/ipv4/Kconfig:config IP_MROUTE
  net/ipv4/Kconfig-	bool "IP: multicast routing"
  --
  net/ipv6/Kconfig:menuconfig IPV6
  net/ipv6/Kconfig-	bool "The IPv6 protocol"
  --
  net/ipv6/Kconfig:config IPV6_MROUTE
  net/ipv6/Kconfig-	bool "IPv6: multicast routing"

Let's remove EXPORT_SYMBOL and friends for fib_rule.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/core/fib_rules.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 8b9dac1bd4a7..25a3fd997782 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -51,7 +51,6 @@ bool fib_rule_matchall(const struct fib_rule *rule)
 		return false;
 	return true;
 }
-EXPORT_SYMBOL_GPL(fib_rule_matchall);
 
 int fib_default_rule_add(struct fib_rules_ops *ops,
 			 u32 pref, u32 table)
@@ -78,7 +77,6 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 	list_add_tail(&r->list, &ops->rules_list);
 	return 0;
 }
-EXPORT_SYMBOL(fib_default_rule_add);
 
 static u32 fib_default_rule_pref(struct fib_rules_ops *ops)
 {
@@ -183,7 +181,6 @@ fib_rules_register(const struct fib_rules_ops *tmpl, struct net *net)
 
 	return ops;
 }
-EXPORT_SYMBOL_GPL(fib_rules_register);
 
 static void fib_rules_cleanup_ops(struct fib_rules_ops *ops)
 {
@@ -208,7 +205,6 @@ void fib_rules_unregister(struct fib_rules_ops *ops)
 	fib_rules_cleanup_ops(ops);
 	kfree_rcu(ops, rcu);
 }
-EXPORT_SYMBOL_GPL(fib_rules_unregister);
 
 static int uid_range_set(struct fib_kuid_range *range)
 {
@@ -364,7 +360,6 @@ int fib_rules_lookup(struct fib_rules_ops *ops, struct flowi *fl,
 
 	return err;
 }
-EXPORT_SYMBOL_GPL(fib_rules_lookup);
 
 static int call_fib_rule_notifier(struct notifier_block *nb,
 				  enum fib_event_type event_type,
@@ -425,7 +420,6 @@ int fib_rules_dump(struct net *net, struct notifier_block *nb, int family,
 
 	return err;
 }
-EXPORT_SYMBOL_GPL(fib_rules_dump);
 
 unsigned int fib_rules_seq_read(const struct net *net, int family)
 {
@@ -441,7 +435,6 @@ unsigned int fib_rules_seq_read(const struct net *net, int family)
 
 	return fib_rules_seq;
 }
-EXPORT_SYMBOL_GPL(fib_rules_seq_read);
 
 static struct fib_rule *rule_find(struct fib_rules_ops *ops,
 				  struct fib_rule_hdr *frh,
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v1 net-next 07/10] net: fib_rules: Drop RTNL assertions.
From: Kuniyuki Iwashima @ 2026-06-29 18:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260629181226.1929658-1-kuniyu@google.com>

Now, fib_rule structs are protected by per-fib_rules_ops mutex.

Let's drop ASSERT_RTNL_NET() and rtnl_dereference().

Note that fib_rules_event() iterates over net->rules_ops without
net->rules_mod_lock, but this is fine because all fib_rule users
are built-in and concurrent fib_rules_unregister() does not happen.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/core/fib_rules.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 25a3fd997782..5eef5d6ace82 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -387,7 +387,6 @@ static int call_fib_rule_notifiers(struct net *net,
 		.rule = rule,
 	};
 
-	ASSERT_RTNL_NET(net);
 	lockdep_assert_held(&ops->lock);
 
 	/* Paired with READ_ONCE() in fib_rules_seq() */
@@ -955,7 +954,7 @@ int fib_newrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 		list_for_each_entry(r, &ops->rules_list, list) {
 			if (r->action == FR_ACT_GOTO &&
 			    r->target == rule->pref &&
-			    rtnl_dereference(r->ctarget) == NULL) {
+			    !rcu_access_pointer(r->ctarget)) {
 				rcu_assign_pointer(r->ctarget, rule);
 				if (--ops->unresolved_rules == 0)
 					break;
@@ -1064,7 +1063,7 @@ int fib_delrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (rule->action == FR_ACT_GOTO) {
 		ops->nr_goto_rules--;
-		if (rtnl_dereference(rule->ctarget) == NULL)
+		if (!rcu_access_pointer(rule->ctarget))
 			ops->unresolved_rules--;
 	}
 
@@ -1082,7 +1081,7 @@ int fib_delrule(struct net *net, struct sk_buff *skb, struct nlmsghdr *nlh,
 		if (&n->list == &ops->rules_list || n->pref != rule->pref)
 			n = NULL;
 		list_for_each_entry(r, &ops->rules_list, list) {
-			if (rtnl_dereference(r->ctarget) != rule)
+			if (rcu_access_pointer(r->ctarget) != rule)
 				continue;
 			rcu_assign_pointer(r->ctarget, n);
 			if (!n)
@@ -1400,8 +1399,6 @@ static int fib_rules_event(struct notifier_block *this, unsigned long event,
 	struct net *net = dev_net(dev);
 	struct fib_rules_ops *ops;
 
-	ASSERT_RTNL();
-
 	switch (event) {
 	case NETDEV_REGISTER:
 		list_for_each_entry(ops, &net->rules_ops, list) {
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox