* [PATCH net-next v9 00/10] enic: SR-IOV V2 admin channel and MBOX protocol
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat,
Breno Leitao
This series adds the admin channel infrastructure and mailbox (MBOX)
protocol needed for V2 SR-IOV support in the enic driver.
The V2 SR-IOV design uses a direct PF-VF communication channel built on
dedicated WQ/RQ/CQ hardware resources and an MSI-X interrupt.
Firmware capability and admin channel infrastructure (patches 1-4):
- Probe-time firmware feature check for V2 SR-IOV support
- Admin channel open/close, RQ buffer management, CQ service
with MSI-X interrupt and workqueue-based polling
MBOX protocol and VF enable (patches 5-10):
- MBOX message types, core send/receive, PF and VF handlers
- V2 SR-IOV enable wiring with admin channel setup
- V2 VF probe with admin channel and PF registration
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
Changes in v9:
- Use dma_rmb() instead of rmb() when reading admin RQ completion
descriptors written by DMA (patch 4) [Sashiko]
- Use GFP_KERNEL instead of GFP_ATOMIC for admin RQ refill and for
received-message allocation; both run in workqueue (process)
context after the v8 NAPI-to-workqueue switch (patch 4) [Sashiko]
- Correct the enic_admin_msg comment to describe the workqueue
enqueue path rather than NAPI (patch 4) [Sashiko]
- Set mbox_send_disabled in enic_admin_channel_close() so a MBOX
send cannot race with channel teardown (patch 6) [Sashiko]
- Send the actual PF carrier state to a VF on registration instead
of unconditionally reporting link up (patch 7) [Sashiko]
- Call reinit_completion() before setting mbox_expected_reply so a
reply arriving between the two is not missed (patch 8) [Sashiko]
- Defer PF->VF link state notification to a workqueue and gate it on
carrier transitions; enic_link_check() runs in the notify (atomic)
context while the MBOX send sleeps on a mutex/completion (patch 9)
[Sashiko]
- Clear ENIC_SRIOV_ENABLED and cancel the link-notify work before
freeing per-VF state in the SR-IOV disable path, closing a
use-after-free window against a concurrent link notification
(patch 9) [Sashiko]
- Link to v8: https://patch.msgid.link/20260609-enic-sriov-v2-admin-channel-v2-v8-0-8ad8babbb826@cisco.com
Changes in v8:
- Replace NAPI polling with workqueue for admin CQ service — admin
channel is low-frequency control traffic, not data path (patch 4)
[Jakub Kicinski]
- Use explicit enum value (= 4) for VIC_FEATURE_SRIOV instead of
placeholder VIC_FEATURE_PTP entry (patch 1) [Breno Leitao]
- Remove unnecessary rmb() in WQ CQ service (patch 4) [Jakub Kicinski]
- Remove admin_msg_drop_cnt counter (patch 4) [Simon Horman]
- Drop NAPI reschedule on RQ refill failure — the NAPI-to-workqueue
switch removes the livelock and budget issues (patch 4) [Simon Horman]
- Remove unnecessary READ_ONCE/WRITE_ONCE on admin_rq_handler — all
access is serialized by probe/remove (patch 6) [Jakub Kicinski]
- Fix checkpatch line-length warnings (patches 3, 5, 6)
- Rate-limit link state send failure and ACK error warnings (patch 7)
[Jakub Kicinski]
- Correct enic_link_check comment to describe actual PF link state
notification flow (patch 7) [Simon Horman]
- Correct mbox_expected_reply comment — serialization is by
RTNL/probe, not mbox_lock (patch 8) [Jakub Kicinski]
- Wire enic_mbox_send_link_state() from enic_link_check() so PF
notifies VFs on carrier change (patch 9) [Simon Horman]
- Fix commit message wording about MSI-X reservation (patch 10)
[Simon Horman]
- Link to v7: https://patch.msgid.link/20260513-enic-sriov-v2-admin-channel-v2-v7-0-68b9f4141f4c@cisco.com
Changes in v7:
- Replace magic numbers in admin channel init with named macros
and inline comments for MBOX descriptor encoding
(patches 2, 6) [Paolo Abeni]
- Add defense-in-depth bounds check on admin RQ bytes_written (patch 4)
- Force NAPI reschedule on admin RQ refill failure (patch 4)
- Always unmask admin interrupt even with zero credits (patch 4)
- Reorder NAPI init before request_irq in admin channel open (patch 4)
- Remove redundant netdev_warn on admin msg enqueue kmalloc failure
(patch 4) [Paolo Abeni]
- Add netdev_warn on admin WQ/RQ disable failure in close path
(patch 2)
- Remove incorrect RES_TYPE_SRIOV_INTR interrupt allocation from
admin channel open (patch 2); interrupt setup handled entirely
in patch 4 using RES_TYPE_INTR_CTRL
- Rate-limit VF register/unregister log messages (patch 7) [Paolo Abeni]
- Add __aligned(8) to admin message data[] for strict-alignment
safety (patch 4)
- Rate-limit MBOX handler error warnings (patch 7)
- Pre-allocate port profile array before pci_disable_sriov in V1
disable path to avoid half-torn-down state on alloc failure (patch 9)
- Account for admin channel interrupt reservation in
enic_set_intr_mode() and enic_adjust_resources() (patch 9) [Paolo Abeni]
- Clear admin_rq_handler in enic_admin_channel_close (patch 9)
- Quiesce admin channel (mask interrupt, disable NAPI, block MBOX
sends) around soft reset (patch 9)
- Use WRITE_ONCE/READ_ONCE for mbox_send_disabled and
admin_rq_handler across data-path/reset boundaries
(patches 4, 6, 9)
- Fix commit message: reference enic_adjust_resources() alongside
enic_set_intr_mode() (patch 10)
Investigated findings from automated review (Simon Horman / Sashiko):
- Race between probe-time feature check and VF proxy: false positive;
detection runs at probe, enable runs from sriov_configure
- Struct alignment of __le32 after 2-byte mbox_hdr_embed: compiler
inserts correct padding, no manual alignment needed
- Stale MBOX reply matching / reinit_completion race: single-flight
design with mutex serialization prevents this
- cancel_work_sync vs MBOX unregister race: work cannot be
re-triggered during the close window
- Link to v6: https://patch.msgid.link/20260503-enic-sriov-v2-admin-channel-v2-v6-0-0af4fbc2d86d@cisco.com
Changes in v6:
- Add explanatory comments documenting admin_cq[0] (WQ CQE size) and
admin_cq[1] (RQ CQE size matching firmware enic_ext_cq() programming)
allocations (patch 2)
- Enforce bytes_written from CQ descriptor when enqueuing admin RQ
message; previously buf->len (allocation size) was passed, exposing
uninitialized buffer memory beyond the real payload (patch 4)
- Drop admin RQ messages with TRUNCATED set or FCS_OK clear, gated by
netdev_warn_once() (patch 4)
- Disable interrupt_enable on admin_cq[0]: WQ completions are polled
synchronously inside enic_mbox_send_msg() and never raise an
interrupt; matches admin_cq[1] (RQ) which does NAPI polling (patch 4)
- Add mbox_expected_reply gating in VF reply handlers (capability,
register, unregister): drop replies whose type does not match the
current waiter's expected type, avoiding spurious wakeup of an
unrelated waiter from a stale reply that arrives after timeout
(patch 8)
- Distinguish error returns in enic_mbox_vf_unregister(): -ETIMEDOUT
(no reply received), -EACCES (PF rejected the unregister), 0 on
success. Previously all paths collapsed to a single -ETIMEDOUT
(patch 8)
- Reserve one extra MSI-X slot in enic_set_intr_mode() when
has_admin_channel is set so enic_admin_setup_intr() always has room
to allocate at intr_count without exceeding intr_avail bounds when
data queue count is maxed out (patch 10)
- Clarify in commit messages that .sriov_configure is intentionally
not yet wired in this series and will be added in a follow-up after
the necessary devcmd hardening lands (patch 9)
- Link to v5: https://patch.msgid.link/20260423-enic-sriov-v2-admin-channel-v2-v5-0-caa9f504a3dc@cisco.com
Changes in v5:
- Fix DMA-into-freed-memory race: call enic_admin_qp_type_set() before
disabling RQ/WQ in both error and close paths (patch 3)
- Fix DMA mapping leak: enic_admin_wq_buf_clean() now unmaps and frees
WQ buffers still held at close time after a send timeout (patch 3)
- Log rate-limited warning on admin RQ refill failure (patch 4)
- Add missing linux/types.h and linux/bits.h includes to enic_mbox.h
(patch 5)
- Guard mbox_lock/mbox_comp init with mbox_initialized flag to prevent
re-initialization on sriov_configure re-entry (patch 7)
- Clear VF registered state before sending unregister reply so PF does
not treat a dead VF as still registered (patch 8)
- Gate VF-facing log messages with net_ratelimit() to prevent malicious
VF from flooding PF dmesg (patch 8)
- Reject VF port profile requests when V2 SR-IOV is active since
enic->pp is not reallocated for V2 VFs (patch 9)
- Move enic_sriov_detect_vf_type() before auto-enable check; skip
probe-time auto-enable for V2 VFs (patch 9)
- Move admin channel close and VF unregister before unregister_netdev()
in enic_remove() to prevent use-after-free on netdev (patch 10)
- Add comment in enic_reset() documenting that admin channel is not
recovered after soft reset (patch 10)
- Bypass RES_TYPE_SRIOV_INTR check for V2 VFs in admin channel
capability detection (patch 10)
- Link to v4: https://patch.msgid.link/20260411-enic-sriov-v2-admin-channel-v2-v4-0-f052326c2a57@cisco.com
Changes in v4:
- Fix reverse xmas tree variable ordering (patches 1, 6)
- Use kzalloc_obj instead of kzalloc with sizeof (patch 9)
- Add NULL check for pp allocation in V1 SR-IOV disable path (patch 9)
- Link to v3: https://lore.kernel.org/r/20260408-enic-sriov-v2-admin-channel-v2-v3-0-1d4999a03cec@cisco.com
Changes in v3:
- Use early-return pattern in enic_sriov_detect_vf_type to reduce
nesting (patch 1) [Breno Leitao]
- Link to v2: https://lore.kernel.org/r/20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com
Changes in v2:
- Fix lines exceeding 80 columns (patches 4, 6, 7, 8)
- Add __maybe_unused to enic_sriov_configure and enic_sriov_v2_enable;
.sriov_configure wiring deferred to a later series after devcmd
hardening is in place (patch 9)
- Guard probe-time auto-enable to skip V2 VFs (patch 9)
- Link to v1: https://lore.kernel.org/r/20260406-enic-sriov-v2-admin-channel-v2-v1-0-82cc47636a78@cisco.com
---
Satish Kharat (10):
enic: verify firmware supports V2 SR-IOV at probe time
enic: add admin channel open and close for SR-IOV
enic: add admin RQ buffer management
enic: add admin CQ service with MSI-X interrupt and workqueue polling
enic: define MBOX message types and header structures
enic: add MBOX core send and receive for admin channel
enic: add MBOX PF handlers for VF register and capability
enic: add MBOX VF handlers for capability, register and link state
enic: wire V2 SR-IOV enable with admin channel and MBOX
enic: add V2 VF probe with admin channel and PF registration
drivers/net/ethernet/cisco/enic/Makefile | 3 +-
drivers/net/ethernet/cisco/enic/enic.h | 34 +-
drivers/net/ethernet/cisco/enic/enic_admin.c | 586 ++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_admin.h | 27 ++
drivers/net/ethernet/cisco/enic/enic_main.c | 349 +++++++++++++-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 630 ++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_mbox.h | 95 ++++
drivers/net/ethernet/cisco/enic/enic_pp.c | 5 +
drivers/net/ethernet/cisco/enic/enic_res.c | 4 +-
drivers/net/ethernet/cisco/enic/vnic_cq.h | 9 +
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 13 +
drivers/net/ethernet/cisco/enic/vnic_enet.h | 4 +-
12 files changed, 1739 insertions(+), 20 deletions(-)
---
base-commit: 2319688890d97c63da423a3c57c23b4ab5952dfc
change-id: 20260404-enic-sriov-v2-admin-channel-v2-c0aa3e988833
Best regards,
--
Satish Kharat <satishkh@cisco.com>
^ permalink raw reply
* [PATCH net-next v9 03/10] enic: add admin RQ buffer management
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
The admin receive queue needs pre-posted DMA buffers for incoming
mailbox messages from VFs. Each buffer is a kmalloc'd region mapped
for DMA (2048 bytes, sufficient for any MBOX message).
Add enic_admin_rq_fill(gfp) to post buffers at open time, and
enic_admin_rq_drain() to unmap and free them at close time.
Wire both into the admin channel open/close paths. The gfp_t
parameter lets the caller pass the allocation context; both current
callers -- channel open and the CQ-poll work handler that refills
after draining (added in the next patch) -- run in process context
and use GFP_KERNEL.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_admin.c | 66 +++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index aa21868a9209..b28fc6c656cc 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -3,6 +3,7 @@
#include <linux/kernel.h>
#include <linux/netdevice.h>
+#include <linux/dma-mapping.h>
#include "vnic_dev.h"
#include "vnic_wq.h"
@@ -34,10 +35,63 @@ static void enic_admin_wq_buf_clean(struct vnic_wq *wq,
}
}
-/* No-op: admin RQ buffer teardown is handled in enic_admin_channel_close */
static void enic_admin_rq_buf_clean(struct vnic_rq *rq,
struct vnic_rq_buf *buf)
{
+ struct enic *enic = vnic_dev_priv(rq->vdev);
+
+ if (!buf->os_buf)
+ return;
+
+ dma_unmap_single(&enic->pdev->dev, buf->dma_addr, buf->len,
+ DMA_FROM_DEVICE);
+ kfree(buf->os_buf);
+ buf->os_buf = NULL;
+}
+
+static int enic_admin_rq_post_one(struct enic *enic, gfp_t gfp)
+{
+ struct vnic_rq *rq = &enic->admin_rq;
+ struct rq_enet_desc *desc;
+ dma_addr_t dma_addr;
+ void *buf;
+
+ buf = kmalloc(ENIC_ADMIN_BUF_SIZE, gfp);
+ if (!buf)
+ return -ENOMEM;
+
+ dma_addr = dma_map_single(&enic->pdev->dev, buf, ENIC_ADMIN_BUF_SIZE,
+ DMA_FROM_DEVICE);
+ if (dma_mapping_error(&enic->pdev->dev, dma_addr)) {
+ kfree(buf);
+ return -ENOMEM;
+ }
+
+ desc = vnic_rq_next_desc(rq);
+ rq_enet_desc_enc(desc, (u64)dma_addr | VNIC_PADDR_TARGET,
+ RQ_ENET_TYPE_ONLY_SOP, ENIC_ADMIN_BUF_SIZE);
+ vnic_rq_post(rq, buf, 0, dma_addr, ENIC_ADMIN_BUF_SIZE, 0);
+
+ return 0;
+}
+
+static int enic_admin_rq_fill(struct enic *enic, gfp_t gfp)
+{
+ struct vnic_rq *rq = &enic->admin_rq;
+ int err;
+
+ while (vnic_rq_desc_avail(rq) > 0) {
+ err = enic_admin_rq_post_one(enic, gfp);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static void enic_admin_rq_drain(struct enic *enic)
+{
+ vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
}
static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
@@ -171,6 +225,13 @@ int enic_admin_channel_open(struct enic *enic)
vnic_wq_enable(&enic->admin_wq);
vnic_rq_enable(&enic->admin_rq);
+ err = enic_admin_rq_fill(enic, GFP_KERNEL);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to fill admin RQ buffers: %d\n", err);
+ goto disable_queues;
+ }
+
err = enic_admin_qp_type_set(enic, QP_ENABLE);
if (err) {
netdev_err(enic->netdev,
@@ -186,6 +247,7 @@ int enic_admin_channel_open(struct enic *enic)
netdev_warn(enic->netdev, "Failed to disable admin WQ\n");
if (vnic_rq_disable(&enic->admin_rq))
netdev_warn(enic->netdev, "Failed to disable admin RQ\n");
+ enic_admin_rq_drain(enic);
enic_admin_free_resources(enic);
return err;
}
@@ -209,7 +271,7 @@ void enic_admin_channel_close(struct enic *enic)
"Failed to disable admin RQ: %d\n", err);
vnic_wq_clean(&enic->admin_wq, enic_admin_wq_buf_clean);
- vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
+ enic_admin_rq_drain(enic);
vnic_cq_clean(&enic->admin_cq[0]);
vnic_cq_clean(&enic->admin_cq[1]);
enic_admin_free_resources(enic);
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 02/10] enic: add admin channel open and close for SR-IOV
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
The V2 SR-IOV design uses a dedicated admin channel (WQ/RQ/CQ/INTR
on separate BAR resources) for PF-VF mailbox communication rather
than firmware-proxied devcmds.
Introduce enic_admin_channel_open() and enic_admin_channel_close().
Open allocates and initialises the admin WQ, RQ, and two CQs (one per
direction), then issues CMD_QP_TYPE_SET to tell firmware the queues are
admin-type. Close reverses the sequence.
enic_admin_wq_buf_clean() unmaps and frees any WQ buffers still held
at close time, fixing a DMA mapping leak when a send times out.
Add CMD_QP_TYPE_SET (97), QP_TYPE_ADMIN/DATA, and QP_ENABLE/QP_DISABLE
defines to vnic_devcmd.h. Add VNIC_CQ_* named constants to vnic_cq.h
so CQ initialisation parameters are self-documenting from their first
introduction.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/Makefile | 3 +-
drivers/net/ethernet/cisco/enic/enic_admin.c | 216 ++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_admin.h | 15 ++
drivers/net/ethernet/cisco/enic/vnic_cq.h | 9 ++
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 11 ++
5 files changed, 253 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/cisco/enic/Makefile b/drivers/net/ethernet/cisco/enic/Makefile
index a96b8332e6e2..7ae72fefc99a 100644
--- a/drivers/net/ethernet/cisco/enic/Makefile
+++ b/drivers/net/ethernet/cisco/enic/Makefile
@@ -3,5 +3,6 @@ obj-$(CONFIG_ENIC) := enic.o
enic-y := enic_main.o vnic_cq.o vnic_intr.o vnic_wq.o \
enic_res.o enic_dev.o enic_pp.o vnic_dev.o vnic_rq.o vnic_vic.o \
- enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o
+ enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o \
+ enic_admin.o
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
new file mode 100644
index 000000000000..aa21868a9209
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2025 Cisco Systems, Inc. All rights reserved.
+
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+
+#include "vnic_dev.h"
+#include "vnic_wq.h"
+#include "vnic_rq.h"
+#include "vnic_cq.h"
+#include "vnic_intr.h"
+#include "vnic_resource.h"
+#include "vnic_devcmd.h"
+#include "enic.h"
+#include "enic_admin.h"
+#include "cq_desc.h"
+#include "wq_enet_desc.h"
+#include "rq_enet_desc.h"
+
+/* Clean up any admin WQ buffers still held by hardware at close time.
+ * Normally buffers are freed inline after send completion, but a timed-out
+ * send intentionally leaves the buffer live until the queue is stopped.
+ */
+static void enic_admin_wq_buf_clean(struct vnic_wq *wq,
+ struct vnic_wq_buf *buf)
+{
+ struct enic *enic = vnic_dev_priv(wq->vdev);
+
+ if (buf->os_buf) {
+ dma_unmap_single(&enic->pdev->dev, buf->dma_addr,
+ buf->len, DMA_TO_DEVICE);
+ kfree(buf->os_buf);
+ buf->os_buf = NULL;
+ }
+}
+
+/* No-op: admin RQ buffer teardown is handled in enic_admin_channel_close */
+static void enic_admin_rq_buf_clean(struct vnic_rq *rq,
+ struct vnic_rq_buf *buf)
+{
+}
+
+static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
+{
+ u64 a0 = QP_TYPE_ADMIN, a1 = enable;
+ int wait = 1000;
+ int err;
+
+ spin_lock_bh(&enic->devcmd_lock);
+ err = vnic_dev_cmd(enic->vdev, CMD_QP_TYPE_SET, &a0, &a1, wait);
+ spin_unlock_bh(&enic->devcmd_lock);
+
+ return err;
+}
+
+static int enic_admin_alloc_resources(struct enic *enic)
+{
+ int err;
+
+ err = vnic_wq_alloc_with_type(enic->vdev, &enic->admin_wq, 0,
+ ENIC_ADMIN_DESC_COUNT,
+ sizeof(struct wq_enet_desc),
+ RES_TYPE_ADMIN_WQ);
+ if (err)
+ return err;
+
+ err = vnic_rq_alloc_with_type(enic->vdev, &enic->admin_rq, 0,
+ ENIC_ADMIN_DESC_COUNT,
+ sizeof(struct rq_enet_desc),
+ RES_TYPE_ADMIN_RQ);
+ if (err)
+ goto free_wq;
+
+ /* admin_cq[0] is the WQ completion queue. WQ CQEs are always
+ * 16 bytes wide; firmware always writes 16-byte CQEs for WQ
+ * completions on every WQ, including the admin channel WQ.
+ * Use sizeof(struct cq_desc) accordingly.
+ */
+ err = vnic_cq_alloc_with_type(enic->vdev, &enic->admin_cq[0], 0,
+ ENIC_ADMIN_DESC_COUNT,
+ sizeof(struct cq_desc),
+ RES_TYPE_ADMIN_CQ);
+ if (err)
+ goto free_rq;
+
+ /* admin_cq[1] is the RQ completion queue. Its descriptor size
+ * must match what firmware writes. enic_ext_cq() called earlier
+ * in probe issues CMD_CQ_ENTRY_SIZE_SET for VNIC_RQ_ALL,
+ * programming firmware to write CQ entries of (16 << enic->ext_cq)
+ * bytes for every RQ CQ on the vNIC, including the admin RQ CQ.
+ * Allocating with the same size keeps the host poller and
+ * firmware in lockstep:
+ *
+ * - The color/valid bit lives at byte (desc_size - 1) of every
+ * cq_enet_rq_desc[_32|_64] variant, so enic_admin_cq_color()
+ * reads it from the correct offset.
+ * - Only the first 15 bytes of the descriptor (vlan,
+ * bytes_written_flags, ...) are accessed by the admin path;
+ * these fields are identical across all three variants (see
+ * comment in enic_rq.c above cq_enet_rq_desc_dec()).
+ */
+ err = vnic_cq_alloc_with_type(enic->vdev, &enic->admin_cq[1], 1,
+ ENIC_ADMIN_DESC_COUNT,
+ 16 << enic->ext_cq,
+ RES_TYPE_ADMIN_CQ);
+ if (err)
+ goto free_cq0;
+
+ return 0;
+
+free_cq0:
+ vnic_cq_free(&enic->admin_cq[0]);
+free_rq:
+ vnic_rq_free(&enic->admin_rq);
+free_wq:
+ vnic_wq_free(&enic->admin_wq);
+ return err;
+}
+
+static void enic_admin_free_resources(struct enic *enic)
+{
+ vnic_cq_free(&enic->admin_cq[1]);
+ vnic_cq_free(&enic->admin_cq[0]);
+ vnic_rq_free(&enic->admin_rq);
+ vnic_wq_free(&enic->admin_wq);
+}
+
+static void enic_admin_init_resources(struct enic *enic)
+{
+ vnic_wq_init(&enic->admin_wq,
+ 0, 0, 0); /* cq_index, err_intr_enable, err_intr_offset */
+ vnic_rq_init(&enic->admin_rq,
+ 1, 0, 0); /* cq_index, err_intr_enable, err_intr_offset */
+ vnic_cq_init(&enic->admin_cq[0],
+ VNIC_CQ_FC_DISABLE,
+ VNIC_CQ_COLOR_ENABLE,
+ 0, 0, 1, /* cq_head, cq_tail, cq_tail_color */
+ VNIC_CQ_INTR_DISABLE,
+ VNIC_CQ_ENTRY_ENABLE,
+ VNIC_CQ_MSG_DISABLE,
+ 0, /* interrupt_offset */
+ 0 /* cq_message_addr */);
+ vnic_cq_init(&enic->admin_cq[1],
+ VNIC_CQ_FC_DISABLE,
+ VNIC_CQ_COLOR_ENABLE,
+ 0, 0, 1, /* cq_head, cq_tail, cq_tail_color */
+ VNIC_CQ_INTR_DISABLE,
+ VNIC_CQ_ENTRY_ENABLE,
+ VNIC_CQ_MSG_DISABLE,
+ 0, /* interrupt_offset */
+ 0 /* cq_message_addr */);
+}
+
+int enic_admin_channel_open(struct enic *enic)
+{
+ int err;
+
+ if (!enic->has_admin_channel)
+ return -ENODEV;
+
+ err = enic_admin_alloc_resources(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to alloc admin channel resources: %d\n",
+ err);
+ return err;
+ }
+
+ enic_admin_init_resources(enic);
+
+ vnic_wq_enable(&enic->admin_wq);
+ vnic_rq_enable(&enic->admin_rq);
+
+ err = enic_admin_qp_type_set(enic, QP_ENABLE);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to set admin QP type: %d\n", err);
+ goto disable_queues;
+ }
+
+ return 0;
+
+disable_queues:
+ enic_admin_qp_type_set(enic, QP_DISABLE);
+ if (vnic_wq_disable(&enic->admin_wq))
+ netdev_warn(enic->netdev, "Failed to disable admin WQ\n");
+ if (vnic_rq_disable(&enic->admin_rq))
+ netdev_warn(enic->netdev, "Failed to disable admin RQ\n");
+ enic_admin_free_resources(enic);
+ return err;
+}
+
+void enic_admin_channel_close(struct enic *enic)
+{
+ int err;
+
+ if (!enic->has_admin_channel)
+ return;
+
+ enic_admin_qp_type_set(enic, QP_DISABLE);
+
+ err = vnic_wq_disable(&enic->admin_wq);
+ if (err)
+ netdev_warn(enic->netdev,
+ "Failed to disable admin WQ: %d\n", err);
+ err = vnic_rq_disable(&enic->admin_rq);
+ if (err)
+ netdev_warn(enic->netdev,
+ "Failed to disable admin RQ: %d\n", err);
+
+ vnic_wq_clean(&enic->admin_wq, enic_admin_wq_buf_clean);
+ vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
+ vnic_cq_clean(&enic->admin_cq[0]);
+ vnic_cq_clean(&enic->admin_cq[1]);
+ enic_admin_free_resources(enic);
+}
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.h b/drivers/net/ethernet/cisco/enic/enic_admin.h
new file mode 100644
index 000000000000..569aadeb9312
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright 2025 Cisco Systems, Inc. All rights reserved. */
+
+#ifndef _ENIC_ADMIN_H_
+#define _ENIC_ADMIN_H_
+
+#define ENIC_ADMIN_DESC_COUNT 64
+#define ENIC_ADMIN_BUF_SIZE 2048
+
+struct enic;
+
+int enic_admin_channel_open(struct enic *enic);
+void enic_admin_channel_close(struct enic *enic);
+
+#endif /* _ENIC_ADMIN_H_ */
diff --git a/drivers/net/ethernet/cisco/enic/vnic_cq.h b/drivers/net/ethernet/cisco/enic/vnic_cq.h
index d46d4d2ef6bb..35ffa3230713 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_cq.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_cq.h
@@ -76,6 +76,15 @@ int vnic_cq_alloc(struct vnic_dev *vdev, struct vnic_cq *cq, unsigned int index,
int vnic_cq_alloc_with_type(struct vnic_dev *vdev, struct vnic_cq *cq,
unsigned int index, unsigned int desc_count,
unsigned int desc_size, unsigned int res_type);
+#define VNIC_CQ_FC_ENABLE 1
+#define VNIC_CQ_FC_DISABLE 0
+#define VNIC_CQ_COLOR_ENABLE 1
+#define VNIC_CQ_INTR_ENABLE 1
+#define VNIC_CQ_INTR_DISABLE 0
+#define VNIC_CQ_ENTRY_ENABLE 1
+#define VNIC_CQ_MSG_ENABLE 1
+#define VNIC_CQ_MSG_DISABLE 0
+
void vnic_cq_init(struct vnic_cq *cq, unsigned int flow_control_enable,
unsigned int color_enable, unsigned int cq_head, unsigned int cq_tail,
unsigned int cq_tail_color, unsigned int interrupt_enable,
diff --git a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
index 3b6efa743dba..90ca06691ebd 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
@@ -455,8 +455,19 @@ enum vnic_devcmd_cmd {
*/
CMD_CQ_ENTRY_SIZE_SET = _CMDC(_CMD_DIR_WRITE, _CMD_VTYPE_ENET, 90),
+ /*
+ * Set queue pair type (admin or data)
+ * in: (u32) a0 = queue pair type (0 = admin, 1 = data)
+ * in: (u32) a1 = enable (1) / disable (0)
+ */
+ CMD_QP_TYPE_SET = _CMDC(_CMD_DIR_WRITE, _CMD_VTYPE_ENET, 97),
};
+#define QP_TYPE_ADMIN 0
+#define QP_TYPE_DATA 1
+#define QP_ENABLE 1
+#define QP_DISABLE 0
+
/* CMD_ENABLE2 flags */
#define CMD_ENABLE2_STANDBY 0x0
#define CMD_ENABLE2_ACTIVE 0x1
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 08/10] enic: add MBOX VF handlers for capability, register and link state
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
Implement VF-side mailbox message processing for SR-IOV V2
admin channel communication.
VF receive handlers:
- VF_CAPABILITY_REPLY: store PF protocol version, signal
completion
- VF_REGISTER_REPLY: mark VF as registered, signal completion
- VF_UNREGISTER_REPLY: mark VF as unregistered, signal
completion
- PF_LINK_STATE_NOTIF: update carrier state via
netif_carrier_on/off, send ACK back to PF
VF initiation functions for the probe-time handshake:
- enic_mbox_vf_capability_check: send capability request,
wait for PF reply via completion
- enic_mbox_vf_register: send register request, wait for
PF confirmation via completion
- enic_mbox_vf_unregister: send unregister request, wait
for PF confirmation
The wait helper (enic_mbox_wait_reply) uses
wait_for_completion_timeout, signaled when the admin ISR and
CQ-poll/dispatch workqueue pipeline delivers the reply message.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 11 ++
drivers/net/ethernet/cisco/enic/enic_mbox.c | 265 ++++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_mbox.h | 3 +
3 files changed, 279 insertions(+)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index cace8e04e9ce..294b751b7cb6 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -258,6 +258,8 @@ struct enic {
u32 tx_coalesce_usecs;
u16 num_vfs;
enum enic_vf_type vf_type;
+ bool vf_registered;
+ u32 pf_cap_version;
unsigned int enable_count;
spinlock_t enic_api_lock;
bool enic_api_busy;
@@ -307,6 +309,15 @@ struct enic {
/* MBOX protocol state — mbox_lock serializes admin WQ sends */
struct mutex mbox_lock;
u64 mbox_msg_num;
+ /* MBOX request-reply state. Written by the process-context request
+ * helpers (capability/register/unregister) and read/cleared by the
+ * admin_msg_work receive handlers. No explicit lock is needed because
+ * only one request is in flight at a time: requesters run under RTNL or
+ * single-threaded probe/remove, so each request is serialized and its
+ * reply completes mbox_comp before the next request is issued.
+ */
+ struct completion mbox_comp;
+ u8 mbox_expected_reply;
/* PF: per-VF MBOX state, allocated when SRIOV V2 is enabled */
struct enic_vf_state {
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
index b6f05b03ae26..eb084adae810 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -5,6 +5,7 @@
#include <linux/netdevice.h>
#include <linux/dma-mapping.h>
#include <linux/delay.h>
+#include <linux/completion.h>
#include "vnic_dev.h"
#include "vnic_wq.h"
@@ -135,6 +136,16 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
return err;
}
+static int enic_mbox_wait_reply(struct enic *enic, unsigned long timeout_ms)
+{
+ unsigned long left;
+
+ left = wait_for_completion_timeout(&enic->mbox_comp,
+ msecs_to_jiffies(timeout_ms));
+
+ return left ? 0 : -ETIMEDOUT;
+}
+
int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state)
{
struct enic_mbox_pf_link_state_notif_msg notif = {};
@@ -306,6 +317,166 @@ static void enic_mbox_pf_process_msg(struct enic *enic,
hdr->msg_type, vf_id, err);
}
+static void enic_mbox_vf_handle_capability_reply(struct enic *enic,
+ void *payload)
+{
+ struct enic_mbox_vf_capability_reply_msg *reply = payload;
+
+ if (enic->mbox_expected_reply != ENIC_MBOX_VF_CAPABILITY_REPLY) {
+ netdev_warn(enic->netdev,
+ "MBOX: stale capability reply (expected %u), drop\n",
+ enic->mbox_expected_reply);
+ return;
+ }
+
+ if (le16_to_cpu(reply->reply.ret_major) == 0)
+ enic->pf_cap_version = le32_to_cpu(reply->version);
+ else
+ netdev_warn(enic->netdev,
+ "MBOX: PF rejected capability request: %u/%u\n",
+ le16_to_cpu(reply->reply.ret_major),
+ le16_to_cpu(reply->reply.ret_minor));
+ complete(&enic->mbox_comp);
+}
+
+static void enic_mbox_vf_handle_register_reply(struct enic *enic,
+ void *payload)
+{
+ struct enic_mbox_vf_register_reply_msg *reply = payload;
+
+ if (enic->mbox_expected_reply != ENIC_MBOX_VF_REGISTER_REPLY) {
+ netdev_warn(enic->netdev,
+ "MBOX: stale register reply (expected %u), drop\n",
+ enic->mbox_expected_reply);
+ return;
+ }
+
+ if (le16_to_cpu(reply->reply.ret_major)) {
+ netdev_warn(enic->netdev,
+ "MBOX: VF register rejected by PF: %u/%u\n",
+ le16_to_cpu(reply->reply.ret_major),
+ le16_to_cpu(reply->reply.ret_minor));
+ } else {
+ enic->vf_registered = true;
+ }
+ complete(&enic->mbox_comp);
+}
+
+static void enic_mbox_vf_handle_unregister_reply(struct enic *enic,
+ void *payload)
+{
+ struct enic_mbox_vf_register_reply_msg *reply = payload;
+
+ if (enic->mbox_expected_reply != ENIC_MBOX_VF_UNREGISTER_REPLY) {
+ netdev_warn(enic->netdev,
+ "MBOX: stale unregister reply (expected %u), drop\n",
+ enic->mbox_expected_reply);
+ return;
+ }
+
+ if (le16_to_cpu(reply->reply.ret_major)) {
+ netdev_warn(enic->netdev,
+ "MBOX: VF unregister rejected by PF: %u/%u\n",
+ le16_to_cpu(reply->reply.ret_major),
+ le16_to_cpu(reply->reply.ret_minor));
+ } else {
+ enic->vf_registered = false;
+ }
+ complete(&enic->mbox_comp);
+}
+
+static void enic_mbox_vf_handle_link_state(struct enic *enic, void *payload)
+{
+ struct enic_mbox_pf_link_state_notif_msg *notif = payload;
+ struct enic_mbox_pf_link_state_ack_msg ack = {};
+ int err;
+
+ switch (le32_to_cpu(notif->link_state)) {
+ case ENIC_MBOX_LINK_STATE_ENABLE:
+ if (!netif_carrier_ok(enic->netdev))
+ netif_carrier_on(enic->netdev);
+ netdev_dbg(enic->netdev, "MBOX: link state -> UP\n");
+ break;
+ case ENIC_MBOX_LINK_STATE_DISABLE:
+ if (netif_carrier_ok(enic->netdev))
+ netif_carrier_off(enic->netdev);
+ netdev_dbg(enic->netdev, "MBOX: link state -> DOWN\n");
+ break;
+ default:
+ netdev_warn(enic->netdev, "MBOX: unknown link state %u\n",
+ le32_to_cpu(notif->link_state));
+ ack.ack.ret_major = cpu_to_le16(ENIC_MBOX_ERR_GENERIC);
+ break;
+ }
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_PF_LINK_STATE_ACK,
+ ENIC_MBOX_DST_PF, &ack, sizeof(ack));
+ if (err && net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: failed to send link state ACK: %d\n", err);
+}
+
+static bool enic_mbox_vf_payload_ok(struct enic *enic, u8 msg_type,
+ u16 payload_len, size_t min_len)
+{
+ if (payload_len < min_len) {
+ netdev_warn(enic->netdev,
+ "MBOX: short payload for type %u (%u < %zu)\n",
+ msg_type, payload_len, min_len);
+ return false;
+ }
+ return true;
+}
+
+static void enic_mbox_vf_process_msg(struct enic *enic,
+ struct enic_mbox_hdr *hdr, void *payload,
+ u16 payload_len)
+{
+ switch (hdr->msg_type) {
+ case ENIC_MBOX_VF_CAPABILITY_REPLY: {
+ size_t exp = sizeof(struct enic_mbox_vf_capability_reply_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_capability_reply(enic, payload);
+ break;
+ }
+ case ENIC_MBOX_VF_REGISTER_REPLY: {
+ size_t exp = sizeof(struct enic_mbox_vf_register_reply_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_register_reply(enic, payload);
+ break;
+ }
+ case ENIC_MBOX_VF_UNREGISTER_REPLY: {
+ size_t exp = sizeof(struct enic_mbox_vf_register_reply_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_unregister_reply(enic, payload);
+ break;
+ }
+ case ENIC_MBOX_PF_LINK_STATE_NOTIF: {
+ size_t exp = sizeof(struct enic_mbox_pf_link_state_notif_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_link_state(enic, payload);
+ break;
+ }
+ default:
+ netdev_dbg(enic->netdev,
+ "MBOX: VF unhandled msg type %u\n",
+ hdr->msg_type);
+ break;
+ }
+}
+
static void enic_mbox_recv_handler(struct enic *enic, void *buf,
unsigned int len)
{
@@ -346,11 +517,105 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
if (enic->vf_state)
enic_mbox_pf_process_msg(enic, hdr, payload);
+ else
+ enic_mbox_vf_process_msg(enic, hdr, payload,
+ msg_len - (u16)sizeof(*hdr));
+}
+
+int enic_mbox_vf_capability_check(struct enic *enic)
+{
+ struct enic_mbox_vf_capability_msg req = {};
+ int err;
+
+ enic->pf_cap_version = 0;
+ reinit_completion(&enic->mbox_comp);
+ enic->mbox_expected_reply = ENIC_MBOX_VF_CAPABILITY_REPLY;
+ req.version = cpu_to_le32(ENIC_MBOX_CAP_VERSION_1);
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_CAPABILITY_REQUEST,
+ ENIC_MBOX_DST_PF, &req, sizeof(req));
+ if (err) {
+ enic->mbox_expected_reply = 0;
+ return err;
+ }
+
+ err = enic_mbox_wait_reply(enic, 3000);
+ enic->mbox_expected_reply = 0;
+ if (err) {
+ netdev_warn(enic->netdev,
+ "MBOX: no capability reply from PF\n");
+ return err;
+ }
+
+ if (enic->pf_cap_version < ENIC_MBOX_CAP_VERSION_1) {
+ netdev_warn(enic->netdev,
+ "MBOX: PF version %u too old\n",
+ enic->pf_cap_version);
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+int enic_mbox_vf_register(struct enic *enic)
+{
+ int err;
+
+ enic->vf_registered = false;
+ reinit_completion(&enic->mbox_comp);
+ enic->mbox_expected_reply = ENIC_MBOX_VF_REGISTER_REPLY;
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_REGISTER_REQUEST,
+ ENIC_MBOX_DST_PF, NULL, 0);
+ if (err) {
+ enic->mbox_expected_reply = 0;
+ return err;
+ }
+
+ err = enic_mbox_wait_reply(enic, 3000);
+ enic->mbox_expected_reply = 0;
+ if (err) {
+ netdev_warn(enic->netdev,
+ "MBOX: VF registration with PF timed out\n");
+ return err;
+ }
+
+ if (!enic->vf_registered)
+ return -ENODEV;
+
+ return 0;
+}
+
+int enic_mbox_vf_unregister(struct enic *enic)
+{
+ int err;
+
+ if (!enic->vf_registered)
+ return 0;
+
+ reinit_completion(&enic->mbox_comp);
+ enic->mbox_expected_reply = ENIC_MBOX_VF_UNREGISTER_REPLY;
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REQUEST,
+ ENIC_MBOX_DST_PF, NULL, 0);
+ if (err) {
+ enic->mbox_expected_reply = 0;
+ return err;
+ }
+
+ err = enic_mbox_wait_reply(enic, 3000);
+ enic->mbox_expected_reply = 0;
+ if (err)
+ return err;
+ if (enic->vf_registered)
+ return -EACCES;
+ return 0;
}
void enic_mbox_init(struct enic *enic)
{
enic->mbox_msg_num = 0;
mutex_init(&enic->mbox_lock);
+ init_completion(&enic->mbox_comp);
enic->admin_rq_handler = enic_mbox_recv_handler;
}
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index f1de67db1273..15e30ee2b0ed 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -88,5 +88,8 @@ void enic_mbox_init(struct enic *enic);
int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
void *payload, u16 payload_len);
int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state);
+int enic_mbox_vf_capability_check(struct enic *enic);
+int enic_mbox_vf_register(struct enic *enic);
+int enic_mbox_vf_unregister(struct enic *enic);
#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 05/10] enic: define MBOX message types and header structures
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
Define the mailbox protocol structures for PF-VF communication:
message header, generic reply, and per-message-type payloads for
capability negotiation, VF registration/unregistration, and link
state notification/acknowledgment.
Include linux/types.h and linux/bits.h for __le16/__le32/__le64
and BIT() used in the header.
Message types use an even=request / odd=reply convention. The
header carries source and destination VNIC IDs, a monotonically
increasing message number, and the total message length.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_mbox.h | 83 +++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
new file mode 100644
index 000000000000..a52f1d25cb21
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright 2025 Cisco Systems, Inc. All rights reserved. */
+
+#ifndef _ENIC_MBOX_H_
+#define _ENIC_MBOX_H_
+
+#include <linux/bits.h>
+#include <linux/types.h>
+
+/*
+ * Mailbox protocol for PF-VF communication over the admin channel.
+ *
+ * Even numbers are requests, odd numbers are replies/acks.
+ * The prefix indicates the initiator: VF_ = VF-initiated, PF_ = PF-initiated.
+ */
+enum enic_mbox_msg_type {
+ ENIC_MBOX_VF_CAPABILITY_REQUEST = 0,
+ ENIC_MBOX_VF_CAPABILITY_REPLY = 1,
+ ENIC_MBOX_VF_REGISTER_REQUEST = 2,
+ ENIC_MBOX_VF_REGISTER_REPLY = 3,
+ ENIC_MBOX_VF_UNREGISTER_REQUEST = 4,
+ ENIC_MBOX_VF_UNREGISTER_REPLY = 5,
+ ENIC_MBOX_PF_LINK_STATE_NOTIF = 6,
+ ENIC_MBOX_PF_LINK_STATE_ACK = 7,
+ ENIC_MBOX_MAX
+};
+
+struct enic_mbox_hdr {
+ __le16 src_vnic_id;
+ __le16 dst_vnic_id;
+ u8 msg_type;
+ u8 flags;
+ __le16 msg_len;
+ __le64 msg_num;
+};
+
+struct enic_mbox_generic_reply {
+ __le16 ret_major;
+ __le16 ret_minor;
+};
+
+#define ENIC_MBOX_ERR_GENERIC BIT(0)
+#define ENIC_MBOX_ERR_VF_NOT_REGISTERED BIT(1)
+#define ENIC_MBOX_ERR_MSG_NOT_SUPPORTED BIT(2)
+
+/* ENIC_MBOX_VF_CAPABILITY_REQUEST / _REPLY */
+#define ENIC_MBOX_CAP_VERSION_0 0
+#define ENIC_MBOX_CAP_VERSION_1 1
+
+struct enic_mbox_vf_capability_msg {
+ __le32 version;
+ __le32 reserved[32];
+};
+
+/* The embedded enic_mbox_generic_reply has 2-byte alignment, but the
+ * __le32 members give this struct 4-byte natural alignment. Receive
+ * buffers come from kmalloc (>= 8-byte aligned), so there is no
+ * misaligned access risk when casting from the receive buffer.
+ */
+struct enic_mbox_vf_capability_reply_msg {
+ struct enic_mbox_generic_reply reply;
+ __le32 version;
+ __le32 reserved[32];
+};
+
+/* ENIC_MBOX_VF_REGISTER / _UNREGISTER */
+struct enic_mbox_vf_register_reply_msg {
+ struct enic_mbox_generic_reply reply;
+};
+
+/* ENIC_MBOX_PF_LINK_STATE_NOTIF / _ACK */
+#define ENIC_MBOX_LINK_STATE_DISABLE 0
+#define ENIC_MBOX_LINK_STATE_ENABLE 1
+
+struct enic_mbox_pf_link_state_notif_msg {
+ __le32 link_state;
+};
+
+struct enic_mbox_pf_link_state_ack_msg {
+ struct enic_mbox_generic_reply ack;
+};
+
+#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 06/10] enic: add MBOX core send and receive for admin channel
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
Implement the mailbox protocol engine used for PF-VF communication
over the admin channel.
The send path (enic_mbox_send_msg) builds a message with a common
header, DMA-maps it, posts a single WQ descriptor with the
destination vnic ID encoded in the VLAN tag field, and polls
the WQ CQ for completion.
MBOX sends are gated by enic->mbox_send_disabled: enic_mbox_send_msg()
returns early while it is set. The flag is cleared in
enic_admin_channel_open() only once the admin WQ/RQ/CQ and interrupt
are fully programmed, and set again at the start of
enic_admin_channel_close(), so a send can never race a not-yet-ready
or torn-down admin channel.
The receive path (enic_mbox_recv_handler) is installed as the admin
RQ callback and validates incoming message headers. PF/VF-specific
dispatch will be added in subsequent commits.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/Makefile | 2 +-
drivers/net/ethernet/cisco/enic/enic.h | 6 +
drivers/net/ethernet/cisco/enic/enic_admin.c | 35 +++++-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 170 +++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_mbox.h | 8 ++
5 files changed, 218 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/Makefile b/drivers/net/ethernet/cisco/enic/Makefile
index 7ae72fefc99a..e38aaf34c148 100644
--- a/drivers/net/ethernet/cisco/enic/Makefile
+++ b/drivers/net/ethernet/cisco/enic/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_ENIC) := enic.o
enic-y := enic_main.o vnic_cq.o vnic_intr.o vnic_wq.o \
enic_res.o enic_dev.o enic_pp.o vnic_dev.o vnic_rq.o vnic_vic.o \
enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o \
- enic_admin.o
+ enic_admin.o enic_mbox.o
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 42f2ac3df212..1d6a88d7f8ac 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -292,6 +292,8 @@ struct enic {
/* Admin channel resources for SR-IOV MBOX */
bool has_admin_channel;
+ /* set on send timeout; cleared on channel re-open */
+ bool mbox_send_disabled;
struct vnic_wq admin_wq;
struct vnic_rq admin_rq;
struct vnic_cq admin_cq[2];
@@ -303,6 +305,10 @@ struct enic {
struct list_head admin_msg_list;
void (*admin_rq_handler)(struct enic *enic, void *buf,
unsigned int len);
+
+ /* MBOX protocol state — mbox_lock serializes admin WQ sends */
+ struct mutex mbox_lock;
+ u64 mbox_msg_num;
};
static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index ec85fd446d02..8edf7ad4557d 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -19,6 +19,7 @@
#include "cq_enet_desc.h"
#include "wq_enet_desc.h"
#include "rq_enet_desc.h"
+#include "enic_mbox.h"
/* Clean up any admin WQ buffers still held by hardware at close time.
* Normally buffers are freed inline after send completion, but a timed-out
@@ -191,7 +192,26 @@ unsigned int enic_admin_rq_cq_service(struct enic *enic)
goto next_desc;
}
- enic_admin_msg_enqueue(enic, buf->os_buf, bytes_written);
+ if (enic->admin_rq_handler) {
+ u16 sender_vlan;
+
+ /* Firmware sets the CQ VLAN field to identify the
+ * sender: 0 = PF, 1-based = VF index. Overwrite
+ * the untrusted src_vnic_id in the MBOX header with
+ * the hardware-verified value.
+ */
+ sender_vlan = le16_to_cpu(rq_desc->vlan);
+ if (bytes_written >= sizeof(struct enic_mbox_hdr)) {
+ struct enic_mbox_hdr *hdr = buf->os_buf;
+
+ hdr->src_vnic_id = (sender_vlan == 0) ?
+ cpu_to_le16(ENIC_MBOX_DST_PF) :
+ cpu_to_le16(sender_vlan - 1);
+ }
+
+ enic_admin_msg_enqueue(enic, buf->os_buf,
+ bytes_written);
+ }
next_desc:
enic_admin_rq_buf_clean(rq, rq->to_clean);
@@ -428,8 +448,9 @@ static void enic_admin_init_resources(struct enic *enic)
VNIC_CQ_MSG_DISABLE,
intr_offset,
0 /* cq_message_addr */);
+ /* coalescing_timer, coalescing_type, mask_on_assertion */
vnic_intr_init(&enic->admin_intr,
- 0, 0, 1); /* coalescing_timer, coalescing_type, mask_on_assertion */
+ 0, 0, 1);
}
static void enic_admin_msg_drain(struct enic *enic)
@@ -493,6 +514,14 @@ int enic_admin_channel_open(struct enic *enic)
vnic_intr_unmask(&enic->admin_intr);
+ /* Only now that the admin WQ/RQ/CQ and interrupt are fully allocated,
+ * programmed and enabled is it safe to allow MBOX sends. Clearing this
+ * earlier opened a window where a concurrent sender (e.g. link-notify
+ * work scheduled by a post-reset link-up) could call enic_mbox_send_msg()
+ * against a not-yet-allocated admin_wq and crash.
+ */
+ WRITE_ONCE(enic->mbox_send_disabled, false);
+
netdev_dbg(enic->netdev,
"admin channel open: intr=%u wq_avail=%u rq_avail=%u cq0_color=%u cq1_color=%u\n",
enic->admin_intr_index,
@@ -525,6 +554,8 @@ void enic_admin_channel_close(struct enic *enic)
if (!enic->has_admin_channel)
return;
+ WRITE_ONCE(enic->mbox_send_disabled, true);
+
netdev_dbg(enic->netdev, "admin channel close\n");
vnic_intr_mask(&enic->admin_intr);
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
new file mode 100644
index 000000000000..3709704bee02
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2025 Cisco Systems, Inc. All rights reserved.
+
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/dma-mapping.h>
+#include <linux/delay.h>
+
+#include "vnic_dev.h"
+#include "vnic_wq.h"
+#include "vnic_cq.h"
+#include "enic.h"
+#include "enic_admin.h"
+#include "enic_mbox.h"
+#include "wq_enet_desc.h"
+
+#define ENIC_MBOX_POLL_TIMEOUT_US 5000000
+#define ENIC_MBOX_POLL_INTERVAL_US 100
+
+static void enic_mbox_fill_hdr(struct enic *enic, struct enic_mbox_hdr *hdr,
+ u8 msg_type, u16 dst_vnic_id, u16 msg_len)
+{
+ memset(hdr, 0, sizeof(*hdr));
+ hdr->dst_vnic_id = cpu_to_le16(dst_vnic_id);
+ hdr->msg_type = msg_type;
+ hdr->msg_len = cpu_to_le16(msg_len);
+ hdr->msg_num = cpu_to_le64(++enic->mbox_msg_num);
+}
+
+int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
+ void *payload, u16 payload_len)
+{
+ u16 total_len = sizeof(struct enic_mbox_hdr) + payload_len;
+ struct vnic_wq *wq = &enic->admin_wq;
+ struct wq_enet_desc *desc;
+ unsigned long timeout;
+ dma_addr_t dma_addr;
+ u16 vlan_tag;
+ void *buf;
+ int err;
+
+ /* Serialize MBOX sends. The admin channel is a low-frequency
+ * control path; holding the mutex across the poll is acceptable.
+ */
+ mutex_lock(&enic->mbox_lock);
+
+ if (!enic->has_admin_channel || READ_ONCE(enic->mbox_send_disabled)) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ if (vnic_wq_desc_avail(wq) == 0) {
+ err = -ENOSPC;
+ goto unlock;
+ }
+
+ buf = kmalloc(total_len, GFP_KERNEL);
+ if (!buf) {
+ err = -ENOMEM;
+ goto unlock;
+ }
+
+ enic_mbox_fill_hdr(enic, buf, msg_type, dst_vnic_id, total_len);
+ if (payload_len) {
+ void *dst = buf + sizeof(struct enic_mbox_hdr);
+
+ memcpy(dst, payload, payload_len);
+ }
+
+ dma_addr = dma_map_single(&enic->pdev->dev, buf, total_len,
+ DMA_TO_DEVICE);
+ if (dma_mapping_error(&enic->pdev->dev, dma_addr)) {
+ kfree(buf);
+ err = -ENOMEM;
+ goto unlock;
+ }
+
+ /* Firmware uses vlan field for routing: 0 = PF, 1-based = VF index */
+ if (dst_vnic_id == ENIC_MBOX_DST_PF)
+ vlan_tag = 0;
+ else
+ vlan_tag = dst_vnic_id + 1;
+
+ desc = vnic_wq_next_desc(wq);
+ wq_enet_desc_enc(desc, (u64)dma_addr | VNIC_PADDR_TARGET,
+ total_len,
+ 0, 0, 0, /* mss, hdr_len, offload_mode */
+ 1, 1, /* eop, cq_entry */
+ 0, /* fcoe_encap */
+ 1, vlan_tag, /* vlan_tag_insert, vlan_tag */
+ 0); /* loopback */
+ vnic_wq_post(wq, buf, dma_addr, total_len,
+ 1, 1, /* sop, eop */
+ 1, 1, /* desc_skip_cnt, cq_entry */
+ 0, 0); /* compressed_send, wrid */
+ vnic_wq_doorbell(wq);
+
+ timeout = jiffies + usecs_to_jiffies(ENIC_MBOX_POLL_TIMEOUT_US);
+ err = -ETIMEDOUT;
+ while (time_before(jiffies, timeout)) {
+ if (enic_admin_wq_cq_service(enic)) {
+ err = 0;
+ break;
+ }
+ usleep_range(ENIC_MBOX_POLL_INTERVAL_US,
+ ENIC_MBOX_POLL_INTERVAL_US + 50);
+ }
+ /* Final check in case completion arrived during the last sleep */
+ if (err && enic_admin_wq_cq_service(enic))
+ err = 0;
+
+ if (!err) {
+ wq->to_clean = wq->to_clean->next;
+ wq->ring.desc_avail++;
+ dma_unmap_single(&enic->pdev->dev, dma_addr, total_len,
+ DMA_TO_DEVICE);
+ kfree(buf);
+ } else {
+ netdev_err(enic->netdev,
+ "MBOX send timed out (type %u dst %u), disabling channel\n",
+ msg_type, dst_vnic_id);
+ /*
+ * The WQ descriptor is still live in hardware. Do not unmap
+ * or free the buffer: the device may still DMA from dma_addr.
+ * Mark the channel unusable so no further sends are attempted.
+ */
+ WRITE_ONCE(enic->mbox_send_disabled, true);
+ }
+
+ netdev_dbg(enic->netdev,
+ "MBOX send msg_type %u dst %u vlan %u err %d\n",
+ msg_type, dst_vnic_id, vlan_tag, err);
+unlock:
+ mutex_unlock(&enic->mbox_lock);
+ return err;
+}
+
+static void enic_mbox_recv_handler(struct enic *enic, void *buf,
+ unsigned int len)
+{
+ struct enic_mbox_hdr *hdr = buf;
+
+ if (len < sizeof(*hdr)) {
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: truncated message (len %u < %zu)\n",
+ len, sizeof(*hdr));
+ return;
+ }
+
+ if (hdr->msg_type >= ENIC_MBOX_MAX) {
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: unknown msg type %u\n",
+ hdr->msg_type);
+ return;
+ }
+
+ netdev_dbg(enic->netdev,
+ "MBOX recv: type %u from vnic %u len %u\n",
+ hdr->msg_type, le16_to_cpu(hdr->src_vnic_id),
+ le16_to_cpu(hdr->msg_len));
+}
+
+void enic_mbox_init(struct enic *enic)
+{
+ enic->mbox_msg_num = 0;
+ mutex_init(&enic->mbox_lock);
+ enic->admin_rq_handler = enic_mbox_recv_handler;
+}
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index a52f1d25cb21..73fd7f783ee2 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -80,4 +80,12 @@ struct enic_mbox_pf_link_state_ack_msg {
struct enic_mbox_generic_reply ack;
};
+#define ENIC_MBOX_DST_PF 0xFFFF
+
+struct enic;
+
+void enic_mbox_init(struct enic *enic);
+int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
+ void *payload, u16 payload_len);
+
#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 10/10] enic: add V2 VF probe with admin channel and PF registration
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
When a V2 SR-IOV VF probes, open the admin channel, initialize the
MBOX protocol, perform the capability check with the PF, and register
with the PF. This establishes the PF-VF communication path that the PF
uses to send link state notifications.
The admin channel and MBOX registration happen after enic_dev_init()
(which discovers admin channel resources) and before register_netdev()
so the VF is fully initialized before the interface is visible to
userspace.
The admin channel is opened before enic_mbox_init() installs the
receive handler. This is safe because enic_admin_rq_cq_service()
checks admin_rq_handler before enqueuing received buffers, so any
interrupt that fires between open and mbox_init is harmlessly
discarded.
On remove, the VF unregisters from the PF and closes its admin channel
before tearing down data path resources.
V2 VFs are not provisioned with an RES_TYPE_SRIOV_INTR resource by
firmware, so bypass that check in the admin channel capability
detection for V2 VFs. The PF still requires this resource.
The admin MSI-X vector reserved by enic_set_intr_mode()
is used for the admin channel interrupt.
enic_adjust_resources() ensures the reserved slot is within
intr_avail bounds even at maximum queue configurations. The
admin INTR uses a RES_TYPE_INTR_CTRL slot shared with the
data path.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 1 +
drivers/net/ethernet/cisco/enic/enic_main.c | 88 ++++++++++++++++++++++++++---
drivers/net/ethernet/cisco/enic/enic_res.c | 3 +-
3 files changed, 82 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index a6abd6fd04dc..1999403bd969 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -446,6 +446,7 @@ void enic_reset_addr_lists(struct enic *enic);
int enic_sriov_enabled(struct enic *enic);
int enic_is_valid_vf(struct enic *enic, int vf);
int enic_is_dynamic(struct enic *enic);
+int enic_is_sriov_vf_v2(struct enic *enic);
void enic_set_ethtool_ops(struct net_device *netdev);
int __enic_set_rsskey(struct enic *enic);
void enic_ext_cq(struct enic *enic);
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 04b9ae4be29b..6bc7ead860a9 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -316,6 +316,11 @@ static int enic_is_sriov_vf(struct enic *enic)
enic->pdev->device == PCI_DEVICE_ID_CISCO_VIC_ENET_VF_V2;
}
+int enic_is_sriov_vf_v2(struct enic *enic)
+{
+ return enic->pdev->device == PCI_DEVICE_ID_CISCO_VIC_ENET_VF_V2;
+}
+
int enic_is_valid_vf(struct enic *enic, int vf)
{
#ifdef CONFIG_PCI_IOV
@@ -2399,15 +2404,19 @@ static int enic_adjust_resources(struct enic *enic)
enic->intr_count = enic->intr_avail;
break;
case VNIC_DEV_INTR_MODE_MSIX: {
- /* Reserve one MSI-X slot for the admin channel interrupt
- * when V2 SR-IOV admin channel resources are present.
- */
- unsigned int admin_reserve =
- enic->has_admin_channel ? 1 : 0;
-
/* Adjust the number of wqs/rqs/cqs/interrupts that will be
- * used based on which resource is the most constrained
+ * used based on which resource is the most constrained.
+ * Reserve one extra MSI-X slot for the admin channel INTR
+ * when has_admin_channel is set so that
+ * enic_admin_setup_intr() can allocate at intr_count
+ * within the intr_avail bounds even when the data queue
+ * count is maxed out. intr_count counts only the data-path
+ * IRQs (registered by enic_request_intr()); the admin INTR
+ * lives at msix index intr_count and is set up later by
+ * enic_admin_setup_intr().
*/
+ unsigned int admin_reserve = enic->has_admin_channel ? 1 : 0;
+
wq_avail = min(enic->wq_avail, ENIC_WQ_MAX);
rq_default = max(netif_get_num_default_rss_queues(),
ENIC_RQ_MIN_DEFAULT);
@@ -3096,6 +3105,38 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_out_dev_close;
}
+ /* V2 VF: open admin channel and register with PF.
+ * Must happen before register_netdev so the VF is fully
+ * initialized before the interface is visible to userspace.
+ *
+ * admin_channel_open() runs before enic_mbox_init() installs
+ * the receive handler. This is safe because
+ * enic_admin_rq_cq_service() checks admin_rq_handler before
+ * enqueuing any received buffer, so interrupts that fire
+ * between open and mbox_init are harmlessly discarded.
+ */
+ if (enic_is_sriov_vf_v2(enic)) {
+ err = enic_admin_channel_open(enic);
+ if (err) {
+ dev_err(dev,
+ "Failed to open admin channel: %d\n", err);
+ goto err_out_dev_deinit;
+ }
+ enic_mbox_init(enic);
+ err = enic_mbox_vf_capability_check(enic);
+ if (err) {
+ dev_err(dev,
+ "MBOX capability check failed: %d\n", err);
+ goto err_out_admin_close;
+ }
+ err = enic_mbox_vf_register(enic);
+ if (err) {
+ dev_err(dev,
+ "MBOX VF registration failed: %d\n", err);
+ goto err_out_admin_close;
+ }
+ }
+
netif_set_real_num_tx_queues(netdev, enic->wq_count);
netif_set_real_num_rx_queues(netdev, enic->rq_count);
@@ -3121,7 +3162,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
err = enic_set_mac_addr(netdev, enic->mac_addr);
if (err) {
dev_err(dev, "Invalid MAC address, aborting\n");
- goto err_out_dev_deinit;
+ goto err_out_admin_close;
}
enic->tx_coalesce_usecs = enic->config.intr_timer_usec;
@@ -3219,11 +3260,23 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
err = register_netdev(netdev);
if (err) {
dev_err(dev, "Cannot register net device, aborting\n");
- goto err_out_dev_deinit;
+ goto err_out_admin_close;
}
return 0;
+err_out_admin_close:
+ if (enic_is_sriov_vf_v2(enic)) {
+ if (enic->vf_registered) {
+ int unreg_err = enic_mbox_vf_unregister(enic);
+
+ if (unreg_err)
+ netdev_warn(netdev,
+ "Failed to unregister from PF: %d\n",
+ unreg_err);
+ }
+ enic_admin_channel_close(enic);
+ }
err_out_dev_deinit:
enic_dev_deinit(enic);
err_out_dev_close:
@@ -3260,6 +3313,23 @@ static void enic_remove(struct pci_dev *pdev)
cancel_work_sync(&enic->reset);
cancel_work_sync(&enic->change_mtu_work);
+
+ /* Close the admin channel and unregister from the PF before
+ * unregister_netdev() to prevent a late PF notification from
+ * touching a netdev that has been freed.
+ */
+ if (enic_is_sriov_vf_v2(enic)) {
+ if (enic->vf_registered) {
+ int unreg_err = enic_mbox_vf_unregister(enic);
+
+ if (unreg_err)
+ netdev_warn(netdev,
+ "Failed to unregister from PF: %d\n",
+ unreg_err);
+ }
+ enic_admin_channel_close(enic);
+ }
+
unregister_netdev(netdev);
#ifdef CONFIG_PCI_IOV
if (enic_sriov_enabled(enic)) {
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
index 436326ace049..74cd2ee3af5c 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -211,7 +211,8 @@ void enic_get_res_counts(struct enic *enic)
vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_RQ) >= 1 &&
vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_CQ) >=
ARRAY_SIZE(enic->admin_cq) &&
- vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1;
+ (enic_is_sriov_vf_v2(enic) ||
+ vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1);
dev_info(enic_get_dev(enic),
"vNIC resources avail: wq %d rq %d cq %d intr %d admin %s\n",
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 01/10] enic: verify firmware supports V2 SR-IOV at probe time
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat,
Breno Leitao
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
During PF probe, query the firmware get-supported-feature interface
to verify that the running firmware supports V2 SR-IOV. Firmware
version 5.3(4.72) and later report VIC_FEATURE_SRIOV via
CMD_GET_SUPP_FEATURE_VER. If the firmware does not support the
feature, set vf_type to ENIC_VF_TYPE_NONE and log a warning so the
admin knows a firmware upgrade is needed.
VIC_FEATURE_SRIOV is assigned the explicit value 4 to match the
firmware ABI. Slot 3 (firmware's VIC_FEATURE_PTP) is reserved with
a comment rather than a placeholder enum entry, since PTP is not
used by the upstream driver.
Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_main.c | 21 ++++++++++++++++++++-
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 2 ++
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index e7125b818087..53d68272d06a 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2641,8 +2641,10 @@ static void enic_iounmap(struct enic *enic)
static void enic_sriov_detect_vf_type(struct enic *enic)
{
struct pci_dev *pdev = enic->pdev;
- int pos;
+ u64 supported_versions, a1 = 0;
u16 vf_dev_id;
+ int pos;
+ int err;
if (enic_is_sriov_vf(enic) || enic_is_dynamic(enic))
return;
@@ -2669,6 +2671,23 @@ static void enic_sriov_detect_vf_type(struct enic *enic)
enic->vf_type = ENIC_VF_TYPE_NONE;
break;
}
+
+ if (enic->vf_type != ENIC_VF_TYPE_V2)
+ return;
+
+ /* A successful command means firmware recognizes
+ * VIC_FEATURE_SRIOV; supported_versions is available
+ * for sub-feature versioning in the future.
+ */
+ err = vnic_dev_get_supported_feature_ver(enic->vdev,
+ VIC_FEATURE_SRIOV,
+ &supported_versions,
+ &a1);
+ if (err) {
+ dev_warn(&pdev->dev,
+ "SR-IOV V2 not supported by current firmware. Upgrade to VIC FW 5.3(4.72) or higher.\n");
+ enic->vf_type = ENIC_VF_TYPE_NONE;
+ }
}
#endif
diff --git a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
index 605ef17f967e..3b6efa743dba 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
@@ -734,6 +734,8 @@ enum vic_feature_t {
VIC_FEATURE_VXLAN,
VIC_FEATURE_RDMA,
VIC_FEATURE_VXLAN_PATCH,
+ /* slot 3 reserved for firmware VIC_FEATURE_PTP */
+ VIC_FEATURE_SRIOV = 4,
VIC_FEATURE_MAX,
};
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 09/10] enic: wire V2 SR-IOV enable with admin channel and MBOX
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
Extend enic_sriov_configure() to handle V2 SR-IOV VFs. When the PF
detects V2 VF device IDs, the enable path allocates per-VF MBOX state,
opens the admin channel, initializes the MBOX protocol, and then calls
pci_enable_sriov(). The admin channel must be ready before VFs are
created so that VF drivers can immediately begin the MBOX capability
and registration handshake during their probe.
The enic_sriov_configure() dispatcher and its V2 helpers
(enic_sriov_v2_enable, enic_sriov_v2_disable) are defined here but
intentionally not yet wired into struct pci_driver via
.sriov_configure -- hence the __maybe_unused annotations. This
series introduces only the admin channel and MBOX infrastructure;
sysfs-driven V2 enable/disable will be activated in a follow-up
patch by adding ".sriov_configure = enic_sriov_configure," to
enic_driver.
The disable path first clears ENIC_SRIOV_ENABLED and flushes the
link-notify work, so no further VF link-state broadcast can run, then
calls pci_disable_sriov() (VF drivers unregister via MBOX), closes the
admin channel, and frees per-VF state. Clearing the flag and flushing
the work before vf_state is freed closes a use-after-free window
against the link-notify path.
Notify registered VFs of PF link transitions: enic_link_check()
schedules link_notify_work on each carrier up/down edge, and the work
handler sends PF_LINK_STATE_NOTIF to the VFs from process context.
The broadcast cannot run directly in enic_link_check() because the
MBOX send path may sleep and link check runs in the notify timer/ISR
context.
Re-establish the admin/MBOX channel across a PF reset. enic_reset()
and enic_tx_hang_reset() fully close the admin channel before the
soft/hang reset (which wipes all hardware queues, including the admin
WQ/RQ), then reopen it and re-run enic_mbox_init() after the data path
is back up, and re-push the current link state to registered VFs.
Reject VF port profile requests when V2 SR-IOV is active
(enic_is_valid_pp_vf), since enic->pp is not reallocated for V2 VFs
and the V2 protocol uses MBOX instead of port profiles.
Update enic_remove() to run enic_dev_deinit() and vnic_dev_close()
after SR-IOV teardown, so the PF device remains functional while VFs
are being cleaned up. This ordering applies to both V1 and V2 SR-IOV
paths.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 2 +
drivers/net/ethernet/cisco/enic/enic_admin.c | 3 +
drivers/net/ethernet/cisco/enic/enic_main.c | 252 +++++++++++++++++++++++++--
drivers/net/ethernet/cisco/enic/enic_mbox.c | 13 +-
drivers/net/ethernet/cisco/enic/enic_pp.c | 5 +
drivers/net/ethernet/cisco/enic/enic_res.c | 1 +
drivers/net/ethernet/cisco/enic/vnic_enet.h | 4 +-
7 files changed, 266 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 294b751b7cb6..a6abd6fd04dc 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -300,6 +300,7 @@ struct enic {
struct vnic_intr admin_intr;
struct work_struct admin_poll_work;
unsigned int admin_intr_index;
+ struct work_struct link_notify_work;
struct work_struct admin_msg_work;
spinlock_t admin_msg_lock; /* protects admin_msg_list */
struct list_head admin_msg_list;
@@ -318,6 +319,7 @@ struct enic {
*/
struct completion mbox_comp;
u8 mbox_expected_reply;
+ bool mbox_initialized;
/* PF: per-VF MBOX state, allocated when SRIOV V2 is enabled */
struct enic_vf_state {
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index 8edf7ad4557d..6bc3cc850fac 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -560,6 +560,7 @@ void enic_admin_channel_close(struct enic *enic)
vnic_intr_mask(&enic->admin_intr);
enic_admin_teardown_intr(enic);
+ cancel_work_sync(&enic->link_notify_work);
cancel_work_sync(&enic->admin_msg_work);
enic_admin_msg_drain(enic);
@@ -579,5 +580,7 @@ void enic_admin_channel_close(struct enic *enic)
vnic_cq_clean(&enic->admin_cq[0]);
vnic_cq_clean(&enic->admin_cq[1]);
vnic_intr_clean(&enic->admin_intr);
+
+ enic->admin_rq_handler = NULL;
enic_admin_free_resources(enic);
}
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 53d68272d06a..04b9ae4be29b 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -60,6 +60,8 @@
#include "enic_clsf.h"
#include "enic_rq.h"
#include "enic_wq.h"
+#include "enic_admin.h"
+#include "enic_mbox.h"
#define ENIC_NOTIFY_TIMER_PERIOD (2 * HZ)
@@ -411,6 +413,24 @@ static void enic_set_rx_coal_setting(struct enic *enic)
rx_coal->use_adaptive_rx_coalesce = 1;
}
+static void enic_link_notify_work_handler(struct work_struct *work)
+{
+ struct enic *enic = container_of(work, struct enic,
+ link_notify_work);
+ u32 state;
+ u16 i;
+
+ if (!enic_sriov_enabled(enic) || !enic->vf_state)
+ return;
+
+ state = netif_carrier_ok(enic->netdev) ?
+ ENIC_MBOX_LINK_STATE_ENABLE :
+ ENIC_MBOX_LINK_STATE_DISABLE;
+
+ for (i = 0; i < enic->num_vfs; i++)
+ enic_mbox_send_link_state(enic, i, state);
+}
+
static void enic_link_check(struct enic *enic)
{
int link_status = vnic_dev_link_status(enic->vdev);
@@ -420,9 +440,13 @@ static void enic_link_check(struct enic *enic)
netdev_info(enic->netdev, "Link UP\n");
netif_carrier_on(enic->netdev);
enic_set_rx_coal_setting(enic);
+ if (enic_sriov_enabled(enic) && enic->vf_state)
+ schedule_work(&enic->link_notify_work);
} else if (!link_status && carrier_ok) {
netdev_info(enic->netdev, "Link DOWN\n");
netif_carrier_off(enic->netdev);
+ if (enic_sriov_enabled(enic) && enic->vf_state)
+ schedule_work(&enic->link_notify_work);
}
}
@@ -2154,15 +2178,47 @@ static void enic_reset(struct work_struct *work)
/* Stop any activity from infiniband */
enic_set_api_busy(enic, true);
+ /* Fully tear down the V2 admin/MBOX channel before the soft reset.
+ * The reset wipes all hardware queues including the admin WQ/RQ;
+ * closing first tells firmware to stop the admin QP (so it no longer
+ * DMAs from the about-to-be-reset rings) and frees the admin resources
+ * so they are cleanly re-allocated afterwards.
+ */
+ if (enic_sriov_enabled(enic) &&
+ enic->vf_type == ENIC_VF_TYPE_V2)
+ enic_admin_channel_close(enic);
+
enic_stop(enic->netdev);
+
enic_dev_soft_reset(enic);
enic_reset_addr_lists(enic);
enic_init_vnic_resources(enic);
enic_set_rss_nic_cfg(enic);
enic_dev_set_ig_vlan_rewrite_mode(enic);
enic_ext_cq(enic);
+
enic_open(enic->netdev);
+ /* Re-establish the admin/MBOX channel after the data path is back up,
+ * mirroring the SR-IOV enable path (channel open + mbox init). The
+ * channel was fully torn down by enic_admin_channel_close() above.
+ */
+ if (enic_sriov_enabled(enic) &&
+ enic->vf_type == ENIC_VF_TYPE_V2) {
+ if (enic_admin_channel_open(enic)) {
+ netdev_err(enic->netdev,
+ "admin channel reopen after reset failed\n");
+ } else {
+ enic_mbox_init(enic);
+ /* The link came back up during enic_open() above
+ * while MBOX sends were still disabled (channel not
+ * yet reopened), so that link-notify was dropped.
+ * Re-push current link state to registered VFs now.
+ */
+ schedule_work(&enic->link_notify_work);
+ }
+ }
+
/* Allow infiniband to fiddle with the device again */
enic_set_api_busy(enic, false);
@@ -2180,16 +2236,46 @@ static void enic_tx_hang_reset(struct work_struct *work)
/* Stop any activity from infiniband */
enic_set_api_busy(enic, true);
+ /* Fully tear down the V2 admin/MBOX channel before the hang reset, for
+ * the same reason as the soft reset path: stop the admin QP and free
+ * the admin resources before the hardware queues are wiped.
+ */
+ if (enic_sriov_enabled(enic) &&
+ enic->vf_type == ENIC_VF_TYPE_V2)
+ enic_admin_channel_close(enic);
+
enic_dev_hang_notify(enic);
enic_stop(enic->netdev);
+
enic_dev_hang_reset(enic);
enic_reset_addr_lists(enic);
enic_init_vnic_resources(enic);
enic_set_rss_nic_cfg(enic);
enic_dev_set_ig_vlan_rewrite_mode(enic);
enic_ext_cq(enic);
+
enic_open(enic->netdev);
+ /* Re-establish the admin/MBOX channel after the data path is back up,
+ * mirroring the SR-IOV enable path (channel open + mbox init). The
+ * channel was fully torn down by enic_admin_channel_close() above.
+ */
+ if (enic_sriov_enabled(enic) &&
+ enic->vf_type == ENIC_VF_TYPE_V2) {
+ if (enic_admin_channel_open(enic)) {
+ netdev_err(enic->netdev,
+ "admin channel reopen after reset failed\n");
+ } else {
+ enic_mbox_init(enic);
+ /* The link came back up during enic_open() above
+ * while MBOX sends were still disabled (channel not
+ * yet reopened), so that link-notify was dropped.
+ * Re-push current link state to registered VFs now.
+ */
+ schedule_work(&enic->link_notify_work);
+ }
+ }
+
/* Allow infiniband to fiddle with the device again */
enic_set_api_busy(enic, false);
@@ -2200,6 +2286,8 @@ static void enic_tx_hang_reset(struct work_struct *work)
static int enic_set_intr_mode(struct enic *enic)
{
+ unsigned int admin_reserve = enic->has_admin_channel ? 1 : 0;
+ unsigned int min_intr = ENIC_MSIX_MIN_INTR + admin_reserve;
unsigned int i;
int num_intr;
@@ -2210,12 +2298,12 @@ static int enic_set_intr_mode(struct enic *enic)
*/
if (enic->config.intr_mode < 1 &&
- enic->intr_avail >= ENIC_MSIX_MIN_INTR) {
+ enic->intr_avail >= min_intr) {
for (i = 0; i < enic->intr_avail; i++)
enic->msix_entry[i].entry = i;
num_intr = pci_enable_msix_range(enic->pdev, enic->msix_entry,
- ENIC_MSIX_MIN_INTR,
+ min_intr,
enic->intr_avail);
if (num_intr > 0) {
vnic_dev_set_intr_mode(enic->vdev,
@@ -2310,7 +2398,13 @@ static int enic_adjust_resources(struct enic *enic)
enic->cq_count = 2;
enic->intr_count = enic->intr_avail;
break;
- case VNIC_DEV_INTR_MODE_MSIX:
+ case VNIC_DEV_INTR_MODE_MSIX: {
+ /* Reserve one MSI-X slot for the admin channel interrupt
+ * when V2 SR-IOV admin channel resources are present.
+ */
+ unsigned int admin_reserve =
+ enic->has_admin_channel ? 1 : 0;
+
/* Adjust the number of wqs/rqs/cqs/interrupts that will be
* used based on which resource is the most constrained
*/
@@ -2319,7 +2413,8 @@ static int enic_adjust_resources(struct enic *enic)
ENIC_RQ_MIN_DEFAULT);
rq_avail = min3(enic->rq_avail, ENIC_RQ_MAX, rq_default);
max_queues = min(enic->cq_avail,
- enic->intr_avail - ENIC_MSIX_RESERVED_INTR);
+ enic->intr_avail - ENIC_MSIX_RESERVED_INTR -
+ admin_reserve);
if (wq_avail + rq_avail <= max_queues) {
enic->rq_count = rq_avail;
enic->wq_count = wq_avail;
@@ -2337,6 +2432,7 @@ static int enic_adjust_resources(struct enic *enic)
enic->intr_count = enic->cq_count + ENIC_MSIX_RESERVED_INTR;
break;
+ }
default:
dev_err(enic_get_dev(enic), "Unknown interrupt mode\n");
return -EINVAL;
@@ -2689,6 +2785,132 @@ static void enic_sriov_detect_vf_type(struct enic *enic)
enic->vf_type = ENIC_VF_TYPE_NONE;
}
}
+
+static int __maybe_unused
+enic_sriov_v2_enable(struct enic *enic, int num_vfs)
+{
+ int err;
+
+ if (!enic->has_admin_channel) {
+ netdev_err(enic->netdev,
+ "V2 SR-IOV requires admin channel resources\n");
+ return -EOPNOTSUPP;
+ }
+
+ enic->vf_state = kcalloc(num_vfs, sizeof(*enic->vf_state), GFP_KERNEL);
+ if (!enic->vf_state)
+ return -ENOMEM;
+
+ err = enic_admin_channel_open(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to open admin channel: %d\n", err);
+ goto free_vf_state;
+ }
+
+ enic_mbox_init(enic);
+
+ enic->num_vfs = num_vfs;
+
+ err = pci_enable_sriov(enic->pdev, num_vfs);
+ if (err) {
+ netdev_err(enic->netdev,
+ "pci_enable_sriov failed: %d\n", err);
+ goto close_admin;
+ }
+
+ enic->priv_flags |= ENIC_SRIOV_ENABLED;
+ return num_vfs;
+
+close_admin:
+ enic->num_vfs = 0;
+ enic_admin_channel_close(enic);
+free_vf_state:
+ kfree(enic->vf_state);
+ enic->vf_state = NULL;
+ return err;
+}
+
+static void enic_sriov_v2_disable(struct enic *enic)
+{
+ /* Stop new VF link-state broadcasts before tearing down vf_state.
+ * Clearing ENIC_SRIOV_ENABLED makes enic_link_check() (called from
+ * the notify timer/ISR) skip the VF notify path, and cancelling
+ * link_notify_work ensures any already-queued broadcast has finished
+ * before vf_state is freed, closing a use-after-free window.
+ */
+ enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
+ cancel_work_sync(&enic->link_notify_work);
+
+ pci_disable_sriov(enic->pdev);
+ enic_admin_channel_close(enic);
+ kfree(enic->vf_state);
+ enic->vf_state = NULL;
+ enic->num_vfs = 0;
+}
+
+static int __maybe_unused
+enic_sriov_configure(struct pci_dev *pdev, int num_vfs)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ struct enic *enic = netdev_priv(netdev);
+ struct enic_port_profile *pp;
+ int err;
+
+ if (num_vfs > 0) {
+ if (enic->config.mq_subvnic_count) {
+ netdev_err(netdev,
+ "SR-IOV not supported with multi-queue sub-vnics\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (enic->vf_type == ENIC_VF_TYPE_NONE) {
+ netdev_err(netdev,
+ "SR-IOV not supported on this firmware version\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (enic->vf_type == ENIC_VF_TYPE_V2)
+ return enic_sriov_v2_enable(enic, num_vfs);
+
+ pp = kcalloc(num_vfs, sizeof(*pp), GFP_KERNEL);
+ if (!pp)
+ return -ENOMEM;
+
+ err = pci_enable_sriov(pdev, num_vfs);
+ if (err) {
+ kfree(pp);
+ return err;
+ }
+
+ kfree(enic->pp);
+ enic->pp = pp;
+ enic->num_vfs = num_vfs;
+ enic->priv_flags |= ENIC_SRIOV_ENABLED;
+ return num_vfs;
+ }
+
+ if (!enic_sriov_enabled(enic))
+ return 0;
+
+ if (enic->vf_type == ENIC_VF_TYPE_V2) {
+ enic_sriov_v2_disable(enic);
+ return 0;
+ }
+
+ pp = kzalloc_obj(*enic->pp, GFP_KERNEL);
+ if (!pp)
+ return -ENOMEM;
+
+ pci_disable_sriov(pdev);
+ enic->num_vfs = 0;
+ enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
+
+ kfree(enic->pp);
+ enic->pp = pp;
+
+ return 0;
+}
#endif
static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
@@ -2787,12 +3009,18 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_out_vnic_unregister;
#ifdef CONFIG_PCI_IOV
- /* Get number of subvnics */
+ enic_sriov_detect_vf_type(enic);
+
+ /* Auto-enable SR-IOV if VFs were pre-configured (e.g. at boot).
+ * V2 VFs require the admin channel, which is not yet set up at probe
+ * time; use sysfs (enic_sriov_configure) to enable V2 SR-IOV instead.
+ */
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
if (pos) {
pci_read_config_word(pdev, pos + PCI_SRIOV_TOTAL_VF,
&enic->num_vfs);
- if (enic->num_vfs) {
+ if (enic->num_vfs &&
+ enic->vf_type != ENIC_VF_TYPE_V2) {
err = pci_enable_sriov(pdev, enic->num_vfs);
if (err) {
dev_err(dev, "SRIOV enable failed, aborting."
@@ -2804,7 +3032,6 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
num_pps = enic->num_vfs;
}
}
- enic_sriov_detect_vf_type(enic);
#endif
/* Allocate structure for port profiles */
@@ -2881,6 +3108,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
INIT_WORK(&enic->reset, enic_reset);
INIT_WORK(&enic->tx_hang_reset, enic_tx_hang_reset);
INIT_WORK(&enic->change_mtu_work, enic_change_mtu_work);
+ INIT_WORK(&enic->link_notify_work, enic_link_notify_work_handler);
for (i = 0; i < enic->wq_count; i++)
spin_lock_init(&enic->wq[i].lock);
@@ -3033,14 +3261,16 @@ static void enic_remove(struct pci_dev *pdev)
cancel_work_sync(&enic->reset);
cancel_work_sync(&enic->change_mtu_work);
unregister_netdev(netdev);
- enic_dev_deinit(enic);
- vnic_dev_close(enic->vdev);
#ifdef CONFIG_PCI_IOV
if (enic_sriov_enabled(enic)) {
- pci_disable_sriov(pdev);
- enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
+ if (enic->vf_type == ENIC_VF_TYPE_V2)
+ enic_sriov_v2_disable(enic);
+ else
+ pci_disable_sriov(pdev);
}
#endif
+ enic_dev_deinit(enic);
+ vnic_dev_close(enic->vdev);
kfree(enic->pp);
vnic_dev_unregister(enic->vdev);
enic_iounmap(enic);
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
index eb084adae810..b90a112703c1 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -614,8 +614,17 @@ int enic_mbox_vf_unregister(struct enic *enic)
void enic_mbox_init(struct enic *enic)
{
+ /* mbox_lock and mbox_comp must be initialized exactly once per
+ * device lifetime; the PF sriov_configure path can re-enter this
+ * on each enable cycle where these primitives are already set up.
+ */
+ if (!enic->mbox_initialized) {
+ mutex_init(&enic->mbox_lock);
+ init_completion(&enic->mbox_comp);
+ enic->mbox_initialized = true;
+ } else {
+ reinit_completion(&enic->mbox_comp);
+ }
enic->mbox_msg_num = 0;
- mutex_init(&enic->mbox_lock);
- init_completion(&enic->mbox_comp);
enic->admin_rq_handler = enic_mbox_recv_handler;
}
diff --git a/drivers/net/ethernet/cisco/enic/enic_pp.c b/drivers/net/ethernet/cisco/enic/enic_pp.c
index 4720a952725d..3f611e240c25 100644
--- a/drivers/net/ethernet/cisco/enic/enic_pp.c
+++ b/drivers/net/ethernet/cisco/enic/enic_pp.c
@@ -25,6 +25,11 @@ int enic_is_valid_pp_vf(struct enic *enic, int vf, int *err)
if (vf != PORT_SELF_VF) {
#ifdef CONFIG_PCI_IOV
if (enic_sriov_enabled(enic)) {
+ /* V2 SR-IOV uses MBOX, not port profiles */
+ if (enic->vf_type == ENIC_VF_TYPE_V2) {
+ *err = -EOPNOTSUPP;
+ goto err_out;
+ }
if (vf < 0 || vf >= enic->num_vfs) {
*err = -EINVAL;
goto err_out;
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
index 2b7545d6a67f..436326ace049 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -59,6 +59,7 @@ int enic_get_vnic_config(struct enic *enic)
GET_CONFIG(intr_timer_usec);
GET_CONFIG(loop_tag);
GET_CONFIG(num_arfs);
+ GET_CONFIG(mq_subvnic_count);
GET_CONFIG(max_rq_ring);
GET_CONFIG(max_wq_ring);
GET_CONFIG(max_cq_ring);
diff --git a/drivers/net/ethernet/cisco/enic/vnic_enet.h b/drivers/net/ethernet/cisco/enic/vnic_enet.h
index 9e8e86262a3f..519d2969990b 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_enet.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_enet.h
@@ -21,7 +21,9 @@ struct vnic_enet_config {
u16 loop_tag;
u16 vf_rq_count;
u16 num_arfs;
- u8 reserved[66];
+ u8 reserved1[32];
+ u16 mq_subvnic_count;
+ u8 reserved2[32];
u32 max_rq_ring; // MAX RQ ring size
u32 max_wq_ring; // MAX WQ ring size
u32 max_cq_ring; // MAX CQ ring size
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 04/10] enic: add admin CQ service with MSI-X interrupt and workqueue polling
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
Add completion queue (CQ) service for the admin channel work queue
(WQ) and receive queue (RQ), driven by a dedicated MSI-X interrupt
and a workqueue-based CQ poller.
The admin WQ CQ service advances the completion ring and returns the
number of descriptors consumed. The admin RQ CQ service does the
same for receive completions and copies each received message into a
preallocated buffer. Received messages are enqueued for deferred
dispatch by a separate work_struct so the CQ poller stays short.
When the MSI-X interrupt fires, the ISR schedules the CQ poll
work_struct. The work handler drains all pending completions, kicks
message dispatch if work was done, and returns credits to unmask the
interrupt.
Log a rate-limited warning when admin RQ buffer refill fails so that
transient memory pressure is visible without flooding the log.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 7 +
drivers/net/ethernet/cisco/enic/enic_admin.c | 282 ++++++++++++++++++++++++++-
drivers/net/ethernet/cisco/enic/enic_admin.h | 12 ++
3 files changed, 297 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 08472420f3a1..42f2ac3df212 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -296,6 +296,13 @@ struct enic {
struct vnic_rq admin_rq;
struct vnic_cq admin_cq[2];
struct vnic_intr admin_intr;
+ struct work_struct admin_poll_work;
+ unsigned int admin_intr_index;
+ struct work_struct admin_msg_work;
+ spinlock_t admin_msg_lock; /* protects admin_msg_list */
+ struct list_head admin_msg_list;
+ void (*admin_rq_handler)(struct enic *enic, void *buf,
+ unsigned int len);
};
static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index b28fc6c656cc..ec85fd446d02 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -4,6 +4,7 @@
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
#include "vnic_dev.h"
#include "vnic_wq.h"
@@ -15,6 +16,7 @@
#include "enic.h"
#include "enic_admin.h"
#include "cq_desc.h"
+#include "cq_enet_desc.h"
#include "wq_enet_desc.h"
#include "rq_enet_desc.h"
@@ -94,6 +96,226 @@ static void enic_admin_rq_drain(struct enic *enic)
vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
}
+static unsigned int enic_admin_cq_color(void *cq_desc, unsigned int desc_size)
+{
+ u8 type_color = *((u8 *)cq_desc + desc_size - 1);
+
+ return (type_color >> CQ_DESC_COLOR_SHIFT) & CQ_DESC_COLOR_MASK;
+}
+
+unsigned int enic_admin_wq_cq_service(struct enic *enic)
+{
+ struct vnic_cq *cq = &enic->admin_cq[0];
+ unsigned int work = 0;
+ void *desc;
+
+ desc = vnic_cq_to_clean(cq);
+ while (enic_admin_cq_color(desc, cq->ring.desc_size) !=
+ cq->last_color) {
+ vnic_cq_inc_to_clean(cq);
+ work++;
+ desc = vnic_cq_to_clean(cq);
+ }
+
+ return work;
+}
+
+static void enic_admin_msg_enqueue(struct enic *enic, void *buf,
+ unsigned int len)
+{
+ struct enic_admin_msg *msg;
+
+ msg = kmalloc(struct_size(msg, data, len), GFP_KERNEL);
+ if (!msg)
+ return;
+
+ msg->len = len;
+ memcpy(msg->data, buf, len);
+
+ spin_lock(&enic->admin_msg_lock);
+ list_add_tail(&msg->list, &enic->admin_msg_list);
+ spin_unlock(&enic->admin_msg_lock);
+}
+
+unsigned int enic_admin_rq_cq_service(struct enic *enic)
+{
+ struct vnic_cq *cq = &enic->admin_cq[1];
+ struct vnic_rq *rq = &enic->admin_rq;
+ struct cq_enet_rq_desc *rq_desc;
+ struct vnic_rq_buf *buf;
+ u16 bwf, bytes_written;
+ unsigned int work = 0;
+ void *desc;
+
+ desc = vnic_cq_to_clean(cq);
+ while (enic_admin_cq_color(desc, cq->ring.desc_size) !=
+ cq->last_color) {
+ /* Ensure DMA descriptor fields are read after
+ * the color/valid check. dma_rmb() is the
+ * correct barrier for DMA-written descriptors.
+ */
+ dma_rmb();
+ buf = rq->to_clean;
+
+ /* Decode the actual number of bytes hardware wrote into
+ * the RX buffer. buf->len is the static allocation size
+ * (ENIC_ADMIN_BUF_SIZE) and would expose uninitialised
+ * heap memory beyond the real payload. bytes_written_flags
+ * is at the same offset in every cq_enet_rq_desc[_32|_64]
+ * variant.
+ */
+ rq_desc = desc;
+ bwf = le16_to_cpu(rq_desc->bytes_written_flags);
+ bytes_written = bwf & CQ_ENET_RQ_DESC_BYTES_WRITTEN_MASK;
+ if (bytes_written > buf->len)
+ goto next_desc;
+
+ dma_sync_single_for_cpu(&enic->pdev->dev,
+ buf->dma_addr, buf->len,
+ DMA_FROM_DEVICE);
+
+ /* Drop on hardware error indications. Admin messages
+ * are internal to the VIC, not received over the wire.
+ * Firmware sets TRUNCATED when the message does not fit
+ * in the posted buffer, and FCS_OK is always set on
+ * healthy admin completions.
+ */
+ if (bwf & CQ_ENET_RQ_DESC_FLAGS_TRUNCATED) {
+ netdev_warn_once(enic->netdev,
+ "admin RQ: truncated message dropped\n");
+ goto next_desc;
+ }
+ if (!(rq_desc->flags & CQ_ENET_RQ_DESC_FLAGS_FCS_OK)) {
+ netdev_warn_once(enic->netdev,
+ "admin RQ: bad FCS, dropping message\n");
+ goto next_desc;
+ }
+
+ enic_admin_msg_enqueue(enic, buf->os_buf, bytes_written);
+
+next_desc:
+ enic_admin_rq_buf_clean(rq, rq->to_clean);
+ rq->to_clean = rq->to_clean->next;
+ rq->ring.desc_avail++;
+
+ vnic_cq_inc_to_clean(cq);
+ work++;
+ desc = vnic_cq_to_clean(cq);
+ }
+
+ if (enic_admin_rq_fill(enic, GFP_KERNEL) && net_ratelimit())
+ netdev_warn(enic->netdev,
+ "admin RQ refill failed\n");
+
+ return work;
+}
+
+static irqreturn_t enic_admin_isr_msix(int irq, void *data)
+{
+ struct enic *enic = data;
+
+ schedule_work(&enic->admin_poll_work);
+
+ return IRQ_HANDLED;
+}
+
+static void enic_admin_msg_work_handler(struct work_struct *work)
+{
+ struct enic *enic = container_of(work, struct enic, admin_msg_work);
+ struct enic_admin_msg *msg, *tmp;
+ LIST_HEAD(local_list);
+
+ spin_lock_bh(&enic->admin_msg_lock);
+ list_splice_init(&enic->admin_msg_list, &local_list);
+ spin_unlock_bh(&enic->admin_msg_lock);
+
+ list_for_each_entry_safe(msg, tmp, &local_list, list) {
+ if (enic->admin_rq_handler)
+ enic->admin_rq_handler(enic, msg->data, msg->len);
+ list_del(&msg->list);
+ kfree(msg);
+ }
+}
+
+static void enic_admin_poll_work_handler(struct work_struct *work)
+{
+ struct enic *enic = container_of(work, struct enic, admin_poll_work);
+ unsigned int credits;
+ unsigned int rq_work;
+
+ credits = vnic_intr_credits(&enic->admin_intr);
+
+ rq_work = enic_admin_rq_cq_service(enic);
+
+ if (rq_work > 0)
+ schedule_work(&enic->admin_msg_work);
+
+ vnic_intr_return_credits(&enic->admin_intr,
+ credits ?: 1,
+ 1 /* unmask */, 0);
+}
+
+static int enic_admin_setup_intr(struct enic *enic)
+{
+ unsigned int intr_index = enic->intr_count;
+ int err;
+
+ if (vnic_dev_get_intr_mode(enic->vdev) != VNIC_DEV_INTR_MODE_MSIX ||
+ intr_index >= enic->intr_avail)
+ return -ENODEV;
+
+ /* The admin INTR uses a slot in the same RES_TYPE_INTR_CTRL
+ * strided array of per-vector control blocks (mask, coalescing
+ * timer, credit return) that the data-path IRQs occupy in BAR0.
+ * vnic_intr_alloc() defaults to RES_TYPE_INTR_CTRL, which is what
+ * we want here.
+ */
+ err = vnic_intr_alloc(enic->vdev, &enic->admin_intr, intr_index);
+ if (err) {
+ netdev_warn(enic->netdev,
+ "Failed to alloc admin intr at index %u: %d\n",
+ intr_index, err);
+ return err;
+ }
+
+ enic->admin_intr_index = intr_index;
+
+ snprintf(enic->msix[intr_index].devname,
+ sizeof(enic->msix[intr_index].devname),
+ "%s-admin", enic->netdev->name);
+ enic->msix[intr_index].isr = enic_admin_isr_msix;
+ enic->msix[intr_index].devid = enic;
+
+ err = request_irq(enic->msix_entry[intr_index].vector,
+ enic->msix[intr_index].isr, 0,
+ enic->msix[intr_index].devname,
+ enic->msix[intr_index].devid);
+ if (err) {
+ netdev_warn(enic->netdev,
+ "Failed to request admin MSI-X irq: %d\n", err);
+ vnic_intr_free(&enic->admin_intr);
+ return err;
+ }
+
+ enic->msix[intr_index].requested = 1;
+
+ netdev_dbg(enic->netdev,
+ "admin channel using MSI-X interrupt (index %u)\n",
+ intr_index);
+
+ return 0;
+}
+
+static void enic_admin_teardown_intr(struct enic *enic)
+{
+ unsigned int intr_index = enic->admin_intr_index;
+
+ free_irq(enic->msix_entry[intr_index].vector,
+ enic->msix[intr_index].devid);
+ cancel_work_sync(&enic->admin_poll_work);
+ enic->msix[intr_index].requested = 0;
+}
+
static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
{
u64 a0 = QP_TYPE_ADMIN, a1 = enable;
@@ -173,6 +395,7 @@ static int enic_admin_alloc_resources(struct enic *enic)
static void enic_admin_free_resources(struct enic *enic)
{
+ vnic_intr_free(&enic->admin_intr);
vnic_cq_free(&enic->admin_cq[1]);
vnic_cq_free(&enic->admin_cq[0]);
vnic_rq_free(&enic->admin_rq);
@@ -181,6 +404,8 @@ static void enic_admin_free_resources(struct enic *enic)
static void enic_admin_init_resources(struct enic *enic)
{
+ unsigned int intr_offset = enic->admin_intr_index;
+
vnic_wq_init(&enic->admin_wq,
0, 0, 0); /* cq_index, err_intr_enable, err_intr_offset */
vnic_rq_init(&enic->admin_rq,
@@ -189,20 +414,34 @@ static void enic_admin_init_resources(struct enic *enic)
VNIC_CQ_FC_DISABLE,
VNIC_CQ_COLOR_ENABLE,
0, 0, 1, /* cq_head, cq_tail, cq_tail_color */
- VNIC_CQ_INTR_DISABLE,
+ VNIC_CQ_INTR_DISABLE, /* polled synchronously by mbox send */
VNIC_CQ_ENTRY_ENABLE,
VNIC_CQ_MSG_DISABLE,
- 0, /* interrupt_offset */
+ intr_offset,
0 /* cq_message_addr */);
vnic_cq_init(&enic->admin_cq[1],
VNIC_CQ_FC_DISABLE,
VNIC_CQ_COLOR_ENABLE,
0, 0, 1, /* cq_head, cq_tail, cq_tail_color */
- VNIC_CQ_INTR_DISABLE,
+ VNIC_CQ_INTR_ENABLE,
VNIC_CQ_ENTRY_ENABLE,
VNIC_CQ_MSG_DISABLE,
- 0, /* interrupt_offset */
+ intr_offset,
0 /* cq_message_addr */);
+ vnic_intr_init(&enic->admin_intr,
+ 0, 0, 1); /* coalescing_timer, coalescing_type, mask_on_assertion */
+}
+
+static void enic_admin_msg_drain(struct enic *enic)
+{
+ struct enic_admin_msg *msg, *tmp;
+
+ spin_lock_bh(&enic->admin_msg_lock);
+ list_for_each_entry_safe(msg, tmp, &enic->admin_msg_list, list) {
+ list_del(&msg->list);
+ kfree(msg);
+ }
+ spin_unlock_bh(&enic->admin_msg_lock);
}
int enic_admin_channel_open(struct enic *enic)
@@ -220,6 +459,19 @@ int enic_admin_channel_open(struct enic *enic)
return err;
}
+ spin_lock_init(&enic->admin_msg_lock);
+ INIT_LIST_HEAD(&enic->admin_msg_list);
+ INIT_WORK(&enic->admin_msg_work, enic_admin_msg_work_handler);
+ INIT_WORK(&enic->admin_poll_work, enic_admin_poll_work_handler);
+
+ err = enic_admin_setup_intr(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Admin channel requires MSI-X, SR-IOV unavailable: %d\n",
+ err);
+ goto free_resources;
+ }
+
enic_admin_init_resources(enic);
vnic_wq_enable(&enic->admin_wq);
@@ -239,15 +491,29 @@ int enic_admin_channel_open(struct enic *enic)
goto disable_queues;
}
+ vnic_intr_unmask(&enic->admin_intr);
+
+ netdev_dbg(enic->netdev,
+ "admin channel open: intr=%u wq_avail=%u rq_avail=%u cq0_color=%u cq1_color=%u\n",
+ enic->admin_intr_index,
+ vnic_wq_desc_avail(&enic->admin_wq),
+ vnic_rq_desc_avail(&enic->admin_rq),
+ enic->admin_cq[0].last_color,
+ enic->admin_cq[1].last_color);
+
return 0;
disable_queues:
+ enic_admin_teardown_intr(enic);
enic_admin_qp_type_set(enic, QP_DISABLE);
if (vnic_wq_disable(&enic->admin_wq))
netdev_warn(enic->netdev, "Failed to disable admin WQ\n");
if (vnic_rq_disable(&enic->admin_rq))
netdev_warn(enic->netdev, "Failed to disable admin RQ\n");
+ cancel_work_sync(&enic->admin_msg_work);
+ enic_admin_msg_drain(enic);
enic_admin_rq_drain(enic);
+free_resources:
enic_admin_free_resources(enic);
return err;
}
@@ -259,6 +525,13 @@ void enic_admin_channel_close(struct enic *enic)
if (!enic->has_admin_channel)
return;
+ netdev_dbg(enic->netdev, "admin channel close\n");
+
+ vnic_intr_mask(&enic->admin_intr);
+ enic_admin_teardown_intr(enic);
+ cancel_work_sync(&enic->admin_msg_work);
+ enic_admin_msg_drain(enic);
+
enic_admin_qp_type_set(enic, QP_DISABLE);
err = vnic_wq_disable(&enic->admin_wq);
@@ -274,5 +547,6 @@ void enic_admin_channel_close(struct enic *enic)
enic_admin_rq_drain(enic);
vnic_cq_clean(&enic->admin_cq[0]);
vnic_cq_clean(&enic->admin_cq[1]);
+ vnic_intr_clean(&enic->admin_intr);
enic_admin_free_resources(enic);
}
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.h b/drivers/net/ethernet/cisco/enic/enic_admin.h
index 569aadeb9312..62c80220b0ca 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.h
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.h
@@ -9,7 +9,19 @@
struct enic;
+/* Wrapper for received admin messages queued for deferred processing.
+ * The admin CQ poll work handler enqueues these; a separate work handler
+ * processes them where sleeping (mutex, GFP_KERNEL) is safe.
+ */
+struct enic_admin_msg {
+ struct list_head list;
+ unsigned int len;
+ u8 data[] __aligned(8);
+};
+
int enic_admin_channel_open(struct enic *enic);
void enic_admin_channel_close(struct enic *enic);
+unsigned int enic_admin_wq_cq_service(struct enic *enic);
+unsigned int enic_admin_rq_cq_service(struct enic *enic);
#endif /* _ENIC_ADMIN_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v9 07/10] enic: add MBOX PF handlers for VF register and capability
From: Satish Kharat @ 2026-06-18 1:53 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Sesidhar Baddela, Satish Kharat
In-Reply-To: <20260617-enic-sriov-v2-admin-channel-v2-v9-0-37f5f5af4c93@cisco.com>
Implement PF-side mailbox message processing for SR-IOV V2
admin channel communication.
When the PF receives messages from VFs, the dispatch routes
them to type-specific handlers:
- VF_CAPABILITY_REQUEST: reply with protocol version 1
- VF_REGISTER_REQUEST: send the register reply, mark the
VF registered on success, then send PF_LINK_STATE_NOTIF
reflecting the PF's current carrier state
- VF_UNREGISTER_REQUEST: mark VF unregistered, send reply
- PF_LINK_STATE_ACK: log errors from VF acknowledgment
Per-VF state (struct enic_vf_state) is tracked via enic->vf_state
which will be allocated when SRIOV V2 is enabled.
Remove the CONFIG_PCI_IOV guard from num_vfs in struct enic. The
PF handlers reference enic->num_vfs for VF ID bounds checking in
enic_mbox.c, which is compiled unconditionally. The field must be
visible regardless of CONFIG_PCI_IOV to avoid build failures.
Add enic_mbox_send_link_state() helper for PF-initiated link
state notifications, also used later by ndo_set_vf_link_state.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 7 +-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 190 +++++++++++++++++++++++++++-
drivers/net/ethernet/cisco/enic/enic_mbox.h | 1 +
3 files changed, 194 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 1d6a88d7f8ac..cace8e04e9ce 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -256,9 +256,7 @@ struct enic {
struct enic_rx_coal rx_coalesce_setting;
u32 rx_coalesce_usecs;
u32 tx_coalesce_usecs;
-#ifdef CONFIG_PCI_IOV
u16 num_vfs;
-#endif
enum enic_vf_type vf_type;
unsigned int enable_count;
spinlock_t enic_api_lock;
@@ -309,6 +307,11 @@ struct enic {
/* MBOX protocol state — mbox_lock serializes admin WQ sends */
struct mutex mbox_lock;
u64 mbox_msg_num;
+
+ /* PF: per-VF MBOX state, allocated when SRIOV V2 is enabled */
+ struct enic_vf_state {
+ bool registered;
+ } *vf_state;
};
static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
index 3709704bee02..b6f05b03ae26 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -135,10 +135,183 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
return err;
}
+int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state)
+{
+ struct enic_mbox_pf_link_state_notif_msg notif = {};
+
+ if (!enic->vf_state || vf_id >= enic->num_vfs ||
+ !enic->vf_state[vf_id].registered) {
+ netdev_dbg(enic->netdev,
+ "MBOX: skip link state to unregistered VF %u\n",
+ vf_id);
+ return 0;
+ }
+
+ notif.link_state = cpu_to_le32(link_state);
+ return enic_mbox_send_msg(enic, ENIC_MBOX_PF_LINK_STATE_NOTIF, vf_id,
+ ¬if, sizeof(notif));
+}
+
+static int enic_mbox_pf_handle_capability(struct enic *enic, void *msg,
+ u16 vf_id, u64 msg_num)
+{
+ struct enic_mbox_vf_capability_reply_msg reply = {};
+
+ reply.reply.ret_major = cpu_to_le16(0);
+ reply.version = cpu_to_le32(ENIC_MBOX_CAP_VERSION_1);
+
+ return enic_mbox_send_msg(enic, ENIC_MBOX_VF_CAPABILITY_REPLY, vf_id,
+ &reply, sizeof(reply));
+}
+
+static int enic_mbox_pf_handle_register(struct enic *enic, void *msg,
+ u16 vf_id, u64 msg_num)
+{
+ struct enic_mbox_vf_register_reply_msg reply = {};
+ u32 link_state;
+ int err;
+
+ if (!enic->vf_state || vf_id >= enic->num_vfs) {
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: register from invalid VF %u\n",
+ vf_id);
+ return -EINVAL;
+ }
+
+ /* VF re-registering (e.g. guest reboot without clean unregister):
+ * mark the previous registration inactive before accepting the new one.
+ */
+ if (enic->vf_state[vf_id].registered) {
+ netdev_dbg(enic->netdev,
+ "MBOX: VF %u re-register, cleaning previous state\n",
+ vf_id);
+ enic->vf_state[vf_id].registered = false;
+ }
+
+ reply.reply.ret_major = cpu_to_le16(0);
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_REGISTER_REPLY, vf_id,
+ &reply, sizeof(reply));
+ if (err)
+ return err;
+
+ enic->vf_state[vf_id].registered = true;
+ if (net_ratelimit())
+ netdev_info(enic->netdev, "VF %u registered via MBOX\n", vf_id);
+
+ link_state = netif_carrier_ok(enic->netdev) ?
+ ENIC_MBOX_LINK_STATE_ENABLE :
+ ENIC_MBOX_LINK_STATE_DISABLE;
+ err = enic_mbox_send_link_state(enic, vf_id, link_state);
+ if (err && net_ratelimit())
+ netdev_warn(enic->netdev,
+ "VF %u: failed to send initial link state: %d\n",
+ vf_id, err);
+ /* Registration succeeded; initial link state notification sent
+ * above. Subsequent link state changes are sent from the PF
+ * when enic_link_check() detects carrier changes.
+ */
+ return 0;
+}
+
+static int enic_mbox_pf_handle_unregister(struct enic *enic, void *msg,
+ u16 vf_id, u64 msg_num)
+{
+ struct enic_mbox_vf_register_reply_msg reply = {};
+ int err;
+
+ if (!enic->vf_state || vf_id >= enic->num_vfs) {
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: unregister from invalid VF %u\n",
+ vf_id);
+ return -EINVAL;
+ }
+
+ /* VF is unloading; clear local state regardless of whether
+ * the reply is successfully delivered to avoid the PF treating
+ * a dead VF as still registered.
+ */
+ enic->vf_state[vf_id].registered = false;
+
+ reply.reply.ret_major = cpu_to_le16(0);
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REPLY, vf_id,
+ &reply, sizeof(reply));
+
+ if (net_ratelimit())
+ netdev_info(enic->netdev,
+ "VF %u unregistered via MBOX\n", vf_id);
+
+ return err;
+}
+
+static void enic_mbox_pf_process_msg(struct enic *enic,
+ struct enic_mbox_hdr *hdr, void *payload)
+{
+ u16 vf_id = le16_to_cpu(hdr->src_vnic_id);
+ u16 msg_len = le16_to_cpu(hdr->msg_len);
+ int err = 0;
+
+ if (!enic->vf_state) {
+ netdev_dbg(enic->netdev,
+ "MBOX: PF received msg but SRIOV not active\n");
+ return;
+ }
+
+ if (vf_id >= enic->num_vfs) {
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: PF received msg from invalid VF %u\n",
+ vf_id);
+ return;
+ }
+
+ switch (hdr->msg_type) {
+ case ENIC_MBOX_VF_CAPABILITY_REQUEST:
+ err = enic_mbox_pf_handle_capability(enic, payload, vf_id,
+ le64_to_cpu(hdr->msg_num));
+ break;
+ case ENIC_MBOX_VF_REGISTER_REQUEST:
+ err = enic_mbox_pf_handle_register(enic, payload, vf_id,
+ le64_to_cpu(hdr->msg_num));
+ break;
+ case ENIC_MBOX_VF_UNREGISTER_REQUEST:
+ err = enic_mbox_pf_handle_unregister(enic, payload, vf_id,
+ le64_to_cpu(hdr->msg_num));
+ break;
+ case ENIC_MBOX_PF_LINK_STATE_ACK: {
+ struct enic_mbox_pf_link_state_ack_msg *ack = payload;
+
+ if (msg_len < sizeof(*hdr) + sizeof(*ack))
+ break;
+ if (le16_to_cpu(ack->ack.ret_major) && net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: VF %u link state ACK error %u/%u\n",
+ vf_id,
+ le16_to_cpu(ack->ack.ret_major),
+ le16_to_cpu(ack->ack.ret_minor));
+ break;
+ }
+ default:
+ netdev_dbg(enic->netdev,
+ "MBOX: PF unhandled msg type %u from VF %u\n",
+ hdr->msg_type, vf_id);
+ err = -EOPNOTSUPP;
+ break;
+ }
+
+ if (err && net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: PF handler for msg type %u from VF %u failed: %d\n",
+ hdr->msg_type, vf_id, err);
+}
+
static void enic_mbox_recv_handler(struct enic *enic, void *buf,
unsigned int len)
{
struct enic_mbox_hdr *hdr = buf;
+ void *payload;
+ u16 msg_len;
if (len < sizeof(*hdr)) {
if (net_ratelimit())
@@ -156,10 +329,23 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
return;
}
+ msg_len = le16_to_cpu(hdr->msg_len);
+ if (msg_len < sizeof(*hdr) || msg_len > len) {
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "MBOX: invalid msg_len %u (buf len %u)\n",
+ msg_len, len);
+ return;
+ }
+
netdev_dbg(enic->netdev,
"MBOX recv: type %u from vnic %u len %u\n",
- hdr->msg_type, le16_to_cpu(hdr->src_vnic_id),
- le16_to_cpu(hdr->msg_len));
+ hdr->msg_type, le16_to_cpu(hdr->src_vnic_id), msg_len);
+
+ payload = buf + sizeof(*hdr);
+
+ if (enic->vf_state)
+ enic_mbox_pf_process_msg(enic, hdr, payload);
}
void enic_mbox_init(struct enic *enic)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index 73fd7f783ee2..f1de67db1273 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -87,5 +87,6 @@ struct enic;
void enic_mbox_init(struct enic *enic);
int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
void *payload, u16 payload_len);
+int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state);
#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] net: ethtool: mm: Increase FPE verification retry count
From: Nazle Asmade, Muhammad Nazim Amirul @ 2026-06-18 2:27 UTC (permalink / raw)
To: Vladimir Oltean, Simon Horman
Cc: netdev@vger.kernel.org, andrew@lunn.ch, kuba@kernel.org,
davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
faizal.abdul.rahim@linux.intel.com, linux-kernel@vger.kernel.org
In-Reply-To: <20260616151640.qi4bobvcbeyptvgg@skbuf>
On 16/6/2026 11:16 pm, Vladimir Oltean wrote:
> On Tue, Jun 16, 2026 at 08:19:25AM +0100, Simon Horman wrote:
>> + Vladimir
>>
>> On Mon, Jun 15, 2026 at 12:24:36AM -0700, muhammad.nazim.amirul.nazle.asmade@altera.com wrote:
>>> From: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
>>>
>>> The current FPE verification retry count is set to 3. However,
>>> the IEEE 802.3br standard does not specify a fixed value for this.
>>> A retry count of 3 may be insufficient when the remote device is
>>> slow to respond during link-up. Increase the retry count to 20 to
>>> improve robustness.
>>>
>>> Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
>>> Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
>>
>> Vladimir, I'm wondering if you could take a look at this one.
>
> IEEE 802.3br is an obsolete standard, I don't even have access to it.
>
> IEEE 802.3-2022 is the current one for the MAC Merge layer. Clause
> 99.4.7.2 Constants states:
>
> verifyLimit: the integer 3, the number of verification attempts
>
> I don't have something in principle against making the verifyLimit
> configurable past IEEE 802.3 for debugging purposes or non-standard
> applications, but keep the default to 3.
Thanks Simon and Vladimir for your review!
BR,
Nazim
^ permalink raw reply
* [PATCHv2] net: emac: Fix NULL pointer dereference in emac_probe
From: Rosen Penev @ 2026-06-18 2:34 UTC (permalink / raw)
To: netdev
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Rosen Penev, open list
Move devm_request_irq() after devm_platform_ioremap_resource() so that
dev->emacp is mapped before the interrupt handler can fire. An early
interrupt hitting emac_irq() would dereference the NULL dev->emacp and
crash.
Also remove redundant error message. devm_platform_ioremap_resource()
already returns an error message with dev_err_probe().
Fixes: dcc34ef7c834 ("net: ibm: emac: manage emac_irq with devm")
Assisted-by: Opencode:Big-Pickle
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
v2: remove redundant error message.
drivers/net/ethernet/ibm/emac/core.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index 80f0c8985845..62ee1b70c3e7 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -3044,6 +3044,12 @@ static int emac_probe(struct platform_device *ofdev)
if (err)
goto err_gone;
+ dev->emacp = devm_platform_ioremap_resource(ofdev, 0);
+ if (IS_ERR(dev->emacp)) {
+ err = PTR_ERR(dev->emacp);
+ goto err_gone;
+ }
+
/* Setup error IRQ handler */
dev->emac_irq = platform_get_irq(ofdev, 0);
if (dev->emac_irq < 0) {
@@ -3061,13 +3067,6 @@ static int emac_probe(struct platform_device *ofdev)
ndev->irq = dev->emac_irq;
- dev->emacp = devm_platform_ioremap_resource(ofdev, 0);
- if (IS_ERR(dev->emacp)) {
- dev_err(&ofdev->dev, "can't map device registers");
- err = PTR_ERR(dev->emacp);
- goto err_gone;
- }
-
/* Wait for dependent devices */
err = emac_wait_deps(dev);
if (err)
--
2.54.0
^ permalink raw reply related
* Re: [PATCH bpf v2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Jiayuan Chen @ 2026-06-18 2:45 UTC (permalink / raw)
To: Sechang Lim, John Fastabend, Jakub Sitnicki, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Bobby Eshleman, netdev, bpf, linux-kernel
In-Reply-To: <20260612123553.2724240-1-rhkrqnwk98@gmail.com>
On 6/12/26 8:35 PM, Sechang Lim wrote:
> sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
> to find the length of the next message. strparser assembles a message out
> of several received skbs by chaining them onto the head's frag_list and
> recording where to append the next one in strp->skb_nextp:
>
> *strp->skb_nextp = skb;
> strp->skb_nextp = &skb->next;
>
> and then calls the parser on the head:
>
> len = (*strp->cb.parse_msg)(strp, head);
>
> The parser is only meant to inspect the skb, but the program may call
> bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
> bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
> Once the head carries a frag_list these go
>
> ... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail
>
> and __pskb_pull_tail() frees the frag_list skbs that strparser still
> tracks through skb_nextp:
>
> while ((list = skb_shinfo(skb)->frag_list) != insp) {
> skb_shinfo(skb)->frag_list = list->next;
> consume_skb(list);
> }
>
> strp->skb_nextp now points into a freed sk_buff. The next segment of
> the same message arrives in __strp_recv(), which links it with
> *strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
> and the write happen in different __strp_recv() calls, so the message
> has to span at least three segments before it triggers.
>
> BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
> Write of size 8 at addr ffff88810db86140 by task repro/349
>
> Call Trace:
> <IRQ>
> __strp_recv+0x447/0xda0
> __tcp_read_sock+0x13d/0x590
> tcp_bpf_strp_read_sock+0x195/0x320
> strp_data_ready+0x267/0x340
> sk_psock_strp_data_ready+0x1ce/0x350
> tcp_data_queue+0x1364/0x2fd0
> tcp_rcv_established+0xe07/0x1640
> [...]
>
> Allocated by task 349:
> skb_clone+0x17b/0x210
> __strp_recv+0x2c3/0xda0
> __tcp_read_sock+0x13d/0x590
> [...]
>
> Freed by task 349:
> kmem_cache_free+0x150/0x570
> __pskb_pull_tail+0x57b/0xc20
> skb_ensure_writable+0x236/0x260
> __bpf_skb_change_tail+0x1d4/0x590
> sk_skb_change_tail+0x2a/0x40
> bpf_prog_1b285dcd6c41373e+0x27/0x30
> bpf_prog_run_pin_on_cpu+0xf3/0x260
> sk_psock_strp_parse+0x118/0x1e0
> __strp_recv+0x4f6/0xda0
> [...]
>
> The same resize also leaves the head's length inconsistent with its
> frags, so a later __pskb_pull_tail() can instead hit the
> BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.
>
> Run the parser on a private clone of the head when the message spans more
> than one skb and the program can modify the packet
> (prog->aux->changes_pkt_data), so a resizing helper can only touch the
> clone and strparser's head and skb_nextp stay valid. Single-skb messages
> have no frag_list and read-only parsers cannot resize, so both are still
> parsed in place. If the clone cannot be allocated, return 0 so the caller
> retries on the next read rather than failing the parser.
>
> Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
Please consider Kuniyuki Iwashima's suggestion.
But it only covers the ATTACH path; the other two paths should be
covered as well:
- BPF_PROG_ATTACH → sock_map_get_from_fd → sock_map_prog_update
- BPF_LINK_CREATE → sock_map_link_create → sock_map_prog_update
- replace prog → sock_map_link_update_prog
A new helper for this check is probably needed, called from both
sock_map_prog_update() and sock_map_link_update_prog().
Since this rejects the program at attach time rather than fixing a
runtime crash,
I'm not sure a Fixes tag is appropriate here - thoughts?
^ permalink raw reply
* Re: [PATCH net-next v1] net: wangxun: don't advertise IFF_SUPP_NOFCS
From: mengyuanlou @ 2026-06-18 3:06 UTC (permalink / raw)
To: Rongguang Wei; +Cc: netdev, jiawenwu, pabeni, kuba, Rongguang Wei
In-Reply-To: <20260617092854.133992-1-clementwei90@163.com>
> 2026年6月17日 17:28,Rongguang Wei <clementwei90@163.com> 写道:
>
> From: Rongguang Wei <weirongguang@kylinos.cn>
>
> Like commit a24162f18825("i40e: don't advertise IFF_SUPP_NOFCS"),
> ngbe and txgbe also advertises IFF_SUPP_NOFCS and allowing users
> to use the SO_NOFCS socket option. But the driver does not check
> skb->no_fcs, so this option is silently ignored.
>
> With this change, send() fails with -EPROTONOSUPPORT when AF_PACKET
> socket is set SO_NOFCS option.
In fact, the hardware supports this function, but it seems that no one is using it at present.
To ensure that the interface does not report any errors, it can be removed.
>
> Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn>
> ---
> drivers/net/ethernet/wangxun/ngbe/ngbe_main.c | 1 -
> drivers/net/ethernet/wangxun/txgbe/txgbe_main.c | 1 -
> 2 files changed, 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
> index d8e3827a8b1f..1e4ebac8e495 100644
> --- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
> +++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
> @@ -713,7 +713,6 @@ static int ngbe_probe(struct pci_dev *pdev,
> netdev->features |= NETIF_F_GRO;
>
> netdev->priv_flags |= IFF_UNICAST_FLT;
> - netdev->priv_flags |= IFF_SUPP_NOFCS;
> netdev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>
> netdev->min_mtu = ETH_MIN_MTU;
> diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
> index 8b7c3753bb6a..db9262b00a66 100644
> --- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
> +++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
> @@ -801,7 +801,6 @@ static int txgbe_probe(struct pci_dev *pdev,
> netdev->features |= NETIF_F_RX_UDP_TUNNEL_PORT;
>
> netdev->priv_flags |= IFF_UNICAST_FLT;
> - netdev->priv_flags |= IFF_SUPP_NOFCS;
> netdev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>
> netdev->min_mtu = ETH_MIN_MTU;
> --
> 2.25.1
>
>
>
^ permalink raw reply
* [PATCH net v3 1/2] geneve: gate GRO hint in geneve_gro_complete() on gs->gro_hint
From: Xiang Mei @ 2026-06-18 3:26 UTC (permalink / raw)
To: netdev, Paolo Abeni
Cc: Jakub Kicinski, Eric Dumazet, Andrew Lunn, David S . Miller,
Weiming Shi, Kyle Zeng, Xiang Mei
geneve_gro_receive() reads the GRO hint through geneve_sk_gro_hint_off(),
which honours it only when the socket enabled IFLA_GENEVE_GRO_HINT
(gs->gro_hint). geneve_gro_complete() instead calls the low-level
geneve_opt_gro_hint_off() and acts on the hint unconditionally.
On a tunnel without the hint, receive aggregates the frames as plain
ETH_P_TEB while complete still honours an attacker-supplied hint option: it
inflates gh_len by gro_hint->nested_hdr_len (u8) and redirects the dispatch
type, so the inner gro_complete handler runs at nhoff + gh_len, an offset
receive never pulled nor validated, reading out of bounds of the skb head:
BUG: KASAN: slab-out-of-bounds in ipv6_gro_complete (net/ipv6/ip6_offload.c:196)
Read of size 1 at addr ffff88800fe91980 by task exploit/153
ipv6_gro_complete (net/ipv6/ip6_offload.c:196)
geneve_gro_complete (drivers/net/geneve.c:965)
udp_gro_complete (net/ipv4/udp_offload.c:940)
inet_gro_complete (net/ipv4/af_inet.c:1621)
__gro_flush (net/core/gro.c:306)
Gate the complete path on gs->gro_hint too via geneve_sk_gro_hint_off(), so
both paths agree. Tunnels that enable the hint are unaffected.
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Reported-by: Kyle Zeng <kylebot@openai.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
drivers/net/geneve.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 9afff7bcaa0b..7cf7aaac8ee1 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -954,13 +954,13 @@ static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb,
struct genevehdr *gh;
struct packet_offload *ptype;
__be16 type;
- int gh_len;
+ unsigned int gh_len;
int err = -ENOSYS;
gh = (struct genevehdr *)(skb->data + nhoff);
gh_len = geneve_hlen(gh);
type = gh->proto_type;
- geneve_opt_gro_hint_off(gh, &type, &gh_len);
+ geneve_sk_gro_hint_off(sk, gh, &type, &gh_len);
/* since skb->encapsulation is set, eth_gro_complete() sets the inner mac header */
if (likely(type == htons(ETH_P_TEB)))
--
2.43.0
^ permalink raw reply related
* [PATCH net v3 2/2] geneve: validate inner network offset in geneve_gro_complete()
From: Xiang Mei @ 2026-06-18 3:26 UTC (permalink / raw)
To: netdev, Paolo Abeni
Cc: Jakub Kicinski, Eric Dumazet, Andrew Lunn, David S . Miller,
Weiming Shi, Kyle Zeng, Xiang Mei
In-Reply-To: <20260618032622.484720-1-xmei5@asu.edu>
Even with both paths gated on gs->gro_hint, geneve_gro_complete()
re-derives the inner dispatch type and length from the packet and the
current gs->gro_hint, independently of geneve_gro_receive(). The two can
disagree if gs->gro_hint flips under a concurrent geneve_quiesce()/
geneve_unquiesce() (sk_user_data is NULL across a synchronize_net()), or if
the re-read option bytes differ from the ones receive parsed.
geneve_gro_receive() already records the inner network header position in
NAPI_GRO_CB()->inner_network_offset. Have geneve_gro_complete() compute the
offset it is about to dispatch at, adding ETH_HLEN in the ETH_P_TEB case
where eth_gro_complete() steps over the inner MAC header, and bail out if
it lands past inner_network_offset.
Use a lower bound rather than exact equality: between gh_len and the inner
L3 header, geneve_gro_receive() may also have pulled an inner VLAN tag
(vlan_gro_receive() advances the recorded offset past it), which only moves
inner_network_offset further out. A valid frame therefore always satisfies
inner_nh <= inner_network_offset, while a gh_len inflated by a hint
gro_receive() did not honour dispatches past the validated inner header,
i.e. the out-of-bounds completion. Only the latter is rejected.
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
drivers/net/geneve.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 7cf7aaac8ee1..396e1a113cd4 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -962,6 +962,20 @@ static int geneve_gro_complete(struct sock *sk, struct sk_buff *skb,
type = gh->proto_type;
geneve_sk_gro_hint_off(sk, gh, &type, &gh_len);
+ /* Bail out if we are about to dispatch past the inner network header
+ * gro_receive() validated. An inner VLAN tag only pushes
+ * inner_network_offset out, so use a lower bound.
+ */
+ if (skb->encapsulation) {
+ unsigned int inner_nh = nhoff + gh_len;
+
+ if (type == htons(ETH_P_TEB))
+ inner_nh += ETH_HLEN;
+
+ if (unlikely(inner_nh > NAPI_GRO_CB(skb)->inner_network_offset))
+ return -EINVAL;
+ }
+
/* since skb->encapsulation is set, eth_gro_complete() sets the inner mac header */
if (likely(type == htons(ETH_P_TEB)))
return eth_gro_complete(skb, nhoff + gh_len);
--
2.43.0
^ permalink raw reply related
* Re: [net-next PATCH 06/10] net: dsa: realtek: rtl8365mb: add VLAN support
From: Luiz Angelo Daros de Luca @ 2026-06-18 3:30 UTC (permalink / raw)
To: Gabor Juhos
Cc: Linus Walleij, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Alvin Šipraga, Yury Norov, Rasmus Villemoes, Russell King,
netdev, linux-kernel
In-Reply-To: <232d7c53-f413-4c8d-bc70-f03a12d601f5@gmail.com>
Hi Gabor, Linus,
(Sorry for the late answer. I left it as a draft and forgot to send it)
> The driver was based on various code found in GPL sources of the TP-Link
> TL-WR2543ND and ASUS RT-N56U devices. Those sources were using the RTL8370
> specific API for these chips.
The GPL code from both TL-WR2543ND and ASUS RT-N56U devices were a
dead end. However,
the mention of "RTL8370" simply changed everything. I was searching
for a RTL8367 API.
> It was not clear that the two models really belongs to the RTL8370 family
> or simply the vendors were using the RTL8370 specific code as a base, so
> I have used RTL8367 prefix in the swconfig driver. Probably, it would have
> been better to keep the RTL8370 prefix to avoid confusion.
Maybe. I don't know if the RTL8367 family exists. I'm adapting the
rtl8365mb driver to work with multiple switch generations, from
RTL8370/RTL8367 until RTL8367D. I simply internally renamed
RTL8370/RTL8367 to RTL8367A as I can reference it as Family-A (and B,
C, D...).
> > The main challenge with the base RTL8367 is the lack of a public API.
> > Most vendors support it via binary managers (ASUS) or proprietary
> > kernel modules (TP-Link). The only available references I’ve found are
> > the OpenWrt swconfig driver you mentioned and some U-Boot
> > initialization code. I do have the rtl8367{b,c,d} APIs. While the
> > rtl8367b seems close to the original RTL8367, it has fewer ports. Does
> > anyone happen to have access to the original RTL8367 API
> > documentation?
>
> I have no documentation, but the RTL8370 API source can be found at various
> places [2], [3].
I did have it checked out in my dev dir for years but the missing
piece was to match RTL8367R with RTL8370.
> To be honest, I don't remember all the details, but I hope that this helps.
It helped a lot.
I have a mostly working rtl8365mb version that runs on RTL8367R. I
still need more tests with FDB but I think it will be possible to
replace the swconfig rtl8367.c with DSA rtl8365mb. I am still looking
for a RTL8367B and RTL8367D device to complete the family.
Regards,
Luiz
^ permalink raw reply
* [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-18 3:50 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: stable
MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.
This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.
Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.
Also validate the packet length reported in the RX CQE before using it as
a DMA sync length or passing it to skb processing. The CQE is supplied
by the device and should not be blindly trusted by Confidential VMs.
Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
include/net/mana/mana.h | 8 +++
2 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..d8906169666d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,15 +2044,19 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
}
static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
- dma_addr_t *da, bool *from_pool)
+ dma_addr_t *da, bool *from_pool,
+ struct page **pp_page, u32 *dma_sync_offset)
{
struct page *page;
u32 offset;
void *va;
+
*from_pool = false;
+ *pp_page = NULL;
+ *dma_sync_offset = 0;
/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
- * per page.
+ * per page. These buffers are mapped with dma_map_single().
*/
if (rxq->frag_count == 1) {
/* Reuse XDP dropped page if available */
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
va = page_to_virt(page) + offset;
*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
*from_pool = true;
+ *pp_page = page;
+ *dma_sync_offset = offset + rxq->headroom;
return va;
}
/* Allocate frag for rx buffer, and save the old buf */
static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
- struct mana_recv_buf_oob *rxoob, void **old_buf,
- bool *old_fp)
+ struct mana_recv_buf_oob *rxoob, u32 pktlen,
+ void **old_buf, bool *old_fp)
{
+ u32 dma_sync_offset;
+ struct page *pp_page;
bool from_pool;
dma_addr_t da;
void *va;
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return;
- if (!rxoob->from_pool || rxq->frag_count == 1)
+ if (!rxoob->from_pool || rxq->frag_count == 1) {
dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
DMA_FROM_DEVICE);
+ } else {
+ /* The page pool maps the whole page and only syncs for device
+ * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+ * for the CPU before they are read: this is required if DMA
+ * is incoherent or bounce buffers are used.
+ */
+ page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+ rxoob->dma_sync_offset, pktlen);
+ }
*old_buf = rxoob->buf_va;
*old_fp = rxoob->from_pool;
rxoob->buf_va = va;
rxoob->sgl[0].address = da;
rxoob->from_pool = from_pool;
+ rxoob->pp_page = pp_page;
+ rxoob->dma_sync_offset = dma_sync_offset;
}
static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,12 +2190,24 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
rxbuf_oob = &rxq->rx_oobs[curr];
WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+ if (unlikely(pktlen > rxq->datasize)) {
+ /* Increase it even if mana_rx_skb() isn't called. */
+ rxq->rx_cq.work_done++;
- /* Unsuccessful refill will have old_buf == NULL.
- * In this case, mana_rx_skb() will drop the packet.
- */
- mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ ++ndev->stats.rx_dropped;
+ netdev_warn_once(ndev,
+ "Dropped oversized RX packet: len=%u, datasize=%u\n",
+ pktlen, rxq->datasize);
+
+ /* Reuse the RX buffer since rxbuf_oob is unchanged. */
+ } else {
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+
+ /* Unsuccessful refill will have old_buf == NULL.
+ * In this case, mana_rx_skb() will drop the packet.
+ */
+ mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ }
mana_move_wq_tail(rxq->gdma_rq,
rxbuf_oob->wqe_inf.wqe_size_in_bu);
@@ -2566,6 +2598,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
struct mana_rxq *rxq, struct device *dev)
{
struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+ struct page *pp_page = NULL;
+ u32 dma_sync_offset = 0;
bool from_pool = false;
dma_addr_t da;
void *va;
@@ -2573,13 +2607,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
if (mpc->rxbufs_pre)
va = mana_get_rxbuf_pre(rxq, &da);
else
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return -ENOMEM;
rx_oob->buf_va = va;
rx_oob->from_pool = from_pool;
+ rx_oob->pp_page = pp_page;
+ rx_oob->dma_sync_offset = dma_sync_offset;
rx_oob->sgl[0].address = da;
rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
void *buf_va;
bool from_pool; /* allocated from a page pool */
+ /* head page of the page_pool fragment; valid only when
+ * from_pool && frag_count > 1.
+ */
+ struct page *pp_page;
+ /* Fragment offset plus rxq->headroom, passed to
+ * page_pool_dma_sync_for_cpu().
+ */
+ u32 dma_sync_offset;
/* SGL of the buffer going to be sent as part of the work request. */
u32 num_sge;
--
2.34.1
^ permalink raw reply related
* [PATCH net] octeontx2-af: npc: cn20k: Fix subbank free list indexing for search order
From: Ratheesh Kannoth @ 2026-06-18 3:59 UTC (permalink / raw)
To: kuba, linux-kernel, netdev, rkannoth
Cc: andrew+netdev, davem, edumazet, pabeni, sgoutham
subbank_srch_order[i] is the physical subbank at search-order slot i,
so each subbank's arr_idx must be i (its slot), not
subbank_srch_order[sb->idx]. The old logic mis-keyed xa_sb_free
and broke allocation traversal order.
Populate arr_idx and xa_sb_free in a single pass over the search
order after subbank structs are initialized.
Fixes: 7ac9d4c4075c ("octeontx2-af: npc: cn20k: add subbank search order control")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
---
.../ethernet/marvell/octeontx2/af/cn20k/npc.c | 47 ++++++++++++++-----
1 file changed, 36 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c b/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
index 354c4e881c6a..d38e848add93 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cn20k/npc.c
@@ -3423,6 +3423,36 @@ static int npc_create_srch_order(int cnt)
return 0;
}
+static int npc_subbanks_srch_oder_init(struct rvu *rvu)
+{
+ struct npc_subbank *sb;
+ int sb_idx;
+ int i, j;
+ int rc;
+
+ for (i = 0; i < npc_priv->num_subbanks; i++) {
+ sb_idx = subbank_srch_order[i];
+ sb = &npc_priv->sb[sb_idx];
+ sb->arr_idx = i;
+
+ dev_dbg(rvu->dev, "%s: sb->idx=%u sb->arr_idx=%u\n",
+ __func__, sb->idx, sb->arr_idx);
+
+ rc = xa_err(xa_store(&npc_priv->xa_sb_free, sb->arr_idx,
+ xa_mk_value(sb->idx), GFP_KERNEL));
+ if (rc) {
+ dev_err(rvu->dev,
+ "%s: xa_store(xa_sb_free) failed at slot %d (sb=%d): %d\n",
+ __func__, i, sb_idx, rc);
+ for (j = 0; j < i; j++)
+ xa_erase(&npc_priv->xa_sb_free, j);
+ return rc;
+ }
+ }
+
+ return 0;
+}
+
static void npc_subbank_init(struct rvu *rvu, struct npc_subbank *sb, int idx)
{
mutex_init(&sb->lock);
@@ -3435,16 +3465,6 @@ static void npc_subbank_init(struct rvu *rvu, struct npc_subbank *sb, int idx)
sb->flags = NPC_SUBBANK_FLAG_FREE;
sb->idx = idx;
- sb->arr_idx = subbank_srch_order[idx];
-
- dev_dbg(rvu->dev, "%s: sb->idx=%u sb->arr_idx=%u\n",
- __func__, sb->idx, sb->arr_idx);
-
- /* Keep first and last subbank at end of free array; so that
- * it will be used at last
- */
- xa_store(&npc_priv->xa_sb_free, sb->arr_idx,
- xa_mk_value(sb->idx), GFP_KERNEL);
}
static int npc_pcifunc_map_create(struct rvu *rvu)
@@ -4635,6 +4655,7 @@ static int npc_priv_init(struct rvu *rvu)
int num_subbanks, subbank_depth;
u64 npc_const1, npc_const2 = 0;
struct npc_subbank *sb;
+ int ret = -ENOMEM;
u64 cfg;
int i;
@@ -4727,6 +4748,10 @@ static int npc_priv_init(struct rvu *rvu)
for (i = 0, sb = npc_priv->sb; i < num_subbanks; i++, sb++)
npc_subbank_init(rvu, sb, i);
+ ret = npc_subbanks_srch_oder_init(rvu);
+ if (ret)
+ goto fail2;
+
/* Get number of pcifuncs in the system */
npc_priv->pf_cnt = npc_pcifunc_map_create(rvu);
npc_priv->xa_pf2idx_map = kcalloc(npc_priv->pf_cnt,
@@ -4760,7 +4785,7 @@ static int npc_priv_init(struct rvu *rvu)
fail1:
kfree(npc_priv);
npc_priv = NULL;
- return -ENOMEM;
+ return ret;
}
void npc_cn20k_deinit(struct rvu *rvu)
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Gustavo A. R. Silva @ 2026-06-18 4:02 UTC (permalink / raw)
To: Ilya Maximets, netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
linux-hardening, llvm, Johan Thomsen
In-Reply-To: <a95aabe9-6294-42ed-8327-b7d74bb4a8c8@embeddedor.com>
On 6/17/26 16:59, Gustavo A. R. Silva wrote:
>
>
> On 6/17/26 16:01, Ilya Maximets wrote:
>> On 6/17/26 10:08 PM, Gustavo A. R. Silva wrote:
>>> Hi,
>>>
>>> On 6/16/26 04:03, Ilya Maximets wrote:
>>>> kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
>>>> structure to the options_len, which is then initialized to zero.
>>>> Later, we're initializing the structure by copying the tunnel info
>>>> together with the options, and this triggers a warning for a potential
>>>> memcpy overflow, since the compiler estimates that the options can't
>>>> fit into the structure, even though the memory for them is actually
>>>> allocated.
>>>>
>>>> memcpy: detected buffer overflow: 104 byte write of buffer size 96
>>>> WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
>>>> skb_tunnel_info_unclone+0x179/0x190
>>>> geneve_xmit+0x7fe/0xe00
>>>
>>> This warning has nothing to do with counted_by. See below for more
>>> comments.
>>>
>>>>
>>>> The issue is triggered when built with clang and source fortification.
>>>>
>>>> Fix that by doing the copy in two stages: first - the main data with
>>>> the options_len, then the options. This way the correct length should
>>>> be known at the time of the copy.
>>>>
>>>> It would be better if the options_len never changed after allocation,
>>>> but the allocation code is a little separate from the initialization
>>>> and it would be awkward and potentially dangerous to return a struct
>>>> with options_len set to a non-zero value from the metadata_dst_alloc().
>>>>
>>>> Another option would be to use ip_tunnel_info_opts_set(), but it is
>>>> doing too many unnecessary operations for the use case here.
>>>>
>>>> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
>>>> Reported-by: Johan Thomsen <write@ownrisk.dk>
>>>> Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>> ---
>>>>
>>>> Johan, if you can test this one in your setup as well, that would
>>>> be great. Thanks.
>>>>
>>>> include/net/dst_metadata.h | 7 +++++--
>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
>>>> index 1fc2fb03ce3f..f45d1e3163f0 100644
>>>> --- a/include/net/dst_metadata.h
>>>> +++ b/include/net/dst_metadata.h
>>>> @@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
>>>> if (!new_md)
>>>> return ERR_PTR(-ENOMEM);
>>>> - memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
>>>> - sizeof(struct ip_tunnel_info) + md_size);
>>>
>>> What's going on here is that, internally, fortified memcpy() retrieves
>>> the destination size via __builtin_dynamic_object_size() in mode 1.
>>>
>>> That is:
>>>
>>> __builtin_dynamic_object_size(&new_md->u.tun_info, 1)
>>>
>>> For the above case, Clang returns sizeof(new_md->u.tun_info) == 96.
>>>
>>> So the warning is reporting that 104 bytes don't fit in an object of
>>> size 96 bytes, regardless of any counted_by annotation or allocation.
>>
>> Hmm. Does __builtin_dynamic_object_size(&new_md->u.tun_info, 1) return
>> 104 when the options_len is 8? If so, isn't that because it is counted
>> by that field? Asking because the fortification doesn't complain if we
>> keep the full 104-byte copy as-is, but set the options_len beforehand,
>> as tested by Johan.
>
> I see. If that is the case, then, internally, fortified memcpy() ends up
> using mode 0 instead of mode 1. Something like this:
>
> __builtin_dynamic_object_size(&new_md->u.tun_info, 0)
>
> The above will effectively consider the allocation and counted_by because
> it will interpret new_md->u.tun_info as an open-ended object due to the
> flexible-array member (in struct ip_tunnel_info) whose size is determined
> by counted_by.
Indeed. The execution stops here:
fortify_memcpy_chk():
588 /*
589 * Always stop accesses beyond the struct that contains the
590 * field, when the buffer's remaining size is known.
591 * (The SIZE_MAX test is to optimize away checks where the buffer
592 * lengths are unknown.)
593 */
594 if (p_size != SIZE_MAX && p_size < size)
595 fortify_panic(func, FORTIFY_WRITE, p_size, size, true);
with p_size = __builtin_dynamic_object_size(&new_md->u.tun_info, 0)
The code never reaches the part where p_size_field (__bdos(&new_md->u.tun_info, 1))
is checked at runtime because there is no need for that.
So yep, this patch is okay as-is.
Thanks
-Gustavo
^ permalink raw reply
* Re: [PATCH net] ipv6: ndisc: fix NULL deref in accept_untracked_na()
From: Jiayuan Chen @ 2026-06-18 4:08 UTC (permalink / raw)
To: Weiming Shi, David S . Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, netdev, linux-kernel, Xiang Mei
In-Reply-To: <DJBD6SGYRIHX.1IHLCVG9YYTNJ@gmail.com>
On 6/17/26 9:38 PM, Weiming Shi wrote:
> On Wed Jun 17, 2026 at 4:32 PM CST, Jiayuan Chen wrote:
>> On 6/17/26 2:55 PM, Weiming Shi wrote:
>>> accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev)
>>> and dereferences idev->cnf.accept_untracked_na without a NULL check,
>>
>> Does ipv6_rpl_srh_rcv have same problem?
> Hi,
>
> Yes, ipv6_rpl_srh_rcv() has the same missing check. It reads
> idev->cnf.rpl_seg_enabled right after __in6_dev_get(skb->dev) with no
> NULL check, while seg6 and ioam6 in the same file both check it.
>
> But I tried to trigger it and couldn't. With a guard added as an instrument,
> idev never came back NULL over tens of millions of RPL packets while
> flapping the MTU, so I can't say it's actually reachable.
Can you need to add mdelay to enlarge the race window to reproduce it?
I believe we need more precise traffic and timing control, instead of
aggressively ramping up traffic and load in an attempt to reproduce the
issue.
^ permalink raw reply
* Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Parthiban.Veerasooran @ 2026-06-18 4:26 UTC (permalink / raw)
To: Selvamani.Rajagopal, andrew+netdev, davem, edumazet, kuba, pabeni,
robh, krzk+dt, conor+dt, Pier.Beruto
Cc: andrew, netdev, linux-kernel, Conor.Dooley, devicetree
In-Reply-To: <CYYPR02MB9828B41845A534BDF0B0C17083E42@CYYPR02MB9828.namprd02.prod.outlook.com>
Hi Selvamani,
On 17/06/26 10:24 am, Selvamani Rajagopal wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
>> Subject: Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level
>> triggered.
>>
>>
>> Hi Selvamani,
>>
>> I did a quick test by connecting Mikroe LAN8651 Click to a Raspberry Pi
>> 4 and shared the feedback below. Please let me know if you need any
>> further details.
>
> Parthiban,
>
> Thanks for testing this.
>
> Though the NULL pointer reference after skb_put is a clue, I am working with our team to see we can see this crash in our setup.
> Will keep you updated.
Sure, thank you.
Best regards,
Parthiban V
>
>>
>> [ 8276.691064] eth1: Receive buffer overflow error
>> [ 8281.662600] Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000074> drm_panel_orientation_quirks backlight nfnetlink
>> [ 8281.839427] pc : skb_put+0x14/0x80
>> [ 8281.842864] lr : oa_tc6_macphy_threaded_irq+0x428/0x880 [lan865x_t1s]
>
^ permalink raw reply
* RE: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
From: Selvamani Rajagopal @ 2026-06-18 4:26 UTC (permalink / raw)
To: Parthiban.Veerasooran@microchip.com, andrew+netdev@lunn.ch,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, robh@kernel.org, krzk+dt@kernel.org,
conor+dt@kernel.org, Piergiorgio Beruto
Cc: andrew@lunn.ch, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, Conor.Dooley@microchip.com,
devicetree@vger.kernel.org
In-Reply-To: <7c89df6b-32ac-46c8-8400-945879037f2e@microchip.com>
> Subject: Re: [PATCH net v5 1/4] net: ethernet: oa_tc6: Interrupt is active low, level
> triggered.
>
>
>
> Test case 2: Two LAN8651 instances on the same RPI4
>
> Setup:
>
> RPI4 #1 + LAN8651 (IP: 192.168.10.101) <--- RPI4 #2 + EVB-LAN8670-USB
> (IP: 192.168.10.102)
> RPI4 #1 + LAN8651 (IP: 192.168.20.101) <--- RPI4 #2 + EVB-LAN8670-USB
> (IP: 192.168.20.102)
>
> Result:
>
Parthiban,
It appears that we can't reproduce the crash you saw in your setup. Code has been running
all day with 5+ millions of "™Receive buffer overflow error" (Yes. I added a counter to see how
many times, code returns EAGAIN error code)
One obvious reason is that our EVB has only one network interface. Just like your setup in Test case 1,
where you didn't see any issue.
AI review bot Sashiko suggested one potential issue where skb pointers aren't protected. But those
concerns are in transmit path. This crash seems to be in receive path. If you think that might help,
I can generate a patch for that.
What do you suggest? Since you are able to see the crash, would you have time to investigate?
Sincerely
Selva
^ permalink raw reply
* Re: [PATCH bpf v2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Sechang Lim @ 2026-06-18 4:57 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: bobbyeshleman, bpf, davem, edumazet, horms, jakub, john.fastabend,
kuba, linux-kernel, netdev, pabeni
In-Reply-To: <20260618002559.1479884-1-kuniyu@google.com>
On Thu, Jun 18, 2026 at 12:25:57AM +0000, Kuniyuki Iwashima wrote:
>From: Sechang Lim <rhkrqnwk98@gmail.com>
>Date: Fri, 12 Jun 2026 12:35:51 +0000
>> sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
>> to find the length of the next message. strparser assembles a message out
>> of several received skbs by chaining them onto the head's frag_list and
>> recording where to append the next one in strp->skb_nextp:
>>
>> *strp->skb_nextp = skb;
>> strp->skb_nextp = &skb->next;
>>
>> and then calls the parser on the head:
>>
>> len = (*strp->cb.parse_msg)(strp, head);
>>
>> The parser is only meant to inspect the skb, but the program may call
>> bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
>> bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
>
>It's bpf prog's responsibility not to abuse them.
>
>Even setting aside that, why not simply block such BPF prog ?
>
>It cannot be done at load time, but doable at attach time.
>
>>
Thanks, this is cleaner than cloning. Will fix in v3.
Best,
Sechang
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox