* Re: [PATCH net-next v2 3/4] bpf-timestamp: keep track of the skb when wait_for_space occurs
From: Willem de Bruijn @ 2026-04-08 15:15 UTC (permalink / raw)
To: Jason Xing, Willem de Bruijn
Cc: davem, edumazet, kuba, pabeni, horms, willemb, martin.lau, netdev,
bpf, Jason Xing, Yushan Zhou
In-Reply-To: <CAL+tcoBt1A5GYFvimkcRFUtmy298y13AiF-XFF8b8E8Y6fg8Xw@mail.gmail.com>
> > > > Since we're modifying the kernel, how about adding a new member to
> > > > record sendmsg time which bpf script is able to read. The whole
> > > > scenario looks like this:
> > > > 1) in tcp_sendmsg_locked(), record the sendmsg time for each skb
> > > > 2) in either tso_fragment() or tcp_gso_tstamp(), each new skb will get
> > > > a copy of its original skb
> > > > 3) in each stage, bpf script reads the skb's sendmsg time and the
> > > > current time, and then effortlessly do the math.
> > > >
> > > > At this point, what I had in mind is we have two options:
> > > > 1) only handle the skb from the view of the send syscall layer, which
> > > > is, for sure, very simple but not thorough.
> > > > 2) stick to a pure authentic packet basis, then adding a new member
> > > > seems inevitable. so the question would be where to add? The space of
> > > > the skb structure is very precious :(
> > >
> > > Finding a suitable place to put this timestamp is really hard. IIRC,
> > > we can't expand the size of struct skb_shared_info so easily since
> > > it's a global effect.
> > >
> > > I'm wondering if we can turn the per-packet mode into a non-compatible
> > > feature by reusing 'u32 tskey' to store a microsecond timestamp of
> > > sendmsg.
> >
> > Agreed that an extra field is hard. We should avoid that.
>
> Avoiding adding a new one makes the whole work extremely hard. I'm
> wondering since we have hwtstamp in shared info, why not add a
> software one for timestamping use? Then, we would support more
> different protocols in more different stages in a finer grain, which
> is a big coarse picture in my mind.
I don't understand the need to store more data in the skb for BPF.
With BPF hooks, the bpf program can record the relevant data directly
in a BPF map.
> Adding a software bit will completely reduce the whole complexity and
> be very easy to use. Would you expect to see a draft by adding such a
> bit first?
>
> Or just like I mentioned, repurposing tskey seems an alternative,
> which, however, makes the new feature incompatible.
>
> >
> > If the purpose is to group skbs by sendmsg call (e.g., to filter out
> > all but the last one), it is probably also unnecessary.
> >
> > From a process PoV, since the process knows the sendmsg len and each
> > skb has a tskey in byte offset, it can correlate the skb with a given
> > sendmsg buffer.
> >
> > The BPF program is under control of a third-party admin. So that does
> > not follow directly. But it can be passed additional metadata.
> >
> > I thought about passing the offset of the skb from the start of the
> > sendmsg buffer to identify all consecutive skbs for a sendmsg call,
> > as each new buffer will start with an skb with offset 0 ..
> >
> > .. but that won't work as there is no guarantee that a sendmsg call
> > will not append to an existing outstanding skb.
>
> Right. TCP is way too complex and we indeed see some tough issues when
> trying to deploy the feature. So my humble take is to make the design
> as simple as possible.
>
> >
> > Anyway, the general idea is to pass to the BPF program through
> > bpf_skops_tx_timestamping some relevant signal , without having to
> > expand either skb or sk itself.
> >
> > I hear you on that measuring every skb is too frequent. But is calling
> > the BPF program and letting it decide whether to measure too? BPF
> > program invocation itself should be cheap.
>
> Oh, I was clear enough. Sorry. I meant tracing per skb is definitely
> an awesome way to go. My ultimate goal is to do so. Instead of letting
> people implement various fine grained bpf progs, we can provide a very
> easy/understandable/efficient approach with more samples. It should be
> very beneficial.
>
> >
> > If per-push is preferable, with a filter ability like the above, it
> > seems more useful to me already.
>
> Push-level is a compromise plan. Packet-level is what I always pursue :)
Then why not directly implement per-packet.
If the BPF call is cheap and the BPF program can choose to selectively
track packets.
Reminder that you do not want to break (BPF) users by changing
behavior. Let alone more than once. If per-push is going to be
obsoleted, skip ip entirely.
> The current series has this ability: the bpf prog noticed it's a
> SENDMSG sock option and will selectively call
> bpf_sock_ops_enable_tx_tstamp() to do so. Only by calling
> bpf_sock_ops_enable_tx_tstamp() could the skb be tracked.
>
> Thanks,
> Jason
^ permalink raw reply
* Re: [PATCH bpf-next v3 5/6] bpf: clear decap tunnel GSO state in skb_adjust_room
From: Willem de Bruijn @ 2026-04-08 15:10 UTC (permalink / raw)
To: Nick Hudson, bpf, netdev, Willem de Bruijn, Martin KaFai Lau
Cc: Nick Hudson, Max Tottenham, Anna Glasgall, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
Kumar Kartikeya Dwivedi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <20260407105005.1639815-6-nhudson@akamai.com>
Nick Hudson wrote:
> On shrink in bpf_skb_adjust_room(), clear tunnel-specific GSO flags
> according to the decapsulation flags:
>
> - BPF_F_ADJ_ROOM_DECAP_L4_UDP clears SKB_GSO_UDP_TUNNEL{,_CSUM}, and
> SKB_GSO_TUNNEL_REMCSUM
> - BPF_F_ADJ_ROOM_DECAP_L4_GRE clears SKB_GSO_GRE{,_CSUM}
> - BPF_F_ADJ_ROOM_DECAP_IPXIP4 clears SKB_GSO_IPXIP4
> - BPF_F_ADJ_ROOM_DECAP_IPXIP6 clears SKB_GSO_IPXIP6
>
> When all tunnel-related GSO bits are cleared, also clear
> skb->encapsulation.
>
> Handle the ESP inside a UDP tunnel case where encapsulation should remain
> set.
>
> If UDP decap is performed and GSO state removed then reset encap_hdr_csum, and
> remcsum_offload.
>
> Co-developed-by: Max Tottenham <mtottenh@akamai.com>
> Signed-off-by: Max Tottenham <mtottenh@akamai.com>
> Co-developed-by: Anna Glasgall <aglasgal@akamai.com>
> Signed-off-by: Anna Glasgall <aglasgal@akamai.com>
> Signed-off-by: Nick Hudson <nhudson@akamai.com>
> ---
> net/core/filter.c | 40 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7f8d43420afb..04059d07d368 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3667,6 +3667,46 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
> if (!(flags & BPF_F_ADJ_ROOM_FIXED_GSO))
> skb_increase_gso_size(shinfo, len_diff);
>
> + /* Selective GSO flag clearing based on decap type.
> + * Only clear the flags for the tunnel layer being removed.
> + */
> + if ((flags & BPF_F_ADJ_ROOM_DECAP_L4_UDP) &&
> + (shinfo->gso_type & (SKB_GSO_UDP_TUNNEL |
> + SKB_GSO_UDP_TUNNEL_CSUM |
> + SKB_GSO_TUNNEL_REMCSUM)))
> + shinfo->gso_type &= ~(SKB_GSO_UDP_TUNNEL |
> + SKB_GSO_UDP_TUNNEL_CSUM |
> + SKB_GSO_TUNNEL_REMCSUM);
REMCSUM was previously not included in the series.
It is a non-obvious and rare enough feature that I would exclude it,
or move it to a separate patch.
> + if ((flags & BPF_F_ADJ_ROOM_DECAP_L4_GRE) &&
> + (shinfo->gso_type & (SKB_GSO_GRE | SKB_GSO_GRE_CSUM)))
> + shinfo->gso_type &= ~(SKB_GSO_GRE |
> + SKB_GSO_GRE_CSUM);
> + if ((flags & BPF_F_ADJ_ROOM_DECAP_IPXIP4) &&
> + (shinfo->gso_type & SKB_GSO_IPXIP4))
> + shinfo->gso_type &= ~SKB_GSO_IPXIP4;
> + if ((flags & BPF_F_ADJ_ROOM_DECAP_IPXIP6) &&
> + (shinfo->gso_type & SKB_GSO_IPXIP6))
> + shinfo->gso_type &= ~SKB_GSO_IPXIP6;
> +
> + /* Clear encapsulation flag only when no tunnel GSO flags remain */
> + if (flags & (BPF_F_ADJ_ROOM_DECAP_L4_MASK |
> + BPF_F_ADJ_ROOM_DECAP_IPXIP_MASK)) {
> + if (!(shinfo->gso_type & (SKB_GSO_UDP_TUNNEL |
> + SKB_GSO_UDP_TUNNEL_CSUM |
> + SKB_GSO_GRE |
> + SKB_GSO_GRE_CSUM |
> + SKB_GSO_IPXIP4 |
> + SKB_GSO_IPXIP6 |
> + SKB_GSO_ESP)))
> + if (skb->encapsulation)
> + skb->encapsulation = 0;
> +
> + if (flags & BPF_F_ADJ_ROOM_DECAP_L4_UDP) {
> + skb->encap_hdr_csum = !!(shinfo->gso_type & SKB_GSO_UDP_TUNNEL_CSUM);
Since the flag is never set, only possibly cleared: just clear this field when clearing the flag?
It appears that this is only used for deprecated UFO anyway.
> + skb->remcsum_offload = !!(shinfo->gso_type & SKB_GSO_TUNNEL_REMCSUM);
Always zero?
^ permalink raw reply
* [PATCH net-next v2 06/10] enic: add MBOX core send and receive for admin channel
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
Implement the mailbox protocol engine used for PF-VF communication
over the admin channel.
The send path (enic_mbox_send_msg) builds a message with a common
header, DMA-maps it, posts a single WQ descriptor with the
destination vnic ID encoded in the VLAN tag field, and polls
the WQ CQ for completion.
The receive path (enic_mbox_recv_handler) is installed as the admin
RQ callback and validates incoming message headers. PF/VF-specific
dispatch will be added in subsequent commits.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/Makefile | 2 +-
drivers/net/ethernet/cisco/enic/enic.h | 6 ++
drivers/net/ethernet/cisco/enic/enic_admin.c | 23 +++-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 156 +++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_mbox.h | 8 ++
5 files changed, 193 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/Makefile b/drivers/net/ethernet/cisco/enic/Makefile
index 7ae72fefc99a..e38aaf34c148 100644
--- a/drivers/net/ethernet/cisco/enic/Makefile
+++ b/drivers/net/ethernet/cisco/enic/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_ENIC) := enic.o
enic-y := enic_main.o vnic_cq.o vnic_intr.o vnic_wq.o \
enic_res.o enic_dev.o enic_pp.o vnic_dev.o vnic_rq.o vnic_vic.o \
enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o \
- enic_admin.o
+ enic_admin.o enic_mbox.o
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 1c09da3c0b1a..42f345aceced 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -292,6 +292,8 @@ struct enic {
/* Admin channel resources for SR-IOV MBOX */
bool has_admin_channel;
+ /* set on send timeout; cleared on channel re-open */
+ bool mbox_send_disabled;
struct vnic_wq admin_wq;
struct vnic_rq admin_rq;
struct vnic_cq admin_cq[2];
@@ -304,6 +306,10 @@ struct enic {
u64 admin_msg_drop_cnt;
void (*admin_rq_handler)(struct enic *enic, void *buf,
unsigned int len);
+
+ /* MBOX protocol state */
+ struct mutex mbox_lock;
+ u64 mbox_msg_num;
};
static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index 345d194c6eeb..c96268adc173 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -19,6 +19,7 @@
#include "cq_enet_desc.h"
#include "wq_enet_desc.h"
#include "rq_enet_desc.h"
+#include "enic_mbox.h"
/* No-op: admin WQ buffers are freed inline after completion polling */
static void enic_admin_wq_buf_clean(struct vnic_wq *wq,
@@ -156,7 +157,26 @@ unsigned int enic_admin_rq_cq_service(struct enic *enic, unsigned int budget)
buf->dma_addr, buf->len,
DMA_FROM_DEVICE);
- enic_admin_msg_enqueue(enic, buf->os_buf, buf->len);
+ if (enic->admin_rq_handler) {
+ struct cq_enet_rq_desc *rq_desc = desc;
+ u16 sender_vlan;
+
+ /* Firmware sets the CQ VLAN field to identify the
+ * sender: 0 = PF, 1-based = VF index. Overwrite
+ * the untrusted src_vnic_id in the MBOX header with
+ * the hardware-verified value.
+ */
+ sender_vlan = le16_to_cpu(rq_desc->vlan);
+ if (buf->len >= sizeof(struct enic_mbox_hdr)) {
+ struct enic_mbox_hdr *hdr = buf->os_buf;
+
+ hdr->src_vnic_id = (sender_vlan == 0) ?
+ cpu_to_le16(ENIC_MBOX_DST_PF) :
+ cpu_to_le16(sender_vlan - 1);
+ }
+
+ enic_admin_msg_enqueue(enic, buf->os_buf, buf->len);
+ }
enic_admin_rq_buf_clean(rq, rq->to_clean);
rq->to_clean = rq->to_clean->next;
@@ -389,6 +409,7 @@ int enic_admin_channel_open(struct enic *enic)
if (!enic->has_admin_channel)
return -ENODEV;
+ enic->mbox_send_disabled = false;
err = enic_admin_alloc_resources(enic);
if (err) {
netdev_err(enic->netdev,
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
new file mode 100644
index 000000000000..00ab76a47a35
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2025 Cisco Systems, Inc. All rights reserved.
+
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/dma-mapping.h>
+#include <linux/delay.h>
+
+#include "vnic_dev.h"
+#include "vnic_wq.h"
+#include "vnic_cq.h"
+#include "enic.h"
+#include "enic_admin.h"
+#include "enic_mbox.h"
+#include "wq_enet_desc.h"
+
+#define ENIC_MBOX_POLL_TIMEOUT_US 5000000
+#define ENIC_MBOX_POLL_INTERVAL_US 100
+
+static void enic_mbox_fill_hdr(struct enic *enic, struct enic_mbox_hdr *hdr,
+ u8 msg_type, u16 dst_vnic_id, u16 msg_len)
+{
+ memset(hdr, 0, sizeof(*hdr));
+ hdr->dst_vnic_id = cpu_to_le16(dst_vnic_id);
+ hdr->msg_type = msg_type;
+ hdr->msg_len = cpu_to_le16(msg_len);
+ hdr->msg_num = cpu_to_le64(++enic->mbox_msg_num);
+}
+
+int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
+ void *payload, u16 payload_len)
+{
+ u16 total_len = sizeof(struct enic_mbox_hdr) + payload_len;
+ struct vnic_wq *wq = &enic->admin_wq;
+ struct wq_enet_desc *desc;
+ dma_addr_t dma_addr;
+ unsigned long timeout;
+ u16 vlan_tag;
+ void *buf;
+ int err;
+
+ /* Serialize MBOX sends. The admin channel is a low-frequency
+ * control path; holding the mutex across the poll is acceptable.
+ */
+ mutex_lock(&enic->mbox_lock);
+
+ if (!enic->has_admin_channel || enic->mbox_send_disabled) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ if (vnic_wq_desc_avail(wq) == 0) {
+ err = -ENOSPC;
+ goto unlock;
+ }
+
+ buf = kmalloc(total_len, GFP_KERNEL);
+ if (!buf) {
+ err = -ENOMEM;
+ goto unlock;
+ }
+
+ enic_mbox_fill_hdr(enic, buf, msg_type, dst_vnic_id, total_len);
+ if (payload_len) {
+ void *dst = buf + sizeof(struct enic_mbox_hdr);
+
+ memcpy(dst, payload, payload_len);
+ }
+
+ dma_addr = dma_map_single(&enic->pdev->dev, buf, total_len,
+ DMA_TO_DEVICE);
+ if (dma_mapping_error(&enic->pdev->dev, dma_addr)) {
+ kfree(buf);
+ err = -ENOMEM;
+ goto unlock;
+ }
+
+ /* Firmware uses vlan field for routing: 0 = PF, 1-based = VF index */
+ if (dst_vnic_id == ENIC_MBOX_DST_PF)
+ vlan_tag = 0;
+ else
+ vlan_tag = dst_vnic_id + 1;
+
+ desc = vnic_wq_next_desc(wq);
+ wq_enet_desc_enc(desc, (u64)dma_addr | VNIC_PADDR_TARGET,
+ total_len, 0, 0, 0, 1, 1, 0, 1, vlan_tag, 0);
+ vnic_wq_post(wq, buf, dma_addr, total_len, 1, 1, 1, 1, 0, 0);
+ vnic_wq_doorbell(wq);
+
+ timeout = jiffies + usecs_to_jiffies(ENIC_MBOX_POLL_TIMEOUT_US);
+ err = -ETIMEDOUT;
+ while (time_before(jiffies, timeout)) {
+ if (enic_admin_wq_cq_service(enic)) {
+ err = 0;
+ break;
+ }
+ usleep_range(ENIC_MBOX_POLL_INTERVAL_US,
+ ENIC_MBOX_POLL_INTERVAL_US + 50);
+ }
+
+ if (!err) {
+ wq->to_clean = wq->to_clean->next;
+ wq->ring.desc_avail++;
+ dma_unmap_single(&enic->pdev->dev, dma_addr, total_len,
+ DMA_TO_DEVICE);
+ kfree(buf);
+ } else {
+ netdev_err(enic->netdev,
+ "MBOX send timed out (type %u dst %u), disabling channel\n",
+ msg_type, dst_vnic_id);
+ /*
+ * The WQ descriptor is still live in hardware. Do not unmap
+ * or free the buffer: the device may still DMA from dma_addr.
+ * Mark the channel unusable so no further sends are attempted.
+ */
+ enic->mbox_send_disabled = true;
+ }
+
+ netdev_dbg(enic->netdev,
+ "MBOX send msg_type %u dst %u vlan %u err %d\n",
+ msg_type, dst_vnic_id, vlan_tag, err);
+unlock:
+ mutex_unlock(&enic->mbox_lock);
+ return err;
+}
+
+static void enic_mbox_recv_handler(struct enic *enic, void *buf,
+ unsigned int len)
+{
+ struct enic_mbox_hdr *hdr = buf;
+
+ if (len < sizeof(*hdr)) {
+ netdev_warn(enic->netdev,
+ "MBOX: truncated message (len %u < %zu)\n",
+ len, sizeof(*hdr));
+ return;
+ }
+
+ if (hdr->msg_type >= ENIC_MBOX_MAX) {
+ netdev_warn(enic->netdev, "MBOX: unknown msg type %u\n",
+ hdr->msg_type);
+ return;
+ }
+
+ netdev_dbg(enic->netdev,
+ "MBOX recv: type %u from vnic %u len %u\n",
+ hdr->msg_type, le16_to_cpu(hdr->src_vnic_id),
+ le16_to_cpu(hdr->msg_len));
+}
+
+void enic_mbox_init(struct enic *enic)
+{
+ enic->mbox_msg_num = 0;
+ mutex_init(&enic->mbox_lock);
+ enic->admin_rq_handler = enic_mbox_recv_handler;
+}
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index 84cb6bbc1ead..554269b78780 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -72,4 +72,12 @@ struct enic_mbox_pf_link_state_ack_msg {
struct enic_mbox_generic_reply ack;
};
+#define ENIC_MBOX_DST_PF 0xFFFF
+
+struct enic;
+
+void enic_mbox_init(struct enic *enic);
+int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
+ void *payload, u16 payload_len);
+
#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 05/10] enic: define MBOX message types and header structures
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
Define the mailbox protocol used for PF-VF communication over the
admin channel. The protocol uses request/reply pairs where even
message types are requests and odd are replies.
Initial message types cover the core SR-IOV handshake:
- VF_CAPABILITY: version negotiation
- VF_REGISTER/UNREGISTER: VF lifecycle management
- PF_LINK_STATE_NOTIF: PF-initiated link state changes
Each message carries a common header (src/dst vnic ID, type,
length, sequence number) followed by a type-specific payload.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_mbox.h | 75 +++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
new file mode 100644
index 000000000000..84cb6bbc1ead
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright 2025 Cisco Systems, Inc. All rights reserved. */
+
+#ifndef _ENIC_MBOX_H_
+#define _ENIC_MBOX_H_
+
+/*
+ * Mailbox protocol for PF-VF communication over the admin channel.
+ *
+ * Even numbers are requests, odd numbers are replies/acks.
+ * The prefix indicates the initiator: VF_ = VF-initiated, PF_ = PF-initiated.
+ */
+enum enic_mbox_msg_type {
+ ENIC_MBOX_VF_CAPABILITY_REQUEST = 0,
+ ENIC_MBOX_VF_CAPABILITY_REPLY = 1,
+ ENIC_MBOX_VF_REGISTER_REQUEST = 2,
+ ENIC_MBOX_VF_REGISTER_REPLY = 3,
+ ENIC_MBOX_VF_UNREGISTER_REQUEST = 4,
+ ENIC_MBOX_VF_UNREGISTER_REPLY = 5,
+ ENIC_MBOX_PF_LINK_STATE_NOTIF = 6,
+ ENIC_MBOX_PF_LINK_STATE_ACK = 7,
+ ENIC_MBOX_MAX
+};
+
+struct enic_mbox_hdr {
+ __le16 src_vnic_id;
+ __le16 dst_vnic_id;
+ u8 msg_type;
+ u8 flags;
+ __le16 msg_len;
+ __le64 msg_num;
+};
+
+struct enic_mbox_generic_reply {
+ __le16 ret_major;
+ __le16 ret_minor;
+};
+
+#define ENIC_MBOX_ERR_GENERIC BIT(0)
+#define ENIC_MBOX_ERR_VF_NOT_REGISTERED BIT(1)
+#define ENIC_MBOX_ERR_MSG_NOT_SUPPORTED BIT(2)
+
+/* ENIC_MBOX_VF_CAPABILITY_REQUEST / _REPLY */
+#define ENIC_MBOX_CAP_VERSION_0 0
+#define ENIC_MBOX_CAP_VERSION_1 1
+
+struct enic_mbox_vf_capability_msg {
+ __le32 version;
+ __le32 reserved[32];
+};
+
+struct enic_mbox_vf_capability_reply_msg {
+ struct enic_mbox_generic_reply reply;
+ __le32 version;
+ __le32 reserved[32];
+};
+
+/* ENIC_MBOX_VF_REGISTER / _UNREGISTER */
+struct enic_mbox_vf_register_reply_msg {
+ struct enic_mbox_generic_reply reply;
+};
+
+/* ENIC_MBOX_PF_LINK_STATE_NOTIF / _ACK */
+#define ENIC_MBOX_LINK_STATE_DISABLE 0
+#define ENIC_MBOX_LINK_STATE_ENABLE 1
+
+struct enic_mbox_pf_link_state_notif_msg {
+ __le32 link_state;
+};
+
+struct enic_mbox_pf_link_state_ack_msg {
+ struct enic_mbox_generic_reply ack;
+};
+
+#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 09/10] enic: wire V2 SR-IOV enable with admin channel and MBOX
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
Extend enic_sriov_configure() to handle V2 SR-IOV VFs. When the PF
detects V2 VF device IDs, the enable path allocates per-VF MBOX state,
opens the admin channel, initializes the MBOX protocol, and then calls
pci_enable_sriov(). The admin channel must be ready before VFs are
created so that VF drivers can immediately begin the MBOX capability
and registration handshake during their probe.
The disable path reverses this order: pci_disable_sriov() first (so VF
drivers unregister via MBOX), then the admin channel is closed and
per-VF state is freed.
The existing V1/USNIC SR-IOV paths are unchanged.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_main.c | 137 ++++++++++++++++++++++++++--
drivers/net/ethernet/cisco/enic/enic_res.c | 1 +
drivers/net/ethernet/cisco/enic/vnic_enet.h | 4 +-
3 files changed, 134 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 3a4afd6da41f..48c38347d6ce 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -60,6 +60,8 @@
#include "enic_clsf.h"
#include "enic_rq.h"
#include "enic_wq.h"
+#include "enic_admin.h"
+#include "enic_mbox.h"
#define ENIC_NOTIFY_TIMER_PERIOD (2 * HZ)
@@ -2688,6 +2690,120 @@ static void enic_sriov_detect_vf_type(struct enic *enic)
}
}
}
+
+static int __maybe_unused
+enic_sriov_v2_enable(struct enic *enic, int num_vfs)
+{
+ int err;
+
+ if (!enic->has_admin_channel) {
+ netdev_err(enic->netdev,
+ "V2 SR-IOV requires admin channel resources\n");
+ return -EOPNOTSUPP;
+ }
+
+ enic->vf_state = kcalloc(num_vfs, sizeof(*enic->vf_state), GFP_KERNEL);
+ if (!enic->vf_state)
+ return -ENOMEM;
+
+ err = enic_admin_channel_open(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to open admin channel: %d\n", err);
+ goto free_vf_state;
+ }
+
+ enic_mbox_init(enic);
+
+ enic->num_vfs = num_vfs;
+
+ err = pci_enable_sriov(enic->pdev, num_vfs);
+ if (err) {
+ netdev_err(enic->netdev,
+ "pci_enable_sriov failed: %d\n", err);
+ goto close_admin;
+ }
+
+ enic->priv_flags |= ENIC_SRIOV_ENABLED;
+ return num_vfs;
+
+close_admin:
+ enic->num_vfs = 0;
+ enic_admin_channel_close(enic);
+free_vf_state:
+ kfree(enic->vf_state);
+ enic->vf_state = NULL;
+ return err;
+}
+
+static void enic_sriov_v2_disable(struct enic *enic)
+{
+ pci_disable_sriov(enic->pdev);
+ enic_admin_channel_close(enic);
+ kfree(enic->vf_state);
+ enic->vf_state = NULL;
+ enic->num_vfs = 0;
+ enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
+}
+
+static int __maybe_unused
+enic_sriov_configure(struct pci_dev *pdev, int num_vfs)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ struct enic *enic = netdev_priv(netdev);
+ struct enic_port_profile *pp;
+ int err;
+
+ if (num_vfs > 0) {
+ if (enic->config.mq_subvnic_count) {
+ netdev_err(netdev,
+ "SR-IOV not supported with multi-queue sub-vnics\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (enic->vf_type == ENIC_VF_TYPE_NONE) {
+ netdev_err(netdev,
+ "SR-IOV not supported on this firmware version\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (enic->vf_type == ENIC_VF_TYPE_V2)
+ return enic_sriov_v2_enable(enic, num_vfs);
+
+ pp = kcalloc(num_vfs, sizeof(*pp), GFP_KERNEL);
+ if (!pp)
+ return -ENOMEM;
+
+ err = pci_enable_sriov(pdev, num_vfs);
+ if (err) {
+ kfree(pp);
+ return err;
+ }
+
+ kfree(enic->pp);
+ enic->pp = pp;
+ enic->num_vfs = num_vfs;
+ enic->priv_flags |= ENIC_SRIOV_ENABLED;
+ return num_vfs;
+ }
+
+ if (!enic_sriov_enabled(enic))
+ return 0;
+
+ if (enic->vf_type == ENIC_VF_TYPE_V2) {
+ enic_sriov_v2_disable(enic);
+ return 0;
+ }
+
+ pci_disable_sriov(pdev);
+ enic->num_vfs = 0;
+ enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
+
+ kfree(enic->pp);
+ enic->pp = kzalloc(sizeof(*enic->pp), GFP_KERNEL);
+
+ return 0;
+}
#endif
static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
@@ -2786,12 +2902,18 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_out_vnic_unregister;
#ifdef CONFIG_PCI_IOV
- /* Get number of subvnics */
+ enic_sriov_detect_vf_type(enic);
+
+ /* Auto-enable SR-IOV if VFs were pre-configured (e.g. at boot).
+ * V2 VFs require the admin channel, which is not yet set up at probe
+ * time; use sysfs (enic_sriov_configure) to enable V2 SR-IOV instead.
+ */
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
if (pos) {
pci_read_config_word(pdev, pos + PCI_SRIOV_TOTAL_VF,
&enic->num_vfs);
- if (enic->num_vfs) {
+ if (enic->num_vfs &&
+ enic->vf_type != ENIC_VF_TYPE_V2) {
err = pci_enable_sriov(pdev, enic->num_vfs);
if (err) {
dev_err(dev, "SRIOV enable failed, aborting."
@@ -2803,7 +2925,6 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
num_pps = enic->num_vfs;
}
}
- enic_sriov_detect_vf_type(enic);
#endif
/* Allocate structure for port profiles */
@@ -3032,14 +3153,16 @@ static void enic_remove(struct pci_dev *pdev)
cancel_work_sync(&enic->reset);
cancel_work_sync(&enic->change_mtu_work);
unregister_netdev(netdev);
- enic_dev_deinit(enic);
- vnic_dev_close(enic->vdev);
#ifdef CONFIG_PCI_IOV
if (enic_sriov_enabled(enic)) {
- pci_disable_sriov(pdev);
- enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
+ if (enic->vf_type == ENIC_VF_TYPE_V2)
+ enic_sriov_v2_disable(enic);
+ else
+ pci_disable_sriov(pdev);
}
#endif
+ enic_dev_deinit(enic);
+ vnic_dev_close(enic->vdev);
kfree(enic->pp);
vnic_dev_unregister(enic->vdev);
enic_iounmap(enic);
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
index 2b7545d6a67f..436326ace049 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -59,6 +59,7 @@ int enic_get_vnic_config(struct enic *enic)
GET_CONFIG(intr_timer_usec);
GET_CONFIG(loop_tag);
GET_CONFIG(num_arfs);
+ GET_CONFIG(mq_subvnic_count);
GET_CONFIG(max_rq_ring);
GET_CONFIG(max_wq_ring);
GET_CONFIG(max_cq_ring);
diff --git a/drivers/net/ethernet/cisco/enic/vnic_enet.h b/drivers/net/ethernet/cisco/enic/vnic_enet.h
index 9e8e86262a3f..519d2969990b 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_enet.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_enet.h
@@ -21,7 +21,9 @@ struct vnic_enet_config {
u16 loop_tag;
u16 vf_rq_count;
u16 num_arfs;
- u8 reserved[66];
+ u8 reserved1[32];
+ u16 mq_subvnic_count;
+ u8 reserved2[32];
u32 max_rq_ring; // MAX RQ ring size
u32 max_wq_ring; // MAX WQ ring size
u32 max_cq_ring; // MAX CQ ring size
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 08/10] enic: add MBOX VF handlers for capability, register and link state
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
Implement VF-side mailbox message processing for SR-IOV V2
admin channel communication.
VF receive handlers:
- VF_CAPABILITY_REPLY: store PF protocol version, signal
completion
- VF_REGISTER_REPLY: mark VF as registered, signal completion
- VF_UNREGISTER_REPLY: mark VF as unregistered, signal
completion
- PF_LINK_STATE_NOTIF: update carrier state via
netif_carrier_on/off, send ACK back to PF
VF initiation functions for the probe-time handshake:
- enic_mbox_vf_capability_check: send capability request,
wait for PF reply via completion
- enic_mbox_vf_register: send register request, wait for
PF confirmation via completion
- enic_mbox_vf_unregister: send unregister request, wait
for PF confirmation
The wait helper (enic_mbox_wait_reply) uses
wait_for_completion_timeout, signaled when the admin ISR/NAPI/
workqueue pipeline delivers the reply message.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 9 +-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 220 ++++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_mbox.h | 3 +
3 files changed, 231 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 9b1fa3857df5..29ce26284493 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -258,6 +258,8 @@ struct enic {
u32 tx_coalesce_usecs;
u16 num_vfs;
enum enic_vf_type vf_type;
+ bool vf_registered;
+ u32 pf_cap_version;
unsigned int enable_count;
spinlock_t enic_api_lock;
bool enic_api_busy;
@@ -305,9 +307,14 @@ struct enic {
void (*admin_rq_handler)(struct enic *enic, void *buf,
unsigned int len);
- /* MBOX protocol state */
+ /* MBOX protocol state -- single-flight: on the VF, all callers
+ * that wait on mbox_comp run under RTNL or during probe/remove,
+ * so only one completion is outstanding at a time. mbox_lock
+ * protects the shared admin WQ from concurrent senders.
+ */
struct mutex mbox_lock;
u64 mbox_msg_num;
+ struct completion mbox_comp;
/* PF: per-VF MBOX state, allocated when SRIOV V2 is enabled */
struct enic_vf_state {
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
index eb5049b538b1..0db05b124557 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -5,6 +5,7 @@
#include <linux/netdevice.h>
#include <linux/dma-mapping.h>
#include <linux/delay.h>
+#include <linux/completion.h>
#include "vnic_dev.h"
#include "vnic_wq.h"
@@ -124,6 +125,16 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
return err;
}
+static int enic_mbox_wait_reply(struct enic *enic, unsigned long timeout_ms)
+{
+ unsigned long left;
+
+ left = wait_for_completion_timeout(&enic->mbox_comp,
+ msecs_to_jiffies(timeout_ms));
+
+ return left ? 0 : -ETIMEDOUT;
+}
+
int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state)
{
struct enic_mbox_pf_link_state_notif_msg notif = {};
@@ -280,6 +291,136 @@ static void enic_mbox_pf_process_msg(struct enic *enic,
hdr->msg_type, vf_id, err);
}
+static void enic_mbox_vf_handle_capability_reply(struct enic *enic,
+ void *payload)
+{
+ struct enic_mbox_vf_capability_reply_msg *reply = payload;
+
+ if (le16_to_cpu(reply->reply.ret_major) == 0)
+ enic->pf_cap_version = le32_to_cpu(reply->version);
+ complete(&enic->mbox_comp);
+}
+
+static void enic_mbox_vf_handle_register_reply(struct enic *enic,
+ void *payload)
+{
+ struct enic_mbox_vf_register_reply_msg *reply = payload;
+
+ if (le16_to_cpu(reply->reply.ret_major)) {
+ netdev_warn(enic->netdev,
+ "MBOX: VF register rejected by PF: %u/%u\n",
+ le16_to_cpu(reply->reply.ret_major),
+ le16_to_cpu(reply->reply.ret_minor));
+ } else {
+ enic->vf_registered = true;
+ }
+ complete(&enic->mbox_comp);
+}
+
+static void enic_mbox_vf_handle_unregister_reply(struct enic *enic,
+ void *payload)
+{
+ struct enic_mbox_vf_register_reply_msg *reply = payload;
+
+ if (le16_to_cpu(reply->reply.ret_major)) {
+ netdev_warn(enic->netdev,
+ "MBOX: VF unregister rejected by PF: %u/%u\n",
+ le16_to_cpu(reply->reply.ret_major),
+ le16_to_cpu(reply->reply.ret_minor));
+ } else {
+ enic->vf_registered = false;
+ }
+ complete(&enic->mbox_comp);
+}
+
+static void enic_mbox_vf_handle_link_state(struct enic *enic, void *payload)
+{
+ struct enic_mbox_pf_link_state_notif_msg *notif = payload;
+ struct enic_mbox_pf_link_state_ack_msg ack = {};
+
+ switch (le32_to_cpu(notif->link_state)) {
+ case ENIC_MBOX_LINK_STATE_ENABLE:
+ if (!netif_carrier_ok(enic->netdev))
+ netif_carrier_on(enic->netdev);
+ netdev_dbg(enic->netdev, "MBOX: link state -> UP\n");
+ break;
+ case ENIC_MBOX_LINK_STATE_DISABLE:
+ if (netif_carrier_ok(enic->netdev))
+ netif_carrier_off(enic->netdev);
+ netdev_dbg(enic->netdev, "MBOX: link state -> DOWN\n");
+ break;
+ default:
+ netdev_warn(enic->netdev, "MBOX: unknown link state %u\n",
+ le32_to_cpu(notif->link_state));
+ ack.ack.ret_major = cpu_to_le16(ENIC_MBOX_ERR_GENERIC);
+ break;
+ }
+
+ enic_mbox_send_msg(enic, ENIC_MBOX_PF_LINK_STATE_ACK, ENIC_MBOX_DST_PF,
+ &ack, sizeof(ack));
+}
+
+static bool enic_mbox_vf_payload_ok(struct enic *enic, u8 msg_type,
+ u16 payload_len, size_t min_len)
+{
+ if (payload_len < min_len) {
+ netdev_warn(enic->netdev,
+ "MBOX: short payload for type %u (%u < %zu)\n",
+ msg_type, payload_len, min_len);
+ return false;
+ }
+ return true;
+}
+
+static void enic_mbox_vf_process_msg(struct enic *enic,
+ struct enic_mbox_hdr *hdr, void *payload,
+ u16 payload_len)
+{
+ switch (hdr->msg_type) {
+ case ENIC_MBOX_VF_CAPABILITY_REPLY: {
+ size_t exp = sizeof(struct enic_mbox_vf_capability_reply_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_capability_reply(enic, payload);
+ break;
+ }
+ case ENIC_MBOX_VF_REGISTER_REPLY: {
+ size_t exp = sizeof(struct enic_mbox_vf_register_reply_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_register_reply(enic, payload);
+ break;
+ }
+ case ENIC_MBOX_VF_UNREGISTER_REPLY: {
+ size_t exp = sizeof(struct enic_mbox_vf_register_reply_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_unregister_reply(enic, payload);
+ break;
+ }
+ case ENIC_MBOX_PF_LINK_STATE_NOTIF: {
+ size_t exp = sizeof(struct enic_mbox_pf_link_state_notif_msg);
+
+ if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
+ payload_len, exp))
+ return;
+ enic_mbox_vf_handle_link_state(enic, payload);
+ break;
+ }
+ default:
+ netdev_dbg(enic->netdev,
+ "MBOX: VF unhandled msg type %u\n",
+ hdr->msg_type);
+ break;
+ }
+}
+
static void enic_mbox_recv_handler(struct enic *enic, void *buf,
unsigned int len)
{
@@ -316,11 +457,90 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
if (enic->vf_state)
enic_mbox_pf_process_msg(enic, hdr, payload);
+ else
+ enic_mbox_vf_process_msg(enic, hdr, payload,
+ msg_len - (u16)sizeof(*hdr));
+}
+
+int enic_mbox_vf_capability_check(struct enic *enic)
+{
+ struct enic_mbox_vf_capability_msg req = {};
+ int err;
+
+ enic->pf_cap_version = 0;
+ reinit_completion(&enic->mbox_comp);
+ req.version = cpu_to_le32(ENIC_MBOX_CAP_VERSION_1);
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_CAPABILITY_REQUEST,
+ ENIC_MBOX_DST_PF, &req, sizeof(req));
+ if (err)
+ return err;
+
+ err = enic_mbox_wait_reply(enic, 3000);
+ if (err) {
+ netdev_warn(enic->netdev,
+ "MBOX: no capability reply from PF\n");
+ return err;
+ }
+
+ if (enic->pf_cap_version < ENIC_MBOX_CAP_VERSION_1) {
+ netdev_warn(enic->netdev,
+ "MBOX: PF version %u too old\n",
+ enic->pf_cap_version);
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+int enic_mbox_vf_register(struct enic *enic)
+{
+ int err;
+
+ enic->vf_registered = false;
+ reinit_completion(&enic->mbox_comp);
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_REGISTER_REQUEST,
+ ENIC_MBOX_DST_PF, NULL, 0);
+ if (err)
+ return err;
+
+ err = enic_mbox_wait_reply(enic, 3000);
+ if (err) {
+ netdev_warn(enic->netdev,
+ "MBOX: VF registration with PF timed out\n");
+ return err;
+ }
+
+ if (!enic->vf_registered)
+ return -ENODEV;
+
+ return 0;
+}
+
+int enic_mbox_vf_unregister(struct enic *enic)
+{
+ int err;
+
+ if (!enic->vf_registered)
+ return 0;
+
+ reinit_completion(&enic->mbox_comp);
+
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REQUEST,
+ ENIC_MBOX_DST_PF, NULL, 0);
+ if (err)
+ return err;
+
+ err = enic_mbox_wait_reply(enic, 3000);
+
+ return enic->vf_registered ? -ETIMEDOUT : 0;
}
void enic_mbox_init(struct enic *enic)
{
enic->mbox_msg_num = 0;
mutex_init(&enic->mbox_lock);
+ init_completion(&enic->mbox_comp);
enic->admin_rq_handler = enic_mbox_recv_handler;
}
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index a6f6798d14f4..fa2fb08bf7d0 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -80,5 +80,8 @@ void enic_mbox_init(struct enic *enic);
int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
void *payload, u16 payload_len);
int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state);
+int enic_mbox_vf_capability_check(struct enic *enic);
+int enic_mbox_vf_register(struct enic *enic);
+int enic_mbox_vf_unregister(struct enic *enic);
#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 10/10] enic: add V2 VF probe with admin channel and PF registration
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
When a V2 SR-IOV VF probes, open the admin channel, initialize the
MBOX protocol, perform the capability check with the PF, and register
with the PF. This establishes the PF-VF communication path that the PF
uses to send link state notifications.
The admin channel and MBOX registration happen after enic_dev_init()
(which discovers admin channel resources) and before register_netdev()
so the VF is fully initialized before the interface is visible to
userspace.
On remove, the VF unregisters from the PF and closes its admin channel
before tearing down data path resources.
V2 VFs are not provisioned with an RES_TYPE_SRIOV_INTR resource by
firmware, so bypass that check in the admin channel capability
detection for V2 VFs. The PF still requires this resource.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 1 +
drivers/net/ethernet/cisco/enic/enic_main.c | 58 ++++++++++++++++++++++++++++-
drivers/net/ethernet/cisco/enic/enic_res.c | 3 +-
3 files changed, 59 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 29ce26284493..6301930903ee 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -441,6 +441,7 @@ void enic_reset_addr_lists(struct enic *enic);
int enic_sriov_enabled(struct enic *enic);
int enic_is_valid_vf(struct enic *enic, int vf);
int enic_is_dynamic(struct enic *enic);
+int enic_is_sriov_vf_v2(struct enic *enic);
void enic_set_ethtool_ops(struct net_device *netdev);
int __enic_set_rsskey(struct enic *enic);
void enic_ext_cq(struct enic *enic);
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 48c38347d6ce..c664ca43ab3a 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -316,6 +316,11 @@ static int enic_is_sriov_vf(struct enic *enic)
enic->pdev->device == PCI_DEVICE_ID_CISCO_VIC_ENET_VF_V2;
}
+int enic_is_sriov_vf_v2(struct enic *enic)
+{
+ return enic->pdev->device == PCI_DEVICE_ID_CISCO_VIC_ENET_VF_V2;
+}
+
int enic_is_valid_vf(struct enic *enic, int vf)
{
#ifdef CONFIG_PCI_IOV
@@ -2989,6 +2994,32 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_out_dev_close;
}
+ /* V2 VF: open admin channel and register with PF.
+ * Must happen before register_netdev so the VF is fully
+ * initialized before the interface is visible to userspace.
+ */
+ if (enic_is_sriov_vf_v2(enic)) {
+ err = enic_admin_channel_open(enic);
+ if (err) {
+ dev_err(dev,
+ "Failed to open admin channel: %d\n", err);
+ goto err_out_dev_deinit;
+ }
+ enic_mbox_init(enic);
+ err = enic_mbox_vf_capability_check(enic);
+ if (err) {
+ dev_err(dev,
+ "MBOX capability check failed: %d\n", err);
+ goto err_out_admin_close;
+ }
+ err = enic_mbox_vf_register(enic);
+ if (err) {
+ dev_err(dev,
+ "MBOX VF registration failed: %d\n", err);
+ goto err_out_admin_close;
+ }
+ }
+
netif_set_real_num_tx_queues(netdev, enic->wq_count);
netif_set_real_num_rx_queues(netdev, enic->rq_count);
@@ -3013,7 +3044,7 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
err = enic_set_mac_addr(netdev, enic->mac_addr);
if (err) {
dev_err(dev, "Invalid MAC address, aborting\n");
- goto err_out_dev_deinit;
+ goto err_out_admin_close;
}
enic->tx_coalesce_usecs = enic->config.intr_timer_usec;
@@ -3111,11 +3142,23 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
err = register_netdev(netdev);
if (err) {
dev_err(dev, "Cannot register net device, aborting\n");
- goto err_out_dev_deinit;
+ goto err_out_admin_close;
}
return 0;
+err_out_admin_close:
+ if (enic_is_sriov_vf_v2(enic)) {
+ if (enic->vf_registered) {
+ int unreg_err = enic_mbox_vf_unregister(enic);
+
+ if (unreg_err)
+ netdev_warn(netdev,
+ "Failed to unregister from PF: %d\n",
+ unreg_err);
+ }
+ enic_admin_channel_close(enic);
+ }
err_out_dev_deinit:
enic_dev_deinit(enic);
err_out_dev_close:
@@ -3153,6 +3196,17 @@ static void enic_remove(struct pci_dev *pdev)
cancel_work_sync(&enic->reset);
cancel_work_sync(&enic->change_mtu_work);
unregister_netdev(netdev);
+ if (enic_is_sriov_vf_v2(enic)) {
+ if (enic->vf_registered) {
+ int unreg_err = enic_mbox_vf_unregister(enic);
+
+ if (unreg_err)
+ netdev_warn(netdev,
+ "Failed to unregister from PF: %d\n",
+ unreg_err);
+ }
+ enic_admin_channel_close(enic);
+ }
#ifdef CONFIG_PCI_IOV
if (enic_sriov_enabled(enic)) {
if (enic->vf_type == ENIC_VF_TYPE_V2)
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
index 436326ace049..74cd2ee3af5c 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -211,7 +211,8 @@ void enic_get_res_counts(struct enic *enic)
vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_RQ) >= 1 &&
vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_CQ) >=
ARRAY_SIZE(enic->admin_cq) &&
- vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1;
+ (enic_is_sriov_vf_v2(enic) ||
+ vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1);
dev_info(enic_get_dev(enic),
"vNIC resources avail: wq %d rq %d cq %d intr %d admin %s\n",
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 07/10] enic: add MBOX PF handlers for VF register and capability
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
Implement PF-side mailbox message processing for SR-IOV V2
admin channel communication.
When the PF receives messages from VFs, the dispatch routes
them to type-specific handlers:
- VF_CAPABILITY_REQUEST: reply with protocol version 1
- VF_REGISTER_REQUEST: mark VF registered, reply, then
send PF_LINK_STATE_NOTIF with link enabled
- VF_UNREGISTER_REQUEST: mark VF unregistered, send reply
- PF_LINK_STATE_ACK: log errors from VF acknowledgment
Per-VF state (struct enic_vf_state) is tracked via enic->vf_state
which will be allocated when SRIOV V2 is enabled.
Remove the CONFIG_PCI_IOV guard from num_vfs in struct enic. The
PF handlers reference enic->num_vfs for VF ID bounds checking in
enic_mbox.c, which is compiled unconditionally. The field must be
visible regardless of CONFIG_PCI_IOV to avoid build failures.
Add enic_mbox_send_link_state() helper for PF-initiated link
state notifications, also used later by ndo_set_vf_link_state.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 7 +-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 174 +++++++++++++++++++++++++++-
drivers/net/ethernet/cisco/enic/enic_mbox.h | 1 +
3 files changed, 178 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 42f345aceced..9b1fa3857df5 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -256,9 +256,7 @@ struct enic {
struct enic_rx_coal rx_coalesce_setting;
u32 rx_coalesce_usecs;
u32 tx_coalesce_usecs;
-#ifdef CONFIG_PCI_IOV
u16 num_vfs;
-#endif
enum enic_vf_type vf_type;
unsigned int enable_count;
spinlock_t enic_api_lock;
@@ -310,6 +308,11 @@ struct enic {
/* MBOX protocol state */
struct mutex mbox_lock;
u64 mbox_msg_num;
+
+ /* PF: per-VF MBOX state, allocated when SRIOV V2 is enabled */
+ struct enic_vf_state {
+ bool registered;
+ } *vf_state;
};
static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
index 00ab76a47a35..eb5049b538b1 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
@@ -124,10 +124,168 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
return err;
}
+int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state)
+{
+ struct enic_mbox_pf_link_state_notif_msg notif = {};
+
+ if (!enic->vf_state || vf_id >= enic->num_vfs ||
+ !enic->vf_state[vf_id].registered) {
+ netdev_dbg(enic->netdev,
+ "MBOX: skip link state to unregistered VF %u\n",
+ vf_id);
+ return 0;
+ }
+
+ notif.link_state = cpu_to_le32(link_state);
+ return enic_mbox_send_msg(enic, ENIC_MBOX_PF_LINK_STATE_NOTIF, vf_id,
+ ¬if, sizeof(notif));
+}
+
+static int enic_mbox_pf_handle_capability(struct enic *enic, void *msg,
+ u16 vf_id, u64 msg_num)
+{
+ struct enic_mbox_vf_capability_reply_msg reply = {};
+
+ reply.reply.ret_major = cpu_to_le16(0);
+ reply.version = cpu_to_le32(ENIC_MBOX_CAP_VERSION_1);
+
+ return enic_mbox_send_msg(enic, ENIC_MBOX_VF_CAPABILITY_REPLY, vf_id,
+ &reply, sizeof(reply));
+}
+
+static int enic_mbox_pf_handle_register(struct enic *enic, void *msg,
+ u16 vf_id, u64 msg_num)
+{
+ struct enic_mbox_vf_register_reply_msg reply = {};
+ int err;
+
+ if (!enic->vf_state || vf_id >= enic->num_vfs) {
+ netdev_warn(enic->netdev,
+ "MBOX: register from invalid VF %u\n", vf_id);
+ return -EINVAL;
+ }
+
+ /* VF re-registering (e.g. guest reboot without clean unregister):
+ * mark the previous registration inactive before accepting the new one.
+ */
+ if (enic->vf_state[vf_id].registered) {
+ netdev_dbg(enic->netdev,
+ "MBOX: VF %u re-register, cleaning previous state\n",
+ vf_id);
+ enic->vf_state[vf_id].registered = false;
+ }
+
+ reply.reply.ret_major = cpu_to_le16(0);
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_REGISTER_REPLY, vf_id,
+ &reply, sizeof(reply));
+ if (err)
+ return err;
+
+ enic->vf_state[vf_id].registered = true;
+ netdev_info(enic->netdev, "VF %u registered via MBOX\n", vf_id);
+
+ err = enic_mbox_send_link_state(enic, vf_id,
+ ENIC_MBOX_LINK_STATE_ENABLE);
+ if (err)
+ netdev_warn(enic->netdev,
+ "VF %u: failed to send initial link state: %d\n",
+ vf_id, err);
+ /* Registration succeeded; link state will be (re-)sent on next
+ * enic_link_check() event.
+ */
+ return 0;
+}
+
+static int enic_mbox_pf_handle_unregister(struct enic *enic, void *msg,
+ u16 vf_id, u64 msg_num)
+{
+ struct enic_mbox_vf_register_reply_msg reply = {};
+ int err;
+
+ if (!enic->vf_state || vf_id >= enic->num_vfs) {
+ netdev_warn(enic->netdev,
+ "MBOX: unregister from invalid VF %u\n", vf_id);
+ return -EINVAL;
+ }
+
+ reply.reply.ret_major = cpu_to_le16(0);
+ err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REPLY, vf_id,
+ &reply, sizeof(reply));
+ if (err)
+ return err;
+
+ enic->vf_state[vf_id].registered = false;
+
+ netdev_info(enic->netdev, "VF %u unregistered via MBOX\n", vf_id);
+
+ return 0;
+}
+
+static void enic_mbox_pf_process_msg(struct enic *enic,
+ struct enic_mbox_hdr *hdr, void *payload)
+{
+ u16 vf_id = le16_to_cpu(hdr->src_vnic_id);
+ u16 msg_len = le16_to_cpu(hdr->msg_len);
+ int err = 0;
+
+ if (!enic->vf_state) {
+ netdev_dbg(enic->netdev,
+ "MBOX: PF received msg but SRIOV not active\n");
+ return;
+ }
+
+ if (vf_id >= enic->num_vfs) {
+ netdev_warn(enic->netdev,
+ "MBOX: PF received msg from invalid VF %u\n",
+ vf_id);
+ return;
+ }
+
+ switch (hdr->msg_type) {
+ case ENIC_MBOX_VF_CAPABILITY_REQUEST:
+ err = enic_mbox_pf_handle_capability(enic, payload, vf_id,
+ le64_to_cpu(hdr->msg_num));
+ break;
+ case ENIC_MBOX_VF_REGISTER_REQUEST:
+ err = enic_mbox_pf_handle_register(enic, payload, vf_id,
+ le64_to_cpu(hdr->msg_num));
+ break;
+ case ENIC_MBOX_VF_UNREGISTER_REQUEST:
+ err = enic_mbox_pf_handle_unregister(enic, payload, vf_id,
+ le64_to_cpu(hdr->msg_num));
+ break;
+ case ENIC_MBOX_PF_LINK_STATE_ACK: {
+ struct enic_mbox_pf_link_state_ack_msg *ack = payload;
+
+ if (msg_len < sizeof(*hdr) + sizeof(*ack))
+ break;
+ if (le16_to_cpu(ack->ack.ret_major))
+ netdev_warn(enic->netdev,
+ "MBOX: VF %u link state ACK error %u/%u\n",
+ vf_id, le16_to_cpu(ack->ack.ret_major),
+ le16_to_cpu(ack->ack.ret_minor));
+ break;
+ }
+ default:
+ netdev_dbg(enic->netdev,
+ "MBOX: PF unhandled msg type %u from VF %u\n",
+ hdr->msg_type, vf_id);
+ err = -EOPNOTSUPP;
+ break;
+ }
+
+ if (err)
+ netdev_warn(enic->netdev,
+ "MBOX: PF handler for msg type %u from VF %u failed: %d\n",
+ hdr->msg_type, vf_id, err);
+}
+
static void enic_mbox_recv_handler(struct enic *enic, void *buf,
unsigned int len)
{
struct enic_mbox_hdr *hdr = buf;
+ void *payload;
+ u16 msg_len;
if (len < sizeof(*hdr)) {
netdev_warn(enic->netdev,
@@ -142,10 +300,22 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
return;
}
+ msg_len = le16_to_cpu(hdr->msg_len);
+ if (msg_len < sizeof(*hdr) || msg_len > len) {
+ netdev_warn(enic->netdev,
+ "MBOX: invalid msg_len %u (buf len %u)\n",
+ msg_len, len);
+ return;
+ }
+
netdev_dbg(enic->netdev,
"MBOX recv: type %u from vnic %u len %u\n",
- hdr->msg_type, le16_to_cpu(hdr->src_vnic_id),
- le16_to_cpu(hdr->msg_len));
+ hdr->msg_type, le16_to_cpu(hdr->src_vnic_id), msg_len);
+
+ payload = buf + sizeof(*hdr);
+
+ if (enic->vf_state)
+ enic_mbox_pf_process_msg(enic, hdr, payload);
}
void enic_mbox_init(struct enic *enic)
diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.h b/drivers/net/ethernet/cisco/enic/enic_mbox.h
index 554269b78780..a6f6798d14f4 100644
--- a/drivers/net/ethernet/cisco/enic/enic_mbox.h
+++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
@@ -79,5 +79,6 @@ struct enic;
void enic_mbox_init(struct enic *enic);
int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
void *payload, u16 payload_len);
+int enic_mbox_send_link_state(struct enic *enic, u16 vf_id, u32 link_state);
#endif /* _ENIC_MBOX_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 04/10] enic: add admin CQ service with MSI-X interrupt and NAPI polling
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
Add completion queue service for the admin channel WQ and RQ, driven
by an MSI-X interrupt and NAPI polling.
The receive pipeline is: MSI-X ISR -> NAPI poll -> RQ CQ service ->
message enqueue -> workqueue handler -> admin_rq_handler callback.
NAPI drains the RQ CQ in softirq context, copying each received
buffer into an enic_admin_msg and appending it to a spinlock-protected
list. A system workqueue handler then processes each message in
process context where sleeping (mutex, GFP_KERNEL allocations) is
safe.
The WQ CQ service counts transmit completions and is called from the
synchronous MBOX send path.
RQ buffer allocation uses GFP_ATOMIC since enic_admin_rq_fill() is
called from NAPI context during CQ processing.
The admin channel open/close paths set up and tear down the MSI-X
interrupt, NAPI instance, and workqueue. CQ init enables interrupt
delivery and sets the interrupt offset so completions trigger the
admin ISR.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic.h | 8 +
drivers/net/ethernet/cisco/enic/enic_admin.c | 297 +++++++++++++++++++++++++--
drivers/net/ethernet/cisco/enic/enic_admin.h | 12 ++
3 files changed, 295 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 08472420f3a1..1c09da3c0b1a 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -296,6 +296,14 @@ struct enic {
struct vnic_rq admin_rq;
struct vnic_cq admin_cq[2];
struct vnic_intr admin_intr;
+ struct napi_struct admin_napi;
+ unsigned int admin_intr_index;
+ struct work_struct admin_msg_work;
+ spinlock_t admin_msg_lock; /* protects admin_msg_list */
+ struct list_head admin_msg_list;
+ u64 admin_msg_drop_cnt;
+ void (*admin_rq_handler)(struct enic *enic, void *buf,
+ unsigned int len);
};
static inline struct net_device *vnic_get_netdev(struct vnic_dev *vdev)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index a8fcd5f116d1..345d194c6eeb 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -4,6 +4,7 @@
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
#include "vnic_dev.h"
#include "vnic_wq.h"
@@ -15,6 +16,7 @@
#include "enic.h"
#include "enic_admin.h"
#include "cq_desc.h"
+#include "cq_enet_desc.h"
#include "wq_enet_desc.h"
#include "rq_enet_desc.h"
@@ -38,14 +40,14 @@ static void enic_admin_rq_buf_clean(struct vnic_rq *rq,
buf->os_buf = NULL;
}
-static int enic_admin_rq_post_one(struct enic *enic)
+static int enic_admin_rq_post_one(struct enic *enic, gfp_t gfp)
{
struct vnic_rq *rq = &enic->admin_rq;
struct rq_enet_desc *desc;
dma_addr_t dma_addr;
void *buf;
- buf = kmalloc(ENIC_ADMIN_BUF_SIZE, GFP_KERNEL);
+ buf = kmalloc(ENIC_ADMIN_BUF_SIZE, gfp);
if (!buf)
return -ENOMEM;
@@ -64,13 +66,13 @@ static int enic_admin_rq_post_one(struct enic *enic)
return 0;
}
-static int enic_admin_rq_fill(struct enic *enic)
+static int enic_admin_rq_fill(struct enic *enic, gfp_t gfp)
{
struct vnic_rq *rq = &enic->admin_rq;
int err;
while (vnic_rq_desc_avail(rq) > 0) {
- err = enic_admin_rq_post_one(enic);
+ err = enic_admin_rq_post_one(enic, gfp);
if (err)
return err;
}
@@ -83,6 +85,207 @@ static void enic_admin_rq_drain(struct enic *enic)
vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
}
+static unsigned int enic_admin_cq_color(void *cq_desc, unsigned int desc_size)
+{
+ u8 type_color = *((u8 *)cq_desc + desc_size - 1);
+
+ return (type_color >> CQ_DESC_COLOR_SHIFT) & CQ_DESC_COLOR_MASK;
+}
+
+unsigned int enic_admin_wq_cq_service(struct enic *enic)
+{
+ struct vnic_cq *cq = &enic->admin_cq[0];
+ unsigned int work = 0;
+ void *desc;
+
+ desc = vnic_cq_to_clean(cq);
+ while (enic_admin_cq_color(desc, cq->ring.desc_size) !=
+ cq->last_color) {
+ /* Ensure color bit is read before descriptor fields */
+ rmb();
+ vnic_cq_inc_to_clean(cq);
+ work++;
+ desc = vnic_cq_to_clean(cq);
+ }
+
+ return work;
+}
+
+static void enic_admin_msg_enqueue(struct enic *enic, void *buf,
+ unsigned int len)
+{
+ struct enic_admin_msg *msg;
+
+ msg = kmalloc(struct_size(msg, data, len), GFP_ATOMIC);
+ if (!msg) {
+ enic->admin_msg_drop_cnt++;
+ if (net_ratelimit())
+ netdev_warn(enic->netdev,
+ "admin msg enqueue drop (len=%u drops=%llu)\n",
+ len, enic->admin_msg_drop_cnt);
+ return;
+ }
+
+ msg->len = len;
+ memcpy(msg->data, buf, len);
+
+ spin_lock(&enic->admin_msg_lock);
+ list_add_tail(&msg->list, &enic->admin_msg_list);
+ spin_unlock(&enic->admin_msg_lock);
+}
+
+unsigned int enic_admin_rq_cq_service(struct enic *enic, unsigned int budget)
+{
+ struct vnic_cq *cq = &enic->admin_cq[1];
+ struct vnic_rq *rq = &enic->admin_rq;
+ struct vnic_rq_buf *buf;
+ unsigned int work = 0;
+ void *desc;
+
+ desc = vnic_cq_to_clean(cq);
+ while (work < budget &&
+ enic_admin_cq_color(desc, cq->ring.desc_size) !=
+ cq->last_color) {
+ /* Ensure CQ descriptor fields are read after
+ * the color/valid check.
+ */
+ rmb();
+ buf = rq->to_clean;
+
+ dma_sync_single_for_cpu(&enic->pdev->dev,
+ buf->dma_addr, buf->len,
+ DMA_FROM_DEVICE);
+
+ enic_admin_msg_enqueue(enic, buf->os_buf, buf->len);
+
+ enic_admin_rq_buf_clean(rq, rq->to_clean);
+ rq->to_clean = rq->to_clean->next;
+ rq->ring.desc_avail++;
+
+ vnic_cq_inc_to_clean(cq);
+ work++;
+ desc = vnic_cq_to_clean(cq);
+ }
+
+ enic_admin_rq_fill(enic, GFP_ATOMIC);
+
+ return work;
+}
+
+static irqreturn_t enic_admin_isr_msix(int irq, void *data)
+{
+ struct napi_struct *napi = data;
+
+ napi_schedule_irqoff(napi);
+
+ return IRQ_HANDLED;
+}
+
+static void enic_admin_msg_work_handler(struct work_struct *work)
+{
+ struct enic *enic = container_of(work, struct enic, admin_msg_work);
+ struct enic_admin_msg *msg, *tmp;
+ LIST_HEAD(local_list);
+
+ spin_lock_bh(&enic->admin_msg_lock);
+ list_splice_init(&enic->admin_msg_list, &local_list);
+ spin_unlock_bh(&enic->admin_msg_lock);
+
+ list_for_each_entry_safe(msg, tmp, &local_list, list) {
+ if (enic->admin_rq_handler)
+ enic->admin_rq_handler(enic, msg->data, msg->len);
+ list_del(&msg->list);
+ kfree(msg);
+ }
+}
+
+static int enic_admin_napi_poll(struct napi_struct *napi, int budget)
+{
+ struct enic *enic = container_of(napi, struct enic, admin_napi);
+ unsigned int credits;
+ unsigned int rq_work;
+
+ credits = vnic_intr_credits(&enic->admin_intr);
+
+ rq_work = enic_admin_rq_cq_service(enic, budget);
+
+ if (rq_work > 0)
+ schedule_work(&enic->admin_msg_work);
+
+ if (rq_work < budget && napi_complete_done(napi, rq_work)) {
+ if (credits)
+ vnic_intr_return_credits(&enic->admin_intr, credits,
+ 1 /* unmask */, 0);
+ } else {
+ if (credits)
+ vnic_intr_return_credits(&enic->admin_intr, credits,
+ 0 /* don't unmask */, 0);
+ }
+
+ return rq_work;
+}
+
+static int enic_admin_setup_intr(struct enic *enic)
+{
+ unsigned int intr_index = enic->intr_count;
+ int err;
+
+ if (vnic_dev_get_intr_mode(enic->vdev) != VNIC_DEV_INTR_MODE_MSIX ||
+ intr_index >= enic->intr_avail)
+ return -ENODEV;
+
+ err = vnic_intr_alloc(enic->vdev, &enic->admin_intr, intr_index);
+ if (err) {
+ netdev_warn(enic->netdev,
+ "Failed to alloc admin intr at index %u: %d\n",
+ intr_index, err);
+ return err;
+ }
+
+ enic->admin_intr_index = intr_index;
+
+ snprintf(enic->msix[intr_index].devname,
+ sizeof(enic->msix[intr_index].devname),
+ "%s-admin", enic->netdev->name);
+ enic->msix[intr_index].isr = enic_admin_isr_msix;
+ enic->msix[intr_index].devid = &enic->admin_napi;
+
+ err = request_irq(enic->msix_entry[intr_index].vector,
+ enic->msix[intr_index].isr, 0,
+ enic->msix[intr_index].devname,
+ enic->msix[intr_index].devid);
+ if (err) {
+ netdev_warn(enic->netdev,
+ "Failed to request admin MSI-X irq: %d\n", err);
+ vnic_intr_free(&enic->admin_intr);
+ return err;
+ }
+
+ enic->msix[intr_index].requested = 1;
+
+ netif_napi_add(enic->netdev, &enic->admin_napi,
+ enic_admin_napi_poll);
+ napi_enable(&enic->admin_napi);
+
+ netdev_dbg(enic->netdev,
+ "admin channel using MSI-X interrupt (index %u)\n",
+ intr_index);
+
+ return 0;
+}
+
+static void enic_admin_teardown_intr(struct enic *enic)
+{
+ unsigned int intr_index = enic->admin_intr_index;
+
+ napi_disable(&enic->admin_napi);
+ netif_napi_del(&enic->admin_napi);
+
+ free_irq(enic->msix_entry[intr_index].vector,
+ enic->msix[intr_index].devid);
+ enic->msix[intr_index].requested = 0;
+}
+
static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
{
u64 a0 = QP_TYPE_ADMIN, a1 = enable;
@@ -128,23 +331,8 @@ static int enic_admin_alloc_resources(struct enic *enic)
if (err)
goto free_cq0;
- /* PFs have dedicated SRIOV_INTR resources for admin channel.
- * VFs lack SRIOV_INTR; use a regular INTR_CTRL slot instead.
- */
- if (vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1)
- err = vnic_intr_alloc_with_type(enic->vdev,
- &enic->admin_intr, 0,
- RES_TYPE_SRIOV_INTR);
- else
- err = vnic_intr_alloc(enic->vdev, &enic->admin_intr,
- enic->intr_count);
- if (err)
- goto free_cq1;
-
return 0;
-free_cq1:
- vnic_cq_free(&enic->admin_cq[1]);
free_cq0:
vnic_cq_free(&enic->admin_cq[0]);
free_rq:
@@ -165,10 +353,32 @@ static void enic_admin_free_resources(struct enic *enic)
static void enic_admin_init_resources(struct enic *enic)
{
+ unsigned int intr_offset = enic->admin_intr_index;
+
vnic_wq_init(&enic->admin_wq, 0, 0, 0);
vnic_rq_init(&enic->admin_rq, 1, 0, 0);
- vnic_cq_init(&enic->admin_cq[0], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0);
- vnic_cq_init(&enic->admin_cq[1], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0);
+ vnic_cq_init(&enic->admin_cq[0],
+ 0 /* flow_control_enable */,
+ 1 /* color_enable */,
+ 0 /* cq_head */,
+ 0 /* cq_tail */,
+ 1 /* cq_tail_color */,
+ 1 /* interrupt_enable */,
+ 1 /* cq_entry_enable */,
+ 0 /* cq_message_enable */,
+ intr_offset,
+ 0 /* cq_message_addr */);
+ vnic_cq_init(&enic->admin_cq[1],
+ 0 /* flow_control_enable */,
+ 1 /* color_enable */,
+ 0 /* cq_head */,
+ 0 /* cq_tail */,
+ 1 /* cq_tail_color */,
+ 1 /* interrupt_enable */,
+ 1 /* cq_entry_enable */,
+ 0 /* cq_message_enable */,
+ intr_offset,
+ 0 /* cq_message_addr */);
vnic_intr_init(&enic->admin_intr, 0, 0, 1);
}
@@ -187,12 +397,24 @@ int enic_admin_channel_open(struct enic *enic)
return err;
}
+ err = enic_admin_setup_intr(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Admin channel requires MSI-X, SR-IOV unavailable: %d\n",
+ err);
+ goto free_resources;
+ }
+
+ spin_lock_init(&enic->admin_msg_lock);
+ INIT_LIST_HEAD(&enic->admin_msg_list);
+ INIT_WORK(&enic->admin_msg_work, enic_admin_msg_work_handler);
+
enic_admin_init_resources(enic);
vnic_wq_enable(&enic->admin_wq);
vnic_rq_enable(&enic->admin_rq);
- err = enic_admin_rq_fill(enic);
+ err = enic_admin_rq_fill(enic, GFP_KERNEL);
if (err) {
netdev_err(enic->netdev,
"Failed to fill admin RQ buffers: %d\n", err);
@@ -206,22 +428,53 @@ int enic_admin_channel_open(struct enic *enic)
goto disable_queues;
}
+ vnic_intr_unmask(&enic->admin_intr);
+
+ netdev_dbg(enic->netdev,
+ "admin channel open: intr=%u wq_avail=%u rq_avail=%u cq0_color=%u cq1_color=%u\n",
+ enic->admin_intr_index,
+ vnic_wq_desc_avail(&enic->admin_wq),
+ vnic_rq_desc_avail(&enic->admin_rq),
+ enic->admin_cq[0].last_color,
+ enic->admin_cq[1].last_color);
+
return 0;
disable_queues:
+ enic_admin_teardown_intr(enic);
vnic_wq_disable(&enic->admin_wq);
vnic_rq_disable(&enic->admin_rq);
enic_admin_qp_type_set(enic, 0);
enic_admin_rq_drain(enic);
+free_resources:
enic_admin_free_resources(enic);
return err;
}
+static void enic_admin_msg_drain(struct enic *enic)
+{
+ struct enic_admin_msg *msg, *tmp;
+
+ spin_lock_bh(&enic->admin_msg_lock);
+ list_for_each_entry_safe(msg, tmp, &enic->admin_msg_list, list) {
+ list_del(&msg->list);
+ kfree(msg);
+ }
+ spin_unlock_bh(&enic->admin_msg_lock);
+}
+
void enic_admin_channel_close(struct enic *enic)
{
if (!enic->has_admin_channel)
return;
+ netdev_dbg(enic->netdev, "admin channel close\n");
+
+ vnic_intr_mask(&enic->admin_intr);
+ enic_admin_teardown_intr(enic);
+ cancel_work_sync(&enic->admin_msg_work);
+ enic_admin_msg_drain(enic);
+
vnic_wq_disable(&enic->admin_wq);
vnic_rq_disable(&enic->admin_rq);
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.h b/drivers/net/ethernet/cisco/enic/enic_admin.h
index 569aadeb9312..73cdd3dac7ec 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.h
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.h
@@ -9,7 +9,19 @@
struct enic;
+/* Wrapper for received admin messages queued for deferred processing.
+ * NAPI enqueues these; a workqueue handler processes them in process context
+ * where sleeping (mutex, GFP_KERNEL) is safe.
+ */
+struct enic_admin_msg {
+ struct list_head list;
+ unsigned int len;
+ u8 data[];
+};
+
int enic_admin_channel_open(struct enic *enic);
void enic_admin_channel_close(struct enic *enic);
+unsigned int enic_admin_wq_cq_service(struct enic *enic);
+unsigned int enic_admin_rq_cq_service(struct enic *enic, unsigned int budget);
#endif /* _ENIC_ADMIN_H_ */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 03/10] enic: add admin RQ buffer management
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
The admin receive queue needs pre-posted DMA buffers for incoming
mailbox messages from VFs. Each buffer is a kmalloc'd region mapped
for DMA (2048 bytes, sufficient for any MBOX message).
Add enic_admin_rq_fill() to post buffers at open time, and
enic_admin_rq_drain() to unmap and free them at close time.
Wire both into the admin channel open/close paths.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_admin.c | 66 +++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
index d1abe6a50095..a8fcd5f116d1 100644
--- a/drivers/net/ethernet/cisco/enic/enic_admin.c
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -3,6 +3,7 @@
#include <linux/kernel.h>
#include <linux/netdevice.h>
+#include <linux/dma-mapping.h>
#include "vnic_dev.h"
#include "vnic_wq.h"
@@ -23,10 +24,63 @@ static void enic_admin_wq_buf_clean(struct vnic_wq *wq,
{
}
-/* No-op: admin RQ buffer teardown is handled in enic_admin_channel_close */
static void enic_admin_rq_buf_clean(struct vnic_rq *rq,
struct vnic_rq_buf *buf)
{
+ struct enic *enic = vnic_dev_priv(rq->vdev);
+
+ if (!buf->os_buf)
+ return;
+
+ dma_unmap_single(&enic->pdev->dev, buf->dma_addr, buf->len,
+ DMA_FROM_DEVICE);
+ kfree(buf->os_buf);
+ buf->os_buf = NULL;
+}
+
+static int enic_admin_rq_post_one(struct enic *enic)
+{
+ struct vnic_rq *rq = &enic->admin_rq;
+ struct rq_enet_desc *desc;
+ dma_addr_t dma_addr;
+ void *buf;
+
+ buf = kmalloc(ENIC_ADMIN_BUF_SIZE, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ dma_addr = dma_map_single(&enic->pdev->dev, buf, ENIC_ADMIN_BUF_SIZE,
+ DMA_FROM_DEVICE);
+ if (dma_mapping_error(&enic->pdev->dev, dma_addr)) {
+ kfree(buf);
+ return -ENOMEM;
+ }
+
+ desc = vnic_rq_next_desc(rq);
+ rq_enet_desc_enc(desc, (u64)dma_addr | VNIC_PADDR_TARGET,
+ RQ_ENET_TYPE_ONLY_SOP, ENIC_ADMIN_BUF_SIZE);
+ vnic_rq_post(rq, buf, 0, dma_addr, ENIC_ADMIN_BUF_SIZE, 0);
+
+ return 0;
+}
+
+static int enic_admin_rq_fill(struct enic *enic)
+{
+ struct vnic_rq *rq = &enic->admin_rq;
+ int err;
+
+ while (vnic_rq_desc_avail(rq) > 0) {
+ err = enic_admin_rq_post_one(enic);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static void enic_admin_rq_drain(struct enic *enic)
+{
+ vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
}
static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
@@ -138,6 +192,13 @@ int enic_admin_channel_open(struct enic *enic)
vnic_wq_enable(&enic->admin_wq);
vnic_rq_enable(&enic->admin_rq);
+ err = enic_admin_rq_fill(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to fill admin RQ buffers: %d\n", err);
+ goto disable_queues;
+ }
+
err = enic_admin_qp_type_set(enic, 1);
if (err) {
netdev_err(enic->netdev,
@@ -151,6 +212,7 @@ int enic_admin_channel_open(struct enic *enic)
vnic_wq_disable(&enic->admin_wq);
vnic_rq_disable(&enic->admin_rq);
enic_admin_qp_type_set(enic, 0);
+ enic_admin_rq_drain(enic);
enic_admin_free_resources(enic);
return err;
}
@@ -166,7 +228,7 @@ void enic_admin_channel_close(struct enic *enic)
enic_admin_qp_type_set(enic, 0);
vnic_wq_clean(&enic->admin_wq, enic_admin_wq_buf_clean);
- vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
+ enic_admin_rq_drain(enic);
vnic_cq_clean(&enic->admin_cq[0]);
vnic_cq_clean(&enic->admin_cq[1]);
vnic_intr_clean(&enic->admin_intr);
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 02/10] enic: add admin channel open and close for SR-IOV
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
The V2 SR-IOV design uses a dedicated admin channel (WQ/RQ/CQ/INTR
on separate BAR resources) for PF-VF mailbox communication rather
than firmware-proxied devcmds.
Introduce enic_admin_channel_open() and enic_admin_channel_close().
Open allocates and initialises the admin WQ, RQ, two CQs (one per
direction) and one SR-IOV interrupt, then issues CMD_QP_TYPE_SET to
tell firmware the queues are admin-type. Close reverses the sequence.
Add CMD_QP_TYPE_SET (97) and QP_TYPE_ADMIN/DATA defines to
vnic_devcmd.h.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/Makefile | 3 +-
drivers/net/ethernet/cisco/enic/enic_admin.c | 175 ++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_admin.h | 15 +++
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 9 ++
4 files changed, 201 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/cisco/enic/Makefile b/drivers/net/ethernet/cisco/enic/Makefile
index a96b8332e6e2..7ae72fefc99a 100644
--- a/drivers/net/ethernet/cisco/enic/Makefile
+++ b/drivers/net/ethernet/cisco/enic/Makefile
@@ -3,5 +3,6 @@ obj-$(CONFIG_ENIC) := enic.o
enic-y := enic_main.o vnic_cq.o vnic_intr.o vnic_wq.o \
enic_res.o enic_dev.o enic_pp.o vnic_dev.o vnic_rq.o vnic_vic.o \
- enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o
+ enic_ethtool.o enic_api.o enic_clsf.o enic_rq.o enic_wq.o \
+ enic_admin.o
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
new file mode 100644
index 000000000000..d1abe6a50095
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2025 Cisco Systems, Inc. All rights reserved.
+
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+
+#include "vnic_dev.h"
+#include "vnic_wq.h"
+#include "vnic_rq.h"
+#include "vnic_cq.h"
+#include "vnic_intr.h"
+#include "vnic_resource.h"
+#include "vnic_devcmd.h"
+#include "enic.h"
+#include "enic_admin.h"
+#include "cq_desc.h"
+#include "wq_enet_desc.h"
+#include "rq_enet_desc.h"
+
+/* No-op: admin WQ buffers are freed inline after completion polling */
+static void enic_admin_wq_buf_clean(struct vnic_wq *wq,
+ struct vnic_wq_buf *buf)
+{
+}
+
+/* No-op: admin RQ buffer teardown is handled in enic_admin_channel_close */
+static void enic_admin_rq_buf_clean(struct vnic_rq *rq,
+ struct vnic_rq_buf *buf)
+{
+}
+
+static int enic_admin_qp_type_set(struct enic *enic, u32 enable)
+{
+ u64 a0 = QP_TYPE_ADMIN, a1 = enable;
+ int wait = 1000;
+ int err;
+
+ spin_lock_bh(&enic->devcmd_lock);
+ err = vnic_dev_cmd(enic->vdev, CMD_QP_TYPE_SET, &a0, &a1, wait);
+ spin_unlock_bh(&enic->devcmd_lock);
+
+ return err;
+}
+
+static int enic_admin_alloc_resources(struct enic *enic)
+{
+ int err;
+
+ err = vnic_wq_alloc_with_type(enic->vdev, &enic->admin_wq, 0,
+ ENIC_ADMIN_DESC_COUNT,
+ sizeof(struct wq_enet_desc),
+ RES_TYPE_ADMIN_WQ);
+ if (err)
+ return err;
+
+ err = vnic_rq_alloc_with_type(enic->vdev, &enic->admin_rq, 0,
+ ENIC_ADMIN_DESC_COUNT,
+ sizeof(struct rq_enet_desc),
+ RES_TYPE_ADMIN_RQ);
+ if (err)
+ goto free_wq;
+
+ err = vnic_cq_alloc_with_type(enic->vdev, &enic->admin_cq[0], 0,
+ ENIC_ADMIN_DESC_COUNT,
+ sizeof(struct cq_desc),
+ RES_TYPE_ADMIN_CQ);
+ if (err)
+ goto free_rq;
+
+ err = vnic_cq_alloc_with_type(enic->vdev, &enic->admin_cq[1], 1,
+ ENIC_ADMIN_DESC_COUNT,
+ 16 << enic->ext_cq,
+ RES_TYPE_ADMIN_CQ);
+ if (err)
+ goto free_cq0;
+
+ /* PFs have dedicated SRIOV_INTR resources for admin channel.
+ * VFs lack SRIOV_INTR; use a regular INTR_CTRL slot instead.
+ */
+ if (vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1)
+ err = vnic_intr_alloc_with_type(enic->vdev,
+ &enic->admin_intr, 0,
+ RES_TYPE_SRIOV_INTR);
+ else
+ err = vnic_intr_alloc(enic->vdev, &enic->admin_intr,
+ enic->intr_count);
+ if (err)
+ goto free_cq1;
+
+ return 0;
+
+free_cq1:
+ vnic_cq_free(&enic->admin_cq[1]);
+free_cq0:
+ vnic_cq_free(&enic->admin_cq[0]);
+free_rq:
+ vnic_rq_free(&enic->admin_rq);
+free_wq:
+ vnic_wq_free(&enic->admin_wq);
+ return err;
+}
+
+static void enic_admin_free_resources(struct enic *enic)
+{
+ vnic_intr_free(&enic->admin_intr);
+ vnic_cq_free(&enic->admin_cq[1]);
+ vnic_cq_free(&enic->admin_cq[0]);
+ vnic_rq_free(&enic->admin_rq);
+ vnic_wq_free(&enic->admin_wq);
+}
+
+static void enic_admin_init_resources(struct enic *enic)
+{
+ vnic_wq_init(&enic->admin_wq, 0, 0, 0);
+ vnic_rq_init(&enic->admin_rq, 1, 0, 0);
+ vnic_cq_init(&enic->admin_cq[0], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0);
+ vnic_cq_init(&enic->admin_cq[1], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0);
+ vnic_intr_init(&enic->admin_intr, 0, 0, 1);
+}
+
+int enic_admin_channel_open(struct enic *enic)
+{
+ int err;
+
+ if (!enic->has_admin_channel)
+ return -ENODEV;
+
+ err = enic_admin_alloc_resources(enic);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to alloc admin channel resources: %d\n",
+ err);
+ return err;
+ }
+
+ enic_admin_init_resources(enic);
+
+ vnic_wq_enable(&enic->admin_wq);
+ vnic_rq_enable(&enic->admin_rq);
+
+ err = enic_admin_qp_type_set(enic, 1);
+ if (err) {
+ netdev_err(enic->netdev,
+ "Failed to set admin QP type: %d\n", err);
+ goto disable_queues;
+ }
+
+ return 0;
+
+disable_queues:
+ vnic_wq_disable(&enic->admin_wq);
+ vnic_rq_disable(&enic->admin_rq);
+ enic_admin_qp_type_set(enic, 0);
+ enic_admin_free_resources(enic);
+ return err;
+}
+
+void enic_admin_channel_close(struct enic *enic)
+{
+ if (!enic->has_admin_channel)
+ return;
+
+ vnic_wq_disable(&enic->admin_wq);
+ vnic_rq_disable(&enic->admin_rq);
+
+ enic_admin_qp_type_set(enic, 0);
+
+ vnic_wq_clean(&enic->admin_wq, enic_admin_wq_buf_clean);
+ vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
+ vnic_cq_clean(&enic->admin_cq[0]);
+ vnic_cq_clean(&enic->admin_cq[1]);
+ vnic_intr_clean(&enic->admin_intr);
+
+ enic_admin_free_resources(enic);
+}
diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.h b/drivers/net/ethernet/cisco/enic/enic_admin.h
new file mode 100644
index 000000000000..569aadeb9312
--- /dev/null
+++ b/drivers/net/ethernet/cisco/enic/enic_admin.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright 2025 Cisco Systems, Inc. All rights reserved. */
+
+#ifndef _ENIC_ADMIN_H_
+#define _ENIC_ADMIN_H_
+
+#define ENIC_ADMIN_DESC_COUNT 64
+#define ENIC_ADMIN_BUF_SIZE 2048
+
+struct enic;
+
+int enic_admin_channel_open(struct enic *enic);
+void enic_admin_channel_close(struct enic *enic);
+
+#endif /* _ENIC_ADMIN_H_ */
diff --git a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
index 7a4bce736105..a1c8f522c7d7 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
@@ -455,8 +455,17 @@ enum vnic_devcmd_cmd {
*/
CMD_CQ_ENTRY_SIZE_SET = _CMDC(_CMD_DIR_WRITE, _CMD_VTYPE_ENET, 90),
+ /*
+ * Set queue pair type (admin or data)
+ * in: (u32) a0 = queue pair type (0 = admin, 1 = data)
+ * in: (u32) a1 = enable (1) / disable (0)
+ */
+ CMD_QP_TYPE_SET = _CMDC(_CMD_DIR_WRITE, _CMD_VTYPE_ENET, 97),
};
+#define QP_TYPE_ADMIN 0
+#define QP_TYPE_DATA 1
+
/* CMD_ENABLE2 flags */
#define CMD_ENABLE2_STANDBY 0x0
#define CMD_ENABLE2_ACTIVE 0x1
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v2 00/10] enic: SR-IOV V2 admin channel and MBOX protocol
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
This series adds the admin channel infrastructure and mailbox (MBOX)
protocol needed for V2 SR-IOV support in the enic driver.
The V2 SR-IOV design uses a direct PF-VF communication channel built on
dedicated WQ/RQ/CQ hardware resources and an MSI-X interrupt.
Firmware capability and admin channel infrastructure (patches 1-4):
- Probe-time firmware feature check for V2 SR-IOV support
- Admin channel open/close, RQ buffer management, CQ service
with MSI-X interrupt and NAPI polling
MBOX protocol and VF enable (patches 5-10):
- MBOX message types, core send/receive, PF and VF handlers
- V2 SR-IOV enable wiring with admin channel setup
- V2 VF probe with admin channel and PF registration
This series depends on "enic: SR-IOV V2 resource discovery and VF
type detection" (Series 1), which has been accepted.
Depends-on: 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9@cisco.com
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
Changes in v2:
- Fix lines exceeding 80 columns (patches 4, 6, 7, 8)
- Add __maybe_unused to enic_sriov_configure and enic_sriov_v2_enable;
.sriov_configure wiring deferred to a later series after devcmd
hardening is in place (patch 9)
- Guard probe-time auto-enable to skip V2 VFs (patch 9)
- Link to v1: https://lore.kernel.org/r/20260406-enic-sriov-v2-admin-channel-v2-v1-0-82cc47636a78@cisco.com
---
Satish Kharat (10):
enic: verify firmware supports V2 SR-IOV at probe time
enic: add admin channel open and close for SR-IOV
enic: add admin RQ buffer management
enic: add admin CQ service with MSI-X interrupt and NAPI polling
enic: define MBOX message types and header structures
enic: add MBOX core send and receive for admin channel
enic: add MBOX PF handlers for VF register and capability
enic: add MBOX VF handlers for capability, register and link state
enic: wire V2 SR-IOV enable with admin channel and MBOX
enic: add V2 VF probe with admin channel and PF registration
drivers/net/ethernet/cisco/enic/Makefile | 3 +-
drivers/net/ethernet/cisco/enic/enic.h | 29 +-
drivers/net/ethernet/cisco/enic/enic_admin.c | 511 ++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_admin.h | 27 ++
drivers/net/ethernet/cisco/enic/enic_main.c | 213 +++++++++-
drivers/net/ethernet/cisco/enic/enic_mbox.c | 546 ++++++++++++++++++++++++++
drivers/net/ethernet/cisco/enic/enic_mbox.h | 87 ++++
drivers/net/ethernet/cisco/enic/enic_res.c | 4 +-
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 11 +
drivers/net/ethernet/cisco/enic/vnic_enet.h | 4 +-
10 files changed, 1421 insertions(+), 14 deletions(-)
---
base-commit: 3e6ef4fb822c971b464d44910a1561b4e7f9efa7
change-id: 20260404-enic-sriov-v2-admin-channel-v2-c0aa3e988833
Best regards,
--
Satish Kharat <satishkh@cisco.com>
^ permalink raw reply
* [PATCH net-next v2 01/10] enic: verify firmware supports V2 SR-IOV at probe time
From: Satish Kharat via B4 Relay @ 2026-04-08 15:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel,
20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9, Satish Kharat
In-Reply-To: <20260408-enic-sriov-v2-admin-channel-v2-v2-0-d05dd3623fd3@cisco.com>
From: Satish Kharat <satishkh@cisco.com>
During PF probe, query the firmware get-supported-feature interface
to verify that the running firmware supports V2 SR-IOV. Firmware
version 5.3(4.72) and later report VIC_FEATURE_SRIOV via
CMD_GET_SUPP_FEATURE_VER. If the firmware does not support the
feature, set vf_type to ENIC_VF_TYPE_NONE and log a warning so the
admin knows a firmware upgrade is needed.
The VIC_FEATURE_SRIOV enum value (4) matches the firmware ABI. A
placeholder entry (VIC_FEATURE_PTP at position 3) is added to keep
the enum in sync with firmware's feature numbering.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
---
drivers/net/ethernet/cisco/enic/enic_main.c | 18 ++++++++++++++++++
drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 2 ++
2 files changed, 20 insertions(+)
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index e7125b818087..3a4afd6da41f 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2641,8 +2641,10 @@ static void enic_iounmap(struct enic *enic)
static void enic_sriov_detect_vf_type(struct enic *enic)
{
struct pci_dev *pdev = enic->pdev;
+ u64 supported_versions, a1 = 0;
int pos;
u16 vf_dev_id;
+ int err;
if (enic_is_sriov_vf(enic) || enic_is_dynamic(enic))
return;
@@ -2669,6 +2671,22 @@ static void enic_sriov_detect_vf_type(struct enic *enic)
enic->vf_type = ENIC_VF_TYPE_NONE;
break;
}
+
+ if (enic->vf_type == ENIC_VF_TYPE_V2) {
+ /* A successful command means firmware recognizes
+ * VIC_FEATURE_SRIOV; supported_versions is available
+ * for sub-feature versioning in the future.
+ */
+ err = vnic_dev_get_supported_feature_ver(enic->vdev,
+ VIC_FEATURE_SRIOV,
+ &supported_versions,
+ &a1);
+ if (err) {
+ dev_warn(&pdev->dev,
+ "SR-IOV V2 not supported by current firmware. Upgrade to VIC FW 5.3(4.72) or higher.\n");
+ enic->vf_type = ENIC_VF_TYPE_NONE;
+ }
+ }
}
#endif
diff --git a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
index 605ef17f967e..7a4bce736105 100644
--- a/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
+++ b/drivers/net/ethernet/cisco/enic/vnic_devcmd.h
@@ -734,6 +734,8 @@ enum vic_feature_t {
VIC_FEATURE_VXLAN,
VIC_FEATURE_RDMA,
VIC_FEATURE_VXLAN_PATCH,
+ VIC_FEATURE_PTP,
+ VIC_FEATURE_SRIOV,
VIC_FEATURE_MAX,
};
--
2.43.0
^ permalink raw reply related
* Re: linux-next: manual merge of the net-next tree with the netfilter tree
From: Mark Brown @ 2026-04-08 15:08 UTC (permalink / raw)
To: David Miller, Jakub Kicinski, Paolo Abeni, Networking
Cc: Andrea Mayer, Justin Iurman, Linux Kernel Mailing List,
Linux Next Mailing List
In-Reply-To: <adZhwtOYfo-0ImSa@sirena.org.uk>
[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]
On Wed, Apr 08, 2026 at 03:10:10PM +0100, Mark Brown wrote:
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
> net/ipv6/seg6_iptunnel.c
>
> between commit:
>
> c3812651b522f ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel")
>
> from the netfilter tree and commit:
>
> 78723a62b969a ("seg6: add per-route tunnel source address")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging. You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> diff --cc net/ipv6/seg6_iptunnel.c
> index d6a0f7df90807,e76cc0cc481ec..0000000000000
> --- a/net/ipv6/seg6_iptunnel.c
> +++ b/net/ipv6/seg6_iptunnel.c
> @@@ -48,8 -48,8 +48,9 @@@ static size_t seg6_lwt_headroom(struct
> }
>
> struct seg6_lwt {
> - struct dst_cache cache;
> + struct dst_cache cache_input;
> + struct dst_cache cache_output;
> + struct in6_addr tunsrc;
> struct seg6_iptunnel_encap tuninfo[];
> };
>
This also needs a fixup for a new jump to the error handling paths that
was added in seg6_build_state().
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH net-next v2 2/9] dt-bindings: net: lan9645x: add LAN9645X switch bindings
From: Jens Emil Schulz Ostergaard @ 2026-04-08 14:59 UTC (permalink / raw)
To: Rob Herring
Cc: UNGLinuxDriver, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Krzysztof Kozlowski, Conor Dooley, Woojung Huh, Russell King,
Steen Hegelund, Daniel Machon, linux-kernel, netdev, devicetree
In-Reply-To: <20260407171854.GA2970003-robh@kernel.org>
On Tue, 2026-04-07 at 12:18 -0500, Rob Herring wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Tue, Mar 24, 2026 at 11:46:45AM +0100, Jens Emil Schulz Østergaard wrote:
> > Add bindings for LAN9645X switch. We use a fallback compatible for the
> > smallest SKU microchip,lan96455s-switch.
> >
> > Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
> > Signed-off-by: Jens Emil Schulz Østergaard <jensemil.schulzostergaard@microchip.com>
> > ---
> > Changes in v2:
> > - rename file to microchip,lan96455s-switch.yaml
> > - remove led vendor property
> > - add {rx,tx}-internal-delay-ps for rgmii delay
> > - remove labels from example
> > - remove container node from example
> > ---
> > .../net/dsa/microchip,lan96455s-switch.yaml | 119 +++++++++++++++++++++
> > MAINTAINERS | 1 +
> > 2 files changed, 120 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/net/dsa/microchip,lan96455s-switch.yaml b/Documentation/devicetree/bindings/net/dsa/microchip,lan96455s-switch.yaml
> > new file mode 100644
> > index 000000000000..0282e25c05d4
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/net/dsa/microchip,lan96455s-switch.yaml
> > @@ -0,0 +1,119 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/net/dsa/microchip,lan96455s-switch.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: Microchip LAN9645x Ethernet switch
> > +
> > +maintainers:
> > + - Jens Emil Schulz Østergaard <jensemil.schulzostergaard@microchip.com>
> > +
> > +description: |
>
> Don't need '|'
I will remove this.
>
> > + The LAN9645x switch is a multi-port Gigabit AVB/TSN Ethernet switch with
> > + five integrated 10/100/1000Base-T PHYs. In addition to the integrated PHYs,
> > + it supports up to 2 RGMII/RMII, up to 2 BASE-X/SERDES/2.5GBASE-X and one
> > + Quad-SGMII interfaces.
> > +
> > +properties:
> > + compatible:
> > + oneOf:
> > + - enum:
> > + - microchip,lan96455s-switch
> > + - items:
> > + - enum:
> > + - microchip,lan96455f-switch
> > + - microchip,lan96457f-switch
> > + - microchip,lan96459f-switch
> > + - microchip,lan96457s-switch
> > + - microchip,lan96459s-switch
> > + - const: microchip,lan96455s-switch
> > +
> > + reg:
> > + maxItems: 1
> > +
> > +$ref: dsa.yaml#
>
> Since you don't have any custom properties (just constraints), this ref
> should be "dsa.yaml#/$defs/ethernet-ports".
Right, I will update the ref.
>
> > +
> > +patternProperties:
> > + "^(ethernet-)?ports$":
>
> For a new binding, use the preferred name which is ethernet-ports. ports
> and port collide with the graph binding.
>
OK, I will use ethernet-ports and move it from patternProperties to properties.
> > + type: object
> > + additionalProperties: true
> > + patternProperties:
> > + "^(ethernet-)?port@[0-8]$":
>
> And 'ethernet-port'
I will change this and update the example.
>
> > + type: object
> > + description: Ethernet switch ports
> > +
> > + $ref: dsa-port.yaml#
> > +
> > + properties:
> > + rx-internal-delay-ps:
> > + const: 2000
> > +
> > + tx-internal-delay-ps:
> > + const: 2000
> > +
> > + unevaluatedProperties: false
>
> Place this after the $ref.
I will move this.
>
> > +
> > +oneOf:
> > + - required:
> > + - ports
> > + - required:
> > + - ethernet-ports
> > +
> > +required:
> > + - compatible
> > + - reg
> > +
> > +unevaluatedProperties: false
> > +
> > +examples:
> > + - |
> > + ethernet-switch@4000 {
> > + compatible = "microchip,lan96459f-switch", "microchip,lan96455s-switch";
> > + reg = <0x4000 0x244>;
> > +
> > + ethernet-ports {
> > + #address-cells = <1>;
> > + #size-cells = <0>;
> > +
> > + port@0 {
> > + reg = <0>;
> > + phy-mode = "gmii";
> > + phy-handle = <&cuphy0>;
> > + };
> > +
> > + port@1 {
> > + reg = <1>;
> > + phy-mode = "gmii";
> > + phy-handle = <&cuphy1>;
> > + };
> > +
> > + port@2 {
> > + reg = <2>;
> > + phy-mode = "gmii";
> > + phy-handle = <&cuphy2>;
> > + };
> > +
> > + port@3 {
> > + reg = <3>;
> > + phy-mode = "gmii";
> > + phy-handle = <&cuphy3>;
> > + };
> > +
> > + port@7 {
> > + reg = <7>;
> > + phy-mode = "rgmii";
> > + ethernet = <&cpu_host_port>;
> > + rx-internal-delay-ps = <2000>;
> > + tx-internal-delay-ps = <2000>;
> > +
> > + fixed-link {
> > + speed = <1000>;
> > + full-duplex;
> > + pause;
> > + };
> > + };
> > + };
> > + };
> > +...
> > +
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 7ae698067c41..8232da1b3951 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -17278,6 +17278,7 @@ M: Jens Emil Schulz Østergaard <jensemil.schulzostergaard@microchip.com>
> > M: UNGLinuxDriver@microchip.com
> > L: netdev@vger.kernel.org
> > S: Maintained
> > +F: Documentation/devicetree/bindings/net/dsa/microchip,lan96455s-switch.yaml
> > F: include/linux/dsa/lan9645x.h
> > F: net/dsa/tag_lan9645x.c
> >
> >
> > --
> > 2.52.0
> >
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH iwl-next 0/2] i40e: implement per-queue stats
From: Paolo Abeni @ 2026-04-08 14:58 UTC (permalink / raw)
To: Paul Menzel
Cc: intel-wired-lan, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, netdev
In-Reply-To: <3c7e8261-7528-431c-adb9-3b90124cef99@molgen.mpg.de>
On 4/8/26 1:50 PM, Paul Menzel wrote:
> Dear Paolo,
>
> Thank you for your patches.
>
> Am 08.04.26 um 13:43 schrieb Paolo Abeni:
>> The i40e driver already collects some per queue statistics, but does
>> not expose them to the user-space using the standard interface.
>>
>> Implement the stat_ops callbacks and extends the already collected info
>
> s/extends/extend/
>
>> with basic GSO counters. Overall this allows passing the kernel NIC
>> drivers TSO test cases.
>
> It’d be great, if you added the commands to show the stats, and to run
> the test cases.
I'll add some "show stats" examples in the v2 cover, hopefully fixing
the typos while at it.
How to run NID driver H/W self-tests is IMHO a bit too wide topic to be
discussed in the cover letter, but there is already a quite
comprehensive readme in the selftests dir:
https://elixir.bootlin.com/linux/v7.0-rc7/source/tools/testing/selftests/drivers/net/README.rst
/P
^ permalink raw reply
* Re: [PATCH net-next v2 4/9] net: dsa: lan9645x: add basic dsa driver for LAN9645X
From: Jens Emil Schulz Ostergaard @ 2026-04-08 14:57 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Jakub Kicinski, UNGLinuxDriver, andrew, davem, edumazet, pabeni,
horms, robh, krzk+dt, conor+dt, woojung.huh, linux,
Steen.Hegelund, daniel.machon, linux-kernel, netdev, devicetree
In-Reply-To: <20260407091217.rrwzho3lcmfip3y5@skbuf>
On Tue, 2026-04-07 at 12:12 +0300, Vladimir Oltean wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Wed, Apr 01, 2026 at 01:46:19PM +0200, Jens Emil Schulz Ostergaard wrote:
> > On Sun, 2026-03-29 at 12:56 -0700, Jakub Kicinski wrote:
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> > >
> > > This is an AI-generated review of your patch. The human sending this
> > > email has considered the AI review valid, or at least plausible.
> > > ---
> > > Subject: net: dsa: lan9645x: add basic dsa driver for LAN9645X
> > >
> > > This commit adds a basic DSA driver for the Microchip LAN9645X switch. It
> > > includes initialization, port module setup for NPI, CPU, and front ports,
> > > and phylink integration for MAC side configuration.
> > >
> > > diff --git a/drivers/net/dsa/microchip/lan9645x/lan9645x_npi.c b/drivers/net/dsa/microchip/lan9645x/lan9645x_npi.c
> > > --- a/drivers/net/dsa/microchip/lan9645x/lan9645x_npi.c
> > > +++ b/drivers/net/dsa/microchip/lan9645x/lan9645x_npi.c
> > > [ ... ]
> > > > +static void disable_conduit_ipv6(struct lan9645x *lan9645x,
> > > > + struct net_device *conduit)
> > > > +{
> > > > + struct inet6_dev *dev_v6;
> > > > +
> > > > + if (!conduit)
> > > > + return;
> > > > +
> > > > + /* IPv6 on the conduit will generate frames bypassing our tag driver, so
> > > > + * they lack an IFH. This will be garbage in garbage out and we want to
> > > > + * avoid this.
> > > > + */
> > > > + rtnl_lock();
> > > > + dev_v6 = __in6_dev_get(conduit);
> > > > + if (dev_v6) {
> > > > + WRITE_ONCE(dev_v6->cnf.disable_ipv6, 1);
> > > > + dev_warn(lan9645x->dev, "Disabled IPv6 on conduit device: %s\n",
> > > > + netdev_name(conduit));
> > > > + }
> > > > + rtnl_lock();
> > > > +}
> > >
> > > Does directly modifying dev_v6->cnf.disable_ipv6 bypass the proper sysctl
> > > teardown path (addrconf_disable_ipv6) and leave the IPv6 stack in an
> > > inconsistent state on the conduit device by failing to clean up routes,
> > > addresses, and MLD memberships?
> > >
> > > Is it a layering violation for a switch driver to alter the networking
> > > configuration of an external upper device, especially since this change
> > > isn't reverted during driver unbind?
> >
> > Yes this will bypass the sysctl teardown path. I can change this to instead
> > read disable_ipv6 state and warn the user that their device is misconfigured.
>
> I don't want to have to deal with a DSA driver altering the conduit
> configuration outside of the DSA core. NACK to any attempts of trying
> this.
>
> Define "garbage in, garbage out" for your case. The packets are sent to
> random ports? Normally, switches are configured by their drivers to drop
> those packets that don't contain DSA tags. You can also do that, by
> setting up the switch to require an injection prefix.
>
> > > [ ... ]
> > > > + /* Configure IFH prefix mode for NPI port. */
> > > > + lan_rmw(SYS_PORT_MODE_INCL_XTR_HDR_SET(LAN9645X_TAG_PREFIX_LONG) |
> > > > + SYS_PORT_MODE_INCL_INJ_HDR_SET(LAN9645X_TAG_PREFIX_NONE),
> > > > + SYS_PORT_MODE_INCL_XTR_HDR |
> > > > + SYS_PORT_MODE_INCL_INJ_HDR,
> > > > + lan9645x,
> > > > + SYS_PORT_MODE(p->chip_port));
> > >
> > > Setting LAN9645X_TAG_PREFIX_NONE means the hardware interprets the first
> > > bytes of any frame received from the host as an IFH command structure.
> > >
> > > Does this allow a local user to transmit crafted frames (for example, via
> > > raw sockets or standard protocols) whose MAC address bytes map to malicious
> > > IFH commands, thereby bypassing VLAN isolation or flooding the switch?
> > >
> > > Should the hardware be configured to require a strict prefix, such as
> > > LAN9645X_TAG_PREFIX_LONG, for injected frames to prevent this?
> >
> > A prefix offers no protection against the described scenario. The current
> > configuration is correct. If a prefix is configured frames MUST contain the
> > prefix. With the conduit exposed in userspace this is not something we can
> > guarantee.
>
> The short prefix prevents user space from sending frames by mistake
> which will reach the wire (daemons attaching to all interfaces, like for DHCP).
> It still allows user space to intentionally inject frames into the
> switch (for example from XDP on the conduit).
>
> I recommend you go for the short prefix and drop the conduit alterations.
>
By garbage in, garbage out I mean frames will be interpreted as having
an IFH, and may be sent to random ports depending the bytes in the
frame.
Unfortunately, configuring an injection prefix is not a good solution.
If an injection prefix is configured frames MUST contain this prefix. If
frames are received which do not contain the prefix they will corrupt
the queue system leaving the port unusable.
Since the driver does not have control over frames sent to the NPI port,
an injection prefix configuration is highly undesirable.
I will remove the conduit alterations.
> > > diff --git a/drivers/net/dsa/microchip/lan9645x/lan9645x_main.c b/drivers/net/dsa/microchip/lan9645x/lan9645x_main.c
> > > --- a/drivers/net/dsa/microchip/lan9645x/lan9645x_main.c
> > > +++ b/drivers/net/dsa/microchip/lan9645x/lan9645x_main.c
> > > [ ... ]
> > > > +static int lan9645x_setup(struct dsa_switch *ds)
> > > > +{
> > > > + struct lan9645x *lan9645x = ds->priv;
> > > > + struct device *dev = lan9645x->dev;
> > > > + u32 all_phys_ports;
> > > > + int err;
> > > > +
> > > > + lan9645x->num_phys_ports = ds->num_ports;
> > > > + all_phys_ports = GENMASK(lan9645x->num_phys_ports - 1, 0);
> > > > +
> > > > + err = lan9645x_reset_switch(lan9645x);
> > > > + if (err)
> > > > + return err;
> > > > +
> > > > + lan9645x->ports = devm_kcalloc(lan9645x->dev, lan9645x->num_phys_ports,
> > > > + sizeof(struct lan9645x_port *),
> > > > + GFP_KERNEL);
> > >
> > > Since the DSA .setup callback can be invoked multiple times during the
> > > platform device's lifetime (for example, when the DSA master interface is
> > > unbound and bound again), does using devm_kcalloc here cause a memory leak?
> > > Should these allocations be moved to .probe or managed with kcalloc and
> > > freed in .teardown?
> > >
> >
> > Hmm I do not think this is right? I think dsa_tree_teardown is called from
> > dsa_switch_remove which is called by the switch drivers .remove.
> >
> > So I do not think you can get multiple setup/teardown cycles during the device
> > lifetime?
> >
> > As far as I can tell both felix and ksz use device managed allocations in their
> > setup callback.
>
> It's valid feedback in general. In a multi-switch tree, ds->ops->setup()
> is called when the tree is complete, and ds->ops->teardown() when the
> first switch disappears. If there are 2 switches and you unbind and
> rebind the driver from A, then switch B will see a ds->ops->teardown()
> call followed by ds->ops->setup().
>
> In your case, since the switch doesn't support cascading, it probably
> doesn't matter. But the LLM doesn't have the context to know that.
>
OK, I was not aware of this. I tried following the callstack and it did not seem
like the described scenario was possible. If device managed allocations are fine
when you do not support cascading, I would prefer to keep them.
> > > [ ... ]
> > > > + /* Set all the entries to obey VLAN. */
> > > > + for (int i = 0; i < PGID_ENTRIES; ++i)
> > > > + lan_wr(ANA_PGID_CFG_OBEY_VLAN_SET(1),
> > > > + lan9645x, ANA_PGID_CFG(i));
> > >
> > > PGID_ENTRIES is defined as 89, so this loop initializes indices 0 through
> > > 88. Since the CPU port is index 9, its source PGID is PGID_SRC + CPU_PORT
> > > (80 + 9 = 89).
> > >
> > > Is index 89 left uninitialized, breaking the OBEY_VLAN rule and allowing
> > > CPU-injected frames to leak across VLAN boundaries?
> > >
> >
> > No I this misunderstands OBEY_VLAN. When set the vlan table can control
> > whether cpu copy from the pgid table is enabled. It makes no sense for PGID 89.
>
> Explain that in a comment.
>
I will add a comment about this.
> > > [ ... ]
> > > > + /* Multicast to all front ports */
> > > > + lan_wr(all_phys_ports, lan9645x, ANA_PGID(PGID_MC));
> > > > +
> > > > + /* IP multicast to all front ports */
> > > > + lan_wr(all_phys_ports, lan9645x, ANA_PGID(PGID_MCIPV4));
> > > > + lan_wr(all_phys_ports, lan9645x, ANA_PGID(PGID_MCIPV6));
> > > > +
> > > > + /* Unicast to all front ports */
> > > > + lan_wr(all_phys_ports, lan9645x, ANA_PGID(PGID_UC));
> > > > +
> > > > + /* Broadcast to all ports */
> > > > + lan_wr(BIT(CPU_PORT) | all_phys_ports, lan9645x, ANA_PGID(PGID_BC));
> > >
> > > PGID_BC includes BIT(CPU_PORT) and all_phys_ports (which includes the NPI
> > > port). Will this forward broadcast frames to both the CPU extraction queue
> > > and the NPI port's normal egress queue, causing duplicate frames for the host?
> > >
> > > Conversely, the multicast masks and PGID_UC exclude BIT(CPU_PORT). Does
> > > this cause them to bypass the CPU extraction queue entirely, thereby
> > > lacking the LONG extraction prefix and breaking the host's DSA tagger parsing?
> > >
> >
> > No this is not how it works. Generally when you configure the CPU port to use
> > an NPI port, the hardware manages this internally. You do you have to start
> > using the npi port number all of a sudden.
>
> The comment is fair though. Why would you set all_phys_ports to
> GENMASK(lan9645x->num_phys_ports - 1, 0) when you can compute the mask
> of user ports which are enabled? It seems sloppy at best, and also
> contradictory (comments say "all front ports", but code includes the NPI
> port in this mask).
>
I will use dsa_user_ports(ds) as the base mask for the flood masks.
> > > [ ... ]
> > > > +int lan9645x_port_setup(struct dsa_switch *ds, int port)
> > > > +{
> > > > + struct dsa_port *dp = dsa_to_port(ds, port);
> > > > + struct lan9645x *lan9645x = ds->priv;
> > > > + struct lan9645x_port *p;
> > > > +
> > > > + p = lan9645x_to_port(lan9645x, port);
> > > > +
> > > > + if (dp->dn) {
> > > > + p->rx_internal_delay =
> > > > + of_property_present(dp->dn, "rx-internal-delay-ps");
> > > > + p->tx_internal_delay =
> > > > + of_property_present(dp->dn, "tx-internal-delay-ps");
> > > > + }
> > >
> > > These are standard integer properties specifying delays in picoseconds. If
> > > a user explicitly disables the delay via devicetree using a value of 0,
> > > will of_property_present evaluate to true and enable the hardware delay
> > > anyway? Should of_property_read_u32 be used instead to check the value?
> >
> > A value of 0 is not allowed per the bindings. The bindings enforce that if this
> > is present the value must be 2000.
>
> Bindings may change. Please use of_property_read_u32().
>
I will switch to of_property_read_u32.
^ permalink raw reply
* Re: [PATCH net-next] net: phy: fix a return path in get_phy_c45_ids()
From: Charles Perry @ 2026-04-08 14:48 UTC (permalink / raw)
To: Andrew Lunn
Cc: Charles Perry, netdev, Heiner Kallweit, Russell King,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Florian Fainelli, linux-kernel
In-Reply-To: <bcc4943d-137f-46d0-b5fd-385577c1259e@lunn.ch>
On Wed, Apr 08, 2026 at 03:48:22PM +0200, Andrew Lunn wrote:
> On Wed, Apr 08, 2026 at 06:31:44AM -0700, Charles Perry wrote:
> > The return value of phy_c45_probe_present() is store in "ret", not
> > "phy_reg", fix this. "phy_reg" always has a positive value if we reach
> > this return path (since it would have returned earlier otherwise), which
> > means that the original goal of the patch of not considering -ENODEV
> > fatal wasn't achieved.
> >
> > Fixes: 17b447539408 ("net: phy: c45 scanning: Don't consider -ENODEV fatal")
> > Signed-off-by: Charles Perry <charles.perry@microchip.com>
>
> Thanks for fixing this.
>
> The Subject line should be [PATCH net] since this is a Fix.
Ok, I'll resend this with the proper patch prefix.
Thanks,
Charles
>
> Otherwise:
>
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
>
> Andrew
^ permalink raw reply
* Re: [PATCH net-next v2 1/2] tcp: rehash onto different ECMP path on retransmit timeout
From: Neal Cardwell @ 2026-04-08 14:45 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Neil Spring, netdev, davem, kuba
In-Reply-To: <CANn89i+6ZaTKZbQrjF_d0mbMkYYK4a5B_Neef9VsgpuVCsU6Ag@mail.gmail.com>
On Wed, Apr 8, 2026 at 3:13 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 8, 2026 at 12:05 AM Neil Spring <ntspring@meta.com> wrote:
> >
> > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> > and spurious-retrans paths so that the next transmit triggers a fresh
> > route lookup. Propagate sk_txhash into fl6->mp_hash in
> > inet6_csk_route_req() and inet6_csk_route_socket() so
> > fib6_select_path() uses the socket's current hash for ECMP selection.
> >
> > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> > because it was cleaner than gating on address family; IPv4 is
> > otherwise unaltered, and not having autoflowlabel in IPv4 means
> > I wouldn't expect a new path on timeout.
> >
> > It is possible that PLB does not need this (that there are other
> > methods of reacting to local congestion); I added the sk_dst_reset
> > for consistency.
> >
> > Signed-off-by: Neil Spring <ntspring@meta.com>
For the next rev, I would suggest in the commit titles to rephrase
"ECMP" as "local ECMP" (e.g., rather than "tcp: rehash onto different
ECMP path on retransmit timeout" using "tcp: rehash onto different
local ECMP path on retransmit timeout").
Rationale: the existing Protective ReRoute ( from 3acf3ec3f4b0f "tcp:
Change txhash on every SYN and RTO retransmit" and discussed in
https://dl.acm.org/doi/10.1145/3603269.3604867 ) already changes flow
labels as an attempt to change the ECMP/WCMP path in the middle of the
network. So I found the original titles (like "tcp: rehash onto
different ECMP path on retransmit timeout") a little confusing, since
they read a bit like they are saying they implement something that is
already implemented. :-)
Thanks,
neal
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH iwl-next 1/2] i40e: implement basic per-queue stats
From: Paolo Abeni @ 2026-04-08 14:44 UTC (permalink / raw)
To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org
Cc: Nguyen, Anthony L, Kitszel, Przemyslaw, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, netdev@vger.kernel.org
In-Reply-To: <IA3PR11MB89864FAAC7BCD8459A2BCD15E55BA@IA3PR11MB8986.namprd11.prod.outlook.com>
On 4/8/26 2:07 PM, Loktionov, Aleksandr wrote:
>> -----Original Message-----
>> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
>> Of Paolo Abeni
>> Sent: Wednesday, April 8, 2026 1:44 PM
>> To: intel-wired-lan@lists.osuosl.org
>> Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel,
>> Przemyslaw <przemyslaw.kitszel@intel.com>; Andrew Lunn
>> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
>> Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>;
>> Alexei Starovoitov <ast@kernel.org>; Daniel Borkmann
>> <daniel@iogearbox.net>; Jesper Dangaard Brouer <hawk@kernel.org>; John
>> Fastabend <john.fastabend@gmail.com>; Stanislav Fomichev
>> <sdf@fomichev.me>; netdev@vger.kernel.org
>> Subject: [Intel-wired-lan] [PATCH iwl-next 1/2] i40e: implement basic
>> per-queue stats
>>
>> Only expose the counters currently available (bytes, packets); add
>> account for base stats to deal with ring clear.
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> drivers/net/ethernet/intel/i40e/i40e.h | 7 ++
>> drivers/net/ethernet/intel/i40e/i40e_main.c | 133
>> ++++++++++++++++++++
>> 2 files changed, 140 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index dcb50c2e1aa2..fe642c464e9c 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -836,16 +836,23 @@ struct i40e_vsi {
>> struct i40e_eth_stats eth_stats;
>> struct i40e_eth_stats eth_stats_offsets;
>> u64 tx_restart;
>
> ...
>
>> +static void i40e_zero_tx_ring_stats(struct netdev_queue_stats_tx *tx)
>> {
>> + tx->bytes = 0;
>> + tx->packets = 0;
>> + tx->stop = 0;
>> + tx->wake = 0;
>> + tx->hw_drops = 0;
>> +}
>> +
>> +static void i40e_add_tx_ring_stats(struct i40e_ring *tx_ring,
>> + struct netdev_queue_stats_tx *tx) {
>> + u64 bytes, packets;
>> + unsigned int start;
>> +
>> + do {
>> + start = u64_stats_fetch_begin(&tx_ring->syncp);
>> + bytes = tx_ring->stats.bytes;
>> + packets = tx_ring->stats.packets;
>> + } while (u64_stats_fetch_retry(&tx_ring->syncp, start));
>> +
>> + tx->bytes += bytes;
>> + tx->packets += packets;
>> +
>> + tx->stop += tx_ring->tx_stats.tx_stopped;
>> + tx->wake += tx_ring->tx_stats.restart_queue;
>> + tx->hw_drops += tx_ring->tx_stats.tx_busy; }
> Why the reads are outside the seqlock region?
> On 32-bit kernels, unprotected u64 reads can tear IMHO
Currently there is no seqlock on the write side; to keep the series
small I preferred avoid fixing the pre-existing issue. In any case I
think moving stop, wake, hw_drops (and others) under seqlock protection
is an orthogonal change.
/P
^ permalink raw reply
* [PATCH net-next v2] ipv6: move IFA_F_PERMANENT percpu allocation in process scope
From: Paolo Abeni @ 2026-04-08 14:36 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Simon Horman
Observed at boot time:
CPU: 43 UID: 0 PID: 3595 Comm: (t-daemon) Not tainted 6.12.0 #1
Call Trace:
<TASK>
dump_stack_lvl+0x4e/0x70
pcpu_alloc_noprof.cold+0x1f/0x4b
fib_nh_common_init+0x4c/0x110
fib6_nh_init+0x387/0x740
ip6_route_info_create+0x46d/0x640
addrconf_f6i_alloc+0x13b/0x180
addrconf_permanent_addr+0xd0/0x220
addrconf_notify+0x93/0x540
notifier_call_chain+0x5a/0xd0
__dev_notify_flags+0x5c/0xf0
dev_change_flags+0x54/0x70
do_setlink+0x36c/0xce0
rtnl_setlink+0x11f/0x1d0
rtnetlink_rcv_msg+0x142/0x3f0
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x242/0x390
netlink_sendmsg+0x21b/0x470
__sys_sendto+0x1dc/0x1f0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x7d/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f5c3852f127
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 85 ef 0c 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 c3 55 48 83 ec 30 44 89 4c 24 2c 4c 89 44
RSP: 002b:00007ffe86caf4c8 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000556c5cd93210 RCX: 00007f5c3852f127
RDX: 0000000000000020 RSI: 0000556c5cd938b0 RDI: 0000000000000003
RBP: 00007ffe86caf5a0 R08: 00007ffe86caf4e0 R09: 0000000000000080
R10: 0000000000000000 R11: 0000000000000202 R12: 0000556c5cd932d0
R13: 00000000021d05d1 R14: 00000000021d05d1 R15: 0000000000000001
IFA_F_PERMANENT addresses require the allocation of a bunch of percpu
pointers, currently in atomic scope.
Similar to commit 51454ea42c1a ("ipv6: fix locking issues with loops
over idev->addr_list"), move fixup_permanent_addr() outside the
&idev->lock scope, and do the allocations with GFP_KERNEL. With such
change fixup_permanent_addr() is invoked with the BH enabled, and the
ifp lock acquired there needs the BH variant.
Note that we don't need to acquire a reference to the permanent
addresses before releasing the mentioned write lock, because
addrconf_permanent_addr() runs under RTNL and ifa removal always happens
under RTNL, too.
Also the PERMANENT flag is constant in the relevant scope, as it can be
cleared only by inet6_addr_modify() under the RTNL lock.
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v1 -> v2:
- rebased on top of "ipv6: prevent possible UaF in
addrconf_permanent_addr()"
v1: https://lore.kernel.org/netdev/d972e1e2-7090-47bd-988a-1ea854cbfb42@kernel.org/
explicitly targeting net-next as this is IMHO an improvement more than a
bug fix
---
net/ipv6/addrconf.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index dd0b4d80e0f8..77c77e843c96 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3585,15 +3585,15 @@ static int fixup_permanent_addr(struct net *net,
struct fib6_info *f6i, *prev;
f6i = addrconf_f6i_alloc(net, idev, &ifp->addr, false,
- GFP_ATOMIC, NULL);
+ GFP_KERNEL, NULL);
if (IS_ERR(f6i))
return PTR_ERR(f6i);
/* ifp->rt can be accessed outside of rtnl */
- spin_lock(&ifp->lock);
+ spin_lock_bh(&ifp->lock);
prev = ifp->rt;
ifp->rt = f6i;
- spin_unlock(&ifp->lock);
+ spin_unlock_bh(&ifp->lock);
fib6_info_release(prev);
}
@@ -3601,7 +3601,7 @@ static int fixup_permanent_addr(struct net *net,
if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
addrconf_prefix_route(&ifp->addr, ifp->prefix_len,
ifp->rt_priority, idev->dev, 0, 0,
- GFP_ATOMIC);
+ GFP_KERNEL);
}
if (ifp->state == INET6_IFADDR_STATE_PREDAD)
@@ -3612,29 +3612,36 @@ static int fixup_permanent_addr(struct net *net,
static void addrconf_permanent_addr(struct net *net, struct net_device *dev)
{
- struct inet6_ifaddr *ifp, *tmp;
+ struct inet6_ifaddr *ifp;
+ LIST_HEAD(tmp_addr_list);
struct inet6_dev *idev;
+ /* Mutual exclusion with other if_list_aux users. */
+ ASSERT_RTNL();
+
idev = __in6_dev_get(dev);
if (!idev)
return;
write_lock_bh(&idev->lock);
+ list_for_each_entry(ifp, &idev->addr_list, if_list) {
+ if (ifp->flags & IFA_F_PERMANENT)
+ list_add_tail(&ifp->if_list_aux, &tmp_addr_list);
+ }
+ write_unlock_bh(&idev->lock);
- list_for_each_entry_safe(ifp, tmp, &idev->addr_list, if_list) {
- if ((ifp->flags & IFA_F_PERMANENT) &&
- fixup_permanent_addr(net, idev, ifp) < 0) {
- write_unlock_bh(&idev->lock);
+ while (!list_empty(&tmp_addr_list)) {
+ ifp = list_first_entry(&tmp_addr_list,
+ struct inet6_ifaddr, if_list_aux);
+ list_del(&ifp->if_list_aux);
+ if (fixup_permanent_addr(net, idev, ifp) < 0) {
net_info_ratelimited("%s: Failed to add prefix route for address %pI6c; dropping\n",
idev->dev->name, &ifp->addr);
in6_ifa_hold(ifp);
ipv6_del_addr(ifp);
- write_lock_bh(&idev->lock);
}
}
-
- write_unlock_bh(&idev->lock);
}
static int addrconf_notify(struct notifier_block *this, unsigned long event,
--
2.53.0
^ permalink raw reply related
* [PATCH net] can: raw: fix ro->uniq use-after-free in raw_rcv()
From: Sam P @ 2026-04-08 14:30 UTC (permalink / raw)
To: netdev; +Cc: socketcan, mkl, linux-kernel, linux-can
raw_release() unregisters raw CAN receive filters via can_rx_unregister(),
but receiver deletion is deferred with call_rcu(). This leaves a window
where raw_rcv() may still be running in an RCU read-side critical section
after raw_release() frees ro->uniq, leading to a use-after-free of the
percpu uniq storage.
Move free_percpu(ro->uniq) out of raw_release() and into a raw-specific
socket destructor. can_rx_unregister() takes an extra reference to the
socket and only drops it from the RCU callback, so freeing uniq from
sk_destruct ensures the percpu area is not released until the relevant
callbacks have drained.
Fixes: 514ac99c64b2 ("can: fix multiple delivery of a single CAN frame for overlapping CAN filters")
Cc: stable@vger.kernel.org # v4.1+
Assisted-by: Bynario AI
Signed-off-by: Samuel Page <sam@bynar.io>
---
net/can/raw.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/net/can/raw.c b/net/can/raw.c
index eee244ffc31e..f042c4316890 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -361,6 +361,14 @@ static int raw_notifier(struct notifier_block *nb, unsigned long msg,
return NOTIFY_DONE;
}
+static void raw_sock_destruct(struct sock *sk)
+{
+ struct raw_sock *ro = raw_sk(sk);
+
+ free_percpu(ro->uniq);
+ can_sock_destruct(sk);
+}
+
static int raw_init(struct sock *sk)
{
struct raw_sock *ro = raw_sk(sk);
@@ -387,6 +395,8 @@ static int raw_init(struct sock *sk)
if (unlikely(!ro->uniq))
return -ENOMEM;
+ sk->sk_destruct = raw_sock_destruct;
+
/* set notifier */
spin_lock(&raw_notifier_lock);
list_add_tail(&ro->notifier, &raw_notifier_list);
@@ -436,7 +446,6 @@ static int raw_release(struct socket *sock)
ro->bound = 0;
ro->dev = NULL;
ro->count = 0;
- free_percpu(ro->uniq);
sock_orphan(sk);
sock->sk = NULL;
--
2.49.0
^ permalink raw reply related
* Re: [PATCH 3/4] drm/drm_ras: Add DRM RAS netlink error event notification
From: Tauro, Riana @ 2026-04-08 14:29 UTC (permalink / raw)
To: Raag Jadav, aravind.iddamsetty, rodrigo.vivi
Cc: intel-xe, dri-devel, netdev, anshuman.gupta, joonas.lahtinen,
simona.vetter, airlied, pratik.bari, joshua.santosh.ranjan,
ashwin.kumar.kulkarni, shubham.kumar, ravi.kishore.koppuravuri,
anvesh.bakwad, maarten.lankhorst, Zack McKevitt, Lijo Lazar,
Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet,
Jakub Kicinski
In-Reply-To: <acPjxf5XTYuR7sbM@black.igk.intel.com>
On 3/25/2026 7:01 PM, Raag Jadav wrote:
> On Wed, Mar 11, 2026 at 03:59:17PM +0530, Riana Tauro wrote:
>> Add support for asynchronous error notifications in drm_ras.
> It's either drm_ras or DRM RAS, make it consistent in all patches
> (both commit message and subject).
Sure.
>
>> Define a new `error-event` netlink event and a new multicast
>> group `error-notify` in drm_ras spec. Each event contains
>> a node-id and error-id to identify the type and source
>> of error.
>>
>> Add drm_ras_error_notify() to trigger this event from drivers.
>> Userspace can receive this event by subscribing to the
>> multicast group error-notify.
>>
>> Example: Using ynl tool
> Ditto. Either Usage or Example, make it consistent in all patches.
>
> Also, please utilize the full 75 character space where possible.
Will fix.
>
>> $ sudo ynl --family drm_ras --subscribe error-notify
>>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com>
>> Cc: Lijo Lazar <lijo.lazar@amd.com>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>> Documentation/gpu/drm-ras.rst | 9 +++++
>> Documentation/netlink/specs/drm_ras.yaml | 14 +++++++
>> drivers/gpu/drm/drm_ras.c | 48 ++++++++++++++++++++++++
>> drivers/gpu/drm/drm_ras_nl.c | 6 +++
>> drivers/gpu/drm/drm_ras_nl.h | 4 ++
>> include/drm/drm_ras.h | 2 +
>> include/uapi/drm/drm_ras.h | 3 ++
>> 7 files changed, 86 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
>> index 4636e68f5678..09b2918f67bd 100644
>> --- a/Documentation/gpu/drm-ras.rst
>> +++ b/Documentation/gpu/drm-ras.rst
>> @@ -54,6 +54,8 @@ User space tools can:
>> ``node-id`` and ``error-id`` as parameters.
>> * Clear specific error counters with the ``clear-error-counter`` command, using both
>> ``node-id`` and ``error-id`` as parameters.
>> +* Listen to ``error-event`` notifications for error events by subscribing to the
>> + ``error-notify`` multicast group.
>>
>> YAML-based Interface
>> --------------------
>> @@ -109,3 +111,10 @@ Example: Clear an error counter for a given node
>>
>> sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
>> None
>> +
>> +Example: Listen to error events
>> +
>> +.. code-block:: bash
>> +
>> + sudo ynl --family drm_ras --subscribe error-notify
>> + {'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
> Can we also have error-name and node-name? I'd be pulling my hair off
> if I need to remember all the ids.
Yeah makes sense. We can add the node_name, error_name.
Adding device_name would also be more useful in the event.
@Rodrigo/@aravind thoughts?
>
> On that note, I think it'll be good to have them as part of request
> attributes as an alternative to ids (also for existing commands) but
> that can done as a follow up.
>
We cannot use names as alternative because it won't work for multiple cards.
example in xe: Suppose there are 2 cards and each has 2 nodes. We cannot
query using node_name+error_name.
Also most of the netlink implementations use id's as unique identifiers.
$ sudo ./cli.py --family drm_ras --dump list-nodes
[{'device-name': 'bdf_1', 'node-id': 0, 'node-name':
'correctable-errors', 'node-type': 'error-counter'},
{'device-name': 'bdf_1, 'node-id': 1, 'node-name':
'uncorrectable-errors', 'node-type': 'error-counter'},
{'device-name': 'bdf_2', 'node-id': 2, 'node-name':
'correctable-errors', 'node-type': 'error-counter'},
{'device-name': 'bdf_2', 'node-id': 3, 'node-name':
'uncorrectable-errors', 'node-type': 'error-counter'}]
>
> Also, what if I have multiple devices with multiple nodes. Do they need
> separate subscription?
>
No, we subscribe only to the group not the nodes. In this case the group
is 'error-notify'
$ sudo ./cli.py --family drm_ras --subscribe error-notify
{'msg': {'error-id': 1, 'node-id': 1}, 'name': 'error-event'}
{'msg': {'error-id': 1, 'node-id': 3}, 'name': 'error-event'}
Thanks
Riana
>
> Raag
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH iwl-next v2] ice: call netif_keep_dst() once when entering switchdev mode
From: Paul Menzel @ 2026-04-08 14:27 UTC (permalink / raw)
To: Aleksandr Loktionov
Cc: intel-wired-lan, anthony.l.nguyen, netdev, Marcin Szycik
In-Reply-To: <20260408141429.2798589-1-aleksandr.loktionov@intel.com>
Dear Aleksandr, dear Marcin,
Thank you for the patch.
Am 08.04.26 um 16:14 schrieb Aleksandr Loktionov:
> From: Marcin Szycik <marcin.szycik@intel.com>
>
> netif_keep_dst() only needs to be called once for the uplink VSI, not
> once for each port representor. Move it from ice_eswitch_setup_repr()
> to ice_eswitch_enable_switchdev().
It’d be great, if you could share the commands, how to verify your change.
> Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path")
> Signed-off-by: Marcin Szycik <marcin.szycik@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v1 -> v2:
> - Verified Fixes: tag via bisect - defd52455aee introduced the redundant
> per-repr call to netif_keep_dst(uplink_vsi->netdev) by changing the
> target netdev to the uplink VSI inside the per-representor setup
> function. Before that commit, each call was on a distinct repr->netdev
> so no Fixes: predating it applies.
>
> drivers/net/ethernet/intel/ice/ice_eswitch.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f096..c30e27b 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> @@ -117,8 +117,6 @@ static int ice_eswitch_setup_repr(struct ice_pf *pf, struct ice_repr *repr)
> if (!repr->dst)
> return -ENOMEM;
>
> - netif_keep_dst(uplink_vsi->netdev);
> -
> dst = repr->dst;
> dst->u.port_info.port_id = vsi->vsi_num;
> dst->u.port_info.lower_dev = uplink_vsi->netdev;
> @@ -312,6 +310,8 @@ static int ice_eswitch_enable_switchdev(struct ice_pf *pf)
> if (ice_eswitch_br_offloads_init(pf))
> goto err_br_offloads;
>
> + netif_keep_dst(uplink_vsi->netdev);
> +
> pf->eswitch.is_running = true;
>
> return 0;
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Kind regards,
Paul
^ permalink raw reply
* [PATCH iwl-next v2] ice: call netif_keep_dst() once when entering switchdev mode
From: Aleksandr Loktionov @ 2026-04-08 14:14 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
Cc: netdev, Marcin Szycik
From: Marcin Szycik <marcin.szycik@intel.com>
netif_keep_dst() only needs to be called once for the uplink VSI, not
once for each port representor. Move it from ice_eswitch_setup_repr()
to ice_eswitch_enable_switchdev().
Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path")
Signed-off-by: Marcin Szycik <marcin.szycik@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
v1 -> v2:
- Verified Fixes: tag via bisect - defd52455aee introduced the redundant
per-repr call to netif_keep_dst(uplink_vsi->netdev) by changing the
target netdev to the uplink VSI inside the per-representor setup
function. Before that commit, each call was on a distinct repr->netdev
so no Fixes: predating it applies.
drivers/net/ethernet/intel/ice/ice_eswitch.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f096..c30e27b 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -117,8 +117,6 @@ static int ice_eswitch_setup_repr(struct ice_pf *pf, struct ice_repr *repr)
if (!repr->dst)
return -ENOMEM;
- netif_keep_dst(uplink_vsi->netdev);
-
dst = repr->dst;
dst->u.port_info.port_id = vsi->vsi_num;
dst->u.port_info.lower_dev = uplink_vsi->netdev;
@@ -312,6 +310,8 @@ static int ice_eswitch_enable_switchdev(struct ice_pf *pf)
if (ice_eswitch_br_offloads_init(pf))
goto err_br_offloads;
+ netif_keep_dst(uplink_vsi->netdev);
+
pf->eswitch.is_running = true;
return 0;
--
2.52.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox