* [PATCH net-next v4 12/13] misc: lan966x-pci: dts: extend cpu reg to cover PCIE DBI space
From: Daniel Machon @ 2026-05-08 7:35 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Horatiu Vultur, Steen Hegelund, UNGLinuxDriver,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Herve Codina, Arnd Bergmann,
Greg Kroah-Hartman, Mohsin Bashir
Cc: netdev, linux-kernel, bpf, linux-arm-kernel
In-Reply-To: <20260508-lan966x-pci-fdma-v4-0-14e0c89d8d63@microchip.com>
The ATU outbound windows used by the FDMA engine are programmed through
registers at offset 0x400000+, which falls outside the current cpu reg
mapping. Extend the cpu reg size from 0x100000 (1MB) to 0x800000 (8MB)
to cover the full PCIE DBI and iATU register space.
Tested-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
---
drivers/misc/lan966x_pci.dtso | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/lan966x_pci.dtso b/drivers/misc/lan966x_pci.dtso
index 7b196b0a0eb6..7bb726550caf 100644
--- a/drivers/misc/lan966x_pci.dtso
+++ b/drivers/misc/lan966x_pci.dtso
@@ -135,7 +135,7 @@ lan966x_phy1: ethernet-lan966x_phy@2 {
switch: switch@e0000000 {
compatible = "microchip,lan966x-switch";
- reg = <0xe0000000 0x0100000>,
+ reg = <0xe0000000 0x0800000>,
<0xe2000000 0x0800000>;
reg-names = "cpu", "gcb";
--
2.34.1
^ permalink raw reply related
* [PATCH net-next v4 13/13] misc: lan966x-pci: dts: add fdma interrupt to overlay
From: Daniel Machon @ 2026-05-08 7:35 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Horatiu Vultur, Steen Hegelund, UNGLinuxDriver,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Herve Codina, Arnd Bergmann,
Greg Kroah-Hartman, Mohsin Bashir
Cc: netdev, linux-kernel, bpf, linux-arm-kernel
In-Reply-To: <20260508-lan966x-pci-fdma-v4-0-14e0c89d8d63@microchip.com>
Add the fdma interrupt (OIC interrupt 14) to the lan966x PCI device
tree overlay, enabling FDMA-based frame injection/extraction when
the switch is connected over PCIe.
Tested-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
---
drivers/misc/lan966x_pci.dtso | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/misc/lan966x_pci.dtso b/drivers/misc/lan966x_pci.dtso
index 7bb726550caf..5bb12dbc0843 100644
--- a/drivers/misc/lan966x_pci.dtso
+++ b/drivers/misc/lan966x_pci.dtso
@@ -141,8 +141,9 @@ switch: switch@e0000000 {
interrupt-parent = <&oic>;
interrupts = <12 IRQ_TYPE_LEVEL_HIGH>,
+ <14 IRQ_TYPE_LEVEL_HIGH>,
<9 IRQ_TYPE_LEVEL_HIGH>;
- interrupt-names = "xtr", "ana";
+ interrupt-names = "xtr", "fdma", "ana";
resets = <&reset 0>;
reset-names = "switch";
--
2.34.1
^ permalink raw reply related
* Re: [PATCH net v7] net: Validate protocol in skb_steal_sock() for BPF-assigned sockets
From: Kuniyuki Iwashima @ 2026-05-08 7:44 UTC (permalink / raw)
To: Jiayuan Chen
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Willem de Bruijn, Joe Stringer,
Martin KaFai Lau, Alexei Starovoitov, linux-kernel, bpf
In-Reply-To: <20260507104612.236253-1-jiayuan.chen@linux.dev>
On Thu, May 7, 2026 at 3:46 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> bpf_sk_assign_tcp_reqsk() can assign a TCP reqsk to a non-TCP skb,
> causing a panic when the skb enters the wrong L4 receive path [1].
> An initial attempt tried to fix this in the BPF helper by checking
> iph->protocol, but Sashiko [2] revealed that BPF programs can bypass
> this check via a TOCTOU attack by modifying iph->protocol around the
> call:
>
> iph->protocol = IPPROTO_TCP;
> bpf_sk_assign_tcp_reqsk(udp_skb, tcp_sk);
> iph->protocol = IPPROTO_UDP;
>
> Furthermore, bpf_sk_assign() has had the same class of vulnerability
> since its introduction — it can assign any socket type to any skb
> without protocol validation. Since the BPF helper check alone cannot
> prevent a malicious BPF program from crashing the kernel, add protocol
> validation in skb_steal_sock() to reject mismatched sockets regardless
> of how they were assigned.
>
> The check is applied to all prefetched sockets. Early demux paths
> already only assign matching protocols (e.g., UDP early demux only
> assigns UDP sockets to UDP skbs), so they pass the check naturally and
> the extra branch is negligible.
>
> Pass the expected protocol from callers rather than extracting it from
> the IP header. For IPv6, walking extension headers to find the L4
> protocol is complex and unnecessary since each caller already knows
> the protocol it handles.
>
> [1] https://lore.kernel.org/bpf/20260403015851.148209-1-jiayuan.chen@linux.dev/
Why did you drop selftest ?
Also this patch should go to bpf.git instead of net.git.
> [2] https://sashiko.dev/#/patchset/20260403015851.148209-1-jiayuan.chen%40linux.dev
>
> Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
> Fixes: e472f88891ab ("bpf: tcp: Support arbitrary SYN Cookie.")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
> include/net/inet6_hashtables.h | 7 ++++---
> include/net/inet_hashtables.h | 7 ++++---
> include/net/request_sock.h | 16 +++++++++++++++-
> net/ipv4/udp.c | 2 +-
> net/ipv6/udp.c | 2 +-
> 5 files changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 2cc5d416bbb5..218498373a9c 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -106,12 +106,13 @@ static inline
> struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> const struct in6_addr *saddr, const __be16 sport,
> const struct in6_addr *daddr, const __be16 dport,
> - bool *refcounted, inet6_ehashfn_t *ehashfn)
> + bool *refcounted, inet6_ehashfn_t *ehashfn,
> + int protocol)
> {
> struct sock *sk, *reuse_sk;
> bool prefetched;
>
> - sk = skb_steal_sock(skb, refcounted, &prefetched);
> + sk = skb_steal_sock(skb, refcounted, &prefetched, protocol);
> if (!sk)
> return NULL;
>
> @@ -153,7 +154,7 @@ static inline struct sock *__inet6_lookup_skb(struct sk_buff *skb, int doff,
> struct sock *sk;
>
> sk = inet6_steal_sock(net, skb, doff, &ip6h->saddr, sport, &ip6h->daddr, dport,
> - refcounted, inet6_ehashfn);
> + refcounted, inet6_ehashfn, IPPROTO_TCP);
> if (IS_ERR(sk))
> return NULL;
> if (sk)
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 6e2fe186d0dc..a2a044f93cc4 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -446,12 +446,13 @@ static inline
> struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> const __be32 saddr, const __be16 sport,
> const __be32 daddr, const __be16 dport,
> - bool *refcounted, inet_ehashfn_t *ehashfn)
> + bool *refcounted, inet_ehashfn_t *ehashfn,
> + int protocol)
> {
> struct sock *sk, *reuse_sk;
> bool prefetched;
>
> - sk = skb_steal_sock(skb, refcounted, &prefetched);
> + sk = skb_steal_sock(skb, refcounted, &prefetched, protocol);
> if (!sk)
> return NULL;
>
> @@ -494,7 +495,7 @@ static inline struct sock *__inet_lookup_skb(struct sk_buff *skb,
> struct sock *sk;
>
> sk = inet_steal_sock(net, skb, doff, iph->saddr, sport, iph->daddr, dport,
> - refcounted, inet_ehashfn);
> + refcounted, inet_ehashfn, IPPROTO_TCP);
> if (IS_ERR(sk))
> return NULL;
> if (sk)
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 5a9c826a7092..3469e4903aed 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -89,9 +89,11 @@ static inline struct sock *req_to_sk(struct request_sock *req)
> * @skb: sk_buff to steal the socket from
> * @refcounted: is set to true if the socket is reference-counted
> * @prefetched: is set to true if the socket was assigned from bpf
> + * @protocol: expected L4 protocol
> */
> static inline struct sock *skb_steal_sock(struct sk_buff *skb,
> - bool *refcounted, bool *prefetched)
> + bool *refcounted, bool *prefetched,
> + int protocol)
> {
> struct sock *sk = skb->sk;
>
> @@ -103,6 +105,18 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb,
>
> *prefetched = skb_sk_is_prefetched(skb);
> if (*prefetched) {
> + /* A non-full socket here is either a reqsk or a
> + * timewait sock, both only contain sock_common and
> + * lack sk_protocol. Since both can only be TCP,
> + * use IPPROTO_TCP as the protocol.
> + */
> + if ((sk_fullsock(sk) ? sk->sk_protocol : IPPROTO_TCP) != protocol) {
Please add unlikely() here.
> + skb_orphan(skb);
> + *prefetched = false;
> + *refcounted = false;
> + return NULL;
> + }
> +
> #if IS_ENABLED(CONFIG_SYN_COOKIES)
> if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> struct request_sock *req = inet_reqsk(sk);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 0ac2bf4f8759..ceb4d29a64ac 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2618,7 +2618,7 @@ int udp_rcv(struct sk_buff *skb)
> goto csum_error;
>
> sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
> - &refcounted, udp_ehashfn);
> + &refcounted, udp_ehashfn, IPPROTO_UDP);
> if (IS_ERR(sk))
> goto no_sk;
>
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 15e032194ecc..d9c12cce5ace 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1106,7 +1106,7 @@ INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
>
> /* Check if the socket is already available, e.g. due to early demux */
> sk = inet6_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
> - &refcounted, udp6_ehashfn);
> + &refcounted, udp6_ehashfn, IPPROTO_UDP);
> if (IS_ERR(sk))
> goto no_sk;
>
> --
> 2.43.0
>
^ permalink raw reply
* [PATCH] net: phy: DP83TC811: add reading of abilities
From: Sven Schuchmann @ 2026-05-08 7:49 UTC (permalink / raw)
To: Andrew Lunn, Heiner Kallweit, Russell King, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
Cc: maxime.chevallier, Sven Schuchmann
At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.
Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
---
drivers/net/phy/dp83tc811.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/phy/dp83tc811.c b/drivers/net/phy/dp83tc811.c
index e480c2a07450..6492b60d1f7d 100644
--- a/drivers/net/phy/dp83tc811.c
+++ b/drivers/net/phy/dp83tc811.c
@@ -392,6 +392,7 @@ static struct phy_driver dp83811_driver[] = {
/* PHY_BASIC_FEATURES */
.config_init = dp83811_config_init,
.config_aneg = dp83811_config_aneg,
+ .get_features = genphy_c45_pma_read_ext_abilities,
.soft_reset = dp83811_phy_reset,
.get_wol = dp83811_get_wol,
.set_wol = dp83811_set_wol,
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] MAINTAINERS: change maintainers for macb Ethernet driver
From: Claudiu Beznea @ 2026-05-08 7:51 UTC (permalink / raw)
To: nicolas.ferre, theo.lebrun, conor.dooley, netdev
Cc: Alexandre Belloni, andrew+netdev, davem, edumazet, kuba, pabeni,
linux-arm-kernel, linux-kernel
In-Reply-To: <20260507120444.9733-1-nicolas.ferre@microchip.com>
On 5/7/26 15:04, nicolas.ferre@microchip.com wrote:
> From: Nicolas Ferre <nicolas.ferre@microchip.com>
>
> I would like to hand over the macb maintenance to Théo, as I'm unable to
> keep up with the recent flow of patches for this driver. After speaking
> with Claudiu, he indicated that he is in the same position as me.
> To help with this work, Conor has agreed to act as a reviewer.
>
> I was given responsibility for this driver years ago, and I'm glad to
> see it continue with talented developers.
>
> Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Acked-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
> ---
> MAINTAINERS | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 37b105a443dd..ad60104f30a1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4181,8 +4181,8 @@ F: include/uapi/linux/sonet.h
> F: net/atm/
>
> ATMEL MACB ETHERNET DRIVER
> -M: Nicolas Ferre <nicolas.ferre@microchip.com>
> -M: Claudiu Beznea <claudiu.beznea@tuxon.dev>
> +M: Théo Lebrun <theo.lebrun@bootlin.com>
> +R: Conor Dooley <conor.dooley@microchip.com>
> S: Maintained
> F: drivers/net/ethernet/cadence/
>
^ permalink raw reply
* Re: [PATCH net-next v7 0/2] selftests: openvswitch: add pop_vlan test
From: Minxi Hou @ 2026-05-08 7:55 UTC (permalink / raw)
To: netdev; +Cc: aconole, Minxi Hou
In-Reply-To: <20260507131541.2331771-1-houminxi@gmail.com>
Hi Aaron,
Just checking in on the v7 series. I addressed both items from your
v5 review:
- Removed the slot number comments from encap_ovskey nla_map,
keeping only comments on entries that differ from the base class.
- Dropped the explicit modprobe 8021q + sysfs pre-flight check,
relying on ip link add type vlan to handle module loading.
Is there anything else you'd like me to change, or does this look
ready to move forward?
Thanks,
Minxi
^ permalink raw reply
* Re: [PATCH net v2] rxrpc: Also unshare DATA/RESPONSE packets when paged frags are present
From: Jiayuan Chen @ 2026-05-08 7:58 UTC (permalink / raw)
To: Hyunwoo Kim, dhowells, marc.dionne, davem, edumazet, kuba, pabeni,
horms, qingfang.deng
Cc: linux-afs, netdev, stable
In-Reply-To: <af2F1FU5d4Q_Gn1W@v4bel>
On 5/8/26 2:42 PM, Hyunwoo Kim wrote:
> The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE
> handler in rxrpc_verify_response() copy the skb to a linear one before
> calling into the security ops only when skb_cloned() is true. An skb
> that is not cloned but still carries paged fragments (skb->data_len != 0)
> falls through to the in-place decryption path, which binds the frag
> pages directly into the AEAD/skcipher SGL via skb_to_sgvec().
>
> Extend the gate so that any skb with non-linear data is also copied,
> ensuring the security handler always operates on a fully linear skb.
> The OOM/trace handling already in place is reused.
>
> Fixes: d0d5c0cd1e71 ("rxrpc: Use skb_unshare() rather than skb_cow_data()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
> ---
> Changes in v2:
> - Use skb_is_nonlinear() instead of skb->data_len
> - v1: https://lore.kernel.org/all/afKV2zGR6rrelPC7@v4bel/
> ---
> net/rxrpc/call_event.c | 2 +-
> net/rxrpc/conn_event.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
> index fdd683261226..a6ad5ff6ec5f 100644
> --- a/net/rxrpc/call_event.c
> +++ b/net/rxrpc/call_event.c
> @@ -334,7 +334,7 @@ bool rxrpc_input_call_event(struct rxrpc_call *call)
>
> if (sp->hdr.type == RXRPC_PACKET_TYPE_DATA &&
> sp->hdr.securityIndex != 0 &&
> - skb_cloned(skb)) {
> + (skb_cloned(skb) || skb_is_nonlinear(skb))) {
> /* Unshare the packet so that it can be
> * modified by in-place decryption.
> */
> diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
> index a2130d25aaa9..632cbeff1f5d 100644
> --- a/net/rxrpc/conn_event.c
> +++ b/net/rxrpc/conn_event.c
> @@ -245,7 +245,7 @@ static int rxrpc_verify_response(struct rxrpc_connection *conn,
> {
> int ret;
>
> - if (skb_cloned(skb)) {
> + if (skb_cloned(skb) || skb_is_nonlinear(skb)) {
> /* Copy the packet if shared so that we can do in-place
> * decryption.
> */
Why not adopt the same gate as the ESP fix:
skb_cloned(skb) || skb_has_frag_list(skb) || skb_has_shared_frag(skb)
so NIC page_pool RX keeps its zero-copy path while still catching the
splice-loopback vector?
^ permalink raw reply
* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
From: Simon Schippers @ 2026-05-08 8:01 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Paolo Abeni, netdev
Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
John Fastabend, Stanislav Fomichev, linux-kernel, bpf
In-Reply-To: <3e43117f-356d-4086-a176-abd7fe2e6f0a@kernel.org>
On 5/7/26 22:45, Jesper Dangaard Brouer wrote:
>
>
> On 07/05/2026 22.12, Simon Schippers wrote:
>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>>
>>>>
>>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>> }
>>>>>>> } else {
>>>>>>> /* ndo_start_xmit */
>>>>>>> - struct sk_buff *skb = ptr;
>>>>>>> + bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>> stats->xdp_bytes += skb->len;
>>>>>>> + if (peer_txq && bql_charged)
>>>>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>>
>>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>>> this doesn’t work.
>>>>>>
>>>
>>> I've experimented with doing the "completion" at NAPI-end in
>>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>>> leads to bad latency results (not acceptable).
>>> (See detailed report later)
>>>
>>>
>>>>>> I still think first that adding an option to modify the hard-coded
>>>>>> VETH_RING_SIZE is the way to go.
>>>>>>
>>>
>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>> the solution here.
>>>
>>> The simply solution is the configure BQL limit_min:
>>> `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>>
>>> My experiments (below) find that limit_min=8 is gives good performance.
>>> We can simply set default to 8 as this still allows userspace to change
>>> this later if lower latency is preferred.
>>>
>>>>>> Thanks!
>>>>>>
>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>>
>>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>>> be ignored. Still the tput figures in the data are extremely low,
>>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>>
>>>>> /P
>>>>>
>>>>
>>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>>> an UDP userspace application is sending.
>>>>
>>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>>
>>>
>>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>>> to test this. John Fastabend added a config to pktgen that allows us
>>> to benchmarking egress qdisc path, this might be better to use this.
>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>>
>>> If redoing the tests, can you adjust limit_min to see the effect?
>>> /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>>
>>> 20% throughput performance regression is of-cause too much, but I will
>>> remind us, that adding a qdisc will "cost" some overhead, that is a
>>> configuration choice. Our purpose here is to reduce bufferbloat and
>>> latency, not optimize for throughput.
>>>
>>>
>>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>>> packets (because the completed queue is always called with 1 packet
>>>> and not in a interrupt/timer with multiple packets...).
>>>>
>>>
>>> I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>>
>>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>>
>>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>>
>>> This patchset helps our (production) use-case reduce mice-flow latency
>>> from approx 22ms to 1.3ms for latency under-load. Due to the consumer
>>> namespace being the bottleneck the requeue overhead is negligible in
>>> comparison.
>>>
>>> -Jesper
>>
>> First of all thanks for you work and I really see the advantages of
>> avoiding bufferbloat :)
>>
>> But the key of the BQL algorithm, which is the *dynamic* adaption of the
>> limit, is not working. Always calling netdev_completed_queue() with
>> 1 packet results in a static limit of 2 packets (as seen by Jonas
>> measurements), which you force up to 8 packets.
>>
>> So in the end this patchset has the same effect as just setting
>> VETH_RING_SIZE to 8 (and giving an option to change this value).
>>
>
> I've code up a time based BQL implementation, see attachment.
> WDYT?
>
> --Jesper
>
A step in the right direction, but I dislike that you call
netdev_sent_queue() with at least 1 packet (never 0 packets).
I am not sure if it works, and I am not sure about the parameter.
I would propose doing it like other BQL implementations do
(for example usbnet for which I adapted BQL [1] :) ):
Call netdev_sent_queue() with n_bql in a periodic work. n_bql would
still be counted in veth_xdp_rcv() like you currently do (synchronized
with the work via ring.consumer_lock?).
The only weird thing that remains is that BQL's inflight != number of
packets in the ring and BQL's limit != "current ring size". Instead
the BQL limit describes the number of maximal allowed packets between
calls of netdev_sent_queue(), which occur periodically in a somewhat
fixed time interval.
I guess that could be fine, but it surely needs testing.
[1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.schippers@tu-dortmund.de/
^ permalink raw reply
* [PATCH] ethtool: fix inverted memchr_inv condition in ethnl_bitmap32_not_zero()
From: Chenguang Zhao @ 2026-05-08 8:02 UTC (permalink / raw)
To: Andrew Lunn, Jakub Kicinski, David S. Miller, Eric Dumazet,
Paolo Abeni
Cc: Chenguang Zhao, Simon Horman, Maxime Chevallier, netdev
memchr_inv() returns non-NULL when a byte differs from the given value.
Return true in that case, not when the scanned words are all zero.
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
---
net/ethtool/bitset.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ethtool/bitset.c b/net/ethtool/bitset.c
index 8bb98d3ea3db..56b0c4867ed2 100644
--- a/net/ethtool/bitset.c
+++ b/net/ethtool/bitset.c
@@ -105,7 +105,7 @@ static bool ethnl_bitmap32_not_zero(const u32 *map, unsigned int start,
start_word++;
}
- if (!memchr_inv(map + start_word, '\0',
+ if (memchr_inv(map + start_word, '\0',
(end_word - start_word) * sizeof(u32)))
return true;
if (end % 32 == 0)
--
2.25.1
^ permalink raw reply related
* Re: [PATCH net v2] rxrpc: Also unshare DATA/RESPONSE packets when paged frags are present
From: Hyunwoo Kim @ 2026-05-08 8:05 UTC (permalink / raw)
To: Jiayuan Chen
Cc: dhowells, marc.dionne, davem, edumazet, kuba, pabeni, horms,
qingfang.deng, linux-afs, netdev, stable, imv4bel
In-Reply-To: <6a1a50d1-9aa8-406d-90b1-4d5ca9fe0afb@linux.dev>
On Fri, May 08, 2026 at 03:58:34PM +0800, Jiayuan Chen wrote:
>
> On 5/8/26 2:42 PM, Hyunwoo Kim wrote:
> > The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE
> > handler in rxrpc_verify_response() copy the skb to a linear one before
> > calling into the security ops only when skb_cloned() is true. An skb
> > that is not cloned but still carries paged fragments (skb->data_len != 0)
> > falls through to the in-place decryption path, which binds the frag
> > pages directly into the AEAD/skcipher SGL via skb_to_sgvec().
> >
> > Extend the gate so that any skb with non-linear data is also copied,
> > ensuring the security handler always operates on a fully linear skb.
> > The OOM/trace handling already in place is reused.
> >
> > Fixes: d0d5c0cd1e71 ("rxrpc: Use skb_unshare() rather than skb_cow_data()")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
> > ---
> > Changes in v2:
> > - Use skb_is_nonlinear() instead of skb->data_len
> > - v1: https://lore.kernel.org/all/afKV2zGR6rrelPC7@v4bel/
> > ---
> > net/rxrpc/call_event.c | 2 +-
> > net/rxrpc/conn_event.c | 2 +-
> > 2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
> > index fdd683261226..a6ad5ff6ec5f 100644
> > --- a/net/rxrpc/call_event.c
> > +++ b/net/rxrpc/call_event.c
> > @@ -334,7 +334,7 @@ bool rxrpc_input_call_event(struct rxrpc_call *call)
> > if (sp->hdr.type == RXRPC_PACKET_TYPE_DATA &&
> > sp->hdr.securityIndex != 0 &&
> > - skb_cloned(skb)) {
> > + (skb_cloned(skb) || skb_is_nonlinear(skb))) {
> > /* Unshare the packet so that it can be
> > * modified by in-place decryption.
> > */
> > diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
> > index a2130d25aaa9..632cbeff1f5d 100644
> > --- a/net/rxrpc/conn_event.c
> > +++ b/net/rxrpc/conn_event.c
> > @@ -245,7 +245,7 @@ static int rxrpc_verify_response(struct rxrpc_connection *conn,
> > {
> > int ret;
> > - if (skb_cloned(skb)) {
> > + if (skb_cloned(skb) || skb_is_nonlinear(skb)) {
> > /* Copy the packet if shared so that we can do in-place
> > * decryption.
> > */
>
>
> Why not adopt the same gate as the ESP fix:
>
>
> skb_cloned(skb) || skb_has_frag_list(skb) || skb_has_shared_frag(skb)
>
>
> so NIC page_pool RX keeps its zero-copy path while still catching the
> splice-loopback vector?
Yeah, that approach preserves the fast path. I'll test and submit v3.
Best regards,
Hyunwoo Kim
^ permalink raw reply
* [linux-next:master] [tcp] 026dfef287: stress-ng.sigurg.ops_per_sec 7.2% improvement
From: kernel test robot @ 2026-05-08 8:10 UTC (permalink / raw)
To: Jakub Kicinski
Cc: oe-lkp, lkp, Eric Dumazet, Kuniyuki Iwashima, netdev, oliver.sang
Hello,
kernel test robot noticed a 7.2% improvement of stress-ng.sigurg.ops_per_sec on:
commit: 026dfef287c07f37d4d4eef7a0b5a4bfdb29b32d ("tcp: give up on stronger sk_rcvbuf checks (for now)")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
parameters:
nr_threads: 100%
testtime: 60s
test: sigurg
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260508/202605081516.e48509cb-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/sigurg/stress-ng/60s
commit:
6996a2d2d0 ("udp: Unhash auto-bound connected sk from 4-tuple hash table when disconnected.")
026dfef287 ("tcp: give up on stronger sk_rcvbuf checks (for now)")
6996a2d2d0a64808 026dfef287c07f37d4d4eef7a0b
---------------- ---------------------------
%stddev %change %stddev
\ | \
3.161e+08 +7.2% 3.389e+08 stress-ng.sigurg.ops
5270683 +7.2% 5651486 stress-ng.sigurg.ops_per_sec
2111132 -7.4% 1954920 stress-ng.time.involuntary_context_switches
17103 -4.9% 16270 stress-ng.time.percent_of_cpu_this_job_got
8621 -3.8% 8290 stress-ng.time.system_time
1655 ± 2% -10.3% 1485 stress-ng.time.user_time
233661 +4.6% 244480 stress-ng.time.voluntary_context_switches
1187968 +10.0% 1306428 ± 2% meminfo.SUnreclaim
1337542 +8.8% 1455830 ± 2% meminfo.Slab
21.62 ± 3% +3.4 24.99 mpstat.cpu.all.soft%
12.33 ± 2% -1.3 11.04 ± 2% mpstat.cpu.all.usr%
297818 +9.5% 326144 ± 2% proc-vmstat.nr_slab_unreclaimable
2652433 ± 4% -10.1% 2383573 proc-vmstat.pgalloc_normal
40509 -5.8% 38147 vmstat.system.cs
5072202 ± 2% +8.6% 5506325 ± 2% vmstat.system.in
18227 ± 11% +30.4% 23763 ± 8% perf-c2c.DRAM.local
57496 ± 9% +21.0% 69555 ± 6% perf-c2c.HITM.local
72755 ± 9% +19.8% 87175 ± 7% perf-c2c.HITM.total
7586 ± 2% -5.0% 7209 ± 2% sched_debug.cfs_rq:/.left_deadline.stddev
7586 ± 2% -5.0% 7207 ± 2% sched_debug.cfs_rq:/.left_vruntime.stddev
7586 ± 2% -5.0% 7207 ± 2% sched_debug.cfs_rq:/.right_vruntime.stddev
0.84 ± 2% -11.2% 0.74 turbostat.IPC
3.193e+08 ± 2% +9.1% 3.484e+08 turbostat.IRQ
110.21 ± 9% -20.0 90.16 ± 6% turbostat.PKG_%
28.31 ± 2% +5.1% 29.75 turbostat.RAMWatt
3.87 ± 2% +8.6% 4.20 perf-sched.sch_delay.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
3.87 ± 2% +8.6% 4.20 perf-sched.total_sch_delay.average.ms
20.88 +7.1% 22.36 ± 2% perf-sched.total_wait_and_delay.average.ms
17.01 +6.8% 18.16 ± 2% perf-sched.total_wait_time.average.ms
20.88 +7.1% 22.36 ± 2% perf-sched.wait_and_delay.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
17.01 +6.8% 18.16 ± 2% perf-sched.wait_time.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
0.48 ± 10% +30.6% 0.63 ± 3% perf-stat.i.MPKI
1.106e+11 -10.4% 9.901e+10 perf-stat.i.branch-instructions
0.09 ± 4% +0.0 0.11 perf-stat.i.branch-miss-rate%
1.018e+08 ± 2% +10.0% 1.119e+08 perf-stat.i.branch-misses
2.65e+08 ± 8% +16.8% 3.095e+08 ± 3% perf-stat.i.cache-misses
8.252e+08 ± 3% +9.7% 9.05e+08 perf-stat.i.cache-references
41977 -5.5% 39685 perf-stat.i.context-switches
1.17 ± 2% +11.9% 1.31 perf-stat.i.cpi
2435 ± 8% -15.0% 2071 ± 3% perf-stat.i.cycles-between-cache-misses
5.489e+11 -10.8% 4.899e+11 perf-stat.i.instructions
0.86 ± 2% -10.6% 0.77 perf-stat.i.ipc
0.48 ± 10% +30.6% 0.63 ± 3% perf-stat.overall.MPKI
0.09 ± 4% +0.0 0.11 perf-stat.overall.branch-miss-rate%
1.17 ± 2% +12.0% 1.31 perf-stat.overall.cpi
2431 ± 8% -14.9% 2069 ± 3% perf-stat.overall.cycles-between-cache-misses
0.86 ± 2% -10.7% 0.76 perf-stat.overall.ipc
1.063e+11 ± 2% -11.1% 9.45e+10 perf-stat.ps.branch-instructions
96467947 +9.5% 1.056e+08 perf-stat.ps.branch-misses
2.552e+08 ± 8% +15.9% 2.957e+08 ± 3% perf-stat.ps.cache-misses
7.926e+08 ± 3% +8.9% 8.633e+08 perf-stat.ps.cache-references
40211 -6.1% 37740 perf-stat.ps.context-switches
2.175e+11 -0.7% 2.159e+11 perf-stat.ps.cpu-clock
5.279e+11 ± 2% -11.4% 4.675e+11 perf-stat.ps.instructions
2.175e+11 -0.7% 2.159e+11 perf-stat.ps.task-clock
3.161e+13 -10.6% 2.825e+13 perf-stat.total.instructions
5.34 ± 9% -3.0 2.30 ± 7% perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
5.27 ± 9% -3.0 2.29 ± 7% perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.73 ± 8% -2.6 2.17 ± 7% perf-profile.calltrace.cycles-pp.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.60 ± 8% -2.4 2.15 ± 7% perf-profile.calltrace.cycles-pp.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
4.49 ± 8% -2.4 2.12 ± 7% perf-profile.calltrace.cycles-pp.tcp_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
48.77 -1.7 47.03 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
48.46 -1.6 46.91 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.74 ± 10% -0.8 0.94 ± 4% perf-profile.calltrace.cycles-pp.skb_do_copy_data_nocache.tcp_sendmsg_locked.tcp_sendmsg.__sys_sendto.__x64_sys_sendto
0.53 ± 2% +0.0 0.56 ± 2% perf-profile.calltrace.cycles-pp.restore_sigcontext.__x64_sys_rt_sigreturn.do_syscall_64.entry_SYSCALL_64_after_hwframe.ioctl
0.53 ± 2% +0.0 0.57 perf-profile.calltrace.cycles-pp.__x64_sys_rt_sigreturn.do_syscall_64.entry_SYSCALL_64_after_hwframe.ioctl
0.53 +0.0 0.56 perf-profile.calltrace.cycles-pp.copy_fpstate_to_sigframe.get_sigframe.x64_setup_rt_frame.arch_do_signal_or_restart.exit_to_user_mode_loop
0.94 +0.1 1.00 perf-profile.calltrace.cycles-pp.x64_setup_rt_frame.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.81 +0.1 0.86 perf-profile.calltrace.cycles-pp.get_sigframe.x64_setup_rt_frame.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64
0.59 ± 4% +0.1 0.65 ± 3% perf-profile.calltrace.cycles-pp.asm_sysvec_reschedule_ipi._raw_spin_unlock_irq.signal_setup_done.arch_do_signal_or_restart.exit_to_user_mode_loop
1.14 +0.1 1.20 ± 3% perf-profile.calltrace.cycles-pp.release_sock.tcp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
0.66 ± 3% +0.1 0.73 perf-profile.calltrace.cycles-pp.__kfree_skb.tcp_urg.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv
0.70 ± 3% +0.1 0.78 ± 2% perf-profile.calltrace.cycles-pp.kmem_cache_alloc_node_noprof.__alloc_skb.tcp_stream_alloc_skb.tcp_sendmsg_locked.tcp_sendmsg
0.77 ± 3% +0.1 0.86 perf-profile.calltrace.cycles-pp.kmem_cache_alloc_node_noprof.kmalloc_reserve.__alloc_skb.tcp_stream_alloc_skb.tcp_sendmsg_locked
0.68 ± 2% +0.1 0.78 perf-profile.calltrace.cycles-pp.tcp_schedule_loss_probe.tcp_write_xmit.__tcp_push_pending_frames.tcp_sendmsg_locked.tcp_sendmsg
0.64 ± 2% +0.1 0.74 ± 2% perf-profile.calltrace.cycles-pp.skb_defer_free_flush.net_rx_action.handle_softirqs.do_softirq.__local_bh_enable_ip
0.87 ± 3% +0.1 0.98 perf-profile.calltrace.cycles-pp.kmalloc_reserve.__alloc_skb.tcp_stream_alloc_skb.tcp_sendmsg_locked.tcp_sendmsg
0.43 ± 44% +0.1 0.55 perf-profile.calltrace.cycles-pp.fpu__restore_sig.restore_sigcontext.__x64_sys_rt_sigreturn.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.60 ± 3% +0.1 0.71 ± 3% perf-profile.calltrace.cycles-pp.loopback_xmit.xmit_one.dev_hard_start_xmit.__dev_queue_xmit.ip_finish_output2
0.69 ± 3% +0.1 0.83 ± 3% perf-profile.calltrace.cycles-pp.xmit_one.dev_hard_start_xmit.__dev_queue_xmit.ip_finish_output2.ip_output
0.45 ± 44% +0.1 0.59 ± 2% perf-profile.calltrace.cycles-pp.__pcs_replace_empty_main.kmem_cache_alloc_node_noprof.kmalloc_reserve.__alloc_skb.tcp_stream_alloc_skb
0.74 ± 3% +0.1 0.88 ± 3% perf-profile.calltrace.cycles-pp.dev_hard_start_xmit.__dev_queue_xmit.ip_finish_output2.ip_output.__ip_queue_xmit
0.83 ± 3% +0.2 1.02 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_unlock_irq.signal_setup_done.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64
1.36 ± 2% +0.2 1.57 ± 2% perf-profile.calltrace.cycles-pp.tcp_data_queue.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv.ip_protocol_deliver_rcu
0.34 ± 70% +0.2 0.57 perf-profile.calltrace.cycles-pp.skb_release_head_state.__kfree_skb.tcp_urg.tcp_rcv_established.tcp_v4_do_rcv
0.35 ± 70% +0.2 0.59 ± 4% perf-profile.calltrace.cycles-pp.ip_rcv.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action
0.77 ± 5% +0.2 1.02 ± 3% perf-profile.calltrace.cycles-pp.tcp_clean_rtx_queue.tcp_ack.tcp_rcv_established.tcp_v4_do_rcv.__release_sock
0.78 ± 5% +0.2 1.02 ± 3% perf-profile.calltrace.cycles-pp.tcp_ack.tcp_rcv_established.tcp_v4_do_rcv.__release_sock.release_sock
0.78 ± 5% +0.2 1.02 ± 3% perf-profile.calltrace.cycles-pp.tcp_rcv_established.tcp_v4_do_rcv.__release_sock.release_sock.tcp_sendmsg
0.78 ± 5% +0.2 1.02 ± 3% perf-profile.calltrace.cycles-pp.tcp_v4_do_rcv.__release_sock.release_sock.tcp_sendmsg.__sys_sendto
0.78 ± 5% +0.2 1.03 ± 3% perf-profile.calltrace.cycles-pp.__release_sock.release_sock.tcp_sendmsg.__sys_sendto.__x64_sys_sendto
0.60 ± 9% +0.3 0.89 ± 28% perf-profile.calltrace.cycles-pp.tcp_ack.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv.ip_protocol_deliver_rcu
2.27 ± 2% +0.3 2.56 ± 2% perf-profile.calltrace.cycles-pp.tcp_event_new_data_sent.tcp_write_xmit.__tcp_push_pending_frames.tcp_sendmsg_locked.tcp_sendmsg
2.46 ± 3% +0.3 2.76 perf-profile.calltrace.cycles-pp.__alloc_skb.tcp_stream_alloc_skb.tcp_sendmsg_locked.tcp_sendmsg.__sys_sendto
2.73 ± 2% +0.4 3.09 perf-profile.calltrace.cycles-pp.tcp_stream_alloc_skb.tcp_sendmsg_locked.tcp_sendmsg.__sys_sendto.__x64_sys_sendto
0.09 ±223% +0.5 0.55 ± 2% perf-profile.calltrace.cycles-pp.__pcs_replace_empty_main.kmem_cache_alloc_node_noprof.__alloc_skb.tcp_stream_alloc_skb.tcp_sendmsg_locked
0.08 ±223% +0.5 0.55 ± 4% perf-profile.calltrace.cycles-pp.tcp_try_rmem_schedule.tcp_data_queue.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv
0.00 +0.5 0.52 ± 2% perf-profile.calltrace.cycles-pp.__inet_lookup_skb.tcp_v4_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver
0.00 +0.5 0.53 perf-profile.calltrace.cycles-pp.sk_reset_timer.tcp_event_new_data_sent.tcp_write_xmit.__tcp_push_pending_frames.tcp_sendmsg_locked
5.83 ± 11% +0.9 6.78 ± 6% perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.signal_setup_done.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64
5.74 ± 11% +0.9 6.69 ± 6% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.signal_setup_done.arch_do_signal_or_restart.exit_to_user_mode_loop
3.41 ± 10% +1.0 4.44 ± 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.do_send_sig_info.send_sigurg.sk_send_sigurg
3.86 ± 8% +1.1 4.92 ± 7% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.do_send_sig_info.send_sigurg.sk_send_sigurg.tcp_urg
13.97 ± 2% +1.1 15.07 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.ioctl
7.66 ± 9% +1.2 8.86 ± 4% perf-profile.calltrace.cycles-pp.signal_setup_done.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
8.19 ± 8% +1.2 9.41 ± 4% perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sigurg_handler
8.25 ± 8% +1.2 9.47 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.stress_sigurg_handler
8.21 ± 8% +1.2 9.43 ± 4% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sigurg_handler
8.24 ± 8% +1.2 9.47 ± 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.stress_sigurg_handler
9.01 ± 7% +1.3 10.27 ± 3% perf-profile.calltrace.cycles-pp.stress_sigurg_handler
4.36 ± 15% +1.3 5.63 ± 10% perf-profile.calltrace.cycles-pp.dec_rlimit_put_ucounts.__sigqueue_free.dequeue_signal.get_signal.arch_do_signal_or_restart
4.36 ± 15% +1.3 5.64 ± 10% perf-profile.calltrace.cycles-pp.__sigqueue_free.dequeue_signal.get_signal.arch_do_signal_or_restart.exit_to_user_mode_loop
5.53 ± 9% +1.3 6.82 ± 6% perf-profile.calltrace.cycles-pp.dequeue_signal.get_signal.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64
4.07 ± 15% +1.3 5.42 ± 11% perf-profile.calltrace.cycles-pp.sig_get_ucounts.__send_signal_locked.do_send_sig_info.send_sigurg.sk_send_sigurg
4.06 ± 15% +1.3 5.41 ± 11% perf-profile.calltrace.cycles-pp.inc_rlimit_get_ucounts.sig_get_ucounts.__send_signal_locked.do_send_sig_info.send_sigurg
6.12 ± 8% +1.4 7.56 ± 6% perf-profile.calltrace.cycles-pp.get_signal.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
24.12 ± 2% +1.5 25.59 ± 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ioctl
22.82 ± 2% +1.5 24.31 ± 2% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ioctl
7.12 ± 7% +1.5 8.63 ± 6% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.ioctl
7.10 ± 7% +1.5 8.61 ± 6% perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.exit_to_user_mode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.ioctl
35.18 ± 2% +1.5 36.69 perf-profile.calltrace.cycles-pp.ioctl
11.85 ± 5% +2.1 13.98 perf-profile.calltrace.cycles-pp.do_send_sig_info.send_sigurg.sk_send_sigurg.tcp_urg.tcp_rcv_established
41.17 +2.2 43.37 perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
12.55 ± 5% +2.2 14.75 perf-profile.calltrace.cycles-pp.send_sigurg.sk_send_sigurg.tcp_urg.tcp_rcv_established.tcp_v4_do_rcv
12.61 ± 5% +2.2 14.82 perf-profile.calltrace.cycles-pp.sk_send_sigurg.tcp_urg.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv
41.05 +2.2 43.29 perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
13.77 ± 5% +2.3 16.11 perf-profile.calltrace.cycles-pp.tcp_urg.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv.ip_protocol_deliver_rcu
40.22 +2.6 42.84 perf-profile.calltrace.cycles-pp.tcp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
16.58 ± 4% +2.9 19.50 perf-profile.calltrace.cycles-pp.tcp_rcv_established.tcp_v4_do_rcv.tcp_v4_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
38.33 +2.9 41.26 perf-profile.calltrace.cycles-pp.tcp_sendmsg_locked.tcp_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
16.89 ± 4% +3.0 19.86 perf-profile.calltrace.cycles-pp.tcp_v4_do_rcv.tcp_v4_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver
18.40 ± 4% +3.1 21.53 perf-profile.calltrace.cycles-pp.tcp_v4_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver.__netif_receive_skb_one_core
18.57 ± 4% +3.1 21.71 perf-profile.calltrace.cycles-pp.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver.__netif_receive_skb_one_core.process_backlog
18.64 ± 4% +3.2 21.80 perf-profile.calltrace.cycles-pp.ip_local_deliver_finish.ip_local_deliver.__netif_receive_skb_one_core.process_backlog.__napi_poll
18.69 ± 4% +3.2 21.85 perf-profile.calltrace.cycles-pp.ip_local_deliver.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action
19.88 ± 4% +3.2 23.07 perf-profile.calltrace.cycles-pp.process_backlog.__napi_poll.net_rx_action.handle_softirqs.do_softirq
19.95 ± 4% +3.2 23.15 perf-profile.calltrace.cycles-pp.__napi_poll.net_rx_action.handle_softirqs.do_softirq.__local_bh_enable_ip
19.52 ± 4% +3.3 22.82 perf-profile.calltrace.cycles-pp.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action.handle_softirqs
20.79 ± 4% +3.3 24.11 perf-profile.calltrace.cycles-pp.net_rx_action.handle_softirqs.do_softirq.__local_bh_enable_ip.__dev_queue_xmit
21.33 ± 3% +3.4 24.72 perf-profile.calltrace.cycles-pp.handle_softirqs.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2
21.41 ± 3% +3.4 24.82 perf-profile.calltrace.cycles-pp.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_output
21.45 ± 3% +3.4 24.86 perf-profile.calltrace.cycles-pp.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.ip_output.__ip_queue_xmit
24.01 ± 3% +3.8 27.82 perf-profile.calltrace.cycles-pp.__dev_queue_xmit.ip_finish_output2.ip_output.__ip_queue_xmit.__tcp_transmit_skb
24.28 ± 3% +3.8 28.12 perf-profile.calltrace.cycles-pp.ip_finish_output2.ip_output.__ip_queue_xmit.__tcp_transmit_skb.tcp_write_xmit
24.52 ± 3% +3.9 28.39 perf-profile.calltrace.cycles-pp.ip_output.__ip_queue_xmit.__tcp_transmit_skb.tcp_write_xmit.__tcp_push_pending_frames
31.62 ± 2% +4.0 35.63 perf-profile.calltrace.cycles-pp.__tcp_push_pending_frames.tcp_sendmsg_locked.tcp_sendmsg.__sys_sendto.__x64_sys_sendto
25.77 ± 3% +4.0 29.78 perf-profile.calltrace.cycles-pp.__ip_queue_xmit.__tcp_transmit_skb.tcp_write_xmit.__tcp_push_pending_frames.tcp_sendmsg_locked
31.50 ± 2% +4.1 35.56 perf-profile.calltrace.cycles-pp.tcp_write_xmit.__tcp_push_pending_frames.tcp_sendmsg_locked.tcp_sendmsg.__sys_sendto
27.22 ± 3% +4.2 31.42 perf-profile.calltrace.cycles-pp.__tcp_transmit_skb.tcp_write_xmit.__tcp_push_pending_frames.tcp_sendmsg_locked.tcp_sendmsg
5.36 ± 9% -3.1 2.31 ± 7% perf-profile.children.cycles-pp.__x64_sys_recvfrom
5.32 ± 9% -3.0 2.30 ± 7% perf-profile.children.cycles-pp.__sys_recvfrom
4.77 ± 8% -2.6 2.18 ± 7% perf-profile.children.cycles-pp.sock_recvmsg
4.61 ± 8% -2.5 2.15 ± 7% perf-profile.children.cycles-pp.inet_recvmsg
4.53 ± 8% -2.4 2.13 ± 7% perf-profile.children.cycles-pp.tcp_recvmsg
1.98 ± 17% -1.8 0.22 ± 23% perf-profile.children.cycles-pp.tcp_recvmsg_locked
1.80 ± 10% -0.8 0.98 ± 4% perf-profile.children.cycles-pp.skb_do_copy_data_nocache
1.19 ± 12% -0.6 0.54 ± 5% perf-profile.children.cycles-pp.__check_object_size
2.17 ± 5% -0.6 1.59 ± 7% perf-profile.children.cycles-pp._raw_spin_lock_bh
4.11 -0.4 3.67 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.54 ± 16% -0.4 0.10 ± 14% perf-profile.children.cycles-pp._copy_to_iter
0.86 ± 12% -0.4 0.43 ± 4% perf-profile.children.cycles-pp.check_heap_object
0.82 ± 11% -0.4 0.39 ± 6% perf-profile.children.cycles-pp._copy_from_iter
1.83 ± 6% -0.4 1.47 ± 8% perf-profile.children.cycles-pp.lock_sock_nested
2.26 ± 3% -0.3 1.92 perf-profile.children.cycles-pp.arch_exit_to_user_mode_prepare
0.78 ± 8% -0.3 0.52 ± 3% perf-profile.children.cycles-pp.ktime_get
0.63 ± 9% -0.2 0.40 ± 4% perf-profile.children.cycles-pp.tcp_send_mss
0.34 ± 13% -0.2 0.11 ± 6% perf-profile.children.cycles-pp.__tcp_select_window
0.51 ± 8% -0.2 0.32 ± 3% perf-profile.children.cycles-pp.read_tsc
1.63 ± 3% -0.2 1.44 ± 3% perf-profile.children.cycles-pp.release_sock
0.55 ± 8% -0.2 0.36 ± 4% perf-profile.children.cycles-pp.tcp_current_mss
0.40 ± 9% -0.2 0.23 ± 4% perf-profile.children.cycles-pp.__virt_addr_valid
1.24 ± 3% -0.2 1.08 perf-profile.children.cycles-pp.fdget
1.38 ± 2% -0.1 1.23 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.96 ± 2% -0.1 0.83 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.17 ± 13% -0.1 0.06 ± 6% perf-profile.children.cycles-pp.import_ubuf
0.16 ± 11% -0.1 0.07 ± 5% perf-profile.children.cycles-pp.tcp_tso_segs
0.96 ± 2% -0.1 0.88 perf-profile.children.cycles-pp.x64_sys_call
0.13 ± 13% -0.1 0.06 ± 7% perf-profile.children.cycles-pp.inet_sendmsg
0.32 ± 6% -0.1 0.26 ± 3% perf-profile.children.cycles-pp.tcp_established_options
0.11 ± 10% -0.1 0.05 ± 7% perf-profile.children.cycles-pp.tcp_push
0.15 ± 8% -0.0 0.10 ± 9% perf-profile.children.cycles-pp.ipv4_mtu
0.11 ± 11% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.tcp_release_cb
0.12 ± 10% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.sk_page_frag_refill
0.09 ± 10% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.__send
0.08 ± 7% -0.0 0.05 perf-profile.children.cycles-pp.skb_page_frag_refill
0.42 ± 2% -0.0 0.40 ± 2% perf-profile.children.cycles-pp.stress_sigurg_server
0.39 ± 2% -0.0 0.37 ± 2% perf-profile.children.cycles-pp.hrtimer_interrupt
0.05 +0.0 0.06 perf-profile.children.cycles-pp.__get_user_nocheck_4
0.07 +0.0 0.08 perf-profile.children.cycles-pp.raw_v4_input
0.10 ± 3% +0.0 0.11 perf-profile.children.cycles-pp.__copy_skb_header
0.14 ± 2% +0.0 0.15 ± 2% perf-profile.children.cycles-pp.__fpu_restore_sig
0.11 ± 4% +0.0 0.12 perf-profile.children.cycles-pp.netif_skb_features
0.09 ± 4% +0.0 0.10 ± 3% perf-profile.children.cycles-pp.enqueue_timer
0.13 ± 2% +0.0 0.14 ± 2% perf-profile.children.cycles-pp.check_xstate_in_sigframe
0.10 ± 4% +0.0 0.12 ± 3% perf-profile.children.cycles-pp.refill_stock
0.09 ± 5% +0.0 0.11 ± 3% perf-profile.children.cycles-pp.tcp_send_delayed_ack
0.06 ± 6% +0.0 0.08 ± 6% perf-profile.children.cycles-pp.tcp_measure_rcv_mss
0.12 ± 4% +0.0 0.14 perf-profile.children.cycles-pp._raw_read_lock_irqsave
0.31 ± 2% +0.0 0.32 perf-profile.children.cycles-pp.fpregs_mark_activate
0.13 ± 2% +0.0 0.14 ± 3% perf-profile.children.cycles-pp.detach_if_pending
0.12 ± 4% +0.0 0.14 ± 3% perf-profile.children.cycles-pp.ip_send_check
0.11 ± 3% +0.0 0.13 perf-profile.children.cycles-pp.tcp_wfree
0.08 ± 6% +0.0 0.10 ± 5% perf-profile.children.cycles-pp._find_first_bit
0.20 ± 2% +0.0 0.22 perf-profile.children.cycles-pp.tcp_queue_rcv
0.37 ± 2% +0.0 0.39 perf-profile.children.cycles-pp.restore_fpregs_from_user
0.14 ± 4% +0.0 0.16 ± 4% perf-profile.children.cycles-pp.tcp_event_data_recv
0.24 ± 2% +0.0 0.26 perf-profile.children.cycles-pp._raw_spin_lock
0.16 ± 2% +0.0 0.18 perf-profile.children.cycles-pp.try_charge_memcg
0.18 ± 2% +0.0 0.20 perf-profile.children.cycles-pp.native_sched_clock
0.07 ± 8% +0.0 0.09 ± 7% perf-profile.children.cycles-pp.netdev_core_pick_tx
0.35 ± 2% +0.0 0.37 perf-profile.children.cycles-pp.prepare_signal
0.28 +0.0 0.31 perf-profile.children.cycles-pp.sock_def_readable
0.22 ± 2% +0.0 0.25 perf-profile.children.cycles-pp.sched_clock
0.49 +0.0 0.51 perf-profile.children.cycles-pp.fpu__clear_user_states
0.14 ± 5% +0.0 0.17 ± 4% perf-profile.children.cycles-pp.__ip_finish_output
0.19 ± 3% +0.0 0.22 perf-profile.children.cycles-pp.__ip_local_out
0.26 ± 2% +0.0 0.29 perf-profile.children.cycles-pp.sched_clock_cpu
0.25 ± 3% +0.0 0.28 perf-profile.children.cycles-pp.rb_erase
0.22 ± 3% +0.0 0.25 perf-profile.children.cycles-pp.validate_xmit_skb
0.62 ± 3% +0.0 0.65 perf-profile.children.cycles-pp.native_irq_return_iret
0.52 ± 3% +0.0 0.55 ± 2% perf-profile.children.cycles-pp.fpu__restore_sig
0.23 ± 3% +0.0 0.26 perf-profile.children.cycles-pp.ip_local_out
0.22 ± 2% +0.0 0.26 perf-profile.children.cycles-pp.tcp_mstamp_refresh
0.54 +0.0 0.58 perf-profile.children.cycles-pp.copy_fpstate_to_sigframe
0.18 ± 3% +0.0 0.21 ± 3% perf-profile.children.cycles-pp.mem_cgroup_sk_uncharge
0.35 ± 3% +0.0 0.39 ± 3% perf-profile.children.cycles-pp.tcp_rearm_rto
0.35 ± 2% +0.0 0.38 perf-profile.children.cycles-pp.irqtime_account_irq
0.44 ± 3% +0.0 0.48 perf-profile.children.cycles-pp._raw_read_unlock_irqrestore
0.25 ± 3% +0.0 0.29 ± 2% perf-profile.children.cycles-pp.enqueue_to_backlog
0.22 ± 9% +0.0 0.26 ± 6% perf-profile.children.cycles-pp.dst_release
0.29 ± 2% +0.0 0.33 ± 4% perf-profile.children.cycles-pp.mod_memcg_state
0.28 ± 3% +0.0 0.32 perf-profile.children.cycles-pp.netif_rx_internal
0.63 ± 2% +0.0 0.67 perf-profile.children.cycles-pp.restore_sigcontext
0.23 ± 3% +0.0 0.28 ± 2% perf-profile.children.cycles-pp.ip_rcv_core
0.32 ± 2% +0.0 0.36 perf-profile.children.cycles-pp.rb_insert_color
0.30 ± 3% +0.0 0.34 perf-profile.children.cycles-pp.__netif_rx
0.31 ± 3% +0.0 0.36 perf-profile.children.cycles-pp.lock_timer_base
1.16 +0.0 1.21 perf-profile.children.cycles-pp.__x64_sys_rt_sigreturn
0.34 ± 3% +0.0 0.39 ± 2% perf-profile.children.cycles-pp.skb_clone
0.25 ± 6% +0.0 0.30 ± 4% perf-profile.children.cycles-pp.barn_replace_full_sheaf
0.36 +0.0 0.41 perf-profile.children.cycles-pp.mem_cgroup_sk_charge
0.00 +0.1 0.05 perf-profile.children.cycles-pp.__xfrm_policy_check2
0.00 +0.1 0.05 perf-profile.children.cycles-pp.bpf_skops_write_hdr_opt
0.00 +0.1 0.05 perf-profile.children.cycles-pp.skb_clone_tx_timestamp
0.34 ± 2% +0.1 0.40 perf-profile.children.cycles-pp.__sk_mem_reduce_allocated
0.32 ± 3% +0.1 0.37 perf-profile.children.cycles-pp.tcp_skb_entail
0.30 +0.1 0.35 ± 2% perf-profile.children.cycles-pp.complete_signal
0.82 +0.1 0.87 perf-profile.children.cycles-pp.get_sigframe
0.28 ± 2% +0.1 0.33 ± 3% perf-profile.children.cycles-pp.__netif_receive_skb_core
0.29 ± 3% +0.1 0.34 ± 2% perf-profile.children.cycles-pp.kfree_skbmem
0.96 +0.1 1.02 perf-profile.children.cycles-pp.x64_setup_rt_frame
0.68 ± 2% +0.1 0.74 perf-profile.children.cycles-pp.__refill_objects_node
0.68 ± 2% +0.1 0.74 perf-profile.children.cycles-pp.refill_objects
0.67 ± 3% +0.1 0.73 ± 3% perf-profile.children.cycles-pp.recalc_sigpending
0.34 ± 5% +0.1 0.40 ± 4% perf-profile.children.cycles-pp.barn_replace_empty_sheaf
0.37 ± 3% +0.1 0.43 ± 2% perf-profile.children.cycles-pp.__inet_lookup_established
0.55 ± 3% +0.1 0.62 perf-profile.children.cycles-pp.skb_release_head_state
0.76 ± 3% +0.1 0.82 ± 2% perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi
0.47 ± 4% +0.1 0.54 ± 3% perf-profile.children.cycles-pp.__pcs_replace_full_main
0.50 ± 2% +0.1 0.56 ± 4% perf-profile.children.cycles-pp.tcp_try_rmem_schedule
0.44 ± 2% +0.1 0.51 ± 2% perf-profile.children.cycles-pp._find_next_bit
0.46 ± 2% +0.1 0.53 ± 2% perf-profile.children.cycles-pp.__inet_lookup_skb
0.53 ± 4% +0.1 0.61 ± 3% perf-profile.children.cycles-pp.ip_rcv
0.56 ± 2% +0.1 0.64 perf-profile.children.cycles-pp.__sk_mem_raise_allocated
0.77 ± 3% +0.1 0.86 perf-profile.children.cycles-pp.skb_release_data
0.60 ± 2% +0.1 0.69 perf-profile.children.cycles-pp.__sk_mem_schedule
0.70 ± 2% +0.1 0.80 perf-profile.children.cycles-pp.tcp_schedule_loss_probe
0.88 ± 3% +0.1 0.99 perf-profile.children.cycles-pp.kmalloc_reserve
0.68 ± 3% +0.1 0.79 ± 2% perf-profile.children.cycles-pp.skb_defer_free_flush
0.62 ± 3% +0.1 0.73 ± 3% perf-profile.children.cycles-pp.loopback_xmit
1.05 ± 3% +0.1 1.18 ± 2% perf-profile.children.cycles-pp.__pcs_replace_empty_main
0.93 ± 2% +0.1 1.06 perf-profile.children.cycles-pp.__mod_timer
0.98 ± 2% +0.1 1.12 perf-profile.children.cycles-pp.sk_reset_timer
2.46 ± 2% +0.1 2.60 ± 3% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
2.50 ± 2% +0.1 2.64 ± 3% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.70 ± 3% +0.1 0.84 ± 3% perf-profile.children.cycles-pp.xmit_one
0.74 ± 3% +0.1 0.89 ± 3% perf-profile.children.cycles-pp.dev_hard_start_xmit
2.06 ± 2% +0.2 2.22 ± 4% perf-profile.children.cycles-pp.__irq_exit_rcu
1.36 ± 3% +0.2 1.52 perf-profile.children.cycles-pp.__kfree_skb
1.48 ± 3% +0.2 1.65 perf-profile.children.cycles-pp.kmem_cache_alloc_node_noprof
1.41 ± 2% +0.2 1.62 ± 2% perf-profile.children.cycles-pp.tcp_data_queue
0.89 ± 5% +0.2 1.14 ± 3% perf-profile.children.cycles-pp.__release_sock
2.29 ± 2% +0.3 2.58 ± 2% perf-profile.children.cycles-pp.tcp_event_new_data_sent
2.50 ± 3% +0.3 2.82 perf-profile.children.cycles-pp.__alloc_skb
2.74 ± 2% +0.4 3.10 perf-profile.children.cycles-pp.tcp_stream_alloc_skb
1.74 ± 2% +0.4 2.13 ± 3% perf-profile.children.cycles-pp._raw_spin_unlock_irq
3.20 ± 3% +0.5 3.66 perf-profile.children.cycles-pp.tcp_clean_rtx_queue
3.56 ± 3% +0.5 4.06 perf-profile.children.cycles-pp.tcp_ack
81.40 +0.9 82.32 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
6.26 ± 11% +1.0 7.23 ± 6% perf-profile.children.cycles-pp._raw_spin_lock_irq
80.20 +1.1 81.27 perf-profile.children.cycles-pp.do_syscall_64
7.67 ± 9% +1.2 8.86 ± 4% perf-profile.children.cycles-pp.signal_setup_done
9.03 ± 7% +1.3 10.29 ± 3% perf-profile.children.cycles-pp.stress_sigurg_handler
4.39 ± 15% +1.3 5.67 ± 10% perf-profile.children.cycles-pp.dec_rlimit_put_ucounts
4.39 ± 15% +1.3 5.68 ± 10% perf-profile.children.cycles-pp.__sigqueue_free
5.56 ± 9% +1.3 6.86 ± 7% perf-profile.children.cycles-pp.dequeue_signal
4.08 ± 15% +1.3 5.43 ± 11% perf-profile.children.cycles-pp.inc_rlimit_get_ucounts
4.10 ± 15% +1.4 5.44 ± 11% perf-profile.children.cycles-pp.sig_get_ucounts
6.18 ± 8% +1.5 7.64 ± 6% perf-profile.children.cycles-pp.get_signal
39.09 +1.5 40.61 perf-profile.children.cycles-pp.ioctl
11.90 ± 5% +2.1 14.03 perf-profile.children.cycles-pp.do_send_sig_info
41.20 +2.2 43.38 perf-profile.children.cycles-pp.__x64_sys_sendto
12.61 ± 5% +2.2 14.82 perf-profile.children.cycles-pp.send_sigurg
41.12 +2.2 43.32 perf-profile.children.cycles-pp.__sys_sendto
12.65 ± 5% +2.2 14.86 perf-profile.children.cycles-pp.sk_send_sigurg
13.85 ± 5% +2.3 16.19 perf-profile.children.cycles-pp.tcp_urg
40.28 +2.6 42.86 perf-profile.children.cycles-pp.tcp_sendmsg
15.38 ± 2% +2.7 18.12 perf-profile.children.cycles-pp.exit_to_user_mode_loop
15.35 ± 2% +2.7 18.08 perf-profile.children.cycles-pp.arch_do_signal_or_restart
38.45 +2.9 41.32 perf-profile.children.cycles-pp.tcp_sendmsg_locked
20.61 ± 3% +3.1 23.72 perf-profile.children.cycles-pp.tcp_v4_rcv
20.75 ± 3% +3.1 23.87 perf-profile.children.cycles-pp.ip_protocol_deliver_rcu
20.81 ± 3% +3.1 23.94 perf-profile.children.cycles-pp.ip_local_deliver_finish
20.86 ± 3% +3.1 23.99 perf-profile.children.cycles-pp.ip_local_deliver
19.64 ± 4% +3.1 22.78 perf-profile.children.cycles-pp.tcp_rcv_established
21.99 ± 3% +3.2 25.16 perf-profile.children.cycles-pp.__local_bh_enable_ip
19.95 ± 4% +3.2 23.13 perf-profile.children.cycles-pp.tcp_v4_do_rcv
21.69 ± 3% +3.3 24.97 perf-profile.children.cycles-pp.__netif_receive_skb_one_core
22.08 ± 3% +3.3 25.42 perf-profile.children.cycles-pp.process_backlog
22.12 ± 3% +3.3 25.47 perf-profile.children.cycles-pp.__napi_poll
21.54 ± 3% +3.4 24.94 perf-profile.children.cycles-pp.do_softirq
22.99 ± 3% +3.5 26.47 perf-profile.children.cycles-pp.net_rx_action
23.53 ± 3% +3.5 27.07 perf-profile.children.cycles-pp.handle_softirqs
24.05 ± 3% +3.8 27.86 perf-profile.children.cycles-pp.__dev_queue_xmit
24.30 ± 3% +3.8 28.14 perf-profile.children.cycles-pp.ip_finish_output2
24.54 ± 3% +3.9 28.41 perf-profile.children.cycles-pp.ip_output
31.64 ± 2% +4.0 35.64 perf-profile.children.cycles-pp.__tcp_push_pending_frames
25.80 ± 3% +4.0 29.82 perf-profile.children.cycles-pp.__ip_queue_xmit
31.57 ± 2% +4.0 35.61 perf-profile.children.cycles-pp.tcp_write_xmit
27.28 ± 3% +4.2 31.49 perf-profile.children.cycles-pp.__tcp_transmit_skb
0.87 ± 13% -0.5 0.34 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_bh
3.99 -0.4 3.57 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.51 ± 17% -0.4 0.10 ± 15% perf-profile.self.cycles-pp._copy_to_iter
0.76 ± 11% -0.4 0.36 ± 5% perf-profile.self.cycles-pp._copy_from_iter
2.91 ± 2% -0.4 2.51 perf-profile.self.cycles-pp.do_syscall_64
1.00 ± 9% -0.3 0.66 ± 3% perf-profile.self.cycles-pp.tcp_sendmsg_locked
0.39 ± 16% -0.3 0.08 ± 12% perf-profile.self.cycles-pp.tcp_recvmsg_locked
2.08 ± 3% -0.3 1.78 perf-profile.self.cycles-pp.arch_exit_to_user_mode_prepare
0.35 ± 17% -0.3 0.07 ± 15% perf-profile.self.cycles-pp.__sys_recvfrom
0.32 ± 13% -0.2 0.10 ± 6% perf-profile.self.cycles-pp.__tcp_select_window
0.38 ± 13% -0.2 0.17 ± 5% perf-profile.self.cycles-pp.check_heap_object
1.90 ± 2% -0.2 1.70 perf-profile.self.cycles-pp.entry_SYSCALL_64
0.41 ± 11% -0.2 0.21 ± 3% perf-profile.self.cycles-pp.__local_bh_enable_ip
0.58 ± 9% -0.2 0.40 ± 3% perf-profile.self.cycles-pp.tcp_write_xmit
0.27 ± 15% -0.2 0.09 ± 7% perf-profile.self.cycles-pp.__check_object_size
0.47 ± 8% -0.2 0.29 ± 2% perf-profile.self.cycles-pp.read_tsc
1.32 ± 2% -0.2 1.15 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.38 ± 11% -0.2 0.21 ± 5% perf-profile.self.cycles-pp.__sys_sendto
0.37 ± 9% -0.1 0.22 ± 4% perf-profile.self.cycles-pp.__virt_addr_valid
1.04 ± 3% -0.1 0.91 perf-profile.self.cycles-pp.fdget
0.96 ± 2% -0.1 0.82 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.23 ± 11% -0.1 0.10 ± 3% perf-profile.self.cycles-pp.release_sock
0.89 ± 2% -0.1 0.79 perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.38 ± 5% -0.1 0.28 perf-profile.self.cycles-pp.tcp_recvmsg
0.14 ± 14% -0.1 0.05 ± 7% perf-profile.self.cycles-pp.import_ubuf
0.18 ± 10% -0.1 0.10 ± 3% perf-profile.self.cycles-pp.tcp_current_mss
0.14 ± 11% -0.1 0.07 ± 5% perf-profile.self.cycles-pp.tcp_tso_segs
0.28 ± 8% -0.1 0.22 ± 5% perf-profile.self.cycles-pp.ktime_get
0.15 ± 9% -0.1 0.09 ± 4% perf-profile.self.cycles-pp.skb_do_copy_data_nocache
0.69 ± 2% -0.1 0.63 perf-profile.self.cycles-pp.x64_sys_call
0.29 ± 5% -0.0 0.24 ± 3% perf-profile.self.cycles-pp.tcp_established_options
0.11 ± 11% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.tcp_sendmsg
0.09 ± 9% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.tcp_release_cb
0.13 ± 7% -0.0 0.10 ± 7% perf-profile.self.cycles-pp.ipv4_mtu
0.07 ± 10% -0.0 0.04 ± 44% perf-profile.self.cycles-pp.__send
0.05 +0.0 0.06 perf-profile.self.cycles-pp.tcp_send_delayed_ack
0.07 +0.0 0.08 perf-profile.self.cycles-pp.__ip_local_out
0.07 +0.0 0.08 perf-profile.self.cycles-pp.skb_release_head_state
0.08 +0.0 0.09 perf-profile.self.cycles-pp.__usecs_to_jiffies
0.06 +0.0 0.07 perf-profile.self.cycles-pp.ip_local_deliver_finish
0.09 +0.0 0.10 perf-profile.self.cycles-pp.irqtime_account_irq
0.06 +0.0 0.07 perf-profile.self.cycles-pp.netif_skb_features
0.06 +0.0 0.07 perf-profile.self.cycles-pp.tcp_event_data_recv
0.06 +0.0 0.07 perf-profile.self.cycles-pp.tcp_inbound_hash
0.06 +0.0 0.07 perf-profile.self.cycles-pp.tcp_options_write
0.09 ± 5% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.__copy_skb_header
0.09 ± 4% +0.0 0.10 perf-profile.self.cycles-pp.refill_stock
0.06 ± 6% +0.0 0.08 ± 6% perf-profile.self.cycles-pp.__ip_finish_output
0.19 +0.0 0.20 ± 2% perf-profile.self.cycles-pp.__get_user_nocheck_8
0.05 ± 7% +0.0 0.06 ± 7% perf-profile.self.cycles-pp.qdisc_pkt_len_segs_init
0.12 ± 5% +0.0 0.14 ± 3% perf-profile.self.cycles-pp.barn_replace_full_sheaf
0.07 ± 6% +0.0 0.09 ± 4% perf-profile.self.cycles-pp.loopback_xmit
0.09 ± 4% +0.0 0.11 ± 4% perf-profile.self.cycles-pp.lock_timer_base
0.06 ± 7% +0.0 0.08 perf-profile.self.cycles-pp.tcp_stream_alloc_skb
0.08 ± 5% +0.0 0.10 perf-profile.self.cycles-pp.__inet_lookup_skb
0.12 ± 4% +0.0 0.14 ± 3% perf-profile.self.cycles-pp._raw_read_lock_irqsave
0.30 ± 2% +0.0 0.31 perf-profile.self.cycles-pp.fpregs_mark_activate
0.29 +0.0 0.31 perf-profile.self.cycles-pp.__restore_fpregs_from_user
0.06 ± 9% +0.0 0.08 ± 6% perf-profile.self.cycles-pp._find_first_bit
0.08 +0.0 0.10 ± 4% perf-profile.self.cycles-pp.ip_rcv
0.08 ± 5% +0.0 0.10 perf-profile.self.cycles-pp.inet_ehashfn
0.10 +0.0 0.12 ± 3% perf-profile.self.cycles-pp.validate_xmit_skb
0.11 +0.0 0.13 ± 2% perf-profile.self.cycles-pp.detach_if_pending
0.10 ± 4% +0.0 0.12 ± 3% perf-profile.self.cycles-pp.tcp_wfree
0.13 ± 4% +0.0 0.15 ± 2% perf-profile.self.cycles-pp.enqueue_to_backlog
0.08 ± 4% +0.0 0.10 perf-profile.self.cycles-pp.tcp_v4_do_rcv
0.17 ± 2% +0.0 0.19 perf-profile.self.cycles-pp.tcp_urg
0.13 ± 2% +0.0 0.15 ± 2% perf-profile.self.cycles-pp.try_charge_memcg
0.22 ± 2% +0.0 0.24 perf-profile.self.cycles-pp._raw_spin_lock
0.17 ± 2% +0.0 0.19 perf-profile.self.cycles-pp.native_sched_clock
0.08 ± 7% +0.0 0.10 ± 8% perf-profile.self.cycles-pp.xmit_one
0.11 ± 5% +0.0 0.13 perf-profile.self.cycles-pp.ip_send_check
0.08 ± 6% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.kmalloc_reserve
0.06 ± 9% +0.0 0.08 ± 10% perf-profile.self.cycles-pp.netdev_core_pick_tx
0.12 ± 4% +0.0 0.14 ± 3% perf-profile.self.cycles-pp.send_sigurg
0.18 ± 5% +0.0 0.21 ± 3% perf-profile.self.cycles-pp.barn_replace_empty_sheaf
0.20 ± 3% +0.0 0.22 ± 2% perf-profile.self.cycles-pp.net_rx_action
0.34 ± 2% +0.0 0.36 ± 2% perf-profile.self.cycles-pp.prepare_signal
0.14 ± 2% +0.0 0.16 perf-profile.self.cycles-pp.process_backlog
0.18 ± 2% +0.0 0.20 perf-profile.self.cycles-pp.__sk_mem_raise_allocated
0.17 ± 2% +0.0 0.19 ± 2% perf-profile.self.cycles-pp.skb_defer_free_flush
0.37 +0.0 0.39 perf-profile.self.cycles-pp.copy_fpstate_to_sigframe
0.31 ± 2% +0.0 0.33 ± 2% perf-profile.self.cycles-pp.__send_signal_locked
0.06 ± 11% +0.0 0.09 ± 12% perf-profile.self.cycles-pp.eth_type_trans
0.26 +0.0 0.29 perf-profile.self.cycles-pp.sock_def_readable
0.18 ± 3% +0.0 0.21 perf-profile.self.cycles-pp.handle_softirqs
0.24 ± 2% +0.0 0.27 perf-profile.self.cycles-pp.rb_erase
0.62 ± 3% +0.0 0.65 perf-profile.self.cycles-pp.native_irq_return_iret
0.31 ± 3% +0.0 0.34 ± 3% perf-profile.self.cycles-pp.tcp_rearm_rto
0.22 ± 2% +0.0 0.25 ± 5% perf-profile.self.cycles-pp.mod_memcg_state
0.32 ± 2% +0.0 0.35 perf-profile.self.cycles-pp.skb_release_data
0.24 ± 2% +0.0 0.27 perf-profile.self.cycles-pp.skb_clone
0.27 ± 2% +0.0 0.30 perf-profile.self.cycles-pp.tcp_v4_rcv
0.51 +0.0 0.55 perf-profile.self.cycles-pp.__refill_objects_node
0.45 ± 5% +0.0 0.49 perf-profile.self.cycles-pp.kmem_cache_free
0.30 +0.0 0.34 ± 2% perf-profile.self.cycles-pp.rb_insert_color
0.32 ± 2% +0.0 0.36 perf-profile.self.cycles-pp.__mod_timer
0.22 ± 2% +0.0 0.26 ± 3% perf-profile.self.cycles-pp.tcp_schedule_loss_probe
0.20 ± 2% +0.0 0.24 ± 3% perf-profile.self.cycles-pp.complete_signal
0.33 ± 2% +0.0 0.37 perf-profile.self.cycles-pp.tcp_ack
0.26 ± 2% +0.0 0.30 perf-profile.self.cycles-pp.tcp_rcv_established
0.22 ± 4% +0.0 0.26 ± 2% perf-profile.self.cycles-pp.ip_rcv_core
0.20 ± 4% +0.0 0.25 ± 4% perf-profile.self.cycles-pp.tcp_data_queue
0.27 ± 2% +0.0 0.32 ± 3% perf-profile.self.cycles-pp.__netif_receive_skb_core
0.44 ± 3% +0.0 0.49 ± 2% perf-profile.self.cycles-pp.kmem_cache_alloc_node_noprof
0.30 ± 3% +0.0 0.35 perf-profile.self.cycles-pp.tcp_skb_entail
0.28 ± 3% +0.1 0.33 ± 3% perf-profile.self.cycles-pp.__inet_lookup_established
0.28 ± 3% +0.1 0.33 ± 2% perf-profile.self.cycles-pp.kfree_skbmem
0.59 ± 3% +0.1 0.64 ± 2% perf-profile.self.cycles-pp._raw_spin_unlock_irq
0.66 ± 2% +0.1 0.72 ± 3% perf-profile.self.cycles-pp.recalc_sigpending
0.42 ± 2% +0.1 0.49 ± 2% perf-profile.self.cycles-pp._find_next_bit
0.88 ± 3% +0.1 0.96 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.93 ± 4% +0.1 1.03 perf-profile.self.cycles-pp.__ip_queue_xmit
0.80 ± 2% +0.1 0.90 perf-profile.self.cycles-pp.__tcp_transmit_skb
0.94 ± 2% +0.1 1.06 perf-profile.self.cycles-pp.__alloc_skb
1.15 ± 3% +0.1 1.30 ± 2% perf-profile.self.cycles-pp.tcp_event_new_data_sent
1.48 ± 4% +0.2 1.67 perf-profile.self.cycles-pp.__dev_queue_xmit
1.58 ± 3% +0.2 1.82 perf-profile.self.cycles-pp.tcp_clean_rtx_queue
4.38 ± 15% +1.3 5.66 ± 10% perf-profile.self.cycles-pp.dec_rlimit_put_ucounts
4.08 ± 15% +1.3 5.43 ± 11% perf-profile.self.cycles-pp.inc_rlimit_get_ucounts
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH 0/6] SUNRPC: Address remaining cache_check_rcu() UAF in cache content files
From: yangerkun @ 2026-05-08 8:16 UTC (permalink / raw)
To: Chuck Lever, Misbah Anjum N, Jeff Layton, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, yi.zhang, Zhihao Cheng, Li Lingfeng
Cc: linux-nfs, linux-kernel, netdev, Chuck Lever
In-Reply-To: <05f93fc4-59d7-4735-bc7d-a00d1497687a@huawei.com>
在 2026/5/8 11:08, yangerkun 写道:
>
>
> 在 2026/5/8 10:45, yangerkun 写道:
>> Hello Chuck,
>>
>> 在 2026/5/8 0:12, Chuck Lever 写道:
>>> Hello Erkun -
>>>
>>> On Thu, May 7, 2026, at 11:09 AM, yangerkun wrote:
>>>> Hi,
>>>>
>>>> 在 2026/5/1 22:51, Chuck Lever 写道:
>>>>> Misbah Anjum reported a use-after-free in cache_check_rcu()
>>>>> reached through e_show() while sosreport was reading
>>>>> /proc/fs/nfsd/exports on ppc64le. Two fixes for that report
>>>>> landed in v7.0:
>>>>>
>>>>> 48db892356d6 ("NFSD: Defer sub-object cleanup in export put
>>>>> callbacks")
>>>>> e7fcf179b82d ("NFSD: Hold net reference for the lifetime of /
>>>>> proc/fs/nfs/exports fd")
>>>>
>>>> Back to the problem fixed by this patches, I'm a little confused why
>>>> this UAF can be trigged.
>>>>
>>>> Before this patches, svc_export_put show as follow:
>>>>
>>>> 368 static void svc_export_put(struct kref *ref)
>>>> 369 {
>>>> 370 struct svc_export *exp = container_of(ref, struct
>>>> svc_export, h.ref);
>>>> 371
>>>> 372 path_put(&exp->ex_path);
>>>> 373 auth_domain_put(exp->ex_client);
>>>> 374 call_rcu(&exp->ex_rcu, svc_export_release);
>>>> 375 }
>>>>
>>>> The auth_domain_put function releases ->name using call_rcu, and
>>>> path_put may release the dentry also via call_rcu. All of this seems to
>>>> prevent e_show from causing a UAF. Could you point out which line in
>>>> d_path triggers the issue?
>>>
>>> The dentry, the mount, and the auth_domain ->name buffer all
>>> end up RCU-freed (dentry_free() and delayed_free_vfsmnt in
>>> fs/, svcauth_unix_domain_release_rcu() in svcauth_unix.c).
>>> The eventual kfree isn't the problem.
>>>
>>> The problem is the synchronous teardown inside path_put(),
>>> which runs before svc_export_put() ever reaches its own
>>> call_rcu():
>>>
>>> path_put(&exp->ex_path)
>>> -> dput(dentry)
>>> -> __dentry_kill() [if last ref]
>>> -> __d_drop() /* unhashes */
>>> -> dentry_unlink_inode() /* d_inode = NULL */
>>> -> d_op->d_release() if set
>>> -> drops parent d_lockref /* may cascade up */
>>> -> dentry_free() /* call_rcu deferred */
>>> -> mntput(mnt) /* deferred via task_work */
>>>
>>> The dentry pointer itself is RCU-safe, so prepend_path()'s walk
>>> of d_parent and d_name doesn't read freed memory. But by the
>>> time the reader gets there, __d_clear_type_and_inode() has
>>> already stored NULL into d_inode, __d_drop() has broken the
>>> hash linkage, and the parent's d_lockref has been decremented
>>> -- which can in turn fire __dentry_kill() on the parent, and
>>> on up the tree. An e_show() that's still inside its cache RCU
>>> read section walks into that half-dismantled state through
>>> seq_path(), and that's the NULL deref Misbah reported.
>>
>> Thank you for your detailed explanation! Yes, e_show might be called
>> when the state is partially dismantled, but after carefully reviewing
>> the code with dput up to __dentry_kill, I still cannot find anything
>> that could cause this issue. Additionally, the comments for
>> prepend_path indicate that they have already taken into account that
>> the dentry can be removed concurrently. I have also run some tests on
>> my arm64 QEMU, but I couldn't reproduce the problem either. Could you
>> please help me identify the specific line or pointer in the dentry
>> that triggers this use-after-free or null pointer issue?
>>
>> Maybe I am not be very familiar with the code, which caused me to fail
>> to identify the real root cause. I'm so sorry for that.
>>
>>
>> 265 char *d_path(const struct path *path, char *buf, int buflen)
>> 266 {
>> 267 DECLARE_BUFFER(b, buf, buflen);
>> 268 struct path root;
>> 269
>> 270 /*
>> 271 * We have various synthetic filesystems that never get
>> mounted. On
>> 272 * these filesystems dentries are never used for lookup
>> purposes, and
>> 273 * thus don't need to be hashed. They also don't need a
>> name until a
>> 274 * user wants to identify the object in /proc/pid/fd/.
>> The little hack
>> 275 * below allows us to generate a name for these objects on
>> demand:
>> 276 *
>> 277 * Some pseudo inodes are mountable. When they are mounted
>> 278 * path->dentry == path->mnt->mnt_root. In that case
>> don't call d_dname
>> 279 * and instead have d_path return the mounted path.
>> 280 */
>> 281 if (path->dentry->d_op && path->dentry->d_op->d_dname &&
>> 282 (!IS_ROOT(path->dentry) || path->dentry != path->mnt-
>> >mnt_root))
>> 283 return path->dentry->d_op->d_dname(path->dentry,
>> buf, buflen);
>> 284
>> 285 rcu_read_lock();
>> 286 get_fs_root_rcu(current->fs, &root);
>> 287 if (unlikely(d_unlinked(path->dentry)))
>> 288 prepend(&b, " (deleted)", 11);
>> 289 else
>> 290 prepend_char(&b, 0);
>> 291 prepend_path(path, &root, &b);
>> 292 rcu_read_unlock();
>> 293
>> 294 return extract_string(&b);
>> 295 }
>>
>>
>>>
>>> The earlier fix (2530766492ec, "nfsd: fix UAF when access
>>> ex_uuid or ex_stats") moved the kfree of ex_uuid and ex_stats
>>> into svc_export_release() so those are RCU-safe now.
>>> path_put() and auth_domain_put() couldn't go in there because
>>> both may sleep, and call_rcu callbacks run in softirq context.
>>> This series uses queue_rcu_work() instead: it defers past the
>>> grace period AND runs the callback in process context, so the
>>> sleeping puts move into the deferred path and the window
>>> closes.
>>
>> Yeah, I can get this! Thanks again for your detail explanation!
>
> Also, could the scenario described in this commit be triggered again?
>
> commit 69d803c40edeaf94089fbc8751c9b746cdc35044
> Author: Yang Erkun <yangerkun@huawei.com>
> Date: Mon Dec 16 22:21:52 2024 +0800
>
> nfsd: Revert "nfsd: release svc_expkey/svc_export with rcu_work"
>
> This reverts commit f8c989a0c89a75d30f899a7cabdc14d72522bb8d.
>
> Before this commit, svc_export_put or expkey_put will call path_put
> with
> sync mode. After this commit, path_put will be called with async mode.
> And this can lead the unexpected results show as follow.
>
> mkfs.xfs -f /dev/sda
> echo "/ *(rw,no_root_squash,fsid=0)" > /etc/exports
> echo "/mnt *(rw,no_root_squash,fsid=1)" >> /etc/exports
> exportfs -ra
> service nfs-server start
> mount -t nfs -o vers=4.0 127.0.0.1:/mnt /mnt1
> mount /dev/sda /mnt/sda
> touch /mnt1/sda/file
> exportfs -r
> umount /mnt/sda # failed unexcepted
>
> The touch will finally call nfsd_cross_mnt, add refcount to mount, and
> then add cache_head. Before this commit, exportfs -r will call
> cache_flush to cleanup all cache_head, and path_put in
> svc_export_put/expkey_put will be finished with sync mode. So, the
> latter umount will always success. However, after this commit,
> path_put
> will be called with async mode, the latter umount may failed, and if
> we add some delay, umount will success too. Personally I think this
> bug
> and should be fixed. We first revert before bugfix patch, and then fix
> the original bug with a different way.
>
> Fixes: f8c989a0c89a ("nfsd: release svc_expkey/svc_export with
> rcu_work")
> Signed-off-by: Yang Erkun <yangerkun@huawei.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>
>
After reviewing these two commits:
e7fcf179b82d NFSD: Hold net reference for the lifetime of
/proc/fs/nfs/exports fd
48db892356d6 NFSD: Defer sub-object cleanup in export put callbacks
I believe that the issue described in commit e7fcf179b82d might be the
root cause of the null pointer dereferences mentioned in [1]. This is
because we do not call get_net when opening /proc/fs/nfs/exports. As a
result, when the network namespace exits, nfsd_net_exit is triggered.
If, at the same time, the contents of /proc/fs/nfs/exports are being
read, a use-after-free (UAF) can occur on the struct cache_detail. I
think all three bugs referenced in [1] stem from this issue. Therefore,
commit e7fcf179b82d has already addressed the problem. To prevent the
issue described in commit 69d803c40ede, should we consider reverting
commit 48db892356d6 first? Please let me know if I have misunderstood
any aspect of this problem.
>>
>> Thanks,
>> Erkun.
>>
>>>
>>>
>>
>
>
>
^ permalink raw reply
* Re: [REGRESSION] stmmac: Random DMA reset failure on RK3399 since v6.18
From: Jensen Huang @ 2026-05-08 8:19 UTC (permalink / raw)
To: Thorsten Leemhuis, Maxime Chevallier
Cc: Ovidiu Panait, Russell King, Heiner Kallweit, Andrew Lunn,
regressions, netdev, LKML
In-Reply-To: <5308c658-7d4c-4292-b091-a51546ea4d23@leemhuis.info>
Hi Thorsten, Maxime,
On Thu, May 7, 2026 at 9:45 PM Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> [+Ovidiu Panait]
>
> On 5/7/26 14:49, Jensen Huang wrote:
> > On Tue, May 5, 2026 at 4:26 PM Thorsten Leemhuis
> > <regressions@leemhuis.info> wrote:
> >> On 4/29/26 14:53, Jensen Huang wrote:
> >
> >>> I'm reporting a regression on RK3399 (stmmac) observed in v6.18.24.
> >>> When a network cable is connected during boot, the DMA reset
> >>> occasionally fails with the error message: "Failed to reset the dma".
> >>>
> >>> This appears to be a timing issue related to the EEE RX clock-stop
> >>> logic. Based on my investigation with the RTL8211E PHY, I monitored
> >>> the PHY register PS1R (MMD device 3, address 0x01) and observed a
> >>> value of 0x0f40. This indicates that the PHY is in LPI mode and the RX
> >>> clock may have already stopped.
> >>>
> >>> While commit dd557266cf5f ("net: stmmac: block PHY RXC clock-stop")
> >>
> >> Just wondering: have you tried if mainline (e.g. 7.1-rc1) is still
> >> affected? This is something that is always a good advisable (some people
> >> would call it required). In this case even more, as it since a while
> >> contains a fix for the change you mentioned, that wasn't backported:
> >> c171e679ee66d7 ("net: stmmac: Disable EEE RX clock stop when VLAN is
> >> enabled"). But this is not my area of expertise (and in different area
> >> of the code), so that fix might be unrelated to your issue.
> >
> > Thanks for the pointer.
> > As you suggested, I have tested the mainline and confirmed that the
> > issue is not present in v7.1-rc2, nor as early as v6.19-rc1. However,
> > I verified that the issue persists in the latest stable v6.18.26.
> > I performed a git bisect and the result pointed exactly to the commit
> > you mentioned: c171e679ee66d7 ("net: stmmac: Disable EEE RX clock stop
> > when VLAN is enabled").
>
> Great! Could you please cherry-pick c171e679ee66d7 to 6.18.y and see if
> that fixes things? It sounds like it should.
>
> @Ovidiu Panait: c171e679ee66d7 is a commit of yours. If Jensen confirms
> that cherry-picking fixed the problem, I'd say we ask Greg to pick it up
> for 6.18.y -- unless you see any reasons why that might be a bad idea.
>
> > Additionally, I tested the case where CONFIG_VLAN_8021Q is not set,
> > and the DMA reset issue occurs again.
>
> I'd say that is likely best discussed in a new thread you might want to
> start. Also wondering if it was like that earlier. Or iow: if that is a
> regression or not.
I have tested v6.18.26 and here are the results:
1. running "ip link add link eth0 name eth0.5 type vlan id 5" over 10
times and did not encounter timeout issues. This might be because the
RK3399 GMAC does not support EEE.
2. cherry-picking c171e679ee66d7 to v6.18.26 avoids the DMA reset failure.
Additionally, I am considering proposing a new DT property (e.g.,
snps,no-eee-rx-clk-stop) to explicitly control eee_rx_clk_stop_enable.
This would provide a more robust solution for hardware combinations
that require a continuous RX clock for stability, regardless of VLAN
configurations. However, this would be better discussed in new thread
too.
On Thu, May 7, 2026 at 9:16 PM Maxime Chevallier
<maxime.chevallier@bootlin.com> wrote:
>
> Hi,
>
> On 07/05/2026 14:49, Jensen Huang wrote:
> > On Tue, May 5, 2026 at 4:26 PM Thorsten Leemhuis
> > <regressions@leemhuis.info> wrote:
> >>
> >> [Jumping in here, as there are no replies yet]
> >>
> >> BTW, Russel, just in case you missed this: looks like this regressions
> >> caused by a change of yours.
>
> I think Russell is dealing with unpleasant personal stuff, let's see if we
> can figure this out while he's away.
>
> >>
> >> On 4/29/26 14:53, Jensen Huang wrote:
> >>>
> >>> I'm reporting a regression on RK3399 (stmmac) observed in v6.18.24.
> >>> When a network cable is connected during boot, the DMA reset
> >>> occasionally fails with the error message: "Failed to reset the dma".
> >>>
> >>> This appears to be a timing issue related to the EEE RX clock-stop
> >>> logic. Based on my investigation with the RTL8211E PHY, I monitored
> >>> the PHY register PS1R (MMD device 3, address 0x01) and observed a
> >>> value of 0x0f40. This indicates that the PHY is in LPI mode and the RX
> >>> clock may have already stopped.
>
> From what I get, your current hypthesis is that it takes a while for that
> clock to stabilize and therefore we're accessing the DMA registers too soon ?
>
> Can you confirm that with the addition of a small delay ?
Adding msleep(100) between phylink_rx_clk_stop_block() and
stmmac_init_dma_engine(), and it did not help.
> Do you mean that c171e679ee66d7 ("net: stmmac: Disable EEE RX clock stop
> when VLAN is enabled") introduces the bug on 6.18.26 ?
>
> do you have the possibility of bisecting to verify when exactly the issue
> was solved between v6.18 and v6.19 ?
Sorry for the confusion. Commit c171e679ee66d7 is actually the fix. My
git bisect pointed to this commit as the one that avoided the issue
between v6.18 and v6.19-rc1.
Best regards,
Jensen Huang
>
> Ciao, Thorsten
>
> >>> ensures the clock is running before the DMA reset, my tests suggest
> >>> that the phylink_rx_clk_stop_block() call might not provide a
> >>> sufficiently stable RX clock in time for the immediate DMA reset that
> >>> follows.
> >>>
> >>> Since stmmac already sets mac_requires_rxc = true, I modified
> >>> phylink_bringup_phy() to honor this flag. This avoids toggling the
> >>> PHY's clk_stop_enable during the initialization sequence, ensuring the
> >>> RX clock remains active and stable throughout.
> >>> With the change below, I achieved 200/200 successful reboots with the
> >>> cable connected (previously ~50% failure rate).
> >>>
> >>> --- a/drivers/net/phy/phylink.c
> >>> +++ b/drivers/net/phy/phylink.c
> >>> @@ -2171,7 +2171,7 @@ static int phylink_bringup_phy(struct phylink
> >>> *pl, struct phy_device *phy,
> >>> /* Allow the MAC to stop its clock if the PHY has the capability */
> >>> pl->mac_tx_clk_stop = phy_eee_tx_clock_stop_capable(phy) > 0;
> >>>
> >>> - if (pl->mac_supports_eee_ops) {
> >>> + if (pl->mac_supports_eee_ops && !pl->config->mac_requires_rxc) {
> >>> /* Explicitly configure whether the PHY is allowed to stop it's
> >>> * receive clock.
> >>> */
> >>>
> >>> Any feedback/testing on this would be appreciated.
> >>>
> >>> Best regards,
> >>> Jensen Huang
> >>>
> >>
>
^ permalink raw reply
* Re: [PATCH net-next v2 1/5] mctp: convert to getsockopt_iter
From: Breno Leitao @ 2026-05-08 8:21 UTC (permalink / raw)
To: Adam Young
Cc: Jeremy Kerr, Matt Johnston, Martin Schiller, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, linux-x25, linux-kernel, netdev, linux-kselftest,
kernel-team
In-Reply-To: <9b8f8cbd-9b94-47f0-a4af-713060d9aef1@amperemail.onmicrosoft.com>
Hello Adam,
On Thu, May 07, 2026 at 04:10:36PM -0400, Adam Young wrote:
> Why is this 1/5 and where are the other 4?
The entire series was sent as one submission, and I've verified that all
recipients received the complete patchset. Did you not receive the other
patches?
You can find the full patchset here on lore:
https://lore.kernel.org/all/20260507-getsock_two-v2-0-5873111d9c12@debian.org/
^ permalink raw reply
* Re: [PATCH 6.12] block: fix memory leak in in bio_map_user_iov()
From: Dmitry Antipov @ 2026-05-08 8:30 UTC (permalink / raw)
To: Fedor Pchelkin
Cc: Greg Kroah-Hartman, stable, Jens Axboe, linux-block,
Christoph Hellwig, lvc-project, netdev
In-Reply-To: <20260507212200-2614841ccc112a082cab6938-pchelkin@ispras>
On Thu, 2026-05-07 at 21:52 +0300, Fedor Pchelkin wrote:
> In some form the issue is present in current upstream as well. For
> example, there is another callsite of iov_iter_extract_pages() in
> block/bio-integrity.c where the same pattern still persists.
Good point, and skb_splice_from_iter() looks suspicious as well:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7dad68e3b518..bf053372acb2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7343,12 +7343,16 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off);
if (len <= 0) {
+ /* Possible memory leak - ppages should be vfree()'d
+ if reallocated (ppages != pages)? */
ret = len ?: -EIO;
break;
}
i = 0;
do {
+ /* This looks wrong if reallocated - ppages[i++]
+ should be used instead? */
struct page *page = pages[i++];
size_t part = min_t(size_t, PAGE_SIZE - off, len);
This issue likely crosses the boundaries of block subsystem so netdev
people are encouraged to look as well.
Dmitry
^ permalink raw reply related
* Re: [RFC 1/4] net: fec: do not use readl()/writel() for ColdFire
From: David Laight @ 2026-05-08 8:40 UTC (permalink / raw)
To: Wei Fang
Cc: Greg Ungerer, linux-m68k@lists.linux-m68k.org,
linux-kernel@vger.kernel.org, arnd@kernel.org, Greg Ungerer,
Frank Li, Shenwei Wang, netdev@vger.kernel.org
In-Reply-To: <DBBPR04MB750044F98B1E719AD9DBD93E883D2@DBBPR04MB7500.eurprd04.prod.outlook.com>
On Fri, 8 May 2026 02:46:38 +0000
Wei Fang <wei.fang@nxp.com> wrote:
> > static void
> > fec_stop(struct net_device *ndev)
> > {
> > struct fec_enet_private *fep = netdev_priv(ndev);
> > - u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & FEC_RCR_RMII;
> > + u32 rmii_mode = fec_readl(fep->hwp + FEC_R_CNTRL) & FEC_RCR_RMII;
>
> This is not an issue, but since you changed this line, the new code should
> follow the "reverse xmas tree" style.
Looking rmii_mode isn't even used until much later in the function.
(and then not very often)
Much better to read it just before it is needed.
David
>
> See: https://elixir.bootlin.com/linux/v7.0.1/source/Documentation/process/maintainer-netdev.rst#L380
>
> > u32 val;
> >
> > /* We cannot expect a graceful transmit stop without link !!! */
> > if (fep->link) {
> > - writel(1, fep->hwp + FEC_X_CNTRL); /* Graceful transmit stop */
> > + fec_writel(1, fep->hwp + FEC_X_CNTRL); /* Graceful transmit stop */
> > udelay(10);
> > - if (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_GRA))
> > + if (!(fec_readl(fep->hwp + FEC_IEVENT) & FEC_ENET_GRA))
> > netdev_err(ndev, "Graceful transmit stop did not complete!\n");
> > }
> >
>
>
^ permalink raw reply
* RE: [PATCH net-next v2 5/6] net: wangxun: clear stored DMA addresses after dma_free_coherent()
From: Jiawen Wu @ 2026-05-08 8:43 UTC (permalink / raw)
To: 'Jakub Kicinski'
Cc: netdev, mengyuanlou, andrew+netdev, davem, edumazet, pabeni,
richardcochran, linux, horms, kees, larysa.zaremba, leitao, joe,
jacob.e.keller, fabio.baltieri, netdev, mengyuanlou,
andrew+netdev, davem, edumazet, pabeni, richardcochran, linux,
horms, kees, larysa.zaremba, leitao, joe, jacob.e.keller,
fabio.baltieri
In-Reply-To: <20260503021536.4127361-1-kuba@kernel.org>
On Sun, May 3, 2026 10:16 AM, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: wangxun: clear stored DMA addresses after dma_free_coherent()
>
> This commit clears the ring DMA addresses after freeing their resources via
> dma_free_coherent() in the wangxun driver. It aims to prevent use-after-free
> of the DMA rings at shutdown when WOL is enabled.
>
> > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > index 3216dee778be7..51599f6b878e7 100644
> > --- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > +++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > @@ -2462,6 +2462,7 @@ void wx_free_isb_resources(struct wx *wx)
> > dma_free_coherent(&pdev->dev, sizeof(u32) * 4,
> > wx->isb_mem, wx->isb_dma);
> > wx->isb_mem = NULL;
> > + wx->isb_dma = 0;
> > }
>
> This isn't a regression introduced by this patch, but can dma_free_coherent()
> be called with a NULL wx->isb_mem virtual address here?
>
> Since wx->isb_mem is only allocated when the interface is brought up, if the
> module is removed while the interface is down, wx->isb_mem will be NULL. Does
> passing a NULL virtual address to dma_free_coherent() cause a panic here, and
> should it be guarded by a NULL check?
>
> [ ... ]
>
> > @@ -2678,6 +2679,7 @@ static void wx_free_rx_resources(struct wx_ring *rx_ring)
> > rx_ring->desc, rx_ring->dma);
> >
> > rx_ring->desc = NULL;
> > + rx_ring->dma = 0;
>
> If rx_ring->dma is set to 0 here, and Wake-on-LAN is enabled, does this result
> in the hardware being programmed to DMA directly to physical address 0?
>
> Looking at ngbe_dev_shutdown(), it frees these resources and then calls
> wx_configure_rx(). Inside wx_configure_rx_ring(), the driver programs the
> base address registers WX_PX_RR_BAL with ring->dma, which is now 0. When a
> WOL packet arrives, could this cause silent memory corruption or fatal
> IOMMU faults?
For the hardware design on this device, WOL packets are handled before DMA ring.
So programming a zeroed ring->dma does not result in DMA transactions to
physical address 0. At this point, the original code is fine.
>
> Additionally, this wasn't introduced by this patch, but does the shutdown
> sequence trigger a NULL pointer dereference when WOL is active?
>
> Earlier in wx_free_rx_resources(), rx_ring->rx_buffer_info is freed and set
> to NULL, but ring->count retains its original value. During WOL shutdown,
> ngbe_dev_shutdown() frees these resources and then reconfigures the RX rings
> via wx_configure_rx_ring().
>
> If wx_configure_rx_ring() executes:
> memset(ring->rx_buffer_info, 0, sizeof(struct wx_rx_buffer) * ring->count);
>
> with a NULL pointer and a count > 0, won't this cause a kernel panic?
>
> [ ... ]
>
> > @@ -2906,6 +2910,7 @@ static int wx_setup_rx_resources(struct wx_ring *rx_ring)
> >
> > err_desc:
> > dma_free_coherent(dev, rx_ring->size, rx_ring->desc, rx_ring->dma);
> > + rx_ring->dma = 0;
> > err:
> > kvfree(rx_ring->rx_buffer_info);
> > rx_ring->rx_buffer_info = NULL;
>
> Should rx_ring->desc also be set to NULL here?
>
> If wx_alloc_page_pool() fails, the driver jumps to err_desc and frees the DMA
> memory, but leaves rx_ring->desc pointing to the freed memory. Since wx_ring
> structures can be reused across interface up/down cycles, could this dangling
> pointer bypass the if (!rx_ring->desc) safety check in wx_free_rx_resources()
> and cause a double-free later?
>
^ permalink raw reply
* [PATCH net v3] rxrpc: Also unshare DATA/RESPONSE packets when paged frags are present
From: Hyunwoo Kim @ 2026-05-08 8:53 UTC (permalink / raw)
To: dhowells, marc.dionne, davem, edumazet, kuba, pabeni, horms,
qingfang.deng, jiayuan.chen
Cc: linux-afs, netdev, stable, imv4bel
The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE
handler in rxrpc_verify_response() copy the skb to a linear one before
calling into the security ops only when skb_cloned() is true. An skb
that is not cloned but still carries externally-owned paged fragments
(e.g. SKBFL_SHARED_FRAG set by splice() into a UDP socket via
__ip_append_data, or a chained skb_has_frag_list()) falls through to
the in-place decryption path, which binds the frag pages directly into
the AEAD/skcipher SGL via skb_to_sgvec().
Extend the gate to also unshare when skb_has_frag_list() or
skb_has_shared_frag() is true. This catches the splice-loopback vector
and other externally-shared frag sources while preserving the
zero-copy fast path for skbs whose frags are kernel-private (e.g. NIC
page_pool RX, GRO). The OOM/trace handling already in place is reused.
Fixes: d0d5c0cd1e71 ("rxrpc: Use skb_unshare() rather than skb_cow_data()")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
---
Changes in v3:
- Use skb_has_frag_list() || skb_has_shared_frag() instead of skb_is_nonlinear()
- v2: https://lore.kernel.org/all/af2F1FU5d4Q_Gn1W@v4bel/
Changes in v2:
- Use skb_is_nonlinear() instead of skb->data_len
- v1: https://lore.kernel.org/all/afKV2zGR6rrelPC7@v4bel/
---
net/rxrpc/call_event.c | 4 +++-
net/rxrpc/conn_event.c | 3 ++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index fdd683261226..2b19b252225e 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -334,7 +334,9 @@ bool rxrpc_input_call_event(struct rxrpc_call *call)
if (sp->hdr.type == RXRPC_PACKET_TYPE_DATA &&
sp->hdr.securityIndex != 0 &&
- skb_cloned(skb)) {
+ (skb_cloned(skb) ||
+ skb_has_frag_list(skb) ||
+ skb_has_shared_frag(skb))) {
/* Unshare the packet so that it can be
* modified by in-place decryption.
*/
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index a2130d25aaa9..442414d90ba1 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -245,7 +245,8 @@ static int rxrpc_verify_response(struct rxrpc_connection *conn,
{
int ret;
- if (skb_cloned(skb)) {
+ if (skb_cloned(skb) || skb_has_frag_list(skb) ||
+ skb_has_shared_frag(skb)) {
/* Copy the packet if shared so that we can do in-place
* decryption.
*/
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net v1] net/mlx5: Fix HWS action unwind NULL dereference
From: Simon Horman @ 2026-05-08 8:53 UTC (permalink / raw)
To: Prathamesh Deshpande
Cc: Saeed Mahameed, Leon Romanovsky, Moshe Shemesh, Tariq Toukan,
Yevgeny Kliteynik, Jakub Kicinski, netdev, linux-rdma,
linux-kernel
In-Reply-To: <20260504220725.46686-1-prathameshdeshpande7@gmail.com>
On Mon, May 04, 2026 at 11:06:46PM +0100, Prathamesh Deshpande wrote:
> mlx5_fs_fte_get_hws_actions() stores some destination actions in
> fs_actions[] before checking whether action creation succeeded.
>
> If creating a table-number or range destination action fails, or if
> fetching a sampler destination action fails, dest_action is NULL but
> num_fs_actions has already been incremented. The shared error path then
> calls mlx5_fs_destroy_fs_action(), which dereferences fs_action->action
> to get the HWS action type, causing a NULL pointer dereference while
> unwinding the original failure.
>
> Track whether the current destination action needs fs_actions[] cleanup,
> but append it only after dest_action has been validated.
>
> Fixes: 2ec6786ad0a6b ("net/mlx5: fs, add HWS fte API functions")
> Fixes: 32e658c84b6d ("net/mlx5: fs, add support for dest flow sampler HWS action")
> Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH v12 net-next 2/9] net/mlx5e: trim stack use in PCIe congestion threshold helper
From: David Laight @ 2026-05-08 9:02 UTC (permalink / raw)
To: Ratheesh Kannoth
Cc: intel-wired-lan, linux-kernel, linux-rdma, netdev, oss-drivers,
akiyano, andrew+netdev, anthony.l.nguyen, arkadiusz.kubalewski,
brett.creeley, darinzon, davem, donald.hunter, edumazet, horms,
idosch, ivecera, jiri, kuba, leon, mbloch, michael.chan, pabeni,
pavan.chebbi, petrm, Prathosh.Satish, przemyslaw.kitszel, saeedm,
sgoutham, tariqt, vadim.fedorenko
In-Reply-To: <20260508034912.4082520-3-rkannoth@marvell.com>
On Fri, 8 May 2026 09:19:05 +0530
Ratheesh Kannoth <rkannoth@marvell.com> wrote:
> union devlink_param_value grew when U64 array parameters were added.
> Keeping a four-element array of that union in
> mlx5e_pcie_cong_get_thresh_config() inflated the stack frame past the
> -Wframe-larger-than limit.
>
> Read each driverinit value into a single reused union, then store the
> four u16 thresholds in struct mlx5e_pcie_cong_thresh field order via a
> temporary u16 pointer to config.
>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
> ---
> .../mellanox/mlx5/core/en/pcie_cong_event.c | 34 +++++++++++--------
> 1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/pcie_cong_event.c b/drivers/net/ethernet/mellanox/mlx5/core/en/pcie_cong_event.c
> index 2eb666a46f39..88e76be3a73d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/pcie_cong_event.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/pcie_cong_event.c
> @@ -252,28 +252,32 @@ static int
> mlx5e_pcie_cong_get_thresh_config(struct mlx5_core_dev *dev,
> struct mlx5e_pcie_cong_thresh *config)
> {
> + enum {
> + INBOUND_HIGH,
> + INBOUND_LOW,
> + OUTBOUND_HIGH,
> + OUTBOUND_LOW,
> + };
> +
> u32 ids[4] = {
Someone will suggest that should be 'static const'.
It may make the code smaller.
> - MLX5_DEVLINK_PARAM_ID_PCIE_CONG_IN_LOW,
> - MLX5_DEVLINK_PARAM_ID_PCIE_CONG_IN_HIGH,
> - MLX5_DEVLINK_PARAM_ID_PCIE_CONG_OUT_LOW,
> - MLX5_DEVLINK_PARAM_ID_PCIE_CONG_OUT_HIGH,
> + [INBOUND_LOW] = MLX5_DEVLINK_PARAM_ID_PCIE_CONG_IN_LOW,
> + [INBOUND_HIGH] = MLX5_DEVLINK_PARAM_ID_PCIE_CONG_IN_HIGH,
> + [OUTBOUND_LOW] = MLX5_DEVLINK_PARAM_ID_PCIE_CONG_OUT_LOW,
> + [OUTBOUND_HIGH] = MLX5_DEVLINK_PARAM_ID_PCIE_CONG_OUT_HIGH,
> };
> - struct devlink *devlink = priv_to_devlink(dev);
> - union devlink_param_value val[4];
>
> - for (int i = 0; i < 4; i++) {
> - u32 id = ids[i];
> - int err;
> + struct devlink *devlink = priv_to_devlink(dev);
> + union devlink_param_value val;
> + u16 *dst = (u16 *)config;
You can't do that - far too fragile.
Maybe &config->inbound_low - but even that assumes the values are in order.
A safer way would be using a temporary 'u16 val16[4]'.
(Or even overwrite ids[] with the result.)
But the code might even be smaller if you just unroll the loop:
err = devl_param_driverinit_value_get(devlink, MLX5_DEVLINK_PARAM_ID_PCIE_CONG_IN_LOW, &val);
if (err)
return err;
config->inbound_low = val.vu16;
err = devl_param_driverinit_value_get(devlink, MLX5_DEVLINK_PARAM_ID_PCIE_CONG_IN_HIGH, &val);
if (err)
return err;
config->inbound_high = val.vu16;
err = devl_param_driverinit_value_get(devlink, MLX5_DEVLINK_PARAM_ID_PCIE_CONG_OUT_LOW, &val);
if (err)
return err;
config->outbound_low = val.vu16;
err = devl_param_driverinit_value_get(devlink, MLX5_DEVLINK_PARAM_ID_PCIE_CONG_OUT_HIGH, &val);
if (err)
return err;
config->outbound_high = val.vu16;
-- David
> + int err;
>
> - err = devl_param_driverinit_value_get(devlink, id, &val[i]);
> + for (int i = 0; i < ARRAY_SIZE(ids); i++) {
> + err = devl_param_driverinit_value_get(devlink, ids[i], &val);
> if (err)
> return err;
> - }
>
> - config->inbound_low = val[0].vu16;
> - config->inbound_high = val[1].vu16;
> - config->outbound_low = val[2].vu16;
> - config->outbound_high = val[3].vu16;
> + dst[i] = val.vu16;
> + }
>
> return 0;
> }
^ permalink raw reply
* Re: [PATCH net v1] net/mlx5e: Fix PTP TX SQ cleanup on metadata DB failure
From: Simon Horman @ 2026-05-08 9:06 UTC (permalink / raw)
To: Prathamesh Deshpande
Cc: Saeed Mahameed, Leon Romanovsky, Richard Cochran, Tariq Toukan,
Eran Ben Elisha, Jakub Kicinski, netdev, linux-kernel
In-Reply-To: <20260504223018.49556-1-prathameshdeshpande7@gmail.com>
On Mon, May 04, 2026 at 11:30:05PM +0100, Prathamesh Deshpande wrote:
> mlx5e_ptp_open_txqsq() creates the hardware SQ before allocating the PTP
> traffic metadata database.
>
> If mlx5e_ptp_alloc_traffic_db() fails, the error path frees the software
> TX queue state but skips destroying the already-created hardware SQ.
>
> Add a dedicated unwind label that destroys the SQ before freeing the TXQ
> state.
>
> Fixes: 1880bc4e4a96 ("net/mlx5e: Add TX port timestamp support")
> Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH] ethtool: fix inverted memchr_inv condition in ethnl_bitmap32_not_zero()
From: Breno Leitao @ 2026-05-08 9:13 UTC (permalink / raw)
To: Chenguang Zhao
Cc: Andrew Lunn, Jakub Kicinski, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Maxime Chevallier, netdev
In-Reply-To: <20260508080211.636177-1-zhaochenguang@kylinos.cn>
On Fri, May 08, 2026 at 04:02:11PM +0800, Chenguang Zhao wrote:
> memchr_inv() returns non-NULL when a byte differs from the given value.
> Return true in that case, not when the scanned words are all zero.
>
> Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
There is no Fixes: tag. It seems the buggy check was introduced in
commit 10b518d4e6dd ("ethtool: netlink bitset handling").
> ---
> net/ethtool/bitset.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ethtool/bitset.c b/net/ethtool/bitset.c
> index 8bb98d3ea3db..56b0c4867ed2 100644
> --- a/net/ethtool/bitset.c
> +++ b/net/ethtool/bitset.c
> @@ -105,7 +105,7 @@ static bool ethnl_bitmap32_not_zero(const u32 *map, unsigned int start,
> start_word++;
> }
>
> - if (!memchr_inv(map + start_word, '\0',
> + if (memchr_inv(map + start_word, '\0',
> (end_word - start_word) * sizeof(u32)))
> return true;
> if (end % 32 == 0)
The fix itself looks correct, but is the rest of
ethnl_bitmap32_not_zero() consistent with the documented "true if there
is a non-zero bit in [start, end)" semantics?
^ permalink raw reply
* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
From: Dragos Tatulea @ 2026-05-08 9:15 UTC (permalink / raw)
To: Amery Hung
Cc: Tariq Toukan, Christoph Paasch, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Andrew Lunn, David S. Miller, Saeed Mahameed,
Mark Bloch, Leon Romanovsky, netdev, linux-rdma, linux-kernel,
Gal Pressman, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Alexei Starovoitov
In-Reply-To: <CAMB2axPNhveQaDPs-ttu4uFcpvAfJCdzJ3d05HWQf4+p7uVUsg@mail.gmail.com>
On 07.05.26 22:50, Amery Hung wrote:
> On Thu, May 7, 2026 at 4:50 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>
>>
>> Hi Amery,
>>
>> On 07.05.26 15:53, Amery Hung wrote:
>>> [...]
>>> Am I understanding correctly that the better performance comes with
>>> the assumption that the XDP does not change headers?
>>>
>>> headlen is determined before the XDP program runs. If it push/pop
>>> headers, there could be headers in frags or data in the linear region
>>> after __pskb_pull_tail().
>>>
>> That's right.
>>
>>>> if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>>> struct mlx5e_frag_page *pfp;
>>>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>> pagep->frags++;
>>>> while (++pagep < frag_page);
>>>>
>>>> - headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>>>> - skb->data_len);
>>>> + headlen = min_t(u16, headlen - len, skb->data_len);
>>>
>>> headlen - len can underflow but will be capped by skb->data_len, so
>>> this should be okay, right?
>> It is safe. But it might trigger an extra allocation in the pull when
>> len > headlen. We could also skip the pull in that case. Or do a
>> min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?
>
> Make sense, but this line took me a bit to understand. Maybe consider
> checking len < headlen first?
>
> if (len < headlen) {
> headlen = min_t(u32, headlen - len, skb->data_len);
> __pskb_pull_tail(skb, headlen);
> }
>
Yes, that's what I had in mind when skipping the pull. I would also
tag this as likely.
> Another clarifying question. So this patch will improve the
> performance when the XDP programs don't change header length. For
> those that encap/decap, they should precisely pull only headers into
> the linear area for optimal performance. Is it correct?
>
Right for encap, but for decap not quite:
Let's say that the XDP program pulls 64B header into the linear part
and snips 4B of the encap out. This would result in a pull of an
additional 4B (headlen (64B) - len (60B) = 4B) which are now
data bytes => sub-optimal layout.
I don't see how we can improve this corner case though.
Thanks,
Dragos
^ permalink raw reply
* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
From: Simon Schippers @ 2026-05-08 9:20 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Paolo Abeni, netdev
Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
John Fastabend, Stanislav Fomichev, linux-kernel, bpf
In-Reply-To: <21d639fc-e244-486e-8368-8891b3c43215@schippers-hamm.de>
On 5/8/26 10:01, Simon Schippers wrote:
> On 5/7/26 22:45, Jesper Dangaard Brouer wrote:
>>
>>
>> On 07/05/2026 22.12, Simon Schippers wrote:
>>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>>>
>>>>>
>>>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>>> }
>>>>>>>> } else {
>>>>>>>> /* ndo_start_xmit */
>>>>>>>> - struct sk_buff *skb = ptr;
>>>>>>>> + bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>>> stats->xdp_bytes += skb->len;
>>>>>>>> + if (peer_txq && bql_charged)
>>>>>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>>>
>>>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>>>> this doesn’t work.
>>>>>>>
>>>>
>>>> I've experimented with doing the "completion" at NAPI-end in
>>>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>>>> leads to bad latency results (not acceptable).
>>>> (See detailed report later)
>>>>
>>>>
>>>>>>> I still think first that adding an option to modify the hard-coded
>>>>>>> VETH_RING_SIZE is the way to go.
>>>>>>>
>>>>
>>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>>> the solution here.
>>>>
>>>> The simply solution is the configure BQL limit_min:
>>>> `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>>>
>>>> My experiments (below) find that limit_min=8 is gives good performance.
>>>> We can simply set default to 8 as this still allows userspace to change
>>>> this later if lower latency is preferred.
>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>>>>>
>>>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>>>> be ignored. Still the tput figures in the data are extremely low,
>>>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>>>
>>>>>> /P
>>>>>>
>>>>>
>>>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>>>> an UDP userspace application is sending.
>>>>>
>>>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>>>
>>>>
>>>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>>>> to test this. John Fastabend added a config to pktgen that allows us
>>>> to benchmarking egress qdisc path, this might be better to use this.
>>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>>>
>>>> If redoing the tests, can you adjust limit_min to see the effect?
>>>> /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>>>
>>>> 20% throughput performance regression is of-cause too much, but I will
>>>> remind us, that adding a qdisc will "cost" some overhead, that is a
>>>> configuration choice. Our purpose here is to reduce bufferbloat and
>>>> latency, not optimize for throughput.
>>>>
>>>>
>>>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>>>> packets (because the completed queue is always called with 1 packet
>>>>> and not in a interrupt/timer with multiple packets...).
>>>>>
>>>>
>>>> I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>>>
>>>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>>>
>>>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>>>
>>>> This patchset helps our (production) use-case reduce mice-flow latency
>>>> from approx 22ms to 1.3ms for latency under-load. Due to the consumer
>>>> namespace being the bottleneck the requeue overhead is negligible in
>>>> comparison.
>>>>
>>>> -Jesper
>>>
>>> First of all thanks for you work and I really see the advantages of
>>> avoiding bufferbloat :)
>>>
>>> But the key of the BQL algorithm, which is the *dynamic* adaption of the
>>> limit, is not working. Always calling netdev_completed_queue() with
>>> 1 packet results in a static limit of 2 packets (as seen by Jonas
>>> measurements), which you force up to 8 packets.
>>>
>>> So in the end this patchset has the same effect as just setting
>>> VETH_RING_SIZE to 8 (and giving an option to change this value).
>>>
>>
>> I've code up a time based BQL implementation, see attachment.
>> WDYT?
>>
>> --Jesper
>>
>
> A step in the right direction, but I dislike that you call
> netdev_sent_queue() with at least 1 packet (never 0 packets).
> I am not sure if it works, and I am not sure about the parameter.
>
Rethinking of it this could be fine, but really needs testing because:
The weird thing is that is that BQL's inflight != number of packets
in the ring and BQL's limit != "current ring size". Instead the BQL
limit describes the number of maximal allowed packets between
calls of netdev_sent_queue().
I messed up in my approach below. Forget it :P
>
> I would propose doing it like other BQL implementations do
> (for example usbnet for which I adapted BQL [1] :) ):
>
> Call netdev_sent_queue() with n_bql in a periodic work. n_bql would
> still be counted in veth_xdp_rcv() like you currently do (synchronized
> with the work via ring.consumer_lock?).
>
> The only weird thing that remains is that BQL's inflight != number of
> packets in the ring and BQL's limit != "current ring size". Instead
> the BQL limit describes the number of maximal allowed packets between
> calls of netdev_sent_queue(), which occur periodically in a somewhat
> fixed time interval.
> I guess that could be fine, but it surely needs testing.
>
> [1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.schippers@tu-dortmund.de/
>
^ permalink raw reply
* [PATCH net] net: wwan: iosm: fix potential memory leaks in ipc_imem_init()
From: Abdun Nihaal @ 2026-05-08 9:21 UTC (permalink / raw)
To: loic.poulain
Cc: Abdun Nihaal, ryazanov.s.a, johannes, andrew+netdev, davem,
edumazet, kuba, pabeni, netdev, linux-kernel, m.chetan.kumar,
stable
The memory allocated in ipc_protocol_init() is not freed on the error
paths that follow in ipc_imem_init(). Fix that by calling the
corresponding release function ipc_protocol_deinit() in the error path.
Fixes: 3670970dd8c6 ("net: iosm: shared memory IPC interface")
Cc: stable@vger.kernel.org
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
---
Compile tested only. Issue found using static analysis.
drivers/net/wwan/iosm/iosm_ipc_imem.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wwan/iosm/iosm_ipc_imem.c b/drivers/net/wwan/iosm/iosm_ipc_imem.c
index 1b7bc7d63a2e..f4edb277efd9 100644
--- a/drivers/net/wwan/iosm/iosm_ipc_imem.c
+++ b/drivers/net/wwan/iosm/iosm_ipc_imem.c
@@ -1422,6 +1422,7 @@ struct iosm_imem *ipc_imem_init(struct iosm_pcie *pcie, unsigned int device_id,
hrtimer_cancel(&ipc_imem->fast_update_timer);
hrtimer_cancel(&ipc_imem->tdupdate_timer);
hrtimer_cancel(&ipc_imem->startup_timer);
+ ipc_protocol_deinit(ipc_imem->ipc_protocol);
protocol_init_fail:
cancel_work_sync(&ipc_imem->run_state_worker);
ipc_task_deinit(ipc_imem->ipc_task);
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox