Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 1/2] net/sched: dualpi2: fix GSO backlog accounting
From: Jamal Hadi Salim @ 2026-06-20 13:35 UTC (permalink / raw)
  To: Xingquan Liu; +Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, stable
In-Reply-To: <20260619151447.223640-1-b1n@b1n.io>

On Fri, Jun 19, 2026 at 11:15 AM Xingquan Liu <b1n@b1n.io> wrote:
>
> When DualPI2 splits a GSO skb into N segments, it propagates N
> additional packets to its parent before returning NET_XMIT_SUCCESS.
> The parent then accounts for the original skb once more, leaving its
> qlen one larger than the number of packets actually queued.
>
> With QFQ as the parent, after all real packets are dequeued, QFQ still
> has a non-zero qlen while its in-service aggregate has no active
> classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
> the result to qfq_peek_skb(), causing a NULL pointer dereference.
>
> Follow the same pattern used by tbf_segment() and taprio: count only
> successfully queued segments, propagate the difference between the
> original skb and those segments, and return NET_XMIT_SUCCESS whenever
> at least one segment was queued.
>
> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> Cc: stable@vger.kernel.org
> Signed-off-by: Xingquan Liu <b1n@b1n.io>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

> ---
> v3:
> - Move the UDP GSO sender into tdc_gso.py.
>
> v2:
> - Change patch commit message.
> - Add tdc test.
>
>  net/sched/sch_dualpi2.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> index d7c3254ef800..5434df6ca8ef 100644
> --- a/net/sched/sch_dualpi2.c
> +++ b/net/sched/sch_dualpi2.c
> @@ -461,7 +461,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                 if (IS_ERR_OR_NULL(nskb))
>                         return qdisc_drop(skb, sch, to_free);
>
> -               cnt = 1;
> +               cnt = 0;
>                 byte_len = 0;
>                 orig_len = qdisc_pkt_len(skb);
>                 skb_list_walk_safe(nskb, nskb, next) {
> @@ -488,16 +488,15 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                                 byte_len += nskb->len;
>                         }
>                 }
> -               if (cnt > 1) {
> +               if (cnt > 0) {
>                         /* The caller will add the original skb stats to its
>                          * backlog, compensate this if any nskb is enqueued.
>                          */
> -                       --cnt;
> -                       byte_len -= orig_len;
> +                       qdisc_tree_reduce_backlog(sch, 1 - cnt,
> +                                                 orig_len - byte_len);
>                 }
> -               qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
>                 consume_skb(skb);
> -               return err;
> +               return cnt > 0 ? NET_XMIT_SUCCESS : err;
>         }
>         return dualpi2_enqueue_skb(skb, sch, to_free);
>  }
>
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> --
> Xingquan Liu
>

^ permalink raw reply

* Re: [PATCH v3 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Jamal Hadi Salim @ 2026-06-20 13:36 UTC (permalink / raw)
  To: Xingquan Liu; +Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, stable
In-Reply-To: <20260619151447.223640-2-b1n@b1n.io>

On Fri, Jun 19, 2026 at 11:16 AM Xingquan Liu <b1n@b1n.io> wrote:
>
> Add a regression test for DualPI2 GSO backlog accounting when it is
> used as a child qdisc of QFQ.
>
> The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
> the leaf qdisc. DualPI2 splits the skb into two segments. After the
> traffic drains, both QFQ and DualPI2 must report zero backlog and zero
> qlen.
>
> On kernels with the broken accounting, QFQ can keep a stale non-zero
> qlen after all real packets have been dequeued.
>
> Signed-off-by: Xingquan Liu <b1n@b1n.io>

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

> ---
>  .../tc-testing/tc-tests/qdiscs/dualpi2.json   | 44 +++++++++++++++++++
>  tools/testing/selftests/tc-testing/tdc_gso.py | 43 ++++++++++++++++++
>  2 files changed, 87 insertions(+)
>  create mode 100755 tools/testing/selftests/tc-testing/tdc_gso.py
>
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> index cd1f2ee8f354..ed6a900bb568 100644
> --- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> +++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> @@ -250,5 +250,49 @@
>          "teardown": [
>              "$TC qdisc del dev $DUMMY handle 1: root"
>          ]
> +    },
> +    {
> +        "id": "891f",
> +        "name": "Verify DualPI2 GSO backlog accounting with QFQ parent",
> +        "category": [
> +            "qdisc",
> +            "dualpi2",
> +            "qfq",
> +            "gso"
> +        ],
> +        "plugins": {
> +            "requires": "nsPlugin"
> +        },
> +        "setup": [
> +            "$IP link set dev $DUMMY up || true",
> +            "$IP addr add 10.10.10.10/24 dev $DUMMY || true",
> +            "$TC qdisc add dev $DUMMY root handle 1: qfq",
> +            "$TC class add dev $DUMMY parent 1: classid 1:1 qfq weight 1 maxpkt 4096",
> +            "$TC qdisc add dev $DUMMY parent 1:1 handle 2: dualpi2",
> +            "$TC filter add dev $DUMMY parent 1: matchall classid 1:1"
> +        ],
> +        "cmdUnderTest": "./tdc_gso.py 10.10.10.10 10.10.10.1 9000 1200 2400",
> +        "expExitCode": "0",
> +        "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY",
> +        "matchJSON": [
> +            {
> +                "kind": "qfq",
> +                "handle": "1:",
> +                "packets": 2,
> +                "backlog": 0,
> +                "qlen": 0
> +            },
> +            {
> +                "kind": "dualpi2",
> +                "handle": "2:",
> +                "packets": 2,
> +                "backlog": 0,
> +                "qlen": 0
> +            }
> +        ],
> +        "teardown": [
> +            "$TC qdisc del dev $DUMMY root",
> +            "$IP addr del 10.10.10.10/24 dev $DUMMY || true"
> +        ]
>      }
>  ]
> diff --git a/tools/testing/selftests/tc-testing/tdc_gso.py b/tools/testing/selftests/tc-testing/tdc_gso.py
> new file mode 100755
> index 000000000000..b66528ea4b68
> --- /dev/null
> +++ b/tools/testing/selftests/tc-testing/tdc_gso.py
> @@ -0,0 +1,43 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0
> +
> +"""
> +tdc_gso.py - send a UDP GSO datagram
> +
> +Copyright (C) 2026 Xingquan Liu <b1n@b1n.io>
> +"""
> +
> +import argparse
> +import socket
> +import struct
> +import sys
> +
> +UDP_MAX_SEGMENTS = 1 << 7
> +
> +
> +parser = argparse.ArgumentParser(description="UDP GSO datagram sender")
> +parser.add_argument("src", help="source IPv4 address")
> +parser.add_argument("dst", help="destination IPv4 address")
> +parser.add_argument("port", type=int, help="destination UDP port")
> +parser.add_argument("gso_size", type=int, help="UDP GSO segment payload size")
> +parser.add_argument("payload_len", type=int, help="total UDP payload length")
> +args = parser.parse_args()
> +
> +if args.gso_size <= 0 or args.gso_size > 0xFFFF:
> +    parser.error("gso_size must fit in an unsigned 16-bit integer")
> +if args.payload_len <= args.gso_size:
> +    parser.error("payload_len must be larger than gso_size")
> +if args.payload_len > args.gso_size * UDP_MAX_SEGMENTS:
> +    parser.error("payload_len exceeds UDP_MAX_SEGMENTS")
> +
> +SOL_UDP = getattr(socket, "SOL_UDP", socket.IPPROTO_UDP)
> +UDP_SEGMENT = getattr(socket, "UDP_SEGMENT", 103)
> +
> +sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
> +sock.bind((args.src, 0))
> +
> +payload = b"b" * args.payload_len
> +cmsg = [(SOL_UDP, UDP_SEGMENT, struct.pack("=H", args.gso_size))]
> +
> +sent = sock.sendmsg([payload], cmsg, 0, (args.dst, args.port))
> +sys.exit(sent != len(payload))
> --
> Xingquan Liu
>

^ permalink raw reply

* Re: [PATCH v3 1/2] net/sched: dualpi2: fix GSO backlog accounting
From: Jamal Hadi Salim @ 2026-06-20 13:39 UTC (permalink / raw)
  To: Xingquan Liu; +Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, stable
In-Reply-To: <20260619151447.223640-1-b1n@b1n.io>

On Fri, Jun 19, 2026 at 11:15 AM Xingquan Liu <b1n@b1n.io> wrote:
>
> When DualPI2 splits a GSO skb into N segments, it propagates N
> additional packets to its parent before returning NET_XMIT_SUCCESS.
> The parent then accounts for the original skb once more, leaving its
> qlen one larger than the number of packets actually queued.
>
> With QFQ as the parent, after all real packets are dequeued, QFQ still
> has a non-zero qlen while its in-service aggregate has no active
> classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
> the result to qfq_peek_skb(), causing a NULL pointer dereference.
>
> Follow the same pattern used by tbf_segment() and taprio: count only
> successfully queued segments, propagate the difference between the
> original skb and those segments, and return NET_XMIT_SUCCESS whenever
> at least one segment was queued.
>
> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> Cc: stable@vger.kernel.org
> Signed-off-by: Xingquan Liu <b1n@b1n.io>

BTW, in the future make sure you wait at least 24 hours before you
send an updated version.
You didnt do that here..

cheers,
jamal
> ---
> v3:
> - Move the UDP GSO sender into tdc_gso.py.
>
> v2:
> - Change patch commit message.
> - Add tdc test.
>
>  net/sched/sch_dualpi2.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> index d7c3254ef800..5434df6ca8ef 100644
> --- a/net/sched/sch_dualpi2.c
> +++ b/net/sched/sch_dualpi2.c
> @@ -461,7 +461,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                 if (IS_ERR_OR_NULL(nskb))
>                         return qdisc_drop(skb, sch, to_free);
>
> -               cnt = 1;
> +               cnt = 0;
>                 byte_len = 0;
>                 orig_len = qdisc_pkt_len(skb);
>                 skb_list_walk_safe(nskb, nskb, next) {
> @@ -488,16 +488,15 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                                 byte_len += nskb->len;
>                         }
>                 }
> -               if (cnt > 1) {
> +               if (cnt > 0) {
>                         /* The caller will add the original skb stats to its
>                          * backlog, compensate this if any nskb is enqueued.
>                          */
> -                       --cnt;
> -                       byte_len -= orig_len;
> +                       qdisc_tree_reduce_backlog(sch, 1 - cnt,
> +                                                 orig_len - byte_len);
>                 }
> -               qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
>                 consume_skb(skb);
> -               return err;
> +               return cnt > 0 ? NET_XMIT_SUCCESS : err;
>         }
>         return dualpi2_enqueue_skb(skb, sch, to_free);
>  }
>
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> --
> Xingquan Liu
>

^ permalink raw reply

* RE: Ethtool : PRBS feature
From: Das, Shubham @ 2026-06-20 13:48 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alexander H Duyck, lee@trager.us, netdev@vger.kernel.org,
	mkubecek@suse.cz, D H, Siddaraju, Chintalapalle, Balaji,
	Lindberg, Magnus, niklas.damberg@ericsson.com
In-Reply-To: <26971635-b13a-4ed3-b2cb-ce35da9b63d3@lunn.ch>

> Can you change the firmware to expose the 802.3 registers for PRBS?
> You can then write a library which both plylib and your driver can use.

Andrew,

No, exposing the PRBS registers to drivers is not possible in our design (the registers are buried deep within the Accelerator/NIC/PHY/Analog IP hierarchy).

Additionally, the PHY PRBS registers are not in accordance with the IEEE Clause 45 definitions. For instance, the PRBS registers are paged and 32-bit wide.

Given these constraints, we think ethtool --phy-test is a reasonable starting point for exposing the long-established Ethernet PRBS functionality to Linux userspace, as it aligns well with the driver-owned NIC architecture model. If you think a more generic layered approach would be preferable, we would appreciate guidance on the expected architecture. That would help us better understand the implementation complexity, required effort, and delivery timelines.

Thanks,
Shubham D

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 20 June 2026 00:07
> To: Das, Shubham <shubham.das@intel.com>
> Cc: Alexander H Duyck <alexander.duyck@gmail.com>; lee@trager.us;
> netdev@vger.kernel.org; mkubecek@suse.cz; D H, Siddaraju
> <siddaraju.dh@intel.com>; Chintalapalle, Balaji
> <balaji.chintalapalle@intel.com>; Lindberg, Magnus
> <magnus.k.lindberg@ericsson.com>; niklas.damberg@ericsson.com
> Subject: Re: Ethtool : PRBS feature
> 
> > The host driver does not directly access any registers but requests
> > the PHY FW to manage PRBS on behalf of it.
> 
> Maybe a dumb question. Why?
> 
> Can you change the firmware to expose the 802.3 registers for PRBS?
> You can then write a library which both plylib and your driver can use.
> 
> 	Andrew

^ permalink raw reply

* Re: [PATCH bpf] bpf: tcp: Fix use-after-free in bpf_iter_tcp_established_batch()
From: Jiayuan Chen @ 2026-06-20 14:06 UTC (permalink / raw)
  To: Jose Fernandez (Anthropic), Eric Dumazet, Neal Cardwell,
	Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrii Nakryiko, Yonghong Song, Martin KaFai Lau
  Cc: netdev, linux-kernel, bpf, Ben Cressey
In-Reply-To: <20260620-bpf-iter-tcp-refcnt-v1-1-883bf9e69495@linux.dev>


On 6/20/26 8:32 AM, Jose Fernandez (Anthropic) wrote:
> reqsk_queue_hash_req() publishes a TCP_NEW_SYN_RECV request_sock onto
> the ehash chain (via inet_ehash_insert(), which drops the bucket lock on
> return) and only afterwards refcount_set()s rsk_refcnt to 3.
>
> Lockless readers such as __inet_lookup_established() account for this by
> using refcount_inc_not_zero(), but bpf_iter_tcp_established_batch() uses
> plain sock_hold() while holding the bucket lock, on the assumption that
> the lock guarantees sk_refcnt > 0. That assumption does not hold for
> request_sock:
>
>    CPU 0                                CPU 1
>    -----                                -----
>    tcp_conn_request()
>     reqsk_queue_hash_req()
>      inet_ehash_insert(req)
>       spin_lock(bucket)
>       __sk_nulls_add_node_rcu(req)      // rsk_refcnt == 0
>       spin_unlock(bucket)
>                                         bpf_iter_tcp_established_batch()
>                                          spin_lock(bucket)
>                                          sock_hold(req)   <-- addition on 0
>                                          spin_unlock(bucket)
>      refcount_set(&req->rsk_refcnt, 3)  // clobbers saturated value
>
> which surfaces as:
>
>    refcount_t: addition on 0; use-after-free.
>    WARNING: lib/refcount.c:25 at refcount_warn_saturate+0x48/0x90, CPU#1
>    Call Trace:
>     bpf_iter_tcp_established_batch+0x14e/0x170
>     bpf_iter_tcp_batch+0x53/0x200
>     bpf_iter_tcp_seq_next+0x27/0x70
>     bpf_seq_read+0x107/0x410
>     vfs_read+0xb9/0x380
>
> refcount_warn_saturate() then saturates the count, the publishing CPU's
> refcount_set() clobbers it, and the socket is left one reference short.
> When the last legitimate owner drops its reference the reqsk is freed
> while still reachable, leading to use-after-free panics in e.g.
> inet_csk_accept() or inet_csk_listen_stop().
>
> This reproduces in seconds with tcp_syncookies=0, a handful of threads
> doing connect()/close() to a local listener while others read an
> iter/tcp link in a tight loop.
>
> Use refcount_inc_not_zero() and skip the socket on failure, the same way
> every other ehash walker does. The listening hash is unaffected as
> listeners are always inserted into lhash2 with sk_refcnt >= 1, so
> bpf_iter_tcp_listening_batch() is left as-is.
>
> If every matching socket in a bucket is mid-init, end_sk can stay at 0;
> advance to the next bucket in that case rather than terminating the
> whole iteration on a stale batch[0].
>
> Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
> Reviewed-by: Ben Cressey <ben@cressey.dev>
> Assisted-by: Claude:unspecified
> Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev>


LGTM.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>


> ---
>   net/ipv4/tcp_ipv4.c | 35 ++++++++++++++++++++---------------
>   1 file changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fdc81150ff6c..92342dcc6892 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3074,25 +3074,25 @@ static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
>   {
>   	struct bpf_tcp_iter_state *iter = seq->private;
>   	struct hlist_nulls_node *node;
> -	unsigned int expected = 1;
> -	struct sock *sk;
> +	unsigned int expected = 0;
> +	struct sock *sk = *start_sk;
>   
> -	sock_hold(*start_sk);
> -	iter->batch[iter->end_sk++].sk = *start_sk;
> -
> -	sk = sk_nulls_next(*start_sk);



Folding the open-coded first *start_sk into the loop is a good
cleanup — it was the one socket that bypassed the refcnt check.


The double-put on the realloc-failure path reported by ai is a separate,
pre-existing issue and can be addressed in a follow-up.



^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Maxime Chevallier @ 2026-06-20 14:39 UTC (permalink / raw)
  To: Das, Shubham, Andrew Lunn
  Cc: Alexander H Duyck, lee@trager.us, netdev@vger.kernel.org,
	mkubecek@suse.cz, D H, Siddaraju, Chintalapalle, Balaji,
	Lindberg, Magnus, niklas.damberg@ericsson.com
In-Reply-To: <SN7PR11MB81096CF160930DB8FA278B3CFFE12@SN7PR11MB8109.namprd11.prod.outlook.com>

Hi,

On 6/20/26 15:48, Das, Shubham wrote:
>> Can you change the firmware to expose the 802.3 registers for PRBS?
>> You can then write a library which both plylib and your driver can use.
> 
> Andrew,
> 
> No, exposing the PRBS registers to drivers is not possible in our design (the registers are buried deep within the Accelerator/NIC/PHY/Analog IP hierarchy).
> 
> Additionally, the PHY PRBS registers are not in accordance with the IEEE Clause 45 definitions. For instance, the PRBS registers are paged and 32-bit wide.
> 
> Given these constraints, we think ethtool --phy-test is a reasonable starting point for exposing the long-established Ethernet PRBS functionality to Linux userspace, as it aligns well with the driver-owned NIC architecture model. If you think a more generic layered approach would be preferable, we would appreciate guidance on the expected architecture. That would help us better understand the implementation complexity, required effort, and delivery timelines.

Can you elaborate on what you have in mind for now ? what would the
"ethtool --phy-test" command look like in terms of its behaviour and
parameters ?

This feature is interesting for multiple people, each having different
hardware designs and constraints. It's good to consider an iterative
approach to build this, however we need to have in mind that this is
uAPI, so once we commit to a design choice, we have to live with it.

We do have flexibility on the kernel side of the API. We can implement PRBS
in generic PHY, phylib, some MAC driver that talks to a firmware, etc. and
hide away these implementation details to userspace, but we need to make
sure the uAPI we come up with allows us to support all of that.

Let's figure this out together, if you already have some ideas in mind we
can use that as a starting point for the discussion :)

Maxime

> 
> Thanks,
> Shubham D
> 
>> -----Original Message-----
>> From: Andrew Lunn <andrew@lunn.ch>
>> Sent: 20 June 2026 00:07
>> To: Das, Shubham <shubham.das@intel.com>
>> Cc: Alexander H Duyck <alexander.duyck@gmail.com>; lee@trager.us;
>> netdev@vger.kernel.org; mkubecek@suse.cz; D H, Siddaraju
>> <siddaraju.dh@intel.com>; Chintalapalle, Balaji
>> <balaji.chintalapalle@intel.com>; Lindberg, Magnus
>> <magnus.k.lindberg@ericsson.com>; niklas.damberg@ericsson.com
>> Subject: Re: Ethtool : PRBS feature
>>
>>> The host driver does not directly access any registers but requests
>>> the PHY FW to manage PRBS on behalf of it.
>>
>> Maybe a dumb question. Why?
>>
>> Can you change the firmware to expose the 802.3 registers for PRBS?
>> You can then write a library which both plylib and your driver can use.
>>
>> 	Andrew
> 


^ permalink raw reply

* [PATCH net v3] net: airoha: fix BQL underflow in shared QDMA TX ring
From: Lorenzo Bianconi @ 2026-06-20 15:04 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Bianconi
  Cc: Wayen Yan, linux-arm-kernel, linux-mediatek, netdev

When multiple netdevs share a QDMA TX ring and one device is stopped,
netdev_tx_reset_subqueue() zeroes that device's BQL counters while its
pending skbs remain in the shared HW TX ring. When NAPI later completes
those skbs via netdev_tx_completed_queue(), the already-zeroed
dql->num_queued counter underflows.

Fix the issue:
- Remove netdev_tx_reset_subqueue() from airoha_dev_stop() so pending
  skbs are completed naturally by NAPI with proper BQL accounting.
- Rework airoha_qdma_tx_cleanup() to disable TX DMA, flush BQL
  counters, DMA-unmap and free all pending skbs while skb->dev
  references are still valid. Use a per-queue flushing flag checked
  under q->lock in airoha_dev_xmit() to prevent races between teardown
  and transmit. Call airoha_qdma_stop_napi() before
  airoha_qdma_tx_cleanup() at the call sites.
- Move DMA engine start into probe. Split DMA teardown so TX DMA is
  disabled in airoha_qdma_tx_cleanup() and RX DMA in
  airoha_qdma_cleanup().
- Remove qdma->users counter since DMA lifetime is now tied to
  probe/cleanup rather than per-netdev open/stop.

Fixes: a9c2ca61fec7 ("net: airoha: Support multiple net_devices for a single FE GDM port")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
Changes in v3:
- Remove airoha_qdma users counter.
- Drop DEV_STATE_FLUSH bit and add per-queue flushing bool to avoid any
  race between airoha_qdma_tx_flush() and airoha_dev_xmit().
- Refactor airoha_qdma_cleanup_tx_queue().
- Rename airoha_qdma_cleanup_tx_queue() in airoha_qdma_tx_cleanup().
- Link to v2: https://lore.kernel.org/r/20260619-airoha-bql-fixes-v2-1-4351d6a24484@kernel.org

Changes in v2:
- Introduce airoha_qdma_tx_flush() to account BQL in airoha_remove() or
  airoha_probe() error path.
- Fix possible NULL pointer dereference in airoha_qdma_cleanup().
- Introduce DEV_STATE_FLUSH().
- Move back airoha_hw_cleanup().
- Set proper Fixes tag.
- Link to v1: https://lore.kernel.org/r/20260618-airoha-bql-fixes-v1-1-ffd2c2089518@kernel.org
---
 drivers/net/ethernet/airoha/airoha_eth.c | 166 +++++++++++++++++--------------
 drivers/net/ethernet/airoha/airoha_eth.h |   3 +-
 2 files changed, 93 insertions(+), 76 deletions(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f3f..8b81e932d535 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -1004,6 +1004,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
 
 		e = &q->entry[index];
 		skb = e->skb;
+		e->skb = NULL;
 
 		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
 				 DMA_TO_DEVICE);
@@ -1147,55 +1148,76 @@ static int airoha_qdma_init_tx(struct airoha_qdma *qdma)
 	return 0;
 }
 
-static void airoha_qdma_cleanup_tx_queue(struct airoha_queue *q)
+static void airoha_qdma_tx_cleanup(struct airoha_qdma *qdma)
 {
-	struct airoha_qdma *qdma = q->qdma;
-	struct airoha_eth *eth = qdma->eth;
-	int i, qid = q - &qdma->q_tx[0];
-	u16 index = 0;
+	u32 status;
+	int i;
 
-	spin_lock_bh(&q->lock);
-	for (i = 0; i < q->ndesc; i++) {
-		struct airoha_queue_entry *e = &q->entry[i];
-		struct airoha_qdma_desc *desc = &q->desc[i];
+	airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
+			  GLOBAL_CFG_TX_DMA_EN_MASK);
+	if (read_poll_timeout(airoha_qdma_rr, status,
+			      !(status & GLOBAL_CFG_TX_DMA_BUSY_MASK),
+			      USEC_PER_MSEC, 50 * USEC_PER_MSEC, true,
+			      qdma, REG_QDMA_GLOBAL_CFG))
+		dev_warn(qdma->eth->dev, "QDMA TX DMA busy timeout\n");
 
-		if (!e->dma_addr)
+	for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
+		struct airoha_queue *q = &qdma->q_tx[i];
+		u16 index = 0;
+		int j;
+
+		if (!q->ndesc)
 			continue;
 
-		dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
-				 DMA_TO_DEVICE);
-		dev_kfree_skb_any(e->skb);
-		e->dma_addr = 0;
-		e->skb = NULL;
-		list_add_tail(&e->list, &q->tx_list);
+		spin_lock_bh(&q->lock);
 
-		/* Reset DMA descriptor */
-		WRITE_ONCE(desc->ctrl, 0);
-		WRITE_ONCE(desc->addr, 0);
-		WRITE_ONCE(desc->data, 0);
-		WRITE_ONCE(desc->msg0, 0);
-		WRITE_ONCE(desc->msg1, 0);
-		WRITE_ONCE(desc->msg2, 0);
+		q->flushing = true;
+		for (j = 0; j < q->ndesc; j++) {
+			struct airoha_queue_entry *e = &q->entry[j];
+			struct airoha_qdma_desc *desc = &q->desc[j];
+			struct sk_buff *skb = e->skb;
 
-		q->queued--;
-	}
+			if (!e->dma_addr)
+				continue;
 
-	if (!list_empty(&q->tx_list)) {
-		struct airoha_queue_entry *e;
+			dma_unmap_single(qdma->eth->dev, e->dma_addr,
+					 e->dma_len, DMA_TO_DEVICE);
+			e->dma_addr = 0;
+			list_add_tail(&e->list, &q->tx_list);
+
+			WRITE_ONCE(desc->ctrl, 0);
+			WRITE_ONCE(desc->addr, 0);
+			WRITE_ONCE(desc->data, 0);
+			WRITE_ONCE(desc->msg0, 0);
+			WRITE_ONCE(desc->msg1, 0);
+			WRITE_ONCE(desc->msg2, 0);
+
+			if (skb) {
+				struct netdev_queue *txq;
+
+				txq = skb_get_tx_queue(skb->dev, skb);
+				netdev_tx_completed_queue(txq, 1, skb->len);
+				dev_kfree_skb_any(skb);
+				e->skb = NULL;
+			}
 
-		e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
-				     list);
-		index = e - q->entry;
-	}
-	/* Set TX_DMA_IDX to TX_CPU_IDX to notify the hw the QDMA TX ring is
-	 * empty.
-	 */
-	airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(qid), TX_RING_CPU_IDX_MASK,
-			FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
-	airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(qid), TX_RING_DMA_IDX_MASK,
-			FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
+			q->queued--;
+		}
 
-	spin_unlock_bh(&q->lock);
+		if (!list_empty(&q->tx_list)) {
+			struct airoha_queue_entry *e;
+
+			e = list_first_entry(&q->tx_list,
+					     struct airoha_queue_entry, list);
+			index = e - q->entry;
+		}
+		airoha_qdma_rmw(qdma, REG_TX_CPU_IDX(i), TX_RING_CPU_IDX_MASK,
+				FIELD_PREP(TX_RING_CPU_IDX_MASK, index));
+		airoha_qdma_rmw(qdma, REG_TX_DMA_IDX(i), TX_RING_DMA_IDX_MASK,
+				FIELD_PREP(TX_RING_DMA_IDX_MASK, index));
+
+		spin_unlock_bh(&q->lock);
+	}
 }
 
 static int airoha_qdma_init_hfwd_queues(struct airoha_qdma *qdma)
@@ -1523,10 +1545,23 @@ static int airoha_qdma_init(struct platform_device *pdev,
 	return airoha_qdma_hw_init(qdma);
 }
 
-static void airoha_qdma_cleanup(struct airoha_qdma *qdma)
+static void airoha_qdma_cleanup(struct airoha_eth *eth,
+				struct airoha_qdma *qdma)
 {
 	int i;
 
+	if (test_bit(DEV_STATE_INITIALIZED, &eth->state)) {
+		u32 status;
+
+		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
+				  GLOBAL_CFG_RX_DMA_EN_MASK);
+		if (read_poll_timeout(airoha_qdma_rr, status,
+				      !(status & GLOBAL_CFG_RX_DMA_BUSY_MASK),
+				      USEC_PER_MSEC, 50 * USEC_PER_MSEC, true,
+				      qdma, REG_QDMA_GLOBAL_CFG))
+			dev_warn(eth->dev, "QDMA RX DMA busy timeout\n");
+	}
+
 	for (i = 0; i < ARRAY_SIZE(qdma->q_rx); i++) {
 		if (!qdma->q_rx[i].ndesc)
 			continue;
@@ -1546,12 +1581,6 @@ static void airoha_qdma_cleanup(struct airoha_qdma *qdma)
 		netif_napi_del(&qdma->q_tx_irq[i].napi);
 	}
 
-	for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
-		if (!qdma->q_tx[i].ndesc)
-			continue;
-
-		airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
-	}
 }
 
 static int airoha_hw_init(struct platform_device *pdev,
@@ -1593,7 +1622,7 @@ static int airoha_hw_init(struct platform_device *pdev,
 	return 0;
 error:
 	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
-		airoha_qdma_cleanup(&eth->qdma[i]);
+		airoha_qdma_cleanup(eth, &eth->qdma[i]);
 
 	return err;
 }
@@ -1603,7 +1632,7 @@ static void airoha_hw_cleanup(struct airoha_eth *eth)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
-		airoha_qdma_cleanup(&eth->qdma[i]);
+		airoha_qdma_cleanup(eth, &eth->qdma[i]);
 	airoha_ppe_deinit(eth);
 }
 
@@ -1837,11 +1866,6 @@ static int airoha_dev_open(struct net_device *netdev)
 	}
 	port->users++;
 
-	airoha_qdma_set(qdma, REG_QDMA_GLOBAL_CFG,
-			GLOBAL_CFG_TX_DMA_EN_MASK |
-			GLOBAL_CFG_RX_DMA_EN_MASK);
-	qdma->users++;
-
 	if (!airoha_is_lan_gdm_dev(dev) &&
 	    airoha_ppe_is_enabled(qdma->eth, 1))
 		pse_port = FE_PSE_PORT_PPE2;
@@ -1880,12 +1904,9 @@ static int airoha_dev_stop(struct net_device *netdev)
 	struct airoha_gdm_dev *dev = netdev_priv(netdev);
 	struct airoha_gdm_port *port = dev->port;
 	struct airoha_qdma *qdma = dev->qdma;
-	int i;
 
 	netif_tx_disable(netdev);
 	airoha_set_vip_for_gdm_port(dev, false);
-	for (i = 0; i < netdev->num_tx_queues; i++)
-		netdev_tx_reset_subqueue(netdev, i);
 
 	if (--port->users)
 		airoha_set_port_mtu(dev->eth, port);
@@ -1893,20 +1914,6 @@ static int airoha_dev_stop(struct net_device *netdev)
 		airoha_set_gdm_port_fwd_cfg(qdma->eth,
 					    REG_GDM_FWD_CFG(port->id),
 					    FE_PSE_PORT_DROP);
-
-	if (!--qdma->users) {
-		airoha_qdma_clear(qdma, REG_QDMA_GLOBAL_CFG,
-				  GLOBAL_CFG_TX_DMA_EN_MASK |
-				  GLOBAL_CFG_RX_DMA_EN_MASK);
-
-		for (i = 0; i < ARRAY_SIZE(qdma->q_tx); i++) {
-			if (!qdma->q_tx[i].ndesc)
-				continue;
-
-			airoha_qdma_cleanup_tx_queue(&qdma->q_tx[i]);
-		}
-	}
-
 	return 0;
 }
 
@@ -2229,6 +2236,9 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
 
 	spin_lock_bh(&q->lock);
 
+	if (q->flushing)
+		goto error_unlock;
+
 	txq = skb_get_tx_queue(netdev, skb);
 	nr_frags = 1 + skb_shinfo(skb)->nr_frags;
 
@@ -2309,7 +2319,7 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
 		e->dma_addr = 0;
 	}
 	list_splice(&tx_list, &q->tx_list);
-
+error_unlock:
 	spin_unlock_bh(&q->lock);
 error:
 	dev_kfree_skb_any(skb);
@@ -3413,8 +3423,12 @@ static int airoha_probe(struct platform_device *pdev)
 	if (err)
 		goto error_netdev_free;
 
-	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
+	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++) {
 		airoha_qdma_start_napi(&eth->qdma[i]);
+		airoha_qdma_set(&eth->qdma[i], REG_QDMA_GLOBAL_CFG,
+				GLOBAL_CFG_TX_DMA_EN_MASK |
+				GLOBAL_CFG_RX_DMA_EN_MASK);
+	}
 
 	for_each_child_of_node(pdev->dev.of_node, np) {
 		if (!of_device_is_compatible(np, "airoha,eth-mac"))
@@ -3437,8 +3451,10 @@ static int airoha_probe(struct platform_device *pdev)
 	return 0;
 
 error_napi_stop:
-	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
+	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++) {
 		airoha_qdma_stop_napi(&eth->qdma[i]);
+		airoha_qdma_tx_cleanup(&eth->qdma[i]);
+	}
 
 	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
 		struct airoha_gdm_port *port = eth->ports[i];
@@ -3474,8 +3490,10 @@ static void airoha_remove(struct platform_device *pdev)
 	struct airoha_eth *eth = platform_get_drvdata(pdev);
 	int i;
 
-	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++)
+	for (i = 0; i < ARRAY_SIZE(eth->qdma); i++) {
 		airoha_qdma_stop_napi(&eth->qdma[i]);
+		airoha_qdma_tx_cleanup(&eth->qdma[i]);
+	}
 
 	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
 		struct airoha_gdm_port *port = eth->ports[i];
diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
index 41d2e7a1f9fb..d7ff8c5200e2 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.h
+++ b/drivers/net/ethernet/airoha/airoha_eth.h
@@ -197,6 +197,7 @@ struct airoha_queue {
 	int free_thr;
 	int buf_size;
 	bool txq_stopped;
+	bool flushing;
 
 	struct napi_struct napi;
 	struct page_pool *page_pool;
@@ -524,8 +525,6 @@ struct airoha_qdma {
 	struct airoha_eth *eth;
 	void __iomem *regs;
 
-	int users;
-
 	struct airoha_irq_bank irq_banks[AIROHA_MAX_NUM_IRQ_BANKS];
 
 	struct airoha_tx_irq_queue q_tx_irq[AIROHA_NUM_TX_IRQ];

---
base-commit: a887f2c7da66a805a55fd8706d45faec85f646db
change-id: 20260618-airoha-bql-fixes-f57b2d108573

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply related

* [PATCH net v2 0/2] sctp: validate INIT in COOKIE-ECHO when auth disabled
From: Xin Long @ 2026-06-20 15:09 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Marcelo Ricardo Leitner

This series fixes a security gap in SCTP's COOKIE-ECHO handling when
cookie authentication is disabled.

Currently, INIT chunks embedded in cookies are not re-verified after
unpacking, creating a vulnerability when cookie_auth_enable=0. This
series first refactors error handling, then adds the missing validation.

Changes in v2: see individual patch changelogs for details.

Xin Long (2):
  sctp: factor out INIT verification failure handling
  sctp: add INIT verification after cookie unpacking

 net/sctp/sm_make_chunk.c |   3 +-
 net/sctp/sm_statefuns.c  | 220 ++++++++++++++++++++-------------------
 2 files changed, 117 insertions(+), 106 deletions(-)

-- 
2.47.1

^ permalink raw reply

* [PATCH net v2 1/2] sctp: factor out INIT verification failure handling
From: Xin Long @ 2026-06-20 15:09 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Marcelo Ricardo Leitner
In-Reply-To: <cover.1781968162.git.lucien.xin@gmail.com>

Extract the duplicated INIT/INIT-ACK error handling logic into a new
helper, sctp_abort_on_init_err().

Several state functions open-code the same pattern after
sctp_verify_init() fails: construct an ABORT with error causes if
available, send it when allocation succeeds, or fall back to T-bit ABORT
handling when no error chunk is present. INIT-ACK handling also includes
additional teardown logic for malformed packets.

Move this logic into sctp_abort_on_init_err() to reduce duplication and
centralize INIT/INIT-ACK failure handling.

No functional change intended. The helper will be used in a subsequent
patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v2:
  - Pass cid to sctp_abort_on_init_err().
  - Delete chunk param from sctp_abort_on_init_err() and get chunk from
    arg param.
  - Jump to label 'out:' when err_chunk is NULL and cid is INIT_ACK in
    sctp_abort_on_init_err(), noted by Sashiko.
---
 net/sctp/sm_statefuns.c | 187 ++++++++++++++++++----------------------
 1 file changed, 85 insertions(+), 102 deletions(-)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 9b23c11cbb9e..8c636f045e45 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -68,6 +68,13 @@ static void sctp_send_stale_cookie_err(struct net *net,
 				       const struct sctp_chunk *chunk,
 				       struct sctp_cmd_seq *commands,
 				       struct sctp_chunk *err_chunk);
+static enum sctp_disposition sctp_abort_on_init_err(
+					struct net *net,
+					const struct sctp_endpoint *ep,
+					const struct sctp_association *asoc,
+					enum sctp_cid cid, void *arg,
+					struct sctp_cmd_seq *commands,
+					struct sctp_chunk *err_chunk);
 static enum sctp_disposition sctp_sf_do_5_2_6_stale(
 					struct net *net,
 					const struct sctp_endpoint *ep,
@@ -325,7 +332,7 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net,
 	struct sctp_chunk *chunk = arg, *repl, *err_chunk;
 	struct sctp_unrecognized_param *unk_param;
 	struct sctp_association *new_asoc;
-	struct sctp_packet *packet;
+	enum sctp_cid cid;
 	int len;
 
 	/* 6.10 Bundling
@@ -373,34 +380,12 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net,
 
 	/* Verify the INIT chunk before processing it. */
 	err_chunk = NULL;
-	if (!sctp_verify_init(net, ep, asoc, chunk->chunk_hdr->type,
+	cid = chunk->chunk_hdr->type;
+	if (!sctp_verify_init(net, ep, asoc, cid,
 			      (struct sctp_init_chunk *)chunk->chunk_hdr, chunk,
-			      &err_chunk)) {
-		/* This chunk contains fatal error. It is to be discarded.
-		 * Send an ABORT, with causes if there is any.
-		 */
-		if (err_chunk) {
-			packet = sctp_abort_pkt_new(net, ep, asoc, arg,
-					(__u8 *)(err_chunk->chunk_hdr) +
-					sizeof(struct sctp_chunkhdr),
-					ntohs(err_chunk->chunk_hdr->length) -
-					sizeof(struct sctp_chunkhdr));
-
-			sctp_chunk_free(err_chunk);
-
-			if (packet) {
-				sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
-						SCTP_PACKET(packet));
-				SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
-				return SCTP_DISPOSITION_CONSUME;
-			} else {
-				return SCTP_DISPOSITION_NOMEM;
-			}
-		} else {
-			return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg,
-						    commands);
-		}
-	}
+			      &err_chunk))
+		return sctp_abort_on_init_err(net, ep, asoc, cid, arg, commands,
+					      err_chunk);
 
 	/* Grab the INIT header.  */
 	chunk->subh.init_hdr = (struct sctp_inithdr *)chunk->skb->data;
@@ -525,7 +510,7 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
 	struct sctp_init_chunk *initchunk;
 	struct sctp_chunk *chunk = arg;
 	struct sctp_chunk *err_chunk;
-	struct sctp_packet *packet;
+	enum sctp_cid cid;
 
 	if (!sctp_vtag_verify(chunk, asoc))
 		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
@@ -546,52 +531,12 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
 
 	/* Verify the INIT chunk before processing it. */
 	err_chunk = NULL;
-	if (!sctp_verify_init(net, ep, asoc, chunk->chunk_hdr->type,
+	cid = chunk->chunk_hdr->type;
+	if (!sctp_verify_init(net, ep, asoc, cid,
 			      (struct sctp_init_chunk *)chunk->chunk_hdr, chunk,
-			      &err_chunk)) {
-
-		enum sctp_error error = SCTP_ERROR_NO_RESOURCE;
-
-		/* This chunk contains fatal error. It is to be discarded.
-		 * Send an ABORT, with causes.  If there are no causes,
-		 * then there wasn't enough memory.  Just terminate
-		 * the association.
-		 */
-		if (err_chunk) {
-			packet = sctp_abort_pkt_new(net, ep, asoc, arg,
-					(__u8 *)(err_chunk->chunk_hdr) +
-					sizeof(struct sctp_chunkhdr),
-					ntohs(err_chunk->chunk_hdr->length) -
-					sizeof(struct sctp_chunkhdr));
-
-			sctp_chunk_free(err_chunk);
-
-			if (packet) {
-				sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
-						SCTP_PACKET(packet));
-				SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
-				error = SCTP_ERROR_INV_PARAM;
-			}
-		}
-
-		/* SCTP-AUTH, Section 6.3:
-		 *    It should be noted that if the receiver wants to tear
-		 *    down an association in an authenticated way only, the
-		 *    handling of malformed packets should not result in
-		 *    tearing down the association.
-		 *
-		 * This means that if we only want to abort associations
-		 * in an authenticated way (i.e AUTH+ABORT), then we
-		 * can't destroy this association just because the packet
-		 * was malformed.
-		 */
-		if (sctp_auth_recv_cid(SCTP_CID_ABORT, asoc))
-			return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
-
-		SCTP_INC_STATS(net, SCTP_MIB_ABORTEDS);
-		return sctp_stop_t1_and_abort(net, commands, error, ECONNREFUSED,
-						asoc, chunk->transport);
-	}
+			      &err_chunk))
+		return sctp_abort_on_init_err(net, ep, asoc, cid, arg, commands,
+					      err_chunk);
 
 	/* Tag the variable length parameters.  Note that we never
 	 * convert the parameters in an INIT chunk.
@@ -1522,7 +1467,7 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
 	struct sctp_unrecognized_param *unk_param;
 	struct sctp_association *new_asoc;
 	enum sctp_disposition retval;
-	struct sctp_packet *packet;
+	enum sctp_cid cid;
 	int len;
 
 	/* 6.10 Bundling
@@ -1564,33 +1509,12 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
 
 	/* Verify the INIT chunk before processing it. */
 	err_chunk = NULL;
-	if (!sctp_verify_init(net, ep, asoc, chunk->chunk_hdr->type,
+	cid = chunk->chunk_hdr->type;
+	if (!sctp_verify_init(net, ep, asoc, cid,
 			      (struct sctp_init_chunk *)chunk->chunk_hdr, chunk,
-			      &err_chunk)) {
-		/* This chunk contains fatal error. It is to be discarded.
-		 * Send an ABORT, with causes if there is any.
-		 */
-		if (err_chunk) {
-			packet = sctp_abort_pkt_new(net, ep, asoc, arg,
-					(__u8 *)(err_chunk->chunk_hdr) +
-					sizeof(struct sctp_chunkhdr),
-					ntohs(err_chunk->chunk_hdr->length) -
-					sizeof(struct sctp_chunkhdr));
-
-			if (packet) {
-				sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
-						SCTP_PACKET(packet));
-				SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
-				retval = SCTP_DISPOSITION_CONSUME;
-			} else {
-				retval = SCTP_DISPOSITION_NOMEM;
-			}
-			goto cleanup;
-		} else {
-			return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg,
-						    commands);
-		}
-	}
+			      &err_chunk))
+		return sctp_abort_on_init_err(net, ep, asoc, cid, arg, commands,
+					      err_chunk);
 
 	/*
 	 * Other parameters for the endpoint SHOULD be copied from the
@@ -1691,7 +1615,6 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
 nomem_retval:
 	if (new_asoc)
 		sctp_association_free(new_asoc);
-cleanup:
 	if (err_chunk)
 		sctp_chunk_free(err_chunk);
 	return retval;
@@ -6485,6 +6408,66 @@ static void sctp_send_stale_cookie_err(struct net *net,
 	}
 }
 
+static enum sctp_disposition sctp_abort_on_init_err(
+					struct net *net,
+					const struct sctp_endpoint *ep,
+					const struct sctp_association *asoc,
+					enum sctp_cid cid, void *arg,
+					struct sctp_cmd_seq *commands,
+					struct sctp_chunk *err_chunk)
+{
+	enum sctp_error error = SCTP_ERROR_NO_RESOURCE;
+	struct sctp_chunk *chunk = arg;
+	struct sctp_packet *packet;
+	struct sctp_chunkhdr *ch;
+
+	if (!err_chunk) {
+		if (cid == SCTP_CID_INIT_ACK)
+			goto out;
+		return sctp_sf_tabort_8_4_8(net, ep, asoc, SCTP_ST_CHUNK(0),
+					    arg, commands);
+	}
+
+	ch = err_chunk->chunk_hdr;
+	packet = sctp_abort_pkt_new(net, ep, asoc, arg,
+				    (__u8 *)ch + sizeof(*ch),
+				    ntohs(ch->length) - sizeof(*ch));
+
+	sctp_chunk_free(err_chunk);
+
+	if (packet) {
+		sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
+				SCTP_PACKET(packet));
+		SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
+		error = SCTP_ERROR_INV_PARAM;
+	}
+
+	if (cid != SCTP_CID_INIT_ACK) {
+		if (!packet)
+			return SCTP_DISPOSITION_NOMEM;
+		return SCTP_DISPOSITION_CONSUME;
+	}
+
+out:
+	/* SCTP-AUTH, Section 6.3:
+	 *    It should be noted that if the receiver wants to tear
+	 *    down an association in an authenticated way only, the
+	 *    handling of malformed packets should not result in
+	 *    tearing down the association.
+	 *
+	 * This means that if we only want to abort associations
+	 * in an authenticated way (i.e AUTH+ABORT), then we
+	 * can't destroy this association just because the packet
+	 * was malformed.
+	 */
+	if (sctp_auth_recv_cid(SCTP_CID_ABORT, asoc))
+		return sctp_sf_pdiscard(net, ep, asoc, SCTP_ST_CHUNK(0), arg,
+					commands);
+
+	SCTP_INC_STATS(net, SCTP_MIB_ABORTEDS);
+	return sctp_stop_t1_and_abort(net, commands, error, ECONNREFUSED,
+				      asoc, chunk->transport);
+}
 
 /* Process a data chunk */
 static int sctp_eat_data(const struct sctp_association *asoc,
-- 
2.47.1


^ permalink raw reply related

* [PATCH net v2 2/2] sctp: add INIT verification after cookie unpacking
From: Xin Long @ 2026-06-20 15:09 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Marcelo Ricardo Leitner
In-Reply-To: <cover.1781968162.git.lucien.xin@gmail.com>

In SCTP handshake, the INIT chunk is initially processed by the server
and embedded into the cookie carried in INIT-ACK. The client then
returns this cookie via COOKIE-ECHO, where the server unpacks it and
reconstructs the original INIT chunk.

When cookie authentication is enabled, the cookie contents are protected
against tampering, so reusing the unpacked INIT without re-verification
is safe.

However, when cookie authentication is disabled, the reconstructed INIT
can no longer be trusted. In this case, the INIT must be explicitly
validated after unpacking to avoid processing potentially tampered data.

Add sctp_verify_init() checks after cookie unpacking in COOKIE-ECHO
processing paths (sctp_sf_do_5_1D_ce() and sctp_sf_do_5_2_4_dupcook())
when cookie_auth_enable is disabled. On failure, the association is
freed and an ABORT is generated via sctp_abort_on_init_err().

Also update sctp_verify_init() to validate parameter bounds against the
actual peer_init length rather than chunk->chunk_end, since peer_init
may not span the full chunk buffer in COOKIE-ECHO.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v2:
  - Because of sctp_abort_on_init_err() param change in patch 1/2,
    pass cid and not chunk.
  - Use SCTP_PAD4() around ntohs(peer_init->chunk_hdr.length) when
    checking param.v in sctp_verify_init() to make Sashiko happy.
---
 net/sctp/sm_make_chunk.c |  3 ++-
 net/sctp/sm_statefuns.c  | 33 ++++++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 41958b8e59fd..d5ee81934d93 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2298,7 +2298,8 @@ int sctp_verify_init(struct net *net, const struct sctp_endpoint *ep,
 	 * VIOLATION error.  We build the ERROR chunk here and let the normal
 	 * error handling code build and send the packet.
 	 */
-	if (param.v != (void *)chunk->chunk_end)
+	if (param.v != (void *)peer_init +
+		       SCTP_PAD4(ntohs(peer_init->chunk_hdr.length)))
 		return sctp_process_inv_paramlength(asoc, param.p, chunk, errp);
 
 	/* The only missing mandatory param possible today is
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 8c636f045e45..6967e889d1bd 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -650,11 +650,12 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 					 struct sctp_cmd_seq *commands)
 {
 	struct sctp_ulpevent *ev, *ai_ev = NULL, *auth_ev = NULL;
+	struct sctp_chunk *err_chk_p = NULL;
 	struct sctp_association *new_asoc;
 	struct sctp_init_chunk *peer_init;
 	struct sctp_chunk *chunk = arg;
-	struct sctp_chunk *err_chk_p;
 	struct sctp_chunk *repl;
+	enum sctp_cid cid;
 	struct sock *sk;
 	int error = 0;
 
@@ -728,6 +729,18 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 		}
 	}
 
+	peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
+	cid = peer_init->chunk_hdr.type;
+	if (!sctp_sk(sk)->cookie_auth_enable &&
+	    !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
+			      &err_chk_p)) {
+		sctp_association_free(new_asoc);
+		return sctp_abort_on_init_err(net, ep, asoc, cid, arg, commands,
+					      err_chk_p);
+	}
+	if (err_chk_p)
+		sctp_chunk_free(err_chk_p);
+
 	if (security_sctp_assoc_request(new_asoc, chunk->head_skb ?: chunk->skb)) {
 		sctp_association_free(new_asoc);
 		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
@@ -741,7 +754,6 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 	/* This is a brand-new association, so these are not yet side
 	 * effects--it is safe to run them here.
 	 */
-	peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
 	if (!sctp_process_init(new_asoc, chunk,
 			       &chunk->subh.cookie_hdr->c.peer_addr,
 			       peer_init, GFP_ATOMIC))
@@ -2133,10 +2145,12 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
 					void *arg,
 					struct sctp_cmd_seq *commands)
 {
+	struct sctp_chunk *err_chk_p = NULL;
 	struct sctp_association *new_asoc;
+	struct sctp_init_chunk *peer_init;
 	struct sctp_chunk *chunk = arg;
 	enum sctp_disposition retval;
-	struct sctp_chunk *err_chk_p;
+	enum sctp_cid cid;
 	int error = 0;
 	char action;
 
@@ -2205,6 +2219,19 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
 	switch (action) {
 	case 'A': /* Association restart. */
 	case 'B': /* Collision case B. */
+		peer_init = (struct sctp_init_chunk *)
+				(chunk->subh.cookie_hdr + 1);
+		cid = peer_init->chunk_hdr.type;
+		if (!sctp_sk(ep->base.sk)->cookie_auth_enable &&
+		    !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
+				      &err_chk_p)) {
+			sctp_association_free(new_asoc);
+			return sctp_abort_on_init_err(net, ep, asoc, cid, arg,
+						      commands, err_chk_p);
+		}
+		if (err_chk_p)
+			sctp_chunk_free(err_chk_p);
+		fallthrough;
 	case 'D': /* Collision case D. */
 		/* Update socket peer label if first association. */
 		if (security_sctp_assoc_request((struct sctp_association *)asoc,
-- 
2.47.1


^ permalink raw reply related

* Re: [PATCH net] net: phylink: print correct c45 phy id when missing PHY driver
From: Maxime Chevallier @ 2026-06-20 15:39 UTC (permalink / raw)
  To: Aleksander Jan Bajkowski, linux, andrew, hkallweit1, davem,
	edumazet, kuba, pabeni, rmk+kernel, vladimir.oltean, netdev,
	linux-kernel
In-Reply-To: <20260620131130.949298-1-olek2@wp.pl>

Hi Aleksander,

On 6/20/26 15:11, Aleksander Jan Bajkowski wrote:
> If no PHY driver is found, `phy_id` is returned. `phy_id` holds the c22 ID.
> Modules with a rollball bridge support only c45 transfers. The c45 IDs are
> stored in the `c45_ids` structure. In the current code these modules report
> an ID 0x00000000. This may lead users to mistakenly conclude that the
> rollball bridge isn't properly implemented in their SFP module. This patch
> fixes the wrong IDs for c45 modules when a driver cannot be found.
> 
> Tested on Fiberstore SFP-GB-BE-T (C22) and ONTi ONT-C1TE-R05 (Rollball).
> 
> Before:
> [ 2440.373985] mtk_soc_eth 15100000.ethernet sfp-lan: PHY i2c:sfp2:11 (id 0x00000000) has no driver loaded
> [ 2440.383385] mtk_soc_eth 15100000.ethernet sfp-lan: Drivers which handle known common cases: CONFIG_BCM84881_PHY, CONFIG_MARVELL_PHY
> [ 2440.395274] sfp sfp2: sfp_add_phy failed: -EINVAL
> 
> After:
> [   82.573700] mtk_soc_eth 15100000.ethernet sfp-lan: PHY i2c:sfp2:11 (id 0x001cc898) has no driver loaded
> [   82.583098] mtk_soc_eth 15100000.ethernet sfp-lan: Drivers which handle known common cases: CONFIG_BCM84881_PHY, CONFIG_MARVELL_PHY
> [   82.594996] sfp sfp2: sfp_add_phy failed: -EINVAL
> 
> Fixes: ffcbfb5f9779 ("net: phylink: improve phylink_sfp_config_phy() error message with missing PHY driver")
> Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>

TBH I'm not convinced this counts as a net-worthy fix, I'd send that
to net-next when it reopens instead. Sure this is useful debug info,
but when you hit that error message you're already in for some digging
around in the code/config :)

> ---
>  drivers/net/phy/phylink.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
> index 087ac63f9193..7d7595158bf9 100644
> --- a/drivers/net/phy/phylink.c
> +++ b/drivers/net/phy/phylink.c
> @@ -3917,13 +3917,30 @@ static void phylink_sfp_link_up(void *upstream)
>  	phylink_enable_and_run_resolve(pl, PHYLINK_DISABLE_LINK);
>  }
>  
> +static u32 phylink_get_phy_id(struct phy_device *phy)
> +{
> +	if (phy->is_c45) {
> +		const int num_ids = ARRAY_SIZE(phy->c45_ids.device_ids);
> +		int i;
> +
> +		for (i = 1; i < num_ids; i++) {
> +			if (phy->c45_ids.mmds_present & BIT(i))
> +				return (phy->c45_ids.device_ids[i]);
> +		}
> +
> +		return 0;
> +	} else {
> +		return phy->phy_id;
> +	}
> +}

The function name is misleading, you don't really get the id, you get either
the c22 id or the first non-zero C45 id.

> +
>  static int phylink_sfp_connect_phy(void *upstream, struct phy_device *phy)
>  {
>  	struct phylink *pl = upstream;
>  
>  	if (!phy->drv) {
> -		phylink_err(pl, "PHY %s (id 0x%.8lx) has no driver loaded\n",
> -			    phydev_name(phy), (unsigned long)phy->phy_id);
> +		phylink_err(pl, "PHY %s (id 0x%.8x) has no driver loaded\n",

Why change the printk format from 0x%.8lx to 0x%.8x ?

> +			    phydev_name(phy), phylink_get_phy_id(phy));
>  		phylink_err(pl, "Drivers which handle known common cases: CONFIG_BCM84881_PHY, CONFIG_MARVELL_PHY\n");
>  		return -EINVAL;
>  	}


After reading all that, I'm actually not really convinced the overall patch
is the best approach. It's a lot of logic for a very niche case. This is really for
debug purposes, so why not instead print either the phy_id for a C22 PHY, or
just "C45 PHY" and no id at all for C45 ? This removes the confusion about the
id being 0, while still being cleat that the user needs to figure-out what's
going on with their module...

Maxime

^ permalink raw reply

* [PATCH net] sctp: fix err_chunk memory leaks in INIT handling
From: Xin Long @ 2026-06-20 15:48 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
	Marcelo Ricardo Leitner

When sctp_verify_init() encounters unrecognized parameters, it allocates an
err_chunk to report them. However, this chunk is leaked in several code
paths:

1. In sctp_sf_do_5_1B_init(), if security_sctp_assoc_request() fails after
   sctp_verify_init() has populated err_chunk, the function returns
   immediately without freeing it.

2. In sctp_sf_do_unexpected_init(), the same leak occurs on the
   security_sctp_assoc_request() failure path.

3. In sctp_sf_do_unexpected_init(), on the success path after copying
   unrecognized parameters to the INIT-ACK, the function returns without
   freeing err_chunk, unlike sctp_sf_do_5_1B_init() which properly frees
   it.

Fix all three leaks by adding sctp_chunk_free(err_chunk) calls before
returning in the error paths and on the success path in
sctp_sf_do_unexpected_init().

Fixes: c081d53f97a1 ("security: pass asoc to sctp_assoc_request and sctp_sk_clone")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/sm_statefuns.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 6967e889d1bd..2f484c678093 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -400,6 +400,8 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net,
 	/* Update socket peer label if first association. */
 	if (security_sctp_assoc_request(new_asoc, chunk->skb)) {
 		sctp_association_free(new_asoc);
+		if (err_chunk)
+			sctp_chunk_free(err_chunk);
 		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
 	}

@@ -1542,6 +1544,8 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
 	/* Update socket peer label if first association. */
 	if (security_sctp_assoc_request(new_asoc, chunk->skb)) {
 		sctp_association_free(new_asoc);
+		if (err_chunk)
+			sctp_chunk_free(err_chunk);
 		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
 	}

@@ -1607,6 +1611,7 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
 		 * parameter type.
 		 */
 		sctp_addto_chunk(repl, len, unk_param);
+		sctp_chunk_free(err_chunk);
 	}

 	sctp_add_cmd_sf(commands, SCTP_CMD_NEW_ASOC, SCTP_ASOC(new_asoc));
-- 
2.47.1

^ permalink raw reply related

* [PATCH] net: sungem: fix probe error cleanup
From: Ruoyu Wang @ 2026-06-20 15:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-kernel, Ruoyu Wang

gem_init_one() calls gem_remove_one() when register_netdev() fails.
That path unregisters and frees resources owned by the net_device,
then probe continues into its own cleanup labels and touches the same
state again.

Clear the driver data and remove the NAPI instance on this error path,
then let the existing probe cleanup labels release the resources once.

Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
---
 drivers/net/ethernet/sun/sungem.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 8e69d917d827..26974ee71352 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -2986,10 +2986,10 @@ static int gem_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	dev->max_mtu = GEM_MAX_MTU;
 
 	/* Register with kernel */
-	if (register_netdev(dev)) {
+	err = register_netdev(dev);
+	if (err) {
 		pr_err("Cannot register net device, aborting\n");
-		err = -ENOMEM;
-		goto err_out_free_consistent;
+		goto err_out_clear_drvdata;
 	}
 
 	/* Undo the get_cell with appropriate locking (we could use
@@ -3003,8 +3003,13 @@ static int gem_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		    dev->dev_addr);
 	return 0;
 
+err_out_clear_drvdata:
+	pci_set_drvdata(pdev, NULL);
+	netif_napi_del(&gp->napi);
+
 err_out_free_consistent:
-	gem_remove_one(pdev);
+	dma_free_coherent(&pdev->dev, sizeof(struct gem_init_block),
+			  gp->init_block, gp->gblock_dvma);
 err_out_iounmap:
 	gem_put_cell(gp);
 	iounmap(gp->regs);

^ permalink raw reply related

* Re: [PATCH v3 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Victor Nogueira @ 2026-06-20 15:53 UTC (permalink / raw)
  To: Xingquan Liu; +Cc: Jamal Hadi Salim, netdev, Jiri Pirko, Chia-Yu Chang, stable
In-Reply-To: <20260619151447.223640-2-b1n@b1n.io>

On Fri, Jun 19, 2026 at 12:16 PM Xingquan Liu <b1n@b1n.io> wrote:
>
> Add a regression test for DualPI2 GSO backlog accounting when it is
> used as a child qdisc of QFQ.
>
> The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
> the leaf qdisc. DualPI2 splits the skb into two segments. After the
> traffic drains, both QFQ and DualPI2 must report zero backlog and zero
> qlen.
>
> On kernels with the broken accounting, QFQ can keep a stale non-zero
> qlen after all real packets have been dequeued.
>
> Signed-off-by: Xingquan Liu <b1n@b1n.io>

Reviewed-by: Victor Nogueira <victor@mojatatu.com>

^ permalink raw reply

* Re: [PATCH v3 1/2] net/sched: dualpi2: fix GSO backlog accounting
From: Victor Nogueira @ 2026-06-20 15:54 UTC (permalink / raw)
  To: Xingquan Liu; +Cc: Jamal Hadi Salim, netdev, Jiri Pirko, Chia-Yu Chang, stable
In-Reply-To: <20260619151447.223640-1-b1n@b1n.io>

On Fri, Jun 19, 2026 at 12:15 PM Xingquan Liu <b1n@b1n.io> wrote:
>
> When DualPI2 splits a GSO skb into N segments, it propagates N
> additional packets to its parent before returning NET_XMIT_SUCCESS.
> The parent then accounts for the original skb once more, leaving its
> qlen one larger than the number of packets actually queued.
>
> With QFQ as the parent, after all real packets are dequeued, QFQ still
> has a non-zero qlen while its in-service aggregate has no active
> classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
> the result to qfq_peek_skb(), causing a NULL pointer dereference.
>
> Follow the same pattern used by tbf_segment() and taprio: count only
> successfully queued segments, propagate the difference between the
> original skb and those segments, and return NET_XMIT_SUCCESS whenever
> at least one segment was queued.
>
> Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> Cc: stable@vger.kernel.org
> Signed-off-by: Xingquan Liu <b1n@b1n.io>

Reviewed-by: Victor Nogueira <victor@mojatatu.com>

^ permalink raw reply

* [PATCH net] seg6: validate SRH length before reading fixed fields
From: Nuoqi Gui @ 2026-06-20 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrea Mayer
  Cc: netdev, bpf, linux-kernel, Nuoqi Gui

seg6_validate_srh() reads fixed SRH fields such as srh->type and
srh->hdrlen before checking that the supplied length covers the fixed
struct ipv6_sr_hdr fields.  Callers that pass a length smaller than
sizeof(struct ipv6_sr_hdr) therefore expose those reads to memory
outside the validated range.

The BPF SEG6 encap path (bpf_lwt_push_encap() -> bpf_push_seg6_encap())
is one such caller: it forwards a BPF program-supplied pointer and
length straight to seg6_validate_srh() with no minimum-size guard, so a
2-byte SEG6 encap header lets the validator read srh->type at offset 2
beyond the caller-supplied buffer.

Reject lengths shorter than the fixed SRH at the top of
seg6_validate_srh(), before any field is read.  This fixes the BPF helper
path and hardens the common validator for any other caller that reaches it
with a too-short SRH.

Fixes: fe94cc290f53 ("bpf: Add IPv6 Segment Routing helpers")
Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
---
 net/ipv6/seg6.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
index 1c3ad25700c4c..d2cb32a1058af 100644
--- a/net/ipv6/seg6.c
+++ b/net/ipv6/seg6.c
@@ -29,6 +29,9 @@ bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len, bool reduced)
 	int max_last_entry;
 	int trailing;

+	if (len < (int)sizeof(*srh))
+		return false;
+
 	if (srh->type != IPV6_SRCRT_TYPE_4)
 		return false;

---
base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
change-id: 20260619-f01-17-seg6-srh-len-a85f35427e0b

Best regards,
--  
Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>

^ permalink raw reply related

* Re: [RFC net-next 3/4] net: dsa: motorcomm: Dynamically allocate port structures
From: David Yang @ 2026-06-20 16:13 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <fc82aaa6-ff6b-4d17-beb2-e6f156e64059@lunn.ch>

On Sat, Jun 20, 2026 at 4:03 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Fri, Jun 19, 2026 at 02:46:24PM +0800, David Yang wrote:
> > On Fri, Jun 19, 2026 at 2:06 PM Andrew Lunn <andrew@lunn.ch> wrote:
> > >
> > > On Fri, Jun 19, 2026 at 04:26:31AM +0800, David Yang wrote:
> > > > With support for LED introduced later, struct yt921x_priv will be 17k
> > > > which is not very good for a single kmalloc(). Convert the ports array
> > > > to a array of pointers to stop bloating the priv struct.
> > > >
> > > > Signed-off-by: David Yang <mmyangfl@gmail.com>
> > > > ---
> > > >  drivers/net/dsa/motorcomm/chip.c | 95 ++++++++++++++++++++++++--------
> > > >  drivers/net/dsa/motorcomm/chip.h |  3 +-
> > > >  2 files changed, 75 insertions(+), 23 deletions(-)
> > > >
> > > > diff --git a/drivers/net/dsa/motorcomm/chip.c b/drivers/net/dsa/motorcomm/chip.c
> > > > index 6dee25b6754a..d44f7749de02 100644
> > > > --- a/drivers/net/dsa/motorcomm/chip.c
> > > > +++ b/drivers/net/dsa/motorcomm/chip.c
> > > > @@ -548,11 +548,14 @@ yt921x_mbus_ext_init(struct yt921x_priv *priv, struct device_node *mnp)
> > > >  /* Read and handle overflow of 32bit MIBs. MIB buffer must be zeroed before. */
> > > >  static int yt921x_read_mib(struct yt921x_priv *priv, int port)
> > > >  {
> > > > -     struct yt921x_port *pp = &priv->ports[port];
> > > > +     struct yt921x_port *pp = priv->ports[port];
> > > >       struct device *dev = to_device(priv);
> > > >       struct yt921x_mib *mib = &pp->mib;
> > > >       int res = 0;
> > > >
> > > > +     if (!pp)
> > > > +             return -ENODEV;
> > > > +
> > >
> > > Are all these tests actually needed? If you cannot allocate the
> > > memory, i would expect the probe to fail, so you can never get here.
> > >
> > >         Andrew
> >
> > Dummy ports are no longer assigned control blocks (in yt921x_dsa_setup).
>
> This seems pretty error prone. A missing check will result in an
> opps. At least it will be obvious. How big is each port structure? Is
> the memory saving worth it?
>
>     Andrew

It's about 1.4k per port for 5 dummy ports.

^ permalink raw reply

* [PATCH net v2 0/7] ipv6: fix sysctl error handling and missing notifications
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera

While working on a different IPv6 patch series I have spotted multiple
minor bugs around sysctl error handling and notifications. In general,
they are not serious issues.

In addition, there is one more issue in forwarding sysctl as it does not
check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
of this series and I am aiming it at the net-next tree once it re-opens.

Changes:
  v2: Patch 3: fix return code of addrconf_fixup_forwarding()
      Patch 5: acquire lock before calling proc_dointvec()
      Patch 7: new on this revision of the series

Fernando Fernandez Mancera (7):
  ipv6: fix error handling in disable_ipv6 sysctl
  ipv6: fix error handling in ignore_routes_with_linkdown sysctl
  ipv6: fix error handling in forwarding sysctl
  ipv6: fix error handling in disable_policy sysctl
  ipv6: reset value and position for proxy_ndp sysctl restart
  ipv6: fix missing notification for ignore_routes_with_linkdown
  ipv6: reset position for force_forwarding sysctl restart

 net/ipv6/addrconf.c | 48 +++++++++++++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 17 deletions(-)

-- 
2.54.0


^ permalink raw reply

* [PATCH net v2 1/7] ipv6: fix error handling in disable_ipv6 sysctl
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 56d417b12e57 ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1f21ccb55caa..c901b444a995 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6467,6 +6467,8 @@ static int addrconf_sysctl_disable(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_disable_ipv6(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net v2 4/7] ipv6: fix error handling in disable_policy sysctl
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When writing to the disable_policy sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is resetting the position argument even if an error
occurred during proc_dointvec() and not only during sysctl restart.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d23a89b07eed..5d96cbf76134 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6769,6 +6769,8 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
 	lctl = *ctl;
 	lctl.data = &val;
 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
 
 	if (write && (*valp != val))
 		ret = addrconf_disable_policy(ctl, valp, val);
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v2 2/7] ipv6: fix error handling in ignore_routes_with_linkdown sysctl
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When writing to the ignore_routes_with_linkdown sysctl, if
proc_dointvec() fails to parse the input, it returns a negative error
code. The current implementation is overwriting that error for write
operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c901b444a995..70058d971205 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6671,6 +6671,8 @@ int addrconf_sysctl_ignore_routes_with_linkdown(const struct ctl_table *ctl,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_linkdown(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net v2 5/7] ipv6: reset value and position for proxy_ndp sysctl restart
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
retried but as the value was already modified by the initial
proc_dointvec() call, the restarted syscall will read the newly modified
value as the 'old' state.

Fix this by taking the RTNL lock before parsing the input value if the
operation is a write.

Fixes: c92d5491a6d9 ("netconf: add support for IPv6 proxy_ndp")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5d96cbf76134..82b6f603faa0 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6482,20 +6482,19 @@ static int addrconf_sysctl_disable(const struct ctl_table *ctl, int write,
 static int addrconf_sysctl_proxy_ndp(const struct ctl_table *ctl, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
 {
+	struct net *net = ctl->extra2;
 	int *valp = ctl->data;
-	int ret;
 	int old, new;
+	int ret;
+
+	if (write && !rtnl_net_trylock(net))
+		return restart_syscall();
 
 	old = *valp;
 	ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 	new = *valp;
 
 	if (write && old != new) {
-		struct net *net = ctl->extra2;
-
-		if (!rtnl_net_trylock(net))
-			return restart_syscall();
-
 		if (valp == &net->ipv6.devconf_dflt->proxy_ndp) {
 			inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
 						     NETCONFA_PROXY_NEIGH,
@@ -6514,8 +6513,9 @@ static int addrconf_sysctl_proxy_ndp(const struct ctl_table *ctl, int write,
 						     idev->dev->ifindex,
 						     &idev->cnf);
 		}
-		rtnl_net_unlock(net);
 	}
+	if (write)
+		rtnl_net_unlock(net);
 
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v2 3/7] ipv6: fix error handling in forwarding sysctl
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When writing to the forwarding sysctl, if proc_dointvec() fails to parse
the input, it returns a negative error code. The current implementation
is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure. In addition, adjust return code of
addrconf_fixup_forwarding() for successful operation.

Fixes: b325fddb7f86 ("ipv6: Fix sysctl unregistration deadlock")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 70058d971205..d23a89b07eed 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -913,7 +913,7 @@ static int addrconf_fixup_forwarding(const struct ctl_table *table, int *p, int

 	if (newf)
 		rt6_purge_dflt_routers(net);
-	return 1;
+	return 0;
 }

 static void addrconf_linkdown_change(struct net *net, __s32 newf)
@@ -6370,6 +6370,8 @@ static int addrconf_sysctl_forward(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_forwarding(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net v2 7/7] ipv6: reset position for force_forwarding sysctl restart
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
retried but the position pointer was already advanced meaning that the
restarted sysctl will read from an incorrect offset.

Fix this by restoring the original position pointer before restarting
the syscall.

In addition, remove the redundant position pointer restoration at the
end of the function.

Fixes: f24987ef6959 ("ipv6: add `force_forwarding` sysctl to enable per-interface forwarding")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index cbe681de3818..8c0741e9dfcc 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6825,8 +6825,10 @@ static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int wri
 	ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos);
 
 	if (write && old_val != new_val) {
-		if (!rtnl_net_trylock(net))
+		if (!rtnl_net_trylock(net)) {
+			*ppos = pos;
 			return restart_syscall();
+		}
 
 		WRITE_ONCE(*valp, new_val);
 
@@ -6851,8 +6853,6 @@ static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int wri
 		rtnl_net_unlock(net);
 	}
 
-	if (ret)
-		*ppos = pos;
 	return ret;
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v2 6/7] ipv6: fix missing notification for ignore_routes_with_linkdown
From: Fernando Fernandez Mancera @ 2026-06-20 16:18 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, brian.haley, horms, pabeni, kuba,
	edumazet, davem, idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260620161850.7114-1-fmancera@suse.de>

When changing the ignore_routes_with_linkdown sysctl for a specific
interface, the RTM_NEWNETCONF netlink notification was not being emitted
to userspace. Fix this by emitting the notification when needed.

In addition, fix bogus return value for successful "all" and specific
interface write operation leading to a wrong reset of the position
pointer.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 82b6f603faa0..cbe681de3818 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -955,11 +955,7 @@ static int addrconf_fixup_linkdown(const struct ctl_table *table, int *p, int ne
 						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
 						     NETCONFA_IFINDEX_DEFAULT,
 						     net->ipv6.devconf_dflt);
-		rtnl_net_unlock(net);
-		return 0;
-	}
-
-	if (p == &net->ipv6.devconf_all->ignore_routes_with_linkdown) {
+	} else if (p == &net->ipv6.devconf_all->ignore_routes_with_linkdown) {
 		WRITE_ONCE(net->ipv6.devconf_dflt->ignore_routes_with_linkdown, newf);
 		addrconf_linkdown_change(net, newf);
 		if ((!newf) ^ (!old))
@@ -968,11 +964,21 @@ static int addrconf_fixup_linkdown(const struct ctl_table *table, int *p, int ne
 						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
 						     NETCONFA_IFINDEX_ALL,
 						     net->ipv6.devconf_all);
+	} else {
+		if (!newf ^ !old) {
+			struct inet6_dev *idev = table->extra1;
+
+			inet6_netconf_notify_devconf(net,
+						     RTM_NEWNETCONF,
+						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
+						     idev->dev->ifindex,
+						     &idev->cnf);
+		}
 	}
 
 	rtnl_net_unlock(net);
 
-	return 1;
+	return 0;
 }
 
 #endif
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox