Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v2] net: dsa: sja1105: round up PTP perout pin duration
From: Aleksandrova Alyona @ 2026-06-18 11:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, linux-kernel,
	netdev, lvc-project

pin_duration is converted from the user-provided period to SJA1105
clock ticks and is later passed as the cycle_time argument to
future_base_time().

Very small period values may become zero after the conversion,
which can lead to a division by zero in future_base_time().

Round zero pin_duration up to 1 tick so that the smallest unsupported
periods use the minimum non-zero hardware duration instead of passing
zero to future_base_time().

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 747e5eb31d59 ("net: dsa: sja1105: configure the PTP_CLK pin as EXT_TS or PER_OUT")
Signed-off-by: Aleksandrova Alyona <aga@itb.spb.ru>
---
v2:
- Round up zero pin_duration to 1 instead of rejecting it, as suggested
  by Andrew Lunn.

 drivers/net/dsa/sja1105/sja1105_ptp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_ptp.c b/drivers/net/dsa/sja1105/sja1105_ptp.c
index a7d41e781398..afb11690c217 100644
--- a/drivers/net/dsa/sja1105/sja1105_ptp.c
+++ b/drivers/net/dsa/sja1105/sja1105_ptp.c
@@ -755,7 +755,7 @@ static int sja1105_per_out_enable(struct sja1105_private *priv,
 		 * 2 edges on PTP_CLK. So check for truncation which happens
 		 * at periods larger than around 68.7 seconds.
 		 */
-		pin_duration = ns_to_sja1105_ticks(pin_duration / 2);
+		pin_duration = max_t(u64, ns_to_sja1105_ticks(pin_duration / 2), 1);
 		if (pin_duration > U32_MAX) {
 			rc = -ERANGE;
 			goto out;
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH net v3] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Wayen Yan @ 2026-06-18 14:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Lorenzo Bianconi, netdev, horms, pabeni, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek
In-Reply-To: <20260617161951.52abe413@kernel.org>

On Wed, Jun 18, 2026 at 08:19:51AM +0200, Jakub Kicinski wrote:
> Hi Lorenzo, is there a reason we're subtracting 1 here in the first
> place? Could be just me, but may be worth adding a comment here.
>
> Please respin with some sort of an explanation..

Hi Jakub,

The (priority - 1) mapping predates my involvement — I only addressed
the underflow bug when skb->priority is 0, where the unsigned
subtraction wraps and routes best-effort packets to the highest-priority
queue.

Lorenzo, could you clarify the intended priority-to-queue mapping so
I can add a proper comment in the respin?

Regards,
Wayen

^ permalink raw reply

* Re: (subset) [PATCH net-next 1/3] leds: trigger: netdev: Extend speeds up to 100G
From: Lee Jones @ 2026-06-18 11:01 UTC (permalink / raw)
  To: Lee Jones, Pavel Machek, Alexander Duyck, Jakub Kicinski,
	kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Russell King, Daniel Golle, Kees Cook, Simon Horman,
	Dimitri Daskalakis, Jacob Keller, Lee Trager, Mohsin Bashir,
	Alok Tiwari, Chengfeng Ye, Andy Shevchenko, Andrew Lunn,
	mike.marciniszyn
  Cc: linux-kernel, linux-leds, netdev
In-Reply-To: <20260520200337.204431-2-mike.marciniszyn@gmail.com>

On Wed, 20 May 2026 16:03:35 -0400, mike.marciniszyn@gmail.com wrote:
> Add 25G, 40G, 50G, and 100G as available speeds to the netdev LED trigger.

Applied, thanks!

[1/3] leds: trigger: netdev: Extend speeds up to 100G
      commit: bbd8b5bdb88bf15006b078f6a2a3b452ffaa10b4

--
Lee Jones [李琼斯]


^ permalink raw reply

* Re: [PATCH net-next 1/3] net: bcmgenet: collapse TX priority queues to a single queue
From: Florian Fainelli @ 2026-06-18 10:50 UTC (permalink / raw)
  To: Nicolai Buchwitz, Doug Berger, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel
In-Reply-To: <20260612205915.3156127-2-nb@tipi-net.de>



On 6/12/2026 1:59 PM, Nicolai Buchwitz wrote:
> The strict-priority TX queues can starve under multi-queue load and
> trip NETDEV_WATCHDOG. Justin's earlier series [1] worked around the
> symptom but kept the design.
> 
> The multi-queue design was originally used for STB use cases that are
> no longer needed, as confirmed by Justin. v1 hw_params already
> exercises a single-queue path. Point v2-v4 at the same configuration:
> ring 0 takes the full BD pool, every per-ring loop collapses to one
> iteration, and netif_set_real_num_tx_queues drops to 1 via the
> existing tx_queues + 1 arithmetic.
> 
> Tested on Raspberry Pi CM4 (BCM2711). The baseline kernel trips
> NETDEV_WATCHDOG within seconds under iperf3 UDP saturation
> (-u -b0 -P16 -t60). After the change the same test completes
> without a watchdog, and a single-stream 60 s UDP run sustains
> 956 Mbit/s with 0/4952890 datagrams lost. Single-stream TCP
> throughput is unchanged at 943 Mbit/s.
> 
> [1] https://lore.kernel.org/netdev/20260406175756.134567-1-justin.chen@broadcom.com/
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply

* Re: [PATCH net-next 2/3] net: bcmgenet: remove dead priority queue plumbing
From: Florian Fainelli @ 2026-06-18 10:50 UTC (permalink / raw)
  To: Nicolai Buchwitz, Doug Berger, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel
In-Reply-To: <20260612205915.3156127-3-nb@tipi-net.de>



On 6/12/2026 1:59 PM, Nicolai Buchwitz wrote:
> With a single TX ring there is nothing left to prioritize. Drop the
> unused register writes, enum entries, helper macros, and the dead
> "flow period for ring != 0" branch in bcmgenet_init_tx_ring().
> 
> The DMA_ARBITER_{RR,WRR,SP} and DMA_RING_BUF_PRIORITY_* HW defines
> are kept as register documentation.
> 
> No functional change.
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply

* Re: [PATCH net-next 3/3] net: bcmgenet: allocate a single-queue netdev
From: Florian Fainelli @ 2026-06-18 10:50 UTC (permalink / raw)
  To: Nicolai Buchwitz, Doug Berger, bcm-kernel-feedback-list,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Justin Chen, Ovidiu Panait, netdev, linux-kernel
In-Reply-To: <20260612205915.3156127-4-nb@tipi-net.de>



On 6/12/2026 1:59 PM, Nicolai Buchwitz wrote:
> The driver only uses TX ring 0 and RX ring 0, so allocating a netdev
> with GENET_MAX_MQ_CNT + 1 = 5 TX and 5 RX slots leaves four of each
> unused. Switch to alloc_etherdev() which allocates exactly one queue
> of each kind.
> 
> No functional change: netif_set_real_num_{tx,rx}_queues() already
> clamps the visible queue count to 1.
> 
> Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply

* Re: [PATCH bpf v3 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: sun jian @ 2026-06-18 10:45 UTC (permalink / raw)
  To: Paul Chaignon
  Cc: bot+bpf-ci, bpf, netdev, linux-kselftest, linux-kernel, ast,
	daniel, andrii, martin.lau, eddyz87, memxor, song, yonghong.song,
	jolsa, davem, edumazet, kuba, pabeni, horms, shuah, hawk,
	john.fastabend, sdf, toke, lorenzo, martin.lau, clm,
	ihor.solodrai
In-Reply-To: <ajOv_oOd1zInaW1b@mail.gmail.com>

On Thu, Jun 18, 2026 at 4:44 PM Paul Chaignon <paul.chaignon@gmail.com> wrote:
>
> On Wed, Jun 17, 2026 at 10:19:52PM +0800, sun jian wrote:
> > On Wed, Jun 17, 2026 at 6:31 PM <bot+bpf-ci@kernel.org> wrote:
> > >
> > > > diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > > > index 01f1d1b6715a..9cc898e6a9f7 100644
> > > > --- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > > > @@ -4,6 +4,10 @@
> > > >
> > > >  #include "test_pkt_access.skel.h"
> > > >
> > > > +#define NONLINEAR_PKT_LEN 9000
> > > > +#define NONLINEAR_LINEAR_DATA_LEN 64
> > > > +#define SHORT_OUT_LEN 100
> > > > +
> > >
> > > [ ... ]
> > >
> > > > @@ -20,6 +24,69 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
> > > >             "incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
> > > >  }
> > > >
> > > > +static void init_pkt(__u8 *pkt, size_t len)
> > > > +{
> > > > +     size_t i;
> > > > +
> > > > +     for (i = 0; i < len; i++)
> > > > +             pkt[i] = i & 0xff;
> > > > +}
> > >
> > > A question was raised on v2 about whether pkt_v4 could be reused by
> > > reducing the linear area to ETH_HLEN, rather than introducing a custom
> > > init_pkt() with a 9000-byte stack packet.
> > >
> > > Can't we reuse pkt_v4 here by reducing the linear area to ETH_HLEN?
> > > The v3 still adds init_pkt() and the NONLINEAR_PKT_LEN packet, so this
> > > doesn't seem to have been picked up.
> > >
> > > > +
> > > > +static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
> > > > +{
> > > > +     LIBBPF_OPTS(bpf_test_run_opts, topts);
> > > > +     __u8 pkt[NONLINEAR_PKT_LEN];
> > > > +     __u8 out[SHORT_OUT_LEN];
> > > > +     struct __sk_buff skb = {};
> > > > +     int prog_fd, err;
> > > > +
> > > > +     init_pkt(pkt, sizeof(pkt));
> > > > +
> > > > +     skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
> > > > +
> > > > +     topts.data_in = pkt;
> > > > +     topts.data_size_in = sizeof(pkt);
> > > > +     topts.data_out = out;
> > > > +     topts.data_size_out = sizeof(out);
> > > > +     topts.ctx_in = &skb;
> > > > +     topts.ctx_size_in = sizeof(skb);
> > > > +
> > > > +     prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
> > >
> > > [ ... ]
> > >
> > > > diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > > > index bce7173152c6..cd284401eebd 100644
> > > > --- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > > > +++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > > > @@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
> > > >
> > > >       return TC_ACT_UNSPEC;
> > > >  }
> > > > +
> > > > +SEC("tc")
> > > > +int tc_pass_prog(struct __sk_buff *skb)
> > > > +{
> > > > +     return TC_ACT_OK;
> > > > +}
> > > > +
> > > > +SEC("xdp.frags")
> > > > +int xdp_frags_pass_prog(struct xdp_md *ctx)
> > > > +{
> > > > +     return XDP_PASS;
> > > > +}
> > >
> > > A related suggestion on v2 was that, once pkt_v4 is reused, the existing
> > > BPF program could be reused instead of adding new pass-through programs.
> > >
> > > Could tc_pass_prog and xdp_frags_pass_prog be dropped in favour of the
> > > existing program? The v3 still adds both of these, so this point also
> > > seems to be open.
> > >
> > >
> > > ---
> > > AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> > > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> > >
> > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27680511802
> >
> > Hi,
> >
> > Thanks for checking this.
>
> Hi Sun Jian,
>
> It would help if you could reply inline instead of at the end of the
> messages, especially when there are multiple comments. See [1] for an
> explanation of how that works.
>
> 1: https://kernelnewbies.org/FirstKernelPatch#Responding_inline

Acknowledged, I will reply inline.

>
> >
> > I tried reusing pkt_v4 and the existing TC program, but they do not fit
> > the skb case this test is trying to cover.
> >
> > For skb test_run, IPv4/IPv6 inputs with a too-short L3 header in the
> > linear area are rejected before bpf_test_finish(). With pkt_v4 and a
> > linear area of ETH_HLEN, the test fails with -EINVAL before reaching the
> > partial copy-out path. If the linear area is increased enough to pass the
> > IPv4 check, pkt_v4 is too small to both trigger the old
> > copy_size - frag_size path and verify that the copied prefix spans the
> > linear data and the first fragment. pkt_v6 has the same issue: after
> > making the IPv6 header linear, only 20 bytes remain in frags.
> >
> > The existing test_pkt_access program has its own packet-access coverage
> > goals and is not just a pass-through carrier. With such a short linear
> > area or small packet fixture, it can fail before the test hits the
> > bpf_test_finish()'s partial copy-out path. A pass-through TC program is
> > therefore a better fit, because it keeps the test focused on the
> > bpf_test_finish() copy-out semantics.
>
> If we're keeping tc_pass_prog() then can't we use pkt_v4 and get rid of
> init_pkt?
>

pkt_v4 is too small to construct a meaningful nonlinear skb with a stable
linear/frag split while still exercising the partial copy-out boundary in
bpf_test_finish().

With pkt_v4, we either do not reach a fragmented layout, or lose control over
the linear/frag boundary needed to exercise the regression path.

This test uses a 9000B packet so it does not depend on small-packet
allocation details. Smaller packets might work depending on allocation
state, but 9000B reliably gives us a non-linear skb with page frags and a
stable linear/frag boundary for the copy-out regression.

init_pkt() is needed to ensure deterministic byte content across both linear
and fragmented regions so that the memcmp-based validation is stable.

Thanks,
Sun Jian


> >
> > For XDP, this object does not have an existing xdp.frags pass-through
> > program, so the small XDP frags program is needed to cover the other
> > caller of the shared bpf_test_finish() path.
> >
> > Thanks,
> > Sun Jian

^ permalink raw reply

* [PATCH v3] net: mvneta: re-enable percpu interrupt on resume
From: Yun Zhou @ 2026-06-18 10:43 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, bigeasy
  Cc: netdev, linux-kernel, yun.zhou

On Armada XP (non-armada3700), mvneta uses percpu interrupts where
the ISR (mvneta_percpu_isr) calls disable_percpu_irq() to mask the
MPIC percpu IRQ, then schedules NAPI. NAPI poll completion calls
enable_percpu_irq() to unmask.

If suspend occurs while NAPI is actively polling (between
disable_percpu_irq in the ISR and enable_percpu_irq in
napi_complete_done), the MPIC percpu interrupt remains masked.
mvneta_stop_dev/mvneta_start_dev do not manage the percpu IRQ
enable state -- they only control mvneta's own INTR_NEW_MASK register.

After resume, the MPIC percpu IRQ stays masked permanently: the
network hardware generates interrupts (INTR_NEW_CAUSE != 0) but the
CPU never receives them (irq count stops incrementing), causing a
complete loss of network connectivity.

Fix by calling on_each_cpu(mvneta_percpu_enable) after
mvneta_start_dev() in the resume path, ensuring the MPIC percpu
IRQ is always unmasked regardless of the pre-suspend state.

Fixes: 12bb03b436da ("net: mvneta: Handle per-cpu interrupts")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3:
  - Dropped the free_irq/request_irq approach (incorrect root cause).
  - Instead, call on_each_cpu(mvneta_percpu_enable) in the resume path
    to ensure the MPIC percpu IRQ is unmasked, matching mvneta_open().
  - Updated commit message with correct root cause analysis.

v2:
  - Move request_irq before cpuhp registration in resume (matching
    mvneta_open ordering) so that failure does not leave cpuhp
    callbacks registered on a non-functional device.
  - On request_irq failure, call netif_device_detach() to prevent
    further traffic on the dead interface.

 drivers/net/ethernet/marvell/mvneta.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b4a845f04c05..5ef79e70e319 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5907,6 +5907,9 @@ static int mvneta_resume(struct device *device)
 	rtnl_unlock();
 	mvneta_set_rx_mode(dev);

+	if (!pp->neta_armada3700)
+		on_each_cpu(mvneta_percpu_enable, pp, true);
+
 	return 0;
 }
 #endif
-- 
2.43.0

^ permalink raw reply related

* [PATCH net] ipv6: ioam: fix type confusion of dst_entry
From: Jiayuan Chen @ 2026-06-18 10:43 UTC (permalink / raw)
  To: netdev
  Cc: Jiayuan Chen, Justin Iurman, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, linux-kernel

IOAM uses a dummy dst_entry(null_dst) to mark that the destination should
not be changed after the transformation. This dst is stored in the IOAM lwt
state and may be passed to dst_cache_set_ip6().

However, the IPv6 dst cache path eventually calls rt6_get_cookie(), which
treats the dst_entry as part of a struct rt6_info. Since the null_dst was
embedded directly as a struct dst_entry in struct ioam6_lwt, this resulted
in an invalid cast and rt6_get_cookie() reading fields from the wrong
object.

In practice, the wrong cookie is not used while dst->obsolete is zero, but
rt6_get_cookie() may also access per-cpu value when rt->sernum is
zero. In this case, rt->sernum aliases ioam6_lwt::cache::reset_ts, which
can become zero, making this a potential invalid pointer access.

Fix this by embedding a full struct rt6_info for the dummy IPv6 route and
passing its dst member to the dst APIs.

Fixes: 47ce7c854563 ("net: ipv6: ioam6: fix double reallocation")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 net/ipv6/ioam6_iptunnel.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ioam6_iptunnel.c b/net/ipv6/ioam6_iptunnel.c
index b9f6d892a566..cfb2c41634a0 100644
--- a/net/ipv6/ioam6_iptunnel.c
+++ b/net/ipv6/ioam6_iptunnel.c
@@ -35,7 +35,7 @@ struct ioam6_lwt_freq {
 };

 struct ioam6_lwt {
-	struct dst_entry null_dst;
+	struct rt6_info null_rt;
 	struct dst_cache cache;
 	struct ioam6_lwt_freq freq;
 	atomic_t pkt_cnt;
@@ -176,7 +176,7 @@ static int ioam6_build_state(struct net *net, struct nlattr *nla,
 	 * it is stored in the cache. Then, +1/-1 each time we read the cache
 	 * and release it. Long story short, we're fine.
 	 */
-	dst_init(&ilwt->null_dst, NULL, NULL, DST_OBSOLETE_NONE, DST_NOCOUNT);
+	dst_init(&ilwt->null_rt.dst, NULL, NULL, DST_OBSOLETE_NONE, DST_NOCOUNT);

 	atomic_set(&ilwt->pkt_cnt, 0);
 	ilwt->freq.k = freq_k;
@@ -360,7 +360,7 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	/* This is how we notify that the destination does not change after
 	 * transformation and that we need to use orig_dst instead of the cache
 	 */
-	if (dst == &ilwt->null_dst) {
+	if (dst == &ilwt->null_rt.dst) {
 		dst_release(dst);

 		dst = orig_dst;
@@ -429,7 +429,7 @@ static int ioam6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		local_bh_disable();
 		if (orig_dst->lwtstate == dst->lwtstate)
 			dst_cache_set_ip6(&ilwt->cache,
-					  &ilwt->null_dst, &fl6.saddr);
+					  &ilwt->null_rt.dst, &fl6.saddr);
 		else
 			dst_cache_set_ip6(&ilwt->cache, dst, &fl6.saddr);
 		local_bh_enable();
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net] net/sched: act_ct: fix nf_connlabels leak on two error paths
From: Jamal Hadi Salim @ 2026-06-18 10:41 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: Jiri Pirko, Pablo Neira Ayuso, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260617215708.1115818-1-michael.bommarito@gmail.com>

On Wed, Jun 17, 2026 at 5:57 PM Michael Bommarito
<michael.bommarito@gmail.com> wrote:
>
> tcf_ct_fill_params() calls nf_connlabels_get() (setting put_labels) when
> TCA_CT_LABELS is present, but two later error sites use a bare return
> instead of "goto err", skipping the err: nf_connlabels_put() cleanup.
> They also precede the "p->put_labels = put_labels" assignment, so the
> tcf_ct_params_free() fallback does not release the count either. Each
> failed RTM_NEWACTION on these paths leaks one nf_connlabels reference:
> net->ct.labels_used is incremented and never released. The action is
> reachable with CAP_NET_ADMIN over the netns, i.e. from an unprivileged
> user namespace on default-userns kernels.
>
> Impact: an unprivileged user with CAP_NET_ADMIN over a network namespace
> (e.g. via user namespaces) leaks one nf_connlabels reference per failed
> RTM_NEWACTION on the two error paths; net->ct.labels_used is never
> released.
>
> The err: label is safe to reach from both sites: p->tmpl is still NULL
> there (kzalloc'd, not yet assigned) and nf_ct_put(NULL) is a no-op, so
> no inline release is needed.
>
> Fixes: 70f06c115bcc ("sched: act_ct: switch to per-action label counting")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> Testing: refcount/counter leak (CWE-772); no sanitizer for this class, so
> the oracle is the nf_connlabels accounting counter net->ct.labels_used.
>
> Reproduction (UML, before/after, same trigger): CONFIG_NET_ACT_CT=y,
> NF_CONNTRACK_LABELS=y, NF_CONNTRACK_ZONES=n (forces the zone-disabled
> path). A raw RTM_NEWACTION trigger adds "action ct label 0x1/0x1 zone 1"
> 20 times; each returns -EOPNOTSUPP.
>   stock:   net->ct.labels_used climbs 1,2,...,20 (get, then bare return,
>            no put) -- 20 leaked counts, never recovered.
>   patched: counter stays balanced (get then goto err -> put); baseline.
> Control: the same loop without "label" (no nf_connlabels_get) leaves the
> counter unchanged on both trees -- the trigger reached the labels path and
> the synthesis is not itself the cause.
>
> Conditions: reachable via RTM_NEWACTION with CAP_NET_ADMIN over the netns,
> i.e. an unprivileged user in a fresh user+net namespace on default-userns
> distros. The easy path needs CONFIG_NF_CONNTRACK_ZONES=n; the
> nf_ct_tmpl_alloc ENOMEM path leaks on any config under memory pressure.
>
> Mitigations: restrict unprivileged user namespaces; otherwise none short of
> the fix. Harness (trigger.c, init) available on request.
>
>  net/sched/act_ct.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 6158e13c98d35..f5866a364a74a 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -1295,7 +1295,8 @@ static int tcf_ct_fill_params(struct net *net,
>         if (tb[TCA_CT_ZONE]) {
>                 if (!IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES)) {
>                         NL_SET_ERR_MSG_MOD(extack, "Conntrack zones isn't enabled.");
> -                       return -EOPNOTSUPP;
> +                       err = -EOPNOTSUPP;
> +                       goto err;
>                 }
>
>                 tcf_ct_set_key_val(tb,
> @@ -1308,7 +1309,8 @@ static int tcf_ct_fill_params(struct net *net,
>         tmpl = nf_ct_tmpl_alloc(net, &zone, GFP_KERNEL);
>         if (!tmpl) {
>                 NL_SET_ERR_MSG_MOD(extack, "Failed to allocate conntrack template");
> -               return -ENOMEM;
> +               err = -ENOMEM;
> +               goto err;
>         }
>         p->tmpl = tmpl;
>         if (tb[TCA_CT_HELPER_NAME]) {

Looks sane to me.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal
> --
> 2.53.0
>

^ permalink raw reply

* [PATCH bpf v3 2/2] selftests/bpf: test rejection of a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-18 10:27 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Bobby Eshleman, Jiayuan Chen, netdev, bpf,
	linux-kernel
In-Reply-To: <20260618102718.2331468-1-rhkrqnwk98@gmail.com>

Verify that attaching an SK_SKB stream parser that can modify the packet
is rejected, while a read-only parser still attaches.

Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 2 files changed, 38 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
index 621b3b71888e..1d7231728eaf 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
@@ -431,6 +431,35 @@ static void test_sockmap_strp_verdict(int family, int sotype)
 	test_sockmap_strp__destroy(strp);
 }
 
+static void test_sockmap_strp_parser_reject(void)
+{
+	struct test_sockmap_strp *strp = NULL;
+	int parser_mod, parser_ro, link;
+	int err, map;
+
+	strp = test_sockmap_strp__open_and_load();
+	if (!ASSERT_OK_PTR(strp, "test_sockmap_strp__open_and_load"))
+		return;
+
+	map = bpf_map__fd(strp->maps.sock_map);
+	parser_mod = bpf_program__fd(strp->progs.prog_skb_parser_resize);
+	parser_ro = bpf_program__fd(strp->progs.prog_skb_parser);
+
+	err = bpf_prog_attach(parser_mod, map, BPF_SK_SKB_STREAM_PARSER, 0);
+	ASSERT_ERR(err, "bpf_prog_attach parser_mod");
+
+	link = bpf_link_create(parser_ro, map, BPF_SK_SKB_STREAM_PARSER, NULL);
+	if (!ASSERT_GE(link, 0, "bpf_link_create parser_ro"))
+		goto out;
+
+	err = bpf_link_update(link, parser_mod, NULL);
+	ASSERT_ERR(err, "bpf_link_update parser_mod");
+out:
+	if (link >= 0)
+		close(link);
+	test_sockmap_strp__destroy(strp);
+}
+
 void test_sockmap_strp(void)
 {
 	if (test__start_subtest("sockmap strp tcp pass"))
@@ -451,4 +480,6 @@ void test_sockmap_strp(void)
 		test_sockmap_strp_multiple_pkt(AF_INET, SOCK_STREAM);
 	if (test__start_subtest("sockmap strp tcp dispatch"))
 		test_sockmap_strp_dispatch_pkt(AF_INET, SOCK_STREAM);
+	if (test__start_subtest("sockmap strp parser reject pkt mod"))
+		test_sockmap_strp_parser_reject();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
index dde3d5bec515..fe88fa6d40bc 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_strp.c
@@ -50,4 +50,11 @@ int prog_skb_parser_partial(struct __sk_buff *skb)
 	return 10;
 }
 
+SEC("sk_skb/stream_parser")
+int prog_skb_parser_resize(struct __sk_buff *skb)
+{
+	bpf_skb_change_tail(skb, skb->len, 0);
+	return skb->len;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf v3 1/2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Sechang Lim @ 2026-06-18 10:27 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Bobby Eshleman, Jiayuan Chen, netdev, bpf,
	linux-kernel
In-Reply-To: <20260618102718.2331468-1-rhkrqnwk98@gmail.com>

sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
to find the length of the next message. strparser assembles a message out
of several received skbs by chaining them onto the head's frag_list and
recording where to append the next one in strp->skb_nextp:

	*strp->skb_nextp = skb;
	strp->skb_nextp = &skb->next;

and then calls the parser on the head:

	len = (*strp->cb.parse_msg)(strp, head);

The parser is only meant to inspect the skb, but the program may call
bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
Once the head carries a frag_list these go

	... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail

and __pskb_pull_tail() frees the frag_list skbs that strparser still
tracks through skb_nextp:

	while ((list = skb_shinfo(skb)->frag_list) != insp) {
		skb_shinfo(skb)->frag_list = list->next;
		consume_skb(list);
	}

strp->skb_nextp now points into a freed sk_buff. The next segment of
the same message arrives in __strp_recv(), which links it with
*strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
and the write happen in different __strp_recv() calls, so the message
has to span at least three segments before it triggers.

  BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
  Write of size 8 at addr ffff88810db86140 by task repro/349

  Call Trace:
   <IRQ>
   __strp_recv+0x447/0xda0
   __tcp_read_sock+0x13d/0x590
   tcp_bpf_strp_read_sock+0x195/0x320
   strp_data_ready+0x267/0x340
   sk_psock_strp_data_ready+0x1ce/0x350
   tcp_data_queue+0x1364/0x2fd0
   tcp_rcv_established+0xe07/0x1640
   [...]

  Allocated by task 349:
   skb_clone+0x17b/0x210
   __strp_recv+0x2c3/0xda0
   __tcp_read_sock+0x13d/0x590
   [...]

  Freed by task 349:
   kmem_cache_free+0x150/0x570
   __pskb_pull_tail+0x57b/0xc20
   skb_ensure_writable+0x236/0x260
   __bpf_skb_change_tail+0x1d4/0x590
   sk_skb_change_tail+0x2a/0x40
   bpf_prog_1b285dcd6c41373e+0x27/0x30
   bpf_prog_run_pin_on_cpu+0xf3/0x260
   sk_psock_strp_parse+0x118/0x1e0
   __strp_recv+0x4f6/0xda0
   [...]

The same resize also leaves the head's length inconsistent with its
frags, so a later __pskb_pull_tail() can instead hit the
BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.

A stream parser is only meant to measure the next message, not to modify
the packet. Reject a parser whose program can change packet data
(prog->aux->changes_pkt_data) at attach time. The check is shared by
sock_map_prog_update() and sock_map_link_update_prog(), which between them
cover prog attach, link create and link update. Verdict programs are
unaffected and may still modify the skb.

Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
 net/core/sock_map.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..c60ba6d292f9 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1515,6 +1515,17 @@ static int sock_map_prog_link_lookup(struct bpf_map *map, struct bpf_prog ***ppr
 	return 0;
 }

+static int sock_map_prog_attach_check(enum bpf_attach_type attach_type,
+				      struct bpf_prog *prog)
+{
+	/* A stream parser must not modify the skb, only measure it. */
+	if (prog && attach_type == BPF_SK_SKB_STREAM_PARSER &&
+	    prog->aux->changes_pkt_data)
+		return -EINVAL;
+
+	return 0;
+}
+
 /* Handle the following four cases:
  * prog_attach: prog != NULL, old == NULL, link == NULL
  * prog_detach: prog == NULL, old != NULL, link == NULL
@@ -1533,6 +1544,10 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 	if (ret)
 		return ret;

+	ret = sock_map_prog_attach_check(which, prog);
+	if (ret)
+		return ret;
+
 	/* for prog_attach/prog_detach/link_attach, return error if a bpf_link
 	 * exists for that prog.
 	 */
@@ -1776,6 +1791,11 @@ static int sock_map_link_update_prog(struct bpf_link *link,
 		ret = -EINVAL;
 		goto out;
 	}
+
+	ret = sock_map_prog_attach_check(link->attach_type, prog);
+	if (ret)
+		goto out;
+
 	if (!sockmap_link->map) {
 		ret = -ENOLINK;
 		goto out;
-- 
2.43.0

^ permalink raw reply related

* [PATCH bpf v3 0/2] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-18 10:27 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski
  Cc: Simon Horman, Bobby Eshleman, Jiayuan Chen, netdev, bpf,
	linux-kernel

A BPF_PROG_TYPE_SK_SKB stream parser runs on strparser's message head,
which can chain skbs through frag_list. A parser that resizes the skb
frees the frag_list segments that strparser still tracks through
skb_nextp, leading to a use-after-free.

A stream parser is only meant to measure the next message, not to modify
the packet, so reject a packet-modifying parser at attach time rather
than working around the resize at runtime.

v3:
 - reject the parser at attach time instead of cloning the skb at
   runtime (Kuniyuki Iwashima, Jiayuan Chen)
 - add a selftest (Bobby Eshleman)

v2:
 - https://lore.kernel.org/all/20260612123553.2724240-1-rhkrqnwk98@gmail.com/

v1:
 - https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/

Sechang Lim (2):
  bpf, sockmap: fix use-after-free when the stream parser resizes the
    skb
  selftests/bpf: test rejection of a packet-modifying SK_SKB stream
    parser

 net/core/sock_map.c                           | 20 ++++++++++++
 .../selftests/bpf/prog_tests/sockmap_strp.c   | 31 +++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_strp.c   |  7 +++++
 3 files changed, 58 insertions(+)

-- 
2.43.0


^ permalink raw reply

* [PATCH bpf-next v2 2/2] selftests/bpf: Cover small conntrack opts error writes
From: Yiyang Chen @ 2026-06-18 10:18 UTC (permalink / raw)
  To: bpf, netfilter-devel
  Cc: Yiyang Chen, pablo, fw, phil, davem, edumazet, kuba, pabeni,
	horms, andrii, eddyz87, ast, daniel, memxor, martin.lau, song,
	yonghong.song, jolsa, emil, shuah, kartikey406, coreteam, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn>

Add a conntrack kfunc regression check for opts__sz values that do not
cover opts->error. The BPF program initializes opts->error with a guard
value, calls the lookup and allocation kfuncs with opts__sz set to
sizeof(opts->netns_id), and verifies that the guard is still intact
after the kfunc returns NULL.

Without the conntrack wrapper guard, the kfunc error path overwrites
that guard with -EINVAL even though the verifier checked only the first
four bytes of the options object.

Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
---
 .../testing/selftests/bpf/prog_tests/bpf_nf.c |  6 +++++
 .../testing/selftests/bpf/progs/test_bpf_nf.c | 26 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
index b33dba4b126e2..14d4c1793aed5 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c
@@ -5,6 +5,8 @@
 #include "test_bpf_nf.skel.h"
 #include "test_bpf_nf_fail.skel.h"
 
+#define CT_OPTS_ERROR_GUARD 0x12345678
+
 static char log_buf[1024 * 1024];
 
 struct {
@@ -119,6 +121,10 @@ static void test_bpf_nf_ct(int mode)
 	ASSERT_EQ(skel->bss->test_einval_reserved_new, -EINVAL, "Test EINVAL for reserved in new struct not set to 0");
 	ASSERT_EQ(skel->bss->test_einval_netns_id, -EINVAL, "Test EINVAL for netns_id < -1");
 	ASSERT_EQ(skel->bss->test_einval_len_opts, -EINVAL, "Test EINVAL for len__opts != NF_BPF_CT_OPTS_SZ");
+	ASSERT_EQ(skel->bss->test_einval_len_opts_small_lookup, CT_OPTS_ERROR_GUARD,
+		  "Test no error write for lookup opts__sz before error field");
+	ASSERT_EQ(skel->bss->test_einval_len_opts_small_alloc, CT_OPTS_ERROR_GUARD,
+		  "Test no error write for alloc opts__sz before error field");
 	ASSERT_EQ(skel->bss->test_eproto_l4proto, -EPROTO, "Test EPROTO for l4proto != TCP or UDP");
 	ASSERT_EQ(skel->bss->test_enonet_netns_id, -ENONET, "Test ENONET for bad but valid netns_id");
 	ASSERT_EQ(skel->bss->test_enoent_lookup, -ENOENT, "Test ENOENT for failed lookup");
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_nf.c b/tools/testing/selftests/bpf/progs/test_bpf_nf.c
index 076fbf03a1268..df43649ecb785 100644
--- a/tools/testing/selftests/bpf/progs/test_bpf_nf.c
+++ b/tools/testing/selftests/bpf/progs/test_bpf_nf.c
@@ -10,6 +10,8 @@
 #define EINVAL 22
 #define ENOENT 2
 
+#define CT_OPTS_ERROR_GUARD 0x12345678
+
 #define NF_CT_ZONE_DIR_ORIG (1 << IP_CT_DIR_ORIGINAL)
 #define NF_CT_ZONE_DIR_REPL (1 << IP_CT_DIR_REPLY)
 
@@ -19,6 +21,8 @@ int test_einval_reserved = 0;
 int test_einval_reserved_new = 0;
 int test_einval_netns_id = 0;
 int test_einval_len_opts = 0;
+int test_einval_len_opts_small_lookup = 0;
+int test_einval_len_opts_small_alloc = 0;
 int test_eproto_l4proto = 0;
 int test_enonet_netns_id = 0;
 int test_enoent_lookup = 0;
@@ -124,6 +128,28 @@ nf_ct_test(struct nf_conn *(*lookup_fn)(void *, struct bpf_sock_tuple *, u32,
 	else
 		test_einval_len_opts = opts_def.error;
 
+	opts_def.error = CT_OPTS_ERROR_GUARD;
+	ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+		       sizeof(opts_def.netns_id));
+	if (ct) {
+		bpf_ct_release(ct);
+		test_einval_len_opts_small_lookup = -EINVAL;
+	} else {
+		test_einval_len_opts_small_lookup = opts_def.error;
+	}
+
+	opts_def.error = CT_OPTS_ERROR_GUARD;
+	ct = alloc_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
+		      sizeof(opts_def.netns_id));
+	if (ct) {
+		ct = bpf_ct_insert_entry(ct);
+		if (ct)
+			bpf_ct_release(ct);
+		test_einval_len_opts_small_alloc = -EINVAL;
+	} else {
+		test_einval_len_opts_small_alloc = opts_def.error;
+	}
+
 	opts_def.l4proto = IPPROTO_ICMP;
 	ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def,
 		       sizeof(opts_def));
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v2 0/2] bpf: Guard conntrack opts error writes
From: Yiyang Chen @ 2026-06-18 10:18 UTC (permalink / raw)
  To: bpf, netfilter-devel
  Cc: Yiyang Chen, pablo, fw, phil, davem, edumazet, kuba, pabeni,
	horms, andrii, eddyz87, ast, daniel, memxor, martin.lau, song,
	yonghong.song, jolsa, emil, shuah, kartikey406, coreteam, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn>

The conntrack lookup/allocation kfuncs expose an opts/opts__sz pair.
The verifier checks the caller-provided opts__sz range, but the wrappers
currently write opts->error after internal errors even when opts__sz is too
small to include that field.

Patch 1 writes opts->error only when opts__sz includes it, and uses a
single helper to fold ERR_PTR returns into the kfunc ABI result while keeping
the local nfct result variable in each wrapper.
Patch 2 adds a bpf_nf regression check that keeps a guard in opts->error
while passing opts__sz covering only netns_id.

The regression check follows the existing bpf_nf test shape.  Before the
fix, the guard is overwritten with -EINVAL even though opts__sz covers only
the first four bytes of the options object.  After the fix, the kfunc still
returns NULL for the invalid size, but the guard remains intact.

Validation, rebased and tested on bpf-next master e771677c937d
("Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd"):

  git diff --check origin/master..HEAD: OK
  scripts/checkpatch.pl --strict on 1/2 and 2/2: OK
  make O=/root/ebpf-verifier-bug-detection/kernel-build/bpf-next \
    net/netfilter/nf_conntrack_bpf.o: OK
  Focused QEMU direct-runner against XDP and TC lookup/alloc paths:
    unpatched bpf-next e771677c937d: guard overwritten with -EINVAL
    patched v2 007dfd0341cd: guard preserved as 0x12345678
  QEMU upstream bpf_nf selftest with CONFIG_NF_CONNTRACK_MARK,
  CONFIG_NF_CONNTRACK_ZONES, and legacy iptables enabled:
    ./test_progs -t bpf_nf -vv: OK
  git am of exported 1/2 and 2/2 on a fresh worktree at base: OK
  range-diff between branch commits and git-am result: equivalent

Changes in v2:
  - Rebased onto current bpf-next master.
  - Reworked patch 1 to use bpf_ct_opts_result() for the ERR_PTR-to-NULL
    conversion and guarded opts->error write, as suggested by Alexei.
  - Kept the local nfct result variable in each wrapper before returning
    through bpf_ct_opts_result().
  - Added matching Fixes tags to the selftest patch so the regression test
    can be backported with the fix.

v1: https://lore.kernel.org/bpf/cover.1781586477.git.chenyy23@mails.tsinghua.edu.cn/

Yiyang Chen (2):
  bpf: Guard conntrack opts error writes
  selftests/bpf: Cover small conntrack opts error writes

 net/netfilter/nf_conntrack_bpf.c              | 35 +++++++------------
 .../testing/selftests/bpf/prog_tests/bpf_nf.c |  6 ++++
 .../testing/selftests/bpf/progs/test_bpf_nf.c | 26 ++++++++++++++
 3 files changed, 45 insertions(+), 22 deletions(-)

base-commit: e771677c937da5808f7b6c1f0e4a97ec1a84f8a8
-- 
2.34.1

^ permalink raw reply

* [PATCH bpf-next v2 1/2] bpf: Guard conntrack opts error writes
From: Yiyang Chen @ 2026-06-18 10:18 UTC (permalink / raw)
  To: bpf, netfilter-devel
  Cc: Yiyang Chen, pablo, fw, phil, davem, edumazet, kuba, pabeni,
	horms, andrii, eddyz87, ast, daniel, memxor, martin.lau, song,
	yonghong.song, jolsa, emil, shuah, kartikey406, coreteam, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <cover.1781765747.git.chenyy23@mails.tsinghua.edu.cn>

The conntrack lookup and allocation kfuncs take an opts pointer
together with an opts__sz argument. The verifier checks only the memory
range described by opts__sz, but the wrappers unconditionally write
opts->error whenever the internal lookup or allocation helper returns an
error.

For an invalid size smaller than the end of opts->error, that write can
land outside the verifier-checked range. Keep returning NULL for invalid
arguments, but only report the error through opts->error when the
supplied size includes the field.

This preserves error reporting for the supported 12-byte and 16-byte
layouts, and for other invalid sizes that still include opts->error.

Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
---
 net/netfilter/nf_conntrack_bpf.c | 35 ++++++++++++--------------------
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/net/netfilter/nf_conntrack_bpf.c b/net/netfilter/nf_conntrack_bpf.c
index 40c261cd0af38..f98d1d4b42c3d 100644
--- a/net/netfilter/nf_conntrack_bpf.c
+++ b/net/netfilter/nf_conntrack_bpf.c
@@ -65,6 +65,15 @@ enum {
 	NF_BPF_CT_OPTS_SZ = 16,
 };
 
+static void *bpf_ct_opts_result(struct bpf_ct_opts *opts, u32 opts__sz, void *ret)
+{
+	if (!IS_ERR(ret))
+		return ret;
+	if (opts__sz >= offsetofend(struct bpf_ct_opts, error))
+		opts->error = PTR_ERR(ret);
+	return NULL;
+}
+
 static int bpf_nf_ct_tuple_parse(struct bpf_sock_tuple *bpf_tuple,
 				 u32 tuple_len, u8 protonum, u8 dir,
 				 struct nf_conntrack_tuple *tuple)
@@ -297,12 +306,7 @@ bpf_xdp_ct_alloc(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
 
 	nfct = __bpf_nf_ct_alloc_entry(dev_net(ctx->rxq->dev), bpf_tuple, tuple__sz,
 				       opts, opts__sz, 10);
-	if (IS_ERR(nfct)) {
-		opts->error = PTR_ERR(nfct);
-		return NULL;
-	}
-
-	return (struct nf_conn___init *)nfct;
+	return (struct nf_conn___init *)bpf_ct_opts_result(opts, opts__sz, nfct);
 }
 
 /* bpf_xdp_ct_lookup - Lookup CT entry for the given tuple, and acquire a
@@ -331,11 +335,7 @@ bpf_xdp_ct_lookup(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
 
 	caller_net = dev_net(ctx->rxq->dev);
 	nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts, opts__sz);
-	if (IS_ERR(nfct)) {
-		opts->error = PTR_ERR(nfct);
-		return NULL;
-	}
-	return nfct;
+	return bpf_ct_opts_result(opts, opts__sz, nfct);
 }
 
 /* bpf_skb_ct_alloc - Allocate a new CT entry
@@ -363,12 +363,7 @@ bpf_skb_ct_alloc(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,
 
 	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
 	nfct = __bpf_nf_ct_alloc_entry(net, bpf_tuple, tuple__sz, opts, opts__sz, 10);
-	if (IS_ERR(nfct)) {
-		opts->error = PTR_ERR(nfct);
-		return NULL;
-	}
-
-	return (struct nf_conn___init *)nfct;
+	return (struct nf_conn___init *)bpf_ct_opts_result(opts, opts__sz, nfct);
 }
 
 /* bpf_skb_ct_lookup - Lookup CT entry for the given tuple, and acquire a
@@ -397,11 +392,7 @@ bpf_skb_ct_lookup(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple,
 
 	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
 	nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts, opts__sz);
-	if (IS_ERR(nfct)) {
-		opts->error = PTR_ERR(nfct);
-		return NULL;
-	}
-	return nfct;
+	return bpf_ct_opts_result(opts, opts__sz, nfct);
 }
 
 /* bpf_ct_insert_entry - Add the provided entry into a CT map
-- 
2.34.1


^ permalink raw reply related

* [PATCH v1 3/3] thunderbold: Drop comma after device id array terminator
From: Uwe Kleine-König (The Capable Hub) @ 2026-06-18 10:14 UTC (permalink / raw)
  To: Mika Westerberg, Yehezkel Bernat, Andreas Noever
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, linux-usb
In-Reply-To: <cover.1781776904.git.u.kleine-koenig@baylibre.com>

The usual style for other device id arrays doesn't have a comma after
the initializer.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
---
 drivers/net/thunderbolt/main.c | 2 +-
 drivers/thunderbolt/dma_test.c | 2 +-
 drivers/thunderbolt/stream.c   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/thunderbolt/main.c b/drivers/net/thunderbolt/main.c
index edfcfc41a316..c1003e06a8bd 100644
--- a/drivers/net/thunderbolt/main.c
+++ b/drivers/net/thunderbolt/main.c
@@ -1455,7 +1455,7 @@ static DEFINE_SIMPLE_DEV_PM_OPS(tbnet_pm_ops, tbnet_suspend, tbnet_resume);
 
 static const struct tb_service_id tbnet_ids[] = {
 	{ TB_SERVICE("network", 1) },
-	{ },
+	{ }
 };
 MODULE_DEVICE_TABLE(tbsvc, tbnet_ids);
 
diff --git a/drivers/thunderbolt/dma_test.c b/drivers/thunderbolt/dma_test.c
index 63e6bbf00e12..519c67678b08 100644
--- a/drivers/thunderbolt/dma_test.c
+++ b/drivers/thunderbolt/dma_test.c
@@ -689,7 +689,7 @@ static const struct dev_pm_ops dma_test_pm_ops = {
 
 static const struct tb_service_id dma_test_ids[] = {
 	{ TB_SERVICE("dma_test", 1) },
-	{ },
+	{ }
 };
 MODULE_DEVICE_TABLE(tbsvc, dma_test_ids);
 
diff --git a/drivers/thunderbolt/stream.c b/drivers/thunderbolt/stream.c
index b28e4e95b422..68d81958262e 100644
--- a/drivers/thunderbolt/stream.c
+++ b/drivers/thunderbolt/stream.c
@@ -1630,7 +1630,7 @@ static const struct dev_pm_ops tbstream_pm_ops = {
 
 static const struct tb_service_id tbstream_ids[] = {
 	{ TB_SERVICE("stream", 1) },
-	{ },
+	{ }
 };
 MODULE_DEVICE_TABLE(tbsvc, tbstream_ids);
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v1 1/3] thunderbold: Stop passing matched device ID to .probe()
From: Uwe Kleine-König (The Capable Hub) @ 2026-06-18 10:14 UTC (permalink / raw)
  To: Mika Westerberg, Yehezkel Bernat, Andreas Noever
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, linux-usb
In-Reply-To: <cover.1781776904.git.u.kleine-koenig@baylibre.com>

No driver makes use of that parameter, so drop it and don't spend the
effort to determine the matching entry.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
---
 drivers/net/thunderbolt/main.c | 2 +-
 drivers/thunderbolt/dma_test.c | 2 +-
 drivers/thunderbolt/domain.c   | 4 +---
 drivers/thunderbolt/stream.c   | 2 +-
 include/linux/thunderbolt.h    | 2 +-
 5 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/thunderbolt/main.c b/drivers/net/thunderbolt/main.c
index f8f97e8e2226..edfcfc41a316 100644
--- a/drivers/net/thunderbolt/main.c
+++ b/drivers/net/thunderbolt/main.c
@@ -1335,7 +1335,7 @@ static void tbnet_generate_mac(struct net_device *dev)
 	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 }
 
-static int tbnet_probe(struct tb_service *svc, const struct tb_service_id *id)
+static int tbnet_probe(struct tb_service *svc)
 {
 	struct tb_xdomain *xd = tb_service_parent(svc);
 	struct net_device *dev;
diff --git a/drivers/thunderbolt/dma_test.c b/drivers/thunderbolt/dma_test.c
index 7877319b1b03..63e6bbf00e12 100644
--- a/drivers/thunderbolt/dma_test.c
+++ b/drivers/thunderbolt/dma_test.c
@@ -636,7 +636,7 @@ static void dma_test_debugfs_init(struct tb_service *svc)
 	debugfs_create_file("test", 0200, debugfs_dir, svc, &test_fops);
 }
 
-static int dma_test_probe(struct tb_service *svc, const struct tb_service_id *id)
+static int dma_test_probe(struct tb_service *svc)
 {
 	struct tb_xdomain *xd = tb_service_parent(svc);
 	struct dma_test *dt;
diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
index 479fa4d265c2..24611f05b3cd 100644
--- a/drivers/thunderbolt/domain.c
+++ b/drivers/thunderbolt/domain.c
@@ -77,12 +77,10 @@ static int tb_service_probe(struct device *dev)
 {
 	struct tb_service *svc = tb_to_service(dev);
 	struct tb_service_driver *driver;
-	const struct tb_service_id *id;
 
 	driver = container_of(dev->driver, struct tb_service_driver, driver);
-	id = __tb_service_match(dev, &driver->driver);
 
-	return driver->probe(svc, id);
+	return driver->probe(svc);
 }
 
 static void tb_service_remove(struct device *dev)
diff --git a/drivers/thunderbolt/stream.c b/drivers/thunderbolt/stream.c
index c1f5c55583d0..b28e4e95b422 100644
--- a/drivers/thunderbolt/stream.c
+++ b/drivers/thunderbolt/stream.c
@@ -1540,7 +1540,7 @@ static void tbstream_group_detach_stream(struct tbstream *stream)
 	config_group_put(&sg->group);
 }
 
-static int tbstream_probe(struct tb_service *svc, const struct tb_service_id *id)
+static int tbstream_probe(struct tb_service *svc)
 {
 	struct tbstream *stream;
 
diff --git a/include/linux/thunderbolt.h b/include/linux/thunderbolt.h
index feb1af175cfd..d9dec4322aa0 100644
--- a/include/linux/thunderbolt.h
+++ b/include/linux/thunderbolt.h
@@ -465,7 +465,7 @@ static inline struct tb_service *tb_to_service(struct device *dev)
  */
 struct tb_service_driver {
 	struct device_driver driver;
-	int (*probe)(struct tb_service *svc, const struct tb_service_id *id);
+	int (*probe)(struct tb_service *svc);
 	void (*remove)(struct tb_service *svc);
 	void (*shutdown)(struct tb_service *svc);
 	const struct tb_service_id *id_table;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v1 0/3] thunderbold: A few cleanups
From: Uwe Kleine-König (The Capable Hub) @ 2026-06-18 10:14 UTC (permalink / raw)
  To: Mika Westerberg, Yehezkel Bernat, Andreas Noever
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, linux-usb

Hello,

I'm currently working on a project that includes looking at all device
ID structures from <linux/mod_devicetable.h>. While doing that for
tb_service_id, I spotted these patch opportunities.

These are all non-critical and also my quest doesn't depend on this, so
there is no urge to apply these patches. My suggestion is to apply them
via the thunderbold tree during the next merge window with an ack from
the network guys.

The first patch touches drivers/net and drivers/thunderbold. It could
theretically be split, but then this results in at least 3 commits which
seems excessive to handle three drivers, so I kept it as a single patch.

The third patch is a style change and so is subjective. Drop it, if you
don't like it. Here splitting would be easy, but given that patch #1
already touches the same files, letting these go in together without
splitting seems to be sensible.

Best regards
Uwe

Uwe Kleine-König (The Capable Hub) (3):
  thunderbold: Stop passing matched device ID to .probe()
  thunderbold: Assert that a service driver has a probe callback
  thunderbold: Drop comma after device id array terminator

 drivers/net/thunderbolt/main.c | 4 ++--
 drivers/thunderbolt/dma_test.c | 4 ++--
 drivers/thunderbolt/domain.c   | 4 +---
 drivers/thunderbolt/stream.c   | 4 ++--
 drivers/thunderbolt/xdomain.c  | 3 +++
 include/linux/thunderbolt.h    | 2 +-
 6 files changed, 11 insertions(+), 10 deletions(-)

base-commit: 4fa3f5fabb30bf00d7475d5a33459ea83d639bf9
-- 
2.47.3

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH 1/2] igc: Wait for MAC passthrough after reset
From: Paul Menzel @ 2026-06-18 10:11 UTC (permalink / raw)
  To: Chia-Lin Kao (AceLan)
  Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, intel-wired-lan,
	netdev, linux-kernel
In-Reply-To: <20260618073324.1843310-1-acelan.kao@canonical.com>

Dear Chia-Lin,


Thank you for your patch.

Am 18.06.26 um 09:33 schrieb Chia-Lin Kao (AceLan) via Intel-wired-lan:
> Some systems support MAC passthrough for dock Ethernet controllers by
> having firmware rewrite the receive address registers after the controller
> reset completes.

Please give one example system.

> igc resets the controller before reading RAL0/RAH0, so that reset can
> restore the controller native MAC address temporarily. If the driver reads
> the registers immediately, it can race the firmware rewrite and keep the
> native dock MAC instead of the host passthrough MAC.
> 
> For LMVP devices, poll RAL0/RAH0 after reset and before reading the MAC

What is LMVP?

> address. Stop once the address registers change to another valid Ethernet
> address, allowing firmware a bounded window to complete the passthrough
> update.

What are the downsides of this approach? Longer reset times?

Please add instructions how to test this.

> Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
> ---
>   drivers/net/ethernet/intel/igc/igc_main.c | 48 +++++++++++++++++++++++
>   1 file changed, 48 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index 2c9e2dfd8499..fa9752ed8bc5 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -11,6 +11,7 @@
>   #include <net/pkt_sched.h>
>   #include <linux/bpf_trace.h>
>   #include <net/xdp_sock_drv.h>
> +#include <linux/etherdevice.h>
>   #include <linux/pci.h>
>   #include <linux/mdio.h>
>   
> @@ -69,6 +70,52 @@ static const struct pci_device_id igc_pci_tbl[] = {
>   
>   MODULE_DEVICE_TABLE(pci, igc_pci_tbl);
>   
> +static void igc_read_rar0(struct igc_hw *hw, u8 *addr, u32 *ral, u32 *rah)
> +{
> +	*ral = rd32(IGC_RAL(0));
> +	*rah = rd32(IGC_RAH(0));
> +
> +	addr[0] = *ral & 0xff;
> +	addr[1] = (*ral >> 8) & 0xff;
> +	addr[2] = (*ral >> 16) & 0xff;
> +	addr[3] = (*ral >> 24) & 0xff;
> +	addr[4] = *rah & 0xff;
> +	addr[5] = (*rah >> 8) & 0xff;

This looks like a common pattern, but there does not seem to be a 
generic Linux implementation. Maybe `igc_read_mac_addr()` in 
`drivers/net/ethernet/intel/igc/igc_nvm.c` can be used?

> +}
> +
> +static bool igc_is_lmvp_device(struct pci_dev *pdev)
> +{
> +	switch (pdev->device) {
> +	case IGC_DEV_ID_I225_LMVP:
> +	case IGC_DEV_ID_I226_LMVP:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static void igc_wait_for_lmvp_mac_passthrough(struct pci_dev *pdev,
> +					      struct igc_hw *hw)
> +{
> +	u8 addr[ETH_ALEN] __aligned(2);
> +	u32 orig_ral, orig_rah;
> +	u32 ral, rah;
> +	int i;
> +
> +	if (!igc_is_lmvp_device(pdev))
> +		return;
> +
> +	igc_read_rar0(hw, addr, &orig_ral, &orig_rah);
> +
> +	for (i = 0; i < 100; i++) {
> +		msleep(100);

Up to ten seconds delay(?) sounds excessive. Please elaborate in the 
commit message.

> +		igc_read_rar0(hw, addr, &ral, &rah);
> +		if ((ral != orig_ral || rah != orig_rah) &&
> +		    is_valid_ether_addr(addr))
> +			return;
> +	}

No error in case this didn’t work?

> +}
> +
>   enum latency_range {
>   	lowest_latency = 0,
>   	low_latency = 1,
> @@ -7259,6 +7306,7 @@ static int igc_probe(struct pci_dev *pdev,
>   	 * known good starting state
>   	 */
>   	hw->mac.ops.reset_hw(hw);
> +	igc_wait_for_lmvp_mac_passthrough(pdev, hw);
>   
>   	if (igc_get_flash_presence_i225(hw)) {
>   		if (hw->nvm.ops.validate(hw) < 0) {


Kind regards,

Paul

^ permalink raw reply

* Re: [PATCH v2 2/2] drm/xe/xe_drm_ras: Add error-event support in XE drm_ras
From: Raag Jadav @ 2026-06-18 10:07 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, kuba, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, maarten.lankhorst,
	mallesh.koujalagi, soham.purkait
In-Reply-To: <20260611052144.784969-6-riana.tauro@intel.com>

On Thu, Jun 11, 2026 at 10:51:47AM +0530, Riana Tauro wrote:
> Add error-event support in XE drm_ras to notify userspace
> when an error occurs.
> 
> $ sudo ynl --family drm_ras --output-json --subscribe error-notify

Same comment as first patch, but upto you.

> {
>     "name": "error-event",
>      "msg": {
>          "device-name": "0000:03:00.0",
>          "node-id": 1,
>          "node-name": "uncorrectable-errors",
>          "error-id": 1,
>          "error-name": "core-compute",
>          "error-value": 1
>      }
> }
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>

Reviewed-by: Raag Jadav <raag.jadav@intel.com>

^ permalink raw reply

* Re: [PATCH bpf-next v3 0/3] bpf: bidirectional VLAN support for bpf_fib_lookup()
From: Toke Høiland-Jørgensen @ 2026-06-18 10:07 UTC (permalink / raw)
  To: Avinash Duduskar, ast, daniel, andrii
  Cc: ameryhung, a.s.protopopov, bpf, davem, dsahern, eddyz87, edumazet,
	emil, eyal.birger, hawk, horms, john.fastabend, jolsa, kpsingh,
	kuba, leon.hwang, linux-kernel, linux-kselftest, martin.lau,
	memxor, netdev, pabeni, rongtao, sdf, shuah, song, yatsenko,
	yonghong.song
In-Reply-To: <20260617224729.1428662-1-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> This series adds VLAN awareness to bpf_fib_lookup() in both directions.
> BPF_FIB_LOOKUP_VLAN resolves a VLAN egress to its underlying real device
> plus the VLAN tag (XDP programs need this because VLAN devices have no XDP
> xmit), and BPF_FIB_LOOKUP_VLAN_INPUT runs the lookup as if a tagged frame
> had arrived on the matching VLAN subinterface, for iif policy routing and
> VRF table selection.
>
> The l3mdev/VRF flow-init fix that was patch 1 in v1 and v2 has been split
> out and sent to bpf on its own, since it is an independent Fixes:-tagged
> fix that routes to stable on its own schedule. This series is otherwise
> independent of it: on the default CONFIG_INIT_STACK_ALL_ZERO the VRF
> selftests pass with or without the fix. Only the one full-lookup VRF arm
> ("IPv4 VLAN input, tag selects VRF table") depends on it, and only on
> INIT_STACK_ALL_PATTERN or NONE builds, where the uninitialized
> flowi_l3mdev otherwise misses the l3mdev rule and the lookup falls
> through to the main table. Applying the l3mdev fix first closes that
> window.
>
> Changes v2 -> v3 (all from Toke's review unless noted):
>
> - Split the l3mdev/VRF flow-init fix out to a standalone bpf submission
>   (it was patch 1 in v2).
>
> - Patch 2 (VLAN_INPUT): bpf_fib_vlan_input_dev() returns a
>   struct net_device * with ERR_PTR() for the -EINVAL case and NULL for
>   NOT_FWDED, instead of an int return and a **dev out-parameter.
>
> - Trim the BPF_FIB_LOOKUP_VLAN and BPF_FIB_LOOKUP_VLAN_INPUT UAPI doc
>   blocks, and drop the in-function comments that restated the commit
>   message or the flag doc.
>
> - Patch 1 (VLAN egress): on the skb path without tot_len, the deferred mtu
>   check now runs against the resolved egress (VLAN) device, not the parent
>   params->ifindex was swapped to, so a VLAN device with a smaller mtu than
>   its parent is no longer checked against, or reported as, the parent's
>   larger mtu. Found by the bpf ci bot; this was an open question in v2.
>
> - Patch 3 (selftests): re-run every case through bpf_xdp_fib_lookup() as
>   well, since the feature targets XDP; and flip the no-tot_len mtu arm to
>   expect the VLAN device's mtu after the fix above.
>
> Open questions (defaults chosen, noted here in case a maintainer
> prefers otherwise):
>
> 1. An unmatched, down, or foreign-netns tag returns
>    BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
>    fib_get_table() finds no table, rather than a new return code.
>
> 2. BPF_FIB_LOOKUP_OUTPUT | BPF_FIB_LOOKUP_VLAN_INPUT is rejected with
>    -EINVAL; restricting now keeps relaxing later backward-compatible.
>
> 3. The name BPF_FIB_LOOKUP_VLAN_INPUT reads oddly next to
>    BPF_FIB_LOOKUP_OUTPUT. A pair like _VLAN_EGRESS/_VLAN_INGRESS is an
>    option while nothing is merged.

These three are fine as-is, I think.

> 4. The egress flag leaves a VLAN it cannot reduce to a physical parent
>    plus one tag (QinQ, or a parent in another namespace) as SUCCESS with
>    the VLAN device's ifindex and the vlan fields zero, like a plain
>    lookup. The input side instead fails closed (NOT_FWDED) on the
>    cross-namespace case. An XDP caller cannot xmit on a VLAN device, and
>    a zero h_vlan_proto does not distinguish this result from a physical
>    egress, so returning NOT_FWDED would be safer for XDP. But the two
>    cases differ: a foreign-netns parent is clearly fail-worthy, while a
>    QinQ egress is still a forwardable route (tc xmits on the inner VLAN
>    device), so failing it closed would reject a usable route. Should
>    egress signal NOT_FWDED, for both or only foreign-netns? I left it
>    best-effort, but will change it if you prefer.

This one is a bit more ambiguous. Specifically, the inability for an XDP
program to distinguish between a route that actually targets a physical
device, and one that targets a VLAN device that couldn't be resolved for
whatever reason.

Since this is a new feature that's opt-in, I think I would lean towards
failing lookups with a new error code (BPF_FIB_LKUP_RET_VLAN_FAILURE,
say) if the lookup finds a VLAN device but can't actually resolve the
parent. That way the XDP program can repeat the lookup without the
BPF_FIB_LOOKUP_VLAN flag if it really wants the ifindex of that VLAN
device, but that will be explicit and not hidden.

-Toke


^ permalink raw reply

* Re: [PATCH v2 1/2] drm/drm_ras: Add drm_ras netlink error event
From: Raag Jadav @ 2026-06-18 10:06 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, dri-devel, netdev, aravind.iddamsetty, anshuman.gupta,
	rodrigo.vivi, joonas.lahtinen, kuba, simona.vetter, airlied,
	pratik.bari, joshua.santosh.ranjan, ashwin.kumar.kulkarni,
	shubham.kumar, ravi.kishore.koppuravuri, maarten.lankhorst,
	mallesh.koujalagi, soham.purkait, Zack McKevitt, Lijo Lazar,
	Hawking Zhang, David S. Miller, Paolo Abeni, Eric Dumazet
In-Reply-To: <20260611052144.784969-5-riana.tauro@intel.com>

On Thu, Jun 11, 2026 at 10:51:46AM +0530, Riana Tauro wrote:
> Define a new netlink event 'error-event' and a new multicast group
> 'error-notify' in drm_ras. Each event contains device name, node and
> error information to identify the error triggering the event.
> 
> Add drm_ras_nl_error_event() to trigger an event from the driver.
> Userspace must subscribe to 'error-notify' to receive 'error-event'
> notifications.
> 
> Usage:
> 
> $ sudo ynl --family drm_ras --subscribe error-notify

...

>  operations:
>    list:
> @@ -124,3 +151,24 @@ operations:
>        do:
>          request:
>            attributes: *id-attrs
> +    -
> +      name: error-event
> +      doc: >-
> +           Notify userspace of an error event.
> +           The event includes the device, node and error information
> +           of the error that triggered the event.
> +      attribute-set: error-event-attrs
> +      mcgrp: error-notify

This looks much closer to "notify:" property, which IIUC it's not. Looking
at some of the existing examples, a better name could be something like
'error-monitor' or 'error-report' to make it a bit distinguishable.

Or perhaps it could be just me without the coffee :(
so I'll leave it to you.

Reviewed-by: Raag Jadav <raag.jadav@intel.com>

> +      event:
> +        attributes:
> +          - device-name
> +          - node-id
> +          - node-name
> +          - error-id
> +          - error-name
> +          - error-value
> +
> +mcast-groups:
> +  list:
> +    -
> +      name: error-notify

^ permalink raw reply

* [PATCH net v2] net: ti: icssg: Fix XSK zero copy TX during application wakeup
From: Meghana Malladi @ 2026-06-18 10:03 UTC (permalink / raw)
  To: diogo.ivo, vadim.fedorenko, haokexin, devnexen, horms,
	jacob.e.keller, m-malladi, pabeni, kuba, edumazet, davem,
	andrew+netdev
  Cc: linux-kernel, netdev, linux-arm-kernel, srk, Vignesh Raghavendra,
	Roger Quadros, danishanwar

emac_xsk_xmit_zc() handles tx xmit for zero copy and gets called
inside napi context. User application wakes up the kernel while
initiating the transmit which triggers napi to start processing
the tx packets. The num_tx check inside emac_tx_complete_packets()
returns early if no packet transfer happen hindering the call
to emac_xsk_xmit_zc(). Remove this check to let application
wakeup initiate zero copy xmit traffic.

Add __netif_tx_lock() to ensure that the TX queue is protected
from concurrent access during the transmission of XDP frames.
This fixes netdev watchdog timeout for long runs.

Fixes: e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file")
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
---

v2-v1:
- Added back xsk_tx_release() inside emac_xsk_xmit_zc()
- Added a check for budget>0 to protect the AF_XDP path
- Move txq_trans_cond_update() inside xsk_frames_done check
Above changes address the comments given by Jakub Kicinski <kuba@kernel.org>

v1: https://lore.kernel.org/all/20260611185744.2498070-5-m-malladi@ti.com/

 drivers/net/ethernet/ti/icssg/icssg_common.c | 23 ++++++++++----------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
index 82ddef9c17d5..6973d4714246 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_common.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
@@ -93,8 +93,8 @@ void prueth_ndev_del_tx_napi(struct prueth_emac *emac, int num)
 }
 EXPORT_SYMBOL_GPL(prueth_ndev_del_tx_napi);
 
-static int emac_xsk_xmit_zc(struct prueth_emac *emac,
-			    unsigned int q_idx)
+static void emac_xsk_xmit_zc(struct prueth_emac *emac,
+			     unsigned int q_idx)
 {
 	struct prueth_tx_chn *tx_chn = &emac->tx_chns[q_idx];
 	struct xsk_buff_pool *pool = tx_chn->xsk_pool;
@@ -115,7 +115,7 @@ static int emac_xsk_xmit_zc(struct prueth_emac *emac,
 	 * necessary
 	 */
 	if (descs_avail <= MAX_SKB_FRAGS)
-		return 0;
+		return;
 
 	descs_avail -= MAX_SKB_FRAGS;
 
@@ -170,8 +170,8 @@ static int emac_xsk_xmit_zc(struct prueth_emac *emac,
 		num_tx++;
 	}
 
-	xsk_tx_release(tx_chn->xsk_pool);
-	return num_tx;
+	if (num_tx)
+		xsk_tx_release(tx_chn->xsk_pool);
 }
 
 void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
@@ -279,9 +279,6 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
 		num_tx++;
 	}
 
-	if (!num_tx)
-		return 0;
-
 	netif_txq = netdev_get_tx_queue(ndev, chn);
 	netdev_tx_completed_queue(netif_txq, num_tx, total_bytes);
 
@@ -297,16 +294,18 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
 		__netif_tx_unlock(netif_txq);
 	}
 
-	if (tx_chn->xsk_pool) {
-		if (xsk_frames_done)
+	if (budget && tx_chn->xsk_pool) {
+		if (xsk_frames_done) {
 			xsk_tx_completed(tx_chn->xsk_pool, xsk_frames_done);
+			txq_trans_cond_update(netif_txq);
+		}
 
 		if (xsk_uses_need_wakeup(tx_chn->xsk_pool))
 			xsk_set_tx_need_wakeup(tx_chn->xsk_pool);
 
-		netif_txq = netdev_get_tx_queue(ndev, chn);
-		txq_trans_cond_update(netif_txq);
+		__netif_tx_lock(netif_txq, smp_processor_id());
 		emac_xsk_xmit_zc(emac, chn);
+		__netif_tx_unlock(netif_txq);
 	}
 
 	return num_tx;

base-commit: 7d8297e26b4e20b5d1c3c3fe51fe81a1c7fbc823
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net v3] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Lorenzo Bianconi @ 2026-06-18 10:03 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Wayen Yan, netdev, horms, pabeni, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek
In-Reply-To: <20260617161951.52abe413@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2515 bytes --]

> On Sun, 14 Jun 2026 07:30:54 +0800 Wayen Yan wrote:
> > In airoha_dev_select_queue(), the expression:
> > 
> >   queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES;
> > 
> > implicitly converts to unsigned arithmetic: when skb->priority is 0
> > (the default for unclassified traffic), (0u - 1u) wraps to UINT_MAX,
> > and UINT_MAX % 8 = 7, routing default best-effort packets to the
> > highest-priority QoS queue. This causes QoS inversion where the
> > majority of traffic on a PON gateway starves actual high-priority
> > flows (VoIP, gaming, etc.).
> > 
> > Fix by guarding the subtraction: when priority is 0, map to queue 0
> > (lowest priority), otherwise apply the original (priority - 1) % 8
> > mapping.
> > 
> > Fixes: 2b288b81560b ("net: airoha: Introduce ndo_select_queue callback")
> > Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > Reviewed-by: Joe Damato <joe@dama.to>
> > Signed-off-by: Wayen Yan <win847@gmail.com>
> > ---
> >  drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 31cdb11cd7..d476ef83c3 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -1933,7 +1933,7 @@ static u16 airoha_dev_select_queue(struct net_device *dev, struct sk_buff *skb,
> >  	 */
> >  	channel = netdev_uses_dsa(dev) ? skb_get_queue_mapping(skb) : port->id;
> >  	channel = channel % AIROHA_NUM_QOS_CHANNELS;
> > -	queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES; /* QoS queue */
> > +	queue = skb->priority ? (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES : 0;
> 
> Hi Lorenzo, is there a reason we're subtracting 1 here in the first
> place? Could be just me, but may be worth adding a comment here.
> 
> Intuitively if we are "narrowing" 16 prios to 8 queues it'd make most
> sense to group the adjacent ones -- divide by two.
> 
> Please respin with some sort of an explanation..

IIRC this is a leftover of the ETS offload support.
I agree it is righ to just do:

	queue = skb->priority % AIROHA_NUM_QOS_QUEUES; /* QoS queue */

@Wayen: can you please respin fixing the issue? Please add even my Acked-by:

Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>

Regards,
Lorenzo

> 
> >  	queue = channel * AIROHA_NUM_QOS_QUEUES + queue;
> >  
> >  	return queue < dev->num_tx_queues ? queue : 0;
> -- 
> pw-bot: cr

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox