Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] net: stmmac: fix missed le32_to_cpu()
From: Ben Dooks @ 2026-06-22 14:37 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maxime Coquelin, Alexandre Torgue,
	Russell King (Oracle), Ben Dooks, Maxime Chevallier, netdev,
	linux-stm32, linux-arm-kernel, linux-kernel

The print in ndesc_display_ring() sends the des2 and des3
to the pr_info() without passing them through the relevant
conversion to cpu order.

Fix the (prototype) sparse warnings by using le32_to_cpu():
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:258:17: warning: incorrect type in argument 6 (different base types)
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:258:17:    expected unsigned int
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:258:17:    got restricted __le32 [usertype] des2
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:258:17: warning: incorrect type in argument 7 (different base types)
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:258:17:    expected unsigned int
drivers/net/ethernet/stmicro/stmmac/norm_desc.c:258:17:    got restricted __le32 [usertype] des3

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
index c4b613564f87..74c9b7b1fe8f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
@@ -258,7 +258,7 @@ static void ndesc_display_ring(void *head, unsigned int size, bool rx,
 		pr_info("%03d [%pad]: 0x%x 0x%x 0x%x 0x%x",
 			i, &dma_addr,
 			(unsigned int)x, (unsigned int)(x >> 32),
-			p->des2, p->des3);
+			le32_to_cpu(p->des2), le32_to_cpu(p->des3));
 		p++;
 	}
 	pr_info("\n");
-- 
2.37.2.352.g3c44437643


^ permalink raw reply related

* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Toke Høiland-Jørgensen @ 2026-06-22 14:36 UTC (permalink / raw)
  To: Ralf Lici
  Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Beniamino Galvani
In-Reply-To: <20260622133452.432257-1-ralf@mandelbit.com>

[ skipping some of the netfilter-related context until we hear from the
netfilter devs ]

>> > My second concern is that the SIIT boundary would be a property of
>> > rule and hook placement. That gives flexibility, but it also means the
>> > translation point has to be constrained and documented very carefully
>> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
>> > For this use case I would rather have the route that matches the
>> > translation prefix also be the object that says: leave this family
>> > here and continue in the other one.
>>
>> Yeah, with flexibility comes the ability to shoot yourself in the foot.
>> But that's not really different from much of the other functionality we
>> have in the kernel today, is it? For netfilter in particular it's
>> certainly possible to configure a broken NAT configuration that leads to
>> packet drops (or just invalid packets being sent out on a network
>> device).
>>
>
> True, misconfiguration is always possible and that alone is not an
> argument against the netfilter model. But what do we actually gain in
> capability from that flexibility? I agree on the UX argument (an admin
> would look in nft first), but in terms of what the feature can do, I
> can't yet see what the nft model unlocks. More on this just below.
>
>> > After looking at the available kernel mechanisms again, I think the
>> > better model is probably LWT: routes carry an ipxlat encap referencing a
>> > named translator domain configured over netlink. That should represent
>> > the stateless, prefix-based and symmetric nature of ipxlat.
>>
>> I think this description actually hits the nail on the head: What are we
>> implementing here? Is it a product feature, or a building block for one?
>> The properties you mention wrt consistency, symmetry etc are properties
>> of the high-level feature (which is also generally the level things are
>> specified in RFCs). Whereas other packet mangling features in the kernel
>> are more in the "building block" category, where it's possible to
>> configure things to implement a particular feature set / compliance with
>> a particular RFC, but it's also possible to do things that are outside
>> of that.
>>
>> I think this relates to the "mechanism, not policy" approach that we
>> take to most things in the kernel: implement the building blocks to do
>> something in the most general way we can, and then leave it up to
>> userspace to configure things in a way that results in a consistent
>> high-level system behaviour.
>>
>
> That's a good point, and I agree that we should not bake a high-level
> product policy into the kernel if what we need is a reusable mechanism
> (the LWT idea was my attempt at exactly that). What I am still trying to
> understand is whether there is a useful generic trigger for stateless
> cross-family translation beyond the route/prefix/policy-routing cases.
>
> Routes and policy routing already cover the selectors I can make
> coherent for a stateless, per-packet translator: destination/source
> prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
> much more than that, but the additional selectors that would materially
> change the translation decision seem to be selectors such as L4 fields,
> payload state, or conntrack state. Those are exactly the selectors I am
> struggling to make correct for a stateless translator:
>
> - non-first fragments carry no L4 header at all, yet the translator must
>   rewrite every fragment (an nft ... tcp dport trigger cannot fire on
>   them);
>
> - ICMP errors must be translated too, but the flow identity lives in the
>   quoted inner header (reversed), not in anything an L4/ct match on the
>   error packet can see and there is no conntrack to associate them,
>   since this is stateless.

True in principle, but if (say) you deploy this on a network that is
configured so it will never fragment packets, this won't be an issue in
practice.

I.e., you're quite right that arbitrary matching criteria cannot be
guaranteed to result in coherent translation. But I think that goes into
the "use it wrong, get wrong results" bin. E.g., if you match on
something that results in only a subset of the packets of a flow being
translated, well, only that subset of the packets will make it to the
destination. The SIIT translator itself should not try to fix this, but
neither should it prevent it; that's what I mean by "building block" -
it's up to the builder using the blocks to make sure the building
doesn't collapse, that's out of scope for the block manufacturer to
worry about :)

> So an L4-conditional trigger does not look like a good primitive for
> correct stateless SIIT unless the action also defragments/refragments or
> uses conntrack-like state. Those may be valid mechanisms, but they move
> the design away from the stateless per-packet SIIT boundary this RFC is
> trying to model.
>
> So my first question is: is there a useful nft configuration this should
> enable that is not naturally expressible as route selection, while still
> remaining stateless SIIT rather than a NAT64-like stateful feature?
> Maybe there is a real use case there, but I cannot construct one yet.

So the poster child for "match on arbitrary criteria" is of course BPF.
You can write BPF programs that match on arbitrary parts of the packet
header, custom encapsulation headers,or even on out of band things like
system state, phase of the moon, or what have you. And we should
certainly allow a BPF program to make the decision on whether to perform
the SIIT translation.

Which... maybe is an argument to keep it as a device like you do in this
RFC series? Redirecting to a device is trivially supported from TC-BPF,
which also makes it possible to use the translation mechanism without
going through the routing subsystem at all, saving a bit of overhead.
Whereas making it a route action ties it very closely to the routing
subsystem.

WDYT?

-Toke

^ permalink raw reply

* Re: [PATCH net 0/2] nfc: llcp: fix OOB reads and integer bugs in TLV parsers
From: Muhammad Bilal @ 2026-06-22 13:58 UTC (permalink / raw)
  To: david; +Cc: netdev, linux-kernel, oe-linux-nfc, horms, stable
In-Reply-To: <78e283f0-1493-4d72-92fe-e6444458fb91@ixit.cz>

On Sun, Jun 21, 2026 at 09:52 PM David Heidelberg wrote:
> could I ask for the patches rebase against for-next?

Hi David,

Rebased onto nfc/for-next.

v2 is now sent and available here:
https://lore.kernel.org/netdev/20260622131802.239035-1-meatuni001@gmail.com/

Patch 2/2 was dropped because the equivalent fixes for nfc_llcp_recv_snl()
were already merged (ed85d4cbbfaa), so only llcp_commands.c remains.

Thanks,
Muhammad Bilal

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-net] ice: clear the default forwarding VSI rule when releasing a VSI
From: Marcin Szycik @ 2026-06-22 13:52 UTC (permalink / raw)
  To: Petr Oros, netdev
  Cc: Przemek Kitszel, Eric Dumazet, linux-kernel, Andrew Lunn,
	Tony Nguyen, Michal Swiatkowski, Jacob Keller, Jakub Kicinski,
	Paolo Abeni, David S. Miller, intel-wired-lan
In-Reply-To: <20260622081030.2312129-1-poros@redhat.com>



On 22/06/2026 10:10, Petr Oros wrote:
> When a VSI is configured as the switch's default forwarding VSI
> (ICE_SW_LKUP_DFLT) and is then torn down, the rule is left behind in
> the switch. ice_vsi_release() no longer removes it, and the SR-IOV VF
> free path (ice_free_vfs() -> ice_free_vf_res() -> ice_vf_vsi_release()
> -> ice_vsi_release()) does not disable promiscuous mode either, which
> only happens on VF reset in ice_vf_clear_all_promisc_modes().
> 
> A trusted VF that enters unicast promiscuous mode becomes the default
> forwarding VSI (this is the default mode, when the PF does not have VF
> true-promiscuous mode enabled). If the VFs are then destroyed without
> the VF first leaving promiscuous mode, the ICE_SW_LKUP_DFLT rule for
> the now-freed VSI is leaked. When VFs are recreated, a VSI reuses the
> freed hw_vsi_id. If it is assigned a different VSI handle than the
> leaked rule holds, ice_set_dflt_vsi() does not recognize it as
> already-default, and ice_add_update_vsi_list() folds the dangling
> (freed) handle into a VSI list, which the firmware rejects. The VSI
> handle assigned on re-creation varies, so the failure is intermittent
> rather than every cycle.
> 
> Reproduce by repeatedly running the cycle below on the two ports of the
> same card, where $VF0 and $VF1 are the netdevs of vf 15 once they
> appear. The VF must be brought up so iavf actually pushes the unicast
> promiscuous request, and the rule must settle before the VFs are torn
> down again:
> 
>   echo 16 > /sys/class/net/$PF0/device/sriov_numvfs
>   echo 16 > /sys/class/net/$PF1/device/sriov_numvfs
>   ip link set $PF0 vf 15 trust on
>   ip link set $PF1 vf 15 trust on
>   ip link set $VF0 up
>   ip link set $VF1 up
>   ip link set $VF0 promisc on
>   ip link set $VF1 promisc on
>   sleep 1
>   echo 0 > /sys/class/net/$PF0/device/sriov_numvfs
>   echo 0 > /sys/class/net/$PF1/device/sriov_numvfs
> 
> Within a few cycles the ice PF and iavf VF log:
> 
>   Failed to set VSI 25 as the default forwarding VSI, error -22
>   Turning on/off promiscuous mode for VF 63 failed, error: -22
>   PF returned error -53 (IAVF_ERR_ADMIN_QUEUE_ERROR) to our request 14
> 
> This cleanup used to live in ice_vsi_release() but was dropped by the
> referenced refactor. Restore it. Clear the default forwarding VSI rule
> in ice_vsi_release() when this VSI owns it, which covers every teardown
> path.
> 
> Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
> Signed-off-by: Petr Oros <poros@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_lib.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> index 2717cc31bff8fe..408464434506ef 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -2872,6 +2872,9 @@ int ice_vsi_release(struct ice_vsi *vsi)
>  		return -ENODEV;
>  	pf = vsi->back;
>  
> +	if (ice_is_vsi_dflt_vsi(vsi))
> +		ice_clear_dflt_vsi(vsi);

In the referenced commit, the chunk of code that contained these missing 2 lines
was moved to ice_vsi_decfg(). It also sounds like a good place for them and will
be called from ice_vsi_release(). Are you sure we should place them directly in
ice_vsi_release() instead?

Thanks,
Marcin

> +
>  	if (test_bit(ICE_FLAG_RSS_ENA, pf->flags))
>  		ice_rss_clean(vsi);
>  


^ permalink raw reply

* Re: [PATCH net 3/3] net/mlx5e: Reject unsupported CB Shaper TSA in ETS validation
From: Pavan Chebbi @ 2026-06-22 13:50 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni, Alexei Lazar, Carolina Jubran,
	Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Saeed Mahameed, Gal Pressman
In-Reply-To: <20260622112925.624795-4-tariqt@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]

On Mon, Jun 22, 2026 at 5:02 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Alexei Lazar <alazar@nvidia.com>
>
> Credit Based (CB) TSA is not supported by the mlx5 driver, so reject
> any configurations that specify it.
>
> Fixes: 08fb1dacdd76 ("net/mlx5e: Support DCBNL IEEE ETS")
> Signed-off-by: Alexei Lazar <alazar@nvidia.com>
> Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH net 2/3] net/mlx5e: Validate bandwidth for non-ETS traffic classes
From: Pavan Chebbi @ 2026-06-22 13:49 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni, Alexei Lazar, Carolina Jubran,
	Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Saeed Mahameed, Gal Pressman
In-Reply-To: <20260622112925.624795-3-tariqt@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 720 bytes --]

On Mon, Jun 22, 2026 at 5:00 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Alexei Lazar <alazar@nvidia.com>
>
> The IEEE 802.1Qaz standard defines that bandwidth allocation percentages
> only apply to ETS traffic classes.
>
> Reject ETS configurations that specify non-zero bandwidth for traffic
> classes.
>
> Fixes: 08fb1dacdd76 ("net/mlx5e: Support DCBNL IEEE ETS")
> Signed-off-by: Alexei Lazar <alazar@nvidia.com>
> Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH] net: liquidio: fix BAR resource leak on PF number failure
From: Simon Horman @ 2026-06-22 13:48 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: andrew+netdev, davem, kuba, pabeni, felix.manlunas,
	ricardo.farrington, netdev, linux-kernel, stable
In-Reply-To: <20260620083728.2722895-1-haoxiang_li2024@163.com>

On Sat, Jun 20, 2026 at 04:37:28PM +0800, Haoxiang Li wrote:
> If cn23xx_get_pf_num() fails, the function returns without
> unmapping either BAR. Unmap both BARs before returning from
> the error path.

I think it would be worth noting how this problem was found,
and if a publicly available tool was used, naming it.

> 
> Fixes: 0c45d7fe12c7 ("liquidio: fix use of pf in pass-through mode in a virtual machine")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>

There is an AI-generated review of this patch available on sashiko.dev.
I don't think the issue raised there directly affects this patch.
But you may want to consider looking into that in the context of
a separate follow-up.

> ---
>  drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> index 75f22f74774c..a1548ca81ecd 100644
> --- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> +++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> @@ -1167,8 +1167,11 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
>  		return 1;
>  	}
>  
> -	if (cn23xx_get_pf_num(oct) != 0)
> +	if (cn23xx_get_pf_num(oct) != 0) {
> +		octeon_unmap_pci_barx(oct, 0);
> +		octeon_unmap_pci_barx(oct, 1);
>  		return 1;
> +	}
>  
>  	if (cn23xx_sriov_config(oct)) {
>  		octeon_unmap_pci_barx(oct, 0);

I think this would be best handled by introducing an idiomatic goto unwind
ladder to this function. Something like this (compile tested only!):

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 75f22f74774c..73362b92d0fd 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1163,18 +1163,14 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 	if (octeon_map_pci_barx(oct, 1, MAX_BAR1_IOREMAP_SIZE)) {
 		dev_err(&oct->pci_dev->dev, "%s CN23XX BAR1 map failed\n",
 			__func__);
-		octeon_unmap_pci_barx(oct, 0);
-		return 1;
+		goto err_free_barx_0;
 	}
 
 	if (cn23xx_get_pf_num(oct) != 0)
-		return 1;
+		goto err_free_barx_1;
 
-	if (cn23xx_sriov_config(oct)) {
-		octeon_unmap_pci_barx(oct, 0);
-		octeon_unmap_pci_barx(oct, 1);
-		return 1;
-	}
+	if (cn23xx_sriov_config(oct))
+		goto err_free_barx_1;
 
 	octeon_write_csr64(oct, CN23XX_SLI_MAC_CREDIT_CNT, 0x3F802080802080ULL);
 
@@ -1205,6 +1201,12 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 	oct->coproc_clock_rate = 1000000ULL * cn23xx_coprocessor_clock(oct);
 
 	return 0;
+
+err_free_barx_0:
+	octeon_unmap_pci_barx(oct, 0);
+err_free_barx_1:
+	octeon_unmap_pci_barx(oct, 1);
+	return 1;
 }
 EXPORT_SYMBOL_GPL(setup_cn23xx_octeon_pf_device);
 
-- 
pw-bot: changes-requested

^ permalink raw reply related

* Re: [PATCH net 1/3] net/mlx5e: Report zero bandwidth for non-ETS traffic classes
From: Pavan Chebbi @ 2026-06-22 13:48 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni, Alexei Lazar, Carolina Jubran,
	Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Saeed Mahameed, Gal Pressman
In-Reply-To: <20260622112925.624795-2-tariqt@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]

On Mon, Jun 22, 2026 at 5:00 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Alexei Lazar <alazar@nvidia.com>
>
> The IEEE 802.1Qaz standard defines that bandwidth allocation percentages
> only apply to Enhanced Transmission Selection (ETS) traffic classes.
> For STRICT and VENDOR transmission selection algorithms, bandwidth
> percentage values are not applicable.
>
> Currently for non-ETS 100 bandwidth is being reported for all traffic
> classes in the get operation due to hardware limitation, regardless of
> their TSA type.
>
> Fix this by reporting 0 for non-ETS traffic classes.
>
> Fixes: 820c2c5e773d ("net/mlx5e: Read ETS settings directly from firmware")
> Signed-off-by: Alexei Lazar <alazar@nvidia.com>
> Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_dcbnl.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>

LGTM. Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Ralf Lici @ 2026-06-22 13:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Beniamino Galvani
In-Reply-To: <87tsr4gcag.fsf@toke.dk>

On Mon, 15 Jun 2026 15:31:51 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >> I think a better model is to treat the device as basically a loopback
> >> >> device that translates packets before looping them back (so when they
> >> >> come back they appear to be coming from that device).
> >> >>
> >> >> Any reason why that wouldn't work?
> >> >>
> >> >
> >> > That's indeed the intended model for the ipxlat netdevice: route packets
> >> > to it, translate them, then loop them back into the stack as packets
> >> > received from that same device. That seemed like the simplest model and
> >> > the one that exposes the translation point most clearly.
> >>
> >> Right. I think this could be made a bit more explicit in the
> >> documentation as well, since it's a bit of an unusual model.
> >>
> >> And, well, taking a step back: is it really the right model? Regular NAT
> >> lives in netfilter, why can't this be a netfilter module as well? Seems
> >> to me you could have something like:
> >>
> >> table ip xlat4 {
> >> 	chain postrouting {
> >> 		type nat hook postrouting priority srcnat; policy accept;
> >> 		ip daddr 0.0.0.0/0 oifname "eth0" xlat to 64:ff9b::/96
> >> 	}
> >> }
> >> table ip6 xlat6 {
> >> 	chain prerouting {
> >> 		type nat hook prerouting priority dstnat; policy accept;
> >> 		ip6 saddr 64::ff0b::/96 iifname "eth0" xlat from 64::ff9b::/96
> >> 	}
> >> }
> >>
> >> and that would provide the functionality without having to implement a
> >> new interface type and the associated multiple traversals through the
> >> stack? Did you consider this as an alternative to the new device type?
> >>
> >
> > We did consider netfilter, and your example is syntactically attractive,
> > but I am no longer convinced it is the cleanest model for SIIT.
> >
> > An nft expression cannot simply rewrite ETH_P_IP <-> ETH_P_IPV6 and
> > return ACCEPT as if this were normal NAT because the current hook
> > invocation, dst, and conntrack-related state were established for the
> > packet as it entered that hook. A cross-family translator would need to
> > consume the skb, clear or rebuild route and ct metadata as appropriate,
> > do an other-family route lookup, and resume at a well-defined point in
> > that family. That seems possible, but it would be a new stateless
> > cross-family action, not just a new mode of the existing nft nat
> > expression (which is built around nf_nat_setup_info and assumes the
> > packet's L3 family does not change AFAICT).
>
> Right, I did not expect it would be possible to actually share code with
> the existing NAT functionality, but conceptually they're similar. I.e.,
> if I was an admin trying to figure out if my system supported SIIT
> translation, my chain of thought would be something along the line of:
> "SIIT is a variant of NAT, and I know NAT is a long-standing feature of
> netfilter, so I wonder if SIIT exists there as well".
>
> Adding the netfilter folks to Cc to try to get their attention and an
> opinion on this :)
>

That's the right move, it would be great to have their opinion on these
architectural questions.

For the netfilter folks just added: the context is a stateless
IPv6<>IPv4 SIIT translator (RFC 7915). The open design question is
whether the stateless cross-family translation should live as an nft
action, or as a route/LWT action (ILA / seg6_local lineage).

> > My second concern is that the SIIT boundary would be a property of
> > rule and hook placement. That gives flexibility, but it also means the
> > translation point has to be constrained and documented very carefully
> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
> > For this use case I would rather have the route that matches the
> > translation prefix also be the object that says: leave this family
> > here and continue in the other one.
>
> Yeah, with flexibility comes the ability to shoot yourself in the foot.
> But that's not really different from much of the other functionality we
> have in the kernel today, is it? For netfilter in particular it's
> certainly possible to configure a broken NAT configuration that leads to
> packet drops (or just invalid packets being sent out on a network
> device).
>

True, misconfiguration is always possible and that alone is not an
argument against the netfilter model. But what do we actually gain in
capability from that flexibility? I agree on the UX argument (an admin
would look in nft first), but in terms of what the feature can do, I
can't yet see what the nft model unlocks. More on this just below.

> > After looking at the available kernel mechanisms again, I think the
> > better model is probably LWT: routes carry an ipxlat encap referencing a
> > named translator domain configured over netlink. That should represent
> > the stateless, prefix-based and symmetric nature of ipxlat.
>
> I think this description actually hits the nail on the head: What are we
> implementing here? Is it a product feature, or a building block for one?
> The properties you mention wrt consistency, symmetry etc are properties
> of the high-level feature (which is also generally the level things are
> specified in RFCs). Whereas other packet mangling features in the kernel
> are more in the "building block" category, where it's possible to
> configure things to implement a particular feature set / compliance with
> a particular RFC, but it's also possible to do things that are outside
> of that.
>
> I think this relates to the "mechanism, not policy" approach that we
> take to most things in the kernel: implement the building blocks to do
> something in the most general way we can, and then leave it up to
> userspace to configure things in a way that results in a consistent
> high-level system behaviour.
>

That's a good point, and I agree that we should not bake a high-level
product policy into the kernel if what we need is a reusable mechanism
(the LWT idea was my attempt at exactly that). What I am still trying to
understand is whether there is a useful generic trigger for stateless
cross-family translation beyond the route/prefix/policy-routing cases.

Routes and policy routing already cover the selectors I can make
coherent for a stateless, per-packet translator: destination/source
prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
much more than that, but the additional selectors that would materially
change the translation decision seem to be selectors such as L4 fields,
payload state, or conntrack state. Those are exactly the selectors I am
struggling to make correct for a stateless translator:

- non-first fragments carry no L4 header at all, yet the translator must
  rewrite every fragment (an nft ... tcp dport trigger cannot fire on
  them);

- ICMP errors must be translated too, but the flow identity lives in the
  quoted inner header (reversed), not in anything an L4/ct match on the
  error packet can see and there is no conntrack to associate them,
  since this is stateless.

So an L4-conditional trigger does not look like a good primitive for
correct stateless SIIT unless the action also defragments/refragments or
uses conntrack-like state. Those may be valid mechanisms, but they move
the design away from the stateless per-packet SIIT boundary this RFC is
trying to model.

So my first question is: is there a useful nft configuration this should
enable that is not naturally expressible as route selection, while still
remaining stateless SIIT rather than a NAT64-like stateful feature?
Maybe there is a real use case there, but I cannot construct one yet.

> That being said:
>
> > Very roughly, userspace could look like:
> >
> >     ip xlat add siit0 prefix6 64:ff9b::/96
> >     ip route add ... encap ipxlat id siit0
> >     ip -6 route add ... encap ipxlat id siit0
> >
> > There are some useful precedents for this: ILA is stateless address
> > translation as LWT, seg6_local already has cross-family LWT actions, and
> > ioam6 has a similar split between separately configured objects and
> > route attachments.
> >
> > The invariant I would like v2 to follow is that the original-family
> > route lookup selects translation as its terminal route action. The
> > translated skb then gets a fresh lookup in the other family. From that
> > point on, TTL/Hop Limit where applicable, PMTU, ICMP errors, and
> > netfilter visibility belong to the translated family.
> >
> > So I think your question addresses the core design issue in this RFC. My
> > current preference is to rework the next version around an LWT/domain
> > model instead of the virtual netdevice model, unless prototyping shows a
> > fundamental problem with that approach.
> >
> > Does that model make sense to you?
>
> I did consider this as well before suggesting netfilter as the right
> place to hook things, and I do think the route object model has some
> appeal. I agree it's a better model than the magical loopback interface,
> certainly.
>
> I think in the end this comes down to whether flexibility in how to use
> this translation mechanism is a bug or a feature, as outlined above. I'm
> leaning towards "feature", but could probably be persuaded otherwise :)
>

I agree this is the tradeoff to settle now. The reason I suggested LWT
is that making the family-A route lookup the handoff point gives the
TTL/PMTU/ICMP-and-reroute semantics a natural place to live.

So my other question, where the netfilter maintainers' input would be
very useful, is what semantics would be acceptable for changing address
family inside a netfilter hook. Concretely:

- What verdict should such an expression return? My assumption is it
  has to consume the skb and reinject into the other family (NF_STOLEN
  + reinject), since ACCEPT would resume traversal in a family whose
  dst/chain no longer apply.

- What becomes of the original-family conntrack entry, which is now
  orphaned (and is pure overhead in the stateless case)?

- Where do TTL/Hop-Limit, PMTU/ICMP generation, and netfilter
  visibility end up belonging after the family changes?

All that said, the netfilter model has its own strengths, and once we
settle the architectural questions it could turn into a very clean
design. Curious to hear your thoughts on the points above.

-- 
Ralf Lici
Mandelbit Srl

^ permalink raw reply

* Re: [PATCH] netdevsim: fix use-after-free in __nsim_dev_port_del
From: Simon Horman @ 2026-06-22 13:35 UTC (permalink / raw)
  To: hrushirajg23
  Cc: Simon Horman, kuba, andrew+netdev, davem, edumazet, pabeni, jiri,
	netdev, linux-kernel, syzbot+6c25f4750230faf70be9
In-Reply-To: <20260620064937.16301-1-hrushirajg23@gmail.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
Subject: netdevsim: fix use-after-free in __nsim_dev_port_del

The patch removes the per-port debugfs directory in nsim_destroy()
before free_netdev() so that debugfs files referencing fields inside
the netdevsim struct cannot be reached after the slab is released.

Link: https://syzkaller.appspot.com/bug?extid=6c25f4750230faf70be9

> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> index 27e5f109f933..08136e7990cb 100644
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -1214,6 +1214,13 @@ void nsim_destroy(struct netdevsim *ns)
>  		ns->page = NULL;
>  	}
>  
> +	/*
> +	 * Remove per-port debugfs files before free_netdev() releases the
> +	 * netdevsim struct to prevent use-after-free in concurrent readers.
> +	 */
> +	debugfs_remove_recursive(ns->nsim_dev_port->ddir);
> +	ns->nsim_dev_port->ddir = NULL;
> +
>  	free_netdev(dev);
>  }

[High]
Does the same use-after-free window remain on the nsim_create() error
path?

nsim_create() unconditionally calls nsim_ethtool_init(ns) before the
init step that can fail:

    SET_NETDEV_DEVLINK_PORT(dev, &nsim_dev_port->devlink_port);
    nsim_ethtool_init(ns);
    if (nsim_dev_port_is_pf(nsim_dev_port))
        err = nsim_init_netdevsim(ns);
    else
        err = nsim_init_netdevsim_vf(ns);
    if (err)
        goto err_free_netdev;
    ...
err_free_netdev:
    free_netdev(dev);
    return ERR_PTR(err);

nsim_ethtool_init() populates files under nsim_dev_port->ddir whose
data pointers reference fields inside ns, for example:

    ethtool = debugfs_create_dir("ethtool", ns->nsim_dev_port->ddir);
    debugfs_create_u32("get_err", 0600, ethtool, &ns->ethtool.get_err);
    debugfs_create_u32("set_err", 0600, ethtool, &ns->ethtool.set_err);
    ...
    dir = debugfs_create_dir("ring", ethtool);
    debugfs_create_u32("rx_max_pending", 0600, dir,
                      &ns->ethtool.ring.rx_max_pending);

If nsim_init_netdevsim() (or nsim_init_netdevsim_vf()) fails, the
err_free_netdev label runs free_netdev(dev) directly and returns
ERR_PTR(err). The caller then calls nsim_dev_port_debugfs_exit() to
tear down ddir, but in the meantime the netdevsim slab has already
been freed while these debugfs files are still live with data pointers
into it.

Should the same debugfs_remove_recursive(nsim_dev_port->ddir) be done
on the err_free_netdev path before free_netdev(dev), or should the
debugfs file creation be moved past the point of no return so the
fuzzer-reachable failure paths (alloc pressure on mock_phc_create,
register_netdevice, nsim_psp_init, etc.) cannot leave dangling
debugfs entries?

The Fixes: tag e05b2d141fef appears to cover this path as well.

^ permalink raw reply

* Re: [PATCH net] veth: fix NAPI leak in XDP enable error path
From: Pavan Chebbi @ 2026-06-22 13:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Guenter Roeck, Björn Töpel,
	Daniel Borkmann, Ilias Apalodimas, Michael S. Tsirkin,
	Tariq Toukan
In-Reply-To: <20260622111825.88337-1-edumazet@google.com>

[-- Attachment #1: Type: text/plain, Size: 1610 bytes --]

On Mon, Jun 22, 2026 at 4:48 PM Eric Dumazet <edumazet@google.com> wrote:
>
> During XDP enablement in veth, if xdp_rxq_info_reg() or
> xdp_rxq_info_reg_mem_model() fails, the driver rolls back the changes.
>
> However, the rollback loop:
>         for (i--; i >= start; i--) {
>
> decrements the loop index 'i' before the first iteration. This
> correctly skips unregistering the rxq for the failed index 'i' (as
> registration failed or was already cleaned up), but it also
> erroneously skips calling netif_napi_deli() for rq[i].xdp_napi.
>
> Since netif_napi_add() was already called for index 'i', this leaves
> a dangling napi_struct in the device's napi_list. When the veth
> device is later destroyed, the freed queue memory (which contains the
> leaked NAPI structure) can be reused.
>
> The subsequent device teardown iterates the NAPI list and
> corrupts the reallocated memory, leading to UAF.
>
> Fix this by explicitly deleting the NAPI association for the failed
> index 'i' before rolling back the successfully configured queues.
>
> Fixes: b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path")
> Reported-by: Guenter Roeck <groeck@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Björn Töpel <bjorn.topel@intel.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/veth.c | 2 ++
>  1 file changed, 2 insertions(+)
>

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH net 1/5] net: hns3: unify copper port ksettings configuration path
From: Jijie Shao @ 2026-06-22 13:30 UTC (permalink / raw)
  To: Simon Horman
  Cc: shaojijie, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260618153356.909862-1-horms@kernel.org>


on 2026/6/18 23:33, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> net: hns3: unify copper port ksettings configuration path
>
> Refactor hns3_set_link_ksettings() and hclge_set_phy_link_ksettings()
> to unify the configuration path for copper ports.
>
>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
>> --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
>> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
> [ ... ]
>> @@ -863,6 +859,8 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
>>   	struct hnae3_handle *handle = hns3_get_handle(netdev);
>>   	struct hnae3_ae_dev *ae_dev = hns3_get_ae_dev(handle);
>>   	const struct hnae3_ae_ops *ops = hns3_get_ops(handle);
>> +	u8 module_type = HNAE3_MODULE_TYPE_UNKNOWN;
>> +	u8 media_type = HNAE3_MEDIA_TYPE_UNKNOWN;
>>   	int ret;
>>   
>>   	/* Chip don't support this mode. */
>> @@ -878,22 +876,20 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
>>   		  cmd->base.autoneg, cmd->base.speed, cmd->base.duplex,
>>   		  cmd->lanes);
>>   
>> -	/* Only support ksettings_set for netdev with phy attached for now */
>> -	if (netdev->phydev) {
>> -		if (cmd->base.speed == SPEED_1000 &&
>> -		    cmd->base.autoneg == AUTONEG_DISABLE)
>> -			return -EINVAL;
> [Severity: Medium]
> Does removing this validation allow invalid configurations for native copper
> PHYs?
>
> The previous code explicitly rejected SPEED_1000 with AUTONEG_DISABLE because
> IEEE 802.3 1000BASE-T strictly requires autonegotiation. By delegating
> directly to phy_ethtool_ksettings_set(), which permits this for 1000BASE-X,
> the driver might now silently accept an invalid configuration.

[Accepted] Yes, this is indeed an effective bug.

>
>> -		return phy_ethtool_ksettings_set(netdev->phydev, cmd);
>> -	} else if (test_bit(HNAE3_DEV_SUPPORT_PHY_IMP_B, ae_dev->caps) &&
>> -		   ops->set_phy_link_ksettings) {
>> +	if (!ops->get_media_type)
>> +		return -EOPNOTSUPP;
>> +	ops->get_media_type(handle, &media_type, &module_type);
>>   
>> -		return ops->set_phy_link_ksettings(handle, cmd);
>> -	}
>> +	if (media_type == HNAE3_MEDIA_TYPE_COPPER) {
>> +		if (!ops->set_phy_link_ksettings)
>> +			return -EOPNOTSUPP;
>> +		return ops->set_phy_link_ksettings(handle, cmd);
>> +	}
> [ ... ]
>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
> [ ... ]
>> @@ -3400,10 +3400,28 @@ hclge_set_phy_link_ksettings(struct hnae3_handle *handle,
>>   		return ret;
>>   	}
>>   
>> +	linkmode_copy(hdev->hw.mac.advertising, cmd->link_modes.advertising);
>> +	return 0;
>> +}
>> +
>> +static int
>> +hclge_set_phy_link_ksettings(struct hnae3_handle *handle,
>> +			     const struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct hclge_vport *vport = hclge_get_vport(handle);
>> +	struct hclge_dev *hdev = vport->back;
>> +	int ret;
>> +
>> +	if (hnae3_dev_phy_imp_supported(hdev))
>> +		ret = hclge_ethtool_ksettings_set(handle, cmd);
>> +	else
>> +		ret = phy_ethtool_ksettings_set(handle->netdev->phydev, cmd);
> [Severity: Critical]
> Can handle->netdev->phydev be NULL here, leading to a kernel panic when called
> from userspace?
>
> A copper port can legitimately have a NULL phydev if the port's PHY address
> is configured as PHY_INEXISTENT in the firmware.
>
> Because the updated hns3_set_link_ksettings() now routes all
> HNAE3_MEDIA_TYPE_COPPER configurations here without the previous
> if (netdev->phydev) guard, phy_ethtool_ksettings_set() will unconditionally
> dereference phydev->mdio.addr.

[Accepted] Yes, this is indeed an effective bug.


Thanks,
Jijie Shao


^ permalink raw reply

* Re: [PATCH] fsl/fman: Free init resources on KeyGen failure in fman_init()
From: Pavan Chebbi @ 2026-06-22 13:27 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: madalin.bucur, sean.anderson, andrew+netdev, davem, edumazet,
	kuba, pabeni, florinel.iordache, netdev, linux-kernel, stable
In-Reply-To: <20260622111638.997251-1-haoxiang_li2024@163.com>

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On Mon, Jun 22, 2026 at 4:47 PM Haoxiang Li <haoxiang_li2024@163.com> wrote:
>
> fman_muram_alloc() allocates initialization resources before
> initializing the KeyGen block. If keygen_init() fails, the
> function returns -EINVAL directly and leaves those resources
> allocated. Free the initialization resources before returning
> from the KeyGen failure path.
>
> While at it, drop the unused error check around enable(), which
> always returns 0.
>
> Fixes: 7472f4f281d0 ("fsl/fman: enable FMan Keygen")
> Cc: stable@kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
> ---
>  drivers/net/ethernet/freescale/fman/fman.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>

LGTM. Note that you should add "net" to the fixes patch titles.
Otherwise you see patch automation has guessed this one for net-next.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>

> diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c
> index 013273a2de32..3a2a57207e55 100644
> --- a/drivers/net/ethernet/freescale/fman/fman.c
> +++ b/drivers/net/ethernet/freescale/fman/fman.c
> @@ -1995,12 +1995,12 @@ static int fman_init(struct fman *fman)
>
>         /* Init KeyGen */
>         fman->keygen = keygen_init(fman->kg_regs);
> -       if (!fman->keygen)
> +       if (!fman->keygen) {
> +               free_init_resources(fman);
>                 return -EINVAL;
> +       }
>
> -       err = enable(fman, cfg);
> -       if (err != 0)
> -               return err;
> +       enable(fman, cfg);
>
>         enable_time_stamp(fman);
>
> --
> 2.25.1
>
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v3] virtio-net: xsk: support tx wake up
From: Michael S. Tsirkin @ 2026-06-22 13:24 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Menglong Dong, xuanzhuo, eperezma, jasowang, andrew+netdev, davem,
	edumazet, kuba, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <BMgCMK8nRYC94QK7N1SXpQ@linux.dev>

On Mon, Jun 22, 2026 at 08:27:12PM +0800, Menglong Dong wrote:
> On 2026/6/22 06:31 Michael S. Tsirkin <mst@redhat.com> write:
> > On Tue, Jun 16, 2026 at 07:59:12PM +0800, Menglong Dong wrote:
> [...]
> > >  
> > > +	vring_size = virtqueue_get_vring_size(sq->vq);
> > > +	need_wakeup = xsk_uses_need_wakeup(pool);
> > > +
> > > +	if (need_wakeup && vring_size == sq->vq->num_free)
> > > +		xsk_set_tx_need_wakeup(pool);
> > > +
> > 
> > why are we doing this here?
> > the check after virtnet_xsk_xmit_batch not enough?
> > I vaguely think it's some kind of race we are closing?
> > Pls add a comment to explain.
> 
> Hi, Michael. Thanks for your review.
> 
> Yeah, it's for a race condition between user space and kernel
> space. I added a comment in V2, which is too confusing, and
> I removed it 😢. I'll make it more clear and add it in the V4. The
> origin comment is:
> 
>  * If the sq->vq is empty, and the tx ring is empty, and the user
>  * submit an entry to the tx ring after virtnet_xsk_xmit_batch() and
>  * before xsk_set_tx_need_wakeup(), we will lose the chance to wake
>  * up the tx napi, so we have to set the need_wakeup flag here.
> 
> And the logic is like this:
> 
> Kernel: tx NAPI is waked up from skb_xmit_done() ->
> Kernel: sq->vq and xsk->tx_ring are both empty ->
> Kernel: call virtnet_xsk_xmit_batch()
> 
>     User: submit a entry to the xsk->tx_ring
>     User: check the wakeup flag
>     User: wakeup flag is not set, skip send()
> 
> Kernel: call xsk_set_tx_need_wakeup(), because sq->vq is empty
> 
> If we don't send more data, the data in the xsk->tx_ring will
> not be sent forever.

I'm not 100% sure I understand, but when someone fixes cross-CPU races
with no synchronization or CPU memory barriers just with extra checks,
this always gives me pause.

AI helped write this for me, for example:
  1. Kernel: xsk_set_tx_need_wakeup stores NEED_WAKEUP (sits in store buffer)
  2. Kernel: xsk_tx_peek_release_desc_batch - load, sees empty (reordered before the store is globally visible)
  3. Kernel: peek finds nothing, returns 0
  4. Userspace: stores entry + producer
  5. Userspace: loads flags - doesn't see NEED_WAKEUP yet (still in kernel's store buffer)
  6. Userspeace: skips send()
  7. Kernel: NEED_WAKEUP store finally becomes visible - too late

Seems legit?



> > 
> > >  	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> > >  
> > > +	if (need_wakeup) {
> > > +		if (vring_size == sq->vq->num_free)
> > > +			/* we can't wake up by ourself, and it should be done
> > > +			 * by the user.
> > > +			 */
> > > +			xsk_set_tx_need_wakeup(pool);
> > > +		else
> > > +			/* we can wake up from skb_xmit_done() */
> > > +			xsk_clear_tx_need_wakeup(pool);
> > 
> > But what if we don't have get tx napi so no wakeup in skb_xmit_done?
> 
> Sorry that I'm not sure what "get tx napi" means here ;(
> 
> There are entry in sq->vq, so skb_xmit_done() will be called after
> the entries in the ring is consumed by the HOST, right?
> Then, the corresponding sq->napi will be scheduled, as we ensure
> that tx napi is always enabled, which means napi->weight is not
> zero, in this commit:
> 1df5116a41a8 ("virtio_net: xsk: prevent disable tx napi")

Oh I forgot we did that. But can xsk bind when tx napi has already
been disabled previously?


> Right?
> 
> Thanks!
> Menglong Dong
> 
> > 
> > 
> > > +	}
> > > +
> > >  	if (!is_xdp_raw_buffer_queue(vi, sq - vi->sq))
> > >  		check_sq_full_and_disable(vi, vi->dev, sq);
> > >  
> > > @@ -1470,9 +1488,6 @@ static bool virtnet_xsk_xmit(struct send_queue *sq, struct xsk_buff_pool *pool,
> > >  	u64_stats_add(&sq->stats.xdp_tx,  sent);
> > >  	u64_stats_update_end(&sq->stats.syncp);
> > >  
> > > -	if (xsk_uses_need_wakeup(pool))
> > > -		xsk_set_tx_need_wakeup(pool);
> > > -
> > >  	return sent;
> > >  }
> > >  
> > > -- 
> > > 2.54.0
> > 
> > 
> > 
> 
> 
> 


^ permalink raw reply

* Re: [PATCH net 2/5] net: hns3: refactor MAC autoneg and speed configuration
From: Jijie Shao @ 2026-06-22 13:18 UTC (permalink / raw)
  To: Simon Horman
  Cc: shaojijie, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260618153419.909913-1-horms@kernel.org>


on 2026/6/18 23:34, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> net: hns3: refactor MAC autoneg and speed configuration
>
> This refactor extracts the MAC autoneg and speed/duplex/lane configuration
> out of hclge_mac_init() into a new helper hclge_set_autoneg_speed_dup(),
> and invokes it from both hclge_init_ae_dev() and hclge_reset_ae_dev().
>
>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> index 5a00797d9252..2c74675b149f 100644
>> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> @@ -2957,20 +2957,6 @@ static int hclge_mac_init(struct hclge_dev *hdev)
>>   	if (!test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state))
>>   		hdev->hw.mac.duplex = HCLGE_MAC_FULL;
>>   
>> -	if (hdev->hw.mac.support_autoneg) {
>> -		ret = hclge_set_autoneg_en(hdev, hdev->hw.mac.autoneg);
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>> -	if (!hdev->hw.mac.autoneg) {
>> -		ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
>> -						 hdev->hw.mac.req_duplex,
>> -						 hdev->hw.mac.lane_num);
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>   	mac->link = 0;
> [ ... ]
>
>> @@ -11901,6 +11908,13 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
>>   	if (ret)
>>   		goto err_ptp_uninit;
>>   
>> +	ret = hclge_set_autoneg_speed_dup(hdev);
>> +	if (ret) {
>> +		dev_err(&pdev->dev,
>> +			"failed to set autoneg speed duplex, ret = %d\n", ret);
>> +		goto err_ptp_uninit;
>> +	}
>> +
> [High]
>
> Has the relocation of the autoneg/speed configuration in hclge_init_ae_dev()
> been considered against the surrounding init order?
>
> Previously the equivalent calls ran inside hclge_mac_init() before
> hclge_update_port_info(). After this patch the helper runs after
> hclge_update_port_info(), and hclge_update_port_info() ->
> hclge_get_sfp_info() unconditionally overwrites hdev->hw.mac.autoneg,
> hdev->hw.mac.support_autoneg and hdev->hw.mac.lane_num from firmware
> whenever resp->speed_ability is non-zero.
>
> The new helper then feeds the firmware-supplied lane_num back through:
>
>      if (!hdev->hw.mac.autoneg) {
>          ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
>                                           hdev->hw.mac.req_duplex,
>                                           hdev->hw.mac.lane_num);
>
> On a non-copper port where firmware reports forced mode (autoneg=0) and a
> stale lane_num inherited from a previous link lifecycle (for example
> lane_num=4 from a prior 100G state), the (req_speed, lane_num) pair given
> to hclge_cfg_mac_speed_dup_hw() can be inconsistent (for example 25G with
> 4 lanes), and firmware rejects it with -EINVAL, which propagates back to
> hclge_init_ae_dev() and unwinds via err_ptp_uninit, failing the entire
> probe.
>
> The follow-up commit a7e52c16ecbf "net: hns3: fix init failure caused by
> lane_num contamination" later in this series fixes this by introducing a
> separate req_lane_num=0 and passing that to hclge_cfg_mac_speed_dup_hw()
> instead of mac.lane_num. Should the fix be folded into this patch so the
> series does not introduce a probe-failure regression at this point and
> then repair it?

[Accepted] This issue has been fixed in the patch "net: hns3: Fix initialization failure caused by lane_num pollution."
		It is recommended that I merge that patch with the current patch. Sure.

>
> [Medium]
>
> After this change, the last write to MAC speed/duplex on the init path is
> hclge_cfg_mac_speed_dup_hw() inside the new helper, which programs
> hardware but does not update the cached hdev->hw.mac.speed and
> hdev->hw.mac.duplex. Previously the last writer was
> hclge_cfg_mac_speed_dup() inside hclge_update_port_info(), which updates
> both hardware and the cached fields.
>
> When req_speed differs from the firmware-reported mac.speed (which can
> happen if the active SFP speed differs from cfg.default_speed parsed in
> hclge_configure()), is hardware left programmed with req_speed while
> mac.speed retains the firmware-reported value, leaving the cache out of
> sync with the actual hardware configuration until the periodic service
> task refreshes mac.speed?

[Reject] Yes, Therefore, this issue is not a valid question.

>
>> @@ -12231,6 +12245,13 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
>>   		return ret;
>>   	}
>>   
>> +	ret = hclge_set_autoneg_speed_dup(hdev);
>> +	if (ret) {
>> +		dev_err(&pdev->dev,
>> +			"failed to set autoneg speed duplex, ret = %d\n", ret);
>> +		return ret;
>> +	}
>> +
>>   	ret = hclge_tp_port_init(hdev);
>>   	if (ret) {
>>   		dev_err(&pdev->dev, "failed to init tp port, ret = %d\n",
> [Low]
>
> Could the commit message be tightened to match the resulting code?
>
> The message says the helper "will also be invoked during the hardware
> reset recovery path to re-apply link settings without repeating
> unnecessary buffer or MTU initializations." However hclge_reset_ae_dev()
> still calls hclge_mac_init() immediately before hclge_set_autoneg_speed_dup(),
> and hclge_mac_init() still performs hclge_set_mac_mtu(),
> hclge_set_default_loopback() and hclge_buffer_alloc(), so the avoidance
> of MTU/buffer re-initialization described in the changelog is not
> realized here.
>
> The commit message is also framed as a pure readability/cyclomatic-
> complexity refactor, but the autoneg/speed configuration is not just
> extracted: in hclge_init_ae_dev() it is moved to a materially later
> point (after hclge_update_port_info(), TSO, GRO, VLAN, TM, RSS, FD and
> PTP setup), versus the original code where it was the very first
> hardware-touching step inside hclge_mac_init(). The reset path keeps the
> original relative ordering (helper runs immediately after
> hclge_mac_init()), so the init and reset paths are now asymmetric in
> placement, and the rationale for the late placement on the init path is
> not stated.

[Accepted] ok


^ permalink raw reply

* [PATCH net v2] nfc: llcp: fix OOB read and u8 offset wrap in TLV parsers
From: Muhammad Bilal @ 2026-06-22 13:18 UTC (permalink / raw)
  To: David Heidelberg, netdev
  Cc: davem, edumazet, kuba, pabeni, horms, krzk, oe-linux-nfc,
	linux-kernel, Muhammad Bilal, stable

nfc_llcp_parse_gb_tlv() and nfc_llcp_parse_connection_tlv() contain
three related bugs in their TLV parsing loops:

1. 'offset' is declared u8 but tlv_array_len is u16. When TLV data
   advances offset past 255 it silently wraps to zero, causing
   infinite loops or double-processing of buffer data.

2. Before reading tlv[0] (type) and tlv[1] (length) there is no
   check that offset+2 <= tlv_array_len. A truncated TLV causes
   an OOB read of one byte past the buffer end.

3. After reading the length field, the value bytes are accessed
   without checking offset+2+length <= tlv_array_len. A crafted
   length=0xFF on a short buffer causes up to 255 bytes of OOB
   read past the buffer end.

Both functions are reachable without authentication via
nfc_llcp_set_remote_gb() which feeds remote LLCP general bytes
directly into nfc_llcp_parse_gb_tlv() with no additional
validation.

Fix all three issues by widening offset from u8 to u16 and adding
bounds checks for both the TLV header and value field before each
access.

Fixes: 3df40eb3a2ea ("nfc: constify several pointers to u8, char and sk_buff")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
---

Notes:
    v2:
     - Rebased onto current nfc/for-next.
     - Dropped the previous nfc_llcp_recv_snl() fix since equivalent checks
       were merged by commit ed85d4cbbfaa ("nfc: llcp: bound SNL TLV parsing
       to the skb and add length checks").
     - Retain only the fixes for u8 offset wraparound and missing TLV bounds
       checks in nfc_llcp_parse_gb_tlv() and nfc_llcp_parse_connection_tlv().
     - Reject invalid TLVs silently with -EINVAL; dropped the v1 pr_err()
       logging, which was reachable from a remote peer.
    
    Link: https://lore.kernel.org/netdev/20260519011937.12903-1-meatuni001@gmail.com/

 net/nfc/llcp_commands.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/nfc/llcp_commands.c b/net/nfc/llcp_commands.c
index 291f26facbf3a..ca89fe967d6a2 100644
--- a/net/nfc/llcp_commands.c
+++ b/net/nfc/llcp_commands.c
@@ -193,7 +193,8 @@ int nfc_llcp_parse_gb_tlv(struct nfc_llcp_local *local,
 			  const u8 *tlv_array, u16 tlv_array_len)
 {
 	const u8 *tlv = tlv_array;
-	u8 type, length, offset = 0;
+	u8 type, length;
+	u16 offset = 0;
 
 	pr_debug("TLV array length %d\n", tlv_array_len);
 
@@ -201,9 +202,15 @@ int nfc_llcp_parse_gb_tlv(struct nfc_llcp_local *local,
 		return -ENODEV;
 
 	while (offset < tlv_array_len) {
+		if (offset + 2 > tlv_array_len)
+			return -EINVAL;
+
 		type = tlv[0];
 		length = tlv[1];
 
+		if (offset + 2 + length > tlv_array_len)
+			return -EINVAL;
+
 		pr_debug("type 0x%x length %d\n", type, length);
 
 		switch (type) {
@@ -243,7 +250,8 @@ int nfc_llcp_parse_connection_tlv(struct nfc_llcp_sock *sock,
 				  const u8 *tlv_array, u16 tlv_array_len)
 {
 	const u8 *tlv = tlv_array;
-	u8 type, length, offset = 0;
+	u8 type, length;
+	u16 offset = 0;
 
 	pr_debug("TLV array length %d\n", tlv_array_len);
 
@@ -251,9 +259,15 @@ int nfc_llcp_parse_connection_tlv(struct nfc_llcp_sock *sock,
 		return -ENOTCONN;
 
 	while (offset < tlv_array_len) {
+		if (offset + 2 > tlv_array_len)
+			return -EINVAL;
+
 		type = tlv[0];
 		length = tlv[1];
 
+		if (offset + 2 + length > tlv_array_len)
+			return -EINVAL;
+
 		pr_debug("type 0x%x length %d\n", type, length);
 
 		switch (type) {

base-commit: ed85d4cbbfaa4e630c5aa0d607348b42620d976b
-- 
2.54.0


^ permalink raw reply related

* Re: [RFC v2] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: David Woodhouse @ 2026-06-22 13:11 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
	Richard Cochran, linux-kernel, netdev
  Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <fea6f61960132ffafc94a8b74adc37d8dd2f79be.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 4094 bytes --]

On Mon, 2026-06-22 at 10:04 +0100, David Woodhouse wrote:
> Hm, I'm leaning towards adding it unconditionally in
> ktime_get_snapshot_id() and get_device_system_crosststamp(), and not
> adding the extra field to the system_time_snapshot at all...

That works.

https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/pps

David Woodhouse (4):
      timekeeping: Apply extrapolated ntp_error to clock snapshots
      pps: Drop the !NO_HZ_COMMON dependency from NTP_PPS
      pps: Always use ktime_get_snapshot_id() for pps_get_ts()
      ptp: ptp_vmclock: Add simulated 1PPS support

Booted with NOHZ_FULL in QEMU with a test program as /init that sets
the clock coarsely from vmclock using clock_settime(), then binds the
simulated 1PPS signal to the kernel's hardpps support.

Then uses PTP_SYS_OFFSET_PRECISE() every second to see the difference
between the kernel's timekeeping, and the vmclock that it's supposed to
be converging to.

The only "problem" is that if there's occasionally a nanosecond of
jitter after ntpdata->pps_jitter has literally reached zero,
hardpps_update_phase() complains:
   [   33.917103] hardpps: PPSJITTER: jitter=1, limit=0

That's *probably* only an issue for the vmclock hack; even with
upcoming PTM-based PPS devices that literally latch the counter value
on the pulse, I don't think it'll literally get to *zero*.

When chronyd on the host adjusts the time, we see it propagate to the
guest and the guest's timekeeping quickly converge from +36ns to +0ns
again...

[    0.653414] Run /init as init process
PPS: coarse-set CLOCK_REALTIME from vmclock
[    1.655824] pps pps0: bound kernel consumer: edge=0x1
PPS: enabled, bound to hardpps, STA_PPSTIME|STA_PPSFREQ set
PRECISE: dev-sys(adj)=+5609ns OK
EXT[ 0] diff=+5608ns
EXT[ 1] diff=+5262ns
[    2.917076] hardpps: PPSJITTER: jitter=5173, limit=0
EXT[ 2] diff=+4914ns
EXT[ 3] diff=+1197ns
EXT[ 4] diff=-297ns
EXT[ 5] diff=-103ns
EXT[ 6] diff=+0ns
EXT[ 7] diff=+0ns
EXT[ 8] diff=-1ns
EXT[ 9] diff=+0ns
EXT[10] diff=+0ns
EXT[11] diff=+0ns
EXT[12] diff=+0ns
EXT[13] diff=+0ns
EXT[14] diff=+0ns
EXT[15] diff=+0ns
EXT[16] diff=+1ns
EXT[17] diff=+0ns
EXT[18] diff=+0ns
EXT[19] diff=-1ns
EXT[20] diff=-1ns
EXT[21] diff=+0ns
EXT[22] diff=+0ns
EXT[23] diff=-1ns
EXT[24] diff=-1ns
EXT[25] diff=-1ns
EXT[26] diff=+0ns
EXT[27] diff=+0ns
EXT[28] diff=+0ns
EXT[29] diff=+0ns
EXT[30] diff=+0ns
EXT[31] diff=+0ns
EXT[32] diff=+1ns
[   33.917103] hardpps: PPSJITTER: jitter=1, limit=0
EXT[33] diff=+1ns
[   34.917102] hardpps: PPSJITTER: jitter=1, limit=0
EXT[34] diff=+0ns
EXT[35] diff=+0ns
EXT[36] diff=+0ns
[   37.917105] hardpps: PPSJITTER: jitter=1, limit=0
EXT[37] diff=+0ns
[   38.917010] hardpps: PPSJITTER: jitter=1, limit=0
EXT[38] diff=-1ns
EXT[39] diff=+0ns
EXT[40] diff=+0ns
EXT[41] diff=-1ns
EXT[42] diff=-1ns
EXT[43] diff=+0ns
[   44.917107] hardpps: PPSJITTER: jitter=1, limit=0
EXT[44] diff=+0ns
[   45.917105] hardpps: PPSJITTER: jitter=1, limit=0
EXT[45] diff=-1ns
EXT[46] diff=+0ns
EXT[47] diff=-1ns
EXT[48] diff=+0ns
EXT[49] diff=-1ns
EXT[50] diff=-1ns
EXT[51] diff=+0ns
EXT[52] diff=-1ns
[   53.917104] hardpps: PPSJITTER: jitter=1, limit=0
EXT[53] diff=+0ns
[   54.917099] hardpps: PPSJITTER: jitter=1, limit=0
EXT[54] diff=+0ns
EXT[55] diff=+0ns
EXT[56] diff=+0ns
vmclock: host_cv=0x998433b116699 offset=0xfff667dd86c7d42d guest_cv=0x20c1d93ac6 tsc_khz=2400000
EXT[57] diff=+0ns
[   58.917101] hardpps: PPSJITTER: jitter=18, limit=0
EXT[58] diff=+36ns
EXT[59] diff=+26ns
EXT[60] diff=+41ns
EXT[61] diff=+44ns
EXT[62] diff=+26ns
EXT[63] diff=+40ns
EXT[64] diff=+31ns
EXT[65] diff=+33ns
EXT[66] diff=+17ns
EXT[67] diff=+21ns
EXT[68] diff=+23ns
EXT[69] diff=+14ns
EXT[70] diff=+10ns
EXT[71] diff=+21ns
EXT[72] diff=+23ns
EXT[73] diff=+24ns
EXT[74] diff=+14ns
EXT[75] diff=+21ns
EXT[76] diff=+23ns
EXT[77] diff=+14ns
EXT[78] diff=+20ns
EXT[79] diff=+23ns
EXT[80] diff=+24ns
EXT[81] diff=+14ns
EXT[82] diff=+2ns
EXT[83] diff=+0ns
EXT[84] diff=+0ns

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH net v2 2/2] net: airoha: fix netif_set_real_num_tx_queues for sparse QoS channels
From: Lorenzo Bianconi @ 2026-06-22 13:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Wayen Yan, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260622123136.GC827683@horms.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1689 bytes --]

> On Fri, Jun 19, 2026 at 01:37:14PM +0200, Lorenzo Bianconi wrote:
> > airoha_tc_htb_alloc_leaf_queue() assigns queue IDs based on the channel
> > index (opt->qid = AIROHA_NUM_TX_RING + channel), but updates
> > real_num_tx_queues with a simple increment (num_tx_queues + 1). When QoS
> > channels are allocated sparsely (e.g., channels 0 and 3 without 1 and
> > 2), the returned qid can exceed real_num_tx_queues, causing out-of-bounds
> > accesses in the networking stack.
> > For example, allocating channel 0 then channel 3 results in
> > real_num_tx_queues = 34 but qid = 35, which is out of range [0, 34).
> > Fix this by computing real_num_tx_queues based on the highest active
> > channel index rather than using a simple counter, in both the allocation
> > and deletion paths.
> > 
> > Fixes: ef1ca9271313b ("net: airoha: Add sched HTB offload support")
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> 
> Thanks for the update since v1.
> 
> Reviewed-by: Simon Horman <horms@kernel.org>

Hi Simon,

thx for the review.

> 
> FTR, there is an AI-generated review of this patch on sashiko.dev.
> I do not think that should impede the progress of this patch but
> you may want to consider it in the context of follow-up.

Even if it is not introduced by this patch, I do not think what is reported
by Sashiko is a real issue since airoha_eth driver implements
ndo_select_queue() callback and the selected queue is always in the range
[0, AIROHA_NUM_TX_RING[. HTB queues (in the range
[AIROHA_NUM_TX_RING, AIROHA_NUM_TX_RING + AIROHA_NUM_QOS_CHANNELS[) are just 
'offloaded' and never used in the TC sw path. Agree?

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net 4/5] net: hns3: differentiate autoneg default values between copper and fiber
From: Jijie Shao @ 2026-06-22 13:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: shaojijie, davem, edumazet, kuba, pabeni, andrew+netdev,
	shenjian15, liuyonglong, chenhao418, huangdonghua3, yangshuaisong,
	netdev, linux-kernel
In-Reply-To: <20260618153450.909964-1-horms@kernel.org>


on 2026/6/18 23:34, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> net: hns3: differentiate autoneg default values between copper and fiber
>
> This commit addresses a link loss issue during driver initialization on
> optical ports connected to forced-mode remote switches. It modifies
> hclge_init_ae_dev to set autonegotiation based on media type. Copper ports
> retain the AUTONEG_ENABLE default, while optical ports inherit the firmware
> pre-configured autoneg setting.
>
>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
>> @@ -11916,6 +11916,9 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
>>   	if (ret)
>>   		goto err_ptp_uninit;
>>   
>> +	if (hdev->hw.mac.media_type != HNAE3_MEDIA_TYPE_COPPER)
>> +		hdev->hw.mac.req_autoneg = hdev->hw.mac.autoneg;
>> +
> [Severity: High]
> Does this apply the wrong speed when autonegotiation is disabled on fiber
> ports?
>
> During driver initialization, hclge_update_port_info() dynamically probes
> the SFP module and updates mac->speed, but does not update mac->req_speed,
> which retains its statically configured default capability.
>
> When hdev->hw.mac.autoneg is 0 (forced-mode), req_autoneg becomes 0, and
> then hclge_set_autoneg_speed_dup() uses req_speed:
>
> drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_set_autoneg_speed_dup() {
> ...
>      if (!hdev->hw.mac.req_autoneg) {
>          ret = hclge_cfg_mac_speed_dup_hw(hdev, hdev->hw.mac.req_speed,
> ...
> }
>
> Could this override the dynamically probed SFP speed with the static default
> speed?

[Reject] This is initializing the path. mac->speed has been parsed in hclge_configure()
          and an initial value has been assigned to req_speed.

>
> [Severity: High]
> Does this unconditionally disable autonegotiation for V1 optical hardware?
>
> Looking at hclge_update_port_info(), it only queries SFP information via
> hclge_get_sfp_info() for V2 hardware. For V1 hardware, it falls back to
> hclge_get_sfp_speed(), which leaves mac->autoneg uninitialized (0):
>
> drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:hclge_update_port_info() {
> ...
>      if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) {
>          speed = mac->speed;
>          ret = hclge_get_sfp_info(hdev, mac);
>      } else {
>          speed = HCLGE_MAC_SPEED_UNKNOWN;
>          ret = hclge_get_sfp_speed(hdev, &speed);
>      }
> ...
> }
>
> If mac->autoneg is left as 0, wouldn't req_autoneg unconditionally be set
> to 0 here, overriding the speed and breaking connectivity for V1 devices?

[Reject] No, this process is the same as before the modification.
        Before the modification, mac->autoneg was not actively assigned a value (it was initialized to 0 by default).

Thanks,
Jijie Shao



^ permalink raw reply

* [PATCH net v3 6/6] ipv6: fix missing notification for ignore_routes_with_linkdown
From: Fernando Fernandez Mancera @ 2026-06-22 13:08 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260622130857.5115-1-fmancera@suse.de>

When changing the ignore_routes_with_linkdown sysctl for a specific
interface, the RTM_NEWNETCONF netlink notification was not being emitted
to userspace. Fix this by emitting the notification when needed.

In addition, fix bogus return value for successful "all" and specific
interface write operation leading to a wrong reset of the position
pointer.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 82b6f603faa0..cbe681de3818 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -955,11 +955,7 @@ static int addrconf_fixup_linkdown(const struct ctl_table *table, int *p, int ne
 						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
 						     NETCONFA_IFINDEX_DEFAULT,
 						     net->ipv6.devconf_dflt);
-		rtnl_net_unlock(net);
-		return 0;
-	}
-
-	if (p == &net->ipv6.devconf_all->ignore_routes_with_linkdown) {
+	} else if (p == &net->ipv6.devconf_all->ignore_routes_with_linkdown) {
 		WRITE_ONCE(net->ipv6.devconf_dflt->ignore_routes_with_linkdown, newf);
 		addrconf_linkdown_change(net, newf);
 		if ((!newf) ^ (!old))
@@ -968,11 +964,21 @@ static int addrconf_fixup_linkdown(const struct ctl_table *table, int *p, int ne
 						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
 						     NETCONFA_IFINDEX_ALL,
 						     net->ipv6.devconf_all);
+	} else {
+		if (!newf ^ !old) {
+			struct inet6_dev *idev = table->extra1;
+
+			inet6_netconf_notify_devconf(net,
+						     RTM_NEWNETCONF,
+						     NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
+						     idev->dev->ifindex,
+						     &idev->cnf);
+		}
 	}
 
 	rtnl_net_unlock(net);
 
-	return 1;
+	return 0;
 }
 
 #endif
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v3 5/6] ipv6: fix state corruption during proxy_ndp sysctl restart
From: Fernando Fernandez Mancera @ 2026-06-22 13:08 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260622130857.5115-1-fmancera@suse.de>

When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
retried but as the value was already modified by the initial
proc_dointvec() call, the restarted syscall will read the newly modified
value as the 'old' state.

Fix this by taking the RTNL lock before parsing the input value if the
operation is a write.

Fixes: c92d5491a6d9 ("netconf: add support for IPv6 proxy_ndp")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5d96cbf76134..82b6f603faa0 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6482,20 +6482,19 @@ static int addrconf_sysctl_disable(const struct ctl_table *ctl, int write,
 static int addrconf_sysctl_proxy_ndp(const struct ctl_table *ctl, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
 {
+	struct net *net = ctl->extra2;
 	int *valp = ctl->data;
-	int ret;
 	int old, new;
+	int ret;
+
+	if (write && !rtnl_net_trylock(net))
+		return restart_syscall();
 
 	old = *valp;
 	ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 	new = *valp;
 
 	if (write && old != new) {
-		struct net *net = ctl->extra2;
-
-		if (!rtnl_net_trylock(net))
-			return restart_syscall();
-
 		if (valp == &net->ipv6.devconf_dflt->proxy_ndp) {
 			inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
 						     NETCONFA_PROXY_NEIGH,
@@ -6514,8 +6513,9 @@ static int addrconf_sysctl_proxy_ndp(const struct ctl_table *ctl, int write,
 						     idev->dev->ifindex,
 						     &idev->cnf);
 		}
-		rtnl_net_unlock(net);
 	}
+	if (write)
+		rtnl_net_unlock(net);
 
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v3 4/6] ipv6: fix error handling in disable_policy sysctl
From: Fernando Fernandez Mancera @ 2026-06-22 13:08 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260622130857.5115-1-fmancera@suse.de>

When writing to the disable_policy sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is resetting the position argument even if an error
occurred during proc_dointvec() and not only during sysctl restart.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d23a89b07eed..5d96cbf76134 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6769,6 +6769,8 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
 	lctl = *ctl;
 	lctl.data = &val;
 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
 
 	if (write && (*valp != val))
 		ret = addrconf_disable_policy(ctl, valp, val);
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v3 3/6] ipv6: fix error handling in forwarding sysctl
From: Fernando Fernandez Mancera @ 2026-06-22 13:08 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260622130857.5115-1-fmancera@suse.de>

When writing to the forwarding sysctl, if proc_dointvec() fails to parse
the input, it returns a negative error code. The current implementation
is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure. In addition, adjust return code of
addrconf_fixup_forwarding() for successful operation.

Fixes: b325fddb7f86 ("ipv6: Fix sysctl unregistration deadlock")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 70058d971205..d23a89b07eed 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -913,7 +913,7 @@ static int addrconf_fixup_forwarding(const struct ctl_table *table, int *p, int

 	if (newf)
 		rt6_purge_dflt_routers(net);
-	return 1;
+	return 0;
 }

 static void addrconf_linkdown_change(struct net *net, __s32 newf)
@@ -6370,6 +6370,8 @@ static int addrconf_sysctl_forward(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_forwarding(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net v3 2/6] ipv6: fix error handling in ignore_routes_with_linkdown sysctl
From: Fernando Fernandez Mancera @ 2026-06-22 13:08 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260622130857.5115-1-fmancera@suse.de>

When writing to the ignore_routes_with_linkdown sysctl, if
proc_dointvec() fails to parse the input, it returns a negative error
code. The current implementation is overwriting that error for write
operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c901b444a995..70058d971205 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6671,6 +6671,8 @@ int addrconf_sysctl_ignore_routes_with_linkdown(const struct ctl_table *ctl,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_fixup_linkdown(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net v3 1/6] ipv6: fix error handling in disable_ipv6 sysctl
From: Fernando Fernandez Mancera @ 2026-06-22 13:08 UTC (permalink / raw)
  To: netdev
  Cc: nicolas.dichtel, stephen, horms, pabeni, kuba, edumazet, davem,
	idosch, dsahern, Fernando Fernandez Mancera
In-Reply-To: <20260622130857.5115-1-fmancera@suse.de>

When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 56d417b12e57 ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1f21ccb55caa..c901b444a995 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6467,6 +6467,8 @@ static int addrconf_sysctl_disable(const struct ctl_table *ctl, int write,
 	lctl.data = &val;

 	ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;

 	if (write)
 		ret = addrconf_disable_ipv6(ctl, valp, val);
-- 
2.54.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox