Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v3 2/2] net: macb: drop in-flight Tx SKBs on close
From: Nicolai Buchwitz @ 2026-06-17  9:49 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley, Paolo Valerio, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, stable
In-Reply-To: <20260617-macb-drop-tx-v3-2-d4c7e57d890b@bootlin.com>

On 17.6.2026 11:17, Théo Lebrun wrote:
> The MACB driver has since forever leaked the outgoing SKBs that
> have not yet been marked as completed. They live in queue->tx_skb
> which gets freed without remorse nor checking.
> 
> macb_free_consistent() gets called in a few codepaths, but only
> close will trigger the added expressions. In macb_open() and
> macb_alloc_consistent() failure cases, tx_skb just got allocated
> and is empty.
> 
> Use the new macb_tx_unmap() prototype to report our error as
> SKB_DROP_REASON_NOT_SPECIFIED rather than SKB_CONSUMED which makes it
> sound like no error occurred. Equivalent to dev_kfree_skb_any().
> 
> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
> ---

> [...]

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>

Thanks,
Nicolai

^ permalink raw reply

* Re: [PATCH net v3 1/2] net: macb: give reasons for Tx SKB kfree
From: Nicolai Buchwitz @ 2026-06-17  9:49 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley, Paolo Valerio, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, stable
In-Reply-To: <20260617-macb-drop-tx-v3-1-d4c7e57d890b@bootlin.com>

On 17.6.2026 11:17, Théo Lebrun wrote:
> Using dev_consume_skb_any() marks the drop reason as SKB_CONSUMED every
> time we free a Tx SKB. Instead, replace by 
> SKB_DROP_REASON_NOT_SPECIFIED
> when packet has been dropped without sending.
> 
> It is not precise but at least differs from SKB_CONSUMED and is used by
> many drivers for their error codepaths through 
> dev_kfree_skb_{any,irq}().
> 
> Pass a reason around rather than call dev_consume_skb_any() or
> dev_kfree_skb_any() because macb_tx_unmap() is called for cleanup in
> all cases.
> 
> macb_tx_error_task() is made complex because some SKBs encountered have
> been successfully sent.
> 
> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
> ---

Looks like my r-b from v2 was lost, but here it goes again :)

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>

Thanks,
Nicolai

^ permalink raw reply

* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Andrew Lunn @ 2026-06-17  9:48 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Mohd Ayaan Anwar, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Richard Cochran, Bjorn Andersson, Konrad Dybcio,
	Maxime Coquelin, Alexandre Torgue, Russell King, linux-arm-msm,
	netdev, devicetree, linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <4f3c6bee-3ccb-467e-a466-89fece0e6a7f@oss.qualcomm.com>

> >>> +	emac0_phy_en_hog: emac0-phy-en-hog {
> >>> +		gpio-hog;
> >>> +		gpios = <149 GPIO_ACTIVE_HIGH>;
> >>> +		output-high;
> >>> +		line-name = "emac0-phy-en";
> >>> +	};
> >>
> >> This looks like a hack - what does this pin actually do?
> >>
> > 
> > The power supply to both PHYs on Shikra is gated by a GPIO pin. I am
> > unsure whether they should be modelled as a fixed, enable-on-boot
> > regulator or just like this. They need to be powered on early so that
> > MDIO can detect them.
> 
> If it's a regulator, then it should be described as a regulator.

Agreed.

> There
> was some discussion regarding the power resources of PHYs over here:
> 
> https://lore.kernel.org/linux-arm-msm/SN7PR19MB67369F7DD02F702437C0F1919D1B2@SN7PR19MB6736.namprd19.prod.outlook.com/

MDIO detection is nice to have, but only works well on simple
boards. I would suggest hard coding the PHY ID in the compatible.

	Andrew

^ permalink raw reply

* Re: [PATCH iwl-next v1] ixgbe: Implement PCI reset handler
From: Andrew Lunn @ 2026-06-17  9:44 UTC (permalink / raw)
  To: Sergey Temerkhanov; +Cc: intel-wired-lan, netdev
In-Reply-To: <20260617084329.199110-1-sergey.temerkhanov@intel.com>

> +static void ixgbe_reset_prep(struct pci_dev *pdev)
> +{
> +	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
> +	unsigned int timeout = IXGBE_PCIE_RESET_RETRIES;
> +
> +	if (!adapter)
> +		return;
> +
> +	/* Prevent the service task from being requeued in the timer callback
> +	 * while we're resetting.
> +	 */
> +	if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
> +		timer_delete_sync(&adapter->service_timer);
> +		/* Prevent the service task from running while we're resetting. */
> +		cancel_work_sync(&adapter->service_task);
> +	}
> +
> +	pci_clear_master(pdev);
> +
> +	while (test_and_set_bit(__IXGBE_RESETTING, &adapter->state) && --timeout)
> +		usleep_range(1000, 2000);

Please consider using something from iopoll.h

> +
> +	if (!timeout) {
> +		e_err(drv, "Timed out waiting for __IXGBE_RESETTING to be released. Reset is needed\n");
> +		pci_set_master(pdev);
> +		return;
> +	}

because this is broken. You need to retest the condition before
declaring ETIMEDOUT.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Maciej Fijalkowski @ 2026-06-17  9:43 UTC (permalink / raw)
  To: Jason Xing
  Cc: Tushar Vyavahare, netdev, magnus.karlsson, stfomichev, kernelxing,
	davem, kuba, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <CAL+tcoDr0gtCPeGi1yOUtg+ZD2YxEbjAy41LBgG63b8-=CStcw@mail.gmail.com>

On Wed, Jun 17, 2026 at 07:39:06AM +0800, Jason Xing wrote:
> Hi Tushar,
> 
> On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> <tushar.vyavahare@intel.com> wrote:
> >
> > This series improves AF_XDP selftests by making timeout handling
> > explicit and fixing sources of non-determinism in xsk timeout tests.
> >
> > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > dependence on RX UMEM setup state for timeout behavior.
> >
> > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > before worker startup, removing signal-based termination, and using
> > barrier synchronization only for dual-thread runs.
> >
> > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > configuration does not leak into subsequent cases on shared-netdev
> > runs.
> >
> > Together these changes make timeout handling easier to follow and
> > improve selftest stability, especially on real NIC runs.
> 
> net-next is closed, but in the meantime I'll review the series ASAP.
> 
> BTW, another thing about selftests I had in my mind is that are you
> planning to work on this [1]?

This one is on me. I took your changes Jason and aligned ZC batching side
to this behavior, followed by xskxceiver adjustment. I am planning to send
this today EOD, however let's see how badly internal Sashiko will kick my
ass.

> 
> [1]: https://lore.kernel.org/all/20260520004244.55663-1-kerneljasonxing@gmail.com/
> 
> Thanks,
> Jason
> 
> >
> > Tushar Vyavahare (3):
> >   selftests/xsk: make poll timeout mode explicit
> >   selftests/xsk: fix timeout thread harness sequencing
> >   selftests/xsk: restore shared_umem after POLL_TXQ_FULL
> >
> >  .../selftests/bpf/prog_tests/test_xsk.c       | 96 +++++++++++--------
> >  .../selftests/bpf/prog_tests/test_xsk.h       |  2 +
> >  2 files changed, 56 insertions(+), 42 deletions(-)
> >
> > --
> > 2.43.0
> >
> >
> 

^ permalink raw reply

* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Konrad Dybcio @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Mohd Ayaan Anwar
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Richard Cochran, Bjorn Andersson, Konrad Dybcio, Maxime Coquelin,
	Alexandre Torgue, Russell King, linux-arm-msm, netdev, devicetree,
	linux-kernel, linux-stm32, linux-arm-kernel
In-Reply-To: <ajF+xlipLuZtf4HL@oss.qualcomm.com>

On 6/16/26 6:50 PM, Mohd Ayaan Anwar wrote:
> On Tue, Jun 16, 2026 at 11:50:26AM +0200, Konrad Dybcio wrote:
>> On 6/11/26 8:37 PM, Mohd Ayaan Anwar wrote:
>>
>>> +&tlmm {
>>> +	ethernet0_defaults: ethernet0-defaults-state {
>>
>> s/defaults/default
>>
>> Please move this definition to shikra.dtsi
>>
> 
> The CQM and CQS variants have identical GPIO mapping but the IQS is
> different. So should I keep this in shikra.dtsi and overwrite for IQS in
> shikra-iqs-evk.dts?
> 
> 
>>> +
>>> +	emac0_phy_en_hog: emac0-phy-en-hog {
>>> +		gpio-hog;
>>> +		gpios = <149 GPIO_ACTIVE_HIGH>;
>>> +		output-high;
>>> +		line-name = "emac0-phy-en";
>>> +	};
>>
>> This looks like a hack - what does this pin actually do?
>>
> 
> The power supply to both PHYs on Shikra is gated by a GPIO pin. I am
> unsure whether they should be modelled as a fixed, enable-on-boot
> regulator or just like this. They need to be powered on early so that
> MDIO can detect them.

If it's a regulator, then it should be described as a regulator. There
was some discussion regarding the power resources of PHYs over here:

https://lore.kernel.org/linux-arm-msm/SN7PR19MB67369F7DD02F702437C0F1919D1B2@SN7PR19MB6736.namprd19.prod.outlook.com/

Konrad

^ permalink raw reply

* Re: [PATCH bpf-next v2 3/4] bpf: Add BPF_FIB_LOOKUP_VLAN_INPUT flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Avinash Duduskar, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
	John Fastabend, Stanislav Fomichev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Shuah Khan, Jesper Dangaard Brouer, Mykyta Yatsenko, Leon Hwang,
	KP Singh, Anton Protopopov, Amery Hung, Eyal Birger, Rong Tao,
	bpf, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20260616223426.3568080-4-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> BPF_FIB_LOOKUP_VLAN resolves a VLAN egress. The reverse is also
> useful: an XDP program receiving a VLAN-tagged frame on a physical
> device wants the lookup to behave as if the packet had arrived on the
> corresponding VLAN subinterface, so iif-based policy routing and VRF
> table selection use the right ingress.
>
> Add BPF_FIB_LOOKUP_VLAN_INPUT. When set, params->h_vlan_proto and
> params->h_vlan_TCI are read as an input VLAN tag and the matching VLAN
> device of params->ifindex is resolved with __vlan_find_dev_deep_rcu().
> The device must be up and in the same network namespace as
> params->ifindex (a VLAN device can be moved to another netns while
> registered on its parent; receive would deliver into that other
> namespace, which a lookup here cannot represent). If params->ifindex
> is itself a VLAN device, its inner (QinQ) subinterface is matched.
> For a bond or team, a tag on a port matches no device and returns
> NOT_FWDED; pass the master's ifindex.
> The lookup then runs with the resolved device as the ingress;
> params->ifindex itself is not modified on the input side. When the
> resolved device is enslaved to a VRF, both the full lookup (via the
> l3mdev rule) and BPF_FIB_LOOKUP_DIRECT (via l3mdev_fib_table_rcu())
> select the VRF's table from the resolved ingress. That follows from
> feeding the resolved device to the flow as the ingress
> (fl4.flowi4_iif = dev->ifindex), which is what makes l3mdev resolve
> the VRF master from the subinterface rather than from
> params->ifindex.
>
> The two failure classes get different treatment on purpose. A
> h_vlan_proto other than 802.1Q/802.1ad is API misuse and returns
> -EINVAL, since it would otherwise reach the WARN in vlan_proto_idx()
> with a program-controlled value. An unmatched VID, a device that is
> down, or one in another namespace is a data outcome and returns
> BPF_FIB_LKUP_RET_NOT_FWDED, matching the DIRECT path when
> fib_get_table() finds no table and mirroring real ingress, where the
> receive path drops such frames. A VID of 0 (a priority tag) is looked
> up literally and normally fails the same way; receive instead
> processes such frames untagged, so callers should not set the flag for
> priority tags. Proceeding on the physical device for any of these
> would be fail-open for the policy-routing cases above.
>
> The h_vlan fields share a union with tbid, so the flag cannot be
> combined with BPF_FIB_LOOKUP_TBID. It describes ingress, so it also
> cannot be combined with BPF_FIB_LOOKUP_OUTPUT. Both combinations
> return -EINVAL; restricting now keeps a later relaxation backward
> compatible. Combining with BPF_FIB_LOOKUP_VLAN is allowed: the tag is
> consumed on the ingress side and the egress tag is written on
> success.
>
> Under !CONFIG_VLAN_8021Q the __vlan_find_dev_deep_rcu() stub returns
> NULL, so every lookup with the flag returns NOT_FWDED, which is
> correct since no VLAN device can exist.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
>  include/uapi/linux/bpf.h       | 34 ++++++++++++++-
>  net/core/filter.c              | 80 +++++++++++++++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h | 34 ++++++++++++++-
>  3 files changed, 141 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f77aa9472bf1..57e28da3336a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3552,6 +3552,35 @@ union bpf_attr {
>   *			reports the route mtu in *params*->mtu_result, and on
>   *			the tc path without tot_len the mtu check runs after
>   *			the swap, against the parent device.
> + *		**BPF_FIB_LOOKUP_VLAN_INPUT**
> + *			Treat *params*->h_vlan_proto and *params*->h_vlan_TCI
> + *			as an input VLAN tag (e.g. parsed from the packet) and
> + *			run the lookup as if ingress had happened on the VLAN
> + *			subinterface carrying that tag for *params*->ifindex,
> + *			rather than on *params*->ifindex itself. The VID is the
> + *			low 12 bits of *params*->h_vlan_TCI;
> + *			*params*->h_vlan_proto must be ETH_P_8021Q or
> + *			ETH_P_8021AD in network byte order (any other value
> + *			returns **-EINVAL**). The
> + *			subinterface is the one configured for that tag on
> + *			*params*->ifindex; if *params*->ifindex is itself a
> + *			VLAN device, its inner (QinQ) subinterface is matched.
> + *			For a bond or team, a tag on a port matches no
> + *			device and returns NOT_FWDED; pass the master's
> + *			ifindex.
> + *			If no matching subinterface exists, or it is not up,
> + *			or it was moved to another network namespace, the
> + *			lookup returns **BPF_FIB_LKUP_RET_NOT_FWDED**,
> + *			mirroring real ingress, which drops a frame whose tag
> + *			is unconfigured or whose VLAN device is down. A VID of
> + *			0 (a priority-tagged frame) is looked up literally like
> + *			any other VID; receive instead processes such frames
> + *			untagged on the device itself, so do not set this flag
> + *			for priority tags.
> + *			Cannot be combined with **BPF_FIB_LOOKUP_TBID** (both
> + *			use the same input fields) or **BPF_FIB_LOOKUP_OUTPUT**
> + *			(this flag is ingress-only); doing so returns
> + *			**-EINVAL**.

This comment is also overly long - please trim.

>   *
>   *		*ctx* is either **struct xdp_md** for XDP programs or
>   *		**struct sk_buff** tc cls_act programs.
> @@ -7348,6 +7377,7 @@ enum {
>  	BPF_FIB_LOOKUP_SRC     = (1U << 4),
>  	BPF_FIB_LOOKUP_MARK    = (1U << 5),
>  	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
> +	BPF_FIB_LOOKUP_VLAN_INPUT = (1U << 7),
>  };
>  
>  enum {
> @@ -7416,7 +7446,9 @@ struct bpf_fib_lookup {
>  		struct {
>  			/* output with BPF_FIB_LOOKUP_VLAN: set from the
>  			 * resolved egress VLAN device (see the flag); zeroed
> -			 * on other successful lookups.
> +			 * on other successful lookups. input with
> +			 * BPF_FIB_LOOKUP_VLAN_INPUT: the VLAN tag to scope
> +			 * the lookup by.
>  			 */
>  			__be16	h_vlan_proto;
>  			__be16	h_vlan_TCI;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b37a12321fba..cfbdd842ce61 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6158,6 +6158,41 @@ static int bpf_fib_set_fwd_params(struct net_device *dev,
>  
>  	return 0;
>  }
> +
> +/* With BPF_FIB_LOOKUP_VLAN_INPUT the caller passes the packet's VLAN tag in
> + * params->h_vlan_proto and params->h_vlan_TCI; the lookup is done as if
> + * ingress had happened on the matching VLAN subinterface of *dev. Resolve
> + * it and store it in *dev. params is not modified.
> + *
> + * A protocol other than 802.1Q/802.1AD is API misuse (it would otherwise
> + * reach the WARN in vlan_proto_idx()), so it is rejected with -EINVAL. An
> + * unmatched VID, a matching device that is down, or one that was moved
> + * to another netns (receive would deliver into that netns' stack, which
> + * a lookup here cannot represent) is a data outcome, reported as
> + * NOT_FWDED, the same way the DIRECT path reports a missing table. Under
> + * !CONFIG_VLAN_8021Q __vlan_find_dev_deep_rcu() returns NULL, so every
> + * call returns NOT_FWDED, which is correct since no subinterface can
> + * exist.
> + */

As in the previous patch, please drop this comment.

> +static int bpf_fib_vlan_input_dev(struct net_device **dev,
> +				  const struct bpf_fib_lookup *params)
> +{

Just return the dev pointer and use ERR_PTR for errors? That's what we
usually do for these kinds of functions.

-Toke


^ permalink raw reply

* Re: [PATCH v27 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Alejandro Lucero Palau @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Dan Williams (nvidia), alejandro.lucero-palau, linux-cxl, netdev,
	edward.cree, davem, kuba, pabeni, edumazet, dave.jiang
In-Reply-To: <6a31a948d61f5_9b8551006b@djbw-dev.notmuch>


On 6/16/26 20:51, Dan Williams (nvidia) wrote:
> Alejandro Lucero Palau wrote:
>> On 6/10/26 14:56, Alejandro Lucero Palau wrote:
>>> On 6/10/26 07:10, Alejandro Lucero Palau wrote:
>>>> On 6/10/26 00:30, Dan Williams (nvidia) wrote:
>>>>> alejandro.lucero-palau@ wrote:
>>>>>> From: Alejandro Lucero <alucerop@amd.com>
>>>>>>
>>>>>> Use core API for safely obtain the CXL range linked to an HDM
>>>>>> committed
>>>>>> by the BIOS. Map such a range for being used as the ctpio buffer.
>>>>>>
>>>>>> A potential user space action through sysfs unbinding or core cxl
>>>>>> modules remove will trigger sfc driver device detachment, with that
>>>>>> case
>>>>>> not racing with this mapping as this is done during driver probe and
>>>>>> therefore protected with device lock against those user space actions.
>>>>>>
>>>>>> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
>>>>>> ---
>>>>>>    drivers/net/ethernet/sfc/efx.c     |  1 +
>>>>>>    drivers/net/ethernet/sfc/efx_cxl.c | 24 ++++++++++++++++++++++++
>>>>>>    drivers/net/ethernet/sfc/efx_cxl.h |  3 +++
>>>>>>    3 files changed, 28 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/sfc/efx.c
>>>>>> b/drivers/net/ethernet/sfc/efx.c
>>>>>> index 90ccbe310386..578054c21e79 100644
>>>>>> --- a/drivers/net/ethernet/sfc/efx.c
>>>>>> +++ b/drivers/net/ethernet/sfc/efx.c
>>>>>> @@ -984,6 +984,7 @@ static void efx_pci_remove(struct pci_dev
>>>>>> *pci_dev)
>>>>>>        efx_fini_io(efx);
>>>>>>          probe_data = container_of(efx, struct efx_probe_data, efx);
>>>>>> +    efx_cxl_exit(probe_data);
>>>>>>          pci_dbg(efx->pci_dev, "shutdown successful\n");
>>>>>>    diff --git a/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> b/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> index 4d55c08cf2a1..d5766a40e2cf 100644
>>>>>> --- a/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
>>>>>> @@ -18,6 +18,7 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>>>>>>    {
>>>>>>        struct efx_nic *efx = &probe_data->efx;
>>>>>>        struct pci_dev *pci_dev = efx->pci_dev;
>>>>>> +    struct range cxl_pio_range;
>>>>>>        struct efx_cxl *cxl;
>>>>>>        u16 dvsec;
>>>>>>        int rc;
>>>>>> @@ -75,9 +76,32 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>>>>>>            return -ENODEV;
>>>>>>        }
>>>>>>    +    cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
>>>>>> +    if (IS_ERR(cxl->cxlmd)) {
>>>>>> +        pci_err(pci_dev, "CXL accel memdev creation failed\n");
>>>>>> +        return PTR_ERR(cxl->cxlmd);
>>>>>> +    }
>>>>>> +
>>>>>> +    cxl->ctpio_cxl = ioremap_wc(cxl_pio_range.start,
>>>>>> +                    range_len(&cxl_pio_range));
>>>>>> +    if (!cxl->ctpio_cxl) {
>>>>>> +        pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
>>>>>> +            &cxl_pio_range);
>>>>>> +        return -ENOMEM;
>>>>> Dave caught the iounmap leak, but another concern is since you want to
>>>>> continue operation if efx_cxl_init() fails then you probably also want
>>>>> to release the successful attachment to the CXL domain if this happens.
>>>>
>>>> I will do that.
>>>>
>>> Looking at this issue, I think an error when creating the memdev or
>>> during the region attach triggers the memdev removal, but ...
>>>
>>>
>>>>> Minor since something else is likely to fail if ioremap is not
>>>>> reliable.
>>>
>>> .. if we want to specifically do that with an unlikely (but possible)
>>> ioremap error something else needs to be exported like
>>> cxl_memdev_unregister(). Are you happy with that approach?
>>>
>> I have just tested with this:
>>
>> +void cxl_memdev_remove(void *_cxlmd)
>> +{
>> +       struct cxl_memdev *cxlmd = _cxlmd;
>> +       struct device *dev = &cxlmd->dev;
>> +
>> +       devm_remove_action_nowarn(cxlmd->cxlds->dev, cxl_memdev_unregister,
>> +                                 cxlmd);
>> +
>> +       cdev_device_del(&cxlmd->cdev, dev);
>> +       cxl_memdev_shutdown(dev);
>> +       put_device(dev);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_memdev_remove, "CXL");
>>
>>
>> only called if the ioremap fails.
>>
>>
>> Please, let me know if you like this approach before sending another
>> version.
> A devres group can automatically cleanup after devm_cxl_memdev_probe()
> in the error path with no new exports needed from the CXL core.
> Something like:
>
>          void *group = devres_open_group(cxl->cxlds.dev, NULL, GFP_KERNEL);
>          int rc = 0;
>
>          if (!group)
>                  return -ENOMEM;
>          
>          cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
>          if (IS_ERR(cxl->cxlmd)) {
>                  pci_err(pci_dev, "CXL accel memdev creation failed\n");
>                  rc = PTR_ERR(cxl->cxlmd);
>                  goto out;
>          }
>
>          cxl->ctpio_cxl =
>                  ioremap_wc(cxl_pio_range.start, range_len(&cxl_pio_range));
>          if (!cxl->ctpio_cxl) {
>                  pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
>                          &cxl_pio_range);
>                  rc = -ENOMEM;
>          }
>
> out:
>          if (rc)
>                  devres_release_group(group);
>          else
>                  devres_remove_group(group);
>          return rc;


OK. I will use this in v28 instead of that export.


Thanks


^ permalink raw reply

* Re: [PATCH net] net: rnpgbe: fix mailbox endianness handling
From: Andrew Lunn @ 2026-06-17  9:40 UTC (permalink / raw)
  To: Dong Yibo
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, vadim.fedorenko,
	netdev, linux-kernel, yaojun
In-Reply-To: <20260617083531.251119-1-dong100@mucse.com>

On Wed, Jun 17, 2026 at 04:35:31PM +0800, Dong Yibo wrote:
> Mailbox data is exchanged through 32-bit MMIO accesses but the
> mailbox payload is defined using little-endian FW structures with
> __le16 and __le32 fields.

Given you are using __le16 and __le32, why did sparse not find these
issues? It would be good to understand this, because if sparse missed
this, what else has sparse missed which is also broken?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: busy-poll: introduce sk_tx_busy_loop()
From: Maciej Fijalkowski @ 2026-06-17  9:40 UTC (permalink / raw)
  To: Menglong Dong
  Cc: menglong8.dong, Jakub Kicinski, jasowang, mst, xuanzhuo, eperezma,
	andrew+netdev, davem, edumazet, pabeni, magnus.karlsson, sdf,
	horms, ast, daniel, hawk, john.fastabend, bjorn, kerneljasonxing,
	netdev, virtualization, linux-kernel, bpf
In-Reply-To: <TYn10tJ2SIGF1pAhF26DRQ@linux.dev>

On Sun, Jun 14, 2026 at 06:12:46PM +0800, Menglong Dong wrote:
> On 2026/6/14 02:21, Jakub Kicinski wrote:
> > On Thu, 11 Jun 2026 15:12:40 +0800 menglong8.dong@gmail.com wrote:
> > > For now, we use sk_busy_loop() for both rx and tx path. The sk_busy_loop()
> > > will call napi_busy_loop() for the specified napi_id. However, some
> > > nic drivers have tx napi, such as virtio-net. In this case, sk_busy_loop()
> > > doesn't work, as it can only schedule the NAPI for the rx queue.
> > > 
> > > Therefore, introduce sk_tx_busy_loop() for the nic drivers that support tx
> > > napi, which will schedule the tx napi if available.
> > 
> > First, I thought the only difference with Tx NAPI is that it can't be
> > busy polled. So if you want to poll an instance don't register it as 
> > a Tx one instead of adding all this "tx polling" stuff in the core?
> 
> I see. Register the tx NAPI with netif_napi_add_config() allow us
> busy poll it. But we still have two NAPI instance: rx NAPI and tx NAPI.
> sk_busy_loop() can only busy poll on one of them.
> 
> Before AF_XDP, we don't have the need to send packet via tx NAPI, which
> means that we don't need to busy poll it.
> 
> I analyst some nic drivers on the implement of AF_XDP. Some of them
> will check xsk tx ring of current queue and send the data in it in the
> rx NAPI, such as mlx5. Some of them will allocate a extra "rxtx" NAPI
> for the AF_XDP zero-copy queue, which will poll both the data receiving
> and sending.
> 
> In the case about, they will do the data sending and receiving for the
> AF_XDP in a single NAPI instance.
> 
> However, some driver receiving the data in rx NAPI and send data in
> tx NAPI for AF_XDP. In this case, we can't use sk_busy_loop() for both
> rx path and tx path, as we need to wake different NAPI instance.
> 
> > 
> > Second, can this problem happen for any other NIC or is it purely 
> > an artifact of virtio's delayed Tx completion handling?
> 
> According to my analysis, only virtio-net and ICSSG driver have
> split NAPI for AF_XDP. I don't have a ICSSG nic, but the codex tell
> me that it does have the same problem.
> 
> I'm not sure if it is a good idea to introduce the sk_tx_busy_loop().
> Maybe we can modify the driver instead by using the same NAPI
> for both data sending and receiving, just like others do. The
> advantage of introduce sk_tx_busy_loop() is that we can split the
> data sending and receiving, which maybe more efficient.

Would be good if you back your changes by any performance numbers. I
believe that drivers do tx processing via rx napi as before AF_XDP it was
only about cleaning up writebacks, AF_XDP added more weight via actual tx
descriptors submission.

Maybe you can vibe-code virtio-net to work only with rx napi and see what
are the results.

Side note/question - Do you have a tx-only use case for AF_XDP ? I am
planning (for a long time actually) to implement asymmetric AF_XDP
sockets. Currently for ZC scenarios xsk socket occupies both rx and tx
queues even when you do rx or tx only.

> 
> > 
> > Third, this series does not apply.
> 
> Ah, I'll rebase this series if a V2 is acceptable.
> 
> Thanks!
> Menglong Dong
> 
> > 
> > 
> 
> 
> 
> 

^ permalink raw reply

* [PATCH bpf v3 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-17  9:35 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, Sun Jian
In-Reply-To: <20260617093557.63880-1-sun.jian.kdev@gmail.com>

prog_run_opts already verifies that BPF_PROG_TEST_RUN returns -ENOSPC
for a short data_out buffer while still reporting the full output size
through data_size_out.

Add the same coverage for non-linear test_run output. Use pass-through
TC and XDP programs with a 9000-byte packet, a 64-byte linear data area,
and a 100-byte data_out buffer. The expected output spans both the linear
data and the first fragment.

Verify that test_run returns -ENOSPC, reports the full packet length
through data_size_out, and copies the packet prefix into data_out for
both non-linear skb and XDP frags paths.

Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
 .../selftests/bpf/prog_tests/prog_run_opts.c  | 70 +++++++++++++++++++
 .../selftests/bpf/progs/test_pkt_access.c     | 12 ++++
 2 files changed, 82 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
index 01f1d1b6715a..9cc898e6a9f7 100644
--- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
+++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
@@ -4,6 +4,10 @@
 
 #include "test_pkt_access.skel.h"
 
+#define NONLINEAR_PKT_LEN 9000
+#define NONLINEAR_LINEAR_DATA_LEN 64
+#define SHORT_OUT_LEN 100
+
 static const __u32 duration;
 
 static void check_run_cnt(int prog_fd, __u64 run_cnt)
@@ -20,6 +24,69 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
 	      "incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
 }
 
+static void init_pkt(__u8 *pkt, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		pkt[i] = i & 0xff;
+}
+
+static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	__u8 pkt[NONLINEAR_PKT_LEN];
+	__u8 out[SHORT_OUT_LEN];
+	struct __sk_buff skb = {};
+	int prog_fd, err;
+
+	init_pkt(pkt, sizeof(pkt));
+
+	skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+	topts.data_in = pkt;
+	topts.data_size_in = sizeof(pkt);
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.ctx_in = &skb;
+	topts.ctx_size_in = sizeof(skb);
+
+	prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	ASSERT_EQ(err, -ENOSPC, "skb_nonlinear_partial_err");
+	ASSERT_EQ(topts.data_size_out, sizeof(pkt), "skb_nonlinear_partial_data_size_out");
+	ASSERT_OK(memcmp(out, pkt, sizeof(out)), "skb_nonlinear_partial_data_out");
+}
+
+static void test_xdp_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	__u8 pkt[NONLINEAR_PKT_LEN];
+	__u8 out[SHORT_OUT_LEN];
+	struct xdp_md ctx = {};
+	int prog_fd, err;
+
+	init_pkt(pkt, sizeof(pkt));
+
+	ctx.data = 0;
+	ctx.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+	topts.data_in = pkt;
+	topts.data_size_in = sizeof(pkt);
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.ctx_in = &ctx;
+	topts.ctx_size_in = sizeof(ctx);
+
+	prog_fd = bpf_program__fd(skel->progs.xdp_frags_pass_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	ASSERT_EQ(err, -ENOSPC, "xdp_nonlinear_partial_err");
+	ASSERT_EQ(topts.data_size_out, sizeof(pkt), "xdp_nonlinear_partial_data_size_out");
+	ASSERT_OK(memcmp(out, pkt, sizeof(out)), "xdp_nonlinear_partial_data_out");
+}
+
 void test_prog_run_opts(void)
 {
 	struct test_pkt_access *skel;
@@ -69,6 +136,9 @@ void test_prog_run_opts(void)
 	run_cnt += topts.repeat;
 	check_run_cnt(prog_fd, run_cnt);
 
+	test_skb_nonlinear_data_out_partial(skel);
+	test_xdp_nonlinear_data_out_partial(skel);
+
 cleanup:
 	if (skel)
 		test_pkt_access__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
index bce7173152c6..cd284401eebd 100644
--- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
+++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
@@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
 
 	return TC_ACT_UNSPEC;
 }
+
+SEC("tc")
+int tc_pass_prog(struct __sk_buff *skb)
+{
+	return TC_ACT_OK;
+}
+
+SEC("xdp.frags")
+int xdp_frags_pass_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf v3 1/2] bpf: Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-17  9:35 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, Sun Jian
In-Reply-To: <20260617093557.63880-1-sun.jian.kdev@gmail.com>

For non-linear test_run output, bpf_test_finish() derives the linear
data copy length from copy_size - frag_size. This only matches the
linear data length when copy_size is the full packet size.

When userspace provides a short data_out buffer, copy_size is clamped to
that buffer size. If copy_size is smaller than frag_size, the computed
length becomes negative and bpf_test_finish() returns -ENOSPC before
copying the packet prefix or updating data_size_out.

Compute the linear data length from the packet layout instead, and clamp
the linear copy length to copy_size. This preserves the expected
partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
in data_out, and report the full packet length through data_size_out.

Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
 net/bpf/test_run.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2bc04feadfab..f15c613aaa4e 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -453,12 +453,8 @@ static int bpf_test_finish(const union bpf_attr *kattr,
 	}

 	if (data_out) {
-		int len = sinfo ? copy_size - frag_size : copy_size;
-
-		if (len < 0) {
-			err = -ENOSPC;
-			goto out;
-		}
+		u32 head_len = size - frag_size;
+		u32 len = min(copy_size, head_len);

 		if (copy_to_user(data_out, data, len))
 			goto out;
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH] e1000: Remove redundant else after return
From: Andrew Lunn @ 2026-06-17  9:36 UTC (permalink / raw)
  To: Lovekesh Solanki
  Cc: andrew+netdev, anthony.l.nguyen, davem, edumazet, kuba, netdev,
	pabeni, przemyslaw.kitszel
In-Reply-To: <20260617075855.113719-1-lovekeshsolanki00@gmail.com>

On Wed, Jun 17, 2026 at 01:28:55PM +0530, Lovekesh Solanki wrote:
> Hi Andrew,
> 
> I read the documentation you linked and understand simple standalone
> cleanups are discouraged.

You also said in the commit message it reduced the indentation level,
but you did not actually reduce the indentation!

    Andrew

^ permalink raw reply

* [PATCH bpf v3 0/2] Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-17  9:35 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, Sun Jian

When BPF_PROG_TEST_RUN returns non-linear output and userspace provides a
short data_out buffer, bpf_test_finish() can return -ENOSPC before copying
the packet prefix or updating data_size_out.

Fix this by deriving the linear copy length from the packet layout rather
than from the already-clamped copy_size. Add selftest coverage for both
non-linear skb and XDP frags paths.

Changes in v3:

* Keep the fix patch minimal by leaving the existing offset declaration
  unchanged.
* Drop unnecessary memset() calls from the new selftests.
* Keep the pass-through TC program and larger test packet for the skb
  case. pkt_v4 is too small once the short IPv4 input check is accounted
  for, and the existing packet-access program fails before reaching the
  partial copy-out path with such a short linear area.

Changes in v2:

* Fix the Fixes tag to point to the commit that introduced the shared
  non-linear copy-out logic.
* Drop skb-specific wording from the fix commit.
* Move the selftest from skb_load_bytes.c to prog_run_opts.c.
* Add XDP frags coverage in addition to non-linear skb coverage.

v2:
https://lore.kernel.org/bpf/20260616093103.471444-1-sun.jian.kdev@gmail.com/

v1:
https://lore.kernel.org/bpf/20260615073856.152479-1-sun.jian.kdev@gmail.com/

Tested with:
  ./test_progs -t prog_run_opts -v
  ./test_progs -t skb_load_bytes -v
  ./test_progs -t xdp_pull_data -v

Sun Jian (2):
  bpf: Fix partial copy of non-linear test_run output
  selftests/bpf: Cover partial copy of non-linear test_run output

 net/bpf/test_run.c                            |  8 +--
 .../selftests/bpf/prog_tests/prog_run_opts.c  | 70 +++++++++++++++++++
 .../selftests/bpf/progs/test_pkt_access.c     | 12 ++++
 3 files changed, 84 insertions(+), 6 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [PATCH net-next v1] net: wangxun: don't advertise IFF_SUPP_NOFCS
From: Rongguang Wei @ 2026-06-17  9:28 UTC (permalink / raw)
  To: netdev; +Cc: jiawenwu, mengyuanlou, pabeni, kuba, Rongguang Wei

From: Rongguang Wei <weirongguang@kylinos.cn>

Like commit a24162f18825("i40e: don't advertise IFF_SUPP_NOFCS"),
ngbe and txgbe also advertises IFF_SUPP_NOFCS and allowing users
to use the SO_NOFCS socket option. But the driver does not check
skb->no_fcs, so this option is silently ignored.

With this change, send() fails with -EPROTONOSUPPORT when AF_PACKET
socket is set SO_NOFCS option.

Signed-off-by: Rongguang Wei <weirongguang@kylinos.cn>
---
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c   | 1 -
 drivers/net/ethernet/wangxun/txgbe/txgbe_main.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
index d8e3827a8b1f..1e4ebac8e495 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -713,7 +713,6 @@ static int ngbe_probe(struct pci_dev *pdev,
 	netdev->features |= NETIF_F_GRO;
 
 	netdev->priv_flags |= IFF_UNICAST_FLT;
-	netdev->priv_flags |= IFF_SUPP_NOFCS;
 	netdev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
 	netdev->min_mtu = ETH_MIN_MTU;
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
index 8b7c3753bb6a..db9262b00a66 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
@@ -801,7 +801,6 @@ static int txgbe_probe(struct pci_dev *pdev,
 	netdev->features |= NETIF_F_RX_UDP_TUNNEL_PORT;
 
 	netdev->priv_flags |= IFF_UNICAST_FLT;
-	netdev->priv_flags |= IFF_SUPP_NOFCS;
 	netdev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
 	netdev->min_mtu = ETH_MIN_MTU;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] rocker: Fix memory leak in ofdpa_port_fdb()
From: Andrew Lunn @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Ziran Zhang, Jiri Pirko, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <1446e974-0df0-4956-b2af-7a9403da3c8d@intel.com>

On Tue, Jun 16, 2026 at 04:29:59PM -0700, Jacob Keller wrote:
> On 6/15/2026 6:32 PM, Ziran Zhang wrote:
> > In ofdpa_port_fdb(), the hash_del() only unlinks the node from
> > hash table, but does not free it.
> > 
> > Fix this by adding kfree(found) after the !found == removing check,
> > where the pointer value is no longer needed.
> > 
> > Found by Coccinelle kfree script.
> > 

Is rocker actually used any more? I'm not too sure of the history, but
was it not added as a way to develop the early switchdev code? There
was a qemu implementation of the 'hardware'?

Is it still useful? Should we actually just remove the driver?

	Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Avinash Duduskar, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
	John Fastabend, Stanislav Fomichev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, David Ahern,
	Shuah Khan, Jesper Dangaard Brouer, Mykyta Yatsenko, Leon Hwang,
	KP Singh, Anton Protopopov, Amery Hung, Eyal Birger, Rong Tao,
	bpf, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20260616223426.3568080-3-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> bpf_fib_lookup() returns the FIB-resolved egress ifindex straight
> from the fib result. When the egress is a VLAN device, the returned
> ifindex is the VLAN netdev's, which has no XDP xmit handler; XDP
> programs that want to forward the frame (e.g. xdp-forward) must
> instead target the underlying physical device and push the VLAN tag
> themselves. Today the program has no way to learn either the
> underlying ifindex or the VLAN tag without maintaining its own
> VLAN-to-ifindex map in userspace and refreshing it on netlink
> events.
>
> Add BPF_FIB_LOOKUP_VLAN. When the caller sets this flag and the fib
> result is a VLAN device whose immediate parent is a real (non-VLAN)
> device in the same network namespace, populate the existing output
> fields params->h_vlan_proto and params->h_vlan_TCI from the VLAN
> device and replace params->ifindex with the parent's ifindex.
> params->h_vlan_TCI carries the VID only, with PCP and DEI bits zero; a
> consumer wanting to set egress priority writes PCP itself.
> params->smac is the VLAN device's own address, which can differ from
> the parent's.
>
> Only the immediate parent is resolved, via vlan_dev_priv(dev)->real_dev
> and not vlan_dev_real_dev(), which walks to the bottom of a stack. For a
> stacked VLAN (QinQ) the immediate parent is itself a VLAN device; since
> one h_vlan_proto/h_vlan_TCI pair cannot describe two tags, ifindex is
> left unchanged and the vlan fields remain zero in that case. The swap
> is also skipped when the parent lives in another network namespace (a
> VLAN device can be moved while its parent stays), since its ifindex
> would be meaningless or match an unrelated device in the caller's
> namespace. The swap and the vlan fields are written only on success;
> other output fields keep their existing behaviour, so a frag-needed
> result still reports the route mtu in params->mtu_result. When the
> flag is not set, behaviour is unchanged: h_vlan_proto and h_vlan_TCI
> are zeroed and ifindex is left at the FIB result.
>
> The new block is compiled only under CONFIG_VLAN_8021Q since
> vlan_dev_priv() is not defined otherwise; without that config
> is_vlan_dev() is constant false and the flag is accepted but never
> acts.
>
> This lets an XDP redirect target the physical device and learn the
> tag to push in a single lookup, which xdp-forward's optional VLAN
> mode (xdp-project/xdp-tools#504) wants from the kernel side.
>
> The helper's input semantics are unchanged; the reverse direction
> (supplying a tag as lookup input) is added in the following patch.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
> ---
>  include/uapi/linux/bpf.h       | 31 ++++++++++++++++++++++++++-
>  net/core/filter.c              | 39 ++++++++++++++++++++++++++++++----
>  tools/include/uapi/linux/bpf.h | 31 ++++++++++++++++++++++++++-
>  3 files changed, 95 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 11dd610fa5fa..f77aa9472bf1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3527,6 +3527,31 @@ union bpf_attr {
>   *			Use the mark present in *params*->mark for the fib lookup.
>   *			This option should not be used with BPF_FIB_LOOKUP_DIRECT,
>   *			as it only has meaning for full lookups.
> + *		**BPF_FIB_LOOKUP_VLAN**
> + *			If the fib lookup resolves to a VLAN device whose
> + *			parent is a real (non-VLAN) device, set
> + *			*params*->h_vlan_proto and *params*->h_vlan_TCI from
> + *			the VLAN device and replace *params*->ifindex with the
> + *			parent's ifindex. This lets XDP programs that target
> + *			the underlying physical device (VLAN devices have no
> + *			XDP xmit) discover both the real egress ifindex and
> + *			the VLAN tag to push in one call. *params*->h_vlan_TCI
> + *			carries the VID only, with PCP and DEI bits zero; a
> + *			consumer wanting to set egress priority writes PCP
> + *			itself. *params*->smac is the VLAN device's own
> + *			address, which can differ from the parent's. Only the
> + *			immediate parent is resolved: for a stacked VLAN (QinQ)
> + *			the parent is itself a VLAN device, and since one tag
> + *			pair cannot describe two tags, *params*->ifindex is
> + *			left unchanged and the vlan fields remain zero. The
> + *			same applies when the parent is in another network
> + *			namespace, where its ifindex would be meaningless.
> + *			The swap and the vlan fields are written only on
> + *			success; other output fields keep the helper's
> + *			existing behaviour, so a frag-needed result still
> + *			reports the route mtu in *params*->mtu_result, and on
> + *			the tc path without tot_len the mtu check runs after
> + *			the swap, against the parent device.

This comment is quite long, please trim. At the very least drop:

"This lets XDP programs that target the underlying physical device (VLAN
devices have no XDP xmit) discover both the real egress ifindex and the
VLAN tag to push in one call."

and shorten:

"Only the immediate parent is resolved: for a stacked VLAN
(QinQ) the parent is itself a VLAN device, and since one tag pair cannot
describe two tags, *params*->ifindex is left unchanged and the vlan
fields remain zero. The same applies when the parent is in another
network namespace, where its ifindex would be meaningless."

to:

"The lookup only resolves the immediate parent (QinQ is not supported),
and fails if the parent is in a different namespace."

>   *
>   *		*ctx* is either **struct xdp_md** for XDP programs or
>   *		**struct sk_buff** tc cls_act programs.
> @@ -7322,6 +7347,7 @@ enum {
>  	BPF_FIB_LOOKUP_TBID    = (1U << 3),
>  	BPF_FIB_LOOKUP_SRC     = (1U << 4),
>  	BPF_FIB_LOOKUP_MARK    = (1U << 5),
> +	BPF_FIB_LOOKUP_VLAN    = (1U << 6),
>  };
>  
>  enum {
> @@ -7388,7 +7414,10 @@ struct bpf_fib_lookup {
>  
>  	union {
>  		struct {
> -			/* output */
> +			/* output with BPF_FIB_LOOKUP_VLAN: set from the
> +			 * resolved egress VLAN device (see the flag); zeroed
> +			 * on other successful lookups.
> +			 */
>  			__be16	h_vlan_proto;
>  			__be16	h_vlan_TCI;
>  		};
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6fa172cb1348..b37a12321fba 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6119,10 +6119,40 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
>  #endif
>  
>  #if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
> -static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, u32 mtu)
> +static int bpf_fib_set_fwd_params(struct net_device *dev,
> +				  struct bpf_fib_lookup *params,
> +				  u32 flags, u32 mtu)
>  {
>  	params->h_vlan_TCI = 0;
>  	params->h_vlan_proto = 0;
> +
> +#if IS_ENABLED(CONFIG_VLAN_8021Q)
> +	/* vlan_dev_priv() is only defined when 8021q is built in or as a
> +	 * module; under !CONFIG_VLAN_8021Q is_vlan_dev() is constant false
> +	 * so this would be dead, but it still has to compile.
> +	 */

Superfluous comment - please drop.

> +	if ((flags & BPF_FIB_LOOKUP_VLAN) && is_vlan_dev(dev)) {
> +		struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
> +
> +		/* Resolve the immediate parent only. For a stacked VLAN
> +		 * (QinQ) the parent is itself a VLAN device, and a single
> +		 * h_vlan_proto/h_vlan_TCI pair cannot describe both tags;
> +		 * leave ifindex and the vlan fields untouched in that case
> +		 * rather than report the lower device with only one tag.
> +		 * The same applies when the parent lives in another netns
> +		 * (a VLAN device can be moved while its parent stays):
> +		 * its ifindex would be meaningless, or match an unrelated
> +		 * device, in the caller's namespace.
> +		 */

And this one - it's redundant with the flag description (and commit message).

-Toke


^ permalink raw reply

* [PATCH v2] net: mvneta: free/request IRQ across suspend/resume
From: Yun Zhou @ 2026-06-17  9:20 UTC (permalink / raw)
  To: marcin.s.wojtas, andrew+netdev, davem, edumazet, kuba, pabeni,
	bigeasy, clrkwllms, rostedt
  Cc: netdev, linux-kernel, linux-rt-devel, yun.zhou

On PREEMPT_RT, the mvneta IRQ handler is force-threaded. Under high
network traffic, the IRQ can enter suspend with desc->depth == 1
(masked by the oneshot mechanism between handler invocations).

During suspend, the kernel increments depth to 2 and masks the
interrupt at the MPIC level (clearing the SRC_CTL CPU routing bit,
due to IRQCHIP_MASK_ON_SUSPEND). On resume, depth is decremented
back to 1, but since it does not reach 0, the unmask is never
called. The MPIC CPU routing remains cleared, permanently disabling
interrupt delivery.

Fix by freeing the IRQ in suspend and re-requesting it in resume.
This ensures a clean IRQ state (depth=0, proper hardware routing)
on every resume cycle, regardless of the pre-suspend depth. This
follows the approach used by other drivers (e.g. igb).

Fixes: 9768b45ceb0b ("net: mvneta: support suspend and resume")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v2:
  - Move request_irq before cpuhp registration in resume (matching
    mvneta_open ordering) so that failure does not leave cpuhp
    callbacks registered on a non-functional device.
  - On request_irq failure, call netif_device_detach() to prevent
    further traffic on the dead interface.

 drivers/net/ethernet/marvell/mvneta.c | 29 +++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b4a845f04c05..02ea867d07a3 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -5826,6 +5826,20 @@ static int mvneta_suspend(struct device *device)
 	mvneta_stop_dev(pp);
 	rtnl_unlock();
 
+	/* Release IRQ to avoid stale MPIC mask state on resume.
+	 * On PREEMPT_RT, forced-threaded oneshot IRQs may leave the
+	 * interrupt masked (depth>0) at suspend time. This prevents
+	 * resume_device_irqs() from restoring the MPIC CPU routing,
+	 * permanently disabling the interrupt. Re-requesting the IRQ
+	 * on resume guarantees a clean state.
+	 */
+	if (pp->neta_armada3700)
+		free_irq(dev->irq, pp);
+	else {
+		on_each_cpu(mvneta_percpu_disable, pp, true);
+		free_percpu_irq(dev->irq, pp->ports);
+	}
+
 	for (queue = 0; queue < rxq_number; queue++) {
 		struct mvneta_rx_queue *rxq = &pp->rxqs[queue];
 
@@ -5892,6 +5906,21 @@ static int mvneta_resume(struct device *device)
 		mvneta_txq_hw_init(pp, txq);
 	}
 
+	/* Re-request IRQ (see comment in mvneta_suspend) */
+	if (pp->neta_armada3700) {
+		err = request_irq(dev->irq, mvneta_isr, 0, dev->name, pp);
+	} else {
+		err = request_percpu_irq(dev->irq, mvneta_percpu_isr,
+					dev->name, pp->ports);
+		if (!err)
+			on_each_cpu(mvneta_percpu_enable, pp, true);
+	}
+	if (err) {
+		netdev_err(dev, "cannot request irq %d\n", dev->irq);
+		netif_device_detach(dev);
+		return err;
+	}
+
 	if (!pp->neta_armada3700) {
 		spin_lock(&pp->lock);
 		pp->is_stopped = false;
-- 
2.43.0


^ permalink raw reply related

* Re: [RESEND PATCH v1] net: dsa: motorcomm: add yt92xx dsa driver
From: Andrew Lunn @ 2026-06-17  9:19 UTC (permalink / raw)
  To: Kyle Switch
  Cc: mmyangfl, olteanv, davem, edumazet, kuba, pabeni, horms, netdev,
	linux-kernel, ming.xu, xiaolin.xu, jianmin.wang, de.ge
In-Reply-To: <adc1dc4b-1d9d-42e6-8983-f0a7c650a6cd@motor-comm.com>

> Thank you for your reminding, this patch will submitted  in a series 
> of small patches.
> This patch mainly contains three contents:
> 1. Underlying function interface for different motorcomm switch series in 
> the file driver/net/dsa/motorcomm/switch/.
> 2. Optimization existing yt921x dsa driver using common switch function apis.
> 3. New yt922x dsa driver.
> Can you accept break this up into lots of small patch according to this logic?
> if not, do you have any suggestion? thank you.

We need to see the patches, but keep them small, and add functionality
little by little. I had a very quick look at your first version, and i
see a lot you will need to change. So just post small bits of code,
learn from the review comments, and fixup all the rest of the code
before posting it.

One thing i need to point out. Linux has a long tradition of not
replacing existing code with a new implementation. You take the
existing code and step by step improve it. So i do not want to see a
patch which just removes the existing driver. I want to see patches
which add new functionality to the existing driver, doing refactors as
needed.

Please also remember:

MOTORCOMM YT921X ETHERNET SWITCH DRIVER
M:      David Yang <mmyangfl@gmail.com>
L:      netdev@vger.kernel.org
S:      Maintained
F:      Documentation/devicetree/bindings/net/dsa/motorcomm,yt921x.yaml
F:      drivers/net/dsa/yt921x.*
F:      net/dsa/tag_yt921x.c

David is the Maintainer of this driver. Listen to what he says. In
time, when you have shown you can also can do the job of being a
Maintainer, your name will be added.

      Andrew

^ permalink raw reply

* Re: [PATCH net] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Simon Horman @ 2026-06-17  9:16 UTC (permalink / raw)
  To: doruk
  Cc: Simon Horman, saeedm, leon, tariqt, mbloch, andrew+netdev, davem,
	edumazet, kuba, pabeni, borisp, sd, raeds, ehakim, netdev,
	linux-rdma, linux-kernel, stable
In-Reply-To: <20260615140534.52691-1-doruk@0sec.ai>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete

This patch replaces metadata_dst_free() with dst_release() during MACsec
RX SC deletion. It relies on the dst subsystem's refcounting to defer freeing
the metadata_dst until the last reference is dropped, aiming to fix a
use-after-free when an skb still references it.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
> index 71b3a059c9647..2a4e7ed76d315 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
> @@ -829,7 +829,7 @@ static void macsec_del_rxsc_ctx(struct mlx5e_macsec *macsec, struct mlx5e_macsec
>  	 */
>  	list_del_rcu(&rx_sc->rx_sc_list_element);
>  	xa_erase(&macsec->sc_xarray, rx_sc->sc_xarray_element->fs_id);
> -	metadata_dst_free(rx_sc->md_dst);
> +	dst_release(&rx_sc->md_dst->dst);

[Severity: High]
Does this change introduce a new use-after-free in the RX datapath?

When dst_release() drops the refcount to 0, concurrent readers in
mlx5e_macsec_offload_handle_rx_skb() can still find the object via RCU
and call dst_hold(). Because dst_hold() uses rcuref_get(), it fails on a
0-refcount object without incrementing the count.

However, the datapath still attaches the 0-refcount dst to the skb. When the
skb is later freed, skb_dst_drop() will call dst_release() on already-freed
memory.

Should the datapath be updated to use dst_hold_safe() instead?

>  	kfree(rx_sc->sc_xarray_element);

[Severity: High]
This is a pre-existing issue, but does this synchronous kfree() cause a
use-after-free for concurrent datapath readers?

The element was just removed from the RCU-protected xarray via xa_erase(),
but the RX datapath can still be accessing it via xa_load() under
rcu_read_lock() in mlx5e_macsec_offload_handle_rx_skb().

Should this use an RCU-deferred free mechanism instead?

[Severity: Critical]
This is a pre-existing issue, but does mlx5e_macsec_offload_handle_rx_skb()
blindly dereference the result of xa_load()?

drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c:mlx5e_macsec_offload_handle_rx_skb() {
    ...
    sc_xarray_element = xa_load(&macsec->sc_xarray, fs_id);
    rx_sc = sc_xarray_element->rx_sc;
    ...
}

If xa_load() returns NULL (e.g., if the entry was concurrently deleted by
xa_erase() or the packet has an unrecognized fs_id), will this cause a
deterministic kernel panic?

Is a NULL check needed before accessing the returned element?

>  	kfree_rcu_mightsleep(rx_sc);
>  }

^ permalink raw reply

* [PATCH net v3 2/2] net: macb: drop in-flight Tx SKBs on close
From: Théo Lebrun @ 2026-06-17  9:17 UTC (permalink / raw)
  To: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley
  Cc: Paolo Valerio, Nicolai Buchwitz, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier,
	Théo Lebrun, stable
In-Reply-To: <20260617-macb-drop-tx-v3-0-d4c7e57d890b@bootlin.com>

The MACB driver has since forever leaked the outgoing SKBs that
have not yet been marked as completed. They live in queue->tx_skb
which gets freed without remorse nor checking.

macb_free_consistent() gets called in a few codepaths, but only
close will trigger the added expressions. In macb_open() and
macb_alloc_consistent() failure cases, tx_skb just got allocated
and is empty.

Use the new macb_tx_unmap() prototype to report our error as
SKB_DROP_REASON_NOT_SPECIFIED rather than SKB_CONSUMED which makes it
sound like no error occurred. Equivalent to dev_kfree_skb_any().

Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
Cc: stable@vger.kernel.org
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 9caae1ef52b1..5a2500bd59a6 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -2678,8 +2678,26 @@ static void macb_free_consistent(struct macb *bp)
 	dma_free_coherent(dev, size, bp->queues[0].rx_ring, bp->queues[0].rx_ring_dma);
 
 	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
-		kfree(queue->tx_skb);
-		queue->tx_skb = NULL;
+		if (queue->tx_skb) {
+			unsigned int dropped = 0, tail;
+
+			for (tail = queue->tx_tail; tail != queue->tx_head;
+			     tail++) {
+				if (macb_tx_skb(queue, tail)->skb)
+					dropped++;
+				macb_tx_unmap(bp, macb_tx_skb(queue, tail), 0,
+					      SKB_DROP_REASON_NOT_SPECIFIED);
+			}
+
+			queue->stats.tx_dropped += dropped;
+			bp->dev->stats.tx_dropped += dropped;
+
+			kfree(queue->tx_skb);
+			queue->tx_skb = NULL;
+		}
+
+		queue->tx_head = 0;
+		queue->tx_tail = 0;
 		queue->tx_ring = NULL;
 		queue->rx_ring = NULL;
 	}

-- 
2.54.0


^ permalink raw reply related

* [PATCH net v3 1/2] net: macb: give reasons for Tx SKB kfree
From: Théo Lebrun @ 2026-06-17  9:17 UTC (permalink / raw)
  To: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley
  Cc: Paolo Valerio, Nicolai Buchwitz, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier,
	Théo Lebrun, stable
In-Reply-To: <20260617-macb-drop-tx-v3-0-d4c7e57d890b@bootlin.com>

Using dev_consume_skb_any() marks the drop reason as SKB_CONSUMED every
time we free a Tx SKB. Instead, replace by SKB_DROP_REASON_NOT_SPECIFIED
when packet has been dropped without sending.

It is not precise but at least differs from SKB_CONSUMED and is used by
many drivers for their error codepaths through dev_kfree_skb_{any,irq}().

Pass a reason around rather than call dev_consume_skb_any() or
dev_kfree_skb_any() because macb_tx_unmap() is called for cleanup in
all cases.

macb_tx_error_task() is made complex because some SKBs encountered have
been successfully sent.

Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
Cc: stable@vger.kernel.org
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index a12aa21244e8..9caae1ef52b1 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1201,7 +1201,8 @@ static int macb_halt_tx(struct macb *bp)
 					bp, TSR);
 }
 
-static void macb_tx_unmap(struct macb *bp, struct macb_tx_skb *tx_skb, int budget)
+static void macb_tx_unmap(struct macb *bp, struct macb_tx_skb *tx_skb,
+			  int budget, enum skb_drop_reason reason)
 {
 	if (tx_skb->mapping) {
 		if (tx_skb->mapped_as_page)
@@ -1214,7 +1215,7 @@ static void macb_tx_unmap(struct macb *bp, struct macb_tx_skb *tx_skb, int budge
 	}
 
 	if (tx_skb->skb) {
-		dev_consume_skb_any(tx_skb->skb);
+		dev_kfree_skb_any_reason(tx_skb->skb, reason);
 		tx_skb->skb = NULL;
 	}
 }
@@ -1297,7 +1298,8 @@ static void macb_tx_error_task(struct work_struct *work)
 	 * Free transmit buffers in upper layer.
 	 */
 	for (tail = queue->tx_tail; tail != queue->tx_head; tail++) {
-		u32	ctrl;
+		enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED;
+		u32 ctrl;
 
 		desc = macb_tx_desc(queue, tail);
 		ctrl = desc->ctrl;
@@ -1307,7 +1309,10 @@ static void macb_tx_error_task(struct work_struct *work)
 		if (ctrl & MACB_BIT(TX_USED)) {
 			/* skb is set for the last buffer of the frame */
 			while (!skb) {
-				macb_tx_unmap(bp, tx_skb, 0);
+				/* The reason parameter is unused because it
+				 * only matters when skb is valid.
+				 */
+				macb_tx_unmap(bp, tx_skb, 0, SKB_CONSUMED);
 				tail++;
 				tx_skb = macb_tx_skb(queue, tail);
 				skb = tx_skb->skb;
@@ -1326,6 +1331,7 @@ static void macb_tx_error_task(struct work_struct *work)
 				bp->dev->stats.tx_bytes += skb->len;
 				queue->stats.tx_bytes += skb->len;
 				bytes += skb->len;
+				reason = SKB_CONSUMED;
 			}
 		} else {
 			/* "Buffers exhausted mid-frame" errors may only happen
@@ -1339,7 +1345,7 @@ static void macb_tx_error_task(struct work_struct *work)
 			desc->ctrl = ctrl | MACB_BIT(TX_USED);
 		}
 
-		macb_tx_unmap(bp, tx_skb, 0);
+		macb_tx_unmap(bp, tx_skb, 0, reason);
 	}
 
 	netdev_tx_completed_queue(netdev_get_tx_queue(bp->dev, queue_index),
@@ -1458,7 +1464,7 @@ static int macb_tx_complete(struct macb_queue *queue, int budget)
 			}
 
 			/* Now we can safely release resources */
-			macb_tx_unmap(bp, tx_skb, budget);
+			macb_tx_unmap(bp, tx_skb, budget, SKB_CONSUMED);
 
 			/* skb is set only for the last buffer of the frame.
 			 * WARNING: at this point skb has been freed by
@@ -2357,7 +2363,11 @@ static unsigned int macb_tx_map(struct macb *bp,
 	for (i = queue->tx_head; i != tx_head; i++) {
 		tx_skb = macb_tx_skb(queue, i);
 
-		macb_tx_unmap(bp, tx_skb, 0);
+		/* The reason parameter is unused, tx_skb->skb has not yet
+		 * been assigned. Parent caller is responsible for freeing
+		 * the SKB.
+		 */
+		macb_tx_unmap(bp, tx_skb, 0, SKB_DROP_REASON_NOT_SPECIFIED);
 	}
 
 	return -ENOMEM;

-- 
2.54.0


^ permalink raw reply related

* [PATCH net v3 0/2] Drop in-flight Tx SKBs on MACB close
From: Théo Lebrun @ 2026-06-17  9:17 UTC (permalink / raw)
  To: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Haavard Skinnemoen,
	Jeff Garzik, Conor Dooley
  Cc: Paolo Valerio, Nicolai Buchwitz, netdev, linux-kernel,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier,
	Théo Lebrun, stable

The first patch is here to allow giving a drop reason.
We dissociate consumed packets from dropped ones that way.

Second patch is the main one: it drops unsent packets on close.
MACB driver forgot freeing its SKBs (and associated DMA mappings).

---
Changes in v3:
- Drop stats fixing. A proper fix deserves its own net-next refactoring
  series to migrate to netdev_stat_ops (ynltool uAPI), which will come
  in later. We keep the tx_dropped++ because they are safe as every
  other context is disabled when macb_free_consistent() is called.
- Rebased to latest net/main (406e8a651a7b), nothing to report.
- Link to v2: https://patch.msgid.link/20260428-macb-drop-tx-v2-0-647f5199d8df@bootlin.com

Changes in v2:
- Increment tx_dropped stat once per SKB, not once per frame.
- Reset tx_head & tx_tail to avoid keeping stalled cursors.
- Fix SKB dropped reasons throughout by adding the reason as parameter
  to macb_tx_unmap(). This is a new patch. Then the drop-all-on-close
  fix can use this ability to report we are not consuming SKBs.
- Add increment to stats->tx_dropped on DMA mapping failure and
  tx_error_task. Done as separate patches (3 and 4).
- Rebase upon net/main @ 46f74a3f7d57, nothing to report.
- Link to v1: https://patch.msgid.link/20260424-macb-drop-tx-v1-1-b3ecb787d84d@bootlin.com

To: Nicolas Ferre <nicolas.ferre@microchip.com>
To: Claudiu Beznea <claudiu.beznea@tuxon.dev>
To: Andrew Lunn <andrew+netdev@lunn.ch>
To: "David S. Miller" <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Haavard Skinnemoen <hskinnemoen@atmel.com>
To: Jeff Garzik <jeff@garzik.org>
To: Conor Dooley <conor.dooley@microchip.com>
Cc: Paolo Valerio <pvalerio@redhat.com>
Cc: Nicolai Buchwitz <nb@tipi-net.de>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com>
Cc: Gregory CLEMENT <gregory.clement@bootlin.com>
Cc: Benoît Monin <benoit.monin@bootlin.com>
Cc: Tawfik Bayouk <tawfik.bayouk@mobileye.com>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>

---
Théo Lebrun (2):
      net: macb: give reasons for Tx SKB kfree
      net: macb: drop in-flight Tx SKBs on close

 drivers/net/ethernet/cadence/macb_main.c | 46 +++++++++++++++++++++++++-------
 1 file changed, 37 insertions(+), 9 deletions(-)
---
base-commit: 712927eaa34199bb62cf370af591c0550ba977de
change-id: 20260423-macb-drop-tx-f9ce72720d05

Best regards,
--  
Théo Lebrun <theo.lebrun@bootlin.com>

^ permalink raw reply

* Re: [PATCH net] net: airoha: Fix TX scheduler queue mask loop upper bound
From: Lorenzo Bianconi @ 2026-06-17  9:17 UTC (permalink / raw)
  To: Wayen Yan; +Cc: netdev, nbd, linux-arm-kernel, linux-mediatek
In-Reply-To: <178168650178.2224380.3950331731013129336@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2931 bytes --]

> On Tue, Jun 17, 2026, Lorenzo Bianconi wrote:
> > Even if the current codebase supports just AIROHA_NUM_QOS_CHANNEL (4), the hw
> > exposes 32 hw QoS channels (AIROHA_NUM_TX_RING). Here we are just clearing the
> > configuration, so I guess the current implementation is correct.
> 
> Hi Lorenzo,
> 
> You are right that there is no functional impact, and I agree this
> should not go to net. Let me explain the register layout I was worried
> about, and you can decide whether it is worth a net-next cleanup or
> should just be dropped.
> 
> The two macros are:
> 
> 	REG_QUEUE_CLOSE_CFG(_n)             = 0x00a0 + ((_n) & 0xfc)
> 	TXQ_DISABLE_CHAN_QUEUE_MASK(_n, _m) = BIT((_m) + (((_n) & 0x3) << 3))
> 
> REG_QUEUE_CLOSE_CFG() masks the channel with 0xfc, and the bit macro
> folds the channel with & 0x3 (mod 4) shifted by 3. So one 32-bit
> register holds 4 channels x 8 queues, 8 queue bits per channel:
> 
> 	channel 0 -> reg 0x00a0, bits  0..7
> 	channel 1 -> reg 0x00a0, bits  8..15
> 	channel 2 -> reg 0x00a0, bits 16..23
> 	channel 3 -> reg 0x00a0, bits 24..31
> 	channel 4 -> reg 0x00a4, bits  0..7
> 	...
> 
> In airoha_qdma_set_chan_tx_sched() the loop variable 'i' is passed as
> the *queue* argument _m, not as a channel:
> 
> 	for (i = 0; i < AIROHA_NUM_TX_RING; i++)   // i = 0..31
> 		airoha_qdma_clear(qdma, REG_QUEUE_CLOSE_CFG(channel),
> 				  TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i));
> 
> Since each channel only has AIROHA_NUM_QOS_QUEUES (8) queues, the correct
> logic is to clear the 8 queue bits belonging to 'channel'. With i running
> up to 31 the BIT() shift instead walks past those 8 bits and into the bit
> ranges of the other channels folded into the same register. For channel 0
> the accumulated mask becomes 0xffffffff, i.e. it touches channels 1..3 as
> well.
> 
> This is harmless today only because REG_QUEUE_CLOSE_CFG is written
> exclusively here, via airoha_qdma_clear() (RMW clear), and the register
> resets to 0 and is never set anywhere -- so clearing extra bits is a
> no-op. Functionally the current code is fine, as you say.
> 
> The point is just the loop-bound semantics: 'i' is a per-channel queue
> index, so the bound should be AIROHA_NUM_QOS_QUEUES (8), not
> AIROHA_NUM_TX_RING (32). The two happen to be related (32 == 4 channels *
> 8 queues) but mean different things.
> 
> Since there is no functional change, feel free to drop this if you would
> rather not carry a cosmetic patch. If you think the clarity is worth it I
> can resend against net-next without the Fixes tag.
> 
> Thanks,
> Wayen
> 

Sorry you are right, I misread the code, your patch is correct. Since as you
pointed out REG_QUEUE_CLOSE_CFG() is actually never set at the moment and the
default register value is 0, I would repost this patch for net-next as soon as
it is opened (this will avoid merge conflicts).

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [RESEND PATCH v1] net: dsa: motorcomm: add yt92xx dsa driver
From: Andrew Lunn @ 2026-06-17  9:07 UTC (permalink / raw)
  To: Kyle Switch
  Cc: David Yang, olteanv, davem, edumazet, kuba, pabeni, horms, netdev,
	linux-kernel, ming.xu, xiaolin.xu, jianmin.wang, de.ge
In-Reply-To: <88f726d5-1617-4d2e-8fbb-d3da9478b386@motor-comm.com>

> >> +#define CMM_PARAM_CHK(expr, err_code)    \
> >> +       do {                             \
> >> +               if ((u32)(expr)) {       \
> >> +                       return err_code; \
> >> +               }                        \
> >> +       } while (0)
> >> +
> >> +#define CMM_ERR_CHK(op, ret)           \
> >> +       do {                           \
> >> +               ret = (op);            \
> >> +               if (ret != CMM_ERR_OK) \
> >> +                       return ret;    \
> >> +       } while (0)
> > 
> > Do not use macros like this.
> 
> Ans: Acknowledged, i will consider how to optimize them in the future.

It is not about optimization. Hiding a return statement in a macro is
very bad style. It will lead to locking bugs, and resource leaks,
because nobody knows the return is there.

> >> +/*
> >> + * Macro Definition
> >> + */
> >> +#ifndef NULL
> >> +#define NULL 0
> >> +#endif
> >> +
> >> +#ifndef FALSE
> >> +#define FALSE 0
> >> +#endif
> >> +
> >> +#ifndef TRUE
> >> +#define TRUE 1
> >> +#endif
> > 
> > Nonsense.
> 
> Ans: Acknowledge, will be fixed later.

No. They will be fixed now.

> >> +       /* Print chipid here since we are interested in lower 16 bits */
> >> +       dev_info(dev,
> >> +                "Motorcomm %s ethernet switch.\n",
> >> +                info->name);
> > 
> > Stop copy-n-paste.
> 
> Ans: Sry for this, i will recheck the code to make sure each line of comments and code
> meaningful again.

Also, consider the comments. Do the comments add anything useful which
is not already obvious from the code. Comments should be about "Why?".

> >> --- a/include/uapi/linux/if_ether.h
> >> +++ b/include/uapi/linux/if_ether.h
> >> @@ -118,7 +118,7 @@
> >>  #define ETH_P_QINQ1    0x9100          /* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>  #define ETH_P_QINQ2    0x9200          /* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>  #define ETH_P_QINQ3    0x9300          /* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
> >> -#define ETH_P_YT921X   0x9988          /* Motorcomm YT921x DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >> +#define ETH_P_YT92XX   0x9988          /* Motorcomm YT92xx DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>  #define ETH_P_EDSA     0xDADA          /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>  #define ETH_P_DSA_8021Q        0xDADB          /* Fake VLAN Header for DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
> >>  #define ETH_P_DSA_A5PSW        0xE001          /* A5PSW Tag Value [ NOT AN OFFICIALLY REGISTERED ID ] */
> > 
> > UAPI stands for User-space API. Do not change it unless there is a
> > very very good reason.
> > 
> 
> Ans: The default tpid both yt921x and yt922x is 0x9988. I have modified this to 
> allow for simultaneous use in both yt922x and yt921x scenarios.

As pointed out, this is UAPI. Any changes to this file need a good
explanation how it does not change the user API. Do this break
backwards compatibility with user space applications? Maybe tcpdump or
wireshark has a dissector which expects ETH_P_YT921X and you have just
broken it?

> >> +#define YT922X_TAG_FORMAT2_NAME "yt922x-8b"
> >> +#define YT922X_FORMAT2_TAG_LEN                  8
> >> +#define YT922X_PKT_TYPE          GENMASK(15, 14)
> >> +#define YT922X_8B_CPUTAG_PKT_FROM_CPU      0x1
> >> +#define YT922X_8B_CPUTAG_SRC_PORT          GENMASK(6, 2)
> >> +#define YT922X_8B_CPUTAG_DST_PORTMASK      GENMASK(8, 0)
> >> +#define YT922X_8B_CPUTAG_DST_PORTMASK_0      BIT(15)
> >> +#define YT922X_8B_CPUTAG_DST_PORTMASK_0_EN      0x1
> >> +#define YT922X_8B_CPUTAG_FORCE_DST         BIT(9)
> >> +#define YT922X_8B_CPUTAG_FORCE_DST_EN      0x1
> > 
> > If yt922x tag format shares no common with yt921x, make a new tag driver.
> 
> Ans: thank you for your suggestion, we will consider whether to create a new driver in the new file.

When you look at other tag drivers, you will also notice some drivers
implement two taggers in one file. So consider this if there is any
shared code.

> >> +static struct dsa_tag_driver *dsa_tag_driver_array[] = {
> >> +       &DSA_TAG_DRIVER_NAME(yt921x_netdev_ops),
> >> +       &DSA_TAG_DRIVER_NAME(yt922x_4b_netdev_ops),
> >> +       &DSA_TAG_DRIVER_NAME(yt922x_8b_netdev_ops),
> >> +};
> > 
> > If both are supported by the chip and 4b does nothing more than 8b
> > does, do not bother with it.
> 
> Ans: 4b and 8b dsa tag may have different application scenarios. from my opinion,
>      1. 4b dsa tag can save 4 bytes of payload
>      2. 8b dsa tag carry more package info.

How do you plan to swap between the different formats?

The user perspective is that the machine has a collection of interface
which are used just as normal, using Linux tools likes like
iproute2. If the user enables a feature which requires the 8b tag
format, will you change the format from the DSA driver? And swap back
to the 4 byte format when the feature is no longer needed?

	Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox