Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net 10/13] ice: fix locking in ice_dcb_rebuild()
From: Jacob Keller @ 2026-05-06 21:13 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Michal Kubiak, Joshua Hay, Madhu Chittim, Willem de Bruijn,
	Dave Ertman, Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Bart Van Assche, intel-wired-lan, Arpana Arland
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-10-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Bart Van Assche <bvanassche@acm.org>
> 
> Move the mutex_lock() call up to prevent that DCB settings change after
> the first ice_query_port_ets() call. The second ice_query_port_ets()
> call in ice_dcb_rebuild() is already protected by pf->tc_mutex.
> 
> This also fixes a bug in an error path, as before taking the first
> "goto dcb_error" in the function jumped over mutex_lock() to
> mutex_unlock().
> 
> This bug has been detected by the clang thread-safety analyzer.
> 
> Cc: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org
> Fixes: 242b5e068b25 ("ice: Fix DCB rebuild after reset")
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Tested-by: Arpana Arland <arpanax.arland@intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_dcb_lib.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> index 16aa25535152..0bc6dd375687 100644
> --- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> @@ -537,14 +537,14 @@ void ice_dcb_rebuild(struct ice_pf *pf)
>  	struct ice_dcbx_cfg *err_cfg;
>  	int ret;
>  
> +	mutex_lock(&pf->tc_mutex);
> +
>  	ret = ice_query_port_ets(pf->hw.port_info, &buf, sizeof(buf), NULL);
>  	if (ret) {
>  		dev_err(dev, "Query Port ETS failed\n");
>  		goto dcb_error;
>  	}
>  
> -	mutex_lock(&pf->tc_mutex);
> -
>  	if (!pf->hw.port_info->qos_cfg.is_sw_lldp)
>  		ice_cfg_etsrec_defaults(pf->hw.port_info);
>  
> 

Sashiko thinks there is a problem here:

> Does moving mutex_lock(&pf->tc_mutex) to the top of ice_dcb_rebuild()
> introduce an AB-BA deadlock with rtnl_lock?
> In standard DCB netlink operations, the netlink layer calls into the driver
> with rtnl_lock already held, and the driver subsequently acquires
> pf->tc_mutex. This establishes a required lock ordering of rtnl_lock
> followed by pf->tc_mutex.
> With this change, if the first ice_query_port_ets() fails, we take the goto
> dcb_error path while already holding pf->tc_mutex.
> Looking at the error cleanup path lower down in ice_dcb_rebuild():
> dcb_error:
>     dev_err(dev, "Disabling DCB until new settings occur\n");
>     [ ... ]
>     ice_pf_dcb_cfg(pf, err_cfg, false);
> Because the locked parameter is passed as false, ice_pf_dcb_cfg() will
> unconditionally acquire rtnl_lock().
> Does this create a lock inversion (pf->tc_mutex followed by rtnl_lock)
> against concurrent DCB netlink operations?
This seems like a fully pre-existing error. We already jump to dcb_error
else where in the function.

I don't know if this locking order really is an ABBA violation (I did
not review any of the other flows that take tc_mutex to confirm), but I
don't think it should hold this fix.

Someone from the ice team will need to investigate and see what the best
solution is. I suspect we'll have to take RTNL lock then the tc_mutex
and pass true to the ice_pf_dcb_cfg function. Or, better yet, see if
this converts to the netdev per-instance lock and we could drop the
tc_mutex entirely, relying on netdev_lock?

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH net 09/13] ice: fix setting RSS VSI hash for E830
From: Jacob Keller @ 2026-05-06 21:06 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Joshua Hay, Madhu Chittim, Willem de Bruijn, Dave Ertman,
	Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Marcin Szycik
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-9-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Marcin Szycik <marcin.szycik@linux.intel.com>
> 
> ice_set_rss_hfunc() performs a VSI update, in which it sets hashing
> function, leaving other VSI options unchanged. However, ::q_opt_flags is
> mistakenly set to the value of another field, instead of its original
> value, probably due to a typo. What happens next is hardware-dependent:
> 
> On E810, only the first bit is meaningful (see
> ICE_AQ_VSI_Q_OPT_PE_FLTR_EN) and can potentially end up in a different
> state than before VSI update.
> 
> On E830, some of the remaining bits are not reserved. Setting them
> to some unrelated values can cause the firmware to reject the update
> because of invalid settings, or worse - succeed.
> 
> Reproducer:
>   sudo ethtool -X $PF1 equal 8
> 
> Output in dmesg:
>   Failed to configure RSS hash for VSI 6, error -5
> 
> Fixes: 352e9bf23813 ("ice: enable symmetric-xor RSS for Toeplitz hash function")
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 1d1947a7fe11..c52c465280f7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -8046,7 +8046,7 @@ int ice_set_rss_hfunc(struct ice_vsi *vsi, u8 hfunc)
>  	ctx->info.q_opt_rss |=
>  		FIELD_PREP(ICE_AQ_VSI_Q_OPT_RSS_HASH_M, hfunc);
>  	ctx->info.q_opt_tc = vsi->info.q_opt_tc;
> -	ctx->info.q_opt_flags = vsi->info.q_opt_rss;
> +	ctx->info.q_opt_flags = vsi->info.q_opt_flags;
>  
>  	err = ice_update_vsi(hw, vsi->idx, ctx, NULL);
>  	if (err) {
> 

Sashiko complains about ice_set_rss_hfunc() but it is unrelated to this fix:

> While looking at this function, I noticed a pre-existing issue regarding the
> hardware cache. Does calling ice_update_vsi() with a local context leave the
> global hw->vsi_ctx[vsi->idx] out of sync?
> If ice_update_vsi() succeeds, vsi->info.q_opt_rss is updated, but
> hw->vsi_ctx[vsi->idx]->info.q_opt_rss is not.
> When an unrelated feature such as RDMA filtering is subsequently toggled via
> ice_cfg_rdma_fltr(), could it retrieve this stale cached context via
> ice_get_vsi_ctx() and copy the stale q_opt_rss value back into its command
> buffer?
> Could this cause the firmware to silently revert the RSS hash function to its
> previous configuration during the next ice_update_vsi() call?
> There also appears to be a pre-existing issue with state desynchronization
> when handling failures. After ice_update_vsi() succeeds, vsi->rss_hfunc
> is updated, but ice_set_rss_hfunc() then calls ice_set_rss_cfg_symm() at the
> end of the function:
> ice_set_rss_hfunc() {
>         ...
>         err = ice_update_vsi(hw, vsi->idx, ctx, NULL);
>         if (err) {
>                 ...
>         } else {
>                 vsi->info.q_opt_rss = ctx->info.q_opt_rss;
>                 vsi->rss_hfunc = hfunc;
>                 ...
>         }
>         ...
>         return ice_set_rss_cfg_symm(hw, vsi, symm);
> }
> If ice_set_rss_cfg_symm() fails, the error is returned but vsi->rss_hfunc
> remains updated. If the user retries the command, the early check:
> if (hfunc == vsi->rss_hfunc)
>         return 0;
> evaluates to true and returns success immediately.
> Could this skip the ice_set_rss_cfg_symm() retry entirely, leaving the
> hardware's flow director rules in a non-symmetric state permanently while
> falsely reporting success?

Someone from the ice team should look into this and determine whether or
not its valid.

^ permalink raw reply

* Re: [PATCH net 08/13] idpf: fix double free and use-after-free in aux device error paths
From: Jacob Keller @ 2026-05-06 21:04 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Michal Kubiak, Joshua Hay, Madhu Chittim, Willem de Bruijn,
	Dave Ertman, Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Greg Kroah-Hartman, Tony Nguyen, stable,
	Paul Menzel
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-8-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> When auxiliary_device_add() fails in idpf_plug_vport_aux_dev() or
> idpf_plug_core_aux_dev(), the err_aux_dev_add label calls
> auxiliary_device_uninit() and falls through to err_aux_dev_init.  The
> uninit call will trigger put_device(), which invokes the release
> callback (idpf_vport_adev_release / idpf_core_adev_release) that frees
> iadev.  The fall-through then reads adev->id from the freed iadev for
> ida_free() and double-frees iadev with kfree().
> 
> Free the IDA slot and clear the back-pointer before uninit, while adev
> is still valid, then return immediately.
> 
> Commit 65637c3a1811 ("idpf: fix UAF in RDMA core aux dev deinitialization")
> fixed the same use-after-free in the matching unplug path in this file but
> missed both probe error paths.
> 
> Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: stable@kernel.org
> Fixes: be91128c579c ("idpf: implement RDMA vport auxiliary dev create, init, and destroy")
> Fixes: f4312e6bfa2a ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
> Assisted-by: gregkh_clanker_t1000
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_idc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_idc.c b/drivers/net/ethernet/intel/idpf/idpf_idc.c
> index 7e4f4ac92653..b7d6b08fc89e 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_idc.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_idc.c
> @@ -90,7 +90,10 @@ static int idpf_plug_vport_aux_dev(struct iidc_rdma_core_dev_info *cdev_info,
>  	return 0;
>  

Sashiko points out some possible flaws in the idpf_plug_vport_aux_dev
function that are unrelated to this change:

> This isn't an issue introduced by this patch, but does earlier code prematurely
> publish vdev_info->adev before initialization is complete?
> vdev_info->adev is set near the beginning of idpf_plug_vport_aux_dev():
> 	adev = &iadev->adev;
> 	vdev_info->adev = &iadev->adev;
> 	iadev->vdev_info = vdev_info;
> 	...
> 	ret = auxiliary_device_init(adev);
> If a concurrent MTU event triggers idpf_idc_vdev_mtu_event(), it will read
> the pointer locklessly:
> void idpf_idc_vdev_mtu_event(...)
> {
> 	...
> 	device_lock(&vdev_info->adev->dev);
> 	adev = vdev_info->adev;
> 	if (!adev || !adev->dev.driver)
> 		goto unlock;
> If device_initialize() hasn't executed yet, the device's mutex is uninitialized
> and attempting to lock it could result in a panic.
> Furthermore, if auxiliary_device_add() fails and the error path sets the
> pointer to NULL and frees the device, could a concurrent reader that fetched
> the pointer just before it was cleared suffer a use-after-free when attempting
> to lock the freed device?
> Should the assignment to vdev_info->adev be deferred until after the device has
> been successfully initialized and added?
> Also, in idpf_idc_vdev_mtu_event(), is it possible for vdev_info->adev to be
> NULL when device_lock() is called?
> 	device_lock(&vdev_info->adev->dev);
> 	adev = vdev_info->adev;
> 	if (!adev || !adev->dev.driver)
> 		goto unlock;
> If vdev_info->adev is NULL, evaluating &vdev_info->adev->dev evaluates to
> NULL, which is then passed to mutex_lock() inside device_lock(). The subsequent
> check for (!adev) indicates this NULL state is anticipated. Should the
> NULL check happen before attempting to acquire the lock?

I do not believe these should block this fix, and we'll need to have
someone from the idpf team review this code and determine if this is
valid and find a fix for it.

Thanks,
Jake

^ permalink raw reply

* Re: [Patch net-next v1 0/7] r8169: add RSS support for RTL8127
From: Heiner Kallweit @ 2026-05-06 21:02 UTC (permalink / raw)
  To: javen, nic_swsd, andrew+netdev, davem, edumazet, kuba, pabeni,
	horms
  Cc: netdev, linux-kernel
In-Reply-To: <20260506081326.767-1-javen_xu@realsil.com.cn>

On 06.05.2026 10:13, javen wrote:
> From: Javen Xu <javen_xu@realsil.com.cn>
> 
> This patch series adds RSS (Receive Side Scaling) support for the r8169
> ethernet driver, specifically for RTL8127 (RTL_GIGA_MAC_VER_80).

Series adds RSS support for RTL8127 only. Is this generic enough to retrofit
RSS support for other chip versions like RTL8126 w/o bigger refactoring?

> 
> RSS enables packet distribution across multiple receive queues, which can
> significantly improve network throughput on multi-core systems by allowing
> parallel processing of incoming packets.
> 
> Key features:
> - Multi-queue RX support (up to 8 queues)
> - MSI-X interrupt with vector mapping
> - Dynamic queue configuration via ethtool (-L)
> - RSS hash computation for flow classification
> 
> Experiments:
> Platform: AMD Ryzen Embedded R2514 with Radeon Graphics(4 Cores/8 Threads)
> Arch: x86_64
> Test command: 
>   Server: iperf3 -s
>   Client: iperf3 -c 192.168.2.1 -P 20 -t 3600
> Monitor: mpstat -P ALL 1
> 
> Before this patch (Without RSS):
>   Throughput: Unstable, fluctuating between 3.76 Gbits/sec and
>   8.2 Gbits/sec.
>   CPU Usage: A single CPU core is fully occupied with softirq reaching 
>   up to 96%.
> 
> After this patch (With RSS enabled):
>   Throughput: Stable at 9.42 Gbits/sec.
>   CPU Usage: The traffic load is evenly distributed across multiple CPU
>   cores. The maximum softirq on a single core dropped to 63%.
>   
> Other Experiments:
> Link: https://lore.kernel.org/netdev/0A5279953D81BB9C+f50c9b49-3e5d-467f-b69a-7e49ed223383@radxa.com/
> 
> Javen Xu (7):
>   r8169: add support for multi irqs
>   r8169: add support for multi rx queues
>   r8169: add support for new interrupt mapping
>   r8169: enable new interrupt mapping
>   r8169: add support and enable rss
>   r8169: move struct ethtool_ops
>   r8169: add support for ethtool
> 
>  drivers/net/ethernet/realtek/r8169_main.c | 1202 ++++++++++++++++++---
>  1 file changed, 1080 insertions(+), 122 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH net 05/13] idpf: do not enable XDP if queue based scheduling is not supported
From: Jacob Keller @ 2026-05-06 20:59 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Michal Kubiak, Joshua Hay, Madhu Chittim, Willem de Bruijn,
	Dave Ertman, Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Patryk Holda
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-5-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Joshua Hay <joshua.a.hay@intel.com>
> 
> The current XDP implementation uses queue based scheduling for its TxQs.
> If the FW does not advertise support for queue based scheduling, do not
> enable XDP. Add the missing capability check at the start of the XDP
> configuration. This will temporarily break XDP while a flow based
> implementation is worked on, as well as while FWs with queue based by
> default are rolled out.
> 
> Fixes: 705457e7211f ("idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq")
> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
> Tested-by: Patryk Holda <patryk.holda@intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/idpf/xdp.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
> index cbccd4546768..dcd867517a5f 100644
> --- a/drivers/net/ethernet/intel/idpf/xdp.c
> +++ b/drivers/net/ethernet/intel/idpf/xdp.c
> @@ -510,6 +510,13 @@ int idpf_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  	if (!idpf_is_queue_model_split(vport->dflt_qv_rsrc.txq_model))
>  		goto notsupp;
>  
> +	if (!idpf_is_cap_ena(vport->adapter, IDPF_OTHER_CAPS,
> +			     VIRTCHNL2_CAP_SPLITQ_QSCHED)) {
> +		NL_SET_ERR_MSG_MOD(xdp->extack,
> +				   "Device does not support requested XDP Tx scheduling mode");
> +		goto notsupp;
> +	}
> +

Sashiko points out that this is only valid for XDP_SETUP_PROG:

> Could accessing xdp->extack here cause an uninitialized memory dereference?
> idpf_xdp() handles multiple commands like XDP_SETUP_PROG and
> XDP_SETUP_XSK_POOL. In struct netdev_bpf, extack is part of a union and
> is only valid when xdp->command is XDP_SETUP_PROG.
> Since this capability check happens before the switch (xdp->command) block,
> if the command is XDP_SETUP_XSK_POOL, xdp->extack overlaps with the xsk
> sub-struct and might contain uninitialized stack data.
> If that memory is non-NULL, NL_SET_ERR_MSG_MOD could write the error message
> pointer to an arbitrary memory address.
> Should this capability check be moved inside the XDP_SETUP_PROG case,
> or should the command type be verified before accessing xdp->extack?

I checked and the netdev_bpf structure indeed only has extack for the
XDP_SETUP_PROG case.

Because extack is only valid for that command, you can't use extack here.

I don't know if this check either belongs inside idpf_xdp_setup_prog()
or if you should just call NL_SET_ERR_MSG_MOD if it is an XDP_SETUP_PROG.

Josh, could you figure out which solution is better and prepare an
updated version.

>  	switch (xdp->command) {
>  	case XDP_SETUP_PROG:
>  		ret = idpf_xdp_setup_prog(vport, xdp);
> 


^ permalink raw reply

* Re: [PATCH net 03/13] i40e: keep q_vectors array in sync with channel count changes
From: Jacob Keller @ 2026-05-06 20:53 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Joshua Hay, Madhu Chittim, Willem de Bruijn, Dave Ertman,
	Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Simon Horman, Sunitha Mekala
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-3-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> For the main VSI, i40e_set_num_rings_in_vsi() always derives
> num_q_vectors from pf->num_lan_msix. At the same time, ethtool -L stores
> the user requested channel count in vsi->req_queue_pairs and the queue
> setup path uses that value for the effective number of queue pairs.
> 
> This leaves queue and vector counts out of sync after shrinking channel
> count via ethtool -L. The active queue configuration is reduced, but the
> VSI still keeps the full PF-sized q_vector topology.
> 
> That mismatch breaks reconfiguration flows which rely on vector/NAPI
> state matching the effective channel configuration. In particular,
> toggling /sys/class/net/<dev>/threaded after reducing the channel count
> can hang, and later channel-count changes can fail because VSI reinit
> does not rebuild q_vectors to match the new vector count.
> 
> Fix this by making the main VSI num_q_vectors follow the effective
> requested channel count, capped by the available MSI-X vectors. Update
> i40e_vsi_reinit_setup() to rebuild q_vectors during VSI reinit so the
> vector topology is refreshed together with the ring arrays when channel
> count changes.
> 
> Keep alloc_queue_pairs unchanged and based on pf->num_lan_qps so the VSI
> retains its full queue capacity.
> 
> Selftest napi_threaded.py was originally used when Jakub reported hang
> on /sys/class/net/<dev>/threaded toggle. In order to make it pass on
> i40e, use persistent NAPI configuration for q_vector NAPIs so NAPI
> identity and threaded settings survive q_vector reallocation across
> channel-count changes. This is achieved by using netif_napi_add_config()
> when configuring q_vectors.
> 
> $ export NETIF=ens259f1np1
> $ sudo -E env PATH="$PATH" ./tools/testing/selftests/drivers/net/napi_threaded.py
> TAP version 13
> 1..3
> ok 1 napi_threaded.napi_init
> ok 2 napi_threaded.change_num_queues
> ok 3 napi_threaded.enable_dev_threaded_disable_napi_threaded
> Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> [Jake: use min() and clamp() as suggested by Simon on Intel Wired LAN]
> 
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Closes: https://lore.kernel.org/intel-wired-lan/20260316133100.6054a11f@kernel.org/
> Fixes: d2a69fefd756 ("i40e: Fix changing previously set num_queue_pairs for PFs")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Reviewed-by: Simon Horman <horms@kernel.org>
> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---

Sashiko points out some possible issues that might need to be addressed:

>  drivers/net/ethernet/intel/i40e/i40e_main.c | 34 ++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 6d4f9218dc68..23156015ed86 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -11403,10 +11403,14 @@ static void i40e_service_timer(struct timer_list *t)
>  static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi)
>  {
>  	struct i40e_pf *pf = vsi->back;
> +	u16 qps;
>  
>  	switch (vsi->type) {
>  	case I40E_VSI_MAIN:
>  		vsi->alloc_queue_pairs = pf->num_lan_qps;
> +		qps = vsi->req_queue_pairs ?
> +		      min(vsi->req_queue_pairs, pf->num_lan_qps) :
> +		      pf->num_lan_qps;
>  		if (!vsi->num_tx_desc)
>  			vsi->num_tx_desc = ALIGN(I40E_DEFAULT_NUM_DESCRIPTORS,
>  						 I40E_REQ_DESCRIPTOR_MULTIPLE);
> @@ -11414,7 +11418,7 @@ static int i40e_set_num_rings_in_vsi(struct i40e_vsi *vsi)
>  			vsi->num_rx_desc = ALIGN(I40E_DEFAULT_NUM_DESCRIPTORS,
>  						 I40E_REQ_DESCRIPTOR_MULTIPLE);
>  		if (test_bit(I40E_FLAG_MSIX_ENA, pf->flags))
> -			vsi->num_q_vectors = pf->num_lan_msix;
> +			vsi->num_q_vectors = clamp(qps, 1, pf->num_lan_msix);
>  		else
>  			vsi->num_q_vectors = 1;
>  
> @@ -11503,6 +11507,7 @@ static int i40e_vsi_alloc_arrays(struct i40e_vsi *vsi, bool alloc_qvectors)
>  
>  err_vectors:
>  	kfree(vsi->tx_rings);
> +	vsi->tx_rings = NULL;
>  	return ret;
>  }
>  
> @@ -12043,7 +12048,8 @@ static int i40e_vsi_alloc_q_vector(struct i40e_vsi *vsi, int v_idx)
>  	cpumask_copy(&q_vector->affinity_mask, cpu_possible_mask);
>  
>  	if (vsi->netdev)
> -		netif_napi_add(vsi->netdev, &q_vector->napi, i40e_napi_poll);
> +		netif_napi_add_config(vsi->netdev, &q_vector->napi,
> +				      i40e_napi_poll, v_idx);
>  
>  	/* tie q_vector and vsi together */
>  	vsi->q_vectors[v_idx] = q_vector;
> @@ -14264,12 +14270,27 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
>  
>  	pf = vsi->back;
>  
> +	if (test_bit(I40E_FLAG_MSIX_ENA, pf->flags)) {
> +		i40e_put_lump(pf->irq_pile, vsi->base_vector, vsi->idx);
> +		vsi->base_vector = 0;
> +	}
> +


It says:
> 
> Does vsi->base_vector need to be cleared unconditionally?
> If the hardware falls back to legacy/MSI mode, the I40E_FLAG_MSIX_ENA flag is
> false, which leaves vsi->base_vector with a stale non-zero value.
> Subsequent calls to i40e_vsi_setup_vectors() check if (vsi->base_vector) and
> unconditionally return -EEXIST, which causes the VSI reinitialization to
> fail and the VSI to be destroyed.

I'm not sure this is an issue. base_vector should only be non-zero when
MSIX is enabled, so this will clear it to 0 in that case. In MSI or
Legacy mode it will already be zero so it doesn't need to be cleared.
When switching modes we have to go through this function to
reinitialize, so we should clear it.

It might nomimally make more sense to always clear the value to ensure
its zero regardless if it ever got a stale value, but I don't think its
actually required.


>  	i40e_put_lump(pf->qp_pile, vsi->base_queue, vsi->idx);
>  	i40e_vsi_clear_rings(vsi);
>  
> -	i40e_vsi_free_arrays(vsi, false);
> +	i40e_vsi_free_q_vectors(vsi);
> +	i40e_vsi_free_arrays(vsi, true);

Sashiko says:

> Can this result in a use-after-free on the ring structures?
> The i40e_vsi_clear_rings() function passes the ring structures to kfree_rcu().
> Immediately after, i40e_vsi_free_q_vectors() iterates over the q_vector->tx
> list, which still points to these RCU-freed rings, and attempts to modify
> them.

I think this is correct, and we'll need to free this in order of
i40e_free_q_vectors -> i40e_vsi_clear_rings -> i40e_vsi_free_arrays.

It does seem like the q_vectors maintain their pointers to the rings and
the i40e_for_each_ring would access the values. The i40e_vsi_clear_rings
function does assign the pointers to NULL, but those don't appear to be
the same pointers checked by in q_vector.

I will note that there is at least one other teardown flow that also
appears incorrect and calls i40e_vsi_clear_rings first before calling
i40e_vsi_free_q_vectors: the teardown flow of i40e_vsi_setup().

This needs to be addressed, so this patch should be dropped from the series.

>  	i40e_set_num_rings_in_vsi(vsi);
> -	ret = i40e_vsi_alloc_arrays(vsi, false);
> +
> +	ret = i40e_vsi_alloc_arrays(vsi, true);
> +	if (ret)
> +		goto err_vsi;
> +
> +	/* Rebuild q_vectors during VSI reinit because the effective channel
> +	 * count may change num_q_vectors. Keep vector topology aligned with the
> +	 * queue configuration after ethtool's .set_channels() callback.
> +	 */
> +	ret = i40e_vsi_setup_vectors(vsi);
>  	if (ret)
>  		goto err_vsi;
>  


Finally sashiko says this:

> Is it possible for the dynamic reallocation of vsi->num_q_vectors to cause
> irq_pile fragmentation?
> Because i40e_get_lump() requires a contiguous block of vectors, reducing
> the channel count and later increasing it might fail if another VSI
> (such as a Macvlan interface) allocated vectors from the freed hole in
> the meantime.
> If this contiguous allocation fails, i40e_vsi_reinit_setup() aborts and
> destroys the VSI, which could render the interface unusable until a driver
> reload.
> Also, if i40e_vsi_setup_vectors() fails here and jumps to err_vsi, will the
> driver skip unregister_netdev()?
> The err_vsi label is placed after the err_rings block, which skips cleaning up
> the netdev. The code then falls through to i40e_vsi_clear() which frees the
> VSI memory.
> Could this leave the active net_device registered in the kernel with a
> dangling pointer to the freed VSI memory?


I'm not sure exactly what its trying to say here since it seems like a
few jumbled issues. However, I suspect even if the fragmentation issue
is real it shouldn't be solved by this fix.

The issue with the err_vsi also I am not certain is real. The function
returns NULL when it fails to reinit, and that will stop probe or
rebuild. It is likely that this entire chunk of code is flawed and
should be redone, but I think that is also out-of-scope for this fix.

Still, we need a new verison that fixes the use-after-free from
q_vectors function.

^ permalink raw reply

* Re: [PATCH net v2] net: napi: Avoid gro timer misfiring at end of busypoll
From: Joe Damato @ 2026-05-06 20:49 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Björn Töpel, Daniel Borkmann,
	Gal Pressman, Frederik Deweerdt, Martin Karsten, Tariq Toukan,
	Cosmin Ratiu, netdev, linux-kernel
In-Reply-To: <20260506090808.820559-2-dtatulea@nvidia.com>

On Wed, May 06, 2026 at 09:08:08AM +0000, Dragos Tatulea wrote:
> When in irq deferral mode (defer-hard-irqs > 0), a short enough
> gro-flush timeout can trigger before NAPI_STATE_SCHED is cleared if the
> last poll in busy_poll_stop() takes too long. This can have the effect
> of leaving the queue stuck with interrupts disabled and no timer armed
> which results in a tx timeout if there is no subsequent busypoll cycle.
> 
> To prevent this, defer the gro-flush timer arm after the last poll.
> 
> Fixes: 7fd3253a7de6 ("net: Introduce preferred busy-polling")
> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
> Changes since RFC [1]:
> - Sending only fix to net.
> - Made commit message clearer and more succint.
> - Fixed timer arming to happen after clearing the NAPI_STATE_SCHED bit
> - Arm timer after clearing NAPI_STATE_SCHED and drop IRQ disable.
> 
> [1] https://lore.kernel.org/all/20260428175134.1197036-3-dtatulea@nvidia.com/
> ---
>  net/core/dev.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)

Good catch on this bug. This fix looks right to me.

Reviewed-by: Joe Damato <joe@dama.to>

^ permalink raw reply

* [GIT PULL] bluetooth 2026-05-06
From: Luiz Augusto von Dentz @ 2026-05-06 20:45 UTC (permalink / raw)
  To: davem, kuba; +Cc: linux-bluetooth, netdev

The following changes since commit b89e0100a5f6885f9748bbacc3f4e3bcff654e4c:

  Merge tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless (2026-05-06 07:29:31 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth.git tags/for-net-2026-05-06

for you to fetch changes up to c5d415596cb6fbdf6334b06cc87a1a5a268d8725:

  Bluetooth: HIDP: serialise l2cap_unregister_user via hidp_session_sem (2026-05-06 16:27:53 -0400)

----------------------------------------------------------------
bluetooth pull request for net:

 - hci_conn: fix potential UAF in create_big_sync
 - hci_event: fix memset typo
 - hci_event: Fix OOB read and infinite loop in hci_le_create_big_complete_evt
 - L2CAP: fix MPS check in l2cap_ecred_reconf_req
 - L2CAP: defer conn param update to avoid conn->lock/hdev->lock inversion
 - L2CAP: Fix null-ptr-deref in l2cap_sock_state_change_cb()
 - L2CAP: Fix null-ptr-deref in l2cap_sock_get_sndtimeo_cb()
 - L2CAP: Fix null-ptr-deref in l2cap_sock_new_connection_cb()
 - RFCOMM: pull credit byte with skb_pull_data()
 - SCO: fix sleeping under spinlock in sco_conn_ready
 - SCO: hold sk properly in sco_conn_ready
 - ISO: Fix data-race on dst in iso_sock_connect()
 - ISO: Fix data-race on iso_pi(sk) in socket and HCI event paths
 - bnep: fix incorrect length parsing in bnep_rx_frame() extension handling
 - hci_uart: Fix NULL deref in recv callbacks when priv is uninitialized
 - virtio_bt: clamp rx length before skb_put
 - virtio_bt: validate rx pkt_type header length
 - HIDP: serialise l2cap_unregister_user via hidp_session_sem
 - btintel_pcie: treat boot stage bit 12 as warning
 - btmtk: validate WMT event SKB length before struct access

----------------------------------------------------------------
Aurelien DESBRIERES (1):
      Bluetooth: hci_uart: Fix NULL deref in recv callbacks when priv is uninitialized

David Carlier (1):
      Bluetooth: hci_conn: fix potential UAF in create_big_sync

Dudu Lu (2):
      Bluetooth: bnep: fix incorrect length parsing in bnep_rx_frame() extension handling
      Bluetooth: l2cap: fix MPS check in l2cap_ecred_reconf_req

Jann Horn (1):
      Bluetooth: hci_event: fix memset typo

Luiz Augusto von Dentz (1):
      Bluetooth: hci_event: Fix OOB read and infinite loop in hci_le_create_big_complete_evt

Michael Bommarito (3):
      Bluetooth: virtio_bt: clamp rx length before skb_put
      Bluetooth: virtio_bt: validate rx pkt_type header length
      Bluetooth: HIDP: serialise l2cap_unregister_user via hidp_session_sem

Mikhail Gavrilov (1):
      Bluetooth: l2cap: defer conn param update to avoid conn->lock/hdev->lock inversion

Pauli Virtanen (2):
      Bluetooth: SCO: fix sleeping under spinlock in sco_conn_ready
      Bluetooth: SCO: hold sk properly in sco_conn_ready

Pengpeng Hou (1):
      Bluetooth: RFCOMM: pull credit byte with skb_pull_data()

Sai Teja Aluvala (1):
      Bluetooth: btintel_pcie: treat boot stage bit 12 as warning

SeungJu Cheon (2):
      Bluetooth: ISO: Fix data-race on dst in iso_sock_connect()
      Bluetooth: ISO: Fix data-race on iso_pi(sk) in socket and HCI event paths

Siwei Zhang (3):
      Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_state_change_cb()
      Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_get_sndtimeo_cb()
      Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_new_connection_cb()

Tristan Madani (1):
      Bluetooth: btmtk: validate WMT event SKB length before struct access

 drivers/bluetooth/btintel_pcie.c |  13 +++-
 drivers/bluetooth/btintel_pcie.h |   2 +-
 drivers/bluetooth/btmtk.c        |  15 ++++-
 drivers/bluetooth/hci_ath.c      |   3 +
 drivers/bluetooth/hci_bcsp.c     |   3 +
 drivers/bluetooth/hci_h4.c       |   3 +
 drivers/bluetooth/hci_h5.c       |   3 +
 drivers/bluetooth/virtio_bt.c    |  39 +++++++++---
 include/net/bluetooth/hci_core.h |   2 +-
 net/bluetooth/bnep/core.c        |  13 +++-
 net/bluetooth/hci_conn.c         | 124 ++++++++++++++++++++++++++++++++-------
 net/bluetooth/hci_event.c        |  29 ++++++++-
 net/bluetooth/hidp/core.c        |  27 ++++++++-
 net/bluetooth/iso.c              |  56 ++++++++++--------
 net/bluetooth/l2cap_core.c       |  14 +----
 net/bluetooth/l2cap_sock.c       |   9 +++
 net/bluetooth/rfcomm/core.c      |   7 ++-
 net/bluetooth/sco.c              |  60 ++++++++++++-------
 18 files changed, 320 insertions(+), 102 deletions(-)

^ permalink raw reply

* Re: [PATCH net 3/3] netdevsim: psp: rcu protect psp_dev reference
From: Willem de Bruijn @ 2026-05-06 20:43 UTC (permalink / raw)
  To: Daniel Zahka, Willem de Bruijn, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Willem de Bruijn
  Cc: netdev, linux-kernel
In-Reply-To: <9e691c63-ef7d-4e35-8cc4-77f83a16c970@gmail.com>

Daniel Zahka wrote:
> 
> On 5/6/26 3:34 PM, Willem de Bruijn wrote:
> > Daniel Zahka wrote
> >>   static ssize_t
> >> @@ -228,16 +237,23 @@ nsim_psp_rereg_write(struct file *file, const char __user *data, size_t count,
> >>   		     loff_t *ppos)
> >>   {
> >>   	struct netdevsim *ns = file->private_data;
> >> -	int err;
> >> +	struct psp_dev *psd;
> >> +	ssize_t ret;
> >>   
> >>   	mutex_lock(&ns->psp.rereg_lock);
> >> -	__nsim_psp_uninit(ns);
> >> +	__nsim_psp_uninit(ns, false);
> >> +
> >> +	psd = psp_dev_create(ns->netdev, &nsim_psp_ops, &nsim_psp_caps, ns);
> >> +	if (IS_ERR(psd)) {
> >> +		ret = PTR_ERR(psd);
> >> +		goto out;
> >> +	}
> > Do you want to create the new device first and only delete the old
> > state if that succeeds? To avoid a netdevsim in state without dev.
> >
> >>   
> >> -	ns->psp.dev = psp_dev_create(ns->netdev, &nsim_psp_ops,
> >> -				     &nsim_psp_caps, ns);
> >> -	err = PTR_ERR_OR_ZERO(ns->psp.dev);
> >> +	rcu_assign_pointer(ns->psp.dev, psd);
> >> +	ret = count;
> >> +out:
> >>   	mutex_unlock(&ns->psp.rereg_lock);
> >> -	return err ?: count;
> >> +	return ret;
> >>   }
> >>   
> 
> Unfortunately, the way we have psp_dev_unregister() written, it would 
> clear out the main_netdev->psp_dev field from the second 
> psp_dev_create() call. I don't believe there is use case to do this in a 
> real driver, so I'm not sure we need to change how the create/unregister 
> paths work.

Thanks for that context.

Reviewed-by: Willem de Bruijn <willemb@google.com>



^ permalink raw reply

* Re: [PATCH net-next v2 0/3] net: Fix protodown with macvlan
From: Ido Schimmel @ 2026-05-06 20:39 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, pabeni, edumazet, horms, petrm
In-Reply-To: <20260505081656.463158-1-idosch@nvidia.com>

On Tue, May 05, 2026 at 11:16:52AM +0300, Ido Schimmel wrote:
> When protodown is enabled on a macvlan, two bugs cause the macvlan to
> incorrectly report an UP operational state:
> 
> 1. Toggling the lower device's carrier while protodown is enabled on the
> macvlan causes the macvlan to inherit the UP operational state,
> effectively bypassing the protodown mechanism.
> 
> 2. Toggling protodown on and then off on the macvlan while the lower
> device has no carrier causes the macvlan to report UP instead of
> LOWERLAYERDOWN, since netif_change_proto_down() unconditionally turns
> the carrier on.
> 
> Patch #1 solves the first problem by making
> netif_stacked_transfer_operstate() return early when protodown is on.
> 
> Patch #2 solves the second problem by calling
> netif_stacked_transfer_operstate() instead of netif_carrier_on() when
> protodown is disabled on a net device that has a linked net device.
> 
> Patch #3 adds a selftest covering both bugs and the basic protodown
> functionality.
> 
> Targeting at net-next since these are not regressions (i.e., never
> worked).
> 
> Note that while these changes are in the core, they should only affect
> macvlan as protodown is only supported by macvlan and vxlan and only the
> former has a linked net device.
> 
> v2:
> - Move protodown handling away from drivers to the core (Jakub).
> - Add a new test case for vxlan.
> v1: https://lore.kernel.org/netdev/20260429124624.835335-1-idosch@nvidia.com/

I thought about this again and since protodown is only about carrier, I
think it's best to avoid netif_stacked_transfer_operstate(). Instead,
make netif_carrier_on() a NOP when protodown is turned on and when
protodown is turned off, only turn on the carrier if the linked net
device (assuming it exists) also has a carrier.

IOW, I will make these changes for v3:

diff --git a/net/core/dev.c b/net/core/dev.c
index 46f8a2efd982..0c272f6e9aaa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10169,10 +10169,8 @@ int netif_change_proto_down(struct net_device *dev, bool proto_down)
 	WRITE_ONCE(dev->proto_down, proto_down);
 	if (proto_down)
 		netif_carrier_off(dev);
-	else if (dev == iflink_dev)
+	else if (dev == iflink_dev || netif_carrier_ok(iflink_dev))
 		netif_carrier_on(dev);
-	else
-		netif_stacked_transfer_operstate(iflink_dev, dev);
 	return 0;
 }
 
@@ -11134,9 +11132,6 @@ EXPORT_SYMBOL(netdev_change_features);
 void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 					struct net_device *dev)
 {
-	if (dev->proto_down)
-		return;
-
 	if (rootdev->operstate == IF_OPER_DORMANT)
 		netif_dormant_on(dev);
 	else
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a93321db8fd7..05c250c483f0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -609,6 +609,9 @@ static void netdev_watchdog_down(struct net_device *dev)
  */
 void netif_carrier_on(struct net_device *dev)
 {
+	if (READ_ONCE(dev->proto_down))
+		return;
+
 	if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
 		if (dev->reg_state == NETREG_UNINITIALIZED)
 			return;

^ permalink raw reply related

* Re: [PATCH net 02/13] i40e: Cleanup PTP pins on probe failure
From: Jacob Keller @ 2026-05-06 20:28 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Michal Kubiak, Joshua Hay, Madhu Chittim, Willem de Bruijn,
	Dave Ertman, Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Matt Vollrath, Kohei Enju, Paul Menzel,
	Sunitha Mekala
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-2-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Matt Vollrath <tactii@gmail.com>
> 
> PTP pin structs are allocated early in probe, but never cleaned up.
> 
> Fix this by calling i40e_ptp_free_pins in the error path.
> 
> To support this, i40e_ptp_free_pins is added to the header and
> pin_config is correctly nullified after being freed.
> 
> This has been an issue since i40e_ptp_alloc_pins was introduced.
> 
> Fixes: 1050713026a08 ("i40e: add support for PTP external synchronization clock")
> Reported-by: Kohei Enju <kohei@enjuk.jp>
> Cc: stable@vger.kernel.org
> Signed-off-by: Matt Vollrath <tactii@gmail.com>
> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Reviewed-by: Kohei Enju <kohei@enjuk.jp>
> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e.h      | 1 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 1 +
>  drivers/net/ethernet/intel/i40e/i40e_ptp.c  | 3 ++-
>  3 files changed, 4 insertions(+), 1 deletion(-)
> 
Sashiko points out some issues which are related but pre-existing this fix:

> This isn't a bug in this patch, but looking at i40e_probe(), the return
> value of i40e_ptp_alloc_pins() is ignored earlier in the probe sequence.
> If the memory allocation fails, pf->ptp_pins remains NULL, but the probe
> routine continues successfully.
> Later, if a user attempts to configure the PTP pins via the PTP clock ioctl
> (such as PTP_EXTTS_REQUEST), i40e_ptp_enable_pin() unconditionally dereferences
> the pointer:
> drivers/net/ethernet/intel/i40e/i40e_ptp.c:i40e_ptp_enable_pin() {
>     ...
> 	pins.sdp3_2 = pf->ptp_pins->sdp3_2;
>     ...
> }
> Could this lead to a NULL pointer dereference panic triggered by userspace?


Probably we just need to add an error handling of the pin allocation
since failure to allocate memory is likely to hit a lot of other places
as well. This is also a pre-existing issue.

> This isn't a bug in this patch, but while looking at PTP cleanup, it appears
> the pf->ptp_extts0_work work item is never canceled during device removal.
> In i40e_remove(), i40e_ptp_stop() masks the PTP event interrupts, but it
> doesn't call cancel_work_sync(&pf->ptp_extts0_work) to flush any already-queued
> work.
> i40e_remove() later cancels service_task but completely omits ptp_extts0_work.
> If an interrupt triggers and schedules the work immediately before the teardown
> path masks the interrupt, will the work item execute after the pf structure is
> completely freed, resulting in a use-after-free?

This is also a pre-existing issue in the PTP teardown that was reported
on patch 1/13 as well: we need to cancel the extts0_work item.


^ permalink raw reply

* Re: [PATCH net-next 09/12] gpio: tc956x: add TC956x/QPS615 support
From: Alex Elder @ 2026-05-06 20:25 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, maxime.chevallier,
	rmk+kernel, andersson, konradybcio, robh, krzk+dt, conor+dt,
	linusw, brgl, arnd, gregkh, daniel, mohd.anwar, a0987203069,
	alexandre.torgue, ast, boon.khai.ng, chenchuangyu, chenhuacai,
	daniel, hawk, hkallweit1, inochiama, john.fastabend, julianbraha,
	livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
	prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
	siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
	devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <7d7b6b89-3ef4-4891-a794-c8b11f39db34@lunn.ch>

On 5/6/26 2:43 PM, Andrew Lunn wrote:
>>>                 ----------------------------------
>>>                 |              Host              |
>>>                 ------+...+----------+........+---
>>>                       |i2c|          |  PCIe  |
>>>       ----------------+...+----------+........+------
>>>       | TC956x        |I2C|          |upstream|     |
>>>       |               -----        --+--------+---  |
>>>       |  -----  ------  -------    | PCIe switch |  |
>>>       |  |SPI|  |GPIO|  |reset|    |             |  |
>>>       |  -----  ------  |clock|    | DS3 DS2 DS1 |  |
>>>       |                 -------    ---++--++--++--  |
>>>       |  -----  ------     downstream//    \\  \\   |  downstream
>>>       |  |MCU|  |SRAM|    /==========/      \\  \===== PCIe port 1
>>>       |  -----  ------   //PCIe port 3       \\     |
>>>       |                  ||                   \======= downstream
>>>       |  ----+-----------++-----------+----         |  PCIe port 2
>>>       |  | M | internal PCIe endpoint | M |         |
>>>       |  | S |------------------------| S |  ------ |
>>>       |  | I |   PCIe   |  |   PCIe   | I |  |UART| |
>>>       |  | G |function 0|  |function 1| G |  ------ |
>>>       |  | E |----++----|  |----++----| E |         |
>>>       |  | N |  eMAC 0  |  |  eMAC 1  | N |         |
>>>       --------+.......+------+.....+-----------------
>>>               |USXGMII|      |SGMII|
>>>             --+.......+--  --+.....+--
>>>             |  ARQ113C  |  | QEP8121 |
>>>             |    PHY    |  |   PHY   |
>>>             -------------  -----------
>>>
> 
> 
>> Because the internal endpoint won't operate until the PCIe
>> power controller has enabled power, this GPIO driver and
>> the PCIe power control driver won't interfere with each
>> other's access to the shared registers.
> 
> What i find interesting is that there are two GPIOs, and two external
> downstream PCIe ports. A naive way of looking at this is that each
> external PCIe port has one GPIO. And the internal PCIe port does not
> have one. Hence the internal port might well work without any
> additional setup?  That was my thinking.

I see what you're saying.  I don't actually know what effect those
two reset signals have on the internal PCIe endpoint or its port.

Here is what the power control driver does:
- asserts those two reset signals (via direct register writes)
     - for every port on the switch:
         - disables the port (which programs a sequence of values to
           specific addresses)
         - sets several PCIe configuration options
             - l0s_entry_delay
             - l1_entry_delay
             - TX amplitude
             - NFTS
             - disable DFE
- Finally deasserts those two reset signals again.

And "every port on the switch" is:
- USP (upstream port)
- DSP 1, 2, 3 (downstream ports, including the embedded one)
- Ethernet (which tells me maybe we need to update that driver
   to support two eMACs?)

The whole point of this power control driver is that it doesn't
actually power up the PCIe switch at all until *after* this
configuration step is complete.  So I believe the internal
endpoint and its two functions aren't powered until after the
power control driver finishes probing.

The GPIO controller is obviously alive when the power control
driver runs though.

> But you are saying it is not as simple as this, and two GPIOs affect
> three ports? Do you have any idea what they actually do?

To be honest, for the most part we haven't looked closely at
the PCIe power control driver--though it's relatively simple
and I understand how the code works...

So I don't know the answer, but I expect with some work I
might be able to find out.


To be clear, the reason you're asking is that you're suggesting
we might want to model the GPIO controller differently, correct?

I.e., model it as *not* associated with the embedded PCIe
functions.  Then we need to think about what its parent device
would be (the power control device, which I think somehow
duplicates the switch device?).

					-Alex

>        Andrew


^ permalink raw reply

* Re: [PATCH net 01/13] i40e: Cleanup PTP registration on probe failure
From: Jacob Keller @ 2026-05-06 20:24 UTC (permalink / raw)
  To: Przemek Kitszel, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Michal Kubiak, Joshua Hay, Madhu Chittim, Willem de Bruijn,
	Dave Ertman, Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable, Matt Vollrath, Sunitha Mekala
In-Reply-To: <20260504-jk-iwl-net-2026-05-04-v1-1-a222a88bd962@intel.com>

On 5/4/2026 10:14 PM, Jacob Keller wrote:
> From: Matt Vollrath <tactii@gmail.com>
> 
> Fix two conditions which would leak PTP registration on probe failure:
> 
> 1. i40e_setup_pf_switch can encounter an error in
>    i40e_setup_pf_filter_control, call i40e_ptp_init, then return
>    non-zero, sending i40e_probe to err_vsis.
> 
> 2. i40e_setup_misc_vector can return non-zero, sending i40e_probe to
>    err_vsis.
> 
> Both of these conditions have been present since PTP was introduced in
> this driver.
> 
> Found with coccinelle.
> 
> Fixes: beb0dff1251db ("i40e: enable PTP")
> Signed-off-by: Matt Vollrath <tactii@gmail.com>
> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 028bd500603a..f06fcef644e5 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -16108,6 +16108,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	/* Unwind what we've done if something failed in the setup */
>  err_vsis:
>  	set_bit(__I40E_DOWN, pf->state);
> +	i40e_ptp_stop(pf);
>  	i40e_clear_interrupt_scheme(pf);
>  	kfree(pf->vsi);
>  err_switch_setup:
> 

Sashiko points out two possible issues, both of which I think are better
served by follow-up fixes:

> This isn't a bug introduced by this patch, but is it safe to use
> i40e_ptp_stop() in this error path without canceling its workqueue task?
> The PTP initialization logic initializes a background work item
> pf->ptp_extts0_work, which is scheduled asynchronously by the hardware
> interrupt handler. The i40e_ptp_stop() function disables hardware
> interrupts and unregisters the clock without calling
> cancel_work_sync(&pf->ptp_extts0_work). 
> Could this leave the work item pending if an event fires right before
> interrupts are disabled, leading to a use-after-free when the module
> teardown frees the pf structure?


I think we need to double check that extts0_work is cancelled properly
when stopping PTP. Or confirm that it won't cause a problem as-is.

Sashiko also says:

> This isn't a bug introduced by this patch, but does this error path leak VSI
> structures and leave netdevs registered?
> If i40e_setup_pf_switch() fails late, the main VSI has already been
> allocated and its netdev registered via register_netdev(). The err_vsis
> cleanup block only executes kfree(pf->vsi), which frees the array of
> pointers but fails to invoke i40e_vsi_release(vsi) or unregister the netdev.
> Since the driver then proceeds to i40e_free_pf(pf) and frees the underlying
> driver structure, could subsequent userspace operations on the
> still-registered netdev execute driver callbacks that dereference the freed
> pf structure?

The only way that the function can fail late appears to be if
i40e_validate_filter_settings() fails.. but that has a comment around
i40e_setup_pf_filter_control() which claims that we shouldn't the probe
just because that failed.. so I think there is maybe an issue but I'm
not 100% sure how to resolve it properly.. It does seem like there may
be a pre-existing issue in the i40e probe error cleanup.


^ permalink raw reply

* [BUG] EM9293 PCIe (MHI) WWAN RX path broken on kernel 7.0
From: Jorge Mayorga @ 2026-05-06 20:21 UTC (permalink / raw)
  To: netdev, linux-arm-msm; +Cc: linux-wireless

Hello, We are observing a data-plane failure with a PCIe-based WWAN
modem (Sierra Wireless EM9293) using the MHI subsystem on Linux kernel
7.0. Environment: Kernel: 7.0.0-15-generic OS: Ubuntu 26.04
ModemManager: 1.25.95 Modem: Sierra Wireless EM9293 Bus: PCIe (MHI)
Driver: mhi_net, mhi-pci-generic Issue summary: Two distinct failure
modes depending on protocol: 1. MBIM mode Interface created: mhi_hwip0
State: UP, LOWER_UP, POINTOPOINT, NOARP IP assigned correctly Traffic:
TX packets increase normally RX packets remain near zero Ping: 100%
packet loss Counters: TX: ~15000 packets RX: ~7 packets This indicates
TX path is functional but RX path is not working. 2. QMI multiplexed
mode Interface qmapmux0.0 is created Traffic works briefly Interface
is removed shortly after connection Interpretation: MBIM: control
plane OK TX OK RX not functional QMI: data path partially works netdev
lifecycle unstable Low-level interpretation: TX path: host -> mhi_net
-> modem OK RX path: modem -> mhi_net -> host FAIL Additional
observations: No routing issues No firewall interference No userspace
misconfiguration Same failure persists regardless of routing setup USB
(cdc_mbim) works correctly on same hardware This appears to be a
kernel-level issue in the MHI WWAN data path. Likely areas: mhi_net RX
handling downlink channel setup rmnet/qmap lifecycle possible race
condition in netdev teardown Request: Any guidance on debugging the RX
path in mhi_net would be appreciated. Also confirming whether current
MHI WWAN support is expected to fully support RX on PCIe devices like
EM9293. Thanks

^ permalink raw reply

* Re: [PATCH net 3/3] netdevsim: psp: rcu protect psp_dev reference
From: Daniel Zahka @ 2026-05-06 20:14 UTC (permalink / raw)
  To: Willem de Bruijn, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Willem de Bruijn
  Cc: netdev, linux-kernel
In-Reply-To: <willemdebruijn.kernel.173236793ca@gmail.com>


On 5/6/26 3:34 PM, Willem de Bruijn wrote:
> Daniel Zahka wrote
>>   static ssize_t
>> @@ -228,16 +237,23 @@ nsim_psp_rereg_write(struct file *file, const char __user *data, size_t count,
>>   		     loff_t *ppos)
>>   {
>>   	struct netdevsim *ns = file->private_data;
>> -	int err;
>> +	struct psp_dev *psd;
>> +	ssize_t ret;
>>   
>>   	mutex_lock(&ns->psp.rereg_lock);
>> -	__nsim_psp_uninit(ns);
>> +	__nsim_psp_uninit(ns, false);
>> +
>> +	psd = psp_dev_create(ns->netdev, &nsim_psp_ops, &nsim_psp_caps, ns);
>> +	if (IS_ERR(psd)) {
>> +		ret = PTR_ERR(psd);
>> +		goto out;
>> +	}
> Do you want to create the new device first and only delete the old
> state if that succeeds? To avoid a netdevsim in state without dev.
>
>>   
>> -	ns->psp.dev = psp_dev_create(ns->netdev, &nsim_psp_ops,
>> -				     &nsim_psp_caps, ns);
>> -	err = PTR_ERR_OR_ZERO(ns->psp.dev);
>> +	rcu_assign_pointer(ns->psp.dev, psd);
>> +	ret = count;
>> +out:
>>   	mutex_unlock(&ns->psp.rereg_lock);
>> -	return err ?: count;
>> +	return ret;
>>   }
>>   

Unfortunately, the way we have psp_dev_unregister() written, it would 
clear out the main_netdev->psp_dev field from the second 
psp_dev_create() call. I don't believe there is use case to do this in a 
real driver, so I'm not sure we need to change how the create/unregister 
paths work.



^ permalink raw reply

* [PATCH net v2] genetlink: free the skb on 'group >= family->n_mcgrps'
From: Alice Ryhl @ 2026-05-06 20:07 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrew Lunn, Matthew Maurer
  Cc: netdev, linux-kernel, Alice Ryhl

These methods generally consume ownership of the provided skb, so even
if an error path is encountered, the skb is freed. This is because the
very first thing they do after some initial setup is to unconditionally
consume the skb via consume_skb(skb). Any subsequent errors lead to the
core netlink layer freeing the skb.

However, there is one check that occurs before ownership is passed,
which is the check for the group index. So if this error condition is
encountered, then the skb is leaked. This error condition is generally
considered a violation of the netlink API, so it's not expected to occur
under normal circumstances. For the same reason, no callers check for
this error condition, and no callers need to be adjusted. However, we
should still follow the same ownership semantics of the rest of the
function. Thus, free the skb in this codepath.

Assisted-by: Antigravity:gemini
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Matthew Maurer <mmaurer@google.com>
Fixes: 2a94fe48f32c ("genetlink: make multicast groups const, prevent abuse")
Link: https://lore.kernel.org/r/845b36ba-7b3a-41f2-acb2-b284f253e2ca@lunn.ch
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
Changes in v2:
- Add Fixes: tag.
- Specify target branch.
- Link to v1: https://lore.kernel.org/r/20260504-genlmsg-return-v1-1-093f3ba970af@google.com
---
 include/net/genetlink.h | 4 +++-
 net/netlink/genetlink.c | 8 ++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 7b84f2cef8b1..d70510ac31ab 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -489,8 +489,10 @@ genlmsg_multicast_netns_filtered(const struct genl_family *family,
 				 netlink_filter_fn filter,
 				 void *filter_data)
 {
-	if (WARN_ON_ONCE(group >= family->n_mcgrps))
+	if (WARN_ON_ONCE(group >= family->n_mcgrps)) {
+		nlmsg_free(skb);
 		return -EINVAL;
+	}
 	group = family->mcgrp_offset + group;
 	return nlmsg_multicast_filtered(net->genl_sock, skb, portid, group,
 					flags, filter, filter_data);
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index d251d894afd4..0da39eaed255 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1972,8 +1972,10 @@ int genlmsg_multicast_allns(const struct genl_family *family,
 			    struct sk_buff *skb, u32 portid,
 			    unsigned int group)
 {
-	if (WARN_ON_ONCE(group >= family->n_mcgrps))
+	if (WARN_ON_ONCE(group >= family->n_mcgrps)) {
+		kfree_skb(skb);
 		return -EINVAL;
+	}
 
 	group = family->mcgrp_offset + group;
 	return genlmsg_mcast(skb, portid, group);
@@ -1986,8 +1988,10 @@ void genl_notify(const struct genl_family *family, struct sk_buff *skb,
 	struct net *net = genl_info_net(info);
 	struct sock *sk = net->genl_sock;
 
-	if (WARN_ON_ONCE(group >= family->n_mcgrps))
+	if (WARN_ON_ONCE(group >= family->n_mcgrps)) {
+		kfree_skb(skb);
 		return;
+	}
 
 	group = family->mcgrp_offset + group;
 	nlmsg_notify(sk, skb, info->snd_portid, group,

---
base-commit: 7fd2df204f342fc17d1a0bfcd474b24232fb0f32
change-id: 20260504-genlmsg-return-1e5d6a74d440

Best regards,
-- 
Alice Ryhl <aliceryhl@google.com>


^ permalink raw reply related

* [PATCH net] net: ti: icssm-prueth: fix eth_ports_node leak in probe
From: Shitalkumar Gandhi @ 2026-05-06 19:58 UTC (permalink / raw)
  To: MD Danish Anwar, Parvathi Pudi
  Cc: Roger Quadros, Mohan Reddy Putluru, Jakub Kicinski,
	David S . Miller, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	Simon Horman, Dan Carpenter, netdev, linux-arm-kernel,
	linux-kernel, Shitalkumar Gandhi

The error path on of_property_read_u32() failure inside
icssm_prueth_probe() returns without putting eth_ports_node,
which was acquired before the for_each_child_of_node() loop.

Drop it before returning.

Fixes: 511f6c1ae093 ("net: ti: icssm-prueth: Adds ICSSM Ethernet driver")
Signed-off-by: Shitalkumar Gandhi <shitalkumar.gandhi@cambiumnetworks.com>
---
 drivers/net/ethernet/ti/icssm/icssm_prueth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ti/icssm/icssm_prueth.c b/drivers/net/ethernet/ti/icssm/icssm_prueth.c
index 53bbd9290904..b7e94244355a 100644
--- a/drivers/net/ethernet/ti/icssm/icssm_prueth.c
+++ b/drivers/net/ethernet/ti/icssm/icssm_prueth.c
@@ -1825,6 +1825,7 @@ static int icssm_prueth_probe(struct platform_device *pdev)
 			dev_err(dev, "%pOF error reading port_id %d\n",
 				eth_node, ret);
 			of_node_put(eth_node);
+			of_node_put(eth_ports_node);
 			return ret;
 		}
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net-next 10/12] net: stmmac: tc956x: add TC956x/QPS615 support
From: Andrew Lunn @ 2026-05-06 19:52 UTC (permalink / raw)
  To: Daniel Thompson
  Cc: Xilin Wu, Alex Elder, andrew+netdev, davem, edumazet, kuba,
	pabeni, maxime.chevallier, rmk+kernel, andersson, konradybcio,
	robh, krzk+dt, conor+dt, linusw, brgl, arnd, gregkh, mohd.anwar,
	a0987203069, alexandre.torgue, ast, boon.khai.ng, chenchuangyu,
	chenhuacai, daniel, hawk, hkallweit1, inochiama, john.fastabend,
	julianbraha, livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
	prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
	siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
	devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <afteD2d8d87Nyvl7@aspen.lan>

> TL;DR - there are conceivable (and sane) hardware designs where the
> interrupt goes only to the TC9564 GPIO, but they are too different to
> RB3gen2 (and related SBC designs) for them to be supported before
> they exist!

Agreed. I'm just trying to be cautious. It would be bad to saying that
WoL is supported in general, because a board might come along where it
does not work because the TC9564 itself cannot wake the system. If the
driver needs to say WoL is supported, it should try to validate the
system is using a supported set of features.

But lets get the basic features supported first, WoL can be added
later.

       Andrew

^ permalink raw reply

* Re: [PATCH net-next 09/12] gpio: tc956x: add TC956x/QPS615 support
From: Andrew Lunn @ 2026-05-06 19:43 UTC (permalink / raw)
  To: Alex Elder
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, maxime.chevallier,
	rmk+kernel, andersson, konradybcio, robh, krzk+dt, conor+dt,
	linusw, brgl, arnd, gregkh, daniel, mohd.anwar, a0987203069,
	alexandre.torgue, ast, boon.khai.ng, chenchuangyu, chenhuacai,
	daniel, hawk, hkallweit1, inochiama, john.fastabend, julianbraha,
	livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
	prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
	siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
	devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <0751a051-9894-45be-92d6-0d46f2c39293@riscstar.com>

> >                ----------------------------------
> >                |              Host              |
> >                ------+...+----------+........+---
> >                      |i2c|          |  PCIe  |
> >      ----------------+...+----------+........+------
> >      | TC956x        |I2C|          |upstream|     |
> >      |               -----        --+--------+---  |
> >      |  -----  ------  -------    | PCIe switch |  |
> >      |  |SPI|  |GPIO|  |reset|    |             |  |
> >      |  -----  ------  |clock|    | DS3 DS2 DS1 |  |
> >      |                 -------    ---++--++--++--  |
> >      |  -----  ------     downstream//    \\  \\   |  downstream
> >      |  |MCU|  |SRAM|    /==========/      \\  \===== PCIe port 1
> >      |  -----  ------   //PCIe port 3       \\     |
> >      |                  ||                   \======= downstream
> >      |  ----+-----------++-----------+----         |  PCIe port 2
> >      |  | M | internal PCIe endpoint | M |         |
> >      |  | S |------------------------| S |  ------ |
> >      |  | I |   PCIe   |  |   PCIe   | I |  |UART| |
> >      |  | G |function 0|  |function 1| G |  ------ |
> >      |  | E |----++----|  |----++----| E |         |
> >      |  | N |  eMAC 0  |  |  eMAC 1  | N |         |
> >      --------+.......+------+.....+-----------------
> >              |USXGMII|      |SGMII|
> >            --+.......+--  --+.....+--
> >            |  ARQ113C  |  | QEP8121 |
> >            |    PHY    |  |   PHY   |
> >            -------------  -----------
> > 


> Because the internal endpoint won't operate until the PCIe
> power controller has enabled power, this GPIO driver and
> the PCIe power control driver won't interfere with each
> other's access to the shared registers.

What i find interesting is that there are two GPIOs, and two external
downstream PCIe ports. A naive way of looking at this is that each
external PCIe port has one GPIO. And the internal PCIe port does not
have one. Hence the internal port might well work without any
additional setup?  That was my thinking.

But you are saying it is not as simple as this, and two GPIOs affect
three ports? Do you have any idea what they actually do?

      Andrew

^ permalink raw reply

* Re: [PATCH v1 bpf] bpf: Free reuseport cBPF prog after RCU grace period.
From: kernel test robot @ 2026-05-06 19:36 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Martin KaFai Lau, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi
  Cc: oe-kbuild-all, Kuniyuki Iwashima, bpf, netdev, Eulgyu Kim
In-Reply-To: <20260424235247.1990272-1-kuniyu@google.com>

Hi Kuniyuki,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/bpf-Free-reuseport-cBPF-prog-after-RCU-grace-period/20260425-093050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20260424235247.1990272-1-kuniyu%40google.com
patch subject: [PATCH v1 bpf] bpf: Free reuseport cBPF prog after RCU grace period.
config: x86_64-rhel-9.4-ltp (https://download.01.org/0day-ci/archive/20260506/202605062104.y7Xps52z-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260506/202605062104.y7Xps52z-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605062104.y7Xps52z-lkp@intel.com/

All errors (new ones prefixed by >>):

>> net/core/filter.c:1666:6: error: conflicting types for 'sk_reuseport_prog_free'; have 'void(struct bpf_prog *, bool)' {aka 'void(struct bpf_prog *, _Bool)'}
    1666 | void sk_reuseport_prog_free(struct bpf_prog *prog, bool wait_rcu)
         |      ^~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/bpf_verifier.h:9,
                    from net/core/filter.c:21:
   include/linux/filter.h:1146:6: note: previous declaration of 'sk_reuseport_prog_free' with type 'void(struct bpf_prog *)'
    1146 | void sk_reuseport_prog_free(struct bpf_prog *prog);
         |      ^~~~~~~~~~~~~~~~~~~~~~


vim +1666 net/core/filter.c

  1665	
> 1666	void sk_reuseport_prog_free(struct bpf_prog *prog, bool wait_rcu)
  1667	{
  1668		if (!prog)
  1669			return;
  1670	
  1671		if (bpf_prog_was_classic(prog))
  1672			call_rcu(&prog->aux->rcu, sk_reuseport_prog_free_rcu);
  1673		else
  1674			bpf_prog_put(prog);
  1675	}
  1676	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net 3/3] netdevsim: psp: rcu protect psp_dev reference
From: Willem de Bruijn @ 2026-05-06 19:34 UTC (permalink / raw)
  To: Daniel Zahka, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Willem de Bruijn, Willem de Bruijn
  Cc: netdev, linux-kernel
In-Reply-To: <20260505-psd-rcu-v1-3-a8f69ec1ab96@gmail.com>

Daniel Zahka wrote:
> There are two issues with the way psp_dev is used in nsim_do_psp():
> 
> 1. There is no check for IS_ERR() on the peers psp_dev, before
>    dereferencing.
> 2. The refcount on this psp_dev can be dropped by
>    nsim_psp_rereg_write()
> 
> To fix this, we can make netdevsim's reference to its psp_dev an rcu
> reference, and then nsim_do_psp() can read the fields it needs from an
> rcu critical section.
> 
> Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation")
> Assisted-by: Claude:claude-opus-4.6
> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>

>  static ssize_t
> @@ -228,16 +237,23 @@ nsim_psp_rereg_write(struct file *file, const char __user *data, size_t count,
>  		     loff_t *ppos)
>  {
>  	struct netdevsim *ns = file->private_data;
> -	int err;
> +	struct psp_dev *psd;
> +	ssize_t ret;
>  
>  	mutex_lock(&ns->psp.rereg_lock);
> -	__nsim_psp_uninit(ns);
> +	__nsim_psp_uninit(ns, false);
> +
> +	psd = psp_dev_create(ns->netdev, &nsim_psp_ops, &nsim_psp_caps, ns);
> +	if (IS_ERR(psd)) {
> +		ret = PTR_ERR(psd);
> +		goto out;
> +	}

Do you want to create the new device first and only delete the old
state if that succeeds? To avoid a netdevsim in state without dev.

>  
> -	ns->psp.dev = psp_dev_create(ns->netdev, &nsim_psp_ops,
> -				     &nsim_psp_caps, ns);
> -	err = PTR_ERR_OR_ZERO(ns->psp.dev);
> +	rcu_assign_pointer(ns->psp.dev, psd);
> +	ret = count;
> +out:
>  	mutex_unlock(&ns->psp.rereg_lock);
> -	return err ?: count;
> +	return ret;
>  }
>  


^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: bcmasp: Keep PHY link during WoL sleep cycle
From: Justin Chen @ 2026-05-06 19:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, bcm-kernel-feedback-list, pabeni, edumazet, davem,
	andrew+netdev, florian.fainelli
In-Reply-To: <20260506002536.1521518-1-kuba@kernel.org>



On 5/5/26 5:25 PM, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: bcmasp: Keep PHY link during WoL sleep cycle
> 
> This patch splits the suspend/resume paths so that when WoL is enabled
> the PHY is left running and the MAC is only partially re-initialized on
> resume. phy_stop() is skipped during suspend and phy_start() is replaced
> with a forced relink via phy_trigger_machine() when the MAC was reset.
> 
>> diff --git a/drivers/net/ethernet/broadcom/asp2/bcmasp_intf.c b/drivers/net/ethernet/broadcom/asp2/bcmasp_intf.c
>> index e2b51ec903af..9ffd0dfa5969 100644
>> --- a/drivers/net/ethernet/broadcom/asp2/bcmasp_intf.c
>> +++ b/drivers/net/ethernet/broadcom/asp2/bcmasp_intf.c
> 
> [ ... ]
> 
>> @@ -946,7 +946,8 @@ static void bcmasp_netif_deinit(struct net_device *dev)
>>   
>>   	umac_enable_set(intf, UMC_CMD_TX_EN, 0);
>>   
>> -	phy_stop(dev->phydev);
>> +	if (stop_phy)
>> +		phy_stop(dev->phydev);
> 
> Skipping phy_stop() in the WoL path means phy_stop_machine() and the
> cancel_delayed_work_sync() on phydev->state_queue are no longer executed
> before the ASP clock is disabled in bcmasp_interface_suspend().
> 
> The PHY state machine runs on system_power_efficient_wq, which is not
> WQ_FREEZABLE, so task freeze does not stop it.  If a PHY IRQ fires (or
> the polling timer expires) between clk_disable_unprepare(intf->parent->clk)
> and the matching clk_prepare_enable() on resume, the state machine can
> reach bcmasp_adj_link() via phy_check_link_status() ->
> phy_link_up()/phy_link_down() -> phydev->adjust_link.
> 
> bcmasp_adj_link() reads and writes UMC_CMD, UMC_EEE_LPI_TIMER,
> UMC_EEE_CTRL and RGMII_OOB_CNTRL, which sit behind the disabled ASP
> clock.  Can those accesses hang the bus or return 0xffffffff on these
> SoCs?
> 
> bcmgenet pins phydev->state = PHY_READY during WoL suspend specifically
> to freeze the state machine in a no-op state.  Is an equivalent
> protection needed here?
> 

Will mirror something closer to bcmgenet for v3.

>> @@ -1385,15 +1386,20 @@ int bcmasp_interface_suspend(struct bcmasp_intf *intf)
>>   {
>>   	struct device *kdev = &intf->parent->pdev->dev;
>>   	struct net_device *dev = intf->ndev;
>> +	bool wake;
>>   
>>   	if (!netif_running(dev))
>>   		return 0;
>>   
>>   	netif_device_detach(dev);
>>   
>> -	bcmasp_netif_deinit(dev);
>> +	wake = device_may_wakeup(kdev) && intf->wolopts;
>>   
>> -	if (!intf->wolopts) {
>> +	bcmasp_netif_deinit(dev, !wake);
>> +
>> +	if (wake) {
>> +		bcmasp_suspend_to_wol(intf);
>> +	} else {
>>   		bcmasp_phy_hw_unprepare(intf);
> 
> [ ... ]
> 
>> @@ -1438,17 +1444,34 @@ int bcmasp_interface_resume(struct bcmasp_intf *intf)
>>   	if (ret)
>>   		return ret;
>>   
>> -	bcmasp_core_clock_set_intf(intf, true);
>> -
>> -	bcmasp_resume_from_wol(intf);
>> +	wake = device_may_wakeup(kdev) && intf->wolopts;
>>   
>> -	bcmasp_phy_hw_prepare(intf);
>> +	bcmasp_core_clock_set_intf(intf, true);
>>   
>> -	umac_reset_and_init(intf, dev->dev_addr);
>> +	/* The interface might be HW reset in some suspend modes, so we may
>> +	 * need to restore the UNIMAC/PHY if that is the case.
>> +	 */
>> +	reg = umac_rl(intf, UMC_CMD);
>> +	if (wake && (reg & UMC_CMD_RX_EN)) {
>> +		umac_enable_set(intf, UMC_CMD_TX_EN, 1);
>> +		bcmasp_resume_from_wol(intf);
>> +	} else {
>> +		bcmasp_phy_hw_prepare(intf);
>> +		umac_reset_and_init(intf, dev->dev_addr);
>> +	}
> 
> In the wake && !(reg & UMC_CMD_RX_EN) branch (MAC was HW reset during
> suspend), bcmasp_resume_from_wol() is not called.

In this case, the HW is reset. So bcmasp_resume_from_wol() doesn't need 
to restore what bcmasp_suspend_to_wol() did as it was erased.

Same thing for PROMISC. The default state is SW_RESET, so when we force 
the state machine to run again it will reset PROMISC. So everything 
below should be irrelevant. I also verified on resume in this case the 
registers are the same, so we resumed back to the original state.

> 
> bcmasp_suspend_to_wol() unconditionally does two things that live in
> the shared parent register block, not the per-interface UMC block:
> 
> 	wakeup_intr2_core_wl(intf->parent, 0xffffffff,
> 			     ASP_WAKEUP_INTR2_MASK_CLEAR);
> 
> 	if (ndev->phydev && ndev->phydev->eee_cfg.eee_enabled &&
> 	    intf->parent->eee_fixup)
> 		intf->parent->eee_fixup(intf, true);
> 
> The wakeup interrupt mask and the EEE fixup bits (phy_lpi_overwrite in
> ASP_EDPKT_SPARE_REG) are not cleared by a reset that only wipes UMC_CMD.
> 
> In this branch the wakeup IRQ remains unmasked (bcmasp_isr_wol would
> then fire on subsequent MPD/FILT events during normal traffic) and the
> EEE fixup stays asserted.  Should this branch also invoke the parts of
> bcmasp_resume_from_wol() / eee_fixup(false) that live in the shared
> register block?
> 
> Separately, bcmasp_suspend_to_wol() explicitly programs
> UMC_CMD_PROMISC:
> 
> 	reg |= UMC_CMD_RX_EN | UMC_CMD_PROMISC;
> 	umac_wl(intf, reg, UMC_CMD);
> 
> In the wake && (reg & UMC_CMD_RX_EN) fast-resume branch, the only
> writes to UMC_CMD are:
> 
> 	umac_enable_set(intf, UMC_CMD_TX_EN, 1);
> 
> and whatever bcmasp_resume_from_wol() does, which only touches
> UMC_MPD_CTRL and ASP_WAKEUP_INTR2_MASK_SET.  bcmasp_netif_init() calls
> bcmasp_set_rx_mode() which manages ASP_RX_FILTER_MDA but not
> UMC_CMD_PROMISC, and bcmasp_adj_link() only reprograms PROMISC inside
> its if (reg & UMC_CMD_SW_RESET) branch (not taken here).  Since
> old_link/old_duplex/old_pause are preserved across the fast path,
> adj_link's changed flag stays zero and it returns without rewriting
> UMC_CMD.
> 
> Does this leave the MAC stuck in UMC_CMD_PROMISC after every WoL
> resume, until the interface is toggled down/up?
> 
>>   
>>   	bcmasp_netif_init(dev);
>>   
>> -	phy_start(dev->phydev);
>> +	/* If HW was reset, we need to force a relink */
>> +	if (wake && !(reg & UMC_CMD_RX_EN)) {
>> +		mutex_lock(&dev->phydev->lock);
>> +		if (dev->phydev->state == PHY_RUNNING)
>> +			dev->phydev->state = PHY_UP;
>> +		mutex_unlock(&dev->phydev->lock);
>> +		phy_trigger_machine(dev->phydev);
>> +	} else if (!wake) {
>> +		phy_start(dev->phydev);
>> +	}
> 
> The forced-relink branch only demotes PHY_RUNNING -> PHY_UP before
> phy_trigger_machine().  Since phy_stop() was skipped during suspend,
> phydev->state is whatever it was at suspend time, which may instead be
> PHY_NOLINK (cable was unplugged), PHY_HALTED or PHY_ERROR.
> 
> For PHY_NOLINK, _phy_state_machine() runs phy_check_link_status() +
> phy_update_stats() only; it does not trigger _phy_start_aneg() (that is
> the PHY_UP case).  If the HW reset that cleared UMC_CMD_RX_EN also
> touched the PHY chip (shared reset/power domain), autoneg may need to
> be restarted.
> 
> Does this path handle states other than PHY_RUNNING, and does it
> guarantee the PHY is reinitialized and autoneg restarted when the MAC
> has been reset?
> 

Will mirror to be more like bcmgenet in v3. The PHY chip is not reset or 
powered off in the WoL case, so we do not need to restart autoneg. It 
should look similar to the non HW reset case except we need to reprogram 
the UNIMAC.

Thanks,
Justin

>> +
>>   	netif_device_attach(dev);
>>   
>>   	return 0;


^ permalink raw reply

* Re: [PATCH net 2/3] netdevsim: psp: serialize calls to nsim_psp_uninit()
From: Willem de Bruijn @ 2026-05-06 19:31 UTC (permalink / raw)
  To: Daniel Zahka, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Willem de Bruijn, Willem de Bruijn
  Cc: netdev, linux-kernel
In-Reply-To: <20260505-psd-rcu-v1-2-a8f69ec1ab96@gmail.com>

Daniel Zahka wrote:
> The debugfs write handler, nsim_psp_rereg_write(), can race against
> nsim_destroy() and against itself, causing nsim_psp_uninit() to run
> more than once concurrently. Two complementary changes serialize all
> callers:

nit: sounds and looks as if this would be simpler as two patches.
but no need to respin just for that.

> 
> 1. Delete the psp_rereg debugfs file from nsim_psp_uninit() before
>    doing the actual teardown. debugfs_remove() drains any in-flight
>    writers and prevents new ones from starting.
> 
> 2. Add a mutex around the body of nsim_psp_rereg_write() so that two
>    concurrent userspace writers cannot both enter the teardown path
>    at once.
> 
> The teardown work itself is moved into a new __nsim_psp_uninit() that
> the rereg handler calls under the mutex, while the public
> nsim_psp_uninit() wraps it with the debugfs_remove()/mutex_destroy()
> pair so nsim_destroy() doesn't have to know about the psp internals.
> 
> Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation")
> Assisted-by: Claude:claude-opus-4.6
> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>

Reviewed-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net 1/3] netdevsim: psp: only call nsim_psp_uninit() on PFs
From: Willem de Bruijn @ 2026-05-06 19:30 UTC (permalink / raw)
  To: Daniel Zahka, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Willem de Bruijn, Willem de Bruijn
  Cc: netdev, linux-kernel
In-Reply-To: <20260505-psd-rcu-v1-1-a8f69ec1ab96@gmail.com>

Daniel Zahka wrote:
> VFs go through nsim_init_netdevsim_vf() which never calls
> nsim_psp_init(), so ns->psp.dev stays NULL. nsim_psp_uninit() guards
> with !IS_ERR(ns->psp.dev), so destroying a VF reaches
> psp_dev_unregister(NULL) and dereferences NULL on the first
> mutex_lock(&psd->lock):
> 
>   BUG: kernel NULL pointer dereference, address: 0000000000000020
>   RIP: 0010:mutex_lock+0x1c/0x30
>   Call Trace:
>    psp_dev_unregister+0x2a/0x1a0
>    nsim_psp_uninit+0x1f/0x40 [netdevsim]
>    nsim_destroy+0x61/0x1e0 [netdevsim]
>    __nsim_dev_port_del+0x47/0x90 [netdevsim]
>    nsim_drv_configure_vfs+0xc9/0x130 [netdevsim]
>    nsim_bus_dev_numvfs_store+0x79/0xb0 [netdevsim]
> 
> Gate nsim_psp_uninit() on nsim_dev_port_is_pf(), matching the pattern
> already used for nsim_exit_netdevsim() and the bpf/ipsec/macsec/queue
> teardowns.
> 
> Reproducer:
>   modprobe netdevsim
>   echo "10 1" > /sys/bus/netdevsim/new_device
>   echo 1 > /sys/bus/netdevsim/devices/netdevsim10/sriov_numvfs
>   devlink dev eswitch set netdevsim/netdevsim10 mode switchdev
>   echo 0 > /sys/bus/netdevsim/devices/netdevsim10/sriov_numvfs
> 
> Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation")
> Assisted-by: Claude:claude-opus-4.6
> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>

Reviewed-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* [net-next v7 2/3] net: ethernet: mtk_eth_soc: Add RSS support
From: Frank Wunderlich @ 2026-05-06 19:28 UTC (permalink / raw)
  To: Felix Fietkau, Lorenzo Bianconi, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Matthias Brugger,
	AngeloGioacchino Del Regno, Russell King
  Cc: Frank Wunderlich, netdev, linux-kernel, linux-arm-kernel,
	linux-mediatek, Mason Chang, Daniel Golle
In-Reply-To: <20260506192806.143725-1-linux@fw-web.de>

From: Mason Chang <mason-cw.chang@mediatek.com>

Add support for Receive Side Scaling.

We can adjust SMP affinity with the following command:
echo [CPU bitmap num] > /proc/irq/[virtual IRQ ID]/smp_affinity,
with interrupts evenly assigned to 4 CPUs, we were able to measure
an RX throughput of 7.3Gbps using iperf3 on the MT7988. Further
optimizations will be carried out in the future.

The experimental command is as follows:
PC: iperf3 -c [IP] -P 10
DUT: iperf3 -s

The entire indirection table can be imagined as 128 buckets, we
can use the ethtool command to mark which RX ring we want to send
the packets in these buckets to.

Show RSS RX ring parameters in indirection table and RSS hash key:
ethtool -x [interface]
Change RSS RX rings weight under uniform distribution:
ethtool --set-rxfh-indir [interface] equal [ring num]
Change RSS RX rings weight under non-uniform distribution:
ethtool --set-rxfh-indir [interface] weight [ring0 weight]
[ring1 weight] [ring2 weight] [ring3 weight]

Signed-off-by: Mason Chang <mason-cw.chang@mediatek.com>
Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
---
v6:
- e33bd8dd7f1f ("net: mediatek: convert to use .get_rx_ring_count") moved
  ETHTOOL_GRXRINGS handling from mtk_get_rxnfc to mtk_get_rx_ring_count
  move changes to this new function too
- fix some Macro argument '...' may be better as '(...)' to avoid precedence issues

v5:
- fix too long line reported by checkpatch
  MTK_RSS_HASH_KEY_DW
  MTK_RSS_INDR_TABLE_DW
  MTK_LRO_CTRL_DW[123]_CFG

v4:
- drop unrelated file
- rss-changes suggested by andrew
  - fix MTK_HW_LRO_RING_NUM macro (add eth)
  - fix MTK_LRO_CTRL_DW[123]_CFG (add reg_map param)
  - fix MTK_RX_DONE_INT (add eth param)

v3:
- changes requested by jakub
- readded rss fix for mt7986
- name all PDMA-IRQ the same way

v2:
- drop wrong change (MTK_CDMP_IG_CTRL is only netsys v1)
- Fix immutable string IRQ setup (thx to Emilia Schotte)
- drop link to no more existent 6.6 patch in comment
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 544 +++++++++++++++-----
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  98 +++-
 2 files changed, 496 insertions(+), 146 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index d25e0b96c26e..908fd88287ac 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1297,6 +1297,7 @@ static bool mtk_rx_get_desc(struct mtk_eth *eth, struct mtk_rx_dma_v2 *rxd,
 	if (mtk_is_netsys_v3_or_greater(eth)) {
 		rxd->rxd5 = READ_ONCE(dma_rxd->rxd5);
 		rxd->rxd6 = READ_ONCE(dma_rxd->rxd6);
+		rxd->rxd7 = READ_ONCE(dma_rxd->rxd7);
 	}
 
 	return true;
@@ -1864,47 +1865,9 @@ static netdev_tx_t mtk_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
-static struct mtk_rx_ring *mtk_get_rx_ring(struct mtk_eth *eth)
+static void mtk_update_rx_cpu_idx(struct mtk_eth *eth, struct mtk_rx_ring *ring)
 {
-	int i;
-	struct mtk_rx_ring *ring;
-	int idx;
-
-	if (!eth->hwlro)
-		return &eth->rx_ring[0];
-
-	for (i = 0; i < MTK_MAX_RX_RING_NUM; i++) {
-		struct mtk_rx_dma *rxd;
-
-		ring = &eth->rx_ring[i];
-		idx = NEXT_DESP_IDX(ring->calc_idx, ring->dma_size);
-		rxd = ring->dma + idx * eth->soc->rx.desc_size;
-		if (rxd->rxd2 & RX_DMA_DONE) {
-			ring->calc_idx_update = true;
-			return ring;
-		}
-	}
-
-	return NULL;
-}
-
-static void mtk_update_rx_cpu_idx(struct mtk_eth *eth)
-{
-	struct mtk_rx_ring *ring;
-	int i;
-
-	if (!eth->hwlro) {
-		ring = &eth->rx_ring[0];
-		mtk_w32(eth, ring->calc_idx, ring->crx_idx_reg);
-	} else {
-		for (i = 0; i < MTK_MAX_RX_RING_NUM; i++) {
-			ring = &eth->rx_ring[i];
-			if (ring->calc_idx_update) {
-				ring->calc_idx_update = false;
-				mtk_w32(eth, ring->calc_idx, ring->crx_idx_reg);
-			}
-		}
-	}
+	mtk_w32(eth, ring->calc_idx, ring->crx_idx_reg);
 }
 
 static bool mtk_page_pool_enabled(struct mtk_eth *eth)
@@ -1935,7 +1898,7 @@ static struct page_pool *mtk_create_page_pool(struct mtk_eth *eth,
 		return pp;
 
 	err = __xdp_rxq_info_reg(xdp_q, eth->dummy_dev, id,
-				 eth->rx_napi.napi_id, PAGE_SIZE);
+				 eth->rx_napi[id].napi.napi_id, PAGE_SIZE);
 	if (err < 0)
 		goto err_free_pp;
 
@@ -2224,7 +2187,8 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
 		       struct mtk_eth *eth)
 {
 	struct dim_sample dim_sample = {};
-	struct mtk_rx_ring *ring;
+	struct mtk_napi *rx_napi = container_of(napi, struct mtk_napi, napi);
+	struct mtk_rx_ring *ring = rx_napi->rx_ring;
 	bool xdp_flush = false;
 	int idx;
 	struct sk_buff *skb;
@@ -2235,16 +2199,15 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
 	dma_addr_t dma_addr = DMA_MAPPING_ERROR;
 	int ppe_idx = 0;
 
+	if (unlikely(!ring))
+		goto rx_done;
+
 	while (done < budget) {
 		unsigned int pktlen, *rxdcsum;
 		struct net_device *netdev;
 		u32 hash, reason;
 		int mac = 0;
 
-		ring = mtk_get_rx_ring(eth);
-		if (unlikely(!ring))
-			goto rx_done;
-
 		idx = NEXT_DESP_IDX(ring->calc_idx, ring->dma_size);
 		rxd = ring->dma + idx * eth->soc->rx.desc_size;
 		data = ring->data[idx];
@@ -2436,7 +2399,7 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
 		 * we continue
 		 */
 		wmb();
-		mtk_update_rx_cpu_idx(eth);
+		mtk_update_rx_cpu_idx(eth, ring);
 	}
 
 	eth->rx_packets += done;
@@ -2645,7 +2608,9 @@ static int mtk_napi_tx(struct napi_struct *napi, int budget)
 
 static int mtk_napi_rx(struct napi_struct *napi, int budget)
 {
-	struct mtk_eth *eth = container_of(napi, struct mtk_eth, rx_napi);
+	struct mtk_napi *rx_napi = container_of(napi, struct mtk_napi, napi);
+	struct mtk_eth *eth = rx_napi->eth;
+	struct mtk_rx_ring *ring = rx_napi->rx_ring;
 	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
 	int rx_done_total = 0;
 
@@ -2654,7 +2619,7 @@ static int mtk_napi_rx(struct napi_struct *napi, int budget)
 	do {
 		int rx_done;
 
-		mtk_w32(eth, eth->soc->rx.irq_done_mask,
+		mtk_w32(eth, MTK_RX_DONE_INT(eth, ring->ring_no),
 			reg_map->pdma.irq_status);
 		rx_done = mtk_poll_rx(napi, budget - rx_done_total, eth);
 		rx_done_total += rx_done;
@@ -2670,10 +2635,10 @@ static int mtk_napi_rx(struct napi_struct *napi, int budget)
 			return budget;
 
 	} while (mtk_r32(eth, reg_map->pdma.irq_status) &
-		 eth->soc->rx.irq_done_mask);
+		 MTK_RX_DONE_INT(eth, ring->ring_no));
 
 	if (napi_complete_done(napi, rx_done_total))
-		mtk_rx_irq_enable(eth, eth->soc->rx.irq_done_mask);
+		mtk_rx_irq_enable(eth, MTK_RX_DONE_INT(eth, ring->ring_no));
 
 	return rx_done_total;
 }
@@ -2918,6 +2883,7 @@ static int mtk_rx_alloc(struct mtk_eth *eth, int ring_no, int rx_flag)
 	else
 		ring->crx_idx_reg = reg_map->pdma.pcrx_ptr +
 				    ring_no * MTK_QRX_OFFSET;
+	ring->ring_no = ring_no;
 	/* make sure that all changes to the dma ring are flushed before we
 	 * continue
 	 */
@@ -2986,6 +2952,7 @@ static void mtk_rx_clean(struct mtk_eth *eth, struct mtk_rx_ring *ring, bool in_
 
 static int mtk_hwlro_rx_init(struct mtk_eth *eth)
 {
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
 	int i;
 	u32 ring_ctrl_dw1 = 0, ring_ctrl_dw2 = 0, ring_ctrl_dw3 = 0;
 	u32 lro_ctrl_dw0 = 0, lro_ctrl_dw3 = 0;
@@ -3008,9 +2975,9 @@ static int mtk_hwlro_rx_init(struct mtk_eth *eth)
 	ring_ctrl_dw3 |= MTK_RING_MAX_AGG_CNT_H;
 
 	for (i = 1; i < MTK_MAX_RX_RING_NUM; i++) {
-		mtk_w32(eth, ring_ctrl_dw1, MTK_LRO_CTRL_DW1_CFG(i));
-		mtk_w32(eth, ring_ctrl_dw2, MTK_LRO_CTRL_DW2_CFG(i));
-		mtk_w32(eth, ring_ctrl_dw3, MTK_LRO_CTRL_DW3_CFG(i));
+		mtk_w32(eth, ring_ctrl_dw1, MTK_LRO_CTRL_DW1_CFG(reg_map, i));
+		mtk_w32(eth, ring_ctrl_dw2, MTK_LRO_CTRL_DW2_CFG(reg_map, i));
+		mtk_w32(eth, ring_ctrl_dw3, MTK_LRO_CTRL_DW3_CFG(reg_map, i));
 	}
 
 	/* IPv4 checksum update enable */
@@ -3046,6 +3013,7 @@ static int mtk_hwlro_rx_init(struct mtk_eth *eth)
 
 static void mtk_hwlro_rx_uninit(struct mtk_eth *eth)
 {
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
 	int i;
 	u32 val;
 
@@ -3064,7 +3032,7 @@ static void mtk_hwlro_rx_uninit(struct mtk_eth *eth)
 
 	/* invalidate lro rings */
 	for (i = 1; i < MTK_MAX_RX_RING_NUM; i++)
-		mtk_w32(eth, 0, MTK_LRO_CTRL_DW2_CFG(i));
+		mtk_w32(eth, 0, MTK_LRO_CTRL_DW2_CFG(reg_map, i));
 
 	/* disable HW LRO */
 	mtk_w32(eth, 0, MTK_PDMA_LRO_CTRL_DW0);
@@ -3072,27 +3040,29 @@ static void mtk_hwlro_rx_uninit(struct mtk_eth *eth)
 
 static void mtk_hwlro_val_ipaddr(struct mtk_eth *eth, int idx, __be32 ip)
 {
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
 	u32 reg_val;
 
-	reg_val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(idx));
+	reg_val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(reg_map, idx));
 
 	/* invalidate the IP setting */
-	mtk_w32(eth, (reg_val & ~MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(idx));
+	mtk_w32(eth, (reg_val & ~MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(reg_map, idx));
 
 	mtk_w32(eth, ip, MTK_LRO_DIP_DW0_CFG(idx));
 
 	/* validate the IP setting */
-	mtk_w32(eth, (reg_val | MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(idx));
+	mtk_w32(eth, (reg_val | MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(reg_map, idx));
 }
 
 static void mtk_hwlro_inval_ipaddr(struct mtk_eth *eth, int idx)
 {
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
 	u32 reg_val;
 
-	reg_val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(idx));
+	reg_val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(reg_map, idx));
 
 	/* invalidate the IP setting */
-	mtk_w32(eth, (reg_val & ~MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(idx));
+	mtk_w32(eth, (reg_val & ~MTK_RING_MYIP_VLD), MTK_LRO_CTRL_DW2_CFG(reg_map, idx));
 
 	mtk_w32(eth, 0, MTK_LRO_DIP_DW0_CFG(idx));
 }
@@ -3222,6 +3192,105 @@ static int mtk_hwlro_get_fdir_all(struct net_device *dev,
 	return 0;
 }
 
+static u32 mtk_rss_indr_table(struct mtk_rss_params *rss_params, int index)
+{
+	u32 val = 0;
+	int i;
+
+	for (i = 16 * index; i < 16 * index + 16; i++)
+		val |= (rss_params->indirection_table[i] << (2 * (i % 16)));
+
+	return val;
+}
+
+static int mtk_rss_init(struct mtk_eth *eth)
+{
+	const struct mtk_soc_data *soc = eth->soc;
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
+	struct mtk_rss_params *rss_params = &eth->rss_params;
+	u32 val;
+	int i;
+
+	netdev_rss_key_fill(rss_params->hash_key, MTK_RSS_HASH_KEYSIZE);
+
+	for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE; i++)
+		rss_params->indirection_table[i] = ethtool_rxfh_indir_default(i, eth->soc->rss_num);
+
+	if (soc->rx.desc_size == sizeof(struct mtk_rx_dma)) {
+		/* Set RSS rings to PSE modes */
+		for (i = 1; i <= MTK_HW_LRO_RING_NUM(eth); i++) {
+			val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(reg_map, i));
+			val |= MTK_RING_PSE_MODE;
+			mtk_w32(eth, val, MTK_LRO_CTRL_DW2_CFG(reg_map, i));
+		}
+
+		/* Enable non-lro multiple rx */
+		val = mtk_r32(eth, reg_map->pdma.lro_ctrl_dw0);
+		val |= MTK_NON_LRO_MULTI_EN;
+		mtk_w32(eth, val, reg_map->pdma.lro_ctrl_dw0);
+
+		/* Enable RSS dly int supoort */
+		val |= MTK_LRO_DLY_INT_EN;
+		mtk_w32(eth, val, reg_map->pdma.lro_ctrl_dw0);
+	}
+
+	/* Hash Type */
+	val = mtk_r32(eth, reg_map->pdma.rss_glo_cfg);
+	val |= MTK_RSS_IPV4_STATIC_HASH;
+	val |= MTK_RSS_IPV6_STATIC_HASH;
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+
+	/* Hash Key */
+	for (i = 0; i < MTK_RSS_HASH_KEYSIZE / sizeof(u32); i++)
+		mtk_w32(eth, rss_params->hash_key[i], MTK_RSS_HASH_KEY_DW(reg_map, i));
+
+	/* Select the size of indirection table */
+	for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE / 16; i++)
+		mtk_w32(eth, mtk_rss_indr_table(rss_params, i),
+			MTK_RSS_INDR_TABLE_DW(reg_map, i));
+
+	/* Pause */
+	val |= MTK_RSS_CFG_REQ;
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+
+	/* Enable RSS */
+	val |= MTK_RSS_EN;
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+
+	/* Release pause */
+	val &= ~(MTK_RSS_CFG_REQ);
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+
+	/* Set perRSS GRP INT */
+	mtk_m32(eth, MTK_RX_DONE_INT(eth, MTK_RSS_RING(1)),
+		MTK_RX_DONE_INT(eth, MTK_RSS_RING(1)), reg_map->pdma.int_grp);
+	mtk_m32(eth, MTK_RX_DONE_INT(eth, MTK_RSS_RING(2)),
+		MTK_RX_DONE_INT(eth, MTK_RSS_RING(2)), reg_map->pdma.int_grp + 0x4);
+	mtk_m32(eth, MTK_RX_DONE_INT(eth, MTK_RSS_RING(3)),
+		MTK_RX_DONE_INT(eth, MTK_RSS_RING(3)), reg_map->pdma.int_grp3);
+
+	return 0;
+}
+
+static void mtk_rss_uninit(struct mtk_eth *eth)
+{
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
+	u32 val;
+
+	/* Pause */
+	val = mtk_r32(eth, reg_map->pdma.rss_glo_cfg);
+	val |= MTK_RSS_CFG_REQ;
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+
+	/* Disable RSS */
+	val &= ~(MTK_RSS_EN);
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+
+	/* Release pause */
+	val &= ~(MTK_RSS_CFG_REQ);
+	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
+}
+
 static netdev_features_t mtk_fix_features(struct net_device *dev,
 					  netdev_features_t features)
 {
@@ -3312,6 +3381,17 @@ static int mtk_dma_init(struct mtk_eth *eth)
 			return err;
 	}
 
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++) {
+			err = mtk_rx_alloc(eth, MTK_RSS_RING(i), MTK_RX_FLAGS_NORMAL);
+			if (err)
+				return err;
+		}
+		err = mtk_rss_init(eth);
+		if (err)
+			return err;
+	}
+
 	if (MTK_HAS_CAPS(eth->soc->caps, MTK_QDMA)) {
 		/* Enable random early drop and set drop threshold
 		 * automatically
@@ -3358,6 +3438,12 @@ static void mtk_dma_free(struct mtk_eth *eth)
 			mtk_rx_clean(eth, &eth->rx_ring[i], false);
 	}
 
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+		mtk_rss_uninit(eth);
+		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++)
+			mtk_rx_clean(eth, &eth->rx_ring[MTK_RSS_RING(i)], true);
+	}
+
 	for (i = 0; i < DIV_ROUND_UP(soc->tx.fq_dma_size, MTK_FQ_DMA_LENGTH); i++) {
 		kfree(eth->scratch_head[i]);
 		eth->scratch_head[i] = NULL;
@@ -3390,23 +3476,23 @@ static void mtk_tx_timeout(struct net_device *dev, unsigned int txqueue)
 	schedule_work(&eth->pending_work);
 }
 
-static int mtk_get_irqs(struct platform_device *pdev, struct mtk_eth *eth)
+static int mtk_get_irqs_fe(struct platform_device *pdev, struct mtk_eth *eth)
 {
 	int i;
 
 	/* future SoCs beginning with MT7988 should use named IRQs in dts */
-	eth->irq[MTK_FE_IRQ_TX] = platform_get_irq_byname_optional(pdev, "fe1");
-	eth->irq[MTK_FE_IRQ_RX] = platform_get_irq_byname_optional(pdev, "fe2");
-	if (eth->irq[MTK_FE_IRQ_TX] >= 0 && eth->irq[MTK_FE_IRQ_RX] >= 0)
+	eth->irq_fe[MTK_FE_IRQ_TX] = platform_get_irq_byname_optional(pdev, "fe1");
+	eth->irq_fe[MTK_FE_IRQ_RX] = platform_get_irq_byname_optional(pdev, "fe2");
+	if (eth->irq_fe[MTK_FE_IRQ_TX] >= 0 && eth->irq_fe[MTK_FE_IRQ_RX] >= 0)
 		return 0;
 
 	/* only use legacy mode if platform_get_irq_byname_optional returned -ENXIO */
-	if (eth->irq[MTK_FE_IRQ_TX] != -ENXIO)
-		return dev_err_probe(&pdev->dev, eth->irq[MTK_FE_IRQ_TX],
+	if (eth->irq_fe[MTK_FE_IRQ_TX] != -ENXIO)
+		return dev_err_probe(&pdev->dev, eth->irq_fe[MTK_FE_IRQ_TX],
 				     "Error requesting FE TX IRQ\n");
 
-	if (eth->irq[MTK_FE_IRQ_RX] != -ENXIO)
-		return dev_err_probe(&pdev->dev, eth->irq[MTK_FE_IRQ_RX],
+	if (eth->irq_fe[MTK_FE_IRQ_RX] != -ENXIO)
+		return dev_err_probe(&pdev->dev, eth->irq_fe[MTK_FE_IRQ_RX],
 				     "Error requesting FE RX IRQ\n");
 
 	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_INT))
@@ -3421,14 +3507,14 @@ static int mtk_get_irqs(struct platform_device *pdev, struct mtk_eth *eth)
 	for (i = 0; i < MTK_FE_IRQ_NUM; i++) {
 		if (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_INT)) {
 			if (i == MTK_FE_IRQ_SHARED)
-				eth->irq[MTK_FE_IRQ_SHARED] = platform_get_irq(pdev, i);
+				eth->irq_fe[MTK_FE_IRQ_SHARED] = platform_get_irq(pdev, i);
 			else
-				eth->irq[i] = eth->irq[MTK_FE_IRQ_SHARED];
+				eth->irq_fe[i] = eth->irq_fe[MTK_FE_IRQ_SHARED];
 		} else {
-			eth->irq[i] = platform_get_irq(pdev, i + 1);
+			eth->irq_fe[i] = platform_get_irq(pdev, i + 1);
 		}
 
-		if (eth->irq[i] < 0) {
+		if (eth->irq_fe[i] < 0) {
 			dev_err(&pdev->dev, "no IRQ%d resource found\n", i);
 			return -ENXIO;
 		}
@@ -3437,14 +3523,36 @@ static int mtk_get_irqs(struct platform_device *pdev, struct mtk_eth *eth)
 	return 0;
 }
 
-static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth)
+static int mtk_get_irqs_pdma(struct platform_device *pdev, struct mtk_eth *eth)
 {
-	struct mtk_eth *eth = _eth;
+	char rxring[] = "pdma0";
+	int i;
+
+	for (i = 0; i < MTK_PDMA_IRQ_NUM; i++) {
+		rxring[4] = '0' + i;
+		eth->irq_pdma[i] = platform_get_irq_byname(pdev, rxring);
+		if (eth->irq_pdma[i] < 0)
+			return eth->irq_pdma[i];
+	}
+
+	return 0;
+}
+
+static irqreturn_t mtk_handle_irq_rx(int irq, void *priv)
+{
+	struct mtk_napi *rx_napi = priv;
+	struct mtk_eth *eth = rx_napi->eth;
+	struct mtk_rx_ring *ring = rx_napi->rx_ring;
 
 	eth->rx_events++;
-	if (likely(napi_schedule_prep(&eth->rx_napi))) {
-		mtk_rx_irq_disable(eth, eth->soc->rx.irq_done_mask);
-		__napi_schedule(&eth->rx_napi);
+	if (unlikely(!(mtk_r32(eth, eth->soc->reg_map->pdma.irq_status) &
+		       mtk_r32(eth, eth->soc->reg_map->pdma.irq_mask) &
+		       MTK_RX_DONE_INT(eth, ring->ring_no))))
+		return IRQ_NONE;
+
+	if (likely(napi_schedule_prep(&rx_napi->napi))) {
+		mtk_rx_irq_disable(eth, MTK_RX_DONE_INT(eth, ring->ring_no));
+		__napi_schedule(&rx_napi->napi);
 	}
 
 	return IRQ_HANDLED;
@@ -3469,10 +3577,10 @@ static irqreturn_t mtk_handle_irq(int irq, void *_eth)
 	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
 
 	if (mtk_r32(eth, reg_map->pdma.irq_mask) &
-	    eth->soc->rx.irq_done_mask) {
+	    MTK_RX_DONE_INT(eth, 0)) {
 		if (mtk_r32(eth, reg_map->pdma.irq_status) &
-		    eth->soc->rx.irq_done_mask)
-			mtk_handle_irq_rx(irq, _eth);
+		    MTK_RX_DONE_INT(eth, 0))
+			mtk_handle_irq_rx(irq, &eth->rx_napi[0]);
 	}
 	if (mtk_r32(eth, reg_map->tx_irq_mask) & MTK_TX_DONE_INT) {
 		if (mtk_r32(eth, reg_map->tx_irq_status) & MTK_TX_DONE_INT)
@@ -3489,10 +3597,10 @@ static void mtk_poll_controller(struct net_device *dev)
 	struct mtk_eth *eth = mac->hw;
 
 	mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
-	mtk_rx_irq_disable(eth, eth->soc->rx.irq_done_mask);
-	mtk_handle_irq_rx(eth->irq[MTK_FE_IRQ_RX], dev);
+	mtk_rx_irq_disable(eth, MTK_RX_DONE_INT(eth, 0));
+	mtk_handle_irq_rx(eth->irq_fe[MTK_FE_IRQ_RX], &eth->rx_napi[0]);
 	mtk_tx_irq_enable(eth, MTK_TX_DONE_INT);
-	mtk_rx_irq_enable(eth, eth->soc->rx.irq_done_mask);
+	mtk_rx_irq_enable(eth, MTK_RX_DONE_INT(eth, 0));
 }
 #endif
 
@@ -3679,9 +3787,17 @@ static int mtk_open(struct net_device *dev)
 			mtk_ppe_update_mtu(eth->ppe[i], mtu);
 
 		napi_enable(&eth->tx_napi);
-		napi_enable(&eth->rx_napi);
+		napi_enable(&eth->rx_napi[0].napi);
 		mtk_tx_irq_enable(eth, MTK_TX_DONE_INT);
-		mtk_rx_irq_enable(eth, soc->rx.irq_done_mask);
+		mtk_rx_irq_enable(eth, MTK_RX_DONE_INT(eth, 0));
+
+		if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+			for (i = 1; i < MTK_RX_RSS_NUM(eth); i++) {
+				napi_enable(&eth->rx_napi[MTK_RSS_RING(i)].napi);
+				mtk_rx_irq_enable(eth, MTK_RX_DONE_INT(eth, MTK_RSS_RING(i)));
+			}
+		}
+
 		refcount_set(&eth->dma_refcnt, 1);
 	} else {
 		refcount_inc(&eth->dma_refcnt);
@@ -3766,9 +3882,16 @@ static int mtk_stop(struct net_device *dev)
 		mtk_gdm_config(eth, i, MTK_GDMA_DROP_ALL);
 
 	mtk_tx_irq_disable(eth, MTK_TX_DONE_INT);
-	mtk_rx_irq_disable(eth, eth->soc->rx.irq_done_mask);
+	mtk_rx_irq_disable(eth, MTK_RX_DONE_INT(eth, 0));
 	napi_disable(&eth->tx_napi);
-	napi_disable(&eth->rx_napi);
+	napi_disable(&eth->rx_napi[0].napi);
+
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++) {
+			mtk_rx_irq_disable(eth, MTK_RX_DONE_INT(eth, MTK_RSS_RING(i)));
+			napi_disable(&eth->rx_napi[MTK_RSS_RING(i)].napi);
+		}
+	}
 
 	cancel_work_sync(&eth->rx_dim.work);
 	cancel_work_sync(&eth->tx_dim.work);
@@ -3888,9 +4011,7 @@ static void mtk_dim_rx(struct work_struct *work)
 						dim->profile_ix);
 	spin_lock_bh(&eth->dim_lock);
 
-	val = mtk_r32(eth, reg_map->pdma.delay_irq);
-	val &= MTK_PDMA_DELAY_TX_MASK;
-	val |= MTK_PDMA_DELAY_RX_EN;
+	val = MTK_PDMA_DELAY_RX_EN;
 
 	cur = min_t(u32, DIV_ROUND_UP(cur_profile.usec, 20), MTK_PDMA_DELAY_PTIME_MASK);
 	val |= cur << MTK_PDMA_DELAY_RX_PTIME_SHIFT;
@@ -3898,9 +4019,19 @@ static void mtk_dim_rx(struct work_struct *work)
 	cur = min_t(u32, cur_profile.pkts, MTK_PDMA_DELAY_PINT_MASK);
 	val |= cur << MTK_PDMA_DELAY_RX_PINT_SHIFT;
 
-	mtk_w32(eth, val, reg_map->pdma.delay_irq);
 	if (MTK_HAS_CAPS(eth->soc->caps, MTK_QDMA))
-		mtk_w32(eth, val, reg_map->qdma.delay_irq);
+		mtk_m32(eth, MTK_PDMA_DELAY_TX_MASK,
+			val << MTK_PDMA_DELAY_TX_PTIME_SHIFT, reg_map->qdma.delay_irq);
+
+	if (eth->soc->rx.desc_size == sizeof(struct mtk_rx_dma)) {
+		mtk_m32(eth, MTK_PDMA_DELAY_RX_MASK, val, reg_map->pdma.delay_irq);
+		mtk_w32(eth, val, reg_map->pdma.lro_rx1_dly_int);
+		mtk_w32(eth, val, reg_map->pdma.lro_rx1_dly_int + 0x4);
+		mtk_w32(eth, val, reg_map->pdma.lro_rx1_dly_int + 0x8);
+	} else {
+		val = val | (val << MTK_PDMA_DELAY_RX_RING_SHIFT);
+		mtk_w32(eth, val, reg_map->pdma.rx_delay_irq);
+	}
 
 	spin_unlock_bh(&eth->dim_lock);
 
@@ -3919,9 +4050,7 @@ static void mtk_dim_tx(struct work_struct *work)
 						dim->profile_ix);
 	spin_lock_bh(&eth->dim_lock);
 
-	val = mtk_r32(eth, reg_map->pdma.delay_irq);
-	val &= MTK_PDMA_DELAY_RX_MASK;
-	val |= MTK_PDMA_DELAY_TX_EN;
+	val = MTK_PDMA_DELAY_TX_EN;
 
 	cur = min_t(u32, DIV_ROUND_UP(cur_profile.usec, 20), MTK_PDMA_DELAY_PTIME_MASK);
 	val |= cur << MTK_PDMA_DELAY_TX_PTIME_SHIFT;
@@ -3929,9 +4058,16 @@ static void mtk_dim_tx(struct work_struct *work)
 	cur = min_t(u32, cur_profile.pkts, MTK_PDMA_DELAY_PINT_MASK);
 	val |= cur << MTK_PDMA_DELAY_TX_PINT_SHIFT;
 
-	mtk_w32(eth, val, reg_map->pdma.delay_irq);
 	if (MTK_HAS_CAPS(eth->soc->caps, MTK_QDMA))
-		mtk_w32(eth, val, reg_map->qdma.delay_irq);
+		mtk_m32(eth, MTK_PDMA_DELAY_RX_MASK,
+			val >> MTK_PDMA_DELAY_TX_PTIME_SHIFT, reg_map->qdma.delay_irq);
+
+	if (eth->soc->rx.desc_size == sizeof(struct mtk_rx_dma)) {
+		mtk_m32(eth, MTK_PDMA_DELAY_TX_MASK, val, reg_map->pdma.delay_irq);
+	} else {
+		mtk_w32(eth, val >> MTK_PDMA_DELAY_TX_PTIME_SHIFT,
+			reg_map->pdma.tx_delay_irq);
+	}
 
 	spin_unlock_bh(&eth->dim_lock);
 
@@ -4149,6 +4285,25 @@ static void mtk_hw_reset_monitor_work(struct work_struct *work)
 			      MTK_DMA_MONITOR_TIMEOUT);
 }
 
+static int mtk_napi_init(struct mtk_eth *eth)
+{
+	struct mtk_napi *rx_napi = &eth->rx_napi[0];
+	int i;
+
+	rx_napi->eth = eth;
+	rx_napi->rx_ring = &eth->rx_ring[0];
+
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++) {
+			rx_napi = &eth->rx_napi[MTK_RSS_RING(i)];
+			rx_napi->eth = eth;
+			rx_napi->rx_ring = &eth->rx_ring[MTK_RSS_RING(i)];
+		}
+	}
+
+	return 0;
+}
+
 static int mtk_hw_init(struct mtk_eth *eth, bool reset)
 {
 	u32 dma_mask = ETHSYS_DMA_AG_MAP_PDMA | ETHSYS_DMA_AG_MAP_QDMA |
@@ -4238,12 +4393,11 @@ static int mtk_hw_init(struct mtk_eth *eth, bool reset)
 	 */
 	val = mtk_r32(eth, MTK_CDMQ_IG_CTRL);
 	mtk_w32(eth, val | MTK_CDMQ_STAG_EN, MTK_CDMQ_IG_CTRL);
-	if (mtk_is_netsys_v1(eth)) {
-		val = mtk_r32(eth, MTK_CDMP_IG_CTRL);
-		mtk_w32(eth, val | MTK_CDMP_STAG_EN, MTK_CDMP_IG_CTRL);
+	val = mtk_r32(eth, MTK_CDMP_IG_CTRL);
+	mtk_w32(eth, val | MTK_CDMP_STAG_EN, MTK_CDMP_IG_CTRL);
 
+	if (mtk_is_netsys_v1(eth))
 		mtk_w32(eth, 1, MTK_CDMP_EG_CTRL);
-	}
 
 	/* set interrupt delays based on current Net DIM sample */
 	mtk_dim_rx(&eth->rx_dim.work);
@@ -4254,11 +4408,17 @@ static int mtk_hw_init(struct mtk_eth *eth, bool reset)
 	mtk_rx_irq_disable(eth, ~0);
 
 	/* FE int grouping */
-	mtk_w32(eth, MTK_TX_DONE_INT, reg_map->pdma.int_grp);
-	mtk_w32(eth, eth->soc->rx.irq_done_mask, reg_map->pdma.int_grp + 4);
+
 	mtk_w32(eth, MTK_TX_DONE_INT, reg_map->qdma.int_grp);
-	mtk_w32(eth, eth->soc->rx.irq_done_mask, reg_map->qdma.int_grp + 4);
-	mtk_w32(eth, 0x21021000, MTK_FE_INT_GRP);
+	mtk_w32(eth, MTK_RX_DONE_INT(eth, 0), reg_map->qdma.int_grp + 4);
+
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_PDMA_INT)) {
+		mtk_w32(eth, 0x210FFFF2, MTK_FE_INT_GRP);
+	} else {
+		mtk_w32(eth, MTK_TX_DONE_INT, reg_map->pdma.int_grp);
+		mtk_w32(eth, MTK_RX_DONE_INT(eth, 0), reg_map->pdma.int_grp + 4);
+		mtk_w32(eth, 0x21021000, MTK_FE_INT_GRP);
+	}
 
 	if (mtk_is_netsys_v3_or_greater(eth)) {
 		/* PSE dummy page mechanism */
@@ -4700,8 +4860,13 @@ static void mtk_get_ethtool_stats(struct net_device *dev,
 
 static u32 mtk_get_rx_ring_count(struct net_device *dev)
 {
+	struct mtk_mac *mac = netdev_priv(dev);
+	struct mtk_eth *eth = mac->hw;
+
 	if (dev->hw_features & NETIF_F_LRO)
 		return MTK_MAX_RX_RING_NUM;
+	else if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS))
+		return MTK_RX_RSS_NUM(eth);
 
 	return 0;
 }
@@ -4784,6 +4949,70 @@ static int mtk_set_eee(struct net_device *dev, struct ethtool_keee *eee)
 	return phylink_ethtool_set_eee(mac->phylink, eee);
 }
 
+static u32 mtk_get_rxfh_key_size(struct net_device *dev)
+{
+	return MTK_RSS_HASH_KEYSIZE;
+}
+
+static u32 mtk_get_rxfh_indir_size(struct net_device *dev)
+{
+	return MTK_RSS_MAX_INDIRECTION_TABLE;
+}
+
+static int mtk_get_rxfh(struct net_device *dev, struct ethtool_rxfh_param *rxfh)
+{
+	struct mtk_mac *mac = netdev_priv(dev);
+	struct mtk_eth *eth = mac->hw;
+	struct mtk_rss_params *rss_params = &eth->rss_params;
+	int i;
+
+	rxfh->hfunc = ETH_RSS_HASH_TOP;	/* Toeplitz */
+
+	if (rxfh->key) {
+		memcpy(rxfh->key, rss_params->hash_key,
+		       sizeof(rss_params->hash_key));
+	}
+
+	if (rxfh->indir) {
+		for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE; i++)
+			rxfh->indir[i] = rss_params->indirection_table[i];
+	}
+
+	return 0;
+}
+
+static int mtk_set_rxfh(struct net_device *dev, struct ethtool_rxfh_param *rxfh,
+			struct netlink_ext_ack *extack)
+{
+	struct mtk_mac *mac = netdev_priv(dev);
+	struct mtk_eth *eth = mac->hw;
+	struct mtk_rss_params *rss_params = &eth->rss_params;
+	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
+	int i;
+
+	if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
+	    rxfh->hfunc != ETH_RSS_HASH_TOP)
+		return -EOPNOTSUPP;
+
+	if (rxfh->key) {
+		memcpy(rss_params->hash_key, rxfh->key,
+		       sizeof(rss_params->hash_key));
+		for (i = 0; i < MTK_RSS_HASH_KEYSIZE / sizeof(u32); i++)
+			mtk_w32(eth, rss_params->hash_key[i],
+				MTK_RSS_HASH_KEY_DW(reg_map, i));
+	}
+
+	if (rxfh->indir) {
+		for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE; i++)
+			rss_params->indirection_table[i] = rxfh->indir[i];
+		for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE / 16; i++)
+			mtk_w32(eth, mtk_rss_indr_table(rss_params, i),
+				MTK_RSS_INDR_TABLE_DW(reg_map, i));
+	}
+
+	return 0;
+}
+
 static u16 mtk_select_queue(struct net_device *dev, struct sk_buff *skb,
 			    struct net_device *sb_dev)
 {
@@ -4819,6 +5048,10 @@ static const struct ethtool_ops mtk_ethtool_ops = {
 	.get_rx_ring_count	= mtk_get_rx_ring_count,
 	.get_eee		= mtk_get_eee,
 	.set_eee		= mtk_set_eee,
+	.get_rxfh_key_size	= mtk_get_rxfh_key_size,
+	.get_rxfh_indir_size	= mtk_get_rxfh_indir_size,
+	.get_rxfh		= mtk_get_rxfh,
+	.set_rxfh		= mtk_set_rxfh,
 };
 
 static const struct net_device_ops mtk_netdev_ops = {
@@ -5012,7 +5245,7 @@ static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 	eth->netdev[id]->features |= eth->soc->hw_features;
 	eth->netdev[id]->ethtool_ops = &mtk_ethtool_ops;
 
-	eth->netdev[id]->irq = eth->irq[MTK_FE_IRQ_SHARED];
+	eth->netdev[id]->irq = eth->irq_fe[MTK_FE_IRQ_SHARED];
 	eth->netdev[id]->dev.of_node = np;
 
 	if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628))
@@ -5120,6 +5353,7 @@ static int mtk_probe(struct platform_device *pdev)
 	struct resource *res = NULL;
 	struct device_node *mac_np;
 	struct mtk_eth *eth;
+	char *irqname;
 	int err, i;
 
 	eth = devm_kzalloc(&pdev->dev, sizeof(*eth), GFP_KERNEL);
@@ -5251,10 +5485,16 @@ static int mtk_probe(struct platform_device *pdev)
 		}
 	}
 
-	err = mtk_get_irqs(pdev, eth);
+	err = mtk_get_irqs_fe(pdev, eth);
 	if (err)
 		goto err_wed_exit;
 
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_PDMA_INT)) {
+		err = mtk_get_irqs_pdma(pdev, eth);
+		if (err)
+			goto err_wed_exit;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(eth->clks); i++) {
 		eth->clks[i] = devm_clk_get(eth->dev,
 					    mtk_clks_source_name[i]);
@@ -5297,23 +5537,56 @@ static int mtk_probe(struct platform_device *pdev)
 		}
 	}
 
+	err = mtk_napi_init(eth);
+	if (err)
+		goto err_free_dev;
+
 	if (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_INT)) {
-		err = devm_request_irq(eth->dev, eth->irq[MTK_FE_IRQ_SHARED],
+		err = devm_request_irq(eth->dev, eth->irq_fe[MTK_FE_IRQ_SHARED],
 				       mtk_handle_irq, 0,
 				       dev_name(eth->dev), eth);
 	} else {
-		err = devm_request_irq(eth->dev, eth->irq[MTK_FE_IRQ_TX],
+		irqname = devm_kasprintf(eth->dev, GFP_KERNEL, "%s TX",
+					 dev_name(eth->dev));
+		err = devm_request_irq(eth->dev, eth->irq_fe[MTK_FE_IRQ_TX],
 				       mtk_handle_irq_tx, 0,
-				       dev_name(eth->dev), eth);
+				       irqname, eth);
 		if (err)
 			goto err_free_dev;
 
-		err = devm_request_irq(eth->dev, eth->irq[MTK_FE_IRQ_RX],
-				       mtk_handle_irq_rx, 0,
-				       dev_name(eth->dev), eth);
+		if (MTK_HAS_CAPS(eth->soc->caps, MTK_PDMA_INT)) {
+			irqname = devm_kasprintf(eth->dev, GFP_KERNEL, "%s PDMA RX %d",
+						 dev_name(eth->dev), 0);
+			err = devm_request_irq(eth->dev, eth->irq_pdma[0],
+					       mtk_handle_irq_rx, IRQF_SHARED,
+					       irqname, &eth->rx_napi[0]);
+			if (err)
+				goto err_free_dev;
+
+			if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+				for (i = 1; i < MTK_RX_RSS_NUM(eth); i++) {
+					irqname = devm_kasprintf(eth->dev, GFP_KERNEL,
+								 "%s PDMA RX %d",
+								 dev_name(eth->dev), i);
+					err = devm_request_irq(eth->dev,
+							       eth->irq_pdma[MTK_RSS_RING(i)],
+							       mtk_handle_irq_rx, IRQF_SHARED,
+							       irqname,
+							       &eth->rx_napi[MTK_RSS_RING(i)]);
+					if (err)
+						goto err_free_dev;
+				}
+			}
+		} else {
+			irqname = devm_kasprintf(eth->dev, GFP_KERNEL, "%s RX",
+						 dev_name(eth->dev));
+			err = devm_request_irq(eth->dev, eth->irq_fe[MTK_FE_IRQ_RX],
+					       mtk_handle_irq_rx, 0,
+					       irqname, &eth->rx_napi[0]);
+			if (err)
+				goto err_free_dev;
+		}
 	}
-	if (err)
-		goto err_free_dev;
 
 	/* No MT7628/88 support yet */
 	if (!MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628)) {
@@ -5354,7 +5627,7 @@ static int mtk_probe(struct platform_device *pdev)
 		} else
 			netif_info(eth, probe, eth->netdev[i],
 				   "mediatek frame engine at 0x%08lx, irq %d\n",
-				   eth->netdev[i]->base_addr, eth->irq[MTK_FE_IRQ_SHARED]);
+				   eth->netdev[i]->base_addr, eth->irq_fe[MTK_FE_IRQ_SHARED]);
 	}
 
 	/* we run 2 devices on the same DMA ring so we need a dummy device
@@ -5367,7 +5640,13 @@ static int mtk_probe(struct platform_device *pdev)
 		goto err_unreg_netdev;
 	}
 	netif_napi_add(eth->dummy_dev, &eth->tx_napi, mtk_napi_tx);
-	netif_napi_add(eth->dummy_dev, &eth->rx_napi, mtk_napi_rx);
+	netif_napi_add(eth->dummy_dev, &eth->rx_napi[0].napi, mtk_napi_rx);
+
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++)
+			netif_napi_add(eth->dummy_dev, &eth->rx_napi[MTK_RSS_RING(i)].napi,
+				       mtk_napi_rx);
+	}
 
 	platform_set_drvdata(pdev, eth);
 	schedule_delayed_work(&eth->reset.monitor_work,
@@ -5411,7 +5690,12 @@ static void mtk_remove(struct platform_device *pdev)
 	mtk_hw_deinit(eth);
 
 	netif_napi_del(&eth->tx_napi);
-	netif_napi_del(&eth->rx_napi);
+	netif_napi_del(&eth->rx_napi[0].napi);
+
+	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
+		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++)
+			netif_napi_del(&eth->rx_napi[MTK_RSS_RING(i)].napi);
+	}
 	mtk_cleanup(eth);
 	free_netdev(eth->dummy_dev);
 	mtk_mdio_cleanup(eth);
@@ -5424,6 +5708,7 @@ static const struct mtk_soc_data mt2701_data = {
 	.required_clks = MT7623_CLKS_BITMAP,
 	.required_pctl = true,
 	.version = 1,
+	.rss_num = 0,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5433,7 +5718,6 @@ static const struct mtk_soc_data mt2701_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID,
 		.dma_size = MTK_DMA_SIZE(2K),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5452,6 +5736,7 @@ static const struct mtk_soc_data mt7621_data = {
 	.ppe_num = 1,
 	.hash_offset = 2,
 	.foe_entry_size = MTK_FOE_ENTRY_V1_SIZE,
+	.rss_num = 0,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5461,7 +5746,6 @@ static const struct mtk_soc_data mt7621_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID,
 		.dma_size = MTK_DMA_SIZE(2K),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5482,6 +5766,7 @@ static const struct mtk_soc_data mt7622_data = {
 	.hash_offset = 2,
 	.has_accounting = true,
 	.foe_entry_size = MTK_FOE_ENTRY_V1_SIZE,
+	.rss_num = 0,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5491,7 +5776,6 @@ static const struct mtk_soc_data mt7622_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID,
 		.dma_size = MTK_DMA_SIZE(2K),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5511,6 +5795,7 @@ static const struct mtk_soc_data mt7623_data = {
 	.hash_offset = 2,
 	.foe_entry_size = MTK_FOE_ENTRY_V1_SIZE,
 	.disable_pll_modes = true,
+	.rss_num = 0,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5520,7 +5805,6 @@ static const struct mtk_soc_data mt7623_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID,
 		.dma_size = MTK_DMA_SIZE(2K),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5537,6 +5821,7 @@ static const struct mtk_soc_data mt7629_data = {
 	.required_pctl = false,
 	.has_accounting = true,
 	.version = 1,
+	.rss_num = 0,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5546,7 +5831,6 @@ static const struct mtk_soc_data mt7629_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID,
 		.dma_size = MTK_DMA_SIZE(2K),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5567,16 +5851,16 @@ static const struct mtk_soc_data mt7981_data = {
 	.hash_offset = 4,
 	.has_accounting = true,
 	.foe_entry_size = MTK_FOE_ENTRY_V2_SIZE,
+	.rss_num = 4,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma_v2),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN_V2,
 		.dma_len_offset = 8,
-		.dma_size = MTK_DMA_SIZE(2K),
+		.dma_size = MTK_DMA_SIZE(4K),
 		.fq_dma_size = MTK_DMA_SIZE(2K),
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID_V2,
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
 		.dma_len_offset = 16,
@@ -5597,6 +5881,7 @@ static const struct mtk_soc_data mt7986_data = {
 	.hash_offset = 4,
 	.has_accounting = true,
 	.foe_entry_size = MTK_FOE_ENTRY_V2_SIZE,
+	.rss_num = 4,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma_v2),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN_V2,
@@ -5606,7 +5891,6 @@ static const struct mtk_soc_data mt7986_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID_V2,
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
 		.dma_len_offset = 16,
@@ -5627,20 +5911,20 @@ static const struct mtk_soc_data mt7988_data = {
 	.hash_offset = 4,
 	.has_accounting = true,
 	.foe_entry_size = MTK_FOE_ENTRY_V3_SIZE,
+	.rss_num = 4,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma_v2),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN_V2,
 		.dma_len_offset = 8,
-		.dma_size = MTK_DMA_SIZE(2K),
+		.dma_size = MTK_DMA_SIZE(4K),
 		.fq_dma_size = MTK_DMA_SIZE(4K),
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma_v2),
-		.irq_done_mask = MTK_RX_DONE_INT_V2,
 		.dma_l4_valid = RX_DMA_L4_VALID_V2,
 		.dma_max_len = MTK_TX_DMA_BUF_LEN_V2,
 		.dma_len_offset = 8,
-		.dma_size = MTK_DMA_SIZE(2K),
+		.dma_size = MTK_DMA_SIZE(1K),
 	},
 };
 
@@ -5651,6 +5935,7 @@ static const struct mtk_soc_data rt5350_data = {
 	.required_clks = MT7628_CLKS_BITMAP,
 	.required_pctl = false,
 	.version = 1,
+	.rss_num = 0,
 	.tx = {
 		.desc_size = sizeof(struct mtk_tx_dma),
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
@@ -5659,7 +5944,6 @@ static const struct mtk_soc_data rt5350_data = {
 	},
 	.rx = {
 		.desc_size = sizeof(struct mtk_rx_dma),
-		.irq_done_mask = MTK_RX_DONE_INT,
 		.dma_l4_valid = RX_DMA_L4_VALID_PDMA,
 		.dma_max_len = MTK_TX_DMA_BUF_LEN,
 		.dma_len_offset = 16,
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 334625814b97..378cf47913ef 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -76,6 +76,8 @@
 #define	MTK_HW_LRO_BW_THRE		3000
 #define	MTK_HW_LRO_REPLACE_DELTA	1000
 #define	MTK_HW_LRO_SDL_REMAIN_ROOM	1522
+#define MTK_RSS_HASH_KEYSIZE		40
+#define MTK_RSS_MAX_INDIRECTION_TABLE	128
 
 /* Frame Engine Global Configuration */
 #define MTK_FE_GLO_CFG(x)	(((x) == MTK_GMAC3_ID) ? 0x24 : 0x00)
@@ -97,6 +99,8 @@
 #define MTK_GDM1_AF		BIT(28)
 #define MTK_GDM2_AF		BIT(29)
 
+#define MTK_PDMA_IRQ_NUM	(4)
+
 /* PDMA HW LRO Alter Flow Timer Register */
 #define MTK_PDMA_LRO_ALT_REFRESH_TIMER	0x1c
 
@@ -179,7 +183,10 @@
 
 /* PDMA HW LRO Control Registers */
 #define MTK_PDMA_LRO_CTRL_DW0	0x980
+#define MTK_HW_LRO_RING_NUM(eth)		(mtk_is_netsys_v3_or_greater(eth) ? 4 : 3)
 #define MTK_LRO_EN			BIT(0)
+#define MTK_NON_LRO_MULTI_EN		BIT(2)
+#define MTK_LRO_DLY_INT_EN		BIT(5)
 #define MTK_L3_CKS_UPD_EN		BIT(7)
 #define MTK_L3_CKS_UPD_EN_V2		BIT(19)
 #define MTK_LRO_ALT_PKT_CNT_MODE	BIT(21)
@@ -198,6 +205,19 @@
 #define MTK_MULTI_EN		BIT(10)
 #define MTK_PDMA_SIZE_8DWORDS	(1 << 4)
 
+/* PDMA RSS Control Registers */
+#define MTK_RX_NAPI_NUM			(4)
+#define MTK_RX_RSS_NUM(eth)		((eth)->soc->rss_num)
+#define MTK_RSS_RING(x)			(x)
+#define MTK_RSS_EN			BIT(0)
+#define MTK_RSS_CFG_REQ			BIT(2)
+#define MTK_RSS_IPV6_STATIC_HASH	(0x7 << 8)
+#define MTK_RSS_IPV4_STATIC_HASH	(0x7 << 12)
+#define MTK_RSS_HASH_KEY_DW(reg_map, x)		((reg_map)->pdma.rss_glo_cfg + \
+						0x20 + ((x) * 0x4))
+#define MTK_RSS_INDR_TABLE_DW(reg_map, x)	((reg_map)->pdma.rss_glo_cfg + \
+						0x50 + ((x) * 0x4))
+
 /* PDMA Global Configuration Register */
 #define MTK_PDMA_LRO_SDL	0x3000
 #define MTK_RX_CFG_SDL_OFFSET	16
@@ -209,6 +229,7 @@
 /* PDMA Delay Interrupt Register */
 #define MTK_PDMA_DELAY_RX_MASK		GENMASK(15, 0)
 #define MTK_PDMA_DELAY_RX_EN		BIT(15)
+#define MTK_PDMA_DELAY_RX_RING_SHIFT	16
 #define MTK_PDMA_DELAY_RX_PINT_SHIFT	8
 #define MTK_PDMA_DELAY_RX_PTIME_SHIFT	0
 
@@ -229,14 +250,15 @@
 #define MTK_RING_MYIP_VLD		BIT(9)
 
 /* PDMA HW LRO Ring Control Registers */
-#define MTK_LRO_RX_RING0_CTRL_DW1	0xb28
-#define MTK_LRO_RX_RING0_CTRL_DW2	0xb2c
-#define MTK_LRO_RX_RING0_CTRL_DW3	0xb30
-#define MTK_LRO_CTRL_DW1_CFG(x)		(MTK_LRO_RX_RING0_CTRL_DW1 + (x * 0x40))
-#define MTK_LRO_CTRL_DW2_CFG(x)		(MTK_LRO_RX_RING0_CTRL_DW2 + (x * 0x40))
-#define MTK_LRO_CTRL_DW3_CFG(x)		(MTK_LRO_RX_RING0_CTRL_DW3 + (x * 0x40))
+#define MTK_LRO_CTRL_DW1_CFG(reg_map, x)	((reg_map)->pdma.lro_ring_ctrl_dw1 + \
+						((x) * 0x40))
+#define MTK_LRO_CTRL_DW2_CFG(reg_map, x)	((reg_map)->pdma.lro_ring_ctrl_dw1 + \
+						0x4 + ((x) * 0x40))
+#define MTK_LRO_CTRL_DW3_CFG(reg_map, x)	((reg_map)->pdma.lro_ring_ctrl_dw1 + \
+						0x8 + ((x) * 0x40))
 #define MTK_RING_AGE_TIME_L		((MTK_HW_LRO_AGE_TIME & 0x3ff) << 22)
 #define MTK_RING_AGE_TIME_H		((MTK_HW_LRO_AGE_TIME >> 10) & 0x3f)
+#define MTK_RING_PSE_MODE		BIT(6)
 #define MTK_RING_AUTO_LERAN_MODE	(3 << 6)
 #define MTK_RING_VLD			BIT(8)
 #define MTK_RING_MAX_AGG_TIME		((MTK_HW_LRO_AGG_TIME & 0xffff) << 10)
@@ -290,7 +312,20 @@
 #define FC_THRES_MIN		0x4444
 
 /* QDMA Interrupt Status Register */
-#define MTK_RX_DONE_DLY		BIT(30)
+#define MTK_RX_DONE_INT_V1(ring_no) \
+	( \
+		(ring_no) ? \
+		BIT(24 + (ring_no)) : \
+		BIT(30) \
+	)
+
+#define MTK_RX_DONE_INT_V2(ring_no)	BIT(24 + (ring_no))
+
+#define MTK_RX_DONE_INT(eth, ring_no)		\
+	(mtk_is_netsys_v3_or_greater(eth) ?  \
+	 MTK_RX_DONE_INT_V2(ring_no) : \
+	 MTK_RX_DONE_INT_V1(ring_no))
+
 #define MTK_TX_DONE_DLY		BIT(28)
 #define MTK_RX_DONE_INT3	BIT(19)
 #define MTK_RX_DONE_INT2	BIT(18)
@@ -300,11 +335,8 @@
 #define MTK_TX_DONE_INT2	BIT(2)
 #define MTK_TX_DONE_INT1	BIT(1)
 #define MTK_TX_DONE_INT0	BIT(0)
-#define MTK_RX_DONE_INT		MTK_RX_DONE_DLY
 #define MTK_TX_DONE_INT		MTK_TX_DONE_DLY
 
-#define MTK_RX_DONE_INT_V2	BIT(14)
-
 #define MTK_CDM_TXFIFO_RDY	BIT(7)
 
 /* QDMA Interrupt grouping registers */
@@ -942,6 +974,7 @@ struct mtk_tx_ring {
 	struct mtk_tx_dma *dma_pdma;	/* For MT7628/88 PDMA handling */
 	dma_addr_t phys_pdma;
 	int cpu_idx;
+	bool in_sram;
 };
 
 /* PDMA rx ring mode */
@@ -967,13 +1000,38 @@ struct mtk_rx_ring {
 	u16 buf_size;
 	u16 dma_size;
 	bool calc_idx_update;
+	bool in_sram;
 	u16 calc_idx;
 	u32 crx_idx_reg;
+	u32 ring_no;
 	/* page_pool */
 	struct page_pool *page_pool;
 	struct xdp_rxq_info xdp_q;
 };
 
+/* struct mtk_rss_params -	This is the structure holding parameters
+ *				for the RSS ring
+ * @hash_key			The element is used to record the
+ *				secret key for the RSS ring
+ * indirection_table		The element is used to record the
+ *				indirection table for the RSS ring
+ */
+struct mtk_rss_params {
+	u32		hash_key[MTK_RSS_HASH_KEYSIZE / sizeof(u32)];
+	u8		indirection_table[MTK_RSS_MAX_INDIRECTION_TABLE];
+};
+
+/* struct mtk_napi -	This is the structure holding NAPI-related information,
+ *			and a mtk_napi struct is binding to one interrupt group
+ * @napi:		The NAPI struct
+ * @rx_ring:		Pointer to the memory holding info about the RX ring
+ */
+struct mtk_napi {
+	struct napi_struct	napi;
+	struct mtk_eth		*eth;
+	struct mtk_rx_ring	*rx_ring;
+};
+
 enum mkt_eth_capabilities {
 	MTK_RGMII_BIT = 0,
 	MTK_TRGMII_BIT,
@@ -985,7 +1043,9 @@ enum mkt_eth_capabilities {
 	MTK_INFRA_BIT,
 	MTK_SHARED_SGMII_BIT,
 	MTK_HWLRO_BIT,
+	MTK_RSS_BIT,
 	MTK_SHARED_INT_BIT,
+	MTK_PDMA_INT_BIT,
 	MTK_TRGMII_MT7621_CLK_BIT,
 	MTK_QDMA_BIT,
 	MTK_SOC_MT7628_BIT,
@@ -1025,7 +1085,9 @@ enum mkt_eth_capabilities {
 #define MTK_INFRA		BIT_ULL(MTK_INFRA_BIT)
 #define MTK_SHARED_SGMII	BIT_ULL(MTK_SHARED_SGMII_BIT)
 #define MTK_HWLRO		BIT_ULL(MTK_HWLRO_BIT)
+#define MTK_RSS			BIT_ULL(MTK_RSS_BIT)
 #define MTK_SHARED_INT		BIT_ULL(MTK_SHARED_INT_BIT)
+#define MTK_PDMA_INT		BIT_ULL(MTK_PDMA_INT_BIT)
 #define MTK_TRGMII_MT7621_CLK	BIT_ULL(MTK_TRGMII_MT7621_CLK_BIT)
 #define MTK_QDMA		BIT_ULL(MTK_QDMA_BIT)
 #define MTK_SOC_MT7628		BIT_ULL(MTK_SOC_MT7628_BIT)
@@ -1117,15 +1179,15 @@ enum mkt_eth_capabilities {
 #define MT7981_CAPS  (MTK_GMAC1_SGMII | MTK_GMAC2_SGMII | MTK_GMAC2_GEPHY | \
 		      MTK_MUX_GMAC12_TO_GEPHY_SGMII | MTK_QDMA | \
 		      MTK_MUX_U3_GMAC2_TO_QPHY | MTK_U3_COPHY_V2 | \
-		      MTK_RSTCTRL_PPE1 | MTK_SRAM)
+		      MTK_RSTCTRL_PPE1 | MTK_SRAM | MTK_PDMA_INT)
 
 #define MT7986_CAPS  (MTK_GMAC1_SGMII | MTK_GMAC2_SGMII | \
 		      MTK_MUX_GMAC12_TO_GEPHY_SGMII | MTK_QDMA | \
-		      MTK_RSTCTRL_PPE1 | MTK_SRAM)
+		      MTK_RSTCTRL_PPE1 | MTK_SRAM | MTK_PDMA_INT)
 
 #define MT7988_CAPS  (MTK_36BIT_DMA | MTK_GDM1_ESW | MTK_GMAC2_2P5GPHY | \
 		      MTK_MUX_GMAC2_TO_2P5GPHY | MTK_QDMA | MTK_RSTCTRL_PPE1 | \
-		      MTK_RSTCTRL_PPE2 | MTK_SRAM)
+		      MTK_RSTCTRL_PPE2 | MTK_SRAM | MTK_PDMA_INT | MTK_RSS)
 
 struct mtk_tx_dma_desc_info {
 	dma_addr_t	addr;
@@ -1223,6 +1285,7 @@ struct mtk_reg_map {
 struct mtk_soc_data {
 	const struct mtk_reg_map *reg_map;
 	u32             ana_rgc3;
+	u32		rss_num;
 	u64		caps;
 	u64		required_clks;
 	bool		required_pctl;
@@ -1270,7 +1333,8 @@ struct mtk_soc_data {
  *			dummy for NAPI to work
  * @netdev:		The netdev instances
  * @mac:		Each netdev is linked to a physical MAC
- * @irq:		The IRQ that we are using
+ * @irq_fe:		Array of IRQs of the frame engine
+ * @irq_pdma:		Array of IRQs of the PDMA used for RSS
  * @msg_enable:		Ethtool msg level
  * @ethsys:		The register map pointing at the range used to setup
  *			MII modes
@@ -1314,7 +1378,8 @@ struct mtk_eth {
 	struct net_device		*dummy_dev;
 	struct net_device		*netdev[MTK_MAX_DEVS];
 	struct mtk_mac			*mac[MTK_MAX_DEVS];
-	int				irq[MTK_FE_IRQ_NUM];
+	int				irq_fe[MTK_FE_IRQ_NUM];
+	int				irq_pdma[MTK_PDMA_IRQ_NUM];
 	u32				msg_enable;
 	unsigned long			sysclk;
 	struct regmap			*ethsys;
@@ -1327,7 +1392,8 @@ struct mtk_eth {
 	struct mtk_rx_ring		rx_ring[MTK_MAX_RX_RING_NUM];
 	struct mtk_rx_ring		rx_ring_qdma;
 	struct napi_struct		tx_napi;
-	struct napi_struct		rx_napi;
+	struct mtk_napi			rx_napi[MTK_RX_NAPI_NUM];
+	struct mtk_rss_params		rss_params;
 	void				*scratch_ring;
 	dma_addr_t			phy_scratch_ring;
 	void				*scratch_head[MTK_FQ_DMA_HEAD];
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox