* Re: [PATCH net v2] eth: fbnic: fix double-free of PCS on phylink creation failure
From: Bobby Eshleman @ 2026-05-07 15:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Paolo Abeni, Alexander Duyck, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Russell King, netdev, linux-kernel,
Bobby Eshleman
In-Reply-To: <20260507072954.263ae8dd@kernel.org>
On Thu, May 07, 2026 at 07:29:54AM -0700, Jakub Kicinski wrote:
> On Thu, 7 May 2026 07:24:53 -0700 Jakub Kicinski wrote:
> > On Thu, 7 May 2026 12:34:24 +0200 Paolo Abeni wrote:
> > > > Clearing fbd->netdev to NULL avoids UAF in init_failure_mode where
> > > > callers guard by checking !fbd->netdev, such as fbnic_mdio_read_pmd().
> > > > These callers remain active even after a failed probe, so fdb->netdev
> > > > still needs to be cleared.
> > > >
> > > > Fixes: d0fe7104c795 ("fbnic: Replace use of internal PCS w/ Designware XPCS")
> > > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > >
> > > Note that sashiko-gemini spotted a pre-existing issue:
> > >
> > > https://sashiko.dev/#/patchset/20260504-fbnic-pcs-fix-v2-1-de45192821d9%40meta.com
> > >
> > > does not block this patch but could deserve a follow-up.
> >
> > fbd is a devlink priv, not netdev priv, touching it after free_netdev()
> > is perfectly fine. I wish Gemini tried a *little* harder instead of
> > guessing :| Sorry for not commenting earlier.
>
> Ugh, not enough coffee. It's complaining about MDIO reads, I think
> that's valid.
It is, but I think the race pre-exists.
static int
fbnic_mdio_read_pmd(struct fbnic_dev *fbd, int addr, int regnum)
[...]
if (fbd->netdev) {
fbn = netdev_priv(fbd->netdev);
if (fbn->aui < FBNIC_AUI_UNKNOWN)
aui = fbn->aui;
}
Definitely possible that ->netdev gets freed concurrently with
fbd->netdev evaluating to true... but fbnic_netdev_free() faces the same
race.
I'm open to fixing this all at once, if preferred. Probably need to look
at some of the other fbnic_net ptr guards too.
Best,
Bobby
^ permalink raw reply
* Re: [PATCH net-next v2 2/6] net: devmem: support TX over NETMEM_TX_NO_DMA devices
From: Bobby Eshleman @ 2026-05-07 15:38 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi, Yanteng Si,
Dongliang Mu, Michael Chan, Pavan Chebbi, Joshua Washington,
Harshitha Ramamurthy, Saeed Mahameed, Tariq Toukan, Mark Bloch,
Leon Romanovsky, Alexander Duyck, kernel-team, Daniel Borkmann,
Nikolay Aleksandrov, Shuah Khan, netdev, linux-doc, linux-kernel,
linux-rdma, bpf, linux-kselftest, Stanislav Fomichev,
Mina Almasry, Bobby Eshleman
In-Reply-To: <20260506193420.575e1806@kernel.org>
On Wed, May 06, 2026 at 07:34:20PM -0700, Jakub Kicinski wrote:
> On Mon, 04 May 2026 17:27:49 -0700 Bobby Eshleman wrote:
> > + if (bind_dev != netdev)
> > + netdev_lock(bind_dev);
> > + dma_dev = netdev_queue_get_dma_dev(bind_dev, 0, NETDEV_QUEUE_TYPE_TX);
> > + if (bind_dev != netdev)
> > + netdev_unlock(bind_dev);
> > + binding = net_devmem_bind_dmabuf(bind_dev,
> > + bind_dev != netdev ? netdev : NULL,
> > + dma_dev, DMA_TO_DEVICE, dmabuf_fd,
> > + priv, info->extack);
>
> Not sure if it matters but are we intentionally releasing the bind_dev
> lock before calling net_devmem_bind_dmabuf() ? Previously more code here
> was covered by the physical netdev's lock.
True, lock needs to be held at least until after binding. I have this
fixed in the next rev.
Best,
Bobby
^ permalink raw reply
* Re: [PATCH net-next V2 3/3] net/mlx5: Add VHCA_ID page management mode support
From: Moshe Shemesh @ 2026-05-07 15:39 UTC (permalink / raw)
To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, Akiva Goldberger,
netdev, linux-rdma, linux-kernel, Gal Pressman, Dragos Tatulea
In-Reply-To: <20260506133239.276237-4-tariqt@nvidia.com>
On 5/6/2026 4:32 PM, Tariq Toukan wrote:
> From: Moshe Shemesh <moshe@nvidia.com>
>
> Add support for VHCA_ID-based page management mode. When the device
> firmware advertises the icm_mng_function_id_mode capability with
> MLX5_ID_MODE_FUNCTION_VHCA_ID, page management operations between the
> driver and firmware may use vhca_id instead of function_id as the
> effective function identifier, and the ec_function field is ignored.
>
> Update page management commands to conditionally set ec_function field
> only in FUNC_ID mode. Boot page allocation always uses FUNC_ID mode
> semantics for backward compatibility, as the capability bit is only
> available after set_hca_cap(). If after set_hca_cap() VHCA_ID mode was
> set, modify the tracking of the boot pages in page_root_xa to use
> vhca_id too.
>
> Add mlx5_esw_vhca_id_to_func_type() to resolve the function type in
> VHCA_ID mode, enabling per-type debugfs counters. Use a dedicated
> vhca_type_map xarray, to provide lockless lookup. Store the resolved
> type on each fw_page at allocation time so reclaim and release paths
> read it directly without any lookup.
>
> Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
> Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
> Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
> .../net/ethernet/mellanox/mlx5/core/eswitch.c | 45 +++-
> .../net/ethernet/mellanox/mlx5/core/eswitch.h | 8 +
> .../net/ethernet/mellanox/mlx5/core/main.c | 3 +
> .../ethernet/mellanox/mlx5/core/pagealloc.c | 250 +++++++++++++-----
> include/linux/mlx5/driver.h | 7 +
> 5 files changed, 251 insertions(+), 62 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index e0eafcf0c52a..125129ef43e3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> @@ -852,6 +852,38 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
> return true;
> }
>
> +static enum mlx5_func_type
> +esw_vport_to_func_type(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
> +{
> + u16 vport_num = vport->vport;
> +
> + if (vport_num == MLX5_VPORT_HOST_PF)
> + return MLX5_HOST_PF;
> + if (xa_get_mark(&esw->vports, vport_num, MLX5_ESW_VPT_SF))
> + return MLX5_SF;
> + if (xa_get_mark(&esw->vports, vport_num, MLX5_ESW_VPT_VF))
> + return MLX5_VF;
> + return MLX5_EC_VF;
> +}
> +
> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
> +{
> + struct mlx5_eswitch *esw = dev->priv.eswitch;
> + void *entry;
> +
> + if (vhca_id == MLX5_CAP_GEN(dev, vhca_id))
> + return MLX5_SELF;
> +
> + if (!esw)
> + return MLX5_FUNC_TYPE_NONE;
> +
> + entry = xa_load(&esw->vhca_type_map, vhca_id);
> + if (entry)
> + return xa_to_value(entry);
> +
> + return MLX5_FUNC_TYPE_NONE;
> +}
> +
> static int esw_vport_setup(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
> {
> bool vst_mode_steering = esw_vst_mode_is_steering(esw);
> @@ -942,6 +974,11 @@ int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, struct mlx5_vport *vport,
> ret = mlx5_esw_vport_vhca_id_map(esw, vport);
> if (ret)
> goto err_vhca_mapping;
> + ret = xa_insert(&esw->vhca_type_map, vport->vhca_id,
> + xa_mk_value(esw_vport_to_func_type(esw, vport)),
> + GFP_KERNEL);
> + if (ret)
> + goto err_type_map;
Sashiko says:
"
If xa_insert() fails here, the error path goes to err_type_map but does
not appear to revert vport->enabled or the increment of
esw->enabled_ipsec_vf_count that occurred earlier in the function.
Since esw->enabled_vports is only incremented on success, could this leave
the vport in an inconsistent state? A later call to
mlx5_esw_vport_disable() might see vport->enabled as true and decrement
esw->enabled_vports, potentially causing an integer underflow.
"
The same inconsistency existed before this patch with the
err_vhca_mapping path for the mlx5_esw_vport_vhca_id_map() failure.
So it will be addressed in a separate fix patch.
- Moshe
> }
>
> esw_vport_change_handle_locked(vport);
> @@ -952,6 +989,8 @@ int mlx5_esw_vport_enable(struct mlx5_eswitch *esw, struct mlx5_vport *vport,
> mutex_unlock(&esw->state_lock);
> return ret;
>
> +err_type_map:
> + mlx5_esw_vport_vhca_id_unmap(esw, vport);
> err_vhca_mapping:
> esw_vport_cleanup(esw, vport);
> mutex_unlock(&esw->state_lock);
> @@ -976,8 +1015,10 @@ void mlx5_esw_vport_disable(struct mlx5_eswitch *esw, struct mlx5_vport *vport)
> arm_vport_context_events_cmd(esw->dev, vport_num, 0);
>
> if (!mlx5_esw_is_manager_vport(esw, vport_num) &&
> - MLX5_CAP_GEN(esw->dev, vport_group_manager))
> + MLX5_CAP_GEN(esw->dev, vport_group_manager)) {
> + xa_erase(&esw->vhca_type_map, vport->vhca_id);
> mlx5_esw_vport_vhca_id_unmap(esw, vport);
> + }
>
> if (vport->vport != MLX5_VPORT_HOST_PF &&
> (vport->info.ipsec_crypto_enabled || vport->info.ipsec_packet_enabled))
> @@ -2084,6 +2125,7 @@ int mlx5_eswitch_init(struct mlx5_core_dev *dev)
> atomic64_set(&esw->offloads.num_flows, 0);
> ida_init(&esw->offloads.vport_metadata_ida);
> xa_init_flags(&esw->offloads.vhca_map, XA_FLAGS_ALLOC);
> + xa_init(&esw->vhca_type_map);
> mutex_init(&esw->state_lock);
> init_rwsem(&esw->mode_lock);
> refcount_set(&esw->qos.refcnt, 0);
> @@ -2133,6 +2175,7 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw)
> mutex_destroy(&esw->state_lock);
> WARN_ON(!xa_empty(&esw->offloads.vhca_map));
> xa_destroy(&esw->offloads.vhca_map);
> + xa_destroy(&esw->vhca_type_map);
> ida_destroy(&esw->offloads.vport_metadata_ida);
> mlx5e_mod_hdr_tbl_destroy(&esw->offloads.mod_hdr);
> mutex_destroy(&esw->offloads.encap_tbl_lock);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index 2fd601bd102f..b06d097824ad 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -373,6 +373,7 @@ struct mlx5_eswitch {
> struct dentry *debugfs_root;
> struct workqueue_struct *work_queue;
> struct xarray vports;
> + struct xarray vhca_type_map;
> u32 flags;
> int total_vports;
> int enabled_vports;
> @@ -863,6 +864,7 @@ void mlx5_esw_vport_vhca_id_unmap(struct mlx5_eswitch *esw,
> struct mlx5_vport *vport);
> int mlx5_eswitch_vhca_id_to_vport(struct mlx5_eswitch *esw, u16 vhca_id, u16 *vport_num);
> bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id);
> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id);
>
> void mlx5_esw_offloads_rep_remove(struct mlx5_eswitch *esw,
> const struct mlx5_vport *vport);
> @@ -1034,6 +1036,12 @@ mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id)
> return false;
> }
>
> +static inline u16
> +mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id)
> +{
> + return MLX5_FUNC_TYPE_NONE;
> +}
> +
> static inline void
> mlx5_eswitch_safe_aux_devs_remove(struct mlx5_core_dev *dev) {}
> static inline struct mlx5_flow_handle *
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 0c1c906b60fa..296c5223cf61 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
> if (MLX5_CAP_GEN_MAX(dev, release_all_pages))
> MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1);
>
> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode))
> + MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1);
> +
> if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
> MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> index 77ffa31cc505..ce2f7fa9bd48 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
> @@ -38,6 +38,7 @@
> #include "mlx5_core.h"
> #include "lib/eq.h"
> #include "lib/tout.h"
> +#include "eswitch.h"
>
> enum {
> MLX5_PAGES_CANT_GIVE = 0,
> @@ -59,6 +60,7 @@ struct fw_page {
> u64 addr;
> struct page *page;
> u32 function;
> + u16 func_type;
> unsigned long bitmask;
> struct list_head list;
> unsigned int free_count;
> @@ -69,9 +71,24 @@ enum {
> MLX5_NUM_4K_IN_PAGE = PAGE_SIZE / MLX5_ADAPTER_PAGE_SIZE,
> };
>
> -static u32 get_function(u16 func_id, bool ec_function)
> +static bool mlx5_page_mgt_mode_is_vhca_id(const struct mlx5_core_dev *dev)
> {
> - return (u32)func_id | (ec_function << 16);
> + return dev->priv.page_mgt_mode == MLX5_PAGE_MGT_MODE_VHCA_ID;
> +}
> +
> +static void mlx5_page_mgt_mode_set(struct mlx5_core_dev *dev,
> + enum mlx5_page_mgt_mode mode)
> +{
> + dev->priv.page_mgt_mode = mode;
> +}
> +
> +static u32 get_function_key(struct mlx5_core_dev *dev, u16 func_vhca_id,
> + bool ec_function)
> +{
> + if (mlx5_page_mgt_mode_is_vhca_id(dev))
> + return (u32)func_vhca_id;
> +
> + return (u32)func_vhca_id | (ec_function << 16);
> }
>
> static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_function)
> @@ -89,12 +106,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct
> return MLX5_SF;
> }
>
> +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id,
> + bool ec_function)
> +{
> + if (mlx5_page_mgt_mode_is_vhca_id(dev))
> + return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id);
> +
> + return func_id_to_type(dev, func_vhca_id, ec_function);
> +}
> +
> static u32 mlx5_get_ec_function(u32 function)
> {
> return function >> 16;
> }
>
> -static u32 mlx5_get_func_id(u32 function)
> +static u32 mlx5_get_func_vhca_id(u32 function)
> {
> return function & 0xffff;
> }
> @@ -123,7 +149,8 @@ static struct rb_root *page_root_per_function(struct mlx5_core_dev *dev, u32 fun
> return root;
> }
>
> -static int insert_page(struct mlx5_core_dev *dev, u64 addr, struct page *page, u32 function)
> +static int insert_page(struct mlx5_core_dev *dev, u64 addr, struct page *page,
> + u32 function, u16 func_type)
> {
> struct rb_node *parent = NULL;
> struct rb_root *root;
> @@ -156,6 +183,7 @@ static int insert_page(struct mlx5_core_dev *dev, u64 addr, struct page *page, u
> nfp->addr = addr;
> nfp->page = page;
> nfp->function = function;
> + nfp->func_type = func_type;
> nfp->free_count = MLX5_NUM_4K_IN_PAGE;
> for (i = 0; i < MLX5_NUM_4K_IN_PAGE; i++)
> set_bit(i, &nfp->bitmask);
> @@ -196,7 +224,7 @@ static struct fw_page *find_fw_page(struct mlx5_core_dev *dev, u64 addr,
> return result;
> }
>
> -static int mlx5_cmd_query_pages(struct mlx5_core_dev *dev, u16 *func_id,
> +static int mlx5_cmd_query_pages(struct mlx5_core_dev *dev, u16 *func_vhca_id,
> s32 *npages, int boot)
> {
> u32 out[MLX5_ST_SZ_DW(query_pages_out)] = {};
> @@ -207,14 +235,20 @@ static int mlx5_cmd_query_pages(struct mlx5_core_dev *dev, u16 *func_id,
> MLX5_SET(query_pages_in, in, op_mod, boot ?
> MLX5_QUERY_PAGES_IN_OP_MOD_BOOT_PAGES :
> MLX5_QUERY_PAGES_IN_OP_MOD_INIT_PAGES);
> - MLX5_SET(query_pages_in, in, embedded_cpu_function, mlx5_core_is_ecpf(dev));
> +
> + if (mlx5_page_mgt_mode_is_vhca_id(dev))
> + MLX5_SET(query_pages_in, in, function_id,
> + MLX5_CAP_GEN(dev, vhca_id));
> + else
> + MLX5_SET(query_pages_in, in, embedded_cpu_function,
> + mlx5_core_is_ecpf(dev));
>
> err = mlx5_cmd_exec_inout(dev, query_pages, in, out);
> if (err)
> return err;
>
> *npages = MLX5_GET(query_pages_out, out, num_pages);
> - *func_id = MLX5_GET(query_pages_out, out, function_id);
> + *func_vhca_id = MLX5_GET(query_pages_out, out, function_id);
>
> return err;
> }
> @@ -245,6 +279,10 @@ static int alloc_4k(struct mlx5_core_dev *dev, u64 *addr, u32 function)
> if (!fp->free_count)
> list_del(&fp->list);
>
> + if (fp->func_type != MLX5_FUNC_TYPE_NONE)
> + dev->priv.page_counters[fp->func_type]++;
> + dev->priv.fw_pages++;
> +
> *addr = fp->addr + n * MLX5_ADAPTER_PAGE_SIZE;
>
> return 0;
> @@ -280,6 +318,11 @@ static void free_4k(struct mlx5_core_dev *dev, u64 addr, u32 function)
> mlx5_core_warn_rl(dev, "page not found\n");
> return;
> }
> +
> + if (fwp->func_type != MLX5_FUNC_TYPE_NONE)
> + dev->priv.page_counters[fwp->func_type]--;
> + dev->priv.fw_pages--;
> +
> n = (addr & ~MLX5_U64_4K_PAGE_MASK) >> MLX5_ADAPTER_PAGE_SHIFT;
> fwp->free_count++;
> set_bit(n, &fwp->bitmask);
> @@ -289,7 +332,8 @@ static void free_4k(struct mlx5_core_dev *dev, u64 addr, u32 function)
> list_add(&fwp->list, &dev->priv.free_list);
> }
>
> -static int alloc_system_page(struct mlx5_core_dev *dev, u32 function)
> +static int alloc_system_page(struct mlx5_core_dev *dev, u32 function,
> + u16 func_type)
> {
> struct device *device = mlx5_core_dma_dev(dev);
> int nid = dev->priv.numa_node;
> @@ -317,7 +361,7 @@ static int alloc_system_page(struct mlx5_core_dev *dev, u32 function)
> goto map;
> }
>
> - err = insert_page(dev, addr, page, function);
> + err = insert_page(dev, addr, page, function, func_type);
> if (err) {
> mlx5_core_err(dev, "failed to track allocated page\n");
> dma_unmap_page(device, addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
> @@ -334,7 +378,7 @@ static int alloc_system_page(struct mlx5_core_dev *dev, u32 function)
> return err;
> }
>
> -static void page_notify_fail(struct mlx5_core_dev *dev, u16 func_id,
> +static void page_notify_fail(struct mlx5_core_dev *dev, u16 func_vhca_id,
> bool ec_function)
> {
> u32 in[MLX5_ST_SZ_DW(manage_pages_in)] = {};
> @@ -342,19 +386,23 @@ static void page_notify_fail(struct mlx5_core_dev *dev, u16 func_id,
>
> MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
> MLX5_SET(manage_pages_in, in, op_mod, MLX5_PAGES_CANT_GIVE);
> - MLX5_SET(manage_pages_in, in, function_id, func_id);
> - MLX5_SET(manage_pages_in, in, embedded_cpu_function, ec_function);
> + MLX5_SET(manage_pages_in, in, function_id, func_vhca_id);
> +
> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
> + MLX5_SET(manage_pages_in, in, embedded_cpu_function,
> + ec_function);
>
> err = mlx5_cmd_exec_in(dev, manage_pages, in);
> if (err)
> - mlx5_core_warn(dev, "page notify failed func_id(%d) err(%d)\n",
> - func_id, err);
> + mlx5_core_warn(dev,
> + "page notify failed func_vhca_id(%d) err(%d)\n",
> + func_vhca_id, err);
> }
>
> -static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> +static int give_pages(struct mlx5_core_dev *dev, u16 func_vhca_id, int npages,
> int event, bool ec_function)
> {
> - u32 function = get_function(func_id, ec_function);
> + u32 function = get_function_key(dev, func_vhca_id, ec_function);
> u32 out[MLX5_ST_SZ_DW(manage_pages_out)] = {0};
> int inlen = MLX5_ST_SZ_BYTES(manage_pages_in);
> int notify_fail = event;
> @@ -364,6 +412,8 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> u32 *in;
> int i;
>
> + func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function);
> +
> inlen += npages * MLX5_FLD_SZ_BYTES(manage_pages_in, pas[0]);
> in = kvzalloc(inlen, GFP_KERNEL);
> if (!in) {
> @@ -377,7 +427,8 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> err = alloc_4k(dev, &addr, function);
> if (err) {
> if (err == -ENOMEM)
> - err = alloc_system_page(dev, function);
> + err = alloc_system_page(dev, function,
> + func_type);
> if (err) {
> dev->priv.fw_pages_alloc_failed += (npages - i);
> goto out_4k;
> @@ -390,9 +441,12 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
>
> MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
> MLX5_SET(manage_pages_in, in, op_mod, MLX5_PAGES_GIVE);
> - MLX5_SET(manage_pages_in, in, function_id, func_id);
> + MLX5_SET(manage_pages_in, in, function_id, func_vhca_id);
> MLX5_SET(manage_pages_in, in, input_num_entries, npages);
> - MLX5_SET(manage_pages_in, in, embedded_cpu_function, ec_function);
> +
> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
> + MLX5_SET(manage_pages_in, in, embedded_cpu_function,
> + ec_function);
>
> err = mlx5_cmd_do(dev, in, inlen, out, sizeof(out));
> if (err == -EREMOTEIO) {
> @@ -405,17 +459,15 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> }
> err = mlx5_cmd_check(dev, err, in, out);
> if (err) {
> - mlx5_core_warn(dev, "func_id 0x%x, npages %d, err %d\n",
> - func_id, npages, err);
> + mlx5_core_warn(dev, "func_vhca_id 0x%x, npages %d, err %d\n",
> + func_vhca_id, npages, err);
> goto out_dropped;
> }
>
> - func_type = func_id_to_type(dev, func_id, ec_function);
> - dev->priv.page_counters[func_type] += npages;
> - dev->priv.fw_pages += npages;
> -
> - mlx5_core_dbg(dev, "npages %d, ec_function %d, func_id 0x%x, err %d\n",
> - npages, ec_function, func_id, err);
> + mlx5_core_dbg(dev,
> + "npages %d, ec_function %d, func 0x%x, mode %d, err %d\n",
> + npages, ec_function, func_vhca_id,
> + mlx5_page_mgt_mode_is_vhca_id(dev), err);
>
> kvfree(in);
> return 0;
> @@ -428,18 +480,17 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> out_free:
> kvfree(in);
> if (notify_fail)
> - page_notify_fail(dev, func_id, ec_function);
> + page_notify_fail(dev, func_vhca_id, ec_function);
> return err;
> }
>
> -static void release_all_pages(struct mlx5_core_dev *dev, u16 func_id,
> +static void release_all_pages(struct mlx5_core_dev *dev, u16 func_vhca_id,
> bool ec_function)
> {
> - u32 function = get_function(func_id, ec_function);
> + u32 function = get_function_key(dev, func_vhca_id, ec_function);
> struct rb_root *root;
> struct rb_node *p;
> int npages = 0;
> - u16 func_type;
>
> root = xa_load(&dev->priv.page_root_xa, function);
> if (WARN_ON_ONCE(!root))
> @@ -448,18 +499,20 @@ static void release_all_pages(struct mlx5_core_dev *dev, u16 func_id,
> p = rb_first(root);
> while (p) {
> struct fw_page *fwp = rb_entry(p, struct fw_page, rb_node);
> + int used = MLX5_NUM_4K_IN_PAGE - fwp->free_count;
>
> p = rb_next(p);
> - npages += (MLX5_NUM_4K_IN_PAGE - fwp->free_count);
> + npages += used;
> + if (fwp->func_type != MLX5_FUNC_TYPE_NONE)
> + dev->priv.page_counters[fwp->func_type] -= used;
> free_fwp(dev, fwp, fwp->free_count);
> }
>
> - func_type = func_id_to_type(dev, func_id, ec_function);
> - dev->priv.page_counters[func_type] -= npages;
> dev->priv.fw_pages -= npages;
>
> - mlx5_core_dbg(dev, "npages %d, ec_function %d, func_id 0x%x\n",
> - npages, ec_function, func_id);
> + mlx5_core_dbg(dev, "npages %d, ec_function %d, func 0x%x, mode %d\n",
> + npages, ec_function, func_vhca_id,
> + mlx5_page_mgt_mode_is_vhca_id(dev));
> }
>
> static u32 fwp_fill_manage_pages_out(struct fw_page *fwp, u32 *out, u32 index,
> @@ -487,7 +540,7 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
> struct fw_page *fwp;
> struct rb_node *p;
> bool ec_function;
> - u32 func_id;
> + u32 func_vhca_id;
> u32 npages;
> u32 i = 0;
> int err;
> @@ -499,10 +552,11 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
>
> /* No hard feelings, we want our pages back! */
> npages = MLX5_GET(manage_pages_in, in, input_num_entries);
> - func_id = MLX5_GET(manage_pages_in, in, function_id);
> + func_vhca_id = MLX5_GET(manage_pages_in, in, function_id);
> ec_function = MLX5_GET(manage_pages_in, in, embedded_cpu_function);
>
> - root = xa_load(&dev->priv.page_root_xa, get_function(func_id, ec_function));
> + root = xa_load(&dev->priv.page_root_xa,
> + get_function_key(dev, func_vhca_id, ec_function));
> if (WARN_ON_ONCE(!root))
> return -EEXIST;
>
> @@ -518,14 +572,14 @@ static int reclaim_pages_cmd(struct mlx5_core_dev *dev,
> return 0;
> }
>
> -static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> - int *nclaimed, bool event, bool ec_function)
> +static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_vhca_id,
> + int npages, int *nclaimed, bool event,
> + bool ec_function)
> {
> - u32 function = get_function(func_id, ec_function);
> + u32 function = get_function_key(dev, func_vhca_id, ec_function);
> int outlen = MLX5_ST_SZ_BYTES(manage_pages_out);
> u32 in[MLX5_ST_SZ_DW(manage_pages_in)] = {};
> int num_claimed;
> - u16 func_type;
> u32 *out;
> int err;
> int i;
> @@ -540,12 +594,16 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
>
> MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
> MLX5_SET(manage_pages_in, in, op_mod, MLX5_PAGES_TAKE);
> - MLX5_SET(manage_pages_in, in, function_id, func_id);
> + MLX5_SET(manage_pages_in, in, function_id, func_vhca_id);
> MLX5_SET(manage_pages_in, in, input_num_entries, npages);
> - MLX5_SET(manage_pages_in, in, embedded_cpu_function, ec_function);
>
> - mlx5_core_dbg(dev, "func 0x%x, npages %d, outlen %d\n",
> - func_id, npages, outlen);
> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
> + MLX5_SET(manage_pages_in, in, embedded_cpu_function,
> + ec_function);
> +
> + mlx5_core_dbg(dev, "func 0x%x, npages %d, outlen %d mode %d\n",
> + func_vhca_id, npages, outlen,
> + mlx5_page_mgt_mode_is_vhca_id(dev));
> err = reclaim_pages_cmd(dev, in, sizeof(in), out, outlen);
> if (err) {
> npages = MLX5_GET(manage_pages_in, in, input_num_entries);
> @@ -577,10 +635,6 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages,
> if (nclaimed)
> *nclaimed = num_claimed;
>
> - func_type = func_id_to_type(dev, func_id, ec_function);
> - dev->priv.page_counters[func_type] -= num_claimed;
> - dev->priv.fw_pages -= num_claimed;
> -
> out_free:
> kvfree(out);
> return err;
> @@ -658,30 +712,102 @@ static int req_pages_handler(struct notifier_block *nb,
> * req->npages (and not min ()).
> */
> req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES);
> - req->ec_function = ec_function;
> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
> + req->ec_function = ec_function;
> req->release_all = release_all;
> INIT_WORK(&req->work, pages_work_handler);
> queue_work(dev->priv.pg_wq, &req->work);
> return NOTIFY_OK;
> }
>
> +/*
> + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see
> + * VHCA_ID mode. If page_root_xa already has the PF entry from the first
> + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that
> + * entry to the device vhca_id key so lookups use VHCA_ID semantics.
> + */
> +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev)
> +{
> + u32 vhca_id_key, old_key;
> + struct rb_root *root;
> + struct fw_page *fwp;
> + struct rb_node *p;
> + bool ec_function;
> + int err;
> +
> + if (xa_empty(&dev->priv.page_root_xa))
> + return 0;
> +
> + vhca_id_key = MLX5_CAP_GEN(dev, vhca_id);
> + ec_function = mlx5_core_is_ecpf(dev);
> +
> + old_key = ec_function ? (1U << 16) : 0;
> + root = xa_load(&dev->priv.page_root_xa, old_key);
> + if (!root)
> + return 0;
> +
> + if (old_key == vhca_id_key)
> + return 0;
> +
> + err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL);
> + if (err) {
> + mlx5_core_warn(dev,
> + "failed to migrate page root key 0x%x to vhca_id 0x%x\n",
> + old_key, vhca_id_key);
> + return err;
> + }
> +
> + for (p = rb_first(root); p; p = rb_next(p)) {
> + fwp = rb_entry(p, struct fw_page, rb_node);
> + fwp->function = vhca_id_key;
> + }
> +
> + xa_erase(&dev->priv.page_root_xa, old_key);
> +
> + return 0;
> +}
> +
> int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot)
> {
> - u16 func_id;
> + bool ec_function = false;
> + u16 func_vhca_id;
> s32 npages;
> int err;
>
> - err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot);
> + /* Boot pages are requested before set_hca_cap(), so the capability
> + * is not negotiated yet; use FUNC_ID mode for backward compatibility.
> + * Init pages are requested after set_hca_cap(), which unconditionally
> + * enables CAP_GEN_MAX. Current caps are not re-queried at this point,
> + * so check CAP_GEN_MAX directly.
> + */
> + if (boot) {
> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID);
> + } else {
> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) ==
> + MLX5_ID_MODE_FUNCTION_VHCA_ID) {
> + err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev);
> + if (err)
> + return err;
> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID);
> + }
> + }
> +
> + err = mlx5_cmd_query_pages(dev, &func_vhca_id, &npages, boot);
> if (err)
> return err;
>
> - mlx5_core_dbg(dev, "requested %d %s pages for func_id 0x%x\n",
> - npages, boot ? "boot" : "init", func_id);
> + mlx5_core_dbg(dev,
> + "requested %d %s pages for func_vhca_id 0x%x\n",
> + npages, boot ? "boot" : "init", func_vhca_id);
>
> if (!npages)
> return 0;
>
> - return give_pages(dev, func_id, npages, 0, mlx5_core_is_ecpf(dev));
> + /* In VHCA_ID mode, ec_function remains false (not used). */
> + if (!mlx5_page_mgt_mode_is_vhca_id(dev))
> + ec_function = mlx5_core_is_ecpf(dev);
> +
> + return give_pages(dev, func_vhca_id, npages, 0, ec_function);
> }
>
> enum {
> @@ -709,15 +835,17 @@ static int mlx5_reclaim_root_pages(struct mlx5_core_dev *dev,
>
> while (!RB_EMPTY_ROOT(root)) {
> u32 ec_function = mlx5_get_ec_function(function);
> - u32 function_id = mlx5_get_func_id(function);
> + u32 func_vhca_id = mlx5_get_func_vhca_id(function);
> int nclaimed;
> int err;
>
> - err = reclaim_pages(dev, function_id, optimal_reclaimed_pages(),
> + err = reclaim_pages(dev, func_vhca_id,
> + optimal_reclaimed_pages(),
> &nclaimed, false, ec_function);
> if (err) {
> - mlx5_core_warn(dev, "reclaim_pages err (%d) func_id=0x%x ec_func=0x%x\n",
> - err, function_id, ec_function);
> + mlx5_core_warn(dev,
> + "reclaim_pages err (%d) func_vhca_id=0x%x ec_func=0x%x\n",
> + err, func_vhca_id, ec_function);
> return err;
> }
>
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index d1751c5d01c7..8b4d384125d1 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -558,6 +558,12 @@ enum mlx5_func_type {
> MLX5_HOST_PF,
> MLX5_EC_VF,
> MLX5_FUNC_TYPE_NUM,
> + MLX5_FUNC_TYPE_NONE = MLX5_FUNC_TYPE_NUM,
> +};
> +
> +enum mlx5_page_mgt_mode {
> + MLX5_PAGE_MGT_MODE_FUNC_ID,
> + MLX5_PAGE_MGT_MODE_VHCA_ID,
> };
>
> struct mlx5_frag_buf_node_pools;
> @@ -578,6 +584,7 @@ struct mlx5_priv {
> u32 fw_pages_alloc_failed;
> u32 give_pages_dropped;
> u32 reclaim_pages_discard;
> + enum mlx5_page_mgt_mode page_mgt_mode;
>
> struct mlx5_core_health health;
> struct list_head traps;
^ permalink raw reply
* Re: [PATCH net v2] net: ethtool: fix NULL pointer dereference in phy_reply_size
From: Maxime Chevallier @ 2026-05-07 15:39 UTC (permalink / raw)
To: Quan Sun, netdev, andrew; +Cc: kuba, edumazet, pabeni
In-Reply-To: <20260507131738.1173835-1-2022090917019@std.uestc.edu.cn>
Hi,
On 07/05/2026 15:17, Quan Sun wrote:
> In phy_prepare_data(), several strings such as 'name', 'drvname',
> 'upstream_sfp_name', and 'downstream_sfp_name' are allocated using
> kstrdup(). However, these allocations were not checked for failure.
>
> If kstrdup() fails for 'name', it returns NULL while the function
> continues. This leads to a kernel NULL pointer dereference and panic
> later in phy_reply_size() when it unconditionally calls strlen() on
> the NULL pointer.
>
> While other strings like 'upstream_sfp_name' might be checked before
> access in certain code paths, failing to handle these allocations
> consistently can lead to incomplete data reporting or hidden bugs.
>
> Fix this by adding proper NULL checks for all kstrdup() calls in
> phy_prepare_data() and implement a centralized error handling path
> using goto labels to ensure all previously allocated resources are
> freed on failure.
>
> Fixes: 9dd2ad5e92b9 ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump")
> Signed-off-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Thanks,
Maxime
> ---
> Changes in v2:
> - Add Fixes: tag.
> - Expand the fix to cover all kstrdup() allocations in the function.
> - Use goto labels for a cleaner and more robust error handling path.
> ---
> net/ethtool/phy.c | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/net/ethtool/phy.c b/net/ethtool/phy.c
> index d4e6887055ab1..f76d94d848d6d 100644
> --- a/net/ethtool/phy.c
> +++ b/net/ethtool/phy.c
> @@ -76,6 +76,7 @@ static int phy_prepare_data(const struct ethnl_req_info *req_info,
> struct nlattr **tb = info->attrs;
> struct phy_device_node *pdn;
> struct phy_device *phydev;
> + int ret;
>
> /* RTNL is held by the caller */
> phydev = ethnl_req_get_phydev(req_info, tb, ETHTOOL_A_PHY_HEADER,
> @@ -88,8 +89,17 @@ static int phy_prepare_data(const struct ethnl_req_info *req_info,
> return -EOPNOTSUPP;
>
> rep_data->phyindex = phydev->phyindex;
> +
> rep_data->name = kstrdup(dev_name(&phydev->mdio.dev), GFP_KERNEL);
> + if (!rep_data->name)
> + return -ENOMEM;
> +
> rep_data->drvname = kstrdup(phydev->drv->name, GFP_KERNEL);
> + if (!rep_data->drvname) {
> + ret = -ENOMEM;
> + goto err_free_name;
> + }
> +
> rep_data->upstream_type = pdn->upstream_type;
>
> if (pdn->upstream_type == PHY_UPSTREAM_PHY) {
> @@ -97,15 +107,33 @@ static int phy_prepare_data(const struct ethnl_req_info *req_info,
> rep_data->upstream_index = upstream->phyindex;
> }
>
> - if (pdn->parent_sfp_bus)
> + if (pdn->parent_sfp_bus) {
> rep_data->upstream_sfp_name = kstrdup(sfp_get_name(pdn->parent_sfp_bus),
> GFP_KERNEL);
> + if (!rep_data->upstream_sfp_name) {
> + ret = -ENOMEM;
> + goto err_free_drvname;
> + }
> + }
>
> - if (phydev->sfp_bus)
> + if (phydev->sfp_bus) {
> rep_data->downstream_sfp_name = kstrdup(sfp_get_name(phydev->sfp_bus),
> GFP_KERNEL);
> + if (!rep_data->downstream_sfp_name) {
> + ret = -ENOMEM;
> + goto err_free_upstream_sfp;
> + }
> + }
>
> return 0;
> +
> +err_free_upstream_sfp:
> + kfree(rep_data->upstream_sfp_name);
> +err_free_drvname:
> + kfree(rep_data->drvname);
> +err_free_name:
> + kfree(rep_data->name);
> + return ret;
> }
>
> static int phy_fill_reply(struct sk_buff *skb,
^ permalink raw reply
* Re: [PATCH] dt-bindings: net: lan966x: Accept standard ethernet prefixes
From: Rob Herring (Arm) @ 2026-05-07 15:41 UTC (permalink / raw)
To: Linus Walleij
Cc: Andrew Lunn, Jakub Kicinski, Eric Dumazet, Krzysztof Kozlowski,
Herve Codina, Paolo Abeni, Horatiu Vultur, devicetree,
Conor Dooley, David S. Miller, netdev
In-Reply-To: <20260507-lan966-binding-v1-1-e99293d2a4ec@kernel.org>
On Thu, 07 May 2026 11:26:01 +0200, Linus Walleij wrote:
> The dsa.yaml and ethernet-switch.yaml bindings recommend
> prefixing ethernet switches and ports with "ethernet-" so
> make the LAN966x do the same.
>
> Reported-by: Herve Codina <herve.codina@bootlin.com>
> Signed-off-by: Linus Walleij <linusw@kernel.org>
> ---
> .../devicetree/bindings/net/microchip,lan966x-switch.yaml | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
^ permalink raw reply
* [PATCH net-next v3] net: eth: fbnic: Fix addr validation in pcs write
From: mike.marciniszyn @ 2026-05-07 15:42 UTC (permalink / raw)
To: Alexander Duyck, Jakub Kicinski, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Andrew Lunn
Cc: mike.marciniszyn, netdev, linux-kernel
From: "Mike Marciniszyn (Meta)" <mike.marciniszyn@gmail.com>
The DW IP has two distinct PCS address ranges cooresponding
to the C45 PCS registers.
The shim translates the PCS addr/regno into specific CSR writes
into one of those two zero-relative ranges.
This patch fixes a one off in the test that could allow an invalid
CSR write if an addr == 2 was called.
There are is of yet, no real impact for the bug as no PCS writes are
present.
Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
---
v3:
- put back into the series based on https://lore.kernel.org/all/9ec11642-8035-419c-a896-52f902020bb8@lunn.ch/
- revised commit message will additional details
v2:
- omitted from patch series
v1: https://lore.kernel.org/all/20260428172810.175077-2-mike.marciniszyn@gmail.com/
drivers/net/ethernet/meta/fbnic/fbnic_mdio.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_mdio.c b/drivers/net/ethernet/meta/fbnic/fbnic_mdio.c
index 709041f7fc43..d6a124889f52 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_mdio.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_mdio.c
@@ -125,7 +125,7 @@ fbnic_mdio_write_pcs(struct fbnic_dev *fbd, int addr, int regnum, u16 val)
addr, regnum, val);
/* Allow access to both halves of PCS for 50R2 config */
- if (addr > 2)
+ if (addr >= 2)
return;
/* Skip write for reserved registers */
--
2.43.0
^ permalink raw reply related
* Re: [PATCH RFC net-next v3] hsr: Allow to send a specific port and with HSR header
From: Sebastian Andrzej Siewior @ 2026-05-07 15:43 UTC (permalink / raw)
To: Willem de Bruijn
Cc: netdev, Andrew Lunn, Chintan Vankar, Danish Anwar, Daolin Qiu,
David S. Miller, Eric Dumazet, Felix Maurer, Jakub Kicinski,
Neelima Muralidharan, Paolo Abeni, Praneeth Bajjuri,
Pratheesh Gangadhar TK, Richard Cochran, Simon Horman,
Vignesh Raghavendra
In-Reply-To: <willemdebruijn.kernel.b7832f991167@gmail.com>
On 2026-05-06 09:20:41 [-0400], Willem de Bruijn wrote:
> > No, I parse the header. As far as I can tell this set based on the
> > header bits from data passed to af_packet. Even if I could change this
> > to something else, these 16bit should better not collide with anything
> > else so I think used the eth-type and expecting nobody sending a PTP
> > packet is good.
>
> Is that more fragile than a specific protocol? And introduces that
> dependency on mirroring the Ethernet format. But your call.
>
> Ethernet does have requirements on MAC address format, so a layout
> that corresponds to an invalid address could be a signal.
perfect. Thank you.
Sebastian
^ permalink raw reply
* Re: [PATCH net-next V6 2/3] net/mlx5e: Avoid copying payload to the skb's linear part
From: Dragos Tatulea @ 2026-05-07 15:49 UTC (permalink / raw)
To: Amery Hung, Tariq Toukan
Cc: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller, Saeed Mahameed, Mark Bloch,
Leon Romanovsky, netdev, linux-rdma, linux-kernel, Gal Pressman,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Alexei Starovoitov
In-Reply-To: <CAMB2axOFQN2f=veYgeJs+4tbZmb9PuNHk03TH_bmE8UL_REd7w@mail.gmail.com>
Hi Amery,
On 07.05.26 15:53, Amery Hung wrote:
> [...]
> Am I understanding correctly that the better performance comes with
> the assumption that the XDP does not change headers?
>
> headlen is determined before the XDP program runs. If it push/pop
> headers, there could be headers in frags or data in the linear region
> after __pskb_pull_tail().
>
That's right.
>> if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>> struct mlx5e_frag_page *pfp;
>> @@ -2060,8 +2066,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>> pagep->frags++;
>> while (++pagep < frag_page);
>>
>> - headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
>> - skb->data_len);
>> + headlen = min_t(u16, headlen - len, skb->data_len);
>
> headlen - len can underflow but will be capped by skb->data_len, so
> this should be okay, right?
It is safe. But it might trigger an extra allocation in the pull when
len > headlen. We could also skip the pull in that case. Or do a
min(headlen - len, min(skb->data_len, MLX5E_RX_MAX_HEAD)). WDYT?
Thanks,
Dragos
^ permalink raw reply
* [PATCH batadv 1/1] batman-adv: fix tp_meter counter underflow during shutdown
From: Ren Wei @ 2026-05-07 15:49 UTC (permalink / raw)
To: b.a.t.m.a.n, netdev
Cc: marek.lindner, sw, antonio, sven, yuantan098, yifanwucs,
tomapufckgml, bird, rakukuip, n05ec
In-Reply-To: <cover.1778118303.git.rakukuip@gmail.com>
From: Luxiao Xu <rakukuip@gmail.com>
batadv_tp_sender_shutdown() unconditionally decrements the "sending"
atomic counter. If multiple paths (e.g. timeout, user cancel, and
normal finish) call this function, the counter can underflow to -1.
Since the sender logic treats any non-zero value as "still sending",
a negative value causes the sender kthread to loop indefinitely.
This leads to a use-after-free when the interface is removed while
the zombie thread is still active.
Fix this by using atomic_xchg() to ensure the counter only transitions
from 1 to 0 once.
Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Luxiao Xu <rakukuip@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
net/batman-adv/tp_meter.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/batman-adv/tp_meter.c b/net/batman-adv/tp_meter.c
index 2e42f6b348c8..4c582443f67c 100644
--- a/net/batman-adv/tp_meter.c
+++ b/net/batman-adv/tp_meter.c
@@ -435,7 +435,7 @@ static void batadv_tp_sender_end(struct batadv_priv *bat_priv,
static void batadv_tp_sender_shutdown(struct batadv_tp_vars *tp_vars,
enum batadv_tp_meter_reason reason)
{
- if (!atomic_dec_and_test(&tp_vars->sending))
+ if (atomic_xchg(&tp_vars->sending, 0) != 1)
return;
tp_vars->reason = reason;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v1 net] tcp: Fix dst leak in tcp_v6_connect().
From: patchwork-bot+netdevbpf @ 2026-05-07 15:50 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: edumazet, ncardwell, davem, kuba, pabeni, horms, kuni1840, netdev,
melotti
In-Reply-To: <20260506070443.1699879-1-kuniyu@google.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 6 May 2026 07:04:42 +0000 you wrote:
> If a socket is bound to a wildcard address, tcp_v[46]_connect()
> updates it with a non-wildcard address based on the route lookup.
>
> After bhash2 was introduced in the cited commit, we must call
> inet_bhash2_update_saddr() to update the bhash2 entry as well.
>
> If inet_bhash2_update_saddr() fails, we must release the refcount
> for dst by ip_route_connect() or ip6_dst_lookup_flow().
>
> [...]
Here is the summary with links:
- [v1,net] tcp: Fix dst leak in tcp_v6_connect().
https://git.kernel.org/netdev/net/c/ecddc523cfdb
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH v1 net] ipmr: Call ipmr_fib_lookup() under RCU.
From: patchwork-bot+netdevbpf @ 2026-05-07 15:50 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: dsahern, idosch, davem, edumazet, kuba, pabeni, horms, kuni1840,
netdev, syzkaller, yi1.lai
In-Reply-To: <20260506065955.1695753-1-kuniyu@google.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 6 May 2026 06:59:53 +0000 you wrote:
> Yi Lai reported RCU splat in reg_vif_xmit() below. [0]
>
> When CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup()
> uses rcu_dereference() without explicit rcu_read_lock().
>
> Although rcu_read_lock_bh() is already held by the caller
> __dev_queue_xmit(), lockdep requires explicit rcu_read_lock()
> for rcu_dereference().
>
> [...]
Here is the summary with links:
- [v1,net] ipmr: Call ipmr_fib_lookup() under RCU.
https://git.kernel.org/netdev/net/c/019c892e4654
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net 0/1] net: stmmac: dwmac-nuvoton: fix NULL pointer dereference in nvt_set_phy_intf_sel()
From: patchwork-bot+netdevbpf @ 2026-05-07 15:50 UTC (permalink / raw)
To: Joey Lu
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, mcoquelin.stm32,
alexandre.torgue, linux-arm-kernel, netdev, linux-stm32,
linux-kernel
In-Reply-To: <20260506084614.192894-1-a0987203069@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 6 May 2026 16:46:12 +0800 you wrote:
> This series fixes a NULL pointer dereference bug introduced in the
> initial dwmac-nuvoton glue driver. The struct nvt_priv_data::dev field
> was never initialized after devm_kzalloc(), leaving it NULL. When
> phylink later calls nvt_set_phy_intf_sel() for interface selection,
> the callback dereferences priv->dev via nvt_gmac_get_delay(), triggering
> a NULL pointer dereference.
>
> [...]
Here is the summary with links:
- [net,1/1] net: stmmac: dwmac-nuvoton: fix NULL pointer dereference in nvt_set_phy_intf_sel()
https://git.kernel.org/netdev/net/c/dedf6c90386d
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v3] net: phy: broadcom: Save PHY counters during suspend
From: patchwork-bot+netdevbpf @ 2026-05-07 15:50 UTC (permalink / raw)
To: Justin Chen
Cc: netdev, pabeni, kuba, edumazet, davem, linux, hkallweit1, andrew,
bcm-kernel-feedback-list, florian.fainelli
In-Reply-To: <20260505173926.2870069-1-justin.chen@broadcom.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 5 May 2026 10:39:26 -0700 you wrote:
> The PHY counters can be lost if the PHY is reset during suspend. We
> need to save the values into the shadow counters or the accounting
> will be incorrect over multiple suspend and resume cycles.
>
> Fixes: 820ee17b8d3b ("net: phy: broadcom: Add support code for reading PHY counters")
> Signed-off-by: Justin Chen <justin.chen@broadcom.com>
> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
>
> [...]
Here is the summary with links:
- [net,v3] net: phy: broadcom: Save PHY counters during suspend
https://git.kernel.org/netdev/net/c/32cd651d14fc
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] net/smc: fix missing sk_err when TCP handshake fails
From: patchwork-bot+netdevbpf @ 2026-05-07 15:50 UTC (permalink / raw)
To: D. Wythe
Cc: davem, dust.li, edumazet, kuba, pabeni, sidraya, wenjia, kgraul,
mjambigi, horms, tonylu, ubraun, guwen, linux-kernel, linux-rdma,
linux-s390, netdev, oliver.yang, pasic
In-Reply-To: <20260506014105.27093-1-alibuda@linux.alibaba.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 6 May 2026 09:41:05 +0800 you wrote:
> In smc_connect_work(), when the underlying TCP handshake fails, the error
> code (rc) must be propagated to sk_err to ensure userspace can correctly
> retrieve the error status via SO_ERROR. Currently, the code only handles
> a restricted set of error codes (e.g., EPIPE, ECONNREFUSED). If other
> errors occurs, such as EHOSTUNREACH, sk_err remains unset (zero).
>
> This affects applications that rely on SO_ERROR to determine connect
> outcome. For example, higher versions of Go's netpoller treats
> SO_ERROR == 0 combined with a failed getpeername() as a spurious wakeup
> and re-enters epoll_wait(). Under ET mode, no further edge will be
> generated since the socket is already in a terminal state, causing the
> connect to hang indefinitely or until a user-specified timeout, if one
> is set.
>
> [...]
Here is the summary with links:
- [net] net/smc: fix missing sk_err when TCP handshake fails
https://git.kernel.org/netdev/net/c/9032f7676935
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v3] af_unix: Reject SIOCATMARK on non-stream sockets
From: patchwork-bot+netdevbpf @ 2026-05-07 15:50 UTC (permalink / raw)
To: Ren Wei
Cc: netdev, kuniyu, davem, edumazet, kuba, pabeni, horms, rao.shoaib,
yuantan098, yifanwucs, tomapufckgml, bird, wangjiexun2025
In-Reply-To: <20260506140825.2987635-1-n05ec@lzu.edu.cn>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 6 May 2026 22:08:23 +0800 you wrote:
> From: Jiexun Wang <wangjiexun2025@gmail.com>
>
> SIOCATMARK reports whether the receive queue is at the urgent mark for
> MSG_OOB.
>
> In AF_UNIX, MSG_OOB is supported only for SOCK_STREAM sockets.
> SOCK_DGRAM and SOCK_SEQPACKET reject MSG_OOB in sendmsg() and recvmsg(),
> so they should not support SIOCATMARK either.
>
> [...]
Here is the summary with links:
- [net,v3] af_unix: Reject SIOCATMARK on non-stream sockets
https://git.kernel.org/netdev/net/c/d119775f2bad
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH net v3 0/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome
From: Ankit Jain @ 2026-05-07 15:48 UTC (permalink / raw)
To: edumazet, netdev
Cc: kuba, davem, pabeni, ncardwell, kuniyu, horms, shuah,
quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
karen.badiryan, ajay.kaher, alexey.makhalov,
vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu, Ankit Jain
This patch series fixes Silly Window Syndrome (SWS) for sockets using
locked SO_RCVBUF.
When applications like Tomcat lock SO_RCVBUF, receiving small packets
causes the memory truesize penalty to drop the scaling_ratio to 1.
This shrinks the internal window clamp and leads to 504 Gateway Timeouts.
Patch 1 bypasses this penalty for locked sockets, except for GRO packets.
Patch 2 adds a packetdrill test to validate this fix.
Link to v1:
https://lore.kernel.org/all/20260427152756.1205-1-ankit-aj.jain@broadcom.com/
Link to v2:
https://lore.kernel.org/all/20260504144945.13477-1-ankit-aj.jain@broadcom.com/
v2 -> v3:
- Changed GRO detection from checking tp->advmss to skb->len > len
based on Eric Dumazet's suggestion. This correctly detects GRO
packets even if they contain tiny segments.
- Updated packetdrill script. Removed the ad-hoc mss 48 configuration.
It now uses a standard 1460 MSS and sends varying packet sizes
(600, 700, 800 bytes) to naturally trigger the scaling_ratio
recalculation.
Testing:
- Verified fix in a live Java/Tomcat environment (504 timeouts resolved).
- Passed the newly added packetdrill test demonstrating the clamp bypass.
- Passed upstream regression tests: tcp_rcv_neg_window.pkt,
tcp_rcv_wnd_shrink_allowed.pkt, tcp_rcv_wnd_shrink_nomem.pkt,
tcp_rcv_zero_wnd_fin.pkt, and tcp_rcv_big_endseq.pkt
Ankit Jain (2):
tcp: protect locked SO_RCVBUF from Silly Window Syndrome
selftests/net: add packetdrill test for locked SO_RCVBUF SWS
net/ipv4/tcp_input.c | 7 ++++-
.../net/packetdrill/tcp_locked_rcvbuf_sws.pkt | 29 +++++++++++++++++++
2 files changed, 35 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
--
2.53.0
^ permalink raw reply
* [PATCH net v3 1/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome
From: Ankit Jain @ 2026-05-07 15:48 UTC (permalink / raw)
To: edumazet, netdev
Cc: kuba, davem, pabeni, ncardwell, kuniyu, horms, shuah,
quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
karen.badiryan, ajay.kaher, alexey.makhalov,
vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu, Ankit Jain
In-Reply-To: <20260507154806.18635-1-ankit-aj.jain@broadcom.com>
When an application locks SO_RCVBUF, it disables TCP window auto-tuning.
However, the kernel still applies dynamic truesize penalties to the
scaling_ratio.
For small packets, this penalty drops the scaling_ratio to 1. This
reduces the advertised window and causes Silly Window Syndrome (SWS)
along with 504 Gateway Timeouts in applications like Tomcat.
This patch bypasses the truesize penalty if SOCK_RCVBUF_LOCK is set.
To prevent memory exhaustion from large aggregate payloads, the penalty
is still applied for GRO packets (skb->len > len).
Fixes: a2cbb1603943 ("tcp: Update window clamping condition")
Reported-by: Karen Badiryan <karen.badiryan@broadcom.com>
Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
---
net/ipv4/tcp_input.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d5c9e65d9760..4b1832b3face 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -240,8 +240,13 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
/* Note: divides are still a bit expensive.
* For the moment, only adjust scaling_ratio
* when we update icsk_ack.rcv_mss.
+ *
+ * Bypass truesize penalty for locked SO_RCVBUF to prevent
+ * window collapse. Still apply it to GRO packets.
*/
- if (unlikely(len != icsk->icsk_ack.rcv_mss)) {
+ if (unlikely(len != icsk->icsk_ack.rcv_mss &&
+ (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
+ skb->len > len))) {
u64 val = (u64)skb->len << TCP_RMEM_TO_WIN_SCALE;
u8 old_ratio = tcp_sk(sk)->scaling_ratio;
--
2.53.0
^ permalink raw reply related
* [PATCH net v3 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS
From: Ankit Jain @ 2026-05-07 15:48 UTC (permalink / raw)
To: edumazet, netdev
Cc: kuba, davem, pabeni, ncardwell, kuniyu, horms, shuah,
quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
karen.badiryan, ajay.kaher, alexey.makhalov,
vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu, Ankit Jain
In-Reply-To: <20260507154806.18635-1-ankit-aj.jain@broadcom.com>
Add a packetdrill test to verify that locked SO_RCVBUF sockets do not
suffer from scaling_ratio truesize penalties.
The test uses a standard 1460 MSS and sends medium-sized packets
(600, 700, 800 bytes) to trigger the recalculation logic. It checks
that the internal window clamp (tcpi_rcv_ssthresh) does not drop
unexpectedly.
Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
---
.../net/packetdrill/tcp_locked_rcvbuf_sws.pkt | 29 +++++++++++++++++++
1 file changed, 29 insertions(+)
create mode 100644 tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
diff --git a/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt b/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
new file mode 100644
index 000000000000..43e1d00d5f26
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+// Test that TCP does not reduce scaling_ratio for locked SO_RCVBUF.
+
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
++0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
++0 bind(3, ..., ...) = 0
++0 listen(3, 1) = 0
+
+// Establish connection with standard MSS.
++0 < S 0:0(0) win 65535 <mss 1460,nop,wscale 8>
++0 > S. 0:0(0) ack 1 <...>
++0 < . 1:1(0) ack 1 win 65535
++0 accept(3, ..., ...) = 4
+
+// Inject varying payload sizes to force scaling_ratio recalculation.
++0.1 < P. 1:601(600) ack 1 win 65535
++0 > . 1:1(0) ack 601 <...>
+
++0.1 < P. 601:1301(700) ack 1 win 65535
++0 > . 1:1(0) ack 1301 <...>
+
++0.1 < P. 1301:2101(800) ack 1 win 65535
++0 > . 1:1(0) ack 2101 <...>
+
+// Check that truesize penalty did not reduce the window clamp.
+// On unpatched kernels, rcv_ssthresh drops to ~22K.
++0.1 %{
+assert tcpi_rcv_ssthresh > 28000, f"rcv_ssthresh dropped unexpectedly: {tcpi_rcv_ssthresh}"
+}%
--
2.53.0
^ permalink raw reply related
* [PATCH v3 net] net: wan: fsl_ucc_hdlc: free tx_skbuff in uhdlc_memclean
From: Holger Brunck @ 2026-05-07 15:53 UTC (permalink / raw)
To: netdev
Cc: linuxppc-dev, andrew+netdev, chleroy, qiang.zhao, horms, kuba,
Holger Brunck
When the device is removed all allocated resources should be freed.
In uhdlc_memclean the netdev transmit queue was already stopped. But at
this point we may have pending skb in the transmit queue which must be
freed. Therefore iterate over the tx_skbuff pointers and free all
pending skb. The issue was discovered by sashiko.
Tested on a ls1043a board running HDLC in bus mode on kernel 6.12.
https://sashiko.dev/#/patchset/20260429114208.941011-1-holger.brunck%40hitachienergy.com
Fixes: c19b6d246a35 ("drivers/net: support hdlc function for QE-UCC")
Signed-off-by: Holger Brunck <holger.brunck@hitachienergy.com>
---
v3: - add test setup to commit message
v2: https://lore.kernel.org/linuxppc-dev/20260506111529.2919079-1-holger.brunck@hitachienergy.com/
- use dev_kfree_skb instead of kfree
- improve commit message
- add missing paramter in for statement
v1: https://lore.kernel.org/linuxppc-dev/20260504161145.2217950-1-holger.brunck@hitachienergy.com/
drivers/net/wan/fsl_ucc_hdlc.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index bc7c2e9e6554..417e8e4c111f 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -739,6 +739,8 @@ static int uhdlc_open(struct net_device *dev)
static void uhdlc_memclean(struct ucc_hdlc_private *priv)
{
+ int i;
+
qe_muram_free(ioread16be(&priv->ucc_pram->riptr));
qe_muram_free(ioread16be(&priv->ucc_pram->tiptr));
@@ -769,6 +771,11 @@ static void uhdlc_memclean(struct ucc_hdlc_private *priv)
kfree(priv->rx_skbuff);
priv->rx_skbuff = NULL;
+ for (i = 0; i < TX_BD_RING_LEN; i++) {
+ dev_kfree_skb(priv->tx_skbuff[i]);
+ priv->tx_skbuff[i] = NULL;
+ }
+
kfree(priv->tx_skbuff);
priv->tx_skbuff = NULL;
--
2.52.0.120.gb31ab939fe
^ permalink raw reply related
* Re: [PATCH net v2 1/5] ionic: Allow the first devcmd to trigger deferred probe
From: Jakub Kicinski @ 2026-05-07 15:57 UTC (permalink / raw)
To: Eric Joyner
Cc: netdev, Brett Creeley, Andrew Lunn, David S. Miller, Eric Dumazet,
Paolo Abeni
In-Reply-To: <20260506035706.12373-2-eric.joyner@amd.com>
On Tue, 5 May 2026 20:57:02 -0700 Eric Joyner wrote:
> There's a chance the register signature value is set before the
> firmware is ready to respond to the driver. This doesn't mean the
> device isn't there, but just means it's not yet ready. If the first
> devcmd fails, then return -EPROBE_DEFER so the device can be probed
> at a later time. As part of this make sure the reset devcmd, which
> is the first devcmd, is not so alarming when it fails by printing
> an information message instead of the standard devcmd failure
> messages.
I don't think that's how PROBE_DEFER works / is supposed to be used.
It re-tries probe after other devices have been probed. So if your
driver is the last one "one the list" and the system is completely
stable afterwards you will never get another probe attempt.
At least that's my recollection. IOW it's for use when you have
dependencies on other components in the system. If you just need
to wait for you own device to become ready you should simply wait..
^ permalink raw reply
* Re: [PATCH net-next v3 1/4] net: eth: fbnic: Fix addr validation in pcs write
From: Mike Marciniszyn @ 2026-05-07 15:58 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jakub Kicinski, Alexander Duyck, kernel-team, Andrew Lunn,
David S. Miller, Eric Dumazet, Heiner Kallweit, Russell King,
Jacob Keller, Mohsin Bashir, Simon Horman, Lee Trager,
Andrew Lunn, netdev, linux-kernel
In-Reply-To: <ee4fb21a-2420-4e35-9798-413eb3619dc6@redhat.com>
On Thu, May 07, 2026 at 03:50:04PM +0200, Paolo Abeni wrote:
> On 5/7/26 3:48 PM, Mike Marciniszyn wrote:
> > On Thu, May 07, 2026 at 09:20:45AM -0400, Mike Marciniszyn wrote:
> >> On Thu, May 07, 2026 at 09:20:53AM +0200, Paolo Abeni wrote:
> >>> On 5/7/26 3:58 AM, Jakub Kicinski wrote:
> >>>> On Mon, 4 May 2026 09:58:12 -0400 mike.marciniszyn@gmail.com wrote:
> >>>>> From: "Mike Marciniszyn (Meta)" <mike.marciniszyn@gmail.com>
> >>>>>
> >>>>> The DW IP has two distinct PCS address ranges cooresponding
> >>>>> to the C45 PCS registers.
> >>>>>
> >>>>> The shim translates the PCS addr/regno into specific CSR writes
> >>>>> into one of those two zero-relative.
> >>>>>
> >>>>> This patch fixes a one off in the test that could allow an invalid
> >>>>> CSR write if an addr == 2 was called.
> >>>>>
> >>>>> This patch contains a fix for addr validation in fbnic_mdio_write_pcs()
> >>>>> to only return actual CSR reads for addr 0 and 1.
> >>>>>
> >>>>> There are as of yet, no real impact for the bug as no PCS writes are
> >>>>> not yet present.
> >>>>
> >>>> Hi Paolo! Was there a reason / do you recall why this was not applied?
> >>>> (I dropped it from patchwork now. If the omission was accidental it has
> >>>> to be reposted)
> >>>
> >>> Darn, limited capacity here plus re-submission glitch: v3 had a slightly
> >>> different cover title (due to typo) WRT v2 so PW did not mark v2 as
> >>> superseded. I process patches via PW in sequence, when I reached v2 I
> >>> considered the sashiko comment not blocking and I apply it. I was unable
> >>> to reach v3 until now.
> >>>
> >>> TL;DR: @Mike: please re-submit 1/4 and double check there are not other
> >>> differences between v2 and v3 - otherwise more patches needed. Also
> >>> please ensure you keep the series title consistent among revision, or at
> >>> least manually remove old revisions from PW upon resubmission.
> >>>
> >>> Thanks,
> >>>
> >>> Paolo
> >>>
> >>
> >> I double checked v2 -> v3 for the other patches are ok.
> >>
> >> I'm just now resending 1/4 of the series. I reworded the commit message
> >> to fix the AI review comment.
> >>
> >> Mike
> >
> > The patch isn't showing up on either lore or patchwork.
> >
> > I may need to bump the rev?
>
> Wait a bit more. 1h latency can happen quite easily.
>
> /P
>
The patch in is patchwork:
https://patchwork.kernel.org/project/netdevbpf/patch/20260507154203.3667-1-mike.marciniszyn@gmail.com/
Mike
^ permalink raw reply
* Re: [PATCH net v2 1/5] ionic: Allow the first devcmd to trigger deferred probe
From: Jakub Kicinski @ 2026-05-07 15:59 UTC (permalink / raw)
To: eric.joyner
Cc: Jakub Kicinski, netdev, brett.creeley, andrew+netdev, davem,
edumazet, pabeni, prabu.ponrajthayalan
In-Reply-To: <20260506035706.12373-2-eric.joyner@amd.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
ionic: Allow the first devcmd to trigger deferred probe
There's a chance the register signature value is set before the
firmware is ready to respond to the driver. This doesn't mean the
device isn't there, but just means it's not yet ready. If the first
devcmd fails, then return -EPROBE_DEFER so the device can be probed
at a later time. As part of this make sure the reset devcmd, which
is the first devcmd, is not so alarming when it fails by printing
an information message instead of the standard devcmd failure
messages.
[ ... ]
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_bus_pci.c b/drivers/net/ethernet/pensando/ionic/ionic_bus_pci.c
> index 05f19489ec5c8..59ce35404e532 100644
> --- a/drivers/net/ethernet/pensando/ionic/ionic_bus_pci.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_bus_pci.c
> @@ -260,7 +260,8 @@ static int ionic_setup_one(struct ionic *ionic)
> /* Configure the device */
> err = ionic_setup(ionic);
> if (err) {
> - dev_err(dev, "Cannot setup device: %d, aborting\n", err);
> + if (err != -EPROBE_DEFER)
> + dev_err(dev, "Cannot setup device: %d, aborting\n", err);
> goto err_out_clear_pci;
Does this code leak the workqueue?
ionic_setup() first calls ionic_dev_setup(), which allocates the workqueue
via ionic_watchdog_init(). If ionic_reset() then fails and returns
-EPROBE_DEFER, the goto here jumps to err_out_clear_pci.
This bypasses err_out_teardown and skips ionic_dev_teardown(), which is
responsible for calling destroy_workqueue(ionic->wq). Because -EPROBE_DEFER
will cause the kernel to retry probing, could this leak a new workqueue on
every failed probe attempt?
> }
[ ... ]
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_main.c b/drivers/net/ethernet/pensando/ionic/ionic_main.c
> index 3c5200e2fdb7d..91f89b9ff8075 100644
> --- a/drivers/net/ethernet/pensando/ionic/ionic_main.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_main.c
> @@ -603,7 +603,11 @@ int ionic_setup(struct ionic *ionic)
> err = ionic_dev_setup(ionic);
> if (err)
> return err;
> - ionic_reset(ionic);
> +
> + err = ionic_reset(ionic);
> + /* firmware may not be ready to respond yet */
> + if (err == -EAGAIN || err == -ETIMEDOUT)
> + return -EPROBE_DEFER;
Will this permanently fail the probe with -ENXIO if the firmware is so early
in its boot process that the IONIC_FW_STS_F_RUNNING bit isn't set yet?
If ionic_is_fw_running() is false, ionic_reset() skips execution and
returns 0:
drivers/net/ethernet/pensando/ionic/ionic_main.c:ionic_reset() {
if (!ionic_is_fw_running(idev))
return 0;
...
}
If this happens, ionic_setup() returns success, and the driver proceeds to
ionic_identify(). When it issues a command and calls __ionic_dev_cmd_wait(),
the wait loop is skipped entirely because the firmware isn't running,
returning -ENXIO. This bypasses the -EPROBE_DEFER logic here and causes a
permanent probe failure.
>
> return 0;
> }
> @@ -687,7 +691,7 @@ int ionic_reset(struct ionic *ionic)
>
> mutex_lock(&ionic->dev_cmd_lock);
> ionic_dev_cmd_reset(idev);
> - err = ionic_dev_cmd_wait(ionic, DEVCMD_TIMEOUT);
> + err = ionic_dev_cmd_wait_nomsg(ionic, DEVCMD_TIMEOUT);
Does this actually suppress the timeout warning?
The commit message says this aims to make the first devcmd failure "not so
alarming" by suppressing standard devcmd failure messages. However,
ionic_dev_cmd_wait_nomsg() passes do_msg = false to __ionic_dev_cmd_wait(),
which only suppresses firmware-returned error codes parsed by
ionic_dev_cmd_dev_err_print().
If the command times out, __ionic_dev_cmd_wait() will still unconditionally
print the dev_warn() for the timeout, so the alarming message might still be
printed.
> mutex_unlock(&ionic->dev_cmd_lock);
^ permalink raw reply
* Re: [PATCH net v2 3/5] ionic: Fix unexpected dev_cmd failures
From: Jakub Kicinski @ 2026-05-07 15:59 UTC (permalink / raw)
To: eric.joyner
Cc: Jakub Kicinski, netdev, brett.creeley, andrew+netdev, davem,
edumazet, pabeni, prabu.ponrajthayalan
In-Reply-To: <20260506035706.12373-4-eric.joyner@amd.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
ionic: Fix unexpected dev_cmd failures
This commit adds a final check of the done bit after exiting the dev_cmd
poll loop. It attempts to prevent timing out commands that have actually
completed.
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_main.c b/drivers/net/ethernet/pensando/ionic/ionic_main.c
> --- a/drivers/net/ethernet/pensando/ionic/ionic_main.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_main.c
> @@ -541,6 +541,10 @@ static int __ionic_dev_cmd_wait(struct ionic *ionic, unsigned long max_seconds,
> }
> duration = jiffies - start_time;
>
> + /* one final check to prevent unexpected timeout */
> + if (!done)
> + done = ionic_dev_cmd_done(idev);
If the polling loop exits early because the firmware crashed or the device
was hot-removed, fw_up evaluates to false. Will this final check then read
from a dead device?
On a disconnected device, MMIO reads typically return all 1s (0xffffffff).
Looking at ionic_dev_cmd_done():
drivers/net/ethernet/pensando/ionic/ionic_dev.c:ionic_dev_cmd_done() {
...
return ioread32(&idev->dev_cmd_regs->done) & IONIC_DEV_CMD_DONE;
}
Since IONIC_DEV_CMD_DONE is 0x1, the bitwise AND with 0xffffffff evaluates
to true, making done = true.
If done becomes true on a dead device, does it bypass the firmware down
error path further down in this function?
__ionic_dev_cmd_wait() {
...
if (!done && !fw_up) {
ionic_dev_cmd_clean(ionic);
dev_warn(ionic->dev, "DEVCMD %s (%d) interrupted - FW is down\n",
ionic_opcode_to_str(opcode), opcode);
return -ENXIO;
}
...
}
It appears the driver will instead proceed to ionic_dev_cmd_status(), read
0xff from the status register, log an unknown error, and return -EIO rather
than correctly handling the device removal with -ENXIO.
^ permalink raw reply
* Re: [PATCH net v2 5/5] ionic: fix completion descriptor access with 2x desc size
From: Jakub Kicinski @ 2026-05-07 15:59 UTC (permalink / raw)
To: eric.joyner
Cc: Jakub Kicinski, netdev, brett.creeley, andrew+netdev, davem,
edumazet, pabeni, prabu.ponrajthayalan
In-Reply-To: <20260506035706.12373-6-eric.joyner@amd.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
ionic: fix completion descriptor access with 2x desc size
This commit fixes completion descriptor access when the IONIC_Q_F_2X_CQ_DESC
flag is set. It introduces pointer arithmetic to correctly calculate the
offset of the 32-byte completion descriptor slot.
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> --- a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
[ ... ]
> @@ -729,7 +725,12 @@ static bool __ionic_rx_service(struct ionic_cq *cq, struct bpf_prog *xdp_prog)
> struct ionic_queue *q = cq->bound_q;
> struct ionic_rxq_comp *comp;
>
> - comp = &((struct ionic_rxq_comp *)cq->base)[cq->tail_idx];
> + if (likely(cq->desc_size == sizeof(*comp)))
> + comp = &((struct ionic_rxq_comp *)cq->base)[cq->tail_idx];
> + else
> + comp = cq->base +
> + cq->desc_size * cq->tail_idx +
> + cq->desc_size - sizeof(*comp);
>
> if (!color_match(comp->pkt_type_color, cq->done_color))
> return false;
This is a pre-existing issue, but since the descriptor access is being
modified here, should there be a dma_rmb() after the color_match() check?
Without a read memory barrier, weakly-ordered CPU architectures might
speculatively read other fields of the descriptor from memory before checking
the color bit. This could result in processing stale data from the previous
ring cycle.
[ ... ]
> @@ -1244,7 +1240,12 @@ static bool ionic_tx_service(struct ionic_cq *cq,
> unsigned int pkts = 0;
> u16 index;
>
> - comp = &((struct ionic_txq_comp *)cq->base)[cq->tail_idx];
> + if (likely(cq->desc_size == sizeof(*comp)))
> + comp = &((struct ionic_txq_comp *)cq->base)[cq->tail_idx];
> + else
> + comp = cq->base +
> + cq->desc_size * cq->tail_idx +
> + cq->desc_size - sizeof(*comp);
>
> if (!color_match(comp->color, cq->done_color))
> return false;
Does this color_match() check also need a dma_rmb() immediately following it
to prevent stale descriptor reads on the transmit side?
^ permalink raw reply
* Re: [net-next v39] mctp pcc: Implement MCTP over PCC Transport
From: Adam Young @ 2026-05-07 16:00 UTC (permalink / raw)
To: Jeremy Kerr, Adam Young, Matt Johnston, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: netdev, linux-kernel, Sudeep Holla, Jonathan Cameron, Huisong Li
In-Reply-To: <04015745fb7223f6d7dc262ae505daf0272586fb.camel@codeconstruct.com.au>
All the other will be taken in as changes
On 5/6/26 23:02, Jeremy Kerr wrote:
>> +static void mctp_pcc_tx_done(struct mbox_client *c, void *mssg, int r)
>> +{
>> + struct mctp_pcc_ndev *mctp_pcc_ndev;
>> + struct sk_buff *skb = mssg;
>> +
>> + /*
>> + * If there is a packet in flight during driver cleanup
>> + * It may have been freed already.
>> + */
>> + if (!mssg)
>> + return;
>> + /*
>> + * If the return code is non-zero, we should not report the packet
>> + * as transmitted. However, we are in IRQ context right now, and we
>> + * cannot safely write transmission statistics.
>> + */
> This reads as if you're not updating stats at all, but you do so in
> mctp_pcc_tx_prepare(). I don't think this comment is necessary - if
> you really want to mention this, add a comment on the
> dev_dstats_tx_add() to indicate why you're calling it early.
This comment is in prep for a fairly large change in the PCC layer to
address it.
This statistic should be reported in tx_done, but cannot be done safely
yet. The fix is to get tx_done out of a hard-irq handler. I will submit
that as a follow on changes to mailbox/pcc.c and mctp-pc.c
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox