* Re: [PATCH for-next] RDMA/efa: Validate SQ depth based on WQE size
From: Yonatan Nachum @ 2026-05-14 20:45 UTC (permalink / raw)
To: Leon Romanovsky
Cc: jgg, linux-rdma, mrgolin, sleybo, matua, gal.pressman,
Daniel Kranzdorf
In-Reply-To: <20260514162812.GS15586@unreal>
On Thu, May 14, 2026 at 07:28:12PM +0300, Leon Romanovsky wrote:
> On Wed, May 13, 2026 at 07:24:51PM +0000, Yonatan Nachum wrote:
> > On Wed, May 13, 2026 at 08:38:12PM +0300, Leon Romanovsky wrote:
> > > On Thu, May 07, 2026 at 11:21:10AM +0000, Yonatan Nachum wrote:
> > > > From: Michael Margolin <mrgolin@amazon.com>
> > > >
> > > > Change the SQ depth validation to take into account the SQ WQE size.
> > > > This is needed since when using 128-byte WQE the max SQ depth is cut in
> > > > half. On create QP command, userspace provides SQ ring size which is SQ
> > > > depth X WQE size so we can calculate the requested WQE size in the
> > > > kernel.
> > > >
> > > > Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
> > > > Reviewed-by: Michael Margolin <mrgolin@amazon.com>
> > > > Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
> > > > ---
> > > > drivers/infiniband/hw/efa/efa_verbs.c | 39 ++++++++++++++++++---------
> > > > 1 file changed, 27 insertions(+), 12 deletions(-)
> > >
> > > Please add Fixes line.
> > >
> > > Thanks
> >
> > There is no Fixes tag as this is not a bug fix. The existing validation
> > works but is overly permissive — it doesn't account for WQE size when
> > checking max SQ depth. Without it, the device would reject the request
> > downstream. This patch tightens the validation to fail early in the
> > kernel.
>
> So why do we need kernel patch after all?
>
> Thanks
The driver already validates max_send_wr against max_sq_depth — this
patch just makes that check accurate for the 128-byte WQE case. This
also gives better error reporting as opposed to device failure.
^ permalink raw reply
* [recipe build #4039958] of ~linux-rdma rdma-core-daily in xenial: Dependency wait
From: noreply @ 2026-05-14 20:32 UTC (permalink / raw)
To: Linux RDMA
* State: Dependency wait
* Recipe: linux-rdma/rdma-core-daily
* Archive: ~linux-rdma/ubuntu/rdma-core-daily
* Distroseries: xenial
* Duration: 2 minutes
* Build Log: https://launchpad.net/~linux-rdma/+archive/ubuntu/rdma-core-daily/+recipebuild/4039958/+files/buildlog.txt.gz
* Upload Log:
* Builder: https://launchpad.net/builders/lcy02-amd64-078
--
https://launchpad.net/~linux-rdma/+archive/ubuntu/rdma-core-daily/+recipebuild/4039958
Your team Linux RDMA is the requester of the build.
^ permalink raw reply
* [rdma:for-rc] BUILD SUCCESS f6b079629becfa977f9c51fe53ad2e6dcc55ef44
From: kernel test robot @ 2026-05-14 19:47 UTC (permalink / raw)
To: Leon Romanovsky; +Cc: Doug Ledford, Jason Gunthorpe, linux-rdma
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-rc
branch HEAD: f6b079629becfa977f9c51fe53ad2e6dcc55ef44 RDMA/bnxt_re: zero shared page before exposing to userspace
elapsed time: 783m
configs tested: 265
configs skipped: 4
The following configs have been built successfully.
More configs may be tested in the coming days.
tested configs:
alpha allnoconfig gcc-15.2.0
alpha allyesconfig gcc-15.2.0
alpha defconfig gcc-15.2.0
arc allmodconfig clang-16
arc allnoconfig gcc-15.2.0
arc allyesconfig clang-23
arc allyesconfig gcc-15.2.0
arc defconfig gcc-15.2.0
arc randconfig-001-20260514 clang-23
arc randconfig-001-20260515 clang-23
arc randconfig-002-20260514 clang-23
arc randconfig-002-20260515 clang-23
arm allnoconfig gcc-15.2.0
arm allyesconfig clang-16
arm defconfig gcc-15.2.0
arm randconfig-001-20260514 clang-23
arm randconfig-001-20260515 clang-23
arm randconfig-002-20260514 clang-23
arm randconfig-002-20260515 clang-23
arm randconfig-003-20260514 clang-23
arm randconfig-003-20260515 clang-23
arm randconfig-004-20260514 clang-23
arm randconfig-004-20260515 clang-23
arm rpc_defconfig clang-18
arm64 allmodconfig clang-19
arm64 allmodconfig clang-23
arm64 allnoconfig gcc-15.2.0
arm64 defconfig gcc-15.2.0
arm64 randconfig-001-20260514 clang-23
arm64 randconfig-001-20260515 gcc-11.5.0
arm64 randconfig-002-20260514 clang-23
arm64 randconfig-002-20260515 gcc-11.5.0
arm64 randconfig-003-20260514 clang-23
arm64 randconfig-003-20260515 gcc-11.5.0
arm64 randconfig-004-20260514 clang-23
arm64 randconfig-004-20260515 gcc-11.5.0
csky allmodconfig gcc-15.2.0
csky allnoconfig gcc-15.2.0
csky defconfig gcc-15.2.0
csky randconfig-001-20260514 clang-23
csky randconfig-001-20260515 gcc-11.5.0
csky randconfig-002-20260514 clang-23
csky randconfig-002-20260515 gcc-11.5.0
hexagon allmodconfig clang-17
hexagon allmodconfig gcc-15.2.0
hexagon allnoconfig gcc-15.2.0
hexagon defconfig gcc-15.2.0
hexagon randconfig-001-20260514 gcc-10.5.0
hexagon randconfig-002-20260514 gcc-10.5.0
i386 allmodconfig clang-20
i386 allmodconfig gcc-14
i386 allnoconfig gcc-15.2.0
i386 allyesconfig clang-20
i386 allyesconfig gcc-14
i386 buildonly-randconfig-001 gcc-14
i386 buildonly-randconfig-001-20260514 gcc-14
i386 buildonly-randconfig-002 gcc-14
i386 buildonly-randconfig-002-20260514 gcc-14
i386 buildonly-randconfig-003 gcc-14
i386 buildonly-randconfig-003-20260514 gcc-14
i386 buildonly-randconfig-004 gcc-14
i386 buildonly-randconfig-004-20260514 gcc-14
i386 buildonly-randconfig-005 gcc-14
i386 buildonly-randconfig-005-20260514 gcc-14
i386 buildonly-randconfig-006 gcc-14
i386 buildonly-randconfig-006-20260514 gcc-14
i386 defconfig gcc-15.2.0
i386 randconfig-001-20260514 clang-20
i386 randconfig-001-20260515 clang-20
i386 randconfig-002-20260514 clang-20
i386 randconfig-002-20260515 clang-20
i386 randconfig-003-20260514 clang-20
i386 randconfig-003-20260515 clang-20
i386 randconfig-004-20260514 clang-20
i386 randconfig-004-20260515 clang-20
i386 randconfig-005-20260514 clang-20
i386 randconfig-005-20260515 clang-20
i386 randconfig-006-20260514 clang-20
i386 randconfig-006-20260515 clang-20
i386 randconfig-007-20260514 clang-20
i386 randconfig-007-20260515 clang-20
i386 randconfig-011-20260514 clang-20
i386 randconfig-012-20260514 clang-20
i386 randconfig-013-20260514 clang-20
i386 randconfig-014-20260514 clang-20
i386 randconfig-015-20260514 clang-20
i386 randconfig-016-20260514 clang-20
i386 randconfig-017-20260514 clang-20
loongarch allmodconfig clang-23
loongarch allnoconfig gcc-15.2.0
loongarch defconfig clang-19
loongarch randconfig-001-20260514 gcc-10.5.0
loongarch randconfig-002-20260514 gcc-10.5.0
m68k allmodconfig gcc-15.2.0
m68k allnoconfig gcc-15.2.0
m68k allyesconfig clang-16
m68k defconfig clang-19
microblaze allnoconfig gcc-15.2.0
microblaze allyesconfig gcc-15.2.0
microblaze defconfig clang-19
mips allmodconfig gcc-15.2.0
mips allnoconfig gcc-15.2.0
mips allyesconfig gcc-15.2.0
nios2 allmodconfig clang-23
nios2 allmodconfig gcc-11.5.0
nios2 allnoconfig clang-23
nios2 allnoconfig gcc-11.5.0
nios2 defconfig clang-19
nios2 randconfig-001-20260514 gcc-10.5.0
nios2 randconfig-002-20260514 gcc-10.5.0
openrisc allmodconfig clang-23
openrisc allmodconfig gcc-15.2.0
openrisc allnoconfig clang-23
openrisc allnoconfig gcc-15.2.0
openrisc defconfig gcc-15.2.0
openrisc virt_defconfig gcc-15.2.0
parisc allmodconfig gcc-15.2.0
parisc allnoconfig clang-23
parisc allnoconfig gcc-15.2.0
parisc allyesconfig clang-19
parisc allyesconfig gcc-15.2.0
parisc defconfig gcc-15.2.0
parisc randconfig-001 gcc-13.4.0
parisc randconfig-001-20260514 gcc-13.4.0
parisc randconfig-001-20260515 gcc-8.5.0
parisc randconfig-002 gcc-13.4.0
parisc randconfig-002-20260514 gcc-13.4.0
parisc randconfig-002-20260515 gcc-8.5.0
parisc64 defconfig clang-19
powerpc allmodconfig gcc-15.2.0
powerpc allnoconfig clang-23
powerpc allnoconfig gcc-15.2.0
powerpc mpc885_ads_defconfig clang-23
powerpc randconfig-001 gcc-13.4.0
powerpc randconfig-001-20260514 gcc-13.4.0
powerpc randconfig-001-20260515 gcc-8.5.0
powerpc randconfig-002 gcc-13.4.0
powerpc randconfig-002-20260514 gcc-13.4.0
powerpc randconfig-002-20260515 gcc-8.5.0
powerpc tqm8541_defconfig clang-23
powerpc64 randconfig-001 gcc-13.4.0
powerpc64 randconfig-001-20260514 gcc-13.4.0
powerpc64 randconfig-001-20260515 gcc-8.5.0
powerpc64 randconfig-002 gcc-13.4.0
powerpc64 randconfig-002-20260514 gcc-13.4.0
powerpc64 randconfig-002-20260515 gcc-8.5.0
riscv allmodconfig clang-23
riscv allnoconfig clang-23
riscv allnoconfig gcc-15.2.0
riscv allyesconfig clang-16
riscv defconfig gcc-15.2.0
riscv randconfig-001-20260514 gcc-14.3.0
riscv randconfig-001-20260515 gcc-15.2.0
riscv randconfig-002-20260514 gcc-14.3.0
riscv randconfig-002-20260515 gcc-15.2.0
s390 allmodconfig clang-18
s390 allmodconfig clang-19
s390 allnoconfig clang-23
s390 allyesconfig gcc-15.2.0
s390 defconfig clang-23
s390 defconfig gcc-15.2.0
s390 randconfig-001-20260514 gcc-14.3.0
s390 randconfig-001-20260515 gcc-15.2.0
s390 randconfig-002-20260514 gcc-14.3.0
s390 randconfig-002-20260515 gcc-15.2.0
sh allmodconfig gcc-15.2.0
sh allnoconfig clang-23
sh allnoconfig gcc-15.2.0
sh allyesconfig clang-19
sh allyesconfig gcc-15.2.0
sh defconfig gcc-14
sh randconfig-001-20260514 gcc-14.3.0
sh randconfig-001-20260515 gcc-15.2.0
sh randconfig-002-20260514 gcc-14.3.0
sh randconfig-002-20260515 gcc-15.2.0
sh sh7757lcr_defconfig gcc-15.2.0
sparc allnoconfig clang-23
sparc allnoconfig gcc-15.2.0
sparc defconfig gcc-15.2.0
sparc randconfig-001 gcc-15.2.0
sparc randconfig-001-20260514 gcc-15.2.0
sparc randconfig-002 gcc-15.2.0
sparc randconfig-002-20260514 gcc-15.2.0
sparc64 allmodconfig clang-23
sparc64 defconfig gcc-14
sparc64 randconfig-001 gcc-15.2.0
sparc64 randconfig-001-20260514 gcc-15.2.0
sparc64 randconfig-002 gcc-15.2.0
sparc64 randconfig-002-20260514 gcc-15.2.0
um allmodconfig clang-19
um allnoconfig clang-23
um allyesconfig gcc-14
um allyesconfig gcc-15.2.0
um defconfig gcc-14
um i386_defconfig gcc-14
um randconfig-001 gcc-15.2.0
um randconfig-001-20260514 gcc-15.2.0
um randconfig-002 gcc-15.2.0
um randconfig-002-20260514 gcc-15.2.0
um x86_64_defconfig gcc-14
x86_64 allmodconfig clang-20
x86_64 allnoconfig clang-20
x86_64 allnoconfig clang-23
x86_64 allyesconfig clang-20
x86_64 buildonly-randconfig-001 clang-20
x86_64 buildonly-randconfig-001-20260514 clang-20
x86_64 buildonly-randconfig-001-20260514 gcc-14
x86_64 buildonly-randconfig-002 clang-20
x86_64 buildonly-randconfig-002-20260514 clang-20
x86_64 buildonly-randconfig-003 clang-20
x86_64 buildonly-randconfig-003-20260514 clang-20
x86_64 buildonly-randconfig-003-20260514 gcc-14
x86_64 buildonly-randconfig-004 clang-20
x86_64 buildonly-randconfig-004-20260514 clang-20
x86_64 buildonly-randconfig-005 clang-20
x86_64 buildonly-randconfig-005-20260514 clang-20
x86_64 buildonly-randconfig-006 clang-20
x86_64 buildonly-randconfig-006-20260514 clang-20
x86_64 buildonly-randconfig-006-20260514 gcc-14
x86_64 defconfig gcc-14
x86_64 kexec clang-20
x86_64 randconfig-001 gcc-14
x86_64 randconfig-001-20260514 gcc-14
x86_64 randconfig-002 gcc-14
x86_64 randconfig-002-20260514 gcc-14
x86_64 randconfig-003 gcc-14
x86_64 randconfig-003-20260514 gcc-14
x86_64 randconfig-004 gcc-14
x86_64 randconfig-004-20260514 gcc-14
x86_64 randconfig-005 gcc-14
x86_64 randconfig-005-20260514 gcc-14
x86_64 randconfig-006 gcc-14
x86_64 randconfig-006-20260514 gcc-14
x86_64 randconfig-011 clang-20
x86_64 randconfig-011-20260514 clang-20
x86_64 randconfig-012 clang-20
x86_64 randconfig-012-20260514 clang-20
x86_64 randconfig-013 clang-20
x86_64 randconfig-013-20260514 clang-20
x86_64 randconfig-014 clang-20
x86_64 randconfig-014-20260514 clang-20
x86_64 randconfig-015 clang-20
x86_64 randconfig-015-20260514 clang-20
x86_64 randconfig-016 clang-20
x86_64 randconfig-016-20260514 clang-20
x86_64 randconfig-071-20260514 clang-20
x86_64 randconfig-072-20260514 clang-20
x86_64 randconfig-073-20260514 clang-20
x86_64 randconfig-074-20260514 clang-20
x86_64 randconfig-075-20260514 clang-20
x86_64 randconfig-076-20260514 clang-20
x86_64 rhel-9.4 clang-20
x86_64 rhel-9.4-bpf gcc-14
x86_64 rhel-9.4-func clang-20
x86_64 rhel-9.4-kselftests clang-20
x86_64 rhel-9.4-kunit gcc-14
x86_64 rhel-9.4-ltp gcc-14
x86_64 rhel-9.4-rust clang-20
xtensa allnoconfig clang-23
xtensa allnoconfig gcc-15.2.0
xtensa allyesconfig clang-23
xtensa randconfig-001 gcc-15.2.0
xtensa randconfig-001-20260514 gcc-15.2.0
xtensa randconfig-002 gcc-15.2.0
xtensa randconfig-002-20260514 gcc-15.2.0
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH net-next 1/8] net/mlx5: Use helper to parse host PF info
From: Simon Horman @ 2026-05-14 19:19 UTC (permalink / raw)
To: Tariq Toukan
Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
David S. Miller, Saeed Mahameed, Leon Romanovsky, Mark Bloch,
Moshe Shemesh, Akiva Goldberger, netdev, linux-rdma, linux-kernel,
Gal Pressman, Dragos Tatulea
In-Reply-To: <20260510053448.326823-2-tariqt@nvidia.com>
On Sun, May 10, 2026 at 08:34:41AM +0300, Tariq Toukan wrote:
> From: Moshe Shemesh <moshe@nvidia.com>
>
> Add a helper mlx5_esw_get_host_pf_info() to retrieve host PF data from
> the query_esw_functions command output, so callers no longer need to
> parse the layout to obtain the required information.
>
> Convert all callers of mlx5_esw_query_functions() to use the new helper,
> preparing for upcoming support of the new op_mod that returns data in
> the network_function_params layout.
>
> Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH net-next 2/8] net/mlx5: Use v1 response layout for query_esw_functions
From: Simon Horman @ 2026-05-14 19:18 UTC (permalink / raw)
To: tariqt
Cc: 'Simon Horman', edumazet, kuba, pabeni, andrew+netdev,
davem, saeedm, leon, mbloch, moshe, agoldberger, netdev,
linux-rdma, linux-kernel, gal, dtatulea
In-Reply-To: <20260510053448.326823-3-tariqt@nvidia.com>
From: 'Simon Horman' <horms@kernel.org>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net/mlx5: Use v1 response layout for query_esw_functions
This patch teaches mlx5_esw_query_functions() to use the v1 response
layout (MLX5_QUERY_ESW_FUNC_OP_MOD_LAYOUT_V1) when the firmware
advertises query_host_net_function_v1, allocates the output buffer
based on query_host_net_function_num_max, and adds a parser for the
new network_function_params layout in mlx5_esw_get_host_pf_info().
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> index 861e79ddb489..8b62dde7eb70 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
[ ... ]
> @@ -1100,12 +1133,55 @@ mlx5_esw_host_pf_from_host_params(const void *entry)
> };
> }
>
> -struct mlx5_esw_pf_info mlx5_esw_get_host_pf_info(const u32 *out)
> +static struct mlx5_esw_pf_info
> +mlx5_esw_host_pf_from_net_func_params(const u8 *entry, int num_entries)
> +{
> + int i;
> +
> + for (i = 0; i < num_entries; i++) {
> + int pf_type, state;
> +
> + pf_type = MLX5_GET(network_function_params, entry, pci_pf_type);
> + if (pf_type != MLX5_PCI_PF_TYPE_EXTERNAL_HOST_PF) {
> + entry += MLX5_UN_SZ_BYTES(net_function_params);
> + continue;
> + }
> +
> + state = MLX5_GET(network_function_params, entry, vhca_state);
> +
> + return (struct mlx5_esw_pf_info) {
> + .pf_disabled = state != MLX5_VHCA_STATE_IN_USE,
[Medium]
The commit message says "this change does not introduce new
functionality," but the legacy parser reads an explicit firmware
host_pf_disabled bit while the v1 parser derives pf_disabled from a
multi-valued vhca_state enum:
.pf_disabled = state != MLX5_VHCA_STATE_IN_USE,
Is this mapping intended to be exactly equivalent to the legacy
host_pf_disabled bit across all five vhca_state values
(INVALID/ALLOCATED/ACTIVE/IN_USE/TEARDOWN_REQUEST)?
The same vhca_state value is interpreted differently elsewhere in
mlx5 for SFs. mlx5_sf_to_devlink_opstate() in
drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c reports:
case MLX5_VHCA_STATE_IN_USE:
case MLX5_VHCA_STATE_TEARDOWN_REQUEST:
return DEVLINK_PORT_FN_OPSTATE_ATTACHED;
case MLX5_VHCA_STATE_INVALID:
case MLX5_VHCA_STATE_ALLOCATED:
case MLX5_VHCA_STATE_ACTIVE:
default:
return DEVLINK_PORT_FN_OPSTATE_DETACHED;
so SFs in TEARDOWN_REQUEST are reported as ATTACHED, while a host
PF in TEARDOWN_REQUEST would now end up as pf_disabled = true and
therefore DETACHED via mlx5_devlink_pf_port_fn_state_get():
> + .num_of_vfs = MLX5_GET(network_function_params,
> + entry, pci_num_vfs),
> + .total_vfs = MLX5_GET(network_function_params,
> + entry, pci_total_vfs),
> + .host_number = MLX5_GET(network_function_params,
> + entry, host_number),
> + };
> + }
[Medium]
pf_disabled also feeds esw_vfs_changed_event_handler(), which skips
VF reload when it is true:
> @@ -3716,7 +3716,7 @@ static void esw_vfs_changed_event_handler(struct mlx5_eswitch *esw)
[ ... ]
> - host_pf_info = mlx5_esw_get_host_pf_info(out);
> + host_pf_info = mlx5_esw_get_host_pf_info(esw->dev, out);
> new_num_vfs = host_pf_info.num_of_vfs;
>
> if (new_num_vfs == esw->esw_funcs.num_vfs || host_pf_info.pf_disabled)
Could the commit message spell out the firmware/PRM mapping between
the legacy host_pf_disabled bit and the vhca_state enum so it is
clear that ACTIVE, ALLOCATED and TEARDOWN_REQUEST really should be
treated as disabled here?
Would it also be worth aligning with mlx5_sf_to_devlink_opstate()
(at least for TEARDOWN_REQUEST) so the devlink opstate reported via
mlx5_devlink_pf_port_fn_state_get() stays consistent between SFs and
host PFs on v1-capable firmware?
^ permalink raw reply
* Re: [PATCH 05/11] selftests: Add additional kernel functions to tools/include/
From: Jason Gunthorpe @ 2026-05-14 19:03 UTC (permalink / raw)
To: David Matlack
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches
In-Reply-To: <afkUO56H6KPy5afA@google.com>
On Mon, May 04, 2026 at 09:48:43PM +0000, David Matlack wrote:
> On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> > These are needed by the VFIO mlx5 selftest in the following patches,
> > which includes some headers from mlx5 and also needs a few more
> > MMIO-related features.
> >
> > - DECLARE_FLEX_ARRAY in new tools/include/linux/stddef.h (wraps
> > existing __DECLARE_FLEX_ARRAY from uapi/linux/stddef.h)
>
> Is this needed? I don't see it used anywhere.
>
> $ git grep DECLARE_FLEX_ARRAY tools/testing/selftests/vfio
Turns out it is needed implicitly in existing headers:
In file included from /home/jgg/oss/wip/mlx5st/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h:16:
/home/jgg/oss/wip/mlx5st/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h:4352:3: error: type name requires a
specifier or qualifier
4352 | DECLARE_FLEX_ARRAY(struct mlx5_ifc_rq_num_bits, rq_num);
| ^
/home/jgg/oss/wip/mlx5st/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h:4352:51: error: type specifier missing,
defaults to 'int'; ISO C99 and later do not support implicit int [-Wimplicit-int]
4352 | DECLARE_FLEX_ARRAY(struct mlx5_ifc_rq_num_bits, rq_num);
Jason
^ permalink raw reply
* [PATCH net-next v5 8/8] selftests: drv-net: add netkit devmem tests
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add nk_devmem.py with four tests for TCP devmem through a netkit device.
These tests are just duplicates of the original devmem tests, with some
adjusted parameters such as telling ncdevmem to avoid device setup
(since it only has access to netkit, not a phys device).
Each test uses NetDrvContEnv with primary_rx_redirect=True to set up the
BPF redirect program on the primary netkit interface, then calls a
shared run_*() helper which probes for devmem support and configures
the NIC (HDS, RSS, queue lease) before driving the test. NIC state is
restored per-test via defer() callbacks registered inside the helper.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v5:
- Move require_devmem() inside test functions so ksft_run() reports it
as a SKIP (Sashiko).
- Drop the inaccurate "mirroring the nk_qlease.py pattern" claim from
v4 (Sashiko).
Changes in v4:
- Call configure_nic()/cleanup_nic() once around ksft_run() rather than
relying on per-test configuration inside the run_* helpers.
Changes in v3:
- Reorder os.path expressions
- Drop @ksft_disruptive from check_nk_rx_hds to mirror the original
check_rx_hds in devmem.py
Changes in v2:
- Add nk_devmem.py to TEST_PROGS in Makefile (Sashiko)
---
tools/testing/selftests/drivers/net/hw/Makefile | 1 +
.../testing/selftests/drivers/net/hw/nk_devmem.py | 46 ++++++++++++++++++++++
2 files changed, 47 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 5e49d7bffced..c7a1206880ea 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -35,6 +35,7 @@ TEST_PROGS = \
irq.py \
loopback.sh \
nic_timestamp.py \
+ nk_devmem.py \
nk_netns.py \
nk_qlease.py \
ntuple.py \
diff --git a/tools/testing/selftests/drivers/net/hw/nk_devmem.py b/tools/testing/selftests/drivers/net/hw/nk_devmem.py
new file mode 100755
index 000000000000..300ed2a70ab4
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/nk_devmem.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""Test devmem TCP with netkit."""
+
+import os
+from devmem_lib import setup_test, run_rx, run_tx, run_tx_chunks, run_rx_hds
+from lib.py import ksft_run, ksft_exit, ksft_disruptive
+from lib.py import NetDrvContEnv
+
+
+@ksft_disruptive
+def check_nk_rx(cfg) -> None:
+ """Run the devmem RX test through netkit."""
+ run_rx(cfg)
+
+
+@ksft_disruptive
+def check_nk_tx(cfg) -> None:
+ """Run the devmem TX test through netkit."""
+ run_tx(cfg)
+
+
+@ksft_disruptive
+def check_nk_tx_chunks(cfg) -> None:
+ """Run the devmem TX chunking test through netkit."""
+ run_tx_chunks(cfg)
+
+
+def check_nk_rx_hds(cfg) -> None:
+ """Run the HDS test through netkit."""
+ run_rx_hds(cfg)
+
+
+def main() -> None:
+ """Run the netkit devmem test cases."""
+ with NetDrvContEnv(__file__, rxqueues=2, primary_rx_redirect=True) as cfg:
+ setup_test(cfg,
+ os.path.join(os.path.dirname(os.path.abspath(__file__)),
+ "ncdevmem"))
+ ksft_run([check_nk_rx, check_nk_tx, check_nk_tx_chunks,
+ check_nk_rx_hds], args=(cfg,))
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 7/8] selftests: drv-net: add primary_rx_redirect support to NetDrvContEnv
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
When sending from a namespace that has access to a netkit device with a
leased queue, the nk primary in the host namespace needs to redirect its
RX to the physical device. This patch adds that redirection bpf program
and teaches the harness to install it.
Add primary_rx_redirect=False parameter to NetDrvContEnv.__init__().
When enabled, _attach_primary_rx_redirect_bpf() attaches a new BPF TC
program (nk_primary_rx_redirect.bpf.c) to the primary (host-side) netkit
interface. The program redirects non-ICMPv6 IPv6 packets to the physical
NIC via bpf_redirect_neigh(), with the physical ifindex configured via
the .bss map. ICMPv6 is left on the host's netkit primary so IPv6
neighbor discovery still work locally.
Extract _find_bss_map_id() from _attach_bpf() into a reusable helper so
other BPF attachment methods can use it.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v5:
- Use sys.byteorder when packing phys_ifindex into the BPF .bss map
(Sashiko).
Changes in v3:
- nk_primary_rx_redirect.bpf.c: add header includes to avoid hardcoding
values
- update commit message explaining why ICMP is passed through
- env.py: re-use _tc_ensure_clsact() (had to add ifname paramater)
- env.py: gate the remote IPv6 host route install on primary_rx_redirect
by moving it from _setup_ns() into _attach_primary_rx_redirect_bpf()
---
.../drivers/net/hw/nk_primary_rx_redirect.bpf.c | 39 +++++++++
tools/testing/selftests/drivers/net/lib/py/env.py | 94 +++++++++++++++++-----
2 files changed, 115 insertions(+), 18 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/nk_primary_rx_redirect.bpf.c b/tools/testing/selftests/drivers/net/hw/nk_primary_rx_redirect.bpf.c
new file mode 100644
index 000000000000..46ff494b23de
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/nk_primary_rx_redirect.bpf.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define ctx_ptr(field) ((void *)(long)(field))
+
+volatile __u32 phys_ifindex;
+
+SEC("tc/ingress")
+int nk_primary_rx_redirect(struct __sk_buff *skb)
+{
+ void *data_end = ctx_ptr(skb->data_end);
+ void *data = ctx_ptr(skb->data);
+ struct ethhdr *eth;
+ struct ipv6hdr *ip6h;
+
+ eth = data;
+ if ((void *)(eth + 1) > data_end)
+ return TC_ACT_OK;
+
+ if (eth->h_proto != bpf_htons(ETH_P_IPV6))
+ return TC_ACT_OK;
+
+ ip6h = data + sizeof(struct ethhdr);
+ if ((void *)(ip6h + 1) > data_end)
+ return TC_ACT_OK;
+
+ if (ip6h->nexthdr == IPPROTO_ICMPV6)
+ return TC_ACT_OK;
+
+ return bpf_redirect_neigh(phys_ifindex, NULL, 0, 0);
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/drivers/net/lib/py/env.py b/tools/testing/selftests/drivers/net/lib/py/env.py
index 409b41922245..ef317aef3a0a 100644
--- a/tools/testing/selftests/drivers/net/lib/py/env.py
+++ b/tools/testing/selftests/drivers/net/lib/py/env.py
@@ -2,6 +2,7 @@
import ipaddress
import os
+import sys
import time
import json
from pathlib import Path
@@ -336,15 +337,18 @@ class NetDrvContEnv(NetDrvEpEnv):
+---------------+
"""
- def __init__(self, src_path, rxqueues=1, **kwargs):
+ def __init__(self, src_path, rxqueues=1, primary_rx_redirect=False, **kwargs):
self.netns = None
self._nk_host_ifname = None
self.nk_guest_ifname = None
self._tc_clsact_added = False
self._tc_attached = False
+ self._primary_rx_redirect_attached = False
+ self._primary_rx_redirect_clsact_added = False
self._bpf_prog_pref = None
self._bpf_prog_id = None
self._init_ns_attached = False
+ self._remote_route_added = False
self._old_fwd = None
self._old_accept_ra = None
@@ -396,8 +400,18 @@ class NetDrvContEnv(NetDrvEpEnv):
self._setup_ns()
self._attach_bpf()
+ if primary_rx_redirect:
+ self._attach_primary_rx_redirect_bpf()
def __del__(self):
+ if self._primary_rx_redirect_attached:
+ cmd(f"tc filter del dev {self._nk_host_ifname} ingress", fail=False)
+ self._primary_rx_redirect_attached = False
+
+ if self._primary_rx_redirect_clsact_added:
+ cmd(f"tc qdisc del dev {self._nk_host_ifname} clsact", fail=False)
+ self._primary_rx_redirect_clsact_added = False
+
if self._tc_attached:
cmd(f"tc filter del dev {self.ifname} ingress pref {self._bpf_prog_pref}")
self._tc_attached = False
@@ -406,6 +420,11 @@ class NetDrvContEnv(NetDrvEpEnv):
cmd(f"tc qdisc del dev {self.ifname} clsact")
self._tc_clsact_added = False
+ if self._remote_route_added:
+ cmd(f"ip -6 route del {self.nk_guest_ipv6}/128",
+ host=self.remote, fail=False)
+ self._remote_route_added = False
+
if self._nk_host_ifname:
cmd(f"ip link del dev {self._nk_host_ifname}")
self._nk_host_ifname = None
@@ -459,13 +478,19 @@ class NetDrvContEnv(NetDrvEpEnv):
ip(f"-6 addr add {self.nk_guest_ipv6}/64 dev {self.nk_guest_ifname} nodad", ns=self.netns)
ip(f"-6 route add default via fe80::1 dev {self.nk_guest_ifname}", ns=self.netns)
- def _tc_ensure_clsact(self):
- qdisc = json.loads(cmd(f"tc -j qdisc show dev {self.ifname}").stdout)
+ def _tc_ensure_clsact(self, ifname=None):
+ """Ensure a clsact qdisc exists on @ifname.
+
+ Returns True if this call added the qdisc, otherwise returns False.
+ """
+ if ifname is None:
+ ifname = self.ifname
+ qdisc = json.loads(cmd(f"tc -j qdisc show dev {ifname}").stdout)
for q in qdisc:
if q['kind'] == 'clsact':
- return
- cmd(f"tc qdisc add dev {self.ifname} clsact")
- self._tc_clsact_added = True
+ return False
+ cmd(f"tc qdisc add dev {ifname} clsact")
+ return True
def _get_bpf_prog_ids(self):
filters = json.loads(cmd(f"tc -j filter show dev {self.ifname} ingress").stdout)
@@ -476,28 +501,28 @@ class NetDrvContEnv(NetDrvEpEnv):
return (bpf['pref'], bpf['options']['prog']['id'])
raise Exception("Failed to get BPF prog ID")
+ def _find_bss_map_id(self, prog_id):
+ """Find the .bss map ID for a loaded BPF program."""
+ prog_info = bpftool(f"prog show id {prog_id}", json=True)
+ for map_id in prog_info.get("map_ids", []):
+ map_info = bpftool(f"map show id {map_id}", json=True)
+ if map_info.get("name", "").endswith("bss"):
+ return map_id
+ raise Exception(f"Failed to find .bss map for prog {prog_id}")
+
def _attach_bpf(self):
bpf_obj = self.test_dir / "nk_forward.bpf.o"
if not bpf_obj.exists():
raise KsftSkipEx("BPF prog not found")
- self._tc_ensure_clsact()
+ if self._tc_ensure_clsact():
+ self._tc_clsact_added = True
cmd(f"tc filter add dev {self.ifname} ingress bpf obj {bpf_obj}"
" sec tc/ingress direct-action")
self._tc_attached = True
(self._bpf_prog_pref, self._bpf_prog_id) = self._get_bpf_prog_ids()
- prog_info = bpftool(f"prog show id {self._bpf_prog_id}", json=True)
- map_ids = prog_info.get("map_ids", [])
-
- bss_map_id = None
- for map_id in map_ids:
- map_info = bpftool(f"map show id {map_id}", json=True)
- if map_info.get("name").endswith("bss"):
- bss_map_id = map_id
-
- if bss_map_id is None:
- raise Exception("Failed to find .bss map")
+ bss_map_id = self._find_bss_map_id(self._bpf_prog_id)
ipv6_addr = ipaddress.IPv6Address(self.ipv6_prefix)
ipv6_bytes = ipv6_addr.packed
@@ -505,3 +530,36 @@ class NetDrvContEnv(NetDrvEpEnv):
value = ipv6_bytes + ifindex_bytes
value_hex = ' '.join(f'{b:02x}' for b in value)
bpftool(f"map update id {bss_map_id} key hex 00 00 00 00 value hex {value_hex}")
+
+ def _attach_primary_rx_redirect_bpf(self):
+ """Attach BPF redirect program on the primary netkit ingress."""
+ bpf_obj = self.test_dir / "nk_primary_rx_redirect.bpf.o"
+ if not bpf_obj.exists():
+ raise KsftSkipEx("Primary RX redirect BPF prog not found")
+
+ if self._tc_ensure_clsact(self._nk_host_ifname):
+ self._primary_rx_redirect_clsact_added = True
+ cmd(f"tc filter add dev {self._nk_host_ifname} ingress"
+ f" bpf obj {bpf_obj} sec tc/ingress direct-action")
+ self._primary_rx_redirect_attached = True
+
+ ip(f"-6 route add {self.nk_guest_ipv6}/128 via {self.addr_v['6']}",
+ host=self.remote)
+ self._remote_route_added = True
+
+ filters = json.loads(
+ cmd(f"tc -j filter show dev {self._nk_host_ifname} ingress").stdout)
+ redirect_prog_id = None
+ for bpf in filters:
+ if 'options' not in bpf:
+ continue
+ if bpf['options']['bpf_name'].startswith('nk_primary_rx_redirect'):
+ redirect_prog_id = bpf['options']['prog']['id']
+ break
+ if redirect_prog_id is None:
+ raise Exception("Failed to get primary RX redirect BPF prog ID")
+
+ bss_map_id = self._find_bss_map_id(redirect_prog_id)
+ phys_ifindex_bytes = self.ifindex.to_bytes(4, byteorder=sys.byteorder)
+ value_hex = ' '.join(f'{b:02x}' for b in phys_ifindex_bytes)
+ bpftool(f"map update id {bss_map_id} key hex 00 00 00 00 value hex {value_hex}")
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 6/8] selftests: drv-net: refactor devmem command builders into lib module
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Adding netkit-based devmem tests is a straight-forward copy of devmem
test commands plus some args for the nk cases, so this patch breaks out
these command builders into helpers used by both.
Though we tried to avoid libraries to avoid increasing the barrier of
entry/complexity (see selftests/drivers/net/README.md, section "Avoid
libraries and frameworks"), factoring out these functions seemed like
the lesser of two evils in this case of using the same commands, just
with slightly different args per environment.
I experimented with just having all of the tests in the same file to
avoid having helpers in a library file, but because ksft_run() is
limited to a single call per file, and the new tests will require
different environments (NetDrvContEnv/NetDrvEpEnv), it would have been
necessary to have each test set up its own environment instead of
sharing one for the entire ksft_run() run. This came at the cost of
ballooning the test time (from under 5s to 30s on my test system), so to
strike a balance these tests were placed in separate files so they could
keep a shared environment across a single ksft_run() run shared across
all tests using the same env type (introduced in subsequent patches).
The helpers work transparently with both plain and netkit environments
by inspecting cfg for netkit-specific attributes (netns, nk_queue,
etc...).
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v5:
- Place the shared helpers in devmem_lib.py next to the test scripts
rather than under lib/py/ (Jakub).
- Add devmem_lib.py to TEST_FILES (Jakub).
- configure_nic(): register cleanup via defer() and drop the separate
cleanup_nic() helper. (Sashiko)
- Move configure_nic() into the run_*() helpers (with an early return
outside a netkit env) so the queue lease doesn't break
require_devmem()'s bind-rx check.
- Make the queue lease idempotent across run_*() calls.
- Fix pylint import-error and import-order.
Changes in v4:
- Fixed bad change list version, v4 -> v3 (Stan)
Changes in v3:
- Make socat_send() always bind the source; drop its bind= parameter
and the matching bind=not_ns at the run_rx call site.
- Drop socat_send()'s nodelay= arg; have buf_size>0 imply TCP_NODELAY
since they are only meaningful together.
- configure_nic(): stash originals on cfg instead of using defer(); add
paired cleanup_nic() helper. Drop the per-test configure_nic() calls
from run_rx/run_tx/run_tx_chunks/run_rx_hds; the netkit test file
invokes configure_nic/cleanup_nic once around ksft_run().
- make cfg.devmem_supported and cfg.devmem_probed public attrs (no '_')
for sake of linting
- general cleanup of the code, linting fixes
- In setup_test, drop the unused cfg.listen_ns = getattr(cfg, 'netns',
None) assignment.
- In run_rx, pass flow_steer=not_ns to ncdevmem_rx and bind=not_ns to
socat_send to avoid changing functionality (we want just a straight
refactor here)
Changes in v2:
- Move require_devmem() into individual test functions so KsftSkipEx goes up to
ksft_run() (Sashiko)
- in ncdevmem_rx(), move -v 7 to take effect for both netns and
non-netns when verify=True
---
tools/testing/selftests/drivers/net/hw/Makefile | 1 +
tools/testing/selftests/drivers/net/hw/devmem.py | 77 ++-----
.../testing/selftests/drivers/net/hw/devmem_lib.py | 222 +++++++++++++++++++++
3 files changed, 236 insertions(+), 64 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 82809d5b2478..5e49d7bffced 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -52,6 +52,7 @@ TEST_PROGS = \
#
TEST_FILES := \
+ devmem_lib.py \
ethtool_lib.sh \
#
diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py
index ee863e90d1e0..031cf9905f65 100755
--- a/tools/testing/selftests/drivers/net/hw/devmem.py
+++ b/tools/testing/selftests/drivers/net/hw/devmem.py
@@ -2,91 +2,40 @@
# SPDX-License-Identifier: GPL-2.0
from os import path
-from lib.py import ksft_run, ksft_exit
-from lib.py import ksft_eq, KsftSkipEx
+from devmem_lib import setup_test, run_rx, run_tx, run_tx_chunks, run_rx_hds
+from lib.py import ksft_run, ksft_exit, ksft_disruptive
from lib.py import NetDrvEpEnv
-from lib.py import bkg, cmd, rand_port, wait_port_listen
-from lib.py import ksft_disruptive
-
-
-def require_devmem(cfg):
- if not hasattr(cfg, "_devmem_probed"):
- probe_command = f"{cfg.bin_local} -f {cfg.ifname}"
- cfg._devmem_supported = cmd(probe_command, fail=False, shell=True).ret == 0
- cfg._devmem_probed = True
-
- if not cfg._devmem_supported:
- raise KsftSkipEx("Test requires devmem support")
@ksft_disruptive
def check_rx(cfg) -> None:
- require_devmem(cfg)
-
- port = rand_port()
- socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
- listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7"
-
- with bkg(listen_cmd, exit_wait=True) as ncdevmem:
- wait_port_listen(port)
- cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
- head -c 1K | {socat}", host=cfg.remote, shell=True)
-
- ksft_eq(ncdevmem.ret, 0)
+ """Run the devmem RX test."""
+ run_rx(cfg)
@ksft_disruptive
def check_tx(cfg) -> None:
- require_devmem(cfg)
-
- port = rand_port()
- listen_cmd = f"socat -U - TCP{cfg.addr_ipver}-LISTEN:{port}"
-
- with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
- wait_port_listen(port, host=cfg.remote)
- cmd(f"echo -e \"hello\\nworld\"| {cfg.bin_local} -f {cfg.ifname} -s {cfg.remote_addr} -p {port}", shell=True)
-
- ksft_eq(socat.stdout.strip(), "hello\nworld")
+ """Run the devmem TX test."""
+ run_tx(cfg)
@ksft_disruptive
def check_tx_chunks(cfg) -> None:
- require_devmem(cfg)
-
- port = rand_port()
- listen_cmd = f"socat -U - TCP{cfg.addr_ipver}-LISTEN:{port}"
-
- with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
- wait_port_listen(port, host=cfg.remote)
- cmd(f"echo -e \"hello\\nworld\"| {cfg.bin_local} -f {cfg.ifname} -s {cfg.remote_addr} -p {port} -z 3", shell=True)
-
- ksft_eq(socat.stdout.strip(), "hello\nworld")
+ """Run the devmem TX chunking test."""
+ run_tx_chunks(cfg)
def check_rx_hds(cfg) -> None:
- """Test HDS splitting across payload sizes."""
- require_devmem(cfg)
-
- for size in [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]:
- port = rand_port()
- listen_cmd = f"{cfg.bin_local} -L -l -f {cfg.ifname} -s {cfg.addr} -p {port}"
-
- with bkg(listen_cmd, exit_wait=True) as ncdevmem:
- wait_port_listen(port)
- cmd(f"dd if=/dev/zero bs={size} count=1 2>/dev/null | " +
- f"socat -b {size} -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},nodelay",
- host=cfg.remote, shell=True)
-
- ksft_eq(ncdevmem.ret, 0, f"HDS failed for payload size {size}")
+ """Run the HDS test."""
+ run_rx_hds(cfg)
def main() -> None:
+ """Run the devmem test cases."""
with NetDrvEpEnv(__file__) as cfg:
- cfg.bin_local = path.abspath(path.dirname(__file__) + "/ncdevmem")
- cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
-
+ setup_test(cfg, path.abspath(path.dirname(__file__) + "/ncdevmem"))
ksft_run([check_rx, check_tx, check_tx_chunks, check_rx_hds],
- args=(cfg, ))
+ args=(cfg,))
ksft_exit()
diff --git a/tools/testing/selftests/drivers/net/hw/devmem_lib.py b/tools/testing/selftests/drivers/net/hw/devmem_lib.py
new file mode 100644
index 000000000000..0921ff03eb81
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/devmem_lib.py
@@ -0,0 +1,222 @@
+# SPDX-License-Identifier: GPL-2.0
+"""Shared helpers for devmem TCP selftests."""
+
+import re
+
+from lib.py import (bkg, cmd, defer, ethtool, rand_port, wait_port_listen,
+ ksft_eq, KsftSkipEx, NetNSEnter, EthtoolFamily,
+ NetdevFamily)
+
+
+def require_devmem(cfg):
+ """Probe ncdevmem on cfg.ifname and SKIP the test if devmem isn't supported."""
+ if not hasattr(cfg, "devmem_probed"):
+ probe_command = f"{cfg.bin_local} -f {cfg.ifname}"
+ cfg.devmem_supported = cmd(probe_command, fail=False, shell=True).ret == 0
+ cfg.devmem_probed = True
+
+ if not cfg.devmem_supported:
+ raise KsftSkipEx("Test requires devmem support")
+
+
+def configure_nic(cfg):
+ """Channels, rings, RSS, queue lease for netkit devmem."""
+ if not hasattr(cfg, 'netns'):
+ return
+
+ cfg.require_ipver('6')
+ ethnl = EthtoolFamily()
+
+ channels = ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
+ channels = channels['combined-count']
+ if channels < 2:
+ raise KsftSkipEx(
+ 'Test requires NETIF with at least 2 combined channels'
+ )
+
+ rings = ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+ orig_rx_rings = rings['rx']
+ orig_hds_thresh = rings.get('hds-thresh', 0)
+ orig_data_split = rings.get('tcp-data-split', 'unknown')
+
+ ethnl.rings_set({'header': {'dev-index': cfg.ifindex},
+ 'tcp-data-split': 'enabled',
+ 'hds-thresh': 0,
+ 'rx': min(64, orig_rx_rings)})
+ defer(ethnl.rings_set, {'header': {'dev-index': cfg.ifindex},
+ 'tcp-data-split': orig_data_split,
+ 'hds-thresh': orig_hds_thresh,
+ 'rx': orig_rx_rings})
+
+ cfg.src_queue = channels - 1
+ ethtool(f"-X {cfg.ifname} equal {cfg.src_queue}")
+ defer(ethtool, f"-X {cfg.ifname} default")
+
+ if not hasattr(cfg, 'nk_queue'):
+ with NetNSEnter(str(cfg.netns)):
+ netdevnl = NetdevFamily()
+ lease_result = netdevnl.queue_create({
+ "ifindex": cfg.nk_guest_ifindex,
+ "type": "rx",
+ "lease": {
+ "ifindex": cfg.ifindex,
+ "queue": {"id": cfg.src_queue, "type": "rx"},
+ "netns-id": 0,
+ },
+ })
+ cfg.nk_queue = lease_result['id']
+
+
+def set_flow_rule(cfg, port):
+ """Install a flow rule steering to src_queue and return the flow rule ID."""
+ output = ethtool(
+ f"-N {cfg.ifname} flow-type tcp6 dst-port {port}"
+ f" action {cfg.src_queue}"
+ ).stdout
+ return int(re.search(r'ID (\d+)', output).group(1))
+
+
+def ncdevmem_rx(cfg, port, verify=True, fail_on_linear=False, flow_steer=False):
+ """Build the ncdevmem RX listener command."""
+ if hasattr(cfg, 'netns'):
+ flow_rule_id = set_flow_rule(cfg, port)
+ defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
+
+ ifname = cfg.nk_guest_ifname
+ addr = cfg.nk_guest_ipv6
+ extras = [f"-t {cfg.nk_queue}", "-q 1", "-n"]
+ else:
+ ifname = cfg.ifname
+ addr = cfg.addr
+ extras = []
+ if flow_steer:
+ extras.append(f"-c {cfg.remote_addr}")
+
+ if verify:
+ extras.append("-v 7")
+ if fail_on_linear:
+ extras.append("-L")
+
+ parts = [cfg.bin_local, "-l", f"-f {ifname}", f"-s {addr}",
+ f"-p {port}", *extras]
+ return " ".join(parts)
+
+
+def ncdevmem_tx(cfg, port, chunk_size=0):
+ """Build the ncdevmem TX send command."""
+ if hasattr(cfg, 'netns'):
+ ifname = cfg.nk_guest_ifname
+ addr = cfg.remote_addr_v['6']
+ extras = ["-t 0", "-q 1", "-n"]
+ else:
+ ifname = cfg.ifname
+ addr = cfg.remote_addr
+ extras = []
+
+ if chunk_size:
+ extras.append(f"-z {chunk_size}")
+
+ parts = [cfg.bin_local, f"-f {ifname}", f"-s {addr}",
+ f"-p {port}", *extras]
+ return " ".join(parts)
+
+
+def socat_send(cfg, port, buf_size=0):
+ """Socat command for sending to the devmem listener.
+
+ When buf_size > 0, force one TCP segment per write of exactly that size by
+ setting socat's buffer (-b) and disabling Nagle (TCP_NODELAY).
+ """
+ proto = f"TCP{cfg.addr_ipver}"
+
+ if hasattr(cfg, 'netns'):
+ addr = f"[{cfg.nk_guest_ipv6}]"
+ else:
+ addr = cfg.baddr
+
+ suffix = f",bind={cfg.remote_baddr}:{port}"
+
+ buf = ""
+ if buf_size:
+ buf = f"-b {buf_size}"
+ suffix += ",nodelay"
+
+ return f"socat {buf} -u - {proto}:{addr}:{port}{suffix}"
+
+
+def socat_listen(cfg, port):
+ """Socat listen command for TX tests."""
+ return f"socat -U - TCP{cfg.addr_ipver}-LISTEN:{port}"
+
+
+def setup_test(cfg, bin_local):
+ """Stash the local ncdevmem path on cfg and deploy it to the remote."""
+ cfg.bin_local = bin_local
+ cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
+
+
+def run_rx(cfg):
+ """Run the devmem RX test."""
+ require_devmem(cfg)
+ configure_nic(cfg)
+ port = rand_port()
+ socat = socat_send(cfg, port)
+ data_pipe = (f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | head -c 1K"
+ f" | {socat}")
+ netns = getattr(cfg, "netns", None)
+
+ listen_cmd = ncdevmem_rx(cfg, port, flow_steer=not hasattr(cfg, 'netns'))
+ with bkg(listen_cmd, exit_wait=True, ns=netns) as ncdevmem:
+ wait_port_listen(port, proto="tcp", ns=netns)
+ cmd(data_pipe, host=cfg.remote, shell=True)
+ ksft_eq(ncdevmem.ret, 0)
+
+
+def run_tx(cfg):
+ """Run the devmem TX test."""
+ require_devmem(cfg)
+ configure_nic(cfg)
+ netns = getattr(cfg, "netns", None)
+ port = rand_port()
+ tx_cmd = ncdevmem_tx(cfg, port)
+ listen_cmd = socat_listen(cfg, port)
+
+ with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
+ wait_port_listen(port, host=cfg.remote)
+ cmd(f"bash -c 'echo -e \"hello\\nworld\" | {tx_cmd}'", ns=netns, shell=True)
+ ksft_eq(socat.stdout.strip(), "hello\nworld")
+
+
+def run_tx_chunks(cfg):
+ """Run the devmem TX chunking test."""
+ require_devmem(cfg)
+ configure_nic(cfg)
+ netns = getattr(cfg, "netns", None)
+ port = rand_port()
+ tx_cmd = ncdevmem_tx(cfg, port, chunk_size=3)
+ listen_cmd = socat_listen(cfg, port)
+
+ with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
+ wait_port_listen(port, host=cfg.remote)
+ cmd(f"bash -c 'echo -e \"hello\\nworld\" | {tx_cmd}'", ns=netns, shell=True)
+ ksft_eq(socat.stdout.strip(), "hello\nworld")
+
+
+def run_rx_hds(cfg):
+ """Run the HDS test by running devmem RX across a segment size sweep."""
+ require_devmem(cfg)
+ configure_nic(cfg)
+ netns = getattr(cfg, "netns", None)
+
+ for size in [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]:
+ port = rand_port()
+
+ listen_cmd = ncdevmem_rx(cfg, port, verify=False,
+ fail_on_linear=True)
+ socat = socat_send(cfg, port, buf_size=size)
+
+ with bkg(listen_cmd, exit_wait=True, ns=netns) as ncdevmem:
+ wait_port_listen(port, proto="tcp", ns=netns)
+ cmd(f"dd if=/dev/zero bs={size} count=1 2>/dev/null | "
+ f"{socat}", host=cfg.remote, shell=True)
+ ksft_eq(ncdevmem.ret, 0, f"HDS failed for payload size {size}")
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 5/8] selftests: drv-net: make attr _nk_guest_ifname public
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Subsequent patches will use the _nk_guest_ifname as a public attr for
setting up devmem. Rename to nk_guest_ifname to avoid angering the
linter about the '_' prefix being used for a non-private attr.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
tools/testing/selftests/drivers/net/hw/nk_qlease.py | 8 ++++----
tools/testing/selftests/drivers/net/lib/py/env.py | 16 ++++++++--------
2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/nk_qlease.py b/tools/testing/selftests/drivers/net/hw/nk_qlease.py
index aa83dc321328..139a91ebd229 100755
--- a/tools/testing/selftests/drivers/net/hw/nk_qlease.py
+++ b/tools/testing/selftests/drivers/net/hw/nk_qlease.py
@@ -71,7 +71,7 @@ def test_iou_zcrx(cfg) -> None:
flow_rule_id = set_flow_rule(cfg)
defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
- rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg._nk_guest_ifname} -q {cfg.nk_queue}"
+ rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg.nk_guest_ifname} -q {cfg.nk_queue}"
tx_cmd = f"{cfg.bin_remote} -c -h {cfg.nk_guest_ipv6} -p {cfg.port} -l 12840"
with bkg(rx_cmd, exit_wait=True):
wait_port_listen(cfg.port, proto="tcp", ns=cfg.netns)
@@ -128,7 +128,7 @@ def test_attach_xdp_with_mp(cfg) -> None:
netdevnl = NetdevFamily()
- rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg._nk_guest_ifname} -q {cfg.nk_queue}"
+ rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg.nk_guest_ifname} -q {cfg.nk_queue}"
with bkg(rx_cmd):
wait_port_listen(cfg.port, proto="tcp", ns=cfg.netns)
@@ -178,7 +178,7 @@ def test_destroy(cfg) -> None:
ethtool(f"-X {cfg.ifname} equal {cfg.src_queue}")
defer(ethtool, f"-X {cfg.ifname} default")
- rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg._nk_guest_ifname} -q {cfg.nk_queue}"
+ rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg.nk_guest_ifname} -q {cfg.nk_queue}"
rx_proc = cmd(rx_cmd, background=True)
wait_port_listen(cfg.port, proto="tcp", ns=cfg.netns)
@@ -196,7 +196,7 @@ def test_destroy(cfg) -> None:
ip(f"link del dev {cfg._nk_host_ifname}")
kill_timer.join()
cfg._nk_host_ifname = None
- cfg._nk_guest_ifname = None
+ cfg.nk_guest_ifname = None
queue_info = netdevnl.queue_get(
{"ifindex": cfg.ifindex, "id": cfg.src_queue, "type": "rx"}
diff --git a/tools/testing/selftests/drivers/net/lib/py/env.py b/tools/testing/selftests/drivers/net/lib/py/env.py
index 24ce122abd9c..409b41922245 100644
--- a/tools/testing/selftests/drivers/net/lib/py/env.py
+++ b/tools/testing/selftests/drivers/net/lib/py/env.py
@@ -339,7 +339,7 @@ class NetDrvContEnv(NetDrvEpEnv):
def __init__(self, src_path, rxqueues=1, **kwargs):
self.netns = None
self._nk_host_ifname = None
- self._nk_guest_ifname = None
+ self.nk_guest_ifname = None
self._tc_clsact_added = False
self._tc_attached = False
self._bpf_prog_pref = None
@@ -390,7 +390,7 @@ class NetDrvContEnv(NetDrvEpEnv):
netkit_links.sort(key=lambda x: x['ifindex'])
self._nk_host_ifname = netkit_links[1]['ifname']
- self._nk_guest_ifname = netkit_links[0]['ifname']
+ self.nk_guest_ifname = netkit_links[0]['ifname']
self.nk_host_ifindex = netkit_links[1]['ifindex']
self.nk_guest_ifindex = netkit_links[0]['ifindex']
@@ -409,7 +409,7 @@ class NetDrvContEnv(NetDrvEpEnv):
if self._nk_host_ifname:
cmd(f"ip link del dev {self._nk_host_ifname}")
self._nk_host_ifname = None
- self._nk_guest_ifname = None
+ self.nk_guest_ifname = None
if self._init_ns_attached:
cmd("ip netns del init", fail=False)
@@ -448,16 +448,16 @@ class NetDrvContEnv(NetDrvEpEnv):
cmd("ip netns attach init 1")
self._init_ns_attached = True
ip("netns set init 0", ns=self.netns)
- ip(f"link set dev {self._nk_guest_ifname} netns {self.netns.name}")
+ ip(f"link set dev {self.nk_guest_ifname} netns {self.netns.name}")
ip(f"link set dev {self._nk_host_ifname} up")
ip(f"-6 addr add fe80::1/64 dev {self._nk_host_ifname} nodad")
ip(f"-6 route add {self.nk_guest_ipv6}/128 via fe80::2 dev {self._nk_host_ifname}")
ip("link set lo up", ns=self.netns)
- ip(f"link set dev {self._nk_guest_ifname} up", ns=self.netns)
- ip(f"-6 addr add fe80::2/64 dev {self._nk_guest_ifname}", ns=self.netns)
- ip(f"-6 addr add {self.nk_guest_ipv6}/64 dev {self._nk_guest_ifname} nodad", ns=self.netns)
- ip(f"-6 route add default via fe80::1 dev {self._nk_guest_ifname}", ns=self.netns)
+ ip(f"link set dev {self.nk_guest_ifname} up", ns=self.netns)
+ ip(f"-6 addr add fe80::2/64 dev {self.nk_guest_ifname}", ns=self.netns)
+ ip(f"-6 addr add {self.nk_guest_ipv6}/64 dev {self.nk_guest_ifname} nodad", ns=self.netns)
+ ip(f"-6 route add default via fe80::1 dev {self.nk_guest_ifname}", ns=self.netns)
def _tc_ensure_clsact(self):
qdisc = json.loads(cmd(f"tc -j qdisc show dev {self.ifname}").stdout)
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 4/8] selftests: drv-net: ncdevmem: add -n flag to skip NIC configuration
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add a -n (skip_config) flag that causes ncdevmem to skip NIC
configuration when operating as an RX server. When -n is passed,
ncdevmem skips configuring header split, RSS, and flow steering, as well
as their teardown on exit.
This allows ksft tests to pre-configure the NIC in the host namespace
before launching ncdevmem in the guest namespace. This is needed for
netkit devmem tests where the test harness namespace has direct access
to the NIC and the ncdevmem namespace does not.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 58 +++++++++++++----------
1 file changed, 34 insertions(+), 24 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index e098d6534c3c..d96e8a3b5a65 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -93,6 +93,7 @@ static char *port;
static size_t do_validation;
static int start_queue = -1;
static int num_queues = -1;
+static int skip_config;
static char *ifname;
static unsigned int ifindex;
static unsigned int dmabuf_id;
@@ -828,7 +829,7 @@ static struct netdev_queue_id *create_queues(void)
static int do_server(struct memory_buffer *mem)
{
- struct ethtool_rings_get_rsp *ring_config;
+ struct ethtool_rings_get_rsp *ring_config = NULL;
char ctrl_data[sizeof(int) * 20000];
size_t non_page_aligned_frags = 0;
struct sockaddr_in6 client_addr;
@@ -851,27 +852,29 @@ static int do_server(struct memory_buffer *mem)
return -1;
}
- ring_config = get_ring_config();
- if (!ring_config) {
- pr_err("Failed to get current ring configuration");
- return -1;
- }
+ if (!skip_config) {
+ ring_config = get_ring_config();
+ if (!ring_config) {
+ pr_err("Failed to get current ring configuration");
+ return -1;
+ }
- if (configure_headersplit(ring_config, 1)) {
- pr_err("Failed to enable TCP header split");
- goto err_free_ring_config;
- }
+ if (configure_headersplit(ring_config, 1)) {
+ pr_err("Failed to enable TCP header split");
+ goto err_free_ring_config;
+ }
- /* Configure RSS to divert all traffic from our devmem queues */
- if (configure_rss()) {
- pr_err("Failed to configure rss");
- goto err_reset_headersplit;
- }
+ /* Configure RSS to divert all traffic from our devmem queues */
+ if (configure_rss()) {
+ pr_err("Failed to configure rss");
+ goto err_reset_headersplit;
+ }
- /* Flow steer our devmem flows to start_queue */
- if (configure_flow_steering(&server_sin)) {
- pr_err("Failed to configure flow steering");
- goto err_reset_rss;
+ /* Flow steer our devmem flows to start_queue */
+ if (configure_flow_steering(&server_sin)) {
+ pr_err("Failed to configure flow steering");
+ goto err_reset_rss;
+ }
}
if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys)) {
@@ -1052,13 +1055,17 @@ static int do_server(struct memory_buffer *mem)
err_unbind:
ynl_sock_destroy(ys);
err_reset_flow_steering:
- reset_flow_steering();
+ if (!skip_config)
+ reset_flow_steering();
err_reset_rss:
- reset_rss();
+ if (!skip_config)
+ reset_rss();
err_reset_headersplit:
- restore_ring_config(ring_config);
+ if (!skip_config)
+ restore_ring_config(ring_config);
err_free_ring_config:
- ethtool_rings_get_rsp_free(ring_config);
+ if (!skip_config)
+ ethtool_rings_get_rsp_free(ring_config);
return err;
}
@@ -1404,7 +1411,7 @@ int main(int argc, char *argv[])
int is_server = 0, opt;
int ret, err = 1;
- while ((opt = getopt(argc, argv, "Lls:c:p:v:q:t:f:z:")) != -1) {
+ while ((opt = getopt(argc, argv, "Lls:c:p:v:q:t:f:z:n")) != -1) {
switch (opt) {
case 'L':
fail_on_linear = true;
@@ -1436,6 +1443,9 @@ int main(int argc, char *argv[])
case 'z':
max_chunk = atoi(optarg);
break;
+ case 'n':
+ skip_config = 1;
+ break;
case '?':
fprintf(stderr, "unknown option: %c\n", optopt);
break;
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 3/8] net: devmem: support TX over NETMEM_TX_NO_DMA devices
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
When a netkit virtual device leases queues from a physical NIC, devmem
TX bindings created on the netkit device must still result in the dmabuf
being mapped for dma by the physical device. This patch accomplishes
this by teaching the bind handler to search for the underlying
DMA-capable device by looking it up via leased rx queues. The function
netdev_find_netmem_tx_dev(), used for finding the underlying DMA-capable
device, can be extended to support other non-netkit NETMEM_TX_NO_DMA
devices in the future if needed.
Additionally, this patch extends validate_xmit_unreadable_skb() to
support the netkit case, where the skb is validated twice: once on the
netkit guest device and again on the physical NIC after BPF redirect or
ip forwarding.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v4:
- Fold the `NETMEM_TX_NO_DMA` check in `validate_xmit_unreadable_skb()`
(Stan, Jakub)
- Convert `binding->vdev` to void* opaque cookie with comment (Jakub)
Changes in v3:
- Fix validate_xmit_unreadable_skb() bug for non-devmem
unreadable niovs (should not be dropped)
- Major simplification of validate_xmit_unreadable_skb()
- Fix prematurely released lock in bind-tx handler (Jakub)
Changes in v2:
- In validate_xmit_unreadable_skb() to check netmem_tx mode before
inspecting frags (Jakub)
- Lock bind_dev around netdev_queue_get_dma_dev() when bind_dev !=
netdev to fix lockdep (Sashiko)
---
net/core/dev.c | 3 ++-
net/core/devmem.c | 6 +++--
net/core/devmem.h | 10 ++++++--
net/core/netdev-genl.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++----
4 files changed, 72 insertions(+), 10 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 2da2688fe490..bbc93b181ef9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,7 +3993,8 @@ static struct sk_buff *validate_xmit_unreadable_skb(struct sk_buff *skb,
struct skb_shared_info *shinfo;
struct net_iov *niov;
- if (likely(skb_frags_readable(skb)))
+ if (likely(skb_frags_readable(skb) ||
+ dev->netmem_tx == NETMEM_TX_NO_DMA))
goto out;
if (dev->netmem_tx == NETMEM_TX_NONE)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 468344739db2..893643909f6a 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -181,7 +181,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
}
struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev,
+net_devmem_bind_dmabuf(struct net_device *dev, void *vdev,
struct device *dma_dev,
enum dma_data_direction direction,
unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
@@ -212,6 +212,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
}
binding->dev = dev;
+ binding->vdev = vdev;
xa_init_flags(&binding->bound_rxqs, XA_FLAGS_ALLOC);
err = percpu_ref_init(&binding->ref,
@@ -396,7 +397,8 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
*/
dst_dev = dst_dev_rcu(dst);
if (unlikely(!dst_dev) ||
- unlikely(dst_dev != READ_ONCE(binding->dev))) {
+ unlikely(dst_dev != READ_ONCE(binding->dev) &&
+ dst_dev != READ_ONCE(binding->vdev))) {
err = -ENODEV;
goto out_unlock;
}
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 1c5c18581fcb..3852a56036cb 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -19,7 +19,13 @@ struct net_devmem_dmabuf_binding {
struct dma_buf *dmabuf;
struct dma_buf_attachment *attachment;
struct sg_table *sgt;
+ /* Physical NIC that does the actual DMA for this binding. */
struct net_device *dev;
+ /* Opaque cookie identifying the virtual device (e.g. netkit) the user
+ * called bind-tx on. Used only for pointer comparison. Never
+ * dereferenced.
+ */
+ void *vdev;
struct gen_pool *chunk_pool;
/* Protect dev */
struct mutex lock;
@@ -84,7 +90,7 @@ struct dmabuf_genpool_chunk_owner {
void __net_devmem_dmabuf_binding_free(struct work_struct *wq);
struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev,
+net_devmem_bind_dmabuf(struct net_device *dev, void *vdev,
struct device *dma_dev,
enum dma_data_direction direction,
unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
@@ -165,7 +171,7 @@ static inline void net_devmem_put_net_iov(struct net_iov *niov)
}
static inline struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev,
+net_devmem_bind_dmabuf(struct net_device *dev, void *vdev,
struct device *dma_dev,
enum dma_data_direction direction,
unsigned int dmabuf_fd,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 4d2c49371cdb..b4d48f3672a5 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1077,7 +1077,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
goto err_rxq_bitmap;
}
- binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
+ binding = net_devmem_bind_dmabuf(netdev, NULL, dma_dev, DMA_FROM_DEVICE,
dmabuf_fd, priv, info->extack);
if (IS_ERR(binding)) {
err = PTR_ERR(binding);
@@ -1119,9 +1119,43 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
return err;
}
+/* Find the DMA-capable device for a netmem TX binding.
+ *
+ * For NETMEM_TX_DMA devices, return the device itself.
+ * For NETMEM_TX_NO_DMA devices, walk leased RX queues to find the underlying
+ * physical device and return it.
+ */
+static struct net_device *
+netdev_find_netmem_tx_dev(struct net_device *dev)
+{
+ struct netdev_rx_queue *lease_rxq;
+ struct net_device *phys_dev;
+ int i;
+
+ if (dev->netmem_tx == NETMEM_TX_DMA)
+ return dev;
+
+ if (dev->netmem_tx != NETMEM_TX_NO_DMA)
+ return NULL;
+
+ for (i = 0; i < dev->real_num_rx_queues; i++) {
+ lease_rxq = READ_ONCE(__netif_get_rx_queue(dev, i)->lease);
+ if (!lease_rxq)
+ continue;
+
+ phys_dev = lease_rxq->dev;
+ if (netif_device_present(phys_dev) &&
+ phys_dev->netmem_tx == NETMEM_TX_DMA)
+ return phys_dev;
+ }
+
+ return NULL;
+}
+
int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
{
struct net_devmem_dmabuf_binding *binding;
+ struct net_device *bind_dev;
struct netdev_nl_sock *priv;
struct net_device *netdev;
struct device *dma_dev;
@@ -1171,22 +1205,41 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
goto err_unlock_netdev;
}
- dma_dev = netdev_queue_get_dma_dev(netdev, 0, NETDEV_QUEUE_TYPE_TX);
- binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
- dmabuf_fd, priv, info->extack);
+ bind_dev = netdev_find_netmem_tx_dev(netdev);
+ if (!bind_dev) {
+ err = -EOPNOTSUPP;
+ NL_SET_ERR_MSG(info->extack,
+ "No DMA-capable device found for netmem TX");
+ goto err_unlock_netdev;
+ }
+
+ if (bind_dev != netdev)
+ netdev_lock(bind_dev);
+
+ dma_dev = netdev_queue_get_dma_dev(bind_dev, 0, NETDEV_QUEUE_TYPE_TX);
+
+ binding = net_devmem_bind_dmabuf(bind_dev,
+ bind_dev != netdev ? netdev : NULL,
+ dma_dev, DMA_TO_DEVICE, dmabuf_fd,
+ priv, info->extack);
if (IS_ERR(binding)) {
err = PTR_ERR(binding);
- goto err_unlock_netdev;
+ goto err_unlock_bind_dev;
}
nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id);
genlmsg_end(rsp, hdr);
+ if (bind_dev != netdev)
+ netdev_unlock(bind_dev);
netdev_unlock(netdev);
mutex_unlock(&priv->lock);
return genlmsg_reply(rsp, info);
+err_unlock_bind_dev:
+ if (bind_dev != netdev)
+ netdev_unlock(bind_dev);
err_unlock_netdev:
netdev_unlock(netdev);
err_unlock_sock:
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 2/8] net: netkit: declare NETMEM_TX_NO_DMA mode
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Some virtual devices like netkit (or ifb) never DMA and never touch frag
contents, they just forward the skb to another device. They are unable
to forward unreadable skbs, however, because they fail to pass TX
validation checks on dev->netmem_tx. The existing two-state
NETMEM_TX_NONE / NETMEM_TX_DMA doesn't give the TX validator enough
information to differentiate devices that will attempt DMA on the
unreadable skb from those that will simply route it untouched.
Add a third mode to the enum so drivers can indicate 1) if they have
netmem TX support, and 2) if they do, whether they are DMA-capable:
NETMEM_TX_NO_DMA - pass-through, device never DMAs
Widen dev->netmem_tx from a 1-bit field to 2 bits to fit the new value,
and declare netkit as NETMEM_TX_NO_DMA. Devmem TX support over these
devices comes in a follow-up patch.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v3:
- net_cachelines/net_device.rst: align the netmem_tx row's type column
with the rest of the table by using "unsigned_long:2" instead of
"unsigned long:2"
- Split this into a distinct patch (Jakub)
---
Documentation/networking/net_cachelines/net_device.rst | 2 +-
Documentation/networking/netmem.rst | 3 +++
Documentation/translations/zh_CN/networking/netmem.rst | 3 +++
drivers/net/netkit.c | 1 +
include/linux/netdevice.h | 3 ++-
5 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 1c19bb7705df..7b3392553fd6 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -10,7 +10,7 @@ Type Name fastpath_tx_acce
=================================== =========================== =================== =================== ===================================================================================
unsigned_long:32 priv_flags read_mostly __dev_queue_xmit(tx)
unsigned_long:1 lltx read_mostly HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
-unsigned long:1 netmem_tx:1; read_mostly
+unsigned_long:2 netmem_tx:2; read_mostly
char name[16]
struct netdev_name_node* name_node
struct dev_ifalias* ifalias
diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
index 5ccadba4f373..217869d1108d 100644
--- a/Documentation/networking/netmem.rst
+++ b/Documentation/networking/netmem.rst
@@ -99,3 +99,6 @@ Driver TX Requirements
appropriate mode:
- `NETMEM_TX_DMA`: for physical devices that perform DMA.
+
+ - `NETMEM_TX_NO_DMA`: for virtual or passthrough devices that do
+ not DMA, but still support handling of netmem-backed skbs.
diff --git a/Documentation/translations/zh_CN/networking/netmem.rst b/Documentation/translations/zh_CN/networking/netmem.rst
index 9c84423b7528..320f3eacf51b 100644
--- a/Documentation/translations/zh_CN/networking/netmem.rst
+++ b/Documentation/translations/zh_CN/networking/netmem.rst
@@ -92,3 +92,6 @@ dma-mapping API 去处理。
2. 驱动程序应将 `netdev->netmem_tx` 设置为适当的模式:
- `NETMEM_TX_DMA`:适用于执行 DMA 的物理设备。
+
+ - `NETMEM_TX_NO_DMA`:适用于不执行 DMA 的虚拟或透传设备,但仍支持
+ 处理 netmem 支持的 skb。
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 5e2eecc3165d..0ad6a806d7d5 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -466,6 +466,7 @@ static void netkit_setup(struct net_device *dev)
dev->priv_flags |= IFF_NO_QUEUE;
dev->priv_flags |= IFF_DISABLE_NETPOLL;
dev->lltx = true;
+ dev->netmem_tx = NETMEM_TX_NO_DMA;
dev->netdev_ops = &netkit_netdev_ops;
dev->ethtool_ops = &netkit_ethtool_ops;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b7a4503f7cdb..bf3dd9b2c1a7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1797,6 +1797,7 @@ enum netdev_stat_type {
enum netmem_tx_mode {
NETMEM_TX_NONE, /* no netmem TX support */
NETMEM_TX_DMA, /* DMA-capable netmem TX (real HW) */
+ NETMEM_TX_NO_DMA, /* no DMA, e.g. passthrough for virtual devs */
};
enum netdev_reg_state {
@@ -2143,7 +2144,7 @@ struct net_device {
struct_group(priv_flags_fast,
unsigned long priv_flags:32;
unsigned long lltx:1;
- unsigned long netmem_tx:1;
+ unsigned long netmem_tx:2;
);
const struct net_device_ops *netdev_ops;
const struct header_ops *header_ops;
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 1/8] net: convert netmem_tx flag to enum
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Devices that support netmem TX previously set dev->netmem_tx = true.
This was checked in validate_xmit_unreadable_skb() to drop unreadable
skbs (skbs with dmabuf-backed frags) before they reach drivers that
would mishandle them or devices that would not have the iommu mappings
for them.
A subsequent patch will introduce a third state for virtual devices
that forward unreadable skbs without ever performing DMA on them. To
prepare for that, convert the boolean dev->netmem_tx into an enum:
NETMEM_TX_NONE - no netmem TX support (drop unreadable skbs)
NETMEM_TX_DMA - full support, device does DMA
Update the existing NIC drivers (bnxt, gve, mlx5, fbnic) and the
validators in net/core to use the new enum. No functional change.
Acked-by: Harshitha Ramamurthy <hramamurthy@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v4:
- netdevice.h: netmem_tx enum list -> only "device netmem TX mode" in
comment (Stan)
Changes in v3:
- Split NO_DMA changes into subsequent commit (Jakub)
- Move !netdev->netmem_tx -> netdev->netmem_tx ==
NETMEM_TX_NONE conversions to this patch (Jakub)
Changes in v2:
- Squash driver conversion patches (2-5) into patch 1 (Jakub)
---
Documentation/networking/netmem.rst | 5 ++++-
Documentation/translations/zh_CN/networking/netmem.rst | 4 +++-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
drivers/net/ethernet/google/gve/gve_main.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
drivers/net/ethernet/meta/fbnic/fbnic_netdev.c | 2 +-
include/linux/netdevice.h | 7 ++++++-
net/core/dev.c | 2 +-
net/core/netdev-genl.c | 2 +-
9 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
index b63aded46337..5ccadba4f373 100644
--- a/Documentation/networking/netmem.rst
+++ b/Documentation/networking/netmem.rst
@@ -95,4 +95,7 @@ Driver TX Requirements
netdev@, or reach out to the maintainers and/or almasrymina@google.com for
help adding the netmem API.
-2. Driver should declare support by setting `netdev->netmem_tx = true`
+2. Driver should declare support by setting `netdev->netmem_tx` to the
+ appropriate mode:
+
+ - `NETMEM_TX_DMA`: for physical devices that perform DMA.
diff --git a/Documentation/translations/zh_CN/networking/netmem.rst b/Documentation/translations/zh_CN/networking/netmem.rst
index fe351a240f02..9c84423b7528 100644
--- a/Documentation/translations/zh_CN/networking/netmem.rst
+++ b/Documentation/translations/zh_CN/networking/netmem.rst
@@ -89,4 +89,6 @@ dma-mapping API 去处理。
使用某个还不存在的 netmem API,你可以自行添加并提交到 netdev@,也可以联系维护
人员或者发送邮件至 almasrymina@google.com 寻求帮助。
-2. 驱动程序应通过设置 netdev->netmem_tx = true 来表明自身支持 netmem 功能。
+2. 驱动程序应将 `netdev->netmem_tx` 设置为适当的模式:
+
+ - `NETMEM_TX_DMA`:适用于执行 DMA 的物理设备。
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 945a86696f2f..d4f93e62f583 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -17123,7 +17123,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
if (BNXT_SUPPORTS_QUEUE_API(bp))
dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;
- dev->netmem_tx = true;
+ dev->netmem_tx = NETMEM_TX_DMA;
rc = register_netdev(dev);
if (rc)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 00750643e614..e4d78ae52daf 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2894,7 +2894,7 @@ static int gve_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto abort_with_wq;
if (!gve_is_gqi(priv) && !gve_is_qpl(priv))
- dev->netmem_tx = true;
+ dev->netmem_tx = NETMEM_TX_DMA;
err = register_netdev(dev);
if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 85b1ccbd351f..90d2979f1a4f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5966,7 +5966,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
netdev->priv_flags |= IFF_UNICAST_FLT;
- netdev->netmem_tx = true;
+ netdev->netmem_tx = NETMEM_TX_DMA;
netif_set_tso_max_size(netdev, GSO_MAX_SIZE);
mlx5e_set_xdp_feature(priv);
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
index 4dea2bb58d2f..f99ca551c1ce 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
@@ -752,7 +752,7 @@ struct net_device *fbnic_netdev_alloc(struct fbnic_dev *fbd)
netdev->netdev_ops = &fbnic_netdev_ops;
netdev->stat_ops = &fbnic_stat_ops;
netdev->queue_mgmt_ops = &fbnic_queue_mgmt_ops;
- netdev->netmem_tx = true;
+ netdev->netmem_tx = NETMEM_TX_DMA;
fbnic_set_ethtool_ops(netdev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7af71491a47..b7a4503f7cdb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1794,6 +1794,11 @@ enum netdev_stat_type {
NETDEV_PCPU_STAT_DSTATS, /* struct pcpu_dstats */
};
+enum netmem_tx_mode {
+ NETMEM_TX_NONE, /* no netmem TX support */
+ NETMEM_TX_DMA, /* DMA-capable netmem TX (real HW) */
+};
+
enum netdev_reg_state {
NETREG_UNINITIALIZED = 0,
NETREG_REGISTERED, /* completed register_netdevice */
@@ -1815,7 +1820,7 @@ enum netdev_reg_state {
* @lltx: device supports lockless Tx. Deprecated for real HW
* drivers. Mainly used by logical interfaces, such as
* bonding and tunnels
- * @netmem_tx: device support netmem_tx.
+ * @netmem_tx: device netmem TX mode
*
* @name: This is the first field of the "visible" part of this structure
* (i.e. as seen by users in the "Space.c" file). It is the name
diff --git a/net/core/dev.c b/net/core/dev.c
index b0691e03dd6b..2da2688fe490 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3996,7 +3996,7 @@ static struct sk_buff *validate_xmit_unreadable_skb(struct sk_buff *skb,
if (likely(skb_frags_readable(skb)))
goto out;
- if (!dev->netmem_tx)
+ if (dev->netmem_tx == NETMEM_TX_NONE)
goto out_free;
shinfo = skb_shinfo(skb);
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index b8f6076d8007..4d2c49371cdb 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1164,7 +1164,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
goto err_unlock_netdev;
}
- if (!netdev->netmem_tx) {
+ if (netdev->netmem_tx == NETMEM_TX_NONE) {
err = -EOPNOTSUPP;
NL_SET_ERR_MSG(info->extack,
"Driver does not support netmem TX");
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH net-next v5 0/8] net: devmem: support devmem with netkit devices
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
Bobby Eshleman
This series enables TCP devmem TX through netkit devices.
Netkit now supports queue leasing. A physical NIC's RX queue can be
leased to a netkit guest interface inside a container namespace. This
gives the container a devmem-capable data path on the RX side (bind-rx,
etc...). On the TX side, the container process binds to its netkit guest
interface and sends traffic that netkit redirects (via BPF or ip
forwarding) to the physical NIC for DMA.
Two things in the existing devmem TX path prevent this from working:
1. validate_xmit_unreadable_skb() requires dev->netmem_tx before it will
forward a dmabuf-backed (unreadable) skb. This protects skbs from
landing on devices that don't have the IOMMU mappings for the backing
dmabuf or that don't speak netmem. Netkit, however, does not support
DMA, doesn't attempt to read unreadable skb pages and so doesn't
break netmem (it is pure skb routing and redirection). It is
functionally capable of routing unreadable skbs, but there is no way
for the TX validation pathway to distinguish between a device that
will actually attempt DMA-ing the skb and another device
(like netkit) that does not DMA but also does not break
netmem.
2. bind_tx_doit uses the bound device as the DMA device. When the user
binds devmem TX to the netkit guest, the bind handler attempts to
create DMA mappings against netkit, which has no DMA capability and
no IOMMU mappings.
This series solves these problems as follows:
1. Extend netmem_tx to two bits, assigned to one of three values:
NETMEM_TX_NONE - netmem not supported
NETMEM_TX_DMA - netmem supported and performs DMA
NETMEM_TX_NO_DMA - netmem supported, but does not DMA
With these bits, phys devices can set NETMEM_TX_DMA and devices like
netkit set NETMEM_TX_NO_DMA. The validation TX path ensures that any
DMA-capable netdev exactly matches the bound device, guaranteeing the
correct mapping of the bound dmabuf. The validation TX path also
allows devices with NETMEM_TX_NO_DMA to pass, knowing these devices
will not misuse netmem or run into IOMMU faults. After redirection or
routing and the skb finally makes its way through the stack to a
physical device's TX path, the above NETMEM_TX_DMA check is performed
again to guarantee the device has the appropriate binding/mappings.
2. On TX bind, the bind handler recognizes NETMEM_TX_NO_DMA devices and
finds the phys TX device and binds to that instead. For the netkit
case, if it has been leased a queue from a DMA-capable device
already, then the bind action is performed on the DMA-capable device
instead and the dmabuf is mapped correctly.
---
Changes in v5:
- configure_nic(): register cleanup via defer() and drop the separate
cleanup_nic() helper, this avoids leaking resources init'd during
setup. (Sashiko)
- Use sys.byteorder when packing phys_ifindex into the BPF .bss map (Sashiko).
- fix unhandle KsftSkip in tests (Sashiko).
- see per-patch changes for more details
- Link to v4: https://lore.kernel.org/r/20260511-tcp-dm-netkit-v4-0-841b78b99d74@meta.com
Changes in v4:
- remove enum list from netmem_tx comment (Stan)
- fold NETMEM_TX_NO_DMA check in validate_xmit_unreadable_skb() into
skb_frags_readable check (Stan, Jakub)
- change binding->vdev to void ptr cookie with comment (Jakub)
- Fixed the bad change list version number (Stan)
- Link to v3: https://lore.kernel.org/r/20260507-tcp-dm-netkit-v3-0-52821445867c@meta.com
Changes in v3:
- Fix validate_xmit_unreadable_skb() logic for non-devmem
unreadable niovs (should not be dropped) (Sashiko)
- Simplify lock handling in bind_tx, no premature release (Jakub)
- split NO_DMA changes into separate patch (Jakub)
- fixed some pylint issues, one required an additional patch ("selftests:
drv-net: make attr _nk_guest_ifname public") to rename a variable from
private to public
- see per-patch changelist for more detailed changes
- Link to v2: https://lore.kernel.org/r/20260504-tcp-dm-netkit-v2-0-56d52ac72fd4@meta.com
Changes in v2:
- Squash driver conversion patches (2-5) into patch 1 (Jakub)
- In validate_xmit_unreadable_skb() to check netmem_tx mode before inspecting
frags (Jakub)
- Lock bind_dev around netdev_queue_get_dma_dev() when bind_dev != netdev to
fix lockdep (Sashiko)
- Move require_devmem() into individual test functions so KsftSkipEx goes up to
ksft_run() (Sashiko)
- Add nk_devmem.py to TEST_PROGS in Makefile (Sashiko)
- Link to v1:
https://lore.kernel.org/all/20260428-tcp-dm-netkit-v1-0-719280eba4d2@meta.com/
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Bobby Eshleman (8):
net: convert netmem_tx flag to enum
net: netkit: declare NETMEM_TX_NO_DMA mode
net: devmem: support TX over NETMEM_TX_NO_DMA devices
selftests: drv-net: ncdevmem: add -n flag to skip NIC configuration
selftests: drv-net: make attr _nk_guest_ifname public
selftests: drv-net: refactor devmem command builders into lib module
selftests: drv-net: add primary_rx_redirect support to NetDrvContEnv
selftests: drv-net: add netkit devmem tests
.../networking/net_cachelines/net_device.rst | 2 +-
Documentation/networking/netmem.rst | 8 +-
.../translations/zh_CN/networking/netmem.rst | 7 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
drivers/net/ethernet/google/gve/gve_main.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
drivers/net/ethernet/meta/fbnic/fbnic_netdev.c | 2 +-
drivers/net/netkit.c | 1 +
include/linux/netdevice.h | 10 +-
net/core/dev.c | 5 +-
net/core/devmem.c | 6 +-
net/core/devmem.h | 10 +-
net/core/netdev-genl.c | 65 +++++-
tools/testing/selftests/drivers/net/hw/Makefile | 2 +
tools/testing/selftests/drivers/net/hw/devmem.py | 77 ++-----
.../testing/selftests/drivers/net/hw/devmem_lib.py | 222 +++++++++++++++++++++
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 58 +++---
.../testing/selftests/drivers/net/hw/nk_devmem.py | 46 +++++
.../drivers/net/hw/nk_primary_rx_redirect.bpf.c | 39 ++++
.../testing/selftests/drivers/net/hw/nk_qlease.py | 8 +-
tools/testing/selftests/drivers/net/lib/py/env.py | 110 +++++++---
21 files changed, 545 insertions(+), 139 deletions(-)
---
base-commit: 8ebd24a7822cbae25beeafba49b2159d6a68a5f2
change-id: 20260423-tcp-dm-netkit-2bd78b638d30
Best regards,
--
Bobby Eshleman <bobbyeshleman@meta.com>
^ permalink raw reply
* Re: [PATCH net-next 0/4] RDMA/net/ionic: Misc updates
From: Creeley, Brett @ 2026-05-14 17:19 UTC (permalink / raw)
To: Leon Romanovsky, Abhijit Gangurde
Cc: Jakub Kicinski, netdev, linux-rdma, Brett Creeley, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Allen Hubbe,
Jason Gunthorpe, Eric Joyner
In-Reply-To: <20260514164029.GU15586@unreal>
On 5/14/2026 9:40 AM, Leon Romanovsky wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Thu, May 14, 2026 at 08:03:09PM +0530, Abhijit Gangurde wrote:
>> On 5/7/26 04:29, Jakub Kicinski wrote:
>>> On Tue, 5 May 2026 21:19:31 -0700 Eric Joyner wrote:
>>>> Other smaller additions add a devlink parameter to the ionic ethernet
>>>> driver for enabling and disabling RDMA,
>>> My understanding is that the devlink param was expected to change
>>> the configuration of the device. IOW user can enable/disable RDMA
>>> to save internal device resources. You seem to be purely preventing
>>> the auxbus device to be added. So there's nothing gained here compared
>>> to simply not loading the RDMA driver. What am I missing?
>> You're right that the current implementation controls only the auxiliary bus
>> device registration and doesn't reconfigure firmware resource allocation.
>> The intent behind this devlink param is to provide per-device granularity
>> for enabling/disabling RDMA. In a system with multiple ionic NICs, an
>> administrator may want RDMA active on some devices but not others.
>> That said, if this per-device control justification is sufficient on its
>> own, or if firmware-side changes are a hard requirement for this to be
>> acceptable?
> I'm confident that the administrator can vibe code an appropriate udev
> rule and disable autoprobing for this case.
>
> The real advantage of a devlink knob here is the ability to control the
> firmware.
>
> Thanks
Based on the documentation in devlink-params.rst, the devlink knob for
enable_rdma indicates that when enabled the driver will instantiate RDMA
specific auxiliary device of the devlink device. The documentation
doesn't state what to do when enable_rdma is disabled, but it seemed
like removing the auxiliary device provided the opposite behavior of
enabled.
If that's not the case, does the documentation need to be updated
accordingly?
Thanks,
Brett
>
>> Thanks,
>> Abhijit
^ permalink raw reply
* Re: [PATCH 1/2] RDMA/siw: reject MPA FPDU length underflow before signed receive math
From: Bernard Metzler @ 2026-05-14 17:10 UTC (permalink / raw)
To: Michael Bommarito, Jason Gunthorpe, Leon Romanovsky, linux-rdma
Cc: linux-kernel
In-Reply-To: <20260513175325.2042630-2-michael.bommarito@gmail.com>
On 13.05.2026 19:53, Michael Bommarito wrote:
> A malicious connected siw peer can send an iWARP FPDU whose MPA length
> field (c_hdr->mpa_len, 16 bit big-endian, peer-controlled) is smaller
> than the fixed DDP/RDMAP header for the announced opcode. Soft-iWARP
> parses the full header in siw_get_hdr() based on iwarp_pktinfo[opcode]
> .hdr_len, but never compares mpa_len against that header length.
>
> siw_tcp_rx_data() then derives
>
> srx->fpdu_part_rem = be16_to_cpu(mpa_len) - fpdu_part_rcvd
> + MPA_HDR_SIZE;
>
> where fpdu_part_rcvd equals iwarp_pktinfo[opcode].hdr_len at this
> point. For a tagged WRITE (hdr_len 16, MPA_HDR_SIZE 2) the smallest
> on-wire mpa_len of 0 yields fpdu_part_rem = -14, and any mpa_len below
> hdr_len - MPA_HDR_SIZE underflows to a negative int.
>
> The signed value then flows into siw_proc_write()/siw_proc_rresp() as
>
> bytes = min(srx->fpdu_part_rem, srx->skb_new);
>
> is handed to siw_check_mem() as an int len (whose interval check
> addr + len > mem->va + mem->len is satisfied for a valid base when
> len is negative), and reaches siw_rx_data() -> siw_rx_kva() /
> siw_rx_umem() -> skb_copy_bits() as a signed copy length. The header
> copy branch in skb_copy_bits() promotes that to size_t, producing a
> multi-gigabyte read.
>
> KASAN under a KUnit harness that drives the real kernel TCP receive
> path -- a loopback AF_INET socketpair, the malformed FPDU written via
> kernel_sendmsg, sk_data_ready firing in softirq, tcp_read_sock
> dispatching to siw_tcp_rx_data -- reports:
>
> BUG: KASAN: use-after-free in skb_copy_bits+0x284/0x480
> Read of size 4294967295 at addr ffff888...
> Call Trace:
> skb_copy_bits
> siw_rx_kva
> siw_rx_data
> siw_check_mem
> siw_proc_write
> siw_tcp_rx_data
> __tcp_read_sock
> siw_qp_llp_data_ready
> tcp_data_ready
> tcp_data_queue
>
> Add the missing invariant at the earliest point where the peer header
> is fully assembled. iwarp_pktinfo[*].hdr_len - MPA_HDR_SIZE is exactly
> the value the siw transmitter uses as the minimum mpa_len for each
> opcode (drivers/infiniband/sw/siw/siw_qp.c:33), so this matches the
> protocol contract. Out-of-range FPDUs terminate the connection with
> TERM_ERROR_LAYER_LLP / LLP_ETYPE_MPA / LLP_ECODE_FPDU_START -- which
> is RFC 5044 Section 8 error code 3 ("Marker and ULPDU Length fields
> do not agree on the start of an FPDU"), the correct framing-error
> class for this inconsistency.
>
> Fixes: 8b6a361b8c48 ("rdma/siw: receive path")
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> Assisted-by: Claude:claude-opus-4-7
> ---
> See cover letter for full root cause, series rationale, and test
> summary. [2/2] adds the KUnit regression harness used to validate
> this fix.
>
> drivers/infiniband/sw/siw/siw_qp_rx.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/drivers/infiniband/sw/siw/siw_qp_rx.c b/drivers/infiniband/sw/siw/siw_qp_rx.c
> index e8a88b378d51..34d03584160c 100644
> --- a/drivers/infiniband/sw/siw/siw_qp_rx.c
> +++ b/drivers/infiniband/sw/siw/siw_qp_rx.c
> @@ -1081,6 +1081,21 @@ static int siw_get_hdr(struct siw_rx_stream *srx)
> return -EAGAIN;
> }
>
> + /*
> + * Peer-controlled mpa_len must not underflow srx->fpdu_part_rem
> + * in siw_tcp_rx_data(); a negative value flows as a signed copy
> + * length into siw_check_mem() and skb_copy_bits().
> + */
Excellent finding. This was an open gateway for all evil.
> + if (unlikely(be16_to_cpu(c_hdr->mpa_len) + MPA_HDR_SIZE <
> + iwarp_pktinfo[opcode].hdr_len)) {
> + pr_warn_ratelimited("siw: short mpa_len %u for opcode %u (hdr_len %u)\n",
I think we shall stay with 80 chars per line. So let's
wrap the above line.
Otherwise
Acked-by: Bernard Metzler <bernard.metzler@linux.dev>
> + be16_to_cpu(c_hdr->mpa_len), opcode,
> + iwarp_pktinfo[opcode].hdr_len);
> + siw_init_terminate(rx_qp(srx), TERM_ERROR_LAYER_LLP,
> + LLP_ETYPE_MPA, LLP_ECODE_FPDU_START, 0);
> + return -EINVAL;
> + }
> +
> /*
> * DDP/RDMAP header receive completed. Check if the current
> * DDP segment starts a new RDMAP message or continues a previously
^ permalink raw reply
* Re: [PATCH net-next 0/4] RDMA/net/ionic: Misc updates
From: Leon Romanovsky @ 2026-05-14 16:40 UTC (permalink / raw)
To: Abhijit Gangurde
Cc: Jakub Kicinski, netdev, linux-rdma, Brett Creeley, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Allen Hubbe,
Jason Gunthorpe, Eric Joyner
In-Reply-To: <4dc23648-7ec1-b68c-0e1b-282e014e534c@amd.com>
On Thu, May 14, 2026 at 08:03:09PM +0530, Abhijit Gangurde wrote:
>
> On 5/7/26 04:29, Jakub Kicinski wrote:
> > On Tue, 5 May 2026 21:19:31 -0700 Eric Joyner wrote:
> > > Other smaller additions add a devlink parameter to the ionic ethernet
> > > driver for enabling and disabling RDMA,
> > My understanding is that the devlink param was expected to change
> > the configuration of the device. IOW user can enable/disable RDMA
> > to save internal device resources. You seem to be purely preventing
> > the auxbus device to be added. So there's nothing gained here compared
> > to simply not loading the RDMA driver. What am I missing?
> You're right that the current implementation controls only the auxiliary bus
> device registration and doesn't reconfigure firmware resource allocation.
> The intent behind this devlink param is to provide per-device granularity
> for enabling/disabling RDMA. In a system with multiple ionic NICs, an
> administrator may want RDMA active on some devices but not others.
> That said, if this per-device control justification is sufficient on its
> own, or if firmware-side changes are a hard requirement for this to be
> acceptable?
I'm confident that the administrator can vibe code an appropriate udev
rule and disable autoprobing for this case.
The real advantage of a devlink knob here is the ability to control the
firmware.
Thanks
>
> Thanks,
> Abhijit
^ permalink raw reply
* Re: [PATCH net-next 3/4] RDMA/ionic: Add debugfs support
From: Leon Romanovsky @ 2026-05-14 16:35 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Jakub Kicinski, Eric Joyner, netdev, linux-rdma, Brett Creeley,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Abhijit Gangurde, Allen Hubbe, Jason Gunthorpe
In-Reply-To: <2026051440-devourer-appendix-4326@gregkh>
On Thu, May 14, 2026 at 09:04:08AM +0200, Greg Kroah-Hartman wrote:
> On Thu, May 14, 2026 at 09:00:48AM +0300, Leon Romanovsky wrote:
> > On Wed, May 13, 2026 at 05:23:14PM -0700, Jakub Kicinski wrote:
> > > On Wed, 13 May 2026 10:21:13 +0300 Leon Romanovsky wrote:
> > > > 3. The patch is too large and exposes too many details that should be
> > > > gathered through the FW (fwctl).
> > >
> > > Why? What's wrong with debugfs? Much easier for people to access.
> >
> > There is nothing inherently wrong with debugfs. You can see recently
> > accepted debugfs patches from hns [1].
> >
> > The issue here is what data is being dumped through debugfs, and in what
> > quantity. From a quick look, ionic_dev_info_show() appears to print
> > raw data coming straight from the FW.
> >
> > In my view, debugfs should expose in‑kernel structures that are shaped
> > and controlled by the kernel itself. IMHO it is not the right place to
> > debug FW state. There can always be exceptions, of course, but in this
> > case the driver is effectively dumping everything from pds_core/FW in
> > the RDMA layer.
>
> debugfs is for anything you want, there is nothing wrong with doing
> this in debugfs, in fact, it's preferred. Don't spread debug info out
> into other areas, that makes it harder for admins and users to properly
> secure things from stuff they don't want users to have access to.
We are not discussing a general principle here, but this specific patchset,
which dumps information gathered and consumed by one subsystem into another.
If netdev wants to print raw firmware data, they are free to do so. I do not
want to see this in RDMA.
Thanks
>
> thanks,
>
> greg k-h
^ permalink raw reply
* [PATCH rdma-next v5 7/7] RDMA/bnxt_re: Enable app allocated QPs
From: Sriharsha Basavapatna @ 2026-05-14 16:23 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, selvin.xavier,
kalesh-anakkur.purayil, Sriharsha Basavapatna
In-Reply-To: <20260514162336.72644-1-sriharsha.basavapatna@broadcom.com>
The driver supports a new comp_mask: REQ_MASK_FIXED_QUE_ATTR.
The application sets this comp_mask bit in the CREATE_QP ureq
to indicate direct control of the QP. The driver goes through
the required processing for app allocated QPs (previous patches).
Only variable WQE mode is supported for these QPs.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 19 +++++++++++++++----
include/uapi/rdma/bnxt_re-abi.h | 2 +-
2 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 9905f1c039d7..51958c5515b6 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1717,11 +1717,11 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
struct ib_qp_init_attr *init_attr,
struct bnxt_re_ucontext *uctx,
struct bnxt_re_qp_req *ureq,
- struct bnxt_re_dbr_obj *dbr_obj)
+ struct bnxt_re_dbr_obj *dbr_obj,
+ bool fixed_que_attr)
{
struct bnxt_qplib_dev_attr *dev_attr;
struct bnxt_qplib_qp *qplqp;
- bool fixed_que_attr = false;
struct bnxt_re_dev *rdev;
struct bnxt_re_cq *cq;
int rc = 0, qptype;
@@ -1741,6 +1741,13 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
return qptype;
qplqp->type = (u8)qptype;
qplqp->wqe_mode = bnxt_re_is_var_size_supported(rdev, uctx);
+ if (fixed_que_attr) {
+ if (qplqp->wqe_mode != BNXT_QPLIB_WQE_MODE_VARIABLE)
+ return -EOPNOTSUPP;
+ if (!ureq->sq_npsn ||
+ ureq->sq_npsn > roundup_pow_of_two(ureq->sq_slots / 2))
+ return -EINVAL;
+ }
qplqp->dev_cap_flags = dev_attr->dev_cap_flags;
qplqp->cctx = rdev->chip_ctx;
if (init_attr->qp_type == IB_QPT_RC) {
@@ -1925,6 +1932,7 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
struct bnxt_qplib_dev_attr *dev_attr;
struct uverbs_attr_bundle *attrs;
struct bnxt_re_ucontext *uctx;
+ bool fixed_que_attr = false;
struct bnxt_re_qp_req ureq;
struct bnxt_re_dev *rdev;
struct bnxt_re_pd *pd;
@@ -1941,7 +1949,8 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
uctx = rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx);
if (udata) {
- rc = ib_copy_validate_udata_in_cm(udata, ureq, qp_handle, 0);
+ rc = ib_copy_validate_udata_in_cm(udata, ureq, qp_handle,
+ BNXT_RE_QP_REQ_MASK_FIXED_QUE_ATTR);
if (rc)
return rc;
@@ -1955,6 +1964,8 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
kref_get(&dbr_obj->usecnt);
qp->dbr_obj = dbr_obj;
}
+ if (ureq.comp_mask & BNXT_RE_QP_REQ_MASK_FIXED_QUE_ATTR)
+ fixed_que_attr = true;
}
rc = bnxt_re_test_qp_limits(rdev, qp_init_attr, dev_attr);
@@ -1965,7 +1976,7 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
qp->rdev = rdev;
rc = bnxt_re_init_qp_attr(qp, pd, qp_init_attr, uctx, &ureq,
- dbr_obj);
+ dbr_obj, fixed_que_attr);
if (rc)
goto fail;
diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h
index 4da8cda337dc..a4599d7b736a 100644
--- a/include/uapi/rdma/bnxt_re-abi.h
+++ b/include/uapi/rdma/bnxt_re-abi.h
@@ -126,7 +126,7 @@ struct bnxt_re_resize_cq_req {
};
enum bnxt_re_qp_mask {
- BNXT_RE_QP_REQ_MASK_VAR_WQE_SQ_SLOTS = 0x1,
+ BNXT_RE_QP_REQ_MASK_FIXED_QUE_ATTR = 0x1,
};
struct bnxt_re_qp_req {
--
2.51.2.636.ga99f379adf
^ permalink raw reply related
* [PATCH rdma-next v5 6/7] RDMA/bnxt_re: Support doorbells for app allocated QPs
From: Sriharsha Basavapatna @ 2026-05-14 16:23 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, selvin.xavier,
kalesh-anakkur.purayil, Sriharsha Basavapatna
In-Reply-To: <20260514162336.72644-1-sriharsha.basavapatna@broadcom.com>
App allocated QPs can use a separate doorbell for each QP.
This doorbell region can be passed through a new driver specific
DBR_HANDLE attribute, during QP creation. When this attribute
is set, associate the QP with the given doorbell region.
While the QP holds a reference to the dbr, the dbr itself
cannot be destroyed and is rejected with EBUSY error.
The current atomic usecnt doesn't handle implicit teardown of
dbr (process-exit/driver-removal), that ignores EBUSY error.
To address this, update this counter to use kref mechanism so
that the uobject (and associated db resource) is freed only when
the usecnt goes to zero.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 35 ++++++++++++++++++---
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 4 ++-
drivers/infiniband/hw/bnxt_re/uapi.c | 39 ++++++++++++++++++++++--
include/uapi/rdma/bnxt_re-abi.h | 4 +++
4 files changed, 73 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 9fd85d81bcea..9905f1c039d7 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1024,6 +1024,9 @@ int bnxt_re_destroy_qp(struct ib_qp *ib_qp, struct ib_udata *udata)
if (rc)
ibdev_err(&rdev->ibdev, "Failed to destroy HW QP");
+ if (qp->dbr_obj)
+ kref_put(&qp->dbr_obj->usecnt, bnxt_re_dbr_kref_release);
+
if (rdma_is_kernel_res(&qp->ib_qp.res)) {
flags = bnxt_re_lock_cqs(qp);
bnxt_qplib_clean_qp(&qp->qplib_qp);
@@ -1191,7 +1194,8 @@ static int bnxt_re_get_psn_bytes(struct bnxt_re_dev *rdev,
static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
struct bnxt_re_qp *qp, struct bnxt_re_ucontext *cntx,
struct bnxt_re_qp_req *ureq,
- bool fixed_que_attr)
+ bool fixed_que_attr,
+ struct bnxt_re_dbr_obj *dbr_obj)
{
struct bnxt_qplib_qp *qplib_qp;
struct ib_umem *umem;
@@ -1234,8 +1238,11 @@ static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
goto rqfail;
done:
+ if (dbr_obj)
+ qplib_qp->dpi = &dbr_obj->dpi;
+ else
+ qplib_qp->dpi = &cntx->dpi;
qplib_qp->qp_handle = ureq->qp_handle;
- qplib_qp->dpi = &cntx->dpi;
qplib_qp->is_user = true;
return 0;
@@ -1709,7 +1716,8 @@ static void bnxt_re_qp_calculate_msn_psn_size(struct bnxt_re_qp *qp,
static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
struct ib_qp_init_attr *init_attr,
struct bnxt_re_ucontext *uctx,
- struct bnxt_re_qp_req *ureq)
+ struct bnxt_re_qp_req *ureq,
+ struct bnxt_re_dbr_obj *dbr_obj)
{
struct bnxt_qplib_dev_attr *dev_attr;
struct bnxt_qplib_qp *qplqp;
@@ -1776,7 +1784,8 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
bnxt_re_adjust_gsi_sq_attr(qp, init_attr, uctx);
if (uctx) { /* This will update DPI and qp_handle */
- rc = bnxt_re_init_user_qp(rdev, pd, qp, uctx, ureq, fixed_que_attr);
+ rc = bnxt_re_init_user_qp(rdev, pd, qp, uctx, ureq, fixed_que_attr,
+ dbr_obj);
if (rc)
return rc;
}
@@ -1912,7 +1921,9 @@ static int bnxt_re_add_unique_gid(struct bnxt_re_dev *rdev)
int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
struct ib_udata *udata)
{
+ struct bnxt_re_dbr_obj *dbr_obj = NULL;
struct bnxt_qplib_dev_attr *dev_attr;
+ struct uverbs_attr_bundle *attrs;
struct bnxt_re_ucontext *uctx;
struct bnxt_re_qp_req ureq;
struct bnxt_re_dev *rdev;
@@ -1933,6 +1944,17 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
rc = ib_copy_validate_udata_in_cm(udata, ureq, qp_handle, 0);
if (rc)
return rc;
+
+ attrs = rdma_udata_to_uverbs_attr_bundle(udata);
+ if (uverbs_attr_is_valid(attrs,
+ BNXT_RE_CREATE_QP_ATTR_DBR_HANDLE)) {
+ dbr_obj = uverbs_attr_get_obj(attrs,
+ BNXT_RE_CREATE_QP_ATTR_DBR_HANDLE);
+ if (IS_ERR(dbr_obj))
+ return PTR_ERR(dbr_obj);
+ kref_get(&dbr_obj->usecnt);
+ qp->dbr_obj = dbr_obj;
+ }
}
rc = bnxt_re_test_qp_limits(rdev, qp_init_attr, dev_attr);
@@ -1942,7 +1964,8 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
}
qp->rdev = rdev;
- rc = bnxt_re_init_qp_attr(qp, pd, qp_init_attr, uctx, &ureq);
+ rc = bnxt_re_init_qp_attr(qp, pd, qp_init_attr, uctx, &ureq,
+ dbr_obj);
if (rc)
goto fail;
@@ -2012,6 +2035,8 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
bnxt_qplib_free_qp_res(&rdev->qplib_res, &qp->qplib_qp);
bnxt_re_qp_free_umem(qp);
fail:
+ if (dbr_obj)
+ kref_put(&dbr_obj->usecnt, bnxt_re_dbr_kref_release);
return rc;
}
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index 08f71a94d55d..cdc403bf9e5d 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -96,6 +96,7 @@ struct bnxt_re_qp {
struct bnxt_re_cq *scq;
struct bnxt_re_cq *rcq;
struct dentry *dentry;
+ struct bnxt_re_dbr_obj *dbr_obj; /* doorbell region */
};
struct bnxt_re_cq {
@@ -167,7 +168,7 @@ struct bnxt_re_dbr_obj {
struct bnxt_re_dev *rdev;
struct bnxt_qplib_dpi dpi;
struct bnxt_re_user_mmap_entry *entry;
- atomic_t usecnt; /* QPs using this dbr */
+ struct kref usecnt; /* 1 (uobject) + n (QPs using this dbr) */
};
struct bnxt_re_flow {
@@ -308,4 +309,5 @@ void bnxt_re_unlock_cqs(struct bnxt_re_qp *qp, unsigned long flags);
struct bnxt_re_user_mmap_entry*
bnxt_re_mmap_entry_insert(struct bnxt_re_ucontext *uctx, u64 mem_offset,
enum bnxt_re_mmap_flag mmap_flag, u64 *offset);
+void bnxt_re_dbr_kref_release(struct kref *ref);
#endif /* __BNXT_RE_IB_VERBS_H__ */
diff --git a/drivers/infiniband/hw/bnxt_re/uapi.c b/drivers/infiniband/hw/bnxt_re/uapi.c
index 3eaee7101615..1b3116dd1fcf 100644
--- a/drivers/infiniband/hw/bnxt_re/uapi.c
+++ b/drivers/infiniband/hw/bnxt_re/uapi.c
@@ -369,6 +369,7 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_DBR_ALLOC)(struct uverbs_attr_bundle *a
}
obj->rdev = rdev;
+ kref_init(&obj->usecnt);
uobj->object = obj;
uverbs_finalize_uobj_create(attrs, BNXT_RE_ALLOC_DBR_HANDLE);
@@ -391,15 +392,32 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_DBR_ALLOC)(struct uverbs_attr_bundle *a
return ret;
}
+void bnxt_re_dbr_kref_release(struct kref *ref)
+{
+ struct bnxt_re_dbr_obj *obj =
+ container_of(ref, struct bnxt_re_dbr_obj, usecnt);
+
+ rdma_user_mmap_entry_remove(&obj->entry->rdma_entry);
+ bnxt_qplib_free_uc_dpi(&obj->rdev->qplib_res, &obj->dpi);
+ kfree(obj);
+}
+
static int bnxt_re_dbr_cleanup(struct ib_uobject *uobject,
enum rdma_remove_reason why,
struct uverbs_attr_bundle *attrs)
{
struct bnxt_re_dbr_obj *obj = uobject->object;
- struct bnxt_re_dev *rdev = obj->rdev;
- rdma_user_mmap_entry_remove(&obj->entry->rdma_entry);
- bnxt_qplib_free_uc_dpi(&rdev->qplib_res, &obj->dpi);
+ /* If it is being destroyed explicitly while QPs still hold a
+ * reference (> 1), reject it with EBUSY. If no QP references
+ * or implicit teardown (process exit, driver removal), drop
+ * the uobject reference unconditionally. The object gets freed
+ * (bnxt_re_dbr_kref_release) when the usecnt goes to zero.
+ */
+ if (why == RDMA_REMOVE_DESTROY && kref_read(&obj->usecnt) > 1)
+ return -EBUSY;
+
+ kref_put(&obj->usecnt, bnxt_re_dbr_kref_release);
return 0;
}
@@ -459,11 +477,26 @@ DECLARE_UVERBS_NAMED_METHOD(BNXT_RE_METHOD_GET_DEFAULT_DBR,
DECLARE_UVERBS_GLOBAL_METHODS(BNXT_RE_OBJECT_DEFAULT_DBR,
&UVERBS_METHOD(BNXT_RE_METHOD_GET_DEFAULT_DBR));
+ADD_UVERBS_ATTRIBUTES_SIMPLE(
+ bnxt_re_qp_create,
+ UVERBS_OBJECT_QP,
+ UVERBS_METHOD_QP_CREATE,
+ UVERBS_ATTR_IDR(BNXT_RE_CREATE_QP_ATTR_DBR_HANDLE,
+ BNXT_RE_OBJECT_DBR,
+ UVERBS_ACCESS_READ,
+ UA_OPTIONAL));
+
+const struct uapi_definition bnxt_re_create_qp_defs[] = {
+ UAPI_DEF_CHAIN_OBJ_TREE(UVERBS_OBJECT_QP, &bnxt_re_qp_create),
+ {},
+};
+
const struct uapi_definition bnxt_re_uapi_defs[] = {
UAPI_DEF_CHAIN_OBJ_TREE_NAMED(BNXT_RE_OBJECT_ALLOC_PAGE),
UAPI_DEF_CHAIN_OBJ_TREE_NAMED(BNXT_RE_OBJECT_NOTIFY_DRV),
UAPI_DEF_CHAIN_OBJ_TREE_NAMED(BNXT_RE_OBJECT_GET_TOGGLE_MEM),
UAPI_DEF_CHAIN_OBJ_TREE_NAMED(BNXT_RE_OBJECT_DBR),
UAPI_DEF_CHAIN_OBJ_TREE_NAMED(BNXT_RE_OBJECT_DEFAULT_DBR),
+ UAPI_DEF_CHAIN(bnxt_re_create_qp_defs),
{}
};
diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h
index db8400f2ce3b..4da8cda337dc 100644
--- a/include/uapi/rdma/bnxt_re-abi.h
+++ b/include/uapi/rdma/bnxt_re-abi.h
@@ -138,6 +138,10 @@ struct bnxt_re_qp_req {
__u32 sq_npsn;
};
+enum bnxt_re_create_qp_attrs {
+ BNXT_RE_CREATE_QP_ATTR_DBR_HANDLE = UVERBS_ID_DRIVER_NS_WITH_UHW,
+};
+
struct bnxt_re_qp_resp {
__u32 qpid;
__u32 rsvd;
--
2.51.2.636.ga99f379adf
^ permalink raw reply related
* [PATCH rdma-next v5 5/7] RDMA/bnxt_re: Update hwq depth for app allocated QPs
From: Sriharsha Basavapatna @ 2026-05-14 16:23 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, selvin.xavier,
kalesh-anakkur.purayil, Sriharsha Basavapatna
In-Reply-To: <20260514162336.72644-1-sriharsha.basavapatna@broadcom.com>
The hwq depth shouldn't be computed using slots/round-up logic for
app allocated QPs, use the max_wqe value saved earlier.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index ae32f86b9e9b..9fd85d81bcea 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1346,7 +1346,7 @@ static int bnxt_re_qp_alloc_init_xrrq(struct bnxt_re_qp *qp)
return rc;
}
-static int bnxt_re_setup_qp_hwqs(struct bnxt_re_qp *qp)
+static int bnxt_re_setup_qp_hwqs(struct bnxt_re_qp *qp, bool fixed_que_attr)
{
struct bnxt_qplib_res *res = &qp->rdev->qplib_res;
struct bnxt_qplib_qp *qplib_qp = &qp->qplib_qp;
@@ -1360,12 +1360,17 @@ static int bnxt_re_setup_qp_hwqs(struct bnxt_re_qp *qp)
hwq_attr.res = res;
hwq_attr.sginfo = &sq->sg_info;
hwq_attr.stride = bnxt_qplib_get_stride();
- hwq_attr.depth = bnxt_qplib_get_depth(sq, wqe_mode, true);
hwq_attr.aux_stride = qplib_qp->psn_sz;
- hwq_attr.aux_depth = (qplib_qp->psn_sz) ?
- bnxt_qplib_set_sq_size(sq, wqe_mode) : 0;
- if (qplib_qp->is_host_msn_tbl && qplib_qp->psn_sz)
+ if (!fixed_que_attr) {
+ hwq_attr.depth = bnxt_qplib_get_depth(sq, wqe_mode, true);
+ hwq_attr.aux_depth = (qplib_qp->psn_sz) ?
+ bnxt_qplib_set_sq_size(sq, wqe_mode) : 0;
+ if (qplib_qp->is_host_msn_tbl && qplib_qp->psn_sz)
+ hwq_attr.aux_depth = qplib_qp->msn_tbl_sz;
+ } else {
+ hwq_attr.depth = sq->max_wqe;
hwq_attr.aux_depth = qplib_qp->msn_tbl_sz;
+ }
hwq_attr.type = HWQ_TYPE_QUEUE;
rc = bnxt_qplib_alloc_init_hwq(&sq->hwq, &hwq_attr);
if (rc)
@@ -1376,6 +1381,9 @@ static int bnxt_re_setup_qp_hwqs(struct bnxt_re_qp *qp)
CMDQ_CREATE_QP_SQ_LVL_SFT);
sq->hwq.pg_sz_lvl = pg_sz_lvl;
+ if (qplib_qp->srq)
+ goto done;
+
hwq_attr.res = res;
hwq_attr.sginfo = &rq->sg_info;
hwq_attr.stride = bnxt_qplib_get_stride();
@@ -1392,6 +1400,7 @@ static int bnxt_re_setup_qp_hwqs(struct bnxt_re_qp *qp)
CMDQ_CREATE_QP_RQ_LVL_SFT);
rq->hwq.pg_sz_lvl = pg_sz_lvl;
+done:
if (qplib_qp->psn_sz) {
rc = bnxt_re_qp_alloc_init_xrrq(qp);
if (rc)
@@ -1460,7 +1469,7 @@ static struct bnxt_re_qp *bnxt_re_create_shadow_qp
qp->qplib_qp.rq_hdr_buf_size = BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV6;
qp->qplib_qp.dpi = &rdev->dpi_privileged;
- rc = bnxt_re_setup_qp_hwqs(qp);
+ rc = bnxt_re_setup_qp_hwqs(qp, false);
if (rc)
goto fail;
@@ -1774,7 +1783,7 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
bnxt_re_qp_calculate_msn_psn_size(qp, fixed_que_attr, ureq);
- rc = bnxt_re_setup_qp_hwqs(qp);
+ rc = bnxt_re_setup_qp_hwqs(qp, fixed_que_attr);
if (rc)
goto free_umem;
--
2.51.2.636.ga99f379adf
^ permalink raw reply related
* [PATCH rdma-next v5 4/7] RDMA/bnxt_re: Update msn table size for app allocated QPs
From: Sriharsha Basavapatna @ 2026-05-14 16:23 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, selvin.xavier,
kalesh-anakkur.purayil, Sriharsha Basavapatna
In-Reply-To: <20260514162336.72644-1-sriharsha.basavapatna@broadcom.com>
For app allocated QPs, the driver shouldn't use slots/round-up logic
to compute the msn table size. The application handles this logic
and computes 'sq_npsn' and passes it to the driver using a new uapi
parameter.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 61 +++++++++++++++---------
include/uapi/rdma/bnxt_re-abi.h | 1 +
2 files changed, 40 insertions(+), 22 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index fd1ea053d563..ae32f86b9e9b 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1159,29 +1159,39 @@ static int bnxt_re_setup_sginfo(struct bnxt_re_dev *rdev,
static int bnxt_re_get_psn_bytes(struct bnxt_re_dev *rdev,
struct bnxt_re_ucontext *cntx,
struct bnxt_qplib_qp *qplib_qp,
- struct bnxt_re_qp_req *ureq)
+ struct bnxt_re_qp_req *ureq,
+ bool fixed_que_attr)
{
int psn_sz, psn_nume;
- psn_sz = bnxt_qplib_is_chip_gen_p5_p7(rdev->chip_ctx) ?
- sizeof(struct sq_psn_search_ext) :
- sizeof(struct sq_psn_search);
- if (cntx && bnxt_re_is_var_size_supported(rdev, cntx)) {
- psn_nume = ureq->sq_slots;
+ if (rdev->dev_attr &&
+ _is_host_msn_table(rdev->dev_attr->dev_cap_flags2))
+ psn_sz = sizeof(struct sq_msn_search);
+ else
+ psn_sz = bnxt_qplib_is_chip_gen_p5_p7(rdev->chip_ctx) ?
+ sizeof(struct sq_psn_search_ext) :
+ sizeof(struct sq_psn_search);
+ if (!fixed_que_attr) {
+ if (cntx && bnxt_re_is_var_size_supported(rdev, cntx)) {
+ psn_nume = ureq->sq_slots;
+ } else {
+ psn_nume = (qplib_qp->wqe_mode == BNXT_QPLIB_WQE_MODE_STATIC) ?
+ qplib_qp->sq.max_wqe : ((qplib_qp->sq.max_wqe * qplib_qp->sq.wqe_size) /
+ sizeof(struct bnxt_qplib_sge));
+ }
+ if (_is_host_msn_table(rdev->qplib_res.dattr->dev_cap_flags2))
+ psn_nume = roundup_pow_of_two(psn_nume);
} else {
- psn_nume = (qplib_qp->wqe_mode == BNXT_QPLIB_WQE_MODE_STATIC) ?
- qplib_qp->sq.max_wqe : ((qplib_qp->sq.max_wqe * qplib_qp->sq.wqe_size) /
- sizeof(struct bnxt_qplib_sge));
+ psn_nume = ureq->sq_npsn;
}
- if (_is_host_msn_table(rdev->qplib_res.dattr->dev_cap_flags2))
- psn_nume = roundup_pow_of_two(psn_nume);
return psn_nume * psn_sz;
}
static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
struct bnxt_re_qp *qp, struct bnxt_re_ucontext *cntx,
- struct bnxt_re_qp_req *ureq)
+ struct bnxt_re_qp_req *ureq,
+ bool fixed_que_attr)
{
struct bnxt_qplib_qp *qplib_qp;
struct ib_umem *umem;
@@ -1193,7 +1203,7 @@ static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
bytes = (qplib_qp->sq.max_wqe * qplib_qp->sq.wqe_size);
/* Consider mapping PSN search memory only for RC QPs. */
if (qplib_qp->type == CMDQ_CREATE_QP_TYPE_RC)
- bytes += bnxt_re_get_psn_bytes(rdev, cntx, qplib_qp, ureq);
+ bytes += bnxt_re_get_psn_bytes(rdev, cntx, qplib_qp, ureq, fixed_que_attr);
bytes = PAGE_ALIGN(bytes);
umem = ib_umem_get(&rdev->ibdev, ureq->qpsva, bytes,
@@ -1647,7 +1657,9 @@ static int bnxt_re_init_qp_type(struct bnxt_re_dev *rdev,
return qptype;
}
-static void bnxt_re_qp_calculate_msn_psn_size(struct bnxt_re_qp *qp)
+static void bnxt_re_qp_calculate_msn_psn_size(struct bnxt_re_qp *qp,
+ bool fixed_que_attr,
+ struct bnxt_re_qp_req *req)
{
struct bnxt_qplib_qp *qplib_qp = &qp->qplib_qp;
struct bnxt_qplib_q *sq = &qplib_qp->sq;
@@ -1670,12 +1682,17 @@ static void bnxt_re_qp_calculate_msn_psn_size(struct bnxt_re_qp *qp)
/* Update msn tbl size */
if (qplib_qp->is_host_msn_tbl && qplib_qp->psn_sz) {
- if (wqe_mode == BNXT_QPLIB_WQE_MODE_STATIC)
- qplib_qp->msn_tbl_sz =
- roundup_pow_of_two(bnxt_qplib_set_sq_size(sq, wqe_mode));
- else
- qplib_qp->msn_tbl_sz =
- roundup_pow_of_two(bnxt_qplib_set_sq_size(sq, wqe_mode)) / 2;
+ if (!fixed_que_attr) {
+ if (wqe_mode == BNXT_QPLIB_WQE_MODE_STATIC)
+ qplib_qp->msn_tbl_sz =
+ roundup_pow_of_two(bnxt_qplib_set_sq_size(sq, wqe_mode));
+ else
+ qplib_qp->msn_tbl_sz =
+ roundup_pow_of_two(bnxt_qplib_set_sq_size(sq, wqe_mode))
+ / 2;
+ } else {
+ qplib_qp->msn_tbl_sz = req->sq_npsn;
+ }
qplib_qp->msn = 0;
}
}
@@ -1750,12 +1767,12 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
bnxt_re_adjust_gsi_sq_attr(qp, init_attr, uctx);
if (uctx) { /* This will update DPI and qp_handle */
- rc = bnxt_re_init_user_qp(rdev, pd, qp, uctx, ureq);
+ rc = bnxt_re_init_user_qp(rdev, pd, qp, uctx, ureq, fixed_que_attr);
if (rc)
return rc;
}
- bnxt_re_qp_calculate_msn_psn_size(qp);
+ bnxt_re_qp_calculate_msn_psn_size(qp, fixed_que_attr, ureq);
rc = bnxt_re_setup_qp_hwqs(qp);
if (rc)
diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h
index 40955eaba32e..db8400f2ce3b 100644
--- a/include/uapi/rdma/bnxt_re-abi.h
+++ b/include/uapi/rdma/bnxt_re-abi.h
@@ -135,6 +135,7 @@ struct bnxt_re_qp_req {
__aligned_u64 qp_handle;
__aligned_u64 comp_mask;
__u32 sq_slots;
+ __u32 sq_npsn;
};
struct bnxt_re_qp_resp {
--
2.51.2.636.ga99f379adf
^ permalink raw reply related
* [PATCH rdma-next v5 3/7] RDMA/bnxt_re: Update sq depth for app allocated QPs
From: Sriharsha Basavapatna @ 2026-05-14 16:23 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, selvin.xavier,
kalesh-anakkur.purayil, Sriharsha Basavapatna
In-Reply-To: <20260514162336.72644-1-sriharsha.basavapatna@broadcom.com>
For app allocated QPs, there's no need to reserve extra slots.
The application accounts for this while allocating the SQ.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index c67179160654..fd1ea053d563 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1541,7 +1541,8 @@ static void bnxt_re_adjust_gsi_rq_attr(struct bnxt_re_qp *qp)
static int bnxt_re_init_sq_attr(struct bnxt_re_qp *qp,
struct ib_qp_init_attr *init_attr,
struct bnxt_re_ucontext *uctx,
- struct bnxt_re_qp_req *ureq)
+ struct bnxt_re_qp_req *ureq,
+ bool fixed_que_attr)
{
struct bnxt_qplib_dev_attr *dev_attr;
struct bnxt_qplib_qp *qplqp;
@@ -1582,13 +1583,18 @@ static int bnxt_re_init_sq_attr(struct bnxt_re_qp *qp,
sq->max_sw_wqe = sq->max_wqe;
}
- sq->q_full_delta = diff + 1;
- /*
- * Reserving one slot for Phantom WQE. Application can
- * post one extra entry in this case. But allowing this to avoid
- * unexpected Queue full condition
- */
- qplqp->sq.q_full_delta -= 1;
+ if (!fixed_que_attr) {
+ sq->q_full_delta = diff + 1;
+ /*
+ * Reserving one slot for Phantom WQE. Application can
+ * post one extra entry in this case. But allowing this to avoid
+ * unexpected Queue full condition
+ */
+ qplqp->sq.q_full_delta -= 1;
+ } else {
+ sq->q_full_delta = 0;
+ }
+
qplqp->sq.sg_info.pgsize = PAGE_SIZE;
qplqp->sq.sg_info.pgshft = PAGE_SHIFT;
@@ -1737,7 +1743,7 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
bnxt_re_adjust_gsi_rq_attr(qp);
/* Setup SQ */
- rc = bnxt_re_init_sq_attr(qp, init_attr, uctx, ureq);
+ rc = bnxt_re_init_sq_attr(qp, init_attr, uctx, ureq, fixed_que_attr);
if (rc)
return rc;
if (init_attr->qp_type == IB_QPT_GSI)
--
2.51.2.636.ga99f379adf
^ permalink raw reply related
* [PATCH rdma-next v5 2/7] RDMA/bnxt_re: Update rq depth for app allocated QPs
From: Sriharsha Basavapatna @ 2026-05-14 16:23 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, selvin.xavier,
kalesh-anakkur.purayil, Sriharsha Basavapatna
In-Reply-To: <20260514162336.72644-1-sriharsha.basavapatna@broadcom.com>
For app allocated QPs, there's no need to add extra slots or
to round up the slot count. Use 'max_recv_wr' count provided
by the application as is.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 561d491f12ff..c67179160654 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1475,7 +1475,8 @@ static struct bnxt_re_qp *bnxt_re_create_shadow_qp
static int bnxt_re_init_rq_attr(struct bnxt_re_qp *qp,
struct ib_qp_init_attr *init_attr,
- struct bnxt_re_ucontext *uctx)
+ struct bnxt_re_ucontext *uctx,
+ bool fixed_que_attr)
{
struct bnxt_qplib_dev_attr *dev_attr;
struct bnxt_qplib_qp *qplqp;
@@ -1500,12 +1501,16 @@ static int bnxt_re_init_rq_attr(struct bnxt_re_qp *qp,
init_attr->cap.max_recv_sge = rq->max_sge;
rq->wqe_size = bnxt_re_setup_rwqe_size(qplqp, rq->max_sge,
dev_attr->max_qp_sges);
- /* Allocate 1 more than what's provided so posting max doesn't
- * mean empty.
- */
- rq->max_wqe = bnxt_re_init_depth(init_attr->cap.max_recv_wr + 1,
- dev_attr->max_qp_wqes + 1,
- uctx);
+ if (!fixed_que_attr) {
+ /* Allocate 1 more than what's provided so posting max doesn't
+ * mean empty.
+ */
+ rq->max_wqe = bnxt_re_init_depth(init_attr->cap.max_recv_wr + 1,
+ dev_attr->max_qp_wqes + 1,
+ uctx);
+ } else {
+ rq->max_wqe = init_attr->cap.max_recv_wr;
+ }
rq->max_sw_wqe = rq->max_wqe;
rq->q_full_delta = 0;
rq->sg_info.pgsize = PAGE_SIZE;
@@ -1676,6 +1681,7 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
{
struct bnxt_qplib_dev_attr *dev_attr;
struct bnxt_qplib_qp *qplqp;
+ bool fixed_que_attr = false;
struct bnxt_re_dev *rdev;
struct bnxt_re_cq *cq;
int rc = 0, qptype;
@@ -1724,7 +1730,7 @@ static int bnxt_re_init_qp_attr(struct bnxt_re_qp *qp, struct bnxt_re_pd *pd,
}
/* Setup RQ/SRQ */
- rc = bnxt_re_init_rq_attr(qp, init_attr, uctx);
+ rc = bnxt_re_init_rq_attr(qp, init_attr, uctx, fixed_que_attr);
if (rc)
return rc;
if (init_attr->qp_type == IB_QPT_GSI)
--
2.51.2.636.ga99f379adf
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox