* [PATCH 00/11] mlx5 support for VFIO self test
@ 2026-05-01 0:08 Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 01/11] net/mlx5: Add IFC structures for CQE and WQE Jason Gunthorpe
` (12 more replies)
0 siblings, 13 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Add an mlx5 driver to VFIO self test. This is largely a remix of the
existing VFIO mlx5 driver in rdma-core. It uses an RDMA loopback QP
to issue RDMA WRITE operations which effectively perform memory
copies using DMA. Since mlx5 has a stable programming ABI this
should work on devices from CX5 to current HW. The device FW must
support the QP loopback configuration.
Also support send_msi by arming completion events of the RDMA WRITE
to trigger MSI delivery.
mlx5 device startup is very complex and most of this code is just
booting the device, with a smaller amount for operating the QP.
This entire series was coded by Claude Code in about 4 days. It
used about 4.5M output tokens, 30 individual sessions and 5600 lines
of AI-generated .md files. I spent an annoying amount of time
de-slopping and cleaning its work product to make it presentable.
However, previous VFIO drivers have taken on the order of 1-2
months to write, so getting one in a week is pretty remarkable.
For those interested, the flow I used was broadly a prompt sequence
sort of like:
- Hey Claude, go look at the falcon series, VFIO self test, the
mlx5 driver, rdma-core and some PDF documentation and make a
plan to put mlx5 under the selftest.
- Write an rdma-core application using the built-in VFIO provider
that can do the required memcpy operations that vfio selftests
wants.
(This resulted in a 1k loc C file that compiled and ran the
first time but had a few bugs related to device programming
that the AI resolved.)
- Replace the rdma-core components with open-coded versions to
create a fully stand-alone program that does the DMA memcpy.
- Review and audit the thing.
[Pause and de-slop it]
- Make it work on a PF too (this is surprisingly hard!).
[Move to a kernel tree and copy all the .md files and .c program
it made]
- Hey Claude, look at all this stuff and make a broad plan to
actually build a VFIO self test.
- Here is my 1 sentence advice on what each patch should look
like, make a detailed plan to make a patch for every one.
[Pause and polish the patch plans]
- Execute plan X then commit it [pause and de-slop each patch,
repeat].
[Review and final polish]
It is based on a tree with the falcon series applied.
Jason Gunthorpe (11):
net/mlx5: Add IFC structures for CQE and WQE
net/mlx5: Move HW constant groups from device.h/cq.h to mlx5_ifc.h
net/mlx5: Extract MLX5_SET/GET macros into mlx5_ifc_macros.h
net/mlx5: Add ONCE and MMIO accessor variants to mlx5_ifc_macros.h
selftests: Add additional kernel functions to tools/include/
selftests: Fix arm64 IO barriers to match kernel
vfio: selftests: Allow drivers to specify required region size
vfio: selftests: Add dev_dbg
vfio: selftests: Add mlx5 driver - HW init and command interface
vfio: selftests: Add mlx5 driver - data path and memcpy ops
vfio: selftests: mlx5 driver - add send_msi support
include/linux/mlx5/cq.h | 10 -
include/linux/mlx5/device.h | 231 +-
include/linux/mlx5/mlx5_ifc.h | 178 ++
include/linux/mlx5/mlx5_ifc_macros.h | 185 ++
tools/arch/arm64/include/asm/barrier.h | 18 +
tools/arch/x86/include/asm/barrier.h | 5 +
tools/include/asm-generic/io.h | 28 +
tools/include/asm/barrier.h | 8 +
tools/include/linux/stddef.h | 10 +
.../selftests/vfio/lib/drivers/mlx5/mlx5.c | 1920 +++++++++++++++++
.../selftests/vfio/lib/drivers/mlx5/mlx5_hw.h | 114 +
.../vfio/lib/drivers/mlx5/mlx5_ifc.h | 1 +
.../vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h | 1 +
.../vfio/lib/drivers/mlx5/mlx5_ifc_macros.h | 1 +
.../lib/include/libvfio/vfio_pci_device.h | 6 +
.../lib/include/libvfio/vfio_pci_driver.h | 3 +
tools/testing/selftests/vfio/lib/libvfio.mk | 1 +
.../selftests/vfio/lib/vfio_pci_driver.c | 2 +
.../selftests/vfio/vfio_pci_driver_test.c | 3 +-
19 files changed, 2485 insertions(+), 240 deletions(-)
create mode 100644 include/linux/mlx5/mlx5_ifc_macros.h
create mode 100644 tools/include/linux/stddef.h
create mode 100644 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
create mode 100644 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_macros.h
base-commit: 75b2e40c376951e2174f08c871389955253d3f5e
--
2.43.0
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 01/11] net/mlx5: Add IFC structures for CQE and WQE
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 02/11] net/mlx5: Move HW constant groups from device.h/cq.h to mlx5_ifc.h Jason Gunthorpe
` (11 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Building the WQE and CQE using the IFC system is easier than with a
C layout struct. Add definitions for their layout. Structs for:
- wqe_ctrl_seg_bits (16B): WQE control segment
- wqe_raddr_seg_bits (16B): RDMA remote address segment
- wqe_data_seg_bits (16B): data segment
- cqe64_bits (64B): 64-byte CQE with error syndrome union
The VFIO mlx5 selftest will use these.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
include/linux/mlx5/mlx5_ifc.h | 55 +++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 49f3ad4b1a7c54..80ae6aeaf535b0 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -5971,6 +5971,61 @@ struct mlx5_ifc_cqe_error_syndrome_bits {
u8 syndrome[0x8];
};
+struct mlx5_ifc_wqe_ctrl_seg_bits {
+ u8 opmod[0x8];
+ u8 wqe_index[0x10];
+ u8 opcode[0x8];
+
+ u8 qp_or_sq[0x18];
+ u8 reserved_at_38[0x2];
+ u8 ds[0x6];
+
+ u8 signature[0x8];
+ u8 reserved_at_48[0x10];
+ u8 fm[0x3];
+ u8 reserved_at_5b[0x1];
+ u8 ce[0x2];
+ u8 se[0x1];
+ u8 reserved_at_5f[0x1];
+
+ u8 imm[0x20];
+};
+
+struct mlx5_ifc_wqe_raddr_seg_bits {
+ u8 raddr[0x40];
+
+ u8 rkey[0x20];
+ u8 reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_wqe_data_seg_bits {
+ u8 reserved_at_0[0x1];
+ u8 byte_count[0x1f];
+
+ u8 lkey[0x20];
+
+ u8 addr[0x40];
+};
+
+struct mlx5_ifc_cqe64_bits {
+ u8 reserved_at_0[0x1a0];
+
+ union {
+ u8 reserved_at_1a0[0x20];
+ struct mlx5_ifc_cqe_error_syndrome_bits error_syndrome;
+ };
+
+ u8 send_wqe_opcode[0x8];
+ u8 qpn_or_dctn_or_flow_tag[0x18];
+
+ u8 wqe_counter[0x10];
+ u8 signature[0x8];
+ u8 opcode[0x4];
+ u8 cqe_format[0x2];
+ u8 se[0x1];
+ u8 owner[0x1];
+};
+
struct mlx5_ifc_qp_context_extension_bits {
u8 reserved_at_0[0x60];
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 02/11] net/mlx5: Move HW constant groups from device.h/cq.h to mlx5_ifc.h
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 01/11] net/mlx5: Add IFC structures for CQE and WQE Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 03/11] net/mlx5: Extract MLX5_SET/GET macros into mlx5_ifc_macros.h Jason Gunthorpe
` (10 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Generally the IFC file should contain enums for the fields it
defines. Move the ones in device.h that are relevant to the VFIO
selftests over to the IFC file:
wqe_ctrl_seg_bits.opcode: all MLX5_OPCODE_* WQE opcodes
wqe_ctrl_seg_bits.ce: new MLX5_WQE_CE_* completion/event modes
cqe64_bits.opcode: all MLX5_CQE_* CQE opcode values
eqe_bits.event_type: enum mlx5_event (all event types)
cmd_out_bits.status: all MLX5_CMD_STAT_* status codes
query_hca_cap_in_bits.op_mod: enum mlx5_cap_mode
Tidy MLX5_PCI_CMD_XPORT which is an alias of
MLX5_CMD_QUEUE_ENTRY_TYPE_PCIE_CMD_IF_TRANSPORT.
No functional change. All existing users of device.h and cq.h
continue to see these constants through the include chain.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
include/linux/mlx5/cq.h | 10 ---
include/linux/mlx5/device.h | 114 +------------------------------
include/linux/mlx5/mlx5_ifc.h | 123 ++++++++++++++++++++++++++++++++++
3 files changed, 124 insertions(+), 123 deletions(-)
diff --git a/include/linux/mlx5/cq.h b/include/linux/mlx5/cq.h
index 9d47cdc727ad0d..a1c14479e462c2 100644
--- a/include/linux/mlx5/cq.h
+++ b/include/linux/mlx5/cq.h
@@ -81,16 +81,6 @@ enum {
enum {
MLX5_CQE_OWNER_MASK = 1,
- MLX5_CQE_REQ = 0,
- MLX5_CQE_RESP_WR_IMM = 1,
- MLX5_CQE_RESP_SEND = 2,
- MLX5_CQE_RESP_SEND_IMM = 3,
- MLX5_CQE_RESP_SEND_INV = 4,
- MLX5_CQE_RESIZE_CQ = 5,
- MLX5_CQE_SIG_ERR = 12,
- MLX5_CQE_REQ_ERR = 13,
- MLX5_CQE_RESP_ERR = 14,
- MLX5_CQE_INVALID = 15,
};
enum {
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 07a25f26429213..c739a1f578dc44 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -172,7 +172,7 @@ enum mlx5_inline_modes {
enum {
MLX5_MAX_COMMANDS = 32,
MLX5_CMD_DATA_BLOCK_SIZE = 512,
- MLX5_PCI_CMD_XPORT = 7,
+ MLX5_PCI_CMD_XPORT = MLX5_CMD_QUEUE_ENTRY_TYPE_PCIE_CMD_IF_TRANSPORT,
MLX5_MKEY_BSF_OCTO_SIZE = 4,
MLX5_MAX_PSVS = 4,
};
@@ -308,63 +308,6 @@ enum {
MLX5_EVENT_QUEUE_TYPE_DCT = 6,
};
-/* mlx5 components can subscribe to any one of these events via
- * mlx5_eq_notifier_register API.
- */
-enum mlx5_event {
- /* Special value to subscribe to any event */
- MLX5_EVENT_TYPE_NOTIFY_ANY = 0x0,
- /* HW events enum start: comp events are not subscribable */
- MLX5_EVENT_TYPE_COMP = 0x0,
- /* HW Async events enum start: subscribable events */
- MLX5_EVENT_TYPE_PATH_MIG = 0x01,
- MLX5_EVENT_TYPE_COMM_EST = 0x02,
- MLX5_EVENT_TYPE_SQ_DRAINED = 0x03,
- MLX5_EVENT_TYPE_SRQ_LAST_WQE = 0x13,
- MLX5_EVENT_TYPE_SRQ_RQ_LIMIT = 0x14,
-
- MLX5_EVENT_TYPE_CQ_ERROR = 0x04,
- MLX5_EVENT_TYPE_WQ_CATAS_ERROR = 0x05,
- MLX5_EVENT_TYPE_PATH_MIG_FAILED = 0x07,
- MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
- MLX5_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11,
- MLX5_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12,
- MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
-
- MLX5_EVENT_TYPE_INTERNAL_ERROR = 0x08,
- MLX5_EVENT_TYPE_PORT_CHANGE = 0x09,
- MLX5_EVENT_TYPE_GPIO_EVENT = 0x15,
- MLX5_EVENT_TYPE_PORT_MODULE_EVENT = 0x16,
- MLX5_EVENT_TYPE_TEMP_WARN_EVENT = 0x17,
- MLX5_EVENT_TYPE_XRQ_ERROR = 0x18,
- MLX5_EVENT_TYPE_REMOTE_CONFIG = 0x19,
- MLX5_EVENT_TYPE_GENERAL_EVENT = 0x22,
- MLX5_EVENT_TYPE_MONITOR_COUNTER = 0x24,
- MLX5_EVENT_TYPE_PPS_EVENT = 0x25,
-
- MLX5_EVENT_TYPE_DB_BF_CONGESTION = 0x1a,
- MLX5_EVENT_TYPE_STALL_EVENT = 0x1b,
-
- MLX5_EVENT_TYPE_CMD = 0x0a,
- MLX5_EVENT_TYPE_PAGE_REQUEST = 0xb,
-
- MLX5_EVENT_TYPE_PAGE_FAULT = 0xc,
- MLX5_EVENT_TYPE_NIC_VPORT_CHANGE = 0xd,
-
- MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED = 0xe,
- MLX5_EVENT_TYPE_VHCA_STATE_CHANGE = 0xf,
-
- MLX5_EVENT_TYPE_DCT_DRAINED = 0x1c,
- MLX5_EVENT_TYPE_DCT_KEY_VIOLATION = 0x1d,
-
- MLX5_EVENT_TYPE_FPGA_ERROR = 0x20,
- MLX5_EVENT_TYPE_FPGA_QP_ERROR = 0x21,
-
- MLX5_EVENT_TYPE_DEVICE_TRACER = 0x26,
-
- MLX5_EVENT_TYPE_MAX = 0x100,
-};
-
enum mlx5_driver_event {
MLX5_DRIVER_EVENT_TYPE_TRAP = 0,
MLX5_DRIVER_EVENT_UPLINK_NETDEV,
@@ -420,22 +363,6 @@ enum {
};
enum {
- MLX5_OPCODE_NOP = 0x00,
- MLX5_OPCODE_SEND_INVAL = 0x01,
- MLX5_OPCODE_RDMA_WRITE = 0x08,
- MLX5_OPCODE_RDMA_WRITE_IMM = 0x09,
- MLX5_OPCODE_SEND = 0x0a,
- MLX5_OPCODE_SEND_IMM = 0x0b,
- MLX5_OPCODE_LSO = 0x0e,
- MLX5_OPCODE_RDMA_READ = 0x10,
- MLX5_OPCODE_ATOMIC_CS = 0x11,
- MLX5_OPCODE_ATOMIC_FA = 0x12,
- MLX5_OPCODE_ATOMIC_MASKED_CS = 0x14,
- MLX5_OPCODE_ATOMIC_MASKED_FA = 0x15,
- MLX5_OPCODE_BIND_MW = 0x18,
- MLX5_OPCODE_CONFIG_CMD = 0x1f,
- MLX5_OPCODE_ENHANCED_MPSW = 0x29,
-
MLX5_RECV_OPCODE_RDMA_WRITE_IMM = 0x00,
MLX5_RECV_OPCODE_SEND = 0x01,
MLX5_RECV_OPCODE_SEND_IMM = 0x02,
@@ -443,19 +370,6 @@ enum {
MLX5_CQE_OPCODE_ERROR = 0x1e,
MLX5_CQE_OPCODE_RESIZE = 0x16,
-
- MLX5_OPCODE_SET_PSV = 0x20,
- MLX5_OPCODE_GET_PSV = 0x21,
- MLX5_OPCODE_CHECK_PSV = 0x22,
- MLX5_OPCODE_DUMP = 0x23,
- MLX5_OPCODE_RGET_PSV = 0x26,
- MLX5_OPCODE_RCHECK_PSV = 0x27,
-
- MLX5_OPCODE_UMR = 0x25,
-
- MLX5_OPCODE_FLOW_TBL_ACCESS = 0x2c,
-
- MLX5_OPCODE_ACCESS_ASO = 0x2d,
};
enum {
@@ -1223,12 +1137,6 @@ enum mlx5_flex_parser_protos {
/* MLX5 DEV CAPs */
-/* TODO: EAT.ME */
-enum mlx5_cap_mode {
- HCA_CAP_OPMOD_GET_MAX = 0,
- HCA_CAP_OPMOD_GET_CUR = 1,
-};
-
/* Any new cap addition must update mlx5_hca_caps_alloc() to allocate
* capability memory.
*/
@@ -1506,26 +1414,6 @@ enum mlx5_qcam_feature_groups {
#define MLX5_CAP_PSP(mdev, cap)\
MLX5_GET(psp_cap, (mdev)->caps.hca[MLX5_CAP_PSP]->cur, cap)
-enum {
- MLX5_CMD_STAT_OK = 0x0,
- MLX5_CMD_STAT_INT_ERR = 0x1,
- MLX5_CMD_STAT_BAD_OP_ERR = 0x2,
- MLX5_CMD_STAT_BAD_PARAM_ERR = 0x3,
- MLX5_CMD_STAT_BAD_SYS_STATE_ERR = 0x4,
- MLX5_CMD_STAT_BAD_RES_ERR = 0x5,
- MLX5_CMD_STAT_RES_BUSY = 0x6,
- MLX5_CMD_STAT_NOT_READY = 0x7,
- MLX5_CMD_STAT_LIM_ERR = 0x8,
- MLX5_CMD_STAT_BAD_RES_STATE_ERR = 0x9,
- MLX5_CMD_STAT_IX_ERR = 0xa,
- MLX5_CMD_STAT_NO_RES_ERR = 0xf,
- MLX5_CMD_STAT_BAD_INP_LEN_ERR = 0x50,
- MLX5_CMD_STAT_BAD_OUTP_LEN_ERR = 0x51,
- MLX5_CMD_STAT_BAD_QP_STATE_ERR = 0x10,
- MLX5_CMD_STAT_BAD_PKT_ERR = 0x30,
- MLX5_CMD_STAT_BAD_SIZE_OUTS_CQES_ERR = 0x40,
-};
-
enum {
MLX5_IEEE_802_3_COUNTERS_GROUP = 0x0,
MLX5_RFC_2863_COUNTERS_GROUP = 0x1,
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 80ae6aeaf535b0..9976bc80a41b33 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -5991,6 +5991,39 @@ struct mlx5_ifc_wqe_ctrl_seg_bits {
u8 imm[0x20];
};
+/* Values for wqe_ctrl_seg_bits.opcode */
+enum {
+ MLX5_OPCODE_NOP = 0x00,
+ MLX5_OPCODE_SEND_INVAL = 0x01,
+ MLX5_OPCODE_RDMA_WRITE = 0x08,
+ MLX5_OPCODE_RDMA_WRITE_IMM = 0x09,
+ MLX5_OPCODE_SEND = 0x0a,
+ MLX5_OPCODE_SEND_IMM = 0x0b,
+ MLX5_OPCODE_LSO = 0x0e,
+ MLX5_OPCODE_RDMA_READ = 0x10,
+ MLX5_OPCODE_ATOMIC_CS = 0x11,
+ MLX5_OPCODE_ATOMIC_FA = 0x12,
+ MLX5_OPCODE_ATOMIC_MASKED_CS = 0x14,
+ MLX5_OPCODE_ATOMIC_MASKED_FA = 0x15,
+ MLX5_OPCODE_BIND_MW = 0x18,
+ MLX5_OPCODE_CONFIG_CMD = 0x1f,
+ MLX5_OPCODE_SET_PSV = 0x20,
+ MLX5_OPCODE_GET_PSV = 0x21,
+ MLX5_OPCODE_CHECK_PSV = 0x22,
+ MLX5_OPCODE_DUMP = 0x23,
+ MLX5_OPCODE_UMR = 0x25,
+ MLX5_OPCODE_RGET_PSV = 0x26,
+ MLX5_OPCODE_RCHECK_PSV = 0x27,
+ MLX5_OPCODE_ENHANCED_MPSW = 0x29,
+ MLX5_OPCODE_FLOW_TBL_ACCESS = 0x2c,
+ MLX5_OPCODE_ACCESS_ASO = 0x2d,
+};
+
+/* Values for wqe_ctrl_seg_bits.ce */
+enum {
+ MLX5_WQE_CE_CQE_ALWAYS = 2,
+};
+
struct mlx5_ifc_wqe_raddr_seg_bits {
u8 raddr[0x40];
@@ -6026,6 +6059,20 @@ struct mlx5_ifc_cqe64_bits {
u8 owner[0x1];
};
+/* Values for cqe64_bits.opcode */
+enum {
+ MLX5_CQE_REQ = 0,
+ MLX5_CQE_RESP_WR_IMM = 1,
+ MLX5_CQE_RESP_SEND = 2,
+ MLX5_CQE_RESP_SEND_IMM = 3,
+ MLX5_CQE_RESP_SEND_INV = 4,
+ MLX5_CQE_RESIZE_CQ = 5,
+ MLX5_CQE_SIG_ERR = 12,
+ MLX5_CQE_REQ_ERR = 13,
+ MLX5_CQE_RESP_ERR = 14,
+ MLX5_CQE_INVALID = 15,
+};
+
struct mlx5_ifc_qp_context_extension_bits {
u8 reserved_at_0[0x60];
@@ -6522,6 +6569,12 @@ struct mlx5_ifc_query_hca_cap_in_bits {
u8 reserved_at_60[0x20];
};
+/* Values for query_hca_cap_in_bits.op_mod */
+enum mlx5_cap_mode {
+ HCA_CAP_OPMOD_GET_MAX = 0,
+ HCA_CAP_OPMOD_GET_CUR = 1,
+};
+
struct mlx5_ifc_other_hca_cap_bits {
u8 roce[0x1];
u8 reserved_at_1[0x27f];
@@ -11310,6 +11363,55 @@ struct mlx5_ifc_eqe_bits {
u8 owner[0x1];
};
+/* Values for eqe_bits.event_type */
+enum mlx5_event {
+ /* Special value to subscribe to any event */
+ MLX5_EVENT_TYPE_NOTIFY_ANY = 0x0,
+ /* HW events enum start: comp events are not subscribable */
+ MLX5_EVENT_TYPE_COMP = 0x0,
+ /* HW Async events enum start: subscribable events */
+ MLX5_EVENT_TYPE_PATH_MIG = 0x01,
+ MLX5_EVENT_TYPE_COMM_EST = 0x02,
+ MLX5_EVENT_TYPE_SQ_DRAINED = 0x03,
+ MLX5_EVENT_TYPE_SRQ_LAST_WQE = 0x13,
+ MLX5_EVENT_TYPE_SRQ_RQ_LIMIT = 0x14,
+
+ MLX5_EVENT_TYPE_CQ_ERROR = 0x04,
+ MLX5_EVENT_TYPE_WQ_CATAS_ERROR = 0x05,
+ MLX5_EVENT_TYPE_PATH_MIG_FAILED = 0x07,
+ MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10,
+ MLX5_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11,
+ MLX5_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12,
+ MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+
+ MLX5_EVENT_TYPE_INTERNAL_ERROR = 0x08,
+ MLX5_EVENT_TYPE_PORT_CHANGE = 0x09,
+ MLX5_EVENT_TYPE_CMD = 0x0a,
+ MLX5_EVENT_TYPE_PAGE_REQUEST = 0x0b,
+ MLX5_EVENT_TYPE_PAGE_FAULT = 0x0c,
+ MLX5_EVENT_TYPE_NIC_VPORT_CHANGE = 0x0d,
+ MLX5_EVENT_TYPE_ESW_FUNCTIONS_CHANGED = 0x0e,
+ MLX5_EVENT_TYPE_VHCA_STATE_CHANGE = 0x0f,
+ MLX5_EVENT_TYPE_GPIO_EVENT = 0x15,
+ MLX5_EVENT_TYPE_PORT_MODULE_EVENT = 0x16,
+ MLX5_EVENT_TYPE_TEMP_WARN_EVENT = 0x17,
+ MLX5_EVENT_TYPE_XRQ_ERROR = 0x18,
+ MLX5_EVENT_TYPE_REMOTE_CONFIG = 0x19,
+ MLX5_EVENT_TYPE_DB_BF_CONGESTION = 0x1a,
+ MLX5_EVENT_TYPE_STALL_EVENT = 0x1b,
+ MLX5_EVENT_TYPE_DCT_DRAINED = 0x1c,
+ MLX5_EVENT_TYPE_DCT_KEY_VIOLATION = 0x1d,
+ MLX5_EVENT_TYPE_FPGA_ERROR = 0x20,
+ MLX5_EVENT_TYPE_FPGA_QP_ERROR = 0x21,
+ MLX5_EVENT_TYPE_GENERAL_EVENT = 0x22,
+ MLX5_EVENT_TYPE_MONITOR_COUNTER = 0x24,
+ MLX5_EVENT_TYPE_PPS_EVENT = 0x25,
+ MLX5_EVENT_TYPE_DEVICE_TRACER = 0x26,
+
+ MLX5_EVENT_TYPE_MAX = 0x100,
+};
+
+/* Values for cmd_queue_entry_bits.type */
enum {
MLX5_CMD_QUEUE_ENTRY_TYPE_PCIE_CMD_IF_TRANSPORT = 0x7,
};
@@ -11352,6 +11454,27 @@ struct mlx5_ifc_cmd_out_bits {
u8 command_output[0x20];
};
+/* Values for cmd_out_bits.status */
+enum {
+ MLX5_CMD_STAT_OK = 0x0,
+ MLX5_CMD_STAT_INT_ERR = 0x1,
+ MLX5_CMD_STAT_BAD_OP_ERR = 0x2,
+ MLX5_CMD_STAT_BAD_PARAM_ERR = 0x3,
+ MLX5_CMD_STAT_BAD_SYS_STATE_ERR = 0x4,
+ MLX5_CMD_STAT_BAD_RES_ERR = 0x5,
+ MLX5_CMD_STAT_RES_BUSY = 0x6,
+ MLX5_CMD_STAT_NOT_READY = 0x7,
+ MLX5_CMD_STAT_LIM_ERR = 0x8,
+ MLX5_CMD_STAT_BAD_RES_STATE_ERR = 0x9,
+ MLX5_CMD_STAT_IX_ERR = 0xa,
+ MLX5_CMD_STAT_NO_RES_ERR = 0xf,
+ MLX5_CMD_STAT_BAD_QP_STATE_ERR = 0x10,
+ MLX5_CMD_STAT_BAD_PKT_ERR = 0x30,
+ MLX5_CMD_STAT_BAD_SIZE_OUTS_CQES_ERR = 0x40,
+ MLX5_CMD_STAT_BAD_INP_LEN_ERR = 0x50,
+ MLX5_CMD_STAT_BAD_OUTP_LEN_ERR = 0x51,
+};
+
struct mlx5_ifc_cmd_in_bits {
u8 opcode[0x10];
u8 reserved_at_10[0x10];
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 03/11] net/mlx5: Extract MLX5_SET/GET macros into mlx5_ifc_macros.h
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 01/11] net/mlx5: Add IFC structures for CQE and WQE Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 02/11] net/mlx5: Move HW constant groups from device.h/cq.h to mlx5_ifc.h Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 04/11] net/mlx5: Add ONCE and MMIO accessor variants to mlx5_ifc_macros.h Jason Gunthorpe
` (9 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Extract the entire MLX5_SET/GET macro family and their internal
helpers from device.h into a new lightweight header
(include/linux/mlx5/mlx5_ifc_macros.h). device.h cannot be
included by the VFIO selftest because it pulls in rdma/ib_verbs.h;
the macros themselves depend only on endian helpers, BUILD_BUG_ON,
and basic C types.
The moved macros include the internal helpers (__mlx5_nullp through
__mlx5_st_sz_bits), all size/address macros (MLX5_ST_SZ_BYTES,
MLX5_BYTE_OFF, MLX5_ADDR_OF, etc.), the 32-bit accessors (MLX5_SET,
MLX5_GET, MLX5_SET_TO_ONES, MLX5_ARRAY_SET, MLX5_GET_PR), the
64-bit accessors (MLX5_SET64, MLX5_GET64, MLX5_ARRAY_SET64,
MLX5_GET64_PR), the 16-bit accessors (MLX5_GET16, MLX5_SET16), and
the big-endian getters (MLX5_GET64_BE, MLX5_GET_BE).
device.h includes the new header so existing kernel code is
unchanged.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
include/linux/mlx5/device.h | 117 +----------------------
include/linux/mlx5/mlx5_ifc_macros.h | 133 +++++++++++++++++++++++++++
2 files changed, 134 insertions(+), 116 deletions(-)
create mode 100644 include/linux/mlx5/mlx5_ifc_macros.h
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index c739a1f578dc44..2de2640d830bd6 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -36,6 +36,7 @@
#include <linux/types.h>
#include <rdma/ib_verbs.h>
#include <linux/mlx5/mlx5_ifc.h>
+#include <linux/mlx5/mlx5_ifc_macros.h>
#include <linux/bitfield.h>
#if defined(__LITTLE_ENDIAN)
@@ -46,122 +47,6 @@
#error Host endianness not defined
#endif
-/* helper macros */
-#define __mlx5_nullp(typ) ((struct mlx5_ifc_##typ##_bits *)0)
-#define __mlx5_bit_sz(typ, fld) sizeof(__mlx5_nullp(typ)->fld)
-#define __mlx5_bit_off(typ, fld) (offsetof(struct mlx5_ifc_##typ##_bits, fld))
-#define __mlx5_16_off(typ, fld) (__mlx5_bit_off(typ, fld) / 16)
-#define __mlx5_dw_off(typ, fld) (__mlx5_bit_off(typ, fld) / 32)
-#define __mlx5_64_off(typ, fld) (__mlx5_bit_off(typ, fld) / 64)
-#define __mlx5_16_bit_off(typ, fld) (16 - __mlx5_bit_sz(typ, fld) - (__mlx5_bit_off(typ, fld) & 0xf))
-#define __mlx5_dw_bit_off(typ, fld) (32 - __mlx5_bit_sz(typ, fld) - (__mlx5_bit_off(typ, fld) & 0x1f))
-#define __mlx5_mask(typ, fld) ((u32)((1ull << __mlx5_bit_sz(typ, fld)) - 1))
-#define __mlx5_dw_mask(typ, fld) (__mlx5_mask(typ, fld) << __mlx5_dw_bit_off(typ, fld))
-#define __mlx5_mask16(typ, fld) ((u16)((1ull << __mlx5_bit_sz(typ, fld)) - 1))
-#define __mlx5_16_mask(typ, fld) (__mlx5_mask16(typ, fld) << __mlx5_16_bit_off(typ, fld))
-#define __mlx5_st_sz_bits(typ) sizeof(struct mlx5_ifc_##typ##_bits)
-
-#define MLX5_FLD_SZ_BYTES(typ, fld) (__mlx5_bit_sz(typ, fld) / 8)
-#define MLX5_ST_SZ_BYTES(typ) (sizeof(struct mlx5_ifc_##typ##_bits) / 8)
-#define MLX5_ST_SZ_DW(typ) (sizeof(struct mlx5_ifc_##typ##_bits) / 32)
-#define MLX5_ST_SZ_QW(typ) (sizeof(struct mlx5_ifc_##typ##_bits) / 64)
-#define MLX5_UN_SZ_BYTES(typ) (sizeof(union mlx5_ifc_##typ##_bits) / 8)
-#define MLX5_UN_SZ_DW(typ) (sizeof(union mlx5_ifc_##typ##_bits) / 32)
-#define MLX5_BYTE_OFF(typ, fld) (__mlx5_bit_off(typ, fld) / 8)
-#define MLX5_ADDR_OF(typ, p, fld) ((void *)((u8 *)(p) + MLX5_BYTE_OFF(typ, fld)))
-
-/* insert a value to a struct */
-#define MLX5_SET(typ, p, fld, v) do { \
- u32 _v = v; \
- BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 32); \
- *((__be32 *)(p) + __mlx5_dw_off(typ, fld)) = \
- cpu_to_be32((be32_to_cpu(*((__be32 *)(p) + __mlx5_dw_off(typ, fld))) & \
- (~__mlx5_dw_mask(typ, fld))) | (((_v) & __mlx5_mask(typ, fld)) \
- << __mlx5_dw_bit_off(typ, fld))); \
-} while (0)
-
-#define MLX5_ARRAY_SET(typ, p, fld, idx, v) do { \
- BUILD_BUG_ON(__mlx5_bit_off(typ, fld) % 32); \
- MLX5_SET(typ, p, fld[idx], v); \
-} while (0)
-
-#define MLX5_SET_TO_ONES(typ, p, fld) do { \
- BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 32); \
- *((__be32 *)(p) + __mlx5_dw_off(typ, fld)) = \
- cpu_to_be32((be32_to_cpu(*((__be32 *)(p) + __mlx5_dw_off(typ, fld))) & \
- (~__mlx5_dw_mask(typ, fld))) | ((__mlx5_mask(typ, fld)) \
- << __mlx5_dw_bit_off(typ, fld))); \
-} while (0)
-
-#define MLX5_GET(typ, p, fld) ((be32_to_cpu(*((__be32 *)(p) +\
-__mlx5_dw_off(typ, fld))) >> __mlx5_dw_bit_off(typ, fld)) & \
-__mlx5_mask(typ, fld))
-
-#define MLX5_GET_PR(typ, p, fld) ({ \
- u32 ___t = MLX5_GET(typ, p, fld); \
- pr_debug(#fld " = 0x%x\n", ___t); \
- ___t; \
-})
-
-#define __MLX5_SET64(typ, p, fld, v) do { \
- BUILD_BUG_ON(__mlx5_bit_sz(typ, fld) != 64); \
- *((__be64 *)(p) + __mlx5_64_off(typ, fld)) = cpu_to_be64(v); \
-} while (0)
-
-#define MLX5_SET64(typ, p, fld, v) do { \
- BUILD_BUG_ON(__mlx5_bit_off(typ, fld) % 64); \
- __MLX5_SET64(typ, p, fld, v); \
-} while (0)
-
-#define MLX5_ARRAY_SET64(typ, p, fld, idx, v) do { \
- BUILD_BUG_ON(__mlx5_bit_off(typ, fld) % 64); \
- __MLX5_SET64(typ, p, fld[idx], v); \
-} while (0)
-
-#define MLX5_GET64(typ, p, fld) be64_to_cpu(*((__be64 *)(p) + __mlx5_64_off(typ, fld)))
-
-#define MLX5_GET64_PR(typ, p, fld) ({ \
- u64 ___t = MLX5_GET64(typ, p, fld); \
- pr_debug(#fld " = 0x%llx\n", ___t); \
- ___t; \
-})
-
-#define MLX5_GET16(typ, p, fld) ((be16_to_cpu(*((__be16 *)(p) +\
-__mlx5_16_off(typ, fld))) >> __mlx5_16_bit_off(typ, fld)) & \
-__mlx5_mask16(typ, fld))
-
-#define MLX5_SET16(typ, p, fld, v) do { \
- u16 _v = v; \
- BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 16); \
- *((__be16 *)(p) + __mlx5_16_off(typ, fld)) = \
- cpu_to_be16((be16_to_cpu(*((__be16 *)(p) + __mlx5_16_off(typ, fld))) & \
- (~__mlx5_16_mask(typ, fld))) | (((_v) & __mlx5_mask16(typ, fld)) \
- << __mlx5_16_bit_off(typ, fld))); \
-} while (0)
-
-/* Big endian getters */
-#define MLX5_GET64_BE(typ, p, fld) (*((__be64 *)(p) +\
- __mlx5_64_off(typ, fld)))
-
-#define MLX5_GET_BE(type_t, typ, p, fld) ({ \
- type_t tmp; \
- switch (sizeof(tmp)) { \
- case sizeof(u8): \
- tmp = (__force type_t)MLX5_GET(typ, p, fld); \
- break; \
- case sizeof(u16): \
- tmp = (__force type_t)cpu_to_be16(MLX5_GET(typ, p, fld)); \
- break; \
- case sizeof(u32): \
- tmp = (__force type_t)cpu_to_be32(MLX5_GET(typ, p, fld)); \
- break; \
- case sizeof(u64): \
- tmp = (__force type_t)MLX5_GET64_BE(typ, p, fld); \
- break; \
- } \
- tmp; \
- })
-
enum mlx5_inline_modes {
MLX5_INLINE_MODE_NONE,
MLX5_INLINE_MODE_L2,
diff --git a/include/linux/mlx5/mlx5_ifc_macros.h b/include/linux/mlx5/mlx5_ifc_macros.h
new file mode 100644
index 00000000000000..d357acfd351de2
--- /dev/null
+++ b/include/linux/mlx5/mlx5_ifc_macros.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2013-2026, Mellanox Technologies. All rights reserved.
+ *
+ * Accessor macros for mlx5 IFC structures.
+ *
+ * Extracted from device.h so that code which cannot include device.h
+ * (e.g. selftests) can still use the MLX5_SET/GET family directly.
+ */
+
+#ifndef MLX5_IFC_MACROS_H
+#define MLX5_IFC_MACROS_H
+
+/* Internal helpers -- 32-bit */
+#define __mlx5_nullp(typ) ((struct mlx5_ifc_##typ##_bits *)0)
+#define __mlx5_bit_sz(typ, fld) sizeof(__mlx5_nullp(typ)->fld)
+#define __mlx5_bit_off(typ, fld) (offsetof(struct mlx5_ifc_##typ##_bits, fld))
+#define __mlx5_16_off(typ, fld) (__mlx5_bit_off(typ, fld) / 16)
+#define __mlx5_dw_off(typ, fld) (__mlx5_bit_off(typ, fld) / 32)
+#define __mlx5_64_off(typ, fld) (__mlx5_bit_off(typ, fld) / 64)
+#define __mlx5_16_bit_off(typ, fld) (16 - __mlx5_bit_sz(typ, fld) - (__mlx5_bit_off(typ, fld) & 0xf))
+#define __mlx5_dw_bit_off(typ, fld) (32 - __mlx5_bit_sz(typ, fld) - (__mlx5_bit_off(typ, fld) & 0x1f))
+#define __mlx5_mask(typ, fld) ((u32)((1ull << __mlx5_bit_sz(typ, fld)) - 1))
+#define __mlx5_dw_mask(typ, fld) (__mlx5_mask(typ, fld) << __mlx5_dw_bit_off(typ, fld))
+#define __mlx5_mask16(typ, fld) ((u16)((1ull << __mlx5_bit_sz(typ, fld)) - 1))
+#define __mlx5_16_mask(typ, fld) (__mlx5_mask16(typ, fld) << __mlx5_16_bit_off(typ, fld))
+#define __mlx5_st_sz_bits(typ) sizeof(struct mlx5_ifc_##typ##_bits)
+
+/* Size and address macros */
+#define MLX5_FLD_SZ_BYTES(typ, fld) (__mlx5_bit_sz(typ, fld) / 8)
+#define MLX5_ST_SZ_BYTES(typ) (sizeof(struct mlx5_ifc_##typ##_bits) / 8)
+#define MLX5_ST_SZ_DW(typ) (sizeof(struct mlx5_ifc_##typ##_bits) / 32)
+#define MLX5_ST_SZ_QW(typ) (sizeof(struct mlx5_ifc_##typ##_bits) / 64)
+#define MLX5_UN_SZ_BYTES(typ) (sizeof(union mlx5_ifc_##typ##_bits) / 8)
+#define MLX5_UN_SZ_DW(typ) (sizeof(union mlx5_ifc_##typ##_bits) / 32)
+#define MLX5_BYTE_OFF(typ, fld) (__mlx5_bit_off(typ, fld) / 8)
+#define MLX5_ADDR_OF(typ, p, fld) ((void *)((u8 *)(p) + MLX5_BYTE_OFF(typ, fld)))
+
+/* insert a value to a struct */
+#define MLX5_SET(typ, p, fld, v) do { \
+ u32 _v = v; \
+ BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 32); \
+ *((__be32 *)(p) + __mlx5_dw_off(typ, fld)) = \
+ cpu_to_be32((be32_to_cpu(*((__be32 *)(p) + __mlx5_dw_off(typ, fld))) & \
+ (~__mlx5_dw_mask(typ, fld))) | (((_v) & __mlx5_mask(typ, fld)) \
+ << __mlx5_dw_bit_off(typ, fld))); \
+} while (0)
+
+#define MLX5_ARRAY_SET(typ, p, fld, idx, v) do { \
+ BUILD_BUG_ON(__mlx5_bit_off(typ, fld) % 32); \
+ MLX5_SET(typ, p, fld[idx], v); \
+} while (0)
+
+#define MLX5_SET_TO_ONES(typ, p, fld) do { \
+ BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 32); \
+ *((__be32 *)(p) + __mlx5_dw_off(typ, fld)) = \
+ cpu_to_be32((be32_to_cpu(*((__be32 *)(p) + __mlx5_dw_off(typ, fld))) & \
+ (~__mlx5_dw_mask(typ, fld))) | ((__mlx5_mask(typ, fld)) \
+ << __mlx5_dw_bit_off(typ, fld))); \
+} while (0)
+
+#define MLX5_GET(typ, p, fld) ((be32_to_cpu(*((__be32 *)(p) +\
+__mlx5_dw_off(typ, fld))) >> __mlx5_dw_bit_off(typ, fld)) & \
+__mlx5_mask(typ, fld))
+
+#define MLX5_GET_PR(typ, p, fld) ({ \
+ u32 ___t = MLX5_GET(typ, p, fld); \
+ pr_debug(#fld " = 0x%x\n", ___t); \
+ ___t; \
+})
+
+/* 64-bit field accessors */
+#define __MLX5_SET64(typ, p, fld, v) do { \
+ BUILD_BUG_ON(__mlx5_bit_sz(typ, fld) != 64); \
+ *((__be64 *)(p) + __mlx5_64_off(typ, fld)) = cpu_to_be64(v); \
+} while (0)
+
+#define MLX5_SET64(typ, p, fld, v) do { \
+ BUILD_BUG_ON(__mlx5_bit_off(typ, fld) % 64); \
+ __MLX5_SET64(typ, p, fld, v); \
+} while (0)
+
+#define MLX5_ARRAY_SET64(typ, p, fld, idx, v) do { \
+ BUILD_BUG_ON(__mlx5_bit_off(typ, fld) % 64); \
+ __MLX5_SET64(typ, p, fld[idx], v); \
+} while (0)
+
+#define MLX5_GET64(typ, p, fld) be64_to_cpu(*((__be64 *)(p) + __mlx5_64_off(typ, fld)))
+
+#define MLX5_GET64_PR(typ, p, fld) ({ \
+ u64 ___t = MLX5_GET64(typ, p, fld); \
+ pr_debug(#fld " = 0x%llx\n", ___t); \
+ ___t; \
+})
+
+/* 16-bit field accessors */
+#define MLX5_GET16(typ, p, fld) ((be16_to_cpu(*((__be16 *)(p) +\
+__mlx5_16_off(typ, fld))) >> __mlx5_16_bit_off(typ, fld)) & \
+__mlx5_mask16(typ, fld))
+
+#define MLX5_SET16(typ, p, fld, v) do { \
+ u16 _v = v; \
+ BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 16); \
+ *((__be16 *)(p) + __mlx5_16_off(typ, fld)) = \
+ cpu_to_be16((be16_to_cpu(*((__be16 *)(p) + __mlx5_16_off(typ, fld))) & \
+ (~__mlx5_16_mask(typ, fld))) | (((_v) & __mlx5_mask16(typ, fld)) \
+ << __mlx5_16_bit_off(typ, fld))); \
+} while (0)
+
+/* Big endian getters */
+#define MLX5_GET64_BE(typ, p, fld) (*((__be64 *)(p) +\
+ __mlx5_64_off(typ, fld)))
+
+#define MLX5_GET_BE(type_t, typ, p, fld) ({ \
+ type_t tmp; \
+ switch (sizeof(tmp)) { \
+ case sizeof(u8): \
+ tmp = (__force type_t)MLX5_GET(typ, p, fld); \
+ break; \
+ case sizeof(u16): \
+ tmp = (__force type_t)cpu_to_be16(MLX5_GET(typ, p, fld)); \
+ break; \
+ case sizeof(u32): \
+ tmp = (__force type_t)cpu_to_be32(MLX5_GET(typ, p, fld)); \
+ break; \
+ case sizeof(u64): \
+ tmp = (__force type_t)MLX5_GET64_BE(typ, p, fld); \
+ break; \
+ } \
+ tmp; \
+ })
+
+#endif /* MLX5_IFC_MACROS_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 04/11] net/mlx5: Add ONCE and MMIO accessor variants to mlx5_ifc_macros.h
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (2 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 03/11] net/mlx5: Extract MLX5_SET/GET macros into mlx5_ifc_macros.h Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 05/11] selftests: Add additional kernel functions to tools/include/ Jason Gunthorpe
` (8 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Add MLX5_GET_ONCE / MLX5_SET_ONCE which include READ_ONCE/WRITE_ONCE
semantics for touching ownership bits that hardware may read or write
at any time. The kernel driver doesn't use the IFC struct for these
ownership bits, but the VFIO driver will and needs these helpers.
Add MLX5_GET_MMIO / MLX5_SET_MMIO for accessing the init seg using
the IFC offsets. They embed a readl/writel using the IFC struct to
generate the addressing.
Add MLX5_ARRAY_GET64(), the missing counterpart to
MLX5_ARRAY_SET64().
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
include/linux/mlx5/mlx5_ifc_macros.h | 52 ++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/include/linux/mlx5/mlx5_ifc_macros.h b/include/linux/mlx5/mlx5_ifc_macros.h
index d357acfd351de2..be963b9ad5b295 100644
--- a/include/linux/mlx5/mlx5_ifc_macros.h
+++ b/include/linux/mlx5/mlx5_ifc_macros.h
@@ -85,6 +85,13 @@ __mlx5_mask(typ, fld))
__MLX5_SET64(typ, p, fld[idx], v); \
} while (0)
+#define MLX5_ARRAY_GET64(typ, p, fld, idx) \
+ ({ \
+ BUILD_BUG_ON(__mlx5_bit_sz(typ, fld) % 64); \
+ be64_to_cpu( \
+ *((__be64 *)(p) + __mlx5_64_off(typ, fld) + (idx))); \
+ })
+
#define MLX5_GET64(typ, p, fld) be64_to_cpu(*((__be64 *)(p) + __mlx5_64_off(typ, fld)))
#define MLX5_GET64_PR(typ, p, fld) ({ \
@@ -130,4 +137,49 @@ __mlx5_mask16(typ, fld))
tmp; \
})
+/*
+ * Use READ_ONCE/WRITE_ONCE for a single field that hardware may read/write
+ * unpredictably, mostly owner bits. All other bits in the DW must be stable.
+ * Usually a dma_wmb() will be required before a write and a dma_rmb() after a
+ * read.
+ */
+#define MLX5_GET_ONCE(typ, p, fld) \
+ ((be32_to_cpu(READ_ONCE(*((__be32 *)(p) + __mlx5_dw_off(typ, fld)))) >> \
+ __mlx5_dw_bit_off(typ, fld)) & \
+ __mlx5_mask(typ, fld))
+
+#define MLX5_SET_ONCE(typ, p, fld, v) \
+ do { \
+ u32 _v = v; \
+ __be32 *_dw = (__be32 *)(p) + __mlx5_dw_off(typ, fld); \
+ BUILD_BUG_ON(__mlx5_st_sz_bits(typ) % 32); \
+ WRITE_ONCE(*_dw, \
+ cpu_to_be32((be32_to_cpu(READ_ONCE(*_dw)) & \
+ (~__mlx5_dw_mask(typ, fld))) | \
+ (((_v) & __mlx5_mask(typ, fld)) \
+ << __mlx5_dw_bit_off(typ, fld)))); \
+ } while (0)
+
+/* Access MMIO registers, usually the init segment, using IFC structs. */
+#define MLX5_GET_MMIO(typ, p, fld) \
+ ((ioread32be(((__be32 __iomem *)(p) + __mlx5_dw_off(typ, fld))) >> \
+ __mlx5_dw_bit_off(typ, fld)) & \
+ __mlx5_mask(typ, fld))
+
+/* The set is not relaxed so there is an integrated dma_wmb(). */
+#define MLX5_SET_MMIO(typ, p, fld, v) \
+ do { \
+ u32 _v = v; \
+ void __iomem *_dw = \
+ ((__be32 __iomem *)(p) + __mlx5_dw_off(typ, fld)); \
+ if (__mlx5_bit_sz(typ, fld) == 32) \
+ iowrite32be(_v, _dw); \
+ else \
+ iowrite32be((ioread32be(_dw) & \
+ (~__mlx5_dw_mask(typ, fld))) | \
+ ((_v & __mlx5_mask(typ, fld)) \
+ << __mlx5_dw_bit_off(typ, fld)), \
+ _dw); \
+ } while (0)
+
#endif /* MLX5_IFC_MACROS_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 05/11] selftests: Add additional kernel functions to tools/include/
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (3 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 04/11] net/mlx5: Add ONCE and MMIO accessor variants to mlx5_ifc_macros.h Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-04 21:48 ` David Matlack
2026-05-01 0:08 ` [PATCH 06/11] selftests: Fix arm64 IO barriers to match kernel Jason Gunthorpe
` (7 subsequent siblings)
12 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
These are needed by the VFIO mlx5 selftest in the following patches,
which includes some headers from mlx5 and also needs a few more
MMIO-related features.
- DECLARE_FLEX_ARRAY in new tools/include/linux/stddef.h (wraps
existing __DECLARE_FLEX_ARRAY from uapi/linux/stddef.h)
- dma_wmb/dma_rmb barriers: x86 uses compiler barrier
(DMA-coherent), arm64 uses dmb oshst/oshld (outer-shareable for
device visibility), generic fallback uses wmb/rmb
- ioread32be/iowrite32be in tools/include/asm-generic/io.h for
big-endian MMIO register access
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
tools/arch/arm64/include/asm/barrier.h | 4 ++++
tools/arch/x86/include/asm/barrier.h | 5 +++++
tools/include/asm-generic/io.h | 28 ++++++++++++++++++++++++++
tools/include/asm/barrier.h | 8 ++++++++
tools/include/linux/stddef.h | 10 +++++++++
5 files changed, 55 insertions(+)
create mode 100644 tools/include/linux/stddef.h
diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h
index 3b9b41331c4f16..abdc64fc3c70f0 100644
--- a/tools/arch/arm64/include/asm/barrier.h
+++ b/tools/arch/arm64/include/asm/barrier.h
@@ -24,6 +24,10 @@
#define smp_wmb() asm volatile("dmb ishst" ::: "memory")
#define smp_rmb() asm volatile("dmb ishld" ::: "memory")
+/* DMA barriers use outer-shareable (osh) for device visibility */
+#define dma_rmb() asm volatile("dmb oshld" ::: "memory")
+#define dma_wmb() asm volatile("dmb oshst" ::: "memory")
+
#define smp_store_release(p, v) \
do { \
union { typeof(*p) __val; char __c[1]; } __u = \
diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h
index 0adf295dd5b6aa..0b51431fa530ea 100644
--- a/tools/arch/x86/include/asm/barrier.h
+++ b/tools/arch/x86/include/asm/barrier.h
@@ -43,4 +43,9 @@ do { \
___p1; \
})
#endif /* defined(__x86_64__) */
+
+/* x86 is DMA-coherent so DMA barriers are just compiler barriers */
+#define dma_rmb() barrier()
+#define dma_wmb() barrier()
+
#endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */
diff --git a/tools/include/asm-generic/io.h b/tools/include/asm-generic/io.h
index e5a0b07ad452a6..0d89decdafb818 100644
--- a/tools/include/asm-generic/io.h
+++ b/tools/include/asm-generic/io.h
@@ -479,4 +479,32 @@ static inline void writesq(volatile void __iomem *addr, const void *buffer,
}
#endif
+/*
+ * ioread/iowrite for big-endian MMIO registers.
+ */
+
+#ifndef ioread32be
+#define ioread32be ioread32be
+static inline u32 ioread32be(const volatile void __iomem *addr)
+{
+ return bswap_32(readl(addr));
+}
+#endif
+
+#ifndef iowrite32be
+#define iowrite32be iowrite32be
+static inline void iowrite32be(u32 value, volatile void __iomem *addr)
+{
+ writel(bswap_32(value), addr);
+}
+#endif
+
+#ifndef iowrite64be
+#define iowrite64be iowrite64be
+static inline void iowrite64be(u64 value, volatile void __iomem *addr)
+{
+ writeq(bswap_64(value), addr);
+}
+#endif
+
#endif /* _TOOLS_ASM_GENERIC_IO_H */
diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h
index 0c21678ac5e65f..e7e0c7de5a2ffe 100644
--- a/tools/include/asm/barrier.h
+++ b/tools/include/asm/barrier.h
@@ -47,6 +47,14 @@
# define smp_mb() mb()
#endif
+#ifndef dma_rmb
+# define dma_rmb() rmb()
+#endif
+
+#ifndef dma_wmb
+# define dma_wmb() wmb()
+#endif
+
#ifndef smp_store_release
# define smp_store_release(p, v) \
do { \
diff --git a/tools/include/linux/stddef.h b/tools/include/linux/stddef.h
new file mode 100644
index 00000000000000..99182ea4a1419b
--- /dev/null
+++ b/tools/include/linux/stddef.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TOOLS_LINUX_STDDEF_H
+#define _TOOLS_LINUX_STDDEF_H
+
+#include_next <linux/stddef.h>
+
+#define DECLARE_FLEX_ARRAY(TYPE, NAME) \
+ __DECLARE_FLEX_ARRAY(TYPE, NAME)
+
+#endif /* _TOOLS_LINUX_STDDEF_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 06/11] selftests: Fix arm64 IO barriers to match kernel
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (4 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 05/11] selftests: Add additional kernel functions to tools/include/ Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size Jason Gunthorpe
` (6 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
The tools/include readl/writel MMIO accessors on arm64 use
inner-shareable barriers (dmb ish) while the kernel uses
outer-shareable (dmb osh). Fix them to match.
Add __io_bw() and __io_ar() definitions matching the kernel's
arch/arm64/include/asm/io.h, including the dummy control dependency
in __io_ar() that orders MMIO reads against all subsequent
instructions.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
tools/arch/arm64/include/asm/barrier.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h
index abdc64fc3c70f0..3f7fcb2a27541e 100644
--- a/tools/arch/arm64/include/asm/barrier.h
+++ b/tools/arch/arm64/include/asm/barrier.h
@@ -28,6 +28,20 @@
#define dma_rmb() asm volatile("dmb oshld" ::: "memory")
#define dma_wmb() asm volatile("dmb oshst" ::: "memory")
+/* Match arch/arm64/include/asm/io.h: use osh barriers for device MMIO */
+#define __io_bw() dma_wmb()
+#define __io_ar(v) \
+({ \
+ unsigned long tmp; \
+ \
+ dma_rmb(); \
+ \
+ asm volatile("eor %0, %1, %1\n" \
+ "cbnz %0, ." \
+ : "=r" (tmp) : "r" ((unsigned long)(v)) \
+ : "memory"); \
+})
+
#define smp_store_release(p, v) \
do { \
union { typeof(*p) __val; char __c[1]; } __u = \
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (5 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 06/11] selftests: Fix arm64 IO barriers to match kernel Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-02 8:33 ` Manuel Ebner
2026-05-04 20:55 ` David Matlack
2026-05-01 0:08 ` [PATCH 08/11] vfio: selftests: Add dev_dbg Jason Gunthorpe
` (5 subsequent siblings)
12 siblings, 2 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Add a region_size field to struct vfio_pci_driver_ops so drivers can
declare how much DMA-mapped region they need. The mlx5 driver will
need ~18MB for firmware pages. Existing drivers leave region_size as
0 and get the current default of SZ_2M.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
.../selftests/vfio/lib/include/libvfio/vfio_pci_driver.h | 3 +++
tools/testing/selftests/vfio/vfio_pci_driver_test.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
index e5ada209b1d102..fa172635632453 100644
--- a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
+++ b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
@@ -9,6 +9,9 @@ struct vfio_pci_device;
struct vfio_pci_driver_ops {
const char *name;
+ /* Minimum driver region size, 0 = default SZ_2M */
+ u64 region_size;
+
/**
* @probe() - Check if the driver supports the given device.
*
diff --git a/tools/testing/selftests/vfio/vfio_pci_driver_test.c b/tools/testing/selftests/vfio/vfio_pci_driver_test.c
index 761bf117d624f8..97eea34825f88b 100644
--- a/tools/testing/selftests/vfio/vfio_pci_driver_test.c
+++ b/tools/testing/selftests/vfio/vfio_pci_driver_test.c
@@ -87,7 +87,8 @@ FIXTURE_SETUP(vfio_pci_driver_test)
driver = &self->device->driver;
region_setup(self->iommu, self->iova_allocator, &self->memcpy_region, SZ_1G);
- region_setup(self->iommu, self->iova_allocator, &driver->region, SZ_2M);
+ region_setup(self->iommu, self->iova_allocator, &driver->region,
+ driver->ops->region_size ?: SZ_2M);
/* Any IOVA that doesn't overlap memcpy_region and driver->region. */
self->unmapped_iova = iova_allocator_alloc(self->iova_allocator, SZ_1G);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 08/11] vfio: selftests: Add dev_dbg
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (6 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-04 21:15 ` David Matlack
2026-05-01 0:08 ` [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface Jason Gunthorpe
` (4 subsequent siblings)
12 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Enable it with a #define DEBUG at the top of the file. Allows leaving
behind debugging prints that are useful in case future changes are
required.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
.../selftests/vfio/lib/include/libvfio/vfio_pci_device.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
index bb4525abd01a22..2d587b988c09fa 100644
--- a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
+++ b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
@@ -38,6 +38,12 @@ struct vfio_pci_device {
#define dev_info(_dev, _fmt, ...) printf("%s: " _fmt, (_dev)->bdf, ##__VA_ARGS__)
#define dev_err(_dev, _fmt, ...) fprintf(stderr, "%s: " _fmt, (_dev)->bdf, ##__VA_ARGS__)
+#ifdef DEBUG
+#define dev_dbg dev_info
+#else
+#define dev_dbg(_dev, _fmt, ...) do { } while (0)
+#endif
+
struct vfio_pci_device *vfio_pci_device_init(const char *bdf, struct iommu *iommu);
void vfio_pci_device_cleanup(struct vfio_pci_device *device);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (7 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 08/11] vfio: selftests: Add dev_dbg Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-02 9:35 ` Manuel Ebner
2026-05-04 22:35 ` David Matlack
2026-05-01 0:08 ` [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops Jason Gunthorpe
` (3 subsequent siblings)
12 siblings, 2 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Add an mlx5 ConnectX selftest driver that programs VFs and PFs
through the command interface. Create the driver skeleton with
probe(), the command interface, HCA boot sequence, and all resource
allocation up through ALLOC_PD + CREATE_MKEY.
The driver implements vfio_pci_driver_ops (probe/init/remove) and
registers with the VFIO selftest framework. It matches the same
mlx5 ConnectX device IDs that the kernel matches.
init() brings the HCA to a running state:
- Command interface setup (two-slot: regular + async pages)
- ENABLE_HCA -> SET_ISSI -> QUERY/MANAGE_PAGES -> SET_HCA_CAP ->
INIT_HCA
- EQ creation (CMD + PAGE_REQUEST events for PF support)
- ALLOC_PD, CREATE_MKEY (PA-mode, covers all IOVAs)
- Force-loopback capability check
remove() tears down in reverse order through TEARDOWN_HCA and
DISABLE_HCA, including async pages slot drain and FW page reclaim.
The driver region (~18MB) holds all HW-visible DMA buffers (command
queue, mailboxes, EQ, FW pages) overlaid as a single struct on
device->driver.region.vaddr.
Data path ops (memcpy_start/memcpy_wait) are left as stubs for the
next patch.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
.../selftests/vfio/lib/drivers/mlx5/mlx5.c | 1406 +++++++++++++++++
.../selftests/vfio/lib/drivers/mlx5/mlx5_hw.h | 108 ++
.../vfio/lib/drivers/mlx5/mlx5_ifc.h | 1 +
.../vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h | 1 +
.../vfio/lib/drivers/mlx5/mlx5_ifc_macros.h | 1 +
tools/testing/selftests/vfio/lib/libvfio.mk | 1 +
.../selftests/vfio/lib/vfio_pci_driver.c | 2 +
7 files changed, 1520 insertions(+)
create mode 100644 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
create mode 100644 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_macros.h
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
new file mode 100644
index 00000000000000..0ab941bad7a66c
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
@@ -0,0 +1,1406 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * mlx5 VFIO selftest driver
+ *
+ * Programs mlx5 ConnectX VFs and PFs through the bare-metal command interface
+ * and RDMA Write self-loopback to perform DMA. Implements vfio_pci_driver_ops
+ * (probe/init/remove) and plugs into the VFIO selftest framework.
+ */
+#include <stdint.h>
+#include <stdbool.h>
+#include <string.h>
+#include <time.h>
+#include <sched.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#include <linux/errno.h>
+#include <linux/io.h>
+#include <linux/log2.h>
+#include <linux/pci_regs.h>
+
+#include <libvfio.h>
+
+#include "mlx5_hw.h"
+
+/*
+ * Driver state — overlaid on device->driver.region.vaddr.
+ *
+ * Contains both software-only state and HW-visible DMA buffers. HW buffers need
+ * strict IOVA alignment.
+ */
+struct mlx5st_device {
+ /* Back pointer */
+ struct vfio_pci_device *device;
+
+ /* BAR0 */
+ struct mlx5st_initial_seg __iomem *init_seg;
+ void __iomem *bar0;
+
+ /* Command interface */
+ struct mlx5st_cmd_queue_entry *cmd_lay;
+ struct mlx5st_cmd_queue_entry *pages_cmd_lay;
+ u8 cmd_log_stride;
+ unsigned int pages_slot;
+ u8 cmd_token;
+ bool cmd_sig_enabled;
+
+ /* PD */
+ u32 pdn;
+
+ /* Global PA-mode MKEY */
+ u32 global_lkey;
+ u32 global_rkey;
+ u32 mkey_index;
+
+ /* CQ */
+ u32 cqn;
+ u32 cq_ci;
+
+ /* UAR */
+ u32 uar_page;
+ void __iomem *uar_base;
+ unsigned int uar_bf_offset;
+
+ /* EQ */
+ u32 eqn;
+ u32 eq_cons_index;
+ bool have_eq;
+
+ /* Async pages slot state */
+ bool pages_slot_in_use;
+ bool pages_slot_is_reclaim;
+ unsigned int pages_reclaim_npages;
+ unsigned int pages_pending_give;
+ unsigned int pages_pending_reclaim;
+ u16 pages_pending_func_id;
+ bool pages_func_id_seen;
+
+ /* QP */
+ u32 qpn;
+ u32 sq_pi;
+ u32 sq_ci;
+
+ /* FW pages bitmap */
+ u64 fw_pages_bitmap[MAX_FW_PAGES / 64];
+ u32 fw_pages_given;
+ u16 fw_func_id;
+
+ /* Capabilities */
+ bool fl_supported;
+
+ /*
+ * HW-visible DMA buffers below — device reads/writes via DMA.
+ */
+ struct mlx5st_cmd_queue_entry cmd_queue
+ [MLX5_HW_PAGE_SIZE / sizeof(struct mlx5st_cmd_queue_entry)]
+ __aligned(MLX5_HW_PAGE_SIZE);
+ struct mlx5st_send_wqe sq_buf[SQ_WQE_CNT];
+ struct mlx5st_dbrec cq_dbrec;
+ struct mlx5st_dbrec qp_dbrec;
+ struct mlx5st_cqe64 cq_buf[CQ_CQE_CNT];
+
+ /* Slot 0 mailboxes (regular commands) */
+ struct mlx5st_mbox_entry cmd_in_mbox[CMD_MBOX_NENT];
+ struct mlx5st_mbox_entry cmd_out_mbox[CMD_MBOX_NENT];
+
+ /* Pages slot mailboxes (async MANAGE_PAGES) */
+ struct mlx5st_mbox_entry pages_in_mbox[CMD_MBOX_NENT];
+ struct mlx5st_mbox_entry pages_out_mbox[CMD_MBOX_NENT];
+
+ /* EQ does not support page_offset */
+ struct mlx5st_eqe eq_buf[EQ_NENT] __aligned(MLX5_HW_PAGE_SIZE);
+
+ u8 fw_pages[MAX_FW_PAGES][MLX5_HW_PAGE_SIZE]
+ __aligned(MLX5_HW_PAGE_SIZE);
+};
+
+/* Check against HW limits on IOVA alignment */
+static_assert(offsetof(struct mlx5st_device, cmd_in_mbox) %
+ CMD_MBOX_STRIDE == 0,
+ "cmd_in_mbox must be stride-aligned");
+static_assert(offsetof(struct mlx5st_device, pages_in_mbox) %
+ CMD_MBOX_STRIDE == 0,
+ "pages_in_mbox must be stride-aligned");
+static_assert(offsetof(struct mlx5st_device, cq_buf) % 64 == 0,
+ "cq_buf must be 64-byte aligned");
+static_assert(offsetof(struct mlx5st_device, sq_buf) % 64 == 0,
+ "sq_buf must be 64-byte aligned");
+static_assert(offsetof(struct mlx5st_device, cq_dbrec) % 64 == 0,
+ "cq_dbrec must be 64-byte aligned");
+static_assert(offsetof(struct mlx5st_device, qp_dbrec) % 64 == 0,
+ "qp_dbrec must be 64-byte aligned");
+static_assert(offsetof(struct mlx5st_device, eq_buf) %
+ MLX5_HW_PAGE_SIZE == 0,
+ "eq_buf must be page-aligned");
+static_assert(offsetof(struct mlx5st_device, fw_pages) %
+ MLX5_HW_PAGE_SIZE == 0,
+ "fw_pages must be page-aligned");
+
+static struct mlx5st_device *to_mlx5st(struct vfio_pci_device *device)
+{
+ return device->driver.region.vaddr;
+}
+
+/*
+ * Fill a PAS (Physical Address Segment) for a buffer in the driver region.
+ * Sets pas[0] to the page-aligned IOVA and returns the page_offset (the
+ * buffer's byte offset within that page, in units of 64 bytes).
+ */
+static unsigned int mlx5st_fill_pas(struct vfio_pci_device *device, void *buf,
+ __be64 *pas)
+{
+ u64 iova = to_iova(device, buf);
+
+ pas[0] = cpu_to_be64(iova & ~(u64)(MLX5_HW_PAGE_SIZE - 1));
+ return (iova & (MLX5_HW_PAGE_SIZE - 1)) / 64;
+}
+
+/*
+ * Probe — match mlx5 devices by PCI vendor/device ID.
+ */
+
+#define PCI_VENDOR_ID_MELLANOX 0x15b3
+static int mlx5st_probe(struct vfio_pci_device *device)
+{
+ static const u16 mlx5st_pci_ids[] = {
+ 0x1011, /* Connect-IB */
+ 0x1012, /* Connect-IB VF */
+ 0x1013, /* ConnectX-4 */
+ 0x1014, /* ConnectX-4 VF */
+ 0x1015, /* ConnectX-4LX */
+ 0x1016, /* ConnectX-4LX VF */
+ 0x1017, /* ConnectX-5 */
+ 0x1018, /* ConnectX-5 VF */
+ 0x1019, /* ConnectX-5 Ex */
+ 0x101a, /* ConnectX-5 Ex VF */
+ 0x101b, /* ConnectX-6 */
+ 0x101c, /* ConnectX-6 VF */
+ 0x101d, /* ConnectX-6 Dx */
+ 0x101e, /* ConnectX-6 Dx VF */
+ 0x101f, /* ConnectX-6 LX */
+ 0x1021, /* ConnectX-7 */
+ 0x1023, /* ConnectX-8 */
+ 0x1025, /* ConnectX-9 */
+ 0x1027, /* ConnectX-10 */
+ 0x2101, /* ConnectX-10 NVLink-C2C */
+ 0xa2d2, /* BlueField integrated ConnectX-5 */
+ 0xa2d3, /* BlueField integrated ConnectX-5 VF */
+ 0xa2d6, /* BlueField-2 integrated ConnectX-6 Dx */
+ 0xa2dc, /* BlueField-3 integrated ConnectX-7 */
+ 0xa2df, /* BlueField-4 integrated ConnectX-8 */
+ };
+ unsigned int i;
+ u16 did;
+
+ if (vfio_pci_config_readw(device, PCI_VENDOR_ID) !=
+ PCI_VENDOR_ID_MELLANOX)
+ return -ENODEV;
+
+ did = vfio_pci_config_readw(device, PCI_DEVICE_ID);
+ for (i = 0; i < ARRAY_SIZE(mlx5st_pci_ids); i++) {
+ if (mlx5st_pci_ids[i] == did)
+ return 0;
+ }
+
+ return -ENODEV;
+}
+
+/*
+ * Command interface
+ */
+
+static u8 xor8_buf(const void *buf, size_t offset, size_t len)
+{
+ const u8 *p = buf;
+ u8 sum = 0;
+ size_t i;
+
+ for (i = offset; i < offset + len; i++)
+ sum ^= p[i];
+ return sum;
+}
+
+#define CMD_IF_BOX_CTRL_OFF MLX5_BYTE_OFF(cmd_if_box, reserved_at_1000)
+#define CMD_IF_BOX_CTRL_SIG_OFF MLX5_BYTE_OFF(cmd_if_box, ctrl_signature)
+#define CMD_IF_BOX_SIG_OFF MLX5_BYTE_OFF(cmd_if_box, signature)
+
+static void mlx5st_cmd_calc_block_sig(struct mlx5st_cmd_if_box *blk)
+{
+ MLX5_SET(cmd_if_box, blk, ctrl_signature,
+ ~xor8_buf(blk, CMD_IF_BOX_CTRL_OFF,
+ CMD_IF_BOX_CTRL_SIG_OFF - CMD_IF_BOX_CTRL_OFF));
+ MLX5_SET(cmd_if_box, blk, signature,
+ ~xor8_buf(blk, 0, CMD_IF_BOX_SIG_OFF));
+}
+
+static int mlx5st_cmd_verify_block_sig(struct mlx5st_cmd_if_box *blk)
+{
+ if (xor8_buf(blk, CMD_IF_BOX_CTRL_OFF,
+ CMD_IF_BOX_SIG_OFF - CMD_IF_BOX_CTRL_OFF) != 0xff)
+ return -1;
+ if (xor8_buf(blk, 0, sizeof(struct mlx5st_cmd_if_box)) != 0xff)
+ return -1;
+ return 0;
+}
+
+static unsigned int mlx5st_cmd_setup_mbox_chain(struct vfio_pci_device *device,
+ struct mlx5st_mbox_entry *mbox,
+ unsigned int nblocks, u8 token)
+{
+ unsigned int i;
+
+ for (i = 0; i < nblocks; i++) {
+ struct mlx5st_cmd_if_box *blk = &mbox[i].block;
+ u64 next_iova;
+
+ memset(blk, 0, sizeof(struct mlx5st_cmd_if_box));
+ MLX5_SET(cmd_if_box, blk, block_number, i);
+ MLX5_SET(cmd_if_box, blk, token, token);
+ if (i < nblocks - 1) {
+ next_iova = to_iova(device, &mbox[i + 1]);
+ MLX5_SET(cmd_if_box, blk, next_pointer_63_32,
+ next_iova >> 32);
+ MLX5_SET(cmd_if_box, blk, next_pointer_31_10,
+ (u32)next_iova >> 10);
+ }
+ }
+ return nblocks;
+}
+
+static void mlx5st_cmd_copy_to_mbox(struct mlx5st_mbox_entry *mbox,
+ const void *data, unsigned int len)
+{
+ const u8 *src = data;
+ unsigned int i = 0;
+
+ while (len > 0) {
+ unsigned int chunk = len < MLX5_CMD_DATA_BLOCK_SIZE ?
+ len :
+ MLX5_CMD_DATA_BLOCK_SIZE;
+
+ memcpy(MLX5_ADDR_OF(cmd_if_box, &mbox[i].block, mailbox_data),
+ src, chunk);
+ src += chunk;
+ len -= chunk;
+ i++;
+ }
+}
+
+static void mlx5st_cmd_copy_from_mbox(void *data,
+ const struct mlx5st_mbox_entry *mbox,
+ unsigned int len)
+{
+ unsigned int i = 0;
+ u8 *dst = data;
+
+ while (len > 0) {
+ unsigned int chunk = len < MLX5_CMD_DATA_BLOCK_SIZE ?
+ len :
+ MLX5_CMD_DATA_BLOCK_SIZE;
+
+ memcpy(dst,
+ MLX5_ADDR_OF(cmd_if_box, &mbox[i].block, mailbox_data),
+ chunk);
+ dst += chunk;
+ len -= chunk;
+ i++;
+ }
+}
+
+/* Forward declaration — cmd_exec polls events during command wait */
+static void mlx5st_process_events(struct mlx5st_device *dev);
+
+static const char *mlx5st_cmd_name(u16 opcode)
+{
+ switch (opcode) {
+ case MLX5_CMD_OP_QUERY_HCA_CAP: return "QUERY_HCA_CAP";
+ case MLX5_CMD_OP_INIT_HCA: return "INIT_HCA";
+ case MLX5_CMD_OP_TEARDOWN_HCA: return "TEARDOWN_HCA";
+ case MLX5_CMD_OP_ENABLE_HCA: return "ENABLE_HCA";
+ case MLX5_CMD_OP_DISABLE_HCA: return "DISABLE_HCA";
+ case MLX5_CMD_OP_QUERY_PAGES: return "QUERY_PAGES";
+ case MLX5_CMD_OP_MANAGE_PAGES: return "MANAGE_PAGES";
+ case MLX5_CMD_OP_SET_HCA_CAP: return "SET_HCA_CAP";
+ case MLX5_CMD_OP_SET_ISSI: return "SET_ISSI";
+ case MLX5_CMD_OP_CREATE_MKEY: return "CREATE_MKEY";
+ case MLX5_CMD_OP_DESTROY_MKEY: return "DESTROY_MKEY";
+ case MLX5_CMD_OP_CREATE_EQ: return "CREATE_EQ";
+ case MLX5_CMD_OP_DESTROY_EQ: return "DESTROY_EQ";
+ case MLX5_CMD_OP_CREATE_CQ: return "CREATE_CQ";
+ case MLX5_CMD_OP_DESTROY_CQ: return "DESTROY_CQ";
+ case MLX5_CMD_OP_CREATE_QP: return "CREATE_QP";
+ case MLX5_CMD_OP_DESTROY_QP: return "DESTROY_QP";
+ case MLX5_CMD_OP_RST2INIT_QP: return "RST2INIT_QP";
+ case MLX5_CMD_OP_INIT2RTR_QP: return "INIT2RTR_QP";
+ case MLX5_CMD_OP_RTR2RTS_QP: return "RTR2RTS_QP";
+ case MLX5_CMD_OP_ALLOC_PD: return "ALLOC_PD";
+ case MLX5_CMD_OP_DEALLOC_PD: return "DEALLOC_PD";
+ case MLX5_CMD_OP_ALLOC_UAR: return "ALLOC_UAR";
+ case MLX5_CMD_OP_DEALLOC_UAR: return "DEALLOC_UAR";
+ default: return "UNKNOWN";
+ }
+}
+
+/*
+ * Post a command on a given slot: fill the cmd_queue_entry, set up mailbox
+ * chains, compute signatures, hand ownership to FW, and ring the doorbell.
+ */
+static void mlx5st_cmd_post(struct mlx5st_device *dev,
+ struct mlx5st_cmd_queue_entry *cmd,
+ struct mlx5st_mbox_entry *in_mbox,
+ struct mlx5st_mbox_entry *out_mbox,
+ void *in, unsigned int ilen, unsigned int olen,
+ u32 doorbell)
+{
+ struct vfio_pci_device *device = dev->device;
+ unsigned int in_remain, out_remain, in_nblk, out_nblk;
+ unsigned int i;
+ void *cin, *cout;
+ u8 token;
+
+ /* Rotating non-zero token ties cmd entry to its mailbox blocks */
+ token = ++dev->cmd_token;
+ if (!token)
+ token = ++dev->cmd_token;
+
+ in_remain = ilen > MLX5_CMD_INLINE_SZ ? ilen - MLX5_CMD_INLINE_SZ : 0;
+ out_remain = olen > MLX5_CMD_INLINE_SZ ? olen - MLX5_CMD_INLINE_SZ : 0;
+ in_nblk = (in_remain + MLX5_CMD_DATA_BLOCK_SIZE - 1) /
+ MLX5_CMD_DATA_BLOCK_SIZE;
+ out_nblk = (out_remain + MLX5_CMD_DATA_BLOCK_SIZE - 1) /
+ MLX5_CMD_DATA_BLOCK_SIZE;
+
+ /* Set up mailbox chains */
+ if (in_nblk > 0) {
+ mlx5st_cmd_setup_mbox_chain(device, in_mbox, in_nblk, token);
+ mlx5st_cmd_copy_to_mbox(in_mbox,
+ (u8 *)in + MLX5_CMD_INLINE_SZ,
+ in_remain);
+ }
+ if (out_nblk > 0)
+ mlx5st_cmd_setup_mbox_chain(device, out_mbox, out_nblk, token);
+
+ /* Copy inline input */
+ cin = MLX5_ADDR_OF(cmd_queue_entry, cmd, command_input_inline_data);
+ memset(cin, 0, MLX5_CMD_INLINE_SZ);
+ memcpy(cin, in, ilen < MLX5_CMD_INLINE_SZ ? ilen : MLX5_CMD_INLINE_SZ);
+ MLX5_SET(cmd_queue_entry, cmd, input_length, ilen);
+ MLX5_SET(cmd_queue_entry, cmd, token, token);
+
+ /* Zero inline output */
+ cout = MLX5_ADDR_OF(cmd_queue_entry, cmd, command_output_inline_data);
+ memset(cout, 0, MLX5_CMD_INLINE_SZ);
+ MLX5_SET(cmd_queue_entry, cmd, output_length, olen);
+
+ /*
+ * Compute signatures: mailbox blocks first, then cmd_queue_entry last.
+ * The sig must cover the final state including ownership=0x1, but
+ * we must not set ownership until after the sig is in place —
+ * XOR in the 0x1 without storing it to memory.
+ */
+ for (i = 0; i < in_nblk; i++)
+ mlx5st_cmd_calc_block_sig(&in_mbox[i].block);
+ for (i = 0; i < out_nblk; i++)
+ mlx5st_cmd_calc_block_sig(&out_mbox[i].block);
+ MLX5_SET(cmd_queue_entry, cmd, signature, 0);
+ MLX5_SET(cmd_queue_entry, cmd, signature,
+ ~(xor8_buf(cmd, 0, sizeof(struct mlx5st_cmd_queue_entry)) ^
+ 0x1));
+
+ /* Ensure all cmd data (including sig) is visible, then hand to FW */
+ dma_wmb();
+ MLX5_SET_ONCE(cmd_queue_entry, cmd, ownership, 1);
+
+ /* Ring doorbell */
+ MLX5_SET_MMIO(initial_seg, dev->init_seg, command_doorbell_vector,
+ doorbell);
+}
+
+static void mlx5st_cmd_exec(struct mlx5st_device *dev, void *in,
+ unsigned int ilen, void *out, unsigned int olen)
+{
+ struct mlx5st_cmd_queue_entry *cmd = dev->cmd_lay;
+ unsigned int out_remain, out_nblk;
+ struct timespec start, now;
+ unsigned int elapsed;
+ unsigned int i;
+ void *cout;
+
+ mlx5st_cmd_post(dev, cmd, dev->cmd_in_mbox, dev->cmd_out_mbox, in,
+ ilen, olen, 1);
+
+ out_remain = olen > MLX5_CMD_INLINE_SZ ? olen - MLX5_CMD_INLINE_SZ : 0;
+ out_nblk = (out_remain + MLX5_CMD_DATA_BLOCK_SIZE - 1) /
+ MLX5_CMD_DATA_BLOCK_SIZE;
+
+ /* Poll for completion — also process EQ events for PF page requests */
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ for (;;) {
+ if (!MLX5_GET_ONCE(cmd_queue_entry, cmd, ownership))
+ break;
+ if (dev->have_eq)
+ mlx5st_process_events(dev);
+ sched_yield();
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ elapsed = (now.tv_sec - start.tv_sec) * 1000 +
+ (now.tv_nsec - start.tv_nsec) / 1000000;
+ if (elapsed > MLX5_CMD_TIMEOUT_MS)
+ VFIO_FAIL("cmd timeout after %d ms", elapsed);
+ }
+ /* Ensure output data reads happen after ownership is seen clear */
+ dma_rmb();
+
+ /* Verify output signatures when FW has checksums enabled */
+ if (dev->cmd_sig_enabled) {
+ if (xor8_buf(cmd, 0,
+ sizeof(struct mlx5st_cmd_queue_entry)) != 0xff)
+ VFIO_FAIL("cmd output signature mismatch");
+ for (i = 0; i < out_nblk; i++) {
+ if (mlx5st_cmd_verify_block_sig(
+ &dev->cmd_out_mbox[i].block))
+ VFIO_FAIL("cmd output mailbox block %d signature mismatch",
+ i);
+ }
+ }
+
+ /* Copy output: inline first */
+ cout = MLX5_ADDR_OF(cmd_queue_entry, cmd, command_output_inline_data);
+ memcpy(out, cout, olen < MLX5_CMD_INLINE_SZ ? olen : MLX5_CMD_INLINE_SZ);
+
+ /* Copy remaining from output mailbox chain */
+ if (out_remain > 0)
+ mlx5st_cmd_copy_from_mbox((u8 *)out + MLX5_CMD_INLINE_SZ,
+ dev->cmd_out_mbox, out_remain);
+
+ /* Check command status */
+ if (MLX5_GET(enable_hca_out, out, status) != MLX5_CMD_STAT_OK)
+ VFIO_FAIL("%s: status=0x%x syndrome=0x%x",
+ mlx5st_cmd_name(MLX5_GET(enable_hca_in, in, opcode)),
+ MLX5_GET(enable_hca_out, out, status),
+ MLX5_GET(enable_hca_out, out, syndrome));
+}
+
+static struct mlx5st_cmd_queue_entry *
+mlx5st_cmd_slot_init(struct mlx5st_device *dev, unsigned int slot,
+ struct mlx5st_mbox_entry *in_mbox,
+ struct mlx5st_mbox_entry *out_mbox)
+{
+ struct vfio_pci_device *device = dev->device;
+ struct mlx5st_cmd_queue_entry *cmd =
+ &dev->cmd_queue[(slot << dev->cmd_log_stride) /
+ sizeof(struct mlx5st_cmd_queue_entry)];
+ u64 iova;
+
+ MLX5_SET(cmd_queue_entry, cmd, type,
+ MLX5_CMD_QUEUE_ENTRY_TYPE_PCIE_CMD_IF_TRANSPORT);
+ iova = to_iova(device, in_mbox);
+ MLX5_SET(cmd_queue_entry, cmd, input_mailbox_pointer_63_32,
+ iova >> 32);
+ MLX5_SET(cmd_queue_entry, cmd, input_mailbox_pointer_31_9, iova >> 9);
+ iova = to_iova(device, out_mbox);
+ MLX5_SET(cmd_queue_entry, cmd, output_mailbox_pointer_63_32,
+ iova >> 32);
+ MLX5_SET(cmd_queue_entry, cmd, output_mailbox_pointer_31_9,
+ iova >> 9);
+ return cmd;
+}
+
+static void mlx5st_cmd_init(struct mlx5st_device *dev)
+{
+ struct mlx5st_initial_seg __iomem *seg = dev->init_seg;
+ struct vfio_pci_device *device = dev->device;
+ u16 cmdif_rev;
+ u8 log_sz;
+ u64 iova;
+
+ cmdif_rev = MLX5_GET_MMIO(initial_seg, seg, cmd_interface_rev);
+ VFIO_ASSERT_EQ(cmdif_rev, 5);
+
+ /* Read command queue geometry from BAR */
+ log_sz = MLX5_GET_MMIO(initial_seg, seg, log_cmdq_size);
+ dev->cmd_log_stride = MLX5_GET_MMIO(initial_seg, seg, log_cmdq_stride);
+ dev->pages_slot = (1 << log_sz) - 1;
+
+ VFIO_ASSERT_LE((unsigned int)(1 << log_sz), 32u);
+ VFIO_ASSERT_GE((unsigned int)(1 << dev->cmd_log_stride),
+ (unsigned int)sizeof(struct mlx5st_cmd_queue_entry));
+ VFIO_ASSERT_LE((unsigned int)((dev->pages_slot + 1) <<
+ dev->cmd_log_stride),
+ (unsigned int)sizeof(dev->cmd_queue));
+
+ /* Set up slot 0 — regular commands */
+ dev->cmd_lay = mlx5st_cmd_slot_init(dev, 0, dev->cmd_in_mbox,
+ dev->cmd_out_mbox);
+
+ /* Set up pages slot — async MANAGE_PAGES */
+ dev->pages_cmd_lay = mlx5st_cmd_slot_init(dev, dev->pages_slot,
+ dev->pages_in_mbox,
+ dev->pages_out_mbox);
+
+ /* Write command queue page address to BAR0 */
+ iova = to_iova(device, dev->cmd_queue);
+ MLX5_SET_MMIO(initial_seg, seg, cmdq_phy_addr_63_32, iova >> 32);
+ MLX5_SET_MMIO(initial_seg, seg, cmdq_phy_addr_31_12, iova >> 12);
+
+ dev_dbg(device,
+ "Command interface initialized (cmdif_rev=5, log_sz=%u, log_stride=%u, pages_slot=%u)\n",
+ log_sz, dev->cmd_log_stride, dev->pages_slot);
+}
+
+/*
+ * FW pages: bitmap allocator + MANAGE_PAGES
+ */
+
+static void mlx5st_fw_pages_alloc(struct mlx5st_device *dev,
+ unsigned int npages, u64 *iovas)
+{
+ struct vfio_pci_device *device = dev->device;
+ unsigned int found = 0;
+ unsigned int w, b;
+ u64 word;
+
+ for (w = 0; w < MAX_FW_PAGES / 64 && found < npages; w++) {
+ word = dev->fw_pages_bitmap[w];
+
+ for (b = 0; b < 64 && found < npages; b++) {
+ if (!(word & (1ULL << b))) {
+ unsigned int idx = w * 64 + b;
+
+ dev->fw_pages_bitmap[w] |= (1ULL << b);
+ iovas[found++] = to_iova(device,
+ dev->fw_pages[idx]);
+ }
+ }
+ }
+ VFIO_ASSERT_EQ(found, npages);
+ dev->fw_pages_given += npages;
+}
+
+static void mlx5st_fw_pages_free(struct mlx5st_device *dev,
+ unsigned int npages, const u64 *iovas)
+{
+ struct vfio_pci_device *device = dev->device;
+ unsigned int i, idx;
+ u64 off;
+
+ for (i = 0; i < npages; i++) {
+ off = iovas[i] - to_iova(device, dev->fw_pages);
+ idx = off / MLX5_HW_PAGE_SIZE;
+
+ VFIO_ASSERT_TRUE(idx < MAX_FW_PAGES);
+ dev->fw_pages_bitmap[idx / 64] &= ~(1ULL << (idx % 64));
+ }
+ dev->fw_pages_given -= npages;
+}
+
+static void *mlx5st_build_manage_pages_give(u16 func_id, unsigned int npages,
+ const u64 *iovas,
+ unsigned int *out_inlen)
+{
+ unsigned int inlen = MLX5_ST_SZ_BYTES(manage_pages_in) + npages * 8;
+ unsigned int i;
+ void *in;
+
+ in = calloc(1, inlen);
+ VFIO_ASSERT_NOT_NULL(in);
+
+ MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
+ MLX5_SET(manage_pages_in, in, op_mod,
+ MLX5_MANAGE_PAGES_IN_OP_MOD_ALLOCATION_SUCCESS);
+ MLX5_SET(manage_pages_in, in, function_id, func_id);
+ MLX5_SET(manage_pages_in, in, input_num_entries, npages);
+
+ for (i = 0; i < npages; i++)
+ MLX5_ARRAY_SET64(manage_pages_in, in, pas, i, iovas[i]);
+
+ *out_inlen = inlen;
+ return in;
+}
+
+static void mlx5st_fw_pages_give_one(struct mlx5st_device *dev, u16 func_id,
+ unsigned int npages, u64 *iovas)
+{
+ u32 out[MLX5_ST_SZ_DW(manage_pages_out)] = {};
+ unsigned int inlen;
+ void *in;
+
+ in = mlx5st_build_manage_pages_give(func_id, npages, iovas, &inlen);
+ mlx5st_cmd_exec(dev, in, inlen, out, sizeof(out));
+ free(in);
+}
+
+static void mlx5st_fw_pages_give(struct mlx5st_device *dev, u16 func_id,
+ unsigned int npages)
+{
+ unsigned int remaining = npages;
+ u64 *iovas;
+
+ if (!npages)
+ return;
+
+ iovas = calloc(npages, sizeof(u64));
+ VFIO_ASSERT_NOT_NULL(iovas);
+
+ mlx5st_fw_pages_alloc(dev, npages, iovas);
+
+ /* Batch into chunks that fit in one mailbox */
+ for (unsigned int off = 0; remaining > 0;) {
+ unsigned int batch = remaining < MAX_FW_PAGES_PER_CMD ?
+ remaining :
+ MAX_FW_PAGES_PER_CMD;
+
+ mlx5st_fw_pages_give_one(dev, func_id, batch, iovas + off);
+ off += batch;
+ remaining -= batch;
+ }
+
+ dev_dbg(dev->device, "MANAGE_PAGES GIVE: %d pages to func_id=%u\n",
+ npages, func_id);
+ free(iovas);
+}
+
+static void mlx5st_fw_pages_satisfy(struct mlx5st_device *dev, int boot)
+{
+ u32 qo[MLX5_ST_SZ_DW(query_pages_out)] = {};
+ u32 qi[MLX5_ST_SZ_DW(query_pages_in)] = {};
+ u16 func_id;
+ int npages;
+
+ MLX5_SET(query_pages_in, qi, opcode, MLX5_CMD_OP_QUERY_PAGES);
+ MLX5_SET(query_pages_in, qi, op_mod, boot ? 0x01 : 0x02);
+ mlx5st_cmd_exec(dev, qi, sizeof(qi), qo, sizeof(qo));
+
+ npages = MLX5_GET(query_pages_out, qo, num_pages);
+ func_id = MLX5_GET(query_pages_out, qo, function_id);
+ dev_dbg(dev->device, "QUERY_PAGES (%s): %d pages (func_id=%u)\n",
+ boot ? "boot" : "init", npages, func_id);
+
+ if (npages > 0) {
+ dev->fw_func_id = func_id;
+ mlx5st_fw_pages_give(dev, func_id, npages);
+ }
+}
+
+/*
+ * Async MANAGE_PAGES on the pages command slot.
+ *
+ * On PFs, firmware sends PAGE_REQUEST events via the EQ during command
+ * execution. We must respond with MANAGE_PAGES on a second command slot
+ * before the first (regular) command can complete.
+ */
+
+static void mlx5st_pages_slot_post(struct mlx5st_device *dev, void *in,
+ unsigned int ilen, unsigned int olen)
+{
+ mlx5st_cmd_post(dev, dev->pages_cmd_lay, dev->pages_in_mbox,
+ dev->pages_out_mbox, in, ilen, olen,
+ 1 << dev->pages_slot);
+}
+
+static void mlx5st_pages_slot_give(struct mlx5st_device *dev, u16 func_id,
+ unsigned int npages)
+{
+ unsigned int inlen;
+ u64 *iovas;
+ void *in;
+
+ iovas = calloc(npages, sizeof(u64));
+ VFIO_ASSERT_NOT_NULL(iovas);
+
+ mlx5st_fw_pages_alloc(dev, npages, iovas);
+
+ in = mlx5st_build_manage_pages_give(func_id, npages, iovas, &inlen);
+ free(iovas);
+
+ mlx5st_pages_slot_post(dev, in, inlen,
+ MLX5_ST_SZ_BYTES(manage_pages_out));
+ dev->pages_slot_in_use = true;
+ dev->pages_slot_is_reclaim = false;
+ free(in);
+
+ dev_dbg(dev->device,
+ "PAGE_REQUEST: %d pages given async to func_id=%u\n",
+ npages, func_id);
+}
+
+static void mlx5st_pages_slot_reclaim(struct mlx5st_device *dev, u16 func_id,
+ unsigned int npages)
+{
+ unsigned int inlen = MLX5_ST_SZ_BYTES(manage_pages_in);
+ unsigned int outlen =
+ MLX5_ST_SZ_BYTES(manage_pages_out) + npages * 8;
+ void *in;
+
+ in = calloc(1, inlen);
+ VFIO_ASSERT_NOT_NULL(in);
+
+ MLX5_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
+ MLX5_SET(manage_pages_in, in, op_mod,
+ MLX5_MANAGE_PAGES_IN_OP_MOD_HCA_RETURN_PAGES);
+ MLX5_SET(manage_pages_in, in, function_id, func_id);
+ MLX5_SET(manage_pages_in, in, input_num_entries, npages);
+
+ mlx5st_pages_slot_post(dev, in, inlen, outlen);
+ dev->pages_slot_in_use = true;
+ dev->pages_slot_is_reclaim = true;
+ dev->pages_reclaim_npages = npages;
+ free(in);
+
+ dev_dbg(dev->device,
+ "PAGE_REQUEST: reclaim %d pages async from func_id=%u\n",
+ npages, func_id);
+}
+
+static void mlx5st_pages_slot_kick(struct mlx5st_device *dev)
+{
+ unsigned int batch;
+
+ if (dev->pages_slot_in_use)
+ return;
+
+ if (dev->pages_pending_give) {
+ batch = dev->pages_pending_give < MAX_FW_PAGES_PER_CMD ?
+ dev->pages_pending_give :
+ MAX_FW_PAGES_PER_CMD;
+ dev->pages_pending_give -= batch;
+ mlx5st_pages_slot_give(dev, dev->pages_pending_func_id, batch);
+ } else if (dev->pages_pending_reclaim) {
+ batch = dev->pages_pending_reclaim < MAX_FW_PAGES_PER_CMD ?
+ dev->pages_pending_reclaim :
+ MAX_FW_PAGES_PER_CMD;
+ dev->pages_pending_reclaim -= batch;
+ mlx5st_pages_slot_reclaim(dev, dev->pages_pending_func_id,
+ batch);
+ }
+}
+
+static void mlx5st_fw_pages_give_async(struct mlx5st_device *dev,
+ u16 func_id, unsigned int npages)
+{
+ if (!npages)
+ return;
+
+ dev->pages_pending_give += npages;
+ dev->pages_pending_func_id = func_id;
+ mlx5st_pages_slot_kick(dev);
+}
+
+static void mlx5st_fw_pages_reclaim_async(struct mlx5st_device *dev,
+ u16 func_id, unsigned int npages)
+{
+ dev->pages_pending_reclaim += npages;
+ dev->pages_pending_func_id = func_id;
+ mlx5st_pages_slot_kick(dev);
+}
+
+static void mlx5st_pages_slot_complete(struct mlx5st_device *dev)
+{
+ struct mlx5st_cmd_queue_entry *cmd = dev->pages_cmd_lay;
+ void *cout;
+
+ dma_rmb();
+
+ cout = MLX5_ADDR_OF(cmd_queue_entry, cmd, command_output_inline_data);
+ if (MLX5_GET(enable_hca_out, cout, status) != MLX5_CMD_STAT_OK)
+ VFIO_FAIL("async MANAGE_PAGES failed: status=0x%x syndrome=0x%x",
+ MLX5_GET(enable_hca_out, cout, status),
+ MLX5_GET(enable_hca_out, cout, syndrome));
+
+ if (dev->pages_slot_is_reclaim) {
+ unsigned int outlen = MLX5_ST_SZ_BYTES(manage_pages_out) +
+ dev->pages_reclaim_npages * 8;
+ unsigned int num_claimed;
+ unsigned int i;
+ void *out;
+ u64 *iovas;
+
+ out = calloc(1, outlen);
+ iovas = calloc(dev->pages_reclaim_npages, sizeof(u64));
+ VFIO_ASSERT_NOT_NULL(out);
+ VFIO_ASSERT_NOT_NULL(iovas);
+
+ /* Copy inline output */
+ memcpy(out, cout, MLX5_CMD_INLINE_SZ);
+ if (outlen > MLX5_CMD_INLINE_SZ)
+ mlx5st_cmd_copy_from_mbox(
+ (u8 *)out + MLX5_CMD_INLINE_SZ,
+ dev->pages_out_mbox,
+ outlen - MLX5_CMD_INLINE_SZ);
+
+ num_claimed =
+ MLX5_GET(manage_pages_out, out, output_num_entries);
+ for (i = 0; i < num_claimed; i++)
+ iovas[i] = MLX5_ARRAY_GET64(manage_pages_out, out, pas,
+ i);
+
+ mlx5st_fw_pages_free(dev, num_claimed, iovas);
+ dev_dbg(dev->device, "PAGE_REQUEST: reclaimed %d pages\n",
+ num_claimed);
+
+ free(iovas);
+ free(out);
+ }
+
+ dev->pages_slot_in_use = false;
+ mlx5st_pages_slot_kick(dev);
+}
+
+/*
+ * UAR alloc/dealloc
+ */
+
+static void mlx5st_alloc_uar(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(alloc_uar_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(alloc_uar_in)] = {};
+
+ MLX5_SET(alloc_uar_in, in, opcode, MLX5_CMD_OP_ALLOC_UAR);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->uar_page = MLX5_GET(alloc_uar_out, out, uar);
+ dev->uar_base = (u8 __iomem*)dev->bar0 + dev->uar_page * MLX5_HW_PAGE_SIZE;
+ dev->uar_bf_offset = MLX5_BF_OFFSET;
+
+ dev_dbg(dev->device,
+ "Allocated UAR page_id=%u, doorbell offset=0x%x\n",
+ dev->uar_page,
+ dev->uar_page * MLX5_HW_PAGE_SIZE + MLX5_BF_OFFSET);
+}
+
+static void mlx5st_dealloc_uar(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(dealloc_uar_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(dealloc_uar_in)] = {};
+
+ MLX5_SET(dealloc_uar_in, in, opcode, MLX5_CMD_OP_DEALLOC_UAR);
+ MLX5_SET(dealloc_uar_in, in, uar, dev->uar_page);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+/*
+ * EQ infrastructure
+ */
+
+static struct mlx5st_eqe *mlx5st_eq_get_eqe(struct mlx5st_device *dev, u32 cc)
+{
+ u32 ci = dev->eq_cons_index + cc;
+ struct mlx5st_eqe *eqe = &dev->eq_buf[ci % EQ_NENT];
+ u8 owner = MLX5_GET_ONCE(eqe, eqe, owner);
+ u8 expected = !!(ci & EQ_NENT);
+
+ if (owner != expected)
+ return NULL;
+ dma_rmb();
+ return eqe;
+}
+
+static void mlx5st_eq_update_ci(struct mlx5st_device *dev, u32 cc, bool arm)
+{
+ u32 val;
+
+ dev->eq_cons_index += cc;
+ val = (dev->eq_cons_index & 0xffffff) | (dev->eqn << 24);
+ iowrite32be(val, (u8 __iomem *)dev->uar_base + MLX5_EQ_DOORBELL_OFFSET +
+ (arm ? 0 : 8));
+}
+
+static void mlx5st_create_eq(struct mlx5st_device *dev)
+{
+ struct vfio_pci_device *device = dev->device;
+ u64 in[MLX5_ST_SZ_QW(create_eq_in) + 1] = {};
+ u32 out[MLX5_ST_SZ_DW(create_eq_out)] = {};
+ struct mlx5_ifc_eqc_bits *eqc;
+ unsigned int i;
+ __be64 *pas;
+
+ /* Initialize EQE owner bits */
+ for (i = 0; i < EQ_NENT; i++) {
+ struct mlx5st_eqe *eqe = &dev->eq_buf[i];
+
+ MLX5_SET_ONCE(eqe, eqe, owner, 1);
+ }
+
+ MLX5_SET(create_eq_in, in, opcode, MLX5_CMD_OP_CREATE_EQ);
+
+ /* Subscribe to CMD completions and PAGE_REQUEST events */
+ MLX5_ARRAY_SET64(create_eq_in, in, event_bitmask, 0,
+ (1ULL << MLX5_EVENT_TYPE_CMD) |
+ (1ULL << MLX5_EVENT_TYPE_PAGE_REQUEST));
+
+ eqc = MLX5_ADDR_OF(create_eq_in, in, eq_context_entry);
+ MLX5_SET(eqc, eqc, log_eq_size, LOG_EQ_SIZE);
+ MLX5_SET(eqc, eqc, uar_page, dev->uar_page);
+ pas = MLX5_ADDR_OF(create_eq_in, in, pas);
+ VFIO_ASSERT_EQ(mlx5st_fill_pas(device, dev->eq_buf, pas), 0u);
+ MLX5_SET(eqc, eqc, log_page_size, 0);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->eqn = MLX5_GET(create_eq_out, out, eq_number);
+ dev->eq_cons_index = 0;
+ mlx5st_eq_update_ci(dev, 0, 0);
+ dev->have_eq = true;
+
+ dev_dbg(device, "Created EQ: eqn=%u, %d entries (CMD+PAGE_REQUEST)\n",
+ dev->eqn, EQ_NENT);
+}
+
+static void mlx5st_destroy_eq(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(destroy_eq_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(destroy_eq_in)] = {};
+
+ MLX5_SET(destroy_eq_in, in, opcode, MLX5_CMD_OP_DESTROY_EQ);
+ MLX5_SET(destroy_eq_in, in, eq_number, dev->eqn);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+/*
+ * Drain all pending EQ events. Dispatches PAGE_REQUEST to the async pages
+ * slot and CMD completions to the pages slot completion handler.
+ */
+static void mlx5st_process_events(struct mlx5st_device *dev)
+{
+ struct mlx5st_eqe *eqe;
+ u32 cc = 0;
+
+ while ((eqe = mlx5st_eq_get_eqe(dev, cc))) {
+ u8 type = MLX5_GET(eqe, eqe, event_type);
+
+ switch (type) {
+ case MLX5_EVENT_TYPE_PAGE_REQUEST: {
+ void *evdata = MLX5_ADDR_OF(eqe, eqe, event_data);
+ u16 func_id = MLX5_GET(pages_req_event, evdata,
+ function_id);
+ s32 npages = (s32)MLX5_GET(pages_req_event, evdata,
+ num_pages);
+
+ /*
+ * The selftest doesn't use more than one func_id so a
+ * simple counter approach is possible.
+ */
+ if (dev->pages_func_id_seen)
+ VFIO_ASSERT_EQ(func_id,
+ dev->pages_pending_func_id);
+ dev->pages_func_id_seen = true;
+
+ if (npages > 0)
+ mlx5st_fw_pages_give_async(dev, func_id,
+ npages);
+ else if (npages < 0)
+ mlx5st_fw_pages_reclaim_async(dev, func_id,
+ -npages);
+ break;
+ }
+ case MLX5_EVENT_TYPE_CMD: {
+ void *evdata = MLX5_ADDR_OF(eqe, eqe, event_data);
+ u32 vector = MLX5_GET(cmd_inter_comp_event, evdata,
+ command_completion_vector);
+
+ if (vector & (1U << dev->pages_slot))
+ mlx5st_pages_slot_complete(dev);
+ break;
+ }
+ default:
+ break;
+ }
+ cc++;
+ }
+
+ if (cc)
+ mlx5st_eq_update_ci(dev, cc, 0);
+}
+
+/*
+ * HCA init / teardown
+ */
+
+#define FW_INIT_TIMEOUT_MS 120000
+#define FW_INIT_WAIT_MS 200
+
+static void mlx5st_wait_fw_init(struct mlx5st_device *dev)
+{
+ struct timespec start, now;
+ unsigned int elapsed;
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ while (MLX5_GET_MMIO(initial_seg, dev->init_seg, initializing)) {
+ usleep(FW_INIT_WAIT_MS * 1000);
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ elapsed = (now.tv_sec - start.tv_sec) * 1000 +
+ (now.tv_nsec - start.tv_nsec) / 1000000;
+ if (elapsed > FW_INIT_TIMEOUT_MS)
+ VFIO_FAIL("FW init timeout after %d ms", elapsed);
+ }
+}
+
+static void mlx5st_set_issi(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(set_issi_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(set_issi_in)] = {};
+
+ MLX5_SET(set_issi_in, in, opcode, MLX5_CMD_OP_SET_ISSI);
+ MLX5_SET(set_issi_in, in, current_issi, 1);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+ dev_dbg(dev->device, "SET_ISSI: OK (issi=1)\n");
+}
+
+static void mlx5st_set_hca_caps(struct mlx5st_device *dev)
+{
+ u32 qout[MLX5_ST_SZ_DW(query_hca_cap_out)] = {};
+ u32 qin[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+ u32 sout[MLX5_ST_SZ_DW(set_hca_cap_out)] = {};
+ u32 sin[MLX5_ST_SZ_DW(set_hca_cap_in)] = {};
+ struct mlx5_ifc_cmd_hca_cap_bits *set_hca_cap;
+ u32 max_checksum;
+
+ /* Query max caps to learn cmdif_checksum support */
+ MLX5_SET(query_hca_cap_in, qin, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+ MLX5_SET(query_hca_cap_in, qin, op_mod,
+ MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
+ mlx5st_cmd_exec(dev, qin, sizeof(qin), qout, sizeof(qout));
+
+ max_checksum = MLX5_GET(
+ cmd_hca_cap,
+ MLX5_ADDR_OF(query_hca_cap_out, qout, capability),
+ cmdif_checksum);
+
+ /* Query current caps as base for SET */
+ memset(qout, 0, sizeof(qout));
+ MLX5_SET(query_hca_cap_in, qin, op_mod,
+ MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE |
+ HCA_CAP_OPMOD_GET_CUR);
+ mlx5st_cmd_exec(dev, qin, sizeof(qin), qout, sizeof(qout));
+
+ set_hca_cap = MLX5_ADDR_OF(set_hca_cap_in, sin, capability);
+ memcpy(set_hca_cap,
+ MLX5_ADDR_OF(query_hca_cap_out, qout, capability),
+ MLX5_ST_SZ_BYTES(cmd_hca_cap));
+
+ MLX5_SET(cmd_hca_cap, set_hca_cap, cmdif_checksum, max_checksum);
+ MLX5_SET(cmd_hca_cap, set_hca_cap, log_uar_page_sz, 0);
+
+ MLX5_SET(set_hca_cap_in, sin, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+ MLX5_SET(set_hca_cap_in, sin, op_mod,
+ MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
+
+ mlx5st_cmd_exec(dev, sin, sizeof(sin), sout, sizeof(sout));
+
+ dev->cmd_sig_enabled = max_checksum == 0x3;
+ dev_dbg(dev->device, "SET_HCA_CAP: OK (cmdif_checksum=%u)\n",
+ max_checksum);
+}
+
+static void mlx5st_hca_init(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(enable_hca_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(enable_hca_in)] = {};
+
+ mlx5st_wait_fw_init(dev);
+ dev_dbg(dev->device, "Firmware ready\n");
+
+ MLX5_SET(enable_hca_in, in, opcode, MLX5_CMD_OP_ENABLE_HCA);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+ dev_dbg(dev->device, "ENABLE_HCA: OK\n");
+
+ mlx5st_set_issi(dev);
+ mlx5st_fw_pages_satisfy(dev, 1);
+
+ mlx5st_set_hca_caps(dev);
+ mlx5st_fw_pages_satisfy(dev, 0);
+
+ memset(in, 0, sizeof(in));
+ memset(out, 0, sizeof(out));
+ MLX5_SET(init_hca_in, in, opcode, MLX5_CMD_OP_INIT_HCA);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+ dev_dbg(dev->device, "INIT_HCA: OK\n");
+
+ /*
+ * Create EQ immediately after INIT_HCA so PAGE_REQUEST events
+ * are captured during all subsequent commands.
+ */
+ mlx5st_alloc_uar(dev);
+ mlx5st_create_eq(dev);
+}
+
+static void mlx5st_disable_hca(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(disable_hca_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(disable_hca_in)] = {};
+
+ MLX5_SET(disable_hca_in, in, opcode, MLX5_CMD_OP_DISABLE_HCA);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+static void mlx5st_fw_pages_reclaim(struct mlx5st_device *dev, u16 func_id)
+{
+ unsigned int npages = dev->fw_pages_given;
+ unsigned int total_claimed = 0;
+
+ while (npages > 0) {
+ unsigned int batch = npages < MAX_FW_PAGES_PER_CMD ?
+ npages :
+ MAX_FW_PAGES_PER_CMD;
+ unsigned int outlen =
+ MLX5_ST_SZ_BYTES(manage_pages_out) + batch * 8;
+ unsigned int inlen = MLX5_ST_SZ_BYTES(manage_pages_in);
+ unsigned int num_claimed;
+ unsigned int i;
+ void *in, *out;
+ u64 *iovas;
+
+ in = calloc(1, inlen);
+ out = calloc(1, outlen);
+ iovas = calloc(batch, sizeof(u64));
+ VFIO_ASSERT_NOT_NULL(in);
+ VFIO_ASSERT_NOT_NULL(out);
+ VFIO_ASSERT_NOT_NULL(iovas);
+
+ MLX5_SET(manage_pages_in, in, opcode,
+ MLX5_CMD_OP_MANAGE_PAGES);
+ MLX5_SET(manage_pages_in, in, op_mod,
+ MLX5_MANAGE_PAGES_IN_OP_MOD_HCA_RETURN_PAGES);
+ MLX5_SET(manage_pages_in, in, function_id, func_id);
+ MLX5_SET(manage_pages_in, in, input_num_entries, batch);
+
+ mlx5st_cmd_exec(dev, in, inlen, out, outlen);
+
+ num_claimed =
+ MLX5_GET(manage_pages_out, out, output_num_entries);
+ for (i = 0; i < num_claimed; i++)
+ iovas[i] = MLX5_ARRAY_GET64(manage_pages_out, out, pas,
+ i);
+
+ mlx5st_fw_pages_free(dev, num_claimed, iovas);
+ total_claimed += num_claimed;
+ npages -= num_claimed;
+
+ free(iovas);
+ free(in);
+ free(out);
+
+ if (!num_claimed && !dev->fw_pages_given)
+ break;
+ if (!num_claimed)
+ VFIO_FAIL("MANAGE_PAGES RECLAIM: FW returned 0 but %d pages still given",
+ dev->fw_pages_given);
+ }
+
+ dev_dbg(dev->device,
+ "MANAGE_PAGES RECLAIM: %d pages (%d still given)\n",
+ total_claimed, dev->fw_pages_given);
+}
+
+static void mlx5st_hca_teardown(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(teardown_hca_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(teardown_hca_in)] = {};
+
+ /* Drain async pages slot, then stop EQ processing */
+ while (dev->pages_slot_in_use) {
+ if (!MLX5_GET_ONCE(cmd_queue_entry, dev->pages_cmd_lay,
+ ownership))
+ mlx5st_pages_slot_complete(dev);
+ else
+ sched_yield();
+ }
+ dev->have_eq = false;
+
+ if (dev->eqn) {
+ mlx5st_destroy_eq(dev);
+ dev->eqn = 0;
+ }
+ if (dev->uar_page) {
+ mlx5st_dealloc_uar(dev);
+ dev->uar_page = 0;
+ }
+
+ dev_dbg(dev->device, " hca_teardown: TEARDOWN_HCA\n");
+ MLX5_SET(teardown_hca_in, in, opcode, MLX5_CMD_OP_TEARDOWN_HCA);
+ MLX5_SET(teardown_hca_in, in, profile,
+ MLX5_TEARDOWN_HCA_IN_PROFILE_GRACEFUL_CLOSE);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ if (dev->fw_pages_given > 0) {
+ dev_dbg(dev->device, " hca_teardown: reclaim %d pages\n",
+ dev->fw_pages_given);
+ mlx5st_fw_pages_reclaim(dev, dev->fw_func_id);
+ }
+
+ dev_dbg(dev->device, " hca_teardown: DISABLE_HCA\n");
+ mlx5st_disable_hca(dev);
+}
+
+/*
+ * Query capabilities
+ */
+static void mlx5st_query_fl_caps(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(query_hca_cap_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
+ bool fl_roce_en, fl_roce_dis;
+
+ /* Query RoCE capabilities */
+ MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+ MLX5_SET(query_hca_cap_in, in, op_mod,
+ MLX5_SET_HCA_CAP_OP_MOD_ROCE | HCA_CAP_OPMOD_GET_CUR);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ fl_roce_en = MLX5_GET(query_hca_cap_out, out,
+ capability.roce_cap.fl_rc_qp_when_roce_enabled);
+ fl_roce_dis = MLX5_GET(query_hca_cap_out, out,
+ capability.roce_cap.fl_rc_qp_when_roce_disabled);
+
+ /* Also check general caps */
+ memset(in, 0, sizeof(in));
+ memset(out, 0, sizeof(out));
+ MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+ MLX5_SET(query_hca_cap_in, in, op_mod,
+ MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE |
+ HCA_CAP_OPMOD_GET_CUR);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ fl_roce_dis |=
+ MLX5_GET(query_hca_cap_out, out,
+ capability.cmd_hca_cap.fl_rc_qp_when_roce_disabled);
+
+ dev->fl_supported = fl_roce_en || fl_roce_dis;
+ dev_dbg(dev->device,
+ "HCA capabilities: fl_roce_enabled=%d fl_roce_disabled=%d\n",
+ fl_roce_en, fl_roce_dis);
+
+ VFIO_ASSERT_TRUE(dev->fl_supported,
+ "Force-loopback not supported on this device");
+}
+
+/*
+ * Resource allocation
+ */
+
+static void mlx5st_alloc_pd(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(alloc_pd_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(alloc_pd_in)] = {};
+
+ MLX5_SET(alloc_pd_in, in, opcode, MLX5_CMD_OP_ALLOC_PD);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->pdn = MLX5_GET(alloc_pd_out, out, pd);
+ dev_dbg(dev->device, "Allocated PD pdn=%u\n", dev->pdn);
+}
+
+static void mlx5st_dealloc_pd(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(dealloc_pd_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(dealloc_pd_in)] = {};
+
+ MLX5_SET(dealloc_pd_in, in, opcode, MLX5_CMD_OP_DEALLOC_PD);
+ MLX5_SET(dealloc_pd_in, in, pd, dev->pdn);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+static void mlx5st_create_mkey(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(create_mkey_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(create_mkey_in)] = {};
+ struct mlx5_ifc_mkc_bits *mkc;
+
+ MLX5_SET(create_mkey_in, in, opcode, MLX5_CMD_OP_CREATE_MKEY);
+
+ mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+ MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_PA);
+ MLX5_SET(mkc, mkc, length64, 1);
+ MLX5_SET(mkc, mkc, pd, dev->pdn);
+ MLX5_SET(mkc, mkc, qpn, 0xffffff);
+ MLX5_SET(mkc, mkc, lr, 1);
+ MLX5_SET(mkc, mkc, lw, 1);
+ MLX5_SET(mkc, mkc, rw, 1);
+ MLX5_SET(mkc, mkc, rr, 1);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->mkey_index = MLX5_GET(create_mkey_out, out, mkey_index);
+ dev->global_lkey = mlx5st_idx_to_mkey(dev->mkey_index);
+ dev->global_rkey = dev->global_lkey;
+
+ dev_dbg(dev->device, "Created global PA-mode MKEY: lkey=0x%x\n",
+ dev->global_lkey);
+}
+
+static void mlx5st_destroy_mkey(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(destroy_mkey_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(destroy_mkey_in)] = {};
+
+ MLX5_SET(destroy_mkey_in, in, opcode, MLX5_CMD_OP_DESTROY_MKEY);
+ MLX5_SET(destroy_mkey_in, in, mkey_index, dev->mkey_index);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+/*
+ * Driver ops callbacks
+ */
+
+static void mlx5st_init(struct vfio_pci_device *device)
+{
+ struct mlx5st_device *dev = to_mlx5st(device);
+ iova_t iova_align =
+ device->driver.region.iova % __alignof__(struct mlx5st_device);
+
+ VFIO_ASSERT_GE(device->driver.region.size, sizeof(*dev));
+ VFIO_ASSERT_EQ(iova_align, 0);
+ memset(dev, 0, sizeof(*dev));
+
+ dev->device = device;
+ dev->bar0 = device->bars[0].vaddr;
+ dev->init_seg = dev->bar0;
+
+ vfio_pci_cmd_set(device, PCI_COMMAND_MASTER);
+
+ mlx5st_wait_fw_init(dev);
+
+ mlx5st_cmd_init(dev);
+ mlx5st_hca_init(dev);
+ mlx5st_query_fl_caps(dev);
+ mlx5st_alloc_pd(dev);
+ mlx5st_create_mkey(dev);
+
+ dev_dbg(device, "mlx5 driver initialized\n");
+}
+
+static void mlx5st_remove(struct vfio_pci_device *device)
+{
+ struct mlx5st_device *dev = to_mlx5st(device);
+
+ dev_dbg(device, "teardown: destroy_mkey\n");
+ if (dev->mkey_index) {
+ mlx5st_destroy_mkey(dev);
+ dev->mkey_index = 0;
+ }
+
+ dev_dbg(device, "teardown: dealloc_pd\n");
+ if (dev->pdn) {
+ mlx5st_dealloc_pd(dev);
+ dev->pdn = 0;
+ }
+
+ dev_dbg(device, "teardown: hca_teardown\n");
+ mlx5st_hca_teardown(dev);
+
+ vfio_pci_cmd_clear(device, PCI_COMMAND_MASTER);
+ dev_dbg(device, "Teardown complete\n");
+}
+
+struct vfio_pci_driver_ops mlx5st_ops = {
+ .name = "mlx5",
+ .region_size = roundup_pow_of_two(sizeof(struct mlx5st_device)),
+ .probe = mlx5st_probe,
+ .init = mlx5st_init,
+ .remove = mlx5st_remove,
+ .memcpy_start = NULL,
+ .memcpy_wait = NULL,
+ .send_msi = NULL,
+};
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
new file mode 100644
index 00000000000000..a2506ec8a19523
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * mlx5 VFIO selftest driver - HW definitions
+ *
+ * Typed wrappers, constants, and helpers for programming mlx5 hardware
+ * via the VFIO selftest framework. Most HW constants and all MLX5_SET/GET
+ * macros come from the kernel headers (mlx5_ifc.h, mlx5_ifc_macros.h).
+ */
+#ifndef SELFTESTS_VFIO_MLX5_HW_H
+#define SELFTESTS_VFIO_MLX5_HW_H
+
+#include <linux/io.h>
+#include <linux/build_bug.h>
+#include <vdso/bits.h>
+
+#include "mlx5_ifc.h"
+#include "mlx5_ifc_macros.h"
+
+/*
+ * Typed HW object wrappers for driver region arrays.
+ *
+ * The IFC _bits structs have sizeof == num_bits (not bytes), so they cannot
+ * be used as array elements. These wrappers provide byte-sized types.
+ */
+#define MLX5ST_MAKE_DATA32(name) \
+ struct mlx5st_##name { \
+ u32 data[MLX5_ST_SZ_DW(name)]; \
+ }
+#define MLX5ST_MAKE_DATA64(name) \
+ struct mlx5st_##name { \
+ u64 data[MLX5_ST_SZ_QW(name)]; \
+ }
+
+MLX5ST_MAKE_DATA32(initial_seg);
+MLX5ST_MAKE_DATA64(cmd_queue_entry);
+MLX5ST_MAKE_DATA64(cmd_if_box);
+MLX5ST_MAKE_DATA64(wqe_ctrl_seg);
+MLX5ST_MAKE_DATA64(wqe_raddr_seg);
+MLX5ST_MAKE_DATA64(wqe_data_seg);
+MLX5ST_MAKE_DATA64(cqe64) __aligned(64);
+MLX5ST_MAKE_DATA64(eqe);
+
+/*
+ * Mailbox blocks: 512 data + 64 header = 576 bytes, but the
+ * next_pointer field stores bits [31:10], requiring 1024-byte alignment.
+ */
+#define CMD_MBOX_SIZE (2 * MLX5_HW_PAGE_SIZE)
+#define CMD_MBOX_STRIDE 1024
+#define CMD_MBOX_NENT (CMD_MBOX_SIZE / CMD_MBOX_STRIDE)
+/* Stride-aligned mailbox entry — block + padding to 1024 bytes */
+struct mlx5st_mbox_entry {
+ struct mlx5st_cmd_if_box block;
+} __aligned(CMD_MBOX_STRIDE);
+
+#define MLX5_CMD_INLINE_SZ \
+ MLX5_FLD_SZ_BYTES(cmd_queue_entry, command_input_inline_data)
+
+/* Command interface mailbox block (512 data + 64 header) */
+#define MLX5_CMD_DATA_BLOCK_SIZE MLX5_FLD_SZ_BYTES(cmd_if_box, mailbox_data)
+
+/* RDMA Write WQE — one basic block: ctrl + raddr + data + padding */
+struct mlx5st_send_wqe {
+ struct mlx5st_wqe_ctrl_seg ctrl;
+ struct mlx5st_wqe_raddr_seg raddr;
+ struct mlx5st_wqe_data_seg data;
+} __aligned(64);
+static_assert(sizeof(struct mlx5st_send_wqe) == 64,
+ "send WQE segments must fit in one BB");
+
+/* DS = number of 16-byte segments in the WQE (ctrl + raddr + data) */
+#define MLX5_RDMA_WRITE_DS 3
+
+/* Doorbell record — two __be32 in a 64-byte aligned pair */
+struct mlx5st_dbrec {
+ __be32 recv_counter;
+ __be32 send_counter;
+} __aligned(64);
+
+/* UAR BlueFlame buffer offsets within a UAR page */
+#define MLX5_BF_OFFSET 0x800
+#define MLX5_BF_SIZE 0x100
+
+/* EQ doorbell offset within UAR page */
+#define MLX5_EQ_DOORBELL_OFFSET 0x40
+
+#define MLX5_HW_PAGE_SIZE 4096
+
+/*
+ * Test parameters
+ */
+#define SQ_WQE_CNT 16
+#define LOG_SQ_SIZE 4
+#define CQ_CQE_CNT 16
+#define LOG_CQ_SIZE 4
+#define EQ_NENT 64
+#define LOG_EQ_SIZE 6
+
+#define MAX_FW_PAGES 8192
+#define MAX_FW_PAGES_PER_CMD 512
+
+#define MLX5_CMD_TIMEOUT_MS 5000
+
+static inline u32 mlx5st_idx_to_mkey(u32 mkey_idx)
+{
+ return mkey_idx << 8;
+}
+
+#endif /* SELFTESTS_VFIO_MLX5_HW_H */
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h
new file mode 120000
index 00000000000000..7dcbb79e1af061
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc.h
@@ -0,0 +1 @@
+../../../../../../../include/linux/mlx5/mlx5_ifc.h
\ No newline at end of file
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h
new file mode 120000
index 00000000000000..865d99e2aeecd3
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_fpga.h
@@ -0,0 +1 @@
+../../../../../../../include/linux/mlx5/mlx5_ifc_fpga.h
\ No newline at end of file
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_macros.h b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_macros.h
new file mode 120000
index 00000000000000..97408c247f06ca
--- /dev/null
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_ifc_macros.h
@@ -0,0 +1 @@
+../../../../../../../include/linux/mlx5/mlx5_ifc_macros.h
\ No newline at end of file
diff --git a/tools/testing/selftests/vfio/lib/libvfio.mk b/tools/testing/selftests/vfio/lib/libvfio.mk
index d7017b0a076723..b24857ed1b016c 100644
--- a/tools/testing/selftests/vfio/lib/libvfio.mk
+++ b/tools/testing/selftests/vfio/lib/libvfio.mk
@@ -15,6 +15,7 @@ LIBVFIO_C += drivers/dsa/dsa.c
endif
LIBVFIO_C += drivers/nv_falcon/nv_falcon.c
+LIBVFIO_C += drivers/mlx5/mlx5.c
LIBVFIO_OUTPUT := $(OUTPUT)/libvfio
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c
index 153bf4a7a19f91..36bb81a18af4c9 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_driver.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_driver.c
@@ -8,6 +8,7 @@ extern struct vfio_pci_driver_ops ioat_ops;
#endif
extern struct vfio_pci_driver_ops nv_falcon_ops;
+extern struct vfio_pci_driver_ops mlx5st_ops;
static struct vfio_pci_driver_ops *driver_ops[] = {
#ifdef __x86_64__
@@ -15,6 +16,7 @@ static struct vfio_pci_driver_ops *driver_ops[] = {
&ioat_ops,
#endif
&nv_falcon_ops,
+ &mlx5st_ops,
};
void vfio_pci_driver_probe(struct vfio_pci_device *device)
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (8 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-04 22:41 ` David Matlack
2026-05-01 0:08 ` [PATCH 11/11] vfio: selftests: mlx5 driver - add send_msi support Jason Gunthorpe
` (2 subsequent siblings)
12 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Complete the mlx5 driver by adding CQ/QP creation, QP state
transitions, WQE posting, CQ polling, and the
memcpy_start/memcpy_wait callbacks. After this patch the driver is
functional for DMA tests.
The data path implements RDMA Write self-loopback via an RC QP with
force-loopback. WQEs are posted to a 16-entry send queue with an
NC doorbell, and completions are polled from a 16-entry CQ.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
.../selftests/vfio/lib/drivers/mlx5/mlx5.c | 359 +++++++++++++++++-
1 file changed, 357 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
index 0ab941bad7a66c..39c5414e2c743c 100644
--- a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
@@ -1340,6 +1340,354 @@ static void mlx5st_destroy_mkey(struct mlx5st_device *dev)
mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
}
+/*
+ * CQ create/destroy
+ */
+
+static void mlx5st_create_cq(struct mlx5st_device *dev)
+{
+ struct vfio_pci_device *device = dev->device;
+ u64 in[MLX5_ST_SZ_QW(create_cq_in) + 1] = {};
+ u32 out[MLX5_ST_SZ_DW(create_cq_out)] = {};
+ struct mlx5_ifc_cqc_bits *cqc;
+ unsigned int i;
+ __be64 *pas;
+
+ /* Initialize CQEs before CREATE_CQ: opcode=0xF, owner=1 */
+ for (i = 0; i < CQ_CQE_CNT; i++) {
+ struct mlx5st_cqe64 *cqe = &dev->cq_buf[i];
+
+ MLX5_SET(cqe64, cqe, opcode, 0xF);
+ MLX5_SET_ONCE(cqe64, cqe, owner, 1);
+ }
+
+ MLX5_SET(create_cq_in, in, opcode, MLX5_CMD_OP_CREATE_CQ);
+
+ cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context);
+ MLX5_SET(cqc, cqc, log_cq_size, LOG_CQ_SIZE);
+ MLX5_SET(cqc, cqc, uar_page, dev->uar_page);
+ MLX5_SET(cqc, cqc, c_eqn_or_apu_element, dev->eqn);
+ MLX5_SET(cqc, cqc, cqe_sz, 0);
+ pas = MLX5_ADDR_OF(create_cq_in, in, pas);
+ MLX5_SET(cqc, cqc, page_offset, mlx5st_fill_pas(device, dev->cq_buf, pas));
+ MLX5_SET(cqc, cqc, log_page_size, 0);
+ MLX5_SET64(cqc, cqc, dbr_addr, to_iova(device, &dev->cq_dbrec));
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->cqn = MLX5_GET(create_cq_out, out, cqn);
+ dev->cq_ci = 0;
+ dev_dbg(device, "Created CQ: cqn=%u, %d entries\n", dev->cqn,
+ CQ_CQE_CNT);
+}
+
+static void mlx5st_destroy_cq(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(destroy_cq_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(destroy_cq_in)] = {};
+
+ MLX5_SET(destroy_cq_in, in, opcode, MLX5_CMD_OP_DESTROY_CQ);
+ MLX5_SET(destroy_cq_in, in, cqn, dev->cqn);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+/*
+ * QP create/destroy
+ */
+
+static void mlx5st_create_qp(struct mlx5st_device *dev)
+{
+ struct vfio_pci_device *device = dev->device;
+ u64 in[MLX5_ST_SZ_QW(create_qp_in) + 1] = {};
+ u32 out[MLX5_ST_SZ_DW(create_qp_out)] = {};
+ struct mlx5_ifc_qpc_bits *qpc;
+ __be64 *pas;
+
+ MLX5_SET(create_qp_in, in, opcode, MLX5_CMD_OP_CREATE_QP);
+
+ qpc = MLX5_ADDR_OF(create_qp_in, in, qpc);
+ MLX5_SET(qpc, qpc, st, MLX5_QPC_ST_RC);
+ MLX5_SET(qpc, qpc, pm_state, MLX5_QPC_PM_STATE_MIGRATED);
+ MLX5_SET(qpc, qpc, pd, dev->pdn);
+ MLX5_SET(qpc, qpc, uar_page, dev->uar_page);
+ MLX5_SET(qpc, qpc, cqn_snd, dev->cqn);
+ MLX5_SET(qpc, qpc, cqn_rcv, dev->cqn);
+ MLX5_SET(qpc, qpc, log_sq_size, LOG_SQ_SIZE);
+ MLX5_SET(qpc, qpc, log_msg_max, 20);
+ MLX5_SET(qpc, qpc, rq_type, 0x3);
+ MLX5_SET(qpc, qpc, ts_format, 1);
+ pas = MLX5_ADDR_OF(create_qp_in, in, pas);
+ MLX5_SET(qpc, qpc, page_offset,
+ mlx5st_fill_pas(device, dev->sq_buf, pas));
+ MLX5_SET(qpc, qpc, log_page_size, 0);
+ MLX5_SET64(qpc, qpc, dbr_addr, to_iova(device, &dev->qp_dbrec));
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->qpn = MLX5_GET(create_qp_out, out, qpn);
+ dev->sq_pi = 0;
+ dev_dbg(device, "Created QP: qpn=%u, RC, sq=%d wqes\n", dev->qpn,
+ SQ_WQE_CNT);
+}
+
+static void mlx5st_destroy_qp(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(destroy_qp_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(destroy_qp_in)] = {};
+
+ MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP);
+ MLX5_SET(destroy_qp_in, in, qpn, dev->qpn);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
+/*
+ * QP state transitions
+ */
+
+static void mlx5st_qp_rst2init(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(rst2init_qp_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(rst2init_qp_in)] = {};
+ struct mlx5_ifc_qpc_bits *qpc = MLX5_ADDR_OF(rst2init_qp_in, in, qpc);
+
+ MLX5_SET(rst2init_qp_in, in, opcode, MLX5_CMD_OP_RST2INIT_QP);
+ MLX5_SET(rst2init_qp_in, in, qpn, dev->qpn);
+
+ MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, 1);
+ MLX5_SET(qpc, qpc, pm_state, MLX5_QPC_PM_STATE_MIGRATED);
+ MLX5_SET(qpc, qpc, rre, 1);
+ MLX5_SET(qpc, qpc, rwe, 1);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+ dev_dbg(dev->device, "QP RST->INIT\n");
+}
+
+static void mlx5st_qp_init2rtr(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(init2rtr_qp_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(init2rtr_qp_in)] = {};
+ struct mlx5_ifc_qpc_bits *qpc = MLX5_ADDR_OF(init2rtr_qp_in, in, qpc);
+
+ MLX5_SET(init2rtr_qp_in, in, opcode, MLX5_CMD_OP_INIT2RTR_QP);
+ MLX5_SET(init2rtr_qp_in, in, qpn, dev->qpn);
+
+ MLX5_SET(qpc, qpc, mtu, 3);
+ MLX5_SET(qpc, qpc, log_msg_max, 20);
+ MLX5_SET(qpc, qpc, remote_qpn, dev->qpn);
+ MLX5_SET(qpc, qpc, min_rnr_nak, 12);
+ MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, 1);
+ MLX5_SET(qpc, qpc, primary_address_path.fl, 1);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+ dev_dbg(dev->device, "QP INIT->RTR (fl=1)\n");
+}
+
+static void mlx5st_qp_rtr2rts(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(rtr2rts_qp_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(rtr2rts_qp_in)] = {};
+ struct mlx5_ifc_qpc_bits *qpc = MLX5_ADDR_OF(rtr2rts_qp_in, in, qpc);
+
+ MLX5_SET(rtr2rts_qp_in, in, opcode, MLX5_CMD_OP_RTR2RTS_QP);
+ MLX5_SET(rtr2rts_qp_in, in, qpn, dev->qpn);
+
+ MLX5_SET(qpc, qpc, log_ack_req_freq, 0);
+ MLX5_SET(qpc, qpc, retry_count, 7);
+ MLX5_SET(qpc, qpc, rnr_retry, 7);
+ MLX5_SET(qpc, qpc, primary_address_path.ack_timeout, 14);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+ dev_dbg(dev->device, "QP RTR->RTS\n");
+}
+
+/*
+ * Post RDMA Write WQE
+ */
+static void mlx5st_post_rdma_write(struct mlx5st_device *dev, u64 src_addr,
+ u32 src_lkey, u64 dst_addr, u32 dst_rkey,
+ u32 length, bool signaled)
+{
+ struct mlx5st_send_wqe *wqe;
+ unsigned int idx;
+
+ idx = dev->sq_pi % SQ_WQE_CNT;
+ wqe = &dev->sq_buf[idx];
+
+ memset(wqe, 0, sizeof(*wqe));
+ MLX5_SET(wqe_ctrl_seg, &wqe->ctrl, opcode, MLX5_OPCODE_RDMA_WRITE);
+ MLX5_SET(wqe_ctrl_seg, &wqe->ctrl, wqe_index, dev->sq_pi);
+ MLX5_SET(wqe_ctrl_seg, &wqe->ctrl, qp_or_sq, dev->qpn);
+ MLX5_SET(wqe_ctrl_seg, &wqe->ctrl, ds, MLX5_RDMA_WRITE_DS);
+ if (signaled)
+ MLX5_SET(wqe_ctrl_seg, &wqe->ctrl, ce, MLX5_WQE_CE_CQE_ALWAYS);
+
+ MLX5_SET64(wqe_raddr_seg, &wqe->raddr, raddr, dst_addr);
+ MLX5_SET(wqe_raddr_seg, &wqe->raddr, rkey, dst_rkey);
+
+ MLX5_SET(wqe_data_seg, &wqe->data, byte_count, length);
+ MLX5_SET(wqe_data_seg, &wqe->data, lkey, src_lkey);
+ MLX5_SET64(wqe_data_seg, &wqe->data, addr, src_addr);
+
+ dev->sq_pi++;
+
+ /* Ensure WQE is visible to device before doorbell record */
+ dma_wmb();
+
+ WRITE_ONCE(dev->qp_dbrec.send_counter,
+ cpu_to_be32(dev->sq_pi & 0xffff));
+
+ /*
+ * Ring doorbell: write first 8 bytes of ctrl to UAR BF register,
+ * iowrite has an internal dma_wmb() so the doorbell record will be
+ * visible.
+ */
+ iowrite64be(be64_to_cpu(*(__be64 *)wqe),
+ (u8 __iomem *)dev->uar_base + dev->uar_bf_offset);
+ dev->uar_bf_offset ^= MLX5_BF_SIZE;
+}
+
+/*
+ * Poll CQ
+ */
+static int mlx5st_poll_cq_batch(struct mlx5st_device *dev,
+ unsigned int max_cqe)
+{
+ unsigned int polled = 0;
+
+ while (polled < max_cqe) {
+ unsigned int idx = dev->cq_ci % CQ_CQE_CNT;
+ struct mlx5st_cqe64 *cqe = &dev->cq_buf[idx];
+ u8 owner, opcode;
+
+ owner = MLX5_GET_ONCE(cqe64, cqe, owner);
+ if (owner != ((dev->cq_ci >> LOG_CQ_SIZE) & 1))
+ break;
+
+ dma_rmb();
+
+ opcode = MLX5_GET(cqe64, cqe, opcode);
+
+ dev->cq_ci++;
+ WRITE_ONCE(dev->cq_dbrec.recv_counter,
+ cpu_to_be32(dev->cq_ci & 0xffffff));
+
+ if (opcode == MLX5_CQE_REQ) {
+ dev->sq_ci =
+ (u16)(MLX5_GET(cqe64, cqe, wqe_counter) + 1);
+ polled++;
+ continue;
+ }
+ if (opcode == MLX5_CQE_REQ_ERR ||
+ opcode == MLX5_CQE_RESP_ERR) {
+ dev_dbg(dev->device,
+ "CQE error: opcode=0x%x syndrome=0x%x vendor=0x%x\n",
+ opcode,
+ MLX5_GET(cqe64, cqe, error_syndrome.syndrome),
+ MLX5_GET(cqe64, cqe,
+ error_syndrome.vendor_error_syndrome));
+ return -1;
+ }
+ dev_err(dev->device, "CQE unexpected opcode=0x%x\n", opcode);
+ return -1;
+ }
+
+ return polled;
+}
+
+static int mlx5st_poll_cq(struct mlx5st_device *dev, unsigned int timeout_ms)
+{
+ struct timespec start, now;
+ unsigned int elapsed;
+ int ret;
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ for (;;) {
+ ret = mlx5st_poll_cq_batch(dev, 1);
+ if (ret < 0)
+ return -1;
+ if (ret > 0)
+ return 0;
+
+ if (dev->have_eq)
+ mlx5st_process_events(dev);
+
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ elapsed = (now.tv_sec - start.tv_sec) * 1000 +
+ (now.tv_nsec - start.tv_nsec) / 1000000;
+ if (elapsed > timeout_ms) {
+ dev_err(dev->device, "CQ poll timeout after %u ms\n",
+ timeout_ms);
+ return -1;
+ }
+ }
+}
+
+/*
+ * Data path setup/teardown helpers
+ */
+
+static void mlx5st_setup_datapath(struct mlx5st_device *dev)
+{
+ mlx5st_create_cq(dev);
+ mlx5st_create_qp(dev);
+ mlx5st_qp_rst2init(dev);
+ mlx5st_qp_init2rtr(dev);
+ mlx5st_qp_rtr2rts(dev);
+}
+
+static void mlx5st_teardown_datapath(struct mlx5st_device *dev)
+{
+ if (dev->qpn) {
+ mlx5st_destroy_qp(dev);
+ dev->qpn = 0;
+ }
+ if (dev->cqn) {
+ mlx5st_destroy_cq(dev);
+ dev->cqn = 0;
+ }
+ dev->sq_pi = 0;
+ dev->sq_ci = 0;
+ memset(&dev->qp_dbrec, 0, sizeof(dev->qp_dbrec));
+ memset(&dev->cq_dbrec, 0, sizeof(dev->cq_dbrec));
+}
+
+/*
+ * memcpy callbacks
+ */
+
+#define MLX5ST_MEMCPY_TIMEOUT_MS 60000
+
+static void mlx5st_memcpy_start(struct vfio_pci_device *device,
+ iova_t src, iova_t dst, u64 size, u64 count)
+{
+ struct mlx5st_device *dev = to_mlx5st(device);
+ u64 i;
+
+ for (i = 0; i < count; i++) {
+ bool signaled = (i == count - 1);
+
+ mlx5st_post_rdma_write(dev, src, dev->global_lkey, dst,
+ dev->global_rkey, size, signaled);
+ }
+}
+
+static int mlx5st_memcpy_wait(struct vfio_pci_device *device)
+{
+ struct mlx5st_device *dev = to_mlx5st(device);
+ int ret;
+
+ ret = mlx5st_poll_cq(dev, MLX5ST_MEMCPY_TIMEOUT_MS);
+ if (ret) {
+ /*
+ * CQE error puts the QP in error state. Rebuild the data path
+ * so subsequent operations can succeed.
+ */
+ mlx5st_teardown_datapath(dev);
+ mlx5st_setup_datapath(dev);
+ }
+ return ret;
+}
+
/*
* Driver ops callbacks
*/
@@ -1368,6 +1716,11 @@ static void mlx5st_init(struct vfio_pci_device *device)
mlx5st_alloc_pd(dev);
mlx5st_create_mkey(dev);
+ mlx5st_setup_datapath(dev);
+
+ device->driver.max_memcpy_size = 1 << 20;
+ device->driver.max_memcpy_count = SQ_WQE_CNT - 1;
+
dev_dbg(device, "mlx5 driver initialized\n");
}
@@ -1375,6 +1728,8 @@ static void mlx5st_remove(struct vfio_pci_device *device)
{
struct mlx5st_device *dev = to_mlx5st(device);
+ mlx5st_teardown_datapath(dev);
+
dev_dbg(device, "teardown: destroy_mkey\n");
if (dev->mkey_index) {
mlx5st_destroy_mkey(dev);
@@ -1400,7 +1755,7 @@ struct vfio_pci_driver_ops mlx5st_ops = {
.probe = mlx5st_probe,
.init = mlx5st_init,
.remove = mlx5st_remove,
- .memcpy_start = NULL,
- .memcpy_wait = NULL,
+ .memcpy_start = mlx5st_memcpy_start,
+ .memcpy_wait = mlx5st_memcpy_wait,
.send_msi = NULL,
};
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 11/11] vfio: selftests: mlx5 driver - add send_msi support
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (9 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops Jason Gunthorpe
@ 2026-05-01 0:08 ` Jason Gunthorpe
2026-05-01 16:11 ` [PATCH 00/11] mlx5 support for VFIO self test David Matlack
2026-05-02 4:31 ` Alex Williamson
12 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 0:08 UTC (permalink / raw)
To: Alex Williamson, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
Wire an MSI-X vector to a dedicated EQ so the mlx5 driver supports
send_msi().
Each EQ can be linked to an MSI-X vector, and the CQ can be set up
to deliver an event to the EQ. Thus, when everything is armed, an
RDMA WRITE posted to the QP generates a CQE, which generates an
EQE, which generates an MSI-X.
To keep things simple this just re-uses all the existing QPs and
CQs, so they generate single MSIs during memcpy.
send_msi() drains any accumulated MSI EQ events from prior memcpy
completions, posts a small signaled RDMA Write, then polls the CQ to
consume the resulting CQE (avoiding stale completions on subsequent
test cycles).
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
.../selftests/vfio/lib/drivers/mlx5/mlx5.c | 165 +++++++++++++++++-
.../selftests/vfio/lib/drivers/mlx5/mlx5_hw.h | 6 +
2 files changed, 168 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
index 39c5414e2c743c..cf6c436a6df0de 100644
--- a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
@@ -56,17 +56,23 @@ struct mlx5st_device {
/* CQ */
u32 cqn;
u32 cq_ci;
+ u32 cq_arm_sn;
/* UAR */
u32 uar_page;
void __iomem *uar_base;
unsigned int uar_bf_offset;
- /* EQ */
+ /* EQ (cmd/pages events — polled, not interrupt-driven) */
u32 eqn;
u32 eq_cons_index;
bool have_eq;
+ /* MSI EQ (CQ completion events — fires MSI-X) */
+ u32 msi_eqn;
+ u32 msi_eq_cons_index;
+ bool have_msi_eq;
+
/* Async pages slot state */
bool pages_slot_in_use;
bool pages_slot_is_reclaim;
@@ -89,6 +95,10 @@ struct mlx5st_device {
/* Capabilities */
bool fl_supported;
+ /* Buffers used by send_msi() to trigger an interrupt */
+ u64 send_msi_src;
+ u64 send_msi_dst;
+
/*
* HW-visible DMA buffers below — device reads/writes via DMA.
*/
@@ -111,6 +121,9 @@ struct mlx5st_device {
/* EQ does not support page_offset */
struct mlx5st_eqe eq_buf[EQ_NENT] __aligned(MLX5_HW_PAGE_SIZE);
+ /* MSI EQ buffer — CQ completions generate EQEs here -> MSI-X */
+ struct mlx5st_eqe msi_eq_buf[MSI_EQ_NENT] __aligned(MLX5_HW_PAGE_SIZE);
+
u8 fw_pages[MAX_FW_PAGES][MLX5_HW_PAGE_SIZE]
__aligned(MLX5_HW_PAGE_SIZE);
};
@@ -133,6 +146,9 @@ static_assert(offsetof(struct mlx5st_device, qp_dbrec) % 64 == 0,
static_assert(offsetof(struct mlx5st_device, eq_buf) %
MLX5_HW_PAGE_SIZE == 0,
"eq_buf must be page-aligned");
+static_assert(offsetof(struct mlx5st_device, msi_eq_buf) %
+ MLX5_HW_PAGE_SIZE == 0,
+ "msi_eq_buf must be page-aligned");
static_assert(offsetof(struct mlx5st_device, fw_pages) %
MLX5_HW_PAGE_SIZE == 0,
"fw_pages must be page-aligned");
@@ -1012,6 +1028,85 @@ static void mlx5st_process_events(struct mlx5st_device *dev)
mlx5st_eq_update_ci(dev, cc, 0);
}
+/*
+ * MSI EQ — dedicated EQ for CQ completion events that fires MSI-X.
+ * Separate from the cmd/pages EQ so that only CQ completions (from
+ * send_msi or memcpy) trigger the interrupt vector.
+ */
+
+static void mlx5st_msi_eq_drain(struct mlx5st_device *dev)
+{
+ u32 cc = 0;
+ u32 val;
+
+ while (cc < MSI_EQ_NENT) {
+ u32 ci = dev->msi_eq_cons_index + cc;
+ struct mlx5st_eqe *eqe =
+ &dev->msi_eq_buf[ci % MSI_EQ_NENT];
+
+ if (MLX5_GET_ONCE(eqe, eqe, owner) != !!(ci & MSI_EQ_NENT))
+ break;
+ cc++;
+ }
+
+ /* Update consumer index and re-arm for next interrupt */
+ dev->msi_eq_cons_index += cc;
+ val = (dev->msi_eq_cons_index & 0xffffff) | (dev->msi_eqn << 24);
+ iowrite32be(val, (u8 __iomem *)dev->uar_base + MLX5_EQ_DOORBELL_OFFSET);
+}
+
+static void mlx5st_create_msi_eq(struct mlx5st_device *dev)
+{
+ struct vfio_pci_device *device = dev->device;
+ u64 in[MLX5_ST_SZ_QW(create_eq_in) + 1] = {};
+ u32 out[MLX5_ST_SZ_DW(create_eq_out)] = {};
+ struct mlx5_ifc_eqc_bits *eqc;
+ unsigned int i;
+ __be64 *pas;
+
+ /* Initialize EQE owner bits */
+ for (i = 0; i < MSI_EQ_NENT; i++) {
+ struct mlx5st_eqe *eqe = &dev->msi_eq_buf[i];
+
+ MLX5_SET_ONCE(eqe, eqe, owner, 1);
+ }
+
+ MLX5_SET(create_eq_in, in, opcode, MLX5_CMD_OP_CREATE_EQ);
+
+ /*
+ * No event_bitmask — completion events are routed to this EQ via
+ * the CQ's c_eqn field, not through CREATE_EQ subscription.
+ */
+ eqc = MLX5_ADDR_OF(create_eq_in, in, eq_context_entry);
+ MLX5_SET(eqc, eqc, log_eq_size, LOG_MSI_EQ_SIZE);
+ MLX5_SET(eqc, eqc, uar_page, dev->uar_page);
+ MLX5_SET(eqc, eqc, intr, MSI_VECTOR);
+ pas = MLX5_ADDR_OF(create_eq_in, in, pas);
+ VFIO_ASSERT_EQ(mlx5st_fill_pas(device, dev->msi_eq_buf, pas), 0u);
+ MLX5_SET(eqc, eqc, log_page_size, 0);
+
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+
+ dev->msi_eqn = MLX5_GET(create_eq_out, out, eq_number);
+ dev->msi_eq_cons_index = 0;
+ dev->have_msi_eq = true;
+ mlx5st_msi_eq_drain(dev);
+
+ dev_dbg(device,
+ "Created MSI EQ: eqn=%u, %d entries (COMP), vector=%d\n",
+ dev->msi_eqn, MSI_EQ_NENT, MSI_VECTOR);
+}
+
+static void mlx5st_destroy_msi_eq(struct mlx5st_device *dev)
+{
+ u32 out[MLX5_ST_SZ_DW(destroy_eq_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(destroy_eq_in)] = {};
+
+ MLX5_SET(destroy_eq_in, in, opcode, MLX5_CMD_OP_DESTROY_EQ);
+ MLX5_SET(destroy_eq_in, in, eq_number, dev->msi_eqn);
+ mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
+}
+
/*
* HCA init / teardown
*/
@@ -1366,7 +1461,7 @@ static void mlx5st_create_cq(struct mlx5st_device *dev)
cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context);
MLX5_SET(cqc, cqc, log_cq_size, LOG_CQ_SIZE);
MLX5_SET(cqc, cqc, uar_page, dev->uar_page);
- MLX5_SET(cqc, cqc, c_eqn_or_apu_element, dev->eqn);
+ MLX5_SET(cqc, cqc, c_eqn_or_apu_element, dev->msi_eqn);
MLX5_SET(cqc, cqc, cqe_sz, 0);
pas = MLX5_ADDR_OF(create_cq_in, in, pas);
MLX5_SET(cqc, cqc, page_offset, mlx5st_fill_pas(device, dev->cq_buf, pas));
@@ -1391,6 +1486,30 @@ static void mlx5st_destroy_cq(struct mlx5st_device *dev)
mlx5st_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
}
+/*
+ * Arm CQ for event generation. The CQ event delivery state machine is
+ * single-shot: after generating one EQE the CQ enters "Fired" state and
+ * won't generate another until re-armed via ARM_NEXT. Both the CQ doorbell
+ * record and the UAR CQ doorbell register must be written.
+ */
+static void mlx5st_arm_cq(struct mlx5st_device *dev)
+{
+ u32 sn = dev->cq_arm_sn & 3;
+ u32 ci = dev->cq_ci & 0xffffff;
+ u64 doorbell;
+
+ /* Update CQ doorbell record arm word */
+ WRITE_ONCE(dev->cq_dbrec.send_counter,
+ cpu_to_be32(sn << 28 | ci));
+
+ /* Ring CQ doorbell register, iowrite has an internal dma_wmb() */
+ doorbell = ((u64)(sn << 28 | ci) << 32) | dev->cqn;
+ iowrite64be(doorbell,
+ (u8 __iomem *)dev->uar_base + MLX5_CQ_DOORBELL_OFFSET);
+
+ dev->cq_arm_sn++;
+}
+
/*
* QP create/destroy
*/
@@ -1647,6 +1766,7 @@ static void mlx5st_teardown_datapath(struct mlx5st_device *dev)
}
dev->sq_pi = 0;
dev->sq_ci = 0;
+ dev->cq_arm_sn = 0;
memset(&dev->qp_dbrec, 0, sizeof(dev->qp_dbrec));
memset(&dev->cq_dbrec, 0, sizeof(dev->cq_dbrec));
}
@@ -1688,6 +1808,34 @@ static int mlx5st_memcpy_wait(struct vfio_pci_device *device)
return ret;
}
+/*
+ * send_msi callback — trigger CQE -> EQE -> MSI-X via a small RDMA Write.
+ *
+ * Both the CQ and MSI EQ use single-shot arming: the CQ must be armed so the
+ * CQE generates an EQE, and the MSI EQ must be armed so the EQE fires MSI-X.
+ */
+static void mlx5st_send_msi(struct vfio_pci_device *device)
+{
+ struct mlx5st_device *dev = to_mlx5st(device);
+
+ /* Drain accumulated MSI EQ events and re-arm for next interrupt */
+ mlx5st_msi_eq_drain(dev);
+
+ /* Arm CQ so the next CQE generates an EQE on the MSI EQ */
+ mlx5st_arm_cq(dev);
+
+ /* Post a signaled RDMA Write to trigger CQE -> EQE -> MSI-X */
+ mlx5st_post_rdma_write(dev,
+ to_iova(device, &dev->send_msi_src),
+ dev->global_lkey,
+ to_iova(device, &dev->send_msi_dst),
+ dev->global_rkey,
+ sizeof(dev->send_msi_src), true);
+
+ /* Consume the CQE to avoid stale completions */
+ VFIO_ASSERT_EQ(mlx5st_poll_cq(dev, MLX5ST_MEMCPY_TIMEOUT_MS), 0);
+}
+
/*
* Driver ops callbacks
*/
@@ -1716,8 +1864,13 @@ static void mlx5st_init(struct vfio_pci_device *device)
mlx5st_alloc_pd(dev);
mlx5st_create_mkey(dev);
+ /* MSI EQ must be created before CQ so CQ can reference its eqn */
+ mlx5st_create_msi_eq(dev);
mlx5st_setup_datapath(dev);
+ vfio_pci_msix_enable(device, MSI_VECTOR, 1);
+ device->driver.msi = MSI_VECTOR;
+
device->driver.max_memcpy_size = 1 << 20;
device->driver.max_memcpy_count = SQ_WQE_CNT - 1;
@@ -1728,8 +1881,14 @@ static void mlx5st_remove(struct vfio_pci_device *device)
{
struct mlx5st_device *dev = to_mlx5st(device);
+ vfio_pci_msix_disable(device);
mlx5st_teardown_datapath(dev);
+ if (dev->have_msi_eq) {
+ mlx5st_destroy_msi_eq(dev);
+ dev->have_msi_eq = false;
+ }
+
dev_dbg(device, "teardown: destroy_mkey\n");
if (dev->mkey_index) {
mlx5st_destroy_mkey(dev);
@@ -1757,5 +1916,5 @@ struct vfio_pci_driver_ops mlx5st_ops = {
.remove = mlx5st_remove,
.memcpy_start = mlx5st_memcpy_start,
.memcpy_wait = mlx5st_memcpy_wait,
- .send_msi = NULL,
+ .send_msi = mlx5st_send_msi,
};
diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
index a2506ec8a19523..2c451e411ec13f 100644
--- a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
+++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5_hw.h
@@ -80,6 +80,9 @@ struct mlx5st_dbrec {
#define MLX5_BF_OFFSET 0x800
#define MLX5_BF_SIZE 0x100
+/* CQ doorbell offset within UAR page */
+#define MLX5_CQ_DOORBELL_OFFSET 0x20
+
/* EQ doorbell offset within UAR page */
#define MLX5_EQ_DOORBELL_OFFSET 0x40
@@ -94,6 +97,9 @@ struct mlx5st_dbrec {
#define LOG_CQ_SIZE 4
#define EQ_NENT 64
#define LOG_EQ_SIZE 6
+#define MSI_EQ_NENT 16
+#define LOG_MSI_EQ_SIZE 4
+#define MSI_VECTOR 0
#define MAX_FW_PAGES 8192
#define MAX_FW_PAGES_PER_CMD 512
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 00/11] mlx5 support for VFIO self test
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (10 preceding siblings ...)
2026-05-01 0:08 ` [PATCH 11/11] vfio: selftests: mlx5 driver - add send_msi support Jason Gunthorpe
@ 2026-05-01 16:11 ` David Matlack
2026-05-01 16:43 ` Jason Gunthorpe
2026-05-02 4:31 ` Alex Williamson
12 siblings, 1 reply; 24+ messages in thread
From: David Matlack @ 2026-05-01 16:11 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches, Josh Hilke
On Thu, Apr 30, 2026 at 5:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> Add an mlx5 driver to VFIO self test. This is largely a remix of the
> existing VFIO mlx5 driver in rdma-core. It uses an RDMA loopback QP
> to issue RDMA WRITE operations which effectively perform memory
> copies using DMA. Since mlx5 has a stable programming ABI this
> should work on devices from CX5 to current HW. The device FW must
> support the QP loopback configuration.
> This entire series was coded by Claude Code in about 4 days.
Very exciting. Josh Hilke from Google is also working on using AI to
create a selftest driver for Intel IGB NICs so VFIO selftests can run
in QEMU [1]. So it's encouraging to see you were able to do it with
mlx5.
[1] https://www.qemu.org/docs/master/system/devices/igb.html
> For those interested, the flow I used was broadly a prompt sequence
> sort of like:
>
> - Hey Claude, go look at the falcon series, VFIO self test, the
> mlx5 driver, rdma-core and some PDF documentation and make a
> plan to put mlx5 under the selftest.
> - Write an rdma-core application using the built-in VFIO provider
> that can do the required memcpy operations that vfio selftests
> wants.
> (This resulted in a 1k loc C file that compiled and ran the
> first time but had a few bugs related to device programming
> that the AI resolved.)
> - Replace the rdma-core components with open-coded versions to
> create a fully stand-alone program that does the DMA memcpy.
> - Review and audit the thing.
> [Pause and de-slop it]
> - Make it work on a PF too (this is surprisingly hard!).
Can it work on CX VFs? We're interested in continuously performing
memory copies across a Live Update using a VF via selftests to
demonstrate SR-IOV preservation (when we eventually get there).
> [Move to a kernel tree and copy all the .md files and .c program
> it made]
> - Hey Claude, look at all this stuff and make a broad plan to
> actually build a VFIO self test.
> - Here is my 1 sentence advice on what each patch should look
> like, make a detailed plan to make a patch for every one.
> [Pause and polish the patch plans]
> - Execute plan X then commit it [pause and de-slop each patch,
> repeat].
> [Review and final polish]
>
> It is based on a tree with the falcon series applied.
Thanks for sending, I look forward to reviewing!
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 00/11] mlx5 support for VFIO self test
2026-05-01 16:11 ` [PATCH 00/11] mlx5 support for VFIO self test David Matlack
@ 2026-05-01 16:43 ` Jason Gunthorpe
2026-05-04 22:54 ` David Matlack
0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-01 16:43 UTC (permalink / raw)
To: David Matlack
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches, Josh Hilke
On Fri, May 01, 2026 at 09:11:11AM -0700, David Matlack wrote:
> On Thu, Apr 30, 2026 at 5:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > Add an mlx5 driver to VFIO self test. This is largely a remix of the
> > existing VFIO mlx5 driver in rdma-core. It uses an RDMA loopback QP
> > to issue RDMA WRITE operations which effectively perform memory
> > copies using DMA. Since mlx5 has a stable programming ABI this
> > should work on devices from CX5 to current HW. The device FW must
> > support the QP loopback configuration.
>
> > This entire series was coded by Claude Code in about 4 days.
>
> Very exciting. Josh Hilke from Google is also working on using AI to
> create a selftest driver for Intel IGB NICs so VFIO selftests can run
> in QEMU [1]. So it's encouraging to see you were able to do it with
> mlx5.
>
> [1] https://www.qemu.org/docs/master/system/devices/igb.html
Yes! I would feed DPDK in as well in this case? Combined with the
kernel driver it should be doable. It is much easier if you understand
how the NIC works, of course. This worked out significantly because I
guided it through sufficiently small steps and knew where to find all
the quality reference material..
> > - Make it work on a PF too (this is surprisingly hard!).
>
> Can it work on CX VFs? We're interested in continuously performing
> memory copies across a Live Update using a VF via selftests to
> demonstrate SR-IOV preservation (when we eventually get there).
Yes, I started with VF because it is simpler. The PF support flow
requires a bunch more complicated stuff.
Jason
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 00/11] mlx5 support for VFIO self test
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
` (11 preceding siblings ...)
2026-05-01 16:11 ` [PATCH 00/11] mlx5 support for VFIO self test David Matlack
@ 2026-05-02 4:31 ` Alex Williamson
2026-05-02 13:40 ` Jason Gunthorpe
12 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2026-05-02 4:31 UTC (permalink / raw)
To: Jason Gunthorpe, David Matlack, kvm, Leon Romanovsky,
linux-kselftest, linux-rdma, Mark Bloch, netdev, Saeed Mahameed,
Shuah Khan, Tariq Toukan
Cc: patches
On Thu, Apr 30, 2026, at 6:08 PM, Jason Gunthorpe wrote:
> Add an mlx5 driver to VFIO self test. This is largely a remix of the
> existing VFIO mlx5 driver in rdma-core. It uses an RDMA loopback QP
> to issue RDMA WRITE operations which effectively perform memory
> copies using DMA. Since mlx5 has a stable programming ABI this
> should work on devices from CX5 to current HW. The device FW must
> support the QP loopback configuration.
Does the PCI ID table in the series need some pruning then? It includes CX4.
Thanks,
Alex
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size
2026-05-01 0:08 ` [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size Jason Gunthorpe
@ 2026-05-02 8:33 ` Manuel Ebner
2026-05-04 20:55 ` David Matlack
1 sibling, 0 replies; 24+ messages in thread
From: Manuel Ebner @ 2026-05-02 8:33 UTC (permalink / raw)
To: Jason Gunthorpe, Alex Williamson, David Matlack, kvm,
Leon Romanovsky, linux-kselftest, linux-rdma, Mark Bloch, netdev,
Saeed Mahameed, Shuah Khan, Tariq Toukan
Cc: patches
Hi,
On Thu, 2026-04-30 at 21:08 -0300, Jason Gunthorpe wrote:
> Add a region_size field to struct vfio_pci_driver_ops so drivers can
> declare how much DMA-mapped region they need. The mlx5 driver will
> need ~18MB for firmware pages. Existing drivers leave region_size as
> 0 and get the current default of SZ_2M.
>
> Assisted-by: Claude:claude-opus-4.6
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> .../selftests/vfio/lib/include/libvfio/vfio_pci_driver.h | 3 +++
> tools/testing/selftests/vfio/vfio_pci_driver_test.c | 3 ++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git
> a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
> b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
> index e5ada209b1d102..fa172635632453 100644
> --- a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
> +++ b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
> @@ -9,6 +9,9 @@ struct vfio_pci_device;
> struct vfio_pci_driver_ops {
> const char *name;
>
> + /* Minimum driver region size, 0 = default SZ_2M */
> + u64 region_size;
i guess i do not understand this comment, but that's no surprise
because i'm new to the kernel.
would one of my suggestions be better?
+ /* Minimum driver region size == 0 -> default = SZ_2M */
or
+ /* Minimum driver region size, default SZ_2M = 0 */
> [...]
Manuel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface
2026-05-01 0:08 ` [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface Jason Gunthorpe
@ 2026-05-02 9:35 ` Manuel Ebner
2026-05-04 22:35 ` David Matlack
1 sibling, 0 replies; 24+ messages in thread
From: Manuel Ebner @ 2026-05-02 9:35 UTC (permalink / raw)
To: Jason Gunthorpe, Alex Williamson, David Matlack, kvm,
Leon Romanovsky, linux-kselftest, linux-rdma, Mark Bloch, netdev,
Saeed Mahameed, Shuah Khan, Tariq Toukan
Cc: patches
Hi Jason,
i've gone through your patch, just some minor and some optional things.
On Thu, 2026-04-30 at 21:08 -0300, Jason Gunthorpe wrote:
> [...]
> diff --git a/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
> b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
> new file mode 100644
> index 00000000000000..0ab941bad7a66c
> --- /dev/null
> +++ b/tools/testing/selftests/vfio/lib/drivers/mlx5/mlx5.c
> @@ -0,0 +1,1406 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * mlx5 VFIO selftest driver
> + *
> + * Programs mlx5 ConnectX VFs and PFs through the bare-metal command
> interface
> + * and RDMA Write self-loopback to perform DMA. Implements
Write -> write
else it feels like there is a new sentence
> + * (probe/init/remove) and plugs into the VFIO selftest framework.
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <time.h>
> +#include <sched.h>
> +#include <unistd.h>
> +#include <stdlib.h>
> +
> +#include <linux/errno.h>
> +#include <linux/io.h>
> +#include <linux/log2.h>
> +#include <linux/pci_regs.h>
sort álphabetically:
+#include <sched.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdlib.h>~
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <linux/errno.h>
+#include <linux/io.h>
+#include <linux/log2.h>
+#include <linux/pci_regs.h>
> +
> +#include <libvfio.h>
> +
> +#include "mlx5_hw.h"
why are these two lines grouped individually?
> +/* Forward declaration — cmd_exec polls events during command wait */
> +static void mlx5st_process_events(struct mlx5st_device *dev);
> +
> +static const char *mlx5st_cmd_name(u16 opcode)
> +{
> + switch (opcode) {
> + case MLX5_CMD_OP_QUERY_HCA_CAP: return "QUERY_HCA_CAP";
> + case MLX5_CMD_OP_INIT_HCA: return "INIT_HCA";
> + case MLX5_CMD_OP_TEARDOWN_HCA: return "TEARDOWN_HCA";
> + case MLX5_CMD_OP_ENABLE_HCA: return "ENABLE_HCA";
> + case MLX5_CMD_OP_DISABLE_HCA: return "DISABLE_HCA";
> + [...]
> + }
> +}
Maybe line them up like:
Depends on your taste.
+ switch (opcode) {
+ case MLX5_CMD_OP_QUERY_HCA_CAP: return "QUERY_HCA_CAP";
+ case MLX5_CMD_OP_INIT_HCA: return "INIT_HCA";
+ case MLX5_CMD_OP_TEARDOWN_HCA: return "TEARDOWN_HCA";
+ case MLX5_CMD_OP_ENABLE_HCA: return "ENABLE_HCA";
+ case MLX5_CMD_OP_DISABLE_HCA: return "DISABLE_HCA";
+ case MLX5_CMD_OP_QUERY_PAGES: return "QUERY_PAGES";
+ [...]
> +
> + /*
> + * Compute signatures: mailbox blocks first, then cmd_queue_entry
> last.
> + * The sig must cover the final state including ownership=0x1, but
> + * we must not set ownership until after the sig is in place —
> + * XOR in the 0x1 without storing it to memory.
> + */
- " last"
+ /*
+ * Compute signatures: mailbox blocks first, then cmd_queue_entry.
Thanks
Manuel
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 00/11] mlx5 support for VFIO self test
2026-05-02 4:31 ` Alex Williamson
@ 2026-05-02 13:40 ` Jason Gunthorpe
0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2026-05-02 13:40 UTC (permalink / raw)
To: Alex Williamson
Cc: David Matlack, kvm, Leon Romanovsky, linux-kselftest, linux-rdma,
Mark Bloch, netdev, Saeed Mahameed, Shuah Khan, Tariq Toukan,
patches
On Fri, May 01, 2026 at 10:31:49PM -0600, Alex Williamson wrote:
> On Thu, Apr 30, 2026, at 6:08 PM, Jason Gunthorpe wrote:
> > Add an mlx5 driver to VFIO self test. This is largely a remix of the
> > existing VFIO mlx5 driver in rdma-core. It uses an RDMA loopback QP
> > to issue RDMA WRITE operations which effectively perform memory
> > copies using DMA. Since mlx5 has a stable programming ABI this
> > should work on devices from CX5 to current HW. The device FW must
> > support the QP loopback configuration.
>
> Does the PCI ID table in the series need some pruning then? It
> includes CX4.
Maybe.. I don't actually know that it doesn't work on CX4 and before,
I only was able to test as far back as CX5. I'm thinking to leave it
and if someone does try it and it doesn't work then remove it.
Jason
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size
2026-05-01 0:08 ` [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size Jason Gunthorpe
2026-05-02 8:33 ` Manuel Ebner
@ 2026-05-04 20:55 ` David Matlack
1 sibling, 0 replies; 24+ messages in thread
From: David Matlack @ 2026-05-04 20:55 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches
On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> Add a region_size field to struct vfio_pci_driver_ops so drivers can
> declare how much DMA-mapped region they need. The mlx5 driver will
> need ~18MB for firmware pages. Existing drivers leave region_size as
> 0 and get the current default of SZ_2M.
I would like to get rid of the magic SZ_2M to make it easier for other
tests to use the driver framework. Can you make this commit update all
the drivers to set region_size? They can all use the same approach:
struct vfio_pci_driver_ops foo_driver = {
...
.region_size = roundup_pow_of_two(sizeof(struct foo)),
...
};
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 08/11] vfio: selftests: Add dev_dbg
2026-05-01 0:08 ` [PATCH 08/11] vfio: selftests: Add dev_dbg Jason Gunthorpe
@ 2026-05-04 21:15 ` David Matlack
0 siblings, 0 replies; 24+ messages in thread
From: David Matlack @ 2026-05-04 21:15 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches
On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> Enable it with a #define DEBUG at the top of the file. Allows leaving
> behind debugging prints that are useful in case future changes are
> required.
>
> Assisted-by: Claude:claude-opus-4.6
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> .../selftests/vfio/lib/include/libvfio/vfio_pci_device.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
> index bb4525abd01a22..2d587b988c09fa 100644
> --- a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
> +++ b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
> @@ -38,6 +38,12 @@ struct vfio_pci_device {
> #define dev_info(_dev, _fmt, ...) printf("%s: " _fmt, (_dev)->bdf, ##__VA_ARGS__)
> #define dev_err(_dev, _fmt, ...) fprintf(stderr, "%s: " _fmt, (_dev)->bdf, ##__VA_ARGS__)
>
> +#ifdef DEBUG
> +#define dev_dbg dev_info
> +#else
> +#define dev_dbg(_dev, _fmt, ...) do { } while (0)
Can you add something to make sure the format strings are still
validated by the compiler even if DEBUG is not defined? (since it
will almost never be defined). e.g.
diff --git a/tools/testing/selftests/vfio/lib/include/libvfio/assert.h b/tools/testing/selftests/vfio/lib/include/libvfio/assert.h
index f4ebd122d9b6..406c430ef28d 100644
--- a/tools/testing/selftests/vfio/lib/include/libvfio/assert.h
+++ b/tools/testing/selftests/vfio/lib/include/libvfio/assert.h
@@ -51,4 +51,9 @@
VFIO_ASSERT_EQ(__ret, 0, "ioctl(%s, %s, %s) returned %d\n", #_fd, #_op, #_arg, __ret); \
} while (0)
+ __attribute__((__format__(__printf__, 1, 2)))
+static inline void check_format_string(const char *fmt, ...)
+{
+}
+
#endif /* SELFTESTS_VFIO_LIB_INCLUDE_LIBVFIO_ASSERT_H */
diff --git a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
index 2d587b988c09..3abfa6ff481c 100644
--- a/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
+++ b/tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
@@ -39,9 +39,9 @@ struct vfio_pci_device {
#define dev_err(_dev, _fmt, ...) fprintf(stderr, "%s: " _fmt, (_dev)->bdf, ##__VA_ARGS__)
#ifdef DEBUG
-#define dev_dbg dev_info
+#define dev_dbg(_dev, _fmt, ...) dev_info
#else
-#define dev_dbg(_dev, _fmt, ...) do { } while (0)
+#define dev_dbg(_dev, _fmt, ...) check_format_string(_fmt, ##__VA_ARGS__)
#endif
struct vfio_pci_device *vfio_pci_device_init(const char *bdf, struct iommu *iommu);
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 05/11] selftests: Add additional kernel functions to tools/include/
2026-05-01 0:08 ` [PATCH 05/11] selftests: Add additional kernel functions to tools/include/ Jason Gunthorpe
@ 2026-05-04 21:48 ` David Matlack
0 siblings, 0 replies; 24+ messages in thread
From: David Matlack @ 2026-05-04 21:48 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches
On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> These are needed by the VFIO mlx5 selftest in the following patches,
> which includes some headers from mlx5 and also needs a few more
> MMIO-related features.
>
> - DECLARE_FLEX_ARRAY in new tools/include/linux/stddef.h (wraps
> existing __DECLARE_FLEX_ARRAY from uapi/linux/stddef.h)
Is this needed? I don't see it used anywhere.
$ git grep DECLARE_FLEX_ARRAY tools/testing/selftests/vfio
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface
2026-05-01 0:08 ` [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface Jason Gunthorpe
2026-05-02 9:35 ` Manuel Ebner
@ 2026-05-04 22:35 ` David Matlack
1 sibling, 0 replies; 24+ messages in thread
From: David Matlack @ 2026-05-04 22:35 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches
On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> +/*
> + * Driver state — overlaid on device->driver.region.vaddr.
> + *
> + * Contains both software-only state and HW-visible DMA buffers. HW buffers need
> + * strict IOVA alignment.
> + */
> +struct mlx5st_device {
Can we do s/mlx5st/mlx5/ on the series?
I assume st is for selftests and that is already implied by this file
being under tools/testing/selftests.
> +/*
> + * Probe — match mlx5 devices by PCI vendor/device ID.
> + */
> +
> +#define PCI_VENDOR_ID_MELLANOX 0x15b3
nit: Use #include <linux/pci_ids.h>
> +static int mlx5st_probe(struct vfio_pci_device *device)
> +{
> + static const u16 mlx5st_pci_ids[] = {
> + 0x1011, /* Connect-IB */
> + 0x1012, /* Connect-IB VF */
> + 0x1013, /* ConnectX-4 */
> + 0x1014, /* ConnectX-4 VF */
> + 0x1015, /* ConnectX-4LX */
> + 0x1016, /* ConnectX-4LX VF */
> + 0x1017, /* ConnectX-5 */
> + 0x1018, /* ConnectX-5 VF */
> + 0x1019, /* ConnectX-5 Ex */
> + 0x101a, /* ConnectX-5 Ex VF */
> + 0x101b, /* ConnectX-6 */
> + 0x101c, /* ConnectX-6 VF */
> + 0x101d, /* ConnectX-6 Dx */
> + 0x101e, /* ConnectX-6 Dx VF */
> + 0x101f, /* ConnectX-6 LX */
> + 0x1021, /* ConnectX-7 */
> + 0x1023, /* ConnectX-8 */
> + 0x1025, /* ConnectX-9 */
> + 0x1027, /* ConnectX-10 */
> + 0x2101, /* ConnectX-10 NVLink-C2C */
> + 0xa2d2, /* BlueField integrated ConnectX-5 */
> + 0xa2d3, /* BlueField integrated ConnectX-5 VF */
> + 0xa2d6, /* BlueField-2 integrated ConnectX-6 Dx */
> + 0xa2dc, /* BlueField-3 integrated ConnectX-7 */
> + 0xa2df, /* BlueField-4 integrated ConnectX-8 */
> + };
> + unsigned int i;
> + u16 did;
> +
> + if (vfio_pci_config_readw(device, PCI_VENDOR_ID) !=
> + PCI_VENDOR_ID_MELLANOX)
> + return -ENODEV;
> +
> + did = vfio_pci_config_readw(device, PCI_DEVICE_ID);
> + for (i = 0; i < ARRAY_SIZE(mlx5st_pci_ids); i++) {
> + if (mlx5st_pci_ids[i] == did)
> + return 0;
> + }
> +
> + return -ENODEV;
> +}
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops
2026-05-01 0:08 ` [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops Jason Gunthorpe
@ 2026-05-04 22:41 ` David Matlack
0 siblings, 0 replies; 24+ messages in thread
From: David Matlack @ 2026-05-04 22:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches
On 2026-04-30 09:08 PM, Jason Gunthorpe wrote:
> @@ -1368,6 +1716,11 @@ static void mlx5st_init(struct vfio_pci_device *device)
> mlx5st_alloc_pd(dev);
> mlx5st_create_mkey(dev);
>
> + mlx5st_setup_datapath(dev);
> +
> + device->driver.max_memcpy_size = 1 << 20;
> + device->driver.max_memcpy_count = SQ_WQE_CNT - 1;
What are these limits a function of? e.g. Is the 1MB size a hardware
limit? Can we change SQ_WQE_CNT in the future to increase max count?
I'm interested in this because for Live Update testing we've found it
valuable to keep the device busy for several minutes so that it can do
DMA continuously throughout the Live Update.
> +
> dev_dbg(device, "mlx5 driver initialized\n");
> }
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 00/11] mlx5 support for VFIO self test
2026-05-01 16:43 ` Jason Gunthorpe
@ 2026-05-04 22:54 ` David Matlack
0 siblings, 0 replies; 24+ messages in thread
From: David Matlack @ 2026-05-04 22:54 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Alex Williamson, kvm, Leon Romanovsky, linux-kselftest,
linux-rdma, Mark Bloch, netdev, Saeed Mahameed, Shuah Khan,
Tariq Toukan, patches, Josh Hilke
On 2026-05-01 01:43 PM, Jason Gunthorpe wrote:
> On Fri, May 01, 2026 at 09:11:11AM -0700, David Matlack wrote:
> > On Thu, Apr 30, 2026 at 5:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > Add an mlx5 driver to VFIO self test. This is largely a remix of the
> > > existing VFIO mlx5 driver in rdma-core. It uses an RDMA loopback QP
> > > to issue RDMA WRITE operations which effectively perform memory
> > > copies using DMA. Since mlx5 has a stable programming ABI this
> > > should work on devices from CX5 to current HW. The device FW must
> > > support the QP loopback configuration.
> >
> > > This entire series was coded by Claude Code in about 4 days.
> >
> > Very exciting. Josh Hilke from Google is also working on using AI to
> > create a selftest driver for Intel IGB NICs so VFIO selftests can run
> > in QEMU [1]. So it's encouraging to see you were able to do it with
> > mlx5.
> >
> > [1] https://www.qemu.org/docs/master/system/devices/igb.html
>
> Yes! I would feed DPDK in as well in this case? Combined with the
> kernel driver it should be doable. It is much easier if you understand
> how the NIC works, of course. This worked out significantly because I
> guided it through sufficiently small steps and knew where to find all
> the quality reference material..
>
> > > - Make it work on a PF too (this is surprisingly hard!).
> >
> > Can it work on CX VFs? We're interested in continuously performing
> > memory copies across a Live Update using a VF via selftests to
> > demonstrate SR-IOV preservation (when we eventually get there).
>
> Yes, I started with VF because it is simpler.
Makes sense. I tested it out and was able to get vfio_pci_driver_test
passing with a CX7 VF.
> The PF support flow requires a bunch more complicated stuff.
Do you think it's worth supporting PFs? If anyone with a CX NIC can
enable SR-IOV and run selftests on a VF then we can keep the driver
somewhat simpler.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-05-04 22:54 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01 0:08 [PATCH 00/11] mlx5 support for VFIO self test Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 01/11] net/mlx5: Add IFC structures for CQE and WQE Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 02/11] net/mlx5: Move HW constant groups from device.h/cq.h to mlx5_ifc.h Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 03/11] net/mlx5: Extract MLX5_SET/GET macros into mlx5_ifc_macros.h Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 04/11] net/mlx5: Add ONCE and MMIO accessor variants to mlx5_ifc_macros.h Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 05/11] selftests: Add additional kernel functions to tools/include/ Jason Gunthorpe
2026-05-04 21:48 ` David Matlack
2026-05-01 0:08 ` [PATCH 06/11] selftests: Fix arm64 IO barriers to match kernel Jason Gunthorpe
2026-05-01 0:08 ` [PATCH 07/11] vfio: selftests: Allow drivers to specify required region size Jason Gunthorpe
2026-05-02 8:33 ` Manuel Ebner
2026-05-04 20:55 ` David Matlack
2026-05-01 0:08 ` [PATCH 08/11] vfio: selftests: Add dev_dbg Jason Gunthorpe
2026-05-04 21:15 ` David Matlack
2026-05-01 0:08 ` [PATCH 09/11] vfio: selftests: Add mlx5 driver - HW init and command interface Jason Gunthorpe
2026-05-02 9:35 ` Manuel Ebner
2026-05-04 22:35 ` David Matlack
2026-05-01 0:08 ` [PATCH 10/11] vfio: selftests: Add mlx5 driver - data path and memcpy ops Jason Gunthorpe
2026-05-04 22:41 ` David Matlack
2026-05-01 0:08 ` [PATCH 11/11] vfio: selftests: mlx5 driver - add send_msi support Jason Gunthorpe
2026-05-01 16:11 ` [PATCH 00/11] mlx5 support for VFIO self test David Matlack
2026-05-01 16:43 ` Jason Gunthorpe
2026-05-04 22:54 ` David Matlack
2026-05-02 4:31 ` Alex Williamson
2026-05-02 13:40 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox