* Implement initial driver for virtio-RDMA device(kernel)
@ 2025-12-18 9:09 Xiong Weimin
2025-12-18 9:09 ` [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices Xiong Weimin
` (11 more replies)
0 siblings, 12 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev
Hi all,
This testing instructions aims to introduce an emulating a soft ROCE
device with normal NIC(no RDMA), we have finished a vhost-user RDMA
device demo, which can work with RDMA features such as CM, QP type of
UC/UD and so on.
There are testing instructions of the demo:
1.Test Environment Configuration
Hardware Environment
Servers: 1 identically configured servers
CPU: HUAWEI Kunpeng 920 (96 cores)
Memory: 3T DDR4
NIC: TAP (paired virtio-net device for RDMA)
Software Environment
Server Host OS: 6.4.0-10.1.0.20.oe2309.aarch64
Kernel: linux-6.16.8 (with kernel-vrdma module)
QEMU: 9.0.2 (compiled with vhost-user-rdma virtual device support)
DPDK: 24.07.0-rc2
Dependencies:
rdma-core
rdma_rxe
libibverbs-dev
2. Test Procedures
a. Starting DPDK with vhost-user-rdma first:
1). Configure Hugepages
echo 2048 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
2). app start
/DPDKDIR/build/examples/dpdk-vhost_user_rdma -l 1-4 -n 4 --vdev "net_tap0" -- --socket-file /tmp/vhost-rdma0
b. Booting guest kernel with qemu, command line:
...
-netdev tap,id=hostnet1,ifname=tap1,script=no,downscript=no
-device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:14:72:30,bus=pci.3,addr=0x0.0,multifunction=on
-chardev socket,path=/tmp/vhost-rdma0,id=vurdma
-device vhost-user-rdma-pci,bus=pci.3,addr=0x0.1,page-per-vq=on,disable-legacy=on,chardev=vurdma
...
c. Guest Kernel Module Loading and Validation
# Load vhost_rdma kernel module
sudo modprobe vrdma
# Verify module loading
lsmod | grep vrdma
# Check kernel logs
dmesg | grep vhost_rdma
# Expected output:
[ 4.935473] vrdma_init_device: Initializing vRDMA device with max_cq=64, max_qp=64
[ 4.949888] [vrdma_init_device]: Successfully initialized, last qp_vq index=192
[ 4.949907] [vrdma_init_netdev]: Found paired net_device 'enp3s0f0' (on 0000:03:00.0)
[ 4.949924] Bound vRDMA device to net_device 'enp3s0f0'
[ 5.026032] vrdma virtio2: vrdma_alloc_pd: allocated PD 1
[ 5.028006] Successfully registered vRDMA device as 'vrdma0'
[ 5.028020] [vrdma_probe]: Successfully probed VirtIO RDMA device (index=2)
[ 5.028104] VirtIO RDMA driver initialized successfully
d. Inside VM, one rdma device fs node will be generated in /dev/infiniband:
[root@localhost ~]# ll -h /dev/infiniband/
total 0
drwxr-xr-x. 2 root root 60 Dec 17 11:24 by-ibdev
drwxr-xr-x. 2 root root 60 Dec 17 11:24 by-path
crw-rw-rw-. 1 root root 10, 259 Dec 17 11:24 rdma_cm
crw-rw-rw-. 1 root root 231, 192 Dec 17 11:24 uverbs0
e. The following are to be done in the future version:
1). SRQ support
2). DPDK support for physical RDMA NIC for handling the datapath between front and backend
3). Reset of VirtQueue
4). Increase size of VirtQueue for PCI transport
5). Performance Testing
f. Test Results
1). Functional Test Results:
Kernel module loading PASS Module loaded without errors
DPDK startup PASS vhost-user-rdma backend initialized
QEMU VM launch PASS VM booted using RDMA device
Network connectivity PASS Host-VM communication established
RDMA device detection PASS Virtual RDMA device recognized
f.Test Conclusion
1). Full functional compliance with specifications
2). Stable operation under extended stress conditions
Recommendations:
1). Optimize memory copy paths for higher throughput
2). Enhance error handling and recovery mechanisms
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-21 9:11 ` Leon Romanovsky
2025-12-18 9:09 ` [PATCH 02/10] drivers/infiniband/hw/virtio: add vrdma_exec_verbs_cmd to construct verbs sgs using virtio Xiong Weimin
` (10 subsequent siblings)
11 siblings, 1 reply; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit introduces a new driver for RDMA over virtio, enabling
RDMA capabilities in virtualized environments. The driver consists
of the following main components:
1. Driver registration with the virtio subsystem and device discovery.
2. Device probe and remove handlers for managing the device lifecycle.
3. Initialization of the InfiniBand device attributes by reading the
virtio configuration space, including conversion from little-endian
to CPU byte order and capability mapping.
4. Setup of virtqueues for:
- Control commands (no callback)
- Completion queues (with callback for CQ events)
- Send and receive queues for queue pairs (no callbacks)
5. Integration with the network device layer for RoCE support.
6. Registration with the InfiniBand core subsystem.
7. Comprehensive error handling during initialization and a symmetric
teardown process.
Key features:
- Support for multiple virtqueues based on device capabilities (max_cq, max_qp)
- Fast doorbell optimization when notify_offset_multiplier equals PAGE_SIZE
- Safe resource management with rollback on failure
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
linux-6.16.8/drivers/infiniband/Kconfig | 1 +
linux-6.16.8/drivers/infiniband/hw/Makefile | 1 +
.../drivers/infiniband/hw/virtio/Kconfig | 6 +
.../drivers/infiniband/hw/virtio/Makefile | 5 +
.../drivers/infiniband/hw/virtio/vrdma.h | 82 ++++++
.../drivers/infiniband/hw/virtio/vrdma_dev.c | 272 ++++++++++++++++++
.../drivers/infiniband/hw/virtio/vrdma_dev.h | 16 ++
.../infiniband/hw/virtio/vrdma_dev_api.h | 116 ++++++++
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 172 +++++++++++
.../drivers/infiniband/hw/virtio/vrdma_ib.h | 81 ++++++
.../drivers/infiniband/hw/virtio/vrdma_main.c | 159 ++++++++++
.../infiniband/hw/virtio/vrdma_netdev.c | 105 +++++++
.../infiniband/hw/virtio/vrdma_netdev.h | 14 +
.../infiniband/hw/virtio/vrdma_queue.c | 21 ++
.../infiniband/hw/virtio/vrdma_queue.h | 14 +
linux-6.16.8/include/rdma/vrdma_abi.h | 62 ++++
linux-6.16.8/include/uapi/linux/virtio_ids.h | 1 +
.../include/uapi/rdma/ib_user_ioctl_verbs.h | 1 +
18 files changed, 1129 insertions(+)
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/Kconfig
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/Makefile
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.h
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.h
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h
create mode 100644 linux-6.16.8/include/rdma/vrdma_abi.h
diff --git a/linux-6.16.8/drivers/infiniband/Kconfig b/linux-6.16.8/drivers/infiniband/Kconfig
index a5827d11e..9ba5f5628 100644
--- a/linux-6.16.8/drivers/infiniband/Kconfig
+++ b/linux-6.16.8/drivers/infiniband/Kconfig
@@ -94,6 +94,7 @@ source "drivers/infiniband/hw/ocrdma/Kconfig"
source "drivers/infiniband/hw/qedr/Kconfig"
source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/usnic/Kconfig"
+source "drivers/infiniband/hw/virtio/Kconfig"
source "drivers/infiniband/hw/vmw_pvrdma/Kconfig"
source "drivers/infiniband/sw/rdmavt/Kconfig"
endif # !UML
diff --git a/linux-6.16.8/drivers/infiniband/hw/Makefile b/linux-6.16.8/drivers/infiniband/hw/Makefile
index aba96ca9b..63253a066 100644
--- a/linux-6.16.8/drivers/infiniband/hw/Makefile
+++ b/linux-6.16.8/drivers/infiniband/hw/Makefile
@@ -14,4 +14,5 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1/
obj-$(CONFIG_INFINIBAND_HNS_HIP08) += hns/
obj-$(CONFIG_INFINIBAND_QEDR) += qedr/
obj-$(CONFIG_INFINIBAND_BNXT_RE) += bnxt_re/
+obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA) += virtio/
obj-$(CONFIG_INFINIBAND_ERDMA) += erdma/
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/Kconfig b/linux-6.16.8/drivers/infiniband/hw/virtio/Kconfig
new file mode 100644
index 000000000..a5624f98f
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/Kconfig
@@ -0,0 +1,6 @@
+config INFINIBAND_VIRTIO_RDMA
+ tristate "VirtIO Paravirtualized RDMA Driver"
+ depends on NETDEVICES && ETHERNET && PCI && INET && VIRTIO
+ help
+ This driver provides low-level support for VirtIO Paravirtual
+ RDMA adapter.
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/Makefile b/linux-6.16.8/drivers/infiniband/hw/virtio/Makefile
new file mode 100644
index 000000000..dbed6471e
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/Makefile
@@ -0,0 +1,5 @@
+#obj-$(CONFIG_INFINIBAND_VIRTIO_RDMA) := virtio_rdma.o
+obj-m += vrdma.o
+
+vrdma-y := vrdma_main.o vrdma_dev.o vrdma_ib.o \
+ vrdma_netdev.o vrdma_dev.o vrdma_queue.o
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
new file mode 100644
index 000000000..bc72d9c5e
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#ifndef __VIRTIO_RDMA_H__
+#define __VIRTIO_RDMA_H__
+
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
+#include <linux/netdevice.h>
+#include <rdma/ib_verbs.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mutex.h>
+#include <linux/list.h>
+
+/**
+ * struct vrdma_dev - Virtual RDMA device structure
+ * @ib_dev: InfiniBand device (must be first for container_of)
+ * @attr: Cached device attributes
+ * @netdev: Associated network device (for offload, etc.)
+ * @vdev: Virtio device backing this RDMA device
+ * @ctrl_vq: Control virtqueue for configuration and management
+ * @ctrl_lock: Spinlock protecting control operations
+ * @cq_vqs: Array of CQ (Completion Queue) virtual queues
+ * @cqs: Pointer array to active completion queues
+ * @qp_vqs: Array of QP (Queue Pair) virtual queues
+ * @num_qp: Counter for active queue pairs
+ * @num_cq: Counter for active completion queues
+ * @num_ah: Counter for active address handles
+ * @pending_mmaps: List of pending memory mappings for mmap handling
+ * @pending_mmaps_lock: Lock protecting pending_mmaps list
+ * @port_mutex: Mutex for port state changes
+ * @port_cap_mask: Port capabilities bitmask
+ * @ib_active: Flag indicating whether IB port is active
+ * @fast_doorbell: Enable fast doorbell mechanism (if supported)
+ */
+struct vrdma_dev {
+ /* Must come first for proper container_of usage in IB layer */
+ struct ib_device ib_dev;
+
+ /* Device attributes cache */
+ struct ib_device_attr attr;
+
+ /* Optional associated net device (e.g., for IPoIB or offload) */
+ struct net_device *netdev;
+
+ /* Backend virtio device and control vq */
+ struct virtio_device *vdev;
+ struct virtqueue *ctrl_vq;
+
+ /* Lock for controlling access to ctrl_vq */
+ spinlock_t ctrl_lock;
+
+ /* Completion Queue (CQ) related */
+ struct vrdma_vq *cq_vqs; /* Array of CQ VQs */
+ struct vrdma_cq **cqs; /* Array of pointers to CQs */
+
+ /* Queue Pair (QP) related */
+ struct vrdma_vq *qp_vqs; /* Array of QP VQs */
+
+ /* Resource counters */
+ atomic_t num_qp;
+ atomic_t num_cq;
+ atomic_t num_ah;
+
+ /* Pending mmaps from userspace */
+ struct list_head pending_mmaps;
+ spinlock_t pending_mmaps_lock;
+
+ /* Port management */
+ struct mutex port_mutex;
+ u32 port_cap_mask;
+
+ /* Runtime state flags */
+ bool ib_active;
+ bool fast_doorbell;
+};
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c
new file mode 100644
index 000000000..0a09b3bd4
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#include <linux/virtio_config.h>
+
+#include "vrdma.h"
+#include "vrdma_dev_api.h"
+#include "vrdma_queue.h"
+
+/**
+ * init_device_attr - Initialize IB device attributes from virtio config space
+ * @rdev: Virtio RDMA device
+ *
+ * Reads the device configuration fields and populates the InfiniBand device
+ * attributes (&rdev->ib_dev.attrs). This function must be called during device
+ * probe after the virtqueue is ready but before registering the IB device.
+ */
+static void init_device_attr(struct vrdma_dev *rdev)
+{
+ struct ib_device_attr *attr = &rdev->attr;
+ struct vrdma_config cfg;
+
+ /* Zero out attribute structure */
+ memset(attr, 0, sizeof(*attr));
+
+ /* Read entire config at once for efficiency and atomicity */
+ virtio_cread(rdev->vdev, struct vrdma_config, phys_port_cnt, &cfg.phys_port_cnt);
+ virtio_cread(rdev->vdev, struct vrdma_config, sys_image_guid, &cfg.sys_image_guid);
+ virtio_cread(rdev->vdev, struct vrdma_config, vendor_id, &cfg.vendor_id);
+ virtio_cread(rdev->vdev, struct vrdma_config, vendor_part_id, &cfg.vendor_part_id);
+ virtio_cread(rdev->vdev, struct vrdma_config, hw_ver, &cfg.hw_ver);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_mr_size, &cfg.max_mr_size);
+ virtio_cread(rdev->vdev, struct vrdma_config, page_size_cap, &cfg.page_size_cap);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_qp, &cfg.max_qp);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_qp_wr, &cfg.max_qp_wr);
+ virtio_cread(rdev->vdev, struct vrdma_config, device_cap_flags, &cfg.device_cap_flags);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_send_sge, &cfg.max_send_sge);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_recv_sge, &cfg.max_recv_sge);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_sge_rd, &cfg.max_sge_rd);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_cq, &cfg.max_cq);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_cqe, &cfg.max_cqe);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_mr, &cfg.max_mr);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_pd, &cfg.max_pd);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_qp_rd_atom, &cfg.max_qp_rd_atom);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_res_rd_atom, &cfg.max_res_rd_atom);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_qp_init_rd_atom, &cfg.max_qp_init_rd_atom);
+ virtio_cread(rdev->vdev, struct vrdma_config, atomic_cap, &cfg.atomic_cap);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_mw, &cfg.max_mw);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_mcast_grp, &cfg.max_mcast_grp);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_mcast_qp_attach, &cfg.max_mcast_qp_attach);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_total_mcast_qp_attach, &cfg.max_total_mcast_qp_attach);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_ah, &cfg.max_ah);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_fast_reg_page_list_len, &cfg.max_fast_reg_page_list_len);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_pi_fast_reg_page_list_len, &cfg.max_pi_fast_reg_page_list_len);
+ virtio_cread(rdev->vdev, struct vrdma_config, max_pkeys, &cfg.max_pkeys);
+ virtio_cread(rdev->vdev, struct vrdma_config, local_ca_ack_delay, &cfg.local_ca_ack_delay);
+
+ /* Copy values into ib_device_attr with proper type conversion */
+ rdev->ib_dev.phys_port_cnt = le32_to_cpu(cfg.phys_port_cnt);
+
+ attr->sys_image_guid = le64_to_cpu(cfg.sys_image_guid);
+ attr->vendor_id = le32_to_cpu(cfg.vendor_id);
+ attr->vendor_part_id = le32_to_cpu(cfg.vendor_part_id);
+ attr->hw_ver = le32_to_cpu(cfg.hw_ver);
+ attr->max_mr_size = le64_to_cpu(cfg.max_mr_size);
+ attr->page_size_cap = le64_to_cpu(cfg.page_size_cap);
+ attr->max_qp = le32_to_cpu(cfg.max_qp);
+ attr->max_qp_wr = le32_to_cpu(cfg.max_qp_wr);
+ attr->device_cap_flags = le64_to_cpu(cfg.device_cap_flags);
+ attr->max_send_sge = le32_to_cpu(cfg.max_send_sge);
+ attr->max_recv_sge = le32_to_cpu(cfg.max_recv_sge);
+ attr->max_srq_sge = attr->max_send_sge; /* unless SRQ supported */
+ attr->max_sge_rd = le32_to_cpu(cfg.max_sge_rd);
+ attr->max_cq = le32_to_cpu(cfg.max_cq);
+ attr->max_cqe = le32_to_cpu(cfg.max_cqe);
+ attr->max_mr = le32_to_cpu(cfg.max_mr);
+ attr->max_pd = le32_to_cpu(cfg.max_pd);
+ attr->max_qp_rd_atom = le32_to_cpu(cfg.max_qp_rd_atom);
+ attr->max_res_rd_atom = le32_to_cpu(cfg.max_res_rd_atom);
+ attr->max_qp_init_rd_atom = le32_to_cpu(cfg.max_qp_init_rd_atom);
+ attr->atomic_cap = vrdma_atomic_cap_to_ib(le32_to_cpu(cfg.atomic_cap));
+ attr->max_mw = le32_to_cpu(cfg.max_mw);
+ attr->max_mcast_grp = le32_to_cpu(cfg.max_mcast_grp);
+ attr->max_mcast_qp_attach = le32_to_cpu(cfg.max_mcast_qp_attach);
+ attr->max_total_mcast_qp_attach = le32_to_cpu(cfg.max_total_mcast_qp_attach);
+ attr->max_ah = le32_to_cpu(cfg.max_ah);
+ attr->max_fast_reg_page_list_len = le32_to_cpu(cfg.max_fast_reg_page_list_len);
+ attr->max_pi_fast_reg_page_list_len = le32_to_cpu(cfg.max_pi_fast_reg_page_list_len);
+ attr->max_pkeys = le16_to_cpu(cfg.max_pkeys);
+ attr->local_ca_ack_delay = cfg.local_ca_ack_delay;
+}
+
+/**
+ * vrdma_init_device - Initialize virtqueues for a vRDMA device
+ * @dev: The vRDMA device to initialize
+ *
+ * Returns 0 on success, or negative errno on failure.
+ */
+int vrdma_init_device(struct vrdma_dev *dev)
+{
+ int rc;
+ struct virtqueue **vqs;
+ struct virtqueue_info *vqs_info;
+ unsigned int i, cur_vq;
+ unsigned int total_vqs;
+ uint32_t max_cq, max_qp;
+
+ /* Initialize device attributes */
+ init_device_attr(dev);
+ max_cq = dev->attr.max_cq;
+ max_qp = dev->attr.max_qp; /* SRQ not supported, so ignored */
+
+ /*
+ * Total virtqueues:
+ * 1 control queue (for verbs commands)
+ * max_cq completion queues (CQ)
+ * max_qp * 2 data queues (send & recv queue pairs per QP)
+ */
+ total_vqs = 1 + max_cq + 2 * max_qp;
+
+ /* Allocate storage in dev */
+ dev->cq_vqs = kcalloc(max_cq, sizeof(*dev->cq_vqs), GFP_ATOMIC);
+ if (!dev->cq_vqs)
+ return -ENOMEM;
+
+ dev->cqs = kcalloc(max_cq, sizeof(*dev->cqs), GFP_ATOMIC);
+ if (!dev->cqs) {
+ rc = -ENOMEM;
+ goto err_free_cq_vqs;
+ }
+
+ dev->qp_vqs = kcalloc(2 * max_qp, sizeof(*dev->qp_vqs), GFP_ATOMIC);
+ if (!dev->qp_vqs) {
+ rc = -ENOMEM;
+ goto err_free_cqs;
+ }
+
+
+ vqs_info = kcalloc(total_vqs, sizeof(*vqs_info), GFP_KERNEL);
+ /* Temporary arrays for virtio_find_vqs */
+ vqs = kcalloc(total_vqs, sizeof(*vqs), GFP_KERNEL);
+ if (!vqs_info || !vqs) {
+ rc = -ENOMEM;
+ goto err_free_vqs;
+ }
+
+ /* Setup queue names and callbacks */
+ cur_vq = 0;
+
+ /* Control virtqueue (no callback) */
+ vqs_info[cur_vq].name = "vrdma-ctrl";
+ vqs_info[cur_vq].callback = NULL;
+ cur_vq++;
+
+ /* Completion Queue virtqueues */
+ for (i = 0; i < max_cq; i++) {
+ snprintf(dev->cq_vqs[i].name, sizeof(dev->cq_vqs[i].name),
+ "cq.%u", i);
+ vqs_info[cur_vq].name = dev->cq_vqs[i].name;
+ vqs_info[cur_vq].callback = vrdma_cq_ack;
+ cur_vq++;
+ }
+
+ /* Send/Receive Queue Pairs for each QP */
+ for (i = 0; i < max_qp; i++) {
+ snprintf(dev->qp_vqs[2 * i].name, sizeof(dev->qp_vqs[2 * i].name),
+ "sqp.%u", i);
+ snprintf(dev->qp_vqs[2 * i + 1].name, sizeof(dev->qp_vqs[2 * i + 1].name),
+ "rqp.%u", i);
+
+ vqs_info[cur_vq].name = dev->qp_vqs[2 * i].name;
+ vqs_info[cur_vq + 1].name = dev->qp_vqs[2 * i + 1].name;
+
+ vqs_info[cur_vq].callback = NULL; /* No TX callback */
+ vqs_info[cur_vq + 1].callback = NULL; /* No RX callback */
+
+ cur_vq += 2;
+ }
+
+ /* Now ask VirtIO layer to set up the virtqueues */
+ rc = virtio_find_vqs(dev->vdev, total_vqs, vqs, vqs_info, NULL);
+ if (rc) {
+ pr_err("Failed to find %u virtqueues: %d\n", total_vqs, rc);
+ goto err_free_vqs;
+ }
+
+ /* Assign found virtqueues to device structures */
+ cur_vq = 0;
+ dev->ctrl_vq = vqs[cur_vq++];
+
+ for (i = 0; i < max_cq; i++) {
+ dev->cq_vqs[i].vq = vqs[cur_vq++];
+ dev->cq_vqs[i].idx = i;
+ spin_lock_init(&dev->cq_vqs[i].lock);
+ }
+
+ for (i = 0; i < max_qp; i++) {
+ struct vrdma_vq *sq = &dev->qp_vqs[2 * i];
+ struct vrdma_vq *rq = &dev->qp_vqs[2 * i + 1];
+
+ sq->vq = vqs[cur_vq++];
+ rq->vq = vqs[cur_vq++];
+
+ sq->idx = i;
+ rq->idx = i;
+
+ spin_lock_init(&sq->lock);
+ spin_lock_init(&rq->lock);
+ }
+
+ /* Final setup */
+ mutex_init(&dev->port_mutex);
+ dev->ib_active = true;
+
+ /* Cleanup temporary arrays */
+ kfree(vqs);
+
+ return 0;
+
+err_free_vqs:
+ kfree(vqs_info);
+ kfree(vqs);
+err_free_cqs:
+ kfree(dev->cqs);
+ dev->cqs = NULL;
+err_free_cq_vqs:
+ kfree(dev->cq_vqs);
+ dev->cq_vqs = NULL;
+
+ return rc;
+}
+
+void vrdma_finish_device(struct vrdma_dev *dev)
+{
+ if (!dev) {
+ pr_err("%s: invalid device pointer\n", __func__);
+ return;
+ }
+
+ if (!dev->vdev || !dev->vdev->config) {
+ pr_warn("%s: device or config is NULL, skipping teardown\n", __func__);
+ return;
+ }
+
+ /* Step 1: Mark device as inactive to prevent new operations */
+ dev->ib_active = false;
+
+ /* Step 2: Synchronize and stop any pending work (e.g., CQ processing) */
+ mutex_lock(&dev->port_mutex);
+ /* If there are workqueues or timers, flush them here */
+ // flush_work(&dev->cq_task); // example
+ // del_timer_sync(&dev->poll_timer); // example
+ mutex_unlock(&dev->port_mutex);
+
+ /* Step 3: Bring the device into reset state */
+ dev->vdev->config->reset(dev->vdev);
+
+ /* Step 4: Delete all virtqueues (this also synchronizes with callbacks) */
+ dev->vdev->config->del_vqs(dev->vdev);
+
+ /* Step 5: Free dynamically allocated arrays */
+ kfree(dev->cq_vqs); /* Free CQ queue metadata */
+ dev->cq_vqs = NULL;
+
+ kfree(dev->cqs); /* Free CQ context array */
+ dev->cqs = NULL;
+
+ kfree(dev->qp_vqs); /* Free QP send/receive queue metadata */
+ dev->qp_vqs = NULL;
+}
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.h
new file mode 100644
index 000000000..78e243faf
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.h
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+#ifndef __VRDMA_DEVICE_H__
+#define __VRDMA_DEVICE_H__
+
+#define VIRTIO_RDMA_BOARD_ID 1
+#define VIRTIO_RDMA_HW_NAME "virtio-rdma"
+#define VIRTIO_RDMA_HW_REV 1
+#define VIRTIO_RDMA_DRIVER_VER "1.0"
+
+int vrdma_init_device(struct vrdma_dev *dev);
+void vrdma_finish_device(struct vrdma_dev *dev);
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
new file mode 100644
index 000000000..403d5e820
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#ifndef __VRDMA_DEV_API_H__
+#define __VRDMA_DEV_API_H__
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <rdma/ib_verbs.h>
+
+#include <rdma/vrdma_abi.h>
+
+/**
+ * struct vrdma_config - Virtio RDMA device configuration
+ *
+ * This structure is mapped from the virtio device's configuration space and
+ * describes the capabilities and attributes of the host RDMA device.
+ * All fields are in little-endian byte order (__le* types).
+ */
+struct vrdma_config {
+ __le32 phys_port_cnt; /* Number of physical ports */
+
+ __le64 sys_image_guid; /* System image GUID */
+ __le32 vendor_id; /* Vendor ID (PCI-style) */
+ __le32 vendor_part_id; /* Vendor part/device ID */
+ __le32 hw_ver; /* Hardware version */
+ __le64 max_mr_size; /* Maximum memory region size */
+ __le64 page_size_cap; /* Supported page sizes bitmask */
+ __le32 max_qp; /* Max number of queue pairs */
+ __le32 max_qp_wr; /* Max outstanding WRs per QP */
+ __le64 device_cap_flags; /* Device capability flags (&enum ib_device_cap_flags) */
+ __le32 max_send_sge; /* Max SGEs in a SEND WR */
+ __le32 max_recv_sge; /* Max SGEs in a RECV WR */
+ __le32 max_sge_rd; /* Max SGEs in an RDMA READ/ATOMIC WR */
+ __le32 max_cq; /* Max number of completion queues */
+ __le32 max_cqe; /* Max entries per CQ */
+ __le32 max_mr; /* Max number of memory regions */
+ __le32 max_pd; /* Max number of protection domains */
+ __le32 max_qp_rd_atom; /* Max RDMA read atoms per QP */
+ __le32 max_res_rd_atom; /* Total RDMA read atoms system-wide */
+ __le32 max_qp_init_rd_atom; /* Max init RD atoms per QP */
+ __le32 atomic_cap; /* Atomic operations support level */
+ __le32 max_mw; /* Max number of memory windows */
+ __le32 max_mcast_grp; /* Max multicast groups */
+ __le32 max_mcast_qp_attach; /* Max QPs that can attach to one mcast group */
+ __le32 max_total_mcast_qp_attach;/* Total mcast attachments allowed */
+ __le32 max_ah; /* Max address handles */
+ __le32 max_fast_reg_page_list_len;/* Max pages in a fast registration request */
+ __le32 max_pi_fast_reg_page_list_len;/* Max PI (protection info) pages */
+ __le16 max_pkeys; /* Max P_Key table entries */
+ __u8 local_ca_ack_delay; /* Local CA ACK delay (usec, encoded as log scale) */
+ __u8 reserved[5]; /* Pad to 8-byte alignment before variable area */
+
+ /*
+ * Future extension: place additional fields here before reserved_tail,
+ * or use a TLV (type-length-value) mechanism for extensibility.
+ */
+ __u8 reserved_tail[64]; /* Reserved for future use (must be zero) */
+};
+
+/**
+ * enum vrdma_ctrl_cmd - Virtio RDMA verbs control commands
+ *
+ * These commands are sent from the guest driver to the host over a control virtqueue
+ * (cvq) to manage RDMA resources such as CQs, QPs, MRs, etc.
+ *
+ * @VIRTIO_RDMA_CMD_ILLEGAL: Invalid or uninitialized command (must be 0)
+ * @VIRTIO_RDMA_CMD_QUERY_PORT: Query port attributes (e.g., state, MTU, GID caps)
+ * @VIRTIO_RDMA_CMD_CREATE_CQ: Create a Completion Queue (CQ)
+ * @VIRTIO_RDMA_CMD_DESTROY_CQ: Destroy an existing CQ
+ * @VIRTIO_RDMA_CMD_CREATE_PD: Create a Protection Domain (PD)
+ * @VIRTIO_RDMA_CMD_DESTROY_PD: Destroy a PD
+ * @VIRTIO_RDMA_CMD_GET_DMA_MR: Get a DMA memory region (uncached, single-region MR)
+ * @VIRTIO_RDMA_CMD_CREATE_MR: Create a Memory Region (MR) with access flags
+ * @VIRTIO_RDMA_CMD_MAP_MR_SG: Map scatter-gather list into an MR (for fast registration)
+ * @VIRTIO_RDMA_CMD_REG_USER_MR: Register user-space memory with IOVA
+ * @VIRTIO_RDMA_CMD_DEREG_MR: Deregister and destroy an MR
+ * @VIRTIO_RDMA_CMD_CREATE_QP: Create a Queue Pair (QP)
+ * @VIRTIO_RDMA_CMD_MODIFY_QP: Modify QP state (e.g., RESET -> INIT -> RTR -> RTS)
+ * @VIRTIO_RDMA_CMD_QUERY_QP: Retrieve current QP attributes
+ * @VIRTIO_RDMA_CMD_DESTROY_QP: Destroy a QP
+ * @VIRTIO_RDMA_CMD_QUERY_PKEY: Fetch P_Key table entry at given index
+ * @VIRTIO_RDMA_CMD_ADD_GID: Add a GID (Global Identifier) to the port
+ * @VIRTIO_RDMA_CMD_DEL_GID: Remove a GID from the port
+ * @VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ: Request interrupt on CQ event (equivalent to req_notify_cq())
+ *
+ * All commands are issued via the control virtqueue (cvq), and responses use
+ * the same command number with a success/failure status.
+ */
+enum vrdma_verbs_cmd {
+ VIRTIO_RDMA_CMD_ILLEGAL = 0,
+
+ VIRTIO_RDMA_CMD_QUERY_PORT,
+ VIRTIO_RDMA_CMD_CREATE_CQ,
+ VIRTIO_RDMA_CMD_DESTROY_CQ,
+ VIRTIO_RDMA_CMD_CREATE_PD,
+ VIRTIO_RDMA_CMD_DESTROY_PD,
+ VIRTIO_RDMA_CMD_GET_DMA_MR,
+ VIRTIO_RDMA_CMD_CREATE_MR,
+ VIRTIO_RDMA_CMD_MAP_MR_SG,
+ VIRTIO_RDMA_CMD_REG_USER_MR,
+ VIRTIO_RDMA_CMD_DEREG_MR,
+ VIRTIO_RDMA_CMD_CREATE_QP,
+ VIRTIO_RDMA_CMD_MODIFY_QP,
+ VIRTIO_RDMA_CMD_QUERY_QP,
+ VIRTIO_RDMA_CMD_DESTROY_QP,
+ VIRTIO_RDMA_CMD_QUERY_PKEY,
+ VIRTIO_RDMA_CMD_ADD_GID,
+ VIRTIO_RDMA_CMD_DEL_GID,
+ VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ,
+};
+
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
new file mode 100644
index 000000000..379bd23d3
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#include <linux/scatterlist.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ring.h>
+#include <rdma/ib_mad.h>
+#include <rdma/uverbs_ioctl.h>
+#include <rdma/ib_umem.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_addr.h>
+
+#include "vrdma.h"
+#include "vrdma_dev.h"
+#include "vrdma_dev_api.h"
+#include "vrdma_ib.h"
+
+/**
+ * cmd_str - String representation of virtio RDMA control commands
+ *
+ * This array maps each &enum virtio_rdma_ctrl_cmd value to its human-readable
+ * string for logging and debugging purposes. It is indexed directly by command ID.
+ *
+ * Example usage:
+ * dev_dbg(dev, "Received ctrl cmd: %s\n", cmd_str[cmd]);
+ */
+static const char * const cmd_str[] = {
+ [VIRTIO_RDMA_CMD_ILLEGAL] = "ILLEGAL",
+ [VIRTIO_RDMA_CMD_QUERY_PORT] = "QUERY_PORT",
+ [VIRTIO_RDMA_CMD_CREATE_CQ] = "CREATE_CQ",
+ [VIRTIO_RDMA_CMD_DESTROY_CQ] = "DESTROY_CQ",
+ [VIRTIO_RDMA_CMD_CREATE_PD] = "CREATE_PD",
+ [VIRTIO_RDMA_CMD_DESTROY_PD] = "DESTROY_PD",
+ [VIRTIO_RDMA_CMD_GET_DMA_MR] = "GET_DMA_MR",
+ [VIRTIO_RDMA_CMD_CREATE_MR] = "CREATE_MR",
+ [VIRTIO_RDMA_CMD_MAP_MR_SG] = "MAP_MR_SG",
+ [VIRTIO_RDMA_CMD_REG_USER_MR] = "REG_USER_MR",
+ [VIRTIO_RDMA_CMD_DEREG_MR] = "DEREG_MR",
+ [VIRTIO_RDMA_CMD_CREATE_QP] = "CREATE_QP",
+ [VIRTIO_RDMA_CMD_MODIFY_QP] = "MODIFY_QP",
+ [VIRTIO_RDMA_CMD_QUERY_QP] = "QUERY_QP",
+ [VIRTIO_RDMA_CMD_DESTROY_QP] = "DESTROY_QP",
+ [VIRTIO_RDMA_CMD_QUERY_PKEY] = "QUERY_PKEY",
+ [VIRTIO_RDMA_CMD_ADD_GID] = "ADD_GID",
+ [VIRTIO_RDMA_CMD_DEL_GID] = "DEL_GID",
+ [VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ] = "REQ_NOTIFY_CQ",
+};
+
+static const struct ib_device_ops virtio_rdma_dev_ops = {
+ .owner = THIS_MODULE,
+ .uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
+ .driver_id = RDMA_DRIVER_VIRTIO,
+};
+
+/**
+ * vrdma_register_ib_device - Register the vRDMA device with IB core
+ * @vrdev: The vRDMA device to register
+ *
+ * Initializes the ib_device structure and registers it with the InfiniBand
+ * core subsystem. Must be called after queues are initialized.
+ *
+ * Returns 0 on success, or negative errno.
+ */
+int vrdma_register_ib_device(struct vrdma_dev *vrdev)
+{
+ struct ib_device *ibdev;
+ int rc;
+
+ if (!vrdev) {
+ pr_err("Invalid vrdev pointer\n");
+ return -EINVAL;
+ }
+
+ ibdev = &vrdev->ib_dev;
+
+ /* --- Step 1: Initialize static device properties --- */
+
+ ibdev->dev.parent = &vrdev->vdev->dev; /* Point to virtio device */
+
+ ibdev->node_type = RDMA_NODE_IB_CA;
+ strncpy(ibdev->node_desc, "VirtIO RDMA", sizeof(ibdev->node_desc));
+
+ ibdev->phys_port_cnt = 1; /* Assume single port */
+ ibdev->num_comp_vectors = 1; /* One completion vector */
+
+ /* Set GUID: Use MAC-like identifier derived from device info (example) */
+ memcpy(&ibdev->node_guid, vrdev->vdev->id.device, 6);
+ *(u64 *)&ibdev->node_guid |= 0x020000 << 24; /* Make locally administered */
+
+ /* --- Step 2: Set user verbs command mask --- */
+
+ ibdev->uverbs_cmd_mask =
+ BIT_ULL(IB_USER_VERBS_CMD_GET_CONTEXT) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_DEVICE) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_PORT) |
+ BIT_ULL(IB_USER_VERBS_CMD_ALLOC_PD) |
+ BIT_ULL(IB_USER_VERBS_CMD_DEALLOC_PD) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_MODIFY_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_DESTROY_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_POST_SEND) |
+ BIT_ULL(IB_USER_VERBS_CMD_POST_RECV) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_DESTROY_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_POLL_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_REG_MR) |
+ BIT_ULL(IB_USER_VERBS_CMD_DEREG_MR) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_AH) |
+ BIT_ULL(IB_USER_VERBS_CMD_MODIFY_AH) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_AH) |
+ BIT_ULL(IB_USER_VERBS_CMD_DESTROY_AH);
+
+ /* --- Step 3: Attach device operation vectors --- */
+ ib_set_device_ops(ibdev, &virtio_rdma_dev_ops);
+
+ /* --- Step 4: Bind to netdev (optional, for RoCE) --- */
+ if (vrdev->netdev) {
+ ib_device_set_netdev(ibdev, vrdev->netdev, 1); /* Port 1 */
+ pr_info("Bound vRDMA device to net_device '%s'\n", vrdev->netdev->name);
+ }
+
+ /* --- Step 5: Register with IB core --- */
+ rc = ib_register_device(ibdev, "vrdma%d", vrdev->vdev->dev.parent);
+ if (rc) {
+ pr_err("Failed to register vRDMA device with IB core: %d\n", rc);
+ return rc;
+ }
+
+ pr_info("Successfully registered vRDMA device as '%s'\n", dev_name(&ibdev->dev));
+ return 0;
+}
+
+/**
+ * vrdma_unregister_ib_device - Safely unregister IB device
+ * @vrdev: The vRDMA device to unregister
+ *
+ * This function unregisters the IB device from the core stack,
+ * ensuring that all client references are dropped before returning.
+ */
+void vrdma_unregister_ib_device(struct vrdma_dev *vrdev)
+{
+ if (!vrdev) {
+ pr_err("%s: invalid vrdev\n", __func__);
+ return;
+ }
+
+ if (!vrdev->ib_dev.dev.parent) {
+ pr_warn("%s: IB device not registered or already unregistered\n", __func__);
+ return;
+ }
+
+ /*
+ * Step 1: Stop device operation - disable VQ handling, doorbells, etc.
+ * You may want to call vrdma_stop_device(vrdev) here if exists.
+ */
+ vrdma_finish_device(vrdev); /* e.g., stop ctrl/intr/comp/virtqueues */
+
+ /*
+ * Step 2: Unregister from IB core.
+ * This will:
+ * - Send IB_EVENT_DEVICE_REMOVAL to all users
+ * - Block until all file descriptors (ucontext, etc.) are released
+ * - Wait for refcount to drop to zero
+ */
+ ib_unregister_device(&vrdev->ib_dev);
+}
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
new file mode 100644
index 000000000..9a7a0a168
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#ifndef __VRDMA_IB_H__
+#define __VRDMA_IB_H__
+
+#include <linux/types.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/vrdma_abi.h>
+
+enum {
+ VIRTIO_RDMA_ATOMIC_NONE,
+ VIRTIO_RDMA_ATOMIC_HCA,
+ VIRTIO_RDMA_ATOMIC_GLOB
+};
+
+static inline enum ib_atomic_cap vrdma_atomic_cap_to_ib(uint32_t src) {
+ switch (src) {
+ case VIRTIO_RDMA_ATOMIC_NONE:
+ return IB_ATOMIC_NONE;
+ case VIRTIO_RDMA_ATOMIC_HCA:
+ return IB_ATOMIC_HCA;
+ case VIRTIO_RDMA_ATOMIC_GLOB:
+ return IB_ATOMIC_GLOB;
+ default:
+ pr_warn("Unknown atomic cap");
+ }
+ return 0;
+}
+
+/**
+ * struct vrdma_vq - Wrapper around a virtqueue for RDMA use
+ * @vq: Pointer to the underlying virtqueue
+ * @lock: Spinlock to protect access to the virtqueue (especially ring updates)
+ * @name: Human-readable name (e.g., "send.0", "recv.1")
+ * @idx: Index of this queue within its type (e.g., queue pair ID)
+ *
+ * This structure wraps a virtqueue with additional metadata needed by the
+ * virtio-rdma driver, including synchronization and identification.
+ */
+struct vrdma_vq {
+ struct virtqueue *vq;
+ spinlock_t lock; /* Protects VQ operations */
+ char name[16]; /* Name for debugging */
+ int idx; /* Queue index */
+};
+
+/**
+ * struct vrdma_cq - Virtio RDMA completion queue
+ * @ibcq: Embedding IB core CQ object (for RDMA ABI)
+ * @cq_handle: Host-visible handle to identify this CQ in virtio messages
+ * @vq: Associated receive virtqueue used to get completions from host
+ * @entry: Mmap entry for user-space mapping of CQ ring
+ * @lock: Protects concurrent access to CQ ring and state
+ * @queue: Kernel virtual address of the CQ ring (array of CQEs)
+ * @queue_size: Total size of the CQ ring in bytes
+ * @dma_addr: DMA address of the CQ ring (for device access)
+ * @num_cqe: Number of CQE slots allocated in the ring
+ *
+ * The completion queue receives work completion notifications from the host.
+ * It is typically backed by a dedicated virtqueue that delivers CQEs.
+ */
+struct vrdma_cq {
+ struct ib_cq ibcq;
+ u32 cq_handle;
+ struct vrdma_vq *vq; /* Virtqueue where CQEs arrive */
+ struct rdma_user_mmap_entry *entry; /* For mmap support in userspace */
+ spinlock_t lock;
+ struct virtio_rdma_cqe *queue; /* CQE ring buffer */
+ size_t queue_size;
+ dma_addr_t dma_addr;
+ u32 num_cqe;
+};
+
+int vrdma_register_ib_device(struct vrdma_dev *vrdev);
+void vrdma_unregister_ib_device(struct vrdma_dev *vrdev);
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c
new file mode 100644
index 000000000..ea2f15491
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#include <linux/err.h>
+#include <linux/scatterlist.h>
+#include <linux/spinlock.h>
+#include <linux/virtio.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <uapi/linux/virtio_ids.h>
+
+#include "vrdma.h"
+#include "vrdma_dev.h"
+#include "vrdma_ib.h"
+#include "vrdma_netdev.h"
+
+#include "../../../virtio/virtio_pci_common.h"
+
+/**
+ * vrdma_probe - Probe a virtio RDMA device
+ * @vdev: VirtIO device structure
+ *
+ * Called when a new virtio-rdma device is attached. Allocates the driver
+ * private structure, initializes device queues, and registers with IB core.
+ *
+ * Returns 0 on success, or negative errno on failure.
+ */
+static int vrdma_probe(struct virtio_device *vdev)
+{
+ struct vrdma_dev *vrdev;
+ int rc;
+
+ /* Step 1: Allocate IB device structure using ib_core's allocator */
+ vrdev = ib_alloc_device(vrdma_dev, ib_dev);
+ if (!vrdev) {
+ pr_err("Failed to allocate vRDMA device\n");
+ return -ENOMEM;
+ }
+
+ /* Initialize basic fields */
+ vrdev->vdev = vdev;
+ vdev->priv = vrdev;
+
+ spin_lock_init(&vrdev->ctrl_lock);
+ spin_lock_init(&vrdev->pending_mmaps_lock);
+ INIT_LIST_HEAD(&vrdev->pending_mmaps);
+
+ /* Step 2: Check doorbell mechanism support */
+ if (to_vp_device(vdev)->mdev.notify_offset_multiplier != PAGE_SIZE) {
+ pr_warn("notify_offset_multiplier=%u != PAGE_SIZE, disabling fast doorbell\n",
+ to_vp_device(vdev)->mdev.notify_offset_multiplier);
+ vrdev->fast_doorbell = false;
+ } else {
+ vrdev->fast_doorbell = true;
+ }
+
+ /* Step 3: Initialize hardware interface (virtqueues) */
+ rc = vrdma_init_device(vrdev);
+ if (rc) {
+ pr_err("Failed to initialize vRDMA device queues\n");
+ goto err_dealloc_device;
+ }
+
+ rc = vrdma_init_netdev(vrdev);
+ if (rc) {
+ pr_err("Fail to connect to NetDev layer\n");
+ goto err_cleanup_device;
+ }
+
+ /* Step 4: Register with InfiniBand core layer */
+ rc = vrdma_register_ib_device(vrdev);
+ if (rc) {
+ pr_err("Failed to register with IB subsystem\n");
+ goto err_cleanup_netdev;
+ }
+
+ return 0;
+
+err_cleanup_netdev:
+ vrdma_finish_netdev(vrdev);
+
+err_cleanup_device:
+ vrdma_finish_device(vrdev); /* Safe cleanup of queues and reset */
+
+err_dealloc_device:
+ ib_dealloc_device(&vrdev->ib_dev); /* Frees vrdev itself */
+ vdev->priv = NULL;
+
+ return rc;
+}
+
+static void vrdma_remove(struct virtio_device *vdev)
+{
+ struct vrdma_dev *vrdev = vdev->priv;
+
+ if (!vrdev) {
+ dev_warn(&vdev->dev, "vrdma_remove: no private data!\n");
+ return;
+ }
+
+ /* Step 1: Prevent further access by clearing private pointer */
+ vdev->priv = NULL;
+
+ /* Step 2: Stop all virtqueues and disable interrupts */
+ vdev->config->reset(vdev);
+
+ /* Step 3: Unregister IB device - waits for all user contexts to close */
+ vrdma_unregister_ib_device(vrdev);
+
+ /* Step 4: Release paired net_device reference (if any) */
+ vrdma_finish_netdev(vrdev);
+
+ /* Step 5: Clean up internal device state (vqs, doorbells, rings, etc.) */
+ vrdma_finish_device(vrdev);
+
+ /* Step 6: Finally, free the ib_device structure itself */
+ ib_dealloc_device(&vrdev->ib_dev);
+}
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_RDMA, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+static struct virtio_driver vrdma_driver = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = vrdma_probe,
+ .remove = vrdma_remove,
+};
+
+static int __init vrdma_init(void)
+{
+ int rc;
+
+ rc = register_virtio_driver(&vrdma_driver);
+ if (rc) {
+ pr_err("Failed to register VirtIO RDMA driver: error %d\n", rc);
+ return rc;
+ }
+
+ return 0;
+}
+
+static void __exit vrdma_finish(void)
+{
+ unregister_virtio_driver(&vrdma_driver);
+}
+
+module_init(vrdma_init);
+module_exit(vrdma_finish);
+
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_AUTHOR("Xiongweimin");
+MODULE_DESCRIPTION("Virtio RDMA driver");
+MODULE_LICENSE("Dual BSD/GPL");
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c
new file mode 100644
index 000000000..e83902e6d
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#include <linux/netdevice.h>
+#include <linux/pci_ids.h>
+#include <linux/virtio_ids.h>
+
+#include "../../../virtio/virtio_pci_common.h"
+#include "vrdma_netdev.h"
+#include "vrdma.h"
+
+/**
+ * vrdma_init_netdev - Attempt to find paired virtio-net device on same PCI slot
+ * @vrdev: The vRDMA device
+ *
+ * WARNING: This is a non-standard hack for development/emulation environments.
+ * Do not use in production or upstream drivers.
+ *
+ * Returns 0 on success, or negative errno.
+ */
+int vrdma_init_netdev(struct vrdma_dev *vrdev)
+{
+ struct pci_dev *pdev_net;
+ struct virtio_pci_device *vp_dev;
+ struct virtio_pci_device *vnet_pdev;
+ void *priv;
+ struct net_device *netdev;
+
+ if (!vrdev || !vrdev->vdev) {
+ pr_err("%s: invalid vrdev or vdev\n", __func__);
+ return -EINVAL;
+ }
+
+ vp_dev = to_vp_device(vrdev->vdev);
+
+ /* Find the PCI device at function 0 of the same slot */
+ pdev_net = pci_get_slot(vp_dev->pci_dev->bus,
+ PCI_DEVFN(PCI_SLOT(vp_dev->pci_dev->devfn), 0));
+ if (!pdev_net) {
+ pr_err("Failed to find PCI device at fn=0 of slot %x\n",
+ PCI_SLOT(vp_dev->pci_dev->devfn));
+ return -ENODEV;
+ }
+
+ /* Optional: Validate it's a known virtio-net device */
+ if (pdev_net->vendor != PCI_VENDOR_ID_REDHAT_QUMRANET ||
+ pdev_net->device != 0x1041) {
+ pr_warn("PCI device %04x:%04x is not expected virtio-net (1041) device\n",
+ pdev_net->vendor, pdev_net->device);
+ pci_dev_put(pdev_net);
+ return -ENODEV;
+ }
+
+ /* Get the virtio_pci_device from drvdata */
+ vnet_pdev = pci_get_drvdata(pdev_net);
+ if (!vnet_pdev || !vnet_pdev->vdev.priv) {
+ pr_err("No driver data or priv for virtio-net device\n");
+ pci_dev_put(pdev_net);
+ return -ENODEV;
+ }
+
+ priv = vnet_pdev->vdev.priv;
+ vrdev->netdev = priv - ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
+ netdev = vrdev->netdev;
+
+ if (!netdev || !netdev->netdev_ops) {
+ pr_err("Invalid net_device retrieved from virtio-net\n");
+ pci_dev_put(pdev_net);
+ return -ENODEV;
+ }
+
+ /* Hold reference so netdev won't disappear */
+ dev_hold(netdev);
+
+ pci_dev_put(pdev_net); /* Release reference from pci_get_slot */
+
+ return 0;
+}
+
+/**
+ * vrdma_finish_netdev - Release reference to paired net_device
+ * @vrdev: The vRDMA device
+ *
+ * This function releases the reference taken on a net_device during
+ * vrdma_init_netdev(). It should be called during device teardown.
+ */
+void vrdma_finish_netdev(struct vrdma_dev *vrdev)
+{
+ if (!vrdev) {
+ pr_err("%s: invalid vrdev pointer\n", __func__);
+ return;
+ }
+
+ if (vrdev->netdev) {
+ pr_info("[%s]: Releasing reference to net_device '%s'\n",
+ __func__, vrdev->netdev->name);
+
+ dev_put(vrdev->netdev);
+ vrdev->netdev = NULL;
+ } else {
+ pr_debug("%s: no netdev to release\n", __func__);
+ }
+}
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.h
new file mode 100644
index 000000000..ce391b5bd
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#ifndef __VRDMA_NETDEV_H__
+#define __VRDMA_NETDEV_H__
+
+#include "vrdma.h"
+
+int vrdma_init_netdev(struct vrdma_dev *vrdev);
+void vrdma_finish_netdev(struct vrdma_dev *vrdev);
+
+#endif
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c
new file mode 100644
index 000000000..78779c243
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+#include <linux/virtio.h>
+
+#include "vrdma.h"
+#include "vrdma_queue.h"
+
+void vrdma_cq_ack(struct virtqueue *vq)
+{
+ struct vrdma_dev *rdev;
+ struct vrdma_cq *vcq;
+
+ rdev = vq->vdev->priv;
+ // vcq->vq's index is start from 1, 0 is ctrl vq
+ vcq = rdev->cqs[vq->index - 1];
+
+ if (vcq && vcq->ibcq.comp_handler)
+ vcq->ibcq.comp_handler(&vcq->ibcq, vcq->ibcq.cq_context);
+}
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h
new file mode 100644
index 000000000..64b896208
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+
+#ifndef __VRDMA_QUEUE_H__
+#define __VRDMA_QUEUE_H__
+
+#include "vrdma_ib.h"
+#include "vrdma_dev_api.h"
+
+void vrdma_cq_ack(struct virtqueue *vq);
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/include/rdma/vrdma_abi.h b/linux-6.16.8/include/rdma/vrdma_abi.h
new file mode 100644
index 000000000..62d4fda09
--- /dev/null
+++ b/linux-6.16.8/include/rdma/vrdma_abi.h
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+#ifndef __VIRTIO_RDMA_ABI_H__
+#define __VIRTIO_RDMA_ABI_H__
+
+#include <linux/types.h>
+
+#define VIRTIO_RDMA_ABI_VERSION 1
+
+/**
+ * struct vrdma_cqe - Virtio RDMA completion queue entry (CQE)
+ * @wr_id: User-provided Work Request ID (passed back on completion)
+ * @status: Completion status (%IB_WC_SUCCESS or error code)
+ * @opcode: Operation type (e.g., %IB_WC_SEND, %IB_WC_RECV)
+ * @vendor_err: Vendor-specific error code (if any)
+ * @byte_len: Number of bytes transferred in this operation
+ * @ex: Union containing additional data based on operation:
+ * - @imm_data: Inbound immediate data (for sends with IMM)
+ * - @invalidate_rkey: RKEY invalidated in remote invalidation
+ * @qp_num: QP number that completed this work (lower 24 bits)
+ * @src_qp: Source QP number from sender (in RC/UC)
+ * @wc_flags: Additional flags (e.g., %IB_WC_WITH_IMM, %IB_WC_GRH, %IB_WC_COMPLETION_TIMESTAMP)
+ * @pkey_index: P_Key index used for this packet
+ * @slid: Source LID (Local Identifier) of the sender
+ * @sl: Service Level used in the packet
+ * @dlid_path_bits: Path bits of the destination LID (useful in FLIT routing)
+ * @port_num: Physical port number on which the packet was received
+ *
+ * This structure represents a single completion entry delivered to a CQ.
+ * It mirrors the fields of &struct ib_wc but is designed to be serialized
+ * over the virtio control channel or ring buffer.
+ *
+ * All fields are laid out for natural alignment; no explicit padding required.
+ */
+struct virtio_rdma_cqe {
+ __u64 wr_id; /* Work Request ID */
+ __u32 status; /* IB_WC_* status code */
+ __u32 opcode; /* IB_WC_* opcode */
+ __u32 vendor_err; /* Vendor-specific error */
+ __u32 byte_len; /* Bytes transferred */
+
+ union {
+ __u32 imm_data; /* Immediate data (if present) */
+ __u32 invalidate_rkey; /* RKEY invalidated */
+ } ex;
+
+ __u32 qp_num; /* Local QP number */
+ __u32 src_qp; /* Remote source QP */
+ __u32 wc_flags; /* IB_WC_* flags (e.g., WITH_IMM, GRH) */
+
+ /* Connection and routing metadata */
+ __u16 pkey_index; /* P_Key table index */
+ __u16 slid; /* Source LID */
+ __u8 sl; /* Service Level */
+ __u8 dlid_path_bits; /* DLID path bits (for subnet routing) */
+ __u8 port_num; /* Port where packet was received */
+ __u8 reserved[3]; /* Pad to maintain 8-byte alignment */
+};
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/include/uapi/linux/virtio_ids.h b/linux-6.16.8/include/uapi/linux/virtio_ids.h
index 7aa2eb766..ff2d0b01b 100644
--- a/linux-6.16.8/include/uapi/linux/virtio_ids.h
+++ b/linux-6.16.8/include/uapi/linux/virtio_ids.h
@@ -68,6 +68,7 @@
#define VIRTIO_ID_AUDIO_POLICY 39 /* virtio audio policy */
#define VIRTIO_ID_BT 40 /* virtio bluetooth */
#define VIRTIO_ID_GPIO 41 /* virtio gpio */
+#define VIRTIO_ID_RDMA 42 /* virtio rdma */
/*
* Virtio Transitional IDs
diff --git a/linux-6.16.8/include/uapi/rdma/ib_user_ioctl_verbs.h b/linux-6.16.8/include/uapi/rdma/ib_user_ioctl_verbs.h
index fe15bc7e9..181978aa9 100644
--- a/linux-6.16.8/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/linux-6.16.8/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -255,6 +255,7 @@ enum rdma_driver_id {
RDMA_DRIVER_SIW,
RDMA_DRIVER_ERDMA,
RDMA_DRIVER_MANA,
+ RDMA_DRIVER_VIRTIO,
};
enum ib_uverbs_gid_type {
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 02/10] drivers/infiniband/hw/virtio: add vrdma_exec_verbs_cmd to construct verbs sgs using virtio
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
2025-12-18 9:09 ` [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 03/10] drivers/infiniband/hw/virtio: Implement core device and key resource management Xiong Weimin
` (9 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
The implementation of vrdma_exec_verbs_cmd used a busy-wait loop
with cpu_relax() for command completion, which wastes CPU cycles especially
in process context. This commit introduces a more efficient approach by:
1. Adding a wait queue (ctrl_waitq) and completion flag (ctrl_completed)
to the vrdma_dev structure
2. Using wait_event_timeout for sleeping instead of spinning in non-atomic
contexts
3. Maintaining the original busy-wait behavior for atomic contexts
4. Adding proper locking around the wait mechanism
5. Implementing wakeup in the IRQ handler
This change significantly reduces CPU usage when executing commands from
process context while maintaining compatibility with atomic contexts like
NAPI and workqueues.
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../drivers/infiniband/hw/virtio/vrdma.h | 3 +-
.../drivers/infiniband/hw/virtio/vrdma_dev.c | 1 +
.../infiniband/hw/virtio/vrdma_dev_api.h | 222 ++++++++++++++++++
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 112 +++++++++
.../drivers/infiniband/hw/virtio/vrdma_ib.h | 2 +
5 files changed, 339 insertions(+), 1 deletion(-)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
index bc72d9c5e..a646794ef 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
@@ -12,9 +12,10 @@
#include <linux/netdevice.h>
#include <rdma/ib_verbs.h>
#include <linux/spinlock.h>
-#include <linux/atomic.h>
+#include <linux/average.h>
#include <linux/mutex.h>
#include <linux/list.h>
+#include <linux/types.h>
/**
* struct vrdma_dev - Virtual RDMA device structure
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c
index 0a09b3bd4..961529b58 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev.c
@@ -6,6 +6,7 @@
#include <linux/virtio_config.h>
#include "vrdma.h"
+#include "vrdma_dev.h"
#include "vrdma_dev_api.h"
#include "vrdma_queue.h"
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index 403d5e820..3b1f7d2b6 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -8,6 +8,8 @@
#include <linux/kernel.h>
#include <linux/types.h>
+#include <linux/u64_stats_sync.h>
+#include <net/xdp.h>
#include <rdma/ib_verbs.h>
#include <rdma/vrdma_abi.h>
@@ -112,5 +114,225 @@ enum vrdma_verbs_cmd {
VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ,
};
+#define VRDMA_CTRL_OK 0
+#define VRDMA_CTRL_ERR 1
+
+/**
+ * struct verbs_ctrl_buf - Control buffer command/status structure
+ * @cmd: Command code from driver to host (VRDMA_VQ_CTRL_*).
+ * @status: Status returned by host (0 = success, non-zero = error).
+ *
+ * Used in control virtqueue for configuration operations.
+ */
+struct verbs_ctrl_buf {
+ u8 cmd;
+ u8 status;
+} __packed;
+
+/**
+ * struct vrdma_sq_stats - Statistics for a send queue (transmit path)
+ * @syncp: Synchronization point for 64-bit stats on 32-bit CPUs.
+ * @packets: Number of packets transmitted.
+ * @bytes: Total number of bytes transmitted.
+ * @xdp_tx: Number of XDP frames sent via XDP_TX action.
+ * @xdp_tx_drops: Dropped due to ring full or mapping failure.
+ * @kicks: Number of times the TX virtqueue was kicked.
+ * @tx_timeouts: Number of transmit timeouts detected.
+ */
+struct vrdma_sq_stats {
+ struct u64_stats_sync syncp;
+ u64 packets;
+ u64 bytes;
+ u64 xdp_tx;
+ u64 xdp_tx_drops;
+ u64 kicks;
+ u64 tx_timeouts;
+};
+
+/**
+ * struct vrdma_rq_stats - Statistics for a receive queue (receive path)
+ * @syncp: Synchronization point for 64-bit stats on 32-bit CPUs.
+ * @packets: Number of packets received.
+ * @bytes: Total number of bytes received.
+ * @drops: Packet drops due to no available buffers.
+ * @xdp_packets: Number of packets processed by XDP.
+ * @xdp_tx: Packets sent back via XDP_TX.
+ * @xdp_redirects: Packets redirected via XDP_REDIRECT.
+ * @xdp_drops: Packets dropped via XDP_DROP or mapping failure.
+ * @kicks: Number of times RQ was kicked after refill.
+ */
+struct vrdma_rq_stats {
+ struct u64_stats_sync syncp;
+ u64 packets;
+ u64 bytes;
+ u64 drops;
+ u64 xdp_packets;
+ u64 xdp_tx;
+ u64 xdp_redirects;
+ u64 xdp_drops;
+ u64 kicks;
+};
+
+/* EWMA: Exponentially Weighted Moving Average for RX packet length */
+DECLARE_EWMA(pkt_len, 0, 64) /* weight=0, factor=64 */
+
+/**
+ * struct vrdma_send_queue - Internal representation of a TX virtqueue
+ * @vq: The associated virtqueue for sending packets.
+ * @sg: Scatterlist used per transmission (header + linear data + frags).
+ * @name: Human-readable name (e.g., "output.0").
+ * @stats: Transmit statistics under NAPI protection.
+ * @napi: NAPI context for interrupt moderation and polling.
+ * @reset: True if SQ is undergoing reset/recovery.
+ *
+ * One per transmit queue pair. Runs in NAPI poll context during congestion.
+ */
+struct vrdma_send_queue {
+ struct virtqueue *vq;
+ struct scatterlist sg[MAX_SKB_FRAGS + 2];
+ char name[16];
+
+ struct vrdma_sq_stats stats;
+ struct napi_struct napi;
+
+ bool reset;
+} __aligned(64);
+
+/**
+ * struct vrdma_receive_queue - Internal representation of an RX virtqueue
+ * @vq: The associated virtqueue for receiving packets.
+ * @napi: NAPI instance for batched processing of incoming packets.
+ * @xdp_prog: Current XDP BPF program (protected by RCU).
+ * @stats: Receive-side statistics.
+ * @pages: Linked list of pages used as packet buffers (via page->private).
+ * @mrg_avg_pkt_len: EWMA of packet length for mergeable buffer sizing.
+ * @alloc_frag: Page fragment allocator for non-mergeable case.
+ * @sg: Scatterlist used during RX submission.
+ * @min_buf_len: Minimum buffer size when using mergeable RX.
+ * @name: Human-readable name (e.g., "input.0").
+ * @xdp_rxq: Metadata for XDP frame reception (used with xdp_do_flush_map()).
+ *
+ * Maintains state for receiving packets from the host.
+ */
+struct vrdma_receive_queue {
+ struct virtqueue *vq;
+ struct napi_struct napi;
+ struct bpf_prog __rcu *xdp_prog;
+ struct vrdma_rq_stats stats;
+
+ struct page *pages;
+ struct ewma_pkt_len mrg_avg_pkt_len;
+ struct page_frag alloc_frag;
+
+ struct scatterlist sg[MAX_SKB_FRAGS + 2];
+ unsigned int min_buf_len;
+ char name[16];
+
+ struct xdp_rxq_info xdp_rxq;
+} __aligned(64);
+
+/**
+ * struct vrdma_info - Main device private data structure
+ * @vdev: Virtio device backing this interface.
+ * @cvq: Control virtqueue (optional, if feature bit set).
+ * @dev: Net device registered to kernel networking stack.
+ * @sq: Array of send queues (size = curr_queue_pairs).
+ * @rq: Array of receive queues (size = curr_queue_pairs).
+ * @status: Current device status (from config space).
+ * @max_queue_pairs: Maximum number of queue pairs supported by host.
+ * @curr_queue_pairs: Currently active queue pairs.
+ * @xdp_queue_pairs: Number of queue pairs dedicated to XDP processing.
+ * @xdp_enabled: Whether XDP is currently active.
+ * @big_packets: Host supports jumbo frames / large MTU.
+ * @big_packets_num_skbfrags: Max SG entries allocated for big packets.
+ * @mergeable_rx_bufs: Host can merge multiple RX buffers into one SKB.
+ * @has_rss: Host supports RSS (Receive Side Scaling).
+ * @has_rss_hash_report: Host provides hash value and type in RX header.
+ * @rss_key_size: Size of RSS key in bytes.
+ * @rss_indir_table_size: Size of indirection table.
+ * @rss_hash_types_supported: Bitmap of supported hash types (TCPV4, UDP6, etc).
+ * @rss_hash_types_saved: User-configured hash types enabled.
+ * @has_cvq: True if control virtqueue is present.
+ * @any_header_sg: Host allows splitting headers across SG elements.
+ * @hdr_len: Size of the transport header (virtio_net_hdr + optional metadata).
+ * @refill: Work item for delayed RX ring refill under memory pressure.
+ * @refill_enabled: Whether delayed refill mechanism is active.
+ * @refill_lock: Spinlock protecting access to refill_enabled.
+ * @config_work: Work item for handling config space changes (e.g., link up/down).
+ * @affinity_hint_set: Whether affinity hints are applied to VQ interrupts.
+ * @node: CPU hotplug notifier node for online events.
+ * @node_dead: CPU hotplug notifier node for dead events.
+ * @ctrl: Pre-allocated control buffer for synchronous CVQ commands.
+ * @duplex: Current duplex setting (from ethtool).
+ * @speed: Current link speed (from ethtool).
+ * @tx_usecs: Interrupt coalescing: TX timer in microseconds.
+ * @rx_usecs: Interrupt coalescing: RX timer in microseconds.
+ * @tx_max_packets: Interrupt coalescing: max packets before IRQ.
+ * @rx_max_packets: Interrupt coalescing: max packets before IRQ.
+ * @guest_offloads: Currently negotiated offload features.
+ * @guest_offloads_capable: Offload capabilities reported by host.
+ * @failover: Failover handle if STANDBY feature is enabled.
+ *
+ * This structure holds all per-device state for the vrdma driver.
+ */
+struct vrdma_info {
+ struct virtio_device *vdev;
+ struct virtqueue *cvq;
+ struct net_device *dev;
+ struct vrdma_send_queue *sq;
+ struct vrdma_receive_queue *rq;
+
+ unsigned int status;
+
+ u16 max_queue_pairs;
+ u16 curr_queue_pairs;
+ u16 xdp_queue_pairs;
+ bool xdp_enabled;
+
+ bool big_packets;
+ unsigned int big_packets_num_skbfrags;
+ bool mergeable_rx_bufs;
+
+ bool has_rss;
+ bool has_rss_hash_report;
+ u8 rss_key_size;
+ u16 rss_indir_table_size;
+ u32 rss_hash_types_supported;
+ u32 rss_hash_types_saved;
+
+ bool has_cvq;
+ bool any_header_sg;
+ u8 hdr_len;
+
+ struct delayed_work refill;
+ bool refill_enabled;
+ spinlock_t refill_lock;
+
+ struct work_struct config_work;
+ bool affinity_hint_set;
+
+ struct hlist_node node;
+ struct hlist_node node_dead;
+
+ void *ctrl; /* use flexible array later if needed */
+
+ /* Ethtool settings */
+ u8 duplex;
+ u32 speed;
+
+ /* Interrupt coalescing */
+ u32 tx_usecs;
+ u32 rx_usecs;
+ u32 tx_max_packets;
+ u32 rx_max_packets;
+
+ unsigned long guest_offloads;
+ unsigned long guest_offloads_capable;
+
+#ifdef CONFIG_NET_FAILOVER
+ struct failover *failover;
+#endif
+} __aligned(64);
+
#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index 379bd23d3..825ec58bd 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -49,6 +49,118 @@ static const char * const cmd_str[] = {
[VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ] = "REQ_NOTIFY_CQ",
};
+/**
+ * vrdma_exec_verbs_cmd - Execute a verbs command via control virtqueue
+ * @vrdev: VRDMA device
+ * @verbs_cmd: Command opcode (VRDMA_CMD_*)
+ * @verbs_in: Input data SG list (optional)
+ * @verbs_out: Output data SG list (optional)
+ *
+ * Context: Can be called from process or atomic context (e.g., NAPI, workqueue).
+ * Locking: Expects caller to handle serialization if needed.
+ * Return: 0 on success, negative errno on failure.
+ */
+static int vrdma_exec_verbs_cmd(struct vrdma_dev *vrdev, int verbs_cmd,
+ struct scatterlist *verbs_in,
+ struct scatterlist *verbs_out)
+{
+ struct vrdma_info *vrdma_info = netdev_priv(vrdev->netdev);
+ struct virtqueue *vq = vrdev->ctrl_vq;
+ struct verbs_ctrl_buf *ctrl_buf;
+ struct scatterlist hdr_sg, status_sg;
+ struct scatterlist *sgs[4];
+ unsigned int out_num = 1, in_num = 1;
+ unsigned int len;
+ int ret, timeout_loops = VRDMA_COMM_TIMEOUT;
+ unsigned long flags;
+
+ if (unlikely(!vq)) {
+ netdev_err(vrdma_info->dev, "Missing control virtqueue\n");
+ return -EINVAL;
+ }
+
+ ctrl_buf = kmalloc(sizeof(*ctrl_buf), GFP_ATOMIC);
+ if (!ctrl_buf) {
+ goto unlock;
+ }
+ ctrl_buf->cmd = verbs_cmd;
+ ctrl_buf->status = ~0U;
+
+ /* Prepare scatterlists for sending command and receiving status */
+ sg_init_one(&hdr_sg, &ctrl_buf->cmd, sizeof(ctrl_buf->cmd));
+ sgs[0] = &hdr_sg;
+
+ if (verbs_in) {
+ sgs[1] = verbs_in;
+ in_num++;
+ }
+
+ sg_init_one(&status_sg, &ctrl_buf->status, sizeof(ctrl_buf->status));
+ sgs[in_num] = &status_sg;
+
+ if (verbs_out) {
+ sgs[in_num + 1] = verbs_out;
+ out_num++;
+ }
+
+ spin_lock_irqsave(&vrdev->ctrl_lock, flags);
+
+ ret = virtqueue_add_sgs(vq, sgs, in_num, out_num, vrdev, GFP_ATOMIC);
+ if (ret) {
+ netdev_err(vrdma_info->dev, "Failed to add cmd %d to CVQ: %d\n",
+ verbs_cmd, ret);
+ goto unlock;
+ }
+
+ if (unlikely(!virtqueue_kick(vq))) {
+ netdev_err(vrdma_info->dev, "Failed to kick CVQ for cmd %d\n", verbs_cmd);
+ ret = -EIO;
+ goto unlock;
+ }
+
+ /* Wait for response: loop with timeout to avoid infinite blocking */
+ ret = -ETIMEDOUT;
+ while (1) {
+ if (virtqueue_get_buf(vq, &len)) {
+ ret = 0;
+ break;
+ }
+ if (unlikely(virtqueue_is_broken(vq))) {
+ netdev_err(vrdma_info->dev, "CVQ is broken\n");
+ ret = -EIO;
+ break;
+ }
+ cpu_relax();
+ /*
+ * Prevent infinite wait. In non-atomic context, consider using schedule_timeout()
+ * for better CPU utilization.
+ */
+ if (!--timeout_loops) {
+ netdev_err(vrdma_info->dev, "Timeout waiting for cmd %d response\n",
+ verbs_cmd);
+ break;
+ }
+ }
+
+unlock:
+ spin_unlock_irqrestore(&vrdev->ctrl_lock, flags);
+
+ /* Log final result */
+ if (ret == 0 && ctrl_buf->status != VRDMA_CTRL_OK) {
+ netdev_err(vrdma_info->dev, "EXEC cmd %s failed: status=%d\n",
+ cmd_str[verbs_cmd], ctrl_buf->status);
+ ret = -EIO; /* Host returned an error status */
+ } else if (ret == 0) {
+ netdev_dbg(vrdma_info->dev, "EXEC cmd %s OK\n", cmd_str[verbs_cmd]);
+ } else {
+ netdev_err(vrdma_info->dev, "EXEC cmd %s failed: ret=%d\n",
+ cmd_str[verbs_cmd], ret);
+ }
+
+ kfree(ctrl_buf);
+ return ret;
+}
+
static const struct ib_device_ops virtio_rdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
index 9a7a0a168..bdba5a9de 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
@@ -11,6 +11,8 @@
#include <rdma/ib_verbs.h>
#include <rdma/vrdma_abi.h>
+#define VRDMA_COMM_TIMEOUT 1000000
+
enum {
VIRTIO_RDMA_ATOMIC_NONE,
VIRTIO_RDMA_ATOMIC_HCA,
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 03/10] drivers/infiniband/hw/virtio: Implement core device and key resource management
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
2025-12-18 9:09 ` [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices Xiong Weimin
2025-12-18 9:09 ` [PATCH 02/10] drivers/infiniband/hw/virtio: add vrdma_exec_verbs_cmd to construct verbs sgs using virtio Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 04/10] drivers/infiniband/hw/virtio: Implement MR, GID, ucontext and AH resource management verbs Xiong Weimin
` (8 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit consolidates foundational implementations for vhost-user RDMA
device driver, including:
1. Core Device Initialization:
- DPDK EAL setup with POSIX signal handling
- NUMA-aware resource allocation (packet pools, ring buffers)
- Backend netdev auto-detection (net_tap/net_vhost)
- Multi-device support with isolated RX/TX resources
- vHost-user protocol feature negotiation
2. RDMA Control Path:
- Device capability queries (VHOST_RDMA_CTRL_ROCE_QUERY_DEVICE)
- Port attribute reporting (VHOST_RDMA_CTRL_ROCE_QUERY_PORT)
- Scatterlist helpers for vmalloc/linear buffers
- Atomic memory handling for interrupt contexts
3. Resource Management:
- Protection Domains (PD) allocation/destruction
- Completion Queues (CQ) creation/destruction with:
* Kernel-mode pre-posted buffers
* Userspace mmap support for zero-copy polling
* DMA-coherent ring buffers
- Queue Pairs (QP) creation/destruction with:
* Dual-mode support (kernel/userspace)
* Dynamic WQE buffer sizing
* Doorbell register mapping
- Global bitmap-based object pools (PD/CQ/QP/AH/MR)
4. Userspace Integration:
- Detailed mmap structures for SQ/RQ rings
- Atomic counters for resource tracking
- Comprehensive error handling paths
- ABI-compliant uresponse structures
The implementation features:
- Device/port attribute reporting compliant with IB specifications
- Per-resource reference counting
- Graceful resource cleanup during destruction
- Support for both kernel and userspace memory models
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
---
.../drivers/infiniband/hw/virtio/vrdma.h | 5 +
.../drivers/infiniband/hw/virtio/vrdma_abi.h | 279 ++++
.../infiniband/hw/virtio/vrdma_dev_api.h | 46 +
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 1178 +++++++++++++++--
.../drivers/infiniband/hw/virtio/vrdma_ib.h | 106 +-
.../drivers/infiniband/hw/virtio/vrdma_main.c | 86 +-
.../drivers/infiniband/hw/virtio/vrdma_mmap.h | 88 ++
.../infiniband/hw/virtio/vrdma_netdev.c | 130 +-
.../infiniband/hw/virtio/vrdma_queue.c | 110 ++
.../infiniband/hw/virtio/vrdma_queue.h | 3 +-
linux-6.16.8/include/rdma/ib_verbs.h | 9 +
11 files changed, 1806 insertions(+), 234 deletions(-)
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
create mode 100644 linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_mmap.h
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
index a646794ef..99909446f 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma.h
@@ -80,4 +80,9 @@ struct vrdma_dev {
bool fast_doorbell;
};
+static inline struct vrdma_dev *to_vdev(struct ib_device *ibdev)
+{
+ return container_of(ibdev, struct vrdma_dev, ib_dev);
+}
+
#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
new file mode 100644
index 000000000..7cdc4e488
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
@@ -0,0 +1,279 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+#ifndef __VRDMA_ABI_H__
+#define __VRDMA_ABI_H__
+
+#include <linux/types.h>
+
+#define VRDMA_ABI_VERSION 1
+
+/**
+ * struct vrdma_cqe - Virtio-RDMA Completion Queue Entry (CQE)
+ *
+ * This structure represents a single completion entry in the Completion Queue (CQ).
+ * It is written by the kernel driver (or directly by the backend via shared memory)
+ * when a Work Request (WR) completes, and is read by userspace applications during
+ * polling or event handling.
+ *
+ * The layout matches the semantics of `struct ib_wc` but is exposed to userspace
+ * for zero-copy access. All fields use native byte order (little-endian assumed),
+ * as virtio is inherently little-endian.
+ *
+ * @wr_id: User-provided WR identifier, copied from the original send/receive request.
+ * Used to correlate completions with outstanding operations.
+ * @status: Completion status (e.g., IB_WC_SUCCESS, IB_WC_RETRY_EXCEEDED_ERR, etc.).
+ * See &enum ib_wc_status.
+ * @opcode: Operation type that completed (e.g., IB_WC_SEND, IB_WC_RECV, IB_WC_RDMA_WRITE).
+ * See &enum ib_wc_opcode.
+ * @vendor_err: Vendor-specific error code (if any). Typically 0 on success.
+ * @byte_len: Number of bytes transferred in this operation.
+ * @ex: Union containing additional data based on operation type:
+ * - @imm_data: Immediate data received in a SEND with immediate flag.
+ * - @invalidate_rkey: RKEY invalidated by an RDMA WRITE with invalidate.
+ * @qp_num: Source QP number (for incoming completions, this is the remote QP).
+ * @src_qp: Alias for @qp_num; kept for symmetry with IB core naming.
+ * @wc_flags: Bitmask of completion flags (e.g., IB_WC_GRH, IB_WC_WITH_IMM, IB_WC_COMPLETION_TIMESTAMP).
+ * See &enum ib_wc_flags.
+ * @pkey_index: Partition Key index used for this packet (local context).
+ * @slid: Source LID (16-bit), valid only for IB transport if GRH not present.
+ * @sl: Service Level (4 bits), extracted from the packet header.
+ * @dlid_path_bits: Encodes either DLID path bits (in RoCE/IB) or switch path information.
+ * @port_num: Physical port number on which the packet was received.
+ *
+ * Note:
+ * This structure is mapped into userspace via mmap() along with the CQ ring buffer.
+ * Applications poll this array for new completions without system calls.
+ *
+ * Memory Layout Example:
+ *
+ * struct vrdma_cqe cq_ring[N];
+ *
+ * while (polling) {
+ * struct vrdma_cqe *cqe = &cq_ring[head];
+ * if (cqe->status == VZ80_CQ_STATUS_EMPTY)
+ * break; // no more completions
+ *
+ * process_completion(cqe);
+ * cqe->status = VZ80_CQ_STATUS_PROCESSED; // optional acknowledgment
+ * head = (head + 1) % N;
+ * }
+ *
+ * Alignment: Must be aligned to 8-byte boundary. Size: typically 64 bytes.
+ */
+struct vrdma_cqe {
+ __u64 wr_id; /* [out] User-defined WR ID */
+ __u32 status; /* [out] Status of the completed WR */
+ __u32 opcode; /* [out] Type of operation completed */
+ __u32 vendor_err; /* [out] Vendor-specific error code */
+ __u32 byte_len; /* [out] Number of bytes transferred */
+
+ union {
+ __u32 imm_data; /* [out] Immediate data (if IBV_WC_WITH_IMM) */
+ __u32 invalidate_rkey; /* [out] RKEY invalidated (if IBV_WC_WITH_INVALIDATE) */
+ } ex;
+
+ __u32 qp_num; /* [out] Remote QP number (source QP) */
+ __u32 src_qp; /* [out] Alias of qp_num for clarity */
+ int wc_flags; /* [out] Flags (e.g., IB_WC_GRH, IB_WC_WITH_IMM) */
+
+ __u16 pkey_index; /* [out] P_Key index used */
+ __u16 slid; /* [out] Source LID (16-bit) */
+ __u8 sl; /* [out] Service Level */
+ __u8 dlid_path_bits; /* [out] DLID path bits / switch routing info */
+ __u8 port_num; /* [out] Port number where packet was received */
+ __u8 reserved[3]; /* Pad to maintain 8-byte alignment */
+};
+
+/**
+ * struct vrdma_create_cq_uresp - Response to userspace on CQ creation with mmap support
+ * @offset: File offset to be used in mmap() for mapping the CQ and vring.
+ * Passed back from kernel via rdma_user_mmap_get_offset().
+ * Userspace does: mmap(0, size, PROT_READ, MAP_SHARED, fd, offset);
+ * @cq_size: Total size of the mapped region, including:
+ * - CQ event ring (array of struct vrdma_cqe)
+ * - Virtqueue structure (descriptor table, available, used rings)
+ * Must be page-aligned.
+ * @cq_phys_addr: Physical address of the CQ ring buffer (optional).
+ * May be used by userspace for debugging or memory inspection tools.
+ * @used_off: Offset of the "used" ring within the virtqueue's memory layout.
+ * Calculated as: used_ring_addr - desc_table_addr.
+ * Allows userspace to directly map and read completions without syscalls.
+ * @vq_size: Size of the entire virtqueue (including padding), page-aligned.
+ * Used by userspace to determine how much extra memory to map beyond CQ ring.
+ * @num_cqe: Number of CQ entries (completion queue depth) allocated.
+ * Useful for bounds checking in userspace.
+ * @num_cvqe: Number of completion virtqueue elements (i.e., size of vring).
+ * Corresponds to virtqueue_get_vring_size(vcq->vq->vq).
+ * Indicates how many completion events can be queued.
+ *
+ * This structure is passed from kernel to userspace via ib_copy_to_udata()
+ * during CQ creation when a user context is provided. It enables zero-copy,
+ * polling-based completion handling by allowing userspace to directly access:
+ * - The CQ event ring (for reading work completions)
+ * - The virtqueue used ring (to detect when device has posted new completions)
+ *
+ * Memory Layout After mmap():
+ *
+ * +---------------------+
+ * | CQ Event Ring | <- Mapped at base addr
+ * | (num_cqe entries) |
+ * +---------------------+
+ * | Virtqueue: |
+ * | - Desc Table |
+ * | - Available Ring |
+ * | - Used Ring | <- Accessed via (base + used_off)
+ * +---------------------+
+ *
+ * Example usage in userspace:
+ *
+ * void *addr = mmap(NULL, uresp.cq_size, PROT_READ, MAP_SHARED,
+ * ctx->cmd_fd, uresp.offset);
+ * struct vrdma_cqe *cqe_ring = addr;
+ * struct vring_used *used_ring = addr + uresp.used_off;
+ */
+struct vrdma_create_cq_uresp {
+ __u64 offset; /* mmap offset for userspace */
+ __u64 cq_size; /* total size to map (CQ + vring) */
+ __u64 cq_phys_addr; /* physical address of CQ ring (hint) */
+ __u64 used_off; /* offset to used ring inside vring */
+ __u32 vq_size; /* size of the virtqueue (aligned) */
+ int num_cqe; /* number of CQ entries */
+ int num_cvqe; /* number of completion VQ descriptors */
+};
+
+struct vrdma_alloc_pd_uresp {
+ __u32 pdn;
+};
+
+/**
+ * struct vrdma_create_qp_uresp - User response for QP creation in virtio-rdma
+ * @sq_mmap_offset: Offset to mmap the Send Queue (SQ) ring buffer
+ * @sq_mmap_size: Size of the SQ ring buffer available for mmap
+ * @sq_db_addr: Physical address (or token) for SQ doorbell register access
+ * @svq_used_idx_off: Offset within SQ mmap where used index is stored (polling support)
+ * @svq_ring_size: Number of entries in the backend's send virtqueue
+ * @num_sq_wqes: Maximum number of SQ WQEs this QP can post
+ * @sq_head_idx: Current head index in kernel's SQ ring (optional debug info)
+
+ * @rq_mmap_offset: Offset to mmap the Receive Queue (RQ) ring buffer
+ * @rq_mmap_size: Size of the RQ ring buffer available for mmap
+ * @rq_db_addr: Physical address (or token) for RQ doorbell register access
+ * @rvq_used_idx_off: Offset within RQ mmap where used index is stored
+ * @rvq_ring_size: Number of entries in the backend's receive virtqueue
+ * @num_rq_wqes: Maximum number of RQ WQEs this QP can post
+ * @rq_head_idx: Current head index in kernel's RQ ring
+
+ * @notifier_size: Size of notification area (e.g., CQ notifier, event counter)
+ * @qp_handle: Unique identifier for this QP (qpn)
+ *
+ * This structure is passed back to userspace via `ib_copy_to_udata()`
+ * during QP creation. It allows userspace to:
+ * - Map SQ/RQ rings into its address space
+ * - Access doorbells directly (if supported)
+ * - Poll for completion status via used index
+ */
+struct vrdma_create_qp_uresp {
+ __u64 sq_mmap_offset;
+ __u64 sq_mmap_size;
+ __u64 sq_db_addr;
+ __u64 svq_used_idx_off;
+ __u32 svq_ring_size;
+ __u32 num_sq_wqes;
+ __u32 num_svqe;
+ __u32 sq_head_idx;
+
+ __u64 rq_mmap_offset;
+ __u64 rq_mmap_size;
+ __u64 rq_db_addr;
+ __u64 rvq_used_idx_off;
+ __u32 rvq_ring_size;
+ __u32 num_rq_wqes;
+ __u32 num_rvqe;
+ __u32 rq_head_idx;
+
+ __u32 notifier_size;
+
+ __u32 qp_handle;
+};
+
+/**
+ * struct vrdma_av - Address Vector for Virtio-RDMA QP routing
+ *
+ * An Address Vector (AV) contains L2/L3 network path information used to
+ * route packets from a UD or RC QP to a remote destination. It is analogous
+ * to InfiniBand's AV structure in user verbs.
+ *
+ * All fields use fixed-width types for ABI stability across architectures.
+ */
+struct vrdma_av {
+ __u32 port:8; /* Physical port index (1-based) */
+ __u32 pdn:8; /* Port DN (Delivery Notification) handle or VPORT ID */
+ __u32 sl_tclass_flowlabel:16; /* Combined SL (4), TClass (8), Flow Label (20) */
+
+ __u8 dgid[16]; /* Destination Global Identifier (GID), big-endian */
+
+ __u8 gid_index; /* Outbound GID table index (for source GID selection) */
+ __u8 stat_rate; /* Static rate control (enum ibv_rate) */
+ __u8 hop_limit; /* IPv6-style hop limit / TTL */
+ __u8 dmac[6]; /* Destination MAC address (for L2 forwarding) */
+
+ __u8 reserved[6]; /* Reserved for future use / alignment padding */
+};
+
+/**
+ * struct vrdma_cmd_post_send - User-space command to post a Send WQE
+ *
+ * This structure is passed from userspace via ioctl (e.g., WRITE on uverbs char dev)
+ * to request posting one or more work queue entries (WQEs) on the Send Queue (SQ).
+ * It mirrors the semantics of `ibv_post_send()` in libibverbs.
+ *
+ * All fields use fixed-size types for ABI stability across architectures.
+ */
+struct vrdma_cmd_post_send {
+ __u32 num_sge; /* Number of scatter-gather elements in this WQE */
+
+ __u32 send_flags; /* IBV_SEND_xxx flags (e.g., signaled, inline, fence) */
+ __u32 opcode; /* Operation code: RDMA_WRITE, SEND, ATOMIC, etc. */
+ __u64 wr_id; /* Work Request ID returned in CQE */
+
+ union {
+ __be32 imm_data; /* Immediate data for RC/UC QPs */
+ __u32 invalidate_rkey; /* rkey to invalidate (on SEND_WITH_INV) */
+ } ex;
+
+ union wr_data {
+ struct {
+ __u64 remote_addr; /* Target virtual address for RDMA op */
+ __u32 rkey; /* Remote key for memory access */
+ } rdma;
+
+ struct {
+ __u64 remote_addr; /* Address of atomic variable */
+ __u64 compare_add; /* Value to compare */
+ __u64 swap; /* Value to swap (or add) */
+ __u32 rkey; /* Remote memory key */
+ } atomic;
+
+ struct {
+ __u32 remote_qpn; /* Destination QP number */
+ __u32 remote_qkey; /* Q_Key for UD packet validation */
+ struct vrdma_av av; /* Address vector (L2/L3 info) */
+ } ud;
+
+ struct {
+ __u32 mrn; /* Memory Region Number (MR handle) */
+ __u32 key; /* Staging rkey for MR registration */
+ __u32 access; /* Access flags (IB_ACCESS_xxx) */
+ } reg;
+ } wr;
+};
+
+struct vrdma_sge {
+ __u64 addr;
+ __u32 length;
+ __u32 lkey;
+};
+
+#endif
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index 3b1f7d2b6..d1db1bea4 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -114,6 +114,52 @@ enum vrdma_verbs_cmd {
VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ,
};
+struct vrdma_cmd_query_port {
+ u32 port;
+};
+
+struct vrdma_cmd_create_cq {
+ u32 cqe;
+};
+
+struct vrdma_rsp_create_cq {
+ u32 cqn;
+};
+
+struct vrdma_cmd_destroy_cq {
+ u32 cqn;
+};
+
+struct vrdma_rsp_create_pd {
+ __u32 pdn;
+};
+
+struct vrdma_cmd_destroy_pd {
+ __u32 pdn;
+};
+
+struct vrdma_cmd_create_qp {
+ __u32 pdn;
+ __u8 qp_type;
+ __u8 sq_sig_type;
+ __u32 max_send_wr;
+ __u32 max_send_sge;
+ __u32 send_cqn;
+ __u32 max_recv_wr;
+ __u32 max_recv_sge;
+ __u32 recv_cqn;
+
+ __u32 max_inline_data;
+};
+
+struct vrdma_rsp_create_qp {
+ __u32 qpn;
+};
+
+struct vrdma_cmd_destroy_qp {
+ __u32 qpn;
+};
+
#define VRDMA_CTRL_OK 0
#define VRDMA_CTRL_ERR 1
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index 825ec58bd..f1f53314f 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -17,6 +17,9 @@
#include "vrdma_dev.h"
#include "vrdma_dev_api.h"
#include "vrdma_ib.h"
+#include "vrdma_abi.h"
+#include "vrdma_mmap.h"
+#include "vrdma_queue.h"
/**
* cmd_str - String representation of virtio RDMA control commands
@@ -61,110 +64,1043 @@ static const char * const cmd_str[] = {
* Return: 0 on success, negative errno on failure.
*/
static int vrdma_exec_verbs_cmd(struct vrdma_dev *vrdev, int verbs_cmd,
- struct scatterlist *verbs_in,
- struct scatterlist *verbs_out)
+ struct scatterlist *verbs_in,
+ struct scatterlist *verbs_out)
{
- struct vrdma_info *vrdma_info = netdev_priv(vrdev->netdev);
- struct virtqueue *vq = vrdev->ctrl_vq;
- struct verbs_ctrl_buf *ctrl_buf;
- struct scatterlist hdr_sg, status_sg;
- struct scatterlist *sgs[4];
- unsigned int out_num = 1, in_num = 1;
- unsigned int len;
- int ret, timeout_loops = VRDMA_COMM_TIMEOUT;
- unsigned long flags;
-
- if (unlikely(!vq)) {
- netdev_err(vrdma_info->dev, "Missing control virtqueue\n");
- return -EINVAL;
- }
-
- ctrl_buf = kmalloc(sizeof(*ctrl_buf), GFP_ATOMIC);
- if (!ctrl_buf) {
- goto unlock;
- }
- ctrl_buf->cmd = verbs_cmd;
- ctrl_buf->status = ~0U;
-
- /* Prepare scatterlists for sending command and receiving status */
- sg_init_one(&hdr_sg, &ctrl_buf->cmd, sizeof(ctrl_buf->cmd));
- sgs[0] = &hdr_sg;
-
- if (verbs_in) {
- sgs[1] = verbs_in;
+ struct vrdma_info *vrdma_info = netdev_priv(vrdev->netdev);
+ struct virtqueue *vq = vrdev->ctrl_vq;
+ struct verbs_ctrl_buf *ctrl_buf;
+ struct scatterlist hdr_sg, status_sg;
+ struct scatterlist *sgs[4];
+ unsigned int out_num = 1, in_num = 1;
+ unsigned int len;
+ int ret, timeout_loops = VRDMA_COMM_TIMEOUT;
+ unsigned long flags;
+
+ if (unlikely(!vq)) {
+ netdev_err(vrdma_info->dev, "Missing control virtqueue\n");
+ return -EINVAL;
+ }
+
+ ctrl_buf = kmalloc(sizeof(*ctrl_buf), GFP_ATOMIC);
+ if(!ctrl_buf){
+ goto unlock;
+ }
+ ctrl_buf->cmd = verbs_cmd;
+ ctrl_buf->status = 0x01;
+
+ /* Prepare scatterlists for sending command and receiving status */
+ sg_init_one(&hdr_sg, &ctrl_buf->cmd, sizeof(ctrl_buf->cmd));
+ sgs[0] = &hdr_sg;
+
+ if (verbs_in) {
+ sgs[1] = verbs_in;
in_num++;
- }
+ }
- sg_init_one(&status_sg, &ctrl_buf->status, sizeof(ctrl_buf->status));
- sgs[in_num] = &status_sg;
+ sg_init_one(&status_sg, &ctrl_buf->status, sizeof(ctrl_buf->status));
+ sgs[in_num] = &status_sg;
- if (verbs_out) {
- sgs[in_num + 1] = verbs_out;
+ if (verbs_out) {
+ sgs[in_num + 1] = verbs_out;
out_num++;
- }
-
- spin_lock_irqsave(&vrdev->ctrl_lock, flags);
-
- ret = virtqueue_add_sgs(vq, sgs, in_num, out_num, vrdev, GFP_ATOMIC);
- if (ret) {
- netdev_err(vrdma_info->dev, "Failed to add cmd %d to CVQ: %d\n",
- verbs_cmd, ret);
- goto unlock;
- }
-
- if (unlikely(!virtqueue_kick(vq))) {
- netdev_err(vrdma_info->dev, "Failed to kick CVQ for cmd %d\n", verbs_cmd);
- ret = -EIO;
- goto unlock;
- }
-
- /* Wait for response: loop with timeout to avoid infinite blocking */
- ret = -ETIMEDOUT;
- while (1) {
- if (virtqueue_get_buf(vq, &len)) {
- ret = 0;
- break;
- }
- if (unlikely(virtqueue_is_broken(vq))) {
- netdev_err(vrdma_info->dev, "CVQ is broken\n");
- ret = -EIO;
- break;
- }
- cpu_relax();
- /*
- * Prevent infinite wait. In non-atomic context, consider using schedule_timeout()
- * for better CPU utilization.
- */
- if (!--timeout_loops) {
- netdev_err(vrdma_info->dev, "Timeout waiting for cmd %d response\n",
- verbs_cmd);
- break;
- }
- }
+ }
+
+ spin_lock_irqsave(&vrdev->ctrl_lock, flags);
+
+ ret = virtqueue_add_sgs(vq, sgs, in_num, out_num, vrdev, GFP_ATOMIC);
+ if (ret) {
+ netdev_err(vrdma_info->dev, "Failed to add cmd %d to CVQ: %d\n",
+ verbs_cmd, ret);
+ goto unlock;
+ }
+
+ if (unlikely(!virtqueue_kick(vq))) {
+ netdev_err(vrdma_info->dev, "Failed to kick CVQ for cmd %d\n", verbs_cmd);
+ ret = -EIO;
+ goto unlock;
+ }
+
+ /* Wait for response: loop with timeout to avoid infinite blocking */
+ ret = -ETIMEDOUT;
+ while (1) {
+ if (virtqueue_get_buf(vq, &len)) {
+ ret = 0;
+ break;
+ }
+ if (unlikely(virtqueue_is_broken(vq))) {
+ netdev_err(vrdma_info->dev, "CVQ is broken\n");
+ ret = -EIO;
+ break;
+ }
+ cpu_relax();
+ /*
+ * Prevent infinite wait. In non-atomic context, consider using schedule_timeout()
+ * for better CPU utilization.
+ */
+ if (!--timeout_loops) {
+ netdev_err(vrdma_info->dev, "Timeout waiting for cmd %d response\n",
+ verbs_cmd);
+ break;
+ }
+ }
unlock:
- spin_unlock_irqrestore(&vrdev->ctrl_lock, flags);
-
- /* Log final result */
- if (ret == 0 && ctrl_buf->status != VRDMA_CTRL_OK) {
- netdev_err(vrdma_info->dev, "EXEC cmd %s failed: status=%d\n",
- cmd_str[verbs_cmd], ctrl_buf->status);
- ret = -EIO; /* Host returned an error status */
- } else if (ret == 0) {
- netdev_dbg(vrdma_info->dev, "EXEC cmd %s OK\n", cmd_str[verbs_cmd]);
- } else {
- netdev_err(vrdma_info->dev, "EXEC cmd %s failed: ret=%d\n",
- cmd_str[verbs_cmd], ret);
- }
-
- kfree(ctrl_buf);
- return ret;
+ spin_unlock_irqrestore(&vrdev->ctrl_lock, flags);
+
+ /* Log final result */
+ if (ret == 0 && ctrl_buf->status != VRDMA_CTRL_OK) {
+ netdev_err(vrdma_info->dev, "EXEC cmd %s failed: status=%d\n",
+ cmd_str[verbs_cmd], ctrl_buf->status);
+ ret = -EIO; /* Host returned an error status */
+ } else if (ret == 0) {
+ netdev_dbg(vrdma_info->dev, "EXEC cmd %s OK\n", cmd_str[verbs_cmd]);
+ } else {
+ netdev_err(vrdma_info->dev, "EXEC cmd %s failed: ret=%d\n",
+ cmd_str[verbs_cmd], ret);
+ }
+
+ kfree(ctrl_buf);
+ return ret;
+}
+
+static int vrdma_port_immutable(struct ib_device *ibdev, u32 port_num,
+ struct ib_port_immutable *immutable)
+{
+ struct ib_port_attr attr;
+ int ret;
+
+ ret = ib_query_port(ibdev, port_num, &attr);
+ if (ret)
+ return ret;
+
+ immutable->core_cap_flags = RDMA_CORE_PORT_VIRTIO;
+ immutable->pkey_tbl_len = attr.pkey_tbl_len;
+ immutable->gid_tbl_len = attr.gid_tbl_len;
+ immutable->max_mad_size = IB_MGMT_MAD_SIZE;
+
+ return 0;
+}
+
+static int vrdma_query_device(struct ib_device *ibdev,
+ struct ib_device_attr *ib_dev_attr,
+ struct ib_udata *udata)
+{
+ if (udata->inlen || udata->outlen)
+ return -EINVAL;
+
+ *ib_dev_attr = to_vdev(ibdev)->attr;
+ return 0;
}
-static const struct ib_device_ops virtio_rdma_dev_ops = {
+static struct scatterlist* vrdma_init_sg(void* buf, unsigned long nbytes)
+{
+ struct scatterlist* need_sg;
+ int num_page = 0;
+ unsigned long offset;
+ void* ptr;
+
+ if (is_vmalloc_addr(buf)) {
+ int i;
+ unsigned long remaining = nbytes;
+
+ ptr = buf;
+ offset = offset_in_page(ptr);
+ num_page = 1;
+ if (offset + nbytes > PAGE_SIZE) {
+ num_page += (offset + nbytes - PAGE_SIZE + PAGE_SIZE - 1) / PAGE_SIZE;
+ }
+
+ need_sg = kmalloc_array(num_page, sizeof(*need_sg), GFP_ATOMIC);
+ if (!need_sg)
+ return NULL;
+
+ sg_init_table(need_sg, num_page);
+
+ for (i = 0; i < num_page; i++) {
+ struct page *page;
+ unsigned int len;
+ unsigned int off_in_page;
+
+ off_in_page = offset_in_page(ptr);
+ len = min((unsigned long)(PAGE_SIZE - off_in_page), remaining);
+
+ page = vmalloc_to_page(ptr);
+ if (!page) {
+ kfree(need_sg);
+ return NULL;
+ }
+
+ sg_set_page(&need_sg[i], page, len, off_in_page);
+
+ ptr += len;
+ remaining -= len;
+ }
+ } else {
+ need_sg = kmalloc(sizeof(*need_sg), GFP_ATOMIC);
+ if (!need_sg)
+ return NULL;
+
+ sg_init_one(need_sg, buf, nbytes);
+ }
+
+ return need_sg;
+}
+
+static int vrdma_query_port(struct ib_device *ibdev, u32 port,
+ struct ib_port_attr *props)
+{
+ struct vrdma_dev *vdev = to_vdev(ibdev);
+ struct vrdma_cmd_query_port *cmd;
+ struct vrdma_port_attr port_attr;
+ struct scatterlist in_sgl, *out_sgl;
+ int ret;
+
+ memset(&port_attr, 0, sizeof(port_attr));
+
+ cmd = kmalloc(sizeof(*cmd), GFP_ATOMIC);
+ if (!cmd)
+ return -ENOMEM;
+
+ out_sgl = vrdma_init_sg(&port_attr, sizeof(port_attr));
+ if (!out_sgl) {
+ kfree(cmd);
+ return -ENOMEM;
+ }
+
+ cmd->port = port;
+ sg_init_one(&in_sgl, cmd, sizeof(*cmd));
+
+ ret = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_QUERY_PORT, &in_sgl, out_sgl);
+ if (!ret) {
+ props->state = port_attr.state;
+ props->max_mtu = port_attr.max_mtu;
+ props->active_mtu = port_attr.active_mtu;
+ props->phys_mtu = port_attr.phys_mtu;
+ props->gid_tbl_len = port_attr.gid_tbl_len;
+ props->port_cap_flags = port_attr.port_cap_flags;
+ props->max_msg_sz = port_attr.max_msg_sz;
+ props->bad_pkey_cntr = port_attr.bad_pkey_cntr;
+ props->qkey_viol_cntr = port_attr.qkey_viol_cntr;
+ props->pkey_tbl_len = port_attr.pkey_tbl_len;
+ props->active_width = port_attr.active_width;
+ props->active_speed = port_attr.active_speed;
+ props->phys_state = port_attr.phys_state;
+
+ props->ip_gids = 1;
+ props->sm_lid = 0;
+ props->lid = 0;
+ props->lmc = 0;
+ props->max_vl_num = 1;
+ props->sm_sl = 0;
+ props->subnet_timeout = 0;
+ props->init_type_reply = 0;
+ props->port_cap_flags2 = 0;
+ }
+
+ kfree(out_sgl);
+ kfree(cmd);
+
+ return ret;
+}
+
+static struct net_device *vrdma_get_netdev(struct ib_device *ibdev,
+ u32 port_num)
+{
+ struct vrdma_dev *vrdev = to_vdev(ibdev);
+ return vrdev->netdev;
+}
+
+/**
+ * vrdma_create_cq - Create a Completion Queue (CQ) for virtio-rdma device
+ * @ibcq: Pointer to the InfiniBand CQ structure to be initialized
+ * @attr: Attributes for CQ initialization, including requested depth (cqe)
+ * @attr_bundle: Bundle containing user context and attributes (includes udata)
+ *
+ * This function creates a Completion Queue (CQ) in the virtio-rdma driver,
+ * which is used to report completion events from asynchronous operations such
+ * as sends, receives, and memory accesses.
+ *
+ * The function performs the following steps:
+ * 1. Enforces per-device CQ count limits.
+ * 2. Allocates a DMA-coherent ring buffer for storing completion entries.
+ * 3. Communicates with the backend via a virtqueue command to create the CQ
+ * and obtain a hardware handle (cqn).
+ * 4. Sets up zero-copy user-space access through mmap() if @udata is provided.
+ * 5. Initializes kernel-side state, including locking and event handling.
+ *
+ * If @attr_bundle->ucore (i.e., udata) is non-NULL:
+ * - A user-mappable region is created that includes:
+ * a) The CQ event ring (array of struct vrdma_cqe)
+ * b) The associated virtqueue's used ring (for polling completions)
+ * - Metadata required for mmap setup (offset, sizes, addresses) is returned
+ * to userspace via ib_copy_to_udata().
+ *
+ * If @udata is NULL (kernel-only CQ):
+ * - The driver pre-posts receive buffers on the CQ's dedicated virtqueue
+ * so that completion messages from the device can be received directly
+ * into the kernel-managed CQ ring.
+ * - No mmap support is enabled.
+ *
+ * Memory Layout Mapped to Userspace:
+ *
+ * +------------------------+ <-- mapped base address
+ * | CQ Event Ring |
+ * | (num_cqe x vrdma_cqe) |
+ * +------------------------+
+ * | Virtqueue Structure |
+ * | - Descriptor Table |
+ * | - Available Ring |
+ * | - Used Ring | <-- accessed at (base + used_off)
+ * +------------------------+
+ *
+ * Usage by Userspace:
+ * After receiving @offset and @cq_size, userspace calls:
+ * mmap(NULL, cq_size, PROT_READ, MAP_SHARED, fd, offset);
+ * Then polls the used ring to detect new completions without syscalls.
+ *
+ * Return:
+ * 0 on success, or negative error code (e.g., -ENOMEM, -EINVAL, -EIO).
+ */
+static int vrdma_create_cq(struct ib_cq *ibcq,
+ const struct ib_cq_init_attr *attr,
+ struct uverbs_attr_bundle *attr_bundle)
+{
+ struct scatterlist in, out;
+ struct vrdma_cq *vcq = to_vcq(ibcq);
+ struct vrdma_dev *vdev = to_vdev(ibcq->device);
+ struct vrdma_cmd_create_cq *cmd;
+ struct vrdma_rsp_create_cq *rsp;
+ struct scatterlist sg;
+ struct ib_udata *udata;
+ int entries = attr->cqe;
+ size_t total_size;
+ struct vrdma_user_mmap_entry *entry = NULL;
+ int ret;
+
+ if(!attr_bundle)
+ udata = NULL;
+ else
+ udata = &attr_bundle->driver_udata;
+
+ /* Enforce maximum number of CQs per device */
+ if (!atomic_add_unless(&vdev->num_cq, 1, vdev->ib_dev.attrs.max_cq)) {
+ dev_dbg(&vdev->vdev->dev, "max CQ limit reached: %u\n",
+ vdev->ib_dev.attrs.max_cq);
+ return -ENOMEM;
+ }
+
+ /* Allocate CQ ring buffer: array of vrdma_cqe entries */
+ total_size = PAGE_ALIGN(entries * sizeof(struct vrdma_cqe));
+ vcq->queue_size = total_size;
+ vcq->queue = dma_alloc_coherent(vdev->vdev->dev.parent, vcq->queue_size,
+ &vcq->dma_addr, GFP_KERNEL);
+ if (!vcq->queue) {
+ ret = -ENOMEM;
+ goto err_out;
+ }
+
+ /* Prepare command and response structures */
+ cmd = kmalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd) {
+ ret = -ENOMEM;
+ goto err_free_queue;
+ }
+
+ rsp = kmalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ ret = -ENOMEM;
+ goto err_free_cmd;
+ }
+
+ /* Optional: allocate mmap entry if userspace mapping is requested */
+ if (udata) {
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry) {
+ ret = -ENOMEM;
+ goto err_free_rsp;
+ }
+ }
+
+ /* Fill command parameters */
+ cmd->cqe = entries;
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Send command to backend device */
+ ret = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_CREATE_CQ, &in, &out);
+ if (ret) {
+ dev_err(&vdev->vdev->dev, "CREATE_CQ cmd failed: %d\n", ret);
+ goto err_free_entry;
+ }
+
+ /* Initialize CQ fields from response */
+ vcq->cq_handle = rsp->cqn;
+ vcq->ibcq.cqe = entries;
+ vcq->num_cqe = entries;
+ vcq->vq = &vdev->cq_vqs[rsp->cqn]; /* Assigned virtqueue for this CQ */
+ vdev->cqs[rsp->cqn] = vcq;
+
+ /* Userspace mapping setup */
+ if (udata) {
+ struct vrdma_create_cq_uresp uresp = {};
+ struct vrdma_ucontext *uctx =rdma_udata_to_drv_context(udata, struct vrdma_ucontext, ibucontext);
+
+ entry->mmap_type = VRDMA_MMAP_CQ;
+ entry->vq = vcq->vq->vq;
+ entry->user_buf = vcq->queue;
+ entry->ubuf_size = vcq->queue_size;
+
+ /* Calculate used ring offset within descriptor table */
+ uresp.used_off = virtqueue_get_used_addr(vcq->vq->vq) -
+ virtqueue_get_desc_addr(vcq->vq->vq);
+
+ /* Align vring size to page boundary for mmap */
+ uresp.vq_size = PAGE_ALIGN(vring_size(virtqueue_get_vring_size(vcq->vq->vq),
+ SMP_CACHE_BYTES));
+ total_size += uresp.vq_size;
+
+ /* Insert mmap entry into user context */
+ ret = rdma_user_mmap_entry_insert(&uctx->ibucontext,
+ &entry->rdma_entry,
+ total_size);
+ if (ret) {
+ dev_err(&vdev->vdev->dev,
+ "Failed to insert mmap entry for CQ: %d\n", ret);
+ goto err_free_entry;
+ }
+
+ /* Populate response to userspace */
+ uresp.offset = rdma_user_mmap_get_offset(&entry->rdma_entry);
+ uresp.cq_phys_addr = virt_to_phys(vcq->queue);
+ uresp.num_cqe = entries;
+ uresp.num_cvqe = virtqueue_get_vring_size(vcq->vq->vq);
+ uresp.cq_size = total_size;
+
+ if (udata->outlen < sizeof(uresp)) {
+ ret = -EINVAL;
+ goto err_remove_mmap;
+ }
+
+ ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ if (ret) {
+ dev_err(&vdev->vdev->dev,
+ "Failed to copy CQ creation response to userspace\n");
+ goto err_remove_mmap;
+ }
+
+ vcq->entry = &entry->rdma_entry;
+ } else {
+ int sg_num = entries > vcq->vq->vq->num_free ? vcq->vq->vq->num_free : entries;
+ /* Kernel-only CQ: pre-post receive buffers to catch events */
+ for (int i = 0; i < sg_num; i++) {
+ sg_init_one(&sg, vcq->queue + i, sizeof(struct vrdma_cqe));
+ ret = virtqueue_add_inbuf(vcq->vq->vq, &sg, 1,
+ vcq->queue + i, GFP_KERNEL);
+ if (ret) {
+ dev_err(&vdev->vdev->dev,
+ "Failed to add inbuf to CQ vq: %d\n", ret);
+ /* Best-effort cleanup; continue anyway */
+ }
+ }
+ virtqueue_kick(vcq->vq->vq);
+ }
+
+ /* Final initialization */
+ spin_lock_init(&vcq->lock);
+
+ /* Cleanup temporaries */
+ kfree(rsp);
+ kfree(cmd);
+ return 0;
+
+err_remove_mmap:
+ if (udata && entry)
+ rdma_user_mmap_entry_remove(&entry->rdma_entry);
+err_free_entry:
+ if (entry)
+ kfree(entry);
+err_free_rsp:
+ kfree(rsp);
+err_free_cmd:
+ kfree(cmd);
+err_free_queue:
+ dma_free_coherent(vdev->vdev->dev.parent, vcq->queue_size,
+ vcq->queue, vcq->dma_addr);
+err_out:
+ atomic_dec(&vdev->num_cq);
+ return ret;
+}
+
+/**
+ * vrdma_destroy_cq - Destroy a Completion Queue (CQ) in virtio-rdma driver
+ * @cq: Pointer to the IB CQ to destroy
+ * @udata: User data context (may be NULL for kernel clients)
+ *
+ * This function destroys a CQ by:
+ * 1. Disabling callbacks on the associated virtqueue
+ * 2. Sending VIRTIO_RDMA_CMD_DESTROY_CQ command to backend
+ * 3. Draining any pending buffers from the virtqueue (for kernel CQs)
+ * 4. Removing mmap entries (if created for userspace)
+ * 5. Freeing DMA-coherent memory used for CQ ring
+ * 6. Decrementing device-wide CQ counter
+ *
+ * The CQ must not be in use when this function is called.
+ *
+ * Return:
+ * Always returns 0 (success). Future versions may return error if
+ * the device fails to acknowledge destruction.
+ */
+static int vrdma_destroy_cq(struct ib_cq *cq, struct ib_udata *udata)
+{
+ struct vrdma_cq *vcq = to_vcq(cq);
+ struct vrdma_dev *vdev = to_vdev(cq->device);
+ struct scatterlist in_sgs;
+ struct vrdma_cmd_destroy_cq *cmd;
+
+ /* Allocate command buffer */
+ cmd = kmalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ /* Prepare and send DESTROY_CQ command to backend */
+ cmd->cqn = vcq->cq_handle;
+ sg_init_one(&in_sgs, cmd, sizeof(*cmd));
+
+ /* Prevent further interrupts/callbacks during teardown */
+ virtqueue_disable_cb(vcq->vq->vq);
+
+ /* Send command synchronously; no response expected on success */
+ int rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_DESTROY_CQ,
+ &in_sgs, NULL);
+ if (rc) {
+ dev_warn(&vdev->vdev->dev,
+ "Failed to destroy CQ %u: backend error %d\n",
+ vcq->cq_handle, rc);
+ /* Proceed anyway: continue cleanup even if device failed */
+ }
+
+ /*
+ * For kernel-only CQs: drain all unused receive buffers.
+ * Userspace manages its own vring via mmap/poll, so skip.
+ */
+ if (!udata) {
+ struct vrdma_cqe *cqe;
+ while ((cqe = virtqueue_detach_unused_buf(vcq->vq->vq)) != NULL) {
+ /* No action needed - just release buffer back */
+ }
+ }
+
+ /* Remove mmap entry if one was created for userspace access */
+ if (vcq->entry) {
+ rdma_user_mmap_entry_remove(vcq->entry);
+ vcq->entry = NULL; /* Safety: avoid double-remove */
+ }
+
+ /* Unregister CQ from device's CQ table */
+ WRITE_ONCE(vdev->cqs[vcq->cq_handle], NULL);
+
+ /* Free CQ event ring (DMA memory) */
+ dma_free_coherent(vdev->vdev->dev.parent, vcq->queue_size,
+ vcq->queue, vcq->dma_addr);
+
+ /* Decrement global CQ count */
+ atomic_dec(&vdev->num_cq);
+
+ /* Re-enable callback (though vq will likely be reused or freed later) */
+ virtqueue_enable_cb(vcq->vq->vq);
+
+ /* Clean up command structure */
+ kfree(cmd);
+
+ return 0;
+}
+
+static int vrdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata);
+
+/**
+ * vrdma_alloc_pd - Allocate a Protection Domain (PD) via virtio-rdma backend
+ * @ibpd: Pointer to the IB PD structure
+ * @udata: User data for communication with userspace (may be NULL)
+ *
+ * This function:
+ * 1. Sends VIRTIO_RDMA_CMD_CREATE_PD to the backend
+ * 2. Receives a PD handle (pdn) from the device
+ * 3. Stores it in the vrdma_pd structure
+ * 4. Optionally returns an empty response to userspace (for ABI compatibility)
+ *
+ * Return:
+ * 0 on success, negative errno on failure.
+ */
+static int vrdma_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+ struct vrdma_pd *pd = to_vpd(ibpd);
+ struct ib_device *ibdev = ibpd->device;
+ struct vrdma_dev *vdev = to_vdev(ibdev);
+ struct vrdma_rsp_create_pd *rsp;
+ struct scatterlist out_sgs;
+ int ret;
+
+ /* Allocate response buffer */
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp)
+ return -ENOMEM;
+
+ sg_init_one(&out_sgs, rsp, sizeof(*rsp));
+
+ /* Send command to backend */
+ ret = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_CREATE_PD,
+ NULL, &out_sgs);
+ if (ret) {
+ dev_err(&vdev->vdev->dev,
+ "Failed to create PD: cmd error %d\n", ret);
+ goto err_free;
+ }
+
+ /* Store returned PD handle */
+ pd->pd_handle = rsp->pdn;
+
+ /* If this is a userspace PD, return success indicator */
+ if (udata) {
+ struct vrdma_alloc_pd_uresp uresp = {};
+
+ if (ib_copy_to_udata(udata, &uresp, sizeof(uresp))) {
+ dev_warn(&vdev->vdev->dev,
+ "Failed to copy PD uresp to userspace\n");
+ /* Undo: destroy the PD on backend */
+ vrdma_dealloc_pd(ibpd, udata);
+ ret = -EFAULT;
+ goto err_free;
+ }
+ }
+
+ dev_info(&vdev->vdev->dev, "%s: allocated PD %u\n",
+ __func__, pd->pd_handle);
+
+err_free:
+ kfree(rsp);
+ return ret;
+}
+
+/**
+ * vrdma_dealloc_pd - Deallocate a Protection Domain (PD)
+ * @ibpd: Pointer to the IB PD to destroy
+ * @udata: User data context (ignored here; used for symmetry)
+ *
+ * This function sends VIRTIO_RDMA_CMD_DESTROY_PD to the backend
+ * to release the PD resource. No response is expected.
+ *
+ * Note: There is no local state (e.g., DMA mappings) tied to PD in this driver.
+ * All cleanup is handled by the backend.
+ *
+ * Return:
+ * Always returns 0 (success).
+ */
+static int vrdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+ struct vrdma_pd *pd = to_vpd(ibpd);
+ struct ib_device *ibdev = ibpd->device;
+ struct vrdma_dev *vdev = to_vdev(ibdev);
+ struct vrdma_cmd_destroy_pd *cmd;
+ struct scatterlist in_sgs;
+ int ret;
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ cmd->pdn = pd->pd_handle;
+ sg_init_one(&in_sgs, cmd, sizeof(*cmd));
+
+ ret = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_DESTROY_PD,
+ &in_sgs, NULL);
+ if (ret) {
+ dev_err(&vdev->vdev->dev,
+ "Failed to destroy PD %u: backend error %d\n",
+ pd->pd_handle, ret);
+ /* Proceed anyway - don't block cleanup */
+ }
+
+ dev_info(&vdev->vdev->dev, "%s: deallocated PD %u\n",
+ __func__, pd->pd_handle);
+
+ kfree(cmd);
+ return 0;
+}
+
+/**
+ * vrdma_init_mmap_entry - Initialize and insert a user mmap entry for QP buffer
+ * @vdev: Pointer to the vRDMA device
+ * @vq: Virtqueue associated with this memory region
+ * @entry_: Pointer to store allocated mmap entry (output)
+ * @buf_size: Size of the user data buffer (e.g., SQ/RQ ring space)
+ * @vctx: User context to which the mmap entry will be attached
+ * @size: Total size of the allocated mapping region (output)
+ * @used_off: Offset within the mapping where used ring starts (output)
+ * @vq_size: Aligned size of the virtqueue structure (output)
+ * @dma_addr: DMA address of the allocated coherent buffer (output)
+ *
+ * This function allocates a physically contiguous, DMA-coherent buffer for
+ * the Send/Receive Queue data (e.g., WQE payloads), maps the virtqueue's
+ * descriptor and used rings into userspace via an mmap entry, and inserts it
+ * into the user context. It supports fast doorbell mapping if enabled.
+ *
+ * The layout in userspace is:
+ * [0, buf_size_aligned) : Data buffer (SQ/RQ payload space)
+ * [buf_size_aligned, ...) : Virtqueue (desc + avail + used)
+ * [... + vq_size, ...] : Optional fast doorbell page
+ *
+ * Returns a pointer to the kernel virtual address of the data buffer,
+ * or NULL on failure.
+ */
+static void *vrdma_init_mmap_entry(struct vrdma_dev *vdev,
+ struct virtqueue *vq,
+ struct vrdma_user_mmap_entry **entry_,
+ int buf_size,
+ struct vrdma_ucontext *vctx,
+ __u64 *size,
+ __u64 *used_off,
+ __u32 *vq_size,
+ dma_addr_t *dma_addr)
+{
+ void *buf;
+ size_t total_size;
+ struct vrdma_user_mmap_entry *entry;
+ int rc;
+
+ /* Allocate aligned buffer for SQ/RQ payload area */
+ total_size = PAGE_ALIGN(buf_size);
+ buf = dma_alloc_coherent(vdev->vdev->dev.parent, total_size,
+ dma_addr, GFP_KERNEL);
+ if (!buf)
+ return NULL;
+
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry) {
+ dma_free_coherent(vdev->vdev->dev.parent, total_size,
+ buf, *dma_addr);
+ return NULL;
+ }
+
+ entry->mmap_type = VRDMA_MMAP_QP;
+ entry->vq = vq;
+ entry->user_buf = buf;
+ entry->ubuf_size = total_size; /* Already page-aligned */
+
+ /* Calculate offset from desc to used ring (for userspace polling) */
+ *used_off = virtqueue_get_used_addr(vq) - virtqueue_get_desc_addr(vq);
+
+ /* Align vring size to cache line boundary and round up to page size */
+ *vq_size = vring_size(virtqueue_get_vring_size(vq), SMP_CACHE_BYTES);
+ *vq_size = PAGE_ALIGN(*vq_size);
+ total_size += *vq_size;
+
+ /* Add extra page for fast doorbell if supported */
+ if (vdev->fast_doorbell)
+ total_size += PAGE_SIZE;
+
+ /* Insert into user mmap infrastructure */
+ rc = rdma_user_mmap_entry_insert(&vctx->ibucontext, &entry->rdma_entry,
+ total_size);
+ if (rc) {
+ dma_free_coherent(vdev->vdev->dev.parent, total_size,
+ buf, *dma_addr);
+ kfree(entry);
+ return NULL;
+ }
+
+ *size = total_size;
+ *entry_ = entry;
+ return buf;
+}
+
+/**
+ * vrdma_create_qp - Create a Virtio-RDMA Queue Pair (QP)
+ * @ibqp: Pointer to the IB QP structure (allocated by core)
+ * @attr: QP initialization attributes from userspace or kernel
+ * @udata: User data for mmap and doorbell mapping (NULL if kernel QP)
+ *
+ * This function creates a QP in the backend vRDMA device via a virtqueue
+ * command. It allocates resources including:
+ * - Send and Receive Queues (virtqueues)
+ * - Memory regions for WQE buffers (if user-space QP)
+ * - DMA-coherent rings and mmap entries
+ *
+ * On success, it returns 0 and fills @udata with offsets and sizes needed
+ * for userspace to mmap SQ/RQ rings and access CQ notification mechanisms.
+ *
+ * Context: Called in process context. May sleep.
+ * Return: 0 on success, negative errno on failure.
+ */
+static int vrdma_create_qp(struct ib_qp *ibqp,
+ struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct scatterlist in_sgs, out_sgs;
+ struct vrdma_dev *vdev = to_vdev(ibqp->device);
+ struct vrdma_cmd_create_qp *cmd;
+ struct vrdma_rsp_create_qp *rsp;
+ struct vrdma_qp *vqp = to_vqp(ibqp);
+ int rc, vqn;
+ int ret = 0;
+
+ /* SRQ is not supported yet */
+ if (attr->srq) {
+ dev_err(&vdev->vdev->dev, "SRQ is not supported\n");
+ return -EOPNOTSUPP;
+ }
+
+ /* Enforce QP count limit */
+ if (!atomic_add_unless(&vdev->num_qp, 1, vdev->ib_dev.attrs.max_qp)) {
+ dev_dbg(&vdev->vdev->dev, "exceeded max_qp (%u)\n",
+ vdev->ib_dev.attrs.max_qp);
+ return -ENOMEM;
+ }
+
+ /* Validate QP attributes before sending to device */
+ if (vrdma_qp_check_init(vdev, attr)) {
+ dev_dbg(&vdev->vdev->dev, "invalid QP init attributes\n");
+ return -EINVAL;
+ }
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd) {
+ ret = -ENOMEM;
+ goto err_alloc_cmd;
+ }
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ ret = -ENOMEM;
+ goto err_alloc_rsp;
+ }
+
+ /* Prepare command for device */
+ cmd->pdn = to_vpd(ibqp->pd)->pd_handle;
+ cmd->qp_type = attr->qp_type;
+ cmd->sq_sig_type = attr->sq_sig_type;
+ cmd->max_send_wr = attr->cap.max_send_wr;
+ cmd->max_send_sge = attr->cap.max_send_sge;
+ cmd->send_cqn = to_vcq(attr->send_cq)->cq_handle;
+ cmd->max_recv_wr = attr->cap.max_recv_wr;
+ cmd->max_recv_sge = attr->cap.max_recv_sge;
+ cmd->recv_cqn = to_vcq(attr->recv_cq)->cq_handle;
+ cmd->max_inline_data = attr->cap.max_inline_data;
+
+ sg_init_one(&in_sgs, cmd, sizeof(*cmd));
+ sg_init_one(&out_sgs, rsp, sizeof(*rsp));
+
+ /* Execute CREATE_QP verb over control virtqueue */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_CREATE_QP,
+ &in_sgs, &out_sgs);
+ if (rc) {
+ dev_err(&vdev->vdev->dev, "CREATE_QP cmd failed: %d\n", rc);
+ ret = -EIO;
+ goto err_exec_cmd;
+ }
+
+ /* Initialize software QP state */
+ vqp->type = udata ? VIRTIO_RDMA_TYPE_USER : VIRTIO_RDMA_TYPE_KERNEL;
+ vqp->port = attr->port_num;
+ vqp->qp_handle = rsp->qpn;
+ ibqp->qp_num = rsp->qpn;
+
+ vqn = rsp->qpn;
+ vqp->sq = &vdev->qp_vqs[vqn * 2];
+ vqp->rq = &vdev->qp_vqs[vqn * 2 + 1];
+
+ /* If this is a user-space QP, set up mmap-able memory regions */
+ if (udata) {
+ struct vrdma_create_qp_uresp uresp = {};
+ struct vrdma_ucontext *uctx = rdma_udata_to_drv_context(
+ udata, struct vrdma_ucontext, ibucontext);
+ uint32_t per_wqe_size;
+
+ /* Allocate SQ buffer area */
+ per_wqe_size = sizeof(struct vrdma_cmd_post_send) +
+ sizeof(struct vrdma_sge) * attr->cap.max_send_sge;
+ vqp->usq_buf_size = PAGE_ALIGN(per_wqe_size * attr->cap.max_send_wr);
+
+ vqp->usq_buf = vrdma_init_mmap_entry(vdev, vqp->sq->vq,
+ &vqp->sq_entry,
+ vqp->usq_buf_size,
+ uctx,
+ &uresp.sq_mmap_size,
+ &uresp.svq_used_idx_off,
+ &uresp.svq_ring_size,
+ &vqp->usq_dma_addr);
+ if (!vqp->usq_buf) {
+ dev_err(&vdev->vdev->dev, "failed to init SQ mmap entry\n");
+ ret = -ENOMEM;
+ goto err_mmap_sq;
+ }
+
+ /* Allocate RQ buffer area */
+ per_wqe_size = sizeof(struct vrdma_cmd_post_send) +
+ sizeof(struct vrdma_sge) * attr->cap.max_recv_sge;
+ vqp->urq_buf_size = PAGE_ALIGN(per_wqe_size * attr->cap.max_recv_wr);
+
+ vqp->urq_buf = vrdma_init_mmap_entry(vdev, vqp->rq->vq,
+ &vqp->rq_entry,
+ vqp->urq_buf_size,
+ uctx,
+ &uresp.rq_mmap_size,
+ &uresp.rvq_used_idx_off,
+ &uresp.rvq_ring_size,
+ &vqp->urq_dma_addr);
+ if (!vqp->urq_buf) {
+ dev_err(&vdev->vdev->dev, "failed to init RQ mmap entry\n");
+ ret = -ENOMEM;
+ goto err_mmap_rq;
+ }
+
+ /* Fill response for userspace */
+ uresp.sq_mmap_offset = rdma_user_mmap_get_offset(&vqp->sq_entry->rdma_entry);
+ uresp.sq_db_addr = vqp->usq_dma_addr;
+ uresp.num_sq_wqes = attr->cap.max_send_wr;
+ uresp.num_svqe = virtqueue_get_vring_size(vqp->sq->vq);
+ uresp.sq_head_idx = vqp->sq->vq->index;
+
+ uresp.rq_mmap_offset = rdma_user_mmap_get_offset(&vqp->rq_entry->rdma_entry);
+ uresp.rq_db_addr = vqp->urq_dma_addr;
+ uresp.num_rq_wqes = attr->cap.max_recv_wr;
+ uresp.num_rvqe = virtqueue_get_vring_size(vqp->rq->vq);
+ uresp.rq_head_idx = vqp->rq->vq->index;
+
+ uresp.notifier_size = vdev->fast_doorbell ? PAGE_SIZE : 0;
+ uresp.qp_handle = vqp->qp_handle;
+
+ if (udata->outlen < sizeof(uresp)) {
+ dev_dbg(&vdev->vdev->dev, "user outlen too small: %zu < %zu\n",
+ udata->outlen, sizeof(uresp));
+ ret = -EINVAL;
+ goto err_copy_udata;
+ }
+
+ rc = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ if (rc) {
+ dev_err(&vdev->vdev->dev, "failed to copy udata to userspace\n");
+ ret = rc;
+ goto err_copy_udata;
+ }
+ }
+
+ /* Cleanup and return success */
+ kfree(cmd);
+ kfree(rsp);
+ return 0;
+
+err_copy_udata:
+ dma_free_coherent(vdev->vdev->dev.parent, vqp->urq_buf_size,
+ vqp->urq_buf, vqp->urq_dma_addr);
+ rdma_user_mmap_entry_remove(&vqp->rq_entry->rdma_entry);
+
+err_mmap_rq:
+ dma_free_coherent(vdev->vdev->dev.parent, vqp->usq_buf_size,
+ vqp->usq_buf, vqp->usq_dma_addr);
+ rdma_user_mmap_entry_remove(&vqp->sq_entry->rdma_entry);
+
+err_mmap_sq:
+ kfree(vqp);
+
+err_exec_cmd:
+ kfree(rsp);
+
+err_alloc_rsp:
+ kfree(cmd);
+
+err_alloc_cmd:
+ atomic_dec(&vdev->num_qp);
+ return ret;
+}
+
+/**
+ * vrdma_destroy_qp - Destroy a Virtio-RDMA Queue Pair (QP)
+ * @ibqp: Pointer to the IB QP to destroy
+ * @udata: User data context (may be NULL for kernel QPs)
+ *
+ * This function destroys a QP both in the host driver and on the backend
+ * vRDMA device. It performs the following steps:
+ * 1. Sends a VIRTIO_RDMA_CMD_DESTROY_QP command to the device.
+ * 2. Frees DMA-coherent memory used for user-space WQE buffers.
+ * 3. Removes mmap entries for SQ/RQ rings.
+ * 4. Decrements global QP count.
+ *
+ * Context: Called in process context. May sleep.
+ * Return:
+ * * 0 on success
+ * * Negative errno on failure (e.g., communication error with device)
+ */
+static int vrdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
+{
+ struct vrdma_dev *vdev = to_vdev(ibqp->device);
+ struct vrdma_qp *vqp = to_vqp(ibqp);
+ struct vrdma_cmd_destroy_qp *cmd;
+ struct scatterlist in;
+ int rc;
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ cmd->qpn = vqp->qp_handle;
+ sg_init_one(&in, cmd, sizeof(*cmd));
+
+ /* Send DESTROY_QP command to the backend device */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_DESTROY_QP, &in, NULL);
+ if (rc) {
+ dev_err(&vdev->vdev->dev, "DESTROY_QP failed for qpn=%u: %d\n",
+ vqp->qp_handle, rc);
+ /*
+ * Even if the device command fails, we still proceed to free
+ * local resources because the QP is being destroyed from the
+ * software side regardless.
+ */
+ }
+
+ /* Clean up user-space mappings if this is a user QP */
+ if (udata) {
+ /* Free Send Queue buffer */
+ if (vqp->usq_buf) {
+ dma_free_coherent(vdev->vdev->dev.parent,
+ vqp->usq_buf_size,
+ vqp->usq_buf,
+ vqp->usq_dma_addr);
+ rdma_user_mmap_entry_remove(&vqp->sq_entry->rdma_entry);
+ }
+
+ /* Free Receive Queue buffer */
+ if (vqp->urq_buf) {
+ dma_free_coherent(vdev->vdev->dev.parent,
+ vqp->urq_buf_size,
+ vqp->urq_buf,
+ vqp->urq_dma_addr);
+ rdma_user_mmap_entry_remove(&vqp->rq_entry->rdma_entry);
+ }
+ }
+
+ /* Decrement global QP counter */
+ atomic_dec(&vdev->num_qp);
+
+ kfree(cmd);
+ return rc;
+}
+
+static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
.driver_id = RDMA_DRIVER_VIRTIO,
+
+ .get_port_immutable = vrdma_port_immutable,
+ .query_device = vrdma_query_device,
+ .query_port = vrdma_query_port,
+ .get_netdev = vrdma_get_netdev,
+ .create_cq = vrdma_create_cq,
+ .destroy_cq = vrdma_destroy_cq,
+ .alloc_pd = vrdma_alloc_pd,
+ .dealloc_pd = vrdma_dealloc_pd,
+ .create_qp = vrdma_create_qp,
+ .destroy_qp = vrdma_destroy_qp,
};
/**
@@ -195,41 +1131,32 @@ int vrdma_register_ib_device(struct vrdma_dev *vrdev)
ibdev->node_type = RDMA_NODE_IB_CA;
strncpy(ibdev->node_desc, "VirtIO RDMA", sizeof(ibdev->node_desc));
- ibdev->phys_port_cnt = 1; /* Assume single port */
- ibdev->num_comp_vectors = 1; /* One completion vector */
+ ibdev->phys_port_cnt = 1; /* Assume single port */
+ ibdev->num_comp_vectors = 1; /* One completion vector */
- /* Set GUID: Use MAC-like identifier derived from device info (example) */
- memcpy(&ibdev->node_guid, vrdev->vdev->id.device, 6);
- *(u64 *)&ibdev->node_guid |= 0x020000 << 24; /* Make locally administered */
-
- /* --- Step 2: Set user verbs command mask --- */
-
- ibdev->uverbs_cmd_mask =
- BIT_ULL(IB_USER_VERBS_CMD_GET_CONTEXT) |
- BIT_ULL(IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
- BIT_ULL(IB_USER_VERBS_CMD_QUERY_DEVICE) |
- BIT_ULL(IB_USER_VERBS_CMD_QUERY_PORT) |
- BIT_ULL(IB_USER_VERBS_CMD_ALLOC_PD) |
- BIT_ULL(IB_USER_VERBS_CMD_DEALLOC_PD) |
- BIT_ULL(IB_USER_VERBS_CMD_CREATE_QP) |
- BIT_ULL(IB_USER_VERBS_CMD_MODIFY_QP) |
- BIT_ULL(IB_USER_VERBS_CMD_QUERY_QP) |
- BIT_ULL(IB_USER_VERBS_CMD_DESTROY_QP) |
- BIT_ULL(IB_USER_VERBS_CMD_POST_SEND) |
- BIT_ULL(IB_USER_VERBS_CMD_POST_RECV) |
- BIT_ULL(IB_USER_VERBS_CMD_CREATE_CQ) |
- BIT_ULL(IB_USER_VERBS_CMD_DESTROY_CQ) |
- BIT_ULL(IB_USER_VERBS_CMD_POLL_CQ) |
- BIT_ULL(IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
- BIT_ULL(IB_USER_VERBS_CMD_REG_MR) |
- BIT_ULL(IB_USER_VERBS_CMD_DEREG_MR) |
- BIT_ULL(IB_USER_VERBS_CMD_CREATE_AH) |
- BIT_ULL(IB_USER_VERBS_CMD_MODIFY_AH) |
- BIT_ULL(IB_USER_VERBS_CMD_QUERY_AH) |
- BIT_ULL(IB_USER_VERBS_CMD_DESTROY_AH);
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_PORT) |
+ BIT_ULL(IB_USER_VERBS_CMD_ALLOC_PD) |
+ BIT_ULL(IB_USER_VERBS_CMD_DEALLOC_PD) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_MODIFY_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_DESTROY_QP) |
+ BIT_ULL(IB_USER_VERBS_CMD_POST_SEND) |
+ BIT_ULL(IB_USER_VERBS_CMD_POST_RECV) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_DESTROY_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_POLL_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+ BIT_ULL(IB_USER_VERBS_CMD_REG_MR) |
+ BIT_ULL(IB_USER_VERBS_CMD_DEREG_MR) |
+ BIT_ULL(IB_USER_VERBS_CMD_CREATE_AH) |
+ BIT_ULL(IB_USER_VERBS_CMD_MODIFY_AH) |
+ BIT_ULL(IB_USER_VERBS_CMD_QUERY_AH) |
+ BIT_ULL(IB_USER_VERBS_CMD_DESTROY_AH);
/* --- Step 3: Attach device operation vectors --- */
- ib_set_device_ops(ibdev, &virtio_rdma_dev_ops);
+ ib_set_device_ops(ibdev, &vrdma_dev_ops);
/* --- Step 4: Bind to netdev (optional, for RoCE) --- */
if (vrdev->netdev) {
@@ -244,6 +1171,9 @@ int vrdma_register_ib_device(struct vrdma_dev *vrdev)
return rc;
}
+ /* Set GUID: Use MAC-like identifier derived from device info (example) */
+ memcpy(&ibdev->node_guid, ibdev->name, 6);
+
pr_info("Successfully registered vRDMA device as '%s'\n", dev_name(&ibdev->dev));
return 0;
}
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
index bdba5a9de..ba88599c8 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
@@ -13,6 +13,12 @@
#define VRDMA_COMM_TIMEOUT 1000000
+enum vrdma_type {
+ VIRTIO_RDMA_TYPE_USER,
+ VIRTIO_RDMA_TYPE_KERNEL
+};
+
+
enum {
VIRTIO_RDMA_ATOMIC_NONE,
VIRTIO_RDMA_ATOMIC_HCA,
@@ -71,12 +77,110 @@ struct vrdma_cq {
struct vrdma_vq *vq; /* Virtqueue where CQEs arrive */
struct rdma_user_mmap_entry *entry; /* For mmap support in userspace */
spinlock_t lock;
- struct virtio_rdma_cqe *queue; /* CQE ring buffer */
+ struct vrdma_cqe *queue; /* CQE ring buffer */
size_t queue_size;
dma_addr_t dma_addr;
u32 num_cqe;
};
+struct vrdma_pd {
+ struct ib_pd ibpd;
+ u32 pd_handle;
+};
+
+/**
+ * struct vrdma_port_attr - Virtual RDMA port attributes
+ * @state: Port physical state (e.g., IB_PORT_DOWN, IB_PORT_ACTIVE).
+ * @max_mtu: Maximum MTU supported by the port.
+ * @active_mtu: Currently active MTU.
+ * @phys_mtu: Physical layer MTU (typically same as active_mtu in virtual devices).
+ * @gid_tbl_len: Size of the GID table (number of supported GIDs).
+ * @port_cap_flags: Port capabilities (e.g., IB_PORT_CM_SUP, IB_PORT_IP_BASED_GIDS).
+ * @max_msg_sz: Maximum message size supported.
+ * @bad_pkey_cntr: P_Key violation counter (optional in virtual devices).
+ * @qkey_viol_cntr: QKey violation counter.
+ * @pkey_tbl_len: Number of entries in the P_Key table.
+ * @active_width: Current active width (e.g., IB_WIDTH_4X, IB_WIDTH_1X).
+ * @active_speed: Current active speed (e.g., IB_SPEED_10_GBPS, IB_SPEED_25_GBPS).
+ * @phys_state: Physical port state (vendor-specific, optional extension).
+ * @reserved: Reserved for future use or alignment padding; must be zeroed.
+ *
+ * This structure mirrors `struct ib_port_attr` from <rdma/ib_verbs.h> and is used
+ * to query port properties via `vrdma_query_port()` operation.
+ */
+struct vrdma_port_attr {
+ enum ib_port_state state;
+ enum ib_mtu max_mtu;
+ enum ib_mtu active_mtu;
+ u32 phys_mtu;
+ int gid_tbl_len;
+ u32 port_cap_flags;
+ u32 max_msg_sz;
+ u32 bad_pkey_cntr;
+ u32 qkey_viol_cntr;
+ u16 pkey_tbl_len;
+ u16 active_speed;
+ u8 active_width;
+ u8 phys_state;
+ u32 reserved[32]; /* For future extensions */
+} __packed;
+
+struct vrdma_ucontext {
+ struct ib_ucontext ibucontext;
+ struct vrdma_dev *dev;
+};
+
+/**
+ * struct vrdma_qp - Virtual RDMA Queue Pair (QP) private data
+ *
+ * This structure holds all driver-private state for a QP in the virtio-rdma driver.
+ * It is allocated during ib_create_qp() and freed on ib_destroy_qp().
+ */
+struct vrdma_qp {
+ /* Public IB layer object must be first */
+ struct ib_qp ibqp;
+
+ /* QP type (IB_QPT_RC, IB_QPT_UD, etc.) */
+ u8 type;
+
+ /* Port number this QP is bound to (usually 1 for single-port devices) */
+ u8 port;
+
+ /* Handle used by backend to identify this QP */
+ u32 qp_handle;
+
+ /* Send Queue (SQ) resources */
+ struct vrdma_vq *sq; /* Virtqueue for SQ ops */
+ void *usq_buf; /* Kernel-mapped send queue ring */
+ size_t usq_buf_size; /* Size of SQ ring buffer */
+ dma_addr_t usq_dma_addr; /* DMA address for coherent mapping */
+
+ /* Receive Queue (RQ) resources */
+ struct vrdma_vq *rq; /* Virtqueue for RQ ops */
+ void *urq_buf; /* Kernel-mapped receive queue ring */
+ size_t urq_buf_size; /* Size of RQ ring buffer */
+ dma_addr_t urq_dma_addr; /* DMA address for coherent mapping */
+
+ /* User-space mmap entries for userspace QP access */
+ struct vrdma_user_mmap_entry *sq_entry; /* Mmap entry for SQ buffer */
+ struct vrdma_user_mmap_entry *rq_entry; /* Mmap entry for RQ buffer */
+};
+
+static inline struct vrdma_cq *to_vcq(struct ib_cq *ibcq)
+{
+ return container_of(ibcq, struct vrdma_cq, ibcq);
+}
+
+static inline struct vrdma_pd *to_vpd(struct ib_pd *ibpd)
+{
+ return container_of(ibpd, struct vrdma_pd, ibpd);
+}
+
+static inline struct vrdma_qp *to_vqp(struct ib_qp *ibqp)
+{
+ return container_of(ibqp, struct vrdma_qp, ibqp);
+}
+
int vrdma_register_ib_device(struct vrdma_dev *vrdev);
void vrdma_unregister_ib_device(struct vrdma_dev *vrdev);
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c
index ea2f15491..9113fa3a3 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_main.c
@@ -29,39 +29,39 @@
*/
static int vrdma_probe(struct virtio_device *vdev)
{
- struct vrdma_dev *vrdev;
- int rc;
-
- /* Step 1: Allocate IB device structure using ib_core's allocator */
- vrdev = ib_alloc_device(vrdma_dev, ib_dev);
- if (!vrdev) {
- pr_err("Failed to allocate vRDMA device\n");
- return -ENOMEM;
- }
-
- /* Initialize basic fields */
- vrdev->vdev = vdev;
- vdev->priv = vrdev;
-
- spin_lock_init(&vrdev->ctrl_lock);
- spin_lock_init(&vrdev->pending_mmaps_lock);
- INIT_LIST_HEAD(&vrdev->pending_mmaps);
-
- /* Step 2: Check doorbell mechanism support */
- if (to_vp_device(vdev)->mdev.notify_offset_multiplier != PAGE_SIZE) {
- pr_warn("notify_offset_multiplier=%u != PAGE_SIZE, disabling fast doorbell\n",
- to_vp_device(vdev)->mdev.notify_offset_multiplier);
- vrdev->fast_doorbell = false;
- } else {
- vrdev->fast_doorbell = true;
- }
-
- /* Step 3: Initialize hardware interface (virtqueues) */
- rc = vrdma_init_device(vrdev);
- if (rc) {
- pr_err("Failed to initialize vRDMA device queues\n");
- goto err_dealloc_device;
- }
+ struct vrdma_dev *vrdev;
+ int rc;
+
+ /* Step 1: Allocate IB device structure using ib_core's allocator */
+ vrdev = ib_alloc_device(vrdma_dev, ib_dev);
+ if (!vrdev) {
+ pr_err("Failed to allocate vRDMA device\n");
+ return -ENOMEM;
+ }
+
+ /* Initialize basic fields */
+ vrdev->vdev = vdev;
+ vdev->priv = vrdev;
+
+ spin_lock_init(&vrdev->ctrl_lock);
+ spin_lock_init(&vrdev->pending_mmaps_lock);
+ INIT_LIST_HEAD(&vrdev->pending_mmaps);
+
+ /* Step 2: Check doorbell mechanism support */
+ if (to_vp_device(vdev)->mdev.notify_offset_multiplier != PAGE_SIZE) {
+ pr_warn("notify_offset_multiplier=%u != PAGE_SIZE, disabling fast doorbell\n",
+ to_vp_device(vdev)->mdev.notify_offset_multiplier);
+ vrdev->fast_doorbell = false;
+ } else {
+ vrdev->fast_doorbell = true;
+ }
+
+ /* Step 3: Initialize hardware interface (virtqueues) */
+ rc = vrdma_init_device(vrdev);
+ if (rc) {
+ pr_err("Failed to initialize vRDMA device queues\n");
+ goto err_dealloc_device;
+ }
rc = vrdma_init_netdev(vrdev);
if (rc) {
@@ -69,26 +69,26 @@ static int vrdma_probe(struct virtio_device *vdev)
goto err_cleanup_device;
}
- /* Step 4: Register with InfiniBand core layer */
- rc = vrdma_register_ib_device(vrdev);
- if (rc) {
- pr_err("Failed to register with IB subsystem\n");
- goto err_cleanup_netdev;
- }
+ /* Step 4: Register with InfiniBand core layer */
+ rc = vrdma_register_ib_device(vrdev);
+ if (rc) {
+ pr_err("Failed to register with IB subsystem\n");
+ goto err_cleanup_netdev;
+ }
- return 0;
+ return 0;
err_cleanup_netdev:
vrdma_finish_netdev(vrdev);
err_cleanup_device:
- vrdma_finish_device(vrdev); /* Safe cleanup of queues and reset */
+ vrdma_finish_device(vrdev); /* Safe cleanup of queues and reset */
err_dealloc_device:
- ib_dealloc_device(&vrdev->ib_dev); /* Frees vrdev itself */
+ ib_dealloc_device(&vrdev->ib_dev); /* Frees vrdev itself */
vdev->priv = NULL;
- return rc;
+ return rc;
}
static void vrdma_remove(struct virtio_device *vdev)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_mmap.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_mmap.h
new file mode 100644
index 000000000..acad4626c
--- /dev/null
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_mmap.h
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause
+
+/* Authors: Xiong Weimin <xiongweimin@kylinos.cn> */
+/* Copyright 2020.kylinos.cn.All Rights Reserved.*/
+#ifndef __VRDMA_MMAP_H__
+#define __VRDMA_MMAP_H__
+
+#include <linux/types.h>
+#include <linux/kref.h>
+
+/* Mmap type definitions for virtio-rdma */
+#define VRDMA_MMAP_CQ 1 /* Mapping for Completion Queue (CQ) */
+#define VRDMA_MMAP_QP 2 /* Mapping for Queue Pair (QP) */
+
+/**
+ * struct vrdma_user_mmap_entry - Private extension of RDMA user mmap entry
+ *
+ * This structure extends the generic 'struct rdma_user_mmap_entry' to carry
+ * driver-specific data for virtio-rdma. It is used when registering mmap
+ * regions that allow userspace to directly access hardware rings or buffers
+ * via memory mapping (e.g., CQ or QP context rings).
+ *
+ * @rdma_entry: The base RDMA core mmap entry; must be the first member
+ * to ensure proper container_of() resolution and compatibility
+ * with RDMA subsystem APIs.
+ * @type: Specifies the type of mapped resource (VRDMA_MMAP_CQ or
+ * VIRTIO_RDMA_MMAP_QP). Used in fault handling to determine behavior.
+ * @vq: Pointer to the associated virtqueue. This allows the driver to link
+ * the mmap region with a specific virtual queue for event processing
+ * or doorbell handling.
+ * @ubuf: Virtual address of the user-space buffer (optional). Can be used
+ * to map kernel-managed ring buffers into user space, allowing direct
+ * access without system calls.
+ * @ubuf_size: Size of the user buffer in bytes. Used for bounds checking
+ * during mapping and fault operations.
+ */
+struct vrdma_user_mmap_entry {
+ struct rdma_user_mmap_entry rdma_entry;
+ u8 mmap_type; /* Type of mmap region (CQ/QP) */
+ struct virtqueue *vq; /* Associated virtqueue pointer */
+ void *user_buf; /* User buffer virtual address */
+ u64 ubuf_size; /* Size of the mapped user buffer */
+};
+
+static inline struct vrdma_user_mmap_entry *
+to_ventry(struct rdma_user_mmap_entry *rdma_entry)
+{
+ return container_of(rdma_entry, struct vrdma_user_mmap_entry,
+ rdma_entry);
+}
+
+/**
+ * virtio_rdma_mmap - Handle userspace mmap request for virtio-rdma resources
+ * @context: User context from IB layer
+ * @vma: Virtual memory area to be mapped
+ *
+ * This callback is invoked when userspace calls mmap() on a special offset
+ * returned by an ioctl (e.g., during CQ or QP creation). It maps device-specific
+ * memory regions (like completion queues or queue pair rings) into user space
+ * for zero-copy access.
+ *
+ * The VMA's pgoff field contains the mmap offset registered via
+ * rdma_user_mmap_entry_insert(). This function looks up the corresponding
+ * mmap entry and sets up the appropriate vm_ops for page fault handling.
+ *
+ * Return:
+ * - 0 on success
+ * - Negative error code (e.g., -EINVAL, -EAGAIN) on failure
+ */
+int vrdma_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
+
+/**
+ * virtio_rdma_mmap_free - Free mmap entry when userspace unmaps or closes
+ * @rdma_entry: The mmap entry being released
+ *
+ * This callback is registered with the RDMA core to free private mmap entries
+ * when the user process unmaps the region or exits. It is responsible for
+ * releasing any resources associated with the mapping (e.g., freeing metadata).
+ *
+ * The function should use 'to_ventry()' to retrieve the private structure,
+ * then kfree() it. Note: The actual mapped memory (e.g., ring buffer) may be
+ * freed separately depending on lifecycle management.
+ *
+ * @rdma_entry: Pointer to the base RDMA mmap entry (container within our struct)
+ */
+void vrdma_mmap_free(struct rdma_user_mmap_entry *rdma_entry);
+
+#endif
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c
index e83902e6d..19dd9af18 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_netdev.c
@@ -16,67 +16,67 @@
* @vrdev: The vRDMA device
*
* WARNING: This is a non-standard hack for development/emulation environments.
- * Do not use in production or upstream drivers.
+ * Do not use in production or upstream drivers.
*
* Returns 0 on success, or negative errno.
*/
int vrdma_init_netdev(struct vrdma_dev *vrdev)
{
- struct pci_dev *pdev_net;
- struct virtio_pci_device *vp_dev;
- struct virtio_pci_device *vnet_pdev;
- void *priv;
- struct net_device *netdev;
-
- if (!vrdev || !vrdev->vdev) {
- pr_err("%s: invalid vrdev or vdev\n", __func__);
- return -EINVAL;
- }
-
- vp_dev = to_vp_device(vrdev->vdev);
-
- /* Find the PCI device at function 0 of the same slot */
- pdev_net = pci_get_slot(vp_dev->pci_dev->bus,
- PCI_DEVFN(PCI_SLOT(vp_dev->pci_dev->devfn), 0));
- if (!pdev_net) {
- pr_err("Failed to find PCI device at fn=0 of slot %x\n",
- PCI_SLOT(vp_dev->pci_dev->devfn));
- return -ENODEV;
- }
-
- /* Optional: Validate it's a known virtio-net device */
- if (pdev_net->vendor != PCI_VENDOR_ID_REDHAT_QUMRANET ||
- pdev_net->device != 0x1041) {
- pr_warn("PCI device %04x:%04x is not expected virtio-net (1041) device\n",
- pdev_net->vendor, pdev_net->device);
- pci_dev_put(pdev_net);
- return -ENODEV;
- }
-
- /* Get the virtio_pci_device from drvdata */
- vnet_pdev = pci_get_drvdata(pdev_net);
- if (!vnet_pdev || !vnet_pdev->vdev.priv) {
- pr_err("No driver data or priv for virtio-net device\n");
- pci_dev_put(pdev_net);
- return -ENODEV;
- }
-
- priv = vnet_pdev->vdev.priv;
+ struct pci_dev *pdev_net;
+ struct virtio_pci_device *vp_dev;
+ struct virtio_pci_device *vnet_pdev;
+ void *priv;
+ struct net_device *netdev;
+
+ if (!vrdev || !vrdev->vdev) {
+ pr_err("%s: invalid vrdev or vdev\n", __func__);
+ return -EINVAL;
+ }
+
+ vp_dev = to_vp_device(vrdev->vdev);
+
+ /* Find the PCI device at function 0 of the same slot */
+ pdev_net = pci_get_slot(vp_dev->pci_dev->bus,
+ PCI_DEVFN(PCI_SLOT(vp_dev->pci_dev->devfn), 0));
+ if (!pdev_net) {
+ pr_err("Failed to find PCI device at fn=0 of slot %x\n",
+ PCI_SLOT(vp_dev->pci_dev->devfn));
+ return -ENODEV;
+ }
+
+ /* Optional: Validate it's a known virtio-net device */
+ if (pdev_net->vendor != PCI_VENDOR_ID_REDHAT_QUMRANET ||
+ pdev_net->device != 0x1041) {
+ pr_warn("PCI device %04x:%04x is not expected virtio-net (1041) device\n",
+ pdev_net->vendor, pdev_net->device);
+ pci_dev_put(pdev_net);
+ return -ENODEV;
+ }
+
+ /* Get the virtio_pci_device from drvdata */
+ vnet_pdev = pci_get_drvdata(pdev_net);
+ if (!vnet_pdev || !vnet_pdev->vdev.priv) {
+ pr_err("No driver data or priv for virtio-net device\n");
+ pci_dev_put(pdev_net);
+ return -ENODEV;
+ }
+
+ priv = vnet_pdev->vdev.priv;
vrdev->netdev = priv - ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
- netdev = vrdev->netdev;
+ netdev = vrdev->netdev;
- if (!netdev || !netdev->netdev_ops) {
- pr_err("Invalid net_device retrieved from virtio-net\n");
- pci_dev_put(pdev_net);
- return -ENODEV;
- }
+ if (!netdev || !netdev->netdev_ops) {
+ pr_err("Invalid net_device retrieved from virtio-net\n");
+ pci_dev_put(pdev_net);
+ return -ENODEV;
+ }
- /* Hold reference so netdev won't disappear */
- dev_hold(netdev);
+ /* Hold reference so netdev won't disappear */
+ dev_hold(netdev);
- pci_dev_put(pdev_net); /* Release reference from pci_get_slot */
+ pci_dev_put(pdev_net); /* Release reference from pci_get_slot */
- return 0;
+ return 0;
}
/**
@@ -88,18 +88,18 @@ int vrdma_init_netdev(struct vrdma_dev *vrdev)
*/
void vrdma_finish_netdev(struct vrdma_dev *vrdev)
{
- if (!vrdev) {
- pr_err("%s: invalid vrdev pointer\n", __func__);
- return;
- }
-
- if (vrdev->netdev) {
- pr_info("[%s]: Releasing reference to net_device '%s'\n",
- __func__, vrdev->netdev->name);
-
- dev_put(vrdev->netdev);
- vrdev->netdev = NULL;
- } else {
- pr_debug("%s: no netdev to release\n", __func__);
- }
+ if (!vrdev) {
+ pr_err("%s: invalid vrdev pointer\n", __func__);
+ return;
+ }
+
+ if (vrdev->netdev) {
+ pr_info("[%s]: Releasing reference to net_device '%s'\n",
+ __func__, vrdev->netdev->name);
+
+ dev_put(vrdev->netdev);
+ vrdev->netdev = NULL;
+ } else {
+ pr_debug("%s: no netdev to release\n", __func__);
+ }
}
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c
index 78779c243..57c635aaf 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.c
@@ -18,4 +18,114 @@ void vrdma_cq_ack(struct virtqueue *vq)
if (vcq && vcq->ibcq.comp_handler)
vcq->ibcq.comp_handler(&vcq->ibcq, vcq->ibcq.cq_context);
+}
+
+/**
+ * vrdma_qp_check_cap - Validate QP capacity limits against device attributes
+ * @vdev: Pointer to the virtio-rdma device
+ * @cap: User-requested QP capabilities
+ * @has_srq: Whether this QP is associated with a Shared Receive Queue (SRQ)
+ *
+ * Checks that the requested QP capacities (WQEs, SGEs) do not exceed
+ * device limits. Inline data limit is currently ignored.
+ *
+ * Return:
+ * 0 if all constraints are satisfied, -EINVAL otherwise.
+ */
+static int vrdma_qp_check_cap(struct vrdma_dev *vdev,
+ const struct ib_qp_cap *cap, bool has_srq)
+{
+ if (cap->max_send_wr > vdev->attr.max_qp_wr) {
+ dev_warn(&vdev->vdev->dev,
+ "invalid max_send_wr = %u > %u\n",
+ cap->max_send_wr, vdev->attr.max_qp_wr);
+ return -EINVAL;
+ }
+
+ if (cap->max_send_sge > vdev->attr.max_send_sge) {
+ dev_warn(&vdev->vdev->dev,
+ "invalid max_send_sge = %u > %u\n",
+ cap->max_send_sge, vdev->attr.max_send_sge);
+ return -EINVAL;
+ }
+
+ /* Only check receive queue parameters if no SRQ is used */
+ if (!has_srq) {
+ if (cap->max_recv_wr > vdev->attr.max_qp_wr) {
+ dev_warn(&vdev->vdev->dev,
+ "invalid max_recv_wr = %u > %u\n",
+ cap->max_recv_wr, vdev->attr.max_qp_wr);
+ return -EINVAL;
+ }
+
+ if (cap->max_recv_sge > vdev->attr.max_recv_sge) {
+ dev_warn(&vdev->vdev->dev,
+ "invalid max_recv_sge = %u > %u\n",
+ cap->max_recv_sge, vdev->attr.max_recv_sge);
+ return -EINVAL;
+ }
+ }
+
+ /* TODO: Add check for inline data: cap->max_inline_data <= dev->attr.max_inline_data */
+
+ return 0;
+}
+
+/**
+ * vrdma_qp_check_init_attr - Validate QP initialization attributes
+ * @vdev: The virtual RDMA device
+ * @init: QP initialization attributes from user/kernel space
+ *
+ * Performs semantic validation of QP creation parameters including:
+ * - Supported QP types
+ * - Valid CQ bindings
+ * - Port number validity for special QP types
+ * - Capacity limits via vrdma_qp_check_cap()
+ *
+ * Return:
+ * 0 on success, negative errno code on failure.
+ */
+int vrdma_qp_check_init(struct vrdma_dev *vdev,
+ const struct ib_qp_init_attr *init)
+{
+ const struct ib_qp_cap *cap = &init->cap;
+ u8 port_num = init->port_num;
+ int ret;
+
+ /* Check supported QP types */
+ switch (init->qp_type) {
+ case IB_QPT_SMI:
+ case IB_QPT_GSI:
+ case IB_QPT_RC:
+ case IB_QPT_UC:
+ case IB_QPT_UD:
+ break; /* Supported */
+ default:
+ dev_dbg(&vdev->vdev->dev,
+ "QP type %d not supported\n", init->qp_type);
+ return -EOPNOTSUPP;
+ }
+
+ /* Send and receive CQs are required unless using SRQ-only recv path */
+ if (!init->send_cq || !init->recv_cq) {
+ dev_warn(&vdev->vdev->dev,
+ "missing send or recv completion queue\n");
+ return -EINVAL;
+ }
+
+ /* Validate QP capacity limits */
+ ret = vrdma_qp_check_cap(vdev, cap, !!init->srq);
+ if (ret)
+ return ret;
+
+ /* For SMI/GSI QPs, ensure port number is valid */
+ if (init->qp_type == IB_QPT_SMI || init->qp_type == IB_QPT_GSI) {
+ if (!rdma_is_port_valid(&vdev->ib_dev, port_num)) {
+ dev_warn(&vdev->vdev->dev,
+ "invalid port number %u\n", port_num);
+ return -EINVAL;
+ }
+ }
+
+ return 0;
}
\ No newline at end of file
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h
index 64b896208..a40c3762f 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_queue.h
@@ -10,5 +10,6 @@
#include "vrdma_dev_api.h"
void vrdma_cq_ack(struct virtqueue *vq);
-
+int vrdma_qp_check_init(struct vrdma_dev *vdev,
+ const struct ib_qp_init_attr *init);
#endif
\ No newline at end of file
diff --git a/linux-6.16.8/include/rdma/ib_verbs.h b/linux-6.16.8/include/rdma/ib_verbs.h
index 6353da1c0..a129f1cc9 100644
--- a/linux-6.16.8/include/rdma/ib_verbs.h
+++ b/linux-6.16.8/include/rdma/ib_verbs.h
@@ -659,6 +659,7 @@ void rdma_free_hw_stats_struct(struct rdma_hw_stats *stats);
#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x00800000
#define RDMA_CORE_CAP_PROT_RAW_PACKET 0x01000000
#define RDMA_CORE_CAP_PROT_USNIC 0x02000000
+#define RDMA_CORE_CAP_PROT_VIRTIO 0x04000000
#define RDMA_CORE_PORT_IB_GRH_REQUIRED (RDMA_CORE_CAP_IB_GRH_REQUIRED \
| RDMA_CORE_CAP_PROT_ROCE \
@@ -690,6 +691,14 @@ void rdma_free_hw_stats_struct(struct rdma_hw_stats *stats);
#define RDMA_CORE_PORT_USNIC (RDMA_CORE_CAP_PROT_USNIC)
+/* in most time, RDMA_CORE_PORT_VIRTIO is same as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP */
+#define RDMA_CORE_PORT_VIRTIO \
+ (RDMA_CORE_CAP_PROT_VIRTIO \
+ | RDMA_CORE_CAP_IB_MAD \
+ | RDMA_CORE_CAP_IB_CM \
+ | RDMA_CORE_CAP_AF_IB \
+ | RDMA_CORE_CAP_ETH_AH)
+
struct ib_port_attr {
u64 subnet_prefix;
enum ib_port_state state;
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 04/10] drivers/infiniband/hw/virtio: Implement MR, GID, ucontext and AH resource management verbs
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (2 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 03/10] drivers/infiniband/hw/virtio: Implement core device and key resource management Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 05/10] drivers/infiniband/hw/virtio: Implement memory mapping and MR scatter-gather support Xiong Weimin
` (7 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit adds foundational resource management capabilities to the
vhost-user RDMA driver, enabling full RDMA operations:
1. Memory Region (MR) Management:
- DMA MR registration via GET_DMA_MR
- Two-level page table for large scatter-gather lists
- CREATE_MR/DEREG_MR backend command flow
- Atomic command execution with virtqueue
2. Global Identifier (GID) Management:
- ADD_GID/DEL_GID backend commands
- RoCE v1/v2 GID type support
- Port-based GID table operations
3. User Context (ucontext) Support:
- Allocation and deallocation hooks
- Device association for future PD/CQ/MR management
4. Address Handle (AH) Management:
- RoCE-specific AH creation/validation
- Unicast GRH enforcement
- Device-wide AH limit tracking
Key technical features:
- MRs support both DMA-direct and user-backed registrations
- Page-table optimized for large scatter-lists
- GID operations integrate with RDMA core notifications
- AHs store full address vectors for packet construction
- Resource limits enforced via atomic counters
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../infiniband/hw/virtio/vrdma_dev_api.h | 40 ++
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 600 ++++++++++++++++++
.../drivers/infiniband/hw/virtio/vrdma_ib.h | 80 +++
3 files changed, 720 insertions(+)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index d1db1bea4..da99f1f32 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -160,6 +160,46 @@ struct vrdma_cmd_destroy_qp {
__u32 qpn;
};
+struct vrdma_cmd_get_dma_mr {
+ __u32 pdn;
+ __u32 access_flags;
+};
+
+struct vrdma_rsp_get_dma_mr {
+ __u32 mrn;
+ __u32 lkey;
+ __u32 rkey;
+};
+
+struct vrdma_cmd_create_mr {
+ __u32 pdn;
+ __u32 access_flags;
+
+ __u32 max_num_sg;
+};
+
+struct vrdma_rsp_create_mr {
+ __u32 mrn;
+ __u32 lkey;
+ __u32 rkey;
+};
+
+struct vrdma_cmd_dereg_mr {
+ __u32 mrn;
+};
+
+struct vrdma_cmd_add_gid {
+ __u8 gid[16];
+ __u32 gid_type;
+ __u16 index;
+ __u32 port_num;
+};
+
+struct vrdma_cmd_del_gid {
+ __u16 index;
+ __u32 port_num;
+};
+
#define VRDMA_CTRL_OK 0
#define VRDMA_CTRL_ERR 1
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index f1f53314f..b4c16ddbb 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -1086,6 +1086,597 @@ static int vrdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
return rc;
}
+/**
+ * vrdma_get_dma_mr - Get a DMA memory region (uncached, direct-access MR)
+ * @pd: Protection Domain to associate this MR with
+ * @flags: Access permissions (IB_ACCESS_LOCAL_WRITE, IB_ACCESS_REMOTE_READ, etc.)
+ *
+ * This function creates a special type of Memory Region (MR) that refers to
+ * physically contiguous or scatter-gather DMA-capable memory, typically used
+ * for zero-copy or kernel-space registrations without user buffer backing.
+ *
+ * It issues the VIRTIO_RDMA_CMD_GET_DMA_MR command to the backend device,
+ * which returns:
+ * - An MR handle (mrn)
+ * - Local Key (lkey)
+ * - Remote Key (rkey)
+ *
+ * Unlike regular MRs created via ib_reg_mr(), this MR does not back any
+ * user-space virtual memory (i.e., no ib_umem). It is typically used for
+ * device-specific buffers, scratch memory, or control structures.
+ *
+ * Context: Called in process context. May sleep.
+ * Return:
+ * * &mr->ibmr on success
+ * * ERR_PTR(-ENOMEM) if memory allocation fails
+ * * ERR_PTR(-EIO) if device communication fails
+ */
+static struct ib_mr *vrdma_get_dma_mr(struct ib_pd *pd, int flags)
+{
+ struct vrdma_dev *vdev = to_vdev(pd->device);
+ struct vrdma_mr *mr;
+ struct vrdma_cmd_get_dma_mr *cmd;
+ struct vrdma_rsp_get_dma_mr *rsp;
+ struct scatterlist in, out;
+ int rc;
+
+ /* Allocate software MR structure */
+ mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd) {
+ rc = -ENOMEM;
+ goto err_free_mr;
+ }
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ rc = -ENOMEM;
+ goto err_free_cmd;
+ }
+
+ /* Prepare command parameters */
+ cmd->pdn = to_vpd(pd)->pd_handle;
+ cmd->access_flags = flags;
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Send GET_DMA_MR command to device */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_GET_DMA_MR, &in, &out);
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "GET_DMA_MR command failed: %d\n", rc);
+ goto err_free_rsp;
+ }
+
+ /* Initialize MR fields from response */
+ mr->mr_handle = rsp->mrn;
+ mr->ibmr.lkey = rsp->lkey;
+ mr->ibmr.rkey = rsp->rkey;
+ mr->ibmr.pd = pd;
+ mr->ibmr.device = pd->device;
+ mr->ibmr.type = IB_MR_TYPE_MEM_REG; /* Standard memory registration */
+
+ /* No backing user memory */
+ mr->umem = NULL;
+ mr->iova = 0;
+ mr->size = 0;
+ mr->pages = NULL;
+ mr->pages_k = NULL;
+ mr->dma_pages = 0;
+ mr->npages = 0;
+ mr->max_pages = 0;
+
+ /* Cleanup command/response buffers */
+ kfree(cmd);
+ kfree(rsp);
+
+ return &mr->ibmr;
+
+err_free_rsp:
+ kfree(rsp);
+
+err_free_cmd:
+ kfree(cmd);
+
+err_free_mr:
+ kfree(mr);
+ return ERR_PTR(rc);
+}
+
+/**
+ * vrdma_init_page_tbl - Initialize a two-level page table for MR management
+ * @dev: vRDMA device pointer
+ * @npages: Maximum number of data pages this table can map
+ * @pages_dma: Output: L1 table with entries pointing to DMA addresses of L2 tables
+ * @dma_pages_p: Output: DMA address of the L1 table itself
+ *
+ * This function sets up a two-level page table structure used in Memory Region (MR)
+ * registration to support scatter-gather I/O. The layout is:
+ *
+ * L1 (Level 1): Single page, DMA-coherent, holds pointers to L2 tables.
+ * Will be passed to hardware via WQE or command.
+ *
+ * L2 (Level 2): Array of pages, each holding up to 512 x 8-byte DMA addresses
+ * (for 4KB page size). Each L2 table maps part of the S/G list.
+ *
+ * Example:
+ * npages = 1024 => needs 1024 / 512 = 2 L2 tables
+ *
+ * Return:
+ * Pointer to kernel virtual address of L1 table (pages_k), which stores
+ * virtual addresses of L2 tables for cleanup.
+ * On failure, returns NULL and cleans up all allocated memory.
+ */
+static uint64_t **vrdma_init_page_tbl(struct vrdma_dev *dev,
+ unsigned int npages,
+ uint64_t ***pages_dma,
+ dma_addr_t *dma_pages_p)
+{
+ unsigned int nl2 = (npages == 0) ? 0 : (npages + 511) / 512; /* ceil(npages / 512) */
+ uint64_t **l1_table; /* L1: stores DMA addrs of L2s (device-readable) */
+ uint64_t **l1_table_k; /* L1: stores kernel vaddrs of L2s (for free) */
+ dma_addr_t l1_dma_addr;
+ dma_addr_t l2_dma_addr;
+ int i;
+
+ /* Allocate L1 table: must be DMA-coherent because device reads it */
+ l1_table = dma_alloc_coherent(dev->vdev->dev.parent, PAGE_SIZE, &l1_dma_addr, GFP_KERNEL);
+ if (!l1_table)
+ return NULL;
+
+ /* Allocate kernel-space array to track L2 virtual addresses */
+ l1_table_k = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!l1_table_k)
+ goto err_free_l1_table;
+
+ /* Allocate each L2 table (DMA-coherent, one per 512 entries) */
+ for (i = 0; i < nl2; i++) {
+ l1_table_k[i] = dma_alloc_coherent(dev->vdev->dev.parent, PAGE_SIZE, &l2_dma_addr, GFP_KERNEL);
+ if (!l1_table_k[i])
+ goto err_free_l2_tables;
+
+ l1_table[i] = (uint64_t *)l2_dma_addr; /* Device sees DMA address */
+ }
+
+ /* Output parameters */
+ *pages_dma = l1_table; /* Device-visible L1 (with DMA pointers) */
+ *dma_pages_p = l1_dma_addr; /* DMA address of L1 table */
+
+ return l1_table_k; /* Return kernel view for later cleanup */
+
+err_free_l2_tables:
+ /* Roll back any successfully allocated L2 tables */
+ while (--i >= 0) {
+ dma_free_coherent(dev->vdev->dev.parent, PAGE_SIZE, l1_table_k[i], (dma_addr_t)l1_table[i]);
+ }
+ kfree(l1_table_k);
+
+err_free_l1_table:
+ dma_free_coherent(dev->vdev->dev.parent, PAGE_SIZE, l1_table, l1_dma_addr);
+
+ return NULL;
+}
+
+/**
+ * vrdma_free_page_tbl - Free a two-level page table
+ * @dev: vRDMA device
+ * @pages_k: Return value from vrdma_init_page_tbl (kernel L2 pointers)
+ * @pages: L1 table with DMA addresses (output of pages_dma)
+ * @dma_pages: DMA address of L1 table
+ * @npages: Number of pages that were to be supported
+ *
+ * Frees both L1 and all L2 page tables allocated by vrdma_init_page_tbl.
+ */
+static void vrdma_free_page_tbl(struct vrdma_dev *dev,
+ uint64_t **pages_k,
+ uint64_t **pages,
+ dma_addr_t dma_pages,
+ unsigned int npages)
+{
+ unsigned int nl2 = (npages == 0) ? 0 : (npages + 511) / 512;
+ int i;
+
+ if (!pages_k || !pages)
+ return;
+
+ /* Free all L2 tables */
+ for (i = 0; i < nl2; i++) {
+ if (pages_k[i])
+ dma_free_coherent(dev->vdev->dev.parent, PAGE_SIZE, pages_k[i],
+ virt_to_phys((void *)pages[i]));
+ }
+
+ /* Free L1 tracking array */
+ kfree(pages_k);
+
+ /* Free L1 DMA table */
+ dma_free_coherent(dev->vdev->dev.parent, PAGE_SIZE, pages, dma_pages);
+}
+
+/**
+ * vrdma_alloc_mr - Allocate a multi-segment Memory Region (MR) with page tables
+ * @pd: Protection Domain to associate the MR with
+ * @mr_type: Type of MR (must be IB_MR_TYPE_MEM_REG)
+ * @max_num_sg: Maximum number of scatter/gather entries supported by this MR
+ *
+ * This function allocates a software MR structure and reserves a hardware MR
+ * context on the backend vRDMA device. It prepares a two-level page table
+ * (L1/L2) to support up to @max_num_sg pages, which will later be filled during
+ * memory registration (e.g., via ib_update_page()).
+ *
+ * The allocated MR is not yet backed by any actual memory - it serves as a
+ * container for future page population (used primarily by ib_get_dma_mr() path
+ * or special fast-register mechanisms).
+ *
+ * Command flow:
+ * - Sends VIRTIO_RDMA_CMD_CREATE_MR to device
+ * - Receives mr_handle, lkey, rkey from response
+ *
+ * Context: Called in process context. May sleep.
+ * Return:
+ * * &mr->ibmr on success
+ * * ERR_PTR(-EINVAL) if unsupported MR type
+ * * ERR_PTR(-ENOMEM) if memory allocation fails
+ * * ERR_PTR(-EIO) if device command fails
+ */
+static struct ib_mr *vrdma_alloc_mr(struct ib_pd *pd,
+ enum ib_mr_type mr_type,
+ u32 max_num_sg)
+{
+ struct vrdma_dev *vdev = to_vdev(pd->device);
+ struct vrdma_mr *mr;
+ struct vrdma_cmd_create_mr *cmd;
+ struct vrdma_rsp_create_mr *rsp;
+ struct scatterlist in, out;
+ int rc;
+
+ /* Only support standard memory registration */
+ if (mr_type != IB_MR_TYPE_MEM_REG)
+ return ERR_PTR(-EINVAL);
+
+ /* Allocate software MR structure */
+ mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd) {
+ rc = -ENOMEM;
+ goto err_free_mr;
+ }
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ rc = -ENOMEM;
+ goto err_free_cmd;
+ }
+
+ /*
+ * Allocate two-level page table for S/G support.
+ * Each L2 table holds PAGE_SIZE / sizeof(u64) entries.
+ * L1 table points to multiple L2s.
+ */
+ mr->pages_k = vrdma_init_page_tbl(vdev, max_num_sg,
+ &mr->pages, &mr->dma_pages);
+ if (!mr->pages_k) {
+ dev_err(&vdev->vdev->dev,
+ "Failed to allocate page table for %u S/G entries\n",
+ max_num_sg);
+ rc = -ENOMEM;
+ goto err_free_rsp;
+ }
+
+ mr->max_pages = max_num_sg;
+ mr->npages = 0;
+ mr->umem = NULL; /* No user memory backing at this stage */
+ mr->iova = 0;
+ mr->size = 0;
+
+ /* Prepare command */
+ cmd->pdn = to_vpd(pd)->pd_handle;
+ cmd->max_num_sg = max_num_sg;
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Send CREATE_MR command to backend device */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_CREATE_MR, &in, &out);
+ if (rc) {
+ dev_err(&vdev->vdev->dev, "CREATE_MR failed: %d\n", rc);
+ goto err_free_page_tbl;
+ }
+
+ /* Initialize MR metadata from response */
+ mr->mr_handle = rsp->mrn;
+ mr->ibmr.lkey = rsp->lkey;
+ mr->ibmr.rkey = rsp->rkey;
+ mr->ibmr.pd = pd;
+ mr->ibmr.device = &vdev->ib_dev;
+ mr->ibmr.type = IB_MR_TYPE_MEM_REG;
+
+ /* Clean up command/response buffers */
+ kfree(cmd);
+ kfree(rsp);
+
+ return &mr->ibmr;
+
+err_free_page_tbl:
+ vrdma_free_page_tbl(vdev, mr->pages_k, mr->pages, mr->dma_pages,
+ max_num_sg);
+err_free_rsp:
+ kfree(rsp);
+err_free_cmd:
+ kfree(cmd);
+err_free_mr:
+ kfree(mr);
+ return ERR_PTR(rc);
+}
+
+/**
+ * vrdma_dereg_mr - Deregister and destroy a Memory Region (MR)
+ * @ibmr: The IB memory region to deregister
+ * @udata: User data (optional, for user-space MRs)
+ *
+ * This function unregisters a previously allocated MR from the vRDMA device.
+ * It performs the following steps:
+ * 1. Sends VIRTIO_RDMA_CMD_DEREG_MR command to the backend device
+ * 2. Frees software page tables (L1/L2) used for scatter-gather mapping
+ * 3. Releases user memory (if any) via ib_umem_release()
+ * 4. Frees local metadata (struct vrdma_mr)
+ *
+ * Context: Can be called in process context. May sleep.
+ * Return:
+ * * 0 on success
+ * * -EIO if device communication fails
+ * * Other negative errno codes on allocation failure (rare during dereg)
+ */
+static int vrdma_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
+{
+ struct vrdma_dev *vdev = to_vdev(ibmr->device);
+ struct vrdma_mr *mr = to_vmr(ibmr);
+ struct vrdma_cmd_dereg_mr *cmd;
+ struct scatterlist in;
+ int rc;
+
+ /* Allocate command buffer */
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ /* Prepare command */
+ cmd->mrn = mr->mr_handle;
+ sg_init_one(&in, cmd, sizeof(*cmd));
+
+ /* Notify hardware to release MR context */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_DEREG_MR, &in, NULL);
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "VIRTIO_RDMA_CMD_DEREG_MR failed for mrn=0x%x, err=%d\n",
+ mr->mr_handle, rc);
+ rc = -EIO;
+ goto out_free_cmd;
+ }
+
+ /* Free two-level page table used for S/G entries */
+ vrdma_free_page_tbl(vdev, mr->pages_k, mr->pages, mr->dma_pages, mr->max_pages);
+
+ /* Release user memory if present */
+ if (mr->umem)
+ ib_umem_release(mr->umem);
+
+ /* Success */
+ kfree(cmd);
+ return 0;
+
+out_free_cmd:
+ kfree(cmd);
+ return rc;
+}
+
+/**
+ * vrdma_add_gid - Add a GID (Global Identifier) entry to the hardware
+ * @attr: GID attribute containing port, index, GID value, and GID type
+ * @context: Pointer to store driver-specific context (unused in vRDMA)
+ *
+ * This callback is invoked by the RDMA core when a GID table entry is added,
+ * either manually via sysfs or automatically during IPv6 address assignment.
+ *
+ * The function sends VIRTIO_RDMA_CMD_ADD_GID to the backend device to register
+ * the GID at the specified index and port. This allows the device to use this
+ * GID for RoCE traffic (e.g., as source in GRH).
+ *
+ * Note: The @context parameter is unused in vRDMA drivers since no additional
+ * per-GID software state is maintained.
+ *
+ * Context: Can sleep (called in process context).
+ * Return:
+ * * 0 on success
+ * * -ENOMEM if kmalloc fails
+ * * -EIO if device command fails
+ */
+static int vrdma_add_gid(const struct ib_gid_attr *attr, void **context)
+{
+ struct vrdma_dev *vdev = to_vdev(attr->device);
+ struct vrdma_cmd_add_gid *cmd;
+ struct scatterlist in;
+ int rc;
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ /* Fill command parameters */
+ memcpy(cmd->gid, attr->gid.raw, sizeof(cmd->gid));
+ cmd->index = attr->index;
+ cmd->port_num = attr->port_num;
+ cmd->gid_type = attr->gid_type; /* e.g., IB_GID_TYPE_ROCE or IB_GID_TYPE_ROCE_UDP_ENCAP */
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+
+ /* Send command to backend */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_ADD_GID, &in, NULL);
+ if (rc)
+ dev_err(&vdev->vdev->dev,
+ "ADD_GID failed: port=%u index=%u type=%d, err=%d\n",
+ attr->port_num, attr->index, attr->gid_type, rc);
+
+ kfree(cmd);
+ return rc ? -EIO : 0;
+}
+
+/**
+ * vrdma_del_gid - Remove a GID entry from the hardware
+ * @attr: GID attribute specifying which GID to delete (by index/port)
+ * @context: Driver-specific context (passed from add_gid; unused here)
+ *
+ * This callback is called when a GID is removed from the GID table.
+ * It notifies the backend device to invalidate the GID mapping at the given index.
+ *
+ * The @context pointer is ignored because vRDMA does not maintain per-GID software state.
+ *
+ * Context: Can sleep (process context).
+ * Return:
+ * * 0 on success
+ * * -ENOMEM if allocation fails
+ * * -EIO if device command fails
+ */
+static int vrdma_del_gid(const struct ib_gid_attr *attr, void **context)
+{
+ struct vrdma_dev *vdev = to_vdev(attr->device);
+ struct vrdma_cmd_del_gid *cmd;
+ struct scatterlist in;
+ int rc;
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ /* Only index and port are needed to identify the GID */
+ cmd->index = attr->index;
+ cmd->port_num = attr->port_num;
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+
+ /* Send command to backend */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_DEL_GID, &in, NULL);
+ if (rc)
+ dev_err(&vdev->vdev->dev,
+ "DEL_GID failed: port=%u index=%u, err=%d\n",
+ attr->port_num, attr->index, rc);
+
+ kfree(cmd);
+ return rc ? -EIO : 0;
+}
+
+static int vrdma_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata)
+{
+ struct vrdma_ucontext *vuc = to_vucontext(uctx);
+
+ vuc->dev = to_vdev(uctx->device);
+
+ return 0;
+}
+
+static void vrdma_dealloc_ucontext(struct ib_ucontext *ibcontext)
+{
+}
+
+/**
+ * vrdma_create_ah - Create an Address Handle (AH) for RoCE communication
+ * @ibah: IB address handle to initialize
+ * @init_attr: AH initialization attributes
+ * @udata: User data (unused in vRDMA)
+ *
+ * This function creates a software-only Address Handle (AH), which represents
+ * a remote destination for UD or RC QP sends. Since this is a virtualized driver,
+ * no hardware command is sent; instead, the AH context is stored locally in
+ * struct vrdma_ah for later use during packet construction.
+ *
+ * The AH must:
+ * - Be RoCE type
+ * - Contain GRH (Global Routing Header)
+ * - Not be multicast (currently unsupported)
+ *
+ * Also enforces device limit on maximum number of active AHs via atomic counter.
+ *
+ * Context: Can sleep (called in process context).
+ * Return:
+ * * 0 on success
+ * * -EINVAL if attributes are invalid
+ * * -ENOMEM if AH limit exceeded
+ */
+static int vrdma_create_ah(struct ib_ah *ibah,
+ struct rdma_ah_init_attr *init_attr,
+ struct ib_udata *udata)
+{
+ struct vrdma_dev *vdev = to_vdev(ibah->device);
+ struct vrdma_ah *ah = to_vah(ibah);
+ const struct ib_global_route *grh;
+ u32 port_num = rdma_ah_get_port_num(init_attr->ah_attr);
+
+ /* Must have GRH enabled */
+ if (!(rdma_ah_get_ah_flags(init_attr->ah_attr) & IB_AH_GRH))
+ return -EINVAL;
+
+ grh = rdma_ah_read_grh(init_attr->ah_attr);
+
+ /* Only support RoCE type and unicast DGRAM */
+ if (init_attr->ah_attr->type != RDMA_AH_ATTR_TYPE_ROCE)
+ return -EINVAL;
+
+ if (rdma_is_multicast_addr((struct in6_addr *)grh->dgid.raw)) {
+ dev_dbg(&vdev->vdev->dev, "Multicast GID not supported in AH\n");
+ return -EINVAL;
+ }
+
+ /* Enforce max_ah limit using atomic increment with barrier */
+ if (!atomic_add_unless(&vdev->num_ah, 1, vdev->ib_dev.attrs.max_ah)) {
+ dev_dbg(&vdev->vdev->dev, "Exceeded max number of AHs (%u)\n",
+ vdev->ib_dev.attrs.max_ah);
+ return -ENOMEM;
+ }
+
+ /* Initialize AV (Address Vector) with relevant fields */
+ ah->av.port = port_num;
+ ah->av.pdn = to_vpd(ibah->pd)->pd_handle; /* Protection Domain Number */
+ ah->av.gid_index = grh->sgid_index; /* Source GID table index */
+ ah->av.hop_limit = grh->hop_limit;
+ ah->av.sl_tclass_flowlabel = (u32)(grh->traffic_class << 20) |
+ (grh->flow_label & 0xfffff); /* 8-bit SL + 20-bit flow label */
+
+ memcpy(ah->av.dgid, grh->dgid.raw, sizeof(ah->av.dgid)); /* 128-bit Dest GID */
+ memcpy(ah->av.dmac, init_attr->ah_attr->roce.dmac, ETH_ALEN); /* Next-hop MAC */
+
+ return 0;
+}
+
+/**
+ * vrdma_destroy_ah - Destroy an Address Handle
+ * @ibah: The IB address handle to destroy
+ * @flags: Destroy flags (e.g., for deferred cleanup; unused here)
+ *
+ * This callback releases the software state associated with an AH.
+ * It decrements the per-device AH counter to allow new AH creation.
+ *
+ * No hardware interaction is needed since AHs are purely software constructs
+ * in this virtio-rdma implementation.
+ *
+ * Context: Can sleep (process context). May be called from RCU read-side critical section.
+ * Return: Always returns 0 (success).
+ */
+static int vrdma_destroy_ah(struct ib_ah *ibah, u32 flags)
+{
+ struct vrdma_dev *vdev = to_vdev(ibah->device);
+
+ atomic_dec(&vdev->num_ah);
+
+ return 0;
+}
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -1101,6 +1692,15 @@ static const struct ib_device_ops vrdma_dev_ops = {
.dealloc_pd = vrdma_dealloc_pd,
.create_qp = vrdma_create_qp,
.destroy_qp = vrdma_destroy_qp,
+ .get_dma_mr = vrdma_get_dma_mr,
+ .alloc_mr = vrdma_alloc_mr,
+ .dereg_mr = vrdma_dereg_mr,
+ .add_gid = vrdma_add_gid,
+ .del_gid = vrdma_del_gid,
+ .alloc_ucontext = vrdma_alloc_ucontext,
+ .dealloc_ucontext = vrdma_dealloc_ucontext,
+ .create_ah = vrdma_create_ah,
+ .destroy_ah = vrdma_destroy_ah,
};
/**
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
index ba88599c8..6759c4349 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
@@ -11,6 +11,8 @@
#include <rdma/ib_verbs.h>
#include <rdma/vrdma_abi.h>
+#include "vrdma_abi.h"
+
#define VRDMA_COMM_TIMEOUT 1000000
enum vrdma_type {
@@ -130,6 +132,11 @@ struct vrdma_ucontext {
struct vrdma_dev *dev;
};
+struct vrdma_ah {
+ struct ib_ah ibah;
+ struct vrdma_av av;
+};
+
/**
* struct vrdma_qp - Virtual RDMA Queue Pair (QP) private data
*
@@ -166,6 +173,64 @@ struct vrdma_qp {
struct vrdma_user_mmap_entry *rq_entry; /* Mmap entry for RQ buffer */
};
+/**
+ * struct vrdma_mr - Software state of a Virtio-RDMA Memory Region (MR)
+ * @ibmr: InfiniBand core MR object (contains rkey, lkey, etc.)
+ * @umem: User memory descriptor from ib_umem_get(), holds
+ * page list and reference to user VMA
+ * @mr_handle: Handle returned by backend device for this MR
+ * @iova: I/O virtual address (start of the mapped region)
+ * @size: Total size of the memory region in bytes
+ * @pages: Level 1 (L1) page table - array of kernel pointers to
+ * level 2 (L2) page tables containing DMA addresses.
+ * This is used by the host driver to manage scatter-gather layout.
+ * @pages_k: Array of kernel virtual addresses of L2 page tables.
+ * Used to free memory correctly during cleanup.
+ * @dma_pages: DMA address of the L1 page table (first-level table),
+ * to be passed to the device or written in command WQE.
+ * @npages: Number of valid pages in the memory region
+ * @max_pages: Maximum number of pages that can be held in current
+ * page table allocation (based on initial mapping size)
+ *
+ * This structure represents a registered memory region in the vRDMA driver.
+ * It supports large memory registrations using a two-level page table design:
+ *
+ * L1 Page Table (contiguous DMA-mapped):
+ * Contains pointers to multiple L2 tables (each L2 = one page).
+ *
+ * L2 Page Tables:
+ * Each stores up to N DMA addresses (physical page addresses).
+ *
+ * The layout allows efficient hardware access while keeping kernel allocations
+ * manageable for very large mappings (e.g., tens of GB).
+ *
+ * Example layout for 4K pages and 512 entries per L2 table:
+ *
+ * L1 (dma_pages) -> [L2_0] -> [DMA_ADDR_A, ..., DMA_ADDR_Z]
+ * [L2_1] -> [DMA_ADDR_X, ..., DMA_ADDR_Y]
+ * ...
+ *
+ * Used during:
+ * - ib_reg_mr()
+ * - SEND/WRITE/READ operations with remote access
+ * - MR invalidation and cleanup in vrdma_dereg_mr()
+ */
+struct vrdma_mr {
+ struct ib_mr ibmr;
+ struct ib_umem *umem;
+
+ u32 mr_handle;
+ u64 iova;
+ u64 size;
+
+ u64 **pages; /* L1: array of L2 table DMA address pointers */
+ u64 **pages_k; /* L1: array of L2 table kernel virtual addresses */
+ dma_addr_t dma_pages; /* DMA address of the L1 table itself */
+
+ u32 npages;
+ u32 max_pages;
+};
+
static inline struct vrdma_cq *to_vcq(struct ib_cq *ibcq)
{
return container_of(ibcq, struct vrdma_cq, ibcq);
@@ -181,6 +246,21 @@ static inline struct vrdma_qp *to_vqp(struct ib_qp *ibqp)
return container_of(ibqp, struct vrdma_qp, ibqp);
}
+static inline struct vrdma_mr *to_vmr(struct ib_mr *ibmr)
+{
+ return container_of(ibmr, struct vrdma_mr, ibmr);
+}
+
+static inline struct vrdma_ucontext *to_vucontext(struct ib_ucontext *ibucontext)
+{
+ return container_of(ibucontext, struct vrdma_ucontext, ibucontext);
+}
+
+static inline struct vrdma_ah *to_vah(struct ib_ah *ibah)
+{
+ return container_of(ibah, struct vrdma_ah, ibah);
+}
+
int vrdma_register_ib_device(struct vrdma_dev *vrdev);
void vrdma_unregister_ib_device(struct vrdma_dev *vrdev);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 05/10] drivers/infiniband/hw/virtio: Implement memory mapping and MR scatter-gather support
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (3 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 04/10] drivers/infiniband/hw/virtio: Implement MR, GID, ucontext and AH resource management verbs Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 06/10] drivers/infiniband/hw/virtio: Implement port management and QP modification verbs Xiong Weimin
` (6 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This comprehensive commit adds critical memory management capabilities to the virtio RDMA driver:
1. Port link layer identification
- Reports Ethernet as the link layer (vrdma_port_link_layer)
2. Memory Region scatter-gather mapping
- Implements two-level page table for efficient large MR handling (vrdma_set_page)
- Adds SG list to MR mapping with device notification (vrdma_map_mr_sg)
3. User-space memory mapping
- Supports mmap() for CQ/QP resources (vrdma_mmap)
- Handles vring descriptors, user buffers, and fast doorbells
- Implements mmap entry cleanup (vrdma_mmap_free)
Key features:
- Efficient 2-level page table for MRs (512 entries per level)
- Virtio command for backend MR mapping notification
- Unified mmap handling for CQ/QP with size validation
- Support for fast doorbell mapping optimization
- Comprehensive error handling in all code paths
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../infiniband/hw/virtio/vrdma_dev_api.h | 13 +
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 250 ++++++++++++++++++
2 files changed, 263 insertions(+)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index da99f1f32..84dc05a96 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -200,6 +200,19 @@ struct vrdma_cmd_del_gid {
__u32 port_num;
};
+struct vrdma_cmd_map_mr_sg {
+ __u32 mrn;
+ __u32 npages;
+ __u64 start;
+ __u64 length;
+
+ __u64 pages;
+};
+
+struct vrdma_rsp_map_mr_sg {
+ __u32 npages;
+};
+
#define VRDMA_CTRL_OK 0
#define VRDMA_CTRL_ERR 1
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index b4c16ddbb..738935e3d 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -12,6 +12,7 @@
#include <rdma/ib_umem.h>
#include <rdma/ib_verbs.h>
#include <rdma/ib_addr.h>
+#include <linux/mm_types.h>
#include "vrdma.h"
#include "vrdma_dev.h"
@@ -21,6 +22,8 @@
#include "vrdma_mmap.h"
#include "vrdma_queue.h"
+#define VRTIO_RDMA_PAGE_PER_TBL 512
+
/**
* cmd_str - String representation of virtio RDMA control commands
*
@@ -1677,6 +1680,248 @@ static int vrdma_destroy_ah(struct ib_ah *ibah, u32 flags)
return 0;
}
+static void vrdma_get_fw_ver_str(struct ib_device *device, char *str)
+{
+ snprintf(str, IB_FW_VERSION_NAME_MAX, "%d.%d.%d\n", 1, 0, 0);
+}
+
+static enum rdma_link_layer vrdma_port_link_layer(struct ib_device *ibdev,
+ u32 port)
+{
+ return IB_LINK_LAYER_ETHERNET;
+}
+
+/**
+ * vrdma_set_page - Callback to collect physical pages from scatterlist
+ * @ibmr: Memory region being mapped
+ * @addr: Physical address of the current page
+ *
+ * This function is called by ib_sg_to_pages() for each page in the SG list.
+ * It stores the physical address into a two-level page table:
+ * - Level 1: Array of pointers to L2 tables (512 entries each)
+ * - Level 2: Each holds up to 512 page addresses
+ *
+ * The layout allows efficient DMA mapping of large MRs without allocating one huge array.
+ *
+ * Context: Called from ib_sg_to_pages(); may sleep if GFP_KERNEL used internally.
+ * Return:
+ * * 0 on success
+ * * -ENOMEM if number of pages exceeds pre-allocated limit
+ */
+static int vrdma_set_page(struct ib_mr *ibmr, u64 addr)
+{
+ struct vrdma_mr *mr = to_vmr(ibmr);
+
+ if (mr->npages >= mr->max_pages) {
+ pr_debug("vRDMA: too many pages for MR (max=%u)\n", mr->max_pages);
+ return -ENOMEM;
+ }
+
+ /* Two-level indexing: [L1 index][L2 offset] */
+ mr->pages_k[mr->npages / VRTIO_RDMA_PAGE_PER_TBL][mr->npages % VRTIO_RDMA_PAGE_PER_TBL] = addr;
+ mr->npages++;
+ return 0;
+}
+
+/**
+ * vrdma_map_mr_sg - Map scatter-gather list into MR's page table and notify device
+ * @ibmr: The memory region to map
+ * @sg: Scatterlist describing user/kernel memory chunks
+ * @sg_nents: Number of entries in sg
+ * @sg_offset: Optional offset within first sg element (ignored here)
+ *
+ * This function:
+ * 1. Walks the SG list via ib_sg_to_pages()
+ * 2. Populates software page table using vrdma_set_page()
+ * 3. Sends VIRTIO_RDMA_CMD_MAP_MR_SG to inform backend about IOVA range and page list
+ *
+ * Note: The actual DMA mapping was already done during ib_umem_get() or get_dma_mr().
+ * This only sets up hardware-visible metadata.
+ *
+ * Context: Can sleep (called in process context).
+ * Return:
+ * * Number of successfully mapped sg entries (>0)
+ * * Negative errno on failure
+ */
+static int vrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+ int sg_nents, unsigned int *sg_offset)
+{
+ struct vrdma_dev *vdev = to_vdev(ibmr->device);
+ struct vrdma_mr *mr = to_vmr(ibmr);
+ struct vrdma_cmd_map_mr_sg *cmd;
+ struct vrdma_rsp_map_mr_sg *rsp;
+ struct scatterlist in, out;
+ int mapped;
+
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ kfree(cmd);
+ return -ENOMEM;
+ }
+
+ /* Reset page counter before traversal */
+ mr->npages = 0;
+
+ /* Use RDMA core helper to walk SG and call vrdma_set_page() per page */
+ mapped = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, vrdma_set_page);
+ if (mapped < 0) {
+ dev_err(&vdev->vdev->dev, "Failed to map SG to pages: %d\n", mapped);
+ kfree(rsp);
+ kfree(cmd);
+ return mapped;
+ }
+
+ /* Prepare command for device notification */
+ cmd->mrn = mr->mr_handle;
+ cmd->start = ibmr->iova;
+ cmd->length = ibmr->length;
+ cmd->npages = mr->npages;
+ cmd->pages = mr->dma_pages; /* Pre-DMA-mapped array of page addrs */
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Notify backend about new mapping (optional optimization) */
+ int rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_MAP_MR_SG, &in, &out);
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "VIRTIO_RDMA_CMD_MAP_MR_SG failed for mrn=0x%x, err=%d\n",
+ mr->mr_handle, rc);
+ rc = -EIO;
+ goto out_free;
+ }
+
+ /* Success: return number of processed sg entries */
+ kfree(rsp);
+ kfree(cmd);
+ return mapped;
+
+out_free:
+ kfree(rsp);
+ kfree(cmd);
+ return rc;
+}
+
+/**
+ * vrdma_mmap - Map device memory (vring, ubuf, doorbell) into user space
+ * @ctx: User's RDMA context
+ * @vma: VMA describing the mapping request
+ *
+ * Maps memory regions associated with QP/CQ virtqueues into user space.
+ * Supports three components:
+ * - vring descriptors (shared ring buffer)
+ * - user buffer (optional data exchange area)
+ * - fast doorbell page (if enabled)
+ *
+ * Uses PFN-based remapping for normal memory and I/O remapping for doorbells.
+ *
+ * Context: Called during mmap() in process context.
+ * Return:
+ * * 0 on success
+ * * -EINVAL for invalid parameters or layout mismatch
+ * * -EAGAIN/-EFAULT if remap fails
+ */
+int vrdma_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma)
+{
+ struct vrdma_ucontext *uctx = to_vucontext(ctx);
+ size_t requested_size = vma->vm_end - vma->vm_start;
+ struct rdma_user_mmap_entry *rdma_entry;
+ struct vrdma_user_mmap_entry *entry;
+ int rc;
+
+ /* Must be page-aligned */
+ if (vma->vm_start & (PAGE_SIZE - 1)) {
+ pr_warn("vRDMA: mmap start not page aligned: %#lx\n", vma->vm_start);
+ return -EINVAL;
+ }
+
+ /* Look up the registered mmap entry */
+ rdma_entry = rdma_user_mmap_entry_get(&uctx->ibucontext, vma);
+ if (!rdma_entry) {
+ pr_err("vRDMA: mmap lookup failed: pgoff=%lu size=%zu\n",
+ vma->vm_pgoff, requested_size);
+ return -EINVAL;
+ }
+ entry = to_ventry(rdma_entry);
+
+ switch (entry->mmap_type) {
+ case VRDMA_MMAP_CQ:
+ case VRDMA_MMAP_QP:
+ {
+ unsigned long vq_size = PAGE_ALIGN(vring_size(virtqueue_get_vring_size(entry->vq),
+ SMP_CACHE_BYTES));
+ unsigned long total_size = vq_size + entry->ubuf_size;
+
+ if (uctx->dev->fast_doorbell && entry->mmap_type == VRDMA_MMAP_QP)
+ total_size += PAGE_SIZE;
+
+ if (requested_size != total_size) {
+ WARN(1, "mmap size mismatch: got=%zu, expected=%lu\n",
+ requested_size, total_size);
+ rc = -EINVAL;
+ goto out_put;
+ }
+
+ /* Map vring descriptor table */
+ rc = remap_pfn_range(vma, vma->vm_start,
+ page_to_pfn(virt_to_page(virtqueue_get_vring(entry->vq)->desc)),
+ vq_size, vma->vm_page_prot);
+ if (rc) {
+ pr_warn("vRDMA: remap vring failed: %d\n", rc);
+ goto out_put;
+ }
+
+ /* Map user buffer (shared data region) */
+ rc = remap_pfn_range(vma, vma->vm_start + vq_size,
+ page_to_pfn(virt_to_page(entry->user_buf)),
+ entry->ubuf_size, vma->vm_page_prot);
+ if (rc) {
+ pr_warn("vRDMA: remap ubuf failed: %d\n", rc);
+ goto out_put;
+ }
+
+ /* Optionally map fast doorbell register (QP only) */
+ if (uctx->dev->fast_doorbell && entry->mmap_type == VRDMA_MMAP_QP) {
+ unsigned long db_addr = vma->vm_start + vq_size + entry->ubuf_size;
+ struct virtqueue *vq = entry->vq;
+
+ rc = io_remap_pfn_range(vma, db_addr,
+ vmalloc_to_pfn(vq->priv),
+ PAGE_SIZE, vma->vm_page_prot);
+ if (rc) {
+ pr_warn("vRDMA: remap doorbell failed: %d\n", rc);
+ goto out_put;
+ }
+ }
+
+ break;
+ }
+ default:
+ pr_err("vRDMA: invalid mmap type %d\n", entry->mmap_type);
+ rc = -EINVAL;
+ goto out_put;
+ }
+
+ /* Success */
+ rdma_user_mmap_entry_put(rdma_entry);
+ return 0;
+
+out_put:
+ rdma_user_mmap_entry_put(rdma_entry);
+ return rc;
+}
+
+void vrdma_mmap_free(struct rdma_user_mmap_entry *rdma_entry)
+{
+ struct vrdma_user_mmap_entry *entry = to_ventry(rdma_entry);
+
+ kfree(entry);
+}
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -1701,6 +1946,11 @@ static const struct ib_device_ops vrdma_dev_ops = {
.dealloc_ucontext = vrdma_dealloc_ucontext,
.create_ah = vrdma_create_ah,
.destroy_ah = vrdma_destroy_ah,
+ .get_dev_fw_str = vrdma_get_fw_ver_str,
+ .get_link_layer = vrdma_port_link_layer,
+ .map_mr_sg = vrdma_map_mr_sg,
+ .mmap = vrdma_mmap,
+ .mmap_free = vrdma_mmap_free,
};
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 06/10] drivers/infiniband/hw/virtio: Implement port management and QP modification verbs
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (4 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 05/10] drivers/infiniband/hw/virtio: Implement memory mapping and MR scatter-gather support Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 07/10] drivers/infiniband/hw/virtio: Implement Completion Queue (CQ) polling support Xiong Weimin
` (5 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit adds essential RDMA verbs implementation for the virtio RDMA driver:
1. Port modification support (vrdma_modify_port):
- Adds IB_PORT_SHUTDOWN flag handling for port deactivation
- Maintains port capability mask state
- Enforces strict attribute mask validation
- Provides proper locking with port_mutex
2. Queue Pair modification support (vrdma_modify_qp):
- Implements full QP attribute translation to virtio commands
- Handles all standard IB_QP_* attribute masks (21 bits)
- Uses efficient two-buffer scheme for device communication
- Includes comprehensive error handling
Key features:
- Minimal port modification support focused on shutdown capability
- Complete QP state transition handling
- Attribute-by-attribute translation with 32+ fields covered
- Safe memory management with guaranteed cleanup
- Verbose error logging for debugging
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../infiniband/hw/virtio/vrdma_dev_api.h | 12 +
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 223 +++++++++++++++++-
.../drivers/infiniband/hw/virtio/vrdma_ib.h | 54 +++++
3 files changed, 288 insertions(+), 1 deletion(-)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index 84dc05a96..d0ce02601 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -14,6 +14,8 @@
#include <rdma/vrdma_abi.h>
+#include "vrdma_ib.h"
+
/**
* struct vrdma_config - Virtio RDMA device configuration
*
@@ -213,6 +215,16 @@ struct vrdma_rsp_map_mr_sg {
__u32 npages;
};
+struct vrdma_cmd_modify_qp {
+ __u32 qpn;
+ __u32 attr_mask;
+ struct vrdma_qp_attr attrs;
+};
+
+struct vrdma_rsp_modify_qp {
+ __u32 qpn;
+};
+
#define VRDMA_CTRL_OK 0
#define VRDMA_CTRL_ERR 1
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index 738935e3d..2d9a612f3 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -55,6 +55,37 @@ static const char * const cmd_str[] = {
[VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ] = "REQ_NOTIFY_CQ",
};
+static void ib_qp_cap_to_vrdma(struct vrdma_qp_cap *dst, const struct ib_qp_cap *src)
+{
+ //dst->max_send_wr = src->max_send_wr;
+ dst->max_send_wr = src->max_send_wr;
+ dst->max_recv_wr = src->max_recv_wr;
+ dst->max_send_sge = src->max_send_sge;
+ dst->max_recv_sge = src->max_recv_sge;
+ dst->max_inline_data = src->max_inline_data;
+}
+
+static void ib_global_route_to_vrdma(struct vrdma_global_route *dst,
+ const struct ib_global_route *src)
+{
+ dst->dgid = src->dgid;
+ dst->flow_label = src->flow_label;
+ dst->sgid_index = src->sgid_index;
+ dst->hop_limit = src->hop_limit;
+ dst->traffic_class = src->traffic_class;
+}
+
+static void rdma_ah_attr_to_vrdma(struct vrdma_ah_attr *dst,
+ const struct rdma_ah_attr *src)
+{
+ ib_global_route_to_vrdma(&dst->grh, rdma_ah_read_grh(src));
+ dst->sl = rdma_ah_get_sl(src);
+ dst->static_rate = rdma_ah_get_static_rate(src);
+ dst->port_num = rdma_ah_get_port_num(src);
+ dst->ah_flags = rdma_ah_get_ah_flags(src);
+ memcpy(&dst->roce, &src->roce, sizeof(struct roce_ah_attr));
+}
+
/**
* vrdma_exec_verbs_cmd - Execute a verbs command via control virtqueue
* @vrdev: VRDMA device
@@ -1922,6 +1953,194 @@ void vrdma_mmap_free(struct rdma_user_mmap_entry *rdma_entry)
kfree(entry);
}
+/**
+ * vrdma_modify_port - Modify port attributes (limited support)
+ * @ibdev: Verbs device
+ * @port: Port number (1-indexed)
+ * @mask: Bitmask of attributes to modify
+ * @props: New port properties
+ *
+ * Currently only supports IB_PORT_SHUTDOWN flag.
+ * Other flags are rejected with -EOPNOTSUPP.
+ *
+ * Context: Can sleep (holds mutex).
+ * Return:
+ * * 0 on success
+ * * -EOPNOTSUPP if unsupported mask bits set
+ * * error code from ib_query_port() on failure
+ */
+static int vrdma_modify_port(struct ib_device *ibdev, u32 port, int mask,
+ struct ib_port_modify *props)
+{
+ struct vrdma_dev *vdev = to_vdev(ibdev);
+ struct ib_port_attr attr;
+ int ret;
+
+ /* Only allow IB_PORT_SHUTDOWN; reject all others */
+ if (mask & ~IB_PORT_SHUTDOWN) {
+ pr_warn("vRDMA: unsupported port modify mask %#x\n", mask);
+ return -EOPNOTSUPP;
+ }
+
+ mutex_lock(&vdev->port_mutex);
+
+ /* Query current port state (required by spec before modify in some cases) */
+ ret = ib_query_port(ibdev, port, &attr);
+ if (ret) {
+ pr_err("vRDMA: failed to query port %u: %d\n", port, ret);
+ goto out_unlock;
+ }
+
+ /* Apply capability mask changes */
+ vdev->port_cap_mask |= props->set_port_cap_mask;
+ vdev->port_cap_mask &= ~props->clr_port_cap_mask;
+
+ /* Handle shutdown request */
+ if (mask & IB_PORT_SHUTDOWN) {
+ vdev->ib_active = false;
+ pr_info("vRDMA: port %u marked as inactive\n", port);
+ }
+
+ ret = 0; /* Success */
+
+out_unlock:
+ mutex_unlock(&vdev->port_mutex);
+ return ret;
+}
+
+/**
+ * vrdma_modify_qp - Modify QP attributes via backend
+ * @ibqp: Queue pair to modify
+ * @attr: New QP attributes
+ * @attr_mask: Which fields in @attr are valid
+ * @udata: User data (unused here)
+ *
+ * Sends a VIRTIO_RDMA_CMD_MODIFY_QP command to the host backend
+ * to update the QP's state and parameters.
+ *
+ * Context: Process context (may sleep due to memory allocation).
+ * Return:
+ * * 0 on success
+ * * -ENOMEM if command buffer allocation fails
+ * * -EIO or other negative errno on communication failure
+ */
+static int vrdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask, struct ib_udata *udata)
+{
+ struct vrdma_dev *vdev = to_vdev(ibqp->device);
+ struct vrdma_cmd_modify_qp *cmd;
+ struct vrdma_rsp_modify_qp *rsp;
+ struct scatterlist in, out;
+ int rc;
+
+ /* Allocate command and response buffers */
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ kfree(cmd);
+ return -ENOMEM;
+ }
+
+ /* Fill command header */
+ cmd->qpn = to_vqp(ibqp)->qp_handle;
+ cmd->attr_mask = attr_mask & ((1U << 21) - 1); /* Limit to 21 bits */
+
+ /* Conditionally copy fields based on attr_mask */
+ if (attr_mask & IB_QP_STATE)
+ cmd->attrs.qp_state = attr->qp_state;
+
+ if (attr_mask & IB_QP_CUR_STATE)
+ cmd->attrs.cur_qp_state = attr->cur_qp_state;
+
+ if (attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY)
+ cmd->attrs.en_sqd_async_notify = attr->en_sqd_async_notify;
+
+ if (attr_mask & IB_QP_ACCESS_FLAGS)
+ cmd->attrs.qp_access_flags = attr->qp_access_flags;
+
+ if (attr_mask & IB_QP_PKEY_INDEX)
+ cmd->attrs.pkey_index = attr->pkey_index;
+
+ if (attr_mask & IB_QP_PORT)
+ cmd->attrs.port_num = attr->port_num;
+
+ if (attr_mask & IB_QP_QKEY)
+ cmd->attrs.qkey = attr->qkey;
+
+ if (attr_mask & IB_QP_AV)
+ rdma_ah_attr_to_vrdma(&cmd->attrs.ah_attr, &attr->ah_attr);
+
+ if (attr_mask & IB_QP_ALT_PATH)
+ rdma_ah_attr_to_vrdma(&cmd->attrs.alt_ah_attr, &attr->alt_ah_attr);
+
+ if (attr_mask & IB_QP_PATH_MTU)
+ cmd->attrs.path_mtu = attr->path_mtu;
+
+ if (attr_mask & IB_QP_TIMEOUT)
+ cmd->attrs.timeout = attr->timeout;
+
+ if (attr_mask & IB_QP_RETRY_CNT)
+ cmd->attrs.retry_cnt = attr->retry_cnt;
+
+ if (attr_mask & IB_QP_RNR_RETRY)
+ cmd->attrs.rnr_retry = attr->rnr_retry;
+
+ if (attr_mask & IB_QP_MIN_RNR_TIMER)
+ cmd->attrs.min_rnr_timer = attr->min_rnr_timer;
+
+ if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC)
+ cmd->attrs.max_rd_atomic = attr->max_rd_atomic;
+
+ if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)
+ cmd->attrs.max_dest_rd_atomic = attr->max_dest_rd_atomic;
+
+ if (attr_mask & IB_QP_PATH_MIG_STATE)
+ cmd->attrs.path_mig_state = attr->path_mig_state;
+
+ if (attr_mask & IB_QP_CAP)
+ ib_qp_cap_to_vrdma(&cmd->attrs.cap, &attr->cap);
+
+ if (attr_mask & IB_QP_DEST_QPN)
+ cmd->attrs.dest_qp_num = attr->dest_qp_num;
+
+ if (attr_mask & IB_QP_RQ_PSN)
+ cmd->attrs.rq_psn = attr->rq_psn;
+
+ if (attr_mask & IB_QP_SQ_PSN)
+ cmd->attrs.sq_psn = attr->sq_psn;
+
+ cmd->attrs.alt_pkey_index = attr->alt_pkey_index;
+ cmd->attrs.alt_port_num = attr->alt_port_num;
+ cmd->attrs.alt_timeout = attr->alt_timeout;
+
+ if (attr_mask & IB_QP_RATE_LIMIT)
+ cmd->attrs.rate_limit = attr->rate_limit;
+
+ /* Prepare scatterlists for virtqueue I/O */
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Send command to backend */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_MODIFY_QP, &in, &out);
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "VIRTIO_RDMA_CMD_MODIFY_QP failed: qpn=0x%x, err=%d\n",
+ cmd->qpn, rc);
+ goto out_free;
+ }
+
+ /* Optional: Update local QP state based on response if needed */
+ // e.g., to_vqp(ibqp)->state = rsp->new_state;
+
+out_free:
+ kfree(rsp);
+ kfree(cmd);
+ return rc;
+}
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -1950,7 +2169,9 @@ static const struct ib_device_ops vrdma_dev_ops = {
.get_link_layer = vrdma_port_link_layer,
.map_mr_sg = vrdma_map_mr_sg,
.mmap = vrdma_mmap,
- .mmap_free = vrdma_mmap_free,
+ .mmap_free = vrdma_mmap_free,
+ .modify_port = vrdma_modify_port,
+ .modify_qp = vrdma_modify_qp,
};
/**
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
index 6759c4349..eaff37c3c 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.h
@@ -173,6 +173,60 @@ struct vrdma_qp {
struct vrdma_user_mmap_entry *rq_entry; /* Mmap entry for RQ buffer */
};
+struct vrdma_global_route {
+ union ib_gid dgid;
+ uint32_t flow_label;
+ uint8_t sgid_index;
+ uint8_t hop_limit;
+ uint8_t traffic_class;
+};
+
+struct vrdma_ah_attr {
+ struct vrdma_global_route grh;
+ uint8_t sl;
+ uint8_t static_rate;
+ uint8_t port_num;
+ uint8_t ah_flags;
+ struct roce_ah_attr roce;
+};
+
+struct vrdma_qp_cap {
+ uint32_t max_send_wr;
+ uint32_t max_recv_wr;
+ uint32_t max_send_sge;
+ uint32_t max_recv_sge;
+ uint32_t max_inline_data;
+};
+
+struct vrdma_qp_attr {
+ enum ib_qp_state qp_state;
+ enum ib_qp_state cur_qp_state;
+ enum ib_mtu path_mtu;
+ enum ib_mig_state path_mig_state;
+ uint32_t qkey;
+ uint32_t rq_psn;
+ uint32_t sq_psn;
+ uint32_t dest_qp_num;
+ uint32_t qp_access_flags;
+ uint16_t pkey_index;
+ uint16_t alt_pkey_index;
+ uint8_t en_sqd_async_notify;
+ uint8_t sq_draining;
+ uint8_t max_rd_atomic;
+ uint8_t max_dest_rd_atomic;
+ uint8_t min_rnr_timer;
+ uint8_t port_num;
+ uint8_t timeout;
+ uint8_t retry_cnt;
+ uint8_t rnr_retry;
+ uint8_t alt_port_num;
+ uint8_t alt_timeout;
+ uint32_t rate_limit;
+ struct vrdma_qp_cap cap;
+ struct vrdma_ah_attr ah_attr;
+ struct vrdma_ah_attr alt_ah_attr;
+};
+
/**
* struct vrdma_mr - Software state of a Virtio-RDMA Memory Region (MR)
* @ibmr: InfiniBand core MR object (contains rkey, lkey, etc.)
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 07/10] drivers/infiniband/hw/virtio: Implement Completion Queue (CQ) polling support
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (5 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 06/10] drivers/infiniband/hw/virtio: Implement port management and QP modification verbs Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 08/10] drivers/infiniband/hw/virtio: Implement send/receive verb support Xiong Weimin
` (4 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit adds CQ polling support to the virtio RDMA driver:
1. Completion queue processing:
- Retrieves CQEs from virtqueue and converts to ib_wc
- Implements buffer recycling to avoid memory leaks
- Handles all standard WC fields including imm_data and flags
2. Key optimizations:
- IRQ-safe locking for virtqueue operations
- Batch processing with virtqueue_kick optimization
- Atomic buffer re-addition to avoid allocation overhead
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 77 ++++++++++++++++++-
1 file changed, 76 insertions(+), 1 deletion(-)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index 2d9a612f3..705d18b55 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -2141,6 +2141,80 @@ static int vrdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
return rc;
}
+/**
+ * vrdma_poll_cq - Poll a completion queue for work completions
+ * @ibcq: Completion queue to poll
+ * @num_entries: Maximum number of entries to return
+ * @wc: User-provided array of work completions
+ *
+ * Retrieves completed CQEs from the virtqueue and fills ib_wc structures.
+ * Each consumed CQE buffer is returned to the backend via inbuf.
+ *
+ * Context: Process context (may sleep during virtqueue refill).
+ * Return:
+ * * Number of completed WCs filled (>= 0)
+ * * Does not return negative values (per IB spec)
+ */
+static int vrdma_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+ struct vrdma_cq *vcq = to_vcq(ibcq);
+ struct vrdma_cqe *cqe;
+ unsigned long flags;
+ unsigned int len;
+ int i = 0;
+ struct scatterlist sg;
+
+ spin_lock_irqsave(&vcq->lock, flags);
+
+ while (i < num_entries) {
+ /* Dequeue one CQE from used ring */
+ cqe = virtqueue_get_buf(vcq->vq->vq, &len);
+ if (!cqe) {
+ break; /* No more completions available */
+ }
+
+ /* Copy CQE fields into ib_wc */
+ wc[i].wr_id = cqe->wr_id;
+ wc[i].status = cqe->status;
+ wc[i].opcode = cqe->opcode;
+ wc[i].vendor_err = cqe->vendor_err;
+ wc[i].byte_len = cqe->byte_len;
+
+ /* TODO: Set wc[i].qp - requires storing QP pointer at send time */
+ // wc[i].qp = container_of(...);
+
+ wc[i].ex.imm_data = cqe->ex.imm_data;
+ wc[i].src_qp = cqe->src_qp;
+ wc[i].slid = cqe->slid;
+ wc[i].wc_flags = cqe->wc_flags;
+ wc[i].pkey_index = cqe->pkey_index;
+ wc[i].sl = cqe->sl;
+ wc[i].dlid_path_bits = cqe->dlid_path_bits;
+ wc[i].port_num = cqe->port_num;
+
+ /* Re-add the CQE buffer to the available list for reuse */
+ sg_init_one(&sg, cqe, sizeof(*cqe));
+ if (virtqueue_add_inbuf(vcq->vq->vq, &sg, 1, cqe, GFP_ATOMIC) != 0) {
+ dev_warn(&vcq->vq->vq->vdev->dev,
+ "Failed to re-add CQE buffer to vq %p\n", vcq->vq->vq);
+ /* Leak this buffer? Better to warn than crash */
+ }
+
+ i++;
+ }
+
+ /*
+ * Kick the virtqueue if needed so host can see returned buffers.
+ * This ensures backend knows which CQE slots are free.
+ */
+ if (i > 0)
+ virtqueue_kick(vcq->vq->vq);
+
+ spin_unlock_irqrestore(&vcq->lock, flags);
+
+ return i; /* Return number of polled completions */
+}
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -2171,7 +2245,8 @@ static const struct ib_device_ops vrdma_dev_ops = {
.mmap = vrdma_mmap,
.mmap_free = vrdma_mmap_free,
.modify_port = vrdma_modify_port,
- .modify_qp = vrdma_modify_qp,
+ .modify_qp = vrdma_modify_qp,
+ .poll_cq = vrdma_poll_cq,
};
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 08/10] drivers/infiniband/hw/virtio: Implement send/receive verb support
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (6 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 07/10] drivers/infiniband/hw/virtio: Implement Completion Queue (CQ) polling support Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 09/10] drivers/infiniband/hw/virtio: Implement P_key, QP query and user MR resource management verbs Xiong Weimin
` (3 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit adds core RDMA verb implementations for the virtio RDMA driver:
1. Post Receive Support:
- Full handling of recv_wr chains with SGE conversion
- SMI QP rejection and user-space QP fast path
- Atomic buffer allocation with GFP_ATOMIC
2. Post Send Support:
- Comprehensive opcode support including RDMA/Atomic/UD
- Inline data handling via contiguous copy
- Detailed error handling with bad_wr tracking
- Memory registration support integration
Key features:
- Support for 15+ IB_WR_OPCODE types
- Specialized handling for UD/RC/GSI QP types
- Kernel-space WR processing with virtio command conversion
- Virtqueue batching optimizations
- Strict concurrency control with QP locks
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../drivers/infiniband/hw/virtio/vrdma_abi.h | 99 ++++--
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 310 +++++++++++++++++-
2 files changed, 372 insertions(+), 37 deletions(-)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
index 7cdc4e488..0a9404057 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
@@ -222,6 +222,19 @@ struct vrdma_av {
__u8 reserved[6]; /* Reserved for future use / alignment padding */
};
+struct vrdma_sge {
+ __u64 addr;
+ __u32 length;
+ __u32 lkey;
+};
+
+struct vrdma_cmd_post_recv {
+ __u32 qpn;
+ __u32 num_sge;
+ __u64 wr_id;
+ struct ib_sge *sge_list;
+};
+
/**
* struct vrdma_cmd_post_send - User-space command to post a Send WQE
*
@@ -232,48 +245,62 @@ struct vrdma_av {
* All fields use fixed-size types for ABI stability across architectures.
*/
struct vrdma_cmd_post_send {
- __u32 num_sge; /* Number of scatter-gather elements in this WQE */
-
- __u32 send_flags; /* IBV_SEND_xxx flags (e.g., signaled, inline, fence) */
- __u32 opcode; /* Operation code: RDMA_WRITE, SEND, ATOMIC, etc. */
- __u64 wr_id; /* Work Request ID returned in CQE */
-
union {
- __be32 imm_data; /* Immediate data for RC/UC QPs */
- __u32 invalidate_rkey; /* rkey to invalidate (on SEND_WITH_INV) */
- } ex;
-
- union wr_data {
+ /* Length of sg_list */
+ __le32 num_sge;
+ /* Length of inline data */
+ __le16 inline_len;
+ };
+ /* Flags of the WR properties */
+ __u8 send_flags;
+ /* WR opcode, enum virtio_ib_wr_opcode */
+ __u32 opcode;
+ /* User defined WR ID */
+ __le64 wr_id;
+#define VIRTIO_IB_SEND_FENCE (1 << 0)
+#define VIRTIO_IB_SEND_SIGNALED (1 << 1)
+#define VIRTIO_IB_SEND_SOLICITED (1 << 2)
+#define VIRTIO_IB_SEND_INLINE (1 << 3)
+ /* Immediate data (in network byte order) to send */
+ __le32 imm_data;
+ union {
+ __le32 imm_data;
+ __u32 invalidate_rkey;
+ } ex;
+ union {
struct {
- __u64 remote_addr; /* Target virtual address for RDMA op */
- __u32 rkey; /* Remote key for memory access */
+ /* Start address of remote memory buffer */
+ __le64 remote_addr;
+ /* Key of the remote MR */
+ __le32 rkey;
} rdma;
-
- struct {
- __u64 remote_addr; /* Address of atomic variable */
- __u64 compare_add; /* Value to compare */
- __u64 swap; /* Value to swap (or add) */
- __u32 rkey; /* Remote memory key */
- } atomic;
-
+ struct {
+ __u64 remote_addr;
+ __u64 compare_add;
+ __u64 swap;
+ __u32 rkey;
+ } atomic;
struct {
- __u32 remote_qpn; /* Destination QP number */
- __u32 remote_qkey; /* Q_Key for UD packet validation */
- struct vrdma_av av; /* Address vector (L2/L3 info) */
+ /* Index of the destination QP */
+ __le32 remote_qpn;
+ /* Q_Key of the destination QP */
+ __le32 remote_qkey;
+ struct vrdma_av av;
} ud;
-
- struct {
- __u32 mrn; /* Memory Region Number (MR handle) */
- __u32 key; /* Staging rkey for MR registration */
- __u32 access; /* Access flags (IB_ACCESS_xxx) */
- } reg;
+ struct {
+ __u32 mrn;
+ __u32 key;
+ int access;
+ } reg;
+ /* Reserved for future */
+ __le64 reserved[4];
} wr;
-};
-
-struct vrdma_sge {
- __u64 addr;
- __u32 length;
- __u32 lkey;
+ /* Inline data */
+ //__u8 inline_data[512];
+ /* Reserved for future */
+ __le32 reserved2[3];
+ /* Scatter/gather list */
+ struct vrdma_sge sg_list[];
};
#endif
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index 705d18b55..f9b129774 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -2215,6 +2215,312 @@ static int vrdma_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
return i; /* Return number of polled completions */
}
+/**
+ * vrdma_post_recv - Post a list of receive work requests
+ * @ibqp: Queue pair
+ * @wr: List of receive work requests
+ * @bad_wr: Out parameter pointing to first failed WR on error
+ *
+ * Submits receive buffers to the backend via virtqueue.
+ * Each WR is serialized into a command structure and passed to the host.
+ *
+ * Context: Process context (may be called from atomic context in rare cases).
+ * Return:
+ * * 0 on success
+ * * negative errno on failure (e.g., -ENOMEM, -EOPNOTSUPP)
+ */
+static int vrdma_post_recv(struct ib_qp *ibqp,
+ const struct ib_recv_wr *wr,
+ const struct ib_recv_wr **bad_wr)
+{
+ struct vrdma_qp *vqp = to_vqp(ibqp);
+ struct vrdma_cmd_post_recv *cmd;
+ unsigned int sgl_size;
+ int rc = 0;
+ struct scatterlist hdr;
+ struct scatterlist *sgs[1];
+ unsigned long flags;
+
+ /* SMI QPs are not supported */
+ if (ibqp->qp_type == IB_QPT_SMI) {
+ *bad_wr = wr;
+ return -EOPNOTSUPP;
+ }
+
+ /*
+ * For user-space QPs, we assume recv posting is handled differently
+ * (e.g., through mmap'ed rings). Skip kernel-side posting.
+ */
+ if (vqp->type == VIRTIO_RDMA_TYPE_USER)
+ goto kick_and_return;
+
+ /* Serialize access to RQ */
+ spin_lock_irqsave(&vqp->rq->lock, flags);
+
+ while (wr) {
+ /* Validate required fields */
+ if (unlikely(!wr->num_sge)) {
+ rc = -EINVAL;
+ goto out_bad_wr;
+ }
+
+ /* Calculate size of SGE array to copy */
+ sgl_size = sizeof(struct vrdma_sge) * wr->num_sge;
+ cmd = kzalloc(sizeof(*cmd) + sgl_size, GFP_ATOMIC);
+ if (!cmd) {
+ rc = -ENOMEM;
+ goto out_bad_wr;
+ }
+
+ cmd->sge_list = kzalloc(sgl_size, GFP_ATOMIC);
+ if (!cmd->sge_list) {
+ rc = -ENOMEM;
+ goto out_bad_wr;
+ }
+
+ /* Fill command */
+ cmd->qpn = vqp->qp_handle;
+ cmd->wr_id = (ibqp->qp_type == IB_QPT_GSI) ? 0 : wr->wr_id;
+ cmd->num_sge = wr->num_sge;
+
+ /* Copy SGEs from user WR into command buffer */
+ memcpy(cmd->sge_list, wr->sg_list, sgl_size);
+
+ /* Prepare scatterlist for virtqueue */
+ sg_init_one(&hdr, cmd, sizeof(*cmd) + sgl_size);
+ sgs[0] = &hdr;
+
+ /* Add to virtqueue */
+ rc = virtqueue_add_sgs(vqp->rq->vq, sgs, 1, 0, cmd, GFP_ATOMIC);
+ if (rc) {
+ kfree(cmd);
+ goto out_bad_wr;
+ }
+
+ wr = wr->next;
+ }
+
+ spin_unlock_irqrestore(&vqp->rq->lock, flags);
+
+kick_and_return:
+ virtqueue_kick(vqp->rq->vq);
+ return 0;
+
+out_bad_wr:
+ *bad_wr = wr;
+ spin_unlock_irqrestore(&vqp->rq->lock, flags);
+ virtqueue_kick(vqp->rq->vq); /* Still kick so backend knows partial update */
+ return rc;
+}
+
+/**
+ * copy_inline_data_to_wqe - Copy inline data from SGEs into WQE buffer
+ * @wqe: Pointer to the vrdma_cmd_post_send command structure
+ * @ibwr: IB send work request containing SGEs with inline data
+ *
+ * Copies all data referenced by SGEs into a contiguous area immediately
+ * following the WQE header, typically used when IB_SEND_INLINE is set.
+ *
+ * Assumes:
+ * - Memory at sge->addr is accessible (kernel virtual address)
+ * - Total size <= device max_inline_data
+ * - wqe has enough tailroom for all data
+ *
+ * Context: Called under spinlock, atomic context (GFP_ATOMIC allocation)
+ */
+static void vrdma_copy_inline_data_to_wqe(struct vrdma_cmd_post_send *wqe,
+ const struct ib_send_wr *ibwr)
+{
+ const struct ib_sge *sge;
+ char *dst = (char *)wqe + sizeof(*wqe); /* Start after header */
+ int i;
+
+ for (i = 0; i < ibwr->num_sge; i++) {
+ sge = &ibwr->sg_list[i];
+
+ /* Skip zero-length segments */
+ if (sge->length == 0)
+ continue;
+
+ /*
+ * WARNING: sge->addr is a user-space or kernel virtual address.
+ * Using (void *)(uintptr_t)sge->addr assumes it's directly dereferenceable.
+ * This is only valid if:
+ * - The QP is KERNEL type AND
+ * - The memory was registered and we trust its mapping
+ */
+
+ memcpy(dst, (void *)(uintptr_t)sge->addr, sge->length);
+ dst += sge->length;
+ }
+}
+
+/**
+ * vrdma_post_send - Post a list of send work requests to the SQ
+ * @ibqp: Queue pair
+ * @wr: List of work requests
+ * @bad_wr: Out parameter pointing to failing WR on error
+ *
+ * Converts each ib_send_wr into a vrdma_cmd_post_send and submits it
+ * via the send virtqueue. Supports both kernel and user QPs.
+ *
+ * Context: Process context (may hold spinlock, so no sleep in atomic section)
+ * Return:
+ * * 0 on success
+ * * negative errno on failure (e.g., -EINVAL, -ENOMEM)
+ * * @bad_wr set to first failed WR
+ */
+static int vrdma_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
+ const struct ib_send_wr **bad_wr)
+{
+ struct vrdma_qp *vqp = to_vqp(ibqp);
+ struct vrdma_cmd_post_send *cmd;
+ unsigned int sgl_size;
+ int rc = 0;
+ struct scatterlist hdr;
+ struct scatterlist *sgs[1];
+
+ /* Fast path for user-space QP: defer to userspace */
+ if (vqp->type == VIRTIO_RDMA_TYPE_USER) {
+ virtqueue_kick(vqp->sq->vq);
+ return 0;
+ }
+
+ spin_lock(&vqp->sq->lock);
+
+ while (wr) {
+ *bad_wr = wr; /* In case of error */
+
+ /* Validate opcode support in kernel QP */
+ switch (wr->opcode) {
+ case IB_WR_SEND:
+ case IB_WR_SEND_WITH_IMM:
+ case IB_WR_SEND_WITH_INV:
+ case IB_WR_RDMA_WRITE:
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ case IB_WR_RDMA_READ:
+ case IB_WR_ATOMIC_CMP_AND_SWP:
+ case IB_WR_ATOMIC_FETCH_AND_ADD:
+ case IB_WR_LOCAL_INV:
+ case IB_WR_REG_MR:
+ break;
+ default:
+ pr_warn("vRDMA: unsupported opcode %d for kernel QP\n",
+ wr->opcode);
+ rc = -EINVAL;
+ goto out_unlock;
+ }
+
+ /* Allocate command buffer including space for SGEs */
+ sgl_size = wr->num_sge * sizeof(struct vrdma_sge);
+ cmd = kzalloc(sizeof(*cmd) + sgl_size, GFP_ATOMIC);
+ if (!cmd) {
+ rc = -ENOMEM;
+ goto out_unlock;
+ }
+
+ /* Fill common fields */
+ cmd->wr_id = wr->wr_id;
+ cmd->num_sge = wr->num_sge;
+ cmd->send_flags = wr->send_flags;
+ cmd->opcode = wr->opcode;
+
+ /* Immediate data and invalidation key */
+ if (wr->send_flags & IB_SEND_INLINE) {
+ /* TODO: Check max_inline_data limit */
+ vrdma_copy_inline_data_to_wqe(cmd, wr);
+ } else {
+ memcpy(cmd->sg_list, wr->sg_list, sgl_size);
+ }
+
+ /* Handle immediate data (SEND_WITH_IMM, WRITE_WITH_IMM) */
+ if (wr->opcode == IB_WR_SEND_WITH_IMM ||
+ wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+ cmd->ex.imm_data = wr->ex.imm_data;
+
+ /* Handle invalidate key (SEND_WITH_INV, LOCAL_INV) */
+ if (wr->opcode == IB_WR_SEND_WITH_INV ||
+ wr->opcode == IB_WR_LOCAL_INV)
+ cmd->ex.invalidate_rkey = wr->ex.invalidate_rkey;
+
+ /* RDMA and Atomic specific fields */
+ switch (ibqp->qp_type) {
+ case IB_QPT_RC:
+ switch (wr->opcode) {
+ case IB_WR_RDMA_READ:
+ case IB_WR_RDMA_WRITE:
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ cmd->wr.rdma.remote_addr = rdma_wr(wr)->remote_addr;
+ cmd->wr.rdma.rkey = rdma_wr(wr)->rkey;
+ break;
+
+ case IB_WR_ATOMIC_CMP_AND_SWP:
+ case IB_WR_ATOMIC_FETCH_AND_ADD:
+ cmd->wr.atomic.remote_addr = atomic_wr(wr)->remote_addr;
+ cmd->wr.atomic.rkey = atomic_wr(wr)->rkey;
+ cmd->wr.atomic.compare_add = atomic_wr(wr)->compare_add;
+ if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP)
+ cmd->wr.atomic.swap = atomic_wr(wr)->swap;
+ break;
+
+ case IB_WR_REG_MR: {
+ const struct ib_reg_wr *reg = reg_wr(wr);
+ struct vrdma_mr *vmr = to_vmr(reg->mr);
+ cmd->wr.reg.mrn = vmr->mr_handle;
+ cmd->wr.reg.key = reg->key;
+ cmd->wr.reg.access = reg->access;
+ break;
+ }
+ default:
+ break;
+ }
+ break;
+
+ case IB_QPT_UD:
+ case IB_QPT_GSI: {
+ if (!ud_wr(wr)->ah) {
+ pr_warn("vRDMA: invalid address handle in UD WR\n");
+ kfree(cmd);
+ rc = -EINVAL;
+ goto out_unlock;
+ }
+ cmd->wr.ud.remote_qpn = ud_wr(wr)->remote_qpn;
+ cmd->wr.ud.remote_qkey = ud_wr(wr)->remote_qkey;
+ cmd->wr.ud.av = to_vah(ud_wr(wr)->ah)->av;
+ break;
+ }
+
+ default:
+ pr_err("vRDMA: unsupported QP type %d\n", ibqp->qp_type);
+ kfree(cmd);
+ rc = -EINVAL;
+ goto out_unlock;
+ }
+
+ /* Prepare scatterlist for virtqueue */
+ sg_init_one(&hdr, cmd, sizeof(*cmd) + sgl_size);
+ sgs[0] = &hdr;
+
+ rc = virtqueue_add_sgs(vqp->sq->vq, sgs, 1, 0, cmd, GFP_ATOMIC);
+ if (rc) {
+ dev_err(&vqp->sq->vq->vdev->dev,
+ "vRDMA: failed to add send WR to vq: %d\n", rc);
+ kfree(cmd);
+ goto out_unlock;
+ }
+
+ /* Advance to next WR */
+ wr = wr->next;
+ }
+
+out_unlock:
+ spin_unlock(&vqp->sq->lock);
+
+ /* Only kick after successful submission(s), but always try to kick */
+ virtqueue_kick(vqp->sq->vq);
+ return rc;
+}
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -2246,7 +2552,9 @@ static const struct ib_device_ops vrdma_dev_ops = {
.mmap_free = vrdma_mmap_free,
.modify_port = vrdma_modify_port,
.modify_qp = vrdma_modify_qp,
- .poll_cq = vrdma_poll_cq,
+ .poll_cq = vrdma_poll_cq,
+ .post_recv = vrdma_post_recv,
+ .post_send = vrdma_post_send,
};
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 09/10] drivers/infiniband/hw/virtio: Implement P_key, QP query and user MR resource management verbs
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (7 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 08/10] drivers/infiniband/hw/virtio: Implement send/receive verb support Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 9:09 ` [PATCH 10/10] drivers/infiniband/hw/virtio: Add completion queue notification support Xiong Weimin
` (2 subsequent siblings)
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit adds support for essential RDMA resource management verbs:
1. P_Key Table Query:
- Implements IB_QUERY_PKEY verb for partition key retrieval
- Handles endianness conversion for cross-platform compatibility
- Provides complete error handling for device communication failures
2. QP Attribute Query:
- Full QP state retrieval including capabilities and AH attributes
- Byte order handling for all struct fields
- Init attribute preservation for consistency checks
- Detailed error logging for debugging
3. User Memory Registration:
- Memory pinning via ib_umem_get() with access flag enforcement
- DMA-safe page table construction and bulk transfer to device
- Multi-architecture DMA address handling
- Strict memory boundary validation
- Resource cleanup guarantees on all error paths
Key enhancements:
- Unified virtqueue command infrastructure
- Cross-architecture endianness handling
- Atomic page table transfer for registered memory regions
- Protection domain integration for memory access control
- Error injection points for robust resource recovery
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../infiniband/hw/virtio/vrdma_dev_api.h | 35 ++
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 333 +++++++++++++++++-
2 files changed, 367 insertions(+), 1 deletion(-)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index d0ce02601..86b5ecade 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -225,6 +225,41 @@ struct vrdma_rsp_modify_qp {
__u32 qpn;
};
+struct vrdma_cmd_query_pkey {
+ __u32 port;
+ __u16 index;
+};
+
+struct vrdma_rsp_query_pkey {
+ __u16 pkey;
+};
+
+struct vrdma_cmd_query_qp {
+ __u32 qpn;
+ __u32 attr_mask;
+};
+
+struct vrdma_rsp_query_qp {
+ struct vrdma_qp_attr attr;
+};
+
+struct vrdma_cmd_reg_user_mr {
+ __u32 pdn;
+ __u32 access_flags;
+ __u64 start;
+ __u64 length;
+ __u64 virt_addr;
+
+ __u64 pages;
+ __u32 npages;
+};
+
+struct vrdma_rsp_reg_user_mr {
+ __u32 mrn;
+ __u32 lkey;
+ __u32 rkey;
+};
+
#define VRDMA_CTRL_OK 0
#define VRDMA_CTRL_ERR 1
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index f9b129774..b1429e072 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -23,6 +23,7 @@
#include "vrdma_queue.h"
#define VRTIO_RDMA_PAGE_PER_TBL 512
+#define VRDMA_MAX_PAGES 512 * 512
/**
* cmd_str - String representation of virtio RDMA control commands
@@ -86,6 +87,36 @@ static void rdma_ah_attr_to_vrdma(struct vrdma_ah_attr *dst,
memcpy(&dst->roce, &src->roce, sizeof(struct roce_ah_attr));
}
+static void vrdma_to_ib_global_route(struct ib_global_route *dst,
+ const struct vrdma_global_route *src)
+{
+ dst->dgid = src->dgid;
+ dst->flow_label = src->flow_label;
+ dst->sgid_index = src->sgid_index;
+ dst->hop_limit = src->hop_limit;
+ dst->traffic_class = src->traffic_class;
+}
+
+static void vrdma_to_ib_qp_cap(struct ib_qp_cap *dst, const struct vrdma_qp_cap *src)
+{
+ dst->max_send_wr = src->max_send_wr;
+ dst->max_recv_wr = src->max_recv_wr;
+ dst->max_send_sge = src->max_send_sge;
+ dst->max_recv_sge = src->max_recv_sge;
+ dst->max_inline_data = src->max_inline_data;
+}
+
+static void vrdma_to_rdma_ah_attr(struct rdma_ah_attr *dst,
+ const struct vrdma_ah_attr *src)
+{
+ vrdma_to_ib_global_route(rdma_ah_retrieve_grh(dst), &src->grh);
+ rdma_ah_set_sl(dst, src->sl);
+ rdma_ah_set_static_rate(dst, src->static_rate);
+ rdma_ah_set_port_num(dst, src->port_num);
+ rdma_ah_set_ah_flags(dst, src->ah_flags);
+ memcpy(&dst->roce, &src->roce, sizeof(struct roce_ah_attr));
+}
+
/**
* vrdma_exec_verbs_cmd - Execute a verbs command via control virtqueue
* @vrdev: VRDMA device
@@ -2521,6 +2552,303 @@ static int vrdma_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
return rc;
}
+/**
+ * vrdma_query_pkey - Query Partition Key (P_Key) at given index
+ * @ibdev: Verbs device (vRDMA virtual device)
+ * @port: Port number (1-indexed)
+ * @index: P_Key table index
+ * @pkey: Output buffer to store the P_Key value
+ *
+ * Queries the P_Key from the backend via virtqueue command.
+ * Only meaningful for IB-style ports (not RoCE).
+ *
+ * Context: Process context (may sleep). Can be called from user IOCTL path.
+ * Return:
+ * * 0 on success
+ * * -ENOMEM if command allocation fails
+ * * -EIO or other negative errno on communication failure
+ */
+static int vrdma_query_pkey(struct ib_device *ibdev, u32 port, u16 index, u16 *pkey)
+{
+ struct vrdma_dev *vdev = to_vdev(ibdev);
+ struct vrdma_cmd_query_pkey *cmd;
+ struct vrdma_rsp_query_pkey *rsp;
+ struct scatterlist in, out;
+ int rc;
+
+ /* Allocate command and response buffers */
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ kfree(cmd);
+ return -ENOMEM;
+ }
+
+ /* Fill input parameters */
+ cmd->port = cpu_to_le32(port);
+ cmd->index = cpu_to_le16(index);
+
+ /* Prepare scatterlists for virtqueue I/O */
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Execute command */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_QUERY_PKEY, &in, &out);
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "VIRTIO_RDMA_CMD_QUERY_PKEY failed: port=%u idx=%u err=%d\n",
+ port, index, rc);
+ goto out_free;
+ }
+
+ /* Copy result to user */
+ *pkey = le16_to_cpu(rsp->pkey);
+
+out_free:
+ kfree(rsp);
+ kfree(cmd);
+ return rc;
+}
+
+/**
+ * vrdma_query_qp - Query QP attributes from the backend
+ * @ibqp: Queue pair to query
+ * @attr: Output structure for QP attributes
+ * @attr_mask: Which fields are requested (ignored by some backends)
+ * @init_attr: Output structure for init-time attributes
+ *
+ * Queries the QP state and configuration via a control virtqueue command.
+ * This is a synchronous operation.
+ *
+ * Context: Process context (can sleep)
+ * Return:
+ * * 0 on success
+ * * -ENOMEM if allocation fails
+ * * -EIO or other negative errno on communication failure
+ */
+static int vrdma_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask, struct ib_qp_init_attr *init_attr)
+{
+ struct vrdma_qp *vqp = to_vqp(ibqp);
+ struct vrdma_dev *vdev = to_vdev(ibqp->device);
+ struct vrdma_cmd_query_qp *cmd;
+ struct vrdma_rsp_query_qp *rsp;
+ struct scatterlist in, out;
+ int rc;
+
+ /* Allocate command and response buffers */
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ kfree(cmd);
+ return -ENOMEM;
+ }
+
+ /* Fill input parameters */
+ cmd->qpn = cpu_to_le32(vqp->qp_handle);
+ cmd->attr_mask = cpu_to_le32(attr_mask); /* Optional optimization */
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Execute command over control virtqueue */
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_QUERY_QP, &in, &out);
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "VIRTIO_RDMA_CMD_QUERY_QP failed: qpn=0x%x err=%d\n",
+ vqp->qp_handle, rc);
+ goto out_free;
+ }
+
+ /* Only copy results on success */
+ attr->qp_state = rsp->attr.qp_state;
+ attr->cur_qp_state = rsp->attr.cur_qp_state;
+ attr->path_mtu = rsp->attr.path_mtu;
+ attr->path_mig_state = rsp->attr.path_mig_state;
+ attr->qkey = le32_to_cpu(rsp->attr.qkey);
+ attr->rq_psn = le32_to_cpu(rsp->attr.rq_psn);
+ attr->sq_psn = le32_to_cpu(rsp->attr.sq_psn);
+ attr->dest_qp_num = le32_to_cpu(rsp->attr.dest_qp_num);
+ attr->qp_access_flags = le32_to_cpu(rsp->attr.qp_access_flags);
+ attr->pkey_index = le16_to_cpu(rsp->attr.pkey_index);
+ attr->alt_pkey_index = le16_to_cpu(rsp->attr.alt_pkey_index);
+ attr->en_sqd_async_notify = rsp->attr.en_sqd_async_notify;
+ attr->sq_draining = rsp->attr.sq_draining;
+ attr->max_rd_atomic = rsp->attr.max_rd_atomic;
+ attr->max_dest_rd_atomic = rsp->attr.max_dest_rd_atomic;
+ attr->min_rnr_timer = rsp->attr.min_rnr_timer;
+ attr->port_num = rsp->attr.port_num;
+ attr->timeout = rsp->attr.timeout;
+ attr->retry_cnt = rsp->attr.retry_cnt;
+ attr->rnr_retry = rsp->attr.rnr_retry;
+ attr->alt_port_num = rsp->attr.alt_port_num;
+ attr->alt_timeout = rsp->attr.alt_timeout;
+ attr->rate_limit = le32_to_cpu(rsp->attr.rate_limit);
+
+ /* Copy capabilities */
+ vrdma_to_ib_qp_cap(&attr->cap, &rsp->attr.cap);
+
+ /* Convert AH attributes (contains GRH + DIP) */
+ vrdma_to_rdma_ah_attr(&attr->ah_attr, &rsp->attr.ah_attr);
+ vrdma_to_rdma_ah_attr(&attr->alt_ah_attr, &rsp->attr.alt_ah_attr);
+
+ /* Fill init attributes (mostly static) */
+ init_attr->event_handler = vqp->ibqp.event_handler;
+ init_attr->qp_context = vqp->ibqp.qp_context;
+ init_attr->send_cq = vqp->ibqp.send_cq;
+ init_attr->recv_cq = vqp->ibqp.recv_cq;
+ init_attr->srq = vqp->ibqp.srq;
+ init_attr->xrcd = NULL; /* Not supported in vRDMA */
+ init_attr->cap = attr->cap;
+ init_attr->sq_sig_type = IB_SIGNAL_REQ_WR; /* Or driver default */
+ init_attr->qp_type = vqp->ibqp.qp_type;
+ init_attr->create_flags = 0;
+ init_attr->port_num = vqp->port;
+
+out_free:
+ kfree(rsp);
+ kfree(cmd);
+ return rc;
+}
+
+/**
+ * vrdma_reg_user_mr - Register a user memory region
+ * @pd: Protection domain
+ * @start: User virtual address of memory to register
+ * @length: Length of memory region
+ * virt_addr: Optional virtual address for rkey access (often same as start)
+ * @access_flags: Access permissions (IB_ACCESS_xxx)
+ * @udata: User data (optional, unused here)
+ *
+ * Locks down user pages, builds page table, and registers MR with backend.
+ * Returns pointer to ib_mr or ERR_PTR on failure.
+ *
+ * Context: Process context (may sleep during ib_umem_get)
+ * Return:
+ * * Pointer to &mr->ibmr on success
+ * * ERR_PTR(-errno) on failure
+ */
+static struct ib_mr *vrdma_reg_user_mr(struct ib_pd *pd, u64 start,
+ u64 length, u64 virt_addr,
+ int access_flags,
+ struct ib_udata *udata)
+{
+ struct vrdma_dev *dev = to_vdev(pd->device);
+ struct vrdma_cmd_reg_user_mr *cmd;
+ struct vrdma_rsp_reg_user_mr *rsp;
+ struct vrdma_mr *mr;
+ struct ib_umem *umem;
+ struct sg_dma_page_iter sg_iter;
+ struct scatterlist in, out;
+ int rc = 0;
+ unsigned npages;
+ dma_addr_t *pages_flat = NULL;
+
+ /* Step 1: Pin user memory pages */
+ umem = ib_umem_get(pd->device, start, length, access_flags);
+ if (IS_ERR(umem)) {
+ dev_err(&dev->vdev->dev, "Failed to pin user memory: va=0x%llx len=%llu\n",
+ start, length);
+ return ERR_CAST(umem);
+ }
+
+ npages = ib_umem_num_pages(umem);
+ if (npages == 0 || npages > VRDMA_MAX_PAGES) { // e.g., VRDMA_MAX_PAGES = 512*512
+ dev_err(&dev->vdev->dev, "Invalid number of pages: %u\n", npages);
+ rc = -EINVAL;
+ goto err_umem;
+ }
+
+ /* Allocate command/response structures (GFP_KERNEL ok in process context) */
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+ if (!cmd || !rsp || !mr) {
+ rc = -ENOMEM;
+ goto err_alloc;
+ }
+
+ /* Initialize MR structure */
+ mr->umem = umem;
+ mr->size = length;
+ mr->iova = virt_addr;
+ mr->max_pages = npages;
+
+ /* Allocate contiguous DMA-mapped array for page addresses */
+ pages_flat = dma_alloc_coherent(&dev->vdev->dev,
+ npages * sizeof(dma_addr_t),
+ &mr->dma_pages, GFP_KERNEL);
+ if (!pages_flat) {
+ dev_err(&dev->vdev->dev, "Failed to allocate DMA memory for page table\n");
+ rc = -ENOMEM;
+ goto err_alloc;
+ }
+ mr->pages_k = &pages_flat; /* Treat as 2D: [i/512][i%512] */
+
+ /* Fill page table from ib_umem scatterlist */
+ mr->npages = 0;
+ for_each_sg_dma_page(umem->sgt_append.sgt.sgl, &sg_iter, umem->sgt_append.sgt.nents, 0) {
+ dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
+ pages_flat[mr->npages++] = addr;
+ }
+
+ /* Sanity check: should match ib_umem_num_pages() */
+ WARN_ON(mr->npages != npages);
+
+ /* Prepare command */
+ cmd->pdn = cpu_to_le32(to_vpd(pd)->pd_handle);
+ cmd->start = cpu_to_le64(start);
+ cmd->length = cpu_to_le64(length);
+ cmd->virt_addr = cpu_to_le64(virt_addr);
+ cmd->access_flags = cpu_to_le32(access_flags);
+ cmd->pages = cpu_to_le64(mr->dma_pages); /* DMA address of page array */
+ cmd->npages = cpu_to_le32(npages);
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ /* Send command to backend */
+ rc = vrdma_exec_verbs_cmd(dev, VIRTIO_RDMA_CMD_REG_USER_MR, &in, &out);
+ if (rc) {
+ dev_err(&dev->vdev->dev, "Backend failed to register MR: %d\n", rc);
+ goto err_cmd;
+ }
+
+ /* Copy results from response */
+ mr->mr_handle = le32_to_cpu(rsp->mrn);
+ mr->ibmr.lkey = le32_to_cpu(rsp->lkey);
+ mr->ibmr.rkey = le32_to_cpu(rsp->rkey);
+
+ /* Cleanup temporary allocations */
+ kfree(cmd);
+ kfree(rsp);
+
+ /* Link MR to PD if needed, initialize other fields */
+ mr->ibmr.pd = pd;
+ mr->ibmr.device = pd->device;
+ mr->ibmr.type = IB_MR_TYPE_MEM_REG;
+ mr->ibmr.length = length;
+
+ return &mr->ibmr;
+
+err_cmd:
+ dma_free_coherent(&dev->vdev->dev, npages * sizeof(dma_addr_t),
+ pages_flat, mr->dma_pages);
+err_alloc:
+ kfree(mr);
+ kfree(rsp);
+ kfree(cmd);
+err_umem:
+ ib_umem_release(umem);
+ return ERR_PTR(rc);
+}
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -2554,7 +2882,10 @@ static const struct ib_device_ops vrdma_dev_ops = {
.modify_qp = vrdma_modify_qp,
.poll_cq = vrdma_poll_cq,
.post_recv = vrdma_post_recv,
- .post_send = vrdma_post_send,
+ .post_send = vrdma_post_send,
+ .query_pkey = vrdma_query_pkey,
+ .query_qp = vrdma_query_qp,
+ .reg_user_mr = vrdma_reg_user_mr,
};
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 10/10] drivers/infiniband/hw/virtio: Add completion queue notification support
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (8 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 09/10] drivers/infiniband/hw/virtio: Implement P_key, QP query and user MR resource management verbs Xiong Weimin
@ 2025-12-18 9:09 ` Xiong Weimin
2025-12-18 16:30 ` Implement initial driver for virtio-RDMA device(kernel) Leon Romanovsky
2025-12-23 1:16 ` Jason Wang
11 siblings, 0 replies; 18+ messages in thread
From: Xiong Weimin @ 2025-12-18 9:09 UTC (permalink / raw)
To: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, xiongweimin
From: xiongweimin <xiongweimin@kylinos.cn>
This commit implements CQ (Completion Queue) notification functionality
for virtio RDMA devices:
1. Notification types:
- Solicited completion notifications (IB_CQ_SOLICITED)
- Next completion notifications (IB_CQ_NEXT_COMP)
- Error handling for unsupported flags
2. Backend communication:
- VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ command implementation
- Command/response buffer management
- Error handling for virtqueue operations
3. Resource management:
- Dynamic memory allocation for command/response structs
- Guaranteed cleanup on error paths
- Rate-limited error logging
4. Feature limitations:
- REPORT_MISSED_EVENTS currently returns -EOPNOTSUPP
(to be implemented in future work)
Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
---
.../drivers/infiniband/hw/virtio/vrdma_abi.h | 6 +
.../infiniband/hw/virtio/vrdma_dev_api.h | 9 ++
.../drivers/infiniband/hw/virtio/vrdma_ib.c | 121 +++++++++++++++++-
3 files changed, 135 insertions(+), 1 deletion(-)
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
index 0a9404057..ff4b2505f 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_abi.h
@@ -9,6 +9,12 @@
#define VRDMA_ABI_VERSION 1
+enum {
+ VIRTIO_RDMA_NOTIFY_NOT = (0),
+ VIRTIO_RDMA_NOTIFY_SOLICITED = (1 << 0),
+ VIRTIO_RDMA_NOTIFY_NEXT_COMPLETION = (1 << 1)
+};
+
/**
* struct vrdma_cqe - Virtio-RDMA Completion Queue Entry (CQE)
*
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
index 86b5ecade..d9a65531e 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_dev_api.h
@@ -243,6 +243,15 @@ struct vrdma_rsp_query_qp {
struct vrdma_qp_attr attr;
};
+struct vrdma_cmd_req_notify {
+ __u32 cqn;
+ __u32 flags;
+};
+
+struct vrdma_rsp_req_notify {
+ __u32 missed;
+};
+
struct vrdma_cmd_reg_user_mr {
__u32 pdn;
__u32 access_flags;
diff --git a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
index b1429e072..6f97c6bdc 100644
--- a/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
+++ b/linux-6.16.8/drivers/infiniband/hw/virtio/vrdma_ib.c
@@ -2849,6 +2849,116 @@ static struct ib_mr *vrdma_reg_user_mr(struct ib_pd *pd, u64 start,
return ERR_PTR(rc);
}
+/**
+ * vrdma_req_notify_cq - Request notification for CQ events
+ * @ibcq: Completion queue
+ * @flags: Notification flags (e.g., solicited only, report missed)
+ *
+ * Requests that the next completion (or next solicited completion) trigger
+ * an interrupt/event. Also supports checking if events were missed.
+ *
+ * Context: Process context (may sleep). Called from user or kernel path.
+ * Return:
+ * * 0 on success
+ * * -EOPNOTSUPP if unsupported flag (e.g., REPORT_MISSED_EVENTS)
+ * * -ENOMEM if command allocation fails
+ * * -EIO if communication with backend fails
+ */
+static int vrdma_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags)
+{
+ struct vrdma_cq *vcq = to_vcq(ibcq);
+ struct vrdma_dev *vdev = to_vdev(ibcq->device);
+ struct vrdma_cmd_req_notify *cmd;
+ struct vrdma_rsp_req_notify *rsp;
+ struct scatterlist in, out;
+ int rc = 0;
+
+ /* Handle solicited-only or any-completion notification */
+ if (flags & IB_CQ_SOLICITED_MASK) {
+ cmd = kzalloc(sizeof(*cmd), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ rsp = kzalloc(sizeof(*rsp), GFP_KERNEL);
+ if (!rsp) {
+ kfree(cmd);
+ return -ENOMEM;
+ }
+
+ cmd->cqn = cpu_to_le32(vcq->cq_handle);
+
+ if ((flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED)
+ cmd->flags = VIRTIO_RDMA_NOTIFY_SOLICITED;
+ else
+ cmd->flags = VIRTIO_RDMA_NOTIFY_NEXT_COMPLETION;
+
+ sg_init_one(&in, cmd, sizeof(*cmd));
+ sg_init_one(&out, rsp, sizeof(*rsp));
+
+ rc = vrdma_exec_verbs_cmd(vdev, VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ,
+ &in, &out);
+
+ if (rc) {
+ dev_err(&vdev->vdev->dev,
+ "VIRTIO_RDMA_CMD_REQ_NOTIFY_CQ failed: cqn=%u, rc=%d\n",
+ vcq->cq_handle, rc);
+ rc = -EIO;
+ }
+
+ kfree(rsp);
+ kfree(cmd);
+
+ if (rc)
+ return rc;
+ }
+
+ /*
+ * Check for missed events: this requires querying backend state
+ * Currently not supported in most virtio-rdma implementations.
+ */
+ if (flags & IB_CQ_REPORT_MISSED_EVENTS) {
+ /*
+ * Ideally we'd query the host whether an event has occurred
+ * since last notify, but this is often unimplemented.
+ */
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static ssize_t hca_type_show(struct device *device,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "VIRTIO-RDMA-%s\n", VIRTIO_RDMA_DRIVER_VER);
+}
+static DEVICE_ATTR_RO(hca_type);
+
+static ssize_t hw_rev_show(struct device *device,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", VIRTIO_RDMA_HW_REV);
+}
+static DEVICE_ATTR_RO(hw_rev);
+
+static ssize_t board_id_show(struct device *device,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", VIRTIO_RDMA_BOARD_ID);
+}
+static DEVICE_ATTR_RO(board_id);
+
+static struct attribute *vrdma_class_attributes[] = {
+ &dev_attr_hw_rev.attr,
+ &dev_attr_hca_type.attr,
+ &dev_attr_board_id.attr,
+ NULL,
+};
+
+static const struct attribute_group vrdma_attr_group = {
+ .attrs = vrdma_class_attributes,
+};
+
static const struct ib_device_ops vrdma_dev_ops = {
.owner = THIS_MODULE,
.uverbs_abi_ver = VIRTIO_RDMA_ABI_VERSION,
@@ -2885,7 +2995,16 @@ static const struct ib_device_ops vrdma_dev_ops = {
.post_send = vrdma_post_send,
.query_pkey = vrdma_query_pkey,
.query_qp = vrdma_query_qp,
- .reg_user_mr = vrdma_reg_user_mr,
+ .reg_user_mr = vrdma_reg_user_mr,
+ .req_notify_cq = vrdma_req_notify_cq,
+
+ .device_group = &vrdma_attr_group,
+
+ INIT_RDMA_OBJ_SIZE(ib_ah, vrdma_ah, ibah),
+ INIT_RDMA_OBJ_SIZE(ib_cq, vrdma_cq, ibcq),
+ INIT_RDMA_OBJ_SIZE(ib_pd, vrdma_pd, ibpd),
+ INIT_RDMA_OBJ_SIZE(ib_qp, vrdma_qp, ibqp),
+ INIT_RDMA_OBJ_SIZE(ib_ucontext, vrdma_ucontext, ibucontext),
};
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Implement initial driver for virtio-RDMA device(kernel)
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (9 preceding siblings ...)
2025-12-18 9:09 ` [PATCH 10/10] drivers/infiniband/hw/virtio: Add completion queue notification support Xiong Weimin
@ 2025-12-18 16:30 ` Leon Romanovsky
2025-12-19 2:27 ` 熊伟民
[not found] ` <6ef11502.4847.19b34677a76.Coremail.15927021679@163.com>
2025-12-23 1:16 ` Jason Wang
11 siblings, 2 replies; 18+ messages in thread
From: Leon Romanovsky @ 2025-12-18 16:30 UTC (permalink / raw)
To: Xiong Weimin
Cc: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson, kvm, virtualization, netdev, RDMA mailing list
On Thu, Dec 18, 2025 at 05:09:40PM +0800, Xiong Weimin wrote:
> Hi all,
>
> This testing instructions aims to introduce an emulating a soft ROCE
> device with normal NIC(no RDMA), we have finished a vhost-user RDMA
> device demo, which can work with RDMA features such as CM, QP type of
> UC/UD and so on.
Same question as on your QEMU patches.
https://lore.kernel.org/all/20251218162028.GG400630@unreal/
And as a bare minimum, you should run get_maintainers.pl script on your
patches and add the right people and ML to the CC/TO fields.
Thanks
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re:Re: Implement initial driver for virtio-RDMA device(kernel)
2025-12-18 16:30 ` Implement initial driver for virtio-RDMA device(kernel) Leon Romanovsky
@ 2025-12-19 2:27 ` 熊伟民
[not found] ` <6ef11502.4847.19b34677a76.Coremail.15927021679@163.com>
1 sibling, 0 replies; 18+ messages in thread
From: 熊伟民 @ 2025-12-19 2:27 UTC (permalink / raw)
To: Leon Romanovsky, Michael S . Tsirkin, David Hildenbrand,
Jason Wang, Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson
Cc: kvm, virtualization, netdev, RDMA mailing list
At 2025-12-19 00:30:08, "Leon Romanovsky" <leon@kernel.org> wrote:
>On Thu, Dec 18, 2025 at 05:09:40PM +0800, Xiong Weimin wrote:
>> Hi all,
>>
>> This testing instructions aims to introduce an emulating a soft ROCE
>> device with normal NIC(no RDMA), we have finished a vhost-user RDMA
>> device demo, which can work with RDMA features such as CM, QP type of
>> UC/UD and so on.
>
>Same question as on your QEMU patches.
>https://lore.kernel.org/all/20251218162028.GG400630@unreal/
>
>And as a bare minimum, you should run get_maintainers.pl script on your
>patches and add the right people and ML to the CC/TO fields.
>
>Thanks
Since this feature involves coordinated changes across QEMU, DPDK, and the kernel,
I have submitted patches for all three components to every maintainer. This is to
ensure that senior developers can review the complete architecture and code.
Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: Implement initial driver for virtio-RDMA device(kernel)
[not found] ` <6ef11502.4847.19b34677a76.Coremail.15927021679@163.com>
@ 2025-12-21 8:46 ` Leon Romanovsky
0 siblings, 0 replies; 18+ messages in thread
From: Leon Romanovsky @ 2025-12-21 8:46 UTC (permalink / raw)
To: 熊伟民
Cc: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson, kvm, virtualization, netdev, RDMA mailing list
On Fri, Dec 19, 2025 at 10:19:15AM +0800, 熊伟民 wrote:
>
> At 2025-12-19 00:30:08, "Leon Romanovsky" <leon@kernel.org> wrote:
> >On Thu, Dec 18, 2025 at 05:09:40PM +0800, Xiong Weimin wrote:
> >> Hi all,
> >>
> >> This testing instructions aims to introduce an emulating a soft ROCE
> >> device with normal NIC(no RDMA), we have finished a vhost-user RDMA
> >> device demo, which can work with RDMA features such as CM, QP type of
> >> UC/UD and so on.
> >
> >Same question as on your QEMU patches.
> >https://lore.kernel.org/all/20251218162028.GG400630@unreal/
> >
> >And as a bare minimum, you should run get_maintainers.pl script on your
> >patches and add the right people and ML to the CC/TO fields.
> >
>
> >Thanks
>
>
> Since this feature involves coordinated changes across QEMU, DPDK, and the kernel,
> I have submitted patches for all three components to every maintainer. This is to
> ensure that senior developers can review the complete architecture and code.
Please run get_maintainers.pl on your kernel patches and check if you
really added "every maintainer".
Hint, you didn't even add right mailing list here.
Thanks
>
>
> Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices
2025-12-18 9:09 ` [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices Xiong Weimin
@ 2025-12-21 9:11 ` Leon Romanovsky
0 siblings, 0 replies; 18+ messages in thread
From: Leon Romanovsky @ 2025-12-21 9:11 UTC (permalink / raw)
To: Xiong Weimin
Cc: Michael S . Tsirkin, David Hildenbrand, Jason Wang,
Stefano Garzarella, Thomas Monjalon, David Marchand,
Luca Boccassi, Kevin Traynor, Christian Ehrhardt, Xuan Zhuo,
Eugenio Pérez, Xueming Li, Maxime Coquelin, Chenbo Xia,
Bruce Richardson, kvm, virtualization, netdev, xiongweimin
On Thu, Dec 18, 2025 at 05:09:41PM +0800, Xiong Weimin wrote:
> From: xiongweimin <xiongweimin@kylinos.cn>
>
> This commit introduces a new driver for RDMA over virtio, enabling
> RDMA capabilities in virtualized environments. The driver consists
> of the following main components:
>
> 1. Driver registration with the virtio subsystem and device discovery.
> 2. Device probe and remove handlers for managing the device lifecycle.
> 3. Initialization of the InfiniBand device attributes by reading the
> virtio configuration space, including conversion from little-endian
> to CPU byte order and capability mapping.
> 4. Setup of virtqueues for:
> - Control commands (no callback)
> - Completion queues (with callback for CQ events)
> - Send and receive queues for queue pairs (no callbacks)
> 5. Integration with the network device layer for RoCE support.
> 6. Registration with the InfiniBand core subsystem.
> 7. Comprehensive error handling during initialization and a symmetric
> teardown process.
>
> Key features:
> - Support for multiple virtqueues based on device capabilities (max_cq, max_qp)
> - Fast doorbell optimization when notify_offset_multiplier equals PAGE_SIZE
> - Safe resource management with rollback on failure
>
> Signed-off-by: Xiong Weimin <xiongweimin@kylinos.cn>
<...>
> +/**
> + * vrdma_init_netdev - Attempt to find paired virtio-net device on same PCI slot
> + * @vrdev: The vRDMA device
> + *
> + * WARNING: This is a non-standard hack for development/emulation environments.
> + * Do not use in production or upstream drivers.
I'm impressed how much AI advanced in code generation. Please recheck
everything that was generated.
> + *
> + * Returns 0 on success, or negative errno.
> + */
> +int vrdma_init_netdev(struct vrdma_dev *vrdev)
> +{
> + struct pci_dev *pdev_net;
> + struct virtio_pci_device *vp_dev;
> + struct virtio_pci_device *vnet_pdev;
> + void *priv;
> + struct net_device *netdev;
> +
> + if (!vrdev || !vrdev->vdev) {
> + pr_err("%s: invalid vrdev or vdev\n", __func__);
> + return -EINVAL;
> + }
> +
> + vp_dev = to_vp_device(vrdev->vdev);
> +
> + /* Find the PCI device at function 0 of the same slot */
> + pdev_net = pci_get_slot(vp_dev->pci_dev->bus,
> + PCI_DEVFN(PCI_SLOT(vp_dev->pci_dev->devfn), 0));
> + if (!pdev_net) {
> + pr_err("Failed to find PCI device at fn=0 of slot %x\n",
> + PCI_SLOT(vp_dev->pci_dev->devfn));
> + return -ENODEV;
> + }
> +
> + /* Optional: Validate it's a known virtio-net device */
> + if (pdev_net->vendor != PCI_VENDOR_ID_REDHAT_QUMRANET ||
> + pdev_net->device != 0x1041) {
> + pr_warn("PCI device %04x:%04x is not expected virtio-net (1041) device\n",
> + pdev_net->vendor, pdev_net->device);
> + pci_dev_put(pdev_net);
> + return -ENODEV;
> + }
> +
> + /* Get the virtio_pci_device from drvdata */
> + vnet_pdev = pci_get_drvdata(pdev_net);
> + if (!vnet_pdev || !vnet_pdev->vdev.priv) {
> + pr_err("No driver data or priv for virtio-net device\n");
> + pci_dev_put(pdev_net);
> + return -ENODEV;
> + }
> +
> + priv = vnet_pdev->vdev.priv;
> + vrdev->netdev = priv - ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
> + netdev = vrdev->netdev;
> +
> + if (!netdev || !netdev->netdev_ops) {
> + pr_err("Invalid net_device retrieved from virtio-net\n");
> + pci_dev_put(pdev_net);
> + return -ENODEV;
> + }
> +
> + /* Hold reference so netdev won't disappear */
> + dev_hold(netdev);
> +
> + pci_dev_put(pdev_net); /* Release reference from pci_get_slot */
> +
> + return 0;
> +}
AI was right here. It is awful hack.
Thanks
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Implement initial driver for virtio-RDMA device(kernel)
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
` (10 preceding siblings ...)
2025-12-18 16:30 ` Implement initial driver for virtio-RDMA device(kernel) Leon Romanovsky
@ 2025-12-23 1:16 ` Jason Wang
2025-12-24 9:31 ` 熊伟民
11 siblings, 1 reply; 18+ messages in thread
From: Jason Wang @ 2025-12-23 1:16 UTC (permalink / raw)
To: Xiong Weimin
Cc: Michael S . Tsirkin, David Hildenbrand, Stefano Garzarella,
Thomas Monjalon, David Marchand, Luca Boccassi, Kevin Traynor,
Christian Ehrhardt, Xuan Zhuo, Eugenio Pérez, Xueming Li,
Maxime Coquelin, Chenbo Xia, Bruce Richardson, kvm,
virtualization, netdev
On Thu, Dec 18, 2025 at 5:11 PM Xiong Weimin <15927021679@163.com> wrote:
>
> Hi all,
>
> This testing instructions aims to introduce an emulating a soft ROCE
> device with normal NIC(no RDMA), we have finished a vhost-user RDMA
> device demo, which can work with RDMA features such as CM, QP type of
> UC/UD and so on.
>
I think we need
1) to know the difference between this and [1]
2) the spec patch
Thanks
[1] https://yhbt.net/lore/virtio-dev/CACycT3sShxOR41Kk1znxC7Mpw73N0LAP66cC3-iqeS_jp8trvw@mail.gmail.com/T/#m0602ee71de0fe389671cbd81242b5f3ceeab0101
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re:Re: Implement initial driver for virtio-RDMA device(kernel)
2025-12-23 1:16 ` Jason Wang
@ 2025-12-24 9:31 ` 熊伟民
2025-12-25 2:13 ` Jason Wang
0 siblings, 1 reply; 18+ messages in thread
From: 熊伟民 @ 2025-12-24 9:31 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S . Tsirkin, David Hildenbrand, Stefano Garzarella,
Thomas Monjalon, David Marchand, Luca Boccassi, Kevin Traynor,
Christian Ehrhardt, Xuan Zhuo, Eugenio Pérez, Xueming Li,
Maxime Coquelin, Chenbo Xia, Bruce Richardson, kvm,
virtualization, netdev
At 2025-12-23 09:16:40, "Jason Wang" <jasowang@redhat.com> wrote:
>On Thu, Dec 18, 2025 at 5:11 PM Xiong Weimin <15927021679@163.com> wrote:
>>
>> Hi all,
>>
>> This testing instructions aims to introduce an emulating a soft ROCE
>> device with normal NIC(no RDMA), we have finished a vhost-user RDMA
>> device demo, which can work with RDMA features such as CM, QP type of
>> UC/UD and so on.
>>
>
>I think we need
>
>1) to know the difference between this and [1]
>2) the spec patch
>
>Thanks
>
>[1] https://yhbt.net/lore/virtio-dev/CACycT3sShxOR41Kk1znxC7Mpw73N0LAP66cC3-iqeS_jp8trvw@mail.gmail.com/T/#m0602ee71de0fe389671cbd81242b5f3ceeab0101
Sorry, I can't access this webpage link. Is there another way to view it?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Re: Implement initial driver for virtio-RDMA device(kernel)
2025-12-24 9:31 ` 熊伟民
@ 2025-12-25 2:13 ` Jason Wang
0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2025-12-25 2:13 UTC (permalink / raw)
To: 熊伟民
Cc: Michael S . Tsirkin, David Hildenbrand, Stefano Garzarella,
Thomas Monjalon, David Marchand, Luca Boccassi, Kevin Traynor,
Christian Ehrhardt, Xuan Zhuo, Eugenio Pérez, Xueming Li,
Maxime Coquelin, Chenbo Xia, Bruce Richardson, kvm,
virtualization, netdev, Yongji Xie
On Wed, Dec 24, 2025 at 5:32 PM 熊伟民 <15927021679@163.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2025-12-23 09:16:40, "Jason Wang" <jasowang@redhat.com> wrote:
> >On Thu, Dec 18, 2025 at 5:11 PM Xiong Weimin <15927021679@163.com> wrote:
> >>
> >> Hi all,
> >>
> >> This testing instructions aims to introduce an emulating a soft ROCE
> >> device with normal NIC(no RDMA), we have finished a vhost-user RDMA
> >> device demo, which can work with RDMA features such as CM, QP type of
> >> UC/UD and so on.
> >>
> >
> >I think we need
> >
> >1) to know the difference between this and [1]
> >2) the spec patch
> >
> >Thanks
> >
>
> >[1] https://yhbt.net/lore/virtio-dev/CACycT3sShxOR41Kk1znxC7Mpw73N0LAP66cC3-iqeS_jp8trvw@mail.gmail.com/T/#m0602ee71de0fe389671cbd81242b5f3ceeab0101
>
>
> Sorry, I can't access this webpage link. Is there another way to view it?
How about this?
https://lore.kernel.org/virtio-comment/20220511095900.343-1-xieyongji@bytedance.com/
Thanks
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-12-25 2:13 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-18 9:09 Implement initial driver for virtio-RDMA device(kernel) Xiong Weimin
2025-12-18 9:09 ` [PATCH 01/10] drivers/infiniband/hw/virtio: Initial driver for virtio RDMA devices Xiong Weimin
2025-12-21 9:11 ` Leon Romanovsky
2025-12-18 9:09 ` [PATCH 02/10] drivers/infiniband/hw/virtio: add vrdma_exec_verbs_cmd to construct verbs sgs using virtio Xiong Weimin
2025-12-18 9:09 ` [PATCH 03/10] drivers/infiniband/hw/virtio: Implement core device and key resource management Xiong Weimin
2025-12-18 9:09 ` [PATCH 04/10] drivers/infiniband/hw/virtio: Implement MR, GID, ucontext and AH resource management verbs Xiong Weimin
2025-12-18 9:09 ` [PATCH 05/10] drivers/infiniband/hw/virtio: Implement memory mapping and MR scatter-gather support Xiong Weimin
2025-12-18 9:09 ` [PATCH 06/10] drivers/infiniband/hw/virtio: Implement port management and QP modification verbs Xiong Weimin
2025-12-18 9:09 ` [PATCH 07/10] drivers/infiniband/hw/virtio: Implement Completion Queue (CQ) polling support Xiong Weimin
2025-12-18 9:09 ` [PATCH 08/10] drivers/infiniband/hw/virtio: Implement send/receive verb support Xiong Weimin
2025-12-18 9:09 ` [PATCH 09/10] drivers/infiniband/hw/virtio: Implement P_key, QP query and user MR resource management verbs Xiong Weimin
2025-12-18 9:09 ` [PATCH 10/10] drivers/infiniband/hw/virtio: Add completion queue notification support Xiong Weimin
2025-12-18 16:30 ` Implement initial driver for virtio-RDMA device(kernel) Leon Romanovsky
2025-12-19 2:27 ` 熊伟民
[not found] ` <6ef11502.4847.19b34677a76.Coremail.15927021679@163.com>
2025-12-21 8:46 ` Leon Romanovsky
2025-12-23 1:16 ` Jason Wang
2025-12-24 9:31 ` 熊伟民
2025-12-25 2:13 ` Jason Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).