public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Enable PCI P2PDMA support for RAID0 and NVMe Multipath
@ 2026-03-23 23:44 Chaitanya Kulkarni
  2026-03-23 23:44 ` [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes Chaitanya Kulkarni
  2026-03-23 23:44 ` [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices Chaitanya Kulkarni
  0 siblings, 2 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2026-03-23 23:44 UTC (permalink / raw)
  To: song, yukuai, linan122, kbusch, axboe, hch, sagi
  Cc: linux-raid, linux-nvme, Chaitanya Kulkarni

Hi,

This patch series extends PCI peer-to-peer DMA (P2PDMA) support to enable
direct data transfers between PCIe devices through RAID and NVMe multipath
block layers.

Background
==========

Current Linux kernel P2PDMA infrastructure supports direct peer-to-peer
transfers, but this support is not propagated through certain storage
stacks like MD RAID and NVMe multipath.

Patch Overview
==============

Patch 1/2: MD RAID0 PCI_P2PDMA support
---------------------------------------
Enables PCI_P2PDMA for MD RAID volumes by propagating the BLK_FEAT_PCI_P2PDMA
feature to the RAID device when all underlying member devices support P2PDMA.
This follows the same pattern as the NOWAIT flag handling in the MD layer.

Without this patch, even if all underlying NVMe devices in a RAID0 array
support P2PDMA, the RAID device itself does not advertise this capability,
preventing direct device-to-device transfers through the RAID layer.

Patch 2/2: NVMe Multipath PCI_P2PDMA support
--------------------------------------------
Adds PCI_P2PDMA support for NVMe multipath devices by setting
BLK_FEAT_PCI_P2PDMA in the queue limits during nvme_mpath_alloc_disk()
when the controller supports P2PDMA operations.

NVMe multipath provides high availability by creating a single block device
(/dev/nvmeXn1) that aggregates multiple paths to the same namespace. This
patch ensures P2PDMA capability is exposed through the multipath device,
enabling peer-to-peer DMA operations on multipath configurations.

Summary
=======
All four test scenarios demonstrate that P2PDMA capabilities are correctly
propagated through both the MD RAID0 layer (patch 1/2) and NVMe multipath
layer (patch 2/2). Direct peer-to-peer transfers complete successfully with
full data integrity verification, confirming that:

1. RAID0 devices properly inherit P2PDMA capability from member devices
2. NVMe multipath devices correctly expose P2PDMA support
3. P2P memory buffers can be used for transfers involving both types
4. Data integrity is maintained across all transfer combinations

I've added the blktest log as well at the end.

Repo:-

git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git

Branch HEAD:-

commit e7cbe110ab3c38b07e5bed91808b7f6a2c328ad6 (origin/for-next)
Merge: 4107f06b64ed 67807fbaf127
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Mar 23 07:58:45 2026 -0600

    Merge branch 'for-7.1/block' into for-next

    * for-7.1/block:
      block: fix bio_alloc_bioset slowpath GFP handling

-ck

Chaitanya Kulkarni (2):
  md: Add PCI_P2PDMA support for MD RAID volumes
  nvme-multipath: enable PCI P2PDMA for multipath devices

 drivers/md/md.c               | 7 ++++++-
 drivers/nvme/host/multipath.c | 3 +++
 2 files changed, 9 insertions(+), 1 deletion(-)


* P2P Tests :-

lab@vm70:~/p2pmem-test$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
vda     253:0    0   50G  0 disk
├─vda1  253:1    0   49G  0 part  /
├─vda14 253:14   0    4M  0 part
├─vda15 253:15   0  106M  0 part
└─vda16 259:0    0  913M  0 part
nvme1n1 259:2    0   10G  0 disk
nvme2n1 259:3    0   10G  0 disk
└─md127   9:127  0   20G  0 raid0
nvme4n1 259:4    0   10G  0 disk
nvme3n1 259:5    0   10G  0 disk
└─md127   9:127  0   20G  0 raid0


lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/nvme2n1 /dev/nvme4n1
/sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/nvme2n1 (10.74GB): writing /dev/nvme4n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7f6ad2125000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773335040
MATCH on data check, 0x3c4f165f = 0x3c4f165f.
Transfer:
 10.49MB in 191.4 ms    54.80MB/s


lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/nvme3n1 /dev/nvme4n1
                             /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/nvme3n1 (10.74GB): writing /dev/nvme4n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7f9a2f478000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773335062
MATCH on data check, 0x6db73ca7 = 0x6db73ca7.
Transfer:
 10.49MB in 489.3 ms    21.43MB/s


lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/nvme1n1 /dev/nvme3n1
                              /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check

Running p2pmem-test: reading /dev/nvme1n1 (10.74GB): writing /dev/nvme3n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7fdc0cd93000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773334971

* pread: Remote I/O error *

lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/md127 /dev/nvme4n1
			      /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/md127 (21.46GB): writing /dev/nvme4n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7fdddb745000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773334985

* pread: Remote I/O error *

# 7.0 with with PCI_P2PDMA patches


lab@vm70:~$ uname -r
7.0.0-rc2-p2pdma
lab@vm70:~$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
vda     253:0    0   50G  0 disk
├─vda1  253:1    0   49G  0 part  /
├─vda14 253:14   0    4M  0 part
├─vda15 253:15   0  106M  0 part
└─vda16 259:0    0  913M  0 part
nvme0n1 259:2    0   10G  0 disk
nvme2n1 259:3    0   10G  0 disk
└─md0     9:0    0   20G  0 raid0
nvme3n1 259:4    0   10G  0 disk
└─md0     9:0    0   20G  0 raid0
nvme4n1 259:5    0   10G  0 disk

lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/nvme0n1 /dev/nvme3n1
                              /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/nvme0n1 (10.74GB): writing /dev/nvme3n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7f2ec7006000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773337740
MATCH on data check, 0x1a4324f = 0x1a4324f.
Transfer:
 10.49MB in 318.0 ms    32.98MB/s

lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/nvme2n1 /dev/nvme3n1
			      /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/nvme2n1 (10.74GB): writing /dev/nvme3n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7f194de7e000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773337752
MATCH on data check, 0x57f470e7 = 0x57f470e7.
Transfer:
 10.49MB in 179.1 ms    58.55MB/s

lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/md0 /dev/nvme3n1
			      /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/md0 (21.46GB): writing /dev/nvme3n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7f9972431000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773337761
MATCH on data check, 0x286fdcdb = 0x286fdcdb.
Transfer:
 10.49MB in 243.9 ms    42.99MB/s

lab@vm70:~/p2pmem-test$ sudo ./p2pmem-test /dev/md0 /dev/nvme4n1
                              /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate -c 10 -s 1M --check
Running p2pmem-test: reading /dev/md0 (21.46GB): writing /dev/nvme4n1 (10.74GB): p2pmem buffer /sys/bus/pci/devices/0000:0c:00.0/p2pmem/allocate.
chunk size = 1048576 : number of chunks =  10: total = 10.49MB : thread(s) = 1 : overlap = OFF.
skip-read = OFF : skip-write =  OFF : duration = INF sec.
buffer = 0x7f73a098e000 (p2pmem): mmap = 1.049MB
PAGE_SIZE = 4096B
checking data with seed = 1773337770
MATCH on data check, 0x5518d20f = 0x5518d20f.
Transfer:
 10.49MB in 251.7 ms    41.66MB/s


* blktest with these patches :-
29ec80128cdc (HEAD -> for-next) nvme-multipath: enable PCI P2PDMA for multipath devices
b4086566d2f7 md: Add PCI_P2PDMA support for MD RAID volumes
p2pdma-nvme-mpath-md (for-next) # 


blktests (master) # ./test-nvme.sh 
++ for t in loop tcp
++ echo '################NVMET_TRTYPES=loop############'
################NVMET_TRTYPES=loop############
++ NVME_IMG_SIZE=1G
++ NVME_NUM_ITER=1
++ NVMET_TRTYPES=loop
++ ./check nvme
nvme/002 (tr=loop) (create many subsystems and test discovery) [passed]
    runtime  35.359s  ...  34.600s
nvme/003 (tr=loop) (test if we're sending keep-alives to a discovery controller) [passed]
    runtime  10.232s  ...  10.214s
nvme/004 (tr=loop) (test nvme and nvmet UUID NS descriptors) [passed]
    runtime  0.690s  ...  0.643s
nvme/005 (tr=loop) (reset local loopback target)             [passed]
    runtime  1.041s  ...  0.995s
nvme/006 (tr=loop bd=device) (create an NVMeOF target)       [passed]
    runtime  0.093s  ...  0.089s
nvme/006 (tr=loop bd=file) (create an NVMeOF target)         [passed]
    runtime  0.085s  ...  0.083s
nvme/008 (tr=loop bd=device) (create an NVMeOF host)         [passed]
    runtime  0.777s  ...  0.648s
nvme/008 (tr=loop bd=file) (create an NVMeOF host)           [passed]
    runtime  0.728s  ...  0.644s
nvme/010 (tr=loop bd=device) (run data verification fio job) [passed]
    runtime  9.037s  ...  8.917s
nvme/010 (tr=loop bd=file) (run data verification fio job)   [passed]
    runtime  46.257s  ...  43.209s
nvme/012 (tr=loop bd=device) (run mkfs and data verification fio) [passed]
    runtime  47.039s  ...  46.860s
nvme/012 (tr=loop bd=file) (run mkfs and data verification fio) [passed]
    runtime  40.123s  ...  40.060s
nvme/014 (tr=loop bd=device) (flush a command from host)     [passed]
    runtime  8.451s  ...  8.331s
nvme/014 (tr=loop bd=file) (flush a command from host)       [passed]
    runtime  7.805s  ...  7.822s
nvme/016 (tr=loop) (create/delete many NVMeOF block device-backed ns and test discovery) [passed]
    runtime  0.132s  ...  0.139s
nvme/017 (tr=loop) (create/delete many file-ns and test discovery) [passed]
    runtime  0.146s  ...  0.146s
nvme/018 (tr=loop) (unit test NVMe-oF out of range access on a file backend) [passed]
    runtime  0.657s  ...  0.631s
nvme/019 (tr=loop bd=device) (test NVMe DSM Discard command) [passed]
    runtime  0.642s  ...  0.654s
nvme/019 (tr=loop bd=file) (test NVMe DSM Discard command)   [passed]
    runtime  0.648s  ...  0.626s
nvme/021 (tr=loop bd=device) (test NVMe list command)        [passed]
    runtime  0.679s  ...  0.631s
nvme/021 (tr=loop bd=file) (test NVMe list command)          [passed]
    runtime  0.635s  ...  0.646s
nvme/022 (tr=loop bd=device) (test NVMe reset command)       [passed]
    runtime  1.007s  ...  0.993s
nvme/022 (tr=loop bd=file) (test NVMe reset command)         [passed]
    runtime  0.999s  ...  1.005s
nvme/023 (tr=loop bd=device) (test NVMe smart-log command)   [passed]
    runtime  0.650s  ...  0.635s
nvme/023 (tr=loop bd=file) (test NVMe smart-log command)     [passed]
    runtime  0.662s  ...  0.646s
nvme/025 (tr=loop bd=device) (test NVMe effects-log)         [passed]
    runtime  0.665s  ...  0.646s
nvme/025 (tr=loop bd=file) (test NVMe effects-log)           [passed]
    runtime  0.671s  ...  0.639s
nvme/026 (tr=loop bd=device) (test NVMe ns-descs)            [passed]
    runtime  0.716s  ...  0.631s
nvme/026 (tr=loop bd=file) (test NVMe ns-descs)              [passed]
    runtime  0.617s  ...  0.613s
nvme/027 (tr=loop bd=device) (test NVMe ns-rescan command)   [passed]
    runtime  0.656s  ...  0.646s
nvme/027 (tr=loop bd=file) (test NVMe ns-rescan command)     [passed]
    runtime  0.644s  ...  0.659s
nvme/028 (tr=loop bd=device) (test NVMe list-subsys)         [passed]
    runtime  0.636s  ...  0.623s
nvme/028 (tr=loop bd=file) (test NVMe list-subsys)           [passed]
    runtime  0.635s  ...  0.621s
nvme/029 (tr=loop) (test userspace IO via nvme-cli read/write interface) [passed]
    runtime  0.954s  ...  0.946s
nvme/030 (tr=loop) (ensure the discovery generation counter is updated appropriately) [passed]
    runtime  0.428s  ...  0.421s
nvme/031 (tr=loop) (test deletion of NVMeOF controllers immediately after setup) [passed]
    runtime  5.984s  ...  5.933s
nvme/038 (tr=loop) (test deletion of NVMeOF subsystem without enabling) [passed]
    runtime  0.034s  ...  0.033s
nvme/040 (tr=loop) (test nvme fabrics controller reset/disconnect operation during I/O) [passed]
    runtime  7.068s  ...  7.012s
nvme/041 (tr=loop) (Create authenticated connections)        [passed]
    runtime  0.705s  ...  0.949s
nvme/042 (tr=loop) (Test dhchap key types for authenticated connections) [passed]
    runtime  4.007s  ...  4.295s
nvme/043 (tr=loop) (Test hash and DH group variations for authenticated connections) [passed]
    runtime  5.191s  ...  6.744s
nvme/044 (tr=loop) (Test bi-directional authentication)      [passed]
    runtime  1.307s  ...  2.293s
nvme/045 (tr=loop) (Test re-authentication)                  [passed]
    runtime  1.639s  ...  1.624s
nvme/047 (tr=loop) (test different queue types for fabric transports) [not run]
    nvme_trtype=loop is not supported in this test
nvme/048 (tr=loop) (Test queue count changes on reconnect)   [not run]
    nvme_trtype=loop is not supported in this test
nvme/051 (tr=loop) (test nvmet concurrent ns enable/disable) [passed]
    runtime  1.395s  ...  2.673s
nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
    runtime  6.288s  ...  6.832s
nvme/054 (tr=loop) (Test the NVMe reservation feature)       [passed]
    runtime  0.806s  ...  1.287s
nvme/055 (tr=loop) (Test nvme write to a loop target ns just after ns is disabled) [not run]
    kernel option DEBUG_ATOMIC_SLEEP has not been enabled
nvme/056 (tr=loop) (enable zero copy offload and run rw traffic) [not run]
    Remote target required but NVME_TARGET_CONTROL is not set
    nvme_trtype=loop is not supported in this test
    kernel option ULP_DDP has not been enabled
    module nvme_tcp does not have parameter ddp_offload
    KERNELSRC not set
    Kernel sources do not have tools/net/ynl/cli.py
    NVME_IFACE not set
nvme/057 (tr=loop) (test nvme fabrics controller ANA failover during I/O) [passed]
    runtime  27.321s  ...  27.487s
nvme/058 (tr=loop) (test rapid namespace remapping)          [passed]
    runtime  4.709s  ...  4.932s
nvme/060 (tr=loop) (test nvme fabrics target reset)          [not run]
    nvme_trtype=loop is not supported in this test
nvme/061 (tr=loop) (test fabric target teardown and setup during I/O) [not run]
    nvme_trtype=loop is not supported in this test
nvme/062 (tr=loop) (Create TLS-encrypted connections)        [not run]
    nvme_trtype=loop is not supported in this test
nvme/063 (tr=loop) (Create authenticated TCP connections with secure concatenation) [not run]
    nvme_trtype=loop is not supported in this test
nvme/065 (test unmap write zeroes sysfs interface with nvmet devices) [passed]
    runtime  2.349s  ...  2.495s
++ for t in loop tcp
++ echo '################NVMET_TRTYPES=tcp############'
################NVMET_TRTYPES=tcp############
++ NVME_IMG_SIZE=1G
++ NVME_NUM_ITER=1
++ NVMET_TRTYPES=tcp
++ ./check nvme
nvme/002 (tr=tcp) (create many subsystems and test discovery) [not run]
    nvme_trtype=tcp is not supported in this test
nvme/003 (tr=tcp) (test if we're sending keep-alives to a discovery controller) [passed]
    runtime  10.244s  ...  10.249s
nvme/004 (tr=tcp) (test nvme and nvmet UUID NS descriptors)  [passed]
    runtime  0.379s  ...  0.380s
nvme/005 (tr=tcp) (reset local loopback target)              [passed]
    runtime  0.451s  ...  0.437s
nvme/006 (tr=tcp bd=device) (create an NVMeOF target)        [passed]
    runtime  0.097s  ...  0.101s
nvme/006 (tr=tcp bd=file) (create an NVMeOF target)          [passed]
    runtime  0.093s  ...  0.097s
nvme/008 (tr=tcp bd=device) (create an NVMeOF host)          [passed]
    runtime  0.389s  ...  0.369s
nvme/008 (tr=tcp bd=file) (create an NVMeOF host)            [passed]
    runtime  0.377s  ...  0.379s
nvme/010 (tr=tcp bd=device) (run data verification fio job)  [passed]
    runtime  77.447s  ...  76.703s
nvme/010 (tr=tcp bd=file) (run data verification fio job)    [passed]
    runtime  120.190s  ...  121.711s
nvme/012 (tr=tcp bd=device) (run mkfs and data verification fio) [passed]
    runtime  82.928s  ...  82.706s
nvme/012 (tr=tcp bd=file) (run mkfs and data verification fio) [passed]
    runtime    ...  118.897s
nvme/014 (tr=tcp bd=device) (flush a command from host)      [passed]
    runtime  7.825s  ...  8.880s
nvme/014 (tr=tcp bd=file) (flush a command from host)        [passed]
    runtime  7.862s  ...  8.690s
nvme/016 (tr=tcp) (create/delete many NVMeOF block device-backed ns and test discovery) [not run]
    nvme_trtype=tcp is not supported in this test
nvme/017 (tr=tcp) (create/delete many file-ns and test discovery) [not run]
    nvme_trtype=tcp is not supported in this test
nvme/018 (tr=tcp) (unit test NVMe-oF out of range access on a file backend) [passed]
    runtime  0.355s  ...  0.357s
nvme/019 (tr=tcp bd=device) (test NVMe DSM Discard command)  [passed]
    runtime  0.364s  ...  0.369s
nvme/019 (tr=tcp bd=file) (test NVMe DSM Discard command)    [passed]
    runtime  0.352s  ...  0.345s
nvme/021 (tr=tcp bd=device) (test NVMe list command)         [passed]
    runtime  0.366s  ...  0.384s
nvme/021 (tr=tcp bd=file) (test NVMe list command)           [passed]
    runtime  0.375s  ...  0.377s
nvme/022 (tr=tcp bd=device) (test NVMe reset command)        [passed]
    runtime  0.464s  ...  0.465s
nvme/022 (tr=tcp bd=file) (test NVMe reset command)          [passed]
    runtime  0.452s  ...  0.460s
nvme/023 (tr=tcp bd=device) (test NVMe smart-log command)    [passed]
    runtime  0.377s  ...  0.375s
nvme/023 (tr=tcp bd=file) (test NVMe smart-log command)      [passed]
    runtime  0.352s  ...  0.355s
nvme/025 (tr=tcp bd=device) (test NVMe effects-log)          [passed]
    runtime  0.382s  ...  0.376s
nvme/025 (tr=tcp bd=file) (test NVMe effects-log)            [passed]
    runtime  0.365s  ...  0.378s
nvme/026 (tr=tcp bd=device) (test NVMe ns-descs)             [passed]
    runtime  0.361s  ...  0.361s
nvme/026 (tr=tcp bd=file) (test NVMe ns-descs)               [passed]
    runtime  0.362s  ...  0.354s
nvme/027 (tr=tcp bd=device) (test NVMe ns-rescan command)    [passed]
    runtime  0.385s  ...  0.390s
nvme/027 (tr=tcp bd=file) (test NVMe ns-rescan command)      [passed]
    runtime  0.390s  ...  0.398s
nvme/028 (tr=tcp bd=device) (test NVMe list-subsys)          [passed]
    runtime  0.353s  ...  0.368s
nvme/028 (tr=tcp bd=file) (test NVMe list-subsys)            [passed]
    runtime  0.357s  ...  0.346s
nvme/029 (tr=tcp) (test userspace IO via nvme-cli read/write interface) [passed]
    runtime  0.695s  ...  0.729s
nvme/030 (tr=tcp) (ensure the discovery generation counter is updated appropriately) [passed]
    runtime  0.400s  ...  0.401s
nvme/031 (tr=tcp) (test deletion of NVMeOF controllers immediately after setup) [passed]
    runtime  3.016s  ...  3.074s
nvme/038 (tr=tcp) (test deletion of NVMeOF subsystem without enabling) [passed]
    runtime  0.042s  ...  0.042s
nvme/040 (tr=tcp) (test nvme fabrics controller reset/disconnect operation during I/O) [passed]
    runtime  6.452s  ...  6.454s
nvme/041 (tr=tcp) (Create authenticated connections)         [passed]
    runtime  0.417s  ...  0.426s
nvme/042 (tr=tcp) (Test dhchap key types for authenticated connections) [passed]
    runtime  1.805s  ...  1.820s
nvme/043 (tr=tcp) (Test hash and DH group variations for authenticated connections) [passed]
    runtime  2.411s  ...  2.432s
nvme/044 (tr=tcp) (Test bi-directional authentication)       [passed]
    runtime  0.739s  ...  0.741s
nvme/045 (tr=tcp) (Test re-authentication)                   [passed]
    runtime  1.338s  ...  1.326s
nvme/047 (tr=tcp) (test different queue types for fabric transports) [passed]
    runtime  1.864s  ...  1.856s
nvme/048 (tr=tcp) (Test queue count changes on reconnect)    [passed]
    runtime  5.535s  ...  5.553s
nvme/051 (tr=tcp) (test nvmet concurrent ns enable/disable)  [passed]
    runtime  1.344s  ...  1.432s
nvme/052 (tr=tcp) (Test file-ns creation/deletion under one subsystem) [not run]
    nvme_trtype=tcp is not supported in this test
nvme/054 (tr=tcp) (Test the NVMe reservation feature)        [passed]
    runtime  0.484s  ...  0.508s
nvme/055 (tr=tcp) (Test nvme write to a loop target ns just after ns is disabled) [not run]
    nvme_trtype=tcp is not supported in this test
    kernel option DEBUG_ATOMIC_SLEEP has not been enabled
nvme/056 (tr=tcp) (enable zero copy offload and run rw traffic) [not run]
    Remote target required but NVME_TARGET_CONTROL is not set
    kernel option ULP_DDP has not been enabled
    module nvme_tcp does not have parameter ddp_offload
    KERNELSRC not set
    Kernel sources do not have tools/net/ynl/cli.py
    NVME_IFACE not set
nvme/057 (tr=tcp) (test nvme fabrics controller ANA failover during I/O) [passed]
    runtime  25.951s  ...  25.962s
nvme/058 (tr=tcp) (test rapid namespace remapping)           [passed]
    runtime  2.935s  ...  3.025s
nvme/060 (tr=tcp) (test nvme fabrics target reset)           [passed]
    runtime  19.400s  ...  19.329s
nvme/061 (tr=tcp) (test fabric target teardown and setup during I/O) [passed]
    runtime  8.566s  ...  8.596s
nvme/062 (tr=tcp) (Create TLS-encrypted connections)         [failed]
    runtime  5.221s  ...  5.232s
    --- tests/nvme/062.out	2026-01-28 12:04:48.888356244 -0800
    +++ /mnt/sda/blktests/results/nodev_tr_tcp/nvme/062.out.bad	2026-03-23 16:32:31.377957706 -0700
    @@ -2,9 +2,13 @@
     Test unencrypted connection w/ tls not required
     disconnected 1 controller(s)
     Test encrypted connection w/ tls not required
    -disconnected 1 controller(s)
    +FAIL: nvme connect return error code
    +WARNING: connection is not encrypted
    +disconnected 0 controller(s)
    ...
    (Run 'diff -u tests/nvme/062.out /mnt/sda/blktests/results/nodev_tr_tcp/nvme/062.out.bad' to see the entire diff)
nvme/063 (tr=tcp) (Create authenticated TCP connections with secure concatenation) [passed]
    runtime  1.907s  ...  1.969s
nvme/065 (test unmap write zeroes sysfs interface with nvmet devices) [passed]
    runtime  2.495s  ...  2.314s
++ ./manage-rdma-nvme.sh --cleanup
====== RDMA NVMe Cleanup ======

[INFO] Disconnecting NVMe RDMA controllers...
[INFO] No NVMe RDMA controllers to disconnect
[INFO] Removing RDMA links...
[INFO] No RDMA links to remove
[INFO] Unloading NVMe RDMA modules...
[INFO] NVMe RDMA modules unloaded successfully
[INFO] Unloading soft-RDMA modules...
[INFO] Soft-RDMA modules unloaded successfully

[INFO] Verifying cleanup...
[INFO] Verification passed
[INFO] RDMA cleanup completed successfully


====== RDMA Network Configuration Status ======

Loaded Modules:
  None

RDMA Links:
  None

Network Interfaces (RDMA-capable):
  None

blktests Configuration:
  Not configured (run --setup first)

NVMe RDMA Controllers:
  None

=================================================
++ ./manage-rdma-nvme.sh --setup
====== RDMA NVMe Setup ======
RDMA Type: siw
Interface: auto-detect

[INFO] Checking prerequisites...
[INFO] Prerequisites check passed
[INFO] Loading RDMA module: siw
[INFO] Module siw loaded successfully
[INFO] Creating RDMA links...
[INFO] Creating RDMA link: ens5_siw
[INFO] Created RDMA link: ens5_siw -> ens5
++ ./manage-rdma-nvme.sh --status
====== RDMA Configuration Status ======

====== RDMA Network Configuration Status ======

Loaded Modules:
  siw                      217088

RDMA Links:
  link ens5_siw/1 state ACTIVE physical_state LINK_UP netdev ens5 

Network Interfaces (RDMA-capable):
  Interface: ens5
    IPv4: 192.168.0.46
    IPv6: fe80::5054:98ff:fe76:5440%ens5

blktests Configuration:
  Transport Address: 192.168.0.46:4420
  Transport Type: rdma
  Command: NVMET_TRTYPES=rdma ./check nvme/

NVMe RDMA Controllers:
  None

=================================================
++ echo '################NVMET_TRTYPES=rdma############'
################NVMET_TRTYPES=rdma############
++ NVME_IMG_SIZE=1G
++ NVME_NUM_ITER=1
++ nvme_trtype=rdma
++ ./check nvme
nvme/002 (tr=rdma) (create many subsystems and test discovery) [not run]
    nvme_trtype=rdma is not supported in this test
nvme/003 (tr=rdma) (test if we're sending keep-alives to a discovery controller) [passed]
    runtime  10.327s  ...  10.331s
nvme/004 (tr=rdma) (test nvme and nvmet UUID NS descriptors) [passed]
    runtime  0.689s  ...  0.689s
nvme/005 (tr=rdma) (reset local loopback target)             [passed]
    runtime  0.979s  ...  1.006s
nvme/006 (tr=rdma bd=device) (create an NVMeOF target)       [passed]
    runtime  0.147s  ...  0.143s
nvme/006 (tr=rdma bd=file) (create an NVMeOF target)         [passed]
    runtime  0.136s  ...  0.137s
nvme/008 (tr=rdma bd=device) (create an NVMeOF host)         [passed]
    runtime  0.691s  ...  0.702s
nvme/008 (tr=rdma bd=file) (create an NVMeOF host)           [passed]
    runtime  0.671s  ...  0.698s
nvme/010 (tr=rdma bd=device) (run data verification fio job) [passed]
    runtime  29.999s  ...  34.266s
nvme/010 (tr=rdma bd=file) (run data verification fio job)   [passed]
    runtime  59.734s  ...  62.762s
nvme/012 (tr=rdma bd=device) (run mkfs and data verification fio) [passed]
    runtime  34.220s  ...  41.940s
nvme/012 (tr=rdma bd=file) (run mkfs and data verification fio) [passed]
    runtime  58.929s  ...  71.674s
nvme/014 (tr=rdma bd=device) (flush a command from host)     [passed]
    runtime  7.554s  ...  8.508s
nvme/014 (tr=rdma bd=file) (flush a command from host)       [passed]
    runtime  7.404s  ...  8.398s
nvme/016 (tr=rdma) (create/delete many NVMeOF block device-backed ns and test discovery) [not run]
    nvme_trtype=rdma is not supported in this test
nvme/017 (tr=rdma) (create/delete many file-ns and test discovery) [not run]
    nvme_trtype=rdma is not supported in this test
nvme/018 (tr=rdma) (unit test NVMe-oF out of range access on a file backend) [passed]
    runtime  0.671s  ...  0.671s
nvme/019 (tr=rdma bd=device) (test NVMe DSM Discard command) [passed]
    runtime  0.679s  ...  0.687s
nvme/019 (tr=rdma bd=file) (test NVMe DSM Discard command)   [passed]
    runtime  0.672s  ...  0.671s
nvme/021 (tr=rdma bd=device) (test NVMe list command)        [passed]
    runtime  0.647s  ...  0.686s
nvme/021 (tr=rdma bd=file) (test NVMe list command)          [passed]
    runtime  0.662s  ...  0.704s
nvme/022 (tr=rdma bd=device) (test NVMe reset command)       [passed]
    runtime  1.009s  ...  1.029s
nvme/022 (tr=rdma bd=file) (test NVMe reset command)         [passed]
    runtime  0.980s  ...  1.028s
nvme/023 (tr=rdma bd=device) (test NVMe smart-log command)   [passed]
    runtime  0.658s  ...  0.671s
nvme/023 (tr=rdma bd=file) (test NVMe smart-log command)     [passed]
    runtime  0.649s  ...  0.665s
nvme/025 (tr=rdma bd=device) (test NVMe effects-log)         [passed]
    runtime  0.683s  ...  0.698s
nvme/025 (tr=rdma bd=file) (test NVMe effects-log)           [passed]
    runtime  0.658s  ...  0.694s
nvme/026 (tr=rdma bd=device) (test NVMe ns-descs)            [passed]
    runtime  0.645s  ...  0.680s
nvme/026 (tr=rdma bd=file) (test NVMe ns-descs)              [passed]
    runtime  0.662s  ...  0.679s
nvme/027 (tr=rdma bd=device) (test NVMe ns-rescan command)   [passed]
    runtime  0.702s  ...  0.708s
nvme/027 (tr=rdma bd=file) (test NVMe ns-rescan command)     [passed]
    runtime  0.699s  ...  0.713s
nvme/028 (tr=rdma bd=device) (test NVMe list-subsys)         [passed]
    runtime  0.639s  ...  0.670s
nvme/028 (tr=rdma bd=file) (test NVMe list-subsys)           [passed]
    runtime  0.651s  ...  0.658s
nvme/029 (tr=rdma) (test userspace IO via nvme-cli read/write interface) [passed]
    runtime  1.048s  ...  1.045s
nvme/030 (tr=rdma) (ensure the discovery generation counter is updated appropriately) [passed]
    runtime  0.508s  ...  0.552s
nvme/031 (tr=rdma) (test deletion of NVMeOF controllers immediately after setup) [passed]
    runtime  5.746s  ...  5.820s
nvme/038 (tr=rdma) (test deletion of NVMeOF subsystem without enabling) [passed]
    runtime  0.086s  ...  0.086s
nvme/040 (tr=rdma) (test nvme fabrics controller reset/disconnect operation during I/O) [passed]
    runtime  7.008s  ...  7.010s
nvme/041 (tr=rdma) (Create authenticated connections)        [passed]
    runtime  0.732s  ...  0.783s
nvme/042 (tr=rdma) (Test dhchap key types for authenticated connections) [passed]
    runtime  3.669s  ...  3.780s
nvme/043 (tr=rdma) (Test hash and DH group variations for authenticated connections) [passed]
    runtime  4.683s  ...  4.689s
nvme/044 (tr=rdma) (Test bi-directional authentication)      [passed]
    runtime  1.342s  ...  1.306s
nvme/045 (tr=rdma) (Test re-authentication)                  [passed]
    runtime  1.809s  ...  1.825s
nvme/047 (tr=rdma) (test different queue types for fabric transports) [passed]
    runtime  2.675s  ...  2.638s
nvme/048 (tr=rdma) (Test queue count changes on reconnect)   [passed]
    runtime  6.818s  ...  5.799s
nvme/051 (tr=rdma) (test nvmet concurrent ns enable/disable) [passed]
    runtime  1.389s  ...  1.465s
nvme/052 (tr=rdma) (Test file-ns creation/deletion under one subsystem) [not run]
    nvme_trtype=rdma is not supported in this test
nvme/054 (tr=rdma) (Test the NVMe reservation feature)       [passed]
    runtime  0.802s  ...  0.804s
nvme/055 (tr=rdma) (Test nvme write to a loop target ns just after ns is disabled) [not run]
    nvme_trtype=rdma is not supported in this test
    kernel option DEBUG_ATOMIC_SLEEP has not been enabled
nvme/056 (tr=rdma) (enable zero copy offload and run rw traffic) [not run]
    Remote target required but NVME_TARGET_CONTROL is not set
    nvme_trtype=rdma is not supported in this test
    kernel option ULP_DDP has not been enabled
    module nvme_tcp does not have parameter ddp_offload
    KERNELSRC not set
    Kernel sources do not have tools/net/ynl/cli.py
    NVME_IFACE not set
nvme/057 (tr=rdma) (test nvme fabrics controller ANA failover during I/O) [passed]
    runtime  26.945s  ...  26.949s
nvme/058 (tr=rdma) (test rapid namespace remapping)          [passed]
    runtime  4.293s  ...  4.336s
nvme/060 (tr=rdma) (test nvme fabrics target reset)          [passed]
    runtime  20.730s  ...  20.756s
nvme/061 (tr=rdma) (test fabric target teardown and setup during I/O) [passed]
    runtime  15.686s  ...  15.514s
nvme/062 (tr=rdma) (Create TLS-encrypted connections)        [not run]
    nvme_trtype=rdma is not supported in this test
nvme/063 (tr=rdma) (Create authenticated TCP connections with secure concatenation) [not run]
    nvme_trtype=rdma is not supported in this test
nvme/065 (test unmap write zeroes sysfs interface with nvmet devices) [passed]
    runtime  2.314s  ...  2.336s
++ ./manage-rdma-nvme.sh --cleanup
====== RDMA NVMe Cleanup ======

[INFO] Disconnecting NVMe RDMA controllers...
[INFO] No NVMe RDMA controllers to disconnect
[INFO] Removing RDMA links...
[INFO] No RDMA links to remove
[INFO] Unloading NVMe RDMA modules...
[INFO] NVMe RDMA modules unloaded successfully
[INFO] Unloading soft-RDMA modules...
[INFO] Soft-RDMA modules unloaded successfully

[INFO] Verifying cleanup...
[INFO] Verification passed
[INFO] RDMA cleanup completed successfully


====== RDMA Network Configuration Status ======

Loaded Modules:
  None

RDMA Links:
  None

Network Interfaces (RDMA-capable):
  None

blktests Configuration:
  Not configured (run --setup first)

NVMe RDMA Controllers:
  None

==========================



-- 
2.39.5


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes
  2026-03-23 23:44 [PATCH 0/2] Enable PCI P2PDMA support for RAID0 and NVMe Multipath Chaitanya Kulkarni
@ 2026-03-23 23:44 ` Chaitanya Kulkarni
  2026-03-24  6:48   ` Christoph Hellwig
  2026-03-23 23:44 ` [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices Chaitanya Kulkarni
  1 sibling, 1 reply; 9+ messages in thread
From: Chaitanya Kulkarni @ 2026-03-23 23:44 UTC (permalink / raw)
  To: song, yukuai, linan122, kbusch, axboe, hch, sagi
  Cc: linux-raid, linux-nvme, Chaitanya Kulkarni, Kiran Kumar Modukuri

MD RAID does not propagate BLK_FEAT_PCI_P2PDMA from member devices to
the RAID device, preventing peer-to-peer DMA through the RAID layer even
when all underlying devices support it.

Enable BLK_FEAT_PCI_P2PDMA by default in md_init_stacking_limits() and
clear it in mddev_stack_rdev_limits() during array init and 
mddev_stack_new_rdev() during hot-add if any member device lacks support.

Tested with RAID arrays containing multiple NVMe devices with P2PDMA
support, confirming that peer-to-peer transfers work correctly through
the RAID layer.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com>
---
 drivers/md/md.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 521d9b34cd9e..a151ea86d844 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6176,6 +6176,8 @@ int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim,
 		if ((flags & MDDEV_STACK_INTEGRITY) &&
 		    !queue_limits_stack_integrity_bdev(lim, rdev->bdev))
 			return -EINVAL;
+		if (!blk_queue_pci_p2pdma(rdev->bdev->bd_disk->queue))
+			lim->features &= ~BLK_FEAT_PCI_P2PDMA;
 	}
 
 	/*
@@ -6231,6 +6233,8 @@ int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev)
 	lim = queue_limits_start_update(mddev->gendisk->queue);
 	queue_limits_stack_bdev(&lim, rdev->bdev, rdev->data_offset,
 				mddev->gendisk->disk_name);
+	if (!blk_queue_pci_p2pdma(rdev->bdev->bd_disk->queue))
+		lim.features &= ~BLK_FEAT_PCI_P2PDMA;
 
 	if (!queue_limits_stack_integrity_bdev(&lim, rdev->bdev)) {
 		pr_err("%s: incompatible integrity profile for %pg\n",
@@ -6272,7 +6276,8 @@ void md_init_stacking_limits(struct queue_limits *lim)
 {
 	blk_set_stacking_limits(lim);
 	lim->features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
-			BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
+			BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT |
+			BLK_FEAT_PCI_P2PDMA;
 }
 EXPORT_SYMBOL_GPL(md_init_stacking_limits);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices
  2026-03-23 23:44 [PATCH 0/2] Enable PCI P2PDMA support for RAID0 and NVMe Multipath Chaitanya Kulkarni
  2026-03-23 23:44 ` [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes Chaitanya Kulkarni
@ 2026-03-23 23:44 ` Chaitanya Kulkarni
  2026-03-24  6:49   ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Chaitanya Kulkarni @ 2026-03-23 23:44 UTC (permalink / raw)
  To: song, yukuai, linan122, kbusch, axboe, hch, sagi
  Cc: linux-raid, linux-nvme, Chaitanya Kulkarni, Kiran Kumar Modukuri

NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
even when the underlying controller supports it.

Set BLK_FEAT_PCI_P2PDMA in nvme_mpath_alloc_disk() when the controller
advertises P2PDMA support via ctrl->ops->supports_pci_p2pdma.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com>
---
 drivers/nvme/host/multipath.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ba00f0b72b85..c49fca43ef19 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -737,6 +737,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 		BLK_FEAT_POLL | BLK_FEAT_ATOMIC_WRITES;
 	if (head->ids.csi == NVME_CSI_ZNS)
 		lim.features |= BLK_FEAT_ZONED;
+	if (ctrl->ops && ctrl->ops->supports_pci_p2pdma &&
+	    ctrl->ops->supports_pci_p2pdma(ctrl))
+		lim.features |= BLK_FEAT_PCI_P2PDMA;
 
 	head->disk = blk_alloc_disk(&lim, ctrl->numa_node);
 	if (IS_ERR(head->disk))
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes
  2026-03-23 23:44 ` [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes Chaitanya Kulkarni
@ 2026-03-24  6:48   ` Christoph Hellwig
       [not found]     ` <DM4PR12MB8473B3907CAF51AEBAE573E5D548A@DM4PR12MB8473.namprd12.prod.outlook.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-03-24  6:48 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: song, yukuai, linan122, kbusch, axboe, hch, sagi, linux-raid,
	linux-nvme, Kiran Kumar Modukuri

On Mon, Mar 23, 2026 at 04:44:15PM -0700, Chaitanya Kulkarni wrote:
> MD RAID does not propagate BLK_FEAT_PCI_P2PDMA from member devices to
> the RAID device, preventing peer-to-peer DMA through the RAID layer even
> when all underlying devices support it.
> 
> Enable BLK_FEAT_PCI_P2PDMA by default in md_init_stacking_limits() and
> clear it in mddev_stack_rdev_limits() during array init and 
> mddev_stack_new_rdev() during hot-add if any member device lacks support.
> 
> Tested with RAID arrays containing multiple NVMe devices with P2PDMA
> support, confirming that peer-to-peer transfers work correctly through
> the RAID layer.

Which personalities did you test this with?  Parity RAID needs to
copy from the data payload, which is not going to work very well
with P2P mappings.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices
  2026-03-23 23:44 ` [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices Chaitanya Kulkarni
@ 2026-03-24  6:49   ` Christoph Hellwig
  2026-03-24 14:40     ` Keith Busch
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2026-03-24  6:49 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: song, yukuai, linan122, kbusch, axboe, hch, sagi, linux-raid,
	linux-nvme, Kiran Kumar Modukuri

On Mon, Mar 23, 2026 at 04:44:16PM -0700, Chaitanya Kulkarni wrote:
> NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
> even when the underlying controller supports it.
> 
> Set BLK_FEAT_PCI_P2PDMA in nvme_mpath_alloc_disk() when the controller
> advertises P2PDMA support via ctrl->ops->supports_pci_p2pdma.
> 
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com>

This signoff chain is wrong - you are both author and patch submitter,
so other signoffs should not appear.

Otherwise looks good:

> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index ba00f0b72b85..c49fca43ef19 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -737,6 +737,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>  		BLK_FEAT_POLL | BLK_FEAT_ATOMIC_WRITES;
>  	if (head->ids.csi == NVME_CSI_ZNS)
>  		lim.features |= BLK_FEAT_ZONED;
> +	if (ctrl->ops && ctrl->ops->supports_pci_p2pdma &&
> +	    ctrl->ops->supports_pci_p2pdma(ctrl))
> +		lim.features |= BLK_FEAT_PCI_P2PDMA;

This assumes all controllers support P2P, but we allow matching
over different transports.  So you'll need to do the same scheme
as for MD RAID that checks that every member supports P2P.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices
  2026-03-24  6:49   ` Christoph Hellwig
@ 2026-03-24 14:40     ` Keith Busch
  2026-03-25  3:50       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2026-03-24 14:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chaitanya Kulkarni, song, yukuai, linan122, axboe, sagi,
	linux-raid, linux-nvme, Kiran Kumar Modukuri

On Tue, Mar 24, 2026 at 07:49:37AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 23, 2026 at 04:44:16PM -0700, Chaitanya Kulkarni wrote:
> > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > index ba00f0b72b85..c49fca43ef19 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -737,6 +737,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> >  		BLK_FEAT_POLL | BLK_FEAT_ATOMIC_WRITES;
> >  	if (head->ids.csi == NVME_CSI_ZNS)
> >  		lim.features |= BLK_FEAT_ZONED;
> > +	if (ctrl->ops && ctrl->ops->supports_pci_p2pdma &&
> > +	    ctrl->ops->supports_pci_p2pdma(ctrl))
> > +		lim.features |= BLK_FEAT_PCI_P2PDMA;
> 
> This assumes all controllers support P2P, but we allow matching
> over different transports.  So you'll need to do the same scheme
> as for MD RAID that checks that every member supports P2P.

If that is a possible setup, then you could add a path that is non-P2P
capable sometime after the MD volume was setup with P2P supported, so
that case might need special handling to notify the stacking device of
the new limits.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes
       [not found]     ` <DM4PR12MB8473B3907CAF51AEBAE573E5D548A@DM4PR12MB8473.namprd12.prod.outlook.com>
@ 2026-03-24 21:29       ` Keith Busch
       [not found]         ` <DM4PR12MB84736C00E876FD160464EB35D548A@DM4PR12MB8473.namprd12.prod.outlook.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2026-03-24 21:29 UTC (permalink / raw)
  To: Kiran Modukuri
  Cc: Christoph Hellwig, Chaitanya Kulkarni, song@kernel.org,
	yukuai@fnnas.com, linan122@huawei.com, axboe@kernel.dk,
	sagi@grimberg.me, linux-raid@vger.kernel.org,
	linux-nvme@lists.infradead.org

On Tue, Mar 24, 2026 at 09:13:55PM +0000, Kiran Modukuri wrote:
> Hi Christoph,
> 
> We tested with RAID0 , RAID1 and RAID10 only. You're right that parity RAID personalities
> need CPU access to data pages for XOR/parity computation, which won't
> work with P2P mappings.
> 
> 
> We'll send a v2 that moves BLK_FEAT_PCI_P2PDMA out of
> md_init_stacking_limits() and instead has raid0, raid1 and raid10
> opt in individually during their queue limits setup. raid4/5/6 will
> not set the flag.

I think the parity could work with P2P memory if the calculation is
offloaded to a dma_aysnc_tx. It doesn't look like we necessarily know if
any particular xor is going to get offloaded, though.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes
       [not found]         ` <DM4PR12MB84736C00E876FD160464EB35D548A@DM4PR12MB8473.namprd12.prod.outlook.com>
@ 2026-03-24 22:08           ` Keith Busch
  0 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2026-03-24 22:08 UTC (permalink / raw)
  To: Kiran Modukuri
  Cc: Christoph Hellwig, Chaitanya Kulkarni, song@kernel.org,
	yukuai@fnnas.com, linan122@huawei.com, axboe@kernel.dk,
	sagi@grimberg.me, linux-raid@vger.kernel.org,
	linux-nvme@lists.infradead.org

On Tue, Mar 24, 2026 at 09:43:15PM +0000, Kiran Modukuri wrote:
> Hi Keith,
> 
> So do you suggest we leave the current patch as is without restricting the P2PDMA support for RAID4/5 or restrict the support only for the RAID0, RAID1 and RAID10.

I think you currently have to remove it. If you want to do parity
against P2P memory (which sounds like like a nice feature to me), then
you'd have introduce a prep patch to detect that an xor dma offload
exists for the raid volume to use, and do something to ensure it will
always get offloaded for P2P memory instead of falling back to the CPU
driven synchronous implementation.

Unrelated suggestion, you need to change your email client settings to
plain text in order for the mailing list to accept the message.


> From: Keith Busch <kbusch@kernel.org>
> Date: Tuesday, March 24, 2026 at 2:29 PM
> To: Kiran Modukuri <kmodukuri@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>, Chaitanya Kulkarni <kch@nvidia.com>, song@kernel.org <song@kernel.org>, yukuai@fnnas.com <yukuai@fnnas.com>, linan122@huawei.com <linan122@huawei.com>, axboe@kernel.dk <axboe@kernel.dk>, sagi@grimberg.me <sagi@grimberg.me>, linux-raid@vger.kernel.org <linux-raid@vger.kernel.org>, linux-nvme@lists.infradead.org <linux-nvme@lists.infradead.org>
> Subject: Re: [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes
> 
> On Tue, Mar 24, 2026 at 09:13:55PM +0000, Kiran Modukuri wrote:
> > Hi Christoph,
> >
> > We tested with RAID0 , RAID1 and RAID10 only. You're right that parity RAID personalities
> > need CPU access to data pages for XOR/parity computation, which won't
> > work with P2P mappings.
> >
> >
> > We'll send a v2 that moves BLK_FEAT_PCI_P2PDMA out of
> > md_init_stacking_limits() and instead has raid0, raid1 and raid10
> > opt in individually during their queue limits setup. raid4/5/6 will
> > not set the flag.
> 
> I think the parity could work with P2P memory if the calculation is
> offloaded to a dma_aysnc_tx. It doesn't look like we necessarily know if
> any particular xor is going to get offloaded, though.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices
  2026-03-24 14:40     ` Keith Busch
@ 2026-03-25  3:50       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2026-03-25  3:50 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Chaitanya Kulkarni, song@kernel.org, yukuai@fnnas.com,
	linan122@huawei.com, axboe@kernel.dk, sagi@grimberg.me,
	linux-raid@vger.kernel.org, linux-nvme@lists.infradead.org,
	Kiran Modukuri

On 3/24/26 07:40, Keith Busch wrote:
> On Tue, Mar 24, 2026 at 07:49:37AM +0100, Christoph Hellwig wrote:
>> On Mon, Mar 23, 2026 at 04:44:16PM -0700, Chaitanya Kulkarni wrote:
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index ba00f0b72b85..c49fca43ef19 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -737,6 +737,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>>>   		BLK_FEAT_POLL | BLK_FEAT_ATOMIC_WRITES;
>>>   	if (head->ids.csi == NVME_CSI_ZNS)
>>>   		lim.features |= BLK_FEAT_ZONED;
>>> +	if (ctrl->ops && ctrl->ops->supports_pci_p2pdma &&
>>> +	    ctrl->ops->supports_pci_p2pdma(ctrl))
>>> +		lim.features |= BLK_FEAT_PCI_P2PDMA;
>> This assumes all controllers support P2P, but we allow matching
>> over different transports.  So you'll need to do the same scheme
>> as for MD RAID that checks that every member supports P2P.
> If that is a possible setup, then you could add a path that is non-P2P
> capable sometime after the MD volume was setup with P2P supported, so
> that case might need special handling to notify the stacking device of
> the new limits.

Thanks for review working on V2 and testscripts.

-ck



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-03-25  3:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23 23:44 [PATCH 0/2] Enable PCI P2PDMA support for RAID0 and NVMe Multipath Chaitanya Kulkarni
2026-03-23 23:44 ` [PATCH 1/2] md: Add PCI_P2PDMA support for MD RAID volumes Chaitanya Kulkarni
2026-03-24  6:48   ` Christoph Hellwig
     [not found]     ` <DM4PR12MB8473B3907CAF51AEBAE573E5D548A@DM4PR12MB8473.namprd12.prod.outlook.com>
2026-03-24 21:29       ` Keith Busch
     [not found]         ` <DM4PR12MB84736C00E876FD160464EB35D548A@DM4PR12MB8473.namprd12.prod.outlook.com>
2026-03-24 22:08           ` Keith Busch
2026-03-23 23:44 ` [PATCH 2/2] nvme-multipath: enable PCI P2PDMA for multipath devices Chaitanya Kulkarni
2026-03-24  6:49   ` Christoph Hellwig
2026-03-24 14:40     ` Keith Busch
2026-03-25  3:50       ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox