* [PATCH 0/2] support dmabuf
@ 2026-01-27 17:44 Cliff Burdick
2026-01-27 17:44 ` [PATCH 1/2] eal: " Cliff Burdick
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: Cliff Burdick @ 2026-01-27 17:44 UTC (permalink / raw)
To: dev; +Cc: anatoly.burakov
Add support for kernel dmabuf feature and integrate it in the mlx5 driver.
This feature is needed to support GPUDirect on newer kernels.
Cliff Burdick (2):
eal: support dmabuf
common/mlx5: support dmabuf
.mailmap | 1 +
drivers/common/mlx5/linux/meson.build | 2 +
drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++-
drivers/common/mlx5/linux/mlx5_glue.c | 19 ++
drivers/common/mlx5/linux/mlx5_glue.h | 3 +
drivers/common/mlx5/mlx5_common.c | 28 ++-
drivers/common/mlx5/mlx5_common_mr.c | 108 ++++++++++-
drivers/common/mlx5/mlx5_common_mr.h | 17 +-
drivers/common/mlx5/windows/mlx5_common_os.c | 8 +-
drivers/crypto/mlx5/mlx5_crypto.h | 1 +
drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +-
lib/eal/common/eal_common_memory.c | 168 ++++++++++++++++++
lib/eal/common/eal_memalloc.h | 21 +++
lib/eal/common/malloc_heap.c | 27 +++
lib/eal/common/malloc_heap.h | 5 +
lib/eal/include/rte_memory.h | 125 +++++++++++++
16 files changed, 576 insertions(+), 8 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH 1/2] eal: support dmabuf 2026-01-27 17:44 [PATCH 0/2] support dmabuf Cliff Burdick @ 2026-01-27 17:44 ` Cliff Burdick 2026-01-29 1:48 ` Stephen Hemminger 2026-01-29 1:51 ` Stephen Hemminger 2026-01-27 17:44 ` [PATCH 2/2] common/mlx5: " Cliff Burdick ` (2 subsequent siblings) 3 siblings, 2 replies; 27+ messages in thread From: Cliff Burdick @ 2026-01-27 17:44 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov, Thomas Monjalon dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. A new api function rte_extmem_register_dmabuf is introduced to create the mapping from a dmabuf file descriptor. dmabuf uses a file descriptor and an offset that has been pre-opened with the kernel. The kernel uses the file descriptor to map to a VA pointer. To avoid ABI changes, a static struct is used inside of eal_common_memory.c, and lookups are done on this struct rather than from the rte_memseg_list. Ideally we would like to add both the dmabuf file descriptor and offset to rte_memseg_list, but it's not clear if we can reuse existing fields when using the dmabuf API. We could rename the external flag to a more generic "properties" flag where "external" is the lowest bit, then we can use the second bit to indicate the presence of dmabuf. In the presence of the flag for dmabuf we could reuse the base_va address field for the dmabuf offset, and the socket_id for the file descriptor. Which option is preferred? Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 1 + lib/eal/common/eal_common_memory.c | 168 +++++++++++++++++++++++++++++ lib/eal/common/eal_memalloc.h | 21 ++++ lib/eal/common/malloc_heap.c | 27 +++++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 125 +++++++++++++++++++++ 6 files changed, 347 insertions(+) diff --git a/.mailmap b/.mailmap index 2f089326ff..4c2b2f921d 100644 --- a/.mailmap +++ b/.mailmap @@ -291,6 +291,7 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> +Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> diff --git a/lib/eal/common/eal_common_memory.c b/lib/eal/common/eal_common_memory.c index c62edf5e55..304ed18396 100644 --- a/lib/eal/common/eal_common_memory.c +++ b/lib/eal/common/eal_common_memory.c @@ -45,6 +45,18 @@ static void *next_baseaddr; static uint64_t system_page_sz; +/* Internal storage for dmabuf info, indexed by memseg list index. + * This keeps dmabuf metadata out of the public rte_memseg_list structure + * to preserve ABI compatibility. + */ +static struct { + int fd; /**< dmabuf fd, -1 if not dmabuf backed */ + uint64_t offset; /**< offset within dmabuf */ + } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { + [0 ... RTE_MAX_MEMSEG_LISTS - 1] = { .fd = -1, .offset = 0 } +}; + + #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5 void * eal_get_virtual_area(void *requested_addr, size_t *size, @@ -930,6 +942,109 @@ rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset) return ret; } +/* Internal dmabuf info functions */ +int +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + dmabuf_info[list_idx].fd = fd; + dmabuf_info[list_idx].offset = offset; + return 0; +} + +int +eal_memseg_list_get_dmabuf_fd(int list_idx) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + return dmabuf_info[list_idx].fd; +} + +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS || offset == NULL) + return -EINVAL; + + *offset = dmabuf_info[list_idx].offset; + return 0; +} + +/* Public dmabuf info API functions */ +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_fd_thread_unsafe) +int +rte_memseg_list_get_dmabuf_fd_thread_unsafe(const struct rte_memseg_list *msl) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + return dmabuf_info[msl_idx].fd; +} + +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_fd) +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_offset_thread_unsafe) +int +rte_memseg_list_get_dmabuf_offset_thread_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL || offset == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + *offset = dmabuf_info[msl_idx].offset; + return 0; +} + +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_offset) +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_offset_thread_unsafe(msl, offset); + rte_mcfg_mem_read_unlock(); + + return ret; +} + RTE_EXPORT_SYMBOL(rte_extmem_register) int rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], @@ -980,6 +1095,59 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], return ret; } +RTE_EXPORT_SYMBOL(rte_extmem_register_dmabuf) +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + unsigned int socket_id, n; + int ret = 0; + + if (va_addr == NULL || page_sz == 0 || len == 0 || + !rte_is_power_of_2(page_sz) || + RTE_ALIGN(len, page_sz) != len || + ((len / page_sz) != n_pages && iova_addrs != NULL) || + !rte_is_aligned(va_addr, page_sz) || + dmabuf_fd < 0) { + rte_errno = EINVAL; + return -1; + } + rte_mcfg_mem_write_lock(); + + /* make sure the segment doesn't already exist */ + if (malloc_heap_find_external_seg(va_addr, len) != NULL) { + rte_errno = EEXIST; + ret = -1; + goto unlock; + } + + /* get next available socket ID */ + socket_id = mcfg->next_socket_id; + if (socket_id > INT32_MAX) { + EAL_LOG(ERR, "Cannot assign new socket ID's"); + rte_errno = ENOSPC; + ret = -1; + goto unlock; + } + + /* we can create a new memseg with dma-buf info */ + n = len / page_sz; + if (malloc_heap_create_external_seg_dmabuf(va_addr, iova_addrs, n, + page_sz, "extmem_dmabuf", socket_id, + dmabuf_fd, dmabuf_offset) == NULL) { + ret = -1; + goto unlock; + } + + /* memseg list successfully created - increment next socket ID */ + mcfg->next_socket_id++; +unlock: + rte_mcfg_mem_write_unlock(); + return ret; +} + RTE_EXPORT_SYMBOL(rte_extmem_unregister) int rte_extmem_unregister(void *va_addr, size_t len) diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h index 0c267066d9..bb2cfa0717 100644 --- a/lib/eal/common/eal_memalloc.h +++ b/lib/eal/common/eal_memalloc.h @@ -90,6 +90,27 @@ eal_memalloc_set_seg_list_fd(int list_idx, int fd); int eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset); +/* + * Set dmabuf info for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset); + +/* + * Get dmabuf fd for a memseg list. + * Returns fd (>= 0) on success, -1 if not dmabuf backed, -errno on error. + */ +int +eal_memseg_list_get_dmabuf_fd(int list_idx); + +/* + * Get dmabuf offset for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset); + int eal_memalloc_init(void) __rte_requires_shared_capability(rte_mcfg_mem_get_lock()); diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c index 39240c261c..fd0376d13b 100644 --- a/lib/eal/common/malloc_heap.c +++ b/lib/eal/common/malloc_heap.c @@ -1232,6 +1232,33 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], msl->version = 0; msl->external = 1; + /* initialize dmabuf info to "not dmabuf backed" */ + eal_memseg_list_set_dmabuf_info(i, -1, 0); + + return msl; +} + +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + int msl_idx; + + /* Create the base external segment */ + msl = malloc_heap_create_external_seg(va_addr, iova_addrs, n_pages, + page_sz, seg_name, socket_id); + if (msl == NULL) + return NULL; + + /* Get memseg list index */ + msl_idx = msl - mcfg->memsegs; + + /* Set dma-buf info in the internal side-table */ + eal_memseg_list_set_dmabuf_info(msl_idx, dmabuf_fd, dmabuf_offset); + return msl; } diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h index dfc56d4ae3..87525d1a68 100644 --- a/lib/eal/common/malloc_heap.h +++ b/lib/eal/common/malloc_heap.h @@ -51,6 +51,11 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz, const char *seg_name, unsigned int socket_id); +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset); + struct rte_memseg_list * malloc_heap_find_external_seg(void *va_addr, size_t len); diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h index b6e97ad695..d1c2fc8aa5 100644 --- a/lib/eal/include/rte_memory.h +++ b/lib/eal/include/rte_memory.h @@ -405,6 +405,82 @@ int rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms, size_t *offset); +/** + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); + +/** + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +int +rte_memseg_list_get_dmabuf_fd_thread_unsafe(const struct rte_memseg_list *msl); + +/** + * Get dma-buf offset associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset); + +/** + * Get dma-buf offset associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +int +rte_memseg_list_get_dmabuf_offset_thread_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset); + /** * Register external memory chunk with DPDK. * @@ -443,6 +519,55 @@ int rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); +/** + * Register external memory chunk backed by a dma-buf with DPDK. + * + * This is similar to rte_extmem_register() but additionally stores dma-buf + * file descriptor information, allowing drivers to use dma-buf based + * memory registration (e.g., ibv_reg_dmabuf_mr for RDMA devices). + * + * @note Using this API is mutually exclusive with ``rte_malloc`` family of + * API's. + * + * @note This API will not perform any DMA mapping. It is expected that user + * will do that themselves via rte_dev_dma_map(). + * + * @note Before accessing this memory in other processes, it needs to be + * attached in each of those processes by calling ``rte_extmem_attach`` in + * each other process. + * + * @param va_addr + * Start of virtual area to register (mmap'd address of the dma-buf). + * Must be aligned by ``page_sz``. + * @param len + * Length of virtual area to register. Must be aligned by ``page_sz``. + * This is independent of dmabuf_offset. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf where the registered region starts. + * @param iova_addrs + * Array of page IOVA addresses corresponding to each page in this memory + * area. Can be NULL, in which case page IOVA addresses will be set to + * RTE_BAD_IOVA. + * @param n_pages + * Number of elements in the iova_addrs array. Ignored if ``iova_addrs`` + * is NULL. + * @param page_sz + * Page size of the underlying memory + * + * @return + * - 0 on success + * - -1 in case of error, with rte_errno set to one of the following: + * EINVAL - one of the parameters was invalid + * EEXIST - memory chunk is already registered + * ENOSPC - no more space in internal config to store a new memory chunk + */ +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); + /** * Unregister external memory chunk with DPDK. * -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] eal: support dmabuf 2026-01-27 17:44 ` [PATCH 1/2] eal: " Cliff Burdick @ 2026-01-29 1:48 ` Stephen Hemminger 2026-01-29 1:51 ` Stephen Hemminger 1 sibling, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-01-29 1:48 UTC (permalink / raw) To: Cliff Burdick; +Cc: dev, anatoly.burakov, Thomas Monjalon On Tue, 27 Jan 2026 17:44:08 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > + int fd; /**< dmabuf fd, -1 if not dmabuf backed */ > + uint64_t offset; /**< offset within dmabuf */ > + } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { > + [0 ... RTE_MAX_MEMSEG_LISTS - 1] = { .fd = -1, .offset = 0 } > +}; > + Range initializer are a GCC extension not available in MSVC. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] eal: support dmabuf 2026-01-27 17:44 ` [PATCH 1/2] eal: " Cliff Burdick 2026-01-29 1:48 ` Stephen Hemminger @ 2026-01-29 1:51 ` Stephen Hemminger 1 sibling, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-01-29 1:51 UTC (permalink / raw) To: Cliff Burdick; +Cc: dev, anatoly.burakov, Thomas Monjalon On Tue, 27 Jan 2026 17:44:08 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > +/** > + * Get dma-buf file descriptor associated with a memseg list. > + * > + * @note This function does not perform any locking, and is only safe to call > + * from within memory-related callback functions. Maybe warning instead of note. > + * > + * @param msl > + * A pointer to memseg list for which to get dma-buf fd. > + * > + * @return > + * Valid dma-buf file descriptor (>= 0) in case of success. > + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: > + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list > + */ > +int > +rte_memseg_list_get_dmabuf_fd_thread_unsafe(const struct rte_memseg_list *msl); ENAMETOOLONG At some point, you need to come up with a better naming convention. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/2] common/mlx5: support dmabuf 2026-01-27 17:44 [PATCH 0/2] support dmabuf Cliff Burdick 2026-01-27 17:44 ` [PATCH 1/2] eal: " Cliff Burdick @ 2026-01-27 17:44 ` Cliff Burdick 2026-01-27 19:21 ` [REVIEW] " Stephen Hemminger 2026-01-29 1:51 ` [PATCH 2/2] " Stephen Hemminger 2026-01-28 0:04 ` [PATCH 0/2] " Stephen Hemminger 2026-02-03 22:26 ` [PATCH v2 " Cliff Burdick 3 siblings, 2 replies; 27+ messages in thread From: Cliff Burdick @ 2026-01-27 17:44 UTC (permalink / raw) To: dev Cc: anatoly.burakov, Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 +++++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 +++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 28 ++++- drivers/common/mlx5/mlx5_common_mr.c | 108 +++++++++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 ++- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- 10 files changed, 229 insertions(+), 8 deletions(-) diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build index 3767e7a69b..8e83104165 100644 --- a/drivers/common/mlx5/linux/meson.build +++ b/drivers/common/mlx5/linux/meson.build @@ -203,6 +203,8 @@ has_sym_args = [ 'mlx5dv_dr_domain_allow_duplicate_rules' ], [ 'HAVE_MLX5_IBV_REG_MR_IOVA', 'infiniband/verbs.h', 'ibv_reg_mr_iova' ], + [ 'HAVE_IBV_REG_DMABUF_MR', 'infiniband/verbs.h', + 'ibv_reg_dmabuf_mr' ], [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h', 'ibv_import_device' ], [ 'HAVE_MLX5DV_DR_ACTION_CREATE_DEST_ROOT_TABLE', 'infiniband/mlx5dv.h', diff --git a/drivers/common/mlx5/linux/mlx5_common_verbs.c b/drivers/common/mlx5/linux/mlx5_common_verbs.c index 98260df470..f6d18fd5df 100644 --- a/drivers/common/mlx5/linux/mlx5_common_verbs.c +++ b/drivers/common/mlx5/linux/mlx5_common_verbs.c @@ -129,6 +129,47 @@ mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, return 0; } +/** + * Register mr for dma-buf backed memory. Given protection domain pointer, + * dma-buf fd, offset and length, register the memory region. + * + * @param[in] pd + * Pointer to protection domain context. + * @param[in] offset + * Offset within the dma-buf. + * @param[in] length + * Length of the memory to register. + * @param[in] fd + * File descriptor of the dma-buf. + * @param[out] pmd_mr + * pmd_mr struct set with lkey, address, length and pointer to mr object + * + * @return + * 0 on successful registration, -1 otherwise + */ +RTE_EXPORT_INTERNAL_SYMBOL(mlx5_common_verbs_reg_dmabuf_mr) +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr) +{ + struct ibv_mr *ibv_mr; + ibv_mr = mlx5_glue->reg_dmabuf_mr(pd, offset, length, iova, fd, + IBV_ACCESS_LOCAL_WRITE | + (haswell_broadwell_cpu ? 0 : + IBV_ACCESS_RELAXED_ORDERING)); + if (!ibv_mr) + return -1; + + *pmd_mr = (struct mlx5_pmd_mr){ + .lkey = ibv_mr->lkey, + .addr = ibv_mr->addr, + .len = ibv_mr->length, + .obj = (void *)ibv_mr, + }; + return 0; +} + /** * Deregister mr. Given the mlx5 pmd MR - deregister the MR * @@ -151,13 +192,18 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func * @param[out] dereg_mr_cb * Pointer to dereg_mr func */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_common_verbs_reg_mr; + *reg_dmabuf_mr_cb = mlx5_common_verbs_reg_dmabuf_mr; *dereg_mr_cb = mlx5_common_verbs_dereg_mr; } diff --git a/drivers/common/mlx5/linux/mlx5_glue.c b/drivers/common/mlx5/linux/mlx5_glue.c index a91eaa429d..6fac7f2bcd 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.c +++ b/drivers/common/mlx5/linux/mlx5_glue.c @@ -291,6 +291,24 @@ mlx5_glue_reg_mr_iova(struct ibv_pd *pd, void *addr, size_t length, #endif } +static struct ibv_mr * +mlx5_glue_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, int access) +{ +#ifdef HAVE_IBV_REG_DMABUF_MR + return ibv_reg_dmabuf_mr(pd, offset, length, iova, fd, access); +#else + (void)pd; + (void)offset; + (void)length; + (void)iova; + (void)fd; + (void)access; + errno = ENOTSUP; + return NULL; +#endif +} + static struct ibv_mr * mlx5_glue_alloc_null_mr(struct ibv_pd *pd) { @@ -1619,6 +1637,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue) { .modify_qp = mlx5_glue_modify_qp, .reg_mr = mlx5_glue_reg_mr, .reg_mr_iova = mlx5_glue_reg_mr_iova, + .reg_dmabuf_mr = mlx5_glue_reg_dmabuf_mr, .alloc_null_mr = mlx5_glue_alloc_null_mr, .dereg_mr = mlx5_glue_dereg_mr, .create_counter_set = mlx5_glue_create_counter_set, diff --git a/drivers/common/mlx5/linux/mlx5_glue.h b/drivers/common/mlx5/linux/mlx5_glue.h index 81d6b0aaf9..66216d1194 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.h +++ b/drivers/common/mlx5/linux/mlx5_glue.h @@ -219,6 +219,9 @@ struct mlx5_glue { struct ibv_mr *(*reg_mr_iova)(struct ibv_pd *pd, void *addr, size_t length, uint64_t iova, int access); + struct ibv_mr *(*reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, + size_t length, uint64_t iova, + int fd, int access); struct ibv_mr *(*alloc_null_mr)(struct ibv_pd *pd); int (*dereg_mr)(struct ibv_mr *mr); struct ibv_counter_set *(*create_counter_set) diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c index 84a93e7dbd..0ec59b0122 100644 --- a/drivers/common/mlx5/mlx5_common.c +++ b/drivers/common/mlx5/mlx5_common.c @@ -13,6 +13,7 @@ #include <rte_class.h> #include <rte_malloc.h> #include <rte_eal_paging.h> +#include <rte_memory.h> #include "mlx5_common.h" #include "mlx5_common_os.h" @@ -1125,6 +1126,7 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, struct mlx5_common_device *dev; struct mlx5_mr_btree *bt; struct mlx5_mr *mr; + struct rte_memseg_list *msl; dev = to_mlx5_device(rte_dev); if (!dev) { @@ -1134,8 +1136,30 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, rte_errno = ENODEV; return -1; } - mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, - SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list(addr); + if (msl != NULL && msl->external) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + rte_memseg_list_get_dmabuf_offset_thread_unsafe(msl, &dmabuf_off); + /* Calculate offset within dmabuf for this specific address */ + dmabuf_off += ((uintptr_t)addr - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + mr = mlx5_create_mr_ext_dmabuf(dev->pd, (uintptr_t)addr, len, + SOCKET_ID_ANY, dmabuf_fd, dmabuf_off, + dev->mr_scache.reg_dmabuf_mr_cb); + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, + SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + } + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, + SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + } if (!mr) { DRV_LOG(WARNING, "Device %s unable to DMA map", rte_dev->name); rte_errno = EINVAL; diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c index 8ed988dec9..18b8a6eaa5 100644 --- a/drivers/common/mlx5/mlx5_common_mr.c +++ b/drivers/common/mlx5/mlx5_common_mr.c @@ -8,6 +8,7 @@ #include <rte_eal_memconfig.h> #include <rte_eal_paging.h> #include <rte_errno.h> +#include <rte_memory.h> #include <rte_mempool.h> #include <rte_malloc.h> #include <rte_rwlock.h> @@ -1141,6 +1142,7 @@ mlx5_mr_create_cache(struct mlx5_mr_share_cache *share_cache, int socket) { /* Set the reg_mr and dereg_mr callback functions */ mlx5_os_set_reg_mr_cb(&share_cache->reg_mr_cb, + &share_cache->reg_dmabuf_mr_cb, &share_cache->dereg_mr_cb); rte_rwlock_init(&share_cache->rwlock); rte_rwlock_init(&share_cache->mprwlock); @@ -1221,6 +1223,74 @@ mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, return mr; } +/** + * Creates a memory region for dma-buf backed external memory. + * + * @param pd + * Pointer to pd of a device (net, regex, vdpa,...). + * @param addr + * Starting virtual address of memory (mmap'd address). + * @param len + * Length of memory segment being mapped. + * @param socket_id + * Socket to allocate heap memory for the control structures. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf. + * @param reg_dmabuf_mr_cb + * Callback function for dma-buf MR registration. + * + * @return + * Pointer to MR structure on success, NULL otherwise. + */ +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb) +{ + struct mlx5_mr *mr = NULL; + + if (reg_dmabuf_mr_cb == NULL) { + DRV_LOG(WARNING, "dma-buf MR registration not supported"); + rte_errno = ENOTSUP; + return NULL; + } + mr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, + RTE_ALIGN_CEIL(sizeof(*mr), RTE_CACHE_LINE_SIZE), + RTE_CACHE_LINE_SIZE, socket_id); + if (mr == NULL) + return NULL; + if (reg_dmabuf_mr_cb(pd, dmabuf_offset, len, addr, dmabuf_fd, + &mr->pmd_mr) < 0) { + DRV_LOG(WARNING, + "Fail to create dma-buf MR for address (%p) fd=%d", + (void *)addr, dmabuf_fd); + mlx5_free(mr); + return NULL; + } + mr->msl = NULL; /* Mark it is external memory. */ + mr->ms_bmp = NULL; + mr->ms_n = 1; + mr->ms_bmp_n = 1; + /* + * For dma-buf MR, the returned addr may be NULL since there's no VA + * in the registration. Store the user-provided addr for cache lookup. + */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)addr; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + DRV_LOG(DEBUG, + "MR CREATED (%p) for dma-buf external memory %p (fd=%d):\n" + " [0x%" PRIxPTR ", 0x%" PRIxPTR ")," + " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u", + (void *)mr, (void *)addr, dmabuf_fd, + addr, addr + len, rte_cpu_to_be_32(mr->pmd_mr.lkey), + mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n); + return mr; +} + /** * Callback for memory free event. Iterate freed memsegs and check whether it * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a @@ -1747,9 +1817,43 @@ mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache, struct mlx5_mempool_mr *mr = &new_mpr->mrs[i]; const struct mlx5_range *range = &ranges[i]; size_t len = range->end - range->start; + struct rte_memseg_list *msl; + int reg_result; + + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list((void *)range->start); + if (msl != NULL && msl->external && + share_cache->reg_dmabuf_mr_cb != NULL) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + rte_memseg_list_get_dmabuf_offset_thread_unsafe(msl, &dmabuf_off); + /* Calculate offset within dmabuf for this specific range */ + dmabuf_off += (range->start - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + reg_result = share_cache->reg_dmabuf_mr_cb(pd, + dmabuf_off, len, range->start, dmabuf_fd, + &mr->pmd_mr); + if (reg_result == 0) { + /* For dma-buf MR, set addr if not set by driver */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)range->start; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } - if (share_cache->reg_mr_cb(pd, (void *)range->start, len, - &mr->pmd_mr) < 0) { + if (reg_result < 0) { DRV_LOG(ERR, "Failed to create an MR in PD %p for address range " "[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s", diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h index cf7c685e9b..3b967b1323 100644 --- a/drivers/common/mlx5/mlx5_common_mr.h +++ b/drivers/common/mlx5/mlx5_common_mr.h @@ -35,6 +35,9 @@ struct mlx5_pmd_mr { */ typedef int (*mlx5_reg_mr_t)(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); +typedef int (*mlx5_reg_dmabuf_mr_t)(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); typedef void (*mlx5_dereg_mr_t)(struct mlx5_pmd_mr *pmd_mr); /* Memory Region object. */ @@ -87,6 +90,7 @@ struct __rte_packed_begin mlx5_mr_share_cache { struct mlx5_mr_list mr_free_list; /* Freed MR list. */ struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */ mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ } __rte_packed_end; @@ -233,6 +237,10 @@ mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache, struct mlx5_mr * mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, mlx5_reg_mr_t reg_mr_cb); +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb); void mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb); __rte_internal uint32_t @@ -251,12 +259,19 @@ int mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); __rte_internal +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); +__rte_internal void mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr); __rte_internal void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb); +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb); __rte_internal int diff --git a/drivers/common/mlx5/windows/mlx5_common_os.c b/drivers/common/mlx5/windows/mlx5_common_os.c index 7fac361460..5e284742ab 100644 --- a/drivers/common/mlx5/windows/mlx5_common_os.c +++ b/drivers/common/mlx5/windows/mlx5_common_os.c @@ -17,6 +17,7 @@ #include "mlx5_common.h" #include "mlx5_common_os.h" #include "mlx5_malloc.h" +#include "mlx5_common_mr.h" /** * Initialization routine for run-time dependency on external lib. @@ -442,15 +443,20 @@ mlx5_os_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func (NULL on Windows - not supported) * @param[out] dereg_mr_cb * Pointer to dereg_mr func * */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_os_reg_mr; + *reg_dmabuf_mr_cb = NULL; /* dma-buf not supported on Windows */ *dereg_mr_cb = mlx5_os_dereg_mr; } diff --git a/drivers/crypto/mlx5/mlx5_crypto.h b/drivers/crypto/mlx5/mlx5_crypto.h index f9f127e9e6..b2712c9a8d 100644 --- a/drivers/crypto/mlx5/mlx5_crypto.h +++ b/drivers/crypto/mlx5/mlx5_crypto.h @@ -41,6 +41,7 @@ struct mlx5_crypto_priv { struct mlx5_common_device *cdev; /* Backend mlx5 device. */ struct rte_cryptodev *crypto_dev; mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ struct mlx5_uar uar; /* User Access Region. */ uint32_t max_segs_num; /* Maximum supported data segs. */ diff --git a/drivers/crypto/mlx5/mlx5_crypto_gcm.c b/drivers/crypto/mlx5/mlx5_crypto_gcm.c index 89f32c7722..380689cfeb 100644 --- a/drivers/crypto/mlx5/mlx5_crypto_gcm.c +++ b/drivers/crypto/mlx5/mlx5_crypto_gcm.c @@ -1186,7 +1186,8 @@ mlx5_crypto_gcm_init(struct mlx5_crypto_priv *priv) /* Override AES-GCM specified ops. */ dev_ops->sym_session_configure = mlx5_crypto_sym_gcm_session_configure; - mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->dereg_mr_cb); + mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->reg_dmabuf_mr_cb, + &priv->dereg_mr_cb); dev_ops->queue_pair_setup = mlx5_crypto_gcm_qp_setup; dev_ops->queue_pair_release = mlx5_crypto_gcm_qp_release; if (mlx5_crypto_is_ipsec_opt(priv)) { -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [REVIEW] common/mlx5: support dmabuf 2026-01-27 17:44 ` [PATCH 2/2] common/mlx5: " Cliff Burdick @ 2026-01-27 19:21 ` Stephen Hemminger 2026-01-28 14:30 ` David Marchand 2026-02-03 17:34 ` Cliff Burdick 2026-01-29 1:51 ` [PATCH 2/2] " Stephen Hemminger 1 sibling, 2 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-01-27 19:21 UTC (permalink / raw) To: dev; +Cc: Stephen Hemminger AI-generated review of bundle-1701-dmabuf.mbox Reviewed using Claude (claude-opus-4-5-20251101) This is an automated review. Please verify all suggestions. --- # DPDK Patch Review: dmabuf Support ## Summary This patch series adds dmabuf (DMA buffer) support to DPDK EAL and the MLX5 driver, enabling DMA transfers between drivers without proprietary kernel modules. --- ## Patch 1/2: eal: support dmabuf ### Commit Message Issues **Warning: Subject line format** - Subject "eal: support dmabuf" is acceptable but could be more descriptive - Consider: "eal: add dmabuf external memory registration support" **Warning: Body contains questions to reviewers** The commit message contains design questions that should be resolved before submission: ``` Which option is preferred? ``` Remove these questions and state the chosen design approach clearly. **Info: Body line length** Some lines in the body exceed 75 characters but are within acceptable range. ### Code Issues **Error: Double blank line** ```c } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { [0 ... RTE_MAX_MEMSEG_LISTS - 1] = { .fd = -1, .offset = 0 } }; #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5 ``` Remove the extra blank line after the struct initialization. **Warning: Inconsistent indentation in struct** ```c static struct { int fd; /**< dmabuf fd, -1 if not dmabuf backed */ uint64_t offset; /**< offset within dmabuf */ } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { ``` The struct members are double-indented with tabs. Should use single tab for consistency: ```c static struct { int fd; /**< dmabuf fd, -1 if not dmabuf backed */ uint64_t offset; /**< offset within dmabuf */ } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { ``` **Error: New public APIs missing `__rte_experimental`** All new public API functions in `rte_memory.h` must be marked as experimental: - `rte_memseg_list_get_dmabuf_fd()` - `rte_memseg_list_get_dmabuf_fd_thread_unsafe()` - `rte_memseg_list_get_dmabuf_offset()` - `rte_memseg_list_get_dmabuf_offset_thread_unsafe()` - `rte_extmem_register_dmabuf()` Add `__rte_experimental` on the line before each function declaration in the header: ```c __rte_experimental int rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); ``` **Warning: Missing release notes** New API functions require documentation in `doc/guides/rel_notes/release_XX_YY.rst`. **Warning: Missing version.map updates** New exported symbols need to be added to `lib/eal/version.map` under the `EXPERIMENTAL` section. **Warning: Missing testpmd hooks and functional tests** New APIs should have tests in `app/test/` and hooks in `app/testpmd` per guidelines. **Info: Doxygen style** The Doxygen comments use `dma-buf` inconsistently (sometimes `dmabuf`, sometimes `dma-buf`). Consider standardizing to one form throughout. **Warning: Variable `n` unused context** ```c n = len / page_sz; if (malloc_heap_create_external_seg_dmabuf(va_addr, iova_addrs, n, ``` The variable `n` is computed but the result from `len / page_sz` could be used directly since it's only used once. --- ## Patch 2/2: common/mlx5: support dmabuf ### Commit Message Issues **Info: Subject is acceptable** "common/mlx5: support dmabuf" follows the format guidelines. ### Code Issues **Warning: Long lines exceed 100 characters** Several lines in `mlx5_common.c` and `mlx5_common_mr.c` exceed the 100-character limit: ```c mr = mlx5_create_mr_ext_dmabuf(dev->pd, (uintptr_t)addr, len, SOCKET_ID_ANY, dmabuf_fd, dmabuf_off, dev->mr_scache.reg_dmabuf_mr_cb); ``` Line 1149: `mr = mlx5_create_mr_ext_dmabuf(dev->pd, (uintptr_t)addr, len,` - OK But the continuation could be better formatted. In `mlx5_common_mr.c`: ```c int dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); ``` This line likely exceeds 100 characters. Break it up: ```c int dmabuf_fd; dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); ``` **Warning: Missing `__rte_internal` attribute** The function `mlx5_create_mr_ext_dmabuf()` is declared in `mlx5_common_mr.h` but lacks `__rte_internal`: ```c struct mlx5_mr * mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, int dmabuf_fd, uint64_t dmabuf_offset, mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb); ``` If this is internal to the driver, add `__rte_internal` on the line before. **Error: Missing RTE_EXPORT_INTERNAL_SYMBOL for mlx5_create_mr_ext_dmabuf** The function `mlx5_create_mr_ext_dmabuf()` in `mlx5_common_mr.c` is missing the export macro if it's meant to be used by other mlx5 components. **Warning: Brace style inconsistency** ```c if (dmabuf_fd >= 0) { uint64_t dmabuf_off; ``` This is fine, but ensure consistency throughout. **Info: Code duplication** The dmabuf detection and registration logic is duplicated between `mlx5_common_dev_dma_map()` and `mlx5_mr_mempool_register_primary()`. Consider extracting to a helper function. **Warning: Potential NULL dereference** In `mlx5_common.c`: ```c msl = rte_mem_virt2memseg_list(addr); if (msl != NULL && msl->external) { int dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); ``` The `rte_memseg_list_get_dmabuf_fd_thread_unsafe()` call happens without holding the memory lock, but the comment in the header says it's only safe within memory-related callbacks. This may be a race condition if memory configuration changes. **Warning: Missing error check** ```c rte_memseg_list_get_dmabuf_offset_thread_unsafe(msl, &dmabuf_off); ``` The return value is not checked. If the function fails, `dmabuf_off` may be uninitialized. **Info: Windows stub** ```c *reg_dmabuf_mr_cb = NULL; /* dma-buf not supported on Windows */ ``` Good practice to document platform limitations. --- ## General Issues **Warning: Missing documentation** - No updates to programmer's guide documentation explaining dmabuf usage - No updates to release notes **Warning: ABI considerations** The patch mentions avoiding ABI changes by using a side-table for dmabuf info. This design decision should be documented more clearly, and the commit message questions should be resolved. **Info: Linux-only feature** The dmabuf feature is Linux-specific. Consider adding explicit documentation about platform support. --- ## Checklist Summary | Item | Status | |------|--------| | SPDX license | ✓ OK (existing files) | | Signed-off-by | ✓ Present | | Subject line length | ✓ OK | | Subject format | ✓ OK | | Body line wrap | ⚠ Some lines long | | `__rte_experimental` on new APIs | ✗ Missing | | Release notes | ✗ Missing | | version.map updates | ✗ Missing | | Tests | ✗ Missing | | Documentation | ✗ Missing | | Code style | ⚠ Minor issues | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REVIEW] common/mlx5: support dmabuf 2026-01-27 19:21 ` [REVIEW] " Stephen Hemminger @ 2026-01-28 14:30 ` David Marchand 2026-01-28 17:10 ` Stephen Hemminger 2026-01-28 17:43 ` Stephen Hemminger 2026-02-03 17:34 ` Cliff Burdick 1 sibling, 2 replies; 27+ messages in thread From: David Marchand @ 2026-01-28 14:30 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev Hello Stephen, On Tue, 27 Jan 2026 at 20:22, Stephen Hemminger <stephen@networkplumber.org> wrote: > > AI-generated review of bundle-1701-dmabuf.mbox > Reviewed using Claude (claude-opus-4-5-20251101) > > This is an automated review. Please verify all suggestions. > > --- > > # DPDK Patch Review: dmabuf Support > > ## Summary > This patch series adds dmabuf (DMA buffer) support to DPDK EAL and the MLX5 driver, enabling DMA transfers between drivers without proprietary kernel modules. > [snip] > **Warning: Missing version.map updates** > New exported symbols need to be added to `lib/eal/version.map` under the `EXPERIMENTAL` section. I noticed similar comments on other series. There is no version.map update needed anymore, since v25.07. -- David Marchand ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REVIEW] common/mlx5: support dmabuf 2026-01-28 14:30 ` David Marchand @ 2026-01-28 17:10 ` Stephen Hemminger 2026-01-28 17:43 ` Stephen Hemminger 1 sibling, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-01-28 17:10 UTC (permalink / raw) To: David Marchand; +Cc: dev On Wed, 28 Jan 2026 15:30:17 +0100 David Marchand <david.marchand@redhat.com> wrote: > Hello Stephen, > > On Tue, 27 Jan 2026 at 20:22, Stephen Hemminger > <stephen@networkplumber.org> wrote: > > > > AI-generated review of bundle-1701-dmabuf.mbox > > Reviewed using Claude (claude-opus-4-5-20251101) > > > > This is an automated review. Please verify all suggestions. > > > > --- > > > > # DPDK Patch Review: dmabuf Support > > > > ## Summary > > This patch series adds dmabuf (DMA buffer) support to DPDK EAL and the MLX5 driver, enabling DMA transfers between drivers without proprietary kernel modules. > > > [snip] > > > **Warning: Missing version.map updates** > > New exported symbols need to be added to `lib/eal/version.map` under the `EXPERIMENTAL` section. > > I noticed similar comments on other series. > There is no version.map update needed anymore, since v25.07. > > I know, there is nothing in the AGENTS file about it, not sure where that neuron is coming from. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REVIEW] common/mlx5: support dmabuf 2026-01-28 14:30 ` David Marchand 2026-01-28 17:10 ` Stephen Hemminger @ 2026-01-28 17:43 ` Stephen Hemminger 1 sibling, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-01-28 17:43 UTC (permalink / raw) To: David Marchand; +Cc: dev On Wed, 28 Jan 2026 15:30:17 +0100 David Marchand <david.marchand@redhat.com> wrote: > Hello Stephen, > > On Tue, 27 Jan 2026 at 20:22, Stephen Hemminger > <stephen@networkplumber.org> wrote: > > > > AI-generated review of bundle-1701-dmabuf.mbox > > Reviewed using Claude (claude-opus-4-5-20251101) > > > > This is an automated review. Please verify all suggestions. > > > > --- > > > > # DPDK Patch Review: dmabuf Support > > > > ## Summary > > This patch series adds dmabuf (DMA buffer) support to DPDK EAL and the MLX5 driver, enabling DMA transfers between drivers without proprietary kernel modules. > > > [snip] > > > **Warning: Missing version.map updates** > > New exported symbols need to be added to `lib/eal/version.map` under the `EXPERIMENTAL` section. > > I noticed similar comments on other series. > There is no version.map update needed anymore, since v25.07. > > I got AI to fix itself :-) Now I understand the issue. The AGENTS.md says "New external functions must be exported properly" but doesn't explain the current mechanism. The AI is filling in with outdated knowledge about version.map files. DPDK has moved to automatic symbol map generation using export macros. ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [REVIEW] common/mlx5: support dmabuf 2026-01-27 19:21 ` [REVIEW] " Stephen Hemminger 2026-01-28 14:30 ` David Marchand @ 2026-02-03 17:34 ` Cliff Burdick 1 sibling, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 17:34 UTC (permalink / raw) To: Stephen Hemminger, dev@dpdk.org > External email: Use caution opening links or attachments > > > AI-generated review of bundle-1701-dmabuf.mbox Reviewed using Claude (claude-opus-4-5-20251101) > > This is an automated review. Please verify all suggestions. > > --- > > # DPDK Patch Review: dmabuf Support > > ## Summary > This patch series adds dmabuf (DMA buffer) support to DPDK EAL and the MLX5 driver, enabling DMA transfers between drivers without proprietary kernel modules. > > --- > > ## Patch 1/2: eal: support dmabuf > > ### Commit Message Issues > > **Warning: Subject line format** > - Subject "eal: support dmabuf" is acceptable but could be more descriptive > - Consider: "eal: add dmabuf external memory registration support" > > **Warning: Body contains questions to reviewers** The commit message contains design questions that should be resolved before submission: > ``` > Which option is preferred? > ``` > Remove these questions and state the chosen design approach clearly. > > **Info: Body line length** > Some lines in the body exceed 75 characters but are within acceptable range. > > ### Code Issues > > **Error: Double blank line** > ```c > } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { > [0 ... RTE_MAX_MEMSEG_LISTS - 1] = { .fd = -1, .offset = 0 } }; > Fixed > > #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5 ``` > Remove the extra blank line after the struct initialization. > > **Warning: Inconsistent indentation in struct** > ```c static struct { > int fd; /**< dmabuf fd, -1 if not dmabuf backed */ > uint64_t offset; /**< offset within dmabuf */ > } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { ``` > The struct members are double-indented with tabs. Should use single tab for consistency: > ```c > static struct { > int fd; /**< dmabuf fd, -1 if not dmabuf backed */ > uint64_t offset; /**< offset within dmabuf */ } dmabuf_info[RTE_MAX_MEMSEG_LISTS] = { ``` > Removed as part of refactoring away from the init struct syntax. > > **Error: New public APIs missing `__rte_experimental`** All new public API functions in `rte_memory.h` must be marked as experimental: > - `rte_memseg_list_get_dmabuf_fd()` > - `rte_memseg_list_get_dmabuf_fd_thread_unsafe()` > - `rte_memseg_list_get_dmabuf_offset()` > - `rte_memseg_list_get_dmabuf_offset_thread_unsafe()` > - `rte_extmem_register_dmabuf()` > > Add `__rte_experimental` on the line before each function declaration in the header: > ```c > __rte_experimental > int > rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); ``` > Done > > **Warning: Missing release notes** > New API functions require documentation in `doc/guides/rel_notes/release_XX_YY.rst`. Done > > **Warning: Missing version.map updates** > New exported symbols need to be added to `lib/eal/version.map` under the `EXPERIMENTAL` section. > > **Warning: Missing testpmd hooks and functional tests** > New APIs should have tests in `app/test/` and hooks in `app/testpmd` per guidelines. > > **Info: Doxygen style** > The Doxygen comments use `dma-buf` inconsistently (sometimes `dmabuf`, sometimes `dma-buf`). Consider standardizing to one form throughout. > > **Warning: Variable `n` unused context** > ```c > n = len / page_sz; > if (malloc_heap_create_external_seg_dmabuf(va_addr, iova_addrs, n, ``` > The variable `n` is computed but the result from `len / page_sz` could be used directly since it's only used once. Fixed as part of refactoring. > > --- > > ## Patch 2/2: common/mlx5: support dmabuf > > ### Commit Message Issues > > **Info: Subject is acceptable** > "common/mlx5: support dmabuf" follows the format guidelines. > > ### Code Issues > > **Warning: Long lines exceed 100 characters** > Several lines in `mlx5_common.c` and `mlx5_common_mr.c` exceed the 100-character limit: > > ```c > mr = mlx5_create_mr_ext_dmabuf(dev->pd, (uintptr_t)addr, len, > SOCKET_ID_ANY, dmabuf_fd, dmabuf_off, > dev->mr_scache.reg_dmabuf_mr_cb); ``` > Line 1149: `mr = mlx5_create_mr_ext_dmabuf(dev->pd, (uintptr_t)addr, len,` - OK > But the continuation could be better formatted. > > In `mlx5_common_mr.c`: > ```c > int dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); > ``` > This line likely exceeds 100 characters. Break it up: > ```c > int dmabuf_fd; > dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); > ``` > > **Warning: Missing `__rte_internal` attribute** > The function `mlx5_create_mr_ext_dmabuf()` is declared in `mlx5_common_mr.h` but lacks `__rte_internal`: > ```c > struct mlx5_mr * > mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, > int dmabuf_fd, uint64_t dmabuf_offset, > mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb); ``` > If this is internal to the driver, add `__rte_internal` on the line before. > > **Error: Missing RTE_EXPORT_INTERNAL_SYMBOL for mlx5_create_mr_ext_dmabuf** > The function `mlx5_create_mr_ext_dmabuf()` in `mlx5_common_mr.c` is missing the export macro if it's meant to be used by other mlx5 components. Fixed > > **Warning: Brace style inconsistency** > ```c > if (dmabuf_fd >= 0) { > uint64_t dmabuf_off; ``` > This is fine, but ensure consistency throughout. > > **Info: Code duplication** > The dmabuf detection and registration logic is duplicated between `mlx5_common_dev_dma_map()` and `mlx5_mr_mempool_register_primary()`. Consider extracting to a helper function. > > **Warning: Potential NULL dereference** > In `mlx5_common.c`: > ```c > msl = rte_mem_virt2memseg_list(addr); > if (msl != NULL && msl->external) { > int dmabuf_fd = rte_memseg_list_get_dmabuf_fd_thread_unsafe(msl); > ``` > The `rte_memseg_list_get_dmabuf_fd_thread_unsafe()` call happens without holding the memory lock, but the comment in the header says it's only safe within memory-related callbacks. This may be a race condition if memory configuration changes. Switched to safe version of the functions > > **Warning: Missing error check** > ```c > rte_memseg_list_get_dmabuf_offset_thread_unsafe(msl, &dmabuf_off); ``` > The return value is not checked. If the function fails, `dmabuf_off` may be uninitialized. Checked errors > > **Info: Windows stub** > ```c > *reg_dmabuf_mr_cb = NULL; /* dma-buf not supported on Windows */ ``` > Good practice to document platform limitations. > > --- > > ## General Issues > > **Warning: Missing documentation** > - No updates to programmer's guide documentation explaining dmabuf usage > - No updates to release notes > > **Warning: ABI considerations** > The patch mentions avoiding ABI changes by using a side-table for dmabuf info. This design decision should be documented more clearly, and the commit message questions should be resolved. > > **Info: Linux-only feature** > The dmabuf feature is Linux-specific. Consider adding explicit documentation about platform support. > > --- > > ## Checklist Summary > > | Item | Status | > |------|--------| > | SPDX license | ✓ OK (existing files) | Signed-off-by | ✓ Present | > | Subject line length | ✓ OK | Subject format | ✓ OK | Body line wrap | > | ⚠ Some lines long | `__rte_experimental` on new APIs | ✗ Missing | > | Release notes | ✗ Missing | version.map updates | ✗ Missing | Tests | > | ✗ Missing | Documentation | ✗ Missing | Code style | ⚠ Minor issues | ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] common/mlx5: support dmabuf 2026-01-27 17:44 ` [PATCH 2/2] common/mlx5: " Cliff Burdick 2026-01-27 19:21 ` [REVIEW] " Stephen Hemminger @ 2026-01-29 1:51 ` Stephen Hemminger 1 sibling, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-01-29 1:51 UTC (permalink / raw) To: Cliff Burdick Cc: dev, anatoly.burakov, Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad On Tue, 27 Jan 2026 17:44:09 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > +static struct ibv_mr * > +mlx5_glue_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, > + uint64_t iova, int fd, int access) > +{ > +#ifdef HAVE_IBV_REG_DMABUF_MR > + return ibv_reg_dmabuf_mr(pd, offset, length, iova, fd, access); > +#else > + (void)pd; > + (void)offset; > + (void)length; > + (void)iova; > + (void)fd; > + (void)access; > + errno = ENOTSUP; > + return NULL; > +#endif > +} I would prefer the callback hook did not exist (was NULL) if you don't have your #ifdef. The (void) change looks messy and better handled by caller. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/2] support dmabuf 2026-01-27 17:44 [PATCH 0/2] support dmabuf Cliff Burdick 2026-01-27 17:44 ` [PATCH 1/2] eal: " Cliff Burdick 2026-01-27 17:44 ` [PATCH 2/2] common/mlx5: " Cliff Burdick @ 2026-01-28 0:04 ` Stephen Hemminger 2026-02-03 17:18 ` Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 " Cliff Burdick 3 siblings, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2026-01-28 0:04 UTC (permalink / raw) To: Cliff Burdick; +Cc: dev, anatoly.burakov On Tue, 27 Jan 2026 17:44:07 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > Add support for kernel dmabuf feature and integrate it in the mlx5 driver. > This feature is needed to support GPUDirect on newer kernels. > > Cliff Burdick (2): > eal: support dmabuf > common/mlx5: support dmabuf > > .mailmap | 1 + > drivers/common/mlx5/linux/meson.build | 2 + > drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- > drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ > drivers/common/mlx5/linux/mlx5_glue.h | 3 + > drivers/common/mlx5/mlx5_common.c | 28 ++- > drivers/common/mlx5/mlx5_common_mr.c | 108 ++++++++++- > drivers/common/mlx5/mlx5_common_mr.h | 17 +- > drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- > drivers/crypto/mlx5/mlx5_crypto.h | 1 + > drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- > lib/eal/common/eal_common_memory.c | 168 ++++++++++++++++++ > lib/eal/common/eal_memalloc.h | 21 +++ > lib/eal/common/malloc_heap.c | 27 +++ > lib/eal/common/malloc_heap.h | 5 + > lib/eal/include/rte_memory.h | 125 +++++++++++++ > 16 files changed, 576 insertions(+), 8 deletions(-) > Build fails (on MSVC) fix and resubmit. "cl" "-Ilib\librte_eal.a.p" "-Ilib" "-I..\lib" "-Ilib\eal\common" "-I..\lib\eal\common" "-I." "-I.." "-Iconfig" "-I..\config" "-Ilib\eal\include" "-I..\lib\eal\include" "-Ilib\eal\windows\include" "-I..\lib\eal\windows\include" "-Ilib\eal\x86\include" "-I..\lib\eal\x86\include" "-Ilib\eal" "-I..\lib\eal" "-Ilib\argparse" "-I..\lib\argparse" "-Ilib\log" "-I..\lib\log" "-Ilib\kvargs" "-I..\lib\kvargs" "/MD" "/nologo" "/showIncludes" "/utf-8" "/W3" "/WX" "/std:c11" "/O2" "/Gw" "/wd4244" "/wd4267" "/wd4146" "/experimental:c11atomics" "/d1experimental:typeof" "/experimental:statementExpressions" "/FI" "rte_config.h" "-D_GNU_SOURCE" "-D_WIN32_WINNT=0x0A00" "-DALLOW_EXPERIMENTAL_API" "-DALLOW_INTERNAL_API" "-DABI_VERSION=\"26.1\"" "-DRTE_LOG_DEFAULT_LOGTYPE=lib.eal" "/Fdlib\librte_eal.a.p\eal_common_eal_common_memory.c.pdb" /Folib/librte_eal.a.p/eal_common_eal_common_memory.c.obj "/c" ../lib/eal/common/eal_common_memory.c ../lib/eal/common/eal_common_memory.c(56): error C2143: syntax error: missing ']' before '...' ../lib/eal/common/eal_common_memory.c(56): error C2059: syntax error: '...' ../lib/eal/common/eal_common_memory.c(57): error C2059: syntax error: '}' ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH 0/2] support dmabuf 2026-01-28 0:04 ` [PATCH 0/2] " Stephen Hemminger @ 2026-02-03 17:18 ` Cliff Burdick 0 siblings, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 17:18 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev@dpdk.org, anatoly.burakov@intel.com > -----Original Message----- > From: Stephen Hemminger <stephen@networkplumber.org> > Sent: Tuesday, January 27, 2026 4:04 PM > To: Cliff Burdick <cburdick@nvidia.com> > Cc: dev@dpdk.org; anatoly.burakov@intel.com > Subject: Re: [PATCH 0/2] support dmabuf > > External email: Use caution opening links or attachments > > > On Tue, 27 Jan 2026 17:44:07 +0000 > Cliff Burdick <cburdick@nvidia.com> wrote: > > > Add support for kernel dmabuf feature and integrate it in the mlx5 driver. > > This feature is needed to support GPUDirect on newer kernels. > > > > Cliff Burdick (2): > > eal: support dmabuf > > common/mlx5: support dmabuf > > > > .mailmap | 1 + > > drivers/common/mlx5/linux/meson.build | 2 + > > drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- > > drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ > > drivers/common/mlx5/linux/mlx5_glue.h | 3 + > > drivers/common/mlx5/mlx5_common.c | 28 ++- > > drivers/common/mlx5/mlx5_common_mr.c | 108 ++++++++++- > > drivers/common/mlx5/mlx5_common_mr.h | 17 +- > > drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- > > drivers/crypto/mlx5/mlx5_crypto.h | 1 + > > drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- > > lib/eal/common/eal_common_memory.c | 168 ++++++++++++++++++ > > lib/eal/common/eal_memalloc.h | 21 +++ > > lib/eal/common/malloc_heap.c | 27 +++ > > lib/eal/common/malloc_heap.h | 5 + > > lib/eal/include/rte_memory.h | 125 +++++++++++++ > > 16 files changed, 576 insertions(+), 8 deletions(-) > > > > Build fails (on MSVC) fix and resubmit. > > "cl" "-Ilib\librte_eal.a.p" "-Ilib" "-I..\lib" "-Ilib\eal\common" "-I..\lib\eal\common" "-I." "-I.." "-Iconfig" "-I..\config" "-Ilib\eal\include" "-I..\lib\eal\include" "-Ilib\eal\windows\include" "-I..\lib\eal\windows\include" "-Ilib\eal\x86\include" "-I..\lib\eal\x86\include" "-Ilib\eal" "-I..\lib\eal" "-Ilib\argparse" "-> I..\lib\argparse" "-Ilib\log" "-I..\lib\log" "-Ilib\kvargs" "-I..\lib\kvargs" "/MD" "/nologo" "/showIncludes" "/utf-8" "/W3" "/WX" "/std:c11" "/O2" "/Gw" "/wd4244" "/wd4267" "/wd4146" "/experimental:c11atomics" "/d1experimental:typeof" "/experimental:statementExpressions" "/FI" "rte_config.h" "-> D_GNU_SOURCE" "-D_WIN32_WINNT=0x0A00" "-DALLOW_EXPERIMENTAL_API" "-DALLOW_INTERNAL_API" "-DABI_VERSION=\"26.1\"" "-DRTE_LOG_DEFAULT_LOGTYPE=lib.eal" "/Fdlib\librte_eal.a.p\eal_common_eal_common_memory.c.pdb" /Folib/librte_eal.a.p/eal_common_eal_common_memory.c.obj > "/c" ../lib/eal/common/eal_common_memory.c >../lib/eal/common/eal_common_memory.c(56): error C2143: syntax error: missing ']' before '...' >../lib/eal/common/eal_common_memory.c(56): error C2059: syntax error: '...' >../lib/eal/common/eal_common_memory.c(57): error C2059: syntax error: '}' Fixed by moving to an init function ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 0/2] support dmabuf 2026-01-27 17:44 [PATCH 0/2] support dmabuf Cliff Burdick ` (2 preceding siblings ...) 2026-01-28 0:04 ` [PATCH 0/2] " Stephen Hemminger @ 2026-02-03 22:26 ` Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 1/2] eal: " Cliff Burdick ` (2 more replies) 3 siblings, 3 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 22:26 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov Fixed since v1: * Fixed issue with MSVC compilation * Fixed style issues from code review Add support for kernel dmabuf feature and integrate it in the mlx5 driver. This feature is needed to support GPUDirect on newer kernels. Cliff Burdick (2): eal: support dmabuf common/mlx5: support dmabuf .mailmap | 1 + doc/guides/rel_notes/release_26_03.rst | 6 + drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 42 ++++- drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 +- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- lib/eal/common/eal_common_memory.c | 165 +++++++++++++++++- lib/eal/common/eal_memalloc.h | 21 +++ lib/eal/common/malloc_heap.c | 27 +++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 145 +++++++++++++++ 17 files changed, 612 insertions(+), 14 deletions(-) -- 2.52.0 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 1/2] eal: support dmabuf 2026-02-03 22:26 ` [PATCH v2 " Cliff Burdick @ 2026-02-03 22:26 ` Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 2/2] common/mlx5: " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 0/2] " Cliff Burdick 2 siblings, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 22:26 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov, Thomas Monjalon dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. A new api function rte_extmem_register_dmabuf is introduced to create the mapping from a dmabuf file descriptor. dmabuf uses a file descriptor and an offset that has been pre-opened with the kernel. The kernel uses the file descriptor to map to a VA pointer. To avoid ABI changes, a static struct is used inside of eal_common_memory.c, and lookups are done on this struct rather than from the rte_memseg_list. Ideally we would like to add both the dmabuf file descriptor and offset to rte_memseg_list, but it's not clear if we can reuse existing fields when using the dmabuf API. We could rename the external flag to a more generic "properties" flag where "external" is the lowest bit, then we can use the second bit to indicate the presence of dmabuf. In the presence of the flag for dmabuf we could reuse the base_va address field for the dmabuf offset, and the socket_id for the file descriptor. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 1 + doc/guides/rel_notes/release_26_03.rst | 6 + lib/eal/common/eal_common_memory.c | 165 ++++++++++++++++++++++++- lib/eal/common/eal_memalloc.h | 21 ++++ lib/eal/common/malloc_heap.c | 27 ++++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 145 ++++++++++++++++++++++ 7 files changed, 364 insertions(+), 6 deletions(-) diff --git a/.mailmap b/.mailmap index 2f089326ff..4c2b2f921d 100644 --- a/.mailmap +++ b/.mailmap @@ -291,6 +291,7 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> +Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 15dabee7a1..56457d0382 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -55,6 +55,12 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Added dma-buf-backed external memory support.** + + Added EAL support for registering dma-buf-backed external memory with + ``rte_extmem_register_dmabuf``, and enabled mlx5 common code to consume + dma-buf mappings for device access. + Removed Items ------------- diff --git a/lib/eal/common/eal_common_memory.c b/lib/eal/common/eal_common_memory.c index c62edf5e55..34ebbdc202 100644 --- a/lib/eal/common/eal_common_memory.c +++ b/lib/eal/common/eal_common_memory.c @@ -45,6 +45,15 @@ static void *next_baseaddr; static uint64_t system_page_sz; +/* Internal storage for dma-buf info, indexed by memseg list index. + * This keeps dma-buf metadata out of the public rte_memseg_list structure + * to preserve ABI compatibility. + */ +static struct { + int fd; /**< dma-buf fd, -1 if not dma-buf backed */ + uint64_t offset; /**< offset within dma-buf */ +} dmabuf_info[RTE_MAX_MEMSEG_LISTS]; + #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5 void * eal_get_virtual_area(void *requested_addr, size_t *size, @@ -232,6 +241,10 @@ eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz, { char name[RTE_FBARRAY_NAME_LEN]; + /* Initialize dma-buf info to "not dma-buf backed" */ + dmabuf_info[type_msl_idx].fd = -1; + dmabuf_info[type_msl_idx].offset = 0; + snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id, type_msl_idx); @@ -930,10 +943,113 @@ rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset) return ret; } -RTE_EXPORT_SYMBOL(rte_extmem_register) +/* Internal dma-buf info functions */ int -rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], - unsigned int n_pages, size_t page_sz) +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + dmabuf_info[list_idx].fd = fd; + dmabuf_info[list_idx].offset = offset; + return 0; +} + +int +eal_memseg_list_get_dmabuf_fd(int list_idx) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + return dmabuf_info[list_idx].fd; +} + +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS || offset == NULL) + return -EINVAL; + + *offset = dmabuf_info[list_idx].offset; + return 0; +} + +/* Public dma-buf info API functions */ +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_fd_unsafe) +int +rte_memseg_list_get_dmabuf_fd_unsafe(const struct rte_memseg_list *msl) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + return dmabuf_info[msl_idx].fd; +} + +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_fd) +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_fd_unsafe(msl); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_offset_unsafe) +int +rte_memseg_list_get_dmabuf_offset_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL || offset == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + *offset = dmabuf_info[msl_idx].offset; + return 0; +} + +RTE_EXPORT_SYMBOL(rte_memseg_list_get_dmabuf_offset) +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_offset_unsafe(msl, offset); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +static int +extmem_register(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; unsigned int socket_id, n; @@ -967,10 +1083,19 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], /* we can create a new memseg */ n = len / page_sz; - if (malloc_heap_create_external_seg(va_addr, iova_addrs, n, + if (dmabuf_fd < 0) { + if (malloc_heap_create_external_seg(va_addr, iova_addrs, n, page_sz, "extmem", socket_id) == NULL) { - ret = -1; - goto unlock; + ret = -1; + goto unlock; + } + } else { + if (malloc_heap_create_external_seg_dmabuf(va_addr, iova_addrs, n, + page_sz, "extmem_dmabuf", socket_id, + dmabuf_fd, dmabuf_offset) == NULL) { + ret = -1; + goto unlock; + } } /* memseg list successfully created - increment next socket ID */ @@ -980,6 +1105,34 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], return ret; } +RTE_EXPORT_SYMBOL(rte_extmem_register) +int +rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz) +{ + return rte_extmem_register_dmabuf(va_addr, len, -1, 0, iova_addrs, n_pages, page_sz); +} + +RTE_EXPORT_SYMBOL(rte_extmem_register_dmabuf) +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) +{ + if (dmabuf_fd < 0) { + rte_errno = EINVAL; + return -1; + } + + return extmem_register(va_addr, + len, + dmabuf_fd, + dmabuf_offset, + iova_addrs, + n_pages, + page_sz); +} + RTE_EXPORT_SYMBOL(rte_extmem_unregister) int rte_extmem_unregister(void *va_addr, size_t len) diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h index 0c267066d9..e7e807ddcb 100644 --- a/lib/eal/common/eal_memalloc.h +++ b/lib/eal/common/eal_memalloc.h @@ -90,6 +90,27 @@ eal_memalloc_set_seg_list_fd(int list_idx, int fd); int eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset); +/* + * Set dma-buf info for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset); + +/* + * Get dma-buf fd for a memseg list. + * Returns fd (>= 0) on success, -1 if not dma-buf backed, -errno on error. + */ +int +eal_memseg_list_get_dmabuf_fd(int list_idx); + +/* + * Get dma-buf offset for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset); + int eal_memalloc_init(void) __rte_requires_shared_capability(rte_mcfg_mem_get_lock()); diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c index 39240c261c..bf986fe654 100644 --- a/lib/eal/common/malloc_heap.c +++ b/lib/eal/common/malloc_heap.c @@ -1232,6 +1232,33 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], msl->version = 0; msl->external = 1; + /* initialize dma-buf info to "not dma-buf backed" */ + eal_memseg_list_set_dmabuf_info(i, -1, 0); + + return msl; +} + +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + int msl_idx; + + /* Create the base external segment */ + msl = malloc_heap_create_external_seg(va_addr, iova_addrs, n_pages, + page_sz, seg_name, socket_id); + if (msl == NULL) + return NULL; + + /* Get memseg list index */ + msl_idx = msl - mcfg->memsegs; + + /* Set dma-buf info in the internal side-table */ + eal_memseg_list_set_dmabuf_info(msl_idx, dmabuf_fd, dmabuf_offset); + return msl; } diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h index dfc56d4ae3..87525d1a68 100644 --- a/lib/eal/common/malloc_heap.h +++ b/lib/eal/common/malloc_heap.h @@ -51,6 +51,11 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz, const char *seg_name, unsigned int socket_id); +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset); + struct rte_memseg_list * malloc_heap_find_external_seg(void *va_addr, size_t len); diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h index b6e97ad695..4e92897dd9 100644 --- a/lib/eal/include/rte_memory.h +++ b/lib/eal/include/rte_memory.h @@ -405,6 +405,98 @@ int rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms, size_t *offset); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_fd_unsafe(const struct rte_memseg_list *msl); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf offset associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf offset associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_offset_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset); + /** * Register external memory chunk with DPDK. * @@ -443,6 +535,59 @@ int rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Register external memory chunk backed by a dma-buf file descriptor and offset. + * + * This is similar to rte_extmem_register() but additionally stores dma-buf + * file descriptor information, allowing drivers to use dma-buf based + * memory registration (e.g., ibv_reg_dmabuf_mr for RDMA devices). + * + * @note Using this API is mutually exclusive with ``rte_malloc`` family of + * API's. + * + * @note This API will not perform any DMA mapping. It is expected that user + * will do that themselves via rte_dev_dma_map(). + * + * @note Before accessing this memory in other processes, it needs to be + * attached in each of those processes by calling ``rte_extmem_attach`` in + * each other process. + * + * @param va_addr + * Start of virtual area to register (mmap'd address of the dma-buf). + * Must be aligned by ``page_sz``. + * @param len + * Length of virtual area to register. Must be aligned by ``page_sz``. + * This is independent of dma-buf offset. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf where the registered region starts. + * @param iova_addrs + * Array of page IOVA addresses corresponding to each page in this memory + * area. Can be NULL, in which case page IOVA addresses will be set to + * RTE_BAD_IOVA. + * @param n_pages + * Number of elements in the iova_addrs array. Ignored if ``iova_addrs`` + * is NULL. + * @param page_sz + * Page size of the underlying memory + * + * @return + * - 0 on success + * - -1 in case of error, with rte_errno set to one of the following: + * EINVAL - one of the parameters was invalid + * EEXIST - memory chunk is already registered + * ENOSPC - no more space in internal config to store a new memory chunk + */ + __rte_experimental +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); + /** * Unregister external memory chunk with DPDK. * -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 2/2] common/mlx5: support dmabuf 2026-02-03 22:26 ` [PATCH v2 " Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 1/2] eal: " Cliff Burdick @ 2026-02-03 22:26 ` Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 0/2] " Cliff Burdick 2 siblings, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 22:26 UTC (permalink / raw) To: dev Cc: anatoly.burakov, Thomas Monjalon, Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 2 +- drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 +++++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 +++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 42 ++++++- drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 ++- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- 11 files changed, 249 insertions(+), 9 deletions(-) diff --git a/.mailmap b/.mailmap index 4c2b2f921d..0a8a67098f 100644 --- a/.mailmap +++ b/.mailmap @@ -291,8 +291,8 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> -Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> +Cliff Burdick <cburdick@nvidia.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> Conor Fogarty <conor.fogarty@intel.com> diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build index 3767e7a69b..8e83104165 100644 --- a/drivers/common/mlx5/linux/meson.build +++ b/drivers/common/mlx5/linux/meson.build @@ -203,6 +203,8 @@ has_sym_args = [ 'mlx5dv_dr_domain_allow_duplicate_rules' ], [ 'HAVE_MLX5_IBV_REG_MR_IOVA', 'infiniband/verbs.h', 'ibv_reg_mr_iova' ], + [ 'HAVE_IBV_REG_DMABUF_MR', 'infiniband/verbs.h', + 'ibv_reg_dmabuf_mr' ], [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h', 'ibv_import_device' ], [ 'HAVE_MLX5DV_DR_ACTION_CREATE_DEST_ROOT_TABLE', 'infiniband/mlx5dv.h', diff --git a/drivers/common/mlx5/linux/mlx5_common_verbs.c b/drivers/common/mlx5/linux/mlx5_common_verbs.c index 98260df470..f6d18fd5df 100644 --- a/drivers/common/mlx5/linux/mlx5_common_verbs.c +++ b/drivers/common/mlx5/linux/mlx5_common_verbs.c @@ -129,6 +129,47 @@ mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, return 0; } +/** + * Register mr for dma-buf backed memory. Given protection domain pointer, + * dma-buf fd, offset and length, register the memory region. + * + * @param[in] pd + * Pointer to protection domain context. + * @param[in] offset + * Offset within the dma-buf. + * @param[in] length + * Length of the memory to register. + * @param[in] fd + * File descriptor of the dma-buf. + * @param[out] pmd_mr + * pmd_mr struct set with lkey, address, length and pointer to mr object + * + * @return + * 0 on successful registration, -1 otherwise + */ +RTE_EXPORT_INTERNAL_SYMBOL(mlx5_common_verbs_reg_dmabuf_mr) +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr) +{ + struct ibv_mr *ibv_mr; + ibv_mr = mlx5_glue->reg_dmabuf_mr(pd, offset, length, iova, fd, + IBV_ACCESS_LOCAL_WRITE | + (haswell_broadwell_cpu ? 0 : + IBV_ACCESS_RELAXED_ORDERING)); + if (!ibv_mr) + return -1; + + *pmd_mr = (struct mlx5_pmd_mr){ + .lkey = ibv_mr->lkey, + .addr = ibv_mr->addr, + .len = ibv_mr->length, + .obj = (void *)ibv_mr, + }; + return 0; +} + /** * Deregister mr. Given the mlx5 pmd MR - deregister the MR * @@ -151,13 +192,18 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func * @param[out] dereg_mr_cb * Pointer to dereg_mr func */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_common_verbs_reg_mr; + *reg_dmabuf_mr_cb = mlx5_common_verbs_reg_dmabuf_mr; *dereg_mr_cb = mlx5_common_verbs_dereg_mr; } diff --git a/drivers/common/mlx5/linux/mlx5_glue.c b/drivers/common/mlx5/linux/mlx5_glue.c index a91eaa429d..6fac7f2bcd 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.c +++ b/drivers/common/mlx5/linux/mlx5_glue.c @@ -291,6 +291,24 @@ mlx5_glue_reg_mr_iova(struct ibv_pd *pd, void *addr, size_t length, #endif } +static struct ibv_mr * +mlx5_glue_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, int access) +{ +#ifdef HAVE_IBV_REG_DMABUF_MR + return ibv_reg_dmabuf_mr(pd, offset, length, iova, fd, access); +#else + (void)pd; + (void)offset; + (void)length; + (void)iova; + (void)fd; + (void)access; + errno = ENOTSUP; + return NULL; +#endif +} + static struct ibv_mr * mlx5_glue_alloc_null_mr(struct ibv_pd *pd) { @@ -1619,6 +1637,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue) { .modify_qp = mlx5_glue_modify_qp, .reg_mr = mlx5_glue_reg_mr, .reg_mr_iova = mlx5_glue_reg_mr_iova, + .reg_dmabuf_mr = mlx5_glue_reg_dmabuf_mr, .alloc_null_mr = mlx5_glue_alloc_null_mr, .dereg_mr = mlx5_glue_dereg_mr, .create_counter_set = mlx5_glue_create_counter_set, diff --git a/drivers/common/mlx5/linux/mlx5_glue.h b/drivers/common/mlx5/linux/mlx5_glue.h index 81d6b0aaf9..66216d1194 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.h +++ b/drivers/common/mlx5/linux/mlx5_glue.h @@ -219,6 +219,9 @@ struct mlx5_glue { struct ibv_mr *(*reg_mr_iova)(struct ibv_pd *pd, void *addr, size_t length, uint64_t iova, int access); + struct ibv_mr *(*reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, + size_t length, uint64_t iova, + int fd, int access); struct ibv_mr *(*alloc_null_mr)(struct ibv_pd *pd); int (*dereg_mr)(struct ibv_mr *mr); struct ibv_counter_set *(*create_counter_set) diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c index 84a93e7dbd..82cf17ca78 100644 --- a/drivers/common/mlx5/mlx5_common.c +++ b/drivers/common/mlx5/mlx5_common.c @@ -13,6 +13,7 @@ #include <rte_class.h> #include <rte_malloc.h> #include <rte_eal_paging.h> +#include <rte_memory.h> #include "mlx5_common.h" #include "mlx5_common_os.h" @@ -1125,6 +1126,7 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, struct mlx5_common_device *dev; struct mlx5_mr_btree *bt; struct mlx5_mr *mr; + struct rte_memseg_list *msl; dev = to_mlx5_device(rte_dev); if (!dev) { @@ -1134,8 +1136,44 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, rte_errno = ENODEV; return -1; } - mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, - SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list(addr); + if (msl != NULL && msl->external) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + int ret = rte_memseg_list_get_dmabuf_offset( + msl, &dmabuf_off); + if (ret < 0) { + DRV_LOG(ERR, + "Failed to get dma-buf offset for memseg list %p", + (void *)msl); + return -1; + } + /* Calculate offset within dma-buf address */ + dmabuf_off += ((uintptr_t)addr - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + mr = mlx5_create_mr_ext_dmabuf(dev->pd, + (uintptr_t)addr, + len, + SOCKET_ID_ANY, + dmabuf_fd, + dmabuf_off, + dev->mr_scache.reg_dmabuf_mr_cb); + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, + (uintptr_t)addr, + len, + SOCKET_ID_ANY, + dev->mr_scache.reg_mr_cb); + } + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, + SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + } if (!mr) { DRV_LOG(WARNING, "Device %s unable to DMA map", rte_dev->name); rte_errno = EINVAL; diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c index 8ed988dec9..8f31eaefe8 100644 --- a/drivers/common/mlx5/mlx5_common_mr.c +++ b/drivers/common/mlx5/mlx5_common_mr.c @@ -8,6 +8,7 @@ #include <rte_eal_memconfig.h> #include <rte_eal_paging.h> #include <rte_errno.h> +#include <rte_memory.h> #include <rte_mempool.h> #include <rte_malloc.h> #include <rte_rwlock.h> @@ -1141,6 +1142,7 @@ mlx5_mr_create_cache(struct mlx5_mr_share_cache *share_cache, int socket) { /* Set the reg_mr and dereg_mr callback functions */ mlx5_os_set_reg_mr_cb(&share_cache->reg_mr_cb, + &share_cache->reg_dmabuf_mr_cb, &share_cache->dereg_mr_cb); rte_rwlock_init(&share_cache->rwlock); rte_rwlock_init(&share_cache->mprwlock); @@ -1221,6 +1223,74 @@ mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, return mr; } +/** + * Creates a memory region for dma-buf backed external memory. + * + * @param pd + * Pointer to pd of a device (net, regex, vdpa,...). + * @param addr + * Starting virtual address of memory (mmap'd address). + * @param len + * Length of memory segment being mapped. + * @param socket_id + * Socket to allocate heap memory for the control structures. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf. + * @param reg_dmabuf_mr_cb + * Callback function for dma-buf MR registration. + * + * @return + * Pointer to MR structure on success, NULL otherwise. + */ +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb) +{ + struct mlx5_mr *mr = NULL; + + if (reg_dmabuf_mr_cb == NULL) { + DRV_LOG(WARNING, "dma-buf MR registration not supported"); + rte_errno = ENOTSUP; + return NULL; + } + mr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, + RTE_ALIGN_CEIL(sizeof(*mr), RTE_CACHE_LINE_SIZE), + RTE_CACHE_LINE_SIZE, socket_id); + if (mr == NULL) + return NULL; + if (reg_dmabuf_mr_cb(pd, dmabuf_offset, len, addr, dmabuf_fd, + &mr->pmd_mr) < 0) { + DRV_LOG(WARNING, + "Fail to create dma-buf MR for address (%p) fd=%d", + (void *)addr, dmabuf_fd); + mlx5_free(mr); + return NULL; + } + mr->msl = NULL; /* Mark it is external memory. */ + mr->ms_bmp = NULL; + mr->ms_n = 1; + mr->ms_bmp_n = 1; + /* + * For dma-buf MR, the returned addr may be NULL since there's no VA + * in the registration. Store the user-provided addr for cache lookup. + */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)addr; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + DRV_LOG(DEBUG, + "MR CREATED (%p) for dma-buf external memory %p (fd=%d):\n" + " [0x%" PRIxPTR ", 0x%" PRIxPTR ")," + " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u", + (void *)mr, (void *)addr, dmabuf_fd, + addr, addr + len, rte_cpu_to_be_32(mr->pmd_mr.lkey), + mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n); + return mr; +} + /** * Callback for memory free event. Iterate freed memsegs and check whether it * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a @@ -1747,9 +1817,48 @@ mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache, struct mlx5_mempool_mr *mr = &new_mpr->mrs[i]; const struct mlx5_range *range = &ranges[i]; size_t len = range->end - range->start; + struct rte_memseg_list *msl; + int reg_result; + + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list((void *)range->start); + if (msl != NULL && msl->external && + share_cache->reg_dmabuf_mr_cb != NULL) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + ret = rte_memseg_list_get_dmabuf_offset(msl, &dmabuf_off); + if (ret < 0) { + DRV_LOG(ERR, "Failed to get dma-buf offset for memseg list %p", + (void *)msl); + goto exit; + } + /* Calculate offset within dma-buf for this specific range */ + dmabuf_off += (range->start - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + reg_result = share_cache->reg_dmabuf_mr_cb(pd, + dmabuf_off, len, range->start, dmabuf_fd, + &mr->pmd_mr); + if (reg_result == 0) { + /* For dma-buf MR, set addr if not set by driver */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)range->start; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } - if (share_cache->reg_mr_cb(pd, (void *)range->start, len, - &mr->pmd_mr) < 0) { + if (reg_result < 0) { DRV_LOG(ERR, "Failed to create an MR in PD %p for address range " "[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s", diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h index cf7c685e9b..3b967b1323 100644 --- a/drivers/common/mlx5/mlx5_common_mr.h +++ b/drivers/common/mlx5/mlx5_common_mr.h @@ -35,6 +35,9 @@ struct mlx5_pmd_mr { */ typedef int (*mlx5_reg_mr_t)(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); +typedef int (*mlx5_reg_dmabuf_mr_t)(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); typedef void (*mlx5_dereg_mr_t)(struct mlx5_pmd_mr *pmd_mr); /* Memory Region object. */ @@ -87,6 +90,7 @@ struct __rte_packed_begin mlx5_mr_share_cache { struct mlx5_mr_list mr_free_list; /* Freed MR list. */ struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */ mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ } __rte_packed_end; @@ -233,6 +237,10 @@ mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache, struct mlx5_mr * mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, mlx5_reg_mr_t reg_mr_cb); +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb); void mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb); __rte_internal uint32_t @@ -251,12 +259,19 @@ int mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); __rte_internal +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); +__rte_internal void mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr); __rte_internal void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb); +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb); __rte_internal int diff --git a/drivers/common/mlx5/windows/mlx5_common_os.c b/drivers/common/mlx5/windows/mlx5_common_os.c index 7fac361460..5e284742ab 100644 --- a/drivers/common/mlx5/windows/mlx5_common_os.c +++ b/drivers/common/mlx5/windows/mlx5_common_os.c @@ -17,6 +17,7 @@ #include "mlx5_common.h" #include "mlx5_common_os.h" #include "mlx5_malloc.h" +#include "mlx5_common_mr.h" /** * Initialization routine for run-time dependency on external lib. @@ -442,15 +443,20 @@ mlx5_os_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func (NULL on Windows - not supported) * @param[out] dereg_mr_cb * Pointer to dereg_mr func * */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_os_reg_mr; + *reg_dmabuf_mr_cb = NULL; /* dma-buf not supported on Windows */ *dereg_mr_cb = mlx5_os_dereg_mr; } diff --git a/drivers/crypto/mlx5/mlx5_crypto.h b/drivers/crypto/mlx5/mlx5_crypto.h index f9f127e9e6..b2712c9a8d 100644 --- a/drivers/crypto/mlx5/mlx5_crypto.h +++ b/drivers/crypto/mlx5/mlx5_crypto.h @@ -41,6 +41,7 @@ struct mlx5_crypto_priv { struct mlx5_common_device *cdev; /* Backend mlx5 device. */ struct rte_cryptodev *crypto_dev; mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ struct mlx5_uar uar; /* User Access Region. */ uint32_t max_segs_num; /* Maximum supported data segs. */ diff --git a/drivers/crypto/mlx5/mlx5_crypto_gcm.c b/drivers/crypto/mlx5/mlx5_crypto_gcm.c index 89f32c7722..380689cfeb 100644 --- a/drivers/crypto/mlx5/mlx5_crypto_gcm.c +++ b/drivers/crypto/mlx5/mlx5_crypto_gcm.c @@ -1186,7 +1186,8 @@ mlx5_crypto_gcm_init(struct mlx5_crypto_priv *priv) /* Override AES-GCM specified ops. */ dev_ops->sym_session_configure = mlx5_crypto_sym_gcm_session_configure; - mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->dereg_mr_cb); + mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->reg_dmabuf_mr_cb, + &priv->dereg_mr_cb); dev_ops->queue_pair_setup = mlx5_crypto_gcm_qp_setup; dev_ops->queue_pair_release = mlx5_crypto_gcm_qp_release; if (mlx5_crypto_is_ipsec_opt(priv)) { -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 0/2] support dmabuf 2026-02-03 22:26 ` [PATCH v2 " Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 1/2] eal: " Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 2/2] common/mlx5: " Cliff Burdick @ 2026-02-03 23:02 ` Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 1/2] eal: " Cliff Burdick ` (2 more replies) 2 siblings, 3 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 23:02 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov Fixes since v2: * Fixed missing EXPERIMENTAL macro on new symbols * Fixed style issue Add support for kernel dmabuf feature and integrate it in the mlx5 driver. This feature is needed to support GPUDirect on newer kernels. Cliff Burdick (2): eal: support dmabuf common/mlx5: support dmabuf .mailmap | 1 + doc/guides/rel_notes/release_26_03.rst | 6 + drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 42 ++++- drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 +- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- lib/eal/common/eal_common_memory.c | 165 +++++++++++++++++- lib/eal/common/eal_memalloc.h | 21 +++ lib/eal/common/malloc_heap.c | 27 +++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 145 +++++++++++++++ 17 files changed, 612 insertions(+), 14 deletions(-) -- 2.52.0 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 1/2] eal: support dmabuf 2026-02-03 23:02 ` [PATCH v3 0/2] " Cliff Burdick @ 2026-02-03 23:02 ` Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 2/2] common/mlx5: " Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick 2 siblings, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 23:02 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov, Thomas Monjalon dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. A new api function rte_extmem_register_dmabuf is introduced to create the mapping from a dmabuf file descriptor. dmabuf uses a file descriptor and an offset that has been pre-opened with the kernel. The kernel uses the file descriptor to map to a VA pointer. To avoid ABI changes, a static struct is used inside of eal_common_memory.c, and lookups are done on this struct rather than from the rte_memseg_list. Ideally we would like to add both the dmabuf file descriptor and offset to rte_memseg_list, but it's not clear if we can reuse existing fields when using the dmabuf API. We could rename the external flag to a more generic "properties" flag where "external" is the lowest bit, then we can use the second bit to indicate the presence of dmabuf. In the presence of the flag for dmabuf we could reuse the base_va address field for the dmabuf offset, and the socket_id for the file descriptor. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 1 + doc/guides/rel_notes/release_26_03.rst | 6 + lib/eal/common/eal_common_memory.c | 165 ++++++++++++++++++++++++- lib/eal/common/eal_memalloc.h | 21 ++++ lib/eal/common/malloc_heap.c | 27 ++++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 145 ++++++++++++++++++++++ 7 files changed, 364 insertions(+), 6 deletions(-) diff --git a/.mailmap b/.mailmap index 2f089326ff..4c2b2f921d 100644 --- a/.mailmap +++ b/.mailmap @@ -291,6 +291,7 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> +Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 15dabee7a1..56457d0382 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -55,6 +55,12 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Added dma-buf-backed external memory support.** + + Added EAL support for registering dma-buf-backed external memory with + ``rte_extmem_register_dmabuf``, and enabled mlx5 common code to consume + dma-buf mappings for device access. + Removed Items ------------- diff --git a/lib/eal/common/eal_common_memory.c b/lib/eal/common/eal_common_memory.c index c62edf5e55..7415479fff 100644 --- a/lib/eal/common/eal_common_memory.c +++ b/lib/eal/common/eal_common_memory.c @@ -45,6 +45,15 @@ static void *next_baseaddr; static uint64_t system_page_sz; +/* Internal storage for dma-buf info, indexed by memseg list index. + * This keeps dma-buf metadata out of the public rte_memseg_list structure + * to preserve ABI compatibility. + */ +static struct { + int fd; /**< dma-buf fd, -1 if not dma-buf backed */ + uint64_t offset; /**< offset within dma-buf */ +} dmabuf_info[RTE_MAX_MEMSEG_LISTS]; + #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5 void * eal_get_virtual_area(void *requested_addr, size_t *size, @@ -232,6 +241,10 @@ eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz, { char name[RTE_FBARRAY_NAME_LEN]; + /* Initialize dma-buf info to "not dma-buf backed" */ + dmabuf_info[type_msl_idx].fd = -1; + dmabuf_info[type_msl_idx].offset = 0; + snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id, type_msl_idx); @@ -930,10 +943,113 @@ rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset) return ret; } -RTE_EXPORT_SYMBOL(rte_extmem_register) +/* Internal dma-buf info functions */ int -rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], - unsigned int n_pages, size_t page_sz) +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + dmabuf_info[list_idx].fd = fd; + dmabuf_info[list_idx].offset = offset; + return 0; +} + +int +eal_memseg_list_get_dmabuf_fd(int list_idx) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + return dmabuf_info[list_idx].fd; +} + +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS || offset == NULL) + return -EINVAL; + + *offset = dmabuf_info[list_idx].offset; + return 0; +} + +/* Public dma-buf info API functions */ +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_fd_unsafe) +int +rte_memseg_list_get_dmabuf_fd_unsafe(const struct rte_memseg_list *msl) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + return dmabuf_info[msl_idx].fd; +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_fd) +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_fd_unsafe(msl); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_offset_unsafe) +int +rte_memseg_list_get_dmabuf_offset_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL || offset == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + *offset = dmabuf_info[msl_idx].offset; + return 0; +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_offset) +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_offset_unsafe(msl, offset); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +static int +extmem_register(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; unsigned int socket_id, n; @@ -967,10 +1083,19 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], /* we can create a new memseg */ n = len / page_sz; - if (malloc_heap_create_external_seg(va_addr, iova_addrs, n, + if (dmabuf_fd < 0) { + if (malloc_heap_create_external_seg(va_addr, iova_addrs, n, page_sz, "extmem", socket_id) == NULL) { - ret = -1; - goto unlock; + ret = -1; + goto unlock; + } + } else { + if (malloc_heap_create_external_seg_dmabuf(va_addr, iova_addrs, n, + page_sz, "extmem_dmabuf", socket_id, + dmabuf_fd, dmabuf_offset) == NULL) { + ret = -1; + goto unlock; + } } /* memseg list successfully created - increment next socket ID */ @@ -980,6 +1105,34 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], return ret; } +RTE_EXPORT_SYMBOL(rte_extmem_register) +int +rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz) +{ + return rte_extmem_register_dmabuf(va_addr, len, -1, 0, iova_addrs, n_pages, page_sz); +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_extmem_register_dmabuf) +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) +{ + if (dmabuf_fd < 0) { + rte_errno = EINVAL; + return -1; + } + + return extmem_register(va_addr, + len, + dmabuf_fd, + dmabuf_offset, + iova_addrs, + n_pages, + page_sz); +} + RTE_EXPORT_SYMBOL(rte_extmem_unregister) int rte_extmem_unregister(void *va_addr, size_t len) diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h index 0c267066d9..e7e807ddcb 100644 --- a/lib/eal/common/eal_memalloc.h +++ b/lib/eal/common/eal_memalloc.h @@ -90,6 +90,27 @@ eal_memalloc_set_seg_list_fd(int list_idx, int fd); int eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset); +/* + * Set dma-buf info for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset); + +/* + * Get dma-buf fd for a memseg list. + * Returns fd (>= 0) on success, -1 if not dma-buf backed, -errno on error. + */ +int +eal_memseg_list_get_dmabuf_fd(int list_idx); + +/* + * Get dma-buf offset for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset); + int eal_memalloc_init(void) __rte_requires_shared_capability(rte_mcfg_mem_get_lock()); diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c index 39240c261c..bf986fe654 100644 --- a/lib/eal/common/malloc_heap.c +++ b/lib/eal/common/malloc_heap.c @@ -1232,6 +1232,33 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], msl->version = 0; msl->external = 1; + /* initialize dma-buf info to "not dma-buf backed" */ + eal_memseg_list_set_dmabuf_info(i, -1, 0); + + return msl; +} + +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + int msl_idx; + + /* Create the base external segment */ + msl = malloc_heap_create_external_seg(va_addr, iova_addrs, n_pages, + page_sz, seg_name, socket_id); + if (msl == NULL) + return NULL; + + /* Get memseg list index */ + msl_idx = msl - mcfg->memsegs; + + /* Set dma-buf info in the internal side-table */ + eal_memseg_list_set_dmabuf_info(msl_idx, dmabuf_fd, dmabuf_offset); + return msl; } diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h index dfc56d4ae3..87525d1a68 100644 --- a/lib/eal/common/malloc_heap.h +++ b/lib/eal/common/malloc_heap.h @@ -51,6 +51,11 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz, const char *seg_name, unsigned int socket_id); +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset); + struct rte_memseg_list * malloc_heap_find_external_seg(void *va_addr, size_t len); diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h index b6e97ad695..fffeb8fcf5 100644 --- a/lib/eal/include/rte_memory.h +++ b/lib/eal/include/rte_memory.h @@ -405,6 +405,98 @@ int rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms, size_t *offset); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_fd_unsafe(const struct rte_memseg_list *msl); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf offset associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf offset associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_offset_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset); + /** * Register external memory chunk with DPDK. * @@ -443,6 +535,59 @@ int rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Register external memory chunk backed by a dma-buf file descriptor and offset. + * + * This is similar to rte_extmem_register() but additionally stores dma-buf + * file descriptor information, allowing drivers to use dma-buf based + * memory registration (e.g., ibv_reg_dmabuf_mr for RDMA devices). + * + * @note Using this API is mutually exclusive with ``rte_malloc`` family of + * API's. + * + * @note This API will not perform any DMA mapping. It is expected that user + * will do that themselves via rte_dev_dma_map(). + * + * @note Before accessing this memory in other processes, it needs to be + * attached in each of those processes by calling ``rte_extmem_attach`` in + * each other process. + * + * @param va_addr + * Start of virtual area to register (mmap'd address of the dma-buf). + * Must be aligned by ``page_sz``. + * @param len + * Length of virtual area to register. Must be aligned by ``page_sz``. + * This is independent of dma-buf offset. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf where the registered region starts. + * @param iova_addrs + * Array of page IOVA addresses corresponding to each page in this memory + * area. Can be NULL, in which case page IOVA addresses will be set to + * RTE_BAD_IOVA. + * @param n_pages + * Number of elements in the iova_addrs array. Ignored if ``iova_addrs`` + * is NULL. + * @param page_sz + * Page size of the underlying memory + * + * @return + * - 0 on success + * - -1 in case of error, with rte_errno set to one of the following: + * EINVAL - one of the parameters was invalid + * EEXIST - memory chunk is already registered + * ENOSPC - no more space in internal config to store a new memory chunk + */ +__rte_experimental +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); + /** * Unregister external memory chunk with DPDK. * -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 2/2] common/mlx5: support dmabuf 2026-02-03 23:02 ` [PATCH v3 0/2] " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 1/2] eal: " Cliff Burdick @ 2026-02-03 23:02 ` Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick 2 siblings, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-03 23:02 UTC (permalink / raw) To: dev Cc: anatoly.burakov, Thomas Monjalon, Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 2 +- drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 +++++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 +++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 42 ++++++- drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 ++- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- 11 files changed, 249 insertions(+), 9 deletions(-) diff --git a/.mailmap b/.mailmap index 4c2b2f921d..0a8a67098f 100644 --- a/.mailmap +++ b/.mailmap @@ -291,8 +291,8 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> -Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> +Cliff Burdick <cburdick@nvidia.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> Conor Fogarty <conor.fogarty@intel.com> diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build index 3767e7a69b..8e83104165 100644 --- a/drivers/common/mlx5/linux/meson.build +++ b/drivers/common/mlx5/linux/meson.build @@ -203,6 +203,8 @@ has_sym_args = [ 'mlx5dv_dr_domain_allow_duplicate_rules' ], [ 'HAVE_MLX5_IBV_REG_MR_IOVA', 'infiniband/verbs.h', 'ibv_reg_mr_iova' ], + [ 'HAVE_IBV_REG_DMABUF_MR', 'infiniband/verbs.h', + 'ibv_reg_dmabuf_mr' ], [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h', 'ibv_import_device' ], [ 'HAVE_MLX5DV_DR_ACTION_CREATE_DEST_ROOT_TABLE', 'infiniband/mlx5dv.h', diff --git a/drivers/common/mlx5/linux/mlx5_common_verbs.c b/drivers/common/mlx5/linux/mlx5_common_verbs.c index 98260df470..f6d18fd5df 100644 --- a/drivers/common/mlx5/linux/mlx5_common_verbs.c +++ b/drivers/common/mlx5/linux/mlx5_common_verbs.c @@ -129,6 +129,47 @@ mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, return 0; } +/** + * Register mr for dma-buf backed memory. Given protection domain pointer, + * dma-buf fd, offset and length, register the memory region. + * + * @param[in] pd + * Pointer to protection domain context. + * @param[in] offset + * Offset within the dma-buf. + * @param[in] length + * Length of the memory to register. + * @param[in] fd + * File descriptor of the dma-buf. + * @param[out] pmd_mr + * pmd_mr struct set with lkey, address, length and pointer to mr object + * + * @return + * 0 on successful registration, -1 otherwise + */ +RTE_EXPORT_INTERNAL_SYMBOL(mlx5_common_verbs_reg_dmabuf_mr) +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr) +{ + struct ibv_mr *ibv_mr; + ibv_mr = mlx5_glue->reg_dmabuf_mr(pd, offset, length, iova, fd, + IBV_ACCESS_LOCAL_WRITE | + (haswell_broadwell_cpu ? 0 : + IBV_ACCESS_RELAXED_ORDERING)); + if (!ibv_mr) + return -1; + + *pmd_mr = (struct mlx5_pmd_mr){ + .lkey = ibv_mr->lkey, + .addr = ibv_mr->addr, + .len = ibv_mr->length, + .obj = (void *)ibv_mr, + }; + return 0; +} + /** * Deregister mr. Given the mlx5 pmd MR - deregister the MR * @@ -151,13 +192,18 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func * @param[out] dereg_mr_cb * Pointer to dereg_mr func */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_common_verbs_reg_mr; + *reg_dmabuf_mr_cb = mlx5_common_verbs_reg_dmabuf_mr; *dereg_mr_cb = mlx5_common_verbs_dereg_mr; } diff --git a/drivers/common/mlx5/linux/mlx5_glue.c b/drivers/common/mlx5/linux/mlx5_glue.c index a91eaa429d..6fac7f2bcd 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.c +++ b/drivers/common/mlx5/linux/mlx5_glue.c @@ -291,6 +291,24 @@ mlx5_glue_reg_mr_iova(struct ibv_pd *pd, void *addr, size_t length, #endif } +static struct ibv_mr * +mlx5_glue_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, int access) +{ +#ifdef HAVE_IBV_REG_DMABUF_MR + return ibv_reg_dmabuf_mr(pd, offset, length, iova, fd, access); +#else + (void)pd; + (void)offset; + (void)length; + (void)iova; + (void)fd; + (void)access; + errno = ENOTSUP; + return NULL; +#endif +} + static struct ibv_mr * mlx5_glue_alloc_null_mr(struct ibv_pd *pd) { @@ -1619,6 +1637,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue) { .modify_qp = mlx5_glue_modify_qp, .reg_mr = mlx5_glue_reg_mr, .reg_mr_iova = mlx5_glue_reg_mr_iova, + .reg_dmabuf_mr = mlx5_glue_reg_dmabuf_mr, .alloc_null_mr = mlx5_glue_alloc_null_mr, .dereg_mr = mlx5_glue_dereg_mr, .create_counter_set = mlx5_glue_create_counter_set, diff --git a/drivers/common/mlx5/linux/mlx5_glue.h b/drivers/common/mlx5/linux/mlx5_glue.h index 81d6b0aaf9..66216d1194 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.h +++ b/drivers/common/mlx5/linux/mlx5_glue.h @@ -219,6 +219,9 @@ struct mlx5_glue { struct ibv_mr *(*reg_mr_iova)(struct ibv_pd *pd, void *addr, size_t length, uint64_t iova, int access); + struct ibv_mr *(*reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, + size_t length, uint64_t iova, + int fd, int access); struct ibv_mr *(*alloc_null_mr)(struct ibv_pd *pd); int (*dereg_mr)(struct ibv_mr *mr); struct ibv_counter_set *(*create_counter_set) diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c index 84a93e7dbd..82cf17ca78 100644 --- a/drivers/common/mlx5/mlx5_common.c +++ b/drivers/common/mlx5/mlx5_common.c @@ -13,6 +13,7 @@ #include <rte_class.h> #include <rte_malloc.h> #include <rte_eal_paging.h> +#include <rte_memory.h> #include "mlx5_common.h" #include "mlx5_common_os.h" @@ -1125,6 +1126,7 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, struct mlx5_common_device *dev; struct mlx5_mr_btree *bt; struct mlx5_mr *mr; + struct rte_memseg_list *msl; dev = to_mlx5_device(rte_dev); if (!dev) { @@ -1134,8 +1136,44 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, rte_errno = ENODEV; return -1; } - mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, - SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list(addr); + if (msl != NULL && msl->external) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + int ret = rte_memseg_list_get_dmabuf_offset( + msl, &dmabuf_off); + if (ret < 0) { + DRV_LOG(ERR, + "Failed to get dma-buf offset for memseg list %p", + (void *)msl); + return -1; + } + /* Calculate offset within dma-buf address */ + dmabuf_off += ((uintptr_t)addr - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + mr = mlx5_create_mr_ext_dmabuf(dev->pd, + (uintptr_t)addr, + len, + SOCKET_ID_ANY, + dmabuf_fd, + dmabuf_off, + dev->mr_scache.reg_dmabuf_mr_cb); + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, + (uintptr_t)addr, + len, + SOCKET_ID_ANY, + dev->mr_scache.reg_mr_cb); + } + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, + SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + } if (!mr) { DRV_LOG(WARNING, "Device %s unable to DMA map", rte_dev->name); rte_errno = EINVAL; diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c index 8ed988dec9..8f31eaefe8 100644 --- a/drivers/common/mlx5/mlx5_common_mr.c +++ b/drivers/common/mlx5/mlx5_common_mr.c @@ -8,6 +8,7 @@ #include <rte_eal_memconfig.h> #include <rte_eal_paging.h> #include <rte_errno.h> +#include <rte_memory.h> #include <rte_mempool.h> #include <rte_malloc.h> #include <rte_rwlock.h> @@ -1141,6 +1142,7 @@ mlx5_mr_create_cache(struct mlx5_mr_share_cache *share_cache, int socket) { /* Set the reg_mr and dereg_mr callback functions */ mlx5_os_set_reg_mr_cb(&share_cache->reg_mr_cb, + &share_cache->reg_dmabuf_mr_cb, &share_cache->dereg_mr_cb); rte_rwlock_init(&share_cache->rwlock); rte_rwlock_init(&share_cache->mprwlock); @@ -1221,6 +1223,74 @@ mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, return mr; } +/** + * Creates a memory region for dma-buf backed external memory. + * + * @param pd + * Pointer to pd of a device (net, regex, vdpa,...). + * @param addr + * Starting virtual address of memory (mmap'd address). + * @param len + * Length of memory segment being mapped. + * @param socket_id + * Socket to allocate heap memory for the control structures. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf. + * @param reg_dmabuf_mr_cb + * Callback function for dma-buf MR registration. + * + * @return + * Pointer to MR structure on success, NULL otherwise. + */ +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb) +{ + struct mlx5_mr *mr = NULL; + + if (reg_dmabuf_mr_cb == NULL) { + DRV_LOG(WARNING, "dma-buf MR registration not supported"); + rte_errno = ENOTSUP; + return NULL; + } + mr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, + RTE_ALIGN_CEIL(sizeof(*mr), RTE_CACHE_LINE_SIZE), + RTE_CACHE_LINE_SIZE, socket_id); + if (mr == NULL) + return NULL; + if (reg_dmabuf_mr_cb(pd, dmabuf_offset, len, addr, dmabuf_fd, + &mr->pmd_mr) < 0) { + DRV_LOG(WARNING, + "Fail to create dma-buf MR for address (%p) fd=%d", + (void *)addr, dmabuf_fd); + mlx5_free(mr); + return NULL; + } + mr->msl = NULL; /* Mark it is external memory. */ + mr->ms_bmp = NULL; + mr->ms_n = 1; + mr->ms_bmp_n = 1; + /* + * For dma-buf MR, the returned addr may be NULL since there's no VA + * in the registration. Store the user-provided addr for cache lookup. + */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)addr; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + DRV_LOG(DEBUG, + "MR CREATED (%p) for dma-buf external memory %p (fd=%d):\n" + " [0x%" PRIxPTR ", 0x%" PRIxPTR ")," + " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u", + (void *)mr, (void *)addr, dmabuf_fd, + addr, addr + len, rte_cpu_to_be_32(mr->pmd_mr.lkey), + mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n); + return mr; +} + /** * Callback for memory free event. Iterate freed memsegs and check whether it * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a @@ -1747,9 +1817,48 @@ mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache, struct mlx5_mempool_mr *mr = &new_mpr->mrs[i]; const struct mlx5_range *range = &ranges[i]; size_t len = range->end - range->start; + struct rte_memseg_list *msl; + int reg_result; + + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list((void *)range->start); + if (msl != NULL && msl->external && + share_cache->reg_dmabuf_mr_cb != NULL) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + ret = rte_memseg_list_get_dmabuf_offset(msl, &dmabuf_off); + if (ret < 0) { + DRV_LOG(ERR, "Failed to get dma-buf offset for memseg list %p", + (void *)msl); + goto exit; + } + /* Calculate offset within dma-buf for this specific range */ + dmabuf_off += (range->start - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + reg_result = share_cache->reg_dmabuf_mr_cb(pd, + dmabuf_off, len, range->start, dmabuf_fd, + &mr->pmd_mr); + if (reg_result == 0) { + /* For dma-buf MR, set addr if not set by driver */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)range->start; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } - if (share_cache->reg_mr_cb(pd, (void *)range->start, len, - &mr->pmd_mr) < 0) { + if (reg_result < 0) { DRV_LOG(ERR, "Failed to create an MR in PD %p for address range " "[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s", diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h index cf7c685e9b..3b967b1323 100644 --- a/drivers/common/mlx5/mlx5_common_mr.h +++ b/drivers/common/mlx5/mlx5_common_mr.h @@ -35,6 +35,9 @@ struct mlx5_pmd_mr { */ typedef int (*mlx5_reg_mr_t)(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); +typedef int (*mlx5_reg_dmabuf_mr_t)(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); typedef void (*mlx5_dereg_mr_t)(struct mlx5_pmd_mr *pmd_mr); /* Memory Region object. */ @@ -87,6 +90,7 @@ struct __rte_packed_begin mlx5_mr_share_cache { struct mlx5_mr_list mr_free_list; /* Freed MR list. */ struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */ mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ } __rte_packed_end; @@ -233,6 +237,10 @@ mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache, struct mlx5_mr * mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, mlx5_reg_mr_t reg_mr_cb); +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb); void mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb); __rte_internal uint32_t @@ -251,12 +259,19 @@ int mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); __rte_internal +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); +__rte_internal void mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr); __rte_internal void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb); +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb); __rte_internal int diff --git a/drivers/common/mlx5/windows/mlx5_common_os.c b/drivers/common/mlx5/windows/mlx5_common_os.c index 7fac361460..5e284742ab 100644 --- a/drivers/common/mlx5/windows/mlx5_common_os.c +++ b/drivers/common/mlx5/windows/mlx5_common_os.c @@ -17,6 +17,7 @@ #include "mlx5_common.h" #include "mlx5_common_os.h" #include "mlx5_malloc.h" +#include "mlx5_common_mr.h" /** * Initialization routine for run-time dependency on external lib. @@ -442,15 +443,20 @@ mlx5_os_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func (NULL on Windows - not supported) * @param[out] dereg_mr_cb * Pointer to dereg_mr func * */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_os_reg_mr; + *reg_dmabuf_mr_cb = NULL; /* dma-buf not supported on Windows */ *dereg_mr_cb = mlx5_os_dereg_mr; } diff --git a/drivers/crypto/mlx5/mlx5_crypto.h b/drivers/crypto/mlx5/mlx5_crypto.h index f9f127e9e6..b2712c9a8d 100644 --- a/drivers/crypto/mlx5/mlx5_crypto.h +++ b/drivers/crypto/mlx5/mlx5_crypto.h @@ -41,6 +41,7 @@ struct mlx5_crypto_priv { struct mlx5_common_device *cdev; /* Backend mlx5 device. */ struct rte_cryptodev *crypto_dev; mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ struct mlx5_uar uar; /* User Access Region. */ uint32_t max_segs_num; /* Maximum supported data segs. */ diff --git a/drivers/crypto/mlx5/mlx5_crypto_gcm.c b/drivers/crypto/mlx5/mlx5_crypto_gcm.c index 89f32c7722..380689cfeb 100644 --- a/drivers/crypto/mlx5/mlx5_crypto_gcm.c +++ b/drivers/crypto/mlx5/mlx5_crypto_gcm.c @@ -1186,7 +1186,8 @@ mlx5_crypto_gcm_init(struct mlx5_crypto_priv *priv) /* Override AES-GCM specified ops. */ dev_ops->sym_session_configure = mlx5_crypto_sym_gcm_session_configure; - mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->dereg_mr_cb); + mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->reg_dmabuf_mr_cb, + &priv->dereg_mr_cb); dev_ops->queue_pair_setup = mlx5_crypto_gcm_qp_setup; dev_ops->queue_pair_release = mlx5_crypto_gcm_qp_release; if (mlx5_crypto_is_ipsec_opt(priv)) { -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v4 0/2] support dmabuf 2026-02-03 23:02 ` [PATCH v3 0/2] " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 1/2] eal: " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 2/2] common/mlx5: " Cliff Burdick @ 2026-02-04 15:50 ` Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 1/2] eal: " Cliff Burdick ` (3 more replies) 2 siblings, 4 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-04 15:50 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov Fixes since v3: * Fixed version in RTE_EXPORT_EXPERIMENTAL_SYMBOL Add support for kernel dmabuf feature and integrate it in the mlx5 driver. This feature is needed to support GPUDirect on newer kernels. I apologize for all the patches. Still trying to learn how to submit these. Cliff Burdick (2): eal: support dmabuf common/mlx5: support dmabuf .mailmap | 1 + doc/guides/rel_notes/release_26_03.rst | 6 + drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 42 ++++- drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 +- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- lib/eal/common/eal_common_memory.c | 165 +++++++++++++++++- lib/eal/common/eal_memalloc.h | 21 +++ lib/eal/common/malloc_heap.c | 27 +++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 145 +++++++++++++++ 17 files changed, 612 insertions(+), 14 deletions(-) -- 2.52.0 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 1/2] eal: support dmabuf 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick @ 2026-02-04 15:50 ` Cliff Burdick 2026-02-12 13:57 ` Burakov, Anatoly 2026-02-04 15:50 ` [PATCH v4 2/2] common/mlx5: " Cliff Burdick ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Cliff Burdick @ 2026-02-04 15:50 UTC (permalink / raw) To: dev; +Cc: anatoly.burakov, Thomas Monjalon dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. A new api function rte_extmem_register_dmabuf is introduced to create the mapping from a dmabuf file descriptor. dmabuf uses a file descriptor and an offset that has been pre-opened with the kernel. The kernel uses the file descriptor to map to a VA pointer. To avoid ABI changes, a static struct is used inside of eal_common_memory.c, and lookups are done on this struct rather than from the rte_memseg_list. Ideally we would like to add both the dmabuf file descriptor and offset to rte_memseg_list, but it's not clear if we can reuse existing fields when using the dmabuf API. We could rename the external flag to a more generic "properties" flag where "external" is the lowest bit, then we can use the second bit to indicate the presence of dmabuf. In the presence of the flag for dmabuf we could reuse the base_va address field for the dmabuf offset, and the socket_id for the file descriptor. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 1 + doc/guides/rel_notes/release_26_03.rst | 6 + lib/eal/common/eal_common_memory.c | 165 ++++++++++++++++++++++++- lib/eal/common/eal_memalloc.h | 21 ++++ lib/eal/common/malloc_heap.c | 27 ++++ lib/eal/common/malloc_heap.h | 5 + lib/eal/include/rte_memory.h | 145 ++++++++++++++++++++++ 7 files changed, 364 insertions(+), 6 deletions(-) diff --git a/.mailmap b/.mailmap index 2f089326ff..4c2b2f921d 100644 --- a/.mailmap +++ b/.mailmap @@ -291,6 +291,7 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> +Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 15dabee7a1..56457d0382 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -55,6 +55,12 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Added dma-buf-backed external memory support.** + + Added EAL support for registering dma-buf-backed external memory with + ``rte_extmem_register_dmabuf``, and enabled mlx5 common code to consume + dma-buf mappings for device access. + Removed Items ------------- diff --git a/lib/eal/common/eal_common_memory.c b/lib/eal/common/eal_common_memory.c index c62edf5e55..4b8b1c8b59 100644 --- a/lib/eal/common/eal_common_memory.c +++ b/lib/eal/common/eal_common_memory.c @@ -45,6 +45,15 @@ static void *next_baseaddr; static uint64_t system_page_sz; +/* Internal storage for dma-buf info, indexed by memseg list index. + * This keeps dma-buf metadata out of the public rte_memseg_list structure + * to preserve ABI compatibility. + */ +static struct { + int fd; /**< dma-buf fd, -1 if not dma-buf backed */ + uint64_t offset; /**< offset within dma-buf */ +} dmabuf_info[RTE_MAX_MEMSEG_LISTS]; + #define MAX_MMAP_WITH_DEFINED_ADDR_TRIES 5 void * eal_get_virtual_area(void *requested_addr, size_t *size, @@ -232,6 +241,10 @@ eal_memseg_list_init(struct rte_memseg_list *msl, uint64_t page_sz, { char name[RTE_FBARRAY_NAME_LEN]; + /* Initialize dma-buf info to "not dma-buf backed" */ + dmabuf_info[type_msl_idx].fd = -1; + dmabuf_info[type_msl_idx].offset = 0; + snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id, type_msl_idx); @@ -930,10 +943,113 @@ rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset) return ret; } -RTE_EXPORT_SYMBOL(rte_extmem_register) +/* Internal dma-buf info functions */ int -rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], - unsigned int n_pages, size_t page_sz) +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + dmabuf_info[list_idx].fd = fd; + dmabuf_info[list_idx].offset = offset; + return 0; +} + +int +eal_memseg_list_get_dmabuf_fd(int list_idx) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS) + return -EINVAL; + + return dmabuf_info[list_idx].fd; +} + +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset) +{ + if (list_idx < 0 || list_idx >= RTE_MAX_MEMSEG_LISTS || offset == NULL) + return -EINVAL; + + *offset = dmabuf_info[list_idx].offset; + return 0; +} + +/* Public dma-buf info API functions */ +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_fd_unsafe, 26.03) +int +rte_memseg_list_get_dmabuf_fd_unsafe(const struct rte_memseg_list *msl) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + return dmabuf_info[msl_idx].fd; +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_fd, 26.03) +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_fd_unsafe(msl); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_offset_unsafe, 26.03) +int +rte_memseg_list_get_dmabuf_offset_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int msl_idx; + + if (msl == NULL || offset == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl_idx = msl - mcfg->memsegs; + if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS) { + rte_errno = EINVAL; + return -1; + } + + *offset = dmabuf_info[msl_idx].offset; + return 0; +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memseg_list_get_dmabuf_offset, 26.03) +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset) +{ + int ret; + + rte_mcfg_mem_read_lock(); + ret = rte_memseg_list_get_dmabuf_offset_unsafe(msl, offset); + rte_mcfg_mem_read_unlock(); + + return ret; +} + +static int +extmem_register(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; unsigned int socket_id, n; @@ -967,10 +1083,19 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], /* we can create a new memseg */ n = len / page_sz; - if (malloc_heap_create_external_seg(va_addr, iova_addrs, n, + if (dmabuf_fd < 0) { + if (malloc_heap_create_external_seg(va_addr, iova_addrs, n, page_sz, "extmem", socket_id) == NULL) { - ret = -1; - goto unlock; + ret = -1; + goto unlock; + } + } else { + if (malloc_heap_create_external_seg_dmabuf(va_addr, iova_addrs, n, + page_sz, "extmem_dmabuf", socket_id, + dmabuf_fd, dmabuf_offset) == NULL) { + ret = -1; + goto unlock; + } } /* memseg list successfully created - increment next socket ID */ @@ -980,6 +1105,34 @@ rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], return ret; } +RTE_EXPORT_SYMBOL(rte_extmem_register) +int +rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz) +{ + return rte_extmem_register_dmabuf(va_addr, len, -1, 0, iova_addrs, n_pages, page_sz); +} + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_extmem_register_dmabuf, 26.03) +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) +{ + if (dmabuf_fd < 0) { + rte_errno = EINVAL; + return -1; + } + + return extmem_register(va_addr, + len, + dmabuf_fd, + dmabuf_offset, + iova_addrs, + n_pages, + page_sz); +} + RTE_EXPORT_SYMBOL(rte_extmem_unregister) int rte_extmem_unregister(void *va_addr, size_t len) diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h index 0c267066d9..e7e807ddcb 100644 --- a/lib/eal/common/eal_memalloc.h +++ b/lib/eal/common/eal_memalloc.h @@ -90,6 +90,27 @@ eal_memalloc_set_seg_list_fd(int list_idx, int fd); int eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset); +/* + * Set dma-buf info for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_set_dmabuf_info(int list_idx, int fd, uint64_t offset); + +/* + * Get dma-buf fd for a memseg list. + * Returns fd (>= 0) on success, -1 if not dma-buf backed, -errno on error. + */ +int +eal_memseg_list_get_dmabuf_fd(int list_idx); + +/* + * Get dma-buf offset for a memseg list. + * Returns 0 on success, -errno on failure. + */ +int +eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset); + int eal_memalloc_init(void) __rte_requires_shared_capability(rte_mcfg_mem_get_lock()); diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c index 39240c261c..bf986fe654 100644 --- a/lib/eal/common/malloc_heap.c +++ b/lib/eal/common/malloc_heap.c @@ -1232,6 +1232,33 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], msl->version = 0; msl->external = 1; + /* initialize dma-buf info to "not dma-buf backed" */ + eal_memseg_list_set_dmabuf_info(i, -1, 0); + + return msl; +} + +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + int msl_idx; + + /* Create the base external segment */ + msl = malloc_heap_create_external_seg(va_addr, iova_addrs, n_pages, + page_sz, seg_name, socket_id); + if (msl == NULL) + return NULL; + + /* Get memseg list index */ + msl_idx = msl - mcfg->memsegs; + + /* Set dma-buf info in the internal side-table */ + eal_memseg_list_set_dmabuf_info(msl_idx, dmabuf_fd, dmabuf_offset); + return msl; } diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h index dfc56d4ae3..87525d1a68 100644 --- a/lib/eal/common/malloc_heap.h +++ b/lib/eal/common/malloc_heap.h @@ -51,6 +51,11 @@ malloc_heap_create_external_seg(void *va_addr, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz, const char *seg_name, unsigned int socket_id); +struct rte_memseg_list * +malloc_heap_create_external_seg_dmabuf(void *va_addr, rte_iova_t iova_addrs[], + unsigned int n_pages, size_t page_sz, const char *seg_name, + unsigned int socket_id, int dmabuf_fd, uint64_t dmabuf_offset); + struct rte_memseg_list * malloc_heap_find_external_seg(void *va_addr, size_t len); diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h index b6e97ad695..fffeb8fcf5 100644 --- a/lib/eal/include/rte_memory.h +++ b/lib/eal/include/rte_memory.h @@ -405,6 +405,98 @@ int rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms, size_t *offset); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf file descriptor associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf fd. + * + * @return + * Valid dma-buf file descriptor (>= 0) in case of success. + * -1 if not dma-buf backed or in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_fd_unsafe(const struct rte_memseg_list *msl); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf offset associated with a memseg list. + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, + uint64_t *offset); + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Get dma-buf offset associated with a memseg list. + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param msl + * A pointer to memseg list for which to get dma-buf offset. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * 0 on success. + * -1 in case of error, with ``rte_errno`` set to: + * - EINVAL - ``msl`` pointer was NULL or did not point to a valid memseg list + * - EINVAL - ``offset`` pointer was NULL + */ +__rte_experimental +int +rte_memseg_list_get_dmabuf_offset_unsafe(const struct rte_memseg_list *msl, + uint64_t *offset); + /** * Register external memory chunk with DPDK. * @@ -443,6 +535,59 @@ int rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice. + * + * Register external memory chunk backed by a dma-buf file descriptor and offset. + * + * This is similar to rte_extmem_register() but additionally stores dma-buf + * file descriptor information, allowing drivers to use dma-buf based + * memory registration (e.g., ibv_reg_dmabuf_mr for RDMA devices). + * + * @note Using this API is mutually exclusive with ``rte_malloc`` family of + * API's. + * + * @note This API will not perform any DMA mapping. It is expected that user + * will do that themselves via rte_dev_dma_map(). + * + * @note Before accessing this memory in other processes, it needs to be + * attached in each of those processes by calling ``rte_extmem_attach`` in + * each other process. + * + * @param va_addr + * Start of virtual area to register (mmap'd address of the dma-buf). + * Must be aligned by ``page_sz``. + * @param len + * Length of virtual area to register. Must be aligned by ``page_sz``. + * This is independent of dma-buf offset. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf where the registered region starts. + * @param iova_addrs + * Array of page IOVA addresses corresponding to each page in this memory + * area. Can be NULL, in which case page IOVA addresses will be set to + * RTE_BAD_IOVA. + * @param n_pages + * Number of elements in the iova_addrs array. Ignored if ``iova_addrs`` + * is NULL. + * @param page_sz + * Page size of the underlying memory + * + * @return + * - 0 on success + * - -1 in case of error, with rte_errno set to one of the following: + * EINVAL - one of the parameters was invalid + * EEXIST - memory chunk is already registered + * ENOSPC - no more space in internal config to store a new memory chunk + */ +__rte_experimental +int +rte_extmem_register_dmabuf(void *va_addr, size_t len, + int dmabuf_fd, uint64_t dmabuf_offset, + rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz); + /** * Unregister external memory chunk with DPDK. * -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v4 1/2] eal: support dmabuf 2026-02-04 15:50 ` [PATCH v4 1/2] eal: " Cliff Burdick @ 2026-02-12 13:57 ` Burakov, Anatoly 0 siblings, 0 replies; 27+ messages in thread From: Burakov, Anatoly @ 2026-02-12 13:57 UTC (permalink / raw) To: Cliff Burdick, dev; +Cc: Thomas Monjalon On 2/4/2026 4:50 PM, Cliff Burdick wrote: > dmabuf is a modern Linux kernel feature to allow DMA transfers between > two drivers. Common examples of usage are streaming video devices and > NIC to GPU transfers. Prior to dmabuf users had to load proprietary > drivers to expose the DMA mappings. With dmabuf the proprietary drivers > are no longer required. > > A new api function rte_extmem_register_dmabuf is introduced to create > the mapping from a dmabuf file descriptor. dmabuf uses a file descriptor > and an offset that has been pre-opened with the kernel. The kernel uses > the file descriptor to map to a VA pointer. To avoid ABI changes, a > static struct is used inside of eal_common_memory.c, and lookups are > done on this struct rather than from the rte_memseg_list. > > Ideally we would like to add both the dmabuf file descriptor and offset > to rte_memseg_list, but it's not clear if we can reuse existing fields > when using the dmabuf API. > > We could rename the external flag to a more generic "properties" flag > where "external" is the lowest bit, then we can use the second bit to > indicate the presence of dmabuf. In the presence of the flag for > dmabuf we could reuse the base_va address field for the dmabuf offset, > and the socket_id for the file descriptor. > > Signed-off-by: Cliff Burdick <cburdick@nvidia.com> > --- Hi, A few random thoughts about the patchset. For one, this API is obviously Linux-only. This in itself is not a problem (we do have VFIO API...) but I would really like to avoid that if possible. For another, I don't see any support for secondary processes - the dmabuf array is process-local, and calling register() from secondary process would presumably either fail or create a duplicate segment, depending on exactly what you pass into the register call. If this scenario isn't supported, it should at least be explicitly disallowed and documented to be such. My biggest concern is that this is creating another type of external memory segment and thus segregating the API, but isn't doing it in a way that is generic. I can see a valid usecase for this, but what we're essentially doing here is storing some metadata together with the segment. So, perhaps, this is what we should do? That would seem like a cleanest solution for me, and it would extend usefulness of the API to other use cases where there may be a requirement to store some metadata/fd/whatever with the segment. You could then build another API on top of this (a library?) that would handle things like secondary process synchronization with IPC, so that you have all fd's valid in all processes. Thoughts? -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v4 2/2] common/mlx5: support dmabuf 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 1/2] eal: " Cliff Burdick @ 2026-02-04 15:50 ` Cliff Burdick 2026-02-05 18:48 ` [PATCH v4 0/2] " Stephen Hemminger 2026-03-31 3:15 ` Stephen Hemminger 3 siblings, 0 replies; 27+ messages in thread From: Cliff Burdick @ 2026-02-04 15:50 UTC (permalink / raw) To: dev Cc: anatoly.burakov, Thomas Monjalon, Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad dmabuf is a modern Linux kernel feature to allow DMA transfers between two drivers. Common examples of usage are streaming video devices and NIC to GPU transfers. Prior to dmabuf users had to load proprietary drivers to expose the DMA mappings. With dmabuf the proprietary drivers are no longer required. Signed-off-by: Cliff Burdick <cburdick@nvidia.com> --- .mailmap | 2 +- drivers/common/mlx5/linux/meson.build | 2 + drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 +++++++- drivers/common/mlx5/linux/mlx5_glue.c | 19 +++ drivers/common/mlx5/linux/mlx5_glue.h | 3 + drivers/common/mlx5/mlx5_common.c | 42 ++++++- drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++++++++- drivers/common/mlx5/mlx5_common_mr.h | 17 ++- drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- drivers/crypto/mlx5/mlx5_crypto.h | 1 + drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- 11 files changed, 249 insertions(+), 9 deletions(-) diff --git a/.mailmap b/.mailmap index 4c2b2f921d..0a8a67098f 100644 --- a/.mailmap +++ b/.mailmap @@ -291,8 +291,8 @@ Cian Ferriter <cian.ferriter@intel.com> Ciara Loftus <ciara.loftus@intel.com> Ciara Power <ciara.power@intel.com> Claire Murphy <claire.k.murphy@intel.com> -Cliff Burdick <cburdick@nvidia.com> Clemens Famulla-Conrad <cfamullaconrad@suse.com> +Cliff Burdick <cburdick@nvidia.com> Cody Doucette <doucette@bu.edu> Congwen Zhang <zhang.congwen@zte.com.cn> Conor Fogarty <conor.fogarty@intel.com> diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build index 3767e7a69b..8e83104165 100644 --- a/drivers/common/mlx5/linux/meson.build +++ b/drivers/common/mlx5/linux/meson.build @@ -203,6 +203,8 @@ has_sym_args = [ 'mlx5dv_dr_domain_allow_duplicate_rules' ], [ 'HAVE_MLX5_IBV_REG_MR_IOVA', 'infiniband/verbs.h', 'ibv_reg_mr_iova' ], + [ 'HAVE_IBV_REG_DMABUF_MR', 'infiniband/verbs.h', + 'ibv_reg_dmabuf_mr' ], [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h', 'ibv_import_device' ], [ 'HAVE_MLX5DV_DR_ACTION_CREATE_DEST_ROOT_TABLE', 'infiniband/mlx5dv.h', diff --git a/drivers/common/mlx5/linux/mlx5_common_verbs.c b/drivers/common/mlx5/linux/mlx5_common_verbs.c index 98260df470..f6d18fd5df 100644 --- a/drivers/common/mlx5/linux/mlx5_common_verbs.c +++ b/drivers/common/mlx5/linux/mlx5_common_verbs.c @@ -129,6 +129,47 @@ mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, return 0; } +/** + * Register mr for dma-buf backed memory. Given protection domain pointer, + * dma-buf fd, offset and length, register the memory region. + * + * @param[in] pd + * Pointer to protection domain context. + * @param[in] offset + * Offset within the dma-buf. + * @param[in] length + * Length of the memory to register. + * @param[in] fd + * File descriptor of the dma-buf. + * @param[out] pmd_mr + * pmd_mr struct set with lkey, address, length and pointer to mr object + * + * @return + * 0 on successful registration, -1 otherwise + */ +RTE_EXPORT_INTERNAL_SYMBOL(mlx5_common_verbs_reg_dmabuf_mr) +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr) +{ + struct ibv_mr *ibv_mr; + ibv_mr = mlx5_glue->reg_dmabuf_mr(pd, offset, length, iova, fd, + IBV_ACCESS_LOCAL_WRITE | + (haswell_broadwell_cpu ? 0 : + IBV_ACCESS_RELAXED_ORDERING)); + if (!ibv_mr) + return -1; + + *pmd_mr = (struct mlx5_pmd_mr){ + .lkey = ibv_mr->lkey, + .addr = ibv_mr->addr, + .len = ibv_mr->length, + .obj = (void *)ibv_mr, + }; + return 0; +} + /** * Deregister mr. Given the mlx5 pmd MR - deregister the MR * @@ -151,13 +192,18 @@ mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func * @param[out] dereg_mr_cb * Pointer to dereg_mr func */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_common_verbs_reg_mr; + *reg_dmabuf_mr_cb = mlx5_common_verbs_reg_dmabuf_mr; *dereg_mr_cb = mlx5_common_verbs_dereg_mr; } diff --git a/drivers/common/mlx5/linux/mlx5_glue.c b/drivers/common/mlx5/linux/mlx5_glue.c index a91eaa429d..6fac7f2bcd 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.c +++ b/drivers/common/mlx5/linux/mlx5_glue.c @@ -291,6 +291,24 @@ mlx5_glue_reg_mr_iova(struct ibv_pd *pd, void *addr, size_t length, #endif } +static struct ibv_mr * +mlx5_glue_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, int access) +{ +#ifdef HAVE_IBV_REG_DMABUF_MR + return ibv_reg_dmabuf_mr(pd, offset, length, iova, fd, access); +#else + (void)pd; + (void)offset; + (void)length; + (void)iova; + (void)fd; + (void)access; + errno = ENOTSUP; + return NULL; +#endif +} + static struct ibv_mr * mlx5_glue_alloc_null_mr(struct ibv_pd *pd) { @@ -1619,6 +1637,7 @@ const struct mlx5_glue *mlx5_glue = &(const struct mlx5_glue) { .modify_qp = mlx5_glue_modify_qp, .reg_mr = mlx5_glue_reg_mr, .reg_mr_iova = mlx5_glue_reg_mr_iova, + .reg_dmabuf_mr = mlx5_glue_reg_dmabuf_mr, .alloc_null_mr = mlx5_glue_alloc_null_mr, .dereg_mr = mlx5_glue_dereg_mr, .create_counter_set = mlx5_glue_create_counter_set, diff --git a/drivers/common/mlx5/linux/mlx5_glue.h b/drivers/common/mlx5/linux/mlx5_glue.h index 81d6b0aaf9..66216d1194 100644 --- a/drivers/common/mlx5/linux/mlx5_glue.h +++ b/drivers/common/mlx5/linux/mlx5_glue.h @@ -219,6 +219,9 @@ struct mlx5_glue { struct ibv_mr *(*reg_mr_iova)(struct ibv_pd *pd, void *addr, size_t length, uint64_t iova, int access); + struct ibv_mr *(*reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, + size_t length, uint64_t iova, + int fd, int access); struct ibv_mr *(*alloc_null_mr)(struct ibv_pd *pd); int (*dereg_mr)(struct ibv_mr *mr); struct ibv_counter_set *(*create_counter_set) diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c index 84a93e7dbd..82cf17ca78 100644 --- a/drivers/common/mlx5/mlx5_common.c +++ b/drivers/common/mlx5/mlx5_common.c @@ -13,6 +13,7 @@ #include <rte_class.h> #include <rte_malloc.h> #include <rte_eal_paging.h> +#include <rte_memory.h> #include "mlx5_common.h" #include "mlx5_common_os.h" @@ -1125,6 +1126,7 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, struct mlx5_common_device *dev; struct mlx5_mr_btree *bt; struct mlx5_mr *mr; + struct rte_memseg_list *msl; dev = to_mlx5_device(rte_dev); if (!dev) { @@ -1134,8 +1136,44 @@ mlx5_common_dev_dma_map(struct rte_device *rte_dev, void *addr, rte_errno = ENODEV; return -1; } - mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, - SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list(addr); + if (msl != NULL && msl->external) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + int ret = rte_memseg_list_get_dmabuf_offset( + msl, &dmabuf_off); + if (ret < 0) { + DRV_LOG(ERR, + "Failed to get dma-buf offset for memseg list %p", + (void *)msl); + return -1; + } + /* Calculate offset within dma-buf address */ + dmabuf_off += ((uintptr_t)addr - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + mr = mlx5_create_mr_ext_dmabuf(dev->pd, + (uintptr_t)addr, + len, + SOCKET_ID_ANY, + dmabuf_fd, + dmabuf_off, + dev->mr_scache.reg_dmabuf_mr_cb); + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, + (uintptr_t)addr, + len, + SOCKET_ID_ANY, + dev->mr_scache.reg_mr_cb); + } + } else { + /* Use regular MR registration */ + mr = mlx5_create_mr_ext(dev->pd, (uintptr_t)addr, len, + SOCKET_ID_ANY, dev->mr_scache.reg_mr_cb); + } if (!mr) { DRV_LOG(WARNING, "Device %s unable to DMA map", rte_dev->name); rte_errno = EINVAL; diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c index 8ed988dec9..8f31eaefe8 100644 --- a/drivers/common/mlx5/mlx5_common_mr.c +++ b/drivers/common/mlx5/mlx5_common_mr.c @@ -8,6 +8,7 @@ #include <rte_eal_memconfig.h> #include <rte_eal_paging.h> #include <rte_errno.h> +#include <rte_memory.h> #include <rte_mempool.h> #include <rte_malloc.h> #include <rte_rwlock.h> @@ -1141,6 +1142,7 @@ mlx5_mr_create_cache(struct mlx5_mr_share_cache *share_cache, int socket) { /* Set the reg_mr and dereg_mr callback functions */ mlx5_os_set_reg_mr_cb(&share_cache->reg_mr_cb, + &share_cache->reg_dmabuf_mr_cb, &share_cache->dereg_mr_cb); rte_rwlock_init(&share_cache->rwlock); rte_rwlock_init(&share_cache->mprwlock); @@ -1221,6 +1223,74 @@ mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, return mr; } +/** + * Creates a memory region for dma-buf backed external memory. + * + * @param pd + * Pointer to pd of a device (net, regex, vdpa,...). + * @param addr + * Starting virtual address of memory (mmap'd address). + * @param len + * Length of memory segment being mapped. + * @param socket_id + * Socket to allocate heap memory for the control structures. + * @param dmabuf_fd + * File descriptor of the dma-buf. + * @param dmabuf_offset + * Offset within the dma-buf. + * @param reg_dmabuf_mr_cb + * Callback function for dma-buf MR registration. + * + * @return + * Pointer to MR structure on success, NULL otherwise. + */ +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb) +{ + struct mlx5_mr *mr = NULL; + + if (reg_dmabuf_mr_cb == NULL) { + DRV_LOG(WARNING, "dma-buf MR registration not supported"); + rte_errno = ENOTSUP; + return NULL; + } + mr = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, + RTE_ALIGN_CEIL(sizeof(*mr), RTE_CACHE_LINE_SIZE), + RTE_CACHE_LINE_SIZE, socket_id); + if (mr == NULL) + return NULL; + if (reg_dmabuf_mr_cb(pd, dmabuf_offset, len, addr, dmabuf_fd, + &mr->pmd_mr) < 0) { + DRV_LOG(WARNING, + "Fail to create dma-buf MR for address (%p) fd=%d", + (void *)addr, dmabuf_fd); + mlx5_free(mr); + return NULL; + } + mr->msl = NULL; /* Mark it is external memory. */ + mr->ms_bmp = NULL; + mr->ms_n = 1; + mr->ms_bmp_n = 1; + /* + * For dma-buf MR, the returned addr may be NULL since there's no VA + * in the registration. Store the user-provided addr for cache lookup. + */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)addr; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + DRV_LOG(DEBUG, + "MR CREATED (%p) for dma-buf external memory %p (fd=%d):\n" + " [0x%" PRIxPTR ", 0x%" PRIxPTR ")," + " lkey=0x%x base_idx=%u ms_n=%u, ms_bmp_n=%u", + (void *)mr, (void *)addr, dmabuf_fd, + addr, addr + len, rte_cpu_to_be_32(mr->pmd_mr.lkey), + mr->ms_base_idx, mr->ms_n, mr->ms_bmp_n); + return mr; +} + /** * Callback for memory free event. Iterate freed memsegs and check whether it * belongs to an existing MR. If found, clear the bit from bitmap of MR. As a @@ -1747,9 +1817,48 @@ mlx5_mr_mempool_register_primary(struct mlx5_mr_share_cache *share_cache, struct mlx5_mempool_mr *mr = &new_mpr->mrs[i]; const struct mlx5_range *range = &ranges[i]; size_t len = range->end - range->start; + struct rte_memseg_list *msl; + int reg_result; + + /* Check if this is dma-buf backed external memory */ + msl = rte_mem_virt2memseg_list((void *)range->start); + if (msl != NULL && msl->external && + share_cache->reg_dmabuf_mr_cb != NULL) { + int dmabuf_fd = rte_memseg_list_get_dmabuf_fd(msl); + if (dmabuf_fd >= 0) { + uint64_t dmabuf_off; + /* Get base offset from memseg list */ + ret = rte_memseg_list_get_dmabuf_offset(msl, &dmabuf_off); + if (ret < 0) { + DRV_LOG(ERR, "Failed to get dma-buf offset for memseg list %p", + (void *)msl); + goto exit; + } + /* Calculate offset within dma-buf for this specific range */ + dmabuf_off += (range->start - (uintptr_t)msl->base_va); + /* Use dma-buf MR registration */ + reg_result = share_cache->reg_dmabuf_mr_cb(pd, + dmabuf_off, len, range->start, dmabuf_fd, + &mr->pmd_mr); + if (reg_result == 0) { + /* For dma-buf MR, set addr if not set by driver */ + if (mr->pmd_mr.addr == NULL) + mr->pmd_mr.addr = (void *)range->start; + if (mr->pmd_mr.len == 0) + mr->pmd_mr.len = len; + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } + } else { + /* Use regular MR registration */ + reg_result = share_cache->reg_mr_cb(pd, + (void *)range->start, len, &mr->pmd_mr); + } - if (share_cache->reg_mr_cb(pd, (void *)range->start, len, - &mr->pmd_mr) < 0) { + if (reg_result < 0) { DRV_LOG(ERR, "Failed to create an MR in PD %p for address range " "[0x%" PRIxPTR ", 0x%" PRIxPTR "] (%zu bytes) for mempool %s", diff --git a/drivers/common/mlx5/mlx5_common_mr.h b/drivers/common/mlx5/mlx5_common_mr.h index cf7c685e9b..3b967b1323 100644 --- a/drivers/common/mlx5/mlx5_common_mr.h +++ b/drivers/common/mlx5/mlx5_common_mr.h @@ -35,6 +35,9 @@ struct mlx5_pmd_mr { */ typedef int (*mlx5_reg_mr_t)(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); +typedef int (*mlx5_reg_dmabuf_mr_t)(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); typedef void (*mlx5_dereg_mr_t)(struct mlx5_pmd_mr *pmd_mr); /* Memory Region object. */ @@ -87,6 +90,7 @@ struct __rte_packed_begin mlx5_mr_share_cache { struct mlx5_mr_list mr_free_list; /* Freed MR list. */ struct mlx5_mempool_reg_list mempool_reg_list; /* Mempool database. */ mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ } __rte_packed_end; @@ -233,6 +237,10 @@ mlx5_mr_lookup_list(struct mlx5_mr_share_cache *share_cache, struct mlx5_mr * mlx5_create_mr_ext(void *pd, uintptr_t addr, size_t len, int socket_id, mlx5_reg_mr_t reg_mr_cb); +struct mlx5_mr * +mlx5_create_mr_ext_dmabuf(void *pd, uintptr_t addr, size_t len, int socket_id, + int dmabuf_fd, uint64_t dmabuf_offset, + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb); void mlx5_mr_free(struct mlx5_mr *mr, mlx5_dereg_mr_t dereg_mr_cb); __rte_internal uint32_t @@ -251,12 +259,19 @@ int mlx5_common_verbs_reg_mr(void *pd, void *addr, size_t length, struct mlx5_pmd_mr *pmd_mr); __rte_internal +int +mlx5_common_verbs_reg_dmabuf_mr(void *pd, uint64_t offset, size_t length, + uint64_t iova, int fd, + struct mlx5_pmd_mr *pmd_mr); +__rte_internal void mlx5_common_verbs_dereg_mr(struct mlx5_pmd_mr *pmd_mr); __rte_internal void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb); +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb); __rte_internal int diff --git a/drivers/common/mlx5/windows/mlx5_common_os.c b/drivers/common/mlx5/windows/mlx5_common_os.c index 7fac361460..5e284742ab 100644 --- a/drivers/common/mlx5/windows/mlx5_common_os.c +++ b/drivers/common/mlx5/windows/mlx5_common_os.c @@ -17,6 +17,7 @@ #include "mlx5_common.h" #include "mlx5_common_os.h" #include "mlx5_malloc.h" +#include "mlx5_common_mr.h" /** * Initialization routine for run-time dependency on external lib. @@ -442,15 +443,20 @@ mlx5_os_dereg_mr(struct mlx5_pmd_mr *pmd_mr) * * @param[out] reg_mr_cb * Pointer to reg_mr func + * @param[out] reg_dmabuf_mr_cb + * Pointer to reg_dmabuf_mr func (NULL on Windows - not supported) * @param[out] dereg_mr_cb * Pointer to dereg_mr func * */ RTE_EXPORT_INTERNAL_SYMBOL(mlx5_os_set_reg_mr_cb) void -mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, mlx5_dereg_mr_t *dereg_mr_cb) +mlx5_os_set_reg_mr_cb(mlx5_reg_mr_t *reg_mr_cb, + mlx5_reg_dmabuf_mr_t *reg_dmabuf_mr_cb, + mlx5_dereg_mr_t *dereg_mr_cb) { *reg_mr_cb = mlx5_os_reg_mr; + *reg_dmabuf_mr_cb = NULL; /* dma-buf not supported on Windows */ *dereg_mr_cb = mlx5_os_dereg_mr; } diff --git a/drivers/crypto/mlx5/mlx5_crypto.h b/drivers/crypto/mlx5/mlx5_crypto.h index f9f127e9e6..b2712c9a8d 100644 --- a/drivers/crypto/mlx5/mlx5_crypto.h +++ b/drivers/crypto/mlx5/mlx5_crypto.h @@ -41,6 +41,7 @@ struct mlx5_crypto_priv { struct mlx5_common_device *cdev; /* Backend mlx5 device. */ struct rte_cryptodev *crypto_dev; mlx5_reg_mr_t reg_mr_cb; /* Callback to reg_mr func */ + mlx5_reg_dmabuf_mr_t reg_dmabuf_mr_cb; /* Callback to reg_dmabuf_mr func */ mlx5_dereg_mr_t dereg_mr_cb; /* Callback to dereg_mr func */ struct mlx5_uar uar; /* User Access Region. */ uint32_t max_segs_num; /* Maximum supported data segs. */ diff --git a/drivers/crypto/mlx5/mlx5_crypto_gcm.c b/drivers/crypto/mlx5/mlx5_crypto_gcm.c index 89f32c7722..380689cfeb 100644 --- a/drivers/crypto/mlx5/mlx5_crypto_gcm.c +++ b/drivers/crypto/mlx5/mlx5_crypto_gcm.c @@ -1186,7 +1186,8 @@ mlx5_crypto_gcm_init(struct mlx5_crypto_priv *priv) /* Override AES-GCM specified ops. */ dev_ops->sym_session_configure = mlx5_crypto_sym_gcm_session_configure; - mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->dereg_mr_cb); + mlx5_os_set_reg_mr_cb(&priv->reg_mr_cb, &priv->reg_dmabuf_mr_cb, + &priv->dereg_mr_cb); dev_ops->queue_pair_setup = mlx5_crypto_gcm_qp_setup; dev_ops->queue_pair_release = mlx5_crypto_gcm_qp_release; if (mlx5_crypto_is_ipsec_opt(priv)) { -- 2.52.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v4 0/2] support dmabuf 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 1/2] eal: " Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 2/2] common/mlx5: " Cliff Burdick @ 2026-02-05 18:48 ` Stephen Hemminger 2026-02-05 20:25 ` Cliff Burdick 2026-03-31 3:15 ` Stephen Hemminger 3 siblings, 1 reply; 27+ messages in thread From: Stephen Hemminger @ 2026-02-05 18:48 UTC (permalink / raw) To: Cliff Burdick; +Cc: dev, anatoly.burakov On Wed, 4 Feb 2026 15:50:07 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > Fixes since v3: > * Fixed version in RTE_EXPORT_EXPERIMENTAL_SYMBOL > > Add support for kernel dmabuf feature and integrate it in the mlx5 driver. > This feature is needed to support GPUDirect on newer kernels. > > I apologize for all the patches. Still trying to learn how to submit these. > > Cliff Burdick (2): > eal: support dmabuf > common/mlx5: support dmabuf > > .mailmap | 1 + > doc/guides/rel_notes/release_26_03.rst | 6 + > drivers/common/mlx5/linux/meson.build | 2 + > drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- > drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ > drivers/common/mlx5/linux/mlx5_glue.h | 3 + > drivers/common/mlx5/mlx5_common.c | 42 ++++- > drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++- > drivers/common/mlx5/mlx5_common_mr.h | 17 +- > drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- > drivers/crypto/mlx5/mlx5_crypto.h | 1 + > drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- > lib/eal/common/eal_common_memory.c | 165 +++++++++++++++++- > lib/eal/common/eal_memalloc.h | 21 +++ > lib/eal/common/malloc_heap.c | 27 +++ > lib/eal/common/malloc_heap.h | 5 + > lib/eal/include/rte_memory.h | 145 +++++++++++++++ > 17 files changed, 612 insertions(+), 14 deletions(-) > Any new library like this needs standalone tests so that it gets covered in CI etc. ^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [PATCH v4 0/2] support dmabuf 2026-02-05 18:48 ` [PATCH v4 0/2] " Stephen Hemminger @ 2026-02-05 20:25 ` Cliff Burdick 2026-02-05 22:50 ` Stephen Hemminger 0 siblings, 1 reply; 27+ messages in thread From: Cliff Burdick @ 2026-02-05 20:25 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev@dpdk.org, anatoly.burakov@intel.com > -----Original Message----- > From: Stephen Hemminger <stephen@networkplumber.org> > Sent: Thursday, February 5, 2026 10:49 AM > To: Cliff Burdick <cburdick@nvidia.com> > Cc: dev@dpdk.org; anatoly.burakov@intel.com > Subject: Re: [PATCH v4 0/2] support dmabuf]() > > External email: Use caution opening links or attachments > > On Wed, 4 Feb 2026 15:50:07 +0000 > Cliff Burdick <cburdick@nvidia.com> wrote: > > > Fixes since v3: > > * Fixed version in RTE_EXPORT_EXPERIMENTAL_SYMBOL > > > > Add support for kernel dmabuf feature and integrate it in the mlx5 driver. > > This feature is needed to support GPUDirect on newer kernels. > > > > I apologize for all the patches. Still trying to learn how to submit these. > > > > Cliff Burdick (2): > > eal: support dmabuf > > common/mlx5: support dmabuf > > > > .mailmap | 1 + > > doc/guides/rel_notes/release_26_03.rst | 6 + > > drivers/common/mlx5/linux/meson.build | 2 + > > drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- > > drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ > > drivers/common/mlx5/linux/mlx5_glue.h | 3 + > > drivers/common/mlx5/mlx5_common.c | 42 ++++- > > drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++- > > drivers/common/mlx5/mlx5_common_mr.h | 17 +- > > drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- > > drivers/crypto/mlx5/mlx5_crypto.h | 1 + > > drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- > > lib/eal/common/eal_common_memory.c | 165 +++++++++++++++++- > > lib/eal/common/eal_memalloc.h | 21 +++ > > lib/eal/common/malloc_heap.c | 27 +++ > > lib/eal/common/malloc_heap.h | 5 + > > lib/eal/include/rte_memory.h | 145 +++++++++++++++ > > 17 files changed, 612 insertions(+), 14 deletions(-) > > > Any new library like this needs standalone tests so that it gets covered in CI etc. I did not see any existing GPU tests, and this would require a GPU to test along with a 5.15+ Linux kernel. Do those systems exist in the test infrastructure? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 0/2] support dmabuf 2026-02-05 20:25 ` Cliff Burdick @ 2026-02-05 22:50 ` Stephen Hemminger 0 siblings, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-02-05 22:50 UTC (permalink / raw) To: Cliff Burdick; +Cc: dev@dpdk.org, anatoly.burakov@intel.com On Thu, 5 Feb 2026 20:25:01 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > > Any new library like this needs standalone tests so that it gets covered in CI etc. > > I did not see any existing GPU tests, and this would require a GPU to test along with a 5.15+ Linux kernel. Do those systems exist in the test infrastructure? That was a mistake in allowing them in. The problem was GPU requires CUDA, but that could have been handled. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v4 0/2] support dmabuf 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick ` (2 preceding siblings ...) 2026-02-05 18:48 ` [PATCH v4 0/2] " Stephen Hemminger @ 2026-03-31 3:15 ` Stephen Hemminger 3 siblings, 0 replies; 27+ messages in thread From: Stephen Hemminger @ 2026-03-31 3:15 UTC (permalink / raw) To: Cliff Burdick; +Cc: dev, anatoly.burakov On Wed, 4 Feb 2026 15:50:07 +0000 Cliff Burdick <cburdick@nvidia.com> wrote: > Fixes since v3: > * Fixed version in RTE_EXPORT_EXPERIMENTAL_SYMBOL > > Add support for kernel dmabuf feature and integrate it in the mlx5 driver. > This feature is needed to support GPUDirect on newer kernels. > > I apologize for all the patches. Still trying to learn how to submit these. > > Cliff Burdick (2): > eal: support dmabuf > common/mlx5: support dmabuf > > .mailmap | 1 + > doc/guides/rel_notes/release_26_03.rst | 6 + > drivers/common/mlx5/linux/meson.build | 2 + > drivers/common/mlx5/linux/mlx5_common_verbs.c | 48 ++++- > drivers/common/mlx5/linux/mlx5_glue.c | 19 ++ > drivers/common/mlx5/linux/mlx5_glue.h | 3 + > drivers/common/mlx5/mlx5_common.c | 42 ++++- > drivers/common/mlx5/mlx5_common_mr.c | 113 +++++++++++- > drivers/common/mlx5/mlx5_common_mr.h | 17 +- > drivers/common/mlx5/windows/mlx5_common_os.c | 8 +- > drivers/crypto/mlx5/mlx5_crypto.h | 1 + > drivers/crypto/mlx5/mlx5_crypto_gcm.c | 3 +- > lib/eal/common/eal_common_memory.c | 165 +++++++++++++++++- > lib/eal/common/eal_memalloc.h | 21 +++ > lib/eal/common/malloc_heap.c | 27 +++ > lib/eal/common/malloc_heap.h | 5 + > lib/eal/include/rte_memory.h | 145 +++++++++++++++ > 17 files changed, 612 insertions(+), 14 deletions(-) > I don't think anyone look at the details here. If they did they would see the same thing as what AI did. Review: [PATCH v4 1/2] eal: support dmabuf [PATCH v4 2/2] common/mlx5: support dmabuf Good approach using a side-table to avoid ABI changes to rte_memseg_list. The dmabuf support for mlx5 is well-structured with proper compile-time gating via HAVE_IBV_REG_DMABUF_MR. Patch 1/2 - eal: support dmabuf Error: rte_extmem_register is broken by rte_extmem_register_dmabuf. rte_extmem_register now calls rte_extmem_register_dmabuf with fd=-1. But rte_extmem_register_dmabuf rejects dmabuf_fd < 0: int rte_extmem_register_dmabuf(void *va_addr, size_t len, int dmabuf_fd, uint64_t dmabuf_offset, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) { if (dmabuf_fd < 0) { rte_errno = EINVAL; return -1; } ... So every call to rte_extmem_register will now fail with EINVAL. rte_extmem_register should call extmem_register directly (with fd=-1), not rte_extmem_register_dmabuf: int rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz) { return extmem_register(va_addr, len, -1, 0, iova_addrs, n_pages, page_sz); } Error: Input validation from rte_extmem_register is lost. The original rte_extmem_register validated va_addr, page_sz, len, alignment, and n_pages before doing anything. The new extmem_register function has no input validation -- it only has the locking and memseg creation logic. The parameter checks that were in the original function (va_addr == NULL, page_sz == 0, len == 0, power-of-2, alignment) need to be in extmem_register or duplicated in both rte_extmem_register and rte_extmem_register_dmabuf. Error: eal_memseg_list_init uses type_msl_idx to index dmabuf_info but this is not the same as the memseg list index in mcfg->memsegs. eal_memseg_list_init initializes dmabuf_info[type_msl_idx], but type_msl_idx is a per-type index used for naming, not the global memseg list position. When malloc_heap_create_external_seg_dmabuf later sets dmabuf_info[msl_idx] where msl_idx = msl - mcfg->memsegs, these are different index spaces. The init in eal_memseg_list_init may write to wrong slots, and external segments may use indices that were never initialized by eal_memseg_list_init. The static dmabuf_info array is zero-initialized at program start (fd=0, offset=0), but fd=0 is a valid file descriptor (stdin), so fd should be initialized to -1 for all entries, or the eal_memseg_list_init initialization should be removed and initialization done only in malloc_heap_create_external_seg and malloc_heap_create_external_seg_dmabuf. Warning: malloc_heap_create_external_seg now redundantly initializes dmabuf_info. malloc_heap_create_external_seg calls eal_memseg_list_set_dmabuf_info(i, -1, 0) at the end, but malloc_heap_create_external_seg_dmabuf also calls malloc_heap_create_external_seg (which sets fd=-1) and then immediately overwrites with the actual fd. This works but is confusing -- the redundant initialization in the base function was added for non-dmabuf paths but makes the dmabuf path do a useless set-then-overwrite. Warning: dmabuf_info is process-local static storage. The side-table approach means dmabuf metadata is not shared with secondary processes via shared memory. The commit message mentions rte_extmem_attach for cross-process use, but secondary processes will not have the dmabuf_info populated. If a secondary process calls rte_memseg_list_get_dmabuf_fd, it will always get -1. This limitation should be documented in the Doxygen for rte_extmem_register_dmabuf. Warning: dmabuf fd lifetime / ownership is not documented. The API does not specify whether DPDK takes ownership of the dmabuf_fd (will it close it?) or whether the caller must keep it open. Since ibv_reg_dmabuf_mr likely holds a reference through the kernel, the fd can probably be closed after registration, but this should be explicitly documented in the Doxygen. Also, rte_extmem_unregister does not clean up the dmabuf_info entry (reset fd to -1), so stale metadata remains after unregistration. Warning: the public API functions in eal_memalloc.h have the same names as the public API functions in rte_memory.h but different signatures. eal_memalloc.h declares: int eal_memseg_list_get_dmabuf_fd(int list_idx); int eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset); rte_memory.h declares: int rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl); int rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, ...); These are distinct functions, but the eal_memalloc.h header claims to declare: "Get dma-buf fd for a memseg list." "Get dma-buf offset for a memseg list." ...with names prefixed eal_ vs rte_. This is fine functionally but the internal and public function signatures should use the same error convention. The internal eal_memseg_list_get_dmabuf_fd returns -EINVAL on error, while the public rte_memseg_list_get_dmabuf_fd_unsafe returns -1 with rte_errno. The public function does not call the internal function at all -- it accesses dmabuf_info directly. The internal functions in eal_memalloc.h appear unused. Patch 2/2 - common/mlx5: support dmabuf Warning: mlx5_os_set_reg_mr_cb signature change is an internal ABI break. The function signature changes from 2 to 3 parameters. This is an internal symbol so it's not a public ABI break, but all callers must be updated atomically. The Windows and Linux implementations are both updated, and the crypto caller is updated, so this appears complete. Verify no other callers exist outside this patch. Warning: duplicate dmabuf detection logic in mlx5_common.c and mlx5_common_mr.c. The pattern of checking msl->external, getting dmabuf_fd, getting dmabuf_offset, and calculating the adjusted offset is repeated nearly identically in mlx5_common_dev_dma_map and mlx5_mr_mempool_register_primary. This should be factored into a helper function to avoid the code duplication and ensure consistent behavior. Info: mlx5_common_verbs_reg_dmabuf_mr passes addr as iova. The call is: reg_dmabuf_mr_cb(pd, dmabuf_offset, len, addr, dmabuf_fd, ...) where addr is the user-space virtual address. For ibv_reg_dmabuf_mr, the iova parameter is the "IO virtual address the device will use to access the region." Using the user-space VA as iova means the MR maps virtual addresses 1:1. This is the common pattern for DPDK but worth noting that it won't work if the device expects different IO addresses. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2026-03-31 3:16 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-27 17:44 [PATCH 0/2] support dmabuf Cliff Burdick 2026-01-27 17:44 ` [PATCH 1/2] eal: " Cliff Burdick 2026-01-29 1:48 ` Stephen Hemminger 2026-01-29 1:51 ` Stephen Hemminger 2026-01-27 17:44 ` [PATCH 2/2] common/mlx5: " Cliff Burdick 2026-01-27 19:21 ` [REVIEW] " Stephen Hemminger 2026-01-28 14:30 ` David Marchand 2026-01-28 17:10 ` Stephen Hemminger 2026-01-28 17:43 ` Stephen Hemminger 2026-02-03 17:34 ` Cliff Burdick 2026-01-29 1:51 ` [PATCH 2/2] " Stephen Hemminger 2026-01-28 0:04 ` [PATCH 0/2] " Stephen Hemminger 2026-02-03 17:18 ` Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 " Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 1/2] eal: " Cliff Burdick 2026-02-03 22:26 ` [PATCH v2 2/2] common/mlx5: " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 0/2] " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 1/2] eal: " Cliff Burdick 2026-02-03 23:02 ` [PATCH v3 2/2] common/mlx5: " Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 0/2] " Cliff Burdick 2026-02-04 15:50 ` [PATCH v4 1/2] eal: " Cliff Burdick 2026-02-12 13:57 ` Burakov, Anatoly 2026-02-04 15:50 ` [PATCH v4 2/2] common/mlx5: " Cliff Burdick 2026-02-05 18:48 ` [PATCH v4 0/2] " Stephen Hemminger 2026-02-05 20:25 ` Cliff Burdick 2026-02-05 22:50 ` Stephen Hemminger 2026-03-31 3:15 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox