* [PATCH for-next v2 0/5] Introduce Completion Counters
@ 2026-04-16 21:23 Michael Margolin
2026-04-16 21:23 ` [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support Michael Margolin
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Michael Margolin @ 2026-04-16 21:23 UTC (permalink / raw)
To: jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman
Add core infrastructure for Completion Counters, a light-weight
alternative to polling CQ for tracking operation completions. The
related rdma-core interface proposal is linked in [1].
Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy,
set, inc and read methods for both success and error counters. Add a
QP attach method on the QP object to associate a completion counter
with a queue pair.
Completion Counters can be backed by user-provided VA or dmabuf or by
internal device/driver memory. Common command infrastructure allows any
of the implementations to support the various device capabilities.
Add EFA Completion Counters support as first implementer.
[1] https://github.com/linux-rdma/rdma-core/pull/1701
---
Changes in v2:
- United set, inc and read flows for successful and error completions
counters
- Added comp_cntr usage count
- Minor cleanups
- Link to v1: https://lore.kernel.org/all/20260407115424.13359-1-mrgolin@amazon.com/
Michael Margolin (5):
RDMA/core: Add Completion Counters support
RDMA/core: Prevent destroying in-use completion counters
RDMA/core: Add Completion Counters to resource tracking
RDMA/efa: Update device interface
RDMA/efa: Add Completion Counters support
drivers/infiniband/core/Makefile | 1 +
drivers/infiniband/core/device.c | 7 +
drivers/infiniband/core/nldev.c | 1 +
drivers/infiniband/core/rdma_core.h | 1 +
drivers/infiniband/core/restrack.c | 2 +
drivers/infiniband/core/uverbs_cmd.c | 1 +
.../core/uverbs_std_types_comp_cntr.c | 299 ++++++++++++++++++
drivers/infiniband/core/uverbs_std_types_qp.c | 65 +++-
drivers/infiniband/core/uverbs_uapi.c | 1 +
drivers/infiniband/core/verbs.c | 1 +
drivers/infiniband/hw/efa/efa.h | 13 +
.../infiniband/hw/efa/efa_admin_cmds_defs.h | 185 ++++++++++-
drivers/infiniband/hw/efa/efa_com_cmd.c | 106 +++++++
drivers/infiniband/hw/efa/efa_com_cmd.h | 36 +++
drivers/infiniband/hw/efa/efa_io_defs.h | 62 +++-
drivers/infiniband/hw/efa/efa_main.c | 6 +
drivers/infiniband/hw/efa/efa_verbs.c | 171 ++++++++++
include/rdma/ib_verbs.h | 41 +++
include/rdma/restrack.h | 4 +
include/uapi/rdma/efa-abi.h | 1 +
include/uapi/rdma/ib_user_ioctl_cmds.h | 50 +++
include/uapi/rdma/ib_user_ioctl_verbs.h | 14 +
include/uapi/rdma/ib_user_verbs.h | 2 +-
23 files changed, 1063 insertions(+), 7 deletions(-)
create mode 100644 drivers/infiniband/core/uverbs_std_types_comp_cntr.c
--
2.47.3
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin @ 2026-04-16 21:23 ` Michael Margolin 2026-04-30 0:50 ` Doug Ledford 2026-04-16 21:23 ` [PATCH for-next v2 2/5] RDMA/core: Prevent destroying in-use completion counters Michael Margolin ` (3 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Michael Margolin @ 2026-04-16 21:23 UTC (permalink / raw) To: jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman, Yonatan Nachum Add core infrastructure for Completion Counters, a light-weight alternative to polling CQ for tracking operation completions. Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy, set, inc and read methods for both success and error counters. Add a QP attach method on the QP object to associate a completion counter with a queue pair. The create handler constructs umem from user-provided VA or dmabuf for each counter, following the CQ buffer pattern. Set, inc and read handlers pass through to driver callbacks. The QP attach handler validates the operation mask flags and delegates to the driver. Add ib_comp_cntr struct, ib_comp_cntr_attach_attr, device ops, and DECLARE_RDMA_OBJ_SIZE for driver object allocation. Only userspace Completion Counters are supported at this stage. Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> --- drivers/infiniband/core/Makefile | 1 + drivers/infiniband/core/device.c | 7 + drivers/infiniband/core/rdma_core.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 1 + .../core/uverbs_std_types_comp_cntr.c | 290 ++++++++++++++++++ drivers/infiniband/core/uverbs_std_types_qp.c | 45 ++- drivers/infiniband/core/uverbs_uapi.c | 1 + include/rdma/ib_verbs.h | 37 +++ include/uapi/rdma/ib_user_ioctl_cmds.h | 50 +++ include/uapi/rdma/ib_user_ioctl_verbs.h | 14 + include/uapi/rdma/ib_user_verbs.h | 2 +- 11 files changed, 447 insertions(+), 2 deletions(-) create mode 100644 drivers/infiniband/core/uverbs_std_types_comp_cntr.c diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index dce798d8cfe6..4767339608a1 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -35,6 +35,7 @@ ib_umad-y := user_mad.o ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \ rdma_core.o uverbs_std_types.o uverbs_ioctl.o \ uverbs_std_types_cq.o \ + uverbs_std_types_comp_cntr.o \ uverbs_std_types_dmabuf.o \ uverbs_std_types_dmah.o \ uverbs_std_types_flow_action.o uverbs_std_types_dm.o \ diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 4c174f7f1070..60c41fc1aa4d 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -2733,6 +2733,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, create_ah); SET_DEVICE_OP(dev_ops, create_counters); SET_DEVICE_OP(dev_ops, create_cq); + SET_DEVICE_OP(dev_ops, create_comp_cntr); SET_DEVICE_OP(dev_ops, create_user_cq); SET_DEVICE_OP(dev_ops, create_flow); SET_DEVICE_OP(dev_ops, create_qp); @@ -2753,6 +2754,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, destroy_ah); SET_DEVICE_OP(dev_ops, destroy_counters); SET_DEVICE_OP(dev_ops, destroy_cq); + SET_DEVICE_OP(dev_ops, destroy_comp_cntr); SET_DEVICE_OP(dev_ops, destroy_flow); SET_DEVICE_OP(dev_ops, destroy_flow_action); SET_DEVICE_OP(dev_ops, destroy_qp); @@ -2804,6 +2806,8 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, modify_hw_stat); SET_DEVICE_OP(dev_ops, modify_port); SET_DEVICE_OP(dev_ops, modify_qp); + SET_DEVICE_OP(dev_ops, inc_comp_cntr); + SET_DEVICE_OP(dev_ops, qp_attach_comp_cntr); SET_DEVICE_OP(dev_ops, modify_srq); SET_DEVICE_OP(dev_ops, modify_wq); SET_DEVICE_OP(dev_ops, peek_cq); @@ -2827,12 +2831,14 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_DEVICE_OP(dev_ops, query_ucontext); SET_DEVICE_OP(dev_ops, rdma_netdev_get_params); SET_DEVICE_OP(dev_ops, read_counters); + SET_DEVICE_OP(dev_ops, read_comp_cntr); SET_DEVICE_OP(dev_ops, reg_dm_mr); SET_DEVICE_OP(dev_ops, reg_user_mr); SET_DEVICE_OP(dev_ops, reg_user_mr_dmabuf); SET_DEVICE_OP(dev_ops, req_notify_cq); SET_DEVICE_OP(dev_ops, rereg_user_mr); SET_DEVICE_OP(dev_ops, resize_user_cq); + SET_DEVICE_OP(dev_ops, set_comp_cntr); SET_DEVICE_OP(dev_ops, set_vf_guid); SET_DEVICE_OP(dev_ops, set_vf_link_state); SET_DEVICE_OP(dev_ops, ufile_hw_cleanup); @@ -2841,6 +2847,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) SET_OBJ_SIZE(dev_ops, ib_ah); SET_OBJ_SIZE(dev_ops, ib_counters); SET_OBJ_SIZE(dev_ops, ib_cq); + SET_OBJ_SIZE(dev_ops, ib_comp_cntr); SET_OBJ_SIZE(dev_ops, ib_dmah); SET_OBJ_SIZE(dev_ops, ib_mw); SET_OBJ_SIZE(dev_ops, ib_pd); diff --git a/drivers/infiniband/core/rdma_core.h b/drivers/infiniband/core/rdma_core.h index 269b393799ab..2569550e4c6d 100644 --- a/drivers/infiniband/core/rdma_core.h +++ b/drivers/infiniband/core/rdma_core.h @@ -156,6 +156,7 @@ uverbs_api_ioctl_handler_fn uverbs_get_handler_fn(struct ib_udata *udata); extern const struct uapi_definition uverbs_def_obj_async_fd[]; extern const struct uapi_definition uverbs_def_obj_counters[]; +extern const struct uapi_definition uverbs_def_obj_comp_cntr[]; extern const struct uapi_definition uverbs_def_obj_cq[]; extern const struct uapi_definition uverbs_def_obj_device[]; extern const struct uapi_definition uverbs_def_obj_dm[]; diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index a768436ba468..4bc493b3b624 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3673,6 +3673,7 @@ static int ib_uverbs_ex_query_device(struct uverbs_attr_bundle *attrs) resp.cq_moderation_caps.max_cq_moderation_period = attr.cq_caps.max_cq_moderation_period; resp.max_dm_size = attr.max_dm_size; + resp.max_comp_cntr = attr.max_comp_cntr; resp.response_length = uverbs_response_length(attrs, sizeof(resp)); return uverbs_response(attrs, &resp, sizeof(resp)); diff --git a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c new file mode 100644 index 000000000000..7651a565bb9f --- /dev/null +++ b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c @@ -0,0 +1,290 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* + * Copyright Amazon.com, Inc. or its affiliates. All rights reserved. + */ + +#include <rdma/uverbs_std_types.h> +#include <rdma/ib_umem.h> +#include <rdma/ib_umem_dmabuf.h> +#include "rdma_core.h" +#include "uverbs.h" + +static int uverbs_free_comp_cntr(struct ib_uobject *uobject, enum rdma_remove_reason why, + struct uverbs_attr_bundle *attrs) +{ + struct ib_comp_cntr *cc = uobject->object; + int ret; + + ret = cc->device->ops.destroy_comp_cntr(cc); + if (ret) + return ret; + + ib_umem_release(cc->comp_umem); + ib_umem_release(cc->err_umem); + kfree(cc); + return 0; +} + +static int comp_cntr_get_umem(struct ib_device *ib_dev, struct uverbs_attr_bundle *attrs, + int va_attr, int fd_attr, int offset_attr, struct ib_umem **umem_out) +{ + struct ib_umem_dmabuf *umem_dmabuf; + u64 buffer_offset; + u64 buffer_va; + int buffer_fd; + int ret; + + *umem_out = NULL; + + if (uverbs_attr_is_valid(attrs, va_attr)) { + if (uverbs_attr_is_valid(attrs, fd_attr) || + uverbs_attr_is_valid(attrs, offset_attr)) + return -EINVAL; + + ret = uverbs_copy_from(&buffer_va, attrs, va_attr); + if (ret) + return ret; + + *umem_out = ib_umem_get(ib_dev, buffer_va, sizeof(u64), IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(*umem_out)) { + ret = PTR_ERR(*umem_out); + *umem_out = NULL; + return ret; + } + } else if (uverbs_attr_is_valid(attrs, fd_attr)) { + if (uverbs_attr_is_valid(attrs, va_attr)) + return -EINVAL; + + ret = uverbs_get_raw_fd(&buffer_fd, attrs, fd_attr); + if (ret) + return ret; + + ret = uverbs_copy_from(&buffer_offset, attrs, offset_attr); + if (ret) + return ret; + + umem_dmabuf = ib_umem_dmabuf_get_pinned(ib_dev, buffer_offset, sizeof(u64), + buffer_fd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(umem_dmabuf)) + return PTR_ERR(umem_dmabuf); + + *umem_out = &umem_dmabuf->umem; + } + + return 0; +} + +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_CREATE)(struct uverbs_attr_bundle *attrs) +{ + struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs, + UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); + struct ib_device *ib_dev = attrs->context->device; + struct ib_comp_cntr *cc; + int ret; + + if (!ib_dev->ops.create_comp_cntr || + !ib_dev->ops.destroy_comp_cntr || + !ib_dev->ops.qp_attach_comp_cntr) + return -EOPNOTSUPP; + + cc = rdma_zalloc_drv_obj(ib_dev, ib_comp_cntr); + if (!cc) + return -ENOMEM; + + cc->device = ib_dev; + cc->uobject = uobj; + + ret = comp_cntr_get_umem(ib_dev, attrs, + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, + &cc->comp_umem); + if (ret) + goto err_free; + + ret = comp_cntr_get_umem(ib_dev, attrs, + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, + &cc->err_umem); + if (ret) + goto err_comp_umem; + + ret = ib_dev->ops.create_comp_cntr(cc, attrs); + if (ret) + goto err_err_umem; + + uobj->object = cc; + uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); + + ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, + &cc->comp_count_max_value, sizeof(cc->comp_count_max_value)); + if (ret) + return ret; + + ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, + &cc->err_count_max_value, sizeof(cc->err_count_max_value)); + return ret; + +err_err_umem: + ib_umem_release(cc->err_umem); +err_comp_umem: + ib_umem_release(cc->comp_umem); +err_free: + kfree(cc); + return ret; +} + +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_SET)(struct uverbs_attr_bundle *attrs) +{ + struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_SET_COMP_CNTR_HANDLE); + enum ib_comp_cntr_entry entry; + u64 value; + int ret; + + if (!cc->device->ops.set_comp_cntr) + return -EOPNOTSUPP; + + ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_SET_COMP_CNTR_ENTRY); + if (ret) + return ret; + + ret = uverbs_copy_from(&value, attrs, UVERBS_ATTR_SET_COMP_CNTR_VALUE); + if (ret) + return ret; + + return cc->device->ops.set_comp_cntr(cc, entry, value); +} + +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_INC)(struct uverbs_attr_bundle *attrs) +{ + struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_INC_COMP_CNTR_HANDLE); + enum ib_comp_cntr_entry entry; + u64 amount; + int ret; + + if (!cc->device->ops.inc_comp_cntr) + return -EOPNOTSUPP; + + ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_INC_COMP_CNTR_ENTRY); + if (ret) + return ret; + + ret = uverbs_copy_from(&amount, attrs, UVERBS_ATTR_INC_COMP_CNTR_VALUE); + if (ret) + return ret; + + return cc->device->ops.inc_comp_cntr(cc, entry, amount); +} + +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_READ)(struct uverbs_attr_bundle *attrs) +{ + struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_READ_COMP_CNTR_HANDLE); + enum ib_comp_cntr_entry entry; + u64 value; + int ret; + + if (!cc->device->ops.read_comp_cntr) + return -EOPNOTSUPP; + + ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_READ_COMP_CNTR_ENTRY); + if (ret) + return ret; + + ret = cc->device->ops.read_comp_cntr(cc, entry, &value); + if (ret) + return ret; + + return uverbs_copy_to(attrs, UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, &value, sizeof(value)); +} + +DECLARE_UVERBS_NAMED_METHOD( + UVERBS_METHOD_COMP_CNTR_CREATE, + UVERBS_ATTR_IDR(UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE, + UVERBS_OBJECT_COMP_CNTR, + UVERBS_ACCESS_NEW, + UA_MANDATORY), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, + UVERBS_ATTR_TYPE(u64), + UA_OPTIONAL), + UVERBS_ATTR_RAW_FD(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, + UA_OPTIONAL), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, + UVERBS_ATTR_TYPE(u64), + UA_OPTIONAL), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, + UVERBS_ATTR_TYPE(u64), + UA_OPTIONAL), + UVERBS_ATTR_RAW_FD(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, + UA_OPTIONAL), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, + UVERBS_ATTR_TYPE(u64), + UA_OPTIONAL), + UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, + UVERBS_ATTR_TYPE(u64), + UA_MANDATORY), + UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, + UVERBS_ATTR_TYPE(u64), + UA_MANDATORY), + UVERBS_ATTR_UHW()); + +DECLARE_UVERBS_NAMED_METHOD_DESTROY( + UVERBS_METHOD_COMP_CNTR_DESTROY, + UVERBS_ATTR_IDR(UVERBS_ATTR_DESTROY_COMP_CNTR_HANDLE, + UVERBS_OBJECT_COMP_CNTR, + UVERBS_ACCESS_DESTROY, + UA_MANDATORY)); + +DECLARE_UVERBS_NAMED_METHOD( + UVERBS_METHOD_COMP_CNTR_SET, + UVERBS_ATTR_IDR(UVERBS_ATTR_SET_COMP_CNTR_HANDLE, + UVERBS_OBJECT_COMP_CNTR, + UVERBS_ACCESS_WRITE, + UA_MANDATORY), + UVERBS_ATTR_CONST_IN(UVERBS_ATTR_SET_COMP_CNTR_ENTRY, + enum ib_uverbs_comp_cntr_entry, + UA_MANDATORY), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_SET_COMP_CNTR_VALUE, + UVERBS_ATTR_TYPE(u64), + UA_MANDATORY)); + +DECLARE_UVERBS_NAMED_METHOD( + UVERBS_METHOD_COMP_CNTR_INC, + UVERBS_ATTR_IDR(UVERBS_ATTR_INC_COMP_CNTR_HANDLE, + UVERBS_OBJECT_COMP_CNTR, + UVERBS_ACCESS_WRITE, + UA_MANDATORY), + UVERBS_ATTR_CONST_IN(UVERBS_ATTR_INC_COMP_CNTR_ENTRY, + enum ib_uverbs_comp_cntr_entry, + UA_MANDATORY), + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_INC_COMP_CNTR_VALUE, + UVERBS_ATTR_TYPE(u64), + UA_MANDATORY)); + +DECLARE_UVERBS_NAMED_METHOD( + UVERBS_METHOD_COMP_CNTR_READ, + UVERBS_ATTR_IDR(UVERBS_ATTR_READ_COMP_CNTR_HANDLE, + UVERBS_OBJECT_COMP_CNTR, + UVERBS_ACCESS_READ, + UA_MANDATORY), + UVERBS_ATTR_CONST_IN(UVERBS_ATTR_READ_COMP_CNTR_ENTRY, + enum ib_uverbs_comp_cntr_entry, + UA_MANDATORY), + UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, + UVERBS_ATTR_TYPE(u64), + UA_MANDATORY)); + +DECLARE_UVERBS_NAMED_OBJECT( + UVERBS_OBJECT_COMP_CNTR, + UVERBS_TYPE_ALLOC_IDR(uverbs_free_comp_cntr), + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_CREATE), + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_DESTROY), + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_SET), + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_INC), + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_READ)); + +const struct uapi_definition uverbs_def_obj_comp_cntr[] = { + UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_COMP_CNTR, + UAPI_DEF_OBJ_NEEDS_FN(destroy_comp_cntr)), + {} +}; diff --git a/drivers/infiniband/core/uverbs_std_types_qp.c b/drivers/infiniband/core/uverbs_std_types_qp.c index be0730e8509e..2c607b02d9d5 100644 --- a/drivers/infiniband/core/uverbs_std_types_qp.c +++ b/drivers/infiniband/core/uverbs_std_types_qp.c @@ -367,11 +367,54 @@ DECLARE_UVERBS_NAMED_METHOD( UVERBS_ATTR_TYPE(struct ib_uverbs_destroy_qp_resp), UA_MANDATORY)); +static int UVERBS_HANDLER(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)( + struct uverbs_attr_bundle *attrs) +{ + struct ib_uobject *qp_uobj = uverbs_attr_get_uobject( + attrs, UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE); + struct ib_comp_cntr *cc = uverbs_attr_get_obj( + attrs, UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE); + struct ib_comp_cntr_attach_attr attr = {}; + struct ib_qp *qp = qp_uobj->object; + int ret; + + if (!cc->device->ops.qp_attach_comp_cntr) + return -EOPNOTSUPP; + + ret = uverbs_get_flags32(&attr.op_mask, attrs, + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, + IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND | + IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV | + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ | + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ | + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE | + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE); + if (ret) + return ret; + + return qp->device->ops.qp_attach_comp_cntr(qp, cc, &attr); +} + +DECLARE_UVERBS_NAMED_METHOD( + UVERBS_METHOD_QP_ATTACH_COMP_CNTR, + UVERBS_ATTR_IDR(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE, + UVERBS_OBJECT_QP, + UVERBS_ACCESS_WRITE, + UA_MANDATORY), + UVERBS_ATTR_IDR(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE, + UVERBS_OBJECT_COMP_CNTR, + UVERBS_ACCESS_READ, + UA_MANDATORY), + UVERBS_ATTR_FLAGS_IN(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, + enum ib_uverbs_comp_cntr_attach_op, + UA_OPTIONAL)); + DECLARE_UVERBS_NAMED_OBJECT( UVERBS_OBJECT_QP, UVERBS_TYPE_ALLOC_IDR_SZ(sizeof(struct ib_uqp_object), uverbs_free_qp), &UVERBS_METHOD(UVERBS_METHOD_QP_CREATE), - &UVERBS_METHOD(UVERBS_METHOD_QP_DESTROY)); + &UVERBS_METHOD(UVERBS_METHOD_QP_DESTROY), + &UVERBS_METHOD(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)); const struct uapi_definition uverbs_def_obj_qp[] = { UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_QP, diff --git a/drivers/infiniband/core/uverbs_uapi.c b/drivers/infiniband/core/uverbs_uapi.c index 31b248295854..a3f42a50a14f 100644 --- a/drivers/infiniband/core/uverbs_uapi.c +++ b/drivers/infiniband/core/uverbs_uapi.c @@ -628,6 +628,7 @@ void uverbs_destroy_api(struct uverbs_api *uapi) static const struct uapi_definition uverbs_core_api[] = { UAPI_DEF_CHAIN(uverbs_def_obj_async_fd), UAPI_DEF_CHAIN(uverbs_def_obj_counters), + UAPI_DEF_CHAIN(uverbs_def_obj_comp_cntr), UAPI_DEF_CHAIN(uverbs_def_obj_cq), UAPI_DEF_CHAIN(uverbs_def_obj_device), UAPI_DEF_CHAIN(uverbs_def_obj_dm), diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 9dd76f489a0b..b0db80447bf0 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -453,6 +453,7 @@ struct ib_device_attr { u64 max_dm_size; /* Max entries for sgl for optimized performance per READ */ u32 max_sgl_rd; + u32 max_comp_cntr; }; enum ib_mtu { @@ -1746,6 +1747,33 @@ struct ib_cq { struct rdma_restrack_entry res; }; +struct ib_comp_cntr { + struct ib_device *device; + struct ib_uobject *uobject; + struct ib_umem *comp_umem; + struct ib_umem *err_umem; + u64 comp_count_max_value; + u64 err_count_max_value; +}; + +enum ib_comp_cntr_entry { + IB_COMP_CNTR_ENTRY_COMP = IB_UVERBS_COMP_CNTR_ENTRY_COMP, + IB_COMP_CNTR_ENTRY_ERR = IB_UVERBS_COMP_CNTR_ENTRY_ERR, +}; + +enum ib_comp_cntr_attach_op { + IB_COMP_CNTR_ATTACH_OP_SEND = IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND, + IB_COMP_CNTR_ATTACH_OP_RECV = IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV, + IB_COMP_CNTR_ATTACH_OP_RDMA_READ = IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ, + IB_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ, + IB_COMP_CNTR_ATTACH_OP_RDMA_WRITE = IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE, + IB_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE, +}; + +struct ib_comp_cntr_attach_attr { + u32 op_mask; +}; + struct ib_srq { struct ib_device *device; struct ib_pd *pd; @@ -2624,6 +2652,8 @@ struct ib_device_ops { struct ib_udata *udata); int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_udata *udata); + int (*qp_attach_comp_cntr)(struct ib_qp *qp, struct ib_comp_cntr *cc, + struct ib_comp_cntr_attach_attr *attr); int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); int (*destroy_qp)(struct ib_qp *qp, struct ib_udata *udata); @@ -2645,6 +2675,12 @@ struct ib_device_ops { * post_destroy_cq - Free all kernel resources */ void (*post_destroy_cq)(struct ib_cq *cq); + int (*create_comp_cntr)(struct ib_comp_cntr *cc, + struct uverbs_attr_bundle *attrs); + int (*destroy_comp_cntr)(struct ib_comp_cntr *cc); + int (*set_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 value); + int (*inc_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 amount); + int (*read_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 *value); struct ib_mr *(*get_dma_mr)(struct ib_pd *pd, int mr_access_flags); struct ib_mr *(*reg_user_mr)(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int mr_access_flags, @@ -2878,6 +2914,7 @@ struct ib_device_ops { DECLARE_RDMA_OBJ_SIZE(ib_ah); DECLARE_RDMA_OBJ_SIZE(ib_counters); DECLARE_RDMA_OBJ_SIZE(ib_cq); + DECLARE_RDMA_OBJ_SIZE(ib_comp_cntr); DECLARE_RDMA_OBJ_SIZE(ib_dmah); DECLARE_RDMA_OBJ_SIZE(ib_mw); DECLARE_RDMA_OBJ_SIZE(ib_pd); diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h index 72041c1b0ea5..6ff6a2afdc60 100644 --- a/include/uapi/rdma/ib_user_ioctl_cmds.h +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h @@ -57,6 +57,7 @@ enum uverbs_default_objects { UVERBS_OBJECT_ASYNC_EVENT, UVERBS_OBJECT_DMAH, UVERBS_OBJECT_DMABUF, + UVERBS_OBJECT_COMP_CNTR, }; enum { @@ -168,6 +169,7 @@ enum uverbs_attrs_destroy_qp_cmd_attr_ids { enum uverbs_methods_qp { UVERBS_METHOD_QP_CREATE, UVERBS_METHOD_QP_DESTROY, + UVERBS_METHOD_QP_ATTACH_COMP_CNTR, }; enum uverbs_attrs_create_srq_cmd_attr_ids { @@ -434,4 +436,52 @@ enum uverbs_attrs_query_gid_entry_cmd_attr_ids { UVERBS_ATTR_QUERY_GID_ENTRY_RESP_ENTRY, }; +enum uverbs_methods_comp_cntr { + UVERBS_METHOD_COMP_CNTR_CREATE, + UVERBS_METHOD_COMP_CNTR_DESTROY, + UVERBS_METHOD_COMP_CNTR_SET, + UVERBS_METHOD_COMP_CNTR_INC, + UVERBS_METHOD_COMP_CNTR_READ, +}; + +enum uverbs_attrs_create_comp_cntr_cmd_attr_ids { + UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE, + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, + UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, + UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, +}; + +enum uverbs_attrs_destroy_comp_cntr_cmd_attr_ids { + UVERBS_ATTR_DESTROY_COMP_CNTR_HANDLE, +}; + +enum uverbs_attrs_set_comp_cntr_cmd_attr_ids { + UVERBS_ATTR_SET_COMP_CNTR_HANDLE, + UVERBS_ATTR_SET_COMP_CNTR_ENTRY, + UVERBS_ATTR_SET_COMP_CNTR_VALUE, +}; + +enum uverbs_attrs_inc_comp_cntr_cmd_attr_ids { + UVERBS_ATTR_INC_COMP_CNTR_HANDLE, + UVERBS_ATTR_INC_COMP_CNTR_ENTRY, + UVERBS_ATTR_INC_COMP_CNTR_VALUE, +}; + +enum uverbs_attrs_read_comp_cntr_cmd_attr_ids { + UVERBS_ATTR_READ_COMP_CNTR_HANDLE, + UVERBS_ATTR_READ_COMP_CNTR_ENTRY, + UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, +}; + +enum uverbs_attrs_qp_attach_comp_cntr_cmd_attr_ids { + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE, + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE, + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, +}; + #endif diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h index 90c5cd8e7753..f38236b056a7 100644 --- a/include/uapi/rdma/ib_user_ioctl_verbs.h +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h @@ -273,4 +273,18 @@ struct ib_uverbs_gid_entry { __u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */ }; +enum ib_uverbs_comp_cntr_entry { + IB_UVERBS_COMP_CNTR_ENTRY_COMP, + IB_UVERBS_COMP_CNTR_ENTRY_ERR, +}; + +enum ib_uverbs_comp_cntr_attach_op { + IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND = 1 << 0, + IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV = 1 << 1, + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ = 1 << 2, + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = 1 << 3, + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE = 1 << 4, + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = 1 << 5, +}; + #endif diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h index 3b7bd99813e9..45d142f4a7f8 100644 --- a/include/uapi/rdma/ib_user_verbs.h +++ b/include/uapi/rdma/ib_user_verbs.h @@ -299,7 +299,7 @@ struct ib_uverbs_ex_query_device_resp { struct ib_uverbs_cq_moderation_caps cq_moderation_caps; __aligned_u64 max_dm_size; __u32 xrc_odp_caps; - __u32 reserved; + __u32 max_comp_cntr; }; struct ib_uverbs_query_port { -- 2.47.3 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-16 21:23 ` [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support Michael Margolin @ 2026-04-30 0:50 ` Doug Ledford 2026-04-30 1:49 ` Jason Gunthorpe 2026-04-30 12:18 ` Michael Margolin 0 siblings, 2 replies; 14+ messages in thread From: Doug Ledford @ 2026-04-30 0:50 UTC (permalink / raw) To: Michael Margolin, jgg, leon, linux-rdma Cc: sleybo, matua, gal.pressman, Yonatan Nachum [-- Attachment #1.1: Type: text/plain, Size: 28969 bytes --] On 4/16/26 4:23 PM, Michael Margolin wrote: > Add core infrastructure for Completion Counters, a light-weight > alternative to polling CQ for tracking operation completions. > > Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy, > set, inc and read methods for both success and error counters. Add a > QP attach method on the QP object to associate a completion counter > with a queue pair. > > The create handler constructs umem from user-provided VA or dmabuf for > each counter, following the CQ buffer pattern. Description here doesn't match implementation. The umem or dmabuf is optional, while this reads that they are the only two options. If neither is passed in, then the counter is on the hardware and the read operation is used to get the value (as per the code anyway). Which raises a different scenario our hardware enables. We can pass in a umem on create, but that doesn't mean the counter exists in umem, it exists on the device and it is copied to umem. If you copy it on every counter update, that kills PCI-e usage, so we have an option to use a trigger to only update on a periodic basis (but then user space authors start polling on the umem location and killing CPU cycles, so this option is not preferred), or there is a wait option where you can set the target and then in your app use a wait call to wait for the count to be reached (we've found this is about the only performant way to implement these counters). Also, we don't really attach counters to QPs. That isn't usually what we care about counting. Given that our EPs are not connected, counters on it are usually only useful for recv operations where you can get aggregate data for a given EP. For send, it is often that we really want counters on a per-flow basis knowing that we have many flows that go through that one EP (soon to be QP). So, for us, we create a counter, then during our send operations, if we want a specific transfer to be included in a specific counter, it's flagged in the command we send to the hardware for that send operation. That implies that a proper place to hang a list of counters is probably off of an AH instead of a QP for us. I think we can extend this API to suit our needs, relax some of the current restrictions/assumptions, and be good. But, as this is a user visible API, if it's taken as-is, I would suggest that the rdma-core portion be marked as experimental until we've made the changes needed for our hardware in order to avoid user API churn. These changes could be summed up as: 1) Make qp attachment optional 2) Extend create verb to differentiate between on-card counter with umem target and in-umem counter 3) Extend create verb to pass in optional trigger or wait capability to perform limited umem updates based upon passed in option 4) Modify read operation so that it can either return the value directly or just trigger an async update of a buffer backed counter (especially useful if the umem counter is on a GPU, is set for a triggered update, and you just want to force an immediate async update) Doug > Set, inc and read > handlers pass through to driver callbacks. The QP attach handler > validates the operation mask flags and delegates to the driver. > > Add ib_comp_cntr struct, ib_comp_cntr_attach_attr, device ops, and > DECLARE_RDMA_OBJ_SIZE for driver object allocation. > > Only userspace Completion Counters are supported at this stage. > > Reviewed-by: Yonatan Nachum <ynachum@amazon.com> > Signed-off-by: Michael Margolin <mrgolin@amazon.com> > --- > drivers/infiniband/core/Makefile | 1 + > drivers/infiniband/core/device.c | 7 + > drivers/infiniband/core/rdma_core.h | 1 + > drivers/infiniband/core/uverbs_cmd.c | 1 + > .../core/uverbs_std_types_comp_cntr.c | 290 ++++++++++++++++++ > drivers/infiniband/core/uverbs_std_types_qp.c | 45 ++- > drivers/infiniband/core/uverbs_uapi.c | 1 + > include/rdma/ib_verbs.h | 37 +++ > include/uapi/rdma/ib_user_ioctl_cmds.h | 50 +++ > include/uapi/rdma/ib_user_ioctl_verbs.h | 14 + > include/uapi/rdma/ib_user_verbs.h | 2 +- > 11 files changed, 447 insertions(+), 2 deletions(-) > create mode 100644 drivers/infiniband/core/uverbs_std_types_comp_cntr.c > > diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile > index dce798d8cfe6..4767339608a1 100644 > --- a/drivers/infiniband/core/Makefile > +++ b/drivers/infiniband/core/Makefile > @@ -35,6 +35,7 @@ ib_umad-y := user_mad.o > ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \ > rdma_core.o uverbs_std_types.o uverbs_ioctl.o \ > uverbs_std_types_cq.o \ > + uverbs_std_types_comp_cntr.o \ > uverbs_std_types_dmabuf.o \ > uverbs_std_types_dmah.o \ > uverbs_std_types_flow_action.o uverbs_std_types_dm.o \ > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c > index 4c174f7f1070..60c41fc1aa4d 100644 > --- a/drivers/infiniband/core/device.c > +++ b/drivers/infiniband/core/device.c > @@ -2733,6 +2733,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > SET_DEVICE_OP(dev_ops, create_ah); > SET_DEVICE_OP(dev_ops, create_counters); > SET_DEVICE_OP(dev_ops, create_cq); > + SET_DEVICE_OP(dev_ops, create_comp_cntr); > SET_DEVICE_OP(dev_ops, create_user_cq); > SET_DEVICE_OP(dev_ops, create_flow); > SET_DEVICE_OP(dev_ops, create_qp); > @@ -2753,6 +2754,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > SET_DEVICE_OP(dev_ops, destroy_ah); > SET_DEVICE_OP(dev_ops, destroy_counters); > SET_DEVICE_OP(dev_ops, destroy_cq); > + SET_DEVICE_OP(dev_ops, destroy_comp_cntr); > SET_DEVICE_OP(dev_ops, destroy_flow); > SET_DEVICE_OP(dev_ops, destroy_flow_action); > SET_DEVICE_OP(dev_ops, destroy_qp); > @@ -2804,6 +2806,8 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > SET_DEVICE_OP(dev_ops, modify_hw_stat); > SET_DEVICE_OP(dev_ops, modify_port); > SET_DEVICE_OP(dev_ops, modify_qp); > + SET_DEVICE_OP(dev_ops, inc_comp_cntr); > + SET_DEVICE_OP(dev_ops, qp_attach_comp_cntr); > SET_DEVICE_OP(dev_ops, modify_srq); > SET_DEVICE_OP(dev_ops, modify_wq); > SET_DEVICE_OP(dev_ops, peek_cq); > @@ -2827,12 +2831,14 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > SET_DEVICE_OP(dev_ops, query_ucontext); > SET_DEVICE_OP(dev_ops, rdma_netdev_get_params); > SET_DEVICE_OP(dev_ops, read_counters); > + SET_DEVICE_OP(dev_ops, read_comp_cntr); > SET_DEVICE_OP(dev_ops, reg_dm_mr); > SET_DEVICE_OP(dev_ops, reg_user_mr); > SET_DEVICE_OP(dev_ops, reg_user_mr_dmabuf); > SET_DEVICE_OP(dev_ops, req_notify_cq); > SET_DEVICE_OP(dev_ops, rereg_user_mr); > SET_DEVICE_OP(dev_ops, resize_user_cq); > + SET_DEVICE_OP(dev_ops, set_comp_cntr); > SET_DEVICE_OP(dev_ops, set_vf_guid); > SET_DEVICE_OP(dev_ops, set_vf_link_state); > SET_DEVICE_OP(dev_ops, ufile_hw_cleanup); > @@ -2841,6 +2847,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > SET_OBJ_SIZE(dev_ops, ib_ah); > SET_OBJ_SIZE(dev_ops, ib_counters); > SET_OBJ_SIZE(dev_ops, ib_cq); > + SET_OBJ_SIZE(dev_ops, ib_comp_cntr); > SET_OBJ_SIZE(dev_ops, ib_dmah); > SET_OBJ_SIZE(dev_ops, ib_mw); > SET_OBJ_SIZE(dev_ops, ib_pd); > diff --git a/drivers/infiniband/core/rdma_core.h b/drivers/infiniband/core/rdma_core.h > index 269b393799ab..2569550e4c6d 100644 > --- a/drivers/infiniband/core/rdma_core.h > +++ b/drivers/infiniband/core/rdma_core.h > @@ -156,6 +156,7 @@ uverbs_api_ioctl_handler_fn uverbs_get_handler_fn(struct ib_udata *udata); > > extern const struct uapi_definition uverbs_def_obj_async_fd[]; > extern const struct uapi_definition uverbs_def_obj_counters[]; > +extern const struct uapi_definition uverbs_def_obj_comp_cntr[]; > extern const struct uapi_definition uverbs_def_obj_cq[]; > extern const struct uapi_definition uverbs_def_obj_device[]; > extern const struct uapi_definition uverbs_def_obj_dm[]; > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > index a768436ba468..4bc493b3b624 100644 > --- a/drivers/infiniband/core/uverbs_cmd.c > +++ b/drivers/infiniband/core/uverbs_cmd.c > @@ -3673,6 +3673,7 @@ static int ib_uverbs_ex_query_device(struct uverbs_attr_bundle *attrs) > resp.cq_moderation_caps.max_cq_moderation_period = > attr.cq_caps.max_cq_moderation_period; > resp.max_dm_size = attr.max_dm_size; > + resp.max_comp_cntr = attr.max_comp_cntr; > resp.response_length = uverbs_response_length(attrs, sizeof(resp)); > > return uverbs_response(attrs, &resp, sizeof(resp)); > diff --git a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c > new file mode 100644 > index 000000000000..7651a565bb9f > --- /dev/null > +++ b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c > @@ -0,0 +1,290 @@ > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB > +/* > + * Copyright Amazon.com, Inc. or its affiliates. All rights reserved. > + */ > + > +#include <rdma/uverbs_std_types.h> > +#include <rdma/ib_umem.h> > +#include <rdma/ib_umem_dmabuf.h> > +#include "rdma_core.h" > +#include "uverbs.h" > + > +static int uverbs_free_comp_cntr(struct ib_uobject *uobject, enum rdma_remove_reason why, > + struct uverbs_attr_bundle *attrs) > +{ > + struct ib_comp_cntr *cc = uobject->object; > + int ret; > + > + ret = cc->device->ops.destroy_comp_cntr(cc); > + if (ret) > + return ret; > + > + ib_umem_release(cc->comp_umem); > + ib_umem_release(cc->err_umem); > + kfree(cc); > + return 0; > +} > + > +static int comp_cntr_get_umem(struct ib_device *ib_dev, struct uverbs_attr_bundle *attrs, > + int va_attr, int fd_attr, int offset_attr, struct ib_umem **umem_out) > +{ > + struct ib_umem_dmabuf *umem_dmabuf; > + u64 buffer_offset; > + u64 buffer_va; > + int buffer_fd; > + int ret; > + > + *umem_out = NULL; > + > + if (uverbs_attr_is_valid(attrs, va_attr)) { > + if (uverbs_attr_is_valid(attrs, fd_attr) || > + uverbs_attr_is_valid(attrs, offset_attr)) > + return -EINVAL; > + > + ret = uverbs_copy_from(&buffer_va, attrs, va_attr); > + if (ret) > + return ret; > + > + *umem_out = ib_umem_get(ib_dev, buffer_va, sizeof(u64), IB_ACCESS_LOCAL_WRITE); > + if (IS_ERR(*umem_out)) { > + ret = PTR_ERR(*umem_out); > + *umem_out = NULL; > + return ret; > + } > + } else if (uverbs_attr_is_valid(attrs, fd_attr)) { > + if (uverbs_attr_is_valid(attrs, va_attr)) > + return -EINVAL; > + > + ret = uverbs_get_raw_fd(&buffer_fd, attrs, fd_attr); > + if (ret) > + return ret; > + > + ret = uverbs_copy_from(&buffer_offset, attrs, offset_attr); > + if (ret) > + return ret; > + > + umem_dmabuf = ib_umem_dmabuf_get_pinned(ib_dev, buffer_offset, sizeof(u64), > + buffer_fd, IB_ACCESS_LOCAL_WRITE); > + if (IS_ERR(umem_dmabuf)) > + return PTR_ERR(umem_dmabuf); > + > + *umem_out = &umem_dmabuf->umem; > + } > + > + return 0; > +} > + > +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_CREATE)(struct uverbs_attr_bundle *attrs) > +{ > + struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs, > + UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); > + struct ib_device *ib_dev = attrs->context->device; > + struct ib_comp_cntr *cc; > + int ret; > + > + if (!ib_dev->ops.create_comp_cntr || > + !ib_dev->ops.destroy_comp_cntr || > + !ib_dev->ops.qp_attach_comp_cntr) > + return -EOPNOTSUPP; > + > + cc = rdma_zalloc_drv_obj(ib_dev, ib_comp_cntr); > + if (!cc) > + return -ENOMEM; > + > + cc->device = ib_dev; > + cc->uobject = uobj; > + > + ret = comp_cntr_get_umem(ib_dev, attrs, > + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, > + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, > + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, > + &cc->comp_umem); > + if (ret) > + goto err_free; > + > + ret = comp_cntr_get_umem(ib_dev, attrs, > + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, > + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, > + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, > + &cc->err_umem); > + if (ret) > + goto err_comp_umem; > + > + ret = ib_dev->ops.create_comp_cntr(cc, attrs); > + if (ret) > + goto err_err_umem; > + > + uobj->object = cc; > + uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); > + > + ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, > + &cc->comp_count_max_value, sizeof(cc->comp_count_max_value)); > + if (ret) > + return ret; > + > + ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, > + &cc->err_count_max_value, sizeof(cc->err_count_max_value)); > + return ret; > + > +err_err_umem: > + ib_umem_release(cc->err_umem); > +err_comp_umem: > + ib_umem_release(cc->comp_umem); > +err_free: > + kfree(cc); > + return ret; > +} > + > +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_SET)(struct uverbs_attr_bundle *attrs) > +{ > + struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_SET_COMP_CNTR_HANDLE); > + enum ib_comp_cntr_entry entry; > + u64 value; > + int ret; > + > + if (!cc->device->ops.set_comp_cntr) > + return -EOPNOTSUPP; > + > + ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_SET_COMP_CNTR_ENTRY); > + if (ret) > + return ret; > + > + ret = uverbs_copy_from(&value, attrs, UVERBS_ATTR_SET_COMP_CNTR_VALUE); > + if (ret) > + return ret; > + > + return cc->device->ops.set_comp_cntr(cc, entry, value); > +} > + > +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_INC)(struct uverbs_attr_bundle *attrs) > +{ > + struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_INC_COMP_CNTR_HANDLE); > + enum ib_comp_cntr_entry entry; > + u64 amount; > + int ret; > + > + if (!cc->device->ops.inc_comp_cntr) > + return -EOPNOTSUPP; > + > + ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_INC_COMP_CNTR_ENTRY); > + if (ret) > + return ret; > + > + ret = uverbs_copy_from(&amount, attrs, UVERBS_ATTR_INC_COMP_CNTR_VALUE); > + if (ret) > + return ret; > + > + return cc->device->ops.inc_comp_cntr(cc, entry, amount); > +} > + > +static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_READ)(struct uverbs_attr_bundle *attrs) > +{ > + struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_READ_COMP_CNTR_HANDLE); > + enum ib_comp_cntr_entry entry; > + u64 value; > + int ret; > + > + if (!cc->device->ops.read_comp_cntr) > + return -EOPNOTSUPP; > + > + ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_READ_COMP_CNTR_ENTRY); > + if (ret) > + return ret; > + > + ret = cc->device->ops.read_comp_cntr(cc, entry, &value); > + if (ret) > + return ret; > + > + return uverbs_copy_to(attrs, UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, &value, sizeof(value)); > +} > + > +DECLARE_UVERBS_NAMED_METHOD( > + UVERBS_METHOD_COMP_CNTR_CREATE, > + UVERBS_ATTR_IDR(UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE, > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_ACCESS_NEW, > + UA_MANDATORY), > + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, > + UVERBS_ATTR_TYPE(u64), > + UA_OPTIONAL), > + UVERBS_ATTR_RAW_FD(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, > + UA_OPTIONAL), > + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, > + UVERBS_ATTR_TYPE(u64), > + UA_OPTIONAL), > + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, > + UVERBS_ATTR_TYPE(u64), > + UA_OPTIONAL), > + UVERBS_ATTR_RAW_FD(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, > + UA_OPTIONAL), > + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, > + UVERBS_ATTR_TYPE(u64), > + UA_OPTIONAL), > + UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, > + UVERBS_ATTR_TYPE(u64), > + UA_MANDATORY), > + UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, > + UVERBS_ATTR_TYPE(u64), > + UA_MANDATORY), > + UVERBS_ATTR_UHW()); > + > +DECLARE_UVERBS_NAMED_METHOD_DESTROY( > + UVERBS_METHOD_COMP_CNTR_DESTROY, > + UVERBS_ATTR_IDR(UVERBS_ATTR_DESTROY_COMP_CNTR_HANDLE, > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_ACCESS_DESTROY, > + UA_MANDATORY)); > + > +DECLARE_UVERBS_NAMED_METHOD( > + UVERBS_METHOD_COMP_CNTR_SET, > + UVERBS_ATTR_IDR(UVERBS_ATTR_SET_COMP_CNTR_HANDLE, > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_ACCESS_WRITE, > + UA_MANDATORY), > + UVERBS_ATTR_CONST_IN(UVERBS_ATTR_SET_COMP_CNTR_ENTRY, > + enum ib_uverbs_comp_cntr_entry, > + UA_MANDATORY), > + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_SET_COMP_CNTR_VALUE, > + UVERBS_ATTR_TYPE(u64), > + UA_MANDATORY)); > + > +DECLARE_UVERBS_NAMED_METHOD( > + UVERBS_METHOD_COMP_CNTR_INC, > + UVERBS_ATTR_IDR(UVERBS_ATTR_INC_COMP_CNTR_HANDLE, > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_ACCESS_WRITE, > + UA_MANDATORY), > + UVERBS_ATTR_CONST_IN(UVERBS_ATTR_INC_COMP_CNTR_ENTRY, > + enum ib_uverbs_comp_cntr_entry, > + UA_MANDATORY), > + UVERBS_ATTR_PTR_IN(UVERBS_ATTR_INC_COMP_CNTR_VALUE, > + UVERBS_ATTR_TYPE(u64), > + UA_MANDATORY)); > + > +DECLARE_UVERBS_NAMED_METHOD( > + UVERBS_METHOD_COMP_CNTR_READ, > + UVERBS_ATTR_IDR(UVERBS_ATTR_READ_COMP_CNTR_HANDLE, > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_ACCESS_READ, > + UA_MANDATORY), > + UVERBS_ATTR_CONST_IN(UVERBS_ATTR_READ_COMP_CNTR_ENTRY, > + enum ib_uverbs_comp_cntr_entry, > + UA_MANDATORY), > + UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, > + UVERBS_ATTR_TYPE(u64), > + UA_MANDATORY)); > + > +DECLARE_UVERBS_NAMED_OBJECT( > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_TYPE_ALLOC_IDR(uverbs_free_comp_cntr), > + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_CREATE), > + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_DESTROY), > + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_SET), > + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_INC), > + &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_READ)); > + > +const struct uapi_definition uverbs_def_obj_comp_cntr[] = { > + UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_COMP_CNTR, > + UAPI_DEF_OBJ_NEEDS_FN(destroy_comp_cntr)), > + {} > +}; > diff --git a/drivers/infiniband/core/uverbs_std_types_qp.c b/drivers/infiniband/core/uverbs_std_types_qp.c > index be0730e8509e..2c607b02d9d5 100644 > --- a/drivers/infiniband/core/uverbs_std_types_qp.c > +++ b/drivers/infiniband/core/uverbs_std_types_qp.c > @@ -367,11 +367,54 @@ DECLARE_UVERBS_NAMED_METHOD( > UVERBS_ATTR_TYPE(struct ib_uverbs_destroy_qp_resp), > UA_MANDATORY)); > > +static int UVERBS_HANDLER(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)( > + struct uverbs_attr_bundle *attrs) > +{ > + struct ib_uobject *qp_uobj = uverbs_attr_get_uobject( > + attrs, UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE); > + struct ib_comp_cntr *cc = uverbs_attr_get_obj( > + attrs, UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE); > + struct ib_comp_cntr_attach_attr attr = {}; > + struct ib_qp *qp = qp_uobj->object; > + int ret; > + > + if (!cc->device->ops.qp_attach_comp_cntr) > + return -EOPNOTSUPP; > + > + ret = uverbs_get_flags32(&attr.op_mask, attrs, > + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, > + IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND | > + IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV | > + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ | > + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ | > + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE | > + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE); > + if (ret) > + return ret; > + > + return qp->device->ops.qp_attach_comp_cntr(qp, cc, &attr); > +} > + > +DECLARE_UVERBS_NAMED_METHOD( > + UVERBS_METHOD_QP_ATTACH_COMP_CNTR, > + UVERBS_ATTR_IDR(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE, > + UVERBS_OBJECT_QP, > + UVERBS_ACCESS_WRITE, > + UA_MANDATORY), > + UVERBS_ATTR_IDR(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE, > + UVERBS_OBJECT_COMP_CNTR, > + UVERBS_ACCESS_READ, > + UA_MANDATORY), > + UVERBS_ATTR_FLAGS_IN(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, > + enum ib_uverbs_comp_cntr_attach_op, > + UA_OPTIONAL)); > + > DECLARE_UVERBS_NAMED_OBJECT( > UVERBS_OBJECT_QP, > UVERBS_TYPE_ALLOC_IDR_SZ(sizeof(struct ib_uqp_object), uverbs_free_qp), > &UVERBS_METHOD(UVERBS_METHOD_QP_CREATE), > - &UVERBS_METHOD(UVERBS_METHOD_QP_DESTROY)); > + &UVERBS_METHOD(UVERBS_METHOD_QP_DESTROY), > + &UVERBS_METHOD(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)); > > const struct uapi_definition uverbs_def_obj_qp[] = { > UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_QP, > diff --git a/drivers/infiniband/core/uverbs_uapi.c b/drivers/infiniband/core/uverbs_uapi.c > index 31b248295854..a3f42a50a14f 100644 > --- a/drivers/infiniband/core/uverbs_uapi.c > +++ b/drivers/infiniband/core/uverbs_uapi.c > @@ -628,6 +628,7 @@ void uverbs_destroy_api(struct uverbs_api *uapi) > static const struct uapi_definition uverbs_core_api[] = { > UAPI_DEF_CHAIN(uverbs_def_obj_async_fd), > UAPI_DEF_CHAIN(uverbs_def_obj_counters), > + UAPI_DEF_CHAIN(uverbs_def_obj_comp_cntr), > UAPI_DEF_CHAIN(uverbs_def_obj_cq), > UAPI_DEF_CHAIN(uverbs_def_obj_device), > UAPI_DEF_CHAIN(uverbs_def_obj_dm), > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 9dd76f489a0b..b0db80447bf0 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -453,6 +453,7 @@ struct ib_device_attr { > u64 max_dm_size; > /* Max entries for sgl for optimized performance per READ */ > u32 max_sgl_rd; > + u32 max_comp_cntr; > }; > > enum ib_mtu { > @@ -1746,6 +1747,33 @@ struct ib_cq { > struct rdma_restrack_entry res; > }; > > +struct ib_comp_cntr { > + struct ib_device *device; > + struct ib_uobject *uobject; > + struct ib_umem *comp_umem; > + struct ib_umem *err_umem; > + u64 comp_count_max_value; > + u64 err_count_max_value; > +}; > + > +enum ib_comp_cntr_entry { > + IB_COMP_CNTR_ENTRY_COMP = IB_UVERBS_COMP_CNTR_ENTRY_COMP, > + IB_COMP_CNTR_ENTRY_ERR = IB_UVERBS_COMP_CNTR_ENTRY_ERR, > +}; > + > +enum ib_comp_cntr_attach_op { > + IB_COMP_CNTR_ATTACH_OP_SEND = IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND, > + IB_COMP_CNTR_ATTACH_OP_RECV = IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV, > + IB_COMP_CNTR_ATTACH_OP_RDMA_READ = IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ, > + IB_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ, > + IB_COMP_CNTR_ATTACH_OP_RDMA_WRITE = IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE, > + IB_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE, > +}; > + > +struct ib_comp_cntr_attach_attr { > + u32 op_mask; > +}; > + > struct ib_srq { > struct ib_device *device; > struct ib_pd *pd; > @@ -2624,6 +2652,8 @@ struct ib_device_ops { > struct ib_udata *udata); > int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, > int qp_attr_mask, struct ib_udata *udata); > + int (*qp_attach_comp_cntr)(struct ib_qp *qp, struct ib_comp_cntr *cc, > + struct ib_comp_cntr_attach_attr *attr); > int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, > int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); > int (*destroy_qp)(struct ib_qp *qp, struct ib_udata *udata); > @@ -2645,6 +2675,12 @@ struct ib_device_ops { > * post_destroy_cq - Free all kernel resources > */ > void (*post_destroy_cq)(struct ib_cq *cq); > + int (*create_comp_cntr)(struct ib_comp_cntr *cc, > + struct uverbs_attr_bundle *attrs); > + int (*destroy_comp_cntr)(struct ib_comp_cntr *cc); > + int (*set_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 value); > + int (*inc_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 amount); > + int (*read_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 *value); > struct ib_mr *(*get_dma_mr)(struct ib_pd *pd, int mr_access_flags); > struct ib_mr *(*reg_user_mr)(struct ib_pd *pd, u64 start, u64 length, > u64 virt_addr, int mr_access_flags, > @@ -2878,6 +2914,7 @@ struct ib_device_ops { > DECLARE_RDMA_OBJ_SIZE(ib_ah); > DECLARE_RDMA_OBJ_SIZE(ib_counters); > DECLARE_RDMA_OBJ_SIZE(ib_cq); > + DECLARE_RDMA_OBJ_SIZE(ib_comp_cntr); > DECLARE_RDMA_OBJ_SIZE(ib_dmah); > DECLARE_RDMA_OBJ_SIZE(ib_mw); > DECLARE_RDMA_OBJ_SIZE(ib_pd); > diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h > index 72041c1b0ea5..6ff6a2afdc60 100644 > --- a/include/uapi/rdma/ib_user_ioctl_cmds.h > +++ b/include/uapi/rdma/ib_user_ioctl_cmds.h > @@ -57,6 +57,7 @@ enum uverbs_default_objects { > UVERBS_OBJECT_ASYNC_EVENT, > UVERBS_OBJECT_DMAH, > UVERBS_OBJECT_DMABUF, > + UVERBS_OBJECT_COMP_CNTR, > }; > > enum { > @@ -168,6 +169,7 @@ enum uverbs_attrs_destroy_qp_cmd_attr_ids { > enum uverbs_methods_qp { > UVERBS_METHOD_QP_CREATE, > UVERBS_METHOD_QP_DESTROY, > + UVERBS_METHOD_QP_ATTACH_COMP_CNTR, > }; > > enum uverbs_attrs_create_srq_cmd_attr_ids { > @@ -434,4 +436,52 @@ enum uverbs_attrs_query_gid_entry_cmd_attr_ids { > UVERBS_ATTR_QUERY_GID_ENTRY_RESP_ENTRY, > }; > > +enum uverbs_methods_comp_cntr { > + UVERBS_METHOD_COMP_CNTR_CREATE, > + UVERBS_METHOD_COMP_CNTR_DESTROY, > + UVERBS_METHOD_COMP_CNTR_SET, > + UVERBS_METHOD_COMP_CNTR_INC, > + UVERBS_METHOD_COMP_CNTR_READ, > +}; > + > +enum uverbs_attrs_create_comp_cntr_cmd_attr_ids { > + UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE, > + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, > + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, > + UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, > + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, > + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, > + UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, > + UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, > + UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, > +}; > + > +enum uverbs_attrs_destroy_comp_cntr_cmd_attr_ids { > + UVERBS_ATTR_DESTROY_COMP_CNTR_HANDLE, > +}; > + > +enum uverbs_attrs_set_comp_cntr_cmd_attr_ids { > + UVERBS_ATTR_SET_COMP_CNTR_HANDLE, > + UVERBS_ATTR_SET_COMP_CNTR_ENTRY, > + UVERBS_ATTR_SET_COMP_CNTR_VALUE, > +}; > + > +enum uverbs_attrs_inc_comp_cntr_cmd_attr_ids { > + UVERBS_ATTR_INC_COMP_CNTR_HANDLE, > + UVERBS_ATTR_INC_COMP_CNTR_ENTRY, > + UVERBS_ATTR_INC_COMP_CNTR_VALUE, > +}; > + > +enum uverbs_attrs_read_comp_cntr_cmd_attr_ids { > + UVERBS_ATTR_READ_COMP_CNTR_HANDLE, > + UVERBS_ATTR_READ_COMP_CNTR_ENTRY, > + UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, > +}; > + > +enum uverbs_attrs_qp_attach_comp_cntr_cmd_attr_ids { > + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE, > + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE, > + UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, > +}; > + > #endif > diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h > index 90c5cd8e7753..f38236b056a7 100644 > --- a/include/uapi/rdma/ib_user_ioctl_verbs.h > +++ b/include/uapi/rdma/ib_user_ioctl_verbs.h > @@ -273,4 +273,18 @@ struct ib_uverbs_gid_entry { > __u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */ > }; > > +enum ib_uverbs_comp_cntr_entry { > + IB_UVERBS_COMP_CNTR_ENTRY_COMP, > + IB_UVERBS_COMP_CNTR_ENTRY_ERR, > +}; > + > +enum ib_uverbs_comp_cntr_attach_op { > + IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND = 1 << 0, > + IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV = 1 << 1, > + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ = 1 << 2, > + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = 1 << 3, > + IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE = 1 << 4, > + IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = 1 << 5, > +}; > + > #endif > diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h > index 3b7bd99813e9..45d142f4a7f8 100644 > --- a/include/uapi/rdma/ib_user_verbs.h > +++ b/include/uapi/rdma/ib_user_verbs.h > @@ -299,7 +299,7 @@ struct ib_uverbs_ex_query_device_resp { > struct ib_uverbs_cq_moderation_caps cq_moderation_caps; > __aligned_u64 max_dm_size; > __u32 xrc_odp_caps; > - __u32 reserved; > + __u32 max_comp_cntr; > }; > > struct ib_uverbs_query_port { -- Doug Ledford <doug.ledford@hpe.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-30 0:50 ` Doug Ledford @ 2026-04-30 1:49 ` Jason Gunthorpe 2026-04-30 15:38 ` Doug Ledford 2026-04-30 12:18 ` Michael Margolin 1 sibling, 1 reply; 14+ messages in thread From: Jason Gunthorpe @ 2026-04-30 1:49 UTC (permalink / raw) To: Doug Ledford Cc: Michael Margolin, leon, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On Wed, Apr 29, 2026 at 06:50:54PM -0600, Doug Ledford wrote: > 1) Make qp attachment optional > 2) Extend create verb to differentiate between on-card counter with umem > target and in-umem counter > 3) Extend create verb to pass in optional trigger or wait capability to > perform limited umem updates based upon passed in option > 4) Modify read operation so that it can either return the value directly or > just trigger an async update of a buffer backed counter (especially useful > if the umem counter is on a GPU, is set for a triggered update, and you just > want to force an immediate async update) After all that is it still a "completion" counter? It seems like it is counting something else than a shortcut to polling a CQ? Jason ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-30 1:49 ` Jason Gunthorpe @ 2026-04-30 15:38 ` Doug Ledford 0 siblings, 0 replies; 14+ messages in thread From: Doug Ledford @ 2026-04-30 15:38 UTC (permalink / raw) To: Jason Gunthorpe Cc: Michael Margolin, leon, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum [-- Attachment #1.1: Type: text/plain, Size: 1581 bytes --] On 4/29/26 8:49 PM, Jason Gunthorpe wrote: > On Wed, Apr 29, 2026 at 06:50:54PM -0600, Doug Ledford wrote: >> 1) Make qp attachment optional >> 2) Extend create verb to differentiate between on-card counter with umem >> target and in-umem counter >> 3) Extend create verb to pass in optional trigger or wait capability to >> perform limited umem updates based upon passed in option >> 4) Modify read operation so that it can either return the value directly or >> just trigger an async update of a buffer backed counter (especially useful >> if the umem counter is on a GPU, is set for a triggered update, and you just >> want to force an immediate async update) > > After all that is it still a "completion" counter? It seems like it is > counting something else than a shortcut to polling a CQ? > > Jason > Depends on your definition of "completion counter". If, by completion, you define it narrowly as only queue pair ops completed, then no. But, given that our hardware and the future UET hardware will both be sending to many, many destinations via a single send Q on a single QP, just counting the QP completions is almost useless for us. We need completions counted to a specific other fabric end point, and aggregate counters won't help us with that. So, from our perspective, yes it's still a completion counter, it's just tied to a different element for practical purposes. -- Doug Ledford <doug.ledford@hpe.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-30 0:50 ` Doug Ledford 2026-04-30 1:49 ` Jason Gunthorpe @ 2026-04-30 12:18 ` Michael Margolin 2026-04-30 19:09 ` Doug Ledford 1 sibling, 1 reply; 14+ messages in thread From: Michael Margolin @ 2026-04-30 12:18 UTC (permalink / raw) To: Doug Ledford Cc: jgg, leon, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On Wed, Apr 29, 2026 at 06:50:54PM -0600, Doug Ledford wrote: > On 4/16/26 4:23 PM, Michael Margolin wrote: > >Add core infrastructure for Completion Counters, a light-weight > >alternative to polling CQ for tracking operation completions. > > > >Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy, > >set, inc and read methods for both success and error counters. Add a > >QP attach method on the QP object to associate a completion counter > >with a queue pair. > > > >The create handler constructs umem from user-provided VA or dmabuf for > >each counter, following the CQ buffer pattern. > > Description here doesn't match implementation. The umem or dmabuf > is optional, while this reads that they are the only two options. > If neither is passed in, then the counter is on the hardware and the > read operation is used to get the value (as per the code anyway). Thanks, I'll make that path more clear in the commit message. > > Which raises a different scenario our hardware enables. We can pass > in a umem on create, but that doesn't mean the counter exists in > umem, it exists on the device and it is copied to umem. If you copy > it on every counter update, that kills PCI-e usage, so we have an Why would it load PCIe more than writing CQEs into a CQ? > option to use a trigger to only update on a periodic basis (but then > user space authors start polling on the umem location and killing > CPU cycles, so this option is not preferred), or there is a wait > option where you can set the target and then in your app use a wait > call to wait for the count to be reached (we've found this is about > the only performant way to implement these counters). > > Also, we don't really attach counters to QPs. That isn't usually > what we care about counting. Given that our EPs are not connected, > counters on it are usually only useful for recv operations where you > can get aggregate data for a given EP. For send, it is often that > we really want counters on a per-flow basis knowing that we have > many flows that go through that one EP (soon to be QP). So, for us, > we create a counter, then during our send operations, if we want a > specific transfer to be included in a specific counter, it's flagged > in the command we send to the hardware for that send operation. > That implies that a proper place to hang a list of counters is > probably off of an AH instead of a QP for us. > > I think we can extend this API to suit our needs, relax some of the > current restrictions/assumptions, and be good. But, as this is a > user visible API, if it's taken as-is, I would suggest that the > rdma-core portion be marked as experimental until we've made the > changes needed for our hardware in order to avoid user API churn. > > These changes could be summed up as: > > 1) Make qp attachment optional The attachment is already a separate call that can be avoided. > 2) Extend create verb to differentiate between on-card counter with > umem target and in-umem counter Can you elaborate on the extension you have on your mind? This seems to me as a totally driver-device level implementation detail. EFA for instance has device level counters that are being synced into the provided memory on each update. Others may choose a different sync strategy. > 3) Extend create verb to pass in optional trigger or wait capability > to perform limited umem updates based upon passed in option I think this can be vendor specific extension rather than a common interface. Providers that want to support this mode can easily add their own "update frequency" attribute in create ioctl or introduce a "sync" verb that will do what's needed for the sequential read to return an up-to-date value. > 4) Modify read operation so that it can either return the value > directly or just trigger an async update of a buffer backed counter > (especially useful if the umem counter is on a GPU, is set for a > triggered update, and you just want to force an immediate async > update) See my suggestion above. I think what you describe here should be a separate command. Michael > > Doug > > >Set, inc and read > >handlers pass through to driver callbacks. The QP attach handler > >validates the operation mask flags and delegates to the driver. > > > >Add ib_comp_cntr struct, ib_comp_cntr_attach_attr, device ops, and > >DECLARE_RDMA_OBJ_SIZE for driver object allocation. > > > >Only userspace Completion Counters are supported at this stage. > > > >Reviewed-by: Yonatan Nachum <ynachum@amazon.com> > >Signed-off-by: Michael Margolin <mrgolin@amazon.com> > >--- > > drivers/infiniband/core/Makefile | 1 + > > drivers/infiniband/core/device.c | 7 + > > drivers/infiniband/core/rdma_core.h | 1 + > > drivers/infiniband/core/uverbs_cmd.c | 1 + > > .../core/uverbs_std_types_comp_cntr.c | 290 ++++++++++++++++++ > > drivers/infiniband/core/uverbs_std_types_qp.c | 45 ++- > > drivers/infiniband/core/uverbs_uapi.c | 1 + > > include/rdma/ib_verbs.h | 37 +++ > > include/uapi/rdma/ib_user_ioctl_cmds.h | 50 +++ > > include/uapi/rdma/ib_user_ioctl_verbs.h | 14 + > > include/uapi/rdma/ib_user_verbs.h | 2 +- > > 11 files changed, 447 insertions(+), 2 deletions(-) > > create mode 100644 drivers/infiniband/core/uverbs_std_types_comp_cntr.c > > > >diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile > >index dce798d8cfe6..4767339608a1 100644 > >--- a/drivers/infiniband/core/Makefile > >+++ b/drivers/infiniband/core/Makefile > >@@ -35,6 +35,7 @@ ib_umad-y := user_mad.o > > ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \ > > rdma_core.o uverbs_std_types.o uverbs_ioctl.o \ > > uverbs_std_types_cq.o \ > >+ uverbs_std_types_comp_cntr.o \ > > uverbs_std_types_dmabuf.o \ > > uverbs_std_types_dmah.o \ > > uverbs_std_types_flow_action.o uverbs_std_types_dm.o \ > >diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c > >index 4c174f7f1070..60c41fc1aa4d 100644 > >--- a/drivers/infiniband/core/device.c > >+++ b/drivers/infiniband/core/device.c > >@@ -2733,6 +2733,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > > SET_DEVICE_OP(dev_ops, create_ah); > > SET_DEVICE_OP(dev_ops, create_counters); > > SET_DEVICE_OP(dev_ops, create_cq); > >+ SET_DEVICE_OP(dev_ops, create_comp_cntr); > > SET_DEVICE_OP(dev_ops, create_user_cq); > > SET_DEVICE_OP(dev_ops, create_flow); > > SET_DEVICE_OP(dev_ops, create_qp); > >@@ -2753,6 +2754,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > > SET_DEVICE_OP(dev_ops, destroy_ah); > > SET_DEVICE_OP(dev_ops, destroy_counters); > > SET_DEVICE_OP(dev_ops, destroy_cq); > >+ SET_DEVICE_OP(dev_ops, destroy_comp_cntr); > > SET_DEVICE_OP(dev_ops, destroy_flow); > > SET_DEVICE_OP(dev_ops, destroy_flow_action); > > SET_DEVICE_OP(dev_ops, destroy_qp); > >@@ -2804,6 +2806,8 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > > SET_DEVICE_OP(dev_ops, modify_hw_stat); > > SET_DEVICE_OP(dev_ops, modify_port); > > SET_DEVICE_OP(dev_ops, modify_qp); > >+ SET_DEVICE_OP(dev_ops, inc_comp_cntr); > >+ SET_DEVICE_OP(dev_ops, qp_attach_comp_cntr); > > SET_DEVICE_OP(dev_ops, modify_srq); > > SET_DEVICE_OP(dev_ops, modify_wq); > > SET_DEVICE_OP(dev_ops, peek_cq); > >@@ -2827,12 +2831,14 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > > SET_DEVICE_OP(dev_ops, query_ucontext); > > SET_DEVICE_OP(dev_ops, rdma_netdev_get_params); > > SET_DEVICE_OP(dev_ops, read_counters); > >+ SET_DEVICE_OP(dev_ops, read_comp_cntr); > > SET_DEVICE_OP(dev_ops, reg_dm_mr); > > SET_DEVICE_OP(dev_ops, reg_user_mr); > > SET_DEVICE_OP(dev_ops, reg_user_mr_dmabuf); > > SET_DEVICE_OP(dev_ops, req_notify_cq); > > SET_DEVICE_OP(dev_ops, rereg_user_mr); > > SET_DEVICE_OP(dev_ops, resize_user_cq); > >+ SET_DEVICE_OP(dev_ops, set_comp_cntr); > > SET_DEVICE_OP(dev_ops, set_vf_guid); > > SET_DEVICE_OP(dev_ops, set_vf_link_state); > > SET_DEVICE_OP(dev_ops, ufile_hw_cleanup); > >@@ -2841,6 +2847,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops) > > SET_OBJ_SIZE(dev_ops, ib_ah); > > SET_OBJ_SIZE(dev_ops, ib_counters); > > SET_OBJ_SIZE(dev_ops, ib_cq); > >+ SET_OBJ_SIZE(dev_ops, ib_comp_cntr); > > SET_OBJ_SIZE(dev_ops, ib_dmah); > > SET_OBJ_SIZE(dev_ops, ib_mw); > > SET_OBJ_SIZE(dev_ops, ib_pd); > >diff --git a/drivers/infiniband/core/rdma_core.h b/drivers/infiniband/core/rdma_core.h > >index 269b393799ab..2569550e4c6d 100644 > >--- a/drivers/infiniband/core/rdma_core.h > >+++ b/drivers/infiniband/core/rdma_core.h > >@@ -156,6 +156,7 @@ uverbs_api_ioctl_handler_fn uverbs_get_handler_fn(struct ib_udata *udata); > > extern const struct uapi_definition uverbs_def_obj_async_fd[]; > > extern const struct uapi_definition uverbs_def_obj_counters[]; > >+extern const struct uapi_definition uverbs_def_obj_comp_cntr[]; > > extern const struct uapi_definition uverbs_def_obj_cq[]; > > extern const struct uapi_definition uverbs_def_obj_device[]; > > extern const struct uapi_definition uverbs_def_obj_dm[]; > >diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > >index a768436ba468..4bc493b3b624 100644 > >--- a/drivers/infiniband/core/uverbs_cmd.c > >+++ b/drivers/infiniband/core/uverbs_cmd.c > >@@ -3673,6 +3673,7 @@ static int ib_uverbs_ex_query_device(struct uverbs_attr_bundle *attrs) > > resp.cq_moderation_caps.max_cq_moderation_period = > > attr.cq_caps.max_cq_moderation_period; > > resp.max_dm_size = attr.max_dm_size; > >+ resp.max_comp_cntr = attr.max_comp_cntr; > > resp.response_length = uverbs_response_length(attrs, sizeof(resp)); > > return uverbs_response(attrs, &resp, sizeof(resp)); > >diff --git a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c > >new file mode 100644 > >index 000000000000..7651a565bb9f > >--- /dev/null > >+++ b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c > >@@ -0,0 +1,290 @@ > >+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB > >+/* > >+ * Copyright Amazon.com, Inc. or its affiliates. All rights reserved. > >+ */ > >+ > >+#include <rdma/uverbs_std_types.h> > >+#include <rdma/ib_umem.h> > >+#include <rdma/ib_umem_dmabuf.h> > >+#include "rdma_core.h" > >+#include "uverbs.h" > >+ > >+static int uverbs_free_comp_cntr(struct ib_uobject *uobject, enum rdma_remove_reason why, > >+ struct uverbs_attr_bundle *attrs) > >+{ > >+ struct ib_comp_cntr *cc = uobject->object; > >+ int ret; > >+ > >+ ret = cc->device->ops.destroy_comp_cntr(cc); > >+ if (ret) > >+ return ret; > >+ > >+ ib_umem_release(cc->comp_umem); > >+ ib_umem_release(cc->err_umem); > >+ kfree(cc); > >+ return 0; > >+} > >+ > >+static int comp_cntr_get_umem(struct ib_device *ib_dev, struct uverbs_attr_bundle *attrs, > >+ int va_attr, int fd_attr, int offset_attr, struct ib_umem **umem_out) > >+{ > >+ struct ib_umem_dmabuf *umem_dmabuf; > >+ u64 buffer_offset; > >+ u64 buffer_va; > >+ int buffer_fd; > >+ int ret; > >+ > >+ *umem_out = NULL; > >+ > >+ if (uverbs_attr_is_valid(attrs, va_attr)) { > >+ if (uverbs_attr_is_valid(attrs, fd_attr) || > >+ uverbs_attr_is_valid(attrs, offset_attr)) > >+ return -EINVAL; > >+ > >+ ret = uverbs_copy_from(&buffer_va, attrs, va_attr); > >+ if (ret) > >+ return ret; > >+ > >+ *umem_out = ib_umem_get(ib_dev, buffer_va, sizeof(u64), IB_ACCESS_LOCAL_WRITE); > >+ if (IS_ERR(*umem_out)) { > >+ ret = PTR_ERR(*umem_out); > >+ *umem_out = NULL; > >+ return ret; > >+ } > >+ } else if (uverbs_attr_is_valid(attrs, fd_attr)) { > >+ if (uverbs_attr_is_valid(attrs, va_attr)) > >+ return -EINVAL; > >+ > >+ ret = uverbs_get_raw_fd(&buffer_fd, attrs, fd_attr); > >+ if (ret) > >+ return ret; > >+ > >+ ret = uverbs_copy_from(&buffer_offset, attrs, offset_attr); > >+ if (ret) > >+ return ret; > >+ > >+ umem_dmabuf = ib_umem_dmabuf_get_pinned(ib_dev, buffer_offset, sizeof(u64), > >+ buffer_fd, IB_ACCESS_LOCAL_WRITE); > >+ if (IS_ERR(umem_dmabuf)) > >+ return PTR_ERR(umem_dmabuf); > >+ > >+ *umem_out = &umem_dmabuf->umem; > >+ } > >+ > >+ return 0; > >+} > >+ > >+static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_CREATE)(struct uverbs_attr_bundle *attrs) > >+{ > >+ struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); > >+ struct ib_device *ib_dev = attrs->context->device; > >+ struct ib_comp_cntr *cc; > >+ int ret; > >+ > >+ if (!ib_dev->ops.create_comp_cntr || > >+ !ib_dev->ops.destroy_comp_cntr || > >+ !ib_dev->ops.qp_attach_comp_cntr) > >+ return -EOPNOTSUPP; > >+ > >+ cc = rdma_zalloc_drv_obj(ib_dev, ib_comp_cntr); > >+ if (!cc) > >+ return -ENOMEM; > >+ > >+ cc->device = ib_dev; > >+ cc->uobject = uobj; > >+ > >+ ret = comp_cntr_get_umem(ib_dev, attrs, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, > >+ &cc->comp_umem); > >+ if (ret) > >+ goto err_free; > >+ > >+ ret = comp_cntr_get_umem(ib_dev, attrs, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, > >+ &cc->err_umem); > >+ if (ret) > >+ goto err_comp_umem; > >+ > >+ ret = ib_dev->ops.create_comp_cntr(cc, attrs); > >+ if (ret) > >+ goto err_err_umem; > >+ > >+ uobj->object = cc; > >+ uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); > >+ > >+ ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, > >+ &cc->comp_count_max_value, sizeof(cc->comp_count_max_value)); > >+ if (ret) > >+ return ret; > >+ > >+ ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, > >+ &cc->err_count_max_value, sizeof(cc->err_count_max_value)); > >+ return ret; > >+ > >+err_err_umem: > >+ ib_umem_release(cc->err_umem); > >+err_comp_umem: > >+ ib_umem_release(cc->comp_umem); > >+err_free: > >+ kfree(cc); > >+ return ret; > >+} > >+ > >+static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_SET)(struct uverbs_attr_bundle *attrs) > >+{ > >+ struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_SET_COMP_CNTR_HANDLE); > >+ enum ib_comp_cntr_entry entry; > >+ u64 value; > >+ int ret; > >+ > >+ if (!cc->device->ops.set_comp_cntr) > >+ return -EOPNOTSUPP; > >+ > >+ ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_SET_COMP_CNTR_ENTRY); > >+ if (ret) > >+ return ret; > >+ > >+ ret = uverbs_copy_from(&value, attrs, UVERBS_ATTR_SET_COMP_CNTR_VALUE); > >+ if (ret) > >+ return ret; > >+ > >+ return cc->device->ops.set_comp_cntr(cc, entry, value); > >+} > >+ > >+static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_INC)(struct uverbs_attr_bundle *attrs) > >+{ > >+ struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_INC_COMP_CNTR_HANDLE); > >+ enum ib_comp_cntr_entry entry; > >+ u64 amount; > >+ int ret; > >+ > >+ if (!cc->device->ops.inc_comp_cntr) > >+ return -EOPNOTSUPP; > >+ > >+ ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_INC_COMP_CNTR_ENTRY); > >+ if (ret) > >+ return ret; > >+ > >+ ret = uverbs_copy_from(&amount, attrs, UVERBS_ATTR_INC_COMP_CNTR_VALUE); > >+ if (ret) > >+ return ret; > >+ > >+ return cc->device->ops.inc_comp_cntr(cc, entry, amount); > >+} > >+ > >+static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_READ)(struct uverbs_attr_bundle *attrs) > >+{ > >+ struct ib_comp_cntr *cc = uverbs_attr_get_obj(attrs, UVERBS_ATTR_READ_COMP_CNTR_HANDLE); > >+ enum ib_comp_cntr_entry entry; > >+ u64 value; > >+ int ret; > >+ > >+ if (!cc->device->ops.read_comp_cntr) > >+ return -EOPNOTSUPP; > >+ > >+ ret = uverbs_get_const(&entry, attrs, UVERBS_ATTR_READ_COMP_CNTR_ENTRY); > >+ if (ret) > >+ return ret; > >+ > >+ ret = cc->device->ops.read_comp_cntr(cc, entry, &value); > >+ if (ret) > >+ return ret; > >+ > >+ return uverbs_copy_to(attrs, UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, &value, sizeof(value)); > >+} > >+ > >+DECLARE_UVERBS_NAMED_METHOD( > >+ UVERBS_METHOD_COMP_CNTR_CREATE, > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE, > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_ACCESS_NEW, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_OPTIONAL), > >+ UVERBS_ATTR_RAW_FD(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, > >+ UA_OPTIONAL), > >+ UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_OPTIONAL), > >+ UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_OPTIONAL), > >+ UVERBS_ATTR_RAW_FD(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, > >+ UA_OPTIONAL), > >+ UVERBS_ATTR_PTR_IN(UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_OPTIONAL), > >+ UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_MANDATORY), > >+ UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_MANDATORY), > >+ UVERBS_ATTR_UHW()); > >+ > >+DECLARE_UVERBS_NAMED_METHOD_DESTROY( > >+ UVERBS_METHOD_COMP_CNTR_DESTROY, > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_DESTROY_COMP_CNTR_HANDLE, > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_ACCESS_DESTROY, > >+ UA_MANDATORY)); > >+ > >+DECLARE_UVERBS_NAMED_METHOD( > >+ UVERBS_METHOD_COMP_CNTR_SET, > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_SET_COMP_CNTR_HANDLE, > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_ACCESS_WRITE, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_CONST_IN(UVERBS_ATTR_SET_COMP_CNTR_ENTRY, > >+ enum ib_uverbs_comp_cntr_entry, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_PTR_IN(UVERBS_ATTR_SET_COMP_CNTR_VALUE, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_MANDATORY)); > >+ > >+DECLARE_UVERBS_NAMED_METHOD( > >+ UVERBS_METHOD_COMP_CNTR_INC, > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_INC_COMP_CNTR_HANDLE, > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_ACCESS_WRITE, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_CONST_IN(UVERBS_ATTR_INC_COMP_CNTR_ENTRY, > >+ enum ib_uverbs_comp_cntr_entry, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_PTR_IN(UVERBS_ATTR_INC_COMP_CNTR_VALUE, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_MANDATORY)); > >+ > >+DECLARE_UVERBS_NAMED_METHOD( > >+ UVERBS_METHOD_COMP_CNTR_READ, > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_READ_COMP_CNTR_HANDLE, > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_ACCESS_READ, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_CONST_IN(UVERBS_ATTR_READ_COMP_CNTR_ENTRY, > >+ enum ib_uverbs_comp_cntr_entry, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_PTR_OUT(UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, > >+ UVERBS_ATTR_TYPE(u64), > >+ UA_MANDATORY)); > >+ > >+DECLARE_UVERBS_NAMED_OBJECT( > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_TYPE_ALLOC_IDR(uverbs_free_comp_cntr), > >+ &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_CREATE), > >+ &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_DESTROY), > >+ &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_SET), > >+ &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_INC), > >+ &UVERBS_METHOD(UVERBS_METHOD_COMP_CNTR_READ)); > >+ > >+const struct uapi_definition uverbs_def_obj_comp_cntr[] = { > >+ UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_COMP_CNTR, > >+ UAPI_DEF_OBJ_NEEDS_FN(destroy_comp_cntr)), > >+ {} > >+}; > >diff --git a/drivers/infiniband/core/uverbs_std_types_qp.c b/drivers/infiniband/core/uverbs_std_types_qp.c > >index be0730e8509e..2c607b02d9d5 100644 > >--- a/drivers/infiniband/core/uverbs_std_types_qp.c > >+++ b/drivers/infiniband/core/uverbs_std_types_qp.c > >@@ -367,11 +367,54 @@ DECLARE_UVERBS_NAMED_METHOD( > > UVERBS_ATTR_TYPE(struct ib_uverbs_destroy_qp_resp), > > UA_MANDATORY)); > >+static int UVERBS_HANDLER(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)( > >+ struct uverbs_attr_bundle *attrs) > >+{ > >+ struct ib_uobject *qp_uobj = uverbs_attr_get_uobject( > >+ attrs, UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE); > >+ struct ib_comp_cntr *cc = uverbs_attr_get_obj( > >+ attrs, UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE); > >+ struct ib_comp_cntr_attach_attr attr = {}; > >+ struct ib_qp *qp = qp_uobj->object; > >+ int ret; > >+ > >+ if (!cc->device->ops.qp_attach_comp_cntr) > >+ return -EOPNOTSUPP; > >+ > >+ ret = uverbs_get_flags32(&attr.op_mask, attrs, > >+ UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND | > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV | > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ | > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ | > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE | > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE); > >+ if (ret) > >+ return ret; > >+ > >+ return qp->device->ops.qp_attach_comp_cntr(qp, cc, &attr); > >+} > >+ > >+DECLARE_UVERBS_NAMED_METHOD( > >+ UVERBS_METHOD_QP_ATTACH_COMP_CNTR, > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE, > >+ UVERBS_OBJECT_QP, > >+ UVERBS_ACCESS_WRITE, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_IDR(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE, > >+ UVERBS_OBJECT_COMP_CNTR, > >+ UVERBS_ACCESS_READ, > >+ UA_MANDATORY), > >+ UVERBS_ATTR_FLAGS_IN(UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, > >+ enum ib_uverbs_comp_cntr_attach_op, > >+ UA_OPTIONAL)); > >+ > > DECLARE_UVERBS_NAMED_OBJECT( > > UVERBS_OBJECT_QP, > > UVERBS_TYPE_ALLOC_IDR_SZ(sizeof(struct ib_uqp_object), uverbs_free_qp), > > &UVERBS_METHOD(UVERBS_METHOD_QP_CREATE), > >- &UVERBS_METHOD(UVERBS_METHOD_QP_DESTROY)); > >+ &UVERBS_METHOD(UVERBS_METHOD_QP_DESTROY), > >+ &UVERBS_METHOD(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)); > > const struct uapi_definition uverbs_def_obj_qp[] = { > > UAPI_DEF_CHAIN_OBJ_TREE_NAMED(UVERBS_OBJECT_QP, > >diff --git a/drivers/infiniband/core/uverbs_uapi.c b/drivers/infiniband/core/uverbs_uapi.c > >index 31b248295854..a3f42a50a14f 100644 > >--- a/drivers/infiniband/core/uverbs_uapi.c > >+++ b/drivers/infiniband/core/uverbs_uapi.c > >@@ -628,6 +628,7 @@ void uverbs_destroy_api(struct uverbs_api *uapi) > > static const struct uapi_definition uverbs_core_api[] = { > > UAPI_DEF_CHAIN(uverbs_def_obj_async_fd), > > UAPI_DEF_CHAIN(uverbs_def_obj_counters), > >+ UAPI_DEF_CHAIN(uverbs_def_obj_comp_cntr), > > UAPI_DEF_CHAIN(uverbs_def_obj_cq), > > UAPI_DEF_CHAIN(uverbs_def_obj_device), > > UAPI_DEF_CHAIN(uverbs_def_obj_dm), > >diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > >index 9dd76f489a0b..b0db80447bf0 100644 > >--- a/include/rdma/ib_verbs.h > >+++ b/include/rdma/ib_verbs.h > >@@ -453,6 +453,7 @@ struct ib_device_attr { > > u64 max_dm_size; > > /* Max entries for sgl for optimized performance per READ */ > > u32 max_sgl_rd; > >+ u32 max_comp_cntr; > > }; > > enum ib_mtu { > >@@ -1746,6 +1747,33 @@ struct ib_cq { > > struct rdma_restrack_entry res; > > }; > >+struct ib_comp_cntr { > >+ struct ib_device *device; > >+ struct ib_uobject *uobject; > >+ struct ib_umem *comp_umem; > >+ struct ib_umem *err_umem; > >+ u64 comp_count_max_value; > >+ u64 err_count_max_value; > >+}; > >+ > >+enum ib_comp_cntr_entry { > >+ IB_COMP_CNTR_ENTRY_COMP = IB_UVERBS_COMP_CNTR_ENTRY_COMP, > >+ IB_COMP_CNTR_ENTRY_ERR = IB_UVERBS_COMP_CNTR_ENTRY_ERR, > >+}; > >+ > >+enum ib_comp_cntr_attach_op { > >+ IB_COMP_CNTR_ATTACH_OP_SEND = IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND, > >+ IB_COMP_CNTR_ATTACH_OP_RECV = IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV, > >+ IB_COMP_CNTR_ATTACH_OP_RDMA_READ = IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ, > >+ IB_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ, > >+ IB_COMP_CNTR_ATTACH_OP_RDMA_WRITE = IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE, > >+ IB_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE, > >+}; > >+ > >+struct ib_comp_cntr_attach_attr { > >+ u32 op_mask; > >+}; > >+ > > struct ib_srq { > > struct ib_device *device; > > struct ib_pd *pd; > >@@ -2624,6 +2652,8 @@ struct ib_device_ops { > > struct ib_udata *udata); > > int (*modify_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, > > int qp_attr_mask, struct ib_udata *udata); > >+ int (*qp_attach_comp_cntr)(struct ib_qp *qp, struct ib_comp_cntr *cc, > >+ struct ib_comp_cntr_attach_attr *attr); > > int (*query_qp)(struct ib_qp *qp, struct ib_qp_attr *qp_attr, > > int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr); > > int (*destroy_qp)(struct ib_qp *qp, struct ib_udata *udata); > >@@ -2645,6 +2675,12 @@ struct ib_device_ops { > > * post_destroy_cq - Free all kernel resources > > */ > > void (*post_destroy_cq)(struct ib_cq *cq); > >+ int (*create_comp_cntr)(struct ib_comp_cntr *cc, > >+ struct uverbs_attr_bundle *attrs); > >+ int (*destroy_comp_cntr)(struct ib_comp_cntr *cc); > >+ int (*set_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 value); > >+ int (*inc_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 amount); > >+ int (*read_comp_cntr)(struct ib_comp_cntr *cc, enum ib_comp_cntr_entry entry, u64 *value); > > struct ib_mr *(*get_dma_mr)(struct ib_pd *pd, int mr_access_flags); > > struct ib_mr *(*reg_user_mr)(struct ib_pd *pd, u64 start, u64 length, > > u64 virt_addr, int mr_access_flags, > >@@ -2878,6 +2914,7 @@ struct ib_device_ops { > > DECLARE_RDMA_OBJ_SIZE(ib_ah); > > DECLARE_RDMA_OBJ_SIZE(ib_counters); > > DECLARE_RDMA_OBJ_SIZE(ib_cq); > >+ DECLARE_RDMA_OBJ_SIZE(ib_comp_cntr); > > DECLARE_RDMA_OBJ_SIZE(ib_dmah); > > DECLARE_RDMA_OBJ_SIZE(ib_mw); > > DECLARE_RDMA_OBJ_SIZE(ib_pd); > >diff --git a/include/uapi/rdma/ib_user_ioctl_cmds.h b/include/uapi/rdma/ib_user_ioctl_cmds.h > >index 72041c1b0ea5..6ff6a2afdc60 100644 > >--- a/include/uapi/rdma/ib_user_ioctl_cmds.h > >+++ b/include/uapi/rdma/ib_user_ioctl_cmds.h > >@@ -57,6 +57,7 @@ enum uverbs_default_objects { > > UVERBS_OBJECT_ASYNC_EVENT, > > UVERBS_OBJECT_DMAH, > > UVERBS_OBJECT_DMABUF, > >+ UVERBS_OBJECT_COMP_CNTR, > > }; > > enum { > >@@ -168,6 +169,7 @@ enum uverbs_attrs_destroy_qp_cmd_attr_ids { > > enum uverbs_methods_qp { > > UVERBS_METHOD_QP_CREATE, > > UVERBS_METHOD_QP_DESTROY, > >+ UVERBS_METHOD_QP_ATTACH_COMP_CNTR, > > }; > > enum uverbs_attrs_create_srq_cmd_attr_ids { > >@@ -434,4 +436,52 @@ enum uverbs_attrs_query_gid_entry_cmd_attr_ids { > > UVERBS_ATTR_QUERY_GID_ENTRY_RESP_ENTRY, > > }; > >+enum uverbs_methods_comp_cntr { > >+ UVERBS_METHOD_COMP_CNTR_CREATE, > >+ UVERBS_METHOD_COMP_CNTR_DESTROY, > >+ UVERBS_METHOD_COMP_CNTR_SET, > >+ UVERBS_METHOD_COMP_CNTR_INC, > >+ UVERBS_METHOD_COMP_CNTR_READ, > >+}; > >+ > >+enum uverbs_attrs_create_comp_cntr_cmd_attr_ids { > >+ UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_VA, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_FD, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_BUFFER_OFFSET, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_VA, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_FD, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_ERR_BUFFER_OFFSET, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, > >+ UVERBS_ATTR_CREATE_COMP_CNTR_RESP_ERR_COUNT_MAX_VALUE, > >+}; > >+ > >+enum uverbs_attrs_destroy_comp_cntr_cmd_attr_ids { > >+ UVERBS_ATTR_DESTROY_COMP_CNTR_HANDLE, > >+}; > >+ > >+enum uverbs_attrs_set_comp_cntr_cmd_attr_ids { > >+ UVERBS_ATTR_SET_COMP_CNTR_HANDLE, > >+ UVERBS_ATTR_SET_COMP_CNTR_ENTRY, > >+ UVERBS_ATTR_SET_COMP_CNTR_VALUE, > >+}; > >+ > >+enum uverbs_attrs_inc_comp_cntr_cmd_attr_ids { > >+ UVERBS_ATTR_INC_COMP_CNTR_HANDLE, > >+ UVERBS_ATTR_INC_COMP_CNTR_ENTRY, > >+ UVERBS_ATTR_INC_COMP_CNTR_VALUE, > >+}; > >+ > >+enum uverbs_attrs_read_comp_cntr_cmd_attr_ids { > >+ UVERBS_ATTR_READ_COMP_CNTR_HANDLE, > >+ UVERBS_ATTR_READ_COMP_CNTR_ENTRY, > >+ UVERBS_ATTR_READ_COMP_CNTR_RESP_VALUE, > >+}; > >+ > >+enum uverbs_attrs_qp_attach_comp_cntr_cmd_attr_ids { > >+ UVERBS_ATTR_QP_ATTACH_COMP_CNTR_QP_HANDLE, > >+ UVERBS_ATTR_QP_ATTACH_COMP_CNTR_HANDLE, > >+ UVERBS_ATTR_QP_ATTACH_COMP_CNTR_OP_MASK, > >+}; > >+ > > #endif > >diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h > >index 90c5cd8e7753..f38236b056a7 100644 > >--- a/include/uapi/rdma/ib_user_ioctl_verbs.h > >+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h > >@@ -273,4 +273,18 @@ struct ib_uverbs_gid_entry { > > __u32 netdev_ifindex; /* It is 0 if there is no netdev associated with it */ > > }; > >+enum ib_uverbs_comp_cntr_entry { > >+ IB_UVERBS_COMP_CNTR_ENTRY_COMP, > >+ IB_UVERBS_COMP_CNTR_ENTRY_ERR, > >+}; > >+ > >+enum ib_uverbs_comp_cntr_attach_op { > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_SEND = 1 << 0, > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_RECV = 1 << 1, > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_READ = 1 << 2, > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_READ = 1 << 3, > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_RDMA_WRITE = 1 << 4, > >+ IB_UVERBS_COMP_CNTR_ATTACH_OP_REMOTE_RDMA_WRITE = 1 << 5, > >+}; > >+ > > #endif > >diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h > >index 3b7bd99813e9..45d142f4a7f8 100644 > >--- a/include/uapi/rdma/ib_user_verbs.h > >+++ b/include/uapi/rdma/ib_user_verbs.h > >@@ -299,7 +299,7 @@ struct ib_uverbs_ex_query_device_resp { > > struct ib_uverbs_cq_moderation_caps cq_moderation_caps; > > __aligned_u64 max_dm_size; > > __u32 xrc_odp_caps; > >- __u32 reserved; > >+ __u32 max_comp_cntr; > > }; > > struct ib_uverbs_query_port { > > > -- > Doug Ledford <doug.ledford@hpe.com> > GPG KeyID: B826A3330E572FDD > Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-30 12:18 ` Michael Margolin @ 2026-04-30 19:09 ` Doug Ledford 2026-04-30 22:33 ` Sean Hefty 0 siblings, 1 reply; 14+ messages in thread From: Doug Ledford @ 2026-04-30 19:09 UTC (permalink / raw) To: Michael Margolin Cc: jgg, leon, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum [-- Attachment #1.1: Type: text/plain, Size: 8076 bytes --] On 4/30/26 7:18 AM, Michael Margolin wrote: > On Wed, Apr 29, 2026 at 06:50:54PM -0600, Doug Ledford wrote: >> On 4/16/26 4:23 PM, Michael Margolin wrote: >>> Add core infrastructure for Completion Counters, a light-weight >>> alternative to polling CQ for tracking operation completions. >>> >>> Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy, >>> set, inc and read methods for both success and error counters. Add a >>> QP attach method on the QP object to associate a completion counter >>> with a queue pair. >>> >>> The create handler constructs umem from user-provided VA or dmabuf for >>> each counter, following the CQ buffer pattern. >> >> Description here doesn't match implementation. The umem or dmabuf >> is optional, while this reads that they are the only two options. >> If neither is passed in, then the counter is on the hardware and the >> read operation is used to get the value (as per the code anyway). > > Thanks, I'll make that path more clear in the commit message. >> >> Which raises a different scenario our hardware enables. We can pass >> in a umem on create, but that doesn't mean the counter exists in >> umem, it exists on the device and it is copied to umem. If you copy >> it on every counter update, that kills PCI-e usage, so we have an > > Why would it load PCIe more than writing CQEs into a CQ? You don't necessarily signal every single completion, but updating umem with every single counter update has that effect. It's just unnecessary PCI-e bandwidth consumed. And it happens to be in addition to any other CQE updates, etc. From our experience, it adds up when you have lots of counters. Writing every update out for 1,000s of counters doesn't scale well. So while the API works well for you as it is, no doubt if we show up in a few months with our hardware wanting to do something similar, we will be told "you should use the official interface for this", so we need this interface to be acceptable to our use patterns also. >> option to use a trigger to only update on a periodic basis (but then >> user space authors start polling on the umem location and killing >> CPU cycles, so this option is not preferred), or there is a wait >> option where you can set the target and then in your app use a wait >> call to wait for the count to be reached (we've found this is about >> the only performant way to implement these counters). >> >> Also, we don't really attach counters to QPs. That isn't usually >> what we care about counting. Given that our EPs are not connected, >> counters on it are usually only useful for recv operations where you >> can get aggregate data for a given EP. For send, it is often that >> we really want counters on a per-flow basis knowing that we have >> many flows that go through that one EP (soon to be QP). So, for us, >> we create a counter, then during our send operations, if we want a >> specific transfer to be included in a specific counter, it's flagged >> in the command we send to the hardware for that send operation. >> That implies that a proper place to hang a list of counters is >> probably off of an AH instead of a QP for us. >> >> I think we can extend this API to suit our needs, relax some of the >> current restrictions/assumptions, and be good. But, as this is a >> user visible API, if it's taken as-is, I would suggest that the >> rdma-core portion be marked as experimental until we've made the >> changes needed for our hardware in order to avoid user API churn. >> >> These changes could be summed up as: >> >> 1) Make qp attachment optional > > The attachment is already a separate call that can be avoided. Yes. but the code that tells the user whether or not it is supported will refuse to recognize the feature without the attach_qp function pointer being registered. So, while it may be optional to attach the qp, the routine to attach it to a qp is not optional. I was referring to that. I mean, we could make a stub, but it would likely just return -EOPNOTSUPP or somesuch. And, the API as it stands, doesn't have any way to consume a counter without attaching it to a QP. In our case, we will be needing to add an extension to the already extended AH that we will need for UET and in that we will attach the counter handle to specific data transfer commands and that's how our counters will get updated. For this patchset to truly make attaching the counter to a qp optional, you would need to add another way to consume it. So you say it's optional in this patchset, but it's really not as far as I can tell. >> 2) Extend create verb to differentiate between on-card counter with >> umem target and in-umem counter > > Can you elaborate on the extension you have on your mind? This seems to > me as a totally driver-device level implementation detail. EFA for > instance has device level counters that are being synced into the > provided memory on each update. Others may choose a different sync > strategy. Allow me to backtrack and rephrase somewhat. My original comment was brought out by the man pages that say, in a nutshell, the counter resides in umem but you can't modify it directly, you must use the accessor functions to modify it. This implies to readers that the actual counter is in umem. I think that's inaccurate. As you say, you sync the counter to umem. The counter resides on the device. This is really why they have to use the accessors, otherwise any direct updates would just get overwritten later. So allow me to rephrase as: Remove the wording in the man pages that the counter is in umem (as it really isn't), then add some optional create flags for sync method and frequency. With the counters really being on the device and synced to umem, optional sync triggers makes sense. Doing it on every update could be the default, but I can also see doing a sync on: timer interval, count interval, specific trigger event of another sort, etc. Or, conversely, just be prepared for us to bring the optional sync triggers as an extension to the base API later. >> 3) Extend create verb to pass in optional trigger or wait capability >> to perform limited umem updates based upon passed in option > > I think this can be vendor specific extension rather than a common > interface. Providers that want to support this mode can easily add > their own "update frequency" attribute in create ioctl or introduce > a "sync" verb that will do what's needed for the sequential read to > return an up-to-date value. It's actually a useful API (from experience), so I would prefer it didn't have to be vendor specific, aka we don't want people having to custom code to our hardware for something generally useful. I wouldn't say it should be required from all vendors, but I'm reluctant to make it vendor specific instead of just an optional extension to the base API. I am, however, perfectly happy to provide the extension to the base API as opposed to requiring it be part of the initial patchset. >> 4) Modify read operation so that it can either return the value >> directly or just trigger an async update of a buffer backed counter >> (especially useful if the umem counter is on a GPU, is set for a >> triggered update, and you just want to force an immediate async >> update) > > See my suggestion above. I think what you describe here should be a > separate command. We can add a flush command. Same deal as read, only instead of returning the value to the caller, it flushes it to whatever the destination is supposed to be. Or, the semantics of read could be modified such that a read also triggers a flush if there is a known umem or gpu mem target for the counter. Either way, although I think I might prefer the flush variant. -- Doug Ledford <doug.ledford@hpe.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-30 19:09 ` Doug Ledford @ 2026-04-30 22:33 ` Sean Hefty 2026-05-04 12:51 ` Michael Margolin 0 siblings, 1 reply; 14+ messages in thread From: Sean Hefty @ 2026-04-30 22:33 UTC (permalink / raw) To: Doug Ledford, Michael Margolin Cc: Jason Gunthorpe, leon@kernel.org, linux-rdma@vger.kernel.org, sleybo@amazon.com, matua@amazon.com, gal.pressman@linux.dev, Yonatan Nachum > >>> Add core infrastructure for Completion Counters, a light-weight > >>> alternative to polling CQ for tracking operation completions. > >>> > >>> Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, > >>> destroy, set, inc and read methods for both success and error > >>> counters. Add a QP attach method on the QP object to associate a > >>> completion counter with a queue pair. > >>> > >>> The create handler constructs umem from user-provided VA or dmabuf > >>> for each counter, following the CQ buffer pattern. > >> > >> Description here doesn't match implementation. The umem or dmabuf is > >> optional, while this reads that they are the only two options. > >> If neither is passed in, then the counter is on the hardware and the > >> read operation is used to get the value (as per the code anyway). > > > > Thanks, I'll make that path more clear in the commit message. > >> > >> Which raises a different scenario our hardware enables. We can pass > >> in a umem on create, but that doesn't mean the counter exists in > >> umem, it exists on the device and it is copied to umem. If you copy > >> it on every counter update, that kills PCI-e usage, so we have an > > > > Why would it load PCIe more than writing CQEs into a CQ? > > You don't necessarily signal every single completion, but updating umem with > every single counter update has that effect. It's just unnecessary PCI-e > bandwidth consumed. And it happens to be in addition to any other CQE > updates, etc. From our experience, it adds up when you have lots of counters. > Writing every update out for 1,000s of counters doesn't scale well. So while > the API works well for you as it is, no doubt if we show up in a few months > with our hardware wanting to do something similar, we will be told "you > should use the official interface for this", so we need this interface to be > acceptable to our use patterns also. > > >> option to use a trigger to only update on a periodic basis (but then > >> user space authors start polling on the umem location and killing CPU > >> cycles, so this option is not preferred), or there is a wait option > >> where you can set the target and then in your app use a wait call to > >> wait for the count to be reached (we've found this is about the only > >> performant way to implement these counters). > >> > >> Also, we don't really attach counters to QPs. That isn't usually > >> what we care about counting. Given that our EPs are not connected, > >> counters on it are usually only useful for recv operations where you > >> can get aggregate data for a given EP. For send, it is often that we > >> really want counters on a per-flow basis knowing that we have many > >> flows that go through that one EP (soon to be QP). So, for us, we > >> create a counter, then during our send operations, if we want a > >> specific transfer to be included in a specific counter, it's flagged > >> in the command we send to the hardware for that send operation. > >> That implies that a proper place to hang a list of counters is > >> probably off of an AH instead of a QP for us. > >> > >> I think we can extend this API to suit our needs, relax some of the > >> current restrictions/assumptions, and be good. But, as this is a > >> user visible API, if it's taken as-is, I would suggest that the > >> rdma-core portion be marked as experimental until we've made the > >> changes needed for our hardware in order to avoid user API churn. > >> > >> These changes could be summed up as: > >> > >> 1) Make qp attachment optional > > > > The attachment is already a separate call that can be avoided. > > Yes. but the code that tells the user whether or not it is supported will refuse > to recognize the feature without the attach_qp function pointer being > registered. So, while it may be optional to attach the qp, the routine to attach > it to a qp is not optional. I was referring to that. I mean, we could make a > stub, but it would likely just return -EOPNOTSUPP or somesuch. And, the API > as it stands, doesn't have any way to consume a counter without attaching it > to a QP. In our case, we will be needing to add an extension to the already > extended AH that we will need for UET and in that we will attach the counter > handle to specific data transfer commands and that's how our counters will > get updated. For this patchset to truly make attaching the counter to a qp > optional, you would need to add another way to consume it. So you say it's > optional in this patchset, but it's really not as far as I can tell. > > >> 2) Extend create verb to differentiate between on-card counter with > >> umem target and in-umem counter > > > > Can you elaborate on the extension you have on your mind? This seems > > to me as a totally driver-device level implementation detail. EFA for > > instance has device level counters that are being synced into the > > provided memory on each update. Others may choose a different sync > > strategy. > > Allow me to backtrack and rephrase somewhat. My original comment was > brought out by the man pages that say, in a nutshell, the counter resides in > umem but you can't modify it directly, you must use the accessor functions to > modify it. This implies to readers that the actual counter is in umem. I think > that's inaccurate. As you say, you sync the counter to umem. The counter > resides on the device. This is really why they have to use the accessors, > otherwise any direct updates would just get overwritten later. > > So allow me to rephrase as: Remove the wording in the man pages that the > counter is in umem (as it really isn't), then add some optional create flags for > sync method and frequency. With the counters really being on the device and > synced to umem, optional sync triggers makes sense. > Doing it on every update could be the default, but I can also see doing a sync > on: timer interval, count interval, specific trigger event of another sort, etc. Or, > conversely, just be prepared for us to bring the optional sync triggers as an > extension to the base API later. > > >> 3) Extend create verb to pass in optional trigger or wait capability > >> to perform limited umem updates based upon passed in option > > > > I think this can be vendor specific extension rather than a common > > interface. Providers that want to support this mode can easily add > > their own "update frequency" attribute in create ioctl or introduce a > > "sync" verb that will do what's needed for the sequential read to > > return an up-to-date value. > > It's actually a useful API (from experience), so I would prefer it didn't have to > be vendor specific, aka we don't want people having to custom code to our > hardware for something generally useful. I wouldn't say it should be required > from all vendors, but I'm reluctant to make it vendor specific instead of just an > optional extension to the base API. > I am, however, perfectly happy to provide the extension to the base API as > opposed to requiring it be part of the initial patchset. > > >> 4) Modify read operation so that it can either return the value > >> directly or just trigger an async update of a buffer backed counter > >> (especially useful if the umem counter is on a GPU, is set for a > >> triggered update, and you just want to force an immediate async > >> update) > > > > See my suggestion above. I think what you describe here should be a > > separate command. > > We can add a flush command. Same deal as read, only instead of returning > the value to the caller, it flushes it to whatever the destination is supposed to > be. Or, the semantics of read could be modified such that a read also triggers a > flush if there is a known umem or gpu mem target for the counter. Either way, > although I think I might prefer the flush variant. I agree with most of what Doug said. To be more specific, this is my current thought: Define counter group: 1 to N counters to size X (e.g. u32, u64) 1. A counter group is associated with a MR on creation. 2. A flush operation writes 1 or more values from a counter group to the MR. Provider flushes the entire group or selectively flushes 1 value Depends on implementation and higher-level SW semantic Flush may be a no-op 3. Future: flush takes parameters to control when the write is required Take-away is that these are flush parameters, not counter attributes I expect flush to be handled by the userspace verbs provider, so it may not need a kernel ABI at this time or be standardized. A libibverbs API aligned with libfabric would look like this: Define completion counter: a success + error counter pair ibv_create_compcntr(ctx, attr, &cntr) ibv_read_cntr(cntr, &val) ibv_read_err_cntr(cntr, &val) To support different HW, I was suggesting the kernel use a different construct, a counter group (previously called a counter array). There's only 1 MR per counter group. If required, it is the provider's responsibility to allocate multiple groups and piece them together (Jason's suggestion). The read_cntr() API suggests that the provider owns the MR for the counter group. Allowing direct user access to the MR implies the user knows how to interpret the value(s) being written, so I don't think a user provided MR makes sense. - Sean ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-04-30 22:33 ` Sean Hefty @ 2026-05-04 12:51 ` Michael Margolin 2026-05-04 15:37 ` Sean Hefty 0 siblings, 1 reply; 14+ messages in thread From: Michael Margolin @ 2026-05-04 12:51 UTC (permalink / raw) To: Sean Hefty Cc: Doug Ledford, Jason Gunthorpe, leon@kernel.org, linux-rdma@vger.kernel.org, sleybo@amazon.com, matua@amazon.com, gal.pressman@linux.dev, Yonatan Nachum On Thu, Apr 30, 2026 at 10:33:55PM +0000, Sean Hefty wrote: > > >>> Add core infrastructure for Completion Counters, a light-weight > > >>> alternative to polling CQ for tracking operation completions. > > >>> > > >>> Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, > > >>> destroy, set, inc and read methods for both success and error > > >>> counters. Add a QP attach method on the QP object to associate a > > >>> completion counter with a queue pair. > > >>> > > >>> The create handler constructs umem from user-provided VA or dmabuf > > >>> for each counter, following the CQ buffer pattern. > > >> > > >> Description here doesn't match implementation. The umem or dmabuf is > > >> optional, while this reads that they are the only two options. > > >> If neither is passed in, then the counter is on the hardware and the > > >> read operation is used to get the value (as per the code anyway). > > > > > > Thanks, I'll make that path more clear in the commit message. > > >> > > >> Which raises a different scenario our hardware enables. We can pass > > >> in a umem on create, but that doesn't mean the counter exists in > > >> umem, it exists on the device and it is copied to umem. If you copy > > >> it on every counter update, that kills PCI-e usage, so we have an > > > > > > Why would it load PCIe more than writing CQEs into a CQ? > > > > You don't necessarily signal every single completion, but updating umem with > > every single counter update has that effect. It's just unnecessary PCI-e > > bandwidth consumed. And it happens to be in addition to any other CQE > > updates, etc. From our experience, it adds up when you have lots of counters. > > Writing every update out for 1,000s of counters doesn't scale well. So while > > the API works well for you as it is, no doubt if we show up in a few months > > with our hardware wanting to do something similar, we will be told "you > > should use the official interface for this", so we need this interface to be > > acceptable to our use patterns also. > > > > >> option to use a trigger to only update on a periodic basis (but then > > >> user space authors start polling on the umem location and killing CPU > > >> cycles, so this option is not preferred), or there is a wait option > > >> where you can set the target and then in your app use a wait call to > > >> wait for the count to be reached (we've found this is about the only > > >> performant way to implement these counters). > > >> > > >> Also, we don't really attach counters to QPs. That isn't usually > > >> what we care about counting. Given that our EPs are not connected, > > >> counters on it are usually only useful for recv operations where you > > >> can get aggregate data for a given EP. For send, it is often that we > > >> really want counters on a per-flow basis knowing that we have many > > >> flows that go through that one EP (soon to be QP). So, for us, we > > >> create a counter, then during our send operations, if we want a > > >> specific transfer to be included in a specific counter, it's flagged > > >> in the command we send to the hardware for that send operation. > > >> That implies that a proper place to hang a list of counters is > > >> probably off of an AH instead of a QP for us. > > >> > > >> I think we can extend this API to suit our needs, relax some of the > > >> current restrictions/assumptions, and be good. But, as this is a > > >> user visible API, if it's taken as-is, I would suggest that the > > >> rdma-core portion be marked as experimental until we've made the > > >> changes needed for our hardware in order to avoid user API churn. > > >> > > >> These changes could be summed up as: > > >> > > >> 1) Make qp attachment optional > > > > > > The attachment is already a separate call that can be avoided. > > > > Yes. but the code that tells the user whether or not it is supported will refuse > > to recognize the feature without the attach_qp function pointer being > > registered. So, while it may be optional to attach the qp, the routine to attach > > it to a qp is not optional. I was referring to that. I mean, we could make a > > stub, but it would likely just return -EOPNOTSUPP or somesuch. And, the API > > as it stands, doesn't have any way to consume a counter without attaching it > > to a QP. In our case, we will be needing to add an extension to the already > > extended AH that we will need for UET and in that we will attach the counter > > handle to specific data transfer commands and that's how our counters will > > get updated. For this patchset to truly make attaching the counter to a qp > > optional, you would need to add another way to consume it. So you say it's > > optional in this patchset, but it's really not as far as I can tell. > > > > >> 2) Extend create verb to differentiate between on-card counter with > > >> umem target and in-umem counter > > > > > > Can you elaborate on the extension you have on your mind? This seems > > > to me as a totally driver-device level implementation detail. EFA for > > > instance has device level counters that are being synced into the > > > provided memory on each update. Others may choose a different sync > > > strategy. > > > > Allow me to backtrack and rephrase somewhat. My original comment was > > brought out by the man pages that say, in a nutshell, the counter resides in > > umem but you can't modify it directly, you must use the accessor functions to > > modify it. This implies to readers that the actual counter is in umem. I think > > that's inaccurate. As you say, you sync the counter to umem. The counter > > resides on the device. This is really why they have to use the accessors, > > otherwise any direct updates would just get overwritten later. > > > > So allow me to rephrase as: Remove the wording in the man pages that the > > counter is in umem (as it really isn't), then add some optional create flags for > > sync method and frequency. With the counters really being on the device and > > synced to umem, optional sync triggers makes sense. > > Doing it on every update could be the default, but I can also see doing a sync > > on: timer interval, count interval, specific trigger event of another sort, etc. Or, > > conversely, just be prepared for us to bring the optional sync triggers as an > > extension to the base API later. > > > > >> 3) Extend create verb to pass in optional trigger or wait capability > > >> to perform limited umem updates based upon passed in option > > > > > > I think this can be vendor specific extension rather than a common > > > interface. Providers that want to support this mode can easily add > > > their own "update frequency" attribute in create ioctl or introduce a > > > "sync" verb that will do what's needed for the sequential read to > > > return an up-to-date value. > > > > It's actually a useful API (from experience), so I would prefer it didn't have to > > be vendor specific, aka we don't want people having to custom code to our > > hardware for something generally useful. I wouldn't say it should be required > > from all vendors, but I'm reluctant to make it vendor specific instead of just an > > optional extension to the base API. > > I am, however, perfectly happy to provide the extension to the base API as > > opposed to requiring it be part of the initial patchset. > > > > >> 4) Modify read operation so that it can either return the value > > >> directly or just trigger an async update of a buffer backed counter > > >> (especially useful if the umem counter is on a GPU, is set for a > > >> triggered update, and you just want to force an immediate async > > >> update) > > > > > > See my suggestion above. I think what you describe here should be a > > > separate command. > > > > We can add a flush command. Same deal as read, only instead of returning > > the value to the caller, it flushes it to whatever the destination is supposed to > > be. Or, the semantics of read could be modified such that a read also triggers a > > flush if there is a known umem or gpu mem target for the counter. Either way, > > although I think I might prefer the flush variant. > > I agree with most of what Doug said. To be more specific, this is my current thought: > > Define counter group: 1 to N counters to size X (e.g. u32, u64) > > 1. A counter group is associated with a MR on creation. > 2. A flush operation writes 1 or more values from a counter group to the MR. > Provider flushes the entire group or selectively flushes 1 value > Depends on implementation and higher-level SW semantic > Flush may be a no-op > 3. Future: flush takes parameters to control when the write is required > Take-away is that these are flush parameters, not counter attributes > > I expect flush to be handled by the userspace verbs provider, so it may not need a kernel ABI at this time or be standardized. > > > A libibverbs API aligned with libfabric would look like this: > > Define completion counter: a success + error counter pair > > ibv_create_compcntr(ctx, attr, &cntr) > ibv_read_cntr(cntr, &val) > ibv_read_err_cntr(cntr, &val) > > To support different HW, I was suggesting the kernel use a different construct, a counter group (previously called a counter array). There's only 1 MR per counter group. If required, it is the provider's responsibility to allocate multiple groups and piece them together (Jason's suggestion). > > The read_cntr() API suggests that the provider owns the MR for the counter group. Allowing direct user access to the MR implies the user knows how to interpret the value(s) being written, so I don't think a user provided MR makes sense. > > - Sean I'll try to answer you both here. I feel like a lot of the confusion comes from the option to pass user-provided memory for completion counter usage. Although this option didn't force any specific device implementation or dictate how/when count values are written to that memory, I've removed this support from the common libibverbs interface. Additionally, following the discussion in [1], I'm going to move buffer attributes and umem ownership to drivers in a way that can later be converted to use core helpers once we have them. Similarly, counter flush and update frequency isn't supported by all HW vendors (including EFA), and I didn't plan to add it at this stage. That said, I do want to make sure we are not closing the door on those features and that the interfaces can be extended to support them. Here's how I see possible future extensions: At the Completion Counter level, an optional flush command can be added and can translate to a nop when not required for a given HW. As Sean suggested, it can take additional params or flags to allow more fine-grained control over the operation. If for performance reasons one would like to "place" multiple Completion Counters together and flush their values with a single operation, we can introduce the following interface: ibv_create_comp_cntr_group() ibv_flush_comp_cntr_group() And extend ibv_create_comp_cntr() with an optional comp_cntr_group param. As I see it, a single Completion Counter is always a pair of success and error counts. I can't see anything in the code that is blocking the possibility of supporting attach to an AH in the future. Michael [1] https://lore.kernel.org/all/jpobfdsuuj4wmrqkxzpjmfjxgz6vn2m6a6wy666yfapv6hzytj@6g5qrelixuwe/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support 2026-05-04 12:51 ` Michael Margolin @ 2026-05-04 15:37 ` Sean Hefty 0 siblings, 0 replies; 14+ messages in thread From: Sean Hefty @ 2026-05-04 15:37 UTC (permalink / raw) To: Michael Margolin Cc: Doug Ledford, Jason Gunthorpe, leon@kernel.org, linux-rdma@vger.kernel.org, sleybo@amazon.com, matua@amazon.com, gal.pressman@linux.dev, Yonatan Nachum > > Define counter group: 1 to N counters to size X (e.g. u32, u64) > > > > 1. A counter group is associated with a MR on creation. > > 2. A flush operation writes 1 or more values from a counter group to the MR. > > Provider flushes the entire group or selectively flushes 1 value > > Depends on implementation and higher-level SW semantic > > Flush may be a no-op > > 3. Future: flush takes parameters to control when the write is required > > Take-away is that these are flush parameters, not counter > > attributes > > > > I expect flush to be handled by the userspace verbs provider, so it may not need > a kernel ABI at this time or be standardized. > > > > > > A libibverbs API aligned with libfabric would look like this: > > > > Define completion counter: a success + error counter pair > > > > ibv_create_compcntr(ctx, attr, &cntr) > > ibv_read_cntr(cntr, &val) > > ibv_read_err_cntr(cntr, &val) > > > > To support different HW, I was suggesting the kernel use a different construct, a > counter group (previously called a counter array). There's only 1 MR per counter > group. If required, it is the provider's responsibility to allocate multiple groups > and piece them together (Jason's suggestion). > > > > The read_cntr() API suggests that the provider owns the MR for the counter > group. Allowing direct user access to the MR implies the user knows how to > interpret the value(s) being written, so I don't think a user provided MR makes > sense. > > > > - Sean > > I'll try to answer you both here. > > I feel like a lot of the confusion comes from the option to pass user-provided > memory for completion counter usage. Although this option didn't force any > specific device implementation or dictate how/when count values are written to > that memory, I've removed this support from the common libibverbs interface. > Additionally, following the discussion in [1], I'm going to move buffer attributes > and umem ownership to drivers in a way that can later be converted to use core > helpers once we have them. > > Similarly, counter flush and update frequency isn't supported by all HW vendors > (including EFA), and I didn't plan to add it at this stage. That said, I do want to > make sure we are not closing the door on those features and that the interfaces > can be extended to support them. > > Here's how I see possible future extensions: > > At the Completion Counter level, an optional flush command can be added and > can translate to a nop when not required for a given HW. > As Sean suggested, it can take additional params or flags to allow more fine- > grained control over the operation. > > If for performance reasons one would like to "place" multiple Completion > Counters together and flush their values with a single operation, we can introduce > the following interface: > > ibv_create_comp_cntr_group() > ibv_flush_comp_cntr_group() > > And extend ibv_create_comp_cntr() with an optional comp_cntr_group param. > > As I see it, a single Completion Counter is always a pair of success and error > counts. This is the same as I was envisioning. A completion counter is always presented to the libibverbs user as a pair of counters - one for success, another for errors. The difference is that the interface to the kernel is a more generic counter group. It doesn't know anything about success or error counters. It just needs a size. E.g. the EFA verbs provider would create 2 groups, each group with only 1 counter. EFA selects which is the success counter, which is the error, and userspace stitches them together to present as a completion counter. The CXI verbs provider creates 1 group with 2 counters. One counter tracks success, the other errors. The placing of multiple individual counters together doesn't target performance. It targets that NICs may implement this differently. Some NICs stitch 2 independent counters together (which is what it appears EFA is doing), some have a single construct (CXI). I don't think the rdma-core needs to care about "success" or "error". IMO, it's conceivable that a NIC might need to merge multiple success counters together because it counts sends and RDMA traffic separately. - Sean ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH for-next v2 2/5] RDMA/core: Prevent destroying in-use completion counters 2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support Michael Margolin @ 2026-04-16 21:23 ` Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 3/5] RDMA/core: Add Completion Counters to resource tracking Michael Margolin ` (2 subsequent siblings) 4 siblings, 0 replies; 14+ messages in thread From: Michael Margolin @ 2026-04-16 21:23 UTC (permalink / raw) To: jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman, Yonatan Nachum Reject comp_cntr destroy while it is attached to any QP. Track attachments using an xarray in ib_qp keyed by the attach op_mask. Use op bitmask to reject overlapping attaches early. Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> --- .../core/uverbs_std_types_comp_cntr.c | 3 +++ drivers/infiniband/core/uverbs_std_types_qp.c | 22 ++++++++++++++++++- drivers/infiniband/core/verbs.c | 1 + include/rdma/ib_verbs.h | 3 +++ 4 files changed, 28 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c index 7651a565bb9f..6fd9f485692d 100644 --- a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c +++ b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c @@ -15,6 +15,9 @@ static int uverbs_free_comp_cntr(struct ib_uobject *uobject, enum rdma_remove_re struct ib_comp_cntr *cc = uobject->object; int ret; + if (atomic_read(&cc->usecnt)) + return -EBUSY; + ret = cc->device->ops.destroy_comp_cntr(cc); if (ret) return ret; diff --git a/drivers/infiniband/core/uverbs_std_types_qp.c b/drivers/infiniband/core/uverbs_std_types_qp.c index 2c607b02d9d5..d4e214c56de9 100644 --- a/drivers/infiniband/core/uverbs_std_types_qp.c +++ b/drivers/infiniband/core/uverbs_std_types_qp.c @@ -15,6 +15,8 @@ static int uverbs_free_qp(struct ib_uobject *uobject, struct ib_qp *qp = uobject->object; struct ib_uqp_object *uqp = container_of(uobject, struct ib_uqp_object, uevent.uobject); + struct ib_comp_cntr *cc; + unsigned long index; int ret; /* @@ -35,6 +37,10 @@ static int uverbs_free_qp(struct ib_uobject *uobject, if (ret) return ret; + xa_for_each(&qp->comp_cntrs, index, cc) + atomic_dec(&cc->usecnt); + xa_destroy(&qp->comp_cntrs); + if (uqp->uxrcd) atomic_dec(&uqp->uxrcd->refcnt); @@ -392,7 +398,21 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QP_ATTACH_COMP_CNTR)( if (ret) return ret; - return qp->device->ops.qp_attach_comp_cntr(qp, cc, &attr); + if (attr.op_mask & qp->comp_cntr_op_mask) + return -EBUSY; + + ret = qp->device->ops.qp_attach_comp_cntr(qp, cc, &attr); + if (ret) + return ret; + + ret = xa_err(xa_store(&qp->comp_cntrs, attr.op_mask, cc, GFP_KERNEL)); + if (ret) + return ret; + + atomic_inc(&cc->usecnt); + qp->comp_cntr_op_mask |= attr.op_mask; + + return 0; } DECLARE_UVERBS_NAMED_METHOD( diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index bac87de9cc67..df9a1bb9ece4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1293,6 +1293,7 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd, qp->qp_context = attr->qp_context; spin_lock_init(&qp->mr_lock); + xa_init(&qp->comp_cntrs); INIT_LIST_HEAD(&qp->rdma_mrs); INIT_LIST_HEAD(&qp->sig_mrs); init_completion(&qp->srq_completion); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index b0db80447bf0..02f2e4dfd1c1 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1754,6 +1754,7 @@ struct ib_comp_cntr { struct ib_umem *err_umem; u64 comp_count_max_value; u64 err_count_max_value; + atomic_t usecnt; }; enum ib_comp_cntr_entry { @@ -1944,6 +1945,8 @@ struct ib_qp { struct completion srq_completion; struct ib_xrcd *xrcd; /* XRC TGT QPs only */ struct list_head xrcd_list; + struct xarray comp_cntrs; /* op_mask -> comp_cntr */ + u32 comp_cntr_op_mask; /* count times opened, mcast attaches, flow attaches */ atomic_t usecnt; -- 2.47.3 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH for-next v2 3/5] RDMA/core: Add Completion Counters to resource tracking 2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 2/5] RDMA/core: Prevent destroying in-use completion counters Michael Margolin @ 2026-04-16 21:23 ` Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 4/5] RDMA/efa: Update device interface Michael Margolin 2026-04-28 22:36 ` [PATCH for-next v2 0/5] Introduce Completion Counters Doug Ledford 4 siblings, 0 replies; 14+ messages in thread From: Michael Margolin @ 2026-04-16 21:23 UTC (permalink / raw) To: jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman, Yonatan Nachum Track completion counter objects in the resource tracking database so they are visible through the rdma netlink interface. The rdma tool displays the comp_cntr count in the resource summary. Add RDMA_RESTRACK_COMP_CNTR type, embed rdma_restrack_entry in ib_comp_cntr, and add the res_to_dev mapping. Register the resource on create and remove it on destroy. Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> --- drivers/infiniband/core/nldev.c | 1 + drivers/infiniband/core/restrack.c | 2 ++ drivers/infiniband/core/uverbs_std_types_comp_cntr.c | 6 ++++++ include/rdma/ib_verbs.h | 1 + include/rdma/restrack.h | 4 ++++ 5 files changed, 14 insertions(+) diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c index 96c745d5bac4..155954fef3e2 100644 --- a/drivers/infiniband/core/nldev.c +++ b/drivers/infiniband/core/nldev.c @@ -446,6 +446,7 @@ static int fill_res_info(struct sk_buff *msg, struct ib_device *device, [RDMA_RESTRACK_MR] = "mr", [RDMA_RESTRACK_CTX] = "ctx", [RDMA_RESTRACK_SRQ] = "srq", + [RDMA_RESTRACK_COMP_CNTR] = "comp_cntr", }; struct nlattr *table_attr; diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c index ac3688952cab..d152cc5f042b 100644 --- a/drivers/infiniband/core/restrack.c +++ b/drivers/infiniband/core/restrack.c @@ -102,6 +102,8 @@ static struct ib_device *res_to_dev(struct rdma_restrack_entry *res) return container_of(res, struct ib_srq, res)->device; case RDMA_RESTRACK_DMAH: return container_of(res, struct ib_dmah, res)->device; + case RDMA_RESTRACK_COMP_CNTR: + return container_of(res, struct ib_comp_cntr, res)->device; default: WARN_ONCE(true, "Wrong resource tracking type %u\n", res->type); return NULL; diff --git a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c index 6fd9f485692d..49b96c2413fb 100644 --- a/drivers/infiniband/core/uverbs_std_types_comp_cntr.c +++ b/drivers/infiniband/core/uverbs_std_types_comp_cntr.c @@ -8,6 +8,7 @@ #include <rdma/ib_umem_dmabuf.h> #include "rdma_core.h" #include "uverbs.h" +#include "restrack.h" static int uverbs_free_comp_cntr(struct ib_uobject *uobject, enum rdma_remove_reason why, struct uverbs_attr_bundle *attrs) @@ -22,6 +23,7 @@ static int uverbs_free_comp_cntr(struct ib_uobject *uobject, enum rdma_remove_re if (ret) return ret; + rdma_restrack_del(&cc->res); ib_umem_release(cc->comp_umem); ib_umem_release(cc->err_umem); kfree(cc); @@ -117,7 +119,11 @@ static int UVERBS_HANDLER(UVERBS_METHOD_COMP_CNTR_CREATE)(struct uverbs_attr_bun if (ret) goto err_err_umem; + rdma_restrack_new(&cc->res, RDMA_RESTRACK_COMP_CNTR); + rdma_restrack_set_name(&cc->res, NULL); + uobj->object = cc; + rdma_restrack_add(&cc->res); uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_HANDLE); ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_COMP_CNTR_RESP_COUNT_MAX_VALUE, diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 02f2e4dfd1c1..9628aaa2f0c0 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1755,6 +1755,7 @@ struct ib_comp_cntr { u64 comp_count_max_value; u64 err_count_max_value; atomic_t usecnt; + struct rdma_restrack_entry res; }; enum ib_comp_cntr_entry { diff --git a/include/rdma/restrack.h b/include/rdma/restrack.h index 451f99e3717d..4ab72bc6d8c7 100644 --- a/include/rdma/restrack.h +++ b/include/rdma/restrack.h @@ -60,6 +60,10 @@ enum rdma_restrack_type { * @RDMA_RESTRACK_DMAH: DMA handle */ RDMA_RESTRACK_DMAH, + /** + * @RDMA_RESTRACK_COMP_CNTR: Completion Counter + */ + RDMA_RESTRACK_COMP_CNTR, /** * @RDMA_RESTRACK_MAX: Last entry, used for array dclarations */ -- 2.47.3 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH for-next v2 4/5] RDMA/efa: Update device interface 2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin ` (2 preceding siblings ...) 2026-04-16 21:23 ` [PATCH for-next v2 3/5] RDMA/core: Add Completion Counters to resource tracking Michael Margolin @ 2026-04-16 21:23 ` Michael Margolin 2026-04-28 22:36 ` [PATCH for-next v2 0/5] Introduce Completion Counters Doug Ledford 4 siblings, 0 replies; 14+ messages in thread From: Michael Margolin @ 2026-04-16 21:23 UTC (permalink / raw) To: jgg, leon, linux-rdma Cc: sleybo, matua, gal.pressman, Daniel Kinsbursky, Yonatan Nachum Align device interface definitions. Reviewed-by: Daniel Kinsbursky <dkinsb@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> --- .../infiniband/hw/efa/efa_admin_cmds_defs.h | 185 +++++++++++++++++- drivers/infiniband/hw/efa/efa_io_defs.h | 62 +++++- 2 files changed, 242 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/efa/efa_admin_cmds_defs.h b/drivers/infiniband/hw/efa/efa_admin_cmds_defs.h index ad34ea5da6b0..2d75edabeefa 100644 --- a/drivers/infiniband/hw/efa/efa_admin_cmds_defs.h +++ b/drivers/infiniband/hw/efa/efa_admin_cmds_defs.h @@ -31,7 +31,12 @@ enum efa_admin_aq_opcode { EFA_ADMIN_CREATE_EQ = 18, EFA_ADMIN_DESTROY_EQ = 19, EFA_ADMIN_ALLOC_MR = 20, - EFA_ADMIN_MAX_OPCODE = 20, + EFA_ADMIN_SERVICE = 21, + EFA_ADMIN_CREATE_COUNTER = 25, + EFA_ADMIN_DESTROY_COUNTER = 26, + EFA_ADMIN_ATTACH_COUNTER = 27, + EFA_ADMIN_MODIFY_COUNTER = 28, + EFA_ADMIN_MAX_OPCODE = 28, }; enum efa_admin_aq_feature_id { @@ -725,7 +730,9 @@ struct efa_admin_feature_device_attr_desc { * on TX queues * 4 : unsolicited_write_recv - If set, unsolicited * write with imm. receive is supported - * 31:5 : reserved - MBZ + * 5 : event_counters - If set, event counters are + * supported + * 31:6 : reserved - MBZ */ u32 device_caps; @@ -814,6 +821,34 @@ struct efa_admin_feature_queue_attr_desc_1 { struct efa_admin_feature_queue_attr_desc_2 { /* Maximum size of data that can be sent inline in a Send WQE */ u16 inline_buf_size_ex; + + /* MBZ */ + u8 reserved[6]; + + /* + * Supported counter QP events + * 0 : send_comp + * 1 : send_comp_err + * 2 : recv_comp + * 3 : recv_comp_err + * 4 : read_comp + * 5 : read_comp_err + * 6 : write_comp + * 7 : write_comp_err + * 8 : remote_read_comp + * 9 : remote_write_comp + * 31:10 : reserved - MBZ + */ + u32 supported_counter_qp_events; + + /* Maximum number of counters */ + u32 max_event_counters; + + /* + * Maximum counter value, counter wraps around to 0 after reaching + * this value + */ + u64 event_counter_max_val; }; struct efa_admin_event_queue_attr_desc { @@ -1092,6 +1127,127 @@ struct efa_admin_host_info { u32 flags; }; +struct efa_admin_service_cmd { + struct efa_admin_aq_common_desc aq_common_descriptor; + + u8 buffer[60]; +}; + +struct efa_admin_service_resp { + struct efa_admin_acq_common_desc acq_common_desc; + + u8 buffer[56]; +}; + +/* Create Counter command */ +struct efa_admin_create_counter_cmd { + struct efa_admin_aq_common_desc aq_common_descriptor; + + /* UAR number */ + u16 uar; + + /* MBZ */ + u16 reserved; + + /* Counter physical address */ + u64 paddr; +}; + +struct efa_admin_create_counter_resp { + struct efa_admin_acq_common_desc acq_common_desc; + + /* Counter handle */ + u32 cntr_handle; + + /* MBZ */ + u32 reserved; +}; + +struct efa_admin_destroy_counter_cmd { + struct efa_admin_aq_common_desc aq_common_descriptor; + + /* Counter handle */ + u32 cntr_handle; +}; + +struct efa_admin_destroy_counter_resp { + struct efa_admin_acq_common_desc acq_common_desc; +}; + +enum efa_admin_counter_attach_type { + EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS = 0, +}; + +struct efa_admin_counter_attach_qp_events { + /* QP handle */ + u32 qp_handle; + + /* + * Bitmask of counter QP events + * 0 : send_comp + * 1 : send_comp_err + * 2 : recv_comp + * 3 : recv_comp_err + * 4 : read_comp + * 5 : read_comp_err + * 6 : write_comp + * 7 : write_comp_err + * 8 : remote_read_comp + * 9 : remote_write_comp + * 31:10 : reserved - MBZ + */ + u32 events; +}; + +struct efa_admin_attach_counter_cmd { + struct efa_admin_aq_common_desc aq_common_descriptor; + + /* Counter handle */ + u32 cntr_handle; + + /* efa_admin_counter_attach_type */ + u8 attach_type; + + /* MBZ */ + u8 reserved[3]; + + union { + struct efa_admin_counter_attach_qp_events qp_events; + } u; +}; + +struct efa_admin_attach_counter_resp { + struct efa_admin_acq_common_desc acq_common_desc; +}; + +/* Counter modify operations */ +enum efa_admin_counter_modify_ops { + /* Set counter value */ + EFA_ADMIN_COUNTER_MODIFY_SET = 0, + /* Add to counter value */ + EFA_ADMIN_COUNTER_MODIFY_ADD = 1, +}; + +struct efa_admin_modify_counter_cmd { + struct efa_admin_aq_common_desc aq_common_descriptor; + + /* Counter handle */ + u32 cntr_handle; + + /* Counter operation type (efa_admin_counter_modify_ops) */ + u8 operation; + + /* MBZ */ + u8 reserved[7]; + + /* Value for SET or ADD */ + u64 value; +}; + +struct efa_admin_modify_counter_resp { + struct efa_admin_acq_common_desc acq_common_desc; +}; + /* create_qp_cmd */ #define EFA_ADMIN_CREATE_QP_CMD_SQ_VIRT_MASK BIT(0) #define EFA_ADMIN_CREATE_QP_CMD_RQ_VIRT_MASK BIT(1) @@ -1132,6 +1288,19 @@ struct efa_admin_host_info { #define EFA_ADMIN_FEATURE_DEVICE_ATTR_DESC_DATA_POLLING_128_MASK BIT(2) #define EFA_ADMIN_FEATURE_DEVICE_ATTR_DESC_RDMA_WRITE_MASK BIT(3) #define EFA_ADMIN_FEATURE_DEVICE_ATTR_DESC_UNSOLICITED_WRITE_RECV_MASK BIT(4) +#define EFA_ADMIN_FEATURE_DEVICE_ATTR_DESC_EVENT_COUNTERS_MASK BIT(5) + +/* feature_queue_attr_desc_2 */ +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_SEND_COMP_MASK BIT(0) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_SEND_COMP_ERR_MASK BIT(1) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_RECV_COMP_MASK BIT(2) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_RECV_COMP_ERR_MASK BIT(3) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_READ_COMP_MASK BIT(4) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_READ_COMP_ERR_MASK BIT(5) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_WRITE_COMP_MASK BIT(6) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_WRITE_COMP_ERR_MASK BIT(7) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_REMOTE_READ_COMP_MASK BIT(8) +#define EFA_ADMIN_FEATURE_QUEUE_ATTR_DESC_2_REMOTE_WRITE_COMP_MASK BIT(9) /* create_eq_cmd */ #define EFA_ADMIN_CREATE_EQ_CMD_ENTRY_SIZE_WORDS_MASK GENMASK(4, 0) @@ -1150,4 +1319,16 @@ struct efa_admin_host_info { #define EFA_ADMIN_HOST_INFO_INTREE_MASK BIT(0) #define EFA_ADMIN_HOST_INFO_GDR_MASK BIT(1) +/* counter_attach_qp_events */ +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_SEND_COMP_MASK BIT(0) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_SEND_COMP_ERR_MASK BIT(1) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_RECV_COMP_MASK BIT(2) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_RECV_COMP_ERR_MASK BIT(3) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_READ_COMP_MASK BIT(4) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_READ_COMP_ERR_MASK BIT(5) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_WRITE_COMP_MASK BIT(6) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_WRITE_COMP_ERR_MASK BIT(7) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_REMOTE_READ_COMP_MASK BIT(8) +#define EFA_ADMIN_COUNTER_ATTACH_QP_EVENTS_REMOTE_WRITE_COMP_MASK BIT(9) + #endif /* _EFA_ADMIN_CMDS_H_ */ diff --git a/drivers/infiniband/hw/efa/efa_io_defs.h b/drivers/infiniband/hw/efa/efa_io_defs.h index a4c9fd33da38..874698e19647 100644 --- a/drivers/infiniband/hw/efa/efa_io_defs.h +++ b/drivers/infiniband/hw/efa/efa_io_defs.h @@ -9,6 +9,7 @@ #define EFA_IO_TX_DESC_NUM_BUFS 2 #define EFA_IO_TX_DESC_NUM_RDMA_BUFS 1 #define EFA_IO_TX_DESC_INLINE_MAX_SIZE 32 +#define EFA_IO_TX_DESC_INLINE_MAX_SIZE_128 80 #define EFA_IO_TX_DESC_IMM_DATA_SIZE 4 #define EFA_IO_TX_DESC_INLINE_PBL_SIZE 1 @@ -65,6 +66,8 @@ enum efa_io_comp_status { EFA_IO_COMP_STATUS_REMOTE_ERROR_UNKNOWN_PEER = 14, /* Unreachable remote - never received a response */ EFA_IO_COMP_STATUS_LOCAL_ERROR_UNREACH_REMOTE = 15, + /* Remote feature mismatch */ + EFA_IO_COMP_STATUS_REMOTE_ERROR_FEATURE_MISMATCH = 18, }; enum efa_io_frwr_pbl_mode { @@ -72,6 +75,13 @@ enum efa_io_frwr_pbl_mode { EFA_IO_FRWR_DIRECT_PBL = 1, }; +enum efa_io_processing_hint { + /* Default value */ + EFA_IO_PROCESSING_HINT_NONE = 0, + /* Optimize for throughput */ + EFA_IO_PROCESSING_HINT_BURST_PPS_SENSITIVE = 1, +}; + struct efa_io_tx_meta_desc { /* Verbs-generated Request ID */ u16 req_id; @@ -121,7 +131,14 @@ struct efa_io_tx_meta_desc { u16 ah; - u16 reserved; + /* + * control flags + * 1:0 : processing_hint - enum efa_io_processing_hint + * 7:2 : reserved - MBZ + */ + u8 ctrl3; + + u8 reserved; /* Queue key */ u32 qkey; @@ -172,6 +189,19 @@ struct efa_io_rdma_req { struct efa_io_tx_buf_desc local_mem[1]; }; +struct efa_io_rdma_req_128 { + /* Remote memory address */ + struct efa_io_remote_mem_addr remote_mem; + + union { + /* Local memory address */ + struct efa_io_tx_buf_desc local_mem[1]; + + /* inline data for RDMA */ + u8 inline_data[80]; + }; +}; + struct efa_io_fast_mr_reg_req { /* Updated local key of the MR after lkey/rkey increment */ u32 lkey; @@ -230,8 +260,8 @@ struct efa_io_fast_mr_inv_req { }; /* - * Tx WQE, composed of tx meta descriptors followed by either tx buffer - * descriptors or inline data + * 64-byte Tx WQE, composed of tx meta descriptors followed by either tx + * buffer descriptors or inline data */ struct efa_io_tx_wqe { /* TX meta */ @@ -254,6 +284,31 @@ struct efa_io_tx_wqe { } data; }; +/* + * 128-byte Tx WQE, composed of tx meta descriptors followed by either tx + * buffer descriptors or inline data + */ +struct efa_io_tx_wqe_128 { + /* TX meta */ + struct efa_io_tx_meta_desc meta; + + union { + /* Send buffer descriptors */ + struct efa_io_tx_buf_desc sgl[2]; + + u8 inline_data[80]; + + /* RDMA local and remote memory addresses */ + struct efa_io_rdma_req_128 rdma_req; + + /* Fast registration */ + struct efa_io_fast_mr_reg_req reg_mr_req; + + /* Fast invalidation */ + struct efa_io_fast_mr_inv_req inv_mr_req; + } data; +}; + /* * Rx buffer descriptor; RX WQE is composed of one or more RX buffer * descriptors. @@ -365,6 +420,7 @@ struct efa_io_rx_cdesc_ex { #define EFA_IO_TX_META_DESC_FIRST_MASK BIT(2) #define EFA_IO_TX_META_DESC_LAST_MASK BIT(3) #define EFA_IO_TX_META_DESC_COMP_REQ_MASK BIT(4) +#define EFA_IO_TX_META_DESC_PROCESSING_HINT_MASK GENMASK(1, 0) /* tx_buf_desc */ #define EFA_IO_TX_BUF_DESC_LKEY_MASK GENMASK(23, 0) -- 2.47.3 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH for-next v2 0/5] Introduce Completion Counters 2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin ` (3 preceding siblings ...) 2026-04-16 21:23 ` [PATCH for-next v2 4/5] RDMA/efa: Update device interface Michael Margolin @ 2026-04-28 22:36 ` Doug Ledford 4 siblings, 0 replies; 14+ messages in thread From: Doug Ledford @ 2026-04-28 22:36 UTC (permalink / raw) To: Michael Margolin, jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman [-- Attachment #1.1: Type: text/plain, Size: 3535 bytes --] On 4/16/26 4:23 PM, Michael Margolin wrote: > Add core infrastructure for Completion Counters, a light-weight > alternative to polling CQ for tracking operation completions. The > related rdma-core interface proposal is linked in [1]. > > Define the UVERBS_OBJECT_COMP_CNTR ioctl object with create, destroy, > set, inc and read methods for both success and error counters. Add a > QP attach method on the QP object to associate a completion counter > with a queue pair. > > Completion Counters can be backed by user-provided VA or dmabuf or by > internal device/driver memory. Common command infrastructure allows any > of the implementations to support the various device capabilities. > > Add EFA Completion Counters support as first implementer. > > [1] https://github.com/linux-rdma/rdma-core/pull/1701 > > --- > Changes in v2: > - United set, inc and read flows for successful and error completions > counters > - Added comp_cntr usage count > - Minor cleanups > - Link to v1: https://lore.kernel.org/all/20260407115424.13359-1-mrgolin@amazon.com/ > > Michael Margolin (5): > RDMA/core: Add Completion Counters support > RDMA/core: Prevent destroying in-use completion counters > RDMA/core: Add Completion Counters to resource tracking > RDMA/efa: Update device interface > RDMA/efa: Add Completion Counters support > > drivers/infiniband/core/Makefile | 1 + > drivers/infiniband/core/device.c | 7 + > drivers/infiniband/core/nldev.c | 1 + > drivers/infiniband/core/rdma_core.h | 1 + > drivers/infiniband/core/restrack.c | 2 + > drivers/infiniband/core/uverbs_cmd.c | 1 + > .../core/uverbs_std_types_comp_cntr.c | 299 ++++++++++++++++++ > drivers/infiniband/core/uverbs_std_types_qp.c | 65 +++- > drivers/infiniband/core/uverbs_uapi.c | 1 + > drivers/infiniband/core/verbs.c | 1 + > drivers/infiniband/hw/efa/efa.h | 13 + > .../infiniband/hw/efa/efa_admin_cmds_defs.h | 185 ++++++++++- > drivers/infiniband/hw/efa/efa_com_cmd.c | 106 +++++++ > drivers/infiniband/hw/efa/efa_com_cmd.h | 36 +++ > drivers/infiniband/hw/efa/efa_io_defs.h | 62 +++- > drivers/infiniband/hw/efa/efa_main.c | 6 + > drivers/infiniband/hw/efa/efa_verbs.c | 171 ++++++++++ > include/rdma/ib_verbs.h | 41 +++ > include/rdma/restrack.h | 4 + > include/uapi/rdma/efa-abi.h | 1 + > include/uapi/rdma/ib_user_ioctl_cmds.h | 50 +++ > include/uapi/rdma/ib_user_ioctl_verbs.h | 14 + > include/uapi/rdma/ib_user_verbs.h | 2 +- > 23 files changed, 1063 insertions(+), 7 deletions(-) > create mode 100644 drivers/infiniband/core/uverbs_std_types_comp_cntr.c > Apologies, my last email I selected the wrong cover letter. I knew that v2 was out and I had intended to comment on it (and I had already checked and didn't see this in the 7.1 window merge request). So, just to make sure my comment is in the right thread for tracking tools to process, we have hardware counters and if this isn't already merged, then I'll prioritize making sure this API will work reasonably for our hardware too. -- Doug Ledford <doug.ledford@hpe.com> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-05-04 15:37 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-16 21:23 [PATCH for-next v2 0/5] Introduce Completion Counters Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 1/5] RDMA/core: Add Completion Counters support Michael Margolin 2026-04-30 0:50 ` Doug Ledford 2026-04-30 1:49 ` Jason Gunthorpe 2026-04-30 15:38 ` Doug Ledford 2026-04-30 12:18 ` Michael Margolin 2026-04-30 19:09 ` Doug Ledford 2026-04-30 22:33 ` Sean Hefty 2026-05-04 12:51 ` Michael Margolin 2026-05-04 15:37 ` Sean Hefty 2026-04-16 21:23 ` [PATCH for-next v2 2/5] RDMA/core: Prevent destroying in-use completion counters Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 3/5] RDMA/core: Add Completion Counters to resource tracking Michael Margolin 2026-04-16 21:23 ` [PATCH for-next v2 4/5] RDMA/efa: Update device interface Michael Margolin 2026-04-28 22:36 ` [PATCH for-next v2 0/5] Introduce Completion Counters Doug Ledford
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox