* [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
@ 2026-05-29 9:07 Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-29 9:07 UTC (permalink / raw)
To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui
From: Tao Cui <cuitao@kylinos.cn>
Currently the RDMA cgroup only tracks two aggregate counters:
hca_handle and hca_object. The real scarce resource in multi-tenant
deployments is pinned memory: how much physical memory gets registered
through MRs. The existing hca_object counter is too coarse to capture
this.
This series adds a single new resource type:
- mr_mem - Cumulative MR memory size in bytes
The per-object-type counters (qp, mr) from RFC v1 have been removed
per review feedback [1]: modern NICs pool objects from the same memory
pool so the distinction between QP count and MR count is not
meaningful for resource limiting. hca_object remains sufficient for
coarse object accounting.
After this series, an administrator can set limits like:
echo "mlx5_0 mr_mem=1073741824" > rdma.max
Design
~~~~~~
mr_mem is not page-level ownership tracking; it is object-based
accounting tied to the MR lifetime:
- charged at MR registration time
- uncharged at MR destruction time
- the charge is pinned to the cgroup that created the MR for the
entire lifetime of the MR object
This model intentionally defines accounting semantics around MR
object lifetime rather than page ownership:
1. fork(): fork() does not duplicate MR objects. Even though the
child inherits the uverbs fd and can access the parent's ucontext,
the MR remains a single kernel object. The charge is tied to the
MR object, not to the number of processes that can reach it, so
no splitting or re-accounting is needed.
2. Cgroup migration: mr_mem follows the same semantics as the existing
hca_object; charge at creation time against the invoking task's
cgroup, uncharge at destruction time. The RDMA cgroup does not
implement can_attach/attach callbacks today, so charges do not
migrate with the task. This is a known limitation that applies
equally to hca_handle and hca_object. mr_mem does not introduce
any new complication here.
3. Overlap with memory cgroup: mr_mem does not count process memory
usage; it represents a per-device DMA registration budget: the
amount of memory this cgroup may register through a given HCA.
This is a different dimension from what memory cgroup tracks. An
administrator might set mr_mem limits differently per device, which
memory cgroup cannot express.
In particular, mr_mem tracks the registered memory range associated
with the MR rather than exact dynamically pinned pages (e.g. for
ODP MRs). This is a stable, policy-oriented approximation of
registration footprint, not an attempt at precise physical page
accounting.
Tao Cui (3):
cgroup/rdma: extend charge/uncharge API with s64 amount parameter
cgroup/rdma: add MR memory size resource tracking
cgroup/rdma: update cgroup resource list for MR_MEM
Documentation/admin-guide/cgroup-v2.rst | 21 ++--
drivers/infiniband/core/cgroup.c | 10 +-
drivers/infiniband/core/core_priv.h | 12 +-
drivers/infiniband/core/rdma_core.c | 20 +++-
drivers/infiniband/core/uverbs_cmd.c | 61 +++++++++-
drivers/infiniband/core/uverbs_std_types_mr.c | 37 ++++++
include/linux/cgroup_rdma.h | 8 +-
include/rdma/ib_verbs.h | 1 +
kernel/cgroup/rdma.c | 108 +++++++++++++-----
9 files changed, 219 insertions(+), 59 deletions(-)
---
Changes from RFC v1:
- Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
counters following review feedback from Jason Gunthorpe [1].
- Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
- Added detailed semantic notes to the commit messages addressing
fork(), cgroup migration, and overlap with memory cgroup [2].
- Renamed patches to reflect the narrower scope.
[1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
[2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter 2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui @ 2026-05-29 9:07 ` Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui ` (3 subsequent siblings) 4 siblings, 0 replies; 9+ messages in thread From: Tao Cui @ 2026-05-29 9:07 UTC (permalink / raw) To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui From: Tao Cui <cuitao@kylinos.cn> Change struct rdmacg_resource fields (max, usage, peak) and all charge/uncharge function signatures from int to s64 to prepare for byte-sized resource tracking such as MR memory. Replace match_int with a match_s64 helper that uses kstrtoll so the user-space limit tokens accept 64-bit values. All existing callers pass amount=1 (count-based), so the change is transparent for existing count-based resources. The rpool->usage_sum counter continues to track the number of active charge operations (not the sum of charged amounts); this is correct because it governs rpool lifetime - a pool is releasable only when all charges, regardless of amount, have been released. Signed-off-by: Tao Cui <cuitao@kylinos.cn> --- drivers/infiniband/core/cgroup.c | 10 ++-- drivers/infiniband/core/core_priv.h | 12 ++-- drivers/infiniband/core/rdma_core.c | 8 +-- drivers/infiniband/core/uverbs_cmd.c | 4 +- include/linux/cgroup_rdma.h | 7 ++- kernel/cgroup/rdma.c | 87 ++++++++++++++++++---------- 6 files changed, 83 insertions(+), 45 deletions(-) diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c index 1f037fe01450..81e24de72392 100644 --- a/drivers/infiniband/core/cgroup.c +++ b/drivers/infiniband/core/cgroup.c @@ -36,18 +36,20 @@ void ib_device_unregister_rdmacg(struct ib_device *device) int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, - enum rdmacg_resource_type resource_index) + enum rdmacg_resource_type resource_index, + s64 amount) { return rdmacg_try_charge(&cg_obj->cg, &device->cg_device, - resource_index); + resource_index, amount); } EXPORT_SYMBOL(ib_rdmacg_try_charge); void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, - enum rdmacg_resource_type resource_index) + enum rdmacg_resource_type resource_index, + s64 amount) { rdmacg_uncharge(cg_obj->cg, &device->cg_device, - resource_index); + resource_index, amount); } EXPORT_SYMBOL(ib_rdmacg_uncharge); diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index a2c36666e6fc..345356d1e504 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -159,11 +159,13 @@ void ib_device_unregister_rdmacg(struct ib_device *device); int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, - enum rdmacg_resource_type resource_index); + enum rdmacg_resource_type resource_index, + s64 amount); void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, - enum rdmacg_resource_type resource_index); + enum rdmacg_resource_type resource_index, + s64 amount); #else static inline void ib_device_register_rdmacg(struct ib_device *device) { @@ -175,14 +177,16 @@ static inline void ib_device_unregister_rdmacg(struct ib_device *device) static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, - enum rdmacg_resource_type resource_index) + enum rdmacg_resource_type resource_index, + s64 amount) { return 0; } static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj, struct ib_device *device, - enum rdmacg_resource_type resource_index) + enum rdmacg_resource_type resource_index, + s64 amount) { } #endif diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c index 5018ec837056..3268285b5478 100644 --- a/drivers/infiniband/core/rdma_core.c +++ b/drivers/infiniband/core/rdma_core.c @@ -437,7 +437,7 @@ alloc_begin_idr_uobject(const struct uverbs_api_object *obj, goto uobj_put; ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device, - RDMACG_RESOURCE_HCA_OBJECT); + RDMACG_RESOURCE_HCA_OBJECT, 1); if (ret) goto remove; @@ -526,7 +526,7 @@ struct ib_uobject *rdma_alloc_begin_uobject(const struct uverbs_api_object *obj, static void alloc_abort_idr_uobject(struct ib_uobject *uobj) { ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device, - RDMACG_RESOURCE_HCA_OBJECT); + RDMACG_RESOURCE_HCA_OBJECT, 1); xa_erase(&uobj->ufile->idr, uobj->id); } @@ -547,7 +547,7 @@ static int __must_check destroy_hw_idr_uobject(struct ib_uobject *uobj, return 0; ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device, - RDMACG_RESOURCE_HCA_OBJECT); + RDMACG_RESOURCE_HCA_OBJECT, 1); return 0; } @@ -878,7 +878,7 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile, } ib_rdmacg_uncharge(&ucontext->cg_obj, ib_dev, - RDMACG_RESOURCE_HCA_HANDLE); + RDMACG_RESOURCE_HCA_HANDLE, 1); rdma_restrack_del(&ucontext->res); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 91a62d2ade4d..9540ac180711 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -234,7 +234,7 @@ int ib_init_ucontext(struct uverbs_attr_bundle *attrs) } ret = ib_rdmacg_try_charge(&ucontext->cg_obj, ucontext->device, - RDMACG_RESOURCE_HCA_HANDLE); + RDMACG_RESOURCE_HCA_HANDLE, 1); if (ret) goto err; @@ -273,7 +273,7 @@ int ib_init_ucontext(struct uverbs_attr_bundle *attrs) err_uncharge: ib_rdmacg_uncharge(&ucontext->cg_obj, ucontext->device, - RDMACG_RESOURCE_HCA_HANDLE); + RDMACG_RESOURCE_HCA_HANDLE, 1); err: mutex_unlock(&file->ucontext_lock); up_read(&file->hw_destroy_rwsem); diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h index 404e746552ca..7146cefa95a6 100644 --- a/include/linux/cgroup_rdma.h +++ b/include/linux/cgroup_rdma.h @@ -7,6 +7,7 @@ #define _CGROUP_RDMA_H #include <linux/cgroup.h> +#include <linux/types.h> enum rdmacg_resource_type { RDMACG_RESOURCE_HCA_HANDLE, @@ -46,9 +47,11 @@ void rdmacg_unregister_device(struct rdmacg_device *device); /* APIs for RDMA/IB stack to charge/uncharge pool specific resources */ int rdmacg_try_charge(struct rdma_cgroup **rdmacg, struct rdmacg_device *device, - enum rdmacg_resource_type index); + enum rdmacg_resource_type index, + s64 amount); void rdmacg_uncharge(struct rdma_cgroup *cg, struct rdmacg_device *device, - enum rdmacg_resource_type index); + enum rdmacg_resource_type index, + s64 amount); #endif /* CONFIG_CGROUP_RDMA */ #endif /* _CGROUP_RDMA_H */ diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c index 5e82a03b3270..519f7f537223 100644 --- a/kernel/cgroup/rdma.c +++ b/kernel/cgroup/rdma.c @@ -59,9 +59,9 @@ static char const *rdmacg_resource_names[] = { /* resource tracker for each resource of rdma cgroup */ struct rdmacg_resource { - int max; - int usage; - int peak; + s64 max; + s64 usage; + s64 peak; }; /* @@ -105,13 +105,13 @@ static inline struct rdma_cgroup *get_current_rdmacg(void) } static void set_resource_limit(struct rdmacg_resource_pool *rpool, - int index, int new_max) + int index, s64 new_max) { - if (new_max == S32_MAX) { - if (rpool->resources[index].max != S32_MAX) + if (new_max == S64_MAX) { + if (rpool->resources[index].max != S64_MAX) rpool->num_max_cnt++; } else { - if (rpool->resources[index].max == S32_MAX) + if (rpool->resources[index].max == S64_MAX) rpool->num_max_cnt--; } rpool->resources[index].max = new_max; @@ -122,7 +122,7 @@ static void set_all_resource_max_limit(struct rdmacg_resource_pool *rpool) int i; for (i = 0; i < RDMACG_RESOURCE_MAX; i++) - set_resource_limit(rpool, i, S32_MAX); + set_resource_limit(rpool, i, S64_MAX); } static void free_cg_rpool_locked(struct rdmacg_resource_pool *rpool) @@ -206,7 +206,8 @@ get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device) static void uncharge_cg_locked(struct rdma_cgroup *cg, struct rdmacg_device *device, - enum rdmacg_resource_type index) + enum rdmacg_resource_type index, + s64 amount) { struct rdmacg_resource_pool *rpool; @@ -222,7 +223,7 @@ uncharge_cg_locked(struct rdma_cgroup *cg, return; } - rpool->resources[index].usage--; + rpool->resources[index].usage -= amount; /* * A negative count (or overflow) is invalid, @@ -307,14 +308,15 @@ static void rdmacg_event_locked(struct rdma_cgroup *cg, static void rdmacg_uncharge_hierarchy(struct rdma_cgroup *cg, struct rdmacg_device *device, struct rdma_cgroup *stop_cg, - enum rdmacg_resource_type index) + enum rdmacg_resource_type index, + s64 amount) { struct rdma_cgroup *p; mutex_lock(&rdmacg_mutex); for (p = cg; p != stop_cg; p = parent_rdmacg(p)) - uncharge_cg_locked(p, device, index); + uncharge_cg_locked(p, device, index, amount); mutex_unlock(&rdmacg_mutex); @@ -329,12 +331,13 @@ static void rdmacg_uncharge_hierarchy(struct rdma_cgroup *cg, */ void rdmacg_uncharge(struct rdma_cgroup *cg, struct rdmacg_device *device, - enum rdmacg_resource_type index) + enum rdmacg_resource_type index, + s64 amount) { if (index >= RDMACG_RESOURCE_MAX) return; - rdmacg_uncharge_hierarchy(cg, device, NULL, index); + rdmacg_uncharge_hierarchy(cg, device, NULL, index, amount); } EXPORT_SYMBOL(rdmacg_uncharge); @@ -343,6 +346,7 @@ EXPORT_SYMBOL(rdmacg_uncharge); * @rdmacg: pointer to rdma cgroup which will own this resource * @device: pointer to rdmacg device * @index: index of the resource to charge in cgroup (resource pool) + * @amount: amount to charge * * This function follows charging resource in hierarchical way. * It will fail if the charge would cause the new value to exceed the @@ -361,7 +365,8 @@ EXPORT_SYMBOL(rdmacg_uncharge); */ int rdmacg_try_charge(struct rdma_cgroup **rdmacg, struct rdmacg_device *device, - enum rdmacg_resource_type index) + enum rdmacg_resource_type index, + s64 amount) { struct rdma_cgroup *cg, *p; struct rdmacg_resource_pool *rpool; @@ -371,6 +376,9 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg, if (index >= RDMACG_RESOURCE_MAX) return -EINVAL; + if (amount <= 0) + return -EINVAL; + /* * hold on to css, as cgroup can be removed but resource * accounting happens on css. @@ -384,8 +392,9 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg, ret = PTR_ERR(rpool); goto err; } else { - new = (s64)rpool->resources[index].usage + 1; - if (new > rpool->resources[index].max) { + new = rpool->resources[index].usage + amount; + if (new < rpool->resources[index].usage || + new > rpool->resources[index].max) { ret = -EAGAIN; goto err; } else { @@ -409,7 +418,7 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg, if (ret == -EAGAIN) rdmacg_event_locked(cg, p, device, index); mutex_unlock(&rdmacg_mutex); - rdmacg_uncharge_hierarchy(cg, device, p, index); + rdmacg_uncharge_hierarchy(cg, device, p, index, amount); return ret; } EXPORT_SYMBOL(rdmacg_try_charge); @@ -477,6 +486,25 @@ static struct rdmacg_device *rdmacg_get_device_locked(const char *name) return NULL; } +static int match_s64(substring_t *s, s64 *result) +{ + char *buf; + int ret; + s64 val; + + buf = kmemdup_nul(s->from, s->to - s->from, GFP_KERNEL); + if (!buf) + return -ENOMEM; + ret = kstrtoll(buf, 0, &val); + kfree(buf); + if (ret) + return ret; + if (val < 0) + return -EINVAL; + *result = val; + return 0; +} + static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -486,7 +514,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, struct rdmacg_device *device; char *options = strstrip(buf); char *p; - int *new_limits; + s64 *new_limits; unsigned long enables = 0; int i = 0, ret = 0; @@ -497,7 +525,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, goto err; } - new_limits = kzalloc_objs(int, RDMACG_RESOURCE_MAX); + new_limits = kcalloc(RDMACG_RESOURCE_MAX, sizeof(s64), GFP_KERNEL); if (!new_limits) { ret = -ENOMEM; goto err; @@ -506,7 +534,8 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, /* parse resource limit tokens */ while ((p = strsep(&options, " \t\n"))) { substring_t args[MAX_OPT_ARGS]; - int tok, intval; + int tok; + s64 intval; if (!*p) continue; @@ -514,7 +543,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, tok = match_token(p, rdmacg_limit_tokens, args); switch (tok) { case RDMACG_HCA_HANDLE_VAL: - if (match_int(&args[0], &intval) || intval < 0) { + if (match_s64(&args[0], &intval)) { ret = -EINVAL; goto parse_err; } @@ -522,11 +551,11 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, enables |= BIT(RDMACG_RESOURCE_HCA_HANDLE); break; case RDMACG_HCA_HANDLE_MAX: - new_limits[RDMACG_RESOURCE_HCA_HANDLE] = S32_MAX; + new_limits[RDMACG_RESOURCE_HCA_HANDLE] = S64_MAX; enables |= BIT(RDMACG_RESOURCE_HCA_HANDLE); break; case RDMACG_HCA_OBJECT_VAL: - if (match_int(&args[0], &intval) || intval < 0) { + if (match_s64(&args[0], &intval)) { ret = -EINVAL; goto parse_err; } @@ -534,7 +563,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, enables |= BIT(RDMACG_RESOURCE_HCA_OBJECT); break; case RDMACG_HCA_OBJECT_MAX: - new_limits[RDMACG_RESOURCE_HCA_OBJECT] = S32_MAX; + new_limits[RDMACG_RESOURCE_HCA_OBJECT] = S64_MAX; enables |= BIT(RDMACG_RESOURCE_HCA_OBJECT); break; default: @@ -588,7 +617,7 @@ static void print_rpool_values(struct seq_file *sf, { enum rdmacg_file_type sf_type; int i; - u32 value; + s64 value; sf_type = seq_cft(sf)->private; @@ -599,7 +628,7 @@ static void print_rpool_values(struct seq_file *sf, if (rpool) value = rpool->resources[i].max; else - value = S32_MAX; + value = S64_MAX; } else if (sf_type == RDMACG_RESOURCE_TYPE_PEAK) { value = rpool ? rpool->resources[i].peak : 0; } else { @@ -609,10 +638,10 @@ static void print_rpool_values(struct seq_file *sf, value = 0; } - if (value == S32_MAX) + if (value == S64_MAX) seq_puts(sf, RDMACG_MAX_STR); else - seq_printf(sf, "%d", value); + seq_printf(sf, "%lld", value); seq_putc(sf, ' '); } } -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking 2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui @ 2026-05-29 9:07 ` Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui ` (2 subsequent siblings) 4 siblings, 0 replies; 9+ messages in thread From: Tao Cui @ 2026-05-29 9:07 UTC (permalink / raw) To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui From: Tao Cui <cuitao@kylinos.cn> Add RDMACG_RESOURCE_MR_MEM so that the cumulative memory size of registered Memory Regions can be tracked and limited independently from the aggregate hca_object counter. Unlike count-based resources (hca_handle, hca_object) which are charged in the generic IDR allocation path, MR_MEM is byte-based and must be charged after the MR length is known. Charge in the uverbs MR registration handlers (ioctl and legacy), and uncharge in the generic destroy paths (alloc_abort_idr_uobject, destroy_hw_idr_uobject). Store the charged byte count in uobj->rdmacg_mr_mem_bytes so that the destroy path knows how much to uncharge. Semantic notes ~~~~~~~~~~~~~~ mr_mem is not page-level ownership tracking - it is object-based accounting tied to the MR lifetime: - charged at MR registration time - uncharged at MR destruction time - the charge lives with the MR's creating cgroup for the entire lifetime of the MR object This model intentionally defines accounting semantics around MR object lifetime rather than page ownership: 1. fork(): fork() does not duplicate MR objects. Even though the child inherits the uverbs fd and can access the parent's ucontext, the MR remains a single kernel object. The charge is tied to the MR object, not to the number of processes that can reach it, so no splitting or re-accounting is needed. 2. Cgroup migration: mr_mem follows the same semantics as the existing hca_object - charge at creation time against the invoking task's cgroup, uncharge at destruction time. The RDMA cgroup does not implement can_attach/attach callbacks today, so charges do not migrate with the task. This is a known limitation that applies equally to hca_handle and hca_object. mr_mem does not introduce any new complication here. 3. Overlap with memory cgroup: mr_mem does not count process memory usage - it represents a per-device DMA registration budget: how much memory can this cgroup register through a given HCA. This is a different dimension from what memory cgroup tracks. An administrator might set mr_mem limits differently per device, which memory cgroup cannot express. In particular, mr_mem tracks the registered memory range associated with the MR rather than exact dynamically pinned pages (e.g. for ODP MRs). This is a stable, policy-oriented approximation of registration footprint - not an attempt at precise physical page accounting. Guard against u64-to-s64 overflow by rejecting MR lengths that exceed S64_MAX at each registration site. Handle MR reregistration (IB_USER_VERBS_CMD_REREG_MR with IB_MR_REREG_TRANS) by computing the delta between old and new lengths and charging or uncharging the difference. When the driver creates a new HW object (new_mr != NULL), the full new length is charged to the new uobj and the old uobj's mr_mem is released through the existing rdma_assign_uobject -> destroy_hw_idr_uobject -> rdmacg_uncharge_uobj path. Enable MR memory limits: echo "mlx5_0 mr_mem=1073741824" > rdma.max Signed-off-by: Tao Cui <cuitao@kylinos.cn> --- drivers/infiniband/core/rdma_core.c | 14 ++++- drivers/infiniband/core/uverbs_cmd.c | 57 +++++++++++++++++++ drivers/infiniband/core/uverbs_std_types_mr.c | 37 ++++++++++++ include/linux/cgroup_rdma.h | 1 + include/rdma/ib_verbs.h | 1 + kernel/cgroup/rdma.c | 21 ++++++- 6 files changed, 126 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c index 3268285b5478..a540cef6bb67 100644 --- a/drivers/infiniband/core/rdma_core.c +++ b/drivers/infiniband/core/rdma_core.c @@ -523,10 +523,19 @@ struct ib_uobject *rdma_alloc_begin_uobject(const struct uverbs_api_object *obj, return ret; } -static void alloc_abort_idr_uobject(struct ib_uobject *uobj) +static void rdmacg_uncharge_uobj(struct ib_uobject *uobj) { ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device, RDMACG_RESOURCE_HCA_OBJECT, 1); + if (uobj->rdmacg_mr_mem_bytes) + ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device, + RDMACG_RESOURCE_MR_MEM, + uobj->rdmacg_mr_mem_bytes); +} + +static void alloc_abort_idr_uobject(struct ib_uobject *uobj) +{ + rdmacg_uncharge_uobj(uobj); xa_erase(&uobj->ufile->idr, uobj->id); } @@ -546,8 +555,7 @@ static int __must_check destroy_hw_idr_uobject(struct ib_uobject *uobj, if (why == RDMA_REMOVE_ABORT) return 0; - ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device, - RDMACG_RESOURCE_HCA_OBJECT, 1); + rdmacg_uncharge_uobj(uobj); return 0; } diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 9540ac180711..901de117c808 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -752,6 +752,17 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs) uobj->object = mr; uobj_put_obj_read(pd); + + if (cmd.length > S64_MAX) + goto err_free; + if (cmd.length) { + ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device, + RDMACG_RESOURCE_MR_MEM, cmd.length); + if (ret) + goto err_dereg; + uobj->rdmacg_mr_mem_bytes = cmd.length; + } + uobj_finalize_uobj_create(uobj, attrs); resp.lkey = mr->lkey; @@ -759,6 +770,8 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs) resp.mr_handle = uobj->id; return uverbs_response(attrs, &resp, sizeof(resp)); +err_dereg: + ib_dereg_mr_user(mr, &attrs->driver_udata); err_put: uobj_put_obj_read(pd); err_free: @@ -854,6 +867,20 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs) rdma_restrack_set_name(&new_mr->res, NULL); rdma_restrack_add(&new_mr->res); + if ((cmd.flags & IB_MR_REREG_TRANS) && cmd.length) { + if (cmd.length > S64_MAX) { + ret = -EINVAL; + goto err_rereg_new_mr; + } + ret = ib_rdmacg_try_charge(&new_uobj->cg_obj, + new_uobj->context->device, + RDMACG_RESOURCE_MR_MEM, + cmd.length); + if (ret) + goto err_rereg_new_mr; + new_uobj->rdmacg_mr_mem_bytes = cmd.length; + } + /* * The new uobj for the new HW object is put into the same spot * in the IDR and the old uobj & HW object is deleted. @@ -871,6 +898,31 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs) atomic_inc(&new_pd->usecnt); } if (cmd.flags & IB_MR_REREG_TRANS) { + s64 delta; + + if (cmd.length > S64_MAX) { + ret = -EINVAL; + goto put_new_uobj; + } + delta = (s64)cmd.length - + (s64)uobj->rdmacg_mr_mem_bytes; + + if (delta > 0) { + ret = ib_rdmacg_try_charge( + &uobj->cg_obj, + uobj->context->device, + RDMACG_RESOURCE_MR_MEM, + delta); + if (ret) + goto put_new_uobj; + } else if (delta < 0) { + ib_rdmacg_uncharge( + &uobj->cg_obj, + uobj->context->device, + RDMACG_RESOURCE_MR_MEM, + -delta); + } + uobj->rdmacg_mr_mem_bytes = cmd.length; mr->iova = cmd.hca_va; mr->length = cmd.length; } @@ -887,6 +939,11 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs) put_new_uobj: if (new_uobj) uobj_alloc_abort(new_uobj, attrs); +err_rereg_new_mr: + if (new_uobj) { + rdma_alloc_abort_uobject(new_uobj, attrs, true); + new_uobj = NULL; + } put_uobj_pd: if (cmd.flags & IB_MR_REREG_PD) uobj_put_obj_read(new_pd); diff --git a/drivers/infiniband/core/uverbs_std_types_mr.c b/drivers/infiniband/core/uverbs_std_types_mr.c index 570b9656801d..3989ff2d282b 100644 --- a/drivers/infiniband/core/uverbs_std_types_mr.c +++ b/drivers/infiniband/core/uverbs_std_types_mr.c @@ -32,6 +32,7 @@ */ #include "rdma_core.h" +#include "core_priv.h" #include "uverbs.h" #include <rdma/uverbs_std_types.h> #include "restrack.h" @@ -140,6 +141,18 @@ static int UVERBS_HANDLER(UVERBS_METHOD_DM_MR_REG)( rdma_restrack_set_name(&mr->res, NULL); rdma_restrack_add(&mr->res); uobj->object = mr; + if (attr.length > S64_MAX) + return -EINVAL; + + if (attr.length) { + ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device, + RDMACG_RESOURCE_MR_MEM, attr.length); + if (ret) { + ib_dereg_mr_user(mr, &attrs->driver_udata); + return ret; + } + uobj->rdmacg_mr_mem_bytes = attr.length; + } uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_REG_DM_MR_HANDLE); @@ -254,6 +267,18 @@ static int UVERBS_HANDLER(UVERBS_METHOD_REG_DMABUF_MR)( rdma_restrack_add(&mr->res); uobj->object = mr; + if (length > S64_MAX) + return -EINVAL; + if (length) { + ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device, + RDMACG_RESOURCE_MR_MEM, length); + if (ret) { + ib_dereg_mr_user(mr, &attrs->driver_udata); + return ret; + } + uobj->rdmacg_mr_mem_bytes = length; + } + uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_REG_DMABUF_MR_HANDLE); ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY, @@ -383,6 +408,18 @@ static int UVERBS_HANDLER(UVERBS_METHOD_REG_MR)( rdma_restrack_add(&mr->res); uobj->object = mr; + if (length > S64_MAX) + return -EINVAL; + if (length) { + ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device, + RDMACG_RESOURCE_MR_MEM, length); + if (ret) { + ib_dereg_mr_user(mr, &attrs->driver_udata); + return ret; + } + uobj->rdmacg_mr_mem_bytes = length; + } + uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_REG_MR_HANDLE); ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_MR_RESP_LKEY, diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h index 7146cefa95a6..2c8fb1ebb1a9 100644 --- a/include/linux/cgroup_rdma.h +++ b/include/linux/cgroup_rdma.h @@ -12,6 +12,7 @@ enum rdmacg_resource_type { RDMACG_RESOURCE_HCA_HANDLE, RDMACG_RESOURCE_HCA_OBJECT, + RDMACG_RESOURCE_MR_MEM, RDMACG_RESOURCE_MAX, }; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 9dd76f489a0b..c7dcd5d085fb 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1569,6 +1569,7 @@ struct ib_uobject { void *object; /* containing object */ struct list_head list; /* link to context's list */ struct ib_rdmacg_object cg_obj; /* rdmacg object */ + s64 rdmacg_mr_mem_bytes; /* charged MR memory size */ int id; /* index into kernel idr */ struct kref ref; atomic_t usecnt; /* protects exclusive access */ diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c index 519f7f537223..ebfc5721c098 100644 --- a/kernel/cgroup/rdma.c +++ b/kernel/cgroup/rdma.c @@ -23,14 +23,18 @@ enum rdmacg_limit_tokens { RDMACG_HCA_HANDLE_MAX, RDMACG_HCA_OBJECT_VAL, RDMACG_HCA_OBJECT_MAX, + RDMACG_MR_MEM_VAL, + RDMACG_MR_MEM_MAX, NR_RDMACG_LIMIT_TOKENS, }; static const match_table_t rdmacg_limit_tokens = { - { RDMACG_HCA_HANDLE_VAL, "hca_handle=%d" }, + { RDMACG_HCA_HANDLE_VAL, "hca_handle=%d" }, { RDMACG_HCA_HANDLE_MAX, "hca_handle=max" }, - { RDMACG_HCA_OBJECT_VAL, "hca_object=%d" }, + { RDMACG_HCA_OBJECT_VAL, "hca_object=%d" }, { RDMACG_HCA_OBJECT_MAX, "hca_object=max" }, + { RDMACG_MR_MEM_VAL, "mr_mem=%d" }, + { RDMACG_MR_MEM_MAX, "mr_mem=max" }, { NR_RDMACG_LIMIT_TOKENS, NULL }, }; @@ -55,6 +59,7 @@ enum rdmacg_file_type { static char const *rdmacg_resource_names[] = { [RDMACG_RESOURCE_HCA_HANDLE] = "hca_handle", [RDMACG_RESOURCE_HCA_OBJECT] = "hca_object", + [RDMACG_RESOURCE_MR_MEM] = "mr_mem", }; /* resource tracker for each resource of rdma cgroup */ @@ -566,6 +571,18 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of, new_limits[RDMACG_RESOURCE_HCA_OBJECT] = S64_MAX; enables |= BIT(RDMACG_RESOURCE_HCA_OBJECT); break; + case RDMACG_MR_MEM_VAL: + if (match_s64(&args[0], &intval)) { + ret = -EINVAL; + goto parse_err; + } + new_limits[RDMACG_RESOURCE_MR_MEM] = intval; + enables |= BIT(RDMACG_RESOURCE_MR_MEM); + break; + case RDMACG_MR_MEM_MAX: + new_limits[RDMACG_RESOURCE_MR_MEM] = S64_MAX; + enables |= BIT(RDMACG_RESOURCE_MR_MEM); + break; default: ret = -EINVAL; goto parse_err; -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM 2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui @ 2026-05-29 9:07 ` Tao Cui 2026-05-29 16:18 ` kernel test robot 2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný 2026-05-29 21:14 ` yanjun.zhu 4 siblings, 1 reply; 9+ messages in thread From: Tao Cui @ 2026-05-29 9:07 UTC (permalink / raw) To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui From: Tao Cui <cuitao@kylinos.cn> The RDMA cgroup now supports MR memory size tracking via the new mr_mem resource. Update the cgroup-v2 documentation to describe the new resource and revise the usage examples accordingly. The mr_mem resource tracks the cumulative size of memory registered through Memory Regions per device per cgroup, providing a DMA registration budget that is orthogonal to the existing hca_object counter. Signed-off-by: Tao Cui <cuitao@kylinos.cn> --- Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 993446ab66d0..08d80e6f79ec 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2766,15 +2766,16 @@ RDMA Interface Files The following nested keys are defined. - ========== ============================= + ========== ================================================ hca_handle Maximum number of HCA Handles hca_object Maximum number of HCA Objects - ========== ============================= + mr_mem Maximum cumulative MR memory size in bytes + ========== ================================================ An example for mlx4 and ocrdma device follows:: - mlx4_0 hca_handle=2 hca_object=2000 - ocrdma1 hca_handle=3 hca_object=max + mlx4_0 hca_handle=2 hca_object=2000 mr_mem=1073741824 + ocrdma1 hca_handle=3 hca_object=max mr_mem=max rdma.current A read-only file that describes current resource usage. @@ -2782,8 +2783,8 @@ RDMA Interface Files An example for mlx4 and ocrdma device follows:: - mlx4_0 hca_handle=1 hca_object=20 - ocrdma1 hca_handle=1 hca_object=23 + mlx4_0 hca_handle=1 hca_object=20 mr_mem=1048576 + ocrdma1 hca_handle=1 hca_object=23 mr_mem=0 rdma.peak A read-only nested-keyed file that exists for all the cgroups @@ -2792,8 +2793,8 @@ RDMA Interface Files An example for mlx4 and ocrdma device follows:: - mlx4_0 hca_handle=1 hca_object=20 - ocrdma1 hca_handle=0 hca_object=23 + mlx4_0 hca_handle=1 hca_object=20 mr_mem=1048576 + ocrdma1 hca_handle=0 hca_object=23 mr_mem=0 rdma.events A read-only nested-keyed file which exists on non-root @@ -2815,7 +2816,7 @@ RDMA Interface Files An example for mlx4 device follows:: - mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=3 hca_object.max=0 hca_object.alloc_fail=0 + mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=3 hca_object.max=0 hca_object.alloc_fail=0 mr_mem.max=0 mr_mem.alloc_fail=0 rdma.events.local Similar to rdma.events but the fields in the file are local @@ -2836,7 +2837,7 @@ RDMA Interface Files An example for mlx4 device follows:: - mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=0 hca_object.max=0 hca_object.alloc_fail=0 + mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=0 hca_object.max=0 hca_object.alloc_fail=0 mr_mem.max=0 mr_mem.alloc_fail=0 DMEM ---- -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM 2026-05-29 9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui @ 2026-05-29 16:18 ` kernel test robot 0 siblings, 0 replies; 9+ messages in thread From: kernel test robot @ 2026-05-29 16:18 UTC (permalink / raw) To: Tao Cui, tj, hannes, mkoutny, leon, jgg Cc: oe-kbuild-all, linux-rdma, cgroups, Tao Cui Hi Tao, kernel test robot noticed the following build warnings: [auto build test WARNING on tj-cgroup/for-next] [also build test WARNING on next-20260528] [cannot apply to linus/master v7.1-rc5] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Tao-Cui/cgroup-rdma-extend-charge-uncharge-API-with-s64-amount-parameter/20260529-171623 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/20260529090733.2242822-4-cui.tao%40linux.dev patch subject: [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM config: i386-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260529/202605291816.15AyhoZE-lkp@intel.com/config) compiler: gcc-14 (Debian 14.2.0-19) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260529/202605291816.15AyhoZE-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605291816.15AyhoZE-lkp@intel.com/ All warnings (new ones prefixed by >>): >> Warning: kernel/cgroup/rdma.c:210 function parameter 'amount' not described in 'uncharge_cg_locked' >> Warning: kernel/cgroup/rdma.c:312 function parameter 'amount' not described in 'rdmacg_uncharge_hierarchy' >> Warning: kernel/cgroup/rdma.c:335 function parameter 'amount' not described in 'rdmacg_uncharge' >> Warning: kernel/cgroup/rdma.c:210 function parameter 'amount' not described in 'uncharge_cg_locked' >> Warning: kernel/cgroup/rdma.c:312 function parameter 'amount' not described in 'rdmacg_uncharge_hierarchy' >> Warning: kernel/cgroup/rdma.c:335 function parameter 'amount' not described in 'rdmacg_uncharge' -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking 2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui ` (2 preceding siblings ...) 2026-05-29 9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui @ 2026-05-29 12:46 ` Michal Koutný 2026-06-01 5:37 ` Tao Cui 2026-05-29 21:14 ` yanjun.zhu 4 siblings, 1 reply; 9+ messages in thread From: Michal Koutný @ 2026-05-29 12:46 UTC (permalink / raw) To: Tao Cui; +Cc: tj, hannes, leon, jgg, linux-rdma, cgroups, Tao Cui [-- Attachment #1: Type: text/plain, Size: 1600 bytes --] Hi. On Fri, May 29, 2026 at 05:07:30PM +0800, Tao Cui <cui.tao@linux.dev> wrote: > The real scarce resource in multi-tenant > deployments is pinned memory: how much physical memory gets registered > through MRs. > ... > 3. Overlap with memory cgroup: mr_mem does not count process memory > usage; it represents a per-device DMA registration budget: the > amount of memory this cgroup may register through a given HCA. > This is a different dimension from what memory cgroup tracks. An > administrator might set mr_mem limits differently per device, which > memory cgroup cannot express. > > In particular, mr_mem tracks the registered memory range associated > with the MR rather than exact dynamically pinned pages (e.g. for > ODP MRs). This is a stable, policy-oriented approximation of > registration footprint, not an attempt at precise physical page > accounting. IIUC the pinned memory is regular RAM, i.e. it could be controlled with memcg as needed. Or is there "physical" limit of what can be assigned to a single device? BTW, have a look at [1], it'd be good to converge to similar approach (the current proposal allows distinguishing whether charging should include or exempt memcg counting). Also it seems, that the dmem controller could be a one-stop solution for all DMA charges. Please tell me if there are any distinguishing factors between RDMA devices' memory and these dmem memory regions. Thanks, Michal [1] https://lore.kernel.org/r/20260519-cgroup-dmem-memcg-double-charge-v2-0-db4d1407062b@redhat.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking 2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný @ 2026-06-01 5:37 ` Tao Cui 0 siblings, 0 replies; 9+ messages in thread From: Tao Cui @ 2026-06-01 5:37 UTC (permalink / raw) To: Michal Koutný Cc: cui.tao, tj, hannes, leon, jgg, linux-rdma, cgroups, Tao Cui Hi Michal, Thanks for the review and for the reference. > IIUC the pinned memory is regular RAM, i.e. it could be controlled > with memcg as needed. Or is there "physical" limit of what can be > assigned to a single device? You are right that the pages associated with an MR are regular system RAM. However, MR registration does not allocate new pages; it registers existing pages that are already charged to the allocating process's memcg. For that reason, mr_mem is intended to represent a different resource dimension: not "how much memory does this cgroup own", but "how much memory may this cgroup register through a given HCA". In other words: * memcg limits memory ownership/consumption * mr_mem limits RDMA registration footprint An administrator may reasonably wish to set different registration budgets per device (for example, 1G through mlx5_0 and 4G through mlx5_1) for the same cgroup. memcg has no notion of device-scoped limits; it only tracks aggregate memory consumption. This distinction is important because memory ownership and DMA registration are not necessarily constrained by the same policy. A tenant may remain within its memcg limit while still consuming a large portion of a particular HCA's registration capacity. The existing RDMA controller already provides a per-device resource control framework, and mr_mem extends that model to cover memory registration footprint. > Or is there "physical" limit of what can be assigned to a single device? Yes. Real HCAs have finite resources associated with memory registration, such as MTT/MPT capacity and related DMA translation resources. In practice, administrators often need to prevent one tenant from consuming a disproportionate share of a particular HCA's registration capacity, even when sufficient system memory remains available. It is also worth noting that mr_mem is intentionally not an attempt to account exact pinned pages. The accounting model is tied to MR object lifetime and tracks registration footprint rather than dynamic physical page state. For example, ODP MRs may have only a subset of their pages pinned at any given time, yet still consume registration resources on the HCA. This is why the proposal focuses on a stable, policy-oriented registration budget rather than precise memory ownership accounting. > BTW, have a look at [1], it'd be good to converge to similar approach > (the current proposal allows distinguishing whether charging should > include or exempt memcg counting). I've read the related dma-buf accounting work. My understanding is that those proposals focus on allocations that create new memory on behalf of a device, which is naturally accounted through memcg. RDMA MR registration is different because no new memory is allocated. The MR object is an in-kernel registration of existing memory that has already been accounted elsewhere. The resource being limited is therefore the registration itself rather than the underlying memory pages. > Also it seems, that the dmem controller could be a one-stop solution > for all DMA charges. Please tell me if there are any distinguishing > factors between RDMA devices' memory and these dmem memory regions. One distinction is that the current dmem work appears to focus on memory resources allocated on behalf of a device, whereas mr_mem is intended to limit host memory registered for DMA through RDMA MRs. RDMA NICs typically do not have large device-local memory pools; instead they provide DMA access to host RAM through memory registration. As a result, the resource being controlled here is not device memory consumption itself, but the registration footprint associated with a particular HCA. Another difference is the accounting model itself. The proposed mr_mem accounting is tied to MR object lifetime and tracks registration footprint rather than precise physical page usage. My understanding is that dmem is currently integrated with the DRM/TTM subsystem for device-local memory accounting, and there is no existing RDMA integration today. I have not investigated what would be required to extend that model to RDMA registration accounting. That said, I agree that convergence would be desirable if a generic framework can naturally express per-device DMA registration budgets. My goal here is not necessarily to require RDMA-specific accounting, but to address a practical resource-control problem within the existing RDMA cgroup framework. Thanks, Tao ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking 2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui ` (3 preceding siblings ...) 2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný @ 2026-05-29 21:14 ` yanjun.zhu 2026-06-01 6:08 ` Tao Cui 4 siblings, 1 reply; 9+ messages in thread From: yanjun.zhu @ 2026-05-29 21:14 UTC (permalink / raw) To: Tao Cui, tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui On 5/29/26 2:07 AM, Tao Cui wrote: > From: Tao Cui <cuitao@kylinos.cn> > > Currently the RDMA cgroup only tracks two aggregate counters: > hca_handle and hca_object. The real scarce resource in multi-tenant > deployments is pinned memory: how much physical memory gets registered > through MRs. The existing hca_object counter is too coarse to capture > this. > > This series adds a single new resource type: > > - mr_mem - Cumulative MR memory size in bytes > > The per-object-type counters (qp, mr) from RFC v1 have been removed > per review feedback [1]: modern NICs pool objects from the same memory > pool so the distinction between QP count and MR count is not > meaningful for resource limiting. hca_object remains sufficient for > coarse object accounting. > > After this series, an administrator can set limits like: > > echo "mlx5_0 mr_mem=1073741824" > rdma.max > Hi, Thanks for the patchset! Introducing `mr_mem` to track and limit pinned memory size is a very practical enhancement for multi-tenant deployments. I have a question regarding how this new resource type interacts with Fast Registration (FRWR / FRMR), which is widely used in production environments (e.g., NVMe-oF, iSER) to achieve high performance. As we know, FRWR decouples the MR object allocation (`ib_alloc_mr`) from the actual memory page mapping (`ib_map_mr_sg`). The creation of FRWR Memory Regions is often managed via a pre-allocated page pool. Could you clarify how `mr_mem` accounts for FRWR in the following scenarios? 1. Accounting Granularity: Does `mr_mem` charge the maximum capacity of the FRWR object at its allocation time (`ib_alloc_mr`), or does it dynamically track the actual mapped bytes during the fast-reg data path? If it's the former, it represents a "static maximum budget" per pool, which seems more practical for performance. 2. Kernel-space vs Userspace: FRWR pools are frequently allocated by kernel-space drivers (like NVMe-oF target/host). If these kernel threads are not bound to a specific user cgroup, will their FRWR allocations end up in the root cgroup, potentially bypassing the per-tenant limits? Don't you think it would be beneficial to explicitly document or consider the FRWR pattern in the design section, given its prevalence in real-world storage and networking workloads? Thanks, Zhu Yanjun > Design > ~~~~~~ > > mr_mem is not page-level ownership tracking; it is object-based > accounting tied to the MR lifetime: > > - charged at MR registration time > - uncharged at MR destruction time > - the charge is pinned to the cgroup that created the MR for the > entire lifetime of the MR object > > This model intentionally defines accounting semantics around MR > object lifetime rather than page ownership: > > 1. fork(): fork() does not duplicate MR objects. Even though the > child inherits the uverbs fd and can access the parent's ucontext, > the MR remains a single kernel object. The charge is tied to the > MR object, not to the number of processes that can reach it, so > no splitting or re-accounting is needed. > > 2. Cgroup migration: mr_mem follows the same semantics as the existing > hca_object; charge at creation time against the invoking task's > cgroup, uncharge at destruction time. The RDMA cgroup does not > implement can_attach/attach callbacks today, so charges do not > migrate with the task. This is a known limitation that applies > equally to hca_handle and hca_object. mr_mem does not introduce > any new complication here. > > 3. Overlap with memory cgroup: mr_mem does not count process memory > usage; it represents a per-device DMA registration budget: the > amount of memory this cgroup may register through a given HCA. > This is a different dimension from what memory cgroup tracks. An > administrator might set mr_mem limits differently per device, which > memory cgroup cannot express. > > In particular, mr_mem tracks the registered memory range associated > with the MR rather than exact dynamically pinned pages (e.g. for > ODP MRs). This is a stable, policy-oriented approximation of > registration footprint, not an attempt at precise physical page > accounting. > > Tao Cui (3): > cgroup/rdma: extend charge/uncharge API with s64 amount parameter > cgroup/rdma: add MR memory size resource tracking > cgroup/rdma: update cgroup resource list for MR_MEM > > Documentation/admin-guide/cgroup-v2.rst | 21 ++-- > drivers/infiniband/core/cgroup.c | 10 +- > drivers/infiniband/core/core_priv.h | 12 +- > drivers/infiniband/core/rdma_core.c | 20 +++- > drivers/infiniband/core/uverbs_cmd.c | 61 +++++++++- > drivers/infiniband/core/uverbs_std_types_mr.c | 37 ++++++ > include/linux/cgroup_rdma.h | 8 +- > include/rdma/ib_verbs.h | 1 + > kernel/cgroup/rdma.c | 108 +++++++++++++----- > 9 files changed, 219 insertions(+), 59 deletions(-) > > --- > Changes from RFC v1: > > - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type > counters following review feedback from Jason Gunthorpe [1]. > - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource. > - Added detailed semantic notes to the commit messages addressing > fork(), cgroup migration, and overlap with memory cgroup [2]. > - Renamed patches to reflect the narrower scope. > > [1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/ > [2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking 2026-05-29 21:14 ` yanjun.zhu @ 2026-06-01 6:08 ` Tao Cui 0 siblings, 0 replies; 9+ messages in thread From: Tao Cui @ 2026-06-01 6:08 UTC (permalink / raw) To: yanjun.zhu, tj, hannes, mkoutny, leon, jgg Cc: cui.tao, linux-rdma, cgroups, Tao Cui Hi Yanjun, Thanks for the thoughtful questions. FRWR is indeed a widely used pattern, and the interaction with mr_mem deserves clarification. > 1. Accounting Granularity: Does mr_mem charge the maximum capacity of > the FRWR object at its allocation time (ib_alloc_mr), or does it > dynamically track the actual mapped bytes during the fast-reg data > path? In the current proposal, mr_mem is only charged for userspace MR registrations that go through the uverbs layer (REG_MR, DM_MR, DMABUF_MR, and the legacy ioctl path). These are the paths where a concrete byte length is known at registration time. FRWR MRs allocated via ib_alloc_mr() are not charged for mr_mem. The actual registration footprint associated with an FRWR MR is not known at allocation time: ib_alloc_mr() only specifies the maximum scatter-gather capacity of the MR, while the mapped byte range may change dynamically across successive ib_map_mr_sg() operations. Supporting FRWR accounting would therefore require a separate accounting model, since the registration footprint is established dynamically rather than by a fixed length parameter supplied at MR creation. This is outside the scope of the current proposal. > 2. Kernel-space vs Userspace: FRWR pools are frequently allocated by > kernel-space drivers (like NVMe-oF target/host). If these kernel > threads are not bound to a specific user cgroup, will their FRWR > allocations end up in the root cgroup, potentially bypassing the > per-tenant limits? The RDMA cgroup's resource control is primarily designed for userspace consumers. Kernel-space consumers (NVMe-oF target, SRP initiator, rtrs, iSER, etc.) allocate resources through kernel APIs (ib_alloc_mr, ib_create_qp, etc.). These resources do not currently participate in RDMA cgroup accounting and therefore are not subject to per-cgroup limits. Kernel-space FRWR pools are typically managed by the administrator rather than subject to per-tenant limits. This behavior is consistent with the current RDMA cgroup model, which tracks resources associated with userspace RDMA objects. If accounting were extended to kernel-allocated FRWR MRs, ownership semantics would become an open question: simply charging against the current task or the root cgroup may not accurately represent the tenant that ultimately benefits from the resource. > Don't you think it would be beneficial to explicitly document or > consider the FRWR pattern in the design section, given its prevalence > in real-world storage and networking workloads? Agreed. I will add a note to the cover letter and commit messages clarifying that mr_mem currently covers only userspace MR registrations with a known length, and that kernel-space FRWR pools are out of scope for this initial proposal. The semantic distinction between userspace registration-length accounting and kernel-space FRWR resource management is worth documenting explicitly. Thanks, Tao ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-06-01 6:09 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui 2026-05-29 9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui 2026-05-29 16:18 ` kernel test robot 2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný 2026-06-01 5:37 ` Tao Cui 2026-05-29 21:14 ` yanjun.zhu 2026-06-01 6:08 ` Tao Cui
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.