[PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
@ 2026-05-29  9:07 Tao Cui
  2026-05-29  9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-29  9:07 UTC (permalink / raw)
  To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui

From: Tao Cui <cuitao@kylinos.cn>

Currently the RDMA cgroup only tracks two aggregate counters:
hca_handle and hca_object.  The real scarce resource in multi-tenant
deployments is pinned memory: how much physical memory gets registered
through MRs.  The existing hca_object counter is too coarse to capture
this.

This series adds a single new resource type:

  - mr_mem  - Cumulative MR memory size in bytes

The per-object-type counters (qp, mr) from RFC v1 have been removed
per review feedback [1]: modern NICs pool objects from the same memory
pool so the distinction between QP count and MR count is not
meaningful for resource limiting.  hca_object remains sufficient for
coarse object accounting.

After this series, an administrator can set limits like:

    echo "mlx5_0 mr_mem=1073741824" > rdma.max

Design
~~~~~~

mr_mem is not page-level ownership tracking; it is object-based
accounting tied to the MR lifetime:

  - charged at MR registration time
  - uncharged at MR destruction time
  - the charge is pinned to the cgroup that created the MR for the
    entire lifetime of the MR object

This model intentionally defines accounting semantics around MR
object lifetime rather than page ownership:

1. fork(): fork() does not duplicate MR objects.  Even though the
   child inherits the uverbs fd and can access the parent's ucontext,
   the MR remains a single kernel object.  The charge is tied to the
   MR object, not to the number of processes that can reach it, so
   no splitting or re-accounting is needed.

2. Cgroup migration: mr_mem follows the same semantics as the existing
   hca_object; charge at creation time against the invoking task's
   cgroup, uncharge at destruction time.  The RDMA cgroup does not
   implement can_attach/attach callbacks today, so charges do not
   migrate with the task.  This is a known limitation that applies
   equally to hca_handle and hca_object.  mr_mem does not introduce
   any new complication here.

3. Overlap with memory cgroup: mr_mem does not count process memory
   usage; it represents a per-device DMA registration budget: the
   amount of memory this cgroup may register through a given HCA.
   This is a different dimension from what memory cgroup tracks.  An
   administrator might set mr_mem limits differently per device, which
   memory cgroup cannot express.

   In particular, mr_mem tracks the registered memory range associated
   with the MR rather than exact dynamically pinned pages (e.g. for
   ODP MRs).  This is a stable, policy-oriented approximation of
   registration footprint, not an attempt at precise physical page
   accounting.

Tao Cui (3):
  cgroup/rdma: extend charge/uncharge API with s64 amount parameter
  cgroup/rdma: add MR memory size resource tracking
  cgroup/rdma: update cgroup resource list for MR_MEM

 Documentation/admin-guide/cgroup-v2.rst       |  21 ++--
 drivers/infiniband/core/cgroup.c              |  10 +-
 drivers/infiniband/core/core_priv.h           |  12 +-
 drivers/infiniband/core/rdma_core.c           |  20 +++-
 drivers/infiniband/core/uverbs_cmd.c          |  61 +++++++++-
 drivers/infiniband/core/uverbs_std_types_mr.c |  37 ++++++
 include/linux/cgroup_rdma.h                   |   8 +-
 include/rdma/ib_verbs.h                       |   1 +
 kernel/cgroup/rdma.c                          | 108 +++++++++++++-----
 9 files changed, 219 insertions(+), 59 deletions(-)

---
Changes from RFC v1:

  - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
    counters following review feedback from Jason Gunthorpe [1].
  - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
  - Added detailed semantic notes to the commit messages addressing
    fork(), cgroup migration, and overlap with memory cgroup [2].
  - Renamed patches to reflect the narrower scope.

[1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
[2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/
-- 
2.43.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter
  2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
@ 2026-05-29  9:07 ` Tao Cui
  2026-05-29  9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-29  9:07 UTC (permalink / raw)
  To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui

From: Tao Cui <cuitao@kylinos.cn>

Change struct rdmacg_resource fields (max, usage, peak) and all
charge/uncharge function signatures from int to s64 to prepare for
byte-sized resource tracking such as MR memory.

Replace match_int with a match_s64 helper that uses kstrtoll so the
user-space limit tokens accept 64-bit values.  All existing callers
pass amount=1 (count-based), so the change is transparent for
existing count-based resources.

The rpool->usage_sum counter continues to track the number of active
charge operations (not the sum of charged amounts); this is correct
because it governs rpool lifetime - a pool is releasable only when
all charges, regardless of amount, have been released.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 drivers/infiniband/core/cgroup.c     | 10 ++--
 drivers/infiniband/core/core_priv.h  | 12 ++--
 drivers/infiniband/core/rdma_core.c  |  8 +--
 drivers/infiniband/core/uverbs_cmd.c |  4 +-
 include/linux/cgroup_rdma.h          |  7 ++-
 kernel/cgroup/rdma.c                 | 87 ++++++++++++++++++----------
 6 files changed, 83 insertions(+), 45 deletions(-)

diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c
index 1f037fe01450..81e24de72392 100644
--- a/drivers/infiniband/core/cgroup.c
+++ b/drivers/infiniband/core/cgroup.c
@@ -36,18 +36,20 @@ void ib_device_unregister_rdmacg(struct ib_device *device)
 
 int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
 			 struct ib_device *device,
-			 enum rdmacg_resource_type resource_index)
+			 enum rdmacg_resource_type resource_index,
+			 s64 amount)
 {
 	return rdmacg_try_charge(&cg_obj->cg, &device->cg_device,
-				 resource_index);
+				 resource_index, amount);
 }
 EXPORT_SYMBOL(ib_rdmacg_try_charge);
 
 void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
 			struct ib_device *device,
-			enum rdmacg_resource_type resource_index)
+			enum rdmacg_resource_type resource_index,
+			s64 amount)
 {
 	rdmacg_uncharge(cg_obj->cg, &device->cg_device,
-			resource_index);
+			resource_index, amount);
 }
 EXPORT_SYMBOL(ib_rdmacg_uncharge);
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a2c36666e6fc..345356d1e504 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -159,11 +159,13 @@ void ib_device_unregister_rdmacg(struct ib_device *device);
 
 int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
 			 struct ib_device *device,
-			 enum rdmacg_resource_type resource_index);
+			 enum rdmacg_resource_type resource_index,
+			 s64 amount);
 
 void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
 			struct ib_device *device,
-			enum rdmacg_resource_type resource_index);
+			enum rdmacg_resource_type resource_index,
+			 s64 amount);
 #else
 static inline void ib_device_register_rdmacg(struct ib_device *device)
 {
@@ -175,14 +177,16 @@ static inline void ib_device_unregister_rdmacg(struct ib_device *device)
 
 static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
 				       struct ib_device *device,
-				       enum rdmacg_resource_type resource_index)
+				       enum rdmacg_resource_type resource_index,
+			       s64 amount)
 {
 	return 0;
 }
 
 static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
 				      struct ib_device *device,
-				      enum rdmacg_resource_type resource_index)
+				      enum rdmacg_resource_type resource_index,
+			      s64 amount)
 {
 }
 #endif
diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
index 5018ec837056..3268285b5478 100644
--- a/drivers/infiniband/core/rdma_core.c
+++ b/drivers/infiniband/core/rdma_core.c
@@ -437,7 +437,7 @@ alloc_begin_idr_uobject(const struct uverbs_api_object *obj,
 		goto uobj_put;
 
 	ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device,
-				   RDMACG_RESOURCE_HCA_OBJECT);
+				   RDMACG_RESOURCE_HCA_OBJECT, 1);
 	if (ret)
 		goto remove;
 
@@ -526,7 +526,7 @@ struct ib_uobject *rdma_alloc_begin_uobject(const struct uverbs_api_object *obj,
 static void alloc_abort_idr_uobject(struct ib_uobject *uobj)
 {
 	ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device,
-			   RDMACG_RESOURCE_HCA_OBJECT);
+			   RDMACG_RESOURCE_HCA_OBJECT, 1);
 
 	xa_erase(&uobj->ufile->idr, uobj->id);
 }
@@ -547,7 +547,7 @@ static int __must_check destroy_hw_idr_uobject(struct ib_uobject *uobj,
 		return 0;
 
 	ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device,
-			   RDMACG_RESOURCE_HCA_OBJECT);
+			   RDMACG_RESOURCE_HCA_OBJECT, 1);
 
 	return 0;
 }
@@ -878,7 +878,7 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
 	}
 
 	ib_rdmacg_uncharge(&ucontext->cg_obj, ib_dev,
-			   RDMACG_RESOURCE_HCA_HANDLE);
+			   RDMACG_RESOURCE_HCA_HANDLE, 1);
 
 	rdma_restrack_del(&ucontext->res);
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 91a62d2ade4d..9540ac180711 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -234,7 +234,7 @@ int ib_init_ucontext(struct uverbs_attr_bundle *attrs)
 	}
 
 	ret = ib_rdmacg_try_charge(&ucontext->cg_obj, ucontext->device,
-				   RDMACG_RESOURCE_HCA_HANDLE);
+				   RDMACG_RESOURCE_HCA_HANDLE, 1);
 	if (ret)
 		goto err;
 
@@ -273,7 +273,7 @@ int ib_init_ucontext(struct uverbs_attr_bundle *attrs)
 
 err_uncharge:
 	ib_rdmacg_uncharge(&ucontext->cg_obj, ucontext->device,
-			   RDMACG_RESOURCE_HCA_HANDLE);
+			   RDMACG_RESOURCE_HCA_HANDLE, 1);
 err:
 	mutex_unlock(&file->ucontext_lock);
 	up_read(&file->hw_destroy_rwsem);
diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
index 404e746552ca..7146cefa95a6 100644
--- a/include/linux/cgroup_rdma.h
+++ b/include/linux/cgroup_rdma.h
@@ -7,6 +7,7 @@
 #define _CGROUP_RDMA_H
 
 #include <linux/cgroup.h>
+#include <linux/types.h>
 
 enum rdmacg_resource_type {
 	RDMACG_RESOURCE_HCA_HANDLE,
@@ -46,9 +47,11 @@ void rdmacg_unregister_device(struct rdmacg_device *device);
 /* APIs for RDMA/IB stack to charge/uncharge pool specific resources */
 int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 		      struct rdmacg_device *device,
-		      enum rdmacg_resource_type index);
+		      enum rdmacg_resource_type index,
+		      s64 amount);
 void rdmacg_uncharge(struct rdma_cgroup *cg,
 		     struct rdmacg_device *device,
-		     enum rdmacg_resource_type index);
+		     enum rdmacg_resource_type index,
+		     s64 amount);
 #endif	/* CONFIG_CGROUP_RDMA */
 #endif	/* _CGROUP_RDMA_H */
diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c
index 5e82a03b3270..519f7f537223 100644
--- a/kernel/cgroup/rdma.c
+++ b/kernel/cgroup/rdma.c
@@ -59,9 +59,9 @@ static char const *rdmacg_resource_names[] = {
 
 /* resource tracker for each resource of rdma cgroup */
 struct rdmacg_resource {
-	int max;
-	int usage;
-	int peak;
+	s64 max;
+	s64 usage;
+	s64 peak;
 };
 
 /*
@@ -105,13 +105,13 @@ static inline struct rdma_cgroup *get_current_rdmacg(void)
 }
 
 static void set_resource_limit(struct rdmacg_resource_pool *rpool,
-			       int index, int new_max)
+			       int index, s64 new_max)
 {
-	if (new_max == S32_MAX) {
-		if (rpool->resources[index].max != S32_MAX)
+	if (new_max == S64_MAX) {
+		if (rpool->resources[index].max != S64_MAX)
 			rpool->num_max_cnt++;
 	} else {
-		if (rpool->resources[index].max == S32_MAX)
+		if (rpool->resources[index].max == S64_MAX)
 			rpool->num_max_cnt--;
 	}
 	rpool->resources[index].max = new_max;
@@ -122,7 +122,7 @@ static void set_all_resource_max_limit(struct rdmacg_resource_pool *rpool)
 	int i;
 
 	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
-		set_resource_limit(rpool, i, S32_MAX);
+		set_resource_limit(rpool, i, S64_MAX);
 }
 
 static void free_cg_rpool_locked(struct rdmacg_resource_pool *rpool)
@@ -206,7 +206,8 @@ get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
 static void
 uncharge_cg_locked(struct rdma_cgroup *cg,
 		   struct rdmacg_device *device,
-		   enum rdmacg_resource_type index)
+		   enum rdmacg_resource_type index,
+		   s64 amount)
 {
 	struct rdmacg_resource_pool *rpool;
 
@@ -222,7 +223,7 @@ uncharge_cg_locked(struct rdma_cgroup *cg,
 		return;
 	}
 
-	rpool->resources[index].usage--;
+	rpool->resources[index].usage -= amount;
 
 	/*
 	 * A negative count (or overflow) is invalid,
@@ -307,14 +308,15 @@ static void rdmacg_event_locked(struct rdma_cgroup *cg,
 static void rdmacg_uncharge_hierarchy(struct rdma_cgroup *cg,
 				     struct rdmacg_device *device,
 				     struct rdma_cgroup *stop_cg,
-				     enum rdmacg_resource_type index)
+				     enum rdmacg_resource_type index,
+				     s64 amount)
 {
 	struct rdma_cgroup *p;
 
 	mutex_lock(&rdmacg_mutex);
 
 	for (p = cg; p != stop_cg; p = parent_rdmacg(p))
-		uncharge_cg_locked(p, device, index);
+		uncharge_cg_locked(p, device, index, amount);
 
 	mutex_unlock(&rdmacg_mutex);
 
@@ -329,12 +331,13 @@ static void rdmacg_uncharge_hierarchy(struct rdma_cgroup *cg,
  */
 void rdmacg_uncharge(struct rdma_cgroup *cg,
 		     struct rdmacg_device *device,
-		     enum rdmacg_resource_type index)
+		     enum rdmacg_resource_type index,
+		     s64 amount)
 {
 	if (index >= RDMACG_RESOURCE_MAX)
 		return;
 
-	rdmacg_uncharge_hierarchy(cg, device, NULL, index);
+	rdmacg_uncharge_hierarchy(cg, device, NULL, index, amount);
 }
 EXPORT_SYMBOL(rdmacg_uncharge);
 
@@ -343,6 +346,7 @@ EXPORT_SYMBOL(rdmacg_uncharge);
  * @rdmacg: pointer to rdma cgroup which will own this resource
  * @device: pointer to rdmacg device
  * @index: index of the resource to charge in cgroup (resource pool)
+ * @amount: amount to charge
  *
  * This function follows charging resource in hierarchical way.
  * It will fail if the charge would cause the new value to exceed the
@@ -361,7 +365,8 @@ EXPORT_SYMBOL(rdmacg_uncharge);
  */
 int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 		      struct rdmacg_device *device,
-		      enum rdmacg_resource_type index)
+		      enum rdmacg_resource_type index,
+		      s64 amount)
 {
 	struct rdma_cgroup *cg, *p;
 	struct rdmacg_resource_pool *rpool;
@@ -371,6 +376,9 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 	if (index >= RDMACG_RESOURCE_MAX)
 		return -EINVAL;
 
+	if (amount <= 0)
+		return -EINVAL;
+
 	/*
 	 * hold on to css, as cgroup can be removed but resource
 	 * accounting happens on css.
@@ -384,8 +392,9 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 			ret = PTR_ERR(rpool);
 			goto err;
 		} else {
-			new = (s64)rpool->resources[index].usage + 1;
-			if (new > rpool->resources[index].max) {
+			new = rpool->resources[index].usage + amount;
+			if (new < rpool->resources[index].usage ||
+			    new > rpool->resources[index].max) {
 				ret = -EAGAIN;
 				goto err;
 			} else {
@@ -409,7 +418,7 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 	if (ret == -EAGAIN)
 		rdmacg_event_locked(cg, p, device, index);
 	mutex_unlock(&rdmacg_mutex);
-	rdmacg_uncharge_hierarchy(cg, device, p, index);
+	rdmacg_uncharge_hierarchy(cg, device, p, index, amount);
 	return ret;
 }
 EXPORT_SYMBOL(rdmacg_try_charge);
@@ -477,6 +486,25 @@ static struct rdmacg_device *rdmacg_get_device_locked(const char *name)
 	return NULL;
 }
 
+static int match_s64(substring_t *s, s64 *result)
+{
+	char *buf;
+	int ret;
+	s64 val;
+
+	buf = kmemdup_nul(s->from, s->to - s->from, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+	ret = kstrtoll(buf, 0, &val);
+	kfree(buf);
+	if (ret)
+		return ret;
+	if (val < 0)
+		return -EINVAL;
+	*result = val;
+	return 0;
+}
+
 static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 				       char *buf, size_t nbytes, loff_t off)
 {
@@ -486,7 +514,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 	struct rdmacg_device *device;
 	char *options = strstrip(buf);
 	char *p;
-	int *new_limits;
+	s64 *new_limits;
 	unsigned long enables = 0;
 	int i = 0, ret = 0;
 
@@ -497,7 +525,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 		goto err;
 	}
 
-	new_limits = kzalloc_objs(int, RDMACG_RESOURCE_MAX);
+	new_limits = kcalloc(RDMACG_RESOURCE_MAX, sizeof(s64), GFP_KERNEL);
 	if (!new_limits) {
 		ret = -ENOMEM;
 		goto err;
@@ -506,7 +534,8 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 	/* parse resource limit tokens */
 	while ((p = strsep(&options, " \t\n"))) {
 		substring_t args[MAX_OPT_ARGS];
-		int tok, intval;
+		int tok;
+		s64 intval;
 
 		if (!*p)
 			continue;
@@ -514,7 +543,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 		tok = match_token(p, rdmacg_limit_tokens, args);
 		switch (tok) {
 		case RDMACG_HCA_HANDLE_VAL:
-			if (match_int(&args[0], &intval) || intval < 0) {
+			if (match_s64(&args[0], &intval)) {
 				ret = -EINVAL;
 				goto parse_err;
 			}
@@ -522,11 +551,11 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 			enables |= BIT(RDMACG_RESOURCE_HCA_HANDLE);
 			break;
 		case RDMACG_HCA_HANDLE_MAX:
-			new_limits[RDMACG_RESOURCE_HCA_HANDLE] = S32_MAX;
+			new_limits[RDMACG_RESOURCE_HCA_HANDLE] = S64_MAX;
 			enables |= BIT(RDMACG_RESOURCE_HCA_HANDLE);
 			break;
 		case RDMACG_HCA_OBJECT_VAL:
-			if (match_int(&args[0], &intval) || intval < 0) {
+			if (match_s64(&args[0], &intval)) {
 				ret = -EINVAL;
 				goto parse_err;
 			}
@@ -534,7 +563,7 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 			enables |= BIT(RDMACG_RESOURCE_HCA_OBJECT);
 			break;
 		case RDMACG_HCA_OBJECT_MAX:
-			new_limits[RDMACG_RESOURCE_HCA_OBJECT] = S32_MAX;
+			new_limits[RDMACG_RESOURCE_HCA_OBJECT] = S64_MAX;
 			enables |= BIT(RDMACG_RESOURCE_HCA_OBJECT);
 			break;
 		default:
@@ -588,7 +617,7 @@ static void print_rpool_values(struct seq_file *sf,
 {
 	enum rdmacg_file_type sf_type;
 	int i;
-	u32 value;
+	s64 value;
 
 	sf_type = seq_cft(sf)->private;
 
@@ -599,7 +628,7 @@ static void print_rpool_values(struct seq_file *sf,
 			if (rpool)
 				value = rpool->resources[i].max;
 			else
-				value = S32_MAX;
+				value = S64_MAX;
 		} else if (sf_type == RDMACG_RESOURCE_TYPE_PEAK) {
 			value = rpool ? rpool->resources[i].peak : 0;
 		} else {
@@ -609,10 +638,10 @@ static void print_rpool_values(struct seq_file *sf,
 				value = 0;
 		}
 
-		if (value == S32_MAX)
+		if (value == S64_MAX)
 			seq_puts(sf, RDMACG_MAX_STR);
 		else
-			seq_printf(sf, "%d", value);
+			seq_printf(sf, "%lld", value);
 		seq_putc(sf, ' ');
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking
  2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
  2026-05-29  9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
@ 2026-05-29  9:07 ` Tao Cui
  2026-05-29  9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-29  9:07 UTC (permalink / raw)
  To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui

From: Tao Cui <cuitao@kylinos.cn>

Add RDMACG_RESOURCE_MR_MEM so that the cumulative memory size of
registered Memory Regions can be tracked and limited independently
from the aggregate hca_object counter.

Unlike count-based resources (hca_handle, hca_object) which are
charged in the generic IDR allocation path, MR_MEM is byte-based
and must be charged after the MR length is known.  Charge in the
uverbs MR registration handlers (ioctl and legacy), and uncharge
in the generic destroy paths (alloc_abort_idr_uobject,
destroy_hw_idr_uobject).

Store the charged byte count in uobj->rdmacg_mr_mem_bytes so that
the destroy path knows how much to uncharge.

Semantic notes
~~~~~~~~~~~~~~

mr_mem is not page-level ownership tracking - it is object-based
accounting tied to the MR lifetime:

  - charged at MR registration time
  - uncharged at MR destruction time
  - the charge lives with the MR's creating cgroup for the entire
    lifetime of the MR object

This model intentionally defines accounting semantics around MR
object lifetime rather than page ownership:

1. fork(): fork() does not duplicate MR objects.  Even though the
   child inherits the uverbs fd and can access the parent's ucontext,
   the MR remains a single kernel object.  The charge is tied to the
   MR object, not to the number of processes that can reach it, so
   no splitting or re-accounting is needed.

2. Cgroup migration: mr_mem follows the same semantics as the existing
   hca_object - charge at creation time against the invoking task's
   cgroup, uncharge at destruction time.  The RDMA cgroup does not
   implement can_attach/attach callbacks today, so charges do not
   migrate with the task.  This is a known limitation that applies
   equally to hca_handle and hca_object.  mr_mem does not introduce
   any new complication here.

3. Overlap with memory cgroup: mr_mem does not count process memory
   usage - it represents a per-device DMA registration budget: how
   much memory can this cgroup register through a given HCA.  This is
   a different dimension from what memory cgroup tracks.  An
   administrator might set mr_mem limits differently per device, which
   memory cgroup cannot express.

   In particular, mr_mem tracks the registered memory range associated
   with the MR rather than exact dynamically pinned pages (e.g. for
   ODP MRs).  This is a stable, policy-oriented approximation of
   registration footprint - not an attempt at precise physical page
   accounting.

Guard against u64-to-s64 overflow by rejecting MR lengths that
exceed S64_MAX at each registration site.

Handle MR reregistration (IB_USER_VERBS_CMD_REREG_MR with
IB_MR_REREG_TRANS) by computing the delta between old and new
lengths and charging or uncharging the difference.  When the driver
creates a new HW object (new_mr != NULL), the full new length is
charged to the new uobj and the old uobj's mr_mem is released
through the existing rdma_assign_uobject -> destroy_hw_idr_uobject
-> rdmacg_uncharge_uobj path.

Enable MR memory limits:

  echo "mlx5_0 mr_mem=1073741824" > rdma.max

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 drivers/infiniband/core/rdma_core.c           | 14 ++++-
 drivers/infiniband/core/uverbs_cmd.c          | 57 +++++++++++++++++++
 drivers/infiniband/core/uverbs_std_types_mr.c | 37 ++++++++++++
 include/linux/cgroup_rdma.h                   |  1 +
 include/rdma/ib_verbs.h                       |  1 +
 kernel/cgroup/rdma.c                          | 21 ++++++-
 6 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
index 3268285b5478..a540cef6bb67 100644
--- a/drivers/infiniband/core/rdma_core.c
+++ b/drivers/infiniband/core/rdma_core.c
@@ -523,10 +523,19 @@ struct ib_uobject *rdma_alloc_begin_uobject(const struct uverbs_api_object *obj,
 	return ret;
 }
 
-static void alloc_abort_idr_uobject(struct ib_uobject *uobj)
+static void rdmacg_uncharge_uobj(struct ib_uobject *uobj)
 {
 	ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device,
 			   RDMACG_RESOURCE_HCA_OBJECT, 1);
+	if (uobj->rdmacg_mr_mem_bytes)
+		ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device,
+				   RDMACG_RESOURCE_MR_MEM,
+				   uobj->rdmacg_mr_mem_bytes);
+}
+
+static void alloc_abort_idr_uobject(struct ib_uobject *uobj)
+{
+	rdmacg_uncharge_uobj(uobj);
 
 	xa_erase(&uobj->ufile->idr, uobj->id);
 }
@@ -546,8 +555,7 @@ static int __must_check destroy_hw_idr_uobject(struct ib_uobject *uobj,
 	if (why == RDMA_REMOVE_ABORT)
 		return 0;
 
-	ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device,
-			   RDMACG_RESOURCE_HCA_OBJECT, 1);
+	rdmacg_uncharge_uobj(uobj);
 
 	return 0;
 }
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 9540ac180711..901de117c808 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -752,6 +752,17 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs)
 
 	uobj->object = mr;
 	uobj_put_obj_read(pd);
+
+	if (cmd.length > S64_MAX)
+		goto err_free;
+	if (cmd.length) {
+		ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device,
+					   RDMACG_RESOURCE_MR_MEM, cmd.length);
+		if (ret)
+			goto err_dereg;
+		uobj->rdmacg_mr_mem_bytes = cmd.length;
+	}
+
 	uobj_finalize_uobj_create(uobj, attrs);
 
 	resp.lkey = mr->lkey;
@@ -759,6 +770,8 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs)
 	resp.mr_handle = uobj->id;
 	return uverbs_response(attrs, &resp, sizeof(resp));
 
+err_dereg:
+	ib_dereg_mr_user(mr, &attrs->driver_udata);
 err_put:
 	uobj_put_obj_read(pd);
 err_free:
@@ -854,6 +867,20 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs)
 		rdma_restrack_set_name(&new_mr->res, NULL);
 		rdma_restrack_add(&new_mr->res);
 
+		if ((cmd.flags & IB_MR_REREG_TRANS) && cmd.length) {
+			if (cmd.length > S64_MAX) {
+				ret = -EINVAL;
+				goto err_rereg_new_mr;
+			}
+			ret = ib_rdmacg_try_charge(&new_uobj->cg_obj,
+						   new_uobj->context->device,
+						   RDMACG_RESOURCE_MR_MEM,
+						   cmd.length);
+			if (ret)
+				goto err_rereg_new_mr;
+			new_uobj->rdmacg_mr_mem_bytes = cmd.length;
+		}
+
 		/*
 		 * The new uobj for the new HW object is put into the same spot
 		 * in the IDR and the old uobj & HW object is deleted.
@@ -871,6 +898,31 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs)
 			atomic_inc(&new_pd->usecnt);
 		}
 		if (cmd.flags & IB_MR_REREG_TRANS) {
+			s64 delta;
+
+			if (cmd.length > S64_MAX) {
+				ret = -EINVAL;
+				goto put_new_uobj;
+			}
+			delta = (s64)cmd.length -
+				(s64)uobj->rdmacg_mr_mem_bytes;
+
+			if (delta > 0) {
+				ret = ib_rdmacg_try_charge(
+					&uobj->cg_obj,
+					uobj->context->device,
+					RDMACG_RESOURCE_MR_MEM,
+					delta);
+				if (ret)
+					goto put_new_uobj;
+			} else if (delta < 0) {
+				ib_rdmacg_uncharge(
+					&uobj->cg_obj,
+					uobj->context->device,
+					RDMACG_RESOURCE_MR_MEM,
+					-delta);
+			}
+			uobj->rdmacg_mr_mem_bytes = cmd.length;
 			mr->iova = cmd.hca_va;
 			mr->length = cmd.length;
 		}
@@ -887,6 +939,11 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs)
 put_new_uobj:
 	if (new_uobj)
 		uobj_alloc_abort(new_uobj, attrs);
+err_rereg_new_mr:
+	if (new_uobj) {
+		rdma_alloc_abort_uobject(new_uobj, attrs, true);
+		new_uobj = NULL;
+	}
 put_uobj_pd:
 	if (cmd.flags & IB_MR_REREG_PD)
 		uobj_put_obj_read(new_pd);
diff --git a/drivers/infiniband/core/uverbs_std_types_mr.c b/drivers/infiniband/core/uverbs_std_types_mr.c
index 570b9656801d..3989ff2d282b 100644
--- a/drivers/infiniband/core/uverbs_std_types_mr.c
+++ b/drivers/infiniband/core/uverbs_std_types_mr.c
@@ -32,6 +32,7 @@
  */
 
 #include "rdma_core.h"
+#include "core_priv.h"
 #include "uverbs.h"
 #include <rdma/uverbs_std_types.h>
 #include "restrack.h"
@@ -140,6 +141,18 @@ static int UVERBS_HANDLER(UVERBS_METHOD_DM_MR_REG)(
 	rdma_restrack_set_name(&mr->res, NULL);
 	rdma_restrack_add(&mr->res);
 	uobj->object = mr;
+	if (attr.length > S64_MAX)
+		return -EINVAL;
+
+	if (attr.length) {
+		ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device,
+					   RDMACG_RESOURCE_MR_MEM, attr.length);
+		if (ret) {
+			ib_dereg_mr_user(mr, &attrs->driver_udata);
+			return ret;
+		}
+		uobj->rdmacg_mr_mem_bytes = attr.length;
+	}
 
 	uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_REG_DM_MR_HANDLE);
 
@@ -254,6 +267,18 @@ static int UVERBS_HANDLER(UVERBS_METHOD_REG_DMABUF_MR)(
 	rdma_restrack_add(&mr->res);
 	uobj->object = mr;
 
+	if (length > S64_MAX)
+		return -EINVAL;
+	if (length) {
+		ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device,
+					   RDMACG_RESOURCE_MR_MEM, length);
+		if (ret) {
+			ib_dereg_mr_user(mr, &attrs->driver_udata);
+			return ret;
+		}
+		uobj->rdmacg_mr_mem_bytes = length;
+	}
+
 	uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_REG_DMABUF_MR_HANDLE);
 
 	ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_DMABUF_MR_RESP_LKEY,
@@ -383,6 +408,18 @@ static int UVERBS_HANDLER(UVERBS_METHOD_REG_MR)(
 	rdma_restrack_add(&mr->res);
 	uobj->object = mr;
 
+	if (length > S64_MAX)
+		return -EINVAL;
+	if (length) {
+		ret = ib_rdmacg_try_charge(&uobj->cg_obj, uobj->context->device,
+					   RDMACG_RESOURCE_MR_MEM, length);
+		if (ret) {
+			ib_dereg_mr_user(mr, &attrs->driver_udata);
+			return ret;
+		}
+		uobj->rdmacg_mr_mem_bytes = length;
+	}
+
 	uverbs_finalize_uobj_create(attrs, UVERBS_ATTR_REG_MR_HANDLE);
 
 	ret = uverbs_copy_to(attrs, UVERBS_ATTR_REG_MR_RESP_LKEY,
diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
index 7146cefa95a6..2c8fb1ebb1a9 100644
--- a/include/linux/cgroup_rdma.h
+++ b/include/linux/cgroup_rdma.h
@@ -12,6 +12,7 @@
 enum rdmacg_resource_type {
 	RDMACG_RESOURCE_HCA_HANDLE,
 	RDMACG_RESOURCE_HCA_OBJECT,
+	RDMACG_RESOURCE_MR_MEM,
 	RDMACG_RESOURCE_MAX,
 };
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9dd76f489a0b..c7dcd5d085fb 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1569,6 +1569,7 @@ struct ib_uobject {
 	void		       *object;		/* containing object */
 	struct list_head	list;		/* link to context's list */
 	struct ib_rdmacg_object	cg_obj;		/* rdmacg object */
+	s64			rdmacg_mr_mem_bytes; /* charged MR memory size */
 	int			id;		/* index into kernel idr */
 	struct kref		ref;
 	atomic_t		usecnt;		/* protects exclusive access */
diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c
index 519f7f537223..ebfc5721c098 100644
--- a/kernel/cgroup/rdma.c
+++ b/kernel/cgroup/rdma.c
@@ -23,14 +23,18 @@ enum rdmacg_limit_tokens {
 	RDMACG_HCA_HANDLE_MAX,
 	RDMACG_HCA_OBJECT_VAL,
 	RDMACG_HCA_OBJECT_MAX,
+	RDMACG_MR_MEM_VAL,
+	RDMACG_MR_MEM_MAX,
 	NR_RDMACG_LIMIT_TOKENS,
 };
 
 static const match_table_t rdmacg_limit_tokens = {
-	{ RDMACG_HCA_HANDLE_VAL,	"hca_handle=%d"	},
+	{ RDMACG_HCA_HANDLE_VAL,	"hca_handle=%d"		},
 	{ RDMACG_HCA_HANDLE_MAX,	"hca_handle=max"	},
-	{ RDMACG_HCA_OBJECT_VAL,	"hca_object=%d"	},
+	{ RDMACG_HCA_OBJECT_VAL,	"hca_object=%d"		},
 	{ RDMACG_HCA_OBJECT_MAX,	"hca_object=max"	},
+	{ RDMACG_MR_MEM_VAL,		"mr_mem=%d"		},
+	{ RDMACG_MR_MEM_MAX,		"mr_mem=max"		},
 	{ NR_RDMACG_LIMIT_TOKENS,	NULL			},
 };
 
@@ -55,6 +59,7 @@ enum rdmacg_file_type {
 static char const *rdmacg_resource_names[] = {
 	[RDMACG_RESOURCE_HCA_HANDLE]	= "hca_handle",
 	[RDMACG_RESOURCE_HCA_OBJECT]	= "hca_object",
+	[RDMACG_RESOURCE_MR_MEM]	= "mr_mem",
 };
 
 /* resource tracker for each resource of rdma cgroup */
@@ -566,6 +571,18 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 			new_limits[RDMACG_RESOURCE_HCA_OBJECT] = S64_MAX;
 			enables |= BIT(RDMACG_RESOURCE_HCA_OBJECT);
 			break;
+		case RDMACG_MR_MEM_VAL:
+			if (match_s64(&args[0], &intval)) {
+				ret = -EINVAL;
+				goto parse_err;
+			}
+			new_limits[RDMACG_RESOURCE_MR_MEM] = intval;
+			enables |= BIT(RDMACG_RESOURCE_MR_MEM);
+			break;
+		case RDMACG_MR_MEM_MAX:
+			new_limits[RDMACG_RESOURCE_MR_MEM] = S64_MAX;
+			enables |= BIT(RDMACG_RESOURCE_MR_MEM);
+			break;
 		default:
 			ret = -EINVAL;
 			goto parse_err;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM
  2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
  2026-05-29  9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
  2026-05-29  9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
@ 2026-05-29  9:07 ` Tao Cui
  2026-05-29 16:18   ` kernel test robot
  2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
  2026-05-29 21:14 ` yanjun.zhu
  4 siblings, 1 reply; 9+ messages in thread
From: Tao Cui @ 2026-05-29  9:07 UTC (permalink / raw)
  To: tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui

From: Tao Cui <cuitao@kylinos.cn>

The RDMA cgroup now supports MR memory size tracking via the new
mr_mem resource.  Update the cgroup-v2 documentation to describe
the new resource and revise the usage examples accordingly.

The mr_mem resource tracks the cumulative size of memory registered
through Memory Regions per device per cgroup, providing a DMA
registration budget that is orthogonal to the existing hca_object
counter.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 993446ab66d0..08d80e6f79ec 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2766,15 +2766,16 @@ RDMA Interface Files
 
 	The following nested keys are defined.
 
-	  ==========	=============================
+	  ==========	================================================
 	  hca_handle	Maximum number of HCA Handles
 	  hca_object 	Maximum number of HCA Objects
-	  ==========	=============================
+	  mr_mem	Maximum cumulative MR memory size in bytes
+	  ==========	================================================
 
 	An example for mlx4 and ocrdma device follows::
 
-	  mlx4_0 hca_handle=2 hca_object=2000
-	  ocrdma1 hca_handle=3 hca_object=max
+	  mlx4_0 hca_handle=2 hca_object=2000 mr_mem=1073741824
+	  ocrdma1 hca_handle=3 hca_object=max mr_mem=max
 
   rdma.current
 	A read-only file that describes current resource usage.
@@ -2782,8 +2783,8 @@ RDMA Interface Files
 
 	An example for mlx4 and ocrdma device follows::
 
-	  mlx4_0 hca_handle=1 hca_object=20
-	  ocrdma1 hca_handle=1 hca_object=23
+	  mlx4_0 hca_handle=1 hca_object=20 mr_mem=1048576
+	  ocrdma1 hca_handle=1 hca_object=23 mr_mem=0
 
   rdma.peak
 	A read-only nested-keyed file that exists for all the cgroups
@@ -2792,8 +2793,8 @@ RDMA Interface Files
 
 	An example for mlx4 and ocrdma device follows::
 
-	  mlx4_0 hca_handle=1 hca_object=20
-	  ocrdma1 hca_handle=0 hca_object=23
+	  mlx4_0 hca_handle=1 hca_object=20 mr_mem=1048576
+	  ocrdma1 hca_handle=0 hca_object=23 mr_mem=0
 
   rdma.events
 	A read-only nested-keyed file which exists on non-root
@@ -2815,7 +2816,7 @@ RDMA Interface Files
 
 	An example for mlx4 device follows::
 
-	  mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=3 hca_object.max=0 hca_object.alloc_fail=0
+	  mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=3 hca_object.max=0 hca_object.alloc_fail=0 mr_mem.max=0 mr_mem.alloc_fail=0
 
   rdma.events.local
 	Similar to rdma.events but the fields in the file are local
@@ -2836,7 +2837,7 @@ RDMA Interface Files
 
 	An example for mlx4 device follows::
 
-	  mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=0 hca_object.max=0 hca_object.alloc_fail=0
+	  mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=0 hca_object.max=0 hca_object.alloc_fail=0 mr_mem.max=0 mr_mem.alloc_fail=0
 
 DMEM
 ----
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
  2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
                   ` (2 preceding siblings ...)
  2026-05-29  9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
@ 2026-05-29 12:46 ` Michal Koutný
  2026-06-01  5:37   ` Tao Cui
  2026-05-29 21:14 ` yanjun.zhu
  4 siblings, 1 reply; 9+ messages in thread
From: Michal Koutný @ 2026-05-29 12:46 UTC (permalink / raw)
  To: Tao Cui; +Cc: tj, hannes, leon, jgg, linux-rdma, cgroups, Tao Cui

[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]

Hi.

On Fri, May 29, 2026 at 05:07:30PM +0800, Tao Cui <cui.tao@linux.dev> wrote:
> The real scarce resource in multi-tenant
> deployments is pinned memory: how much physical memory gets registered
> through MRs.
> ...
> 3. Overlap with memory cgroup: mr_mem does not count process memory
>    usage; it represents a per-device DMA registration budget: the
>    amount of memory this cgroup may register through a given HCA.
>    This is a different dimension from what memory cgroup tracks.  An
>    administrator might set mr_mem limits differently per device, which
>    memory cgroup cannot express.
> 
>    In particular, mr_mem tracks the registered memory range associated
>    with the MR rather than exact dynamically pinned pages (e.g. for
>    ODP MRs).  This is a stable, policy-oriented approximation of
>    registration footprint, not an attempt at precise physical page
>    accounting.

IIUC the pinned memory is regular RAM, i.e. it could be controlled with
memcg as needed. Or is there "physical" limit of what can be assigned to
a single device?

BTW, have a look at [1], it'd be good to converge to similar approach
(the current proposal allows distinguishing whether charging should
include or exempt memcg counting). Also it seems, that the dmem
controller could be a one-stop solution for all DMA charges. Please tell
me if there are any distinguishing factors between RDMA devices' memory
and these dmem memory regions.

Thanks,
Michal


[1] https://lore.kernel.org/r/20260519-cgroup-dmem-memcg-double-charge-v2-0-db4d1407062b@redhat.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM
  2026-05-29  9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
@ 2026-05-29 16:18   ` kernel test robot
  0 siblings, 0 replies; 9+ messages in thread
From: kernel test robot @ 2026-05-29 16:18 UTC (permalink / raw)
  To: Tao Cui, tj, hannes, mkoutny, leon, jgg
  Cc: oe-kbuild-all, linux-rdma, cgroups, Tao Cui

Hi Tao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on next-20260528]
[cannot apply to linus/master v7.1-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Tao-Cui/cgroup-rdma-extend-charge-uncharge-API-with-s64-amount-parameter/20260529-171623
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20260529090733.2242822-4-cui.tao%40linux.dev
patch subject: [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM
config: i386-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260529/202605291816.15AyhoZE-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260529/202605291816.15AyhoZE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605291816.15AyhoZE-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: kernel/cgroup/rdma.c:210 function parameter 'amount' not described in 'uncharge_cg_locked'
>> Warning: kernel/cgroup/rdma.c:312 function parameter 'amount' not described in 'rdmacg_uncharge_hierarchy'
>> Warning: kernel/cgroup/rdma.c:335 function parameter 'amount' not described in 'rdmacg_uncharge'
>> Warning: kernel/cgroup/rdma.c:210 function parameter 'amount' not described in 'uncharge_cg_locked'
>> Warning: kernel/cgroup/rdma.c:312 function parameter 'amount' not described in 'rdmacg_uncharge_hierarchy'
>> Warning: kernel/cgroup/rdma.c:335 function parameter 'amount' not described in 'rdmacg_uncharge'

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
  2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
                   ` (3 preceding siblings ...)
  2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
@ 2026-05-29 21:14 ` yanjun.zhu
  2026-06-01  6:08   ` Tao Cui
  4 siblings, 1 reply; 9+ messages in thread
From: yanjun.zhu @ 2026-05-29 21:14 UTC (permalink / raw)
  To: Tao Cui, tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui

On 5/29/26 2:07 AM, Tao Cui wrote:
> From: Tao Cui <cuitao@kylinos.cn>
> 
> Currently the RDMA cgroup only tracks two aggregate counters:
> hca_handle and hca_object.  The real scarce resource in multi-tenant
> deployments is pinned memory: how much physical memory gets registered
> through MRs.  The existing hca_object counter is too coarse to capture
> this.
> 
> This series adds a single new resource type:
> 
>    - mr_mem  - Cumulative MR memory size in bytes
> 
> The per-object-type counters (qp, mr) from RFC v1 have been removed
> per review feedback [1]: modern NICs pool objects from the same memory
> pool so the distinction between QP count and MR count is not
> meaningful for resource limiting.  hca_object remains sufficient for
> coarse object accounting.
> 
> After this series, an administrator can set limits like:
> 
>      echo "mlx5_0 mr_mem=1073741824" > rdma.max
> 

Hi,

Thanks for the patchset! Introducing `mr_mem` to track and limit pinned
memory size is a very practical enhancement for multi-tenant deployments.

I have a question regarding how this new resource type interacts with
Fast Registration (FRWR / FRMR), which is widely used in production
environments (e.g., NVMe-oF, iSER) to achieve high performance.

As we know, FRWR decouples the MR object allocation (`ib_alloc_mr`) from
the actual memory page mapping (`ib_map_mr_sg`). The creation of FRWR
Memory Regions is often managed via a pre-allocated page pool.

Could you clarify how `mr_mem` accounts for FRWR in the following scenarios?

1. Accounting Granularity: Does `mr_mem` charge the maximum capacity of
    the FRWR object at its allocation time (`ib_alloc_mr`), or does it
    dynamically track the actual mapped bytes during the fast-reg data 
path? If it's the former, it represents a "static maximum budget" per 
pool, which seems more practical for performance.

2. Kernel-space vs Userspace: FRWR pools are frequently allocated by
    kernel-space drivers (like NVMe-oF target/host). If these kernel
    threads are not bound to a specific user cgroup, will their FRWR
    allocations end up in the root cgroup, potentially bypassing the
    per-tenant limits?

Don't you think it would be beneficial to explicitly document or 
consider the FRWR pattern in the design section, given its prevalence in
real-world storage and networking workloads?

Thanks,
Zhu Yanjun

> Design
> ~~~~~~
> 
> mr_mem is not page-level ownership tracking; it is object-based
> accounting tied to the MR lifetime:
> 
>    - charged at MR registration time
>    - uncharged at MR destruction time
>    - the charge is pinned to the cgroup that created the MR for the
>      entire lifetime of the MR object
> 
> This model intentionally defines accounting semantics around MR
> object lifetime rather than page ownership:
> 
> 1. fork(): fork() does not duplicate MR objects.  Even though the
>     child inherits the uverbs fd and can access the parent's ucontext,
>     the MR remains a single kernel object.  The charge is tied to the
>     MR object, not to the number of processes that can reach it, so
>     no splitting or re-accounting is needed.
> 
> 2. Cgroup migration: mr_mem follows the same semantics as the existing
>     hca_object; charge at creation time against the invoking task's
>     cgroup, uncharge at destruction time.  The RDMA cgroup does not
>     implement can_attach/attach callbacks today, so charges do not
>     migrate with the task.  This is a known limitation that applies
>     equally to hca_handle and hca_object.  mr_mem does not introduce
>     any new complication here.
> 
> 3. Overlap with memory cgroup: mr_mem does not count process memory
>     usage; it represents a per-device DMA registration budget: the
>     amount of memory this cgroup may register through a given HCA.
>     This is a different dimension from what memory cgroup tracks.  An
>     administrator might set mr_mem limits differently per device, which
>     memory cgroup cannot express.
> 
>     In particular, mr_mem tracks the registered memory range associated
>     with the MR rather than exact dynamically pinned pages (e.g. for
>     ODP MRs).  This is a stable, policy-oriented approximation of
>     registration footprint, not an attempt at precise physical page
>     accounting.
> 
> Tao Cui (3):
>    cgroup/rdma: extend charge/uncharge API with s64 amount parameter
>    cgroup/rdma: add MR memory size resource tracking
>    cgroup/rdma: update cgroup resource list for MR_MEM
> 
>   Documentation/admin-guide/cgroup-v2.rst       |  21 ++--
>   drivers/infiniband/core/cgroup.c              |  10 +-
>   drivers/infiniband/core/core_priv.h           |  12 +-
>   drivers/infiniband/core/rdma_core.c           |  20 +++-
>   drivers/infiniband/core/uverbs_cmd.c          |  61 +++++++++-
>   drivers/infiniband/core/uverbs_std_types_mr.c |  37 ++++++
>   include/linux/cgroup_rdma.h                   |   8 +-
>   include/rdma/ib_verbs.h                       |   1 +
>   kernel/cgroup/rdma.c                          | 108 +++++++++++++-----
>   9 files changed, 219 insertions(+), 59 deletions(-)
> 
> ---
> Changes from RFC v1:
> 
>    - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
>      counters following review feedback from Jason Gunthorpe [1].
>    - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
>    - Added detailed semantic notes to the commit messages addressing
>      fork(), cgroup migration, and overlap with memory cgroup [2].
>    - Renamed patches to reflect the narrower scope.
> 
> [1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
> [2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
  2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
@ 2026-06-01  5:37   ` Tao Cui
  0 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-06-01  5:37 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cui.tao, tj, hannes, leon, jgg, linux-rdma, cgroups, Tao Cui

Hi Michal,

Thanks for the review and for the reference.

> IIUC the pinned memory is regular RAM, i.e. it could be controlled
> with memcg as needed. Or is there "physical" limit of what can be
> assigned to a single device?

You are right that the pages associated with an MR are regular system
RAM. However, MR registration does not allocate new pages; it registers
existing pages that are already charged to the allocating process's
memcg.

For that reason, mr_mem is intended to represent a different resource
dimension: not "how much memory does this cgroup own", but "how much
memory may this cgroup register through a given HCA". In other words:

  * memcg limits memory ownership/consumption
  * mr_mem limits RDMA registration footprint

An administrator may reasonably wish to set different registration
budgets per device (for example, 1G through mlx5_0 and 4G through
mlx5_1) for the same cgroup. memcg has no notion of device-scoped
limits; it only tracks aggregate memory consumption.

This distinction is important because memory ownership and DMA
registration are not necessarily constrained by the same policy. A
tenant may remain within its memcg limit while still consuming a large
portion of a particular HCA's registration capacity. The existing RDMA
controller already provides a per-device resource control framework,
and mr_mem extends that model to cover memory registration footprint.

> Or is there "physical" limit of what can be assigned to a single device?

Yes. Real HCAs have finite resources associated with memory
registration, such as MTT/MPT capacity and related DMA translation
resources. In practice, administrators often need to prevent one tenant
from consuming a disproportionate share of a particular HCA's
registration capacity, even when sufficient system memory remains
available.

It is also worth noting that mr_mem is intentionally not an attempt to
account exact pinned pages. The accounting model is tied to MR object
lifetime and tracks registration footprint rather than dynamic physical
page state. For example, ODP MRs may have only a subset of their pages
pinned at any given time, yet still consume registration resources on
the HCA. This is why the proposal focuses on a stable,
policy-oriented registration budget rather than precise memory
ownership accounting.

> BTW, have a look at [1], it'd be good to converge to similar approach
> (the current proposal allows distinguishing whether charging should
> include or exempt memcg counting).

I've read the related dma-buf accounting work.

My understanding is that those proposals focus on allocations that
create new memory on behalf of a device, which is naturally accounted
through memcg.

RDMA MR registration is different because no new memory is allocated.
The MR object is an in-kernel registration of existing memory that has
already been accounted elsewhere. The resource being limited is
therefore the registration itself rather than the underlying memory
pages.

> Also it seems, that the dmem controller could be a one-stop solution
> for all DMA charges. Please tell me if there are any distinguishing
> factors between RDMA devices' memory and these dmem memory regions.

One distinction is that the current dmem work appears to focus on
memory resources allocated on behalf of a device, whereas mr_mem is
intended to limit host memory registered for DMA through RDMA MRs.
RDMA NICs typically do not have large device-local memory pools;
instead they provide DMA access to host RAM through memory
registration. As a result, the resource being controlled here is not
device memory consumption itself, but the registration footprint
associated with a particular HCA.

Another difference is the accounting model itself. The proposed mr_mem
accounting is tied to MR object lifetime and tracks registration
footprint rather than precise physical page usage.

My understanding is that dmem is currently integrated with the DRM/TTM
subsystem for device-local memory accounting, and there is no existing
RDMA integration today. I have not investigated what would be required
to extend that model to RDMA registration accounting.

That said, I agree that convergence would be desirable if a generic
framework can naturally express per-device DMA registration budgets.
My goal here is not necessarily to require RDMA-specific accounting,
but to address a practical resource-control problem within the existing
RDMA cgroup framework.

Thanks,
Tao

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
  2026-05-29 21:14 ` yanjun.zhu
@ 2026-06-01  6:08   ` Tao Cui
  0 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-06-01  6:08 UTC (permalink / raw)
  To: yanjun.zhu, tj, hannes, mkoutny, leon, jgg
  Cc: cui.tao, linux-rdma, cgroups, Tao Cui

Hi Yanjun,

Thanks for the thoughtful questions.  FRWR is indeed a widely used
pattern, and the interaction with mr_mem deserves clarification.

> 1. Accounting Granularity: Does mr_mem charge the maximum capacity of
>    the FRWR object at its allocation time (ib_alloc_mr), or does it
>    dynamically track the actual mapped bytes during the fast-reg data
>    path?

In the current proposal, mr_mem is only charged for userspace MR
registrations that go through the uverbs layer (REG_MR, DM_MR,
DMABUF_MR, and the legacy ioctl path).  These are the paths where a
concrete byte length is known at registration time.

FRWR MRs allocated via ib_alloc_mr() are not charged for mr_mem.  The
actual registration footprint associated with an FRWR MR is not known
at allocation time: ib_alloc_mr() only specifies the maximum
scatter-gather capacity of the MR, while the mapped byte range may
change dynamically across successive ib_map_mr_sg() operations.

Supporting FRWR accounting would therefore require a separate
accounting model, since the registration footprint is established
dynamically rather than by a fixed length parameter supplied at MR
creation.  This is outside the scope of the current proposal.

> 2. Kernel-space vs Userspace: FRWR pools are frequently allocated by
>    kernel-space drivers (like NVMe-oF target/host). If these kernel
>    threads are not bound to a specific user cgroup, will their FRWR
>    allocations end up in the root cgroup, potentially bypassing the
>    per-tenant limits?

The RDMA cgroup's resource control is primarily designed for userspace
consumers.  Kernel-space consumers (NVMe-oF target, SRP initiator,
rtrs, iSER, etc.) allocate resources through kernel APIs
(ib_alloc_mr, ib_create_qp, etc.).  These resources do not currently
participate in RDMA cgroup accounting and therefore are not subject to
per-cgroup limits.

Kernel-space FRWR pools are typically managed by the administrator
rather than subject to per-tenant limits.

This behavior is consistent with the current RDMA cgroup model, which
tracks resources associated with userspace RDMA objects.  If accounting
were extended to kernel-allocated FRWR MRs, ownership semantics would
become an open question: simply charging against the current task or
the root cgroup may not accurately represent the tenant that ultimately
benefits from the resource.

> Don't you think it would be beneficial to explicitly document or
> consider the FRWR pattern in the design section, given its prevalence
> in real-world storage and networking workloads?

Agreed.  I will add a note to the cover letter and commit messages
clarifying that mr_mem currently covers only userspace MR registrations
with a known length, and that kernel-space FRWR pools are out of scope
for this initial proposal.  The semantic distinction between
userspace registration-length accounting and kernel-space FRWR
resource management is worth documenting explicitly.

Thanks,
Tao

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-01  6:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29  9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
2026-05-29  9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29  9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
2026-05-29 16:18   ` kernel test robot
2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
2026-06-01  5:37   ` Tao Cui
2026-05-29 21:14 ` yanjun.zhu
2026-06-01  6:08   ` Tao Cui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.