[PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local]

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local]
@ 2026-05-14  6:50 Tao Cui
  2026-05-14  6:50 ` [PATCH v3 1/4] cgroup/rdma: add rdma.peak for per-device peak usage tracking Tao Cui
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-14  6:50 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups; +Cc: Tao Cui

Hi,

This is v3 of the RDMA cgroup observability series.  Thanks to the
reviewers for the detailed feedback on v1 and v2.

This series adds new cgroup interface files to the RDMA controller
to improve observability of resource usage and limit enforcement:

  - rdma.peak:        per-device high watermark of resource usage
  - rdma.events:      hierarchical max and alloc_fail event counters
  - rdma.events.local: per-cgroup local max and alloc_fail counters

rdma.peak tracks the historical high watermark so administrators can
determine a sensible rdma.max based on actual peak demand rather than
guesswork.  This is directly analogous to memory.peak.

rdma.events and rdma.events.local provide per-device counters that
track how often resource limits block allocations, and can be monitored
via poll/epoll for real-time alerting.  Both files expose the same
keys (max and alloc_fail); rdma.events aggregates hierarchically while
rdma.events.local shows per-cgroup values.  This follows the
pids.events / pids.events.local design.

Patch overview:
  Patch 1 introduces rdma.peak, adding a per-resource peak field to track
  the high watermark of usage, updated only after a full hierarchical
  charge succeeds, and extends rpool lifetime to preserve non-zero
  peak values.
  Patch 2 adds rdma.events, which introduces rdmacg_event_locked() to
  propagate hierarchical max counters upward from the over-limit
  cgroup using get_cg_rpool_locked() to ensure full hierarchical
  coverage even for ancestors without a prior rpool, with poll/epoll
  notification via cgroup_file_notify().
  Patch 3 adds rdma.events.local and hierarchical alloc_fail, extending
  the event framework with per-cgroup local counters (local_max for
  the over-limit cgroup, local_alloc_fail for the requesting cgroup)
  and a hierarchical alloc_fail counter propagated from the requestor
  upward.  It also extracts the duplicated rpool-keep predicate into
  a rpool_has_persistent_state() helper and replaces the non-error
  goto dev_err in rdmacg_resource_set_max() with an if-guard.
  Patch 4 documents all three new interface files in cgroup-v2.rst.

Tao Cui (4):
  cgroup/rdma: add rdma.peak for per-device peak usage tracking
  cgroup/rdma: add rdma.events to track resource limit exhaustion
  cgroup/rdma: add rdma.events.local for per-cgroup allocation failure
    attribution
  cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local

 Documentation/admin-guide/cgroup-v2.rst |  54 +++++++
 include/linux/cgroup_rdma.h             |   4 +
 kernel/cgroup/rdma.c                    | 199 ++++++++++++++++++++++--
 3 files changed, 247 insertions(+), 10 deletions(-)

---
Changes in v3:
  - Switch rdmacg_event_locked() from find_ to get_cg_rpool_locked()
    in hierarchical propagation loops (events_max and events_alloc_fail)
    to ensure full hierarchical coverage; the rpool-keep check now
    covers event counters, so spurious-rpool concern from v1 no longer
    applies.
  - Extract the duplicated rpool-keep predicate (peak + 4 event
    counters) into rpool_has_persistent_state() helper.
  - Replace the non-error goto dev_err in rdmacg_resource_set_max()
    with an if-guard so dev_err is only used for real error paths.
  - Fix commit message of rdma.events.local patch to mention the
    rdma.events hierarchical alloc_fail extension.
  - Use %llu and drop (s64) cast in rdmacg_events_show() and
    rdmacg_events_local_show() to match u64 counter type.

Changes in v2:
  - Fix peak updated before full hierarchical charge succeeds.
  - Use find_cg_rpool_locked() to avoid creating spurious rpools.
  - Replace atomic64_t with u64 + READ_ONCE (all under rdmacg_mutex).
  - Use key=value output format, remove trailing spaces.
  - Always list all devices, show zero for devices without an rpool.
  - Extend rpool-free condition to preserve non-zero event counters.
  - Rename "failcnt" to "alloc_fail" (cgroup v2 naming convention).
  - Fix alloc_fail semantics: local to the requesting cgroup only.
  - Add hierarchical alloc_fail to rdma.events for key consistency.
  - Add documentation in Documentation/admin-guide/cgroup-v2.rst.

v1:
  https://lore.kernel.org/all/20260512031719.273507-1-cuitao@kylinos.cn/
v2:
  https://lore.kernel.org/all/20260513104956.373216-1-cuitao@kylinos.cn/
-- 
2.43.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/4] cgroup/rdma: add rdma.peak for per-device peak usage tracking
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
@ 2026-05-14  6:50 ` Tao Cui
  2026-05-14  6:50 ` [PATCH v3 2/4] cgroup/rdma: add rdma.events to track resource limit exhaustion Tao Cui
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-14  6:50 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups; +Cc: Tao Cui

rdma.peak tracks the high watermark of resource usage per device,
giving a better baseline on which to set rdma.max. Polling
rdma.current isn't feasible since it would miss short-lived spikes.

This interface is analogous to memory.peak.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 kernel/cgroup/rdma.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c
index 3df7c38ce481..4e3bf0bade18 100644
--- a/kernel/cgroup/rdma.c
+++ b/kernel/cgroup/rdma.c
@@ -44,6 +44,7 @@ static LIST_HEAD(rdmacg_devices);
 enum rdmacg_file_type {
 	RDMACG_RESOURCE_TYPE_MAX,
 	RDMACG_RESOURCE_TYPE_STAT,
+	RDMACG_RESOURCE_TYPE_PEAK,
 };
 
 /*
@@ -60,6 +61,7 @@ static char const *rdmacg_resource_names[] = {
 struct rdmacg_resource {
 	int max;
 	int usage;
+	int peak;
 };
 
 /*
@@ -204,6 +206,17 @@ uncharge_cg_locked(struct rdma_cgroup *cg,
 	rpool->usage_sum--;
 	if (rpool->usage_sum == 0 &&
 	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
+		int i;
+
+		/*
+		 * Keep the rpool alive if any peak value is non-zero,
+		 * so that rdma.peak persists as a historical high-
+		 * watermark even after all resources are freed.
+		 */
+		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+			if (rpool->resources[i].peak)
+				return;
+		}
 		/*
 		 * No user of the rpool and all entries are set to max, so
 		 * safe to delete this rpool.
@@ -310,6 +323,12 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 			}
 		}
 	}
+	/* Update peak only after all charges succeed */
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		rpool = find_cg_rpool_locked(p, device);
+		if (rpool && rpool->resources[index].usage > rpool->resources[index].peak)
+			rpool->resources[index].peak = rpool->resources[index].usage;
+	}
 	mutex_unlock(&rdmacg_mutex);
 
 	*rdmacg = cg;
@@ -472,6 +491,12 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 
 	if (rpool->usage_sum == 0 &&
 	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
+		int i;
+
+		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+			if (rpool->resources[i].peak)
+				goto dev_err;
+		}
 		/*
 		 * No user of the rpool and all entries are set to max, so
 		 * safe to delete this rpool.
@@ -506,6 +531,8 @@ static void print_rpool_values(struct seq_file *sf,
 				value = rpool->resources[i].max;
 			else
 				value = S32_MAX;
+		} else if (sf_type == RDMACG_RESOURCE_TYPE_PEAK) {
+			value = rpool ? rpool->resources[i].peak : 0;
 		} else {
 			if (rpool)
 				value = rpool->resources[i].usage;
@@ -556,6 +583,12 @@ static struct cftype rdmacg_files[] = {
 		.private = RDMACG_RESOURCE_TYPE_STAT,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "peak",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_RESOURCE_TYPE_PEAK,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{ }	/* terminate */
 };
 
@@ -575,6 +608,13 @@ rdmacg_css_alloc(struct cgroup_subsys_state *parent)
 static void rdmacg_css_free(struct cgroup_subsys_state *css)
 {
 	struct rdma_cgroup *cg = css_rdmacg(css);
+	struct rdmacg_resource_pool *rpool, *tmp;
+
+	/* Clean up rpools kept alive by non-zero peak values */
+	mutex_lock(&rdmacg_mutex);
+	list_for_each_entry_safe(rpool, tmp, &cg->rpools, cg_node)
+		free_cg_rpool_locked(rpool);
+	mutex_unlock(&rdmacg_mutex);
 
 	kfree(cg);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/4] cgroup/rdma: add rdma.events to track resource limit exhaustion
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
  2026-05-14  6:50 ` [PATCH v3 1/4] cgroup/rdma: add rdma.peak for per-device peak usage tracking Tao Cui
@ 2026-05-14  6:50 ` Tao Cui
  2026-05-14  6:50 ` [PATCH v3 3/4] cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution Tao Cui
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-14  6:50 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups; +Cc: Tao Cui

Add per-device hierarchical event counters to track when RDMA resource
limits are exceeded. The rdma.events file reports max event counts
propagated upward from the cgroup whose limit was hit to all ancestors.

This mirrors the design of pids.events, where events are attributed to
the cgroup that imposed the limit, not necessarily the cgroup where the
allocation was attempted. Userspace can monitor this file via
poll/epoll for real-time notification of resource exhaustion.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 include/linux/cgroup_rdma.h |  3 ++
 kernel/cgroup/rdma.c        | 72 +++++++++++++++++++++++++++++++++++--
 2 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
index 80edae03c313..ac691fe7d3f5 100644
--- a/include/linux/cgroup_rdma.h
+++ b/include/linux/cgroup_rdma.h
@@ -24,6 +24,9 @@ struct rdma_cgroup {
 	 * that belongs to this cgroup.
 	 */
 	struct list_head		rpools;
+
+	/* Handle for rdma.events */
+	struct cgroup_file		events_file;
 };
 
 struct rdmacg_device {
diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c
index 4e3bf0bade18..927bbf1eb949 100644
--- a/kernel/cgroup/rdma.c
+++ b/kernel/cgroup/rdma.c
@@ -81,6 +81,9 @@ struct rdmacg_resource_pool {
 	u64			usage_sum;
 	/* total number counts which are set to max */
 	int			num_max_cnt;
+
+	/* per-resource hierarchical max event counters */
+	u64			events_max[RDMACG_RESOURCE_MAX];
 };
 
 static struct rdma_cgroup *css_rdmacg(struct cgroup_subsys_state *css)
@@ -214,7 +217,8 @@ uncharge_cg_locked(struct rdma_cgroup *cg,
 		 * watermark even after all resources are freed.
 		 */
 		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
-			if (rpool->resources[i].peak)
+			if (rpool->resources[i].peak ||
+			    READ_ONCE(rpool->events_max[i]))
 				return;
 		}
 		/*
@@ -225,6 +229,34 @@ uncharge_cg_locked(struct rdma_cgroup *cg,
 	}
 }
 
+/**
+ * rdmacg_event_locked - fire hierarchical max event when resource limit is hit
+ * @over_cg: cgroup whose limit was exceeded
+ * @device: rdma device
+ * @index: resource type index
+ *
+ * Must be called under rdmacg_mutex. Propagates max event counts
+ * from @over_cg (including itself) upward to all ancestors with
+ * an rpool and notifies userspace.
+ */
+static void rdmacg_event_locked(struct rdma_cgroup *over_cg,
+				struct rdmacg_device *device,
+				enum rdmacg_resource_type index)
+{
+	struct rdmacg_resource_pool *rpool;
+	struct rdma_cgroup *p;
+
+	lockdep_assert_held(&rdmacg_mutex);
+
+	for (p = over_cg; parent_rdmacg(p); p = parent_rdmacg(p)) {
+		rpool = get_cg_rpool_locked(p, device);
+		if (!IS_ERR(rpool)) {
+			rpool->events_max[index]++;
+			cgroup_file_notify(&p->events_file);
+		}
+	}
+}
+
 /**
  * rdmacg_uncharge_hierarchy - hierarchically uncharge rdma resource count
  * @cg: pointer to cg to uncharge and all parents in hierarchy
@@ -335,6 +367,8 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 	return 0;
 
 err:
+	if (ret == -EAGAIN)
+		rdmacg_event_locked(p, device, index);
 	mutex_unlock(&rdmacg_mutex);
 	rdmacg_uncharge_hierarchy(cg, device, p, index);
 	return ret;
@@ -494,7 +528,8 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 		int i;
 
 		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
-			if (rpool->resources[i].peak)
+			if (rpool->resources[i].peak ||
+			    READ_ONCE(rpool->events_max[i]))
 				goto dev_err;
 		}
 		/*
@@ -569,6 +604,33 @@ static int rdmacg_resource_read(struct seq_file *sf, void *v)
 	return 0;
 }
 
+static int rdmacg_events_show(struct seq_file *sf, void *v)
+{
+	struct rdma_cgroup *cg = css_rdmacg(seq_css(sf));
+	struct rdmacg_resource_pool *rpool;
+	struct rdmacg_device *device;
+	int i;
+
+	mutex_lock(&rdmacg_mutex);
+
+	list_for_each_entry(device, &rdmacg_devices, dev_node) {
+		rpool = find_cg_rpool_locked(cg, device);
+
+		seq_printf(sf, "%s ", device->name);
+		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+			seq_printf(sf, "%s.max=%llu",
+				   rdmacg_resource_names[i],
+				   rpool ? READ_ONCE(rpool->events_max[i]) : 0ULL);
+			if (i < RDMACG_RESOURCE_MAX - 1)
+				seq_putc(sf, ' ');
+		}
+		seq_putc(sf, '\n');
+	}
+
+	mutex_unlock(&rdmacg_mutex);
+	return 0;
+}
+
 static struct cftype rdmacg_files[] = {
 	{
 		.name = "max",
@@ -589,6 +651,12 @@ static struct cftype rdmacg_files[] = {
 		.private = RDMACG_RESOURCE_TYPE_PEAK,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "events",
+		.seq_show = rdmacg_events_show,
+		.file_offset = offsetof(struct rdma_cgroup, events_file),
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/4] cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
  2026-05-14  6:50 ` [PATCH v3 1/4] cgroup/rdma: add rdma.peak for per-device peak usage tracking Tao Cui
  2026-05-14  6:50 ` [PATCH v3 2/4] cgroup/rdma: add rdma.events to track resource limit exhaustion Tao Cui
@ 2026-05-14  6:50 ` Tao Cui
  2026-05-14  6:50 ` [PATCH v3 4/4] cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local Tao Cui
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-14  6:50 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups; +Cc: Tao Cui

Add per-cgroup local event counters to track RDMA resource limit
exhaustion from the perspective of individual cgroups. The
rdma.events.local file reports two per-resource counters:

- max: number of times this cgroup's limit was the one that blocked
  an allocation in the subtree
- alloc_fail: number of allocation attempts originating from this
  cgroup that failed due to an ancestor's limit

This mirrors the design of pids.events.local, where events are
attributed to the cgroup that imposed the limit, not necessarily the
cgroup where the allocation was attempted.

Also extend rdma.events with a hierarchical alloc_fail counter that
tracks allocation failures propagating upward from the requesting
cgroup, complementing the existing max counter, so that rdma.events
and rdma.events.local share the same output format.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 include/linux/cgroup_rdma.h |   3 +-
 kernel/cgroup/rdma.c        | 143 +++++++++++++++++++++++++++---------
 2 files changed, 109 insertions(+), 37 deletions(-)

diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
index ac691fe7d3f5..404e746552ca 100644
--- a/include/linux/cgroup_rdma.h
+++ b/include/linux/cgroup_rdma.h
@@ -25,8 +25,9 @@ struct rdma_cgroup {
 	 */
 	struct list_head		rpools;
 
-	/* Handle for rdma.events */
+	/* Handles for rdma.events[.local] */
 	struct cgroup_file		events_file;
+	struct cgroup_file		events_local_file;
 };
 
 struct rdmacg_device {
diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c
index 927bbf1eb949..7c238a9d64d4 100644
--- a/kernel/cgroup/rdma.c
+++ b/kernel/cgroup/rdma.c
@@ -82,8 +82,11 @@ struct rdmacg_resource_pool {
 	/* total number counts which are set to max */
 	int			num_max_cnt;
 
-	/* per-resource hierarchical max event counters */
+	/* per-resource event counters */
 	u64			events_max[RDMACG_RESOURCE_MAX];
+	u64			events_alloc_fail[RDMACG_RESOURCE_MAX];
+	u64			events_local_max[RDMACG_RESOURCE_MAX];
+	u64			events_local_alloc_fail[RDMACG_RESOURCE_MAX];
 };
 
 static struct rdma_cgroup *css_rdmacg(struct cgroup_subsys_state *css)
@@ -131,6 +134,26 @@ static void free_cg_rpool_locked(struct rdmacg_resource_pool *rpool)
 	kfree(rpool);
 }
 
+static bool rpool_has_persistent_state(struct rdmacg_resource_pool *rpool)
+{
+	int i;
+
+	/*
+	 * Keep the rpool alive if any peak value is non-zero,
+	 * so that rdma.peak persists as a historical high-
+	 * watermark even after all resources are freed.
+	 */
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+		if (rpool->resources[i].peak ||
+		    READ_ONCE(rpool->events_max[i]) ||
+		    READ_ONCE(rpool->events_local_max[i]) ||
+		    READ_ONCE(rpool->events_alloc_fail[i]) ||
+		    READ_ONCE(rpool->events_local_alloc_fail[i]))
+			return true;
+	}
+	return false;
+}
+
 static struct rdmacg_resource_pool *
 find_cg_rpool_locked(struct rdma_cgroup *cg,
 		     struct rdmacg_device *device)
@@ -209,37 +232,30 @@ uncharge_cg_locked(struct rdma_cgroup *cg,
 	rpool->usage_sum--;
 	if (rpool->usage_sum == 0 &&
 	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
-		int i;
-
-		/*
-		 * Keep the rpool alive if any peak value is non-zero,
-		 * so that rdma.peak persists as a historical high-
-		 * watermark even after all resources are freed.
-		 */
-		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
-			if (rpool->resources[i].peak ||
-			    READ_ONCE(rpool->events_max[i]))
-				return;
+		if (!rpool_has_persistent_state(rpool)) {
+			/*
+			 * No user of the rpool and all entries are set to max, so
+			 * safe to delete this rpool.
+			 */
+			free_cg_rpool_locked(rpool);
 		}
-		/*
-		 * No user of the rpool and all entries are set to max, so
-		 * safe to delete this rpool.
-		 */
-		free_cg_rpool_locked(rpool);
 	}
 }
 
 /**
- * rdmacg_event_locked - fire hierarchical max event when resource limit is hit
+ * rdmacg_event_locked - fire event when resource allocation exceeds limit
+ * @cg: requesting cgroup
  * @over_cg: cgroup whose limit was exceeded
  * @device: rdma device
  * @index: resource type index
  *
- * Must be called under rdmacg_mutex. Propagates max event counts
- * from @over_cg (including itself) upward to all ancestors with
- * an rpool and notifies userspace.
+ * Must be called under rdmacg_mutex. Updates event counters in the
+ * resource pools of @cg and @over_cg, propagates hierarchical max
+ * events from @over_cg (including itself) upward, and notifies
+ * userspace via cgroup_file_notify().
  */
-static void rdmacg_event_locked(struct rdma_cgroup *over_cg,
+static void rdmacg_event_locked(struct rdma_cgroup *cg,
+				struct rdma_cgroup *over_cg,
 				struct rdmacg_device *device,
 				enum rdmacg_resource_type index)
 {
@@ -248,6 +264,21 @@ static void rdmacg_event_locked(struct rdma_cgroup *over_cg,
 
 	lockdep_assert_held(&rdmacg_mutex);
 
+	/* Increment local alloc_fail in requesting cgroup */
+	rpool = find_cg_rpool_locked(cg, device);
+	if (rpool) {
+		rpool->events_local_alloc_fail[index]++;
+		cgroup_file_notify(&cg->events_local_file);
+	}
+
+	/* Increment local max in the over-limit cgroup */
+	rpool = find_cg_rpool_locked(over_cg, device);
+	if (rpool) {
+		rpool->events_local_max[index]++;
+		cgroup_file_notify(&over_cg->events_local_file);
+	}
+
+	/* Propagate hierarchical max events upward */
 	for (p = over_cg; parent_rdmacg(p); p = parent_rdmacg(p)) {
 		rpool = get_cg_rpool_locked(p, device);
 		if (!IS_ERR(rpool)) {
@@ -255,6 +286,14 @@ static void rdmacg_event_locked(struct rdma_cgroup *over_cg,
 			cgroup_file_notify(&p->events_file);
 		}
 	}
+	/* Propagate hierarchical alloc_fail from requesting cgroup upward */
+	for (p = cg; parent_rdmacg(p); p = parent_rdmacg(p)) {
+		rpool = get_cg_rpool_locked(p, device);
+		if (!IS_ERR(rpool)) {
+			rpool->events_alloc_fail[index]++;
+			cgroup_file_notify(&p->events_file);
+		}
+	}
 }
 
 /**
@@ -368,7 +407,7 @@ int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
 
 err:
 	if (ret == -EAGAIN)
-		rdmacg_event_locked(p, device, index);
+		rdmacg_event_locked(cg, p, device, index);
 	mutex_unlock(&rdmacg_mutex);
 	rdmacg_uncharge_hierarchy(cg, device, p, index);
 	return ret;
@@ -525,18 +564,13 @@ static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
 
 	if (rpool->usage_sum == 0 &&
 	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
-		int i;
-
-		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
-			if (rpool->resources[i].peak ||
-			    READ_ONCE(rpool->events_max[i]))
-				goto dev_err;
+		if (!rpool_has_persistent_state(rpool)) {
+			/*
+			 * No user of the rpool and all entries are set to max, so
+			 * safe to delete this rpool.
+			 */
+			free_cg_rpool_locked(rpool);
 		}
-		/*
-		 * No user of the rpool and all entries are set to max, so
-		 * safe to delete this rpool.
-		 */
-		free_cg_rpool_locked(rpool);
 	}
 
 dev_err:
@@ -618,9 +652,40 @@ static int rdmacg_events_show(struct seq_file *sf, void *v)
 
 		seq_printf(sf, "%s ", device->name);
 		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
-			seq_printf(sf, "%s.max=%llu",
+			seq_printf(sf, "%s.max=%llu %s.alloc_fail=%llu",
+				   rdmacg_resource_names[i],
+				   rpool ? READ_ONCE(rpool->events_max[i]) : 0ULL,
+				   rdmacg_resource_names[i],
+				   rpool ? READ_ONCE(rpool->events_alloc_fail[i]) : 0ULL);
+			if (i < RDMACG_RESOURCE_MAX - 1)
+				seq_putc(sf, ' ');
+		}
+		seq_putc(sf, '\n');
+	}
+
+	mutex_unlock(&rdmacg_mutex);
+	return 0;
+}
+
+static int rdmacg_events_local_show(struct seq_file *sf, void *v)
+{
+	struct rdma_cgroup *cg = css_rdmacg(seq_css(sf));
+	struct rdmacg_resource_pool *rpool;
+	struct rdmacg_device *device;
+	int i;
+
+	mutex_lock(&rdmacg_mutex);
+
+	list_for_each_entry(device, &rdmacg_devices, dev_node) {
+		rpool = find_cg_rpool_locked(cg, device);
+
+		seq_printf(sf, "%s ", device->name);
+		for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+			seq_printf(sf, "%s.max=%llu %s.alloc_fail=%llu",
+				   rdmacg_resource_names[i],
+				   rpool ? READ_ONCE(rpool->events_local_max[i]) : 0ULL,
 				   rdmacg_resource_names[i],
-				   rpool ? READ_ONCE(rpool->events_max[i]) : 0ULL);
+				   rpool ? READ_ONCE(rpool->events_local_alloc_fail[i]) : 0ULL);
 			if (i < RDMACG_RESOURCE_MAX - 1)
 				seq_putc(sf, ' ');
 		}
@@ -657,6 +722,12 @@ static struct cftype rdmacg_files[] = {
 		.file_offset = offsetof(struct rdma_cgroup, events_file),
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "events.local",
+		.seq_show = rdmacg_events_local_show,
+		.file_offset = offsetof(struct rdma_cgroup, events_local_file),
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/4] cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
                   ` (2 preceding siblings ...)
  2026-05-14  6:50 ` [PATCH v3 3/4] cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution Tao Cui
@ 2026-05-14  6:50 ` Tao Cui
  2026-05-14 21:26 ` [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-14  6:50 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups; +Cc: Tao Cui

Add interface file documentation for the new rdma cgroup files to
Documentation/admin-guide/cgroup-v2.rst.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
---
 Documentation/admin-guide/cgroup-v2.rst | 53 +++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..993446ab66d0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2785,6 +2785,59 @@ RDMA Interface Files
 	  mlx4_0 hca_handle=1 hca_object=20
 	  ocrdma1 hca_handle=1 hca_object=23
 
+  rdma.peak
+	A read-only nested-keyed file that exists for all the cgroups
+	except root.  It shows the historical high watermark of
+	resource usage per device since the cgroup was created.
+
+	An example for mlx4 and ocrdma device follows::
+
+	  mlx4_0 hca_handle=1 hca_object=20
+	  ocrdma1 hca_handle=0 hca_object=23
+
+  rdma.events
+	A read-only nested-keyed file which exists on non-root
+	cgroups.  The following nested keys are defined.
+
+	  max
+		The number of times a process in this cgroup or its
+		descendants attempted an RDMA resource allocation that
+		was rejected because a rdma.max limit in the subtree
+		was reached.  This is a hierarchical counter: the event
+		is propagated upward to all ancestor cgroups.  A value
+		change in this file generates a file modified event.
+
+	  alloc_fail
+		The number of RDMA resource allocation attempts that
+		originated in this cgroup or its descendants and failed
+		due to a rdma.max limit being reached.  This is a
+		hierarchical counter propagated upward.
+
+	An example for mlx4 device follows::
+
+	  mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=3 hca_object.max=0 hca_object.alloc_fail=0
+
+  rdma.events.local
+	Similar to rdma.events but the fields in the file are local
+	to the cgroup i.e. not hierarchical.  The file modified event
+	generated on this file reflects only the local events.
+
+	The following nested keys are defined.
+
+	  max
+		The number of times a process in this cgroup or its
+		descendants attempted an RDMA resource allocation that
+		was rejected because this cgroup's own rdma.max limit
+		was reached.
+	  alloc_fail
+		The number of RDMA resource allocation attempts
+		originating from this cgroup that failed due to this
+		cgroup's or an ancestor's rdma.max limit.
+
+	An example for mlx4 device follows::
+
+	  mlx4_0 hca_handle.max=5 hca_handle.alloc_fail=0 hca_object.max=0 hca_object.alloc_fail=0
+
 DMEM
 ----
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local]
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
                   ` (3 preceding siblings ...)
  2026-05-14  6:50 ` [PATCH v3 4/4] cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local Tao Cui
@ 2026-05-14 21:26 ` Tejun Heo
  2026-05-15  0:48   ` Tao Cui
  2026-05-14 21:27 ` Tejun Heo
  2026-05-14 21:27 ` Tejun Heo
  6 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2026-05-14 21:26 UTC (permalink / raw)
  To: Tao Cui; +Cc: hannes, mkoutny, cgroups

Hello,

> Tao Cui (4):
>   cgroup/rdma: add rdma.peak for per-device peak usage tracking
>   cgroup/rdma: add rdma.events to track resource limit exhaustion
>   cgroup/rdma: add rdma.events.local for per-cgroup allocation failure
>     attribution
>   cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local

Applied 1-4 to cgroup/for-7.2.

One follow-up: the new event counters use READ_ONCE() on reads but plain
++ on writes, and all accesses are under rdmacg_mutex. Please send a
follow-up patch dropping the READ_ONCE()s.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local]
  2026-05-14 21:26 ` [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tejun Heo
@ 2026-05-15  0:48   ` Tao Cui
  0 siblings, 0 replies; 9+ messages in thread
From: Tao Cui @ 2026-05-15  0:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups

Hello，

在 2026/5/15 5:26, Tejun Heo 写道:
> 
> One follow-up: the new event counters use READ_ONCE() on reads but plain
> ++ on writes, and all accesses are under rdmacg_mutex. Please send a
> follow-up patch dropping the READ_ONCE()s.
> 

Thanks for your suggestions and help throughout this process. 
I will handle the follow-up on my side later.

Thanks.

--
Tao

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local]
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
                   ` (4 preceding siblings ...)
  2026-05-14 21:26 ` [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tejun Heo
@ 2026-05-14 21:27 ` Tejun Heo
  2026-05-14 21:27 ` Tejun Heo
  6 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2026-05-14 21:27 UTC (permalink / raw)
  To: Tao Cui; +Cc: hannes, mkoutny, cgroups

Hello,

> Tao Cui (4):
>   cgroup/rdma: add rdma.peak for per-device peak usage tracking
>   cgroup/rdma: add rdma.events to track resource limit exhaustion
>   cgroup/rdma: add rdma.events.local for per-cgroup allocation failure
>     attribution
>   cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local

Applied 1-4 to cgroup/for-7.2.

One follow-up: the new event counters use READ_ONCE() on reads but plain
++ on writes, and all accesses are under rdmacg_mutex. Please send a
follow-up patch dropping the READ_ONCE()s.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local]
  2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
                   ` (5 preceding siblings ...)
  2026-05-14 21:27 ` Tejun Heo
@ 2026-05-14 21:27 ` Tejun Heo
  6 siblings, 0 replies; 9+ messages in thread
From: Tejun Heo @ 2026-05-14 21:27 UTC (permalink / raw)
  To: Tao Cui; +Cc: hannes, mkoutny, cgroups

Hello,

> Tao Cui (4):
>   cgroup/rdma: add rdma.peak for per-device peak usage tracking
>   cgroup/rdma: add rdma.events to track resource limit exhaustion
>   cgroup/rdma: add rdma.events.local for per-cgroup allocation failure
>     attribution
>   cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local

Applied 1-4 to cgroup/for-7.2.

One follow-up: the new event counters use READ_ONCE() on reads but plain
++ on writes, and all accesses are under rdmacg_mutex. Please send a
follow-up patch dropping the READ_ONCE()s.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-15  0:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14  6:50 [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tao Cui
2026-05-14  6:50 ` [PATCH v3 1/4] cgroup/rdma: add rdma.peak for per-device peak usage tracking Tao Cui
2026-05-14  6:50 ` [PATCH v3 2/4] cgroup/rdma: add rdma.events to track resource limit exhaustion Tao Cui
2026-05-14  6:50 ` [PATCH v3 3/4] cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution Tao Cui
2026-05-14  6:50 ` [PATCH v3 4/4] cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local Tao Cui
2026-05-14 21:26 ` [PATCH v3 0/4] cgroup/rdma: add rdma.peak and rdma.events[.local] Tejun Heo
2026-05-15  0:48   ` Tao Cui
2026-05-14 21:27 ` Tejun Heo
2026-05-14 21:27 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox