[PATCH] cgroup/rstat: change cgroup_base

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
@ 2025-06-17 10:26 Bertrand Wlodarczyk
  2025-06-17 12:38 ` kernel test robot
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Bertrand Wlodarczyk @ 2025-06-17 10:26 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups, linux-kernel; +Cc: Bertrand Wlodarczyk

The kernel currently faces scalability issues when multiple userspace
programs attempt to read cgroup statistics concurrently.

The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
which prevents access and updates to the statistics
of the css from multiple CPUs in parallel.

Given that rstat operates on a per-CPU basis and only aggregates
statistics in the parent cgroup, there is no compelling reason
why these statistics cannot be atomic.
By eliminating the lock, each CPU can traverse its rstat hierarchy
independently, without blocking. Synchronization is achieved during
parent propagation through atomic operations.

This change significantly enhances performance in scenarios
where multiple CPUs access CPU rstat within a single cgroup hierarchy,
yielding a performance improvement of around 50 times compared
to the mainline version.
Notably, performance for memory and I/O rstats remains unchanged,
as these are managed in separate submodules.

Additionally, this patch addresses a race condition detectable
in the current mainline by KCSAN in __cgroup_account_cputime,
which occurs when attempting to read a single hierarchy
from multiple CPUs.

Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>
---
Benchmark code: https://gist.github.com/bwlodarcz/c955b36b5667f0167dffcff23953d1da

Tested on Intel(R) Xeon(R) Platinum 8468V, 2s 48c 2tpc, 377GiB RAM, Fedora 41:
+--------+-------+
|Mainline|Patched|
+--------+-------+
|369.95s |6.52s  |
+--------+-------+

KCSAN error after benchmark run in mainline on Qemu VM:
[   70.614601] ==================================================================
[   70.615350] BUG: KCSAN: data-race in __cgroup_account_cputime / cgroup_base_stat_flush
[   70.616229]
[   70.616462] write to 0xffd1ffffffd82360 of 8 bytes by interrupt on cpu 7:
[   70.617155]  __cgroup_account_cputime+0x4f/0x90
[   70.617674]  update_curr+0x1c1/0x260
[   70.618081]  task_tick_fair+0x46/0x610
[   70.618538]  sched_tick+0xb2/0x280
[   70.618921]  update_process_times+0x9c/0xe0
[   70.619409]  tick_nohz_handler+0x101/0x230
[   70.619856]  __hrtimer_run_queues+0x21f/0x460
[   70.620364]  hrtimer_interrupt+0x1cb/0x3b0
[   70.620814]  __sysvec_apic_timer_interrupt+0x64/0x180
[   70.621368]  sysvec_apic_timer_interrupt+0x71/0x90
[   70.621867]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   70.622438]  _raw_spin_unlock_irq+0x15/0x30
[   70.622883]  css_rstat_flush+0x13b/0x280
[   70.623305]  cgroup_base_stat_cputime_show+0xa4/0x340
[   70.623819]  cpu_stat_show+0x19/0x1c0
[   70.624246]  cgroup_seqfile_show+0xc6/0x180
[   70.624686]  kernfs_seq_show+0x9b/0xb0
[   70.625096]  seq_read_iter+0x195/0x800
[   70.625542]  kernfs_fop_read_iter+0x70/0x80
[   70.625985]  vfs_read+0x473/0x5a0
[   70.626388]  ksys_read+0xaa/0x130
[   70.626756]  __x64_sys_read+0x41/0x50
[   70.627156]  x64_sys_call+0x19e1/0x1c10
[   70.627649]  do_syscall_64+0x84/0x210
[   70.628052]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   70.628598]
[   70.628811] read to 0xffd1ffffffd82360 of 8 bytes by task 347 on cpu 3:
[   70.629480]  cgroup_base_stat_flush+0xa7/0x9e0
[   70.629943]  css_rstat_flush+0x161/0x280
[   70.630396]  cgroup_base_stat_cputime_show+0xa4/0x340
[   70.630912]  cpu_stat_show+0x19/0x1c0
[   70.631343]  cgroup_seqfile_show+0xc6/0x180
[   70.631784]  kernfs_seq_show+0x9b/0xb0
[   70.632221]  seq_read_iter+0x195/0x800
[   70.632625]  kernfs_fop_read_iter+0x70/0x80
[   70.633070]  vfs_read+0x473/0x5a0
[   70.633472]  ksys_read+0xaa/0x130
[   70.633840]  __x64_sys_read+0x41/0x50
[   70.634242]  x64_sys_call+0x19e1/0x1c10
[   70.634655]  do_syscall_64+0x84/0x210
[   70.635056]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   70.635602]
[   70.635816] value changed: 0x0000000ac6ba8b3f -> 0x0000000ac6c9f1ed
[   70.636461]
[   70.636674] Reported by Kernel Concurrency Sanitizer on:
[   70.637233] CPU: 3 UID: 0 PID: 347 Comm: benchmark Not tainted 6.15.0+ #37 PREEMPT(lazy)
[   70.638042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[   70.638942] ==================================================================
---
 include/linux/cgroup-defs.h   |  30 ++---
 include/linux/sched/types.h   |   6 +
 include/trace/events/cgroup.h |  26 ----
 kernel/cgroup/rstat.c         | 228 +++++++++++++---------------------
 4 files changed, 104 insertions(+), 186 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e61687d5e496..58c6086e49e0 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -233,16 +233,6 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
-
-	/*
-	 * A singly-linked list of css structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * css_rstat_flush().
-	 *
-	 * Protected by rstat_base_lock when css is cgroup::self.
-	 * Protected by css->ss->rstat_ss_lock otherwise.
-	 */
-	struct cgroup_subsys_state *rstat_flush_next;
 };
 
 /*
@@ -343,12 +333,12 @@ struct css_set {
 };
 
 struct cgroup_base_stat {
-	struct task_cputime cputime;
+	struct atomic_task_cputime cputime;
 
 #ifdef CONFIG_SCHED_CORE
-	u64 forceidle_sum;
+	atomic64_t forceidle_sum;
 #endif
-	u64 ntime;
+	atomic64_t ntime;
 };
 
 /*
@@ -384,6 +374,14 @@ struct css_rstat_cpu {
 	 */
 	struct cgroup_subsys_state *updated_children;
 	struct cgroup_subsys_state *updated_next;	/* NULL if not on the list */
+	/*
+	 * A singly-linked list of css structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * css_rstat_flush().
+	 *
+	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
 };
 
 /*
@@ -391,11 +389,6 @@ struct css_rstat_cpu {
  * top of it - bsync, bstat and last_bstat.
  */
 struct cgroup_rstat_base_cpu {
-	/*
-	 * ->bsync protects ->bstat.  These are the only fields which get
-	 * updated in the hot path.
-	 */
-	struct u64_stats_sync bsync;
 	struct cgroup_base_stat bstat;
 
 	/*
@@ -820,7 +813,6 @@ struct cgroup_subsys {
 	 */
 	unsigned int depends_on;
 
-	spinlock_t rstat_ss_lock;
 	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
 };
 
diff --git a/include/linux/sched/types.h b/include/linux/sched/types.h
index 969aaf5ef9d6..b2bc644192b1 100644
--- a/include/linux/sched/types.h
+++ b/include/linux/sched/types.h
@@ -20,4 +20,10 @@ struct task_cputime {
 	unsigned long long		sum_exec_runtime;
 };
 
+struct atomic_task_cputime {
+	atomic64_t stime;
+	atomic64_t utime;
+	atomic_long_t sum_exec_runtime;
+};
+
 #endif /* _LINUX_SCHED_TYPES_H */
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index 7d332387be6c..baa96f86b040 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -231,32 +231,6 @@ DECLARE_EVENT_CLASS(cgroup_rstat,
 		  __entry->cpu, __entry->contended)
 );
 
-/*
- * Related to locks:
- * global rstat_base_lock for base stats
- * cgroup_subsys::rstat_ss_lock for subsystem stats
- */
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
 /*
  * Related to per CPU locks:
  * global rstat_base_cpu_lock for base stats
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index cbeaa499a96a..36af2b883440 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,7 +9,6 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(rstat_base_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -37,14 +36,6 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
 	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
 }
 
-static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
-{
-	if (ss)
-		return &ss->rstat_ss_lock;
-
-	return &rstat_base_lock;
-}
-
 static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
 {
 	if (ss) {
@@ -209,15 +200,7 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 	struct cgroup_subsys_state *parent, *grandchild;
 	struct css_rstat_cpu *crstatc;
 
-	child->rstat_flush_next = NULL;
-
-	/*
-	 * The subsystem rstat lock must be held for the whole duration from
-	 * here as the rstat_flush_next list is being constructed to when
-	 * it is consumed later in css_rstat_flush().
-	 */
-	lockdep_assert_held(ss_rstat_lock(head->ss));
-
+	css_rstat_cpu(child, cpu)->rstat_flush_next = NULL;
 	/*
 	 * Notation: -> updated_next pointer
 	 *	     => rstat_flush_next pointer
@@ -237,19 +220,19 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 next_level:
 	while (cnext) {
 		child = cnext;
-		cnext = child->rstat_flush_next;
+		cnext = css_rstat_cpu(child, cpu)->rstat_flush_next;
 		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated if !NULL */
 		while (child != parent) {
-			child->rstat_flush_next = head;
+			css_rstat_cpu(child, cpu)->rstat_flush_next = head;
 			head = child;
 			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
 				crstatc->updated_children = child;
-				grandchild->rstat_flush_next = ghead;
+				css_rstat_cpu(grandchild, cpu)->rstat_flush_next = ghead;
 				ghead = grandchild;
 			}
 			child = crstatc->updated_next;
@@ -288,9 +271,6 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 {
 	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
-	unsigned long flags;
-
-	flags = _css_rstat_cpu_lock(root, cpu, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -321,13 +301,12 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 
 	/* Push @root to the list first before pushing the children */
 	head = root;
-	root->rstat_flush_next = NULL;
+	css_rstat_cpu(root, cpu)->rstat_flush_next = NULL;
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
 		head = css_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_css_rstat_cpu_unlock(root, cpu, flags, false);
 	return head;
 }
 
@@ -353,44 +332,6 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 
 __bpf_hook_end();
 
-/*
- * Helper functions for locking.
- *
- * This makes it easier to diagnose locking issues and contention in
- * production environments.  The parameter @cpu_in_loop indicate lock
- * was released and re-taken when collection data from the CPUs. The
- * value -1 is used when obtaining the main lock else this is the CPU
- * number processed last.
- */
-static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
-		int cpu_in_loop)
-	__acquires(ss_rstat_lock(css->ss))
-{
-	struct cgroup *cgrp = css->cgroup;
-	spinlock_t *lock;
-	bool contended;
-
-	lock = ss_rstat_lock(css->ss);
-	contended = !spin_trylock_irq(lock);
-	if (contended) {
-		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(lock);
-	}
-	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
-}
-
-static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
-				      int cpu_in_loop)
-	__releases(ss_rstat_lock(css->ss))
-{
-	struct cgroup *cgrp = css->cgroup;
-	spinlock_t *lock;
-
-	lock = ss_rstat_lock(css->ss);
-	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(lock);
-}
-
 /**
  * css_rstat_flush - flush stats in @css's rstat subtree
  * @css: target cgroup subsystem state
@@ -419,11 +360,11 @@ __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
 	might_sleep();
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos;
+		unsigned long flags;
 
-		/* Reacquire for each CPU to avoid disabling IRQs too long */
-		__css_rstat_lock(css, cpu);
+		flags = _css_rstat_cpu_lock(css, cpu, false);
 		pos = css_rstat_updated_list(css, cpu);
-		for (; pos; pos = pos->rstat_flush_next) {
+		for (; pos; pos = css_rstat_cpu(pos, cpu)->rstat_flush_next) {
 			if (is_self) {
 				cgroup_base_stat_flush(pos->cgroup, cpu);
 				bpf_rstat_flush(pos->cgroup,
@@ -431,7 +372,7 @@ __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
 			} else
 				pos->ss->css_rstat_flush(pos, cpu);
 		}
-		__css_rstat_unlock(css, cpu);
+		_css_rstat_cpu_unlock(css, cpu, flags, false);
 		if (!cond_resched())
 			cpu_relax();
 	}
@@ -474,7 +415,6 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 			struct cgroup_rstat_base_cpu *rstatbc;
 
 			rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
-			u64_stats_init(&rstatbc->bsync);
 		}
 	}
 
@@ -532,7 +472,6 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 			return -ENOMEM;
 	}
 
-	spin_lock_init(ss_rstat_lock(ss));
 	for_each_possible_cpu(cpu)
 		raw_spin_lock_init(ss_rstat_cpu_lock(ss, cpu));
 
@@ -544,46 +483,58 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
  * rstat.
  */
 static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
-				 struct cgroup_base_stat *src_bstat)
+				struct cgroup_base_stat *src_bstat)
 {
-	dst_bstat->cputime.utime += src_bstat->cputime.utime;
-	dst_bstat->cputime.stime += src_bstat->cputime.stime;
-	dst_bstat->cputime.sum_exec_runtime += src_bstat->cputime.sum_exec_runtime;
+	atomic64_add(atomic64_read(&src_bstat->cputime.utime), &dst_bstat->cputime.utime);
+	atomic64_add(atomic64_read(&src_bstat->cputime.stime), &dst_bstat->cputime.stime);
+	atomic_long_add(atomic_long_read(&src_bstat->cputime.sum_exec_runtime),
+			&dst_bstat->cputime.sum_exec_runtime);
 #ifdef CONFIG_SCHED_CORE
-	dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
+	atomic64_add(atomic64_read(&src_bstat->forceidle_sum), &dst_bstat->forceidle_sum);
 #endif
-	dst_bstat->ntime += src_bstat->ntime;
+	atomic64_add(atomic64_read(&src_bstat->ntime), &dst_bstat->ntime);
 }
 
 static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
-				 struct cgroup_base_stat *src_bstat)
+				struct cgroup_base_stat *src_bstat)
 {
-	dst_bstat->cputime.utime -= src_bstat->cputime.utime;
-	dst_bstat->cputime.stime -= src_bstat->cputime.stime;
-	dst_bstat->cputime.sum_exec_runtime -= src_bstat->cputime.sum_exec_runtime;
+	atomic64_sub(atomic64_read(&src_bstat->cputime.utime), &dst_bstat->cputime.utime);
+	atomic64_sub(atomic64_read(&src_bstat->cputime.stime), &dst_bstat->cputime.stime);
+	atomic_long_sub(atomic_long_read(&src_bstat->cputime.sum_exec_runtime),
+			&dst_bstat->cputime.sum_exec_runtime);
 #ifdef CONFIG_SCHED_CORE
-	dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
+	atomic64_sub(atomic64_read(&src_bstat->forceidle_sum), &dst_bstat->forceidle_sum);
 #endif
-	dst_bstat->ntime -= src_bstat->ntime;
+	atomic64_sub(atomic64_read(&src_bstat->ntime), &dst_bstat->ntime);
 }
 
+static void cgroup_base_stat_copy(struct cgroup_base_stat *dst_bstat,
+				struct cgroup_base_stat *src_bstat)
+{
+		atomic64_set(&dst_bstat->cputime.stime, atomic64_read(&src_bstat->cputime.stime));
+		atomic64_set(&dst_bstat->cputime.utime, atomic64_read(&src_bstat->cputime.utime));
+		atomic_long_set(&dst_bstat->cputime.sum_exec_runtime,
+			atomic_long_read(&src_bstat->cputime.sum_exec_runtime));
+#ifdef CONFIG_SCHED_CORE
+		atomic64_set(&dst_bstat->forceidle_sum, atomic64_read(&src_bstat->forceidle_sum));
+#endif
+		atomic64_set(&dst_bstat->ntime, atomic64_read(&src_bstat->ntime));
+}
+
+
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
 	struct cgroup_rstat_base_cpu *rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_base_cpu *prstatbc;
 	struct cgroup_base_stat delta;
-	unsigned seq;
 
 	/* Root-level stats are sourced from system-wide CPU stats */
 	if (!parent)
 		return;
 
 	/* fetch the current per-cpu values */
-	do {
-		seq = __u64_stats_fetch_begin(&rstatbc->bsync);
-		delta = rstatbc->bstat;
-	} while (__u64_stats_fetch_retry(&rstatbc->bsync, seq));
+	cgroup_base_stat_copy(&delta, &rstatbc->bstat);
 
 	/* propagate per-cpu delta to cgroup and per-cpu global statistics */
 	cgroup_base_stat_sub(&delta, &rstatbc->last_bstat);
@@ -593,12 +544,12 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 
 	/* propagate cgroup and per-cpu global delta to parent (unless that's root) */
 	if (cgroup_parent(parent)) {
-		delta = cgrp->bstat;
+		cgroup_base_stat_copy(&delta, &cgrp->bstat);
 		cgroup_base_stat_sub(&delta, &cgrp->last_bstat);
 		cgroup_base_stat_add(&parent->bstat, &delta);
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
-		delta = rstatbc->subtree_bstat;
+		cgroup_base_stat_copy(&delta, &rstatbc->subtree_bstat);
 		prstatbc = cgroup_rstat_base_cpu(parent, cpu);
 		cgroup_base_stat_sub(&delta, &rstatbc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatbc->subtree_bstat, &delta);
@@ -606,65 +557,44 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 	}
 }
 
-static struct cgroup_rstat_base_cpu *
-cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
 	struct cgroup_rstat_base_cpu *rstatbc;
 
 	rstatbc = get_cpu_ptr(cgrp->rstat_base_cpu);
-	*flags = u64_stats_update_begin_irqsave(&rstatbc->bsync);
-	return rstatbc;
-}
-
-static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct cgroup_rstat_base_cpu *rstatbc,
-						 unsigned long flags)
-{
-	u64_stats_update_end_irqrestore(&rstatbc->bsync, flags);
+	atomic_long_add(delta_exec, &rstatbc->bstat.cputime.sum_exec_runtime);
 	css_rstat_updated(&cgrp->self, smp_processor_id());
 	put_cpu_ptr(rstatbc);
 }
 
-void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
-{
-	struct cgroup_rstat_base_cpu *rstatbc;
-	unsigned long flags;
-
-	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
-	rstatbc->bstat.cputime.sum_exec_runtime += delta_exec;
-	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
-}
-
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
 	struct cgroup_rstat_base_cpu *rstatbc;
-	unsigned long flags;
 
-	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
+	rstatbc = get_cpu_ptr(cgrp->rstat_base_cpu);
 
 	switch (index) {
 	case CPUTIME_NICE:
-		rstatbc->bstat.ntime += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.ntime);
 		fallthrough;
 	case CPUTIME_USER:
-		rstatbc->bstat.cputime.utime += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.cputime.utime);
 		break;
 	case CPUTIME_SYSTEM:
 	case CPUTIME_IRQ:
 	case CPUTIME_SOFTIRQ:
-		rstatbc->bstat.cputime.stime += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.cputime.stime);
 		break;
 #ifdef CONFIG_SCHED_CORE
 	case CPUTIME_FORCEIDLE:
-		rstatbc->bstat.forceidle_sum += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.forceidle_sum);
 		break;
 #endif
 	default:
 		break;
 	}
-
-	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
+	put_cpu_ptr(rstatbc);
 }
 
 /*
@@ -675,7 +605,7 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
  */
 static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 {
-	struct task_cputime *cputime = &bstat->cputime;
+	struct atomic_task_cputime *cputime = &bstat->cputime;
 	int i;
 
 	memset(bstat, 0, sizeof(*bstat));
@@ -689,20 +619,19 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 
 		user += cpustat[CPUTIME_USER];
 		user += cpustat[CPUTIME_NICE];
-		cputime->utime += user;
+		atomic64_add(user, &cputime->utime);
 
 		sys += cpustat[CPUTIME_SYSTEM];
 		sys += cpustat[CPUTIME_IRQ];
 		sys += cpustat[CPUTIME_SOFTIRQ];
-		cputime->stime += sys;
+		atomic64_add(sys, &cputime->stime);
 
-		cputime->sum_exec_runtime += user;
-		cputime->sum_exec_runtime += sys;
+		atomic64_add(sys + user, &cputime->sum_exec_runtime);
 
 #ifdef CONFIG_SCHED_CORE
-		bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
+		atomic64_add(cpustat[CPUTIME_FORCEIDLE], &bstat->forceidle_sum);
 #endif
-		bstat->ntime += cpustat[CPUTIME_NICE];
+		atomic64_add(cpustat[CPUTIME_NICE], &bstat->ntime);
 	}
 }
 
@@ -710,7 +639,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat *bstat)
 {
 #ifdef CONFIG_SCHED_CORE
-	u64 forceidle_time = bstat->forceidle_sum;
+	s64 forceidle_time = atomic64_read(&bstat->forceidle_sum);
 
 	do_div(forceidle_time, NSEC_PER_USEC);
 	seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
@@ -721,31 +650,48 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 	struct cgroup_base_stat bstat;
+	unsigned long long sum_exec_runtime;
+	s64 utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
 		css_rstat_flush(&cgrp->self);
-		__css_rstat_lock(&cgrp->self, -1);
-		bstat = cgrp->bstat;
-		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
-			       &bstat.cputime.utime, &bstat.cputime.stime);
-		__css_rstat_unlock(&cgrp->self, -1);
-	} else {
+		cgroup_base_stat_copy(&bstat, &cgrp->bstat);
+		struct task_cputime cputime = {
+			.stime = atomic64_read(&bstat.cputime.stime),
+			.utime = atomic64_read(&bstat.cputime.utime),
+			.sum_exec_runtime = atomic_long_read(
+				&bstat.cputime.sum_exec_runtime)
+		};
+		cputime_adjust(&cputime, &cgrp->prev_cputime,
+			       &cputime.utime, &cputime.stime);
+		atomic64_set(&bstat.cputime.utime, cputime.utime);
+		atomic64_set(&bstat.cputime.stime, cputime.stime);
+	} else
 		root_cgroup_cputime(&bstat);
-	}
+	sum_exec_runtime = atomic_long_read(&bstat.cputime.sum_exec_runtime);
+
+	do_div(sum_exec_runtime, NSEC_PER_USEC);
+
+	utime = atomic64_read(&bstat.cputime.utime);
+
+	do_div(utime, NSEC_PER_USEC);
+
+	stime =	atomic64_read(&bstat.cputime.stime);
+
+	do_div(stime, NSEC_PER_USEC);
+
+	ntime = atomic64_read(&bstat.ntime);
 
-	do_div(bstat.cputime.sum_exec_runtime, NSEC_PER_USEC);
-	do_div(bstat.cputime.utime, NSEC_PER_USEC);
-	do_div(bstat.cputime.stime, NSEC_PER_USEC);
-	do_div(bstat.ntime, NSEC_PER_USEC);
+	do_div(ntime, NSEC_PER_USEC);
 
 	seq_printf(seq, "usage_usec %llu\n"
 			"user_usec %llu\n"
 			"system_usec %llu\n"
 			"nice_usec %llu\n",
-			bstat.cputime.sum_exec_runtime,
-			bstat.cputime.utime,
-			bstat.cputime.stime,
-			bstat.ntime);
+			sum_exec_runtime,
+			utime,
+			stime,
+			ntime);
 
 	cgroup_force_idle_show(seq, &bstat);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 10:26 [PATCH] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
@ 2025-06-17 12:38 ` kernel test robot
  2025-06-17 13:16 ` Michal Koutný
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: kernel test robot @ 2025-06-17 12:38 UTC (permalink / raw)
  To: Bertrand Wlodarczyk, tj, hannes, mkoutny, cgroups, linux-kernel
  Cc: oe-kbuild-all, Bertrand Wlodarczyk

Hi Bertrand,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on linus/master v6.16-rc2 next-20250617]
[cannot apply to trace/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bertrand-Wlodarczyk/cgroup-rstat-change-cgroup_base_stat-to-atomic/20250617-183242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250617102644.752201-2-bertrand.wlodarczyk%40intel.com
patch subject: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
config: arc-randconfig-001-20250617 (https://download.01.org/0day-ci/archive/20250617/202506172020.kFkGyrEE-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506172020.kFkGyrEE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506172020.kFkGyrEE-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/cgroup/rstat.c: In function 'css_rstat_init':
>> kernel/cgroup/rstat.c:415:55: warning: variable 'rstatbc' set but not used [-Wunused-but-set-variable]
     415 |                         struct cgroup_rstat_base_cpu *rstatbc;
         |                                                       ^~~~~~~
   kernel/cgroup/rstat.c: In function 'root_cgroup_cputime':
   kernel/cgroup/rstat.c:629:42: error: passing argument 2 of 'atomic64_add' from incompatible pointer type [-Werror=incompatible-pointer-types]
     629 |                 atomic64_add(sys + user, &cputime->sum_exec_runtime);
         |                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
         |                                          |
         |                                          atomic_long_t * {aka atomic_t *}
   In file included from include/linux/atomic.h:82,
                    from include/asm-generic/bitops/lock.h:5,
                    from arch/arc/include/asm/bitops.h:188,
                    from include/linux/bitops.h:67,
                    from include/linux/thread_info.h:27,
                    from include/linux/sched.h:14,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/cgroup-internal.h:5,
                    from kernel/cgroup/rstat.c:2:
   include/linux/atomic/atomic-instrumented.h:1680:33: note: expected 'atomic64_t *' but argument is of type 'atomic_long_t *' {aka 'atomic_t *'}
    1680 | atomic64_add(s64 i, atomic64_t *v)
         |                     ~~~~~~~~~~~~^
   In file included from ./arch/arc/include/generated/asm/div64.h:1,
                    from include/linux/math.h:6,
                    from include/linux/kernel.h:27,
                    from include/linux/cpumask.h:11,
                    from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/spinlock.h:63,
                    from include/linux/sched.h:2209,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/cgroup-internal.h:5,
                    from kernel/cgroup/rstat.c:2:
   kernel/cgroup/rstat.c: In function 'cgroup_base_stat_cputime_show':
   include/asm-generic/div64.h:183:35: warning: comparison of distinct pointer types lacks a cast
     183 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
         |                                   ^~
   kernel/cgroup/rstat.c:677:9: note: in expansion of macro 'do_div'
     677 |         do_div(utime, NSEC_PER_USEC);
         |         ^~~~~~
   include/asm-generic/div64.h:183:35: warning: comparison of distinct pointer types lacks a cast
     183 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
         |                                   ^~
   kernel/cgroup/rstat.c:681:9: note: in expansion of macro 'do_div'
     681 |         do_div(stime, NSEC_PER_USEC);
         |         ^~~~~~
   include/asm-generic/div64.h:183:35: warning: comparison of distinct pointer types lacks a cast
     183 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
         |                                   ^~
   kernel/cgroup/rstat.c:685:9: note: in expansion of macro 'do_div'
     685 |         do_div(ntime, NSEC_PER_USEC);
         |         ^~~~~~
   cc1: some warnings being treated as errors


vim +/rstatbc +415 kernel/cgroup/rstat.c

a17556f8d9798e Tejun Heo 2018-04-26  380  
a97915559f5c5f JP Kobryn 2025-04-03  381  int css_rstat_init(struct cgroup_subsys_state *css)
a17556f8d9798e Tejun Heo 2018-04-26  382  {
a97915559f5c5f JP Kobryn 2025-04-03  383  	struct cgroup *cgrp = css->cgroup;
a17556f8d9798e Tejun Heo 2018-04-26  384  	int cpu;
5da3bfa029d680 JP Kobryn 2025-05-14  385  	bool is_self = css_is_self(css);
a17556f8d9798e Tejun Heo 2018-04-26  386  
5da3bfa029d680 JP Kobryn 2025-05-14  387  	if (is_self) {
5da3bfa029d680 JP Kobryn 2025-05-14  388  		/* the root cgrp has rstat_base_cpu preallocated */
5da3bfa029d680 JP Kobryn 2025-05-14  389  		if (!cgrp->rstat_base_cpu) {
5da3bfa029d680 JP Kobryn 2025-05-14  390  			cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
5da3bfa029d680 JP Kobryn 2025-05-14  391  			if (!cgrp->rstat_base_cpu)
a17556f8d9798e Tejun Heo 2018-04-26  392  				return -ENOMEM;
a17556f8d9798e Tejun Heo 2018-04-26  393  		}
5da3bfa029d680 JP Kobryn 2025-05-14  394  	} else if (css->ss->css_rstat_flush == NULL)
5da3bfa029d680 JP Kobryn 2025-05-14  395  		return 0;
5da3bfa029d680 JP Kobryn 2025-05-14  396  
5da3bfa029d680 JP Kobryn 2025-05-14  397  	/* the root cgrp's self css has rstat_cpu preallocated */
5da3bfa029d680 JP Kobryn 2025-05-14  398  	if (!css->rstat_cpu) {
5da3bfa029d680 JP Kobryn 2025-05-14  399  		css->rstat_cpu = alloc_percpu(struct css_rstat_cpu);
5da3bfa029d680 JP Kobryn 2025-05-14  400  		if (!css->rstat_cpu) {
5da3bfa029d680 JP Kobryn 2025-05-14  401  			if (is_self)
5da3bfa029d680 JP Kobryn 2025-05-14  402  				free_percpu(cgrp->rstat_base_cpu);
a17556f8d9798e Tejun Heo 2018-04-26  403  
f6e9a26e2d488c JP Kobryn 2025-04-03  404  			return -ENOMEM;
f6e9a26e2d488c JP Kobryn 2025-04-03  405  		}
f6e9a26e2d488c JP Kobryn 2025-04-03  406  	}
f6e9a26e2d488c JP Kobryn 2025-04-03  407  
a17556f8d9798e Tejun Heo 2018-04-26  408  	/* ->updated_children list is self terminated */
a17556f8d9798e Tejun Heo 2018-04-26  409  	for_each_possible_cpu(cpu) {
5da3bfa029d680 JP Kobryn 2025-05-14  410  		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
5da3bfa029d680 JP Kobryn 2025-05-14  411  
5da3bfa029d680 JP Kobryn 2025-05-14  412  		rstatc->updated_children = css;
a17556f8d9798e Tejun Heo 2018-04-26  413  
5da3bfa029d680 JP Kobryn 2025-05-14  414  		if (is_self) {
5da3bfa029d680 JP Kobryn 2025-05-14 @415  			struct cgroup_rstat_base_cpu *rstatbc;
5da3bfa029d680 JP Kobryn 2025-05-14  416  
5da3bfa029d680 JP Kobryn 2025-05-14  417  			rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
a17556f8d9798e Tejun Heo 2018-04-26  418  		}
5da3bfa029d680 JP Kobryn 2025-05-14  419  	}
a17556f8d9798e Tejun Heo 2018-04-26  420  
a17556f8d9798e Tejun Heo 2018-04-26  421  	return 0;
a17556f8d9798e Tejun Heo 2018-04-26  422  }
a17556f8d9798e Tejun Heo 2018-04-26  423  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 10:26 [PATCH] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
  2025-06-17 12:38 ` kernel test robot
@ 2025-06-17 13:16 ` Michal Koutný
  2025-06-18 14:31   ` Wlodarczyk, Bertrand
  2025-06-17 13:32 ` kernel test robot
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Michal Koutný @ 2025-06-17 13:16 UTC (permalink / raw)
  To: Bertrand Wlodarczyk; +Cc: tj, hannes, cgroups, linux-kernel, JP Kobryn

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

Hello.

This is interesting.

On Tue, Jun 17, 2025 at 12:26:45PM +0200, Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com> wrote:
> The kernel currently faces scalability issues when multiple userspace
> programs attempt to read cgroup statistics concurrently.

Does "currently" mean that you didn't observe this before per-subsys
split?

> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
> which prevents access and updates to the statistics
> of the css from multiple CPUs in parallel.

I think this description needs some refresh on top of the current
mainline (at least after the commit 748922dcfabdd ("cgroup: use
subsystem-specific rstat locks to avoid contention") to be clear which
lock (and locking functions) is apparently contentious.

> Given that rstat operates on a per-CPU basis and only aggregates
> statistics in the parent cgroup, there is no compelling reason
> why these statistics cannot be atomic.
> By eliminating the lock, each CPU can traverse its rstat hierarchy
> independently, without blocking. Synchronization is achieved during
> parent propagation through atomic operations.
> 
> This change significantly enhances performance in scenarios
> where multiple CPUs access CPU rstat within a single cgroup hierarchy,
> yielding a performance improvement of around 50 times compared
> to the mainline version.

Given the recent changes and to be on the same page, it'd be better if
you referred to particular commits/tags when benchmarking so that it's
unambiguous what is compared with what.

> Notably, performance for memory and I/O rstats remains unchanged,
> as these are managed in separate submodules.
 

> Additionally, this patch addresses a race condition detectable
> in the current mainline by KCSAN in __cgroup_account_cputime,
> which occurs when attempting to read a single hierarchy
> from multiple CPUs.

Could you please extract this fix and send it separately?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 10:26 [PATCH] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
  2025-06-17 12:38 ` kernel test robot
  2025-06-17 13:16 ` Michal Koutný
@ 2025-06-17 13:32 ` kernel test robot
  2025-06-17 21:05 ` JP Kobryn
  2025-06-18 17:21 ` Shakeel Butt
  4 siblings, 0 replies; 9+ messages in thread
From: kernel test robot @ 2025-06-17 13:32 UTC (permalink / raw)
  To: Bertrand Wlodarczyk, tj, hannes, mkoutny, cgroups, linux-kernel
  Cc: oe-kbuild-all, Bertrand Wlodarczyk

Hi Bertrand,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on linus/master v6.16-rc2 next-20250617]
[cannot apply to trace/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Bertrand-Wlodarczyk/cgroup-rstat-change-cgroup_base_stat-to-atomic/20250617-183242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250617102644.752201-2-bertrand.wlodarczyk%40intel.com
patch subject: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
config: arc-randconfig-001-20250617 (https://download.01.org/0day-ci/archive/20250617/202506172158.OEJ5qjT6-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506172158.OEJ5qjT6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506172158.OEJ5qjT6-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/cgroup/rstat.c: In function 'css_rstat_init':
   kernel/cgroup/rstat.c:415:55: warning: variable 'rstatbc' set but not used [-Wunused-but-set-variable]
     415 |                         struct cgroup_rstat_base_cpu *rstatbc;
         |                                                       ^~~~~~~
   kernel/cgroup/rstat.c: In function 'root_cgroup_cputime':
>> kernel/cgroup/rstat.c:629:42: error: passing argument 2 of 'atomic64_add' from incompatible pointer type [-Werror=incompatible-pointer-types]
     629 |                 atomic64_add(sys + user, &cputime->sum_exec_runtime);
         |                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
         |                                          |
         |                                          atomic_long_t * {aka atomic_t *}
   In file included from include/linux/atomic.h:82,
                    from include/asm-generic/bitops/lock.h:5,
                    from arch/arc/include/asm/bitops.h:188,
                    from include/linux/bitops.h:67,
                    from include/linux/thread_info.h:27,
                    from include/linux/sched.h:14,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/cgroup-internal.h:5,
                    from kernel/cgroup/rstat.c:2:
   include/linux/atomic/atomic-instrumented.h:1680:33: note: expected 'atomic64_t *' but argument is of type 'atomic_long_t *' {aka 'atomic_t *'}
    1680 | atomic64_add(s64 i, atomic64_t *v)
         |                     ~~~~~~~~~~~~^
   In file included from ./arch/arc/include/generated/asm/div64.h:1,
                    from include/linux/math.h:6,
                    from include/linux/kernel.h:27,
                    from include/linux/cpumask.h:11,
                    from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/spinlock.h:63,
                    from include/linux/sched.h:2209,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/cgroup-internal.h:5,
                    from kernel/cgroup/rstat.c:2:
   kernel/cgroup/rstat.c: In function 'cgroup_base_stat_cputime_show':
   include/asm-generic/div64.h:183:35: warning: comparison of distinct pointer types lacks a cast
     183 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
         |                                   ^~
   kernel/cgroup/rstat.c:677:9: note: in expansion of macro 'do_div'
     677 |         do_div(utime, NSEC_PER_USEC);
         |         ^~~~~~
   include/asm-generic/div64.h:183:35: warning: comparison of distinct pointer types lacks a cast
     183 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
         |                                   ^~
   kernel/cgroup/rstat.c:681:9: note: in expansion of macro 'do_div'
     681 |         do_div(stime, NSEC_PER_USEC);
         |         ^~~~~~
   include/asm-generic/div64.h:183:35: warning: comparison of distinct pointer types lacks a cast
     183 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
         |                                   ^~
   kernel/cgroup/rstat.c:685:9: note: in expansion of macro 'do_div'
     685 |         do_div(ntime, NSEC_PER_USEC);
         |         ^~~~~~
   cc1: some warnings being treated as errors


vim +/atomic64_add +629 kernel/cgroup/rstat.c

   599	
   600	/*
   601	 * compute the cputime for the root cgroup by getting the per cpu data
   602	 * at a global level, then categorizing the fields in a manner consistent
   603	 * with how it is done by __cgroup_account_cputime_field for each bit of
   604	 * cpu time attributed to a cgroup.
   605	 */
   606	static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
   607	{
   608		struct atomic_task_cputime *cputime = &bstat->cputime;
   609		int i;
   610	
   611		memset(bstat, 0, sizeof(*bstat));
   612		for_each_possible_cpu(i) {
   613			struct kernel_cpustat kcpustat;
   614			u64 *cpustat = kcpustat.cpustat;
   615			u64 user = 0;
   616			u64 sys = 0;
   617	
   618			kcpustat_cpu_fetch(&kcpustat, i);
   619	
   620			user += cpustat[CPUTIME_USER];
   621			user += cpustat[CPUTIME_NICE];
   622			atomic64_add(user, &cputime->utime);
   623	
   624			sys += cpustat[CPUTIME_SYSTEM];
   625			sys += cpustat[CPUTIME_IRQ];
   626			sys += cpustat[CPUTIME_SOFTIRQ];
   627			atomic64_add(sys, &cputime->stime);
   628	
 > 629			atomic64_add(sys + user, &cputime->sum_exec_runtime);
   630	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 10:26 [PATCH] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
                   ` (2 preceding siblings ...)
  2025-06-17 13:32 ` kernel test robot
@ 2025-06-17 21:05 ` JP Kobryn
  2025-06-18 16:05   ` Wlodarczyk, Bertrand
  2025-06-18 17:21 ` Shakeel Butt
  4 siblings, 1 reply; 9+ messages in thread
From: JP Kobryn @ 2025-06-17 21:05 UTC (permalink / raw)
  To: Bertrand Wlodarczyk, tj, hannes, mkoutny, cgroups, linux-kernel

On 6/17/25 3:26 AM, Bertrand Wlodarczyk wrote:
> The kernel currently faces scalability issues when multiple userspace
> programs attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
> which prevents access and updates to the statistics
> of the css from multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates
> statistics in the parent cgroup, there is no compelling reason
> why these statistics cannot be atomic.

Have you considered the "tearing" that will occur when writes and reads
are happening in parallel?
The existing state is more of a snapshot approach. Changing the fields
involved to atomic and lockless reading/writing can result in
inconsistent values, i.e. fieldA might be more current than fieldB.

> By eliminating the lock, each CPU can traverse its rstat hierarchy
> independently, without blocking. Synchronization is achieved during
> parent propagation through atomic operations.

Even if the tearing scenario mentioned above is acceptable, removing
the lock will break synchronization of flushing non-base stat
subsystems.

> 
> This change significantly enhances performance in scenarios
> where multiple CPUs access CPU rstat within a single cgroup hierarchy,
> yielding a performance improvement of around 50 times compared
> to the mainline version.
> Notably, performance for memory and I/O rstats remains unchanged,
> as these are managed in separate submodules.
> 
> Additionally, this patch addresses a race condition detectable
> in the current mainline by KCSAN in __cgroup_account_cputime,
> which occurs when attempting to read a single hierarchy
> from multiple CPUs.
> 
> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>
> ---
> Benchmark code: https://gist.github.com/bwlodarcz/c955b36b5667f0167dffcff23953d1da
> 
> Tested on Intel(R) Xeon(R) Platinum 8468V, 2s 48c 2tpc, 377GiB RAM, Fedora 41:
> +--------+-------+
> |Mainline|Patched|
> +--------+-------+
> |369.95s |6.52s  |
> +--------+-------+
> 
[..]
> @@ -820,7 +813,6 @@ struct cgroup_subsys {
>   	 */
>   	unsigned int depends_on;
>   
> -	spinlock_t rstat_ss_lock;

This lock is not used with base stats. The base stats are not a formal
subsystem.

[..]
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index cbeaa499a96a..36af2b883440 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,7 +9,6 @@
>   
>   #include <trace/events/cgroup.h>
>   
> -static DEFINE_SPINLOCK(rstat_base_lock);
>   static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
>   
>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
> @@ -37,14 +36,6 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
>   	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
>   }
>   
> -static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
> -{
> -	if (ss)
> -		return &ss->rstat_ss_lock;

This was needed for non-base stat subsystems like memory and io.

> -
> -	return &rstat_base_lock;
> -}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 13:16 ` Michal Koutný
@ 2025-06-18 14:31   ` Wlodarczyk, Bertrand
  0 siblings, 0 replies; 9+ messages in thread
From: Wlodarczyk, Bertrand @ 2025-06-18 14:31 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj@kernel.org, hannes@cmpxchg.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, JP Kobryn

Thank you for your time and answer. 

> The kernel currently faces scalability issues when multiple userspace 
> programs attempt to read cgroup statistics concurrently.

> Does "currently" mean that you didn't observe this before per-subsys split?

That means the current e04c78d86a9699d1 (Linux 6.16-rc2). It was observed by our customer around winter this year and it's still present.
I believe it's present since the lock exists. 

> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> which prevents access and updates to the statistics of the css from 
> multiple CPUs in parallel.

> I think this description needs some refresh on top of the current mainline (at least after the commit 748922dcfabdd ("cgroup: use subsystem-specific rstat locks to avoid contention") to be clear which lock (and locking functions) is apparently contentious.

The main culprit is css_cgroup_lock in cgroup_rstat_flush. It's locking css although the main algo operates mainly on per cpu data. 
Only propagation to parent needs to be locked but only if the data isn't atomic.
The benchmark results were gathered after the patch 748922dcfabdd on top of the commit e04c78d86a9699d1 (Linux 6.16-rc2).

> Notably, performance for memory and I/O rstats remains unchanged, as 
> these are managed in separate submodules.

> Additionally, this patch addresses a race condition detectable in the 
> current mainline by KCSAN in __cgroup_account_cputime, which occurs 
> when attempting to read a single hierarchy from multiple CPUs.

> Could you please extract this fix and send it separately?

Unfortunately, I don't have it. My primary objective was to resolve performance bottleneck during rstat access for customer. I found the race condition by accident after my benchmark (provided in gist) run with KCSAN. Didn't investigated race alone.

Thanks,
Bertrand

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 21:05 ` JP Kobryn
@ 2025-06-18 16:05   ` Wlodarczyk, Bertrand
  2025-06-18 17:17     ` Shakeel Butt
  0 siblings, 1 reply; 9+ messages in thread
From: Wlodarczyk, Bertrand @ 2025-06-18 16:05 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org

Thank you for your time and review. 

> The kernel currently faces scalability issues when multiple userspace 
> programs attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> which prevents access and updates to the statistics of the css from 
> multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates 
> statistics in the parent cgroup, there is no compelling reason why 
> these statistics cannot be atomic.

> Have you considered the "tearing" that will occur when writes and reads are happening in parallel?
> The existing state is more of a snapshot approach. Changing the fields involved to atomic and lockless reading/writing can result in inconsistent values, i.e. fieldA might be more current than fieldB.

Yes, I considered it. In our scenario, "tearing" means that we have the newer data then the last "snapshot" and the output doesn't sum up in
sum_exec_runtime. The "snapshot" suggests that these stats not need to be super precise because we're accepting outdated state.
I'm not considering "tearing" issue as very bad here. 
Additionally, I'm sure that the "tearing" can happen but I didn't observe them during the benchmark runs.
That suggests that's a rare occurrence or I didn't trigger the right condition to expose the issue.

When multiple cpus tries to access the rstat the slowdown caused by css_base_lock is so massive that atomic change is preferable.
 It's better to get even "teared" result than spinlock cpus and disable irq for such long time. 

> By eliminating the lock, each CPU can traverse its rstat hierarchy 
> independently, without blocking. Synchronization is achieved during 
> parent propagation through atomic operations.

> Even if the tearing scenario mentioned above is acceptable, removing the lock will break synchronization of flushing non-base stat subsystems.

> -static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss) -{
> -	if (ss)
> -		return &ss->rstat_ss_lock;

> This was needed for non-base stat subsystems like memory and io.

From what I could find in code the flush with css and cpu arguments is implemented only in two subsystems: memory and io.
Both memory and io have its own locks for them.
I tested the benchmark (provided in gist) with memory.stat and io.stat hierarchies.
In both cases the KCSAN doesn't have any issues, and performance is unchanged in comparison to the commit
e04c78d86a9699d1 (Linux 6.16-rc2).
For cpu stats the performance is much better after patch. 

I can't find a scenario when lack of this locks triggering some kind of issues.
Maybe you can help me with that?

Thanks,
Bertrand

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-18 16:05   ` Wlodarczyk, Bertrand
@ 2025-06-18 17:17     ` Shakeel Butt
  0 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2025-06-18 17:17 UTC (permalink / raw)
  To: Wlodarczyk, Bertrand
  Cc: JP Kobryn, tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org

On Wed, Jun 18, 2025 at 04:05:45PM +0000, Wlodarczyk, Bertrand wrote:
> Thank you for your time and review. 
> 
> > The kernel currently faces scalability issues when multiple userspace 
> > programs attempt to read cgroup statistics concurrently.
> > 
> > The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> > which prevents access and updates to the statistics of the css from 
> > multiple CPUs in parallel.
> > 
> > Given that rstat operates on a per-CPU basis and only aggregates 
> > statistics in the parent cgroup, there is no compelling reason why 
> > these statistics cannot be atomic.
> 
> > Have you considered the "tearing" that will occur when writes and reads are happening in parallel?
> > The existing state is more of a snapshot approach. Changing the fields involved to atomic and lockless reading/writing can result in inconsistent values, i.e. fieldA might be more current than fieldB.
> 
> Yes, I considered it. In our scenario, "tearing" means that we have the newer data then the last "snapshot" and the output doesn't sum up in
> sum_exec_runtime. The "snapshot" suggests that these stats not need to be super precise because we're accepting outdated state.
> I'm not considering "tearing" issue as very bad here. 
> Additionally, I'm sure that the "tearing" can happen but I didn't observe them during the benchmark runs.
> That suggests that's a rare occurrence or I didn't trigger the right condition to expose the issue.
> 
> When multiple cpus tries to access the rstat the slowdown caused by css_base_lock is so massive that atomic change is preferable.
>  It's better to get even "teared" result than spinlock cpus and disable irq for such long time. 
>  
> 
> > By eliminating the lock, each CPU can traverse its rstat hierarchy 
> > independently, without blocking. Synchronization is achieved during 
> > parent propagation through atomic operations.
> 
> > Even if the tearing scenario mentioned above is acceptable, removing the lock will break synchronization of flushing non-base stat subsystems.
> 
> > -static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss) -{
> > -	if (ss)
> > -		return &ss->rstat_ss_lock;
> 
> > This was needed for non-base stat subsystems like memory and io.
> 
> From what I could find in code the flush with css and cpu arguments is implemented only in two subsystems: memory and io.
> Both memory and io have its own locks for them.
> I tested the benchmark (provided in gist) with memory.stat and io.stat hierarchies.
> In both cases the KCSAN doesn't have any issues, and performance is unchanged in comparison to the commit
> e04c78d86a9699d1 (Linux 6.16-rc2).
> For cpu stats the performance is much better after patch. 
> 
> I can't find a scenario when lack of this locks triggering some kind of issues.
> Maybe you can help me with that?

Please look at *_local and *_pending fields in
mem_cgroup_css_rstat_flush().

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-17 10:26 [PATCH] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
                   ` (3 preceding siblings ...)
  2025-06-17 21:05 ` JP Kobryn
@ 2025-06-18 17:21 ` Shakeel Butt
  4 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2025-06-18 17:21 UTC (permalink / raw)
  To: Bertrand Wlodarczyk; +Cc: tj, hannes, mkoutny, cgroups, linux-kernel

On Tue, Jun 17, 2025 at 12:26:45PM +0200, Bertrand Wlodarczyk wrote:
> The kernel currently faces scalability issues when multiple userspace
> programs attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
> which prevents access and updates to the statistics
> of the css from multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates
> statistics in the parent cgroup, there is no compelling reason
> why these statistics cannot be atomic.
> By eliminating the lock, each CPU can traverse its rstat hierarchy
> independently, without blocking. Synchronization is achieved during
> parent propagation through atomic operations.
> 
> This change significantly enhances performance in scenarios
> where multiple CPUs access CPU rstat within a single cgroup hierarchy,
> yielding a performance improvement of around 50 times compared
> to the mainline version.
> Notably, performance for memory and I/O rstats remains unchanged,
> as these are managed in separate submodules.
> 
> Additionally, this patch addresses a race condition detectable
> in the current mainline by KCSAN in __cgroup_account_cputime,
> which occurs when attempting to read a single hierarchy
> from multiple CPUs.
> 
> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>

Can you please rebase your change over for-6.17 branch of the cgroup
tree and resend?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-18 17:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17 10:26 [PATCH] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
2025-06-17 12:38 ` kernel test robot
2025-06-17 13:16 ` Michal Koutný
2025-06-18 14:31   ` Wlodarczyk, Bertrand
2025-06-17 13:32 ` kernel test robot
2025-06-17 21:05 ` JP Kobryn
2025-06-18 16:05   ` Wlodarczyk, Bertrand
2025-06-18 17:17     ` Shakeel Butt
2025-06-18 17:21 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).