[PATCH v2] cgroup/rstat: change cgroup_base

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
@ 2025-06-24 14:45 Bertrand Wlodarczyk
  2025-06-26 19:15 ` Shakeel Butt
  0 siblings, 1 reply; 11+ messages in thread
From: Bertrand Wlodarczyk @ 2025-06-24 14:45 UTC (permalink / raw)
  To: tj, hannes, mkoutny, cgroups, linux-kernel
  Cc: shakeel.butt, inwardvessel, Bertrand Wlodarczyk

The kernel faces scalability issues when multiple userspace
programs attempt to read cgroup statistics concurrently.

The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
which prevents access and updates to the statistics
of the css from multiple CPUs in parallel.

Given that rstat operates on a per-CPU basis and only aggregates
statistics in the parent cgroup, there is no compelling reason
why these statistics cannot be atomic.
By eliminating the lock during CPU statistics access,
each CPU can traverse its rstat hierarchy independently, without blocking.
Synchronization is achieved during parent propagation through
atomic operations.

This change significantly enhances performance on commit
8dcb0ed834a3ec03 ("memcg: cgroup: call css_rstat_updated irrespective of in_nmi()")
in scenarios where multiple CPUs accessCPU rstat within a
single cgroup hierarchy, yielding a performance improvement of around 40 times.
Notably, performance for memory and I/O rstats remains unchanged,
as the lock remains in place for these usages.

Additionally, this patch addresses a race condition detectable
in the current mainline by KCSAN in __cgroup_account_cputime,
which occurs when attempting to read a single hierarchy
from multiple CPUs.

Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>
---
Changes from v1:
* reverted removal of rstat_ss_lock/css_base_lock
* ported patch on branch for-6.17
* surrounded pos->ss->css_rstat_flush(pos, cpu) call with
  css_rstat_lock to address concerns about potential
  race cond when calling io or memory rstat flush
* fixed issues reported by lkp

Benchmark code: https://gist.github.com/bwlodarcz/c955b36b5667f0167dffcff23953d1da

On branch for-6.17 (commit 8dcb0ed834a3ec03) the __cgroup_account_cputime
KCSAN error is still present:
[  734.729731] ==================================================================
[  734.731942] BUG: KCSAN: data-race in css_rstat_flush / css_rstat_updated
[  734.733587]
[  734.734108] write to 0xffd1fffffee52090 of 8 bytes by task 1144 on cpu 57:
[  734.735678]  css_rstat_flush+0x1b0/0xed0
[  734.736657]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  734.737971]  cpu_stat_show+0x14/0x1a0
[  734.738945]  cgroup_seqfile_show+0xb0/0x150
[  734.739947]  kernfs_seq_show+0x93/0xb0
[  734.740838]  seq_read_iter+0x190/0x7d0
[  734.741913]  kernfs_fop_read_iter+0x23b/0x290
[  734.742949]  vfs_read+0x46b/0x5a0
[  734.743750]  ksys_read+0xa5/0x130
[  734.744625]  __x64_sys_read+0x3c/0x50
[  734.745626]  x64_sys_call+0x19e1/0x1c10
[  734.746657]  do_syscall_64+0xa2/0x200
[  734.747679]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  734.749021]
[  734.749441] read to 0xffd1fffffee52090 of 8 bytes by interrupt on cpu 41:
[  734.750996]  css_rstat_updated+0x8f/0x1a0
[  734.751943]  __cgroup_account_cputime+0x5d/0x90
[  734.752985]  update_curr+0x1bd/0x260
[  734.753837]  task_tick_fair+0x3b/0x130
[  734.754749]  sched_tick+0xa1/0x220
[  734.755705]  update_process_times+0x97/0xd0
[  734.756739]  tick_nohz_handler+0xfc/0x220
[  734.757845]  __hrtimer_run_queues+0x2a3/0x4b0
[  734.758954]  hrtimer_interrupt+0x1c6/0x3a0
[  734.759974]  __sysvec_apic_timer_interrupt+0x62/0x180
[  734.761195]  sysvec_apic_timer_interrupt+0x6b/0x80
[  734.762372]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  734.763663]  _raw_spin_unlock_irq+0x18/0x30
[  734.764737]  css_rstat_flush+0x5cd/0xed0
[  734.765776]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  734.766964]  cpu_stat_show+0x14/0x1a0
[  734.767827]  cgroup_seqfile_show+0xb0/0x150
[  734.768942]  kernfs_seq_show+0x93/0xb0
[  734.769842]  seq_read_iter+0x190/0x7d0
[  734.770801]  kernfs_fop_read_iter+0x23b/0x290
[  734.771972]  vfs_read+0x46b/0x5a0
[  734.772777]  ksys_read+0xa5/0x130
[  734.773664]  __x64_sys_read+0x3c/0x50
[  734.774714]  x64_sys_call+0x19e1/0x1c10
[  734.775727]  do_syscall_64+0xa2/0x200
[  734.776682]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  734.777952]
[  734.778393] value changed: 0x0000000000000000 -> 0xffd1fffffee52090
[  734.779942]
[  734.780375] Reported by Kernel Concurrency Sanitizer on:
[  734.781662] CPU: 41 UID: 0 PID: 1128 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #7 PREEMPT(voluntary)
[  734.784115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[  734.786279] ==================================================================
[  738.533150] ==================================================================
[  738.534820] BUG: KCSAN: data-race in __cgroup_account_cputime / css_rstat_flush
[  738.536444]
[  738.536891] write to 0xffd1fffffe512010 of 8 bytes by interrupt on cpu 4:
[  738.538311]  __cgroup_account_cputime+0x4a/0x90
[  738.539311]  update_curr+0x1bd/0x260
[  738.540145]  task_tick_fair+0x3b/0x130
[  738.540961]  sched_tick+0xa1/0x220
[  738.541709]  update_process_times+0x97/0xd0
[  738.542642]  tick_nohz_handler+0xfc/0x220
[  738.543559]  __hrtimer_run_queues+0x2a3/0x4b0
[  738.544562]  hrtimer_interrupt+0x1c6/0x3a0
[  738.545480]  __sysvec_apic_timer_interrupt+0x62/0x180
[  738.546586]  sysvec_apic_timer_interrupt+0x6b/0x80
[  738.547660]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  738.548782]  _raw_spin_unlock_irq+0x18/0x30
[  738.549782]  css_rstat_flush+0x5cd/0xed0
[  738.550662]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  738.551757]  cpu_stat_show+0x14/0x1a0
[  738.552619]  cgroup_seqfile_show+0xb0/0x150
[  738.553552]  kernfs_seq_show+0x93/0xb0
[  738.554407]  seq_read_iter+0x190/0x7d0
[  738.555268]  kernfs_fop_read_iter+0x23b/0x290
[  738.556247]  vfs_read+0x46b/0x5a0
[  738.556983]  ksys_read+0xa5/0x130
[  738.557708]  __x64_sys_read+0x3c/0x50
[  738.558550]  x64_sys_call+0x19e1/0x1c10
[  738.559418]  do_syscall_64+0xa2/0x200
[  738.560258]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  738.561356]
[  738.561760] read to 0xffd1fffffe512010 of 8 bytes by task 1073 on cpu 85:
[  738.563221]  css_rstat_flush+0x717/0xed0
[  738.564106]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  738.565202]  cpu_stat_show+0x14/0x1a0
[  738.571702]  cgroup_seqfile_show+0xb0/0x150
[  738.572661]  kernfs_seq_show+0x93/0xb0
[  738.573512]  seq_read_iter+0x190/0x7d0
[  738.574395]  kernfs_fop_read_iter+0x23b/0x290
[  738.575373]  vfs_read+0x46b/0x5a0
[  738.576161]  ksys_read+0xa5/0x130
[  738.576926]  __x64_sys_read+0x3c/0x50
[  738.577718]  x64_sys_call+0x19e1/0x1c10
[  738.578611]  do_syscall_64+0xa2/0x200
[  738.579454]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  738.580547]
[  738.580944] value changed: 0x0000003d5fa31114 -> 0x0000003d5fb77378
[  738.582262]
[  738.582657] Reported by Kernel Concurrency Sanitizer on:
[  738.583791] CPU: 85 UID: 0 PID: 1073 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #7 PREEMPT(voluntary)
[  738.585919] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[  738.587725] ==================================================================

Results benchmark on Qemu:
               +---------+---------+
     mean (s)  |8dcb0ed8 | patched |
+--------------+---------+---------+
|cpu, KCSAN on |borked*  |4.52     |
+--------------+---------+---------+
|cpu, KCSAN off|87.32    |2.21     |
+--------------+---------+---------+
|memory        |2.17     |2.23     |
+--------------+---------+---------+
* was spammed by KCSAN errors so the result is non conclusive

ext4 raw image with debian:
qemu-system-x86_64 -enable-kvm -cpu host -smp 102 -m 16G -kernel linux-cgroup/arch/x86/boot/bzImage -drive file=rootfs.ext4,if=virtio,format=raw -append "rootwait root=/dev/vda console=tty1 console=ttyS0 nokaslr cgroup_enable=memory cgroup_memory=1" -net nic,model=virtio -net user -nographic

During runs KCSAN detected the bug which seems unrelated to change:
[  105.806835] ==================================================================
[  105.810667] BUG: KCSAN: data-race in _find_next_bit+0x37/0xb0
[  105.812128]
[  105.812531] race at unknown origin, with read to 0xffffffff8444b080 of 8 bytes by interrupt on cpu 0:
[  105.814616]  _find_next_bit+0x37/0xb0
[  105.815495]  _nohz_idle_balance.isra.0+0x10e/0x360
[  105.816599]  handle_softirqs+0xcd/0x280
[  105.817500]  irq_exit_rcu+0x89/0xb0
[  105.818330]  sysvec_call_function_single+0x6b/0x80
[  105.819400]  asm_sysvec_call_function_single+0x1a/0x20
[  105.820561]  pv_native_safe_halt+0xf/0x20
[  105.821475]  default_idle+0x9/0x10
[  105.822287]  default_idle_call+0x2b/0x100
[  105.823214]  do_idle+0x1ca/0x230
[  105.823964]  cpu_startup_entry+0x24/0x30
[  105.824826]  rest_init+0x101/0x110
[  105.825600]  start_kernel+0x95e/0x960
[  105.826500]  x86_64_start_reservations+0x24/0x30
[  105.827540]  x86_64_start_kernel+0xc5/0xd0
[  105.828465]  common_startup_64+0x13e/0x148
[  105.829408]
[  105.829838] value changed: 0xf99dad734a1de4fd -> 0xf99fad734a1de4fd
[  105.831191]
[  105.831586] Reported by Kernel Concurrency Sanitizer on:
[  105.832775] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-gd5b739bb486e #10 PREEMPT(voluntary)
[  105.834791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[  105.836624] ==================================================================

---
 include/linux/cgroup-defs.h |  29 ++---
 include/linux/sched/types.h |   6 ++
 kernel/cgroup/rstat.c       | 205 ++++++++++++++++++------------------
 3 files changed, 119 insertions(+), 121 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..c5d883be422c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -233,16 +233,6 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
-
-	/*
-	 * A singly-linked list of css structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * css_rstat_flush().
-	 *
-	 * Protected by rstat_base_lock when css is cgroup::self.
-	 * Protected by css->ss->rstat_ss_lock otherwise.
-	 */
-	struct cgroup_subsys_state *rstat_flush_next;
 };
 
 /*
@@ -343,12 +333,12 @@ struct css_set {
 };
 
 struct cgroup_base_stat {
-	struct task_cputime cputime;
+	struct atomic_task_cputime cputime;
 
 #ifdef CONFIG_SCHED_CORE
-	u64 forceidle_sum;
+	atomic64_t forceidle_sum;
 #endif
-	u64 ntime;
+	atomic64_t ntime;
 };
 
 /*
@@ -378,6 +368,14 @@ struct css_rstat_cpu {
 	 */
 	struct cgroup_subsys_state *updated_children;
 	struct cgroup_subsys_state *updated_next;	/* NULL if not on the list */
+	/*
+	 * A singly-linked list of css structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * css_rstat_flush().
+	 *
+	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
 
 	struct llist_node lnode;		/* lockless list for update */
 	struct cgroup_subsys_state *owner;	/* back pointer */
@@ -388,11 +386,6 @@ struct css_rstat_cpu {
  * top of it - bsync, bstat and last_bstat.
  */
 struct cgroup_rstat_base_cpu {
-	/*
-	 * ->bsync protects ->bstat.  These are the only fields which get
-	 * updated in the hot path.
-	 */
-	struct u64_stats_sync bsync;
 	struct cgroup_base_stat bstat;
 
 	/*
diff --git a/include/linux/sched/types.h b/include/linux/sched/types.h
index 969aaf5ef9d6..dab61f7099f6 100644
--- a/include/linux/sched/types.h
+++ b/include/linux/sched/types.h
@@ -20,4 +20,10 @@ struct task_cputime {
 	unsigned long long		sum_exec_runtime;
 };
 
+struct atomic_task_cputime {
+	atomic64_t			stime;
+	atomic64_t			utime;
+	atomic_long_t			sum_exec_runtime;
+};
+
 #endif /* _LINUX_SCHED_TYPES_H */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index c8a48cf83878..14c12da11930 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -175,14 +175,7 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 	struct cgroup_subsys_state *parent, *grandchild;
 	struct css_rstat_cpu *crstatc;
 
-	child->rstat_flush_next = NULL;
-
-	/*
-	 * The subsystem rstat lock must be held for the whole duration from
-	 * here as the rstat_flush_next list is being constructed to when
-	 * it is consumed later in css_rstat_flush().
-	 */
-	lockdep_assert_held(ss_rstat_lock(head->ss));
+	css_rstat_cpu(child, cpu)->rstat_flush_next = NULL;
 
 	/*
 	 * Notation: -> updated_next pointer
@@ -203,19 +196,19 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 next_level:
 	while (cnext) {
 		child = cnext;
-		cnext = child->rstat_flush_next;
+		cnext = css_rstat_cpu(child, cpu)->rstat_flush_next;
 		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated if !NULL */
 		while (child != parent) {
-			child->rstat_flush_next = head;
+			css_rstat_cpu(child, cpu)->rstat_flush_next = head;
 			head = child;
 			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
 				crstatc->updated_children = child;
-				grandchild->rstat_flush_next = ghead;
+				css_rstat_cpu(grandchild, cpu)->rstat_flush_next = ghead;
 				ghead = grandchild;
 			}
 			child = crstatc->updated_next;
@@ -286,7 +279,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 
 	/* Push @root to the list first before pushing the children */
 	head = root;
-	root->rstat_flush_next = NULL;
+	css_rstat_cpu(root, cpu)->rstat_flush_next = NULL;
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
@@ -383,19 +376,18 @@ __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
 	might_sleep();
 	for_each_possible_cpu(cpu) {
 		struct cgroup_subsys_state *pos;
-
-		/* Reacquire for each CPU to avoid disabling IRQs too long */
-		__css_rstat_lock(css, cpu);
 		pos = css_rstat_updated_list(css, cpu);
-		for (; pos; pos = pos->rstat_flush_next) {
+		for (; pos; pos = css_rstat_cpu(pos, cpu)->rstat_flush_next) {
 			if (is_self) {
 				cgroup_base_stat_flush(pos->cgroup, cpu);
 				bpf_rstat_flush(pos->cgroup,
 						cgroup_parent(pos->cgroup), cpu);
-			} else
+			} else {
+				__css_rstat_lock(css, cpu);
 				pos->ss->css_rstat_flush(pos, cpu);
+				__css_rstat_unlock(css, cpu);
+			}
 		}
-		__css_rstat_unlock(css, cpu);
 		if (!cond_resched())
 			cpu_relax();
 	}
@@ -434,13 +426,6 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 
 		rstatc->owner = rstatc->updated_children = css;
 		init_llist_node(&rstatc->lnode);
-
-		if (is_self) {
-			struct cgroup_rstat_base_cpu *rstatbc;
-
-			rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
-			u64_stats_init(&rstatbc->bsync);
-		}
 	}
 
 	return 0;
@@ -507,25 +492,48 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
 				 struct cgroup_base_stat *src_bstat)
 {
-	dst_bstat->cputime.utime += src_bstat->cputime.utime;
-	dst_bstat->cputime.stime += src_bstat->cputime.stime;
-	dst_bstat->cputime.sum_exec_runtime += src_bstat->cputime.sum_exec_runtime;
+	atomic64_add(atomic64_read(&src_bstat->cputime.utime),
+		     &dst_bstat->cputime.utime);
+	atomic64_add(atomic64_read(&src_bstat->cputime.stime),
+		     &dst_bstat->cputime.stime);
+	atomic_long_add(atomic_long_read(&src_bstat->cputime.sum_exec_runtime),
+			&dst_bstat->cputime.sum_exec_runtime);
 #ifdef CONFIG_SCHED_CORE
-	dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
+	atomic64_add(atomic64_read(&src_bstat->forceidle_sum),
+		     &dst_bstat->forceidle_sum);
 #endif
-	dst_bstat->ntime += src_bstat->ntime;
 }
 
 static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 				 struct cgroup_base_stat *src_bstat)
 {
-	dst_bstat->cputime.utime -= src_bstat->cputime.utime;
-	dst_bstat->cputime.stime -= src_bstat->cputime.stime;
-	dst_bstat->cputime.sum_exec_runtime -= src_bstat->cputime.sum_exec_runtime;
+	atomic64_sub(atomic64_read(&src_bstat->cputime.utime),
+		     &dst_bstat->cputime.utime);
+	atomic64_sub(atomic64_read(&src_bstat->cputime.stime),
+		     &dst_bstat->cputime.stime);
+	atomic_long_sub(atomic_long_read(&src_bstat->cputime.sum_exec_runtime),
+			&dst_bstat->cputime.sum_exec_runtime);
 #ifdef CONFIG_SCHED_CORE
-	dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
+	atomic64_sub(atomic64_read(&src_bstat->forceidle_sum),
+		     &dst_bstat->forceidle_sum);
 #endif
-	dst_bstat->ntime -= src_bstat->ntime;
+	atomic64_sub(atomic64_read(&src_bstat->ntime), &dst_bstat->ntime);
+}
+
+static void cgroup_base_stat_copy(struct cgroup_base_stat *dst_bstat,
+				  struct cgroup_base_stat *src_bstat)
+{
+	atomic64_set(&dst_bstat->cputime.stime,
+		     atomic64_read(&src_bstat->cputime.stime));
+	atomic64_set(&dst_bstat->cputime.utime,
+		     atomic64_read(&src_bstat->cputime.utime));
+	atomic_long_set(&dst_bstat->cputime.sum_exec_runtime,
+			atomic_long_read(&src_bstat->cputime.sum_exec_runtime));
+#ifdef CONFIG_SCHED_CORE
+	atomic64_set(&dst_bstat->forceidle_sum,
+		     atomic64_read(&src_bstat->forceidle_sum));
+#endif
+	atomic64_set(&dst_bstat->ntime, atomic64_read(&src_bstat->ntime));
 }
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
@@ -534,17 +542,12 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_base_cpu *prstatbc;
 	struct cgroup_base_stat delta;
-	unsigned seq;
 
 	/* Root-level stats are sourced from system-wide CPU stats */
 	if (!parent)
 		return;
 
-	/* fetch the current per-cpu values */
-	do {
-		seq = __u64_stats_fetch_begin(&rstatbc->bsync);
-		delta = rstatbc->bstat;
-	} while (__u64_stats_fetch_retry(&rstatbc->bsync, seq));
+	cgroup_base_stat_copy(&delta, &rstatbc->bstat);
 
 	/* propagate per-cpu delta to cgroup and per-cpu global statistics */
 	cgroup_base_stat_sub(&delta, &rstatbc->last_bstat);
@@ -554,12 +557,12 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 
 	/* propagate cgroup and per-cpu global delta to parent (unless that's root) */
 	if (cgroup_parent(parent)) {
-		delta = cgrp->bstat;
+		cgroup_base_stat_copy(&delta, &cgrp->bstat);
 		cgroup_base_stat_sub(&delta, &cgrp->last_bstat);
 		cgroup_base_stat_add(&parent->bstat, &delta);
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
-		delta = rstatbc->subtree_bstat;
+		cgroup_base_stat_copy(&delta, &rstatbc->subtree_bstat);
 		prstatbc = cgroup_rstat_base_cpu(parent, cpu);
 		cgroup_base_stat_sub(&delta, &rstatbc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatbc->subtree_bstat, &delta);
@@ -567,65 +570,43 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 	}
 }
 
-static struct cgroup_rstat_base_cpu *
-cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
-{
-	struct cgroup_rstat_base_cpu *rstatbc;
-
-	rstatbc = get_cpu_ptr(cgrp->rstat_base_cpu);
-	*flags = u64_stats_update_begin_irqsave(&rstatbc->bsync);
-	return rstatbc;
-}
-
-static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct cgroup_rstat_base_cpu *rstatbc,
-						 unsigned long flags)
-{
-	u64_stats_update_end_irqrestore(&rstatbc->bsync, flags);
-	css_rstat_updated(&cgrp->self, smp_processor_id());
-	put_cpu_ptr(rstatbc);
-}
-
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
-	struct cgroup_rstat_base_cpu *rstatbc;
-	unsigned long flags;
-
-	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
-	rstatbc->bstat.cputime.sum_exec_runtime += delta_exec;
-	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
+	struct cgroup_rstat_base_cpu *rstatbc =
+		get_cpu_ptr(cgrp->rstat_base_cpu);
+	atomic_long_add(delta_exec, &rstatbc->bstat.cputime.sum_exec_runtime);
+	put_cpu_ptr(rstatbc);
 }
 
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
 	struct cgroup_rstat_base_cpu *rstatbc;
-	unsigned long flags;
 
-	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
+	rstatbc = get_cpu_ptr(cgrp->rstat_base_cpu);
 
 	switch (index) {
 	case CPUTIME_NICE:
-		rstatbc->bstat.ntime += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.ntime);
 		fallthrough;
 	case CPUTIME_USER:
-		rstatbc->bstat.cputime.utime += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.cputime.utime);
 		break;
 	case CPUTIME_SYSTEM:
 	case CPUTIME_IRQ:
 	case CPUTIME_SOFTIRQ:
-		rstatbc->bstat.cputime.stime += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.cputime.stime);
 		break;
 #ifdef CONFIG_SCHED_CORE
 	case CPUTIME_FORCEIDLE:
-		rstatbc->bstat.forceidle_sum += delta_exec;
+		atomic64_add(delta_exec, &rstatbc->bstat.forceidle_sum);
 		break;
 #endif
 	default:
 		break;
 	}
 
-	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
+	put_cpu_ptr(rstatbc);
 }
 
 /*
@@ -636,7 +617,7 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
  */
 static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 {
-	struct task_cputime *cputime = &bstat->cputime;
+	struct atomic_task_cputime *cputime = &bstat->cputime;
 	int i;
 
 	memset(bstat, 0, sizeof(*bstat));
@@ -650,20 +631,19 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 
 		user += cpustat[CPUTIME_USER];
 		user += cpustat[CPUTIME_NICE];
-		cputime->utime += user;
+		atomic64_add(user, &cputime->utime);
 
 		sys += cpustat[CPUTIME_SYSTEM];
 		sys += cpustat[CPUTIME_IRQ];
 		sys += cpustat[CPUTIME_SOFTIRQ];
-		cputime->stime += sys;
+		atomic64_add(sys, &cputime->stime);
 
-		cputime->sum_exec_runtime += user;
-		cputime->sum_exec_runtime += sys;
+		atomic_long_add(sys + user, &cputime->sum_exec_runtime);
 
 #ifdef CONFIG_SCHED_CORE
-		bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
+		atomic64_add(cpustat[CPUTIME_FORCEIDLE], &bstat->forceidle_sum);
 #endif
-		bstat->ntime += cpustat[CPUTIME_NICE];
+		atomic64_add(cpustat[CPUTIME_NICE], &bstat->ntime);
 	}
 }
 
@@ -671,7 +651,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat *bstat)
 {
 #ifdef CONFIG_SCHED_CORE
-	u64 forceidle_time = bstat->forceidle_sum;
+	u64 forceidle_time = atomic64_read(&bstat->forceidle_sum);
 
 	do_div(forceidle_time, NSEC_PER_USEC);
 	seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
@@ -682,31 +662,50 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 	struct cgroup_base_stat bstat;
+	unsigned long long sum_exec_runtime;
+	u64 utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
 		css_rstat_flush(&cgrp->self);
-		__css_rstat_lock(&cgrp->self, -1);
-		bstat = cgrp->bstat;
-		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
-			       &bstat.cputime.utime, &bstat.cputime.stime);
-		__css_rstat_unlock(&cgrp->self, -1);
-	} else {
+		cgroup_base_stat_copy(&bstat, &cgrp->bstat);
+		struct task_cputime cputime = {
+			.stime = atomic64_read(&bstat.cputime.stime),
+			.utime = atomic64_read(&bstat.cputime.utime),
+			.sum_exec_runtime = atomic_long_read(
+				&bstat.cputime.sum_exec_runtime)
+		};
+		cputime_adjust(&cputime, &cgrp->prev_cputime, &cputime.utime,
+			       &cputime.stime);
+		atomic64_set(&bstat.cputime.utime, cputime.utime);
+		atomic64_set(&bstat.cputime.stime, cputime.stime);
+	} else
 		root_cgroup_cputime(&bstat);
-	}
 
-	do_div(bstat.cputime.sum_exec_runtime, NSEC_PER_USEC);
-	do_div(bstat.cputime.utime, NSEC_PER_USEC);
-	do_div(bstat.cputime.stime, NSEC_PER_USEC);
-	do_div(bstat.ntime, NSEC_PER_USEC);
-
-	seq_printf(seq, "usage_usec %llu\n"
-			"user_usec %llu\n"
-			"system_usec %llu\n"
-			"nice_usec %llu\n",
-			bstat.cputime.sum_exec_runtime,
-			bstat.cputime.utime,
-			bstat.cputime.stime,
-			bstat.ntime);
+	sum_exec_runtime = atomic_long_read(&bstat.cputime.sum_exec_runtime);
+
+	do_div(sum_exec_runtime, NSEC_PER_USEC);
+
+	utime = atomic64_read(&bstat.cputime.utime);
+
+	do_div(utime, NSEC_PER_USEC);
+
+	stime = atomic64_read(&bstat.cputime.stime);
+
+	do_div(stime, NSEC_PER_USEC);
+
+	ntime = atomic64_read(&bstat.ntime);
+
+	do_div(ntime, NSEC_PER_USEC);
+
+	seq_printf(seq,
+		   "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n"
+		   "nice_usec %llu\n",
+		   sum_exec_runtime,
+		   utime,
+		   stime,
+		   ntime);
 
 	cgroup_force_idle_show(seq, &bstat);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-24 14:45 [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
@ 2025-06-26 19:15 ` Shakeel Butt
  2025-06-27 13:15   ` Wlodarczyk, Bertrand
  0 siblings, 1 reply; 11+ messages in thread
From: Shakeel Butt @ 2025-06-26 19:15 UTC (permalink / raw)
  To: Bertrand Wlodarczyk
  Cc: tj, hannes, mkoutny, cgroups, linux-kernel, inwardvessel

On Tue, Jun 24, 2025 at 04:45:58PM +0200, Bertrand Wlodarczyk wrote:
> The kernel faces scalability issues when multiple userspace
> programs attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
> which prevents access and updates to the statistics
> of the css from multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates
> statistics in the parent cgroup, there is no compelling reason
> why these statistics cannot be atomic.
> By eliminating the lock during CPU statistics access,
> each CPU can traverse its rstat hierarchy independently, without blocking.
> Synchronization is achieved during parent propagation through
> atomic operations.
> 
> This change significantly enhances performance on commit
> 8dcb0ed834a3ec03 ("memcg: cgroup: call css_rstat_updated irrespective of in_nmi()")
> in scenarios where multiple CPUs accessCPU rstat within a
> single cgroup hierarchy, yielding a performance improvement of around 40 times.
> Notably, performance for memory and I/O rstats remains unchanged,
> as the lock remains in place for these usages.
> 
> Additionally, this patch addresses a race condition detectable
> in the current mainline by KCSAN in __cgroup_account_cputime,
> which occurs when attempting to read a single hierarchy
> from multiple CPUs.
> 
> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>

This patch breaks memory controller as explained in the comments on the
previous version. Also the response to the tearing issue explained by JP
is not satisfying. 


Please run scripts/faddr2line on css_rstat_flush+0x1b0/0xed0 and
css_rstat_updated+0x8f/0x1a0 to see which field is causing the race.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-26 19:15 ` Shakeel Butt
@ 2025-06-27 13:15   ` Wlodarczyk, Bertrand
  2025-06-27 16:50     ` tj
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Wlodarczyk, Bertrand @ 2025-06-27 13:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

> The kernel faces scalability issues when multiple userspace programs 
> attempt to read cgroup statistics concurrently.
> 
> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> which prevents access and updates to the statistics of the css from 
> multiple CPUs in parallel.
> 
> Given that rstat operates on a per-CPU basis and only aggregates 
> statistics in the parent cgroup, there is no compelling reason why 
> these statistics cannot be atomic.
> By eliminating the lock during CPU statistics access, each CPU can 
> traverse its rstat hierarchy independently, without blocking.
> Synchronization is achieved during parent propagation through atomic 
> operations.
> 
> This change significantly enhances performance on commit
> 8dcb0ed834a3ec03 ("memcg: cgroup: call css_rstat_updated irrespective 
> of in_nmi()") in scenarios where multiple CPUs accessCPU rstat within 
> a single cgroup hierarchy, yielding a performance improvement of around 40 times.
> Notably, performance for memory and I/O rstats remains unchanged, as 
> the lock remains in place for these usages.
> 
> Additionally, this patch addresses a race condition detectable in the 
> current mainline by KCSAN in __cgroup_account_cputime, which occurs 
> when attempting to read a single hierarchy from multiple CPUs.
> 
> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>

> This patch breaks memory controller as explained in the comments on the previous version.

Ekhm... no? I addressed the issue and v2 has lock back and surrounding the call to dependent submodules? 
The behavior is the same as before patching.

In the long term, in my opinion, the atomics should happen also in dependent submodules to eliminate locks
completely. 

 > Also the response to the tearing issue explained by JP is not satisfying.

In other words, the claim is: "it's better to stall other cpus in spinlock plus disable IRQ every time in order to 
serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.
I wouldn't be addressing this issue if there were no customers affected by rstat latency in multi-container
multi-cpu scenarios.

> Please run scripts/faddr2line on css_rstat_flush+0x1b0/0xed0 and
> css_rstat_updated+0x8f/0x1a0 to see which field is causing the race.

There is more than race in current for-next-6.17. In faddr2line first line writes, second reads.
Benchmark is provided in gist - it's exposing the issue. 

[   30.547317] BUG: KCSAN: data-race in css_rstat_flush / css_rstat_updated
[   30.549011]
[   30.549483] write to 0xffd1ffffff686a30 of 8 bytes by task 1014 on cpu 82:
[   30.551124]  css_rstat_flush+0x1b0/0xed0
[   30.552260]  cgroup_base_stat_cputime_show+0x96/0x2f0
[   30.553582]  cpu_stat_show+0x14/0x1a0
[   30.555477]  cgroup_seqfile_show+0xb0/0x150
[   30.557060]  kernfs_seq_show+0x93/0xb0
[   30.558241]  seq_read_iter+0x190/0x7d0
[   30.559278]  kernfs_fop_read_iter+0x23b/0x290
[   30.560416]  vfs_read+0x46b/0x5a0
[   30.561336]  ksys_read+0xa5/0x130
[   30.562190]  __x64_sys_read+0x3c/0x50
[   30.563179]  x64_sys_call+0x19e1/0x1c10
[   30.564215]  do_syscall_64+0xa2/0x200
[   30.565214]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   30.566456]
[   30.566892] read to 0xffd1ffffff686a30 of 8 bytes by interrupt on cpu 74:
[   30.568472]  css_rstat_updated+0x8f/0x1a0
[   30.569499]  __cgroup_account_cputime+0x5d/0x90
[   30.570640]  update_curr+0x1bd/0x260
[   30.571559]  task_tick_fair+0x3b/0x130
[   30.572545]  sched_tick+0xa1/0x220
[   30.573510]  update_process_times+0x97/0xd0
[   30.574576]  tick_nohz_handler+0xfc/0x220
[   30.575650]  __hrtimer_run_queues+0x2a3/0x4b0
[   30.576703]  hrtimer_interrupt+0x1c6/0x3a0
[   30.577761]  __sysvec_apic_timer_interrupt+0x62/0x180
[   30.578982]  sysvec_apic_timer_interrupt+0x6b/0x80
[   30.580161]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   30.581397]  _raw_spin_unlock_irq+0x18/0x30
[   30.582505]  css_rstat_flush+0x5cd/0xed0
[   30.583611]  cgroup_base_stat_cputime_show+0x96/0x2f0
[   30.584934]  cpu_stat_show+0x14/0x1a0
[   30.585814]  cgroup_seqfile_show+0xb0/0x150
[   30.586915]  kernfs_seq_show+0x93/0xb0
[   30.587876]  seq_read_iter+0x190/0x7d0
[   30.588797]  kernfs_fop_read_iter+0x23b/0x290
[   30.589904]  vfs_read+0x46b/0x5a0
[   30.590723]  ksys_read+0xa5/0x130
[   30.591659]  __x64_sys_read+0x3c/0x50
[   30.592612]  x64_sys_call+0x19e1/0x1c10
[   30.593593]  do_syscall_64+0xa2/0x200
[   30.594523]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   30.595756]
[   30.596305] value changed: 0x0000000000000000 -> 0xffd1ffffff686a30
[   30.597787]
[   30.598286] Reported by Kernel Concurrency Sanitizer on:
[   30.599583] CPU: 74 UID: 0 PID: 1006 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #12 PREEMPT(voluntary)
[   30.601968] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014

./scripts/faddr2line vmlinux css_rstat_flush+0x1b0/0xed0 css_rstat_updated+0x8f/0x1a0
css_rstat_flush+0x1b0/0xed0:
init_llist_node at include/linux/llist.h:86
(inlined by) llist_del_first_init at include/linux/llist.h:308
(inlined by) css_process_update_tree at kernel/cgroup/rstat.c:148
(inlined by) css_rstat_updated_list at kernel/cgroup/rstat.c:258
(inlined by) css_rstat_flush at kernel/cgroup/rstat.c:389

css_rstat_updated+0x8f/0x1a0:
css_rstat_updated at kernel/cgroup/rstat.c:90 (discriminator 1)

---

[  140.063127] BUG: KCSAN: data-race in __cgroup_account_cputime / css_rstat_flush
[  140.064809]
[  140.065290] write to 0xffd1ffffff711f50 of 8 bytes by interrupt on cpu 76:
[  140.067221]  __cgroup_account_cputime+0x4a/0x90
[  140.068346]  update_curr+0x1bd/0x260
[  140.069278]  task_tick_fair+0x3b/0x130
[  140.070226]  sched_tick+0xa1/0x220
[  140.071080]  update_process_times+0x97/0xd0
[  140.072091]  tick_nohz_handler+0xfc/0x220
[  140.073048]  __hrtimer_run_queues+0x2a3/0x4b0
[  140.074105]  hrtimer_interrupt+0x1c6/0x3a0
[  140.075081]  __sysvec_apic_timer_interrupt+0x62/0x180
[  140.076262]  sysvec_apic_timer_interrupt+0x6b/0x80
[  140.077423]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  140.078625]  _raw_spin_unlock_irq+0x18/0x30
[  140.079579]  css_rstat_flush+0x5cd/0xed0
[  140.080501]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  140.081638]  cpu_stat_show+0x14/0x1a0
[  140.082534]  cgroup_seqfile_show+0xb0/0x150
[  140.083534]  kernfs_seq_show+0x93/0xb0
[  140.084457]  seq_read_iter+0x190/0x7d0
[  140.085373]  kernfs_fop_read_iter+0x23b/0x290
[  140.086416]  vfs_read+0x46b/0x5a0
[  140.087263]  ksys_read+0xa5/0x130
[  140.088088]  __x64_sys_read+0x3c/0x50
[  140.088921]  x64_sys_call+0x19e1/0x1c10
[  140.089814]  do_syscall_64+0xa2/0x200
[  140.090698]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  140.091932]
[  140.092357] read to 0xffd1ffffff711f50 of 8 bytes by task 1172 on cpu 16:
[  140.093877]  css_rstat_flush+0x717/0xed0
[  140.094791]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  140.095989]  cpu_stat_show+0x14/0x1a0
[  140.096866]  cgroup_seqfile_show+0xb0/0x150
[  140.097817]  kernfs_seq_show+0x93/0xb0
[  140.098694]  seq_read_iter+0x190/0x7d0
[  140.099625]  kernfs_fop_read_iter+0x23b/0x290
[  140.100674]  vfs_read+0x46b/0x5a0
[  140.101529]  ksys_read+0xa5/0x130
[  140.102382]  __x64_sys_read+0x3c/0x50
[  140.103290]  x64_sys_call+0x19e1/0x1c10
[  140.104252]  do_syscall_64+0xa2/0x200
[  140.105157]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  140.106343]
[  140.106750] value changed: 0x000000032a8e1130 -> 0x000000032ab08ca6
[  140.108251]
[  140.108670] Reported by Kernel Concurrency Sanitizer on:
[  140.109910] CPU: 16 UID: 0 PID: 1172 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #12 PREEMPT(voluntary)
[  140.112075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014

./scripts/faddr2line vmlinux __cgroup_account_cputime+0x4a/0x90 css_rstat_flush+0x717/0xed0
__cgroup_account_cputime+0x4a/0x90:
__cgroup_account_cputime at kernel/cgroup/rstat.c:595

css_rstat_flush+0x717/0xed0:
cgroup_base_stat_flush at kernel/cgroup/rstat.c:546
(inlined by) css_rstat_flush at kernel/cgroup/rstat.c:392

---

[  156.387539] BUG: KCSAN: data-race in __cgroup_account_cputime_field / css_rstat_flush
[  156.389371]
[  156.389784] write to 0xffd1fffffe7d1f40 of 8 bytes by interrupt on cpu 15:
[  156.391394]  __cgroup_account_cputime_field+0x9d/0xe0
[  156.392539]  account_system_index_time+0x84/0x90
[  156.393585]  update_process_times+0x25/0xd0
[  156.394544]  tick_nohz_handler+0xfc/0x220
[  156.395517]  __hrtimer_run_queues+0x2a3/0x4b0
[  156.396544]  hrtimer_interrupt+0x1c6/0x3a0
[  156.397515]  __sysvec_apic_timer_interrupt+0x62/0x180
[  156.398660]  sysvec_apic_timer_interrupt+0x6b/0x80
[  156.399769]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  156.400937]  _raw_spin_unlock_irq+0x18/0x30
[  156.401902]  css_rstat_flush+0x5cd/0xed0
[  156.402774]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  156.403940]  cpu_stat_show+0x14/0x1a0
[  156.404763]  cgroup_seqfile_show+0xb0/0x150
[  156.405724]  kernfs_seq_show+0x93/0xb0
[  156.406643]  seq_read_iter+0x190/0x7d0
[  156.407522]  kernfs_fop_read_iter+0x23b/0x290
[  156.408549]  vfs_read+0x46b/0x5a0
[  156.409386]  ksys_read+0xa5/0x130
[  156.410176]  __x64_sys_read+0x3c/0x50
[  156.410973]  x64_sys_call+0x19e1/0x1c10
[  156.411862]  do_syscall_64+0xa2/0x200
[  156.412673]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  156.413814]
[  156.414249] read to 0xffd1fffffe7d1f40 of 8 bytes by task 1140 on cpu 85:
[  156.415718]  css_rstat_flush+0x6fe/0xed0
[  156.416669]  cgroup_base_stat_cputime_show+0x96/0x2f0
[  156.417855]  cpu_stat_show+0x14/0x1a0
[  156.418684]  cgroup_seqfile_show+0xb0/0x150
[  156.419637]  kernfs_seq_show+0x93/0xb0
[  156.420519]  seq_read_iter+0x190/0x7d0
[  156.421395]  kernfs_fop_read_iter+0x23b/0x290
[  156.422413]  vfs_read+0x46b/0x5a0
[  156.423228]  ksys_read+0xa5/0x130
[  156.423974]  __x64_sys_read+0x3c/0x50
[  156.424773]  x64_sys_call+0x19e1/0x1c10
[  156.425704]  do_syscall_64+0xa2/0x200
[  156.426600]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  156.427723]
[  156.428217] value changed: 0x00000004be5ffd29 -> 0x00000004be6f3f69
[  156.429575]
[  156.430024] Reported by Kernel Concurrency Sanitizer on:
[  156.431227] CPU: 85 UID: 0 PID: 1140 Comm: benchmark Not tainted 6.15.0-g633e6bad3124 #12 PREEMPT(voluntary)
[  156.433406] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014

./scripts/faddr2line vmlinux __cgroup_account_cputime_field+0x9d/0xe0 css_rstat_flush+0x6fe/0xed0
__cgroup_account_cputime_field+0x9d/0xe0:
__cgroup_account_cputime_field at kernel/cgroup/rstat.c:617

css_rstat_flush+0x6fe/0xed0:
cgroup_base_stat_flush at kernel/cgroup/rstat.c:546
(inlined by) css_rstat_flush at kernel/cgroup/rstat.c:392


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-27 13:15   ` Wlodarczyk, Bertrand
@ 2025-06-27 16:50     ` tj
  2025-06-30 14:25       ` Wlodarczyk, Bertrand
  2025-06-27 16:55     ` JP Kobryn
  2025-06-27 17:17     ` Shakeel Butt
  2 siblings, 1 reply; 11+ messages in thread
From: tj @ 2025-06-27 16:50 UTC (permalink / raw)
  To: Wlodarczyk, Bertrand
  Cc: Shakeel Butt, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

Hello,

On Fri, Jun 27, 2025 at 01:15:31PM +0000, Wlodarczyk, Bertrand wrote:
...
>  > Also the response to the tearing issue explained by JP is not satisfying.
> 
> In other words, the claim is: "it's better to stall other cpus in spinlock plus disable IRQ every time in order to 
> serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.

This is a false choice, I think. e.g. We can easily use seqlock to remove
strict synchronization only from user side, right?

> I wouldn't be addressing this issue if there were no customers affected by rstat latency in multi-container
> multi-cpu scenarios.

Out of curiosity, can you explain the case that you observed in more detail?
What were the customer doing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-27 16:50     ` tj
@ 2025-06-30 14:25       ` Wlodarczyk, Bertrand
  2025-06-30 15:48         ` tj
  0 siblings, 1 reply; 11+ messages in thread
From: Wlodarczyk, Bertrand @ 2025-06-30 14:25 UTC (permalink / raw)
  To: tj@kernel.org
  Cc: Shakeel Butt, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

>  > Also the response to the tearing issue explained by JP is not satisfying.
> 
> In other words, the claim is: "it's better to stall other cpus in 
> spinlock plus disable IRQ every time in order to serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.

> This is a false choice, I think. e.g. We can easily use seqlock to remove strict synchronization only from user side, right?

Yes, that's second possibility to solve a problem.
I choose atomics approach because, in my opinion, incremental statistics are somewhat natural use case for them.

> I wouldn't be addressing this issue if there were no customers 
> affected by rstat latency in multi-container multi-cpu scenarios.

> Out of curiosity, can you explain the case that you observed in more detail?
> What were the customer doing?

Single hierarchy, hundreds of the containers on one server, multiple independent owners.
Some of them wants to have current stats available in their webgui.
They are hammering the stats for their cgroups. 
Server experience inefficiencies, perf shows visible percentage of cpu cycles spent in cgroup_rstat_flush.

I prepared benchmark which can be example of the issue faced by the customer:
https://gist.github.com/bwlodarcz/21bbc24813bced8e6ffc9e5ca3150fcc

qemu vm:
               +---------+---------+
     mean (s)  |8dcb0ed8 | patched |
+--------------+---------+---------+
|cpu, KCSAN on |16.13*   |3.75     |
+--------------+---------+---------+
|cpu, KCSAN off|4.45     |0.81     |
+--------------+---------+---------+
*race condition still present

It's not hammering the lock so much as previous stressor, so the results are better for for-6.17 branch.
The customer has much bigger scale than 4 cgroups in benchmark. 
There are workarounds implemented so it's not that hot now (for them).
Anyway, I think it's worth to try improving the scalability situation, 
especially that as far as I see it, there are no downsides.

There also reports about similar problems in memory rstats but I didn't look on them yet. 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-30 14:25       ` Wlodarczyk, Bertrand
@ 2025-06-30 15:48         ` tj
  2025-07-04 13:13           ` Wlodarczyk, Bertrand
  0 siblings, 1 reply; 11+ messages in thread
From: tj @ 2025-06-30 15:48 UTC (permalink / raw)
  To: Wlodarczyk, Bertrand
  Cc: Shakeel Butt, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

Hello,

On Mon, Jun 30, 2025 at 02:25:27PM +0000, Wlodarczyk, Bertrand wrote:
> >  > Also the response to the tearing issue explained by JP is not satisfying.
> > 
> > In other words, the claim is: "it's better to stall other cpus in 
> > spinlock plus disable IRQ every time in order to serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> > In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.
> 
> > This is a false choice, I think. e.g. We can easily use seqlock to remove strict synchronization only from user side, right?
> 
> Yes, that's second possibility to solve a problem.
> I choose atomics approach because, in my opinion, incremental statistics are somewhat natural use case for them.

They're good for individual counters but I'm not sure they're natural fit
for a group of stats. A series of atomic ops can be significantly more
expensive than locked updates and it also comes with problems like split
updates as discussed in this thread. I think most of resistance is from the
use of atomics. Can you please try a different approach?

> > I wouldn't be addressing this issue if there were no customers 
> > affected by rstat latency in multi-container multi-cpu scenarios.
> 
> > Out of curiosity, can you explain the case that you observed in more detail?
> > What were the customer doing?
> 
> Single hierarchy, hundreds of the containers on one server, multiple independent owners.
> Some of them wants to have current stats available in their webgui.
> They are hammering the stats for their cgroups. 
> Server experience inefficiencies, perf shows visible percentage of cpu cycles spent in cgroup_rstat_flush.
> 
> I prepared benchmark which can be example of the issue faced by the customer:
> https://gist.github.com/bwlodarcz/21bbc24813bced8e6ffc9e5ca3150fcc
> 
> qemu vm:
>                +---------+---------+
>      mean (s)  |8dcb0ed8 | patched |
> +--------------+---------+---------+
> |cpu, KCSAN on |16.13*   |3.75     |
> +--------------+---------+---------+
> |cpu, KCSAN off|4.45     |0.81     |
> +--------------+---------+---------+
> *race condition still present
> 
> It's not hammering the lock so much as previous stressor, so the results are better for for-6.17 branch.
> The customer has much bigger scale than 4 cgroups in benchmark. 
> There are workarounds implemented so it's not that hot now (for them).
> Anyway, I think it's worth to try improving the scalability situation, 
> especially that as far as I see it, there are no downsides.
>  
> There also reports about similar problems in memory rstats but I didn't look on them yet. 

Yeah, I saw the benchmark but I was more curious what actual use case would
lead to behaviors like that because you'd have to hammer on those stats
really hard for this to be a problem. In most use cases that I'm aware of,
the polling frequencies of these stats are >= 1sec. I guess the users in
your use case were banging on them way harder, at least previously.

I don't think switching to atomics is a good idea, but improving the read
scalability would definitely be nice.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-30 15:48         ` tj
@ 2025-07-04 13:13           ` Wlodarczyk, Bertrand
  2025-07-04 17:57             ` tj
  0 siblings, 1 reply; 11+ messages in thread
From: Wlodarczyk, Bertrand @ 2025-07-04 13:13 UTC (permalink / raw)
  To: tj@kernel.org
  Cc: Shakeel Butt, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

> >  > Also the response to the tearing issue explained by JP is not satisfying.
> > 
> > In other words, the claim is: "it's better to stall other cpus in 
> > spinlock plus disable IRQ every time in order to serve outdated snapshot instead of providing user to the freshest statistics much, much faster".
> > In term of statistics, freshest data served fast to the user is, in my opinion, better behavior.
> 
> > This is a false choice, I think. e.g. We can easily use seqlock to remove strict synchronization only from user side, right?
> 
> Yes, that's second possibility to solve a problem.
> I choose atomics approach because, in my opinion, incremental statistics are somewhat natural use case for them.

>They're good for individual counters but I'm not sure they're natural fit for a group of stats. 

It depends on what we consider as a group of stats. 
In case of rstat cpu we have statistics printed together but independently counted.
The only exception is sum_exec_runtime which existence purpose in rstats is unknown for me.
From what I can observe and read in code (in rstat context, know about usage in sched) it's just a sum of stime and utime.

> A series of atomic ops can be significantly more expensive than locked updates and it also comes with problems like split updates as discussed in this thread. I think most of resistance is from the use of atomics.

I just can't see what's the issue with split updates.
To illustrate let's perform an example thought experiment.

Let's assume we have three stats A, B, C initialized to 0 and execution is set on 0 unit of time.
We have two cpus which one want to perform read operation and second write operation at the same time.
Cost in time is 16 units per update operation, 2 per read. 
Let's start the read with 16 units of time delay but write without any.

Lock scenario:
Init: A = 0, B = 0 , C = 0
cpu 0 - write {
 lock
 A = 1
 B = 1
 C = 1
 unlock
} 
48 units elapsed

16 units delayed
Cpu 1 - read {
 spins 32 units
 lock
 read(A) // 1
 read(B) // 1
 read(C) // 1
 unlock
}
Result is A = 1, B = 1, C =1 after 54 units of time elapsed. 

Atomic scenario:
Init: A = 0, B = 0, C = 0
cpu 0 - write {
 A = 1 // 16 units elapsed
 B = 1 // 32 units elapsed
 C = 1 // 48 units elapsed
}

16 units delayed
Cpu 1 - read {
 read(A) // A = 1 // 18 units elapsed
 read(B) // B = 0 // 20 units elapsed 
 read(C) // C = 0 // 22 units elapsed
} 
Result is A = 1, B = 0, C = 0 after 22 units of time elapsed.

48 units delayed
Cpu 1 - read {
 read(A) // A = 1 // 50 units elapsed
 read(B) // B = 1 // 52 units elapsed 
 read(C) // C = 1 // 54 units elapsed
}
Result is A = 1, B = 1, C = 1 after 54 units of time elapsed.

The difference is that with atomics you have access to the fresher state of A statistic in 18 unit of time.
After 54 units both solutions have the same result.
What's the issue here? Why user seeing A = 1, B = 0, C = 0 in 22 unit (instead of spin) is a bad thing in rstat scenario?

Moreover, I consider lock solution as inferior due to complexity.
For example, currently the function of ss_rstat_lock can return one of two locks.
Global rstat_base_lock or lock from cgroup_subsys. Lock for cgroup_subsys is initiated in ss_rstat_init.
When? Why sometimes during flushes we are acquiring global lock and sometimes in happier scenario cgroup_subsys lock?
How to generalize performance solution here? Context of states and scopes to remember when crafting a solution is larger in comparison to atomics. 
Even now, mainline implementation still has race conditions. I know that these are worked on right now but it's a sign of solution complexity which essentially atomics clear.

The patch removes more code then adds. Have much better resistance to stress than lock.
The code handling 32 bit  __u64_stats_fetch is gone because it's not needed.
Makes everything simpler to reason, scopes are easy to grasp, opens opportunities to further improvements in submodules and clears all races detected by KCSAN.

> Can you please try a different approach?

In last few days I've investigated this, have some success but nowhere near to the improvements yield by atomics use. 
For the reasons I mentioned above, locks approach is much more complex to optimize.

> > I wouldn't be addressing this issue if there were no customers 
> > affected by rstat latency in multi-container multi-cpu scenarios.
> 
> > Out of curiosity, can you explain the case that you observed in more detail?
> > What were the customer doing?
> 
> Single hierarchy, hundreds of the containers on one server, multiple independent owners.
> Some of them wants to have current stats available in their webgui.
> They are hammering the stats for their cgroups. 
> Server experience inefficiencies, perf shows visible percentage of cpu cycles spent in cgroup_rstat_flush.
> 
> I prepared benchmark which can be example of the issue faced by the customer:
> https://gist.github.com/bwlodarcz/21bbc24813bced8e6ffc9e5ca3150fcc
> 
> qemu vm:
>                +---------+---------+
>      mean (s)  |8dcb0ed8 | patched |
> +--------------+---------+---------+
> |cpu, KCSAN on |16.13*   |3.75     |
> +--------------+---------+---------+
> |cpu, KCSAN off|4.45     |0.81     |
> +--------------+---------+---------+
> *race condition still present
> 
> It's not hammering the lock so much as previous stressor, so the results are better for for-6.17 branch.
> The customer has much bigger scale than 4 cgroups in benchmark. 
> There are workarounds implemented so it's not that hot now (for them).
> Anyway, I think it's worth to try improving the scalability situation, 
> especially that as far as I see it, there are no downsides.
>  
> There also reports about similar problems in memory rstats but I didn't look on them yet. 

> Yeah, I saw the benchmark but I was more curious what actual use case would lead to behaviors like that because you'd have to hammer on those stats really hard for this to be a problem. In most use cases that I'm aware of, the polling frequencies of these stats are >= 1sec. I guess the users in your use case were banging on them way harder, at least previously.

From what I know, the https://github.com/google/cadvisor instances deployed on the client machine hammered these stats.
Sharing servers between independent teams or orgs in big corps is frequent. Every interested party deployed its own, or similar, instance. 
We can say just don't do that and be fine, but it will be happening anyway. It's better to just make rstats more robust.  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-07-04 13:13           ` Wlodarczyk, Bertrand
@ 2025-07-04 17:57             ` tj
  2025-07-21 11:48               ` Wlodarczyk, Bertrand
  0 siblings, 1 reply; 11+ messages in thread
From: tj @ 2025-07-04 17:57 UTC (permalink / raw)
  To: Wlodarczyk, Bertrand
  Cc: Shakeel Butt, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

Hello,

On Fri, Jul 04, 2025 at 01:13:56PM +0000, Wlodarczyk, Bertrand wrote:
...
> After 54 units both solutions have the same result.
> What's the issue here? Why user seeing A = 1, B = 0, C = 0 in 22 unit (instead of spin) is a bad thing in rstat scenario?

Because some stats are related to each other - e.g. in blkcg, BPS and IOPS.
Here, overlapping cputime stats if we ever add [soft]irq time breakdowns,
and that can lead to non-sensical calculations (divide by zero, underflow,
and so on) in its users, just rare enough to not debugged easily but
frequent enough to be a headache in larger / longer deployments. And,
because we can usually do better.

> > Can you please try a different approach?
> 
> In last few days I've investigated this, have some success but nowhere
> near to the improvements yield by atomics use. For the reasons I mentioned
> above, locks approach is much more complex to optimize.

So, I'm not converting these stats to atomics. It's just not a good long
term direction. Please find a better solution. I'm pretty sure there are
multiple.

>> Yeah, I saw the benchmark but I was more curious what actual use case
>> would lead to behaviors like that because you'd have to hammer on those
>> stats really hard for this to be a problem. In most use cases that I'm
>> aware of, the polling frequencies of these stats are >= 1sec. I guess the
>> users in your use case were banging on them way harder, at least
>> previously.
> 
> From what I know, the https://github.com/google/cadvisor instances
> deployed on the client machine hammered these stats. Sharing servers
> between independent teams or orgs in big corps is frequent. Every
> interested party deployed its own, or similar, instance. We can say just
> don't do that and be fine, but it will be happening anyway. It's better to
> just make rstats more robust.

I do think this is a valid use case. I just want to get some sense on the
numbers involved. Do you happen to know what frequency cAdvisor was polling
the stats at and how many instances were running? The numbers don't have to
be accurate. I just want to know the ballpark numbers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-07-04 17:57             ` tj
@ 2025-07-21 11:48               ` Wlodarczyk, Bertrand
  0 siblings, 0 replies; 11+ messages in thread
From: Wlodarczyk, Bertrand @ 2025-07-21 11:48 UTC (permalink / raw)
  To: tj@kernel.org
  Cc: Shakeel Butt, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

>>> Yeah, I saw the benchmark but I was more curious what actual use case 
>>> would lead to behaviors like that because you'd have to hammer on 
>>> those stats really hard for this to be a problem. In most use cases 
>>> that I'm aware of, the polling frequencies of these stats are >= 
>>> 1sec. I guess the users in your use case were banging on them way 
>>> harder, at least previously.
>> 
>> From what I know, the https://github.com/google/cadvisor instances 
>> deployed on the client machine hammered these stats. Sharing servers 
>> between independent teams or orgs in big corps is frequent. Every 
>> interested party deployed its own, or similar, instance. We can say 
>> just don't do that and be fine, but it will be happening anyway. It's 
>> better to just make rstats more robust.

> I do think this is a valid use case. I just want to get some sense on the numbers involved. Do you happen to know what frequency cAdvisor was polling the stats at and how many instances were running? The numbers don't have to be accurate. I just want to know the ballpark numbers.

I'm quoting colleague here: 
"the frequency to call cadvisor, when every 1 ms call rstat_flush for each container(100 total), the contention is high. interval larger than 5ms we'll see less contention in my experiments." 

Experiment,CPU,Core #,Workload,Container #,FS,Cgroup flush interval,Spin lock,Flush time spend for 1000 iterations (ms),Flush Latency (avg) per iteration
1,GNR-AP,128,Proxy Load v1_lite,100,EXT4,1ms   ,,Min Time = 1751.91 msMax Time = 2232.26 msAvg Time = 1919.72 ms,919.72/1000
2,GNR-AP,128,Proxy Load v1_lite,100,EXT4,1.5ms,,Min Time = 1987.79 msMax Time = 2014.94 msAvg Time = 2001.14 ms,501.14/1000
3,GNR-AP,128,Proxy Load v1_lite,100,EXT4,1.7ms,,Min Time = 2025.42 msMax Time = 2044.56 msAvg Time = 2036.16 ms,336.16/1000
4,GNR-AP,128,Proxy Load v1_lite,100,EXT4,2ms,,Min Time = 2113.47 msMax Time = 2120.04 msAvg Time = 2116.33 ms,116.33/1000
5,GNR-AP,128,Proxy Load v1_lite,100,EXT4,5ms,,Min Time = 5160.85 msMax Time = 5170.68 msAvg Time = 5165.59 ms,165.59 / 1000 = 0.1656 ms
6,GNR-AP,128,Proxy Load v1_lite,100,EXT4,10ms,,Min Time = 10164.12 msMax Time = 10174.46 msAvg Time = 10168.97 ms,168.97 / 1000 = 0.169 ms
7,GNR-AP,128,Proxy Load v1_lite,100,EXT4,100ms,,Min Time = 100165.41 msMax Time = 100182.40 msAvg Time = 100174.89 ms,174.89 / 1000 = 0.1749 ms
 
It seems that the client had cadvisor set on 1 ms poll. I don't have any information if that was intentional or not (seems too frequent).

Thanks,
Bertrand

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-27 13:15   ` Wlodarczyk, Bertrand
  2025-06-27 16:50     ` tj
@ 2025-06-27 16:55     ` JP Kobryn
  2025-06-27 17:17     ` Shakeel Butt
  2 siblings, 0 replies; 11+ messages in thread
From: JP Kobryn @ 2025-06-27 16:55 UTC (permalink / raw)
  To: Wlodarczyk, Bertrand, Shakeel Butt
  Cc: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org

On 6/27/25 6:15 AM, Wlodarczyk, Bertrand wrote:
>> The kernel faces scalability issues when multiple userspace programs
>> attempt to read cgroup statistics concurrently.
>>
>> The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush,
>> which prevents access and updates to the statistics of the css from
>> multiple CPUs in parallel.
>>
>> Given that rstat operates on a per-CPU basis and only aggregates
>> statistics in the parent cgroup, there is no compelling reason why
>> these statistics cannot be atomic.
>> By eliminating the lock during CPU statistics access, each CPU can
>> traverse its rstat hierarchy independently, without blocking.
>> Synchronization is achieved during parent propagation through atomic
>> operations.
>>
>> This change significantly enhances performance on commit
>> 8dcb0ed834a3ec03 ("memcg: cgroup: call css_rstat_updated irrespective
>> of in_nmi()") in scenarios where multiple CPUs accessCPU rstat within
>> a single cgroup hierarchy, yielding a performance improvement of around 40 times.
>> Notably, performance for memory and I/O rstats remains unchanged, as
>> the lock remains in place for these usages.
>>
>> Additionally, this patch addresses a race condition detectable in the
>> current mainline by KCSAN in __cgroup_account_cputime, which occurs
>> when attempting to read a single hierarchy from multiple CPUs.
>>
>> Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>
> 
>> This patch breaks memory controller as explained in the comments on the previous version.
> 
> Ekhm... no? I addressed the issue and v2 has lock back and surrounding the call to dependent submodules?
> The behavior is the same as before patching.
> 
> In the long term, in my opinion, the atomics should happen also in dependent submodules to eliminate locks
> completely.
> 
>   > Also the response to the tearing issue explained by JP is not satisfying.
> 
> In other words, the claim is: "it's better to stall other cpus in spinlock plus disable IRQ every time in order to
> serve outdated snapshot instead of providing user to the freshest statistics much, much faster".

But they're not really "outdated" are they? Regardless of the wait, once
the lock is acquired they will get the latest snapshot.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic
  2025-06-27 13:15   ` Wlodarczyk, Bertrand
  2025-06-27 16:50     ` tj
  2025-06-27 16:55     ` JP Kobryn
@ 2025-06-27 17:17     ` Shakeel Butt
  2 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-27 17:17 UTC (permalink / raw)
  To: Wlodarczyk, Bertrand
  Cc: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	inwardvessel@gmail.com

On Fri, Jun 27, 2025 at 01:15:31PM +0000, Wlodarczyk, Bertrand wrote:
> > The kernel faces scalability issues when multiple userspace programs 
> > attempt to read cgroup statistics concurrently.
> > 
> > The primary bottleneck is the css_cgroup_lock in cgroup_rstat_flush, 
> > which prevents access and updates to the statistics of the css from 
> > multiple CPUs in parallel.
> > 
> > Given that rstat operates on a per-CPU basis and only aggregates 
> > statistics in the parent cgroup, there is no compelling reason why 
> > these statistics cannot be atomic.
> > By eliminating the lock during CPU statistics access, each CPU can 
> > traverse its rstat hierarchy independently, without blocking.
> > Synchronization is achieved during parent propagation through atomic 
> > operations.
> > 
> > This change significantly enhances performance on commit
> > 8dcb0ed834a3ec03 ("memcg: cgroup: call css_rstat_updated irrespective 
> > of in_nmi()") in scenarios where multiple CPUs accessCPU rstat within 
> > a single cgroup hierarchy, yielding a performance improvement of around 40 times.
> > Notably, performance for memory and I/O rstats remains unchanged, as 
> > the lock remains in place for these usages.
> > 
> > Additionally, this patch addresses a race condition detectable in the 
> > current mainline by KCSAN in __cgroup_account_cputime, which occurs 
> > when attempting to read a single hierarchy from multiple CPUs.
> > 
> > Signed-off-by: Bertrand Wlodarczyk <bertrand.wlodarczyk@intel.com>
> 
> > This patch breaks memory controller as explained in the comments on the previous version.
> 
> Ekhm... no? I addressed the issue and v2 has lock back and surrounding the call to dependent submodules? 
> The behavior is the same as before patching.
> 

Oh you have moved the rstat lock just around pos->ss->css_rstat_flush().

Have you checked if __css_process_update_tree() is safe from concurrent
flushers for a given cpu and css?


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-07-21 11:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 14:45 [PATCH v2] cgroup/rstat: change cgroup_base_stat to atomic Bertrand Wlodarczyk
2025-06-26 19:15 ` Shakeel Butt
2025-06-27 13:15   ` Wlodarczyk, Bertrand
2025-06-27 16:50     ` tj
2025-06-30 14:25       ` Wlodarczyk, Bertrand
2025-06-30 15:48         ` tj
2025-07-04 13:13           ` Wlodarczyk, Bertrand
2025-07-04 17:57             ` tj
2025-07-21 11:48               ` Wlodarczyk, Bertrand
2025-06-27 16:55     ` JP Kobryn
2025-06-27 17:17     ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).