linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] sched: Run task_mm_cid_work in batches to lower latency
@ 2025-07-16 16:06 Gabriele Monaco
  2025-07-16 16:06 ` [PATCH v2 1/4] sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes Gabriele Monaco
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-07-16 16:06 UTC (permalink / raw)
  To: linux-kernel, Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar
  Cc: Gabriele Monaco

This V2 of [1] is a continuation of [2] but using a simpler approach.
The task_mm_cid_work runs as a task_work returning to userspace and
causes a non-negligible scheduling latency, mostly due to its iterations
over all cores.

Split the work into several batches, each call to task_mm_cid_work will
not run for all cpus but just for a configurable number of cpus. Next
runs will pick up where the previous left off.
The mechanism that avoids running too frequently (100ms) is enforced
only when finishing all cpus, that is when starting from 0.

Also improve the predictability of the scan on short running tasks by
running it from rseq_handle_notify_resume, which runs it on every task
switch (similar behaviour to [2]), the same workaround on the tick for
long running tasks seen in [2] was ported also here.

Patch 1 add support for prev_sum_exec_runtime to the RT, deadline and
sched_ext classes as it is supported by fair, this is required to avoid
calling rseq_preempt on tick if the runtime is below a threshold.

Patch 2 moves the directly calls task_mm_cid_work instead of relying on
a task_work, necessary to avoid rseq_handle_notify_resume being called
twice while enqueuing a task_work.

Patch 3 splits the work into batches.

Patch 4 adds a selftest to validate the functionality of the
task_mm_cid_work (i.e. to compact the mm_cids).

Changes since V1 [1]:
* Use cpu_possible_mask in scan.
* Make sure batches have the same number of CPUs also if mask is sparse.
* Run the task on rseq_handle_notify_resume as in [2] but call directly.
* Schedule the work and mm_cid update on tick for long running tasks.
* Fix condition for need_scan only on first batch.
* Change RSEQ_CID_SCAN_BATCH default to be a power of 2.
* Rebase selftest on [2].
* Increase the selftest timeout on large systems.

To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Ingo Molnar <mingo@redhat.org>

[1] - https://lore.kernel.org/lkml/20250217112317.258716-1-gmonaco@redhat.com
[2] - https://lore.kernel.org/lkml/20250707144824.117014-1-gmonaco@redhat.com

Gabriele Monaco (4):
  sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes
  rseq: Run the mm_cid_compaction from rseq_handle_notify_resume()
  sched: Compact RSEQ concurrency IDs in batches
  selftests/rseq: Add test for mm_cid compaction

 include/linux/mm.h                            |   2 +
 include/linux/mm_types.h                      |  26 +++
 include/linux/sched.h                         |   2 +-
 init/Kconfig                                  |  12 ++
 kernel/rseq.c                                 |   2 +
 kernel/sched/core.c                           |  92 ++++++--
 kernel/sched/deadline.c                       |   1 +
 kernel/sched/ext.c                            |   1 +
 kernel/sched/rt.c                             |   1 +
 kernel/sched/sched.h                          |   2 +
 tools/testing/selftests/rseq/.gitignore       |   1 +
 tools/testing/selftests/rseq/Makefile         |   2 +-
 .../selftests/rseq/mm_cid_compaction_test.c   | 204 ++++++++++++++++++
 13 files changed, 323 insertions(+), 25 deletions(-)
 create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c


base-commit: 155a3c003e555a7300d156a5252c004c392ec6b0
-- 
2.50.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/4] sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes
  2025-07-16 16:06 [PATCH v2 0/4] sched: Run task_mm_cid_work in batches to lower latency Gabriele Monaco
@ 2025-07-16 16:06 ` Gabriele Monaco
  2025-07-16 16:06 ` [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume() Gabriele Monaco
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-07-16 16:06 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, sched-ext
  Cc: Gabriele Monaco, Mathieu Desnoyers, Ingo Molnar

The fair scheduling class relies on prev_sum_exec_runtime to compute the
duration of the task's runtime since it was last scheduled. This value
is currently not required by other scheduling classes but can be useful
to understand long running tasks and take certain actions (e.g. during a
scheduler tick).

Add support for prev_sum_exec_runtime to the RT, deadline and sched_ext
classes by simply assigning the sum_exec_runtime at each set_next_task.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/sched/deadline.c | 1 +
 kernel/sched/ext.c      | 1 +
 kernel/sched/rt.c       | 1 +
 3 files changed, 3 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 89019a1408264..65ecd86bae37d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2389,6 +2389,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	p->se.exec_start = rq_clock_task(rq);
 	if (on_dl_rq(&p->dl))
 		update_stats_wait_end_dl(dl_rq, dl_se);
+	p->se.prev_sum_exec_runtime = p->se.sum_exec_runtime;
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b498d867ba210..a4ac4386b9795 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3255,6 +3255,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 	}
 
 	p->se.exec_start = rq_clock_task(rq);
+	p->se.prev_sum_exec_runtime = p->se.sum_exec_runtime;
 
 	/* see dequeue_task_scx() on why we skip when !QUEUED */
 	if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED))
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c370335..2c70ff2042ee9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1693,6 +1693,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	p->se.exec_start = rq_clock_task(rq);
 	if (on_rt_rq(&p->rt))
 		update_stats_wait_end_rt(rt_rq, rt_se);
+	p->se.prev_sum_exec_runtime = p->se.sum_exec_runtime;
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume()
  2025-07-16 16:06 [PATCH v2 0/4] sched: Run task_mm_cid_work in batches to lower latency Gabriele Monaco
  2025-07-16 16:06 ` [PATCH v2 1/4] sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes Gabriele Monaco
@ 2025-07-16 16:06 ` Gabriele Monaco
  2025-08-26 18:01   ` Mathieu Desnoyers
  2025-07-16 16:06 ` [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches Gabriele Monaco
  2025-07-16 16:06 ` [PATCH v2 4/4] selftests/rseq: Add test for mm_cid compaction Gabriele Monaco
  3 siblings, 1 reply; 12+ messages in thread
From: Gabriele Monaco @ 2025-07-16 16:06 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney, linux-mm
  Cc: Gabriele Monaco, Ingo Molnar

Currently the mm_cid_compaction is triggered by the scheduler tick and
runs in a task_work, behaviour is more unpredictable with periodic tasks
with short runtime, which may rarely run during a tick.

Run the mm_cid_compaction from the rseq_handle_notify_resume() call,
which runs from resume_user_mode_work. Since the context is the same
where the task_work would run, skip this step and call the compaction
function directly.
The compaction function still exits prematurely in case the scan is not
required, that is when the pseudo-period of 100ms did not elapse.

Keep a tick handler used for long running tasks that are never preempted
(i.e. that never call rseq_handle_notify_resume), which triggers a
compaction and mm_cid update only in that case.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/mm.h       |  2 ++
 include/linux/mm_types.h | 11 ++++++++
 include/linux/sched.h    |  2 +-
 kernel/rseq.c            |  2 ++
 kernel/sched/core.c      | 55 +++++++++++++++++++++++++---------------
 kernel/sched/sched.h     |  2 ++
 6 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa538feaa8d95..cc8c1c9ae26c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2294,6 +2294,7 @@ void sched_mm_cid_before_execve(struct task_struct *t);
 void sched_mm_cid_after_execve(struct task_struct *t);
 void sched_mm_cid_fork(struct task_struct *t);
 void sched_mm_cid_exit_signals(struct task_struct *t);
+void task_mm_cid_work(struct task_struct *t);
 static inline int task_mm_cid(struct task_struct *t)
 {
 	return t->mm_cid;
@@ -2303,6 +2304,7 @@ static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
 static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
 static inline void sched_mm_cid_fork(struct task_struct *t) { }
 static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
+static inline void task_mm_cid_work(struct task_struct *t) { }
 static inline int task_mm_cid(struct task_struct *t)
 {
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8a66d6d..e6d6e468e64b4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1420,6 +1420,13 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
 	WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
 	raw_spin_unlock(&mm->cpus_allowed_lock);
 }
+
+static inline bool mm_cid_needs_scan(struct mm_struct *mm)
+{
+	if (!mm)
+		return false;
+	return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
+}
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
 static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
@@ -1430,6 +1437,10 @@ static inline unsigned int mm_cid_size(void)
 	return 0;
 }
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
+static inline bool mm_cid_needs_scan(struct mm_struct *mm)
+{
+	return false;
+}
 #endif /* CONFIG_SCHED_MM_CID */
 
 struct mmu_gather;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index aa9c5be7a6325..a75f61cea2271 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1428,7 +1428,7 @@ struct task_struct {
 	int				last_mm_cid;	/* Most recent cid in mm */
 	int				migrate_from_cpu;
 	int				mm_cid_active;	/* Whether cid bitmap is active */
-	struct callback_head		cid_work;
+	unsigned long			last_cid_reset;	/* Time of last reset in jiffies */
 #endif
 
 	struct tlbflush_unmap_batch	tlb_ubc;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index b7a1ec327e811..100f81e330dc6 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -441,6 +441,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	}
 	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;
+	/* The mm_cid compaction returns prematurely if scan is not needed. */
+	task_mm_cid_work(t);
 	return;
 
 error:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81c6df746df17..27b856a1cb0a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10589,22 +10589,13 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
 	sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
 }
 
-static void task_mm_cid_work(struct callback_head *work)
+void task_mm_cid_work(struct task_struct *t)
 {
 	unsigned long now = jiffies, old_scan, next_scan;
-	struct task_struct *t = current;
 	struct cpumask *cidmask;
-	struct mm_struct *mm;
 	int weight, cpu;
+	struct mm_struct *mm = t->mm;
 
-	WARN_ON_ONCE(t != container_of(work, struct task_struct, cid_work));
-
-	work->next = work;	/* Prevent double-add */
-	if (t->flags & PF_EXITING)
-		return;
-	mm = t->mm;
-	if (!mm)
-		return;
 	old_scan = READ_ONCE(mm->mm_cid_next_scan);
 	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	if (!old_scan) {
@@ -10643,23 +10634,47 @@ void init_sched_mm_cid(struct task_struct *t)
 		if (mm_users == 1)
 			mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	}
-	t->cid_work.next = &t->cid_work;	/* Protect against double add */
-	init_task_work(&t->cid_work, task_mm_cid_work);
 }
 
 void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
 {
-	struct callback_head *work = &curr->cid_work;
-	unsigned long now = jiffies;
+	u64 rtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime;
 
+	/*
+	 * If a task is running unpreempted for a long time, it won't get its
+	 * mm_cid compacted and won't update its mm_cid value after a
+	 * compaction occurs.
+	 * For such a task, this function does two things:
+	 * A) trigger the mm_cid recompaction,
+	 * B) trigger an update of the task's rseq->mm_cid field at some point
+	 * after recompaction, so it can get a mm_cid value closer to 0.
+	 * A change in the mm_cid triggers an rseq_preempt.
+	 *
+	 * B occurs once after the compaction work completes, neither A nor B
+	 * run as long as the compaction work is pending, the task is exiting
+	 * or is not a userspace task.
+	 */
 	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
-	    work->next != work)
+	    test_tsk_thread_flag(curr, TIF_NOTIFY_RESUME))
 		return;
-	if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
+	if (rtime < RSEQ_UNPREEMPTED_THRESHOLD)
 		return;
-
-	/* No page allocation under rq lock */
-	task_work_add(curr, work, TWA_RESUME);
+	if (mm_cid_needs_scan(curr->mm)) {
+		/* Trigger mm_cid recompaction */
+		rseq_set_notify_resume(curr);
+	} else if (time_after(jiffies, curr->last_cid_reset +
+			      msecs_to_jiffies(MM_CID_SCAN_DELAY))) {
+		/* Update mm_cid field */
+		int old_cid = curr->mm_cid;
+
+		if (!curr->mm_cid_active)
+			return;
+		mm_cid_snapshot_time(rq, curr->mm);
+		mm_cid_put_lazy(curr);
+		curr->last_mm_cid = curr->mm_cid = mm_cid_get(rq, curr, curr->mm);
+		if (old_cid != curr->mm_cid)
+			rseq_preempt(curr);
+	}
 }
 
 void sched_mm_cid_exit_signals(struct task_struct *t)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295e..90a5b58188232 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3606,6 +3606,7 @@ extern const char *preempt_modes[];
 
 #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
 #define MM_CID_SCAN_DELAY	100			/* 100ms */
+#define RSEQ_UNPREEMPTED_THRESHOLD	SCHED_MM_CID_PERIOD_NS
 
 extern raw_spinlock_t cid_lock;
 extern int use_cid_lock;
@@ -3809,6 +3810,7 @@ static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
 	int cid;
 
 	lockdep_assert_rq_held(rq);
+	t->last_cid_reset = jiffies;
 	cpumask = mm_cidmask(mm);
 	cid = __this_cpu_read(pcpu_cid->cid);
 	if (mm_cid_is_valid(cid)) {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches
  2025-07-16 16:06 [PATCH v2 0/4] sched: Run task_mm_cid_work in batches to lower latency Gabriele Monaco
  2025-07-16 16:06 ` [PATCH v2 1/4] sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes Gabriele Monaco
  2025-07-16 16:06 ` [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume() Gabriele Monaco
@ 2025-07-16 16:06 ` Gabriele Monaco
  2025-08-05 12:42   ` Gabriele Monaco
  2025-08-26 18:10   ` Mathieu Desnoyers
  2025-07-16 16:06 ` [PATCH v2 4/4] selftests/rseq: Add test for mm_cid compaction Gabriele Monaco
  3 siblings, 2 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-07-16 16:06 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Mathieu Desnoyers, linux-mm
  Cc: Gabriele Monaco, Ingo Molnar

Currently, task_mm_cid_work() is called from resume_user_mode_work().
This can delay the execution of the corresponding thread for the entire
duration of the function, negatively affecting the response in case of
real time tasks.
In practice, we observe task_mm_cid_work increasing the latency of
30-35us on a 128 cores system, this order of magnitude is meaningful
under PREEMPT_RT.

Run the task_mm_cid_work in batches of up to CONFIG_RSEQ_CID_SCAN_BATCH
CPUs, this reduces the duration of the delay for each scan.

The task_mm_cid_work contains a mechanism to avoid running more
frequently than every 100ms. Keep this pseudo-periodicity only on
complete scans.
This means each call to task_mm_cid_work returns prematurely if the
period did not elapse and a scan is not ongoing (i.e. the next batch to
scan is not the first).
This way full scans are not excessively delayed while still keeping each
run, and introduced latency, short.

Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/mm_types.h | 15 +++++++++++++++
 init/Kconfig             | 12 ++++++++++++
 kernel/sched/core.c      | 37 ++++++++++++++++++++++++++++++++++---
 3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e6d6e468e64b4..a822966a584f3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -995,6 +995,13 @@ struct mm_struct {
 		 * When the next mm_cid scan is due (in jiffies).
 		 */
 		unsigned long mm_cid_next_scan;
+		/*
+		 * @mm_cid_scan_batch: Counter for batch used in the next scan.
+		 *
+		 * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH. This field
+		 * increments at each scan and reset when all batches are done.
+		 */
+		unsigned int mm_cid_scan_batch;
 		/**
 		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
 		 *
@@ -1385,6 +1392,7 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
 	raw_spin_lock_init(&mm->cpus_allowed_lock);
 	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
 	cpumask_clear(mm_cidmask(mm));
+	mm->mm_cid_scan_batch = 0;
 }
 
 static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
@@ -1423,8 +1431,15 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
 
 static inline bool mm_cid_needs_scan(struct mm_struct *mm)
 {
+	unsigned int next_batch;
+
 	if (!mm)
 		return false;
+	next_batch = READ_ONCE(mm->mm_cid_scan_batch);
+	/* Always needs scan unless it's the first batch. */
+	if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch < num_possible_cpus() &&
+	    next_batch)
+		return true;
 	return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
 }
 #else /* CONFIG_SCHED_MM_CID */
diff --git a/init/Kconfig b/init/Kconfig
index 666783eb50abd..98d7f078cd6df 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1860,6 +1860,18 @@ config DEBUG_RSEQ
 
 	  If unsure, say N.
 
+config RSEQ_CID_SCAN_BATCH
+	int "Number of CPUs to scan at every mm_cid compaction attempt"
+	range 1 NR_CPUS
+	default 8
+	depends on SCHED_MM_CID
+	help
+	  CPUs are scanned pseudo-periodically to compact the CID of each task,
+	  this operation can take a longer amount of time on systems with many
+	  CPUs, resulting in higher scheduling latency for the current task.
+	  A higher value means the CID is compacted faster, but results in
+	  higher scheduling latency.
+
 config CACHESTAT_SYSCALL
 	bool "Enable cachestat() system call" if EXPERT
 	default y
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27b856a1cb0a9..eae4c8faf980b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10591,11 +10591,26 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
 
 void task_mm_cid_work(struct task_struct *t)
 {
+	int weight, cpu, from_cpu, this_batch, next_batch, idx;
 	unsigned long now = jiffies, old_scan, next_scan;
 	struct cpumask *cidmask;
-	int weight, cpu;
 	struct mm_struct *mm = t->mm;
 
+	/*
+	 * This function is called from __rseq_handle_notify_resume, which
+	 * makes sure t is a user thread and is not exiting.
+	 */
+	this_batch = READ_ONCE(mm->mm_cid_scan_batch);
+	next_batch = this_batch + 1;
+	from_cpu = cpumask_nth(this_batch * CONFIG_RSEQ_CID_SCAN_BATCH,
+			       cpu_possible_mask);
+	if (from_cpu >= nr_cpu_ids) {
+		from_cpu = 0;
+		next_batch = 1;
+	}
+	/* Delay scan only if we are done with all cpus. */
+	if (from_cpu != 0)
+		goto cid_compact;
 	old_scan = READ_ONCE(mm->mm_cid_next_scan);
 	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
 	if (!old_scan) {
@@ -10611,17 +10626,33 @@ void task_mm_cid_work(struct task_struct *t)
 		return;
 	if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
 		return;
+
+cid_compact:
+	if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch, next_batch))
+		return;
 	cidmask = mm_cidmask(mm);
 	/* Clear cids that were not recently used. */
-	for_each_possible_cpu(cpu)
+	idx = 0;
+	cpu = from_cpu;
+	for_each_cpu_from(cpu, cpu_possible_mask) {
+		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
+			break;
 		sched_mm_cid_remote_clear_old(mm, cpu);
+		++idx;
+	}
 	weight = cpumask_weight(cidmask);
 	/*
 	 * Clear cids that are greater or equal to the cidmask weight to
 	 * recompact it.
 	 */
-	for_each_possible_cpu(cpu)
+	idx = 0;
+	cpu = from_cpu;
+	for_each_cpu_from(cpu, cpu_possible_mask) {
+		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
+			break;
 		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
+		++idx;
+	}
 }
 
 void init_sched_mm_cid(struct task_struct *t)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/4] selftests/rseq: Add test for mm_cid compaction
  2025-07-16 16:06 [PATCH v2 0/4] sched: Run task_mm_cid_work in batches to lower latency Gabriele Monaco
                   ` (2 preceding siblings ...)
  2025-07-16 16:06 ` [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches Gabriele Monaco
@ 2025-07-16 16:06 ` Gabriele Monaco
  3 siblings, 0 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-07-16 16:06 UTC (permalink / raw)
  To: linux-kernel, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Shuah Khan, linux-kselftest
  Cc: Gabriele Monaco, Shuah Khan, Ingo Molnar

A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.

The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.

At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.

After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.

The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.

Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 tools/testing/selftests/rseq/.gitignore       |   1 +
 tools/testing/selftests/rseq/Makefile         |   2 +-
 .../selftests/rseq/mm_cid_compaction_test.c   | 204 ++++++++++++++++++
 3 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c

diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 0fda241fa62b0..b3920c59bf401 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
 basic_percpu_ops_mm_cid_test
 basic_test
 basic_rseq_op_test
+mm_cid_compaction_test
 param_test
 param_test_benchmark
 param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae59547..bc4d940f66d40 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
 		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
-		syscall_errors_test
+		syscall_errors_test mm_cid_compaction_test
 
 TEST_GEN_PROGS_EXTENDED = librseq.so
 
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..d13623625f5a9
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...)                    \
+	do {                                        \
+		if (VERBOSE)                        \
+			printf(fmt, ##__VA_ARGS__); \
+	} while (0)
+
+/* 50 ms */
+#define RUNNER_PERIOD 50000
+/*
+ * Number of runs before we terminate or get the token.
+ * The number is slowly increasing with the number of CPUs as the compaction
+ * process can take longer on larger systems. This is an arbitrary value.
+ */
+#define THREAD_RUNS (3 + args->num_cpus/8)
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+	int cpu;
+	int num_cpus;
+	pthread_mutex_t *token;
+	pthread_barrier_t *barrier;
+	pthread_t *tinfo;
+	struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+	struct thread_args *args = arg;
+	int i, ret, curr_mm_cid;
+	cpu_set_t cpumask;
+
+	CPU_ZERO(&cpumask);
+	CPU_SET(args->cpu, &cpumask);
+	ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+	if (ret) {
+		errno = ret;
+		perror("Error: failed to set affinity");
+		abort();
+	}
+	pthread_barrier_wait(args->barrier);
+
+	for (i = 0; i < THREAD_RUNS; i++)
+		usleep(RUNNER_PERIOD);
+	curr_mm_cid = rseq_current_mm_cid();
+	/*
+	 * We select one thread with high enough mm_cid to be the new leader.
+	 * All other threads (including the main thread) will terminate.
+	 * After some time, the mm_cid of the only remaining thread should
+	 * converge to 0, if not, the test fails.
+	 */
+	if (curr_mm_cid >= args->num_cpus / 2 &&
+	    !pthread_mutex_trylock(args->token)) {
+		printf_verbose(
+			"cpu%d has mm_cid=%d and will be the new leader.\n",
+			sched_getcpu(), curr_mm_cid);
+		for (i = 0; i < args->num_cpus; i++) {
+			if (args->tinfo[i] == pthread_self())
+				continue;
+			ret = pthread_join(args->tinfo[i], NULL);
+			if (ret) {
+				errno = ret;
+				perror("Error: failed to join thread");
+				abort();
+			}
+		}
+		pthread_barrier_destroy(args->barrier);
+		free(args->tinfo);
+		free(args->token);
+		free(args->barrier);
+		free(args->args_head);
+
+		for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+			curr_mm_cid = rseq_current_mm_cid();
+			printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+				       curr_mm_cid, sched_getcpu());
+			if (curr_mm_cid == 0)
+				exit(EXIT_SUCCESS);
+			usleep(RUNNER_PERIOD);
+		}
+		exit(EXIT_FAILURE);
+	}
+	printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+		       sched_getcpu(), curr_mm_cid);
+	pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+	cpu_set_t affinity;
+	int i, j, ret = 0, num_threads;
+	pthread_t *tinfo;
+	pthread_mutex_t *token;
+	pthread_barrier_t *barrier;
+	struct thread_args *args;
+
+	sched_getaffinity(0, sizeof(affinity), &affinity);
+	num_threads = CPU_COUNT(&affinity);
+	tinfo = calloc(num_threads, sizeof(*tinfo));
+	if (!tinfo) {
+		perror("Error: failed to allocate tinfo");
+		return -1;
+	}
+	args = calloc(num_threads, sizeof(*args));
+	if (!args) {
+		perror("Error: failed to allocate args");
+		ret = -1;
+		goto out_free_tinfo;
+	}
+	token = malloc(sizeof(*token));
+	if (!token) {
+		perror("Error: failed to allocate token");
+		ret = -1;
+		goto out_free_args;
+	}
+	barrier = malloc(sizeof(*barrier));
+	if (!barrier) {
+		perror("Error: failed to allocate barrier");
+		ret = -1;
+		goto out_free_token;
+	}
+	if (num_threads == 1) {
+		fprintf(stderr, "Cannot test on a single cpu. "
+				"Skipping mm_cid_compaction test.\n");
+		/* only skipping the test, this is not a failure */
+		goto out_free_barrier;
+	}
+	pthread_mutex_init(token, NULL);
+	ret = pthread_barrier_init(barrier, NULL, num_threads);
+	if (ret) {
+		errno = ret;
+		perror("Error: failed to initialise barrier");
+		goto out_free_barrier;
+	}
+	for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+		if (!CPU_ISSET(i, &affinity))
+			continue;
+		args[j].num_cpus = num_threads;
+		args[j].tinfo = tinfo;
+		args[j].token = token;
+		args[j].barrier = barrier;
+		args[j].cpu = i;
+		args[j].args_head = args;
+		if (!j) {
+			/* The first thread is the main one */
+			tinfo[0] = pthread_self();
+			++j;
+			continue;
+		}
+		ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+		if (ret) {
+			errno = ret;
+			perror("Error: failed to create thread");
+			abort();
+		}
+		++j;
+	}
+	printf_verbose("Started %d threads.\n", num_threads);
+
+	/* Also main thread will terminate if it is not selected as leader */
+	thread_runner(&args[0]);
+
+	/* only reached in case of errors */
+out_free_barrier:
+	free(barrier);
+out_free_token:
+	free(token);
+out_free_args:
+	free(args);
+out_free_tinfo:
+	free(tinfo);
+
+	return ret;
+}
+
+int main(int argc, char **argv)
+{
+	if (!rseq_mm_cid_available()) {
+		fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+		return -1;
+	}
+	if (test_mm_cid_compaction())
+		return -1;
+	return 0;
+}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches
  2025-07-16 16:06 ` [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches Gabriele Monaco
@ 2025-08-05 12:42   ` Gabriele Monaco
  2025-08-06 16:57     ` Mathieu Desnoyers
  2025-08-26 18:10   ` Mathieu Desnoyers
  1 sibling, 1 reply; 12+ messages in thread
From: Gabriele Monaco @ 2025-08-05 12:42 UTC (permalink / raw)
  To: linux-kernel, Mathieu Desnoyers
  Cc: Andrew Morton, David Hildenbrand, Ingo Molnar, Peter Zijlstra

On Wed, 2025-07-16 at 18:06 +0200, Gabriele Monaco wrote:
> Currently, task_mm_cid_work() is called from resume_user_mode_work().
> This can delay the execution of the corresponding thread for the
> entire duration of the function, negatively affecting the response in
> case of real time tasks.
> In practice, we observe task_mm_cid_work increasing the latency of
> 30-35us on a 128 cores system, this order of magnitude is meaningful
> under PREEMPT_RT.
> 
> Run the task_mm_cid_work in batches of up to
> CONFIG_RSEQ_CID_SCAN_BATCH CPUs, this reduces the duration of the
> delay for each scan.
> 
> The task_mm_cid_work contains a mechanism to avoid running more
> frequently than every 100ms. Keep this pseudo-periodicity only on
> complete scans.
> This means each call to task_mm_cid_work returns prematurely if the
> period did not elapse and a scan is not ongoing (i.e. the next batch
> to scan is not the first).
> This way full scans are not excessively delayed while still keeping
> each run, and introduced latency, short.
> 

Mathieu, would you have some time to look at this implementation?

Thanks,
Gabriele

> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by
> mm_cid")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>  include/linux/mm_types.h | 15 +++++++++++++++
>  init/Kconfig             | 12 ++++++++++++
>  kernel/sched/core.c      | 37 ++++++++++++++++++++++++++++++++++---
>  3 files changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e6d6e468e64b4..a822966a584f3 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -995,6 +995,13 @@ struct mm_struct {
>  		 * When the next mm_cid scan is due (in jiffies).
>  		 */
>  		unsigned long mm_cid_next_scan;
> +		/*
> +		 * @mm_cid_scan_batch: Counter for batch used in the
> next scan.
> +		 *
> +		 * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH.
> This field
> +		 * increments at each scan and reset when all
> batches are done.
> +		 */
> +		unsigned int mm_cid_scan_batch;
>  		/**
>  		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
>  		 *
> @@ -1385,6 +1392,7 @@ static inline void mm_init_cid(struct mm_struct
> *mm, struct task_struct *p)
>  	raw_spin_lock_init(&mm->cpus_allowed_lock);
>  	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
>  	cpumask_clear(mm_cidmask(mm));
> +	mm->mm_cid_scan_batch = 0;
>  }
>  
>  static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct
> task_struct *p)
> @@ -1423,8 +1431,15 @@ static inline void mm_set_cpus_allowed(struct
> mm_struct *mm, const struct cpumas
>  
>  static inline bool mm_cid_needs_scan(struct mm_struct *mm)
>  {
> +	unsigned int next_batch;
> +
>  	if (!mm)
>  		return false;
> +	next_batch = READ_ONCE(mm->mm_cid_scan_batch);
> +	/* Always needs scan unless it's the first batch. */
> +	if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch <
> num_possible_cpus() &&
> +	    next_batch)
> +		return true;
>  	return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
>  }
>  #else /* CONFIG_SCHED_MM_CID */
> diff --git a/init/Kconfig b/init/Kconfig
> index 666783eb50abd..98d7f078cd6df 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1860,6 +1860,18 @@ config DEBUG_RSEQ
>  
>  	  If unsure, say N.
>  
> +config RSEQ_CID_SCAN_BATCH
> +	int "Number of CPUs to scan at every mm_cid compaction
> attempt"
> +	range 1 NR_CPUS
> +	default 8
> +	depends on SCHED_MM_CID
> +	help
> +	  CPUs are scanned pseudo-periodically to compact the CID of
> each task,
> +	  this operation can take a longer amount of time on systems
> with many
> +	  CPUs, resulting in higher scheduling latency for the
> current task.
> +	  A higher value means the CID is compacted faster, but
> results in
> +	  higher scheduling latency.
> +
>  config CACHESTAT_SYSCALL
>  	bool "Enable cachestat() system call" if EXPERT
>  	default y
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 27b856a1cb0a9..eae4c8faf980b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10591,11 +10591,26 @@ static void
> sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
>  
>  void task_mm_cid_work(struct task_struct *t)
>  {
> +	int weight, cpu, from_cpu, this_batch, next_batch, idx;
>  	unsigned long now = jiffies, old_scan, next_scan;
>  	struct cpumask *cidmask;
> -	int weight, cpu;
>  	struct mm_struct *mm = t->mm;
>  
> +	/*
> +	 * This function is called from __rseq_handle_notify_resume,
> which
> +	 * makes sure t is a user thread and is not exiting.
> +	 */
> +	this_batch = READ_ONCE(mm->mm_cid_scan_batch);
> +	next_batch = this_batch + 1;
> +	from_cpu = cpumask_nth(this_batch *
> CONFIG_RSEQ_CID_SCAN_BATCH,
> +			       cpu_possible_mask);
> +	if (from_cpu >= nr_cpu_ids) {
> +		from_cpu = 0;
> +		next_batch = 1;
> +	}
> +	/* Delay scan only if we are done with all cpus. */
> +	if (from_cpu != 0)
> +		goto cid_compact;
>  	old_scan = READ_ONCE(mm->mm_cid_next_scan);
>  	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>  	if (!old_scan) {
> @@ -10611,17 +10626,33 @@ void task_mm_cid_work(struct task_struct
> *t)
>  		return;
>  	if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan,
> next_scan))
>  		return;
> +
> +cid_compact:
> +	if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch,
> next_batch))
> +		return;
>  	cidmask = mm_cidmask(mm);
>  	/* Clear cids that were not recently used. */
> -	for_each_possible_cpu(cpu)
> +	idx = 0;
> +	cpu = from_cpu;
> +	for_each_cpu_from(cpu, cpu_possible_mask) {
> +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
> +			break;
>  		sched_mm_cid_remote_clear_old(mm, cpu);
> +		++idx;
> +	}
>  	weight = cpumask_weight(cidmask);
>  	/*
>  	 * Clear cids that are greater or equal to the cidmask
> weight to
>  	 * recompact it.
>  	 */
> -	for_each_possible_cpu(cpu)
> +	idx = 0;
> +	cpu = from_cpu;
> +	for_each_cpu_from(cpu, cpu_possible_mask) {
> +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
> +			break;
>  		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
> +		++idx;
> +	}
>  }
>  
>  void init_sched_mm_cid(struct task_struct *t)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches
  2025-08-05 12:42   ` Gabriele Monaco
@ 2025-08-06 16:57     ` Mathieu Desnoyers
  2025-08-06 18:24       ` Gabriele Monaco
  0 siblings, 1 reply; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-08-06 16:57 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel
  Cc: Andrew Morton, David Hildenbrand, Ingo Molnar, Peter Zijlstra

On 2025-08-05 08:42, Gabriele Monaco wrote:
> On Wed, 2025-07-16 at 18:06 +0200, Gabriele Monaco wrote:
>> Currently, task_mm_cid_work() is called from resume_user_mode_work().
>> This can delay the execution of the corresponding thread for the
>> entire duration of the function, negatively affecting the response in
>> case of real time tasks.
>> In practice, we observe task_mm_cid_work increasing the latency of
>> 30-35us on a 128 cores system, this order of magnitude is meaningful
>> under PREEMPT_RT.
>>
>> Run the task_mm_cid_work in batches of up to
>> CONFIG_RSEQ_CID_SCAN_BATCH CPUs, this reduces the duration of the
>> delay for each scan.
>>
>> The task_mm_cid_work contains a mechanism to avoid running more
>> frequently than every 100ms. Keep this pseudo-periodicity only on
>> complete scans.
>> This means each call to task_mm_cid_work returns prematurely if the
>> period did not elapse and a scan is not ongoing (i.e. the next batch
>> to scan is not the first).
>> This way full scans are not excessively delayed while still keeping
>> each run, and introduced latency, short.
>>
> 
> Mathieu, would you have some time to look at this implementation?

Hi Gabriele,

Please note that I am currently on vacation. I'll be back shortly
before the end of August, but I'm afraid there are other tasks I
need to focus on before I can get back to this. I'm adding this
review to my todo list for September.

Thanks,

Mathieu

> 
> Thanks,
> Gabriele
> 
>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by
>> mm_cid")
>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>> ---
>>   include/linux/mm_types.h | 15 +++++++++++++++
>>   init/Kconfig             | 12 ++++++++++++
>>   kernel/sched/core.c      | 37 ++++++++++++++++++++++++++++++++++---
>>   3 files changed, 61 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index e6d6e468e64b4..a822966a584f3 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -995,6 +995,13 @@ struct mm_struct {
>>   		 * When the next mm_cid scan is due (in jiffies).
>>   		 */
>>   		unsigned long mm_cid_next_scan;
>> +		/*
>> +		 * @mm_cid_scan_batch: Counter for batch used in the
>> next scan.
>> +		 *
>> +		 * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH.
>> This field
>> +		 * increments at each scan and reset when all
>> batches are done.
>> +		 */
>> +		unsigned int mm_cid_scan_batch;
>>   		/**
>>   		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
>>   		 *
>> @@ -1385,6 +1392,7 @@ static inline void mm_init_cid(struct mm_struct
>> *mm, struct task_struct *p)
>>   	raw_spin_lock_init(&mm->cpus_allowed_lock);
>>   	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
>>   	cpumask_clear(mm_cidmask(mm));
>> +	mm->mm_cid_scan_batch = 0;
>>   }
>>   
>>   static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct
>> task_struct *p)
>> @@ -1423,8 +1431,15 @@ static inline void mm_set_cpus_allowed(struct
>> mm_struct *mm, const struct cpumas
>>   
>>   static inline bool mm_cid_needs_scan(struct mm_struct *mm)
>>   {
>> +	unsigned int next_batch;
>> +
>>   	if (!mm)
>>   		return false;
>> +	next_batch = READ_ONCE(mm->mm_cid_scan_batch);
>> +	/* Always needs scan unless it's the first batch. */
>> +	if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch <
>> num_possible_cpus() &&
>> +	    next_batch)
>> +		return true;
>>   	return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
>>   }
>>   #else /* CONFIG_SCHED_MM_CID */
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 666783eb50abd..98d7f078cd6df 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1860,6 +1860,18 @@ config DEBUG_RSEQ
>>   
>>   	  If unsure, say N.
>>   
>> +config RSEQ_CID_SCAN_BATCH
>> +	int "Number of CPUs to scan at every mm_cid compaction
>> attempt"
>> +	range 1 NR_CPUS
>> +	default 8
>> +	depends on SCHED_MM_CID
>> +	help
>> +	  CPUs are scanned pseudo-periodically to compact the CID of
>> each task,
>> +	  this operation can take a longer amount of time on systems
>> with many
>> +	  CPUs, resulting in higher scheduling latency for the
>> current task.
>> +	  A higher value means the CID is compacted faster, but
>> results in
>> +	  higher scheduling latency.
>> +
>>   config CACHESTAT_SYSCALL
>>   	bool "Enable cachestat() system call" if EXPERT
>>   	default y
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 27b856a1cb0a9..eae4c8faf980b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -10591,11 +10591,26 @@ static void
>> sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
>>   
>>   void task_mm_cid_work(struct task_struct *t)
>>   {
>> +	int weight, cpu, from_cpu, this_batch, next_batch, idx;
>>   	unsigned long now = jiffies, old_scan, next_scan;
>>   	struct cpumask *cidmask;
>> -	int weight, cpu;
>>   	struct mm_struct *mm = t->mm;
>>   
>> +	/*
>> +	 * This function is called from __rseq_handle_notify_resume,
>> which
>> +	 * makes sure t is a user thread and is not exiting.
>> +	 */
>> +	this_batch = READ_ONCE(mm->mm_cid_scan_batch);
>> +	next_batch = this_batch + 1;
>> +	from_cpu = cpumask_nth(this_batch *
>> CONFIG_RSEQ_CID_SCAN_BATCH,
>> +			       cpu_possible_mask);
>> +	if (from_cpu >= nr_cpu_ids) {
>> +		from_cpu = 0;
>> +		next_batch = 1;
>> +	}
>> +	/* Delay scan only if we are done with all cpus. */
>> +	if (from_cpu != 0)
>> +		goto cid_compact;
>>   	old_scan = READ_ONCE(mm->mm_cid_next_scan);
>>   	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>>   	if (!old_scan) {
>> @@ -10611,17 +10626,33 @@ void task_mm_cid_work(struct task_struct
>> *t)
>>   		return;
>>   	if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan,
>> next_scan))
>>   		return;
>> +
>> +cid_compact:
>> +	if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch,
>> next_batch))
>> +		return;
>>   	cidmask = mm_cidmask(mm);
>>   	/* Clear cids that were not recently used. */
>> -	for_each_possible_cpu(cpu)
>> +	idx = 0;
>> +	cpu = from_cpu;
>> +	for_each_cpu_from(cpu, cpu_possible_mask) {
>> +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
>> +			break;
>>   		sched_mm_cid_remote_clear_old(mm, cpu);
>> +		++idx;
>> +	}
>>   	weight = cpumask_weight(cidmask);
>>   	/*
>>   	 * Clear cids that are greater or equal to the cidmask
>> weight to
>>   	 * recompact it.
>>   	 */
>> -	for_each_possible_cpu(cpu)
>> +	idx = 0;
>> +	cpu = from_cpu;
>> +	for_each_cpu_from(cpu, cpu_possible_mask) {
>> +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
>> +			break;
>>   		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
>> +		++idx;
>> +	}
>>   }
>>   
>>   void init_sched_mm_cid(struct task_struct *t)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches
  2025-08-06 16:57     ` Mathieu Desnoyers
@ 2025-08-06 18:24       ` Gabriele Monaco
  0 siblings, 0 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-08-06 18:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra

2025-08-06T16:57:51Z Mathieu Desnoyers <mathieu.desnoyers@efficios.com>:

> On 2025-08-05 08:42, Gabriele Monaco wrote:
>> On Wed, 2025-07-16 at 18:06 +0200, Gabriele Monaco wrote:
>>> Currently, task_mm_cid_work() is called from resume_user_mode_work().
>>> This can delay the execution of the corresponding thread for the
>>> entire duration of the function, negatively affecting the response in
>>> case of real time tasks.
>>> In practice, we observe task_mm_cid_work increasing the latency of
>>> 30-35us on a 128 cores system, this order of magnitude is meaningful
>>> under PREEMPT_RT.
>>>
>>> Run the task_mm_cid_work in batches of up to
>>> CONFIG_RSEQ_CID_SCAN_BATCH CPUs, this reduces the duration of the
>>> delay for each scan.
>>>
>>> The task_mm_cid_work contains a mechanism to avoid running more
>>> frequently than every 100ms. Keep this pseudo-periodicity only on
>>> complete scans.
>>> This means each call to task_mm_cid_work returns prematurely if the
>>> period did not elapse and a scan is not ongoing (i.e. the next batch
>>> to scan is not the first).
>>> This way full scans are not excessively delayed while still keeping
>>> each run, and introduced latency, short.
>>>
>> Mathieu, would you have some time to look at this implementation?
>
> Hi Gabriele,
>
> Please note that I am currently on vacation. I'll be back shortly
> before the end of August, but I'm afraid there are other tasks I
> need to focus on before I can get back to this. I'm adding this
> review to my todo list for September.
>

No problem, thanks for the update and enjoy your vacation!

Thanks,
Gabriele

> Thanks,
>
> Mathieu
>
>> Thanks,
>> Gabriele
>> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by
>>> mm_cid")
>>> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
>>> ---
>>>  include/linux/mm_types.h | 15 +++++++++++++++
>>>  init/Kconfig             | 12 ++++++++++++
>>>  kernel/sched/core.c      | 37 ++++++++++++++++++++++++++++++++++---
>>>  3 files changed, 61 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index e6d6e468e64b4..a822966a584f3 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -995,6 +995,13 @@ struct mm_struct {
>>>         * When the next mm_cid scan is due (in jiffies).
>>>         */
>>>        unsigned long mm_cid_next_scan;
>>> +       /*
>>> +        * @mm_cid_scan_batch: Counter for batch used in the
>>> next scan.
>>> +        *
>>> +        * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH.
>>> This field
>>> +        * increments at each scan and reset when all
>>> batches are done.
>>> +        */
>>> +       unsigned int mm_cid_scan_batch;
>>>        /**
>>>         * @nr_cpus_allowed: Number of CPUs allowed for mm.
>>>         *
>>> @@ -1385,6 +1392,7 @@ static inline void mm_init_cid(struct mm_struct
>>> *mm, struct task_struct *p)
>>>    raw_spin_lock_init(&mm->cpus_allowed_lock);
>>>    cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
>>>    cpumask_clear(mm_cidmask(mm));
>>> +   mm->mm_cid_scan_batch = 0;
>>>  }
>>>     static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct
>>> task_struct *p)
>>> @@ -1423,8 +1431,15 @@ static inline void mm_set_cpus_allowed(struct
>>> mm_struct *mm, const struct cpumas
>>>     static inline bool mm_cid_needs_scan(struct mm_struct *mm)
>>>  {
>>> +   unsigned int next_batch;
>>> +
>>>    if (!mm)
>>>        return false;
>>> +   next_batch = READ_ONCE(mm->mm_cid_scan_batch);
>>> +   /* Always needs scan unless it's the first batch. */
>>> +   if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch <
>>> num_possible_cpus() &&
>>> +       next_batch)
>>> +       return true;
>>>    return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
>>>  }
>>>  #else /* CONFIG_SCHED_MM_CID */
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index 666783eb50abd..98d7f078cd6df 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -1860,6 +1860,18 @@ config DEBUG_RSEQ
>>>           If unsure, say N.
>>>   +config RSEQ_CID_SCAN_BATCH
>>> +   int "Number of CPUs to scan at every mm_cid compaction
>>> attempt"
>>> +   range 1 NR_CPUS
>>> +   default 8
>>> +   depends on SCHED_MM_CID
>>> +   help
>>> +     CPUs are scanned pseudo-periodically to compact the CID of
>>> each task,
>>> +     this operation can take a longer amount of time on systems
>>> with many
>>> +     CPUs, resulting in higher scheduling latency for the
>>> current task.
>>> +     A higher value means the CID is compacted faster, but
>>> results in
>>> +     higher scheduling latency.
>>> +
>>>  config CACHESTAT_SYSCALL
>>>    bool "Enable cachestat() system call" if EXPERT
>>>    default y
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 27b856a1cb0a9..eae4c8faf980b 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -10591,11 +10591,26 @@ static void
>>> sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
>>>     void task_mm_cid_work(struct task_struct *t)
>>>  {
>>> +   int weight, cpu, from_cpu, this_batch, next_batch, idx;
>>>    unsigned long now = jiffies, old_scan, next_scan;
>>>    struct cpumask *cidmask;
>>> -   int weight, cpu;
>>>    struct mm_struct *mm = t->mm;
>>>   + /*
>>> +    * This function is called from __rseq_handle_notify_resume,
>>> which
>>> +    * makes sure t is a user thread and is not exiting.
>>> +    */
>>> +   this_batch = READ_ONCE(mm->mm_cid_scan_batch);
>>> +   next_batch = this_batch + 1;
>>> +   from_cpu = cpumask_nth(this_batch *
>>> CONFIG_RSEQ_CID_SCAN_BATCH,
>>> +                  cpu_possible_mask);
>>> +   if (from_cpu >= nr_cpu_ids) {
>>> +       from_cpu = 0;
>>> +       next_batch = 1;
>>> +   }
>>> +   /* Delay scan only if we are done with all cpus. */
>>> +   if (from_cpu != 0)
>>> +       goto cid_compact;
>>>    old_scan = READ_ONCE(mm->mm_cid_next_scan);
>>>    next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>>>    if (!old_scan) {
>>> @@ -10611,17 +10626,33 @@ void task_mm_cid_work(struct task_struct
>>> *t)
>>>        return;
>>>    if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan,
>>> next_scan))
>>>        return;
>>> +
>>> +cid_compact:
>>> +   if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch,
>>> next_batch))
>>> +       return;
>>>    cidmask = mm_cidmask(mm);
>>>    /* Clear cids that were not recently used. */
>>> -   for_each_possible_cpu(cpu)
>>> +   idx = 0;
>>> +   cpu = from_cpu;
>>> +   for_each_cpu_from(cpu, cpu_possible_mask) {
>>> +       if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
>>> +           break;
>>>        sched_mm_cid_remote_clear_old(mm, cpu);
>>> +       ++idx;
>>> +   }
>>>    weight = cpumask_weight(cidmask);
>>>    /*
>>>     * Clear cids that are greater or equal to the cidmask
>>> weight to
>>>     * recompact it.
>>>     */
>>> -   for_each_possible_cpu(cpu)
>>> +   idx = 0;
>>> +   cpu = from_cpu;
>>> +   for_each_cpu_from(cpu, cpu_possible_mask) {
>>> +       if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
>>> +           break;
>>>        sched_mm_cid_remote_clear_weight(mm, cpu, weight);
>>> +       ++idx;
>>> +   }
>>>  }
>>>     void init_sched_mm_cid(struct task_struct *t)
>>
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume()
  2025-07-16 16:06 ` [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume() Gabriele Monaco
@ 2025-08-26 18:01   ` Mathieu Desnoyers
  2025-08-27  6:55     ` Gabriele Monaco
  0 siblings, 1 reply; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 18:01 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Andrew Morton, David Hildenbrand,
	Ingo Molnar, Peter Zijlstra, Paul E. McKenney, linux-mm,
	Thomas Gleixner
  Cc: Ingo Molnar

On 2025-07-16 12:06, Gabriele Monaco wrote:
> Currently the mm_cid_compaction is triggered by the scheduler tick and
> runs in a task_work, behaviour is more unpredictable with periodic tasks
> with short runtime, which may rarely run during a tick.
> 
> Run the mm_cid_compaction from the rseq_handle_notify_resume() call,
> which runs from resume_user_mode_work. Since the context is the same
> where the task_work would run, skip this step and call the compaction
> function directly.
> The compaction function still exits prematurely in case the scan is not
> required, that is when the pseudo-period of 100ms did not elapse.
> 
> Keep a tick handler used for long running tasks that are never preempted
> (i.e. that never call rseq_handle_notify_resume), which triggers a
> compaction and mm_cid update only in that case.

Your approach looks good, but please note that this will probably
need to be rebased on top of the rseq rework from Thomas Gleixner.

Latest version can be found here:

https://lore.kernel.org/lkml/20250823161326.635281786@linutronix.de/

Thanks,

Mathieu

> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/linux/mm.h       |  2 ++
>   include/linux/mm_types.h | 11 ++++++++
>   include/linux/sched.h    |  2 +-
>   kernel/rseq.c            |  2 ++
>   kernel/sched/core.c      | 55 +++++++++++++++++++++++++---------------
>   kernel/sched/sched.h     |  2 ++
>   6 files changed, 53 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa538feaa8d95..cc8c1c9ae26c1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2294,6 +2294,7 @@ void sched_mm_cid_before_execve(struct task_struct *t);
>   void sched_mm_cid_after_execve(struct task_struct *t);
>   void sched_mm_cid_fork(struct task_struct *t);
>   void sched_mm_cid_exit_signals(struct task_struct *t);
> +void task_mm_cid_work(struct task_struct *t);
>   static inline int task_mm_cid(struct task_struct *t)
>   {
>   	return t->mm_cid;
> @@ -2303,6 +2304,7 @@ static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
>   static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
>   static inline void sched_mm_cid_fork(struct task_struct *t) { }
>   static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
> +static inline void task_mm_cid_work(struct task_struct *t) { }
>   static inline int task_mm_cid(struct task_struct *t)
>   {
>   	/*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d6b91e8a66d6d..e6d6e468e64b4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1420,6 +1420,13 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
>   	WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
>   	raw_spin_unlock(&mm->cpus_allowed_lock);
>   }
> +
> +static inline bool mm_cid_needs_scan(struct mm_struct *mm)
> +{
> +	if (!mm)
> +		return false;
> +	return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
> +}
>   #else /* CONFIG_SCHED_MM_CID */
>   static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
>   static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
> @@ -1430,6 +1437,10 @@ static inline unsigned int mm_cid_size(void)
>   	return 0;
>   }
>   static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
> +static inline bool mm_cid_needs_scan(struct mm_struct *mm)
> +{
> +	return false;
> +}
>   #endif /* CONFIG_SCHED_MM_CID */
>   
>   struct mmu_gather;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index aa9c5be7a6325..a75f61cea2271 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1428,7 +1428,7 @@ struct task_struct {
>   	int				last_mm_cid;	/* Most recent cid in mm */
>   	int				migrate_from_cpu;
>   	int				mm_cid_active;	/* Whether cid bitmap is active */
> -	struct callback_head		cid_work;
> +	unsigned long			last_cid_reset;	/* Time of last reset in jiffies */
>   #endif
>   
>   	struct tlbflush_unmap_batch	tlb_ubc;
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index b7a1ec327e811..100f81e330dc6 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -441,6 +441,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>   	}
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
> +	/* The mm_cid compaction returns prematurely if scan is not needed. */
> +	task_mm_cid_work(t);
>   	return;
>   
>   error:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 81c6df746df17..27b856a1cb0a9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10589,22 +10589,13 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
>   	sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
>   }
>   
> -static void task_mm_cid_work(struct callback_head *work)
> +void task_mm_cid_work(struct task_struct *t)
>   {
>   	unsigned long now = jiffies, old_scan, next_scan;
> -	struct task_struct *t = current;
>   	struct cpumask *cidmask;
> -	struct mm_struct *mm;
>   	int weight, cpu;
> +	struct mm_struct *mm = t->mm;
>   
> -	WARN_ON_ONCE(t != container_of(work, struct task_struct, cid_work));
> -
> -	work->next = work;	/* Prevent double-add */
> -	if (t->flags & PF_EXITING)
> -		return;
> -	mm = t->mm;
> -	if (!mm)
> -		return;
>   	old_scan = READ_ONCE(mm->mm_cid_next_scan);
>   	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>   	if (!old_scan) {
> @@ -10643,23 +10634,47 @@ void init_sched_mm_cid(struct task_struct *t)
>   		if (mm_users == 1)
>   			mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>   	}
> -	t->cid_work.next = &t->cid_work;	/* Protect against double add */
> -	init_task_work(&t->cid_work, task_mm_cid_work);
>   }
>   
>   void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
>   {
> -	struct callback_head *work = &curr->cid_work;
> -	unsigned long now = jiffies;
> +	u64 rtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime;
>   
> +	/*
> +	 * If a task is running unpreempted for a long time, it won't get its
> +	 * mm_cid compacted and won't update its mm_cid value after a
> +	 * compaction occurs.
> +	 * For such a task, this function does two things:
> +	 * A) trigger the mm_cid recompaction,
> +	 * B) trigger an update of the task's rseq->mm_cid field at some point
> +	 * after recompaction, so it can get a mm_cid value closer to 0.
> +	 * A change in the mm_cid triggers an rseq_preempt.
> +	 *
> +	 * B occurs once after the compaction work completes, neither A nor B
> +	 * run as long as the compaction work is pending, the task is exiting
> +	 * or is not a userspace task.
> +	 */
>   	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
> -	    work->next != work)
> +	    test_tsk_thread_flag(curr, TIF_NOTIFY_RESUME))
>   		return;
> -	if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
> +	if (rtime < RSEQ_UNPREEMPTED_THRESHOLD)
>   		return;
> -
> -	/* No page allocation under rq lock */
> -	task_work_add(curr, work, TWA_RESUME);
> +	if (mm_cid_needs_scan(curr->mm)) {
> +		/* Trigger mm_cid recompaction */
> +		rseq_set_notify_resume(curr);
> +	} else if (time_after(jiffies, curr->last_cid_reset +
> +			      msecs_to_jiffies(MM_CID_SCAN_DELAY))) {
> +		/* Update mm_cid field */
> +		int old_cid = curr->mm_cid;
> +
> +		if (!curr->mm_cid_active)
> +			return;
> +		mm_cid_snapshot_time(rq, curr->mm);
> +		mm_cid_put_lazy(curr);
> +		curr->last_mm_cid = curr->mm_cid = mm_cid_get(rq, curr, curr->mm);
> +		if (old_cid != curr->mm_cid)
> +			rseq_preempt(curr);
> +	}
>   }
>   
>   void sched_mm_cid_exit_signals(struct task_struct *t)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 475bb5998295e..90a5b58188232 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3606,6 +3606,7 @@ extern const char *preempt_modes[];
>   
>   #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
>   #define MM_CID_SCAN_DELAY	100			/* 100ms */
> +#define RSEQ_UNPREEMPTED_THRESHOLD	SCHED_MM_CID_PERIOD_NS
>   
>   extern raw_spinlock_t cid_lock;
>   extern int use_cid_lock;
> @@ -3809,6 +3810,7 @@ static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
>   	int cid;
>   
>   	lockdep_assert_rq_held(rq);
> +	t->last_cid_reset = jiffies;
>   	cpumask = mm_cidmask(mm);
>   	cid = __this_cpu_read(pcpu_cid->cid);
>   	if (mm_cid_is_valid(cid)) {


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches
  2025-07-16 16:06 ` [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches Gabriele Monaco
  2025-08-05 12:42   ` Gabriele Monaco
@ 2025-08-26 18:10   ` Mathieu Desnoyers
  2025-08-28  8:36     ` Gabriele Monaco
  1 sibling, 1 reply; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 18:10 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Andrew Morton, David Hildenbrand,
	Ingo Molnar, Peter Zijlstra, linux-mm, Thomas Gleixner
  Cc: Ingo Molnar

On 2025-07-16 12:06, Gabriele Monaco wrote:
> Currently, task_mm_cid_work() is called from resume_user_mode_work().
> This can delay the execution of the corresponding thread for the entire
> duration of the function, negatively affecting the response in case of
> real time tasks.
> In practice, we observe task_mm_cid_work increasing the latency of
> 30-35us on a 128 cores system, this order of magnitude is meaningful
> under PREEMPT_RT.
> 
> Run the task_mm_cid_work in batches of up to CONFIG_RSEQ_CID_SCAN_BATCH
> CPUs, this reduces the duration of the delay for each scan.
> 
> The task_mm_cid_work contains a mechanism to avoid running more
> frequently than every 100ms. Keep this pseudo-periodicity only on
> complete scans.
> This means each call to task_mm_cid_work returns prematurely if the
> period did not elapse and a scan is not ongoing (i.e. the next batch to
> scan is not the first).
> This way full scans are not excessively delayed while still keeping each
> run, and introduced latency, short.

With your test hardware/workload as reference, do you have an idea of
how many CPUs would be needed to require more than 100ms to iterate on
all CPUs with the default scan batch size (8) ?

> 
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/linux/mm_types.h | 15 +++++++++++++++
>   init/Kconfig             | 12 ++++++++++++
>   kernel/sched/core.c      | 37 ++++++++++++++++++++++++++++++++++---
>   3 files changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e6d6e468e64b4..a822966a584f3 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -995,6 +995,13 @@ struct mm_struct {
>   		 * When the next mm_cid scan is due (in jiffies).
>   		 */
>   		unsigned long mm_cid_next_scan;
> +		/*
> +		 * @mm_cid_scan_batch: Counter for batch used in the next scan.
> +		 *
> +		 * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH. This field
> +		 * increments at each scan and reset when all batches are done.
> +		 */
> +		unsigned int mm_cid_scan_batch;
>   		/**
>   		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
>   		 *
> @@ -1385,6 +1392,7 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
>   	raw_spin_lock_init(&mm->cpus_allowed_lock);
>   	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
>   	cpumask_clear(mm_cidmask(mm));
> +	mm->mm_cid_scan_batch = 0;
>   }
>   
>   static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
> @@ -1423,8 +1431,15 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
>   
>   static inline bool mm_cid_needs_scan(struct mm_struct *mm)
>   {
> +	unsigned int next_batch;
> +
>   	if (!mm)
>   		return false;
> +	next_batch = READ_ONCE(mm->mm_cid_scan_batch);
> +	/* Always needs scan unless it's the first batch. */
> +	if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch < num_possible_cpus() &&
> +	    next_batch)
> +		return true;
>   	return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan));
>   }
>   #else /* CONFIG_SCHED_MM_CID */
> diff --git a/init/Kconfig b/init/Kconfig
> index 666783eb50abd..98d7f078cd6df 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1860,6 +1860,18 @@ config DEBUG_RSEQ
>   
>   	  If unsure, say N.
>   
> +config RSEQ_CID_SCAN_BATCH
> +	int "Number of CPUs to scan at every mm_cid compaction attempt"
> +	range 1 NR_CPUS
> +	default 8
> +	depends on SCHED_MM_CID
> +	help
> +	  CPUs are scanned pseudo-periodically to compact the CID of each task,
> +	  this operation can take a longer amount of time on systems with many
> +	  CPUs, resulting in higher scheduling latency for the current task.
> +	  A higher value means the CID is compacted faster, but results in
> +	  higher scheduling latency.
> +
>   config CACHESTAT_SYSCALL
>   	bool "Enable cachestat() system call" if EXPERT
>   	default y
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 27b856a1cb0a9..eae4c8faf980b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10591,11 +10591,26 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
>   
>   void task_mm_cid_work(struct task_struct *t)
>   {
> +	int weight, cpu, from_cpu, this_batch, next_batch, idx;
>   	unsigned long now = jiffies, old_scan, next_scan;
>   	struct cpumask *cidmask;
> -	int weight, cpu;
>   	struct mm_struct *mm = t->mm;
>   
> +	/*
> +	 * This function is called from __rseq_handle_notify_resume, which
> +	 * makes sure t is a user thread and is not exiting.
> +	 */
> +	this_batch = READ_ONCE(mm->mm_cid_scan_batch);
> +	next_batch = this_batch + 1;
> +	from_cpu = cpumask_nth(this_batch * CONFIG_RSEQ_CID_SCAN_BATCH,
> +			       cpu_possible_mask);
> +	if (from_cpu >= nr_cpu_ids) {
> +		from_cpu = 0;
> +		next_batch = 1;
> +	}
> +	/* Delay scan only if we are done with all cpus. */
> +	if (from_cpu != 0)
> +		goto cid_compact;
>   	old_scan = READ_ONCE(mm->mm_cid_next_scan);
>   	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>   	if (!old_scan) {
> @@ -10611,17 +10626,33 @@ void task_mm_cid_work(struct task_struct *t)
>   		return;
>   	if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
>   		return;
> +
> +cid_compact:
> +	if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch, next_batch))
> +		return;
>   	cidmask = mm_cidmask(mm);
>   	/* Clear cids that were not recently used. */
> -	for_each_possible_cpu(cpu)
> +	idx = 0;
> +	cpu = from_cpu;
> +	for_each_cpu_from(cpu, cpu_possible_mask) {
> +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)

could do "if (idx++ == CONFIG_RSEQ_CID_SCAN_BATCH)"

> +			break;
>   		sched_mm_cid_remote_clear_old(mm, cpu);
> +		++idx;

and remove this ^

> +	}
>   	weight = cpumask_weight(cidmask);
>   	/*
>   	 * Clear cids that are greater or equal to the cidmask weight to
>   	 * recompact it.
>   	 */
> -	for_each_possible_cpu(cpu)
> +	idx = 0;
> +	cpu = from_cpu;
> +	for_each_cpu_from(cpu, cpu_possible_mask) {
> +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)

Likewise.

> +			break;
>   		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
> +		++idx;

Likewise.

Thanks,

Mathieu

> +	}
>   }
>   
>   void init_sched_mm_cid(struct task_struct *t)


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume()
  2025-08-26 18:01   ` Mathieu Desnoyers
@ 2025-08-27  6:55     ` Gabriele Monaco
  0 siblings, 0 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-08-27  6:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Andrew Morton, David Hildenbrand, Ingo Molnar,
	Peter Zijlstra, Paul E. McKenney, linux-mm, Thomas Gleixner

On Tue, 2025-08-26 at 14:01 -0400, Mathieu Desnoyers wrote:
> On 2025-07-16 12:06, Gabriele Monaco wrote:
> > Currently the mm_cid_compaction is triggered by the scheduler tick
> > and
> > runs in a task_work, behaviour is more unpredictable with periodic
> > tasks
> > with short runtime, which may rarely run during a tick.
> > 
> > Run the mm_cid_compaction from the rseq_handle_notify_resume()
> > call,
> > which runs from resume_user_mode_work. Since the context is the
> > same
> > where the task_work would run, skip this step and call the
> > compaction
> > function directly.
> > The compaction function still exits prematurely in case the scan is
> > not
> > required, that is when the pseudo-period of 100ms did not elapse.
> > 
> > Keep a tick handler used for long running tasks that are never
> > preempted
> > (i.e. that never call rseq_handle_notify_resume), which triggers a
> > compaction and mm_cid update only in that case.
> 
> Your approach looks good, but please note that this will probably
> need to be rebased on top of the rseq rework from Thomas Gleixner.
> 
> Latest version can be found here:
> 
> https://lore.kernel.org/lkml/20250823161326.635281786@linutronix.de/
> 

Mmh that's quite a large one, thanks for sharing!
I'm going to have a look but it might make sense to wait until that's
included, I guess.

Thanks,
Gabriele


> Thanks,
> 
> Mathieu
> 
> > 
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > ---
> >   include/linux/mm.h       |  2 ++
> >   include/linux/mm_types.h | 11 ++++++++
> >   include/linux/sched.h    |  2 +-
> >   kernel/rseq.c            |  2 ++
> >   kernel/sched/core.c      | 55 +++++++++++++++++++++++++----------
> > -----
> >   kernel/sched/sched.h     |  2 ++
> >   6 files changed, 53 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index fa538feaa8d95..cc8c1c9ae26c1 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2294,6 +2294,7 @@ void sched_mm_cid_before_execve(struct
> > task_struct *t);
> >   void sched_mm_cid_after_execve(struct task_struct *t);
> >   void sched_mm_cid_fork(struct task_struct *t);
> >   void sched_mm_cid_exit_signals(struct task_struct *t);
> > +void task_mm_cid_work(struct task_struct *t);
> >   static inline int task_mm_cid(struct task_struct *t)
> >   {
> >   	return t->mm_cid;
> > @@ -2303,6 +2304,7 @@ static inline void
> > sched_mm_cid_before_execve(struct task_struct *t) { }
> >   static inline void sched_mm_cid_after_execve(struct task_struct
> > *t) { }
> >   static inline void sched_mm_cid_fork(struct task_struct *t) { }
> >   static inline void sched_mm_cid_exit_signals(struct task_struct
> > *t) { }
> > +static inline void task_mm_cid_work(struct task_struct *t) { }
> >   static inline int task_mm_cid(struct task_struct *t)
> >   {
> >   	/*
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index d6b91e8a66d6d..e6d6e468e64b4 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1420,6 +1420,13 @@ static inline void
> > mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
> >   	WRITE_ONCE(mm->nr_cpus_allowed,
> > cpumask_weight(mm_allowed));
> >   	raw_spin_unlock(&mm->cpus_allowed_lock);
> >   }
> > +
> > +static inline bool mm_cid_needs_scan(struct mm_struct *mm)
> > +{
> > +	if (!mm)
> > +		return false;
> > +	return time_after(jiffies, READ_ONCE(mm-
> > >mm_cid_next_scan));
> > +}
> >   #else /* CONFIG_SCHED_MM_CID */
> >   static inline void mm_init_cid(struct mm_struct *mm, struct
> > task_struct *p) { }
> >   static inline int mm_alloc_cid(struct mm_struct *mm, struct
> > task_struct *p) { return 0; }
> > @@ -1430,6 +1437,10 @@ static inline unsigned int mm_cid_size(void)
> >   	return 0;
> >   }
> >   static inline void mm_set_cpus_allowed(struct mm_struct *mm,
> > const struct cpumask *cpumask) { }
> > +static inline bool mm_cid_needs_scan(struct mm_struct *mm)
> > +{
> > +	return false;
> > +}
> >   #endif /* CONFIG_SCHED_MM_CID */
> >   
> >   struct mmu_gather;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index aa9c5be7a6325..a75f61cea2271 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1428,7 +1428,7 @@ struct task_struct {
> >   	int				last_mm_cid;	/* Most
> > recent cid in mm */
> >   	int				migrate_from_cpu;
> >   	int				mm_cid_active;	/* Whether
> > cid bitmap is active */
> > -	struct callback_head		cid_work;
> > +	unsigned long			last_cid_reset;	/*
> > Time of last reset in jiffies */
> >   #endif
> >   
> >   	struct tlbflush_unmap_batch	tlb_ubc;
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index b7a1ec327e811..100f81e330dc6 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -441,6 +441,8 @@ void __rseq_handle_notify_resume(struct ksignal
> > *ksig, struct pt_regs *regs)
> >   	}
> >   	if (unlikely(rseq_update_cpu_node_id(t)))
> >   		goto error;
> > +	/* The mm_cid compaction returns prematurely if scan is
> > not needed. */
> > +	task_mm_cid_work(t);
> >   	return;
> >   
> >   error:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 81c6df746df17..27b856a1cb0a9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -10589,22 +10589,13 @@ static void
> > sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
> >   	sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
> >   }
> >   
> > -static void task_mm_cid_work(struct callback_head *work)
> > +void task_mm_cid_work(struct task_struct *t)
> >   {
> >   	unsigned long now = jiffies, old_scan, next_scan;
> > -	struct task_struct *t = current;
> >   	struct cpumask *cidmask;
> > -	struct mm_struct *mm;
> >   	int weight, cpu;
> > +	struct mm_struct *mm = t->mm;
> >   
> > -	WARN_ON_ONCE(t != container_of(work, struct task_struct,
> > cid_work));
> > -
> > -	work->next = work;	/* Prevent double-add */
> > -	if (t->flags & PF_EXITING)
> > -		return;
> > -	mm = t->mm;
> > -	if (!mm)
> > -		return;
> >   	old_scan = READ_ONCE(mm->mm_cid_next_scan);
> >   	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
> >   	if (!old_scan) {
> > @@ -10643,23 +10634,47 @@ void init_sched_mm_cid(struct task_struct
> > *t)
> >   		if (mm_users == 1)
> >   			mm->mm_cid_next_scan = jiffies +
> > msecs_to_jiffies(MM_CID_SCAN_DELAY);
> >   	}
> > -	t->cid_work.next = &t->cid_work;	/* Protect against
> > double add */
> > -	init_task_work(&t->cid_work, task_mm_cid_work);
> >   }
> >   
> >   void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
> >   {
> > -	struct callback_head *work = &curr->cid_work;
> > -	unsigned long now = jiffies;
> > +	u64 rtime = curr->se.sum_exec_runtime - curr-
> > >se.prev_sum_exec_runtime;
> >   
> > +	/*
> > +	 * If a task is running unpreempted for a long time, it
> > won't get its
> > +	 * mm_cid compacted and won't update its mm_cid value
> > after a
> > +	 * compaction occurs.
> > +	 * For such a task, this function does two things:
> > +	 * A) trigger the mm_cid recompaction,
> > +	 * B) trigger an update of the task's rseq->mm_cid field
> > at some point
> > +	 * after recompaction, so it can get a mm_cid value closer
> > to 0.
> > +	 * A change in the mm_cid triggers an rseq_preempt.
> > +	 *
> > +	 * B occurs once after the compaction work completes,
> > neither A nor B
> > +	 * run as long as the compaction work is pending, the task
> > is exiting
> > +	 * or is not a userspace task.
> > +	 */
> >   	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD))
> > ||
> > -	    work->next != work)
> > +	    test_tsk_thread_flag(curr, TIF_NOTIFY_RESUME))
> >   		return;
> > -	if (time_before(now, READ_ONCE(curr->mm-
> > >mm_cid_next_scan)))
> > +	if (rtime < RSEQ_UNPREEMPTED_THRESHOLD)
> >   		return;
> > -
> > -	/* No page allocation under rq lock */
> > -	task_work_add(curr, work, TWA_RESUME);
> > +	if (mm_cid_needs_scan(curr->mm)) {
> > +		/* Trigger mm_cid recompaction */
> > +		rseq_set_notify_resume(curr);
> > +	} else if (time_after(jiffies, curr->last_cid_reset +
> > +			     
> > msecs_to_jiffies(MM_CID_SCAN_DELAY))) {
> > +		/* Update mm_cid field */
> > +		int old_cid = curr->mm_cid;
> > +
> > +		if (!curr->mm_cid_active)
> > +			return;
> > +		mm_cid_snapshot_time(rq, curr->mm);
> > +		mm_cid_put_lazy(curr);
> > +		curr->last_mm_cid = curr->mm_cid = mm_cid_get(rq,
> > curr, curr->mm);
> > +		if (old_cid != curr->mm_cid)
> > +			rseq_preempt(curr);
> > +	}
> >   }
> >   
> >   void sched_mm_cid_exit_signals(struct task_struct *t)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 475bb5998295e..90a5b58188232 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -3606,6 +3606,7 @@ extern const char *preempt_modes[];
> >   
> >   #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/*
> > 100ms */
> >   #define MM_CID_SCAN_DELAY	100			/* 100ms
> > */
> > +#define RSEQ_UNPREEMPTED_THRESHOLD	SCHED_MM_CID_PERIOD_NS
> >   
> >   extern raw_spinlock_t cid_lock;
> >   extern int use_cid_lock;
> > @@ -3809,6 +3810,7 @@ static inline int mm_cid_get(struct rq *rq,
> > struct task_struct *t,
> >   	int cid;
> >   
> >   	lockdep_assert_rq_held(rq);
> > +	t->last_cid_reset = jiffies;
> >   	cpumask = mm_cidmask(mm);
> >   	cid = __this_cpu_read(pcpu_cid->cid);
> >   	if (mm_cid_is_valid(cid)) {
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches
  2025-08-26 18:10   ` Mathieu Desnoyers
@ 2025-08-28  8:36     ` Gabriele Monaco
  0 siblings, 0 replies; 12+ messages in thread
From: Gabriele Monaco @ 2025-08-28  8:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, linux-kernel, Andrew Morton, David Hildenbrand,
	Ingo Molnar, Peter Zijlstra, linux-mm, Thomas Gleixner

On Tue, 2025-08-26 at 14:10 -0400, Mathieu Desnoyers wrote:
> On 2025-07-16 12:06, Gabriele Monaco wrote:
> > Currently, task_mm_cid_work() is called from
> > resume_user_mode_work().
> > This can delay the execution of the corresponding thread for the
> > entire duration of the function, negatively affecting the response
> > in case of real time tasks.
> > In practice, we observe task_mm_cid_work increasing the latency of
> > 30-35us on a 128 cores system, this order of magnitude is
> > meaningful under PREEMPT_RT.
> > 
> > Run the task_mm_cid_work in batches of up to
> > CONFIG_RSEQ_CID_SCAN_BATCH CPUs, this reduces the duration of the
> > delay for each scan.
> > 
> > The task_mm_cid_work contains a mechanism to avoid running more
> > frequently than every 100ms. Keep this pseudo-periodicity only on
> > complete scans.
> > This means each call to task_mm_cid_work returns prematurely if the
> > period did not elapse and a scan is not ongoing (i.e. the next
> > batch to scan is not the first).
> > This way full scans are not excessively delayed while still keeping
> > each run, and introduced latency, short.
> 
> With your test hardware/workload as reference, do you have an idea of
> how many CPUs would be needed to require more than 100ms to iterate
> on all CPUs with the default scan batch size (8) ?

As you guessed, this is strongly dependent on the workload, where
workloads with less threads are more likely to take longer.
I used cyclictest (threads with 100us period) and hackbench (processes)
on a 128 CPUs machine and measured the time to complete the scan (16
iterations) as well as the time between non-complete scans (not delayed
by 100ms):

cyclictest: delay 0-400 us , complete scan 1.5-2 ms
hackbench: delay 5us - 3ms , complete scan 1.5-15 ms

So to answer your question, in the observed worst case for hackbench,
it would take more than 800 CPUs to reach the 100ms limit.

That said, the problematic latency was observed on a full scan (128
CPUs), so perhaps the default of 8 is a bit too conservative and could
easily be doubled.

Measurements showed these durations for each call to task_mm_cid_scan:

batch size  8:  1-11 us (majority below 10)
batch size 16:  3-16 us (majority below 10)
batch size 32: 10-21 us (majority above 15)

20 us is considered a relevant latency on this machine, so 16 seems a
good tradeoff for a batch size to me.


I'm going to include those numbers in the next iteration of the series.

...
> > +cid_compact:
> > +	if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch,
> > next_batch))
> > +		return;
> >   	cidmask = mm_cidmask(mm);
> >   	/* Clear cids that were not recently used. */
> > -	for_each_possible_cpu(cpu)
> > +	idx = 0;
> > +	cpu = from_cpu;
> > +	for_each_cpu_from(cpu, cpu_possible_mask) {
> > +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
> 
> could do "if (idx++ == CONFIG_RSEQ_CID_SCAN_BATCH)"
> 
> > +			break;
> >   		sched_mm_cid_remote_clear_old(mm, cpu);
> > +		++idx;
> 
> and remove this ^
> 
> > +	}
> >   	weight = cpumask_weight(cidmask);
> >   	/*
> >   	 * Clear cids that are greater or equal to the cidmask
> > weight to
> >   	 * recompact it.
> >   	 */
> > -	for_each_possible_cpu(cpu)
> > +	idx = 0;
> > +	cpu = from_cpu;
> > +	for_each_cpu_from(cpu, cpu_possible_mask) {
> > +		if (idx == CONFIG_RSEQ_CID_SCAN_BATCH)
> 
> Likewise.
> 
> > +			break;
> >   		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
> > +		++idx;
> 
> Likewise.

Sure, will do.

Thanks,
Gabriele


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-28  8:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 16:06 [PATCH v2 0/4] sched: Run task_mm_cid_work in batches to lower latency Gabriele Monaco
2025-07-16 16:06 ` [PATCH v2 1/4] sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes Gabriele Monaco
2025-07-16 16:06 ` [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume() Gabriele Monaco
2025-08-26 18:01   ` Mathieu Desnoyers
2025-08-27  6:55     ` Gabriele Monaco
2025-07-16 16:06 ` [PATCH v2 3/4] sched: Compact RSEQ concurrency IDs in batches Gabriele Monaco
2025-08-05 12:42   ` Gabriele Monaco
2025-08-06 16:57     ` Mathieu Desnoyers
2025-08-06 18:24       ` Gabriele Monaco
2025-08-26 18:10   ` Mathieu Desnoyers
2025-08-28  8:36     ` Gabriele Monaco
2025-07-16 16:06 ` [PATCH v2 4/4] selftests/rseq: Add test for mm_cid compaction Gabriele Monaco

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).