linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch V2 16/20] sched/mmcid: Provide new scheduler CID mechanism
       [not found] ` <20251022110556.399477196@linutronix.de>
@ 2025-10-27  5:11   ` Shrikanth Hegde
  2025-10-27  8:54     ` Thomas Gleixner
  0 siblings, 1 reply; 2+ messages in thread
From: Shrikanth Hegde @ 2025-10-27  5:11 UTC (permalink / raw)
  To: Thomas Gleixner, LKML, linuxppc-dev
  Cc: Peter Zijlstra, Gabriele Monaco, Mathieu Desnoyers,
	Michael Jeanson, Jens Axboe, Paul E. McKenney, Gautham R. Shenoy,
	Florian Weimer, Tim Chen, Yury Norov, Madhavan Srinivasan


Hi Thomas,

On 10/22/25 6:25 PM, Thomas Gleixner wrote:
> The MM CID management has two fundamental requirements:
> 
>    1) It has to guarantee that at no given point in time the same CID is
>       used by concurrent tasks in userspace.
> 
>    2) The CID space must not exceed the number of possible CPUs in a
>       system. While most allocators (glibc, tcmalloc, jemalloc) do not
>       care about that, there seems to be at least some LTTng library
>       depending on it.
> 
> The CID space compaction itself is not a functional correctness
> requirement, it is only a useful optimization mechanism to reduce the
> memory foot print in unused user space pools.
> 

Just wondering, if there is no user space request for CID, this whole mechanism
should be under a static check to avoid any overhead?

(Trying to understand this change. Very interesting. ignore if this is not applicable)

> The optimal CID space is:
> 
>      min(nr_tasks, nr_cpus_allowed);
> 
> Where @nr_tasks is the number of actual user space threads associated to
> the mm and @nr_cpus_allowed is the superset of all task affinities. It is
> growth only as it would be insane to take a racy snapshot of all task
> affinities when the affinity of one task changes just do redo it 2
> milliseconds later when the next task changes it's affinity.
> 
> That means that as long as the number of tasks is lower or equal than the
> number of CPUs allowed, each task owns a CID. If the number of tasks
> exceeds the number of CPUs allowed it switches to per CPU mode, where the
> CPUs own the CIDs and the tasks borrow them as long as they are scheduled
> in.
> 
> For transition periods CIDs can go beyond the optimal space as long as they
> don't go beyond the number of possible CPUs.
> 
> The current upstream implementation tries to keep the CID with the task
> even in overcommit situations, which complicates task migration. It also
> has to do the CID space consolidation work from a task work in the exit to
> user space path. As that work is assigned to a random task related to a MM
> this can inflict unwanted exit latencies.
> 
> Implement the context switch parts of a strict ownership mechanism to
> address this.
> 
> This removes most of the work from the task which schedules out. Only
> during transitioning from per CPU to per task ownership it is required to
> drop the CID when leaving the CPU to prevent CID space exhaustion. Other
> than that scheduling out is just a single check and branch.
> 
> The task which schedules in has to check whether:
> 
>      1) The ownership mode changed
>      2) The CID is within the optimal CID space
> 
> In stable situations this results in zero work. The only short disruption
> is when ownership mode changes or when the associated CID is not in the
> optimal CID space. The latter only happens when tasks exit and therefore
> the optimal CID space shrinks.
> 
> That mechanism is strictly optimized for the common case where no change
> happens. The only case where it actually causes a temporary one time spike
> is on mode changes when and only when a lot of tasks related to a MM
> schedule exactly at the same time and have eventually to compete on
> allocating a CID from the bitmap.
> 
> In the sysbench test case which triggered the spinlock contention in the
> initial CID code, __schedule() drops significantly in perf top on a 128
> Core (256 threads) machine when running sysbench with 255 threads, which
> fits into the task mode limit of 256 together with the parent thread:
> 
>    Upstream  rseq/perf branch  +CID rework
>    0.42%     0.37%             0.32%          [k] __schedule
> 
> Increasing the number of threads to 256, which puts the test process into
> per CPU mode looks about the same.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---

[...]   
> +static inline unsigned int __mm_get_cid(struct mm_struct *mm, unsigned int max_cids)
> +{
> +	unsigned int cid = find_first_zero_bit(mm_cidmask(mm), max_cids);
> +
> +	if (cid >= max_cids)
> +		return MM_CID_UNSET;
> +	if (test_and_set_bit(cid, mm_cidmask(mm)))
> +		return MM_CID_UNSET;
> +	return cid;
> +}
> +
> +static inline unsigned int mm_get_cid(struct mm_struct *mm)
> +{
> +	unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
> +
> +	for (; cid == MM_CID_UNSET; cpu_relax())

This triggers an compile error on ppc64le.

In file included from ./include/vdso/processor.h:10,
                  from ./arch/powerpc/include/asm/processor.h:9,
                  from ./include/linux/sched.h:13,
                  from ./include/linux/sched/affinity.h:1,
                  from kernel/sched/sched.h:8,
                  from kernel/sched/rq-offsets.c:5:
kernel/sched/sched.h: In function ‘mm_get_cid’:
./arch/powerpc/include/asm/vdso/processor.h:26:9: error: expected expression before ‘asm’
    26 |         asm volatile(ASM_FTR_IFCLR(                                     \
       |         ^~~
kernel/sched/sched.h:3615:37: note: in expansion of macro ‘cpu_relax’
  3615 |         for (; cid == MM_CID_UNSET; cpu_relax())


+linux-ppc dev.


Moving it into a while seems to make compiling happy.
Need to see why, but thought of informing fist.

  kernel/sched/sched.h | 7 +++++--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50bc0eb08a66..8a6ad4fc9534 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3612,8 +3612,11 @@ static inline unsigned int mm_get_cid(struct mm_struct *mm)
  {
  	unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
  
-	for (; cid == MM_CID_UNSET; cpu_relax())
-		cid = __mm_get_cid(mm, num_possible_cpus());
+	while(1) {
+		if (cid == MM_CID_UNSET)
+			break;
+		cpu_relax();
+	}
  
  	return cid;
  }




> +		cid = __mm_get_cid(mm, num_possible_cpus());
> +
> +	return cid;
> +}
> +


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [patch V2 16/20] sched/mmcid: Provide new scheduler CID mechanism
  2025-10-27  5:11   ` [patch V2 16/20] sched/mmcid: Provide new scheduler CID mechanism Shrikanth Hegde
@ 2025-10-27  8:54     ` Thomas Gleixner
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Gleixner @ 2025-10-27  8:54 UTC (permalink / raw)
  To: Shrikanth Hegde, LKML, linuxppc-dev
  Cc: Peter Zijlstra, Gabriele Monaco, Mathieu Desnoyers,
	Michael Jeanson, Jens Axboe, Paul E. McKenney, Gautham R. Shenoy,
	Florian Weimer, Tim Chen, Yury Norov, Madhavan Srinivasan

On Mon, Oct 27 2025 at 10:41, Shrikanth Hegde wrote:
> On 10/22/25 6:25 PM, Thomas Gleixner wrote:
>> The MM CID management has two fundamental requirements:
>> 
>>    1) It has to guarantee that at no given point in time the same CID is
>>       used by concurrent tasks in userspace.
>> 
>>    2) The CID space must not exceed the number of possible CPUs in a
>>       system. While most allocators (glibc, tcmalloc, jemalloc) do not
>>       care about that, there seems to be at least some LTTng library
>>       depending on it.
>> 
>> The CID space compaction itself is not a functional correctness
>> requirement, it is only a useful optimization mechanism to reduce the
>> memory foot print in unused user space pools.
>> 
>
> Just wondering, if there is no user space request for CID, this whole mechanism
> should be under a static check to avoid any overhead?

The problem is that CID has been introduced unconditionally with RSEQ
and there is no mechanism to opt-in. So we could go and change the ABI,
but as you know that's generally frowned upon.

I thought about adding a static key, but that'd be systemwide and
would probably required to be opt-out for the same reason.

>> +static inline unsigned int mm_get_cid(struct mm_struct *mm)
>> +{
>> +	unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
>> +
>> +	for (; cid == MM_CID_UNSET; cpu_relax())
>
> This triggers an compile error on ppc64le.
>
> In file included from ./include/vdso/processor.h:10,
>                   from ./arch/powerpc/include/asm/processor.h:9,
>                   from ./include/linux/sched.h:13,
>                   from ./include/linux/sched/affinity.h:1,
>                   from kernel/sched/sched.h:8,
>                   from kernel/sched/rq-offsets.c:5:
> kernel/sched/sched.h: In function ‘mm_get_cid’:
> ./arch/powerpc/include/asm/vdso/processor.h:26:9: error: expected expression before ‘asm’
>     26 |         asm volatile(ASM_FTR_IFCLR(                                     \
>        |         ^~~
> kernel/sched/sched.h:3615:37: note: in expansion of macro ‘cpu_relax’
>   3615 |         for (; cid == MM_CID_UNSET; cpu_relax())
>

Duh. Did not notice because x86 implements cpu_relax() as a static
inline while PPC has it as a plain macro define. Let me move it out of
the for() then.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-10-27  8:54 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251022104005.907410538@linutronix.de>
     [not found] ` <20251022110556.399477196@linutronix.de>
2025-10-27  5:11   ` [patch V2 16/20] sched/mmcid: Provide new scheduler CID mechanism Shrikanth Hegde
2025-10-27  8:54     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).