[PATCH] sched/fair: do not scan twice in detach

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/fair: do not scan twice in detach_tasks()
@ 2025-07-07  8:36 Huang Shijie
  2025-07-15 17:04 ` Valentin Schneider
  0 siblings, 1 reply; 6+ messages in thread
From: Huang Shijie @ 2025-07-07  8:36 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot
  Cc: patches, cl, Shubhang, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, Huang Shijie

When detach_tasks() scans the src_cpu's task list, the task list
may shrink during the scanning. For example, the task list
may have four tasks at the beginning, it may becomes to two
during the scanning in detach_tasks():
    Task list at beginning : "ABCD"
    Task list in scanning  : "CD"

    (ABCD stands for differnt tasks.)

In this scenario, the env->loop_max is still four, so
detach_tasks() may scan twice for some tasks:
    the scanning order maybe : "DCDC"

The patch introduces "first_back" to record the first task which
is put back to the task list. If we get a task which is equal to
first_back, we break the loop, and avoid to scan twice for it.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e2963efe800..0e9c8ae68cc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	unsigned long util, load;
+	struct task_struct *first_back = NULL;
 	struct task_struct *p;
 	int detached = 0;
 
@@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
 		}
 
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
+		if (p == first_back)
+			break;
 
 		if (!can_migrate_task(p, env))
 			goto next;
@@ -9562,6 +9565,8 @@ static int detach_tasks(struct lb_env *env)
 			schedstat_inc(p->stats.nr_failed_migrations_hot);
 
 		list_move(&p->se.group_node, tasks);
+		if (!first_back)
+			first_back = p;
 	}
 
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: do not scan twice in detach_tasks()
  2025-07-07  8:36 [PATCH] sched/fair: do not scan twice in detach_tasks() Huang Shijie
@ 2025-07-15 17:04 ` Valentin Schneider
  2025-07-16  2:13   ` Shijie Huang
  0 siblings, 1 reply; 6+ messages in thread
From: Valentin Schneider @ 2025-07-15 17:04 UTC (permalink / raw)
  To: Huang Shijie, mingo, peterz, juri.lelli, vincent.guittot
  Cc: patches, cl, Shubhang, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel, Huang Shijie

On 07/07/25 16:36, Huang Shijie wrote:
> When detach_tasks() scans the src_cpu's task list, the task list
> may shrink during the scanning. For example, the task list
> may have four tasks at the beginning, it may becomes to two
> during the scanning in detach_tasks():
>     Task list at beginning : "ABCD"
>     Task list in scanning  : "CD"
>
>     (ABCD stands for differnt tasks.)
>
> In this scenario, the env->loop_max is still four, so
> detach_tasks() may scan twice for some tasks:
>     the scanning order maybe : "DCDC"
>


Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
about one occurrence every two default hackbench run (~200ms) on my dummy
QEMU setup.

Have you seen this happen on your workloads or did you find this while
staring at the code?

> The patch introduces "first_back" to record the first task which
> is put back to the task list. If we get a task which is equal to
> first_back, we break the loop, and avoid to scan twice for it.
>

Potentially more than twice, no?

> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7e2963efe800..0e9c8ae68cc2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
>  {
>       struct list_head *tasks = &env->src_rq->cfs_tasks;
>       unsigned long util, load;
> +	struct task_struct *first_back = NULL;
>       struct task_struct *p;
>       int detached = 0;
>
> @@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
>               }
>
>               p = list_last_entry(tasks, struct task_struct, se.group_node);

Small comment nit:

                /*
                 * We're back to an already visited task that couldn't be
                 * detached, we've seen all there is to see.
                 */

> +		if (p == first_back)
> +			break;
>
>               if (!can_migrate_task(p, env))
>                       goto next;
> @@ -9562,6 +9565,8 @@ static int detach_tasks(struct lb_env *env)
>                       schedstat_inc(p->stats.nr_failed_migrations_hot);
>
>               list_move(&p->se.group_node, tasks);
> +		if (!first_back)
> +			first_back = p;
>       }
>
>       /*
> --
> 2.40.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: do not scan twice in detach_tasks()
  2025-07-15 17:04 ` Valentin Schneider
@ 2025-07-16  2:13   ` Shijie Huang
  2025-07-16 15:08     ` Valentin Schneider
  0 siblings, 1 reply; 6+ messages in thread
From: Shijie Huang @ 2025-07-16  2:13 UTC (permalink / raw)
  To: Valentin Schneider, Huang Shijie, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: patches, cl, Shubhang, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 2025/7/16 1:04, Valentin Schneider wrote:
> On 07/07/25 16:36, Huang Shijie wrote:
>> When detach_tasks() scans the src_cpu's task list, the task list
>> may shrink during the scanning. For example, the task list
>> may have four tasks at the beginning, it may becomes to two
>> during the scanning in detach_tasks():
>>      Task list at beginning : "ABCD"
>>      Task list in scanning  : "CD"
>>
>>      (ABCD stands for differnt tasks.)
>>
>> In this scenario, the env->loop_max is still four, so
>> detach_tasks() may scan twice for some tasks:
>>      the scanning order maybe : "DCDC"
>>
>
> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
> about one occurrence every two default hackbench run (~200ms) on my dummy
> QEMU setup.
>
> Have you seen this happen on your workloads or did you find this while
> staring at the code?

I found this issue in my Specjbb2015 test.  It is very easy to trigger.

I noticed many times in the test.

I even found that:

         At the beginning: env->loop_max is 12.

        When the detach_tasks() scans: the real task list shrink to 5.


>> The patch introduces "first_back" to record the first task which
>> is put back to the task list. If we get a task which is equal to
>> first_back, we break the loop, and avoid to scan twice for it.
>>
> Potentially more than twice, no?

yes.  it maybe more then twice in theory.



>
>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>> ---
>>   kernel/sched/fair.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7e2963efe800..0e9c8ae68cc2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
>>   {
>>        struct list_head *tasks = &env->src_rq->cfs_tasks;
>>        unsigned long util, load;
>> +	struct task_struct *first_back = NULL;
>>        struct task_struct *p;
>>        int detached = 0;
>>
>> @@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
>>                }
>>
>>                p = list_last_entry(tasks, struct task_struct, se.group_node);
> Small comment nit:
>
>                  /*
>                   * We're back to an already visited task that couldn't be
>                   * detached, we've seen all there is to see.
>                   */

Thanks, will use it in next version.


Thanks

Huang Shijie


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: do not scan twice in detach_tasks()
  2025-07-16  2:13   ` Shijie Huang
@ 2025-07-16 15:08     ` Valentin Schneider
  2025-07-17  2:56       ` Shijie Huang
  0 siblings, 1 reply; 6+ messages in thread
From: Valentin Schneider @ 2025-07-16 15:08 UTC (permalink / raw)
  To: Shijie Huang, Huang Shijie, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: patches, cl, Shubhang, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel

On 16/07/25 10:13, Shijie Huang wrote:
> On 2025/7/16 1:04, Valentin Schneider wrote:
>> On 07/07/25 16:36, Huang Shijie wrote:
>>> When detach_tasks() scans the src_cpu's task list, the task list
>>> may shrink during the scanning. For example, the task list
>>> may have four tasks at the beginning, it may becomes to two
>>> during the scanning in detach_tasks():
>>>      Task list at beginning : "ABCD"
>>>      Task list in scanning  : "CD"
>>>
>>>      (ABCD stands for differnt tasks.)
>>>
>>> In this scenario, the env->loop_max is still four, so
>>> detach_tasks() may scan twice for some tasks:
>>>      the scanning order maybe : "DCDC"
>>>
>>
>> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
>> about one occurrence every two default hackbench run (~200ms) on my dummy
>> QEMU setup.
>>
>> Have you seen this happen on your workloads or did you find this while
>> staring at the code?
>
> I found this issue in my Specjbb2015 test.  It is very easy to trigger.
>

That would be good to include in the changelog.

> I noticed many times in the test.
>
> I even found that:
>
>          At the beginning: env->loop_max is 12.
>
>         When the detach_tasks() scans: the real task list shrink to 5.
>

That's set using rq->nr_running which includes more than just the CFS
tasks, and looking at the git history it looks like it's almost always been
the case.

Looks like env->loop_max really is only used for detach_tasks(), so perhaps
a "better" fix would be something like the below, so that we can't iterate
more than length(env->src_rq->cfs_tasks). That is, assuming

  rq->cfs.h_nr_queued == length(rq->cfs_tasks)

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9b4bbbf0af6f..32ae24aa377ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		 * still unbalanced. ld_moved simply stays zero, so it is
 		 * correctly treated as an imbalance.
 		 */
-		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);
 
 more_balance:
 		rq_lock_irqsave(busiest, &rf);


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: do not scan twice in detach_tasks()
  2025-07-16 15:08     ` Valentin Schneider
@ 2025-07-17  2:56       ` Shijie Huang
  2025-07-17  9:49         ` Valentin Schneider
  0 siblings, 1 reply; 6+ messages in thread
From: Shijie Huang @ 2025-07-17  2:56 UTC (permalink / raw)
  To: Valentin Schneider, Huang Shijie, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: patches, cl, Shubhang, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel


On 2025/7/16 23:08, Valentin Schneider wrote:
> On 16/07/25 10:13, Shijie Huang wrote:
>> On 2025/7/16 1:04, Valentin Schneider wrote:
>>> On 07/07/25 16:36, Huang Shijie wrote:
>>>> When detach_tasks() scans the src_cpu's task list, the task list
>>>> may shrink during the scanning. For example, the task list
>>>> may have four tasks at the beginning, it may becomes to two
>>>> during the scanning in detach_tasks():
>>>>       Task list at beginning : "ABCD"
>>>>       Task list in scanning  : "CD"
>>>>
>>>>       (ABCD stands for differnt tasks.)
>>>>
>>>> In this scenario, the env->loop_max is still four, so
>>>> detach_tasks() may scan twice for some tasks:
>>>>       the scanning order maybe : "DCDC"
>>>>
>>> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
>>> about one occurrence every two default hackbench run (~200ms) on my dummy
>>> QEMU setup.
>>>
>>> Have you seen this happen on your workloads or did you find this while
>>> staring at the code?
>> I found this issue in my Specjbb2015 test.  It is very easy to trigger.
>>
> That would be good to include in the changelog.
okay.
>> I noticed many times in the test.
>>
>> I even found that:
>>
>>           At the beginning: env->loop_max is 12.
>>
>>          When the detach_tasks() scans: the real task list shrink to 5.
>>
> That's set using rq->nr_running which includes more than just the CFS
> tasks, and looking at the git history it looks like it's almost always been
> the case.
>
> Looks like env->loop_max really is only used for detach_tasks(), so perhaps
> a "better" fix would be something like the below, so that we can't iterate
> more than length(env->src_rq->cfs_tasks). That is, assuming
>
>    rq->cfs.h_nr_queued == length(rq->cfs_tasks)
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9b4bbbf0af6f..32ae24aa377ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   		 * still unbalanced. ld_moved simply stays zero, so it is
>   		 * correctly treated as an imbalance.
>   		 */
> -		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> +		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);

I tested this patch, it did not work. I still can catch lots of 
occurrences of this issue in Specjbb test.


IMHO, the root cause of this issue is env.loop_max is set out of the 
rq's lock.

Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks 
length still can shrink in

other places.


Btw, the real tasks's length can also bigger then env.loop_max.

  For example, we set env.loop_max with 5, when detach_tasks() runs, the 
real tasks' length can be changed to 7.


Thanks

Huang Shijie

>   
>   more_balance:
>   		rq_lock_irqsave(busiest, &rf);
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: do not scan twice in detach_tasks()
  2025-07-17  2:56       ` Shijie Huang
@ 2025-07-17  9:49         ` Valentin Schneider
  0 siblings, 0 replies; 6+ messages in thread
From: Valentin Schneider @ 2025-07-17  9:49 UTC (permalink / raw)
  To: Shijie Huang, Huang Shijie, mingo, peterz, juri.lelli,
	vincent.guittot
  Cc: patches, cl, Shubhang, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel

On 17/07/25 10:56, Shijie Huang wrote:
> On 2025/7/16 23:08, Valentin Schneider wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b9b4bbbf0af6f..32ae24aa377ca 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   		 * still unbalanced. ld_moved simply stays zero, so it is
>>   		 * correctly treated as an imbalance.
>>   		 */
>> -		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>> +		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);
>
> I tested this patch, it did not work. I still can catch lots of 
> occurrences of this issue in Specjbb test.
>
>
> IMHO, the root cause of this issue is env.loop_max is set out of the 
> rq's lock.
>
> Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks 
> length still can shrink in
>
> other places.
>

Ah right, and updating the max in detach_tasks() itself isn't a complete
solution if we re-enter it due to LBF_NEED_BREAK. Nevermind then :-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-07-17  9:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07  8:36 [PATCH] sched/fair: do not scan twice in detach_tasks() Huang Shijie
2025-07-15 17:04 ` Valentin Schneider
2025-07-16  2:13   ` Shijie Huang
2025-07-16 15:08     ` Valentin Schneider
2025-07-17  2:56       ` Shijie Huang
2025-07-17  9:49         ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).