[PATCH] sched/core: Minor optimize ttwu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/core: Minor optimize ttwu_runnable()
@ 2022-11-02 10:23 Chengming Zhou
  2022-11-04 18:19 ` Valentin Schneider
  2022-11-08  9:20 ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Chengming Zhou @ 2022-11-02 10:23 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid
  Cc: linux-kernel, Chengming Zhou

ttwu_runnable() is used as a fast wakeup path when the wakee task
is between set_current_state() and schedule(), in which case all
we need to do is change p->state back to TASK_RUNNING. So we don't
need to update_rq_clock() and check_preempt_curr() in this case.

Some performance numbers using mmtests/perfpipe on Intel Xeon server:

                           linux-next                patched
Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87c9cdf37a26..3785418de127 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
-		/* check_preempt_curr() may use rq clock */
-		update_rq_clock(rq);
-		ttwu_do_wakeup(rq, p, wake_flags, &rf);
+		WRITE_ONCE(p->__state, TASK_RUNNING);
+		trace_sched_wakeup(p);
 		ret = 1;
 	}
 	__task_rq_unlock(rq, &rf);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-02 10:23 [PATCH] sched/core: Minor optimize ttwu_runnable() Chengming Zhou
@ 2022-11-04 18:19 ` Valentin Schneider
  2022-11-05  1:34   ` Chengming Zhou
  2022-11-08  9:20 ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2022-11-04 18:19 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel, Chengming Zhou

On 02/11/22 18:23, Chengming Zhou wrote:
> ttwu_runnable() is used as a fast wakeup path when the wakee task
> is between set_current_state() and schedule(), in which case all
> we need to do is change p->state back to TASK_RUNNING. So we don't
> need to update_rq_clock() and check_preempt_curr() in this case.
>
> Some performance numbers using mmtests/perfpipe on Intel Xeon server:
>
>                            linux-next                patched
> Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
> 1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
> 2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
> 3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
> Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
> Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
> Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
> Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
> Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
> Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87c9cdf37a26..3785418de127 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>
>       rq = __task_rq_lock(p, &rf);
>       if (task_on_rq_queued(p)) {
> -		/* check_preempt_curr() may use rq clock */
> -		update_rq_clock(rq);
> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
> +		WRITE_ONCE(p->__state, TASK_RUNNING);
> +		trace_sched_wakeup(p);

This also loses the class->task_woken() call, AFAICT we could have
!p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
then again if there is a need to push we should have done that at the IRQ
preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
OK, but that needs elaborating in the changelog.


>               ret = 1;
>       }
>       __task_rq_unlock(rq, &rf);
> --
> 2.37.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-04 18:19 ` Valentin Schneider
@ 2022-11-05  1:34   ` Chengming Zhou
  2022-11-07 12:56     ` Valentin Schneider
  0 siblings, 1 reply; 11+ messages in thread
From: Chengming Zhou @ 2022-11-05  1:34 UTC (permalink / raw)
  To: Valentin Schneider, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On 2022/11/5 02:19, Valentin Schneider wrote:
> On 02/11/22 18:23, Chengming Zhou wrote:
>> ttwu_runnable() is used as a fast wakeup path when the wakee task
>> is between set_current_state() and schedule(), in which case all
>> we need to do is change p->state back to TASK_RUNNING. So we don't
>> need to update_rq_clock() and check_preempt_curr() in this case.
>>
>> Some performance numbers using mmtests/perfpipe on Intel Xeon server:
>>
>>                            linux-next                patched
>> Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
>> 1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
>> 2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
>> 3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
>> Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
>> Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
>> Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
>> Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
>> Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
>> Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
>> Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
>> Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/sched/core.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 87c9cdf37a26..3785418de127 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>
>>       rq = __task_rq_lock(p, &rf);
>>       if (task_on_rq_queued(p)) {
>> -		/* check_preempt_curr() may use rq clock */
>> -		update_rq_clock(rq);
>> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
>> +		WRITE_ONCE(p->__state, TASK_RUNNING);
>> +		trace_sched_wakeup(p);
> 
> This also loses the class->task_woken() call, AFAICT we could have
> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
> then again if there is a need to push we should have done that at the IRQ
> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
> OK, but that needs elaborating in the changelog.

Sorry, I don't get why we could have !p->on_cpu here?

I thought if we have task_on_rq_queued(p) here, it means p hasn't got to
__schedule() to deactivate_task(), so p should still be on_cpu?

Thanks.

> 
> 
>>               ret = 1;
>>       }
>>       __task_rq_unlock(rq, &rf);
>> --
>> 2.37.2
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-05  1:34   ` Chengming Zhou
@ 2022-11-07 12:56     ` Valentin Schneider
  2022-11-07 13:19       ` Chengming Zhou
  0 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2022-11-07 12:56 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On 05/11/22 09:34, Chengming Zhou wrote:
> On 2022/11/5 02:19, Valentin Schneider wrote:
>> On 02/11/22 18:23, Chengming Zhou wrote:
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 87c9cdf37a26..3785418de127 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>>
>>>       rq = __task_rq_lock(p, &rf);
>>>       if (task_on_rq_queued(p)) {
>>> -		/* check_preempt_curr() may use rq clock */
>>> -		update_rq_clock(rq);
>>> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
>>> +		WRITE_ONCE(p->__state, TASK_RUNNING);
>>> +		trace_sched_wakeup(p);
>>
>> This also loses the class->task_woken() call, AFAICT we could have
>> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
>> then again if there is a need to push we should have done that at the IRQ
>> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
>> OK, but that needs elaborating in the changelog.
>
> Sorry, I don't get why we could have !p->on_cpu here?
>
> I thought if we have task_on_rq_queued(p) here, it means p hasn't got to
> __schedule() to deactivate_task(), so p should still be on_cpu?
>
> Thanks.
>

So p is still on the rq for sure, but it may not be the currently running
task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:

0    for (;;) {
1       set_current_state(TASK_UNINTERRUPTIBLE);
2
3       if (CONDITION)
4          break;
5
6       schedule();
7    }
8    __set_current_state(TASK_RUNNING);

Now further consider p0 executes line 1, but then gets interrupted by an
IRQ, at the end of which preempt_schedule_irq() happens. We enter
__schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
state, but we *can* end up switching the CPU's current task.

ISTR that's why Peter renamed that function ttwu_*runnable*() and not
ttwu_*running*().

>>
>>
>>>               ret = 1;
>>>       }
>>>       __task_rq_unlock(rq, &rf);
>>> --
>>> 2.37.2
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-07 12:56     ` Valentin Schneider
@ 2022-11-07 13:19       ` Chengming Zhou
  2022-11-07 15:54         ` Valentin Schneider
  0 siblings, 1 reply; 11+ messages in thread
From: Chengming Zhou @ 2022-11-07 13:19 UTC (permalink / raw)
  To: Valentin Schneider, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On 2022/11/7 20:56, Valentin Schneider wrote:
> On 05/11/22 09:34, Chengming Zhou wrote:
>> On 2022/11/5 02:19, Valentin Schneider wrote:
>>> On 02/11/22 18:23, Chengming Zhou wrote:
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 87c9cdf37a26..3785418de127 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>>>
>>>>       rq = __task_rq_lock(p, &rf);
>>>>       if (task_on_rq_queued(p)) {
>>>> -		/* check_preempt_curr() may use rq clock */
>>>> -		update_rq_clock(rq);
>>>> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
>>>> +		WRITE_ONCE(p->__state, TASK_RUNNING);
>>>> +		trace_sched_wakeup(p);
>>>
>>> This also loses the class->task_woken() call, AFAICT we could have
>>> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
>>> then again if there is a need to push we should have done that at the IRQ
>>> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
>>> OK, but that needs elaborating in the changelog.
>>
>> Sorry, I don't get why we could have !p->on_cpu here?
>>
>> I thought if we have task_on_rq_queued(p) here, it means p hasn't got to
>> __schedule() to deactivate_task(), so p should still be on_cpu?
>>
>> Thanks.
>>
> 
> So p is still on the rq for sure, but it may not be the currently running
> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:
> 
> 0    for (;;) {
> 1       set_current_state(TASK_UNINTERRUPTIBLE);
> 2
> 3       if (CONDITION)
> 4          break;
> 5
> 6       schedule();
> 7    }
> 8    __set_current_state(TASK_RUNNING);
> 
> Now further consider p0 executes line 1, but then gets interrupted by an
> IRQ, at the end of which preempt_schedule_irq() happens. We enter
> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
> state, but we *can* end up switching the CPU's current task.

Thank you very much for this detailed explanation, I get it. Yes, this
process can be seen on CONFIG_PREEMPT kernel.

> 
> ISTR that's why Peter renamed that function ttwu_*runnable*() and not
> ttwu_*running*().

So this task p didn't really sleep, it's preempted, later scheduled in,
get to execute line 6 schedule(), but its state has been set to TASK_RUNNING,
it won't deactivate_task(p).

Looks like this patch is still reasonable? Should I put this detailed
explanation in the changelog and send v2?

Thanks!

> 
>>>
>>>
>>>>               ret = 1;
>>>>       }
>>>>       __task_rq_unlock(rq, &rf);
>>>> --
>>>> 2.37.2
>>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-07 13:19       ` Chengming Zhou
@ 2022-11-07 15:54         ` Valentin Schneider
  2022-11-08  7:38           ` Chengming Zhou
  2022-11-08  9:11           ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Valentin Schneider @ 2022-11-07 15:54 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On 07/11/22 21:19, Chengming Zhou wrote:
> On 2022/11/7 20:56, Valentin Schneider wrote:
>>
>> So p is still on the rq for sure, but it may not be the currently running
>> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:
>>
>> 0    for (;;) {
>> 1       set_current_state(TASK_UNINTERRUPTIBLE);
>> 2
>> 3       if (CONDITION)
>> 4          break;
>> 5
>> 6       schedule();
>> 7    }
>> 8    __set_current_state(TASK_RUNNING);
>>
>> Now further consider p0 executes line 1, but then gets interrupted by an
>> IRQ, at the end of which preempt_schedule_irq() happens. We enter
>> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
>> state, but we *can* end up switching the CPU's current task.
>
> Thank you very much for this detailed explanation, I get it. Yes, this
> process can be seen on CONFIG_PREEMPT kernel.
>
>>
>> ISTR that's why Peter renamed that function ttwu_*runnable*() and not
>> ttwu_*running*().
>
> So this task p didn't really sleep, it's preempted, later scheduled in,
> get to execute line 6 schedule(), but its state has been set to TASK_RUNNING,
> it won't deactivate_task(p).
>

Right!

> Looks like this patch is still reasonable? Should I put this detailed
> explanation in the changelog and send v2?
>

So that's the part for the p->sched_class->task_woken() callback, which
only affects RT and DL (and only does something when !p->on_cpu). I *think*
it's fine to remove it from ttwu_runnable() as any push/pull should have
happened when other tasks were enqueued on the same CPU - with that said,
it wouldn't hurt to double check this :-)

As for the check_preempt_curr(), since per the above p can be preempted,
you could have scenarios right now with CFS tasks where
ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.

e.g. p0 does

  set_current_state(TASK_UNINTERRUPTIBLE)

but then gets interrupted by the tick, a p1 gets selected to run instead
because of check_preempt_tick(), and then runs long enough to have
check_preempt_curr() decide to let p0 preempt p1.

That does require specific timing (lower tick frequency should make this
more likely) and probably task niceness distribution too, but isn't
impossible.

Maybe try reading p->on_cpu, and only do the quick task state update if
it's still the current task, otherwise do the preemption checks?

> Thanks!
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-07 15:54         ` Valentin Schneider
@ 2022-11-08  7:38           ` Chengming Zhou
  2022-11-08  9:11           ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Chengming Zhou @ 2022-11-08  7:38 UTC (permalink / raw)
  To: Valentin Schneider, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On 2022/11/7 23:54, Valentin Schneider wrote:
> On 07/11/22 21:19, Chengming Zhou wrote:
>> On 2022/11/7 20:56, Valentin Schneider wrote:
>>>
>>> So p is still on the rq for sure, but it may not be the currently running
>>> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:
>>>
>>> 0    for (;;) {
>>> 1       set_current_state(TASK_UNINTERRUPTIBLE);
>>> 2
>>> 3       if (CONDITION)
>>> 4          break;
>>> 5
>>> 6       schedule();
>>> 7    }
>>> 8    __set_current_state(TASK_RUNNING);
>>>
>>> Now further consider p0 executes line 1, but then gets interrupted by an
>>> IRQ, at the end of which preempt_schedule_irq() happens. We enter
>>> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
>>> state, but we *can* end up switching the CPU's current task.
>>
>> Thank you very much for this detailed explanation, I get it. Yes, this
>> process can be seen on CONFIG_PREEMPT kernel.
>>
>>>
>>> ISTR that's why Peter renamed that function ttwu_*runnable*() and not
>>> ttwu_*running*().
>>
>> So this task p didn't really sleep, it's preempted, later scheduled in,
>> get to execute line 6 schedule(), but its state has been set to TASK_RUNNING,
>> it won't deactivate_task(p).
>>
> 
> Right!
> 
>> Looks like this patch is still reasonable? Should I put this detailed
>> explanation in the changelog and send v2?
>>
> 
> So that's the part for the p->sched_class->task_woken() callback, which
> only affects RT and DL (and only does something when !p->on_cpu). I *think*
> it's fine to remove it from ttwu_runnable() as any push/pull should have
> happened when other tasks were enqueued on the same CPU - with that said,
> it wouldn't hurt to double check this :-)

Yes, ttwu_runnable() should be fine to remove p->sched_class->task_woken().

> 
> 
> As for the check_preempt_curr(), since per the above p can be preempted,
> you could have scenarios right now with CFS tasks where
> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
> 
> e.g. p0 does
> 
>   set_current_state(TASK_UNINTERRUPTIBLE)
> 
> but then gets interrupted by the tick, a p1 gets selected to run instead
> because of check_preempt_tick(), and then runs long enough to have
> check_preempt_curr() decide to let p0 preempt p1.
> 
> That does require specific timing (lower tick frequency should make this
> more likely) and probably task niceness distribution too, but isn't
> impossible.
> 
> Maybe try reading p->on_cpu, and only do the quick task state update if
> it's still the current task, otherwise do the preemption checks?

I think it's a good suggestion, so we still have the preemption check when
p0 is preempted by p1, letting p0 to be scheduled in more timely.

I will update and send v2 later.

> 
>> Thanks!
>>
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-07 15:54         ` Valentin Schneider
  2022-11-08  7:38           ` Chengming Zhou
@ 2022-11-08  9:11           ` Peter Zijlstra
  2022-11-08 10:08             ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2022-11-08  9:11 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot,
	linux-kernel

On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote:

> So that's the part for the p->sched_class->task_woken() callback, which
> only affects RT and DL (and only does something when !p->on_cpu). I *think*
> it's fine to remove it from ttwu_runnable() as any push/pull should have
> happened when other tasks were enqueued on the same CPU - with that said,
> it wouldn't hurt to double check this :-)
> 
> 
> As for the check_preempt_curr(), since per the above p can be preempted,
> you could have scenarios right now with CFS tasks where
> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
> 
> e.g. p0 does
> 
>   set_current_state(TASK_UNINTERRUPTIBLE)
> 
> but then gets interrupted by the tick, a p1 gets selected to run instead
> because of check_preempt_tick(), and then runs long enough to have
> check_preempt_curr() decide to let p0 preempt p1.
> 
> That does require specific timing (lower tick frequency should make this
> more likely) and probably task niceness distribution too, but isn't
> impossible.
> 
> Maybe try reading p->on_cpu, and only do the quick task state update if
> it's still the current task, otherwise do the preemption checks?

I'm confused...

So all relevant parties take rq->lock:

 - __schedule()
 - scheduler_tick()
 - ttwu_runnable()

So if ttwu_runnable() sees on_rq and switches state back to RUNNING then
neither check_preempt_curr() nor task_woken() make any sense.

Specifically:

 - you can't very well preempt yourself (which is what
   check_preempt_curr() is trying to determine -- if the woken task
   should preempt the running task, they're both the same in this case);

 - the task did not actually wake up, because it was still on the
   runqueue to begin with. This path prevents a sleep, rather than
   issues a wakeup.

So yes, I think the patch as proposed is ok.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-08  9:11           ` Peter Zijlstra
@ 2022-11-08 10:08             ` Peter Zijlstra
  2022-11-08 11:17               ` Chengming Zhou
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2022-11-08 10:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot,
	linux-kernel

On Tue, Nov 08, 2022 at 10:11:49AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote:
> 
> > So that's the part for the p->sched_class->task_woken() callback, which
> > only affects RT and DL (and only does something when !p->on_cpu). I *think*
> > it's fine to remove it from ttwu_runnable() as any push/pull should have
> > happened when other tasks were enqueued on the same CPU - with that said,
> > it wouldn't hurt to double check this :-)
> > 
> > 
> > As for the check_preempt_curr(), since per the above p can be preempted,
> > you could have scenarios right now with CFS tasks where
> > ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
> > 
> > e.g. p0 does
> > 
> >   set_current_state(TASK_UNINTERRUPTIBLE)
> > 
> > but then gets interrupted by the tick, a p1 gets selected to run instead
> > because of check_preempt_tick(), and then runs long enough to have
> > check_preempt_curr() decide to let p0 preempt p1.
> > 
> > That does require specific timing (lower tick frequency should make this
> > more likely) and probably task niceness distribution too, but isn't
> > impossible.
> > 
> > Maybe try reading p->on_cpu, and only do the quick task state update if
> > it's still the current task, otherwise do the preemption checks?
> 
> I'm confused...

I am and Valentin has a point. It could indeed be preempted and in that
case check_preempt_curr() could indeed make it get back on.

In that case his suggestion might make sense; something along the lines
of so I suppose...

(And I think we can still do the reorg I proposed elsewhere, but I've not
yet tried.)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..6944d9473295 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3698,9 +3698,16 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
-		/* check_preempt_curr() may use rq clock */
-		update_rq_clock(rq);
-		ttwu_do_wakeup(rq, p, wake_flags, &rf);
+		if (!p->on_cpu) {
+			/*
+			 * When on_rq && !on_cpu the task is preempted, see if
+			 * it should preempt whatever is current there now.
+			 */
+			update_rq_clock(rq);
+			check_preempt_curr(rq, p, wake_flags);
+		}
+		WRITE_ONCE(p->__state, TASK_RUNNING);
+		trace_sched_wakeup(p);
 		ret = 1;
 	}
 	__task_rq_unlock(rq, &rf);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-08 10:08             ` Peter Zijlstra
@ 2022-11-08 11:17               ` Chengming Zhou
  0 siblings, 0 replies; 11+ messages in thread
From: Chengming Zhou @ 2022-11-08 11:17 UTC (permalink / raw)
  To: Peter Zijlstra, Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

On 2022/11/8 18:08, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 10:11:49AM +0100, Peter Zijlstra wrote:
>> On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote:
>>
>>> So that's the part for the p->sched_class->task_woken() callback, which
>>> only affects RT and DL (and only does something when !p->on_cpu). I *think*
>>> it's fine to remove it from ttwu_runnable() as any push/pull should have
>>> happened when other tasks were enqueued on the same CPU - with that said,
>>> it wouldn't hurt to double check this :-)
>>>
>>>
>>> As for the check_preempt_curr(), since per the above p can be preempted,
>>> you could have scenarios right now with CFS tasks where
>>> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
>>>
>>> e.g. p0 does
>>>
>>>   set_current_state(TASK_UNINTERRUPTIBLE)
>>>
>>> but then gets interrupted by the tick, a p1 gets selected to run instead
>>> because of check_preempt_tick(), and then runs long enough to have
>>> check_preempt_curr() decide to let p0 preempt p1.
>>>
>>> That does require specific timing (lower tick frequency should make this
>>> more likely) and probably task niceness distribution too, but isn't
>>> impossible.
>>>
>>> Maybe try reading p->on_cpu, and only do the quick task state update if
>>> it's still the current task, otherwise do the preemption checks?
>>
>> I'm confused...
> 
> I am and Valentin has a point. It could indeed be preempted and in that
> case check_preempt_curr() could indeed make it get back on.
> 
> In that case his suggestion might make sense; something along the lines
> of so I suppose...
> 
> (And I think we can still do the reorg I proposed elsewhere, but I've not
> yet tried.)

Ok, thanks for these suggestions.

I will try to do this, do some tests and send v2.

Thanks.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cb2aa2b54c7a..6944d9473295 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3698,9 +3698,16 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  
>  	rq = __task_rq_lock(p, &rf);
>  	if (task_on_rq_queued(p)) {
> -		/* check_preempt_curr() may use rq clock */
> -		update_rq_clock(rq);
> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
> +		if (!p->on_cpu) {
> +			/*
> +			 * When on_rq && !on_cpu the task is preempted, see if
> +			 * it should preempt whatever is current there now.
> +			 */
> +			update_rq_clock(rq);
> +			check_preempt_curr(rq, p, wake_flags);
> +		}
> +		WRITE_ONCE(p->__state, TASK_RUNNING);
> +		trace_sched_wakeup(p);
>  		ret = 1;
>  	}
>  	__task_rq_unlock(rq, &rf);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/core: Minor optimize ttwu_runnable()
  2022-11-02 10:23 [PATCH] sched/core: Minor optimize ttwu_runnable() Chengming Zhou
  2022-11-04 18:19 ` Valentin Schneider
@ 2022-11-08  9:20 ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2022-11-08  9:20 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, linux-kernel

On Wed, Nov 02, 2022 at 06:23:43PM +0800, Chengming Zhou wrote:
> ttwu_runnable() is used as a fast wakeup path when the wakee task
> is between set_current_state() and schedule(), in which case all
> we need to do is change p->state back to TASK_RUNNING. So we don't
> need to update_rq_clock() and check_preempt_curr() in this case.
> 
> Some performance numbers using mmtests/perfpipe on Intel Xeon server:
> 
>                            linux-next                patched
> Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
> 1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
> 2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
> 3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
> Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
> Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
> Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
> Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
> Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
> Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87c9cdf37a26..3785418de127 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  
>  	rq = __task_rq_lock(p, &rf);
>  	if (task_on_rq_queued(p)) {
> -		/* check_preempt_curr() may use rq clock */
> -		update_rq_clock(rq);
> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
> +		WRITE_ONCE(p->__state, TASK_RUNNING);
> +		trace_sched_wakeup(p);
>  		ret = 1;
>  	}
>  	__task_rq_unlock(rq, &rf);

Yes, I think this is correct; however I would re-organize code a little.

How's this?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..43d9a1551a5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3602,15 +3602,40 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 		__schedstat_inc(p->stats.nr_wakeups_sync);
 }
 
 /*
- * Mark the task runnable and perform wakeup-preemption.
+ * Mark the task runnable...
  */
-static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
-			   struct rq_flags *rf)
+static inline void ttwu_do_wakeup(struct task_struct *p)
 {
-	check_preempt_curr(rq, p, wake_flags);
 	WRITE_ONCE(p->__state, TASK_RUNNING);
 	trace_sched_wakeup(p);
+}
+
+static void
+ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
+		 struct rq_flags *rf)
+{
+	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
+
+	lockdep_assert_rq_held(rq);
+
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible--;
+
+#ifdef CONFIG_SMP
+	if (wake_flags & WF_MIGRATED)
+		en_flags |= ENQUEUE_MIGRATED;
+	else
+#endif
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
+	activate_task(rq, p, en_flags);
+	check_preempt_curr(rq, p, wake_flags);
+
+	ttwu_do_wakeup(p);
 
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken) {
@@ -3640,31 +3666,6 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 #endif
 }
 
-static void
-ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
-		 struct rq_flags *rf)
-{
-	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
-
-	lockdep_assert_rq_held(rq);
-
-	if (p->sched_contributes_to_load)
-		rq->nr_uninterruptible--;
-
-#ifdef CONFIG_SMP
-	if (wake_flags & WF_MIGRATED)
-		en_flags |= ENQUEUE_MIGRATED;
-	else
-#endif
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
-	activate_task(rq, p, en_flags);
-	ttwu_do_wakeup(rq, p, wake_flags, rf);
-}
-
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3698,9 +3699,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
-		/* check_preempt_curr() may use rq clock */
-		update_rq_clock(rq);
-		ttwu_do_wakeup(rq, p, wake_flags, &rf);
+		ttwu_do_wakeup(p);
 		ret = 1;
 	}
 	__task_rq_unlock(rq, &rf);
@@ -4062,8 +4061,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 			goto out;
 
 		trace_sched_waking(p);
-		WRITE_ONCE(p->__state, TASK_RUNNING);
-		trace_sched_wakeup(p);
+		ttwu_do_wakeup(p);
 		goto out;
 	}
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-11-08 11:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-02 10:23 [PATCH] sched/core: Minor optimize ttwu_runnable() Chengming Zhou
2022-11-04 18:19 ` Valentin Schneider
2022-11-05  1:34   ` Chengming Zhou
2022-11-07 12:56     ` Valentin Schneider
2022-11-07 13:19       ` Chengming Zhou
2022-11-07 15:54         ` Valentin Schneider
2022-11-08  7:38           ` Chengming Zhou
2022-11-08  9:11           ` Peter Zijlstra
2022-11-08 10:08             ` Peter Zijlstra
2022-11-08 11:17               ` Chengming Zhou
2022-11-08  9:20 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox