Re: [PATCH v3 2/2] sched/fair: Choose the CPU where short task is running during wake up

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Chen Yu <yu.c.chen@intel.com>
To: Tianchen Ding <dtcccc@linux.alibaba.com>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Rik van Riel <riel@surriel.com>, "Aaron Lu" <aaron.lu@intel.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Yicong Yang <yangyicong@hisilicon.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Ingo Molnar <mingo@redhat.com>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Hillf Danton <hdanton@sina.com>,
	Honglei Wang <wanghonglei@didichuxing.com>,
	Len Brown <len.brown@intel.com>, Chen Yu <yu.chen.surf@gmail.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Josh Don <joshdon@google.com>, <linux-kernel@vger.kernel.org>,
	kernel test robot <yujie.liu@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Tim Chen <tim.c.chen@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH v3 2/2] sched/fair: Choose the CPU where short task is running during wake up
Date: Mon, 5 Dec 2022 16:40:12 +0800	[thread overview]
Message-ID: <Y42ubFckJTOgiYyG@chenyu5-mobl1> (raw)
In-Reply-To: <a71a227a-5f1b-8ebc-bfde-0babf267b7d9@linux.alibaba.com>

On 2022-12-05 at 10:36:25 +0800, Tianchen Ding wrote:
> Hi Chen.
> 
> On 2022/12/1 16:44, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continues to increase.
> > 
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 112 CPUs per LLC domain. The will-it-scale launches
> > 1, 8, 16 ... 112 instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU.
> > It is found that, once the number of instances is higher than
> > 56(112 tasks in total, every CPU has 1 task), the throughput
> > drops accordingly if the instance number continues to increase:
> > 
> >            ^
> > throughput|
> >            |                 X
> >            |               X   X X
> >            |             X         X X
> >            |           X               X
> >            |         X                   X
> >            |       X
> >            |     X
> >            |   X
> >            | X
> >            |
> >            +-----------------.------------------->
> >                              56
> >                                   number of instances
> > 
> > [Symptom analysis]
> > 
> > The performance downgrading was caused by a high system idle
> > percentage(around 20% ~ 30%). The CPUs waste a lot of time in
> > idle and do nothing. As a comparison, if set CPU affinity to
> > these workloads and stops them from migrating among CPUs,
> > the idle percentage drops to nearly 0%, and the throughput
> > increases by about 300%. This indicates room for optimization.
> > 
> > The reason for the high idle percentage is different before/after
> > commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU
> > on wakelist if wakee cpu is idle").
> > 
> > [Before the commit]
> > The bottleneck is the runqueue spinlock.
> > 
> > nr_instance          rq lock percentage
> > 1                    1.22%
> > 8                    1.17%
> > 16                   1.20%
> > 24                   1.22%
> > 32                   1.46%
> > 40                   1.61%
> > 48                   1.63%
> > 56                   1.65%
> > --------------------------
> > 64                   3.77%      |
> > 72                   5.90%      | increase
> > 80                   7.95%      |
> > 88                   9.98%      v
> > 96                   11.81%
> > 104                  13.54%
> > 112                  15.13%
> > 
> > And top 2 rq lock hot paths:
> > 
> > (path1):
> > raw_spin_rq_lock_nested.constprop.0;
> > try_to_wake_up;
> > default_wake_function;
> > autoremove_wake_function;
> > __wake_up_common;
> > __wake_up_common_lock;
> > __wake_up_sync_key;
> > pipe_write;
> > new_sync_write;
> > vfs_write;
> > ksys_write;
> > __x64_sys_write;
> > do_syscall_64;
> > entry_SYSCALL_64_after_hwframe;write
> > 
> > (path2):
> > raw_spin_rq_lock_nested.constprop.0;
> > __sched_text_start;
> > schedule_idle;
> > do_idle;
> > cpu_startup_entry;
> > start_secondary;
> > secondary_startup_64_no_verify
> > 
> > task A tries to wake up task B on CPU1, then task A grabs the
> > runqueue lock of CPU1. If CPU1 is about to quit idle, it needs
> > to grab its lock which has been taken by someone else. Then
> > CPU1 takes more time to quit which hurts the performance.
> > 
> > [After the commit]
> > The cause is the race condition between select_task_rq() and
> > the task enqueue.
> > 
> > Suppose there are nr_cpus pairs of ping-pong scheduling
> > tasks. For example, p0' and p0 are ping-pong scheduling,
> > so do p1' <=> p1, and p2'<=> p2. None of these tasks are
> > bound to any CPUs. The problem can be summarized as:
> > more than 1 wakers are stacked on 1 CPU, which slows down
> > waking up their wakees:
> > 
> > CPU0					CPU1				CPU2
> > 
> > p0'					p1' => idle			p2'
> > 
> > try_to_wake_up(p0)							try_to_wake_up(p2);
> > CPU1 = select_task_rq(p0);						CPU1 = select_task_rq(p2);
> > ttwu_queue(p0, CPU1);							ttwu_queue(p2, CPU1);
> >    __ttwu_queue_wakelist(p0, CPU1); =>	ttwu_list->p0
> > 					quiting cpuidle_idle_call()
> > 
> > 					ttwu_list->p2->p0	<=	  __ttwu_queue_wakelist(p2, CPU1);
> > 
> >      WRITE_ONCE(CPU1->ttwu_pending, 1);					    WRITE_ONCE(CPU1->ttwu_pending, 1);
> > 
> > p0' => idle
> > 					sched_ttwu_pending()
> > 					  enqueue_task(p2 and p0)
> > 
> > 					idle => p2
> > 
> > 					...
> > 					p2 time slice expires
> > 					...
> > 									!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > 								<===	!!! p2 delays the wake up of p0' !!!
> > 									!!! causes long idle on CPU0     !!!
> > 					p2 => p0			!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > 					p0 wakes up p0'
> > 
> > idle => p0'
> > 
> > 
> > 
> > Since there are many waker/wakee pairs in the system, the chain reaction
> > causes many CPUs to be victims. These idle CPUs wait for their waker to
> > be scheduled.
> > 
> > Actually Tiancheng has mentioned above issue here[2].
> > 
> 
> First I want to say that this issue (race condition between selecting idle
> cpu and enqueuing task) always exists before or after the commit
> f3dd3f674555. And I know this is a big issue so in [2] I only try to fix a
> small part of it. Of course I'm glad to see you trying solving this issue
> too :-)
> 
> There may be a bit wrong in your comment about the order.
> "WRITE_ONCE(CPU1->ttwu_pending, 1);" on CPU0 is earlier than CPU1 getting
> "ttwu_list->p0", right?
>
Yes, you are right, I'll refine the comment.

thanks,
Chenyu

next prev parent reply	other threads:[~2022-12-05  8:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-01  8:43 [PATCH v3 0/2] sched/fair: Choose the CPU where short task is running during wake up Chen Yu
2022-12-01  8:44 ` [PATCH v3 1/2] sched/fair: Introduce short duration task check Chen Yu
2022-12-02  7:44   ` Honglei Wang
2022-12-03  7:49     ` Chen Yu
2022-12-03 15:35       ` Joel Fernandes
2022-12-05  8:38         ` Chen Yu
2022-12-05  9:59           ` Vincent Guittot
2022-12-05 14:38             ` Chen Yu
2022-12-07  2:23             ` Josh Don
2022-12-07 14:24               ` Chen Yu
2022-12-12 11:22                 ` Yicong Yang
2022-12-12 14:33                   ` Chen Yu
2022-12-12 18:17                     ` Josh Don
2022-12-13  5:46                       ` Chen Yu
2022-12-13 10:06                         ` Honglei Wang
2022-12-13 12:24                           ` Chen Yu
2022-12-01  8:44 ` [PATCH v3 2/2] sched/fair: Choose the CPU where short task is running during wake up Chen Yu
2022-12-05  2:36   ` Tianchen Ding
2022-12-05  8:40     ` Chen Yu [this message]
2022-12-06 13:02   ` Yicong Yang
2022-12-07  3:54     ` Chen Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y42ubFckJTOgiYyG@chenyu5-mobl1 \
    --to=yu.c.chen@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dtcccc@linux.alibaba.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hdanton@sina.com \
    --cc=joel@joelfernandes.org \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wanghonglei@didichuxing.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=yangyicong@hisilicon.com \
    --cc=yu.chen.surf@gmail.com \
    --cc=yujie.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox