Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brendan Jackman <brendan.jackman@arm.com>
To: linux-kernel@vger.kernel.org
To: Peter Zijlstra <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>,
	Andres Oportus <andresoportus@google.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Josef Bacik <josef@toxicpanda.com>,
	Mike Galbraith <efault@gmx.de>,
	Matt Fleming <matt@codeblueprint.co.uk>
Subject: Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
Date: Fri, 29 Sep 2017 18:05:31 +0100	[thread overview]
Message-ID: <873775zb8k.fsf@arm.com> (raw)
In-Reply-To: <20170811094558.29536-1-brendan.jackman@arm.com>


There has been a bit of discussion on this RFC, but before I do any
more work I'd really like your input on the basic idea.

Does the approach of feeding outside info like wake_q length into the
scheduler's decisions seem basically acceptable?

Cheers,
Brendan


On Fri, Aug 11 2017 at 09:45, Brendan Jackman wrote:
> This patch adds a parameter to select_task_rq, sibling_count_hint
> allowing the caller, where it has this information, to inform the
> sched_class the number of tasks that are being woken up as part of
> the same event.
>
> The wake_q mechanism is one case where this information is available.
>
> select_task_rq_fair can then use the information to detect that it
> needs to widen the search space for task placement in order to avoid
> overloading the last-level cache domain's CPUs.
>
>                                * * *
>
> The reason I am investigating this change is the following use case
> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
> all repeatedly do X amount of work then
> pthread_barrier_wait (i.e. sleep until the last task finishes its X
> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
> finish faster, and then those CPUs pull over the tasks that are still
> running:
>
>      v CPU v           ->time->
>
>                     -------------
>    0  (big)         11111  /333
>                     -------------
>    1  (big)         22222   /444|
>                     -------------
>    2  (LITTLE)      333333/
>                     -------------
>    3  (LITTLE)      444444/
>                     -------------
>
> Now when task 4 hits the barrier (at |) and wakes the others up,
> there are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. want_affine therefore means that we'll only look
> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
> on the bigs until the next load balance, something like this:
>
>      v CPU v           ->time->
>
>                     ------------------------
>    0  (big)         11111  /333  31313\33333
>                     ------------------------
>    1  (big)         22222   /444|424\4444444
>                     ------------------------
>    2  (LITTLE)      333333/          \222222
>                     ------------------------
>    3  (LITTLE)      444444/            \1111
>                     ------------------------
>                                  ^^^
>                            underutilization
>
> So, I'm trying to get want_affine = 0 for these tasks.
>
> I don't _think_ any incarnation of the wakee_flips mechanism can help
> us here because which task is waker and which tasks are wakees
> generally changes with each iteration.
>
> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
> nice property that we know exactly how many tasks are being woken, so
> we can cheat.
>
> It might be a disadvantage that we "widen" _every_ task that's woken in
> an event, while select_idle_sibling would work fine for the first
> sd_llc_size - 1 tasks.
>
> IIUC, if wake_affine() behaves correctly this trick wouldn't be
> necessary on SMP systems, so it might be best guarded by the presence
> of SD_ASYM_CPUCAPACITY?
>
>                                * * *
>
> Final note..
>
> In order to observe "perfect" behaviour for this use case, I also had
> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
> above we are working through the work queue and have placed tasks 3
> and 2, and are about to place task 1:
>
>      v CPU v           ->time->
>
>                     --------------
>    0  (big)         11111  /333  3
>                     --------------
>    1  (big)         22222   /444|4
>                     --------------
>    2  (LITTLE)      333333/      2
>                     --------------
>    3  (LITTLE)      444444/          <- Task 1 should go here
>                     --------------
>
> If TTWU_QUEUE is enabled, we will not yet have enqueued task
> 2 (having instead sent a reschedule IPI) or attached its load to CPU
> 2. So we are likely to also place task 1 on cpu 2. Disabling
> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
> solving this issue. TTWU_QUEUE is there to minimise rq lock
> contention, and I guess that this contention is less of an issue on
> big.LITTLE systems since they have relatively few CPUs, which
> suggests the trade-off makes sense here.
>
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> ---
>  include/linux/sched/wake_q.h |  2 ++
>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>  kernel/sched/deadline.c      |  3 ++-
>  kernel/sched/fair.c          | 17 +++++++++++------
>  kernel/sched/idle_task.c     |  3 ++-
>  kernel/sched/rt.c            |  3 ++-
>  kernel/sched/sched.h         |  3 ++-
>  kernel/sched/stop_task.c     |  3 ++-
>  8 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index d03d8a9047dc..607a888eb35b 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -33,6 +33,7 @@
>  struct wake_q_head {
>  	struct wake_q_node *first;
>  	struct wake_q_node **lastp;
> +	int count;
>  };
>
>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>  {
>  	head->first = WAKE_Q_TAIL;
>  	head->lastp = &head->first;
> +	head->count = 0;
>  }
>
>  extern void wake_q_add(struct wake_q_head *head,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0869b20fba81..ddf9257b1467 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>  		return;
>
> +	head->count++;
> +
>  	get_task_struct(task);
>
>  	/*
> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	head->lastp = &node->next;
>  }
>
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +	       int sibling_count_hint);
> +
>  void wake_up_q(struct wake_q_head *head)
>  {
>  	struct wake_q_node *node = head->first;
> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>  		task->wake_q.next = NULL;
>
>  		/*
> -		 * wake_up_process() implies a wmb() to pair with the queueing
> +		 * try_to_wake_up() implies a wmb() to pair with the queueing
>  		 * in wake_q_add() so as not to miss wakeups.
>  		 */
> -		wake_up_process(task);
> +		try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>  		put_task_struct(task);
>  	}
>  }
> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>   */
>  static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
> +		   int sibling_count_hint)
>  {
>  	lockdep_assert_held(&p->pi_lock);
>
>  	if (p->nr_cpus_allowed > 1)
> -		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> +		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
> +						     sibling_count_hint);
>  	else
>  		cpu = cpumask_any(&p->cpus_allowed);
>
> @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   * @p: the thread to be awakened
>   * @state: the mask of task states that can be woken
>   * @wake_flags: wake modifier flags (WF_*)
> + * @sibling_count_hint: A hint at the number of threads that are being woken up
> + *                      in this event.
>   *
>   * If (@state & @p->state) @p->state = TASK_RUNNING.
>   *
> @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   *	   %false otherwise.
>   */
>  static int
> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +	       int sibling_count_hint)
>  {
>  	unsigned long flags;
>  	int cpu, success = 0;
> @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		atomic_dec(&task_rq(p)->nr_iowait);
>  	}
>
> -	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> +	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
> +			     sibling_count_hint);
>  	if (task_cpu(p) != cpu) {
>  		wake_flags |= WF_MIGRATED;
>  		set_task_cpu(p, cpu);
> @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
>   */
>  int wake_up_process(struct task_struct *p)
>  {
> -	return try_to_wake_up(p, TASK_NORMAL, 0);
> +	return try_to_wake_up(p, TASK_NORMAL, 0, 1);
>  }
>  EXPORT_SYMBOL(wake_up_process);
>
>  int wake_up_state(struct task_struct *p, unsigned int state)
>  {
> -	return try_to_wake_up(p, state, 0);
> +	return try_to_wake_up(p, state, 0, 1);
>  }
>
>  /*
> @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
>  	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
>  	 * as we're not fully set-up yet.
>  	 */
> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> +	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
>  #endif
>  	rq = __task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
> @@ -2893,7 +2905,7 @@ void sched_exec(void)
>  	int dest_cpu;
>
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
> +	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
>  	if (dest_cpu == smp_processor_id())
>  		goto unlock;
>
> @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
>  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
>  			  void *key)
>  {
> -	return try_to_wake_up(curr->private, mode, wake_flags);
> +	return try_to_wake_up(curr->private, mode, wake_flags, 1);
>  }
>  EXPORT_SYMBOL(default_wake_function);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 755bd3f1a1a9..69a9dd407267 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
>  static int find_later_rq(struct task_struct *task);
>
>  static int
> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		  int sibling_count_hint)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..0a9d706b62bf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
>   * whatever is irrelevant, spread criteria is apparent partner count exceeds
>   * socket size.
>   */
> -static int wake_wide(struct task_struct *p)
> +static int wake_wide(struct task_struct *p, int sibling_count_hint)
>  {
>  	unsigned int master = current->wakee_flips;
>  	unsigned int slave = p->wakee_flips;
> -	int factor = this_cpu_read(sd_llc_size);
> +	int llc_size = this_cpu_read(sd_llc_size);
> +
> +	if (sibling_count_hint >= llc_size)
> +		return 1;
>
>  	if (master < slave)
>  		swap(master, slave);
> -	if (slave < factor || master < slave * factor)
> +	if (slave < llc_size || master < slave * llc_size)
>  		return 0;
>  	return 1;
>  }
> @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>   * preempt must be disabled.
>   */
>  static int
> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
> +		    int sibling_count_hint)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>  	int cpu = smp_processor_id();
> @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +		want_affine = !wake_wide(p, sibling_count_hint) &&
> +			      !wake_cap(p, cpu, prev_cpu) &&
> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
>  	}
>
>  	rcu_read_lock();
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172db63e..3c343e055110 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -9,7 +9,8 @@
>
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		    int sibling_count_hint)
>  {
>  	return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 45caf937ef90..b9937dccb8b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
>  static int find_lowest_rq(struct task_struct *task);
>
>  static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		  int sibling_count_hint)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..56ae525618e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1419,7 +1419,8 @@ struct sched_class {
>  	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
>  #ifdef CONFIG_SMP
> -	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> +	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
> +			       int subling_count_hint);
>  	void (*migrate_task_rq)(struct task_struct *p);
>
>  	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 9f69fb630853..d0ce4fbb18ef 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -11,7 +11,8 @@
>
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		    int sibling_count_hint)
>  {
>  	return task_cpu(p); /* stop tasks as never migrate */
>  }

next prev parent reply	other threads:[~2017-09-29 17:05 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11  9:45 [RFC] sched/fair: Use wake_q length as a hint for wake_wide Brendan Jackman
2017-09-19  5:15 ` Joel Fernandes
2017-09-19 10:05   ` Brendan Jackman
2017-09-20  4:39     ` Joel Fernandes
2017-09-20  5:06       ` Joel Fernandes
2017-09-20  9:33         ` Brendan Jackman
2017-09-20 20:23           ` Joel Fernandes
2017-09-20 21:17             ` Atish Patra
2017-09-21  5:50               ` Joel Fernandes
2017-09-21 16:51                 ` Atish Patra
2017-09-20  9:41   ` Brendan Jackman
2017-09-20 19:36 ` Atish Patra
2017-09-29 17:05 ` Brendan Jackman [this message]
2017-10-02 10:46   ` Brendan Jackman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=873775zb8k.fsf@arm.com \
    --to=brendan.jackman@arm.com \
    --cc=andresoportus@google.com \
    --cc=efault@gmx.de \
    --cc=joelaf@google.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.