[RFC] sched/fair: Use wake_q length as a hint for wake

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] sched/fair: Use wake_q length as a hint for wake_wide
@ 2017-08-11  9:45 Brendan Jackman
  2017-09-19  5:15 ` Joel Fernandes
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Brendan Jackman @ 2017-08-11  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Andres Oportus, Ingo Molnar, Peter Zijlstra,
	Josef Bacik, Mike Galbraith, Matt Fleming

This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.

The wake_q mechanism is one case where this information is available.

select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.

                               * * *

The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
finish faster, and then those CPUs pull over the tasks that are still
running:

     v CPU v           ->time->

                    -------------
   0  (big)         11111  /333
                    -------------
   1  (big)         22222   /444|
                    -------------
   2  (LITTLE)      333333/
                    -------------
   3  (LITTLE)      444444/
                    -------------

Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:

     v CPU v           ->time->

                    ------------------------
   0  (big)         11111  /333  31313\33333
                    ------------------------
   1  (big)         22222   /444|424\4444444
                    ------------------------
   2  (LITTLE)      333333/          \222222
                    ------------------------
   3  (LITTLE)      444444/            \1111
                    ------------------------
                                 ^^^
                           underutilization

So, I'm trying to get want_affine = 0 for these tasks.

I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.

However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.

It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.

IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?

                               * * *

Final note..

In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:

     v CPU v           ->time->

                    --------------
   0  (big)         11111  /333  3
                    --------------
   1  (big)         22222   /444|4
                    --------------
   2  (LITTLE)      333333/      2
                    --------------
   3  (LITTLE)      444444/          <- Task 1 should go here
                    --------------

If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
---
 include/linux/sched/wake_q.h |  2 ++
 kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
 kernel/sched/deadline.c      |  3 ++-
 kernel/sched/fair.c          | 17 +++++++++++------
 kernel/sched/idle_task.c     |  3 ++-
 kernel/sched/rt.c            |  3 ++-
 kernel/sched/sched.h         |  3 ++-
 kernel/sched/stop_task.c     |  3 ++-
 8 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index d03d8a9047dc..607a888eb35b 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -33,6 +33,7 @@
 struct wake_q_head {
 	struct wake_q_node *first;
 	struct wake_q_node **lastp;
+	int count;
 };
 
 #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
@@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
 {
 	head->first = WAKE_Q_TAIL;
 	head->lastp = &head->first;
+	head->count = 0;
 }
 
 extern void wake_q_add(struct wake_q_head *head,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20fba81..ddf9257b1467 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
 		return;
 
+	head->count++;
+
 	get_task_struct(task);
 
 	/*
@@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	head->lastp = &node->next;
 }
 
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+	       int sibling_count_hint);
+
 void wake_up_q(struct wake_q_head *head)
 {
 	struct wake_q_node *node = head->first;
@@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
 		task->wake_q.next = NULL;
 
 		/*
-		 * wake_up_process() implies a wmb() to pair with the queueing
+		 * try_to_wake_up() implies a wmb() to pair with the queueing
 		 * in wake_q_add() so as not to miss wakeups.
 		 */
-		wake_up_process(task);
+		try_to_wake_up(task, TASK_NORMAL, 0, head->count);
 		put_task_struct(task);
 	}
 }
@@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
+		   int sibling_count_hint)
 {
 	lockdep_assert_held(&p->pi_lock);
 
 	if (p->nr_cpus_allowed > 1)
-		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
+		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
+						     sibling_count_hint);
 	else
 		cpu = cpumask_any(&p->cpus_allowed);
 
@@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
  * @wake_flags: wake modifier flags (WF_*)
+ * @sibling_count_hint: A hint at the number of threads that are being woken up
+ *                      in this event.
  *
  * If (@state & @p->state) @p->state = TASK_RUNNING.
  *
@@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  *	   %false otherwise.
  */
 static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+	       int sibling_count_hint)
 {
 	unsigned long flags;
 	int cpu, success = 0;
@@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		atomic_dec(&task_rq(p)->nr_iowait);
 	}
 
-	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
+			     sibling_count_hint);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
@@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
  */
 int wake_up_process(struct task_struct *p)
 {
-	return try_to_wake_up(p, TASK_NORMAL, 0);
+	return try_to_wake_up(p, TASK_NORMAL, 0, 1);
 }
 EXPORT_SYMBOL(wake_up_process);
 
 int wake_up_state(struct task_struct *p, unsigned int state)
 {
-	return try_to_wake_up(p, state, 0);
+	return try_to_wake_up(p, state, 0, 1);
 }
 
 /*
@@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
 	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
 	 * as we're not fully set-up yet.
 	 */
-	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
 #endif
 	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
@@ -2893,7 +2905,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
@@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
 int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
 			  void *key)
 {
-	return try_to_wake_up(curr->private, mode, wake_flags);
+	return try_to_wake_up(curr->private, mode, wake_flags, 1);
 }
 EXPORT_SYMBOL(default_wake_function);
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 755bd3f1a1a9..69a9dd407267 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
 static int find_later_rq(struct task_struct *task);
 
 static int
-select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
+		  int sibling_count_hint)
 {
 	struct task_struct *curr;
 	struct rq *rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e216f6..0a9d706b62bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
  * whatever is irrelevant, spread criteria is apparent partner count exceeds
  * socket size.
  */
-static int wake_wide(struct task_struct *p)
+static int wake_wide(struct task_struct *p, int sibling_count_hint)
 {
 	unsigned int master = current->wakee_flips;
 	unsigned int slave = p->wakee_flips;
-	int factor = this_cpu_read(sd_llc_size);
+	int llc_size = this_cpu_read(sd_llc_size);
+
+	if (sibling_count_hint >= llc_size)
+		return 1;
 
 	if (master < slave)
 		swap(master, slave);
-	if (slave < factor || master < slave * factor)
+	if (slave < llc_size || master < slave * llc_size)
 		return 0;
 	return 1;
 }
@@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
+		    int sibling_count_hint)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
@@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+		want_affine = !wake_wide(p, sibling_count_hint) &&
+			      !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, &p->cpus_allowed);
 	}
 
 	rcu_read_lock();
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172db63e..3c343e055110 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,8 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
+		    int sibling_count_hint)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf937ef90..b9937dccb8b3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
+		  int sibling_count_hint)
 {
 	struct task_struct *curr;
 	struct rq *rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..56ae525618e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1419,7 +1419,8 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
+			       int subling_count_hint);
 	void (*migrate_task_rq)(struct task_struct *p);
 
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 9f69fb630853..d0ce4fbb18ef 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,8 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
+		    int sibling_count_hint)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-08-11  9:45 [RFC] sched/fair: Use wake_q length as a hint for wake_wide Brendan Jackman
@ 2017-09-19  5:15 ` Joel Fernandes
  2017-09-19 10:05   ` Brendan Jackman
  2017-09-20  9:41   ` Brendan Jackman
  2017-09-20 19:36 ` Atish Patra
  2017-09-29 17:05 ` Brendan Jackman
  2 siblings, 2 replies; 14+ messages in thread
From: Joel Fernandes @ 2017-09-19  5:15 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann

Hi Brendan,

On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman
<brendan.jackman@arm.com> wrote:
> This patch adds a parameter to select_task_rq, sibling_count_hint
> allowing the caller, where it has this information, to inform the
> sched_class the number of tasks that are being woken up as part of
> the same event.
>
> The wake_q mechanism is one case where this information is available.
>
> select_task_rq_fair can then use the information to detect that it
> needs to widen the search space for task placement in order to avoid
> overloading the last-level cache domain's CPUs.
>
>                                * * *
>
> The reason I am investigating this change is the following use case
> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
> all repeatedly do X amount of work then
> pthread_barrier_wait (i.e. sleep until the last task finishes its X
> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
> finish faster, and then those CPUs pull over the tasks that are still
> running:
>
>      v CPU v           ->time->
>
>                     -------------
>    0  (big)         11111  /333
>                     -------------
>    1  (big)         22222   /444|
>                     -------------
>    2  (LITTLE)      333333/
>                     -------------
>    3  (LITTLE)      444444/
>                     -------------

This is because of the newly idle balancing right?

>
> Now when task 4 hits the barrier (at |) and wakes the others up,
> there are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. want_affine therefore means that we'll only look
> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
> on the bigs until the next load balance, something like this:
>
>      v CPU v           ->time->
>
>                     ------------------------
>    0  (big)         11111  /333  31313\33333
>                     ------------------------
>    1  (big)         22222   /444|424\4444444
>                     ------------------------
>    2  (LITTLE)      333333/          \222222
>                     ------------------------
>    3  (LITTLE)      444444/            \1111
>                     ------------------------
>                                  ^^^
>                            underutilization

I wonder if this problem can be better solved by not having the first
newly-idle-balance pull happen based on some condition, only to be
corrected later by another balancing act?

>
> So, I'm trying to get want_affine = 0 for these tasks.
>
> I don't _think_ any incarnation of the wakee_flips mechanism can help
> us here because which task is waker and which tasks are wakees
> generally changes with each iteration.
>
> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
> nice property that we know exactly how many tasks are being woken, so
> we can cheat.
>
> It might be a disadvantage that we "widen" _every_ task that's woken in
> an event, while select_idle_sibling would work fine for the first
> sd_llc_size - 1 tasks.
>
> IIUC, if wake_affine() behaves correctly this trick wouldn't be
> necessary on SMP systems, so it might be best guarded by the presence

Actually wake_affine doesn't check for balance if previous/next cpu
are within the same shared cache domain. The difference is some time
ago it would return true for shared cache but now it returns false as
of 4.14-rc1:
http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466

Since it would return false in the above wake up cases for task 1 and
2, it would then run select_idle_sibling on the previous CPU which is
also within the big cluster, so I don't think it will make a
difference in this case... Infact what it returns probably doesn't
matter.

> of SD_ASYM_CPUCAPACITY?
>
>                                * * *
>
> Final note..
>
> In order to observe "perfect" behaviour for this use case, I also had
> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
> above we are working through the work queue and have placed tasks 3
> and 2, and are about to place task 1:
>
>      v CPU v           ->time->
>
>                     --------------
>    0  (big)         11111  /333  3
>                     --------------
>    1  (big)         22222   /444|4
>                     --------------
>    2  (LITTLE)      333333/      2
>                     --------------
>    3  (LITTLE)      444444/          <- Task 1 should go here
>                     --------------
>
> If TTWU_QUEUE is enabled, we will not yet have enqueued task
> 2 (having instead sent a reschedule IPI) or attached its load to CPU
> 2. So we are likely to also place task 1 on cpu 2. Disabling
> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
> solving this issue. TTWU_QUEUE is there to minimise rq lock
> contention, and I guess that this contention is less of an issue on
> big.LITTLE systems since they have relatively few CPUs, which
> suggests the trade-off makes sense here.

Yeah, OTOH probably once those IPIs are sent, the select path should I
feel be smart enough to factor those pending enqueues, not sure how
hard or meaningful it is to implement something like that..

>
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> ---
>  include/linux/sched/wake_q.h |  2 ++
>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>  kernel/sched/deadline.c      |  3 ++-
>  kernel/sched/fair.c          | 17 +++++++++++------
>  kernel/sched/idle_task.c     |  3 ++-
>  kernel/sched/rt.c            |  3 ++-
>  kernel/sched/sched.h         |  3 ++-
>  kernel/sched/stop_task.c     |  3 ++-
>  8 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index d03d8a9047dc..607a888eb35b 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -33,6 +33,7 @@
>  struct wake_q_head {
>         struct wake_q_node *first;
>         struct wake_q_node **lastp;
> +       int count;
>  };
>
>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>  {
>         head->first = WAKE_Q_TAIL;
>         head->lastp = &head->first;
> +       head->count = 0;
>  }
>
>  extern void wake_q_add(struct wake_q_head *head,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0869b20fba81..ddf9257b1467 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>         if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>                 return;
>
> +       head->count++;
> +
>         get_task_struct(task);
>
>         /*
> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>         head->lastp = &node->next;
>  }
>
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +              int sibling_count_hint);
> +
>  void wake_up_q(struct wake_q_head *head)
>  {
>         struct wake_q_node *node = head->first;
> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>                 task->wake_q.next = NULL;
>
>                 /*
> -                * wake_up_process() implies a wmb() to pair with the queueing
> +                * try_to_wake_up() implies a wmb() to pair with the queueing
>                  * in wake_q_add() so as not to miss wakeups.
>                  */
> -               wake_up_process(task);
> +               try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>                 put_task_struct(task);

Shouldn't you reset head->count after all the tasks have been woken up?

>         }
>  }
> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>   */
>  static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
> +                  int sibling_count_hint)

This variable seems a bit long to me, how about just sibling_count?

>  {
>         lockdep_assert_held(&p->pi_lock);
>
>         if (p->nr_cpus_allowed > 1)
> -               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> +               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
> +                                                    sibling_count_hint);

same.

<snip>

>
>  static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
> +                 int sibling_count_hint)
>  {
>         struct task_struct *curr;
>         struct rq *rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..56ae525618e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1419,7 +1419,8 @@ struct sched_class {
>         void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
>  #ifdef CONFIG_SMP
> -       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> +       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
> +                              int subling_count_hint);

s/subling/sibling/


thanks,

- Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-19  5:15 ` Joel Fernandes
@ 2017-09-19 10:05   ` Brendan Jackman
  2017-09-20  4:39     ` Joel Fernandes
  2017-09-20  9:41   ` Brendan Jackman
  1 sibling, 1 reply; 14+ messages in thread
From: Brendan Jackman @ 2017-09-19 10:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann, Morten Rasmussen

On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
> Hi Brendan,
Hi Joel,

Thanks for taking a look :)

> On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman
> <brendan.jackman@arm.com> wrote:
>> This patch adds a parameter to select_task_rq, sibling_count_hint
>> allowing the caller, where it has this information, to inform the
>> sched_class the number of tasks that are being woken up as part of
>> the same event.
>>
>> The wake_q mechanism is one case where this information is available.
>>
>> select_task_rq_fair can then use the information to detect that it
>> needs to widen the search space for task placement in order to avoid
>> overloading the last-level cache domain's CPUs.
>>
>>                                * * *
>>
>> The reason I am investigating this change is the following use case
>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
>> all repeatedly do X amount of work then
>> pthread_barrier_wait (i.e. sleep until the last task finishes its X
>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
>> finish faster, and then those CPUs pull over the tasks that are still
>> running:
>>
>>      v CPU v           ->time->
>>
>>                     -------------
>>    0  (big)         11111  /333
>>                     -------------
>>    1  (big)         22222   /444|
>>                     -------------
>>    2  (LITTLE)      333333/
>>                     -------------
>>    3  (LITTLE)      444444/
>>                     -------------
>
> This is because of the newly idle balancing right?

Probably not: before CPUs 0 and 1 will pull tasks their utilization
(tasks 1 and 2's blocked load) needs to decay enough that they have more
spare capacity than CPUs 2 and 3 (with imbalance_pct etc), so it's more
likely to be a nohz idle balance. We also need need_active_balance to be
satisfied (assuming 3 and 4 are running constantly) - in Android this is
hastened by a special case [1], in mainline it takes longer.

[1] https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/fair.c#8865

>>
>> Now when task 4 hits the barrier (at |) and wakes the others up,
>> there are 4 tasks with prev_cpu=<big> and 0 tasks with
>> prev_cpu=<little>. want_affine therefore means that we'll only look
>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
>> on the bigs until the next load balance, something like this:
>>
>>      v CPU v           ->time->
>>
>>                     ------------------------
>>    0  (big)         11111  /333  31313\33333
>>                     ------------------------
>>    1  (big)         22222   /444|424\4444444
>>                     ------------------------
>>    2  (LITTLE)      333333/          \222222
>>                     ------------------------
>>    3  (LITTLE)      444444/            \1111
>>                     ------------------------
>>                                  ^^^
>>                            underutilization
>
> I wonder if this problem can be better solved by not having the first
> newly-idle-balance pull happen based on some condition, only to be
> corrected later by another balancing act?

Well the fact that CPUs 0 and 1 pull 3 and 4 is desirable - we want to
use the big CPUs as much as possible for these always-running
tasks. Moving load back to the LITTLE CPUs is not "correcting" a
previous mistake, it's responding to a change in conditions (i.e. we
have more runnable tasks than big CPUs now) and this patch is about
trying to do that response ASAP.

>>
>> So, I'm trying to get want_affine = 0 for these tasks.
>>
>> I don't _think_ any incarnation of the wakee_flips mechanism can help
>> us here because which task is waker and which tasks are wakees
>> generally changes with each iteration.
>>
>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
>> nice property that we know exactly how many tasks are being woken, so
>> we can cheat.
>>
>> It might be a disadvantage that we "widen" _every_ task that's woken in
>> an event, while select_idle_sibling would work fine for the first
>> sd_llc_size - 1 tasks.
>>
>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>> necessary on SMP systems, so it might be best guarded by the presence
>
> Actually wake_affine doesn't check for balance if previous/next cpu
> are within the same shared cache domain. The difference is some time
> ago it would return true for shared cache but now it returns false as
> of 4.14-rc1:
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>
> Since it would return false in the above wake up cases for task 1 and
> 2, it would then run select_idle_sibling on the previous CPU which is
> also within the big cluster, so I don't think it will make a
> difference in this case... Infact what it returns probably doesn't
> matter.

So my paragraph here was making a leap in reasoning, let me try to fill
the gap: On SMP these tasks never need to move around. If by some chance
they did get coscheduled, the first load balance would spread them out and
then every time they wake up from then on, prev_cpu is the sensible
choice. So it will look something like:

            v CPU v           ->time->

                            -------------
         {  0  (SAME)       11111111111  
  cache  {                  -------------
         {  1  (SAME)       222222222222|
                            -------------
         {  2  (SAME)       33333333333  
  cache  {                  -------------
         {  3  (SAME)       44444444444  
                            -------------        

So here, task 2 wakes up the other guys and when it's doing tasks 3 and
4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
basically wake_affine's job to decide between prev_cpu and
smp_processor_id(). So "if wake_affine is behaving correctly" the
problem that this patch aims to solve (i.e. the fact that we overload
the waker's LLC domain because of bias towards prev_cpu) does not arise
on SMP.

>> of SD_ASYM_CPUCAPACITY?
>>
>>                                * * *
>>
>> Final note..
>>
>> In order to observe "perfect" behaviour for this use case, I also had
>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
>> above we are working through the work queue and have placed tasks 3
>> and 2, and are about to place task 1:
>>
>>      v CPU v           ->time->
>>
>>                     --------------
>>    0  (big)         11111  /333  3
>>                     --------------
>>    1  (big)         22222   /444|4
>>                     --------------
>>    2  (LITTLE)      333333/      2
>>                     --------------
>>    3  (LITTLE)      444444/          <- Task 1 should go here
>>                     --------------
>>
>> If TTWU_QUEUE is enabled, we will not yet have enqueued task
>> 2 (having instead sent a reschedule IPI) or attached its load to CPU
>> 2. So we are likely to also place task 1 on cpu 2. Disabling
>> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
>> solving this issue. TTWU_QUEUE is there to minimise rq lock
>> contention, and I guess that this contention is less of an issue on
>> big.LITTLE systems since they have relatively few CPUs, which
>> suggests the trade-off makes sense here.
>
> Yeah, OTOH probably once those IPIs are sent, the select path should I
> feel be smart enough to factor those pending enqueues, not sure how
> hard or meaningful it is to implement something like that..

Yeah that should be possible - in fact I've just noticed that
find_idlest_cpu already checks idle_cpu() which takes pending enqueues
(rq->wake_list) into account, and I think find_idlest_cpu should be used
in the case I'm talking about here.. I must have been doing something
wrong when testing.

Cheers,
Brendan

>>
>> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Josef Bacik <josef@toxicpanda.com>
>> Cc: Joel Fernandes <joelaf@google.com>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Matt Fleming <matt@codeblueprint.co.uk>
>> ---
>>  include/linux/sched/wake_q.h |  2 ++
>>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>>  kernel/sched/deadline.c      |  3 ++-
>>  kernel/sched/fair.c          | 17 +++++++++++------
>>  kernel/sched/idle_task.c     |  3 ++-
>>  kernel/sched/rt.c            |  3 ++-
>>  kernel/sched/sched.h         |  3 ++-
>>  kernel/sched/stop_task.c     |  3 ++-
>>  8 files changed, 46 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>> index d03d8a9047dc..607a888eb35b 100644
>> --- a/include/linux/sched/wake_q.h
>> +++ b/include/linux/sched/wake_q.h
>> @@ -33,6 +33,7 @@
>>  struct wake_q_head {
>>         struct wake_q_node *first;
>>         struct wake_q_node **lastp;
>> +       int count;
>>  };
>>
>>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
>> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>>  {
>>         head->first = WAKE_Q_TAIL;
>>         head->lastp = &head->first;
>> +       head->count = 0;
>>  }
>>
>>  extern void wake_q_add(struct wake_q_head *head,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0869b20fba81..ddf9257b1467 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>         if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>>                 return;
>>
>> +       head->count++;
>> +
>>         get_task_struct(task);
>>
>>         /*
>> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>         head->lastp = &node->next;
>>  }
>>
>> +static int
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> +              int sibling_count_hint);
>> +
>>  void wake_up_q(struct wake_q_head *head)
>>  {
>>         struct wake_q_node *node = head->first;
>> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>>                 task->wake_q.next = NULL;
>>
>>                 /*
>> -                * wake_up_process() implies a wmb() to pair with the queueing
>> +                * try_to_wake_up() implies a wmb() to pair with the queueing
>>                  * in wake_q_add() so as not to miss wakeups.
>>                  */
>> -               wake_up_process(task);
>> +               try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>>                 put_task_struct(task);
>
> Shouldn't you reset head->count after all the tasks have been woken up?
>
>>         }
>>  }
>> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>>   */
>>  static inline
>> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
>> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
>> +                  int sibling_count_hint)
>
> This variable seems a bit long to me, how about just sibling_count?
>
>>  {
>>         lockdep_assert_held(&p->pi_lock);
>>
>>         if (p->nr_cpus_allowed > 1)
>> -               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
>> +               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
>> +                                                    sibling_count_hint);
>
> same.
>
> <snip>
>
>>
>>  static int
>> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +                 int sibling_count_hint)
>>  {
>>         struct task_struct *curr;
>>         struct rq *rq;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index eeef1a3086d1..56ae525618e9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1419,7 +1419,8 @@ struct sched_class {
>>         void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>>
>>  #ifdef CONFIG_SMP
>> -       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>> +       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
>> +                              int subling_count_hint);
>
> s/subling/sibling/
>
>
> thanks,
>
> - Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-19 10:05   ` Brendan Jackman
@ 2017-09-20  4:39     ` Joel Fernandes
  2017-09-20  5:06       ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2017-09-20  4:39 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann, Morten Rasmussen

On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
<brendan.jackman@arm.com> wrote:
> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>> Hi Brendan,
> Hi Joel,
>
> Thanks for taking a look :)
>
>> On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman
>> <brendan.jackman@arm.com> wrote:
>>> This patch adds a parameter to select_task_rq, sibling_count_hint
>>> allowing the caller, where it has this information, to inform the
>>> sched_class the number of tasks that are being woken up as part of
>>> the same event.
>>>
>>> The wake_q mechanism is one case where this information is available.
>>>
>>> select_task_rq_fair can then use the information to detect that it
>>> needs to widen the search space for task placement in order to avoid
>>> overloading the last-level cache domain's CPUs.
>>>
>>>                                * * *
>>>
>>> The reason I am investigating this change is the following use case
>>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
>>> all repeatedly do X amount of work then
>>> pthread_barrier_wait (i.e. sleep until the last task finishes its X
>>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
>>> finish faster, and then those CPUs pull over the tasks that are still
>>> running:
>>>
>>>      v CPU v           ->time->
>>>
>>>                     -------------
>>>    0  (big)         11111  /333
>>>                     -------------
>>>    1  (big)         22222   /444|
>>>                     -------------
>>>    2  (LITTLE)      333333/
>>>                     -------------
>>>    3  (LITTLE)      444444/
>>>                     -------------
>>
>> This is because of the newly idle balancing right?
>
> Probably not: before CPUs 0 and 1 will pull tasks their utilization
> (tasks 1 and 2's blocked load) needs to decay enough that they have more
> spare capacity than CPUs 2 and 3 (with imbalance_pct etc), so it's more
> likely to be a nohz idle balance. We also need need_active_balance to be
> satisfied (assuming 3 and 4 are running constantly) - in Android this is
> hastened by a special case [1], in mainline it takes longer.
>
> [1] https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/fair.c#8865

Ok, I agree.

>
>>>
>>> Now when task 4 hits the barrier (at |) and wakes the others up,
>>> there are 4 tasks with prev_cpu=<big> and 0 tasks with
>>> prev_cpu=<little>. want_affine therefore means that we'll only look
>>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
>>> on the bigs until the next load balance, something like this:
>>>
>>>      v CPU v           ->time->
>>>
>>>                     ------------------------
>>>    0  (big)         11111  /333  31313\33333
>>>                     ------------------------
>>>    1  (big)         22222   /444|424\4444444
>>>                     ------------------------
>>>    2  (LITTLE)      333333/          \222222
>>>                     ------------------------
>>>    3  (LITTLE)      444444/            \1111
>>>                     ------------------------
>>>                                  ^^^
>>>                            underutilization
>>
>> I wonder if this problem can be better solved by not having the first
>> newly-idle-balance pull happen based on some condition, only to be
>> corrected later by another balancing act?
>
> Well the fact that CPUs 0 and 1 pull 3 and 4 is desirable - we want to
> use the big CPUs as much as possible for these always-running
> tasks. Moving load back to the LITTLE CPUs is not "correcting" a
> previous mistake, it's responding to a change in conditions (i.e. we
> have more runnable tasks than big CPUs now) and this patch is about
> trying to do that response ASAP.

Agreed.

>>>
>>> So, I'm trying to get want_affine = 0 for these tasks.
>>>
>>> I don't _think_ any incarnation of the wakee_flips mechanism can help
>>> us here because which task is waker and which tasks are wakees
>>> generally changes with each iteration.
>>>
>>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
>>> nice property that we know exactly how many tasks are being woken, so
>>> we can cheat.
>>>
>>> It might be a disadvantage that we "widen" _every_ task that's woken in
>>> an event, while select_idle_sibling would work fine for the first
>>> sd_llc_size - 1 tasks.
>>>
>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>> necessary on SMP systems, so it might be best guarded by the presence
>>
>> Actually wake_affine doesn't check for balance if previous/next cpu
>> are within the same shared cache domain. The difference is some time
>> ago it would return true for shared cache but now it returns false as
>> of 4.14-rc1:
>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>
>> Since it would return false in the above wake up cases for task 1 and
>> 2, it would then run select_idle_sibling on the previous CPU which is
>> also within the big cluster, so I don't think it will make a
>> difference in this case... Infact what it returns probably doesn't
>> matter.
>
> So my paragraph here was making a leap in reasoning, let me try to fill
> the gap: On SMP these tasks never need to move around. If by some chance
> they did get coscheduled, the first load balance would spread them out and
> then every time they wake up from then on, prev_cpu is the sensible
> choice. So it will look something like:
>
>             v CPU v           ->time->
>
>                             -------------
>          {  0  (SAME)       11111111111
>   cache  {                  -------------
>          {  1  (SAME)       222222222222|
>                             -------------
>          {  2  (SAME)       33333333333
>   cache  {                  -------------
>          {  3  (SAME)       44444444444
>                             -------------
>
> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
> basically wake_affine's job to decide between prev_cpu and
> smp_processor_id(). So "if wake_affine is behaving correctly" the
> problem that this patch aims to solve (i.e. the fact that we overload
> the waker's LLC domain because of bias towards prev_cpu) does not arise
> on SMP.
>

Yes SMP, but your patch is for solving a problem for non-SMP. So your
original statement about wake_affine solving any problem for SMP is
not relevant I feel :-P. I guess you can just kill this para from the
commit message to prevent confusion.

>>> of SD_ASYM_CPUCAPACITY?
>>>
>>>                                * * *
>>>
>>> Final note..
>>>
>>> In order to observe "perfect" behaviour for this use case, I also had
>>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
>>> above we are working through the work queue and have placed tasks 3
>>> and 2, and are about to place task 1:
>>>
>>>      v CPU v           ->time->
>>>
>>>                     --------------
>>>    0  (big)         11111  /333  3
>>>                     --------------
>>>    1  (big)         22222   /444|4
>>>                     --------------
>>>    2  (LITTLE)      333333/      2
>>>                     --------------
>>>    3  (LITTLE)      444444/          <- Task 1 should go here
>>>                     --------------
>>>
>>> If TTWU_QUEUE is enabled, we will not yet have enqueued task
>>> 2 (having instead sent a reschedule IPI) or attached its load to CPU
>>> 2. So we are likely to also place task 1 on cpu 2. Disabling
>>> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
>>> solving this issue. TTWU_QUEUE is there to minimise rq lock
>>> contention, and I guess that this contention is less of an issue on
>>> big.LITTLE systems since they have relatively few CPUs, which
>>> suggests the trade-off makes sense here.
>>
>> Yeah, OTOH probably once those IPIs are sent, the select path should I
>> feel be smart enough to factor those pending enqueues, not sure how
>> hard or meaningful it is to implement something like that..
>
> Yeah that should be possible - in fact I've just noticed that
> find_idlest_cpu already checks idle_cpu() which takes pending enqueues
> (rq->wake_list) into account, and I think find_idlest_cpu should be used
> in the case I'm talking about here.. I must have been doing something
> wrong when testing.

Oh ok!

thanks,

- Joel

[..]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-20  4:39     ` Joel Fernandes
@ 2017-09-20  5:06       ` Joel Fernandes
  2017-09-20  9:33         ` Brendan Jackman
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2017-09-20  5:06 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann, Morten Rasmussen

> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
> <brendan.jackman@arm.com> wrote:
>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
[..]
>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>> necessary on SMP systems, so it might be best guarded by the presence
>>>
>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>> are within the same shared cache domain. The difference is some time
>>> ago it would return true for shared cache but now it returns false as
>>> of 4.14-rc1:
>>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>
>>> Since it would return false in the above wake up cases for task 1 and
>>> 2, it would then run select_idle_sibling on the previous CPU which is
>>> also within the big cluster, so I don't think it will make a
>>> difference in this case... Infact what it returns probably doesn't
>>> matter.
>>
>> So my paragraph here was making a leap in reasoning, let me try to fill
>> the gap: On SMP these tasks never need to move around. If by some chance
>> they did get coscheduled, the first load balance would spread them out and
>> then every time they wake up from then on, prev_cpu is the sensible
>> choice. So it will look something like:
>>
>>             v CPU v           ->time->
>>
>>                             -------------
>>          {  0  (SAME)       11111111111
>>   cache  {                  -------------
>>          {  1  (SAME)       222222222222|
>>                             -------------
>>          {  2  (SAME)       33333333333
>>   cache  {                  -------------
>>          {  3  (SAME)       44444444444
>>                             -------------
>>
>> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>> basically wake_affine's job to decide between prev_cpu and
>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>> problem that this patch aims to solve (i.e. the fact that we overload
>> the waker's LLC domain because of bias towards prev_cpu) does not arise
>> on SMP.
>>
>
> Yes SMP, but your patch is for solving a problem for non-SMP. So your
> original statement about wake_affine solving any problem for SMP is
> not relevant I feel :-P. I guess you can just kill this para from the
> commit message to prevent confusion.

Ok I take that back, you were talking about guarding this feature by
the SD_ASYM_CPUCAPACITY flag.

I don't think that protection would be helpful because you can have
the same issue if the tasks do different amount of work on SMP. So in
that case some threads might still complete before the others and you
run into the same thing.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-20  5:06       ` Joel Fernandes
@ 2017-09-20  9:33         ` Brendan Jackman
  2017-09-20 20:23           ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Brendan Jackman @ 2017-09-20  9:33 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann, Morten Rasmussen


On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>> <brendan.jackman@arm.com> wrote:
>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
> [..]
>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>> necessary on SMP systems, so it might be best guarded by the presence
>>>>
>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>> are within the same shared cache domain. The difference is some time
>>>> ago it would return true for shared cache but now it returns false as
>>>> of 4.14-rc1:
>>>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>
>>>> Since it would return false in the above wake up cases for task 1 and
>>>> 2, it would then run select_idle_sibling on the previous CPU which is
>>>> also within the big cluster, so I don't think it will make a
>>>> difference in this case... Infact what it returns probably doesn't
>>>> matter.
>>>
>>> So my paragraph here was making a leap in reasoning, let me try to fill
>>> the gap: On SMP these tasks never need to move around. If by some chance
>>> they did get coscheduled, the first load balance would spread them out and
>>> then every time they wake up from then on, prev_cpu is the sensible
>>> choice. So it will look something like:
>>>
>>>             v CPU v           ->time->
>>>
>>>                             -------------
>>>          {  0  (SAME)       11111111111
>>>   cache  {                  -------------
>>>          {  1  (SAME)       222222222222|
>>>                             -------------
>>>          {  2  (SAME)       33333333333
>>>   cache  {                  -------------
>>>          {  3  (SAME)       44444444444
>>>                             -------------
>>>
>>> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>> basically wake_affine's job to decide between prev_cpu and
>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>> problem that this patch aims to solve (i.e. the fact that we overload
>>> the waker's LLC domain because of bias towards prev_cpu) does not arise
>>> on SMP.
>>>
>>
>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>> original statement about wake_affine solving any problem for SMP is
>> not relevant I feel :-P. I guess you can just kill this para from the
>> commit message to prevent confusion.
>
> Ok I take that back, you were talking about guarding this feature by
> the SD_ASYM_CPUCAPACITY flag.
>
> I don't think that protection would be helpful because you can have
> the same issue if the tasks do different amount of work on SMP. So in
> that case some threads might still complete before the others and you
> run into the same thing.

Well assuming we're still talking about one task per CPU, if you have
tasks doing different amount of work there's still no reason to move the
longer-running threads around. The only reason that happens in my
example is because of the asym capacity.

There might be a case on SMP where this would have a benefit, but the
reason I suggested disabling it for SMP is because the cost of this
patch is that we use the slow path a whole load more for certain usage
patterns. On big.LITTLE etc. my thinking is that's the price we have to
pay to get task placement right; on SMP I don't think we'd want to pay
that price unless we see a good reason.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-20  9:33         ` Brendan Jackman
@ 2017-09-20 20:23           ` Joel Fernandes
  2017-09-20 21:17             ` Atish Patra
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2017-09-20 20:23 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann, Morten Rasmussen

On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
<brendan.jackman@arm.com> wrote:
>
> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>> <brendan.jackman@arm.com> wrote:
>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>> [..]
>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>> necessary on SMP systems, so it might be best guarded by the presence
>>>>>
>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>> are within the same shared cache domain. The difference is some time
>>>>> ago it would return true for shared cache but now it returns false as
>>>>> of 4.14-rc1:
>>>>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>
>>>>> Since it would return false in the above wake up cases for task 1 and
>>>>> 2, it would then run select_idle_sibling on the previous CPU which is
>>>>> also within the big cluster, so I don't think it will make a
>>>>> difference in this case... Infact what it returns probably doesn't
>>>>> matter.
>>>>
>>>> So my paragraph here was making a leap in reasoning, let me try to fill
>>>> the gap: On SMP these tasks never need to move around. If by some chance
>>>> they did get coscheduled, the first load balance would spread them out and
>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>> choice. So it will look something like:
>>>>
>>>>             v CPU v           ->time->
>>>>
>>>>                             -------------
>>>>          {  0  (SAME)       11111111111
>>>>   cache  {                  -------------
>>>>          {  1  (SAME)       222222222222|
>>>>                             -------------
>>>>          {  2  (SAME)       33333333333
>>>>   cache  {                  -------------
>>>>          {  3  (SAME)       44444444444
>>>>                             -------------
>>>>
>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>> basically wake_affine's job to decide between prev_cpu and
>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>> the waker's LLC domain because of bias towards prev_cpu) does not arise
>>>> on SMP.
>>>>
>>>
>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>> original statement about wake_affine solving any problem for SMP is
>>> not relevant I feel :-P. I guess you can just kill this para from the
>>> commit message to prevent confusion.
>>
>> Ok I take that back, you were talking about guarding this feature by
>> the SD_ASYM_CPUCAPACITY flag.
>>
>> I don't think that protection would be helpful because you can have
>> the same issue if the tasks do different amount of work on SMP. So in
>> that case some threads might still complete before the others and you
>> run into the same thing.
>
> Well assuming we're still talking about one task per CPU, if you have
> tasks doing different amount of work there's still no reason to move the
> longer-running threads around. The only reason that happens in my
> example is because of the asym capacity.

Yes but you can very well have RT pressure and things that temporarily
change the capacity equality. Also this is a simple benchmark and for
any reason you have more than 1 task running on those other CPUs and
then the idle CPUs run some of the tasks and you run into a similar
situation that might need your patch..

Also one more note, the SD_ASYM_CPUCAPACITY protection is still not
needed because SD_BALANCE_WAKE isn't turned on for
!SD_ASYM_CPUCAPACITY from what I learnt from discussions with Mike, I
believe its this piece of code in sd_init that actually enables it:

        if (sd->flags & SD_ASYM_CPUCAPACITY) {
                struct sched_domain *t = sd;

                for_each_lower_domain(t)
                        t->flags |= SD_BALANCE_WAKE;
        }


thanks,

- Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-20 20:23           ` Joel Fernandes
@ 2017-09-20 21:17             ` Atish Patra
  2017-09-21  5:50               ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Atish Patra @ 2017-09-20 21:17 UTC (permalink / raw)
  To: Joel Fernandes, Brendan Jackman
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann, Morten Rasmussen

On 09/20/2017 03:23 PM, Joel Fernandes wrote:
> On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
> <brendan.jackman@arm.com> wrote:
>>
>> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>>> <brendan.jackman@arm.com> wrote:
>>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>>> [..]
>>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>>> necessary on SMP systems, so it might be best guarded by the presence
>>>>>>
>>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>>> are within the same shared cache domain. The difference is some time
>>>>>> ago it would return true for shared cache but now it returns false as
>>>>>> of 4.14-rc1:
>>>>>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>>
>>>>>> Since it would return false in the above wake up cases for task 1 and
>>>>>> 2, it would then run select_idle_sibling on the previous CPU which is
>>>>>> also within the big cluster, so I don't think it will make a
>>>>>> difference in this case... Infact what it returns probably doesn't
>>>>>> matter.
>>>>>
>>>>> So my paragraph here was making a leap in reasoning, let me try to fill
>>>>> the gap: On SMP these tasks never need to move around. If by some chance
>>>>> they did get coscheduled, the first load balance would spread them out and
>>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>>> choice. So it will look something like:
>>>>>
>>>>>             v CPU v           ->time->
>>>>>
>>>>>                             -------------
>>>>>          {  0  (SAME)       11111111111
>>>>>   cache  {                  -------------
>>>>>          {  1  (SAME)       222222222222|
>>>>>                             -------------
>>>>>          {  2  (SAME)       33333333333
>>>>>   cache  {                  -------------
>>>>>          {  3  (SAME)       44444444444
>>>>>                             -------------
>>>>>
>>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
>>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>>> basically wake_affine's job to decide between prev_cpu and
>>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>>> the waker's LLC domain because of bias towards prev_cpu) does not arise
>>>>> on SMP.
>>>>>
>>>>
>>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>>> original statement about wake_affine solving any problem for SMP is
>>>> not relevant I feel :-P. I guess you can just kill this para from the
>>>> commit message to prevent confusion.
>>>
>>> Ok I take that back, you were talking about guarding this feature by
>>> the SD_ASYM_CPUCAPACITY flag.
>>>
>>> I don't think that protection would be helpful because you can have
>>> the same issue if the tasks do different amount of work on SMP. So in
>>> that case some threads might still complete before the others and you
>>> run into the same thing.
>>
>> Well assuming we're still talking about one task per CPU, if you have
>> tasks doing different amount of work there's still no reason to move the
>> longer-running threads around. The only reason that happens in my
>> example is because of the asym capacity.
>
> Yes but you can very well have RT pressure and things that temporarily
> change the capacity equality. Also this is a simple benchmark and for
> any reason you have more than 1 task running on those other CPUs and
> then the idle CPUs run some of the tasks and you run into a similar
> situation that might need your patch..
>
The patch would be helpful only if it doesn't cross NUMA boundary. right ?

If NUMA comes into picture, not sure searching across NUMA may hurt more 
than help, especially in this case.
> Also one more note, the SD_ASYM_CPUCAPACITY protection is still not
> needed because SD_BALANCE_WAKE isn't turned on for
> !SD_ASYM_CPUCAPACITY from what I learnt from discussions with Mike, I
> believe its this piece of code in sd_init that actually enables it:
>
>         if (sd->flags & SD_ASYM_CPUCAPACITY) {
>                 struct sched_domain *t = sd;
>
>                 for_each_lower_domain(t)
>                         t->flags |= SD_BALANCE_WAKE;
>         }
>
>
But it will always try to search within LLC group if want_affine is set 
irrespective of the SD_BALANCE_WAKE flag.

if (affine_sd) {
		sd = NULL; /* Prefer wake_affine over balance flags */
		if (cpu == prev_cpu)
			goto pick_cpu;

		if (wake_affine(affine_sd, p, prev_cpu, sync))
			new_cpu = cpu;
	}

	if (!sd) {
  pick_cpu:
		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

SD_ASYM_CPUCAPACITY protection may still be required if it is a 
non-issue for big SMP servers.

Regards,
Atish
> thanks,
>
> - Joel
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-20 21:17             ` Atish Patra
@ 2017-09-21  5:50               ` Joel Fernandes
  2017-09-21 16:51                 ` Atish Patra
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2017-09-21  5:50 UTC (permalink / raw)
  To: atish.patra
  Cc: Brendan Jackman, LKML, Andres Oportus, Ingo Molnar,
	Peter Zijlstra, Josef Bacik, Mike Galbraith, Matt Fleming,
	Dietmar Eggemann, Morten Rasmussen

On Wed, Sep 20, 2017 at 2:17 PM, Atish Patra <atish.patra@oracle.com> wrote:
> On 09/20/2017 03:23 PM, Joel Fernandes wrote:
>>
>> On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
>> <brendan.jackman@arm.com> wrote:
>>>
>>>
>>> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>>>>
>>>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>>>> <brendan.jackman@arm.com> wrote:
>>>>>>
>>>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>>>>
>>>> [..]
>>>>>>>>
>>>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>>>> necessary on SMP systems, so it might be best guarded by the
>>>>>>>> presence
>>>>>>>
>>>>>>>
>>>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>>>> are within the same shared cache domain. The difference is some time
>>>>>>> ago it would return true for shared cache but now it returns false
as
>>>>>>> of 4.14-rc1:
>>>>>>>
>>>>>>>
http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>>>
>>>>>>> Since it would return false in the above wake up cases for task 1
and
>>>>>>> 2, it would then run select_idle_sibling on the previous CPU which
is
>>>>>>> also within the big cluster, so I don't think it will make a
>>>>>>> difference in this case... Infact what it returns probably doesn't
>>>>>>> matter.
>>>>>>
>>>>>>
>>>>>> So my paragraph here was making a leap in reasoning, let me try to
>>>>>> fill
>>>>>> the gap: On SMP these tasks never need to move around. If by some
>>>>>> chance
>>>>>> they did get coscheduled, the first load balance would spread them
out
>>>>>> and
>>>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>>>> choice. So it will look something like:
>>>>>>
>>>>>>             v CPU v           ->time->
>>>>>>
>>>>>>                             -------------
>>>>>>          {  0  (SAME)       11111111111
>>>>>>   cache  {                  -------------
>>>>>>          {  1  (SAME)       222222222222|
>>>>>>                             -------------
>>>>>>          {  2  (SAME)       33333333333
>>>>>>   cache  {                  -------------
>>>>>>          {  3  (SAME)       44444444444
>>>>>>                             -------------
>>>>>>
>>>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3
>>>>>> and
>>>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>>>> basically wake_affine's job to decide between prev_cpu and
>>>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>>>> the waker's LLC domain because of bias towards prev_cpu) does not
>>>>>> arise
>>>>>> on SMP.
>>>>>>
>>>>>
>>>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>>>> original statement about wake_affine solving any problem for SMP is
>>>>> not relevant I feel :-P. I guess you can just kill this para from the
>>>>> commit message to prevent confusion.
>>>>
>>>>
>>>> Ok I take that back, you were talking about guarding this feature by
>>>> the SD_ASYM_CPUCAPACITY flag.
>>>>
>>>> I don't think that protection would be helpful because you can have
>>>> the same issue if the tasks do different amount of work on SMP. So in
>>>> that case some threads might still complete before the others and you
>>>> run into the same thing.
>>>
>>>
>>> Well assuming we're still talking about one task per CPU, if you have
>>> tasks doing different amount of work there's still no reason to move the
>>> longer-running threads around. The only reason that happens in my
>>> example is because of the asym capacity.
>>
>>
>> Yes but you can very well have RT pressure and things that temporarily
>> change the capacity equality. Also this is a simple benchmark and for
>> any reason you have more than 1 task running on those other CPUs and
>> then the idle CPUs run some of the tasks and you run into a similar
>> situation that might need your patch..
>>
> The patch would be helpful only if it doesn't cross NUMA boundary. right ?
>
> If NUMA comes into picture, not sure searching across NUMA may hurt more
> than help, especially in this case.

I don't understand what you mean by "searching across NUMA", :-(, do you
mean the slow path?

As I said, if the SD_BALANCE_WAKE flag for the sched domain flag is not
set, then a full wide search isn't done anyway. You have this code that
sets the sd variable:

                if (tmp->flags & sd_flag)
                        sd = tmp;

Since sd = NULL if the sched domain (tmp->flags) isn't set, you will
always have select_idle_sibling running and not doing the full search
if I understand correctly.

Further adding the ASYM protection isn't sensible if capacities are
affected by RT and IRQ time etc anyway. Does that make sense?

I am glad I understand the code a bit better now after staring at it
for quite some time but I think some more staring is needed.

- Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-21  5:50               ` Joel Fernandes
@ 2017-09-21 16:51                 ` Atish Patra
  0 siblings, 0 replies; 14+ messages in thread
From: Atish Patra @ 2017-09-21 16:51 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Brendan Jackman, LKML, Andres Oportus, Ingo Molnar,
	Peter Zijlstra, Josef Bacik, Mike Galbraith, Matt Fleming,
	Dietmar Eggemann, Morten Rasmussen

On 09/21/2017 12:50 AM, Joel Fernandes wrote:
> On Wed, Sep 20, 2017 at 2:17 PM, Atish Patra <atish.patra@oracle.com> wrote:
>> On 09/20/2017 03:23 PM, Joel Fernandes wrote:
>>>
>>> On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
>>> <brendan.jackman@arm.com> wrote:
>>>>
>>>>
>>>> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>>>>> <brendan.jackman@arm.com> wrote:
>>>>>>>
>>>>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>>>>>
>>>>> [..]
>>>>>>>>>
>>>>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>>>>> necessary on SMP systems, so it might be best guarded by the
>>>>>>>>> presence
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>>>>> are within the same shared cache domain. The difference is some time
>>>>>>>> ago it would return true for shared cache but now it returns false
> as
>>>>>>>> of 4.14-rc1:
>>>>>>>>
>>>>>>>>
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>>>>
>>>>>>>> Since it would return false in the above wake up cases for task 1
> and
>>>>>>>> 2, it would then run select_idle_sibling on the previous CPU which
> is
>>>>>>>> also within the big cluster, so I don't think it will make a
>>>>>>>> difference in this case... Infact what it returns probably doesn't
>>>>>>>> matter.
>>>>>>>
>>>>>>>
>>>>>>> So my paragraph here was making a leap in reasoning, let me try to
>>>>>>> fill
>>>>>>> the gap: On SMP these tasks never need to move around. If by some
>>>>>>> chance
>>>>>>> they did get coscheduled, the first load balance would spread them
> out
>>>>>>> and
>>>>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>>>>> choice. So it will look something like:
>>>>>>>
>>>>>>>             v CPU v           ->time->
>>>>>>>
>>>>>>>                             -------------
>>>>>>>          {  0  (SAME)       11111111111
>>>>>>>   cache  {                  -------------
>>>>>>>          {  1  (SAME)       222222222222|
>>>>>>>                             -------------
>>>>>>>          {  2  (SAME)       33333333333
>>>>>>>   cache  {                  -------------
>>>>>>>          {  3  (SAME)       44444444444
>>>>>>>                             -------------
>>>>>>>
>>>>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3
>>>>>>> and
>>>>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>>>>> basically wake_affine's job to decide between prev_cpu and
>>>>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>>>>> the waker's LLC domain because of bias towards prev_cpu) does not
>>>>>>> arise
>>>>>>> on SMP.
>>>>>>>
>>>>>>
>>>>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>>>>> original statement about wake_affine solving any problem for SMP is
>>>>>> not relevant I feel :-P. I guess you can just kill this para from the
>>>>>> commit message to prevent confusion.
>>>>>
>>>>>
>>>>> Ok I take that back, you were talking about guarding this feature by
>>>>> the SD_ASYM_CPUCAPACITY flag.
>>>>>
>>>>> I don't think that protection would be helpful because you can have
>>>>> the same issue if the tasks do different amount of work on SMP. So in
>>>>> that case some threads might still complete before the others and you
>>>>> run into the same thing.
>>>>
>>>>
>>>> Well assuming we're still talking about one task per CPU, if you have
>>>> tasks doing different amount of work there's still no reason to move the
>>>> longer-running threads around. The only reason that happens in my
>>>> example is because of the asym capacity.
>>>
>>>
>>> Yes but you can very well have RT pressure and things that temporarily
>>> change the capacity equality. Also this is a simple benchmark and for
>>> any reason you have more than 1 task running on those other CPUs and
>>> then the idle CPUs run some of the tasks and you run into a similar
>>> situation that might need your patch..
>>>
>> The patch would be helpful only if it doesn't cross NUMA boundary. right ?
>>
>> If NUMA comes into picture, not sure searching across NUMA may hurt more
>> than help, especially in this case.
>
> I don't understand what you mean by "searching across NUMA", :-(, do you
> mean the slow path?
>
> As I said, if the SD_BALANCE_WAKE flag for the sched domain flag is not
> set, then a full wide search isn't done anyway. You have this code that
> sets the sd variable:
>
>                 if (tmp->flags & sd_flag)
>                         sd = tmp;
>
> Since sd = NULL if the sched domain (tmp->flags) isn't set, you will
> always have select_idle_sibling running and not doing the full search
> if I understand correctly.
>
Correct. I was under the impression that the above logic is going
to changed to fix the possible issue raised by Brendan for big SMP 
server as well. I guess that was not the case.


> Further adding the ASYM protection isn't sensible if capacities are
> affected by RT and IRQ time etc anyway. Does that make sense?
>
Yup. We won't need ASYM protection.

Regards,
Atish
> I am glad I understand the code a bit better now after staring at it
> for quite some time but I think some more staring is needed.
>
> - Joel
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-19  5:15 ` Joel Fernandes
  2017-09-19 10:05   ` Brendan Jackman
@ 2017-09-20  9:41   ` Brendan Jackman
  1 sibling, 0 replies; 14+ messages in thread
From: Brendan Jackman @ 2017-09-20  9:41 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: LKML, Andres Oportus, Ingo Molnar, Peter Zijlstra, Josef Bacik,
	Mike Galbraith, Matt Fleming, Dietmar Eggemann

Hi Joel,

Sorry I didn't see your comments on the code before, I think it's
orthoganal to the other thread about the overall design so I'll just
respond here.

On Tue, Sep 19 2017 at 05:15, Joel Fernandes wrote:
> Hi Brendan,
>
> On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman

[snip]

>>
>> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>> index d03d8a9047dc..607a888eb35b 100644
>> --- a/include/linux/sched/wake_q.h
>> +++ b/include/linux/sched/wake_q.h
>> @@ -33,6 +33,7 @@
>>  struct wake_q_head {
>>         struct wake_q_node *first;
>>         struct wake_q_node **lastp;
>> +       int count;
>>  };
>>
>>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
>> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>>  {
>>         head->first = WAKE_Q_TAIL;
>>         head->lastp = &head->first;
>> +       head->count = 0;
>>  }
>>
>>  extern void wake_q_add(struct wake_q_head *head,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0869b20fba81..ddf9257b1467 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>         if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>>                 return;
>>
>> +       head->count++;
>> +
>>         get_task_struct(task);
>>
>>         /*
>> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>         head->lastp = &node->next;
>>  }
>>
>> +static int
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> +              int sibling_count_hint);
>> +
>>  void wake_up_q(struct wake_q_head *head)
>>  {
>>         struct wake_q_node *node = head->first;
>> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>>                 task->wake_q.next = NULL;
>>
>>                 /*
>> -                * wake_up_process() implies a wmb() to pair with the queueing
>> +                * try_to_wake_up() implies a wmb() to pair with the queueing
>>                  * in wake_q_add() so as not to miss wakeups.
>>                  */
>> -               wake_up_process(task);
>> +               try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>>                 put_task_struct(task);
>
> Shouldn't you reset head->count after all the tasks have been woken up?

That's done in wake_q_init, which should be enough as you only use a
wake_q once per init [1]

[1] http://elixir.free-electrons.com/linux/latest/source/include/linux/sched/wake_q.h#L33

>
>>         }
>>  }
>> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>>   */
>>  static inline
>> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
>> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
>> +                  int sibling_count_hint)
>
> This variable seems a bit long to me, how about just sibling_count?

Yeah, shortening sounds good. Coming back with fresh eyes I think
'sibling' is a bad description, I was thinking of siblings in the
waker/wakee graph of tasks actually but I don't think that's obvious at
all. This is just an RFC so if it ever makes it to PATCH I'll try and
think of something better.

>>  {
>>         lockdep_assert_held(&p->pi_lock);
>>
>>         if (p->nr_cpus_allowed > 1)
>> -               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
>> +               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
>> +                                                    sibling_count_hint);
>
> same.
>
> <snip>
>
>>
>>  static int
>> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +                 int sibling_count_hint)
>>  {
>>         struct task_struct *curr;
>>         struct rq *rq;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index eeef1a3086d1..56ae525618e9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1419,7 +1419,8 @@ struct sched_class {
>>         void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>>
>>  #ifdef CONFIG_SMP
>> -       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>> +       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
>> +                              int subling_count_hint);
>
> s/subling/sibling/

Yup :|

Thanks,
Brendan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-08-11  9:45 [RFC] sched/fair: Use wake_q length as a hint for wake_wide Brendan Jackman
  2017-09-19  5:15 ` Joel Fernandes
@ 2017-09-20 19:36 ` Atish Patra
  2017-09-29 17:05 ` Brendan Jackman
  2 siblings, 0 replies; 14+ messages in thread
From: Atish Patra @ 2017-09-20 19:36 UTC (permalink / raw)
  To: Brendan Jackman, linux-kernel
  Cc: Joel Fernandes, Andres Oportus, Ingo Molnar, Peter Zijlstra,
	Josef Bacik, Mike Galbraith, Matt Fleming

On 08/11/2017 04:45 AM, Brendan Jackman wrote:
> This patch adds a parameter to select_task_rq, sibling_count_hint
> allowing the caller, where it has this information, to inform the
> sched_class the number of tasks that are being woken up as part of
> the same event.
>
> The wake_q mechanism is one case where this information is available.
>
> select_task_rq_fair can then use the information to detect that it
> needs to widen the search space for task placement in order to avoid
> overloading the last-level cache domain's CPUs.
>
>                                * * *
>
> The reason I am investigating this change is the following use case
> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
> all repeatedly do X amount of work then
> pthread_barrier_wait (i.e. sleep until the last task finishes its X
> and hits the barrier).

Are all these tasks are homogeneous i.e. does exactly equal amount of work?

  On big.LITTLE, the tasks which get a "big" CPU
> finish faster, and then those CPUs pull over the tasks that are still
> running:
>
>      v CPU v           ->time->
>
>                     -------------
>    0  (big)         11111  /333
>                     -------------
>    1  (big)         22222   /444|
>                     -------------
>    2  (LITTLE)      333333/
>                     -------------
>    3  (LITTLE)      444444/
>                     -------------
>
> Now when task 4 hits the barrier (at |) and wakes the others up,
> there are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. want_affine therefore means that we'll only look
> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
> on the bigs until the next load balance, something like this:
>
>      v CPU v           ->time->
>
>                     ------------------------
>    0  (big)         11111  /333  31313\33333
>                     ------------------------
>    1  (big)         22222   /444|424\4444444
>                     ------------------------
>    2  (LITTLE)      333333/          \222222
>                     ------------------------
>    3  (LITTLE)      444444/            \1111
>                     ------------------------
>                                  ^^^
>                            underutilization
>
> So, I'm trying to get want_affine = 0 for these tasks.
>
> I don't _think_ any incarnation of the wakee_flips mechanism can help
> us here because which task is waker and which tasks are wakees
> generally changes with each iteration.
>
> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
> nice property that we know exactly how many tasks are being woken, so
> we can cheat.
>
> It might be a disadvantage that we "widen" _every_ task that's woken in
> an event, while select_idle_sibling would work fine for the first
> sd_llc_size - 1 tasks.
>
> IIUC, if wake_affine() behaves correctly this trick wouldn't be
> necessary on SMP systems, so it might be best guarded by the presence
> of SD_ASYM_CPUCAPACITY?
>
>                                * * *
>
> Final note..
>
> In order to observe "perfect" behaviour for this use case, I also had
> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
> above we are working through the work queue and have placed tasks 3
> and 2, and are about to place task 1:
>
>      v CPU v           ->time->
>
>                     --------------
>    0  (big)         11111  /333  3
>                     --------------
>    1  (big)         22222   /444|4
>                     --------------
>    2  (LITTLE)      333333/      2
>                     --------------
>    3  (LITTLE)      444444/          <- Task 1 should go here
>                     --------------
>
> If TTWU_QUEUE is enabled, we will not yet have enqueued task
> 2 (having instead sent a reschedule IPI) or attached its load to CPU
> 2. So we are likely to also place task 1 on cpu 2. Disabling
> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
> solving this issue. TTWU_QUEUE is there to minimise rq lock
> contention, and I guess that this contention is less of an issue on
> big.LITTLE systems since they have relatively few CPUs, which
> suggests the trade-off makes sense here.
>
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> ---
>  include/linux/sched/wake_q.h |  2 ++
>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>  kernel/sched/deadline.c      |  3 ++-
>  kernel/sched/fair.c          | 17 +++++++++++------
>  kernel/sched/idle_task.c     |  3 ++-
>  kernel/sched/rt.c            |  3 ++-
>  kernel/sched/sched.h         |  3 ++-
>  kernel/sched/stop_task.c     |  3 ++-
>  8 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index d03d8a9047dc..607a888eb35b 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -33,6 +33,7 @@
>  struct wake_q_head {
>  	struct wake_q_node *first;
>  	struct wake_q_node **lastp;
> +	int count;
>  };
>

Instead of passing around the head count, can we store the count in 
task_struct ? The patch would be lot less invasive for a single use-case.

>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>  {
>  	head->first = WAKE_Q_TAIL;
>  	head->lastp = &head->first;
> +	head->count = 0;
>  }
>
>  extern void wake_q_add(struct wake_q_head *head,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0869b20fba81..ddf9257b1467 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>  		return;
>
> +	head->count++;
> +
>  	get_task_struct(task);
>
>  	/*
> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	head->lastp = &node->next;
>  }
>
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +	       int sibling_count_hint);
> +
>  void wake_up_q(struct wake_q_head *head)
>  {
>  	struct wake_q_node *node = head->first;
> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>  		task->wake_q.next = NULL;
>
>  		/*
> -		 * wake_up_process() implies a wmb() to pair with the queueing
> +		 * try_to_wake_up() implies a wmb() to pair with the queueing
>  		 * in wake_q_add() so as not to miss wakeups.
>  		 */
> -		wake_up_process(task);
> +		try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>  		put_task_struct(task);
>  	}
>  }
> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>   */
>  static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
> +		   int sibling_count_hint)
>  {
>  	lockdep_assert_held(&p->pi_lock);
>
>  	if (p->nr_cpus_allowed > 1)
> -		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> +		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
> +						     sibling_count_hint);
>  	else
>  		cpu = cpumask_any(&p->cpus_allowed);
>
> @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   * @p: the thread to be awakened
>   * @state: the mask of task states that can be woken
>   * @wake_flags: wake modifier flags (WF_*)
> + * @sibling_count_hint: A hint at the number of threads that are being woken up
> + *                      in this event.

I also had the same thought as Joel about the naming.
Probably wakee_count ?

>   *
>   * If (@state & @p->state) @p->state = TASK_RUNNING.
>   *
> @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   *	   %false otherwise.
>   */
>  static int
> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +	       int sibling_count_hint)
>  {
>  	unsigned long flags;
>  	int cpu, success = 0;
> @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		atomic_dec(&task_rq(p)->nr_iowait);
>  	}
>
> -	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> +	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
> +			     sibling_count_hint);
>  	if (task_cpu(p) != cpu) {
>  		wake_flags |= WF_MIGRATED;
>  		set_task_cpu(p, cpu);
> @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
>   */
>  int wake_up_process(struct task_struct *p)
>  {
> -	return try_to_wake_up(p, TASK_NORMAL, 0);
> +	return try_to_wake_up(p, TASK_NORMAL, 0, 1);
>  }
>  EXPORT_SYMBOL(wake_up_process);
>
>  int wake_up_state(struct task_struct *p, unsigned int state)
>  {
> -	return try_to_wake_up(p, state, 0);
> +	return try_to_wake_up(p, state, 0, 1);
>  }
>
>  /*
> @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
>  	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
>  	 * as we're not fully set-up yet.
>  	 */
> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> +	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
>  #endif
>  	rq = __task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
> @@ -2893,7 +2905,7 @@ void sched_exec(void)
>  	int dest_cpu;
>
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
> +	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
>  	if (dest_cpu == smp_processor_id())
>  		goto unlock;
>
> @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
>  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
>  			  void *key)
>  {
> -	return try_to_wake_up(curr->private, mode, wake_flags);
> +	return try_to_wake_up(curr->private, mode, wake_flags, 1);
>  }
>  EXPORT_SYMBOL(default_wake_function);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 755bd3f1a1a9..69a9dd407267 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
>  static int find_later_rq(struct task_struct *task);
>
>  static int
> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		  int sibling_count_hint)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..0a9d706b62bf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
>   * whatever is irrelevant, spread criteria is apparent partner count exceeds
>   * socket size.
>   */
> -static int wake_wide(struct task_struct *p)
> +static int wake_wide(struct task_struct *p, int sibling_count_hint)
>  {
>  	unsigned int master = current->wakee_flips;
>  	unsigned int slave = p->wakee_flips;
> -	int factor = this_cpu_read(sd_llc_size);
> +	int llc_size = this_cpu_read(sd_llc_size);
> +
> +	if (sibling_count_hint >= llc_size)
> +		return 1;
>
Since we are talking about 1 task per cpu, should it be 
sibling_count_hint > llc_size ?


>  	if (master < slave)
>  		swap(master, slave);
> -	if (slave < factor || master < slave * factor)
> +	if (slave < llc_size || master < slave * llc_size)
>  		return 0;
>  	return 1;
>  }
> @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>   * preempt must be disabled.
>   */
>  static int
> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
> +		    int sibling_count_hint)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>  	int cpu = smp_processor_id();
> @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +		want_affine = !wake_wide(p, sibling_count_hint) &&
> +			      !wake_cap(p, cpu, prev_cpu) &&
> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
>  	}
>
>  	rcu_read_lock();
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172db63e..3c343e055110 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -9,7 +9,8 @@
>
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		    int sibling_count_hint)
>  {
>  	return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 45caf937ef90..b9937dccb8b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
>  static int find_lowest_rq(struct task_struct *task);
>
>  static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		  int sibling_count_hint)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..56ae525618e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1419,7 +1419,8 @@ struct sched_class {
>  	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
>  #ifdef CONFIG_SMP
> -	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> +	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
> +			       int subling_count_hint);
>  	void (*migrate_task_rq)(struct task_struct *p);
>
>  	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 9f69fb630853..d0ce4fbb18ef 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -11,7 +11,8 @@
>
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		    int sibling_count_hint)
>  {
>  	return task_cpu(p); /* stop tasks as never migrate */
>  }
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-08-11  9:45 [RFC] sched/fair: Use wake_q length as a hint for wake_wide Brendan Jackman
  2017-09-19  5:15 ` Joel Fernandes
  2017-09-20 19:36 ` Atish Patra
@ 2017-09-29 17:05 ` Brendan Jackman
  2017-10-02 10:46   ` Brendan Jackman
  2 siblings, 1 reply; 14+ messages in thread
From: Brendan Jackman @ 2017-09-29 17:05 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra
  Cc: Joel Fernandes, Andres Oportus, Ingo Molnar, Peter Zijlstra,
	Josef Bacik, Mike Galbraith, Matt Fleming


There has been a bit of discussion on this RFC, but before I do any
more work I'd really like your input on the basic idea.

Does the approach of feeding outside info like wake_q length into the
scheduler's decisions seem basically acceptable?

Cheers,
Brendan


On Fri, Aug 11 2017 at 09:45, Brendan Jackman wrote:
> This patch adds a parameter to select_task_rq, sibling_count_hint
> allowing the caller, where it has this information, to inform the
> sched_class the number of tasks that are being woken up as part of
> the same event.
>
> The wake_q mechanism is one case where this information is available.
>
> select_task_rq_fair can then use the information to detect that it
> needs to widen the search space for task placement in order to avoid
> overloading the last-level cache domain's CPUs.
>
>                                * * *
>
> The reason I am investigating this change is the following use case
> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
> all repeatedly do X amount of work then
> pthread_barrier_wait (i.e. sleep until the last task finishes its X
> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
> finish faster, and then those CPUs pull over the tasks that are still
> running:
>
>      v CPU v           ->time->
>
>                     -------------
>    0  (big)         11111  /333
>                     -------------
>    1  (big)         22222   /444|
>                     -------------
>    2  (LITTLE)      333333/
>                     -------------
>    3  (LITTLE)      444444/
>                     -------------
>
> Now when task 4 hits the barrier (at |) and wakes the others up,
> there are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. want_affine therefore means that we'll only look
> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
> on the bigs until the next load balance, something like this:
>
>      v CPU v           ->time->
>
>                     ------------------------
>    0  (big)         11111  /333  31313\33333
>                     ------------------------
>    1  (big)         22222   /444|424\4444444
>                     ------------------------
>    2  (LITTLE)      333333/          \222222
>                     ------------------------
>    3  (LITTLE)      444444/            \1111
>                     ------------------------
>                                  ^^^
>                            underutilization
>
> So, I'm trying to get want_affine = 0 for these tasks.
>
> I don't _think_ any incarnation of the wakee_flips mechanism can help
> us here because which task is waker and which tasks are wakees
> generally changes with each iteration.
>
> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
> nice property that we know exactly how many tasks are being woken, so
> we can cheat.
>
> It might be a disadvantage that we "widen" _every_ task that's woken in
> an event, while select_idle_sibling would work fine for the first
> sd_llc_size - 1 tasks.
>
> IIUC, if wake_affine() behaves correctly this trick wouldn't be
> necessary on SMP systems, so it might be best guarded by the presence
> of SD_ASYM_CPUCAPACITY?
>
>                                * * *
>
> Final note..
>
> In order to observe "perfect" behaviour for this use case, I also had
> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
> above we are working through the work queue and have placed tasks 3
> and 2, and are about to place task 1:
>
>      v CPU v           ->time->
>
>                     --------------
>    0  (big)         11111  /333  3
>                     --------------
>    1  (big)         22222   /444|4
>                     --------------
>    2  (LITTLE)      333333/      2
>                     --------------
>    3  (LITTLE)      444444/          <- Task 1 should go here
>                     --------------
>
> If TTWU_QUEUE is enabled, we will not yet have enqueued task
> 2 (having instead sent a reschedule IPI) or attached its load to CPU
> 2. So we are likely to also place task 1 on cpu 2. Disabling
> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
> solving this issue. TTWU_QUEUE is there to minimise rq lock
> contention, and I guess that this contention is less of an issue on
> big.LITTLE systems since they have relatively few CPUs, which
> suggests the trade-off makes sense here.
>
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Matt Fleming <matt@codeblueprint.co.uk>
> ---
>  include/linux/sched/wake_q.h |  2 ++
>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>  kernel/sched/deadline.c      |  3 ++-
>  kernel/sched/fair.c          | 17 +++++++++++------
>  kernel/sched/idle_task.c     |  3 ++-
>  kernel/sched/rt.c            |  3 ++-
>  kernel/sched/sched.h         |  3 ++-
>  kernel/sched/stop_task.c     |  3 ++-
>  8 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index d03d8a9047dc..607a888eb35b 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -33,6 +33,7 @@
>  struct wake_q_head {
>  	struct wake_q_node *first;
>  	struct wake_q_node **lastp;
> +	int count;
>  };
>
>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>  {
>  	head->first = WAKE_Q_TAIL;
>  	head->lastp = &head->first;
> +	head->count = 0;
>  }
>
>  extern void wake_q_add(struct wake_q_head *head,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0869b20fba81..ddf9257b1467 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>  		return;
>
> +	head->count++;
> +
>  	get_task_struct(task);
>
>  	/*
> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	head->lastp = &node->next;
>  }
>
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +	       int sibling_count_hint);
> +
>  void wake_up_q(struct wake_q_head *head)
>  {
>  	struct wake_q_node *node = head->first;
> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>  		task->wake_q.next = NULL;
>
>  		/*
> -		 * wake_up_process() implies a wmb() to pair with the queueing
> +		 * try_to_wake_up() implies a wmb() to pair with the queueing
>  		 * in wake_q_add() so as not to miss wakeups.
>  		 */
> -		wake_up_process(task);
> +		try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>  		put_task_struct(task);
>  	}
>  }
> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>   */
>  static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
> +		   int sibling_count_hint)
>  {
>  	lockdep_assert_held(&p->pi_lock);
>
>  	if (p->nr_cpus_allowed > 1)
> -		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> +		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
> +						     sibling_count_hint);
>  	else
>  		cpu = cpumask_any(&p->cpus_allowed);
>
> @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   * @p: the thread to be awakened
>   * @state: the mask of task states that can be woken
>   * @wake_flags: wake modifier flags (WF_*)
> + * @sibling_count_hint: A hint at the number of threads that are being woken up
> + *                      in this event.
>   *
>   * If (@state & @p->state) @p->state = TASK_RUNNING.
>   *
> @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>   *	   %false otherwise.
>   */
>  static int
> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +	       int sibling_count_hint)
>  {
>  	unsigned long flags;
>  	int cpu, success = 0;
> @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		atomic_dec(&task_rq(p)->nr_iowait);
>  	}
>
> -	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> +	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
> +			     sibling_count_hint);
>  	if (task_cpu(p) != cpu) {
>  		wake_flags |= WF_MIGRATED;
>  		set_task_cpu(p, cpu);
> @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
>   */
>  int wake_up_process(struct task_struct *p)
>  {
> -	return try_to_wake_up(p, TASK_NORMAL, 0);
> +	return try_to_wake_up(p, TASK_NORMAL, 0, 1);
>  }
>  EXPORT_SYMBOL(wake_up_process);
>
>  int wake_up_state(struct task_struct *p, unsigned int state)
>  {
> -	return try_to_wake_up(p, state, 0);
> +	return try_to_wake_up(p, state, 0, 1);
>  }
>
>  /*
> @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
>  	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
>  	 * as we're not fully set-up yet.
>  	 */
> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> +	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
>  #endif
>  	rq = __task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
> @@ -2893,7 +2905,7 @@ void sched_exec(void)
>  	int dest_cpu;
>
>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
> -	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
> +	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
>  	if (dest_cpu == smp_processor_id())
>  		goto unlock;
>
> @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
>  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
>  			  void *key)
>  {
> -	return try_to_wake_up(curr->private, mode, wake_flags);
> +	return try_to_wake_up(curr->private, mode, wake_flags, 1);
>  }
>  EXPORT_SYMBOL(default_wake_function);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 755bd3f1a1a9..69a9dd407267 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
>  static int find_later_rq(struct task_struct *task);
>
>  static int
> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		  int sibling_count_hint)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..0a9d706b62bf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
>   * whatever is irrelevant, spread criteria is apparent partner count exceeds
>   * socket size.
>   */
> -static int wake_wide(struct task_struct *p)
> +static int wake_wide(struct task_struct *p, int sibling_count_hint)
>  {
>  	unsigned int master = current->wakee_flips;
>  	unsigned int slave = p->wakee_flips;
> -	int factor = this_cpu_read(sd_llc_size);
> +	int llc_size = this_cpu_read(sd_llc_size);
> +
> +	if (sibling_count_hint >= llc_size)
> +		return 1;
>
>  	if (master < slave)
>  		swap(master, slave);
> -	if (slave < factor || master < slave * factor)
> +	if (slave < llc_size || master < slave * llc_size)
>  		return 0;
>  	return 1;
>  }
> @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>   * preempt must be disabled.
>   */
>  static int
> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
> +		    int sibling_count_hint)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>  	int cpu = smp_processor_id();
> @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +		want_affine = !wake_wide(p, sibling_count_hint) &&
> +			      !wake_cap(p, cpu, prev_cpu) &&
> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
>  	}
>
>  	rcu_read_lock();
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172db63e..3c343e055110 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -9,7 +9,8 @@
>
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		    int sibling_count_hint)
>  {
>  	return task_cpu(p); /* IDLE tasks as never migrated */
>  }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 45caf937ef90..b9937dccb8b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
>  static int find_lowest_rq(struct task_struct *task);
>
>  static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		  int sibling_count_hint)
>  {
>  	struct task_struct *curr;
>  	struct rq *rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..56ae525618e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1419,7 +1419,8 @@ struct sched_class {
>  	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
>  #ifdef CONFIG_SMP
> -	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> +	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
> +			       int subling_count_hint);
>  	void (*migrate_task_rq)(struct task_struct *p);
>
>  	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 9f69fb630853..d0ce4fbb18ef 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -11,7 +11,8 @@
>
>  #ifdef CONFIG_SMP
>  static int
> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
> +		    int sibling_count_hint)
>  {
>  	return task_cpu(p); /* stop tasks as never migrate */
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
  2017-09-29 17:05 ` Brendan Jackman
@ 2017-10-02 10:46   ` Brendan Jackman
  0 siblings, 0 replies; 14+ messages in thread
From: Brendan Jackman @ 2017-10-02 10:46 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra
  Cc: Joel Fernandes, Andres Oportus, Ingo Molnar, Peter Zijlstra,
	Josef Bacik, Mike Galbraith, Matt Fleming

Hi PeterZ,

I just got this in my inbox and noticed I didn't adress it to anyone. I
meant to address it to you.

On Fri, Sep 29 2017 at 17:05, Brendan Jackman wrote:
> There has been a bit of discussion on this RFC, but before I do any
> more work I'd really like your input on the basic idea.
>
> Does the approach of feeding outside info like wake_q length into the
> scheduler's decisions seem basically acceptable?
>
> Cheers,
> Brendan
>
>
> On Fri, Aug 11 2017 at 09:45, Brendan Jackman wrote:
>> This patch adds a parameter to select_task_rq, sibling_count_hint
>> allowing the caller, where it has this information, to inform the
>> sched_class the number of tasks that are being woken up as part of
>> the same event.
>>
>> The wake_q mechanism is one case where this information is available.
>>
>> select_task_rq_fair can then use the information to detect that it
>> needs to widen the search space for task placement in order to avoid
>> overloading the last-level cache domain's CPUs.
>>
>>                                * * *
>>
>> The reason I am investigating this change is the following use case
>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
>> all repeatedly do X amount of work then
>> pthread_barrier_wait (i.e. sleep until the last task finishes its X
>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
>> finish faster, and then those CPUs pull over the tasks that are still
>> running:
>>
>>      v CPU v           ->time->
>>
>>                     -------------
>>    0  (big)         11111  /333
>>                     -------------
>>    1  (big)         22222   /444|
>>                     -------------
>>    2  (LITTLE)      333333/
>>                     -------------
>>    3  (LITTLE)      444444/
>>                     -------------
>>
>> Now when task 4 hits the barrier (at |) and wakes the others up,
>> there are 4 tasks with prev_cpu=<big> and 0 tasks with
>> prev_cpu=<little>. want_affine therefore means that we'll only look
>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
>> on the bigs until the next load balance, something like this:
>>
>>      v CPU v           ->time->
>>
>>                     ------------------------
>>    0  (big)         11111  /333  31313\33333
>>                     ------------------------
>>    1  (big)         22222   /444|424\4444444
>>                     ------------------------
>>    2  (LITTLE)      333333/          \222222
>>                     ------------------------
>>    3  (LITTLE)      444444/            \1111
>>                     ------------------------
>>                                  ^^^
>>                            underutilization
>>
>> So, I'm trying to get want_affine = 0 for these tasks.
>>
>> I don't _think_ any incarnation of the wakee_flips mechanism can help
>> us here because which task is waker and which tasks are wakees
>> generally changes with each iteration.
>>
>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
>> nice property that we know exactly how many tasks are being woken, so
>> we can cheat.
>>
>> It might be a disadvantage that we "widen" _every_ task that's woken in
>> an event, while select_idle_sibling would work fine for the first
>> sd_llc_size - 1 tasks.
>>
>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>> necessary on SMP systems, so it might be best guarded by the presence
>> of SD_ASYM_CPUCAPACITY?
>>
>>                                * * *
>>
>> Final note..
>>
>> In order to observe "perfect" behaviour for this use case, I also had
>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
>> above we are working through the work queue and have placed tasks 3
>> and 2, and are about to place task 1:
>>
>>      v CPU v           ->time->
>>
>>                     --------------
>>    0  (big)         11111  /333  3
>>                     --------------
>>    1  (big)         22222   /444|4
>>                     --------------
>>    2  (LITTLE)      333333/      2
>>                     --------------
>>    3  (LITTLE)      444444/          <- Task 1 should go here
>>                     --------------
>>
>> If TTWU_QUEUE is enabled, we will not yet have enqueued task
>> 2 (having instead sent a reschedule IPI) or attached its load to CPU
>> 2. So we are likely to also place task 1 on cpu 2. Disabling
>> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
>> solving this issue. TTWU_QUEUE is there to minimise rq lock
>> contention, and I guess that this contention is less of an issue on
>> big.LITTLE systems since they have relatively few CPUs, which
>> suggests the trade-off makes sense here.
>>
>> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Josef Bacik <josef@toxicpanda.com>
>> Cc: Joel Fernandes <joelaf@google.com>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Matt Fleming <matt@codeblueprint.co.uk>
>> ---
>>  include/linux/sched/wake_q.h |  2 ++
>>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>>  kernel/sched/deadline.c      |  3 ++-
>>  kernel/sched/fair.c          | 17 +++++++++++------
>>  kernel/sched/idle_task.c     |  3 ++-
>>  kernel/sched/rt.c            |  3 ++-
>>  kernel/sched/sched.h         |  3 ++-
>>  kernel/sched/stop_task.c     |  3 ++-
>>  8 files changed, 46 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>> index d03d8a9047dc..607a888eb35b 100644
>> --- a/include/linux/sched/wake_q.h
>> +++ b/include/linux/sched/wake_q.h
>> @@ -33,6 +33,7 @@
>>  struct wake_q_head {
>>  	struct wake_q_node *first;
>>  	struct wake_q_node **lastp;
>> +	int count;
>>  };
>>
>>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
>> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>>  {
>>  	head->first = WAKE_Q_TAIL;
>>  	head->lastp = &head->first;
>> +	head->count = 0;
>>  }
>>
>>  extern void wake_q_add(struct wake_q_head *head,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0869b20fba81..ddf9257b1467 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>  	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>>  		return;
>>
>> +	head->count++;
>> +
>>  	get_task_struct(task);
>>
>>  	/*
>> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>  	head->lastp = &node->next;
>>  }
>>
>> +static int
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> +	       int sibling_count_hint);
>> +
>>  void wake_up_q(struct wake_q_head *head)
>>  {
>>  	struct wake_q_node *node = head->first;
>> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>>  		task->wake_q.next = NULL;
>>
>>  		/*
>> -		 * wake_up_process() implies a wmb() to pair with the queueing
>> +		 * try_to_wake_up() implies a wmb() to pair with the queueing
>>  		 * in wake_q_add() so as not to miss wakeups.
>>  		 */
>> -		wake_up_process(task);
>> +		try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>>  		put_task_struct(task);
>>  	}
>>  }
>> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>>   */
>>  static inline
>> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
>> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
>> +		   int sibling_count_hint)
>>  {
>>  	lockdep_assert_held(&p->pi_lock);
>>
>>  	if (p->nr_cpus_allowed > 1)
>> -		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
>> +		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
>> +						     sibling_count_hint);
>>  	else
>>  		cpu = cpumask_any(&p->cpus_allowed);
>>
>> @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>>   * @p: the thread to be awakened
>>   * @state: the mask of task states that can be woken
>>   * @wake_flags: wake modifier flags (WF_*)
>> + * @sibling_count_hint: A hint at the number of threads that are being woken up
>> + *                      in this event.
>>   *
>>   * If (@state & @p->state) @p->state = TASK_RUNNING.
>>   *
>> @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>>   *	   %false otherwise.
>>   */
>>  static int
>> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> +	       int sibling_count_hint)
>>  {
>>  	unsigned long flags;
>>  	int cpu, success = 0;
>> @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  		atomic_dec(&task_rq(p)->nr_iowait);
>>  	}
>>
>> -	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>> +	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
>> +			     sibling_count_hint);
>>  	if (task_cpu(p) != cpu) {
>>  		wake_flags |= WF_MIGRATED;
>>  		set_task_cpu(p, cpu);
>> @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
>>   */
>>  int wake_up_process(struct task_struct *p)
>>  {
>> -	return try_to_wake_up(p, TASK_NORMAL, 0);
>> +	return try_to_wake_up(p, TASK_NORMAL, 0, 1);
>>  }
>>  EXPORT_SYMBOL(wake_up_process);
>>
>>  int wake_up_state(struct task_struct *p, unsigned int state)
>>  {
>> -	return try_to_wake_up(p, state, 0);
>> +	return try_to_wake_up(p, state, 0, 1);
>>  }
>>
>>  /*
>> @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
>>  	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
>>  	 * as we're not fully set-up yet.
>>  	 */
>> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> +	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
>>  #endif
>>  	rq = __task_rq_lock(p, &rf);
>>  	update_rq_clock(rq);
>> @@ -2893,7 +2905,7 @@ void sched_exec(void)
>>  	int dest_cpu;
>>
>>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>> -	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
>> +	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
>>  	if (dest_cpu == smp_processor_id())
>>  		goto unlock;
>>
>> @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
>>  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
>>  			  void *key)
>>  {
>> -	return try_to_wake_up(curr->private, mode, wake_flags);
>> +	return try_to_wake_up(curr->private, mode, wake_flags, 1);
>>  }
>>  EXPORT_SYMBOL(default_wake_function);
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 755bd3f1a1a9..69a9dd407267 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
>>  static int find_later_rq(struct task_struct *task);
>>
>>  static int
>> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		  int sibling_count_hint)
>>  {
>>  	struct task_struct *curr;
>>  	struct rq *rq;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c95880e216f6..0a9d706b62bf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
>>   * whatever is irrelevant, spread criteria is apparent partner count exceeds
>>   * socket size.
>>   */
>> -static int wake_wide(struct task_struct *p)
>> +static int wake_wide(struct task_struct *p, int sibling_count_hint)
>>  {
>>  	unsigned int master = current->wakee_flips;
>>  	unsigned int slave = p->wakee_flips;
>> -	int factor = this_cpu_read(sd_llc_size);
>> +	int llc_size = this_cpu_read(sd_llc_size);
>> +
>> +	if (sibling_count_hint >= llc_size)
>> +		return 1;
>>
>>  	if (master < slave)
>>  		swap(master, slave);
>> -	if (slave < factor || master < slave * factor)
>> +	if (slave < llc_size || master < slave * llc_size)
>>  		return 0;
>>  	return 1;
>>  }
>> @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>>   * preempt must be disabled.
>>   */
>>  static int
>> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
>> +		    int sibling_count_hint)
>>  {
>>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>  	int cpu = smp_processor_id();
>> @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>
>>  	if (sd_flag & SD_BALANCE_WAKE) {
>>  		record_wakee(p);
>> -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
>> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
>> +		want_affine = !wake_wide(p, sibling_count_hint) &&
>> +			      !wake_cap(p, cpu, prev_cpu) &&
>> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
>>  	}
>>
>>  	rcu_read_lock();
>> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
>> index 0c00172db63e..3c343e055110 100644
>> --- a/kernel/sched/idle_task.c
>> +++ b/kernel/sched/idle_task.c
>> @@ -9,7 +9,8 @@
>>
>>  #ifdef CONFIG_SMP
>>  static int
>> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		    int sibling_count_hint)
>>  {
>>  	return task_cpu(p); /* IDLE tasks as never migrated */
>>  }
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 45caf937ef90..b9937dccb8b3 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
>>  static int find_lowest_rq(struct task_struct *task);
>>
>>  static int
>> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		  int sibling_count_hint)
>>  {
>>  	struct task_struct *curr;
>>  	struct rq *rq;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index eeef1a3086d1..56ae525618e9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1419,7 +1419,8 @@ struct sched_class {
>>  	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>>
>>  #ifdef CONFIG_SMP
>> -	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>> +	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
>> +			       int subling_count_hint);
>>  	void (*migrate_task_rq)(struct task_struct *p);
>>
>>  	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
>> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
>> index 9f69fb630853..d0ce4fbb18ef 100644
>> --- a/kernel/sched/stop_task.c
>> +++ b/kernel/sched/stop_task.c
>> @@ -11,7 +11,8 @@
>>
>>  #ifdef CONFIG_SMP
>>  static int
>> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		    int sibling_count_hint)
>>  {
>>  	return task_cpu(p); /* stop tasks as never migrate */
>>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-10-02 10:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-11  9:45 [RFC] sched/fair: Use wake_q length as a hint for wake_wide Brendan Jackman
2017-09-19  5:15 ` Joel Fernandes
2017-09-19 10:05   ` Brendan Jackman
2017-09-20  4:39     ` Joel Fernandes
2017-09-20  5:06       ` Joel Fernandes
2017-09-20  9:33         ` Brendan Jackman
2017-09-20 20:23           ` Joel Fernandes
2017-09-20 21:17             ` Atish Patra
2017-09-21  5:50               ` Joel Fernandes
2017-09-21 16:51                 ` Atish Patra
2017-09-20  9:41   ` Brendan Jackman
2017-09-20 19:36 ` Atish Patra
2017-09-29 17:05 ` Brendan Jackman
2017-10-02 10:46   ` Brendan Jackman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox