Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michael Wang <wangyun@linux.vnet.ibm.com>
To: Mike Galbraith <bitbucket@online.de>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, mingo@kernel.org, a.p.zijlstra@chello.nl
Subject: Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
Date: Tue, 22 Jan 2013 11:43:24 +0800	[thread overview]
Message-ID: <50FE0ADC.6060701@linux.vnet.ibm.com> (raw)
In-Reply-To: <1358761496.4994.118.camel@marge.simpson.net>

On 01/21/2013 05:44 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote: 
>> On 01/21/2013 05:09 PM, Mike Galbraith wrote:
>>> On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
>>>> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
>>>>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
>>>>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
>>>>>
>>>>>>> May be we could try change this back to the old way later, after the aim
>>>>>>> 7 test on my server.
>>>>>>
>>>>>> Yeah, something funny is going on.
>>>>>
>>>>> Never entering balance path kills the collapse.  Asking wake_affine()
>>>>> wrt the pull as before, but allowing us to continue should no idle cpu
>>>>> be found, still collapsed.  So the source of funny behavior is indeed in
>>>>> balance_path.
>>>>
>>>> Below patch based on the patch set could help to avoid enter balance path
>>>> if affine_sd could be found, just like the old logical, would you like to
>>>> take a try and see whether it could help fix the collapse?
>>>
>>> No, it does not.
>>
>> Hmm...what have changed now compared to the old logical?
> 
> What I did earlier to confirm the collapse originates in balance_path is
> below.  I just retested to confirm.
> 
> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>     1      435.34  100       435.3448     13.92      3.76   Mon Jan 21 10:24:00 2013
>     1      440.09  100       440.0871     13.77      3.76   Mon Jan 21 10:24:22 2013
>     1      440.41  100       440.4070     13.76      3.75   Mon Jan 21 10:24:45 2013
>     5     2467.43   99       493.4853     12.28     10.71   Mon Jan 21 10:24:59 2013
>     5     2445.52   99       489.1041     12.39     10.98   Mon Jan 21 10:25:14 2013
>     5     2475.49   99       495.0980     12.24     10.59   Mon Jan 21 10:25:27 2013
>    10     4963.14   99       496.3145     12.21     20.64   Mon Jan 21 10:25:41 2013
>    10     4959.08   99       495.9083     12.22     21.26   Mon Jan 21 10:25:54 2013
>    10     5415.55   99       541.5550     11.19     11.54   Mon Jan 21 10:26:06 2013
>    20     9934.43   96       496.7213     12.20     33.52   Mon Jan 21 10:26:18 2013
>    20     9950.74   98       497.5369     12.18     36.52   Mon Jan 21 10:26:31 2013
>    20     9893.88   96       494.6939     12.25     34.39   Mon Jan 21 10:26:43 2013
>    40    18937.50   98       473.4375     12.80     84.74   Mon Jan 21 10:26:56 2013
>    40    18996.87   98       474.9216     12.76     88.64   Mon Jan 21 10:27:09 2013
>    40    19146.92   98       478.6730     12.66     89.98   Mon Jan 21 10:27:22 2013
>    80    37610.55   98       470.1319     12.89    112.01   Mon Jan 21 10:27:35 2013
>    80    37321.02   98       466.5127     12.99    114.21   Mon Jan 21 10:27:48 2013
>    80    37610.55   98       470.1319     12.89    111.77   Mon Jan 21 10:28:01 2013
>   160    69109.05   98       431.9316     14.03    156.81   Mon Jan 21 10:28:15 2013
>   160    69505.38   98       434.4086     13.95    155.33   Mon Jan 21 10:28:29 2013
>   160    69207.71   98       432.5482     14.01    155.79   Mon Jan 21 10:28:43 2013
>   320   108033.43   98       337.6045     17.95    314.01   Mon Jan 21 10:29:01 2013
>   320   108577.83   98       339.3057     17.86    311.79   Mon Jan 21 10:29:19 2013
>   320   108395.75   98       338.7367     17.89    312.55   Mon Jan 21 10:29:37 2013
>   640   151440.84   98       236.6263     25.61    620.37   Mon Jan 21 10:30:03 2013
>   640   151440.84   97       236.6263     25.61    621.23   Mon Jan 21 10:30:29 2013
>   640   151145.75   98       236.1652     25.66    622.35   Mon Jan 21 10:30:55 2013
>  1280   190117.65   98       148.5294     40.80   1228.40   Mon Jan 21 10:31:36 2013
>  1280   189977.96   98       148.4203     40.83   1229.91   Mon Jan 21 10:32:17 2013
>  1280   189560.12   98       148.0938     40.92   1231.71   Mon Jan 21 10:32:58 2013
>  2560   217857.04   98        85.1004     71.21   2441.61   Mon Jan 21 10:34:09 2013
>  2560   217338.19   98        84.8977     71.38   2448.76   Mon Jan 21 10:35:21 2013
>  2560   217795.87   97        85.0765     71.23   2443.12   Mon Jan 21 10:36:32 2013
> 
> That was with your change backed out, and the q/d below applied.

So that change will help to solve the issue? good to know :)

But it will invoke wake_affine() with out any delay, the benefit
of the patch set will be reduced a lot...

I think this change help to solve the issue because it avoid jump
into balance path when wakeup for any cases, I think we can do
some change like below to achieve this and meanwhile gain benefit
from delay wake_affine().

Since the issue could not been reproduced on my side, I don't know
whether the patch benefit or not, so if you are willing to send out
a formal patch, I would be glad to include it in my patch set ;-)

And another patch below below is a debug one, which will print out
all the sbm info, so we could check whether it was initialized
correctly, just use command "dmesg | grep WYT" to show the map.

Regards,
Michael Wang

---
 kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
 1 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2aa26c1..4361333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 }

 /*
- * Try and locate an idle CPU in the sched_domain.
+ * Try and locate an idle CPU in the sched_domain, return -1 if failed.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
@@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)

                        target = cpumask_first_and(sched_group_cpus(sg),
                                        tsk_cpus_allowed(p));
-                       goto done;
+                       return target;
 next:
                        sg = sg->next;
                } while (sg != sd->groups);
        }
-done:
-       return target;
+
+       return -1;
 }

 /*
@@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
                 * may has already been cached on prev_cpu, and usually
                 * they require low latency.
                 *
-                * So firstly try to locate an idle cpu shared the cache
+                * Therefor, balance path in such case will cause damage
+                * and bring benefit synchronously, wakeup on prev_cpu
+                * may better than wakeup on a new lower load cpu for the
+                * cached memory, and we never know.
+                *
+                * So the principle is, try to find an idle cpu as close to
+                * prev_cpu as possible, if failed, just take prev_cpu.
+                *
+                * Firstly try to locate an idle cpu shared the cache
                 * with prev_cpu, it has the chance to break the load
                 * balance, fortunately, select_idle_sibling() will search
                 * from top to bottom, which help to reduce the chance in
                 * some cases.
                 */
                new_cpu = select_idle_sibling(p, prev_cpu);
-               if (idle_cpu(new_cpu))
+               if (new_cpu != -1)
                        goto unlock;

                /*
                 * No idle cpu could be found in the topology of prev_cpu,
-                * before jump into the slow balance_path, try search again
-                * in the topology of current cpu if it is the affine of
-                * prev_cpu.
+                * before take the prev_cpu, try search again in the
+                * topology of current cpu if it is the affine of prev_cpu.
                 */
-               if (cpu == prev_cpu ||
-                               !sbm->affine_map[prev_cpu] ||
+               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
                                !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
-                       goto balance_path;
+                       goto take_prev;

                new_cpu = select_idle_sibling(p, cpu);
-               if (!idle_cpu(new_cpu))
-                       goto balance_path;
-
                /*
                 * Invoke wake_affine() finally since it is no doubt a
                 * performance killer.
                 */
-               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
+               if ((new_cpu != -1) &&
+                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
                        goto unlock;
+
+take_prev:
+               new_cpu = prev_cpu;
+               goto unlock;
        }

-balance_path:
+       /* Balance path. */
        new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
        sd = sbm->sd[type][sbm->top_level[type]];

-- 
1.7.4.1

DEBUG PATCH:

---
 kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c63303..f251f29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
 static int sbm_max_level;
 DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);

+static void debug_sched_balance_map(int cpu)
+{
+       int i, type, level = 0;
+       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+       printk("WYT: sbm of cpu %d\n", cpu);
+
+       for (type = 0; type < SBM_MAX_TYPE; type++) {
+               if (type == SBM_EXEC_TYPE)
+                       printk("WYT: \t exec map\n");
+               else if (type == SBM_FORK_TYPE)
+                       printk("WYT: \t fork map\n");
+               else if (type == SBM_WAKE_TYPE)
+                       printk("WYT: \t wake map\n");
+
+               for (level = 0; level < sbm_max_level; level++) {
+                       if (sbm->sd[type][level])
+                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
+               }
+       }
+
+       printk("WYT: \t affine map\n");
+
+       for_each_possible_cpu(i) {
+               if (sbm->affine_map[i])
+                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
+       }
+}
+
 static void build_sched_balance_map(int cpu)
 {
        struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
@@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
         * destroy_sched_domains() already do the work.
         */
        build_sched_balance_map(cpu);
+       debug_sched_balance_map(cpu);
        rcu_assign_pointer(rq->sbm, sbm);
 }

-- 
1.7.4.1

> 
> ---
>  kernel/sched/fair.c |   27 ++++++---------------------
>  1 file changed, 6 insertions(+), 21 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3337,6 +3337,8 @@ select_task_rq_fair(struct task_struct *
>  		goto unlock;
> 
>  	if (sd_flag & SD_BALANCE_WAKE) {
> +		new_cpu = prev_cpu;
> +
>  		/*
>  		 * Tasks to be waked is special, memory it relied on
>  		 * may has already been cached on prev_cpu, and usually
> @@ -3348,33 +3350,16 @@ select_task_rq_fair(struct task_struct *
>  		 * from top to bottom, which help to reduce the chance in
>  		 * some cases.
>  		 */
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  		if (idle_cpu(new_cpu))
>  			goto unlock;
> 
> -		/*
> -		 * No idle cpu could be found in the topology of prev_cpu,
> -		 * before jump into the slow balance_path, try search again
> -		 * in the topology of current cpu if it is the affine of
> -		 * prev_cpu.
> -		 */
> -		if (!sbm->affine_map[prev_cpu] ||
> -				!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> -			goto balance_path;
> -
> -		new_cpu = select_idle_sibling(p, cpu);
> -		if (!idle_cpu(new_cpu))
> -			goto balance_path;
> +		if (wake_affine(sbm->affine_map[cpu], p, sync))
> +			new_cpu = select_idle_sibling(p, cpu);
> 
> -		/*
> -		 * Invoke wake_affine() finally since it is no doubt a
> -		 * performance killer.
> -		 */
> -		if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> -			goto unlock;
> +		goto unlock;
>  	}
> 
> -balance_path:
>  	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>  	sd = sbm->sd[type][sbm->top_level[type]];
> 
> 
> 
> 
>

next prev parent reply	other threads:[~2013-01-22  3:43 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1356588535-23251-1-git-send-email-wangyun@linux.vnet.ibm.com>
2013-01-09  9:28 ` [RFC PATCH 0/2] sched: simplify the select_task_rq_fair() Michael Wang
2013-01-12  8:01   ` Mike Galbraith
2013-01-12 10:19     ` Mike Galbraith
2013-01-14  9:21       ` Mike Galbraith
2013-01-15  3:10         ` Michael Wang
2013-01-15  4:52           ` Mike Galbraith
2013-01-15  8:26             ` Michael Wang
2013-01-17  5:55         ` Michael Wang
2013-01-20  4:09           ` Mike Galbraith
2013-01-21  2:50             ` Michael Wang
2013-01-21  4:38               ` Mike Galbraith
2013-01-21  5:07                 ` Michael Wang
2013-01-21  6:42                   ` Mike Galbraith
2013-01-21  7:09                     ` Mike Galbraith
2013-01-21  7:45                       ` Michael Wang
2013-01-21  9:09                         ` Mike Galbraith
2013-01-21  9:22                           ` Michael Wang
2013-01-21  9:44                             ` Mike Galbraith
2013-01-21 10:30                               ` Mike Galbraith
2013-01-22  3:43                               ` Michael Wang [this message]
2013-01-22  8:03                                 ` Mike Galbraith
2013-01-22  8:56                                   ` Michael Wang
2013-01-22 11:34                                     ` Mike Galbraith
2013-01-23  3:01                                       ` Michael Wang
2013-01-23  5:02                                         ` Mike Galbraith
2013-01-22 14:41                                     ` Mike Galbraith
2013-01-23  2:44                                       ` Michael Wang
2013-01-23  4:31                                         ` Mike Galbraith
2013-01-23  5:09                                           ` Michael Wang
2013-01-23  6:28                                             ` Mike Galbraith
2013-01-23  7:10                                               ` Michael Wang
2013-01-23  8:20                                                 ` Mike Galbraith
2013-01-23  8:30                                                   ` Michael Wang
2013-01-23  8:49                                                     ` Mike Galbraith
2013-01-23  9:00                                                       ` Michael Wang
2013-01-23  9:18                                                         ` Mike Galbraith
2013-01-23  9:26                                                           ` Michael Wang
2013-01-23  9:37                                                             ` Mike Galbraith
2013-01-23  9:32                                                           ` Mike Galbraith
2013-01-24  6:01                                                             ` Michael Wang
2013-01-24  6:51                                                               ` Mike Galbraith
2013-01-24  7:15                                                                 ` Michael Wang
2013-01-24  7:47                                                                   ` Mike Galbraith
2013-01-24  8:14                                                                     ` Michael Wang
2013-01-24  9:07                                                                       ` Mike Galbraith
2013-01-24  9:26                                                                         ` Michael Wang
2013-01-24 10:34                                                                           ` Mike Galbraith
2013-01-25  2:14                                                                             ` Michael Wang
2013-01-24  7:00                                                               ` Michael Wang
2013-01-21  7:34                     ` Michael Wang
2013-01-21  8:26                       ` Mike Galbraith
2013-01-21  8:46                         ` Michael Wang
2013-01-21  9:11                           ` Mike Galbraith
2013-01-15  2:46     ` Michael Wang
2013-01-11  8:15 Michael Wang
2013-01-11 10:13 ` Nikunj A Dadhania
2013-01-15  2:20   ` Michael Wang

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:2aa26c1 dfblob:4361333 dfblob:0c63303 dfblob:f251f29 )
 OR (
bs:"Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50FE0ADC.6060701@linux.vnet.ibm.com \
    --to=wangyun@linux.vnet.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bitbucket@online.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.