From: Michael Wang <wangyun@linux.vnet.ibm.com>
To: Mike Galbraith <bitbucket@online.de>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
peterz@infradead.org, mingo@kernel.org, a.p.zijlstra@chello.nl
Subject: Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
Date: Tue, 22 Jan 2013 11:43:24 +0800 [thread overview]
Message-ID: <50FE0ADC.6060701@linux.vnet.ibm.com> (raw)
In-Reply-To: <1358761496.4994.118.camel@marge.simpson.net>
On 01/21/2013 05:44 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote:
>> On 01/21/2013 05:09 PM, Mike Galbraith wrote:
>>> On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote:
>>>> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
>>>>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote:
>>>>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
>>>>>
>>>>>>> May be we could try change this back to the old way later, after the aim
>>>>>>> 7 test on my server.
>>>>>>
>>>>>> Yeah, something funny is going on.
>>>>>
>>>>> Never entering balance path kills the collapse. Asking wake_affine()
>>>>> wrt the pull as before, but allowing us to continue should no idle cpu
>>>>> be found, still collapsed. So the source of funny behavior is indeed in
>>>>> balance_path.
>>>>
>>>> Below patch based on the patch set could help to avoid enter balance path
>>>> if affine_sd could be found, just like the old logical, would you like to
>>>> take a try and see whether it could help fix the collapse?
>>>
>>> No, it does not.
>>
>> Hmm...what have changed now compared to the old logical?
>
> What I did earlier to confirm the collapse originates in balance_path is
> below. I just retested to confirm.
>
> Tasks jobs/min jti jobs/min/task real cpu
> 1 435.34 100 435.3448 13.92 3.76 Mon Jan 21 10:24:00 2013
> 1 440.09 100 440.0871 13.77 3.76 Mon Jan 21 10:24:22 2013
> 1 440.41 100 440.4070 13.76 3.75 Mon Jan 21 10:24:45 2013
> 5 2467.43 99 493.4853 12.28 10.71 Mon Jan 21 10:24:59 2013
> 5 2445.52 99 489.1041 12.39 10.98 Mon Jan 21 10:25:14 2013
> 5 2475.49 99 495.0980 12.24 10.59 Mon Jan 21 10:25:27 2013
> 10 4963.14 99 496.3145 12.21 20.64 Mon Jan 21 10:25:41 2013
> 10 4959.08 99 495.9083 12.22 21.26 Mon Jan 21 10:25:54 2013
> 10 5415.55 99 541.5550 11.19 11.54 Mon Jan 21 10:26:06 2013
> 20 9934.43 96 496.7213 12.20 33.52 Mon Jan 21 10:26:18 2013
> 20 9950.74 98 497.5369 12.18 36.52 Mon Jan 21 10:26:31 2013
> 20 9893.88 96 494.6939 12.25 34.39 Mon Jan 21 10:26:43 2013
> 40 18937.50 98 473.4375 12.80 84.74 Mon Jan 21 10:26:56 2013
> 40 18996.87 98 474.9216 12.76 88.64 Mon Jan 21 10:27:09 2013
> 40 19146.92 98 478.6730 12.66 89.98 Mon Jan 21 10:27:22 2013
> 80 37610.55 98 470.1319 12.89 112.01 Mon Jan 21 10:27:35 2013
> 80 37321.02 98 466.5127 12.99 114.21 Mon Jan 21 10:27:48 2013
> 80 37610.55 98 470.1319 12.89 111.77 Mon Jan 21 10:28:01 2013
> 160 69109.05 98 431.9316 14.03 156.81 Mon Jan 21 10:28:15 2013
> 160 69505.38 98 434.4086 13.95 155.33 Mon Jan 21 10:28:29 2013
> 160 69207.71 98 432.5482 14.01 155.79 Mon Jan 21 10:28:43 2013
> 320 108033.43 98 337.6045 17.95 314.01 Mon Jan 21 10:29:01 2013
> 320 108577.83 98 339.3057 17.86 311.79 Mon Jan 21 10:29:19 2013
> 320 108395.75 98 338.7367 17.89 312.55 Mon Jan 21 10:29:37 2013
> 640 151440.84 98 236.6263 25.61 620.37 Mon Jan 21 10:30:03 2013
> 640 151440.84 97 236.6263 25.61 621.23 Mon Jan 21 10:30:29 2013
> 640 151145.75 98 236.1652 25.66 622.35 Mon Jan 21 10:30:55 2013
> 1280 190117.65 98 148.5294 40.80 1228.40 Mon Jan 21 10:31:36 2013
> 1280 189977.96 98 148.4203 40.83 1229.91 Mon Jan 21 10:32:17 2013
> 1280 189560.12 98 148.0938 40.92 1231.71 Mon Jan 21 10:32:58 2013
> 2560 217857.04 98 85.1004 71.21 2441.61 Mon Jan 21 10:34:09 2013
> 2560 217338.19 98 84.8977 71.38 2448.76 Mon Jan 21 10:35:21 2013
> 2560 217795.87 97 85.0765 71.23 2443.12 Mon Jan 21 10:36:32 2013
>
> That was with your change backed out, and the q/d below applied.
So that change will help to solve the issue? good to know :)
But it will invoke wake_affine() with out any delay, the benefit
of the patch set will be reduced a lot...
I think this change help to solve the issue because it avoid jump
into balance path when wakeup for any cases, I think we can do
some change like below to achieve this and meanwhile gain benefit
from delay wake_affine().
Since the issue could not been reproduced on my side, I don't know
whether the patch benefit or not, so if you are willing to send out
a formal patch, I would be glad to include it in my patch set ;-)
And another patch below below is a debug one, which will print out
all the sbm info, so we could check whether it was initialized
correctly, just use command "dmesg | grep WYT" to show the map.
Regards,
Michael Wang
---
kernel/sched/fair.c | 42 +++++++++++++++++++++++++-----------------
1 files changed, 25 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2aa26c1..4361333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
}
/*
- * Try and locate an idle CPU in the sched_domain.
+ * Try and locate an idle CPU in the sched_domain, return -1 if failed.
*/
static int select_idle_sibling(struct task_struct *p, int target)
{
@@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
target = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
- goto done;
+ return target;
next:
sg = sg->next;
} while (sg != sd->groups);
}
-done:
- return target;
+
+ return -1;
}
/*
@@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
* may has already been cached on prev_cpu, and usually
* they require low latency.
*
- * So firstly try to locate an idle cpu shared the cache
+ * Therefor, balance path in such case will cause damage
+ * and bring benefit synchronously, wakeup on prev_cpu
+ * may better than wakeup on a new lower load cpu for the
+ * cached memory, and we never know.
+ *
+ * So the principle is, try to find an idle cpu as close to
+ * prev_cpu as possible, if failed, just take prev_cpu.
+ *
+ * Firstly try to locate an idle cpu shared the cache
* with prev_cpu, it has the chance to break the load
* balance, fortunately, select_idle_sibling() will search
* from top to bottom, which help to reduce the chance in
* some cases.
*/
new_cpu = select_idle_sibling(p, prev_cpu);
- if (idle_cpu(new_cpu))
+ if (new_cpu != -1)
goto unlock;
/*
* No idle cpu could be found in the topology of prev_cpu,
- * before jump into the slow balance_path, try search again
- * in the topology of current cpu if it is the affine of
- * prev_cpu.
+ * before take the prev_cpu, try search again in the
+ * topology of current cpu if it is the affine of prev_cpu.
*/
- if (cpu == prev_cpu ||
- !sbm->affine_map[prev_cpu] ||
+ if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
- goto balance_path;
+ goto take_prev;
new_cpu = select_idle_sibling(p, cpu);
- if (!idle_cpu(new_cpu))
- goto balance_path;
-
/*
* Invoke wake_affine() finally since it is no doubt a
* performance killer.
*/
- if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
+ if ((new_cpu != -1) &&
+ wake_affine(sbm->affine_map[prev_cpu], p, sync))
goto unlock;
+
+take_prev:
+ new_cpu = prev_cpu;
+ goto unlock;
}
-balance_path:
+ /* Balance path. */
new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
sd = sbm->sd[type][sbm->top_level[type]];
--
1.7.4.1
DEBUG PATCH:
---
kernel/sched/core.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c63303..f251f29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
static int sbm_max_level;
DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
+static void debug_sched_balance_map(int cpu)
+{
+ int i, type, level = 0;
+ struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+ printk("WYT: sbm of cpu %d\n", cpu);
+
+ for (type = 0; type < SBM_MAX_TYPE; type++) {
+ if (type == SBM_EXEC_TYPE)
+ printk("WYT: \t exec map\n");
+ else if (type == SBM_FORK_TYPE)
+ printk("WYT: \t fork map\n");
+ else if (type == SBM_WAKE_TYPE)
+ printk("WYT: \t wake map\n");
+
+ for (level = 0; level < sbm_max_level; level++) {
+ if (sbm->sd[type][level])
+ printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
+ }
+ }
+
+ printk("WYT: \t affine map\n");
+
+ for_each_possible_cpu(i) {
+ if (sbm->affine_map[i])
+ printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
+ }
+}
+
static void build_sched_balance_map(int cpu)
{
struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
@@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
* destroy_sched_domains() already do the work.
*/
build_sched_balance_map(cpu);
+ debug_sched_balance_map(cpu);
rcu_assign_pointer(rq->sbm, sbm);
}
--
1.7.4.1
>
> ---
> kernel/sched/fair.c | 27 ++++++---------------------
> 1 file changed, 6 insertions(+), 21 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3337,6 +3337,8 @@ select_task_rq_fair(struct task_struct *
> goto unlock;
>
> if (sd_flag & SD_BALANCE_WAKE) {
> + new_cpu = prev_cpu;
> +
> /*
> * Tasks to be waked is special, memory it relied on
> * may has already been cached on prev_cpu, and usually
> @@ -3348,33 +3350,16 @@ select_task_rq_fair(struct task_struct *
> * from top to bottom, which help to reduce the chance in
> * some cases.
> */
> - new_cpu = select_idle_sibling(p, prev_cpu);
> + new_cpu = select_idle_sibling(p, new_cpu);
> if (idle_cpu(new_cpu))
> goto unlock;
>
> - /*
> - * No idle cpu could be found in the topology of prev_cpu,
> - * before jump into the slow balance_path, try search again
> - * in the topology of current cpu if it is the affine of
> - * prev_cpu.
> - */
> - if (!sbm->affine_map[prev_cpu] ||
> - !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> - goto balance_path;
> -
> - new_cpu = select_idle_sibling(p, cpu);
> - if (!idle_cpu(new_cpu))
> - goto balance_path;
> + if (wake_affine(sbm->affine_map[cpu], p, sync))
> + new_cpu = select_idle_sibling(p, cpu);
>
> - /*
> - * Invoke wake_affine() finally since it is no doubt a
> - * performance killer.
> - */
> - if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> - goto unlock;
> + goto unlock;
> }
>
> -balance_path:
> new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
> sd = sbm->sd[type][sbm->top_level[type]];
>
>
>
>
>
next prev parent reply other threads:[~2013-01-22 3:43 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1356588535-23251-1-git-send-email-wangyun@linux.vnet.ibm.com>
2013-01-09 9:28 ` [RFC PATCH 0/2] sched: simplify the select_task_rq_fair() Michael Wang
2013-01-12 8:01 ` Mike Galbraith
2013-01-12 10:19 ` Mike Galbraith
2013-01-14 9:21 ` Mike Galbraith
2013-01-15 3:10 ` Michael Wang
2013-01-15 4:52 ` Mike Galbraith
2013-01-15 8:26 ` Michael Wang
2013-01-17 5:55 ` Michael Wang
2013-01-20 4:09 ` Mike Galbraith
2013-01-21 2:50 ` Michael Wang
2013-01-21 4:38 ` Mike Galbraith
2013-01-21 5:07 ` Michael Wang
2013-01-21 6:42 ` Mike Galbraith
2013-01-21 7:09 ` Mike Galbraith
2013-01-21 7:45 ` Michael Wang
2013-01-21 9:09 ` Mike Galbraith
2013-01-21 9:22 ` Michael Wang
2013-01-21 9:44 ` Mike Galbraith
2013-01-21 10:30 ` Mike Galbraith
2013-01-22 3:43 ` Michael Wang [this message]
2013-01-22 8:03 ` Mike Galbraith
2013-01-22 8:56 ` Michael Wang
2013-01-22 11:34 ` Mike Galbraith
2013-01-23 3:01 ` Michael Wang
2013-01-23 5:02 ` Mike Galbraith
2013-01-22 14:41 ` Mike Galbraith
2013-01-23 2:44 ` Michael Wang
2013-01-23 4:31 ` Mike Galbraith
2013-01-23 5:09 ` Michael Wang
2013-01-23 6:28 ` Mike Galbraith
2013-01-23 7:10 ` Michael Wang
2013-01-23 8:20 ` Mike Galbraith
2013-01-23 8:30 ` Michael Wang
2013-01-23 8:49 ` Mike Galbraith
2013-01-23 9:00 ` Michael Wang
2013-01-23 9:18 ` Mike Galbraith
2013-01-23 9:26 ` Michael Wang
2013-01-23 9:37 ` Mike Galbraith
2013-01-23 9:32 ` Mike Galbraith
2013-01-24 6:01 ` Michael Wang
2013-01-24 6:51 ` Mike Galbraith
2013-01-24 7:15 ` Michael Wang
2013-01-24 7:47 ` Mike Galbraith
2013-01-24 8:14 ` Michael Wang
2013-01-24 9:07 ` Mike Galbraith
2013-01-24 9:26 ` Michael Wang
2013-01-24 10:34 ` Mike Galbraith
2013-01-25 2:14 ` Michael Wang
2013-01-24 7:00 ` Michael Wang
2013-01-21 7:34 ` Michael Wang
2013-01-21 8:26 ` Mike Galbraith
2013-01-21 8:46 ` Michael Wang
2013-01-21 9:11 ` Mike Galbraith
2013-01-15 2:46 ` Michael Wang
2013-01-11 8:15 Michael Wang
2013-01-11 10:13 ` Nikunj A Dadhania
2013-01-15 2:20 ` Michael Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50FE0ADC.6060701@linux.vnet.ibm.com \
--to=wangyun@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=bitbucket@online.de \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.