From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752903AbcFTMYd (ORCPT ); Mon, 20 Jun 2016 08:24:33 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47684 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753151AbcFTMYb (ORCPT ); Mon, 20 Jun 2016 08:24:31 -0400 From: Jiri Olsa To: Ingo Molnar , Peter Zijlstra Cc: lkml , James Hartsock , Rik van Riel , Srivatsa Vaddagiri , Kirill Tkhai Subject: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code Date: Mon, 20 Jun 2016 14:15:13 +0200 Message-Id: <1466424914-8981-4-git-send-email-jolsa@kernel.org> In-Reply-To: <1466424914-8981-1-git-send-email-jolsa@kernel.org> References: <1466424914-8981-1-git-send-email-jolsa@kernel.org> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Mon, 20 Jun 2016 12:15:27 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Adding rebalance_affinity function that place tasks based on their cpus_allowed with following logic. Current load balancing places tasks on runqueues based on their weight to achieve balance within sched domains. Sched domains are defined at the start and can't be changed during runtime. If user defines workload affinity settings unevenly with sched domains, he could get unbalanced state within his affinity group, like: Say we have following sched domains: domain 0: (pairs) domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2) domain 2: 0-23 level NUMA User runs workload with affinity setup that takes one CPU from group1 (0) and the rest from group 2: 0,6,7,8,9,10,11,18,19,20,21,22 User will see idle CPUs within his affinity group, because load balancer will balance tasks based on load within group1 and group2, thus placing eqaul load of tasks on CPU 0 and on the rest of CPUs. The rebalance_affinity function detects above setup and tries to place task with cpus_allowed on idle CPUs within their allowed mask if there are any. Once such task is re-balanced the load balancer is not allowed to touch it (balance it) unless it's reattached to runqueue. This functionality is in place only if REBALANCE_AFFINITY feature is enabled. Signed-off-by: Jiri Olsa --- kernel/sched/fair.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 99 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 78c4127f2f3a..736e525e189c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6100,16 +6100,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) return 0; } +static void __detach_task(struct task_struct *p, + struct rq *src_rq, int dst_cpu) +{ + lockdep_assert_held(&src_rq->lock); + + p->on_rq = TASK_ON_RQ_MIGRATING; + deactivate_task(src_rq, p, 0); + set_task_cpu(p, dst_cpu); +} + /* * detach_task() -- detach the task for the migration specified in env */ static void detach_task(struct task_struct *p, struct lb_env *env) { - lockdep_assert_held(&env->src_rq->lock); - - p->on_rq = TASK_ON_RQ_MIGRATING; - deactivate_task(env->src_rq, p, 0); - set_task_cpu(p, env->dst_cpu); + __detach_task(p, env->src_rq, env->dst_cpu); } /* @@ -7833,6 +7839,91 @@ void sched_idle_exit(int cpu) } } +static bool has_affinity_set(struct task_struct *p, cpumask_var_t mask) +{ + if (!cpumask_and(mask, tsk_cpus_allowed(p), cpu_active_mask)) + return false; + + cpumask_xor(mask, mask, cpu_active_mask); + return !cpumask_empty(mask); +} + +static void rebalance_affinity(struct rq *rq) +{ + struct task_struct *p; + unsigned long flags; + cpumask_var_t mask; + bool mask_alloc = false; + + /* + * No need to bother if: + * - there's only 1 task on the queue + * - there's no idle cpu at the moment. + */ + if (rq->nr_running <= 1) + return; + + if (!atomic_read(&balance.nr_cpus)) + return; + + raw_spin_lock_irqsave(&rq->lock, flags); + + list_for_each_entry(p, &rq->cfs_tasks, se.group_node) { + struct rq *dst_rq; + int cpu; + + /* + * Force affinity balance only if: + * - task is not current one + * - task is already balanced (p->se.dont_balance is set) + * - task has cpus_allowed set + * - we have idle cpu ready within task's cpus_allowed + */ + if (task_running(rq, p)) + continue; + + if (p->se.dont_balance) + continue; + + if (!mask_alloc) { + int ret = zalloc_cpumask_var(&mask, GFP_KERNEL); + + if (WARN_ON_ONCE(!ret)) + return; + mask_alloc = true; + } + + if (!has_affinity_set(p, mask)) + continue; + + if (!cpumask_and(mask, tsk_cpus_allowed(p), balance.idle_cpus_mask)) + continue; + + cpu = cpumask_any_but(mask, task_cpu(p)); + if (cpu >= nr_cpu_ids) + continue; + + __detach_task(p, rq, cpu); + raw_spin_unlock(&rq->lock); + + dst_rq = cpu_rq(cpu); + + raw_spin_lock(&dst_rq->lock); + attach_task(dst_rq, p); + p->se.dont_balance = true; + raw_spin_unlock(&dst_rq->lock); + + local_irq_restore(flags); + free_cpumask_var(mask); + return; + } + + raw_spin_unlock_irqrestore(&rq->lock, flags); + + if (mask_alloc) + free_cpumask_var(mask); +} + #ifdef CONFIG_NO_HZ_COMMON /* * idle load balancing details @@ -8077,6 +8168,9 @@ out: nohz.next_balance = rq->next_balance; #endif } + + if (sched_feat(REBALANCE_AFFINITY)) + rebalance_affinity(rq); } #ifdef CONFIG_NO_HZ_COMMON -- 2.4.11