From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1034756AbdEWWFU (ORCPT ); Tue, 23 May 2017 18:05:20 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:47336 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1034551AbdEWWFN (ORCPT ); Tue, 23 May 2017 18:05:13 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org AA7F460870 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=jhugo@codeaurora.org Subject: Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path To: Dietmar Eggemann , "Christ, Austin" , Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org Cc: Tyler Baicar , Timur Tabi References: <1495136163-27440-1-git-send-email-jhugo@codeaurora.org> <1495136163-27440-2-git-send-email-jhugo@codeaurora.org> <642f9111-71ae-398d-58ad-74f930533e70@arm.com> <2b2e0560-ae8d-b4a6-414f-8f6f01f21e05@codeaurora.org> From: Jeffrey Hugo Message-ID: Date: Tue, 23 May 2017 16:05:10 -0600 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/23/2017 5:45 AM, Dietmar Eggemann wrote: > Hey Austin, > > On 22/05/17 20:57, Christ, Austin wrote: >> Hey Dietmar, >> >> >> On 5/22/2017 3:48 AM, Dietmar Eggemann wrote: >>> On 19/05/17 14:31, Dietmar Eggemann wrote: >>>> On 18/05/17 20:36, Jeffrey Hugo wrote: >>>> >>>> [...] >>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index d711093..a5d41b1 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct >>>>> rq *this_rq, >>>>> /* All tasks on this runqueue were pinned by CPU affinity */ >>>>> if (unlikely(env.flags & LBF_ALL_PINNED)) { >>>>> cpumask_clear_cpu(cpu_of(busiest), cpus); >>>>> - if (!cpumask_empty(cpus)) { >>>>> + /* >>>>> + * dst_cpu is not a valid busiest cpu in the following >>>>> + * check since load cannot be pulled from dst_cpu to be >>>>> + * put on dst_cpu. >>>>> + */ >>>>> + cpumask_clear_cpu(env.dst_cpu, cpus); >>>>> + /* >>>>> + * Go back to "redo" iff the load-balance cpumask >>>>> + * contains other potential busiest cpus for the >>>>> + * current sched domain. >>>>> + */ >>>>> + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { >>>>> + /* >>>>> + * Now that the check has passed, reenable >>>>> + * dst_cpu so that load can be calculated on >>>>> + * it in the redo path. >>>>> + */ >>>>> + cpumask_set_cpu(env.dst_cpu, cpus); >>>> IMHO, this will work nicely and its way easier. >>> This was too quick ... if we still have other potential dst cpus >>> available and cpu_of(busiest) is the latest src cpu then this will fail. >>> >>> It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child >>> == NULL'. >>> >>> But IMHO 'group_imbalance' propagation has to work on higher sd levels >>> as well. >> Can you clarify the fail case you are seeing? We are only aware of >> dst_cpu being changed under [1] where a dst_cpu will try to move work to >> one of its sched_group siblings. >> >> I'm also not entirely sure I understand what you mean about the flag >> being propagated to higher sd levels. > > The propagation of 'imbalance' information should not only happen > between lowest sd (sd->child == NULL) and its parent (MC->DIE in your > example) but between all {sd, sd->parent} pairs. > > Imagine your machine had another sd on top of DIE. > > I recreated the issue I pointed out on my hikey board (2*4) (w/o this > extra sd on top of DIE), hotplug-ed out cpu 2,3,6,7 so I have a system > with the following DIE sched_groups (sg): > > sg1(0,1) and sg2(4,5) <- the DIE level sg's contain more than 1 logical cpu. > > As a workload I run 4 25% tasks affine to [0,1]. These tasks are > 'SOURCE' PINNED for a DIE lb sg2<-sg1. > > With: > > if (unlikely(env.flags & LBF_ALL_PINNED)) { > cpumask_clear_cpu(cpu_of(busiest), cpus); > if (!cpumask_empty(cpus)) { > ... > > printk("goto redo: sd=%s dst_cpu=%d src_cpu=%d cpus=%*pbl > dst_grpmask=%*pbl\n", > sd->name, env.dst_cpu, cpu_of(busiest), > cpumask_pr_args(cpus), > cpumask_pr_args(env.dst_grpmask)); > > goto redo; > } > > While running the workload I sometimes get: > > ... > goto redo: sd=DIE dst_cpu=4 src_cpu=1 cpus=0,4-5 dst_grpmask=4-5 > goto redo: sd=DIE dst_cpu=4 src_cpu=0 cpus=4-5 dst_grpmask=4-5 > ... > > So even though 'redo' handling has tried both possible src_cpu's we > would still enter another 'redo' path even you remove dst_cpu=4 > temporarily because of cpu=5. > > You could replace: > > cpumask_clear_cpu(env.dst_cpu, cpus) > cpumask_set_cpu(env.dst_cpu, cpus) > > with > > cpumask_andnot(cpus, cpus, env.dst_grpmask) > cpumask_or(cpus, cpus, env.dst_grpmask) > > but then env.dst_grpmask can't be set to NULL for CPU_NEWLY_IDLE and > you're almost at the snippet I sent out for v1: > https://marc.info/?l=linux-kernel&m=149486020010389&w=2 > > [...] > I see what you mean. You are right, our current proposal is flawed in this regard. We'll take a look for a bit and see if we come up with any other ideas. -- Jeffrey Hugo Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.