From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90D66C433ED for ; Tue, 11 May 2021 17:25:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 67AFB613C5 for ; Tue, 11 May 2021 17:25:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231874AbhEKR0q (ORCPT ); Tue, 11 May 2021 13:26:46 -0400 Received: from mga17.intel.com ([192.55.52.151]:26995 "EHLO mga17.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231230AbhEKR0p (ORCPT ); Tue, 11 May 2021 13:26:45 -0400 IronPort-SDR: FSAbzaBxW+7acldlcrJKcDokK18P3YxCA6jFbcw0fvpavcBr+XgLcLbBk+9HT1ZrGNZDG9KIsd DWAJ7k+RbNOQ== X-IronPort-AV: E=McAfee;i="6200,9189,9981"; a="179764063" X-IronPort-AV: E=Sophos;i="5.82,291,1613462400"; d="scan'208";a="179764063" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 May 2021 10:25:38 -0700 IronPort-SDR: obO83tK9sdWGcdECl4zUVLZwfdJRstBVAXSkwf0uX1c8H/kJHUPIZ1TM82dxRngv3drdIIr6yl CbbnvVaDbtSg== X-IronPort-AV: E=Sophos;i="5.82,291,1613462400"; d="scan'208";a="436766874" Received: from schen9-mobl.amr.corp.intel.com ([10.251.18.81]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 May 2021 10:25:37 -0700 Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ To: Vincent Guittot Cc: Joel Fernandes , linux-kernel , Paul McKenney , Frederic Weisbecker , Dietmar Eggeman , Qais Yousef , Ben Segall , Daniel Bristot de Oliveira , Ingo Molnar , Juri Lelli , Mel Gorman , Peter Zijlstra , Steven Rostedt , "Uladzislau Rezki (Sony)" , Neeraj upadhyay , Aubrey Li References: <20210122154600.1722680-1-joel@joelfernandes.org> <20210129172727.GA30719@vingu-book> <274d8ae5-8f4d-7662-0e04-2fbc92b416fc@linux.intel.com> <20210324134437.GA17675@vingu-book> <4aa674d9-db49-83d5-356f-a20f9e2a7935@linux.intel.com> <2d2294ce-f1d1-f827-754b-4541c1b43be8@linux.intel.com> From: Tim Chen Message-ID: <577b0aae-0111-97aa-0c99-c2a2fcfb5e2e@linux.intel.com> Date: Tue, 11 May 2021 10:25:37 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/11/21 8:25 AM, Vincent Guittot wrote: > Hi Tim, > > Sometimes, we want to set this_rq->next_balance backward compared to > its current value. When a CPU is busy, the balance interval is > multiplied by busy_factor which is set to 16 by default. On SMT2 sched > domain level, it means that the interval will be 32ms when busy > instead of 2ms. But if a busy load balance happens just before > becoming idle, the this_rq->next_balance will be set 32ms later > whereas it should go down to 2ms as the CPU is now idle. And this > becomes even worse when you have 128 CPUs at die sched_domain level > because the idle CPU will have to wait 2048ms instead of the correct > 128ms interval. > >> >> out: >> /* Move the next balance forward */ >> - if (time_after(this_rq->next_balance, next_balance)) >> + if (time_after(next_balance, this_rq->next_balance)) > > The current comparison is correct but next_balance should not be in the past. I understand then the intention is after the update, this_rq->next_balance should have a minimum value of jiffies+1. So we will need out: /* Move the next balance forward */ + this_rq->next_balance = max(jiffies+1, this_rq->next_balance); if (time_after(this_rq->next_balance, next_balance)) this_rq->next_balance = next_balance; as next_balance computed will be at least jiffies+1 after your fix to update_next_balance(). We still need to take care of the case when this_rq->next_balance <= jiffies. So combining with your suggestion on the fix to update_next_balance(), the fix will be diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d75af1ecfb4..2dc471c1511c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9901,7 +9901,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) /* used by idle balance, so cpu_busy = 0 */ interval = get_sd_balance_interval(sd, 0); - next = sd->last_balance + interval; + next = max(jiffies+1, sd->last_balance + interval); if (time_after(*next_balance, next)) *next_balance = next; @@ -10681,6 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) out: /* Move the next balance forward */ + this_rq->next_balance = max(jiffies+1, this_rq->next_balance); if (time_after(this_rq->next_balance, next_balance)) this_rq->next_balance = next_balance; > > update_next_balance() is only used in newidle_balance() so we could > modify it to have: > > next = max(jiffies+1, next = sd->last_balance + interval) Is the extra assignment "next = sd->last_balance + interval" needed? This seems more straight forward: next = max(jiffies+1, sd->last_balance + interval) I will try to get the benchmark folks to do another run with this change. Hopefully I'll get some bandwidth from them soon. Thanks. Tim