From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 63BE412D205 for ; Mon, 3 Jun 2024 16:16:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.11 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717431370; cv=none; b=H1sMPVhUfnuG3lVEA8KtrJ5bx/wBWegB5tR5LsR3r/xJkKn0q8sM+YWEQdcCGCmV6nNWV6wTF2Xu4gn8Vo2VCN6Od2bC0GBPLDSqJEqIQpKvtwWIoowrw+W1m6HAjgvB+FMYDFTtUrByyMg7Bo+iqFtBh4blwJXtCV6ZU1fQOS4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717431370; c=relaxed/simple; bh=NjJX0xKM5DPMuTwBiyi6UFsD0n9pOuHRMR2Y1oanS1o=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=tnFDbZlDEw626ycvQrDqVHysM4S5el+GzVP6QE+yKdRhfhx22yljuWTk1QXbACw9GIUD5k8mZjCTYEB8yFl1U4xrAPbNeS7Maznu+Lk+hk6U655oO4/+WGGUcGv+c0T9r6fGxf87TREGtqjLLmyNl92NyuP28Zo1cAg/2GuFRbQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=dGzVvi4g; arc=none smtp.client-ip=192.198.163.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="dGzVvi4g" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1717431369; x=1748967369; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=NjJX0xKM5DPMuTwBiyi6UFsD0n9pOuHRMR2Y1oanS1o=; b=dGzVvi4g7iJlhrepokwZkR2Q1/kfLzFPCfMY8YyamiX7hCh8Lv0rZ2Tp QEFrjCm4mEQhYQJozbkffzKdkR3sw5iqlL4AnmRIm7O2Prlv0YGdJ51Rs AwlSgBw+on0Y+1QTSuKwLS4+u5flX5t/JL8ZdhuYTn/YeUpMpiZgQw16i 89Iug7HIMgWDPAJv+KyqVYUU5b2iRH1FdFi75ZmSzR2bLGhUi/RLr8++j j2Ei4BP+D/t2Av6oAH7NlEAKvIcAjVOup0v3y768gkRhac8EpmqzEDtR1 JW0gQmtFkDlOQFGQequMcScf92u3j3AYbVSufKpv9pmJ7g3DFoxtC8B6o A==; X-CSE-ConnectionGUID: 1EFdWXwURzOtLG+6oPJxkQ== X-CSE-MsgGUID: NKESAfI9S9uXCXGMpl/Fqg== X-IronPort-AV: E=McAfee;i="6600,9927,11092"; a="24556309" X-IronPort-AV: E=Sophos;i="6.08,212,1712646000"; d="scan'208";a="24556309" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Jun 2024 09:13:48 -0700 X-CSE-ConnectionGUID: mLGO8cpdSeWD3nsCWZrRKA== X-CSE-MsgGUID: pLeFbCtMTFqXRiZOtIBYYw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,212,1712646000"; d="scan'208";a="74416911" Received: from tjancza1-mobl.amr.corp.intel.com (HELO [10.209.38.23]) ([10.209.38.23]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Jun 2024 09:13:47 -0700 Message-ID: Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags From: Tim Chen To: Chen Yu Cc: Peter Zijlstra , Ingo Molnar , Vincent Guittot , linux-kernel@vger.kernel.org, Vinicius Gomes Date: Mon, 03 Jun 2024 09:13:47 -0700 In-Reply-To: References: <20240531205452.65781-1-tim.c.chen@linux.intel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.44.4 (3.44.4-3.fc36) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Mon, 2024-06-03 at 00:40 +0800, Chen Yu wrote: > >=20 > > With instrumentation, we found that 81% of the updates do not result in > > any change in the ilb_cpu's flags. That is, multiple cpus are asking > > the ilb_cpu to do the same things over and over again, before the ilb_c= pu > > has a chance to run NOHZ load balance. > >=20 > > Skip updates to ilb_cpu's flags if no new work needs to be done. > > Such updates do not change ilb_cpu's NOHZ flags. This requires an extr= a > > atomic read but it is less expensive than frequent unnecessary atomic > > updates that generate cache bounces. >=20 > A race condition is that many CPUs choose the same ilb_cpu and ask it to = trigger > the nohz idle balance. This is because find_new_ilb() always finds the fi= rst > nohz idle CPU. I wonder if we could change the > for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) > into > for_each_cpu_wrap(ilb_cpu, cpumask_and(nohz.idle_cpus_mask, hk_mask), th= is_cpu+1)=20 > so different ilb_cpu might be found by different CPUs. > Then the extra atomic read could brings less cache bounces. >=20 Your proposal improves scaling. However, that could result in many idle CPUs getting kicked. I assume that current approach of delegating to a common idle CPU will disturb fewer CPUs and let them stay in deeper idle states, and get the power benefits from NOHZ scheme. > >=20 > > We saw that on the OLTP workload, cpu cycles from trigger_load_balance(= ) > > (or sched_balance_trigger()) got reduced from 0.7% to 0.2%. > >=20 > > Signed-off-by: Tim Chen > > --- > > kernel/sched/fair.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > >=20 > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 8a5b1ae0aa55..9ab6dff6d8ac 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags) > > if (ilb_cpu < 0) > > return; > > =20 > > + /* > > + * Don't bother if no new NOHZ balance work items for ilb_cpu, > > + * i.e. all bits in flags are already set in ilb_cpu. > > + */ > > + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) =3D=3D flags) >=20 > Maybe also mention in the comment that when above statement is true, the > current ilb_cpu's flags is not 0 and in NOHZ_KICK_MASK, so return directl= y > here is safe(anyway just 2 cents) Not sure I follow your comments about return being safe. Let me explain in details. We will return directly if and only if the bits set in flags are also set in nohz_flags(ilb_cpu). =20 The comment's intention is to say that if the above statement is true, then the later operation of=20 atomic_fetch_or(flags, nohz_flags(ilb_cpu)) will be useless and not result in any change to nohz_flags(ilb_cpu), since = all the set bits in flags are already set in nohz_flags(ilb_cpu). So will it be clearer if I say /* * Don't bother if no new NOHZ balance work items for ilb_cpu, * i.e. all bits in flags are already set in ilb_cpu. * Later OR of flags to nohz_flags(ilb_cpu) * will not change nohz_flags(ilb_cpu). */ Thanks. Tim > Reviewed-by: Chen Yu >=20 > thanks, > Chenyu >=20 > > + return; > > + > > /* > > * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who set= s > > * the first flag owns it; cleared by nohz_csd_func(). > > --=20 > > 2.32.0 > >=20