Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
@ 2022-06-21  7:27 kernel test robot
  0 siblings, 0 replies; 7+ messages in thread
From: kernel test robot @ 2022-06-21  7:27 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 20768 bytes --]

:::::: 
:::::: Manual check reason: "low confidence static check warning: kernel/sched/fair.c:9330:6: warning: Value stored to 'icpu' during its initialization is never read [clang-analyzer-deadcode.DeadStores]"
:::::: 

CC: llvm(a)lists.linux.dev
CC: kbuild-all(a)lists.01.org
BCC: lkp(a)intel.com
In-Reply-To: <20220619120451.95251-8-wuyun.abel@bytedance.com>
References: <20220619120451.95251-8-wuyun.abel@bytedance.com>
TO: Abel Wu <wuyun.abel@bytedance.com>
TO: Peter Zijlstra <peterz@infradead.org>
TO: Mel Gorman <mgorman@suse.de>
TO: Vincent Guittot <vincent.guittot@linaro.org>
CC: Josh Don <joshdon@google.com>
CC: Chen Yu <yu.c.chen@intel.com>
CC: Tim Chen <tim.c.chen@linux.intel.com>
CC: K Prateek Nayak <kprateek.nayak@amd.com>
CC: "Gautham R . Shenoy" <gautham.shenoy@amd.com>
CC: linux-kernel(a)vger.kernel.org
CC: Abel Wu <wuyun.abel@bytedance.com>

Hi Abel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v5.19-rc2 next-20220617]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Abel-Wu/sched-fair-improve-scan-efficiency-of-SIS/20220619-200743
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git f3dd3f674555bd9455c5ae7fafce0696bd9931b3
:::::: branch date: 2 days ago
:::::: commit date: 2 days ago
config: s390-randconfig-c005-20220619 (https://download.01.org/0day-ci/archive/20220621/202206211517.bchsifSF-lkp(a)intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project af6d2a0b6825e71965f3e2701a63c239fa0ad70f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/32fe13cd7aa184ed349d698ebf6f420fa426dd73
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Abel-Wu/sched-fair-improve-scan-efficiency-of-SIS/20220619-200743
        git checkout 32fe13cd7aa184ed349d698ebf6f420fa426dd73
        # save the config file
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390 clang-analyzer 

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>


clang-analyzer warnings: (new ones prefixed by >>)
           ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:9670:2: note: Calling 'update_sd_lb_stats'
           update_sd_lb_stats(env, &sds);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:9422:7: note: Assuming 'local_group' is 0
                   if (local_group) {
                       ^~~~~~~~~~~
   kernel/sched/fair.c:9422:3: note: Taking false branch
                   if (local_group) {
                   ^
   kernel/sched/fair.c:9431:3: note: Calling 'update_sg_lb_stats'
                   update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8766:16: note: Assuming 'group' is not equal to field 'local'
           local_group = group == sds->local;
                         ^~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8769:2: note: Assuming 'i' is >= 'nr_cpu_ids'
           for_each_cpu_and(i, sched_group_span(group), env->cpus) {
           ^
   include/linux/cpumask.h:326:3: note: expanded from macro 'for_each_cpu_and'
                   (cpu) < nr_cpu_ids;)
                   ^~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8769:2: note: Loop condition is false. Execution continues on line 8828
           for_each_cpu_and(i, sched_group_span(group), env->cpus) {
           ^
   include/linux/cpumask.h:324:2: note: expanded from macro 'for_each_cpu_and'
           for ((cpu) = -1;                                                \
           ^
   kernel/sched/fair.c:8833:7: note: 'local_group' is 0
           if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
                ^~~~~~~~~~~
   kernel/sched/fair.c:8833:6: note: Left side of '&&' is true
           if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
               ^
   kernel/sched/fair.c:8833:22: note: Assuming the condition is true
           if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8833:6: note: Left side of '&&' is true
           if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
               ^
   kernel/sched/fair.c:8834:6: note: Assuming field 'idle' is not equal to CPU_NOT_IDLE
               env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
               ^~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8833:6: note: Left side of '&&' is true
           if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
               ^
   kernel/sched/fair.c:8834:35: note: Assuming field 'sum_h_nr_running' is not equal to 0
               env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
                                            ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8833:6: note: Left side of '&&' is true
           if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
               ^
   kernel/sched/fair.c:8835:6: note: Calling 'sched_asym'
               sched_asym(env, sds, sgs, group)) {
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8727:7: note: Access to field 'flags' results in a dereference of a null pointer (loaded from field 'local')
           if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
                ^    ~~~~~
   kernel/sched/fair.c:8828:24: warning: Access to field 'sgc' results in a dereference of a null pointer (loaded from variable 'group') [clang-analyzer-core.NullDereference]
           sgs->group_capacity = group->sgc->capacity;
                                 ^
   kernel/sched/fair.c:9670:2: note: Calling 'update_sd_lb_stats'
           update_sd_lb_stats(env, &sds);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:9412:2: note: 'sg' initialized here
           struct sched_group *sg = env->sd->groups;
           ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:9422:7: note: Assuming 'local_group' is 0
                   if (local_group) {
                       ^~~~~~~~~~~
   kernel/sched/fair.c:9422:3: note: Taking false branch
                   if (local_group) {
                   ^
   kernel/sched/fair.c:9431:32: note: Passing 'sg' via 3rd parameter 'group'
                   update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
                                                ^~
   kernel/sched/fair.c:9431:3: note: Calling 'update_sg_lb_stats'
                   update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8766:16: note: Assuming 'group' is equal to field 'local'
           local_group = group == sds->local;
                         ^~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8766:16: note: Assuming pointer value is null
           local_group = group == sds->local;
                         ^~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8769:2: note: Assuming 'i' is >= 'nr_cpu_ids'
           for_each_cpu_and(i, sched_group_span(group), env->cpus) {
           ^
   include/linux/cpumask.h:326:3: note: expanded from macro 'for_each_cpu_and'
                   (cpu) < nr_cpu_ids;)
                   ^~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:8769:2: note: Loop condition is false. Execution continues on line 8828
           for_each_cpu_and(i, sched_group_span(group), env->cpus) {
           ^
   include/linux/cpumask.h:324:2: note: expanded from macro 'for_each_cpu_and'
           for ((cpu) = -1;                                                \
           ^
   kernel/sched/fair.c:8828:24: note: Access to field 'sgc' results in a dereference of a null pointer (loaded from variable 'group')
           sgs->group_capacity = group->sgc->capacity;
                                 ^~~~~
>> kernel/sched/fair.c:9330:6: warning: Value stored to 'icpu' during its initialization is never read [clang-analyzer-deadcode.DeadStores]
           int icpu = sds->idle_cpu, this = env->dst_cpu;
               ^~~~   ~~~~~~~~~~~~~
   kernel/sched/fair.c:9330:6: note: Value stored to 'icpu' during its initialization is never read
           int icpu = sds->idle_cpu, this = env->dst_cpu;
               ^~~~   ~~~~~~~~~~~~~
>> kernel/sched/fair.c:9352:4: warning: Value stored to 'icpu' is never read [clang-analyzer-deadcode.DeadStores]
                           icpu = this;
                           ^      ~~~~
   kernel/sched/fair.c:9352:4: note: Value stored to 'icpu' is never read
                           icpu = this;
                           ^      ~~~~
   kernel/sched/fair.c:10040:25: warning: Dereference of null pointer [clang-analyzer-core.NullDereference]
           struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
                                  ^
   include/linux/cpumask.h:761:37: note: expanded from macro 'this_cpu_cpumask_var_ptr'
   #define this_cpu_cpumask_var_ptr(x) this_cpu_ptr(x)
                                       ^
   include/linux/percpu-defs.h:252:27: note: expanded from macro 'this_cpu_ptr'
   #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
                             ^
   include/linux/percpu-defs.h:242:2: note: expanded from macro 'raw_cpu_ptr'
           arch_raw_cpu_ptr(ptr);                                          \
           ^
   include/asm-generic/percpu.h:44:31: note: expanded from macro 'arch_raw_cpu_ptr'
   #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
                                 ^
   include/linux/percpu-defs.h:231:2: note: expanded from macro 'SHIFT_PERCPU_PTR'
           RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
           ^
   include/linux/compiler.h:170:28: note: expanded from macro 'RELOC_HIDE'
       (typeof(ptr)) (__ptr + (off)); })
                              ^
   kernel/sched/fair.c:10462:13: note: Assuming 'idle' is equal to CPU_IDLE
           int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
                      ^~~~~~~~~~~~~~~~
   kernel/sched/fair.c:10462:30: note: Left side of '&&' is false
           int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
                                       ^
   kernel/sched/fair.c:10472:2: note: Left side of '||' is false
           for_each_domain(cpu, sd) {
           ^
   kernel/sched/sched.h:1716:14: note: expanded from macro 'for_each_domain'
           for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
                       ^
   kernel/sched/sched.h:1705:2: note: expanded from macro 'rcu_dereference_check_sched_domain'
           rcu_dereference_check((p), \
           ^
   include/linux/rcupdate.h:532:2: note: expanded from macro 'rcu_dereference_check'
           __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
           ^
   note: (skipping 1 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:21: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                              ^
   include/linux/compiler_types.h:319:3: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
            ^
   kernel/sched/fair.c:10472:2: note: Left side of '||' is false
           for_each_domain(cpu, sd) {
           ^
   kernel/sched/sched.h:1716:14: note: expanded from macro 'for_each_domain'
           for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
                       ^
   kernel/sched/sched.h:1705:2: note: expanded from macro 'rcu_dereference_check_sched_domain'
           rcu_dereference_check((p), \
           ^
   include/linux/rcupdate.h:532:2: note: expanded from macro 'rcu_dereference_check'
           __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
           ^
   note: (skipping 1 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:21: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                              ^
   include/linux/compiler_types.h:319:3: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
            ^
   kernel/sched/fair.c:10472:2: note: Left side of '||' is false
           for_each_domain(cpu, sd) {
           ^
   kernel/sched/sched.h:1716:14: note: expanded from macro 'for_each_domain'
           for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
                       ^
   kernel/sched/sched.h:1705:2: note: expanded from macro 'rcu_dereference_check_sched_domain'
           rcu_dereference_check((p), \
           ^
   include/linux/rcupdate.h:532:2: note: expanded from macro 'rcu_dereference_check'
           __rcu_dereference_check((p), __UNIQUE_ID(rcu), \
           ^
   note: (skipping 1 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:21: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                              ^
   include/linux/compiler_types.h:319:3: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
            ^
   kernel/sched/fair.c:10472:2: note: Left side of '||' is true
           for_each_domain(cpu, sd) {

vim +/icpu +9330 kernel/sched/fair.c

57abff067a0848 Vincent Guittot 2019-10-18  9325  
ba6ee6cee1ed2f Abel Wu         2022-06-19  9326  static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
ba6ee6cee1ed2f Abel Wu         2022-06-19  9327  {
fcc108377a7cf7 Abel Wu         2022-06-19  9328  	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
fcc108377a7cf7 Abel Wu         2022-06-19  9329  	enum sd_state new = sds->sd_state;
32fe13cd7aa184 Abel Wu         2022-06-19 @9330  	int icpu = sds->idle_cpu, this = env->dst_cpu;
fcc108377a7cf7 Abel Wu         2022-06-19  9331  
fcc108377a7cf7 Abel Wu         2022-06-19  9332  	/*
fcc108377a7cf7 Abel Wu         2022-06-19  9333  	 * Parallel updating can hardly contribute accuracy to
fcc108377a7cf7 Abel Wu         2022-06-19  9334  	 * the filter, besides it can be one of the burdens on
fcc108377a7cf7 Abel Wu         2022-06-19  9335  	 * cache traffic.
fcc108377a7cf7 Abel Wu         2022-06-19  9336  	 */
fcc108377a7cf7 Abel Wu         2022-06-19  9337  	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
fcc108377a7cf7 Abel Wu         2022-06-19  9338  		return;
fcc108377a7cf7 Abel Wu         2022-06-19  9339  
32fe13cd7aa184 Abel Wu         2022-06-19  9340  	/*
32fe13cd7aa184 Abel Wu         2022-06-19  9341  	 * The dst_cpu is likely to be fed with tasks soon.
32fe13cd7aa184 Abel Wu         2022-06-19  9342  	 * If it is the only unoccupied cpu in this domain,
32fe13cd7aa184 Abel Wu         2022-06-19  9343  	 * we still handle it the same way as as_has_icpus
32fe13cd7aa184 Abel Wu         2022-06-19  9344  	 * but turn the SMT into the unstable state, rather
32fe13cd7aa184 Abel Wu         2022-06-19  9345  	 * than waiting to the end of load balancing since
32fe13cd7aa184 Abel Wu         2022-06-19  9346  	 * it's also important that update the filter as
32fe13cd7aa184 Abel Wu         2022-06-19  9347  	 * early as possible to keep it fresh.
32fe13cd7aa184 Abel Wu         2022-06-19  9348  	 */
32fe13cd7aa184 Abel Wu         2022-06-19  9349  	if (new == sd_is_busy) {
32fe13cd7aa184 Abel Wu         2022-06-19  9350  		if (idle_cpu(this) || sched_idle_cpu(this)) {
32fe13cd7aa184 Abel Wu         2022-06-19  9351  			new = sd_may_idle;
32fe13cd7aa184 Abel Wu         2022-06-19 @9352  			icpu = this;
32fe13cd7aa184 Abel Wu         2022-06-19  9353  		}
32fe13cd7aa184 Abel Wu         2022-06-19  9354  	}
32fe13cd7aa184 Abel Wu         2022-06-19  9355  
fcc108377a7cf7 Abel Wu         2022-06-19  9356  	/*
fcc108377a7cf7 Abel Wu         2022-06-19  9357  	 * There is at least one unoccupied cpu available, so
fcc108377a7cf7 Abel Wu         2022-06-19  9358  	 * propagate it to the filter to avoid false negative
fcc108377a7cf7 Abel Wu         2022-06-19  9359  	 * issue which could result in lost tracking of some
fcc108377a7cf7 Abel Wu         2022-06-19  9360  	 * idle cpus thus throughupt downgraded.
fcc108377a7cf7 Abel Wu         2022-06-19  9361  	 */
fcc108377a7cf7 Abel Wu         2022-06-19  9362  	if (new != sd_is_busy) {
32fe13cd7aa184 Abel Wu         2022-06-19  9363  		/*
32fe13cd7aa184 Abel Wu         2022-06-19  9364  		 * The sd_may_idle state is taken into
32fe13cd7aa184 Abel Wu         2022-06-19  9365  		 * consideration as well because from
32fe13cd7aa184 Abel Wu         2022-06-19  9366  		 * here we couldn't actually know task
32fe13cd7aa184 Abel Wu         2022-06-19  9367  		 * migrations would happen or not.
32fe13cd7aa184 Abel Wu         2022-06-19  9368  		 */
fcc108377a7cf7 Abel Wu         2022-06-19  9369  		if (!test_idle_cpus(this))
fcc108377a7cf7 Abel Wu         2022-06-19  9370  			set_idle_cpus(this, true);
fcc108377a7cf7 Abel Wu         2022-06-19  9371  	} else {
fcc108377a7cf7 Abel Wu         2022-06-19  9372  		/*
fcc108377a7cf7 Abel Wu         2022-06-19  9373  		 * Nothing changes so nothing to update or
fcc108377a7cf7 Abel Wu         2022-06-19  9374  		 * propagate.
fcc108377a7cf7 Abel Wu         2022-06-19  9375  		 */
fcc108377a7cf7 Abel Wu         2022-06-19  9376  		if (sd_smt_shared->state == sd_is_busy)
fcc108377a7cf7 Abel Wu         2022-06-19  9377  			goto out;
32fe13cd7aa184 Abel Wu         2022-06-19  9378  
32fe13cd7aa184 Abel Wu         2022-06-19  9379  		/*
32fe13cd7aa184 Abel Wu         2022-06-19  9380  		 * Allow false positive to exist for some time
32fe13cd7aa184 Abel Wu         2022-06-19  9381  		 * to make a tradeoff of accuracy of the filter
32fe13cd7aa184 Abel Wu         2022-06-19  9382  		 * for relieving cache traffic.
32fe13cd7aa184 Abel Wu         2022-06-19  9383  		 */
32fe13cd7aa184 Abel Wu         2022-06-19  9384  		if (sd_smt_shared->state == sd_has_icpus) {
32fe13cd7aa184 Abel Wu         2022-06-19  9385  			new = sd_may_idle;
32fe13cd7aa184 Abel Wu         2022-06-19  9386  			goto update;
32fe13cd7aa184 Abel Wu         2022-06-19  9387  		}
32fe13cd7aa184 Abel Wu         2022-06-19  9388  
32fe13cd7aa184 Abel Wu         2022-06-19  9389  		/*
32fe13cd7aa184 Abel Wu         2022-06-19  9390  		 * If the false positive issue has already been
32fe13cd7aa184 Abel Wu         2022-06-19  9391  		 * there for a while, a correction of the filter
32fe13cd7aa184 Abel Wu         2022-06-19  9392  		 * is needed.
32fe13cd7aa184 Abel Wu         2022-06-19  9393  		 */
fcc108377a7cf7 Abel Wu         2022-06-19  9394  	}
fcc108377a7cf7 Abel Wu         2022-06-19  9395  
fcc108377a7cf7 Abel Wu         2022-06-19  9396  	sd_update_icpus(this, sds->idle_cpu);
32fe13cd7aa184 Abel Wu         2022-06-19  9397  update:
fcc108377a7cf7 Abel Wu         2022-06-19  9398  	sd_smt_shared->state = new;
fcc108377a7cf7 Abel Wu         2022-06-19  9399  out:
fcc108377a7cf7 Abel Wu         2022-06-19  9400  	xchg(&sd_smt_shared->updating, 0);
ba6ee6cee1ed2f Abel Wu         2022-06-19  9401  }
ba6ee6cee1ed2f Abel Wu         2022-06-19  9402  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 0/7] sched/fair: improve scan efficiency of SIS
@ 2022-06-19 12:04 Abel Wu
  2022-06-19 12:04 ` [PATCH v4 7/7] sched/fair: de-entropy for SIS filter Abel Wu
  0 siblings, 1 reply; 7+ messages in thread
From: Abel Wu @ 2022-06-19 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 18092 bytes --]

The wakeup fastpath (select_idle_sibling or SIS) plays an important role
in maximizing the usage of cpu resources and can greatly affect overall
performance of the system.

The SIS tries to find an idle cpu inside that LLC to place the woken-up
task. The cache hot cpus will be checked first, then other cpus of that
LLC (domain scan) if the hot ones are not idle.

The domain scan works well under light workload by simply traversing
the cpus of the LLC due to lots of idle cpus can be available. But this
doesn’t scale well once the LLC gets bigger and the load increases, so
SIS_PROP was born to limit the scan cost. For now SIS_PROP just limits
the number of cpus to be scanned, but the way of how it scans is not
changed.

This patchset introduces the SIS filter to help improving scan efficiency
when scan depth is limited. The filter only contains the unoccupied cpus,
and is updated during SMT level load balancing. It is expected that the
more overloaded the system is, the less cpus will be scanned.

SIS_PROP
=======

The filter should be put under SIS_PROP which is not now, because this
sched-feature is under rework or been replaced (by SIS_UTIL [1]). Will
be adjusted once that work is ready.

[1] https://lore.kernel.org/all/20220612163428.849378-1-yu.c.chen@intel.com/

Entrance: Load balance vs. Tick
===============================

Now we update the filter during load balancing at SMT level, while it was
done at the tick (patch v1-v3). There are several reasons:

  - The periodic load balancing has the admission control that only the
    first idle cpu (if any) or the balance cpu can trigger balance, so
    the complexity is reduced by O(n) in the domain scope. While on the
    other hand, the complexity of per-cpu updating at the tick is not
    reduced, and can cause heavy traffic on the LLC cache.

  - In order to keep the filter relatively fresh, the cpu idle path needs
    to be taken into consideration as well. Since load_balance is the
    common path for the periodic and newly-idle balance, we can simplify
    and concentrate on load_balance itself. While the tick way both tick
    and idle path need to be modified.

  - The calculation of SMT state can re-use the intermediate results of
    the load balance, so it’s cheap without sacrifice.

But the tick way also has its advantages in the idle path that the cpus
set to filter are guaranteed to be idle compared to load_balance in which
the filter is updated before task migrations (so the idle dst_cpu can be
fed with tasks). This can be fixed by false positive correction at later
discussion.

Filter: Busy vs. Overloaded
===========================

In preious patch versions, rather than the busy or unoccupied cpus, only
the overloaded ones will be set into the filter because of its stableness.
For example, the state of the unoccupied cpus (non-idle tasks == 0) or the
busy cpus (== 1) can be easily changed by a short running workload. And
besides, wakup won’t pollute the filter since add tasks to the already
overloaded cpus can’t change their states.

But it’s different in this patchset in which the load balance cpu will be
responsible for updating its core state. The state will be set to has_icpus
if unoccupied cpus exist no matter what the other cpu states are. So the
overloaded state of a core is also unstable. Given this, the semantics of
busy is much preferred than overloaded due to its accuracy.

The only problem comes to cache, again. To avoid frequent update thanks
to the variance, the false positive bits in the filter are allowed to exist
for a while, which will be talked about later.

Incremental Propagation
=======================

By design, the unoccupied cpus will be set (or propagate) to the filter.
But it’s heavy to update the whole SMT state every time. What we actually
do is only to propagate one unoccupied cpu of this core to the filter at
a time, so called incremental propagation. It’s not just a tradeoff, it
also helps spreading load to different cores.

The cpu to be propagated needs to be selected carefully. If the load
balance cpu is chosen, the filter might be polluted because that cpu
could be fed with tasks soon. But it’s not wise either to wait to the
end of load balancing as we need to do it as early as possible to keep
the filter fresh. The solution is false positive correction too.

Accuracy: False Positive Correction
===================================

The filter is supposed to contain unoccupied cpus. There are two main
cases that will downgrade the filter:

  - False negtive: there are unoccupied cpus not represented in the
    filter. This is the case we should try to eliminate because these
    cpus will be out of reach in the wakeup leading to in-sufficient
    use of cpu resources. So in practical the unoccupied cpus will be
    aggressively propagated to the filter.

  - False positive: some of the cpus in the filter are not unoccupied
    any more. We allow these cpus to stay in the filter for some time
    to relieve cache pressure. It's a tradeoff.

A correction is needed when the false positive case is continuously
detected. See patch 7 for detailed info.

SMT 2/4/8
=========

The incremental propagation is relatively friendly to SMT2, but not for
the others. As the core grows bigger, the more false negtive cpus could
exist. It would be appreciated if anyone can help testing on SMT 4/8
machines.

Benchmark
=========

Tests are done in an Intel Xeon(R) Platinum 8260 CPU@2.40GHz machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Based on tip sched/core f3dd3f674555 (v5.19-rc2).

Results
=======

hackbench-process-pipes
                             vanilla		     filter
Amean     1        0.2357 (   0.00%)      0.2540 (  -7.78%)
Amean     4        0.6403 (   0.00%)      0.6363 (   0.62%)
Amean     7        0.7583 (   0.00%)      0.7367 (   2.86%)
Amean     12       1.0703 (   0.00%)      1.0520 (   1.71%)
Amean     21       2.0363 (   0.00%)      1.9610 *   3.70%*
Amean     30       3.2487 (   0.00%)      2.9083 *  10.48%*
Amean     48       6.3620 (   0.00%)      4.8543 *  23.70%*
Amean     79       8.3653 (   0.00%)      7.1077 *  15.03%*
Amean     110      9.8370 (   0.00%)      8.5740 *  12.84%*
Amean     141     11.4667 (   0.00%)     10.8750 *   5.16%*
Amean     172     13.4433 (   0.00%)     12.6443 *   5.94%*
Amean     203     15.8970 (   0.00%)     14.9143 *   6.18%*
Amean     234     17.9643 (   0.00%)     16.9123 *   5.86%*
Amean     265     20.3910 (   0.00%)     19.2040 *   5.82%*
Amean     296     22.5253 (   0.00%)     21.2547 *   5.64%*

hackbench-process-sockets
                             vanilla		     filter
Amean     1        0.4177 (   0.00%)      0.4133 (   1.04%)
Amean     4        1.4397 (   0.00%)      1.4240 *   1.09%*
Amean     7        2.4720 (   0.00%)      2.4310 *   1.66%*
Amean     12       4.1407 (   0.00%)      4.0683 *   1.75%*
Amean     21       7.0550 (   0.00%)      6.8830 *   2.44%*
Amean     30       9.9633 (   0.00%)      9.7750 *   1.89%*
Amean     48      15.9837 (   0.00%)     15.5313 *   2.83%*
Amean     79      26.7740 (   0.00%)     26.2703 *   1.88%*
Amean     110     37.2913 (   0.00%)     36.5433 *   2.01%*
Amean     141     47.8937 (   0.00%)     46.5300 *   2.85%*
Amean     172     58.0273 (   0.00%)     56.4530 *   2.71%*
Amean     203     68.2530 (   0.00%)     66.3320 *   2.81%*
Amean     234     78.8987 (   0.00%)     76.8497 *   2.60%*
Amean     265     89.1520 (   0.00%)     86.8213 *   2.61%*
Amean     296     99.6920 (   0.00%)     96.9707 *   2.73%*

hackbench-thread-pipes
                             vanilla		     filter
Amean     1        0.2647 (   0.00%)      0.2633 (   0.50%)
Amean     4        0.6290 (   0.00%)      0.6607 (  -5.03%)
Amean     7        0.7850 (   0.00%)      0.7870 (  -0.25%)
Amean     12       1.3347 (   0.00%)      1.2577 (   5.77%)
Amean     21       3.1233 (   0.00%)      2.4613 *  21.20%*
Amean     30       5.7120 (   0.00%)      3.6847 *  35.49%*
Amean     48       8.1947 (   0.00%)      6.2670 *  23.52%*
Amean     79       9.1750 (   0.00%)      8.0640 *  12.11%*
Amean     110     10.6300 (   0.00%)      9.5583 *  10.08%*
Amean     141     12.7490 (   0.00%)     12.0167 *   5.74%*
Amean     172     15.1567 (   0.00%)     14.1570 *   6.60%*
Amean     203     17.5160 (   0.00%)     16.7883 (   4.15%)
Amean     234     19.8710 (   0.00%)     19.5370 (   1.68%)
Amean     265     23.2700 (   0.00%)     21.4017 *   8.03%*
Amean     296     25.4093 (   0.00%)     23.9943 *   5.57%*

hackbench-thread-sockets
                             vanilla		     filter
Amean     1        0.4467 (   0.00%)      0.4347 (   2.69%)
Amean     4        1.4757 (   0.00%)      1.4533 *   1.51%*
Amean     7        2.5320 (   0.00%)      2.4993 *   1.29%*
Amean     12       4.2617 (   0.00%)      4.1780 *   1.96%*
Amean     21       7.2397 (   0.00%)      7.0660 *   2.40%*
Amean     30      10.2200 (   0.00%)      9.9810 *   2.34%*
Amean     48      16.2623 (   0.00%)     16.0483 *   1.32%*
Amean     79      27.4307 (   0.00%)     26.8410 *   2.15%*
Amean     110     37.8993 (   0.00%)     37.3223 *   1.52%*
Amean     141     48.3890 (   0.00%)     47.4823 *   1.87%*
Amean     172     58.9887 (   0.00%)     57.7753 *   2.06%*
Amean     203     69.5853 (   0.00%)     68.0563 *   2.20%*
Amean     234     80.0743 (   0.00%)     78.4857 *   1.98%*
Amean     265     90.5473 (   0.00%)     89.3363 *   1.34%*
Amean     296    101.3857 (   0.00%)     99.7717 *   1.59%*

schbench
				   vanilla		   filter
Lat 50.0th-qrtle-1         6.00 (   0.00%)        6.00 (   0.00%)
Lat 75.0th-qrtle-1         8.00 (   0.00%)        8.00 (   0.00%)
Lat 90.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
Lat 95.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
Lat 99.0th-qrtle-1        10.00 (   0.00%)        9.00 (  10.00%)
Lat 99.5th-qrtle-1        11.00 (   0.00%)        9.00 (  18.18%)
Lat 99.9th-qrtle-1        11.00 (   0.00%)        9.00 (  18.18%)
Lat 50.0th-qrtle-2         6.00 (   0.00%)        7.00 ( -16.67%)
Lat 75.0th-qrtle-2         7.00 (   0.00%)        8.00 ( -14.29%)
Lat 90.0th-qrtle-2         8.00 (   0.00%)        9.00 ( -12.50%)
Lat 95.0th-qrtle-2         8.00 (   0.00%)       10.00 ( -25.00%)
Lat 99.0th-qrtle-2         9.00 (   0.00%)       11.00 ( -22.22%)
Lat 99.5th-qrtle-2         9.00 (   0.00%)       11.00 ( -22.22%)
Lat 99.9th-qrtle-2         9.00 (   0.00%)       11.00 ( -22.22%)
Lat 50.0th-qrtle-4         9.00 (   0.00%)        8.00 (  11.11%)
Lat 75.0th-qrtle-4        10.00 (   0.00%)       10.00 (   0.00%)
Lat 90.0th-qrtle-4        11.00 (   0.00%)       11.00 (   0.00%)
Lat 95.0th-qrtle-4        12.00 (   0.00%)       11.00 (   8.33%)
Lat 99.0th-qrtle-4        13.00 (   0.00%)       13.00 (   0.00%)
Lat 99.5th-qrtle-4        13.00 (   0.00%)       16.00 ( -23.08%)
Lat 99.9th-qrtle-4        13.00 (   0.00%)       19.00 ( -46.15%)
Lat 50.0th-qrtle-8        11.00 (   0.00%)       11.00 (   0.00%)
Lat 75.0th-qrtle-8        14.00 (   0.00%)       14.00 (   0.00%)
Lat 90.0th-qrtle-8        16.00 (   0.00%)       16.00 (   0.00%)
Lat 95.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
Lat 99.0th-qrtle-8        22.00 (   0.00%)       19.00 (  13.64%)
Lat 99.5th-qrtle-8        28.00 (   0.00%)       23.00 (  17.86%)
Lat 99.9th-qrtle-8        31.00 (   0.00%)       42.00 ( -35.48%)
Lat 50.0th-qrtle-16       17.00 (   0.00%)       17.00 (   0.00%)
Lat 75.0th-qrtle-16       23.00 (   0.00%)       23.00 (   0.00%)
Lat 90.0th-qrtle-16       26.00 (   0.00%)       27.00 (  -3.85%)
Lat 95.0th-qrtle-16       28.00 (   0.00%)       29.00 (  -3.57%)
Lat 99.0th-qrtle-16       32.00 (   0.00%)       33.00 (  -3.12%)
Lat 99.5th-qrtle-16       37.00 (   0.00%)       35.00 (   5.41%)
Lat 99.9th-qrtle-16       54.00 (   0.00%)       46.00 (  14.81%)
Lat 50.0th-qrtle-32       30.00 (   0.00%)       29.00 (   3.33%)
Lat 75.0th-qrtle-32       43.00 (   0.00%)       42.00 (   2.33%)
Lat 90.0th-qrtle-32       51.00 (   0.00%)       49.00 (   3.92%)
Lat 95.0th-qrtle-32       54.00 (   0.00%)       51.00 (   5.56%)
Lat 99.0th-qrtle-32       61.00 (   0.00%)       57.00 (   6.56%)
Lat 99.5th-qrtle-32       64.00 (   0.00%)       60.00 (   6.25%)
Lat 99.9th-qrtle-32       72.00 (   0.00%)       82.00 ( -13.89%)
Lat 50.0th-qrtle-47       44.00 (   0.00%)       45.00 (  -2.27%)
Lat 75.0th-qrtle-47       64.00 (   0.00%)       65.00 (  -1.56%)
Lat 90.0th-qrtle-47       75.00 (   0.00%)       77.00 (  -2.67%)
Lat 95.0th-qrtle-47       81.00 (   0.00%)       82.00 (  -1.23%)
Lat 99.0th-qrtle-47       92.00 (   0.00%)       98.00 (  -6.52%)
Lat 99.5th-qrtle-47      101.00 (   0.00%)      114.00 ( -12.87%)
Lat 99.9th-qrtle-47      271.00 (   0.00%)      167.00 (  38.38%)

netperf-udp
				    vanilla		    filter
Hmean     send-64         199.12 (   0.00%)      201.32 (   1.11%)
Hmean     send-128        396.22 (   0.00%)      397.01 (   0.20%)
Hmean     send-256        777.80 (   0.00%)      783.96 (   0.79%)
Hmean     send-1024      2972.62 (   0.00%)     3011.87 *   1.32%*
Hmean     send-2048      5600.64 (   0.00%)     5730.50 *   2.32%*
Hmean     send-3312      8757.45 (   0.00%)     8703.62 (  -0.61%)
Hmean     send-4096     10578.90 (   0.00%)    10590.93 (   0.11%)
Hmean     send-8192     17051.22 (   0.00%)    17189.62 *   0.81%*
Hmean     send-16384    27915.16 (   0.00%)    27816.01 (  -0.36%)
Hmean     recv-64         199.12 (   0.00%)      201.32 (   1.11%)
Hmean     recv-128        396.22 (   0.00%)      397.01 (   0.20%)
Hmean     recv-256        777.80 (   0.00%)      783.96 (   0.79%)
Hmean     recv-1024      2972.62 (   0.00%)     3011.87 *   1.32%*
Hmean     recv-2048      5600.64 (   0.00%)     5730.49 *   2.32%*
Hmean     recv-3312      8757.45 (   0.00%)     8703.61 (  -0.61%)
Hmean     recv-4096     10578.90 (   0.00%)    10590.93 (   0.11%)
Hmean     recv-8192     17051.21 (   0.00%)    17189.57 *   0.81%*
Hmean     recv-16384    27915.08 (   0.00%)    27815.86 (  -0.36%)

netperf-tcp
			       vanilla		       filter
Hmean     64         811.07 (   0.00%)      835.46 *   3.01%*
Hmean     128       1614.86 (   0.00%)     1652.27 *   2.32%*
Hmean     256       3131.16 (   0.00%)     3119.01 (  -0.39%)
Hmean     1024     10286.12 (   0.00%)    10333.64 (   0.46%)
Hmean     2048     16231.88 (   0.00%)    17141.88 *   5.61%*
Hmean     3312     20705.91 (   0.00%)    21703.49 *   4.82%*
Hmean     4096     22650.75 (   0.00%)    23904.09 *   5.53%*
Hmean     8192     27984.06 (   0.00%)    29170.57 *   4.24%*
Hmean     16384    32816.85 (   0.00%)    33351.41 *   1.63%*

tbench4 Throughput
			     vanilla		     filter
Hmean     1        300.07 (   0.00%)      302.52 *   0.82%*
Hmean     2        617.72 (   0.00%)      598.45 *  -3.12%*
Hmean     4       1213.99 (   0.00%)     1206.36 *  -0.63%*
Hmean     8       2373.78 (   0.00%)     2372.28 *  -0.06%*
Hmean     16      4777.82 (   0.00%)     4711.44 *  -1.39%*
Hmean     32      7182.50 (   0.00%)     7718.15 *   7.46%*
Hmean     64      8611.44 (   0.00%)     9409.29 *   9.27%*
Hmean     128    18102.63 (   0.00%)    20650.23 *  14.07%*
Hmean     256    18029.28 (   0.00%)    20611.03 *  14.32%*
Hmean     384    17986.44 (   0.00%)    19361.29 *   7.64%*

Conclusion
==========

There seems like not much obvious regressions except the hackbench pipe
tests with minor groups like 1/4, and some of the benchmarks showed great
improvement.

As preious said, the domain scan works well under light workload by simply
traversing the cpus of the LLC due to lots of idle cpus can be available.
So in this case, the cost of maintaining the filter can hardly contribute
to the scan efficiency, but rather adds to the scheduling overhead slowing
down the system.

One reasonable way to fix it is to disable the filter when the system is
not loaded. Three possible solutions come to my mind:

  a. Disable the filter when the number of unoccupied cpus in the LLC
     is more than a threshold. This is straight but not easy to define a
     proper threshold for different topologies or workloads. So I gave up
     on this quickly.

  b. Enable the filter when average idle time reduces. This is what the
     SIS_PROP relies on. I did a test on enabling the filter when nr=4,
     but the result is not convincing..

  c. Enable the filter when scan efficiency drops to a threshold. Like a,
     the threshold is hard to define.

I am still working on this and open to discussion and suggestions.
Comments and tests are appreciated!

v3 -> v4:
  - update filter in load_balance rather than in the tick
  - now the filter contains unoccupied cpus rather than overloaded ones
  - added mechanisms to deal with the false positive cases

v2 -> v3:
  - removed sched-idle balance feature and focus on SIS
  - take non-CFS tasks into consideration
  - several fixes/improvement suggested by Josh Don

v1 -> v2:
  - several optimizations on sched-idle balancing
  - ignore asym topos in can_migrate_task
  - add more benchmarks including SIS efficiency
  - re-organize patch as suggested by Mel

v3: https://lore.kernel.org/lkml/53fcde27-7dd2-5150-633b-4e2acc37bb6f@bytedance.com/
v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/
v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@bytedance.com/

Abel Wu (7):
  sched/fair: default to false in test_idle_cores
  sched/fair: remove redundant check in select_idle_smt
  sched/fair: avoid double search on same cpu
  sched/fair: remove useless check in select_idle_core
  sched/fair: skip SIS domain search if fully busy
  sched/fair: skip busy cores in SIS search
  sched/fair: de-entropy for SIS filter

 include/linux/sched/topology.h |  62 ++++++++-
 kernel/sched/fair.c            | 233 +++++++++++++++++++++++++++++----
 kernel/sched/topology.c        |  12 +-
 3 files changed, 277 insertions(+), 30 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
  2022-06-19 12:04 [PATCH v4 0/7] sched/fair: improve scan efficiency of SIS Abel Wu
@ 2022-06-19 12:04 ` Abel Wu
  2022-06-21 18:23   ` Chen Yu
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Abel Wu @ 2022-06-19 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

Now when updating core state, there are two main problems that could
pollute the SIS filter:

  - The updating is before task migration, so if dst_cpu is
    selected to be propagated which might be fed with tasks
    soon, the efforts we paid is no more than setting a busy
    cpu into the SIS filter. While on the other hand it is
    important that we update as early as possible to keep the
    filter fresh, so it's not wise to delay the update to the
    end of load balancing.

  - False negative propagation hurts performance since some
    idle cpus could be out of reach. So in general we will
    aggressively propagate idle cpus but allow false positive
    continue to exist for a while, which may lead to filter
    being fully polluted.

Pains can be relieved by a force correction when false positive
continuously detected.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h |  7 +++++
 kernel/sched/fair.c            | 51 ++++++++++++++++++++++++++++++++--
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index b93edf587d84..e3552ce192a9 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -91,6 +91,12 @@ struct sched_group;
  *	search, and is also used as a fallback state of the other
  *	states.
  *
+ * - sd_may_idle
+ *	This state implies the unstableness of the SIS filter, and
+ *	some bits of it may out of date. This state is only used in
+ *	SMT domains as an intermediate state between sd_has_icpus
+ *	and sd_is_busy.
+ *
  * - sd_is_busy
  *	This state indicates there are no unoccupied cpus in this
  *	domain. So for LLC domains, it gives the hint on whether
@@ -111,6 +117,7 @@ struct sched_group;
 enum sd_state {
 	sd_has_icores,
 	sd_has_icpus,
+	sd_may_idle,
 	sd_is_busy
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d55fdcedf2c0..9713d183d35e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8768,6 +8768,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
+		bool update = update_core && (env->dst_cpu != i);
 
 		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util_cfs(i);
@@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		nr_running = rq->nr_running;
 		sgs->sum_nr_running += nr_running;
 
-		if (update_core)
+		/*
+		 * The dst_cpu is not preferred since it might
+		 * be fed with tasks soon.
+		 */
+		if (update)
 			sd_classify(sds, rq, i);
 
 		if (nr_running > 1)
@@ -8801,7 +8806,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			 * and fed with tasks, so we'd better choose
 			 * a candidate in an opposite way.
 			 */
-			sds->idle_cpu = i;
+			if (update)
+				sds->idle_cpu = i;
 			sgs->idle_cpus++;
 
 			/* Idle cpu can't have misfit task */
@@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
 	enum sd_state new = sds->sd_state;
-	int this = env->dst_cpu;
+	int icpu = sds->idle_cpu, this = env->dst_cpu;
 
 	/*
 	 * Parallel updating can hardly contribute accuracy to
@@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
 	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
 		return;
 
+	/*
+	 * The dst_cpu is likely to be fed with tasks soon.
+	 * If it is the only unoccupied cpu in this domain,
+	 * we still handle it the same way as as_has_icpus
+	 * but turn the SMT into the unstable state, rather
+	 * than waiting to the end of load balancing since
+	 * it's also important that update the filter as
+	 * early as possible to keep it fresh.
+	 */
+	if (new == sd_is_busy) {
+		if (idle_cpu(this) || sched_idle_cpu(this)) {
+			new = sd_may_idle;
+			icpu = this;
+		}
+	}
+
 	/*
 	 * There is at least one unoccupied cpu available, so
 	 * propagate it to the filter to avoid false negative
@@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
 	 * idle cpus thus throughupt downgraded.
 	 */
 	if (new != sd_is_busy) {
+		/*
+		 * The sd_may_idle state is taken into
+		 * consideration as well because from
+		 * here we couldn't actually know task
+		 * migrations would happen or not.
+		 */
 		if (!test_idle_cpus(this))
 			set_idle_cpus(this, true);
 	} else {
@@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
 		 */
 		if (sd_smt_shared->state == sd_is_busy)
 			goto out;
+
+		/*
+		 * Allow false positive to exist for some time
+		 * to make a tradeoff of accuracy of the filter
+		 * for relieving cache traffic.
+		 */
+		if (sd_smt_shared->state == sd_has_icpus) {
+			new = sd_may_idle;
+			goto update;
+		}
+
+		/*
+		 * If the false positive issue has already been
+		 * there for a while, a correction of the filter
+		 * is needed.
+		 */
 	}
 
 	sd_update_icpus(this, sds->idle_cpu);
+update:
 	sd_smt_shared->state = new;
 out:
 	xchg(&sd_smt_shared->updating, 0);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
  2022-06-19 12:04 ` [PATCH v4 7/7] sched/fair: de-entropy for SIS filter Abel Wu
@ 2022-06-21 18:23   ` Chen Yu
  2022-06-22  4:01     ` Abel Wu
  2022-06-30  7:46   ` Abel Wu
  2022-07-20 17:08   ` Gautham R. Shenoy
  2 siblings, 1 reply; 7+ messages in thread
From: Chen Yu @ 2022-06-21 18:23 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
> Now when updating core state, there are two main problems that could
> pollute the SIS filter:
> 
>   - The updating is before task migration, so if dst_cpu is
>     selected to be propagated which might be fed with tasks
>     soon, the efforts we paid is no more than setting a busy
>     cpu into the SIS filter. While on the other hand it is
>     important that we update as early as possible to keep the
>     filter fresh, so it's not wise to delay the update to the
>     end of load balancing.
> 
>   - False negative propagation hurts performance since some
>     idle cpus could be out of reach. So in general we will
>     aggressively propagate idle cpus but allow false positive
>     continue to exist for a while, which may lead to filter
>     being fully polluted.
> 
> Pains can be relieved by a force correction when false positive
> continuously detected.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h |  7 +++++
>  kernel/sched/fair.c            | 51 ++++++++++++++++++++++++++++++++--
>  2 files changed, 55 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index b93edf587d84..e3552ce192a9 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -91,6 +91,12 @@ struct sched_group;
>   *	search, and is also used as a fallback state of the other
>   *	states.
>   *
> + * - sd_may_idle
> + *	This state implies the unstableness of the SIS filter, and
> + *	some bits of it may out of date. This state is only used in
> + *	SMT domains as an intermediate state between sd_has_icpus
> + *	and sd_is_busy.
> + *
>   * - sd_is_busy
>   *	This state indicates there are no unoccupied cpus in this
>   *	domain. So for LLC domains, it gives the hint on whether
> @@ -111,6 +117,7 @@ struct sched_group;
>  enum sd_state {
>  	sd_has_icores,
>  	sd_has_icpus,
> +	sd_may_idle,
>  	sd_is_busy
>  };
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d55fdcedf2c0..9713d183d35e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8768,6 +8768,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  
>  	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>  		struct rq *rq = cpu_rq(i);
> +		bool update = update_core && (env->dst_cpu != i);
>  
>  		sgs->group_load += cpu_load(rq);
>  		sgs->group_util += cpu_util_cfs(i);
> @@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		nr_running = rq->nr_running;
>  		sgs->sum_nr_running += nr_running;
>  
> -		if (update_core)
> +		/*
> +		 * The dst_cpu is not preferred since it might
> +		 * be fed with tasks soon.
> +		 */
> +		if (update)
maybe if (update_core && (env->dst_cpu != i)) so that the comment would
be near the code logic and meanwhile without introducing a update variable?
>  			sd_classify(sds, rq, i);
>  
>  		if (nr_running > 1)
> @@ -8801,7 +8806,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			 * and fed with tasks, so we'd better choose
>  			 * a candidate in an opposite way.
>  			 */
> -			sds->idle_cpu = i;
> +			if (update)
> +				sds->idle_cpu = i;
>  			sgs->idle_cpus++;
>  
>  			/* Idle cpu can't have misfit task */
> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  {
>  	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>  	enum sd_state new = sds->sd_state;
> -	int this = env->dst_cpu;
> +	int icpu = sds->idle_cpu, this = env->dst_cpu;
>  
>  	/*
>  	 * Parallel updating can hardly contribute accuracy to
> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>  		return;
>  
> +	/*
> +	 * The dst_cpu is likely to be fed with tasks soon.
> +	 * If it is the only unoccupied cpu in this domain,
> +	 * we still handle it the same way as as_has_icpus
> +	 * but turn the SMT into the unstable state, rather
> +	 * than waiting to the end of load balancing since
> +	 * it's also important that update the filter as
> +	 * early as possible to keep it fresh.
> +	 */
> +	if (new == sd_is_busy) {
> +		if (idle_cpu(this) || sched_idle_cpu(this)) {
available_idle_cpu()?

thanks,
Chenyu
> +			new = sd_may_idle;
> +			icpu = this;
> +		}
> +	}
> +
>  	/*
>  	 * There is at least one unoccupied cpu available, so
>  	 * propagate it to the filter to avoid false negative
> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  	 * idle cpus thus throughupt downgraded.
>  	 */
>  	if (new != sd_is_busy) {
> +		/*
> +		 * The sd_may_idle state is taken into
> +		 * consideration as well because from
> +		 * here we couldn't actually know task
> +		 * migrations would happen or not.
> +		 */
>  		if (!test_idle_cpus(this))
>  			set_idle_cpus(this, true);
>  	} else {
> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  		 */
>  		if (sd_smt_shared->state == sd_is_busy)
>  			goto out;
> +
> +		/*
> +		 * Allow false positive to exist for some time
> +		 * to make a tradeoff of accuracy of the filter
> +		 * for relieving cache traffic.
> +		 */
> +		if (sd_smt_shared->state == sd_has_icpus) {
> +			new = sd_may_idle;
> +			goto update;
> +		}
> +
> +		/*
> +		 * If the false positive issue has already been
> +		 * there for a while, a correction of the filter
> +		 * is needed.
> +		 */
>  	}
>  
>  	sd_update_icpus(this, sds->idle_cpu);
> +update:
>  	sd_smt_shared->state = new;
>  out:
>  	xchg(&sd_smt_shared->updating, 0);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
  2022-06-21 18:23   ` Chen Yu
@ 2022-06-22  4:01     ` Abel Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Abel Wu @ 2022-06-22  4:01 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel


On 6/22/22 2:23 AM, Chen Yu Wrote:
> On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
>> ...
>> @@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>   		nr_running = rq->nr_running;
>>   		sgs->sum_nr_running += nr_running;
>>   
>> -		if (update_core)
>> +		/*
>> +		 * The dst_cpu is not preferred since it might
>> +		 * be fed with tasks soon.
>> +		 */
>> +		if (update)
> maybe if (update_core && (env->dst_cpu != i)) so that the comment would
> be near the code logic and meanwhile without introducing a update variable?

Makes sense.

>> ...
>> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>   	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>>   		return;
>>   
>> +	/*
>> +	 * The dst_cpu is likely to be fed with tasks soon.
>> +	 * If it is the only unoccupied cpu in this domain,
>> +	 * we still handle it the same way as as_has_icpus
>> +	 * but turn the SMT into the unstable state, rather
>> +	 * than waiting to the end of load balancing since
>> +	 * it's also important that update the filter as
>> +	 * early as possible to keep it fresh.
>> +	 */
>> +	if (new == sd_is_busy) {
>> +		if (idle_cpu(this) || sched_idle_cpu(this)) {
> available_idle_cpu()?
> 

It is used for choosing an idle cpu that will be immediately used,
so generally inside the wakeup path. But here we just want to know
the idle state of the cpus (and later inside wakeup path these cpus
will still be re-checked to see if they are preempted).

Thanks & BR,
Abel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
  2022-06-19 12:04 ` [PATCH v4 7/7] sched/fair: de-entropy for SIS filter Abel Wu
  2022-06-21 18:23   ` Chen Yu
@ 2022-06-30  7:46   ` Abel Wu
  2022-07-20 17:08   ` Gautham R. Shenoy
  2 siblings, 0 replies; 7+ messages in thread
From: Abel Wu @ 2022-06-30  7:46 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel


On 6/19/22 8:04 PM, Abel Wu Wrote:
> Now when updating core state, there are two main problems that could
> pollute the SIS filter:
> 
>    - The updating is before task migration, so if dst_cpu is
>      selected to be propagated which might be fed with tasks
>      soon, the efforts we paid is no more than setting a busy
>      cpu into the SIS filter. While on the other hand it is
>      important that we update as early as possible to keep the
>      filter fresh, so it's not wise to delay the update to the
>      end of load balancing.
> 
>    - False negative propagation hurts performance since some
>      idle cpus could be out of reach. So in general we will
>      aggressively propagate idle cpus but allow false positive
>      continue to exist for a while, which may lead to filter
>      being fully polluted.
> 
> Pains can be relieved by a force correction when false positive
> continuously detected.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>   include/linux/sched/topology.h |  7 +++++
>   kernel/sched/fair.c            | 51 ++++++++++++++++++++++++++++++++--
>   2 files changed, 55 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index b93edf587d84..e3552ce192a9 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -91,6 +91,12 @@ struct sched_group;
>    *	search, and is also used as a fallback state of the other
>    *	states.
>    *
> + * - sd_may_idle
> + *	This state implies the unstableness of the SIS filter, and
> + *	some bits of it may out of date. This state is only used in
> + *	SMT domains as an intermediate state between sd_has_icpus
> + *	and sd_is_busy.
> + *
>    * - sd_is_busy
>    *	This state indicates there are no unoccupied cpus in this
>    *	domain. So for LLC domains, it gives the hint on whether
> @@ -111,6 +117,7 @@ struct sched_group;
>   enum sd_state {
>   	sd_has_icores,
>   	sd_has_icpus,
> +	sd_may_idle,
>   	sd_is_busy
>   };
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d55fdcedf2c0..9713d183d35e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8768,6 +8768,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   
>   	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   		struct rq *rq = cpu_rq(i);
> +		bool update = update_core && (env->dst_cpu != i);
>   
>   		sgs->group_load += cpu_load(rq);
>   		sgs->group_util += cpu_util_cfs(i);
> @@ -8777,7 +8778,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		nr_running = rq->nr_running;
>   		sgs->sum_nr_running += nr_running;
>   
> -		if (update_core)
> +		/*
> +		 * The dst_cpu is not preferred since it might
> +		 * be fed with tasks soon.
> +		 */
> +		if (update)
>   			sd_classify(sds, rq, i);
>   
>   		if (nr_running > 1)
> @@ -8801,7 +8806,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   			 * and fed with tasks, so we'd better choose
>   			 * a candidate in an opposite way.
>   			 */
> -			sds->idle_cpu = i;
> +			if (update)
> +				sds->idle_cpu = i;
>   			sgs->idle_cpus++;
>   
>   			/* Idle cpu can't have misfit task */
> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>   {
>   	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>   	enum sd_state new = sds->sd_state;
> -	int this = env->dst_cpu;
> +	int icpu = sds->idle_cpu, this = env->dst_cpu;
>   
>   	/*
>   	 * Parallel updating can hardly contribute accuracy to
> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>   	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>   		return;
>   
> +	/*
> +	 * The dst_cpu is likely to be fed with tasks soon.
> +	 * If it is the only unoccupied cpu in this domain,
> +	 * we still handle it the same way as as_has_icpus
> +	 * but turn the SMT into the unstable state, rather
> +	 * than waiting to the end of load balancing since
> +	 * it's also important that update the filter as
> +	 * early as possible to keep it fresh.
> +	 */
> +	if (new == sd_is_busy) {
> +		if (idle_cpu(this) || sched_idle_cpu(this)) {
> +			new = sd_may_idle;
> +			icpu = this;
> +		}
> +	}
> +
>   	/*
>   	 * There is at least one unoccupied cpu available, so
>   	 * propagate it to the filter to avoid false negative
> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>   	 * idle cpus thus throughupt downgraded.
>   	 */
>   	if (new != sd_is_busy) {
> +		/*
> +		 * The sd_may_idle state is taken into
> +		 * consideration as well because from
> +		 * here we couldn't actually know task
> +		 * migrations would happen or not.
> +		 */
>   		if (!test_idle_cpus(this))
>   			set_idle_cpus(this, true);
>   	} else {
> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>   		 */
>   		if (sd_smt_shared->state == sd_is_busy)
>   			goto out;
> +
> +		/*
> +		 * Allow false positive to exist for some time
> +		 * to make a tradeoff of accuracy of the filter
> +		 * for relieving cache traffic.
> +		 */
> +		if (sd_smt_shared->state == sd_has_icpus) {
> +			new = sd_may_idle;
> +			goto update;
> +		}
> +
> +		/*
> +		 * If the false positive issue has already been
> +		 * there for a while, a correction of the filter
> +		 * is needed.
> +		 */
>   	}
>   
>   	sd_update_icpus(this, sds->idle_cpu);

The @icpu should be used here rather than 'sds->idle_cpu'..
Will be fixed in next version.

> +update:
>   	sd_smt_shared->state = new;
>   out:
>   	xchg(&sd_smt_shared->updating, 0);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
  2022-06-19 12:04 ` [PATCH v4 7/7] sched/fair: de-entropy for SIS filter Abel Wu
  2022-06-21 18:23   ` Chen Yu
  2022-06-30  7:46   ` Abel Wu
@ 2022-07-20 17:08   ` Gautham R. Shenoy
  2022-08-15  9:49     ` Abel Wu
  2 siblings, 1 reply; 7+ messages in thread
From: Gautham R. Shenoy @ 2022-07-20 17:08 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	Tim Chen, K Prateek Nayak, linux-kernel

Hello Abel,

On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
> Now when updating core state, there are two main problems that could
> pollute the SIS filter:
> 
>   - The updating is before task migration, so if dst_cpu is
>     selected to be propagated which might be fed with tasks
>     soon, the efforts we paid is no more than setting a busy
>     cpu into the SIS filter. While on the other hand it is
>     important that we update as early as possible to keep the
>     filter fresh, so it's not wise to delay the update to the
>     end of load balancing.
> 
>   - False negative propagation hurts performance since some
>     idle cpus could be out of reach. So in general we will
>     aggressively propagate idle cpus but allow false positive
>     continue to exist for a while, which may lead to filter
>     being fully polluted.

Ok, so the false positive case is being addressed in this patch.

> 
> Pains can be relieved by a force correction when false positive
> continuously detected.
> 
[..snip..]

> @@ -111,6 +117,7 @@ struct sched_group;
>  enum sd_state {
>  	sd_has_icores,
>  	sd_has_icpus,
> +	sd_may_idle,
>  	sd_is_busy
>  };
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d55fdcedf2c0..9713d183d35e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[...snip..]

> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  {
>  	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>  	enum sd_state new = sds->sd_state;
> -	int this = env->dst_cpu;
> +	int icpu = sds->idle_cpu, this = env->dst_cpu;
>  
>  	/*
>  	 * Parallel updating can hardly contribute accuracy to
> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>  		return;
>  
> +	/*
> +	 * The dst_cpu is likely to be fed with tasks soon.
> +	 * If it is the only unoccupied cpu in this domain,
> +	 * we still handle it the same way as as_has_icpus
                                              ^^^^^^^^^^^^^
Nit:                                          sd_has_icpus

> +	 * but turn the SMT into the unstable state, rather
> +	 * than waiting to the end of load balancing since
> +	 * it's also important that update the filter as
> +	 * early as possible to keep it fresh.
> +	 */
> +	if (new == sd_is_busy) {
> +		if (idle_cpu(this) || sched_idle_cpu(this)) {
> +			new = sd_may_idle;
> +			icpu = this;
> +		}
> +	}
> +
>  	/*
>  	 * There is at least one unoccupied cpu available, so
>  	 * propagate it to the filter to avoid false negative
> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  	 * idle cpus thus throughupt downgraded.
>  	 */
>  	if (new != sd_is_busy) {
> +		/*
> +		 * The sd_may_idle state is taken into
> +		 * consideration as well because from
> +		 * here we couldn't actually know task
> +		 * migrations would happen or not.
> +		 */
>  		if (!test_idle_cpus(this))
>  			set_idle_cpus(this, true);
>  	} else {
> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>  		 */
>  		if (sd_smt_shared->state == sd_is_busy)
>  			goto out;
> +
> +		/*
> +		 * Allow false positive to exist for some time
> +		 * to make a tradeoff of accuracy of the filter
> +		 * for relieving cache traffic.
> +		 */

I can understand allowing the false positive to exist when there are
no other idle CPUs in this SMT domain other than this CPU, which is
handled by the case where new != sd_is_busy in the current
load-balance round and will be handled by the "else" clause in the
subsequent round if env->dst_cpu ends up becoming busy.


However, when we know that new == sd_is_busy and the previous state of
this SMT domain was sd_has_icpus, should we not immediately clear this
core's cpumask from the LLCs icpus mask? What is the need for the
intermediate sd_may_idle state transition between sd_has_icpus and
sd_is_busy in this case ?



> +		if (sd_smt_shared->state == sd_has_icpus) {
> +			new = sd_may_idle;
> +			goto update;
> +		}
> +
> +		/*
> +		 * If the false positive issue has already been
> +		 * there for a while, a correction of the filter
> +		 * is needed.
> +		 */
>  	}
>  
>  	sd_update_icpus(this, sds->idle_cpu);
                              ^^^^^^^^^^^^^^

I think you meant to use icpu here ?  sds->idle_cpu == -1 in the case
when new == sd_may_idle. Which will end up clearing this core's
cpumask from this LLC's icpus mask. This defeats the
"allow-false-positive" heuristic.




> +update:
>  	sd_smt_shared->state = new;
>  out:
>  	xchg(&sd_smt_shared->updating, 0);
> -- 
> 2.31.1
>

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
  2022-07-20 17:08   ` Gautham R. Shenoy
@ 2022-08-15  9:49     ` Abel Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Abel Wu @ 2022-08-15  9:49 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Chen Yu,
	Tim Chen, K Prateek Nayak, linux-kernel

On 7/21/22 1:08 AM, Gautham R. Shenoy Wrote:
> Hello Abel,
> 
> On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
>> Now when updating core state, there are two main problems that could
>> pollute the SIS filter:
>>
>>    - The updating is before task migration, so if dst_cpu is
>>      selected to be propagated which might be fed with tasks
>>      soon, the efforts we paid is no more than setting a busy
>>      cpu into the SIS filter. While on the other hand it is
>>      important that we update as early as possible to keep the
>>      filter fresh, so it's not wise to delay the update to the
>>      end of load balancing.
>>
>>    - False negative propagation hurts performance since some
>>      idle cpus could be out of reach. So in general we will
>>      aggressively propagate idle cpus but allow false positive
>>      continue to exist for a while, which may lead to filter
>>      being fully polluted.
> 
> Ok, so the false positive case is being addressed in this patch.
> 
>>
>> Pains can be relieved by a force correction when false positive
>> continuously detected.
>>
> [..snip..]
> 
>> @@ -111,6 +117,7 @@ struct sched_group;
>>   enum sd_state {
>>   	sd_has_icores,
>>   	sd_has_icpus,
>> +	sd_may_idle,
>>   	sd_is_busy
>>   };
>>   
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d55fdcedf2c0..9713d183d35e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
> 
> [...snip..]
> 
>> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>   {
>>   	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>>   	enum sd_state new = sds->sd_state;
>> -	int this = env->dst_cpu;
>> +	int icpu = sds->idle_cpu, this = env->dst_cpu;
>>   
>>   	/*
>>   	 * Parallel updating can hardly contribute accuracy to
>> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>   	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>>   		return;
>>   
>> +	/*
>> +	 * The dst_cpu is likely to be fed with tasks soon.
>> +	 * If it is the only unoccupied cpu in this domain,
>> +	 * we still handle it the same way as as_has_icpus
>                                                ^^^^^^^^^^^^^
> Nit:                                          sd_has_icpus

Will fix.

> 
>> +	 * but turn the SMT into the unstable state, rather
>> +	 * than waiting to the end of load balancing since
>> +	 * it's also important that update the filter as
>> +	 * early as possible to keep it fresh.
>> +	 */
>> +	if (new == sd_is_busy) {
>> +		if (idle_cpu(this) || sched_idle_cpu(this)) {
>> +			new = sd_may_idle;
>> +			icpu = this;
>> +		}
>> +	}
>> +
>>   	/*
>>   	 * There is at least one unoccupied cpu available, so
>>   	 * propagate it to the filter to avoid false negative
>> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>   	 * idle cpus thus throughupt downgraded.
>>   	 */
>>   	if (new != sd_is_busy) {
>> +		/*
>> +		 * The sd_may_idle state is taken into
>> +		 * consideration as well because from
>> +		 * here we couldn't actually know task
>> +		 * migrations would happen or not.
>> +		 */
>>   		if (!test_idle_cpus(this))
>>   			set_idle_cpus(this, true);
>>   	} else {
>> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>   		 */
>>   		if (sd_smt_shared->state == sd_is_busy)
>>   			goto out;
>> +
>> +		/*
>> +		 * Allow false positive to exist for some time
>> +		 * to make a tradeoff of accuracy of the filter
>> +		 * for relieving cache traffic.
>> +		 */
> 
> I can understand allowing the false positive to exist when there are
> no other idle CPUs in this SMT domain other than this CPU, which is
> handled by the case where new != sd_is_busy in the current
> load-balance round and will be handled by the "else" clause in the
> subsequent round if env->dst_cpu ends up becoming busy.
> 

Yes.

> 
> However, when we know that new == sd_is_busy and the previous state of
> this SMT domain was sd_has_icpus, should we not immediately clear this
> core's cpumask from the LLCs icpus mask? What is the need for the
> intermediate sd_may_idle state transition between sd_has_icpus and
> sd_is_busy in this case ?
> 

The thought was to make addition more aggressive than deletion to the
filter to try to avoid real idle cpus being out of reach. Take short
running tasks for example, the cpus in this newly-busy SMT domain can
become idle multiple times during next balancing period, which won't
be selected if update to sd_is_busy intermediately. While for the non-
short running workloads, the only downside is sacrificing the filter's
accuracy for a short while.

IOW, temporarily containing false positive cpus in the filter is more
acceptable than missing some real idle ones which would cause throughput
reduced. Besides, in this way the cache traffic can be relieved more
or less.

> 
> 
>> +		if (sd_smt_shared->state == sd_has_icpus) {
>> +			new = sd_may_idle;
>> +			goto update;
>> +		}
>> +
>> +		/*
>> +		 * If the false positive issue has already been
>> +		 * there for a while, a correction of the filter
>> +		 * is needed.
>> +		 */
>>   	}
>>   
>>   	sd_update_icpus(this, sds->idle_cpu);
>                                ^^^^^^^^^^^^^^
> 
> I think you meant to use icpu here ?  sds->idle_cpu == -1 in the case
> when new == sd_may_idle. Which will end up clearing this core's
> cpumask from this LLC's icpus mask. This defeats the
> "allow-false-positive" heuristic.
> 

Nice catch, will fix.

Thanks & Best Regards,
Abel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-08-15  9:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-21  7:27 [PATCH v4 7/7] sched/fair: de-entropy for SIS filter kernel test robot
  -- strict thread matches above, loose matches on Subject: below --
2022-06-19 12:04 [PATCH v4 0/7] sched/fair: improve scan efficiency of SIS Abel Wu
2022-06-19 12:04 ` [PATCH v4 7/7] sched/fair: de-entropy for SIS filter Abel Wu
2022-06-21 18:23   ` Chen Yu
2022-06-22  4:01     ` Abel Wu
2022-06-30  7:46   ` Abel Wu
2022-07-20 17:08   ` Gautham R. Shenoy
2022-08-15  9:49     ` Abel Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.