* [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements
@ 2023-05-03 2:05 Raghavendra K T
2023-05-03 2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
2023-05-03 2:05 ` [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq Raghavendra K T
0 siblings, 2 replies; 6+ messages in thread
From: Raghavendra K T @ 2023-05-03 2:05 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
Bharata B Rao, Raghavendra K T
With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.
While this has improved significant system time overhead, there are corner
cases, which genuinely needs some relaxation for e.g., concern raised by
PeterZ where unfairness amongst the threadbelonging to disjoint set of VMSs
can potentially amplify the side effects of vma regions belonging to some of
the tasks being left unscanned.
Currently that is handled by allowing first two scans at mm level
(mm->numa_scan_seq) unconditionally.
One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
allocation happen by main thread and it is divided into memory chunks of 24MB
to be continuously bzeroed.
(this is run default by LKP tests while numa01 is run default in mmtests which
operate on full 3GB region by each thread)
So to address this issue, proposal here is:
1) Have per vma scan counter that gets incremented for every successful scan
(which potentially scans 256MB or sysctl_scan_size)
2) Do unconditional scan for first few times (To be precise, half of the window
calculated for scanning normally otherwise)
3) Do reset of the counter when whole mm is scanned (this needs remembering
mm->numa_scan_sequece) at vma level
With this patch I am seeing good improvement in numa01_THREAD_ALLOC case,
but please note that with [1] there was a drastic decrease in system time when
benchmarks run, this patch adds back some of the system time.
Your comments/Ideas are welcome.
Result:
SUT: Milan w/ 2 numa nodes 256 cpus
Manaul run of numa01_THREAD__ALLOC
Base 11-apr-next
w/numascan w/o numascan numascan+patch
real 1m33.579s 1m2.042s 1m11.738s
user 280m46.032s 213m38.647s 231m40.226s
sys 0m18.061s 6m54.963s 4m43.174s
numa_hit 5813057 6166060 6146064
numa_local 5812546 6165471 6145573
numa_other 511 589 491
numa_pte_updates 0 2098276 1248398
numa_hint_faults 10 1768382 982034
numa_hint_faults_local 10 981824 625424
numa_pages_migrated 0 786558 356604
Below is the mmtest kernbench and autonuma performance
kernbench
===========
Base 11-apr-next
w/numascan w/o numascan numascan+patch
Amean user-256 23873.01 ( 0.00%) 23688.21 * 0.77%* 23948.47 * -0.32%*
Amean syst-256 4990.73 ( 0.00%) 5113.32 * -2.46%* 4800.86 * 3.80%*
Amean elsp-256 150.67 ( 0.00%) 150.52 * 0.10%* 150.63 * 0.03%*
Duration User 71628.53 71074.04 71855.31
Duration System 14985.61 15354.33 14416.72
Duration Elapsed 472.69 473.24 473.72
Ops NUMA alloc hit 1739476674.00 1739443601.00 1739591558.00
Ops NUMA alloc local 1739534231.00 1739519795.00 1739647666.00
Ops NUMA base-page range updates 485073.00 673766.00 733129.00
Ops NUMA PTE updates 485073.00 673766.00 733129.00
Ops NUMA hint faults 107776.00 181920.00 186250.00
Ops NUMA hint local faults % 1789.00 6165.00 10889.00
Ops NUMA hint local percent 1.66 3.39 5.85
Ops NUMA pages migrated 105987.00 175755.00 175356.00
Ops AutoNUMA cost 544.29 917.66 939.71
autonumabench
===============
w/numascan w/o numascan numascan+patch
Amean syst-NUMA01 33.10 ( 0.00%) 571.68 *-1627.21%* 219.51 *-563.21%*
Amean syst-NUMA01_THREADLOCAL 0.23 ( 0.00%) 0.22 * 4.38%* 0.22 * 5.00%*
Amean syst-NUMA02 0.81 ( 0.00%) 0.75 * 7.76%* 0.76 * 6.00%*
Amean syst-NUMA02_SMT 0.68 ( 0.00%) 0.73 * -7.79%* 0.65 * 3.58%*
Amean elsp-NUMA01 299.71 ( 0.00%) 333.24 * -11.19%* 329.60 * -9.97%*
Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.06 * 0.00%* 1.06 * -0.68%*
Amean elsp-NUMA02 3.29 ( 0.00%) 3.23 * 1.95%* 3.18 * 3.51%*
Amean elsp-NUMA02_SMT 3.75 ( 0.00%) 3.38 * 9.86%* 3.79 * -0.95%*
Duration User 321693.29 437210.09 376657.80
Duration System 244.25 4014.23 1548.57
Duration Elapsed 2165.83 2395.53 2373.46
Ops NUMA alloc hit 49608099.00 62272320.00 55815229.00
Ops NUMA alloc local 49585747.00 62236996.00 55812601.00
Ops NUMA base-page range updates 1571.00 202868357.00 96006221.00
Ops NUMA PTE updates 1571.00 202868357.00 96006221.00
Ops NUMA hint faults 1203.00 204902318.00 97246909.00
Ops NUMA hint local faults % 981.00 187233695.00 81136933.00
Ops NUMA hint local percent 81.55 91.38 83.43
Ops NUMA pages migrated 222.00 10011134.00 6060787.00
Ops AutoNUMA cost 6.03 1026121.88 487021.74
Notes: Implementation considered/tried
1) Limit the disjoint set vma scan to 4 (hardcoded) = 1GB per whole mm scan
2) Current PID reset window = 4 * sysctl_scan_delay is changed to
8 * sysctl_scan_delay (to ensure some random overlapping overtime in scanning)
links:
[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
Raghavendra K T (2):
sched/numa: Introduce per vma scan counter
sched/numa: Introduce per vma numa_scan_seq
include/linux/mm_types.h | 2 ++
kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++---
2 files changed, 43 insertions(+), 3 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
2023-05-03 2:05 [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements Raghavendra K T
@ 2023-05-03 2:05 ` Raghavendra K T
2023-05-03 17:23 ` kernel test robot
2023-05-03 17:42 ` Raghavendra K T
2023-05-03 2:05 ` [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq Raghavendra K T
1 sibling, 2 replies; 6+ messages in thread
From: Raghavendra K T @ 2023-05-03 2:05 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
Bharata B Rao, Raghavendra K T
With the recent numa scan enhancements, only the tasks which had
previously accessed vma are allowed to scan.
While this has improved significant system time overhead, there are
corner cases, which genuinely needs some relaxation for e.g., concern
raised by PeterZ where unfairness amongst the theread belonging to
disjoint set of VMSs can potentially amplify the side effects of vma
regions belonging to some of the tasks being left unscanned.
To address this, allow scanning for first few times with a per vma
counter.
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/linux/mm_types.h | 1 +
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
2 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3fc9e680f174..f66e6b4e0620 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
unsigned long next_scan;
unsigned long next_pid_reset;
unsigned long access_pids[2];
+ unsigned int scan_counter;
};
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a29ca11bead2..3c50dc3893eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2928,19 +2928,38 @@ static void reset_ptenuma_scan(struct task_struct *p)
p->mm->numa_scan_offset = 0;
}
+/* Scan 1GB or 4 * scan_size */
+#define VMA_DISJOINT_SET_ACCESS_THRESH 4U
+
static bool vma_is_accessed(struct vm_area_struct *vma)
{
unsigned long pids;
+ unsigned int windows;
+ unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
+
+ if (scan_size < MAX_SCAN_WINDOW)
+ windows = MAX_SCAN_WINDOW / scan_size;
+
+ /* Allow only half of the windows for disjoint set cases */
+ windows /= 2;
+
+ windows = max(VMA_DISJOINT_SET_ACCESS_THRESH, windows);
+
/*
- * Allow unconditional access first two times, so that all the (pages)
- * of VMAs get prot_none fault introduced irrespective of accesses.
+ * Make sure to allow scanning of disjoint vma set for the first
+ * few times.
+ * OR At mm level allow unconditional access first two times, so that
+ * all the (pages) of VMAs get prot_none fault introduced irrespective
+ * of accesses.
* This is also done to avoid any side effect of task scanning
* amplifying the unfairness of disjoint set of VMAs' access.
*/
- if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+ if (READ_ONCE(vma->numab_state->scan_counter) < windows ||
+ READ_ONCE(current->mm->numa_scan_seq) < 2)
return true;
pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+
return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
}
@@ -3058,6 +3077,8 @@ static void task_numa_work(struct callback_head *work)
/* Reset happens after 4 times scan delay of scan start */
vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+ WRITE_ONCE(vma->numab_state->scan_counter, 0);
}
/*
@@ -3084,6 +3105,9 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->access_pids[1] = 0;
}
+ WRITE_ONCE(vma->numab_state->scan_counter,
+ READ_ONCE(vma->numab_state->scan_counter) + 1);
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
2023-05-03 2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
@ 2023-05-03 17:23 ` kernel test robot
2023-05-03 17:37 ` Raghavendra K T
2023-05-03 17:42 ` Raghavendra K T
1 sibling, 1 reply; 6+ messages in thread
From: kernel test robot @ 2023-05-03 17:23 UTC (permalink / raw)
To: Raghavendra K T; +Cc: llvm, oe-kbuild-all
Hi Raghavendra,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master next-20230428]
[cannot apply to tip/sched/core tip/master tip/auto-latest v6.3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Introduce-per-vma-scan-counter/20230503-100717
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/abd037023141f25f79c6bbbb801c8405e4c449a1.1683033105.git.raghavendra.kt%40amd.com
patch subject: [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
config: x86_64-randconfig-a016-20230501 (https://download.01.org/0day-ci/archive/20230504/202305040103.BRgS5gsL-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/2e4296f7fd895ddd9ab683b8bd350f3aa654d91d
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Raghavendra-K-T/sched-numa-Introduce-per-vma-scan-counter/20230503-100717
git checkout 2e4296f7fd895ddd9ab683b8bd350f3aa654d91d
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202305040103.BRgS5gsL-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> kernel/sched/fair.c:2940:6: warning: variable 'windows' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (scan_size < MAX_SCAN_WINDOW)
^~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:2944:2: note: uninitialized use occurs here
windows /= 2;
^~~~~~~
kernel/sched/fair.c:2940:2: note: remove the 'if' if its condition is always true
if (scan_size < MAX_SCAN_WINDOW)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/fair.c:2937:22: note: initialize the variable 'windows' to silence this warning
unsigned int windows;
^
= 0
kernel/sched/fair.c:6191:6: warning: no previous prototype for function 'init_cfs_bandwidth' [-Wmissing-prototypes]
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
^
kernel/sched/fair.c:6191:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
^
static
kernel/sched/fair.c:12003:6: warning: no previous prototype for function 'task_vruntime_update' [-Wmissing-prototypes]
void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
^
kernel/sched/fair.c:12003:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
^
static
kernel/sched/fair.c:12613:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
void free_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:12613:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void free_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:12615:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
kernel/sched/fair.c:12615:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
^
static
kernel/sched/fair.c:12620:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
void online_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:12620:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void online_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:12622:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
void unregister_fair_sched_group(struct task_group *tg) { }
^
kernel/sched/fair.c:12622:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void unregister_fair_sched_group(struct task_group *tg) { }
^
static
kernel/sched/fair.c:535:20: warning: unused function 'list_del_leaf_cfs_rq' [-Wunused-function]
static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
^
kernel/sched/fair.c:556:19: warning: unused function 'tg_is_idle' [-Wunused-function]
static inline int tg_is_idle(struct task_group *tg)
^
kernel/sched/fair.c:3606:20: warning: unused function 'load_avg_is_decayed' [-Wunused-function]
static inline bool load_avg_is_decayed(struct sched_avg *sa)
^
kernel/sched/fair.c:6164:20: warning: unused function 'cfs_bandwidth_used' [-Wunused-function]
static inline bool cfs_bandwidth_used(void)
^
kernel/sched/fair.c:6172:20: warning: unused function 'sync_throttle' [-Wunused-function]
static inline void sync_throttle(struct task_group *tg, int cpu) {}
^
kernel/sched/fair.c:6197:37: warning: unused function 'tg_cfs_bandwidth' [-Wunused-function]
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
^
kernel/sched/fair.c:6201:20: warning: unused function 'destroy_cfs_bandwidth' [-Wunused-function]
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
^
kernel/sched/fair.c:9235:19: warning: unused function 'check_misfit_status' [-Wunused-function]
static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
^
15 warnings generated.
vim +2940 kernel/sched/fair.c
2933
2934 static bool vma_is_accessed(struct vm_area_struct *vma)
2935 {
2936 unsigned long pids;
2937 unsigned int windows;
2938 unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
2939
> 2940 if (scan_size < MAX_SCAN_WINDOW)
2941 windows = MAX_SCAN_WINDOW / scan_size;
2942
2943 /* Allow only half of the windows for disjoint set cases */
2944 windows /= 2;
2945
2946 windows = max(VMA_DISJOINT_SET_ACCESS_THRESH, windows);
2947
2948 /*
2949 * Make sure to allow scanning of disjoint vma set for the first
2950 * few times.
2951 * OR At mm level allow unconditional access first two times, so that
2952 * all the (pages) of VMAs get prot_none fault introduced irrespective
2953 * of accesses.
2954 * This is also done to avoid any side effect of task scanning
2955 * amplifying the unfairness of disjoint set of VMAs' access.
2956 */
2957 if (READ_ONCE(vma->numab_state->scan_counter) < windows ||
2958 READ_ONCE(current->mm->numa_scan_seq) < 2)
2959 return true;
2960
2961 pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
2962
2963 return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
2964 }
2965
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
2023-05-03 17:23 ` kernel test robot
@ 2023-05-03 17:37 ` Raghavendra K T
0 siblings, 0 replies; 6+ messages in thread
From: Raghavendra K T @ 2023-05-03 17:37 UTC (permalink / raw)
To: kernel test robot; +Cc: llvm, oe-kbuild-all
On 5/3/2023 10:53 PM, kernel test robot wrote:
> Hi Raghavendra,
>
> [This is a private test report for your RFC patch.]
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
> [also build test WARNING on linus/master next-20230428]
> [cannot apply to tip/sched/core tip/master tip/auto-latest v6.3]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
Sorry to miss about base patch explicitly (Though it is hidden in cover
letter that base was linux-next-april-2023). so it was next-20230411
> url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Introduce-per-vma-scan-counter/20230503-100717
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/abd037023141f25f79c6bbbb801c8405e4c449a1.1683033105.git.raghavendra.kt%40amd.com
> patch subject: [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
> config: x86_64-randconfig-a016-20230501 (https://download.01.org/0day-ci/archive/20230504/202305040103.BRgS5gsL-lkp@intel.com/config)
> compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # https://github.com/intel-lab-lkp/linux/commit/2e4296f7fd895ddd9ab683b8bd350f3aa654d91d
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Raghavendra-K-T/sched-numa-Introduce-per-vma-scan-counter/20230503-100717
> git checkout 2e4296f7fd895ddd9ab683b8bd350f3aa654d91d
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/sched/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Link: https://lore.kernel.org/oe-kbuild-all/202305040103.BRgS5gsL-lkp@intel.com/
>
> All warnings (new ones prefixed by >>):
>
>>> kernel/sched/fair.c:2940:6: warning: variable 'windows' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> if (scan_size < MAX_SCAN_WINDOW)
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> kernel/sched/fair.c:2944:2: note: uninitialized use occurs here
> windows /= 2;
> ^~~~~~~
> kernel/sched/fair.c:2940:2: note: remove the 'if' if its condition is always true
> if (scan_size < MAX_SCAN_WINDOW)
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> kernel/sched/fair.c:2937:22: note: initialize the variable 'windows' to silence this warning
> unsigned int windows;
> ^
> = 0
> kernel/sched/fair.c:6191:6: warning: no previous prototype for function 'init_cfs_bandwidth' [-Wmissing-prototypes]
> void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> ^
> kernel/sched/fair.c:6191:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
> void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> ^
> static
> kernel/sched/fair.c:12003:6: warning: no previous prototype for function 'task_vruntime_update' [-Wmissing-prototypes]
> void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
> ^
> kernel/sched/fair.c:12003:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
> void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
> ^
> static
> kernel/sched/fair.c:12613:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
> void free_fair_sched_group(struct task_group *tg) { }
> ^
> kernel/sched/fair.c:12613:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
> void free_fair_sched_group(struct task_group *tg) { }
> ^
> static
> kernel/sched/fair.c:12615:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
> int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> ^
> kernel/sched/fair.c:12615:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
> int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> ^
> static
> kernel/sched/fair.c:12620:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
> void online_fair_sched_group(struct task_group *tg) { }
> ^
> kernel/sched/fair.c:12620:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
> void online_fair_sched_group(struct task_group *tg) { }
> ^
> static
> kernel/sched/fair.c:12622:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
> void unregister_fair_sched_group(struct task_group *tg) { }
> ^
> kernel/sched/fair.c:12622:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
> void unregister_fair_sched_group(struct task_group *tg) { }
> ^
> static
> kernel/sched/fair.c:535:20: warning: unused function 'list_del_leaf_cfs_rq' [-Wunused-function]
> static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> ^
> kernel/sched/fair.c:556:19: warning: unused function 'tg_is_idle' [-Wunused-function]
> static inline int tg_is_idle(struct task_group *tg)
> ^
> kernel/sched/fair.c:3606:20: warning: unused function 'load_avg_is_decayed' [-Wunused-function]
> static inline bool load_avg_is_decayed(struct sched_avg *sa)
> ^
> kernel/sched/fair.c:6164:20: warning: unused function 'cfs_bandwidth_used' [-Wunused-function]
> static inline bool cfs_bandwidth_used(void)
> ^
> kernel/sched/fair.c:6172:20: warning: unused function 'sync_throttle' [-Wunused-function]
> static inline void sync_throttle(struct task_group *tg, int cpu) {}
> ^
> kernel/sched/fair.c:6197:37: warning: unused function 'tg_cfs_bandwidth' [-Wunused-function]
> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> ^
> kernel/sched/fair.c:6201:20: warning: unused function 'destroy_cfs_bandwidth' [-Wunused-function]
> static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> ^
> kernel/sched/fair.c:9235:19: warning: unused function 'check_misfit_status' [-Wunused-function]
> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> ^
> 15 warnings generated.
>
>
> vim +2940 kernel/sched/fair.c
>
> 2933
> 2934 static bool vma_is_accessed(struct vm_area_struct *vma)
> 2935 {
> 2936 unsigned long pids;
> 2937 unsigned int windows;
Correct I did miss initialization windows=0 while splitting patch
> 2938 unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
> 2939
>> 2940 if (scan_size < MAX_SCAN_WINDOW)
> 2941 windows = MAX_SCAN_WINDOW / scan_size;
> 2942
> 2943 /* Allow only half of the windows for disjoint set cases */
> 2944 windows /= 2;
> 2945
> 2946 windows = max(VMA_DISJOINT_SET_ACCESS_THRESH, windows);
> 2947
> 2948 /*
> 2949 * Make sure to allow scanning of disjoint vma set for the first
> 2950 * few times.
> 2951 * OR At mm level allow unconditional access first two times, so that
> 2952 * all the (pages) of VMAs get prot_none fault introduced irrespective
> 2953 * of accesses.
> 2954 * This is also done to avoid any side effect of task scanning
> 2955 * amplifying the unfairness of disjoint set of VMAs' access.
> 2956 */
> 2957 if (READ_ONCE(vma->numab_state->scan_counter) < windows ||
> 2958 READ_ONCE(current->mm->numa_scan_seq) < 2)
> 2959 return true;
> 2960
> 2961 pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> 2962
> 2963 return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
> 2964 }
> 2965
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter
2023-05-03 2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
2023-05-03 17:23 ` kernel test robot
@ 2023-05-03 17:42 ` Raghavendra K T
1 sibling, 0 replies; 6+ messages in thread
From: Raghavendra K T @ 2023-05-03 17:42 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
Bharata B Rao
On 5/3/2023 7:35 AM, Raghavendra K T wrote:
> With the recent numa scan enhancements, only the tasks which had
> previously accessed vma are allowed to scan.
>
> While this has improved significant system time overhead, there are
> corner cases, which genuinely needs some relaxation for e.g., concern
> raised by PeterZ where unfairness amongst the theread belonging to
> disjoint set of VMSs can potentially amplify the side effects of vma
> regions belonging to some of the tasks being left unscanned.
>
> To address this, allow scanning for first few times with a per vma
> counter.
>
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
Some clarification:
base was linux-next-20230411 (because I have some issue with
linux-next-20230425 onwards and linux master branch, which I am diging.
> include/linux/mm_types.h | 1 +
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> 2 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3fc9e680f174..f66e6b4e0620 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -479,6 +479,7 @@ struct vma_numab_state {
> unsigned long next_scan;
> unsigned long next_pid_reset;
> unsigned long access_pids[2];
> + unsigned int scan_counter;
> };
>
> /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a29ca11bead2..3c50dc3893eb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2928,19 +2928,38 @@ static void reset_ptenuma_scan(struct task_struct *p)
> p->mm->numa_scan_offset = 0;
> }
>
> +/* Scan 1GB or 4 * scan_size */
> +#define VMA_DISJOINT_SET_ACCESS_THRESH 4U
> +
> static bool vma_is_accessed(struct vm_area_struct *vma)
> {
> unsigned long pids;
> + unsigned int windows;
Missed windows = 0 while splitting the patch
will be corrected in next posting.
/me Remembered after kernel test robot noticed
[...]
^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq
2023-05-03 2:05 [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements Raghavendra K T
2023-05-03 2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
@ 2023-05-03 2:05 ` Raghavendra K T
1 sibling, 0 replies; 6+ messages in thread
From: Raghavendra K T @ 2023-05-03 2:05 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
Bharata B Rao, Raghavendra K T
Per vma scan counter was introduced to aid disjoint set vma
scanning in corner cases. But that counter needs reset regularly.
Reset is achieved after full round of mm scanning by per vma
numa_scan_sequence that follows mm->numa_scan_seq.
Result: With this patch series we recover mmtest's
numa01_THREAD_ALLOC performance as below
Base 11-apr-next
w/numascan w/o numascan numascan+patch
real 1m33.579s 1m2.042s 1m11.738s
user 280m46.032s 213m38.647s 231m40.226s
sys 0m18.061s 6m54.963s 4m43.174s
In summary: it adds back some system overhaed of scanning disjoint
vma scanning, But still we are at huge advantage w.r.t base kernel
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
include/linux/mm_types.h | 1 +
kernel/sched/fair.c | 18 ++++++++++++++++--
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f66e6b4e0620..9c0fc83118da 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
unsigned long next_scan;
unsigned long next_pid_reset;
unsigned long access_pids[2];
+ unsigned int vma_scan_seq;
unsigned int scan_counter;
};
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c50dc3893eb..dc011a2a31ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2935,6 +2935,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
{
unsigned long pids;
unsigned int windows;
+ unsigned int mm_seq, vma_seq;
unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
if (scan_size < MAX_SCAN_WINDOW)
@@ -2945,6 +2946,18 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
windows = max(VMA_DISJOINT_SET_ACCESS_THRESH, windows);
+ mm_seq = READ_ONCE(current->mm->numa_scan_seq);
+ vma_seq = READ_ONCE(vma->numab_state->vma_scan_seq);
+
+ if (vma_seq != mm_seq) {
+ /*
+ * One more round of whole mm scan was done. Reset the vma scan_counter
+ * and sync per vma numa_scan_seq.
+ */
+ WRITE_ONCE(vma->numab_state->vma_scan_seq,
+ READ_ONCE(current->mm->numa_scan_seq));
+ WRITE_ONCE(vma->numab_state->scan_counter, 0);
+ }
/*
* Make sure to allow scanning of disjoint vma set for the first
* few times.
@@ -2954,8 +2967,7 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
* This is also done to avoid any side effect of task scanning
* amplifying the unfairness of disjoint set of VMAs' access.
*/
- if (READ_ONCE(vma->numab_state->scan_counter) < windows ||
- READ_ONCE(current->mm->numa_scan_seq) < 2)
+ if (READ_ONCE(vma->numab_state->scan_counter) < windows || mm_seq < 2)
return true;
pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
@@ -3078,6 +3090,8 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+ WRITE_ONCE(vma->numab_state->vma_scan_seq,
+ READ_ONCE(current->mm->numa_scan_seq));
WRITE_ONCE(vma->numab_state->scan_counter, 0);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-05-03 17:42 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-03 2:05 [RFC PATCH V1 0/2] sched/numa: Disjoint set vma scan improvements Raghavendra K T
2023-05-03 2:05 ` [RFC PATCH V1 1/2] sched/numa: Introduce per vma scan counter Raghavendra K T
2023-05-03 17:23 ` kernel test robot
2023-05-03 17:37 ` Raghavendra K T
2023-05-03 17:42 ` Raghavendra K T
2023-05-03 2:05 ` [RFC PATCH V1 2/2] sched/numa: Introduce per vma numa_scan_seq Raghavendra K T
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.