* [v4 0/3] Reduce TLB flushes under some specific conditions @ 2023-11-09 4:59 Byungchul Park 2023-11-09 4:59 ` [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush Byungchul Park ` (4 more replies) 0 siblings, 5 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-09 4:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Hi everyone, While I'm working with CXL memory, I have been facing migration overhead esp. TLB shootdown on promotion or demotion between different tiers. Yeah.. most TLB shootdowns on migration through hinting fault can be avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid flushing TLB in batch if PTE is inaccessible"). However, it's only for ones using hinting fault. I thought it'd be much better if we have a general mechanism to reduce # of TLB flushes and TLB misses, that we can apply to any type of migration. I tried it only for tiering migration for now tho. I'm suggesting a mechanism to reduce TLB flushes by keeping source and destination of folios participated in the migrations until all TLB flushes required are done, only if those folios are not mapped with write permission PTE entries at all. I worked Based on v6.6-rc5. Can you believe it? I saw the number of TLB full flush reduced about 80% and iTLB miss reduced about 50%, and the time wise performance always shows at least 1% stable improvement with the workload I tested with, XSBench. However, I believe that it would help more with other ones or any real ones. It'd be appreciated to let me know if I'm missing something. Byungchul --- Changes from v3: 1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob, migrc_enable. (feedbacked by Nadav) 2. Remove the optimization skipping CPUs that have already performed TLB flushes needed by any reason when performing TLB flushes by migrc because I can't tell the performance difference between w/ the optimization and w/o that. (feedbacked by Nadav) 3. Minimize arch-specific code. While at it, move all the migrc declarations and inline functions from include/linux/mm.h to mm/internal.h (feedbacked by Dave Hansen, Nadav) 4. Separate a part making migrc paused when the system is in high memory pressure to another patch. (feedbacked by Nadav) 5. Rename: a. arch_tlbbatch_clean() to arch_tlbbatch_clear(), b. tlb_ubc_nowr to tlb_ubc_ro, c. migrc_try_flush_free_folios() to migrc_flush_free_folios(), d. migrc_stop to migrc_pause. (feedbacked by Nadav) 6. Use ->lru list_head instead of introducing a new llist_head. (feedbacked by Nadav) 7. Use non-atomic operations of page-flag when it's safe. (feedbacked by Nadav) 8. Use stack instead of keeping a pointer of 'struct migrc_req' in struct task, which is for manipulating it locally. (feedbacked by Nadav) 9. Replace a lot of simple functions to inline functions placed in a header, mm/internal.h. (feedbacked by Nadav) 10. Add additional sufficient comments. (feedbacked by Nadav) 11. Remove a lot of wrapper functions. (feedbacked by Nadav) Changes from RFC v2: 1. Remove additional occupation in struct page. To do that, unioned with lru field for migrc's list and added a page flag. I know page flag is a thing that we don't like to add but no choice because migrc should distinguish folios under migrc's control from others. Instead, I force migrc to be used only on 64 bit system to mitigate you guys from getting angry. 2. Remove meaningless internal object allocator that I introduced to minimize impact onto the system. However, a ton of tests showed there was no difference. 3. Stop migrc from working when the system is in high memory pressure like about to perform direct reclaim. At the condition where the swap mechanism is heavily used, I found the system suffered from regression without this control. 4. Exclude folios that pte_dirty() == true from migrc's interest so that migrc can work simpler. 5. Combine several patches that work tightly coupled to one. 6. Add sufficient comments for better review. 7. Manage migrc's request in per-node manner (from globally). 8. Add TLB miss improvement in commit message. 9. Test with more CPUs(4 -> 16) to see bigger improvement. Changes from RFC: 1. Fix a bug triggered when a destination folio at the previous migration becomes a source folio at the next migration, before the folio gets handled properly so that the folio can play with another migration. There was inconsistency in the folio's state. Fixed it. 2. Split the patch set into more pieces so that the folks can review better. (Feedbacked by Nadav Amit) 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic(). (Feedbacked by Nadav Amit) 4. Tried to add sufficient comments to explain the patch set better. (Feedbacked by Nadav Amit) Byungchul Park (3): mm/rmap: Recognize read-only TLB entries during batched TLB flush mm: Defer TLB flush by keeping both src and dst folios at migration mm: Pause migrc mechanism at high memory pressure arch/x86/include/asm/tlbflush.h | 3 + arch/x86/mm/tlb.c | 11 ++ include/linux/mm_types.h | 21 +++ include/linux/mmzone.h | 9 ++ include/linux/page-flags.h | 4 + include/linux/sched.h | 7 + include/trace/events/mmflags.h | 3 +- mm/internal.h | 78 ++++++++++ mm/memory.c | 11 ++ mm/migrate.c | 266 ++++++++++++++++++++++++++++++++ mm/page_alloc.c | 30 +++- mm/rmap.c | 35 ++++- 12 files changed, 475 insertions(+), 3 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush 2023-11-09 4:59 [v4 0/3] Reduce TLB flushes under some specific conditions Byungchul Park @ 2023-11-09 4:59 ` Byungchul Park 2023-11-09 20:26 ` kernel test robot 2023-11-09 4:59 ` [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park ` (3 subsequent siblings) 4 siblings, 1 reply; 17+ messages in thread From: Byungchul Park @ 2023-11-09 4:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Functionally, no change. This is a preparation for migrc mechanism that requires to recognize read-only TLB entries and makes use of them to batch more aggressively. Signed-off-by: Byungchul Park <byungchul@sk.com> --- arch/x86/include/asm/tlbflush.h | 3 +++ arch/x86/mm/tlb.c | 11 +++++++++++ include/linux/sched.h | 1 + mm/internal.h | 4 ++++ mm/rmap.c | 30 +++++++++++++++++++++++++++++- 5 files changed, 48 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 25726893c6f4..5c618a8821de 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -292,6 +292,9 @@ static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) } extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +extern void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch); +extern void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst, + struct arch_tlbflush_unmap_batch *bsrc); static inline bool pte_flags_need_flush(unsigned long oldflags, unsigned long newflags, diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 453ea95b667d..d3c89a3d91eb 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1274,6 +1274,17 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) put_cpu(); } +void arch_tlbbatch_clear(struct arch_tlbflush_unmap_batch *batch) +{ + cpumask_clear(&batch->cpumask); +} + +void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst, + struct arch_tlbflush_unmap_batch *bsrc) +{ + cpumask_or(&bdst->cpumask, &bdst->cpumask, &bsrc->cpumask); +} + /* * Blindly accessing user memory from NMI context can be dangerous * if we're in the middle of switching the current user task or diff --git a/include/linux/sched.h b/include/linux/sched.h index 77f01ac385f7..8a31527d9ed8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1324,6 +1324,7 @@ struct task_struct { #endif struct tlbflush_unmap_batch tlb_ubc; + struct tlbflush_unmap_batch tlb_ubc_ro; /* Cache last used pipe for splice(): */ struct pipe_inode_info *splice_pipe; diff --git a/mm/internal.h b/mm/internal.h index 30cf724ddbce..9764b240e259 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -861,6 +861,7 @@ extern struct workqueue_struct *mm_percpu_wq; void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); void flush_tlb_batched_pending(struct mm_struct *mm); +void fold_ubc_ro(void); #else static inline void try_to_unmap_flush(void) { @@ -871,6 +872,9 @@ static inline void try_to_unmap_flush_dirty(void) static inline void flush_tlb_batched_pending(struct mm_struct *mm) { } +static inline void fold_ubc_ro(void) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/rmap.c b/mm/rmap.c index 9f795b93cf40..c787ae94b4c6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -605,6 +605,28 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, } #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + +void fold_ubc_ro(void) +{ + struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro; + + if (!tlb_ubc_ro->flush_required) + return; + + /* + * Fold tlb_ubc_ro's data to tlb_ubc. + */ + arch_tlbbatch_fold(&tlb_ubc->arch, &tlb_ubc_ro->arch); + tlb_ubc->flush_required = true; + + /* + * Reset tlb_ubc_ro's data. + */ + arch_tlbbatch_clear(&tlb_ubc_ro->arch); + tlb_ubc_ro->flush_required = false; +} + /* * Flush TLB entries for recently unmapped pages from remote CPUs. It is * important if a PTE was dirty when it was unmapped that it's flushed @@ -615,6 +637,7 @@ void try_to_unmap_flush(void) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + fold_ubc_ro(); if (!tlb_ubc->flush_required) return; @@ -645,13 +668,18 @@ void try_to_unmap_flush_dirty(void) static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, unsigned long uaddr) { - struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + struct tlbflush_unmap_batch *tlb_ubc; int batch; bool writable = pte_dirty(pteval); if (!pte_accessible(mm, pteval)) return; + if (pte_write(pteval) || writable) + tlb_ubc = ¤t->tlb_ubc; + else + tlb_ubc = ¤t->tlb_ubc_ro; + arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr); tlb_ubc->flush_required = true; -- 2.17.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush 2023-11-09 4:59 ` [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush Byungchul Park @ 2023-11-09 20:26 ` kernel test robot 0 siblings, 0 replies; 17+ messages in thread From: kernel test robot @ 2023-11-09 20:26 UTC (permalink / raw) To: Byungchul Park, linux-kernel, linux-mm Cc: oe-kbuild-all, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Hi Byungchul, kernel test robot noticed the following build errors: [auto build test ERROR on tip/sched/core] [also build test ERROR on tip/x86/core tip/x86/mm linus/master v6.6 next-20231109] [cannot apply to akpm-mm/mm-everything] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Byungchul-Park/mm-rmap-Recognize-read-only-TLB-entries-during-batched-TLB-flush/20231109-163706 base: tip/sched/core patch link: https://lore.kernel.org/r/20231109045908.54996-2-byungchul%40sk.com patch subject: [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush config: arm64-randconfig-002-20231109 (https://download.01.org/0day-ci/archive/20231110/202311100429.nc4jJoNu-lkp@intel.com/config) compiler: aarch64-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311100429.nc4jJoNu-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202311100429.nc4jJoNu-lkp@intel.com/ All errors (new ones prefixed by >>): mm/rmap.c: In function 'fold_ubc_ro': >> mm/rmap.c:620:9: error: implicit declaration of function 'arch_tlbbatch_fold'; did you mean 'arch_tlbbatch_flush'? [-Werror=implicit-function-declaration] 620 | arch_tlbbatch_fold(&tlb_ubc->arch, &tlb_ubc_ro->arch); | ^~~~~~~~~~~~~~~~~~ | arch_tlbbatch_flush >> mm/rmap.c:626:9: error: implicit declaration of function 'arch_tlbbatch_clear'; did you mean 'arch_tlbbatch_flush'? [-Werror=implicit-function-declaration] 626 | arch_tlbbatch_clear(&tlb_ubc_ro->arch); | ^~~~~~~~~~~~~~~~~~~ | arch_tlbbatch_flush cc1: some warnings being treated as errors vim +620 mm/rmap.c 608 609 void fold_ubc_ro(void) 610 { 611 struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; 612 struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro; 613 614 if (!tlb_ubc_ro->flush_required) 615 return; 616 617 /* 618 * Fold tlb_ubc_ro's data to tlb_ubc. 619 */ > 620 arch_tlbbatch_fold(&tlb_ubc->arch, &tlb_ubc_ro->arch); 621 tlb_ubc->flush_required = true; 622 623 /* 624 * Reset tlb_ubc_ro's data. 625 */ > 626 arch_tlbbatch_clear(&tlb_ubc_ro->arch); 627 tlb_ubc_ro->flush_required = false; 628 } 629 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 17+ messages in thread
* [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration 2023-11-09 4:59 [v4 0/3] Reduce TLB flushes under some specific conditions Byungchul Park 2023-11-09 4:59 ` [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush Byungchul Park @ 2023-11-09 4:59 ` Byungchul Park 2023-11-09 14:36 ` Matthew Wilcox ` (2 more replies) 2023-11-09 4:59 ` [v4 3/3] mm: Pause migrc mechanism at high memory pressure Byungchul Park ` (2 subsequent siblings) 4 siblings, 3 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-09 4:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Implementation of MIGRC mechanism that stands for 'Migration Read Copy'. We always face the migration overhead at either promotion or demotion, while working with tiered memory e.g. CXL memory and found out TLB shootdown is a quite big one that is needed to get rid of if possible. Fortunately, TLB flush can be defered if both source and destination of folios during migration are kept until all TLB flushes required will have been done, of course, only if the target PTE entries have read-only permission, more precisely speaking, don't have write permission. Otherwise, no doubt the folio might get messed up. To achieve that: 1. For the folios that map only to non-writable TLB entries, prevent TLB flush at migration by keeping both source and destination folios, which will be handled later at a better time. 2. When any non-writable TLB entry changes to writable e.g. through fault handler, give up migrc mechanism so as to perform TLB flush required right away. The measurement result: Architecture - x86_64 QEMU - kvm enabled, host cpu Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB) Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled Benchmark - XSBench -p 50000000 (-p option makes the runtime longer) run 'perf stat' using events: 1) itlb.itlb_flush 2) tlb_flush.dtlb_thread 3) tlb_flush.stlb_any 4) dTLB-load-misses 5) dTLB-store-misses 6) iTLB-load-misses run 'cat /proc/vmstat' and pick: 1) numa_pages_migrated 2) pgmigrate_success 3) nr_tlb_remote_flush 4) nr_tlb_remote_flush_received 5) nr_tlb_local_flush_all 6) nr_tlb_local_flush_one BEFORE - mainline v6.6-rc5 ------------------------------------------ $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 20953405 itlb.itlb_flush 114886593 tlb_flush.dtlb_thread 88267015 tlb_flush.stlb_any 115304095543 dTLB-load-misses 163904743 dTLB-store-misses 608486259 iTLB-load-misses 556.787113849 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 3378748 pgmigrate_success 7720310 nr_tlb_remote_flush 751464 nr_tlb_remote_flush_received 10742115 nr_tlb_local_flush_all 21899 nr_tlb_local_flush_one 740157 ... AFTER - mainline v6.6-rc5 + migrc ------------------------------------------ $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 4353555 itlb.itlb_flush 72482780 tlb_flush.dtlb_thread 68226458 tlb_flush.stlb_any 114331610808 dTLB-load-misses 116084771 dTLB-store-misses 377180518 iTLB-load-misses 552.667718220 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 3339325 pgmigrate_success 7642363 nr_tlb_remote_flush 192913 nr_tlb_remote_flush_received 2327426 nr_tlb_local_flush_all 25759 nr_tlb_local_flush_one 740454 ... Signed-off-by: Byungchul Park <byungchul@sk.com> --- include/linux/mm_types.h | 21 ++++ include/linux/mmzone.h | 9 ++ include/linux/page-flags.h | 4 + include/linux/sched.h | 6 + include/trace/events/mmflags.h | 3 +- mm/internal.h | 57 +++++++++ mm/memory.c | 11 ++ mm/migrate.c | 215 +++++++++++++++++++++++++++++++++ mm/page_alloc.c | 17 ++- mm/rmap.c | 11 +- 10 files changed, 349 insertions(+), 5 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 36c5b43999e6..202de9b09bd6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1372,4 +1372,25 @@ enum { /* See also internal only FOLL flags in mm/internal.h */ }; +struct migrc_req { + /* + * folios pending for TLB flush + */ + struct list_head folios; + + /* + * for hanging to the associated numa node + */ + struct llist_node llnode; + + /* + * architecture specific data for batched TLB flush + */ + struct arch_tlbflush_unmap_batch arch; + + /* + * associated numa node + */ + int nid; +}; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..b79ac8053c6a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -980,6 +980,11 @@ struct zone { /* Zone statistics */ atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; + + /* + * the number of folios pending for TLB flush in the zone + */ + atomic_t migrc_pending_nr; } ____cacheline_internodealigned_in_smp; enum pgdat_flags { @@ -1398,6 +1403,10 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif + /* + * migrc requests including folios pending for TLB flush + */ + struct llist_head migrc_reqs; } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5c02720c53a5..ec7c178bfb49 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -136,6 +136,7 @@ enum pageflags { PG_arch_2, PG_arch_3, #endif + PG_migrc, /* Page is under migrc's control */ __NR_PAGEFLAGS, PG_readahead = PG_reclaim, @@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +TESTCLEARFLAG(Migrc, migrc, PF_ANY) +__PAGEFLAG(Migrc, migrc, PF_ANY) + /* * PageReported() is used to track reported free pages within the Buddy * allocator. We can use the non-atomic version of the test and set diff --git a/include/linux/sched.h b/include/linux/sched.h index 8a31527d9ed8..a2c16d21e365 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1326,6 +1326,12 @@ struct task_struct { struct tlbflush_unmap_batch tlb_ubc; struct tlbflush_unmap_batch tlb_ubc_ro; + /* + * if all the mappings of a folio during unmap are RO so that + * migrc can work on it + */ + bool can_migrc; + /* Cache last used pipe for splice(): */ struct pipe_inode_info *splice_pipe; diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 1478b9dd05fa..dafe302444d9 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -118,7 +118,8 @@ DEF_PAGEFLAG_NAME(mappedtodisk), \ DEF_PAGEFLAG_NAME(reclaim), \ DEF_PAGEFLAG_NAME(swapbacked), \ - DEF_PAGEFLAG_NAME(unevictable) \ + DEF_PAGEFLAG_NAME(unevictable), \ + DEF_PAGEFLAG_NAME(migrc) \ IF_HAVE_PG_MLOCK(mlocked) \ IF_HAVE_PG_UNCACHED(uncached) \ IF_HAVE_PG_HWPOISON(hwpoison) \ diff --git a/mm/internal.h b/mm/internal.h index 9764b240e259..a2b6f0321729 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1158,4 +1158,61 @@ struct vma_prepare { struct vm_area_struct *remove; struct vm_area_struct *remove2; }; + +/* + * Initialize the page when allocated from buddy allocator. + */ +static inline void migrc_init_page(struct page *p) +{ + __ClearPageMigrc(p); +} + +/* + * Check if the folio is pending for TLB flush and then clear the flag. + */ +static inline bool migrc_unpend_if_pending(struct folio *f) +{ + return folio_test_clear_migrc(f); +} + +/* + * Reset the indicator indicating there are no writable mappings at the + * beginning of every rmap traverse for unmap. Migrc can work only when + * all the mappings are RO. + */ +static inline void can_migrc_init(void) +{ + current->can_migrc = true; +} + +/* + * Mark the folio is not applicable to migrc once it found a writble or + * dirty pte during rmap traverse for unmap. + */ +static inline void can_migrc_fail(void) +{ + current->can_migrc = false; +} + +/* + * Check if all the mappings are RO and RO mappings even exist. + */ +static inline bool can_migrc_test(void) +{ + return current->can_migrc && current->tlb_ubc_ro.flush_required; +} + +/* + * Return the number of folios pending TLB flush that have yet to get + * freed in the zone. + */ +static inline int migrc_pending_nr_in_zone(struct zone *z) +{ + return atomic_read(&z->migrc_pending_nr); +} + +/* + * Perform TLB flush needed and free the folios in the node. + */ +bool migrc_flush_free_folios(nodemask_t *nodes); #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 6c264d2f969c..5287ea1639cc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3359,6 +3359,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (vmf->page) folio = page_folio(vmf->page); + /* + * This folio has its read copy to prevent inconsistency while + * deferring TLB flushes. However, the problem might arise if + * it's going to become writable. + * + * To prevent it, give up the deferring TLB flushes and perform + * TLB flush right away. + */ + if (folio && migrc_unpend_if_pending(folio)) + migrc_flush_free_folios(NULL); + /* * Shared mapping: we are guaranteed to have VM_WRITE and * FAULT_FLAG_WRITE set at this point. diff --git a/mm/migrate.c b/mm/migrate.c index 2053b54556ca..9ab7794b0390 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -57,6 +57,162 @@ #include "internal.h" +/* + * Marks the folios as pending for TLB flush. + */ +static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst) +{ + folio_get(fsrc); + __folio_set_migrc(fsrc); + __folio_set_migrc(fdst); +} + +static bool migrc_under_processing(struct folio *fsrc, struct folio *fdst) +{ + /* + * case1. folio_test_migrc(fsrc) && !folio_test_migrc(fdst): + * + * fsrc was already under migrc's control even before the + * current migration. Migrc doesn't work with it this time. + * + * case2. !folio_test_migrc(fsrc) && !folio_test_migrc(fdst): + * + * This is the normal case that is not migrc's interest. + * + * case3. folio_test_migrc(fsrc) && folio_test_migrc(fdst): + * + * Only the case that migrc works on. + */ + return folio_test_migrc(fsrc) && folio_test_migrc(fdst); +} + +static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) +{ + /* + * TLB flushes needed are already done at this moment so the + * flag doesn't have to be kept. + */ + __folio_clear_migrc(fsrc); + __folio_clear_migrc(fdst); + folio_put(fsrc); +} + +static void migrc_expand_req(struct folio *fsrc, struct folio *fdst, + struct migrc_req *req) +{ + if (req->nid == -1) + req->nid = folio_nid(fsrc); + + /* + * All the nids in a req should be the same. + */ + WARN_ON(req->nid != folio_nid(fsrc)); + + list_add(&fsrc->lru, &req->folios); + atomic_inc(&folio_zone(fsrc)->migrc_pending_nr); +} + +/* + * Prepares for gathering folios pending for TLB flushes, try to + * allocate objects needed, initialize them and make them ready. + */ +static struct migrc_req *migrc_req_start(void) +{ + struct migrc_req *req; + + req = kmalloc(sizeof(struct migrc_req), GFP_KERNEL); + if (!req) + return NULL; + + arch_tlbbatch_clear(&req->arch); + INIT_LIST_HEAD(&req->folios); + req->nid = -1; + + return req; +} + +/* + * Hang the request with the collected folios to the corresponding node. + */ +static void migrc_req_end(struct migrc_req *req) +{ + if (!req) + return; + + if (list_empty(&req->folios)) { + kfree(req); + return; + } + + llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs); +} + +/* + * Gather folios and architecture specific data to handle. + */ +static void migrc_gather(struct list_head *folios, + struct arch_tlbflush_unmap_batch *arch, + struct llist_head *reqs) +{ + struct llist_node *nodes; + struct migrc_req *req; + struct migrc_req *req2; + + nodes = llist_del_all(reqs); + if (!nodes) + return; + + llist_for_each_entry_safe(req, req2, nodes, llnode) { + arch_tlbbatch_fold(arch, &req->arch); + list_splice(&req->folios, folios); + kfree(req); + } +} + +bool migrc_flush_free_folios(nodemask_t *nodes) +{ + struct folio *f, *f2; + int nid; + struct arch_tlbflush_unmap_batch arch; + LIST_HEAD(folios); + + if (!nodes) + nodes = &node_possible_map; + arch_tlbbatch_clear(&arch); + + for_each_node_mask(nid, *nodes) + migrc_gather(&folios, &arch, &NODE_DATA(nid)->migrc_reqs); + + if (list_empty(&folios)) + return false; + + arch_tlbbatch_flush(&arch); + list_for_each_entry_safe(f, f2, &folios, lru) { + atomic_dec(&folio_zone(f)->migrc_pending_nr); + folio_put(f); + } + return true; +} + +static void fold_ubc_ro_to_migrc(struct migrc_req *req) +{ + struct tlbflush_unmap_batch *tlb_ubc_ro = ¤t->tlb_ubc_ro; + + if (!tlb_ubc_ro->flush_required) + return; + + /* + * Fold tlb_ubc_ro's data to the request. + */ + arch_tlbbatch_fold(&req->arch, &tlb_ubc_ro->arch); + + /* + * Reset tlb_ubc_ro's data. + */ + arch_tlbbatch_clear(&tlb_ubc_ro->arch); + tlb_ubc_ro->flush_required = false; +} + bool isolate_movable_page(struct page *page, isolate_mode_t mode) { struct folio *folio = folio_get_nontail_page(page); @@ -379,6 +535,7 @@ static int folio_expected_refs(struct address_space *mapping, struct folio *folio) { int refs = 1; + if (!mapping) return refs; @@ -406,6 +563,12 @@ int folio_migrate_mapping(struct address_space *mapping, int expected_count = folio_expected_refs(mapping, folio) + extra_count; long nr = folio_nr_pages(folio); + /* + * Migrc mechanism increased the reference count. + */ + if (migrc_under_processing(folio, newfolio)) + expected_count++; + if (!mapping) { /* Anonymous page without mapping */ if (folio_ref_count(folio) != expected_count) @@ -1620,16 +1783,25 @@ static int migrate_pages_batch(struct list_head *from, LIST_HEAD(unmap_folios); LIST_HEAD(dst_folios); bool nosplit = (reason == MR_NUMA_MISPLACED); + struct migrc_req *mreq = NULL; VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC && !list_empty(from) && !list_is_singular(from)); + /* + * Apply migrc only to numa migration for now. + */ + if (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED) + mreq = migrc_req_start(); + for (pass = 0; pass < nr_pass && retry; pass++) { retry = 0; thp_retry = 0; nr_retry_pages = 0; list_for_each_entry_safe(folio, folio2, from, lru) { + bool can_migrc; + is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio); nr_pages = folio_nr_pages(folio); @@ -1657,9 +1829,21 @@ static int migrate_pages_batch(struct list_head *from, continue; } + can_migrc_init(); rc = migrate_folio_unmap(get_new_folio, put_new_folio, private, folio, &dst, mode, reason, ret_folios); + /* + * can_migrc is true only if: + * + * 1. struct migrc_req has been allocated && + * 2. There's no writable mapping at all && + * 3. There's read-only mapping found && + * 4. Not under migrc's control already + */ + can_migrc = mreq && can_migrc_test() && + !folio_test_migrc(folio); + /* * The rules are: * Success: folio will be freed @@ -1720,6 +1904,19 @@ static int migrate_pages_batch(struct list_head *from, case MIGRATEPAGE_UNMAP: list_move_tail(&folio->lru, &unmap_folios); list_add_tail(&dst->lru, &dst_folios); + + if (can_migrc) { + /* + * To use ->lru exclusively, just + * mark the page flag for now. + * + * The folio will be queued to + * the current migrc request on + * move success below. + */ + migrc_mark_pending(folio, dst); + fold_ubc_ro_to_migrc(mreq); + } break; default: /* @@ -1733,6 +1930,11 @@ static int migrate_pages_batch(struct list_head *from, stats->nr_failed_pages += nr_pages; break; } + /* + * Done with the current folio. Fold the ro + * batch data gathered, to the normal batch. + */ + fold_ubc_ro(); } } nr_failed += retry; @@ -1774,6 +1976,14 @@ static int migrate_pages_batch(struct list_head *from, case MIGRATEPAGE_SUCCESS: stats->nr_succeeded += nr_pages; stats->nr_thp_succeeded += is_thp; + + /* + * Now that it's safe to use ->lru, + * queue the folio to the current migrc + * request. + */ + if (migrc_under_processing(folio, dst)) + migrc_expand_req(folio, dst, mreq); break; default: nr_failed++; @@ -1791,6 +2001,8 @@ static int migrate_pages_batch(struct list_head *from, rc = rc_saved ? : nr_failed; out: + migrc_req_end(mreq); + /* Cleanup remaining folios */ dst = list_first_entry(&dst_folios, struct folio, lru); dst2 = list_next_entry(dst, lru); @@ -1798,6 +2010,9 @@ static int migrate_pages_batch(struct list_head *from, int page_was_mapped = 0; struct anon_vma *anon_vma = NULL; + if (migrc_under_processing(folio, dst)) + migrc_undo_folios(folio, dst); + __migrate_folio_extract(dst, &page_was_mapped, &anon_vma); migrate_folio_undo_src(folio, page_was_mapped, anon_vma, true, ret_folios); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 95546f376302..914e93ab598e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_owner(page, order, gfp_flags); page_table_check_alloc(page, order); + + for (i = 0; i != 1 << order; ++i) + migrc_init_page(page + i); } static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, @@ -2839,6 +2842,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, long min = mark; int o; + free_pages += migrc_pending_nr_in_zone(z); + /* free_pages may go negative - that's OK */ free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); @@ -2933,7 +2938,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, long usable_free; long reserved; - usable_free = free_pages; + usable_free = free_pages + migrc_pending_nr_in_zone(z); reserved = __zone_watermark_unusable_free(z, 0, alloc_flags); /* reserved may over estimate high-atomic reserves. */ @@ -3121,6 +3126,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, gfp_mask)) { int ret; + /* + * Free the pending folios so that the remaining + * code can use the updated vmstats and check + * zone_watermark_fast() again. + */ + migrc_flush_free_folios(ac->nodemask); + if (zone_watermark_fast(zone, order, mark, + ac->highest_zoneidx, alloc_flags, gfp_mask)) + goto try_this_zone; + if (has_unaccepted_memory()) { if (try_to_accept_memory(zone, order)) goto try_this_zone; diff --git a/mm/rmap.c b/mm/rmap.c index c787ae94b4c6..8786c14e08c9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -605,7 +605,6 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, } #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH - void fold_ubc_ro(void) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; @@ -675,9 +674,15 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval, if (!pte_accessible(mm, pteval)) return; - if (pte_write(pteval) || writable) + if (pte_write(pteval) || writable) { tlb_ubc = ¤t->tlb_ubc; - else + + /* + * migrc cannot work with the folio once it found a + * writable or dirty mapping of it. + */ + can_migrc_fail(); + } else tlb_ubc = ¤t->tlb_ubc_ro; arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr); -- 2.17.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration 2023-11-09 4:59 ` [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park @ 2023-11-09 14:36 ` Matthew Wilcox 2023-11-10 1:29 ` Byungchul Park 2024-01-15 7:55 ` Byungchul Park 2023-11-09 17:09 ` kernel test robot 2023-11-09 19:07 ` kernel test robot 2 siblings, 2 replies; 17+ messages in thread From: Matthew Wilcox @ 2023-11-09 14:36 UTC (permalink / raw) To: Byungchul Park Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 01:59:07PM +0900, Byungchul Park wrote: > +++ b/include/linux/page-flags.h > @@ -136,6 +136,7 @@ enum pageflags { > PG_arch_2, > PG_arch_3, > #endif > + PG_migrc, /* Page is under migrc's control */ > __NR_PAGEFLAGS, Yeah; no. We're out of page flags. And CXL is insufficiently compelling to add more. If you use CXL, you don't care about performance, by definition. > @@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY) > PAGEFLAG(Idle, idle, PF_ANY) > #endif > > +TESTCLEARFLAG(Migrc, migrc, PF_ANY) > +__PAGEFLAG(Migrc, migrc, PF_ANY) Why PF_ANY? > +/* > + * Initialize the page when allocated from buddy allocator. > + */ > +static inline void migrc_init_page(struct page *p) > +{ > + __ClearPageMigrc(p); > +} This flag should already be clear ... ? > +/* > + * Check if the folio is pending for TLB flush and then clear the flag. > + */ > +static inline bool migrc_unpend_if_pending(struct folio *f) > +{ > + return folio_test_clear_migrc(f); > +} If you named the flag better, you wouldn't need this wrapper. > +static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst) > +{ > + folio_get(fsrc); > + __folio_set_migrc(fsrc); > + __folio_set_migrc(fdst); > +} This is almost certainly unsafe. By using the non-atomic bit ops, you stand the risk of losing a simultaneous update to any other bit in this word. Like, say, someone trying to lock the folio? > +++ b/mm/page_alloc.c > @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > set_page_owner(page, order, gfp_flags); > page_table_check_alloc(page, order); > + > + for (i = 0; i != 1 << order; ++i) > + migrc_init_page(page + i); No. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration 2023-11-09 14:36 ` Matthew Wilcox @ 2023-11-10 1:29 ` Byungchul Park 2024-01-15 7:55 ` Byungchul Park 1 sibling, 0 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-10 1:29 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 02:36:01PM +0000, Matthew Wilcox wrote: > On Thu, Nov 09, 2023 at 01:59:07PM +0900, Byungchul Park wrote: > > +++ b/include/linux/page-flags.h > > @@ -136,6 +136,7 @@ enum pageflags { > > PG_arch_2, > > PG_arch_3, > > #endif > > + PG_migrc, /* Page is under migrc's control */ > > __NR_PAGEFLAGS, > > Yeah; no. We're out of page flags. And CXL is insufficiently I should've forced migrc to work only for 64bit arch. I missed it while I removed kconifg for it. However, lemme try to avoid the additonal page flag anyway if possible. > compelling to add more. If you use CXL, you don't care about > performance, by definition. > > > @@ -589,6 +590,9 @@ TESTCLEARFLAG(Young, young, PF_ANY) > > PAGEFLAG(Idle, idle, PF_ANY) > > #endif > > > > +TESTCLEARFLAG(Migrc, migrc, PF_ANY) > > +__PAGEFLAG(Migrc, migrc, PF_ANY) > > Why PF_ANY? PF_HEAD looks more fit on the purpose. I will change it to PF_HEAD. > > +/* > > + * Initialize the page when allocated from buddy allocator. > > + */ > > +static inline void migrc_init_page(struct page *p) > > +{ > > + __ClearPageMigrc(p); > > +} > > This flag should already be clear ... ? That should be initialized either on allocation or on free. > > +/* > > + * Check if the folio is pending for TLB flush and then clear the flag. > > + */ > > +static inline bool migrc_unpend_if_pending(struct folio *f) > > +{ > > + return folio_test_clear_migrc(f); > > +} > > If you named the flag better, you wouldn't need this wrapper. I will. > > +static void migrc_mark_pending(struct folio *fsrc, struct folio *fdst) > > +{ > > + folio_get(fsrc); > > + __folio_set_migrc(fsrc); > > + __folio_set_migrc(fdst); > > +} > > This is almost certainly unsafe. By using the non-atomic bit ops, you > stand the risk of losing a simultaneous update to any other bit in this > word. Like, say, someone trying to lock the folio? fdst is not exposed yet so safe to use non-atomic in here IMHO. While.. fsrc's PG_locked is owned by the migration context and the folio has been successfully unmapped, so I thought it'd be safe but yeah I'm not convinced if fsrc is safe here for real. I will change it to atomic. > > +++ b/mm/page_alloc.c > > @@ -1535,6 +1535,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > > > set_page_owner(page, order, gfp_flags); > > page_table_check_alloc(page, order); > > + > > + for (i = 0; i != 1 << order; ++i) > > + migrc_init_page(page + i); > > No. I will change it. Appreciate your feedback. Byungchul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration 2023-11-09 14:36 ` Matthew Wilcox 2023-11-10 1:29 ` Byungchul Park @ 2024-01-15 7:55 ` Byungchul Park 1 sibling, 0 replies; 17+ messages in thread From: Byungchul Park @ 2024-01-15 7:55 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 02:36:01PM +0000, Matthew Wilcox wrote: > On Thu, Nov 09, 2023 at 01:59:07PM +0900, Byungchul Park wrote: > > +++ b/include/linux/page-flags.h > > @@ -136,6 +136,7 @@ enum pageflags { > > PG_arch_2, > > PG_arch_3, > > #endif > > + PG_migrc, /* Page is under migrc's control */ > > __NR_PAGEFLAGS, > > Yeah; no. We're out of page flags. And CXL is insufficiently I won't use an additional page flag any more. Thanks. Byungchul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration 2023-11-09 4:59 ` [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park 2023-11-09 14:36 ` Matthew Wilcox @ 2023-11-09 17:09 ` kernel test robot 2023-11-09 19:07 ` kernel test robot 2 siblings, 0 replies; 17+ messages in thread From: kernel test robot @ 2023-11-09 17:09 UTC (permalink / raw) To: Byungchul Park, linux-kernel, linux-mm Cc: oe-kbuild-all, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Hi Byungchul, kernel test robot noticed the following build errors: [auto build test ERROR on tip/sched/core] [also build test ERROR on tip/x86/core tip/x86/mm v6.6] [cannot apply to akpm-mm/mm-everything linus/master next-20231109] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Byungchul-Park/mm-rmap-Recognize-read-only-TLB-entries-during-batched-TLB-flush/20231109-163706 base: tip/sched/core patch link: https://lore.kernel.org/r/20231109045908.54996-3-byungchul%40sk.com patch subject: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20231109/202311092356.XzY1aBHX-lkp@intel.com/config) compiler: m68k-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231109/202311092356.XzY1aBHX-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202311092356.XzY1aBHX-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from include/linux/mmzone.h:22, from include/linux/topology.h:33, from include/linux/irq.h:19, from include/asm-generic/hardirq.h:17, from ./arch/m68k/include/generated/asm/hardirq.h:1, from include/linux/hardirq.h:11, from include/linux/interrupt.h:11, from include/linux/kernel_stat.h:9, from arch/m68k/kernel/asm-offsets.c:16: >> include/linux/mm_types.h:1416:42: error: field 'arch' has incomplete type 1416 | struct arch_tlbflush_unmap_batch arch; | ^~~~ make[3]: *** [scripts/Makefile.build:116: arch/m68k/kernel/asm-offsets.s] Error 1 make[3]: Target 'prepare' not remade because of errors. make[2]: *** [Makefile:1202: prepare0] Error 2 make[2]: Target 'prepare' not remade because of errors. make[1]: *** [Makefile:234: __sub-make] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [Makefile:234: __sub-make] Error 2 make: Target 'prepare' not remade because of errors. vim +/arch +1416 include/linux/mm_types.h 1401 1402 struct migrc_req { 1403 /* 1404 * folios pending for TLB flush 1405 */ 1406 struct list_head folios; 1407 1408 /* 1409 * for hanging to the associated numa node 1410 */ 1411 struct llist_node llnode; 1412 1413 /* 1414 * architecture specific data for batched TLB flush 1415 */ > 1416 struct arch_tlbflush_unmap_batch arch; -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration 2023-11-09 4:59 ` [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park 2023-11-09 14:36 ` Matthew Wilcox 2023-11-09 17:09 ` kernel test robot @ 2023-11-09 19:07 ` kernel test robot 2 siblings, 0 replies; 17+ messages in thread From: kernel test robot @ 2023-11-09 19:07 UTC (permalink / raw) To: Byungchul Park, linux-kernel, linux-mm Cc: llvm, oe-kbuild-all, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Hi Byungchul, kernel test robot noticed the following build errors: [auto build test ERROR on tip/sched/core] [also build test ERROR on tip/x86/core tip/x86/mm v6.6] [cannot apply to akpm-mm/mm-everything linus/master next-20231109] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Byungchul-Park/mm-rmap-Recognize-read-only-TLB-entries-during-batched-TLB-flush/20231109-163706 base: tip/sched/core patch link: https://lore.kernel.org/r/20231109045908.54996-3-byungchul%40sk.com patch subject: [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration config: um-allnoconfig (https://download.01.org/0day-ci/archive/20231110/202311100211.UAqu6dj7-lkp@intel.com/config) compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311100211.UAqu6dj7-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202311100211.UAqu6dj7-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from arch/um/kernel/asm-offsets.c:1: In file included from arch/x86/um/shared/sysdep/kernel-offsets.h:5: In file included from include/linux/crypto.h:17: In file included from include/linux/slab.h:16: In file included from include/linux/gfp.h:7: In file included from include/linux/mmzone.h:22: >> include/linux/mm_types.h:1416:35: error: field has incomplete type 'struct arch_tlbflush_unmap_batch' 1416 | struct arch_tlbflush_unmap_batch arch; | ^ include/linux/mm_types.h:1416:9: note: forward declaration of 'struct arch_tlbflush_unmap_batch' 1416 | struct arch_tlbflush_unmap_batch arch; | ^ In file included from arch/um/kernel/asm-offsets.c:1: arch/x86/um/shared/sysdep/kernel-offsets.h:9:6: warning: no previous prototype for function 'foo' [-Wmissing-prototypes] 9 | void foo(void) | ^ arch/x86/um/shared/sysdep/kernel-offsets.h:9:1: note: declare 'static' if the function is not intended to be used outside of this translation unit 9 | void foo(void) | ^ | static 1 warning and 1 error generated. make[3]: *** [scripts/Makefile.build:116: arch/um/kernel/asm-offsets.s] Error 1 make[3]: Target 'prepare' not remade because of errors. make[2]: *** [Makefile:1202: prepare0] Error 2 make[2]: Target 'prepare' not remade because of errors. make[1]: *** [Makefile:234: __sub-make] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [Makefile:234: __sub-make] Error 2 make: Target 'prepare' not remade because of errors. vim +1416 include/linux/mm_types.h 1401 1402 struct migrc_req { 1403 /* 1404 * folios pending for TLB flush 1405 */ 1406 struct list_head folios; 1407 1408 /* 1409 * for hanging to the associated numa node 1410 */ 1411 struct llist_node llnode; 1412 1413 /* 1414 * architecture specific data for batched TLB flush 1415 */ > 1416 struct arch_tlbflush_unmap_batch arch; -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 17+ messages in thread
* [v4 3/3] mm: Pause migrc mechanism at high memory pressure 2023-11-09 4:59 [v4 0/3] Reduce TLB flushes under some specific conditions Byungchul Park 2023-11-09 4:59 ` [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush Byungchul Park 2023-11-09 4:59 ` [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park @ 2023-11-09 4:59 ` Byungchul Park 2023-11-09 5:20 ` [v4 0/3] Reduce TLB flushes under some specific conditions Huang, Ying 2023-11-09 14:26 ` Dave Hansen 4 siblings, 0 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-09 4:59 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Regression was observed when the system is in high memory pressure with swap on, where migrc keeps expanding its pending queue and the page allocator keeps flushing the queue and freeing folios at the same time, which is meaningless. So temporarily prevented migrc from expanding its pending queue on that condition. Signed-off-by: Byungchul Park <byungchul@sk.com> --- mm/internal.h | 17 ++++++++++++++++ mm/migrate.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++- mm/page_alloc.c | 13 ++++++++++++ 3 files changed, 82 insertions(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index a2b6f0321729..971f2dded4a6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1159,6 +1159,8 @@ struct vma_prepare { struct vm_area_struct *remove2; }; +extern atomic_t migrc_pause_cnt; + /* * Initialize the page when allocated from buddy allocator. */ @@ -1202,6 +1204,21 @@ static inline bool can_migrc_test(void) return current->can_migrc && current->tlb_ubc_ro.flush_required; } +static inline void migrc_pause(void) +{ + atomic_inc(&migrc_pause_cnt); +} + +static inline void migrc_resume(void) +{ + atomic_dec(&migrc_pause_cnt); +} + +static inline bool migrc_paused(void) +{ + return !!atomic_read(&migrc_pause_cnt); +} + /* * Return the number of folios pending TLB flush that have yet to get * freed in the zone. diff --git a/mm/migrate.c b/mm/migrate.c index 9ab7794b0390..bde4f49d0144 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -100,6 +100,16 @@ static void migrc_undo_folios(struct folio *fsrc, struct folio *fdst) static void migrc_expand_req(struct folio *fsrc, struct folio *fdst, struct migrc_req *req) { + /* + * If migrc has been paused in the middle of unmap because of + * high memory pressure, then the folios that have already been + * marked as pending should get back. + */ + if (!req) { + migrc_undo_folios(fsrc, fdst); + return; + } + if (req->nid == -1) req->nid = folio_nid(fsrc); @@ -147,6 +157,12 @@ static void migrc_req_end(struct migrc_req *req) llist_add(&req->llnode, &NODE_DATA(req->nid)->migrc_reqs); } +/* + * Increase on entry of handling high memory pressure e.g. direct + * reclaim, decrease on the exit. See __alloc_pages_slowpath(). + */ +atomic_t migrc_pause_cnt = ATOMIC_INIT(0); + /* * Gather folios and architecture specific data to handle. */ @@ -213,6 +229,31 @@ static void fold_ubc_ro_to_migrc(struct migrc_req *req) tlb_ubc_ro->flush_required = false; } +static void fold_migrc_to_ubc(struct migrc_req *req) +{ + struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; + + if (!req) + return; + + /* + * Fold the req's data to tlb_ubc. + */ + arch_tlbbatch_fold(&tlb_ubc->arch, &req->arch); + + /* + * Reset the req's data. + */ + arch_tlbbatch_clear(&req->arch); + + /* + * req->arch might be empty. However, conservatively set + * ->flush_required to true so that try_to_unmap_flush() can + * check it anyway. + */ + tlb_ubc->flush_required = true; +} + bool isolate_movable_page(struct page *page, isolate_mode_t mode) { struct folio *folio = folio_get_nontail_page(page); @@ -1791,7 +1832,7 @@ static int migrate_pages_batch(struct list_head *from, /* * Apply migrc only to numa migration for now. */ - if (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED) + if (!migrc_paused() && (reason == MR_DEMOTION || reason == MR_NUMA_MISPLACED)) mreq = migrc_req_start(); for (pass = 0; pass < nr_pass && retry; pass++) { @@ -1829,6 +1870,16 @@ static int migrate_pages_batch(struct list_head *from, continue; } + /* + * In case that the system is in high memory + * pressure, give up migrc mechanism this turn. + */ + if (unlikely(mreq && migrc_paused())) { + fold_migrc_to_ubc(mreq); + migrc_req_end(mreq); + mreq = NULL; + } + can_migrc_init(); rc = migrate_folio_unmap(get_new_folio, put_new_folio, private, folio, &dst, mode, reason, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 914e93ab598e..c920ad48f741 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3926,6 +3926,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned int cpuset_mems_cookie; unsigned int zonelist_iter_cookie; int reserve_flags; + bool migrc_paused = false; restart: compaction_retries = 0; @@ -4057,6 +4058,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; + /* + * The system is in very high memory pressure. Pause migrc from + * expanding its pending queue temporarily. + */ + if (!migrc_paused) { + migrc_pause(); + migrc_paused = true; + migrc_flush_free_folios(NULL); + } + /* Caller is not willing to reclaim, we can't balance anything */ if (!can_direct_reclaim) goto nopage; @@ -4184,6 +4195,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: + if (migrc_paused) + migrc_resume(); return page; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 4:59 [v4 0/3] Reduce TLB flushes under some specific conditions Byungchul Park ` (2 preceding siblings ...) 2023-11-09 4:59 ` [v4 3/3] mm: Pause migrc mechanism at high memory pressure Byungchul Park @ 2023-11-09 5:20 ` Huang, Ying 2023-11-10 1:32 ` Byungchul Park 2023-11-15 2:57 ` Byungchul Park 2023-11-09 14:26 ` Dave Hansen 4 siblings, 2 replies; 17+ messages in thread From: Huang, Ying @ 2023-11-09 5:20 UTC (permalink / raw) To: Byungchul Park Cc: linux-kernel, linux-mm, kernel_team, akpm, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen Byungchul Park <byungchul@sk.com> writes: > Hi everyone, > > While I'm working with CXL memory, I have been facing migration overhead > esp. TLB shootdown on promotion or demotion between different tiers. > Yeah.. most TLB shootdowns on migration through hinting fault can be > avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid > flushing TLB in batch if PTE is inaccessible"). > > However, it's only for ones using hinting fault. I thought it'd be much > better if we have a general mechanism to reduce # of TLB flushes and > TLB misses, that we can apply to any type of migration. I tried it only > for tiering migration for now tho. > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > destination of folios participated in the migrations until all TLB > flushes required are done, only if those folios are not mapped with > write permission PTE entries at all. I worked Based on v6.6-rc5. > > Can you believe it? I saw the number of TLB full flush reduced about > 80% and iTLB miss reduced about 50%, and the time wise performance > always shows at least 1% stable improvement with the workload I tested > with, XSBench. However, I believe that it would help more with other > ones or any real ones. It'd be appreciated to let me know if I'm missing > something. Can you help to test the effect of commit 7e12beb8ca2a ("migrate_pages: batch flushing TLB") for your test case? To test it, you can revert it and compare the performance before and after the reverting. And, how do you trigger migration when testing XSBench? Use a tiered memory system, and migrate pages between DRAM and CXL memory back and forth? If so, how many pages will you migrate for each migration? -- Best Regards, Huang, Ying > > Byungchul > > --- > > Changes from v3: > > 1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob, > migrc_enable. (feedbacked by Nadav) > 2. Remove the optimization skipping CPUs that have already > performed TLB flushes needed by any reason when performing > TLB flushes by migrc because I can't tell the performance > difference between w/ the optimization and w/o that. > (feedbacked by Nadav) > 3. Minimize arch-specific code. While at it, move all the migrc > declarations and inline functions from include/linux/mm.h to > mm/internal.h (feedbacked by Dave Hansen, Nadav) > 4. Separate a part making migrc paused when the system is in > high memory pressure to another patch. (feedbacked by Nadav) > 5. Rename: > a. arch_tlbbatch_clean() to arch_tlbbatch_clear(), > b. tlb_ubc_nowr to tlb_ubc_ro, > c. migrc_try_flush_free_folios() to migrc_flush_free_folios(), > d. migrc_stop to migrc_pause. > (feedbacked by Nadav) > 6. Use ->lru list_head instead of introducing a new llist_head. > (feedbacked by Nadav) > 7. Use non-atomic operations of page-flag when it's safe. > (feedbacked by Nadav) > 8. Use stack instead of keeping a pointer of 'struct migrc_req' > in struct task, which is for manipulating it locally. > (feedbacked by Nadav) > 9. Replace a lot of simple functions to inline functions placed > in a header, mm/internal.h. (feedbacked by Nadav) > 10. Add additional sufficient comments. (feedbacked by Nadav) > 11. Remove a lot of wrapper functions. (feedbacked by Nadav) > > Changes from RFC v2: > > 1. Remove additional occupation in struct page. To do that, > unioned with lru field for migrc's list and added a page > flag. I know page flag is a thing that we don't like to add > but no choice because migrc should distinguish folios under > migrc's control from others. Instead, I force migrc to be > used only on 64 bit system to mitigate you guys from getting > angry. > 2. Remove meaningless internal object allocator that I > introduced to minimize impact onto the system. However, a ton > of tests showed there was no difference. > 3. Stop migrc from working when the system is in high memory > pressure like about to perform direct reclaim. At the > condition where the swap mechanism is heavily used, I found > the system suffered from regression without this control. > 4. Exclude folios that pte_dirty() == true from migrc's interest > so that migrc can work simpler. > 5. Combine several patches that work tightly coupled to one. > 6. Add sufficient comments for better review. > 7. Manage migrc's request in per-node manner (from globally). > 8. Add TLB miss improvement in commit message. > 9. Test with more CPUs(4 -> 16) to see bigger improvement. > > Changes from RFC: > > 1. Fix a bug triggered when a destination folio at the previous > migration becomes a source folio at the next migration, > before the folio gets handled properly so that the folio can > play with another migration. There was inconsistency in the > folio's state. Fixed it. > 2. Split the patch set into more pieces so that the folks can > review better. (Feedbacked by Nadav Amit) > 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic(). > (Feedbacked by Nadav Amit) > 4. Tried to add sufficient comments to explain the patch set > better. (Feedbacked by Nadav Amit) > > Byungchul Park (3): > mm/rmap: Recognize read-only TLB entries during batched TLB flush > mm: Defer TLB flush by keeping both src and dst folios at migration > mm: Pause migrc mechanism at high memory pressure > > arch/x86/include/asm/tlbflush.h | 3 + > arch/x86/mm/tlb.c | 11 ++ > include/linux/mm_types.h | 21 +++ > include/linux/mmzone.h | 9 ++ > include/linux/page-flags.h | 4 + > include/linux/sched.h | 7 + > include/trace/events/mmflags.h | 3 +- > mm/internal.h | 78 ++++++++++ > mm/memory.c | 11 ++ > mm/migrate.c | 266 ++++++++++++++++++++++++++++++++ > mm/page_alloc.c | 30 +++- > mm/rmap.c | 35 ++++- > 12 files changed, 475 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 5:20 ` [v4 0/3] Reduce TLB flushes under some specific conditions Huang, Ying @ 2023-11-10 1:32 ` Byungchul Park 2023-11-15 2:57 ` Byungchul Park 1 sibling, 0 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-10 1:32 UTC (permalink / raw) To: Huang, Ying Cc: linux-kernel, linux-mm, kernel_team, akpm, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 01:20:29PM +0800, Huang, Ying wrote: > Byungchul Park <byungchul@sk.com> writes: > > > Hi everyone, > > > > While I'm working with CXL memory, I have been facing migration overhead > > esp. TLB shootdown on promotion or demotion between different tiers. > > Yeah.. most TLB shootdowns on migration through hinting fault can be > > avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid > > flushing TLB in batch if PTE is inaccessible"). > > > > However, it's only for ones using hinting fault. I thought it'd be much > > better if we have a general mechanism to reduce # of TLB flushes and > > TLB misses, that we can apply to any type of migration. I tried it only > > for tiering migration for now tho. > > > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > > > Can you believe it? I saw the number of TLB full flush reduced about > > 80% and iTLB miss reduced about 50%, and the time wise performance > > always shows at least 1% stable improvement with the workload I tested > > with, XSBench. However, I believe that it would help more with other > > ones or any real ones. It'd be appreciated to let me know if I'm missing > > something. > > Can you help to test the effect of commit 7e12beb8ca2a ("migrate_pages: > batch flushing TLB") for your test case? To test it, you can revert it > and compare the performance before and after the reverting. I will. > And, how do you trigger migration when testing XSBench? Use a tiered > memory system, and migrate pages between DRAM and CXL memory back and > forth? If so, how many pages will you migrate for each migration? Honestly I've been focusing on the migration # and TLB #. I will get back to you. Byungchul > -- > Best Regards, > Huang, Ying > > > > > Byungchul > > > > --- > > > > Changes from v3: > > > > 1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob, > > migrc_enable. (feedbacked by Nadav) > > 2. Remove the optimization skipping CPUs that have already > > performed TLB flushes needed by any reason when performing > > TLB flushes by migrc because I can't tell the performance > > difference between w/ the optimization and w/o that. > > (feedbacked by Nadav) > > 3. Minimize arch-specific code. While at it, move all the migrc > > declarations and inline functions from include/linux/mm.h to > > mm/internal.h (feedbacked by Dave Hansen, Nadav) > > 4. Separate a part making migrc paused when the system is in > > high memory pressure to another patch. (feedbacked by Nadav) > > 5. Rename: > > a. arch_tlbbatch_clean() to arch_tlbbatch_clear(), > > b. tlb_ubc_nowr to tlb_ubc_ro, > > c. migrc_try_flush_free_folios() to migrc_flush_free_folios(), > > d. migrc_stop to migrc_pause. > > (feedbacked by Nadav) > > 6. Use ->lru list_head instead of introducing a new llist_head. > > (feedbacked by Nadav) > > 7. Use non-atomic operations of page-flag when it's safe. > > (feedbacked by Nadav) > > 8. Use stack instead of keeping a pointer of 'struct migrc_req' > > in struct task, which is for manipulating it locally. > > (feedbacked by Nadav) > > 9. Replace a lot of simple functions to inline functions placed > > in a header, mm/internal.h. (feedbacked by Nadav) > > 10. Add additional sufficient comments. (feedbacked by Nadav) > > 11. Remove a lot of wrapper functions. (feedbacked by Nadav) > > > > Changes from RFC v2: > > > > 1. Remove additional occupation in struct page. To do that, > > unioned with lru field for migrc's list and added a page > > flag. I know page flag is a thing that we don't like to add > > but no choice because migrc should distinguish folios under > > migrc's control from others. Instead, I force migrc to be > > used only on 64 bit system to mitigate you guys from getting > > angry. > > 2. Remove meaningless internal object allocator that I > > introduced to minimize impact onto the system. However, a ton > > of tests showed there was no difference. > > 3. Stop migrc from working when the system is in high memory > > pressure like about to perform direct reclaim. At the > > condition where the swap mechanism is heavily used, I found > > the system suffered from regression without this control. > > 4. Exclude folios that pte_dirty() == true from migrc's interest > > so that migrc can work simpler. > > 5. Combine several patches that work tightly coupled to one. > > 6. Add sufficient comments for better review. > > 7. Manage migrc's request in per-node manner (from globally). > > 8. Add TLB miss improvement in commit message. > > 9. Test with more CPUs(4 -> 16) to see bigger improvement. > > > > Changes from RFC: > > > > 1. Fix a bug triggered when a destination folio at the previous > > migration becomes a source folio at the next migration, > > before the folio gets handled properly so that the folio can > > play with another migration. There was inconsistency in the > > folio's state. Fixed it. > > 2. Split the patch set into more pieces so that the folks can > > review better. (Feedbacked by Nadav Amit) > > 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic(). > > (Feedbacked by Nadav Amit) > > 4. Tried to add sufficient comments to explain the patch set > > better. (Feedbacked by Nadav Amit) > > > > Byungchul Park (3): > > mm/rmap: Recognize read-only TLB entries during batched TLB flush > > mm: Defer TLB flush by keeping both src and dst folios at migration > > mm: Pause migrc mechanism at high memory pressure > > > > arch/x86/include/asm/tlbflush.h | 3 + > > arch/x86/mm/tlb.c | 11 ++ > > include/linux/mm_types.h | 21 +++ > > include/linux/mmzone.h | 9 ++ > > include/linux/page-flags.h | 4 + > > include/linux/sched.h | 7 + > > include/trace/events/mmflags.h | 3 +- > > mm/internal.h | 78 ++++++++++ > > mm/memory.c | 11 ++ > > mm/migrate.c | 266 ++++++++++++++++++++++++++++++++ > > mm/page_alloc.c | 30 +++- > > mm/rmap.c | 35 ++++- > > 12 files changed, 475 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 5:20 ` [v4 0/3] Reduce TLB flushes under some specific conditions Huang, Ying 2023-11-10 1:32 ` Byungchul Park @ 2023-11-15 2:57 ` Byungchul Park 1 sibling, 0 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-15 2:57 UTC (permalink / raw) To: Huang, Ying Cc: linux-kernel, linux-mm, kernel_team, akpm, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 01:20:29PM +0800, Huang, Ying wrote: > Byungchul Park <byungchul@sk.com> writes: > > > Hi everyone, > > > > While I'm working with CXL memory, I have been facing migration overhead > > esp. TLB shootdown on promotion or demotion between different tiers. > > Yeah.. most TLB shootdowns on migration through hinting fault can be > > avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid > > flushing TLB in batch if PTE is inaccessible"). > > > > However, it's only for ones using hinting fault. I thought it'd be much > > better if we have a general mechanism to reduce # of TLB flushes and > > TLB misses, that we can apply to any type of migration. I tried it only > > for tiering migration for now tho. > > > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > > > Can you believe it? I saw the number of TLB full flush reduced about > > 80% and iTLB miss reduced about 50%, and the time wise performance > > always shows at least 1% stable improvement with the workload I tested > > with, XSBench. However, I believe that it would help more with other > > ones or any real ones. It'd be appreciated to let me know if I'm missing > > something. > > Can you help to test the effect of commit 7e12beb8ca2a ("migrate_pages: > batch flushing TLB") for your test case? To test it, you can revert it > and compare the performance before and after the reverting. > > And, how do you trigger migration when testing XSBench? Use a tiered > memory system, and migrate pages between DRAM and CXL memory back and > forth? If so, how many pages will you migrate for each migration It was not an actual CXL memory but a cpuless remote numa node's DRAM recognized as a slow tier (node_is_toptier() == false) by the kernel. It's been okay to me because I've been focusing on TLB # and migration # while working with numa tiering mechanism and, I think, the time wise performance will be followed, big or little depending on the system configuration. So it migrates pages between the two DRAMs back and forth - promotion by hinting fault and demotion by page reclaim. I tested what you asked me with another slower system to make TLB miss overhead stand out. Unfortunately I got even worse result with vanilla v6.6-rc5 than v6.6-rc5 with 7e12beb8ca2a reverted, while the 'v6.6-rc5 + migrc' definitely shows far better result. Thoughts? Byungchul --- Architecture - x86_64 QEMU - kvm enabled, host cpu Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB) Kernel - v6.6-rc5, NUMA_BALANCING_MEMORY_TIERING, demotion enabled Benchmark - XSBench -p 50000000 (-p option makes the runtime longer) CASE1 - mainline v6.6-rc5 + 7e12beb8ca2a reverted ------------------------------------------------- $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 190247118 itlb.itlb_flush 716182438 tlb_flush.dtlb_thread 327051673 tlb_flush.stlb_any 119542331968 dTLB-load-misses 724072795 dTLB-store-misses 3054343419 iTLB-load-misses 1172.580552728 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 5968431 pgmigrate_success 12484773 nr_tlb_remote_flush 6614459 nr_tlb_remote_flush_received 96022799 nr_tlb_local_flush_all 50869 nr_tlb_local_flush_one 785597 ... CASE2 - mainline v6.6-rc5 (vanilla) ------------------------------------------------- $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 55139061 itlb.itlb_flush 286725687 tlb_flush.dtlb_thread 199687660 tlb_flush.stlb_any 119497951269 dTLB-load-misses 358434759 dTLB-store-misses 1867135967 iTLB-load-misses 1181.311084373 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 8190027 pgmigrate_success 17098994 nr_tlb_remote_flush 1955114 nr_tlb_remote_flush_received 29028093 nr_tlb_local_flush_all 140921 nr_tlb_local_flush_one 740767 ... CASE3 - mainline v6.6-rc5 + migrc ------------------------------------------------- $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 6337091 itlb.itlb_flush 157229778 tlb_flush.dtlb_thread 148240163 tlb_flush.stlb_any 117701381319 dTLB-load-misses 231212468 dTLB-store-misses 973083466 iTLB-load-misses 1105.756705157 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 8791934 pgmigrate_success 18276174 nr_tlb_remote_flush 311146 nr_tlb_remote_flush_received 4387708 nr_tlb_local_flush_all 143883 nr_tlb_local_flush_one 740953 ... ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 4:59 [v4 0/3] Reduce TLB flushes under some specific conditions Byungchul Park ` (3 preceding siblings ...) 2023-11-09 5:20 ` [v4 0/3] Reduce TLB flushes under some specific conditions Huang, Ying @ 2023-11-09 14:26 ` Dave Hansen 2023-11-10 1:08 ` Byungchul Park ` (2 more replies) 4 siblings, 3 replies; 17+ messages in thread From: Dave Hansen @ 2023-11-09 14:26 UTC (permalink / raw) To: Byungchul Park, linux-kernel, linux-mm Cc: kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen On 11/8/23 20:59, Byungchul Park wrote: > Can you believe it? I saw the number of TLB full flush reduced about > 80% and iTLB miss reduced about 50%, and the time wise performance > always shows at least 1% stable improvement with the workload I tested > with, XSBench. However, I believe that it would help more with other > ones or any real ones. It'd be appreciated to let me know if I'm missing > something. I see that you've moved a substantial amount of code out of arch/x86. That's great. But there doesn't appear to be any improvement in the justification or performance data. The page flag is also here, which is horribly frowned upon. It's an absolute no-go with this level of justification. I'd really suggest not sending any more of these out until those issues are rectified. I know I definitely won't be reviewing them in this state. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 14:26 ` Dave Hansen @ 2023-11-10 1:08 ` Byungchul Park 2023-11-15 6:43 ` Byungchul Park 2024-01-15 7:58 ` Byungchul Park 2 siblings, 0 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-10 1:08 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 06:26:08AM -0800, Dave Hansen wrote: > On 11/8/23 20:59, Byungchul Park wrote: > > Can you believe it? I saw the number of TLB full flush reduced about > > 80% and iTLB miss reduced about 50%, and the time wise performance > > always shows at least 1% stable improvement with the workload I tested > > with, XSBench. However, I believe that it would help more with other > > ones or any real ones. It'd be appreciated to let me know if I'm missing > > something. > > I see that you've moved a substantial amount of code out of arch/x86. > That's great. > > But there doesn't appear to be any improvement in the justification or > performance data. The page flag is also here, which is horribly frowned > upon. It's an absolute no-go with this level of justification. > > I'd really suggest not sending any more of these out until those issues > are rectified. I know I definitely won't be reviewing them in this state. Make sense. Lemme think it more and improve it. Byungchul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 14:26 ` Dave Hansen 2023-11-10 1:08 ` Byungchul Park @ 2023-11-15 6:43 ` Byungchul Park 2024-01-15 7:58 ` Byungchul Park 2 siblings, 0 replies; 17+ messages in thread From: Byungchul Park @ 2023-11-15 6:43 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 06:26:08AM -0800, Dave Hansen wrote: > On 11/8/23 20:59, Byungchul Park wrote: > > Can you believe it? I saw the number of TLB full flush reduced about > > 80% and iTLB miss reduced about 50%, and the time wise performance > > always shows at least 1% stable improvement with the workload I tested > > with, XSBench. However, I believe that it would help more with other > > ones or any real ones. It'd be appreciated to let me know if I'm missing > > something. > > I see that you've moved a substantial amount of code out of arch/x86. > That's great. > > But there doesn't appear to be any improvement in the justification or > performance data. The page flag is also here, which is horribly frowned > upon. It's an absolute no-go with this level of justification. > > I'd really suggest not sending any more of these out until those issues > are rectified. I know I definitely won't be reviewing them in this state. As I expected, I got a fair better result when I tested migrc with a system with a slower DRAM to make TLB miss overhead stand out. 1. XSBench execution time was reduced about 7%. 2. iTLB flush # was reduced stably about 90% while running XSBench. 3. iTLB miss # was reduced stably about 50% while running XSBench. https://lore.kernel.org/lkml/20231115025755.GA29979@system.software.com/ Of course, I can reimplement migrc to replace PG_migrc with another thing like hash table but, IMHO, it's worth having the page flag if it gives such a good performance. Lemme know if not so that I'll change the way to implement. I'd like to note that no doubt migrc significantly reduces TLB miss and the impact depends on TLB miss overhead that varies according to the system configuration. Byungchul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [v4 0/3] Reduce TLB flushes under some specific conditions 2023-11-09 14:26 ` Dave Hansen 2023-11-10 1:08 ` Byungchul Park 2023-11-15 6:43 ` Byungchul Park @ 2024-01-15 7:58 ` Byungchul Park 2 siblings, 0 replies; 17+ messages in thread From: Byungchul Park @ 2024-01-15 7:58 UTC (permalink / raw) To: Dave Hansen Cc: linux-kernel, linux-mm, kernel_team, akpm, ying.huang, namit, xhao, mgorman, hughd, willy, david, peterz, luto, tglx, mingo, bp, dave.hansen On Thu, Nov 09, 2023 at 06:26:08AM -0800, Dave Hansen wrote: > On 11/8/23 20:59, Byungchul Park wrote: > > Can you believe it? I saw the number of TLB full flush reduced about > > 80% and iTLB miss reduced about 50%, and the time wise performance > > always shows at least 1% stable improvement with the workload I tested > > with, XSBench. However, I believe that it would help more with other > > ones or any real ones. It'd be appreciated to let me know if I'm missing > > something. > > I see that you've moved a substantial amount of code out of arch/x86. > That's great. > > But there doesn't appear to be any improvement in the justification or > performance data. The page flag is also here, which is horribly frowned > upon. It's an absolute no-go with this level of justification. Okay. I won't use an additional page flag anymore from migrc v5. Thanks. Byungchul ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-01-15 7:58 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-11-09 4:59 [v4 0/3] Reduce TLB flushes under some specific conditions Byungchul Park 2023-11-09 4:59 ` [v4 1/3] mm/rmap: Recognize read-only TLB entries during batched TLB flush Byungchul Park 2023-11-09 20:26 ` kernel test robot 2023-11-09 4:59 ` [v4 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration Byungchul Park 2023-11-09 14:36 ` Matthew Wilcox 2023-11-10 1:29 ` Byungchul Park 2024-01-15 7:55 ` Byungchul Park 2023-11-09 17:09 ` kernel test robot 2023-11-09 19:07 ` kernel test robot 2023-11-09 4:59 ` [v4 3/3] mm: Pause migrc mechanism at high memory pressure Byungchul Park 2023-11-09 5:20 ` [v4 0/3] Reduce TLB flushes under some specific conditions Huang, Ying 2023-11-10 1:32 ` Byungchul Park 2023-11-15 2:57 ` Byungchul Park 2023-11-09 14:26 ` Dave Hansen 2023-11-10 1:08 ` Byungchul Park 2023-11-15 6:43 ` Byungchul Park 2024-01-15 7:58 ` Byungchul Park
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).