* [patch]x86: clearing access bit don't flush tlb @ 2014-03-26 22:30 Shaohua Li 2014-03-26 23:55 ` Rik van Riel 2014-04-02 13:01 ` Mel Gorman 0 siblings, 2 replies; 15+ messages in thread From: Shaohua Li @ 2014-03-26 22:30 UTC (permalink / raw) To: linux-mm; +Cc: akpm, hughd, riel, mel I posted this patch a year ago or so, but it gets lost. Repost it here to check if we can make progress this time. We use access bit to age a page at page reclaim. When clearing pte access bit, we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte access bit is unset in page table, when cpu access the page again, cpu will not set page table pte's access bit. Next time page reclaim will think this hot page is old and reclaim it wrongly, but this doesn't corrupt data. And according to intel manual, tlb has less than 1k entries, which covers < 4M memory. In today's system, several giga byte memory is normal. After page reclaim clears pte access bit and before cpu access the page again, it's quite unlikely this page's pte is still in TLB. And context swich will flush tlb too. The chance skiping tlb flush to impact page reclaim should be very rare. Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. Hugh added it to fix some ARM and sparc issues. Since I only change this for x86, there should be no risk. And in some workloads, TLB flush overhead is very heavy. In my simple multithread app with a lot of swap to several pcie SSD, removing the tlb flush gives about 20% ~ 30% swapout speedup. Signed-off-by: Shaohua Li <shli@fusionio.com> --- arch/x86/mm/pgtable.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) Index: linux/arch/x86/mm/pgtable.c =================================================================== --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { - int young; - - young = ptep_test_and_clear_young(vma, address, ptep); - if (young) - flush_tlb_page(vma, address); - - return young; + /* + * In X86, clearing access bit without TLB flush doesn't cause data + * corruption. Doing this could cause wrong page aging and so hot pages + * are reclaimed, but the chance should be very rare. + */ + return ptep_test_and_clear_young(vma, address, ptep); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-26 22:30 [patch]x86: clearing access bit don't flush tlb Shaohua Li @ 2014-03-26 23:55 ` Rik van Riel 2014-03-27 17:12 ` Shaohua Li 2014-04-02 13:01 ` Mel Gorman 1 sibling, 1 reply; 15+ messages in thread From: Rik van Riel @ 2014-03-26 23:55 UTC (permalink / raw) To: Shaohua Li, linux-mm; +Cc: akpm, hughd, mel On 03/26/2014 06:30 PM, Shaohua Li wrote: > > I posted this patch a year ago or so, but it gets lost. Repost it here to check > if we can make progress this time. I believe we can make progress. However, I also believe the code could be enhanced to address a concern that Hugh raised last time this was proposed... > And according to intel manual, tlb has less than 1k entries, which covers < 4M > memory. In today's system, several giga byte memory is normal. After page > reclaim clears pte access bit and before cpu access the page again, it's quite > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > The chance skiping tlb flush to impact page reclaim should be very rare. Context switch to a kernel thread does not result in a TLB flush, due to the lazy TLB code. While I agree with you that clearing the TLB right at the moment the accessed bit is cleared in a PTE is not necessary, I believe it would be good to clear the TLB on affected CPUs relatively soon, maybe at the next time schedule is called? > --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > - int young; > - > - young = ptep_test_and_clear_young(vma, address, ptep); > - if (young) > - flush_tlb_page(vma, address); > - > - return young; > + /* > + * In X86, clearing access bit without TLB flush doesn't cause data > + * corruption. Doing this could cause wrong page aging and so hot pages > + * are reclaimed, but the chance should be very rare. > + */ > + return ptep_test_and_clear_young(vma, address, ptep); > } At this point, we could use vma->vm_mm->cpu_vm_mask_var to set (or clear) some bit in the per-cpu data of each CPU that has active/valid tlb state for the mm in question. I could see using cpu_tlbstate.state for this, or maybe another variable in cpu_tlbstate, so switch_mm will load both items with the same cache line. At schedule time, the function switch_mm() can examine that variable (it already touches that data, anyway), and flush the TLB even if prev==next. I suspect that would be both low overhead enough to get you the performance gains you want, and address the concern that we do want to flush the TLB at some point. Does that sound reasonable? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-26 23:55 ` Rik van Riel @ 2014-03-27 17:12 ` Shaohua Li 2014-03-27 18:41 ` Rik van Riel 0 siblings, 1 reply; 15+ messages in thread From: Shaohua Li @ 2014-03-27 17:12 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, akpm, hughd, mel On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote: > On 03/26/2014 06:30 PM, Shaohua Li wrote: > > > >I posted this patch a year ago or so, but it gets lost. Repost it here to check > >if we can make progress this time. > > I believe we can make progress. However, I also > believe the code could be enhanced to address a > concern that Hugh raised last time this was > proposed... > > >And according to intel manual, tlb has less than 1k entries, which covers < 4M > >memory. In today's system, several giga byte memory is normal. After page > >reclaim clears pte access bit and before cpu access the page again, it's quite > >unlikely this page's pte is still in TLB. And context swich will flush tlb too. > >The chance skiping tlb flush to impact page reclaim should be very rare. > > Context switch to a kernel thread does not result in a > TLB flush, due to the lazy TLB code. > > While I agree with you that clearing the TLB right at > the moment the accessed bit is cleared in a PTE is > not necessary, I believe it would be good to clear > the TLB on affected CPUs relatively soon, maybe at the > next time schedule is called? > > >--- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > >+++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > >@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > > int ptep_clear_flush_young(struct vm_area_struct *vma, > > unsigned long address, pte_t *ptep) > > { > >- int young; > >- > >- young = ptep_test_and_clear_young(vma, address, ptep); > >- if (young) > >- flush_tlb_page(vma, address); > >- > >- return young; > >+ /* > >+ * In X86, clearing access bit without TLB flush doesn't cause data > >+ * corruption. Doing this could cause wrong page aging and so hot pages > >+ * are reclaimed, but the chance should be very rare. > >+ */ > >+ return ptep_test_and_clear_young(vma, address, ptep); > > } > > > At this point, we could use vma->vm_mm->cpu_vm_mask_var to > set (or clear) some bit in the per-cpu data of each CPU that > has active/valid tlb state for the mm in question. > > I could see using cpu_tlbstate.state for this, or maybe > another variable in cpu_tlbstate, so switch_mm will load > both items with the same cache line. > > At schedule time, the function switch_mm() can examine that > variable (it already touches that data, anyway), and flush > the TLB even if prev==next. > > I suspect that would be both low overhead enough to get you > the performance gains you want, and address the concern that > we do want to flush the TLB at some point. > > Does that sound reasonable? So looks what you suggested is to force tlb flush for a mm with access bit cleared in two corner cases: 1. lazy tlb flush 2. context switch between threads from one process Am I missing anything? I'm wonering if we should care about these corner cases. On the other hand, a thread might run long time without schedule. If the corner cases are an issue, the long run thread is a severer issue. My point is context switch does provide a safeguard, but we don't depend on it. The whole theory at the back of this patch is page which has access bit cleared is unlikely accessed again when its pte entry is still in tlb cache. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-27 17:12 ` Shaohua Li @ 2014-03-27 18:41 ` Rik van Riel 2014-03-28 19:02 ` Shaohua Li 0 siblings, 1 reply; 15+ messages in thread From: Rik van Riel @ 2014-03-27 18:41 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-mm, akpm, hughd, mel [-- Attachment #1: Type: text/plain, Size: 3850 bytes --] On 03/27/2014 01:12 PM, Shaohua Li wrote: > On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote: >> On 03/26/2014 06:30 PM, Shaohua Li wrote: >>> >>> I posted this patch a year ago or so, but it gets lost. Repost it here to check >>> if we can make progress this time. >> >> I believe we can make progress. However, I also >> believe the code could be enhanced to address a >> concern that Hugh raised last time this was >> proposed... >> >>> And according to intel manual, tlb has less than 1k entries, which covers < 4M >>> memory. In today's system, several giga byte memory is normal. After page >>> reclaim clears pte access bit and before cpu access the page again, it's quite >>> unlikely this page's pte is still in TLB. And context swich will flush tlb too. >>> The chance skiping tlb flush to impact page reclaim should be very rare. >> >> Context switch to a kernel thread does not result in a >> TLB flush, due to the lazy TLB code. >> >> While I agree with you that clearing the TLB right at >> the moment the accessed bit is cleared in a PTE is >> not necessary, I believe it would be good to clear >> the TLB on affected CPUs relatively soon, maybe at the >> next time schedule is called? >> >>> --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 >>> +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 >>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ >>> int ptep_clear_flush_young(struct vm_area_struct *vma, >>> unsigned long address, pte_t *ptep) >>> { >>> - int young; >>> - >>> - young = ptep_test_and_clear_young(vma, address, ptep); >>> - if (young) >>> - flush_tlb_page(vma, address); >>> - >>> - return young; >>> + /* >>> + * In X86, clearing access bit without TLB flush doesn't cause data >>> + * corruption. Doing this could cause wrong page aging and so hot pages >>> + * are reclaimed, but the chance should be very rare. >>> + */ >>> + return ptep_test_and_clear_young(vma, address, ptep); >>> } >> >> >> At this point, we could use vma->vm_mm->cpu_vm_mask_var to >> set (or clear) some bit in the per-cpu data of each CPU that >> has active/valid tlb state for the mm in question. >> >> I could see using cpu_tlbstate.state for this, or maybe >> another variable in cpu_tlbstate, so switch_mm will load >> both items with the same cache line. >> >> At schedule time, the function switch_mm() can examine that >> variable (it already touches that data, anyway), and flush >> the TLB even if prev==next. >> >> I suspect that would be both low overhead enough to get you >> the performance gains you want, and address the concern that >> we do want to flush the TLB at some point. >> >> Does that sound reasonable? > > So looks what you suggested is to force tlb flush for a mm with access bit > cleared in two corner cases: > 1. lazy tlb flush > 2. context switch between threads from one process > > Am I missing anything? I'm wonering if we should care about these corner cases. I believe the corner case is relatively rare, but I also suspect that your patch could fail pretty badly in some of those cases, and the fix is easy... > On the other hand, a thread might run long time without schedule. If the corner > cases are an issue, the long run thread is a severer issue. My point is context > switch does provide a safeguard, but we don't depend on it. The whole theory at > the back of this patch is page which has access bit cleared is unlikely > accessed again when its pte entry is still in tlb cache. On the contrary, a TLB with a good cache policy should retain the most actively used entries, in favor of less actively used ones. That means the pages we care most about keeping, are the ones also most at danger of not having the accessed bit flushed to memory. Does the attached (untested) patch look reasonable? [-- Attachment #2: flush_young_lazy_tlb.patch --] [-- Type: text/x-patch, Size: 3356 bytes --] Signed-off-by: Rik van Riel <riel@redhat.com> --- arch/x86/include/asm/mmu_context.h | 5 ++++- arch/x86/include/asm/tlbflush.h | 12 ++++++++++++ arch/x86/mm/pgtable.c | 9 ++++++--- 3 files changed, 22 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index be12c53..665d98b 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -39,6 +39,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, #ifdef CONFIG_SMP this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); this_cpu_write(cpu_tlbstate.active_mm, next); + this_cpu_write(cpu_tlbstate.force_flush, false); #endif cpumask_set_cpu(cpu, mm_cpumask(next)); @@ -57,7 +58,8 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + if (!cpumask_test_cpu(cpu, mm_cpumask(next)) || + this_cpu_read(cpu_tlbstate.force_flush)) { /* * On established mms, the mm_cpumask is only changed * from irq context, from ptep_clear_flush() while in @@ -70,6 +72,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * tlb flush IPI delivery. We must reload CR3 * to make sure to use no freed page tables. */ + this_cpu_write(cpu_tlbstate.force_flush, false); load_cr3(next->pgd); load_LDT_nolock(&next->context); } diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 04905bf..f2cda2c 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -151,6 +151,10 @@ static inline void reset_lazy_tlbstate(void) { } +static inline void tlb_set_force_flush(int cpu) +{ +} + static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end) { @@ -187,6 +191,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, struct tlb_state { struct mm_struct *active_mm; int state; + bool force_flush; }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); @@ -196,6 +201,13 @@ static inline void reset_lazy_tlbstate(void) this_cpu_write(cpu_tlbstate.active_mm, &init_mm); } +static inline void tlb_set_force_flush(int cpu) +{ + struct tlb_state *percputlb= &per_cpu(cpu_tlbstate, cpu); + if (percputlb->force_flush == false) + percputlb->force_flush = true; +} + #endif /* SMP */ #ifndef CONFIG_PARAVIRT diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index c96314a..dcd26e9 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -4,6 +4,7 @@ #include <asm/pgtable.h> #include <asm/tlb.h> #include <asm/fixmap.h> +#include <asm/tlbflush.h> #define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO @@ -399,11 +400,13 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { - int young; + int young, cpu; young = ptep_test_and_clear_young(vma, address, ptep); - if (young) - flush_tlb_page(vma, address); + if (young) { + for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var) + tlb_set_force_flush(cpu); + } return young; } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-27 18:41 ` Rik van Riel @ 2014-03-28 19:02 ` Shaohua Li 2014-03-30 12:58 ` Rik van Riel 0 siblings, 1 reply; 15+ messages in thread From: Shaohua Li @ 2014-03-28 19:02 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, akpm, hughd, mel On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote: > On 03/27/2014 01:12 PM, Shaohua Li wrote: > >On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote: > >>On 03/26/2014 06:30 PM, Shaohua Li wrote: > >>> > >>>I posted this patch a year ago or so, but it gets lost. Repost it here to check > >>>if we can make progress this time. > >> > >>I believe we can make progress. However, I also > >>believe the code could be enhanced to address a > >>concern that Hugh raised last time this was > >>proposed... > >> > >>>And according to intel manual, tlb has less than 1k entries, which covers < 4M > >>>memory. In today's system, several giga byte memory is normal. After page > >>>reclaim clears pte access bit and before cpu access the page again, it's quite > >>>unlikely this page's pte is still in TLB. And context swich will flush tlb too. > >>>The chance skiping tlb flush to impact page reclaim should be very rare. > >> > >>Context switch to a kernel thread does not result in a > >>TLB flush, due to the lazy TLB code. > >> > >>While I agree with you that clearing the TLB right at > >>the moment the accessed bit is cleared in a PTE is > >>not necessary, I believe it would be good to clear > >>the TLB on affected CPUs relatively soon, maybe at the > >>next time schedule is called? > >> > >>>--- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > >>>+++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > >>>@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > >>> int ptep_clear_flush_young(struct vm_area_struct *vma, > >>> unsigned long address, pte_t *ptep) > >>> { > >>>- int young; > >>>- > >>>- young = ptep_test_and_clear_young(vma, address, ptep); > >>>- if (young) > >>>- flush_tlb_page(vma, address); > >>>- > >>>- return young; > >>>+ /* > >>>+ * In X86, clearing access bit without TLB flush doesn't cause data > >>>+ * corruption. Doing this could cause wrong page aging and so hot pages > >>>+ * are reclaimed, but the chance should be very rare. > >>>+ */ > >>>+ return ptep_test_and_clear_young(vma, address, ptep); > >>> } > >> > >> > >>At this point, we could use vma->vm_mm->cpu_vm_mask_var to > >>set (or clear) some bit in the per-cpu data of each CPU that > >>has active/valid tlb state for the mm in question. > >> > >>I could see using cpu_tlbstate.state for this, or maybe > >>another variable in cpu_tlbstate, so switch_mm will load > >>both items with the same cache line. > >> > >>At schedule time, the function switch_mm() can examine that > >>variable (it already touches that data, anyway), and flush > >>the TLB even if prev==next. > >> > >>I suspect that would be both low overhead enough to get you > >>the performance gains you want, and address the concern that > >>we do want to flush the TLB at some point. > >> > >>Does that sound reasonable? > > > >So looks what you suggested is to force tlb flush for a mm with access bit > >cleared in two corner cases: > >1. lazy tlb flush > >2. context switch between threads from one process > > > >Am I missing anything? I'm wonering if we should care about these corner cases. > > I believe the corner case is relatively rare, but I also > suspect that your patch could fail pretty badly in some > of those cases, and the fix is easy... > > >On the other hand, a thread might run long time without schedule. If the corner > >cases are an issue, the long run thread is a severer issue. My point is context > >switch does provide a safeguard, but we don't depend on it. The whole theory at > >the back of this patch is page which has access bit cleared is unlikely > >accessed again when its pte entry is still in tlb cache. > > On the contrary, a TLB with a good cache policy should > retain the most actively used entries, in favor of > less actively used ones. > > That means the pages we care most about keeping, are > the ones also most at danger of not having the accessed > bit flushed to memory. > > Does the attached (untested) patch look reasonable? It works obviously. Test shows tehre is no extra tradeoff too compared to just skip tlb flush. So I have no objection to this if you insist a safeguard like this. Should we force no entering lazy tlb too (in context_switch) if force_flush is set, because you are talking about it but I didn't see it in the patch? Should I push this or will you do it? Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-28 19:02 ` Shaohua Li @ 2014-03-30 12:58 ` Rik van Riel 2014-03-31 2:16 ` Shaohua Li 0 siblings, 1 reply; 15+ messages in thread From: Rik van Riel @ 2014-03-30 12:58 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-mm, akpm, hughd, mel On 03/28/2014 03:02 PM, Shaohua Li wrote: > On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote: >> On 03/27/2014 01:12 PM, Shaohua Li wrote: >>> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote: >>>> On 03/26/2014 06:30 PM, Shaohua Li wrote: >>>>> >>>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check >>>>> if we can make progress this time. >>>> >>>> I believe we can make progress. However, I also >>>> believe the code could be enhanced to address a >>>> concern that Hugh raised last time this was >>>> proposed... >>>> >>>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M >>>>> memory. In today's system, several giga byte memory is normal. After page >>>>> reclaim clears pte access bit and before cpu access the page again, it's quite >>>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too. >>>>> The chance skiping tlb flush to impact page reclaim should be very rare. >>>> >>>> Context switch to a kernel thread does not result in a >>>> TLB flush, due to the lazy TLB code. >>>> >>>> While I agree with you that clearing the TLB right at >>>> the moment the accessed bit is cleared in a PTE is >>>> not necessary, I believe it would be good to clear >>>> the TLB on affected CPUs relatively soon, maybe at the >>>> next time schedule is called? >>>> >>>>> --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 >>>>> +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 >>>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ >>>>> int ptep_clear_flush_young(struct vm_area_struct *vma, >>>>> unsigned long address, pte_t *ptep) >>>>> { >>>>> - int young; >>>>> - >>>>> - young = ptep_test_and_clear_young(vma, address, ptep); >>>>> - if (young) >>>>> - flush_tlb_page(vma, address); >>>>> - >>>>> - return young; >>>>> + /* >>>>> + * In X86, clearing access bit without TLB flush doesn't cause data >>>>> + * corruption. Doing this could cause wrong page aging and so hot pages >>>>> + * are reclaimed, but the chance should be very rare. >>>>> + */ >>>>> + return ptep_test_and_clear_young(vma, address, ptep); >>>>> } >>>> >>>> >>>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to >>>> set (or clear) some bit in the per-cpu data of each CPU that >>>> has active/valid tlb state for the mm in question. >>>> >>>> I could see using cpu_tlbstate.state for this, or maybe >>>> another variable in cpu_tlbstate, so switch_mm will load >>>> both items with the same cache line. >>>> >>>> At schedule time, the function switch_mm() can examine that >>>> variable (it already touches that data, anyway), and flush >>>> the TLB even if prev==next. >>>> >>>> I suspect that would be both low overhead enough to get you >>>> the performance gains you want, and address the concern that >>>> we do want to flush the TLB at some point. >>>> >>>> Does that sound reasonable? >>> >>> So looks what you suggested is to force tlb flush for a mm with access bit >>> cleared in two corner cases: >>> 1. lazy tlb flush >>> 2. context switch between threads from one process >>> >>> Am I missing anything? I'm wonering if we should care about these corner cases. >> >> I believe the corner case is relatively rare, but I also >> suspect that your patch could fail pretty badly in some >> of those cases, and the fix is easy... >> >>> On the other hand, a thread might run long time without schedule. If the corner >>> cases are an issue, the long run thread is a severer issue. My point is context >>> switch does provide a safeguard, but we don't depend on it. The whole theory at >>> the back of this patch is page which has access bit cleared is unlikely >>> accessed again when its pte entry is still in tlb cache. >> >> On the contrary, a TLB with a good cache policy should >> retain the most actively used entries, in favor of >> less actively used ones. >> >> That means the pages we care most about keeping, are >> the ones also most at danger of not having the accessed >> bit flushed to memory. >> >> Does the attached (untested) patch look reasonable? > > It works obviously. Test shows tehre is no extra tradeoff too compared to just > skip tlb flush. So I have no objection to this if you insist a safeguard like > this. Should we force no entering lazy tlb too (in context_switch) if > force_flush is set, because you are talking about it but I didn't see it in the > patch? Should I push this or will you do it? Thank you for testing the patch. I think we should be fine not adding any code to the lazy tlb entering code, because a kernel thread in lazy tlb mode should not be accessing user space memory, except for the vhost-net thread, which will then wake up userspace, and cause the flush. Since performance numbers were identical, can I use your performance numbers when submitting my patch? :) -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-30 12:58 ` Rik van Riel @ 2014-03-31 2:16 ` Shaohua Li 0 siblings, 0 replies; 15+ messages in thread From: Shaohua Li @ 2014-03-31 2:16 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-mm, akpm, hughd, mel On Sun, Mar 30, 2014 at 08:58:46AM -0400, Rik van Riel wrote: > On 03/28/2014 03:02 PM, Shaohua Li wrote: > > On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote: > >> On 03/27/2014 01:12 PM, Shaohua Li wrote: > >>> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote: > >>>> On 03/26/2014 06:30 PM, Shaohua Li wrote: > >>>>> > >>>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check > >>>>> if we can make progress this time. > >>>> > >>>> I believe we can make progress. However, I also > >>>> believe the code could be enhanced to address a > >>>> concern that Hugh raised last time this was > >>>> proposed... > >>>> > >>>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M > >>>>> memory. In today's system, several giga byte memory is normal. After page > >>>>> reclaim clears pte access bit and before cpu access the page again, it's quite > >>>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too. > >>>>> The chance skiping tlb flush to impact page reclaim should be very rare. > >>>> > >>>> Context switch to a kernel thread does not result in a > >>>> TLB flush, due to the lazy TLB code. > >>>> > >>>> While I agree with you that clearing the TLB right at > >>>> the moment the accessed bit is cleared in a PTE is > >>>> not necessary, I believe it would be good to clear > >>>> the TLB on affected CPUs relatively soon, maybe at the > >>>> next time schedule is called? > >>>> > >>>>> --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > >>>>> +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > >>>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > >>>>> int ptep_clear_flush_young(struct vm_area_struct *vma, > >>>>> unsigned long address, pte_t *ptep) > >>>>> { > >>>>> - int young; > >>>>> - > >>>>> - young = ptep_test_and_clear_young(vma, address, ptep); > >>>>> - if (young) > >>>>> - flush_tlb_page(vma, address); > >>>>> - > >>>>> - return young; > >>>>> + /* > >>>>> + * In X86, clearing access bit without TLB flush doesn't cause data > >>>>> + * corruption. Doing this could cause wrong page aging and so hot pages > >>>>> + * are reclaimed, but the chance should be very rare. > >>>>> + */ > >>>>> + return ptep_test_and_clear_young(vma, address, ptep); > >>>>> } > >>>> > >>>> > >>>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to > >>>> set (or clear) some bit in the per-cpu data of each CPU that > >>>> has active/valid tlb state for the mm in question. > >>>> > >>>> I could see using cpu_tlbstate.state for this, or maybe > >>>> another variable in cpu_tlbstate, so switch_mm will load > >>>> both items with the same cache line. > >>>> > >>>> At schedule time, the function switch_mm() can examine that > >>>> variable (it already touches that data, anyway), and flush > >>>> the TLB even if prev==next. > >>>> > >>>> I suspect that would be both low overhead enough to get you > >>>> the performance gains you want, and address the concern that > >>>> we do want to flush the TLB at some point. > >>>> > >>>> Does that sound reasonable? > >>> > >>> So looks what you suggested is to force tlb flush for a mm with access bit > >>> cleared in two corner cases: > >>> 1. lazy tlb flush > >>> 2. context switch between threads from one process > >>> > >>> Am I missing anything? I'm wonering if we should care about these corner cases. > >> > >> I believe the corner case is relatively rare, but I also > >> suspect that your patch could fail pretty badly in some > >> of those cases, and the fix is easy... > >> > >>> On the other hand, a thread might run long time without schedule. If the corner > >>> cases are an issue, the long run thread is a severer issue. My point is context > >>> switch does provide a safeguard, but we don't depend on it. The whole theory at > >>> the back of this patch is page which has access bit cleared is unlikely > >>> accessed again when its pte entry is still in tlb cache. > >> > >> On the contrary, a TLB with a good cache policy should > >> retain the most actively used entries, in favor of > >> less actively used ones. > >> > >> That means the pages we care most about keeping, are > >> the ones also most at danger of not having the accessed > >> bit flushed to memory. > >> > >> Does the attached (untested) patch look reasonable? > > > > It works obviously. Test shows tehre is no extra tradeoff too compared to just > > skip tlb flush. So I have no objection to this if you insist a safeguard like > > this. Should we force no entering lazy tlb too (in context_switch) if > > force_flush is set, because you are talking about it but I didn't see it in the > > patch? Should I push this or will you do it? > > Thank you for testing the patch. > > I think we should be fine not adding any code to the lazy tlb > entering code, because a kernel thread in lazy tlb mode should > not be accessing user space memory, except for the vhost-net > thread, which will then wake up userspace, and cause the flush. > > Since performance numbers were identical, can I use your > performance numbers when submitting my patch? :) Sure, I tested 10 threads sequential access memory to 2 pcie ssd. without the patch, the swapout speed is around 1.5G/s; with the patch, the speed is around 1.85G/s. As I said before, the patch can provide about 20% ~ 30% swapout speedup for ssd swap. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-03-26 22:30 [patch]x86: clearing access bit don't flush tlb Shaohua Li 2014-03-26 23:55 ` Rik van Riel @ 2014-04-02 13:01 ` Mel Gorman 2014-04-02 15:42 ` Hugh Dickins 1 sibling, 1 reply; 15+ messages in thread From: Mel Gorman @ 2014-04-02 13:01 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-mm, akpm, hughd, riel, mel On Thu, Mar 27, 2014 at 06:30:34AM +0800, Shaohua Li wrote: > > I posted this patch a year ago or so, but it gets lost. Repost it here to check > if we can make progress this time. > > We use access bit to age a page at page reclaim. When clearing pte access bit, > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > access bit is unset in page table, when cpu access the page again, cpu will not > set page table pte's access bit. Next time page reclaim will think this hot > page is old and reclaim it wrongly, but this doesn't corrupt data. > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > memory. In today's system, several giga byte memory is normal. After page > reclaim clears pte access bit and before cpu access the page again, it's quite > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > The chance skiping tlb flush to impact page reclaim should be very rare. > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > Hugh added it to fix some ARM and sparc issues. Since I only change this for > x86, there should be no risk. > > And in some workloads, TLB flush overhead is very heavy. In my simple > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > gives about 20% ~ 30% swapout speedup. > > Signed-off-by: Shaohua Li <shli@fusionio.com> I'm aware of the discussion on the more complex version and the outcome of that. While I think the corner case is real, I think it's also very unlikely and as this is an x86-only thing which will be safe from corruption at least; Acked-by: Mel Gorman <mgorman@suse.de> Shaohua, you almost certainly should resend this to Andrew with the ack's you collected so that he does not have to dig into the history trying to figure out what the exact story is. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch]x86: clearing access bit don't flush tlb 2014-04-02 13:01 ` Mel Gorman @ 2014-04-02 15:42 ` Hugh Dickins 0 siblings, 0 replies; 15+ messages in thread From: Hugh Dickins @ 2014-04-02 15:42 UTC (permalink / raw) To: Shaohua Li; +Cc: Mel Gorman, linux-mm, akpm, hughd, riel, mel On Wed, 2 Apr 2014, Mel Gorman wrote: > On Thu, Mar 27, 2014 at 06:30:34AM +0800, Shaohua Li wrote: > > > > I posted this patch a year ago or so, but it gets lost. Repost it here to check > > if we can make progress this time. > > > > We use access bit to age a page at page reclaim. When clearing pte access bit, > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > > access bit is unset in page table, when cpu access the page again, cpu will not > > set page table pte's access bit. Next time page reclaim will think this hot > > page is old and reclaim it wrongly, but this doesn't corrupt data. > > > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > > memory. In today's system, several giga byte memory is normal. After page > > reclaim clears pte access bit and before cpu access the page again, it's quite > > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > > The chance skiping tlb flush to impact page reclaim should be very rare. > > > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > > Hugh added it to fix some ARM and sparc issues. Since I only change this for > > x86, there should be no risk. > > > > And in some workloads, TLB flush overhead is very heavy. In my simple > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > > gives about 20% ~ 30% swapout speedup. > > > > Signed-off-by: Shaohua Li <shli@fusionio.com> > > I'm aware of the discussion on the more complex version and the outcome > of that. While I think the corner case is real, I think it's also very > unlikely and as this is an x86-only thing which will be safe from > corruption at least; > > Acked-by: Mel Gorman <mgorman@suse.de> > > Shaohua, you almost certainly should resend this to Andrew with the > ack's you collected so that he does not have to dig into the history > trying to figure out what the exact story is. And you can add my Acked-by: Hugh Dickins <hughd@google.com> to your collection too: you and I discussed this at LSF/MM, and nowadays I agree that the corner case that originally worried me (highly-accessed page not getting its accessed bit updated and then temporarily unmapped) is too unlikely a case to refuse the optimization: it might happen occasionally, but I doubt anybody will notice. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* [patch]x86: clearing access bit don't flush tlb @ 2014-04-03 0:42 Shaohua Li 2014-04-03 11:35 ` [patch] x86: " Ingo Molnar 0 siblings, 1 reply; 15+ messages in thread From: Shaohua Li @ 2014-04-03 0:42 UTC (permalink / raw) To: linux-kernel; +Cc: linux-mm, akpm, mingo, riel, hughd, mgorman, torvalds Add a few acks and resend this patch. We use access bit to age a page at page reclaim. When clearing pte access bit, we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte access bit is unset in page table, when cpu access the page again, cpu will not set page table pte's access bit. Next time page reclaim will think this hot page is yong and reclaim it wrongly, but this doesn't corrupt data. And according to intel manual, tlb has less than 1k entries, which covers < 4M memory. In today's system, several giga byte memory is normal. After page reclaim clears pte access bit and before cpu access the page again, it's quite unlikely this page's pte is still in TLB. And context swich will flush tlb too. The chance skiping tlb flush to impact page reclaim should be very rare. Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. Hugh added it to fix some ARM and sparc issues. Since I only change this for x86, there should be no risk. And in some workloads, TLB flush overhead is very heavy. In my simple multithread app with a lot of swap to several pcie SSD, removing the tlb flush gives about 20% ~ 30% swapout speedup. Signed-off-by: Shaohua Li <shli@fusionio.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hugh Dickins <hughd@google.com> --- arch/x86/mm/pgtable.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) Index: linux/arch/x86/mm/pgtable.c =================================================================== --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { - int young; - - young = ptep_test_and_clear_young(vma, address, ptep); - if (young) - flush_tlb_page(vma, address); - - return young; + /* + * In X86, clearing access bit without TLB flush doesn't cause data + * corruption. Doing this could cause wrong page aging and so hot pages + * are reclaimed, but the chance should be very rare. + */ + return ptep_test_and_clear_young(vma, address, ptep); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] x86: clearing access bit don't flush tlb 2014-04-03 0:42 Shaohua Li @ 2014-04-03 11:35 ` Ingo Molnar 2014-04-03 13:45 ` Shaohua Li 2014-04-08 7:58 ` Shaohua Li 0 siblings, 2 replies; 15+ messages in thread From: Ingo Molnar @ 2014-04-03 11:35 UTC (permalink / raw) To: Shaohua Li Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds, Peter Zijlstra, Thomas Gleixner * Shaohua Li <shli@kernel.org> wrote: > Add a few acks and resend this patch. > > We use access bit to age a page at page reclaim. When clearing pte access bit, > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > access bit is unset in page table, when cpu access the page again, cpu will not > set page table pte's access bit. Next time page reclaim will think this hot > page is yong and reclaim it wrongly, but this doesn't corrupt data. > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > memory. In today's system, several giga byte memory is normal. After page > reclaim clears pte access bit and before cpu access the page again, it's quite > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > The chance skiping tlb flush to impact page reclaim should be very rare. > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > Hugh added it to fix some ARM and sparc issues. Since I only change this for > x86, there should be no risk. > > And in some workloads, TLB flush overhead is very heavy. In my simple > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > gives about 20% ~ 30% swapout speedup. > > Signed-off-by: Shaohua Li <shli@fusionio.com> > Acked-by: Rik van Riel <riel@redhat.com> > Acked-by: Mel Gorman <mgorman@suse.de> > Acked-by: Hugh Dickins <hughd@google.com> > --- > arch/x86/mm/pgtable.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > Index: linux/arch/x86/mm/pgtable.c > =================================================================== > --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > - int young; > - > - young = ptep_test_and_clear_young(vma, address, ptep); > - if (young) > - flush_tlb_page(vma, address); > - > - return young; > + /* > + * In X86, clearing access bit without TLB flush doesn't cause data > + * corruption. Doing this could cause wrong page aging and so hot pages > + * are reclaimed, but the chance should be very rare. So, beyond the spelling mistakes, I guess this explanation should also be a bit more explanatory - how about something like: /* * On x86 CPUs, clearing the accessed bit without a TLB flush * doesn't cause data corruption. [ It could cause incorrect * page aging and the (mistaken) reclaim of hot pages, but the * chance of that should be relatively low. ] * * So as a performance optimization don't flush the TLB when * clearing the accessed bit, it will eventually be flushed by * a context switch or a VM operation anyway. [ In the rare * event of it not getting flushed for a long time the delay * shouldn't really matter because there's no real memory * pressure for swapout to react to. ] */ Agreed? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] x86: clearing access bit don't flush tlb 2014-04-03 11:35 ` [patch] x86: " Ingo Molnar @ 2014-04-03 13:45 ` Shaohua Li 2014-04-04 15:01 ` Johannes Weiner 2014-04-08 7:58 ` Shaohua Li 1 sibling, 1 reply; 15+ messages in thread From: Shaohua Li @ 2014-04-03 13:45 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds, Peter Zijlstra, Thomas Gleixner On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote: > > * Shaohua Li <shli@kernel.org> wrote: > > > Add a few acks and resend this patch. > > > > We use access bit to age a page at page reclaim. When clearing pte access bit, > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > > access bit is unset in page table, when cpu access the page again, cpu will not > > set page table pte's access bit. Next time page reclaim will think this hot > > page is yong and reclaim it wrongly, but this doesn't corrupt data. > > > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > > memory. In today's system, several giga byte memory is normal. After page > > reclaim clears pte access bit and before cpu access the page again, it's quite > > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > > The chance skiping tlb flush to impact page reclaim should be very rare. > > > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > > Hugh added it to fix some ARM and sparc issues. Since I only change this for > > x86, there should be no risk. > > > > And in some workloads, TLB flush overhead is very heavy. In my simple > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > > gives about 20% ~ 30% swapout speedup. > > > > Signed-off-by: Shaohua Li <shli@fusionio.com> > > Acked-by: Rik van Riel <riel@redhat.com> > > Acked-by: Mel Gorman <mgorman@suse.de> > > Acked-by: Hugh Dickins <hughd@google.com> > > --- > > arch/x86/mm/pgtable.c | 13 ++++++------- > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > Index: linux/arch/x86/mm/pgtable.c > > =================================================================== > > --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > > +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > > int ptep_clear_flush_young(struct vm_area_struct *vma, > > unsigned long address, pte_t *ptep) > > { > > - int young; > > - > > - young = ptep_test_and_clear_young(vma, address, ptep); > > - if (young) > > - flush_tlb_page(vma, address); > > - > > - return young; > > + /* > > + * In X86, clearing access bit without TLB flush doesn't cause data > > + * corruption. Doing this could cause wrong page aging and so hot pages > > + * are reclaimed, but the chance should be very rare. > > So, beyond the spelling mistakes, I guess this explanation should also > be a bit more explanatory - how about something like: > > /* > * On x86 CPUs, clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > > Agreed? Sure, that's better, thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] x86: clearing access bit don't flush tlb 2014-04-03 13:45 ` Shaohua Li @ 2014-04-04 15:01 ` Johannes Weiner 0 siblings, 0 replies; 15+ messages in thread From: Johannes Weiner @ 2014-04-04 15:01 UTC (permalink / raw) To: Shaohua Li Cc: Ingo Molnar, linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds, Peter Zijlstra, Thomas Gleixner On Thu, Apr 03, 2014 at 09:45:42PM +0800, Shaohua Li wrote: > On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote: > > > > * Shaohua Li <shli@kernel.org> wrote: > > > > > Add a few acks and resend this patch. > > > > > > We use access bit to age a page at page reclaim. When clearing pte access bit, > > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > > > access bit is unset in page table, when cpu access the page again, cpu will not > > > set page table pte's access bit. Next time page reclaim will think this hot > > > page is yong and reclaim it wrongly, but this doesn't corrupt data. > > > > > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > > > memory. In today's system, several giga byte memory is normal. After page > > > reclaim clears pte access bit and before cpu access the page again, it's quite > > > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > > > The chance skiping tlb flush to impact page reclaim should be very rare. > > > > > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > > > Hugh added it to fix some ARM and sparc issues. Since I only change this for > > > x86, there should be no risk. > > > > > > And in some workloads, TLB flush overhead is very heavy. In my simple > > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > > > gives about 20% ~ 30% swapout speedup. > > > > > > Signed-off-by: Shaohua Li <shli@fusionio.com> > > > Acked-by: Rik van Riel <riel@redhat.com> > > > Acked-by: Mel Gorman <mgorman@suse.de> > > > Acked-by: Hugh Dickins <hughd@google.com> > > > --- > > > arch/x86/mm/pgtable.c | 13 ++++++------- > > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > > > Index: linux/arch/x86/mm/pgtable.c > > > =================================================================== > > > --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > > > +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > > > int ptep_clear_flush_young(struct vm_area_struct *vma, > > > unsigned long address, pte_t *ptep) > > > { > > > - int young; > > > - > > > - young = ptep_test_and_clear_young(vma, address, ptep); > > > - if (young) > > > - flush_tlb_page(vma, address); > > > - > > > - return young; > > > + /* > > > + * In X86, clearing access bit without TLB flush doesn't cause data > > > + * corruption. Doing this could cause wrong page aging and so hot pages > > > + * are reclaimed, but the chance should be very rare. > > > > So, beyond the spelling mistakes, I guess this explanation should also > > be a bit more explanatory - how about something like: > > > > /* > > * On x86 CPUs, clearing the accessed bit without a TLB flush > > * doesn't cause data corruption. [ It could cause incorrect > > * page aging and the (mistaken) reclaim of hot pages, but the > > * chance of that should be relatively low. ] > > * > > * So as a performance optimization don't flush the TLB when > > * clearing the accessed bit, it will eventually be flushed by > > * a context switch or a VM operation anyway. [ In the rare > > * event of it not getting flushed for a long time the delay > > * shouldn't really matter because there's no real memory > > * pressure for swapout to react to. ] > > */ > > > > Agreed? > > Sure, that's better, thanks! With Ingo's updated comment: Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] x86: clearing access bit don't flush tlb 2014-04-03 11:35 ` [patch] x86: " Ingo Molnar 2014-04-03 13:45 ` Shaohua Li @ 2014-04-08 7:58 ` Shaohua Li 2014-04-14 11:36 ` Ingo Molnar 1 sibling, 1 reply; 15+ messages in thread From: Shaohua Li @ 2014-04-08 7:58 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds, Peter Zijlstra, Thomas Gleixner On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote: > > * Shaohua Li <shli@kernel.org> wrote: > > > Add a few acks and resend this patch. > > > > We use access bit to age a page at page reclaim. When clearing pte access bit, > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > > access bit is unset in page table, when cpu access the page again, cpu will not > > set page table pte's access bit. Next time page reclaim will think this hot > > page is yong and reclaim it wrongly, but this doesn't corrupt data. > > > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > > memory. In today's system, several giga byte memory is normal. After page > > reclaim clears pte access bit and before cpu access the page again, it's quite > > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > > The chance skiping tlb flush to impact page reclaim should be very rare. > > > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > > Hugh added it to fix some ARM and sparc issues. Since I only change this for > > x86, there should be no risk. > > > > And in some workloads, TLB flush overhead is very heavy. In my simple > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > > gives about 20% ~ 30% swapout speedup. > > > > Signed-off-by: Shaohua Li <shli@fusionio.com> > > Acked-by: Rik van Riel <riel@redhat.com> > > Acked-by: Mel Gorman <mgorman@suse.de> > > Acked-by: Hugh Dickins <hughd@google.com> > > --- > > arch/x86/mm/pgtable.c | 13 ++++++------- > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > Index: linux/arch/x86/mm/pgtable.c > > =================================================================== > > --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > > +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > > int ptep_clear_flush_young(struct vm_area_struct *vma, > > unsigned long address, pte_t *ptep) > > { > > - int young; > > - > > - young = ptep_test_and_clear_young(vma, address, ptep); > > - if (young) > > - flush_tlb_page(vma, address); > > - > > - return young; > > + /* > > + * In X86, clearing access bit without TLB flush doesn't cause data > > + * corruption. Doing this could cause wrong page aging and so hot pages > > + * are reclaimed, but the chance should be very rare. > > So, beyond the spelling mistakes, I guess this explanation should also > be a bit more explanatory - how about something like: > > /* > * On x86 CPUs, clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > > Agreed? Changed the comments and added ACK of Johannes, so you can pick up directly. Subject: x86: clearing access bit don't flush tlb We use access bit to age a page at page reclaim. When clearing pte access bit, we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte access bit is unset in page table, when cpu access the page again, cpu will not set page table pte's access bit. Next time page reclaim will think this hot page is yong and reclaim it wrongly, but this doesn't corrupt data. And according to intel manual, tlb has less than 1k entries, which covers < 4M memory. In today's system, several giga byte memory is normal. After page reclaim clears pte access bit and before cpu access the page again, it's quite unlikely this page's pte is still in TLB. And context swich will flush tlb too. The chance skiping tlb flush to impact page reclaim should be very rare. Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. Hugh added it to fix some ARM and sparc issues. Since I only change this for x86, there should be no risk. And in some workloads, TLB flush overhead is very heavy. In my simple multithread app with a lot of swap to several pcie SSD, removing the tlb flush gives about 20% ~ 30% swapout speedup. Update comments by Ingo. Signed-off-by: Shaohua Li <shli@fusionio.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> --- arch/x86/mm/pgtable.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) Index: linux/arch/x86/mm/pgtable.c =================================================================== --- linux.orig/arch/x86/mm/pgtable.c 2014-04-07 08:36:02.843221074 +0800 +++ linux/arch/x86/mm/pgtable.c 2014-04-07 08:37:26.438170140 +0800 @@ -399,13 +399,20 @@ int pmdp_test_and_clear_young(struct vm_ int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { - int young; - - young = ptep_test_and_clear_young(vma, address, ptep); - if (young) - flush_tlb_page(vma, address); - - return young; + /* + * On x86 CPUs, clearing the accessed bit without a TLB flush + * doesn't cause data corruption. [ It could cause incorrect + * page aging and the (mistaken) reclaim of hot pages, but the + * chance of that should be relatively low. ] + * + * So as a performance optimization don't flush the TLB when + * clearing the accessed bit, it will eventually be flushed by + * a context switch or a VM operation anyway. [ In the rare + * event of it not getting flushed for a long time the delay + * shouldn't really matter because there's no real memory + * pressure for swapout to react to. ] + */ + return ptep_test_and_clear_young(vma, address, ptep); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] x86: clearing access bit don't flush tlb 2014-04-08 7:58 ` Shaohua Li @ 2014-04-14 11:36 ` Ingo Molnar 0 siblings, 0 replies; 15+ messages in thread From: Ingo Molnar @ 2014-04-14 11:36 UTC (permalink / raw) To: Shaohua Li Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds, Peter Zijlstra, Thomas Gleixner * Shaohua Li <shli@kernel.org> wrote: > On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote: > > > > * Shaohua Li <shli@kernel.org> wrote: > > > > > Add a few acks and resend this patch. > > > > > > We use access bit to age a page at page reclaim. When clearing pte access bit, > > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > > > access bit is unset in page table, when cpu access the page again, cpu will not > > > set page table pte's access bit. Next time page reclaim will think this hot > > > page is yong and reclaim it wrongly, but this doesn't corrupt data. > > > > > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > > > memory. In today's system, several giga byte memory is normal. After page > > > reclaim clears pte access bit and before cpu access the page again, it's quite > > > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > > > The chance skiping tlb flush to impact page reclaim should be very rare. > > > > > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > > > Hugh added it to fix some ARM and sparc issues. Since I only change this for > > > x86, there should be no risk. > > > > > > And in some workloads, TLB flush overhead is very heavy. In my simple > > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > > > gives about 20% ~ 30% swapout speedup. > > > > > > Signed-off-by: Shaohua Li <shli@fusionio.com> > > > Acked-by: Rik van Riel <riel@redhat.com> > > > Acked-by: Mel Gorman <mgorman@suse.de> > > > Acked-by: Hugh Dickins <hughd@google.com> > > > --- > > > arch/x86/mm/pgtable.c | 13 ++++++------- > > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > > > Index: linux/arch/x86/mm/pgtable.c > > > =================================================================== > > > --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > > > +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > > > int ptep_clear_flush_young(struct vm_area_struct *vma, > > > unsigned long address, pte_t *ptep) > > > { > > > - int young; > > > - > > > - young = ptep_test_and_clear_young(vma, address, ptep); > > > - if (young) > > > - flush_tlb_page(vma, address); > > > - > > > - return young; > > > + /* > > > + * In X86, clearing access bit without TLB flush doesn't cause data > > > + * corruption. Doing this could cause wrong page aging and so hot pages > > > + * are reclaimed, but the chance should be very rare. > > > > So, beyond the spelling mistakes, I guess this explanation should also > > be a bit more explanatory - how about something like: > > > > /* > > * On x86 CPUs, clearing the accessed bit without a TLB flush > > * doesn't cause data corruption. [ It could cause incorrect > > * page aging and the (mistaken) reclaim of hot pages, but the > > * chance of that should be relatively low. ] > > * > > * So as a performance optimization don't flush the TLB when > > * clearing the accessed bit, it will eventually be flushed by > > * a context switch or a VM operation anyway. [ In the rare > > * event of it not getting flushed for a long time the delay > > * shouldn't really matter because there's no real memory > > * pressure for swapout to react to. ] > > */ > > > > Agreed? > > Changed the comments and added ACK of Johannes, so you can pick up directly. > > Subject: x86: clearing access bit don't flush tlb > > We use access bit to age a page at page reclaim. When clearing pte access bit, > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte > access bit is unset in page table, when cpu access the page again, cpu will not > set page table pte's access bit. Next time page reclaim will think this hot > page is yong and reclaim it wrongly, but this doesn't corrupt data. > > And according to intel manual, tlb has less than 1k entries, which covers < 4M > memory. In today's system, several giga byte memory is normal. After page > reclaim clears pte access bit and before cpu access the page again, it's quite > unlikely this page's pte is still in TLB. And context swich will flush tlb too. > The chance skiping tlb flush to impact page reclaim should be very rare. > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit. > Hugh added it to fix some ARM and sparc issues. Since I only change this for > x86, there should be no risk. > > And in some workloads, TLB flush overhead is very heavy. In my simple > multithread app with a lot of swap to several pcie SSD, removing the tlb flush > gives about 20% ~ 30% swapout speedup. > > Update comments by Ingo. I fixed this changelog as well. > Signed-off-by: Shaohua Li <shli@fusionio.com> > Acked-by: Rik van Riel <riel@redhat.com> > Acked-by: Mel Gorman <mgorman@suse.de> > Acked-by: Hugh Dickins <hughd@google.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > --- > arch/x86/mm/pgtable.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > Index: linux/arch/x86/mm/pgtable.c > =================================================================== > --- linux.orig/arch/x86/mm/pgtable.c 2014-04-07 08:36:02.843221074 +0800 > +++ linux/arch/x86/mm/pgtable.c 2014-04-07 08:37:26.438170140 +0800 > @@ -399,13 +399,20 @@ int pmdp_test_and_clear_young(struct vm_ > int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > - int young; > - > - young = ptep_test_and_clear_young(vma, address, ptep); > - if (young) > - flush_tlb_page(vma, address); > - > - return young; > + /* > + * On x86 CPUs, clearing the accessed bit without a TLB flush > + * doesn't cause data corruption. [ It could cause incorrect > + * page aging and the (mistaken) reclaim of hot pages, but the > + * chance of that should be relatively low. ] > + * > + * So as a performance optimization don't flush the TLB when > + * clearing the accessed bit, it will eventually be flushed by > + * a context switch or a VM operation anyway. [ In the rare > + * event of it not getting flushed for a long time the delay > + * shouldn't really matter because there's no real memory > + * pressure for swapout to react to. ] > + */ There's whitespace damage here - I fixed that up as well. Please use scripts/checkpatch.pl before submitting patches, to make sure there are no fixable problems in it. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-04-14 11:36 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-26 22:30 [patch]x86: clearing access bit don't flush tlb Shaohua Li 2014-03-26 23:55 ` Rik van Riel 2014-03-27 17:12 ` Shaohua Li 2014-03-27 18:41 ` Rik van Riel 2014-03-28 19:02 ` Shaohua Li 2014-03-30 12:58 ` Rik van Riel 2014-03-31 2:16 ` Shaohua Li 2014-04-02 13:01 ` Mel Gorman 2014-04-02 15:42 ` Hugh Dickins -- strict thread matches above, loose matches on Subject: below -- 2014-04-03 0:42 Shaohua Li 2014-04-03 11:35 ` [patch] x86: " Ingo Molnar 2014-04-03 13:45 ` Shaohua Li 2014-04-04 15:01 ` Johannes Weiner 2014-04-08 7:58 ` Shaohua Li 2014-04-14 11:36 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).