[PATCH] x86,mm: delay TLB flush after clearing accessed bit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86,mm: delay TLB flush after clearing accessed bit
@ 2014-03-31 15:34 Rik van Riel
  2014-04-01 10:53 ` Ingo Molnar
  2014-04-01 15:13 ` Linus Torvalds
  0 siblings, 2 replies; 11+ messages in thread
From: Rik van Riel @ 2014-03-31 15:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, shli, akpm, mingo, hughd, mgorman

Doing an immediate TLB flush after clearing the accesed bit
in page tables results in a lot of extra TLB flushes when there
is memory pressure. This used to not be a problem, when swap
was done to spinning disks, but with SSDs it is starting to
become an issue.

However, clearing the accessed bit does not lead to any
consistency issues, there is no reason to flush the TLB
immediately. The TLB flush can be deferred until some
later point in time.

The lazy TLB flush code already has a data structure that
is used at context switch time to determine whether or not
the TLB needs to be flushed. The accessed bit clearing code
can piggyback on top of that same data structure, allowing
the context switch code to check whether a TLB flush needs
to be forced when switching between the same mm, without
incurring an additional cache miss.

In Shaohua's multi-threaded test with a lot of swap to several
PCIe SSDs, this patch results in about 20-30% swapout speedup,
increasing swapout speed from 1.5GB/s to 1.85GB/s.

Tested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/mmu_context.h |  5 ++++-
 arch/x86/include/asm/tlbflush.h    | 12 ++++++++++++
 arch/x86/mm/pgtable.c              |  9 ++++++---
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index be12c53..665d98b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -39,6 +39,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
+		this_cpu_write(cpu_tlbstate.force_flush, false);
 #endif
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
@@ -57,7 +58,8 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (!cpumask_test_cpu(cpu, mm_cpumask(next)) ||
+				this_cpu_read(cpu_tlbstate.force_flush)) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -70,6 +72,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			 * tlb flush IPI delivery. We must reload CR3
 			 * to make sure to use no freed page tables.
 			 */
+			this_cpu_write(cpu_tlbstate.force_flush, false);
 			load_cr3(next->pgd);
 			load_LDT_nolock(&next->context);
 		}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 04905bf..f2cda2c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -151,6 +151,10 @@ static inline void reset_lazy_tlbstate(void)
 {
 }
 
+static inline void tlb_set_force_flush(int cpu)
+{
+}
+
 static inline void flush_tlb_kernel_range(unsigned long start,
 					  unsigned long end)
 {
@@ -187,6 +191,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 struct tlb_state {
 	struct mm_struct *active_mm;
 	int state;
+	bool force_flush;
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -196,6 +201,13 @@ static inline void reset_lazy_tlbstate(void)
 	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
 }
 
+static inline void tlb_set_force_flush(int cpu)
+{
+	struct tlb_state *percputlb= &per_cpu(cpu_tlbstate, cpu);
+	if (percputlb->force_flush == false)
+		percputlb->force_flush = true;
+}
+
 #endif	/* SMP */
 
 #ifndef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index c96314a..dcd26e9 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -4,6 +4,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlb.h>
 #include <asm/fixmap.h>
+#include <asm/tlbflush.h>
 
 #define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
 
@@ -399,11 +400,13 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
+	int young, cpu;
 
 	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
+	if (young) {
+		for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var)
+			tlb_set_force_flush(cpu);
+	}
 
 	return young;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-03-31 15:34 [PATCH] x86,mm: delay TLB flush after clearing accessed bit Rik van Riel
@ 2014-04-01 10:53 ` Ingo Molnar
  2014-04-01 12:55   ` Rik van Riel
  2014-04-01 15:13 ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2014-04-01 10:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, shli, akpm, hughd, mgorman,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin


The speedup looks good to me!

I have one major concern (see the last item), plus a few minor nits:

* Rik van Riel <riel@redhat.com> wrote:

> Doing an immediate TLB flush after clearing the accesed bit

s/accesed/accessed

> in page tables results in a lot of extra TLB flushes when there
> is memory pressure. This used to not be a problem, when swap
> was done to spinning disks, but with SSDs it is starting to
> become an issue.

s/This used to not be a problem/This did not use to be a problem

> However, clearing the accessed bit does not lead to any
> consistency issues, there is no reason to flush the TLB
> immediately. The TLB flush can be deferred until some
> later point in time.
> 
> The lazy TLB flush code already has a data structure that
> is used at context switch time to determine whether or not
> the TLB needs to be flushed. The accessed bit clearing code
> can piggyback on top of that same data structure, allowing
> the context switch code to check whether a TLB flush needs
> to be forced when switching between the same mm, without
> incurring an additional cache miss.
> 
> In Shaohua's multi-threaded test with a lot of swap to several
> PCIe SSDs, this patch results in about 20-30% swapout speedup,
> increasing swapout speed from 1.5GB/s to 1.85GB/s.
> 
> Tested-by: Shaohua Li <shli@kernel.org>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/x86/include/asm/mmu_context.h |  5 ++++-
>  arch/x86/include/asm/tlbflush.h    | 12 ++++++++++++
>  arch/x86/mm/pgtable.c              |  9 ++++++---
>  3 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index be12c53..665d98b 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -39,6 +39,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  #ifdef CONFIG_SMP
>  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>  		this_cpu_write(cpu_tlbstate.active_mm, next);
> +		this_cpu_write(cpu_tlbstate.force_flush, false);
>  #endif
>  		cpumask_set_cpu(cpu, mm_cpumask(next));
>  
> @@ -57,7 +58,8 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>  		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>  
> -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> +		if (!cpumask_test_cpu(cpu, mm_cpumask(next)) ||
> +				this_cpu_read(cpu_tlbstate.force_flush)) {

Should this be unlikely() perhaps?

>  			/*
>  			 * On established mms, the mm_cpumask is only changed
>  			 * from irq context, from ptep_clear_flush() while in
> @@ -70,6 +72,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  			 * tlb flush IPI delivery. We must reload CR3
>  			 * to make sure to use no freed page tables.
>  			 */
> +			this_cpu_write(cpu_tlbstate.force_flush, false);
>  			load_cr3(next->pgd);
>  			load_LDT_nolock(&next->context);
>  		}
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 04905bf..f2cda2c 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -151,6 +151,10 @@ static inline void reset_lazy_tlbstate(void)
>  {
>  }
>  
> +static inline void tlb_set_force_flush(int cpu)
> +{
> +}
> +
>  static inline void flush_tlb_kernel_range(unsigned long start,
>  					  unsigned long end)
>  {
> @@ -187,6 +191,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  struct tlb_state {
>  	struct mm_struct *active_mm;
>  	int state;
> +	bool force_flush;
>  };
>  DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
>  
> @@ -196,6 +201,13 @@ static inline void reset_lazy_tlbstate(void)
>  	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
>  }
>  
> +static inline void tlb_set_force_flush(int cpu)
> +{
> +	struct tlb_state *percputlb= &per_cpu(cpu_tlbstate, cpu);

s/b= /b = /

> +	if (percputlb->force_flush == false)
> +		percputlb->force_flush = true;
> +}
> +
>  #endif	/* SMP */
>  
>  #ifndef CONFIG_PARAVIRT
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index c96314a..dcd26e9 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -4,6 +4,7 @@
>  #include <asm/pgtable.h>
>  #include <asm/tlb.h>
>  #include <asm/fixmap.h>
> +#include <asm/tlbflush.h>
>  
>  #define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
>  
> @@ -399,11 +400,13 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> -	int young;
> +	int young, cpu;
>  
>  	young = ptep_test_and_clear_young(vma, address, ptep);
> -	if (young)
> -		flush_tlb_page(vma, address);
> +	if (young) {
> +		for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var)
> +			tlb_set_force_flush(cpu);

Hm, just to play the devil's advocate - what happens when we have a va 
that is used on a few dozen, a few hundred or a few thousand CPUs? 
Will the savings be dwarved by the O(nr_cpus_used) loop overhead?

Especially as this is touching cachelines on other CPUs and likely 
creating the worst kind of cachemisses. That can really kill 
performance.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 10:53 ` Ingo Molnar
@ 2014-04-01 12:55   ` Rik van Riel
  2014-04-01 13:20     ` Ingo Molnar
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2014-04-01 12:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, shli, akpm, hughd, mgorman,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin

On 04/01/2014 06:53 AM, Ingo Molnar wrote:
> 
> The speedup looks good to me!
> 
> I have one major concern (see the last item), plus a few minor nits:

I will address all the minor issues. Let me explain the major one :)

>> @@ -196,6 +201,13 @@ static inline void reset_lazy_tlbstate(void)
>>  	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
>>  }
>>  
>> +static inline void tlb_set_force_flush(int cpu)
>> +{
>> +	struct tlb_state *percputlb= &per_cpu(cpu_tlbstate, cpu);
> 
> s/b= /b = /
> 
>> +	if (percputlb->force_flush == false)
>> +		percputlb->force_flush = true;
>> +}
>> +
>>  #endif	/* SMP */

This code does a test before the set, so each cache line will only be
grabbed exclusively once, if there is heavy pageout scanning activity.

>> @@ -399,11 +400,13 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>>  			   unsigned long address, pte_t *ptep)
>>  {
>> -	int young;
>> +	int young, cpu;
>>  
>>  	young = ptep_test_and_clear_young(vma, address, ptep);
>> -	if (young)
>> -		flush_tlb_page(vma, address);
>> +	if (young) {
>> +		for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var)
>> +			tlb_set_force_flush(cpu);
> 
> Hm, just to play the devil's advocate - what happens when we have a va 
> that is used on a few dozen, a few hundred or a few thousand CPUs? 
> Will the savings be dwarved by the O(nr_cpus_used) loop overhead?
> 
> Especially as this is touching cachelines on other CPUs and likely 
> creating the worst kind of cachemisses. That can really kill 
> performance.

flush_tlb_page does the same O(nr_cpus_used) loop, but it sends an
IPI to each CPU every time, instead of dirtying a cache line once
per pageout run (or until the next context switch).

Does that address your concern?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 12:55   ` Rik van Riel
@ 2014-04-01 13:20     ` Ingo Molnar
  2014-04-01 13:26       ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2014-04-01 13:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, shli, akpm, hughd, mgorman,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin


* Rik van Riel <riel@redhat.com> wrote:

> >>  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>  			   unsigned long address, pte_t *ptep)
> >>  {
> >> -	int young;
> >> +	int young, cpu;
> >>  
> >>  	young = ptep_test_and_clear_young(vma, address, ptep);
> >> -	if (young)
> >> -		flush_tlb_page(vma, address);
> >> +	if (young) {
> >> +		for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var)
> >> +			tlb_set_force_flush(cpu);
> > 
> > Hm, just to play the devil's advocate - what happens when we have 
> > a va that is used on a few dozen, a few hundred or a few thousand 
> > CPUs? Will the savings be dwarved by the O(nr_cpus_used) loop 
> > overhead?
> > 
> > Especially as this is touching cachelines on other CPUs and likely 
> > creating the worst kind of cachemisses. That can really kill 
> > performance.
> 
> flush_tlb_page does the same O(nr_cpus_used) loop, but it sends an 
> IPI to each CPU every time, instead of dirtying a cache line once 
> per pageout run (or until the next context switch).
> 
> Does that address your concern?

That depends on the platform - which could implement flush_tlb_page() 
as a broadcast IPI - but yes, it was bad before as well, now it became 
more visible and I noticed it :)

Wouldn't it be more scalable to use a generation count as a timestamp, 
and set that in the mm? mm that last flushed before that timestamp 
need to flush, or so. That gets rid of the mask logic and the loop, 
AFAICS.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 13:20     ` Ingo Molnar
@ 2014-04-01 13:26       ` Rik van Riel
  0 siblings, 0 replies; 11+ messages in thread
From: Rik van Riel @ 2014-04-01 13:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, shli, akpm, hughd, mgorman,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin

On 04/01/2014 09:20 AM, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
>>>>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>  			   unsigned long address, pte_t *ptep)
>>>>  {
>>>> -	int young;
>>>> +	int young, cpu;
>>>>  
>>>>  	young = ptep_test_and_clear_young(vma, address, ptep);
>>>> -	if (young)
>>>> -		flush_tlb_page(vma, address);
>>>> +	if (young) {
>>>> +		for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var)
>>>> +			tlb_set_force_flush(cpu);
>>>
>>> Hm, just to play the devil's advocate - what happens when we have 
>>> a va that is used on a few dozen, a few hundred or a few thousand 
>>> CPUs? Will the savings be dwarved by the O(nr_cpus_used) loop 
>>> overhead?
>>>
>>> Especially as this is touching cachelines on other CPUs and likely 
>>> creating the worst kind of cachemisses. That can really kill 
>>> performance.
>>
>> flush_tlb_page does the same O(nr_cpus_used) loop, but it sends an 
>> IPI to each CPU every time, instead of dirtying a cache line once 
>> per pageout run (or until the next context switch).
>>
>> Does that address your concern?
> 
> That depends on the platform - which could implement flush_tlb_page() 
> as a broadcast IPI - but yes, it was bad before as well, now it became 
> more visible and I noticed it :)
> 
> Wouldn't it be more scalable to use a generation count as a timestamp, 
> and set that in the mm? mm that last flushed before that timestamp 
> need to flush, or so. That gets rid of the mask logic and the loop, 
> AFAICS.

More scalable in the page eviction code, sure.

However, that would cause the context switch code to load an
additional cache line, so I am not convinced that is a good
tradeoff...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-03-31 15:34 [PATCH] x86,mm: delay TLB flush after clearing accessed bit Rik van Riel
  2014-04-01 10:53 ` Ingo Molnar
@ 2014-04-01 15:13 ` Linus Torvalds
  2014-04-01 16:11   ` Rik van Riel
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-04-01 15:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, linux-mm, shli, Andrew Morton,
	Ingo Molnar, Hugh Dickins, Mel Gorman

On Mon, Mar 31, 2014 at 8:34 AM, Rik van Riel <riel@redhat.com> wrote:
>
> However, clearing the accessed bit does not lead to any
> consistency issues, there is no reason to flush the TLB
> immediately. The TLB flush can be deferred until some
> later point in time.

Ugh. I absolutely detest this patch.

If we're going to leave the TLB dirty, then dammit, leave it dirty.
Don't play some half-way games.

Here's the patch you should just try:

 int ptep_clear_flush_young(struct vm_area_struct *vma,
        unsigned long address, pte_t *ptep)
 {
     return ptep_test_and_clear_young(vma, address, ptep);
 }

instead of complicating things.

Rationale: if the working set is so big that we start paging things
out, we sure as hell don't need to worry about TLB flushing. It will
flush itself.

And conversely - if it doesn't flush itself, and something stays
marked as "accessed" in the TLB for a long time even though we've
cleared it in the page tables, we don't care, because clearly there
isn't enough memory pressure for the accessed bit to matter.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 15:13 ` Linus Torvalds
@ 2014-04-01 16:11   ` Rik van Riel
  2014-04-01 16:21     ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2014-04-01 16:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, shli, Andrew Morton,
	Ingo Molnar, Hugh Dickins, Mel Gorman

On 04/01/2014 11:13 AM, Linus Torvalds wrote:
> On Mon, Mar 31, 2014 at 8:34 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> However, clearing the accessed bit does not lead to any
>> consistency issues, there is no reason to flush the TLB
>> immediately. The TLB flush can be deferred until some
>> later point in time.
> 
> Ugh. I absolutely detest this patch.
> 
> If we're going to leave the TLB dirty, then dammit, leave it dirty.
> Don't play some half-way games.
> 
> Here's the patch you should just try:
> 
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>         unsigned long address, pte_t *ptep)
>  {
>      return ptep_test_and_clear_young(vma, address, ptep);
>  }
> 
> instead of complicating things.
> 
> Rationale: if the working set is so big that we start paging things
> out, we sure as hell don't need to worry about TLB flushing. It will
> flush itself.
> 
> And conversely - if it doesn't flush itself, and something stays
> marked as "accessed" in the TLB for a long time even though we've
> cleared it in the page tables, we don't care, because clearly there
> isn't enough memory pressure for the accessed bit to matter.

That was my initial feeling too, when this kind of patch first
came up, a few years ago.

However, the more I think about it, the less I am convinced it
is actually true.

Memory pressure is not necessarily caused by the same process
whose accessed bit we just cleared. Memory pressure may not
even be caused by any process's virtual memory at all, but it
could be caused by the page cache.

With 2MB pages, a reasonably sized process could fit in the
TLB quite easily. Having its accessed bits not make it to the
page table while its pages are on the inactive list could
cause it to get paged out, due to memory pressure from another,
larger process.

I have no particular preference for this implementation, and am
willing to implement any other idea for batching the TLB shootdowns
that are due to pageout scanning.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 16:11   ` Rik van Riel
@ 2014-04-01 16:21     ` Linus Torvalds
  2014-04-01 18:31       ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-04-01 16:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, linux-mm, shli, Andrew Morton,
	Ingo Molnar, Hugh Dickins, Mel Gorman

On Tue, Apr 1, 2014 at 9:11 AM, Rik van Riel <riel@redhat.com> wrote:
>
> Memory pressure is not necessarily caused by the same process
> whose accessed bit we just cleared. Memory pressure may not
> even be caused by any process's virtual memory at all, but it
> could be caused by the page cache.

If we have that much memory pressure on the page cache without having
any memory pressure on the actual VM space, then the swap-out activity
will never be an issue anyway.

IOW, I think all these scenarios are made-up. I'd much rather go for
simpler implementation, and make things more complex only in the
presence of numbers. Of which we have none.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 16:21     ` Linus Torvalds
@ 2014-04-01 18:31       ` Rik van Riel
  2014-04-02  6:06         ` Shaohua Li
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2014-04-01 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, shli, Andrew Morton,
	Ingo Molnar, Hugh Dickins, Mel Gorman

On 04/01/2014 12:21 PM, Linus Torvalds wrote:
> On Tue, Apr 1, 2014 at 9:11 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> Memory pressure is not necessarily caused by the same process
>> whose accessed bit we just cleared. Memory pressure may not
>> even be caused by any process's virtual memory at all, but it
>> could be caused by the page cache.
> 
> If we have that much memory pressure on the page cache without having
> any memory pressure on the actual VM space, then the swap-out activity
> will never be an issue anyway.
> 
> IOW, I think all these scenarios are made-up. I'd much rather go for
> simpler implementation, and make things more complex only in the
> presence of numbers. Of which we have none.

We've been bitten by the lack of a properly tracked accessed
bit before, but admittedly that was with the KVM code and EPT.

I'll add my Acked-by: to Shaohua's original patch then, and
will keep my eyes open for any problems that may or may not
materialize...

Shaohua?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-01 18:31       ` Rik van Riel
@ 2014-04-02  6:06         ` Shaohua Li
  2014-04-02  7:46           ` Ingo Molnar
  0 siblings, 1 reply; 11+ messages in thread
From: Shaohua Li @ 2014-04-02  6:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Andrew Morton, Ingo Molnar, Hugh Dickins, Mel Gorman

On Tue, Apr 01, 2014 at 02:31:31PM -0400, Rik van Riel wrote:
> On 04/01/2014 12:21 PM, Linus Torvalds wrote:
> > On Tue, Apr 1, 2014 at 9:11 AM, Rik van Riel <riel@redhat.com> wrote:
> >>
> >> Memory pressure is not necessarily caused by the same process
> >> whose accessed bit we just cleared. Memory pressure may not
> >> even be caused by any process's virtual memory at all, but it
> >> could be caused by the page cache.
> > 
> > If we have that much memory pressure on the page cache without having
> > any memory pressure on the actual VM space, then the swap-out activity
> > will never be an issue anyway.
> > 
> > IOW, I think all these scenarios are made-up. I'd much rather go for
> > simpler implementation, and make things more complex only in the
> > presence of numbers. Of which we have none.
> 
> We've been bitten by the lack of a properly tracked accessed
> bit before, but admittedly that was with the KVM code and EPT.
> 
> I'll add my Acked-by: to Shaohua's original patch then, and
> will keep my eyes open for any problems that may or may not
> materialize...
> 
> Shaohua?

I'd agree to choose the simple implementation at current stage and check if
there are problems really.

Andrew,
can you please pick up my orginal patch "x86: clearing access bit don't
flush tlb" (with Rik's Ack)? Or I can resend it if you preferred.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86,mm: delay TLB flush after clearing accessed bit
  2014-04-02  6:06         ` Shaohua Li
@ 2014-04-02  7:46           ` Ingo Molnar
  0 siblings, 0 replies; 11+ messages in thread
From: Ingo Molnar @ 2014-04-02  7:46 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Rik van Riel, Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Andrew Morton, Hugh Dickins, Mel Gorman


* Shaohua Li <shli@kernel.org> wrote:

> On Tue, Apr 01, 2014 at 02:31:31PM -0400, Rik van Riel wrote:
> > On 04/01/2014 12:21 PM, Linus Torvalds wrote:
> > > On Tue, Apr 1, 2014 at 9:11 AM, Rik van Riel <riel@redhat.com> wrote:
> > >>
> > >> Memory pressure is not necessarily caused by the same process
> > >> whose accessed bit we just cleared. Memory pressure may not
> > >> even be caused by any process's virtual memory at all, but it
> > >> could be caused by the page cache.
> > > 
> > > If we have that much memory pressure on the page cache without having
> > > any memory pressure on the actual VM space, then the swap-out activity
> > > will never be an issue anyway.
> > > 
> > > IOW, I think all these scenarios are made-up. I'd much rather go for
> > > simpler implementation, and make things more complex only in the
> > > presence of numbers. Of which we have none.
> > 
> > We've been bitten by the lack of a properly tracked accessed
> > bit before, but admittedly that was with the KVM code and EPT.
> > 
> > I'll add my Acked-by: to Shaohua's original patch then, and
> > will keep my eyes open for any problems that may or may not
> > materialize...
> > 
> > Shaohua?
> 
> I'd agree to choose the simple implementation at current stage and check if
> there are problems really.
> 
> Andrew,
> can you please pick up my orginal patch "x86: clearing access bit don't
> flush tlb" (with Rik's Ack)? Or I can resend it if you preferred.

Please resend it so I can pick it up for this cycle, that approach 
obviously looks good.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-04-02  7:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-31 15:34 [PATCH] x86,mm: delay TLB flush after clearing accessed bit Rik van Riel
2014-04-01 10:53 ` Ingo Molnar
2014-04-01 12:55   ` Rik van Riel
2014-04-01 13:20     ` Ingo Molnar
2014-04-01 13:26       ` Rik van Riel
2014-04-01 15:13 ` Linus Torvalds
2014-04-01 16:11   ` Rik van Riel
2014-04-01 16:21     ` Linus Torvalds
2014-04-01 18:31       ` Rik van Riel
2014-04-02  6:06         ` Shaohua Li
2014-04-02  7:46           ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).