[PATCH] thp, mm: remove comments on serializion of THP split vs. gup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
@ 2016-02-24 15:59 Kirill A. Shutemov
  2016-02-24 17:50 ` Gerald Schaefer
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2016-02-24 15:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gerald Schaefer, Andrea Arcangeli, linux-mm, Kirill A. Shutemov

Previously, __split_huge_page_splitting() required serialization against
gup_fast to make sure nobody can obtain new reference to the page after
__split_huge_page_splitting() returns. This was a way to stabilize page
references before starting to distribute them from head page to tail
pages.

With new refcounting, we don't care about this. Splitting PMD is now
decoupled from splitting underlying compound page. It's okay to get new
pins after split_huge_pmd(). To stabilize page references during
split_huge_page() we rely on setting up migration entries once all
pmds are split into page tables.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/gup.c         | 11 +++--------
 mm/huge_memory.c |  7 +++----
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7bf19ffa2199..2f528fce3a62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1087,8 +1087,7 @@ struct page *get_dump_page(unsigned long addr)
  *
  * get_user_pages_fast attempts to pin user pages by walking the page
  * tables directly and avoids taking locks. Thus the walker needs to be
- * protected from page table pages being freed from under it, and should
- * block any THP splits.
+ * protected from page table pages being freed from under it.
  *
  * One way to achieve this is to have the walker disable interrupts, and
  * rely on IPIs from the TLB flushing code blocking before the page table
@@ -1097,9 +1096,8 @@ struct page *get_dump_page(unsigned long addr)
  *
  * Another way to achieve this is to batch up page table containing pages
  * belonging to more than one mm_user, then rcu_sched a callback to free those
- * pages. Disabling interrupts will allow the fast_gup walker to both block
- * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
- * (which is a relatively rare event). The code below adopts this strategy.
+ * pages. Disabling interrupts will allow the fast_gup walker to block
+ * the rcu_sched callback. The code below adopts this strategy.
  *
  * Before activating this code, please be aware that the following assumptions
  * are currently made:
@@ -1391,9 +1389,6 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 * With interrupts disabled, we block page table pages from being
 	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
 	 * for more details.
-	 *
-	 * We do not adopt an rcu_read_lock(.) here as we also want to
-	 * block IPIs that come from THPs splitting.
 	 */
 
 	local_irq_save(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e10a4fee88d2..846fe173e04b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2930,10 +2930,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * for the same virtual address to be loaded simultaneously. So instead
 	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
 	 * current pmd notpresent (atomically because here the pmd_trans_huge
-	 * and pmd_trans_splitting must remain set at all times on the pmd
-	 * until the split is complete for this pmd), then we flush the SMP TLB
-	 * and finally we write the non-huge version of the pmd entry with
-	 * pmd_populate.
+	 * must remain set at all times on the pmd until the split_huge_pmd()
+	 * is complete, then we flush the SMP TLB and finally we write the
+	 * non-huge version of the pmd entry with pmd_populate.
 	 */
 	pmdp_invalidate(vma, haddr, pmd);
 	pmd_populate(mm, pmd, pgtable);
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-24 15:59 [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast Kirill A. Shutemov
@ 2016-02-24 17:50 ` Gerald Schaefer
  2016-02-25 15:07   ` Kirill A. Shutemov
  0 siblings, 1 reply; 13+ messages in thread
From: Gerald Schaefer @ 2016-02-24 17:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, Martin Schwidefsky

On Wed, 24 Feb 2016 18:59:21 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Previously, __split_huge_page_splitting() required serialization against
> gup_fast to make sure nobody can obtain new reference to the page after
> __split_huge_page_splitting() returns. This was a way to stabilize page
> references before starting to distribute them from head page to tail
> pages.
> 
> With new refcounting, we don't care about this. Splitting PMD is now
> decoupled from splitting underlying compound page. It's okay to get new
> pins after split_huge_pmd(). To stabilize page references during
> split_huge_page() we rely on setting up migration entries once all
> pmds are split into page tables.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/gup.c         | 11 +++--------
>  mm/huge_memory.c |  7 +++----
>  2 files changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 7bf19ffa2199..2f528fce3a62 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1087,8 +1087,7 @@ struct page *get_dump_page(unsigned long addr)
>   *
>   * get_user_pages_fast attempts to pin user pages by walking the page
>   * tables directly and avoids taking locks. Thus the walker needs to be
> - * protected from page table pages being freed from under it, and should
> - * block any THP splits.
> + * protected from page table pages being freed from under it.
>   *
>   * One way to achieve this is to have the walker disable interrupts, and
>   * rely on IPIs from the TLB flushing code blocking before the page table
> @@ -1097,9 +1096,8 @@ struct page *get_dump_page(unsigned long addr)
>   *
>   * Another way to achieve this is to batch up page table containing pages
>   * belonging to more than one mm_user, then rcu_sched a callback to free those
> - * pages. Disabling interrupts will allow the fast_gup walker to both block
> - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> - * (which is a relatively rare event). The code below adopts this strategy.
> + * pages. Disabling interrupts will allow the fast_gup walker to block
> + * the rcu_sched callback. The code below adopts this strategy.
>   *
>   * Before activating this code, please be aware that the following assumptions
>   * are currently made:
> @@ -1391,9 +1389,6 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	 * With interrupts disabled, we block page table pages from being
>  	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
>  	 * for more details.
> -	 *
> -	 * We do not adopt an rcu_read_lock(.) here as we also want to
> -	 * block IPIs that come from THPs splitting.
>  	 */

Hmm, now that the IPI from THP splitting is not needed anymore, this
comment would suggest that we could use rcu_read_lock(_sched) for
fast_gup, instead of keeping the (probably more expensive) IRQ enable/
disable. That should be enough to synchronize against the
call_rcu_sched() from the batched tlb_table_flush, right?

So, if the comment is correct, removing it would also remove the hint
that we could use RCU instead if IRQ enable/disable.

> 
>  	local_irq_save(flags);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e10a4fee88d2..846fe173e04b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2930,10 +2930,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * for the same virtual address to be loaded simultaneously. So instead
>  	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
>  	 * current pmd notpresent (atomically because here the pmd_trans_huge
> -	 * and pmd_trans_splitting must remain set at all times on the pmd
> -	 * until the split is complete for this pmd), then we flush the SMP TLB
> -	 * and finally we write the non-huge version of the pmd entry with
> -	 * pmd_populate.
> +	 * must remain set at all times on the pmd until the split_huge_pmd()
> +	 * is complete, then we flush the SMP TLB and finally we write the
> +	 * non-huge version of the pmd entry with pmd_populate.
>  	 */
>  	pmdp_invalidate(vma, haddr, pmd);
>  	pmd_populate(mm, pmd, pgtable);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-24 17:50 ` Gerald Schaefer
@ 2016-02-25 15:07   ` Kirill A. Shutemov
  2016-02-26  6:50     ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2016-02-25 15:07 UTC (permalink / raw)
  To: Gerald Schaefer, Steve Capper, Dann Frazier, Catalin Marinas,
	Hugh Dickins
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, linux-mm,
	Martin Schwidefsky

On Wed, Feb 24, 2016 at 06:50:25PM +0100, Gerald Schaefer wrote:
> On Wed, 24 Feb 2016 18:59:21 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Previously, __split_huge_page_splitting() required serialization against
> > gup_fast to make sure nobody can obtain new reference to the page after
> > __split_huge_page_splitting() returns. This was a way to stabilize page
> > references before starting to distribute them from head page to tail
> > pages.
> > 
> > With new refcounting, we don't care about this. Splitting PMD is now
> > decoupled from splitting underlying compound page. It's okay to get new
> > pins after split_huge_pmd(). To stabilize page references during
> > split_huge_page() we rely on setting up migration entries once all
> > pmds are split into page tables.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/gup.c         | 11 +++--------
> >  mm/huge_memory.c |  7 +++----
> >  2 files changed, 6 insertions(+), 12 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 7bf19ffa2199..2f528fce3a62 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1087,8 +1087,7 @@ struct page *get_dump_page(unsigned long addr)
> >   *
> >   * get_user_pages_fast attempts to pin user pages by walking the page
> >   * tables directly and avoids taking locks. Thus the walker needs to be
> > - * protected from page table pages being freed from under it, and should
> > - * block any THP splits.
> > + * protected from page table pages being freed from under it.
> >   *
> >   * One way to achieve this is to have the walker disable interrupts, and
> >   * rely on IPIs from the TLB flushing code blocking before the page table
> > @@ -1097,9 +1096,8 @@ struct page *get_dump_page(unsigned long addr)
> >   *
> >   * Another way to achieve this is to batch up page table containing pages
> >   * belonging to more than one mm_user, then rcu_sched a callback to free those
> > - * pages. Disabling interrupts will allow the fast_gup walker to both block
> > - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> > - * (which is a relatively rare event). The code below adopts this strategy.
> > + * pages. Disabling interrupts will allow the fast_gup walker to block
> > + * the rcu_sched callback. The code below adopts this strategy.
> >   *
> >   * Before activating this code, please be aware that the following assumptions
> >   * are currently made:
> > @@ -1391,9 +1389,6 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  	 * With interrupts disabled, we block page table pages from being
> >  	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> >  	 * for more details.
> > -	 *
> > -	 * We do not adopt an rcu_read_lock(.) here as we also want to
> > -	 * block IPIs that come from THPs splitting.
> >  	 */
> 
> Hmm, now that the IPI from THP splitting is not needed anymore, this
> comment would suggest that we could use rcu_read_lock(_sched) for
> fast_gup, instead of keeping the (probably more expensive) IRQ enable/
> disable. That should be enough to synchronize against the
> call_rcu_sched() from the batched tlb_table_flush, right?

Possibly. I'm not hugely aware about all details here.

+ People from the patch which introduced the comment.

Can anybody comment on this?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-25 15:07   ` Kirill A. Shutemov
@ 2016-02-26  6:50     ` Hugh Dickins
  2016-02-26 11:06       ` Peter Zijlstra
  2016-03-10 16:10       ` Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2016-02-26  6:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Zijlstra, Gerald Schaefer, Steve Capper, Dann Frazier,
	Catalin Marinas, Hugh Dickins, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, linux-mm, Martin Schwidefsky

On Thu, 25 Feb 2016, Kirill A. Shutemov wrote:
> On Wed, Feb 24, 2016 at 06:50:25PM +0100, Gerald Schaefer wrote:
> > On Wed, 24 Feb 2016 18:59:21 +0300
> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > Previously, __split_huge_page_splitting() required serialization against
> > > gup_fast to make sure nobody can obtain new reference to the page after
> > > __split_huge_page_splitting() returns. This was a way to stabilize page
> > > references before starting to distribute them from head page to tail
> > > pages.
> > > 
> > > With new refcounting, we don't care about this. Splitting PMD is now
> > > decoupled from splitting underlying compound page. It's okay to get new
> > > pins after split_huge_pmd(). To stabilize page references during
> > > split_huge_page() we rely on setting up migration entries once all
> > > pmds are split into page tables.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  mm/gup.c         | 11 +++--------
> > >  mm/huge_memory.c |  7 +++----
> > >  2 files changed, 6 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 7bf19ffa2199..2f528fce3a62 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1087,8 +1087,7 @@ struct page *get_dump_page(unsigned long addr)
> > >   *
> > >   * get_user_pages_fast attempts to pin user pages by walking the page
> > >   * tables directly and avoids taking locks. Thus the walker needs to be
> > > - * protected from page table pages being freed from under it, and should
> > > - * block any THP splits.
> > > + * protected from page table pages being freed from under it.
> > >   *
> > >   * One way to achieve this is to have the walker disable interrupts, and
> > >   * rely on IPIs from the TLB flushing code blocking before the page table
> > > @@ -1097,9 +1096,8 @@ struct page *get_dump_page(unsigned long addr)
> > >   *
> > >   * Another way to achieve this is to batch up page table containing pages
> > >   * belonging to more than one mm_user, then rcu_sched a callback to free those
> > > - * pages. Disabling interrupts will allow the fast_gup walker to both block
> > > - * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> > > - * (which is a relatively rare event). The code below adopts this strategy.
> > > + * pages. Disabling interrupts will allow the fast_gup walker to block
> > > + * the rcu_sched callback. The code below adopts this strategy.
> > >   *
> > >   * Before activating this code, please be aware that the following assumptions
> > >   * are currently made:
> > > @@ -1391,9 +1389,6 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > >  	 * With interrupts disabled, we block page table pages from being
> > >  	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> > >  	 * for more details.
> > > -	 *
> > > -	 * We do not adopt an rcu_read_lock(.) here as we also want to
> > > -	 * block IPIs that come from THPs splitting.
> > >  	 */
> > 
> > Hmm, now that the IPI from THP splitting is not needed anymore, this
> > comment would suggest that we could use rcu_read_lock(_sched) for
> > fast_gup, instead of keeping the (probably more expensive) IRQ enable/
> > disable. That should be enough to synchronize against the
> > call_rcu_sched() from the batched tlb_table_flush, right?
> 
> Possibly. I'm not hugely aware about all details here.
> 
> + People from the patch which introduced the comment.
> 
> Can anybody comment on this?

Peter and Andrea will have a much firmer grasp of this than I do.

I think:

It's a useful suggestion from Gerald, and your THP rework may have
brought us closer to being able to rely on RCU locking rather than
IRQ disablement there; but you were right just to delete the comment,
there are other reasons why fast GUP still depends on IRQs disabled.

For example, see the fallback tlb_remove_table_one() in mm/memory.c:
that one uses smp_call_function() sending IPI to all CPUs concerned,
without waiting an RCU grace period at all.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-26  6:50     ` Hugh Dickins
@ 2016-02-26 11:06       ` Peter Zijlstra
  2016-02-26 11:41         ` Martin Schwidefsky
  2016-03-10 16:10       ` Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2016-02-26 11:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Gerald Schaefer, Steve Capper, Dann Frazier,
	Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, linux-mm, Martin Schwidefsky

On Thu, Feb 25, 2016 at 10:50:14PM -0800, Hugh Dickins wrote:

> For example, see the fallback tlb_remove_table_one() in mm/memory.c:
> that one uses smp_call_function() sending IPI to all CPUs concerned,
> without waiting an RCU grace period at all.

The better comment is with mmu_table_batch.

Its been too long for me to fully remember, nor have I really paid much
attention to this code in the past few years, so any memory I might have
had might be totally wrong.

But relying on rcu_read_lock_sched() and friends would mean replacing
that smp_call_function() with synchronize_sched().

A real quick look at the current code seems to suggest that _might_ just
work, but note that that will be slower, RT and HPC people will like you
for it though.

So it depends on how hard we hit that special, totally out of memory,
case, and if we care about some performance if we do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-26 11:06       ` Peter Zijlstra
@ 2016-02-26 11:41         ` Martin Schwidefsky
  2016-02-29  2:38           ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2016-02-26 11:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, linux-mm

On Fri, 26 Feb 2016 12:06:50 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Feb 25, 2016 at 10:50:14PM -0800, Hugh Dickins wrote:
> 
> > For example, see the fallback tlb_remove_table_one() in mm/memory.c:
> > that one uses smp_call_function() sending IPI to all CPUs concerned,
> > without waiting an RCU grace period at all.
> 
> The better comment is with mmu_table_batch.
> 
> Its been too long for me to fully remember, nor have I really paid much
> attention to this code in the past few years, so any memory I might have
> had might be totally wrong.
> 
> But relying on rcu_read_lock_sched() and friends would mean replacing
> that smp_call_function() with synchronize_sched().

That makes sense, just tried that together with a big fat printk to see if
we hit that out-of-memory condition in the page table freeing code.
The system is swapping like mad but no message so far.
 
> A real quick look at the current code seems to suggest that _might_ just
> work, but note that that will be slower, RT and HPC people will like you
> for it though.
> 
> So it depends on how hard we hit that special, totally out of memory,
> case, and if we care about some performance if we do.

If the system is out of memory bad enough for the page allocation to fail
an additional synchronize_sched() call probably won't hurt too much. Most
of the time we'll be waiting for I/O anyway.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-26 11:41         ` Martin Schwidefsky
@ 2016-02-29  2:38           ` Hugh Dickins
  0 siblings, 0 replies; 13+ messages in thread
From: Hugh Dickins @ 2016-02-29  2:38 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer,
	Steve Capper, Dann Frazier, Catalin Marinas, Kirill A. Shutemov,
	Andrew Morton, Andrea Arcangeli, linux-mm

On Fri, 26 Feb 2016, Martin Schwidefsky wrote:
> On Fri, 26 Feb 2016 12:06:50 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Feb 25, 2016 at 10:50:14PM -0800, Hugh Dickins wrote:
> > 
> > > For example, see the fallback tlb_remove_table_one() in mm/memory.c:
> > > that one uses smp_call_function() sending IPI to all CPUs concerned,
> > > without waiting an RCU grace period at all.
> > 
> > The better comment is with mmu_table_batch.
> > 
> > Its been too long for me to fully remember, nor have I really paid much
> > attention to this code in the past few years, so any memory I might have
> > had might be totally wrong.
> > 
> > But relying on rcu_read_lock_sched() and friends would mean replacing
> > that smp_call_function() with synchronize_sched().
> 
> That makes sense, just tried that together with a big fat printk to see if
> we hit that out-of-memory condition in the page table freeing code.
> The system is swapping like mad but no message so far.
>  
> > A real quick look at the current code seems to suggest that _might_ just
> > work, but note that that will be slower, RT and HPC people will like you
> > for it though.

Thanks for looking, Peter.

Just to make clear, "you" will not be me: apparently simple enough,
but needs a lot more careful testing than I have time to give it.

And here we are just talking about the HAVE_GENERIC_RCU_GUP,
HAVE_RCU_TABLE_FREE architectures (arm, arm64, powerpc, perhaps sparc).

Whether x86 would like to be converted over to the generic RCU GUP,
and give up on its IRQ disabling, is another matter.  Not an argument
I'll get into.  Could be a Kconfig choice I suppose, but that wouldn't
help the distros.

And, for OOM reasons, I do dislike adding any further delay to
freeing pages from exit (or if there's to be an OOM reaper, from zap).

> > 
> > So it depends on how hard we hit that special, totally out of memory,
> > case, and if we care about some performance if we do.
> 
> If the system is out of memory bad enough for the page allocation to fail
> an additional synchronize_sched() call probably won't hurt too much. Most
> of the time we'll be waiting for I/O anyway.

I agree, that case should be too uncommon, and already too slowed, to
worry about any slowdown there - unless it provides an expedited way
out of OOM in some cases - that might be a consideration.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-02-26  6:50     ` Hugh Dickins
  2016-02-26 11:06       ` Peter Zijlstra
@ 2016-03-10 16:10       ` Andrea Arcangeli
  2016-03-10 16:34         ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2016-03-10 16:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Peter Zijlstra, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	linux-mm, Martin Schwidefsky

On Thu, Feb 25, 2016 at 10:50:14PM -0800, Hugh Dickins wrote:
> It's a useful suggestion from Gerald, and your THP rework may have
> brought us closer to being able to rely on RCU locking rather than
> IRQ disablement there; but you were right just to delete the comment,
> there are other reasons why fast GUP still depends on IRQs disabled.
> 
> For example, see the fallback tlb_remove_table_one() in mm/memory.c:
> that one uses smp_call_function() sending IPI to all CPUs concerned,
> without waiting an RCU grace period at all.

I full agree, the refcounting change just drops the THP splitting from
the equation, but everything else remains. It's not like x86 is using
RCU for gup_fast when CONFIG_TRANSPARENT_HUGEPAGE=n.

The main issue Peter also pointed out is how it can be faster to wait
a RCU grace period than sending an IPI to only the CPU that have an
active_mm matching the one the page belongs to and I'm not exactly
sure the cost of disabling irqs in gup_fast is going to pay off. It's
not just swap, large munmap should be able to free up pagetables or
pagetables would get a footprint out of proportion with the Rss of the
process, and in turn it'll have to either block synchronously for long
before returning to userland, or return to userland when the pagetable
memory is still not free, and userland may mmap again and munmap again
in a loop and being legit doing so too, with unclear side effects with
regard to false positive OOM.

Then there's another issue with synchronize_sched(),
__get_user_pages_fast has to safe to run from irq (note the
local_irq_save instead of local_irq_disable) and KVM leverages it. KVM
just requires it to be atomic so it can run from inside a preempt
disabled section (i.e. inside a spinlock), I'm fairly certain the
irq-safe guarantee could be dropped without pain and
rcu_read_lock_sched() would be enough, but the documentation of the
IRQ-safe guarantees provided by __get_user_pages_fast should be also
altered if we were to use synchronize_sched() and that's a symbol
exported to GPL modules too.

Overall my main concern in switching x86 to RCU gup-fast is the
performance of synchronize_sched in large munmap pagetable teardown.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-03-10 16:10       ` Andrea Arcangeli
@ 2016-03-10 16:34         ` Peter Zijlstra
  2016-03-10 16:40           ` Peter Zijlstra
  2016-03-10 17:04           ` Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2016-03-10 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	linux-mm, Martin Schwidefsky

On Thu, Mar 10, 2016 at 05:10:35PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 25, 2016 at 10:50:14PM -0800, Hugh Dickins wrote:
> > It's a useful suggestion from Gerald, and your THP rework may have
> > brought us closer to being able to rely on RCU locking rather than
> > IRQ disablement there; but you were right just to delete the comment,
> > there are other reasons why fast GUP still depends on IRQs disabled.
> > 
> > For example, see the fallback tlb_remove_table_one() in mm/memory.c:
> > that one uses smp_call_function() sending IPI to all CPUs concerned,
> > without waiting an RCU grace period at all.
> 
> I full agree, the refcounting change just drops the THP splitting from
> the equation, but everything else remains. It's not like x86 is using
> RCU for gup_fast when CONFIG_TRANSPARENT_HUGEPAGE=n.
> 
> The main issue Peter also pointed out is how it can be faster to wait
> a RCU grace period than sending an IPI to only the CPU that have an
> active_mm matching the one the page belongs to 

Typically RCU (sched) grace periods take a relative 'forever' compared
to sending IPIs. That is, synchronize_sched() is typically slower.

But, on the upside, not sending IPIs will not perturb those other
CPUs, which is something HPC/RT people like.

> and I'm not exactly
> sure the cost of disabling irqs in gup_fast is going to pay off.

Entirely depends on the workload of course, but you can do a lot of
gup_fast compared to munmap()s. So making gup_fast, faster, seems like a
potential win. Also, is anybody really interested in munmap()
performance?

> It's
> not just swap, large munmap should be able to free up pagetables or
> pagetables would get a footprint out of proportion with the Rss of the
> process, and in turn it'll have to either block synchronously for long
> before returning to userland, or return to userland when the pagetable
> memory is still not free, and userland may mmap again and munmap again
> in a loop and being legit doing so too, with unclear side effects with
> regard to false positive OOM.

I'm not seeing that, the only point where this matters at all, is if the
batch alloc fails, otherwise the RCU_TABLE_FREE stuff uses
call_rcu_sched() and what you write above is true already.

Now, RCU already has an oom_notifier to push work harder if we approach
that.

> Then there's another issue with synchronize_sched(),
> __get_user_pages_fast has to safe to run from irq (note the
> local_irq_save instead of local_irq_disable) and KVM leverages it.

This is unchanged. synchronize_sched() serialized against anything that
disables preemption, having IRQs disabled is very much included in that.

So there should be no problem running this from IRQ context.

> KVM
> just requires it to be atomic so it can run from inside a preempt
> disabled section (i.e. inside a spinlock), I'm fairly certain the
> irq-safe guarantee could be dropped without pain and
> rcu_read_lock_sched() would be enough, but the documentation of the
> IRQ-safe guarantees provided by __get_user_pages_fast should be also
> altered if we were to use synchronize_sched() and that's a symbol
> exported to GPL modules too.

No changes needed.

> Overall my main concern in switching x86 to RCU gup-fast is the
> performance of synchronize_sched in large munmap pagetable teardown.

Normally, as already established by Martin, you should not actually ever
encounter the sync_sched() call. Only under severe memory pressure, when
the batch alloc in tlb_remove_table() fails is this ever an issue.

And at the point where such allocations fail, performance typically
isn't a concern anymore.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-03-10 16:34         ` Peter Zijlstra
@ 2016-03-10 16:40           ` Peter Zijlstra
  2016-03-10 17:04           ` Andrea Arcangeli
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2016-03-10 16:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	linux-mm, Martin Schwidefsky

On Thu, Mar 10, 2016 at 05:34:39PM +0100, Peter Zijlstra wrote:
> 
> > Then there's another issue with synchronize_sched(),
> > __get_user_pages_fast has to safe to run from irq (note the
> > local_irq_save instead of local_irq_disable) and KVM leverages it.
> 
> This is unchanged. synchronize_sched() serialized against anything that
> disables preemption, having IRQs disabled is very much included in that.
> 
> So there should be no problem running this from IRQ context.

Think of it this way: synchronize_sched() waits for every cpu to have
called schedule() at least once. If you're inside an IRQ handler, you
cannot call schedule(), therefore the RCU (sched) QS cannot progress and
any dereferences you make must stay valid.

> > Overall my main concern in switching x86 to RCU gup-fast is the
> > performance of synchronize_sched in large munmap pagetable teardown.
> 
> Normally, as already established by Martin, you should not actually ever
> encounter the sync_sched() call. Only under severe memory pressure, when
> the batch alloc in tlb_remove_table() fails is this ever an issue.
> 
> And at the point where such allocations fail, performance typically
> isn't a concern anymore.

Note, I'm not advocating switching x86 over (although it might be an
interested experiment), I just wanted to clarify some points I perceived
you were not entirely clear on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-03-10 16:34         ` Peter Zijlstra
  2016-03-10 16:40           ` Peter Zijlstra
@ 2016-03-10 17:04           ` Andrea Arcangeli
  2016-03-10 17:22             ` Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2016-03-10 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	linux-mm, Martin Schwidefsky

On Thu, Mar 10, 2016 at 05:34:39PM +0100, Peter Zijlstra wrote:
> I'm not seeing that, the only point where this matters at all, is if the
> batch alloc fails, otherwise the RCU_TABLE_FREE stuff uses
> call_rcu_sched() and what you write above is true already.

Not on x86, only if HAVE_RCU_TABLE_FREE is selected, which is not the
case for x86:

arch/arm/Kconfig:       select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
arch/arm64/Kconfig:     select HAVE_RCU_TABLE_FREE
arch/powerpc/Kconfig:   select HAVE_RCU_TABLE_FREE if SMP
arch/sparc/Kconfig:     select HAVE_RCU_TABLE_FREE if SMP

> Normally, as already established by Martin, you should not actually ever
> encounter the sync_sched() call. Only under severe memory pressure, when
> the batch alloc in tlb_remove_table() fails is this ever an issue.

What I mean is that a large mmap/munmap loops may want to be
benchmarked to see if they end up stalling in the synchronize_sched
case through the memory pressure handler, in turn not restricting the
synchronize_sched to a real memory pressure situation but ending up in
the memory pressure just because of it. For example the normal load
running on ARM isn't as diverse as the one on x86 where RCU_TABLE_FREE
has never been exercised at large yet. I doubt anything like that
would ever materialize in light load, light munmap load certainly
would not be affected.

I doubt it'd be ok if munmap end up stalling in synchronize_sched. In
fact I'd feel safer if the srcu context can be added to the mm (but
that costs memory in the mm unless we're lucky with the slab hw
alignment), then I think synchronize_srcu may actually be preferable
than a full synchronize_sched that affects the entire system with
thousand of CPUs. A per-cpu inc wouldn't be a big deal and it would at
least avoid to stall for the whole system if a stall eventually has to
happen (unless every cpu is actually running gup_fast but that's ok in
such case).

About the IRQ safety of synchronize_sched, I was mistaken with sofitrq
which can block never mind sorry, of course local_irq_disable or
preempt_enable are both valid read barriers as schedule can't run.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-03-10 17:04           ` Andrea Arcangeli
@ 2016-03-10 17:22             ` Andrea Arcangeli
  2016-03-11  9:22               ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2016-03-10 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	linux-mm, Martin Schwidefsky

On Thu, Mar 10, 2016 at 06:04:06PM +0100, Andrea Arcangeli wrote:
> that costs memory in the mm unless we're lucky with the slab hw
> alignment), then I think synchronize_srcu may actually be preferable
> than a full synchronize_sched that affects the entire system with
> thousand of CPUs. A per-cpu inc wouldn't be a big deal and it would at
> least avoid to stall for the whole system if a stall eventually has to
> happen (unless every cpu is actually running gup_fast but that's ok in
> such case).

Thinking more about this, it'd be ok if the pgtable freeing srcu
context was global, no need of mess with the mm. A __percpu inside mm
wouldn't fly anyway. With srcu we'd wait only for those CPUs that are
effectively inside gup_fast, most of the time none or a few.

The main worry about synchronize_sched for x86 is that it doesn't
scale as CPU number increases and there can be thousands of
those. srcu has much a smaller issue as checking those per-cpu
variables is almost instantaneous even if there are thousand of CPUs
and while local_irq_disable may hurt in gup_fast, srcu_read_lock is
unlikely to be measurable. __gup_fast would also be still ok to be
called within irqs. If srcu causes problem for preempt-RT you could
use synchronize_sched there and the model would remain the same.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast
  2016-03-10 17:22             ` Andrea Arcangeli
@ 2016-03-11  9:22               ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2016-03-11  9:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Kirill A. Shutemov, Gerald Schaefer, Steve Capper,
	Dann Frazier, Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	linux-mm, Martin Schwidefsky

On Thu, Mar 10, 2016 at 06:22:49PM +0100, Andrea Arcangeli wrote:
> On Thu, Mar 10, 2016 at 06:04:06PM +0100, Andrea Arcangeli wrote:
> > that costs memory in the mm unless we're lucky with the slab hw
> > alignment), then I think synchronize_srcu may actually be preferable
> > than a full synchronize_sched that affects the entire system with
> > thousand of CPUs. A per-cpu inc wouldn't be a big deal and it would at
> > least avoid to stall for the whole system if a stall eventually has to
> > happen (unless every cpu is actually running gup_fast but that's ok in
> > such case).
> 
> Thinking more about this, it'd be ok if the pgtable freeing srcu
> context was global, no need of mess with the mm. A __percpu inside mm
> wouldn't fly anyway. With srcu we'd wait only for those CPUs that are
> effectively inside gup_fast, most of the time none or a few.

You've not looked at srcu, right? That's not going to make it faster,
not without some serious surgery.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-11  9:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-24 15:59 [PATCH] thp, mm: remove comments on serializion of THP split vs. gup_fast Kirill A. Shutemov
2016-02-24 17:50 ` Gerald Schaefer
2016-02-25 15:07   ` Kirill A. Shutemov
2016-02-26  6:50     ` Hugh Dickins
2016-02-26 11:06       ` Peter Zijlstra
2016-02-26 11:41         ` Martin Schwidefsky
2016-02-29  2:38           ` Hugh Dickins
2016-03-10 16:10       ` Andrea Arcangeli
2016-03-10 16:34         ` Peter Zijlstra
2016-03-10 16:40           ` Peter Zijlstra
2016-03-10 17:04           ` Andrea Arcangeli
2016-03-10 17:22             ` Andrea Arcangeli
2016-03-11  9:22               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).