linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
  • * Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance
           [not found] ` <20230726095146.2826796-3-ryan.roberts@arm.com>
           [not found]   ` <CAOUHufackQzy+yXOzaej+G6DNYK-k9GAUHAK6Vq79BFHr7KwAQ@mail.gmail.com>
    @ 2023-08-01  6:18   ` Yu Zhao
      2023-08-02  9:33     ` Ryan Roberts
      2023-08-03 12:43   ` Ryan Roberts
      2023-08-07  5:24   ` Yu Zhao
      3 siblings, 1 reply; 46+ messages in thread
    From: Yu Zhao @ 2023-08-01  6:18 UTC (permalink / raw)
      To: Ryan Roberts
      Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
    	Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi,
    	Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama, linux-mm,
    	linux-kernel, linux-arm-kernel
    
    On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
    >
    > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
    > allocated in large folios of a determined order. All pages of the large
    > folio are pte-mapped during the same page fault, significantly reducing
    > the number of page faults. The number of per-page operations (e.g. ref
    > counting, rmap management lru list management) are also significantly
    > reduced since those ops now become per-folio.
    >
    > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
    > which defaults to disabled for now; The long term aim is for this to
    > defaut to enabled, but there are some risks around internal
    > fragmentation that need to be better understood first.
    >
    > When enabled, the folio order is determined as such: For a vma, process
    > or system that has explicitly disabled THP, we continue to allocate
    > order-0. THP is most likely disabled to avoid any possible internal
    > fragmentation so we honour that request.
    >
    > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
    > that have not explicitly opted-in to use transparent hugepages (e.g.
    > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
    > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
    > bigger). This allows for a performance boost without requiring any
    > explicit opt-in from the workload while limitting internal
    > fragmentation.
    >
    > If the preferred order can't be used (e.g. because the folio would
    > breach the bounds of the vma, or because ptes in the region are already
    > mapped) then we fall back to a suitable lower order; first
    > PAGE_ALLOC_COSTLY_ORDER, then order-0.
    >
    > arch_wants_pte_order() can be overridden by the architecture if desired.
    > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
    > set of ptes map physically contigious, naturally aligned memory, so this
    > mechanism allows the architecture to optimize as required.
    >
    > Here we add the default implementation of arch_wants_pte_order(), used
    > when the architecture does not define it, which returns -1, implying
    > that the HW has no preference. In this case, mm will choose it's own
    > default order.
    >
    > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    > ---
    >  include/linux/pgtable.h |  13 ++++
    >  mm/Kconfig              |  10 +++
    >  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
    >  3 files changed, 172 insertions(+), 17 deletions(-)
    >
    > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
    > index 5063b482e34f..2a1d83775837 100644
    > --- a/include/linux/pgtable.h
    > +++ b/include/linux/pgtable.h
    > @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
    >  }
    >  #endif
    >
    > +#ifndef arch_wants_pte_order
    > +/*
    > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
    > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
    > + * to be at least order-2. Negative value implies that the HW has no preference
    > + * and mm will choose it's own default order.
    > + */
    > +static inline int arch_wants_pte_order(void)
    > +{
    > +       return -1;
    > +}
    > +#endif
    > +
    >  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
    >  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
    >                                        unsigned long address,
    > diff --git a/mm/Kconfig b/mm/Kconfig
    > index 09130434e30d..fa61ea160447 100644
    > --- a/mm/Kconfig
    > +++ b/mm/Kconfig
    > @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
    >
    >  source "mm/damon/Kconfig"
    >
    > +config LARGE_ANON_FOLIO
    > +       bool "Allocate large folios for anonymous memory"
    > +       depends on TRANSPARENT_HUGEPAGE
    > +       default n
    > +       help
    > +         Use large (bigger than order-0) folios to back anonymous memory where
    > +         possible, even for pte-mapped memory. This reduces the number of page
    > +         faults, as well as other per-page overheads to improve performance for
    > +         many workloads.
    > +
    >  endmenu
    > diff --git a/mm/memory.c b/mm/memory.c
    > index 01f39e8144ef..64c3f242c49a 100644
    > --- a/mm/memory.c
    > +++ b/mm/memory.c
    > @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
    >         return ret;
    >  }
    >
    > +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
    > +{
    > +       int i;
    > +
    > +       if (nr_pages == 1)
    > +               return vmf_pte_changed(vmf);
    > +
    > +       for (i = 0; i < nr_pages; i++) {
    > +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
    > +                       return true;
    > +       }
    > +
    > +       return false;
    > +}
    > +
    > +#ifdef CONFIG_LARGE_ANON_FOLIO
    > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
    > +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
    > +
    > +static int anon_folio_order(struct vm_area_struct *vma)
    > +{
    > +       int order;
    > +
    > +       /*
    > +        * If THP is explicitly disabled for either the vma, the process or the
    > +        * system, then this is very likely intended to limit internal
    > +        * fragmentation; in this case, don't attempt to allocate a large
    > +        * anonymous folio.
    > +        *
    > +        * Else, if the vma is eligible for thp, allocate a large folio of the
    > +        * size preferred by the arch. Or if the arch requested a very small
    > +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
    > +        * which still meets the arch's requirements but means we still take
    > +        * advantage of SW optimizations (e.g. fewer page faults).
    > +        *
    > +        * Finally if thp is enabled but the vma isn't eligible, take the
    > +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
    > +        * This ensures workloads that have not explicitly opted-in take benefit
    > +        * while capping the potential for internal fragmentation.
    > +        */
    > +
    > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
    > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
    > +           !hugepage_flags_enabled())
    > +               order = 0;
    > +       else {
    > +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
    > +
    > +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
    > +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
    > +       }
    > +
    > +       return order;
    > +}
    > +
    > +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
    > +{
    > +       int i;
    > +       gfp_t gfp;
    > +       pte_t *pte;
    > +       unsigned long addr;
    > +       struct vm_area_struct *vma = vmf->vma;
    > +       int prefer = anon_folio_order(vma);
    > +       int orders[] = {
    > +               prefer,
    > +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
    > +               0,
    > +       };
    > +
    > +       *folio = NULL;
    > +
    > +       if (vmf_orig_pte_uffd_wp(vmf))
    > +               goto fallback;
    
    I think we need to s/vmf_orig_pte_uffd_wp/userfaultfd_armed/ here;
    otherwise UFFD would miss VM_UFFD_MISSING/MINOR.
    
    _______________________________________________
    linux-arm-kernel mailing list
    linux-arm-kernel@lists.infradead.org
    http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
    
    ^ permalink raw reply	[flat|nested] 46+ messages in thread
  • * Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance
           [not found] ` <20230726095146.2826796-3-ryan.roberts@arm.com>
           [not found]   ` <CAOUHufackQzy+yXOzaej+G6DNYK-k9GAUHAK6Vq79BFHr7KwAQ@mail.gmail.com>
      2023-08-01  6:18   ` Yu Zhao
    @ 2023-08-03 12:43   ` Ryan Roberts
      2023-08-03 14:21     ` Kirill A. Shutemov
      2023-08-03 23:50     ` Yu Zhao
      2023-08-07  5:24   ` Yu Zhao
      3 siblings, 2 replies; 46+ messages in thread
    From: Ryan Roberts @ 2023-08-03 12:43 UTC (permalink / raw)
      To: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
    	Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual,
    	Yang Shi, Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama,
    	Kirill A. Shutemov
      Cc: linux-mm, linux-kernel, linux-arm-kernel
    
    + Kirill
    
    On 26/07/2023 10:51, Ryan Roberts wrote:
    > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
    > allocated in large folios of a determined order. All pages of the large
    > folio are pte-mapped during the same page fault, significantly reducing
    > the number of page faults. The number of per-page operations (e.g. ref
    > counting, rmap management lru list management) are also significantly
    > reduced since those ops now become per-folio.
    > 
    > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
    > which defaults to disabled for now; The long term aim is for this to
    > defaut to enabled, but there are some risks around internal
    > fragmentation that need to be better understood first.
    > 
    > When enabled, the folio order is determined as such: For a vma, process
    > or system that has explicitly disabled THP, we continue to allocate
    > order-0. THP is most likely disabled to avoid any possible internal
    > fragmentation so we honour that request.
    > 
    > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
    > that have not explicitly opted-in to use transparent hugepages (e.g.
    > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
    > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
    > bigger). This allows for a performance boost without requiring any
    > explicit opt-in from the workload while limitting internal
    > fragmentation.
    > 
    > If the preferred order can't be used (e.g. because the folio would
    > breach the bounds of the vma, or because ptes in the region are already
    > mapped) then we fall back to a suitable lower order; first
    > PAGE_ALLOC_COSTLY_ORDER, then order-0.
    > 
    
    ...
    
    > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
    > +		(ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
    > +
    > +static int anon_folio_order(struct vm_area_struct *vma)
    > +{
    > +	int order;
    > +
    > +	/*
    > +	 * If THP is explicitly disabled for either the vma, the process or the
    > +	 * system, then this is very likely intended to limit internal
    > +	 * fragmentation; in this case, don't attempt to allocate a large
    > +	 * anonymous folio.
    > +	 *
    > +	 * Else, if the vma is eligible for thp, allocate a large folio of the
    > +	 * size preferred by the arch. Or if the arch requested a very small
    > +	 * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
    > +	 * which still meets the arch's requirements but means we still take
    > +	 * advantage of SW optimizations (e.g. fewer page faults).
    > +	 *
    > +	 * Finally if thp is enabled but the vma isn't eligible, take the
    > +	 * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
    > +	 * This ensures workloads that have not explicitly opted-in take benefit
    > +	 * while capping the potential for internal fragmentation.
    > +	 */
    > +
    > +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
    > +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
    > +	    !hugepage_flags_enabled())
    > +		order = 0;
    > +	else {
    > +		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
    > +
    > +		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
    > +			order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
    > +	}
    > +
    > +	return order;
    > +}
    
    
    Hi All,
    
    I'm writing up the conclusions that we arrived at during discussion in the THP
    meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if
    I can get explicit "agree" or disagree + rationale from at least David, Yu and
    Kirill.
    
    In summary; I think we are converging on the approach that is already coded, but
    I'd like confirmation.
    
    
    
    The THP situation today
    -----------------------
    
     - At system level: THP can be set to "never", "madvise" or "always"
     - At process level: THP can be "never" or "defer to system setting"
     - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
    
    That gives us this table to describe how a page fault is handled, according to
    process state (columns) and vma flags (rows):
    
                    | never     | madvise   | always
    ----------------|-----------|-----------|-----------
    no hint         | S         | S         | THP>S
    MADV_HUGEPAGE   | S         | THP>S     | THP>S
    MADV_NOHUGEPAGE | S         | S         | S
    
    Legend:
    S	allocate single page (PTE-mapped)
    LAF	allocate lage anon folio (PTE-mapped)
    THP	allocate THP-sized folio (PMD-mapped)
    >	fallback (usually because vma size/alignment insufficient for folio)
    
    
    
    Principles for Large Anon Folios (LAF)
    --------------------------------------
    
    David tells us there are use cases today (e.g. qemu live migration) which use
    MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted"
    and these use cases will break (i.e. functionally incorrect) if this request is
    not honoured.
    
    So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use
    cases. And once we do this, then I think the least confusing thing is for it to
    also honor the "never" system/process state; so if either the system, process or
    vma has explicitly opted-out of THP, then LAF should also be bypassed.
    
    Similarly, any case that would previously cause the allocation of PMD-sized THP
    must continue to be honoured, else we risk performance regression.
    
    That leaves the "madvise/no-hint" case, and all THP fallback paths due to the
    VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these
    cases, we will attempt to use LAF first, and fallback to single page if the vma
    size/alignment doesn't permit it.
    
                    | never     | madvise   | always
    ----------------|-----------|-----------|-----------
    no hint         | S         | LAF>S     | THP>LAF>S
    MADV_HUGEPAGE   | S         | THP>LAF>S | THP>LAF>S
    MADV_NOHUGEPAGE | S         | S         | S
    
    I think this (perhaps conservative) approach will be the least surprising to
    users. And is the policy that is already implemented in this patch.
    
    
    
    Downsides of this policy
    ------------------------
    
    As Yu and Yin have pointed out, there are some workloads which do not perform
    well with THP, due to large fault latency or memory wastage, etc. But which
    _may_ still benefit from LAF. By taking the conservative approach, we exclude
    these workloads from benefiting automatically.
    
    But given they have explicitly opted out of THP, it doesn't seem unreasonable
    that those workloads should be explicitly modified to opt-in to LAF. The
    question is what should a control for this look like? And do we need to
    implement the control for an MVP implementation of LAF? For the latter question,
    I would suggest this can come later - its a tool to further optimize, but its
    absence does not regress today's performance.
    
    What should a control look like?
    
    One suggestion was to expose a "maximum order" tunable, which would limit the
    size of THP that could be allocated. setting it to 1M would cause traditional
    THP to be bypassed (assuming for now PMD-sized THP is 2M) but would permit LAF.
    But Kirill suggested that this type of control might turn out to be restrictive
    in the long run.
    
    Another suggestion was to provide a more abstracted hint to the kernel, which
    the kernel could then derive a policy from, and that policy would be easier to
    change over time.
    
    
    
    Large Anon Folio Size
    ---------------------
    
    Once we have decided to use LAF (vs THP vs S), we need to decide how big the
    folio should be. If/when we get a control as described above, that will
    obviously place an upper bound on the size. HW may also have a preferred size
    due to tricks it can do in the TLB (arch_wants_pte_order() in this patch) but
    you may still want to allocate a bigger folio than the HW wants (since bigger
    folios will reduce page faults) or you may want to allocate a smaller folio than
    the HW wants (due to concerns about latency or memory wastage).
    
    I've had a stab at addressing this in the patch too, using the same decision as
    for THP (ignoring the vma size/alignment requirement) to decide if we use the HW
    preferred order or if we cap it (currently set at 64K).
    
    Thoughts, comments?
    
    Thanks,
    Ryan
    
    
    
    
    
    
    _______________________________________________
    linux-arm-kernel mailing list
    linux-arm-kernel@lists.infradead.org
    http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
    
    ^ permalink raw reply	[flat|nested] 46+ messages in thread
  • * Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance
           [not found] ` <20230726095146.2826796-3-ryan.roberts@arm.com>
                         ` (2 preceding siblings ...)
      2023-08-03 12:43   ` Ryan Roberts
    @ 2023-08-07  5:24   ` Yu Zhao
      2023-08-07 19:07     ` Ryan Roberts
      3 siblings, 1 reply; 46+ messages in thread
    From: Yu Zhao @ 2023-08-07  5:24 UTC (permalink / raw)
      To: Ryan Roberts
      Cc: Andrew Morton, Matthew Wilcox, Yin Fengwei, David Hildenbrand,
    	Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi,
    	Huang, Ying, Zi Yan, Luis Chamberlain, Itaru Kitayama, linux-mm,
    	linux-kernel, linux-arm-kernel
    
    On Wed, Jul 26, 2023 at 3:52 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
    >
    > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
    > allocated in large folios of a determined order. All pages of the large
    > folio are pte-mapped during the same page fault, significantly reducing
    > the number of page faults. The number of per-page operations (e.g. ref
    > counting, rmap management lru list management) are also significantly
    > reduced since those ops now become per-folio.
    >
    > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
    > which defaults to disabled for now; The long term aim is for this to
    > defaut to enabled, but there are some risks around internal
    > fragmentation that need to be better understood first.
    >
    > When enabled, the folio order is determined as such: For a vma, process
    > or system that has explicitly disabled THP, we continue to allocate
    > order-0. THP is most likely disabled to avoid any possible internal
    > fragmentation so we honour that request.
    >
    > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
    > that have not explicitly opted-in to use transparent hugepages (e.g.
    > where thp=madvise and the vma does not have MADV_HUGEPAGE), then
    > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
    > bigger). This allows for a performance boost without requiring any
    > explicit opt-in from the workload while limitting internal
    > fragmentation.
    >
    > If the preferred order can't be used (e.g. because the folio would
    > breach the bounds of the vma, or because ptes in the region are already
    > mapped) then we fall back to a suitable lower order; first
    > PAGE_ALLOC_COSTLY_ORDER, then order-0.
    >
    > arch_wants_pte_order() can be overridden by the architecture if desired.
    > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
    > set of ptes map physically contigious, naturally aligned memory, so this
    > mechanism allows the architecture to optimize as required.
    >
    > Here we add the default implementation of arch_wants_pte_order(), used
    > when the architecture does not define it, which returns -1, implying
    > that the HW has no preference. In this case, mm will choose it's own
    > default order.
    >
    > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
    > ---
    >  include/linux/pgtable.h |  13 ++++
    >  mm/Kconfig              |  10 +++
    >  mm/memory.c             | 166 ++++++++++++++++++++++++++++++++++++----
    >  3 files changed, 172 insertions(+), 17 deletions(-)
    >
    > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
    > index 5063b482e34f..2a1d83775837 100644
    > --- a/include/linux/pgtable.h
    > +++ b/include/linux/pgtable.h
    > @@ -313,6 +313,19 @@ static inline bool arch_has_hw_pte_young(void)
    >  }
    >  #endif
    >
    > +#ifndef arch_wants_pte_order
    > +/*
    > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
    > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
    > + * to be at least order-2. Negative value implies that the HW has no preference
    > + * and mm will choose it's own default order.
    > + */
    > +static inline int arch_wants_pte_order(void)
    > +{
    > +       return -1;
    > +}
    > +#endif
    > +
    >  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
    >  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
    >                                        unsigned long address,
    > diff --git a/mm/Kconfig b/mm/Kconfig
    > index 09130434e30d..fa61ea160447 100644
    > --- a/mm/Kconfig
    > +++ b/mm/Kconfig
    > @@ -1238,4 +1238,14 @@ config LOCK_MM_AND_FIND_VMA
    >
    >  source "mm/damon/Kconfig"
    >
    > +config LARGE_ANON_FOLIO
    > +       bool "Allocate large folios for anonymous memory"
    > +       depends on TRANSPARENT_HUGEPAGE
    > +       default n
    > +       help
    > +         Use large (bigger than order-0) folios to back anonymous memory where
    > +         possible, even for pte-mapped memory. This reduces the number of page
    > +         faults, as well as other per-page overheads to improve performance for
    > +         many workloads.
    > +
    >  endmenu
    > diff --git a/mm/memory.c b/mm/memory.c
    > index 01f39e8144ef..64c3f242c49a 100644
    > --- a/mm/memory.c
    > +++ b/mm/memory.c
    > @@ -4050,6 +4050,127 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
    >         return ret;
    >  }
    >
    > +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
    > +{
    > +       int i;
    > +
    > +       if (nr_pages == 1)
    > +               return vmf_pte_changed(vmf);
    > +
    > +       for (i = 0; i < nr_pages; i++) {
    > +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
    > +                       return true;
    > +       }
    > +
    > +       return false;
    > +}
    > +
    > +#ifdef CONFIG_LARGE_ANON_FOLIO
    > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
    > +               (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT)
    > +
    > +static int anon_folio_order(struct vm_area_struct *vma)
    > +{
    > +       int order;
    > +
    > +       /*
    > +        * If THP is explicitly disabled for either the vma, the process or the
    > +        * system, then this is very likely intended to limit internal
    > +        * fragmentation; in this case, don't attempt to allocate a large
    > +        * anonymous folio.
    > +        *
    > +        * Else, if the vma is eligible for thp, allocate a large folio of the
    > +        * size preferred by the arch. Or if the arch requested a very small
    > +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
    > +        * which still meets the arch's requirements but means we still take
    > +        * advantage of SW optimizations (e.g. fewer page faults).
    > +        *
    > +        * Finally if thp is enabled but the vma isn't eligible, take the
    > +        * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED.
    > +        * This ensures workloads that have not explicitly opted-in take benefit
    > +        * while capping the potential for internal fragmentation.
    > +        */
    > +
    > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
    > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
    > +           !hugepage_flags_enabled())
    > +               order = 0;
    > +       else {
    > +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
    > +
    > +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
    > +                       order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED);
    > +       }
    > +
    > +       return order;
    > +}
    > +
    > +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
    > +{
    > +       int i;
    > +       gfp_t gfp;
    > +       pte_t *pte;
    > +       unsigned long addr;
    > +       struct vm_area_struct *vma = vmf->vma;
    > +       int prefer = anon_folio_order(vma);
    > +       int orders[] = {
    > +               prefer,
    > +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
    > +               0,
    > +       };
    > +
    > +       *folio = NULL;
    > +
    > +       if (vmf_orig_pte_uffd_wp(vmf))
    > +               goto fallback;
    
    Per the discussion, we need to check hugepage_vma_check() for
    correctness of VM LM. I'd just check it here and fall back to order 0
    if that helper returns false.
    
    _______________________________________________
    linux-arm-kernel mailing list
    linux-arm-kernel@lists.infradead.org
    http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
    
    ^ permalink raw reply	[flat|nested] 46+ messages in thread

  • end of thread, other threads:[~2023-08-09 16:09 UTC | newest]
    
    Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <20230726095146.2826796-1-ryan.roberts@arm.com>
         [not found] ` <20230726095146.2826796-3-ryan.roberts@arm.com>
         [not found]   ` <CAOUHufackQzy+yXOzaej+G6DNYK-k9GAUHAK6Vq79BFHr7KwAQ@mail.gmail.com>
    2023-07-27  4:31     ` [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance Yu Zhao
    2023-07-28 10:13       ` Ryan Roberts
    2023-08-01  6:36         ` Yu Zhao
    2023-08-01 23:30           ` Yin Fengwei
    2023-08-02  8:02           ` Ryan Roberts
    2023-08-02  9:04             ` Ryan Roberts
    2023-08-02 13:51             ` Yin, Fengwei
    2023-08-03  8:05         ` Yin Fengwei
    2023-08-03  8:21           ` Ryan Roberts
    2023-08-03  8:37             ` Yin Fengwei
    2023-08-03  9:32               ` Ryan Roberts
    2023-08-03  9:58                 ` Yin Fengwei
    2023-08-03 10:27                   ` Ryan Roberts
    2023-08-03 10:54                     ` Yin Fengwei
    2023-08-04  0:28           ` Yu Zhao
    2023-08-01  6:18   ` Yu Zhao
    2023-08-02  9:33     ` Ryan Roberts
    2023-08-02 21:05       ` Yu Zhao
    2023-08-03 10:24         ` Ryan Roberts
    2023-08-03 12:43   ` Ryan Roberts
    2023-08-03 14:21     ` Kirill A. Shutemov
    2023-08-04  0:19       ` Yu Zhao
    2023-08-04  2:16         ` Zi Yan
    2023-08-04  3:35           ` Yu Zhao
    2023-08-04  9:06         ` Ryan Roberts
    2023-08-04 18:53           ` Yu Zhao
    2023-08-07 19:00             ` Ryan Roberts
    2023-08-03 23:50     ` Yu Zhao
    2023-08-04  8:27       ` Ryan Roberts
    2023-08-04 20:23         ` David Hildenbrand
    2023-08-04 21:00           ` Yu Zhao
    2023-08-04 21:13             ` David Hildenbrand
    2023-08-04 21:26               ` Yu Zhao
    2023-08-04 21:30                 ` David Hildenbrand
    2023-08-04 21:58                   ` Zi Yan
    2023-08-05  2:50                     ` Yin, Fengwei
    2023-08-07 17:45                       ` Ryan Roberts
    2023-08-07 18:10                         ` Zi Yan
    2023-08-08  9:58                           ` Ryan Roberts
    2023-08-07  5:24   ` Yu Zhao
    2023-08-07 19:07     ` Ryan Roberts
    2023-08-07 23:21       ` Yu Zhao
    2023-08-08  9:37         ` Ryan Roberts
    2023-08-08 17:57           ` Yu Zhao
    2023-08-08 18:12             ` Yu Zhao
    2023-08-09 16:08               ` Ryan Roberts
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox;
    as well as URLs for NNTP newsgroup(s).