linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Mina Almasry <almasrymina@google.com>
Cc: James Houghton <jthoughton@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Jue Wang <juew@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping
Date: Tue, 28 Jun 2022 18:56:25 +0100	[thread overview]
Message-ID: <YrtAyUSbtCLwCFxC@work-vm> (raw)
In-Reply-To: <CAHS8izNSsEW88Q=ozcC2rbnmvcX3zOL-qkFTPgn=M6S1R5t=Yw@mail.gmail.com>

* Mina Almasry (almasrymina@google.com) wrote:
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <jthoughton@google.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <jthoughton@google.com> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
> 
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I think it depends on whether the userspace is already holding bitmaps
and data structures to let it know when the right time to call collapse
is; if it already has to do all that book keeping for it's own postcopy
or whatever process, then getting userspace to call it is easy.
(I don't know the answer to whether it does have!)

Dave

> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
> 
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.
> 
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
> 
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
> 
> Thank you for the detailed explanation. Please add it to the cover letter.
> 
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
> 
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get).  The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs.  I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > >  - Compatibility with userfaultfd write-protect (will be included in v1).
> > > >  - Support for mremap() (will be included in v1). This looks a lot like
> > > >    the support we have for fork().
> > > >  - Documentation changes (will be included in v1).
> > > >  - Completely ignores PMD sharing and hugepage migration (will be included
> > > >    in v1).
> > > >  - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > >    than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1     - Preliminary changes
> > > > Patch 2-10  - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > >     name. "High-granularity mapping" is not a great name either. I am open
> > > >     to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



  reply	other threads:[~2022-06-28 17:56 UTC|newest]

Thread overview: 125+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-24 17:36 [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping James Houghton
2022-06-24 17:36 ` [RFC PATCH 01/26] hugetlb: make hstate accessor functions const James Houghton
2022-06-24 18:43   ` Mina Almasry
     [not found]   ` <e55f90f5-ba14-5d6e-8f8f-abf731b9095e@nutanix.com>
2022-06-27 12:09     ` manish.mishra
2022-06-28 17:08       ` James Houghton
2022-06-29  6:18   ` Muchun Song
2022-06-24 17:36 ` [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2022-06-24 18:51   ` Mina Almasry
2022-06-27 12:08   ` manish.mishra
2022-06-28 15:35     ` James Houghton
2022-06-27 18:42   ` Mike Kravetz
2022-06-28 15:40     ` James Houghton
2022-06-29  6:39       ` Muchun Song
2022-06-29 21:06         ` Mike Kravetz
2022-06-29 21:13           ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift James Houghton
2022-06-24 19:01   ` Mina Almasry
2022-06-27 12:13   ` manish.mishra
2022-06-24 17:36 ` [RFC PATCH 04/26] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2022-06-27 12:26   ` manish.mishra
2022-06-27 20:51   ` Mike Kravetz
2022-06-28 15:29     ` James Houghton
2022-06-29  6:09     ` Muchun Song
2022-06-29 21:03       ` Mike Kravetz
2022-06-29 21:39         ` James Houghton
2022-06-29 22:24           ` Mike Kravetz
2022-06-30  9:35             ` Muchun Song
2022-06-30 16:23               ` James Houghton
2022-06-30 17:40                 ` Mike Kravetz
2022-07-01  3:32                 ` Muchun Song
2022-06-24 17:36 ` [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2022-06-27 12:28   ` manish.mishra
2022-06-28 20:03     ` Mina Almasry
2022-06-24 17:36 ` [RFC PATCH 06/26] mm: make free_p?d_range functions public James Houghton
2022-06-27 12:31   ` manish.mishra
2022-06-28 20:35   ` Mike Kravetz
2022-07-12 20:52     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 07/26] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2022-06-27 12:47   ` manish.mishra
2022-06-29 16:28     ` James Houghton
2022-06-28 20:25   ` Mina Almasry
2022-06-29 16:42     ` James Houghton
2022-06-28 20:44   ` Mike Kravetz
2022-06-29 16:24     ` James Houghton
2022-07-11 23:32   ` Mike Kravetz
2022-07-12  9:42     ` Dr. David Alan Gilbert
2022-07-12 17:51       ` Mike Kravetz
2022-07-15 16:35       ` Peter Xu
2022-07-15 21:52         ` Axel Rasmussen
2022-07-15 23:03           ` Peter Xu
2022-09-08 17:38   ` Peter Xu
2022-09-08 17:54     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures James Houghton
2022-06-27 12:52   ` manish.mishra
2022-06-28 20:27   ` Mina Almasry
2022-06-24 17:36 ` [RFC PATCH 09/26] hugetlb: add hugetlb_hgm_enabled James Houghton
2022-06-27 12:55   ` manish.mishra
2022-06-28 20:33   ` Mina Almasry
2022-09-08 18:07   ` Peter Xu
2022-09-08 18:13     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift James Houghton
2022-06-27 13:01   ` manish.mishra
2022-06-28 21:58   ` Mina Almasry
2022-07-07 21:39     ` Mike Kravetz
2022-07-08 15:52     ` James Houghton
2022-07-09 21:55       ` Mina Almasry
2022-06-24 17:36 ` [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks James Houghton
2022-06-27 13:07   ` manish.mishra
2022-07-07 23:03     ` Mike Kravetz
2022-09-08 18:20   ` Peter Xu
2022-06-24 17:36 ` [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality James Houghton
2022-06-27 13:50   ` manish.mishra
2022-06-29 16:10     ` James Houghton
2022-06-29 14:33   ` manish.mishra
2022-06-29 16:20     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 13/26] hugetlb: add huge_pte_alloc_high_granularity James Houghton
2022-06-29 14:11   ` manish.mishra
2022-06-24 17:36 ` [RFC PATCH 14/26] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2022-06-29 14:40   ` manish.mishra
2022-06-29 15:56     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 15/26] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2022-07-19 10:19   ` manish.mishra
2022-07-19 15:58     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 16/26] hugetlb: make hugetlb_change_protection compatible with HGM James Houghton
2022-06-24 17:36 ` [RFC PATCH 17/26] hugetlb: update follow_hugetlb_page to support HGM James Houghton
2022-07-19 10:48   ` manish.mishra
2022-07-19 16:19     ` James Houghton
2022-06-24 17:36 ` [RFC PATCH 18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2022-06-24 17:36 ` [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2022-07-11 23:41   ` Mike Kravetz
2022-07-12 17:19     ` James Houghton
2022-07-12 18:06       ` Mike Kravetz
2022-07-15 21:39         ` Axel Rasmussen
2022-06-24 17:36 ` [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE James Houghton
2022-07-15 16:21   ` Peter Xu
2022-07-15 16:58     ` James Houghton
2022-07-15 17:20       ` Peter Xu
2022-07-20 20:58         ` James Houghton
2022-07-21 19:09           ` Peter Xu
2022-07-21 19:44             ` James Houghton
2022-07-21 19:53               ` Peter Xu
2022-06-24 17:36 ` [RFC PATCH 21/26] hugetlb: add hugetlb_collapse James Houghton
2022-06-24 17:36 ` [RFC PATCH 22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE James Houghton
2022-06-24 17:36 ` [RFC PATCH 23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM James Houghton
2022-06-24 17:36 ` [RFC PATCH 24/26] arm64/hugetlb: add support for high-granularity mappings James Houghton
2022-06-24 17:36 ` [RFC PATCH 25/26] selftests: add HugeTLB HGM to userfaultfd selftest James Houghton
2022-06-24 17:36 ` [RFC PATCH 26/26] selftests: add HugeTLB HGM to KVM demand paging selftest James Houghton
2022-06-24 18:29 ` [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping Matthew Wilcox
2022-06-27 16:36   ` James Houghton
2022-06-27 17:56     ` Dr. David Alan Gilbert
2022-06-27 20:31       ` James Houghton
2022-06-28  0:04         ` Nadav Amit
2022-06-30 19:21           ` Peter Xu
2022-07-01  5:54             ` Nadav Amit
2022-06-28  8:20         ` Dr. David Alan Gilbert
2022-06-30 16:09           ` Peter Xu
2022-06-24 18:41 ` Mina Almasry
2022-06-27 16:27   ` James Houghton
2022-06-28 14:17     ` Muchun Song
2022-06-28 17:26     ` Mina Almasry
2022-06-28 17:56       ` Dr. David Alan Gilbert [this message]
2022-06-29 18:31         ` James Houghton
2022-06-29 20:39       ` Axel Rasmussen
2022-06-24 18:47 ` Matthew Wilcox
2022-06-27 16:48   ` James Houghton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YrtAyUSbtCLwCFxC@work-vm \
    --to=dgilbert@redhat.com \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=jthoughton@google.com \
    --cc=juew@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).