Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Zach O'Keefe <zokeefe@google.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Yang Shi <shy828301@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range
Date: Mon, 30 Jan 2023 16:14:34 -0500	[thread overview]
Message-ID: <Y9gzOqwKcu7p/PEw@x1n> (raw)
In-Reply-To: <CADrL8HX3sf6OO3PXS1g6b2dKf8b5phQ7oyNR0dVT=sAOdTmmqw@mail.gmail.com>

On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > It turns out that the THP-like scheme significantly slows down
> > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > >
> > > > > This is only really a problem because this is done between
> > > > > mmu_notifier_invalidate_range_start() and
> > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > >
> > > > Did you try to measure the time, or it's a quick observation from perf?
> > >
> > > I put some ktime_get()s in.
> > >
> > > >
> > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > But maybe it depends on many things.
> > > >
> > > > I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > too as it should walk the smally mapped EPT pgtables.
> > >
> > > Somehow this doesn't take all that long (only like 10-30ms when
> > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> >
> > Did you populate as much the EPT pgtable when measuring this?
> >
> > IIUC this number should be pretty much relevant to how many pages are
> > shadowed to the kvm pgtables.  If the EPT table is mostly empty it should
> > be super fast, but OTOH it can be much slower if when it's populated,
> > because tdp mmu should need to handle the pgtable leaves one by one.
> >
> > E.g. it should be fully populated if you have a program busy dirtying most
> > of the guest pages during test migration.
> 
> That's what I was doing. I was running a workload in the guest that
> just writes 8 bytes to a page and jumps ahead a few pages on all
> vCPUs, touching most of its memory.
> 
> But there is more to understand; I'll collect more results. I'm not
> sure why the EPT can be unmapped/collapsed so quickly.

Maybe something smart done by the hypervisor?

> 
> >
> > Write op should be the worst here case since it'll require the atomic op
> > being applied; see kvm_tdp_mmu_write_spte().
> >
> > >
> > > >
> > > > Since we'll still keep the intermediate levels around - from application
> > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > so potentially for a very large page we can start with building 2M layers.
> > > > But then collapse will need to be run at least two rounds.
> > >
> > > That's exactly what I thought to do. :) I realized, too, that this is
> > > actually how userspace *should* collapse things to avoid holding up
> > > vCPUs too long. I think this is a good reason to keep intermediate
> > > page sizes.
> > >
> > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > huge difference: the THP-like scheme is about 30% slower overall.
> > >
> > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > decrement the compound_mapcount 512 times (once per PTE), then
> > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > mapcount again by 512 (once per PMD), incrementing it once.
> >
> > Did you have quantified numbers (with your ktime treak) to compare these?
> > If we want to go the other route, I think these will be materials to
> > justify any other approach on mapcount handling.
> 
> Ok, I can do that. GIve me a couple days to collect more results and
> organize them in a helpful way.
> 
> (If it's helpful at all, here are some results I collected last week:
> [2]. Please ignore it if it's not helpful.)

It's helpful already at least to me, thanks.  Yes the change is drastic.

> 
> >
> > >
> > > The mapcount decrements are about on par with how long it takes to do
> > > other things, like updating page tables. The main problem is, with the
> > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > avoid the 262k decrements when collapsing 1G. So if we want
> > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > then I think something more clever needs to be implemented.
> > >
> > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> >
> > I believe the whole goal of HGM is trying to face the same challenge if
> > we'll allow 1G THP exist and being able to split for anon.
> >
> > I don't remember whether we discussed below, maybe we did?  Anyway...
> >
> > Another way to not use thp mapcount, nor break smaps and similar calls to
> > page_mapcount() on small page, is to only increase the hpage mapcount only
> > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > freeing pgtable of the PUD entry).
> 
> Right, and this is doable. Also it seems like this is pretty close to
> the direction Matthew Wilcox wants to go with THPs.

I may not be familiar with it, do you mean this one?

https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/

For hugetlb I think it should be easier to maintain rather than any-sized
folios, because there's the pgtable non-leaf entry to track rmap
information and the folio size being static to hpage size.

It'll be different to folios where it can be random sized pages chunk, so
it needs to be managed by batching the ptes when install/zap.

> 
> Something I noticed though, from the implementation of
> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> ought to report the total number of PTEs that are pointing on the page
> (or the number of times page_vma_mapped_walk returns true). FWIW,
> folio_referenced() is never called for hugetlb folios.

FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
it'll walk every leaf page being mapped, big or small, so IIUC that number
should match with what it expects to see later, more or less.

But I agree the mapcount/referenced value itself is debatable to me, just
like what you raised in the other thread on page migration.  Meanwhile, I
am not certain whether the mapcount is accurate either because AFAICT the
mapcount can be modified if e.g. new page mapping established as long as
before taking the page lock later in folio_referenced().

It's just that I don't see any severe issue either due to any of above, as
long as that information is only used as a hint for next steps, e.g., to
swap which page out.

-- 
Peter Xu

next prev parent reply	other threads:[~2023-01-30 21:14 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 10:17 [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) James Houghton
2023-01-05 10:17 ` [PATCH 01/46] hugetlb: don't set PageUptodate for UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 02/46] hugetlb: remove mk_huge_pte; it is unused James Houghton
2023-01-05 10:18 ` [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path James Houghton
2023-01-05 10:18 ` [PATCH 04/46] hugetlb: only adjust address ranges when VMAs want PMD sharing James Houghton
2023-01-05 10:18 ` [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING James Houghton
2023-01-05 10:18 ` [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag James Houghton
2023-01-05 10:18 ` [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock James Houghton
2023-01-05 10:18 ` [PATCH 08/46] hugetlb: add HugeTLB HGM enablement helpers James Houghton
2023-01-05 10:18 ` [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM James Houghton
2023-01-05 15:05   ` kernel test robot
2023-01-05 15:29   ` David Hildenbrand
2023-01-10  0:01     ` Zach O'Keefe
2023-01-05 10:18 ` [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument James Houghton
2023-01-05 10:18 ` [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries James Houghton
2023-01-05 16:06   ` kernel test robot
2023-01-05 10:18 ` [PATCH 12/46] hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte James Houghton
2023-01-05 10:18 ` [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step James Houghton
2023-01-05 16:57   ` kernel test robot
2023-01-05 18:58   ` kernel test robot
2023-01-11 21:51   ` Peter Xu
2023-01-12 13:38     ` James Houghton
2023-01-05 10:18 ` [PATCH 14/46] hugetlb: add make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings James Houghton
2023-01-05 10:18 ` [PATCH 16/46] hugetlbfs: do a full walk to check if vma maps a page James Houghton
2023-01-05 10:18 ` [PATCH 17/46] hugetlb: make unmapping compatible with high-granularity mappings James Houghton
2023-01-05 10:18 ` [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection James Houghton
2023-01-05 10:18 ` [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page James Houghton
2023-01-05 22:26   ` Peter Xu
2023-01-12 18:02   ` Peter Xu
2023-01-12 18:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask James Houghton
2023-01-05 10:18 ` [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range James Houghton
2023-01-05 22:42   ` Peter Xu
2023-01-11 22:58   ` Peter Xu
2023-01-12 14:06     ` James Houghton
2023-01-12 15:29       ` Peter Xu
2023-01-12 16:45         ` James Houghton
2023-01-12 16:55           ` James Houghton
2023-01-12 20:27           ` Peter Xu
2023-01-12 21:17             ` James Houghton
2023-01-12 21:33               ` Peter Xu
2023-01-16 10:17                 ` David Hildenbrand
2023-01-17 23:11                   ` James Houghton
2023-01-18  9:43                     ` David Hildenbrand
2023-01-18 15:35                       ` Peter Xu
2023-01-18 16:39                         ` James Houghton
2023-01-18 18:21                           ` David Hildenbrand
2023-01-18 19:28                           ` Mike Kravetz
2023-01-19 16:57                             ` James Houghton
2023-01-19 17:31                               ` Mike Kravetz
2023-01-19 19:42                                 ` James Houghton
2023-01-19 20:53                                   ` Peter Xu
2023-01-19 22:45                                     ` James Houghton
2023-01-19 22:00                                   ` Mike Kravetz
2023-01-19 22:23                                     ` Peter Xu
2023-01-19 22:35                                       ` James Houghton
2023-01-19 23:07                                         ` Peter Xu
2023-01-19 23:26                                           ` James Houghton
2023-01-20 17:23                                             ` Peter Xu
2023-01-19 23:44                                           ` Mike Kravetz
2023-01-23 15:19                                             ` Peter Xu
2023-01-23 17:49                                               ` Mike Kravetz
2023-01-26 16:58                                   ` James Houghton
2023-01-26 20:30                                     ` Peter Xu
2023-01-27 21:02                                       ` James Houghton
2023-01-30 17:29                                         ` Peter Xu
2023-01-30 18:38                                           ` James Houghton
2023-01-30 21:14                                             ` Peter Xu [this message]
2023-02-01  0:24                                               ` James Houghton
2023-02-01  1:24                                                 ` Peter Xu
2023-02-01 15:45                                                   ` James Houghton
2023-02-01 15:56                                                     ` David Hildenbrand
2023-02-01 17:58                                                       ` James Houghton
2023-02-01 18:01                                                         ` David Hildenbrand
2023-02-01 16:22                                                     ` Peter Xu
2023-02-01 21:32                                                       ` James Houghton
2023-02-01 21:51                                                         ` Peter Xu
2023-02-02  0:24                                                           ` James Houghton
2023-02-07 16:30                                                             ` James Houghton
2023-02-07 22:46                                                               ` James Houghton
2023-02-07 23:13                                                                 ` Peter Xu
2023-02-08  0:26                                                                   ` James Houghton
2023-02-08 16:16                                                                     ` Peter Xu
2023-02-09 16:43                                                                       ` James Houghton
2023-02-09 19:10                                                                         ` Peter Xu
2023-02-09 19:49                                                                           ` James Houghton
2023-02-09 20:22                                                                             ` Peter Xu
2023-01-18 17:08                         ` David Hildenbrand
2023-01-05 10:18 ` [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk James Houghton
2023-01-05 10:18 ` [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order James Houghton
2023-01-05 10:18 ` [PATCH 24/46] rmap: update hugetlb lock comment for HGM James Houghton
2023-01-05 10:18 ` [PATCH 25/46] hugetlb: update page_vma_mapped to do high-granularity walks James Houghton
2023-01-05 10:18 ` [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range James Houghton
2023-01-05 10:18 ` [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables James Houghton
2023-01-05 10:18 ` [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page James Houghton
2023-01-05 10:18 ` [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags James Houghton
2023-01-05 10:18 ` [PATCH 30/46] hugetlb: add high-granularity migration support James Houghton
2023-01-05 10:18 ` [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates James Houghton
2023-01-05 10:18 ` [PATCH 32/46] hugetlb: add for_each_hgm_shift James Houghton
2023-01-05 10:18 ` [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE James Houghton
2023-01-05 10:18 ` [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE James Houghton
2023-01-06 15:13   ` Peter Xu
2023-01-10 14:50     ` James Houghton
2023-01-05 10:18 ` [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb James Houghton
2023-01-10 20:04   ` James Houghton
2023-01-17 21:06   ` Peter Xu
2023-01-17 21:38     ` James Houghton
2023-01-17 21:54       ` Peter Xu
2023-01-19 22:37   ` Peter Xu
2023-01-19 23:06     ` James Houghton
2023-01-05 10:18 ` [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr James Houghton
2023-01-05 10:18 ` [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift James Houghton
2023-01-05 10:18 ` [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size James Houghton
2023-01-05 10:18 ` [PATCH 39/46] hugetlb: x86: enable high-granularity mapping James Houghton
2023-01-12 20:07   ` James Houghton
2023-01-05 10:18 ` [PATCH 40/46] docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM info James Houghton
2023-01-05 10:18 ` [PATCH 41/46] docs: proc: include information about HugeTLB HGM James Houghton
2023-01-05 10:18 ` [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest James Houghton
2023-01-05 10:18 ` [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest James Houghton
2023-01-05 10:18 ` [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test James Houghton
2023-01-05 10:18 ` [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest James Houghton
2023-01-05 10:18 ` [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests James Houghton
2023-01-05 10:47 ` [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e) David Hildenbrand
2023-01-09 19:53   ` Mike Kravetz
2023-01-10 15:47     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9gzOqwKcu7p/PEw@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=manish.mishra@nutanix.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).