From: Peter Xu <peterx@redhat.com>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
Michael Ellerman <mpe@ellerman.id.au>,
Matthew Wilcox <willy@infradead.org>,
Rik van Riel <riel@surriel.com>,
Lorenzo Stoakes <lstoakes@gmail.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Yang Shi <shy828301@gmail.com>,
John Hubbard <jhubbard@nvidia.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"Kirill A . Shutemov" <kirill@shutemov.name>,
Andrew Jones <andrew.jones@linux.dev>,
Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Muchun Song <muchun.song@linux.dev>,
Christoph Hellwig <hch@infradead.org>,
"linux-riscv@lists.infradead.org"
<linux-riscv@lists.infradead.org>,
James Houghton <jthoughton@google.com>,
David Hildenbrand <david@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Aneesh Kumar K . V" <aneesh.kumar@kernel.org>,
Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2
Date: Wed, 10 Apr 2024 15:58:47 -0400 [thread overview]
Message-ID: <Zhbvd9WZzWl3IA8Y@x1n> (raw)
In-Reply-To: <7b75c4db-9dbe-4ff1-b649-06a9218ae0aa@csgroup.eu>
On Wed, Apr 10, 2024 at 04:30:41PM +0000, Christophe Leroy wrote:
>
>
> Le 10/04/2024 à 17:28, Peter Xu a écrit :
> > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> >>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> >>> not all. I need to look at all the walkers and there can be some tricky
> >>> ones, but I believe that applies in general. It's actually similar to what
> >>> I did with slow gup here.
> >>
> >> I think that is the big question, I also haven't done the research to
> >> know the answer.
> >>
> >> At this point focusing on moving what is reasonable to the pXX_* API
> >> makes sense to me. Then reviewing what remains and making some
> >> decision.
> >>
> >>> Like this series, for cont-pXd we'll need multiple walks comparing to
> >>> before (when with hugetlb_entry()), but for that part I'll provide some
> >>> performance tests too, and we also have a fallback plan, which is to detect
> >>> cont-pXd existance, which will also work for large folios.
> >>
> >> I think we can optimize this pretty easy.
> >>
> >>>> I think if you do the easy places for pXX conversion you will have a
> >>>> good idea about what is needed for the hard places.
> >>>
> >>> Here IMHO we don't need to understand "what is the size of this hugetlb
> >>> vma"
> >>
> >> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> >> page table is self describing, obviously.
> >
> > Attaching to vma still makes sense to me, where we should definitely avoid
> > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> > allocated, managed, ... totally differently.
> >
> > And since hugetlb is designed as file-based (which also makes sense to me,
> > at least for now), it's also natural that it's vma-attached.
> >
> >>
> >>> or "which level of pgtable does this hugetlb vma pages locate",
> >>
> >> Ditto
> >>
> >>> because we may not need that, e.g., when we only want to collect some smaps
> >>> statistics. "whether it's hugetlb" may matter, though. E.g. in the mm
> >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> >>> hugetlb_entry removed), we may need extra check later to put things into
> >>> the right bucket, but for the walker itself it doesn't necessarily need
> >>> hugetlb_entry().
> >>
> >> Right, places may still need to know it is part of a huge VMA because we
> >> have special stuff linked to that.
> >>
> >>>> But then again we come back to power and its big list of page sizes
> >>>> and variety :( Looks like some there have huge sizes at the pgd level
> >>>> at least.
> >>>
> >>> Yeah this is something I want to be super clear, because I may miss
> >>> something: we don't have real pgd pages, right? Powerpc doesn't even
> >>> define p4d_leaf(), AFAICT.
> >>
> >> AFAICT it is because it hides it all in hugepd.
> >
> > IMHO one thing we can benefit from such hugepd rework is, if we can squash
> > all the hugepds like what Christophe does, then we push it one more layer
> > down, and we have a good chance all things should just work.
> >
> > So again my Power brain is close to zero, but now I'm referring to what
> > Christophe shared in the other thread:
> >
> > https://github.com/linuxppc/wiki/wiki/Huge-pages
> >
> > Together with:
> >
> > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu
> >
> > Where it has:
> >
> > --- a/arch/powerpc/platforms/Kconfig.cputype
> > +++ b/arch/powerpc/platforms/Kconfig.cputype
> > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
> > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> > select ARCH_ENABLE_SPLIT_PMD_PTLOCK
> > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> > select ARCH_SUPPORTS_HUGETLBFS
> > select ARCH_SUPPORTS_NUMA_BALANCING
> > select HAVE_MOVE_PMD
> > @@ -290,6 +291,7 @@ config PPC_BOOK3S
> > config PPC_E500
> > select FSL_EMB_PERFMON
> > bool
> > + select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> > select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
> > select PPC_SMP_MUXED_IPI
> > select PPC_DOORBELL
> >
> > So I think it means we have three PowerPC systems that supports hugepd
> > right now (besides the 8xx which Christophe is trying to drop support
> > there), besides 8xx we still have book3s_64 and E500.
> >
> > Let's check one by one:
> >
> > - book3s_64
> >
> > - hash
> >
> > - 64K: p4d is not used, largest pgsize pgd 16G @pud level. It
> > means after squashing it'll be a bunch of cont-pmd, all good.
> >
> > - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
> > it'll be cont-pud. all good.
> >
> > - radix
> >
> > - 64K: largest 1G @pud, then cont-pmd after squashed. all good.
> >
> > - 4K: largest 1G @pud, then cont-pmd, all good.
> >
> > - e500 & 8xx
> >
> > - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
> > @pgd level they become cont-pte. all good.
>
> e500 has two modes: 32 bits and 64 bits.
>
> For 32 bits:
>
> 8xx is the only one handling it through HW-assisted pagetable walk hence
> requiring a 2-level whatever the pagesize is.
Hmm I think maybe finally I get it..
I think the confusion came from when I saw there's always such level-2
table described in Figure 8-5 of the manual:
https://www.nxp.com/docs/en/reference-manual/MPC860UM.pdf
So I suppose you meant for 8M, the PowerPC 8xx system hardware will be
aware of such 8M pgtable (from level-1's entry, where it has bit 28-29 set
011b), then it won't ever read anything starting from "Level-2 Descriptor
1" (but only read the only entry "Level-2 Descriptor 0"), so fundamentally
hugepd format must look like such for 8xx?
But then perhaps it's still compatible with cont-pte because the rest
entries (pte index 1+) will simply be ignored by the hardware?
>
> On e500 it is all software so pages 2M and larger should be cont-PGD (by
> the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD
> and PTE, the PGD entries are populated by a function called PMD_populate()).
Yeah.. I am also wondering whether pgd_populate() could also work there
(perhaps with some trivial changes, or maybe not even needed..), as when
p4d/pud/pmd levels are missing, linux should just do something like an
enforced cast from pgd_t* -> pmd_t* in this case.
I think currently they're already not pgd, as __find_linux_pte() already
skipped pgd unconditionally:
pgdp = pgdir + pgd_index(ea);
p4dp = p4d_offset(pgdp, ea);
>
> Current situation for 8xx is illustrated here:
> https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx
>
> I also tried to better illustrate e500/32 here:
> https://github.com/linuxppc/wiki/wiki/Huge-pages#e500
>
> For 64 bits:
> We have PTE/PMD/PUD/PGD, no P4D
>
> See arch/powerpc/include/asm/nohash/64/pgtable-4k.h
We don't have anything that is above pud in this category, right? That's
what I read from your wiki (and thanks for providing that in the first
place; helps a lot for me to understand how it works on PowerPC).
I want to make sure if I can move on without caring on p4d/pgd leafs like
what we do right now, even after if we can remove hugepd for good, in this
case since p4d always missing, then it's about whether "pud|pmd|pte_leaf()"
can also cover the pgd ones when that day comes, iiuc.
Thanks,
>
>
> >
> > I think the trick here is there'll be no pgd leaves after hugepd squashing
> > to lower levels, then since PowerPC seems to never have p4d, then all
> > things fall into pud or lower. We seem to be all good there?
> >
> >>
> >> If the goal is to purge hugepd then some of the options might turn out
> >> to convert hugepd into huge p4d/pgd, as I understand it. It would be
> >> nice to have certainty on this at least.
> >
> > Right. I hope the pmd/pud plan I proposed above can already work too with
> > such ambicious goal too. But review very welcomed from either you or
> > Christophe.
> >
> > PS: I think I'll also have a closer look at Christophe's series this week
> > or next.
> >
> >>
> >> We have effectively three APIs to parse a single page table and
> >> currently none of the APIs can return 100% of the data for power.
> >
> > Thanks,
> >
--
Peter Xu
next prev parent reply other threads:[~2024-04-10 19:58 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-21 22:07 [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2 peterx
2024-03-21 22:07 ` [PATCH v3 01/12] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES peterx
2024-03-21 22:07 ` [PATCH v3 02/12] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static peterx
2024-03-21 22:07 ` [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP peterx
2024-03-22 17:14 ` SeongJae Park
2024-03-23 0:30 ` Peter Xu
2024-03-23 1:05 ` SeongJae Park
2024-03-21 22:07 ` [PATCH v3 04/12] mm: Introduce vma_pgtable_walk_{begin|end}() peterx
2024-03-22 12:27 ` Jason Gunthorpe
2024-03-21 22:07 ` [PATCH v3 06/12] mm/gup: Refactor record_subpages() to find 1st small page peterx
2024-03-21 22:07 ` [PATCH v3 07/12] mm/gup: Handle hugetlb for no_page_table() peterx
2024-03-21 22:07 ` [PATCH v3 08/12] mm/gup: Cache *pudp in follow_pud_mask() peterx
2024-03-21 22:07 ` [PATCH v3 09/12] mm/gup: Handle huge pud for follow_pud_mask() peterx
2024-03-21 22:08 ` [PATCH v3 11/12] mm/gup: Handle hugepd for follow_page() peterx
2024-03-21 22:08 ` [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code peterx
2024-03-22 13:30 ` Jason Gunthorpe
2024-03-22 15:55 ` Peter Xu
2024-03-22 16:08 ` Jason Gunthorpe
2024-03-22 20:48 ` Andrew Morton
2024-03-23 0:45 ` Peter Xu
2024-03-23 2:15 ` Peter Xu
[not found] ` <20240321220802.679544-6-peterx@redhat.com>
2024-03-22 12:28 ` [PATCH v3 05/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing Jason Gunthorpe
2024-03-22 16:10 ` [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2 Jason Gunthorpe
2024-03-25 18:58 ` Peter Xu
2024-03-26 14:02 ` Jason Gunthorpe
2024-04-04 21:48 ` Peter Xu
2024-04-05 18:16 ` Jason Gunthorpe
2024-04-05 21:42 ` Peter Xu
2024-04-09 23:43 ` Jason Gunthorpe
2024-04-10 15:28 ` Peter Xu
2024-04-10 16:30 ` Christophe Leroy
2024-04-10 19:58 ` Peter Xu [this message]
2024-04-12 14:27 ` Christophe Leroy
2024-03-25 14:56 ` Christophe Leroy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zhbvd9WZzWl3IA8Y@x1n \
--to=peterx@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andrew.jones@linux.dev \
--cc=aneesh.kumar@kernel.org \
--cc=axelrasmussen@google.com \
--cc=christophe.leroy@csgroup.eu \
--cc=david@redhat.com \
--cc=hch@infradead.org \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=jthoughton@google.com \
--cc=kirill@shutemov.name \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lstoakes@gmail.com \
--cc=mike.kravetz@oracle.com \
--cc=mpe@ellerman.id.au \
--cc=muchun.song@linux.dev \
--cc=riel@surriel.com \
--cc=rppt@kernel.org \
--cc=shy828301@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).