From: Catalin Marinas <catalin.marinas@arm.com>
To: Yu Zhao <yuzhao@google.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>,
will@kernel.org, mike.kravetz@oracle.com, muchun.song@linux.dev,
akpm@linux-foundation.org, anshuman.khandual@arm.com,
willy@infradead.org, wangkefeng.wang@huawei.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
Date: Thu, 11 Jul 2024 12:39:52 +0100 [thread overview]
Message-ID: <Zo_EiIm4ylNqO2ZR@arm.com> (raw)
In-Reply-To: <CAOUHufb3CHLCo54fZcPSG+mrXD-kRsa0Foi8=vL5=q+YHpQ+Rg@mail.gmail.com>
On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote:
> On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao <yuzhao@google.com> wrote:
> > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > The Arm ARM states that we need a BBM if we change the output address
> > > and: the old or new mappings are RW *or* the content of the page
> > > changes. Ignoring the latter (page content), we can turn the PTEs RO
> > > first without changing the pfn followed by changing the pfn while they
> > > are RO. Once that's done, we make entry 0 RW and, of course, with
> > > additional TLBIs between all these steps.
> >
> > Aha! This is easy to do -- I just made the RO guaranteed, as I
> > mentioned earlier.
> >
> > Just to make sure I fully understand the workflow:
> >
> > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area.
I don't think we can turn all of them RO here since some of those 512
PTEs are not related to the hugetlb page. So you'd need to keep them RW
but preserving the pfn so that there's no actual translation change. I
think that's covered by FEAT_BBM level 2. Basically this step should be
only about breaking up a PMD block entry into a table entry.
> > 2. TLBI once, after pmd_populate_kernel()
> > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8
> > PTEs, while they remain RO.
You may need some intermediate step to turn these PTEs read-only since
step 1 should leave them RW. Also, if we want to free and order-3 page
here, it might be better to allocate an order 0 even for PTE entry 0 (I
had the impression that's what the core code does, I haven't checked).
> > 4. TLBI once, after set_pte_at() on PTE 1-7.
> > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area.
> > 6. TLBI once, after set_pte_at() on PTE 0.
> >
> > No BBM required, regardless of FEAT_BBM level 2.
>
> I just studied D8.16.1 from the reference manual, and it seems to me:
> 1. We still need either FEAT_BBM or BBM to split PMD.
Yes.
> 2. We still need BBM when we change PTE 1-7, because even if they
> remain RO, the content of the `struct page` page at the new location
> does not match that at the old location.
Yes, in theory, the data at the new pfn should be the same. We could try
to get clarification from the architects on what could go wrong but I
suspect it's some atomicity is not guarantee if you read the data (the
CPU getting confused whether to read from the old or the new page).
Otherwise, since after all these steps PTEs 1-7 point to the same data
as PTE 0, before step 3 we could copy the data in page 0 over to the
other 7 pages while entries 1-7 are still RO. The remapping afterwards
would be fully compliant.
> > > Can we leave entry 0 RO? This would save an additional TLBI.
> >
> > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a
> > refcnt on any hugeTLB pages.
OK, fair enough.
> > > Now, I wonder if all this is worth it. What are the scenarios where the
> > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB
> > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned.
>
> One of the fundamental assumptions in core MM is that anyone can
> read or try to grab (write) a refcnt from any `struct page`. Those
> speculative PFN walkers include memory compaction, etc.
But how does this work if PTEs 1-7 are RO? Do those walkers detect it's
a tail page and skip it. Actually, if they all point to the same vmemmap
page, how can one distinguish a tail page via PTE 1 from the head page
via PTE 0?
BTW, I'll be on holiday from tomorrow for two weeks and won't be able to
follow up on this thread (and likely to forget all the discussion by the
time I get back ;)).
--
Catalin
next prev parent reply other threads:[~2024-07-11 11:40 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-13 9:44 [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Nanyong Sun
2024-01-13 9:44 ` [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable Nanyong Sun
2024-01-13 9:44 ` [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely Nanyong Sun
2024-01-15 2:38 ` Muchun Song
2024-02-07 12:21 ` Mark Rutland
2024-02-08 9:30 ` Nanyong Sun
2024-01-13 9:44 ` [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Nanyong Sun
2024-01-25 18:06 ` [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Catalin Marinas
2024-01-27 5:04 ` Nanyong Sun
2024-02-07 11:12 ` Will Deacon
2024-02-07 11:21 ` Matthew Wilcox
2024-02-07 12:11 ` Will Deacon
2024-02-07 12:24 ` Mark Rutland
2024-02-07 14:17 ` Matthew Wilcox
2024-02-08 2:24 ` Jane Chu
2024-02-08 15:49 ` Matthew Wilcox
2024-02-08 19:21 ` Jane Chu
2024-02-11 11:59 ` Muchun Song
2024-06-05 20:50 ` Yu Zhao
2024-06-06 8:30 ` David Hildenbrand
2024-06-07 16:55 ` Frank van der Linden
2024-02-07 12:20 ` Catalin Marinas
2024-02-08 9:44 ` Nanyong Sun
2024-02-08 13:17 ` Will Deacon
2024-03-13 23:32 ` David Rientjes
2024-03-25 15:24 ` Nanyong Sun
2024-03-26 12:54 ` Will Deacon
2024-06-24 5:39 ` Yu Zhao
2024-06-27 14:33 ` Nanyong Sun
2024-06-27 21:03 ` Yu Zhao
2024-07-04 11:47 ` Nanyong Sun
2024-07-04 19:45 ` Yu Zhao
2024-02-07 12:44 ` Catalin Marinas
2024-06-27 21:19 ` Yu Zhao
2024-07-05 15:49 ` Catalin Marinas
2024-07-05 17:41 ` Yu Zhao
2024-07-10 16:51 ` Catalin Marinas
2024-07-10 17:12 ` Yu Zhao
2024-07-10 22:29 ` Catalin Marinas
2024-07-10 23:07 ` Yu Zhao
2024-07-11 8:31 ` Yu Zhao
2024-07-11 11:39 ` Catalin Marinas [this message]
2024-07-11 17:38 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zo_EiIm4ylNqO2ZR@arm.com \
--to=catalin.marinas@arm.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=muchun.song@linux.dev \
--cc=sunnanyong@huawei.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).