From: David Hildenbrand <david@redhat.com>
To: Dev Jain <dev.jain@arm.com>,
akpm@linux-foundation.org, willy@infradead.org,
kirill.shutemov@linux.intel.com
Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com,
catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz,
mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com,
will@kernel.org, baohua@kernel.org, jack@suse.cz,
srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com,
aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
peterx@redhat.com, ioworker0@gmail.com,
wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
21cnbao@gmail.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
Date: Thu, 2 Jan 2025 12:33:05 +0100 [thread overview]
Message-ID: <e4e02b6b-2bc4-449e-87f9-43f4be269626@redhat.com> (raw)
In-Reply-To: <8d752d25-b9b2-4bf9-9a81-254aeb3ab0f6@arm.com>
>>>
>>> When having to back-off (restore original PTEs), or for copying,
>>> you'll likely need access to the original PTEs, which were already
>>> cleared. So likely you need a temporary copy of the original PTEs
>>> somehow.
>>>
>>> That's why temporarily clearing the PMD und mmap write lock is easier
>>> to implement, at the cost of requiring the mmap lock in write mode
>>> like PMD collapse.
>
> Why do I need to clear the PMD if I am taking the mmap_write_lock() and
> operating only on the PTE?
One approach I proposed to Nico (and I think he has a prototype) is:
a) Take all locks like we do today (mmap in write, vma in write, rmap in
write)
After this step, no "ordinary" page table walkers can run anymore
b) Clear the PMD entry and flush the TLB like we do today
After this step, neither the CPU can read/write folios nor GUP-fast can
run. The PTE table is completely isolated.
c) Now we can work on the (temporarily cleared) PTE table as we please:
isolate folios, lock them, ... without clearing the PTE entries, just
like we do today.
d) Allocate the new folios (we don't have to hold any spinlocks), copy +
replace the affected PTE entries in the isolated PTE table. Similar to
what we do today, except that we don't clear PTEs but instead clear+reset.
e) Unlock+un-isolate + unref the collapsed folios like we do today.
f) Re-map the PTE-table, like we do today when collapse would have failed.
Of course, after taking all locks we have to re-verify that there is
something to collapse (e.g., in d) we also have to check for unexpected
folio references). The backup path is easy: remap the PTE table as no
PTE entries were touched just yet.
Observe that many things are "like we do today".
As soon as we go to read locks + PTE locks, it all gets more complicated
to get it right. Not that it cannot be done, but the above is IMHO a lot
simpler to get right.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-01-02 11:33 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
2024-12-17 4:18 ` Matthew Wilcox
2024-12-17 5:52 ` Dev Jain
2024-12-17 6:43 ` Ryan Roberts
2024-12-17 18:11 ` Zi Yan
2024-12-17 19:12 ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
2024-12-17 2:51 ` Baolin Wang
2024-12-17 6:08 ` Dev Jain
2024-12-17 4:17 ` Matthew Wilcox
2024-12-17 7:09 ` Ryan Roberts
2024-12-17 13:00 ` Zi Yan
2024-12-20 17:41 ` Christoph Lameter (Ampere)
2024-12-20 17:45 ` Ryan Roberts
2024-12-20 18:47 ` Christoph Lameter (Ampere)
2025-01-02 11:21 ` Ryan Roberts
2024-12-17 6:53 ` Ryan Roberts
2024-12-17 9:06 ` Dev Jain
2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2024-12-17 4:21 ` Matthew Wilcox
2024-12-17 16:58 ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2024-12-17 4:24 ` Matthew Wilcox
2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2024-12-17 4:32 ` Matthew Wilcox
2024-12-17 6:41 ` Dev Jain
2024-12-17 17:14 ` Ryan Roberts
2024-12-17 17:09 ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
2024-12-17 17:22 ` Ryan Roberts
2024-12-18 8:49 ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
2024-12-17 18:15 ` Ryan Roberts
2024-12-18 9:24 ` Dev Jain
2025-01-06 10:04 ` Usama Arif
2025-01-07 7:17 ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
2024-12-17 19:24 ` Ryan Roberts
2024-12-18 9:26 ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2024-12-16 17:06 ` David Hildenbrand
2024-12-16 19:08 ` Yang Shi
2024-12-17 10:07 ` Dev Jain
2024-12-17 10:32 ` David Hildenbrand
2024-12-18 8:35 ` Dev Jain
2025-01-02 10:08 ` Dev Jain
2025-01-02 11:33 ` David Hildenbrand [this message]
2025-01-03 8:17 ` Dev Jain
2025-01-02 11:22 ` David Hildenbrand
2024-12-18 15:59 ` Dev Jain
2025-01-06 10:17 ` Usama Arif
2025-01-07 8:12 ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
2024-12-18 7:36 ` Ryan Roberts
2024-12-18 9:34 ` Dev Jain
2024-12-19 3:40 ` John Hubbard
2024-12-19 3:51 ` Zi Yan
2024-12-19 7:59 ` Dev Jain
2024-12-19 8:07 ` Dev Jain
2024-12-20 11:57 ` Ryan Roberts
2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
2024-12-18 9:03 ` Ryan Roberts
2024-12-18 9:50 ` Dev Jain
2024-12-20 11:05 ` Ryan Roberts
2024-12-30 7:09 ` Dev Jain
2024-12-30 16:36 ` Zi Yan
2025-01-02 11:43 ` Ryan Roberts
2025-01-03 10:10 ` Dev Jain
2025-01-03 10:11 ` Dev Jain
2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
2025-01-02 21:58 ` Nico Pache
2025-01-03 7:04 ` Dev Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e4e02b6b-2bc4-449e-87f9-43f4be269626@redhat.com \
--to=david@redhat.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=dave.hansen@linux.intel.com \
--cc=dev.jain@arm.com \
--cc=haowenchao22@gmail.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jack@suse.cz \
--cc=jglisse@google.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=peterx@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=srivatsa@csail.mit.edu \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=vishal.moola@gmail.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yang@os.amperecomputing.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).