linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Nico Pache <npache@redhat.com>, Jann Horn <jannh@google.com>
Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org,
	mhiramat@kernel.org, mathieu.desnoyers@efficios.com,
	baohua@kernel.org, baolin.wang@linux.alibaba.com,
	ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com,
	ziy@nvidia.com, wangkefeng.wang@huawei.com,
	usamaarif642@gmail.com, sunnanyong@huawei.com,
	vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com,
	yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com,
	aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com,
	anshuman.khandual@arm.com, catalin.marinas@arm.com,
	tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com,
	jack@suse.cz, cl@gentwo.org, jglisse@google.com,
	surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org,
	rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com
Subject: Re: [PATCH v5 07/12] khugepaged: add mTHP support
Date: Fri, 2 May 2025 17:42:25 +0200	[thread overview]
Message-ID: <cc8e253f-5dd1-488d-b5a4-a9e0c0466ed3@redhat.com> (raw)
In-Reply-To: <CAA1CXcBn3eZpcdzsG5DRNBa+rFBg1dQjssC=SYyXW5_7fDmwXg@mail.gmail.com>

On 02.05.25 17:30, Nico Pache wrote:
> On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@google.com> wrote:
>>
>> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 02.05.25 14:50, Jann Horn wrote:
>>>> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>>>>> On 02.05.25 00:29, Nico Pache wrote:
>>>>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>>>>
>>>>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>>>>
>>>>>>>> After the scan is complete we will perform binary recursion on the
>>>>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>>>>> determine how full a THP must be to be eligible.
>>>>>>>>
>>>>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>>>>> pages, we dont perform the collapse.
>>>>>>> [...]
>>>>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>>>>            vma_start_write(vma);
>>>>>>>>            anon_vma_lock_write(vma->anon_vma);
>>>>>>>>
>>>>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>>>>            mmu_notifier_invalidate_range_start(&range);
>>>>>>>>
>>>>>>>>            pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>>>>> +
>>>>>>>>            /*
>>>>>>>>             * This removes any huge TLB entry from the CPU so we won't allow
>>>>>>>>             * huge and small TLB entries for the same virtual address to
>>>>>>>
>>>>>>> It's not visible in this diff, but we're about to do a
>>>>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>>>>> assumes that the entire page table exclusively corresponds to the
>>>>>>> current VMA.
>>>>>>>
>>>>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>>>>> happens for full-size THP, and that mTHP only tears down individual
>>>>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>>>>> the existing THP code tears down PTEs in a detached page table, while
>>>>>>> mTHP would have to do it in a still-attached page table.)
>>>>>> Hi Jann!
>>>>>>
>>>>>> I was under the impression that this is needed to prevent GUP-fast
>>>>>> races (and potentially others).
>>>>
>>>> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
>>>>
>>>>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>>>>> the collapse, then reinstall the PMD (similarly to how the system
>>>>>> recovers from a failed PMD collapse). I tried to keep the current
>>>>>> locking behavior as it seemed the easiest way to get it right (and not
>>>>>> break anything). So I keep the PMD detaching and reinstalling for the
>>>>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>>>>> early. I will comment further on his response.
>>>>
>>>> As I see it, you're not "keeping" the current locking behavior; you're
>>>> making a big implicit locking change by reusing a codepath designed
>>>> for PMD THP for mTHP, where the page table may not be exclusively
>>>> owned by one VMA.
>>>
>>> That is not the intention. The intention in this series (at least as we
>>> discussed) was to not do it across VMAs; that is considered the next
>>> logical step (which will be especially relevant on arm64 IMHO).
>>
>> Ah, so for now this is supposed to only work for PTEs which are in a
>> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
>> then try to collapse its contents to an order-2 mTHP page, that should
>> just not work?
> Correct! As I started in reply to Hugh, the locking conditions explode
> if we drop that requirement.

Right. Adding to that, one could evaluate how much we would gain by 
trying to lock, say, up to $MAGIC_NUMBER related VMAs.

Of course, if no other VMA spans the PMD, and the VMA only covers it 
partially, it is likely still fine as long as we hold the mmap lock in 
write mode.

But probably, looking into a different locking scheme would be 
beneficial at this point.

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2025-05-02 15:42 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-28 18:12 [PATCH v5 00/12] khugepaged: mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 01/12] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-04-29 13:41   ` Zi Yan
2025-04-30  7:47   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2025-04-30  7:56   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 03/12] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 04/12] khugepaged: generalize alloc_charge_folio() Nico Pache
2025-04-28 18:12 ` [PATCH v5 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2025-04-28 18:12 ` [PATCH v5 06/12] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-04-30 10:08   ` Baolin Wang
2025-04-30 18:56     ` Nico Pache
2025-05-01 23:03       ` Nico Pache
2025-05-02  1:23         ` Baolin Wang
2025-06-06 16:37       ` Dev Jain
2025-06-07 12:48         ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 07/12] khugepaged: add " Nico Pache
2025-04-30 20:51   ` Jann Horn
2025-05-01 22:29     ` Nico Pache
2025-05-02  6:29       ` David Hildenbrand
2025-05-02 12:50         ` Jann Horn
2025-05-02 15:18           ` David Hildenbrand
2025-05-02 15:24             ` Lorenzo Stoakes
2025-05-02 15:39               ` David Hildenbrand
2025-05-02 15:26             ` Jann Horn
2025-05-02 15:30               ` Nico Pache
2025-05-02 15:42                 ` David Hildenbrand [this message]
2025-05-01 12:58   ` Hugh Dickins
2025-05-01 16:15     ` Andrew Morton
2025-05-01 22:30     ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 08/12] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-04-30 10:09   ` Baolin Wang
2025-04-28 18:12 ` [PATCH v5 09/12] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2025-04-30 10:12   ` Baolin Wang
2025-04-30 18:43     ` Nico Pache
2025-04-28 18:12 ` [PATCH v5 10/12] khugepaged: improve tracepoints for mTHP orders Nico Pache
2025-04-28 18:12 ` [PATCH v5 11/12] khugepaged: add per-order mTHP khugepaged stats Nico Pache
2025-04-28 18:12 ` [PATCH v5 12/12] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2025-04-29  5:10   ` Bagas Sanjaya
2025-04-29 16:38   ` Christoph Lameter (Ampere)
2025-04-29 17:15     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc8e253f-5dd1-488d-b5a4-a9e0c0466ed3@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jglisse@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=peterx@redhat.com \
    --cc=raquini@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=ryan.roberts@arm.com \
    --cc=sunnanyong@huawei.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=usamaarif642@gmail.com \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).