linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Dev Jain <dev.jain@arm.com>,
	akpm@linux-foundation.org, willy@infradead.org,
	kirill.shutemov@linux.intel.com
Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com,
	catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz,
	mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com,
	will@kernel.org, baohua@kernel.org, jack@suse.cz,
	srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com,
	aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
	peterx@redhat.com, ioworker0@gmail.com,
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
	zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
	21cnbao@gmail.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio()
Date: Tue, 17 Dec 2024 11:32:58 +0100	[thread overview]
Message-ID: <0368f4f2-cb0f-4633-a86d-5c3f75839b4e@redhat.com> (raw)
In-Reply-To: <28013908-65d8-462e-b975-cd0f63d226b1@arm.com>

On 17.12.24 11:07, Dev Jain wrote:
> 
> On 16/12/24 10:36 pm, David Hildenbrand wrote:
>> On 16.12.24 17:51, Dev Jain wrote:
>>> In contrast to PMD-collapse, we do not need to operate on two levels
>>> of pagetable
>>> simultaneously. Therefore, downgrade the mmap lock from write to read
>>> mode. Still
>>> take the anon_vma lock in exclusive mode so as to not waste time in
>>> the rmap path,
>>> which is anyways going to fail since the PTEs are going to be
>>> changed. Under the PTL,
>>> copy page contents, clear the PTEs, remove folio pins, and (try to)
>>> unmap the
>>> old folios. Set the PTEs to the new folio using the set_ptes() API.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> Note: I have been trying hard to get rid of the locks in here: we
>>> still are
>>> taking the PTL around the page copying; dropping the PTL and taking
>>> it after
>>> the copying should lead to a deadlock, for example:
>>> khugepaged                        madvise(MADV_COLD)
>>> folio_lock()                        lock(ptl)
>>> lock(ptl)                        folio_lock()
>>>
>>> We can create a locked folio list, altogether drop both the locks,
>>> take the PTL,
>>> do everything which __collapse_huge_page_isolate() does *except* the
>>> isolation and
>>> again try locking folios, but then it will reduce efficiency of
>>> khugepaged
>>> and almost looks like a forced solution :)
>>> Please note the following discussion if anyone is interested:
>>> https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/
>>>
>>> (Apologies for not CCing the mailing list from the start)
>>>
>>>    mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++----------
>>>    1 file changed, 87 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 88beebef773e..8040b130e677 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -714,24 +714,28 @@ static void
>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>                            struct vm_area_struct *vma,
>>>                            unsigned long address,
>>>                            spinlock_t *ptl,
>>> -                        struct list_head *compound_pagelist)
>>> +                        struct list_head *compound_pagelist, int order)
>>>    {
>>>        struct folio *src, *tmp;
>>>        pte_t *_pte;
>>>        pte_t pteval;
>>>    -    for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> +    for (_pte = pte; _pte < pte + (1UL << order);
>>>             _pte++, address += PAGE_SIZE) {
>>>            pteval = ptep_get(_pte);
>>>            if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>>                add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>>>                if (is_zero_pfn(pte_pfn(pteval))) {
>>> -                /*
>>> -                 * ptl mostly unnecessary.
>>> -                 */
>>> -                spin_lock(ptl);
>>> -                ptep_clear(vma->vm_mm, address, _pte);
>>> -                spin_unlock(ptl);
>>> +                if (order == HPAGE_PMD_ORDER) {
>>> +                    /*
>>> +                    * ptl mostly unnecessary.
>>> +                    */
>>> +                    spin_lock(ptl);
>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>> +                    spin_unlock(ptl);
>>> +                } else {
>>> +                    ptep_clear(vma->vm_mm, address, _pte);
>>> +                }
>>>                    ksm_might_unmap_zero_page(vma->vm_mm, pteval);
>>>                }
>>>            } else {
>>> @@ -740,15 +744,20 @@ static void
>>> __collapse_huge_page_copy_succeeded(pte_t *pte,
>>>                src = page_folio(src_page);
>>>                if (!folio_test_large(src))
>>>                    release_pte_folio(src);
>>> -            /*
>>> -             * ptl mostly unnecessary, but preempt has to
>>> -             * be disabled to update the per-cpu stats
>>> -             * inside folio_remove_rmap_pte().
>>> -             */
>>> -            spin_lock(ptl);
>>> -            ptep_clear(vma->vm_mm, address, _pte);
>>> -            folio_remove_rmap_pte(src, src_page, vma);
>>> -            spin_unlock(ptl);
>>> +            if (order == HPAGE_PMD_ORDER) {
>>> +                /*
>>> +                * ptl mostly unnecessary, but preempt has to
>>> +                * be disabled to update the per-cpu stats
>>> +                * inside folio_remove_rmap_pte().
>>> +                */
>>> +                spin_lock(ptl);
>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>
>>
>>
>>
>>> + folio_remove_rmap_pte(src, src_page, vma);
>>> +                spin_unlock(ptl);
>>> +            } else {
>>> +                ptep_clear(vma->vm_mm, address, _pte);
>>> +                folio_remove_rmap_pte(src, src_page, vma);
>>> +            }
>>
>> As I've talked to Nico about this code recently ... :)
>>
>> Are you clearing the PTE after the copy succeeded? If so, where is the
>> TLB flush?
>>
>> How do you sync against concurrent write acess + GUP-fast?
>>
>>
>> The sequence really must be: (1) clear PTE/PMD + flush TLB (2) check
>> if there are unexpected page references (e.g., GUP) if so back off (3)
>> copy page content (4) set updated PTE/PMD.
> 
> Thanks...we need to ensure GUP-fast does not write when we are copying
> contents, so (2) will ensure that GUP-fast will see the cleared PTE and
> back-off.

Yes, and of course, that also the CPU cannot concurrently still modify 
the page content while/after you copy the page content, but before you 
unmap+flush.

>>
>> To Nico, I suggested doing it simple initially, and still clear the
>> high-level PMD entry + flush under mmap write lock, then re-map the
>> PTE table after modifying the page table. It's not as efficient, but
>> "harder to get wrong".
>>
>> Maybe that's already happening, but I stumbled over this clearing
>> logic in __collapse_huge_page_copy_succeeded(), so I'm curious.
> 
> No, I am not even touching the PMD. I guess the sequence you described
> should work? I just need to reverse the copying and PTE clearing order
> to implement this sequence.

That would work, but you really have to hold the PTL for the whole 
period: from when you temporarily clear PTEs +_ flush the TLB, when you 
copy, until you re-insert the updated ones.

When having to back-off (restore original PTEs), or for copying, you'll 
likely need access to the original PTEs, which were already cleared. So 
likely you need a temporary copy of the original PTEs somehow.

That's why temporarily clearing the PMD und mmap write lock is easier to 
implement, at the cost of requiring the mmap lock in write mode like PMD 
collapse.

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2024-12-17 10:33 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
2024-12-17  4:18   ` Matthew Wilcox
2024-12-17  5:52     ` Dev Jain
2024-12-17  6:43     ` Ryan Roberts
2024-12-17 18:11       ` Zi Yan
2024-12-17 19:12         ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
2024-12-17  2:51   ` Baolin Wang
2024-12-17  6:08     ` Dev Jain
2024-12-17  4:17   ` Matthew Wilcox
2024-12-17  7:09     ` Ryan Roberts
2024-12-17 13:00       ` Zi Yan
2024-12-20 17:41       ` Christoph Lameter (Ampere)
2024-12-20 17:45         ` Ryan Roberts
2024-12-20 18:47           ` Christoph Lameter (Ampere)
2025-01-02 11:21             ` Ryan Roberts
2024-12-17  6:53   ` Ryan Roberts
2024-12-17  9:06     ` Dev Jain
2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2024-12-17  4:21   ` Matthew Wilcox
2024-12-17 16:58   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2024-12-17  4:24   ` Matthew Wilcox
2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2024-12-17  4:32   ` Matthew Wilcox
2024-12-17  6:41     ` Dev Jain
2024-12-17 17:14       ` Ryan Roberts
2024-12-17 17:09   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
2024-12-17 17:22   ` Ryan Roberts
2024-12-18  8:49     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
2024-12-17 18:15   ` Ryan Roberts
2024-12-18  9:24     ` Dev Jain
2025-01-06 10:04   ` Usama Arif
2025-01-07  7:17     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
2024-12-17 19:24   ` Ryan Roberts
2024-12-18  9:26     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2024-12-16 17:06   ` David Hildenbrand
2024-12-16 19:08     ` Yang Shi
2024-12-17 10:07     ` Dev Jain
2024-12-17 10:32       ` David Hildenbrand [this message]
2024-12-18  8:35         ` Dev Jain
2025-01-02 10:08           ` Dev Jain
2025-01-02 11:33             ` David Hildenbrand
2025-01-03  8:17               ` Dev Jain
2025-01-02 11:22           ` David Hildenbrand
2024-12-18 15:59     ` Dev Jain
2025-01-06 10:17   ` Usama Arif
2025-01-07  8:12     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
2024-12-18  7:36   ` Ryan Roberts
2024-12-18  9:34     ` Dev Jain
2024-12-19  3:40       ` John Hubbard
2024-12-19  3:51         ` Zi Yan
2024-12-19  7:59         ` Dev Jain
2024-12-19  8:07           ` Dev Jain
2024-12-20 11:57             ` Ryan Roberts
2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
2024-12-18  9:03   ` Ryan Roberts
2024-12-18  9:50     ` Dev Jain
2024-12-20 11:05       ` Ryan Roberts
2024-12-30  7:09         ` Dev Jain
2024-12-30 16:36           ` Zi Yan
2025-01-02 11:43             ` Ryan Roberts
2025-01-03 10:10               ` Dev Jain
2025-01-03 10:11             ` Dev Jain
2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
2025-01-02 21:58   ` Nico Pache
2025-01-03  7:04     ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0368f4f2-cb0f-4633-a86d-5c3f75839b4e@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dev.jain@arm.com \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=srivatsa@csail.mit.edu \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).