linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: David Hildenbrand <david@redhat.com>,
	akpm@linux-foundation.org, kirill.shutemov@linux.intel.com,
	jgg@nvidia.com, tglx@linutronix.de, willy@infradead.org
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, muchun.song@linux.dev
Subject: Re: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages
Date: Mon, 29 Aug 2022 22:00:47 +0800	[thread overview]
Message-ID: <68f43b57-32b6-1844-a0a6-d22fb0e089aa@bytedance.com> (raw)
In-Reply-To: <cf2dee71-0b01-1df2-f97e-12c27ed6d630@redhat.com>



On 2022/8/29 18:09, David Hildenbrand wrote:
> On 25.08.22 12:10, Qi Zheng wrote:
>> Hi,
>>
>> Before this, in order to free empty user PTE page table pages, I posted the
>> following patch sets of two solutions:
>>   - atomic refcount version:
>> 	https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>>   - percpu refcount version:
>> 	https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
>>
>> Both patch sets have the following behavior:
>> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
>>     pte_unmap{_unlock}()
>> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
>>
>> For behavior a, there may be the following disadvantages mentioned by
>> David Hildenbrand:
>>   - It introduces a lot of complexity. It's not something easy to get in and most
>>     probably not easy to get out again
>>   - It is inconvenient to extend to other architectures. For example, for the
>>     continuous ptes of arm64, the pointer to the PTE entry is obtained directly
>>     through pte_offset_kernel() instead of pte_offset_map{_lock}()
>>   - It has been found that pte_unmap() is missing in some places that only
>>     execute on 64-bit systems, which is a disaster for pte_refcount
>>
>> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
>> when memory pressure is not high, and deferring to the reclaim path may be a
>> better choice.
>>
>> In addition, the above two solutions are only for empty PTE pages (a PTE page
>> where all entries are empty), and do not deal with the zero PTE page ( a PTE
>> page where all page table entries are mapped to shared zero page) mentioned by
>> David Hildenbrand:
>> 	"Especially the shared zeropage is nasty, because there are
>> 	 sane use cases that can trigger it. Assume you have a VM
>> 	 (e.g., QEMU) that inflated the balloon to return free memory
>> 	 to the hypervisor.
>>
>> 	 Simply migrating that VM will populate the shared zeropage to
>> 	 all inflated pages, because migration code ends up reading all
>> 	 VM memory. Similarly, the guest can just read that memory as
>> 	 well, for example, when the guest issues kdump itself."
>>
>> The purpose of this RFC patch is to continue the discussion and fix the above
>> issues. The following is the solution to be discussed.
> 
> Thanks for providing an alternative! It's certainly easier to digest :)

Hi David,

Nice to see your reply.

> 
>>
>> In order to quickly identify the above two types of PTE pages, we still
>> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
>> entry counter into the pte_refcount of the PTE page. The bitmask has the
>> following meaning:
>>
>>   - bits 0-9 are mapped PTE entry count
>>   - bits 10-19 are zero PTE entry count
> 
> I guess we could factor the zero PTE change out, to have an even simpler
OK, we can deal with the empty PTE page case first.

> first version. The issue is that some features (userfaultfd) don't
> expect page faults when something was aleady mapped previously.
> 
> PTE markers as introduced by Peter might require a thought -- we don't
> have anything mapped but do have additional information that we have to
> maintain.

I see the pte marker entry is non-present entry not empty entry 
(pte_none()). So we've dealt with this situation, which is also
what's done in [RFC PATCH 1/7].

> 
>>
>> In this way, when mapped PTE entry count is 0, we can know that the current PTE
>> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
>> know that the current PTE page is a zero PTE page.
>>
>> We only update the pte_refcount when setting and clearing of PTE entry, and
>> since they are both protected by pte lock, pte_refcount can be a non-atomic
>> variable with little performance overhead.
>>
>> For page table walker, we mutually exclusive it by holding write lock of
>> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).
> 
> I recall when I played with that idea that the mmap_lock is not
> sufficient to rip out a page table. IIRC, we also have to hold the rmap
> lock(s), to prevent RMAP walkers from still using the page table.

Oh, I forgot this. We should also hold rmap lock(s) like
move_normal_pmd().

> 
> Especially if multiple VMAs intersect a page table, things might get
> tricky, because multiple rmap locks could be involved.

Maybe we can iterate over the vma list and just process the 2M aligned
part?

> 
> We might want/need another mechanism to synchronize against page table
> walkers.

This is a tricky problem, equivalent to narrowing the protection scope
of mmap_lock. Any preliminary ideas?

Thanks,
Qi

> 

-- 
Thanks,
Qi


      reply	other threads:[~2022-08-29 14:00 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-25 10:10 [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 1/7] mm: use ptep_clear() in non-present cases Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 2/7] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 3/7] mm: add pte_to_page() helper Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 4/7] mm: introduce pte_refcount for user PTE page table page Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 5/7] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 6/7] x86/mm: add x86_64 support for pte_ref Qi Zheng
2022-08-25 10:10 ` [RFC PATCH 7/7] mm: add proc interface to free user PTE page table pages Qi Zheng
2022-08-29 10:09 ` [RFC PATCH 0/7] Try to free empty and zero " David Hildenbrand
2022-08-29 14:00   ` Qi Zheng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=68f43b57-32b6-1844-a0a6-d22fb0e089aa@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).