linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	xen-devel@lists.xenproject.org, linux-fsdevel@vger.kernel.org,
	nvdimm@lists.linux.dev, Andrew Morton <akpm@linux-foundation.org>,
	Juergen Gross <jgross@suse.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Hugh Dickins <hughd@google.com>,
	Oscar Salvador <osalvador@suse.de>,
	Lance Yang <lance.yang@linux.dev>
Subject: Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Date: Thu, 17 Jul 2025 19:29:56 +0100	[thread overview]
Message-ID: <46c9a90c-46b8-4136-9890-b9b2b97ee1bb@lucifer.local> (raw)
In-Reply-To: <20250717115212.1825089-6-david@redhat.com>

On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> The huge zero folio is refcounted (+mapcounted -- is that a word?)
> differently than "normal" folios, similarly (but different) to the ordinary
> shared zeropage.

Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
pages?

But for some reason the huge zero page wants to exist or not exist based on
usage for one. Which is stupid to me.

>
> For this reason, we special-case these pages in
> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> still use them (e.g., GUP can still take a reference on them).
>
> vm_normal_page_pmd() already filters out the huge zero folio. However,
> so far we are not marking it as special like we do with the ordinary
> shared zeropage. Let's mark it as special, so we can further refactor
> vm_normal_page_pmd() and vm_normal_page().
>
> While at it, update the doc regarding the shared zero folios.

Hmm I wonder how this will interact with the static PMD series at [0]?

I wonder if more use of that might result in some weirdness with refcounting
etc.?

Also, that series was (though I reviewed against it) moving stuff that
references the huge zero folio out of there, but also generally allows
access and mapping of this folio via largest_zero_folio() so not only via
insert_pmd().

So we're going to end up with mappings of this that are not marked special
that are potentially going to have refcount/mapcount manipulation that
contradict what you're doing here perhaps?

[0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/

>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I looked thorugh places that use vm_normal_page_pm() (other than decl of
function):

fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
mm/pagewalk.c - correctly handles NULL page + huge zero page
mm/huge_memory.c - can_change_pmd_writable() correctly returns false.

And all seems to work wtih this change.

Overall, other than concerns above + nits below LGTM, we should treat all
the zero folios the same in this regard, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c |  5 ++++-
>  mm/memory.c      | 14 +++++++++-----
>  2 files changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index db08c37b87077..3f9a27812a590 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>  {
>  	pmd_t entry;
>  	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> +	entry = pmd_mkspecial(entry);
>  	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>  	set_pmd_at(mm, haddr, pmd, entry);
>  	mm_inc_nr_ptes(mm);
> @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (fop.is_folio) {
>  		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>
> -		if (!is_huge_zero_folio(fop.folio)) {
> +		if (is_huge_zero_folio(fop.folio)) {
> +			entry = pmd_mkspecial(entry);
> +		} else {
>  			folio_get(fop.folio);
>  			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>  			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> diff --git a/mm/memory.c b/mm/memory.c
> index 92fd18a5d8d1f..173eb6267e0ac 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
> - * case, NULL is returned here. "Normal" mappings do have a struct page.
> + * case, NULL is returned here. "Normal" mappings do have a struct page and
> + * are ordinarily refcounted.
> + *
> + * Page mappings of the shared zero folios are always considered "special", as
> + * they are not ordinarily refcounted. However, selected page table walkers
> + * (such as GUP) can still identify these mappings and work with the
> + * underlying "struct page".

I feel like we need more detail or something more explicit about what 'not
ordinary' refcounting constitutes. This is a bit vague.

>   *
>   * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>   * pte bit, in which case this function is trivial. Secondly, an architecture
> @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>   * page" backing, however the difference is that _all_ pages with a struct
> - * page (that is, those where pfn_valid is true) are refcounted and considered
> - * normal pages by the VM. The only exception are zeropages, which are
> - * *never* refcounted.
> + * page (that is, those where pfn_valid is true, except the shared zero
> + * folios) are refcounted and considered normal pages by the VM.
>   *
>   * The disadvantage is that pages are refcounted (which can be slower and
>   * simply not an option for some PFNMAP users). The advantage is that we
> @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

You know I"m semi-ashamed to admit I didn't even know this function
exists. But yikes that we have a separate function like this just for PMDs.

>  {
>  	unsigned long pfn = pmd_pfn(pmd);
>
> -	/* Currently it's only used for huge pfnmaps */
>  	if (unlikely(pmd_special(pmd)))
>  		return NULL;
>
> --
> 2.50.1
>

  reply	other threads:[~2025-07-17 18:30 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-07-17 15:34   ` Lorenzo Stoakes
2025-07-25  2:47   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-07-17 15:42   ` Lorenzo Stoakes
2025-07-25  2:56   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-07-17 15:47   ` Lorenzo Stoakes
2025-07-25  8:07   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-07-17 18:09   ` Lorenzo Stoakes
2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-07-17 18:29   ` Lorenzo Stoakes [this message]
2025-07-17 20:31     ` David Hildenbrand
2025-07-18 10:41       ` Lorenzo Stoakes
2025-07-18 10:54         ` David Hildenbrand
2025-07-18 13:06           ` Lorenzo Stoakes
2025-07-28  8:49   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-07-17 19:17   ` Lorenzo Stoakes
2025-07-17 20:03     ` David Hildenbrand
2025-07-18 10:15       ` Lorenzo Stoakes
2025-07-18 11:04         ` David Hildenbrand
2025-07-18 12:55           ` Lorenzo Stoakes
2025-07-17 22:06   ` Demi Marie Obenour
2025-07-18  7:44     ` David Hildenbrand
2025-07-18  7:59       ` Demi Marie Obenour
2025-07-18  8:26         ` David Hildenbrand
2025-07-17 11:52 ` [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-07-17 19:51   ` Lorenzo Stoakes
2025-07-17 19:55     ` Lorenzo Stoakes
2025-07-17 20:03       ` David Hildenbrand
2025-07-18 12:43         ` Lorenzo Stoakes
2025-07-30 12:54           ` David Hildenbrand
2025-07-30 13:24             ` Lorenzo Stoakes
2025-07-17 20:12     ` David Hildenbrand
2025-07-18 12:35       ` Lorenzo Stoakes
2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-07-17 20:03   ` Lorenzo Stoakes
2025-07-17 20:14     ` David Hildenbrand
2025-07-18 10:47       ` Lorenzo Stoakes
2025-07-18 11:06         ` David Hildenbrand
2025-07-18 12:44           ` Lorenzo Stoakes
2025-07-29  7:52   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-07-17 20:07   ` Lorenzo Stoakes
2025-07-29  7:53   ` Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46c9a90c-46b8-4136-9890-b9b2b97ee1bb@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgross@suse.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleksandr_tyshchenko@epam.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=sstabellini@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).