LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Yin Tirui <yintirui@huawei.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>, Juergen Gross <jgross@suse.com>,
	Jonathan Cameron <jic23@kernel.org>,
	Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Peter Xu <peterx@redhat.com>,
	Luiz Capitulino <luizcap@redhat.com>,
	Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Madhavan Srinivasan <maddy@linux.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <chleroy@kernel.org>,
	"Liam R . Howlett" <liam@infradead.org>, Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Rohan McLure <rmclure@linux.ibm.com>,
	Kevin Brodsky <kevin.brodsky@arm.com>,
	Alistair Popple <apopple@nvidia.com>,
	Andrew Donnellan <andrew+kernel@donnellan.id.au>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Baoquan He <bhe@redhat.com>, Thomas Huth <thuth@redhat.com>,
	Coiby Xu <coxu@redhat.com>, Dan Williams <djbw@kernel.org>,
	Yu-cheng Yu <yu-cheng.yu@intel.com>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	Conor Dooley <conor.dooley@microchip.com>,
	Rik van Riel <riel@surriel.com>,
	wangkefeng.wang@huawei.com, chenjun102@huawei.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-pm@vger.kernel.org
Subject: Re: [PATCH mm-unstable RFC v4 4/7] mm/huge_memory: refactor copy_huge_pmd()
Date: Wed, 27 May 2026 17:54:05 +0530	[thread overview]
Message-ID: <38ee9d6e-4331-48b0-82e4-2e6ae0aee705@arm.com> (raw)
In-Reply-To: <20260526145003.88445-5-yintirui@huawei.com>



On 26/05/26 8:20 pm, Yin Tirui wrote:
> Classify the source PMD via pmd_present() and vm_normal_folio_pmd(),
> matching the way the PTE path uses pte_present() and vm_normal_page().
> This moves the present-PMD decision from VMA identity checks to the
> actual PMD/folio state.
> 
> Drop the defensive "if (!pmd_trans_huge(pmd)) goto out_unlock" branch:
> with mmap_write_lock held during fork, it should not occur.

Split this from this patch? This is a functional change so will be better
to review it separately.

> 
> Extract the present-PMD side of copy_huge_pmd() into
> copy_present_huge_pmd(). The helper owns the child pgtable passed by the
> caller: it either deposits the pgtable when installing a copied PMD, or
> frees it on paths that do not install one.
> 
> The child pgtable is now allocated once up front and freed on every skip
> path. This makes file/shmem and PFNMAP/special skip paths take the PMD
> locks and free the preallocated pgtable before returning. These are not
> expected to be hot paths, and the PFNMAP case is reused by the follow-up
> PMD PFNMAP copy support.
> 
> Signed-off-by: Yin Tirui <yintirui@huawei.com>
> ---

Since the series is still marked RFC, I think it would be better to send
the refactor patches separately?

>  mm/huge_memory.c | 175 +++++++++++++++++++++++++----------------------
>  1 file changed, 95 insertions(+), 80 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9832ee910d5e..3964258ff91d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1879,6 +1879,82 @@ bool touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	return false;
>  }
>  
> +static int copy_present_huge_pmd(
> +		struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> +		pmd_t pmd, pgtable_t pgtable, bool *need_split)
> +{
> +	struct folio *src_folio;
> +	bool wrprotect = true;
> +
> +	src_folio = vm_normal_folio_pmd(src_vma, addr, pmd);
> +	if (!src_folio) {
> +		/*
> +		 * When page table lock is held, the huge zero pmd should not be
> +		 * under splitting since we don't split the page itself, only pmd to
> +		 * a page table.
> +		 */
> +		if (is_huge_zero_pmd(pmd)) {
> +			/*
> +			 * mm_get_huge_zero_folio() will never allocate a new
> +			 * folio here, since we already have a zero page to
> +			 * copy. It just takes a reference.
> +			 */
> +			mm_get_huge_zero_folio(dst_mm);
> +			goto set_pmd;
> +		}
> +
> +		/*
> +		 * Making sure it's not a CoW VMA with writable
> +		 * mapping, otherwise it means either the anon page wrongly
> +		 * applied special bit, or we made the PRIVATE mapping be
> +		 * able to wrongly write to the backend MMIO.
> +		 */
> +		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> +		pte_free(dst_mm, pgtable);
> +		pgtable = NULL;
> +		wrprotect = false;
> +		goto set_pmd;
> +	}
> +
> +	/* File THPs are copied lazily by refaulting. */
> +	if (!folio_test_anon(src_folio)) {
> +		pte_free(dst_mm, pgtable);
> +		return 0;
> +	}


You removed !vma_is_anonymous to condense it into !folio_test_anon.
For private-file mappings that is not true, but okay since PMD mapping
is not supported.

> +
> +	folio_get(src_folio);
> +	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio,
> +							&src_folio->page,
> +							dst_vma, src_vma))) {
> +		/* Page maybe pinned: split and retry the fault on PTEs. */
> +		folio_put(src_folio);
> +		pte_free(dst_mm, pgtable);
> +		*need_split = true;
> +		return -EAGAIN;
> +	}
> +	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +
> +set_pmd:
> +	if (pgtable) {
> +		mm_inc_nr_ptes(dst_mm);
> +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	}
> +
> +	if (wrprotect) {
> +		pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +		if (!userfaultfd_wp(dst_vma))
> +			pmd = pmd_clear_uffd_wp(pmd);
> +		pmd = pmd_wrprotect(pmd);
> +	}
> +
> +	pmd = pmd_mkold(pmd);
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +
> +	return 0;
> +}
> +
>  static void copy_huge_non_present_pmd(
>  		struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -1940,104 +2016,43 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  {
>  	spinlock_t *dst_ptl, *src_ptl;
> -	struct page *src_page;
> -	struct folio *src_folio;
> -	pmd_t pmd;
>  	pgtable_t pgtable = NULL;
> -	int ret = -ENOMEM;
> -
> -	pmd = pmdp_get_lockless(src_pmd);
> -	if (unlikely(pmd_present(pmd) && pmd_special(pmd) &&
> -		     !is_huge_zero_pmd(pmd))) {
> -		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> -		src_ptl = pmd_lockptr(src_mm, src_pmd);
> -		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> -		/*
> -		 * No need to recheck the pmd, it can't change with write
> -		 * mmap lock held here.
> -		 *
> -		 * Meanwhile, making sure it's not a CoW VMA with writable
> -		 * mapping, otherwise it means either the anon page wrongly
> -		 * applied special bit, or we made the PRIVATE mapping be
> -		 * able to wrongly write to the backend MMIO.
> -		 */
> -		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> -		goto set_pmd;
> -	}
> -
> -	/* Skip if can be re-fill on fault */
> -	if (!vma_is_anonymous(dst_vma))
> -		return 0;
> +	bool need_split = false;
> +	int ret = 0;
> +	pmd_t pmd;
>  
>  	pgtable = pte_alloc_one(dst_mm);
>  	if (unlikely(!pgtable))
> -		goto out;
> +		return -ENOMEM;
>  
>  	dst_ptl = pmd_lock(dst_mm, dst_pmd);
>  	src_ptl = pmd_lockptr(src_mm, src_pmd);
>  	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>  
> -	ret = -EAGAIN;
>  	pmd = *src_pmd;
>  
> -	if (unlikely(thp_migration_supported() &&
> -		     pmd_is_valid_softleaf(pmd))) {
> -		copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
> +	if (likely(pmd_present(pmd))) {
> +		ret = copy_present_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
> +					  dst_vma, src_vma, pmd, pgtable, &need_split);
> +	} else if (unlikely(thp_migration_supported() && pmd_is_valid_softleaf(pmd))) {
> +		if (unlikely(!vma_is_anonymous(dst_vma)))
> +			pte_free(dst_mm, pgtable);
> +		else
> +			copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr,
>  					  dst_vma, src_vma, pmd, pgtable);
> -		ret = 0;
> -		goto out_unlock;
> -	}
> -
> -	if (unlikely(!pmd_trans_huge(pmd))) {
> +	} else {
> +		VM_WARN_ONCE(1, "unexpected non-present PMD %llx\n",
> +				(unsigned long long)pmd_val(pmd));
>  		pte_free(dst_mm, pgtable);
> -		goto out_unlock;
> -	}
> -	/*
> -	 * When page table lock is held, the huge zero pmd should not be
> -	 * under splitting since we don't split the page itself, only pmd to
> -	 * a page table.
> -	 */
> -	if (is_huge_zero_pmd(pmd)) {
> -		/*
> -		 * mm_get_huge_zero_folio() will never allocate a new
> -		 * folio here, since we already have a zero page to
> -		 * copy. It just takes a reference.
> -		 */
> -		mm_get_huge_zero_folio(dst_mm);
> -		goto out_zero_page;
> +		ret = -EAGAIN;
>  	}
>  
> -	src_page = pmd_page(pmd);
> -	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> -	src_folio = page_folio(src_page);
> +	spin_unlock(src_ptl);
> +	spin_unlock(dst_ptl);
>  
> -	folio_get(src_folio);
> -	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
> -		/* Page maybe pinned: split and retry the fault on PTEs. */
> -		folio_put(src_folio);
> -		pte_free(dst_mm, pgtable);
> -		spin_unlock(src_ptl);
> -		spin_unlock(dst_ptl);
> +	if (unlikely(need_split))
>  		__split_huge_pmd(src_vma, src_pmd, addr, false);
> -		return -EAGAIN;
> -	}
> -	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> -out_zero_page:
> -	mm_inc_nr_ptes(dst_mm);
> -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> -	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> -	if (!userfaultfd_wp(dst_vma))
> -		pmd = pmd_clear_uffd_wp(pmd);
> -	pmd = pmd_wrprotect(pmd);
> -set_pmd:
> -	pmd = pmd_mkold(pmd);
> -	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>  
> -	ret = 0;
> -out_unlock:
> -	spin_unlock(src_ptl);
> -	spin_unlock(dst_ptl);
> -out:
>  	return ret;
>  }
>  



  reply	other threads:[~2026-05-27 19:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26 14:49 [PATCH mm-unstable RFC v4 0/7] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
2026-05-26 14:49 ` [PATCH mm-unstable RFC v4 1/7] x86/mm: use PTE-level pgprot for huge PFN helpers Yin Tirui
2026-05-26 14:49 ` [PATCH mm-unstable RFC v4 2/7] arm64/mm: " Yin Tirui
2026-05-26 14:49 ` [PATCH mm-unstable RFC v4 3/7] powerpc/mm: " Yin Tirui
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 4/7] mm/huge_memory: refactor copy_huge_pmd() Yin Tirui
2026-05-27 12:24   ` Dev Jain [this message]
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 5/7] mm/huge_memory: refactor __split_huge_pmd_locked() Yin Tirui
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 6/7] mm/huge_memory: make move_huge_pmd() use has_deposited_pgtable() Yin Tirui
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 7/7] mm: add PMD-level PFNMAP support for remap_pfn_range() Yin Tirui
2026-05-26 15:33 ` [PATCH mm-unstable RFC v4 0/7] mm: add huge pfnmap " Lorenzo Stoakes
2026-05-27  2:57   ` Yin Tirui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38ee9d6e-4331-48b0-82e4-2e6ae0aee705@arm.com \
    --to=dev.jain@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrew+kernel@donnellan.id.au \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=chenjun102@huawei.com \
    --cc=chleroy@kernel.org \
    --cc=conor.dooley@microchip.com \
    --cc=coxu@redhat.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=jic23@kernel.org \
    --cc=kevin.brodsky@arm.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=ljs@kernel.org \
    --cc=luizcap@redhat.com \
    --cc=luto@kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=npache@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rmclure@linux.ibm.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=thuth@redhat.com \
    --cc=vbabka@kernel.org \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yintirui@huawei.com \
    --cc=yu-cheng.yu@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox