From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 00272371068 for ; Tue, 26 May 2026 06:37:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779777437; cv=none; b=crOM5ZMOmdN+cLUzb/EHTTzMuEKLx1QjMzUtn72a+azhqXnOR35uzh35KGofKv2gPsiZtiSB9LT++4y4gNZAlSpZfLLS5yp6WitzE4jdfqfC1Te3P05ji2vOfbDvJir/8xhwnooy2JuCb8gp5KHd7m1A0sd101IntwNm6YQiRdY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779777437; c=relaxed/simple; bh=QlKeC0szlyawhho6aeQ2cKHnkM8Vg2oBvDeCLJCPaGs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Sx29IjqCo6vmYYurhQ530XaSjFX81mSy572byD7fe1ifgU1NnvV35qJmIpX07AWPTOnS0FiBWRHap7VD2vJbumPCDcn1vbw2AiTpG5PPJqtNf4tDM003AU6sxnKR5s9ezdiPzdAbFEKHzV0oSYLFW5lOVgzAqIc7xUFQnqTzyLM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b=Vp/ZV24q; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b="Vp/ZV24q" Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2636E16F8; Mon, 25 May 2026 23:37:09 -0700 (PDT) Received: from a080796.blr.arm.com (a080796.arm.com [10.164.21.51]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 0DCC13F7D8; Mon, 25 May 2026 23:37:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1779777434; bh=QlKeC0szlyawhho6aeQ2cKHnkM8Vg2oBvDeCLJCPaGs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Vp/ZV24qm/3mGT83C2lrKF89JTiR/fob+WnE1iuAb2C94iBxBX94+AH07VDEeYLS/ 5e+tnfDoTPzconByN45SjcPvrNRywr6EhGk515aGY8S1JelYpx34CVWmsfIG5glX53 DedAeDvhrHHpI4U7vpmlksKOKKsMaBsnSAit1jBA= From: Dev Jain To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, chrisl@kernel.org, kasong@tencent.com, hughd@google.com, liam@infradead.org Cc: Dev Jain , riel@surriel.com, vbabka@kernel.org, harry@kernel.org, jannh@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, rppt@kernel.org, surenb@google.com, mhocko@suse.com, qi.zheng@linux.dev, shakeel.butt@linux.dev, baohua@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, youngjun.park@lge.com, baolin.wang@linux.alibaba.com, pfalcato@suse.de, ryan.roberts@arm.com, anshuman.khandual@arm.com Subject: [PATCH v4 02/12] mm/rmap: Add try_to_unmap_hugetlb_one Date: Tue, 26 May 2026 12:06:25 +0530 Message-Id: <20260526063635.61721-3-dev.jain@arm.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260526063635.61721-1-dev.jain@arm.com> References: <20260526063635.61721-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Simplify try_to_unmap_one by separating out the hugetlb parts into try_to_unmap_hugetlb_one. To understand the correctness of the refactoring, the following points are noted: 1. try_to_unmap() is called for hugetlb folios only when they are hwpoisoned. 2. A hugetlb VMA cannot be mlocked. 3. The pvmw API sets pvmw.pte to the base of the hugetlb folio (pvmw.pmd is NULL). 4. We won't ever process a softleaf entry that encodes a hugetlb folio; hugetlb folios are never swapped out, migration entries will be skipped (PVMW_MIGRATION not passed) and device-exclusive does not work for hugetlb. 5. uffd-wp bit is lost when converting pvmw.pte to hwpoison softleaf (therefore no need to call pte_install_uffd_wp_if_needed after clearing pvmw.pte) 6. TTU_HWPOISON is always present; for it to not be present, either folio has to be in swapcache, or mapping_can_writeback() is true (see unmap_poisoned_folio), none of which is true for hugetlb folios. 7. hugetlb uses separate counters from normal rss counters, therefore update_highwater_rss() need not be called. While at it: - Change VM_BUG_* to VM_WARN_*. - Do not declare variables which are only used once - Assert that the subpage derived by the pvmw walk is exactly the head page. This is because try_to_unmap() does not remember the specific subpage which was hwpoisoned, and, since we cannot munmap/mremap across a hugetlb folio partially, the first pte mapping the hugetlb folio (in case of a contpte or contpmd mapped folio) cannot ever point to an intermediate page. Signed-off-by: Dev Jain --- mm/rmap.c | 203 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 121 insertions(+), 82 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 430c91c8fe2ae..06ab1158d4cd1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1978,6 +1978,121 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio, FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY); } +static bool __try_to_unmap_hugetlb_one(struct folio *folio, + struct vm_area_struct *vma, struct page_vma_mapped_walk *pvmw, + struct mmu_notifier_range *range, enum ttu_flags flags) +{ + unsigned long hsz = huge_page_size(hstate_vma(vma)); + unsigned long address = pvmw->address; + struct mm_struct *mm = vma->vm_mm; + struct page *subpage; + bool ret = true; + pte_t pteval; + + if (!page_vma_mapped_walk(pvmw)) + return true; + + pteval = ptep_get(pvmw->pte); + VM_WARN_ON(!pte_present(pteval)); + subpage = folio_page(folio, pte_pfn(pteval) - folio_pfn(folio)); + VM_WARN_ON(folio_page(folio, 0) != subpage); + + /* + * huge_pmd_unshare may unmap an entire PMD page. There is no way of + * knowing exactly which PMDs may be cached for this mm, so we must + * flush them all. start/end were already adjusted above to cover this + * range. + */ + flush_cache_range(vma, range->start, range->end); + + /* + * To call huge_pmd_unshare, i_mmap_rwsem must be held in write mode. + * Caller needs to explicitly do this outside rmap routines. + * + * We also must hold hugetlb vma_lock in write mode. Lock order dictates + * acquiring vma_lock BEFORE i_mmap_rwsem. We can only try lock here and + * fail if unsuccessful. + */ + if (!folio_test_anon(folio)) { + struct mmu_gather tlb; + + VM_WARN_ON(!(flags & TTU_RMAP_LOCKED)); + if (!hugetlb_vma_trylock_write(vma)) { + ret = false; + goto walk_done; + } + + tlb_gather_mmu_vma(&tlb, vma); + if (huge_pmd_unshare(&tlb, vma, address, pvmw->pte)) { + hugetlb_vma_unlock_write(vma); + huge_pmd_unshare_flush(&tlb, vma); + tlb_finish_mmu(&tlb); + /* + * The PMD table was unmapped, consequently unmapping + * the folio. + */ + goto walk_done; + } + hugetlb_vma_unlock_write(vma); + tlb_finish_mmu(&tlb); + } + pteval = huge_ptep_clear_flush(vma, address, pvmw->pte); + if (pte_dirty(pteval)) + folio_mark_dirty(folio); + + VM_WARN_ON(!(flags & TTU_HWPOISON)); + pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); + hugetlb_count_sub(folio_nr_pages(folio), mm); + set_huge_pte_at(mm, address, pvmw->pte, pteval, hsz); + hugetlb_remove_rmap(folio); + folio_put_refs(folio, 1); + +walk_done: + page_vma_mapped_walk_done(pvmw); + return ret; +} + +static bool try_to_unmap_hugetlb_one(struct folio *folio, + struct vm_area_struct *vma, unsigned long address, void *arg) +{ + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + struct mmu_notifier_range range; + enum ttu_flags flags = (enum ttu_flags)(long)arg; + bool ret; + + /* + * The try_to_unmap() is only passed a hugetlb folio in the case + * where the hugetlb folio contains a poisoned page. + */ + VM_WARN_ON_FOLIO(!folio_contain_hwpoisoned_page(folio), folio); + + /* + * When racing against e.g. zap_pte_range() on another cpu, + * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(), + * try_to_unmap() may return before folio_mapped() has become false, + * if page table locking is skipped: use TTU_SYNC to wait for that. + */ + if (flags & TTU_SYNC) + pvmw.flags = PVMW_SYNC; + + /* + * For hugetlb, it could be much worse than THP if we need pud + * invalidation in the case of pmd sharing. + * + * Note that the folio can not be freed in this function as call of + * try_to_unmap() must hold a reference on the folio. + */ + range.end = vma_address_end(&pvmw); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, + address, range.end); + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); + mmu_notifier_invalidate_range_start(&range); + ret = __try_to_unmap_hugetlb_one(folio, vma, &pvmw, &range, + flags); + mmu_notifier_invalidate_range_end(&range); + return ret; +} + /* * @arg: enum ttu_flags will be passed to this argument */ @@ -1993,7 +2108,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, enum ttu_flags flags = (enum ttu_flags)(long)arg; unsigned long nr_pages = 1, end_addr; unsigned long pfn; - unsigned long hsz = 0; int ptes = 0; /* @@ -2007,8 +2121,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, /* * For THP, we have to assume the worse case ie pmd for invalidation. - * For hugetlb, it could be much worse if we need to do pud - * invalidation in the case of pmd sharing. * * Note that the folio can not be freed in this function as call of * try_to_unmap() must hold a reference on the folio. @@ -2016,17 +2128,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, range.end = vma_address_end(&pvmw); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, address, range.end); - if (folio_test_hugetlb(folio)) { - /* - * If sharing is possible, start and end will be adjusted - * accordingly. - */ - adjust_range_if_pmd_sharing_possible(vma, &range.start, - &range.end); - - /* We need the huge page size for set_huge_pte_at() */ - hsz = huge_page_size(hstate_vma(vma)); - } mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { @@ -2104,7 +2205,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, const softleaf_t entry = softleaf_from_pte(pteval); pfn = softleaf_to_pfn(entry); - VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); } subpage = folio_page(folio, pfn - folio_pfn(folio)); @@ -2112,59 +2212,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(subpage); - if (folio_test_hugetlb(folio)) { - bool anon = folio_test_anon(folio); - - /* - * The try_to_unmap() is only passed a hugetlb folio - * in the case where the hugetlb folio contains a - * poisoned page. - */ - VM_WARN_ON_FOLIO(!folio_contain_hwpoisoned_page(folio), folio); - /* - * huge_pmd_unshare may unmap an entire PMD page. - * There is no way of knowing exactly which PMDs may - * be cached for this mm, so we must flush them all. - * start/end were already adjusted above to cover this - * range. - */ - flush_cache_range(vma, range.start, range.end); - - /* - * To call huge_pmd_unshare, i_mmap_rwsem must be - * held in write mode. Caller needs to explicitly - * do this outside rmap routines. - * - * We also must hold hugetlb vma_lock in write mode. - * Lock order dictates acquiring vma_lock BEFORE - * i_mmap_rwsem. We can only try lock here and fail - * if unsuccessful. - */ - if (!anon) { - struct mmu_gather tlb; - - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - if (!hugetlb_vma_trylock_write(vma)) - goto walk_abort; - - tlb_gather_mmu_vma(&tlb, vma); - if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { - hugetlb_vma_unlock_write(vma); - huge_pmd_unshare_flush(&tlb, vma); - tlb_finish_mmu(&tlb); - /* - * The PMD table was unmapped, - * consequently unmapping the folio. - */ - goto walk_done; - } - hugetlb_vma_unlock_write(vma); - tlb_finish_mmu(&tlb); - } - pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); - if (pte_dirty(pteval)) - folio_mark_dirty(folio); - } else if (likely(pte_present(pteval))) { + if (likely(pte_present(pteval))) { nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval); end_addr = address + nr_pages * PAGE_SIZE; flush_cache_range(vma, address, end_addr); @@ -2201,14 +2249,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (folio_contain_hwpoisoned_page(folio) && (flags & TTU_HWPOISON)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); - if (folio_test_hugetlb(folio)) { - hugetlb_count_sub(folio_nr_pages(folio), mm); - set_huge_pte_at(mm, address, pvmw.pte, pteval, - hsz); - } else { - dec_mm_counter(mm, mm_counter(folio)); - set_pte_at(mm, address, pvmw.pte, pteval); - } + dec_mm_counter(mm, mm_counter(folio)); + set_pte_at(mm, address, pvmw.pte, pteval); } else if (likely(pte_present(pteval)) && pte_unused(pteval) && !userfaultfd_armed(vma)) { /* @@ -2341,11 +2383,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, add_mm_counter(mm, mm_counter_file(folio), -nr_pages); } discard: - if (unlikely(folio_test_hugetlb(folio))) { - hugetlb_remove_rmap(folio); - } else { - folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); - } + folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); folio_put_refs(folio, nr_pages); @@ -2393,7 +2431,8 @@ static int folio_not_mapped(struct folio *folio) void try_to_unmap(struct folio *folio, enum ttu_flags flags) { struct rmap_walk_control rwc = { - .rmap_one = try_to_unmap_one, + .rmap_one = folio_test_hugetlb(folio) ? + try_to_unmap_hugetlb_one : try_to_unmap_one, .arg = (void *)flags, .done = folio_not_mapped, .anon_lock = folio_lock_anon_vma_read, -- 2.34.1