From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Ryan Roberts <ryan.roberts@arm.com>,
Matthew Wilcox <willy@infradead.org>,
Hugh Dickins <hughd@google.com>,
Yin Fengwei <fengwei.yin@intel.com>,
Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>,
Zi Yan <ziy@nvidia.com>, Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
Waiman Long <longman@redhat.com>,
"Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 03/20] mm: convert folio_estimated_sharers() to folio_mapped_shared() and improve it
Date: Fri, 24 Nov 2023 14:26:08 +0100 [thread overview]
Message-ID: <20231124132626.235350-4-david@redhat.com> (raw)
In-Reply-To: <20231124132626.235350-1-david@redhat.com>
Callers of folio_estimated_sharers() only care about "mapped shared vs.
mapped exclusively". Let's rename the function and improve our detection
for partially-mappable folios (i.e., PTE-mapped THPs).
For now we can only implement, based on our guess, "certainly mapped
shared vs. maybe mapped exclusively". Ideally, we'd have something like
"maybe mapped shared vs. certainly mapped exclusive" -- or even better
"certainly mapped shared vs. certainly mapped exclusively" instead. But
these semantics are currently impossible using our guess-based heuristic
we apply for partially-mappable folios.
Naming the function "folio_certainly_mapped_shared" could be possible,
but let's just keep it simple an call it "folio_mapped_shared" and
document the fuzziness that applies for now.
As we can now read the total mapcount of large folios very efficiently,
use that to improve our implementation, falling back to making a guess only
in case the folio is not "obviously mapped shared".
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm.h | 68 +++++++++++++++++++++++++++++++++++++++-------
mm/huge_memory.c | 2 +-
mm/madvise.c | 6 ++--
mm/memory.c | 2 +-
mm/mempolicy.c | 14 ++++------
mm/migrate.c | 2 +-
6 files changed, 70 insertions(+), 24 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fe91aaefa3db..17dac913f367 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2114,21 +2114,69 @@ static inline size_t folio_size(struct folio *folio)
}
/**
- * folio_estimated_sharers - Estimate the number of sharers of a folio.
+ * folio_mapped_shared - Report if a folio is certainly mapped by
+ * multiple entities in their page tables
* @folio: The folio.
*
- * folio_estimated_sharers() aims to serve as a function to efficiently
- * estimate the number of processes sharing a folio. This is done by
- * looking at the precise mapcount of the first subpage in the folio, and
- * assuming the other subpages are the same. This may not be true for large
- * folios. If you want exact mapcounts for exact calculations, look at
- * page_mapcount() or folio_mapcount().
+ * This function checks if a folio is certainly *currently* mapped by
+ * multiple entities in their page table ("mapped shared") or if the folio
+ * may be mapped exclusively by a single entity ("mapped exclusively").
*
- * Return: The estimated number of processes sharing a folio.
+ * Usually, we consider a single entity to be a single MM. However, some
+ * folios (KSM, pagecache) can be mapped multiple times into the same MM.
+ *
+ * For KSM folios, each individual page table mapping is considered a
+ * separate entity. So if a KSM folio is mapped multiple times into the
+ * same process, it is considered "mapped shared".
+ *
+ * For pagecache folios that are entirely mapped multiple times into the
+ * same MM (i.e., multiple VMAs in the same MM cover the same
+ * file range), we traditionally (and for simplicity) consider them,
+ * "mapped shared". For partially-mapped folios (e..g, PTE-mapped THP), we
+ * might detect them either as "mapped shared" or "mapped exclusively" --
+ * whatever is simpler.
+ *
+ * For small folios and entirely mapped large folios (e.g., hugetlb,
+ * PMD-mapped PMD-sized THP), the result will be exactly correct.
+ *
+ * For all other (partially-mappable) folios, such as PTE-mapped THP, the
+ * return value is partially fuzzy: true is not fuzzy, because it means
+ * "certainly mapped shared", but false means "maybe mapped exclusively".
+ *
+ * Note that this function only considers *current* page table mappings
+ * tracked via rmap -- that properly adjusts the folio mapcount(s) -- and
+ * does not consider:
+ * (1) any way the folio might get mapped in the (near) future (e.g.,
+ * swapcache, pagecache, temporary unmapping for migration).
+ * (2) any way a folio might be mapped besides using the rmap (PFN mappings).
+ * (3) any form of page table sharing.
+ *
+ * Return: Whether the folio is certainly mapped by multiple entities.
*/
-static inline int folio_estimated_sharers(struct folio *folio)
+static inline bool folio_mapped_shared(struct folio *folio)
{
- return page_mapcount(folio_page(folio, 0));
+ unsigned int total_mapcount;
+
+ if (likely(!folio_test_large(folio)))
+ return atomic_read(&folio->page._mapcount) != 0;
+ total_mapcount = folio_total_mapcount(folio);
+
+ /* A single mapping implies "mapped exclusively". */
+ if (total_mapcount == 1)
+ return false;
+
+ /* If there is an entire mapping, it must be the only mapping. */
+ if (folio_entire_mapcount(folio) || unlikely(folio_test_hugetlb(folio)))
+ return total_mapcount != 1;
+ /*
+ * Partially-mappable folios are tricky ... but some are "obviously
+ * mapped shared": if we have more (PTE) mappings than we have pages
+ * in the folio, some other entity is certainly involved.
+ */
+ if (total_mapcount > folio_nr_pages(folio))
+ return true;
+ /* ... guess based on the mapcount of the first page of the folio. */
+ return atomic_read(&folio->page._mapcount) > 0;
}
#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..874eeeb90e0b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1638,7 +1638,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* If other processes are mapping this folio, we couldn't discard
* the folio unless they all do MADV_FREE so let's skip the folio.
*/
- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
goto out;
if (!folio_trylock(folio))
diff --git a/mm/madvise.c b/mm/madvise.c
index cf4d694280e9..1a82867c8c2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */
- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;
- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
@@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;
- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
break;
if (!folio_trylock(folio))
break;
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..6bcfa763a146 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* Flag if the folio is shared between multiple address spaces. This
* is later used when determining whether to group tasks together
*/
- if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED))
+ if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
flags |= TNF_SHARED;
nid = folio_nid(folio);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..0492113497cc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -605,12 +605,11 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
* Choosing not to migrate a shared folio is not counted as a failure.
*
- * To check if the folio is shared, ideally we want to make sure
- * every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated sharers of the folio instead.
+ * See folio_mapped_shared() on possible imprecision when we cannot
+ * easily detect if a folio is shared.
*/
if ((flags & MPOL_MF_MOVE_ALL) ||
- (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte)))
+ (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
if (!isolate_hugetlb(folio, qp->pagelist))
qp->nr_failed++;
unlock:
@@ -988,11 +987,10 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
* Choosing not to migrate a shared folio is not counted as a failure.
*
- * To check if the folio is shared, ideally we want to make sure
- * every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated sharers of the folio instead.
+ * See folio_mapped_shared() on possible imprecision when we cannot
+ * easily detect if a folio is shared.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) {
+ if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) {
if (folio_isolate_lru(folio)) {
list_add_tail(&folio->lru, foliolist);
node_stat_mod_folio(folio,
diff --git a/mm/migrate.c b/mm/migrate.c
index 35a88334bb3c..fda41bc09903 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
* every page is mapped to the same process. Doing that is very
* expensive, so check the estimated mapcount of the folio instead.
*/
- if (folio_estimated_sharers(folio) != 1 && folio_is_file_lru(folio) &&
+ if (folio_mapped_shared(folio) && folio_is_file_lru(folio) &&
(vma->vm_flags & VM_EXEC))
goto out;
--
2.41.0
next prev parent reply other threads:[~2023-11-24 13:26 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-24 13:26 [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 01/20] mm/rmap: factor out adding folio range into __folio_add_rmap_range() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 02/20] mm: add a total mapcount for large folios David Hildenbrand
2023-11-24 13:26 ` David Hildenbrand [this message]
2023-11-24 13:26 ` [PATCH WIP v1 04/20] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 05/20] mm/rmap: abstract total mapcount operations for partially-mappable folios David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 06/20] atomic_seqcount: new (raw) seqcount variant to support concurrent writers David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio David Hildenbrand
2023-12-17 19:13 ` Nadav Amit
2023-12-18 14:04 ` David Hildenbrand
2023-12-18 14:34 ` Nadav Amit
2023-11-24 13:26 ` [PATCH WIP v1 08/20] mm: pass MM to folio_mapped_shared() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 09/20] mm: improve folio_mapped_shared() for partially-mappable folios using rmap IDs David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 10/20] mm/memory: COW reuse support for PTE-mapped THP with " David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 11/20] mm/rmap_id: support for 1, 2 and 3 values by manual calculation David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 12/20] mm/rmap: introduce folio_add_anon_rmap_range() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 13/20] mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 14/20] mm/huge_memory: avoid folio_refcount() < folio_mapcount() " David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 15/20] mm/rmap_id: verify precalculated subids with CONFIG_DEBUG_VM David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 16/20] atomic_seqcount: support a single exclusive writer in the absence of other writers David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 17/20] mm/rmap_id: reduce atomic RMW operations when we are the exclusive writer David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 18/20] atomic_seqcount: use atomic add-return instead of atomic cmpxchg on 64bit David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 19/20] mm/rmap: factor out removing folio range into __folio_remove_rmap_range() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 20/20] mm/rmap: perform all mapcount operations of large folios under the rmap seqcount David Hildenbrand
2023-11-24 20:55 ` [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios Linus Torvalds
2023-11-25 17:02 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231124132626.235350-4-david@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=fengwei.yin@intel.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).