linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	Yin Fengwei <fengwei.yin@intel.com>,
	Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>,
	Zi Yan <ziy@nvidia.com>, Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
	Waiman Long <longman@redhat.com>,
	"Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH WIP v1 10/20] mm/memory: COW reuse support for PTE-mapped THP with rmap IDs
Date: Fri, 24 Nov 2023 14:26:15 +0100	[thread overview]
Message-ID: <20231124132626.235350-11-david@redhat.com> (raw)
In-Reply-To: <20231124132626.235350-1-david@redhat.com>

For now, we only end up reusing small folios and PMD-mapped large folios
(i.e., THP) after fork(); PTE-mapped THPs are never reused, except when
only a single page of the folio remains mapped. Instead, we end up copying
each subpage even though the THP might be exclusive to the MM.

The logic we're using for small folios and PMD-mapped THPs is the
following: Is the only reference to the folio from a single page table
mapping? Then:
  (a) There are no other references to the folio from other MMs
      (e.g., page table mapping, GUP)
  (b) There are no other references to the folio from page migration/
      swapout/swapcache that might temporarily unmap the folio.

Consequently, the folio is exclusive to that process and can be reused.
In that case, we end up with folio_refcount(folio) == 1 and an implied
folio_mapcount(folio) == 1, while holding the page table lock and the
page lock to protect against possible races.

For PTE-mapped THP, however, we have not one, but multiple references
from page tables, whereby such THPs can be mapped into multiple
page tables in the MM.

Reusing the logic that we use for small folios and PMD-mapped THPs means,
that when reusing a PTE-mapped THP, we want to make sure that:
  (1) All folio references are from page table mappings.
  (2) All page table mappings belong to the same MM.
  (3) We didn't race with (un)mapping of the page related to other page
      tables, such that the mapcount and refcount are stable.

For (1), we can check
	folio_refcount(folio) == folio_mapcount(folio)
For (2) and (3), we can use our new rmap ID infrastructure.

We won't bother with the swapcache and LRU cache for now. Add some sanity
checks under CONFIG_DEBUG_VM, to identify any obvious problems early.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 5048d58d6174..fb533995ff68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3360,6 +3360,95 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
 static bool wp_can_reuse_anon_folio(struct folio *folio,
 				    struct vm_area_struct *vma)
 {
+#ifdef CONFIG_RMAP_ID
+	if (folio_test_large(folio)) {
+		bool retried = false;
+		unsigned long start;
+		int mapcount, i;
+
+		/*
+		 * The assumption for anonymous folios is that each page can
+		 * only get mapped once into a MM.  This also holds for
+		 * small folios -- except when KSM is involved. KSM does
+		 * currently not apply to large folios.
+		 *
+		 * Further, each taken mapcount must be paired with exactly one
+		 * taken reference, whereby references must be incremented
+		 * before the mapcount when mapping a page, and references must
+		 * be decremented after the mapcount when unmapping a page.
+		 *
+		 * So if all references to a folio are from mappings, and all
+		 * mappings are due to our (MM) page tables, and there was no
+		 * concurrent (un)mapping, this folio is certainly exclusive.
+		 *
+		 * We currently don't optimize for:
+		 * (a) folio is mapped into multiple page tables in this
+		 *     MM (e.g., mremap) and other page tables are
+		 *     concurrently (un)mapping the folio.
+		 * (b) the folio is in the swapcache. Likely the other PTEs
+		 *     are still swap entries and folio_free_swap() would fail.
+		 * (c) the folio is in the LRU cache.
+		 */
+retry:
+		start = raw_read_atomic_seqcount(&folio->_rmap_atomic_seqcount);
+		if (start & ATOMIC_SEQCOUNT_WRITERS_MASK)
+			return false;
+		mapcount = folio_mapcount(folio);
+
+		/* Is this folio possibly exclusive ... */
+		if (mapcount > folio_nr_pages(folio) || folio_entire_mapcount(folio))
+			return false;
+
+		/* ... and are all references from mappings ... */
+		if (folio_ref_count(folio) != mapcount)
+			return false;
+
+		/* ... and do all mappings belong to us ... */
+		if (!__folio_has_large_matching_rmap_val(folio, mapcount, vma->vm_mm))
+			return false;
+
+		/* ... and was there no concurrent (un)mapping ? */
+		if (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount,
+						   start))
+			return false;
+
+		/* Safety checks we might want to drop in the future. */
+		if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+			unsigned int mapcount;
+
+			if (WARN_ON_ONCE(folio_test_ksm(folio)))
+				return false;
+			/*
+			 * We might have raced against swapout code adding
+			 * the folio to the swapcache (which, by itself, is not
+			 * problematic). Let's simply check again if we would
+			 * properly detect the additional reference now and
+			 * properly fail.
+			 */
+			if (unlikely(folio_test_swapcache(folio))) {
+				if (WARN_ON_ONCE(retried))
+					return false;
+				retried = true;
+				goto retry;
+			}
+			for (i = 0; i < folio_nr_pages(folio); i++) {
+				mapcount = page_mapcount(folio_page(folio, i));
+				if (WARN_ON_ONCE(mapcount > 1))
+					return false;
+			}
+		}
+
+		/*
+		 * This folio is exclusive to us. Do we need the page lock?
+		 * Likely not, and a trylock would be unfortunate if this
+		 * folio is mapped into multiple page tables and we get
+		 * concurrent page faults. If there would be references from
+		 * page migration/swapout/swapcache, we would have detected
+		 * an additional reference and never ended up here.
+		 */
+		return true;
+	}
+#endif /* CONFIG_RMAP_ID */
 	/*
 	 * We have to verify under folio lock: these early checks are
 	 * just an optimization to avoid locking the folio and freeing
-- 
2.41.0



  parent reply	other threads:[~2023-11-24 13:27 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-24 13:26 [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 01/20] mm/rmap: factor out adding folio range into __folio_add_rmap_range() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 02/20] mm: add a total mapcount for large folios David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 03/20] mm: convert folio_estimated_sharers() to folio_mapped_shared() and improve it David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 04/20] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 05/20] mm/rmap: abstract total mapcount operations for partially-mappable folios David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 06/20] atomic_seqcount: new (raw) seqcount variant to support concurrent writers David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio David Hildenbrand
2023-12-17 19:13   ` Nadav Amit
2023-12-18 14:04     ` David Hildenbrand
2023-12-18 14:34       ` Nadav Amit
2023-11-24 13:26 ` [PATCH WIP v1 08/20] mm: pass MM to folio_mapped_shared() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 09/20] mm: improve folio_mapped_shared() for partially-mappable folios using rmap IDs David Hildenbrand
2023-11-24 13:26 ` David Hildenbrand [this message]
2023-11-24 13:26 ` [PATCH WIP v1 11/20] mm/rmap_id: support for 1, 2 and 3 values by manual calculation David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 12/20] mm/rmap: introduce folio_add_anon_rmap_range() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 13/20] mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 14/20] mm/huge_memory: avoid folio_refcount() < folio_mapcount() " David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 15/20] mm/rmap_id: verify precalculated subids with CONFIG_DEBUG_VM David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 16/20] atomic_seqcount: support a single exclusive writer in the absence of other writers David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 17/20] mm/rmap_id: reduce atomic RMW operations when we are the exclusive writer David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 18/20] atomic_seqcount: use atomic add-return instead of atomic cmpxchg on 64bit David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 19/20] mm/rmap: factor out removing folio range into __folio_remove_rmap_range() David Hildenbrand
2023-11-24 13:26 ` [PATCH WIP v1 20/20] mm/rmap: perform all mapcount operations of large folios under the rmap seqcount David Hildenbrand
2023-11-24 20:55 ` [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios Linus Torvalds
2023-11-25 17:02   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231124132626.235350-11-david@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).