From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 33238C3ABDA
	for <linux-mm@archiver.kernel.org>; Wed, 14 May 2025 20:18:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 85C876B00A7; Wed, 14 May 2025 16:18:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7E4216B00A8; Wed, 14 May 2025 16:18:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 612246B00A9; Wed, 14 May 2025 16:18:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 38A986B00A7
	for <linux-mm@kvack.org>; Wed, 14 May 2025 16:18:45 -0400 (EDT)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 1E4EF5EBD8
	for <linux-mm@kvack.org>; Wed, 14 May 2025 20:18:45 +0000 (UTC)
X-FDA: 83442626610.25.5E724D6
Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178])
	by imf12.hostedemail.com (Postfix) with ESMTP id 3209340007
	for <linux-mm@kvack.org>; Wed, 14 May 2025 20:18:42 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=m1KEfaNR;
	spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1747253923;
	h=from:from:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=aiUQ9PqOfEtp93nyVYgzpjg1vss1BLVfY/NVAKlO77E=;
	b=oJVwulghriRyIJT04uyenZm644gzMfOiJ4NJSIbNskewtb9Lhutz7EOcLDQHhHN0w5bJzs
	5M3bHyf7VQp5ju4xAyMGne0sqooG0ADVdtaBNcqFYnGdXnoZV1GQgQipUOTNgKXTD94TGi
	uCNu1sZ/bGNAu2OXZGvewrz1A3LQR/E=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=m1KEfaNR;
	spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747253923; a=rsa-sha256;
	cv=none;
	b=Gjx7/G8rHF+9xG8XrBDemmBBUTvAqYDTO6T7jADFgsAL1MjA4PkSrQAArcz/c1NCO4cn7M
	nyMlocYLHq2bdcCr/bDKzARB7Bs139x9SPsp0t8wtrpXQIN91EDo6fJNMAAaHepJJGwQYS
	/+34y1nEJMZioWflMxtB9zR4/ykrVNo=
Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-b0b2d1f2845so87182a12.3
        for <linux-mm@kvack.org>; Wed, 14 May 2025 13:18:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1747253921; x=1747858721; darn=kvack.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=aiUQ9PqOfEtp93nyVYgzpjg1vss1BLVfY/NVAKlO77E=;
        b=m1KEfaNRN65f04PK/C4drgwqlcBcaFUjD1T9WJOoJl9TroRnuEIGSXWPVe+7SIO3SR
         V88L37CfE5fELVFw/xSPuYnszph9eY3bATWct6H/xq/zJYxJVsagnSsVRMFF8c2i3YrD
         AtBOCfFa8CQlfPtf16wo2O6Zs+VNviai3V0/Z8n1HlbrYZmuW1coU1z+pkiPRG9orFCI
         x+dNoMEsf/15MzS7k9pILiuIvwDQ8dal0jV2iWJu15yTLBUC1g5zPzKHtiigAiqX2g0U
         F6/isQsz6DxMzqaazAb0gytLvrccLQ+Ylhuf57sfnqmmWMekou4fBIawA09r9X18RVB1
         dLGA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1747253921; x=1747858721;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=aiUQ9PqOfEtp93nyVYgzpjg1vss1BLVfY/NVAKlO77E=;
        b=r0aUkDL7Mkg3JjtkkaIRQHjIjmx63+kR9zDOIkLC7V6qAQ/52ga2Mpk3XJdZBHXJtm
         PR4HDPRMo3zB+c/W8MyPJPHS8SdpdBDzyWAwrL4OC8Dz89I1feLnddz6V7sT+n0I4yRd
         Nov8ErXh/fmCfHBJWZqC9yBbXVUc50CI9LCEnjcqfwhWFxCBo/q4ACwlAwoTSPGVQU9A
         vkMRvJQ3XOJVC865fr8qQKl690Sf9ix77dvTgv5b9y22SjVInysMr9B1zCTCPZ1EFWV4
         6RZBw/YvAwfGd0MooSUwfV0KK/1oaocr2JkUAr/39nJJz8Z+llK59tLqLe+CqKeukHky
         SQ8Q==
X-Gm-Message-State: AOJu0YyPOmE2pldXnBGPR6jxrAVW07qZe+SW4iWsbEy3oQpidWBVMYEE
	mypjx4RpWioqGd4RXbtF9kZT8+mOtjqx4M6+RBZaa0ZG3XfDmrSPpXusVL7cH2s=
X-Gm-Gg: ASbGncvi50+wrl8RsHXIrR1LDhFZQhd7U9aDUMr70nTfsCDV8QZz+p+0F6u946xO42D
	I0VUYPwycGZywRgy6o64NRgFvU2IGHoD3vLaJkiwy3O0Mkg/Gt5UTbeMt0KsPI/7diAIjrDjiZX
	RgHybKb/V9LgI/+LYxBamBXzZA1br89qzJXyIGgdbOiiyV/RA06M3Htj+UIaJNJWLqyDOoiYjdt
	TPuQGBU4HGxHzNbbIDcrPxjffS2u4V7ivgHljIgCZ3WAkjsUv/LAMm4q0/SP4KdAgUV9TMTKzRF
	u07SwNP8Aq1MoIWjw5Bdyv0o0Omm+ByZeup5SkBOFs6ez7dsep1wG2f4N79eT9iK3I88eKAC
X-Google-Smtp-Source: AGHT+IGvH9jLMKR0M1op815hSr4ZkZA5KXjzJ3aSP8fvHNJ42h/Yem6xI+4PpRaquMBrZbGOn97BxA==
X-Received: by 2002:a17:90a:e7c6:b0:2fe:a8b1:7d8 with SMTP id 98e67ed59e1d1-30e2e65e722mr7436493a91.25.1747253921261;
        Wed, 14 May 2025 13:18:41 -0700 (PDT)
Received: from KASONG-MC4.tencent.com ([101.32.222.185])
        by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-30e33401934sm2003692a91.9.2025.05.14.13.18.36
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Wed, 14 May 2025 13:18:40 -0700 (PDT)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	Chris Li <chrisl@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO
Date: Thu, 15 May 2025 04:17:12 +0800
Message-ID: <20250514201729.48420-13-ryncsn@gmail.com>
X-Mailer: git-send-email 2.49.0
In-Reply-To: <20250514201729.48420-1-ryncsn@gmail.com>
References: <20250514201729.48420-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Queue-Id: 3209340007
X-Rspamd-Server: rspam09
X-Stat-Signature: rsdakah7au9roj8forw6agann4mpgyx9
X-HE-Tag: 1747253922-636888
X-HE-Meta: U2FsdGVkX1+bMNqddHnTa22OsLHQcwLmbZn5jVOVSDiusmD1PzfcpNQVZIbnBvzXKCLQQVC+nQ0vmS8zA6csZuFa84SomLm8QiuMNSV4GIxb701XKfUezWLD/wr7E5G1a0/UMFwswpv9/mbb8BOLq/am03XhJlpv3L9EEDefAm7kGRYESs+RLLt7haygBtZfOsv4J+Ih/V7l/HG43pHhs/Ad0xZyTtkvriIt8rJxArb30eiTNTqyIkZU/QLyCUqUMSVx1Rt8UcjSbhbwB0qzjE8CHGne0YyMKdtHJIS7vUSP7EZeUboQXe0DS7LbYZMd9otIi6gMuNqv3Vdmz7iBbYVaoRT8iYjYDdyptYUrUEA+U43qV/r4J5MEuyl6oeQF9CQ8zX2SvvayRpafuvySlMOKCTDW8P9TGpos0HH4jqoRIf8SZEsbN0rpZDVQEBXWRGUdxQgBtj2l2fs11wioxBV1SKGBI7Lk06s8u7ez4os/OMbRCXEcjduC8qscVKvbbIZYxQw9ERUMlGw+5YMWI2gu0CX1tYHawKdE481kPmBatTuBwmOSq5i1OPElFJyekFBIewk+sxp7FF7p2MCFbTMLyfOepPEheaZQy+1Bv6dFgfKgoGqL+NoHa/kXHESX33Mvm1qVzurWXDFIueKzOO5voW85AQyjczLLoqnTuwYKPnvB9gJlFLoK7E9pA5JF8voBiiWgJh2hj/2EewUlsHMrDjy9olzp4QfbAUNGLqz+uJf4gCylFxdau8i0vXLdwoEXsv4qYlsCKpiay6T6U/IgHgKG7jW3BqQ0qjttHGAIbJA/z+JpbMMPgf0y3eBgaY+sf21C9WvdvIdwRv54HPMj7HGV7M9zlyJydaxnAdcixqL7v24nTp5dDyO2x8gS70x5ddDPQ+ClnA0SyhQS7cPldzz8fVCdUC2dIuePgSwHkLHFIj1jMmqNN2yeTdUUMcTOqtHQLu/jzxgWvdq
 ZXtl5u7b
 Ie/TxctmffJ3pikRThtMtidaKP24Fgw6Fp48ACQ3qdnnbB85/Kd4IKvhtDEF4HuyLzjP6ylv5Eb6ggTcAAtWe6UDu4RYfTWDh89oTXN6+pSL+oBB9IfF7hAaHk1VjD6w4hN90IG7IcPiMcel9SivJIfiBrlZTpN+3XbAWlOhvgQmmod98P6uUrWnGEx1fuFWtaLmH+wg/uNEmfy3Slnz82tkfRpWp0+N94GZ/3OaAIKNbnz4kWrJiuHYkvvuIxIvfFGTdkKTfHLSoUYLbTaqdXLEmoMnv1oJZsWekOEOF9N7n/EpaS+gYJpMhnnrssbDpQKqSAMxGBud8k6edV72UaMCgx5pD58qJbEJYlcYirICgUlBGD7L211VKdihpwRX1FbQImdpCTkmNwTILgeL66RCt33bIXUEJxTSbef6Bjcaz9YjRwwAlAULXtNAub3H6Hmzx+WFi2hvdjhhKptbXovDStgD/bc3w061XVL5/ArFfA8Tm4x47NvpLNZ8eLQ5ejSsmxOse6/UQArD78TKKA6nDn5K2iyyzUg3GnXKoH8cIZzLBmJbO7qsr8M2YasuFTy2IC2nGLcpmjGKAaax4OWr71JLKrALUItA/bUWjQUpBBSk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

From: Kairui Song <kasong@tencent.com>

Now the overhead of the swap cache is trivial to none, bypassing the
swap cache is no longer a valid optimization.

This commit is more than code simplification, it changes the swap in
behaviour in multiple ways:

We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
The indicator to bypass the swap cache and read ahead, in many workload
bypassing read ahead is the more helpful part for SWP_SYNCHRONOUS_IO
devices as they have extreme low latency the read ahead isn't helpful.

The `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` is not a good
indicator in the first place: obviously, read ahead has nothing to do with
swap count, that's more of a workaround due to the limitation of current
implementation that read ahead bypassing is strictly coupled with swap
cache bypassing. Swap count > 1 can't bypass the swap cache because that
will result in redundant IO or wasted CPU time.

So the first change with this commit is that read ahead is now always
disabled for SWP_SYNCHRONOUS_IO devices, this is a good thing as these
devices have extreme low latency, and queued IO do not affect them
(ZRAM, RAMDISK), so read ahead isn't helpful.

The second thing here is that this enabled mTHP swap in for all faults on
SWP_SYNCHRONOUS_IO devices. Previously, the mTHP swap is also coupled with
swap cache bypassing. But again clearly, it doesn't make much sense that
mTHP's ref count affects its swap in behavior.

And to catch potential issues with mTHP swap in, especially with page
exclusiveness, more debug sanity checks and comments are added. But the
code is still simpler with reduced LOC.

For a real mTHP workload, this may cause more serious thrashing, this isn't
a problem with this commit but a generic mTHP issue. For a 4K workload,
this commit boosts the performance:

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 267 +++++++++++++++++++++++-----------------------------
 1 file changed, 116 insertions(+), 151 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1b6e192de6ec..0b41d15c6d7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -87,6 +87,7 @@
 #include <asm/tlbflush.h>
 
 #include "pgalloc-track.h"
+#include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
 
@@ -4477,7 +4478,33 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+/* Check if a folio should be exclusive, with sanity tests */
+static bool check_swap_exclusive(struct folio *folio, swp_entry_t entry,
+				 pte_t *ptep, unsigned int fault_nr)
+{
+	pgoff_t offset = swp_offset(entry);
+	struct page *page = folio_file_page(folio, offset);
+
+	if (!pte_swp_exclusive(ptep_get(ptep)))
+		return false;
+
+	/* For exclusive swapin, it must not be mapped */
+	if (fault_nr == 1)
+		VM_WARN_ON_ONCE_PAGE(atomic_read(&page->_mapcount) != -1, page);
+	else
+		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
+	/*
+	 * Check if swap count is consistent with exclusiveness. The folio
+	 * and PTL lock keeps the swap count stable.
+	 */
+	if (IS_ENABLED(CONFIG_VM_DEBUG)) {
+		for (int i = 0; i < fault_nr; i++) {
+			VM_WARN_ON_FOLIO(__swap_count(entry) != 1, folio);
+			entry.val++;
+		}
+	}
+	return true;
+}
 
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
@@ -4490,17 +4517,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct folio *swapcache, *folio = NULL;
-	DECLARE_WAITQUEUE(wait, current);
+	struct folio *swapcache = NULL, *folio;
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
-	bool need_clear_cache = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 	int nr_pages;
 	unsigned long page_idx;
 	unsigned long address;
@@ -4571,56 +4595,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	folio = swap_cache_get_folio(entry);
 	swapcache = folio;
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			/* skip swapcache */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = alloc_swap_folio(vmf);
 			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
-					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
-					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
-				}
-				need_clear_cache = true;
-
-				memcg1_swapin(entry, nr_pages);
-
-				shadow = swap_cache_get_shadow(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL);
-				folio->private = NULL;
+				swapcache = swapin_entry(entry, folio);
+				if (swapcache != folio)
+					folio_put(folio);
 			}
 		} else {
-			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			swapcache = folio;
+			swapcache = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
 		}
 
+		folio = swapcache;
 		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -4644,57 +4630,56 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
+	/*
+	 * Make sure folio_free_swap() or swapoff did not release the
+	 * swapcache from under us.  The page pin, and pte_same test
+	 * below, are not enough to exclude that.  Even if it is still
+	 * swapcache, we need to check that the page's swap has not
+	 * changed.
+	 */
+	if (!folio_swap_contains(folio, entry))
+		goto out_page;
 	page = folio_file_page(folio, swp_offset(entry));
-	if (swapcache) {
-		/*
-		 * Make sure folio_free_swap() or swapoff did not release the
-		 * swapcache from under us.  The page pin, and pte_same test
-		 * below, are not enough to exclude that.  Even if it is still
-		 * swapcache, we need to check that the page's swap has not
-		 * changed.
-		 */
-		if (!folio_swap_contains(folio, entry))
-			goto out_page;
 
-		if (PageHWPoison(page)) {
-			/*
-			 * hwpoisoned dirty swapcache pages are kept for killing
-			 * owner processes (which may be unknown at hwpoison time)
-			 */
-			ret = VM_FAULT_HWPOISON;
-			goto out_page;
-		}
-
-		swap_update_readahead(folio, vma, vmf->address);
+	/*
+	 * hwpoisoned dirty swapcache pages are kept for killing
+	 * owner processes (which may be unknown at hwpoison time)
+	 */
+	if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		goto out_page;
+	}
 
-		/*
-		 * KSM sometimes has to copy on read faults, for example, if
-		 * page->index of !PageKSM() pages would be nonlinear inside the
-		 * anon VMA -- PageKSM() is lost on actual swapout.
-		 */
-		folio = ksm_might_need_to_copy(folio, vma, vmf->address);
-		if (unlikely(!folio)) {
-			ret = VM_FAULT_OOM;
-			folio = swapcache;
-			goto out_page;
-		} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
-			ret = VM_FAULT_HWPOISON;
-			folio = swapcache;
-			goto out_page;
-		} else if (folio != swapcache)
-			page = folio_page(folio, 0);
+	swap_update_readahead(folio, vma, vmf->address);
 
-		/*
-		 * If we want to map a page that's in the swapcache writable, we
-		 * have to detect via the refcount if we're really the exclusive
-		 * owner. Try removing the extra reference from the local LRU
-		 * caches if required.
-		 */
-		if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
-		    !folio_test_ksm(folio) && !folio_test_lru(folio))
-			lru_add_drain();
+	/*
+	 * KSM sometimes has to copy on read faults, for example, if
+	 * page->index of !PageKSM() pages would be nonlinear inside the
+	 * anon VMA -- PageKSM() is lost on actual swapout.
+	 */
+	folio = ksm_might_need_to_copy(folio, vma, vmf->address);
+	if (unlikely(!folio)) {
+		ret = VM_FAULT_OOM;
+		folio = swapcache;
+		goto out_page;
+	} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
+		ret = VM_FAULT_HWPOISON;
+		folio = swapcache;
+		goto out_page;
+	} else if (folio != swapcache) {
+		page = folio_file_page(folio, swp_offset(entry));
 	}
 
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * owner. Try removing the extra reference from the local LRU
+	 * caches if required.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
+	    !folio_test_ksm(folio) && !folio_test_lru(folio))
+		lru_add_drain();
+
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
@@ -4710,44 +4695,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
-	/* allocated large folios for SWP_SYNCHRONOUS_IO */
-	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
-		unsigned long nr = folio_nr_pages(folio);
-		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
-		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
-		pte_t *folio_ptep = vmf->pte - idx;
-		pte_t folio_pte = ptep_get(folio_ptep);
-
-		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
-		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
-			goto out_nomap;
-
-		page_idx = idx;
-		address = folio_start;
-		ptep = folio_ptep;
-		goto check_folio;
-	}
-
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
 	ptep = vmf->pte;
 
-	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+	if (folio_test_large(folio)) {
 		unsigned long nr = folio_nr_pages(folio);
 		unsigned long idx = folio_page_idx(folio, page);
-		unsigned long folio_address = address - idx * PAGE_SIZE;
+		unsigned long folio_address = vmf->address - idx * PAGE_SIZE;
 		pte_t *folio_ptep = vmf->pte - idx;
 
-		if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))
+		if (can_swapin_thp(vmf, folio_ptep, folio_address, nr)) {
+			page_idx = idx;
+			address = folio_address;
+			ptep = folio_ptep;
+			nr_pages = nr;
+			entry = folio->swap;
+			page = &folio->page;
 			goto check_folio;
-
-		page_idx = idx;
-		address = folio_address;
-		ptep = folio_ptep;
-		nr_pages = nr;
-		entry = folio->swap;
-		page = &folio->page;
+		}
+		/*
+		 * If it's a fresh large folio in the swap cache but the
+		 * page table supporting it is gone, drop it and fallback
+		 * to order 0 swap in again.
+		 *
+		 * The folio must be clean, nothing should have touched
+		 * it, shmem removes the folio from swap cache upon
+		 * swapin, and anon flag won't be gone once set.
+		 * TODO: We might want to split or partially map it.
+		 */
+		if (!folio_test_anon(folio)) {
+			WARN_ON_ONCE(folio_test_dirty(folio));
+			delete_from_swap_cache(folio);
+			goto out_nomap;
+		}
 	}
 
 check_folio:
@@ -4767,7 +4749,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * the swap entry concurrently) for certainly exclusive pages.
 	 */
 	if (!folio_test_ksm(folio)) {
-		exclusive = pte_swp_exclusive(vmf->orig_pte);
+		exclusive = check_swap_exclusive(folio, entry, ptep, nr_pages);
 		if (folio != swapcache) {
 			/*
 			 * We have a fresh page that is not exposed to the
@@ -4805,15 +4787,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
-	/*
-	 * Remove the swap entry and conditionally try to free up the swapcache.
-	 * We're already holding a reference on the page but haven't mapped it
-	 * yet.
-	 */
-	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
-		folio_free_swap(folio);
-
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
 	pte = mk_pte(page, vma->vm_page_prot);
@@ -4849,14 +4822,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
-		/*
-		 * We currently only expect small !anon folios which are either
-		 * fully exclusive or fully shared, or new allocated large
-		 * folios which are fully exclusive. If we ever get large
-		 * folios within swapcache here, we have to be careful.
-		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
-		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
+		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
@@ -4869,7 +4836,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	arch_do_swap_page_nr(vma->vm_mm, vma, address,
 			pte, pte, nr_pages);
 
+	/*
+	 * Remove the swap entry and conditionally try to free up the
+	 * swapcache then unlock the folio. Do this after the PTEs are
+	 * set, so raced faults will see updated PTEs.
+	 */
+	swap_free_nr(entry, nr_pages);
+	if (should_try_to_free_swap(folio, vma, vmf->flags))
+		folio_free_swap(folio);
 	folio_unlock(folio);
+
 	if (folio != swapcache && swapcache) {
 		/*
 		 * Hold the lock to avoid the swap entry to be reused
@@ -4896,12 +4872,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
-	/* Clear the swap cache pin for direct swapin after PTL unlock */
-	if (need_clear_cache) {
-		swapcache_clear(si, entry, nr_pages);
-		if (waitqueue_active(&swapcache_wq))
-			wake_up(&swapcache_wq);
-	}
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4916,11 +4886,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (need_clear_cache) {
-		swapcache_clear(si, entry, nr_pages);
-		if (waitqueue_active(&swapcache_wq))
-			wake_up(&swapcache_wq);
-	}
 	if (si)
 		put_swap_device(si);
 	return ret;
-- 
2.49.0