From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33238C3ABDA for ; Wed, 14 May 2025 20:18:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 85C876B00A7; Wed, 14 May 2025 16:18:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E4216B00A8; Wed, 14 May 2025 16:18:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 612246B00A9; Wed, 14 May 2025 16:18:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 38A986B00A7 for ; Wed, 14 May 2025 16:18:45 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1E4EF5EBD8 for ; Wed, 14 May 2025 20:18:45 +0000 (UTC) X-FDA: 83442626610.25.5E724D6 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by imf12.hostedemail.com (Postfix) with ESMTP id 3209340007 for ; Wed, 14 May 2025 20:18:42 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=m1KEfaNR; spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747253923; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aiUQ9PqOfEtp93nyVYgzpjg1vss1BLVfY/NVAKlO77E=; b=oJVwulghriRyIJT04uyenZm644gzMfOiJ4NJSIbNskewtb9Lhutz7EOcLDQHhHN0w5bJzs 5M3bHyf7VQp5ju4xAyMGne0sqooG0ADVdtaBNcqFYnGdXnoZV1GQgQipUOTNgKXTD94TGi uCNu1sZ/bGNAu2OXZGvewrz1A3LQR/E= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=m1KEfaNR; spf=pass (imf12.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747253923; a=rsa-sha256; cv=none; b=Gjx7/G8rHF+9xG8XrBDemmBBUTvAqYDTO6T7jADFgsAL1MjA4PkSrQAArcz/c1NCO4cn7M nyMlocYLHq2bdcCr/bDKzARB7Bs139x9SPsp0t8wtrpXQIN91EDo6fJNMAAaHepJJGwQYS /+34y1nEJMZioWflMxtB9zR4/ykrVNo= Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-b0b2d1f2845so87182a12.3 for ; Wed, 14 May 2025 13:18:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747253921; x=1747858721; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=aiUQ9PqOfEtp93nyVYgzpjg1vss1BLVfY/NVAKlO77E=; b=m1KEfaNRN65f04PK/C4drgwqlcBcaFUjD1T9WJOoJl9TroRnuEIGSXWPVe+7SIO3SR V88L37CfE5fELVFw/xSPuYnszph9eY3bATWct6H/xq/zJYxJVsagnSsVRMFF8c2i3YrD AtBOCfFa8CQlfPtf16wo2O6Zs+VNviai3V0/Z8n1HlbrYZmuW1coU1z+pkiPRG9orFCI x+dNoMEsf/15MzS7k9pILiuIvwDQ8dal0jV2iWJu15yTLBUC1g5zPzKHtiigAiqX2g0U F6/isQsz6DxMzqaazAb0gytLvrccLQ+Ylhuf57sfnqmmWMekou4fBIawA09r9X18RVB1 dLGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747253921; x=1747858721; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=aiUQ9PqOfEtp93nyVYgzpjg1vss1BLVfY/NVAKlO77E=; b=r0aUkDL7Mkg3JjtkkaIRQHjIjmx63+kR9zDOIkLC7V6qAQ/52ga2Mpk3XJdZBHXJtm PR4HDPRMo3zB+c/W8MyPJPHS8SdpdBDzyWAwrL4OC8Dz89I1feLnddz6V7sT+n0I4yRd Nov8ErXh/fmCfHBJWZqC9yBbXVUc50CI9LCEnjcqfwhWFxCBo/q4ACwlAwoTSPGVQU9A vkMRvJQ3XOJVC865fr8qQKl690Sf9ix77dvTgv5b9y22SjVInysMr9B1zCTCPZ1EFWV4 6RZBw/YvAwfGd0MooSUwfV0KK/1oaocr2JkUAr/39nJJz8Z+llK59tLqLe+CqKeukHky SQ8Q== X-Gm-Message-State: AOJu0YyPOmE2pldXnBGPR6jxrAVW07qZe+SW4iWsbEy3oQpidWBVMYEE mypjx4RpWioqGd4RXbtF9kZT8+mOtjqx4M6+RBZaa0ZG3XfDmrSPpXusVL7cH2s= X-Gm-Gg: ASbGncvi50+wrl8RsHXIrR1LDhFZQhd7U9aDUMr70nTfsCDV8QZz+p+0F6u946xO42D I0VUYPwycGZywRgy6o64NRgFvU2IGHoD3vLaJkiwy3O0Mkg/Gt5UTbeMt0KsPI/7diAIjrDjiZX RgHybKb/V9LgI/+LYxBamBXzZA1br89qzJXyIGgdbOiiyV/RA06M3Htj+UIaJNJWLqyDOoiYjdt TPuQGBU4HGxHzNbbIDcrPxjffS2u4V7ivgHljIgCZ3WAkjsUv/LAMm4q0/SP4KdAgUV9TMTKzRF u07SwNP8Aq1MoIWjw5Bdyv0o0Omm+ByZeup5SkBOFs6ez7dsep1wG2f4N79eT9iK3I88eKAC X-Google-Smtp-Source: AGHT+IGvH9jLMKR0M1op815hSr4ZkZA5KXjzJ3aSP8fvHNJ42h/Yem6xI+4PpRaquMBrZbGOn97BxA== X-Received: by 2002:a17:90a:e7c6:b0:2fe:a8b1:7d8 with SMTP id 98e67ed59e1d1-30e2e65e722mr7436493a91.25.1747253921261; Wed, 14 May 2025 13:18:41 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-30e33401934sm2003692a91.9.2025.05.14.13.18.36 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 14 May 2025 13:18:40 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , David Hildenbrand , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Baolin Wang , Baoquan He , Barry Song , Kalesh Singh , Kemeng Shi , Tim Chen , Ryan Roberts , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Date: Thu, 15 May 2025 04:17:12 +0800 Message-ID: <20250514201729.48420-13-ryncsn@gmail.com> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250514201729.48420-1-ryncsn@gmail.com> References: <20250514201729.48420-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 3209340007 X-Rspamd-Server: rspam09 X-Stat-Signature: rsdakah7au9roj8forw6agann4mpgyx9 X-HE-Tag: 1747253922-636888 X-HE-Meta: U2FsdGVkX1+bMNqddHnTa22OsLHQcwLmbZn5jVOVSDiusmD1PzfcpNQVZIbnBvzXKCLQQVC+nQ0vmS8zA6csZuFa84SomLm8QiuMNSV4GIxb701XKfUezWLD/wr7E5G1a0/UMFwswpv9/mbb8BOLq/am03XhJlpv3L9EEDefAm7kGRYESs+RLLt7haygBtZfOsv4J+Ih/V7l/HG43pHhs/Ad0xZyTtkvriIt8rJxArb30eiTNTqyIkZU/QLyCUqUMSVx1Rt8UcjSbhbwB0qzjE8CHGne0YyMKdtHJIS7vUSP7EZeUboQXe0DS7LbYZMd9otIi6gMuNqv3Vdmz7iBbYVaoRT8iYjYDdyptYUrUEA+U43qV/r4J5MEuyl6oeQF9CQ8zX2SvvayRpafuvySlMOKCTDW8P9TGpos0HH4jqoRIf8SZEsbN0rpZDVQEBXWRGUdxQgBtj2l2fs11wioxBV1SKGBI7Lk06s8u7ez4os/OMbRCXEcjduC8qscVKvbbIZYxQw9ERUMlGw+5YMWI2gu0CX1tYHawKdE481kPmBatTuBwmOSq5i1OPElFJyekFBIewk+sxp7FF7p2MCFbTMLyfOepPEheaZQy+1Bv6dFgfKgoGqL+NoHa/kXHESX33Mvm1qVzurWXDFIueKzOO5voW85AQyjczLLoqnTuwYKPnvB9gJlFLoK7E9pA5JF8voBiiWgJh2hj/2EewUlsHMrDjy9olzp4QfbAUNGLqz+uJf4gCylFxdau8i0vXLdwoEXsv4qYlsCKpiay6T6U/IgHgKG7jW3BqQ0qjttHGAIbJA/z+JpbMMPgf0y3eBgaY+sf21C9WvdvIdwRv54HPMj7HGV7M9zlyJydaxnAdcixqL7v24nTp5dDyO2x8gS70x5ddDPQ+ClnA0SyhQS7cPldzz8fVCdUC2dIuePgSwHkLHFIj1jMmqNN2yeTdUUMcTOqtHQLu/jzxgWvdq ZXtl5u7b Ie/TxctmffJ3pikRThtMtidaKP24Fgw6Fp48ACQ3qdnnbB85/Kd4IKvhtDEF4HuyLzjP6ylv5Eb6ggTcAAtWe6UDu4RYfTWDh89oTXN6+pSL+oBB9IfF7hAaHk1VjD6w4hN90IG7IcPiMcel9SivJIfiBrlZTpN+3XbAWlOhvgQmmod98P6uUrWnGEx1fuFWtaLmH+wg/uNEmfy3Slnz82tkfRpWp0+N94GZ/3OaAIKNbnz4kWrJiuHYkvvuIxIvfFGTdkKTfHLSoUYLbTaqdXLEmoMnv1oJZsWekOEOF9N7n/EpaS+gYJpMhnnrssbDpQKqSAMxGBud8k6edV72UaMCgx5pD58qJbEJYlcYirICgUlBGD7L211VKdihpwRX1FbQImdpCTkmNwTILgeL66RCt33bIXUEJxTSbef6Bjcaz9YjRwwAlAULXtNAub3H6Hmzx+WFi2hvdjhhKptbXovDStgD/bc3w061XVL5/ArFfA8Tm4x47NvpLNZ8eLQ5ejSsmxOse6/UQArD78TKKA6nDn5K2iyyzUg3GnXKoH8cIZzLBmJbO7qsr8M2YasuFTy2IC2nGLcpmjGKAaax4OWr71JLKrALUItA/bUWjQUpBBSk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a valid optimization. This commit is more than code simplification, it changes the swap in behaviour in multiple ways: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as The indicator to bypass the swap cache and read ahead, in many workload bypassing read ahead is the more helpful part for SWP_SYNCHRONOUS_IO devices as they have extreme low latency the read ahead isn't helpful. The `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` is not a good indicator in the first place: obviously, read ahead has nothing to do with swap count, that's more of a workaround due to the limitation of current implementation that read ahead bypassing is strictly coupled with swap cache bypassing. Swap count > 1 can't bypass the swap cache because that will result in redundant IO or wasted CPU time. So the first change with this commit is that read ahead is now always disabled for SWP_SYNCHRONOUS_IO devices, this is a good thing as these devices have extreme low latency, and queued IO do not affect them (ZRAM, RAMDISK), so read ahead isn't helpful. The second thing here is that this enabled mTHP swap in for all faults on SWP_SYNCHRONOUS_IO devices. Previously, the mTHP swap is also coupled with swap cache bypassing. But again clearly, it doesn't make much sense that mTHP's ref count affects its swap in behavior. And to catch potential issues with mTHP swap in, especially with page exclusiveness, more debug sanity checks and comments are added. But the code is still simpler with reduced LOC. For a real mTHP workload, this may cause more serious thrashing, this isn't a problem with this commit but a generic mTHP issue. For a 4K workload, this commit boosts the performance: Signed-off-by: Kairui Song --- mm/memory.c | 267 +++++++++++++++++++++++----------------------------- 1 file changed, 116 insertions(+), 151 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 1b6e192de6ec..0b41d15c6d7a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -87,6 +87,7 @@ #include #include "pgalloc-track.h" +#include "swap_table.h" #include "internal.h" #include "swap.h" @@ -4477,7 +4478,33 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); +/* Check if a folio should be exclusive, with sanity tests */ +static bool check_swap_exclusive(struct folio *folio, swp_entry_t entry, + pte_t *ptep, unsigned int fault_nr) +{ + pgoff_t offset = swp_offset(entry); + struct page *page = folio_file_page(folio, offset); + + if (!pte_swp_exclusive(ptep_get(ptep))) + return false; + + /* For exclusive swapin, it must not be mapped */ + if (fault_nr == 1) + VM_WARN_ON_ONCE_PAGE(atomic_read(&page->_mapcount) != -1, page); + else + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); + /* + * Check if swap count is consistent with exclusiveness. The folio + * and PTL lock keeps the swap count stable. + */ + if (IS_ENABLED(CONFIG_VM_DEBUG)) { + for (int i = 0; i < fault_nr; i++) { + VM_WARN_ON_FOLIO(__swap_count(entry) != 1, folio); + entry.val++; + } + } + return true; +} /* * We enter with non-exclusive mmap_lock (to exclude vma changes, @@ -4490,17 +4517,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - struct folio *swapcache, *folio = NULL; - DECLARE_WAITQUEUE(wait, current); + struct folio *swapcache = NULL, *folio; struct page *page; struct swap_info_struct *si = NULL; rmap_t rmap_flags = RMAP_NONE; - bool need_clear_cache = false; bool exclusive = false; swp_entry_t entry; pte_t pte; vm_fault_t ret = 0; - void *shadow = NULL; int nr_pages; unsigned long page_idx; unsigned long address; @@ -4571,56 +4595,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio = swap_cache_get_folio(entry); swapcache = folio; if (!folio) { - if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && - __swap_count(entry) == 1) { - /* skip swapcache */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { folio = alloc_swap_folio(vmf); if (folio) { - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - nr_pages = folio_nr_pages(folio); - if (folio_test_large(folio)) - entry.val = ALIGN_DOWN(entry.val, nr_pages); - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread - * may finish swapin first, free the entry, and - * swapout reusing the same entry. It's - * undetectable as pte_same() returns true due - * to entry reuse. - */ - if (swapcache_prepare(entry, nr_pages)) { - /* - * Relax a bit to prevent rapid - * repeated page faults. - */ - add_wait_queue(&swapcache_wq, &wait); - schedule_timeout_uninterruptible(1); - remove_wait_queue(&swapcache_wq, &wait); - goto out_page; - } - need_clear_cache = true; - - memcg1_swapin(entry, nr_pages); - - shadow = swap_cache_get_shadow(entry); - if (shadow) - workingset_refault(folio, shadow); - - folio_add_lru(folio); - - /* To provide entry to swap_read_folio() */ - folio->swap = entry; - swap_read_folio(folio, NULL); - folio->private = NULL; + swapcache = swapin_entry(entry, folio); + if (swapcache != folio) + folio_put(folio); } } else { - folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, - vmf); - swapcache = folio; + swapcache = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); } + folio = swapcache; if (!folio) { /* * Back out if somebody else faulted in this pte @@ -4644,57 +4630,56 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (ret & VM_FAULT_RETRY) goto out_release; + /* + * Make sure folio_free_swap() or swapoff did not release the + * swapcache from under us. The page pin, and pte_same test + * below, are not enough to exclude that. Even if it is still + * swapcache, we need to check that the page's swap has not + * changed. + */ + if (!folio_swap_contains(folio, entry)) + goto out_page; page = folio_file_page(folio, swp_offset(entry)); - if (swapcache) { - /* - * Make sure folio_free_swap() or swapoff did not release the - * swapcache from under us. The page pin, and pte_same test - * below, are not enough to exclude that. Even if it is still - * swapcache, we need to check that the page's swap has not - * changed. - */ - if (!folio_swap_contains(folio, entry)) - goto out_page; - if (PageHWPoison(page)) { - /* - * hwpoisoned dirty swapcache pages are kept for killing - * owner processes (which may be unknown at hwpoison time) - */ - ret = VM_FAULT_HWPOISON; - goto out_page; - } - - swap_update_readahead(folio, vma, vmf->address); + /* + * hwpoisoned dirty swapcache pages are kept for killing + * owner processes (which may be unknown at hwpoison time) + */ + if (PageHWPoison(page)) { + ret = VM_FAULT_HWPOISON; + goto out_page; + } - /* - * KSM sometimes has to copy on read faults, for example, if - * page->index of !PageKSM() pages would be nonlinear inside the - * anon VMA -- PageKSM() is lost on actual swapout. - */ - folio = ksm_might_need_to_copy(folio, vma, vmf->address); - if (unlikely(!folio)) { - ret = VM_FAULT_OOM; - folio = swapcache; - goto out_page; - } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) { - ret = VM_FAULT_HWPOISON; - folio = swapcache; - goto out_page; - } else if (folio != swapcache) - page = folio_page(folio, 0); + swap_update_readahead(folio, vma, vmf->address); - /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * owner. Try removing the extra reference from the local LRU - * caches if required. - */ - if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache && - !folio_test_ksm(folio) && !folio_test_lru(folio)) - lru_add_drain(); + /* + * KSM sometimes has to copy on read faults, for example, if + * page->index of !PageKSM() pages would be nonlinear inside the + * anon VMA -- PageKSM() is lost on actual swapout. + */ + folio = ksm_might_need_to_copy(folio, vma, vmf->address); + if (unlikely(!folio)) { + ret = VM_FAULT_OOM; + folio = swapcache; + goto out_page; + } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) { + ret = VM_FAULT_HWPOISON; + folio = swapcache; + goto out_page; + } else if (folio != swapcache) { + page = folio_file_page(folio, swp_offset(entry)); } + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * owner. Try removing the extra reference from the local LRU + * caches if required. + */ + if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache && + !folio_test_ksm(folio) && !folio_test_lru(folio)) + lru_add_drain(); + folio_throttle_swaprate(folio, GFP_KERNEL); /* @@ -4710,44 +4695,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } - /* allocated large folios for SWP_SYNCHRONOUS_IO */ - if (folio_test_large(folio) && !folio_test_swapcache(folio)) { - unsigned long nr = folio_nr_pages(folio); - unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); - unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; - pte_t *folio_ptep = vmf->pte - idx; - pte_t folio_pte = ptep_get(folio_ptep); - - if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || - swap_pte_batch(folio_ptep, nr, folio_pte) != nr) - goto out_nomap; - - page_idx = idx; - address = folio_start; - ptep = folio_ptep; - goto check_folio; - } - nr_pages = 1; page_idx = 0; address = vmf->address; ptep = vmf->pte; - if (folio_test_large(folio) && folio_test_swapcache(folio)) { + if (folio_test_large(folio)) { unsigned long nr = folio_nr_pages(folio); unsigned long idx = folio_page_idx(folio, page); - unsigned long folio_address = address - idx * PAGE_SIZE; + unsigned long folio_address = vmf->address - idx * PAGE_SIZE; pte_t *folio_ptep = vmf->pte - idx; - if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr)) + if (can_swapin_thp(vmf, folio_ptep, folio_address, nr)) { + page_idx = idx; + address = folio_address; + ptep = folio_ptep; + nr_pages = nr; + entry = folio->swap; + page = &folio->page; goto check_folio; - - page_idx = idx; - address = folio_address; - ptep = folio_ptep; - nr_pages = nr; - entry = folio->swap; - page = &folio->page; + } + /* + * If it's a fresh large folio in the swap cache but the + * page table supporting it is gone, drop it and fallback + * to order 0 swap in again. + * + * The folio must be clean, nothing should have touched + * it, shmem removes the folio from swap cache upon + * swapin, and anon flag won't be gone once set. + * TODO: We might want to split or partially map it. + */ + if (!folio_test_anon(folio)) { + WARN_ON_ONCE(folio_test_dirty(folio)); + delete_from_swap_cache(folio); + goto out_nomap; + } } check_folio: @@ -4767,7 +4749,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * the swap entry concurrently) for certainly exclusive pages. */ if (!folio_test_ksm(folio)) { - exclusive = pte_swp_exclusive(vmf->orig_pte); + exclusive = check_swap_exclusive(folio, entry, ptep, nr_pages); if (folio != swapcache) { /* * We have a fresh page that is not exposed to the @@ -4805,15 +4787,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ arch_swap_restore(folio_swap(entry, folio), folio); - /* - * Remove the swap entry and conditionally try to free up the swapcache. - * We're already holding a reference on the page but haven't mapped it - * yet. - */ - swap_free_nr(entry, nr_pages); - if (should_try_to_free_swap(folio, vma, vmf->flags)) - folio_free_swap(folio); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte = mk_pte(page, vma->vm_page_prot); @@ -4849,14 +4822,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { - /* - * We currently only expect small !anon folios which are either - * fully exclusive or fully shared, or new allocated large - * folios which are fully exclusive. If we ever get large - * folios within swapcache here, we have to be careful. - */ - VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio); + VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, @@ -4869,7 +4836,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages); + /* + * Remove the swap entry and conditionally try to free up the + * swapcache then unlock the folio. Do this after the PTEs are + * set, so raced faults will see updated PTEs. + */ + swap_free_nr(entry, nr_pages); + if (should_try_to_free_swap(folio, vma, vmf->flags)) + folio_free_swap(folio); folio_unlock(folio); + if (folio != swapcache && swapcache) { /* * Hold the lock to avoid the swap entry to be reused @@ -4896,12 +4872,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); out: - /* Clear the swap cache pin for direct swapin after PTL unlock */ - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; @@ -4916,11 +4886,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_unlock(swapcache); folio_put(swapcache); } - if (need_clear_cache) { - swapcache_clear(si, entry, nr_pages); - if (waitqueue_active(&swapcache_wq)) - wake_up(&swapcache_wq); - } if (si) put_swap_device(si); return ret; -- 2.49.0