From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3A85C7EE31 for ; Fri, 27 Jun 2025 06:22:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8180D8D0010; Fri, 27 Jun 2025 02:22:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C7578D0001; Fri, 27 Jun 2025 02:22:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B7818D0010; Fri, 27 Jun 2025 02:22:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5750F8D0001 for ; Fri, 27 Jun 2025 02:22:49 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D1AB010493E for ; Fri, 27 Jun 2025 06:22:48 +0000 (UTC) X-FDA: 83600187216.30.7D9C71D Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf09.hostedemail.com (Postfix) with ESMTP id E6F02140009 for ; Fri, 27 Jun 2025 06:22:46 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Daiqqc31; spf=pass (imf09.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751005367; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v/lPSXdBTYyYcW2f1P/9Kus+D32Lno4t+GB7GicU1lE=; b=PEbYw2s5VQGS6gkwd/Y4gy2yPvgVECOCbnU2y+ccIoAHqEvFkQwpDh61eHbEkTM1pF4t0m sT52jNZP1Y5lGVx4Jd1ddFWPwaiSvr689jsgkXhRrkjcmviHC2tHnM3kTYtNiYWHVzMu93 7Pe+iL7LMk5TjDUvGXi/KZt0ZxvNGiE= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Daiqqc31; spf=pass (imf09.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751005367; a=rsa-sha256; cv=none; b=O0UBIzh2W2c+jMVy0W0jZGYbZj0ey3s9UzG44Wm4j4DCKbpJfbeVtyUm4sLRS+CPOgq/dv 2orPqoAcSJAQSP8i0c7IhuWaEu7JEOC7FTQPO0QtbgpEtm6ZdU8gQ5VQtBWtxlZN1iVirL VCdLX2YEd4m+HqNgZXTEP8rfegh4WK8= Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-748e63d4b05so1197262b3a.2 for ; Thu, 26 Jun 2025 23:22:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1751005365; x=1751610165; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=v/lPSXdBTYyYcW2f1P/9Kus+D32Lno4t+GB7GicU1lE=; b=Daiqqc317SkKDJ+xtespj6+Ttyafrsz5Zq6iuwKyTJqLknkhylwjjSJuTJEWLQ10V1 6adPz7ZNaGBqmxL5SBMR+cJ/qP/K/DrzxoEsopx1IkXaSY/Jb9JD+oM1XQmlP616m/8B nZSjeg3U7ikEdIbt88IwNjxKCoCAeAdvQPBHtgW6aYeOzOOo6xQXSjWlieM+zO20CIvL CLkM6cliokDmIAPbTShkd1L9pDrJ0pp/umiCmvzJsaSYO+0y97vcVxR49EuWARvajvpQ hx+7H68LvEipphHetKPzhVMp0MFCb1dDiuMgbDn+RPd/5KuI8jDHay5Hq6USvofw/6um PfDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751005365; x=1751610165; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=v/lPSXdBTYyYcW2f1P/9Kus+D32Lno4t+GB7GicU1lE=; b=gwCEMPH0XAjR+QjhIN8K8uL2C9dXu7iLkb6ndSiw66YKgclyv+99GA2jQXR4jXvJ7E O515CoMFs8jbHeW8I15p8/ZK8VBPEb8XBU/BE6Rk5dusdzkd2B+vPPo9rDku4Ey+I6tw X2iK5Rc9UhzXlYyB1ht+2goRGazZyQ1rWx+yVF80ZHLnGEzE+PvrBvU6OcCi5JBOotsO DKHHmlGMwvVgldEE+otmTjkXmiAFUl6fnfsuJo9w59FrKfS6xqTsd0tUC04SHWPY5szo X9fDSCnJEFc+2iSdSRcqcsen2F2xQ4fGU3Ni3Vt6fIowzvFhaB2zoQSjYPLNQI/B8sLT hSaw== X-Gm-Message-State: AOJu0Yxd6Xd3naSn1k4aMHDc03E4fWVpVJWbuT/BXK0cg+4k/Nw9O6gx iirmaZcqpk2oXJO0oh+ZjjdwRbQ4jG2TW/oHkBYlHgbLswAhAIzDZ06QWoKaKJMEJPQ= X-Gm-Gg: ASbGncvvkX23KANqLMxkr/fCQP6F6q15+5MvLPbcH/1LOwR51rIDI8YYAi9WMrLxPSv 5O5W/T1kobWroEI2sk4z2DxgnjlVh3ePCxTsWdEDNhgc2WA860V5OrZ1asw1dzzufkI07e10rx6 y11D7pVjxN1FilzZxWMUAbo+5Ths03tmucAfl/kvo8s4PM5ZJLjxFdjJMn979AsPw3ZJfwpZoPK DBUcwCJB9RI5rmTZ8Xf7SVC4bditbF/0XBCTqiPoArneav1+6Tqage4ClIJGukPUWnLgz5IAnU9 kFMwo+iorllaaXVY65FSieKYqZm/YBd4xlAXqTyLYwav0jkWkBgie0f8Ah5U5UjRjd51HX8dhVK m X-Google-Smtp-Source: AGHT+IEHehy1Nsijf5r8v/quVVBhH3AlbhVTvjTmjKxr/kL02DNDQ1CmikQ6GIVuWlF8J/CCf3tK0Q== X-Received: by 2002:a05:6a00:2ea6:b0:748:3a1a:ba72 with SMTP id d2e1a72fcca58-74af701c716mr2887242b3a.20.1751005364750; Thu, 26 Jun 2025 23:22:44 -0700 (PDT) Received: from KASONG-MC4 ([43.132.141.21]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-74af5409cb6sm1456212b3a.23.2025.06.26.23.22.41 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 26 Jun 2025 23:22:44 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song , stable@vger.kernel.org Subject: [PATCH v3 1/7] mm/shmem, swap: improve cached mTHP handling and fix potential hung Date: Fri, 27 Jun 2025 14:20:14 +0800 Message-ID: <20250627062020.534-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250627062020.534-1-ryncsn@gmail.com> References: <20250627062020.534-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: E6F02140009 X-Stat-Signature: 8dooizuuyhija7dkbj3a3rhdq44ay8kf X-Rspam-User: X-HE-Tag: 1751005366-994532 X-HE-Meta: U2FsdGVkX1/N8ErNQrSSipdYB7d2uIgGjMB/D7dX80K9Pp9UNwnEdeDLQfmqpgko+5ns6HOyNq5gp4KQAPG8NAipQbpnTR5ycMexebeO6UtU3Zw/IdRDC60rGLHIcb+ZnW+WjmcsRs0a7Mncto7vSMk6BVQwxCpJeftlV5AV+8azr+pfE/VB1aXLJXh+H5izUJYzNq3IAiadgmG2eHSfn/tGWtctFsdborQ3DcVMeMCqs+A7PDqNkn7CCuW77aMtzIs8Vh8RkXexbXwk8bEiPG4RcihNZiTagY+GH1cNUXg1/mX07wdgnmUp0bFT5ezsw+uNE72zDZ2kGKBInjzsBtVmhnzMArFBpedOBEsKkIbfpPlnTBTBXAUD0UCBPhSdn1U9lIfKtxyc9NjHeXiYL6GuRE9uaw3F//X0iIoe0dJ2WLOf1w5GcDY+ZOUusPrR7RD0yHEaKvjkMgaPDmFTxeljmdVc8rnKTS+POSUJHRKHdS4jmeyGKctHA+vuJ/ZxRdZltFp8lrzsAwlfxVzwPKGFSu/Fkx52S+r0WUIPKE68v3bOm0Bi5rKx08gugX+cyYh/t2Jg154SEOO90shnIrQy0bbZtW9qRbfgOS+HrLbDjU1BPnUT6Lx5ZrQiGGWHDgLfPmNlJJXSQJb7FxypOfQrP4xlKioYbYz3GPUy700ups3oUfVyOabrXl+LO0JAUHtJ1FfMhVOCxMEWvNCIQY8aZUXv3yp1yQ8udPtjyuKF9GIBlrKyIS9mVCX2DWZK6Km/+EKA/DwGtFJ/81kpo8VRMGlgkNSGz3UYRyrIVp3/fmMXVajiNbYA34Ainrkkrb3hTvYvmJp2tJl2KSbZRBzOIQ95UcmOEcIysup5oEXUowjX/vThkQ4DZnUaAECQqTSW8yEGRZBX4O25cXeMRMBlfFQ3TWCf+y6mQkoJQHdTSHZPzfNj/U9oqFoXwfTMXheQN5wtjW1tbcK40QX BnurVDyC EWQzz+IxRVWQBNOLDtSmn15ZZm1H+oZZwcE0xaHyuOAHQGIWeNir3Pu1lnYt8bzhsQcCq6kO43ngkeX2+z8xG8Z/weP0bMvx2IAEwvdc52UtHSSw2BFrU8vBgnS8RkiY2QLnyOFaIlOGsgeOSpMamiHT1QzwTNQejUbvIbdY/SdhaN+ncRfSYaveJleeoM838wpdECTkLrDoUqmTgln3TjwLoOlxldJEyz8qQSqdb9hNEw3Nv/TKBv7lSL4XLq5/zGogu4x13O4LGKOjPuosVL7lG4SWenAclfF6fjal+FZZHxlJJz+YHuwNqHzBjYs2UTy5O6rExDMMf2IqzDuJaogeRCa2Mn7TLYnFYa4gvVbp8tgzbEwJJR6dnL0nC8MqLAC1ir1bX3+jjbS+GSswTRYCMp3cTx6honzMIS9ghVON9xic5dntL9PZds829RWIz4iBibvNIlyBvmmZJJfY9fImKk4MykU3l6W7xsw886Auj1RHj/ALkqQLRFWObv8G3Q3fRxCMKUZ2X9n4Y4zJwisxcYDf+ZifIyOY8td/P08Zzdb3daB//JYBuNFfpmZhAKjE9n66JVvaJfNNd/Zk1YTaDjtoj9SBTuIPjA7hYdyBCtWVWP+s4R2qkLAkC6dEFSWWlz4VS//pd+rxZdFQx9QKP1Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve performance, as it avoids two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song Reviewed-by: Kemeng Shi Cc: --- mm/shmem.c | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 334b7b4a61a0..e3c9a1365ff4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio, pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); - long nr = folio_nr_pages(folio); + unsigned long nr = folio_nr_pages(folio); + swp_entry_t iter, swap; + void *entry; VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio, gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); + swap = iter = radix_to_swp_entry(expected); do { xas_lock_irq(&xas); - if (expected != xas_find_conflict(&xas)) { - xas_set_err(&xas, -EEXIST); - goto unlock; + xas_for_each_conflict(&xas, entry) { + /* + * The range must either be empty, or filled with + * expected swap entries. Shmem swap entries are never + * partially freed without split of both entry and + * folio, so there shouldn't be any holes. + */ + if (!expected || entry != swp_to_radix_entry(iter)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + iter.val += 1 << xas_get_order(&xas); } - if (expected && xas_find_conflict(&xas)) { + if (expected && iter.val - nr != swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, error = -ENOMEM; goto failed; } - } else if (order != folio_order(folio)) { + } else if (order > folio_order(folio)) { /* * Swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } + } else if (order < folio_order(folio)) { + swap.val = round_down(swap.val, 1 << folio_order(folio)); } alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || - folio->swap.val != swap.val || - !shmem_confirm_swap(mapping, index, swap) || - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { + folio->swap.val != swap.val) { error = -EEXIST; goto unlock; } -- 2.50.0