From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D15BFC7115D for ; Thu, 19 Jun 2025 17:55:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5D1B96B0095; Thu, 19 Jun 2025 13:55:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5838C6B0096; Thu, 19 Jun 2025 13:55:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 471AD6B0098; Thu, 19 Jun 2025 13:55:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 367F86B0095 for ; Thu, 19 Jun 2025 13:55:54 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B5C1816070F for ; Thu, 19 Jun 2025 17:55:53 +0000 (UTC) X-FDA: 83572903386.29.3605DCA Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf06.hostedemail.com (Postfix) with ESMTP id C6269180004 for ; Thu, 19 Jun 2025 17:55:51 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dApM8uvc; spf=pass (imf06.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750355751; a=rsa-sha256; cv=none; b=JconlMTsu98+dXSqqvSceoKWpe8qYKq7jSTW2oD+H71GMoEVs60mM8ELQ2JYdFAV6V0peF HrlqjQNJmOnggIg6ghW9dyJgQn0jpbfa8rKyCRq2k4peSzKL4WsmnWoJkvOwh3BIAGHc0e Bq57k4DMfAiUEX593BS2WrcBvOQ3Otg= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dApM8uvc; spf=pass (imf06.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750355751; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G6U4jlRW51IhrAkyNdYr7Hbh7uPerEcOuqvZV/VaVHo=; b=ZMnqbPQO5qhHq8vlXvI0Fzqd9VO2i0FgPXrWeqR1VRGjKS6m9C7lW5rxPYEYia8UfRytLE QL33uZ4627zx6GHK7C1Y9bS335hHk0JXeifv0N6WaQWJ1edPYPQx30M3bsvHlqVUHzexbW R2JZyMsA8Z4Aut+cZk9PDv+Ww8tTCpA= Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-23602481460so9224555ad.0 for ; Thu, 19 Jun 2025 10:55:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750355750; x=1750960550; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=G6U4jlRW51IhrAkyNdYr7Hbh7uPerEcOuqvZV/VaVHo=; b=dApM8uvcC1IsJCxrBny0Il8xpOk5Mv1ctFTHrcgSqeU60+8T9AmyIvRx2uD3dd5jSW Sq/2tOTIZns/uIo+hHRVa2n66LWCdMNHZQ6gCRoddIbEVnMN9jugRRdNzcArHukBgPmc W9WrJPClIwY6SNuWVqZ7zEWv7cVbi6y4nNxOOadTTwGgMSTT2yV1/QUSLcXp7CAYTshd HLEx7Hc1NSx/9mguq9cCVC6hSD6foiC18QR5RaiiUEI63CuIlBP3WrgxqBNuw5tXnWhV Rkd5X1htdjQmz9xk0Hrq8vTkynPxx6YaRpgXZ8eGcKBAy/Ff1MTLsolisK7xpnpCkdtn dvWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750355750; x=1750960550; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=G6U4jlRW51IhrAkyNdYr7Hbh7uPerEcOuqvZV/VaVHo=; b=DDOxXGKJM4iHJ1K3q63JFQhtF/Gag8vsDD8zRXEF52genvnRcVUk5wS1WIbBLBvEct P0jswAAk7FKL+XFroz6NUOY3053UEnh4931p/4MQAa9cL0NPdDljpW5eW1EFRgAqisA/ aKDQYFgDgoZioUYPvlAYTyr1ZbXRISGy0pm+NjEi79jmIRnKA86ltecovAdEoAOGtpJ+ RJ8CrB3Z4StpQgMxGEfyI0QnD84q0fVd5PODW/IwNbPRr+aYCD/jHgDEzBEijYBXOlPm Mv66G2TIcyvUR7OgtUX4+ZvIcWrGqgrOsFqZtqhEV494tWrsFs8JzciJfAUdcKTXcdTK ixtA== X-Gm-Message-State: AOJu0YzObhfWOEdIjE1ZNNyaM7sqMlir8531kkTysg8CC9aiBz9iz0Hl DIBNksx5kykbXfNpy5V3LsUWlUjoRYEJBUuwrAVQIUJ7ATG35wYJqXoVu0G00JnoEbo= X-Gm-Gg: ASbGncuhkrPn2J14n1NcsU3vTxYaJg2ZifbiGhHuFX+prrpt9nOWrg+ydEKUXk58alG OhcLk0D2KP0MNxa+oCtoeU61vndP69t0rLKUrxQNAJAkZchjMzTcHcJvVWEqwzREK57pYuA/GHq 2tYuUi2F+XPPJkfbj7B5WhR0AX8Z47Ev2PNs4RKbfS9pCLyhUyf27+zwDSs44xG4VXNiokozx2g jU5y3cQzk8TTwDHKhGMXbsUSC9aTgoQskwZbcNiqh6nEnNzkTUttJkwL29TQ0zePRXoDb7+zJpe 3pUSWeioUmbcAOTp1VA8gYEl9v3rnnI3FPfn2qFs0RDlE2FQuLj1rKGMxbL5gSAoYWFkRN6vmkJ MNEnfudiDdyB6W0+RqA== X-Google-Smtp-Source: AGHT+IGSMiUkLVaHEY0qpn7YhOq9nmtfFgzCs7eZSl+BsvEKujzwRfUJdWREe52x5B5NCsObb9RRRA== X-Received: by 2002:a17:902:7b8c:b0:234:8ef1:aa7b with SMTP id d9443c01a7336-2366afe7f06mr238436175ad.20.1750355749765; Thu, 19 Jun 2025 10:55:49 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-237d83efa44sm255215ad.77.2025.06.19.10.55.46 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 19 Jun 2025 10:55:49 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song , stable@vger.kernel.org Subject: [PATCH v2 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Date: Fri, 20 Jun 2025 01:55:35 +0800 Message-ID: <20250619175538.15799-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250619175538.15799-1-ryncsn@gmail.com> References: <20250619175538.15799-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: C6269180004 X-Stat-Signature: fzf39kcpqg1tfydadpufpi76se6i8ka5 X-Rspam-User: X-HE-Tag: 1750355751-997420 X-HE-Meta: U2FsdGVkX18Z3LZ1alPPtSx7Hb+24hrCB0tCVQICneD/W71rLc+pI+THrRJnZEo1qPENGdCYycnfsGSQEnDwLpGzpbVCelEdMOfSe29Fq7BuKcD4dHqw3AfxcumHSwFQo3mtOCGM/I92ycsWfYog/PSVQw7yx26g3OYOeZXRxzUxke2zcBwzh4c9fQMlVIxv4LCSKdYhLukt5g4dTb/FBSHQf2vIbritSTJ8HTfotoygOHxH6JPNxfCW1mWQIosscycGqXEmTA8/xr5Kebph0pkGwdoabX5nxOOHJhZjqCpI9QQd7lD7bGpWanH33AT0IYtxj5jVpZ/oyQPVf1EMerAzgmd9z/oVD6Tti5ZXttSRmrMAiAqdFzWPud5CiaQ+Kf0wG6lBhZjmthzqGnFNSaQodauluUJjOeJAnPMo51FaG1vnhSrquw6if3mXkF/FrjAlgzRJZTl1jtecVgkX3yxJEr0KExdx3RoR0j2PswqFsZPKn1upwpGduNLnJziG37cJaeNJcotI5M39WHPmBDAfqCvxq020bx5TYD/3s8PcvB314WWpOSzCk7uUaBiY40yLxMUuV1dBg57zrpF5T5ZWuQAPGnaDFVyNCMcbsPguzHao7ISljd1otf5p4MfQnl8asAJ/0uyngK4H8lOWdQGSUBfsILmmwpenl2SzF26kNl45wCOsg/kmVVGLZskBgSNwZwbwyI3NFhGilklgW1ARY1QFXTyqNNvkUukQEMEIaU54F813uEY6CYYbsC1A5Ztegfy+cU6zZ+ByUOZhiZXgzVHXiXBhkWXO9wtrsA6jZQNNWWcoKurRnv+TFw0cnRJNluww/zl2aReLw/QA77agzfqMgfOQrNHHUnfp69oCsVgREHHUkpAN+OlRircEdTD0uSXOWi+0zcU/jRHMo1KlayP30nHhYFcZTZDWuqZH7u2a0x0JEBZ6RD4XTOg6N5zxAHKrffffN0VaOad eHUjdxia /n/xx6IQw4jRZQylCtY/ZaSUD0o6FrgmErQPaA694yD9a5KNZptEmujgfhEsyJpjXlTeyJNKqhWx4yKkCOtPoboIQQQxVqXnhoXGIxURzBxosA/s3BNeR6Q1gfJ/dhLwVKVBuetBNGNGOTGccdykJ/sIn6yslIDfvzw6PcKCeDPquUpBr4XegfOBNddGhRMroRhyvDugXE8U84YaKoAqUMkBgtWtMUS7yEjNmCSB+hNTOq0XwI65lMHUpp8f1TN29MXncQiRaGz7ZrmsVKmbETwEePAHvYQpEVI01SbQEUMp2TRKBmjm5IijPWe3kz+AqSgVYL2+rax8XQFNchL1uksWH5RIEj7XxQbRQle0e0pmXjyKiBGUFNP+YQCMZJXsSwCYYg51chNcvM8buV2BSueYTO/t7Hj4cxw5XPST3CfdeC/MP5na7vSeBjfG7PIZ1ECzzwLc8Uu3TXTMOrpwYMzjAvUL/W4n2mHA1ww0W8fzqS6udGVRlP55GseRunwiRTuiPEGsPhRn7Jhq96IzeFWVa/oib6hXUgcMNg+gsW4fn3DQBWQFtgl2X/oYMg37jMiqGl9fZsJmBP5JjLgS59S8JHSP/R/SMlqspPROhIm0zYy8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve the performance, as it avoided two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Cc: stable@vger.kernel.org Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song Reviewed-by: Kemeng Shi --- mm/shmem.c | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index eda35be2a8d9..4e7ef343a29b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio, pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); - long nr = folio_nr_pages(folio); + unsigned long nr = folio_nr_pages(folio); + swp_entry_t iter, swap; + void *entry; VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio, gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); + swap = iter = radix_to_swp_entry(expected); do { xas_lock_irq(&xas); - if (expected != xas_find_conflict(&xas)) { - xas_set_err(&xas, -EEXIST); - goto unlock; + xas_for_each_conflict(&xas, entry) { + /* + * The range must either be empty, or filled with + * expected swap entries. Shmem swap entries are never + * partially freed without split of both entry and + * folio, so there shouldn't be any holes. + */ + if (!expected || entry != swp_to_radix_entry(iter)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + iter.val += 1 << xas_get_order(&xas); } - if (expected && xas_find_conflict(&xas)) { + if (expected && iter.val - nr != swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, error = -ENOMEM; goto failed; } - } else if (order != folio_order(folio)) { + } else if (order > folio_order(folio)) { /* * Swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } + } else if (order < folio_order(folio)) { + swap.val = round_down(swp_type(swap), folio_order(folio)); } alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || - folio->swap.val != swap.val || - !shmem_confirm_swap(mapping, index, swap) || - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { + folio->swap.val != swap.val) { error = -EEXIST; goto unlock; } -- 2.50.0