From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA7CEC83030 for ; Tue, 1 Jul 2025 01:57:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 463436B00A6; Mon, 30 Jun 2025 21:57:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 413B56B00A9; Mon, 30 Jun 2025 21:57:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32A386B00AA; Mon, 30 Jun 2025 21:57:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 212186B00A6 for ; Mon, 30 Jun 2025 21:57:14 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id CC4031A0214 for ; Tue, 1 Jul 2025 01:57:13 +0000 (UTC) X-FDA: 83614033146.02.79382D9 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf05.hostedemail.com (Postfix) with ESMTP id BF92810000B for ; Tue, 1 Jul 2025 01:57:10 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=naM+y+pf; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751335032; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+2wC9AMtVtRi22rg4pym3ZO+D2tGf/83DmYS5GtTzLM=; b=XroftYVcaSatKA4NxdVbJmMq+X4yqRpJ9WDAIPBR9B3xwssmgdAuxwYZ1R0wfeEfNW4edH OFoIneOD0KHkzRLBy7G50emYqsE7wzeMkh3NjCBeALwKZKSrgKnU547inXMDBnbf/DSJgj ebfVp8v2+/fcuw8zHX/xJ0hO2bBrnOQ= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=naM+y+pf; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751335032; a=rsa-sha256; cv=none; b=XIYQ39Xuq/yhEiQc3NgzCS2Q2opPZYvj1JW8hKbkXFPWOYTb0MzKYbV1xUwR0bRSJpmxbx gGzEw/HJXZXZSpokY/BpoLCZgfwJqT7vHkzgecG/a6UxP4y0qS6PzBF3+kEYsH3Wg/edST EJW4ymOvT13X9d/t1EPc/XbOUteG4Y0= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1751335027; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=+2wC9AMtVtRi22rg4pym3ZO+D2tGf/83DmYS5GtTzLM=; b=naM+y+pfrhkANn+cCDzGlUmih61wk3rTdIXvoNxW2IAmBVRTG5M81htg8j4kRnASzc3L4kXvNNKncENu0F2BE81qPWf7AEozVJhhxdb7GQrAEu38IkmUZvwZvOtV2lO0jhvvIMU6Rd9JJKrRxF4OO+CpxV/T+dj+Gb8aSiBnkV0= Received: from 30.74.144.130(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WgHa0q4_1751335025 cluster:ay36) by smtp.aliyun-inc.com; Tue, 01 Jul 2025 09:57:06 +0800 Message-ID: Date: Tue, 1 Jul 2025 09:57:05 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org References: <20250627062020.534-1-ryncsn@gmail.com> <20250627062020.534-5-ryncsn@gmail.com> <3d317ce9-2304-4361-adda-32abcc06e3dd@linux.alibaba.com> <1102fb2b-3e2e-4ae2-8609-cff6a4b0d821@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: BF92810000B X-Stat-Signature: 3ikz36ieig3szs66fg6k81t5iki1y8su X-HE-Tag: 1751335030-686665 X-HE-Meta: U2FsdGVkX19WhD53Pumj1QfSggEl7OFdQ+zxkbu6s1VHWfb33aD8t8APG8Jl/3yvGgerLrD7nuiIjzxNU1Qr660acms8blkX20Blr4mq+L9x9dBijavfH/D9UVtYT71D1Y4TpmNeEffXMGMf1XdBc260HoBoBSvMOeeUwjtd4fYGtnQrrPRjMZ9WrtPwR7fT5x21qDhNH1tejWRh8e4ZjuOr6PqCSzbFt2eTmMusjKyMXw0j7AkSzQRtcR/ZQU5E7Ed8kcTOLEjTUvkaNEIXN++tA9hB5rq+FZCelzTdv4XSlJi4+bbSX5j24xaeecXReKtccCkfy0flbCAcDOjZokZfMG9Km27vCZUOIJzBKitKR9g/dbyFiV9OXnmUPyYK+vyCCSYR+m5msAluFXkCZlqGbbb6mZzakXch9AHVLZSkC5irS85LXIvMO+EhPRgXt/pXlwPmjjpa/Pop4rsZhxpIgemi295s3hQBdd2EbgmFqIlYWLCW4YhmEtxIbcTIByVzEge/q12F+moxB2FTxwi2UBl8IFfpHLet1xiFPpVJ2NV0oJh7GeFCa2Q1admB1zCn/2DJP1XcYwEdmPpCEpdwzaRDf4V6ZhokID4C1npc6TuUkQob0dHmR5BP+F0BWrGOd3F3aiM0xo5aUwlSB3CyeWvG2FbEC7jOXFmCVDEhjgVHUlrngxh0Gx4Ab23ZOLSYnCQYYlqbkCAyE2SsgQmDCkVCQ8Qp12ivKfzqbeF36v555DWAgFYI05Z11gMAJuoHULEsBfdmxxC4D3Z4SHZ1PhIHp7/1+V4tzA/OcHmgXpCIMKHy4etH/oU6reEbmDhigdga4pEtChS0ao8fr3rhkZrbc5uLIdabvU64VQMDJl1IuACJbBZww6huBxjJFEnS2Tzb1k6ox9GsIH1Xl/cztT3fXD2J7dJ0SK3+AJscrzJAyZPSfxZc2vCJVGsYH6XCTxUJb5RFAP33uuC XSoW1knC unmjCcN6tu68DSkTipxeAYUwajS88uim4RRZX0W8DHAzvl6mNjIHD/X4Ujp6jiz9XEuJPYOWJXvr/5wuskLqSAqpiDsjYzEbI7jm2MGNgl17buQWN63oNiRq39m6i6kTEXC5zU8PN9SC1hlex8MDQGzCjngLZNaGQMSo+rupo4SAotou7MwQzz62p0trBDmY9pu1LTaWQzEc9098vsamOKnX3idZz+Ab/PZZxFmkFkqGcIpxOBWsgX6DZtxXhoji7AciTx2ZJjog45cOD871mFz9872E4JKxW86fL59kw0AE5VLN17DBO/r+Xdz6eDz/6ynpH X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/7/1 02:19, Kairui Song wrote: > On Mon, Jun 30, 2025 at 7:59 PM Baolin Wang > wrote: >> >> >> >> On 2025/6/30 18:06, Kairui Song wrote: >>> On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang >>> wrote: >>>> On 2025/6/30 17:16, Kairui Song wrote: >>>>> On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang >>>>> wrote: >>>>>> On 2025/6/27 14:20, Kairui Song wrote: >>>>>>> From: Kairui Song >>>>>>> >>>>>>> Instead of keeping different paths of splitting the entry and >>>>>>> recalculating the swap entry and index, do it in one place. >>>>>>> >>>>>>> Whenever swapin brought in a folio smaller than the entry, split the >>>>>>> entry. And always recalculate the entry and index, in case it might >>>>>>> read in a folio that's larger than the entry order. This removes >>>>>>> duplicated code and function calls, and makes the code more robust. >>>>>>> >>>>>>> Signed-off-by: Kairui Song >>>>>>> --- >>>>>>> mm/shmem.c | 103 +++++++++++++++++++++-------------------------------- >>>>>>> 1 file changed, 41 insertions(+), 62 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c >>>>>>> index f85a985167c5..5be9c905396e 100644 >>>>>>> --- a/mm/shmem.c >>>>>>> +++ b/mm/shmem.c >>>>>>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, >>>>>>> swap_free_nr(swap, nr_pages); >>>>>>> } >>>>>>> >>>>>>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index, >>>>>>> - swp_entry_t swap, gfp_t gfp) >>>>>>> +/* >>>>>>> + * Split an existing large swap entry. @index should point to one sub mapping >>>>>>> + * slot within the entry @swap, this sub slot will be split into order 0. >>>>>>> + */ >>>>>>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index, >>>>>>> + swp_entry_t swap, gfp_t gfp) >>>>>>> { >>>>>>> struct address_space *mapping = inode->i_mapping; >>>>>>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); >>>>>>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index, >>>>>>> if (xas_error(&xas)) >>>>>>> return xas_error(&xas); >>>>>>> >>>>>>> - return entry_order; >>>>>>> + return 0; >>>>>>> } >>>>>>> >>>>>>> /* >>>>>>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, >>>>>>> struct address_space *mapping = inode->i_mapping; >>>>>>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; >>>>>>> struct shmem_inode_info *info = SHMEM_I(inode); >>>>>>> + int error, nr_pages, order, swap_order; >>>>>>> struct swap_info_struct *si; >>>>>>> struct folio *folio = NULL; >>>>>>> bool skip_swapcache = false; >>>>>>> swp_entry_t swap; >>>>>>> - int error, nr_pages, order, split_order; >>>>>>> >>>>>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); >>>>>>> swap = radix_to_swp_entry(*foliop); >>>>>>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, >>>>>>> goto failed; >>>>>>> } >>>>>>> >>>>>>> - /* >>>>>>> - * Now swap device can only swap in order 0 folio, then we >>>>>>> - * should split the large swap entry stored in the pagecache >>>>>>> - * if necessary. >>>>>>> - */ >>>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp); >>>>>>> - if (split_order < 0) { >>>>>>> - error = split_order; >>>>>>> - goto failed; >>>>>>> - } >>>>>>> - >>>>>>> - /* >>>>>>> - * If the large swap entry has already been split, it is >>>>>>> - * necessary to recalculate the new swap entry based on >>>>>>> - * the old order alignment. >>>>>>> - */ >>>>>>> - if (split_order > 0) { >>>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order); >>>>>>> - >>>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); >>>>>>> - } >>>>>>> - >>>>>>> /* Here we actually start the io */ >>>>>>> folio = shmem_swapin_cluster(swap, gfp, info, index); >>>>>>> if (!folio) { >>>>>>> error = -ENOMEM; >>>>>>> goto failed; >>>>>>> } >>>>>>> - } else if (order > folio_order(folio)) { >>>>>>> - /* >>>>>>> - * Swap readahead may swap in order 0 folios into swapcache >>>>>>> - * asynchronously, while the shmem mapping can still stores >>>>>>> - * large swap entries. In such cases, we should split the >>>>>>> - * large swap entry to prevent possible data corruption. >>>>>>> - */ >>>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp); >>>>>>> - if (split_order < 0) { >>>>>>> - folio_put(folio); >>>>>>> - folio = NULL; >>>>>>> - error = split_order; >>>>>>> - goto failed; >>>>>>> - } >>>>>>> - >>>>>>> - /* >>>>>>> - * If the large swap entry has already been split, it is >>>>>>> - * necessary to recalculate the new swap entry based on >>>>>>> - * the old order alignment. >>>>>>> - */ >>>>>>> - if (split_order > 0) { >>>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order); >>>>>>> - >>>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); >>>>>>> - } >>>>>>> - } else if (order < folio_order(folio)) { >>>>>>> - swap.val = round_down(swap.val, 1 << folio_order(folio)); >>>>>>> } >>>>>>> >>>>>>> alloced: >>>>>>> + /* >>>>>>> + * We need to split an existing large entry if swapin brought in a >>>>>>> + * smaller folio due to various of reasons. >>>>>>> + * >>>>>>> + * And worth noting there is a special case: if there is a smaller >>>>>>> + * cached folio that covers @swap, but not @index (it only covers >>>>>>> + * first few sub entries of the large entry, but @index points to >>>>>>> + * later parts), the swap cache lookup will still see this folio, >>>>>>> + * And we need to split the large entry here. Later checks will fail, >>>>>>> + * as it can't satisfy the swap requirement, and we will retry >>>>>>> + * the swapin from beginning. >>>>>>> + */ >>>>>>> + swap_order = folio_order(folio); >>>>>> >>>>>> Nit: 'swap_order' is confusing, and can you just use folio_order() or a >>>>>> btter name? >>>>> >>>>> Good idea. >>>>> >>>>>> >>>>>>> + if (order > swap_order) { >>>>>>> + error = shmem_split_swap_entry(inode, index, swap, gfp); >>>>>>> + if (error) >>>>>>> + goto failed_nolock; >>>>>>> + } >>>>>>> + >>>>>>> + index = round_down(index, 1 << swap_order); >>>>>>> + swap.val = round_down(swap.val, 1 << swap_order); >>>>>> >>>>>> The round_down() of index and swap value here may cause >>>>>> shmem_add_to_page_cache() to fail to insert a new folio, because the >>>>>> swap value stored at that index in the shmem mapping does not match, >>>>>> leading to another swapin page fault for correction. >>>>>> >>>>>> For example, shmem stores a large swap entry of order 4 in the range of >>>>>> index 0-64. When a swapin fault occurs at index = 3, with swap.val = >>>>>> 0x4000, if a split happens and this round_down() logic is applied, then >>>>>> index = 3, swap.val = 0x4000. However, the actual swap.val should be >>>>>> 0x4003 stored in the shmem mapping. This would cause another swapin fault. >>>>> >>>>> Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path >>>>> above, it should re-calculate the swap value there. It's fixed in the >>>>> final patch but left unhandled here. I'll update this part. >>>>> >>>>>> >>>>>> I still prefer my original alignment method, and do you find this will >>>>>> cause any issues? >>>>>> >>>>>> " >>>>>> if (split_order > 0) { >>>>>> pgoff_t offset = index - round_down(index, 1 << split_order); >>>>>> >>>>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); >>>>>> } >>>>>> " >>>>> >>>>> It only fits the cached swapin and uncached swapin, not the cache hit >>>>> case. Cache hits may see a larger folio so split didn't happen, but >>>>> the round_down is still needed. >>>> >>>> IMO, this only fits for the large swap entry split case. >>>> >>>>> And there is another racy case: before this patch, the split may >>>>> happen first, but shmem_swapin_cluster brought in a large folio due to >>>>> race in the swap cache layer. >>>> >>>> shmem_swapin_cluster() can only allocate order 0 folio, right? >>> >>> It can only allocate order 0 folio, but It can hit a large folio: eg. >>> a parallel swapin/swapout happened, and the folio stays in swap cache, >>> while we are handling a swapin here. >> >> Yes, I know this. This is exactly the issue that patch 1 wants to fix. > > Hmm, patch 1 prevents the hang but doesn't prevent things like a > duplicated fault: > > Starts with: > swap entry val = 0x400, order = 4, covering index 0 - 15, faulting index 3: > > Before this patch: > CPU0 CPU1 > shmem_swapin_folio (see order = 4) > shmem_swapin_folio (see order = 4) > /* fallback to order 0 due to */ > /* mem pressure / temporary pin / etc */ > shmem_split_large_entry > /* split to order 0 */ > /* interrupted */ > /* swapin done with order = 4 folio */ > /* swapout again, leave the large folio > in cache temporarily */ > folio = swap_cluster_readahead(0x403) > /* Gets folio order = 4, folio->swap = 0x400 > since swap_cluster_readahead uses swap cache */ > folio->swap.val != swap.val > /* ! Above test failed ! */ > /* It shouldn't fail the round down is needed */ OK. Thanks for the explanation. Yes, this might cause a refault. > This patch moved the split after the swapin so it should be OK now, Yes, I agree 'moving the split after the swapin'. > but still the split_order could be unstable, see below: >>>>> And I'm not sure if split_order is always reliable here, for example >>>>> concurrent split may return an inaccurate value here. >>>> >>>> We've held the xas lock to ensure the split is reliable, even though >>>> concurrent splits may occur, only one split can get the large >>>> 'split_order', another will return 0 (since it will see the large swao >>>> entry has already been split). >>> >>> Yes, it may return 0, so we can get a large folio here, but get >>> `split_order = 0`? >> >> If split happens, which means the 'order' > folio_order(), right? how >> can you get a large folio in this context? >> >>> And if concurrently swapout/swapin happened, the `split_order` could >>> be a different value? >> >> What do you mean different value? The large swap entry can only be split >> once, so the 'split_order' must be 0 or the original large order. > > Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot > into order 0 IIUC, so things get complex if two CPUs are faulting on > different indexes landing into two different splitting zones: > > Before this patch: > swap entry val = 0x400, order = 9, covering index 0 - 511, faulting index 3: > > CPU0 CPU1 > shmem_swapin_folio (index = 3) > shmem_swapin_folio (index = 510) > /* Gets swap = 0x400 */ /* Gets swap = 0x400 */ > /* fallback to order 0 */ /* fallback to order 0 */ > split_order = shmem_split_large_entry > /* get split_order = 512 */ > /* offset = 3 */ > split_order = shmem_split_large_entry > /* get split_order = 0, but no split */ > /* map order is 8, offset = 0 */ > /* wrong offset */ > shmem_swapin_cluster(0x400) > /* It should swapin 0x5fe */ Not ture. IIUC, the CPU1 will failed to split due to 'swp_to_radix_entry(swap) != old' in shmem_split_large_entry(), and will retry again to fix this race. > After this patch (with the append fix which was left in latest patch > by mistake) it won't swapin wrong entry now, but > shmem_split_large_entry may still return a outdated order. Like I said above, I don't think we can get a outdated order,right? > That's two previous races I can come up with. These no longer exist > after this patch, it's not a bug though, just redundant IO as far as I > can see because other checks will fallback, looks a bit fragile > though. But shmem_split_large_entry may still return invalid order, > just much less likely. > > I think the ideology here is, because the `order = > shmem_confirm_swap(mapping, index, swap)` ensures order is stable and > corresponds to the entry value at one point, so keep using that value > is better (and so this patch does the offset and calculation using the > `order` retrieved there before issues the swapin). I don't think that the 'order' obtained through the shmem_confirm_swap() is stable, because shmem_confirm_swap() is only protected by RCU. However, I think the 'split_order' obtained from shmem_split_large_entry() under the xas lock is stable. > And after the swapin have brought a folio in, simply round down using > the folio's order, which should ensure the folio can be added > successfully in any case as long as the folio->swap and index fits the > shmem mapping fine. > >>>> Based on your current patch, would the following modifications be clearer? >>>> >>>> diff --git a/mm/shmem.c b/mm/shmem.c >>>> index 5be9c905396e..91c071fb7b67 100644 >>>> --- a/mm/shmem.c >>>> +++ b/mm/shmem.c >>>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode >>>> *inode, pgoff_t index, >>>> if (xas_error(&xas)) >>>> return xas_error(&xas); >>>> >>>> - return 0; >>>> + return split_order; >>>> } >>>> >>>> /* >>>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode >>>> *inode, pgoff_t index, >>>> error = shmem_split_swap_entry(inode, index, swap, gfp); >>>> if (error) >>>> goto failed_nolock; >>>> - } >>>> >>>> - index = round_down(index, 1 << swap_order); >>>> - swap.val = round_down(swap.val, 1 << swap_order); >>>> + /* >>>> + * If the large swap entry has already been split, it is >>>> + * necessary to recalculate the new swap entry based on >>>> + * the old order alignment. >>>> + */ >>>> + if (split_order > 0) { > > The split_order could still be an outdated value, eg. we may even get > split_order = 0 with a large folio loaded here. Ditto. I didn't see split_order can be an outdated value. >>>> + pgoff_t offset = index - round_down(index, 1 << >>>> split_order); >>>> + >>>> + swap = swp_entry(swp_type(swap), >>>> swp_offset(swap) + offset); >>>> + } >>>> + } else if (order < folio_order(folio)) { >>>> + /* >>>> + * TODO; explain the posible race... >>>> + */ >>>> + swap.val = round_down(swap.val, 1 << folio_order(folio)); >>>> + } >> >> Sorry, you changes did not convince me. I still prefer my changes, >> listing out the possible races that might require changes to the swap >> value, as it would be clearer and more readable. Additionally, this is a >> cleanup patch, so I hope there are no implicit functional changes. > > I was thinking that the less branch we have the better, I won't insist > on this though. But this is under the premise of keeping the code logic clear and simple. >>>> /* We have to do this with folio locked to prevent races */ >>>> folio_lock(folio); >>>> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode, >>>> pgoff_t index, >>>> goto failed; >>>> } >>>> >>>> - error = shmem_add_to_page_cache(folio, mapping, index, >>>> + error = shmem_add_to_page_cache(folio, mapping, >>>> + round_down(index, nr_pages), >>>> swp_to_radix_entry(swap), gfp); >>>> if (error) >>>> goto failed; >>>> >>>>> So I wanted to simplify it: by round_down(folio_order(folio)) we >>>>> simply get the index and swap that will be covered by this specific >>>>> folio, if the covered range still has the corresponding swap entries >>>>> (check and ensured by shmem_add_to_page_cache which holds the >>>>> xa_lock), then the folio will be inserted back safely and >>>>> successfully. >>>> >>> >>> I think adding the missing swap value fixup in the !SYNC_IO path >>> should be good enough? >>> >>> diff --git a/mm/shmem.c b/mm/shmem.c >>> index 5be9c905396e..2620e4d1b56a 100644 >>> --- a/mm/shmem.c >>> +++ b/mm/shmem.c >>> @@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode >>> *inode, pgoff_t index, >>> struct folio *folio = NULL; >>> bool skip_swapcache = false; >>> swp_entry_t swap; >>> + pgoff_t offset; >>> >>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); >>> swap = radix_to_swp_entry(*foliop); >>> @@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode >>> *inode, pgoff_t index, >>> goto failed; >>> } >>> >>> - /* Here we actually start the io */ >>> + /* Cached swapin currently only supports order 0 swapin */ >>> + /* It may hit a large folio but that's OK and handled below */ >>> + offset = index - round_down(index, 1 << order); >>> + swap.val = swap.val + offset; > > BTW for this append patch, it should be using a different variable so > swap won't be changed or it will cause redundant fault again. My bad. > >>> folio = shmem_swapin_cluster(swap, gfp, info, index); >>> if (!folio) { >>> error = -ENOMEM; >>