From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A51B7C71157 for ; Thu, 19 Jun 2025 01:28:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C7618D0005; Wed, 18 Jun 2025 21:28:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 176B88D0001; Wed, 18 Jun 2025 21:28:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 065D58D0005; Wed, 18 Jun 2025 21:28:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DA5A58D0001 for ; Wed, 18 Jun 2025 21:28:40 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3CB371A145A for ; Thu, 19 Jun 2025 01:28:40 +0000 (UTC) X-FDA: 83570415600.05.611C207 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by imf19.hostedemail.com (Postfix) with ESMTP id 556C71A0003 for ; Thu, 19 Jun 2025 01:28:33 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of shikemeng@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=shikemeng@huaweicloud.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750296518; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jR9bFN1B8py+84LyfGFmnDtfK3QgfOIqD9i19f7o+Sc=; b=Ovq/8GXXvzfxyGfXkbskGW6t/h6PNDVhmPiNVaUhrRyuCWHKygvUNM1mlPY8/6qDV6vihU ZQB7VKVGl+sqN6O7j0kfl7NV32q/uc7RrnsG/n8h+hh0HvO9F23Lxi9gB3iixBQysm8l8W VFyHtp+kTbBmpZj4/QqA6RyTj1sxpv4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750296518; a=rsa-sha256; cv=none; b=dO4IdTWZZCumArOSpNhohnLxjJw5+kihCRrnSvJwoZ/p6GJHljaFzso1/PZswqny5qzgF2 WJeFNMLjT7hJGT7Iq1nNOZnP4beFygmk73DXX5sXgsAijslcCsJm3eROHyGGut1e88Bo7P 82nHvPMLIz0U3HQ3z5Qv21SctRZ5IaE= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of shikemeng@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=shikemeng@huaweicloud.com; dmarc=none Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTPS id 4bN30G5z4WzYQvcV for ; Thu, 19 Jun 2025 09:28:30 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.252]) by mail.maildlp.com (Postfix) with ESMTP id C13221A08C2 for ; Thu, 19 Jun 2025 09:28:29 +0800 (CST) Received: from [10.174.99.169] (unknown [10.174.99.169]) by APP3 (Coremail) with SMTP id _Ch0CgCHJ8K8Z1NoCAraPg--.14009S2; Thu, 19 Jun 2025 09:28:29 +0800 (CST) Subject: Re: [PATCH 4/4] mm/shmem, swap: avoid false positive swap cache lookup To: Kairui Song , linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org References: <20250617183503.10527-1-ryncsn@gmail.com> <20250617183503.10527-5-ryncsn@gmail.com> From: Kemeng Shi Message-ID: Date: Thu, 19 Jun 2025 09:28:27 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1 MIME-Version: 1.0 In-Reply-To: <20250617183503.10527-5-ryncsn@gmail.com> Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit X-CM-TRANSID:_Ch0CgCHJ8K8Z1NoCAraPg--.14009S2 X-Coremail-Antispam: 1UD129KBjvJXoW3WFyDXF18uF47WF4ktw4fZrb_yoWxtrWrpF WagFs3JFW8XrWxCw4xta1DZrya93yFgF48JFy3Cw4fA3Zxt340kF1UJ347AFyUCrykC3yI qF47Kr98ua1qqrDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUU9Ib4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7Mxk0xIA0c2IE e2xFo4CEbIxvr21lc7CjxVAaw2AFwI0_Jw0_GFyl42xK82IYc2Ij64vIr41l4I8I3I0E4I kC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWU WwC2zVAF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr 0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWU JVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJb IYCTnIWIevJa73UjIFyTuYvjxUF1v3UUUUU X-CM-SenderInfo: 5vklyvpphqwq5kxd4v5lfo033gof0z/ X-Rspamd-Server: rspam03 X-Stat-Signature: pbyfubzer4u7pdrzw6huo3tas4wikugj X-Rspam-User: X-Rspamd-Queue-Id: 556C71A0003 X-HE-Tag: 1750296513-178780 X-HE-Meta: U2FsdGVkX19Ytvr0Nq5/tax82lMsvACzc5YzChSCA+lJLlpOrMkryw8c8tnAtvEO/RLNUM1l1lwR5UtnTKAlr3ZoVe8X/U8ocWHzcUQZj66BY5cDDKvzRrgw7stVSs8mJTH0kI81HgQrSjCmuv+pF5Wa8/LRnjLWVUKH91uMkaXKgx7DmZTcUgwn3KO1dpUBdvwFtKxASYhW0jXVyux0wsGT/p86CW9HgKWTn4zdfbOmz4RrUk6jXSLH5Mrl1+8dqzQ1wW+RzBD7FVgUhwgwMClKbkx9QR0GYUssYhv7NLUEVTR/+k0urb6G4AWRt+viL0232rVtHOoLRGL9MUVyneAuR3n04ESYstSzOvIQxxCywVAw4uuxrpyHbTs+fCFeakVyaCe+044ALFHMMY18sk0ALu4HwvDrxAa+Hgt9SL7H3VUQgGIxCpL8+A6jN9ofWKMqKe2cIoC9OZf/ldJnmAyBN57GBIHtb5sMe6hB0vK+UUNVcVE2KIS4KWo7uZvSRFpK8IO/HvyZB3sKdgfRqBdaZcn7Aw9pcsUpFzv/rfOJ7FazG060qdGZgpoMATSMpnLdWkZzl+3Ex0ddR1sVtJbLq53zttreZQmraf9vDutS0Bt8vVk6nIXpKGj7d6yILMNajbXMiAA552w4kX9FmVpMa1P2F2ZVJ28S3AiCgtTl7uOyhlG38hp8RE8MRTTPXCeVDQe2FpktkZiXIHxgWpqSlhElJdyAKlmzZdiLJzFPDrf4CFsIwnS7iFFZ4JNSF7Suf2i5i3JGtQ7jBRdgfEFg4zgj8lGPGiYUALhIGnnBb4HHTCy7n5OR0ju3ho9qFowrhNa75U9quUgnD2tITg1vxMtPCLupp42ObvsCx5OIDFts+XBF+uJWu2RAYdzIlDqeTRr1iJPahAy0jOPTlx2uMRW5VPLtKkydIAdScMQDkEeBdPXvHbRZz0JN4Yl6GnLT4Jo3k6TlwDSijjj xgIWXdtt 8pv7gYqBpAumC+lhBg9uKeEd0Dpn5RTMR2oUuOAaZi569b7L9LgOv/3kggzF1hEd2dggcbnjci9kd+7zmp4fV8Qi6Fr3GZzD1mn1/3hJ7HD7yaZ83hSlrBZuH07+ej+Oubx/1LA3eBT8v8QwLsh4fiFJu0zZ9pucpk5T6HgF9grAX+mY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: on 6/18/2025 2:35 AM, Kairui Song wrote: > From: Kairui Song > > If the shmem read request's index points to the middle of a large swap > entry, shmem swap in does the swap cache lookup use the large swap > entry's starting value (the first sub swap entry of this large entry). > This will lead to false positive lookup result if only the first few > swap entries are cached, but the requested swap entry pointed by index > is uncached. > > Currently shmem will do a large entry split then retry the swapin from > beginning, which is a waste of CPU and fragile. Handle this correctly. > > Also add some sanity checks to help understand the code and ensure > things won't go wrong. > > Signed-off-by: Kairui Song > --- > mm/shmem.c | 61 ++++++++++++++++++++++++++---------------------------- > 1 file changed, 29 insertions(+), 32 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 46dea2fa1b43..0bc30dafad90 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1977,12 +1977,12 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, > > static struct folio *shmem_swapin_direct(struct inode *inode, > struct vm_area_struct *vma, pgoff_t index, > - swp_entry_t entry, int *order, gfp_t gfp) > + swp_entry_t swap_entry, swp_entry_t swap, > + int *order, gfp_t gfp) > { > struct shmem_inode_info *info = SHMEM_I(inode); > int nr_pages = 1 << *order; > struct folio *new; > - pgoff_t offset; > void *shadow; > > /* > @@ -2003,13 +2003,11 @@ static struct folio *shmem_swapin_direct(struct inode *inode, > */ > if ((vma && userfaultfd_armed(vma)) || > !zswap_never_enabled() || > - non_swapcache_batch(entry, nr_pages) != nr_pages) { > - offset = index - round_down(index, nr_pages); > - entry = swp_entry(swp_type(entry), > - swp_offset(entry) + offset); > + non_swapcache_batch(swap_entry, nr_pages) != nr_pages) { > *order = 0; > nr_pages = 1; > } else { > + swap.val = swap_entry.val; > gfp_t huge_gfp = vma_thp_gfp_mask(vma); > > gfp = limit_gfp_mask(huge_gfp, gfp); > @@ -2021,7 +2019,7 @@ static struct folio *shmem_swapin_direct(struct inode *inode, > return ERR_PTR(-ENOMEM); > > if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, > - gfp, entry)) { > + gfp, swap)) { > folio_put(new); > return ERR_PTR(-ENOMEM); > } > @@ -2036,17 +2034,17 @@ static struct folio *shmem_swapin_direct(struct inode *inode, > * In this case, shmem_add_to_page_cache() will help identify the > * concurrent swapin and return -EEXIST. > */ > - if (swapcache_prepare(entry, nr_pages)) { > + if (swapcache_prepare(swap, nr_pages)) { > folio_put(new); > return ERR_PTR(-EEXIST); > } > > __folio_set_locked(new); > __folio_set_swapbacked(new); > - new->swap = entry; > + new->swap = swap; > > - memcg1_swapin(entry, nr_pages); > - shadow = get_shadow_from_swap_cache(entry); > + memcg1_swapin(swap, nr_pages); > + shadow = get_shadow_from_swap_cache(swap); > if (shadow) > workingset_refault(new, shadow); > folio_add_lru(new); > @@ -2278,20 +2276,21 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; > struct shmem_inode_info *info = SHMEM_I(inode); > int error, nr_pages, order, swap_order; > + swp_entry_t swap, swap_entry; > struct swap_info_struct *si; > struct folio *folio = NULL; > bool skip_swapcache = false; > - swp_entry_t swap; > + pgoff_t offset; > > VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); > - swap = radix_to_swp_entry(*foliop); > + swap_entry = radix_to_swp_entry(*foliop); > *foliop = NULL; > > - if (is_poisoned_swp_entry(swap)) > + if (is_poisoned_swp_entry(swap_entry)) > return -EIO; > > - si = get_swap_device(swap); > - order = shmem_swap_check_entry(mapping, index, swap); > + si = get_swap_device(swap_entry); > + order = shmem_swap_check_entry(mapping, index, swap_entry); > if (unlikely(!si)) { > if (order < 0) > return -EEXIST; > @@ -2303,7 +2302,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > return -EEXIST; > } > > - /* Look it up and read it in.. */ > + /* @index may points to the middle of a large entry, get the real swap value first */ > + offset = index - round_down(index, 1 << order); > + swap.val = swap_entry.val + offset; > folio = swap_cache_get_folio(swap, NULL, 0); > if (!folio) { > /* Or update major stats only when swapin succeeds?? */ > @@ -2315,7 +2316,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > /* Try direct mTHP swapin bypassing swap cache and readahead */ > if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > swap_order = order; > - folio = shmem_swapin_direct(inode, vma, index, > + folio = shmem_swapin_direct(inode, vma, index, swap_entry, > swap, &swap_order, gfp); > if (!IS_ERR(folio)) { > skip_swapcache = true; > @@ -2338,28 +2339,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > } > } > alloced: > + swap_order = folio_order(folio); > + nr_pages = folio_nr_pages(folio); > + > + /* The swap-in should cover both @swap and @index */ > + swap.val = round_down(swap.val, nr_pages); > + VM_WARN_ON_ONCE(swap.val > swap_entry.val + offset); > + VM_WARN_ON_ONCE(swap.val + nr_pages <= swap_entry.val + offset);> + > /* > * We need to split an existing large entry if swapin brought in a > * smaller folio due to various of reasons. > - * > - * And worth noting there is a special case: if there is a smaller > - * cached folio that covers @swap, but not @index (it only covers > - * first few sub entries of the large entry, but @index points to > - * later parts), the swap cache lookup will still see this folio, > - * And we need to split the large entry here. Later checks will fail, > - * as it can't satisfy the swap requirement, and we will retry > - * the swapin from beginning. > */ > - swap_order = folio_order(folio); > + index = round_down(index, nr_pages); > if (order > swap_order) { > - error = shmem_split_swap_entry(inode, index, swap, gfp); > + error = shmem_split_swap_entry(inode, index, swap_entry, gfp); > if (error) > goto failed_nolock; > } > > - index = round_down(index, 1 << swap_order); > - swap.val = round_down(swap.val, 1 << swap_order); > - > /* We have to do this with folio locked to prevent races */ > folio_lock(folio); > if ((!skip_swapcache && !folio_test_swapcache(folio)) || > @@ -2372,7 +2370,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > goto failed; > } > folio_wait_writeback(folio); > - nr_pages = folio_nr_pages(folio); > > /* > * Some architectures may have to restore extra metadata to the > The patch look good to me, just some small suggestion. I think the name "swap" and "swap_entry" is not good enough. Maybe something like "index_entry" and "align_entry" will be more clean. Besides we pass "swap" and "order" already, we can calculate swap_entry easily and the code will be more easy to understand. Not a big deal anyway, so: Reviewed-by: Kemeng Shi