From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFFC8C7115B for ; Wed, 18 Jun 2025 06:27:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E15B6B008A; Wed, 18 Jun 2025 02:27:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 46B176B008C; Wed, 18 Jun 2025 02:27:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 332E96B0092; Wed, 18 Jun 2025 02:27:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1E78F6B008A for ; Wed, 18 Jun 2025 02:27:22 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id BE347100EBE for ; Wed, 18 Jun 2025 06:27:21 +0000 (UTC) X-FDA: 83567539482.08.F81ED22 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) by imf21.hostedemail.com (Postfix) with ESMTP id A43BF1C0005 for ; Wed, 18 Jun 2025 06:27:16 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf21.hostedemail.com: domain of shikemeng@huaweicloud.com designates 45.249.212.56 as permitted sender) smtp.mailfrom=shikemeng@huaweicloud.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750228040; a=rsa-sha256; cv=none; b=APykV3OgRZJ7S4ouyR1ldEAKdHYIbbAmB7gg7Eg1XgKLxpPP9cUuFQpCHUoqabcn4QScx8 DNiTIC+EZ20SsCum80BIiAstM6cFUjWw2wzX1s1JekbI4iRXMAW8ZwvcUYiRFwx7jc04wN rRbFaj+y4Iq6vTwIa043JXnm3pFkkfo= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf21.hostedemail.com: domain of shikemeng@huaweicloud.com designates 45.249.212.56 as permitted sender) smtp.mailfrom=shikemeng@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750228040; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xIEbXFuMy0+cfd2/04wH550nuUPWYDBa4IKCbd0DXvo=; b=TdA2z+N9A1RvYa92FkxmGwyX7uKVczvYQFOwFP2AUIu+KtBu7xe4SWIDN0asI5+SuH8YYD W9O55lW927HDndzsFFNM3TyCMzJPnj8hpTWuCmLznRvIqWkJHc0S61nSPhcgl4xRlhr1MJ txoF/GQ/0XzGF++e18xdlzld9l3UC2k= Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTPS id 4bMYgP1xFszKHMSW for ; Wed, 18 Jun 2025 14:27:13 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id 9C2631A0D8C for ; Wed, 18 Jun 2025 14:27:11 +0800 (CST) Received: from [10.174.99.169] (unknown [10.174.99.169]) by APP4 (Coremail) with SMTP id gCh0CgC3ZVo9XFJokUSBPw--.42356S2; Wed, 18 Jun 2025 14:27:11 +0800 (CST) Subject: Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process To: Kairui Song , linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org References: <20250617183503.10527-1-ryncsn@gmail.com> <20250617183503.10527-4-ryncsn@gmail.com> From: Kemeng Shi Message-ID: <06bf5a2f-6687-dc24-cdb2-408faf413dd4@huaweicloud.com> Date: Wed, 18 Jun 2025 14:27:09 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1 MIME-Version: 1.0 In-Reply-To: <20250617183503.10527-4-ryncsn@gmail.com> Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit X-CM-TRANSID:gCh0CgC3ZVo9XFJokUSBPw--.42356S2 X-Coremail-Antispam: 1UD129KBjvJXoW3GF4rCr17AFW7Zr1xAFWrXwb_yoW7Zw1fpF y2gFn3tFykJry2kr42qw48trZ8W3yFqF1rJa43Cw13Z3ZxJw12kry8Jw18AFyUC34DAay0 qF47Jryq93Z8taDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUU9Ib4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7Mxk0xIA0c2IE e2xFo4CEbIxvr21lc7CjxVAaw2AFwI0_Jw0_GFyl42xK82IYc2Ij64vIr41l4I8I3I0E4I kC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWU WwC2zVAF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr 0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWU JVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJb IYCTnIWIevJa73UjIFyTuYvjxUF1v3UUUUU X-CM-SenderInfo: 5vklyvpphqwq5kxd4v5lfo033gof0z/ X-Rspam-User: X-Rspamd-Queue-Id: A43BF1C0005 X-Rspamd-Server: rspam10 X-Stat-Signature: r5t7ekn7rkuq3zua9xmn9dzfoudi1hu7 X-HE-Tag: 1750228036-557016 X-HE-Meta: U2FsdGVkX19X1momgxJgE7VXMq7ZMUzcJmLzPPaKkn44eQggcJ9bw0eeVVS4aZWHD7011NdUGqMWIfS301P56S4MOW77XObyd/LW7JuUaeyDMX8O2JAE7HnS0CpskisN7BYZbNxxvZ0CsxGr5Jhiv2vm+6KxDCT76Zrd2H6Hnm53Uukx+/2RE0wma7Au2U/vl81uWrguKJuXg98ApPbe0Gv8nHr7LCU6cnEG2FEnJrwzp/xhleh8HS71urL+vuRbmJL777raFmjfqUlBcFWRX1pzw9qLZwXMNQR/haAwDSXnxt3IBtz76leMUeBgweEpkRsd51+rKHtuhi5zUzjMQ+cZciUdL9jaIB96tMZHebMcsJuRdGbDZarsVgwElbuFNKcNPO8Dx7Dl/75l+MLi3l8kXXfrQ3BayYc7SMzKpRzqKhAiR9zPYpfNGaHugGv5dfqSPTFFaAlTZVGH4cEKzfEWEJlZT2ThfCfq6HaLoZT36kD6poahkuY7hzy4X1i4AsU8KhlIrbvap5vlDqed1FGDO8xgqM0Yyd8UJpCb++p2p9mH32lfp2ZNaJizJts3vw5GxqTT57AtrEL72ItfFeaXFvFkfIalE1ACrBIOJdl3ZaDkpe1Sb7mpioe2EYycGJ8ZUyCTRo3wEHt4VZvDfzlQXoxPv/4ggED+WH6UaDul6UxHjAQsNZrppArjfOZp31ZlrB0vE4a8Nyslw9hNgQB+yRuWG1HrGNTbdVpSoGMkYK4sDjge/6L6JNHLC1gD4WlHuVpXXf+pwCFZKASKsFk87rrSouNkOBb7MXskA4zTKlqKGEcsieNu0zHflgBw3Ak7BoOjGpFcjETVsorvMMwk20+in5e9X6idaIXzpr3Ue+pnZ+4Wf1176rmcSGO+JVWfxXG2N6pdO5yVGaHMooSyPhHouBTAZFbUEXIoz24UpGeKq/1fDGj4EHTErduWw7ZzWHno9XgyYXD0YRl K0MDX78S +PzdPzPgMQecXcRXWVvJnXFuRTH7X7poZ1q/ym0vzAaICVINqhGcyq2DW1Ly6J1CSzXyJUB18bWncqRG3UZ02Qs/kp/1xsvaRQ7I0Db0Loro5LGx6ngPrv/fTF0AJvkORE1Viqe0Hhz68Hmc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: on 6/18/2025 2:35 AM, Kairui Song wrote: > From: Kairui Song > > Tidy up the mTHP swapin workflow. There should be no feature change, but > consolidates the mTHP related check to one place so they are now all > wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by > compiler if not needed. > > Signed-off-by: Kairui Song > --- > mm/shmem.c | 175 ++++++++++++++++++++++++----------------------------- > 1 file changed, 78 insertions(+), 97 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 0ad49e57f736..46dea2fa1b43 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c ... > @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > /* Look it up and read it in.. */ > folio = swap_cache_get_folio(swap, NULL, 0); > if (!folio) { > - int nr_pages = 1 << order; > - bool fallback_order0 = false; > - > /* Or update major stats only when swapin succeeds?? */ > if (fault_type) { > *fault_type |= VM_FAULT_MAJOR; > count_vm_event(PGMAJFAULT); > count_memcg_event_mm(fault_mm, PGMAJFAULT); > } > - > - /* > - * If uffd is active for the vma, we need per-page fault > - * fidelity to maintain the uffd semantics, then fallback > - * to swapin order-0 folio, as well as for zswap case. > - * Any existing sub folio in the swap cache also blocks > - * mTHP swapin. > - */ > - if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) || > - !zswap_never_enabled() || > - non_swapcache_batch(swap, nr_pages) != nr_pages)) > - fallback_order0 = true; > - > - /* Skip swapcache for synchronous device. */ > - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > - folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp); > + /* Try direct mTHP swapin bypassing swap cache and readahead */ > + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > + swap_order = order; > + folio = shmem_swapin_direct(inode, vma, index, > + swap, &swap_order, gfp); > if (!IS_ERR(folio)) { > skip_swapcache = true; > goto alloced; > } > - > - /* > - * Fallback to swapin order-0 folio unless the swap entry > - * already exists. > - */ > + /* Fallback if order > 0 swapin failed with -ENOMEM */ > error = PTR_ERR(folio); > folio = NULL; > - if (error == -EEXIST) > + if (error != -ENOMEM || !swap_order) > goto failed; > } > - > /* > - * Now swap device can only swap in order 0 folio, then we > - * should split the large swap entry stored in the pagecache > - * if necessary. > + * Try order 0 swapin using swap cache and readahead, it still > + * may return order > 0 folio due to raced swap cache. > */ > - split_order = shmem_split_large_entry(inode, index, swap, gfp); > - if (split_order < 0) { > - error = split_order; > - goto failed; > - } > - > - /* > - * If the large swap entry has already been split, it is > - * necessary to recalculate the new swap entry based on > - * the old order alignment. > - */ > - if (split_order > 0) { > - pgoff_t offset = index - round_down(index, 1 << split_order); > - > - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); > - } > - For fallback order 0, we always call shmem_swapin_cluster() before but we will call shmem_swap_alloc_folio() now. It seems fine to me. Just point this out for others to recheck this. > - /* Here we actually start the io */ > folio = shmem_swapin_cluster(swap, gfp, info, index); > if (!folio) { > error = -ENOMEM; > goto failed; > } > - } else if (order > folio_order(folio)) { > - /* > - * Swap readahead may swap in order 0 folios into swapcache > - * asynchronously, while the shmem mapping can still stores > - * large swap entries. In such cases, we should split the > - * large swap entry to prevent possible data corruption. > - */ > - split_order = shmem_split_large_entry(inode, index, swap, gfp); > - if (split_order < 0) { > - folio_put(folio); > - folio = NULL; > - error = split_order; > - goto failed; > - } > - > - /* > - * If the large swap entry has already been split, it is > - * necessary to recalculate the new swap entry based on > - * the old order alignment. > - */ > - if (split_order > 0) { > - pgoff_t offset = index - round_down(index, 1 << split_order); > - > - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); > - } > - } else if (order < folio_order(folio)) { > - swap.val = round_down(swp_type(swap), folio_order(folio)); > } > - > alloced: > + /* > + * We need to split an existing large entry if swapin brought in a > + * smaller folio due to various of reasons. > + * > + * And worth noting there is a special case: if there is a smaller > + * cached folio that covers @swap, but not @index (it only covers > + * first few sub entries of the large entry, but @index points to > + * later parts), the swap cache lookup will still see this folio, > + * And we need to split the large entry here. Later checks will fail, > + * as it can't satisfy the swap requirement, and we will retry > + * the swapin from beginning. > + */ > + swap_order = folio_order(folio); > + if (order > swap_order) { > + error = shmem_split_swap_entry(inode, index, swap, gfp); > + if (error) > + goto failed_nolock; > + } > + > + index = round_down(index, 1 << swap_order); > + swap.val = round_down(swap.val, 1 << swap_order); > + If swap entry order is reduced but index and value keep unchange, the shmem_split_swap_entry() will split the reduced large swap entry successfully but index and swap.val will be incorrect beacuse of wrong swap_order. We can catch unexpected order and swap entry in shmem_add_to_page_cache() and will retry the swapin, but this will introduce extra cost. If we return order of entry which is splited in shmem_split_swap_entry() and update index and swap.val with returned order, we can avoid the extra cost for mentioned racy case.