From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B70E9C7115B for ; Wed, 18 Jun 2025 06:50:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 422226B0088; Wed, 18 Jun 2025 02:50:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F94E6B0089; Wed, 18 Jun 2025 02:50:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30EEC6B008A; Wed, 18 Jun 2025 02:50:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1D66C6B0088 for ; Wed, 18 Jun 2025 02:50:27 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BEA01807EC for ; Wed, 18 Jun 2025 06:50:26 +0000 (UTC) X-FDA: 83567597652.01.92BEEF0 Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) by imf18.hostedemail.com (Postfix) with ESMTP id C7BA01C0008 for ; Wed, 18 Jun 2025 06:50:24 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=I0fLM5v8; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.181 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750229424; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DIeMiiKb4oTFC9/Xf6OjRX+SZD33hG/1cOEcU99ZFFo=; b=tmlwKy6x1xJzqAnKDRi7gcPgE9qawtPUFZlSi1Zzl6PpYCoEnq38lLzSW7pvLHy//tu2Pk 2XPJR4F38mUlB9TQ2h2i9xyHDKeHT+4UkJQ5ssg5lyQbLtf2WSX69VJy81IKbKByPHm5JM VerCQxuVLSD2YSM4obANcDSNHg9q+zI= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=I0fLM5v8; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.181 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750229424; a=rsa-sha256; cv=none; b=bMp+cS+zhK/teDKoBLuYTEuHEFN0/zHWfxuJ9awHaxXhbOc3vgsbvQO3gCSSCvunJrQJAO NlvAmw+xsSVMBiEhezIUn+f48MH1IaXHjBkZhG0/kb5DF9LjC0LmlJSzR/s0ktRGrWpqAq ukLUNS1VJg2sXNI1XsHGfdeUYa9NYXY= Received: by mail-lj1-f181.google.com with SMTP id 38308e7fff4ca-32b50f357ecso32621561fa.2 for ; Tue, 17 Jun 2025 23:50:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750229423; x=1750834223; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DIeMiiKb4oTFC9/Xf6OjRX+SZD33hG/1cOEcU99ZFFo=; b=I0fLM5v8l7ITIlq02ysgaKSxdYnzIIn14UbVJrLkibMEu06IBajessefjCI1lte04+ hmBM99wkZwTNYHAwLYRYH8HfX0PS4E8+Ba2q8BPWuSNpVggkyxot8aNJNH9dp8zj+Dog gRLaxUd8mqXBGVKVn7swqdPCTbJvofjlAshi2ncHcvm8u9aJBkl6fVKcXeKWvZOcQUYH bLtXjcR8YLoSr4fzKHvwWdqBr13QElrCVGFWGzhhjUsnrKsG/0P0UQNv+nzSm4CJVXZw fk4bok9lnwDO02PUdXS2nkzbZ7e8wfBwQRBAIta8csnsElADqCmphniezAhIaAlr10RS EvdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750229423; x=1750834223; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DIeMiiKb4oTFC9/Xf6OjRX+SZD33hG/1cOEcU99ZFFo=; b=ILZ9nBHY7oD8vX1e5oNxGIIRSJiOufG9atUSsBDjVD8TnYj6iOEaAh7PQaFaJgUz/v 6Q84cEfbAvOSljSMIU6m4AcMvmu9e6pe+mUEYmENyUeNtaymwU5cBOPQOxkKP3KKp9Da B+q+C8/ewCjMsMSWDVuNXfS3PQyQI03n4vfO0fjhey9JeglY+JS8FMpRGB/RmkXaoNwk pl/ccgngiDAwuLJPtPgRV0YxQ17UuDTFNzN4m/7fu5XJdsgoItneNWeEdzOIavWpFqPp onbcGLdXek6xlve51Nc8FVyir14qShUl1mNApeutchoZ1xtwfIMx/Y8aARk0+E7RMTrv CEPQ== X-Gm-Message-State: AOJu0Yz12i2oYGjWEUEWkPlu+xDFX5Qu8/rN+Bqi80GuJPbknnvdpWna nKfSu3njZacNueD+niq05CEBTvG6tq2NkRJDlgqhlxbWRHZ/W8FW/Hyd+afRKZUC4HGXVbphilQ ttNY04lNb8DAoq1teN0HSb9sX/d3WuBE= X-Gm-Gg: ASbGncudR81x9crRgWMrkGmzB62xNAOXp9sdIHgP+XgeVI2HJ9X7XtbC9e+sc8hqQQ8 rJlEp7nRd7x6GJxTjynk/Dz30AAVdqQ7kEn7Kkujqlx9Gmbmxc5soFsDmUywFqQIekNauB4KrNI 7Efog+A20XNhF0nTd2GdN4C2ZLbrMJnK4fGiSeryEtbK4= X-Google-Smtp-Source: AGHT+IHR/lo91lkLt10dA3Hreb4XkVuhHX8jRTyUHpmbSD2NOop/C86BxTUHk8Jffc+ng7xlE+avIlyE7FrwooDV6kw= X-Received: by 2002:a2e:a9a6:0:b0:32a:6bef:7587 with SMTP id 38308e7fff4ca-32b4a4b17fbmr50969661fa.20.1750229422529; Tue, 17 Jun 2025 23:50:22 -0700 (PDT) MIME-Version: 1.0 References: <20250617183503.10527-1-ryncsn@gmail.com> <20250617183503.10527-4-ryncsn@gmail.com> <06bf5a2f-6687-dc24-cdb2-408faf413dd4@huaweicloud.com> In-Reply-To: <06bf5a2f-6687-dc24-cdb2-408faf413dd4@huaweicloud.com> From: Kairui Song Date: Wed, 18 Jun 2025 14:50:05 +0800 X-Gm-Features: AX0GCFt_nSbgUw7mPQRrX6sRL6bxEhcf3zdP6m4SWM2DW0AwqnqV0FB0-osfJA8 Message-ID: Subject: Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process To: Kemeng Shi Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C7BA01C0008 X-Stat-Signature: 1hezxrepw738ak1qt74thqxo6uxnoej9 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1750229424-81688 X-HE-Meta: U2FsdGVkX19zsooFVGc1VRnBf6UE9+CccrEYuKS6c/KkaAjrNdbTnGqgtUa4q67XDZ34A4PX0+QgIJ445U7QwWIUF/PPSKbDNYTO3HfFaZ6n2qEpOFNKBJB7xRwwhrWe3BUi7Z2m3YljaFLP5uTVYoIhMntuNQ/KpodKfHAKVWMOZhAhUQgO5T5JgM+v5vIoFcOmof22gVkwK30F696phaaqAGLua52NRK+zUiDMV6sCLzW4H0r8UcHMXquUz7Llucal3gccYlPk3mqi7zV4nXOAHC1EQIVi2/kSndQQddPbxnpJs7cUYoHnY+kg+n6HKCA+ptEfJA3dwWlt4VtwSvZ/SRvnRyV8wdroRwrWQA8lS0oIDlHvuy/rnKyBk3bBg06A2pumdX+irz18h8kgqhiDSrrVrVFVmAzGQ6jI/j1xbANB+OHe6Cqz781zFoD6vp2X6Zx3qv/UBQ95OrdCiZIaT4zCZ9byLKJ0g53NqQ/kJK0VcCyrmWyO8Q+JaZ7WRivsXsS06FkCuHCqmpJnpfqm6qcbTu75FGlCgk6Bgcrjd/t1scTcFLiaILAwznJPVe4ZlkPRlQTLy7K5TdprzZtYjgPxJdYmXC0ZssSWhvtwB6EzkULAH4z7W1Cis0aPue8YHhe4xVUq80H0X5fn2BMwclPoSPBw6+L+EWbsQIG+L+V3DvewZXGSuxmW0V+Y1nBf/Lkz6SZPa5ciXthGtEj+rHPAgton6qAm9GmGX+ev9umQPIuFxVLMoQcDMc9ZEk29B5iA6omVnqFtu4B5AGGzIQd3l9oKoHZHj83FN+KztUjjKqmGkRAPJFF67OR3bfNxVQo3eK61tYeUK4SznZk8Emr9uUnzKIov54pmPEhGpTXPSigrXFb4iqMWTXBaORqiM/ds4/47ij6BFrBEEze76UubPeKsEPKdP/tlxw9fCu8hYvh1XHGihZp2CiDn7LZP1vsiq5VANBRJRyy 1/xzeRbv Ti0Lbklvs56i5qUmgcLIE1HDDiyFgXpd0TVUlYyEOu+9MJFH/TpbWKD+i2B4ICWqbln9taJj5asw4MHCflKVVEYI1HImAlCvwJiIIp+wHHLxAGIE7S/Ey/smUI/BoR6wzH2wTKTDJI9hGAUcwbizLrB8CNJJ33WT9O4q3p6qbm8jFvJjJoZyrl5y0gdTBWOW8QXa6UPyS2jUCqWhl+Xj/q2KaJ0VPNyo0w1tovaJTNUhj4o91IQJeZUYAlj/iV6orfFR1DAs036qn4quJ/fRfmiyT6kvr7XQtqSNBa41sY+o6xHPWURZVebBOXrcAQs/RWN1YKC0QTf/GNLgyoclMjlKFyxVK4Kv7Ejk7fqsTf9rIGHaaOz1DGhZuCQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 18, 2025 at 2:27=E2=80=AFPM Kemeng Shi wrote: > on 6/18/2025 2:35 AM, Kairui Song wrote: > > From: Kairui Song > > > > Tidy up the mTHP swapin workflow. There should be no feature change, bu= t > > consolidates the mTHP related check to one place so they are now all > > wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by > > compiler if not needed. > > > > Signed-off-by: Kairui Song > > --- > > mm/shmem.c | 175 ++++++++++++++++++++++++----------------------------- > > 1 file changed, 78 insertions(+), 97 deletions(-) > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > index 0ad49e57f736..46dea2fa1b43 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > ... > > > @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *in= ode, pgoff_t index, > > /* Look it up and read it in.. */ > > folio =3D swap_cache_get_folio(swap, NULL, 0); > > if (!folio) { > > - int nr_pages =3D 1 << order; > > - bool fallback_order0 =3D false; > > - > > /* Or update major stats only when swapin succeeds?? */ > > if (fault_type) { > > *fault_type |=3D VM_FAULT_MAJOR; > > count_vm_event(PGMAJFAULT); > > count_memcg_event_mm(fault_mm, PGMAJFAULT); > > } > > - > > - /* > > - * If uffd is active for the vma, we need per-page fault > > - * fidelity to maintain the uffd semantics, then fallback > > - * to swapin order-0 folio, as well as for zswap case. > > - * Any existing sub folio in the swap cache also blocks > > - * mTHP swapin. > > - */ > > - if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma)= )) || > > - !zswap_never_enabled() || > > - non_swapcache_batch(swap, nr_pages) != =3D nr_pages)) > > - fallback_order0 =3D true; > > - > > - /* Skip swapcache for synchronous device. */ > > - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRO= NOUS_IO)) { > > - folio =3D shmem_swap_alloc_folio(inode, vma, inde= x, swap, order, gfp); > > + /* Try direct mTHP swapin bypassing swap cache and readah= ead */ > > + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > > + swap_order =3D order; > > + folio =3D shmem_swapin_direct(inode, vma, index, > > + swap, &swap_order, gf= p); > > if (!IS_ERR(folio)) { > > skip_swapcache =3D true; > > goto alloced; > > } > > - > > - /* > > - * Fallback to swapin order-0 folio unless the sw= ap entry > > - * already exists. > > - */ > > + /* Fallback if order > 0 swapin failed with -ENOM= EM */ > > error =3D PTR_ERR(folio); > > folio =3D NULL; > > - if (error =3D=3D -EEXIST) > > + if (error !=3D -ENOMEM || !swap_order) > > goto failed; > > } > > - > > /* > > - * Now swap device can only swap in order 0 folio, then w= e > > - * should split the large swap entry stored in the pageca= che > > - * if necessary. > > + * Try order 0 swapin using swap cache and readahead, it = still > > + * may return order > 0 folio due to raced swap cache. > > */ > > - split_order =3D shmem_split_large_entry(inode, index, swa= p, gfp); > > - if (split_order < 0) { > > - error =3D split_order; > > - goto failed; > > - } > > - > > - /* > > - * If the large swap entry has already been split, it is > > - * necessary to recalculate the new swap entry based on > > - * the old order alignment. > > - */ > > - if (split_order > 0) { > > - pgoff_t offset =3D index - round_down(index, 1 <<= split_order); > > - > > - swap =3D swp_entry(swp_type(swap), swp_offset(swa= p) + offset); > > - } > > - > For fallback order 0, we always call shmem_swapin_cluster() before but we= will call > shmem_swap_alloc_folio() now. It seems fine to me. Just point this out fo= r others > to recheck this. It's an improvement, I forgot to mention that in the commit message. Readahead is a burden for SWP_SYNCHRONOUS_IO devices so calling shmem_swap_alloc_folio is better. I'll update the commit message. > > - /* Here we actually start the io */ > > folio =3D shmem_swapin_cluster(swap, gfp, info, index); > > if (!folio) { > > error =3D -ENOMEM; > > goto failed; > > } > > - } else if (order > folio_order(folio)) { > > - /* > > - * Swap readahead may swap in order 0 folios into swapcac= he > > - * asynchronously, while the shmem mapping can still stor= es > > - * large swap entries. In such cases, we should split the > > - * large swap entry to prevent possible data corruption. > > - */ > > - split_order =3D shmem_split_large_entry(inode, index, swa= p, gfp); > > - if (split_order < 0) { > > - folio_put(folio); > > - folio =3D NULL; > > - error =3D split_order; > > - goto failed; > > - } > > - > > - /* > > - * If the large swap entry has already been split, it is > > - * necessary to recalculate the new swap entry based on > > - * the old order alignment. > > - */ > > - if (split_order > 0) { > > - pgoff_t offset =3D index - round_down(index, 1 <<= split_order); > > - > > - swap =3D swp_entry(swp_type(swap), swp_offset(swa= p) + offset); > > - } > > - } else if (order < folio_order(folio)) { > > - swap.val =3D round_down(swp_type(swap), folio_order(folio= )); > > } > > - > > alloced: > > + /* > > + * We need to split an existing large entry if swapin brought in = a > > + * smaller folio due to various of reasons. > > + * > > + * And worth noting there is a special case: if there is a smalle= r > > + * cached folio that covers @swap, but not @index (it only covers > > + * first few sub entries of the large entry, but @index points to > > + * later parts), the swap cache lookup will still see this folio, > > + * And we need to split the large entry here. Later checks will f= ail, > > + * as it can't satisfy the swap requirement, and we will retry > > + * the swapin from beginning. > > + */ > > + swap_order =3D folio_order(folio); > > + if (order > swap_order) { > > + error =3D shmem_split_swap_entry(inode, index, swap, gfp)= ; > > + if (error) > > + goto failed_nolock; > > + } > > + > > + index =3D round_down(index, 1 << swap_order); > > + swap.val =3D round_down(swap.val, 1 << swap_order); > > + > > If swap entry order is reduced but index and value keep unchange, > the shmem_split_swap_entry() will split the reduced large swap entry > successfully but index and swap.val will be incorrect beacuse of wrong > swap_order. We can catch unexpected order and swap entry in > shmem_add_to_page_cache() and will retry the swapin, but this will > introduce extra cost. > > If we return order of entry which is splited in shmem_split_swap_entry() > and update index and swap.val with returned order, we can avoid the extra > cost for mentioned racy case. The swap_order here is simply the folio's order, so doing this round_down will get the swap entry and index that will be covered by this folio. (the later folio->swap.val !=3D swap.val ensures the values are valid here). Remember the previous patch mentioned that, we may see the shmem mapping's entry got split but still seeing a large folio here. With current design, shmem_add_to_page_cache can still succeed as it should be, but if we round_down with the returned order of shmem_split_swap_entry, it will fail. So I think to make it better to keep it this way, and besides, the next patch makes use of this for sanity checks and reducing false positives of swap cache lookups.