* Re: [PATCH v3 01/12] mm, swap: simplify swap cache allocation helper [not found] ` <20260421-swap-table-p4-v3-1-2f23759a76bc@tencent.com> @ 2026-05-06 13:51 ` Chris Li 2026-05-11 8:57 ` Kairui Song 0 siblings, 1 reply; 26+ messages in thread From: Chris Li @ 2026-05-06 13:51 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Instead of trying to return the existing folio if the entry is already > cached, simply return an error code if the allocation fails and drop the Nitpick: Spell out which function changes the return type here. It is __swap_cache_prepare_and_add() > output argument. And introduce proper wrappers that handle the Nitpick: Spell out the helper function. It is swap_cache_read_folio(). > allocation failure in different ways. > > For async swapin and readahead, the caller only wants to ensure that a > swap-in read is issued when the allocation succeeded. And for zswap swap > out, the caller will abort if the allocation failed because the entry is > gone or cached already. Should you add no functional change expected? > > Signed-off-by: Kairui Song <kasong@tencent.com> Very nice clean ups. I like it. Here are some nitpicks; feel free to ignore them. Acked-by: Chris Li <chrisl@kerel.org> > --- > mm/swap.h | 3 +- > mm/swap_state.c | 180 +++++++++++++++++++++++++++++--------------------------- > mm/zswap.c | 23 +++----- > 3 files changed, 103 insertions(+), 103 deletions(-) > > diff --git a/mm/swap.h b/mm/swap.h > index a77016f2423b..ad8b17a93758 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -281,8 +281,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry); > void *swap_cache_get_shadow(swp_entry_t entry); > void swap_cache_del_folio(struct folio *folio); > struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, > - struct mempolicy *mpol, pgoff_t ilx, > - bool *alloced); > + struct mempolicy *mpol, pgoff_t ilx); > /* Below helpers require the caller to lock and pass in the swap cluster. */ > void __swap_cache_add_folio(struct swap_cluster_info *ci, > struct folio *folio, swp_entry_t entry); > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 1415a5c54a43..204a9499d50c 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -459,54 +459,38 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, > * All swap slots covered by the folio must have a non-zero swap count. > * > * Context: Caller must protect the swap device with reference count or locks. > - * Return: Returns the folio being added on success. Returns the existing folio > - * if @entry is already cached. Returns NULL if raced with swapin or swapoff. > + * Return: 0 if success, error code if failed. > */ > -static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, > - struct folio *folio, > - gfp_t gfp, bool charged) > +static int __swap_cache_prepare_and_add(swp_entry_t entry, > + struct folio *folio, > + gfp_t gfp, bool charged) > { > - struct folio *swapcache = NULL; > void *shadow; > int ret; > > __folio_set_locked(folio); > __folio_set_swapbacked(folio); > > - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) > + if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { > + ret = -ENOMEM; > goto failed; > - > - for (;;) { > - ret = swap_cache_add_folio(folio, entry, &shadow); > - if (!ret) > - break; > - > - /* > - * Large order allocation needs special handling on > - * race: if a smaller folio exists in cache, swapin needs > - * to fallback to order 0, and doing a swap cache lookup > - * might return a folio that is irrelevant to the faulting > - * entry because @entry is aligned down. Just return NULL. > - */ > - if (ret != -EEXIST || folio_test_large(folio)) > - goto failed; > - > - swapcache = swap_cache_get_folio(entry); > - if (swapcache) > - goto failed; > } > > + ret = swap_cache_add_folio(folio, entry, &shadow); > + if (ret) > + goto failed; > + > memcg1_swapin(entry, folio_nr_pages(folio)); > if (shadow) > workingset_refault(folio, shadow); > > /* Caller will initiate read into locked folio */ > folio_add_lru(folio); > - return folio; > + return 0; > > failed: > folio_unlock(folio); > - return swapcache; > + return ret; > } > > /** > @@ -515,7 +499,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, > * @gfp_mask: memory allocation flags > * @mpol: NUMA memory allocation policy to be applied > * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE > - * @new_page_allocated: sets true if allocation happened, false otherwise > * > * Allocate a folio in the swap cache for one swap slot, typically before > * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by > @@ -523,18 +506,40 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry, > * Currently only supports order 0. > * > * Context: Caller must protect the swap device with reference count or locks. > - * Return: Returns the existing folio if @entry is cached already. Returns > - * NULL if failed due to -ENOMEM or @entry have a swap count < 1. > + * Return: Returns the folio if allocation succeeded and folio is added to > + * swap cache. Returns error code if allocation failed due to race. > */ > struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, > - struct mempolicy *mpol, pgoff_t ilx, > - bool *new_page_allocated) > + struct mempolicy *mpol, pgoff_t ilx) > +{ > + int ret; Nitpick: Suggest renaming it to "err" to make it obvious that it is an int type for the error code. Because this function previously returned a folio pointer, I have to remind myself that it is an int type not a folio. > + struct folio *folio; > + > + /* Allocate a new folio to be added into the swap cache. */ > + folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); > + if (!folio) > + return ERR_PTR(-ENOMEM); > + > + /* > + * Try to add the new folio to the swap cache. It returns > + * -EEXIST if the entry is already cached. > + */ > + ret = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); > + if (ret) { > + folio_put(folio); > + return ERR_PTR(ret); > + } > + > + return folio; > +} > + > +static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, > + struct mempolicy *mpol, pgoff_t ilx, > + struct swap_iocb **plug, bool readahead) > { > struct swap_info_struct *si = __swap_entry_to_info(entry); > struct folio *folio; > - struct folio *result = NULL; > > - *new_page_allocated = false; > /* Check the swap cache again for readahead path. */ > folio = swap_cache_get_folio(entry); > if (folio) > @@ -544,17 +549,24 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, > if (!swap_entry_swapped(si, entry)) > return NULL; > > - /* Allocate a new folio to be added into the swap cache. */ > - folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); > - if (!folio) > + do { > + folio = swap_cache_get_folio(entry); > + if (folio) > + return folio; > + > + folio = swap_cache_alloc_folio(entry, gfp, mpol, ilx); > + } while (IS_ERR(folio) && PTR_ERR(folio) == -EEXIST); Nitpick: IS_ERR() only checks that the pointer is in the error code range. If the pointer is -EEXIST, it will always be in the error code range. I think the "IS_ERR(folio)" test can be dropped. Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 01/12] mm, swap: simplify swap cache allocation helper 2026-05-06 13:51 ` [PATCH v3 01/12] mm, swap: simplify swap cache allocation helper Chris Li @ 2026-05-11 8:57 ` Kairui Song 0 siblings, 0 replies; 26+ messages in thread From: Kairui Song @ 2026-05-11 8:57 UTC (permalink / raw) To: Chris Li Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Wed, May 6, 2026 at 9:51 PM Chris Li <chrisl@kernel.org> wrote: > > On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay > <devnull+kasong.tencent.com@kernel.org> wrote: > > > > From: Kairui Song <kasong@tencent.com> > > > > Instead of trying to return the existing folio if the entry is already > > cached, simply return an error code if the allocation fails and drop the > > Nitpick: Spell out which function changes the return type here. It is > __swap_cache_prepare_and_add() Good idea. > > > output argument. And introduce proper wrappers that handle the > > Nitpick: Spell out the helper function. It is swap_cache_read_folio(). > > allocation failure in different ways. > > > > > For async swapin and readahead, the caller only wants to ensure that a > > swap-in read is issued when the allocation succeeded. And for zswap swap > > out, the caller will abort if the allocation failed because the entry is > > gone or cached already. > > Should you add no functional change expected? Yes indeed, there is no functional change. > > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > Very nice clean ups. I like it. Here are some nitpicks; feel free to > ignore them. > > Acked-by: Chris Li <chrisl@kerel.org> Thanks. > > Nitpick: IS_ERR() only checks that the pointer is in the error code > range. If the pointer is -EEXIST, it will always be in the error code > range. I think the "IS_ERR(folio)" test can be dropped. Agreed. Actually, I didn't add IS_ERR in V1 then Sashiko complained that it should be added. I just checked the API documentation again and existing patterns, there is indeed no rule to prohibit the direct check. Let me drop it, I also like it better that way, and maybe just ignore Sashiko next time. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-2-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 02/12] mm, swap: move common swap cache operations into standalone helpers [not found] ` <20260421-swap-table-p4-v3-2-2f23759a76bc@tencent.com> @ 2026-05-06 14:42 ` Chris Li 2026-05-12 14:48 ` Kairui Song 0 siblings, 1 reply; 26+ messages in thread From: Chris Li @ 2026-05-06 14:42 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Move a few swap cache checking, adding, and deletion operations > into standalone helpers to be used later. And while at it, add > proper kernel doc. > > No feature or behavior change. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> > --- > mm/swap_state.c | 141 ++++++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 95 insertions(+), 46 deletions(-) > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 204a9499d50c..3da285a891b2 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -137,8 +137,42 @@ void *swap_cache_get_shadow(swp_entry_t entry) > return NULL; > } > > -void __swap_cache_add_folio(struct swap_cluster_info *ci, > - struct folio *folio, swp_entry_t entry) > +/** > + * __swap_cache_add_check - Check if a range is suitable for adding a folio. > + * @ci: The locked swap cluster. > + * @ci_off: Range start offset. > + * @nr: Number of slots to check. > + * @shadow: Returns the shadow value if one exists in the range. > + * > + * Check if all slots covered by given range have a swap count >= 1. > + * Retrieves the shadow if there is one. > + * > + * Context: Caller must lock the cluster. > + */ > +static int __swap_cache_add_check(struct swap_cluster_info *ci, > + unsigned int ci_off, unsigned int nr, > + void **shadow) > +{ > + unsigned int ci_end = ci_off + nr; > + unsigned long old_tb; > + Nitpick: Can add lockdep_assert_held(&ci->lock); Can check ci_end < SWAPFILE_CLUSTER and bail out on error. > + if (unlikely(!ci->table)) > + return -ENOENT; > + do { > + old_tb = __swap_table_get(ci, ci_off); > + if (unlikely(swp_tb_is_folio(old_tb))) > + return -EEXIST; > + if (unlikely(!__swp_tb_get_count(old_tb))) > + return -ENOENT; > + if (swp_tb_is_shadow(old_tb)) > + *shadow = swp_tb_to_shadow(old_tb); Nitpick: You can create a local variable for the shadow and assign it at the end. Because it is a pointer, the compiler can't optimize the store away. Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 02/12] mm, swap: move common swap cache operations into standalone helpers 2026-05-06 14:42 ` [PATCH v3 02/12] mm, swap: move common swap cache operations into standalone helpers Chris Li @ 2026-05-12 14:48 ` Kairui Song 0 siblings, 0 replies; 26+ messages in thread From: Kairui Song @ 2026-05-12 14:48 UTC (permalink / raw) To: Chris Li Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Wed, May 6, 2026 at 10:46 PM Chris Li <chrisl@kernel.org> wrote: > > On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay > <devnull+kasong.tencent.com@kernel.org> wrote: > > > > From: Kairui Song <kasong@tencent.com> > > > > Move a few swap cache checking, adding, and deletion operations > > into standalone helpers to be used later. And while at it, add > > proper kernel doc. > > > > No feature or behavior change. > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > Acked-by: Chris Li <chrisl@kernel.org> > > > > --- > > mm/swap_state.c | 141 ++++++++++++++++++++++++++++++++++++++------------------ > > 1 file changed, 95 insertions(+), 46 deletions(-) > > > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 204a9499d50c..3da285a891b2 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -137,8 +137,42 @@ void *swap_cache_get_shadow(swp_entry_t entry) > > return NULL; > > } > > > > -void __swap_cache_add_folio(struct swap_cluster_info *ci, > > - struct folio *folio, swp_entry_t entry) > > +/** > > + * __swap_cache_add_check - Check if a range is suitable for adding a folio. > > + * @ci: The locked swap cluster. > > + * @ci_off: Range start offset. > > + * @nr: Number of slots to check. > > + * @shadow: Returns the shadow value if one exists in the range. > > + * > > + * Check if all slots covered by given range have a swap count >= 1. > > + * Retrieves the shadow if there is one. > > + * > > + * Context: Caller must lock the cluster. > > + */ > > +static int __swap_cache_add_check(struct swap_cluster_info *ci, > > + unsigned int ci_off, unsigned int nr, > > + void **shadow) > > +{ > > + unsigned int ci_end = ci_off + nr; > > + unsigned long old_tb; > > + > > Nitpick: Can add lockdep_assert_held(&ci->lock); > > Can check ci_end < SWAPFILE_CLUSTER and bail out on error. Ack. > > > + if (unlikely(!ci->table)) > > + return -ENOENT; > > + do { > > + old_tb = __swap_table_get(ci, ci_off); > > + if (unlikely(swp_tb_is_folio(old_tb))) > > + return -EEXIST; > > + if (unlikely(!__swp_tb_get_count(old_tb))) > > + return -ENOENT; > > + if (swp_tb_is_shadow(old_tb)) > > + *shadow = swp_tb_to_shadow(old_tb); > > Nitpick: You can create a local variable for the shadow and assign it > at the end. Because it is a pointer, the compiler can't optimize the > store away. This part will be reworked very soon but using a local variable here is good for an intermediate commit. > Chris Thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-3-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 03/12] mm/huge_memory: move THP gfp limit helper into header [not found] ` <20260421-swap-table-p4-v3-3-2f23759a76bc@tencent.com> @ 2026-05-06 14:46 ` Chris Li [not found] ` <D631DCC9-85F0-4E68-88A0-AD5DE328818E@nvidia.com> 1 sibling, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-06 14:46 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Shmem has some special requirements for THP GFP and has to limit it in > certain zones or provide a more lenient fallback. > > We'll use this helper for generic swap THP allocation, which needs to > support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper > is basically a no-op. But it's necessary for certain shmem users, mostly > drivers. > > No feature change. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <D631DCC9-85F0-4E68-88A0-AD5DE328818E@nvidia.com>]
[parent not found: <CAMgjq7BDmGWaVWBL+52_c=jgs293bgB+Qe-MafKE7dWZRsmx9A@mail.gmail.com>]
[parent not found: <125AABD0-02D5-4656-9F55-4B5BFBD5BD3D@nvidia.com>]
* Re: [PATCH v3 03/12] mm/huge_memory: move THP gfp limit helper into header [not found] ` <125AABD0-02D5-4656-9F55-4B5BFBD5BD3D@nvidia.com> @ 2026-05-12 9:02 ` Baolin Wang 0 siblings, 0 replies; 26+ messages in thread From: Baolin Wang @ 2026-05-12 9:02 UTC (permalink / raw) To: Zi Yan, Kairui Song Cc: linux-mm, Andrew Morton, David Hildenbrand, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On 4/22/26 1:23 AM, Zi Yan wrote: > On 21 Apr 2026, at 13:21, Kairui Song wrote: > >> On Tue, Apr 21, 2026 at 9:14 PM Zi Yan <ziy@nvidia.com> wrote: >>> >>> On 21 Apr 2026, at 2:16, Kairui Song via B4 Relay wrote: >>> >>>> From: Kairui Song <kasong@tencent.com> >>>> >>>> Shmem has some special requirements for THP GFP and has to limit it in >>>> certain zones or provide a more lenient fallback. >>>> >>>> We'll use this helper for generic swap THP allocation, which needs to >>>> support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper >>>> is basically a no-op. But it's necessary for certain shmem users, mostly >>>> drivers. >>>> >>>> No feature change. >>>> >>>> Signed-off-by: Kairui Song <kasong@tencent.com> >>>> --- >>>> include/linux/huge_mm.h | 30 ++++++++++++++++++++++++++++++ >>>> mm/shmem.c | 30 +++--------------------------- >>>> 2 files changed, 33 insertions(+), 27 deletions(-) >>>> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>> index 2949e5acff35..ffe5a120eee4 100644 >>>> --- a/include/linux/huge_mm.h >>>> +++ b/include/linux/huge_mm.h >>>> @@ -237,6 +237,31 @@ static inline bool thp_vma_suitable_order(struct vm_area_struct *vma, >>>> return true; >>>> } >>>> >>>> +/* >>>> + * Make sure huge_gfp is always more limited than limit_gfp. >>>> + * Some shmem users want THP allocation to be done less aggressively >>>> + * and only in certain zone. >>>> + */ >>>> +static inline gfp_t thp_limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) >>> >>> Would it be better to rename it to thp_swap_limit_gfp_mask() or something >>> more descriptive? I am just worried about misuses in the future due to >>> the generic thp prefix. >> >> Good idea, I wasn't sure if this might be helpful for any other user, >> but for now naming it more descriptive does help to avoid misuse. >> >> How about thp_shmem_limit_gfp_mask? Ordinary swap is fine with thp >> gfp, only shmem is a bit special. >> > > Sounds good to me. Thanks. Sounds good to me too. Feel free to add: Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation [not found] ` <20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com> @ 2026-05-06 20:48 ` Chris Li 2026-05-11 12:57 ` David Hildenbrand (Arm) 2026-05-12 10:10 ` Baolin Wang 2 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-06 20:48 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Now that direct large order allocation is supported in the swap cache, > both anon and shmem can use it instead of implementing their own methods. > This unifies the fallback and swap cache check, which also reduces the > TOCTOU race window of swap cache state: previously, high order swapin > required checking swap cache states first, then allocating and falling > back separately. Now all these steps happen in the same compact loop. > > Order fallback and statistics are also unified, callers just need to > check and pass the acceptable order bitmask. > > There is basically no behavior change. This only makes things more > unified and prepares for later commits. Cgroup and zero map checks can > also be moved into the compact loop, further reducing race windows and > redundancy > > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > mm/memory.c | 77 ++++++------------------------ > mm/shmem.c | 94 +++++++++--------------------------- > mm/swap.h | 30 ++---------- > mm/swap_state.c | 145 ++++++++++---------------------------------------------- Thanks for unifying the different code paths. I really like those diff stats. The execution flow for swap in is easier to read now. Good job. Acked-by: Chris Li <chrisl@kernel.org> Chris > mm/swapfile.c | 3 +- > 5 files changed, 67 insertions(+), 282 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index ea6568571131..404734a5bcff 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4593,26 +4593,6 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > return VM_FAULT_SIGBUS; > } > > -static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > -{ > - struct vm_area_struct *vma = vmf->vma; > - struct folio *folio; > - softleaf_t entry; > - > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); > - if (!folio) > - return NULL; > - > - entry = softleaf_from_pte(vmf->orig_pte); > - if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - GFP_KERNEL, entry)) { > - folio_put(folio); > - return NULL; > - } > - > - return folio; > -} > - > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > /* > * Check if the PTEs within a range are contiguous swap entries > @@ -4642,8 +4622,6 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > */ > if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > return false; > - if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > - return false; > > return true; > } > @@ -4671,16 +4649,14 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > return orders; > } > > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > unsigned long orders; > - struct folio *folio; > unsigned long addr; > softleaf_t entry; > spinlock_t *ptl; > pte_t *pte; > - gfp_t gfp; > int order; > > /* > @@ -4688,7 +4664,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > * maintain the uffd semantics. > */ > if (unlikely(userfaultfd_armed(vma))) > - goto fallback; > + return 0; > > /* > * A large swapped out folio could be partially or fully in zswap. We > @@ -4696,7 +4672,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > * folio. > */ > if (!zswap_never_enabled()) > - goto fallback; > + return 0; > > entry = softleaf_from_pte(vmf->orig_pte); > /* > @@ -4710,12 +4686,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > vmf->address, orders); > > if (!orders) > - goto fallback; > + return 0; > > pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > vmf->address & PMD_MASK, &ptl); > if (unlikely(!pte)) > - goto fallback; > + return 0; > > /* > * For do_swap_page, find the highest order where the aligned range is > @@ -4731,29 +4707,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > pte_unmap_unlock(pte, ptl); > > - /* Try allocating the highest of the remaining orders. */ > - gfp = vma_thp_gfp_mask(vma); > - while (orders) { > - addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > - folio = vma_alloc_folio(gfp, order, vma, addr); > - if (folio) { > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - gfp, entry)) > - return folio; > - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); > - folio_put(folio); > - } > - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); > - order = next_order(&orders, order); > - } > - > -fallback: > - return __alloc_swap_folio(vmf); > + return orders; > } > #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) > { > - return __alloc_swap_folio(vmf); > + return 0; > } > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > @@ -4859,21 +4818,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (folio) > swap_update_readahead(folio, vma, vmf->address); > if (!folio) { > - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > - folio = alloc_swap_folio(vmf); > - if (folio) { > - /* > - * folio is charged, so swapin can only fail due > - * to raced swapin and return NULL. > - */ > - swapcache = swapin_folio(entry, folio); > - if (swapcache != folio) > - folio_put(folio); > - folio = swapcache; > - } > - } else { > + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */ > + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) > + folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE, > + thp_swapin_suitable_orders(vmf), > + vmf, NULL, 0); > + else > folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); > - } > > if (!folio) { > /* > diff --git a/mm/shmem.c b/mm/shmem.c > index 5916acf594a8..17e3da11bb1d 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -159,7 +159,7 @@ static unsigned long shmem_default_max_inodes(void) > > static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > struct folio **foliop, enum sgp_type sgp, gfp_t gfp, > - struct vm_area_struct *vma, vm_fault_t *fault_type); > + struct vm_fault *vmf, vm_fault_t *fault_type); > > static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) > { > @@ -2017,68 +2017,24 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, > } > > static struct folio *shmem_swap_alloc_folio(struct inode *inode, > - struct vm_area_struct *vma, pgoff_t index, > + struct vm_fault *vmf, pgoff_t index, > swp_entry_t entry, int order, gfp_t gfp) > { > + pgoff_t ilx; > + struct folio *folio; > + struct mempolicy *mpol; > + unsigned long orders = BIT(order); > struct shmem_inode_info *info = SHMEM_I(inode); > - struct folio *new, *swapcache; > - int nr_pages = 1 << order; > - gfp_t alloc_gfp = gfp; > - > - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { > - if (WARN_ON_ONCE(order)) > - return ERR_PTR(-EINVAL); > - } else if (order) { > - /* > - * If uffd is active for the vma, we need per-page fault > - * fidelity to maintain the uffd semantics, then fallback > - * to swapin order-0 folio, as well as for zswap case. > - * Any existing sub folio in the swap cache also blocks > - * mTHP swapin. > - */ > - if ((vma && unlikely(userfaultfd_armed(vma))) || > - !zswap_never_enabled() || > - non_swapcache_batch(entry, nr_pages) != nr_pages) > - goto fallback; > > - alloc_gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); > - } > -retry: > - new = shmem_alloc_folio(alloc_gfp, order, info, index); > - if (!new) { > - new = ERR_PTR(-ENOMEM); > - goto fallback; > - } > + if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) || > + !zswap_never_enabled()) > + orders = 0; > > - if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, > - alloc_gfp, entry)) { > - folio_put(new); > - new = ERR_PTR(-ENOMEM); > - goto fallback; > - } > + mpol = shmem_get_pgoff_policy(info, index, order, &ilx); > + folio = swapin_entry(entry, gfp, orders, vmf, mpol, ilx); > + mpol_cond_put(mpol); > > - swapcache = swapin_folio(entry, new); > - if (swapcache != new) { > - folio_put(new); > - if (!swapcache) { > - /* > - * The new folio is charged already, swapin can > - * only fail due to another raced swapin. > - */ > - new = ERR_PTR(-EEXIST); > - goto fallback; > - } > - } > - return swapcache; > -fallback: > - /* Order 0 swapin failed, nothing to fallback to, abort */ > - if (!order) > - return new; > - entry.val += index - round_down(index, nr_pages); > - alloc_gfp = gfp; > - nr_pages = 1; > - order = 0; > - goto retry; > + return folio; > } > > /* > @@ -2265,11 +2221,12 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index, > */ > static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > struct folio **foliop, enum sgp_type sgp, > - gfp_t gfp, struct vm_area_struct *vma, > + gfp_t gfp, struct vm_fault *vmf, > vm_fault_t *fault_type) > { > struct address_space *mapping = inode->i_mapping; > - struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; > + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; > + struct mm_struct *fault_mm = vmf ? vmf->vma->vm_mm : NULL; > struct shmem_inode_info *info = SHMEM_I(inode); > swp_entry_t swap; > softleaf_t index_entry; > @@ -2310,20 +2267,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > if (!folio) { > if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > /* Direct swapin skipping swap cache & readahead */ > - folio = shmem_swap_alloc_folio(inode, vma, index, > - index_entry, order, gfp); > - if (IS_ERR(folio)) { > - error = PTR_ERR(folio); > - folio = NULL; > - goto failed; > - } > + folio = shmem_swap_alloc_folio(inode, vmf, index, > + swap, order, gfp); > } else { > /* Cached swapin only supports order 0 folio */ > folio = shmem_swapin_cluster(swap, gfp, info, index); > - if (!folio) { > - error = -ENOMEM; > - goto failed; > - } > + } > + if (!folio) { > + error = -ENOMEM; > + goto failed; > } > if (fault_type) { > *fault_type |= VM_FAULT_MAJOR; > @@ -2471,7 +2423,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, > > if (xa_is_value(folio)) { > error = shmem_swapin_folio(inode, index, &folio, > - sgp, gfp, vma, fault_type); > + sgp, gfp, vmf, fault_type); > if (error == -EEXIST) > goto repeat; > > diff --git a/mm/swap.h b/mm/swap.h > index 6774af10a943..80c2f1bf7a57 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -300,7 +300,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > struct mempolicy *mpol, pgoff_t ilx); > struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, > struct vm_fault *vmf); > -struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); > +struct folio *swapin_entry(swp_entry_t entry, gfp_t flag, unsigned long orders, > + struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx); > void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, > unsigned long addr); > > @@ -334,24 +335,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, > return find_next_bit(sis->zeromap, end, start) - start; > } > > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) > -{ > - int i; > - > - /* > - * While allocating a large folio and doing mTHP swapin, we need to > - * ensure all entries are not cached, otherwise, the mTHP folio will > - * be in conflict with the folio in swap cache. > - */ > - for (i = 0; i < max_nr; i++) { > - if (swap_cache_has_folio(entry)) > - return i; > - entry.val++; > - } > - > - return i; > -} > - > #else /* CONFIG_SWAP */ > struct swap_iocb; > static inline struct swap_cluster_info *swap_cluster_lock( > @@ -433,7 +416,9 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, > return NULL; > } > > -static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) > +static inline struct folio *swapin_entry( > + swp_entry_t entry, gfp_t flag, unsigned long orders, > + struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) > { > return NULL; > } > @@ -493,10 +478,5 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, > { > return 0; > } > - > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) > -{ > - return 0; > -} > #endif /* CONFIG_SWAP */ > #endif /* _MM_SWAP_H */ > diff --git a/mm/swap_state.c b/mm/swap_state.c > index f5c77f348bbd..6ebd062bcece 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -235,45 +235,6 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci, > lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); > } > > -/** > - * swap_cache_add_folio - Add a folio into the swap cache. > - * @folio: The folio to be added. > - * @entry: The swap entry corresponding to the folio. > - * @shadowp: If a shadow is found, return the shadow. > - * > - * Add a folio into the swap cache. Will return error if any slot is no > - * longer a valid swapped out slot or already occupied by another folio. > - * > - * Context: Caller must ensure @entry is valid and protect the swap device > - * with reference count or locks. > - */ > -static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, > - void **shadowp) > -{ > - int err; > - void *shadow = NULL; > - unsigned int ci_off; > - struct swap_info_struct *si; > - struct swap_cluster_info *ci; > - unsigned long nr_pages = folio_nr_pages(folio); > - > - si = __swap_entry_to_info(entry); > - ci = swap_cluster_lock(si, swp_offset(entry)); > - ci_off = swp_cluster_offset(entry); > - err = __swap_cache_add_check(ci, entry, nr_pages, &shadow); > - if (err) { > - swap_cluster_unlock(ci); > - return err; > - } > - > - __swap_cache_add_folio(ci, folio, entry); > - swap_cluster_unlock(ci); > - if (shadowp) > - *shadowp = shadow; > - > - return 0; > -} > - > static void __swap_cache_do_del_folio(struct swap_cluster_info *ci, > struct folio *folio, > swp_entry_t entry, void *shadow) > @@ -644,51 +605,6 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, > } > } > > -/** > - * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cache. > - * @entry: swap entry to be bound to the folio. > - * @folio: folio to be added. > - * @gfp: memory allocation flags for charge, can be 0 if @charged if true. > - * @charged: if the folio is already charged. > - * > - * Update the swap_map and add folio as swap cache, typically before swapin. > - * All swap slots covered by the folio must have a non-zero swap count. > - * > - * Context: Caller must protect the swap device with reference count or locks. > - * Return: 0 if success, error code if failed. > - */ > -static int __swap_cache_prepare_and_add(swp_entry_t entry, > - struct folio *folio, > - gfp_t gfp, bool charged) > -{ > - void *shadow; > - int ret; > - > - __folio_set_locked(folio); > - __folio_set_swapbacked(folio); > - > - if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) { > - ret = -ENOMEM; > - goto failed; > - } > - > - ret = swap_cache_add_folio(folio, entry, &shadow); > - if (ret) > - goto failed; > - > - memcg1_swapin(entry, folio_nr_pages(folio)); > - if (shadow) > - workingset_refault(folio, shadow); > - > - /* Caller will initiate read into locked folio */ > - folio_add_lru(folio); > - return 0; > - > -failed: > - folio_unlock(folio); > - return ret; > -} > - > static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, > struct mempolicy *mpol, pgoff_t ilx, > struct swap_iocb **plug, bool readahead) > @@ -704,7 +620,6 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, > folio = swap_cache_get_folio(entry); > if (folio) > return folio; > - > folio = swap_cache_alloc_folio(entry, gfp, 0, NULL, mpol, ilx); > } while (IS_ERR(folio) && PTR_ERR(folio) == -EEXIST); > > @@ -721,49 +636,37 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, > } > > /** > - * swapin_folio - swap-in one or multiple entries skipping readahead. > - * @entry: starting swap entry to swap in > - * @folio: a new allocated and charged folio > + * swapin_entry - swap-in one or multiple entries skipping readahead. > + * @entry: swap entry indicating the target slot > + * @gfp: memory allocation flags > + * @orders: allocation orders > + * @vmf: fault information > + * @mpol: NUMA memory allocation policy to be applied > + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE > * > - * Reads @entry into @folio, @folio will be added to the swap cache. > - * If @folio is a large folio, the @entry will be rounded down to align > - * with the folio size. > + * This allocates a folio suitable for given @orders, or returns the > + * existing folio in the swap cache for @entry. This initiates the IO, too, > + * if needed. @entry is rounded down if @orders allow large allocation. > * > - * Return: returns pointer to @folio on success. If folio is a large folio > - * and this raced with another swapin, NULL will be returned to allow fallback > - * to order 0. Else, if another folio was already added to the swap cache, > - * return that swap cache folio instead. > + * Context: Caller must ensure @entry is valid and pin the swap device with refcount. > + * Return: Returns the folio on success, NULL if failed. > */ > -struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) > +struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp, unsigned long orders, > + struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx) > { > - int ret; > - struct folio *swapcache; > - pgoff_t offset = swp_offset(entry); > - unsigned long nr_pages = folio_nr_pages(folio); > - > - entry = swp_entry(swp_type(entry), round_down(offset, nr_pages)); > - for (;;) { > - ret = __swap_cache_prepare_and_add(entry, folio, 0, true); > - if (!ret) { > - swap_read_folio(folio, NULL); > - break; > - } > + struct folio *folio; > > - /* > - * Large order allocation needs special handling on > - * race: if a smaller folio exists in cache, swapin needs > - * to fall back to order 0, and doing a swap cache lookup > - * might return a folio that is irrelevant to the faulting > - * entry because @entry is aligned down. Just return NULL. > - */ > - if (ret != -EEXIST || nr_pages > 1) > - return NULL; > + do { > + folio = swap_cache_get_folio(entry); > + if (folio) > + return folio; > + folio = swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx); > + } while (IS_ERR(folio) && PTR_ERR(folio) == -EEXIST); > > - swapcache = swap_cache_get_folio(entry); > - if (swapcache) > - return swapcache; > - } > + if (IS_ERR(folio)) > + return NULL; > > + swap_read_folio(folio, NULL); > return folio; > } > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index c7e173b93e11..2e384d1c78c3 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1826,8 +1826,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage) > * do_swap_page() > * ... swapoff+swapon > * swap_cache_alloc_folio() > - * swap_cache_add_folio() > - * // check swap_map > + * // check swap_map > * // verify PTE not changed > * > * In __swap_duplicate(), the swap_map need to be checked before > > -- > 2.53.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation [not found] ` <20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com> 2026-05-06 20:48 ` [PATCH v3 05/12] mm, swap: unify large folio allocation Chris Li @ 2026-05-11 12:57 ` David Hildenbrand (Arm) 2026-05-11 14:37 ` Kairui Song 2026-05-12 10:10 ` Baolin Wang 2 siblings, 1 reply; 26+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-11 12:57 UTC (permalink / raw) To: kasong, linux-mm Cc: Andrew Morton, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On 4/21/26 08:16, Kairui Song via B4 Relay wrote: > From: Kairui Song <kasong@tencent.com> > > Now that direct large order allocation is supported in the swap cache, > both anon and shmem can use it instead of implementing their own methods. > This unifies the fallback and swap cache check, which also reduces the > TOCTOU race window of swap cache state: previously, high order swapin > required checking swap cache states first, then allocating and falling > back separately. Now all these steps happen in the same compact loop. > > Order fallback and statistics are also unified, callers just need to > check and pass the acceptable order bitmask. > > There is basically no behavior change. This only makes things more > unified and prepares for later commits. Cgroup and zero map checks can > also be moved into the compact loop, further reducing race windows and > redundancy > You should spell out the rename from swapin_folio() to swapin_entry() [and why it is done]. swapin_readahead() vs. swapin_entry() looks a bit odd, fiven that both consume an entry. > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > mm/memory.c | 77 ++++++------------------------ > mm/shmem.c | 94 +++++++++--------------------------- > mm/swap.h | 30 ++---------- > mm/swap_state.c | 145 ++++++++++---------------------------------------------- > mm/swapfile.c | 3 +- > 5 files changed, 67 insertions(+), 282 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index ea6568571131..404734a5bcff 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4593,26 +4593,6 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > return VM_FAULT_SIGBUS; > } > > -static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > -{ > - struct vm_area_struct *vma = vmf->vma; > - struct folio *folio; > - softleaf_t entry; > - > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); > - if (!folio) > - return NULL; > - > - entry = softleaf_from_pte(vmf->orig_pte); > - if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - GFP_KERNEL, entry)) { > - folio_put(folio); > - return NULL; > - } > - > - return folio; > -} > - > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > /* > * Check if the PTEs within a range are contiguous swap entries > @@ -4642,8 +4622,6 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > */ > if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > return false; > - if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > - return false; > This should also be pointed out in the patch description. (and why it is ok) > return true; > } > @@ -4671,16 +4649,14 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > return orders; > } > > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > unsigned long orders; > - struct folio *folio; > unsigned long addr; > softleaf_t entry; > spinlock_t *ptl; > pte_t *pte; > - gfp_t gfp; > int order; > > /* > @@ -4688,7 +4664,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > * maintain the uffd semantics. > */ > if (unlikely(userfaultfd_armed(vma))) > - goto fallback; > + return 0; > > /* > * A large swapped out folio could be partially or fully in zswap. We > @@ -4696,7 +4672,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > * folio. > */ > if (!zswap_never_enabled()) > - goto fallback; > + return 0; > > entry = softleaf_from_pte(vmf->orig_pte); > /* > @@ -4710,12 +4686,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > vmf->address, orders); > > if (!orders) > - goto fallback; > + return 0; > > pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > vmf->address & PMD_MASK, &ptl); > if (unlikely(!pte)) > - goto fallback; > + return 0; > > /* > * For do_swap_page, find the highest order where the aligned range is > @@ -4731,29 +4707,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > pte_unmap_unlock(pte, ptl); > > - /* Try allocating the highest of the remaining orders. */ > - gfp = vma_thp_gfp_mask(vma); > - while (orders) { > - addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > - folio = vma_alloc_folio(gfp, order, vma, addr); > - if (folio) { > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - gfp, entry)) > - return folio; > - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); > - folio_put(folio); > - } > - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); > - order = next_order(&orders, order); > - } > - > -fallback: > - return __alloc_swap_folio(vmf); > + return orders; > } > #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) > { > - return __alloc_swap_folio(vmf); > + return 0; > } > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > @@ -4859,21 +4818,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (folio) > swap_update_readahead(folio, vma, vmf->address); > if (!folio) { > - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > - folio = alloc_swap_folio(vmf); > - if (folio) { > - /* > - * folio is charged, so swapin can only fail due > - * to raced swapin and return NULL. > - */ > - swapcache = swapin_folio(entry, folio); > - if (swapcache != folio) > - folio_put(folio); > - folio = swapcache; > - } > - } else { > + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */ > + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) > + folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE, > + thp_swapin_suitable_orders(vmf), > + vmf, NULL, 0); > + else > folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); > - } > > if (!folio) { > /* Nothing else jumped at me in memory.c -- Cheers, David ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation 2026-05-11 12:57 ` David Hildenbrand (Arm) @ 2026-05-11 14:37 ` Kairui Song 2026-05-11 15:15 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 26+ messages in thread From: Kairui Song @ 2026-05-11 14:37 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: linux-mm, Andrew Morton, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Mon, May 11, 2026 at 8:58 PM David Hildenbrand (Arm) <david@kernel.org> wrote: > > On 4/21/26 08:16, Kairui Song via B4 Relay wrote: > > From: Kairui Song <kasong@tencent.com> > > > > Now that direct large order allocation is supported in the swap cache, > > both anon and shmem can use it instead of implementing their own methods. > > This unifies the fallback and swap cache check, which also reduces the > > TOCTOU race window of swap cache state: previously, high order swapin > > required checking swap cache states first, then allocating and falling > > back separately. Now all these steps happen in the same compact loop. > > > > Order fallback and statistics are also unified, callers just need to > > check and pass the acceptable order bitmask. > > > > There is basically no behavior change. This only makes things more > > unified and prepares for later commits. Cgroup and zero map checks can > > also be moved into the compact loop, further reducing race windows and > > redundancy > > > > You should spell out the rename from swapin_folio() to swapin_entry() [and why > it is done]. > > swapin_readahead() vs. swapin_entry() looks a bit odd, fiven that both consume > an entry. Yes, the current status is a bit odd, about two years ago I also wanted to name it `swapin_direct()`. https://lore.kernel.org/linux-mm/20240326185032.72159-3-ryncsn@gmail.com/ But actually ZRAM or shmem would also benefit from supporting unified readahead like this: https://lore.kernel.org/linux-mm/20240102175338.62012-6-ryncsn@gmail.com/ So calling it `swapin_entry` seems more future-proof. At some point in the future we might remove `swapin_readahead`. All swapin operations could have a unified or at least a per-device readahead policy like the one in the link above, instead of the current policy where the caller must decide whether to perform readahead. But any suggestion on naming is welcome :) > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > /* > > * Check if the PTEs within a range are contiguous swap entries > > @@ -4642,8 +4622,6 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > > */ > > if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > > return false; > > - if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > > - return false; > > > > This should also be pointed out in the patch description. (and why it is ok) Right, the check is now resolved by the swap cache layer, so the caller no longer needs to check it. I'll describe that in the commit message. Thanks! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation 2026-05-11 14:37 ` Kairui Song @ 2026-05-11 15:15 ` David Hildenbrand (Arm) 2026-05-11 16:44 ` Kairui Song 0 siblings, 1 reply; 26+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-11 15:15 UTC (permalink / raw) To: Kairui Song Cc: linux-mm, Andrew Morton, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On 5/11/26 16:37, Kairui Song wrote: > On Mon, May 11, 2026 at 8:58 PM David Hildenbrand (Arm) > <david@kernel.org> wrote: >> >> On 4/21/26 08:16, Kairui Song via B4 Relay wrote: >>> From: Kairui Song <kasong@tencent.com> >>> >>> Now that direct large order allocation is supported in the swap cache, >>> both anon and shmem can use it instead of implementing their own methods. >>> This unifies the fallback and swap cache check, which also reduces the >>> TOCTOU race window of swap cache state: previously, high order swapin >>> required checking swap cache states first, then allocating and falling >>> back separately. Now all these steps happen in the same compact loop. >>> >>> Order fallback and statistics are also unified, callers just need to >>> check and pass the acceptable order bitmask. >>> >>> There is basically no behavior change. This only makes things more >>> unified and prepares for later commits. Cgroup and zero map checks can >>> also be moved into the compact loop, further reducing race windows and >>> redundancy >>> >> >> You should spell out the rename from swapin_folio() to swapin_entry() [and why >> it is done]. >> >> swapin_readahead() vs. swapin_entry() looks a bit odd, fiven that both consume >> an entry. > > Yes, the current status is a bit odd, about two years ago I also > wanted to name it `swapin_direct()`. > https://lore.kernel.org/linux-mm/20240326185032.72159-3-ryncsn@gmail.com/ > > But actually ZRAM or shmem would also benefit from supporting unified > readahead like this: > https://lore.kernel.org/linux-mm/20240102175338.62012-6-ryncsn@gmail.com/ > > So calling it `swapin_entry` seems more future-proof. At some point in > the future we might remove `swapin_readahead`. All swapin operations > could have a unified or at least a per-device readahead policy like > the one in the link above, instead of the current policy where the > caller must decide whether to perform readahead. > > But any suggestion on naming is welcome :) The other proposal https://lore.kernel.org/all/tencent_CD11FE9B4A0B362E95E776C5F679598FAA07@qq.com/ calls it swapin_synchronous_folio Maybe just swapin_sync_io()/swapin_sync() or sth like that? -- Cheers, David ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation 2026-05-11 15:15 ` David Hildenbrand (Arm) @ 2026-05-11 16:44 ` Kairui Song 2026-05-12 6:07 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 26+ messages in thread From: Kairui Song @ 2026-05-11 16:44 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: linux-mm, Andrew Morton, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Mon, May 11, 2026 at 11:15 PM David Hildenbrand (Arm) <david@kernel.org> wrote: > > On 5/11/26 16:37, Kairui Song wrote: > > > > Yes, the current status is a bit odd, about two years ago I also > > wanted to name it `swapin_direct()`. > > https://lore.kernel.org/linux-mm/20240326185032.72159-3-ryncsn@gmail.com/ > > > > But actually ZRAM or shmem would also benefit from supporting unified > > readahead like this: > > https://lore.kernel.org/linux-mm/20240102175338.62012-6-ryncsn@gmail.com/ > > > > So calling it `swapin_entry` seems more future-proof. At some point in > > the future we might remove `swapin_readahead`. All swapin operations > > could have a unified or at least a per-device readahead policy like > > the one in the link above, instead of the current policy where the > > caller must decide whether to perform readahead. > > > > But any suggestion on naming is welcome :) > > The other proposal > > https://lore.kernel.org/all/tencent_CD11FE9B4A0B362E95E776C5F679598FAA07@qq.com/ > > calls it > > swapin_synchronous_folio > > Maybe just swapin_sync_io()/swapin_sync() or sth like that? Good idea, I can keep the swapin_sync name at this point. Sync io flag still may remain for a longer time. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation 2026-05-11 16:44 ` Kairui Song @ 2026-05-12 6:07 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 26+ messages in thread From: David Hildenbrand (Arm) @ 2026-05-12 6:07 UTC (permalink / raw) To: Kairui Song Cc: linux-mm, Andrew Morton, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On 5/11/26 18:44, Kairui Song wrote: > On Mon, May 11, 2026 at 11:15 PM David Hildenbrand (Arm) > <david@kernel.org> wrote: >> >> On 5/11/26 16:37, Kairui Song wrote: >>> >>> Yes, the current status is a bit odd, about two years ago I also >>> wanted to name it `swapin_direct()`. >>> https://lore.kernel.org/linux-mm/20240326185032.72159-3-ryncsn@gmail.com/ >>> >>> But actually ZRAM or shmem would also benefit from supporting unified >>> readahead like this: >>> https://lore.kernel.org/linux-mm/20240102175338.62012-6-ryncsn@gmail.com/ >>> >>> So calling it `swapin_entry` seems more future-proof. At some point in >>> the future we might remove `swapin_readahead`. All swapin operations >>> could have a unified or at least a per-device readahead policy like >>> the one in the link above, instead of the current policy where the >>> caller must decide whether to perform readahead. >>> >>> But any suggestion on naming is welcome :) >> >> The other proposal >> >> https://lore.kernel.org/all/tencent_CD11FE9B4A0B362E95E776C5F679598FAA07@qq.com/ >> >> calls it >> >> swapin_synchronous_folio >> >> Maybe just swapin_sync_io()/swapin_sync() or sth like that? > > Good idea, I can keep the swapin_sync name at this point. Sync io flag > still may remain for a longer time. BTW, I was also wondering whether the whole sync vs. readahead part could simply be handled in a function called "swapin". moving that completely out of memory.c :) Probably something for another cleanup. -- Cheers, David ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 05/12] mm, swap: unify large folio allocation [not found] ` <20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com> 2026-05-06 20:48 ` [PATCH v3 05/12] mm, swap: unify large folio allocation Chris Li 2026-05-11 12:57 ` David Hildenbrand (Arm) @ 2026-05-12 10:10 ` Baolin Wang 2 siblings, 0 replies; 26+ messages in thread From: Baolin Wang @ 2026-05-12 10:10 UTC (permalink / raw) To: kasong, linux-mm Cc: Andrew Morton, David Hildenbrand, Zi Yan, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On 4/21/26 2:16 PM, Kairui Song via B4 Relay wrote: > From: Kairui Song <kasong@tencent.com> > > Now that direct large order allocation is supported in the swap cache, > both anon and shmem can use it instead of implementing their own methods. > This unifies the fallback and swap cache check, which also reduces the > TOCTOU race window of swap cache state: previously, high order swapin > required checking swap cache states first, then allocating and falling > back separately. Now all these steps happen in the same compact loop. > > Order fallback and statistics are also unified, callers just need to > check and pass the acceptable order bitmask. > > There is basically no behavior change. This only makes things more > unified and prepares for later commits. Cgroup and zero map checks can > also be moved into the compact loop, further reducing race windows and > redundancy > > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > mm/memory.c | 77 ++++++------------------------ > mm/shmem.c | 94 +++++++++--------------------------- > mm/swap.h | 30 ++---------- > mm/swap_state.c | 145 ++++++++++---------------------------------------------- > mm/swapfile.c | 3 +- > 5 files changed, 67 insertions(+), 282 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index ea6568571131..404734a5bcff 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4593,26 +4593,6 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > return VM_FAULT_SIGBUS; > } > > -static struct folio *__alloc_swap_folio(struct vm_fault *vmf) > -{ > - struct vm_area_struct *vma = vmf->vma; > - struct folio *folio; > - softleaf_t entry; > - > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address); > - if (!folio) > - return NULL; > - > - entry = softleaf_from_pte(vmf->orig_pte); > - if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - GFP_KERNEL, entry)) { > - folio_put(folio); > - return NULL; > - } > - > - return folio; > -} > - > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > /* > * Check if the PTEs within a range are contiguous swap entries > @@ -4642,8 +4622,6 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > */ > if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > return false; > - if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > - return false; > > return true; > } > @@ -4671,16 +4649,14 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, > return orders; > } > > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > unsigned long orders; > - struct folio *folio; > unsigned long addr; > softleaf_t entry; > spinlock_t *ptl; > pte_t *pte; > - gfp_t gfp; > int order; > > /* > @@ -4688,7 +4664,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > * maintain the uffd semantics. > */ > if (unlikely(userfaultfd_armed(vma))) > - goto fallback; > + return 0; > > /* > * A large swapped out folio could be partially or fully in zswap. We > @@ -4696,7 +4672,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > * folio. > */ > if (!zswap_never_enabled()) > - goto fallback; > + return 0; > > entry = softleaf_from_pte(vmf->orig_pte); > /* > @@ -4710,12 +4686,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > vmf->address, orders); > > if (!orders) > - goto fallback; > + return 0; > > pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > vmf->address & PMD_MASK, &ptl); > if (unlikely(!pte)) > - goto fallback; > + return 0; > > /* > * For do_swap_page, find the highest order where the aligned range is > @@ -4731,29 +4707,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > pte_unmap_unlock(pte, ptl); > > - /* Try allocating the highest of the remaining orders. */ > - gfp = vma_thp_gfp_mask(vma); > - while (orders) { > - addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); > - folio = vma_alloc_folio(gfp, order, vma, addr); > - if (folio) { > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, > - gfp, entry)) > - return folio; > - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); > - folio_put(folio); > - } > - count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); > - order = next_order(&orders, order); > - } > - > -fallback: > - return __alloc_swap_folio(vmf); > + return orders; > } > #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ > -static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf) > { > - return __alloc_swap_folio(vmf); > + return 0; > } > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > @@ -4859,21 +4818,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (folio) > swap_update_readahead(folio, vma, vmf->address); > if (!folio) { > - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { > - folio = alloc_swap_folio(vmf); > - if (folio) { > - /* > - * folio is charged, so swapin can only fail due > - * to raced swapin and return NULL. > - */ > - swapcache = swapin_folio(entry, folio); > - if (swapcache != folio) > - folio_put(folio); > - folio = swapcache; > - } > - } else { > + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */ > + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) > + folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE, > + thp_swapin_suitable_orders(vmf), > + vmf, NULL, 0); > + else > folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); > - } > > if (!folio) { > /* > diff --git a/mm/shmem.c b/mm/shmem.c > index 5916acf594a8..17e3da11bb1d 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -159,7 +159,7 @@ static unsigned long shmem_default_max_inodes(void) > > static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > struct folio **foliop, enum sgp_type sgp, gfp_t gfp, > - struct vm_area_struct *vma, vm_fault_t *fault_type); > + struct vm_fault *vmf, vm_fault_t *fault_type); > > static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) > { > @@ -2017,68 +2017,24 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, > } > > static struct folio *shmem_swap_alloc_folio(struct inode *inode, > - struct vm_area_struct *vma, pgoff_t index, > + struct vm_fault *vmf, pgoff_t index, > swp_entry_t entry, int order, gfp_t gfp) > { > + pgoff_t ilx; > + struct folio *folio; > + struct mempolicy *mpol; > + unsigned long orders = BIT(order); > struct shmem_inode_info *info = SHMEM_I(inode); > - struct folio *new, *swapcache; > - int nr_pages = 1 << order; > - gfp_t alloc_gfp = gfp; > - > - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { > - if (WARN_ON_ONCE(order)) > - return ERR_PTR(-EINVAL); > - } else if (order) { > - /* > - * If uffd is active for the vma, we need per-page fault > - * fidelity to maintain the uffd semantics, then fallback > - * to swapin order-0 folio, as well as for zswap case. > - * Any existing sub folio in the swap cache also blocks > - * mTHP swapin. > - */ > - if ((vma && unlikely(userfaultfd_armed(vma))) || > - !zswap_never_enabled() || > - non_swapcache_batch(entry, nr_pages) != nr_pages) > - goto fallback; > > - alloc_gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); > - } > -retry: > - new = shmem_alloc_folio(alloc_gfp, order, info, index); > - if (!new) { > - new = ERR_PTR(-ENOMEM); > - goto fallback; > - } > + if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) || > + !zswap_never_enabled()) > + orders = 0; > > - if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, > - alloc_gfp, entry)) { > - folio_put(new); > - new = ERR_PTR(-ENOMEM); > - goto fallback; > - } > + mpol = shmem_get_pgoff_policy(info, index, order, &ilx); > + folio = swapin_entry(entry, gfp, orders, vmf, mpol, ilx); > + mpol_cond_put(mpol); > > - swapcache = swapin_folio(entry, new); > - if (swapcache != new) { > - folio_put(new); > - if (!swapcache) { > - /* > - * The new folio is charged already, swapin can > - * only fail due to another raced swapin. > - */ > - new = ERR_PTR(-EEXIST); > - goto fallback; > - } > - } > - return swapcache; > -fallback: > - /* Order 0 swapin failed, nothing to fallback to, abort */ > - if (!order) > - return new; > - entry.val += index - round_down(index, nr_pages); > - alloc_gfp = gfp; > - nr_pages = 1; > - order = 0; > - goto retry; > + return folio; > } IIUC, in the __swap_cache_alloc() implementation in patch 4, when shmem swapin falls back to order 0, it doesn't adjust the swap entry value like here. Because the original swap entry may not correspond to the swap entry for the order 0 index. Of course, I haven't tested this yet, just pointing it out for you to double check. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-6-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 06/12] mm/memcg, swap: tidy up cgroup v1 memsw swap helpers [not found] ` <20260421-swap-table-p4-v3-6-2f23759a76bc@tencent.com> @ 2026-05-06 20:57 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-06 20:57 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > The cgroup v1 swap helpers always operate on swap cache folios whose > swap entry is stable: the folio is locked and in the swap cache. There > is no need to pass the swap entry or page count as separate parameters > when they can be derived from the folio itself. > > Simplify the redundant parameters and add sanity checks to document > the required preconditions. > > Also rename memcg1_swapout to __memcg1_swapout to indicate it requires > special calling context: the folio must be isolated and dying, and the > call must be made with interrupts disabled. > > No functional change. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Chris > --- > include/linux/memcontrol.h | 8 ++++---- > include/linux/swap.h | 10 ++++------ > mm/huge_memory.c | 2 +- > mm/memcontrol-v1.c | 33 ++++++++++++++++++++------------- > mm/memcontrol.c | 9 ++++----- > mm/swap_state.c | 4 ++-- > mm/swapfile.c | 2 +- > mm/vmscan.c | 2 +- > 8 files changed, 37 insertions(+), 33 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index dc3fa687759b..7d08128de1fd 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1899,8 +1899,8 @@ static inline void mem_cgroup_exit_user_fault(void) > current->in_user_fault = 0; > } > > -void memcg1_swapout(struct folio *folio, swp_entry_t entry); > -void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages); > +void __memcg1_swapout(struct folio *folio); > +void memcg1_swapin(struct folio *folio); > > #else /* CONFIG_MEMCG_V1 */ > static inline > @@ -1929,11 +1929,11 @@ static inline void mem_cgroup_exit_user_fault(void) > { > } > > -static inline void memcg1_swapout(struct folio *folio, swp_entry_t entry) > +static inline void __memcg1_swapout(struct folio *folio) > { > } > > -static inline void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages) > +static inline void memcg1_swapin(struct folio *folio) > { > } > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 1930f81e6be4..f2949f5844a6 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -574,13 +574,12 @@ static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp) > #endif > > #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) > -int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry); > -static inline int mem_cgroup_try_charge_swap(struct folio *folio, > - swp_entry_t entry) > +int __mem_cgroup_try_charge_swap(struct folio *folio); > +static inline int mem_cgroup_try_charge_swap(struct folio *folio) > { > if (mem_cgroup_disabled()) > return 0; > - return __mem_cgroup_try_charge_swap(folio, entry); > + return __mem_cgroup_try_charge_swap(folio); > } > > extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); > @@ -594,8 +593,7 @@ static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_p > extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); > extern bool mem_cgroup_swap_full(struct folio *folio); > #else > -static inline int mem_cgroup_try_charge_swap(struct folio *folio, > - swp_entry_t entry) > +static inline int mem_cgroup_try_charge_swap(struct folio *folio) > { > return 0; > } > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 970e077019b7..9630e283cf25 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -4431,7 +4431,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped) > > /* > * Exclude swapcache: originally to avoid a corrupt deferred split > - * queue. Nowadays that is fully prevented by memcg1_swapout(); > + * queue. Nowadays that is fully prevented by __memcg1_swapout(); > * but if page reclaim is already handling the same folio, it is > * unnecessary to handle it again in the shrinker, so excluding > * swapcache here may still be a useful optimization. > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c > index 433bba9dfe71..36c507d81dc5 100644 > --- a/mm/memcontrol-v1.c > +++ b/mm/memcontrol-v1.c > @@ -604,18 +604,23 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg) > } > > /** > - * memcg1_swapout - transfer a memsw charge to swap > + * __memcg1_swapout - transfer a memsw charge to swap > * @folio: folio whose memsw charge to transfer > - * @entry: swap entry to move the charge to > * > - * Transfer the memsw charge of @folio to @entry. > + * Transfer the memsw charge of @folio to the swap entry stored in > + * folio->swap. > + * > + * Context: folio must be isolated, unmapped, locked and is just about > + * to be freed, and caller must disable IRQs. > */ > -void memcg1_swapout(struct folio *folio, swp_entry_t entry) > +void __memcg1_swapout(struct folio *folio) > { > struct mem_cgroup *memcg, *swap_memcg; > struct obj_cgroup *objcg; > unsigned int nr_entries; > > + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); > + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); > VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); > > @@ -641,7 +646,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry) > swap_memcg = mem_cgroup_private_id_get_online(memcg, nr_entries); > mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); > > - swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), entry); > + swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), folio->swap); > > folio_unqueue_deferred_split(folio); > folio->memcg_data = 0; > @@ -671,18 +676,20 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry) > obj_cgroup_put(objcg); > } > > -/* > +/** > * memcg1_swapin - uncharge swap slot > - * @entry: the first swap entry for which the pages are charged > - * @nr_pages: number of pages which will be uncharged > + * @folio: folio being swapped in > * > - * Call this function after successfully adding the charged page to swapcache. > + * Call this function after successfully adding the charged > + * folio to swapcache. > * > - * Note: This function assumes the page for which swap slot is being uncharged > - * is order 0 page. > + * Context: The folio has to be in swap cache and locked. > */ > -void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages) > +void memcg1_swapin(struct folio *folio) > { > + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); > + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); > + > /* > * Cgroup1's unified memory+swap counter has been charged with the > * new swapcache page, finish the transfer by uncharging the swap > @@ -701,7 +708,7 @@ void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages) > * let's not wait for it. The page already received a > * memory+swap charge, drop the swap entry duplicate. > */ > - mem_cgroup_uncharge_swap(entry, nr_pages); > + mem_cgroup_uncharge_swap(folio->swap, folio_nr_pages(folio)); > } > } > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c3d98ab41f1f..c7df30ca5aa7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5456,13 +5456,12 @@ int __init mem_cgroup_init(void) > /** > * __mem_cgroup_try_charge_swap - try charging swap space for a folio > * @folio: folio being added to swap > - * @entry: swap entry to charge > * > - * Try to charge @folio's memcg for the swap space at @entry. > + * Try to charge @folio's memcg for the swap space at folio->swap. > * > * Returns 0 on success, -ENOMEM on failure. > */ > -int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) > +int __mem_cgroup_try_charge_swap(struct folio *folio) > { > unsigned int nr_pages = folio_nr_pages(folio); > struct page_counter *counter; > @@ -5479,7 +5478,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) > > rcu_read_lock(); > memcg = obj_cgroup_memcg(objcg); > - if (!entry.val) { > + if (!folio_test_swapcache(folio)) { > memcg_memory_event(memcg, MEMCG_SWAP_FAIL); > rcu_read_unlock(); > return 0; > @@ -5498,7 +5497,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry) > } > mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); > > - swap_cgroup_record(folio, mem_cgroup_private_id(memcg), entry); > + swap_cgroup_record(folio, mem_cgroup_private_id(memcg), folio->swap); > > return 0; > } > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 6ebd062bcece..12b290d43e45 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -451,8 +451,8 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci, > return ERR_PTR(-ENOMEM); > } > > - /* For memsw accounting, swap is uncharged when folio is added to swap cache */ > - memcg1_swapin(entry, 1 << order); > + /* memsw uncharges swap when folio is added to swap cache */ > + memcg1_swapin(folio); > if (shadow) > workingset_refault(folio, shadow); > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2e384d1c78c3..e1ad77a69e54 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1730,7 +1730,7 @@ int folio_alloc_swap(struct folio *folio) > } > > /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */ > - if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) > + if (unlikely(mem_cgroup_try_charge_swap(folio))) > swap_cache_del_folio(folio); > > if (unlikely(!folio_test_swapcache(folio))) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bd1b1aa12581..63d06930d8e3 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -739,7 +739,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, > > if (reclaimed && !mapping_exiting(mapping)) > shadow = workingset_eviction(folio, target_memcg); > - memcg1_swapout(folio, swap); > + __memcg1_swapout(folio); > __swap_cache_del_folio(ci, folio, swap, shadow); > swap_cluster_unlock_irq(ci); > } else { > > -- > 2.53.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-7-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 07/12] mm, swap: support flexible batch freeing of slots in different memcgs [not found] ` <20260421-swap-table-p4-v3-7-2f23759a76bc@tencent.com> @ 2026-05-08 4:01 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-08 4:01 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 7:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Instead of requiring the caller to ensure all slots are in the same > memcg, make the function handle different memcgs at once. > > This is both a micro optimization and required for removing the memcg > lookup in the page table layer, so it can be unified at the swap layer. > > We are not removing the memcg lookup in the page table in this commit. > It has to be done after the memcg lookup is deferred to the swap layer. > > Signed-off-by: Kairui Song <kasong@tencent.com> Overall, it looks good. Some nitpicks follow. Acked-by: Chris Li <chrisl@kernel.org> > --- > mm/swapfile.c | 33 +++++++++++++++++++++++++++++---- > 1 file changed, 29 insertions(+), 4 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index e1ad77a69e54..8d3d22c463f3 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1872,21 +1872,46 @@ void __swap_cluster_free_entries(struct swap_info_struct *si, > unsigned int ci_start, unsigned int nr_pages) > { > unsigned long old_tb; > + unsigned int type = si->type; > + unsigned short id = 0, id_cur; Nitpick: I'm tempted to rename a few variables to improve my understanding. Feel free to keep it as it is. id -> batch_id > unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages; > - unsigned long offset = cluster_offset(si, ci) + ci_start; > + unsigned long offset = cluster_offset(si, ci); Nitpick: offset -> ci_offset. This is the fixed offset of the ci which is a fixed in the loop. > + unsigned int ci_batch = ci_off; Nitpick: ci_batch -> batch_off, this one go with the batch_id. > + swp_entry_t entry; > > VM_WARN_ON(ci->count < nr_pages); > > ci->count -= nr_pages; > do { > old_tb = __swap_table_get(ci, ci_off); > - /* Release the last ref, or after swap cache is dropped */ > + /* > + * Freeing is done after release of the last swap count > + * ref, or after swap cache is dropped > + */ > VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1); > __swap_table_set(ci, ci_off, null_to_swp_tb()); > + > + /* > + * Uncharge swap slots by memcg in batches. Consecutive > + * slots with the same cgroup id are uncharged together. > + */ > + entry = swp_entry(type, offset + ci_off); Nitpick: This line confused me a bit. Two offsets are mentioned here: "offset + ci_offset". One would assume that ci_offset is the offset of the ci, and the offset is the incremental one. It is the other way around. > + id_cur = lookup_swap_cgroup_id(entry); > + if (id != id_cur) { > + if (id) > + mem_cgroup_uncharge_swap(swp_entry(type, offset + ci_batch), > + ci_off - ci_batch); With the above rename, this become: "... swp_entry(type, ci_offset + batch_off)," ; This combined the offset turn into the swap entry. "ci_off - batch_off". That is the running length from the beginning of batch. > + id = id_cur; > + ci_batch = ci_off; > + } > } while (++ci_off < ci_end); > > - mem_cgroup_uncharge_swap(swp_entry(si->type, offset), nr_pages); > - swap_range_free(si, offset, nr_pages); > + if (id) { This becomes `if (batch_id)`, meaning if we have pending batching, we flush the current batch. Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-8-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 08/12] mm, swap: delay and unify memcg lookup and charging for swapin [not found] ` <20260421-swap-table-p4-v3-8-2f23759a76bc@tencent.com> @ 2026-05-08 4:46 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-08 4:46 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 2:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Instead of checking the cgroup private ID during page table walk in > swap_pte_batch(), move the memcg lookup into __swap_cache_add_check() > under the cluster lock. > > The first pre-alloc check is speculative and skips the memcg check since > the post-alloc stable check ensures all slots covered by the folio > belong to the same memcg. It is very rare for contiguous and aligned > entries across a contiguous region of a page table of the same process > or shmem mapping to belong to different memcgs. > > This also prepares for recording the memcg info in the cluster's table. > Also make the order check and fallback more compact. > > There should be no user-observable behavior change. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> > --- > include/linux/memcontrol.h | 6 +++--- > mm/internal.h | 10 +--------- > mm/memcontrol.c | 10 ++++------ > mm/swap_state.c | 28 +++++++++++++++++++--------- > 4 files changed, 27 insertions(+), 27 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 7d08128de1fd..a013f37f24aa 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -646,8 +646,8 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, > > int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp); > > -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > - gfp_t gfp, swp_entry_t entry); > +int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id, > + struct mm_struct *mm, gfp_t gfp); > > void __mem_cgroup_uncharge(struct folio *folio); > > @@ -1137,7 +1137,7 @@ static inline int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp) > } > > static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, > - struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) > + unsigned short id, struct mm_struct *mm, gfp_t gfp) > { > return 0; > } > diff --git a/mm/internal.h b/mm/internal.h > index 5a2ddcf68e0b..9d2fec696bd6 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -451,24 +451,16 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) > { > pte_t expected_pte = pte_next_swp_offset(pte); > const pte_t *end_ptep = start_ptep + max_nr; > - const softleaf_t entry = softleaf_from_pte(pte); > pte_t *ptep = start_ptep + 1; > - unsigned short cgroup_id; > > VM_WARN_ON(max_nr < 1); > - VM_WARN_ON(!softleaf_is_swap(entry)); > + VM_WARN_ON(!softleaf_is_swap(softleaf_from_pte(pte))); > > - cgroup_id = lookup_swap_cgroup_id(entry); > while (ptep < end_ptep) { > - softleaf_t entry; > - > pte = ptep_get(ptep); > > if (!pte_same(pte, expected_pte)) > break; > - entry = softleaf_from_pte(pte); > - if (lookup_swap_cgroup_id(entry) != cgroup_id) > - break; > expected_pte = pte_next_swp_offset(expected_pte); > ptep++; > } > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c7df30ca5aa7..641706fa47bf 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5062,27 +5062,25 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp) > > /** > * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. > - * @folio: folio to charge. > + * @folio: the folio to charge > + * @id: memory cgroup id > * @mm: mm context of the victim > * @gfp: reclaim mode > - * @entry: swap entry for which the folio is allocated > * > * This function charges a folio allocated for swapin. Please call this before > * adding the folio to the swapcache. > * > * Returns 0 on success. Otherwise, an error code is returned. > */ > -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, > - gfp_t gfp, swp_entry_t entry) > +int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id, > + struct mm_struct *mm, gfp_t gfp) > { > struct mem_cgroup *memcg; > - unsigned short id; > int ret; > > if (mem_cgroup_disabled()) > return 0; > > - id = lookup_swap_cgroup_id(entry); > rcu_read_lock(); > memcg = mem_cgroup_from_private_id(id); > if (!memcg || !css_tryget_online(&memcg->css)) > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 12b290d43e45..86d517a33a55 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -142,16 +142,20 @@ void *swap_cache_get_shadow(swp_entry_t entry) > * @ci: The locked swap cluster > * @targ_entry: The target swap entry to check, will be rounded down by @nr > * @nr: Number of slots to check, must be a power of 2 > - * @shadowp: Returns the shadow value if one exists in the range. > + * @shadowp: Returns the shadow value if one exists in the range > + * @memcg_id: Returns the memory cgroup id, NULL to ignore cgroup check > * > * Check if all slots covered by given range have a swap count >= 1. > - * Retrieves the shadow if there is one. > + * Retrieves the shadow if there is one. If @memcg_id is not NULL, also > + * checks if all slots belong to the same cgroup and return the cgroup > + * private id. > * > * Context: Caller must lock the cluster. > */ > static int __swap_cache_add_check(struct swap_cluster_info *ci, > swp_entry_t targ_entry, > - unsigned long nr, void **shadowp) > + unsigned long nr, void **shadowp, > + unsigned short *memcg_id) > { > unsigned int ci_off, ci_end; > unsigned long old_tb; > @@ -169,19 +173,24 @@ static int __swap_cache_add_check(struct swap_cluster_info *ci, > return -EEXIST; > if (!__swp_tb_get_count(old_tb)) > return -ENOENT; > - if (swp_tb_is_shadow(old_tb) && shadowp) > + if (shadowp && swp_tb_is_shadow(old_tb)) > *shadowp = swp_tb_to_shadow(old_tb); > + if (memcg_id) > + *memcg_id = lookup_swap_cgroup_id(targ_entry); Nitpick: Consider also use a local variable to stare the memcg_id value here. > > if (nr == 1) > return 0; > > + targ_entry.val = round_down(targ_entry.val, nr); > ci_off = round_down(ci_off, nr); > ci_end = ci_off + nr; > do { > old_tb = __swap_table_get(ci, ci_off); > if (unlikely(swp_tb_is_folio(old_tb) || > - !__swp_tb_get_count(old_tb))) > + !__swp_tb_get_count(old_tb) || > + (memcg_id && *memcg_id != lookup_swap_cgroup_id(targ_entry)))) Nitpick: You can use the local variable here to avoid a memory fetch. Micro optimizations. Chris ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-9-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 09/12] mm, swap: consolidate cluster allocation helpers [not found] ` <20260421-swap-table-p4-v3-9-2f23759a76bc@tencent.com> @ 2026-05-08 5:02 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-08 5:02 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 2:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Swap cluster table management is spread across several narrow > helpers. As a result, the allocation and fallback sequences are > open-coded in multiple places. > > A few more per-cluster tables will be added soon, so avoid > duplicating these sequences per table type. Fold the existing > pairs into cluster-oriented helpers, and rename for consistency. > > No functional change, only a few sanity checks are slightly adjusted. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Chris > --- > mm/swapfile.c | 110 ++++++++++++++++++++++++++-------------------------------- > 1 file changed, 49 insertions(+), 61 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 8d3d22c463f3..2d16aa89a4fd 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -411,20 +411,7 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si, > return cluster_index(si, ci) * SWAPFILE_CLUSTER; > } > > -static struct swap_table *swap_table_alloc(gfp_t gfp) > -{ > - struct folio *folio; > - > - if (!SWP_TABLE_USE_PAGE) > - return kmem_cache_zalloc(swap_table_cachep, gfp); > - > - folio = folio_alloc(gfp | __GFP_ZERO, 0); > - if (folio) > - return folio_address(folio); > - return NULL; > -} > - > -static void swap_table_free_folio_rcu_cb(struct rcu_head *head) > +static void swap_cluster_free_table_folio_rcu_cb(struct rcu_head *head) > { > struct folio *folio; > > @@ -432,15 +419,46 @@ static void swap_table_free_folio_rcu_cb(struct rcu_head *head) > folio_put(folio); > } > > -static void swap_table_free(struct swap_table *table) > +static void swap_cluster_free_table(struct swap_cluster_info *ci) > { > + struct swap_table *table; > + > + table = (struct swap_table *)rcu_dereference_protected(ci->table, true); > + if (!table) > + return; > + > + rcu_assign_pointer(ci->table, NULL); > if (!SWP_TABLE_USE_PAGE) { > kmem_cache_free(swap_table_cachep, table); > return; > } > > call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head), > - swap_table_free_folio_rcu_cb); > + swap_cluster_free_table_folio_rcu_cb); > +} > + > +static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp) > +{ > + struct swap_table *table = NULL; > + struct folio *folio; > + > + /* The cluster must be empty and not on any list during allocation. */ > + VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); > + if (rcu_access_pointer(ci->table)) > + return 0; > + > + if (SWP_TABLE_USE_PAGE) { > + folio = folio_alloc(gfp | __GFP_ZERO, 0); > + if (folio) > + table = folio_address(folio); > + } else { > + table = kmem_cache_zalloc(swap_table_cachep, gfp); > + } > + if (!table) > + return -ENOMEM; > + > + rcu_assign_pointer(ci->table, table); > + return 0; > } > > /* > @@ -471,27 +489,15 @@ static void swap_cluster_assert_empty(struct swap_cluster_info *ci, > WARN_ON_ONCE(nr == SWAPFILE_CLUSTER && ci->extend_table); > } > > -static void swap_cluster_free_table(struct swap_cluster_info *ci) > -{ > - struct swap_table *table; > - > - /* Only empty cluster's table is allow to be freed */ > - lockdep_assert_held(&ci->lock); > - table = (void *)rcu_dereference_protected(ci->table, true); > - rcu_assign_pointer(ci->table, NULL); > - > - swap_table_free(table); > -} > - > /* > * Allocate swap table for one cluster. Attempt an atomic allocation first, > * then fallback to sleeping allocation. > */ > static struct swap_cluster_info * > -swap_cluster_alloc_table(struct swap_info_struct *si, > +swap_cluster_populate(struct swap_info_struct *si, > struct swap_cluster_info *ci) > { > - struct swap_table *table; > + int ret; > > /* > * Only cluster isolation from the allocator does table allocation. > @@ -502,14 +508,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si, > lockdep_assert_held(&si->global_cluster_lock); > lockdep_assert_held(&ci->lock); > > - /* The cluster must be free and was just isolated from the free list. */ > - VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); > - > - table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); > - if (table) { > - rcu_assign_pointer(ci->table, table); > + if (!swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC | > + __GFP_NOWARN)) > return ci; > - } > > /* > * Try a sleep allocation. Each isolated free cluster may cause > @@ -521,7 +522,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si, > spin_unlock(&si->global_cluster_lock); > local_unlock(&percpu_swap_cluster.lock); > > - table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL); > + ret = swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC | > + GFP_KERNEL); > > /* > * Back to atomic context. We might have migrated to a new CPU with a > @@ -536,20 +538,11 @@ swap_cluster_alloc_table(struct swap_info_struct *si, > spin_lock(&si->global_cluster_lock); > spin_lock(&ci->lock); > > - /* Nothing except this helper should touch a dangling empty cluster. */ > - if (WARN_ON_ONCE(cluster_table_is_alloced(ci))) { > - if (table) > - swap_table_free(table); > - return ci; > - } > - > - if (!table) { > + if (ret) { > move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); > spin_unlock(&ci->lock); > return NULL; > } > - > - rcu_assign_pointer(ci->table, table); > return ci; > } > > @@ -621,12 +614,11 @@ static struct swap_cluster_info *isolate_lock_cluster( > } > spin_unlock(&si->lock); > > - if (found && !cluster_table_is_alloced(found)) { > - /* Only an empty free cluster's swap table can be freed. */ > - VM_WARN_ON_ONCE(flags != CLUSTER_FLAG_FREE); > + /* Cluster's table is freed when and only when it's on the free list. */ > + if (found && flags == CLUSTER_FLAG_FREE) { > VM_WARN_ON_ONCE(list != &si->free_clusters); > - VM_WARN_ON_ONCE(!cluster_is_empty(found)); > - return swap_cluster_alloc_table(si, found); > + VM_WARN_ON_ONCE(cluster_table_is_alloced(found)); > + return swap_cluster_populate(si, found); > } > > return found; > @@ -769,7 +761,6 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si, > unsigned int ci_off = offset % SWAPFILE_CLUSTER; > unsigned long idx = offset / SWAPFILE_CLUSTER; > struct swap_cluster_info *ci; > - struct swap_table *table; > int ret = 0; > > /* si->max may got shrunk by swap swap_activate() */ > @@ -790,12 +781,9 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si, > } > > ci = cluster_info + idx; > - if (!ci->table) { > - table = swap_table_alloc(GFP_KERNEL); > - if (!table) > - return -ENOMEM; > - rcu_assign_pointer(ci->table, table); > - } > + /* Need to allocate swap table first for initial bad slot marking. */ > + if (!ci->count && swap_cluster_alloc_table(ci, GFP_KERNEL)) > + return -ENOMEM; > spin_lock(&ci->lock); > /* Check for duplicated bad swap slots. */ > if (__swap_table_xchg(ci, ci_off, SWP_TB_BAD) != SWP_TB_NULL) { > @@ -2992,7 +2980,7 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info, > ci = cluster_info + i; > /* Cluster with bad marks count will have a remaining table */ > spin_lock(&ci->lock); > - if (rcu_dereference_protected(ci->table, true)) { > + if (cluster_table_is_alloced(ci)) { > swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, true); > swap_cluster_free_table(ci); > } > > -- > 2.53.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-10-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 10/12] mm/memcg, swap: store cgroup id in cluster table directly [not found] ` <20260421-swap-table-p4-v3-10-2f23759a76bc@tencent.com> @ 2026-05-08 22:46 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-08 22:46 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 2:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster > table instead. Nice! It takes so many steps to finally drop the static allocated swap cgroup ctrl array. Thank you for making it happen. > > The per-cluster memcg table is 1024 / 512 bytes on most archs, and does > not need RCU protection: the cgroup data is only read and written under > the cluster lock. That keeps things simple, lets the allocation use > plain kmalloc with immediate kfree (no deferred free), and keeps > fragmentation acceptable. > > Signed-off-by: Kairui Song <kasong@tencent.com> Overall looks good, with some nitpick and question follows. Acked-by: Chris Li <chrisl@kernel.org> > --- > include/linux/memcontrol.h | 6 ++++-- > include/linux/swap.h | 8 +++---- > mm/memcontrol-v1.c | 42 +++++++++++++++++++++++------------- > mm/memcontrol.c | 14 +++++++----- > mm/swap.h | 4 ++++ > mm/swap_state.c | 6 ++---- > mm/swap_table.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++ > mm/swapfile.c | 35 +++++++++++++++++++----------- > mm/vmscan.c | 2 +- > 9 files changed, 128 insertions(+), 43 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index a013f37f24aa..bf1a6e131eca 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -29,6 +29,7 @@ struct obj_cgroup; > struct page; > struct mm_struct; > struct kmem_cache; > +struct swap_cluster_info; > > /* Cgroup-specific page state, on top of universal node page state */ > enum memcg_stat_item { > @@ -1899,7 +1900,7 @@ static inline void mem_cgroup_exit_user_fault(void) > current->in_user_fault = 0; > } > > -void __memcg1_swapout(struct folio *folio); > +void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci); > void memcg1_swapin(struct folio *folio); > > #else /* CONFIG_MEMCG_V1 */ > @@ -1929,7 +1930,8 @@ static inline void mem_cgroup_exit_user_fault(void) > { > } > > -static inline void __memcg1_swapout(struct folio *folio) > +static inline void __memcg1_swapout(struct folio *folio, > + struct swap_cluster_info *ci) > { > } > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index f2949f5844a6..57af4647d432 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -582,12 +582,12 @@ static inline int mem_cgroup_try_charge_swap(struct folio *folio) > return __mem_cgroup_try_charge_swap(folio); > } > > -extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); > -static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) > +extern void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages); > +static inline void mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages) > { > if (mem_cgroup_disabled()) > return; > - __mem_cgroup_uncharge_swap(entry, nr_pages); > + __mem_cgroup_uncharge_swap(id, nr_pages); > } > > extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); > @@ -598,7 +598,7 @@ static inline int mem_cgroup_try_charge_swap(struct folio *folio) > return 0; > } > > -static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, > +static inline void mem_cgroup_uncharge_swap(unsigned short id, > unsigned int nr_pages) > { > } > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c > index 36c507d81dc5..494e7b9adc60 100644 > --- a/mm/memcontrol-v1.c > +++ b/mm/memcontrol-v1.c > @@ -14,6 +14,7 @@ > > #include "internal.h" > #include "swap.h" > +#include "swap_table.h" > #include "memcontrol-v1.h" > > /* > @@ -606,14 +607,15 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg) > /** > * __memcg1_swapout - transfer a memsw charge to swap > * @folio: folio whose memsw charge to transfer > + * @ci: the locked swap cluster holding the swap entries > * > * Transfer the memsw charge of @folio to the swap entry stored in > * folio->swap. > * > - * Context: folio must be isolated, unmapped, locked and is just about > - * to be freed, and caller must disable IRQs. > + * Context: folio must be isolated, unmapped, locked and is just about to > + * be freed, and caller must disable IRQs and hold the swap cluster lock. > */ > -void __memcg1_swapout(struct folio *folio) > +void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci) > { > struct mem_cgroup *memcg, *swap_memcg; > struct obj_cgroup *objcg; > @@ -646,7 +648,8 @@ void __memcg1_swapout(struct folio *folio) > swap_memcg = mem_cgroup_private_id_get_online(memcg, nr_entries); > mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries); > > - swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), folio->swap); > + __swap_cgroup_set(ci, swp_cluster_offset(folio->swap), nr_entries, > + mem_cgroup_private_id(swap_memcg)); > > folio_unqueue_deferred_split(folio); > folio->memcg_data = 0; > @@ -661,8 +664,7 @@ void __memcg1_swapout(struct folio *folio) > } > > /* > - * Interrupts should be disabled here because the caller holds the > - * i_pages lock which is taken with interrupts-off. It is > + * The caller must hold the swap cluster lock with IRQ off. It is > * important here to have the interrupts disabled because it is the > * only synchronisation we have for updating the per-CPU variables. > */ > @@ -677,7 +679,7 @@ void __memcg1_swapout(struct folio *folio) > } > > /** > - * memcg1_swapin - uncharge swap slot > + * memcg1_swapin - uncharge swap slot on swapin > * @folio: folio being swapped in > * > * Call this function after successfully adding the charged > @@ -687,6 +689,10 @@ void __memcg1_swapout(struct folio *folio) > */ > void memcg1_swapin(struct folio *folio) > { > + struct swap_cluster_info *ci; > + unsigned long nr_pages; > + unsigned short id; > + > VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); > VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); > > @@ -702,14 +708,20 @@ void memcg1_swapin(struct folio *folio) > * correspond 1:1 to page and swap slot lifetimes: we charge the > * page to memory here, and uncharge swap when the slot is freed. > */ > - if (do_memsw_account()) { > - /* > - * The swap entry might not get freed for a long time, > - * let's not wait for it. The page already received a > - * memory+swap charge, drop the swap entry duplicate. > - */ > - mem_cgroup_uncharge_swap(folio->swap, folio_nr_pages(folio)); > - } > + if (!do_memsw_account()) > + return; > + > + /* > + * The swap entry might not get freed for a long time, > + * let's not wait for it. The page already received a > + * memory+swap charge, drop the swap entry duplicate. > + */ > + nr_pages = folio_nr_pages(folio); > + ci = swap_cluster_get_and_lock(folio); > + id = __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap), > + nr_pages); > + swap_cluster_unlock(ci); > + mem_cgroup_uncharge_swap(id, nr_pages); > } > > void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 641706fa47bf..193c8eb73be7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -64,6 +64,8 @@ > #include <linux/sched/isolation.h> > #include <linux/kmemleak.h> > #include "internal.h" > +#include "swap.h" > +#include "swap_table.h" > #include <net/sock.h> > #include <net/ip.h> > #include "slab.h" > @@ -5462,6 +5464,7 @@ int __init mem_cgroup_init(void) > int __mem_cgroup_try_charge_swap(struct folio *folio) > { > unsigned int nr_pages = folio_nr_pages(folio); > + struct swap_cluster_info *ci; > struct page_counter *counter; > struct mem_cgroup *memcg; > struct obj_cgroup *objcg; > @@ -5495,22 +5498,23 @@ int __mem_cgroup_try_charge_swap(struct folio *folio) > } > mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); > > - swap_cgroup_record(folio, mem_cgroup_private_id(memcg), folio->swap); > + ci = swap_cluster_get_and_lock(folio); > + __swap_cgroup_set(ci, swp_cluster_offset(folio->swap), nr_pages, > + mem_cgroup_private_id(memcg)); > + swap_cluster_unlock(ci); > > return 0; > } > > /** > * __mem_cgroup_uncharge_swap - uncharge swap space > - * @entry: swap entry to uncharge > + * @id: cgroup id to uncharge > * @nr_pages: the amount of swap space to uncharge > */ > -void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) > +void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages) > { > struct mem_cgroup *memcg; > - unsigned short id; > > - id = swap_cgroup_clear(entry, nr_pages); > rcu_read_lock(); > memcg = mem_cgroup_from_private_id(id); > if (memcg) { > diff --git a/mm/swap.h b/mm/swap.h > index 80c2f1bf7a57..e4ac7dbc1080 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -5,6 +5,7 @@ > #include <linux/atomic.h> /* for atomic_long_t */ > struct mempolicy; > struct swap_iocb; > +struct swap_memcg_table; > > extern int page_cluster; > > @@ -38,6 +39,9 @@ struct swap_cluster_info { > u8 order; > atomic_long_t __rcu *table; /* Swap table entries, see mm/swap_table.h */ > unsigned int *extend_table; /* For large swap count, protected by ci->lock */ > +#ifdef CONFIG_MEMCG > + struct swap_memcg_table *memcg_table; /* Swap table entries' cgroup record */ > +#endif > struct list_head list; > }; > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 86d517a33a55..71a3f128fcf0 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -176,21 +176,19 @@ static int __swap_cache_add_check(struct swap_cluster_info *ci, > if (shadowp && swp_tb_is_shadow(old_tb)) > *shadowp = swp_tb_to_shadow(old_tb); > if (memcg_id) > - *memcg_id = lookup_swap_cgroup_id(targ_entry); > + *memcg_id = __swap_cgroup_get(ci, ci_off); > > if (nr == 1) > return 0; > > - targ_entry.val = round_down(targ_entry.val, nr); > ci_off = round_down(ci_off, nr); > ci_end = ci_off + nr; > do { > old_tb = __swap_table_get(ci, ci_off); > if (unlikely(swp_tb_is_folio(old_tb) || > !__swp_tb_get_count(old_tb) || > - (memcg_id && *memcg_id != lookup_swap_cgroup_id(targ_entry)))) > + (memcg_id && *memcg_id != __swap_cgroup_get(ci, ci_off)))) > return -EBUSY; > - targ_entry.val++; > } while (++ci_off < ci_end); > > return 0; > diff --git a/mm/swap_table.h b/mm/swap_table.h > index 8415ffbe2b9c..b2b02ee161b1 100644 > --- a/mm/swap_table.h > +++ b/mm/swap_table.h > @@ -11,6 +11,11 @@ struct swap_table { > atomic_long_t entries[SWAPFILE_CLUSTER]; > }; > > +/* For storing memcg private id */ > +struct swap_memcg_table { > + unsigned short id[SWAPFILE_CLUSTER]; > +}; > + > #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE) > > /* > @@ -247,4 +252,53 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci, > > return swp_tb; > } > + > +#ifdef CONFIG_MEMCG > +static inline void __swap_cgroup_set(struct swap_cluster_info *ci, > + unsigned int ci_off, unsigned long nr, unsigned short id) > +{ > + lockdep_assert_held(&ci->lock); > + VM_WARN_ON_ONCE(ci_off >= SWAPFILE_CLUSTER); > + do { > + ci->memcg_table->id[ci_off++] = id; Do you need to check the memcg_table is not NULL here? Because this function is no longer static. Another caller might invoke this when the cluster hasn't allocated the memcg_table. They shouldn't. We might want some check and complain here. > + } while (--nr); > +} > + > +static inline unsigned short __swap_cgroup_get(struct swap_cluster_info *ci, > + unsigned int ci_off) > +{ > + lockdep_assert_held(&ci->lock); > + VM_WARN_ON_ONCE(ci_off >= SWAPFILE_CLUSTER); > + return ci->memcg_table->id[ci_off]; Here too. > +} > + > +static inline unsigned short __swap_cgroup_clear(struct swap_cluster_info *ci, > + unsigned int ci_off, > + unsigned long nr) > +{ > + unsigned short old = ci->memcg_table->id[ci_off]; Here as well. Chris > + > + __swap_cgroup_set(ci, ci_off, nr, 0); > + return old; > +} > +#else > +static inline void __swap_cgroup_set(struct swap_cluster_info *ci, > + unsigned int ci_off, unsigned long nr, unsigned short id) > +{ > +} > + > +static inline unsigned short __swap_cgroup_get(struct swap_cluster_info *ci, > + unsigned int ci_off) > +{ > + return 0; > +} > + > +static inline unsigned short __swap_cgroup_clear(struct swap_cluster_info *ci, > + unsigned int ci_off, > + unsigned long nr) > +{ > + return 0; > +} > +#endif > + > #endif > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2d16aa89a4fd..edf4cb36728e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -423,7 +423,12 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci) > { > struct swap_table *table; > > - table = (struct swap_table *)rcu_dereference_protected(ci->table, true); > +#ifdef CONFIG_MEMCG > + kfree(ci->memcg_table); > + ci->memcg_table = NULL; > +#endif > + > + table = (struct swap_table *)rcu_access_pointer(ci->table); > if (!table) > return; > > @@ -441,6 +446,7 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp) > { > struct swap_table *table = NULL; > struct folio *folio; > + int ret = 0; > > /* The cluster must be empty and not on any list during allocation. */ > VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); > @@ -458,7 +464,17 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp) > return -ENOMEM; > > rcu_assign_pointer(ci->table, table); > - return 0; > + > +#ifdef CONFIG_MEMCG > + if (!ci->memcg_table) > + ci->memcg_table = kzalloc(sizeof(*ci->memcg_table), gfp); > + if (!ci->memcg_table) > + ret = -ENOMEM; > +#endif > + if (ret) > + swap_cluster_free_table(ci); > + > + return ret; > } > > /* > @@ -483,6 +499,7 @@ static void swap_cluster_assert_empty(struct swap_cluster_info *ci, > bad_slots++; > else > WARN_ON_ONCE(!swp_tb_is_null(swp_tb)); > + WARN_ON_ONCE(__swap_cgroup_get(ci, ci_off)); > } while (++ci_off < ci_end); > > WARN_ON_ONCE(bad_slots != (swapoff ? ci->count : 0)); > @@ -1860,12 +1877,10 @@ void __swap_cluster_free_entries(struct swap_info_struct *si, > unsigned int ci_start, unsigned int nr_pages) > { > unsigned long old_tb; > - unsigned int type = si->type; > unsigned short id = 0, id_cur; > unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages; > unsigned long offset = cluster_offset(si, ci); > unsigned int ci_batch = ci_off; > - swp_entry_t entry; > > VM_WARN_ON(ci->count < nr_pages); > > @@ -1883,21 +1898,17 @@ void __swap_cluster_free_entries(struct swap_info_struct *si, > * Uncharge swap slots by memcg in batches. Consecutive > * slots with the same cgroup id are uncharged together. > */ > - entry = swp_entry(type, offset + ci_off); > - id_cur = lookup_swap_cgroup_id(entry); > + id_cur = __swap_cgroup_clear(ci, ci_off, 1); > if (id != id_cur) { > if (id) > - mem_cgroup_uncharge_swap(swp_entry(type, offset + ci_batch), > - ci_off - ci_batch); > + mem_cgroup_uncharge_swap(id, ci_off - ci_batch); > id = id_cur; > ci_batch = ci_off; > } > } while (++ci_off < ci_end); > > - if (id) { > - mem_cgroup_uncharge_swap(swp_entry(type, offset + ci_batch), > - ci_off - ci_batch); > - } > + if (id) > + mem_cgroup_uncharge_swap(id, ci_off - ci_batch); > > swap_range_free(si, offset + ci_start, nr_pages); > swap_cluster_assert_empty(ci, ci_start, nr_pages, false); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 63d06930d8e3..50d87ff58f86 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -739,7 +739,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, > > if (reclaimed && !mapping_exiting(mapping)) > shadow = workingset_eviction(folio, target_memcg); > - __memcg1_swapout(folio); > + __memcg1_swapout(folio, ci); > __swap_cache_del_folio(ci, folio, swap, shadow); > swap_cluster_unlock_irq(ci); > } else { > > -- > 2.53.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-11-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 11/12] mm/memcg: remove no longer used swap cgroup array [not found] ` <20260421-swap-table-p4-v3-11-2f23759a76bc@tencent.com> @ 2026-05-08 22:47 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-08 22:47 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:17 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > Now all swap cgroup records are stored in the swap cluster directly, > the static array is no longer needed. > > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > MAINTAINERS | 1 - > include/linux/swap_cgroup.h | 47 ------------ > mm/Makefile | 3 - > mm/internal.h | 1 - > mm/memcontrol-v1.c | 1 - > mm/memcontrol.c | 1 - > mm/swap_cgroup.c | 172 -------------------------------------------- Nice patch stats. Acked-by: Chris Li <chrisl@kernel.org> Chris > mm/swapfile.c | 8 --- > 8 files changed, 234 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 76d8291237be..217d98c89275 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -6565,7 +6565,6 @@ F: mm/memcontrol.c > F: mm/memcontrol-v1.c > F: mm/memcontrol-v1.h > F: mm/page_counter.c > -F: mm/swap_cgroup.c > F: samples/cgroup/* > F: tools/testing/selftests/cgroup/memcg_protection.m > F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c > diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h > deleted file mode 100644 > index 91cdf12190a0..000000000000 > --- a/include/linux/swap_cgroup.h > +++ /dev/null > @@ -1,47 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0 */ > -#ifndef __LINUX_SWAP_CGROUP_H > -#define __LINUX_SWAP_CGROUP_H > - > -#include <linux/swap.h> > - > -#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP) > - > -extern void swap_cgroup_record(struct folio *folio, unsigned short id, swp_entry_t ent); > -extern unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents); > -extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); > -extern int swap_cgroup_swapon(int type, unsigned long max_pages); > -extern void swap_cgroup_swapoff(int type); > - > -#else > - > -static inline > -void swap_cgroup_record(struct folio *folio, unsigned short id, swp_entry_t ent) > -{ > -} > - > -static inline > -unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents) > -{ > - return 0; > -} > - > -static inline > -unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > -{ > - return 0; > -} > - > -static inline int > -swap_cgroup_swapon(int type, unsigned long max_pages) > -{ > - return 0; > -} > - > -static inline void swap_cgroup_swapoff(int type) > -{ > - return; > -} > - > -#endif > - > -#endif /* __LINUX_SWAP_CGROUP_H */ > diff --git a/mm/Makefile b/mm/Makefile > index 8ad2ab08244e..eff9f9e7e061 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -103,9 +103,6 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o > obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o > obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o > obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o > -ifdef CONFIG_SWAP > -obj-$(CONFIG_MEMCG) += swap_cgroup.o > -endif > ifdef CONFIG_BPF_SYSCALL > obj-$(CONFIG_MEMCG) += bpf_memcontrol.o > endif > diff --git a/mm/internal.h b/mm/internal.h > index 9d2fec696bd6..7646ecb9d621 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -17,7 +17,6 @@ > #include <linux/rmap.h> > #include <linux/swap.h> > #include <linux/leafops.h> > -#include <linux/swap_cgroup.h> > #include <linux/tracepoint-defs.h> > > /* Internal core VMA manipulation functions. */ > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c > index 494e7b9adc60..08be1a752c2e 100644 > --- a/mm/memcontrol-v1.c > +++ b/mm/memcontrol-v1.c > @@ -5,7 +5,6 @@ > #include <linux/mm_inline.h> > #include <linux/pagewalk.h> > #include <linux/backing-dev.h> > -#include <linux/swap_cgroup.h> > #include <linux/eventfd.h> > #include <linux/poll.h> > #include <linux/sort.h> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 193c8eb73be7..12165fd32529 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -54,7 +54,6 @@ > #include <linux/vmpressure.h> > #include <linux/memremap.h> > #include <linux/mm_inline.h> > -#include <linux/swap_cgroup.h> > #include <linux/cpu.h> > #include <linux/oom.h> > #include <linux/lockdep.h> > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c > deleted file mode 100644 > index de779fed8c21..000000000000 > --- a/mm/swap_cgroup.c > +++ /dev/null > @@ -1,172 +0,0 @@ > -// SPDX-License-Identifier: GPL-2.0 > -#include <linux/swap_cgroup.h> > -#include <linux/vmalloc.h> > -#include <linux/mm.h> > - > -#include <linux/swapops.h> /* depends on mm.h include */ > - > -static DEFINE_MUTEX(swap_cgroup_mutex); > - > -/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */ > -#define ID_PER_SC (sizeof(struct swap_cgroup) / sizeof(unsigned short)) > -#define ID_SHIFT (BITS_PER_TYPE(unsigned short)) > -#define ID_MASK (BIT(ID_SHIFT) - 1) > -struct swap_cgroup { > - atomic_t ids; > -}; > - > -struct swap_cgroup_ctrl { > - struct swap_cgroup *map; > -}; > - > -static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; > - > -static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map, > - pgoff_t offset) > -{ > - unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT; > - unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids); > - > - BUILD_BUG_ON(!is_power_of_2(ID_PER_SC)); > - BUILD_BUG_ON(sizeof(struct swap_cgroup) != sizeof(atomic_t)); > - > - return (old_ids >> shift) & ID_MASK; > -} > - > -static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map, > - pgoff_t offset, > - unsigned short new_id) > -{ > - unsigned short old_id; > - struct swap_cgroup *sc = &map[offset / ID_PER_SC]; > - unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT; > - unsigned int new_ids, old_ids = atomic_read(&sc->ids); > - > - do { > - old_id = (old_ids >> shift) & ID_MASK; > - new_ids = (old_ids & ~(ID_MASK << shift)); > - new_ids |= ((unsigned int)new_id) << shift; > - } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids)); > - > - return old_id; > -} > - > -/** > - * swap_cgroup_record - record mem_cgroup for a set of swap entries. > - * These entries must belong to one single folio, and that folio > - * must be being charged for swap space (swap out), and these > - * entries must not have been charged > - * > - * @folio: the folio that the swap entry belongs to > - * @id: mem_cgroup ID to be recorded > - * @ent: the first swap entry to be recorded > - */ > -void swap_cgroup_record(struct folio *folio, unsigned short id, > - swp_entry_t ent) > -{ > - unsigned int nr_ents = folio_nr_pages(folio); > - struct swap_cgroup *map; > - pgoff_t offset, end; > - unsigned short old; > - > - offset = swp_offset(ent); > - end = offset + nr_ents; > - map = swap_cgroup_ctrl[swp_type(ent)].map; > - > - do { > - old = __swap_cgroup_id_xchg(map, offset, id); > - VM_BUG_ON(old); > - } while (++offset != end); > -} > - > -/** > - * swap_cgroup_clear - clear mem_cgroup for a set of swap entries. > - * These entries must be being uncharged from swap. They either > - * belongs to one single folio in the swap cache (swap in for > - * cgroup v1), or no longer have any users (slot freeing). > - * > - * @ent: the first swap entry to be recorded into > - * @nr_ents: number of swap entries to be recorded > - * > - * Returns the existing old value. > - */ > -unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents) > -{ > - pgoff_t offset, end; > - struct swap_cgroup *map; > - unsigned short old, iter = 0; > - > - offset = swp_offset(ent); > - end = offset + nr_ents; > - map = swap_cgroup_ctrl[swp_type(ent)].map; > - > - do { > - old = __swap_cgroup_id_xchg(map, offset, 0); > - if (!iter) > - iter = old; > - VM_BUG_ON(iter != old); > - } while (++offset != end); > - > - return old; > -} > - > -/** > - * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry > - * @ent: swap entry to be looked up. > - * > - * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID) > - */ > -unsigned short lookup_swap_cgroup_id(swp_entry_t ent) > -{ > - struct swap_cgroup_ctrl *ctrl; > - > - if (mem_cgroup_disabled()) > - return 0; > - > - ctrl = &swap_cgroup_ctrl[swp_type(ent)]; > - return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent)); > -} > - > -int swap_cgroup_swapon(int type, unsigned long max_pages) > -{ > - struct swap_cgroup *map; > - struct swap_cgroup_ctrl *ctrl; > - > - if (mem_cgroup_disabled()) > - return 0; > - > - BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC != > - sizeof(struct swap_cgroup)); > - map = vzalloc(DIV_ROUND_UP(max_pages, ID_PER_SC) * > - sizeof(struct swap_cgroup)); > - if (!map) > - goto nomem; > - > - ctrl = &swap_cgroup_ctrl[type]; > - mutex_lock(&swap_cgroup_mutex); > - ctrl->map = map; > - mutex_unlock(&swap_cgroup_mutex); > - > - return 0; > -nomem: > - pr_info("couldn't allocate enough memory for swap_cgroup\n"); > - pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n"); > - return -ENOMEM; > -} > - > -void swap_cgroup_swapoff(int type) > -{ > - struct swap_cgroup *map; > - struct swap_cgroup_ctrl *ctrl; > - > - if (mem_cgroup_disabled()) > - return; > - > - mutex_lock(&swap_cgroup_mutex); > - ctrl = &swap_cgroup_ctrl[type]; > - map = ctrl->map; > - ctrl->map = NULL; > - mutex_unlock(&swap_cgroup_mutex); > - > - vfree(map); > -} > diff --git a/mm/swapfile.c b/mm/swapfile.c > index edf4cb36728e..2172920e68d1 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -45,7 +45,6 @@ > > #include <asm/tlbflush.h> > #include <linux/leafops.h> > -#include <linux/swap_cgroup.h> > #include "swap_table.h" > #include "internal.h" > #include "swap.h" > @@ -3136,8 +3135,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > p->global_cluster = NULL; > kvfree(zeromap); > free_swap_cluster_info(cluster_info, maxpages); > - /* Destroy swap account information */ > - swap_cgroup_swapoff(p->type); > > inode = mapping->host; > > @@ -3668,10 +3665,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > if (error) > goto bad_swap_unlock_inode; > > - error = swap_cgroup_swapon(si->type, maxpages); > - if (error) > - goto bad_swap_unlock_inode; > - > /* > * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might > * be above MAX_PAGE_ORDER incase of a large swap file. > @@ -3782,7 +3775,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > si->global_cluster = NULL; > inode = NULL; > destroy_swap_extents(si, swap_file); > - swap_cgroup_swapoff(si->type); > free_swap_cluster_info(si->cluster_info, si->max); > si->cluster_info = NULL; > kvfree(si->zeromap); > > -- > 2.53.0 > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-12-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 12/12] mm, swap: merge zeromap into swap table [not found] ` <20260421-swap-table-p4-v3-12-2f23759a76bc@tencent.com> @ 2026-05-11 16:30 ` Chris Li 0 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-11 16:30 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen ) On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > By allocating one additional bit in the swap table entry's flags field > alongside the count, we can store the zeromap inline > > For certain 32-bit archs, there might not be enough bits in the swap > table to contain both PFN and flags. Therefore, conditionally let each > cluster have a zeromap field at build time, and use that instead of the > swap table for these archs. A few macros were moved to different headers > for build time struct definition. It might be worthwhile to mention the user-visible impact. For 64 bit systems. The zeromap will store in the swap table, avoiding zeromap allocation. It reduces the allocated memory. That is the happy path. For certain 32-bit architectures, if the swapfile cluster is not fully used, it will use less memory for zeromap. The empty cluster does not allocate a zeromap. We still save memory. In the worst case, all cluster are fully populated. We will use memory similar to the previous zeromap implementation. > > Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> > --- > include/linux/swap.h | 1 - > mm/memory.c | 11 +---- > mm/page_io.c | 58 ++++++++++++++++++++++---- > mm/swap.h | 51 +++++++++-------------- > mm/swap_state.c | 14 ++++--- > mm/swap_table.h | 115 +++++++++++++++++++++++++++++++++++++-------------- > mm/swapfile.c | 45 +++++++++----------- > 7 files changed, 184 insertions(+), 111 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 57af4647d432..8f0f68e245ba 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -253,7 +253,6 @@ struct swap_info_struct { > struct plist_node list; /* entry in swap_active_head */ > signed char type; /* strange name for an index */ > unsigned int max; /* size of this swap device */ > - unsigned long *zeromap; /* kvmalloc'ed bitmap to track zero pages */ Nice. > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > struct list_head free_clusters; /* free clusters list */ > struct list_head full_clusters; /* full clusters list */ > diff --git a/mm/memory.c b/mm/memory.c > index 404734a5bcff..a45905f8728f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4595,13 +4595,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > /* > - * Check if the PTEs within a range are contiguous swap entries > - * and have consistent swapcache, zeromap. > + * Check if the PTEs within a range are contiguous swap entries. > */ > static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > { > unsigned long addr; > - softleaf_t entry; > int idx; > pte_t pte; > > @@ -4611,18 +4609,13 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > > if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) > return false; > - entry = softleaf_from_pte(pte); > - if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) > - return false; > - > /* > * swap_read_folio() can't handle the case a large folio is hybridly > * from different backends. And they are likely corner cases. Similar > * things might be added once zswap support large folios. > */ > - if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) > return false; > - > return true; > } > > diff --git a/mm/page_io.c b/mm/page_io.c > index 70cea9e24d2f..c2557e72c381 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -26,6 +26,7 @@ > #include <linux/delayacct.h> > #include <linux/zswap.h> > #include "swap.h" > +#include "swap_table.h" > > static void __end_swap_bio_write(struct bio *bio) > { > @@ -204,15 +205,20 @@ static bool is_folio_zero_filled(struct folio *folio) > static void swap_zeromap_folio_set(struct folio *folio) > { > struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio); > - struct swap_info_struct *sis = __swap_entry_to_info(folio->swap); > int nr_pages = folio_nr_pages(folio); > + struct swap_cluster_info *ci; > swp_entry_t entry; > unsigned int i; > > + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); > + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); > + > + ci = swap_cluster_get_and_lock(folio); > for (i = 0; i < folio_nr_pages(folio); i++) { > entry = page_swap_entry(folio_page(folio, i)); > - set_bit(swp_offset(entry), sis->zeromap); > + __swap_table_set_zero(ci, swp_cluster_offset(entry)); > } > + swap_cluster_unlock(ci); > > count_vm_events(SWPOUT_ZERO, nr_pages); > if (objcg) { > @@ -223,14 +229,19 @@ static void swap_zeromap_folio_set(struct folio *folio) > > static void swap_zeromap_folio_clear(struct folio *folio) > { > - struct swap_info_struct *sis = __swap_entry_to_info(folio->swap); > + struct swap_cluster_info *ci; > swp_entry_t entry; > unsigned int i; > > + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio); > + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); > + > + ci = swap_cluster_get_and_lock(folio); > for (i = 0; i < folio_nr_pages(folio); i++) { > entry = page_swap_entry(folio_page(folio, i)); > - clear_bit(swp_offset(entry), sis->zeromap); > + __swap_table_clear_zero(ci, swp_cluster_offset(entry)); > } > + swap_cluster_unlock(ci); > } > > /* > @@ -255,10 +266,9 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug) > } > > /* > - * Use a bitmap (zeromap) to avoid doing IO for zero-filled pages. > - * The bits in zeromap are protected by the locked swapcache folio > - * and atomic updates are used to protect against read-modify-write > - * corruption due to other zero swap entries seeing concurrent updates. > + * Use the swap table zero mark to avoid doing IO for zero-filled > + * pages. The zero mark is protected by the cluster lock, which is > + * acquired internally by swap_zeromap_folio_set/clear. > */ > if (is_folio_zero_filled(folio)) { > swap_zeromap_folio_set(folio); > @@ -509,16 +519,48 @@ static void sio_read_complete(struct kiocb *iocb, long ret) > mempool_free(sio, sio_pool); > } > > +/* > + * Return the count of contiguous swap entries that share the same > + * zeromap status as the starting entry. If is_zerop is not NULL, > + * it will return the zeromap status of the starting entry. > + * > + * Context: Caller must ensure the cluster containing the entries > + * that will be checked won't be freed. > + */ > +static int swap_zeromap_batch(swp_entry_t entry, int max_nr, > + bool *is_zerop) > +{ > + bool is_zero; > + struct swap_cluster_info *ci = __swap_entry_to_cluster(entry); > + unsigned int ci_start = swp_cluster_offset(entry), ci_off, ci_end; > + > + ci_off = ci_start; > + ci_end = ci_off + max_nr; Should we check ci_end less than the cluster's end and complain if not? It seems using a for loop can be simpler. The loop index serves as a counter as well. Totally untested code: int i; rcu_read_lock(); is_zero = __swap_table_test_zero(ci, ci_start); for (i =1; i < max_nr ; i++) if (is_zero != __swap_table_test_zero(ci, ci_start + i)) break; rcu_read_unlock(); if (is_zerop) *is_zerop = is_zero; return i; Chris > + break; > + rcu_read_lock(); > + is_zero = __swap_table_test_zero(ci, ci_off); > + if (is_zerop) > + *is_zerop = is_zero; > + while (++ci_off < ci_end) { > + if (is_zero != __swap_table_test_zero(ci, ci_off)) > + break; > + } > + rcu_read_unlock(); > + return ci_off - ci_start; > +} > + > static bool swap_read_folio_zeromap(struct folio *folio) > { > int nr_pages = folio_nr_pages(folio); > struct obj_cgroup *objcg; > bool is_zeromap; > > + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio); > + > /* > * Swapping in a large folio that is partially in the zeromap is not > * currently handled. Return true without marking the folio uptodate so > * that an IO error is emitted (e.g. do_swap_page() will sigbus). > + * Folio lock stabilizes the cluster and map, so the check is safe. > */ > if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages, > &is_zeromap) != nr_pages)) > diff --git a/mm/swap.h b/mm/swap.h > index e4ac7dbc1080..025ff4f0b021 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -3,12 +3,29 @@ > #define _MM_SWAP_H > > #include <linux/atomic.h> /* for atomic_long_t */ > +#include <linux/mm.h> /* for PAGE_SHIFT */ > struct mempolicy; > struct swap_iocb; > struct swap_memcg_table; > > extern int page_cluster; > > +#if defined(MAX_POSSIBLE_PHYSMEM_BITS) > +#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) > +#elif defined(MAX_PHYSMEM_BITS) > +#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) > +#else > +#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT) > +#endif > + > +/* Swap table marker, 0x1 means shadow, 0x2 means PFN (SWP_TB_PFN_MARK) */ > +#define SWAP_CACHE_PFN_MARK_BITS 2 > +/* At least 2 bits are needed to distinguish SWP_TB_COUNT_MAX, 1 and 0 */ > +#define SWAP_COUNT_MIN_BITS 2 > +/* If there are enough bits besides PFN and marker, store zero flag inline */ > +#define SWAP_TABLE_HAS_ZEROFLAG ((BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BITS - \ > + SWAP_CACHE_PFN_BITS) > SWAP_COUNT_MIN_BITS) > + > #ifdef CONFIG_THP_SWAP > #define SWAPFILE_CLUSTER HPAGE_PMD_NR > #define swap_entry_order(order) (order) > @@ -41,6 +58,9 @@ struct swap_cluster_info { > unsigned int *extend_table; /* For large swap count, protected by ci->lock */ > #ifdef CONFIG_MEMCG > struct swap_memcg_table *memcg_table; /* Swap table entries' cgroup record */ > +#endif > +#if !SWAP_TABLE_HAS_ZEROFLAG > + unsigned long *zero_bitmap; > #endif > struct list_head list; > }; > @@ -314,31 +334,6 @@ static inline unsigned int folio_swap_flags(struct folio *folio) > return __swap_entry_to_info(folio->swap)->flags; > } > > -/* > - * Return the count of contiguous swap entries that share the same > - * zeromap status as the starting entry. If is_zeromap is not NULL, > - * it will return the zeromap status of the starting entry. > - */ > -static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, > - bool *is_zeromap) > -{ > - struct swap_info_struct *sis = __swap_entry_to_info(entry); > - unsigned long start = swp_offset(entry); > - unsigned long end = start + max_nr; > - bool first_bit; > - > - first_bit = test_bit(start, sis->zeromap); > - if (is_zeromap) > - *is_zeromap = first_bit; > - > - if (max_nr <= 1) > - return max_nr; > - if (first_bit) > - return find_next_zero_bit(sis->zeromap, end, start) - start; > - else > - return find_next_bit(sis->zeromap, end, start) - start; > -} > - > #else /* CONFIG_SWAP */ > struct swap_iocb; > static inline struct swap_cluster_info *swap_cluster_lock( > @@ -476,11 +471,5 @@ static inline unsigned int folio_swap_flags(struct folio *folio) > { > return 0; > } > - > -static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, > - bool *has_zeromap) > -{ > - return 0; > -} > #endif /* CONFIG_SWAP */ > #endif /* _MM_SWAP_H */ > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 71a3f128fcf0..fa4ef9f4a1d3 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -159,6 +159,7 @@ static int __swap_cache_add_check(struct swap_cluster_info *ci, > { > unsigned int ci_off, ci_end; > unsigned long old_tb; > + bool is_zero; > > /* > * If the target slot is not swapped out, return > @@ -181,12 +182,14 @@ static int __swap_cache_add_check(struct swap_cluster_info *ci, > if (nr == 1) > return 0; > > + is_zero = __swap_table_test_zero(ci, ci_off); > ci_off = round_down(ci_off, nr); > ci_end = ci_off + nr; > do { > old_tb = __swap_table_get(ci, ci_off); > if (unlikely(swp_tb_is_folio(old_tb) || > !__swp_tb_get_count(old_tb) || > + is_zero != __swap_table_test_zero(ci, ci_off) || > (memcg_id && *memcg_id != __swap_cgroup_get(ci, ci_off)))) > return -EBUSY; > } while (++ci_off < ci_end); > @@ -210,7 +213,7 @@ static void __swap_cache_do_add_folio(struct swap_cluster_info *ci, > do { > old_tb = __swap_table_get(ci, ci_off); > VM_WARN_ON_ONCE(swp_tb_is_folio(old_tb)); > - __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb))); > + __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_flags(old_tb))); > } while (++ci_off < ci_end); > > folio_ref_add(folio, nr_pages); > @@ -246,7 +249,6 @@ static void __swap_cache_do_del_folio(struct swap_cluster_info *ci, > struct folio *folio, > swp_entry_t entry, void *shadow) > { > - int count; > unsigned long old_tb; > struct swap_info_struct *si; > unsigned int ci_start, ci_off, ci_end; > @@ -266,13 +268,13 @@ static void __swap_cache_do_del_folio(struct swap_cluster_info *ci, > old_tb = __swap_table_get(ci, ci_off); > WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || > swp_tb_to_folio(old_tb) != folio); > - count = __swp_tb_get_count(old_tb); > - if (count) > + if (__swp_tb_get_count(old_tb)) > folio_swapped = true; > else > need_free = true; > /* If shadow is NULL, we set an empty shadow. */ > - __swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, count)); > + __swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, > + __swp_tb_get_flags(old_tb))); > } while (++ci_off < ci_end); > > folio->swap.val = 0; > @@ -366,7 +368,7 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci, > do { > old_tb = __swap_table_get(ci, ci_off); > WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old); > - __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb))); > + __swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_flags(old_tb))); > } while (++ci_off < ci_end); > > /* > diff --git a/mm/swap_table.h b/mm/swap_table.h > index b2b02ee161b1..6cf1575eb26e 100644 > --- a/mm/swap_table.h > +++ b/mm/swap_table.h > @@ -26,12 +26,14 @@ struct swap_memcg_table { > * Swap table entry type and bits layouts: > * > * NULL: |---------------- 0 ---------------| - Free slot > - * Shadow: | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot > - * PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot > + * Shadow: |SWAP_COUNT|Z|---- SHADOW_VAL ---|1| - Swapped out slot > + * PFN: |SWAP_COUNT|Z|------ PFN -------|10| - Cached slot > * Pointer: |----------- Pointer ----------|100| - (Unused) > * Bad: |------------- 1 -------------|1000| - Bad slot > * > - * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long. > + * COUNT is `SWP_TB_COUNT_BITS` long, Z is the `SWP_TB_ZERO_FLAG` bit, > + * and together they form the `SWP_TB_FLAGS_BITS` wide flags field. > + * Each entry is an atomic long. > * > * Usages: > * > @@ -54,14 +56,6 @@ struct swap_memcg_table { > * - Bad: Swap slot is reserved, protects swap header or holes on swap devices. > */ > > -#if defined(MAX_POSSIBLE_PHYSMEM_BITS) > -#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) > -#elif defined(MAX_PHYSMEM_BITS) > -#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) > -#else > -#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT) > -#endif > - > /* NULL Entry, all 0 */ > #define SWP_TB_NULL 0UL > > @@ -69,22 +63,26 @@ struct swap_memcg_table { > #define SWP_TB_SHADOW_MARK 0b1UL > > /* Cached: PFN */ > -#define SWP_TB_PFN_BITS (SWAP_CACHE_PFN_BITS + SWP_TB_PFN_MARK_BITS) > +#define SWP_TB_PFN_BITS (SWAP_CACHE_PFN_BITS + SWAP_CACHE_PFN_MARK_BITS) > #define SWP_TB_PFN_MARK 0b10UL > -#define SWP_TB_PFN_MARK_BITS 2 > -#define SWP_TB_PFN_MARK_MASK (BIT(SWP_TB_PFN_MARK_BITS) - 1) > +#define SWP_TB_PFN_MARK_MASK (BIT(SWAP_CACHE_PFN_MARK_BITS) - 1) > > -/* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended */ > -#define SWP_TB_COUNT_BITS min(4, BITS_PER_LONG - SWP_TB_PFN_BITS) > +/* Flags: For PFN or shadow, contains SWAP_COUNT, width changes */ > +#define SWP_TB_FLAGS_BITS min(5, BITS_PER_LONG - SWP_TB_PFN_BITS) > +#define SWP_TB_COUNT_BITS (SWP_TB_FLAGS_BITS - SWAP_TABLE_HAS_ZEROFLAG) > +#define SWP_TB_FLAGS_MASK (~((~0UL) >> SWP_TB_FLAGS_BITS)) > #define SWP_TB_COUNT_MASK (~((~0UL) >> SWP_TB_COUNT_BITS)) > +#define SWP_TB_FLAGS_SHIFT (BITS_PER_LONG - SWP_TB_FLAGS_BITS) > #define SWP_TB_COUNT_SHIFT (BITS_PER_LONG - SWP_TB_COUNT_BITS) > #define SWP_TB_COUNT_MAX ((1 << SWP_TB_COUNT_BITS) - 1) > +/* The first flag is zero bit (SWAP_TABLE_HAS_ZEROFLAG) */ > +#define SWP_TB_ZERO_FLAG BIT(BITS_PER_LONG - SWP_TB_FLAGS_BITS) > > /* Bad slot: ends with 0b1000 and rests of bits are all 1 */ > #define SWP_TB_BAD ((~0UL) << 3) > > /* Macro for shadow offset calculation */ > -#define SWAP_COUNT_SHIFT SWP_TB_COUNT_BITS > +#define SWAP_COUNT_SHIFT SWP_TB_FLAGS_BITS > > /* > * Helpers for casting one type of info into a swap table entry. > @@ -102,40 +100,47 @@ static inline unsigned long __count_to_swp_tb(unsigned char count) > * used (count > 0 && count < SWP_TB_COUNT_MAX), and > * overflow (count == SWP_TB_COUNT_MAX). > */ > - BUILD_BUG_ON(SWP_TB_COUNT_MAX < 2 || SWP_TB_COUNT_BITS < 2); > + BUILD_BUG_ON(SWP_TB_COUNT_BITS < SWAP_COUNT_MIN_BITS); > VM_WARN_ON(count > SWP_TB_COUNT_MAX); > return ((unsigned long)count) << SWP_TB_COUNT_SHIFT; > } > > -static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int count) > +static inline unsigned long __flags_to_swp_tb(unsigned char flags) > +{ > + BUILD_BUG_ON(SWP_TB_FLAGS_BITS > BITS_PER_BYTE); > + VM_WARN_ON(flags >> SWP_TB_FLAGS_BITS); > + return ((unsigned long)flags) << SWP_TB_FLAGS_SHIFT; > +} > + > +static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned char flags) > { > unsigned long swp_tb; > > BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *)); > BUILD_BUG_ON(SWAP_CACHE_PFN_BITS > > - (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS)); > + (BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BITS - SWP_TB_FLAGS_BITS)); > > - swp_tb = (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK; > - VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK); > + swp_tb = (pfn << SWAP_CACHE_PFN_MARK_BITS) | SWP_TB_PFN_MARK; > + VM_WARN_ON_ONCE(swp_tb & SWP_TB_FLAGS_MASK); > > - return swp_tb | __count_to_swp_tb(count); > + return swp_tb | __flags_to_swp_tb(flags); > } > > -static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned int count) > +static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned char flags) > { > - return pfn_to_swp_tb(folio_pfn(folio), count); > + return pfn_to_swp_tb(folio_pfn(folio), flags); > } > > -static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int count) > +static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned char flags) > { > BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) != > BITS_PER_BYTE * sizeof(unsigned long)); > BUILD_BUG_ON((unsigned long)xa_mk_value(0) != SWP_TB_SHADOW_MARK); > > VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow)); > - VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK)); > + VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_FLAGS_MASK)); > > - return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_MARK; > + return (unsigned long)shadow | SWP_TB_SHADOW_MARK | __flags_to_swp_tb(flags); > } > > /* > @@ -173,14 +178,14 @@ static inline bool swp_tb_is_countable(unsigned long swp_tb) > static inline struct folio *swp_tb_to_folio(unsigned long swp_tb) > { > VM_WARN_ON(!swp_tb_is_folio(swp_tb)); > - return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS); > + return pfn_folio((swp_tb & ~SWP_TB_FLAGS_MASK) >> SWAP_CACHE_PFN_MARK_BITS); > } > > static inline void *swp_tb_to_shadow(unsigned long swp_tb) > { > VM_WARN_ON(!swp_tb_is_shadow(swp_tb)); > /* No shift needed, xa_value is stored as it is in the lower bits. */ > - return (void *)(swp_tb & ~SWP_TB_COUNT_MASK); > + return (void *)(swp_tb & ~SWP_TB_FLAGS_MASK); > } > > static inline unsigned char __swp_tb_get_count(unsigned long swp_tb) > @@ -189,6 +194,12 @@ static inline unsigned char __swp_tb_get_count(unsigned long swp_tb) > return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT); > } > > +static inline unsigned char __swp_tb_get_flags(unsigned long swp_tb) > +{ > + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); > + return ((swp_tb & SWP_TB_FLAGS_MASK) >> SWP_TB_FLAGS_SHIFT); > +} > + > static inline int swp_tb_get_count(unsigned long swp_tb) > { > if (swp_tb_is_countable(swp_tb)) > @@ -253,6 +264,50 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci, > return swp_tb; > } > > +static inline void __swap_table_set_zero(struct swap_cluster_info *ci, > + unsigned int ci_off) > +{ > +#if SWAP_TABLE_HAS_ZEROFLAG > + unsigned long swp_tb = __swap_table_get(ci, ci_off); > + > + BUILD_BUG_ON(SWP_TB_ZERO_FLAG & ~SWP_TB_FLAGS_MASK); > + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); > + swp_tb |= SWP_TB_ZERO_FLAG; > + __swap_table_set(ci, ci_off, swp_tb); > +#else > + __set_bit(ci_off, ci->zero_bitmap); > +#endif > +} > + > +static inline bool __swap_table_test_zero(struct swap_cluster_info *ci, > + unsigned int ci_off) > +{ > +#if SWAP_TABLE_HAS_ZEROFLAG > + unsigned long swp_tb = __swap_table_get(ci, ci_off); > + > + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); > + return !!(swp_tb & SWP_TB_ZERO_FLAG); > +#else > + return test_bit(ci_off, ci->zero_bitmap); > +#endif > +} > + > +static inline void __swap_table_clear_zero(struct swap_cluster_info *ci, > + unsigned int ci_off) > +{ > + > +#if SWAP_TABLE_HAS_ZEROFLAG > + unsigned long swp_tb = __swap_table_get(ci, ci_off); > + > + VM_WARN_ON(!swp_tb_is_countable(swp_tb)); > + swp_tb &= ~SWP_TB_ZERO_FLAG; > + __swap_table_set(ci, ci_off, swp_tb); > +#else > + lockdep_assert_held(&ci->lock); > + __clear_bit(ci_off, ci->zero_bitmap); > +#endif > +} > + > #ifdef CONFIG_MEMCG > static inline void __swap_cgroup_set(struct swap_cluster_info *ci, > unsigned int ci_off, unsigned long nr, unsigned short id) > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2172920e68d1..287d5807b8f7 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -427,6 +427,11 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci) > ci->memcg_table = NULL; > #endif > > +#if !SWAP_TABLE_HAS_ZEROFLAG > + kfree(ci->zero_bitmap); > + ci->zero_bitmap = NULL; > +#endif > + > table = (struct swap_table *)rcu_access_pointer(ci->table); > if (!table) > return; > @@ -470,6 +475,13 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp) > if (!ci->memcg_table) > ret = -ENOMEM; > #endif > + > +#if !SWAP_TABLE_HAS_ZEROFLAG > + ci->zero_bitmap = bitmap_zalloc(SWAPFILE_CLUSTER, gfp); > + if (!ci->zero_bitmap) > + ret = -ENOMEM; > +#endif > + > if (ret) > swap_cluster_free_table(ci); > > @@ -926,8 +938,8 @@ static bool __swap_cluster_alloc_entries(struct swap_info_struct *si, > order = 0; > nr_pages = 1; > swap_cluster_assert_empty(ci, ci_off, 1, false); > - /* Sets a fake shadow as placeholder */ > - __swap_table_set(ci, ci_off, shadow_to_swp_tb(NULL, 1)); > + /* Fake shadow placeholder with no flag, hibernation does not use the zeromap */ > + __swap_table_set(ci, ci_off, __swp_tb_mk_count(shadow_to_swp_tb(NULL, 0), 1)); > } else { > /* Allocation without folio is only possible with hibernation */ > WARN_ON_ONCE(1); > @@ -1299,14 +1311,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > void (*swap_slot_free_notify)(struct block_device *, unsigned long); > unsigned int i; > > - /* > - * Use atomic clear_bit operations only on zeromap instead of non-atomic > - * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes. > - */ > - for (i = 0; i < nr_entries; i++) { > - clear_bit(offset + i, si->zeromap); > + for (i = 0; i < nr_entries; i++) > zswap_invalidate(swp_entry(si->type, offset + i)); > - } > > if (si->flags & SWP_BLKDEV) > swap_slot_free_notify = > @@ -1891,7 +1897,11 @@ void __swap_cluster_free_entries(struct swap_info_struct *si, > * ref, or after swap cache is dropped > */ > VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1); > + > + /* Resetting the slot to NULL also clears the inline flags. */ > __swap_table_set(ci, ci_off, null_to_swp_tb()); > + if (!SWAP_TABLE_HAS_ZEROFLAG) > + __swap_table_clear_zero(ci, ci_off); > > /* > * Uncharge swap slots by memcg in batches. Consecutive > @@ -3024,7 +3034,6 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si) > SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > { > struct swap_info_struct *p = NULL; > - unsigned long *zeromap; > struct swap_cluster_info *cluster_info; > struct file *swap_file, *victim; > struct address_space *mapping; > @@ -3120,8 +3129,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > > swap_file = p->swap_file; > p->swap_file = NULL; > - zeromap = p->zeromap; > - p->zeromap = NULL; > maxpages = p->max; > cluster_info = p->cluster_info; > p->max = 0; > @@ -3133,7 +3140,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > mutex_unlock(&swapon_mutex); > kfree(p->global_cluster); > p->global_cluster = NULL; > - kvfree(zeromap); > free_swap_cluster_info(cluster_info, maxpages); > > inode = mapping->host; > @@ -3665,17 +3671,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > if (error) > goto bad_swap_unlock_inode; > > - /* > - * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might > - * be above MAX_PAGE_ORDER incase of a large swap file. > - */ > - si->zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long), > - GFP_KERNEL | __GFP_ZERO); > - if (!si->zeromap) { > - error = -ENOMEM; > - goto bad_swap_unlock_inode; > - } > - > if (si->bdev && bdev_stable_writes(si->bdev)) > si->flags |= SWP_STABLE_WRITES; > > @@ -3777,8 +3772,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > destroy_swap_extents(si, swap_file); > free_swap_cluster_info(si->cluster_info, si->max); > si->cluster_info = NULL; > - kvfree(si->zeromap); > - si->zeromap = NULL; > /* > * Clear the SWP_USED flag after all resources are freed so > * alloc_swap_info can reuse this si safely. > > -- > 2.53.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata [not found] <20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com> ` (10 preceding siblings ...) [not found] ` <20260421-swap-table-p4-v3-12-2f23759a76bc@tencent.com> @ 2026-05-11 16:34 ` Chris Li [not found] ` <CAMgjq7CJ8Are6m7X2UxUoJ=77c_oSpdG8-bzkmdRzwey2Cp1gQ@mail.gmail.com> [not found] ` <20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com> 13 siblings, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-11 16:34 UTC (permalink / raw) To: Andrew Morton Cc: kasong, linux-mm, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Mon, Apr 20, 2026 at 11:16 PM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > This series unifies the allocation and charging of anon and shmem swap > in folios, provides better synchronization, consolidates the metadata > management, hence dropping the static array and map, and improves the > performance. The static metadata overhead is now close to zero, and > workload performance is slightly improved. > > For example, mounting a 1TB swap device saves about 512MB of memory: > > Before: > free -m > total used free shared buff/cache available > Mem: 1464 805 346 1 382 658 > Swap: 1048575 0 1048575 > > After: > free -m > total used free shared buff/cache available > Mem: 1464 277 899 1 356 1187 > Swap: 1048575 0 1048575 > > Memory usage is ~512M lower, and we now have a close to 0 static > overhead. It was about 2 bytes per slot before, now roughly 0.09375 > bytes per slot (48 bytes ci info per cluster, which is 512 slots). > > Performance test is also looking good, testing Redis in a 1.5G VM using > 5G ZRAM as swap: > > valkey-server --maxmemory 2560M > redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get > > Before: 3289011.918750 RPS > After: 3312087.142241 RPS (0.99% better) > > Testing with build kernel under global pressure on a 48c96t system, > limiting the total memory to 8G, using 12G ZRAM, 24 test runs, > enabling THP: > > make -j96, using defconfig > > Before: user time 2904.59s system time 4773.99s > After: user time 2909.38s system time 4641.55s (2.77% better) > > Testing with usemem on a 32c machine using 48G brd ramdisk and 16G > RAM, 12 test run: > > usemem --init-time -O -y -x -n 48 1G > > Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us > After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us > > Seems similar, or slightly better. > > This series also reduces memory thrashing, I no longer see any: > "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was > shown several times during stress testing before this series when under > great pressure: > > Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18 > After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0 > > Signed-off-by: Kairui Song <kasong@tencent.com> Hi Andrew, I have given this swap table phase 4 series the first round of review. Overall, it looks good with some minor nitpicks. Can you add this to the mm-unstable for more exposures? Thanks Chris > --- > Changes in v3: > - This is based on mm-unstable, also applies to mm-new, and has no > conflict with YoungJun's tier series, and only trivial conflict with > Baoquan's swapops due to filename change. > - Fix zero map build issue on 32 bit archs [ YoungJun Park ] > - Cleanup memcg table allocation helpers [ YoungJun Park ] > - Fix WARN for non NUMA build: > https://lore.kernel.org/linux-mm/CAMgjq7ANih7u7SJB8uWcQHS8XRJySNRc3ti9V-SVey0nGE3gLQ@mail.gmail.com/ > - Improve of commit messages. > - Re-test several tests, the conclusion is the same as v2. > - Link to v2: https://patch.msgid.link/20260417-swap-table-p4-v2-0-17f5d1015428@tencent.com > > Changes in v2: > - Drop the RFC prefix and also the RFC part. > - Now there is zero change to cgroup or refault tracking, RFC v1 changed > some cgroup behavior. To archive that v2 use a standalone memcg_table > for each cluster. It can be dropped or better optimized later if we > have a better solution. The performance gain is partly cancelled > compared to RFC v1 since we now need an extra allocation for free cluster > isolation and peak memory usage is 2 bytes higher. But still looking > good. That table size is accetable (1024 bytes), no RCU needed, and > fits for kmalloc. Even if we keep it as it is in the future, > it's still accetable. > - Link to v1: https://lore.kernel.org/r/20260220-swap-table-p4-v1-0-104795d19815@tencent.com > > To: linux-mm@kvack.org > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Chris Li <chrisl@kernel.org> > Cc: Kairui Song <kasong@tencent.com> > Cc: Kemeng Shi <shikemeng@huaweicloud.com> > Cc: Nhat Pham <nphamcs@gmail.com> > Cc: Baoquan He <bhe@redhat.com> > Cc: Barry Song <baohua@kernel.org> > Cc: Youngjun Park <youngjun.park@lge.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Yosry Ahmed <yosry@kernel.org> > Cc: Chengming Zhou <chengming.zhou@linux.dev> > Cc: David Hildenbrand <david@kernel.org> > Cc: Lorenzo Stoakes <ljs@kernel.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Dev Jain <dev.jain@arm.com> > Cc: Lance Yang <lance.yang@linux.dev> > Cc: Hugh Dickins <hughd@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Roman Gushchin <roman.gushchin@linux.dev> > Cc: Shakeel Butt <shakeel.butt@linux.dev> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Qi Zheng <zhengqi.arch@bytedance.com> > Cc: linux-kernel@vger.kernel.org > Cc: cgroups@vger.kernel.org > > --- > Kairui Song (12): > mm, swap: simplify swap cache allocation helper > mm, swap: move common swap cache operations into standalone helpers > mm/huge_memory: move THP gfp limit helper into header > mm, swap: add support for stable large allocation in swap cache directly > mm, swap: unify large folio allocation > mm/memcg, swap: tidy up cgroup v1 memsw swap helpers > mm, swap: support flexible batch freeing of slots in different memcgs > mm, swap: delay and unify memcg lookup and charging for swapin > mm, swap: consolidate cluster allocation helpers > mm/memcg, swap: store cgroup id in cluster table directly > mm/memcg: remove no longer used swap cgroup array > mm, swap: merge zeromap into swap table > > MAINTAINERS | 1 - > include/linux/huge_mm.h | 30 +++ > include/linux/memcontrol.h | 16 +- > include/linux/swap.h | 19 +- > include/linux/swap_cgroup.h | 47 ---- > mm/Makefile | 3 - > mm/huge_memory.c | 2 +- > mm/internal.h | 11 +- > mm/memcontrol-v1.c | 66 +++--- > mm/memcontrol.c | 32 +-- > mm/memory.c | 88 ++------ > mm/page_io.c | 58 ++++- > mm/shmem.c | 122 +++-------- > mm/swap.h | 91 +++----- > mm/swap_cgroup.c | 172 --------------- > mm/swap_state.c | 516 +++++++++++++++++++++++++------------------- > mm/swap_table.h | 169 ++++++++++++--- > mm/swapfile.c | 212 +++++++++--------- > mm/vmscan.c | 2 +- > mm/zswap.c | 25 +-- > 20 files changed, 783 insertions(+), 899 deletions(-) > --- > base-commit: f1541b40cd422d7e22273be9b7e9edfc9ea4f0d7 > change-id: 20260111-swap-table-p4-98ee92baa7c4 > > Best regards, > -- > Kairui Song <kasong@tencent.com> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <CAMgjq7CJ8Are6m7X2UxUoJ=77c_oSpdG8-bzkmdRzwey2Cp1gQ@mail.gmail.com>]
* Re: [PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata [not found] ` <CAMgjq7CJ8Are6m7X2UxUoJ=77c_oSpdG8-bzkmdRzwey2Cp1gQ@mail.gmail.com> @ 2026-05-11 21:12 ` Andrew Morton 2026-05-12 5:10 ` Kairui Song 0 siblings, 1 reply; 26+ messages in thread From: Andrew Morton @ 2026-05-11 21:12 UTC (permalink / raw) To: Kairui Song Cc: kasong, linux-mm, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Sat, 25 Apr 2026 02:11:47 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > base-commit: f1541b40cd422d7e22273be9b7e9edfc9ea4f0d7 > > change-id: 20260111-swap-table-p4-98ee92baa7c4 > > > > Best regards, > > -- > > Kairui Song <kasong@tencent.com> > > > > > > I checked sashiko's review, it seems sashiko itself is bugged or > something wrong, Most patched end up with: > Tool error: Review tool timed out (active time exceeded) > > The rest of the results are all false positives, maybe I can add a few > more comments in the code or commit so it can understand the code > better. > > And checking V2's review: > https://sashiko.dev/#/patchset/20260417-swap-table-p4-v2-0-17f5d1015428%40tencent.com > > Which are mostly false positives and I've fixed the two real but > trivial issues already. Things should be fine. Sashiko review of v3: https://sashiko.dev/#/patchset/20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com appears to be complete, so perhaps it went back and figured it out. It claims to have several "critical" and "high" things, so please recheck? From your replies in this thread, I believe that we'll be seeing a v4 series? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata 2026-05-11 21:12 ` Andrew Morton @ 2026-05-12 5:10 ` Kairui Song 0 siblings, 0 replies; 26+ messages in thread From: Kairui Song @ 2026-05-12 5:10 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, May 12, 2026 at 5:12 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Sat, 25 Apr 2026 02:11:47 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > I checked sashiko's review, it seems sashiko itself is bugged or > > something wrong, Most patched end up with: > > Tool error: Review tool timed out (active time exceeded) > > > > The rest of the results are all false positives, maybe I can add a few > > more comments in the code or commit so it can understand the code > > better. > > > > And checking V2's review: > > https://sashiko.dev/#/patchset/20260417-swap-table-p4-v2-0-17f5d1015428%40tencent.com > > > > Which are mostly false positives and I've fixed the two real but > > trivial issues already. Things should be fine. > > Sashiko review of v3: > > https://sashiko.dev/#/patchset/20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com > > appears to be complete, so perhaps it went back and figured it out. > > It claims to have several "critical" and "high" things, so please > recheck? Right, thanks for the head up! Just checked again, still, all reports are false positives. Some part may worth adding a WARN_ON or some comment (one was also suggested by Chris), so both humans and AI will be less confused. For example sashiko is very concerned about round_down of swp_entry_t, or alignment of folio's swap entry, which is already a common pattern now and completely fine. We did plan to use a wrapper for that later to make it less confusing, not really a problem. Maybe better also add a bit more info to the commit message. > From your replies in this thread, I believe that we'll be seeing a v4 > series? Sure, I'll send a v4, most changes are for cleanup and minor improvements. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com>]
* Re: [PATCH v3 04/12] mm, swap: add support for stable large allocation in swap cache directly [not found] ` <20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com> @ 2026-05-06 20:27 ` Chris Li 2026-05-12 9:48 ` Baolin Wang 1 sibling, 0 replies; 26+ messages in thread From: Chris Li @ 2026-05-06 20:27 UTC (permalink / raw) To: kasong Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang, Barry Song, Hugh Dickins, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, Apr 21, 2026 at 8:16 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > To make it possible to allocate large folios directly in swap cache, > provide a new infrastructure helper to handle the swap cache status > check, allocation, and order fallback in the swap cache layer > > The new helper replaces the existing swap_cache_alloc_folio. Based on > this, all the separate swap folio allocation that is being done by anon > / shmem before is converted to use this helper directly, unifying folio > allocation for anon, shmem, and readahead. > > This slightly consolidates how allocation is synchronized, making it > more stable and less prone to errors. The slot-count and cache-conflict > check is now always performed with the cluster lock held before > allocation, and repeated under the same lock right before cache > insertion. This double check produces a stable result compared to the > previous anon and shmem mTHP allocation implementation, avoids the > false-negative conflict checks that the lockless path can return — large > allocations no longer have to be unwound because the range turned out to > be occupied — and aborts early for already-freed slots, which helps > ordinary swapin and especially readahead, with only a marginal increase > in cluster-lock contention (the lock is very lightly contended and stays > local in the first place). Hence, callers of swap_cache_alloc_folio() no > longer need to check the swap slot count or swap cache status > themselves. > > And now whoever first successfully allocates a folio in the swap cache > will be the one who charges it and performs the swap-in. The race window > of swapping is also reduced since the loop is much more compact. > > Signed-off-by: Kairui Song <kasong@tencent.com> Overall looks good. There seems to be some typo on the expression of orders below. > --- > mm/swap.h | 3 +- > mm/swap_state.c | 222 +++++++++++++++++++++++++++++++++++++++++--------------- > mm/zswap.c | 2 +- > 3 files changed, 165 insertions(+), 62 deletions(-) > > diff --git a/mm/swap.h b/mm/swap.h > index ad8b17a93758..6774af10a943 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -280,7 +280,8 @@ bool swap_cache_has_folio(swp_entry_t entry); > struct folio *swap_cache_get_folio(swp_entry_t entry); > void *swap_cache_get_shadow(swp_entry_t entry); > void swap_cache_del_folio(struct folio *folio); > -struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, > +struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask, > + unsigned long orders, struct vm_fault *vmf, > struct mempolicy *mpol, pgoff_t ilx); > /* Below helpers require the caller to lock and pass in the swap cluster. */ > void __swap_cache_add_folio(struct swap_cluster_info *ci, > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3da285a891b2..f5c77f348bbd 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -139,10 +139,10 @@ void *swap_cache_get_shadow(swp_entry_t entry) > > /** > * __swap_cache_add_check - Check if a range is suitable for adding a folio. > - * @ci: The locked swap cluster. > - * @ci_off: Range start offset. > - * @nr: Number of slots to check. > - * @shadow: Returns the shadow value if one exists in the range. > + * @ci: The locked swap cluster > + * @targ_entry: The target swap entry to check, will be rounded down by @nr > + * @nr: Number of slots to check, must be a power of 2 > + * @shadowp: Returns the shadow value if one exists in the range. > * > * Check if all slots covered by given range have a swap count >= 1. > * Retrieves the shadow if there is one. > @@ -150,22 +150,38 @@ void *swap_cache_get_shadow(swp_entry_t entry) > * Context: Caller must lock the cluster. > */ > static int __swap_cache_add_check(struct swap_cluster_info *ci, > - unsigned int ci_off, unsigned int nr, > - void **shadow) > + swp_entry_t targ_entry, > + unsigned long nr, void **shadowp) > { > - unsigned int ci_end = ci_off + nr; > + unsigned int ci_off, ci_end; > unsigned long old_tb; > > + /* > + * If the target slot is not swapped out, return > + * -EEXIST or -ENOENT. If the batch is not suitable, could be a > + * race with concurrent free or cache add, return -EBUSY. > + */ > if (unlikely(!ci->table)) > return -ENOENT; > + ci_off = swp_cluster_offset(targ_entry); > + old_tb = __swap_table_get(ci, ci_off); > + if (swp_tb_is_folio(old_tb)) > + return -EEXIST; > + if (!__swp_tb_get_count(old_tb)) > + return -ENOENT; > + if (swp_tb_is_shadow(old_tb) && shadowp) > + *shadowp = swp_tb_to_shadow(old_tb); > + > + if (nr == 1) > + return 0; > + > + ci_off = round_down(ci_off, nr); > + ci_end = ci_off + nr; > do { > old_tb = __swap_table_get(ci, ci_off); > - if (unlikely(swp_tb_is_folio(old_tb))) > - return -EEXIST; > - if (unlikely(!__swp_tb_get_count(old_tb))) > - return -ENOENT; > - if (swp_tb_is_shadow(old_tb)) > - *shadow = swp_tb_to_shadow(old_tb); > + if (unlikely(swp_tb_is_folio(old_tb) || > + !__swp_tb_get_count(old_tb))) > + return -EBUSY; > } while (++ci_off < ci_end); > > return 0; > @@ -244,7 +260,7 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, > si = __swap_entry_to_info(entry); > ci = swap_cluster_lock(si, swp_offset(entry)); > ci_off = swp_cluster_offset(entry); > - err = __swap_cache_add_check(ci, ci_off, nr_pages, &shadow); > + err = __swap_cache_add_check(ci, entry, nr_pages, &shadow); > if (err) { > swap_cluster_unlock(ci); > return err; > @@ -399,6 +415,137 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci, > } > } > > +/* > + * Try to allocate a folio of given order in the swap cache. > + * > + * This helper resolves the potential races of swap allocation > + * and prepares a folio to be used for swap IO. May return following > + * value: > + * > + * -ENOMEM / -EBUSY: Order is too large or in conflict with sub slot, > + * caller should shrink the order and retry > + * -ENOENT / -EEXIST: Target swap entry is unavailable or cached, the caller > + * should abort or try to use the cached folio instead > + */ > +static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci, > + swp_entry_t targ_entry, gfp_t gfp, > + unsigned int order, struct vm_fault *vmf, > + struct mempolicy *mpol, pgoff_t ilx) > +{ > + int err; > + swp_entry_t entry; > + struct folio *folio; > + void *shadow = NULL; > + unsigned long address, nr_pages = 1 << order; > + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; > + > + entry.val = round_down(targ_entry.val, nr_pages); > + > + /* Check if the slot and range are available, skip allocation if not */ > + spin_lock(&ci->lock); > + err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL); > + spin_unlock(&ci->lock); > + if (unlikely(err)) > + return ERR_PTR(err); > + > + /* > + * Limit THP gfp. The limitation is a no-op for typical > + * GFP_HIGHUSER_MOVABLE but matters for shmem. > + */ > + if (order) > + gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); > + > + if (mpol || !vmf) { > + folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id()); > + } else { > + address = round_down(vmf->address, PAGE_SIZE << order); > + folio = vma_alloc_folio(gfp, order, vmf->vma, address); > + } > + if (unlikely(!folio)) > + return ERR_PTR(-ENOMEM); > + > + /* Double check the range is still not in conflict */ > + spin_lock(&ci->lock); > + err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow); > + if (unlikely(err)) { > + spin_unlock(&ci->lock); > + folio_put(folio); > + return ERR_PTR(err); > + } > + > + __folio_set_locked(folio); > + __folio_set_swapbacked(folio); > + __swap_cache_do_add_folio(ci, folio, entry); > + spin_unlock(&ci->lock); > + > + if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL, > + gfp, entry)) { > + spin_lock(&ci->lock); > + __swap_cache_do_del_folio(ci, folio, entry, shadow); > + spin_unlock(&ci->lock); > + folio_unlock(folio); > + /* nr_pages refs from swap cache, 1 from allocation */ > + folio_put_refs(folio, nr_pages + 1); > + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); > + return ERR_PTR(-ENOMEM); > + } > + > + /* For memsw accounting, swap is uncharged when folio is added to swap cache */ > + memcg1_swapin(entry, 1 << order); > + if (shadow) > + workingset_refault(folio, shadow); > + > + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); > + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); > + > + /* Caller will initiate read into locked new_folio */ > + folio_add_lru(folio); > + return folio; > +} > + > +/** > + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache. > + * @targ_entry: swap entry indicating the target slot > + * @gfp: memory allocation flags > + * @orders: allocation orders > + * @vmf: fault information > + * @mpol: NUMA memory allocation policy to be applied > + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE > + * > + * Allocate a folio in the swap cache for one swap slot, typically before > + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by > + * @targ_entry must have a non-zero swap count (swapped out). > + * > + * Context: Caller must protect the swap device with reference count or locks. > + * Return: Returns the folio if allocation succeeded and folio is added to > + * swap cache. Returns error code if allocation failed due to race. > + */ > +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp, > + unsigned long orders, struct vm_fault *vmf, > + struct mempolicy *mpol, pgoff_t ilx) > +{ > + int order, err; > + struct folio *ret; > + struct swap_cluster_info *ci; > + > + /* Always allow order 0 so swap won't fail under pressure. */ > + order = orders ? highest_order(orders |= BIT(0)) : 0; I can't understand this line. You seem to have put an order variable assignment in an expression which feels odd to me. I assume you mean "orders | BIT(0)". BTW, can you write this as: order = highest_order(orders | BIT(0)); Because when orders is zero, highest_order(BIT(0)) should be 0 as well. Chris > + ci = __swap_entry_to_cluster(targ_entry); > + for (;;) { > + ret = __swap_cache_alloc(ci, targ_entry, gfp, order, > + vmf, mpol, ilx); > + if (!IS_ERR(ret)) > + break; > + err = PTR_ERR(ret); > + if (!order || (err && err != -EBUSY && err != -ENOMEM)) > + break; > + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); > + order = next_order(&orders, order); > + } > + > + return ret; > +} > + > /* > * If we are the only user, then try to free up the swap cache. > * > @@ -542,51 +689,10 @@ static int __swap_cache_prepare_and_add(swp_entry_t entry, > return ret; > } > > -/** > - * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache. > - * @entry: the swapped out swap entry to be binded to the folio. > - * @gfp_mask: memory allocation flags > - * @mpol: NUMA memory allocation policy to be applied > - * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE > - * > - * Allocate a folio in the swap cache for one swap slot, typically before > - * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by > - * @entry must have a non-zero swap count (swapped out). > - * Currently only supports order 0. > - * > - * Context: Caller must protect the swap device with reference count or locks. > - * Return: Returns the folio if allocation succeeded and folio is added to > - * swap cache. Returns error code if allocation failed due to race. > - */ > -struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask, > - struct mempolicy *mpol, pgoff_t ilx) > -{ > - int ret; > - struct folio *folio; > - > - /* Allocate a new folio to be added into the swap cache. */ > - folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); > - if (!folio) > - return ERR_PTR(-ENOMEM); > - > - /* > - * Try to add the new folio to the swap cache. It returns > - * -EEXIST if the entry is already cached. > - */ > - ret = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false); > - if (ret) { > - folio_put(folio); > - return ERR_PTR(ret); > - } > - > - return folio; > -} > - > static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, > struct mempolicy *mpol, pgoff_t ilx, > struct swap_iocb **plug, bool readahead) > { > - struct swap_info_struct *si = __swap_entry_to_info(entry); > struct folio *folio; > > /* Check the swap cache again for readahead path. */ > @@ -594,16 +700,12 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp, > if (folio) > return folio; > > - /* Skip allocation for unused and bad swap slot for readahead. */ > - if (!swap_entry_swapped(si, entry)) > - return NULL; > - > do { > folio = swap_cache_get_folio(entry); > if (folio) > return folio; > > - folio = swap_cache_alloc_folio(entry, gfp, mpol, ilx); > + folio = swap_cache_alloc_folio(entry, gfp, 0, NULL, mpol, ilx); > } while (IS_ERR(folio) && PTR_ERR(folio) == -EEXIST); > > if (IS_ERR_OR_NULL(folio)) > diff --git a/mm/zswap.c b/mm/zswap.c > index e27f6e96f003..4fcd95eb24cb 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1000,7 +1000,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > return -EEXIST; > > mpol = get_task_policy(current); > - folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol, > + folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, 0, NULL, mpol, > NO_INTERLEAVE_INDEX); > put_swap_device(si); > > > -- > 2.53.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 04/12] mm, swap: add support for stable large allocation in swap cache directly [not found] ` <20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com> 2026-05-06 20:27 ` [PATCH v3 04/12] mm, swap: add support for stable large allocation in swap cache directly Chris Li @ 2026-05-12 9:48 ` Baolin Wang 2026-05-12 9:55 ` Kairui Song 1 sibling, 1 reply; 26+ messages in thread From: Baolin Wang @ 2026-05-12 9:48 UTC (permalink / raw) To: kasong, linux-mm Cc: Andrew Morton, David Hildenbrand, Zi Yan, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On 4/21/26 2:16 PM, Kairui Song via B4 Relay wrote: > From: Kairui Song <kasong@tencent.com> > > To make it possible to allocate large folios directly in swap cache, > provide a new infrastructure helper to handle the swap cache status > check, allocation, and order fallback in the swap cache layer > > The new helper replaces the existing swap_cache_alloc_folio. Based on > this, all the separate swap folio allocation that is being done by anon > / shmem before is converted to use this helper directly, unifying folio > allocation for anon, shmem, and readahead. > > This slightly consolidates how allocation is synchronized, making it > more stable and less prone to errors. The slot-count and cache-conflict > check is now always performed with the cluster lock held before > allocation, and repeated under the same lock right before cache > insertion. This double check produces a stable result compared to the > previous anon and shmem mTHP allocation implementation, avoids the > false-negative conflict checks that the lockless path can return — large > allocations no longer have to be unwound because the range turned out to > be occupied — and aborts early for already-freed slots, which helps > ordinary swapin and especially readahead, with only a marginal increase > in cluster-lock contention (the lock is very lightly contended and stays > local in the first place). Hence, callers of swap_cache_alloc_folio() no > longer need to check the swap slot count or swap cache status > themselves. > > And now whoever first successfully allocates a folio in the swap cache > will be the one who charges it and performs the swap-in. The race window > of swapping is also reduced since the loop is much more compact. > > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > mm/swap.h | 3 +- > mm/swap_state.c | 222 +++++++++++++++++++++++++++++++++++++++++--------------- > mm/zswap.c | 2 +- > 3 files changed, 165 insertions(+), 62 deletions(-) > > diff --git a/mm/swap.h b/mm/swap.h > index ad8b17a93758..6774af10a943 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -280,7 +280,8 @@ bool swap_cache_has_folio(swp_entry_t entry); > struct folio *swap_cache_get_folio(swp_entry_t entry); > void *swap_cache_get_shadow(swp_entry_t entry); > void swap_cache_del_folio(struct folio *folio); > -struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags, > +struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask, > + unsigned long orders, struct vm_fault *vmf, > struct mempolicy *mpol, pgoff_t ilx); > /* Below helpers require the caller to lock and pass in the swap cluster. */ > void __swap_cache_add_folio(struct swap_cluster_info *ci, > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3da285a891b2..f5c77f348bbd 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -139,10 +139,10 @@ void *swap_cache_get_shadow(swp_entry_t entry) > > /** > * __swap_cache_add_check - Check if a range is suitable for adding a folio. > - * @ci: The locked swap cluster. > - * @ci_off: Range start offset. > - * @nr: Number of slots to check. > - * @shadow: Returns the shadow value if one exists in the range. > + * @ci: The locked swap cluster > + * @targ_entry: The target swap entry to check, will be rounded down by @nr > + * @nr: Number of slots to check, must be a power of 2 > + * @shadowp: Returns the shadow value if one exists in the range. > * > * Check if all slots covered by given range have a swap count >= 1. > * Retrieves the shadow if there is one. > @@ -150,22 +150,38 @@ void *swap_cache_get_shadow(swp_entry_t entry) > * Context: Caller must lock the cluster. > */ > static int __swap_cache_add_check(struct swap_cluster_info *ci, > - unsigned int ci_off, unsigned int nr, > - void **shadow) > + swp_entry_t targ_entry, > + unsigned long nr, void **shadowp) > { > - unsigned int ci_end = ci_off + nr; > + unsigned int ci_off, ci_end; > unsigned long old_tb; > > + /* > + * If the target slot is not swapped out, return > + * -EEXIST or -ENOENT. If the batch is not suitable, could be a > + * race with concurrent free or cache add, return -EBUSY. > + */ > if (unlikely(!ci->table)) > return -ENOENT; > + ci_off = swp_cluster_offset(targ_entry); > + old_tb = __swap_table_get(ci, ci_off); > + if (swp_tb_is_folio(old_tb)) > + return -EEXIST; > + if (!__swp_tb_get_count(old_tb)) > + return -ENOENT; > + if (swp_tb_is_shadow(old_tb) && shadowp) > + *shadowp = swp_tb_to_shadow(old_tb); > + > + if (nr == 1) > + return 0; > + > + ci_off = round_down(ci_off, nr); > + ci_end = ci_off + nr; > do { > old_tb = __swap_table_get(ci, ci_off); > - if (unlikely(swp_tb_is_folio(old_tb))) > - return -EEXIST; > - if (unlikely(!__swp_tb_get_count(old_tb))) > - return -ENOENT; > - if (swp_tb_is_shadow(old_tb)) > - *shadow = swp_tb_to_shadow(old_tb); > + if (unlikely(swp_tb_is_folio(old_tb) || > + !__swp_tb_get_count(old_tb))) > + return -EBUSY; > } while (++ci_off < ci_end); > > return 0; > @@ -244,7 +260,7 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry, > si = __swap_entry_to_info(entry); > ci = swap_cluster_lock(si, swp_offset(entry)); > ci_off = swp_cluster_offset(entry); > - err = __swap_cache_add_check(ci, ci_off, nr_pages, &shadow); > + err = __swap_cache_add_check(ci, entry, nr_pages, &shadow); > if (err) { > swap_cluster_unlock(ci); > return err; > @@ -399,6 +415,137 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci, > } > } > > +/* > + * Try to allocate a folio of given order in the swap cache. > + * > + * This helper resolves the potential races of swap allocation > + * and prepares a folio to be used for swap IO. May return following > + * value: > + * > + * -ENOMEM / -EBUSY: Order is too large or in conflict with sub slot, > + * caller should shrink the order and retry > + * -ENOENT / -EEXIST: Target swap entry is unavailable or cached, the caller > + * should abort or try to use the cached folio instead > + */ > +static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci, > + swp_entry_t targ_entry, gfp_t gfp, > + unsigned int order, struct vm_fault *vmf, > + struct mempolicy *mpol, pgoff_t ilx) > +{ > + int err; > + swp_entry_t entry; > + struct folio *folio; > + void *shadow = NULL; > + unsigned long address, nr_pages = 1 << order; > + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; > + > + entry.val = round_down(targ_entry.val, nr_pages); > + > + /* Check if the slot and range are available, skip allocation if not */ > + spin_lock(&ci->lock); > + err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL); > + spin_unlock(&ci->lock); > + if (unlikely(err)) > + return ERR_PTR(err); > + > + /* > + * Limit THP gfp. The limitation is a no-op for typical > + * GFP_HIGHUSER_MOVABLE but matters for shmem. > + */ > + if (order) > + gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp); > + > + if (mpol || !vmf) { > + folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id()); > + } else { > + address = round_down(vmf->address, PAGE_SIZE << order); > + folio = vma_alloc_folio(gfp, order, vmf->vma, address); > + } > + if (unlikely(!folio)) > + return ERR_PTR(-ENOMEM); > + > + /* Double check the range is still not in conflict */ > + spin_lock(&ci->lock); > + err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow); > + if (unlikely(err)) { > + spin_unlock(&ci->lock); > + folio_put(folio); > + return ERR_PTR(err); > + } > + > + __folio_set_locked(folio); > + __folio_set_swapbacked(folio); > + __swap_cache_do_add_folio(ci, folio, entry); > + spin_unlock(&ci->lock); > + > + if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL, > + gfp, entry)) { > + spin_lock(&ci->lock); > + __swap_cache_do_del_folio(ci, folio, entry, shadow); > + spin_unlock(&ci->lock); > + folio_unlock(folio); > + /* nr_pages refs from swap cache, 1 from allocation */ > + folio_put_refs(folio, nr_pages + 1); > + count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); > + return ERR_PTR(-ENOMEM); > + } > + > + /* For memsw accounting, swap is uncharged when folio is added to swap cache */ > + memcg1_swapin(entry, 1 << order); > + if (shadow) > + workingset_refault(folio, shadow); > + > + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); > + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages); > + > + /* Caller will initiate read into locked new_folio */ > + folio_add_lru(folio); > + return folio; > +} > + > +/** > + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache. > + * @targ_entry: swap entry indicating the target slot > + * @gfp: memory allocation flags > + * @orders: allocation orders > + * @vmf: fault information > + * @mpol: NUMA memory allocation policy to be applied > + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE > + * > + * Allocate a folio in the swap cache for one swap slot, typically before > + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by > + * @targ_entry must have a non-zero swap count (swapped out). > + * > + * Context: Caller must protect the swap device with reference count or locks. > + * Return: Returns the folio if allocation succeeded and folio is added to > + * swap cache. Returns error code if allocation failed due to race. > + */ > +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp, > + unsigned long orders, struct vm_fault *vmf, > + struct mempolicy *mpol, pgoff_t ilx) > +{ > + int order, err; > + struct folio *ret; > + struct swap_cluster_info *ci; > + > + /* Always allow order 0 so swap won't fail under pressure. */ > + order = orders ? highest_order(orders |= BIT(0)) : 0; This seems a bit odd here. In THP/mTHP operations, it's usually the callers' responsibility to determine the allowable orders. So I think we should not implicitly set order 0 here. Instead, we should let callers explicitly set it. What do you think? diff --git a/mm/shmem.c b/mm/shmem.c index f0da10054620..fb05daeab59a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2023,7 +2023,8 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode, pgoff_t ilx; struct folio *folio; struct mempolicy *mpol; - unsigned long orders = BIT(order); + /* Always allow order 0 so swap won't fail under pressure. */ + unsigned long orders = BIT(order) | BIT(0); struct shmem_inode_info *info = SHMEM_I(inode); if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) || ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3 04/12] mm, swap: add support for stable large allocation in swap cache directly 2026-05-12 9:48 ` Baolin Wang @ 2026-05-12 9:55 ` Kairui Song 0 siblings, 0 replies; 26+ messages in thread From: Kairui Song @ 2026-05-12 9:55 UTC (permalink / raw) To: Baolin Wang Cc: linux-mm, Andrew Morton, David Hildenbrand, Zi Yan, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Johannes Weiner, Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt, Muchun Song, Qi Zheng, linux-kernel, cgroups, Yosry Ahmed, Lorenzo Stoakes, Dev Jain, Lance Yang, Michal Hocko, Michal Hocko, Suren Baghdasaryan, Axel Rasmussen On Tue, May 12, 2026 at 5:49 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > > On 4/21/26 2:16 PM, Kairui Song via B4 Relay wrote: > > From: Kairui Song <kasong@tencent.com> > > > > +/** > > + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache. > > + * @targ_entry: swap entry indicating the target slot > > + * @gfp: memory allocation flags > > + * @orders: allocation orders > > + * @vmf: fault information > > + * @mpol: NUMA memory allocation policy to be applied > > + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE > > + * > > + * Allocate a folio in the swap cache for one swap slot, typically before > > + * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by > > + * @targ_entry must have a non-zero swap count (swapped out). > > + * > > + * Context: Caller must protect the swap device with reference count or locks. > > + * Return: Returns the folio if allocation succeeded and folio is added to > > + * swap cache. Returns error code if allocation failed due to race. > > + */ > > +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp, > > + unsigned long orders, struct vm_fault *vmf, > > + struct mempolicy *mpol, pgoff_t ilx) > > +{ > > + int order, err; > > + struct folio *ret; > > + struct swap_cluster_info *ci; > > + > > + /* Always allow order 0 so swap won't fail under pressure. */ > > + order = orders ? highest_order(orders |= BIT(0)) : 0; > > This seems a bit odd here. In THP/mTHP operations, it's usually the > callers' responsibility to determine the allowable orders. So I think we > should not implicitly set order 0 here. Instead, we should let callers > explicitly set it. What do you think? Totally agree. I hesitated between these two designs. And Usama also needed this because some callers (PMD swapin) don't want the fallback. I'll let the caller explicitly pass in the allowable order in v4. Thanks for the review! ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2026-05-12 14:49 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260421-swap-table-p4-v3-0-2f23759a76bc@tencent.com>
[not found] ` <20260421-swap-table-p4-v3-1-2f23759a76bc@tencent.com>
2026-05-06 13:51 ` [PATCH v3 01/12] mm, swap: simplify swap cache allocation helper Chris Li
2026-05-11 8:57 ` Kairui Song
[not found] ` <20260421-swap-table-p4-v3-2-2f23759a76bc@tencent.com>
2026-05-06 14:42 ` [PATCH v3 02/12] mm, swap: move common swap cache operations into standalone helpers Chris Li
2026-05-12 14:48 ` Kairui Song
[not found] ` <20260421-swap-table-p4-v3-3-2f23759a76bc@tencent.com>
2026-05-06 14:46 ` [PATCH v3 03/12] mm/huge_memory: move THP gfp limit helper into header Chris Li
[not found] ` <D631DCC9-85F0-4E68-88A0-AD5DE328818E@nvidia.com>
[not found] ` <CAMgjq7BDmGWaVWBL+52_c=jgs293bgB+Qe-MafKE7dWZRsmx9A@mail.gmail.com>
[not found] ` <125AABD0-02D5-4656-9F55-4B5BFBD5BD3D@nvidia.com>
2026-05-12 9:02 ` Baolin Wang
[not found] ` <20260421-swap-table-p4-v3-5-2f23759a76bc@tencent.com>
2026-05-06 20:48 ` [PATCH v3 05/12] mm, swap: unify large folio allocation Chris Li
2026-05-11 12:57 ` David Hildenbrand (Arm)
2026-05-11 14:37 ` Kairui Song
2026-05-11 15:15 ` David Hildenbrand (Arm)
2026-05-11 16:44 ` Kairui Song
2026-05-12 6:07 ` David Hildenbrand (Arm)
2026-05-12 10:10 ` Baolin Wang
[not found] ` <20260421-swap-table-p4-v3-6-2f23759a76bc@tencent.com>
2026-05-06 20:57 ` [PATCH v3 06/12] mm/memcg, swap: tidy up cgroup v1 memsw swap helpers Chris Li
[not found] ` <20260421-swap-table-p4-v3-7-2f23759a76bc@tencent.com>
2026-05-08 4:01 ` [PATCH v3 07/12] mm, swap: support flexible batch freeing of slots in different memcgs Chris Li
[not found] ` <20260421-swap-table-p4-v3-8-2f23759a76bc@tencent.com>
2026-05-08 4:46 ` [PATCH v3 08/12] mm, swap: delay and unify memcg lookup and charging for swapin Chris Li
[not found] ` <20260421-swap-table-p4-v3-9-2f23759a76bc@tencent.com>
2026-05-08 5:02 ` [PATCH v3 09/12] mm, swap: consolidate cluster allocation helpers Chris Li
[not found] ` <20260421-swap-table-p4-v3-10-2f23759a76bc@tencent.com>
2026-05-08 22:46 ` [PATCH v3 10/12] mm/memcg, swap: store cgroup id in cluster table directly Chris Li
[not found] ` <20260421-swap-table-p4-v3-11-2f23759a76bc@tencent.com>
2026-05-08 22:47 ` [PATCH v3 11/12] mm/memcg: remove no longer used swap cgroup array Chris Li
[not found] ` <20260421-swap-table-p4-v3-12-2f23759a76bc@tencent.com>
2026-05-11 16:30 ` [PATCH v3 12/12] mm, swap: merge zeromap into swap table Chris Li
2026-05-11 16:34 ` [PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata Chris Li
[not found] ` <CAMgjq7CJ8Are6m7X2UxUoJ=77c_oSpdG8-bzkmdRzwey2Cp1gQ@mail.gmail.com>
2026-05-11 21:12 ` Andrew Morton
2026-05-12 5:10 ` Kairui Song
[not found] ` <20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com>
2026-05-06 20:27 ` [PATCH v3 04/12] mm, swap: add support for stable large allocation in swap cache directly Chris Li
2026-05-12 9:48 ` Baolin Wang
2026-05-12 9:55 ` Kairui Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox