[PATCH v4 00/10] support large folio swap-out and swap-in for shmem

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/10] support large folio swap-out and swap-in for shmem
@ 2024-08-07  7:31 Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap-out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, and large folio swap-in[3] series is queued into mm-unstable branch.
Hence this patch set also supports the large folio swap-out and swap-in for
shmem.

Please help to review. Thanks.

Functional testing
==================
Machine environment: 32 Arm cores, 120G memory and 50G swap device.

1. Run xfstests suite to test tmpfs filesystem, and I did not catch any
regressions with this patch set.
FSTYP=tmpfs
export TEST_DIR=/mnt/tempfs_mnt
export TEST_DEV=/mnt/tempfs_mnt
export SCRATCH_MNT=/mnt/scratchdir
export SCRATCH_DEV=/mnt/scratchdir

2. Run all mm selftests in tools/testing/selftests/mm/, and no
regressions found.

3. I also wrote several shmem swap test cases, including shmem splitting,
shmem swapout, shmem swapin, swapoff during shmem swapout, shmem reclaim,
shmem swapin replacement, etc. I tested these cases under 4K and 64K
shmem folio sizes with a swap device, and shmem swap functionality works
well on my machine.

[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
[3] https://lore.kernel.org/all/20240508224040.190469-6-21cnbao@gmail.com/T/
[4] https://lore.kernel.org/all/8db63194-77fd-e0b8-8601-2bbf04889a5b@google.com/

Changes from v3:
 - Rebase to the latest mm-unstable.
 - Simplify patch 2 based on Barry's patch:
 https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.com

Chagens from v2:
 - Add new patch to split large swap entry if swapin folio is order 0
 folio.
 - Update some commit message.

Changes from v1:
 - Remove useless 'order' variable in shmem_partial_swap_usage(), per Daniel.
 - Add a new patch to return number of pages beeing freed in shmem_free_swap(),
 per Daniel.
 - Drop 'orders' parameter for find_get_entries() and find_lock_entries().
 - Round down the index when adding the swapin folio into the pagecache,
 suggested by Hugh.
 - Fix the reference issue when removing folio from pagecache in patch 8.
 - Fix replacing old folio in swap cache in patch 7.

Changes from RFC:
 - Rebased to the latest mm-unstable.
 - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable
 branch.

Baolin Wang (9):
  mm: vmscan: add validation before spliting shmem large folio
  mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM
    flag setting
  mm: shmem: extend shmem_partial_swap_usage() to support large folio
    swap
  mm: filemap: use xa_get_order() to get the swap entry order
  mm: shmem: use swap_free_nr() to free shmem swap entries
  mm: shmem: support large folio allocation for shmem_replace_folio()
  mm: shmem: drop folio reference count using 'nr_pages' in
    shmem_delete_from_page_cache()
  mm: shmem: split large entry if the swapin folio is not large
  mm: shmem: support large folio swap out

Daniel Gomez (1):
  mm: shmem: return number of pages beeing freed in shmem_free_swap

 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |   1 +
 include/linux/swap.h                      |   4 +-
 include/linux/writeback.h                 |   1 +
 mm/filemap.c                              |   4 +
 mm/shmem.c                                | 208 +++++++++++++++++-----
 mm/swapfile.c                             |   4 +-
 mm/vmscan.c                               |  22 ++-
 7 files changed, 194 insertions(+), 50 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07 15:53   ` David Hildenbrand
  2024-08-07  7:31 ` [PATCH v4 02/10] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Page reclaim will not scan anon LRU if no swap space, however MADV_PAGEOUT
can still split shmem large folios even without a swap device. Thus add
swap available space validation before spliting shmem large folio to
avoid redundant split.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 31d13462571e..796f65781f4f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		} else if (folio_test_swapbacked(folio) &&
 			   folio_test_large(folio)) {
+
+			/*
+			 * Do not split shmem folio if no swap memory
+			 * available.
+			 */
+			if (!total_swap_pages)
+				goto activate_locked;
+
 			/* Split shmem folio */
 			if (split_folio_to_list(folio, folio_list))
 				goto keep_locked;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-07  7:31 ` [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
@ 2024-08-07 15:53   ` David Hildenbrand
  2024-08-08  2:36     ` Baolin Wang
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2024-08-07 15:53 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, wangkefeng.wang, chrisl, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel, Christian Brauner, Luis Chamberlain

On 07.08.24 09:31, Baolin Wang wrote:
> Page reclaim will not scan anon LRU if no swap space, however MADV_PAGEOUT
> can still split shmem large folios even without a swap device. Thus add
> swap available space validation before spliting shmem large folio to
> avoid redundant split.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/vmscan.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 31d13462571e..796f65781f4f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			}
>   		} else if (folio_test_swapbacked(folio) &&
>   			   folio_test_large(folio)) {
> +
> +			/*
> +			 * Do not split shmem folio if no swap memory
> +			 * available.
> +			 */
> +			if (!total_swap_pages)
> +				goto activate_locked;
> +
>   			/* Split shmem folio */
>   			if (split_folio_to_list(folio, folio_list))
>   				goto keep_locked;

Reminds me of

commit 9a976f0c847b67d22ed694556a3626ed92da0422
Author: Luis Chamberlain <mcgrof@kernel.org>
Date:   Thu Mar 9 15:05:43 2023 -0800

     shmem: skip page split if we're not reclaiming
     
     In theory when info->flags & VM_LOCKED we should not be getting
     shem_writepage() called so we should be verifying this with a
     WARN_ON_ONCE().  Since we should not be swapping then best to ensure we
     also don't do the folio split earlier too.  So just move the check early
     to avoid folio splits in case its a dubious call.
     
     We also have a similar early bail when !total_swap_pages so just move that
     earlier to avoid the possible folio split in the same situation.


But indeed, pageout() -> writepage() is called *after* the split in the vmscan path.

In that "noswap" context, I wonder if we also want to skip folios part of shmem
with disabled swapping?

But now I am wondering under which circumstances we end up calling
shmem_writepage() with a large folio. And I think the answer is the comment of
folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.


... so if shmem_writepage() handles+checks that, could we do

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a332cb80e928..7dfa3d6e8ba7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
                                                 goto activate_locked_split;
                                 }
                         }
-               } else if (folio_test_swapbacked(folio) &&
-                          folio_test_large(folio)) {
-                       /* Split shmem folio */
-                       if (split_folio_to_list(folio, folio_list))
-                               goto keep_locked;
                 }
  
                 /*

instead?

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-07 15:53   ` David Hildenbrand
@ 2024-08-08  2:36     ` Baolin Wang
  2024-08-08  8:51       ` David Hildenbrand
  2024-08-08 12:35       ` Matthew Wilcox
  0 siblings, 2 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-08  2:36 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: willy, wangkefeng.wang, chrisl, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel, Christian Brauner, Luis Chamberlain



On 2024/8/7 23:53, David Hildenbrand wrote:
> On 07.08.24 09:31, Baolin Wang wrote:
>> Page reclaim will not scan anon LRU if no swap space, however 
>> MADV_PAGEOUT
>> can still split shmem large folios even without a swap device. Thus add
>> swap available space validation before spliting shmem large folio to
>> avoid redundant split.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/vmscan.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 31d13462571e..796f65781f4f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct 
>> list_head *folio_list,
>>               }
>>           } else if (folio_test_swapbacked(folio) &&
>>                  folio_test_large(folio)) {
>> +
>> +            /*
>> +             * Do not split shmem folio if no swap memory
>> +             * available.
>> +             */
>> +            if (!total_swap_pages)
>> +                goto activate_locked;
>> +
>>               /* Split shmem folio */
>>               if (split_folio_to_list(folio, folio_list))
>>                   goto keep_locked;
> 
> Reminds me of
> 
> commit 9a976f0c847b67d22ed694556a3626ed92da0422
> Author: Luis Chamberlain <mcgrof@kernel.org>
> Date:   Thu Mar 9 15:05:43 2023 -0800
> 
>      shmem: skip page split if we're not reclaiming
>      In theory when info->flags & VM_LOCKED we should not be getting
>      shem_writepage() called so we should be verifying this with a
>      WARN_ON_ONCE().  Since we should not be swapping then best to 
> ensure we
>      also don't do the folio split earlier too.  So just move the check 
> early
>      to avoid folio splits in case its a dubious call.
>      We also have a similar early bail when !total_swap_pages so just 
> move that
>      earlier to avoid the possible folio split in the same situation.
> 
> 
> But indeed, pageout() -> writepage() is called *after* the split in the 
> vmscan path.
> 
> In that "noswap" context, I wonder if we also want to skip folios part 
> of shmem
> with disabled swapping?

Yes, I think so.

> 
> But now I am wondering under which circumstances we end up calling
> shmem_writepage() with a large folio. And I think the answer is the 
> comment of
> folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
> 
> 
> ... so if shmem_writepage() handles+checks that, could we do
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a332cb80e928..7dfa3d6e8ba7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct 
> list_head *folio_list,
>                                                  goto 
> activate_locked_split;
>                                  }
>                          }
> -               } else if (folio_test_swapbacked(folio) &&
> -                          folio_test_large(folio)) {
> -                       /* Split shmem folio */
> -                       if (split_folio_to_list(folio, folio_list))
> -                               goto keep_locked;
>                  }
> 
>                  /*
> 
> instead?

Seems reasonable to me. But we should pass the 'folio_list' to 
shmem_writepage() to list the subpages of the large folio. Let me try. 
Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08  2:36     ` Baolin Wang
@ 2024-08-08  8:51       ` David Hildenbrand
  2024-08-08  9:34         ` Baolin Wang
  2024-08-08 12:35       ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2024-08-08  8:51 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, wangkefeng.wang, chrisl, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel, Christian Brauner, Luis Chamberlain

On 08.08.24 04:36, Baolin Wang wrote:
> 
> 
> On 2024/8/7 23:53, David Hildenbrand wrote:
>> On 07.08.24 09:31, Baolin Wang wrote:
>>> Page reclaim will not scan anon LRU if no swap space, however
>>> MADV_PAGEOUT
>>> can still split shmem large folios even without a swap device. Thus add
>>> swap available space validation before spliting shmem large folio to
>>> avoid redundant split.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>    mm/vmscan.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 31d13462571e..796f65781f4f 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                }
>>>            } else if (folio_test_swapbacked(folio) &&
>>>                   folio_test_large(folio)) {
>>> +
>>> +            /*
>>> +             * Do not split shmem folio if no swap memory
>>> +             * available.
>>> +             */
>>> +            if (!total_swap_pages)
>>> +                goto activate_locked;
>>> +
>>>                /* Split shmem folio */
>>>                if (split_folio_to_list(folio, folio_list))
>>>                    goto keep_locked;
>>
>> Reminds me of
>>
>> commit 9a976f0c847b67d22ed694556a3626ed92da0422
>> Author: Luis Chamberlain <mcgrof@kernel.org>
>> Date:   Thu Mar 9 15:05:43 2023 -0800
>>
>>       shmem: skip page split if we're not reclaiming
>>       In theory when info->flags & VM_LOCKED we should not be getting
>>       shem_writepage() called so we should be verifying this with a
>>       WARN_ON_ONCE().  Since we should not be swapping then best to
>> ensure we
>>       also don't do the folio split earlier too.  So just move the check
>> early
>>       to avoid folio splits in case its a dubious call.
>>       We also have a similar early bail when !total_swap_pages so just
>> move that
>>       earlier to avoid the possible folio split in the same situation.
>>
>>
>> But indeed, pageout() -> writepage() is called *after* the split in the
>> vmscan path.
>>
>> In that "noswap" context, I wonder if we also want to skip folios part
>> of shmem
>> with disabled swapping?
> 
> Yes, I think so.
> 
>>
>> But now I am wondering under which circumstances we end up calling
>> shmem_writepage() with a large folio. And I think the answer is the
>> comment of
>> folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
>>
>>
>> ... so if shmem_writepage() handles+checks that, could we do
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a332cb80e928..7dfa3d6e8ba7 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
>> list_head *folio_list,
>>                                                   goto
>> activate_locked_split;
>>                                   }
>>                           }
>> -               } else if (folio_test_swapbacked(folio) &&
>> -                          folio_test_large(folio)) {
>> -                       /* Split shmem folio */
>> -                       if (split_folio_to_list(folio, folio_list))
>> -                               goto keep_locked;
>>                   }
>>
>>                   /*
>>
>> instead?
> 
> Seems reasonable to me. But we should pass the 'folio_list' to
> shmem_writepage() to list the subpages of the large folio. Let me try.

Ah, yes, good point. Alternatively, we'd have to split and try writing 
all subpages in there. I wonder what to do if we fail to write some, and 
if we could handle that transparently, without the folio_list.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08  8:51       ` David Hildenbrand
@ 2024-08-08  9:34         ` Baolin Wang
  2024-08-08 10:48           ` Daniel Gomez
  0 siblings, 1 reply; 21+ messages in thread
From: Baolin Wang @ 2024-08-08  9:34 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: willy, wangkefeng.wang, chrisl, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, linux-mm,
	linux-kernel, Christian Brauner, Luis Chamberlain



On 2024/8/8 16:51, David Hildenbrand wrote:
> On 08.08.24 04:36, Baolin Wang wrote:
>>
>>
>> On 2024/8/7 23:53, David Hildenbrand wrote:
>>> On 07.08.24 09:31, Baolin Wang wrote:
>>>> Page reclaim will not scan anon LRU if no swap space, however
>>>> MADV_PAGEOUT
>>>> can still split shmem large folios even without a swap device. Thus add
>>>> swap available space validation before spliting shmem large folio to
>>>> avoid redundant split.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/vmscan.c | 8 ++++++++
>>>>    1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 31d13462571e..796f65781f4f 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct
>>>> list_head *folio_list,
>>>>                }
>>>>            } else if (folio_test_swapbacked(folio) &&
>>>>                   folio_test_large(folio)) {
>>>> +
>>>> +            /*
>>>> +             * Do not split shmem folio if no swap memory
>>>> +             * available.
>>>> +             */
>>>> +            if (!total_swap_pages)
>>>> +                goto activate_locked;
>>>> +
>>>>                /* Split shmem folio */
>>>>                if (split_folio_to_list(folio, folio_list))
>>>>                    goto keep_locked;
>>>
>>> Reminds me of
>>>
>>> commit 9a976f0c847b67d22ed694556a3626ed92da0422
>>> Author: Luis Chamberlain <mcgrof@kernel.org>
>>> Date:   Thu Mar 9 15:05:43 2023 -0800
>>>
>>>       shmem: skip page split if we're not reclaiming
>>>       In theory when info->flags & VM_LOCKED we should not be getting
>>>       shem_writepage() called so we should be verifying this with a
>>>       WARN_ON_ONCE().  Since we should not be swapping then best to
>>> ensure we
>>>       also don't do the folio split earlier too.  So just move the check
>>> early
>>>       to avoid folio splits in case its a dubious call.
>>>       We also have a similar early bail when !total_swap_pages so just
>>> move that
>>>       earlier to avoid the possible folio split in the same situation.
>>>
>>>
>>> But indeed, pageout() -> writepage() is called *after* the split in the
>>> vmscan path.
>>>
>>> In that "noswap" context, I wonder if we also want to skip folios part
>>> of shmem
>>> with disabled swapping?
>>
>> Yes, I think so.
>>
>>>
>>> But now I am wondering under which circumstances we end up calling
>>> shmem_writepage() with a large folio. And I think the answer is the
>>> comment of
>>> folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
>>>
>>>
>>> ... so if shmem_writepage() handles+checks that, could we do
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index a332cb80e928..7dfa3d6e8ba7 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                                                   goto
>>> activate_locked_split;
>>>                                   }
>>>                           }
>>> -               } else if (folio_test_swapbacked(folio) &&
>>> -                          folio_test_large(folio)) {
>>> -                       /* Split shmem folio */
>>> -                       if (split_folio_to_list(folio, folio_list))
>>> -                               goto keep_locked;
>>>                   }
>>>
>>>                   /*
>>>
>>> instead?
>>
>> Seems reasonable to me. But we should pass the 'folio_list' to
>> shmem_writepage() to list the subpages of the large folio. Let me try.
> 
> Ah, yes, good point. Alternatively, we'd have to split and try writing 
> all subpages in there. I wonder what to do if we fail to write some, and 
> if we could handle that transparently, without the folio_list.

After some investigation, I prefer to pass 'folio_list' to 
shmem_writepage() via 'struct writeback_control', which could simplify 
the logic a lot. Otherwise, we need to handle each subpage's 
writeback/reclaim/dirty state, as well as tracking each subpage's write 
result, which seems more complicated.

I made a quick change by passing 'folio_list', and it looks simple and 
works as expected.

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 75196b0f894f..10100e22d5c6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -80,6 +80,9 @@ struct writeback_control {
          */
         struct swap_iocb **swap_plug;

+       /* Target list for splitting a large folio */
+       struct list_head *list;
+
         /* internal fields used by the ->writepages implementation: */
         struct folio_batch fbatch;
         pgoff_t index;
diff --git a/mm/shmem.c b/mm/shmem.c
index 05525e9e7423..0a5a68f7d0a0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1496,9 +1496,10 @@ static int shmem_writepage(struct page *page, 
struct writeback_control *wbc)
          * and its shmem_writeback() needs them to be split when swapping.
          */
         if (wbc->split_large_folio && folio_test_large(folio)) {
+try_split:
                 /* Ensure the subpages are still dirty */
                 folio_test_set_dirty(folio);
-               if (split_huge_page(page) < 0)
+               if (split_huge_page_to_list_to_order(page, wbc->list, 0))
                         goto redirty;
                 folio = page_folio(page);
                 folio_clear_dirty(folio);
@@ -1540,8 +1541,12 @@ static int shmem_writepage(struct page *page, 
struct writeback_control *wbc)
         }

         swap = folio_alloc_swap(folio);
-       if (!swap.val)
+       if (!swap.val) {
+               if (nr_pages > 1)
+                       goto try_split;
+
                 goto redirty;
+       }

         /*
          * Add inode to shmem_unuse()'s list of swapped-out inodes,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 277571815789..cf982cf2454f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -628,7 +628,7 @@ typedef enum {
   * Calls ->writepage().
   */
  static pageout_t pageout(struct folio *folio, struct address_space 
*mapping,
-                        struct swap_iocb **plug)
+                        struct swap_iocb **plug, struct list_head 
*folio_list)
  {
         /*
          * If the folio is dirty, only perform writeback if that write
@@ -676,6 +676,11 @@ static pageout_t pageout(struct folio *folio, 
struct address_space *mapping,
                         .swap_plug = plug,
                 };

+               if (shmem_mapping(mapping)) {
+                       wbc.list = folio_list;
+                       wbc.split_large_folio = 
!IS_ENABLED(CONFIG_THP_SWAP);
+               }
+
                 folio_set_reclaim(folio);
                 res = mapping->a_ops->writepage(&folio->page, &wbc);
                 if (res < 0)
@@ -1259,23 +1264,6 @@ static unsigned int shrink_folio_list(struct 
list_head *folio_list,
                                                 goto activate_locked_split;
                                 }
                         }
-               } else if (folio_test_swapbacked(folio) &&
-                          folio_test_large(folio)) {
-
-                       /*
-                        * Do not split shmem folio if no swap memory
-                        * available.
-                        */
-                       if (!total_swap_pages)
-                               goto activate_locked;
-
-                       /*
-                        * Only split shmem folio when CONFIG_THP_SWAP
-                        * is not enabled.
-                        */
-                       if (!IS_ENABLED(CONFIG_THP_SWAP) &&
-                           split_folio_to_list(folio, folio_list))
-                               goto keep_locked;
                 }

                 /*
@@ -1377,18 +1365,21 @@ static unsigned int shrink_folio_list(struct 
list_head *folio_list,
                          * starts and then write it out here.
                          */
                         try_to_unmap_flush_dirty();
-try_pageout:
-                       switch (pageout(folio, mapping, &plug)) {
+                       switch (pageout(folio, mapping, &plug, 
folio_list)) {
                         case PAGE_KEEP:
                                 goto keep_locked;
                         case PAGE_ACTIVATE:
-                               if (shmem_mapping(mapping) && 
folio_test_large(folio) &&
-                                   !split_folio_to_list(folio, 
folio_list)) {
+                               /* Shmem can be split when writeback to 
swap */
+                               if ((nr_pages > 1) && 
!folio_test_large(folio)) {
+                                       sc->nr_scanned -= (nr_pages - 1);
                                         nr_pages = 1;
-                                       goto try_pageout;
                                 }
                                 goto activate_locked;
                         case PAGE_SUCCESS:
+                               if ((nr_pages > 1) && 
!folio_test_large(folio)) {
+                                       sc->nr_scanned -= (nr_pages - 1);
+                                       nr_pages = 1;
+                               }
                                 stat->nr_pageout += nr_pages;

                                 if (folio_test_writeback(folio)) {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08  9:34         ` Baolin Wang
@ 2024-08-08 10:48           ` Daniel Gomez
  2024-08-08 10:57             ` Daniel Gomez
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Gomez @ 2024-08-08 10:48 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand, akpm@linux-foundation.org, hughd@google.com,
	willy@infradead.org, wangkefeng.wang@huawei.com,
	chrisl@kernel.org, ying.huang@intel.com, 21cnbao@gmail.com,
	ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com,
	ioworker0@gmail.com, Pankaj Raghav, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Christian Brauner, Luis Chamberlain

On Thu, Aug 08, 2024 at 05:34:50PM GMT, Baolin Wang wrote:
> 
> 
> On 2024/8/8 16:51, David Hildenbrand wrote:
> > On 08.08.24 04:36, Baolin Wang wrote:
> > > 
> > > 
> > > On 2024/8/7 23:53, David Hildenbrand wrote:
> > > > On 07.08.24 09:31, Baolin Wang wrote:
> > > > > Page reclaim will not scan anon LRU if no swap space, however
> > > > > MADV_PAGEOUT
> > > > > can still split shmem large folios even without a swap device. Thus add
> > > > > swap available space validation before spliting shmem large folio to
> > > > > avoid redundant split.
> > > > > 
> > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > ---
> > > > >    mm/vmscan.c | 8 ++++++++
> > > > >    1 file changed, 8 insertions(+)
> > > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 31d13462571e..796f65781f4f 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct
> > > > > list_head *folio_list,
> > > > >                }
> > > > >            } else if (folio_test_swapbacked(folio) &&
> > > > >                   folio_test_large(folio)) {
> > > > > +
> > > > > +            /*
> > > > > +             * Do not split shmem folio if no swap memory
> > > > > +             * available.
> > > > > +             */
> > > > > +            if (!total_swap_pages)
> > > > > +                goto activate_locked;
> > > > > +
> > > > >                /* Split shmem folio */
> > > > >                if (split_folio_to_list(folio, folio_list))
> > > > >                    goto keep_locked;
> > > > 
> > > > Reminds me of
> > > > 
> > > > commit 9a976f0c847b67d22ed694556a3626ed92da0422
> > > > Author: Luis Chamberlain <mcgrof@kernel.org>
> > > > Date:   Thu Mar 9 15:05:43 2023 -0800
> > > > 
> > > >       shmem: skip page split if we're not reclaiming
> > > >       In theory when info->flags & VM_LOCKED we should not be getting
> > > >       shem_writepage() called so we should be verifying this with a
> > > >       WARN_ON_ONCE().  Since we should not be swapping then best to
> > > > ensure we
> > > >       also don't do the folio split earlier too.  So just move the check
> > > > early
> > > >       to avoid folio splits in case its a dubious call.
> > > >       We also have a similar early bail when !total_swap_pages so just
> > > > move that
> > > >       earlier to avoid the possible folio split in the same situation.
> > > > 
> > > > 
> > > > But indeed, pageout() -> writepage() is called *after* the split in the
> > > > vmscan path.
> > > > 
> > > > In that "noswap" context, I wonder if we also want to skip folios part
> > > > of shmem
> > > > with disabled swapping?
> > > 
> > > Yes, I think so.
> > > 
> > > > 
> > > > But now I am wondering under which circumstances we end up calling
> > > > shmem_writepage() with a large folio. And I think the answer is the
> > > > comment of
> > > > folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
> > > > 
> > > > 
> > > > ... so if shmem_writepage() handles+checks that, could we do
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index a332cb80e928..7dfa3d6e8ba7 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
> > > > list_head *folio_list,
> > > >                                                   goto
> > > > activate_locked_split;
> > > >                                   }
> > > >                           }
> > > > -               } else if (folio_test_swapbacked(folio) &&
> > > > -                          folio_test_large(folio)) {
> > > > -                       /* Split shmem folio */
> > > > -                       if (split_folio_to_list(folio, folio_list))
> > > > -                               goto keep_locked;
> > > >                   }
> > > > 
> > > >                   /*
> > > > 
> > > > instead?
> > > 
> > > Seems reasonable to me. But we should pass the 'folio_list' to
> > > shmem_writepage() to list the subpages of the large folio. Let me try.
> > 
> > Ah, yes, good point. Alternatively, we'd have to split and try writing
> > all subpages in there. I wonder what to do if we fail to write some, and
> > if we could handle that transparently, without the folio_list.
> 
> After some investigation, I prefer to pass 'folio_list' to shmem_writepage()
> via 'struct writeback_control', which could simplify the logic a lot.
> Otherwise, we need to handle each subpage's writeback/reclaim/dirty state,
> as well as tracking each subpage's write result, which seems more
> complicated.
> 
> I made a quick change by passing 'folio_list', and it looks simple and works
> as expected.
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 75196b0f894f..10100e22d5c6 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -80,6 +80,9 @@ struct writeback_control {
>          */
>         struct swap_iocb **swap_plug;
> 
> +       /* Target list for splitting a large folio */
> +       struct list_head *list;
> +
>         /* internal fields used by the ->writepages implementation: */
>         struct folio_batch fbatch;
>         pgoff_t index;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 05525e9e7423..0a5a68f7d0a0 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1496,9 +1496,10 @@ static int shmem_writepage(struct page *page, struct
> writeback_control *wbc)
>          * and its shmem_writeback() needs them to be split when swapping.
>          */
>         if (wbc->split_large_folio && folio_test_large(folio)) {
> +try_split:
>                 /* Ensure the subpages are still dirty */
>                 folio_test_set_dirty(folio);
> -               if (split_huge_page(page) < 0)
> +               if (split_huge_page_to_list_to_order(page, wbc->list, 0))

We check for split_large_folio, but we still send the wbc->list for i915
writepage() case. Previously, we were sending a NULL list. Shouldn't we address
that case too?

>                         goto redirty;
>                 folio = page_folio(page);
>                 folio_clear_dirty(folio);
> @@ -1540,8 +1541,12 @@ static int shmem_writepage(struct page *page, struct
> writeback_control *wbc)
>         }
> 
>         swap = folio_alloc_swap(folio);
> -       if (!swap.val)
> +       if (!swap.val) {
> +               if (nr_pages > 1)
> +                       goto try_split;
> +
>                 goto redirty;
> +       }
> 
>         /*
>          * Add inode to shmem_unuse()'s list of swapped-out inodes,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 277571815789..cf982cf2454f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -628,7 +628,7 @@ typedef enum {
>   * Calls ->writepage().
>   */
>  static pageout_t pageout(struct folio *folio, struct address_space
> *mapping,
> -                        struct swap_iocb **plug)
> +                        struct swap_iocb **plug, struct list_head
> *folio_list)
>  {
>         /*
>          * If the folio is dirty, only perform writeback if that write
> @@ -676,6 +676,11 @@ static pageout_t pageout(struct folio *folio, struct
> address_space *mapping,
>                         .swap_plug = plug,
>                 };
> 
> +               if (shmem_mapping(mapping)) {
> +                       wbc.list = folio_list;
> +                       wbc.split_large_folio =
> !IS_ENABLED(CONFIG_THP_SWAP);
> +               }
> +
>                 folio_set_reclaim(folio);
>                 res = mapping->a_ops->writepage(&folio->page, &wbc);
>                 if (res < 0)
> @@ -1259,23 +1264,6 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                                                 goto activate_locked_split;
>                                 }
>                         }
> -               } else if (folio_test_swapbacked(folio) &&
> -                          folio_test_large(folio)) {
> -
> -                       /*
> -                        * Do not split shmem folio if no swap memory
> -                        * available.
> -                        */
> -                       if (!total_swap_pages)
> -                               goto activate_locked;
> -
> -                       /*
> -                        * Only split shmem folio when CONFIG_THP_SWAP
> -                        * is not enabled.
> -                        */
> -                       if (!IS_ENABLED(CONFIG_THP_SWAP) &&
> -                           split_folio_to_list(folio, folio_list))
> -                               goto keep_locked;
>                 }
> 
>                 /*
> @@ -1377,18 +1365,21 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                          * starts and then write it out here.
>                          */
>                         try_to_unmap_flush_dirty();
> -try_pageout:
> -                       switch (pageout(folio, mapping, &plug)) {
> +                       switch (pageout(folio, mapping, &plug, folio_list))
> {
>                         case PAGE_KEEP:
>                                 goto keep_locked;
>                         case PAGE_ACTIVATE:
> -                               if (shmem_mapping(mapping) &&
> folio_test_large(folio) &&
> -                                   !split_folio_to_list(folio, folio_list))
> {
> +                               /* Shmem can be split when writeback to swap
> */
> +                               if ((nr_pages > 1) &&
> !folio_test_large(folio)) {
> +                                       sc->nr_scanned -= (nr_pages - 1);
>                                         nr_pages = 1;
> -                                       goto try_pageout;
>                                 }
>                                 goto activate_locked;
>                         case PAGE_SUCCESS:
> +                               if ((nr_pages > 1) &&
> !folio_test_large(folio)) {
> +                                       sc->nr_scanned -= (nr_pages - 1);
> +                                       nr_pages = 1;
> +                               }
>                                 stat->nr_pageout += nr_pages;
> 
>                                 if (folio_test_writeback(folio)) {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08 10:48           ` Daniel Gomez
@ 2024-08-08 10:57             ` Daniel Gomez
  2024-08-08 11:51               ` Baolin Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Gomez @ 2024-08-08 10:57 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand, akpm@linux-foundation.org, hughd@google.com,
	willy@infradead.org, wangkefeng.wang@huawei.com,
	chrisl@kernel.org, ying.huang@intel.com, 21cnbao@gmail.com,
	ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com,
	ioworker0@gmail.com, Pankaj Raghav, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Christian Brauner, Luis Chamberlain

On Thu, Aug 08, 2024 at 12:48:52PM GMT, Daniel Gomez wrote:
> On Thu, Aug 08, 2024 at 05:34:50PM GMT, Baolin Wang wrote:
> > 
> > 
> > On 2024/8/8 16:51, David Hildenbrand wrote:
> > > On 08.08.24 04:36, Baolin Wang wrote:
> > > > 
> > > > 
> > > > On 2024/8/7 23:53, David Hildenbrand wrote:
> > > > > On 07.08.24 09:31, Baolin Wang wrote:
> > > > > > Page reclaim will not scan anon LRU if no swap space, however
> > > > > > MADV_PAGEOUT
> > > > > > can still split shmem large folios even without a swap device. Thus add
> > > > > > swap available space validation before spliting shmem large folio to
> > > > > > avoid redundant split.
> > > > > > 
> > > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > > ---
> > > > > >    mm/vmscan.c | 8 ++++++++
> > > > > >    1 file changed, 8 insertions(+)
> > > > > > 
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 31d13462571e..796f65781f4f 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct
> > > > > > list_head *folio_list,
> > > > > >                }
> > > > > >            } else if (folio_test_swapbacked(folio) &&
> > > > > >                   folio_test_large(folio)) {
> > > > > > +
> > > > > > +            /*
> > > > > > +             * Do not split shmem folio if no swap memory
> > > > > > +             * available.
> > > > > > +             */
> > > > > > +            if (!total_swap_pages)
> > > > > > +                goto activate_locked;
> > > > > > +
> > > > > >                /* Split shmem folio */
> > > > > >                if (split_folio_to_list(folio, folio_list))
> > > > > >                    goto keep_locked;
> > > > > 
> > > > > Reminds me of
> > > > > 
> > > > > commit 9a976f0c847b67d22ed694556a3626ed92da0422
> > > > > Author: Luis Chamberlain <mcgrof@kernel.org>
> > > > > Date:   Thu Mar 9 15:05:43 2023 -0800
> > > > > 
> > > > >       shmem: skip page split if we're not reclaiming
> > > > >       In theory when info->flags & VM_LOCKED we should not be getting
> > > > >       shem_writepage() called so we should be verifying this with a
> > > > >       WARN_ON_ONCE().  Since we should not be swapping then best to
> > > > > ensure we
> > > > >       also don't do the folio split earlier too.  So just move the check
> > > > > early
> > > > >       to avoid folio splits in case its a dubious call.
> > > > >       We also have a similar early bail when !total_swap_pages so just
> > > > > move that
> > > > >       earlier to avoid the possible folio split in the same situation.
> > > > > 
> > > > > 
> > > > > But indeed, pageout() -> writepage() is called *after* the split in the
> > > > > vmscan path.
> > > > > 
> > > > > In that "noswap" context, I wonder if we also want to skip folios part
> > > > > of shmem
> > > > > with disabled swapping?
> > > > 
> > > > Yes, I think so.
> > > > 
> > > > > 
> > > > > But now I am wondering under which circumstances we end up calling
> > > > > shmem_writepage() with a large folio. And I think the answer is the
> > > > > comment of
> > > > > folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
> > > > > 
> > > > > 
> > > > > ... so if shmem_writepage() handles+checks that, could we do
> > > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index a332cb80e928..7dfa3d6e8ba7 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
> > > > > list_head *folio_list,
> > > > >                                                   goto
> > > > > activate_locked_split;
> > > > >                                   }
> > > > >                           }
> > > > > -               } else if (folio_test_swapbacked(folio) &&
> > > > > -                          folio_test_large(folio)) {
> > > > > -                       /* Split shmem folio */
> > > > > -                       if (split_folio_to_list(folio, folio_list))
> > > > > -                               goto keep_locked;
> > > > >                   }
> > > > > 
> > > > >                   /*
> > > > > 
> > > > > instead?
> > > > 
> > > > Seems reasonable to me. But we should pass the 'folio_list' to
> > > > shmem_writepage() to list the subpages of the large folio. Let me try.
> > > 
> > > Ah, yes, good point. Alternatively, we'd have to split and try writing
> > > all subpages in there. I wonder what to do if we fail to write some, and
> > > if we could handle that transparently, without the folio_list.
> > 
> > After some investigation, I prefer to pass 'folio_list' to shmem_writepage()
> > via 'struct writeback_control', which could simplify the logic a lot.
> > Otherwise, we need to handle each subpage's writeback/reclaim/dirty state,
> > as well as tracking each subpage's write result, which seems more
> > complicated.
> > 
> > I made a quick change by passing 'folio_list', and it looks simple and works
> > as expected.
> > 
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 75196b0f894f..10100e22d5c6 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -80,6 +80,9 @@ struct writeback_control {
> >          */
> >         struct swap_iocb **swap_plug;
> > 
> > +       /* Target list for splitting a large folio */
> > +       struct list_head *list;
> > +
> >         /* internal fields used by the ->writepages implementation: */
> >         struct folio_batch fbatch;
> >         pgoff_t index;
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 05525e9e7423..0a5a68f7d0a0 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1496,9 +1496,10 @@ static int shmem_writepage(struct page *page, struct
> > writeback_control *wbc)
> >          * and its shmem_writeback() needs them to be split when swapping.
> >          */
> >         if (wbc->split_large_folio && folio_test_large(folio)) {
> > +try_split:
> >                 /* Ensure the subpages are still dirty */
> >                 folio_test_set_dirty(folio);
> > -               if (split_huge_page(page) < 0)
> > +               if (split_huge_page_to_list_to_order(page, wbc->list, 0))
> 
> We check for split_large_folio, but we still send the wbc->list for i915
> writepage() case. Previously, we were sending a NULL list. Shouldn't we address
> that case too?

I guess I was missing wbc initialization snippet:

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index fe69f2c8527d..174b95a9a988 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -308,6 +308,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
                .range_start = 0,
                .range_end = LLONG_MAX,
                .for_reclaim = 1,
+               .list = NULL,
        };
        unsigned long i;

> >                         goto redirty;
> >                 folio = page_folio(page);
> >                 folio_clear_dirty(folio);
> > @@ -1540,8 +1541,12 @@ static int shmem_writepage(struct page *page, struct
> > writeback_control *wbc)
> >         }
> > 
> >         swap = folio_alloc_swap(folio);
> > -       if (!swap.val)
> > +       if (!swap.val) {
> > +               if (nr_pages > 1)
> > +                       goto try_split;
> > +
> >                 goto redirty;
> > +       }
> > 
> >         /*
> >          * Add inode to shmem_unuse()'s list of swapped-out inodes,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 277571815789..cf982cf2454f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -628,7 +628,7 @@ typedef enum {
> >   * Calls ->writepage().
> >   */
> >  static pageout_t pageout(struct folio *folio, struct address_space
> > *mapping,
> > -                        struct swap_iocb **plug)
> > +                        struct swap_iocb **plug, struct list_head
> > *folio_list)
> >  {
> >         /*
> >          * If the folio is dirty, only perform writeback if that write
> > @@ -676,6 +676,11 @@ static pageout_t pageout(struct folio *folio, struct
> > address_space *mapping,
> >                         .swap_plug = plug,
> >                 };
> > 
> > +               if (shmem_mapping(mapping)) {
> > +                       wbc.list = folio_list;
> > +                       wbc.split_large_folio =
> > !IS_ENABLED(CONFIG_THP_SWAP);
> > +               }
> > +
> >                 folio_set_reclaim(folio);
> >                 res = mapping->a_ops->writepage(&folio->page, &wbc);
> >                 if (res < 0)
> > @@ -1259,23 +1264,6 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                                                 goto activate_locked_split;
> >                                 }
> >                         }
> > -               } else if (folio_test_swapbacked(folio) &&
> > -                          folio_test_large(folio)) {
> > -
> > -                       /*
> > -                        * Do not split shmem folio if no swap memory
> > -                        * available.
> > -                        */
> > -                       if (!total_swap_pages)
> > -                               goto activate_locked;
> > -
> > -                       /*
> > -                        * Only split shmem folio when CONFIG_THP_SWAP
> > -                        * is not enabled.
> > -                        */
> > -                       if (!IS_ENABLED(CONFIG_THP_SWAP) &&
> > -                           split_folio_to_list(folio, folio_list))
> > -                               goto keep_locked;
> >                 }
> > 
> >                 /*
> > @@ -1377,18 +1365,21 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                          * starts and then write it out here.
> >                          */
> >                         try_to_unmap_flush_dirty();
> > -try_pageout:
> > -                       switch (pageout(folio, mapping, &plug)) {
> > +                       switch (pageout(folio, mapping, &plug, folio_list))
> > {
> >                         case PAGE_KEEP:
> >                                 goto keep_locked;
> >                         case PAGE_ACTIVATE:
> > -                               if (shmem_mapping(mapping) &&
> > folio_test_large(folio) &&
> > -                                   !split_folio_to_list(folio, folio_list))
> > {
> > +                               /* Shmem can be split when writeback to swap
> > */
> > +                               if ((nr_pages > 1) &&
> > !folio_test_large(folio)) {
> > +                                       sc->nr_scanned -= (nr_pages - 1);
> >                                         nr_pages = 1;
> > -                                       goto try_pageout;
> >                                 }
> >                                 goto activate_locked;
> >                         case PAGE_SUCCESS:
> > +                               if ((nr_pages > 1) &&
> > !folio_test_large(folio)) {
> > +                                       sc->nr_scanned -= (nr_pages - 1);
> > +                                       nr_pages = 1;
> > +                               }
> >                                 stat->nr_pageout += nr_pages;
> > 
> >                                 if (folio_test_writeback(folio)) {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08 10:57             ` Daniel Gomez
@ 2024-08-08 11:51               ` Baolin Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-08 11:51 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: David Hildenbrand, akpm@linux-foundation.org, hughd@google.com,
	willy@infradead.org, wangkefeng.wang@huawei.com,
	chrisl@kernel.org, ying.huang@intel.com, 21cnbao@gmail.com,
	ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com,
	ioworker0@gmail.com, Pankaj Raghav, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Christian Brauner, Luis Chamberlain



On 2024/8/8 18:57, Daniel Gomez wrote:
> On Thu, Aug 08, 2024 at 12:48:52PM GMT, Daniel Gomez wrote:
>> On Thu, Aug 08, 2024 at 05:34:50PM GMT, Baolin Wang wrote:
>>>
>>>
>>> On 2024/8/8 16:51, David Hildenbrand wrote:
>>>> On 08.08.24 04:36, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/8/7 23:53, David Hildenbrand wrote:
>>>>>> On 07.08.24 09:31, Baolin Wang wrote:
>>>>>>> Page reclaim will not scan anon LRU if no swap space, however
>>>>>>> MADV_PAGEOUT
>>>>>>> can still split shmem large folios even without a swap device. Thus add
>>>>>>> swap available space validation before spliting shmem large folio to
>>>>>>> avoid redundant split.
>>>>>>>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> ---
>>>>>>>     mm/vmscan.c | 8 ++++++++
>>>>>>>     1 file changed, 8 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 31d13462571e..796f65781f4f 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -1259,6 +1259,14 @@ static unsigned int shrink_folio_list(struct
>>>>>>> list_head *folio_list,
>>>>>>>                 }
>>>>>>>             } else if (folio_test_swapbacked(folio) &&
>>>>>>>                    folio_test_large(folio)) {
>>>>>>> +
>>>>>>> +            /*
>>>>>>> +             * Do not split shmem folio if no swap memory
>>>>>>> +             * available.
>>>>>>> +             */
>>>>>>> +            if (!total_swap_pages)
>>>>>>> +                goto activate_locked;
>>>>>>> +
>>>>>>>                 /* Split shmem folio */
>>>>>>>                 if (split_folio_to_list(folio, folio_list))
>>>>>>>                     goto keep_locked;
>>>>>>
>>>>>> Reminds me of
>>>>>>
>>>>>> commit 9a976f0c847b67d22ed694556a3626ed92da0422
>>>>>> Author: Luis Chamberlain <mcgrof@kernel.org>
>>>>>> Date:   Thu Mar 9 15:05:43 2023 -0800
>>>>>>
>>>>>>        shmem: skip page split if we're not reclaiming
>>>>>>        In theory when info->flags & VM_LOCKED we should not be getting
>>>>>>        shem_writepage() called so we should be verifying this with a
>>>>>>        WARN_ON_ONCE().  Since we should not be swapping then best to
>>>>>> ensure we
>>>>>>        also don't do the folio split earlier too.  So just move the check
>>>>>> early
>>>>>>        to avoid folio splits in case its a dubious call.
>>>>>>        We also have a similar early bail when !total_swap_pages so just
>>>>>> move that
>>>>>>        earlier to avoid the possible folio split in the same situation.
>>>>>>
>>>>>>
>>>>>> But indeed, pageout() -> writepage() is called *after* the split in the
>>>>>> vmscan path.
>>>>>>
>>>>>> In that "noswap" context, I wonder if we also want to skip folios part
>>>>>> of shmem
>>>>>> with disabled swapping?
>>>>>
>>>>> Yes, I think so.
>>>>>
>>>>>>
>>>>>> But now I am wondering under which circumstances we end up calling
>>>>>> shmem_writepage() with a large folio. And I think the answer is the
>>>>>> comment of
>>>>>> folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
>>>>>>
>>>>>>
>>>>>> ... so if shmem_writepage() handles+checks that, could we do
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index a332cb80e928..7dfa3d6e8ba7 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
>>>>>> list_head *folio_list,
>>>>>>                                                    goto
>>>>>> activate_locked_split;
>>>>>>                                    }
>>>>>>                            }
>>>>>> -               } else if (folio_test_swapbacked(folio) &&
>>>>>> -                          folio_test_large(folio)) {
>>>>>> -                       /* Split shmem folio */
>>>>>> -                       if (split_folio_to_list(folio, folio_list))
>>>>>> -                               goto keep_locked;
>>>>>>                    }
>>>>>>
>>>>>>                    /*
>>>>>>
>>>>>> instead?
>>>>>
>>>>> Seems reasonable to me. But we should pass the 'folio_list' to
>>>>> shmem_writepage() to list the subpages of the large folio. Let me try.
>>>>
>>>> Ah, yes, good point. Alternatively, we'd have to split and try writing
>>>> all subpages in there. I wonder what to do if we fail to write some, and
>>>> if we could handle that transparently, without the folio_list.
>>>
>>> After some investigation, I prefer to pass 'folio_list' to shmem_writepage()
>>> via 'struct writeback_control', which could simplify the logic a lot.
>>> Otherwise, we need to handle each subpage's writeback/reclaim/dirty state,
>>> as well as tracking each subpage's write result, which seems more
>>> complicated.
>>>
>>> I made a quick change by passing 'folio_list', and it looks simple and works
>>> as expected.
>>>
>>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>>> index 75196b0f894f..10100e22d5c6 100644
>>> --- a/include/linux/writeback.h
>>> +++ b/include/linux/writeback.h
>>> @@ -80,6 +80,9 @@ struct writeback_control {
>>>           */
>>>          struct swap_iocb **swap_plug;
>>>
>>> +       /* Target list for splitting a large folio */
>>> +       struct list_head *list;
>>> +
>>>          /* internal fields used by the ->writepages implementation: */
>>>          struct folio_batch fbatch;
>>>          pgoff_t index;
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 05525e9e7423..0a5a68f7d0a0 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1496,9 +1496,10 @@ static int shmem_writepage(struct page *page, struct
>>> writeback_control *wbc)
>>>           * and its shmem_writeback() needs them to be split when swapping.
>>>           */
>>>          if (wbc->split_large_folio && folio_test_large(folio)) {
>>> +try_split:
>>>                  /* Ensure the subpages are still dirty */
>>>                  folio_test_set_dirty(folio);
>>> -               if (split_huge_page(page) < 0)
>>> +               if (split_huge_page_to_list_to_order(page, wbc->list, 0))
>>
>> We check for split_large_folio, but we still send the wbc->list for i915
>> writepage() case. Previously, we were sending a NULL list. Shouldn't we address
>> that case too?
> 
> I guess I was missing wbc initialization snippet:
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> index fe69f2c8527d..174b95a9a988 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> @@ -308,6 +308,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
>                  .range_start = 0,
>                  .range_end = LLONG_MAX,
>                  .for_reclaim = 1,
> +               .list = NULL,
>          };
>          unsigned long i;
> 

IMO, we don't need an explicit initialization, and 'list' will be 
initialized as NULL. Please see: 
https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08  2:36     ` Baolin Wang
  2024-08-08  8:51       ` David Hildenbrand
@ 2024-08-08 12:35       ` Matthew Wilcox
  2024-08-09  3:21         ` Baolin Wang
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2024-08-08 12:35 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand, akpm, hughd, wangkefeng.wang, chrisl,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, ioworker0,
	da.gomez, p.raghav, linux-mm, linux-kernel, Christian Brauner,
	Luis Chamberlain

On Thu, Aug 08, 2024 at 10:36:23AM +0800, Baolin Wang wrote:
> On 2024/8/7 23:53, David Hildenbrand wrote:
> > But now I am wondering under which circumstances we end up calling
> > shmem_writepage() with a large folio. And I think the answer is the
> > comment of
> > folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
> > 
> > 
> > ... so if shmem_writepage() handles+checks that, could we do
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a332cb80e928..7dfa3d6e8ba7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                                                  goto
> > activate_locked_split;
> >                                  }
> >                          }
> > -               } else if (folio_test_swapbacked(folio) &&
> > -                          folio_test_large(folio)) {
> > -                       /* Split shmem folio */
> > -                       if (split_folio_to_list(folio, folio_list))
> > -                               goto keep_locked;
> >                  }
> > 
> >                  /*
> > 
> > instead?
> 
> Seems reasonable to me. But we should pass the 'folio_list' to
> shmem_writepage() to list the subpages of the large folio. Let me try.
> Thanks.

We should be trying to remove shmem_writepage(), not make it more
complex.  We're making good progress removing instances of ->writepage;
just ceph, ecryptfs, f2fs, gfs2, hostfs, nilfs2, orangefs, vboxsf, shmem
& swap are left.  gfs2 patches are out for review.

As you can see from previous patches, the approach is to use
->writepages instead of ->writepage.  There should be no need to
handle a folio split list as splitting a folio leaves the folios in the
page cache and they'll naturally be found by subsequent iterations.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio
  2024-08-08 12:35       ` Matthew Wilcox
@ 2024-08-09  3:21         ` Baolin Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-09  3:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, akpm, hughd, wangkefeng.wang, chrisl,
	ying.huang, 21cnbao, ryan.roberts, shy828301, ziy, ioworker0,
	da.gomez, p.raghav, linux-mm, linux-kernel, Christian Brauner,
	Luis Chamberlain



On 2024/8/8 20:35, Matthew Wilcox wrote:
> On Thu, Aug 08, 2024 at 10:36:23AM +0800, Baolin Wang wrote:
>> On 2024/8/7 23:53, David Hildenbrand wrote:
>>> But now I am wondering under which circumstances we end up calling
>>> shmem_writepage() with a large folio. And I think the answer is the
>>> comment of
>>> folio_test_large(): via drivers/gpu/drm/i915/gem/i915_gem_shmem.c.
>>>
>>>
>>> ... so if shmem_writepage() handles+checks that, could we do
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index a332cb80e928..7dfa3d6e8ba7 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1257,11 +1257,6 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                                                   goto
>>> activate_locked_split;
>>>                                   }
>>>                           }
>>> -               } else if (folio_test_swapbacked(folio) &&
>>> -                          folio_test_large(folio)) {
>>> -                       /* Split shmem folio */
>>> -                       if (split_folio_to_list(folio, folio_list))
>>> -                               goto keep_locked;
>>>                   }
>>>
>>>                   /*
>>>
>>> instead?
>>
>> Seems reasonable to me. But we should pass the 'folio_list' to
>> shmem_writepage() to list the subpages of the large folio. Let me try.
>> Thanks.
> 
> We should be trying to remove shmem_writepage(), not make it more
> complex.  We're making good progress removing instances of ->writepage;
> just ceph, ecryptfs, f2fs, gfs2, hostfs, nilfs2, orangefs, vboxsf, shmem
> & swap are left.  gfs2 patches are out for review.

I am afraid shmem is a bit special. IIUC, ->writepages() is used to 
write back some dirty pages from the mapping by the writeback flusher 
thread, but shmem cannot be written back (mapping_can_writeback() will 
return false). Therefore, shmem can only be reclaimed through direct 
reclaim or kswapd if a swap device is set up (if no swap, shmem should 
always be kept in memory). So currently, we should still keep 
shmem_writepage() to reclaim shmem pages.

> As you can see from previous patches, the approach is to use
> ->writepages instead of ->writepage.  There should be no need to
> handle a folio split list as splitting a folio leaves the folios in the
> page cache and they'll naturally be found by subsequent iterations.

Again, shmem is special. If shmem folio is reclaimable (when a swap 
device is set up), we need to allocate contiguous swap entries for large 
folios. However, if there is significant fragmentation of swap entries 
(there is already a topic to talk about this issue), we will not able to 
allocate contiguous swap entries for large shmem folios. Therefore, in 
this case, it is necessary to split the large shmem folio in order to 
try to allocate a singe swap entry for reclaiming shmem.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 02/10] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  8:02   ` Barry Song
  2024-08-07  7:31 ` [PATCH v4 03/10] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for
shmem swap entries.

While we are at it, using folio_nr_pages() to get the number of pages
of the folio as a preparation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/swap.h | 4 ++--
 mm/shmem.c           | 6 ++++--
 mm/swapfile.c        | 4 ++--
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1c8f844a9f0f..248db1dd7812 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,7 +481,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t);
+extern void swap_shmem_alloc(swp_entry_t, int);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
@@ -548,7 +548,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline void swap_shmem_alloc(swp_entry_t swp)
+static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 4a5254bfd610..22cdc10f27ea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1452,6 +1452,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	swp_entry_t swap;
 	pgoff_t index;
+	int nr_pages;
 
 	/*
 	 * Our capabilities prevent regular writeback or sync from ever calling
@@ -1484,6 +1485,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	index = folio->index;
+	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
@@ -1536,8 +1538,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (add_to_swap_cache(folio, swap,
 			__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
 			NULL) == 0) {
-		shmem_recalc_inode(inode, 0, 1);
-		swap_shmem_alloc(swap);
+		shmem_recalc_inode(inode, 0, nr_pages);
+		swap_shmem_alloc(swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
 
 		mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ea023fc25d08..88d73880aada 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3604,9 +3604,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
  * Help swapoff by noting that swap entry belongs to shmem/tmpfs
  * (in which case its reference count is never incremented).
  */
-void swap_shmem_alloc(swp_entry_t entry)
+void swap_shmem_alloc(swp_entry_t entry, int nr)
 {
-	__swap_duplicate(entry, SWAP_MAP_SHMEM, 1);
+	__swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
 }
 
 /*
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 02/10] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
  2024-08-07  7:31 ` [PATCH v4 02/10] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
@ 2024-08-07  8:02   ` Barry Song
  0 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2024-08-07  8:02 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, chrisl, ying.huang,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel

On Wed, Aug 7, 2024 at 7:31 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> To support shmem large folio swap operations, add a new parameter to
> swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for
> shmem swap entries.
>
> While we are at it, using folio_nr_pages() to get the number of pages
> of the folio as a preparation.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  include/linux/swap.h | 4 ++--
>  mm/shmem.c           | 6 ++++--
>  mm/swapfile.c        | 4 ++--
>  3 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1c8f844a9f0f..248db1dd7812 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -481,7 +481,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
>  extern swp_entry_t get_swap_page_of_type(int);
>  extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> -extern void swap_shmem_alloc(swp_entry_t);
> +extern void swap_shmem_alloc(swp_entry_t, int);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t entry, int nr);
>  extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> @@ -548,7 +548,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
>         return 0;
>  }
>
> -static inline void swap_shmem_alloc(swp_entry_t swp)
> +static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
>  {
>  }
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4a5254bfd610..22cdc10f27ea 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1452,6 +1452,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>         struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         swp_entry_t swap;
>         pgoff_t index;
> +       int nr_pages;
>
>         /*
>          * Our capabilities prevent regular writeback or sync from ever calling
> @@ -1484,6 +1485,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>         }
>
>         index = folio->index;
> +       nr_pages = folio_nr_pages(folio);
>
>         /*
>          * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
> @@ -1536,8 +1538,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>         if (add_to_swap_cache(folio, swap,
>                         __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
>                         NULL) == 0) {
> -               shmem_recalc_inode(inode, 0, 1);
> -               swap_shmem_alloc(swap);
> +               shmem_recalc_inode(inode, 0, nr_pages);
> +               swap_shmem_alloc(swap, nr_pages);
>                 shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
>
>                 mutex_unlock(&shmem_swaplist_mutex);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ea023fc25d08..88d73880aada 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3604,9 +3604,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>   * Help swapoff by noting that swap entry belongs to shmem/tmpfs
>   * (in which case its reference count is never incremented).
>   */
> -void swap_shmem_alloc(swp_entry_t entry)
> +void swap_shmem_alloc(swp_entry_t entry, int nr)
>  {
> -       __swap_duplicate(entry, SWAP_MAP_SHMEM, 1);
> +       __swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
>  }
>
>  /*
> --
> 2.39.3
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 03/10] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 02/10] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 04/10] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 22cdc10f27ea..02fb188d627f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -890,7 +890,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 		if (xas_retry(&xas, page))
 			continue;
 		if (xa_is_value(page))
-			swapped++;
+			swapped += 1 << xa_get_order(xas.xa, xas.xa_index);
 		if (xas.xa_index == max)
 			break;
 		if (need_resched()) {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 04/10] mm: shmem: return number of pages beeing freed in shmem_free_swap
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (2 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 03/10] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 05/10] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

From: Daniel Gomez <da.gomez@samsung.com>

Both shmem_free_swap callers expect the number of pages being freed. In
the large folios context, this needs to support larger values other than
0 (used as 1 page being freed) and -ENOENT (used as 0 pages being
freed). In preparation for large folios adoption, make shmem_free_swap
routine return the number of pages being freed. So, returning 0 in this
context, means 0 pages being freed.

While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 02fb188d627f..d0d54939da48 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -856,18 +856,22 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
 }
 
 /*
- * Remove swap entry from page cache, free the swap and its page cache.
+ * Remove swap entry from page cache, free the swap and its page cache. Returns
+ * the number of pages being freed. 0 means entry not found in XArray (0 pages
+ * being freed).
  */
-static int shmem_free_swap(struct address_space *mapping,
-			   pgoff_t index, void *radswap)
+static long shmem_free_swap(struct address_space *mapping,
+			    pgoff_t index, void *radswap)
 {
+	int order = xa_get_order(&mapping->i_pages, index);
 	void *old;
 
 	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
 	if (old != radswap)
-		return -ENOENT;
-	free_swap_and_cache(radix_to_swp_entry(radswap));
-	return 0;
+		return 0;
+	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+
+	return 1 << order;
 }
 
 /*
@@ -1019,7 +1023,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			if (xa_is_value(folio)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
+				nr_swaps_freed += shmem_free_swap(mapping,
 							indices[i], folio);
 				continue;
 			}
@@ -1086,14 +1090,17 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			folio = fbatch.folios[i];
 
 			if (xa_is_value(folio)) {
+				long swaps_freed;
+
 				if (unfalloc)
 					continue;
-				if (shmem_free_swap(mapping, indices[i], folio)) {
+				swaps_freed = shmem_free_swap(mapping, indices[i], folio);
+				if (!swaps_freed) {
 					/* Swap was replaced by page: retry */
 					index = indices[i];
 					break;
 				}
-				nr_swaps_freed++;
+				nr_swaps_freed += swaps_freed;
 				continue;
 			}
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 05/10] mm: filemap: use xa_get_order() to get the swap entry order
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (3 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 04/10] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 06/10] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4130be74f6fd..4c312aab8b1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
 		folio = fbatch->folios[idx];
 		if (!xa_is_value(folio))
 			nr = folio_nr_pages(folio);
+		else
+			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
 		*start = indices[idx] + nr;
 	}
 	return folio_batch_count(fbatch);
@@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
 		folio = fbatch->folios[idx];
 		if (!xa_is_value(folio))
 			nr = folio_nr_pages(folio);
+		else
+			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
 		*start = indices[idx] + nr;
 	}
 	return folio_batch_count(fbatch);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 06/10] mm: shmem: use swap_free_nr() to free shmem swap entries
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (4 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 05/10] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 07/10] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

As a preparation for supporting shmem large folio swapout, use swap_free_nr()
to free some continuous swap entries of the shmem large folio when the
large folio was swapped in from the swap cache. In addition, the index
should also be round down to the number of pages when adding the swapin
folio into the pagecache.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d0d54939da48..f6bab42180ea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1961,6 +1961,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	swp_entry_t swapin_error;
 	void *old;
+	int nr_pages;
 
 	swapin_error = make_poisoned_swp_entry();
 	old = xa_cmpxchg_irq(&mapping->i_pages, index,
@@ -1969,6 +1970,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	if (old != swp_to_radix_entry(swap))
 		return;
 
+	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
 	delete_from_swap_cache(folio);
 	/*
@@ -1976,8 +1978,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
 	 * in shmem_evict_inode().
 	 */
-	shmem_recalc_inode(inode, -1, -1);
-	swap_free(swap);
+	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
+	swap_free_nr(swap, nr_pages);
 }
 
 /*
@@ -1996,7 +1998,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	swp_entry_t swap;
-	int error;
+	int error, nr_pages;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
@@ -2043,6 +2045,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
+	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -2056,19 +2059,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping, index,
+	error = shmem_add_to_page_cache(folio, mapping,
+					round_down(index, nr_pages),
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
 
-	shmem_recalc_inode(inode, 0, -1);
+	shmem_recalc_inode(inode, 0, -nr_pages);
 
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
 	delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
-	swap_free(swap);
+	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
 
 	*foliop = folio;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 07/10] mm: shmem: support large folio allocation for shmem_replace_folio()
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (5 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 06/10] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 08/10] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in shmem_replace_folio().
Moreover large folios occupy N consecutive entries in the swap cache
instead of using multi-index entries like the page cache, therefore
we should replace each consecutive entries in the swap cache instead
of using the shmem_replace_entry().

As well as updating statistics and folio reference count using the number
of pages in the folio.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 54 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 23 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f6bab42180ea..d94f02ad7bd1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1889,28 +1889,24 @@ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp)
 static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index)
 {
-	struct folio *old, *new;
-	struct address_space *swap_mapping;
-	swp_entry_t entry;
-	pgoff_t swap_index;
-	int error;
-
-	old = *foliop;
-	entry = old->swap;
-	swap_index = swap_cache_index(entry);
-	swap_mapping = swap_address_space(entry);
+	struct folio *new, *old = *foliop;
+	swp_entry_t entry = old->swap;
+	struct address_space *swap_mapping = swap_address_space(entry);
+	pgoff_t swap_index = swap_cache_index(entry);
+	XA_STATE(xas, &swap_mapping->i_pages, swap_index);
+	int nr_pages = folio_nr_pages(old);
+	int error = 0, i;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
 	 * limit chance of success by further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	VM_BUG_ON_FOLIO(folio_test_large(old), old);
-	new = shmem_alloc_folio(gfp, 0, info, index);
+	new = shmem_alloc_folio(gfp, folio_order(old), info, index);
 	if (!new)
 		return -ENOMEM;
 
-	folio_get(new);
+	folio_ref_add(new, nr_pages);
 	folio_copy(new, old);
 	flush_dcache_folio(new);
 
@@ -1920,18 +1916,25 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	/*
-	 * Our caller will very soon move newpage out of swapcache, but it's
-	 * a nice clean interface for us to replace oldpage by newpage there.
-	 */
+	/* Swap cache still stores N entries instead of a high-order entry */
 	xa_lock_irq(&swap_mapping->i_pages);
-	error = shmem_replace_entry(swap_mapping, swap_index, old, new);
+	for (i = 0; i < nr_pages; i++) {
+		void *item = xas_load(&xas);
+
+		if (item != old) {
+			error = -ENOENT;
+			break;
+		}
+
+		xas_store(&xas, new);
+		xas_next(&xas);
+	}
 	if (!error) {
 		mem_cgroup_replace_folio(old, new);
-		__lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1);
-		__lruvec_stat_mod_folio(new, NR_SHMEM, 1);
-		__lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1);
-		__lruvec_stat_mod_folio(old, NR_SHMEM, -1);
+		__lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages);
+		__lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages);
+		__lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages);
+		__lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages);
 	}
 	xa_unlock_irq(&swap_mapping->i_pages);
 
@@ -1951,7 +1954,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	old->private = NULL;
 
 	folio_unlock(old);
-	folio_put_refs(old, 2);
+	/*
+	 * The old folio are removed from swap cache, drop the 'nr_pages'
+	 * reference, as well as one temporary reference getting from swap
+	 * cache.
+	 */
+	folio_put_refs(old, nr_pages + 1);
 	return error;
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 08/10] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (6 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 07/10] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 09/10] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 10/10] mm: shmem: support large folio swap out Baolin Wang
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding
into swap cache.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d94f02ad7bd1..345e25425e37 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -851,7 +851,7 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
 	__lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
 	__lruvec_stat_mod_folio(folio, NR_SHMEM, -nr);
 	xa_unlock_irq(&mapping->i_pages);
-	folio_put(folio);
+	folio_put_refs(folio, nr);
 	BUG_ON(error);
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 09/10] mm: shmem: split large entry if the swapin folio is not large
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (7 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 08/10] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  2024-08-07  7:31 ` [PATCH v4 10/10] mm: shmem: support large folio swap out Baolin Wang
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Now the swap device can only swap-in order 0 folio, even though a large
folio is swapped out. This requires us to split the large entry previously
saved in the shmem pagecache to support the swap in of small folios.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 345e25425e37..996062dc196b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1990,6 +1990,81 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	swap_free_nr(swap, nr_pages);
 }
 
+static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
+				   swp_entry_t swap, int new_order, gfp_t gfp)
+{
+	struct address_space *mapping = inode->i_mapping;
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, new_order);
+	void *alloced_shadow = NULL;
+	int alloced_order = 0, i;
+
+	for (;;) {
+		int order = -1, split_order = 0;
+		void *old = NULL;
+
+		xas_lock_irq(&xas);
+		old = xas_load(&xas);
+		if (!xa_is_value(old) || swp_to_radix_entry(swap) != old) {
+			xas_set_err(&xas, -EEXIST);
+			goto unlock;
+		}
+
+		order = xas_get_order(&xas);
+
+		/* Swap entry may have changed before we re-acquire the lock */
+		if (alloced_order &&
+		    (old != alloced_shadow || order != alloced_order)) {
+			xas_destroy(&xas);
+			alloced_order = 0;
+		}
+
+		/* Try to split large swap entry in pagecache */
+		if (order > 0 && order > new_order) {
+			if (!alloced_order) {
+				split_order = order;
+				goto unlock;
+			}
+			xas_split(&xas, old, order);
+
+			/*
+			 * Re-set the swap entry after splitting, and the swap
+			 * offset of the original large entry must be continuous.
+			 */
+			for (i = 0; i < 1 << order; i += (1 << new_order)) {
+				pgoff_t aligned_index = round_down(index, 1 << order);
+				swp_entry_t tmp;
+
+				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+				__xa_store(&mapping->i_pages, aligned_index + i,
+					   swp_to_radix_entry(tmp), 0);
+			}
+		}
+
+unlock:
+		xas_unlock_irq(&xas);
+
+		/* split needed, alloc here and retry. */
+		if (split_order) {
+			xas_split_alloc(&xas, old, split_order, gfp);
+			if (xas_error(&xas))
+				goto error;
+			alloced_shadow = old;
+			alloced_order = split_order;
+			xas_reset(&xas);
+			continue;
+		}
+
+		if (!xas_nomem(&xas, gfp))
+			break;
+	}
+
+error:
+	if (xas_error(&xas))
+		return xas_error(&xas);
+
+	return alloced_order;
+}
+
 /*
  * Swap in the folio pointed to by *foliop.
  * Caller has to make sure that *foliop contains a valid swapped folio.
@@ -2026,12 +2101,37 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
+		int split_order;
+
 		/* Or update major stats only when swapin succeeds?? */
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
+
+		/*
+		 * Now swap device can only swap in order 0 folio, then we
+		 * should split the large swap entry stored in the pagecache
+		 * if necessary.
+		 */
+		split_order = shmem_split_large_entry(inode, index, swap, 0, gfp);
+		if (split_order < 0) {
+			error = split_order;
+			goto failed;
+		}
+
+		/*
+		 * If the large swap entry has already been split, it is
+		 * necessary to recalculate the new swap entry based on
+		 * the old order alignment.
+		 */
+		if (split_order > 0) {
+			pgoff_t offset = index - round_down(index, 1 << split_order);
+
+			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+		}
+
 		/* Here we actually start the io */
 		folio = shmem_swapin_cluster(swap, gfp, info, index);
 		if (!folio) {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 10/10] mm: shmem: support large folio swap out
  2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
                   ` (8 preceding siblings ...)
  2024-08-07  7:31 ` [PATCH v4 09/10] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
@ 2024-08-07  7:31 ` Baolin Wang
  9 siblings, 0 replies; 21+ messages in thread
From: Baolin Wang @ 2024-08-07  7:31 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, chrisl, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	baolin.wang, linux-mm, linux-kernel

Shmem will support large folio allocation [1] [2] to get a better performance,
however, the memory reclaim still splits the precious large folios when trying
to swap out shmem, which may lead to the memory fragmentation issue and can not
take advantage of the large folio for shmeme.

Moreover, the swap code already supports for swapping out large folio without
split, hence this patch set supports the large folio swap out for shmem.

Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate spliting
the large folio.

[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |  1 +
 include/linux/writeback.h                 |  1 +
 mm/shmem.c                                |  3 +--
 mm/vmscan.c                               | 14 ++++++++++++--
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index c5e1c718a6d2..c66cb9c585e1 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -308,6 +308,7 @@ void __shmem_writeback(size_t size, struct address_space *mapping)
 		.range_start = 0,
 		.range_end = LLONG_MAX,
 		.for_reclaim = 1,
+		.split_large_folio = 1,
 	};
 	unsigned long i;
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 1a54676d843a..75196b0f894f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,6 +63,7 @@ struct writeback_control {
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 	unsigned unpinned_netfs_wb:1;	/* Cleared I_PINNING_NETFS_WB */
+	unsigned split_large_folio:1;	/* Split large folio for shmem writeback */
 
 	/*
 	 * When writeback IOs are bounced through async layers, only the
diff --git a/mm/shmem.c b/mm/shmem.c
index 996062dc196b..68c9a31bc763 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -795,7 +795,6 @@ static int shmem_add_to_page_cache(struct folio *folio,
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
-	VM_BUG_ON(expected && folio_test_large(folio));
 
 	folio_ref_add(folio, nr);
 	folio->mapping = mapping;
@@ -1482,7 +1481,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	 * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
 	 * and its shmem_writeback() needs them to be split when swapping.
 	 */
-	if (folio_test_large(folio)) {
+	if (wbc->split_large_folio && folio_test_large(folio)) {
 		/* Ensure the subpages are still dirty */
 		folio_test_set_dirty(folio);
 		if (split_huge_page(page) < 0)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 796f65781f4f..21acd6c2fbab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1267,8 +1267,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (!total_swap_pages)
 				goto activate_locked;
 
-			/* Split shmem folio */
-			if (split_folio_to_list(folio, folio_list))
+			/*
+			 * Only split shmem folio when CONFIG_THP_SWAP
+			 * is not enabled.
+			 */
+			if (!IS_ENABLED(CONFIG_THP_SWAP) &&
+			    split_folio_to_list(folio, folio_list))
 				goto keep_locked;
 		}
 
@@ -1370,10 +1374,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
+try_pageout:
 			switch (pageout(folio, mapping, &plug)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
+				if (shmem_mapping(mapping) && folio_test_large(folio) &&
+				    !split_folio_to_list(folio, folio_list)) {
+					nr_pages = 1;
+					goto try_pageout;
+				}
 				goto activate_locked;
 			case PAGE_SUCCESS:
 				stat->nr_pageout += nr_pages;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-08-09  3:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07  7:31 [PATCH v4 00/10] support large folio swap-out and swap-in for shmem Baolin Wang
2024-08-07  7:31 ` [PATCH v4 01/10] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
2024-08-07 15:53   ` David Hildenbrand
2024-08-08  2:36     ` Baolin Wang
2024-08-08  8:51       ` David Hildenbrand
2024-08-08  9:34         ` Baolin Wang
2024-08-08 10:48           ` Daniel Gomez
2024-08-08 10:57             ` Daniel Gomez
2024-08-08 11:51               ` Baolin Wang
2024-08-08 12:35       ` Matthew Wilcox
2024-08-09  3:21         ` Baolin Wang
2024-08-07  7:31 ` [PATCH v4 02/10] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
2024-08-07  8:02   ` Barry Song
2024-08-07  7:31 ` [PATCH v4 03/10] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
2024-08-07  7:31 ` [PATCH v4 04/10] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
2024-08-07  7:31 ` [PATCH v4 05/10] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
2024-08-07  7:31 ` [PATCH v4 06/10] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
2024-08-07  7:31 ` [PATCH v4 07/10] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
2024-08-07  7:31 ` [PATCH v4 08/10] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
2024-08-07  7:31 ` [PATCH v4 09/10] mm: shmem: split large entry if the swapin folio is not large Baolin Wang
2024-08-07  7:31 ` [PATCH v4 10/10] mm: shmem: support large folio swap out Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox