Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
	Linux Memory Management List <linux-mm@kvack.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Yu Zhao <yuzhao@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Large folio (z)swapin
Date: Fri, 10 Jan 2025 10:40:25 +0000	[thread overview]
Message-ID: <a6172c45-70ee-49ac-8ab4-d922dd19a661@gmail.com> (raw)
In-Reply-To: <CAGsJ_4zNE1yvrYdj46piw5no8WJ08xAJ+b4JRnqW0FeXyqn7rw@mail.gmail.com>



On 10/01/2025 10:30, Barry Song wrote:
> On Fri, Jan 10, 2025 at 11:26 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 10/01/2025 10:09, Barry Song wrote:
>>> Hi Usama,
>>>
>>> Please include me in the discussion. I'll try to attend, at least remotely.
>>>
>>> On Fri, Jan 10, 2025 at 9:06 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> I would like to propose a session to discuss the work going on
>>>> around large folio swapin, whether its traditional swap or
>>>> zswap or zram.
>>>>
>>>> Large folios have obvious advantages that have been discussed before
>>>> like fewer page faults, batched PTE and rmap manipulation, reduced
>>>> lru list, TLB coalescing (for arm64 and amd).
>>>> However, swapping in large folios has its own drawbacks like higher
>>>> swap thrashing.
>>>> I had initially sent a RFC of zswapin of large folios in [1]
>>>> but it causes a regression due to swap thrashing in kernel
>>>> build time, which I am confident is happening with zram large
>>>> folio swapin as well (which is merged in kernel).
>>>>
>>>> Some of the points we could discuss in the session:
>>>>
>>>> - What is the right (preferably open source) benchmark to test for
>>>> swapin of large folios? kernel build time in limited
>>>> memory cgroup shows a regression, microbenchmarks show a massive
>>>> improvement, maybe there are benchmarks where TLB misses is
>>>> a big factor and show an improvement.
>>>
>>> My understanding is that it largely depends on the workload. In interactive
>>> scenarios, such as on a phone, swap thrashing is not an issue because
>>> there is minimal to no thrashing for the app occupying the screen
>>> (foreground). In such cases, swap bandwidth becomes the most critical factor
>>> in improving app switching speed, especially when multiple applications
>>> are switching between background and foreground states.
>>>
>>>>
>>>> - We could have something like
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled
>>>> to enable/disable swapin but its going to be difficult to tune, might
>>>> have different optimum values based on workloads and are likely to be
>>>> left at their default values. Is there some dynamic way to decide when
>>>> to swapin large folios and when to fallback to smaller folios?
>>>> swapin_readahead swapcache path which only supports 4K folios atm has a
>>>> read ahead window based on hits, however readahead is a folio flag and
>>>> not a page flag, so this method can't be used as once a large folio
>>>> is swapped in, we won't get a fault and subsequent hits on other
>>>> pages of the large folio won't be recorded.
>>>>
>>>> - For zswap and zram, it might be that doing larger block compression/
>>>> decompression might offset the regression from swap thrashing, but it
>>>> brings about its own issues. For e.g. once a large folio is swapped
>>>> out, it could fail to swapin as a large folio and fallback
>>>> to 4K, resulting in redundant decompressions.
>>>
>>> That's correct. My current workaround involves swapping four small folios,
>>> and zsmalloc will compress and decompress in chunks of four pages,
>>> regardless of the actual size of the mTHP - The improvement in compression
>>> ratio and speed becomes less significant after exceeding four pages, even
>>> though there is still some increase.
>>>
>>> Our recent experiments on phone also show that enabling direct reclamation
>>> for do_swap_page() to allocate 2-order mTHP results in a 0% allocation
>>> failure rate -  this probably removes the need for fallbacking to 4 small
>>> folios. (Note that our experiments include Yu's TAO—Android GKI has
>>> already merged it. However, since 2 is less than
>>> PAGE_ALLOC_COSTLY_ORDER, we might achieve similar results even
>>> without Yu's TAO, although I have not confirmed this.)
>>>
>>
>> Hi Barry,
>>
>> Thanks for the comments!
>>
>> I haven't seen any activity on TAO on the mailing list recently. Do you know
>> if there are any plans for it to be sent for upstream review?
>> Have cc-ed Yu Zhao as well.
>>
>>
>>>> This will also mean swapin of large folios from traditional swap
>>>> isn't something we should proceed with?
>>>>
>>>> - Should we even support large folio swapin? You often have high swap
>>>> activity when the system/cgroup is close to running out of memory, at this
>>>> point, maybe the best way forward is to just swapin 4K pages and let
>>>> khugepaged [2], [3] collapse them if the surrounding pages are swapped in
>>>> as well.
>>>
>>> This approach might be suitable for non-interactive scenarios, such as building
>>> a kernel within a memory control group (memcg) or running other server
>>> applications. However, performing collapse in interactive and power-sensitive
>>> scenarios would be unnecessary and could lead to wasted power due to
>>> memory migration and unmap/map operations.
>>>
>>> However, it is quite challenging to automatically determine the type
>>> of workloads
>>> the system is running. I feel we still need a global control to decide whether
>>> to enable mTHP swap-in—not necessarily per size, but at least at a global level.
>>> That said, there is evident resistance to introducing additional
>>> controls to enable
>>> or disable mTHP features.
>>>
>>> By the way, Usama, have you ever tried switching between mglru and the
>>> traditional
>>> active/inactive LRU? My experience shows a significant difference in
>>> swap thrashing
>>> —active/inactive LRU exhibits much less swap thrashing in my local kernel build
>>> tests.
>>>
>>
>> I never tried with MGLRU enabled, so I am probably seeing the lowest amount of
>> swap-thrashing.
> 
> Are you sure, Usama, since mglru is enabled by default? I have to echo
> 0 to manually
> disable it.
> 

Yes, I dont have CONFIG_LRU_GEN set in my defconfig. I dont think it is set
by default as well? Atleast on x86.

$ make defconfig
$  grep  LRU_GEN .config
# CONFIG_LRU_GEN is not set

Thanks,
Usama

>>
>> Thanks,
>> Usama
>>
>>> the latest mm-unstable
>>>
>>> *********** default mglru:   ***********
>>>
>>> root@barry-desktop:/home/barry/develop/linux# ./build.sh
>>> *** Executing round 1 ***
>>> real 6m44.561s
>>> user 46m53.274s
>>> sys 3m48.585s
>>> pswpin: 1286081
>>> pswpout: 3147936
>>> 64kB-swpout: 0
>>> 32kB-swpout: 0
>>> 16kB-swpout: 714580
>>> 64kB-swpin: 0
>>> 32kB-swpin: 0
>>> 16kB-swpin: 286881
>>> pgpgin: 17199072
>>> pgpgout: 21493892
>>> swpout_zero: 229163
>>> swpin_zero: 84353
>>>
>>> ******** disable mglru ********
>>>
>>> root@barry-desktop:/home/barry/develop/linux# echo 0 >
>>> /sys/kernel/mm/lru_gen/enabled
>>>
>>> root@barry-desktop:/home/barry/develop/linux# ./build.sh
>>> *** Executing round 1 ***
>>> real 6m27.944s
>>> user 46m41.832s
>>> sys 3m30.635s
>>> pswpin: 474036
>>> pswpout: 1434853
>>> 64kB-swpout: 0
>>> 32kB-swpout: 0
>>> 16kB-swpout: 331755
>>> 64kB-swpin: 0
>>> 32kB-swpin: 0
>>> 16kB-swpin: 106333
>>> pgpgin: 11763720
>>> pgpgout: 14551524
>>> swpout_zero: 145050
>>> swpin_zero: 87981
>>>
>>> my build script:
>>>
>>> #!/bin/bash
>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled
>>> echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>
>>> vmstat_path="/proc/vmstat"
>>> thp_base_path="/sys/kernel/mm/transparent_hugepage"
>>>
>>> read_values() {
>>>     pswpin=$(grep "pswpin" $vmstat_path | awk '{print $2}')
>>>     pswpout=$(grep "pswpout" $vmstat_path | awk '{print $2}')
>>>     pgpgin=$(grep "pgpgin" $vmstat_path | awk '{print $2}')
>>>     pgpgout=$(grep "pgpgout" $vmstat_path | awk '{print $2}')
>>>     swpout_zero=$(grep "swpout_zero" $vmstat_path | awk '{print $2}')
>>>     swpin_zero=$(grep "swpin_zero" $vmstat_path | awk '{print $2}')
>>>     swpout_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpout
>>> 2>/dev/null || echo 0)
>>>     swpout_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpout
>>> 2>/dev/null || echo 0)
>>>     swpout_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpout
>>> 2>/dev/null || echo 0)
>>>     swpin_64k=$(cat $thp_base_path/hugepages-64kB/stats/swpin
>>> 2>/dev/null || echo 0)
>>>     swpin_32k=$(cat $thp_base_path/hugepages-32kB/stats/swpin
>>> 2>/dev/null || echo 0)
>>>     swpin_16k=$(cat $thp_base_path/hugepages-16kB/stats/swpin
>>> 2>/dev/null || echo 0)
>>>     echo "$pswpin $pswpout $swpout_64k $swpout_32k $swpout_16k
>>> $swpin_64k $swpin_32k $swpin_16k $pgpgin $pgpgout $swpout_zero
>>> $swpin_zero"
>>> }
>>>
>>> for ((i=1; i<=1; i++))
>>> do
>>>   echo
>>>   echo "*** Executing round $i ***"
>>>   make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- clean 1>/dev/null 2>/dev/null
>>>   echo 3 > /proc/sys/vm/drop_caches
>>>
>>>   #kernel build
>>>   initial_values=($(read_values))
>>>   time systemd-run --scope -p MemoryMax=1G make ARCH=arm64 \
>>>         CROSS_COMPILE=aarch64-linux-gnu- vmlinux -j10 1>/dev/null 2>/dev/null
>>>   final_values=($(read_values))
>>>
>>>   echo "pswpin: $((final_values[0] - initial_values[0]))"
>>>   echo "pswpout: $((final_values[1] - initial_values[1]))"
>>>   echo "64kB-swpout: $((final_values[2] - initial_values[2]))"
>>>   echo "32kB-swpout: $((final_values[3] - initial_values[3]))"
>>>   echo "16kB-swpout: $((final_values[4] - initial_values[4]))"
>>>   echo "64kB-swpin: $((final_values[5] - initial_values[5]))"
>>>   echo "32kB-swpin: $((final_values[6] - initial_values[6]))"
>>>   echo "16kB-swpin: $((final_values[7] - initial_values[7]))"
>>>   echo "pgpgin: $((final_values[8] - initial_values[8]))"
>>>   echo "pgpgout: $((final_values[9] - initial_values[9]))"
>>>   echo "swpout_zero: $((final_values[10] - initial_values[10]))"
>>>   echo "swpin_zero: $((final_values[11] - initial_values[11]))"
>>>   sync
>>>   sleep 10
>>> done
>>>
>>>>
>>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
>>>> [2] https://lore.kernel.org/all/20250108233128.14484-1-npache@redhat.com/
>>>> [3] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>>
>>>> Thanks,
>>>> Usama
>>>
> 
> Thanks
> Barry

next prev parent reply	other threads:[~2025-01-10 10:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-09 20:06 [LSF/MM/BPF TOPIC] Large folio (z)swapin Usama Arif
2025-01-09 21:34 ` Yosry Ahmed
2025-01-10  4:29 ` Nhat Pham
2025-01-10 10:28   ` Barry Song
2025-01-11 10:52   ` Zhu Yanjun
2025-01-10 10:09 ` Barry Song
2025-01-10 10:26   ` Usama Arif
2025-01-10 10:30     ` Barry Song
2025-01-10 10:40       ` Usama Arif [this message]
2025-01-10 10:47         ` Barry Song
2025-01-12 10:49   ` Barry Song
2025-01-13  3:16 ` Chuanhua Han
2025-01-28  8:17 ` Sergey Senozhatsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a6172c45-70ee-49ac-8ab4-d922dd19a661@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox