Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Chuanhua Han <chuanhuahan@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Chuanhua Han <hanchuanhua@oppo.com>, Chris Li <chrisl@kernel.org>,
	 linux-mm <linux-mm@kvack.org>,
	lsf-pc@lists.linux-foundation.org,  ryan.roberts@arm.com,
	21cnbao@gmail.com, david@redhat.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
Date: Thu, 14 Mar 2024 19:19:58 +0800	[thread overview]
Message-ID: <CANzGp4Ks_uTj2h=G8cBBZLT+qMhWqbJC229xOTR_uHzrf4LpWw@mail.gmail.com> (raw)
In-Reply-To: <20240314082651.ckfpp2tyslq2hl2c@quack3>

Jan Kara <jack@suse.cz> 于2024年3月14日周四 16:28写道：
>
> On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> >
> > 在 2024/3/7 22:03, Jan Kara 写道:
> > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> > >> 在 2024/3/1 17:24, Chris Li 写道:
> > >>> In last year's LSF/MM I talked about a VFS-like swap system. That is
> > >>> the pony that was chosen.
> > >>> However, I did not have much chance to go into details.
> > >>>
> > >>> This year, I would like to discuss what it takes to re-architect the
> > >>> whole swap back end from scratch?
> > >>>
> > >>> Let’s start from the requirements for the swap back end.
> > >>>
> > >>> 1) support the existing swap usage (not the implementation).
> > >>>
> > >>> Some other design goals::
> > >>>
> > >>> 2) low per swap entry memory usage.
> > >>>
> > >>> 3) low io latency.
> > >>>
> > >>> What are the functions the swap system needs to support?
> > >>>
> > >>> At the device level. Swap systems need to support a list of swap files
> > >>> with a priority order. The same priority of swap device will do round
> > >>> robin writing on the swap device. The swap device type includes zswap,
> > >>> zram, SSD, spinning hard disk, swap file in a file system.
> > >>>
> > >>> At the swap entry level, here is the list of existing swap entry usage:
> > >>>
> > >>> * Swap entry allocation and free. Each swap entry needs to be
> > >>> associated with a location of the disk space in the swapfile. (offset
> > >>> of swap entry).
> > >>> * Each swap entry needs to track the map count of the entry. (swap_map)
> > >>> * Each swap entry needs to be able to find the associated memory
> > >>> cgroup. (swap_cgroup_ctrl->map)
> > >>> * Swap cache. Lookup folio/shadow from swap entry
> > >>> * Swap page writes through a swapfile in a file system other than a
> > >>> block device. (swap_extent)
> > >>> * Shadow entry. (store in swap cache)
> > >>>
> > >>> Any new swap back end might have different internal implementation,
> > >>> but needs to support the above usage. For example, using the existing
> > >>> file system as swap backend, per vma or per swap entry map to a file
> > >>> would mean it needs additional data structure to track the
> > >>> swap_cgroup_ctrl, combined with the size of the file inode. It would
> > >>> be challenging to meet the design goal 2) and 3) using another file
> > >>> system as it is..
> > >>>
> > >>> I am considering grouping different swap entry data into one single
> > >>> struct and dynamically allocate it so no upfront allocation of
> > >>> swap_map.
> > >>>
> > >>> For the swap entry allocation.Current kernel support swap out 0 order
> > >>> or pmd order pages.
> > >>>
> > >>> There are some discussions and patches that add swap out for folio
> > >>> size in between (mTHP)
> > >>>
> > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> > >>>
> > >>> and swap in for mTHP:
> > >>>
> > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> > >>>
> > >>> The introduction of swapping different order of pages will further
> > >>> complicate the swap entry fragmentation issue. The swap back end has
> > >>> no way to predict the life cycle of the swap entries. Repeat allocate
> > >>> and free swap entry of different sizes will fragment the swap entries
> > >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > >>> will have to split the mTHP to a smaller size to perform the swap in
> > >>> and out. T
> > >>>
> > >>> Current swap only supports 4K pages or pmd size pages. When adding the
> > >>> other in between sizes, it greatly increases the chance of fragmenting
> > >>> the swap entry space. When no more continuous swap swap entry for
> > >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > >>> the fragmentation issue. It will be a constant source of splitting the
> > >>> mTHP.
> > >>>
> > >>> Another limitation I would like to address is that swap_writepage can
> > >>> only write out IO in one contiguous chunk, not able to perform
> > >>> non-continuous IO. When the swapfile is close to full, it is likely
> > >>> the unused entry will spread across different locations. It would be
> > >>> nice to be able to read and write large folio using discontiguous disk
> > >>> IO locations.
> > >>>
> > >>> Some possible ideas for the fragmentation issue.
> > >>>
> > >>> a) buddy allocator for swap entities. Similar to the buddy allocator
> > >>> in memory. We can use a buddy allocator system for the swap entry to
> > >>> avoid the low order swap entry fragment too much of the high order
> > >>> swap entry. It should greatly reduce the fragmentation caused by
> > >>> allocate and free of the swap entry of different sizes. However the
> > >>> buddy allocator has its own limit as well. Unlike system memory, we
> > >>> can move and compact the memory. There is no rmap for swap entry, it
> > >>> is much harder to move a swap entry to another disk location. So the
> > >>> buddy allocator for swap will help, but not solve all the
> > >>> fragmentation issues.
> > >> I have an idea here😁
> > >>
> > >> Each swap device is divided into multiple chunks, and each chunk is
> > >> allocated to meet each order allocation
> > >> (order indicates the order of swapout's folio, and each chunk is used
> > >> for only one order).
> > >> This can solve the fragmentation problem, which is much simpler than
> > >> buddy, easier to implement,
> > >>  and can be compatible with multiple sizes, similar to small slab allocator.
> > >>
> > >> 1) Add structure members
> > >> In the swap_info_struct structure, we only need to add the offset array
> > >> representing the offset of each order search.
> > >> eg:
> > >>
> > >> #define MTHP_NR_ORDER 9
> > >>
> > >> struct swap_info_struct {
> > >>     ...
> > >>     long order_off[MTHP_NR_ORDER];
> > >>     ...
> > >> };
> > >>
> > >> Note: order_off = -1 indicates that this order is not supported.
> > >>
> > >> 2) Initialize
> > >> Set the proportion of swap device occupied by each order.
> > >> For the sake of simplicity, there are 8 kinds of orders.
> > >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> > >> (maxpages indicates the maximum number of available slots in the current
> > >> swap device)
> > > Well, but then if you fill in space of a particular order and need to swap
> > > out a page of that order what do you do? Return ENOSPC prematurely?
> > If we swapout a subpage of large folio(due to a split in large folio),
> > Simply search for a free swap entry from order_off[0].
>
> I meant what are you going to do if you want to swapout 2MB huge page but
> you don't have any free swap entry of the appropriate order? History shows
> that these schemes where you partition available space into buckets of
> pages of different order tends to fragment rather quickly so you need to
> also implement some defragmentation / compaction scheme and once you do
> that you are at the complexity of a standard filesystem block allocator.
> That is all I wanted to point at :)
OK, got it!  It's true that my approach doesn't eliminate
fragmentation, but it can be
mitigated to some extent, and the method itself doesn't currently
involve complex
file system operations.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
Thnaks,
Chuanhua

next prev parent reply	other threads:[~2024-03-14 11:20 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
2024-03-01  9:53 ` Nhat Pham
2024-03-01 18:57   ` Chris Li
2024-03-04 22:58   ` Matthew Wilcox
2024-03-05  3:23     ` Chengming Zhou
2024-03-05  7:44       ` Chris Li
2024-03-05  8:15         ` Chengming Zhou
2024-03-05 18:24           ` Chris Li
2024-03-05  9:32         ` Nhat Pham
2024-03-05  9:52           ` Chengming Zhou
2024-03-05 10:55             ` Nhat Pham
2024-03-05 19:20               ` Chris Li
2024-03-05 20:56                 ` Jared Hulbert
2024-03-05 21:38         ` Jared Hulbert
2024-03-05 21:58           ` Chris Li
2024-03-06  4:16             ` Jared Hulbert
2024-03-06  5:50               ` Chris Li
     [not found]                 ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16                   ` Chris Li
2024-03-06 22:44                     ` Jared Hulbert
2024-03-07  0:46                       ` Chris Li
2024-03-07  8:57                         ` Jared Hulbert
2024-03-06  1:33   ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03   ` Jared Hulbert
2024-03-04 22:47     ` Chris Li
2024-03-04 22:36   ` Chris Li
2024-03-06  1:15 ` Barry Song
2024-03-06  2:59   ` Chris Li
2024-03-06  6:05     ` Barry Song
2024-03-06 17:56       ` Chris Li
2024-03-06 21:29         ` Barry Song
2024-03-08  8:55       ` David Hildenbrand
2024-03-07  7:56 ` Chuanhua Han
2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
2024-03-07 21:06     ` Jared Hulbert
2024-03-07 21:17       ` Barry Song
2024-03-08  0:14         ` Jared Hulbert
2024-03-08  0:53           ` Barry Song
2024-03-14  9:03         ` Jan Kara
2024-05-16 15:04           ` Zi Yan
2024-05-17  3:48             ` Chris Li
2024-03-14  8:52       ` Jan Kara
2024-03-08  2:02     ` Chuanhua Han
2024-03-14  8:26       ` Jan Kara
2024-03-14 11:19         ` Chuanhua Han [this message]
2024-05-15 23:07           ` Chris Li
2024-05-16  7:16             ` Chuanhua Han
2024-05-17 12:12     ` Karim Manaouil
2024-05-21 20:40       ` Chris Li
2024-05-28  7:08         ` Jared Hulbert
2024-05-29  3:36           ` Chris Li
2024-05-29  3:57         ` Matthew Wilcox
2024-05-29  6:50           ` Chris Li
2024-05-29 12:33             ` Matthew Wilcox
2024-05-30 22:53               ` Chris Li
2024-05-31  3:12                 ` Matthew Wilcox
2024-06-01  0:43                   ` Chris Li
2024-05-31  1:56               ` Yuanchu Xie
2024-05-31 16:51                 ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANzGp4Ks_uTj2h=G8cBBZLT+qMhWqbJC229xOTR_uHzrf4LpWw@mail.gmail.com' \
    --to=chuanhuahan@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hanchuanhua@oppo.com \
    --cc=jack@suse.cz \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryan.roberts@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).