linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
@ 2024-03-01  9:24 Chris Li
  2024-03-01  9:53 ` Nhat Pham
                   ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Chris Li @ 2024-03-01  9:24 UTC (permalink / raw)
  To: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Barry Song,
	Chuanhua Han

In last year's LSF/MM I talked about a VFS-like swap system. That is
the pony that was chosen.
However, I did not have much chance to go into details.

This year, I would like to discuss what it takes to re-architect the
whole swap back end from scratch?

Let’s start from the requirements for the swap back end.

1) support the existing swap usage (not the implementation).

Some other design goals::

2) low per swap entry memory usage.

3) low io latency.

What are the functions the swap system needs to support?

At the device level. Swap systems need to support a list of swap files
with a priority order. The same priority of swap device will do round
robin writing on the swap device. The swap device type includes zswap,
zram, SSD, spinning hard disk, swap file in a file system.

At the swap entry level, here is the list of existing swap entry usage:

* Swap entry allocation and free. Each swap entry needs to be
associated with a location of the disk space in the swapfile. (offset
of swap entry).
* Each swap entry needs to track the map count of the entry. (swap_map)
* Each swap entry needs to be able to find the associated memory
cgroup. (swap_cgroup_ctrl->map)
* Swap cache. Lookup folio/shadow from swap entry
* Swap page writes through a swapfile in a file system other than a
block device. (swap_extent)
* Shadow entry. (store in swap cache)

Any new swap back end might have different internal implementation,
but needs to support the above usage. For example, using the existing
file system as swap backend, per vma or per swap entry map to a file
would mean it needs additional data structure to track the
swap_cgroup_ctrl, combined with the size of the file inode. It would
be challenging to meet the design goal 2) and 3) using another file
system as it is..

I am considering grouping different swap entry data into one single
struct and dynamically allocate it so no upfront allocation of
swap_map.

For the swap entry allocation.Current kernel support swap out 0 order
or pmd order pages.

There are some discussions and patches that add swap out for folio
size in between (mTHP)

https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/

and swap in for mTHP:

https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/

The introduction of swapping different order of pages will further
complicate the swap entry fragmentation issue. The swap back end has
no way to predict the life cycle of the swap entries. Repeat allocate
and free swap entry of different sizes will fragment the swap entries
array. If we can’t allocate the contiguous swap entry for a mTHP, it
will have to split the mTHP to a smaller size to perform the swap in
and out. T

Current swap only supports 4K pages or pmd size pages. When adding the
other in between sizes, it greatly increases the chance of fragmenting
the swap entry space. When no more continuous swap swap entry for
mTHP, it will force the mTHP split into 4K pages. If we don’t solve
the fragmentation issue. It will be a constant source of splitting the
mTHP.

Another limitation I would like to address is that swap_writepage can
only write out IO in one contiguous chunk, not able to perform
non-continuous IO. When the swapfile is close to full, it is likely
the unused entry will spread across different locations. It would be
nice to be able to read and write large folio using discontiguous disk
IO locations.

Some possible ideas for the fragmentation issue.

a) buddy allocator for swap entities. Similar to the buddy allocator
in memory. We can use a buddy allocator system for the swap entry to
avoid the low order swap entry fragment too much of the high order
swap entry. It should greatly reduce the fragmentation caused by
allocate and free of the swap entry of different sizes. However the
buddy allocator has its own limit as well. Unlike system memory, we
can move and compact the memory. There is no rmap for swap entry, it
is much harder to move a swap entry to another disk location. So the
buddy allocator for swap will help, but not solve all the
fragmentation issues.

b) Large swap entries. Take file as an example, a file on the file
system can write to a discontinuous disk location. The file system
responsible for tracking how to map the file offset into disk
location. A large swap entry can have a similar indirection array map
out the disk location for different subpages within a folio.  This
allows a large folio to write out dis-continguos swap entries on the
swap file. The array will need to store somewhere as part of the
overhead.When allocating swap entries for the folio, we can allocate a
batch of smaller 4k swap entries into an array. Use this array to
read/write the large folio. There will be a lot of plumbing work to
get it to work.

Solution a) and b) can work together as well. Only use b) if not able
to allocate swap entries from a).

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2024-06-01  0:43 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
2024-03-01  9:53 ` Nhat Pham
2024-03-01 18:57   ` Chris Li
2024-03-04 22:58   ` Matthew Wilcox
2024-03-05  3:23     ` Chengming Zhou
2024-03-05  7:44       ` Chris Li
2024-03-05  8:15         ` Chengming Zhou
2024-03-05 18:24           ` Chris Li
2024-03-05  9:32         ` Nhat Pham
2024-03-05  9:52           ` Chengming Zhou
2024-03-05 10:55             ` Nhat Pham
2024-03-05 19:20               ` Chris Li
2024-03-05 20:56                 ` Jared Hulbert
2024-03-05 21:38         ` Jared Hulbert
2024-03-05 21:58           ` Chris Li
2024-03-06  4:16             ` Jared Hulbert
2024-03-06  5:50               ` Chris Li
     [not found]                 ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16                   ` Chris Li
2024-03-06 22:44                     ` Jared Hulbert
2024-03-07  0:46                       ` Chris Li
2024-03-07  8:57                         ` Jared Hulbert
2024-03-06  1:33   ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03   ` Jared Hulbert
2024-03-04 22:47     ` Chris Li
2024-03-04 22:36   ` Chris Li
2024-03-06  1:15 ` Barry Song
2024-03-06  2:59   ` Chris Li
2024-03-06  6:05     ` Barry Song
2024-03-06 17:56       ` Chris Li
2024-03-06 21:29         ` Barry Song
2024-03-08  8:55       ` David Hildenbrand
2024-03-07  7:56 ` Chuanhua Han
2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
2024-03-07 21:06     ` Jared Hulbert
2024-03-07 21:17       ` Barry Song
2024-03-08  0:14         ` Jared Hulbert
2024-03-08  0:53           ` Barry Song
2024-03-14  9:03         ` Jan Kara
2024-05-16 15:04           ` Zi Yan
2024-05-17  3:48             ` Chris Li
2024-03-14  8:52       ` Jan Kara
2024-03-08  2:02     ` Chuanhua Han
2024-03-14  8:26       ` Jan Kara
2024-03-14 11:19         ` Chuanhua Han
2024-05-15 23:07           ` Chris Li
2024-05-16  7:16             ` Chuanhua Han
2024-05-17 12:12     ` Karim Manaouil
2024-05-21 20:40       ` Chris Li
2024-05-28  7:08         ` Jared Hulbert
2024-05-29  3:36           ` Chris Li
2024-05-29  3:57         ` Matthew Wilcox
2024-05-29  6:50           ` Chris Li
2024-05-29 12:33             ` Matthew Wilcox
2024-05-30 22:53               ` Chris Li
2024-05-31  3:12                 ` Matthew Wilcox
2024-06-01  0:43                   ` Chris Li
2024-05-31  1:56               ` Yuanchu Xie
2024-05-31 16:51                 ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).