[LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
@ 2024-03-01  9:24 Chris Li
  2024-03-01  9:53 ` Nhat Pham
                   ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Chris Li @ 2024-03-01  9:24 UTC (permalink / raw)
  To: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Barry Song,
	Chuanhua Han

In last year's LSF/MM I talked about a VFS-like swap system. That is
the pony that was chosen.
However, I did not have much chance to go into details.

This year, I would like to discuss what it takes to re-architect the
whole swap back end from scratch?

Let’s start from the requirements for the swap back end.

1) support the existing swap usage (not the implementation).

Some other design goals::

2) low per swap entry memory usage.

3) low io latency.

What are the functions the swap system needs to support?

At the device level. Swap systems need to support a list of swap files
with a priority order. The same priority of swap device will do round
robin writing on the swap device. The swap device type includes zswap,
zram, SSD, spinning hard disk, swap file in a file system.

At the swap entry level, here is the list of existing swap entry usage:

* Swap entry allocation and free. Each swap entry needs to be
associated with a location of the disk space in the swapfile. (offset
of swap entry).
* Each swap entry needs to track the map count of the entry. (swap_map)
* Each swap entry needs to be able to find the associated memory
cgroup. (swap_cgroup_ctrl->map)
* Swap cache. Lookup folio/shadow from swap entry
* Swap page writes through a swapfile in a file system other than a
block device. (swap_extent)
* Shadow entry. (store in swap cache)

Any new swap back end might have different internal implementation,
but needs to support the above usage. For example, using the existing
file system as swap backend, per vma or per swap entry map to a file
would mean it needs additional data structure to track the
swap_cgroup_ctrl, combined with the size of the file inode. It would
be challenging to meet the design goal 2) and 3) using another file
system as it is..

I am considering grouping different swap entry data into one single
struct and dynamically allocate it so no upfront allocation of
swap_map.

For the swap entry allocation.Current kernel support swap out 0 order
or pmd order pages.

There are some discussions and patches that add swap out for folio
size in between (mTHP)

https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/

and swap in for mTHP:

https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/

The introduction of swapping different order of pages will further
complicate the swap entry fragmentation issue. The swap back end has
no way to predict the life cycle of the swap entries. Repeat allocate
and free swap entry of different sizes will fragment the swap entries
array. If we can’t allocate the contiguous swap entry for a mTHP, it
will have to split the mTHP to a smaller size to perform the swap in
and out. T

Current swap only supports 4K pages or pmd size pages. When adding the
other in between sizes, it greatly increases the chance of fragmenting
the swap entry space. When no more continuous swap swap entry for
mTHP, it will force the mTHP split into 4K pages. If we don’t solve
the fragmentation issue. It will be a constant source of splitting the
mTHP.

Another limitation I would like to address is that swap_writepage can
only write out IO in one contiguous chunk, not able to perform
non-continuous IO. When the swapfile is close to full, it is likely
the unused entry will spread across different locations. It would be
nice to be able to read and write large folio using discontiguous disk
IO locations.

Some possible ideas for the fragmentation issue.

a) buddy allocator for swap entities. Similar to the buddy allocator
in memory. We can use a buddy allocator system for the swap entry to
avoid the low order swap entry fragment too much of the high order
swap entry. It should greatly reduce the fragmentation caused by
allocate and free of the swap entry of different sizes. However the
buddy allocator has its own limit as well. Unlike system memory, we
can move and compact the memory. There is no rmap for swap entry, it
is much harder to move a swap entry to another disk location. So the
buddy allocator for swap will help, but not solve all the
fragmentation issues.

b) Large swap entries. Take file as an example, a file on the file
system can write to a discontinuous disk location. The file system
responsible for tracking how to map the file offset into disk
location. A large swap entry can have a similar indirection array map
out the disk location for different subpages within a folio.  This
allows a large folio to write out dis-continguos swap entries on the
swap file. The array will need to store somewhere as part of the
overhead.When allocating swap entries for the folio, we can allocate a
batch of smaller 4k swap entries into an array. Use this array to
read/write the large folio. There will be a lot of plumbing work to
get it to work.

Solution a) and b) can work together as well. Only use b) if not able
to allocate swap entries from a).

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
@ 2024-03-01  9:53 ` Nhat Pham
  2024-03-01 18:57   ` Chris Li
                     ` (2 more replies)
  2024-03-04 18:43 ` Kairui Song
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 59+ messages in thread
From: Nhat Pham @ 2024-03-01  9:53 UTC (permalink / raw)
  To: Chris Li
  Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Barry Song,
	Chuanhua Han

On Fri, Mar 1, 2024 at 4:24 PM Chris Li <chrisl@kernel.org> wrote:
>
> In last year's LSF/MM I talked about a VFS-like swap system. That is
> the pony that was chosen.
> However, I did not have much chance to go into details.

I'd love to attend this talk/chat :)

>
> This year, I would like to discuss what it takes to re-architect the
> whole swap back end from scratch?
>
> Let’s start from the requirements for the swap back end.
>
> 1) support the existing swap usage (not the implementation).
>
> Some other design goals::
>
> 2) low per swap entry memory usage.
>
> 3) low io latency.
>
> What are the functions the swap system needs to support?
>
> At the device level. Swap systems need to support a list of swap files
> with a priority order. The same priority of swap device will do round
> robin writing on the swap device. The swap device type includes zswap,
> zram, SSD, spinning hard disk, swap file in a file system.
>
> At the swap entry level, here is the list of existing swap entry usage:
>
> * Swap entry allocation and free. Each swap entry needs to be
> associated with a location of the disk space in the swapfile. (offset
> of swap entry).
> * Each swap entry needs to track the map count of the entry. (swap_map)
> * Each swap entry needs to be able to find the associated memory
> cgroup. (swap_cgroup_ctrl->map)
> * Swap cache. Lookup folio/shadow from swap entry
> * Swap page writes through a swapfile in a file system other than a
> block device. (swap_extent)
> * Shadow entry. (store in swap cache)

IMHO, one thing this new abstraction should support is seamless
transfer/migration of pages from one backend to another (perhaps from
high to low priority backends, i.e writeback).

I think this will require some careful redesigns. The closest thing we
have right now is zswap -> backing swapfile. But it is currently
handled in a rather peculiar manner - the underlying swap slot has
already been reserved for the zswap entry. But there's a couple of
problems with this:

a) This is wasteful. We're essentially having the same piece of data
occupying spaces in two levels in the hierarchies.
b) How do we generalize to a multi-tier hierarchy?
c) This is a bit too backend-specific. It'd be nice if we can make
this as backend-agnostic as possible (if possible).

Motivation: I'm currently working/thinking about decoupling zswap and
swap, and this is one of the more challenging aspects (as I can't seem
to find a precedent in the swap world for inter-swap backends pages
migration), and especially with respect to concurrent loads (and
swapcache interactions).

I don't have good answers/designs quite yet - just raising some
questions/concerns :)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:53 ` Nhat Pham
@ 2024-03-01 18:57   ` Chris Li
  2024-03-04 22:58   ` Matthew Wilcox
  2024-03-06  1:33   ` Barry Song
  2 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-03-01 18:57 UTC (permalink / raw)
  To: Nhat Pham
  Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Barry Song,
	Chuanhua Han

On Fri, Mar 1, 2024 at 1:53 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
>
> IMHO, one thing this new abstraction should support is seamless
> transfer/migration of pages from one backend to another (perhaps from
> high to low priority backends, i.e writeback).

Yes, that is the next step. I am just covering the existing usage here.
What you describe is what I call "the swap tiers". I considered that
topic but did not submit it this year. The current swap back end is
too en-tangled, (lack of a better word). It is very hard to add more
complex data structures in the existing swap back end. That is why I
want to untangle it a bit before attacking the next level stuff.

>
> I think this will require some careful redesigns. The closest thing we
> have right now is zswap -> backing swapfile. But it is currently
> handled in a rather peculiar manner - the underlying swap slot has
> already been reserved for the zswap entry. But there's a couple of
> problems with this:
>
> a) This is wasteful. We're essentially having the same piece of data
> occupying spaces in two levels in the hierarchies.

Can you elerate? If you have a ghost swap file, the zswap will not
store data in two swap devices.
The price to pay is that you need to allocate another swap slot on the
real backing swap file. That is the same if you move SSD data to a
hard disk. You need to allocate a new swap entry on the destination
device.

> b) How do we generalize to a multi-tier hierarchy?

If zswap runs on a ghost swap file, flushing from zswap to another
real swap file would be very similar to flushing from one SSD to
another. That is the more generalized case. Zswap sharing swap slot
with the backing swapfile is a very special case.

> c) This is a bit too backend-specific. It'd be nice if we can make
> this as backend-agnostic as possible (if possible).

Totally agree, that is one of my motivations for the "swap.tiers" idea.

>
> Motivation: I'm currently working/thinking about decoupling zswap and
> swap, and this is one of the more challenging aspects (as I can't seem
> to find a precedent in the swap world for inter-swap backends pages
> migration), and especially with respect to concurrent loads (and
> swapcache interactions).

It will be very messy if you try that in the current swap back end.

Chris

>
> I don't have good answers/designs quite yet - just raising some
> questions/concerns :)
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:53 ` Nhat Pham
  2024-03-01 18:57   ` Chris Li
@ 2024-03-04 22:58   ` Matthew Wilcox
  2024-03-05  3:23     ` Chengming Zhou
  2024-03-06  1:33   ` Barry Song
  2 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-03-04 22:58 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chris Li, lsf-pc, linux-mm, ryan.roberts, David Hildenbrand,
	Barry Song, Chuanhua Han

On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
> IMHO, one thing this new abstraction should support is seamless
> transfer/migration of pages from one backend to another (perhaps from
> high to low priority backends, i.e writeback).
> 
> I think this will require some careful redesigns. The closest thing we
> have right now is zswap -> backing swapfile. But it is currently
> handled in a rather peculiar manner - the underlying swap slot has
> already been reserved for the zswap entry. But there's a couple of
> problems with this:
> 
> a) This is wasteful. We're essentially having the same piece of data
> occupying spaces in two levels in the hierarchies.
> b) How do we generalize to a multi-tier hierarchy?
> c) This is a bit too backend-specific. It'd be nice if we can make
> this as backend-agnostic as possible (if possible).
> 
> Motivation: I'm currently working/thinking about decoupling zswap and
> swap, and this is one of the more challenging aspects (as I can't seem
> to find a precedent in the swap world for inter-swap backends pages
> migration), and especially with respect to concurrent loads (and
> swapcache interactions).

Have you considered (and already rejected?) the opposite approach --
coupling zswap and swap more tightly?  That is, we always write out
the original pages today.  Why don't we write out the compressed pages
instead?  For the same amount of I/O, we'd free up more memory!  That
sounds like a win to me.

I'm sure it'd be a big redesign, but that seems to be what we're talking
about anyway.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-04 22:58   ` Matthew Wilcox
@ 2024-03-05  3:23     ` Chengming Zhou
  2024-03-05  7:44       ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Chengming Zhou @ 2024-03-05  3:23 UTC (permalink / raw)
  To: Matthew Wilcox, Nhat Pham
  Cc: Chris Li, lsf-pc, linux-mm, ryan.roberts, David Hildenbrand,
	Barry Song, Chuanhua Han

On 2024/3/5 06:58, Matthew Wilcox wrote:
> On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
>> IMHO, one thing this new abstraction should support is seamless
>> transfer/migration of pages from one backend to another (perhaps from
>> high to low priority backends, i.e writeback).
>>
>> I think this will require some careful redesigns. The closest thing we
>> have right now is zswap -> backing swapfile. But it is currently
>> handled in a rather peculiar manner - the underlying swap slot has
>> already been reserved for the zswap entry. But there's a couple of
>> problems with this:
>>
>> a) This is wasteful. We're essentially having the same piece of data
>> occupying spaces in two levels in the hierarchies.
>> b) How do we generalize to a multi-tier hierarchy?
>> c) This is a bit too backend-specific. It'd be nice if we can make
>> this as backend-agnostic as possible (if possible).
>>
>> Motivation: I'm currently working/thinking about decoupling zswap and
>> swap, and this is one of the more challenging aspects (as I can't seem
>> to find a precedent in the swap world for inter-swap backends pages
>> migration), and especially with respect to concurrent loads (and
>> swapcache interactions).
> 
> Have you considered (and already rejected?) the opposite approach --
> coupling zswap and swap more tightly?  That is, we always write out
> the original pages today.  Why don't we write out the compressed pages
> instead?  For the same amount of I/O, we'd free up more memory!  That
> sounds like a win to me.

Right, I also thought about this direction for some time.
Apart from fewer IO, there are more advantages we can see:

1. Don't need to allocate a page when write out compressed data.
   This method actually has its own problem[1], by allocating a new page and
   put on LRU list, wait for writeback and reclaim.
   If we write out compressed data directly, so don't need to allocated page,
   these problems can be avoided.

2. Don't need to decompress when write out compressed data.

[1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/

> 
> I'm sure it'd be a big redesign, but that seems to be what we're talking
> about anyway.
> 

Yes, we need to do modifications in some parts:

1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.

2. swapout: need to support non-folio write out.

3. zswap: zswap need to handle synchronization between compressed write out and swapin,
   since they share the same swap entry.

I must missed something, more discussions are welcome if others have interests too.

Thanks!


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  3:23     ` Chengming Zhou
@ 2024-03-05  7:44       ` Chris Li
  2024-03-05  8:15         ` Chengming Zhou
                           ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Chris Li @ 2024-03-05  7:44 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm, ryan.roberts,
	David Hildenbrand, Barry Song, Chuanhua Han

On Mon, Mar 4, 2024 at 7:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/5 06:58, Matthew Wilcox wrote:
> > On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
> >> IMHO, one thing this new abstraction should support is seamless
> >> transfer/migration of pages from one backend to another (perhaps from
> >> high to low priority backends, i.e writeback).
> >>
> >> I think this will require some careful redesigns. The closest thing we
> >> have right now is zswap -> backing swapfile. But it is currently
> >> handled in a rather peculiar manner - the underlying swap slot has
> >> already been reserved for the zswap entry. But there's a couple of
> >> problems with this:
> >>
> >> a) This is wasteful. We're essentially having the same piece of data
> >> occupying spaces in two levels in the hierarchies.
> >> b) How do we generalize to a multi-tier hierarchy?
> >> c) This is a bit too backend-specific. It'd be nice if we can make
> >> this as backend-agnostic as possible (if possible).
> >>
> >> Motivation: I'm currently working/thinking about decoupling zswap and
> >> swap, and this is one of the more challenging aspects (as I can't seem
> >> to find a precedent in the swap world for inter-swap backends pages
> >> migration), and especially with respect to concurrent loads (and
> >> swapcache interactions).
> >
> > Have you considered (and already rejected?) the opposite approach --
> > coupling zswap and swap more tightly?  That is, we always write out
> > the original pages today.  Why don't we write out the compressed pages
> > instead?  For the same amount of I/O, we'd free up more memory!  That
> > sounds like a win to me.

I have considered that as well, that is further than writing from one
swap device to another. The current swap device currently can't accept
write on non page aligned offset. If we allow byte aligned write out
size, the whole swap entry offset stuff needs some heavy changes.

If we write out 4K pages, and the compression ratio is lower than 50%,
it means a combination of two compressed pages can't fit into one
page.  Which means some of the page read back will need to overflow
into another page. We kind of need a small file system to keep track
of how the compressed data is stored, because it is not page aligned
size any more.

We can write out zsmalloc blocks of data as it is, however there is no
guarantee the data in zsmalloc blocks have the same LRU order.

It makes more sense when writing higher order > 0 swap pages. e.g
writing 64K pages in one buffer, then we can write out compressed data
as page boundary aligned and page sizes, accepting the waste on the
last compressed page, might not fill up the whole page.

>
> Right, I also thought about this direction for some time.
> Apart from fewer IO, there are more advantages we can see:
>
> 1. Don't need to allocate a page when write out compressed data.
>    This method actually has its own problem[1], by allocating a new page and
>    put on LRU list, wait for writeback and reclaim.
>    If we write out compressed data directly, so don't need to allocated page,
>    these problems can be avoided.

Does it go through swap cache at all? If not, there will be some
interesting synchronization issues when other races swap in the page
and modify it.

>
> 2. Don't need to decompress when write out compressed data.

Yes.

>
> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/
>
> >
> > I'm sure it'd be a big redesign, but that seems to be what we're talking
> > about anyway.
> >
>
> Yes, we need to do modifications in some parts:
>
> 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.

Or use a bounce buffer to read it out.

>
> 2. swapout: need to support non-folio write out.

Yes. Non page aligned write out will change swap back end design dramatically.

>
> 3. zswap: zswap need to handle synchronization between compressed write out and swapin,
>    since they share the same swap entry.

Exactly. Same for ZRAM as well.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  7:44       ` Chris Li
@ 2024-03-05  8:15         ` Chengming Zhou
  2024-03-05 18:24           ` Chris Li
  2024-03-05  9:32         ` Nhat Pham
  2024-03-05 21:38         ` Jared Hulbert
  2 siblings, 1 reply; 59+ messages in thread
From: Chengming Zhou @ 2024-03-05  8:15 UTC (permalink / raw)
  To: Chris Li
  Cc: Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm, ryan.roberts,
	David Hildenbrand, Barry Song, Chuanhua Han

On 2024/3/5 15:44, Chris Li wrote:
> On Mon, Mar 4, 2024 at 7:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/3/5 06:58, Matthew Wilcox wrote:
>>> On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
>>>> IMHO, one thing this new abstraction should support is seamless
>>>> transfer/migration of pages from one backend to another (perhaps from
>>>> high to low priority backends, i.e writeback).
>>>>
>>>> I think this will require some careful redesigns. The closest thing we
>>>> have right now is zswap -> backing swapfile. But it is currently
>>>> handled in a rather peculiar manner - the underlying swap slot has
>>>> already been reserved for the zswap entry. But there's a couple of
>>>> problems with this:
>>>>
>>>> a) This is wasteful. We're essentially having the same piece of data
>>>> occupying spaces in two levels in the hierarchies.
>>>> b) How do we generalize to a multi-tier hierarchy?
>>>> c) This is a bit too backend-specific. It'd be nice if we can make
>>>> this as backend-agnostic as possible (if possible).
>>>>
>>>> Motivation: I'm currently working/thinking about decoupling zswap and
>>>> swap, and this is one of the more challenging aspects (as I can't seem
>>>> to find a precedent in the swap world for inter-swap backends pages
>>>> migration), and especially with respect to concurrent loads (and
>>>> swapcache interactions).
>>>
>>> Have you considered (and already rejected?) the opposite approach --
>>> coupling zswap and swap more tightly?  That is, we always write out
>>> the original pages today.  Why don't we write out the compressed pages
>>> instead?  For the same amount of I/O, we'd free up more memory!  That
>>> sounds like a win to me.
> 
> I have considered that as well, that is further than writing from one
> swap device to another. The current swap device currently can't accept
> write on non page aligned offset. If we allow byte aligned write out
> size, the whole swap entry offset stuff needs some heavy changes.
> 
> If we write out 4K pages, and the compression ratio is lower than 50%,
> it means a combination of two compressed pages can't fit into one
> page.  Which means some of the page read back will need to overflow
> into another page. We kind of need a small file system to keep track
> of how the compressed data is stored, because it is not page aligned
> size any more.
> 
> We can write out zsmalloc blocks of data as it is, however there is no
> guarantee the data in zsmalloc blocks have the same LRU order.

Right, so we should choose to write out objects based on the LRU order
in zswap, but don't decompress it, write out it directly to swap file.

> 
> It makes more sense when writing higher order > 0 swap pages. e.g
> writing 64K pages in one buffer, then we can write out compressed data
> as page boundary aligned and page sizes, accepting the waste on the
> last compressed page, might not fill up the whole page.
> 
>>
>> Right, I also thought about this direction for some time.
>> Apart from fewer IO, there are more advantages we can see:
>>
>> 1. Don't need to allocate a page when write out compressed data.
>>    This method actually has its own problem[1], by allocating a new page and
>>    put on LRU list, wait for writeback and reclaim.
>>    If we write out compressed data directly, so don't need to allocated page,
>>    these problems can be avoided.
> 
> Does it go through swap cache at all? If not, there will be some
> interesting synchronization issues when other races swap in the page
> and modify it.

No, right, we have to handle the races. (Maybe we can leave "shadow" entry in zswap,
which can be used for synchronization)

> 
>>
>> 2. Don't need to decompress when write out compressed data.
> 
> Yes.
> 
>>
>> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/
>>
>>>
>>> I'm sure it'd be a big redesign, but that seems to be what we're talking
>>> about anyway.
>>>
>>
>> Yes, we need to do modifications in some parts:
>>
>> 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.
> 
> Or use a bounce buffer to read it out.

Yeah, also a choice if pinning is not easy to implement :)

> 
>>
>> 2. swapout: need to support non-folio write out.
> 
> Yes. Non page aligned write out will change swap back end design dramatically.
> 
>>
>> 3. zswap: zswap need to handle synchronization between compressed write out and swapin,
>>    since they share the same swap entry.
> 
> Exactly. Same for ZRAM as well.
> 
> Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  8:15         ` Chengming Zhou
@ 2024-03-05 18:24           ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-03-05 18:24 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm, ryan.roberts,
	David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 12:16 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > We can write out zsmalloc blocks of data as it is, however there is no
> > guarantee the data in zsmalloc blocks have the same LRU order.
>
> Right, so we should choose to write out objects based on the LRU order
> in zswap, but don't decompress it, write out it directly to swap file.

Here is an idea. Since zsmalloc uses N pages as a block to store the
data, we can have a backend read the compressed data, write out to
another zsmalloc in N page blocks, with LRU ordering.  Then those N
pages block write out to the swap file, The meta data of zsmalloc keep
track of the handle will convert into physical locations of the disk.
Those meta data of zsmalloc will stay in the memory.

>
> >
> > It makes more sense when writing higher order > 0 swap pages. e.g
> > writing 64K pages in one buffer, then we can write out compressed data
> > as page boundary aligned and page sizes, accepting the waste on the
> > last compressed page, might not fill up the whole page.
> >
> >>
> >> Right, I also thought about this direction for some time.
> >> Apart from fewer IO, there are more advantages we can see:
> >>
> >> 1. Don't need to allocate a page when write out compressed data.
> >>    This method actually has its own problem[1], by allocating a new page and
> >>    put on LRU list, wait for writeback and reclaim.
> >>    If we write out compressed data directly, so don't need to allocated page,
> >>    these problems can be avoided.
> >
> > Does it go through swap cache at all? If not, there will be some
> > interesting synchronization issues when other races swap in the page
> > and modify it.
>
> No, right, we have to handle the races. (Maybe we can leave "shadow" entry in zswap,
> which can be used for synchronization)

I kind of wish swap cache store either folio or a pointer to the swap
entry struct. At the cost of one extra pointer per swap entry, we can
have different types of swap entry struct, e.g. zswap.  The shadow
will be the common part of the swap entry members. Then zswap or more
fancy swap entries can allocate different types of swap structs.

That will simplify a lot of swap cache for each looping code as well,
no need to deal with is_value() of swap entry.
We just need a tag to tell it is folio or swap entry pointer.

>
> >
> >>
> >> 2. Don't need to decompress when write out compressed data.
> >
> > Yes.
> >
> >>
> >> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/
> >>
> >>>
> >>> I'm sure it'd be a big redesign, but that seems to be what we're talking
> >>> about anyway.
> >>>
> >>
> >> Yes, we need to do modifications in some parts:
> >>
> >> 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.
> >
> > Or use a bounce buffer to read it out.
>
> Yeah, also a choice if pinning is not easy to implement :)

In the above another zsmalloc backend idea, the bounce buffer is kind
of required to compact different size objects into page aligned
blocks. That removes the pinning requirement as well.

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  7:44       ` Chris Li
  2024-03-05  8:15         ` Chengming Zhou
@ 2024-03-05  9:32         ` Nhat Pham
  2024-03-05  9:52           ` Chengming Zhou
  2024-03-05 21:38         ` Jared Hulbert
  2 siblings, 1 reply; 59+ messages in thread
From: Nhat Pham @ 2024-03-05  9:32 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Matthew Wilcox, lsf-pc, linux-mm, ryan.roberts,
	David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 2:44 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Mon, Mar 4, 2024 at 7:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >
> > On 2024/3/5 06:58, Matthew Wilcox wrote:
> > > On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
> > >> IMHO, one thing this new abstraction should support is seamless
> > >> transfer/migration of pages from one backend to another (perhaps from
> > >> high to low priority backends, i.e writeback).
> > >>
> > >> I think this will require some careful redesigns. The closest thing we
> > >> have right now is zswap -> backing swapfile. But it is currently
> > >> handled in a rather peculiar manner - the underlying swap slot has
> > >> already been reserved for the zswap entry. But there's a couple of
> > >> problems with this:
> > >>
> > >> a) This is wasteful. We're essentially having the same piece of data
> > >> occupying spaces in two levels in the hierarchies.
> > >> b) How do we generalize to a multi-tier hierarchy?
> > >> c) This is a bit too backend-specific. It'd be nice if we can make
> > >> this as backend-agnostic as possible (if possible).
> > >>
> > >> Motivation: I'm currently working/thinking about decoupling zswap and
> > >> swap, and this is one of the more challenging aspects (as I can't seem
> > >> to find a precedent in the swap world for inter-swap backends pages
> > >> migration), and especially with respect to concurrent loads (and
> > >> swapcache interactions).
> > >
> > > Have you considered (and already rejected?) the opposite approach --
> > > coupling zswap and swap more tightly?  That is, we always write out
> > > the original pages today.  Why don't we write out the compressed pages
> > > instead?  For the same amount of I/O, we'd free up more memory!  That
> > > sounds like a win to me.

Compressed writeback (for a lack of better term) is in my
to-think-about list, precisely for the benefits you point out (swap
space saving + I/O efficiency).

By decoupling, I was primarily aiming to reduce initial swap space
wastage. Specifically, as of now, we prematurely reserve swap space
even if the page is successfully stored in zswap. I'd like to avoid
this (if possible). Compressed writeback could be an orthogonal
improvement to this.

>
> I have considered that as well, that is further than writing from one
> swap device to another. The current swap device currently can't accept
> write on non page aligned offset. If we allow byte aligned write out
> size, the whole swap entry offset stuff needs some heavy changes.
>
> If we write out 4K pages, and the compression ratio is lower than 50%,
> it means a combination of two compressed pages can't fit into one
> page.  Which means some of the page read back will need to overflow
> into another page. We kind of need a small file system to keep track
> of how the compressed data is stored, because it is not page aligned
> size any more.
>
> We can write out zsmalloc blocks of data as it is, however there is no
> guarantee the data in zsmalloc blocks have the same LRU order.

zsmalloc used to do this - it writes the entire zspage (which is a
multiple of pages). I don't think it's a good idea either. Objects in
the same zspage are of the same size class, and this does NOT
necessarily imply similar access recency/frequency IMHO.

We need to figure out how to swap out non-page-sized entities before
compressed writeback becomes a thing.

>
> It makes more sense when writing higher order > 0 swap pages. e.g
> writing 64K pages in one buffer, then we can write out compressed data
> as page boundary aligned and page sizes, accepting the waste on the
> last compressed page, might not fill up the whole page.
>
> >
> > Right, I also thought about this direction for some time.
> > Apart from fewer IO, there are more advantages we can see:
> >
> > 1. Don't need to allocate a page when write out compressed data.
> >    This method actually has its own problem[1], by allocating a new page and
> >    put on LRU list, wait for writeback and reclaim.
> >    If we write out compressed data directly, so don't need to allocated page,
> >    these problems can be avoided.
>
> Does it go through swap cache at all? If not, there will be some
> interesting synchronization issues when other races swap in the page
> and modify it.

I agree. If we are to go with this approach, we will need to modify
the swap cache to synchronize concurrent swapins.

As usual, devils are in the details :)

>
> >
> > 2. Don't need to decompress when write out compressed data.
>
> Yes.
>
> >
> > [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/
> >
> > >
> > > I'm sure it'd be a big redesign, but that seems to be what we're talking
> > > about anyway.
> > >
> >
> > Yes, we need to do modifications in some parts:
> >
> > 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.
>
> Or use a bounce buffer to read it out.
>
> >
> > 2. swapout: need to support non-folio write out.
>
> Yes. Non page aligned write out will change swap back end design dramatically.
>
> >
> > 3. zswap: zswap need to handle synchronization between compressed write out and swapin,
> >    since they share the same swap entry.
>
> Exactly. Same for ZRAM as well.
>

I agree with this list. 1 sounds the least invasive, but the other 2
will be quite massive :)

> Chris

Nhat


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  9:32         ` Nhat Pham
@ 2024-03-05  9:52           ` Chengming Zhou
  2024-03-05 10:55             ` Nhat Pham
  0 siblings, 1 reply; 59+ messages in thread
From: Chengming Zhou @ 2024-03-05  9:52 UTC (permalink / raw)
  To: Nhat Pham, Chris Li
  Cc: Matthew Wilcox, lsf-pc, linux-mm, ryan.roberts, David Hildenbrand,
	Barry Song, Chuanhua Han

On 2024/3/5 17:32, Nhat Pham wrote:
> On Tue, Mar 5, 2024 at 2:44 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Mon, Mar 4, 2024 at 7:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>>
>>> On 2024/3/5 06:58, Matthew Wilcox wrote:
>>>> On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
>>>>> IMHO, one thing this new abstraction should support is seamless
>>>>> transfer/migration of pages from one backend to another (perhaps from
>>>>> high to low priority backends, i.e writeback).
>>>>>
>>>>> I think this will require some careful redesigns. The closest thing we
>>>>> have right now is zswap -> backing swapfile. But it is currently
>>>>> handled in a rather peculiar manner - the underlying swap slot has
>>>>> already been reserved for the zswap entry. But there's a couple of
>>>>> problems with this:
>>>>>
>>>>> a) This is wasteful. We're essentially having the same piece of data
>>>>> occupying spaces in two levels in the hierarchies.
>>>>> b) How do we generalize to a multi-tier hierarchy?
>>>>> c) This is a bit too backend-specific. It'd be nice if we can make
>>>>> this as backend-agnostic as possible (if possible).
>>>>>
>>>>> Motivation: I'm currently working/thinking about decoupling zswap and
>>>>> swap, and this is one of the more challenging aspects (as I can't seem
>>>>> to find a precedent in the swap world for inter-swap backends pages
>>>>> migration), and especially with respect to concurrent loads (and
>>>>> swapcache interactions).
>>>>
>>>> Have you considered (and already rejected?) the opposite approach --
>>>> coupling zswap and swap more tightly?  That is, we always write out
>>>> the original pages today.  Why don't we write out the compressed pages
>>>> instead?  For the same amount of I/O, we'd free up more memory!  That
>>>> sounds like a win to me.
> 
> Compressed writeback (for a lack of better term) is in my
> to-think-about list, precisely for the benefits you point out (swap
> space saving + I/O efficiency).
> 
> By decoupling, I was primarily aiming to reduce initial swap space
> wastage. Specifically, as of now, we prematurely reserve swap space
> even if the page is successfully stored in zswap. I'd like to avoid
> this (if possible). Compressed writeback could be an orthogonal
> improvement to this.

Looks sensible. Now the zswap middle layer is transparent to frontend users,
which just allocate swap entry and swap out, don't care about whether it's
swapped out to the zswap or swap file.

By decoupling, the frontend users need to know it want to allocate zswap entry
instead of a swap entry, right? Which becomes not transparent to users.

On the other hand, MGLRU could use its more fine-grained information to
decide whether to allocate zswap entry or swap entry, instead of always
going through zswap -> swap layers. Just some random thoughts :)

Thanks.

> 
>>
>> I have considered that as well, that is further than writing from one
>> swap device to another. The current swap device currently can't accept
>> write on non page aligned offset. If we allow byte aligned write out
>> size, the whole swap entry offset stuff needs some heavy changes.
>>
>> If we write out 4K pages, and the compression ratio is lower than 50%,
>> it means a combination of two compressed pages can't fit into one
>> page.  Which means some of the page read back will need to overflow
>> into another page. We kind of need a small file system to keep track
>> of how the compressed data is stored, because it is not page aligned
>> size any more.
>>
>> We can write out zsmalloc blocks of data as it is, however there is no
>> guarantee the data in zsmalloc blocks have the same LRU order.
> 
> zsmalloc used to do this - it writes the entire zspage (which is a
> multiple of pages). I don't think it's a good idea either. Objects in
> the same zspage are of the same size class, and this does NOT
> necessarily imply similar access recency/frequency IMHO.
> 
> We need to figure out how to swap out non-page-sized entities before
> compressed writeback becomes a thing.
> 
>>
>> It makes more sense when writing higher order > 0 swap pages. e.g
>> writing 64K pages in one buffer, then we can write out compressed data
>> as page boundary aligned and page sizes, accepting the waste on the
>> last compressed page, might not fill up the whole page.
>>
>>>
>>> Right, I also thought about this direction for some time.
>>> Apart from fewer IO, there are more advantages we can see:
>>>
>>> 1. Don't need to allocate a page when write out compressed data.
>>>    This method actually has its own problem[1], by allocating a new page and
>>>    put on LRU list, wait for writeback and reclaim.
>>>    If we write out compressed data directly, so don't need to allocated page,
>>>    these problems can be avoided.
>>
>> Does it go through swap cache at all? If not, there will be some
>> interesting synchronization issues when other races swap in the page
>> and modify it.
> 
> I agree. If we are to go with this approach, we will need to modify
> the swap cache to synchronize concurrent swapins.
> 
> As usual, devils are in the details :)
> 
>>
>>>
>>> 2. Don't need to decompress when write out compressed data.
>>
>> Yes.
>>
>>>
>>> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/
>>>
>>>>
>>>> I'm sure it'd be a big redesign, but that seems to be what we're talking
>>>> about anyway.
>>>>
>>>
>>> Yes, we need to do modifications in some parts:
>>>
>>> 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.
>>
>> Or use a bounce buffer to read it out.
>>
>>>
>>> 2. swapout: need to support non-folio write out.
>>
>> Yes. Non page aligned write out will change swap back end design dramatically.
>>
>>>
>>> 3. zswap: zswap need to handle synchronization between compressed write out and swapin,
>>>    since they share the same swap entry.
>>
>> Exactly. Same for ZRAM as well.
>>
> 
> I agree with this list. 1 sounds the least invasive, but the other 2
> will be quite massive :)
> 
>> Chris
> 
> Nhat


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  9:52           ` Chengming Zhou
@ 2024-03-05 10:55             ` Nhat Pham
  2024-03-05 19:20               ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Nhat Pham @ 2024-03-05 10:55 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Chris Li, Matthew Wilcox, lsf-pc, linux-mm, ryan.roberts,
	David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 4:52 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/5 17:32, Nhat Pham wrote:
> > On Tue, Mar 5, 2024 at 2:44 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> On Mon, Mar 4, 2024 at 7:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >>>
> >>> On 2024/3/5 06:58, Matthew Wilcox wrote:
> >>>> On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote:
> >>>>> IMHO, one thing this new abstraction should support is seamless
> >>>>> transfer/migration of pages from one backend to another (perhaps from
> >>>>> high to low priority backends, i.e writeback).
> >>>>>
> >>>>> I think this will require some careful redesigns. The closest thing we
> >>>>> have right now is zswap -> backing swapfile. But it is currently
> >>>>> handled in a rather peculiar manner - the underlying swap slot has
> >>>>> already been reserved for the zswap entry. But there's a couple of
> >>>>> problems with this:
> >>>>>
> >>>>> a) This is wasteful. We're essentially having the same piece of data
> >>>>> occupying spaces in two levels in the hierarchies.
> >>>>> b) How do we generalize to a multi-tier hierarchy?
> >>>>> c) This is a bit too backend-specific. It'd be nice if we can make
> >>>>> this as backend-agnostic as possible (if possible).
> >>>>>
> >>>>> Motivation: I'm currently working/thinking about decoupling zswap and
> >>>>> swap, and this is one of the more challenging aspects (as I can't seem
> >>>>> to find a precedent in the swap world for inter-swap backends pages
> >>>>> migration), and especially with respect to concurrent loads (and
> >>>>> swapcache interactions).
> >>>>
> >>>> Have you considered (and already rejected?) the opposite approach --
> >>>> coupling zswap and swap more tightly?  That is, we always write out
> >>>> the original pages today.  Why don't we write out the compressed pages
> >>>> instead?  For the same amount of I/O, we'd free up more memory!  That
> >>>> sounds like a win to me.
> >
> > Compressed writeback (for a lack of better term) is in my
> > to-think-about list, precisely for the benefits you point out (swap
> > space saving + I/O efficiency).
> >
> > By decoupling, I was primarily aiming to reduce initial swap space
> > wastage. Specifically, as of now, we prematurely reserve swap space
> > even if the page is successfully stored in zswap. I'd like to avoid
> > this (if possible). Compressed writeback could be an orthogonal
> > improvement to this.
>
> Looks sensible. Now the zswap middle layer is transparent to frontend users,
> which just allocate swap entry and swap out, don't care about whether it's
> swapped out to the zswap or swap file.
>
> By decoupling, the frontend users need to know it want to allocate zswap entry
> instead of a swap entry, right? Which becomes not transparent to users.

Hmm for now, I was just thinking that it should always try zswap
first, and only fall back to swap if it fails to store to zswap, to
maintain the overall LRU ordering (best effort).

The minimal viable implementation I'm thinking right now for this is
basically the "ghost swapfile" approach - i.e represent zswap as a
swapfile.

Writeback becomes quite hairy though, because there might be two
"swap" entries of the same object (the zswap swap entry and the newly
reserved swap entry) lying around near the end of the writeback step,
so gotta be careful with synchronization (read: juggling the swap
cache) to make sure concurrent swap-ins get something that makes
sense.

>
> On the other hand, MGLRU could use its more fine-grained information to
> decide whether to allocate zswap entry or swap entry, instead of always
> going through zswap -> swap layers. Just some random thoughts :)
>
> Thanks.
>
> >
> >>
> >> I have considered that as well, that is further than writing from one
> >> swap device to another. The current swap device currently can't accept
> >> write on non page aligned offset. If we allow byte aligned write out
> >> size, the whole swap entry offset stuff needs some heavy changes.
> >>
> >> If we write out 4K pages, and the compression ratio is lower than 50%,
> >> it means a combination of two compressed pages can't fit into one
> >> page.  Which means some of the page read back will need to overflow
> >> into another page. We kind of need a small file system to keep track
> >> of how the compressed data is stored, because it is not page aligned
> >> size any more.
> >>
> >> We can write out zsmalloc blocks of data as it is, however there is no
> >> guarantee the data in zsmalloc blocks have the same LRU order.
> >
> > zsmalloc used to do this - it writes the entire zspage (which is a
> > multiple of pages). I don't think it's a good idea either. Objects in
> > the same zspage are of the same size class, and this does NOT
> > necessarily imply similar access recency/frequency IMHO.
> >
> > We need to figure out how to swap out non-page-sized entities before
> > compressed writeback becomes a thing.
> >
> >>
> >> It makes more sense when writing higher order > 0 swap pages. e.g
> >> writing 64K pages in one buffer, then we can write out compressed data
> >> as page boundary aligned and page sizes, accepting the waste on the
> >> last compressed page, might not fill up the whole page.
> >>
> >>>
> >>> Right, I also thought about this direction for some time.
> >>> Apart from fewer IO, there are more advantages we can see:
> >>>
> >>> 1. Don't need to allocate a page when write out compressed data.
> >>>    This method actually has its own problem[1], by allocating a new page and
> >>>    put on LRU list, wait for writeback and reclaim.
> >>>    If we write out compressed data directly, so don't need to allocated page,
> >>>    these problems can be avoided.
> >>
> >> Does it go through swap cache at all? If not, there will be some
> >> interesting synchronization issues when other races swap in the page
> >> and modify it.
> >
> > I agree. If we are to go with this approach, we will need to modify
> > the swap cache to synchronize concurrent swapins.

... with concurrent compressed writebacks. One benefit of allocating
the page is we can share that with the swap cache. Removing that from
the equation can make everything becomes hairier, but I haven't
thought too deeply about this.

> >
> > As usual, devils are in the details :)
> >
> >>
> >>>
> >>> 2. Don't need to decompress when write out compressed data.
> >>
> >> Yes.
> >>
> >>>
> >>> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zhou@linux.dev/
> >>>
> >>>>
> >>>> I'm sure it'd be a big redesign, but that seems to be what we're talking
> >>>> about anyway.
> >>>>
> >>>
> >>> Yes, we need to do modifications in some parts:
> >>>
> >>> 1. zsmalloc: compressed objects can be migrated anytime, we need to support pinning.
> >>
> >> Or use a bounce buffer to read it out.
> >>
> >>>
> >>> 2. swapout: need to support non-folio write out.
> >>
> >> Yes. Non page aligned write out will change swap back end design dramatically.
> >>
> >>>
> >>> 3. zswap: zswap need to handle synchronization between compressed write out and swapin,
> >>>    since they share the same swap entry.
> >>
> >> Exactly. Same for ZRAM as well.
> >>
> >
> > I agree with this list. 1 sounds the least invasive, but the other 2
> > will be quite massive :)
> >
> >> Chris
> >
> > Nhat


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05 10:55             ` Nhat Pham
@ 2024-03-05 19:20               ` Chris Li
  2024-03-05 20:56                 ` Jared Hulbert
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-05 19:20 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chengming Zhou, Matthew Wilcox, lsf-pc, linux-mm, ryan.roberts,
	David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 2:55 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 4:52 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >
> > Looks sensible. Now the zswap middle layer is transparent to frontend users,
> > which just allocate swap entry and swap out, don't care about whether it's
> > swapped out to the zswap or swap file.
> >
> > By decoupling, the frontend users need to know it want to allocate zswap entry
> > instead of a swap entry, right? Which becomes not transparent to users.
>
> Hmm for now, I was just thinking that it should always try zswap
> first, and only fall back to swap if it fails to store to zswap, to
> maintain the overall LRU ordering (best effort).
>
> The minimal viable implementation I'm thinking right now for this is
> basically the "ghost swapfile" approach - i.e represent zswap as a
> swapfile.

Google has been using the ghost swapfile in production for many years.
If it helps, I can rebase the ghost swap file patches to mm-unstable
then send them out for RFC discussion. I am not expecting it to merge
as it is, just as a starting point for if any one is interested in the
ghost swap file.

I think zswap with a ghost swap file will make zswap behave more like
other swap back ends. If you use the ghost swap file, migrating from
zswap to another swap device is very similar to migrating from SSD to
hard drive, for example.

> Writeback becomes quite hairy though, because there might be two
> "swap" entries of the same object (the zswap swap entry and the newly
> reserved swap entry) lying around near the end of the writeback step,
> so gotta be careful with synchronization (read: juggling the swap
> cache) to make sure concurrent swap-ins get something that makes
> sense.

Dealing with two swap device entries while writing back from one to
another is unavoidable. I consider it as necessary evil.
If we can have  swap offset lookup to different swap entry types. One
idea is to introduce a migration type of swap entry, the swap entry
will have both source and destination swap entry stored in it. Then
you just read in the source swap entry data (compressed or not). Write
to the destination entry. Every swap in of the source swap  entry will
notice it has a migration swap entry type. Then it will ask the
destination swap device to perform the IO. The same folio will exist
in both source and destination swap cache.

The limit of this approach is that, unless the source entry usage
count drops to zero (every user swap in the entry). That source swap
entry is occupied. It can't be reused for other data.

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05 19:20               ` Chris Li
@ 2024-03-05 20:56                 ` Jared Hulbert
  0 siblings, 0 replies; 59+ messages in thread
From: Jared Hulbert @ 2024-03-05 20:56 UTC (permalink / raw)
  To: Chris Li
  Cc: Nhat Pham, Chengming Zhou, Matthew Wilcox, lsf-pc, linux-mm,
	ryan.roberts, David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 11:20 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 5, 2024 at 2:55 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Mar 5, 2024 at 4:52 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > >
> > > Looks sensible. Now the zswap middle layer is transparent to frontend users,
> > > which just allocate swap entry and swap out, don't care about whether it's
> > > swapped out to the zswap or swap file.
> > >
> > > By decoupling, the frontend users need to know it want to allocate zswap entry
> > > instead of a swap entry, right? Which becomes not transparent to users.
> >
> > Hmm for now, I was just thinking that it should always try zswap
> > first, and only fall back to swap if it fails to store to zswap, to
> > maintain the overall LRU ordering (best effort).
> >
> > The minimal viable implementation I'm thinking right now for this is
> > basically the "ghost swapfile" approach - i.e represent zswap as a
> > swapfile.
>
> Google has been using the ghost swapfile in production for many years.
> If it helps, I can rebase the ghost swap file patches to mm-unstable
> then send them out for RFC discussion. I am not expecting it to merge
> as it is, just as a starting point for if any one is interested in the
> ghost swap file.
>
> I think zswap with a ghost swap file will make zswap behave more like
> other swap back ends. If you use the ghost swap file, migrating from
> zswap to another swap device is very similar to migrating from SSD to
> hard drive, for example.

Yes please.

> > Writeback becomes quite hairy though, because there might be two
> > "swap" entries of the same object (the zswap swap entry and the newly
> > reserved swap entry) lying around near the end of the writeback step,
> > so gotta be careful with synchronization (read: juggling the swap
> > cache) to make sure concurrent swap-ins get something that makes
> > sense.
>
> Dealing with two swap device entries while writing back from one to
> another is unavoidable. I consider it as necessary evil.
> If we can have  swap offset lookup to different swap entry types. One
> idea is to introduce a migration type of swap entry, the swap entry
> will have both source and destination swap entry stored in it. Then
> you just read in the source swap entry data (compressed or not). Write
> to the destination entry. Every swap in of the source swap  entry will
> notice it has a migration swap entry type. Then it will ask the
> destination swap device to perform the IO. The same folio will exist
> in both source and destination swap cache.
>
> The limit of this approach is that, unless the source entry usage
> count drops to zero (every user swap in the entry). That source swap
> entry is occupied. It can't be reused for other data.
>
> Chris
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05  7:44       ` Chris Li
  2024-03-05  8:15         ` Chengming Zhou
  2024-03-05  9:32         ` Nhat Pham
@ 2024-03-05 21:38         ` Jared Hulbert
  2024-03-05 21:58           ` Chris Li
  2 siblings, 1 reply; 59+ messages in thread
From: Jared Hulbert @ 2024-03-05 21:38 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm,
	ryan.roberts, David Hildenbrand, Barry Song, Chuanhua Han

On Mon, Mar 4, 2024 at 11:49 PM Chris Li <chrisl@kernel.org> wrote:
>
> I have considered that as well, that is further than writing from one
> swap device to another. The current swap device currently can't accept
> write on non page aligned offset. If we allow byte aligned write out
> size, the whole swap entry offset stuff needs some heavy changes.
>
> If we write out 4K pages, and the compression ratio is lower than 50%,
> it means a combination of two compressed pages can't fit into one
> page.  Which means some of the page read back will need to overflow
> into another page. We kind of need a small file system to keep track
> of how the compressed data is stored, because it is not page aligned
> size any more.
>
> We can write out zsmalloc blocks of data as it is, however there is no
> guarantee the data in zsmalloc blocks have the same LRU order.
>
> It makes more sense when writing higher order > 0 swap pages. e.g
> writing 64K pages in one buffer, then we can write out compressed data
> as page boundary aligned and page sizes, accepting the waste on the
> last compressed page, might not fill up the whole page.

A swap device not a device, until recently, it was a really bad
filesystem with no abstractions between the block device and the
filesystem.  Zswap and zram are, in some respects, attempts to make
specialized filesystems without any of the advantages of using the vfs
tooling.

What stops us from using an existing compressing filesystem?

Crazy talk here.  What if we handled swap pages like they were mmap'd
to a special swap "file(s)"?

> Chris
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05 21:38         ` Jared Hulbert
@ 2024-03-05 21:58           ` Chris Li
  2024-03-06  4:16             ` Jared Hulbert
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-05 21:58 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Chengming Zhou, Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm,
	ryan.roberts, David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 1:38 PM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Mon, Mar 4, 2024 at 11:49 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > I have considered that as well, that is further than writing from one
> > swap device to another. The current swap device currently can't accept
> > write on non page aligned offset. If we allow byte aligned write out
> > size, the whole swap entry offset stuff needs some heavy changes.
> >
> > If we write out 4K pages, and the compression ratio is lower than 50%,
> > it means a combination of two compressed pages can't fit into one
> > page.  Which means some of the page read back will need to overflow
> > into another page. We kind of need a small file system to keep track
> > of how the compressed data is stored, because it is not page aligned
> > size any more.
> >
> > We can write out zsmalloc blocks of data as it is, however there is no
> > guarantee the data in zsmalloc blocks have the same LRU order.
> >
> > It makes more sense when writing higher order > 0 swap pages. e.g
> > writing 64K pages in one buffer, then we can write out compressed data
> > as page boundary aligned and page sizes, accepting the waste on the
> > last compressed page, might not fill up the whole page.
>
> A swap device not a device, until recently, it was a really bad
> filesystem with no abstractions between the block device and the
> filesystem.  Zswap and zram are, in some respects, attempts to make
> specialized filesystems without any of the advantages of using the vfs
> tooling.
>
> What stops us from using an existing compressing filesystem?

The issue is that the swap has a lot of different usage than a typical
file system. Please take a look at the current different usage cases
of swap and their related data structures, in the beginning of this
email thread.  If you want to use an existing file system, you still
need to to bridge the gap between swap system and file systems. For
example, the cgroup information is associated with each swap entry.

You can think of swap as  a special file system that can read and
write 4K objects by keys.  You can always use file system extend
attributes to track the additional information associated with each
swap entry. The end of the day, using the existing file system, the
per swap entry metadata overhead would  likely be much higher than the
current swap back end. I understand the current swap back end
organizes the data around swap offset, that makes swap data spreading
to many different places. That is one reason people might not like it.
However, it does have pretty minimal per swap entry memory overheads.

The file system can store their meta data on disk, reducing the in
memory overhead. That has a price that when you swap in a page, you
might need to go through a few file system metadata reads before you
can read in the real swapping data.

>
> Crazy talk here.  What if we handled swap pages like they were mmap'd
> to a special swap "file(s)"?

That is already the case in the kernel, the swap cache handling is the
same way of handling file cache with a file offset. Some of them even
share the same underlying function, for example filemap_get_folio().

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-05 21:58           ` Chris Li
@ 2024-03-06  4:16             ` Jared Hulbert
  2024-03-06  5:50               ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Jared Hulbert @ 2024-03-06  4:16 UTC (permalink / raw)
  To: Chris Li
  Cc: Chengming Zhou, Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm,
	ryan.roberts, David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 1:58 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 5, 2024 at 1:38 PM Jared Hulbert <jaredeh@gmail.com> wrote:
> >
> > On Mon, Mar 4, 2024 at 11:49 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > I have considered that as well, that is further than writing from one
> > > swap device to another. The current swap device currently can't accept
> > > write on non page aligned offset. If we allow byte aligned write out
> > > size, the whole swap entry offset stuff needs some heavy changes.
> > >
> > > If we write out 4K pages, and the compression ratio is lower than 50%,
> > > it means a combination of two compressed pages can't fit into one
> > > page.  Which means some of the page read back will need to overflow
> > > into another page. We kind of need a small file system to keep track
> > > of how the compressed data is stored, because it is not page aligned
> > > size any more.
> > >
> > > We can write out zsmalloc blocks of data as it is, however there is no
> > > guarantee the data in zsmalloc blocks have the same LRU order.
> > >
> > > It makes more sense when writing higher order > 0 swap pages. e.g
> > > writing 64K pages in one buffer, then we can write out compressed data
> > > as page boundary aligned and page sizes, accepting the waste on the
> > > last compressed page, might not fill up the whole page.
> >
> > A swap device not a device, until recently, it was a really bad
> > filesystem with no abstractions between the block device and the
> > filesystem.  Zswap and zram are, in some respects, attempts to make
> > specialized filesystems without any of the advantages of using the vfs
> > tooling.
> >
> > What stops us from using an existing compressing filesystem?
>
> The issue is that the swap has a lot of different usage than a typical
> file system. Please take a look at the current different usage cases
> of swap and their related data structures, in the beginning of this
> email thread.  If you want to use an existing file system, you still
> need to to bridge the gap between swap system and file systems. For
> example, the cgroup information is associated with each swap entry.
>
> You can think of swap as  a special file system that can read and
> write 4K objects by keys.  You can always use file system extend
> attributes to track the additional information associated with each
> swap entry.

Yes.  This is what I was trying to say.  While swap dev pretends to be
just a simple index, your opener for this thread mentions a VFS-like
swap interface.  What exactly is the interface you have in mind?  If
it's VFS-like... how does it differ?

>The end of the day, using the existing file system, the
> per swap entry metadata overhead would  likely be much higher than the
> current swap back end. I understand the current swap back end
> organizes the data around swap offset, that makes swap data spreading
> to many different places. That is one reason people might not like it.
> However, it does have pretty minimal per swap entry memory overheads.
>
> The file system can store their meta data on disk, reducing the in
> memory overhead. That has a price that when you swap in a page, you
> might need to go through a few file system metadata reads before you
> can read in the real swapping data.

When I look at all the things being asked of modern swap backends,
compression, tiering, metadata tracking, usage metrics, caching,
backing storage.  There is a lot of potential for reuse from the
filesystem world.  If we truly have a VFS-like swap interface why not
make it easy to facilitate that reuse?

So of course I don't think we should just take stock btrfs and call it
a swap backing store.  When I asked "Why stops us..." I meant to
discuss it to see how far off the vision is.

So let's consider the points you mentioned.

Metadata overhead:
ZRAM uses 1% of the disksize as metadata storage, you can get to 1% or
less with modern filesystems unmodified (depends on a lot of factors)
From a fundamental architecture standpoint it's not a stretch to think
that a modified filesystem would be meet or beat existing swap engines
on metadata overhead.

Too many disk ops:
This is a solid argument for not using most filesystems today.  But
it's also one that is addressable, modern filesystems have lots of
caching and separation of metadata and data.  No reason a variant
can't be made that will not store metadata to disk.

In the traditional VFS space fragmentation and allocation is the
responsibility of the filesystem, not the pagecache or VFS layer (okay
it gets complicated in the corner cases).  If we call swap backends
the swap filesystems then I don't think it's hard to imagine a
modified (or a new) filesystem could be rather easily adapted to
handle many of the things you're looking for if we made a swapping
VFS-like interface that was a truely a clean subset of the VFS
interface

With a whole family of specialized swap filesystems optimized for
different systems and media types you could do buddy allocating,
larger writes, LRU level group allocations, sub-page allocation,
direct writes, compression, tiering, readahead hints, deduplication,
caching, etc with nearly off the shelf code.  And all this with a free
set of stable APIs, tools, conventions, design patterns, and
abstractions to allow for quick and easy innovation in this space.

And if that seems daunting we can start by making existing swap
backends glue into the new VFS-like interface and punt this for later.
But making clear and clean VFS like interfaces, if done right allows
for a ton of innovation here.

> >
> > Crazy talk here.  What if we handled swap pages like they were mmap'd
> > to a special swap "file(s)"?
>
> That is already the case in the kernel, the swap cache handling is the
> same way of handling file cache with a file offset. Some of them even
> share the same underlying function, for example filemap_get_folio().

Right there is some similarity in the middle.  And yet the way a
swapped page is handled is very different at the "ends"  the PTEs /
fault paths and the way data gets to swap media are totally different.
Those are the parts I was thinking about.  In otherwords, why do a
VFS-like interface, why not use the VFS interface?

I suppose the first level why gets you to something like a circular
reference issue when allocating memory for a VFS ops that could
trigger swapping... but maybe that's addressable.  It gets crazy but I
have a feeling the core issues are not too serious.

> Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06  4:16             ` Jared Hulbert
@ 2024-03-06  5:50               ` Chris Li
       [not found]                 ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-06  5:50 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Chengming Zhou, Matthew Wilcox, Nhat Pham, lsf-pc, linux-mm,
	ryan.roberts, David Hildenbrand, Barry Song, Chuanhua Han

On Tue, Mar 5, 2024 at 8:17 PM Jared Hulbert <jaredeh@gmail.com> wrote:
> > >
> > > What stops us from using an existing compressing filesystem?
> >
> > The issue is that the swap has a lot of different usage than a typical
> > file system. Please take a look at the current different usage cases
> > of swap and their related data structures, in the beginning of this
> > email thread.  If you want to use an existing file system, you still
> > need to to bridge the gap between swap system and file systems. For
> > example, the cgroup information is associated with each swap entry.
> >
> > You can think of swap as  a special file system that can read and
> > write 4K objects by keys.  You can always use file system extend
> > attributes to track the additional information associated with each
> > swap entry.
>
> Yes.  This is what I was trying to say.  While swap dev pretends to be
> just a simple index, your opener for this thread mentions a VFS-like
> swap interface.  What exactly is the interface you have in mind?  If
> it's VFS-like... how does it differ?

Let me clarify what I mean by VFS like.  I want the swap device to
have some common swap related operation using VFS like callback
functions. So that it can allow different implementations of swap
devices to live together. For example, the classic swap and swap
cluster should be able to register as two different swap back end
implementations using common operation functions.

I do not mean to borrow the VFS operation interface as it is. The swap
back end requirement is very different from a typical file system.

>
> >The end of the day, using the existing file system, the
> > per swap entry metadata overhead would  likely be much higher than the
> > current swap back end. I understand the current swap back end
> > organizes the data around swap offset, that makes swap data spreading
> > to many different places. That is one reason people might not like it.
> > However, it does have pretty minimal per swap entry memory overheads.
> >
> > The file system can store their meta data on disk, reducing the in
> > memory overhead. That has a price that when you swap in a page, you
> > might need to go through a few file system metadata reads before you
> > can read in the real swapping data.
>
> When I look at all the things being asked of modern swap backends,
> compression, tiering, metadata tracking, usage metrics, caching,
> backing storage.  There is a lot of potential for reuse from the
> filesystem world.  If we truly have a VFS-like swap interface why not
> make it easy to facilitate that reuse?
>
> So of course I don't think we should just take stock btrfs and call it
> a swap backing store.  When I asked "Why stops us..." I meant to
> discuss it to see how far off the vision is.
>
> So let's consider the points you mentioned.
>
> Metadata overhead:
> ZRAM uses 1% of the disksize as metadata storage, you can get to 1% or
> less with modern filesystems unmodified (depends on a lot of factors)

If your file size is 4K each and you need to store millions of 4K
small files, reference it by an integer like filename. Typical file
systems like ext4, btrfs will definitely not be able to get 1% meta
data storage for that kind of usage. 1% of 4K is 40 bytes. Your
typical inode struct is much bigger than that. Last I checked, the
sizeof(struct inode) is 632.

> From a fundamental architecture standpoint it's not a stretch to think
> that a modified filesystem would be meet or beat existing swap engines
> on metadata overhead.

Please show me one file system that can beat the existing swap system
in the swap specific usage case (load/store of individual 4K pages), I
am interested in learning.

>
> Too many disk ops:
> This is a solid argument for not using most filesystems today.  But
> it's also one that is addressable, modern filesystems have lots of
> caching and separation of metadata and data.  No reason a variant
> can't be made that will not store metadata to disk.

That is based on the assumption that you can predict the next IO based
on previous IO, specially in file streaming.
Swap access is typically more random.

>
> In the traditional VFS space fragmentation and allocation is the
> responsibility of the filesystem, not the pagecache or VFS layer (okay
> it gets complicated in the corner cases).  If we call swap backends
> the swap filesystems then I don't think it's hard to imagine a
> modified (or a new) filesystem could be rather easily adapted to
> handle many of the things you're looking for if we made a swapping
> VFS-like interface that was a truely a clean subset of the VFS
> interface
>
> With a whole family of specialized swap filesystems optimized for
> different systems and media types you could do buddy allocating,
> larger writes, LRU level group allocations, sub-page allocation,
> direct writes, compression, tiering, readahead hints, deduplication,
> caching, etc with nearly off the shelf code.  And all this with a free
> set of stable APIs, tools, conventions, design patterns, and
> abstractions to allow for quick and easy innovation in this space.

If you have a more concrete example of how to map an existing file
system to match the current swap usage, we can discuss more on the
trade off of size.  The usage case and constraint of swap and file
system is so different. I believe a custom designed swap back end will
suit better than just borrowing the existing file system as it is.
>
> And if that seems daunting we can start by making existing swap
> backends glue into the new VFS-like interface and punt this for later.
> But making clear and clean VFS like interfaces, if done right allows
> for a ton of innovation here.
>
> > >
> > > Crazy talk here.  What if we handled swap pages like they were mmap'd
> > > to a special swap "file(s)"?
> >
> > That is already the case in the kernel, the swap cache handling is the
> > same way of handling file cache with a file offset. Some of them even
> > share the same underlying function, for example filemap_get_folio().
>
> Right there is some similarity in the middle.  And yet the way a
> swapped page is handled is very different at the "ends"  the PTEs /
> fault paths and the way data gets to swap media are totally different.
> Those are the parts I was thinking about.  In otherwords, why do a
> VFS-like interface, why not use the VFS interface?

Again, I have no intention to use the existing VFS interface and graft
it to the swap back end. I am yet to be convinced that is the right
direction to go. I most mean you can register your own implementation
of swap back end using common operation interfaces. The interface will
be specific to swap related operations, not the VFS one.

> I suppose the first level why gets you to something like a circular
> reference issue when allocating memory for a VFS ops that could
> trigger swapping... but maybe that's addressable.  It gets crazy but I
> have a feeling the core issues are not too serious.

Don't let me discourage you though. Feel free to give it a try and
share more detail on how you plan to do that. For example I can use
this xyz file system, each swap entry map to an inode of a file. Here
is how to allocate and free a swap entry. The more detail the better.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

[parent not found: <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>]

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
       [not found]                 ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
@ 2024-03-06 18:16                   ` Chris Li
  2024-03-06 22:44                     ` Jared Hulbert
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-06 18:16 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: linux-mm

On Wed, Mar 6, 2024 at 2:39 AM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 9:51 PM Chris Li <chrisl@kernel.org> wrote:

> > If your file size is 4K each and you need to store millions of 4K
> > small files, reference it by an integer like filename. Typical file
> > systems like ext4, btrfs will definitely not be able to get 1% meta
> > data storage for that kind of usage. 1% of 4K is 40 bytes. Your
> > typical inode struct is much bigger than that. Last I checked, the
> > sizeof(struct inode) is 632.
>
> Okay that is an interesting difference in assumptions.  I see no need
> to have file == page, I think that would be insane to have an inode
> per swap page.  You'd have one big "file" and do offsets.  Or a file
> per cgroup, etc.

Then you are back to design your own data structure to manage how to
map the swap entry into large file offsets. The swap file is a one
large file, it can group  clusters as smaller large files internally.
Why not use the swap file directly? The VFS does not really help, it
is more of a burden to maintain all those super blocks, directory,
inode etc.

> Remember I'm advocating a subset of the VFS interface, learning from
> it not using it as is.

You can't really use a subset without having the other parts drag
alone. Most of the VFS operations, those op call back functions do not
apply to swap directly any way.
If you say VFS is just an inspiration, then that is more or less what
I had in mind earlier :-)

>
> > > From a fundamental architecture standpoint it's not a stretch to think
> > > that a modified filesystem would be meet or beat existing swap engines
> > > on metadata overhead.
> >
> > Please show me one file system that can beat the existing swap system
> > in the swap specific usage case (load/store of individual 4K pages), I
> > am interested in learning.
>
> Well mind you I'm suggesting a modified filesystem and this is hard to
> compare apples to apples, but sure... here we go :)
>
> Consider an unmodified EXT4 vs ZRAM with a backing device of the same
> sizes, same hardware.
>
> Using the page cache as a bad proxy for RAM caching in the case of
> EXT4 and comparing to the ZRAM without sending anything to the backing
> store. The ZRAM is faster at reads while the EXT4 is a little faster
> at writes
>
>       | ZRAM     | EXT4     |
> -----------------------------
> read  | 4.4 GB/s | 2.5 GB/s |
> write | 643 MB/s | 658 MB/s |
>
> If you look at what happens when you talk about getting thing to and
> from the disk then while the ZRAM is a tiny bit faster at the reads
> but ZRAM is way slow at writes.
>
>       | ZRAM      | EXT4      |
> -------------------------------
> read  | 1.14 GB/s | 1.10 GB/s |
> write | 82.3 MB/s |  548 MB/s |

I am more interested in terms of per swap entry memory overhead.

Without knowing how you map the swap entry into file read/writes, I
have no idea now how to interpertet those numbers in the swap back end
usage context. ZRAM is just a block device, ZRAM does not participate
in how the swap entry was allocated or free. ZRAM does compression,
which is CPU intensive.  While EXT4 doesn't, it is understandable ZRAM
might have lower write bandwidth.   I am not sure how those numbers
translate into prediction of how a file system based swap back end
system performs.

Regards,

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06 18:16                   ` Chris Li
@ 2024-03-06 22:44                     ` Jared Hulbert
  2024-03-07  0:46                       ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Jared Hulbert @ 2024-03-06 22:44 UTC (permalink / raw)
  To: Chris Li; +Cc: linux-mm

On Wed, Mar 6, 2024 at 10:16 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Mar 6, 2024 at 2:39 AM Jared Hulbert <jaredeh@gmail.com> wrote:
> >
> > On Tue, Mar 5, 2024 at 9:51 PM Chris Li <chrisl@kernel.org> wrote:
>
> > > If your file size is 4K each and you need to store millions of 4K
> > > small files, reference it by an integer like filename. Typical file
> > > systems like ext4, btrfs will definitely not be able to get 1% meta
> > > data storage for that kind of usage. 1% of 4K is 40 bytes. Your
> > > typical inode struct is much bigger than that. Last I checked, the
> > > sizeof(struct inode) is 632.
> >
> > Okay that is an interesting difference in assumptions.  I see no need
> > to have file == page, I think that would be insane to have an inode
> > per swap page.  You'd have one big "file" and do offsets.  Or a file
> > per cgroup, etc.
>
> Then you are back to design your own data structure to manage how to
> map the swap entry into large file offsets. The swap file is a one
> large file, it can group  clusters as smaller large files internally.

No, that's not how I see it. I must be missing something.  From my
perspective I am suggesting we should NOT be designing our own data
structures to manage how to map the swap entries into large file
offsets.

This is nearly identical to the database use case which has been a
huge driver of filesystem and block subsystem optimizations over the
years.   In practice it's not uncommon to have a dedicated filesystem
dominated with one huge database file, a smaller transaction log, and
some metadata files about the database.  The workload for the database
is random reads and writes at 8K, while the log file is operated like
a write only ringbuffer most of the time.  And filesystems have been
designed and optimized for decades (and continue to to be optimized)
to properly place data on the media.  All the data structures and
grouping logic is present.  Filesystems aren't just about directories
and files.  Those are the easy parts.

> Why not use the swap file directly? The VFS does not really help,

I don't understand your question?  How do you have a "swap file"
without a clearly defined API?  What am I lissing.

> it
> is more of a burden to maintain all those super blocks, directory,
> inode etc.

I mean... how is the minimum required superblock different than the
header on a swap partition?  Sure we can strip out features that
aren't needed. What directories and inodes are you maintaining?  But
if your swap store happened to support extra features... why does it
matter?

> > Remember I'm advocating a subset of the VFS interface, learning from
> > it not using it as is.
>
> You can't really use a subset without having the other parts drag
> alone. Most of the VFS operations, those op call back functions do not
> apply to swap directly any way.
> If you say VFS is just an inspiration, then that is more or less what
> I had in mind earlier :-)

Of course you can use a subset without having the other parts drag
along.  That's the definition of subset, at least how I intent it.

Matthew Wilcox talked about integrating zswap and swap more tightly.
I feel like it's not clear how zswap and swap _should_ interact given
the state of the swap related APIs such as they are

On the other hand there are several canonical and easy to implement
ways to do something similar in traditional fs/vfs land.

1. A filesystem that compressed data in RAM and did writeback to
blockdev, it would have to have a blockdev aware allocator.
2. A filesystem that compressed data in RAM that overlaid another
filesystem. Would require uncompressing to do writeback (unless VFS
was extended with cwrite() cread() )
3. A block dev that compressed data in RAM under a filesystem, it
would have to have a block dev aware allocator.

I'd like to talk about making this sort of thing simple and clean to
do with swap.

> >
> > > > From a fundamental architecture standpoint it's not a stretch to think
> > > > that a modified filesystem would be meet or beat existing swap engines
> > > > on metadata overhead.
> > >
> > > Please show me one file system that can beat the existing swap system
> > > in the swap specific usage case (load/store of individual 4K pages), I
> > > am interested in learning.
> >
> > Well mind you I'm suggesting a modified filesystem and this is hard to
> > compare apples to apples, but sure... here we go :)
> >
> > Consider an unmodified EXT4 vs ZRAM with a backing device of the same
> > sizes, same hardware.
> >
> > Using the page cache as a bad proxy for RAM caching in the case of
> > EXT4 and comparing to the ZRAM without sending anything to the backing
> > store. The ZRAM is faster at reads while the EXT4 is a little faster
> > at writes
> >
> >       | ZRAM     | EXT4     |
> > -----------------------------
> > read  | 4.4 GB/s | 2.5 GB/s |
> > write | 643 MB/s | 658 MB/s |
> >
> > If you look at what happens when you talk about getting thing to and
> > from the disk then while the ZRAM is a tiny bit faster at the reads
> > but ZRAM is way slow at writes.
> >
> >       | ZRAM      | EXT4      |
> > -------------------------------
> > read  | 1.14 GB/s | 1.10 GB/s |
> > write | 82.3 MB/s |  548 MB/s |
>
> I am more interested in terms of per swap entry memory overhead.
>
> Without knowing how you map the swap entry into file read/writes, I
> have no idea now how to interpertet those numbers in the swap back end
> usage context. ZRAM is just a block device, ZRAM does not participate
> in how the swap entry was allocated or free. ZRAM does compression,
> which is CPU intensive.  While EXT4 doesn't, it is understandable ZRAM
> might have lower write bandwidth.   I am not sure how those numbers
> translate into prediction of how a file system based swap back end
> system performs.

I randomly read/write to zram block dev and one large EXT4 file with
max concurrency for my system.  If you mounted the file and the zram
as swap devs the performance from the benchmark should transfer to
swap operations.  How that maps to system performance....? That's a
more complicated benchmarking question.

> Regards,
>
> Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06 22:44                     ` Jared Hulbert
@ 2024-03-07  0:46                       ` Chris Li
  2024-03-07  8:57                         ` Jared Hulbert
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-07  0:46 UTC (permalink / raw)
  To: Jared Hulbert; +Cc: linux-mm

On Wed, Mar 6, 2024 at 2:44 PM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 10:16 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Wed, Mar 6, 2024 at 2:39 AM Jared Hulbert <jaredeh@gmail.com> wrote:
> > >
> > > On Tue, Mar 5, 2024 at 9:51 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > > > If your file size is 4K each and you need to store millions of 4K
> > > > small files, reference it by an integer like filename. Typical file
> > > > systems like ext4, btrfs will definitely not be able to get 1% meta
> > > > data storage for that kind of usage. 1% of 4K is 40 bytes. Your
> > > > typical inode struct is much bigger than that. Last I checked, the
> > > > sizeof(struct inode) is 632.
> > >
> > > Okay that is an interesting difference in assumptions.  I see no need
> > > to have file == page, I think that would be insane to have an inode
> > > per swap page.  You'd have one big "file" and do offsets.  Or a file
> > > per cgroup, etc.
> >
> > Then you are back to design your own data structure to manage how to
> > map the swap entry into large file offsets. The swap file is a one
> > large file, it can group  clusters as smaller large files internally.
>
> No, that's not how I see it. I must be missing something.  From my
> perspective I am suggesting we should NOT be designing our own data
> structures to manage how to map the swap entries into large file
> offsets.

OK, you are suggesting not using file inodes for 4K swap pages.
Also not design our own data structure to manage swap entry allocation.

>
> This is nearly identical to the database use case which has been a
> huge driver of filesystem and block subsystem optimizations over the
> years.   In practice it's not uncommon to have a dedicated filesystem
> dominated with one huge database file, a smaller transaction log, and
> some metadata files about the database.  The workload for the database
> is random reads and writes at 8K, while the log file is operated like
> a write only ringbuffer most of the time.  And filesystems have been
> designed and optimized for decades (and continue to to be optimized)
> to properly place data on the media.  All the data structures and
> grouping logic is present.  Filesystems aren't just about directories
> and files.  Those are the easy parts.

Then how do you allocate swap entries using this file system or database?
More detail on how swap entries map into the large files offsets can
help me understand what you are trying to do.

>
> > Why not use the swap file directly? The VFS does not really help,
>
> I don't understand your question?  How do you have a "swap file"
> without a clearly defined API?  What am I lissing.

Swap file support exists in the kernel. You can block IO on the swap
device with a given offset. The block device API exists.  That is how
the swap back end works right now. I am not sure I understand your
question.

Chris

>
> > it
> > is more of a burden to maintain all those super blocks, directory,
> > inode etc.
>
> I mean... how is the minimum required superblock different than the
> header on a swap partition?  Sure we can strip out features that
> aren't needed. What directories and inodes are you maintaining?  But
> if your swap store happened to support extra features... why does it
> matter?
>
> > > Remember I'm advocating a subset of the VFS interface, learning from
> > > it not using it as is.
> >
> > You can't really use a subset without having the other parts drag
> > alone. Most of the VFS operations, those op call back functions do not
> > apply to swap directly any way.
> > If you say VFS is just an inspiration, then that is more or less what
> > I had in mind earlier :-)
>
> Of course you can use a subset without having the other parts drag
> along.  That's the definition of subset, at least how I intent it.
>
> Matthew Wilcox talked about integrating zswap and swap more tightly.
> I feel like it's not clear how zswap and swap _should_ interact given
> the state of the swap related APIs such as they are
>
> On the other hand there are several canonical and easy to implement
> ways to do something similar in traditional fs/vfs land.
>
> 1. A filesystem that compressed data in RAM and did writeback to
> blockdev, it would have to have a blockdev aware allocator.
> 2. A filesystem that compressed data in RAM that overlaid another
> filesystem. Would require uncompressing to do writeback (unless VFS
> was extended with cwrite() cread() )
> 3. A block dev that compressed data in RAM under a filesystem, it
> would have to have a block dev aware allocator.
>
> I'd like to talk about making this sort of thing simple and clean to
> do with swap.
>
> > >
> > > > > From a fundamental architecture standpoint it's not a stretch to think
> > > > > that a modified filesystem would be meet or beat existing swap engines
> > > > > on metadata overhead.
> > > >
> > > > Please show me one file system that can beat the existing swap system
> > > > in the swap specific usage case (load/store of individual 4K pages), I
> > > > am interested in learning.
> > >
> > > Well mind you I'm suggesting a modified filesystem and this is hard to
> > > compare apples to apples, but sure... here we go :)
> > >
> > > Consider an unmodified EXT4 vs ZRAM with a backing device of the same
> > > sizes, same hardware.
> > >
> > > Using the page cache as a bad proxy for RAM caching in the case of
> > > EXT4 and comparing to the ZRAM without sending anything to the backing
> > > store. The ZRAM is faster at reads while the EXT4 is a little faster
> > > at writes
> > >
> > >       | ZRAM     | EXT4     |
> > > -----------------------------
> > > read  | 4.4 GB/s | 2.5 GB/s |
> > > write | 643 MB/s | 658 MB/s |
> > >
> > > If you look at what happens when you talk about getting thing to and
> > > from the disk then while the ZRAM is a tiny bit faster at the reads
> > > but ZRAM is way slow at writes.
> > >
> > >       | ZRAM      | EXT4      |
> > > -------------------------------
> > > read  | 1.14 GB/s | 1.10 GB/s |
> > > write | 82.3 MB/s |  548 MB/s |
> >
> > I am more interested in terms of per swap entry memory overhead.
> >
> > Without knowing how you map the swap entry into file read/writes, I
> > have no idea now how to interpertet those numbers in the swap back end
> > usage context. ZRAM is just a block device, ZRAM does not participate
> > in how the swap entry was allocated or free. ZRAM does compression,
> > which is CPU intensive.  While EXT4 doesn't, it is understandable ZRAM
> > might have lower write bandwidth.   I am not sure how those numbers
> > translate into prediction of how a file system based swap back end
> > system performs.
>
> I randomly read/write to zram block dev and one large EXT4 file with
> max concurrency for my system.  If you mounted the file and the zram
> as swap devs the performance from the benchmark should transfer to
> swap operations.  How that maps to system performance....? That's a
> more complicated benchmarking question.
>
> > Regards,
> >
> > Chris
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07  0:46                       ` Chris Li
@ 2024-03-07  8:57                         ` Jared Hulbert
  0 siblings, 0 replies; 59+ messages in thread
From: Jared Hulbert @ 2024-03-07  8:57 UTC (permalink / raw)
  To: Chris Li; +Cc: linux-mm

On Wed, Mar 6, 2024 at 4:46 PM Chris Li <chrisl@kernel.org> wrote:
>
> OK, you are suggesting not using file inodes for 4K swap pages.
> Also not design our own data structure to manage swap entry allocation.
>
> Then how do you allocate swap entries using this file system or database?
> More detail on how swap entries map into the large files offsets can
> help me understand what you are trying to do.
>
> Swap file support exists in the kernel. You can block IO on the swap
> device with a given offset. The block device API exists.  That is how
> the swap back end works right now. I am not sure I understand your
> question.

To apply the database model to the problems of fragmentation and mTHP
you could have a file for every page size.  All your offsets would be
aligned.  Similar to what Chuanhua Han is proposing in the swap device
on another subthread.

Here is an example of how filesystems would make this all so easy.
Let's assume you have a 20GB filesystem so you set it up with a 10GB
file for 4KB pages and 10GB for mTHP.  Then overtime workloads change
and the 4KB is only using 2GB while the mTHP needs more space so you
decide to add 5GB to the mTHP taking it from the 4KB.  However, while
the 4KB is largely unutilized there is a valid entry at the last
offset, you can't truncate the file without moving entries.  If you
fallocate(FALLOC_FL_PUNCH_HOLE) when you free entries then you end up
with a sparse file and can easily grow the mTHP file to 15GB.  You end
up with 25GB of logical space on the 20GB disk no problem.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:53 ` Nhat Pham
  2024-03-01 18:57   ` Chris Li
  2024-03-04 22:58   ` Matthew Wilcox
@ 2024-03-06  1:33   ` Barry Song
  2 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-03-06  1:33 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chris Li, lsf-pc, linux-mm, ryan.roberts, David Hildenbrand,
	Chuanhua Han

On Fri, Mar 1, 2024 at 10:53 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 4:24 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > In last year's LSF/MM I talked about a VFS-like swap system. That is
> > the pony that was chosen.
> > However, I did not have much chance to go into details.
>
> I'd love to attend this talk/chat :)
>
> >
> > This year, I would like to discuss what it takes to re-architect the
> > whole swap back end from scratch?
> >
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
>
> IMHO, one thing this new abstraction should support is seamless
> transfer/migration of pages from one backend to another (perhaps from
> high to low priority backends, i.e writeback).
>
> I think this will require some careful redesigns. The closest thing we
> have right now is zswap -> backing swapfile. But it is currently
> handled in a rather peculiar manner - the underlying swap slot has
> already been reserved for the zswap entry. But there's a couple of
> problems with this:
>
> a) This is wasteful. We're essentially having the same piece of data
> occupying spaces in two levels in the hierarchies.
> b) How do we generalize to a multi-tier hierarchy?
> c) This is a bit too backend-specific. It'd be nice if we can make
> this as backend-agnostic as possible (if possible).
>
> Motivation: I'm currently working/thinking about decoupling zswap and
> swap, and this is one of the more challenging aspects (as I can't seem
> to find a precedent in the swap world for inter-swap backends pages
> migration), and especially with respect to concurrent loads (and
> swapcache interactions).
>
> I don't have good answers/designs quite yet - just raising some
> questions/concerns :)

I actually have one more problem here. to swap in a large folio,
in case we have 16 subpages, it could be that 5 subpages are
in zswap and 11 are in the backend swap in some cases. we get
no way to differententiate this unless we iterate subpage one by one
within a large folio before calling zswap_load(). right now,
swap_read_folio() can't handle this,

void swap_read_folio(struct folio *folio, bool synchronous,
                struct swap_iocb **plug)
{
       ...

        if (zswap_load(folio)) {
                folio_mark_uptodate(folio);
                folio_unlock(folio);
        } else if (data_race(sis->flags & SWP_FS_OPS)) {
                swap_read_folio_fs(folio, plug);
        } else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) {
                swap_read_folio_bdev_sync(folio, sis);
        } else {
                swap_read_folio_bdev_async(folio, sis);
        }
        ...
}

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
  2024-03-01  9:53 ` Nhat Pham
@ 2024-03-04 18:43 ` Kairui Song
  2024-03-04 22:03   ` Jared Hulbert
  2024-03-04 22:36   ` Chris Li
  2024-03-06  1:15 ` Barry Song
  2024-03-07  7:56 ` Chuanhua Han
  3 siblings, 2 replies; 59+ messages in thread
From: Kairui Song @ 2024-03-04 18:43 UTC (permalink / raw)
  To: Chris Li
  Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Barry Song,
	Chuanhua Han

On Fri, Mar 1, 2024 at 5:27 PM Chris Li <chrisl@kernel.org> wrote:
>
> In last year's LSF/MM I talked about a VFS-like swap system. That is
> the pony that was chosen.
> However, I did not have much chance to go into details.
>
> This year, I would like to discuss what it takes to re-architect the
> whole swap back end from scratch?

Very interesting topic! Have been stepping into many pitfalls and
existing issues of SWAP recently, and things are complicated,
definitely need more attention.

>
> Let’s start from the requirements for the swap back end.
>
> 1) support the existing swap usage (not the implementation).
>
> Some other design goals::
>
> 2) low per swap entry memory usage.
>
> 3) low io latency.
>
> What are the functions the swap system needs to support?
>
> At the device level. Swap systems need to support a list of swap files
> with a priority order. The same priority of swap device will do round
> robin writing on the swap device. The swap device type includes zswap,
> zram, SSD, spinning hard disk, swap file in a file system.
>
> At the swap entry level, here is the list of existing swap entry usage:
>
> * Swap entry allocation and free. Each swap entry needs to be
> associated with a location of the disk space in the swapfile. (offset
> of swap entry).
> * Each swap entry needs to track the map count of the entry. (swap_map)
> * Each swap entry needs to be able to find the associated memory
> cgroup. (swap_cgroup_ctrl->map)
> * Swap cache. Lookup folio/shadow from swap entry
> * Swap page writes through a swapfile in a file system other than a
> block device. (swap_extent)
> * Shadow entry. (store in swap cache)
>
> Any new swap back end might have different internal implementation,
> but needs to support the above usage. For example, using the existing
> file system as swap backend, per vma or per swap entry map to a file
> would mean it needs additional data structure to track the
> swap_cgroup_ctrl, combined with the size of the file inode. It would
> be challenging to meet the design goal 2) and 3) using another file
> system as it is..
>
> I am considering grouping different swap entry data into one single
> struct and dynamically allocate it so no upfront allocation of
> swap_map.

Just some modest ideas about this ...

Besides the usage, I noticed currently we already have following
metadata reserved for SWAP:
SWAP map (Array of char)
SWAP shadow (XArray of pointer/long)
SWAP cgroup map (Array of short)
And ZSWAP has its own data.
Also the folio->private (SWAP entry)
PTE (SWAP entry)

Maybe something new can combine and make better use of these, also
reduce redundant. eg. SWAP shadow (assume it's not shrinked) contains
cgroup info already; a folio in the swap cache having it's -> private
pointing to SWAP entry while mapping/index are all empty; These may
indicate some space for a smarter usage.

One easy approach might be making better use of the current swap cache
xarray. We can never skip it even for direct swap in path (SYNC_IO),
I'm working on it (not for a whole new swap abstraction, just trying
to resolve some other issue and optimize things) and so far it seems
OK. With some optimizations performance is even better than before, as
we are already doing lookup and shadow cleaning in the current kernel.

And considering XArray is capable of storing ranged data with size of
order of 2, this gives us a nice tool to store grouped swap metadata
for folios, and reduce memory overhead.

Following this idea we may be able to have a smoother progressive
transition to a better design of SWAP (eg. start with storing more
complex things other than folio/shadow, then make it more
backend-specified, add features bit by bit), it is more unlikely to
break things and we can test the stability and performance step by
step.

> For the swap entry allocation.Current kernel support swap out 0 order
> or pmd order pages.
>
> There are some discussions and patches that add swap out for folio
> size in between (mTHP)
>
> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>
> and swap in for mTHP:
>
> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
>
> The introduction of swapping different order of pages will further
> complicate the swap entry fragmentation issue. The swap back end has
> no way to predict the life cycle of the swap entries. Repeat allocate
> and free swap entry of different sizes will fragment the swap entries
> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> will have to split the mTHP to a smaller size to perform the swap in
> and out. T
>
> Current swap only supports 4K pages or pmd size pages. When adding the
> other in between sizes, it greatly increases the chance of fragmenting
> the swap entry space. When no more continuous swap swap entry for
> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> the fragmentation issue. It will be a constant source of splitting the
> mTHP.
>
> Another limitation I would like to address is that swap_writepage can
> only write out IO in one contiguous chunk, not able to perform
> non-continuous IO. When the swapfile is close to full, it is likely
> the unused entry will spread across different locations. It would be
> nice to be able to read and write large folio using discontiguous disk
> IO locations.
>
> Some possible ideas for the fragmentation issue.
>
> a) buddy allocator for swap entities. Similar to the buddy allocator
> in memory. We can use a buddy allocator system for the swap entry to
> avoid the low order swap entry fragment too much of the high order
> swap entry. It should greatly reduce the fragmentation caused by
> allocate and free of the swap entry of different sizes. However the
> buddy allocator has its own limit as well. Unlike system memory, we
> can move and compact the memory. There is no rmap for swap entry, it
> is much harder to move a swap entry to another disk location. So the
> buddy allocator for swap will help, but not solve all the
> fragmentation issues.
>
> b) Large swap entries. Take file as an example, a file on the file
> system can write to a discontinuous disk location. The file system
> responsible for tracking how to map the file offset into disk
> location. A large swap entry can have a similar indirection array map
> out the disk location for different subpages within a folio.  This
> allows a large folio to write out dis-continguos swap entries on the
> swap file. The array will need to store somewhere as part of the
> overhead.When allocating swap entries for the folio, we can allocate a
> batch of smaller 4k swap entries into an array. Use this array to
> read/write the large folio. There will be a lot of plumbing work to
> get it to work.
>
> Solution a) and b) can work together as well. Only use b) if not able
> to allocate swap entries from a).

Despite the limitation, I think a) is a better approach. non-sequel
read/write is very performance unfriendly even for ZRAM, so it will be
better if the data is continuous in both RAM and SWAP.

And if not, something like VMA readahead can already help improve
performance. But we have seen this have a negative impact with fast
devices like ZRAM so it's disabled in the current kernel.

Migration of swap entries is a good thing to have, but the migration
cost seems too high... I don't have a better idea on this.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-04 18:43 ` Kairui Song
@ 2024-03-04 22:03   ` Jared Hulbert
  2024-03-04 22:47     ` Chris Li
  2024-03-04 22:36   ` Chris Li
  1 sibling, 1 reply; 59+ messages in thread
From: Jared Hulbert @ 2024-03-04 22:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: Chris Li, lsf-pc, linux-mm, ryan.roberts, David Hildenbrand,
	Barry Song, Chuanhua Han

On Mon, Mar 4, 2024 at 10:44 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 5:27 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > In last year's LSF/MM I talked about a VFS-like swap system. That is
> > the pony that was chosen.
> > However, I did not have much chance to go into details.
> >
> > This year, I would like to discuss what it takes to re-architect the
> > whole swap back end from scratch?
>
> Very interesting topic! Have been stepping into many pitfalls and
> existing issues of SWAP recently, and things are complicated,
> definitely need more attention.
>
> >
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
> >
> > Any new swap back end might have different internal implementation,
> > but needs to support the above usage. For example, using the existing
> > file system as swap backend, per vma or per swap entry map to a file
> > would mean it needs additional data structure to track the
> > swap_cgroup_ctrl, combined with the size of the file inode. It would
> > be challenging to meet the design goal 2) and 3) using another file
> > system as it is..
> >
> > I am considering grouping different swap entry data into one single
> > struct and dynamically allocate it so no upfront allocation of
> > swap_map.
>
> Just some modest ideas about this ...
>
> Besides the usage, I noticed currently we already have following
> metadata reserved for SWAP:
> SWAP map (Array of char)
> SWAP shadow (XArray of pointer/long)
> SWAP cgroup map (Array of short)
> And ZSWAP has its own data.
> Also the folio->private (SWAP entry)
> PTE (SWAP entry)
>
> Maybe something new can combine and make better use of these, also
> reduce redundant. eg. SWAP shadow (assume it's not shrinked) contains
> cgroup info already; a folio in the swap cache having it's -> private
> pointing to SWAP entry while mapping/index are all empty; These may
> indicate some space for a smarter usage.
>
> One easy approach might be making better use of the current swap cache
> xarray. We can never skip it even for direct swap in path (SYNC_IO),
> I'm working on it (not for a whole new swap abstraction, just trying
> to resolve some other issue and optimize things) and so far it seems
> OK. With some optimizations performance is even better than before, as
> we are already doing lookup and shadow cleaning in the current kernel.
>
> And considering XArray is capable of storing ranged data with size of
> order of 2, this gives us a nice tool to store grouped swap metadata
> for folios, and reduce memory overhead.
>
> Following this idea we may be able to have a smoother progressive
> transition to a better design of SWAP (eg. start with storing more
> complex things other than folio/shadow, then make it more
> backend-specified, add features bit by bit), it is more unlikely to
> break things and we can test the stability and performance step by
> step.
>
> > For the swap entry allocation.Current kernel support swap out 0 order
> > or pmd order pages.
> >
> > There are some discussions and patches that add swap out for folio
> > size in between (mTHP)
> >
> > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> >
> > and swap in for mTHP:
> >
> > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> >
> > The introduction of swapping different order of pages will further
> > complicate the swap entry fragmentation issue. The swap back end has
> > no way to predict the life cycle of the swap entries. Repeat allocate
> > and free swap entry of different sizes will fragment the swap entries
> > array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > will have to split the mTHP to a smaller size to perform the swap in
> > and out. T
> >
> > Current swap only supports 4K pages or pmd size pages. When adding the
> > other in between sizes, it greatly increases the chance of fragmenting
> > the swap entry space. When no more continuous swap swap entry for
> > mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > the fragmentation issue. It will be a constant source of splitting the
> > mTHP.
> >
> > Another limitation I would like to address is that swap_writepage can
> > only write out IO in one contiguous chunk, not able to perform
> > non-continuous IO. When the swapfile is close to full, it is likely
> > the unused entry will spread across different locations. It would be
> > nice to be able to read and write large folio using discontiguous disk
> > IO locations.
> >
> > Some possible ideas for the fragmentation issue.
> >
> > a) buddy allocator for swap entities. Similar to the buddy allocator
> > in memory. We can use a buddy allocator system for the swap entry to
> > avoid the low order swap entry fragment too much of the high order
> > swap entry. It should greatly reduce the fragmentation caused by
> > allocate and free of the swap entry of different sizes. However the
> > buddy allocator has its own limit as well. Unlike system memory, we
> > can move and compact the memory. There is no rmap for swap entry, it
> > is much harder to move a swap entry to another disk location. So the
> > buddy allocator for swap will help, but not solve all the
> > fragmentation issues.
> >
> > b) Large swap entries. Take file as an example, a file on the file
> > system can write to a discontinuous disk location. The file system
> > responsible for tracking how to map the file offset into disk
> > location. A large swap entry can have a similar indirection array map
> > out the disk location for different subpages within a folio.  This
> > allows a large folio to write out dis-continguos swap entries on the
> > swap file. The array will need to store somewhere as part of the
> > overhead.When allocating swap entries for the folio, we can allocate a
> > batch of smaller 4k swap entries into an array. Use this array to
> > read/write the large folio. There will be a lot of plumbing work to
> > get it to work.
> >
> > Solution a) and b) can work together as well. Only use b) if not able
> > to allocate swap entries from a).
>
> Despite the limitation, I think a) is a better approach. non-sequel
> read/write is very performance unfriendly even for ZRAM, so it will be
> better if the data is continuous in both RAM and SWAP.

Why is it so unfriendly with ZRAM?

I'm surprised to hear that.  Even with NVMe SSD's (controversial take
here) the penalty for non-sequential writes, if batched, is not
necessarily significant, you need other factors to be in play in the
drive state/usage.

> And if not, something like VMA readahead can already help improve
> performance. But we have seen this have a negative impact with fast
> devices like ZRAM so it's disabled in the current kernel.
>
> Migration of swap entries is a good thing to have, but the migration
> cost seems too high... I don't have a better idea on this.

One of my issues with the arch IIRC is that though it's called
swap_type/swap_offset in the PTE, that it's functionally
swap_partition/swap_offset.  The consequence being there is no
practical way to for example migrate swapped pages from one swap
backend to another.  Instead we awkwardly do these sort of things
inside the backend.

I need to look at the swap cache xarray (pointer to where to start
would be welcome).  Would it be feasible to enable a redirection
there?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-04 22:03   ` Jared Hulbert
@ 2024-03-04 22:47     ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-03-04 22:47 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Kairui Song, lsf-pc, linux-mm, ryan.roberts, David Hildenbrand,
	Barry Song, Chuanhua Han

On Mon, Mar 4, 2024 at 2:04 PM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Mon, Mar 4, 2024 at 10:44 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, Mar 1, 2024 at 5:27 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > In last year's LSF/MM I talked about a VFS-like swap system. That is
> > > the pony that was chosen.
> > > However, I did not have much chance to go into details.
> > >
> > > This year, I would like to discuss what it takes to re-architect the
> > > whole swap back end from scratch?
> >
> > Very interesting topic! Have been stepping into many pitfalls and
> > existing issues of SWAP recently, and things are complicated,
> > definitely need more attention.
> >
> > >
> > > Let’s start from the requirements for the swap back end.
> > >
> > > 1) support the existing swap usage (not the implementation).
> > >
> > > Some other design goals::
> > >
> > > 2) low per swap entry memory usage.
> > >
> > > 3) low io latency.
> > >
> > > What are the functions the swap system needs to support?
> > >
> > > At the device level. Swap systems need to support a list of swap files
> > > with a priority order. The same priority of swap device will do round
> > > robin writing on the swap device. The swap device type includes zswap,
> > > zram, SSD, spinning hard disk, swap file in a file system.
> > >
> > > At the swap entry level, here is the list of existing swap entry usage:
> > >
> > > * Swap entry allocation and free. Each swap entry needs to be
> > > associated with a location of the disk space in the swapfile. (offset
> > > of swap entry).
> > > * Each swap entry needs to track the map count of the entry. (swap_map)
> > > * Each swap entry needs to be able to find the associated memory
> > > cgroup. (swap_cgroup_ctrl->map)
> > > * Swap cache. Lookup folio/shadow from swap entry
> > > * Swap page writes through a swapfile in a file system other than a
> > > block device. (swap_extent)
> > > * Shadow entry. (store in swap cache)
> > >
> > > Any new swap back end might have different internal implementation,
> > > but needs to support the above usage. For example, using the existing
> > > file system as swap backend, per vma or per swap entry map to a file
> > > would mean it needs additional data structure to track the
> > > swap_cgroup_ctrl, combined with the size of the file inode. It would
> > > be challenging to meet the design goal 2) and 3) using another file
> > > system as it is..
> > >
> > > I am considering grouping different swap entry data into one single
> > > struct and dynamically allocate it so no upfront allocation of
> > > swap_map.
> >
> > Just some modest ideas about this ...
> >
> > Besides the usage, I noticed currently we already have following
> > metadata reserved for SWAP:
> > SWAP map (Array of char)
> > SWAP shadow (XArray of pointer/long)
> > SWAP cgroup map (Array of short)
> > And ZSWAP has its own data.
> > Also the folio->private (SWAP entry)
> > PTE (SWAP entry)
> >
> > Maybe something new can combine and make better use of these, also
> > reduce redundant. eg. SWAP shadow (assume it's not shrinked) contains
> > cgroup info already; a folio in the swap cache having it's -> private
> > pointing to SWAP entry while mapping/index are all empty; These may
> > indicate some space for a smarter usage.
> >
> > One easy approach might be making better use of the current swap cache
> > xarray. We can never skip it even for direct swap in path (SYNC_IO),
> > I'm working on it (not for a whole new swap abstraction, just trying
> > to resolve some other issue and optimize things) and so far it seems
> > OK. With some optimizations performance is even better than before, as
> > we are already doing lookup and shadow cleaning in the current kernel.
> >
> > And considering XArray is capable of storing ranged data with size of
> > order of 2, this gives us a nice tool to store grouped swap metadata
> > for folios, and reduce memory overhead.
> >
> > Following this idea we may be able to have a smoother progressive
> > transition to a better design of SWAP (eg. start with storing more
> > complex things other than folio/shadow, then make it more
> > backend-specified, add features bit by bit), it is more unlikely to
> > break things and we can test the stability and performance step by
> > step.
> >
> > > For the swap entry allocation.Current kernel support swap out 0 order
> > > or pmd order pages.
> > >
> > > There are some discussions and patches that add swap out for folio
> > > size in between (mTHP)
> > >
> > > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> > >
> > > and swap in for mTHP:
> > >
> > > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> > >
> > > The introduction of swapping different order of pages will further
> > > complicate the swap entry fragmentation issue. The swap back end has
> > > no way to predict the life cycle of the swap entries. Repeat allocate
> > > and free swap entry of different sizes will fragment the swap entries
> > > array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > > will have to split the mTHP to a smaller size to perform the swap in
> > > and out. T
> > >
> > > Current swap only supports 4K pages or pmd size pages. When adding the
> > > other in between sizes, it greatly increases the chance of fragmenting
> > > the swap entry space. When no more continuous swap swap entry for
> > > mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > > the fragmentation issue. It will be a constant source of splitting the
> > > mTHP.
> > >
> > > Another limitation I would like to address is that swap_writepage can
> > > only write out IO in one contiguous chunk, not able to perform
> > > non-continuous IO. When the swapfile is close to full, it is likely
> > > the unused entry will spread across different locations. It would be
> > > nice to be able to read and write large folio using discontiguous disk
> > > IO locations.
> > >
> > > Some possible ideas for the fragmentation issue.
> > >
> > > a) buddy allocator for swap entities. Similar to the buddy allocator
> > > in memory. We can use a buddy allocator system for the swap entry to
> > > avoid the low order swap entry fragment too much of the high order
> > > swap entry. It should greatly reduce the fragmentation caused by
> > > allocate and free of the swap entry of different sizes. However the
> > > buddy allocator has its own limit as well. Unlike system memory, we
> > > can move and compact the memory. There is no rmap for swap entry, it
> > > is much harder to move a swap entry to another disk location. So the
> > > buddy allocator for swap will help, but not solve all the
> > > fragmentation issues.
> > >
> > > b) Large swap entries. Take file as an example, a file on the file
> > > system can write to a discontinuous disk location. The file system
> > > responsible for tracking how to map the file offset into disk
> > > location. A large swap entry can have a similar indirection array map
> > > out the disk location for different subpages within a folio.  This
> > > allows a large folio to write out dis-continguos swap entries on the
> > > swap file. The array will need to store somewhere as part of the
> > > overhead.When allocating swap entries for the folio, we can allocate a
> > > batch of smaller 4k swap entries into an array. Use this array to
> > > read/write the large folio. There will be a lot of plumbing work to
> > > get it to work.
> > >
> > > Solution a) and b) can work together as well. Only use b) if not able
> > > to allocate swap entries from a).
> >
> > Despite the limitation, I think a) is a better approach. non-sequel
> > read/write is very performance unfriendly even for ZRAM, so it will be
> > better if the data is continuous in both RAM and SWAP.
>
> Why is it so unfriendly with ZRAM?

Because ZRAM currently is operating as a block device backend.
Discontinued write might consider as many small IO, it will not be
able to perform bigger buffer compression. It is also possible to
modify the ZRAM API to accept those IO vec write differently.

>
> I'm surprised to hear that.  Even with NVMe SSD's (controversial take
> here) the penalty for non-sequential writes, if batched, is not
> necessarily significant, you need other factors to be in play in the
> drive state/usage.
>
> > And if not, something like VMA readahead can already help improve
> > performance. But we have seen this have a negative impact with fast
> > devices like ZRAM so it's disabled in the current kernel.
> >
> > Migration of swap entries is a good thing to have, but the migration
> > cost seems too high... I don't have a better idea on this.
>
> One of my issues with the arch IIRC is that though it's called
> swap_type/swap_offset in the PTE, that it's functionally
> swap_partition/swap_offset.  The consequence being there is no
> practical way to for example migrate swapped pages from one swap
> backend to another.  Instead we awkwardly do these sort of things
> inside the backend.

You will need to track extra data struct for writing from one swap
device to another. e.g. source and destination swap entries. It
requires additional data structures that do not exist in the current
swap back end for sure.

>
> I need to look at the swap cache xarray (pointer to where to start
> would be welcome).  Would it be feasible to enable a redirection
> there?

You can take a look at add_to_swap_cache() . Look at its internals and
its callers. That should get you into the swap cache internals.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-04 18:43 ` Kairui Song
  2024-03-04 22:03   ` Jared Hulbert
@ 2024-03-04 22:36   ` Chris Li
  1 sibling, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-03-04 22:36 UTC (permalink / raw)
  To: Kairui Song
  Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Barry Song,
	Chuanhua Han

On Mon, Mar 4, 2024 at 10:44 AM Kairui Song <ryncsn@gmail.com> wrote:
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
> >
> > Any new swap back end might have different internal implementation,
> > but needs to support the above usage. For example, using the existing
> > file system as swap backend, per vma or per swap entry map to a file
> > would mean it needs additional data structure to track the
> > swap_cgroup_ctrl, combined with the size of the file inode. It would
> > be challenging to meet the design goal 2) and 3) using another file
> > system as it is..
> >
> > I am considering grouping different swap entry data into one single
> > struct and dynamically allocate it so no upfront allocation of
> > swap_map.
>
> Just some modest ideas about this ...

BTW, one trade off I am interested in hearing more discussion is the
current swap offset arrangement of the swap data structure.
There are a lot of array-like objects indexed by the swap offset
spread around different places of swap code.
You can think of the swap offset like a struct page pfn.
If we want more of memdesc type of indirection of swap entry.
Presumably get the swap entry struct pointer from swap cache or page
cache.
We will need to add two pointer per swap entry. One for pointing to
the swap entry struct, each swap entry struct needs to remember
another swap offset value. (similar to the memdesc pfn.)
So that is about 16 bytes per entry just for the indirection layer if
every swap entry is order 0.
If we have a lot of high order swap entries, we don't need to allocate
that main order 0 swap entries. We can have fewer high order swap
entries instead.

Another interesting thing I notice is that we have a lot of high order
swap entries. I think the current swap address space sharding of the
swap entries will work less effectively. All the high order swap
entries will likely be shared to the same xarray true due to offset %
64 == 0.

>
> Besides the usage, I noticed currently we already have following
> metadata reserved for SWAP:
> SWAP map (Array of char)
> SWAP shadow (XArray of pointer/long)
> SWAP cgroup map (Array of short)
> And ZSWAP has its own data.
> Also the folio->private (SWAP entry)
> PTE (SWAP entry)

One more thing to add is that shmem stores the swap entry not in PTE
but the page cache of the shmem. PTE is None for shmem.

>
> Maybe something new can combine and make better use of these, also
> reduce redundant. eg. SWAP shadow (assume it's not shrinked) contains
> cgroup info already; a folio in the swap cache having it's -> private
> pointing to SWAP entry while mapping/index are all empty; These may
> indicate some space for a smarter usage.

Keep in mind that after swap out, the folio will be removed from swap
cache. So the storage in folio->private will be gone.

>
> One easy approach might be making better use of the current swap cache
> xarray. We can never skip it even for direct swap in path (SYNC_IO),
> I'm working on it (not for a whole new swap abstraction, just trying
> to resolve some other issue and optimize things) and so far it seems
> OK. With some optimizations performance is even better than before, as
> we are already doing lookup and shadow cleaning in the current kernel.
>
> And considering XArray is capable of storing ranged data with size of
> order of 2, this gives us a nice tool to store grouped swap metadata
> for folios, and reduce memory overhead.

Yes, we can store the swap cache similar to store the file cache, the
swap address space sharding might get in the way though.

>
> Following this idea we may be able to have a smoother progressive
> transition to a better design of SWAP (eg. start with storing more
> complex things other than folio/shadow, then make it more
> backend-specified, add features bit by bit), it is more unlikely to
> break things and we can test the stability and performance step by
> step.

You kind of need a struct for swap entry and index by the swap cache.
If we don't use a continuous swap offset, then we will likely have
that two pointer overhead per swap entry I mentioned above.

> > For the swap entry allocation.Current kernel support swap out 0 order
> > or pmd order pages.
> >
> > There are some discussions and patches that add swap out for folio
> > size in between (mTHP)
> >
> > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> >
> > and swap in for mTHP:
> >
> > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> >
> > The introduction of swapping different order of pages will further
> > complicate the swap entry fragmentation issue. The swap back end has
> > no way to predict the life cycle of the swap entries. Repeat allocate
> > and free swap entry of different sizes will fragment the swap entries
> > array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > will have to split the mTHP to a smaller size to perform the swap in
> > and out. T
> >
> > Current swap only supports 4K pages or pmd size pages. When adding the
> > other in between sizes, it greatly increases the chance of fragmenting
> > the swap entry space. When no more continuous swap swap entry for
> > mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > the fragmentation issue. It will be a constant source of splitting the
> > mTHP.
> >
> > Another limitation I would like to address is that swap_writepage can
> > only write out IO in one contiguous chunk, not able to perform
> > non-continuous IO. When the swapfile is close to full, it is likely
> > the unused entry will spread across different locations. It would be
> > nice to be able to read and write large folio using discontiguous disk
> > IO locations.
> >
> > Some possible ideas for the fragmentation issue.
> >
> > a) buddy allocator for swap entities. Similar to the buddy allocator
> > in memory. We can use a buddy allocator system for the swap entry to
> > avoid the low order swap entry fragment too much of the high order
> > swap entry. It should greatly reduce the fragmentation caused by
> > allocate and free of the swap entry of different sizes. However the
> > buddy allocator has its own limit as well. Unlike system memory, we
> > can move and compact the memory. There is no rmap for swap entry, it
> > is much harder to move a swap entry to another disk location. So the
> > buddy allocator for swap will help, but not solve all the
> > fragmentation issues.
> >
> > b) Large swap entries. Take file as an example, a file on the file
> > system can write to a discontinuous disk location. The file system
> > responsible for tracking how to map the file offset into disk
> > location. A large swap entry can have a similar indirection array map
> > out the disk location for different subpages within a folio.  This
> > allows a large folio to write out dis-continguos swap entries on the
> > swap file. The array will need to store somewhere as part of the
> > overhead.When allocating swap entries for the folio, we can allocate a
> > batch of smaller 4k swap entries into an array. Use this array to
> > read/write the large folio. There will be a lot of plumbing work to
> > get it to work.
> >
> > Solution a) and b) can work together as well. Only use b) if not able
> > to allocate swap entries from a).
>
> Despite the limitation, I think a) is a better approach. non-sequel
> read/write is very performance unfriendly even for ZRAM, so it will be
> better if the data is continuous in both RAM and SWAP.

ZRAM might be able to extend the interface to receive larger swap
entry writes directly, the physical offset actually doesn't make much
difference to ZRAM because everything is virtual anyway.

>
> And if not, something like VMA readahead can already help improve
> performance. But we have seen this have a negative impact with fast
> devices like ZRAM so it's disabled in the current kernel.
>
> Migration of swap entries is a good thing to have, but the migration
> cost seems too high... I don't have a better idea on this.

Likely require a backend to read in the data from one device to the
swap cache and write the next level device. Similar to the zswap write
back but read from one and write to the another one.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
  2024-03-01  9:53 ` Nhat Pham
  2024-03-04 18:43 ` Kairui Song
@ 2024-03-06  1:15 ` Barry Song
  2024-03-06  2:59   ` Chris Li
  2024-03-07  7:56 ` Chuanhua Han
  3 siblings, 1 reply; 59+ messages in thread
From: Barry Song @ 2024-03-06  1:15 UTC (permalink / raw)
  To: Chris Li; +Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Chuanhua Han

On Fri, Mar 1, 2024 at 10:24 PM Chris Li <chrisl@kernel.org> wrote:
>
> In last year's LSF/MM I talked about a VFS-like swap system. That is
> the pony that was chosen.
> However, I did not have much chance to go into details.
>
> This year, I would like to discuss what it takes to re-architect the
> whole swap back end from scratch?
>
> Let’s start from the requirements for the swap back end.
>
> 1) support the existing swap usage (not the implementation).
>
> Some other design goals::
>
> 2) low per swap entry memory usage.
>
> 3) low io latency.
>
> What are the functions the swap system needs to support?
>
> At the device level. Swap systems need to support a list of swap files
> with a priority order. The same priority of swap device will do round
> robin writing on the swap device. The swap device type includes zswap,
> zram, SSD, spinning hard disk, swap file in a file system.
>
> At the swap entry level, here is the list of existing swap entry usage:
>
> * Swap entry allocation and free. Each swap entry needs to be
> associated with a location of the disk space in the swapfile. (offset
> of swap entry).
> * Each swap entry needs to track the map count of the entry. (swap_map)
> * Each swap entry needs to be able to find the associated memory
> cgroup. (swap_cgroup_ctrl->map)
> * Swap cache. Lookup folio/shadow from swap entry
> * Swap page writes through a swapfile in a file system other than a
> block device. (swap_extent)
> * Shadow entry. (store in swap cache)
>
> Any new swap back end might have different internal implementation,
> but needs to support the above usage. For example, using the existing
> file system as swap backend, per vma or per swap entry map to a file
> would mean it needs additional data structure to track the
> swap_cgroup_ctrl, combined with the size of the file inode. It would
> be challenging to meet the design goal 2) and 3) using another file
> system as it is..
>
> I am considering grouping different swap entry data into one single
> struct and dynamically allocate it so no upfront allocation of
> swap_map.
>
> For the swap entry allocation.Current kernel support swap out 0 order
> or pmd order pages.
>
> There are some discussions and patches that add swap out for folio
> size in between (mTHP)
>
> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>
> and swap in for mTHP:
>
> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
>
> The introduction of swapping different order of pages will further
> complicate the swap entry fragmentation issue. The swap back end has
> no way to predict the life cycle of the swap entries. Repeat allocate
> and free swap entry of different sizes will fragment the swap entries
> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> will have to split the mTHP to a smaller size to perform the swap in
> and out. T
>
> Current swap only supports 4K pages or pmd size pages. When adding the
> other in between sizes, it greatly increases the chance of fragmenting
> the swap entry space. When no more continuous swap swap entry for
> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> the fragmentation issue. It will be a constant source of splitting the
> mTHP.
>
> Another limitation I would like to address is that swap_writepage can
> only write out IO in one contiguous chunk, not able to perform
> non-continuous IO. When the swapfile is close to full, it is likely
> the unused entry will spread across different locations. It would be
> nice to be able to read and write large folio using discontiguous disk
> IO locations.

I don't find it will be too difficult for swap_writepage to only write
out a large folio which has discontiguous swap offsets. taking
zRAM as an example, as long as bio can be organized correctly,
zram should be able to write a large folio one by one for its all
subpages.

static void zram_bio_write(struct zram *zram, struct bio *bio)
{
        unsigned long start_time = bio_start_io_acct(bio);
        struct bvec_iter iter = bio->bi_iter;

        do {
                u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
                u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
                                SECTOR_SHIFT;
                struct bio_vec bv = bio_iter_iovec(bio, iter);

                bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);

                if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
                        atomic64_inc(&zram->stats.failed_writes);
                        bio->bi_status = BLK_STS_IOERR;
                        break;
                }

                zram_slot_lock(zram, index);
                zram_accessed(zram, index);
                zram_slot_unlock(zram, index);

                bio_advance_iter_single(bio, &iter, bv.bv_len);
        } while (iter.bi_size);

        bio_end_io_acct(bio, start_time);
        bio_endio(bio);
}

right now , add_to_swap() is lacking a way to record discontiguous
offset for each subpage, alternatively, we have a folio->swap.

I wonder if we can somehow make it page granularity, for each
subpage, it can have its own offset somehow like page->swap,
then in swap_writepage(), we can make a bio with multiple
discontiguous I/O index. then we allow add_to_swap() to get
nr_pages different swap offsets, and fill into each subpage.

But will this be a step back for folio?

>
> Some possible ideas for the fragmentation issue.
>
> a) buddy allocator for swap entities. Similar to the buddy allocator
> in memory. We can use a buddy allocator system for the swap entry to
> avoid the low order swap entry fragment too much of the high order
> swap entry. It should greatly reduce the fragmentation caused by
> allocate and free of the swap entry of different sizes. However the
> buddy allocator has its own limit as well. Unlike system memory, we
> can move and compact the memory. There is no rmap for swap entry, it
> is much harder to move a swap entry to another disk location. So the
> buddy allocator for swap will help, but not solve all the
> fragmentation issues.

I agree buddy will help. Meanwhile, we might need some way similar
with MOVABLE, UNMOVABLE migratetype. For example, try to gather
swap applications for small folios together and don't let them spread
throughout the whole swapfile.
we might be able to dynamically classify swap clusters to be for small
folios, for large folios, and avoid small folios to spread all clusters.

>
> b) Large swap entries. Take file as an example, a file on the file
> system can write to a discontinuous disk location. The file system
> responsible for tracking how to map the file offset into disk
> location. A large swap entry can have a similar indirection array map
> out the disk location for different subpages within a folio.  This
> allows a large folio to write out dis-continguos swap entries on the
> swap file. The array will need to store somewhere as part of the
> overhead.When allocating swap entries for the folio, we can allocate a
> batch of smaller 4k swap entries into an array. Use this array to
> read/write the large folio. There will be a lot of plumbing work to
> get it to work.

we already have page struct, i wonder if we can record the offset
there if this is not a step back to folio. on the other hand, while
swap-in, we can also allow large folios be swapped in from non-
discontiguous places and those offsets are actually also in PTE
entries.

I feel we have "page" to record offset before pageout() is done
and we have PTE entries to record offset after pageout() is
done.

But still (a) is needed as we really hope large folios can be put
in contiguous offsets, with this, we might have other benefit
like saving the whole compressed large folio as one object rather than
nr_pages objects in zsmalloc and decompressing them together
while swapping  in (a patchset is coming in a couple of days for this).
when a large folio is put in nr_pages different places, hardly can we do
this in zsmalloc. But at least, we can still swap-out large folios
without splitting and swap-in large folios though we read it
back from nr_pages different objects.

>
> Solution a) and b) can work together as well. Only use b) if not able
> to allocate swap entries from a).
>
> Chris

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06  1:15 ` Barry Song
@ 2024-03-06  2:59   ` Chris Li
  2024-03-06  6:05     ` Barry Song
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-06  2:59 UTC (permalink / raw)
  To: Barry Song
  Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Chuanhua Han

On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> > Another limitation I would like to address is that swap_writepage can
> > only write out IO in one contiguous chunk, not able to perform
> > non-continuous IO. When the swapfile is close to full, it is likely
> > the unused entry will spread across different locations. It would be
> > nice to be able to read and write large folio using discontiguous disk
> > IO locations.
>
> I don't find it will be too difficult for swap_writepage to only write
> out a large folio which has discontiguous swap offsets. taking
> zRAM as an example, as long as bio can be organized correctly,
> zram should be able to write a large folio one by one for its all
> subpages.

Yes.

>
> static void zram_bio_write(struct zram *zram, struct bio *bio)
> {
>         unsigned long start_time = bio_start_io_acct(bio);
>         struct bvec_iter iter = bio->bi_iter;
>
>         do {
>                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
>                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
>                                 SECTOR_SHIFT;
>                 struct bio_vec bv = bio_iter_iovec(bio, iter);
>
>                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
>
>                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
>                         atomic64_inc(&zram->stats.failed_writes);
>                         bio->bi_status = BLK_STS_IOERR;
>                         break;
>                 }
>
>                 zram_slot_lock(zram, index);
>                 zram_accessed(zram, index);
>                 zram_slot_unlock(zram, index);
>
>                 bio_advance_iter_single(bio, &iter, bv.bv_len);
>         } while (iter.bi_size);
>
>         bio_end_io_acct(bio, start_time);
>         bio_endio(bio);
> }
>
> right now , add_to_swap() is lacking a way to record discontiguous
> offset for each subpage, alternatively, we have a folio->swap.
>
> I wonder if we can somehow make it page granularity, for each
> subpage, it can have its own offset somehow like page->swap,
> then in swap_writepage(), we can make a bio with multiple
> discontiguous I/O index. then we allow add_to_swap() to get
> nr_pages different swap offsets, and fill into each subpage.

The key is where to store the subpage offset. It can't be stored on
the tail page's page->swap because some tail page's page struct are
just mapping of the head page's page struct. I am afraid this mapping
relationship has to be stored on the swap back end. That is the idea,
have swap backend keep track of an array of subpage's swap location.
This array is looked up by the head swap offset.

> But will this be a step back for folio?

I think this should be separate from the folio. It is on the swap
backend. From folio's point of view, it is just writing out a folio.
The swap back end knows how to write out into subpage locations. From
folio's point of view. It is just one swap page write.

> > Some possible ideas for the fragmentation issue.
> >
> > a) buddy allocator for swap entities. Similar to the buddy allocator
> > in memory. We can use a buddy allocator system for the swap entry to
> > avoid the low order swap entry fragment too much of the high order
> > swap entry. It should greatly reduce the fragmentation caused by
> > allocate and free of the swap entry of different sizes. However the
> > buddy allocator has its own limit as well. Unlike system memory, we
> > can move and compact the memory. There is no rmap for swap entry, it
> > is much harder to move a swap entry to another disk location. So the
> > buddy allocator for swap will help, but not solve all the
> > fragmentation issues.
>
> I agree buddy will help. Meanwhile, we might need some way similar
> with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> swap applications for small folios together and don't let them spread
> throughout the whole swapfile.
> we might be able to dynamically classify swap clusters to be for small
> folios, for large folios, and avoid small folios to spread all clusters.

This really depends on the swap entries allocation and free cycle. In
this extreme case, all swap entries have been allocated full.
Then it free some of the 4K entry at discotinuges locations. Buddy
allocator or cluster allocator are not going to save you from ending
up with fragmented swap entries.  That is why I think we still need
b).

> > b) Large swap entries. Take file as an example, a file on the file
> > system can write to a discontinuous disk location. The file system
> > responsible for tracking how to map the file offset into disk
> > location. A large swap entry can have a similar indirection array map
> > out the disk location for different subpages within a folio.  This
> > allows a large folio to write out dis-continguos swap entries on the
> > swap file. The array will need to store somewhere as part of the
> > overhead.When allocating swap entries for the folio, we can allocate a
> > batch of smaller 4k swap entries into an array. Use this array to
> > read/write the large folio. There will be a lot of plumbing work to
> > get it to work.
>
> we already have page struct, i wonder if we can record the offset
> there if this is not a step back to folio. on the other hand, while

No for the tail pages. Because some of the tail page "struct page" are
just remapping of the head page "struct page".

> swap-in, we can also allow large folios be swapped in from non-
> discontiguous places and those offsets are actually also in PTE
> entries.

This discontinues sub page location needs to store outside of folio.
Keep in mind that you can have more than one PTE in different
processes. Those PTE on different processes might not agree with each
other. BTW, shmem store the swap entry in page cache not PTE.
>
> I feel we have "page" to record offset before pageout() is done
> and we have PTE entries to record offset after pageout() is
> done.
>
> But still (a) is needed as we really hope large folios can be put
> in contiguous offsets, with this, we might have other benefit
> like saving the whole compressed large folio as one object rather than
> nr_pages objects in zsmalloc and decompressing them together
> while swapping  in (a patchset is coming in a couple of days for this).
> when a large folio is put in nr_pages different places, hardly can we do
> this in zsmalloc. But at least, we can still swap-out large folios
> without splitting and swap-in large folios though we read it
> back from nr_pages different objects.

Exactly.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06  2:59   ` Chris Li
@ 2024-03-06  6:05     ` Barry Song
  2024-03-06 17:56       ` Chris Li
  2024-03-08  8:55       ` David Hildenbrand
  0 siblings, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-03-06  6:05 UTC (permalink / raw)
  To: Chris Li; +Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Chuanhua Han

On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> > > Another limitation I would like to address is that swap_writepage can
> > > only write out IO in one contiguous chunk, not able to perform
> > > non-continuous IO. When the swapfile is close to full, it is likely
> > > the unused entry will spread across different locations. It would be
> > > nice to be able to read and write large folio using discontiguous disk
> > > IO locations.
> >
> > I don't find it will be too difficult for swap_writepage to only write
> > out a large folio which has discontiguous swap offsets. taking
> > zRAM as an example, as long as bio can be organized correctly,
> > zram should be able to write a large folio one by one for its all
> > subpages.
>
> Yes.
>
> >
> > static void zram_bio_write(struct zram *zram, struct bio *bio)
> > {
> >         unsigned long start_time = bio_start_io_acct(bio);
> >         struct bvec_iter iter = bio->bi_iter;
> >
> >         do {
> >                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> >                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
> >                                 SECTOR_SHIFT;
> >                 struct bio_vec bv = bio_iter_iovec(bio, iter);
> >
> >                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
> >
> >                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
> >                         atomic64_inc(&zram->stats.failed_writes);
> >                         bio->bi_status = BLK_STS_IOERR;
> >                         break;
> >                 }
> >
> >                 zram_slot_lock(zram, index);
> >                 zram_accessed(zram, index);
> >                 zram_slot_unlock(zram, index);
> >
> >                 bio_advance_iter_single(bio, &iter, bv.bv_len);
> >         } while (iter.bi_size);
> >
> >         bio_end_io_acct(bio, start_time);
> >         bio_endio(bio);
> > }
> >
> > right now , add_to_swap() is lacking a way to record discontiguous
> > offset for each subpage, alternatively, we have a folio->swap.
> >
> > I wonder if we can somehow make it page granularity, for each
> > subpage, it can have its own offset somehow like page->swap,
> > then in swap_writepage(), we can make a bio with multiple
> > discontiguous I/O index. then we allow add_to_swap() to get
> > nr_pages different swap offsets, and fill into each subpage.
>
> The key is where to store the subpage offset. It can't be stored on
> the tail page's page->swap because some tail page's page struct are
> just mapping of the head page's page struct. I am afraid this mapping
> relationship has to be stored on the swap back end. That is the idea,
> have swap backend keep track of an array of subpage's swap location.
> This array is looked up by the head swap offset.

I assume "some tail page's page struct are just mapping of the head
page's page struct" is only true of hugeTLB larger than PMD-mapped
hugeTLB (for example 2MB) for this moment? more widely mTHP
less than PMD-mapped size will still have all tail page struct?

"Having swap backend keep track of an array of subpage's swap
location" means we will save this metadata on swapfile?  will we
have more I/O especially if a large folio's mapping area might be
partially unmap, for example, by MADV_DONTNEED even after
the large folio is swapped-out, then we have to update the
metadata? right now, we only need to change PTE entries
and swap_map[] for the same case. do we have some way to keep
those data in memory instead?

>
> > But will this be a step back for folio?
>
> I think this should be separate from the folio. It is on the swap
> backend. From folio's point of view, it is just writing out a folio.
> The swap back end knows how to write out into subpage locations. From
> folio's point of view. It is just one swap page write.
>
> > > Some possible ideas for the fragmentation issue.
> > >
> > > a) buddy allocator for swap entities. Similar to the buddy allocator
> > > in memory. We can use a buddy allocator system for the swap entry to
> > > avoid the low order swap entry fragment too much of the high order
> > > swap entry. It should greatly reduce the fragmentation caused by
> > > allocate and free of the swap entry of different sizes. However the
> > > buddy allocator has its own limit as well. Unlike system memory, we
> > > can move and compact the memory. There is no rmap for swap entry, it
> > > is much harder to move a swap entry to another disk location. So the
> > > buddy allocator for swap will help, but not solve all the
> > > fragmentation issues.
> >
> > I agree buddy will help. Meanwhile, we might need some way similar
> > with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> > swap applications for small folios together and don't let them spread
> > throughout the whole swapfile.
> > we might be able to dynamically classify swap clusters to be for small
> > folios, for large folios, and avoid small folios to spread all clusters.
>
> This really depends on the swap entries allocation and free cycle. In
> this extreme case, all swap entries have been allocated full.
> Then it free some of the 4K entry at discotinuges locations. Buddy
> allocator or cluster allocator are not going to save you from ending
> up with fragmented swap entries.  That is why I think we still need
> b).

I agree. I believe that classifying clusters has the potential to alleviate
fragmentation to some degree while it can not resolve it. We can
to some extent prevent the spread of small swaps' applications.

>
> > > b) Large swap entries. Take file as an example, a file on the file
> > > system can write to a discontinuous disk location. The file system
> > > responsible for tracking how to map the file offset into disk
> > > location. A large swap entry can have a similar indirection array map
> > > out the disk location for different subpages within a folio.  This
> > > allows a large folio to write out dis-continguos swap entries on the
> > > swap file. The array will need to store somewhere as part of the
> > > overhead.When allocating swap entries for the folio, we can allocate a
> > > batch of smaller 4k swap entries into an array. Use this array to
> > > read/write the large folio. There will be a lot of plumbing work to
> > > get it to work.
> >
> > we already have page struct, i wonder if we can record the offset
> > there if this is not a step back to folio. on the other hand, while
>
> No for the tail pages. Because some of the tail page "struct page" are
> just remapping of the head page "struct page".
>
> > swap-in, we can also allow large folios be swapped in from non-
> > discontiguous places and those offsets are actually also in PTE
> > entries.
>
> This discontinues sub page location needs to store outside of folio.
> Keep in mind that you can have more than one PTE in different
> processes. Those PTE on different processes might not agree with each
> other. BTW, shmem store the swap entry in page cache not PTE.

I don't quite understand what you mean by "Those PTE on different
processes might not agree with each other". Can we have a concrete
example?
I assume this is also true for small folios but it won't be a problem
as the process which is doing swap-in only cares about its own
PTE entries?

> >
> > I feel we have "page" to record offset before pageout() is done
> > and we have PTE entries to record offset after pageout() is
> > done.
> >
> > But still (a) is needed as we really hope large folios can be put
> > in contiguous offsets, with this, we might have other benefit
> > like saving the whole compressed large folio as one object rather than
> > nr_pages objects in zsmalloc and decompressing them together
> > while swapping  in (a patchset is coming in a couple of days for this).
> > when a large folio is put in nr_pages different places, hardly can we do
> > this in zsmalloc. But at least, we can still swap-out large folios
> > without splitting and swap-in large folios though we read it
> > back from nr_pages different objects.
>
> Exactly.
>
> Chris

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06  6:05     ` Barry Song
@ 2024-03-06 17:56       ` Chris Li
  2024-03-06 21:29         ` Barry Song
  2024-03-08  8:55       ` David Hildenbrand
  1 sibling, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-03-06 17:56 UTC (permalink / raw)
  To: Barry Song
  Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Chuanhua Han

On Tue, Mar 5, 2024 at 10:05 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > Another limitation I would like to address is that swap_writepage can
> > > > only write out IO in one contiguous chunk, not able to perform
> > > > non-continuous IO. When the swapfile is close to full, it is likely
> > > > the unused entry will spread across different locations. It would be
> > > > nice to be able to read and write large folio using discontiguous disk
> > > > IO locations.
> > >
> > > I don't find it will be too difficult for swap_writepage to only write
> > > out a large folio which has discontiguous swap offsets. taking
> > > zRAM as an example, as long as bio can be organized correctly,
> > > zram should be able to write a large folio one by one for its all
> > > subpages.
> >
> > Yes.
> >
> > >
> > > static void zram_bio_write(struct zram *zram, struct bio *bio)
> > > {
> > >         unsigned long start_time = bio_start_io_acct(bio);
> > >         struct bvec_iter iter = bio->bi_iter;
> > >
> > >         do {
> > >                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> > >                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
> > >                                 SECTOR_SHIFT;
> > >                 struct bio_vec bv = bio_iter_iovec(bio, iter);
> > >
> > >                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
> > >
> > >                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
> > >                         atomic64_inc(&zram->stats.failed_writes);
> > >                         bio->bi_status = BLK_STS_IOERR;
> > >                         break;
> > >                 }
> > >
> > >                 zram_slot_lock(zram, index);
> > >                 zram_accessed(zram, index);
> > >                 zram_slot_unlock(zram, index);
> > >
> > >                 bio_advance_iter_single(bio, &iter, bv.bv_len);
> > >         } while (iter.bi_size);
> > >
> > >         bio_end_io_acct(bio, start_time);
> > >         bio_endio(bio);
> > > }
> > >
> > > right now , add_to_swap() is lacking a way to record discontiguous
> > > offset for each subpage, alternatively, we have a folio->swap.
> > >
> > > I wonder if we can somehow make it page granularity, for each
> > > subpage, it can have its own offset somehow like page->swap,
> > > then in swap_writepage(), we can make a bio with multiple
> > > discontiguous I/O index. then we allow add_to_swap() to get
> > > nr_pages different swap offsets, and fill into each subpage.
> >
> > The key is where to store the subpage offset. It can't be stored on
> > the tail page's page->swap because some tail page's page struct are
> > just mapping of the head page's page struct. I am afraid this mapping
> > relationship has to be stored on the swap back end. That is the idea,
> > have swap backend keep track of an array of subpage's swap location.
> > This array is looked up by the head swap offset.
>
> I assume "some tail page's page struct are just mapping of the head
> page's page struct" is only true of hugeTLB larger than PMD-mapped
> hugeTLB (for example 2MB) for this moment? more widely mTHP
> less than PMD-mapped size will still have all tail page struct?

That is the HVO for huge pages. Yes, I consider using the tail page
struct to store the swap entry a step back from the folio. The folio
is about all these 4k pages having the same property and they can look
like one big page. If we move to the memdesc world, those tail pages
will not exist in any way. It is doable in some situations, I am just
not sure it aligns with our future goal.

>
> "Having swap backend keep track of an array of subpage's swap
> location" means we will save this metadata on swapfile?  will we
> have more I/O especially if a large folio's mapping area might be
> partially unmap, for example, by MADV_DONTNEED even after
> the large folio is swapped-out, then we have to update the
> metadata? right now, we only need to change PTE entries
> and swap_map[] for the same case. do we have some way to keep
> those data in memory instead?

I actually consider keeping those arrays in memory, index by xarray
and looking up by the head swap entry offset.

>
> >
> > > But will this be a step back for folio?
> >
> > I think this should be separate from the folio. It is on the swap
> > backend. From folio's point of view, it is just writing out a folio.
> > The swap back end knows how to write out into subpage locations. From
> > folio's point of view. It is just one swap page write.
> >
> > > > Some possible ideas for the fragmentation issue.
> > > >
> > > > a) buddy allocator for swap entities. Similar to the buddy allocator
> > > > in memory. We can use a buddy allocator system for the swap entry to
> > > > avoid the low order swap entry fragment too much of the high order
> > > > swap entry. It should greatly reduce the fragmentation caused by
> > > > allocate and free of the swap entry of different sizes. However the
> > > > buddy allocator has its own limit as well. Unlike system memory, we
> > > > can move and compact the memory. There is no rmap for swap entry, it
> > > > is much harder to move a swap entry to another disk location. So the
> > > > buddy allocator for swap will help, but not solve all the
> > > > fragmentation issues.
> > >
> > > I agree buddy will help. Meanwhile, we might need some way similar
> > > with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> > > swap applications for small folios together and don't let them spread
> > > throughout the whole swapfile.
> > > we might be able to dynamically classify swap clusters to be for small
> > > folios, for large folios, and avoid small folios to spread all clusters.
> >
> > This really depends on the swap entries allocation and free cycle. In
> > this extreme case, all swap entries have been allocated full.
> > Then it free some of the 4K entry at discotinuges locations. Buddy
> > allocator or cluster allocator are not going to save you from ending
> > up with fragmented swap entries.  That is why I think we still need
> > b).
>
> I agree. I believe that classifying clusters has the potential to alleviate
> fragmentation to some degree while it can not resolve it. We can
> to some extent prevent the spread of small swaps' applications.

Yes, as I state earlier, it will help but not solve it completely.

>
> >
> > > > b) Large swap entries. Take file as an example, a file on the file
> > > > system can write to a discontinuous disk location. The file system
> > > > responsible for tracking how to map the file offset into disk
> > > > location. A large swap entry can have a similar indirection array map
> > > > out the disk location for different subpages within a folio.  This
> > > > allows a large folio to write out dis-continguos swap entries on the
> > > > swap file. The array will need to store somewhere as part of the
> > > > overhead.When allocating swap entries for the folio, we can allocate a
> > > > batch of smaller 4k swap entries into an array. Use this array to
> > > > read/write the large folio. There will be a lot of plumbing work to
> > > > get it to work.
> > >
> > > we already have page struct, i wonder if we can record the offset
> > > there if this is not a step back to folio. on the other hand, while
> >
> > No for the tail pages. Because some of the tail page "struct page" are
> > just remapping of the head page "struct page".
> >
> > > swap-in, we can also allow large folios be swapped in from non-
> > > discontiguous places and those offsets are actually also in PTE
> > > entries.
> >
> > This discontinues sub page location needs to store outside of folio.
> > Keep in mind that you can have more than one PTE in different
> > processes. Those PTE on different processes might not agree with each
> > other. BTW, shmem store the swap entry in page cache not PTE.
>
> I don't quite understand what you mean by "Those PTE on different
> processes might not agree with each other". Can we have a concrete
> example?

Process A allocates memory back by large folio, A fork as process B.
Both A and B swap out the large folio. Then B MADVICE zap some PTE
from the large folio (Maybe zap before the swap out). While A did not
change the large folio at all.

> I assume this is also true for small folios but it won't be a problem
> as the process which is doing swap-in only cares about its own
> PTE entries?

It will be a challenge if we maintain a large swap entry with its
internal array mapping to different swap device offset. You get
different partial mapping of the same large folio. That is a problem
we need to solve, I don't have all the answers yet.

Chris

>
> > >
> > > I feel we have "page" to record offset before pageout() is done
> > > and we have PTE entries to record offset after pageout() is
> > > done.
> > >
> > > But still (a) is needed as we really hope large folios can be put
> > > in contiguous offsets, with this, we might have other benefit
> > > like saving the whole compressed large folio as one object rather than
> > > nr_pages objects in zsmalloc and decompressing them together
> > > while swapping  in (a patchset is coming in a couple of days for this).
> > > when a large folio is put in nr_pages different places, hardly can we do
> > > this in zsmalloc. But at least, we can still swap-out large folios
> > > without splitting and swap-in large folios though we read it
> > > back from nr_pages different objects.
> >
> > Exactly.
> >
> > Chris
>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06 17:56       ` Chris Li
@ 2024-03-06 21:29         ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-03-06 21:29 UTC (permalink / raw)
  To: Chris Li; +Cc: lsf-pc, linux-mm, ryan.roberts, David Hildenbrand, Chuanhua Han

On Thu, Mar 7, 2024 at 6:56 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 5, 2024 at 10:05 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > Another limitation I would like to address is that swap_writepage can
> > > > > only write out IO in one contiguous chunk, not able to perform
> > > > > non-continuous IO. When the swapfile is close to full, it is likely
> > > > > the unused entry will spread across different locations. It would be
> > > > > nice to be able to read and write large folio using discontiguous disk
> > > > > IO locations.
> > > >
> > > > I don't find it will be too difficult for swap_writepage to only write
> > > > out a large folio which has discontiguous swap offsets. taking
> > > > zRAM as an example, as long as bio can be organized correctly,
> > > > zram should be able to write a large folio one by one for its all
> > > > subpages.
> > >
> > > Yes.
> > >
> > > >
> > > > static void zram_bio_write(struct zram *zram, struct bio *bio)
> > > > {
> > > >         unsigned long start_time = bio_start_io_acct(bio);
> > > >         struct bvec_iter iter = bio->bi_iter;
> > > >
> > > >         do {
> > > >                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> > > >                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
> > > >                                 SECTOR_SHIFT;
> > > >                 struct bio_vec bv = bio_iter_iovec(bio, iter);
> > > >
> > > >                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
> > > >
> > > >                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
> > > >                         atomic64_inc(&zram->stats.failed_writes);
> > > >                         bio->bi_status = BLK_STS_IOERR;
> > > >                         break;
> > > >                 }
> > > >
> > > >                 zram_slot_lock(zram, index);
> > > >                 zram_accessed(zram, index);
> > > >                 zram_slot_unlock(zram, index);
> > > >
> > > >                 bio_advance_iter_single(bio, &iter, bv.bv_len);
> > > >         } while (iter.bi_size);
> > > >
> > > >         bio_end_io_acct(bio, start_time);
> > > >         bio_endio(bio);
> > > > }
> > > >
> > > > right now , add_to_swap() is lacking a way to record discontiguous
> > > > offset for each subpage, alternatively, we have a folio->swap.
> > > >
> > > > I wonder if we can somehow make it page granularity, for each
> > > > subpage, it can have its own offset somehow like page->swap,
> > > > then in swap_writepage(), we can make a bio with multiple
> > > > discontiguous I/O index. then we allow add_to_swap() to get
> > > > nr_pages different swap offsets, and fill into each subpage.
> > >
> > > The key is where to store the subpage offset. It can't be stored on
> > > the tail page's page->swap because some tail page's page struct are
> > > just mapping of the head page's page struct. I am afraid this mapping
> > > relationship has to be stored on the swap back end. That is the idea,
> > > have swap backend keep track of an array of subpage's swap location.
> > > This array is looked up by the head swap offset.
> >
> > I assume "some tail page's page struct are just mapping of the head
> > page's page struct" is only true of hugeTLB larger than PMD-mapped
> > hugeTLB (for example 2MB) for this moment? more widely mTHP
> > less than PMD-mapped size will still have all tail page struct?
>
> That is the HVO for huge pages. Yes, I consider using the tail page
> struct to store the swap entry a step back from the folio. The folio
> is about all these 4k pages having the same property and they can look
> like one big page. If we move to the memdesc world, those tail pages
> will not exist in any way. It is doable in some situations, I am just
> not sure it aligns with our future goal.
>
> >
> > "Having swap backend keep track of an array of subpage's swap
> > location" means we will save this metadata on swapfile?  will we
> > have more I/O especially if a large folio's mapping area might be
> > partially unmap, for example, by MADV_DONTNEED even after
> > the large folio is swapped-out, then we have to update the
> > metadata? right now, we only need to change PTE entries
> > and swap_map[] for the same case. do we have some way to keep
> > those data in memory instead?
>
> I actually consider keeping those arrays in memory, index by xarray
> and looking up by the head swap entry offset.
>
> >
> > >
> > > > But will this be a step back for folio?
> > >
> > > I think this should be separate from the folio. It is on the swap
> > > backend. From folio's point of view, it is just writing out a folio.
> > > The swap back end knows how to write out into subpage locations. From
> > > folio's point of view. It is just one swap page write.
> > >
> > > > > Some possible ideas for the fragmentation issue.
> > > > >
> > > > > a) buddy allocator for swap entities. Similar to the buddy allocator
> > > > > in memory. We can use a buddy allocator system for the swap entry to
> > > > > avoid the low order swap entry fragment too much of the high order
> > > > > swap entry. It should greatly reduce the fragmentation caused by
> > > > > allocate and free of the swap entry of different sizes. However the
> > > > > buddy allocator has its own limit as well. Unlike system memory, we
> > > > > can move and compact the memory. There is no rmap for swap entry, it
> > > > > is much harder to move a swap entry to another disk location. So the
> > > > > buddy allocator for swap will help, but not solve all the
> > > > > fragmentation issues.
> > > >
> > > > I agree buddy will help. Meanwhile, we might need some way similar
> > > > with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> > > > swap applications for small folios together and don't let them spread
> > > > throughout the whole swapfile.
> > > > we might be able to dynamically classify swap clusters to be for small
> > > > folios, for large folios, and avoid small folios to spread all clusters.
> > >
> > > This really depends on the swap entries allocation and free cycle. In
> > > this extreme case, all swap entries have been allocated full.
> > > Then it free some of the 4K entry at discotinuges locations. Buddy
> > > allocator or cluster allocator are not going to save you from ending
> > > up with fragmented swap entries.  That is why I think we still need
> > > b).
> >
> > I agree. I believe that classifying clusters has the potential to alleviate
> > fragmentation to some degree while it can not resolve it. We can
> > to some extent prevent the spread of small swaps' applications.
>
> Yes, as I state earlier, it will help but not solve it completely.
>
> >
> > >
> > > > > b) Large swap entries. Take file as an example, a file on the file
> > > > > system can write to a discontinuous disk location. The file system
> > > > > responsible for tracking how to map the file offset into disk
> > > > > location. A large swap entry can have a similar indirection array map
> > > > > out the disk location for different subpages within a folio.  This
> > > > > allows a large folio to write out dis-continguos swap entries on the
> > > > > swap file. The array will need to store somewhere as part of the
> > > > > overhead.When allocating swap entries for the folio, we can allocate a
> > > > > batch of smaller 4k swap entries into an array. Use this array to
> > > > > read/write the large folio. There will be a lot of plumbing work to
> > > > > get it to work.
> > > >
> > > > we already have page struct, i wonder if we can record the offset
> > > > there if this is not a step back to folio. on the other hand, while
> > >
> > > No for the tail pages. Because some of the tail page "struct page" are
> > > just remapping of the head page "struct page".
> > >
> > > > swap-in, we can also allow large folios be swapped in from non-
> > > > discontiguous places and those offsets are actually also in PTE
> > > > entries.
> > >
> > > This discontinues sub page location needs to store outside of folio.
> > > Keep in mind that you can have more than one PTE in different
> > > processes. Those PTE on different processes might not agree with each
> > > other. BTW, shmem store the swap entry in page cache not PTE.
> >
> > I don't quite understand what you mean by "Those PTE on different
> > processes might not agree with each other". Can we have a concrete
> > example?
>
> Process A allocates memory back by large folio, A fork as process B.
> Both A and B swap out the large folio. Then B MADVICE zap some PTE
> from the large folio (Maybe zap before the swap out). While A did not
> change the large folio at all.

That behavior seems quite normal. Since B's Page Table Entries (PTEs) are set
to pte_none, there's no need to swap in those parts.
This behavior is occurring today, but we already know the full
situation based on
the values of the PTEs.
As long as we correctly fill in the offsets in the PTEs, whether they
are continuous
or not, we have a method to swap them in.
Once pageout() has been done and folio is destroyed, it seems the metadata you
mentioned becomes totally useless.

>
> > I assume this is also true for small folios but it won't be a problem
> > as the process which is doing swap-in only cares about its own
> > PTE entries?
>
> It will be a challenge if we maintain a large swap entry with its
> internal array mapping to different swap device offset. You get
> different partial mapping of the same large folio. That is a problem
> we need to solve, I don't have all the answers yet.
>

I don't quite understand why it is a large folio for process B, we may
instead swap-in small folios for those parts which are still mapped.

> Chris
>
> >
> > > >
> > > > I feel we have "page" to record offset before pageout() is done
> > > > and we have PTE entries to record offset after pageout() is
> > > > done.
> > > >
> > > > But still (a) is needed as we really hope large folios can be put
> > > > in contiguous offsets, with this, we might have other benefit
> > > > like saving the whole compressed large folio as one object rather than
> > > > nr_pages objects in zsmalloc and decompressing them together
> > > > while swapping  in (a patchset is coming in a couple of days for this).
> > > > when a large folio is put in nr_pages different places, hardly can we do
> > > > this in zsmalloc. But at least, we can still swap-out large folios
> > > > without splitting and swap-in large folios though we read it
> > > > back from nr_pages different objects.
> > >
> > > Exactly.
> > >
> > > Chris
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-06  6:05     ` Barry Song
  2024-03-06 17:56       ` Chris Li
@ 2024-03-08  8:55       ` David Hildenbrand
  1 sibling, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2024-03-08  8:55 UTC (permalink / raw)
  To: Barry Song, Chris Li; +Cc: lsf-pc, linux-mm, ryan.roberts, Chuanhua Han

On 06.03.24 07:05, Barry Song wrote:
> On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
>>>> Another limitation I would like to address is that swap_writepage can
>>>> only write out IO in one contiguous chunk, not able to perform
>>>> non-continuous IO. When the swapfile is close to full, it is likely
>>>> the unused entry will spread across different locations. It would be
>>>> nice to be able to read and write large folio using discontiguous disk
>>>> IO locations.
>>>
>>> I don't find it will be too difficult for swap_writepage to only write
>>> out a large folio which has discontiguous swap offsets. taking
>>> zRAM as an example, as long as bio can be organized correctly,
>>> zram should be able to write a large folio one by one for its all
>>> subpages.
>>
>> Yes.
>>
>>>
>>> static void zram_bio_write(struct zram *zram, struct bio *bio)
>>> {
>>>          unsigned long start_time = bio_start_io_acct(bio);
>>>          struct bvec_iter iter = bio->bi_iter;
>>>
>>>          do {
>>>                  u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
>>>                  u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
>>>                                  SECTOR_SHIFT;
>>>                  struct bio_vec bv = bio_iter_iovec(bio, iter);
>>>
>>>                  bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
>>>
>>>                  if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
>>>                          atomic64_inc(&zram->stats.failed_writes);
>>>                          bio->bi_status = BLK_STS_IOERR;
>>>                          break;
>>>                  }
>>>
>>>                  zram_slot_lock(zram, index);
>>>                  zram_accessed(zram, index);
>>>                  zram_slot_unlock(zram, index);
>>>
>>>                  bio_advance_iter_single(bio, &iter, bv.bv_len);
>>>          } while (iter.bi_size);
>>>
>>>          bio_end_io_acct(bio, start_time);
>>>          bio_endio(bio);
>>> }
>>>
>>> right now , add_to_swap() is lacking a way to record discontiguous
>>> offset for each subpage, alternatively, we have a folio->swap.
>>>
>>> I wonder if we can somehow make it page granularity, for each
>>> subpage, it can have its own offset somehow like page->swap,
>>> then in swap_writepage(), we can make a bio with multiple
>>> discontiguous I/O index. then we allow add_to_swap() to get
>>> nr_pages different swap offsets, and fill into each subpage.
>>
>> The key is where to store the subpage offset. It can't be stored on
>> the tail page's page->swap because some tail page's page struct are
>> just mapping of the head page's page struct. I am afraid this mapping
>> relationship has to be stored on the swap back end. That is the idea,
>> have swap backend keep track of an array of subpage's swap location.
>> This array is looked up by the head swap offset.
> 
> I assume "some tail page's page struct are just mapping of the head
> page's page struct" is only true of hugeTLB larger than PMD-mapped
> hugeTLB (for example 2MB) for this moment? more widely mTHP
> less than PMD-mapped size will still have all tail page struct?

We just successfully stopped using subpages to store swap offsets, and 
even accidentally fixed a bug that was lurking for years. I am confident 
that we don't want to go back. The current direction is to move as much 
information we can out of the subpages: So if we can find ways to avoid 
messing with subpages, that would be great.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
                   ` (2 preceding siblings ...)
  2024-03-06  1:15 ` Barry Song
@ 2024-03-07  7:56 ` Chuanhua Han
  2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
  3 siblings, 1 reply; 59+ messages in thread
From: Chuanhua Han @ 2024-03-07  7:56 UTC (permalink / raw)
  To: Chris Li; +Cc: lsf-pc, linux-mm, ryan.roberts, david, 21cnbao


在 2024/3/1 17:24, Chris Li 写道:
> In last year's LSF/MM I talked about a VFS-like swap system. That is
> the pony that was chosen.
> However, I did not have much chance to go into details.
>
> This year, I would like to discuss what it takes to re-architect the
> whole swap back end from scratch?
>
> Let’s start from the requirements for the swap back end.
>
> 1) support the existing swap usage (not the implementation).
>
> Some other design goals::
>
> 2) low per swap entry memory usage.
>
> 3) low io latency.
>
> What are the functions the swap system needs to support?
>
> At the device level. Swap systems need to support a list of swap files
> with a priority order. The same priority of swap device will do round
> robin writing on the swap device. The swap device type includes zswap,
> zram, SSD, spinning hard disk, swap file in a file system.
>
> At the swap entry level, here is the list of existing swap entry usage:
>
> * Swap entry allocation and free. Each swap entry needs to be
> associated with a location of the disk space in the swapfile. (offset
> of swap entry).
> * Each swap entry needs to track the map count of the entry. (swap_map)
> * Each swap entry needs to be able to find the associated memory
> cgroup. (swap_cgroup_ctrl->map)
> * Swap cache. Lookup folio/shadow from swap entry
> * Swap page writes through a swapfile in a file system other than a
> block device. (swap_extent)
> * Shadow entry. (store in swap cache)
>
> Any new swap back end might have different internal implementation,
> but needs to support the above usage. For example, using the existing
> file system as swap backend, per vma or per swap entry map to a file
> would mean it needs additional data structure to track the
> swap_cgroup_ctrl, combined with the size of the file inode. It would
> be challenging to meet the design goal 2) and 3) using another file
> system as it is..
>
> I am considering grouping different swap entry data into one single
> struct and dynamically allocate it so no upfront allocation of
> swap_map.
>
> For the swap entry allocation.Current kernel support swap out 0 order
> or pmd order pages.
>
> There are some discussions and patches that add swap out for folio
> size in between (mTHP)
>
> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>
> and swap in for mTHP:
>
> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
>
> The introduction of swapping different order of pages will further
> complicate the swap entry fragmentation issue. The swap back end has
> no way to predict the life cycle of the swap entries. Repeat allocate
> and free swap entry of different sizes will fragment the swap entries
> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> will have to split the mTHP to a smaller size to perform the swap in
> and out. T
>
> Current swap only supports 4K pages or pmd size pages. When adding the
> other in between sizes, it greatly increases the chance of fragmenting
> the swap entry space. When no more continuous swap swap entry for
> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> the fragmentation issue. It will be a constant source of splitting the
> mTHP.
>
> Another limitation I would like to address is that swap_writepage can
> only write out IO in one contiguous chunk, not able to perform
> non-continuous IO. When the swapfile is close to full, it is likely
> the unused entry will spread across different locations. It would be
> nice to be able to read and write large folio using discontiguous disk
> IO locations.
>
> Some possible ideas for the fragmentation issue.
>
> a) buddy allocator for swap entities. Similar to the buddy allocator
> in memory. We can use a buddy allocator system for the swap entry to
> avoid the low order swap entry fragment too much of the high order
> swap entry. It should greatly reduce the fragmentation caused by
> allocate and free of the swap entry of different sizes. However the
> buddy allocator has its own limit as well. Unlike system memory, we
> can move and compact the memory. There is no rmap for swap entry, it
> is much harder to move a swap entry to another disk location. So the
> buddy allocator for swap will help, but not solve all the
> fragmentation issues.
I have an idea here😁

Each swap device is divided into multiple chunks, and each chunk is
allocated to meet each order allocation
(order indicates the order of swapout's folio, and each chunk is used
for only one order).  
This can solve the fragmentation problem, which is much simpler than
buddy, easier to implement,
 and can be compatible with multiple sizes, similar to small slab allocator.

1) Add structure members  
In the swap_info_struct structure, we only need to add the offset array
representing the offset of each order search.
eg:

#define MTHP_NR_ORDER 9

struct swap_info_struct {
    ...
    long order_off[MTHP_NR_ORDER];
    ...
};

Note: order_off = -1 indicates that this order is not supported.

2) Initialize
Set the proportion of swap device occupied by each order.
For the sake of simplicity, there are 8 kinds of orders.  
Number of slots occupied by each order: chunk_size = 1/8 * maxpages
(maxpages indicates the maximum number of available slots in the current
swap device)

Set the location of each order lookup swap entry:
order_off[0] = 0;
order_off[1] = -1;
order_off[2] = chunk_size;
order_off[3] = order_off[2] + chunk_size;
...
order_off[8] = order_off[7] + chunk_size;

3) Allocate swap entries for the order  

Look for free swap entries starting with the corresponding order_off.  
eg:  

Apply order=4 to find free and continuous swap entries from the
order_off[4] location:

  order_off[4]->|                       order_off[5]->|
|----- ... -----|------------------------------------|----- ... --------|
                |<---    nr * 16 swap entries     --->|
               
nr = chunk_size / 16;
>
> b) Large swap entries. Take file as an example, a file on the file
> system can write to a discontinuous disk location. The file system
> responsible for tracking how to map the file offset into disk
> location. A large swap entry can have a similar indirection array map
> out the disk location for different subpages within a folio.  This
> allows a large folio to write out dis-continguos swap entries on the
> swap file. The array will need to store somewhere as part of the
> overhead.When allocating swap entries for the folio, we can allocate a
> batch of smaller 4k swap entries into an array. Use this array to
> read/write the large folio. There will be a lot of plumbing work to
> get it to work.
>
> Solution a) and b) can work together as well. Only use b) if not able
> to allocate swap entries from a).
>
> Chris

Thanks,

Chuanhua



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07  7:56 ` Chuanhua Han
@ 2024-03-07 14:03   ` Jan Kara
  2024-03-07 21:06     ` Jared Hulbert
                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Jan Kara @ 2024-03-07 14:03 UTC (permalink / raw)
  To: Chuanhua Han; +Cc: Chris Li, linux-mm, lsf-pc, ryan.roberts, 21cnbao, david

On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> 
> 在 2024/3/1 17:24, Chris Li 写道:
> > In last year's LSF/MM I talked about a VFS-like swap system. That is
> > the pony that was chosen.
> > However, I did not have much chance to go into details.
> >
> > This year, I would like to discuss what it takes to re-architect the
> > whole swap back end from scratch?
> >
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
> >
> > Any new swap back end might have different internal implementation,
> > but needs to support the above usage. For example, using the existing
> > file system as swap backend, per vma or per swap entry map to a file
> > would mean it needs additional data structure to track the
> > swap_cgroup_ctrl, combined with the size of the file inode. It would
> > be challenging to meet the design goal 2) and 3) using another file
> > system as it is..
> >
> > I am considering grouping different swap entry data into one single
> > struct and dynamically allocate it so no upfront allocation of
> > swap_map.
> >
> > For the swap entry allocation.Current kernel support swap out 0 order
> > or pmd order pages.
> >
> > There are some discussions and patches that add swap out for folio
> > size in between (mTHP)
> >
> > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> >
> > and swap in for mTHP:
> >
> > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> >
> > The introduction of swapping different order of pages will further
> > complicate the swap entry fragmentation issue. The swap back end has
> > no way to predict the life cycle of the swap entries. Repeat allocate
> > and free swap entry of different sizes will fragment the swap entries
> > array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > will have to split the mTHP to a smaller size to perform the swap in
> > and out. T
> >
> > Current swap only supports 4K pages or pmd size pages. When adding the
> > other in between sizes, it greatly increases the chance of fragmenting
> > the swap entry space. When no more continuous swap swap entry for
> > mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > the fragmentation issue. It will be a constant source of splitting the
> > mTHP.
> >
> > Another limitation I would like to address is that swap_writepage can
> > only write out IO in one contiguous chunk, not able to perform
> > non-continuous IO. When the swapfile is close to full, it is likely
> > the unused entry will spread across different locations. It would be
> > nice to be able to read and write large folio using discontiguous disk
> > IO locations.
> >
> > Some possible ideas for the fragmentation issue.
> >
> > a) buddy allocator for swap entities. Similar to the buddy allocator
> > in memory. We can use a buddy allocator system for the swap entry to
> > avoid the low order swap entry fragment too much of the high order
> > swap entry. It should greatly reduce the fragmentation caused by
> > allocate and free of the swap entry of different sizes. However the
> > buddy allocator has its own limit as well. Unlike system memory, we
> > can move and compact the memory. There is no rmap for swap entry, it
> > is much harder to move a swap entry to another disk location. So the
> > buddy allocator for swap will help, but not solve all the
> > fragmentation issues.
> I have an idea here😁
> 
> Each swap device is divided into multiple chunks, and each chunk is
> allocated to meet each order allocation
> (order indicates the order of swapout's folio, and each chunk is used
> for only one order).  
> This can solve the fragmentation problem, which is much simpler than
> buddy, easier to implement,
>  and can be compatible with multiple sizes, similar to small slab allocator.
> 
> 1) Add structure members  
> In the swap_info_struct structure, we only need to add the offset array
> representing the offset of each order search.
> eg:
> 
> #define MTHP_NR_ORDER 9
> 
> struct swap_info_struct {
>     ...
>     long order_off[MTHP_NR_ORDER];
>     ...
> };
> 
> Note: order_off = -1 indicates that this order is not supported.
> 
> 2) Initialize
> Set the proportion of swap device occupied by each order.
> For the sake of simplicity, there are 8 kinds of orders.  
> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> (maxpages indicates the maximum number of available slots in the current
> swap device)

Well, but then if you fill in space of a particular order and need to swap
out a page of that order what do you do? Return ENOSPC prematurely?

Frankly as I'm reading the discussions here, it seems to me you are trying
to reinvent a lot of things from the filesystem space :) Like block
allocation with reasonably efficient fragmentation prevention, transparent
data compression (zswap), hierarchical storage management (i.e., moving
data between different backing stores), efficient way to get from
VMA+offset to the place on disk where the content is stored. Sure you still
don't need a lot of things modern filesystems do like permissions,
directory structure (or even more complex namespacing stuff), all the stuff
achieving fs consistency after a crash, etc. But still what you need is a
notable portion of what filesystems do.

So maybe it would be time to implement swap as a proper filesystem? Or even
better we could think about factoring out these bits out of some existing
filesystem to share code?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
@ 2024-03-07 21:06     ` Jared Hulbert
  2024-03-07 21:17       ` Barry Song
  2024-03-14  8:52       ` Jan Kara
  2024-03-08  2:02     ` Chuanhua Han
  2024-05-17 12:12     ` Karim Manaouil
  2 siblings, 2 replies; 59+ messages in thread
From: Jared Hulbert @ 2024-03-07 21:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david

On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@suse.cz> wrote:
>
> Well, but then if you fill in space of a particular order and need to swap
> out a page of that order what do you do? Return ENOSPC prematurely?
>
> Frankly as I'm reading the discussions here, it seems to me you are trying
> to reinvent a lot of things from the filesystem space :) Like block
> allocation with reasonably efficient fragmentation prevention, transparent
> data compression (zswap), hierarchical storage management (i.e., moving
> data between different backing stores), efficient way to get from
> VMA+offset to the place on disk where the content is stored. Sure you still
> don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
> achieving fs consistency after a crash, etc. But still what you need is a
> notable portion of what filesystems do.
>
> So maybe it would be time to implement swap as a proper filesystem? Or even
> better we could think about factoring out these bits out of some existing
> filesystem to share code?

Yes.  Thank you.  I've been struggling to communicate this.

I'm thinking you can just use existing filesystems as a first step
with a modest glue layer.  See the branch of this thread where I'm
babbling on to Chris about this.

"efficient way to get from VMA+offset to place on the disk where
content is stored"
You mean treat swapped pages like they were mmap'ed files and use the
same code paths?  How big of a project is that?  That seems either
deceptively easy or really hard... I've been away too long and was
never really good enough to have a clear vision of the scale.

On the file side we have the page cache, but on the swap side you have
swap cache and zswap. If we reconciled file pages and swap pages you
could have page cache and zpage_cache(?) bringing gains in both
directions.  If the argument is that the swap fault path is a lot
faster, then shouldn't we be talking about fixing the file fault path
anyway?

I'd love to hear the real experts chime in.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 21:06     ` Jared Hulbert
@ 2024-03-07 21:17       ` Barry Song
  2024-03-08  0:14         ` Jared Hulbert
  2024-03-14  9:03         ` Jan Kara
  2024-03-14  8:52       ` Jan Kara
  1 sibling, 2 replies; 59+ messages in thread
From: Barry Song @ 2024-03-07 21:17 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Jan Kara, Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts,
	david

On Fri, Mar 8, 2024 at 5:06 AM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Well, but then if you fill in space of a particular order and need to swap
> > out a page of that order what do you do? Return ENOSPC prematurely?
> >
> > Frankly as I'm reading the discussions here, it seems to me you are trying
> > to reinvent a lot of things from the filesystem space :) Like block
> > allocation with reasonably efficient fragmentation prevention, transparent
> > data compression (zswap), hierarchical storage management (i.e., moving
> > data between different backing stores), efficient way to get from
> > VMA+offset to the place on disk where the content is stored. Sure you still
> > don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
> > achieving fs consistency after a crash, etc. But still what you need is a
> > notable portion of what filesystems do.
> >
> > So maybe it would be time to implement swap as a proper filesystem? Or even
> > better we could think about factoring out these bits out of some existing
> > filesystem to share code?
>
> Yes.  Thank you.  I've been struggling to communicate this.
>
> I'm thinking you can just use existing filesystems as a first step
> with a modest glue layer.  See the branch of this thread where I'm
> babbling on to Chris about this.
>
> "efficient way to get from VMA+offset to place on the disk where
> content is stored"
> You mean treat swapped pages like they were mmap'ed files and use the
> same code paths?  How big of a project is that?  That seems either
> deceptively easy or really hard... I've been away too long and was
> never really good enough to have a clear vision of the scale.

I don't understand why we need this level of complexity. All we need to know
are the offsets during pageout. After that, the large folio is
destroyed, and all
offsets are stored in page table entries (PTEs) or xa. Swap-in doesn't depend
on a complex file system; it can make its own decision on how to swap-in
based on the values it reads from PTEs.

Swap-in doesn't need to know whether the swapped-out folio was large or not.

>
> On the file side we have the page cache, but on the swap side you have
> swap cache and zswap. If we reconciled file pages and swap pages you
> could have page cache and zpage_cache(?) bringing gains in both
> directions.  If the argument is that the swap fault path is a lot
> faster, then shouldn't we be talking about fixing the file fault path
> anyway?
>
> I'd love to hear the real experts chime in.

Thanks
Barry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 21:17       ` Barry Song
@ 2024-03-08  0:14         ` Jared Hulbert
  2024-03-08  0:53           ` Barry Song
  2024-03-14  9:03         ` Jan Kara
  1 sibling, 1 reply; 59+ messages in thread
From: Jared Hulbert @ 2024-03-08  0:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Jan Kara, Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts,
	david

On Thu, Mar 7, 2024 at 1:17 PM Barry Song <21cnbao@gmail.com> wrote:
>
> I don't understand why we need this level of complexity. All we need to know
> are the offsets during pageout. After that, the large folio is
> destroyed, and all
> offsets are stored in page table entries (PTEs) or xa. Swap-in doesn't depend
> on a complex file system; it can make its own decision on how to swap-in
> based on the values it reads from PTEs.
>
> Swap-in doesn't need to know whether the swapped-out folio was large or not.

Right if the folio was broken down to individual pages on swap out
then individual pages PTEs know where the data is. So I agree it's not
necessary.

But the folio was destroyed. We want to recreate the folio on swap in? IDK

\What if you flip the argument? The complexity of the file path exists
already...  If swap didn't exist could we justify adding the
duplicated (albeit simpler) functionality of swap?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-08  0:14         ` Jared Hulbert
@ 2024-03-08  0:53           ` Barry Song
  0 siblings, 0 replies; 59+ messages in thread
From: Barry Song @ 2024-03-08  0:53 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Jan Kara, Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts,
	david

On Fri, Mar 8, 2024 at 1:15 PM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 1:17 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > I don't understand why we need this level of complexity. All we need to know
> > are the offsets during pageout. After that, the large folio is
> > destroyed, and all
> > offsets are stored in page table entries (PTEs) or xa. Swap-in doesn't depend
> > on a complex file system; it can make its own decision on how to swap-in
> > based on the values it reads from PTEs.
> >
> > Swap-in doesn't need to know whether the swapped-out folio was large or not.
>
> Right if the folio was broken down to individual pages on swap out
> then individual pages PTEs know where the data is. So I agree it's not
> necessary.

Hi Jared,

My point is that even if we fail to obtain contiguous swap offsets,
there's no need to split
large folios into smaller ones. We only need to temporarily record
discontiguous offsets for
each subpage, then proceed with pageout and try_to_unmap_one. After
this process, the
lifecycle of the large folio is complete. Once the pageout and
try_to_unmap_one operations
are done, the folio is no longer present, and the related offset
information in memory can also
be cleared.

So, I'd like to concentrate on how to record that offset information and perform
anti-fragmentation in the swap system. I don't understand why we should
get a complex fs involved here.

>
> But the folio was destroyed. We want to recreate the folio on swap in? IDK

Indeed, the primary purpose of swap-out is to release the folio (or
"release" it to the buddy system).
Before doing so, we store offsets in Page Table Entries (PTEs) or xa
in try_to_unmap_one. Swap-in
already has all the necessary information to retrieve its memory from
the swapfile.

Sorry I abused the word "destroy", I meant "release".

>
> \What if you flip the argument? The complexity of the file path exists
> already...  If swap didn't exist could we justify adding the
> duplicated (albeit simpler) functionality of swap?

I don't quite get your point. Can you please give a concrete example?

Thanks
Barry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 21:17       ` Barry Song
  2024-03-08  0:14         ` Jared Hulbert
@ 2024-03-14  9:03         ` Jan Kara
  2024-05-16 15:04           ` Zi Yan
  1 sibling, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-03-14  9:03 UTC (permalink / raw)
  To: Barry Song
  Cc: Jared Hulbert, Jan Kara, Chuanhua Han, Chris Li, linux-mm, lsf-pc,
	ryan.roberts, david

On Fri 08-03-24 05:17:46, Barry Song wrote:
> On Fri, Mar 8, 2024 at 5:06 AM Jared Hulbert <jaredeh@gmail.com> wrote:
> >
> > On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Well, but then if you fill in space of a particular order and need to swap
> > > out a page of that order what do you do? Return ENOSPC prematurely?
> > >
> > > Frankly as I'm reading the discussions here, it seems to me you are trying
> > > to reinvent a lot of things from the filesystem space :) Like block
> > > allocation with reasonably efficient fragmentation prevention, transparent
> > > data compression (zswap), hierarchical storage management (i.e., moving
> > > data between different backing stores), efficient way to get from
> > > VMA+offset to the place on disk where the content is stored. Sure you still
> > > don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
> > > achieving fs consistency after a crash, etc. But still what you need is a
> > > notable portion of what filesystems do.
> > >
> > > So maybe it would be time to implement swap as a proper filesystem? Or even
> > > better we could think about factoring out these bits out of some existing
> > > filesystem to share code?
> >
> > Yes.  Thank you.  I've been struggling to communicate this.
> >
> > I'm thinking you can just use existing filesystems as a first step
> > with a modest glue layer.  See the branch of this thread where I'm
> > babbling on to Chris about this.
> >
> > "efficient way to get from VMA+offset to place on the disk where
> > content is stored"
> > You mean treat swapped pages like they were mmap'ed files and use the
> > same code paths?  How big of a project is that?  That seems either
> > deceptively easy or really hard... I've been away too long and was
> > never really good enough to have a clear vision of the scale.
> 
> I don't understand why we need this level of complexity. All we need to
> know are the offsets during pageout. After that, the large folio is
> destroyed, and all offsets are stored in page table entries (PTEs) or xa.
> Swap-in doesn't depend on a complex file system; it can make its own
> decision on how to swap-in based on the values it reads from PTEs.

Well, but once compression chimes in (like with zswap) or if you need to
perform compaction on swap space and move swapped out data, things aren't
that simple anymore, are they? So as I was reading this thread I had the
impression that swap complexity is coming close to a complexity of a
(relatively simple) filesystem so I was brainstorming about possibility of
sharing some code between filesystems and swap...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-14  9:03         ` Jan Kara
@ 2024-05-16 15:04           ` Zi Yan
  2024-05-17  3:48             ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Zi Yan @ 2024-05-16 15:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Barry Song, Jared Hulbert, Chuanhua Han, Chris Li, linux-mm,
	lsf-pc, ryan.roberts, david

[-- Attachment #1: Type: text/plain, Size: 3425 bytes --]

On 14 Mar 2024, at 5:03, Jan Kara wrote:

> On Fri 08-03-24 05:17:46, Barry Song wrote:
>> On Fri, Mar 8, 2024 at 5:06 AM Jared Hulbert <jaredeh@gmail.com> wrote:
>>>
>>> On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@suse.cz> wrote:
>>>>
>>>> Well, but then if you fill in space of a particular order and need to swap
>>>> out a page of that order what do you do? Return ENOSPC prematurely?
>>>>
>>>> Frankly as I'm reading the discussions here, it seems to me you are trying
>>>> to reinvent a lot of things from the filesystem space :) Like block
>>>> allocation with reasonably efficient fragmentation prevention, transparent
>>>> data compression (zswap), hierarchical storage management (i.e., moving
>>>> data between different backing stores), efficient way to get from
>>>> VMA+offset to the place on disk where the content is stored. Sure you still
>>>> don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
>>>> achieving fs consistency after a crash, etc. But still what you need is a
>>>> notable portion of what filesystems do.
>>>>
>>>> So maybe it would be time to implement swap as a proper filesystem? Or even
>>>> better we could think about factoring out these bits out of some existing
>>>> filesystem to share code?
>>>
>>> Yes.  Thank you.  I've been struggling to communicate this.
>>>
>>> I'm thinking you can just use existing filesystems as a first step
>>> with a modest glue layer.  See the branch of this thread where I'm
>>> babbling on to Chris about this.
>>>
>>> "efficient way to get from VMA+offset to place on the disk where
>>> content is stored"
>>> You mean treat swapped pages like they were mmap'ed files and use the
>>> same code paths?  How big of a project is that?  That seems either
>>> deceptively easy or really hard... I've been away too long and was
>>> never really good enough to have a clear vision of the scale.
>>
>> I don't understand why we need this level of complexity. All we need to
>> know are the offsets during pageout. After that, the large folio is
>> destroyed, and all offsets are stored in page table entries (PTEs) or xa.
>> Swap-in doesn't depend on a complex file system; it can make its own
>> decision on how to swap-in based on the values it reads from PTEs.
>
> Well, but once compression chimes in (like with zswap) or if you need to
> perform compaction on swap space and move swapped out data, things aren't
> that simple anymore, are they? So as I was reading this thread I had the
> impression that swap complexity is coming close to a complexity of a
> (relatively simple) filesystem so I was brainstorming about possibility of
> sharing some code between filesystems and swap...

I think all the complexity comes from that we want to preserve folios as
a whole, thus need to handle fragmentation issues. But Barry’s approach
is trying to get us away from it. The downside is what you mentioned
about compression, since 64KB should give better compression ratio than
4KB. For swap without compression, we probably can use Barry’s
approach to keep everything simple, just split all folios when they go
into swap, but I am not sure about if there is disk throughput loss.
For zswap, there will be design tradeoff between better compression ratio
and complexity.

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-16 15:04           ` Zi Yan
@ 2024-05-17  3:48             ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-05-17  3:48 UTC (permalink / raw)
  To: Zi Yan
  Cc: Jan Kara, Barry Song, Jared Hulbert, Chuanhua Han, linux-mm,
	lsf-pc, ryan.roberts, david, Kairui Song

Hi Zi,

On Thu, May 16, 2024 at 8:04 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 14 Mar 2024, at 5:03, Jan Kara wrote:
>
> > On Fri 08-03-24 05:17:46, Barry Song wrote:
> >> On Fri, Mar 8, 2024 at 5:06 AM Jared Hulbert <jaredeh@gmail.com> wrote:
> >>>
> >>> On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@suse.cz> wrote:
> >>>>
> >>>> Well, but then if you fill in space of a particular order and need to swap
> >>>> out a page of that order what do you do? Return ENOSPC prematurely?
> >>>>
> >>>> Frankly as I'm reading the discussions here, it seems to me you are trying
> >>>> to reinvent a lot of things from the filesystem space :) Like block
> >>>> allocation with reasonably efficient fragmentation prevention, transparent
> >>>> data compression (zswap), hierarchical storage management (i.e., moving
> >>>> data between different backing stores), efficient way to get from
> >>>> VMA+offset to the place on disk where the content is stored. Sure you still
> >>>> don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
> >>>> achieving fs consistency after a crash, etc. But still what you need is a
> >>>> notable portion of what filesystems do.
> >>>>
> >>>> So maybe it would be time to implement swap as a proper filesystem? Or even
> >>>> better we could think about factoring out these bits out of some existing
> >>>> filesystem to share code?
> >>>
> >>> Yes.  Thank you.  I've been struggling to communicate this.
> >>>
> >>> I'm thinking you can just use existing filesystems as a first step
> >>> with a modest glue layer.  See the branch of this thread where I'm
> >>> babbling on to Chris about this.
> >>>
> >>> "efficient way to get from VMA+offset to place on the disk where
> >>> content is stored"
> >>> You mean treat swapped pages like they were mmap'ed files and use the
> >>> same code paths?  How big of a project is that?  That seems either
> >>> deceptively easy or really hard... I've been away too long and was
> >>> never really good enough to have a clear vision of the scale.
> >>
> >> I don't understand why we need this level of complexity. All we need to
> >> know are the offsets during pageout. After that, the large folio is
> >> destroyed, and all offsets are stored in page table entries (PTEs) or xa.
> >> Swap-in doesn't depend on a complex file system; it can make its own
> >> decision on how to swap-in based on the values it reads from PTEs.
> >
> > Well, but once compression chimes in (like with zswap) or if you need to
> > perform compaction on swap space and move swapped out data, things aren't
> > that simple anymore, are they? So as I was reading this thread I had the
> > impression that swap complexity is coming close to a complexity of a
> > (relatively simple) filesystem so I was brainstorming about possibility of
> > sharing some code between filesystems and swap...

There is a session for the filesystem as swap back end in LSF/MM.

>
> I think all the complexity comes from that we want to preserve folios as
> a whole, thus need to handle fragmentation issues. But Barry’s approach

Yes, we want to preserve the folio as a whole. The fragmentation is
one the swap entry on the swap file. These two are at two different
layers. It should  be possible to folio as a whole and write out
fragmented swap entries.

> is trying to get us away from it. The downside is what you mentioned
> about compression, since 64KB should give better compression ratio than
> 4KB. For swap without compression, we probably can use Barry’s
> approach to keep everything simple, just split all folios when they go
> into swap, but I am not sure about if there is disk throughput loss.

I have some ideas about writing out a large folio to non-contiguous
swap entry without breaking up the folio. It will have the same effect
in terms of swap entry and disk write side effects as Barry's folio
break out approach. We can still track back those fragmented swap
entries belonging to the compound swap entry. That is in the last page
of my talk slide (not the reference slide).

BTW, we can have the option to  swap in as large folio doesn't mean we
have to swap in as large folio all the time. It should be a policy
decision above the swap back end. The swap back end can support large
or small folio as requested.

For zram, I suppose it is possible to modify zram to compress
non-contiguous io vectors written as one internal compressed buffer
in zsmalloc.
If it is read back using the same io vectors, it will get the same data back.

Chris

> For zswap, there will be design tradeoff between better compression ratio
> and complexity.
>
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 21:06     ` Jared Hulbert
  2024-03-07 21:17       ` Barry Song
@ 2024-03-14  8:52       ` Jan Kara
  1 sibling, 0 replies; 59+ messages in thread
From: Jan Kara @ 2024-03-14  8:52 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Jan Kara, Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts,
	21cnbao, david

On Thu 07-03-24 13:06:27, Jared Hulbert wrote:
> On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Well, but then if you fill in space of a particular order and need to swap
> > out a page of that order what do you do? Return ENOSPC prematurely?
> >
> > Frankly as I'm reading the discussions here, it seems to me you are trying
> > to reinvent a lot of things from the filesystem space :) Like block
> > allocation with reasonably efficient fragmentation prevention, transparent
> > data compression (zswap), hierarchical storage management (i.e., moving
> > data between different backing stores), efficient way to get from
> > VMA+offset to the place on disk where the content is stored. Sure you still
> > don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
> > achieving fs consistency after a crash, etc. But still what you need is a
> > notable portion of what filesystems do.
> >
> > So maybe it would be time to implement swap as a proper filesystem? Or even
> > better we could think about factoring out these bits out of some existing
> > filesystem to share code?
> 
> Yes.  Thank you.  I've been struggling to communicate this.
> 
> I'm thinking you can just use existing filesystems as a first step
> with a modest glue layer.  See the branch of this thread where I'm
> babbling on to Chris about this.
> 
> "efficient way to get from VMA+offset to place on the disk where
> content is stored"
> You mean treat swapped pages like they were mmap'ed files and use the
> same code paths?  How big of a project is that?  That seems either
> deceptively easy or really hard... I've been away too long and was
> never really good enough to have a clear vision of the scale.

Well, conceptually it is easy to consider anonymous VMA as a mapping of
some file (anon pages are the page cache of this file) and swapout is just
writeback event. But I suspect the details are going to get hairy with this
concept - filesystems are generally optimized for handling large contiguous
blocks where as anon memory is much more random access, also filesystem
code does not expect to be run from reclaim context so locking and memory
demands might be a problem. So although the unification of anon and file
backed memory is intriguing, I didn't mean to go *this* far :) I rather
meant we could export some functional blocks like block allocator from some
filesystem as a library swap code could use.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
  2024-03-07 21:06     ` Jared Hulbert
@ 2024-03-08  2:02     ` Chuanhua Han
  2024-03-14  8:26       ` Jan Kara
  2024-05-17 12:12     ` Karim Manaouil
  2 siblings, 1 reply; 59+ messages in thread
From: Chuanhua Han @ 2024-03-08  2:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Chris Li, linux-mm, lsf-pc, ryan.roberts, 21cnbao, david


在 2024/3/7 22:03, Jan Kara 写道:
> On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
>> 在 2024/3/1 17:24, Chris Li 写道:
>>> In last year's LSF/MM I talked about a VFS-like swap system. That is
>>> the pony that was chosen.
>>> However, I did not have much chance to go into details.
>>>
>>> This year, I would like to discuss what it takes to re-architect the
>>> whole swap back end from scratch?
>>>
>>> Let’s start from the requirements for the swap back end.
>>>
>>> 1) support the existing swap usage (not the implementation).
>>>
>>> Some other design goals::
>>>
>>> 2) low per swap entry memory usage.
>>>
>>> 3) low io latency.
>>>
>>> What are the functions the swap system needs to support?
>>>
>>> At the device level. Swap systems need to support a list of swap files
>>> with a priority order. The same priority of swap device will do round
>>> robin writing on the swap device. The swap device type includes zswap,
>>> zram, SSD, spinning hard disk, swap file in a file system.
>>>
>>> At the swap entry level, here is the list of existing swap entry usage:
>>>
>>> * Swap entry allocation and free. Each swap entry needs to be
>>> associated with a location of the disk space in the swapfile. (offset
>>> of swap entry).
>>> * Each swap entry needs to track the map count of the entry. (swap_map)
>>> * Each swap entry needs to be able to find the associated memory
>>> cgroup. (swap_cgroup_ctrl->map)
>>> * Swap cache. Lookup folio/shadow from swap entry
>>> * Swap page writes through a swapfile in a file system other than a
>>> block device. (swap_extent)
>>> * Shadow entry. (store in swap cache)
>>>
>>> Any new swap back end might have different internal implementation,
>>> but needs to support the above usage. For example, using the existing
>>> file system as swap backend, per vma or per swap entry map to a file
>>> would mean it needs additional data structure to track the
>>> swap_cgroup_ctrl, combined with the size of the file inode. It would
>>> be challenging to meet the design goal 2) and 3) using another file
>>> system as it is..
>>>
>>> I am considering grouping different swap entry data into one single
>>> struct and dynamically allocate it so no upfront allocation of
>>> swap_map.
>>>
>>> For the swap entry allocation.Current kernel support swap out 0 order
>>> or pmd order pages.
>>>
>>> There are some discussions and patches that add swap out for folio
>>> size in between (mTHP)
>>>
>>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>>>
>>> and swap in for mTHP:
>>>
>>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
>>>
>>> The introduction of swapping different order of pages will further
>>> complicate the swap entry fragmentation issue. The swap back end has
>>> no way to predict the life cycle of the swap entries. Repeat allocate
>>> and free swap entry of different sizes will fragment the swap entries
>>> array. If we can’t allocate the contiguous swap entry for a mTHP, it
>>> will have to split the mTHP to a smaller size to perform the swap in
>>> and out. T
>>>
>>> Current swap only supports 4K pages or pmd size pages. When adding the
>>> other in between sizes, it greatly increases the chance of fragmenting
>>> the swap entry space. When no more continuous swap swap entry for
>>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
>>> the fragmentation issue. It will be a constant source of splitting the
>>> mTHP.
>>>
>>> Another limitation I would like to address is that swap_writepage can
>>> only write out IO in one contiguous chunk, not able to perform
>>> non-continuous IO. When the swapfile is close to full, it is likely
>>> the unused entry will spread across different locations. It would be
>>> nice to be able to read and write large folio using discontiguous disk
>>> IO locations.
>>>
>>> Some possible ideas for the fragmentation issue.
>>>
>>> a) buddy allocator for swap entities. Similar to the buddy allocator
>>> in memory. We can use a buddy allocator system for the swap entry to
>>> avoid the low order swap entry fragment too much of the high order
>>> swap entry. It should greatly reduce the fragmentation caused by
>>> allocate and free of the swap entry of different sizes. However the
>>> buddy allocator has its own limit as well. Unlike system memory, we
>>> can move and compact the memory. There is no rmap for swap entry, it
>>> is much harder to move a swap entry to another disk location. So the
>>> buddy allocator for swap will help, but not solve all the
>>> fragmentation issues.
>> I have an idea here😁
>>
>> Each swap device is divided into multiple chunks, and each chunk is
>> allocated to meet each order allocation
>> (order indicates the order of swapout's folio, and each chunk is used
>> for only one order).  
>> This can solve the fragmentation problem, which is much simpler than
>> buddy, easier to implement,
>>  and can be compatible with multiple sizes, similar to small slab allocator.
>>
>> 1) Add structure members  
>> In the swap_info_struct structure, we only need to add the offset array
>> representing the offset of each order search.
>> eg:
>>
>> #define MTHP_NR_ORDER 9
>>
>> struct swap_info_struct {
>>     ...
>>     long order_off[MTHP_NR_ORDER];
>>     ...
>> };
>>
>> Note: order_off = -1 indicates that this order is not supported.
>>
>> 2) Initialize
>> Set the proportion of swap device occupied by each order.
>> For the sake of simplicity, there are 8 kinds of orders.  
>> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
>> (maxpages indicates the maximum number of available slots in the current
>> swap device)
> Well, but then if you fill in space of a particular order and need to swap
> out a page of that order what do you do? Return ENOSPC prematurely?
If we swapout a subpage of large folio(due to a split in large folio),  
Simply search for a free swap entry from order_off[0].
> Frankly as I'm reading the discussions here, it seems to me you are trying
> to reinvent a lot of things from the filesystem space :) Like block
> allocation with reasonably efficient fragmentation prevention, transparent
> data compression (zswap), hierarchical storage management (i.e., moving
> data between different backing stores), efficient way to get from
> VMA+offset to the place on disk where the content is stored. Sure you still
> don't need a lot of things modern filesystems do like permissions,
> directory structure (or even more complex namespacing stuff), all the stuff
> achieving fs consistency after a crash, etc. But still what you need is a
> notable portion of what filesystems do.
>
> So maybe it would be time to implement swap as a proper filesystem? Or even
> better we could think about factoring out these bits out of some existing
> filesystem to share code?

In fact, my current idea does not involve too complicated file system
related layers

(chris' idea b) might involve file system modifications), this idea is
simple to implement,  

We just need to record the search location of the relevant order in the
swap device.
>
> 								Honza

Thanks,

Chuanhua



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-08  2:02     ` Chuanhua Han
@ 2024-03-14  8:26       ` Jan Kara
  2024-03-14 11:19         ` Chuanhua Han
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Kara @ 2024-03-14  8:26 UTC (permalink / raw)
  To: Chuanhua Han
  Cc: Jan Kara, Chris Li, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david

On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> 
> 在 2024/3/7 22:03, Jan Kara 写道:
> > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> >> 在 2024/3/1 17:24, Chris Li 写道:
> >>> In last year's LSF/MM I talked about a VFS-like swap system. That is
> >>> the pony that was chosen.
> >>> However, I did not have much chance to go into details.
> >>>
> >>> This year, I would like to discuss what it takes to re-architect the
> >>> whole swap back end from scratch?
> >>>
> >>> Let’s start from the requirements for the swap back end.
> >>>
> >>> 1) support the existing swap usage (not the implementation).
> >>>
> >>> Some other design goals::
> >>>
> >>> 2) low per swap entry memory usage.
> >>>
> >>> 3) low io latency.
> >>>
> >>> What are the functions the swap system needs to support?
> >>>
> >>> At the device level. Swap systems need to support a list of swap files
> >>> with a priority order. The same priority of swap device will do round
> >>> robin writing on the swap device. The swap device type includes zswap,
> >>> zram, SSD, spinning hard disk, swap file in a file system.
> >>>
> >>> At the swap entry level, here is the list of existing swap entry usage:
> >>>
> >>> * Swap entry allocation and free. Each swap entry needs to be
> >>> associated with a location of the disk space in the swapfile. (offset
> >>> of swap entry).
> >>> * Each swap entry needs to track the map count of the entry. (swap_map)
> >>> * Each swap entry needs to be able to find the associated memory
> >>> cgroup. (swap_cgroup_ctrl->map)
> >>> * Swap cache. Lookup folio/shadow from swap entry
> >>> * Swap page writes through a swapfile in a file system other than a
> >>> block device. (swap_extent)
> >>> * Shadow entry. (store in swap cache)
> >>>
> >>> Any new swap back end might have different internal implementation,
> >>> but needs to support the above usage. For example, using the existing
> >>> file system as swap backend, per vma or per swap entry map to a file
> >>> would mean it needs additional data structure to track the
> >>> swap_cgroup_ctrl, combined with the size of the file inode. It would
> >>> be challenging to meet the design goal 2) and 3) using another file
> >>> system as it is..
> >>>
> >>> I am considering grouping different swap entry data into one single
> >>> struct and dynamically allocate it so no upfront allocation of
> >>> swap_map.
> >>>
> >>> For the swap entry allocation.Current kernel support swap out 0 order
> >>> or pmd order pages.
> >>>
> >>> There are some discussions and patches that add swap out for folio
> >>> size in between (mTHP)
> >>>
> >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> >>>
> >>> and swap in for mTHP:
> >>>
> >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> >>>
> >>> The introduction of swapping different order of pages will further
> >>> complicate the swap entry fragmentation issue. The swap back end has
> >>> no way to predict the life cycle of the swap entries. Repeat allocate
> >>> and free swap entry of different sizes will fragment the swap entries
> >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> >>> will have to split the mTHP to a smaller size to perform the swap in
> >>> and out. T
> >>>
> >>> Current swap only supports 4K pages or pmd size pages. When adding the
> >>> other in between sizes, it greatly increases the chance of fragmenting
> >>> the swap entry space. When no more continuous swap swap entry for
> >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> >>> the fragmentation issue. It will be a constant source of splitting the
> >>> mTHP.
> >>>
> >>> Another limitation I would like to address is that swap_writepage can
> >>> only write out IO in one contiguous chunk, not able to perform
> >>> non-continuous IO. When the swapfile is close to full, it is likely
> >>> the unused entry will spread across different locations. It would be
> >>> nice to be able to read and write large folio using discontiguous disk
> >>> IO locations.
> >>>
> >>> Some possible ideas for the fragmentation issue.
> >>>
> >>> a) buddy allocator for swap entities. Similar to the buddy allocator
> >>> in memory. We can use a buddy allocator system for the swap entry to
> >>> avoid the low order swap entry fragment too much of the high order
> >>> swap entry. It should greatly reduce the fragmentation caused by
> >>> allocate and free of the swap entry of different sizes. However the
> >>> buddy allocator has its own limit as well. Unlike system memory, we
> >>> can move and compact the memory. There is no rmap for swap entry, it
> >>> is much harder to move a swap entry to another disk location. So the
> >>> buddy allocator for swap will help, but not solve all the
> >>> fragmentation issues.
> >> I have an idea here😁
> >>
> >> Each swap device is divided into multiple chunks, and each chunk is
> >> allocated to meet each order allocation
> >> (order indicates the order of swapout's folio, and each chunk is used
> >> for only one order).  
> >> This can solve the fragmentation problem, which is much simpler than
> >> buddy, easier to implement,
> >>  and can be compatible with multiple sizes, similar to small slab allocator.
> >>
> >> 1) Add structure members  
> >> In the swap_info_struct structure, we only need to add the offset array
> >> representing the offset of each order search.
> >> eg:
> >>
> >> #define MTHP_NR_ORDER 9
> >>
> >> struct swap_info_struct {
> >>     ...
> >>     long order_off[MTHP_NR_ORDER];
> >>     ...
> >> };
> >>
> >> Note: order_off = -1 indicates that this order is not supported.
> >>
> >> 2) Initialize
> >> Set the proportion of swap device occupied by each order.
> >> For the sake of simplicity, there are 8 kinds of orders.  
> >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> >> (maxpages indicates the maximum number of available slots in the current
> >> swap device)
> > Well, but then if you fill in space of a particular order and need to swap
> > out a page of that order what do you do? Return ENOSPC prematurely?
> If we swapout a subpage of large folio(due to a split in large folio),  
> Simply search for a free swap entry from order_off[0].

I meant what are you going to do if you want to swapout 2MB huge page but
you don't have any free swap entry of the appropriate order? History shows
that these schemes where you partition available space into buckets of
pages of different order tends to fragment rather quickly so you need to
also implement some defragmentation / compaction scheme and once you do
that you are at the complexity of a standard filesystem block allocator.
That is all I wanted to point at :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-14  8:26       ` Jan Kara
@ 2024-03-14 11:19         ` Chuanhua Han
  2024-05-15 23:07           ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Chuanhua Han @ 2024-03-14 11:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david

Jan Kara <jack@suse.cz> 于2024年3月14日周四 16:28写道：
>
> On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> >
> > 在 2024/3/7 22:03, Jan Kara 写道:
> > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> > >> 在 2024/3/1 17:24, Chris Li 写道:
> > >>> In last year's LSF/MM I talked about a VFS-like swap system. That is
> > >>> the pony that was chosen.
> > >>> However, I did not have much chance to go into details.
> > >>>
> > >>> This year, I would like to discuss what it takes to re-architect the
> > >>> whole swap back end from scratch?
> > >>>
> > >>> Let’s start from the requirements for the swap back end.
> > >>>
> > >>> 1) support the existing swap usage (not the implementation).
> > >>>
> > >>> Some other design goals::
> > >>>
> > >>> 2) low per swap entry memory usage.
> > >>>
> > >>> 3) low io latency.
> > >>>
> > >>> What are the functions the swap system needs to support?
> > >>>
> > >>> At the device level. Swap systems need to support a list of swap files
> > >>> with a priority order. The same priority of swap device will do round
> > >>> robin writing on the swap device. The swap device type includes zswap,
> > >>> zram, SSD, spinning hard disk, swap file in a file system.
> > >>>
> > >>> At the swap entry level, here is the list of existing swap entry usage:
> > >>>
> > >>> * Swap entry allocation and free. Each swap entry needs to be
> > >>> associated with a location of the disk space in the swapfile. (offset
> > >>> of swap entry).
> > >>> * Each swap entry needs to track the map count of the entry. (swap_map)
> > >>> * Each swap entry needs to be able to find the associated memory
> > >>> cgroup. (swap_cgroup_ctrl->map)
> > >>> * Swap cache. Lookup folio/shadow from swap entry
> > >>> * Swap page writes through a swapfile in a file system other than a
> > >>> block device. (swap_extent)
> > >>> * Shadow entry. (store in swap cache)
> > >>>
> > >>> Any new swap back end might have different internal implementation,
> > >>> but needs to support the above usage. For example, using the existing
> > >>> file system as swap backend, per vma or per swap entry map to a file
> > >>> would mean it needs additional data structure to track the
> > >>> swap_cgroup_ctrl, combined with the size of the file inode. It would
> > >>> be challenging to meet the design goal 2) and 3) using another file
> > >>> system as it is..
> > >>>
> > >>> I am considering grouping different swap entry data into one single
> > >>> struct and dynamically allocate it so no upfront allocation of
> > >>> swap_map.
> > >>>
> > >>> For the swap entry allocation.Current kernel support swap out 0 order
> > >>> or pmd order pages.
> > >>>
> > >>> There are some discussions and patches that add swap out for folio
> > >>> size in between (mTHP)
> > >>>
> > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> > >>>
> > >>> and swap in for mTHP:
> > >>>
> > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> > >>>
> > >>> The introduction of swapping different order of pages will further
> > >>> complicate the swap entry fragmentation issue. The swap back end has
> > >>> no way to predict the life cycle of the swap entries. Repeat allocate
> > >>> and free swap entry of different sizes will fragment the swap entries
> > >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > >>> will have to split the mTHP to a smaller size to perform the swap in
> > >>> and out. T
> > >>>
> > >>> Current swap only supports 4K pages or pmd size pages. When adding the
> > >>> other in between sizes, it greatly increases the chance of fragmenting
> > >>> the swap entry space. When no more continuous swap swap entry for
> > >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > >>> the fragmentation issue. It will be a constant source of splitting the
> > >>> mTHP.
> > >>>
> > >>> Another limitation I would like to address is that swap_writepage can
> > >>> only write out IO in one contiguous chunk, not able to perform
> > >>> non-continuous IO. When the swapfile is close to full, it is likely
> > >>> the unused entry will spread across different locations. It would be
> > >>> nice to be able to read and write large folio using discontiguous disk
> > >>> IO locations.
> > >>>
> > >>> Some possible ideas for the fragmentation issue.
> > >>>
> > >>> a) buddy allocator for swap entities. Similar to the buddy allocator
> > >>> in memory. We can use a buddy allocator system for the swap entry to
> > >>> avoid the low order swap entry fragment too much of the high order
> > >>> swap entry. It should greatly reduce the fragmentation caused by
> > >>> allocate and free of the swap entry of different sizes. However the
> > >>> buddy allocator has its own limit as well. Unlike system memory, we
> > >>> can move and compact the memory. There is no rmap for swap entry, it
> > >>> is much harder to move a swap entry to another disk location. So the
> > >>> buddy allocator for swap will help, but not solve all the
> > >>> fragmentation issues.
> > >> I have an idea here😁
> > >>
> > >> Each swap device is divided into multiple chunks, and each chunk is
> > >> allocated to meet each order allocation
> > >> (order indicates the order of swapout's folio, and each chunk is used
> > >> for only one order).
> > >> This can solve the fragmentation problem, which is much simpler than
> > >> buddy, easier to implement,
> > >>  and can be compatible with multiple sizes, similar to small slab allocator.
> > >>
> > >> 1) Add structure members
> > >> In the swap_info_struct structure, we only need to add the offset array
> > >> representing the offset of each order search.
> > >> eg:
> > >>
> > >> #define MTHP_NR_ORDER 9
> > >>
> > >> struct swap_info_struct {
> > >>     ...
> > >>     long order_off[MTHP_NR_ORDER];
> > >>     ...
> > >> };
> > >>
> > >> Note: order_off = -1 indicates that this order is not supported.
> > >>
> > >> 2) Initialize
> > >> Set the proportion of swap device occupied by each order.
> > >> For the sake of simplicity, there are 8 kinds of orders.
> > >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> > >> (maxpages indicates the maximum number of available slots in the current
> > >> swap device)
> > > Well, but then if you fill in space of a particular order and need to swap
> > > out a page of that order what do you do? Return ENOSPC prematurely?
> > If we swapout a subpage of large folio(due to a split in large folio),
> > Simply search for a free swap entry from order_off[0].
>
> I meant what are you going to do if you want to swapout 2MB huge page but
> you don't have any free swap entry of the appropriate order? History shows
> that these schemes where you partition available space into buckets of
> pages of different order tends to fragment rather quickly so you need to
> also implement some defragmentation / compaction scheme and once you do
> that you are at the complexity of a standard filesystem block allocator.
> That is all I wanted to point at :)
OK, got it!  It's true that my approach doesn't eliminate
fragmentation, but it can be
mitigated to some extent, and the method itself doesn't currently
involve complex
file system operations.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>
Thnaks,
Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-14 11:19         ` Chuanhua Han
@ 2024-05-15 23:07           ` Chris Li
  2024-05-16  7:16             ` Chuanhua Han
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-05-15 23:07 UTC (permalink / raw)
  To: Chuanhua Han
  Cc: Jan Kara, Chuanhua Han, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david, Matthew Wilcox

Hi,

Here is my slide for today's swap abstraction discussion.

https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view

Chris

On Thu, Mar 14, 2024 at 4:20 AM Chuanhua Han <chuanhuahan@gmail.com> wrote:
>
> Jan Kara <jack@suse.cz> 于2024年3月14日周四 16:28写道：
> >
> > On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> > >
> > > 在 2024/3/7 22:03, Jan Kara 写道:
> > > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> > > >> 在 2024/3/1 17:24, Chris Li 写道:
> > > >>> In last year's LSF/MM I talked about a VFS-like swap system. That is
> > > >>> the pony that was chosen.
> > > >>> However, I did not have much chance to go into details.
> > > >>>
> > > >>> This year, I would like to discuss what it takes to re-architect the
> > > >>> whole swap back end from scratch?
> > > >>>
> > > >>> Let’s start from the requirements for the swap back end.
> > > >>>
> > > >>> 1) support the existing swap usage (not the implementation).
> > > >>>
> > > >>> Some other design goals::
> > > >>>
> > > >>> 2) low per swap entry memory usage.
> > > >>>
> > > >>> 3) low io latency.
> > > >>>
> > > >>> What are the functions the swap system needs to support?
> > > >>>
> > > >>> At the device level. Swap systems need to support a list of swap files
> > > >>> with a priority order. The same priority of swap device will do round
> > > >>> robin writing on the swap device. The swap device type includes zswap,
> > > >>> zram, SSD, spinning hard disk, swap file in a file system.
> > > >>>
> > > >>> At the swap entry level, here is the list of existing swap entry usage:
> > > >>>
> > > >>> * Swap entry allocation and free. Each swap entry needs to be
> > > >>> associated with a location of the disk space in the swapfile. (offset
> > > >>> of swap entry).
> > > >>> * Each swap entry needs to track the map count of the entry. (swap_map)
> > > >>> * Each swap entry needs to be able to find the associated memory
> > > >>> cgroup. (swap_cgroup_ctrl->map)
> > > >>> * Swap cache. Lookup folio/shadow from swap entry
> > > >>> * Swap page writes through a swapfile in a file system other than a
> > > >>> block device. (swap_extent)
> > > >>> * Shadow entry. (store in swap cache)
> > > >>>
> > > >>> Any new swap back end might have different internal implementation,
> > > >>> but needs to support the above usage. For example, using the existing
> > > >>> file system as swap backend, per vma or per swap entry map to a file
> > > >>> would mean it needs additional data structure to track the
> > > >>> swap_cgroup_ctrl, combined with the size of the file inode. It would
> > > >>> be challenging to meet the design goal 2) and 3) using another file
> > > >>> system as it is..
> > > >>>
> > > >>> I am considering grouping different swap entry data into one single
> > > >>> struct and dynamically allocate it so no upfront allocation of
> > > >>> swap_map.
> > > >>>
> > > >>> For the swap entry allocation.Current kernel support swap out 0 order
> > > >>> or pmd order pages.
> > > >>>
> > > >>> There are some discussions and patches that add swap out for folio
> > > >>> size in between (mTHP)
> > > >>>
> > > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> > > >>>
> > > >>> and swap in for mTHP:
> > > >>>
> > > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> > > >>>
> > > >>> The introduction of swapping different order of pages will further
> > > >>> complicate the swap entry fragmentation issue. The swap back end has
> > > >>> no way to predict the life cycle of the swap entries. Repeat allocate
> > > >>> and free swap entry of different sizes will fragment the swap entries
> > > >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > > >>> will have to split the mTHP to a smaller size to perform the swap in
> > > >>> and out. T
> > > >>>
> > > >>> Current swap only supports 4K pages or pmd size pages. When adding the
> > > >>> other in between sizes, it greatly increases the chance of fragmenting
> > > >>> the swap entry space. When no more continuous swap swap entry for
> > > >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > > >>> the fragmentation issue. It will be a constant source of splitting the
> > > >>> mTHP.
> > > >>>
> > > >>> Another limitation I would like to address is that swap_writepage can
> > > >>> only write out IO in one contiguous chunk, not able to perform
> > > >>> non-continuous IO. When the swapfile is close to full, it is likely
> > > >>> the unused entry will spread across different locations. It would be
> > > >>> nice to be able to read and write large folio using discontiguous disk
> > > >>> IO locations.
> > > >>>
> > > >>> Some possible ideas for the fragmentation issue.
> > > >>>
> > > >>> a) buddy allocator for swap entities. Similar to the buddy allocator
> > > >>> in memory. We can use a buddy allocator system for the swap entry to
> > > >>> avoid the low order swap entry fragment too much of the high order
> > > >>> swap entry. It should greatly reduce the fragmentation caused by
> > > >>> allocate and free of the swap entry of different sizes. However the
> > > >>> buddy allocator has its own limit as well. Unlike system memory, we
> > > >>> can move and compact the memory. There is no rmap for swap entry, it
> > > >>> is much harder to move a swap entry to another disk location. So the
> > > >>> buddy allocator for swap will help, but not solve all the
> > > >>> fragmentation issues.
> > > >> I have an idea here😁
> > > >>
> > > >> Each swap device is divided into multiple chunks, and each chunk is
> > > >> allocated to meet each order allocation
> > > >> (order indicates the order of swapout's folio, and each chunk is used
> > > >> for only one order).
> > > >> This can solve the fragmentation problem, which is much simpler than
> > > >> buddy, easier to implement,
> > > >>  and can be compatible with multiple sizes, similar to small slab allocator.
> > > >>
> > > >> 1) Add structure members
> > > >> In the swap_info_struct structure, we only need to add the offset array
> > > >> representing the offset of each order search.
> > > >> eg:
> > > >>
> > > >> #define MTHP_NR_ORDER 9
> > > >>
> > > >> struct swap_info_struct {
> > > >>     ...
> > > >>     long order_off[MTHP_NR_ORDER];
> > > >>     ...
> > > >> };
> > > >>
> > > >> Note: order_off = -1 indicates that this order is not supported.
> > > >>
> > > >> 2) Initialize
> > > >> Set the proportion of swap device occupied by each order.
> > > >> For the sake of simplicity, there are 8 kinds of orders.
> > > >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> > > >> (maxpages indicates the maximum number of available slots in the current
> > > >> swap device)
> > > > Well, but then if you fill in space of a particular order and need to swap
> > > > out a page of that order what do you do? Return ENOSPC prematurely?
> > > If we swapout a subpage of large folio(due to a split in large folio),
> > > Simply search for a free swap entry from order_off[0].
> >
> > I meant what are you going to do if you want to swapout 2MB huge page but
> > you don't have any free swap entry of the appropriate order? History shows
> > that these schemes where you partition available space into buckets of
> > pages of different order tends to fragment rather quickly so you need to
> > also implement some defragmentation / compaction scheme and once you do
> > that you are at the complexity of a standard filesystem block allocator.
> > That is all I wanted to point at :)
> OK, got it!  It's true that my approach doesn't eliminate
> fragmentation, but it can be
> mitigated to some extent, and the method itself doesn't currently
> involve complex
> file system operations.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> >
> Thnaks,
> Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-15 23:07           ` Chris Li
@ 2024-05-16  7:16             ` Chuanhua Han
  0 siblings, 0 replies; 59+ messages in thread
From: Chuanhua Han @ 2024-05-16  7:16 UTC (permalink / raw)
  To: Chris Li
  Cc: Jan Kara, Chuanhua Han, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david, Matthew Wilcox

Chris Li <chrisl@kernel.org> 于2024年5月16日周四 07:07写道：
>
> Hi,
>
> Here is my slide for today's swap abstraction discussion.
>
> https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
Great, Thank you!
>
> Chris
>
> On Thu, Mar 14, 2024 at 4:20 AM Chuanhua Han <chuanhuahan@gmail.com> wrote:
> >
> > Jan Kara <jack@suse.cz> 于2024年3月14日周四 16:28写道：
> > >
> > > On Fri 08-03-24 10:02:20, Chuanhua Han wrote:
> > > >
> > > > 在 2024/3/7 22:03, Jan Kara 写道:
> > > > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote:
> > > > >> 在 2024/3/1 17:24, Chris Li 写道:
> > > > >>> In last year's LSF/MM I talked about a VFS-like swap system. That is
> > > > >>> the pony that was chosen.
> > > > >>> However, I did not have much chance to go into details.
> > > > >>>
> > > > >>> This year, I would like to discuss what it takes to re-architect the
> > > > >>> whole swap back end from scratch?
> > > > >>>
> > > > >>> Let’s start from the requirements for the swap back end.
> > > > >>>
> > > > >>> 1) support the existing swap usage (not the implementation).
> > > > >>>
> > > > >>> Some other design goals::
> > > > >>>
> > > > >>> 2) low per swap entry memory usage.
> > > > >>>
> > > > >>> 3) low io latency.
> > > > >>>
> > > > >>> What are the functions the swap system needs to support?
> > > > >>>
> > > > >>> At the device level. Swap systems need to support a list of swap files
> > > > >>> with a priority order. The same priority of swap device will do round
> > > > >>> robin writing on the swap device. The swap device type includes zswap,
> > > > >>> zram, SSD, spinning hard disk, swap file in a file system.
> > > > >>>
> > > > >>> At the swap entry level, here is the list of existing swap entry usage:
> > > > >>>
> > > > >>> * Swap entry allocation and free. Each swap entry needs to be
> > > > >>> associated with a location of the disk space in the swapfile. (offset
> > > > >>> of swap entry).
> > > > >>> * Each swap entry needs to track the map count of the entry. (swap_map)
> > > > >>> * Each swap entry needs to be able to find the associated memory
> > > > >>> cgroup. (swap_cgroup_ctrl->map)
> > > > >>> * Swap cache. Lookup folio/shadow from swap entry
> > > > >>> * Swap page writes through a swapfile in a file system other than a
> > > > >>> block device. (swap_extent)
> > > > >>> * Shadow entry. (store in swap cache)
> > > > >>>
> > > > >>> Any new swap back end might have different internal implementation,
> > > > >>> but needs to support the above usage. For example, using the existing
> > > > >>> file system as swap backend, per vma or per swap entry map to a file
> > > > >>> would mean it needs additional data structure to track the
> > > > >>> swap_cgroup_ctrl, combined with the size of the file inode. It would
> > > > >>> be challenging to meet the design goal 2) and 3) using another file
> > > > >>> system as it is..
> > > > >>>
> > > > >>> I am considering grouping different swap entry data into one single
> > > > >>> struct and dynamically allocate it so no upfront allocation of
> > > > >>> swap_map.
> > > > >>>
> > > > >>> For the swap entry allocation.Current kernel support swap out 0 order
> > > > >>> or pmd order pages.
> > > > >>>
> > > > >>> There are some discussions and patches that add swap out for folio
> > > > >>> size in between (mTHP)
> > > > >>>
> > > > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> > > > >>>
> > > > >>> and swap in for mTHP:
> > > > >>>
> > > > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@gmail.com/
> > > > >>>
> > > > >>> The introduction of swapping different order of pages will further
> > > > >>> complicate the swap entry fragmentation issue. The swap back end has
> > > > >>> no way to predict the life cycle of the swap entries. Repeat allocate
> > > > >>> and free swap entry of different sizes will fragment the swap entries
> > > > >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > > > >>> will have to split the mTHP to a smaller size to perform the swap in
> > > > >>> and out. T
> > > > >>>
> > > > >>> Current swap only supports 4K pages or pmd size pages. When adding the
> > > > >>> other in between sizes, it greatly increases the chance of fragmenting
> > > > >>> the swap entry space. When no more continuous swap swap entry for
> > > > >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > > > >>> the fragmentation issue. It will be a constant source of splitting the
> > > > >>> mTHP.
> > > > >>>
> > > > >>> Another limitation I would like to address is that swap_writepage can
> > > > >>> only write out IO in one contiguous chunk, not able to perform
> > > > >>> non-continuous IO. When the swapfile is close to full, it is likely
> > > > >>> the unused entry will spread across different locations. It would be
> > > > >>> nice to be able to read and write large folio using discontiguous disk
> > > > >>> IO locations.
> > > > >>>
> > > > >>> Some possible ideas for the fragmentation issue.
> > > > >>>
> > > > >>> a) buddy allocator for swap entities. Similar to the buddy allocator
> > > > >>> in memory. We can use a buddy allocator system for the swap entry to
> > > > >>> avoid the low order swap entry fragment too much of the high order
> > > > >>> swap entry. It should greatly reduce the fragmentation caused by
> > > > >>> allocate and free of the swap entry of different sizes. However the
> > > > >>> buddy allocator has its own limit as well. Unlike system memory, we
> > > > >>> can move and compact the memory. There is no rmap for swap entry, it
> > > > >>> is much harder to move a swap entry to another disk location. So the
> > > > >>> buddy allocator for swap will help, but not solve all the
> > > > >>> fragmentation issues.
> > > > >> I have an idea here😁
> > > > >>
> > > > >> Each swap device is divided into multiple chunks, and each chunk is
> > > > >> allocated to meet each order allocation
> > > > >> (order indicates the order of swapout's folio, and each chunk is used
> > > > >> for only one order).
> > > > >> This can solve the fragmentation problem, which is much simpler than
> > > > >> buddy, easier to implement,
> > > > >>  and can be compatible with multiple sizes, similar to small slab allocator.
> > > > >>
> > > > >> 1) Add structure members
> > > > >> In the swap_info_struct structure, we only need to add the offset array
> > > > >> representing the offset of each order search.
> > > > >> eg:
> > > > >>
> > > > >> #define MTHP_NR_ORDER 9
> > > > >>
> > > > >> struct swap_info_struct {
> > > > >>     ...
> > > > >>     long order_off[MTHP_NR_ORDER];
> > > > >>     ...
> > > > >> };
> > > > >>
> > > > >> Note: order_off = -1 indicates that this order is not supported.
> > > > >>
> > > > >> 2) Initialize
> > > > >> Set the proportion of swap device occupied by each order.
> > > > >> For the sake of simplicity, there are 8 kinds of orders.
> > > > >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages
> > > > >> (maxpages indicates the maximum number of available slots in the current
> > > > >> swap device)
> > > > > Well, but then if you fill in space of a particular order and need to swap
> > > > > out a page of that order what do you do? Return ENOSPC prematurely?
> > > > If we swapout a subpage of large folio(due to a split in large folio),
> > > > Simply search for a free swap entry from order_off[0].
> > >
> > > I meant what are you going to do if you want to swapout 2MB huge page but
> > > you don't have any free swap entry of the appropriate order? History shows
> > > that these schemes where you partition available space into buckets of
> > > pages of different order tends to fragment rather quickly so you need to
> > > also implement some defragmentation / compaction scheme and once you do
> > > that you are at the complexity of a standard filesystem block allocator.
> > > That is all I wanted to point at :)
> > OK, got it!  It's true that my approach doesn't eliminate
> > fragmentation, but it can be
> > mitigated to some extent, and the method itself doesn't currently
> > involve complex
> > file system operations.
> > >
> > >                                                                 Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > >
> > Thnaks,
> > Chuanhua



-- 
Thanks,
Chuanhua


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
  2024-03-07 21:06     ` Jared Hulbert
  2024-03-08  2:02     ` Chuanhua Han
@ 2024-05-17 12:12     ` Karim Manaouil
  2024-05-21 20:40       ` Chris Li
  2 siblings, 1 reply; 59+ messages in thread
From: Karim Manaouil @ 2024-05-17 12:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chuanhua Han, Chris Li, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david, Karim Manaouil

On Thu, Mar 07, 2024 at 03:03:44PM +0100, Jan Kara wrote:
> Frankly as I'm reading the discussions here, it seems to me you are trying
> to reinvent a lot of things from the filesystem space :) Like block
> allocation with reasonably efficient fragmentation prevention, transparent
> data compression (zswap), hierarchical storage management (i.e., moving
> data between different backing stores), efficient way to get from
> VMA+offset to the place on disk where the content is stored. Sure you still
> don't need a lot of things modern filesystems do like permissions,
> directory structure (or even more complex namespacing stuff), all the stuff
> achieving fs consistency after a crash, etc. But still what you need is a
> notable portion of what filesystems do.
> 
> So maybe it would be time to implement swap as a proper filesystem? Or even
> better we could think about factoring out these bits out of some existing
> filesystem to share code?

I definitely agree with you on this point. I had the same exact thought,
reading the discussion.

Filesystems already implemented a lot of solutions for fragmentation 
avoidance that are more apropriate for slow storage media.

Also, writing chunks of any size (e.g. to directly write compressed
pages) means slab-based management of swap space might not be ideal
and will waste space for internal fragmentation. Also compaction
for slow media is obviously harder and slower to implement compared
to doing it in memory. You can do it in memory as well, but that is
at the expense of more I/O.

It sounds to me that all the problems above can be solved with an 
extent-based filesystem implementation of swap.

Cheers
Karim
PhD student
Edinburgh University

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-17 12:12     ` Karim Manaouil
@ 2024-05-21 20:40       ` Chris Li
  2024-05-28  7:08         ` Jared Hulbert
  2024-05-29  3:57         ` Matthew Wilcox
  0 siblings, 2 replies; 59+ messages in thread
From: Chris Li @ 2024-05-21 20:40 UTC (permalink / raw)
  To: Karim Manaouil
  Cc: Jan Kara, Chuanhua Han, linux-mm, lsf-pc, ryan.roberts, 21cnbao,
	david

Hi Karim,

On Fri, May 17, 2024 at 5:12 AM Karim Manaouil <kmanaouil.dev@gmail.com> wrote:
>
> On Thu, Mar 07, 2024 at 03:03:44PM +0100, Jan Kara wrote:
> > Frankly as I'm reading the discussions here, it seems to me you are trying
> > to reinvent a lot of things from the filesystem space :) Like block
> > allocation with reasonably efficient fragmentation prevention, transparent
> > data compression (zswap), hierarchical storage management (i.e., moving
> > data between different backing stores), efficient way to get from
> > VMA+offset to the place on disk where the content is stored. Sure you still
> > don't need a lot of things modern filesystems do like permissions,
> > directory structure (or even more complex namespacing stuff), all the stuff
> > achieving fs consistency after a crash, etc. But still what you need is a
> > notable portion of what filesystems do.
> >
> > So maybe it would be time to implement swap as a proper filesystem? Or even
> > better we could think about factoring out these bits out of some existing
> > filesystem to share code?
>
> I definitely agree with you on this point. I had the same exact thought,
> reading the discussion.
>
> Filesystems already implemented a lot of solutions for fragmentation
> avoidance that are more apropriate for slow storage media.
>

Swap and file systems have very different requirements and usage
patterns and IO patterns.

> Also, writing chunks of any size (e.g. to directly write compressed
> pages) means slab-based management of swap space might not be ideal
> and will waste space for internal fragmentation. Also compaction
> for slow media is obviously harder and slower to implement compared
> to doing it in memory. You can do it in memory as well, but that is
> at the expense of more I/O.

I am not able to understand what you describe above. The current swap
entry is not allocated from slab. The compressed swap backend, zswap
or zram. both use zsmalloc as backend to store compressed pages.

>
> It sounds to me that all the problems above can be solved with an
> extent-based filesystem implementation of swap.

It looks good on paper, once you try to actually implement it  you
will find out a lot of new obstacles.

One challenging aspect is that the current swap back end has a very
low per swap entry memory overhead. It is about 1 byte (swap_map), 2
byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
more than 64 bytes per file. That is a big jump if you map a swap
entry to a file. If you map more than one swap entry to a file, then
you need to track the mapping of file offset to swap entry, and the
reverse lookup of swap entry to a file with offset. Whichever way you
cut it, it will significantly increase the per swap entry memory
overhead.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-21 20:40       ` Chris Li
@ 2024-05-28  7:08         ` Jared Hulbert
  2024-05-29  3:36           ` Chris Li
  2024-05-29  3:57         ` Matthew Wilcox
  1 sibling, 1 reply; 59+ messages in thread
From: Jared Hulbert @ 2024-05-28  7:08 UTC (permalink / raw)
  To: Chris Li
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Tue, May 21, 2024 at 1:43 PM Chris Li <chrisl@kernel.org> wrote:
>
> Swap and file systems have very different requirements and usage
> patterns and IO patterns.

I would counter that the design requirements for a simple filesystem
and what you are proposing doing to support heterogeneously sized
block allocation on a block device are very similar, not very
different.

Data is owned by clients, but I've done the profiling on servers and
Android.  As I've stated before, databases have reasonably close usage
and IO.  Swap usage of block devices is not a particularly odd usage
profile.

> One challenging aspect is that the current swap back end has a very
> low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> more than 64 bytes per file. That is a big jump if you map a swap
> entry to a file. If you map more than one swap entry to a file, then
> you need to track the mapping of file offset to swap entry, and the
> reverse lookup of swap entry to a file with offset. Whichever way you
> cut it, it will significantly increase the per swap entry memory
> overhead.

No it won't.  Because the suggestion is NOT to add some array of inode
structs in place of the structures you've been talking about altering.

IIUC your proposals per the "Swap Abstraction LSF_MM 2024.pdf" are to
more than double the per entry overhead from 11 B to 24 B.  Is that
correct? Of course if modernizing the structures to be properly folio
aware requires a few bytes, that seems prudent.

Also IIUC 8 bytes of the 24 are a per swap entry pointer to a
dynamically allocated structure that will be used to manage
heterogeneous block size allocation management on block devices.  I
object to this.  That's what the filesystem abstraction is for.  EXT4
too heavy for you? Then make a simpler filesystem.

So how do you map swap entries to a filesystem without a new mapping
layer?  Here is a simple proposal.  (It assumes there are only 16
valid folio orders.  There are ways to get around that limit but it
would take longer to explain, so let's just go with it.)

* swap_types (fs inodes) map to different page sizes (page, compound
order, folio order, mTHP size etc).
   ex. swap_type == 1 -> 4K pages,    swap_type == 15 -> 1G hugepages etc
* swap_type = fs inode
* swap_offset = fs file offset
* swap_offset is selected using the same simple allocation scheme as today.
  - because the swap entries are all the same size/order per
swap_type/inode you can just pick the first free slot.
* on freeing a swap entry call fallocate(FALLOC_FL_PUNCH_HOLE)
  - removes the blocks from the file without changing its "size".
  - no changes are required to the swap_offsets to garbage collect blocks.

This allows you the following:
* dynamic allocation of block space between sizes/orders
* avoids any new tracking structures in memory for all swap entries
* places burden of tracking on filesystem

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-28  7:08         ` Jared Hulbert
@ 2024-05-29  3:36           ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-05-29  3:36 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Tue, May 28, 2024 at 12:08 AM Jared Hulbert <jaredeh@gmail.com> wrote:
>
> On Tue, May 21, 2024 at 1:43 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Swap and file systems have very different requirements and usage
> > patterns and IO patterns.
>
> I would counter that the design requirements for a simple filesystem
> and what you are proposing doing to support heterogeneously sized
> block allocation on a block device are very similar, not very
> different.
>
> Data is owned by clients, but I've done the profiling on servers and
> Android.  As I've stated before, databases have reasonably close usage
> and IO.  Swap usage of block devices is not a particularly odd usage
> profile.
>
> > One challenging aspect is that the current swap back end has a very
> > low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> > more than 64 bytes per file. That is a big jump if you map a swap
> > entry to a file. If you map more than one swap entry to a file, then
> > you need to track the mapping of file offset to swap entry, and the
> > reverse lookup of swap entry to a file with offset. Whichever way you
> > cut it, it will significantly increase the per swap entry memory
> > overhead.
>
> No it won't.  Because the suggestion is NOT to add some array of inode
> structs in place of the structures you've been talking about altering.
>
> IIUC your proposals per the "Swap Abstraction LSF_MM 2024.pdf" are to
> more than double the per entry overhead from 11 B to 24 B.  Is that
> correct? Of course if modernizing the structures to be properly folio
> aware requires a few bytes, that seems prudent.

The most expanded form of swap entry is 24B, the last option in the slide.
However, you get the saving of duplicating compound swap entries.
e.g. for PMD size of compound swap entries. You can have 512 identical
swap entries within one compound swap entry. They only need to have 8
bytes of pointer each point to the compound entry struct. So the
average of the per entry is 8B + (24B + compound struct overhead)/512,
much smaller than 24B. If all swap entries are order 0. Then yes, the
average is 24B per entry.

>
> Also IIUC 8 bytes of the 24 are a per swap entry pointer to a
> dynamically allocated structure that will be used to manage
> heterogeneous block size allocation management on block devices.  I
> object to this.  That's what the filesystem abstraction is for.  EXT4
> too heavy for you? Then make a simpler filesystem.

You can call my compound swap entry a simpler filesystem. Just a different name.
If you are writing a new file system for swap, you don't need the
inode and most of the VFS ops etc.
Those are unnecessary complexity to deal with.

>
> So how do you map swap entries to a filesystem without a new mapping
> layer?  Here is a simple proposal.  (It assumes there are only 16
> valid folio orders.  There are ways to get around that limit but it
> would take longer to explain, so let's just go with it.)
>
> * swap_types (fs inodes) map to different page sizes (page, compound
> order, folio order, mTHP size etc).

Swap type has preexisting meaning in Linux swap back end code, its
reference to the swap device.
Let me just call it "swap_order".

>    ex. swap_type == 1 -> 4K pages,    swap_type == 15 -> 1G hugepages etc
> * swap_type = fs inode
> * swap_offset = fs file offset
> * swap_offset is selected using the same simple allocation scheme as today.
>   - because the swap entries are all the same size/order per
> swap_type/inode you can just pick the first free slot.
> * on freeing a swap entry call fallocate(FALLOC_FL_PUNCH_HOLE)
>   - removes the blocks from the file without changing its "size".
>   - no changes are required to the swap_offsets to garbage collect blocks.

Can I assume your swap entry encoding is something like [swap_order
(your swap_type)] + [swap_offset]?

Let's forget the fact that you might not be able to get swap order
bits from the swap entry in a 32 bit system.
Assume the swapfile is small enough that is not a problem.
Now your swap cache address space is 16x compared to the original swap
cache address space.
You may say, oh, that is "virtual" swap cache address space, you are
not using the 16x address space at the same time.
That is true. However, you can create worse fragmentation in your 16x
virtual swap cache address space. The xarray used to track the swap
cache does not handle sparse index storage well. The worst case
fragmentation in xarray is about 32-64x. So the worst fragmentation in
your 16x swap address space can be something close to 16x end.
Let's say it is not 16x, pick a low end 4x. 4x 8B per swap cache
pointer that is already 32B per swap entry just on the swap cache
alone.

FYI, the original swap cache and the compound swap entry in the pdf do
not have this swap cache address space blow up issue.

Chris

>
> This allows you the following:
> * dynamic allocation of block space between sizes/orders
> * avoids any new tracking structures in memory for all swap entries
> * places burden of tracking on filesystem
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-21 20:40       ` Chris Li
  2024-05-28  7:08         ` Jared Hulbert
@ 2024-05-29  3:57         ` Matthew Wilcox
  2024-05-29  6:50           ` Chris Li
  1 sibling, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-05-29  3:57 UTC (permalink / raw)
  To: Chris Li
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote:
> > Filesystems already implemented a lot of solutions for fragmentation
> > avoidance that are more apropriate for slow storage media.
> 
> Swap and file systems have very different requirements and usage
> patterns and IO patterns.

Should they, though?  Filesystems noticed that handling pages in LRU
order was inefficient and so they stopped doing that (see the removal
of aops->writepage in favour of ->writepages, along with where each are
called from).  Maybe it's time for swap to start doing writes in the order
of virtual addresses within a VMA, instead of LRU order.

Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
is 40x faster:
https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/

> One challenging aspect is that the current swap back end has a very
> low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> more than 64 bytes per file. That is a big jump if you map a swap
> entry to a file. If you map more than one swap entry to a file, then
> you need to track the mapping of file offset to swap entry, and the
> reverse lookup of swap entry to a file with offset. Whichever way you
> cut it, it will significantly increase the per swap entry memory
> overhead.

Not necessarily, no.  If your workload uses a lot of order-2, order-4
and order-9 folios, then the current scheme is using 11 bytes per page,
so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per
order-9 folio.  That's a lot of bytes we can use for an extent-based
scheme.

Also, why would you compare the size of an inode to the size of an
inode?  inode is ~equivalent to an anon_vma, not to a swap entry.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-29  3:57         ` Matthew Wilcox
@ 2024-05-29  6:50           ` Chris Li
  2024-05-29 12:33             ` Matthew Wilcox
  0 siblings, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-05-29  6:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Tue, May 28, 2024 at 8:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote:
> > > Filesystems already implemented a lot of solutions for fragmentation
> > > avoidance that are more apropriate for slow storage media.
> >
> > Swap and file systems have very different requirements and usage
> > patterns and IO patterns.
>
> Should they, though?  Filesystems noticed that handling pages in LRU
> order was inefficient and so they stopped doing that (see the removal
> of aops->writepage in favour of ->writepages, along with where each are
> called from).  Maybe it's time for swap to start doing writes in the order
> of virtual addresses within a VMA, instead of LRU order.

Well, swap has one fundamental difference than file system:
the dirty file system cache will need to eventually write to file
backing at least once, otherwise machine reboots you lose the data.

Where the anonymous memory case, the dirty page does not have to write
to swap. It is optional, so which page you choose to swap out is
critical, you want to swap out the coldest page, the page that is
least likely to get swapin. Therefore, the LRU makes sense.

In VMA swap out, the question is, which VMA you choose from first? To
make things more complicated, the same page can map into different
processes in more than one VMA as well.

> Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
> is 40x faster:
> https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/

That simulation assumes the page struct has access to information already.
On the physical CPU level, the access bit is from the PTE. If you scan
from physical page order, you need to use rmap to find the PTE to
check the access bit. It is not a simple pfn page order walk. You need
to scan the PTE first then move the access information into page
struct.

>
> > One challenging aspect is that the current swap back end has a very
> > low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> > more than 64 bytes per file. That is a big jump if you map a swap
> > entry to a file. If you map more than one swap entry to a file, then
> > you need to track the mapping of file offset to swap entry, and the
> > reverse lookup of swap entry to a file with offset. Whichever way you
> > cut it, it will significantly increase the per swap entry memory
> > overhead.
>
> Not necessarily, no.  If your workload uses a lot of order-2, order-4
> and order-9 folios, then the current scheme is using 11 bytes per page,
> so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per
> order-9 folio.  That's a lot of bytes we can use for an extent-based
> scheme.

Yes, if we allow dynamic allocation of swap entry, the 24B option.
Then sub entries inside the compound swap entry structure can share
the same compound swap struct pointer.

>
> Also, why would you compare the size of an inode to the size of an
> inode?  inode is ~equivalent to an anon_vma, not to a swap entry.

I am not assigning inode to one swap entry. That is covered in my
description of "if you map more than one swap entry to a file". If you
want to map each inode to anon_vma, you need to have a way to map
inode  and file offset into swap entry encoding. In your anon_vma as
inode world, how do you deal with two different vma containing the
same page? Once we have more detail of the swap entry mapping scheme,
we can analyse the pros and cons.

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-29  6:50           ` Chris Li
@ 2024-05-29 12:33             ` Matthew Wilcox
  2024-05-30 22:53               ` Chris Li
  2024-05-31  1:56               ` Yuanchu Xie
  0 siblings, 2 replies; 59+ messages in thread
From: Matthew Wilcox @ 2024-05-29 12:33 UTC (permalink / raw)
  To: Chris Li
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Tue, May 28, 2024 at 11:50:47PM -0700, Chris Li wrote:
> On Tue, May 28, 2024 at 8:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote:
> > > > Filesystems already implemented a lot of solutions for fragmentation
> > > > avoidance that are more apropriate for slow storage media.
> > >
> > > Swap and file systems have very different requirements and usage
> > > patterns and IO patterns.
> >
> > Should they, though?  Filesystems noticed that handling pages in LRU
> > order was inefficient and so they stopped doing that (see the removal
> > of aops->writepage in favour of ->writepages, along with where each are
> > called from).  Maybe it's time for swap to start doing writes in the order
> > of virtual addresses within a VMA, instead of LRU order.
> 
> Well, swap has one fundamental difference than file system:
> the dirty file system cache will need to eventually write to file
> backing at least once, otherwise machine reboots you lose the data.

Yes, that's why we write back data from the page cache every 30 seconds
or so.  It's still important to not write back too early, otherwise
you need to write the same block multiple times.  The differences aren't
as stark as you're implying.

> Where the anonymous memory case, the dirty page does not have to write
> to swap. It is optional, so which page you choose to swap out is
> critical, you want to swap out the coldest page, the page that is
> least likely to get swapin. Therefore, the LRU makes sense.

Disagree.  There are two things you want and the LRU serves neither
particularly well.  One is that when you want to reclaim memory, you
want to find some memory that is likely to not be accessed in the next
few seconds/minutes/hours.  It doesn't need to be the coldest, just in
(say) the coldest 10% or so of memory.  And it needs to already be clean,
otherwise you have to wait for it to writeback, and you can't afford that.

The second thing you need to be able to do is find pages which are
already dirty, and not likely to be written to soon, and write those
back so they join the pool of clean pages which are eligible for reclaim.
Again, the LRU isn't really the best tool for the job.

> In VMA swap out, the question is, which VMA you choose from first? To
> make things more complicated, the same page can map into different
> processes in more than one VMA as well.

This is why we have the anon_vma, to handle the same pages mapped from
multiple VMAs.

> > Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
> > is 40x faster:
> > https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/
> 
> That simulation assumes the page struct has access to information already.
> On the physical CPU level, the access bit is from the PTE. If you scan
> from physical page order, you need to use rmap to find the PTE to
> check the access bit. It is not a simple pfn page order walk. You need
> to scan the PTE first then move the access information into page
> struct.

We already maintain the dirty bit on the folio when we take a write-fault
for file memory.  If we do that for anon memory as well, we don't need
to do an rmap walk at scan time.

> > > One challenging aspect is that the current swap back end has a very
> > > low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> > > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> > > more than 64 bytes per file. That is a big jump if you map a swap
> > > entry to a file. If you map more than one swap entry to a file, then
> > > you need to track the mapping of file offset to swap entry, and the
> > > reverse lookup of swap entry to a file with offset. Whichever way you
> > > cut it, it will significantly increase the per swap entry memory
> > > overhead.
> >
> > Not necessarily, no.  If your workload uses a lot of order-2, order-4
> > and order-9 folios, then the current scheme is using 11 bytes per page,
> > so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per
> > order-9 folio.  That's a lot of bytes we can use for an extent-based
> > scheme.
> 
> Yes, if we allow dynamic allocation of swap entry, the 24B option.
> Then sub entries inside the compound swap entry structure can share
> the same compound swap struct pointer.
> 
> >
> > Also, why would you compare the size of an inode to the size of an
> > inode?  inode is ~equivalent to an anon_vma, not to a swap entry.
> 
> I am not assigning inode to one swap entry. That is covered in my
> description of "if you map more than one swap entry to a file". If you
> want to map each inode to anon_vma, you need to have a way to map
> inode  and file offset into swap entry encoding. In your anon_vma as
> inode world, how do you deal with two different vma containing the
> same page? Once we have more detail of the swap entry mapping scheme,
> we can analyse the pros and cons.

Are you confused between an anon_vma and an anon vma?  The naming in
this area is terrible.  Maybe we should call it an mnode instead of an
anon_vma.  The parallel with inode would be more obvious ...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-29 12:33             ` Matthew Wilcox
@ 2024-05-30 22:53               ` Chris Li
  2024-05-31  3:12                 ` Matthew Wilcox
  2024-05-31  1:56               ` Yuanchu Xie
  1 sibling, 1 reply; 59+ messages in thread
From: Chris Li @ 2024-05-30 22:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 28, 2024 at 11:50:47PM -0700, Chris Li wrote:
> > On Tue, May 28, 2024 at 8:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote:
> > > > > Filesystems already implemented a lot of solutions for fragmentation
> > > > > avoidance that are more apropriate for slow storage media.
> > > >
> > > > Swap and file systems have very different requirements and usage
> > > > patterns and IO patterns.
> > >
> > > Should they, though?  Filesystems noticed that handling pages in LRU
> > > order was inefficient and so they stopped doing that (see the removal
> > > of aops->writepage in favour of ->writepages, along with where each are
> > > called from).  Maybe it's time for swap to start doing writes in the order
> > > of virtual addresses within a VMA, instead of LRU order.
> >
> > Well, swap has one fundamental difference than file system:
> > the dirty file system cache will need to eventually write to file
> > backing at least once, otherwise machine reboots you lose the data.
>
> Yes, that's why we write back data from the page cache every 30 seconds
> or so.  It's still important to not write back too early, otherwise
> you need to write the same block multiple times.  The differences aren't
> as stark as you're implying.
>
> > Where the anonymous memory case, the dirty page does not have to write
> > to swap. It is optional, so which page you choose to swap out is
> > critical, you want to swap out the coldest page, the page that is
> > least likely to get swapin. Therefore, the LRU makes sense.
>
> Disagree.  There are two things you want and the LRU serves neither
> particularly well.  One is that when you want to reclaim memory, you
> want to find some memory that is likely to not be accessed in the next
> few seconds/minutes/hours.  It doesn't need to be the coldest, just in
> (say) the coldest 10% or so of memory.  And it needs to already be clean,
> otherwise you have to wait for it to writeback, and you can't afford that.

Do you disagree that LRU is necessary or the way we use the LRU?

In order to get the coldest 10% or so pages, assume you still need to
maintain an LRU, no?

>
> The second thing you need to be able to do is find pages which are
> already dirty, and not likely to be written to soon, and write those
> back so they join the pool of clean pages which are eligible for reclaim.
> Again, the LRU isn't really the best tool for the job.

It seems you need to LRU to find which pages qualify for write back.
It should be both dirty and cold.

The question is, can you do the reclaim write back without LRU for
anonymous pages?
If LRU is unavoidable, then it is necessarily evil.

>
> > In VMA swap out, the question is, which VMA you choose from first? To
> > make things more complicated, the same page can map into different
> > processes in more than one VMA as well.
>
> This is why we have the anon_vma, to handle the same pages mapped from
> multiple VMAs.

Can you clarify when you use anon_vma to organize the swap out and
swap in, do you want to write a range of pages rather than just one
page at a time? Will write back a sub list of the LRU work for you?
Ideally we shouldn't write back pages that are hot. anon_vma alone
does not give us that information.

>
> > > Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
> > > is 40x faster:
> > > https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/
> >
> > That simulation assumes the page struct has access to information already.
> > On the physical CPU level, the access bit is from the PTE. If you scan
> > from physical page order, you need to use rmap to find the PTE to
> > check the access bit. It is not a simple pfn page order walk. You need
> > to scan the PTE first then move the access information into page
> > struct.
>
> We already maintain the dirty bit on the folio when we take a write-fault
> for file memory.  If we do that for anon memory as well, we don't need
> to do an rmap walk at scan time.

You need to find out which 10% pages are cold to swap them out in the
first place. That is before the write-fault can happen. The
write-fault does not help selecting which subset of pages to swap out.

>
> > > > One challenging aspect is that the current swap back end has a very
> > > > low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> > > > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> > > > more than 64 bytes per file. That is a big jump if you map a swap
> > > > entry to a file. If you map more than one swap entry to a file, then
> > > > you need to track the mapping of file offset to swap entry, and the
> > > > reverse lookup of swap entry to a file with offset. Whichever way you
> > > > cut it, it will significantly increase the per swap entry memory
> > > > overhead.
> > >
> > > Not necessarily, no.  If your workload uses a lot of order-2, order-4
> > > and order-9 folios, then the current scheme is using 11 bytes per page,
> > > so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per
> > > order-9 folio.  That's a lot of bytes we can use for an extent-based
> > > scheme.
> >
> > Yes, if we allow dynamic allocation of swap entry, the 24B option.
> > Then sub entries inside the compound swap entry structure can share
> > the same compound swap struct pointer.
> >
> > >
> > > Also, why would you compare the size of an inode to the size of an
> > > inode?  inode is ~equivalent to an anon_vma, not to a swap entry.
> >
> > I am not assigning inode to one swap entry. That is covered in my
> > description of "if you map more than one swap entry to a file". If you
> > want to map each inode to anon_vma, you need to have a way to map
> > inode  and file offset into swap entry encoding. In your anon_vma as
> > inode world, how do you deal with two different vma containing the
> > same page? Once we have more detail of the swap entry mapping scheme,
> > we can analyse the pros and cons.
>
> Are you confused between an anon_vma and an anon vma?  The naming in
> this area is terrible.  Maybe we should call it an mnode instead of an
> anon_vma.  The parallel with inode would be more obvious ...

Yes, I was thinking about anon vma. I am just taking a look at the
anon_vma and what it can do. Thanks for the pointer.

Chris


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-30 22:53               ` Chris Li
@ 2024-05-31  3:12                 ` Matthew Wilcox
  2024-06-01  0:43                   ` Chris Li
  0 siblings, 1 reply; 59+ messages in thread
From: Matthew Wilcox @ 2024-05-31  3:12 UTC (permalink / raw)
  To: Chris Li
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote:
> On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Where the anonymous memory case, the dirty page does not have to write
> > > to swap. It is optional, so which page you choose to swap out is
> > > critical, you want to swap out the coldest page, the page that is
> > > least likely to get swapin. Therefore, the LRU makes sense.
> >
> > Disagree.  There are two things you want and the LRU serves neither
> > particularly well.  One is that when you want to reclaim memory, you
> > want to find some memory that is likely to not be accessed in the next
> > few seconds/minutes/hours.  It doesn't need to be the coldest, just in
> > (say) the coldest 10% or so of memory.  And it needs to already be clean,
> > otherwise you have to wait for it to writeback, and you can't afford that.
> 
> Do you disagree that LRU is necessary or the way we use the LRU?

I think we should switch to a scheme where we just don't use an LRU at
all.

> In order to get the coldest 10% or so pages, assume you still need to
> maintain an LRU, no?

I don't think that's true.  If you reframe the problem as "we need to
find some of the coldest pages in the system", then you can use a
different scheme.

> > The second thing you need to be able to do is find pages which are
> > already dirty, and not likely to be written to soon, and write those
> > back so they join the pool of clean pages which are eligible for reclaim.
> > Again, the LRU isn't really the best tool for the job.
> 
> It seems you need to LRU to find which pages qualify for write back.
> It should be both dirty and cold.
> 
> The question is, can you do the reclaim write back without LRU for
> anonymous pages?
> If LRU is unavoidable, then it is necessarily evil.

The point I was trying to make is that a simple physical scan is 40x
faster.  So if you just scan N pages, starting from wherever you left
off the scan last time, and even 1/10 of them are eligible for
reclaiming (not referenced since last time the clock hand swept past it,
perhaps), you're still reclaiming 4x as many pages as doing an LRU scan.

> > > In VMA swap out, the question is, which VMA you choose from first? To
> > > make things more complicated, the same page can map into different
> > > processes in more than one VMA as well.
> >
> > This is why we have the anon_vma, to handle the same pages mapped from
> > multiple VMAs.
> 
> Can you clarify when you use anon_vma to organize the swap out and
> swap in, do you want to write a range of pages rather than just one
> page at a time? Will write back a sub list of the LRU work for you?
> Ideally we shouldn't write back pages that are hot. anon_vma alone
> does not give us that information.

So filesystems do write back all pages in an inode that are dirty,
regardless of whether they're hot.  But, as noted, we do like to
get the pagecache written back periodically even if the pages are
going to be redirtied soon.  And this is somewhere that I think there's
a difference between anon & file pages.  So maybe the algorithm looks
something like this:

A: write page fault causes page to be created
B: scan unmaps page, marks it dirty, does not start writeout
C: scan finds dirty, unmapped anon page, starts writeout
D: scan finds clean unmapped anon page, frees it

so it will actually take three trips around the whole of memory for
the physical scan to evict an anon page.  That should be adequate
time for a workload to fault back in a page that's actually hot.
(if a page fault finds a page in state B, it transitions back to state
A and gets three more trips around the clock).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-31  3:12                 ` Matthew Wilcox
@ 2024-06-01  0:43                   ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-06-01  0:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm, lsf-pc,
	ryan.roberts, 21cnbao, david

On Thu, May 30, 2024 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote:
> > On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > Where the anonymous memory case, the dirty page does not have to write
> > > > to swap. It is optional, so which page you choose to swap out is
> > > > critical, you want to swap out the coldest page, the page that is
> > > > least likely to get swapin. Therefore, the LRU makes sense.
> > >
> > > Disagree.  There are two things you want and the LRU serves neither
> > > particularly well.  One is that when you want to reclaim memory, you
> > > want to find some memory that is likely to not be accessed in the next
> > > few seconds/minutes/hours.  It doesn't need to be the coldest, just in
> > > (say) the coldest 10% or so of memory.  And it needs to already be clean,
> > > otherwise you have to wait for it to writeback, and you can't afford that.
> >
> > Do you disagree that LRU is necessary or the way we use the LRU?
>
> I think we should switch to a scheme where we just don't use an LRU at
> all.

I would love to hear more details on how to achieve that. Can you elaborate?

>
> > In order to get the coldest 10% or so pages, assume you still need to
> > maintain an LRU, no?
>
> I don't think that's true.  If you reframe the problem as "we need to
> find some of the coldest pages in the system", then you can use a
> different scheme.

If you can have a way to do the reclaim without using LRU at all, that
would be some thing to replace the traditional LRU and MGLRU.
""we need to find some of the coldest pages in the system" that is not
enough for anonymous memory.

You want to find the and reclaim from the coldest memory, if that is
not enough, you need to reclaim more second coldest memory. The
threshold is a moving target depend on the memory pressure.

>
> > > The second thing you need to be able to do is find pages which are
> > > already dirty, and not likely to be written to soon, and write those
> > > back so they join the pool of clean pages which are eligible for reclaim.
> > > Again, the LRU isn't really the best tool for the job.
> >
> > It seems you need to LRU to find which pages qualify for write back.
> > It should be both dirty and cold.
> >
> > The question is, can you do the reclaim write back without LRU for
> > anonymous pages?
> > If LRU is unavoidable, then it is necessarily evil.
>
> The point I was trying to make is that a simple physical scan is 40x
> faster.  So if you just scan N pages, starting from wherever you left
> off the scan last time, and even 1/10 of them are eligible for
> reclaiming (not referenced since last time the clock hand swept past it,
> perhaps), you're still reclaiming 4x as many pages as doing an LRU scan.

I feel that I am missing something. In your 40x faster scan, do you
still scan the page table PTE entry for access bit or not?
If no, I fail to see how you can get the dirty information in the
first place. Unmap a page can get that information at a very high
price.
If yes, then you scan order is not physical any way, you need to find
the PTE entry location and scan that. It is not going to be in the pfn
order.

Also, when reclaiming for a cgroup. You want to scan for memory that
is belong to this cgroup. The page used in this cgroup will be all
over the place, you wouldn't be doing a linear pfn scanning away.
Unless you want to scan for a lot of page that is not belong to this
cgroup. The CPU prefetching and caching contribute to that 40x speed
up would be out of the window.

>
> > > > In VMA swap out, the question is, which VMA you choose from first? To
> > > > make things more complicated, the same page can map into different
> > > > processes in more than one VMA as well.
> > >
> > > This is why we have the anon_vma, to handle the same pages mapped from
> > > multiple VMAs.
> >
> > Can you clarify when you use anon_vma to organize the swap out and
> > swap in, do you want to write a range of pages rather than just one
> > page at a time? Will write back a sub list of the LRU work for you?
> > Ideally we shouldn't write back pages that are hot. anon_vma alone
> > does not give us that information.
>
> So filesystems do write back all pages in an inode that are dirty,
> regardless of whether they're hot.  But, as noted, we do like to
> get the pagecache written back periodically even if the pages are
> going to be redirtied soon.  And this is somewhere that I think there's

Yes, I think there is a critical difference in file system vs
anonymous memory in this regard. In file system write out all dirty
page is more or less OK. It need to eventually happen anyway. Where in
anonymous memory, write out dirty memory has cost associate with it.
It needs to allocate swap entry, put on the swap cache etc. We want to
minimize swap out the page that are hot.

> a difference between anon & file pages.  So maybe the algorithm looks
> something like this:
>
> A: write page fault causes page to be created

You are talking about swap in page fault, right? Are you only going to
write out pages that has recently been swap in?

> B: scan unmaps page, marks it dirty, does not start writeout

Sorry a lot of questions, I just want to make sure I understand what
you are saying correctly.
1) scan in what order? the pfn order or following the anon_vma scan
all page in that anon_vma?
2) The scan process unmaps which page? All pages in anon_vma or the
page recently have a swap in page fault in step A?

> C: scan finds dirty, unmapped anon page, starts writeout

Can you clarify "scan file dirty" where does the "dirty" come from?
Does it only use the above step B or also involve scanning the PTE
dirty/access bit by LRU/MGLRU?
I think you mean the dirty come from step B, just want to make sure.

> D: scan finds clean unmapped anon page, frees it

It seems you are using unmapped page causing page fault to detect if
that page is needed. Which is much more expensive than scanning the
PTE dirty/access bit.

>
> so it will actually take three trips around the whole of memory for
> the physical scan to evict an anon page.  That should be adequate
> time for a workload to fault back in a page that's actually hot.
> (if a page fault finds a page in state B, it transitions back to state
> A and gets three more trips around the clock).

That seems limit to reclaim page you already swap out then recently swap in.

How does it reclaim the first page to when there is no page swap out
previously? It seems it would require step B to unmap all scanned page
not just the swap in one. That would have a lot of performance hit. I
still feel that I am missing some thing in your step A -> D.

Chris

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-29 12:33             ` Matthew Wilcox
  2024-05-30 22:53               ` Chris Li
@ 2024-05-31  1:56               ` Yuanchu Xie
  2024-05-31 16:51                 ` Chris Li
  1 sibling, 1 reply; 59+ messages in thread
From: Yuanchu Xie @ 2024-05-31  1:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Li, Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm,
	lsf-pc, ryan.roberts, 21cnbao, david

On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
...
> > > Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
> > > is 40x faster:
> > > https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/
> >
> > That simulation assumes the page struct has access to information already.
> > On the physical CPU level, the access bit is from the PTE. If you scan
> > from physical page order, you need to use rmap to find the PTE to
> > check the access bit. It is not a simple pfn page order walk. You need
> > to scan the PTE first then move the access information into page
> > struct.
>
> We already maintain the dirty bit on the folio when we take a write-fault
> for file memory.  If we do that for anon memory as well, we don't need
> to do an rmap walk at scan time.
>
The access bit in the PTE is set by the CPU, and in terms of moving
the accessed bit into the folio, I think that's already done by MGLRU
scanning of PTEs, but the gen number written to the folios is
per-lruvec.
I can't come up with a scheme
Perhaps the idea is to scan through all the folios on a system,
instead of evicting folios from each LRU list?

As for whether anon folios should behave more similar to file, I think
this is an excellent opportunity to reconcile some special handling of
anon vs file. Because to me a swap-backed folio writeback mechanism
sounds a lot like what "proactive reclaim on anon pages" achieves: not
having to wait to reclaim memory.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
  2024-05-31  1:56               ` Yuanchu Xie
@ 2024-05-31 16:51                 ` Chris Li
  0 siblings, 0 replies; 59+ messages in thread
From: Chris Li @ 2024-05-31 16:51 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: Matthew Wilcox, Karim Manaouil, Jan Kara, Chuanhua Han, linux-mm,
	lsf-pc, ryan.roberts, 21cnbao, david

On Thu, May 30, 2024 at 6:56 PM Yuanchu Xie <yuanchu@google.com> wrote:
>
> On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> ...
> > > > Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
> > > > is 40x faster:
> > > > https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/
> > >
> > > That simulation assumes the page struct has access to information already.
> > > On the physical CPU level, the access bit is from the PTE. If you scan
> > > from physical page order, you need to use rmap to find the PTE to
> > > check the access bit. It is not a simple pfn page order walk. You need
> > > to scan the PTE first then move the access information into page
> > > struct.
> >
> > We already maintain the dirty bit on the folio when we take a write-fault
> > for file memory.  If we do that for anon memory as well, we don't need
> > to do an rmap walk at scan time.
> >
> The access bit in the PTE is set by the CPU, and in terms of moving
> the accessed bit into the folio, I think that's already done by MGLRU
> scanning of PTEs, but the gen number written to the folios is
> per-lruvec.

That is the point I was trying to make. To get the latest hot and cold
information, it needs to scan the PTE access bits. The walk on page
struct in pfn order does not able to achieve the same thing.

> I can't come up with a scheme
> Perhaps the idea is to scan through all the folios on a system,
> instead of evicting folios from each LRU list?
>
> As for whether anon folios should behave more similar to file, I think
> this is an excellent opportunity to reconcile some special handling of
> anon vs file. Because to me a swap-backed folio writeback mechanism
> sounds a lot like what "proactive reclaim on anon pages" achieves: not
> having to wait to reclaim memory.


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2024-06-01  0:43 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-01  9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
2024-03-01  9:53 ` Nhat Pham
2024-03-01 18:57   ` Chris Li
2024-03-04 22:58   ` Matthew Wilcox
2024-03-05  3:23     ` Chengming Zhou
2024-03-05  7:44       ` Chris Li
2024-03-05  8:15         ` Chengming Zhou
2024-03-05 18:24           ` Chris Li
2024-03-05  9:32         ` Nhat Pham
2024-03-05  9:52           ` Chengming Zhou
2024-03-05 10:55             ` Nhat Pham
2024-03-05 19:20               ` Chris Li
2024-03-05 20:56                 ` Jared Hulbert
2024-03-05 21:38         ` Jared Hulbert
2024-03-05 21:58           ` Chris Li
2024-03-06  4:16             ` Jared Hulbert
2024-03-06  5:50               ` Chris Li
     [not found]                 ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16                   ` Chris Li
2024-03-06 22:44                     ` Jared Hulbert
2024-03-07  0:46                       ` Chris Li
2024-03-07  8:57                         ` Jared Hulbert
2024-03-06  1:33   ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03   ` Jared Hulbert
2024-03-04 22:47     ` Chris Li
2024-03-04 22:36   ` Chris Li
2024-03-06  1:15 ` Barry Song
2024-03-06  2:59   ` Chris Li
2024-03-06  6:05     ` Barry Song
2024-03-06 17:56       ` Chris Li
2024-03-06 21:29         ` Barry Song
2024-03-08  8:55       ` David Hildenbrand
2024-03-07  7:56 ` Chuanhua Han
2024-03-07 14:03   ` [Lsf-pc] " Jan Kara
2024-03-07 21:06     ` Jared Hulbert
2024-03-07 21:17       ` Barry Song
2024-03-08  0:14         ` Jared Hulbert
2024-03-08  0:53           ` Barry Song
2024-03-14  9:03         ` Jan Kara
2024-05-16 15:04           ` Zi Yan
2024-05-17  3:48             ` Chris Li
2024-03-14  8:52       ` Jan Kara
2024-03-08  2:02     ` Chuanhua Han
2024-03-14  8:26       ` Jan Kara
2024-03-14 11:19         ` Chuanhua Han
2024-05-15 23:07           ` Chris Li
2024-05-16  7:16             ` Chuanhua Han
2024-05-17 12:12     ` Karim Manaouil
2024-05-21 20:40       ` Chris Li
2024-05-28  7:08         ` Jared Hulbert
2024-05-29  3:36           ` Chris Li
2024-05-29  3:57         ` Matthew Wilcox
2024-05-29  6:50           ` Chris Li
2024-05-29 12:33             ` Matthew Wilcox
2024-05-30 22:53               ` Chris Li
2024-05-31  3:12                 ` Matthew Wilcox
2024-06-01  0:43                   ` Chris Li
2024-05-31  1:56               ` Yuanchu Xie
2024-05-31 16:51                 ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).