[LSF/MM/BPF TOPIC] Large folios, swap and fscache

netfs.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Large folios, swap and fscache
@ 2024-02-02  9:09 David Howells
  2024-02-02 14:29 ` Matthew Wilcox
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: David Howells @ 2024-02-02  9:09 UTC (permalink / raw)
  To: lsf-pc; +Cc: dhowells, Matthew Wilcox, netfs, linux-fsdevel, linux-mm

Hi,

The topic came up in a recent discussion about how to deal with large folios
when it comes to swap as a swap device is normally considered a simple array
of PAGE_SIZE-sized elements that can be indexed by a single integer.

With the advent of large folios, however, we might need to change this in
order to be better able to swap out a compound page efficiently.  Swap
fragmentation raises its head, as does the need to potentially save multiple
indices per folio.  Does swap need to grow more filesystem features?

Further to this, we have at least two ways to cache data on disk/flash/etc. -
swap and fscache - and both want to set aside disk space for their operation.
Might it be possible to combine the two?

One thing I want to look at for fscache is the possibility of switching from a
file-per-object-based approach to a tagged cache more akin to the way OpenAFS
does things.  In OpenAFS, you have a whole bunch of small files, each
containing a single block (e.g. 256K) of data, and an index that maps a
particular {volume,file,version,block} to one of these files in the cache.

Now, I could also consider holding all the data blocks in a single file (or
blockdev) - and this might work for swap.  For fscache, I do, however, need to
have some sort of integrity across reboots that swap does not require.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02  9:09 [LSF/MM/BPF TOPIC] Large folios, swap and fscache David Howells
@ 2024-02-02 14:29 ` Matthew Wilcox
  2024-02-22 19:02   ` Luis Chamberlain
  2024-02-29 19:31   ` Chris Li
  2024-02-02 15:57 ` David Howells
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-02-02 14:29 UTC (permalink / raw)
  To: David Howells; +Cc: lsf-pc, netfs, linux-fsdevel, linux-mm

On Fri, Feb 02, 2024 at 09:09:49AM +0000, David Howells wrote:
> The topic came up in a recent discussion about how to deal with large folios
> when it comes to swap as a swap device is normally considered a simple array
> of PAGE_SIZE-sized elements that can be indexed by a single integer.
> 
> With the advent of large folios, however, we might need to change this in
> order to be better able to swap out a compound page efficiently.  Swap
> fragmentation raises its head, as does the need to potentially save multiple
> indices per folio.  Does swap need to grow more filesystem features?

I didn't mention this during the meeting, but there are more reasons
to do something like this.  For example, even with large folios, it
doesn't make sense to drive writing to swap on a per-folio basis.  We
should be writing out large chunks of virtual address space in a single
write to the swap device, just like we do large chunks of files in
->writepages.

Another reason to do something different is that we're starting to see
block devices with bs>PS.  That means we'll _have_ to write out larger
chunks than a single page.  For reads, we can discard the extra data,
but it'd be better to swap back in the entire block rather than
individual pages.

So my modest proposal is that we completely rearchitect how we handle
swap.  Instead of putting swp entries in the page tables (and in shmem's
case in the page cache), we turn swap into an (object, offset) lookup
(just like a filesystem).  That means that each anon_vma becomes its
own swap object and each shmem inode becomes its own swap object.
The swap system can then borrow techniques from whichever filesystem
it likes to do (object, offset, length) -> n x (device, block) mappings.

> Further to this, we have at least two ways to cache data on disk/flash/etc. -
> swap and fscache - and both want to set aside disk space for their operation.
> Might it be possible to combine the two?
> 
> One thing I want to look at for fscache is the possibility of switching from a
> file-per-object-based approach to a tagged cache more akin to the way OpenAFS
> does things.  In OpenAFS, you have a whole bunch of small files, each
> containing a single block (e.g. 256K) of data, and an index that maps a
> particular {volume,file,version,block} to one of these files in the cache.

I think my proposal above works for you?  For each file you want to cache,
create a swap object, and then tell swap when you want to read/write to
the local swap object.  What you do need is to persist the objects over
a power cycle.  That shouldn't be too hard ... after all, filesystems
manage to do it.  All we need to do is figure out how to name the
lookup (I don't think we need to use strings to name the swap object,
but obviously we could).  Maybe it's just a stream of bytes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02  9:09 [LSF/MM/BPF TOPIC] Large folios, swap and fscache David Howells
  2024-02-02 14:29 ` Matthew Wilcox
@ 2024-02-02 15:57 ` David Howells
  2024-02-02 19:22   ` Matthew Wilcox
  2024-02-03  5:13 ` Gao Xiang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: David Howells @ 2024-02-02 15:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: dhowells, lsf-pc, netfs, linux-fsdevel, linux-mm

Matthew Wilcox <willy@infradead.org> wrote:

> So my modest proposal is that we completely rearchitect how we handle
> swap.  Instead of putting swp entries in the page tables (and in shmem's
> case in the page cache), we turn swap into an (object, offset) lookup
> (just like a filesystem).  That means that each anon_vma becomes its
> own swap object and each shmem inode becomes its own swap object.
> The swap system can then borrow techniques from whichever filesystem
> it likes to do (object, offset, length) -> n x (device, block) mappings.

That's basically what I'm suggesting, I think, but offloading the mechanics
down to a filesystem.  That would be fine with me.  bcachefs is an {key,val}
store right?

> > Further to this, we have at least two ways to cache data on
> > disk/flash/etc. - swap and fscache - and both want to set aside disk space
> > for their operation.  Might it be possible to combine the two?
> > 
> > One thing I want to look at for fscache is the possibility of switching
> > from a file-per-object-based approach to a tagged cache more akin to the
> > way OpenAFS does things.  In OpenAFS, you have a whole bunch of small
> > files, each containing a single block (e.g. 256K) of data, and an index
> > that maps a particular {volume,file,version,block} to one of these files
> > in the cache.
> 
> I think my proposal above works for you?  For each file you want to cache,
> create a swap object, and then tell swap when you want to read/write to
> the local swap object.  What you do need is to persist the objects over
> a power cycle.  That shouldn't be too hard ... after all, filesystems
> manage to do it.

Sure - but there is an integrity constraint that doesn't exist with swap.

There is also an additional feature of fscache: unless the cache entry is
locked in the cache (e.g. we're doing diconnected operation), we can throw
away an object from fscache and recycle it if we need space.  In fact, this is
the way OpenAFS works: every write transaction done on a file/dir on the
server is done atomically and is given a monotonically increasing data version
number that is then used as part of the index key in the cache.  So old
versions of the data get recycled as the cache needs to make space.

Which also means that if swap needs more space, it can just kick stuff out of
fscache if it is not locked in.

> All we need to do is figure out how to name the lookup (I don't think we
> need to use strings to name the swap object, but obviously we could).  Maybe
> it's just a stream of bytes.

A binary blob would probably be better.

I would use a separate index to map higher level organisations, such as
cell+volume in afs or the server address + share name in cifs to an index
number that can be used in the cache.

Further, I could do with a way to invalidate all objects matching a particular
subkey.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02 15:57 ` David Howells
@ 2024-02-02 19:22   ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-02-02 19:22 UTC (permalink / raw)
  To: David Howells; +Cc: lsf-pc, netfs, linux-fsdevel, linux-mm

On Fri, Feb 02, 2024 at 03:57:44PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > So my modest proposal is that we completely rearchitect how we handle
> > swap.  Instead of putting swp entries in the page tables (and in shmem's
> > case in the page cache), we turn swap into an (object, offset) lookup
> > (just like a filesystem).  That means that each anon_vma becomes its
> > own swap object and each shmem inode becomes its own swap object.
> > The swap system can then borrow techniques from whichever filesystem
> > it likes to do (object, offset, length) -> n x (device, block) mappings.
> 
> That's basically what I'm suggesting, I think, but offloading the mechanics
> down to a filesystem.  That would be fine with me.  bcachefs is an {key,val}
> store right?

Hmm.  That's not a bad idea.  So instead of having a swapfile, we
could create a swap directory on an existing filesystem.  Or if we
want to partition the drive and have a swap partition we just
mkfs.favourite that and tell it that root is the swap directory.

I think this means we do away with the swap cache?  If the page has been
brought back in, we'd be able to find it in the anon_vma's page cache
rather than having to search the global swap cache.

> > I think my proposal above works for you?  For each file you want to cache,
> > create a swap object, and then tell swap when you want to read/write to
> > the local swap object.  What you do need is to persist the objects over
> > a power cycle.  That shouldn't be too hard ... after all, filesystems
> > manage to do it.
> 
> Sure - but there is an integrity constraint that doesn't exist with swap.
> 
> There is also an additional feature of fscache: unless the cache entry is
> locked in the cache (e.g. we're doing diconnected operation), we can throw
> away an object from fscache and recycle it if we need space.  In fact, this is
> the way OpenAFS works: every write transaction done on a file/dir on the
> server is done atomically and is given a monotonically increasing data version
> number that is then used as part of the index key in the cache.  So old
> versions of the data get recycled as the cache needs to make space.
> 
> Which also means that if swap needs more space, it can just kick stuff out of
> fscache if it is not locked in.

Ah, more requirements ;-)

> > All we need to do is figure out how to name the lookup (I don't think we
> > need to use strings to name the swap object, but obviously we could).  Maybe
> > it's just a stream of bytes.
> 
> A binary blob would probably be better.
> 
> I would use a separate index to map higher level organisations, such as
> cell+volume in afs or the server address + share name in cifs to an index
> number that can be used in the cache.
> 
> Further, I could do with a way to invalidate all objects matching a particular
> subkey.

That seems to map to a directory hierarchy?

So, named swap objects for fscache; anonymous ones for anon memory?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02  9:09 [LSF/MM/BPF TOPIC] Large folios, swap and fscache David Howells
  2024-02-02 14:29 ` Matthew Wilcox
  2024-02-02 15:57 ` David Howells
@ 2024-02-03  5:13 ` Gao Xiang
  2024-02-04 23:45 ` Dave Chinner
  2024-02-22 22:45 ` Chris Li
  4 siblings, 0 replies; 13+ messages in thread
From: Gao Xiang @ 2024-02-03  5:13 UTC (permalink / raw)
  To: David Howells; +Cc: lsf-pc, Matthew Wilcox, netfs, linux-fsdevel, linux-mm

Hi David,

On Fri, Feb 02, 2024 at 09:09:49AM +0000, David Howells wrote:
> Hi,
> 
> The topic came up in a recent discussion about how to deal with large folios
> when it comes to swap as a swap device is normally considered a simple array
> of PAGE_SIZE-sized elements that can be indexed by a single integer.
> 
> With the advent of large folios, however, we might need to change this in
> order to be better able to swap out a compound page efficiently.  Swap
> fragmentation raises its head, as does the need to potentially save multiple
> indices per folio.  Does swap need to grow more filesystem features?
> 
> Further to this, we have at least two ways to cache data on disk/flash/etc. -
> swap and fscache - and both want to set aside disk space for their operation.
> Might it be possible to combine the two?
> 
> One thing I want to look at for fscache is the possibility of switching from a
> file-per-object-based approach to a tagged cache more akin to the way OpenAFS
> does things.  In OpenAFS, you have a whole bunch of small files, each
> containing a single block (e.g. 256K) of data, and an index that maps a
> particular {volume,file,version,block} to one of these files in the cache.
> 
> Now, I could also consider holding all the data blocks in a single file (or
> blockdev) - and this might work for swap.  For fscache, I do, however, need to
> have some sort of integrity across reboots that swap does not require.

If my understanding is correct, I think the old swapfile approach just
works with pinned local fs extents, which means it looks up extents in
advance and it doesn't expect these extents will be moved so the real
swap data I/O paths always work without fses involved.  I don't look
into the new SWP_FS_OPS/.swap_rw way and it seems only some network
fses use it but IMHO it might have some deadlock risk if swapout
triggers local fs block allocation.  But overall I think it's a good
idea to combine the two.

Just slight off the topic: Recently I had another rough thought. As
you said, even a single fscache block or called a single fscache chunk is like
256K or whatever.

Is it possible to implement an _optional_ partial cached data uptodate
like fscache chunk vs fsblock?  For example a bitmap can be attached to
each 256K or 1M chunk. That would be much helpful.

Thanks,
Gao Xiang

> 
> David
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02  9:09 [LSF/MM/BPF TOPIC] Large folios, swap and fscache David Howells
                   ` (2 preceding siblings ...)
  2024-02-03  5:13 ` Gao Xiang
@ 2024-02-04 23:45 ` Dave Chinner
  2024-02-22 22:45 ` Chris Li
  4 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2024-02-04 23:45 UTC (permalink / raw)
  To: David Howells; +Cc: lsf-pc, Matthew Wilcox, netfs, linux-fsdevel, linux-mm

On Fri, Feb 02, 2024 at 09:09:49AM +0000, David Howells wrote:
> Hi,
> 
> The topic came up in a recent discussion about how to deal with large folios
> when it comes to swap as a swap device is normally considered a simple array
> of PAGE_SIZE-sized elements that can be indexed by a single integer.
> 
> With the advent of large folios, however, we might need to change this in
> order to be better able to swap out a compound page efficiently.  Swap
> fragmentation raises its head, as does the need to potentially save multiple
> indices per folio.  Does swap need to grow more filesystem features?

The "file-based swap" infrastructure needs to be converted to use
filesystem direct IO methods. It should not cache the extent list
and do raw direct-to-device IO itself, it should just build an
iov that points to the pages and submit that to the filesystem
DIO read/write path to do the mapping and submission to disk.

If we tell the dio subsystem that it is IOCB_SWAP IO, then we can
do things like ignore unwritten bits in the extent mappings so
we don't have to do transactions to avoid unwritten conversion on
write or do timestamp updates on the inode...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02 14:29 ` Matthew Wilcox
@ 2024-02-22 19:02   ` Luis Chamberlain
  2024-02-22 19:16     ` Yosry Ahmed
  2024-02-22 22:26     ` Chris Li
  2024-02-29 19:31   ` Chris Li
  1 sibling, 2 replies; 13+ messages in thread
From: Luis Chamberlain @ 2024-02-22 19:02 UTC (permalink / raw)
  To: Matthew Wilcox, Yosry Ahmed, Chris Li, Daniel Gomez,
	Pankaj Raghav, Hugh Dickins
  Cc: David Howells, lsf-pc, netfs, linux-fsdevel, linux-mm

On Fri, Feb 02, 2024 at 02:29:40PM +0000, Matthew Wilcox wrote:
> So my modest proposal is that we completely rearchitect how we handle
> swap.  Instead of putting swp entries in the page tables (and in shmem's
> case in the page cache), we turn swap into an (object, offset) lookup
> (just like a filesystem).  That means that each anon_vma becomes its
> own swap object and each shmem inode becomes its own swap object.
> The swap system can then borrow techniques from whichever filesystem
> it likes to do (object, offset, length) -> n x (device, block) mappings.

What happened to Yosry or Chris's last year's pony [0]? In order to try
to take a stab at this we started with adding large folios to tmpfs,
which Daniel Gomez has taken on, as its a simple filesystem and with
large folios can enable us to easily test large folio swap support too.
Daniel first tried fixing lseek issue with huge pages [1] and on top of
that he has patches (a new RFC not posted yet) which do add large folios
support to tmpfs. Hugh has noted the lskeek changes are incorrect and
suggested instead a fix for the failed tests in fstests. If we get
agreement on Hugh's approach then we have a step forward with tmpfs and
later we hope this will make it easier to test swap changes.

Its probably then a good time to ask, do we have a list of tests for
swap to ensure we don't break things if we add large folio support?
We can at least start with a good baseline of tests for that.

[0] https://lwn.net/Articles/932077/
[1] https://lkml.kernel.org/r/20240209142901.126894-1-da.gomez@samsung.com

  Luis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-22 19:02   ` Luis Chamberlain
@ 2024-02-22 19:16     ` Yosry Ahmed
  2024-02-22 22:26     ` Chris Li
  1 sibling, 0 replies; 13+ messages in thread
From: Yosry Ahmed @ 2024-02-22 19:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Chris Li, Daniel Gomez, Pankaj Raghav,
	Hugh Dickins, David Howells, Nhat Pham, lsf-pc, netfs,
	linux-fsdevel, linux-mm

On Thu, Feb 22, 2024 at 11:02:24AM -0800, Luis Chamberlain wrote:
> On Fri, Feb 02, 2024 at 02:29:40PM +0000, Matthew Wilcox wrote:
> > So my modest proposal is that we completely rearchitect how we handle
> > swap.  Instead of putting swp entries in the page tables (and in shmem's
> > case in the page cache), we turn swap into an (object, offset) lookup
> > (just like a filesystem).  That means that each anon_vma becomes its
> > own swap object and each shmem inode becomes its own swap object.
> > The swap system can then borrow techniques from whichever filesystem
> > it likes to do (object, offset, length) -> n x (device, block) mappings.
> 
> What happened to Yosry or Chris's last year's pony [0]? In order to try

For me, I unfortunately got occuppied with other projects and don't have
the bandwidth to work on it for now :/

I don't want to put anyone on the spot, but I think Nhat may have been
thinking about pursuing a version of this at some point.

> to take a stab at this we started with adding large folios to tmpfs,
> which Daniel Gomez has taken on, as its a simple filesystem and with
> large folios can enable us to easily test large folio swap support too.
> Daniel first tried fixing lseek issue with huge pages [1] and on top of
> that he has patches (a new RFC not posted yet) which do add large folios
> support to tmpfs. Hugh has noted the lskeek changes are incorrect and
> suggested instead a fix for the failed tests in fstests. If we get
> agreement on Hugh's approach then we have a step forward with tmpfs and
> later we hope this will make it easier to test swap changes.
> 
> Its probably then a good time to ask, do we have a list of tests for
> swap to ensure we don't break things if we add large folio support?
> We can at least start with a good baseline of tests for that.
> 
> [0] https://lwn.net/Articles/932077/
> [1] https://lkml.kernel.org/r/20240209142901.126894-1-da.gomez@samsung.com
> 
>   Luis

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-22 19:02   ` Luis Chamberlain
  2024-02-22 19:16     ` Yosry Ahmed
@ 2024-02-22 22:26     ` Chris Li
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Li @ 2024-02-22 22:26 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, Yosry Ahmed, Daniel Gomez, Pankaj Raghav,
	Hugh Dickins, David Howells, lsf-pc, netfs, linux-fsdevel,
	linux-mm

On Thu, Feb 22, 2024 at 11:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Fri, Feb 02, 2024 at 02:29:40PM +0000, Matthew Wilcox wrote:
> > So my modest proposal is that we completely rearchitect how we handle
> > swap.  Instead of putting swp entries in the page tables (and in shmem's
> > case in the page cache), we turn swap into an (object, offset) lookup
> > (just like a filesystem).  That means that each anon_vma becomes its
> > own swap object and each shmem inode becomes its own swap object.
> > The swap system can then borrow techniques from whichever filesystem
> > it likes to do (object, offset, length) -> n x (device, block) mappings.
>
> What happened to Yosry or Chris's last year's pony [0]? In order to try
> to take a stab at this we started with adding large folios to tmpfs,
> which Daniel Gomez has taken on, as its a simple filesystem and with
> large folios can enable us to easily test large folio swap support too.
> Daniel first tried fixing lseek issue with huge pages [1] and on top of
> that he has patches (a new RFC not posted yet) which do add large folios
> support to tmpfs. Hugh has noted the lskeek changes are incorrect and
> suggested instead a fix for the failed tests in fstests. If we get
> agreement on Hugh's approach then we have a step forward with tmpfs and
> later we hope this will make it easier to test swap changes.

Ah, just notice this. I have some pending ideas on how to address
that, I might be the
one that brings up this topic in the discussion David was referring to.

Will reply in his email of this thread.

==== quote ======
On Fri, Feb 2, 2024 at 1:10 AM David Howells <dhowells@redhat.com> wrote:
>
> Hi,
>
> The topic came up in a recent discussion about how to deal with large folios
> when it comes to swap as a swap device is normally considered a simple array
> of PAGE_SIZE-sized elements that can be indexed by a single integer.
>
==== end quote ====

>
> Its probably then a good time to ask, do we have a list of tests for
> swap to ensure we don't break things if we add large folio support?
> We can at least start with a good baseline of tests for that.

That is a very good idea to start with the test. We need all the help
we can get on the testing side.
I know Huge has his own test setup for stressing the swap systems.
Yes, more tests are always better.

Chris

>
> [0] https://lwn.net/Articles/932077/
> [1] https://lkml.kernel.org/r/20240209142901.126894-1-da.gomez@samsung.com
>
>   Luis
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02  9:09 [LSF/MM/BPF TOPIC] Large folios, swap and fscache David Howells
                   ` (3 preceding siblings ...)
  2024-02-04 23:45 ` Dave Chinner
@ 2024-02-22 22:45 ` Chris Li
  2024-02-23  3:00   ` Andreas Dilger
  4 siblings, 1 reply; 13+ messages in thread
From: Chris Li @ 2024-02-22 22:45 UTC (permalink / raw)
  To: David Howells; +Cc: lsf-pc, Matthew Wilcox, netfs, linux-fsdevel, linux-mm

Hi David,

On Fri, Feb 2, 2024 at 1:10 AM David Howells <dhowells@redhat.com> wrote:
>
> Hi,
>
> The topic came up in a recent discussion about how to deal with large folios
> when it comes to swap as a swap device is normally considered a simple array
> of PAGE_SIZE-sized elements that can be indexed by a single integer.

Sorry for being late for the party. I think I was the one that brought
this topic up in the online discussion with Will and You. Let me know
if you are referring to a different discussion.

>
> With the advent of large folios, however, we might need to change this in
> order to be better able to swap out a compound page efficiently.  Swap
> fragmentation raises its head, as does the need to potentially save multiple
> indices per folio.  Does swap need to grow more filesystem features?

Yes, with a large folio, it is harder to allocate continuous swap
entries where 4K swap entries are allocated and free all the time. The
fragmentation will likely make the swap file have very little
continuous swap entries.

We can change that assumption, allow large folio reading and writing
of discontinued blocks on the block device level. We will likely need
a file system like kind of the indirection layer to store the location
of those blocks. In other words, the folio needs to read/write a list
of io vectors, not just one block.

>
> Further to this, we have at least two ways to cache data on disk/flash/etc. -
> swap and fscache - and both want to set aside disk space for their operation.
> Might it be possible to combine the two?
>
> One thing I want to look at for fscache is the possibility of switching from a
> file-per-object-based approach to a tagged cache more akin to the way OpenAFS
> does things.  In OpenAFS, you have a whole bunch of small files, each
> containing a single block (e.g. 256K) of data, and an index that maps a
> particular {volume,file,version,block} to one of these files in the cache.
>
> Now, I could also consider holding all the data blocks in a single file (or
> blockdev) - and this might work for swap.  For fscache, I do, however, need to
> have some sort of integrity across reboots that swap does not require.

The main trade off is the memory usage for the meta data and latency
of reading and writing.
The file system has typically a different IO pattern than swap, e.g.
file reads can be batched and have good locality.
Where swap is a lot of random location read/write.

Current swap using array like swap entry, one of the pros of that is
just one IO required for one folio.
The performance gets worse when swap needs to read the metadata first
to locate the block, then read the block of data in.
Page fault latency will get longer. That is one of the trade-offs we
need to consider.

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-22 22:45 ` Chris Li
@ 2024-02-23  3:00   ` Andreas Dilger
  2024-02-23  3:46     ` Chris Li
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2024-02-23  3:00 UTC (permalink / raw)
  To: Chris Li
  Cc: David Howells, lsf-pc, Matthew Wilcox, netfs, linux-fsdevel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

On Feb 22, 2024, at 3:45 PM, Chris Li <chrisl@kernel.org> wrote:
> 
> Hi David,
> 
> On Fri, Feb 2, 2024 at 1:10 AM David Howells <dhowells@redhat.com> wrote:
>> 
>> Hi,
>> 
>> The topic came up in a recent discussion about how to deal with large folios
>> when it comes to swap as a swap device is normally considered a simple array
>> of PAGE_SIZE-sized elements that can be indexed by a single integer.
> 
> Sorry for being late for the party. I think I was the one that brought
> this topic up in the online discussion with Will and You. Let me know
> if you are referring to a different discussion.
> 
>> 
>> With the advent of large folios, however, we might need to change this in
>> order to be better able to swap out a compound page efficiently.  Swap
>> fragmentation raises its head, as does the need to potentially save multiple
>> indices per folio.  Does swap need to grow more filesystem features?
> 
> Yes, with a large folio, it is harder to allocate continuous swap
> entries where 4K swap entries are allocated and free all the time. The
> fragmentation will likely make the swap file have very little
> continuous swap entries.

One option would be to reuse the multi-block allocator (mballoc) from
ext4, which has quite efficient power-of-two buddy allocation.  That
would naturally aggregate contiguous pages as they are freed.  Since
the swap partition is not containing anything useful across a remount
there is no need to save allocation bitmaps persistently.

Cheers, Andreas

> We can change that assumption, allow large folio reading and writing
> of discontinued blocks on the block device level. We will likely need
> a file system like kind of the indirection layer to store the location
> of those blocks. In other words, the folio needs to read/write a list
> of io vectors, not just one block.
> 
>> 
>> Further to this, we have at least two ways to cache data on disk/flash/etc. -
>> swap and fscache - and both want to set aside disk space for their operation.
>> Might it be possible to combine the two?
>> 
>> One thing I want to look at for fscache is the possibility of switching from a
>> file-per-object-based approach to a tagged cache more akin to the way OpenAFS
>> does things.  In OpenAFS, you have a whole bunch of small files, each
>> containing a single block (e.g. 256K) of data, and an index that maps a
>> particular {volume,file,version,block} to one of these files in the cache.
>> 
>> Now, I could also consider holding all the data blocks in a single file (or
>> blockdev) - and this might work for swap.  For fscache, I do, however, need to
>> have some sort of integrity across reboots that swap does not require.
> 
> The main trade off is the memory usage for the meta data and latency
> of reading and writing.
> The file system has typically a different IO pattern than swap, e.g.
> file reads can be batched and have good locality.
> Where swap is a lot of random location read/write.
> 
> Current swap using array like swap entry, one of the pros of that is
> just one IO required for one folio.
> The performance gets worse when swap needs to read the metadata first
> to locate the block, then read the block of data in.
> Page fault latency will get longer. That is one of the trade-offs we
> need to consider.
> 
> Chris
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-23  3:00   ` Andreas Dilger
@ 2024-02-23  3:46     ` Chris Li
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Li @ 2024-02-23  3:46 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: David Howells, lsf-pc, Matthew Wilcox, netfs, linux-fsdevel,
	linux-mm

Hi Andreas,

On Thu, Feb 22, 2024 at 7:03 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Feb 22, 2024, at 3:45 PM, Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi David,
> >
> > On Fri, Feb 2, 2024 at 1:10 AM David Howells <dhowells@redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> The topic came up in a recent discussion about how to deal with large folios
> >> when it comes to swap as a swap device is normally considered a simple array
> >> of PAGE_SIZE-sized elements that can be indexed by a single integer.
> >
> > Sorry for being late for the party. I think I was the one that brought
> > this topic up in the online discussion with Will and You. Let me know
> > if you are referring to a different discussion.
> >
> >>
> >> With the advent of large folios, however, we might need to change this in
> >> order to be better able to swap out a compound page efficiently.  Swap
> >> fragmentation raises its head, as does the need to potentially save multiple
> >> indices per folio.  Does swap need to grow more filesystem features?
> >
> > Yes, with a large folio, it is harder to allocate continuous swap
> > entries where 4K swap entries are allocated and free all the time. The
> > fragmentation will likely make the swap file have very little
> > continuous swap entries.
>
> One option would be to reuse the multi-block allocator (mballoc) from
> ext4, which has quite efficient power-of-two buddy allocation.  That
> would naturally aggregate contiguous pages as they are freed.  Since
> the swap partition is not containing anything useful across a remount
> there is no need to save allocation bitmaps persistently.

That is a very interesting idea. I saw two ways to solve this problem,
buddy allocation system is one of them. The buddy allocation system
can keep the assumption that swap entries will be contiguous within
the same folio. The buddy system also has its own limits due to
external fragmentations. For one there is no easy way to relocate the
swap entry to other locations. We don't have the rmap for swap
entries. That makes the swap entries hard to compact. I do expect the
buddy allocator can help reduce the fragmentation greatly.

The other way is just to have an indirection for mapping a folio's
swap entry to discontiguous swap entries. It will break more
assumptions of the current code about contiguous swap entries.

If we can reuse the ext4 mballoc for swap entries, that would be
great. I will take a look at that and report back.

Thanks for the great suggestion.

Chris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Large folios, swap and fscache
  2024-02-02 14:29 ` Matthew Wilcox
  2024-02-22 19:02   ` Luis Chamberlain
@ 2024-02-29 19:31   ` Chris Li
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Li @ 2024-02-29 19:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Howells, lsf-pc, netfs, linux-fsdevel, linux-mm

Hi Matthew,

On Fri, Feb 2, 2024 at 6:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Feb 02, 2024 at 09:09:49AM +0000, David Howells wrote:
> > The topic came up in a recent discussion about how to deal with large folios
> > when it comes to swap as a swap device is normally considered a simple array
> > of PAGE_SIZE-sized elements that can be indexed by a single integer.
> >
> > With the advent of large folios, however, we might need to change this in
> > order to be better able to swap out a compound page efficiently.  Swap
> > fragmentation raises its head, as does the need to potentially save multiple
> > indices per folio.  Does swap need to grow more filesystem features?
>
> I didn't mention this during the meeting, but there are more reasons
> to do something like this.  For example, even with large folios, it
> doesn't make sense to drive writing to swap on a per-folio basis.  We
> should be writing out large chunks of virtual address space in a single
> write to the swap device, just like we do large chunks of files in
> ->writepages.

I have thought about your proposal after the THP meeting. One
observation is that the swap write and swap read has some asymmetries.
For swap read, you always know which vma you are reading into.
However, the swap write that is based on the LRU list,
(shrink_folio_list) does not have the vma information in hand.
Actually the same folio might map by two different processes. It would
need to do the rmap walk to find out the VMA. So organizing the swap
write around VMA mapping is not convenient for the LRU reclaim write
back case.

Chris


> Another reason to do something different is that we're starting to see
> block devices with bs>PS.  That means we'll _have_ to write out larger
> chunks than a single page.  For reads, we can discard the extra data,
> but it'd be better to swap back in the entire block rather than
> individual pages.
>
> So my modest proposal is that we completely rearchitect how we handle
> swap.  Instead of putting swp entries in the page tables (and in shmem's
> case in the page cache), we turn swap into an (object, offset) lookup
> (just like a filesystem).  That means that each anon_vma becomes its
> own swap object and each shmem inode becomes its own swap object.
> The swap system can then borrow techniques from whichever filesystem
> it likes to do (object, offset, length) -> n x (device, block) mappings.
>
> > Further to this, we have at least two ways to cache data on disk/flash/etc. -
> > swap and fscache - and both want to set aside disk space for their operation.
> > Might it be possible to combine the two?
> >
> > One thing I want to look at for fscache is the possibility of switching from a
> > file-per-object-based approach to a tagged cache more akin to the way OpenAFS
> > does things.  In OpenAFS, you have a whole bunch of small files, each
> > containing a single block (e.g. 256K) of data, and an index that maps a
> > particular {volume,file,version,block} to one of these files in the cache.
>
> I think my proposal above works for you?  For each file you want to cache,
> create a swap object, and then tell swap when you want to read/write to
> the local swap object.  What you do need is to persist the objects over
> a power cycle.  That shouldn't be too hard ... after all, filesystems
> manage to do it.  All we need to do is figure out how to name the
> lookup (I don't think we need to use strings to name the swap object,
> but obviously we could).  Maybe it's just a stream of bytes.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-02-29 19:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-02  9:09 [LSF/MM/BPF TOPIC] Large folios, swap and fscache David Howells
2024-02-02 14:29 ` Matthew Wilcox
2024-02-22 19:02   ` Luis Chamberlain
2024-02-22 19:16     ` Yosry Ahmed
2024-02-22 22:26     ` Chris Li
2024-02-29 19:31   ` Chris Li
2024-02-02 15:57 ` David Howells
2024-02-02 19:22   ` Matthew Wilcox
2024-02-03  5:13 ` Gao Xiang
2024-02-04 23:45 ` Dave Chinner
2024-02-22 22:45 ` Chris Li
2024-02-23  3:00   ` Andreas Dilger
2024-02-23  3:46     ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).