* Re: fallocate on XFS for swap
2018-03-10 0:58 ` Dave Chinner
@ 2018-03-10 1:17 ` Darrick J. Wong
2018-03-10 1:36 ` Dave Chinner
2018-03-10 9:38 ` Christoph Hellwig
2018-03-12 18:40 ` Besogonov, Aleksei
2 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2018-03-10 1:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Besogonov, Aleksei, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, xfs
On Sat, Mar 10, 2018 at 11:58:50AM +1100, Dave Chinner wrote:
> On Fri, Mar 09, 2018 at 03:44:22PM -0800, Darrick J. Wong wrote:
> > [you really ought to cc the xfs list]
> >
> > On Fri, Mar 09, 2018 at 10:05:24PM +0000, Besogonov, Aleksei wrote:
> > > Hi!
> > >
> > > We’re working at Amazon on making XFS our default root filesystem for
> > > the upcoming Amazon Linux 2 (now in prod preview). One of the problems
> > > that we’ve encountered is inability to use fallocated files for swap
> > > on XFS. This is really important for us, since we’re shipping our
> > > current Amazon Linux with hibernation support .
> >
> > <shudder>
> >
> > > I’ve traced the problem to bmap(), used in generic_swapfile_activate
> > > call, which returns 0 for blocks inside holes created by fallocate and
> > > Dave Chinner confirmed it in a private email. I’m thinking about ways
> > > to fix it, so far I see the following possibilities:
> > >
> > > 1. Change bmap() to not return zeroes for blocks inside holes. But
> > > this is an ABI change and it likely will break some obscure userspace
> > > utility somewhere.
> >
> > bmap is a horrible interface, let's leave it to wither and eventually go
> > away.
> >
> > > 2. Change generic_swap_activate to use a more modern interface, by
> > > adding fiemap-like operation to address_space_operations with fallback
> > > on bmap().
> >
> > Probably the best idea, but see fs/iomap.c since we're basically leasing
> > a chunk of file space to the kernel. Leasing space to a user that wants
> > direct access is becoming rather common (rdma, map_sync, etc.)
>
> thing is, we don't want in-kernel users of fiemap. We've got other
> block mapping interfaces that can be used, such as iomap...
Well yes, I was clumsily trying to suggest reimplementing
generic_swap_activate with an iomap backend replacing/augmenting the old
get_blocks thing... :)
> > > 3. Add an XFS-specific implementation of swapfile_activate.
> >
> > Ugh no.
>
> What we want is an iomap-based re-implementation of
> generic_swap_activate(). One of the ways to plumb that in is to
> use ->swapfile_activate() like so:
Is this distinct from the ->swap_activate function pointer in
address_operations or a new one? I think it'd be best to have it be a
separate callback like you suggest:
> iomap_swapfile_activate()
> {
> return iomap_apply(... iomap_swapfile_add_extent, ...)
> }
>
> xfs_vm_swapfile_activate()
> {
> return iomap_swapfile_activate(xfs_iomap_ops);
> }
>
> .swapfile_activate = xfs_vm_swapfile_activate()
>
> And massage the swapfile_activate callout be friendly to fragmented
> files. i.e. change the nfs caller to run a
> "add_single_swap_extent()" caller rather than have to do it in the
> generic code on return....
But ugh, the names are confusing. ->swapfile_activate, ->swap_activate,
and generic_swapfile_activate. Not sure what's needed to clean up the
other filesystems to use a single mapping interface, though.
> IOWs, I think the choices we have are to either re-implement
> generic_swapfile_activate() and then be stuck with using get_block
> style interfaces forever in XFS, or we use the filesystem specific
> callout to implement more advanced generic support using the
> filesystem supplied get_block/iomap interfaces for block mapping
> like we do for everything else that the VM needs the filesystem to
> do....
Yes, that's what I was trying to nudge Mr. Besogonov towards, though not
as clearly as you've put it. Thanks. :)
--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-10 1:17 ` Darrick J. Wong
@ 2018-03-10 1:36 ` Dave Chinner
2018-03-12 22:01 ` Besogonov, Aleksei
0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-03-10 1:36 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Besogonov, Aleksei, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, xfs
On Fri, Mar 09, 2018 at 05:17:07PM -0800, Darrick J. Wong wrote:
> On Sat, Mar 10, 2018 at 11:58:50AM +1100, Dave Chinner wrote:
> > On Fri, Mar 09, 2018 at 03:44:22PM -0800, Darrick J. Wong wrote:
> > > [you really ought to cc the xfs list]
> > >
> > > On Fri, Mar 09, 2018 at 10:05:24PM +0000, Besogonov, Aleksei wrote:
> > > > Hi!
> > > >
> > > > We’re working at Amazon on making XFS our default root filesystem for
> > > > the upcoming Amazon Linux 2 (now in prod preview). One of the problems
> > > > that we’ve encountered is inability to use fallocated files for swap
> > > > on XFS. This is really important for us, since we’re shipping our
> > > > current Amazon Linux with hibernation support .
> > >
> > > <shudder>
> > >
> > > > I’ve traced the problem to bmap(), used in generic_swapfile_activate
> > > > call, which returns 0 for blocks inside holes created by fallocate and
> > > > Dave Chinner confirmed it in a private email. I’m thinking about ways
> > > > to fix it, so far I see the following possibilities:
> > > >
> > > > 1. Change bmap() to not return zeroes for blocks inside holes. But
> > > > this is an ABI change and it likely will break some obscure userspace
> > > > utility somewhere.
> > >
> > > bmap is a horrible interface, let's leave it to wither and eventually go
> > > away.
> > >
> > > > 2. Change generic_swap_activate to use a more modern interface, by
> > > > adding fiemap-like operation to address_space_operations with fallback
> > > > on bmap().
> > >
> > > Probably the best idea, but see fs/iomap.c since we're basically leasing
> > > a chunk of file space to the kernel. Leasing space to a user that wants
> > > direct access is becoming rather common (rdma, map_sync, etc.)
> >
> > thing is, we don't want in-kernel users of fiemap. We've got other
> > block mapping interfaces that can be used, such as iomap...
>
> Well yes, I was clumsily trying to suggest reimplementing
> generic_swap_activate with an iomap backend replacing/augmenting the old
> get_blocks thing... :)
>
> > > > 3. Add an XFS-specific implementation of swapfile_activate.
> > >
> > > Ugh no.
> >
> > What we want is an iomap-based re-implementation of
> > generic_swap_activate(). One of the ways to plumb that in is to
> > use ->swapfile_activate() like so:
>
> Is this distinct from the ->swap_activate function pointer in
> address_operations or a new one? I think it'd be best to have it be a
> separate callback like you suggest:
No, we don't need to create a new one - the existing one is used by
a single caller and we can easily move all the functionality it
requires inside the NFS specific implementation - it's just mapping
the entire range as a single extent, but the callout is needed to
mark the sockets backing the file as in the memalloc path...
> > iomap_swapfile_activate()
> > {
> > return iomap_apply(... iomap_swapfile_add_extent, ...)
> > }
> >
> > xfs_vm_swapfile_activate()
> > {
> > return iomap_swapfile_activate(xfs_iomap_ops);
> > }
> >
> > .swapfile_activate = xfs_vm_swapfile_activate()
> >
> > And massage the swapfile_activate callout be friendly to fragmented
> > files. i.e. change the nfs caller to run a
> > "add_single_swap_extent()" caller rather than have to do it in the
> > generic code on return....
>
> But ugh, the names are confusing. ->swapfile_activate, ->swap_activate,
> and generic_swapfile_activate. Not sure what's needed to clean up the
> other filesystems to use a single mapping interface, though.
If they don't implement the callout, they use the
generic_swapfile_activate code that currently exists. Maybe with a
name change, but this way we don't have to touch them....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-10 1:36 ` Dave Chinner
@ 2018-03-12 22:01 ` Besogonov, Aleksei
2018-03-13 1:31 ` Dave Chinner
0 siblings, 1 reply; 11+ messages in thread
From: Besogonov, Aleksei @ 2018-03-12 22:01 UTC (permalink / raw)
To: Dave Chinner, Darrick J. Wong
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs
[snip unrelated]
So I'm looking at the XFS code and it appears that the iomap is limited to 1024*PAGE_SIZE blocks at a time, which is too small for most of swap use-cases. I can of course just loop through the file in 4Mb increments and, just like the bmap() code does today. But this just doesn't look right and it's not atomic. And it looks like iomap in ext2 doesn't have this limitation.
The stated rationale for the XFS limit is:
>/*
> * We cap the maximum length we map here to MAX_WRITEBACK_PAGES pages
> * to keep the chunks of work done where somewhat symmetric with the
> * work writeback does. This is a completely arbitrary number pulled
> * out of thin air as a best guess for initial testing.
> *
> * Note that the values needs to be less than 32-bits wide until
> * the lower level functions are updated.
> */
So can it be lifted today?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-12 22:01 ` Besogonov, Aleksei
@ 2018-03-13 1:31 ` Dave Chinner
0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2018-03-13 1:31 UTC (permalink / raw)
To: Besogonov, Aleksei
Cc: Darrick J. Wong, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, xfs
[Aleski, can you word wrap your email text at 72 columns? ]
On Mon, Mar 12, 2018 at 10:01:54PM +0000, Besogonov, Aleksei wrote:
> [snip unrelated]
>
> So I'm looking at the XFS code and it appears that the iomap is
> limited to 1024*PAGE_SIZE blocks at a time,
Take a closer look - that code is not used for reading file extents
and returning them to the caller.
> which is too small for
> most of swap use-cases. I can of course just loop through the file
> in 4Mb increments and, just like the bmap() code does today. But
> this just doesn't look right and it's not atomic. And it looks
> like iomap in ext2 doesn't have this limitation.
>
> The stated rationale for the XFS limit is:
> >/*
> > * We cap the maximum length we map here to MAX_WRITEBACK_PAGES pages
> > * to keep the chunks of work done where somewhat symmetric with the
> > * work writeback does. This is a completely arbitrary number pulled
> > * out of thin air as a best guess for initial testing.
> > *
> > * Note that the values needs to be less than 32-bits wide until
> > * the lower level functions are updated.
> > */
Yeah, that's in the IOMAP_WRITE path used for block allocation. swap
file mapping should not be asking for IOMAP_WRITE mappings that
trigger extent allocation, so you should never hit this case.
You should probably be using the IOMAP_REPORT path (i.e. basically
very similar code to iomap_fiemap/iomap_fiemap_apply and rejecting
any file that returns an iomap that is not IOMAP_MAPPED or
IOMAP_UNWRITTEN. Also, you want to reject any file that returns
IOMAP_F_SHARED in iomap->flags, too, because swapfiles can't do COW
to break extent sharing on writes.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-10 0:58 ` Dave Chinner
2018-03-10 1:17 ` Darrick J. Wong
@ 2018-03-10 9:38 ` Christoph Hellwig
2018-03-12 21:46 ` Dave Chinner
2018-03-12 18:40 ` Besogonov, Aleksei
2 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2018-03-10 9:38 UTC (permalink / raw)
To: Dave Chinner
Cc: Darrick J. Wong, Besogonov, Aleksei,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs
On Sat, Mar 10, 2018 at 11:58:50AM +1100, Dave Chinner wrote:
> > Probably the best idea, but see fs/iomap.c since we're basically leasing
> > a chunk of file space to the kernel. Leasing space to a user that wants
> > direct access is becoming rather common (rdma, map_sync, etc.)
>
> thing is, we don't want in-kernel users of fiemap. We've got other
> block mapping interfaces that can be used, such as iomap...
Agreed. fiemap is in many ways just as bad as bmap - it is an
information at a given point in time interface. It is more detailed
than bmap and allows better error reporting, but it still is
fundamentally the wrong thing to use for any sort of the I/O path.
>
> > > 3. Add an XFS-specific implementation of swapfile_activate.
> >
> > Ugh no.
>
> What we want is an iomap-based re-implementation of
> generic_swap_activate(). One of the ways to plumb that in is to
> use ->swapfile_activate() like so:
Hmm. Fundamentally swap is the same problem as the pNFS block layout
or get_user_pages on DAX mappings - we want to get a 'lease' on the
current block mapping, and make sure it stays that way as the external
user (the swap code in this case) uses it. The twist for the swap code
is mostly that it never wants to break the least but instead disallow
any external operation, but that's not really such a big difference.
So maybe we want a layout based swap code instead of reinventing it,
with the slight twist to the layout break code to never try a lease
break and just return an error for the IS_SWAPFILE case.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-10 9:38 ` Christoph Hellwig
@ 2018-03-12 21:46 ` Dave Chinner
2018-03-13 7:14 ` Christoph Hellwig
0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-03-12 21:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Darrick J. Wong, Besogonov, Aleksei,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs
On Sat, Mar 10, 2018 at 01:38:44AM -0800, Christoph Hellwig wrote:
> On Sat, Mar 10, 2018 at 11:58:50AM +1100, Dave Chinner wrote:
> > > > 3. Add an XFS-specific implementation of swapfile_activate.
> > >
> > > Ugh no.
> >
> > What we want is an iomap-based re-implementation of
> > generic_swap_activate(). One of the ways to plumb that in is to
> > use ->swapfile_activate() like so:
>
> Hmm. Fundamentally swap is the same problem as the pNFS block layout
> or get_user_pages on DAX mappings - we want to get a 'lease' on the
> current block mapping, and make sure it stays that way as the external
> user (the swap code in this case) uses it. The twist for the swap code
> is mostly that it never wants to break the least but instead disallow
> any external operation, but that's not really such a big difference.
True.
> So maybe we want a layout based swap code instead of reinventing it,
> with the slight twist to the layout break code to never try a lease
> break and just return an error for the IS_SWAPFILE case.
Hmmm - won't that change user visible behaviour on swapfiles? Not
that it would be a bad thing to reject read/write from root on swap
files, but it would make XFS different to everything else.
Speaking of which - we probably need to spend some time at LSFMM in
the fs track talking about the iomap infrastructure and long term
plans to migrate the major filesystems to it....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-12 21:46 ` Dave Chinner
@ 2018-03-13 7:14 ` Christoph Hellwig
0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2018-03-13 7:14 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Darrick J. Wong, Besogonov, Aleksei,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs
On Tue, Mar 13, 2018 at 08:46:26AM +1100, Dave Chinner wrote:
> > So maybe we want a layout based swap code instead of reinventing it,
> > with the slight twist to the layout break code to never try a lease
> > break and just return an error for the IS_SWAPFILE case.
>
> Hmmm - won't that change user visible behaviour on swapfiles? Not
> that it would be a bad thing to reject read/write from root on swap
> files, but it would make XFS different to everything else.
We already can't writew to active swap files, thank god:
root@testvm:~# dd if=/dev/zero of=swapfile bs=1M count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.0458446 s, 1.5 GB/s
mkswap swapfile
mkswap: swapfile: insecure permissions 0644, 0600 suggested.
Setting up swapspace version 1, size = 64 MiB (67104768 bytes)
no label, UUID=bb42b883-f224-4627-8580-c1ba9f4569ab
root@testvm:~# swapon swapfile
swapon: /root/swapfile: insecure permissions 0644, 0600 suggested.
[ 54.165439] Adding 65532k swap on /root/swapfile. Priority:-2 extents:1 across:65532k
root@testvm:~# dd if=/dev/zero of=swapfile bs=1M count=64
dd: failed to open 'swapfile': Text file busy
>
> Speaking of which - we probably need to spend some time at LSFMM in
> the fs track talking about the iomap infrastructure and long term
> plans to migrate the major filesystems to it....
I won't be there, as I'll be busy working the local election ballot.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: fallocate on XFS for swap
2018-03-10 0:58 ` Dave Chinner
2018-03-10 1:17 ` Darrick J. Wong
2018-03-10 9:38 ` Christoph Hellwig
@ 2018-03-12 18:40 ` Besogonov, Aleksei
2 siblings, 0 replies; 11+ messages in thread
From: Besogonov, Aleksei @ 2018-03-12 18:40 UTC (permalink / raw)
To: Dave Chinner, Darrick J. Wong
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs
On 3/9/18, 16:58, "Dave Chinner" <david@fromorbit.com> wrote:
On Fri, Mar 09, 2018 at 03:44:22PM -0800, Darrick J. Wong wrote:
> [you really ought to cc the xfs list]
>
> On Fri, Mar 09, 2018 at 10:05:24PM +0000, Besogonov, Aleksei wrote:
> > Hi!
> >
> > We’re working at Amazon on making XFS our default root filesystem for
> > the upcoming Amazon Linux 2 (now in prod preview). One of the problems
> > that we’ve encountered is inability to use fallocated files for swap
> > on XFS. This is really important for us, since we’re shipping our
> > current Amazon Linux with hibernation support .
>
> <shudder>
>
> > I’ve traced the problem to bmap(), used in generic_swapfile_activate
> > call, which returns 0 for blocks inside holes created by fallocate and
> > Dave Chinner confirmed it in a private email. I’m thinking about ways
> > to fix it, so far I see the following possibilities:
> >
> > 1. Change bmap() to not return zeroes for blocks inside holes. But
> > this is an ABI change and it likely will break some obscure userspace
> > utility somewhere.
>
> bmap is a horrible interface, let's leave it to wither and eventually go
> away.
>
> > 2. Change generic_swap_activate to use a more modern interface, by
> > adding fiemap-like operation to address_space_operations with fallback
> > on bmap().
>
> Probably the best idea, but see fs/iomap.c since we're basically leasing
> a chunk of file space to the kernel. Leasing space to a user that wants
> direct access is becoming rather common (rdma, map_sync, etc.)
thing is, we don't want in-kernel users of fiemap. We've got other
block mapping interfaces that can be used, such as iomap...
> > 3. Add an XFS-specific implementation of swapfile_activate.
>
> Ugh no.
What we want is an iomap-based re-implementation of
generic_swap_activate(). One of the ways to plumb that in is to
use ->swapfile_activate() like so:
iomap_swapfile_activate()
{
return iomap_apply(... iomap_swapfile_add_extent, ...)
}
xfs_vm_swapfile_activate()
{
return iomap_swapfile_activate(xfs_iomap_ops);
}
.swapfile_activate = xfs_vm_swapfile_activate()
And massage the swapfile_activate callout be friendly to fragmented
files. i.e. change the nfs caller to run a
"add_single_swap_extent()" caller rather than have to do it in the
generic code on return....
This sounds reasonable, I'll try to implement it this week or so for XFS.
No guarantees about NFS, though.
^ permalink raw reply [flat|nested] 11+ messages in thread