Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
       [not found] <20171026083322.20428-1-david@fromorbit.com>
@ 2017-10-26 11:09 ` Amir Goldstein
  2017-10-26 12:35   ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Amir Goldstein @ 2017-10-26 11:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> This patchset is aimed at filesystems that are installed on sparse
> block devices, a.k.a thin provisioned devices. The aim of the
> patchset is to bring the space management aspect of the storage
> stack up into the filesystem rather than keeping it below the
> filesystem where users and the filesystem have no clue they are
> about to run out of space.
>
> The idea is that thin block devices will be massively
> over-provisioned giving the filesystem a large block device address
> space to manage, but the filesystem presents itself as a much
> smaller filesystem. That is, the space the filesystem presents the
> users is much lower than the what the address space teh block device
> provides.
>
> This somewhat turns traditional thin provisioning on it's head.
> Admins are used to lying through their teeth to users about how much
> space they have available, and then they hope to hell that users
> never try to store as much data as they've been "provisioned" with.
> As a result, the traditional failure case is the block device
> running out of space all of a sudden and the filesystem and
> users wondering WTF just went wrong with their system.
>
> Moving the space management up into the filesystem by itself doesn't
> solve this problem - the thin storage pools can still be
> over-committed - but it does allow a new way of managing the space.
> Essentially, growing or shrinking a thin filesystem is an
> operation that only takes a couple of milliseconds to do because
> it's just an accounting trick. It's far less complex than creating
> a new file, or even reading data from a file.
>
> Freeing unused space from the filesystem isn't done during a shrink
> operation. It is done through discard operations, either dynamically
> via the discard mount option or, preferrably, by an fstrim
> invocation. This means freeing space in the thin pool is not in any
> way related to the management of the filesystem size and space
> enforcement even during a grow or shrink operation.
>
> What it means is that the filesystem controls the amount of active
> data the user can have in the thin pool. The thin pool usage may be
> more or less, depending on snapshots, deduplication,
> freed-but-not-discarded space, etc. And because of how low the
> overhead of changing the accounting is, users don't need to be given
> a filesystem with all the space they might need once in a blue moon.
> It is trivial to expand when need, and shrink and release when the
> data is removed.
>
> Yes, the underlying thin device that the filesystem sits on gets
> provisioned at the "once in a blue moon" size that is requested,
> but until that space is needed the filesystem can run at low amounts
> of reported free space and so prevent the likelyhood of sudden
> thin device pool depletion.
>
> Normally, running a filesysetm for low periods of time at low
> amounts of free space is a bad thing. However, for a thin
> filesystem, a low amount of usable free space doesn't mean the
> filesystem is running near full. The filesystem still has the full
> block device address to work with, so has oodles of contiguous free
> space hidden from the user. hence it's not until the thin filesystem
> grows to be near "non-thin" and is near full that the traditional
> "running near ENOSPC" problems arise.
>
> How to stop that from ever happening? e.g. Some one needs 100GB of
> space now, but maybe much more than that in a year. So provision a
> 10TB thin block device and put a 100GB thin filesystem on it.
> Problems won't arise until it's been grown to 100x it's original
> size.
>
> Yeah, it all requires thinking about the way storage is provisioned
> and managed a little bit differently, but the key point to realise
> is that grow and shrink effectively become free operations on
> thin devices if the filesystem is aware that it's on a thin device.
>
> The patchset has several parts to it. It is built on a 4.14-rc5
> kernel with for-next and Darrick's scrub tree from a couple of days
> ago merged into it.
>
> The first part of teh series is a growfs refactoring. This can
> probably stand alone, and the idea is to move the refactored
> infrastructure into libxfs so it can be shared with mkfs. This also
> cleans up a lot of the cruft in growfs and so makes it much easier
> to add the changes later in the series.
>
> The second part of the patchset moves the functionality of
> sb_dblocks into the struct xfs_mount. This provides the separation
> of address space checks and capacty related calculations that the
> thinspace mods require. This also fixes the problem of freshly made,
> empty filesystems reporting 2% of the space as used.
>
> The XFS_IOC_FSGEOMETRY ioctl needed to be bumped to a new version
> because the structure needed growing.
>
> Finally, there's the patches that provide thinspace support and the
> growfs mods needed to grow and shrink.
>
> I've smoke tested the non-thinspace code paths (running auto tests
> on a scrub enabled kernel+userspace right now) as I haven't updated
> the userspace code to exercise the thinp code paths yet. I know the
> concept works, but my userspace code has an older on-disk format
> from the prototype so it will take me a couple of days to update and
> work out how to get fstests to integrate it reliably. So this is
> mainly a heads-up RFC patchset....
>
> Comments, thoughts, flames all welcome....
>

This proposal is very interesting outside the scope of xfs, so I hope you
don't mind I've CC'ed fsdevel.

I am thinking how a slightly similar approach could be used to online shrink
the physical size for filesystems that are not on thin provisioned devices:

- Set/get a geometry variable of "agsoftlimit" (better names are welcome)
  which is <= agcount.
- agsoftlimit < agcount means that free space of AG > agsoftlimit is zero,
  so total disk space usage will not show this space as available user space.
- inode and block allocators will avoid dipping into the high AG pool,
  expect for metadata block needed for freeing high AG inodes/blocks.
- A variant of xfs_fsr (or e4defrag for that matter) could "migrate" inodes
  and/or blocks from high to low AGs.
- Migrating directories is quite different than migrating files, but doable.
- Finally, on XFS_IOC_FSGROWFSDATA, if shrinking filesystem size and
  high AG usage counters are zero, then physical size can be shrunk
  as down as agsoftlimit instead of reducing usable_blocks.

With this, xfs can gain physical shrink support and ext4 can gain online
(and safe) shrink support.

Assuming that this idea is not shot down on sight, the only implication
I can think of w.r.t your current patches is leaving enough room in new APIs
to accomodate this prospect functionality.

You have already reserved 15 u64 in geometry V5 ioctl struct, so that's good.
You have not changed XFS_IOC_FSGROWFSDATA at all, so going forward
the ambiguity of physical shrink vs. virtual shrink could either be determined
by heuristics (shrink physical if usable == physical > agsoftlimit) or a new
ioctl would be introduced to disambiguate the intention.
I have a suggestion for 3rd option, but I'll post it on the relevant patch.


Thanks,
Amir.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-26 11:09 ` [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Amir Goldstein
@ 2017-10-26 12:35   ` Dave Chinner
  2017-11-01 22:31     ` Darrick J. Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2017-10-26 12:35 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs, linux-fsdevel

On Thu, Oct 26, 2017 at 02:09:26PM +0300, Amir Goldstein wrote:
> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > This patchset is aimed at filesystems that are installed on sparse
> > block devices, a.k.a thin provisioned devices. The aim of the
> > patchset is to bring the space management aspect of the storage
> > stack up into the filesystem rather than keeping it below the
> > filesystem where users and the filesystem have no clue they are
> > about to run out of space.
.....
> > I've smoke tested the non-thinspace code paths (running auto tests
> > on a scrub enabled kernel+userspace right now) as I haven't updated
> > the userspace code to exercise the thinp code paths yet. I know the
> > concept works, but my userspace code has an older on-disk format
> > from the prototype so it will take me a couple of days to update and
> > work out how to get fstests to integrate it reliably. So this is
> > mainly a heads-up RFC patchset....
> >
> > Comments, thoughts, flames all welcome....
> >
> 
> This proposal is very interesting outside the scope of xfs, so I hope you
> don't mind I've CC'ed fsdevel.
> 
> I am thinking how a slightly similar approach could be used to online shrink
> the physical size for filesystems that are not on thin provisioned devices:
> 
> - Set/get a geometry variable of "agsoftlimit" (better names are welcome)
>   which is <= agcount.
> - agsoftlimit < agcount means that free space of AG > agsoftlimit is zero,
>   so total disk space usage will not show this space as available user space.
> - inode and block allocators will avoid dipping into the high AG pool,
>   expect for metadata block needed for freeing high AG inodes/blocks.
> - A variant of xfs_fsr (or e4defrag for that matter) could "migrate" inodes
>   and/or blocks from high to low AGs.
> - Migrating directories is quite different than migrating files, but doable.
> - Finally, on XFS_IOC_FSGROWFSDATA, if shrinking filesystem size and
>   high AG usage counters are zero, then physical size can be shrunk
>   as down as agsoftlimit instead of reducing usable_blocks.

Yup, you've just described all the craziness that a physical shrink
requires on XFS. Lots of new user APIs, new tools to move data
around, new code to transparently migrate directories and other
metadata (like xattrs), etc.

Also, the log is placed half way through the XFS filesystem, so
unless we add code to allocate and switch to a new journal (in a
crash safe and recoverable way!) we can't shrink by more than 50%.

Also, none of the growfs code touches existing AGs - they'll have to
be scanned to determine they really are empty before they get
removed from the filesystem, and then there's the other issues like
we can't shrink to less than 2 AGs, which puts a significant minimum
shrink size on filesystems (again there's that "shrink more than 50%
requires a lot more work" problem for filesystems < 4TB).

And to do it efficiently, we really need rmap support in filesystems
so the fs can tell us what files and metadata need to be moved,
rather than having to do brute force scans to work out what needs
moving. Especially as the brute force scans can't find all the
metadata that we might need to relocate before we've emptied the
space we need to stop using.

IOWs, it's a *lot* of work, and IMO there's more work in
verification and proving that everything is crash safe, recoverable
and restartable. We've known how much work it is for years - why do
you think it hasn't been implemented? See:

http://xfs.org/index.php/Shrinking_Support

And:

http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool

And specifically follow the reference to a discussion in 2007:

https://marc.info/?l=linux-xfs&m=119131697224361&w=2

> With this, xfs can gain physical shrink support and ext4 can gain online
> (and safe) shrink support.

Yes, I estimate it'll probably take about a man-year's worth of work
to get xfs shrink to production ready from all the pieces we have
sitting around today.

> Assuming that this idea is not shot down on sight, the only implication
> I can think of w.r.t your current patches is leaving enough room in new APIs
> to accomodate this prospect functionality.

I'm not introducing any new APIs. XFS_IOC_FSGROWFSDATA already
supports shrinking and resizing/moving the log, they just aren't
implemented.

> You have already reserved 15 u64 in geometry V5 ioctl struct, so that's good.
> You have not changed XFS_IOC_FSGROWFSDATA at all, so going forward
> the ambiguity of physical shrink vs. virtual shrink could either be determined
> by heuristics

No heuristics at all. filesystems on thin devices will have a
feature bit in the superblock indicating they are thin filesystems.
If the "thinspace" bit is set, shrink is just an accounting
operation. If it's not set, then it needs to physically change the
geometry of the filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-26 12:35   ` Dave Chinner
@ 2017-11-01 22:31     ` Darrick J. Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Darrick J. Wong @ 2017-11-01 22:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Amir Goldstein, linux-xfs, linux-fsdevel

On Thu, Oct 26, 2017 at 11:35:48PM +1100, Dave Chinner wrote:
> On Thu, Oct 26, 2017 at 02:09:26PM +0300, Amir Goldstein wrote:
> > On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > This patchset is aimed at filesystems that are installed on sparse
> > > block devices, a.k.a thin provisioned devices. The aim of the
> > > patchset is to bring the space management aspect of the storage
> > > stack up into the filesystem rather than keeping it below the
> > > filesystem where users and the filesystem have no clue they are
> > > about to run out of space.
> .....
> > > I've smoke tested the non-thinspace code paths (running auto tests
> > > on a scrub enabled kernel+userspace right now) as I haven't updated
> > > the userspace code to exercise the thinp code paths yet. I know the
> > > concept works, but my userspace code has an older on-disk format
> > > from the prototype so it will take me a couple of days to update and
> > > work out how to get fstests to integrate it reliably. So this is
> > > mainly a heads-up RFC patchset....
> > >
> > > Comments, thoughts, flames all welcome....
> > >
> > 
> > This proposal is very interesting outside the scope of xfs, so I hope you
> > don't mind I've CC'ed fsdevel.
> > 
> > I am thinking how a slightly similar approach could be used to online shrink
> > the physical size for filesystems that are not on thin provisioned devices:
> > 
> > - Set/get a geometry variable of "agsoftlimit" (better names are welcome)
> >   which is <= agcount.
> > - agsoftlimit < agcount means that free space of AG > agsoftlimit is zero,
> >   so total disk space usage will not show this space as available user space.
> > - inode and block allocators will avoid dipping into the high AG pool,
> >   expect for metadata block needed for freeing high AG inodes/blocks.
> > - A variant of xfs_fsr (or e4defrag for that matter) could "migrate" inodes
> >   and/or blocks from high to low AGs.
> > - Migrating directories is quite different than migrating files, but doable.
> > - Finally, on XFS_IOC_FSGROWFSDATA, if shrinking filesystem size and
> >   high AG usage counters are zero, then physical size can be shrunk
> >   as down as agsoftlimit instead of reducing usable_blocks.
> 
> Yup, you've just described all the craziness that a physical shrink
> requires on XFS. Lots of new user APIs, new tools to move data
> around, new code to transparently migrate directories and other
> metadata (like xattrs), etc.
> 
> Also, the log is placed half way through the XFS filesystem, so
> unless we add code to allocate and switch to a new journal (in a
> crash safe and recoverable way!) we can't shrink by more than 50%.
> 
> Also, none of the growfs code touches existing AGs - they'll have to
> be scanned to determine they really are empty before they get
> removed from the filesystem, and then there's the other issues like
> we can't shrink to less than 2 AGs, which puts a significant minimum
> shrink size on filesystems (again there's that "shrink more than 50%
> requires a lot more work" problem for filesystems < 4TB).
> 
> And to do it efficiently, we really need rmap support in filesystems
> so the fs can tell us what files and metadata need to be moved,
> rather than having to do brute force scans to work out what needs
> moving. Especially as the brute force scans can't find all the
> metadata that we might need to relocate before we've emptied the
> space we need to stop using.
> 
> IOWs, it's a *lot* of work, and IMO there's more work in
> verification and proving that everything is crash safe, recoverable
> and restartable. We've known how much work it is for years - why do
> you think it hasn't been implemented? See:
> 
> http://xfs.org/index.php/Shrinking_Support
> 
> And:
> 
> http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool
> 
> And specifically follow the reference to a discussion in 2007:
> 
> https://marc.info/?l=linux-xfs&m=119131697224361&w=2
> 
> > With this, xfs can gain physical shrink support and ext4 can gain online
> > (and safe) shrink support.
> 
> Yes, I estimate it'll probably take about a man-year's worth of work
> to get xfs shrink to production ready from all the pieces we have
> sitting around today.

Ewww, physical shrink.  Maybe that becomes feasible after parent pointer
support lands, both from a "making the directory rewrite easier" and a
"do the reviewers have time for this?" perspective. :)

I've worked on bashing resize2fs into better shape for shrink support;
the things you have to do (even on ext4, which doesn't share extents) to
the fs are pretty awful.  Ideally you'd move whole extents (or just
defrag the file into the space that will be left) but once reflink comes
into play you /have/ to have a strategy for maintaining the sharedness
across the migration or else you run the risk of blowing up the space
usage.

That's a lot to review, even if the strategy is "bail out with ENOSPC
having potentially done a ton of work and/or fragmented the fs".

--D

> > Assuming that this idea is not shot down on sight, the only implication
> > I can think of w.r.t your current patches is leaving enough room in new APIs
> > to accomodate this prospect functionality.
> 
> I'm not introducing any new APIs. XFS_IOC_FSGROWFSDATA already
> supports shrinking and resizing/moving the log, they just aren't
> implemented.
> 
> > You have already reserved 15 u64 in geometry V5 ioctl struct, so that's good.
> > You have not changed XFS_IOC_FSGROWFSDATA at all, so going forward
> > the ambiguity of physical shrink vs. virtual shrink could either be determined
> > by heuristics
> 
> No heuristics at all. filesystems on thin devices will have a
> feature bit in the superblock indicating they are thin filesystems.
> If the "thinspace" bit is set, shrink is just an accounting
> operation. If it's not set, then it needs to physically change the
> geometry of the filesystem....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-11-01 22:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20171026083322.20428-1-david@fromorbit.com>
2017-10-26 11:09 ` [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Amir Goldstein
2017-10-26 12:35   ` Dave Chinner
2017-11-01 22:31     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).