* Some questions about per-ag metadata space reservations...
@ 2017-09-06 10:30 Dave Chinner
2017-09-07 13:44 ` Brian Foster
0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2017-09-06 10:30 UTC (permalink / raw)
To: linux-xfs
Hi folks,
I've got a bit of a problem with the per-ag reservations we are
using at the moment. The existance of them is fine, but the
implementation is problematic for something I'm working on right
now.
I've been making a couple of mods to the filesystem to separate
physical space accounting from free space accounting to allow us to
optimise the filesystem for thinly provisioned devices. That is,
the filesystem is laid out as though it is the size of the
underlying device, but then free space is artificially limited. i.e.
we have a "physical size" of the filesystem and a "logical size"
that limits the amount of data and metadata that can actually be
stored in it.
When combined with a thinly provisioned device, this enables us to
shrink the XFS filesystem simply by running fstrim to punch all the
free space out of the underlying thin device and then adjusting the
free space down appropriately. Because the thin device abstracts the
physical location of the data in the block device away from the
address space presented to the filesystem, we don't need to move any
data or metadata to free up this space - it's just an accounting
change.
The problem arises with the per AG reservations in that they are
based on the physical size of the AG, which for a thin filesystem
will always be larger than the space available. e.g. we might
allocate a 32TB thin device to give 32x1TB AGs in the filesystem,
but we might only start by allocating 1TB of space to the
filesystem. e.g.:
# mkfs.xfs -f -m rmapbt=1,reflink=1 -d size=32t,thin=1t /dev/vdc
Default configuration sourced from package build definitions
meta-data=/dev/vdc isize=512 agcount=32, agsize=268435455 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=1, reflink=1
data = bsize=4096 blocks=268435456, imaxpct=5, thin=1
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
#
The issue is now when we mount it:
# mount /dev/vdc /mnt/scratch ; df -h /mnt/scratch/ ; sudo umount /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/vdc 1023G 628G 395G 62% /mnt/scratch
#
Of that 1TB of space, we immediately remove 600+GB of free space for
finobt, rmapbt and reflink metadata reservations. This is based on
the physical size and number of AGs in the filesystem, so it always
gets removed from the free block count available to the user.
This is clearly seen when I grow the filesystem to 10x the size:
# xfs_growfs -D 2684354560 /mnt/scratch
....
data blocks changed from 268435456 to 2684354560
# df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/vdc 10T 628G 9.4T 7% /mnt/scratch
#
And also shows up on shrinking back down a chunk, too:
# xfs_growfs -D 468435456 /mnt/scratch
.....
data blocks changed from 2684354560 to 468435456
# df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/vdc 1.8T 628G 1.2T 36% /mnt/scratch
#
(Oh, did I mention I have working code and that's how I came across
this problem? :P)
For a normal filesystem, there's no problem with doing this brute
force physical reservation, though it is slightly disconcerting to
see a new, empty 100TB filesystem say it's got 2TB used and only
98TB free...
The issue is that for a thin filesystem, this space reservation
come out of the *logical* free space, not the physical free space.
With 1TB of thin space, we've got 31TB of /physical free space/ the
reservation can be taken out of without the user ever seeing it. The
question is this: how on earth do I do this?
I want the available space to match the "thin=size" value on the
mkfs command line, but I don't want metadata reservations to take
away from this space. metadata allocations need to be accounted to
the available space, but the reservations should not be. So how
should I go about providing these reservations? Do we even need them
to be accounted against free space in this case where we control the
filesysetm free blocks to be a /lot/ less than the physical space?
e.g. if I limit a thin filesystem to 95% of the underlying thin
device size, then we've always got a 5% space margin and so we don't
need to take the reservations out of the global free block counter
to ensure we always have physical space for the metadata. We still
take the per-ag reservations to ensure everything still works on the
physical side, we just don't pull the space from the free block
counter. I think this will work, but I'm not sure I've fully grokked
all the conditions the per-ag reservation is protecting against or
whether there's more accounting work needed deep in allocation code
to make it work correctly.
Thoughts, anyone?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Some questions about per-ag metadata space reservations... 2017-09-06 10:30 Some questions about per-ag metadata space reservations Dave Chinner @ 2017-09-07 13:44 ` Brian Foster 2017-09-07 23:11 ` Dave Chinner 0 siblings, 1 reply; 7+ messages in thread From: Brian Foster @ 2017-09-07 13:44 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Sep 06, 2017 at 08:30:54PM +1000, Dave Chinner wrote: > Hi folks, > > I've got a bit of a problem with the per-ag reservations we are > using at the moment. The existance of them is fine, but the > implementation is problematic for something I'm working on right > now. > > I've been making a couple of mods to the filesystem to separate > physical space accounting from free space accounting to allow us to > optimise the filesystem for thinly provisioned devices. That is, > the filesystem is laid out as though it is the size of the > underlying device, but then free space is artificially limited. i.e. > we have a "physical size" of the filesystem and a "logical size" > that limits the amount of data and metadata that can actually be > stored in it. > Interesting... > When combined with a thinly provisioned device, this enables us to > shrink the XFS filesystem simply by running fstrim to punch all the > free space out of the underlying thin device and then adjusting the > free space down appropriately. Because the thin device abstracts the > physical location of the data in the block device away from the > address space presented to the filesystem, we don't need to move any > data or metadata to free up this space - it's just an accounting > change. > How are you dealing with block size vs. thin chunk allocation size alignment? You could require they match, but if not it seems like there could be a bit more involved than an accounting change. > The problem arises with the per AG reservations in that they are > based on the physical size of the AG, which for a thin filesystem > will always be larger than the space available. e.g. we might > allocate a 32TB thin device to give 32x1TB AGs in the filesystem, > but we might only start by allocating 1TB of space to the > filesystem. e.g.: > > # mkfs.xfs -f -m rmapbt=1,reflink=1 -d size=32t,thin=1t /dev/vdc > Default configuration sourced from package build definitions > meta-data=/dev/vdc isize=512 agcount=32, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=0, rmapbt=1, reflink=1 > data = bsize=4096 blocks=268435456, imaxpct=5, thin=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > # > > The issue is now when we mount it: > > # mount /dev/vdc /mnt/scratch ; df -h /mnt/scratch/ ; sudo umount /mnt/scratch > Filesystem Size Used Avail Use% Mounted on > /dev/vdc 1023G 628G 395G 62% /mnt/scratch > # > > Of that 1TB of space, we immediately remove 600+GB of free space for > finobt, rmapbt and reflink metadata reservations. This is based on > the physical size and number of AGs in the filesystem, so it always > gets removed from the free block count available to the user. > This is clearly seen when I grow the filesystem to 10x the size: > > # xfs_growfs -D 2684354560 /mnt/scratch > .... > data blocks changed from 268435456 to 2684354560 > # df -h /mnt/scratch > Filesystem Size Used Avail Use% Mounted on > /dev/vdc 10T 628G 9.4T 7% /mnt/scratch > # > > And also shows up on shrinking back down a chunk, too: > > # xfs_growfs -D 468435456 /mnt/scratch > ..... > data blocks changed from 2684354560 to 468435456 > # df -h /mnt/scratch > Filesystem Size Used Avail Use% Mounted on > /dev/vdc 1.8T 628G 1.2T 36% /mnt/scratch > # > > (Oh, did I mention I have working code and that's how I came across > this problem? :P) > > For a normal filesystem, there's no problem with doing this brute > force physical reservation, though it is slightly disconcerting to > see a new, empty 100TB filesystem say it's got 2TB used and only > 98TB free... > Ugh, I think the reservation requirement there is kind of insane. We reserve 1GB out of a 1TB fs just for finobt (13GB for rmap and 6GB for reflink), most of which will probably never be used. I'm not a big fan of this approach. I think the patch was originally added because there was some unknown workload that reproduced a finobt block allocation failure and filesystem shutdown that couldn't be reproduced independently, hence the overkill reservation. I'd much prefer to see if we can come up with something that is more dynamic in nature. For example, the finobt cannot be larger than the inobt. If we mount a 1TB fs with one inode chunk allocated in the fs, there is clearly no immediate risk for the finobt to grow beyond a single block until more inodes are allocated. I'm wondering if we could come up with something that grows and shrinks the reservation as needed based on the size delta between the inobt/finobt and rather than guarantee we can always create a maximum size finobt, guarantee that the finobt can always grow to the size of the inobt. I suppose this might require some clever accounting tricks on finobt block allocation/free and some estimation at mount time of an already populated fs. I've also not really gone through the per-AG reservation stuff since it was originally reviewed, so this is all handwaving atm. Anyways, I think it would be nice if we could improve these reservation requirements first and foremost, though I'm not sure I understand whether that would address your issue... > The issue is that for a thin filesystem, this space reservation > come out of the *logical* free space, not the physical free space. > With 1TB of thin space, we've got 31TB of /physical free space/ the > reservation can be taken out of without the user ever seeing it. The > question is this: how on earth do I do this? > Hmm, so is the issue that the reservations aren't accounted out of whatever counters you're using to artificially limit block allocation? I'm a little confused.. ISTM that if you have a 32TB fs and have artificially limited the free block accounting to 1TB based on available physical space, the reservation accounting needs to be accounted against that same artificially limited pool. IOW, it looks like the perag res code eventually calls xfs_mod_fdblocks() just the same as we would for a delayed allocation. Can you elaborate a bit on how your updated accounting works and how it breaks this model? > I want the available space to match the "thin=size" value on the > mkfs command line, but I don't want metadata reservations to take > away from this space. metadata allocations need to be accounted to > the available space, but the reservations should not be. So how > should I go about providing these reservations? Do we even need them > to be accounted against free space in this case where we control the > filesysetm free blocks to be a /lot/ less than the physical space? > I don't understand how you'd guarantee availability of physical blocks for metadata if you don't account metadata block reservations out of the (physically) available free space. ISTM the only way around that is to eliminate the requirement for a reservation in the first place (i.e., allocate physical blocks up front or something like that). Brian > e.g. if I limit a thin filesystem to 95% of the underlying thin > device size, then we've always got a 5% space margin and so we don't > need to take the reservations out of the global free block counter > to ensure we always have physical space for the metadata. We still > take the per-ag reservations to ensure everything still works on the > physical side, we just don't pull the space from the free block > counter. I think this will work, but I'm not sure I've fully grokked > all the conditions the per-ag reservation is protecting against or > whether there's more accounting work needed deep in allocation code > to make it work correctly. > > Thoughts, anyone? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some questions about per-ag metadata space reservations... 2017-09-07 13:44 ` Brian Foster @ 2017-09-07 23:11 ` Dave Chinner 2017-09-08 13:33 ` Brian Foster 0 siblings, 1 reply; 7+ messages in thread From: Dave Chinner @ 2017-09-07 23:11 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Sep 07, 2017 at 09:44:58AM -0400, Brian Foster wrote: > On Wed, Sep 06, 2017 at 08:30:54PM +1000, Dave Chinner wrote: > > Hi folks, > > > > I've got a bit of a problem with the per-ag reservations we are > > using at the moment. The existance of them is fine, but the > > implementation is problematic for something I'm working on right > > now. > > > > I've been making a couple of mods to the filesystem to separate > > physical space accounting from free space accounting to allow us to > > optimise the filesystem for thinly provisioned devices. That is, > > the filesystem is laid out as though it is the size of the > > underlying device, but then free space is artificially limited. i.e. > > we have a "physical size" of the filesystem and a "logical size" > > that limits the amount of data and metadata that can actually be > > stored in it. > > > > Interesting... > > > When combined with a thinly provisioned device, this enables us to > > shrink the XFS filesystem simply by running fstrim to punch all the > > free space out of the underlying thin device and then adjusting the > > free space down appropriately. Because the thin device abstracts the > > physical location of the data in the block device away from the > > address space presented to the filesystem, we don't need to move any > > data or metadata to free up this space - it's just an accounting > > change. > > > > How are you dealing with block size vs. thin chunk allocation size > alignment? You could require they match, but if not it seems like there > could be a bit more involved than an accounting change. Not a filesystem problem. If there's less pool space than you let the filesystem have, then the pool will ENOSPC before the filesystem will. regular fstrim (which you should be doing on thin filesystems anyway) will keep them mostly aligned because XFS tends to pack holes in AG space rather than continually growing the space they use. ..... > > For a normal filesystem, there's no problem with doing this brute > > force physical reservation, though it is slightly disconcerting to > > see a new, empty 100TB filesystem say it's got 2TB used and only > > 98TB free... > > > > Ugh, I think the reservation requirement there is kind of insane. We > reserve 1GB out of a 1TB fs just for finobt (13GB for rmap and 6GB for > reflink), most of which will probably never be used. Yeah, the reservations are large, but the rmap/reflink ones are necessary. I don't think finobt should use this mechanism - it should not require more than a few blocks for any given inode chunk allocation, and they should stop pretty quickly if the finobt blocks are having to work around ENOSPC conditions by dipping into the reserve pool. > I'm not a big fan of this approach. I think the patch was originally > added because there was some unknown workload that reproduced a finobt > block allocation failure and filesystem shutdown that couldn't be > reproduced independently, hence the overkill reservation. I'd much > prefer to see if we can come up with something that is more dynamic in > nature. Based on the commit message, I think the justification for finobt reservations was weak and wasn't backed up by analysis as to why the reserve block pool was drained (which should never occur in normal ENOSPC conditions). The per-ag reserve also requires a walk of every finobt at mount time, so there's also mount time regressions for filesystems with sparsely populated inode trees. > For example, the finobt cannot be larger than the inobt. If we mount a > 1TB fs with one inode chunk allocated in the fs, there is clearly no > immediate risk for the finobt to grow beyond a single block until more > inodes are allocated. I'm wondering if we could come up with something > that grows and shrinks the reservation as needed based on the size delta > between the inobt/finobt and rather than guarantee we can always create > a maximum size finobt, guarantee that the finobt can always grow to the > size of the inobt. I suppose this might require some clever accounting > tricks on finobt block allocation/free and some estimation at mount time > of an already populated fs. I've also not really gone through the per-AG > reservation stuff since it was originally reviewed, so this is all > handwaving atm. > > Anyways, I think it would be nice if we could improve these reservation > requirements first and foremost, though I'm not sure I understand > whether that would address your issue... No, it doesn't really. Unless they are brought down to the size of the existing reserve pool, it's going to be an issue.... > > The issue is that for a thin filesystem, this space reservation > > come out of the *logical* free space, not the physical free space. > > With 1TB of thin space, we've got 31TB of /physical free space/ the > > reservation can be taken out of without the user ever seeing it. The > > question is this: how on earth do I do this? > > > > Hmm, so is the issue that the reservations aren't accounted out of > whatever counters you're using to artificially limit block allocation? No, the issue is that they are being accounted out of the existing freespace counters. They are a persistent reservation that will always be present. However, rather than hiding this unusable space from users, we simply pull it from free space. > I'm a little confused.. ISTM that if you have a 32TB fs and have > artificially limited the free block accounting to 1TB based on available > physical space, the reservation accounting needs to be accounted against > that same artificially limited pool. IOW, it looks like the perag res > code eventually calls xfs_mod_fdblocks() just the same as we would for a > delayed allocation. Can you elaborate a bit on how your updated > accounting works and how it breaks this model? Yes, that's what it does to ensure users get ENOSPC for data allocation before we run out of metadata reservation space, even if we don't need the metadata reservation space. It's size is physically bound by the AG size so we can calculate it any time we know what the AG size is. > > I want the available space to match the "thin=size" value on the > > mkfs command line, but I don't want metadata reservations to take > > away from this space. metadata allocations need to be accounted to > > the available space, but the reservations should not be. So how > > should I go about providing these reservations? Do we even need them > > to be accounted against free space in this case where we control the > > filesysetm free blocks to be a /lot/ less than the physical space? > > > > I don't understand how you'd guarantee availability of physical blocks > for metadata if you don't account metadata block reservations out of the > (physically) available free space. ISTM the only way around that is to > eliminate the requirement for a reservation in the first place (i.e., > allocate physical blocks up front or something like that). I was working on the idea that thin filesystems have sufficient spare physical space (e.g. logical size < (physical size - max metadata reservation) that even when maxxed out there's sufficient physical space remaining for all the metadata without needing to reserve that space. In theory, this /should/ work as the metadata blocks are already reserved as used space at mount time and hence the actual allocation of those blocks is only accounted against the reservation, not the global freespace counter. Hence these metadata blocks aren't counted as used space when they are allocated - they are always accounted as used space whether used or not. Hence if I remove the "accounted as used space" part of the reservation, but then ensure that there is physically enough room for them to always succeed, we end up with exactly the same metadata space guarantee. The only difference is how it's accounted and provided.... I've written the patches to do this, but I haven't tested it other than checking falloc triggers ENOSPC when it's supposed to. I'm just finishing off the repair support so I can run it through xfstests. That will be interesting. :P FWIW, I think there is a good case for storing the metadata reservation on disk in the AGF and removing it from user visible global free space. We already account for free space, rmap and refcount btree block usage in the AGF, so we already have the mechanisms for tracking the necessary per-ag metadata usage outside of the global free space counters. Hence there doesn't appear to me to be any reason why why we can't do the per-ag metadata reservation/usage accounting in the AGF and get rid of the in-memory reservation stuff. If we do that, users will end up with exactly the same amount of free space, but the metadata reservations are no longer accounted as user visible used space. i.e. the users never need to see the internal space reservations we need to make the filesystem work reliably. This would work identically for normal filesystems and thin filesystems without needing to play special games for thin filesystems.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some questions about per-ag metadata space reservations... 2017-09-07 23:11 ` Dave Chinner @ 2017-09-08 13:33 ` Brian Foster 2017-09-09 0:25 ` Dave Chinner 0 siblings, 1 reply; 7+ messages in thread From: Brian Foster @ 2017-09-08 13:33 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Christoph Hellwig cc Christoph (re: finobt perag reservation) On Fri, Sep 08, 2017 at 09:11:36AM +1000, Dave Chinner wrote: > On Thu, Sep 07, 2017 at 09:44:58AM -0400, Brian Foster wrote: > > On Wed, Sep 06, 2017 at 08:30:54PM +1000, Dave Chinner wrote: ... > > > When combined with a thinly provisioned device, this enables us to > > > shrink the XFS filesystem simply by running fstrim to punch all the > > > free space out of the underlying thin device and then adjusting the > > > free space down appropriately. Because the thin device abstracts the > > > physical location of the data in the block device away from the > > > address space presented to the filesystem, we don't need to move any > > > data or metadata to free up this space - it's just an accounting > > > change. > > > > > > > How are you dealing with block size vs. thin chunk allocation size > > alignment? You could require they match, but if not it seems like there > > could be a bit more involved than an accounting change. > > Not a filesystem problem. If there's less pool space than you let > the filesystem have, then the pool will ENOSPC before the filesystem > will. regular fstrim (which you should be doing on thin filesystems > anyway) will keep them mostly aligned because XFS tends to pack > holes in AG space rather than continually growing the space they > use. > I don't see how tracking underlying physical/available pool space in the filesystem is a filesystem problem but tracking the alignment/size of those physical allocations is not. It seems to me that either they are both fs problems or they aren't. This is just a question of accuracy. I get that the filesystem may return ENOSPC before the pool shuts down more often than not, but that is still workload dependent. If it's not important, perhaps I'm just not following what the objectives/requirements are for this feature. ... > > Based on the commit message, I think the justification for finobt > reservations was weak and wasn't backed up by analysis as to why the > reserve block pool was drained (which should never occur in normal > ENOSPC conditions). The per-ag reserve also requires a walk of every > finobt at mount time, so there's also mount time regressions for > filesystems with sparsely populated inode trees. > I agree. I'd be fine with ripping this out in favor of a better solution. The problem is since we don't have a detailed root cause of the problem, it's not clear what the right fix is. I'm not sure where this leaves the user that originally reproduced the problem. Does bumping the reserve block pool work around the problem? Can we revisit it to find a more specific root cause? Christoph? ... > Yes, that's what it does to ensure users get ENOSPC for data > allocation before we run out of metadata reservation space, even if > we don't need the metadata reservation space. It's size is > physically bound by the AG size so we can calculate it any time we > know what the AG size is. > Right. So I got the impression that the problem was enforcement of the reservation. Is that not the case? Rather, is the problem the calculation of the reservation requirement due to the basis on AG size (which is no longer valid due to the thin nature)? IOW, the reservations restrict far too much space and cause the fs to return ENOSPC too early? E.g., re-reading your original example.. you have a 32TB fs backed by 1TB of physical allocation to the volume. You mount the fs and see 1TB "available" space, but ~600GB if that is already consumed by reservation so you end up at ENOSPC after 300-400GB of real usage. Hm? If that is the case, then it does seem that dynamic reservation based on current usage could be a solution in-theory. I.e., basing the reservation on usage effectively bases it against "real" space, whether the underlying volume is thin or fully allocated. That seems do-able for the finobt (if we don't end up removing this reservation entirely) as noted above. If that would not help your use case, could you elaborate on why using the finobt example? Of course, I've no idea if that's a viable approach for the other reservations so it's still just a handwavy idea. ... > > I was working on the idea that thin filesystems have sufficient > spare physical space (e.g. logical size < (physical size - max > metadata reservation) that even when maxxed out there's sufficient > physical space remaining for all the metadata without needing to > reserve that space. > > In theory, this /should/ work as the metadata blocks are already > reserved as used space at mount time and hence the actual allocation > of those blocks is only accounted against the reservation, not the > global freespace counter. Hence these metadata blocks aren't > counted as used space when they are allocated - they are always > accounted as used space whether used or not. Hence if I remove the > "accounted as used space" part of the reservation, but then ensure > that there is physically enough room for them to always succeed, we > end up with exactly the same metadata space guarantee. The only > difference is how it's accounted and provided.... > I'd probably need to see patches to make sure I follow this correctly. While I'm sure we can ultimately implement whatever accounting tricks we want, I'm more curious how accuracy is maintained for anything based on assumptions about how physical space is allocated in the underlying volume. > I've written the patches to do this, but I haven't tested it other > than checking falloc triggers ENOSPC when it's supposed to. I'm just > finishing off the repair support so I can run it through xfstests. > That will be interesting. :P > > FWIW, I think there is a good case for storing the metadata > reservation on disk in the AGF and removing it from user visible > global free space. We already account for free space, rmap and > refcount btree block usage in the AGF, so we already have the > mechanisms for tracking the necessary per-ag metadata usage outside > of the global free space counters. Hence there doesn't appear to me > to be any reason why why we can't do the per-ag metadata > reservation/usage accounting in the AGF and get rid of the in-memory > reservation stuff. > Sounds interesting, that might very well be a cleaner implementation of reservations. The current reservation tracking tends to confuse me more often than not. ;) > If we do that, users will end up with exactly the same amount of > free space, but the metadata reservations are no longer accounted as > user visible used space. i.e. the users never need to see the > internal space reservations we need to make the filesystem work > reliably. This would work identically for normal filesystems and > thin filesystems without needing to play special games for thin > filesystems.... > Indeed, though this seems more like a usability enhancement. Couldn't we accomplish this part by just subtracting the reservations from the total free space up front (along with whatever accounting changes need to happen to support that)? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some questions about per-ag metadata space reservations... 2017-09-08 13:33 ` Brian Foster @ 2017-09-09 0:25 ` Dave Chinner 2017-09-11 13:26 ` Brian Foster 0 siblings, 1 reply; 7+ messages in thread From: Dave Chinner @ 2017-09-09 0:25 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Christoph Hellwig On Fri, Sep 08, 2017 at 09:33:54AM -0400, Brian Foster wrote: > cc Christoph (re: finobt perag reservation) > > On Fri, Sep 08, 2017 at 09:11:36AM +1000, Dave Chinner wrote: > > On Thu, Sep 07, 2017 at 09:44:58AM -0400, Brian Foster wrote: > > > On Wed, Sep 06, 2017 at 08:30:54PM +1000, Dave Chinner wrote: > ... > > > > When combined with a thinly provisioned device, this enables us to > > > > shrink the XFS filesystem simply by running fstrim to punch all the > > > > free space out of the underlying thin device and then adjusting the > > > > free space down appropriately. Because the thin device abstracts the > > > > physical location of the data in the block device away from the > > > > address space presented to the filesystem, we don't need to move any > > > > data or metadata to free up this space - it's just an accounting > > > > change. > > > > > > > > > > How are you dealing with block size vs. thin chunk allocation size > > > alignment? You could require they match, but if not it seems like there > > > could be a bit more involved than an accounting change. > > > > Not a filesystem problem. If there's less pool space than you let > > the filesystem have, then the pool will ENOSPC before the filesystem > > will. regular fstrim (which you should be doing on thin filesystems > > anyway) will keep them mostly aligned because XFS tends to pack > > holes in AG space rather than continually growing the space they > > use. > > > > I don't see how tracking underlying physical/available pool space in the > filesystem is a filesystem problem but tracking the alignment/size of > those physical allocations is not. It seems to me that either they are > both fs problems or they aren't. This is just a question of accuracy. The filesystem cannot do anything about the size/alignment of blocks in the thin device. It gets a hint through stripe alignment, but other than that we can only track the space the filesystem uses in the filesystem. In practice XFS tends to pack used space fairly well over time (unlike ext4) so I'm really not too concerned about this right now. If it becomes a problem, then we can analyse it where the problem lies and work out how to mitigate it. But until we see such problems that can't be solved with "allow X% size margins" guidelines, I'm not going to worry about it. > I get that the filesystem may return ENOSPC before the pool shuts down > more often than not, but that is still workload dependent. If it's not > important, perhaps I'm just not following what the > objectives/requirements are for this feature. What I'm trying to do is move the first point of ENOSPC in a thin environment up into the filesystem. ie. you don't manage user space requirements by thin device sizing - you way, way over commit that with the devices and instead use the filesystem "thin size" to limit what the filesystem can draw from the pool. That way users know exactly how much space they have available and can plan appropriately, as opposed to the current case where the first warning they get of the underlying storage running out of space when they have heaps of free space is "things suddenly stop working". If you overcommit the filesystem thin sizes, then it's no different to overcommiting the thin pool with large devices - the device pool is going to ENOSPC first. If you don't leave some amount of margin in the thin fs sizing, then you're going to ENOSPC the device pool. If you don't leave margin for snapshots, you're going to ENOSPC the device pool. IOWs, using the filesystem to control thin space allocation has exactly the same admin pitfalls as using dm-thinp to manage the pool space. The only difference is that when the sizes/margins are set properly then the fs layer ENOSPCs before the thin device pool ENOSPCs and so we remove that clusterfuck completely from the picture. > ... > > Yes, that's what it does to ensure users get ENOSPC for data > > allocation before we run out of metadata reservation space, even if > > we don't need the metadata reservation space. It's size is > > physically bound by the AG size so we can calculate it any time we > > know what the AG size is. > > > > Right. So I got the impression that the problem was enforcement of the > reservation. Is that not the case? Rather, is the problem the > calculation of the reservation requirement due to the basis on AG size > (which is no longer valid due to the thin nature)? No, the physical metadata reservation space is still required. It just should not be *accounted* to the logical free space. > IOW, the reservations > restrict far too much space and cause the fs to return ENOSPC too > early? Yes, the initial problem is that the fixed reservations are dynamically accounted as used space. > E.g., re-reading your original example.. you have a 32TB fs backed by > 1TB of physical allocation to the volume. You mount the fs and see 1TB > "available" space, but ~600GB if that is already consumed by > reservation so you end up at ENOSPC after 300-400GB of real usage. Hm? Yup, that's a visible *symptom*. Another user visible symptom is df on an empty filesystem reports hundreds of GB (TB even!) of used space on a completely empty filesystem. > If that is the case, then it does seem that dynamic reservation based on > current usage could be a solution in-theory. I.e., basing the > reservation on usage effectively bases it against "real" space, whether > the underlying volume is thin or fully allocated. That seems do-able for > the finobt (if we don't end up removing this reservation entirely) as > noted above. The finobt case is different to rmap and reflink. finobt should only require a per-operation reservation to ensure there is space in the AG to create the finobt record and btree blocks. We do not need a permanent, maximum sized tree reservation for this - we just need to ensure all the required space is available in the one AG rather than globally available before we start the allocation operation. If we can do that, then the operation should (in theory) never fail with ENOSPC... As for rmap and refcountbt reservations, they have to have space to allow rmap and CoW operations to succeed when no user data is modified, and to allow metadata allocations to run without needing to update every transaction reservation to take into account all the rmapbt updates that are necessary. These can be many and span multiple AGs (think badly fragmented directory blocks) and so the worst case reservation is /huge/ and made upfront worst-case reservations for rmap/reflink DOA. So we avoided this entire problem by ensuring we always have space for the rmap/refcount metadata; using 1-2% of disk space permanently was considered a valid trade off for the simplicity of implementation. That's what the per-ag reservations implement and we even added on-disk metadata in the AGF to make this reservation process low overhead. This was all "it seems like the best compromise" design. We based it on the existing reserve pool behaviour because it was easy to do. Now that I'm trying to use these filesystems in anger, I'm tripping over the problems as a result of this choice to base the per ag metadata reservations on the reserve pool behaviour. > > I've written the patches to do this, but I haven't tested it other > > than checking falloc triggers ENOSPC when it's supposed to. I'm just > > finishing off the repair support so I can run it through xfstests. > > That will be interesting. :P > > > > FWIW, I think there is a good case for storing the metadata > > reservation on disk in the AGF and removing it from user visible > > global free space. We already account for free space, rmap and > > refcount btree block usage in the AGF, so we already have the > > mechanisms for tracking the necessary per-ag metadata usage outside > > of the global free space counters. Hence there doesn't appear to me > > to be any reason why why we can't do the per-ag metadata > > reservation/usage accounting in the AGF and get rid of the in-memory > > reservation stuff. > > > > Sounds interesting, that might very well be a cleaner implementation of > reservations. The current reservation tracking tends to confuse me more > often than not. ;) In hindsight, I think we should have baked the reservation space fully into the on-disk format rather than tried to make it dynamic and backwards compatible. i.e. make it completely hidden from the user and always there for filesystems with those features enabled. > > If we do that, users will end up with exactly the same amount of > > free space, but the metadata reservations are no longer accounted as > > user visible used space. i.e. the users never need to see the > > internal space reservations we need to make the filesystem work > > reliably. This would work identically for normal filesystems and > > thin filesystems without needing to play special games for thin > > filesystems.... > > > > Indeed, though this seems more like a usability enhancement. Couldn't we > accomplish this part by just subtracting the reservations from the total > free space up front (along with whatever accounting changes need to > happen to support that)? Yes, I had a crazy thought last night that I might be able to do some in-memory mods to sb_dblocks and sb_fdblocks at mount time to to adjust how available space and reservations are accounted. I'll have a bit of a think and a play over the next few days and see what I come up with. The testing I've been doing with thin filesystems backs this up - they are behaving sanely at ENOSPC without accounting for the metadata reservations in the user visible free space. I'm still using the metadata reservations to ensure operations have space in each AG to complete successfully, it's just not consuming user accounted free space.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some questions about per-ag metadata space reservations... 2017-09-09 0:25 ` Dave Chinner @ 2017-09-11 13:26 ` Brian Foster 2017-09-15 1:03 ` Dave Chinner 0 siblings, 1 reply; 7+ messages in thread From: Brian Foster @ 2017-09-11 13:26 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, Christoph Hellwig On Sat, Sep 09, 2017 at 10:25:43AM +1000, Dave Chinner wrote: > On Fri, Sep 08, 2017 at 09:33:54AM -0400, Brian Foster wrote: ... > > The filesystem cannot do anything about the size/alignment of blocks > in the thin device. It gets a hint through stripe alignment, but > other than that we can only track the space the filesystem uses in > the filesystem. In practice XFS tends to pack used space fairly well > over time (unlike ext4) so I'm really not too concerned about this > right now. > > If it becomes a problem, then we can analyse it where the problem > lies and work out how to mitigate it. But until we see such problems > that can't be solved with "allow X% size margins" guidelines, I'm > not going to worry about it. > Ok, fair enough. > > I get that the filesystem may return ENOSPC before the pool shuts down > > more often than not, but that is still workload dependent. If it's not > > important, perhaps I'm just not following what the > > objectives/requirements are for this feature. > > What I'm trying to do is move the first point of ENOSPC in a thin > environment up into the filesystem. ie. you don't manage user space > requirements by thin device sizing - you way, way over commit that > with the devices and instead use the filesystem "thin size" to limit > what the filesystem can draw from the pool. > > That way users know exactly how much space they have available and > can plan appropriately, as opposed to the current case where the > first warning they get of the underlying storage running out of > space when they have heaps of free space is "things suddenly stop > working". > Ok, that's what I suspected. FWIW, this reminded me of the thin space reservation thing I was hacking on a year or two ago to accomplish a similar objective. That's where the whole size/alignment question came up. > If you overcommit the filesystem thin sizes, then it's no different > to overcommiting the thin pool with large devices - the device pool > is going to ENOSPC first. If you don't leave some amount of margin > in the thin fs sizing, then you're going to ENOSPC the device pool. > If you don't leave margin for snapshots, you're going to ENOSPC the > device pool. > > IOWs, using the filesystem to control thin space allocation has > exactly the same admin pitfalls as using dm-thinp to manage the pool > space. The only difference is that when the sizes/margins are set > properly then the fs layer ENOSPCs before the thin device pool > ENOSPCs and so we remove that clusterfuck completely from the > picture. > I still think some of that is non-deterministic, but I suppose if you have a worst case slop/margin delta between usable space in the fs and what is truly available from the underlying storage, it might not be a problem in practice. I still have some questions, but it's probably not worth reasoning about until code is available. > > ... > > > Yes, that's what it does to ensure users get ENOSPC for data > > > allocation before we run out of metadata reservation space, even if > > > we don't need the metadata reservation space. It's size is > > > physically bound by the AG size so we can calculate it any time we > > > know what the AG size is. > > > > > > > Right. So I got the impression that the problem was enforcement of the > > reservation. Is that not the case? Rather, is the problem the > > calculation of the reservation requirement due to the basis on AG size > > (which is no longer valid due to the thin nature)? > > No, the physical metadata reservation space is still required. It > just should not be *accounted* to the logical free space. > Ok, I think we're talking about the same things and just thinking about it differently. On the presumption that we (continue to) use a worst case reservation, it makes sense to account it against the physical free space in the AG rather than the (more limited) logical free space. My point was to explore whether we could adjust the actual reservation requirements to be dynamic such that it would (continue to) not matter that the reservations are accounted out of logical free space. Indeed, this hasn't been a problem in situations where we know the reservation is only 1-2% of truly available space. Thinking about it from another angle, the old thin reservation rfc I referenced above would probably ENOSPC on mount in the current scheme of things because there simply isn't that much space available to reserve out of the volume. It worked fine at the time because we only had the capped size global reserve pool. Hence, we'd have to either change how the reservations are made so they wouldn't reserve out of the volume (as you suggest) or somehow or another base them on the logical size of the volume. > > IOW, the reservations > > restrict far too much space and cause the fs to return ENOSPC too > > early? > > Yes, the initial problem is that the fixed reservations are > dynamically accounted as used space. > > > E.g., re-reading your original example.. you have a 32TB fs backed by > > 1TB of physical allocation to the volume. You mount the fs and see 1TB > > "available" space, but ~600GB if that is already consumed by > > reservation so you end up at ENOSPC after 300-400GB of real usage. Hm? > > Yup, that's a visible *symptom*. Another user visible symptom is df > on an empty filesystem reports hundreds of GB (TB even!) of used > space on a completely empty filesystem. > > > If that is the case, then it does seem that dynamic reservation based on > > current usage could be a solution in-theory. I.e., basing the > > reservation on usage effectively bases it against "real" space, whether > > the underlying volume is thin or fully allocated. That seems do-able for > > the finobt (if we don't end up removing this reservation entirely) as > > noted above. > > The finobt case is different to rmap and reflink. finobt should only > require a per-operation reservation to ensure there is space in the > AG to create the finobt record and btree blocks. We do not need a > permanent, maximum sized tree reservation for this - we just need to > ensure all the required space is available in the one AG rather than > globally available before we start the allocation operation. If we > can do that, then the operation should (in theory) never fail with > ENOSPC... > I'm not familiar with the workload that motivated the finobt perag reservation stuff, but I suspect it's something that pushes an fs (or AG) with a ton of inodes to near ENOSPC with a very small finobt, and then runs a bunch of operations that populate the finobt without freeing up enough space in the particular AG. I suppose that could be due to having zero sized files (which seems pointless in practice), sparsely freeing inodes such that inode chunks are never freed, using the ikeep mount option, and/or otherwise freeing a bunch of small files that only free up space in other AGs before the finobt allocation demand is made. The larger point is that we don't really know much of anything to try and at least reason about what the original problem could have been, but it seems plausible to create the ENOSPC condition if one tried hard enough. > As for rmap and refcountbt reservations, they have to have space to > allow rmap and CoW operations to succeed when no user data is > modified, and to allow metadata allocations to run without needing > to update every transaction reservation to take into account all the > rmapbt updates that are necessary. These can be many and span > multiple AGs (think badly fragmented directory blocks) and so the > worst case reservation is /huge/ and made upfront worst-case > reservations for rmap/reflink DOA. > > So we avoided this entire problem by ensuring we always have space for > the rmap/refcount metadata; using 1-2% of disk space permanently > was considered a valid trade off for the simplicity of > implementation. That's what the per-ag reservations implement and > we even added on-disk metadata in the AGF to make this reservation > process low overhead. > > This was all "it seems like the best compromise" design. We > based it on the existing reserve pool behaviour because it was easy > to do. Now that I'm trying to use these filesystems in anger, I'm > tripping over the problems as a result of this choice to base the > per ag metadata reservations on the reserve pool behaviour. > Got it. FWIW, what I was handwaving about sounds like more of a compromise between what we do now (worst case res, user visible) and what it sounds like you're working towards (worst case res, user invisible). By that I mean that I've been thinking about the problem more from the angle of whether we can avoid the worst case reservation. The reservation itself could still be made visible or not either way. Of course, it sounds like changing the reservation requirement for things like the rmapbt would be significantly more complicated than for the finobt, so "hiding" the reservation might be the next best tradeoff. Brian > > > I've written the patches to do this, but I haven't tested it other > > > than checking falloc triggers ENOSPC when it's supposed to. I'm just > > > finishing off the repair support so I can run it through xfstests. > > > That will be interesting. :P > > > > > > FWIW, I think there is a good case for storing the metadata > > > reservation on disk in the AGF and removing it from user visible > > > global free space. We already account for free space, rmap and > > > refcount btree block usage in the AGF, so we already have the > > > mechanisms for tracking the necessary per-ag metadata usage outside > > > of the global free space counters. Hence there doesn't appear to me > > > to be any reason why why we can't do the per-ag metadata > > > reservation/usage accounting in the AGF and get rid of the in-memory > > > reservation stuff. > > > > > > > Sounds interesting, that might very well be a cleaner implementation of > > reservations. The current reservation tracking tends to confuse me more > > often than not. ;) > > In hindsight, I think we should have baked the reservation space > fully into the on-disk format rather than tried to make it dynamic > and backwards compatible. i.e. make it completely hidden from the > user and always there for filesystems with those features enabled. > > > > If we do that, users will end up with exactly the same amount of > > > free space, but the metadata reservations are no longer accounted as > > > user visible used space. i.e. the users never need to see the > > > internal space reservations we need to make the filesystem work > > > reliably. This would work identically for normal filesystems and > > > thin filesystems without needing to play special games for thin > > > filesystems.... > > > > > > > Indeed, though this seems more like a usability enhancement. Couldn't we > > accomplish this part by just subtracting the reservations from the total > > free space up front (along with whatever accounting changes need to > > happen to support that)? > > Yes, I had a crazy thought last night that I might be able to do > some in-memory mods to sb_dblocks and sb_fdblocks at mount time to > to adjust how available space and reservations are accounted. I'll > have a bit of a think and a play over the next few days and see what > I come up with. > > The testing I've been doing with thin filesystems backs this up - > they are behaving sanely at ENOSPC without accounting for the > metadata reservations in the user visible free space. I'm still > using the metadata reservations to ensure operations have space in > each AG to complete successfully, it's just not consuming user > accounted free space.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some questions about per-ag metadata space reservations... 2017-09-11 13:26 ` Brian Foster @ 2017-09-15 1:03 ` Dave Chinner 0 siblings, 0 replies; 7+ messages in thread From: Dave Chinner @ 2017-09-15 1:03 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, Christoph Hellwig On Mon, Sep 11, 2017 at 09:26:08AM -0400, Brian Foster wrote: > On Sat, Sep 09, 2017 at 10:25:43AM +1000, Dave Chinner wrote: > > On Fri, Sep 08, 2017 at 09:33:54AM -0400, Brian Foster wrote: > > > If that is the case, then it does seem that dynamic reservation based on > > > current usage could be a solution in-theory. I.e., basing the > > > reservation on usage effectively bases it against "real" space, whether > > > the underlying volume is thin or fully allocated. That seems do-able for > > > the finobt (if we don't end up removing this reservation entirely) as > > > noted above. > > > > The finobt case is different to rmap and reflink. finobt should only > > require a per-operation reservation to ensure there is space in the > > AG to create the finobt record and btree blocks. We do not need a > > permanent, maximum sized tree reservation for this - we just need to > > ensure all the required space is available in the one AG rather than > > globally available before we start the allocation operation. If we > > can do that, then the operation should (in theory) never fail with > > ENOSPC... > > > > I'm not familiar with the workload that motivated the finobt perag > reservation stuff, but I suspect it's something that pushes an fs (or > AG) with a ton of inodes to near ENOSPC with a very small finobt, and > then runs a bunch of operations that populate the finobt without freeing > up enough space in the particular AG. That's a characteristic of a hardlink backup farm. And, in new-skool terms, that's what a reflink- or dedupe- based backup farm will look like, too. i.e. old backups get removed freeing up inodes, but no data gets freed so the only new free blocks are the directory blocks that are no longer in use... > I suppose that could be due to > having zero sized files (which seems pointless in practice), sparsely > freeing inodes such that inode chunks are never freed, using the ikeep > mount option, and/or otherwise freeing a bunch of small files that only > free up space in other AGs before the finobt allocation demand is made. Yup, all of those are potential issues.... > The larger point is that we don't really know much of anything to try > and at least reason about what the original problem could have been, but > it seems plausible to create the ENOSPC condition if one tried hard > enough. *nod*. i.e. if you're not freeing data, then unlinking dataless inodes may not succeed at ENOSPC. I think we can do better than what we currently do, though. e.g. we can simply dump them on the unlinked list and process them when there is free space to create the necessary finobt btree blocks to index them rather than as soon as the last VFS reference goes away (i.e. background inode freeing). > > As for rmap and refcountbt reservations, they have to have space to > > allow rmap and CoW operations to succeed when no user data is > > modified, and to allow metadata allocations to run without needing > > to update every transaction reservation to take into account all the > > rmapbt updates that are necessary. These can be many and span > > multiple AGs (think badly fragmented directory blocks) and so the > > worst case reservation is /huge/ and made upfront worst-case > > reservations for rmap/reflink DOA. > > > > So we avoided this entire problem by ensuring we always have space for > > the rmap/refcount metadata; using 1-2% of disk space permanently > > was considered a valid trade off for the simplicity of > > implementation. That's what the per-ag reservations implement and > > we even added on-disk metadata in the AGF to make this reservation > > process low overhead. > > > > This was all "it seems like the best compromise" design. We > > based it on the existing reserve pool behaviour because it was easy > > to do. Now that I'm trying to use these filesystems in anger, I'm > > tripping over the problems as a result of this choice to base the > > per ag metadata reservations on the reserve pool behaviour. > > > > Got it. FWIW, what I was handwaving about sounds like more of a > compromise between what we do now (worst case res, user visible) and > what it sounds like you're working towards (worst case res, user > invisible). By that I mean that I've been thinking about the problem > more from the angle of whether we can avoid the worst case reservation. > The reservation itself could still be made visible or not either way. Of > course, it sounds like changing the reservation requirement for things > like the rmapbt would be significantly more complicated than for the > finobt, so "hiding" the reservation might be the next best tradeoff. Yeah, and having done that I'm tripping over the next issue: it's possible for the log to be larger than than thin space, so I think I'm going to have to cut that out of visible used space, too.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-09-15 1:04 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-09-06 10:30 Some questions about per-ag metadata space reservations Dave Chinner 2017-09-07 13:44 ` Brian Foster 2017-09-07 23:11 ` Dave Chinner 2017-09-08 13:33 ` Brian Foster 2017-09-09 0:25 ` Dave Chinner 2017-09-11 13:26 ` Brian Foster 2017-09-15 1:03 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox