* [PATCHSET 0/4] xfs: random fixes for 6.2, part 3
@ 2022-12-20 0:04 Darrick J. Wong
2022-12-20 0:05 ` [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock Darrick J. Wong
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-12-20 0:04 UTC (permalink / raw)
To: djwong; +Cc: linux-xfs
Hi all,
Here's a sprinkling of bug fixes to start the 6.2 stabilization process.
I would imagine that the most controversial ones are going to be the one
that detaches background memory reclaim from the inodegc threads, and
the last one, which corrects the calculation of max btree height from
the given amount of space.
If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.
This is an extraordinary way to destroy everything. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xfs-fixes-6.2
---
fs/xfs/libxfs/xfs_btree.c | 6 +++++-
fs/xfs/xfs_icache.c | 5 +++++
fs/xfs/xfs_iomap.c | 2 +-
fs/xfs/xfs_reflink.c | 2 --
4 files changed, 11 insertions(+), 4 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock 2022-12-20 0:04 [PATCHSET 0/4] xfs: random fixes for 6.2, part 3 Darrick J. Wong @ 2022-12-20 0:05 ` Darrick J. Wong 2022-12-20 4:49 ` Dave Chinner 2022-12-20 0:05 ` [PATCH 2/4] xfs: don't stall background reclaim on inactvation Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 0:05 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> In xfs_reflink_fill_cow_hole, there's a debugging assertion that trips if (after cycling the ILOCK to get a transaction) the requeried cow mapping overlaps the start of the area being written. IOWs, it trips if the hole in the cow fork that it's supposed to fill has been filled. This is trivially possible since we cycled ILOCK_EXCL. If we trip the assertion, then we know that cmap is a delalloc extent because @found is false. Fortunately, the bmapi_write call below will convert the delalloc extent to a real unwritten cow fork extent, so all we need to do here is remove the assertion. It turns out that generic/095 trips this pretty regularly with alwayscow mode enabled. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_reflink.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index fe46bce8cae6..5535778a98f9 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -416,8 +416,6 @@ xfs_reflink_fill_cow_hole( goto convert; } - ASSERT(cmap->br_startoff > imap->br_startoff); - /* Allocate the entire reservation as unwritten blocks. */ nimaps = 1; error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount, ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock 2022-12-20 0:05 ` [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock Darrick J. Wong @ 2022-12-20 4:49 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2022-12-20 4:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Mon, Dec 19, 2022 at 04:05:03PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > In xfs_reflink_fill_cow_hole, there's a debugging assertion that trips > if (after cycling the ILOCK to get a transaction) the requeried cow > mapping overlaps the start of the area being written. IOWs, it trips if > the hole in the cow fork that it's supposed to fill has been filled. > > This is trivially possible since we cycled ILOCK_EXCL. If we trip the > assertion, then we know that cmap is a delalloc extent because @found is > false. Fortunately, the bmapi_write call below will convert the > delalloc extent to a real unwritten cow fork extent, so all we need to > do here is remove the assertion. > > It turns out that generic/095 trips this pretty regularly with alwayscow > mode enabled. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_reflink.c | 2 -- > 1 file changed, 2 deletions(-) Looks fine. Reviewed-by: Dave Chinner <dchinner@redhat.com> -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/4] xfs: don't stall background reclaim on inactvation 2022-12-20 0:04 [PATCHSET 0/4] xfs: random fixes for 6.2, part 3 Darrick J. Wong 2022-12-20 0:05 ` [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock Darrick J. Wong @ 2022-12-20 0:05 ` Darrick J. Wong 2022-12-20 4:49 ` Dave Chinner 2022-12-20 0:05 ` [PATCH 3/4] xfs: make xfs_iomap_page_ops static Darrick J. Wong 2022-12-20 0:05 ` [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height Darrick J. Wong 3 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 0:05 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> The online fsck stress tests deadlocked a test VM the other night. The deadlock happened because: 1. kswapd tried to prune the sb inode list, xfs found that it needed to inactivate an inode and that the queue was long enough that it should wait for the worker. It was holding shrinker_rwsem. 2. The inactivation worker allocated a transaction and then stalled trying to obtain the AGI buffer lock. 3. An online repair function called unregister_shrinker while in transaction context and holding the AGI lock. It also tried to grab shrinker_rwsem. #3 shouldn't happen and was easily fixed, but seeing as we designed background inodegc to avoid stalling reclaim, I feel that #1 shouldn't be happening either. Fix xfs_inodegc_want_flush_work to avoid stalling background reclaim on inode inactivation. Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_icache.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index f35e2cee5265..24eff2bd4062 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -2000,6 +2000,8 @@ xfs_inodegc_want_queue_work( * * Note: If the current thread is running a transaction, we don't ever want to * wait for other transactions because that could introduce a deadlock. + * + * Don't let kswapd background reclamation stall on inactivations. */ static inline bool xfs_inodegc_want_flush_work( @@ -2010,6 +2012,9 @@ xfs_inodegc_want_flush_work( if (current->journal_info) return false; + if (current_is_kswapd()) + return false; + if (shrinker_hits > 0) return true; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] xfs: don't stall background reclaim on inactvation 2022-12-20 0:05 ` [PATCH 2/4] xfs: don't stall background reclaim on inactvation Darrick J. Wong @ 2022-12-20 4:49 ` Dave Chinner 2022-12-20 16:28 ` Darrick J. Wong 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2022-12-20 4:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Mon, Dec 19, 2022 at 04:05:08PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > The online fsck stress tests deadlocked a test VM the other night. The > deadlock happened because: > > 1. kswapd tried to prune the sb inode list, xfs found that it needed to > inactivate an inode and that the queue was long enough that it should > wait for the worker. It was holding shrinker_rwsem. > > 2. The inactivation worker allocated a transaction and then stalled > trying to obtain the AGI buffer lock. > > 3. An online repair function called unregister_shrinker while in > transaction context and holding the AGI lock. It also tried to grab > shrinker_rwsem. > > #3 shouldn't happen and was easily fixed, but seeing as we designed > background inodegc to avoid stalling reclaim, I feel that #1 shouldn't > be happening either. Fix xfs_inodegc_want_flush_work to avoid stalling > background reclaim on inode inactivation. Yup, I see what you are saying, but I specifically left GFP_KERNEL reclaim to throttle on inodegc if necessary. kswapd runs in GFP_KERNEL context - it's the only memory reclaim context where it is actually safe to block in this manner in the filesystem. The question becomes one of whether we allow kswapd to ignore inodegc queue limits. Before the background inodegc, we used to do all the xfs_inactive() work in ->destroy_inode, directly from the VFS inode cache shrinker context run by GFP_KERNEL memory reclaim contexts such as kswapd(). IOWs, causing kswapd to wait while we run inactivation in the VFS inode cache shrinker context is something we've done for a long time. kswapd did: vfs inode shrinker destroy inode xfs_inactive() mark inode inactivated update radix tree tags And then the XFS inode shrinker walks INACTIVATED inodes and frees them. Now the code does: vfs inode shrinker NEED_INACTIVE -> inodegc queue if (throttle needed) flush_work(gc->work); return Then some time later in a different context inodegc then runs and we do: xfs_inactive INACTIVATED And then sometime later in a different context the XFS inode shrinker walks INACTIVATED inodes and frees them. Essentially, without throttling the bulk of the memory freeing work is shifted from the immediate GFP_KERNEL reclaim context to some later GFP_KERNEL reclaim context. For small memory reclaim scans (i.e. low priority shrinker invocations) this doesn't matter - we aren't at risk of OOM in these situations, and these are the typical situations in which memory reclaim is needed. Generally speaking the the background work is fast enough that memory is freed before shortages escalate. However, when we have large inode caches and a significant memory demand event, we get into the situation where reclaim might be scanning a hundreds of thousands of cached VFS inodes at a time. We kinda need to stall kswapd in these cases - it may be the only reclaim context that can free inodes - and we need to avoid kswapd dumping all the freeable inodes on the node to a queue that isn't being processed, decided it has no more memory that can be freed because all the VFS inodes have gone and there's nothing reclaimable on the XFS inode lists, so it declares OOM when we have hundreds of thousands of freeable inodes sitting there waiting for inodegc to run. I note that inodes queued for inodegc are not counted as reclaimable XFS inodes, so for the purposes of shrinkers those inodes queued on inodegc disappear from the shrinker accounting compeltely. Maybe we could consider a separate bug hidden by the inodegc throttling - maybe that's the bug that makes throttling kswapd on inodegc necessary. But this re-inforces my point - I don't think we can just make kswapd skip inodegc throttling without further investion into the effect on performance under memory pressure as well as OOM killer resistance. It's likely a solvable issue, but given how complex and subtle the inode shrinker interactions can be, I'd like to make sure we don't wake a sleeping dragon here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] xfs: don't stall background reclaim on inactvation 2022-12-20 4:49 ` Dave Chinner @ 2022-12-20 16:28 ` Darrick J. Wong 0 siblings, 0 replies; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 16:28 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Dec 20, 2022 at 03:49:08PM +1100, Dave Chinner wrote: > On Mon, Dec 19, 2022 at 04:05:08PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > The online fsck stress tests deadlocked a test VM the other night. The > > deadlock happened because: > > > > 1. kswapd tried to prune the sb inode list, xfs found that it needed to > > inactivate an inode and that the queue was long enough that it should > > wait for the worker. It was holding shrinker_rwsem. > > > > 2. The inactivation worker allocated a transaction and then stalled > > trying to obtain the AGI buffer lock. > > > > 3. An online repair function called unregister_shrinker while in > > transaction context and holding the AGI lock. It also tried to grab > > shrinker_rwsem. > > > > #3 shouldn't happen and was easily fixed, but seeing as we designed > > background inodegc to avoid stalling reclaim, I feel that #1 shouldn't > > be happening either. Fix xfs_inodegc_want_flush_work to avoid stalling > > background reclaim on inode inactivation. > > Yup, I see what you are saying, but I specifically left GFP_KERNEL > reclaim to throttle on inodegc if necessary. kswapd runs in > GFP_KERNEL context - it's the only memory reclaim context where it > is actually safe to block in this manner in the filesystem. The > question becomes one of whether we allow kswapd to ignore inodegc > queue limits. > > Before the background inodegc, we used to do all the xfs_inactive() > work in ->destroy_inode, directly from the VFS inode cache shrinker > context run by GFP_KERNEL memory reclaim contexts such as kswapd(). > IOWs, causing kswapd to wait while we run inactivation in the VFS > inode cache shrinker context is something we've done for a long > time. kswapd did: > > vfs inode shrinker > destroy inode > xfs_inactive() > mark inode inactivated > update radix tree tags > > And then the XFS inode shrinker walks INACTIVATED inodes and frees > them. > > Now the code does: > > vfs inode shrinker > NEED_INACTIVE > -> inodegc queue > if (throttle needed) > flush_work(gc->work); > return > > Then some time later in a different context inodegc then runs and > we do: > > xfs_inactive > INACTIVATED > > And then sometime later in a different context the XFS inode > shrinker walks INACTIVATED inodes and frees them. > > Essentially, without throttling the bulk of the memory freeing work > is shifted from the immediate GFP_KERNEL reclaim context to some > later GFP_KERNEL reclaim context. > > For small memory reclaim scans (i.e. low priority shrinker > invocations) this doesn't matter - we aren't at risk of OOM in these > situations, and these are the typical situations in which memory > reclaim is needed. Generally speaking the the background work is > fast enough that memory is freed before shortages escalate. > > However, when we have large inode caches and a significant memory > demand event, we get into the situation where reclaim might be > scanning a hundreds of thousands of cached VFS inodes at a time. ...and the problem with removing the flush_workqueue() calls from background reclaim is that it doesn't know that xfs will try to push things out of memory in a different thread, because there's no way to tell it that we're separately working on freeing XXX bytes? > We kinda need to stall kswapd in these cases - it may be the only > reclaim context that can free inodes - and we need to avoid kswapd > dumping all the freeable inodes on the node to a queue that isn't > being processed, decided it has no more memory that can be freed > because all the VFS inodes have gone and there's nothing reclaimable > on the XFS inode lists, so it declares OOM when we have hundreds of > thousands of freeable inodes sitting there waiting for inodegc to > run. Ah, yep. XFS has no way to convey this to background reclaim, and reclaim might not even be able to make much use of that information, because anyone supplying it could be (a) wrong or (b) unable to free the objects within whatever timeframe reclaim needs. Thinking ahead, even if XFS could give the shrinker an ETA, reclaim would still end up waiting. I suppose it might be nice for reclaim to trigger multiple asynchronous freeings and wait on all of them, but that just shifts the question to "how does xfs know how long it'll take?" and "how many more cleverness beans will it take for support to figure out the system has wedged itself in this weird state?" Right now we at least get a nice compact stack trace when xfs stalls background reclaim. > I note that inodes queued for inodegc are not counted as reclaimable > XFS inodes, so for the purposes of shrinkers those inodes queued on > inodegc disappear from the shrinker accounting compeltely. Maybe we > could consider a separate bug hidden by the inodegc throttling - > maybe that's the bug that makes throttling kswapd on inodegc > necessary. > > But this re-inforces my point - I don't think we can just make > kswapd skip inodegc throttling without further investion into the > effect on performance under memory pressure as well as OOM killer > resistance. It's likely a solvable issue, but given how complex and > subtle the inode shrinker interactions can be, I'd like to make sure > we don't wake a sleeping dragon here.... Ah, ok. I'll drop this then. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/4] xfs: make xfs_iomap_page_ops static 2022-12-20 0:04 [PATCHSET 0/4] xfs: random fixes for 6.2, part 3 Darrick J. Wong 2022-12-20 0:05 ` [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock Darrick J. Wong 2022-12-20 0:05 ` [PATCH 2/4] xfs: don't stall background reclaim on inactvation Darrick J. Wong @ 2022-12-20 0:05 ` Darrick J. Wong 2022-12-20 4:49 ` Dave Chinner 2022-12-20 0:05 ` [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height Darrick J. Wong 3 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 0:05 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Shut up the sparse warnings about this variable that isn't referenced anywhere else. Fixes: cd89a0950c40 ("xfs: use iomap_valid method to detect stale cached iomaps") Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_iomap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 669c1bc5c3a7..fc1946f80a4a 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -83,7 +83,7 @@ xfs_iomap_valid( return true; } -const struct iomap_page_ops xfs_iomap_page_ops = { +static const struct iomap_page_ops xfs_iomap_page_ops = { .iomap_valid = xfs_iomap_valid, }; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/4] xfs: make xfs_iomap_page_ops static 2022-12-20 0:05 ` [PATCH 3/4] xfs: make xfs_iomap_page_ops static Darrick J. Wong @ 2022-12-20 4:49 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2022-12-20 4:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Mon, Dec 19, 2022 at 04:05:14PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Shut up the sparse warnings about this variable that isn't referenced > anywhere else. > > Fixes: cd89a0950c40 ("xfs: use iomap_valid method to detect stale cached iomaps") > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_iomap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 669c1bc5c3a7..fc1946f80a4a 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -83,7 +83,7 @@ xfs_iomap_valid( > return true; > } > > -const struct iomap_page_ops xfs_iomap_page_ops = { > +static const struct iomap_page_ops xfs_iomap_page_ops = { > .iomap_valid = xfs_iomap_valid, > }; Oops, yeah. Reviewed-by: Dave Chinner <dchinner@redhat.com> -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height 2022-12-20 0:04 [PATCHSET 0/4] xfs: random fixes for 6.2, part 3 Darrick J. Wong ` (2 preceding siblings ...) 2022-12-20 0:05 ` [PATCH 3/4] xfs: make xfs_iomap_page_ops static Darrick J. Wong @ 2022-12-20 0:05 ` Darrick J. Wong 2022-12-20 5:00 ` Dave Chinner 3 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 0:05 UTC (permalink / raw) To: djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Lately I've been stress-testing extreme-sized rmap btrees by using the (new) xfs_db bmap_inflate command to clone bmbt mappings billions of times and then using xfs_repair to build new rmap and refcount btrees. This of course is /much/ faster than actually FICLONEing a file billions of times. Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not sufficiently large for the test scenario. For a 1TB filesystem (~67 million AG blocks, 4 AGs) the btheight command reports: $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node) level 0: 4400801200 records, 52390491 blocks level 1: 52390491 records, 1164234 blocks level 2: 1164234 records, 25872 blocks level 3: 25872 records, 575 blocks level 4: 575 records, 13 blocks level 5: 13 records, 1 block 6 levels, 53581186 blocks total The AG is sufficiently large to build this rmap btree. Unfortunately, m_rmap_maxlevels is 5. Augmenting the loop in the space->height function to report height, node blocks, and blocks remaining produces this: ht 1 node_blocks 45 blockleft 67108863 ht 2 node_blocks 2025 blockleft 67108818 ht 3 node_blocks 91125 blockleft 67106793 ht 4 node_blocks 4100625 blockleft 67015668 final height: 5 The goal of this function is to compute the maximum height btree that can be stored in the given number of ondisk fsblocks. Starting with the top level of the tree, each iteration through the loop adds the fanout factor of the next level down until we run out of blocks. IOWs, maximum height is achieved by using the smallest fanout factor that can apply to that level. However, the loop setup is not correct. Top level btree blocks are allowed to contain fewer than minrecs items, so the computation is incorrect because the first time through the loop it should be using a fanout factor of 2. With this corrected, the above becomes: ht 1 node_blocks 2 blockleft 67108863 ht 2 node_blocks 90 blockleft 67108861 ht 3 node_blocks 4050 blockleft 67108771 ht 4 node_blocks 182250 blockleft 67104721 ht 5 node_blocks 8201250 blockleft 66922471 final height: 6 Fixes: 9ec691205e7d ("xfs: compute the maximum height of the rmap btree when reflink enabled") Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/libxfs/xfs_btree.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 4c16c8c31fcb..8d11d3f5e529 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height( const unsigned int *limits, unsigned long long leaf_blocks) { - unsigned long long node_blocks = limits[1]; + /* + * The root btree block can have a fanout between 2 and maxrecs because + * the tree might not be big enough to fill it. + */ + unsigned long long node_blocks = 2; unsigned long long blocks_left = leaf_blocks - 1; unsigned int height = 1; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height 2022-12-20 0:05 ` [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height Darrick J. Wong @ 2022-12-20 5:00 ` Dave Chinner 2022-12-20 16:10 ` Darrick J. Wong 0 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2022-12-20 5:00 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Mon, Dec 19, 2022 at 04:05:19PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Lately I've been stress-testing extreme-sized rmap btrees by using the > (new) xfs_db bmap_inflate command to clone bmbt mappings billions of > times and then using xfs_repair to build new rmap and refcount btrees. > This of course is /much/ faster than actually FICLONEing a file billions > of times. > > Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with > EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not > sufficiently large for the test scenario. For a 1TB filesystem (~67 > million AG blocks, 4 AGs) the btheight command reports: > > $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda > rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node) > level 0: 4400801200 records, 52390491 blocks > level 1: 52390491 records, 1164234 blocks > level 2: 1164234 records, 25872 blocks > level 3: 25872 records, 575 blocks > level 4: 575 records, 13 blocks > level 5: 13 records, 1 block > 6 levels, 53581186 blocks total > > The AG is sufficiently large to build this rmap btree. Unfortunately, > m_rmap_maxlevels is 5. Augmenting the loop in the space->height > function to report height, node blocks, and blocks remaining produces > this: > > ht 1 node_blocks 45 blockleft 67108863 > ht 2 node_blocks 2025 blockleft 67108818 > ht 3 node_blocks 91125 blockleft 67106793 > ht 4 node_blocks 4100625 blockleft 67015668 > final height: 5 > > The goal of this function is to compute the maximum height btree that > can be stored in the given number of ondisk fsblocks. Starting with the > top level of the tree, each iteration through the loop adds the fanout > factor of the next level down until we run out of blocks. IOWs, maximum > height is achieved by using the smallest fanout factor that can apply > to that level. > > However, the loop setup is not correct. Top level btree blocks are > allowed to contain fewer than minrecs items, so the computation is ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ah, that's the critical piece of information I was looking for. I couldn't work out from the code change below what was wrong with limits[1]. So.... > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > index 4c16c8c31fcb..8d11d3f5e529 100644 > --- a/fs/xfs/libxfs/xfs_btree.c > +++ b/fs/xfs/libxfs/xfs_btree.c > @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height( > const unsigned int *limits, > unsigned long long leaf_blocks) > { > - unsigned long long node_blocks = limits[1]; > + /* > + * The root btree block can have a fanout between 2 and maxrecs because > + * the tree might not be big enough to fill it. > + */ Can you change this comment to say something like: /* * The root btree block can have less than minrecs pointers * in it because the tree might not be big enough to require * that amount of fanout. Hence it has a minimum size of * 2 pointers, not limits[1]. */ Otherwise it looks good. Reviewed-by: Dave Chinner <dchinner@redhat.com> > + unsigned long long node_blocks = 2; > unsigned long long blocks_left = leaf_blocks - 1; > unsigned int height = 1; For future consideration, we don't use maxrecs in this calculation at all - should we just pass minrecs into the function rather than an array of limits? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height 2022-12-20 5:00 ` Dave Chinner @ 2022-12-20 16:10 ` Darrick J. Wong 2022-12-20 16:20 ` Darrick J. Wong 0 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 16:10 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Dec 20, 2022 at 04:00:01PM +1100, Dave Chinner wrote: > On Mon, Dec 19, 2022 at 04:05:19PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Lately I've been stress-testing extreme-sized rmap btrees by using the > > (new) xfs_db bmap_inflate command to clone bmbt mappings billions of > > times and then using xfs_repair to build new rmap and refcount btrees. > > This of course is /much/ faster than actually FICLONEing a file billions > > of times. > > > > Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with > > EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not > > sufficiently large for the test scenario. For a 1TB filesystem (~67 > > million AG blocks, 4 AGs) the btheight command reports: > > > > $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda > > rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node) > > level 0: 4400801200 records, 52390491 blocks > > level 1: 52390491 records, 1164234 blocks > > level 2: 1164234 records, 25872 blocks > > level 3: 25872 records, 575 blocks > > level 4: 575 records, 13 blocks > > level 5: 13 records, 1 block > > 6 levels, 53581186 blocks total > > > > The AG is sufficiently large to build this rmap btree. Unfortunately, > > m_rmap_maxlevels is 5. Augmenting the loop in the space->height > > function to report height, node blocks, and blocks remaining produces > > this: > > > > ht 1 node_blocks 45 blockleft 67108863 > > ht 2 node_blocks 2025 blockleft 67108818 > > ht 3 node_blocks 91125 blockleft 67106793 > > ht 4 node_blocks 4100625 blockleft 67015668 > > final height: 5 > > > > The goal of this function is to compute the maximum height btree that > > can be stored in the given number of ondisk fsblocks. Starting with the > > top level of the tree, each iteration through the loop adds the fanout > > factor of the next level down until we run out of blocks. IOWs, maximum > > height is achieved by using the smallest fanout factor that can apply > > to that level. > > > > However, the loop setup is not correct. Top level btree blocks are > > allowed to contain fewer than minrecs items, so the computation is > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Ah, that's the critical piece of information I was looking for. I > couldn't work out from the code change below what was wrong with > limits[1]. So.... > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > > index 4c16c8c31fcb..8d11d3f5e529 100644 > > --- a/fs/xfs/libxfs/xfs_btree.c > > +++ b/fs/xfs/libxfs/xfs_btree.c > > @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height( > > const unsigned int *limits, > > unsigned long long leaf_blocks) > > { > > - unsigned long long node_blocks = limits[1]; > > + /* > > + * The root btree block can have a fanout between 2 and maxrecs because > > + * the tree might not be big enough to fill it. > > + */ > > Can you change this comment to say something like: > > /* > * The root btree block can have less than minrecs pointers > * in it because the tree might not be big enough to require > * that amount of fanout. Hence it has a minimum size of > * 2 pointers, not limits[1]. > */ Done. Thanks for the reviews! :) --D > > Otherwise it looks good. > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > > + unsigned long long node_blocks = 2; > > unsigned long long blocks_left = leaf_blocks - 1; > > unsigned int height = 1; > > For future consideration, we don't use maxrecs in this calculation > at all - should we just pass minrecs into the function rather than > an array of limits? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height 2022-12-20 16:10 ` Darrick J. Wong @ 2022-12-20 16:20 ` Darrick J. Wong 2022-12-20 20:39 ` Dave Chinner 0 siblings, 1 reply; 13+ messages in thread From: Darrick J. Wong @ 2022-12-20 16:20 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Dec 20, 2022 at 08:10:08AM -0800, Darrick J. Wong wrote: > On Tue, Dec 20, 2022 at 04:00:01PM +1100, Dave Chinner wrote: > > On Mon, Dec 19, 2022 at 04:05:19PM -0800, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > Lately I've been stress-testing extreme-sized rmap btrees by using the > > > (new) xfs_db bmap_inflate command to clone bmbt mappings billions of > > > times and then using xfs_repair to build new rmap and refcount btrees. > > > This of course is /much/ faster than actually FICLONEing a file billions > > > of times. > > > > > > Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with > > > EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not > > > sufficiently large for the test scenario. For a 1TB filesystem (~67 > > > million AG blocks, 4 AGs) the btheight command reports: > > > > > > $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda > > > rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node) > > > level 0: 4400801200 records, 52390491 blocks > > > level 1: 52390491 records, 1164234 blocks > > > level 2: 1164234 records, 25872 blocks > > > level 3: 25872 records, 575 blocks > > > level 4: 575 records, 13 blocks > > > level 5: 13 records, 1 block > > > 6 levels, 53581186 blocks total > > > > > > The AG is sufficiently large to build this rmap btree. Unfortunately, > > > m_rmap_maxlevels is 5. Augmenting the loop in the space->height > > > function to report height, node blocks, and blocks remaining produces > > > this: > > > > > > ht 1 node_blocks 45 blockleft 67108863 > > > ht 2 node_blocks 2025 blockleft 67108818 > > > ht 3 node_blocks 91125 blockleft 67106793 > > > ht 4 node_blocks 4100625 blockleft 67015668 > > > final height: 5 > > > > > > The goal of this function is to compute the maximum height btree that > > > can be stored in the given number of ondisk fsblocks. Starting with the > > > top level of the tree, each iteration through the loop adds the fanout > > > factor of the next level down until we run out of blocks. IOWs, maximum > > > height is achieved by using the smallest fanout factor that can apply > > > to that level. > > > > > > However, the loop setup is not correct. Top level btree blocks are > > > allowed to contain fewer than minrecs items, so the computation is > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Ah, that's the critical piece of information I was looking for. I > > couldn't work out from the code change below what was wrong with > > limits[1]. So.... > > > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > > > index 4c16c8c31fcb..8d11d3f5e529 100644 > > > --- a/fs/xfs/libxfs/xfs_btree.c > > > +++ b/fs/xfs/libxfs/xfs_btree.c > > > @@ -4666,7 +4666,11 @@ xfs_btree_space_to_height( > > > const unsigned int *limits, > > > unsigned long long leaf_blocks) > > > { > > > - unsigned long long node_blocks = limits[1]; > > > + /* > > > + * The root btree block can have a fanout between 2 and maxrecs because > > > + * the tree might not be big enough to fill it. > > > + */ > > > > Can you change this comment to say something like: > > > > /* > > * The root btree block can have less than minrecs pointers > > * in it because the tree might not be big enough to require > > * that amount of fanout. Hence it has a minimum size of > > * 2 pointers, not limits[1]. > > */ > > Done. Thanks for the reviews! :) > > --D > > > > > Otherwise it looks good. > > > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > > > > + unsigned long long node_blocks = 2; > > > unsigned long long blocks_left = leaf_blocks - 1; > > > unsigned int height = 1; > > > > For future consideration, we don't use maxrecs in this calculation > > at all - should we just pass minrecs into the function rather than > > an array of limits? I prefer to replace all those mxr/mnr array constructs with an explicit structure: struct xfs_btblock_geometry { uint16_t leaf_maxrecs; uint16_t leaf_minrecs; uint16_t node_maxrecs; uint16_t node_minrecs; }; and get rid of all the mp->m_rmap_mxr[leaf != 0] stuff that slows me down every time I have to read it. --D > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height 2022-12-20 16:20 ` Darrick J. Wong @ 2022-12-20 20:39 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2022-12-20 20:39 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Tue, Dec 20, 2022 at 08:20:03AM -0800, Darrick J. Wong wrote: > On Tue, Dec 20, 2022 at 08:10:08AM -0800, Darrick J. Wong wrote: > > On Tue, Dec 20, 2022 at 04:00:01PM +1100, Dave Chinner wrote: > > > For future consideration, we don't use maxrecs in this calculation > > > at all - should we just pass minrecs into the function rather than > > > an array of limits? > > I prefer to replace all those mxr/mnr array constructs with an explicit > structure: > > struct xfs_btblock_geometry { > uint16_t leaf_maxrecs; > uint16_t leaf_minrecs; > > uint16_t node_maxrecs; > uint16_t node_minrecs; > }; > > and get rid of all the mp->m_rmap_mxr[leaf != 0] stuff that slows me > down every time I have to read it. Yup, that sounds like a great idea - I have to work it out from first principles every time, too. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-12-20 20:39 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-12-20 0:04 [PATCHSET 0/4] xfs: random fixes for 6.2, part 3 Darrick J. Wong 2022-12-20 0:05 ` [PATCH 1/4] xfs: don't assert if cmap covers imap after cycling lock Darrick J. Wong 2022-12-20 4:49 ` Dave Chinner 2022-12-20 0:05 ` [PATCH 2/4] xfs: don't stall background reclaim on inactvation Darrick J. Wong 2022-12-20 4:49 ` Dave Chinner 2022-12-20 16:28 ` Darrick J. Wong 2022-12-20 0:05 ` [PATCH 3/4] xfs: make xfs_iomap_page_ops static Darrick J. Wong 2022-12-20 4:49 ` Dave Chinner 2022-12-20 0:05 ` [PATCH 4/4] xfs: fix off-by-one error in xfs_btree_space_to_height Darrick J. Wong 2022-12-20 5:00 ` Dave Chinner 2022-12-20 16:10 ` Darrick J. Wong 2022-12-20 16:20 ` Darrick J. Wong 2022-12-20 20:39 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox