* Maximum height of rmapbt when reflink feature is enabled
@ 2020-11-30 9:05 Chandan Babu R
2020-11-30 19:26 ` Darrick J. Wong
2020-11-30 21:51 ` Dave Chinner
0 siblings, 2 replies; 8+ messages in thread
From: Chandan Babu R @ 2020-11-30 9:05 UTC (permalink / raw)
To: linux-xfs; +Cc: david, darrick.wong
The comment in xfs_rmapbt_compute_maxlevels() mentions that with
reflink enabled, XFS will run out of AG blocks before reaching maximum
levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k
block size case:
Considering theoretical limits, maximum height of rmapbt can be,
max btree height = Log_(min_recs)(total recs)
max_rmapbt_height = Log_45(2^64) = 12.
Detailed calculation:
nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
nr-levels = 11; nr-blks = 3e2 / 45 = 6;
nr-levels = 12; nr-blks = 1;
Total number of blocks = 2e17
Here, 84 is the minimum number of leaf records and 45 is the minimum
number of node records in the rmapbt when using 4k block size. 2^64 is
the maximum possible rmapbt records
(i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
(2^32)).
i.e. theoretically rmapbt height can go upto 12.
But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
run out of per-ag blocks trying to build an rmapbt of height
XFS_BTREE_MAXLEVELS (i.e. 9).
Since number of nodes grows as a geometric series,
nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
However, with 1k block size we are not close to consuming all of 2^32
AG blocks as shown by the below calculations,
- rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
1) = 2e8 blocks.
- 11 is the minimum number of recs in a non-leaf node with 1k block size.
- Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
= 2e9 (this value will be used later).
- refcountbt
- Maximum number of records theoretically = maximum number of blocks
in an AG = 2^32
- Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
Leaf min recs = 20; Node min recs = 60 (with 1k as the block size).
- Detailed calculation:
nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
nr-levels = 6; nr-blks = 1
- Total block count = 2e8
- Bmbt (assuming all the rmapbt records have the same inode as owner)
- Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
Leaf min recs = 29; Node min recs = 29 (with 1k as the block size).
(2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
nr-levels = 6; nr-blks = 1e2 / 29 = 3
nr-levels = 7; nr-blks = 1
- Total block count = 7e7
Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.
Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying to
build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
Please let me know if my understanding is incorrect.
I have come across a "log reservation" calculation issue when
increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
extending data fork extent count to 48 bits. To proceed further, I
need to have a correct understanding of problem I have described w.r.t
1k filesystem block size.
--
chandan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-11-30 9:05 Maximum height of rmapbt when reflink feature is enabled Chandan Babu R
@ 2020-11-30 19:26 ` Darrick J. Wong
2020-11-30 22:03 ` Dave Chinner
2020-11-30 21:51 ` Dave Chinner
1 sibling, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2020-11-30 19:26 UTC (permalink / raw)
To: Chandan Babu R; +Cc: linux-xfs, david
On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> The comment in xfs_rmapbt_compute_maxlevels() mentions that with
> reflink enabled, XFS will run out of AG blocks before reaching maximum
> levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k
> block size case:
>
> Considering theoretical limits, maximum height of rmapbt can be,
> max btree height = Log_(min_recs)(total recs)
> max_rmapbt_height = Log_45(2^64) = 12.
>
> Detailed calculation:
> nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
> nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
> nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
> nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
> nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
> nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
> nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
> nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
> nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
> nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
> nr-levels = 11; nr-blks = 3e2 / 45 = 6;
> nr-levels = 12; nr-blks = 1;
>
> Total number of blocks = 2e17
>
> Here, 84 is the minimum number of leaf records and 45 is the minimum
> number of node records in the rmapbt when using 4k block size. 2^64 is
> the maximum possible rmapbt records
> (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
> (2^32)).
>
> i.e. theoretically rmapbt height can go upto 12.
>
> But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
> run out of per-ag blocks trying to build an rmapbt of height
> XFS_BTREE_MAXLEVELS (i.e. 9).
>
> Since number of nodes grows as a geometric series,
> nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
>
> i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
>
>
> However, with 1k block size we are not close to consuming all of 2^32
> AG blocks as shown by the below calculations,
>
> - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
> 1) = 2e8 blocks.
> - 11 is the minimum number of recs in a non-leaf node with 1k block size.
> - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
> = 2e9 (this value will be used later).
>
> - refcountbt
> - Maximum number of records theoretically = maximum number of blocks
> in an AG = 2^32
> - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
> Leaf min recs = 20; Node min recs = 60 (with 1k as the block size).
> - Detailed calculation:
> nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
> nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
> nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
> nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
> nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
> nr-levels = 6; nr-blks = 1
>
> - Total block count = 2e8
>
> - Bmbt (assuming all the rmapbt records have the same inode as owner)
> - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
> Leaf min recs = 29; Node min recs = 29 (with 1k as the block size).
> (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
> nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
> nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
> nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
> nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
> nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
> nr-levels = 6; nr-blks = 1e2 / 29 = 3
> nr-levels = 7; nr-blks = 1
>
> - Total block count = 7e7
>
> Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.
>
> Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying to
> build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
>
> Please let me know if my understanding is incorrect.
I have no idea what is the real upper limit on the number of rmap
records on a reflink filesystem -- as the rmap btree gets bigger, the
maximum number of records that one could need to store in that btree
gets smaller because rmap btree blocks cannot be shared and all have the
same owner (_OWN_AG).
But your reasoning seems correct.
> I have come across a "log reservation" calculation issue when
> increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
Hmm. That will increase the size of the btree cursor structure even
farther. It's already gotten pretty bad with the realtime rmap and
reflink patchsets since the realtime volume can have 2^63 blocks, which
implies a theoretical maximum rtrmapbt height of 21 levels and a maximum
rtrefcountbt height of 13 levels.
(These heights are absurd, since they imply a data device of 2^63
blocks...)
I suspect that we need to split MAXLEVELS into two values -- one for
per-AG btrees, and one for per-file btrees, and then refactor the btree
cursor so that the level data are a single VLA at the end. I started a
patchset to do all that[1], but it's incomplete.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=btree-dynamic-depth&id=692f761838dd821cd8cc5b3d1c66d6b1ac8ec05b
> extending data fork extent count to 48 bits. To proceed further, I
> need to have a correct understanding of problem I have described w.r.t
> 1k filesystem block size.
<nod>
--D
>
> --
> chandan
>
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-11-30 9:05 Maximum height of rmapbt when reflink feature is enabled Chandan Babu R
2020-11-30 19:26 ` Darrick J. Wong
@ 2020-11-30 21:51 ` Dave Chinner
2020-12-01 13:12 ` Chandan Babu R
1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2020-11-30 21:51 UTC (permalink / raw)
To: Chandan Babu R; +Cc: linux-xfs, darrick.wong
On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> The comment in xfs_rmapbt_compute_maxlevels() mentions that with
> reflink enabled, XFS will run out of AG blocks before reaching maximum
> levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k
> block size case:
>
> Considering theoretical limits, maximum height of rmapbt can be,
> max btree height = Log_(min_recs)(total recs)
> max_rmapbt_height = Log_45(2^64) = 12.
I think the use of mirecs here is wrong. We can continue to fill
both leaves and nodes above minrecs once we get to maximal height.
When a leaf/node is full, it will attempt to shift records left and
right to sibling nodes before trying to split. Hence a split only
occurs when all three nodes are completely full.
Hence we'll end up with a much higher average population of leaves
and nodes than the minimum when we are at maximum height, especially
at the upper levels of the btree near the root.
i.e. we won't try to split the root ever until the root is at
maximum capacity, and we won't try to split the next level down
(before we get to a root split) until at least 3 adjacent nodes
are at max capacity, and so on. Hence at higher levels, the tree is
always going to tend towards max capacity nodes, not min capacity.
That changes these calculations quite a lot...
> Detailed calculation:
> nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
> nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
> nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
> nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
> nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
> nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
> nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
> nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
> nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
> nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
> nr-levels = 11; nr-blks = 3e2 / 45 = 6;
> nr-levels = 12; nr-blks = 1;
>
> Total number of blocks = 2e17
>
> Here, 84 is the minimum number of leaf records and 45 is the minimum
> number of node records in the rmapbt when using 4k block size. 2^64 is
> the maximum possible rmapbt records
> (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
> (2^32)).
>
> i.e. theoretically rmapbt height can go upto 12.
Yes, but if the rmapbt contains 2^64 records, how many physical disk
blocks does it consume itself? Worst case that's 2^64 / 45, so
somewhere between 2^58 and 2^59 blocks of storage required for
indexing all those reocrds with a 4kB block size?
Yet we only have 2^28 4kB blocks in an AG, and the limit of a
rmapbt is actually what can physically fit in an AG, yes?
So, regardless of what the theoretical record capacity of a worse
case rmapbt record fill might be, it hits physical record storage
limits long before that. i.e. the limit on the btree height is
physical storage, not theoretical record indexing capability.
In which case, let us [incorrectly, see later] assume we have a
handful of maximally shared single data blocks and the rest of the
1TB AG is rmapbt. i.e. the maximum depth of the rmapbt is
determined by how much physical space it can consume, which is very
close to 2^32 blocks for 1kB filesystems.
Let's work from max capacity, so 2^32 * 21 is the max record holding
capacity of the per-ag rmapbt. Hence the worst case index height
requirement for the rmapbt is indexing ~2^37 records.
nr-levels = 1; nr-leaf-blks = 2^32 * 21 / 21 = 4e9
nr-levels = 2; nr-blks = 4e9 / 21 = 2e8;
nr-levels = 3; nr-blks = 2e8 / 21 = 9e6;
nr-levels = 4; nr-blks = 9e6/ 21 = 4e5;
nr-levels = 5; nr-blks = 4e5 / 21 = 2e4;
nr-levels = 6; nr-blks = 2e4 / 21 = 1e3;
nr-levels = 7; nr-blks = 1e3 / 21 = 50
nr-levels = 8; nr-blks = 50 / 21 = 3;
nr-levels = 9; nr-blks = 3 / 21 = 1;
nr-levels = 10; nr-blks = 1;
So we *just* tick over into a 10 level rmapbt here in this worse
case where the rmapbt takes *all* of the AG space. In comparison,
min capacity lower nodes, max capacity higher nodes:
nr-levels = 1; nr-leaf-blks = 2^32 * 11 / 11 = 4e9
nr-levels = 2; nr-blks = 4e9 / 11 = 4e8;
nr-levels = 3; nr-blks = 4e8 / 11 = 4e7;
nr-levels = 4; nr-blks = 4e7/ 11 = 3e6;
nr-levels = 5; nr-blks = 3e6 / 21 = 2e5;
nr-levels = 6; nr-blks = 2e5 / 21 = 7e3;
nr-levels = 7; nr-blks = 7e3 / 21 = 348;
nr-levels = 8; nr-blks = 348 / 21 = 16;
nr-levels = 9; nr-blks = 16 / 21 = 1;
nr-levels = 10; nr-blks = 1;
Same, it's a 10 level tree.
> But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
> run out of per-ag blocks trying to build an rmapbt of height
> XFS_BTREE_MAXLEVELS (i.e. 9).
>
> Since number of nodes grows as a geometric series,
> nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
>
> i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
>
> However, with 1k block size we are not close to consuming all of 2^32
> AG blocks as shown by the below calculations,
>
> - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
> 1) = 2e8 blocks.
2.35e8, which is a quarter of the max AG space, assuming minimum
capacity nodes. Assuming max capacity nodes, we're at
3.9e10 blocks, which is almost 40x the number of blocks in the AG...
> - 11 is the minimum number of recs in a non-leaf node with 1k block size.
> - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
> = 2e9 (this value will be used later).
>
> - refcountbt
> - Maximum number of records theoretically = maximum number of blocks
> in an AG = 2^32
The logic here is flawed - you're assuming exclusive use of the AG
to hold *both* rmapbt records and shared data extents. Every block
used for an rmap record cannot have a refcount record pointing at
it because per-ag metadata cannot be shared.
> - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
> Leaf min recs = 20; Node min recs = 60 (with 1k as the block size).
> - Detailed calculation:
> nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
> nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
> nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
> nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
> nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
> nr-levels = 6; nr-blks = 1
>
> - Total block count = 2e8
So, if we assume that 20% of the AG space is refcount btree records
like this, then the rmapbt only consumes 80% of the AG, then the
rmap btree is much closer to the 9 level limit, especially if
we consider we'll tend towards max capacity nodes as we fill the
tree, not min capacity nodes.
But the reality is that we could have a single refcount record with
2^32 references, and that requires 2^32 rmapbt records. So if we are
talking about really high extent refcount situations, the worst case
is a handful of single blocks with billions of references. IOWs, the
refcountbt is a single block with ~2^5 entries in it, each which
track 2^32 references. Now we require 2^37 rmapbt records, and we've
overflowed the maximum physical storage capacity of the AG for
rmapbt blocks.
> - Bmbt (assuming all the rmapbt records have the same inode as owner)
> - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
> Leaf min recs = 29; Node min recs = 29 (with 1k as the block size).
> (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
> nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
> nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
> nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
> nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
> nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
> nr-levels = 6; nr-blks = 1e2 / 29 = 3
> nr-levels = 7; nr-blks = 1
>
> - Total block count = 7e7
>
> Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.
BMBT blocks are not bound to an AG the way refcount and rmapbt
records are. They can be anywhere in the filesystem, so they play no
part in calculations like these.
> Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying
> to build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
>
> Please let me know if my understanding is incorrect.
The worst case theory is largely sound, but theory != reality.
The actual reality of someone with billions of refcount entries to a
a handful of single extents in a single AG is using a 1kB block size
is likely to be so rare that we shouldn't consider it a critical
design point.
That is, if the values cover the default 4kB block size just fine,
then we should not penalise every common configuration just to
handle an extreme case in an extremely rare corner case for a
non-default configuration.
> I have come across a "log reservation" calculation issue when
> increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> extending data fork extent count to 48 bits.
What is that issue?
The BMBT modifications should not care about XFS_BTREE_MAXLEVELS as
a limit, we use the calculated mp->m_bm_maxlevels[] variables for
BMBT height. These are bound by the maximum number of extents the
tree can index, not the depth of btrees allowed within an AG.
Changing the size of the BMBT tree should not affect the per-ag
btrees in any way, nor the log reservations required to manipulate
the per-ag btrees...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-11-30 19:26 ` Darrick J. Wong
@ 2020-11-30 22:03 ` Dave Chinner
2020-12-01 13:12 ` Chandan Babu R
0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2020-11-30 22:03 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs
On Mon, Nov 30, 2020 at 11:26:05AM -0800, Darrick J. Wong wrote:
> On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> > I have come across a "log reservation" calculation issue when
> > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
>
> Hmm. That will increase the size of the btree cursor structure even
> farther. It's already gotten pretty bad with the realtime rmap and
> reflink patchsets since the realtime volume can have 2^63 blocks, which
> implies a theoretical maximum rtrmapbt height of 21 levels and a maximum
> rtrefcountbt height of 13 levels.
The cursor is dynamically allocated, yes? So what we need to do is
drop the idea that the btree is a fixed size and base it's size on
the actual number of levels iwe calculated for that the btree it is
being allocated for, right?
> (These heights are absurd, since they imply a data device of 2^63
> blocks...)
>
> I suspect that we need to split MAXLEVELS into two values -- one for
> per-AG btrees, and one for per-file btrees,
We already do that. XFS_BTREE_MAXLEVELS is supposed to only be for
per-AG btrees. It is not used for BMBTs at all, they use
mp->m_bm_maxlevels[] which have max height calculations done at
mount time.
The problem is the cursor, because right now max mp->m_bm_maxlevels
fits within the XFS_BTREE_MAXLEVELS limit for the per-AG trees as
well, because everything is limited to less than 2^32 records...
> and then refactor the btree
> cursor so that the level data are a single VLA at the end. I started a
> patchset to do all that[1], but it's incomplete.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=btree-dynamic-depth&id=692f761838dd821cd8cc5b3d1c66d6b1ac8ec05b
Yeah, this, along with dynamic sizing of the rmapbt based
on the physical AG size when refcount is enabled...
And then we just don't have to care about the 1kB block size case at
all....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-11-30 21:51 ` Dave Chinner
@ 2020-12-01 13:12 ` Chandan Babu R
2020-12-03 21:55 ` Darrick J. Wong
0 siblings, 1 reply; 8+ messages in thread
From: Chandan Babu R @ 2020-12-01 13:12 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs, darrick.wong
On Tue, 01 Dec 2020 08:51:14 +1100, Dave Chinner wrote:
> On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> > The comment in xfs_rmapbt_compute_maxlevels() mentions that with
> > reflink enabled, XFS will run out of AG blocks before reaching maximum
> > levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k
> > block size case:
> >
> > Considering theoretical limits, maximum height of rmapbt can be,
> > max btree height = Log_(min_recs)(total recs)
> > max_rmapbt_height = Log_45(2^64) = 12.
>
> I think the use of mirecs here is wrong. We can continue to fill
> both leaves and nodes above minrecs once we get to maximal height.
> When a leaf/node is full, it will attempt to shift records left and
> right to sibling nodes before trying to split. Hence a split only
> occurs when all three nodes are completely full.
>
> Hence we'll end up with a much higher average population of leaves
> and nodes than the minimum when we are at maximum height, especially
> at the upper levels of the btree near the root.
>
> i.e. we won't try to split the root ever until the root is at
> maximum capacity, and we won't try to split the next level down
> (before we get to a root split) until at least 3 adjacent nodes
> are at max capacity, and so on. Hence at higher levels, the tree is
> always going to tend towards max capacity nodes, not min capacity.
> That changes these calculations quite a lot...
>
> > Detailed calculation:
> > nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
> > nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
> > nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
> > nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
> > nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
> > nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
> > nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
> > nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
> > nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
> > nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
> > nr-levels = 11; nr-blks = 3e2 / 45 = 6;
> > nr-levels = 12; nr-blks = 1;
> >
> > Total number of blocks = 2e17
> >
> > Here, 84 is the minimum number of leaf records and 45 is the minimum
> > number of node records in the rmapbt when using 4k block size. 2^64 is
> > the maximum possible rmapbt records
> > (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
> > (2^32)).
> >
> > i.e. theoretically rmapbt height can go upto 12.
>
> Yes, but if the rmapbt contains 2^64 records, how many physical disk
> blocks does it consume itself? Worst case that's 2^64 / 45, so
> somewhere between 2^58 and 2^59 blocks of storage required for
> indexing all those reocrds with a 4kB block size?
>
> Yet we only have 2^28 4kB blocks in an AG, and the limit of a
> rmapbt is actually what can physically fit in an AG, yes?
>
> So, regardless of what the theoretical record capacity of a worse
> case rmapbt record fill might be, it hits physical record storage
> limits long before that. i.e. the limit on the btree height is
> physical storage, not theoretical record indexing capability.
>
> In which case, let us [incorrectly, see later] assume we have a
> handful of maximally shared single data blocks and the rest of the
> 1TB AG is rmapbt. i.e. the maximum depth of the rmapbt is
> determined by how much physical space it can consume, which is very
> close to 2^32 blocks for 1kB filesystems.
>
> Let's work from max capacity, so 2^32 * 21 is the max record holding
> capacity of the per-ag rmapbt. Hence the worst case index height
> requirement for the rmapbt is indexing ~2^37 records.
>
> nr-levels = 1; nr-leaf-blks = 2^32 * 21 / 21 = 4e9
> nr-levels = 2; nr-blks = 4e9 / 21 = 2e8;
> nr-levels = 3; nr-blks = 2e8 / 21 = 9e6;
> nr-levels = 4; nr-blks = 9e6/ 21 = 4e5;
> nr-levels = 5; nr-blks = 4e5 / 21 = 2e4;
> nr-levels = 6; nr-blks = 2e4 / 21 = 1e3;
> nr-levels = 7; nr-blks = 1e3 / 21 = 50
> nr-levels = 8; nr-blks = 50 / 21 = 3;
> nr-levels = 9; nr-blks = 3 / 21 = 1;
> nr-levels = 10; nr-blks = 1;
>
> So we *just* tick over into a 10 level rmapbt here in this worse
> case where the rmapbt takes *all* of the AG space. In comparison,
> min capacity lower nodes, max capacity higher nodes:
>
> nr-levels = 1; nr-leaf-blks = 2^32 * 11 / 11 = 4e9
> nr-levels = 2; nr-blks = 4e9 / 11 = 4e8;
> nr-levels = 3; nr-blks = 4e8 / 11 = 4e7;
> nr-levels = 4; nr-blks = 4e7/ 11 = 3e6;
> nr-levels = 5; nr-blks = 3e6 / 21 = 2e5;
> nr-levels = 6; nr-blks = 2e5 / 21 = 7e3;
> nr-levels = 7; nr-blks = 7e3 / 21 = 348;
> nr-levels = 8; nr-blks = 348 / 21 = 16;
> nr-levels = 9; nr-blks = 16 / 21 = 1;
> nr-levels = 10; nr-blks = 1;
>
> Same, it's a 10 level tree.
>
> > But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
> > run out of per-ag blocks trying to build an rmapbt of height
> > XFS_BTREE_MAXLEVELS (i.e. 9).
> >
> > Since number of nodes grows as a geometric series,
> > nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
> >
> > i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
> >
> > However, with 1k block size we are not close to consuming all of 2^32
> > AG blocks as shown by the below calculations,
> >
> > - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
> > 1) = 2e8 blocks.
>
> 2.35e8, which is a quarter of the max AG space, assuming minimum
> capacity nodes. Assuming max capacity nodes, we're at
> 3.9e10 blocks, which is almost 40x the number of blocks in the AG...
>
> > - 11 is the minimum number of recs in a non-leaf node with 1k block size.
> > - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
> > = 2e9 (this value will be used later).
> >
> > - refcountbt
> > - Maximum number of records theoretically = maximum number of blocks
> > in an AG = 2^32
>
> The logic here is flawed - you're assuming exclusive use of the AG
> to hold *both* rmapbt records and shared data extents. Every block
> used for an rmap record cannot have a refcount record pointing at
> it because per-ag metadata cannot be shared.
>
> > - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
> > Leaf min recs = 20; Node min recs = 60 (with 1k as the block size).
> > - Detailed calculation:
> > nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
> > nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
> > nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
> > nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
> > nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
> > nr-levels = 6; nr-blks = 1
> >
> > - Total block count = 2e8
>
> So, if we assume that 20% of the AG space is refcount btree records
> like this, then the rmapbt only consumes 80% of the AG, then the
> rmap btree is much closer to the 9 level limit, especially if
> we consider we'll tend towards max capacity nodes as we fill the
> tree, not min capacity nodes.
>
> But the reality is that we could have a single refcount record with
> 2^32 references, and that requires 2^32 rmapbt records. So if we are
> talking about really high extent refcount situations, the worst case
> is a handful of single blocks with billions of references. IOWs, the
> refcountbt is a single block with ~2^5 entries in it, each which
> track 2^32 references. Now we require 2^37 rmapbt records, and we've
> overflowed the maximum physical storage capacity of the AG for
> rmapbt blocks.
Thanks for the detailed explanation.
>
> > - Bmbt (assuming all the rmapbt records have the same inode as owner)
> > - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
> > Leaf min recs = 29; Node min recs = 29 (with 1k as the block size).
> > (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
> > nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
> > nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
> > nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
> > nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
> > nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
> > nr-levels = 6; nr-blks = 1e2 / 29 = 3
> > nr-levels = 7; nr-blks = 1
> >
> > - Total block count = 7e7
> >
> > Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.
>
> BMBT blocks are not bound to an AG the way refcount and rmapbt
> records are. They can be anywhere in the filesystem, so they play no
> part in calculations like these.
>
> > Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying
> > to build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
> >
> > Please let me know if my understanding is incorrect.
>
> The worst case theory is largely sound, but theory != reality.
>
> The actual reality of someone with billions of refcount entries to a
> a handful of single extents in a single AG is using a 1kB block size
> is likely to be so rare that we shouldn't consider it a critical
> design point.
>
> That is, if the values cover the default 4kB block size just fine,
> then we should not penalise every common configuration just to
> handle an extreme case in an extremely rare corner case for a
> non-default configuration.
Ok.
>
> > I have come across a "log reservation" calculation issue when
> > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> > extending data fork extent count to 48 bits.
>
> What is that issue?
Sorry, I should have described it when sending the original mail.
Increasing the value of XFS_BTREE_MAXLEVELS to 10 caused the value of
mp->m_rmap_maxlevels to be set to 10 when reflink feature is enabled. This in
turn increased "min log size" calculations and hence a modified kernel (one
with XFS_BTREE_MAXLEVELS set to 10) will fail to mount a filesystem which was
created by mkfs.xfs which used 9 as the value for XFS_BTREE_MAXLEVELS for its
"min log size" calculations.
Of course, this problem is no longer valid because ...
>
> The BMBT modifications should not care about XFS_BTREE_MAXLEVELS as
> a limit, we use the calculated mp->m_bm_maxlevels[] variables for
> BMBT height. These are bound by the maximum number of extents the
> tree can index, not the depth of btrees allowed within an AG.
> Changing the size of the BMBT tree should not affect the per-ag
> btrees in any way, nor the log reservations required to manipulate
> the per-ag btrees...
Ah, I didn't realize that XFS_BTREE_MAXLEVELS was applicable only to per-AG
btrees. As mentioned in your response to Darrick, memory for "struct
xfs_btree_cur" should be allocated based on the maximum height of the btree
that the code is about to traverse.
Thanks once again for providing an insight into this.
--
chandan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-11-30 22:03 ` Dave Chinner
@ 2020-12-01 13:12 ` Chandan Babu R
2020-12-01 16:22 ` Darrick J. Wong
0 siblings, 1 reply; 8+ messages in thread
From: Chandan Babu R @ 2020-12-01 13:12 UTC (permalink / raw)
To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs
On Tue, 01 Dec 2020 09:03:47 +1100, Dave Chinner wrote:
> On Mon, Nov 30, 2020 at 11:26:05AM -0800, Darrick J. Wong wrote:
> > On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> > > I have come across a "log reservation" calculation issue when
> > > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> >
> > Hmm. That will increase the size of the btree cursor structure even
> > farther. It's already gotten pretty bad with the realtime rmap and
> > reflink patchsets since the realtime volume can have 2^63 blocks, which
> > implies a theoretical maximum rtrmapbt height of 21 levels and a maximum
> > rtrefcountbt height of 13 levels.
>
> The cursor is dynamically allocated, yes? So what we need to do is
> drop the idea that the btree is a fixed size and base it's size on
> the actual number of levels iwe calculated for that the btree it is
> being allocated for, right?
>
> > (These heights are absurd, since they imply a data device of 2^63
> > blocks...)
> >
> > I suspect that we need to split MAXLEVELS into two values -- one for
> > per-AG btrees, and one for per-file btrees,
>
> We already do that. XFS_BTREE_MAXLEVELS is supposed to only be for
> per-AG btrees. It is not used for BMBTs at all, they use
> mp->m_bm_maxlevels[] which have max height calculations done at
> mount time.
>
> The problem is the cursor, because right now max mp->m_bm_maxlevels
> fits within the XFS_BTREE_MAXLEVELS limit for the per-AG trees as
> well, because everything is limited to less than 2^32 records...
>
> > and then refactor the btree
> > cursor so that the level data are a single VLA at the end. I started a
> > patchset to do all that[1], but it's incomplete.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=btree-dynamic-depth&id=692f761838dd821cd8cc5b3d1c66d6b1ac8ec05b
>
Darrick, I will rebase my "Extend data fork extent count field" patches on top
your patch with required fixes applied. Please let me know if you have any
objection to it.
> Yeah, this, along with dynamic sizing of the rmapbt based
> on the physical AG size when refcount is enabled...
>
> And then we just don't have to care about the 1kB block size case at
> all....
>
--
chandan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-12-01 13:12 ` Chandan Babu R
@ 2020-12-01 16:22 ` Darrick J. Wong
0 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2020-12-01 16:22 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Dave Chinner, linux-xfs
On Tue, Dec 01, 2020 at 06:42:51PM +0530, Chandan Babu R wrote:
> On Tue, 01 Dec 2020 09:03:47 +1100, Dave Chinner wrote:
> > On Mon, Nov 30, 2020 at 11:26:05AM -0800, Darrick J. Wong wrote:
> > > On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> > > > I have come across a "log reservation" calculation issue when
> > > > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> > >
> > > Hmm. That will increase the size of the btree cursor structure even
> > > farther. It's already gotten pretty bad with the realtime rmap and
> > > reflink patchsets since the realtime volume can have 2^63 blocks, which
> > > implies a theoretical maximum rtrmapbt height of 21 levels and a maximum
> > > rtrefcountbt height of 13 levels.
> >
> > The cursor is dynamically allocated, yes? So what we need to do is
> > drop the idea that the btree is a fixed size and base it's size on
> > the actual number of levels iwe calculated for that the btree it is
> > being allocated for, right?
> >
> > > (These heights are absurd, since they imply a data device of 2^63
> > > blocks...)
> > >
> > > I suspect that we need to split MAXLEVELS into two values -- one for
> > > per-AG btrees, and one for per-file btrees,
> >
> > We already do that. XFS_BTREE_MAXLEVELS is supposed to only be for
> > per-AG btrees. It is not used for BMBTs at all, they use
> > mp->m_bm_maxlevels[] which have max height calculations done at
> > mount time.
> >
> > The problem is the cursor, because right now max mp->m_bm_maxlevels
> > fits within the XFS_BTREE_MAXLEVELS limit for the per-AG trees as
> > well, because everything is limited to less than 2^32 records...
> >
> > > and then refactor the btree
> > > cursor so that the level data are a single VLA at the end. I started a
> > > patchset to do all that[1], but it's incomplete.
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=btree-dynamic-depth&id=692f761838dd821cd8cc5b3d1c66d6b1ac8ec05b
> >
>
> Darrick, I will rebase my "Extend data fork extent count field" patches on top
> your patch with required fixes applied. Please let me know if you have any
> objection to it.
You might want to wait a day or two, because I decided to get rid of
XFS_BTREE_MAXLEVELS entirely as part of making the cursors dynamically
sized.
--D
> > Yeah, this, along with dynamic sizing of the rmapbt based
> > on the physical AG size when refcount is enabled...
> >
> > And then we just don't have to care about the 1kB block size case at
> > all....
> >
>
> --
> chandan
>
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maximum height of rmapbt when reflink feature is enabled
2020-12-01 13:12 ` Chandan Babu R
@ 2020-12-03 21:55 ` Darrick J. Wong
0 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2020-12-03 21:55 UTC (permalink / raw)
To: Chandan Babu R; +Cc: Dave Chinner, linux-xfs
On Tue, Dec 01, 2020 at 06:42:34PM +0530, Chandan Babu R wrote:
> On Tue, 01 Dec 2020 08:51:14 +1100, Dave Chinner wrote:
> > On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> > > The comment in xfs_rmapbt_compute_maxlevels() mentions that with
> > > reflink enabled, XFS will run out of AG blocks before reaching maximum
> > > levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k
> > > block size case:
> > >
> > > Considering theoretical limits, maximum height of rmapbt can be,
> > > max btree height = Log_(min_recs)(total recs)
> > > max_rmapbt_height = Log_45(2^64) = 12.
> >
> > I think the use of mirecs here is wrong. We can continue to fill
> > both leaves and nodes above minrecs once we get to maximal height.
> > When a leaf/node is full, it will attempt to shift records left and
> > right to sibling nodes before trying to split. Hence a split only
> > occurs when all three nodes are completely full.
> >
> > Hence we'll end up with a much higher average population of leaves
> > and nodes than the minimum when we are at maximum height, especially
> > at the upper levels of the btree near the root.
> >
> > i.e. we won't try to split the root ever until the root is at
> > maximum capacity, and we won't try to split the next level down
> > (before we get to a root split) until at least 3 adjacent nodes
> > are at max capacity, and so on. Hence at higher levels, the tree is
> > always going to tend towards max capacity nodes, not min capacity.
> > That changes these calculations quite a lot...
> >
> > > Detailed calculation:
> > > nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
> > > nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
> > > nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
> > > nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
> > > nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
> > > nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
> > > nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
> > > nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
> > > nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
> > > nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
> > > nr-levels = 11; nr-blks = 3e2 / 45 = 6;
> > > nr-levels = 12; nr-blks = 1;
> > >
> > > Total number of blocks = 2e17
> > >
> > > Here, 84 is the minimum number of leaf records and 45 is the minimum
> > > number of node records in the rmapbt when using 4k block size. 2^64 is
> > > the maximum possible rmapbt records
> > > (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
> > > (2^32)).
> > >
> > > i.e. theoretically rmapbt height can go upto 12.
> >
> > Yes, but if the rmapbt contains 2^64 records, how many physical disk
> > blocks does it consume itself? Worst case that's 2^64 / 45, so
> > somewhere between 2^58 and 2^59 blocks of storage required for
> > indexing all those reocrds with a 4kB block size?
> >
> > Yet we only have 2^28 4kB blocks in an AG, and the limit of a
> > rmapbt is actually what can physically fit in an AG, yes?
> >
> > So, regardless of what the theoretical record capacity of a worse
> > case rmapbt record fill might be, it hits physical record storage
> > limits long before that. i.e. the limit on the btree height is
> > physical storage, not theoretical record indexing capability.
> >
> > In which case, let us [incorrectly, see later] assume we have a
> > handful of maximally shared single data blocks and the rest of the
> > 1TB AG is rmapbt. i.e. the maximum depth of the rmapbt is
> > determined by how much physical space it can consume, which is very
> > close to 2^32 blocks for 1kB filesystems.
> >
> > Let's work from max capacity, so 2^32 * 21 is the max record holding
> > capacity of the per-ag rmapbt. Hence the worst case index height
> > requirement for the rmapbt is indexing ~2^37 records.
> >
> > nr-levels = 1; nr-leaf-blks = 2^32 * 21 / 21 = 4e9
> > nr-levels = 2; nr-blks = 4e9 / 21 = 2e8;
> > nr-levels = 3; nr-blks = 2e8 / 21 = 9e6;
> > nr-levels = 4; nr-blks = 9e6/ 21 = 4e5;
> > nr-levels = 5; nr-blks = 4e5 / 21 = 2e4;
> > nr-levels = 6; nr-blks = 2e4 / 21 = 1e3;
> > nr-levels = 7; nr-blks = 1e3 / 21 = 50
> > nr-levels = 8; nr-blks = 50 / 21 = 3;
> > nr-levels = 9; nr-blks = 3 / 21 = 1;
> > nr-levels = 10; nr-blks = 1;
> >
> > So we *just* tick over into a 10 level rmapbt here in this worse
> > case where the rmapbt takes *all* of the AG space. In comparison,
> > min capacity lower nodes, max capacity higher nodes:
> >
> > nr-levels = 1; nr-leaf-blks = 2^32 * 11 / 11 = 4e9
> > nr-levels = 2; nr-blks = 4e9 / 11 = 4e8;
> > nr-levels = 3; nr-blks = 4e8 / 11 = 4e7;
> > nr-levels = 4; nr-blks = 4e7/ 11 = 3e6;
> > nr-levels = 5; nr-blks = 3e6 / 21 = 2e5;
> > nr-levels = 6; nr-blks = 2e5 / 21 = 7e3;
> > nr-levels = 7; nr-blks = 7e3 / 21 = 348;
> > nr-levels = 8; nr-blks = 348 / 21 = 16;
> > nr-levels = 9; nr-blks = 16 / 21 = 1;
> > nr-levels = 10; nr-blks = 1;
> >
> > Same, it's a 10 level tree.
> >
> > > But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
> > > run out of per-ag blocks trying to build an rmapbt of height
> > > XFS_BTREE_MAXLEVELS (i.e. 9).
> > >
> > > Since number of nodes grows as a geometric series,
> > > nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
> > >
> > > i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
> > >
> > > However, with 1k block size we are not close to consuming all of 2^32
> > > AG blocks as shown by the below calculations,
> > >
> > > - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
> > > 1) = 2e8 blocks.
> >
> > 2.35e8, which is a quarter of the max AG space, assuming minimum
> > capacity nodes. Assuming max capacity nodes, we're at
> > 3.9e10 blocks, which is almost 40x the number of blocks in the AG...
> >
> > > - 11 is the minimum number of recs in a non-leaf node with 1k block size.
> > > - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
> > > = 2e9 (this value will be used later).
> > >
> > > - refcountbt
> > > - Maximum number of records theoretically = maximum number of blocks
> > > in an AG = 2^32
> >
> > The logic here is flawed - you're assuming exclusive use of the AG
> > to hold *both* rmapbt records and shared data extents. Every block
> > used for an rmap record cannot have a refcount record pointing at
> > it because per-ag metadata cannot be shared.
> >
> > > - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
> > > Leaf min recs = 20; Node min recs = 60 (with 1k as the block size).
> > > - Detailed calculation:
> > > nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
> > > nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
> > > nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
> > > nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
> > > nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
> > > nr-levels = 6; nr-blks = 1
> > >
> > > - Total block count = 2e8
> >
> > So, if we assume that 20% of the AG space is refcount btree records
> > like this, then the rmapbt only consumes 80% of the AG, then the
> > rmap btree is much closer to the 9 level limit, especially if
> > we consider we'll tend towards max capacity nodes as we fill the
> > tree, not min capacity nodes.
> >
> > But the reality is that we could have a single refcount record with
> > 2^32 references, and that requires 2^32 rmapbt records. So if we are
> > talking about really high extent refcount situations, the worst case
> > is a handful of single blocks with billions of references. IOWs, the
> > refcountbt is a single block with ~2^5 entries in it, each which
> > track 2^32 references. Now we require 2^37 rmapbt records, and we've
> > overflowed the maximum physical storage capacity of the AG for
> > rmapbt blocks.
>
> Thanks for the detailed explanation.
>
> >
> > > - Bmbt (assuming all the rmapbt records have the same inode as owner)
> > > - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
> > > Leaf min recs = 29; Node min recs = 29 (with 1k as the block size).
> > > (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
> > > nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
> > > nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
> > > nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
> > > nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
> > > nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
> > > nr-levels = 6; nr-blks = 1e2 / 29 = 3
> > > nr-levels = 7; nr-blks = 1
> > >
> > > - Total block count = 7e7
> > >
> > > Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.
> >
> > BMBT blocks are not bound to an AG the way refcount and rmapbt
> > records are. They can be anywhere in the filesystem, so they play no
> > part in calculations like these.
> >
> > > Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying
> > > to build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
> > >
> > > Please let me know if my understanding is incorrect.
> >
> > The worst case theory is largely sound, but theory != reality.
> >
> > The actual reality of someone with billions of refcount entries to a
> > a handful of single extents in a single AG is using a 1kB block size
> > is likely to be so rare that we shouldn't consider it a critical
> > design point.
> >
> > That is, if the values cover the default 4kB block size just fine,
> > then we should not penalise every common configuration just to
> > handle an extreme case in an extremely rare corner case for a
> > non-default configuration.
>
> Ok.
>
> >
> > > I have come across a "log reservation" calculation issue when
> > > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> > > extending data fork extent count to 48 bits.
> >
> > What is that issue?
>
> Sorry, I should have described it when sending the original mail.
>
> Increasing the value of XFS_BTREE_MAXLEVELS to 10 caused the value of
> mp->m_rmap_maxlevels to be set to 10 when reflink feature is enabled. This in
> turn increased "min log size" calculations and hence a modified kernel (one
> with XFS_BTREE_MAXLEVELS set to 10) will fail to mount a filesystem which was
> created by mkfs.xfs which used 9 as the value for XFS_BTREE_MAXLEVELS for its
> "min log size" calculations.
>
> Of course, this problem is no longer valid because ...
I disagree -- m_rmap_maxlevels should be set to the maximal height of an
rmap btree. On a reflink filesystem we (theoretically) can create
enough rmappings to make the rmapbt consume more than every block in the
AG. (In practice the reflink remap code will cut you off long before we
run out of space so this is unlikely unless every block is individually
mapped into files and are shared, or every extent is shared hundreds of
millions of times.
Therefore, the (asymptotic upper bound) on m_rmap_maxlevels for a
reflink+rmap filesystem ought to be the maximal height assuming the
rmapbt consumes every block in the AG.
We're probably stuck with xfs_rmapadd_space_res returning a value no
smaller than 9, though, because it affects the minimum log size and thus
affects the ondisk format. :(
> >
> > The BMBT modifications should not care about XFS_BTREE_MAXLEVELS as
> > a limit, we use the calculated mp->m_bm_maxlevels[] variables for
> > BMBT height. These are bound by the maximum number of extents the
> > tree can index, not the depth of btrees allowed within an AG.
> > Changing the size of the BMBT tree should not affect the per-ag
> > btrees in any way, nor the log reservations required to manipulate
> > the per-ag btrees...
>
> Ah, I didn't realize that XFS_BTREE_MAXLEVELS was applicable only to per-AG
> btrees. As mentioned in your response to Darrick, memory for "struct
> xfs_btree_cur" should be allocated based on the maximum height of the btree
> that the code is about to traverse.
>
> Thanks once again for providing an insight into this.
Yeah, before my eye problems flared up again I was stabilising a
patchset that gets rid of MAXLEVELS entirely.
> --
> chandan
>
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-12-03 21:56 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-30 9:05 Maximum height of rmapbt when reflink feature is enabled Chandan Babu R
2020-11-30 19:26 ` Darrick J. Wong
2020-11-30 22:03 ` Dave Chinner
2020-12-01 13:12 ` Chandan Babu R
2020-12-01 16:22 ` Darrick J. Wong
2020-11-30 21:51 ` Dave Chinner
2020-12-01 13:12 ` Chandan Babu R
2020-12-03 21:55 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox