From: Chandan Babu R <chandanrlinux@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, darrick.wong@oracle.com
Subject: Re: Maximum height of rmapbt when reflink feature is enabled
Date: Tue, 01 Dec 2020 18:42:34 +0530 [thread overview]
Message-ID: <7691390.z3D37GG2ZM@garuda> (raw)
In-Reply-To: <20201130215114.GH2842436@dread.disaster.area>
On Tue, 01 Dec 2020 08:51:14 +1100, Dave Chinner wrote:
> On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> > The comment in xfs_rmapbt_compute_maxlevels() mentions that with
> > reflink enabled, XFS will run out of AG blocks before reaching maximum
> > levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k
> > block size case:
> >
> > Considering theoretical limits, maximum height of rmapbt can be,
> > max btree height = Log_(min_recs)(total recs)
> > max_rmapbt_height = Log_45(2^64) = 12.
>
> I think the use of mirecs here is wrong. We can continue to fill
> both leaves and nodes above minrecs once we get to maximal height.
> When a leaf/node is full, it will attempt to shift records left and
> right to sibling nodes before trying to split. Hence a split only
> occurs when all three nodes are completely full.
>
> Hence we'll end up with a much higher average population of leaves
> and nodes than the minimum when we are at maximum height, especially
> at the upper levels of the btree near the root.
>
> i.e. we won't try to split the root ever until the root is at
> maximum capacity, and we won't try to split the next level down
> (before we get to a root split) until at least 3 adjacent nodes
> are at max capacity, and so on. Hence at higher levels, the tree is
> always going to tend towards max capacity nodes, not min capacity.
> That changes these calculations quite a lot...
>
> > Detailed calculation:
> > nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
> > nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
> > nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
> > nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
> > nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
> > nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
> > nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
> > nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
> > nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
> > nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
> > nr-levels = 11; nr-blks = 3e2 / 45 = 6;
> > nr-levels = 12; nr-blks = 1;
> >
> > Total number of blocks = 2e17
> >
> > Here, 84 is the minimum number of leaf records and 45 is the minimum
> > number of node records in the rmapbt when using 4k block size. 2^64 is
> > the maximum possible rmapbt records
> > (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
> > (2^32)).
> >
> > i.e. theoretically rmapbt height can go upto 12.
>
> Yes, but if the rmapbt contains 2^64 records, how many physical disk
> blocks does it consume itself? Worst case that's 2^64 / 45, so
> somewhere between 2^58 and 2^59 blocks of storage required for
> indexing all those reocrds with a 4kB block size?
>
> Yet we only have 2^28 4kB blocks in an AG, and the limit of a
> rmapbt is actually what can physically fit in an AG, yes?
>
> So, regardless of what the theoretical record capacity of a worse
> case rmapbt record fill might be, it hits physical record storage
> limits long before that. i.e. the limit on the btree height is
> physical storage, not theoretical record indexing capability.
>
> In which case, let us [incorrectly, see later] assume we have a
> handful of maximally shared single data blocks and the rest of the
> 1TB AG is rmapbt. i.e. the maximum depth of the rmapbt is
> determined by how much physical space it can consume, which is very
> close to 2^32 blocks for 1kB filesystems.
>
> Let's work from max capacity, so 2^32 * 21 is the max record holding
> capacity of the per-ag rmapbt. Hence the worst case index height
> requirement for the rmapbt is indexing ~2^37 records.
>
> nr-levels = 1; nr-leaf-blks = 2^32 * 21 / 21 = 4e9
> nr-levels = 2; nr-blks = 4e9 / 21 = 2e8;
> nr-levels = 3; nr-blks = 2e8 / 21 = 9e6;
> nr-levels = 4; nr-blks = 9e6/ 21 = 4e5;
> nr-levels = 5; nr-blks = 4e5 / 21 = 2e4;
> nr-levels = 6; nr-blks = 2e4 / 21 = 1e3;
> nr-levels = 7; nr-blks = 1e3 / 21 = 50
> nr-levels = 8; nr-blks = 50 / 21 = 3;
> nr-levels = 9; nr-blks = 3 / 21 = 1;
> nr-levels = 10; nr-blks = 1;
>
> So we *just* tick over into a 10 level rmapbt here in this worse
> case where the rmapbt takes *all* of the AG space. In comparison,
> min capacity lower nodes, max capacity higher nodes:
>
> nr-levels = 1; nr-leaf-blks = 2^32 * 11 / 11 = 4e9
> nr-levels = 2; nr-blks = 4e9 / 11 = 4e8;
> nr-levels = 3; nr-blks = 4e8 / 11 = 4e7;
> nr-levels = 4; nr-blks = 4e7/ 11 = 3e6;
> nr-levels = 5; nr-blks = 3e6 / 21 = 2e5;
> nr-levels = 6; nr-blks = 2e5 / 21 = 7e3;
> nr-levels = 7; nr-blks = 7e3 / 21 = 348;
> nr-levels = 8; nr-blks = 348 / 21 = 16;
> nr-levels = 9; nr-blks = 16 / 21 = 1;
> nr-levels = 10; nr-blks = 1;
>
> Same, it's a 10 level tree.
>
> > But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
> > run out of per-ag blocks trying to build an rmapbt of height
> > XFS_BTREE_MAXLEVELS (i.e. 9).
> >
> > Since number of nodes grows as a geometric series,
> > nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
> >
> > i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
> >
> > However, with 1k block size we are not close to consuming all of 2^32
> > AG blocks as shown by the below calculations,
> >
> > - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
> > 1) = 2e8 blocks.
>
> 2.35e8, which is a quarter of the max AG space, assuming minimum
> capacity nodes. Assuming max capacity nodes, we're at
> 3.9e10 blocks, which is almost 40x the number of blocks in the AG...
>
> > - 11 is the minimum number of recs in a non-leaf node with 1k block size.
> > - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
> > = 2e9 (this value will be used later).
> >
> > - refcountbt
> > - Maximum number of records theoretically = maximum number of blocks
> > in an AG = 2^32
>
> The logic here is flawed - you're assuming exclusive use of the AG
> to hold *both* rmapbt records and shared data extents. Every block
> used for an rmap record cannot have a refcount record pointing at
> it because per-ag metadata cannot be shared.
>
> > - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
> > Leaf min recs = 20; Node min recs = 60 (with 1k as the block size).
> > - Detailed calculation:
> > nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
> > nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
> > nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
> > nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
> > nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
> > nr-levels = 6; nr-blks = 1
> >
> > - Total block count = 2e8
>
> So, if we assume that 20% of the AG space is refcount btree records
> like this, then the rmapbt only consumes 80% of the AG, then the
> rmap btree is much closer to the 9 level limit, especially if
> we consider we'll tend towards max capacity nodes as we fill the
> tree, not min capacity nodes.
>
> But the reality is that we could have a single refcount record with
> 2^32 references, and that requires 2^32 rmapbt records. So if we are
> talking about really high extent refcount situations, the worst case
> is a handful of single blocks with billions of references. IOWs, the
> refcountbt is a single block with ~2^5 entries in it, each which
> track 2^32 references. Now we require 2^37 rmapbt records, and we've
> overflowed the maximum physical storage capacity of the AG for
> rmapbt blocks.
Thanks for the detailed explanation.
>
> > - Bmbt (assuming all the rmapbt records have the same inode as owner)
> > - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
> > Leaf min recs = 29; Node min recs = 29 (with 1k as the block size).
> > (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
> > nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
> > nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
> > nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
> > nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
> > nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
> > nr-levels = 6; nr-blks = 1e2 / 29 = 3
> > nr-levels = 7; nr-blks = 1
> >
> > - Total block count = 7e7
> >
> > Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.
>
> BMBT blocks are not bound to an AG the way refcount and rmapbt
> records are. They can be anywhere in the filesystem, so they play no
> part in calculations like these.
>
> > Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying
> > to build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
> >
> > Please let me know if my understanding is incorrect.
>
> The worst case theory is largely sound, but theory != reality.
>
> The actual reality of someone with billions of refcount entries to a
> a handful of single extents in a single AG is using a 1kB block size
> is likely to be so rare that we shouldn't consider it a critical
> design point.
>
> That is, if the values cover the default 4kB block size just fine,
> then we should not penalise every common configuration just to
> handle an extreme case in an extremely rare corner case for a
> non-default configuration.
Ok.
>
> > I have come across a "log reservation" calculation issue when
> > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> > extending data fork extent count to 48 bits.
>
> What is that issue?
Sorry, I should have described it when sending the original mail.
Increasing the value of XFS_BTREE_MAXLEVELS to 10 caused the value of
mp->m_rmap_maxlevels to be set to 10 when reflink feature is enabled. This in
turn increased "min log size" calculations and hence a modified kernel (one
with XFS_BTREE_MAXLEVELS set to 10) will fail to mount a filesystem which was
created by mkfs.xfs which used 9 as the value for XFS_BTREE_MAXLEVELS for its
"min log size" calculations.
Of course, this problem is no longer valid because ...
>
> The BMBT modifications should not care about XFS_BTREE_MAXLEVELS as
> a limit, we use the calculated mp->m_bm_maxlevels[] variables for
> BMBT height. These are bound by the maximum number of extents the
> tree can index, not the depth of btrees allowed within an AG.
> Changing the size of the BMBT tree should not affect the per-ag
> btrees in any way, nor the log reservations required to manipulate
> the per-ag btrees...
Ah, I didn't realize that XFS_BTREE_MAXLEVELS was applicable only to per-AG
btrees. As mentioned in your response to Darrick, memory for "struct
xfs_btree_cur" should be allocated based on the maximum height of the btree
that the code is about to traverse.
Thanks once again for providing an insight into this.
--
chandan
next prev parent reply other threads:[~2020-12-01 13:13 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-30 9:05 Maximum height of rmapbt when reflink feature is enabled Chandan Babu R
2020-11-30 19:26 ` Darrick J. Wong
2020-11-30 22:03 ` Dave Chinner
2020-12-01 13:12 ` Chandan Babu R
2020-12-01 16:22 ` Darrick J. Wong
2020-11-30 21:51 ` Dave Chinner
2020-12-01 13:12 ` Chandan Babu R [this message]
2020-12-03 21:55 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7691390.z3D37GG2ZM@garuda \
--to=chandanrlinux@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox