From: Dave Chinner <david@fromorbit.com>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: xfs-oss <xfs@oss.sgi.com>
Subject: Re: Limits on agi->agi_level (and other btree depths?)
Date: Thu, 13 Feb 2014 12:03:08 +1100 [thread overview]
Message-ID: <20140213010308.GH13997@dastard> (raw)
In-Reply-To: <52FC09AB.3030209@sandeen.net>
On Wed, Feb 12, 2014 at 05:54:19PM -0600, Eric Sandeen wrote:
> If agi->agi_level exceeds XFS_BTREE_MAXLEVELS (8), bad things
> happen. For example in xfs_inobt_init_cursor() we read it
> directly off disk into a btree cursor:
>
> xfs_inobt_init_cursor()
> cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>
> and then when it's time to tear it down we'll index into bc_bufs[]
> buy whatever it said:
>
> xfs_btree_del_cursor()
> for (i = 0; i < cur->bc_nlevels; i++) {
> if (cur->bc_bufs[i])
> xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
>
> but bc_bufs[] in the xfs_btree_cur is of fixed size:
>
> struct xfs_buf *bc_bufs[XFS_BTREE_MAXLEVELS]; /* buf ptr per level */
>
> where
>
> #define XFS_BTREE_MAXLEVELS 8 /* max of all btrees */
>
> (which means this limits any btree depth, not just agi, right...)
>
> ...
> So I ran across this on an intentionally corrupted image, but I
> don't know what stops us from going past XFS_BTREE_MAXLEVELS in
> normal operations (unless we just hit filesystem limits before
> then?)
Right, we hit filesystem limits before we get deeper than 8 levels.
For an AGI btree, ptr/key pairs in a node use 8 bytes, while records
use 16 bytes. Hence worst case is a 512 byte blocksize filesystem
where we get roughly 30 records/leaf and 60 ptr/key pairs per node.
So, the number of extents we can reference at different levels are:
level number
0 30
1 60 * 30
2 60^2 * 30
....
n 60^n * 30
In 1TB AG, worst case freespace is alternate single block freespace,
so that's 1TB/blocksize/2 = (2^31*2^9) / 2^9 / 2^1 = 2^30 extent
records for a 512 byte blocksize filesystem.
2^30 : 1,073,741,824 records
n = 5 : 23,328,000,000 records
So the maximum possible number of levels for an AGI btree is 6 (5
node levels + leaf level). The AGF btrees are the same (32 bit
key/ptrs, 16 byte records)
The AGF freespace trees are more dense - the records are only 8
bytes so there's 60/leaf. It still needs 6 levels though.
For the extent btree (bmbt) it can index 54 bits of file offset, so
worst case is single block fragments so 2^54 extents. Records are 16
bytes, key and pointers are 8 bytes each. Hence 30/30 are the
numbers for a 512 byte block size fs. At level n, the extents are
30^n * 30 = 30^(n+1). So, solving 2^54 <= 30^(n+1) gives n = 11.
So in theory we could overflow XFS_BTREE_MAXLEVELS here, but in
practice this worst case requires 30^8 extents in memory, and that
requires this much RAM:
656,100,000,000 * sizeof(struct xfs_bmbt_irec) bytes
= 656,100,000,000 * 32 bytes
~= 19TiB
And requires reading in from disk 512 bytes at a time. Nothing in
XFS^WLinux scales to indexing 19TiB of extent metadata with any
efficiency in this manner. And let's face it, if you have a 300TiB
file in single 512 byte block fragments, you've got bigger problems.
The least of which being that you should be using a larger block
size...
Back in reality, if we take a 4k block size, the bmbt tree has a
240/240 breakdown, which means that the equation is actually 2^54 <=
240^(n+1), and in that case n = 6, so we don't overflow
XFS_BTREE_MAXLEVELS at all for the normal mkfs cases.
> i.e. xfs_btree_new_root() does:
>
> /* Set the root in the holding structure increasing the level by 1. */
> cur->bc_ops->set_root(cur, &lptr, 1);
>
> and ->set_root / xfs_inobt_set_root() will happily increase
> agi_level; I don't see anything limiting it to
> XFS_BTREE_MAXLEVELS.
Physical limits of the AGI due to the 1TB size of the AG.
> I guess XFS_BTREE_MAXLEVELS is just an arbitrary in-memory limit,
> not a limit of the underlying disk structures, but as it stands,
> we should be sure that we don't exceed it, right?
If you really want to enforce XFS_BTREE_MAXLEVELS, add checks
into xfs_alloc_compute_maxlevels(), xfs_ialloc_compute_maxlevels()
and xfs_bmap_compute_maxlevels() to constrain the limits in the
struct xfs_mount and validate the on-disk values based on the
values in the struct xfs_mount.
> I was going to put that limit into xfs_agi_verify, but realized
> that I wasn't sure if we could actually exceed that depth in
> normal operations.
>
> (cue dchinner working out that 9 levels is 59 bazillion jillion
> items, and will never be hit?)
Yep, done that ;)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-02-13 1:03 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-12 23:54 Limits on agi->agi_level (and other btree depths?) Eric Sandeen
2014-02-13 1:01 ` Shaun Gosse
2014-02-13 1:03 ` Dave Chinner [this message]
2014-02-13 1:16 ` Shaun Gosse
2014-02-13 1:53 ` Eric Sandeen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140213010308.GH13997@dastard \
--to=david@fromorbit.com \
--cc=sandeen@sandeen.net \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox