From: Dave Chinner <david@fromorbit.com>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: xfs-oss <xfs@oss.sgi.com>
Subject: Re: Limits on agi->agi_level (and other btree depths?)
Date: Thu, 13 Feb 2014 12:03:08 +1100 [thread overview]
Message-ID: <20140213010308.GH13997@dastard> (raw)
In-Reply-To: <52FC09AB.3030209@sandeen.net>
On Wed, Feb 12, 2014 at 05:54:19PM -0600, Eric Sandeen wrote:
> If agi->agi_level exceeds XFS_BTREE_MAXLEVELS (8), bad things
> happen. For example in xfs_inobt_init_cursor() we read it
> directly off disk into a btree cursor:
>
> xfs_inobt_init_cursor()
> cur->bc_nlevels = be32_to_cpu(agi->agi_level);
>
> and then when it's time to tear it down we'll index into bc_bufs[]
> buy whatever it said:
>
> xfs_btree_del_cursor()
> for (i = 0; i < cur->bc_nlevels; i++) {
> if (cur->bc_bufs[i])
> xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
>
> but bc_bufs[] in the xfs_btree_cur is of fixed size:
>
> struct xfs_buf *bc_bufs[XFS_BTREE_MAXLEVELS]; /* buf ptr per level */
>
> where
>
> #define XFS_BTREE_MAXLEVELS 8 /* max of all btrees */
>
> (which means this limits any btree depth, not just agi, right...)
>
> ...
> So I ran across this on an intentionally corrupted image, but I
> don't know what stops us from going past XFS_BTREE_MAXLEVELS in
> normal operations (unless we just hit filesystem limits before
> then?)
Right, we hit filesystem limits before we get deeper than 8 levels.
For an AGI btree, ptr/key pairs in a node use 8 bytes, while records
use 16 bytes. Hence worst case is a 512 byte blocksize filesystem
where we get roughly 30 records/leaf and 60 ptr/key pairs per node.
So, the number of extents we can reference at different levels are:
level number
0 30
1 60 * 30
2 60^2 * 30
....
n 60^n * 30
In 1TB AG, worst case freespace is alternate single block freespace,
so that's 1TB/blocksize/2 = (2^31*2^9) / 2^9 / 2^1 = 2^30 extent
records for a 512 byte blocksize filesystem.
2^30 : 1,073,741,824 records
n = 5 : 23,328,000,000 records
So the maximum possible number of levels for an AGI btree is 6 (5
node levels + leaf level). The AGF btrees are the same (32 bit
key/ptrs, 16 byte records)
The AGF freespace trees are more dense - the records are only 8
bytes so there's 60/leaf. It still needs 6 levels though.
For the extent btree (bmbt) it can index 54 bits of file offset, so
worst case is single block fragments so 2^54 extents. Records are 16
bytes, key and pointers are 8 bytes each. Hence 30/30 are the
numbers for a 512 byte block size fs. At level n, the extents are
30^n * 30 = 30^(n+1). So, solving 2^54 <= 30^(n+1) gives n = 11.
So in theory we could overflow XFS_BTREE_MAXLEVELS here, but in
practice this worst case requires 30^8 extents in memory, and that
requires this much RAM:
656,100,000,000 * sizeof(struct xfs_bmbt_irec) bytes
= 656,100,000,000 * 32 bytes
~= 19TiB
And requires reading in from disk 512 bytes at a time. Nothing in
XFS^WLinux scales to indexing 19TiB of extent metadata with any
efficiency in this manner. And let's face it, if you have a 300TiB
file in single 512 byte block fragments, you've got bigger problems.
The least of which being that you should be using a larger block
size...
Back in reality, if we take a 4k block size, the bmbt tree has a
240/240 breakdown, which means that the equation is actually 2^54 <=
240^(n+1), and in that case n = 6, so we don't overflow
XFS_BTREE_MAXLEVELS at all for the normal mkfs cases.
> i.e. xfs_btree_new_root() does:
>
> /* Set the root in the holding structure increasing the level by 1. */
> cur->bc_ops->set_root(cur, &lptr, 1);
>
> and ->set_root / xfs_inobt_set_root() will happily increase
> agi_level; I don't see anything limiting it to
> XFS_BTREE_MAXLEVELS.
Physical limits of the AGI due to the 1TB size of the AG.
> I guess XFS_BTREE_MAXLEVELS is just an arbitrary in-memory limit,
> not a limit of the underlying disk structures, but as it stands,
> we should be sure that we don't exceed it, right?
If you really want to enforce XFS_BTREE_MAXLEVELS, add checks
into xfs_alloc_compute_maxlevels(), xfs_ialloc_compute_maxlevels()
and xfs_bmap_compute_maxlevels() to constrain the limits in the
struct xfs_mount and validate the on-disk values based on the
values in the struct xfs_mount.
> I was going to put that limit into xfs_agi_verify, but realized
> that I wasn't sure if we could actually exceed that depth in
> normal operations.
>
> (cue dchinner working out that 9 levels is 59 bazillion jillion
> items, and will never be hit?)
Yep, done that ;)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-02-13 1:03 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-12 23:54 Limits on agi->agi_level (and other btree depths?) Eric Sandeen
2014-02-13 1:01 ` Shaun Gosse
2014-02-13 1:03 ` Dave Chinner [this message]
2014-02-13 1:16 ` Shaun Gosse
2014-02-13 1:53 ` Eric Sandeen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140213010308.GH13997@dastard \
--to=david@fromorbit.com \
--cc=sandeen@sandeen.net \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.