From: Chandan Rajendra <chandan@linux.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
Chandan Rajendra <chandanrlinux@gmail.com>,
linux-xfs@vger.kernel.org, bfoster@redhat.com
Subject: Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
Date: Sat, 25 Apr 2020 17:37:39 +0530 [thread overview]
Message-ID: <2457302.TnqmriUJk8@localhost.localdomain> (raw)
In-Reply-To: <20200422223041.GE27860@dread.disaster.area>
On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote:
> On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote:
> > > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote:
> > > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > >
> > > > Can we use this new feature flag + inode flag to allow 4294967295
> > > > extents in either fork?
> > >
> > > Sure.
> > >
> > > I have already tested that having 4294967295 as the maximum data extent count
> > > does not cause any regressions.
> > >
> > > Also, Dave was of the opinion that data extent counter be increased to
> > > 64-bit. I think I should include that change along with this feature flag
> > > rather than adding a new one in the near future.
> > >
> > >
> >
> > Hello Dave & Darrick,
> >
> > Can you please look into the following design decision w.r.t using 32-bit and
> > 64-bit unsigned counters for xattr and data extents.
> >
> > Maximum extent counts.
> > |-----------------------+----------------------|
> > | Field width (in bits) | Max extents |
> > |-----------------------+----------------------|
> > | 32 | 4294967295 |
> > | 48 | 281474976710655 |
> > | 64 | 18446744073709551615 |
> > |-----------------------+----------------------|
>
> These huge numbers are impossible to compare visually. Once numbers
> go beyond 7-9 digits, you need to start condensing them in reports.
> Humans are, in general, unable to handle strings of digits longer
> than 7-9 digits at all well...
>
> Can you condense them by using scientific representation i.e. XEy,
> which gives:
>
> |-----------------------+-------------|
> | Field width (in bits) | Max extents |
> |-----------------------+-------------|
> | 32 | 4.3E09 |
> | 48 | 2.8E14 |
> | 64 | 1.8E19 |
> |-----------------------+-------------|
>
> It's much easier to compare differences visually because it's not
> only 4 digits, not 20. The other alternative is to use k,m,g,t,p,e
> suffixes to indicate magnitude (4.3g, 280t, 18e), but using
> exponentials make the numbers easier to do calculations on
> directly...
>
Sorry about that. I will use scientific notation for representing large
numbers.
> > |-------------------+-----|
> > | Minimum node recs | 125 |
> > | Minimum leaf recs | 125 |
> > |-------------------+-----|
>
Yes, your assumption of 4k block size is correct. I will include detailed
calculation steps in my future mails.
> Please show your working. I'm assuming this is 50% * 4kB /
> sizeof(bmbt_rec), so you are working out limits based on 4kB block
> size? Realistically, worse case behaviour will be with the minimum
> supported block size, which in this case will be 1kB....
>
> > Data bmbt tree height (MINDBTPTRS == 3)
> > |-------+------------------------+-------------------------|
> > | Level | Number of nodes/leaves | Total Nr recs |
> > | | | (nr nodes/leaves * 125) |
> > |-------+------------------------+-------------------------|
> > | 0 | 1 | 3 |
> > | 1 | 3 | 375 |
> > | 2 | 375 | 46875 |
> > | 3 | 46875 | 5859375 |
> > | 4 | 5859375 | 732421875 |
> > | 5 | 732421875 | 91552734375 |
> > | 6 | 91552734375 | 11444091796875 |
> > | 7 | 11444091796875 | 1430511474609375 |
> > | 8 | 1430511474609375 | 178813934326171875 |
> > | 9 | 178813934326171875 | 22351741790771484375 |
> > |-------+------------------------+-------------------------|
> >
> > For counting data extents, even though we theoretically have 64 bits at our
> > disposal, I think we should have (2 ** 48) - 1 as the maximum number of
> > extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> > bmbt tree's height grows by just two more levels (i.e. it grows from the
> > current maximum height of 5 to 7). Please let me know your opinion on this.
>
> We shouldn't make up arbitrary limits when we can calculate them exactly.
> i.e. 2^63 max file size, 1kB block size (2^10), means max fragments
> is 2^53 entries. On a 64kB block size (2^16), we have a max extent
> count of 2^47....
>
> i.e. 2^48 would be an acceptible limit for 1kB block size, but it is
> not correct for 64kB block size filesystems....
You are right about this. I will set the max data extent count to 2^47.
>
> > Attr bmbt tree height (MINABTPTRS == 2)
> > |-------+------------------------+-------------------------|
> > | Level | Number of nodes/leaves | Total Nr recs |
> > | | | (nr nodes/leaves * 125) |
> > |-------+------------------------+-------------------------|
> > | 0 | 1 | 2 |
> > | 1 | 2 | 250 |
> > | 2 | 250 | 31250 |
> > | 3 | 31250 | 3906250 |
> > | 4 | 3906250 | 488281250 |
> > | 5 | 488281250 | 61035156250 |
> > |-------+------------------------+-------------------------|
> >
> > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > This probably won't cause any regression.
>
> We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> attr fork extent count makes no difference to the attribute fork
> bmbt reservations. i.e. the bmbt reservations are defined by the
> dabtree structure limits, not the maximum extent count the fork can
> hold.
I think the dabtree structure limits is because of the following ...
How many levels of dabtree would be needed to hold ~100 million xattrs?
- name len = 16 bytes
struct xfs_parent_name_rec {
__be64 p_ino;
__be32 p_gen;
__be32 p_diroffset;
};
i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
- Value len = file name length = Assume ~40 bytes
- Formula for number of node entries (used in column 3 in the table given
below) at any level of the dabtree,
nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
xfs_da_node_entry))
i.e. nr_blocks * ((block size - 64) / 8)
- Formula for number of leaf entries (used in column 4 in the table given
below),
(block size - sizeof(xfs_attr_leaf_hdr_t)) /
(sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
Here I have assumed block size to be 4k.
|-------+------------------+--------------------------+--------------------------|
| Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
|-------+------------------+--------------------------+--------------------------|
| 0 | 1.0 | 5e2 | 6.1e1 |
| 1 | 5e2 | 2.5e5 | 3.0e4 |
| 2 | 2.5e5 | 1.3e8 | 1.5e7 |
| 3 | 1.3e8 | 6.6e10 | 7.9e9 |
|-------+------------------+--------------------------+--------------------------|
Hence we would need a tree of height 3.
Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
... which is < 2^32 (4.3e9)
>
> The data fork to 64 bits has no impact on the directory
> reservations, either, because the number of extents in the directory
> is bound by the directory segment size of 32GB. i.e. a directory can
> hold, at most, 32GB of dirent data, which means there's a hard limit
> on the number of dabtree entries somewhere in the order of a few
> hundred million. That's where XFS_DA_NODE_MAXDEPTH comes from - it's
> large enough to index a max sized directory, and the BMBT overhead
> is derived from that...
Ok. Thanks for explaining that.
>
> > Meanwhile, I will work on finding the impact of increasing the
> > height of these two trees on log reservation.
>
> It should not change it substantially - 2 blocks per bmbt
> reservation per transaction is what I'd expect from the numbers
> presented...
I still haven't got to this task yet. I will respond soon. I spent time in
figuring out how directories are organized in XFS and also arriving at the
above mentioned calculations for xattr extent counter.
--
chandan
next prev parent reply other threads:[~2020-04-25 12:04 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-04 8:52 [PATCH 0/2] Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-04 8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
2020-04-06 15:25 ` Brian Foster
2020-04-06 22:57 ` Dave Chinner
2020-04-07 5:11 ` Chandan Rajendra
2020-04-07 12:59 ` Brian Foster
2020-04-07 0:49 ` Dave Chinner
2020-04-08 8:47 ` Chandan Rajendra
2020-04-04 8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-06 16:45 ` Brian Foster
2020-04-08 12:40 ` Chandan Rajendra
2020-04-06 17:06 ` Darrick J. Wong
2020-04-06 23:30 ` Dave Chinner
2020-04-08 12:43 ` Chandan Rajendra
2020-04-08 15:38 ` Darrick J. Wong
2020-04-08 22:43 ` Dave Chinner
2020-04-08 15:45 ` Darrick J. Wong
2020-04-08 22:45 ` Dave Chinner
2020-04-08 12:42 ` Chandan Rajendra
2020-04-07 1:20 ` Dave Chinner
2020-04-08 12:45 ` Chandan Rajendra
2020-04-10 7:46 ` Chandan Rajendra
2020-04-12 6:34 ` Chandan Rajendra
2020-04-13 18:55 ` Darrick J. Wong
2020-04-20 4:38 ` Chandan Rajendra
2020-04-22 9:38 ` Chandan Rajendra
2020-04-22 22:30 ` Dave Chinner
2020-04-25 12:07 ` Chandan Rajendra [this message]
2020-04-26 22:08 ` Dave Chinner
2020-04-29 15:35 ` Chandan Rajendra
2020-05-01 7:08 ` Chandan Rajendra
2020-05-12 23:53 ` Darrick J. Wong
2020-05-13 12:19 ` Chandan Rajendra
2020-04-22 22:51 ` Darrick J. Wong
2020-04-27 7:42 ` Christoph Hellwig
2020-04-27 7:39 ` Christoph Hellwig
2020-04-30 2:29 ` Chandan Rajendra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2457302.TnqmriUJk8@localhost.localdomain \
--to=chandan@linux.ibm.com \
--cc=bfoster@redhat.com \
--cc=chandanrlinux@gmail.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).