From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Chandan Rajendra <chandanrlinux@gmail.com>,
linux-xfs@vger.kernel.org, chandan@linux.ibm.com,
bfoster@redhat.com
Subject: Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
Date: Wed, 8 Apr 2020 08:45:12 -0700 [thread overview]
Message-ID: <20200408154512.GA6741@magnolia> (raw)
In-Reply-To: <20200406233002.GD21885@dread.disaster.area>
On Tue, Apr 07, 2020 at 09:30:02AM +1000, Dave Chinner wrote:
> On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > >
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > >
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > >
> > > I have been informed that there are instances where a single file has
> > > > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > >
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > >
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > > fs/xfs/libxfs/xfs_format.h | 28 +++++++++++++++++++++-------
> > > fs/xfs/libxfs/xfs_inode_buf.c | 27 +++++++++++++++++++--------
> > > fs/xfs/libxfs/xfs_inode_fork.c | 3 ++-
> > > fs/xfs/libxfs/xfs_log_format.h | 5 +++--
> > > fs/xfs/libxfs/xfs_types.h | 4 ++--
> > > fs/xfs/scrub/inode.c | 7 ++++---
> > > fs/xfs/xfs_inode_item.c | 3 ++-
> > > fs/xfs/xfs_log_recover.c | 13 ++++++++++---
> > > 8 files changed, 63 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > #define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
> > > #define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */
> > > #define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >
> > If you're going to introduce an INCOMPAT feature, please also use the
> > opportunity to convert xattrs to something resembling the dir v3 format,
> > where we index free space within each block so that we can speed up attr
> > setting with 100 million attrs.
>
> Not necessary. Chandan has already spent a lot of time investigating
> that - I suggested doing the investigation probably a year ago when
> he was looking for stuff to do knowing that this could be a problem
> parent pointers hit.
Oh, I didn't realize that analysis work has already been done.
Chandan, could you please mention that somewhere in the cover letter?
It does mention that you tried creating 1M xattrs, but I guess it needed
to be more explicit about not uncovering any gigantic performance holes.
> Long story short - there's no degradation in
> performance in the dabtree out to tens of millions of records with
> different fixed size or random sized attributes, nor does various
> combinations of insert/lookup/remove/replace operations seem to
> impact the tree performance at scale. IOWs, we hit the 16 bit extent
> limits of the attribute trees without finding any degradation in
> performance.
Ok. I'll take "attr v3 upgrade" off my list of things to look out for.
> Hence we concluded that the dabtree structure does not require
> significant modification or optimisation to work well with typical
> parent pointer attribute demands...
>
> As for free space indexes....
>
> The issue with the directory structure that requires external free
> space is that the directory data is not part of the dabtree itself.
> The attribute fork stores all the attributes at the leaves of the
> dabtree, while the directory structure stores the directory data in
> external blocks and the dabtree only contains the name hash index
> that points to the external data.
>
> i.e. When we add an attribute to the dabtree, we split/merge leaves
> of the tree based on where the name hash index tells us it needs to
> be inserted/removed from. i.e. we make space available or collapse
> sparse leaves of the dabtree as a side effect of inserting or
> removing objects.
>
> The directory structure is very different. The dirents cannot change
> location as their logical offset into the dir data segment is used
> as the readdir/seekdir/telldir cookie. Therefore that location is
> not allowed to change for the life of the dirent and so we can't
> store them in the leaves of a dabtree indexed in hash order because
> the offset into the tree would change as other entries are inserted
> and removed. Hence when we remove dirents, we must leave holes in
> the data segment so the rest of the dirent data does not change
> logical offset.
>
> The directory name hash index - the dabtree bit - is in a separate
> segment (the 2nd one). Because it only stores pointers to dirents in
> the data segment, it doesn't need to leave holes - the dabtree just
> merge/splits as required as pointers to the dir data segment are
> added/removed - and has no free space tracking.
>
> Hence when we go to add a dirent, we need to find the best free
> space in the dir data segment to add that dirent. This requires a
> dir data segment free space index, and that is held in the 3rd dir
> segment. Once we've found the best free space via lookup in the
> free space index, we go modify the dir data block it points to, then
> update the dabtree to point the name hash at that new dirent.
>
> IOWs, the requirement for a free space map in the directory
> structure results from storing the dirent data externally to the
> dabtree. Attributes are stored directly in the leaves of the
> dabtree - except for remote attributes which can be anywhere in the
> BMBT address space - and hence do no need external free space
> tracking to determine where to best insert them...
<nod> Got it. I've suspected this property about the xattr structures
for a long time, so I'm glad to hear someone else echo that. :)
Dave: May I try to rework the above into something suitable for the
ondisk format documentation?
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2020-04-08 15:47 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-04 8:52 [PATCH 0/2] Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-04 8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
2020-04-06 15:25 ` Brian Foster
2020-04-06 22:57 ` Dave Chinner
2020-04-07 5:11 ` Chandan Rajendra
2020-04-07 12:59 ` Brian Foster
2020-04-07 0:49 ` Dave Chinner
2020-04-08 8:47 ` Chandan Rajendra
2020-04-04 8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-06 16:45 ` Brian Foster
2020-04-08 12:40 ` Chandan Rajendra
2020-04-06 17:06 ` Darrick J. Wong
2020-04-06 23:30 ` Dave Chinner
2020-04-08 12:43 ` Chandan Rajendra
2020-04-08 15:38 ` Darrick J. Wong
2020-04-08 22:43 ` Dave Chinner
2020-04-08 15:45 ` Darrick J. Wong [this message]
2020-04-08 22:45 ` Dave Chinner
2020-04-08 12:42 ` Chandan Rajendra
2020-04-07 1:20 ` Dave Chinner
2020-04-08 12:45 ` Chandan Rajendra
2020-04-10 7:46 ` Chandan Rajendra
2020-04-12 6:34 ` Chandan Rajendra
2020-04-13 18:55 ` Darrick J. Wong
2020-04-20 4:38 ` Chandan Rajendra
2020-04-22 9:38 ` Chandan Rajendra
2020-04-22 22:30 ` Dave Chinner
2020-04-25 12:07 ` Chandan Rajendra
2020-04-26 22:08 ` Dave Chinner
2020-04-29 15:35 ` Chandan Rajendra
2020-05-01 7:08 ` Chandan Rajendra
2020-05-12 23:53 ` Darrick J. Wong
2020-05-13 12:19 ` Chandan Rajendra
2020-04-22 22:51 ` Darrick J. Wong
2020-04-27 7:42 ` Christoph Hellwig
2020-04-27 7:39 ` Christoph Hellwig
2020-04-30 2:29 ` Chandan Rajendra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200408154512.GA6741@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=chandan@linux.ibm.com \
--cc=chandanrlinux@gmail.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.