Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chandan Rajendra <chandan@linux.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Chandan Rajendra <chandanrlinux@gmail.com>,
	linux-xfs@vger.kernel.org, darrick.wong@oracle.com,
	bfoster@redhat.com
Subject: Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
Date: Sun, 12 Apr 2020 12:04:13 +0530	[thread overview]
Message-ID: <3077601.x7mL1aTcQV@localhost.localdomain> (raw)
In-Reply-To: <7594634.fbK8aHRXK3@localhost.localdomain>

On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > >  
> > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > >  static inline bool
> > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > >  	__s8		di_aformat;	/* format of attr fork's data */
> > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > >  	__be64		di_lsn;		/* flush sequence */
> > >  	__be64		di_flags2;	/* more random flags */
> > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > 
> > Ok, I think you've limited what we can do here by using this "fill
> > holes" variable split. I've never liked doing this, and we've only
> > done it in the past when we haven't had space in the inode to create
> > a new 32 bit variable.
> > 
> > IOWs, this is a v5 format feature only, so we should just create a
> > new variable:
> > 
> > 	__be32		di_attr_nextents;
> > 
> > With that in place, we can now do what we did extending the v1 inode
> > link count (16 bits) to the v2 inode link count (32 bits).
> > 
> > That is, when the attribute count is going to overflow, we set a
> > inode flag on disk to indicate that it now has a 32 bit extent count
> > and uses that field in the inode, and we set a RO-compat feature
> > flag in the superblock to indicate that there are 32 bit attr fork
> > extent counts in use.
> > 
> > Old kernels can still read the filesystem, but see the extent count
> > as "max" (65535) but can't modify the attr fork and hence corrupt
> > the 32 bit count it knows nothing about.
> > 
> > If the kernel sees the RO feature bit set, it can set the inode flag
> > on inodes it is modifying and update both the old and new counters
> > appropriately when flushing the inode to disk (i.e. transparent
> > conversion).
> > 
> > In future, mkfs can then set the RO feature flag by default so all
> > new filesystems use the 32 bit counter.
> > 
> > >  	/* fields only written to during inode creation */
> > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > >  	((w) == XFS_DATA_FORK ? \
> > >  		(dip)->di_format : \
> > >  		(dip)->di_aformat)
> > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > -	((w) == XFS_DATA_FORK ? \
> > > -		be32_to_cpu((dip)->di_nextents) : \
> > > -		be16_to_cpu((dip)->di_anextents))
> > > +
> > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > 
> > If you are converting a macro to static inline, then all the caller
> > sites should be converted to lower case at the same time.
> > 
> > > +					struct xfs_dinode *dip, int whichfork)
> > > +{
> > > +	int32_t anextents;
> > 
> > Extent counts should be unsigned, as they are on disk.
> > 
> > > +
> > > +	if (whichfork == XFS_DATA_FORK)
> > > +		return be32_to_cpu((dip)->di_nextents);
> > > +
> > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > +
> > > +	return anextents;
> > > +}
> > 
> > No feature bit to indicate that 32 bit attribute extent counts are
> > valid?
> > 
> > >  
> > >  /*
> > >   * For block and character special files the 32bit dev_t is stored at the
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > +				XFS_ATTR_FORK);
> > 
> > This should open code, but I'd prefer a compeltely separate
> > variable...
> > 
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat	= from->di_aformat;
> > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		to->di_anextents_hi
> > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > 
> > Again, feature bit for on-disk format modifications needed...
> > 
> > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > >  		to->di_lsn = cpu_to_be64(lsn);
> > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > >  	struct xfs_mount	*mp,
> > >  	int			whichfork)
> > >  {
> > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > +	uint32_t		di_nextents;
> > > +
> > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > >  
> > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > >  	case XFS_DINODE_FMT_LOCAL:
> > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > >  	uint16_t		flags;
> > >  	uint64_t		flags2;
> > >  	uint64_t		di_size;
> > > +	int32_t			nextents;
> > > +	int32_t			anextents;
> > > +	int64_t			nblocks;
> > 
> > Extent counts need to be converted to unsigned in memory - they are
> > unsigned on disk....
> 
> In the current code, we have,
> 
> #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> 
> i.e. the maximum allowed data extent counter and xattr extent counter are
> maximum possible values w.r.t signed int and signed short.
> 
> Can you please explain as to why signed maximum values were considered when
> the corresponding on-disk data types are unsigned?
> 
> 

Ok. So the reason I asked that question was because I was wondering if
changing the maximum number of extents for data and attr would cause a change
the height of the corresponding bmbt trees (which in-turn could change the log
reservation values). The following calculations prove otherwise,

- 5 levels deep data bmbt tree.
  |-------+------------------------+-------------------------------|
  | level | number of nodes/leaves | Total Nr recs                 |
  |-------+------------------------+-------------------------------|
  |     0 |                      1 | 3 (max root recs)             |
  |     1 |                      3 | 125 * 3 = 375                 |
  |     2 |                    375 | 125 * 375 = 46875             |
  |     3 |                  46875 | 125 * 46875 = 5859375         |
  |     4 |                5859375 | 125 * 5859375 = 732421875     |
  |     5 |              732421875 | 125 * 732421875 = 91552734375 |
  |-------+------------------------+-------------------------------|

- 3 levels deep attr bmbt tree.
  |-------+------------------------+-----------------------|
  | level | number of nodes/leaves | Total Nr recs         |
  |-------+------------------------+-----------------------|
  |     0 |                      1 | 2 (max root recs)     |
  |     1 |                      2 | 125 * 2 = 250         |
  |     2 |                    250 | 125 * 250 = 31250     |
  |     3 |                  31250 | 125 * 31250 = 3906250 |
  |-------+------------------------+-----------------------|

- Data type to number of records
  |-----------+-------------+-----------------|
  | data type | max extents | max leaf blocks |
  |-----------+-------------+-----------------|
  | int32     |  2147483647 |        17179870 |
  | uint32    |  4294967295 |        34359739 |
  | int16     |       32767 |             263 |
  | uint16    |       65535 |             525 |                                                                                                                  
  |-----------+-------------+-----------------|

So data bmbt will still have a height of 5 and attr bmbt will continue to have
a height of 3.

-- 
chandan

next prev parent reply	other threads:[~2020-04-12  6:31 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-04  8:52 [PATCH 0/2] Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-04  8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
2020-04-06 15:25   ` Brian Foster
2020-04-06 22:57     ` Dave Chinner
2020-04-07  5:11       ` Chandan Rajendra
2020-04-07 12:59       ` Brian Foster
2020-04-07  0:49   ` Dave Chinner
2020-04-08  8:47     ` Chandan Rajendra
2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-06 16:45   ` Brian Foster
2020-04-08 12:40     ` Chandan Rajendra
2020-04-06 17:06   ` Darrick J. Wong
2020-04-06 23:30     ` Dave Chinner
2020-04-08 12:43       ` Chandan Rajendra
2020-04-08 15:38         ` Darrick J. Wong
2020-04-08 22:43         ` Dave Chinner
2020-04-08 15:45       ` Darrick J. Wong
2020-04-08 22:45         ` Dave Chinner
2020-04-08 12:42     ` Chandan Rajendra
2020-04-07  1:20   ` Dave Chinner
2020-04-08 12:45     ` Chandan Rajendra
2020-04-10  7:46     ` Chandan Rajendra
2020-04-12  6:34       ` Chandan Rajendra [this message]
2020-04-13 18:55         ` Darrick J. Wong
2020-04-20  4:38           ` Chandan Rajendra
2020-04-22  9:38             ` Chandan Rajendra
2020-04-22 22:30               ` Dave Chinner
2020-04-25 12:07                 ` Chandan Rajendra
2020-04-26 22:08                   ` Dave Chinner
2020-04-29 15:35                     ` Chandan Rajendra
2020-05-01  7:08                       ` Chandan Rajendra
2020-05-12 23:53                         ` Darrick J. Wong
2020-05-13 12:19                           ` Chandan Rajendra
2020-04-22 22:51               ` Darrick J. Wong
2020-04-27  7:42     ` Christoph Hellwig
2020-04-27  7:39   ` Christoph Hellwig
2020-04-30  2:29     ` Chandan Rajendra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3077601.x7mL1aTcQV@localhost.localdomain \
    --to=chandan@linux.ibm.com \
    --cc=bfoster@redhat.com \
    --cc=chandanrlinux@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).