public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: fix off-by-one in inode alloc block reservation calculation
@ 2020-08-20 17:07 Brian Foster
  2020-08-20 17:25 ` Darrick J. Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2020-08-20 17:07 UTC (permalink / raw)
  To: linux-xfs

The inode chunk allocation transaction reserves inobt_maxlevels-1
blocks to accommodate a full split of the inode btree. A full split
requires an allocation for every existing level and a new root
block, which means inobt_maxlevels is the worst case block
requirement for a transaction that inserts to the inobt. This can
lead to a transaction block reservation overrun when tmpfile
creation allocates an inode chunk and expands the inobt to its
maximum depth. This problem has been observed in conjunction with
overlayfs, which makes frequent use of tmpfiles internally.

The existing reservation code goes back as far as the Linux git repo
history (v2.6.12). It was likely never observed as a problem because
the traditional file/directory creation transactions also include
worst case block reservation for directory modifications, which most
likely is able to make up for a single block deficiency in the inode
allocation portion of the calculation. tmpfile support is relatively
more recent (v3.15), less heavily used, and only includes the inode
allocation block reservation as tmpfiles aren't linked into the
directory tree on creation.

Fix up the inode alloc block reservation macro and a couple of the
block allocator minleft parameters that enforce an allocation to
leave enough free blocks in the AG for a full inobt split.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c      | 4 ++--
 fs/xfs/libxfs/xfs_trans_space.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index f742a96a2fe1..a6b37db55169 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -688,7 +688,7 @@ xfs_ialloc_ag_alloc(
 		args.minalignslop = igeo->cluster_align - 1;
 
 		/* Allow space for the inode btree to split. */
-		args.minleft = igeo->inobt_maxlevels - 1;
+		args.minleft = igeo->inobt_maxlevels;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 
@@ -736,7 +736,7 @@ xfs_ialloc_ag_alloc(
 		/*
 		 * Allow space for the inode btree to split.
 		 */
-		args.minleft = igeo->inobt_maxlevels - 1;
+		args.minleft = igeo->inobt_maxlevels;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 	}
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index c6df01a2a158..7ad3659c5d2a 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -58,7 +58,7 @@
 #define	XFS_IALLOC_SPACE_RES(mp)	\
 	(M_IGEO(mp)->ialloc_blks + \
 	 ((xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1) * \
-	  (M_IGEO(mp)->inobt_maxlevels - 1)))
+	  M_IGEO(mp)->inobt_maxlevels))
 
 /*
  * Space reservation values for various transactions.
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: fix off-by-one in inode alloc block reservation calculation
  2020-08-20 17:07 [PATCH] xfs: fix off-by-one in inode alloc block reservation calculation Brian Foster
@ 2020-08-20 17:25 ` Darrick J. Wong
  2020-08-20 17:47   ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Darrick J. Wong @ 2020-08-20 17:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Aug 20, 2020 at 01:07:34PM -0400, Brian Foster wrote:
> The inode chunk allocation transaction reserves inobt_maxlevels-1
> blocks to accommodate a full split of the inode btree. A full split
> requires an allocation for every existing level and a new root
> block, which means inobt_maxlevels is the worst case block
> requirement for a transaction that inserts to the inobt. This can
> lead to a transaction block reservation overrun when tmpfile
> creation allocates an inode chunk and expands the inobt to its
> maximum depth. This problem has been observed in conjunction with
> overlayfs, which makes frequent use of tmpfiles internally.
> 
> The existing reservation code goes back as far as the Linux git repo
> history (v2.6.12). It was likely never observed as a problem because
> the traditional file/directory creation transactions also include
> worst case block reservation for directory modifications, which most
> likely is able to make up for a single block deficiency in the inode
> allocation portion of the calculation. tmpfile support is relatively
> more recent (v3.15), less heavily used, and only includes the inode
> allocation block reservation as tmpfiles aren't linked into the
> directory tree on creation.
> 
> Fix up the inode alloc block reservation macro and a couple of the
> block allocator minleft parameters that enforce an allocation to
> leave enough free blocks in the AG for a full inobt split.

Looks all fine to me, but... does a similar logic apply to the other
maxlevels uses in the kernel?

fs/xfs/libxfs/xfs_trans_resv.c:73:      blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
fs/xfs/libxfs/xfs_trans_resv.c:75:              blocks += max(num_ops * (2 * mp->m_rmap_maxlevels - 1),
fs/xfs/libxfs/xfs_trans_resv.c:78:              blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);

Can we end up in the same kind of situation with those other trees
{bno,cnt,rmap,refc} where we have a maxlevels-1 tall tree and split each
level all the way to the top?

> Signed-off-by: Brian Foster <bfoster@redhat.com>

For this bit,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_ialloc.c      | 4 ++--
>  fs/xfs/libxfs/xfs_trans_space.h | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index f742a96a2fe1..a6b37db55169 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -688,7 +688,7 @@ xfs_ialloc_ag_alloc(
>  		args.minalignslop = igeo->cluster_align - 1;
>  
>  		/* Allow space for the inode btree to split. */
> -		args.minleft = igeo->inobt_maxlevels - 1;
> +		args.minleft = igeo->inobt_maxlevels;
>  		if ((error = xfs_alloc_vextent(&args)))
>  			return error;
>  
> @@ -736,7 +736,7 @@ xfs_ialloc_ag_alloc(
>  		/*
>  		 * Allow space for the inode btree to split.
>  		 */
> -		args.minleft = igeo->inobt_maxlevels - 1;
> +		args.minleft = igeo->inobt_maxlevels;
>  		if ((error = xfs_alloc_vextent(&args)))
>  			return error;
>  	}
> diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> index c6df01a2a158..7ad3659c5d2a 100644
> --- a/fs/xfs/libxfs/xfs_trans_space.h
> +++ b/fs/xfs/libxfs/xfs_trans_space.h
> @@ -58,7 +58,7 @@
>  #define	XFS_IALLOC_SPACE_RES(mp)	\
>  	(M_IGEO(mp)->ialloc_blks + \
>  	 ((xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1) * \
> -	  (M_IGEO(mp)->inobt_maxlevels - 1)))
> +	  M_IGEO(mp)->inobt_maxlevels))
>  
>  /*
>   * Space reservation values for various transactions.
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: fix off-by-one in inode alloc block reservation calculation
  2020-08-20 17:25 ` Darrick J. Wong
@ 2020-08-20 17:47   ` Brian Foster
  2020-08-20 21:24     ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2020-08-20 17:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 20, 2020 at 10:25:12AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 20, 2020 at 01:07:34PM -0400, Brian Foster wrote:
> > The inode chunk allocation transaction reserves inobt_maxlevels-1
> > blocks to accommodate a full split of the inode btree. A full split
> > requires an allocation for every existing level and a new root
> > block, which means inobt_maxlevels is the worst case block
> > requirement for a transaction that inserts to the inobt. This can
> > lead to a transaction block reservation overrun when tmpfile
> > creation allocates an inode chunk and expands the inobt to its
> > maximum depth. This problem has been observed in conjunction with
> > overlayfs, which makes frequent use of tmpfiles internally.
> > 
> > The existing reservation code goes back as far as the Linux git repo
> > history (v2.6.12). It was likely never observed as a problem because
> > the traditional file/directory creation transactions also include
> > worst case block reservation for directory modifications, which most
> > likely is able to make up for a single block deficiency in the inode
> > allocation portion of the calculation. tmpfile support is relatively
> > more recent (v3.15), less heavily used, and only includes the inode
> > allocation block reservation as tmpfiles aren't linked into the
> > directory tree on creation.
> > 
> > Fix up the inode alloc block reservation macro and a couple of the
> > block allocator minleft parameters that enforce an allocation to
> > leave enough free blocks in the AG for a full inobt split.
> 
> Looks all fine to me, but... does a similar logic apply to the other
> maxlevels uses in the kernel?
> 
> fs/xfs/libxfs/xfs_trans_resv.c:73:      blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
> fs/xfs/libxfs/xfs_trans_resv.c:75:              blocks += max(num_ops * (2 * mp->m_rmap_maxlevels - 1),
> fs/xfs/libxfs/xfs_trans_resv.c:78:              blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
> 
> Can we end up in the same kind of situation with those other trees
> {bno,cnt,rmap,refc} where we have a maxlevels-1 tall tree and split each
> level all the way to the top?
> 

Hmm.. it seems so at first glance, but I'm not sure I follow the
calculations in that function. If we factor out the obvious
num_ops/num_trees components, the comment refers to the following
generic formula:

	((2 blocks/level * max depth) - 1)

I take it that since this is a log reservation calculation, the two
block/level multiplier is there because we have to move records between
two blocks for each level that splits. Is there a reason the -1 is
applied after that multiplier (as opposed to subtracting 1 from the max
depth first)? I'm wondering if that's intentional and it reflects that
the root level is only one block...

Brian

> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> For this bit,
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> --D
> 
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c      | 4 ++--
> >  fs/xfs/libxfs/xfs_trans_space.h | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index f742a96a2fe1..a6b37db55169 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -688,7 +688,7 @@ xfs_ialloc_ag_alloc(
> >  		args.minalignslop = igeo->cluster_align - 1;
> >  
> >  		/* Allow space for the inode btree to split. */
> > -		args.minleft = igeo->inobt_maxlevels - 1;
> > +		args.minleft = igeo->inobt_maxlevels;
> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  
> > @@ -736,7 +736,7 @@ xfs_ialloc_ag_alloc(
> >  		/*
> >  		 * Allow space for the inode btree to split.
> >  		 */
> > -		args.minleft = igeo->inobt_maxlevels - 1;
> > +		args.minleft = igeo->inobt_maxlevels;
> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  	}
> > diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> > index c6df01a2a158..7ad3659c5d2a 100644
> > --- a/fs/xfs/libxfs/xfs_trans_space.h
> > +++ b/fs/xfs/libxfs/xfs_trans_space.h
> > @@ -58,7 +58,7 @@
> >  #define	XFS_IALLOC_SPACE_RES(mp)	\
> >  	(M_IGEO(mp)->ialloc_blks + \
> >  	 ((xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1) * \
> > -	  (M_IGEO(mp)->inobt_maxlevels - 1)))
> > +	  M_IGEO(mp)->inobt_maxlevels))
> >  
> >  /*
> >   * Space reservation values for various transactions.
> > -- 
> > 2.25.4
> > 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: fix off-by-one in inode alloc block reservation calculation
  2020-08-20 17:47   ` Brian Foster
@ 2020-08-20 21:24     ` Dave Chinner
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2020-08-20 21:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Aug 20, 2020 at 01:47:08PM -0400, Brian Foster wrote:
> On Thu, Aug 20, 2020 at 10:25:12AM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 20, 2020 at 01:07:34PM -0400, Brian Foster wrote:
> > > The inode chunk allocation transaction reserves inobt_maxlevels-1
> > > blocks to accommodate a full split of the inode btree. A full split
> > > requires an allocation for every existing level and a new root
> > > block, which means inobt_maxlevels is the worst case block
> > > requirement for a transaction that inserts to the inobt. This can
> > > lead to a transaction block reservation overrun when tmpfile
> > > creation allocates an inode chunk and expands the inobt to its
> > > maximum depth. This problem has been observed in conjunction with
> > > overlayfs, which makes frequent use of tmpfiles internally.
> > > 
> > > The existing reservation code goes back as far as the Linux git repo
> > > history (v2.6.12). It was likely never observed as a problem because
> > > the traditional file/directory creation transactions also include
> > > worst case block reservation for directory modifications, which most
> > > likely is able to make up for a single block deficiency in the inode
> > > allocation portion of the calculation. tmpfile support is relatively
> > > more recent (v3.15), less heavily used, and only includes the inode
> > > allocation block reservation as tmpfiles aren't linked into the
> > > directory tree on creation.
> > > 
> > > Fix up the inode alloc block reservation macro and a couple of the
> > > block allocator minleft parameters that enforce an allocation to
> > > leave enough free blocks in the AG for a full inobt split.
> > 
> > Looks all fine to me, but... does a similar logic apply to the other
> > maxlevels uses in the kernel?
> > 
> > fs/xfs/libxfs/xfs_trans_resv.c:73:      blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
> > fs/xfs/libxfs/xfs_trans_resv.c:75:              blocks += max(num_ops * (2 * mp->m_rmap_maxlevels - 1),
> > fs/xfs/libxfs/xfs_trans_resv.c:78:              blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
> > 
> > Can we end up in the same kind of situation with those other trees
> > {bno,cnt,rmap,refc} where we have a maxlevels-1 tall tree and split each
> > level all the way to the top?
> > 
> 
> Hmm.. it seems so at first glance, but I'm not sure I follow the
> calculations in that function. If we factor out the obvious
> num_ops/num_trees components, the comment refers to the following
> generic formula:
> 
> 	((2 blocks/level * max depth) - 1)
> 
> I take it that since this is a log reservation calculation, the two
> block/level multiplier is there because we have to move records between
> two blocks for each level that splits. Is there a reason the -1 is
> applied after that multiplier (as opposed to subtracting 1 from the max
> depth first)? I'm wondering if that's intentional and it reflects that
> the root level is only one block...

Intentional, I think, because that's how btree splits work. :) i.e.
split every level into 2 blocks, then add one for the new root. But
when the tree is already at max height, we can't split the root
block anymore so we are accounting for a split at every level except
the root block, which is a single block....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-08-20 21:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-20 17:07 [PATCH] xfs: fix off-by-one in inode alloc block reservation calculation Brian Foster
2020-08-20 17:25 ` Darrick J. Wong
2020-08-20 17:47   ` Brian Foster
2020-08-20 21:24     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox