public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation
@ 2024-12-16 13:05 Tianxiang Peng
  2024-12-16 13:05 ` [PATCH 1/2] xfs: calculate cluster_size_raw from sb when sparse alloc enabled Tianxiang Peng
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Tianxiang Peng @ 2024-12-16 13:05 UTC (permalink / raw)
  To: chandan.babu, djwong, p.raghav, mcgrof, brauner, dchinner
  Cc: Tianxiang Peng, linux-xfs, linux-fsdevel, linux-kernel,
	allexjlzheng, flyingpeng

This patch series makes inode cluster size a tunnable parameter in
mkfs.xfs when sparse allocation is enabled, and also makes xfs use
inode cluster size directly from the superblock read in rather than
recalculate itself and verify.

Under extreme fragmentation situations, even inode sparse allocation
may fail with current default inode cluster size i.e. 8192 bytes. Such
situations may come from the PUNCH_HOLE fallocation which is used by
some applications, for example MySQL innodb page compression. With xfs
of 4K blocksize, MySQL may write out 16K buffer with direct I/O(which
immediately triggers block allocation) then try to compress the 16K
buffer to <4K. If the compression succeeds, MySQL will punch out the
latter 12K, leave only the first 4K allocated:
	after write 16k buffer: OOOO
	after punch latter 12K: OXXX
where O means page with block allocated, X means page without.

Such feature saves disk space(the 12K freed by punching can be used
by others), but also makes the filesystem much more fragmented.
Considering xfs has no automatic defragmentation mechanism, in the
most extreme cases, there will be only 1-3 physically continuous
blocks finally avaliable.

For data block allocation, such fragmentation is not a problem, as
physical continuation is not always required. But inode chunk
allocation requires so. Even for sparse allocation, physical
continuation has also to be guaranteed in a way. Currently this
value is calculated from a scaled inode cluster size. In xfs, inodes
are manipulated(e.g. read in, logged, written back) in cluster, and
the size of that cluster is just the inode cluster size. Sparse
allocation unit currently is calculated from that:
	(inode size / MIN_INODE_SIZE) * inode cluster size
		-> sparse allocation aligmnet
			-> sparse allocation unit
For example, under default mkfs configuration(i.e. crc and sparse
allocation enabled, 4K blocksize), inode size is 512 bytes(2 times
of MIN_INODE_SIZE=256 bytes), then sparse allocation unit will be
2 * current inode cluster size(8192 bytes) = 16384 bytes, that is
4 blocks. As we mentioned above, under extreme fragmentation, the
filesystem may be full of 1-3 physically continuous blocks but can
never find one of 4, so even sparese allocation will also fail. If
we know application will easily create such fragmentation, then we
had better have a way to loose sparse allocation requirement manually.

This patch series achieves that by making the source of sparse
allocation unit, inode cluster size a tunnable parameter. When
sparse allocation enabled, make that size tunnable in mkfs. As xfs
itself currently recalculate and verify related value, change xfs
behavior to directly using the value provided by superblock read in.

Tianxiang Peng (2):
  xfs: calculate cluster_size_raw from sb when sparse alloc enabled
  mkfs: make cluster size tunnable when sparse alloc enabled

 fs/xfs/libxfs/xfs_ialloc.c | 35 ++++++++++++++++++++++-------------
 fs/xfs/xfs_mount.c         | 12 ++++++------
 mkfs/xfs_mkfs.c            | 34 +++++++++++++++++++++++++++++-----
 3 files changed, 57 insertions(+), 24 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] xfs: calculate cluster_size_raw from sb when sparse alloc enabled
  2024-12-16 13:05 [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Tianxiang Peng
@ 2024-12-16 13:05 ` Tianxiang Peng
  2024-12-16 13:05 ` [PATCH 2/2] mkfs: make cluster size tunnable " Tianxiang Peng
  2024-12-16 21:19 ` [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Dave Chinner
  2 siblings, 0 replies; 4+ messages in thread
From: Tianxiang Peng @ 2024-12-16 13:05 UTC (permalink / raw)
  To: chandan.babu, djwong, p.raghav, mcgrof, brauner, dchinner
  Cc: Tianxiang Peng, linux-xfs, linux-fsdevel, linux-kernel,
	allexjlzheng, flyingpeng

When sparse inode allocation enabled, use sb_spino_align read from
sb to calculate inode_cluster_size_raw. As now
inode_cluster_size_raw is not a fixed value, remove the validation
in mount code, subtitute it with some value checks.

Signed-off-by: Tianxiang Peng <txpeng@tencent.com>
Reviewed-by: Jinliang Zheng <allexjlzheng@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 35 ++++++++++++++++++++++-------------
 fs/xfs/xfs_mount.c         | 12 ++++++------
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 271855227514..f276ccbe9d6f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2989,22 +2989,31 @@ xfs_ialloc_setup_geometry(
 	}
 
 	/*
-	 * Compute the desired size of an inode cluster buffer size, which
-	 * starts at 8K and (on v5 filesystems) scales up with larger inode
-	 * sizes.
+	 * If sparse inode is enabled, take its alignment as cluster size, as
+	 * mkfs may change cluster size to deal with extreme fragmentation
+	 * situations. For that case, sparse inode alignment will be set to
+	 * actual block count of cluster size.
 	 *
-	 * Preserve the desired inode cluster size because the sparse inodes
-	 * feature uses that desired size (not the actual size) to compute the
-	 * sparse inode alignment.  The mount code validates this value, so we
-	 * cannot change the behavior.
+	 * Otherwise, compute the desired size of an inode cluster buffer size,
+	 * which starts at 8K and (on v5 filesystems) scales up with larger inode
+	 * sizes.
 	 */
-	igeo->inode_cluster_size_raw = XFS_INODE_BIG_CLUSTER_SIZE;
-	if (xfs_has_v3inodes(mp)) {
-		int	new_size = igeo->inode_cluster_size_raw;
+	if (xfs_has_sparseinodes(mp)) {
+		igeo->inode_cluster_size_raw = mp->m_sb.sb_blocksize;
+		if (mp->m_sb.sb_spino_align)
+			igeo->inode_cluster_size_raw *= mp->m_sb.sb_spino_align;
+		xfs_info(mp,
+			"Calculate cluster_size_raw from sb: %u. sb_inoalignment: %u",
+			igeo->inode_cluster_size_raw, mp->m_sb.sb_inoalignmt);
+	} else {
+		igeo->inode_cluster_size_raw = XFS_INODE_BIG_CLUSTER_SIZE;
+		if (xfs_has_v3inodes(mp)) {
+			int	new_size = igeo->inode_cluster_size_raw;
 
-		new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
-		if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
-			igeo->inode_cluster_size_raw = new_size;
+			new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
+			if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
+				igeo->inode_cluster_size_raw = new_size;
+		}
 	}
 
 	/* Calculate inode cluster ratios. */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 1fdd79c5bfa0..47260d9c5033 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -761,12 +761,12 @@ xfs_mountfs(
 	 * but that is checked on sb read verification...
 	 */
 	if (xfs_has_sparseinodes(mp) &&
-	    mp->m_sb.sb_spino_align !=
-			XFS_B_TO_FSBT(mp, igeo->inode_cluster_size_raw)) {
-		xfs_warn(mp,
-	"Sparse inode block alignment (%u) must match cluster size (%llu).",
-			 mp->m_sb.sb_spino_align,
-			 XFS_B_TO_FSBT(mp, igeo->inode_cluster_size_raw));
+		(!is_power_of_2(igeo->inode_cluster_size_raw)
+		|| igeo->inode_cluster_size_raw < mp->m_sb.sb_inodesize
+		|| igeo->inode_cluster_size_raw >
+			XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize)) {
+		xfs_warn(mp, "Invalid sparse inode cluster size(%u).",
+			igeo->inode_cluster_size_raw);
 		error = -EINVAL;
 		goto out_remove_uuid;
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] mkfs: make cluster size tunnable when sparse alloc enabled
  2024-12-16 13:05 [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Tianxiang Peng
  2024-12-16 13:05 ` [PATCH 1/2] xfs: calculate cluster_size_raw from sb when sparse alloc enabled Tianxiang Peng
@ 2024-12-16 13:05 ` Tianxiang Peng
  2024-12-16 21:19 ` [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Dave Chinner
  2 siblings, 0 replies; 4+ messages in thread
From: Tianxiang Peng @ 2024-12-16 13:05 UTC (permalink / raw)
  To: chandan.babu, djwong, p.raghav, mcgrof, brauner, dchinner
  Cc: Tianxiang Peng, linux-xfs, linux-fsdevel, linux-kernel,
	allexjlzheng, flyingpeng

Add clustersize parameter for -i(inode size) switch. When sparse
inode allocation is enabled, use clustersize from cmdline if it's
provided and fallback to XFS_INODE_BIG_CLUSTER_SIZE if not.

Signed-off-by: Tianxiang Peng <txpeng@tencent.com>
Reviewed-by: Jinliang Zheng <allexjlzheng@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
---
 mkfs/xfs_mkfs.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index bbd0dbb6..b8a597d4 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -92,6 +92,7 @@ enum {
 	I_SPINODES,
 	I_NREXT64,
 	I_EXCHANGE,
+	I_CLUSTERSIZE,
 	I_MAX_OPTS,
 };
 
@@ -474,6 +475,7 @@ static struct opt_params iopts = {
 		[I_SPINODES] = "sparse",
 		[I_NREXT64] = "nrext64",
 		[I_EXCHANGE] = "exchange",
+		[I_CLUSTERSIZE] = "clustersize",
 		[I_MAX_OPTS] = NULL,
 	},
 	.subopt_params = {
@@ -535,6 +537,13 @@ static struct opt_params iopts = {
 		  .maxval = 1,
 		  .defaultval = 1,
 		},
+		{ .index = I_CLUSTERSIZE,
+		  .conflicts = { { NULL, LAST_CONFLICT } },
+		  .is_power_2 = true,
+		  .minval = XFS_DINODE_MIN_SIZE,
+		  .maxval = XFS_DINODE_MIN_SIZE << XFS_INODES_PER_CHUNK_LOG,
+		  .defaultval = XFS_INODE_BIG_CLUSTER_SIZE,
+		},
 	},
 };
 
@@ -956,6 +965,7 @@ struct cli_params {
 	int	inopblock;
 	int	imaxpct;
 	int	lsectorsize;
+	int	clustersize;
 	uuid_t	uuid;
 
 	/* feature flags that are set */
@@ -993,6 +1003,7 @@ struct mkfs_params {
 	int		inodesize;
 	int		inodelog;
 	int		inopblock;
+	int		clustersize;
 
 	uint64_t	dblocks;
 	uint64_t	logblocks;
@@ -1055,7 +1066,7 @@ usage( void )
 /* force overwrite */	[-f]\n\
 /* inode size */	[-i perblock=n|size=num,maxpct=n,attr=0|1|2,\n\
 			    projid32bit=0|1,sparse=0|1,nrext64=0|1,\n\
-			    exchange=0|1]\n\
+			    exchange=0|1,clustersize=num]\n\
 /* no discard */	[-K]\n\
 /* log subvol */	[-l agnum=n,internal,size=num,logdev=xxx,version=n\n\
 			    sunit=value|su=num,sectsize=num,lazy-count=0|1,\n\
@@ -1756,6 +1767,9 @@ inode_opts_parser(
 	case I_EXCHANGE:
 		cli->sb_feat.exchrange = getnum(value, opts, subopt);
 		break;
+	case I_CLUSTERSIZE:
+		cli->clustersize = getnum(value, opts, subopt);
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -2594,6 +2608,17 @@ validate_inodesize(
 	}
 }
 
+static void
+validate_clustersize(
+	struct mkfs_params	*cfg,
+	struct cli_params	*cli)
+{
+	if (cli->sb_feat.spinodes && cli->clustersize)
+		cfg->clustersize = cli->clustersize;
+	else
+		cfg->clustersize = XFS_INODE_BIG_CLUSTER_SIZE;
+}
+
 static xfs_rfsblock_t
 calc_dev_size(
 	char			*size,
@@ -3517,12 +3542,10 @@ sb_set_features(
 		sbp->sb_versionnum |= XFS_SB_VERSION_4;
 
 	if (fp->inode_align) {
-		int     cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
-
 		sbp->sb_versionnum |= XFS_SB_VERSION_ALIGNBIT;
 		if (cfg->sb_feat.crcs_enabled)
-			cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
-		sbp->sb_inoalignmt = cluster_size >> cfg->blocklog;
+			cfg->clustersize *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
+		sbp->sb_inoalignmt = cfg->clustersize >> cfg->blocklog;
 	} else
 		sbp->sb_inoalignmt = 0;
 
@@ -4634,6 +4657,7 @@ main(
 	 */
 	validate_dirblocksize(&cfg, &cli);
 	validate_inodesize(&cfg, &cli);
+	validate_clustersize(&cfg, &cli);
 
 	/*
 	 * if the device size was specified convert it to a block count
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation
  2024-12-16 13:05 [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Tianxiang Peng
  2024-12-16 13:05 ` [PATCH 1/2] xfs: calculate cluster_size_raw from sb when sparse alloc enabled Tianxiang Peng
  2024-12-16 13:05 ` [PATCH 2/2] mkfs: make cluster size tunnable " Tianxiang Peng
@ 2024-12-16 21:19 ` Dave Chinner
  2 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2024-12-16 21:19 UTC (permalink / raw)
  To: Tianxiang Peng
  Cc: chandan.babu, djwong, p.raghav, mcgrof, brauner, dchinner,
	Tianxiang Peng, linux-xfs, linux-fsdevel, linux-kernel,
	allexjlzheng, flyingpeng

On Mon, Dec 16, 2024 at 09:05:47PM +0800, Tianxiang Peng wrote:
> This patch series makes inode cluster size a tunnable parameter in
> mkfs.xfs when sparse allocation is enabled, and also makes xfs use
> inode cluster size directly from the superblock read in rather than
> recalculate itself and verify.
> 
> Under extreme fragmentation situations, even inode sparse allocation
> may fail with current default inode cluster size i.e. 8192 bytes. Such
> situations may come from the PUNCH_HOLE fallocation which is used by
> some applications, for example MySQL innodb page compression. With xfs
> of 4K blocksize, MySQL may write out 16K buffer with direct I/O(which
> immediately triggers block allocation) then try to compress the 16K
> buffer to <4K. If the compression succeeds, MySQL will punch out the
> latter 12K, leave only the first 4K allocated:
> 	after write 16k buffer: OOOO
> 	after punch latter 12K: OXXX
> where O means page with block allocated, X means page without.
> 
> Such feature saves disk space(the 12K freed by punching can be used
> by others), but also makes the filesystem much more fragmented.
> Considering xfs has no automatic defragmentation mechanism, in the
> most extreme cases, there will be only 1-3 physically continuous
> blocks finally avaliable.
> 
> For data block allocation, such fragmentation is not a problem, as
> physical continuation is not always required. But inode chunk
> allocation requires so. Even for sparse allocation, physical
> continuation has also to be guaranteed in a way. Currently this
> value is calculated from a scaled inode cluster size. In xfs, inodes
> are manipulated(e.g. read in, logged, written back) in cluster, and
> the size of that cluster is just the inode cluster size. Sparse
> allocation unit currently is calculated from that:
> 	(inode size / MIN_INODE_SIZE) * inode cluster size
> 		-> sparse allocation aligmnet
> 			-> sparse allocation unit
> For example, under default mkfs configuration(i.e. crc and sparse
> allocation enabled, 4K blocksize), inode size is 512 bytes(2 times
> of MIN_INODE_SIZE=256 bytes), then sparse allocation unit will be
> 2 * current inode cluster size(8192 bytes) = 16384 bytes, that is
> 4 blocks. As we mentioned above, under extreme fragmentation, the
> filesystem may be full of 1-3 physically continuous blocks but can
> never find one of 4, so even sparese allocation will also fail. If
> we know application will easily create such fragmentation, then we
> had better have a way to loose sparse allocation requirement manually.

Please go an study mkfs.xfs -i align=1 does, how it affects
sb->s_inoalignmnt, and how that then affects sparse inode cluster
size and alignment. i.e.  Sparse inode clusters must be correctly
aligned and they have a fixed minimum size, so we can't just
arbitrarily select a sparse cluster size like these patches enable a
user to do.

> This patch series achieves that by making the source of sparse
> allocation unit, inode cluster size a tunnable parameter.

Fundamentally, I think this is the wrong way to solve the
problem because it requires the system admin to know ahead of time
that this specific database configuration is going to cause
fragmentation and inode allocation issues.

Once the problem manifests, it is too late to run mkfs to change the
geometry for the fs, so we really need to change the runtime
allocation policy code to minimise the impact of the data
fragmentation as much as possible.

As to that policy change, it has been discussed here:

https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/

and my preferred generic solution to the problem is to define the
high AG space as metadata preferred, thereby preventing data
allocation from occurring in it until all other AGs are full of
data.

I'm still waiting to hear back as to whether the inode32 algorithm
behaves as expected (which uses metadata preferred AGs to direct
data to fill high AGs first) before we move forward with a
allocation policy based fix for this workload issue. If you can
reproduce the issue on demand, then perhaps you could also run the
same experiement - build a 2TB filesystem with ~300 AGs mounted
with inode32 and demonstrate that the upper AGs are filled to near
full before data spills to the lower AGs.

If inode32 behaves as it should under the mysql workload, then it
seems like a relatively trival tweak to the AG setup at mount time
to always reserve some space in the high AG(s) for inode allocation
and hence largely mitigate this problem for everyone....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-12-16 21:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 13:05 [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Tianxiang Peng
2024-12-16 13:05 ` [PATCH 1/2] xfs: calculate cluster_size_raw from sb when sparse alloc enabled Tianxiang Peng
2024-12-16 13:05 ` [PATCH 2/2] mkfs: make cluster size tunnable " Tianxiang Peng
2024-12-16 21:19 ` [PATCH 0/2] xfs: make cluster size tunnable for sparse allocation Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox