4K inode support on 4Kn device

Linux XFS filesystem development
 help / color / mirror / Atom feed

* 4K inode support on 4Kn device
@ 2026-04-21  1:42 Wang Yugui
  2026-04-21  5:48 ` Carlos Maiolino
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Wang Yugui @ 2026-04-21  1:42 UTC (permalink / raw)
  To: linux-xfs

Hi,

I want to support 4K inode on 4Kn device.

1,  change the size of XFS_DINODE_MAX_SIZE/XFS_DINODE_MAX_LOG
     to 4096/12 from 2048/11.

2, do we need to keep the following restriction in man mkfs.xfs?
    the inode size cannot exceed one half of the filesystem block size.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2026/04/21



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 4K inode support on 4Kn device
  2026-04-21  1:42 4K inode support on 4Kn device Wang Yugui
@ 2026-04-21  5:48 ` Carlos Maiolino
  2026-04-21 23:01 ` [PATCH] xfsprogs: 4K inode support Wang Yugui
  2026-04-21 23:05 ` [ RFC ] xfs: " Wang Yugui
  2 siblings, 0 replies; 8+ messages in thread
From: Carlos Maiolino @ 2026-04-21  5:48 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Tue, Apr 21, 2026 at 09:42:05AM +0800, Wang Yugui wrote:
> Hi,
> 
> I want to support 4K inode on 4Kn device.
> 
> 1,  change the size of XFS_DINODE_MAX_SIZE/XFS_DINODE_MAX_LOG
>      to 4096/12 from 2048/11.
> 
> 2, do we need to keep the following restriction in man mkfs.xfs?
>     the inode size cannot exceed one half of the filesystem block size.

Can you please better detail what are you trying to achieve with 4KiB
inodes?


> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2026/04/21
> 
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] xfsprogs: 4K inode support
  2026-04-21  1:42 4K inode support on 4Kn device Wang Yugui
  2026-04-21  5:48 ` Carlos Maiolino
@ 2026-04-21 23:01 ` Wang Yugui
  2026-04-21 23:05 ` [ RFC ] xfs: " Wang Yugui
  2 siblings, 0 replies; 8+ messages in thread
From: Wang Yugui @ 2026-04-21 23:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: Wang Yugui

---
 include/xfs_multidisk.h | 2 +-
 libxfs/xfs_format.h     | 2 +-
 libxfs/xfs_sb.c         | 1 +
 man/man8/mkfs.xfs.8.in  | 3 +--
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/xfs_multidisk.h b/include/xfs_multidisk.h
index ef4443b0bb0e..b77a915811f9 100644
--- a/include/xfs_multidisk.h
+++ b/include/xfs_multidisk.h
@@ -14,7 +14,7 @@
 #define	XFS_DFL_BLOCKSIZE_LOG	12		/* 4096 byte blocks */
 #define	XFS_DINODE_DFL_LOG	8		/* 256 byte inodes */
 #define	XFS_DINODE_DFL_CRC_LOG	9		/* 512 byte inodes for CRCs */
-#define	XFS_MIN_INODE_PERBLOCK	2		/* min inodes per block */
+#define	XFS_MIN_INODE_PERBLOCK	1		/* min inodes per block */
 #define	XFS_DFL_IMAXIMUM_PCT	25		/* max % of space for inodes */
 #define	XFS_MIN_REC_DIRSIZE	12		/* 4096 byte dirblocks (V2) */
 #define XFS_MAX_INODE_SIG_BITS	32		/* most significant bits in an
diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index 779dac59b1f3..84cce8d268e6 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -1079,7 +1079,7 @@ enum xfs_dinode_fmt {
  * Inode minimum and maximum sizes.
  */
 #define	XFS_DINODE_MIN_LOG	8
-#define	XFS_DINODE_MAX_LOG	11
+#define	XFS_DINODE_MAX_LOG	12
 #define	XFS_DINODE_MIN_SIZE	(1 << XFS_DINODE_MIN_LOG)
 #define	XFS_DINODE_MAX_SIZE	(1 << XFS_DINODE_MAX_LOG)
 
diff --git a/libxfs/xfs_sb.c b/libxfs/xfs_sb.c
index dd14c3ab3b59..9c57d13e50bc 100644
--- a/libxfs/xfs_sb.c
+++ b/libxfs/xfs_sb.c
@@ -717,6 +717,7 @@ xfs_validate_sb_common(
 	case 512:
 	case 1024:
 	case 2048:
+	case 4096:
 		break;
 	default:
 		xfs_warn(mp, "inode size of %d bytes not supported",
diff --git a/man/man8/mkfs.xfs.8.in b/man/man8/mkfs.xfs.8.in
index fbafc5c79e02..b0a2222d370d 100644
--- a/man/man8/mkfs.xfs.8.in
+++ b/man/man8/mkfs.xfs.8.in
@@ -638,8 +638,7 @@ The minimum (and default)
 is 256 bytes without crc, 512 bytes with crc enabled.
 The maximum
 .I value
-is 2048 (2 KiB) subject to the restriction that
-the inode size cannot exceed one half of the filesystem block size.
+is 4096 (4 KiB).
 .IP
 XFS uses 64-bit inode numbers internally; however, the number of
 significant bits in an inode number
-- 
2.36.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [ RFC ] xfs: 4K inode support
  2026-04-21  1:42 4K inode support on 4Kn device Wang Yugui
  2026-04-21  5:48 ` Carlos Maiolino
  2026-04-21 23:01 ` [PATCH] xfsprogs: 4K inode support Wang Yugui
@ 2026-04-21 23:05 ` Wang Yugui
  2026-04-22 21:41   ` Dave Chinner
  2 siblings, 1 reply; 8+ messages in thread
From: Wang Yugui @ 2026-04-21 23:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: Wang Yugui

use case for 4K inode
- simpler logic for 4Kn device, and less lock.
- better performance for directory with many files.
- maybe inline data support later.

TODO:
still crash in xfs_trans_read_buf_map() when mount a 4K inode xfs now.

---
 fs/xfs/libxfs/xfs_format.h   | 2 +-
 fs/xfs/libxfs/xfs_metafile.c | 2 +-
 fs/xfs/libxfs/xfs_sb.c       | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 779dac59b1f3..84cce8d268e6 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1079,7 +1079,7 @@ enum xfs_dinode_fmt {
  * Inode minimum and maximum sizes.
  */
 #define	XFS_DINODE_MIN_LOG	8
-#define	XFS_DINODE_MAX_LOG	11
+#define	XFS_DINODE_MAX_LOG	12
 #define	XFS_DINODE_MIN_SIZE	(1 << XFS_DINODE_MIN_LOG)
 #define	XFS_DINODE_MAX_SIZE	(1 << XFS_DINODE_MAX_LOG)
 
diff --git a/fs/xfs/libxfs/xfs_metafile.c b/fs/xfs/libxfs/xfs_metafile.c
index cf239f862212..9db799576775 100644
--- a/fs/xfs/libxfs/xfs_metafile.c
+++ b/fs/xfs/libxfs/xfs_metafile.c
@@ -98,7 +98,7 @@ xfs_metafile_resv_can_cover(
 	 * isn't critical unless there also isn't enough free space.
 	 */
 	return xfs_compare_freecounter(mp, XC_FREE_BLOCKS,
-			rhs - mp->m_metafile_resv_avail, 2048) >= 0;
+			rhs - mp->m_metafile_resv_avail, 4096) >= 0;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 38d16fe1f6d8..39424a7c74df 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -734,6 +734,7 @@ xfs_validate_sb_common(
 	case 512:
 	case 1024:
 	case 2048:
+	case 4096:
 		break;
 	default:
 		xfs_warn(mp, "inode size of %d bytes not supported",
-- 
2.36.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [ RFC ] xfs: 4K inode support
  2026-04-21 23:05 ` [ RFC ] xfs: " Wang Yugui
@ 2026-04-22 21:41   ` Dave Chinner
  2026-04-22 23:02     ` Wang Yugui
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2026-04-22 21:41 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Wed, Apr 22, 2026 at 07:05:15AM +0800, Wang Yugui wrote:
> use case for 4K inode
> - simpler logic for 4Kn device, and less lock.

Nope, neither of these are true.

There is no change in logic when inode sizes change, and there is no
change in locking as inode size changes.

This is because inodes are allocated in chunks of 64, and they are
read and written in clusters of 32 inodes. Hence all that changing
the size of the inode does is change the size of the inode cluster
buffer.

And therein lies the problem: 32 x 4kB inodes is 128kB. Looking at
xfs_types.h:

/*                                                                               
 * Minimum and maximum blocksize and sectorsize.                                 
 * The blocksize upper limit is pretty much arbitrary.                           
 * The sectorsize upper limit is due to sizeof(sb_sectsize).                     
 * CRC enable filesystems use 512 byte inodes, meaning 512 byte block sizes      
 * cannot be used.                                                               
 */                                                                              
#define XFS_MIN_BLOCKSIZE_LOG   9       /* i.e. 512 bytes */                     
#define XFS_MAX_BLOCKSIZE_LOG   16      /* i.e. 65536 bytes */                   
#define XFS_MIN_BLOCKSIZE       (1 << XFS_MIN_BLOCKSIZE_LOG)                     
#define XFS_MAX_BLOCKSIZE       (1 << XFS_MAX_BLOCKSIZE_LOG)

Yup, XFS defines a maximum block size of 64kB, and inode cluster
buffers are already at this maximum size for 2kB inodes.

> - better performance for directory with many files.

No, it won't make any difference to large directory performance
because they are in block/leaf/node form and all the directory
information is held in extents external to the inode. The size of
the directory inode really does not influence the performance of the
directory once it transitions out of inline format.

In fact, larger inode sizes result in lower performance for
directory ops, because the metadata footprint has increased in
size and so every inode cluster IO now has higher latency and
consumes more IO bandwidth. i.e. the -inode operations- that are
done during directory modifications are slower...

Then there's the larger memory footprint of the buffer cache due to
cached inode cluster buffers - in most cases that's all wasted space
because inode metadata is typically just an inode core (176 bytes),
a couple of extent records (16 bytes each) and maybe a couple of
xattrs (e.g. selinux). So a typical inode will only contain maybe
300 bytes of metadata, yet now they take up 4kB of RAM -each- when
resident in the buffer cache...

> - maybe inline data support later.

That's a whole different problem - it doesn't require inode sizes to
be expanded to implement.

> TODO:
> still crash in xfs_trans_read_buf_map() when mount a 4K inode xfs now.

Good luck with that - there's several issues with on-disk format
constants that need to be sorted out before IO will work. e.g.
you'll hit this error through _xfs_trans_bjoin():

                        xfs_err(mp,
        "buffer item dirty bitmap (%u uints) too small to reflect %u bytes!",
                                        map_size,
                                        BBTOB(bp->b_maps[i].bm_len));

and it will shut down with a corruption error. That's indicating
that the on-disk journal format for buffer logging does not support
the buffer size being read. i.e. there's a problem with the inode
cluster size....

IOWs, there are -lots- of complex and cirtical subsystems that
increasing the inode size will break and need to be fixed. Changing
a fundamental on-disk format constant isn't a simple thing to do, an
AI will not be able to tell you all the things you need to change
and test without already knowing where all the architectural
problems are to begin with....

Without an actual solid reason for making fundamental on-disk format
changes and a commitment of significant time and testing resources,
changes of this scope are unlikely to be made...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ RFC ] xfs: 4K inode support
  2026-04-22 21:41   ` Dave Chinner
@ 2026-04-22 23:02     ` Wang Yugui
  2026-04-23  2:09       ` Eric Sandeen
  2026-04-23  2:18       ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Wang Yugui @ 2026-04-22 23:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi,

> On Wed, Apr 22, 2026 at 07:05:15AM +0800, Wang Yugui wrote:
> > use case for 4K inode
> > - simpler logic for 4Kn device, and less lock.
> 
> Nope, neither of these are true.
> 
> There is no change in logic when inode sizes change, and there is no
> change in locking as inode size changes.
> 
> This is because inodes are allocated in chunks of 64, and they are
> read and written in clusters of 32 inodes. Hence all that changing
> the size of the inode does is change the size of the inode cluster
> buffer.
> 
> And therein lies the problem: 32 x 4kB inodes is 128kB. Looking at
> xfs_types.h:
> 
> /*                                                                               
>  * Minimum and maximum blocksize and sectorsize.                                 
>  * The blocksize upper limit is pretty much arbitrary.                           
>  * The sectorsize upper limit is due to sizeof(sb_sectsize).                     
>  * CRC enable filesystems use 512 byte inodes, meaning 512 byte block sizes      
>  * cannot be used.                                                               
>  */                                                                              
> #define XFS_MIN_BLOCKSIZE_LOG   9       /* i.e. 512 bytes */                     
> #define XFS_MAX_BLOCKSIZE_LOG   16      /* i.e. 65536 bytes */                   
> #define XFS_MIN_BLOCKSIZE       (1 << XFS_MIN_BLOCKSIZE_LOG)                     
> #define XFS_MAX_BLOCKSIZE       (1 << XFS_MAX_BLOCKSIZE_LOG)
> 
> Yup, XFS defines a maximum block size of 64kB, and inode cluster
> buffers are already at this maximum size for 2kB inodes.
> 
> > - better performance for directory with many files.
> 
> No, it won't make any difference to large directory performance
> because they are in block/leaf/node form and all the directory
> information is held in extents external to the inode. The size of
> the directory inode really does not influence the performance of the
> directory once it transitions out of inline format.
> 
> In fact, larger inode sizes result in lower performance for
> directory ops, because the metadata footprint has increased in
> size and so every inode cluster IO now has higher latency and
> consumes more IO bandwidth. i.e. the -inode operations- that are
> done during directory modifications are slower...
> 
> Then there's the larger memory footprint of the buffer cache due to
> cached inode cluster buffers - in most cases that's all wasted space
> because inode metadata is typically just an inode core (176 bytes),
> a couple of extent records (16 bytes each) and maybe a couple of
> xattrs (e.g. selinux). So a typical inode will only contain maybe
> 300 bytes of metadata, yet now they take up 4kB of RAM -each- when
> resident in the buffer cache...
> 
> > - maybe inline data support later.
> 
> That's a whole different problem - it doesn't require inode sizes to
> be expanded to implement.
> 
> > TODO:
> > still crash in xfs_trans_read_buf_map() when mount a 4K inode xfs now.
> 
> Good luck with that - there's several issues with on-disk format
> constants that need to be sorted out before IO will work. e.g.
> you'll hit this error through _xfs_trans_bjoin():
> 
>                         xfs_err(mp,
>         "buffer item dirty bitmap (%u uints) too small to reflect %u bytes!",
>                                         map_size,
>                                         BBTOB(bp->b_maps[i].bm_len));
> 
> and it will shut down with a corruption error. That's indicating
> that the on-disk journal format for buffer logging does not support
> the buffer size being read. i.e. there's a problem with the inode
> cluster size....
> 
> IOWs, there are -lots- of complex and cirtical subsystems that
> increasing the inode size will break and need to be fixed. Changing
> a fundamental on-disk format constant isn't a simple thing to do, an
> AI will not be able to tell you all the things you need to change
> and test without already knowing where all the architectural
> problems are to begin with....
> 
> Without an actual solid reason for making fundamental on-disk format
> changes and a commitment of significant time and testing resources,
> changes of this scope are unlikely to be made...
> 
> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

Thanks a lot for these info.

the basic logic is that, for 4Kn device, the min io size is already 4K. 
and 4Kn device(SSD and RAID) because more common now.

On a 4Kn device, we can I/O one single inode of 4K size without interaction with
other inode? so mabye better performance for high speed ssd such as pcie gen5/gen6?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2026/04/23



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ RFC ] xfs: 4K inode support
  2026-04-22 23:02     ` Wang Yugui
@ 2026-04-23  2:09       ` Eric Sandeen
  2026-04-23  2:18       ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2026-04-23  2:09 UTC (permalink / raw)
  To: Wang Yugui, Dave Chinner; +Cc: linux-xfs

On 4/22/26 6:02 PM, Wang Yugui wrote:
> Thanks a lot for these info.
> 
> the basic logic is that, for 4Kn device, the min io size is already 4K. 
> and 4Kn device(SSD and RAID) because more common now.

> On a 4Kn device, we can I/O one single inode of 4K size without interaction with
> other inode? so mabye better performance for high speed ssd such as pcie gen5/gen6?

We already do efficient reads of inodes, even on 4kn devices. As Dave
explained, "inodes are allocated in chunks of 64, and they are read and written
in clusters of 32 inodes." Even with the smallest 512-byte inodes, we do inode IO
in efficiently sized batches. Reading larger clusters of larger inodes filled with
mostly-zeros will not improve performance in any way. It will only waste IO,
memory, and disk space.

-Eric

> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2026/04/23

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ RFC ] xfs: 4K inode support
  2026-04-22 23:02     ` Wang Yugui
  2026-04-23  2:09       ` Eric Sandeen
@ 2026-04-23  2:18       ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2026-04-23  2:18 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Thu, Apr 23, 2026 at 07:02:27AM +0800, Wang Yugui wrote:
> Hi,
> 
> > On Wed, Apr 22, 2026 at 07:05:15AM +0800, Wang Yugui wrote:
> > > use case for 4K inode
> > > - simpler logic for 4Kn device, and less lock.
> > 
> > Nope, neither of these are true.
> > 
> > There is no change in logic when inode sizes change, and there is no
> > change in locking as inode size changes.
> > 
> > This is because inodes are allocated in chunks of 64, and they are
> > read and written in clusters of 32 inodes. Hence all that changing
> > the size of the inode does is change the size of the inode cluster
> > buffer.
.....

> On a 4Kn device, we can I/O one single inode of 4K size without interaction with
> other inode? so mabye better performance for high speed ssd such as pcie gen5/gen6?

Yes, you can do lots of 4kB IOs, but you can move more data in/out
of memory by doing 8kB IOs, yes?

In reality, on-disk inodes are not independent. They are allocated
and freed in contiguous chunks of 64 inodes, and the inode cluster
buffer is used for bulk initialisation, logging unlinked list
changes, etc.

Application operations on inodes often occur in batches, and XFS's
inode allocation algorithms usually provide physical locality of
inodes for a given workload. Hence for typical data set access
patterns, inode clustering usually results in a reduction of inode
IO due to increases in inode cluster buffer cache hit ratios.

If you want to test whether 8kB inode cluster buffers 
result in higher performance than using 32 inodes per buffer,
then you can do that with some tweaks to the sb->sb_inoalignmt
value set by mkfs.xfs. See the xfs_ialloc_setup_geometry() function
for details on how to modify that setting during mkfs to influence
the cluster buffer size the kernel will configure.

If you create a filesytem with 2kB inodes and a 8kB cluster buffer
size, you are going to see a different performance profile compared
to using a 64kB inode cluster buffer. It will very much depend on
the workload  and cache hit patterns as to whether that is a
performance win or a performance degradation.

The typical situation is that smaller cluster buffers reduces cache
hits and so increase both metadata IOPS (read and write) and
per-metadata operation CPU overhead due to needing to manage more
buffers (e.g. inode chunk allocation now has to allocate,
initialise, log and write back 16 buffers instead of 2). 
Workloads that benefit from smaller buffers tend to have large
working sets of inodes (i.e. don't fit in cache) and low
physical locality in their inode access patterns (i.e. random file
access patterns). There aren't a lot of workloads with those
characteristics, espcially with modern servers have hundreds of GBs
to TBs of RAM in them.

So before you start asking us to review code changes, first show us
that we can meaningfully improve application performance by reducing
inode cluster sizes and increasing the number of inode metadata IOPS
needed for any given inode intensive workload....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-23  2:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21  1:42 4K inode support on 4Kn device Wang Yugui
2026-04-21  5:48 ` Carlos Maiolino
2026-04-21 23:01 ` [PATCH] xfsprogs: 4K inode support Wang Yugui
2026-04-21 23:05 ` [ RFC ] xfs: " Wang Yugui
2026-04-22 21:41   ` Dave Chinner
2026-04-22 23:02     ` Wang Yugui
2026-04-23  2:09       ` Eric Sandeen
2026-04-23  2:18       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox