From: Wang Yugui <wangyugui@e16-tech.com>
To: Dave Chinner <dgc@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [ RFC ] xfs: 4K inode support
Date: Thu, 23 Apr 2026 07:02:27 +0800 [thread overview]
Message-ID: <20260423070227.B2C6.409509F4@e16-tech.com> (raw)
In-Reply-To: <aelAeiyiAyFiJUgQ@dread>
Hi,
> On Wed, Apr 22, 2026 at 07:05:15AM +0800, Wang Yugui wrote:
> > use case for 4K inode
> > - simpler logic for 4Kn device, and less lock.
>
> Nope, neither of these are true.
>
> There is no change in logic when inode sizes change, and there is no
> change in locking as inode size changes.
>
> This is because inodes are allocated in chunks of 64, and they are
> read and written in clusters of 32 inodes. Hence all that changing
> the size of the inode does is change the size of the inode cluster
> buffer.
>
> And therein lies the problem: 32 x 4kB inodes is 128kB. Looking at
> xfs_types.h:
>
> /*
> * Minimum and maximum blocksize and sectorsize.
> * The blocksize upper limit is pretty much arbitrary.
> * The sectorsize upper limit is due to sizeof(sb_sectsize).
> * CRC enable filesystems use 512 byte inodes, meaning 512 byte block sizes
> * cannot be used.
> */
> #define XFS_MIN_BLOCKSIZE_LOG 9 /* i.e. 512 bytes */
> #define XFS_MAX_BLOCKSIZE_LOG 16 /* i.e. 65536 bytes */
> #define XFS_MIN_BLOCKSIZE (1 << XFS_MIN_BLOCKSIZE_LOG)
> #define XFS_MAX_BLOCKSIZE (1 << XFS_MAX_BLOCKSIZE_LOG)
>
> Yup, XFS defines a maximum block size of 64kB, and inode cluster
> buffers are already at this maximum size for 2kB inodes.
>
> > - better performance for directory with many files.
>
> No, it won't make any difference to large directory performance
> because they are in block/leaf/node form and all the directory
> information is held in extents external to the inode. The size of
> the directory inode really does not influence the performance of the
> directory once it transitions out of inline format.
>
> In fact, larger inode sizes result in lower performance for
> directory ops, because the metadata footprint has increased in
> size and so every inode cluster IO now has higher latency and
> consumes more IO bandwidth. i.e. the -inode operations- that are
> done during directory modifications are slower...
>
> Then there's the larger memory footprint of the buffer cache due to
> cached inode cluster buffers - in most cases that's all wasted space
> because inode metadata is typically just an inode core (176 bytes),
> a couple of extent records (16 bytes each) and maybe a couple of
> xattrs (e.g. selinux). So a typical inode will only contain maybe
> 300 bytes of metadata, yet now they take up 4kB of RAM -each- when
> resident in the buffer cache...
>
> > - maybe inline data support later.
>
> That's a whole different problem - it doesn't require inode sizes to
> be expanded to implement.
>
> > TODO:
> > still crash in xfs_trans_read_buf_map() when mount a 4K inode xfs now.
>
> Good luck with that - there's several issues with on-disk format
> constants that need to be sorted out before IO will work. e.g.
> you'll hit this error through _xfs_trans_bjoin():
>
> xfs_err(mp,
> "buffer item dirty bitmap (%u uints) too small to reflect %u bytes!",
> map_size,
> BBTOB(bp->b_maps[i].bm_len));
>
> and it will shut down with a corruption error. That's indicating
> that the on-disk journal format for buffer logging does not support
> the buffer size being read. i.e. there's a problem with the inode
> cluster size....
>
> IOWs, there are -lots- of complex and cirtical subsystems that
> increasing the inode size will break and need to be fixed. Changing
> a fundamental on-disk format constant isn't a simple thing to do, an
> AI will not be able to tell you all the things you need to change
> and test without already knowing where all the architectural
> problems are to begin with....
>
> Without an actual solid reason for making fundamental on-disk format
> changes and a commitment of significant time and testing resources,
> changes of this scope are unlikely to be made...
>
> -Dave.
> --
> Dave Chinner
> dgc@kernel.org
Thanks a lot for these info.
the basic logic is that, for 4Kn device, the min io size is already 4K.
and 4Kn device(SSD and RAID) because more common now.
On a 4Kn device, we can I/O one single inode of 4K size without interaction with
other inode? so mabye better performance for high speed ssd such as pcie gen5/gen6?
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2026/04/23
next prev parent reply other threads:[~2026-04-23 1:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 1:42 4K inode support on 4Kn device Wang Yugui
2026-04-21 5:48 ` Carlos Maiolino
2026-04-21 23:01 ` [PATCH] xfsprogs: 4K inode support Wang Yugui
2026-04-21 23:05 ` [ RFC ] xfs: " Wang Yugui
2026-04-22 21:41 ` Dave Chinner
2026-04-22 23:02 ` Wang Yugui [this message]
2026-04-23 2:09 ` Eric Sandeen
2026-04-23 2:18 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260423070227.B2C6.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=dgc@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox