Re: [ RFC ] xfs: 4K inode support

Linux XFS filesystem development
 help / color / mirror / Atom feed

From: Dave Chinner <dgc@kernel.org>
To: Wang Yugui <wangyugui@e16-tech.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [ RFC ] xfs: 4K inode support
Date: Thu, 23 Apr 2026 07:41:14 +1000	[thread overview]
Message-ID: <aelAeiyiAyFiJUgQ@dread> (raw)
In-Reply-To: <20260421230515.2234-1-wangyugui@e16-tech.com>

On Wed, Apr 22, 2026 at 07:05:15AM +0800, Wang Yugui wrote:
> use case for 4K inode
> - simpler logic for 4Kn device, and less lock.

Nope, neither of these are true.

There is no change in logic when inode sizes change, and there is no
change in locking as inode size changes.

This is because inodes are allocated in chunks of 64, and they are
read and written in clusters of 32 inodes. Hence all that changing
the size of the inode does is change the size of the inode cluster
buffer.

And therein lies the problem: 32 x 4kB inodes is 128kB. Looking at
xfs_types.h:

/*                                                                               
 * Minimum and maximum blocksize and sectorsize.                                 
 * The blocksize upper limit is pretty much arbitrary.                           
 * The sectorsize upper limit is due to sizeof(sb_sectsize).                     
 * CRC enable filesystems use 512 byte inodes, meaning 512 byte block sizes      
 * cannot be used.                                                               
 */                                                                              
#define XFS_MIN_BLOCKSIZE_LOG   9       /* i.e. 512 bytes */                     
#define XFS_MAX_BLOCKSIZE_LOG   16      /* i.e. 65536 bytes */                   
#define XFS_MIN_BLOCKSIZE       (1 << XFS_MIN_BLOCKSIZE_LOG)                     
#define XFS_MAX_BLOCKSIZE       (1 << XFS_MAX_BLOCKSIZE_LOG)

Yup, XFS defines a maximum block size of 64kB, and inode cluster
buffers are already at this maximum size for 2kB inodes.

> - better performance for directory with many files.

No, it won't make any difference to large directory performance
because they are in block/leaf/node form and all the directory
information is held in extents external to the inode. The size of
the directory inode really does not influence the performance of the
directory once it transitions out of inline format.

In fact, larger inode sizes result in lower performance for
directory ops, because the metadata footprint has increased in
size and so every inode cluster IO now has higher latency and
consumes more IO bandwidth. i.e. the -inode operations- that are
done during directory modifications are slower...

Then there's the larger memory footprint of the buffer cache due to
cached inode cluster buffers - in most cases that's all wasted space
because inode metadata is typically just an inode core (176 bytes),
a couple of extent records (16 bytes each) and maybe a couple of
xattrs (e.g. selinux). So a typical inode will only contain maybe
300 bytes of metadata, yet now they take up 4kB of RAM -each- when
resident in the buffer cache...

> - maybe inline data support later.

That's a whole different problem - it doesn't require inode sizes to
be expanded to implement.

> TODO:
> still crash in xfs_trans_read_buf_map() when mount a 4K inode xfs now.

Good luck with that - there's several issues with on-disk format
constants that need to be sorted out before IO will work. e.g.
you'll hit this error through _xfs_trans_bjoin():

                        xfs_err(mp,
        "buffer item dirty bitmap (%u uints) too small to reflect %u bytes!",
                                        map_size,
                                        BBTOB(bp->b_maps[i].bm_len));

and it will shut down with a corruption error. That's indicating
that the on-disk journal format for buffer logging does not support
the buffer size being read. i.e. there's a problem with the inode
cluster size....

IOWs, there are -lots- of complex and cirtical subsystems that
increasing the inode size will break and need to be fixed. Changing
a fundamental on-disk format constant isn't a simple thing to do, an
AI will not be able to tell you all the things you need to change
and test without already knowing where all the architectural
problems are to begin with....

Without an actual solid reason for making fundamental on-disk format
changes and a commitment of significant time and testing resources,
changes of this scope are unlikely to be made...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

next prev parent reply	other threads:[~2026-04-22 21:41 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21  1:42 4K inode support on 4Kn device Wang Yugui
2026-04-21  5:48 ` Carlos Maiolino
2026-04-21 23:01 ` [PATCH] xfsprogs: 4K inode support Wang Yugui
2026-04-21 23:05 ` [ RFC ] xfs: " Wang Yugui
2026-04-22 21:41   ` Dave Chinner [this message]
2026-04-22 23:02     ` Wang Yugui
2026-04-23  2:09       ` Eric Sandeen
2026-04-23  2:18       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aelAeiyiAyFiJUgQ@dread \
    --to=dgc@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox