From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out28-83.mail.aliyun.com (out28-83.mail.aliyun.com [115.124.28.83]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E7B5219ABC6 for ; Thu, 23 Apr 2026 01:03:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.28.83 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776906240; cv=none; b=Ll0jWGfJp76vfPDLGbQjzyt9G69V12RAS1nKKiuMwZIEuYDNAzxLMedzFDN9BxJcrWBpWj3XtrQysrWV67BVWUNOO9sc7obEck57524iip+R1fBA1h1+F1bpiitsfxBNjLG7DQrjJnbrUglIJGbnO4Oz3GP/ZxNrmBFolpPKddo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776906240; c=relaxed/simple; bh=7ebH368/fGyf+pfrCFCi17nkTgKvgKDYZOLZTta7EmQ=; h=Date:From:To:Subject:Cc:In-Reply-To:References:Message-Id: MIME-Version:Content-Type; b=I7qJg9VvZzQD5tgU5tv5MfR26dgxnQfNHJ5mTHwYgYp9p9UqtGVy+S0OGkk7FxOWHvvguvsnwnRktyMC4zRmTYHIUyx/2B+ow3rLVVGwwOwZ7g/sDLZuNYM6WpI1Z+0+WYY4M5GrXOT9RsQnM6/j+PVZVbMawBNNKE/Q9eAOL8U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=e16-tech.com; spf=pass smtp.mailfrom=e16-tech.com; arc=none smtp.client-ip=115.124.28.83 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=e16-tech.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=e16-tech.com X-Alimail-AntiSpam:AC=CONTINUE;BC=0.04436532|-1;CH=green;DM=|CONTINUE|false|;DS=CONTINUE|ham_regular_dialog|0.356775-0.0126591-0.630566;FP=14178217100912973106|0|0|0|0|-1|-1|-1;HT=maildocker-contentspam033037032089;MF=wangyugui@e16-tech.com;NM=1;PH=DS;RN=2;RT=2;SR=0;TI=SMTPD_---.hICBEv-_1776898946; Received: from 192.168.2.112(mailfrom:wangyugui@e16-tech.com fp:SMTPD_---.hICBEv-_1776898946 cluster:ay29) by smtp.aliyun-inc.com; Thu, 23 Apr 2026 07:02:26 +0800 Date: Thu, 23 Apr 2026 07:02:27 +0800 From: Wang Yugui To: Dave Chinner Subject: Re: [ RFC ] xfs: 4K inode support Cc: linux-xfs@vger.kernel.org In-Reply-To: References: <20260421230515.2234-1-wangyugui@e16-tech.com> Message-Id: <20260423070227.B2C6.409509F4@e16-tech.com> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.83.02 [en] Hi, > On Wed, Apr 22, 2026 at 07:05:15AM +0800, Wang Yugui wrote: > > use case for 4K inode > > - simpler logic for 4Kn device, and less lock. > > Nope, neither of these are true. > > There is no change in logic when inode sizes change, and there is no > change in locking as inode size changes. > > This is because inodes are allocated in chunks of 64, and they are > read and written in clusters of 32 inodes. Hence all that changing > the size of the inode does is change the size of the inode cluster > buffer. > > And therein lies the problem: 32 x 4kB inodes is 128kB. Looking at > xfs_types.h: > > /* > * Minimum and maximum blocksize and sectorsize. > * The blocksize upper limit is pretty much arbitrary. > * The sectorsize upper limit is due to sizeof(sb_sectsize). > * CRC enable filesystems use 512 byte inodes, meaning 512 byte block sizes > * cannot be used. > */ > #define XFS_MIN_BLOCKSIZE_LOG 9 /* i.e. 512 bytes */ > #define XFS_MAX_BLOCKSIZE_LOG 16 /* i.e. 65536 bytes */ > #define XFS_MIN_BLOCKSIZE (1 << XFS_MIN_BLOCKSIZE_LOG) > #define XFS_MAX_BLOCKSIZE (1 << XFS_MAX_BLOCKSIZE_LOG) > > Yup, XFS defines a maximum block size of 64kB, and inode cluster > buffers are already at this maximum size for 2kB inodes. > > > - better performance for directory with many files. > > No, it won't make any difference to large directory performance > because they are in block/leaf/node form and all the directory > information is held in extents external to the inode. The size of > the directory inode really does not influence the performance of the > directory once it transitions out of inline format. > > In fact, larger inode sizes result in lower performance for > directory ops, because the metadata footprint has increased in > size and so every inode cluster IO now has higher latency and > consumes more IO bandwidth. i.e. the -inode operations- that are > done during directory modifications are slower... > > Then there's the larger memory footprint of the buffer cache due to > cached inode cluster buffers - in most cases that's all wasted space > because inode metadata is typically just an inode core (176 bytes), > a couple of extent records (16 bytes each) and maybe a couple of > xattrs (e.g. selinux). So a typical inode will only contain maybe > 300 bytes of metadata, yet now they take up 4kB of RAM -each- when > resident in the buffer cache... > > > - maybe inline data support later. > > That's a whole different problem - it doesn't require inode sizes to > be expanded to implement. > > > TODO: > > still crash in xfs_trans_read_buf_map() when mount a 4K inode xfs now. > > Good luck with that - there's several issues with on-disk format > constants that need to be sorted out before IO will work. e.g. > you'll hit this error through _xfs_trans_bjoin(): > > xfs_err(mp, > "buffer item dirty bitmap (%u uints) too small to reflect %u bytes!", > map_size, > BBTOB(bp->b_maps[i].bm_len)); > > and it will shut down with a corruption error. That's indicating > that the on-disk journal format for buffer logging does not support > the buffer size being read. i.e. there's a problem with the inode > cluster size.... > > IOWs, there are -lots- of complex and cirtical subsystems that > increasing the inode size will break and need to be fixed. Changing > a fundamental on-disk format constant isn't a simple thing to do, an > AI will not be able to tell you all the things you need to change > and test without already knowing where all the architectural > problems are to begin with.... > > Without an actual solid reason for making fundamental on-disk format > changes and a commitment of significant time and testing resources, > changes of this scope are unlikely to be made... > > -Dave. > -- > Dave Chinner > dgc@kernel.org Thanks a lot for these info. the basic logic is that, for 4Kn device, the min io size is already 4K. and 4Kn device(SSD and RAID) because more common now. On a 4Kn device, we can I/O one single inode of 4K size without interaction with other inode? so mabye better performance for high speed ssd such as pcie gen5/gen6? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2026/04/23