From: Eric Sandeen <sandeen@sandeen.net>
To: Dave Chinner <david@fromorbit.com>, xfs@oss.sgi.com
Subject: Re: [PATCH 5/5] xfs: increase inode cluster size for v5 filesystems
Date: Fri, 08 Nov 2013 12:21:20 -0600 [thread overview]
Message-ID: <527D2BA0.7000504@sandeen.net> (raw)
In-Reply-To: <1383280040-21979-6-git-send-email-david@fromorbit.com>
On 10/31/13, 11:27 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> v5 filesystems use 512 byte inodes as a minimum, so read inodes in
I think you meant :
"..., so today we read inodes in clusters that are..."
^^^^^^^^
- changes the meaning quite a bit. :)
> clusters that are effectively half the size of a v4 filesystem with
> 256 byte inodes. For v5 fielsystems, scale the inode cluster size
> with the size of the inode so that we keep a constant 32 inodes per
> cluster ratio for all inode IO.
>
> This only works if mkfs.xfs sets the inode alignment appropriately
> for larger inode clusters, so this functionality is made conditional
> on mkfs doing the right thing. xfs_repair needs to know about
> the inode alignment changes, too.
>
> Wall time:
> create bulkstat find+stat ls -R unlink
> v4 237s 161s 173s 201s 299s
> v5 235s 163s 205s 31s 356s
> patched 234s 160s 182s 29s 317s
>
> System time:
> create bulkstat find+stat ls -R unlink
> v4 2601s 2490s 1653s 1656s 2960s
> v5 2637s 2497s 1681s 20s 3216s
> patched 2613s 2451s 1658s 20s 3007s
>
> So, wall time same or down across the board, system time same or
> down across the board, and cache hit rates all improve except for
> the ls -R case which is a pure cold cache directory read workload
> on v5 filesystems...
>
> So, this patch removes most of the performance and CPU usage
> differential between v4 and v5 filesystems on traversal related
> workloads.
We already have some issues w/ larger inode clusters & freespace
fragmentation resulting in the inability to create new clusters,
right?
How might this impact that problem? I understand that it may be
a tradeoff.
> Note: while this patch is currently for v5 filesystems only, there
> is no reason it can't be ported back to v4 filesystems. This hasn't
> been done here because bringing the code back to v4 requires
> forwards and backwards kernel compatibility testing. i.e. to
> deterine if older kernels(*) do the right thing with larger inode
> alignments but still only using 8k inode cluster sizes. None of this
> testing and validation on v4 filesystems has been done, so for the
> moment larger inode clusters is limited to v5 superblocks.
Thanks for that explanation.
> (*) a current default config v4 filesystem should mount just fine on
> 2.6.23 (when lazy-count support was introduced), and so if we change
> the alignment emitted by mkfs without a feature bit then we have to
> make sure it works properly on all kernels since 2.6.23. And if we
> allow it to be changed when the lazy-count bit is not set, then it's
> all kernels since v2 logs were introduced that need to be tested for
> compatibility...
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_mount.c | 15 +++++++++++++++
> fs/xfs/xfs_mount.h | 2 +-
> fs/xfs/xfs_trans_resv.c | 3 +--
> 3 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index da88f16..02df7b4 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -41,6 +41,7 @@
> #include "xfs_fsops.h"
> #include "xfs_trace.h"
> #include "xfs_icache.h"
> +#include "xfs_dinode.h"
>
>
> #ifdef HAVE_PERCPU_SB
> @@ -718,8 +719,22 @@ xfs_mountfs(
> * Set the inode cluster size.
> * This may still be overridden by the file system
> * block size if it is larger than the chosen cluster size.
> + *
> + * For v5 filesystems, scale the cluster size with the inode size to
> + * keep a constant ratio of inode per cluster buffer, but only if mkfs
> + * has set the inode alignment value appropriately for larger cluster
> + * sizes.
> */
> mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
Just thinking out loud here: So this is runtime only; nothing on disk sets
the cluster size explicitly (granted, it never did).
So moving back and forth across newer/older kernels will create clusters
of different sizes on the same filesystem, right?
(In the very distant past, this same change could have happened if
the amount of memory in a box changed (!) - see commit
425f9ddd534573f58df8e7b633a534fcfc16d44d; prior to that we set
m_inode_cluster_size on the fly as well).
But sb_inoalignmt is a mkfs-set, on-disk feature. So we might start with
i.e. this, where A1 are 8k alignment points, and 512 byte inodes, in clusters
of size 8k / 16 inodes:
A1 A1 A1 A1
[ 16 inodes ][ 16 inodes ] [ 16 inodes ]
and in this case we couldn't bump up m_inode_cluster_size, lest we
allocate a larger cluster on the smaller alignment & overlap:
A1 A1 A1 A1
[ 16 inodes ][ 16 inodes ] [ 16 inodes ] <--- existing
[ 32 inodes ] <--- new
But if we had a larger (say, 16k) inode alignment, like: we could
safely do:
A1 A2 A2
[ 16 inodes ] [ 16 inodes ]
[ 32 inodes ] <--- new
and be ok.
So the only other thing I wonder about is when we are handling
pre-existing, smaller-than m_inode_cluster_size clusters.
i.e. xfs_ifree_cluster() figures out the number of blocks &
number of inodes in a cluster, based on the (now not
constant) m_inode_cluster_size.
What stops us from going off the end of a smaller cluster?
</handwave>
I'm not up to speed on inode IO TBH, so will dig into it a little more.
Aside from all those vague questions, this seems sane. ;)
> + if (xfs_sb_version_hascrc(&mp->m_sb)) {
> + int new_size = mp->m_inode_cluster_size;
> +
> + new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
> + if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
> + mp->m_inode_cluster_size = new_size;
> + xfs_info(mp, "Using inode cluster size of %d bytes",
> + mp->m_inode_cluster_size);
> + }
>
> /*
> * Set inode alignment fields
next 2 hunks are just tidying, right.
Thanks,
-Eric
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 1d8101a..a466c5e 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -112,7 +112,7 @@ typedef struct xfs_mount {
> __uint8_t m_blkbb_log; /* blocklog - BBSHIFT */
> __uint8_t m_agno_log; /* log #ag's */
> __uint8_t m_agino_log; /* #bits for agino in inum */
> - __uint16_t m_inode_cluster_size;/* min inode buf size */
> + uint m_inode_cluster_size;/* min inode buf size */
> uint m_blockmask; /* sb_blocksize-1 */
> uint m_blockwsize; /* sb_blocksize in words */
> uint m_blockwmask; /* blockwsize-1 */
> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
> index d53d9f0..2fd59c0 100644
> --- a/fs/xfs/xfs_trans_resv.c
> +++ b/fs/xfs/xfs_trans_resv.c
> @@ -385,8 +385,7 @@ xfs_calc_ifree_reservation(
> xfs_calc_inode_res(mp, 1) +
> xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> - MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
> - XFS_INODE_CLUSTER_SIZE(mp)) +
> + max_t(uint, XFS_FSB_TO_B(mp, 1), XFS_INODE_CLUSTER_SIZE(mp)) +
> xfs_calc_buf_res(1, 0) +
> xfs_calc_buf_res(2 + XFS_IALLOC_BLOCKS(mp) +
> mp->m_in_maxlevels, 0) +
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2013-11-08 18:21 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-01 4:27 [PATCH 0/5] xfs: more patches for 3.13 Dave Chinner
2013-11-01 4:27 ` [PATCH 1/5] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
2013-11-01 4:27 ` [PATCH 2/5] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
2013-11-05 16:41 ` Christoph Hellwig
2013-11-18 21:54 ` Eric Sandeen
2013-11-18 22:28 ` Ben Myers
2013-11-18 22:45 ` Eric Sandeen
2013-11-01 4:27 ` [PATCH 3/5] xfs: trace AIL manipulations Dave Chinner
2013-11-05 16:41 ` Christoph Hellwig
2013-11-01 4:27 ` [PATCH 4/5] xfs: add tracepoints to AGF/AGI read operations Dave Chinner
2013-11-05 16:42 ` Christoph Hellwig
2013-11-01 4:27 ` [PATCH 5/5] xfs: increase inode cluster size for v5 filesystems Dave Chinner
2013-11-05 16:43 ` Christoph Hellwig
2013-11-05 19:56 ` Dave Chinner
2013-11-06 21:31 ` Ben Myers
2013-11-07 0:32 ` Dave Chinner
2013-11-12 17:33 ` Christoph Hellwig
2013-11-08 18:21 ` Eric Sandeen [this message]
2013-11-11 22:45 ` Dave Chinner
2013-11-12 0:24 ` Eric Sandeen
2013-11-14 18:51 ` Eric Sandeen
2013-11-06 23:01 ` [PATCH 0/5] xfs: more patches for 3.13 Ben Myers
2013-11-07 1:57 ` Dave Chinner
2013-11-13 1:16 ` Eric Sandeen
2013-11-14 1:16 ` Dave Chinner
2013-11-15 17:19 ` Eric Sandeen
2013-11-15 17:55 ` Eric Sandeen
2013-11-17 19:48 ` Dave Chinner
2013-11-18 21:52 ` Eric Sandeen
2013-11-18 20:30 ` Ben Myers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=527D2BA0.7000504@sandeen.net \
--to=sandeen@sandeen.net \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.