All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@sandeen.net>
To: Dave Chinner <david@fromorbit.com>, xfs@oss.sgi.com
Subject: Re: [PATCH 5/5] xfs: increase inode cluster size for v5 filesystems
Date: Fri, 08 Nov 2013 12:21:20 -0600	[thread overview]
Message-ID: <527D2BA0.7000504@sandeen.net> (raw)
In-Reply-To: <1383280040-21979-6-git-send-email-david@fromorbit.com>

On 10/31/13, 11:27 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> v5 filesystems use 512 byte inodes as a minimum, so read inodes in

I think you meant :

"..., so today we read inodes in clusters that are..."
         ^^^^^^^^

- changes the meaning quite a bit.  :)

> clusters that are effectively half the size of a v4 filesystem with
> 256 byte inodes. For v5 fielsystems, scale the inode cluster size
> with the size of the inode so that we keep a constant 32 inodes per
> cluster ratio for all inode IO.
> 
> This only works if mkfs.xfs sets the inode alignment appropriately
> for larger inode clusters, so this functionality is made conditional
> on mkfs doing the right thing. xfs_repair needs to know about
> the inode alignment changes, too.
> 
> Wall time:
> 	create	bulkstat	find+stat	ls -R	unlink
> v4	237s	161s		173s		201s	299s
> v5	235s	163s		205s		 31s	356s
> patched	234s	160s		182s		 29s	317s
> 
> System time:
> 	create	bulkstat	find+stat	ls -R	unlink
> v4	2601s	2490s		1653s		1656s	2960s
> v5	2637s	2497s		1681s		  20s	3216s
> patched	2613s	2451s		1658s		  20s	3007s
> 
> So, wall time same or down across the board, system time same or
> down across the board, and cache hit rates all improve except for
> the ls -R case which is a pure cold cache directory read workload
> on v5 filesystems...
> 
> So, this patch removes most of the performance and CPU usage
> differential between v4 and v5 filesystems on traversal related
> workloads.

We already have some issues w/ larger inode clusters & freespace
fragmentation resulting in the inability to create new clusters,
right?

How might this impact that problem?  I understand that it may be
a tradeoff.

> Note: while this patch is currently for v5 filesystems only, there
> is no reason it can't be ported back to v4 filesystems.  This hasn't
> been done here because bringing the code back to v4 requires
> forwards and backwards kernel compatibility testing.  i.e. to
> deterine if older kernels(*) do the right thing with larger inode
> alignments but still only using 8k inode cluster sizes. None of this
> testing and validation on v4 filesystems has been done, so for the
> moment larger inode clusters is limited to v5 superblocks.

Thanks for that explanation.

> (*) a current default config v4 filesystem should mount just fine on
> 2.6.23 (when lazy-count support was introduced), and so if we change
> the alignment emitted by mkfs without a feature bit then we have to
> make sure it works properly on all kernels since 2.6.23. And if we
> allow it to be changed when the lazy-count bit is not set, then it's
> all kernels since v2 logs were introduced that need to be tested for
> compatibility...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.c      | 15 +++++++++++++++
>  fs/xfs/xfs_mount.h      |  2 +-
>  fs/xfs/xfs_trans_resv.c |  3 +--
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index da88f16..02df7b4 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -41,6 +41,7 @@
>  #include "xfs_fsops.h"
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
> +#include "xfs_dinode.h"
>  
>  
>  #ifdef HAVE_PERCPU_SB
> @@ -718,8 +719,22 @@ xfs_mountfs(
>  	 * Set the inode cluster size.
>  	 * This may still be overridden by the file system
>  	 * block size if it is larger than the chosen cluster size.
> +	 *
> +	 * For v5 filesystems, scale the cluster size with the inode size to
> +	 * keep a constant ratio of inode per cluster buffer, but only if mkfs
> +	 * has set the inode alignment value appropriately for larger cluster
> +	 * sizes.
>  	 */
>  	mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;

Just thinking out loud here: So this is runtime only; nothing on disk sets
the cluster size explicitly (granted, it never did).

So moving back and forth across newer/older kernels will create clusters 
of different sizes on the same filesystem, right?

(In the very distant past, this same change could have happened if
the amount of memory in a box changed (!) - see commit
425f9ddd534573f58df8e7b633a534fcfc16d44d; prior to that we set
m_inode_cluster_size on the fly as well).

But sb_inoalignmt is a mkfs-set, on-disk feature.  So we might start with
i.e. this, where A1 are 8k alignment points, and 512 byte inodes, in clusters
of size 8k / 16 inodes:

A1           A1           A1           A1           
[ 16 inodes ][ 16 inodes ]             [ 16 inodes ]

and in this case we couldn't bump up m_inode_cluster_size, lest we
allocate a larger cluster on the smaller alignment & overlap:

A1           A1           A1           A1           
[ 16 inodes ][ 16 inodes ]             [ 16 inodes ] <--- existing
                         [        32 inodes        ] <--- new

But if we had a larger (say, 16k) inode alignment, like:  we could
safely do:

A1                        A2                        A2          
[ 16 inodes ]                                       [ 16 inodes ]
                          [        32 inodes       ] <--- new

and be ok.

So the only other thing I wonder about is when we are handling
pre-existing, smaller-than m_inode_cluster_size clusters.

i.e. xfs_ifree_cluster() figures out the number of blocks &
number of inodes in a cluster, based on the (now not
constant) m_inode_cluster_size.

What stops us from going off the end of a smaller cluster?
</handwave>

I'm not up to speed on inode IO TBH, so will dig into it a little more.

Aside from all those vague questions, this seems sane.  ;)

> +	if (xfs_sb_version_hascrc(&mp->m_sb)) {
> +		int	new_size = mp->m_inode_cluster_size;
> +
> +		new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
> +		if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
> +			mp->m_inode_cluster_size = new_size;
> +		xfs_info(mp, "Using inode cluster size of %d bytes",
> +			 mp->m_inode_cluster_size);
> +	}
>  
>  	/*
>  	 * Set inode alignment fields

next 2 hunks are just tidying, right.

Thanks,
-Eric

> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 1d8101a..a466c5e 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -112,7 +112,7 @@ typedef struct xfs_mount {
>  	__uint8_t		m_blkbb_log;	/* blocklog - BBSHIFT */
>  	__uint8_t		m_agno_log;	/* log #ag's */
>  	__uint8_t		m_agino_log;	/* #bits for agino in inum */
> -	__uint16_t		m_inode_cluster_size;/* min inode buf size */
> +	uint			m_inode_cluster_size;/* min inode buf size */
>  	uint			m_blockmask;	/* sb_blocksize-1 */
>  	uint			m_blockwsize;	/* sb_blocksize in words */
>  	uint			m_blockwmask;	/* blockwsize-1 */
> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
> index d53d9f0..2fd59c0 100644
> --- a/fs/xfs/xfs_trans_resv.c
> +++ b/fs/xfs/xfs_trans_resv.c
> @@ -385,8 +385,7 @@ xfs_calc_ifree_reservation(
>  		xfs_calc_inode_res(mp, 1) +
>  		xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> -		MAX((__uint16_t)XFS_FSB_TO_B(mp, 1),
> -		    XFS_INODE_CLUSTER_SIZE(mp)) +
> +		max_t(uint, XFS_FSB_TO_B(mp, 1), XFS_INODE_CLUSTER_SIZE(mp)) +
>  		xfs_calc_buf_res(1, 0) +
>  		xfs_calc_buf_res(2 + XFS_IALLOC_BLOCKS(mp) +
>  				 mp->m_in_maxlevels, 0) +
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2013-11-08 18:21 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-01  4:27 [PATCH 0/5] xfs: more patches for 3.13 Dave Chinner
2013-11-01  4:27 ` [PATCH 1/5] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering Dave Chinner
2013-11-01  4:27 ` [PATCH 2/5] xfs: open code inc_inode_iversion when logging an inode Dave Chinner
2013-11-05 16:41   ` Christoph Hellwig
2013-11-18 21:54   ` Eric Sandeen
2013-11-18 22:28     ` Ben Myers
2013-11-18 22:45       ` Eric Sandeen
2013-11-01  4:27 ` [PATCH 3/5] xfs: trace AIL manipulations Dave Chinner
2013-11-05 16:41   ` Christoph Hellwig
2013-11-01  4:27 ` [PATCH 4/5] xfs: add tracepoints to AGF/AGI read operations Dave Chinner
2013-11-05 16:42   ` Christoph Hellwig
2013-11-01  4:27 ` [PATCH 5/5] xfs: increase inode cluster size for v5 filesystems Dave Chinner
2013-11-05 16:43   ` Christoph Hellwig
2013-11-05 19:56     ` Dave Chinner
2013-11-06 21:31       ` Ben Myers
2013-11-07  0:32         ` Dave Chinner
2013-11-12 17:33       ` Christoph Hellwig
2013-11-08 18:21   ` Eric Sandeen [this message]
2013-11-11 22:45     ` Dave Chinner
2013-11-12  0:24       ` Eric Sandeen
2013-11-14 18:51   ` Eric Sandeen
2013-11-06 23:01 ` [PATCH 0/5] xfs: more patches for 3.13 Ben Myers
2013-11-07  1:57   ` Dave Chinner
2013-11-13  1:16     ` Eric Sandeen
2013-11-14  1:16       ` Dave Chinner
2013-11-15 17:19         ` Eric Sandeen
2013-11-15 17:55           ` Eric Sandeen
2013-11-17 19:48             ` Dave Chinner
2013-11-18 21:52               ` Eric Sandeen
2013-11-18 20:30 ` Ben Myers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=527D2BA0.7000504@sandeen.net \
    --to=sandeen@sandeen.net \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.