public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: changes for 3.11
@ 2013-05-27 11:39 Dave Chinner
  2013-05-27 11:39 ` [PATCH 1/3] xfs: don't use speculative prealloc for small files Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Dave Chinner @ 2013-05-27 11:39 UTC (permalink / raw)
  To: xfs

Hi folks,

three patches for the 3.11 cycle. The first two i've posted
previously to reduce buffer cache lookups for inode creation and
inode unlink.

The third is a new patch that modifies speculative preallocation to
be nicer in small file workloads. It solves anissue where a specific
workload triggers freespace fragmentation problems due to writeback
being triggered with the speculative preallocation beyond EOF
still attached to the file and then never being used.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/3] xfs: don't use speculative prealloc for small files
  2013-05-27 11:39 [PATCH 0/3] xfs: changes for 3.11 Dave Chinner
@ 2013-05-27 11:39 ` Dave Chinner
  2013-05-27 11:39 ` [PATCH 2/3] xfs: don't do IO when creating an new inode Dave Chinner
  2013-05-27 11:39 ` [PATCH 3/3] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner
  2 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2013-05-27 11:39 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Dedicated small file workloads have been seeing significant free
space fragmentation causing premature inode allocation failure
when large inode sizes are in use. A particular test case showed
that a workload that runs to a real ENOSPC on 256 byte inodes would
fail inode allocation with ENOSPC about about 80% full with 512 byte
inodes, and at about 50% full with 1024 byte inodes.

The same workload, when run with -o allocsize=4096 on 1024 byte
inodes would run to being 100% full before giving ENOSPC. That is,
no freespace fragmentation at all.

The issue was caused by the specific IO pattern the application had
- the framework it was using did not support direct IO, and so it
was emulating it by using fadvise(DONT_NEED). The result was that
the data was getting written back before the speculative prealloc
had been trimmed from memory by the close(), and so small single
block files were being allocated with 2 blocks, and then having one
truncated away. The result was lots of small 4k free space extents,
and hence each new 8k allocation would take another 8k from
contiguous free space and turn it into 4k of allocated space and 4k
of free space.

Hence inode allocation, which requires contiguous, aligned
allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
(1024 byte inodes) can fail to find sufficiently large freespace and
hence fail while there is still lots of free space available.

There's a simple fix for this, and one that has precendence in the
allocator code already - don't do speculative allocation unless the
size of the file is larger than a certain size. In this case, that
size is the minimum default preallocation size:
mp->m_writeio_blocks. And to keep with the concept of being nice to
people when the files are still relatively small, cap the prealloc
to mp->m_writeio_blocks until the file goes over a stripe unit is
size, at which point we'll fall back to the current behaviour based
on the last extent size.

This will effectively turn off speculative prealloc for very small
files, keep preallocation low for small files, and behave as it
currently does for any file larger than a stripe unit. This
completely avoids the freespace fragmentation problem this
particular IO pattern was causing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iomap.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8f8aaee..14be676 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -284,6 +284,15 @@ xfs_iomap_eof_want_preallocate(
 		return 0;
 
 	/*
+	 * If the file is smaller than the minimum prealloc and we are using
+	 * dynamic preallocation, don't do any preallocation at all as it is
+	 * likely this is the only write to the file that is going to be done.
+	 */
+	if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) &&
+	    XFS_ISIZE(ip) < mp->m_writeio_blocks)
+		return 0;
+
+	/*
 	 * If there are any real blocks past eof, then don't
 	 * do any speculative allocation.
 	 */
@@ -345,6 +354,10 @@ xfs_iomap_eof_prealloc_initial_size(
 	if (mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)
 		return 0;
 
+	/* If the file is small, then use the minimum prealloc */
+	if (XFS_ISIZE(ip) < mp->m_dalign)
+		return 0;
+
 	/*
 	 * As we write multiple pages, the offset will always align to the
 	 * start of a page and hence point to a hole at EOF. i.e. if the size is
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/3] xfs: don't do IO when creating an new inode
  2013-05-27 11:39 [PATCH 0/3] xfs: changes for 3.11 Dave Chinner
  2013-05-27 11:39 ` [PATCH 1/3] xfs: don't use speculative prealloc for small files Dave Chinner
@ 2013-05-27 11:39 ` Dave Chinner
  2013-05-27 11:39 ` [PATCH 3/3] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner
  2 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2013-05-27 11:39 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we are allocating a new inode, we read the inode cluster off
disk to increment the generation number. We are already using a
random generation number for newly allocated inodes, so if we are not
using the ikeep mode, we can just generate a new generation number
when we initialise the newly allocated inode.

This avoids the need for reading the inode buffer during inode
creation. This will speed up allocation of inodes in cold, partially
allocated clusters as they will no longer need to be read from disk
during allocation. It will also reduce the CPU overhead of inode
allocation by not having the process the buffer read, even on cache
hits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c |   36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f4daadf..71c074f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1028,6 +1028,11 @@ xfs_dinode_calc_crc(
 
 /*
  * Read the disk inode attributes into the in-core inode structure.
+ *
+ * If we are initialising a new inode and we are not utilising the
+ * XFS_MOUNT_IKEEP inode cluster mode, we can simple build the new inode core
+ * with a random generation number. If we are keeping inodes around, we need to
+ * read the inode cluster to get the existing generation number off disk.
  */
 int
 xfs_iread(
@@ -1047,6 +1052,22 @@ xfs_iread(
 	if (error)
 		return error;
 
+	/* shortcut IO on inode allocation if possible */
+	if ((iget_flags & XFS_IGET_CREATE) &&
+	    !(mp->m_flags & XFS_MOUNT_IKEEP)) {
+		/* initialise the on-disk inode core */
+		memset(&ip->i_d, 0, sizeof(ip->i_d));
+		ip->i_d.di_magic = XFS_DINODE_MAGIC;
+		ip->i_d.di_gen = prandom_u32();
+		if (xfs_sb_version_hascrc(&mp->m_sb)) {
+			ip->i_d.di_version = 3;
+			ip->i_d.di_ino = ip->i_ino;
+			uuid_copy(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid);
+		} else
+			ip->i_d.di_version = 2;
+		return 0;
+	}
+
 	/*
 	 * Get pointers to the on-disk inode and the buffer containing it.
 	 */
@@ -1133,17 +1154,16 @@ xfs_iread(
 	xfs_buf_set_ref(bp, XFS_INO_REF);
 
 	/*
-	 * Use xfs_trans_brelse() to release the buffer containing the
-	 * on-disk inode, because it was acquired with xfs_trans_read_buf()
-	 * in xfs_imap_to_bp() above.  If tp is NULL, this is just a normal
+	 * Use xfs_trans_brelse() to release the buffer containing the on-disk
+	 * inode, because it was acquired with xfs_trans_read_buf() in
+	 * xfs_imap_to_bp() above.  If tp is NULL, this is just a normal
 	 * brelse().  If we're within a transaction, then xfs_trans_brelse()
 	 * will only release the buffer if it is not dirty within the
 	 * transaction.  It will be OK to release the buffer in this case,
-	 * because inodes on disk are never destroyed and we will be
-	 * locking the new in-core inode before putting it in the hash
-	 * table where other processes can find it.  Thus we don't have
-	 * to worry about the inode being changed just because we released
-	 * the buffer.
+	 * because inodes on disk are never destroyed and we will be locking the
+	 * new in-core inode before putting it in the cache where other
+	 * processes can find it.  Thus we don't have to worry about the inode
+	 * being changed just because we released the buffer.
 	 */
  out_brelse:
 	xfs_trans_brelse(tp, bp);
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 3/3] xfs: xfs_ifree doesn't need to modify the inode buffer
  2013-05-27 11:39 [PATCH 0/3] xfs: changes for 3.11 Dave Chinner
  2013-05-27 11:39 ` [PATCH 1/3] xfs: don't use speculative prealloc for small files Dave Chinner
  2013-05-27 11:39 ` [PATCH 2/3] xfs: don't do IO when creating an new inode Dave Chinner
@ 2013-05-27 11:39 ` Dave Chinner
  2 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2013-05-27 11:39 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Long ago, bulkstat used to read inodes directly fromteh backing
buffer for speed. This had the unfortunate problem of being cache
incoherent with unlinks, and so xfs_ifree() had to mark the inode
as free directly in the backing buffer. bulkstat was changed some
time ago to use inode cache coherent lookups, and so will never see
unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
touch the inode backing buffer anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c |   32 ++++----------------------------
 1 file changed, 4 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 71c074f..518988c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2065,8 +2065,6 @@ xfs_ifree(
 	int			error;
 	int			delete;
 	xfs_ino_t		first_ino;
-	xfs_dinode_t    	*dip;
-	xfs_buf_t       	*ibp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(ip->i_d.di_nlink == 0);
@@ -2079,14 +2077,13 @@ xfs_ifree(
 	 * Pull the on-disk inode from the AGI unlinked list.
 	 */
 	error = xfs_iunlink_remove(tp, ip);
-	if (error != 0) {
+	if (error)
 		return error;
-	}
 
 	error = xfs_difree(tp, ip->i_ino, flist, &delete, &first_ino);
-	if (error != 0) {
+	if (error)
 		return error;
-	}
+
 	ip->i_d.di_mode = 0;		/* mark incore inode as free */
 	ip->i_d.di_flags = 0;
 	ip->i_d.di_dmevmask = 0;
@@ -2098,31 +2095,10 @@ xfs_ifree(
 	 * by reincarnations of this inode.
 	 */
 	ip->i_d.di_gen++;
-
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 
-	error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &dip, &ibp,
-			       0, 0);
-	if (error)
-		return error;
-
-        /*
-	* Clear the on-disk di_mode. This is to prevent xfs_bulkstat
-	* from picking up this inode when it is reclaimed (its incore state
-	* initialzed but not flushed to disk yet). The in-core di_mode is
-	* already cleared  and a corresponding transaction logged.
-	* The hack here just synchronizes the in-core to on-disk
-	* di_mode value in advance before the actual inode sync to disk.
-	* This is OK because the inode is already unlinked and would never
-	* change its di_mode again for this inode generation.
-	* This is a temporary hack that would require a proper fix
-	* in the future.
-	*/
-	dip->di_mode = 0;
-
-	if (delete) {
+	if (delete)
 		error = xfs_ifree_cluster(ip, tp, first_ino);
-	}
 
 	return error;
 }
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-05-27 11:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-27 11:39 [PATCH 0/3] xfs: changes for 3.11 Dave Chinner
2013-05-27 11:39 ` [PATCH 1/3] xfs: don't use speculative prealloc for small files Dave Chinner
2013-05-27 11:39 ` [PATCH 2/3] xfs: don't do IO when creating an new inode Dave Chinner
2013-05-27 11:39 ` [PATCH 3/3] xfs: xfs_ifree doesn't need to modify the inode buffer Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox