[PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4
@ 2010-12-13  1:25 Dave Chinner
  2010-12-13  1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
  2010-12-13  1:25 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2010-12-13  1:25 UTC (permalink / raw)
  To: xfs

This is the latest version of the dynamic speculative allocation
beyond EOF patch set. The description of the patchset can be found
here:

http://oss.sgi.com/archives/xfs/2010-10/msg00040.html

Version 4:
- factored prealloc size into separate function to keep
  xfs_iomap_write_delay() easy to read.
- convert i_dirty_releases counter to a flag.

Version 3:
- allocsize mount option returned to fixed preallocation size only.
- reduces maximum dynamic prealloc size as the filesytem gets near
  full.
- split i_delayed_blks bug fixes into new patch (posted in 2.6.37-rc
  bug fix series)

Version 2:
- base speculative execution size on current inode size, not the
  number of previous speculative allocations.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] xfs: dynamic speculative EOF preallocation
  2010-12-13  1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner
@ 2010-12-13  1:25 ` Dave Chinner
  2010-12-15 18:57   ` Christoph Hellwig
  2010-12-13  1:25 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2010-12-13  1:25 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.

Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases.  Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.

For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:

EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
   1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
   2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
   3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
   4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
   5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
   6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
   7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088

and for 16 concurrent 16GB file writes:

 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
   1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
   2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
   3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
   4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
   5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
   6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
   7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208

Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):

freespace       max prealloc size
  >5%             full extent (8GB)
  4-5%             2GB (8GB >> 2)
  3-4%             1GB (8GB >> 3)
  2-3%           512MB (8GB >> 4)
  1-2%           256MB (8GB >> 5)
  <1%            128MB (8GB >> 6)

This should reduce the amount of space held in speculative
preallocation for such cases.

The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c |    1 +
 fs/xfs/xfs_iomap.c |   85 +++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_mount.c |   21 +++++++++++++
 fs/xfs/xfs_mount.h |   14 ++++++++
 4 files changed, 111 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index be34ff2..6d17206 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -374,6 +374,7 @@ xfs_growfs_data_private(
 		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
 	} else
 		mp->m_maxicount = 0;
+	xfs_set_low_space_thresholds(mp);
 
 	/* update secondary superblocks. */
 	for (agno = 1; agno < nagcount; agno++) {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2057614..40f9612 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -389,6 +389,9 @@ error_out:
  * If the caller is doing a write at the end of the file, then extend the
  * allocation out to the file system's write iosize.  We clean up any extra
  * space left over when the file is closed in xfs_inactive().
+ *
+ * If we find we already have delalloc preallocation beyond EOF, don't do more
+ * preallocation as it it not needed.
  */
 STATIC int
 xfs_iomap_eof_want_preallocate(
@@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate(
 	xfs_filblks_t   count_fsb;
 	xfs_fsblock_t	firstblock;
 	int		n, error, imaps;
+	int		found_delalloc = 0;
 
 	*prealloc = 0;
 	if ((offset + count) <= ip->i_size)
@@ -427,14 +431,63 @@ xfs_iomap_eof_want_preallocate(
 			if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
 			    (imap[n].br_startblock != DELAYSTARTBLOCK))
 				return 0;
+
 			start_fsb += imap[n].br_blockcount;
 			count_fsb -= imap[n].br_blockcount;
+
+			if (imap[n].br_startblock == DELAYSTARTBLOCK)
+				found_delalloc = 1;
 		}
 	}
-	*prealloc = 1;
+	if (!found_delalloc)
+		*prealloc = 1;
 	return 0;
 }
 
+/*
+ * If we don't have a user specified preallocation size, dynamically increase
+ * the preallocation size as the size of the file grows. Cap the maximum size
+ * at a single extent or less if the filesystem is near full. The closer the
+ * filesystem is to full, the smaller the maximum prealocation.
+ */
+STATIC xfs_fsblock_t
+xfs_iomap_prealloc_size(
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip)
+{
+	xfs_fsblock_t		alloc_blocks = 0;
+
+	if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
+		int shift = 0;
+		int64_t freesp;
+
+		alloc_blocks = XFS_B_TO_FSB(mp, ip->i_size);
+		alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
+					rounddown_pow_of_two(alloc_blocks));
+
+		freesp = percpu_counter_read_positive(
+						&mp->m_icsb[XFS_ICSB_FDBLOCKS]);
+		if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) {
+			shift = 2;
+			if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT])
+				shift++;
+			if (freesp < mp->m_low_space[XFS_LOWSP_3_PCNT])
+				shift++;
+			if (freesp < mp->m_low_space[XFS_LOWSP_2_PCNT])
+				shift++;
+			if (freesp < mp->m_low_space[XFS_LOWSP_1_PCNT])
+				shift++;
+		}
+		if (shift)
+			alloc_blocks >>= shift;
+	}
+
+	if (alloc_blocks < mp->m_writeio_blocks)
+		alloc_blocks = mp->m_writeio_blocks;
+
+	return alloc_blocks;
+}
+
 STATIC int
 xfs_iomap_write_delay(
 	xfs_inode_t	*ip,
@@ -469,6 +522,7 @@ xfs_iomap_write_delay(
 	extsz = xfs_get_extsz_hint(ip);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
+
 	error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
 				ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
 	if (error)
@@ -476,9 +530,11 @@ xfs_iomap_write_delay(
 
 retry:
 	if (prealloc) {
+		xfs_fsblock_t	alloc_blocks = xfs_iomap_prealloc_size(mp, ip);
+
 		aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1));
 		ioalign = XFS_B_TO_FSBT(mp, aligned_offset);
-		last_fsb = ioalign + mp->m_writeio_blocks;
+		last_fsb = ioalign + alloc_blocks;
 	} else {
 		last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
 	}
@@ -496,22 +552,31 @@ retry:
 			  XFS_BMAPI_DELAY | XFS_BMAPI_WRITE |
 			  XFS_BMAPI_ENTIRE, &firstblock, 1, imap,
 			  &nimaps, NULL);
-	if (error && (error != ENOSPC))
+	switch (error) {
+	case 0:
+	case ENOSPC:
+	case EDQUOT:
+		break;
+	default:
 		return XFS_ERROR(error);
+	}
 
 	/*
-	 * If bmapi returned us nothing, and if we didn't get back EDQUOT,
-	 * then we must have run out of space - flush all other inodes with
-	 * delalloc blocks and retry without EOF preallocation.
+	 * If bmapi returned us nothing, we got either ENOSPC or EDQUOT.  For
+	 * ENOSPC, * flush all other inodes with delalloc blocks to free up
+	 * some of the excess reserved metadata space. For both cases, retry
+	 * without EOF preallocation.
 	 */
 	if (nimaps == 0) {
 		trace_xfs_delalloc_enospc(ip, offset, count);
 		if (flushed)
-			return XFS_ERROR(ENOSPC);
+			return XFS_ERROR(error ? error : ENOSPC);
 
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		xfs_flush_inodes(ip);
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		if (error == ENOSPC) {
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			xfs_flush_inodes(ip);
+			xfs_ilock(ip, XFS_ILOCK_EXCL);
+		}
 
 		flushed = 1;
 		error = 0;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5e41ef3..fe27338 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1101,6 +1101,24 @@ xfs_set_rw_sizes(xfs_mount_t *mp)
 }
 
 /*
+ * precalculate the low space thresholds for dynamic speculative preallocation.
+ */
+void
+xfs_set_low_space_thresholds(
+	struct xfs_mount	*mp)
+{
+	int i;
+
+	for (i = 0; i < XFS_LOWSP_MAX; i++) {
+		__uint64_t space = mp->m_sb.sb_dblocks;
+
+		do_div(space, 100);
+		mp->m_low_space[i] = space * (i + 1);
+	}
+}
+
+
+/*
  * Set whether we're using inode alignment.
  */
 STATIC void
@@ -1322,6 +1340,9 @@ xfs_mountfs(
 	 */
 	xfs_set_rw_sizes(mp);
 
+	/* set the low space thresholds for dynamic preallocation */
+	xfs_set_low_space_thresholds(mp);
+
 	/*
 	 * Set the inode cluster size.
 	 * This may still be overridden by the file system
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 03ad25c6..7b42e04 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -75,6 +75,16 @@ enum {
 	XFS_ICSB_MAX,
 };
 
+/* dynamic preallocation free space thresholds, 5% down to 1% */
+enum {
+	XFS_LOWSP_1_PCNT = 0,
+	XFS_LOWSP_2_PCNT,
+	XFS_LOWSP_3_PCNT,
+	XFS_LOWSP_4_PCNT,
+	XFS_LOWSP_5_PCNT,
+	XFS_LOWSP_MAX,
+};
+
 typedef struct xfs_mount {
 	struct super_block	*m_super;
 	xfs_tid_t		m_tid;		/* next unused tid for fs */
@@ -169,6 +179,8 @@ typedef struct xfs_mount {
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
 	struct percpu_counter	m_icsb[XFS_ICSB_MAX];
+	int64_t			m_low_space[XFS_LOWSP_MAX];
+						/* low free space thresholds */
 } xfs_mount_t;
 
 /*
@@ -333,6 +345,8 @@ extern void	xfs_icsb_sync_counters(struct xfs_mount *);
 extern int	xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t);
 extern int	xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int);
 
+extern void	xfs_set_low_space_thresholds(struct xfs_mount *);
+
 #endif	/* __KERNEL__ */
 
 extern void	xfs_mod_sb(struct xfs_trans *, __int64_t);
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation
  2010-12-13  1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-12-15 18:57   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2010-12-15 18:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

With this patch my 32-bit x86 VM hangs when running test 014, although
I can interrupt it.  With the full patchset applied I get softlockup
warnings in this test for xfsaild, which seems related.  Also test 012
fails, but that might be intentional (?).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-12-13  1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner
  2010-12-13  1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-12-13  1:25 ` Dave Chinner
  2010-12-16 15:46   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2010-12-13  1:25 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF.  This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.

To avoid this problem, keep a count of the number of ->release calls
made on an inode. For most cases, an inode is only going to be opened
once for writing and then closed again during it's lifetime in
cache. Hence if there are multiple ->release calls, there is a good
chance that the inode is being accessed by the NFS server. Hence
count up every time ->release is called while there are delalloc
blocks still outstanding on the inode.

If this count is non-zero when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.

If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.

The new counter in the struct xfs_inode fits into a hole in the
structure on 64 bit machines, so does not grow the size of the inode
at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.h    |   13 +++++-----
 fs/xfs/xfs_vnodeops.c |   61 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1c6514d..5c95fa8 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -376,12 +376,13 @@ static inline void xfs_ifunlock(xfs_inode_t *ip)
 /*
  * In-core inode flags.
  */
-#define XFS_IRECLAIM    0x0001  /* we have started reclaiming this inode    */
-#define XFS_ISTALE	0x0002	/* inode has been staled */
-#define XFS_IRECLAIMABLE 0x0004 /* inode can be reclaimed */
-#define XFS_INEW	0x0008	/* inode has just been allocated */
-#define XFS_IFILESTREAM	0x0010	/* inode is in a filestream directory */
-#define XFS_ITRUNCATED	0x0020	/* truncated down so flush-on-close */
+#define XFS_IRECLAIM		0x0001  /* started reclaiming this inode */
+#define XFS_ISTALE		0x0002	/* inode has been staled */
+#define XFS_IRECLAIMABLE	0x0004	/* inode can be reclaimed */
+#define XFS_INEW		0x0008	/* inode has just been allocated */
+#define XFS_IFILESTREAM		0x0010	/* inode is in a filestream directory */
+#define XFS_ITRUNCATED		0x0020	/* truncated down so flush-on-close */
+#define XFS_IDIRTY_RELEASE	0x0040	/* dirty release already seen */
 
 /*
  * Flags for inode locking.
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 8e4a63c..d8e6f8c 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -964,29 +964,48 @@ xfs_release(
 			xfs_flush_pages(ip, 0, -1, XBF_ASYNC, FI_NONE);
 	}
 
-	if (ip->i_d.di_nlink != 0) {
-		if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
-		     ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
-		       ip->i_delayed_blks > 0)) &&
-		     (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
-		    (!(ip->i_d.di_flags &
-				(XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+	if (ip->i_d.di_nlink == 0)
+		return 0;
 
-			/*
-			 * If we can't get the iolock just skip truncating
-			 * the blocks past EOF because we could deadlock
-			 * with the mmap_sem otherwise.  We'll get another
-			 * chance to drop them once the last reference to
-			 * the inode is dropped, so we'll never leak blocks
-			 * permanently.
-			 */
-			error = xfs_free_eofblocks(mp, ip,
-						   XFS_FREE_EOF_TRYLOCK);
-			if (error)
-				return error;
-		}
-	}
+	if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
+	     ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
+	       ip->i_delayed_blks > 0)) &&
+	     (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
+	    (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
 
+		/*
+		 * If we can't get the iolock just skip truncating the blocks
+		 * past EOF because we could deadlock with the mmap_sem
+		 * otherwise.  We'll get another chance to drop them once the
+		 * last reference to the inode is dropped, so we'll never leak
+		 * blocks permanently.
+		 *
+		 * Further, check if the inode is being opened, written and
+		 * closed frequently and we have delayed allocation blocks
+		 * oustanding (e.g. streaming writes from the NFS server),
+		 * truncating the blocks past EOF will cause fragmentation to
+		 * occur.
+		 *
+		 * In this case don't do the truncation, either, but we have to
+		 * be careful how we detect this case. Blocks beyond EOF show
+		 * up as i_delayed_blks even when the inode is clean, so we
+		 * need to truncate them away first before checking for a dirty
+		 * release. Hence on the first dirty close we will still remove
+		 * the speculative allocation, but after that we will leave it
+		 * in place.
+		 */
+		if (xfs_iflags_test(ip, XFS_IDIRTY_RELEASE))
+			return 0;
+
+		error = xfs_free_eofblocks(mp, ip,
+					   XFS_FREE_EOF_TRYLOCK);
+		if (error)
+			return error;
+
+		/* delalloc blocks after truncation means it really is dirty */
+		if (ip->i_delayed_blks)
+			xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
+	}
 	return 0;
 }
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-12-13  1:25 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
@ 2010-12-16 15:46   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2010-12-16 15:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Dec 13, 2010 at 12:25:11PM +1100, Dave Chinner wrote:
> The new counter in the struct xfs_inode fits into a hole in the
> structure on 64 bit machines, so does not grow the size of the inode
> at all.

The count actually is gone now.


But this patch totally changes xfstests 203 output for me.  Given that
the test doesn't care about the allocation pattern it's probably fine,
but we'll a fix for the testcase.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3
@ 2010-11-29  0:43 Dave Chinner
  2010-11-29  0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2010-11-29  0:43 UTC (permalink / raw)
  To: xfs

This is the latest version of the dynamic speculative allocation
beyond EOF patch set. The description of the patchset can be found
here:

http://oss.sgi.com/archives/xfs/2010-10/msg00040.html

Version 3:

- allocsize mount option returned to fixed preallocation size only.
- reduces maximum dynamic prealloc size as the filesytem gets near
  full.
- split i_delayed_blks bug fixes into new patch (posted in 2.6.37-rc
  bug fix series)

Version 2:
- base speculative execution size on current inode size, not the
  number of previous speculative allocations.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-29  0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner
@ 2010-11-29  0:43 ` Dave Chinner
  2010-11-29  9:42   ` Andi Kleen
  2010-11-30 17:03   ` Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2010-11-29  0:43 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF.  This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.

To avoid this problem, keep a count of the number of ->release calls
made on an inode. For most cases, an inode is only going to be opened
once for writing and then closed again during it's lifetime in
cache. Hence if there are multiple ->release calls, there is a good
chance that the inode is being accessed by the NFS server. Hence
count up every time ->release is called while there are delalloc
blocks still outstanding on the inode.

If this count is non-zero when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.

If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.

The new counter in the struct xfs_inode fits into a hole in the
structure on 64 bit machines, so does not grow the size of the inode
at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iget.c     |    1 +
 fs/xfs/xfs_inode.h    |    1 +
 fs/xfs/xfs_vnodeops.c |   61 ++++++++++++++++++++++++++++++++-----------------
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 0cdd269..18991a9 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -84,6 +84,7 @@ xfs_inode_alloc(
 	memset(&ip->i_d, 0, sizeof(xfs_icdinode_t));
 	ip->i_size = 0;
 	ip->i_new_size = 0;
+	ip->i_dirty_releases = 0;
 
 	/* prevent anyone from using this yet */
 	VFS_I(ip)->i_state = I_NEW;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index fb2ca2e..ea2f34e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -260,6 +260,7 @@ typedef struct xfs_inode {
 	xfs_fsize_t		i_size;		/* in-memory size */
 	xfs_fsize_t		i_new_size;	/* size when write completes */
 	atomic_t		i_iocount;	/* outstanding I/O count */
+	int			i_dirty_releases; /* dirty ->release calls */
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 8e4a63c..49f3a5a 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -964,29 +964,48 @@ xfs_release(
 			xfs_flush_pages(ip, 0, -1, XBF_ASYNC, FI_NONE);
 	}
 
-	if (ip->i_d.di_nlink != 0) {
-		if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
-		     ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
-		       ip->i_delayed_blks > 0)) &&
-		     (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
-		    (!(ip->i_d.di_flags &
-				(XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+	if (ip->i_d.di_nlink == 0)
+		return 0;
 
-			/*
-			 * If we can't get the iolock just skip truncating
-			 * the blocks past EOF because we could deadlock
-			 * with the mmap_sem otherwise.  We'll get another
-			 * chance to drop them once the last reference to
-			 * the inode is dropped, so we'll never leak blocks
-			 * permanently.
-			 */
-			error = xfs_free_eofblocks(mp, ip,
-						   XFS_FREE_EOF_TRYLOCK);
-			if (error)
-				return error;
-		}
-	}
+	if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
+	     ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
+	       ip->i_delayed_blks > 0)) &&
+	     (ip->i_df.if_flags & XFS_IFEXTENTS))  &&
+	    (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+		/*
+		 * If we can't get the iolock just skip truncating the blocks
+		 * past EOF because we could deadlock with the mmap_sem
+		 * otherwise.  We'll get another chance to drop them once the
+		 * last reference to the inode is dropped, so we'll never leak
+		 * blocks permanently.
+		 *
+		 * Further, count the number of times we get here in the life
+		 * of this inode. If the inode is being opened, written and
+		 * closed frequently and we have delayed allocation blocks
+		 * oustanding (e.g. streaming writes from the NFS server),
+		 * truncating the blocks past EOF will cause fragmentation to
+		 * occur.
+		 *
+		 * In this case don't do the truncation, either, but we have to
+		 * be careful how we detect this case. Blocks beyond EOF show
+		 * up as i_delayed_blks even when the inode is clean, so we
+		 * need to truncate them away first before checking for a dirty
+		 * release. Hence on the first couple of dirty closes,we will
+		 * still remove the speculative allocation, but then we will
+		 * leave it in place.
+		 */
+		if (ip->i_dirty_releases > 1)
+			return 0;
 
+		error = xfs_free_eofblocks(mp, ip,
+					   XFS_FREE_EOF_TRYLOCK);
+		if (error)
+			return error;
+
+		/* delalloc blocks after truncation means it really is dirty */
+		if (ip->i_delayed_blks)
+			ip->i_dirty_releases++;
+	}
 	return 0;
 }
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-29  0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
@ 2010-11-29  9:42   ` Andi Kleen
  2010-11-30  1:00     ` Dave Chinner
  2010-11-30 17:03   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2010-11-29  9:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

Dave Chinner <david@fromorbit.com> writes:
>
> To avoid this problem, keep a count of the number of ->release calls
> made on an inode. For most cases, an inode is only going to be opened
> once for writing and then closed again during it's lifetime in
> cache. Hence if there are multiple ->release calls, there is a good
> chance that the inode is being accessed by the NFS server. Hence
> count up every time ->release is called while there are delalloc
> blocks still outstanding on the inode.

Seems like a hack. It would be cleaner and less fragile to add a
explicit VFS hint that is passed down from the nfs server, similar
to the existing open intents.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-29  9:42   ` Andi Kleen
@ 2010-11-30  1:00     ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2010-11-30  1:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, xfs

On Mon, Nov 29, 2010 at 10:42:29AM +0100, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> >
> > To avoid this problem, keep a count of the number of ->release calls
> > made on an inode. For most cases, an inode is only going to be opened
> > once for writing and then closed again during it's lifetime in
> > cache. Hence if there are multiple ->release calls, there is a good
> > chance that the inode is being accessed by the NFS server. Hence
> > count up every time ->release is called while there are delalloc
> > blocks still outstanding on the inode.
> 
> Seems like a hack. It would be cleaner and less fragile to add a
> explicit VFS hint that is passed down from the nfs server, similar
> to the existing open intents.

Agreed.

However, we've been asking for the nfsd to change it's behaviour for
various operations for quite some time (i.e. years) to help
filesystems behave better, but and we're no closer to having it
fixed now than we were 3 or 4 years ago. What the nfsd really needs
is an an open file cache so that IO looks like normal file IO rather
than every write being an "open-write-close" operation....

While we wait for nfsd to be fixed, we've still got people reporting
excessive fragmentation during concurrent sequential writes to nfs
servers running XFS, so we really need some kind of fix for the
problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-29  0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
  2010-11-29  9:42   ` Andi Kleen
@ 2010-11-30 17:03   ` Christoph Hellwig
  2010-11-30 22:00     ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2010-11-30 17:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Did any problems show up with just trying to use an inode flag instead
of the counter?  I'd really hate to bloat the inode without reason.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-11-30 17:03   ` Christoph Hellwig
@ 2010-11-30 22:00     ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2010-11-30 22:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Nov 30, 2010 at 12:03:01PM -0500, Christoph Hellwig wrote:
> Did any problems show up with just trying to use an inode flag instead
> of the counter?  I'd really hate to bloat the inode without reason.

None that I've noticed in local testing, but I haven't been
focussing on this aspect so I hadn't changed it. I'll change it to a
flag, and we can go back to a counter if necessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc
@ 2010-10-04 10:13 Dave Chinner
  2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2010-10-04 10:13 UTC (permalink / raw)
  To: xfs

When multiple concurrent streaming writes land in the same AG,
allocation of extents interleaves between inodes and causes
excessive fragmentation of the files being written. That instead of
getting maximally sized extents, we'll get writeback range sized
extents interleaved on disk. that is for four files A, B, C and D,
we'll end up with extents like:

   +---+---+---+---+---+---+---+---+---+---+---+---+
     A1  B1  C1  D1  A2  B2  C2  A3  D2  C3  B3  D3 .....

instead of:

   +-----------+-----------+-----------+-----------+
         A           B           C           D

It is well known that using the allocsize mount option makes the
allocator behaviour much better and more likely to result in
the second layout above than the first, but that doesn't work in all
situations (e.g. writes from the NFS server). I think that we should
not be relying on manual configuration to solve this problem.

To demonstrate, writing 4 x 64GB files in parallel (16TB volume,
inode64 so all files land in same AG, 700MB/s write speed)

$ for i in `seq 0 1 3`; do
> dd if=/dev/zero of=/mnt/scratch/test.$i bs=64k count=1048576 &
> done
....

results in:

$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
777
196
804
784
$

This shows an average extent size on three of files of 80MB, and
320MB for the other file. The level of fragmentation varies
throughout the files, and varies greatly from run to run. To
demonstrate allocsize=1g:

$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
64
64
64
64
$

Which is 64x1GB extents per file, as we would expect. However, we
can do better than that - with this dynamic speculative
preallocation patch:

$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
9
9
9
9
$

Which gives extent sizes of a maximal 8GB (i.e. perfect):

$ sudo xfs_bmap -vv /mnt/scratch/test.0
/mnt/scratch/test.0:
 EXT: FILE-OFFSET             BLOCK-RANGE          AG AG-OFFSET                 TOTAL
   0: [0..16777207]:          96..16777303          0 (96..16777303)         16777208
   1: [16777208..33554295]:   91344616..108121703   0 (91344616..108121703)  16777088
   2: [33554296..50331383]:   158452968..175230055  0 (158452968..175230055) 16777088
   3: [50331384..67108471]:   225561320..242338407  0 (225561320..242338407) 16777088
   4: [67108472..83885559]:   292669672..309446759  0 (292669672..309446759) 16777088
   5: [83885560..100662647]:  359778024..376555111  0 (359778024..376555111) 16777088
   6: [100662648..117439735]: 426886376..443663463  0 (426886376..443663463) 16777088
   7: [117439736..134216823]: 510771816..527548903  0 (510771816..527548903) 16777088
   8: [134216824..134217727]: 594657256..594658159  0 (594657256..594658159)      904
$

The same results occur for tests running 16 and 64 sequential
writers into the same AG - extents of 8GB in all files, so
this is a major improvement in default behaviour and effectively
means we do not need the allocsize mount option anymore.

Worth noting is that the extents still interleave between files -
that problem still exists - but the size of the extents now means
that sequential read and write rates are not going to be affected
by excessive seeks between extents within each file.

Given this demonstratably improves allocation patterns, the only
question that remains in my mind is exactly what algorithm to use to
scale the preallocation.  The current patch records the last
prealloc size and increases the next one from that. While that
preovides good results, it will cause problems when interacting with
truncation. It also means that a file may have a substantial amount
of preallocatin beyond EOF - maybe several times the size of the
file.

However, the current algorithm does work well when writing lots of
relatively small files (e.g. up to a few tens of megabytes), as
increasing the preallocation size fast reduces the chances of
interleaving small allocations.

I've been thinking that basing the preallocation size on the current
file size - say preallocate half the size of the file, is a better
option once file sizes start to grow large (more than a few tens of
of megabytes), so maybe a combination of the two is a better idea
(increase exponentially up to default^2 (4MB prealloc), then take
min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size
so that we don't do excessive amounts of preallocation?

--

We need to make the same write patterns result in equivalent
allocation patterns even when they come through the NFS server.
Right now the NFS server uses a file descriptor for each write that
comes across the wire. This means that the ->release function is
called after every write, and that means XFS will be truncating away
the speculative preallocation it did during the write. Hence we get
interleaving files and fragmentation.

To avoid this problem, detect when the ->release function is being
called repeatedly on an inode that has delayed allocation
outstanding. If this happenѕ twice in a row, then don't truncate the
speculative allocation away. This ensures that the speculative
preallocation is preserved until the delalloc blocks are converted
to real extents during writeback.

The result of this is that concurrent files written by NFS will tend
to have a small first extent (due to specultive prealloc being
truncated once), followed by 4-8GB extents that interleave
identically to the above local dd exmaples. I have tested this for
4, 16 and 64 concurrent writers from multiple NFS clients. The
result for 2 clients each writing 16x16GB files (32 all up):

$ for i in `seq 0 1 31`; do
> sudo xfs_bmap -vv /mnt/scratch/test.$i |grep ": \[" | wc -l
> done | uniq -c
    1 2
   31 3

Mostly a combination of 4GB and 8GB extents, instead of severe
fragmentation. The typical layout was:

/mnt/scratch/test.1:
 EXT: FILE-OFFSET        BLOCK-RANGE          AG AG-OFFSET                TOTAL
   0: [0..8388607]:         225562280..233950887  0 (225562280..233950887)  8388608
   1: [8388608..25165815]:  410111608..426888815  0 (410111608..426888815) 16777208
   2: [25165816..33554431]: 896648152..905036767  0 (896648152..905036767)  8388616

These results are using NFSv3, and the per-file write rate is only
~3MB/s.  Hence it can be seen that the dynamic preallocation works
for both high and low per-file write throughput.

Comments welcome.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner
@ 2010-10-04 10:13 ` Dave Chinner
  2010-10-14 17:22   ` Alex Elder
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2010-10-04 10:13 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF.  This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.

To avoid this problem, keep a count of the number of ->release calls
made on an inode. For most cases, an inode is only going to be opened
once for writing and then closed again during it's lifetime in
cache. Hence if there are multiple ->release calls, there is a good
chance that the inode is being accessed by the NFS server. Hence
count up every time ->release is called while there are delalloc
blocks still outstanding on the inode.

If this count is non-zero when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.

If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.

The new counter in the struct xfs_inode fits into a hole in the
structure on 64 bit machines, so does not grow the size of the inode
at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.h    |    1 +
 fs/xfs/xfs_vnodeops.c |   15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1594190..82aad5e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -261,6 +261,7 @@ typedef struct xfs_inode {
 	xfs_fsize_t		i_size;		/* in-memory size */
 	xfs_fsize_t		i_new_size;	/* size when write completes */
 	atomic_t		i_iocount;	/* outstanding I/O count */
+	int			i_dirty_releases; /* dirty ->release calls */
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index b7bdc43..0c8eeba 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -979,14 +979,27 @@ xfs_release(
 			 * chance to drop them once the last reference to
 			 * the inode is dropped, so we'll never leak blocks
 			 * permanently.
+			 *
+			 * Further, count the number of times we get here in
+			 * the life of this inode. If the inode is being
+			 * opened, written and closed frequently and we have
+			 * delayed allocation blocks oustanding (e.g. streaming
+			 * writes from the NFS server), truncating the
+			 * blocks past EOF will cause fragmentation to occur.
+			 * In this case don't do the truncation, either.
 			 */
+			if (ip->i_delayed_blks)
+				ip->i_dirty_releases++;
+			if (ip->i_dirty_releases > 1)
+					goto out;
+
 			error = xfs_free_eofblocks(mp, ip,
 						   XFS_FREE_EOF_TRYLOCK);
 			if (error)
 				return error;
 		}
 	}
-
+out:
 	return 0;
 }
 
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
@ 2010-10-14 17:22   ` Alex Elder
  2010-10-14 21:28     ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Elder @ 2010-10-14 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A long standing problem for streaming writeѕ through the NFS server
> has been that the NFS server opens and closes file descriptors on an
> inode for every write. The result of this behaviour is that the
> ->release() function is called on every close and that results in
> XFS truncating speculative preallocation beyond the EOF.  This has
> an adverse effect on file layout when multiple files are being
> written at the same time - they interleave their extents and can
> result in severe fragmentation.
> 
> To avoid this problem, keep a count of the number of ->release calls
> made on an inode. For most cases, an inode is only going to be opened
> once for writing and then closed again during it's lifetime in
> cache. Hence if there are multiple ->release calls, there is a good
> chance that the inode is being accessed by the NFS server. Hence
> count up every time ->release is called while there are delalloc
> blocks still outstanding on the inode.
> 
> If this count is non-zero when ->release is next called, then do no
> truncate away the speculative preallocation - leave it there so that
> subsequent writes do not need to reallocate the delalloc space. This
> will prevent interleaving of extents of different inodes written
> concurrently to the same AG.
> 
> If we get this wrong, it is not a big deal as we truncate
> speculative allocation beyond EOF anyway in xfs_inactive() when the
> inode is thrown out of the cache.
> 
> The new counter in the struct xfs_inode fits into a hole in the
> structure on 64 bit machines, so does not grow the size of the inode
> at all.

This seems reasonable, and I have no real objection to
it.  However, I have a question and a comment related
to the affected code (and not your specific change).

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.h    |    1 +
>  fs/xfs/xfs_vnodeops.c |   15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 1594190..82aad5e 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -261,6 +261,7 @@ typedef struct xfs_inode {
>  	xfs_fsize_t		i_size;		/* in-memory size */
>  	xfs_fsize_t		i_new_size;	/* size when write completes */
>  	atomic_t		i_iocount;	/* outstanding I/O count */
> +	int			i_dirty_releases; /* dirty ->release calls */
>  
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
> diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
> index b7bdc43..0c8eeba 100644
> --- a/fs/xfs/xfs_vnodeops.c
> +++ b/fs/xfs/xfs_vnodeops.c

OK, this comment is unrelated to your exact change.  But just above
the next hunk there's a big nasty condition, which appears to
be *almost* duplicated in xfs_inactive() (twice!).  It would be
very nice if, while you're at modifying this nearby code, you
could encapsulate that condition in a macro that has a meaningful
name.

> @@ -979,14 +979,27 @@ xfs_release(
>  			 * chance to drop them once the last reference to
>  			 * the inode is dropped, so we'll never leak blocks
>  			 * permanently.

I'm curious what the effect is if we simply don't do the truncate
*except* when the inode becomes inactive.  It means we hang onto
the stuff for a while longer, and maybe it makes things messier
in the event of a crash.  Can you tell me why we do the truncate
here as well as in xfs_inactive() (or what the problem is of
*not* doing it here)?

> +			 *
> +			 * Further, count the number of times we get here in
> +			 * the life of this inode. If the inode is being
> +			 * opened, written and closed frequently and we have
> +			 * delayed allocation blocks oustanding (e.g. streaming
> +			 * writes from the NFS server), truncating the
> +			 * blocks past EOF will cause fragmentation to occur.
> +			 * In this case don't do the truncation, either.
>  			 */
> +			if (ip->i_delayed_blks)
> +				ip->i_dirty_releases++;
> +			if (ip->i_dirty_releases > 1)
> +					goto out;
> +
>  			error = xfs_free_eofblocks(mp, ip,
>  						   XFS_FREE_EOF_TRYLOCK);
>  			if (error)
>  				return error;
>  		}
>  	}
> -
> +out:
>  	return 0;
>  }
>  



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
  2010-10-14 17:22   ` Alex Elder
@ 2010-10-14 21:28     ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2010-10-14 21:28 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, Oct 14, 2010 at 12:22:50PM -0500, Alex Elder wrote:
> On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
> > index b7bdc43..0c8eeba 100644
> > --- a/fs/xfs/xfs_vnodeops.c
> > +++ b/fs/xfs/xfs_vnodeops.c
> 
> OK, this comment is unrelated to your exact change.  But just above
> the next hunk there's a big nasty condition, which appears to
> be *almost* duplicated in xfs_inactive() (twice!).  It would be
> very nice if, while you're at modifying this nearby code, you
> could encapsulate that condition in a macro that has a meaningful
> name.

I've looked at doing this factoring before, but the conditions are
subtly different and hence cannot be factored into a single macro.
The xfs_inactive case can be cleaned up (i've got old patches around
that do half the job, IIRC), but the tests in the two functions are
not the same.

> > @@ -979,14 +979,27 @@ xfs_release(
> >  			 * chance to drop them once the last reference to
> >  			 * the inode is dropped, so we'll never leak blocks
> >  			 * permanently.
> 
> I'm curious what the effect is if we simply don't do the truncate
> *except* when the inode becomes inactive.  It means we hang onto
> the stuff for a while longer, and maybe it makes things messier
> in the event of a crash.

It doesn't change a thing in the event of a crash - speculative
preallocation is entirely in-memory state.

> Can you tell me why we do the truncate
> here as well as in xfs_inactive() (or what the problem is of
> *not* doing it here)?

the truncation in ->release is not guaranteed to succeed - it uses
trylock semantics to avoid blocking unnecessarily. It can do this
because we know that when xfs_inactive() is called, the truncation
will happen for real.


-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-12-16 15:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-13  1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner
2010-12-13  1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-12-15 18:57   ` Christoph Hellwig
2010-12-13  1:25 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-12-16 15:46   ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2010-11-29  0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner
2010-11-29  0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-11-29  9:42   ` Andi Kleen
2010-11-30  1:00     ` Dave Chinner
2010-11-30 17:03   ` Christoph Hellwig
2010-11-30 22:00     ` Dave Chinner
2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner
2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-10-14 17:22   ` Alex Elder
2010-10-14 21:28     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox