public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
@ 2015-04-16  5:00 Dave Chinner
  2015-04-16 17:32 ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-04-16  5:00 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

This results in BMBT corruption, as seen by this test:

# mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
....
# mount /dev/vdc /mnt/scratch
# xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo

which results in this failure on a debug kernel:

XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
....
Call Trace:
 [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
 [<ffffffff81ddcc29>] ? down_write+0x29/0x60
 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260
 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17

The tracepoint that indicates the extent that triggered the assert
failure is:

xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1

Clearly indicating that the extent length is greater than MAXEXTLEN,
which is 2097151. A prior trace point shows the allocation was an
exact size match and that a length greater than MAXEXTLEN was asked
for:

xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
					    ^^^^^^^        ^^^^^^^

The issue is that the extent size hint alignment is rounding up the
extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
into account extent size hints when calculating the maximum extent
length to allocate. xfs_bmapi_reserve_delalloc() is already doing
this, but direct extent allocation is not.

We don't see this problem with extent size hints through the IO path
because we can't do single IOs large enough to trigger MAXEXTLEN
allocation. fallocate(), OTOH, is not limited in it's allocation
sizes and so needs help here. The fix is simply to copy the logic
from xfs_bmapi_reserve_delalloc() and apply it apropriately to
xfs_bmapi_write().

I also add an ASSERT() to xfs_bmap_extsize_align() so we'll catch
cases of alignment exceeding MAXEXTLEN on debug kernel machines in
future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 51 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaa..37949b5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3224,12 +3224,21 @@ xfs_bmap_extsize_align(
 		align_alen += temp;
 		align_off -= temp;
 	}
+
+	/* Same adjustment for the end of the requested area. */
+	temp = (align_alen % extsz);
+	if (temp)
+		align_alen += extsz - temp;
+
 	/*
-	 * Same adjustment for the end of the requested area.
+	 * we are in trouble if the caller requested an extent that will align
+	 * to something larger than the supported on disk extent size.  Assert
+	 * fail here to catch callers that make this mistake; they should always
+	 * be setting the maximum allocation length to be (MAXEXTLEN - extsz) so
+	 * we can round outwards here for alignment.
 	 */
-	if ((temp = (align_alen % extsz))) {
-		align_alen += extsz - temp;
-	}
+	ASSERT(align_alen <= MAXEXTLEN);
+
 	/*
 	 * If the previous block overlaps with this proposed allocation
 	 * then move the start forward without adjusting the length.
@@ -4074,6 +4083,27 @@ xfs_bmapi_read(
 	return 0;
 }
 
+/*
+ * Calculate the maximum extent length we can ask to allocate after taking into
+ * account the on-disk size limitations, the extent size hints and the size
+ * being requested. We have to deal with the extent size hint here because the
+ * allocation will attempt alignment and hence grow the length outwards by up to
+ * @extsz on either side.
+ */
+static inline xfs_extlen_t
+xfs_bmapi_max_extlen(
+	struct xfs_inode	*ip,
+	xfs_extlen_t		length)
+{
+	xfs_extlen_t		extsz = xfs_get_extsz_hint(ip);
+	xfs_extlen_t		max_length = MAXEXTLEN;
+
+	if (extsz)
+		max_length -= 2 * extsz - 1;
+
+	return (length < max_length) ? length : max_length;
+}
+
 STATIC int
 xfs_bmapi_reserve_delalloc(
 	struct xfs_inode	*ip,
@@ -4092,20 +4122,13 @@ xfs_bmapi_reserve_delalloc(
 	xfs_extlen_t		extsz;
 	int			error;
 
-	alen = XFS_FILBLKS_MIN(len, MAXEXTLEN);
+	alen = xfs_bmapi_max_extlen(ip, len);
 	if (!eof)
 		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
 
-	/* Figure out the extent size, adjust alen */
+	/* Figure out the extent size, align alen */
 	extsz = xfs_get_extsz_hint(ip);
 	if (extsz) {
-		/*
-		 * Make sure we don't exceed a single extent length when we
-		 * align the extent by reducing length we are going to
-		 * allocate by the maximum amount extent size aligment may
-		 * require.
-		 */
-		alen = XFS_FILBLKS_MIN(len, MAXEXTLEN - (2 * extsz - 1));
 		error = xfs_bmap_extsize_align(mp, got, prev, extsz, rt, eof,
 					       1, 0, &aoff, &alen);
 		ASSERT(!error);
@@ -4287,7 +4310,7 @@ xfs_bmapi_allocate(
 					 &bma->prev);
 		}
 	} else {
-		bma->length = XFS_FILBLKS_MIN(bma->length, MAXEXTLEN);
+		bma->length = xfs_bmapi_max_extlen(bma->ip, bma->length);
 		if (!bma->eof)
 			bma->length = XFS_FILBLKS_MIN(bma->length,
 					bma->got.br_startoff - bma->offset);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-16  5:00 [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN Dave Chinner
@ 2015-04-16 17:32 ` Brian Foster
  2015-04-16 22:28   ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2015-04-16 17:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This results in BMBT corruption, as seen by this test:
> 
> # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
> ....
> # mount /dev/vdc /mnt/scratch
> # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
> 
> which results in this failure on a debug kernel:
> 
> XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
> ....
> Call Trace:
>  [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
>  [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
>  [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
>  [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
>  [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
>  [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
>  [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
>  [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
>  [<ffffffff81ddcc29>] ? down_write+0x29/0x60
>  [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
>  [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
>  [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
>  [<ffffffff811cd680>] vfs_fallocate+0x140/0x260
>  [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
>  [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17
> 
> The tracepoint that indicates the extent that triggered the assert
> failure is:
> 
> xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1
> 
> Clearly indicating that the extent length is greater than MAXEXTLEN,
> which is 2097151. A prior trace point shows the allocation was an
> exact size match and that a length greater than MAXEXTLEN was asked
> for:
> 
> xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
> 					    ^^^^^^^        ^^^^^^^
> 
> The issue is that the extent size hint alignment is rounding up the
> extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
> into account extent size hints when calculating the maximum extent
> length to allocate. xfs_bmapi_reserve_delalloc() is already doing
> this, but direct extent allocation is not.
> 
> We don't see this problem with extent size hints through the IO path
> because we can't do single IOs large enough to trigger MAXEXTLEN
> allocation. fallocate(), OTOH, is not limited in it's allocation
> sizes and so needs help here. The fix is simply to copy the logic
> from xfs_bmapi_reserve_delalloc() and apply it apropriately to
> xfs_bmapi_write().
> 
> I also add an ASSERT() to xfs_bmap_extsize_align() so we'll catch
> cases of alignment exceeding MAXEXTLEN on debug kernel machines in
> future.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks fine from the perspective of applying pre-existing logic to a
separate codepath, but...

>  fs/xfs/libxfs/xfs_bmap.c | 51 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index aeffeaa..37949b5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3224,12 +3224,21 @@ xfs_bmap_extsize_align(
>  		align_alen += temp;
>  		align_off -= temp;
>  	}
> +
> +	/* Same adjustment for the end of the requested area. */
> +	temp = (align_alen % extsz);
> +	if (temp)
> +		align_alen += extsz - temp;
> +
>  	/*
> -	 * Same adjustment for the end of the requested area.
> +	 * we are in trouble if the caller requested an extent that will align
> +	 * to something larger than the supported on disk extent size.  Assert
> +	 * fail here to catch callers that make this mistake; they should always
> +	 * be setting the maximum allocation length to be (MAXEXTLEN - extsz) so
> +	 * we can round outwards here for alignment.
>  	 */
> -	if ((temp = (align_alen % extsz))) {
> -		align_alen += extsz - temp;
> -	}
> +	ASSERT(align_alen <= MAXEXTLEN);
> +
>  	/*
>  	 * If the previous block overlaps with this proposed allocation
>  	 * then move the start forward without adjusting the length.
> @@ -4074,6 +4083,27 @@ xfs_bmapi_read(
>  	return 0;
>  }
>  
> +/*
> + * Calculate the maximum extent length we can ask to allocate after taking into
> + * account the on-disk size limitations, the extent size hints and the size
> + * being requested. We have to deal with the extent size hint here because the
> + * allocation will attempt alignment and hence grow the length outwards by up to
> + * @extsz on either side.
> + */
> +static inline xfs_extlen_t
> +xfs_bmapi_max_extlen(
> +	struct xfs_inode	*ip,
> +	xfs_extlen_t		length)
> +{
> +	xfs_extlen_t		extsz = xfs_get_extsz_hint(ip);
> +	xfs_extlen_t		max_length = MAXEXTLEN;
> +
> +	if (extsz)
> +		max_length -= 2 * extsz - 1;

This can underflow or cause other issues if set to just the right value
(with smaller block sizes such that length can be trimmed to 0):

$ mkfs.xfs -f -bsize=1k <dev>
$ mount <dev> /mnt
$ xfs_io -f -c "extsize 1g" -c "pwrite 0 4k" -c fsync /mnt/file
pwrite64: No space left on device

...
 kernel:XFS: Assertion failed: ap->length, file: fs/xfs/libxfs/xfs_bmap.c, line: 3652
...

$ xfs_io -f -c "extsize 1025m" -c "pwrite 0 4k" -c fsync /mnt/file
wrote 4096/4096 bytes at offset 0
4 KiB, 4 ops; 0.0000 sec (1.212 MiB/sec and 1241.0797 ops/sec)

(Both that and the original reproducer might make a good xfstests test,
btw...)

Brian

> +
> +	return (length < max_length) ? length : max_length;
> +}
> +
>  STATIC int
>  xfs_bmapi_reserve_delalloc(
>  	struct xfs_inode	*ip,
> @@ -4092,20 +4122,13 @@ xfs_bmapi_reserve_delalloc(
>  	xfs_extlen_t		extsz;
>  	int			error;
>  
> -	alen = XFS_FILBLKS_MIN(len, MAXEXTLEN);
> +	alen = xfs_bmapi_max_extlen(ip, len);
>  	if (!eof)
>  		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
>  
> -	/* Figure out the extent size, adjust alen */
> +	/* Figure out the extent size, align alen */
>  	extsz = xfs_get_extsz_hint(ip);
>  	if (extsz) {
> -		/*
> -		 * Make sure we don't exceed a single extent length when we
> -		 * align the extent by reducing length we are going to
> -		 * allocate by the maximum amount extent size aligment may
> -		 * require.
> -		 */
> -		alen = XFS_FILBLKS_MIN(len, MAXEXTLEN - (2 * extsz - 1));
>  		error = xfs_bmap_extsize_align(mp, got, prev, extsz, rt, eof,
>  					       1, 0, &aoff, &alen);
>  		ASSERT(!error);
> @@ -4287,7 +4310,7 @@ xfs_bmapi_allocate(
>  					 &bma->prev);
>  		}
>  	} else {
> -		bma->length = XFS_FILBLKS_MIN(bma->length, MAXEXTLEN);
> +		bma->length = xfs_bmapi_max_extlen(bma->ip, bma->length);
>  		if (!bma->eof)
>  			bma->length = XFS_FILBLKS_MIN(bma->length,
>  					bma->got.br_startoff - bma->offset);
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-16 17:32 ` Brian Foster
@ 2015-04-16 22:28   ` Dave Chinner
  2015-04-17  0:03     ` Dave Chinner
  2015-04-17 12:58     ` Brian Foster
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2015-04-16 22:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Apr 16, 2015 at 01:32:38PM -0400, Brian Foster wrote:
> On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > This results in BMBT corruption, as seen by this test:
> > 
> > # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
> > ....
> > # mount /dev/vdc /mnt/scratch
> > # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
> > 
> > which results in this failure on a debug kernel:
> > 
> > XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
> > ....
....
> 
> Looks fine from the perspective of applying pre-existing logic to a
> separate codepath, but...

That was my thought....

> > + * Calculate the maximum extent length we can ask to allocate after taking into
> > + * account the on-disk size limitations, the extent size hints and the size
> > + * being requested. We have to deal with the extent size hint here because the
> > + * allocation will attempt alignment and hence grow the length outwards by up to
> > + * @extsz on either side.
> > + */
> > +static inline xfs_extlen_t
> > +xfs_bmapi_max_extlen(
> > +	struct xfs_inode	*ip,
> > +	xfs_extlen_t		length)
> > +{
> > +	xfs_extlen_t		extsz = xfs_get_extsz_hint(ip);
> > +	xfs_extlen_t		max_length = MAXEXTLEN;
> > +
> > +	if (extsz)
> > +		max_length -= 2 * extsz - 1;
> 
> This can underflow or cause other issues if set to just the right value
> (with smaller block sizes such that length can be trimmed to 0):

But I assumed the existing code was correct for this context. My
bad. :/

> $ mkfs.xfs -f -bsize=1k <dev>
> $ mount <dev> /mnt
> $ xfs_io -f -c "extsize 1g" -c "pwrite 0 4k" -c fsync /mnt/file
> pwrite64: No space left on device

Yup, because it 2^21 = 2G, and extsize = 1g puts max_length < 0.

Ok. So, the problem is that it is overestimating the amount of space
that alignment will need, and that alignment cannot be guaranteed
for extsz hints of over (MAXEXTLEN / 2) in size.

i.e. given an alignment (A[0-2]) and an extent (E[01]):

   A0                  A1                  A2
   +-------------------+-------------------+
		     +ooo+
		     E0  E1

The problem is that the alignment done by xfs_bmap_extsize_align()
only extends outwards (i.e. increases extent size). Hence E0 gets
rounded down to A0-A2, and E1 gets extended to A2, which means we
are adding almost 2 entire extent size hints to the allocation.
That's where the reduction in length by two extsz values came from.

Now, for delayed allocation, this is just fine, because real
allocation will break this delalloc extent up into two separate
extents, and underflow wouldn't be noticed as delalloc extents are
not physically limited to MAXEXTLEN and so nothing would have
broken. Still, it's not the intended behaviour.

I'm not sure what the solution is yet - the fundamental problem here
is the outwards alignment of both ends of the extent, and this
MAXEXTLEN twiddling is just an encoding of that behaviour. I need to
spend some time looking at xfs_bmap_extsize_align() and determining
if there is something we can do differently here.

> (Both that and the original reproducer might make a good xfstests test,
> btw...)

Yeah, I think I mentioned that on IRC to sandeen when I wrote the
first fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-16 22:28   ` Dave Chinner
@ 2015-04-17  0:03     ` Dave Chinner
  2015-04-17 13:01       ` Brian Foster
  2015-04-17 12:58     ` Brian Foster
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-04-17  0:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Apr 17, 2015 at 08:28:29AM +1000, Dave Chinner wrote:
> On Thu, Apr 16, 2015 at 01:32:38PM -0400, Brian Foster wrote:
> > On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> > > + * Calculate the maximum extent length we can ask to allocate after taking into
> > > + * account the on-disk size limitations, the extent size hints and the size
> > > + * being requested. We have to deal with the extent size hint here because the
> > > + * allocation will attempt alignment and hence grow the length outwards by up to
> > > + * @extsz on either side.
> > > + */
> > > +static inline xfs_extlen_t
> > > +xfs_bmapi_max_extlen(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_extlen_t		length)
> > > +{
> > > +	xfs_extlen_t		extsz = xfs_get_extsz_hint(ip);
> > > +	xfs_extlen_t		max_length = MAXEXTLEN;
> > > +
> > > +	if (extsz)
> > > +		max_length -= 2 * extsz - 1;
> > 
> > This can underflow or cause other issues if set to just the right value
> > (with smaller block sizes such that length can be trimmed to 0):
> 
> But I assumed the existing code was correct for this context. My
> bad. :/
> 
> > $ mkfs.xfs -f -bsize=1k <dev>
> > $ mount <dev> /mnt
> > $ xfs_io -f -c "extsize 1g" -c "pwrite 0 4k" -c fsync /mnt/file
> > pwrite64: No space left on device
> 
> Yup, because it 2^21 = 2G, and extsize = 1g puts max_length < 0.
> 
> Ok. So, the problem is that it is overestimating the amount of space
> that alignment will need, and that alignment cannot be guaranteed
> for extsz hints of over (MAXEXTLEN / 2) in size.
> 
> i.e. given an alignment (A[0-2]) and an extent (E[01]):
> 
>    A0                  A1                  A2
>    +-------------------+-------------------+
> 		     +ooo+
> 		     E0  E1
> 
> The problem is that the alignment done by xfs_bmap_extsize_align()
> only extends outwards (i.e. increases extent size). Hence E0 gets
> rounded down to A0-A2, and E1 gets extended to A2, which means we
> are adding almost 2 entire extent size hints to the allocation.
> That's where the reduction in length by two extsz values came from.
> 
> Now, for delayed allocation, this is just fine, because real
> allocation will break this delalloc extent up into two separate
> extents, and underflow wouldn't be noticed as delalloc extents are
> not physically limited to MAXEXTLEN and so nothing would have
> broken. Still, it's not the intended behaviour.
> 
> I'm not sure what the solution is yet - the fundamental problem here
> is the outwards alignment of both ends of the extent, and this
> MAXEXTLEN twiddling is just an encoding of that behaviour. I need to
> spend some time looking at xfs_bmap_extsize_align() and determining
> if there is something we can do differently here.

Ok, so the callers of xfs_bmap_extsize_align() are:

	xfs_bmapi_reserve_delalloc()
	xfs_bmap_btalloc()
	xfs_bmap_rtalloc().

For xfs_bmapi_reserve_delalloc(), the alignment does not need grow
outwards; it can be truncated mid-range, and the code should still
work. i.e.

   A0                  A1                  A2
   +-------------------+-------------------+
		     +ooo+
		     E0  E1
   +-------------------+
   R0                  R1

R[01] is a valid alignment and will result in a second allocation
occurring for this:

   A0                  A1                  A2
   +-------------------+-------------------+
		       +o+
		      E2  E1
                       +-------------------+
                       R1                  R2

And so the range we need allocation for (E[01]) will be allocated
and correctly extent size aligned.

For xfs_bmap_btalloc() - the problem case here - the code is a
little more complex. We do:

xfs_bmapi_write
  loop until all allocated {
    xfs_bmapi_allocate(bma)
      calc off/len
        xfs_bmap_btalloc(bma)
	  xfs_bmap_extsize_align(bma)
	  xfs_alloc_vextent
	  update bma->length
      BMBT insert
    trim returned map
  }

So we are doing alignment two steps removed from the off/len
calculation (xfs_bmap_rtalloc() is in the same boat). Hence the
question is whether xfs_bmap_extsize_align() can trim the range
being allocated and still have everything work....

Ok, upon further reading, the xfs_bmalloc structure (bma) that is
passed between these functions to track the allocation being done is
updated after allocation with the length of the extent allocated.
IOWs:

	bma->length = E(len)
	xfs_bmap_btalloc(bma)
	  A(len) = xfs_bmap_extsize_align(bma->length)
	  R(len) = xfs_alloc_vextent(A(len))
	  bma->length = R(len)

Hence this:

   A0                  A1                  A2
   +-------------------+-------------------+
		     +ooo+
		     E0  E1
   +-------------------+
   R0                  R1

Is a valid result from xfs_bmap_btalloc() and the loop in
xfs_bmapi_write() will do a second allocation and alignment as per
the above delalloc case. xfs_bmap_rtalloc() appears to mirror this
same structure, so should also have the same behaviour.

What this means is that we can actually reduce the requested
allocation to be only a partial overlap when aligning it, and
everything should still work. Let's now see how complex that makes
the code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-16 22:28   ` Dave Chinner
  2015-04-17  0:03     ` Dave Chinner
@ 2015-04-17 12:58     ` Brian Foster
  1 sibling, 0 replies; 9+ messages in thread
From: Brian Foster @ 2015-04-17 12:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 17, 2015 at 08:28:29AM +1000, Dave Chinner wrote:
> On Thu, Apr 16, 2015 at 01:32:38PM -0400, Brian Foster wrote:
> > On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > This results in BMBT corruption, as seen by this test:
> > > 
> > > # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
> > > ....
> > > # mount /dev/vdc /mnt/scratch
> > > # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
> > > 
> > > which results in this failure on a debug kernel:
> > > 
> > > XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
> > > ....
> ....
> > 
> > Looks fine from the perspective of applying pre-existing logic to a
> > separate codepath, but...
> 
> That was my thought....
> 
> > > + * Calculate the maximum extent length we can ask to allocate after taking into
> > > + * account the on-disk size limitations, the extent size hints and the size
> > > + * being requested. We have to deal with the extent size hint here because the
> > > + * allocation will attempt alignment and hence grow the length outwards by up to
> > > + * @extsz on either side.
> > > + */
> > > +static inline xfs_extlen_t
> > > +xfs_bmapi_max_extlen(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_extlen_t		length)
> > > +{
> > > +	xfs_extlen_t		extsz = xfs_get_extsz_hint(ip);
> > > +	xfs_extlen_t		max_length = MAXEXTLEN;
> > > +
> > > +	if (extsz)
> > > +		max_length -= 2 * extsz - 1;
> > 
> > This can underflow or cause other issues if set to just the right value
> > (with smaller block sizes such that length can be trimmed to 0):
> 
> But I assumed the existing code was correct for this context. My
> bad. :/
> 
> > $ mkfs.xfs -f -bsize=1k <dev>
> > $ mount <dev> /mnt
> > $ xfs_io -f -c "extsize 1g" -c "pwrite 0 4k" -c fsync /mnt/file
> > pwrite64: No space left on device
> 
> Yup, because it 2^21 = 2G, and extsize = 1g puts max_length < 0.
> 

I think it puts max_length at 0, which basically kills the allocation.
Increasing the hint further underflows the max and makes it ineffective.
Regardless, broken either way...

> Ok. So, the problem is that it is overestimating the amount of space
> that alignment will need, and that alignment cannot be guaranteed
> for extsz hints of over (MAXEXTLEN / 2) in size.
> 
> i.e. given an alignment (A[0-2]) and an extent (E[01]):
> 
>    A0                  A1                  A2
>    +-------------------+-------------------+
> 		     +ooo+
> 		     E0  E1
> 
> The problem is that the alignment done by xfs_bmap_extsize_align()
> only extends outwards (i.e. increases extent size). Hence E0 gets
> rounded down to A0-A2, and E1 gets extended to A2, which means we
> are adding almost 2 entire extent size hints to the allocation.
> That's where the reduction in length by two extsz values came from.
> 

Makes sense... From reading through xfs_bmap_extsize_align(), it looks
like the intent of the function is to basically look at the current bmap
of the file and apply the extent size hint to the original allocation
request. E.g., expand the range of the file being allocated while
dealing with potential overlap of previous or subsequent extents, eof,
etc.

> Now, for delayed allocation, this is just fine, because real
> allocation will break this delalloc extent up into two separate
> extents, and underflow wouldn't be noticed as delalloc extents are
> not physically limited to MAXEXTLEN and so nothing would have
> broken. Still, it's not the intended behaviour.
> 

The delalloc behavior wasn't clear to me at first. I was expecting
something along the lines of the behavior above, only done as a delalloc
extent (only inserted in the in-core extent list). Observing that not
happening, however, lead me to this:

aff3a9ed xfs: Use preallocation for inodes with extsz hints

... which leads me to believe all of the extent size hint handling code
in the bmapi delalloc codepath is historical from when we did have this
behavior. We simply turned it off without cleaning out the lower layers,
yes?

In any case, that explains the behavior. It's a bit confusing having
that code around. On one hand, I could understand the view that the
allocator is an independent layer that should account for the hints
regardless of how the higher layers choose to call it. The downside is
we can't really test that allocator codepath any longer. I think I'd be
in favor of ripping that stuff out if it's not called. We could always
add it back down the road if the extent alignment stuff is well factored
into helper functions.

> I'm not sure what the solution is yet - the fundamental problem here
> is the outwards alignment of both ends of the extent, and this
> MAXEXTLEN twiddling is just an encoding of that behaviour. I need to
> spend some time looking at xfs_bmap_extsize_align() and determining
> if there is something we can do differently here.
> 

Can the extent alignment code learn to account for MAXEXTLEN itself?

Brian

> > (Both that and the original reproducer might make a good xfstests test,
> > btw...)
> 
> Yeah, I think I mentioned that on IRC to sandeen when I wrote the
> first fix.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-17  0:03     ` Dave Chinner
@ 2015-04-17 13:01       ` Brian Foster
  2015-04-18 23:17         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2015-04-17 13:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 17, 2015 at 10:03:08AM +1000, Dave Chinner wrote:
> On Fri, Apr 17, 2015 at 08:28:29AM +1000, Dave Chinner wrote:
> > On Thu, Apr 16, 2015 at 01:32:38PM -0400, Brian Foster wrote:
> > > On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> > > > + * Calculate the maximum extent length we can ask to allocate after taking into
> > > > + * account the on-disk size limitations, the extent size hints and the size
> > > > + * being requested. We have to deal with the extent size hint here because the
> > > > + * allocation will attempt alignment and hence grow the length outwards by up to
> > > > + * @extsz on either side.
> > > > + */
> > > > +static inline xfs_extlen_t
> > > > +xfs_bmapi_max_extlen(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_extlen_t		length)
> > > > +{
> > > > +	xfs_extlen_t		extsz = xfs_get_extsz_hint(ip);
> > > > +	xfs_extlen_t		max_length = MAXEXTLEN;
> > > > +
> > > > +	if (extsz)
> > > > +		max_length -= 2 * extsz - 1;
> > > 
> > > This can underflow or cause other issues if set to just the right value
> > > (with smaller block sizes such that length can be trimmed to 0):
> > 
> > But I assumed the existing code was correct for this context. My
> > bad. :/
> > 
> > > $ mkfs.xfs -f -bsize=1k <dev>
> > > $ mount <dev> /mnt
> > > $ xfs_io -f -c "extsize 1g" -c "pwrite 0 4k" -c fsync /mnt/file
> > > pwrite64: No space left on device
> > 
> > Yup, because it 2^21 = 2G, and extsize = 1g puts max_length < 0.
> > 
> > Ok. So, the problem is that it is overestimating the amount of space
> > that alignment will need, and that alignment cannot be guaranteed
> > for extsz hints of over (MAXEXTLEN / 2) in size.
> > 
> > i.e. given an alignment (A[0-2]) and an extent (E[01]):
> > 
> >    A0                  A1                  A2
> >    +-------------------+-------------------+
> > 		     +ooo+
> > 		     E0  E1
> > 
> > The problem is that the alignment done by xfs_bmap_extsize_align()
> > only extends outwards (i.e. increases extent size). Hence E0 gets
> > rounded down to A0-A2, and E1 gets extended to A2, which means we
> > are adding almost 2 entire extent size hints to the allocation.
> > That's where the reduction in length by two extsz values came from.
> > 
> > Now, for delayed allocation, this is just fine, because real
> > allocation will break this delalloc extent up into two separate
> > extents, and underflow wouldn't be noticed as delalloc extents are
> > not physically limited to MAXEXTLEN and so nothing would have
> > broken. Still, it's not the intended behaviour.
> > 
> > I'm not sure what the solution is yet - the fundamental problem here
> > is the outwards alignment of both ends of the extent, and this
> > MAXEXTLEN twiddling is just an encoding of that behaviour. I need to
> > spend some time looking at xfs_bmap_extsize_align() and determining
> > if there is something we can do differently here.
> 
> Ok, so the callers of xfs_bmap_extsize_align() are:
> 
> 	xfs_bmapi_reserve_delalloc()
> 	xfs_bmap_btalloc()
> 	xfs_bmap_rtalloc().
> 
> For xfs_bmapi_reserve_delalloc(), the alignment does not need grow
> outwards; it can be truncated mid-range, and the code should still
> work. i.e.
> 
>    A0                  A1                  A2
>    +-------------------+-------------------+
> 		     +ooo+
> 		     E0  E1
>    +-------------------+
>    R0                  R1
> 
> R[01] is a valid alignment and will result in a second allocation
> occurring for this:
> 
>    A0                  A1                  A2
>    +-------------------+-------------------+
> 		       +o+
> 		      E2  E1
>                        +-------------------+
>                        R1                  R2
> 
> And so the range we need allocation for (E[01]) will be allocated
> and correctly extent size aligned.
> 

See my previous comments about delalloc extent size hints...

That aside, seems reasonable at a glance. The delayed allocation is
effectively aggregated into what we expect to be an allocation that
covers the entire aligned range.

> For xfs_bmap_btalloc() - the problem case here - the code is a
> little more complex. We do:
> 
> xfs_bmapi_write
>   loop until all allocated {
>     xfs_bmapi_allocate(bma)
>       calc off/len
>         xfs_bmap_btalloc(bma)
> 	  xfs_bmap_extsize_align(bma)
> 	  xfs_alloc_vextent
> 	  update bma->length
>       BMBT insert
>     trim returned map
>   }
> 
> So we are doing alignment two steps removed from the off/len
> calculation (xfs_bmap_rtalloc() is in the same boat). Hence the
> question is whether xfs_bmap_extsize_align() can trim the range
> being allocated and still have everything work....
> 
> Ok, upon further reading, the xfs_bmalloc structure (bma) that is
> passed between these functions to track the allocation being done is
> updated after allocation with the length of the extent allocated.
> IOWs:
> 
> 	bma->length = E(len)
> 	xfs_bmap_btalloc(bma)
> 	  A(len) = xfs_bmap_extsize_align(bma->length)
> 	  R(len) = xfs_alloc_vextent(A(len))
> 	  bma->length = R(len)
> 
> Hence this:
> 
>    A0                  A1                  A2
>    +-------------------+-------------------+
> 		     +ooo+
> 		     E0  E1
>    +-------------------+
>    R0                  R1
> 
> Is a valid result from xfs_bmap_btalloc() and the loop in
> xfs_bmapi_write() will do a second allocation and alignment as per
> the above delalloc case. xfs_bmap_rtalloc() appears to mirror this
> same structure, so should also have the same behaviour.
> 
> What this means is that we can actually reduce the requested
> allocation to be only a partial overlap when aligning it, and
> everything should still work. Let's now see how complex that makes
> the code...
> 

Ok, so xfs_bmapi_write() walks over the file block range to be mapped
and is thus prepared to handle the potential for multiple allocations.
As long as the extent size alignment covers the high level range in
incremental order, the higher layers should keep moving along until the
originally requested range is allocated.

What I'm not so sure about is that the xfs_bmapi_write() loop also
accounts for nimap, and that is passed as 1 from at least the couple of
codepaths I looked at. I'm guessing this is because those paths have a
transaction reservation good enough for one allocation at a time. Some
paths (e.g., xfs_iomap_write_allocate()) seems to handle this with an
even higher level loop, but others such as xfs_iomap_write_direct() do
not appear to. That said, it might still be the case that everything
technically works, as then the higher level DIO becomes the next guy up
the chain responsible for the alloc via requesting the next mapping...

Even if that is the case, it seems at some level this alters the
semantics of the extent size hint. Maybe that's fine and we just
document it such that rather than extent size hints potentially
resulting in 2x allocations, unaligned I/Os simply result in multiple
aligned allocations. IIUC, that shouldn't really have much user visible
impact, if at all..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-17 13:01       ` Brian Foster
@ 2015-04-18 23:17         ` Dave Chinner
  2015-04-19 13:33           ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-04-18 23:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

[compund reply]

On Fri, Apr 17, 2015 at 08:58:44AM -0400, Brian Foster wrote:
> On Fri, Apr 17, 2015 at 08:28:29AM +1000, Dave Chinner wrote:
> > On Thu, Apr 16, 2015 at 01:32:38PM -0400, Brian Foster wrote:
> > > On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > This results in BMBT corruption, as seen by this test:
> > > > 
> > > > # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
> > > > ....
> > > > # mount /dev/vdc /mnt/scratch
> > > > # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
> > > > 
> > > > which results in this failure on a debug kernel:
> > > > 
> > > > XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
> > > > ....
> > ....
....
> I think it puts max_length at 0, which basically kills the allocation.
> Increasing the hint further underflows the max and makes it ineffective.
> Regardless, broken either way...
> 
> > Ok. So, the problem is that it is overestimating the amount of space
> > that alignment will need, and that alignment cannot be guaranteed
> > for extsz hints of over (MAXEXTLEN / 2) in size.
> > 
> > i.e. given an alignment (A[0-2]) and an extent (E[01]):
> > 
> >    A0                  A1                  A2
> >    +-------------------+-------------------+
> > 		     +ooo+
> > 		     E0  E1
> > 
> > The problem is that the alignment done by xfs_bmap_extsize_align()
> > only extends outwards (i.e. increases extent size). Hence E0 gets
> > rounded down to A0-A2, and E1 gets extended to A2, which means we
> > are adding almost 2 entire extent size hints to the allocation.
> > That's where the reduction in length by two extsz values came from.
> > 
> 
> Makes sense... From reading through xfs_bmap_extsize_align(), it looks
> like the intent of the function is to basically look at the current bmap
> of the file and apply the extent size hint to the original allocation
> request. E.g., expand the range of the file being allocated while
> dealing with potential overlap of previous or subsequent extents, eof,
> etc.

Yes, that is what it is supposed to be doing.

> > Now, for delayed allocation, this is just fine, because real
> > allocation will break this delalloc extent up into two separate
> > extents, and underflow wouldn't be noticed as delalloc extents are
> > not physically limited to MAXEXTLEN and so nothing would have
> > broken. Still, it's not the intended behaviour.
> > 
> 
> The delalloc behavior wasn't clear to me at first. I was expecting
> something along the lines of the behavior above, only done as a delalloc
> extent (only inserted in the in-core extent list). Observing that not
> happening, however, lead me to this:
> 
> aff3a9ed xfs: Use preallocation for inodes with extsz hints
> 
> ... which leads me to believe all of the extent size hint handling code
> in the bmapi delalloc codepath is historical from when we did have this
> behavior. We simply turned it off without cleaning out the lower layers,
> yes?

Yes. It had crossed my mind that the delalloc code probably wasn't
being called, but I did the analysis assuming that it was called.

> In any case, that explains the behavior. It's a bit confusing having
> that code around. On one hand, I could understand the view that the
> allocator is an independent layer that should account for the hints
> regardless of how the higher layers choose to call it. The downside is
> we can't really test that allocator codepath any longer. I think I'd be
> in favor of ripping that stuff out if it's not called. We could always
> add it back down the road if the extent alignment stuff is well factored
> into helper functions.

Yup.

[.... to other reply ....]

> > Ok, so the callers of xfs_bmap_extsize_align() are:
> > 
> > 	xfs_bmapi_reserve_delalloc()
> > 	xfs_bmap_btalloc()
> > 	xfs_bmap_rtalloc().
> > 
> > For xfs_bmapi_reserve_delalloc(), the alignment does not need grow
> > outwards; it can be truncated mid-range, and the code should still
> > work. i.e.
> > 
> >    A0                  A1                  A2
> >    +-------------------+-------------------+
> > 		     +ooo+
> > 		     E0  E1
> >    +-------------------+
> >    R0                  R1
> > 
> > R[01] is a valid alignment and will result in a second allocation
> > occurring for this:
> > 
> >    A0                  A1                  A2
> >    +-------------------+-------------------+
> > 		       +o+
> > 		      E2  E1
> >                        +-------------------+
> >                        R1                  R2
> > 
> > And so the range we need allocation for (E[01]) will be allocated
> > and correctly extent size aligned.
> > 
> 
> See my previous comments about delalloc extent size hints...
> 
> That aside, seems reasonable at a glance. The delayed allocation is
> effectively aggregated into what we expect to be an allocation that
> covers the entire aligned range.

Right.

> 
> > For xfs_bmap_btalloc() - the problem case here - the code is a
> > little more complex. We do:
> > 
> > xfs_bmapi_write
> >   loop until all allocated {
> >     xfs_bmapi_allocate(bma)
> >       calc off/len
> >         xfs_bmap_btalloc(bma)
> > 	  xfs_bmap_extsize_align(bma)
> > 	  xfs_alloc_vextent
> > 	  update bma->length
> >       BMBT insert
> >     trim returned map
> >   }
> > 
> > So we are doing alignment two steps removed from the off/len
> > calculation (xfs_bmap_rtalloc() is in the same boat). Hence the
> > question is whether xfs_bmap_extsize_align() can trim the range
> > being allocated and still have everything work....
> > 
> > Ok, upon further reading, the xfs_bmalloc structure (bma) that is
> > passed between these functions to track the allocation being done is
> > updated after allocation with the length of the extent allocated.
> > IOWs:
> > 
> > 	bma->length = E(len)
> > 	xfs_bmap_btalloc(bma)
> > 	  A(len) = xfs_bmap_extsize_align(bma->length)
> > 	  R(len) = xfs_alloc_vextent(A(len))
> > 	  bma->length = R(len)
> > 
> > Hence this:
> > 
> >    A0                  A1                  A2
> >    +-------------------+-------------------+
> > 		     +ooo+
> > 		     E0  E1
> >    +-------------------+
> >    R0                  R1
> > 
> > Is a valid result from xfs_bmap_btalloc() and the loop in
> > xfs_bmapi_write() will do a second allocation and alignment as per
> > the above delalloc case. xfs_bmap_rtalloc() appears to mirror this
> > same structure, so should also have the same behaviour.
> > 
> > What this means is that we can actually reduce the requested
> > allocation to be only a partial overlap when aligning it, and
> > everything should still work. Let's now see how complex that makes
> > the code...
> 
> Ok, so xfs_bmapi_write() walks over the file block range to be mapped
> and is thus prepared to handle the potential for multiple allocations.
> As long as the extent size alignment covers the high level range in
> incremental order, the higher layers should keep moving along until the
> originally requested range is allocated.

Yes.

> What I'm not so sure about is that the xfs_bmapi_write() loop also
> accounts for nimap, and that is passed as 1 from at least the couple of
> codepaths I looked at.

Yes. This "shorter allocation" is the same case as not having a
single contiguous region of free space large enough to allocate the
entire space being asked for - the allocated extent comes back
shorter than maxlen but larger than minlen, so another allocation is
needed...

> I'm guessing this is because those paths have a
> transaction reservation good enough for one allocation at a time. Some
> paths (e.g., xfs_iomap_write_allocate()) seems to handle this with an
> even higher level loop, but others such as xfs_iomap_write_direct() do
> not appear to. That said, it might still be the case that everything
> technically works, as then the higher level DIO becomes the next guy up
> the chain responsible for the alloc via requesting the next mapping...

Right, all the higher layers have loops to complete the
allocation/mapping calls because they always have to assume that a
single allocation/mapping call will not span the entire range they
are asking for. If they didn't loop, we'd see broken bits all over
the place ;)

> Even if that is the case, it seems at some level this alters the
> semantics of the extent size hint. Maybe that's fine and we just
> document it such that rather than extent size hints potentially
> resulting in 2x allocations, unaligned I/Os simply result in multiple
> aligned allocations. IIUC, that shouldn't really have much user visible
> impact, if at all..?

I don't think it has any visible user impact at all, just a slight
difference in CPU usage. We'll still end up with the same allocation
being done because we hold the AGF locked over both the allocations,
and the "contiguous allocation block target" should result in the
same free space extents being chosen as if it was a single
allocation...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-18 23:17         ` Dave Chinner
@ 2015-04-19 13:33           ` Brian Foster
  2015-04-19 23:59             ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2015-04-19 13:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sun, Apr 19, 2015 at 09:17:45AM +1000, Dave Chinner wrote:
> [compund reply]
> 
> On Fri, Apr 17, 2015 at 08:58:44AM -0400, Brian Foster wrote:
> > On Fri, Apr 17, 2015 at 08:28:29AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 16, 2015 at 01:32:38PM -0400, Brian Foster wrote:
> > > > On Thu, Apr 16, 2015 at 03:00:50PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > 
...
> > > For xfs_bmap_btalloc() - the problem case here - the code is a
> > > little more complex. We do:
> > > 
> > > xfs_bmapi_write
> > >   loop until all allocated {
> > >     xfs_bmapi_allocate(bma)
> > >       calc off/len
> > >         xfs_bmap_btalloc(bma)
> > > 	  xfs_bmap_extsize_align(bma)
> > > 	  xfs_alloc_vextent
> > > 	  update bma->length
> > >       BMBT insert
> > >     trim returned map
> > >   }
> > > 
> > > So we are doing alignment two steps removed from the off/len
> > > calculation (xfs_bmap_rtalloc() is in the same boat). Hence the
> > > question is whether xfs_bmap_extsize_align() can trim the range
> > > being allocated and still have everything work....
> > > 
> > > Ok, upon further reading, the xfs_bmalloc structure (bma) that is
> > > passed between these functions to track the allocation being done is
> > > updated after allocation with the length of the extent allocated.
> > > IOWs:
> > > 
> > > 	bma->length = E(len)
> > > 	xfs_bmap_btalloc(bma)
> > > 	  A(len) = xfs_bmap_extsize_align(bma->length)
> > > 	  R(len) = xfs_alloc_vextent(A(len))
> > > 	  bma->length = R(len)
> > > 
> > > Hence this:
> > > 
> > >    A0                  A1                  A2
> > >    +-------------------+-------------------+
> > > 		     +ooo+
> > > 		     E0  E1
> > >    +-------------------+
> > >    R0                  R1
> > > 
> > > Is a valid result from xfs_bmap_btalloc() and the loop in
> > > xfs_bmapi_write() will do a second allocation and alignment as per
> > > the above delalloc case. xfs_bmap_rtalloc() appears to mirror this
> > > same structure, so should also have the same behaviour.
> > > 
> > > What this means is that we can actually reduce the requested
> > > allocation to be only a partial overlap when aligning it, and
> > > everything should still work. Let's now see how complex that makes
> > > the code...
> > 
> > Ok, so xfs_bmapi_write() walks over the file block range to be mapped
> > and is thus prepared to handle the potential for multiple allocations.
> > As long as the extent size alignment covers the high level range in
> > incremental order, the higher layers should keep moving along until the
> > originally requested range is allocated.
> 
> Yes.
> 
> > What I'm not so sure about is that the xfs_bmapi_write() loop also
> > accounts for nimap, and that is passed as 1 from at least the couple of
> > codepaths I looked at.
> 
> Yes. This "shorter allocation" is the same case as not having a
> single contiguous region of free space large enough to allocate the
> entire space being asked for - the allocated extent comes back
> shorter than maxlen but larger than minlen, so another allocation is
> needed...
> 
> > I'm guessing this is because those paths have a
> > transaction reservation good enough for one allocation at a time. Some
> > paths (e.g., xfs_iomap_write_allocate()) seems to handle this with an
> > even higher level loop, but others such as xfs_iomap_write_direct() do
> > not appear to. That said, it might still be the case that everything
> > technically works, as then the higher level DIO becomes the next guy up
> > the chain responsible for the alloc via requesting the next mapping...
> 
> Right, all the higher layers have loops to complete the
> allocation/mapping calls because they always have to assume that a
> single allocation/mapping call will not span the entire range they
> are asking for. If they didn't loop, we'd see broken bits all over
> the place ;)
> 
> > Even if that is the case, it seems at some level this alters the
> > semantics of the extent size hint. Maybe that's fine and we just
> > document it such that rather than extent size hints potentially
> > resulting in 2x allocations, unaligned I/Os simply result in multiple
> > aligned allocations. IIUC, that shouldn't really have much user visible
> > impact, if at all..?
> 
> I don't think it has any visible user impact at all, just a slight
> difference in CPU usage. We'll still end up with the same allocation
> being done because we hold the AGF locked over both the allocations,
> and the "contiguous allocation block target" should result in the
> same free space extents being chosen as if it was a single
> allocation...
> 

Indeed, a bit of a longer path for the overall allocation... and given
that the purpose of the hint is to cause larger allocations, we'll
naturally end up doing less of them overall for a given workload.

The idea that the same allocation is guaranteed doesn't seem always the
case, however. If we bubble up out of xfs_bmapi_write(), the calling
code commits and starts a new transaction. It does look like the
"first_block" mechanism will certainly try to pick up where it left off,
if I'm following that correctly, so perhaps that is the real world
result most of the time (just as an optimization as opposed to a hard
guarantee).

That still seems reasonable to me regardless since we're still doing
extent size allocations. I don't see any major reason why the hint
mechanism needs to guarantee everything is single allocation as opposed
to just ensuring allocations are of the requested size, provided the
file mapping allows it. The way I understand it, we're just disabling a
small optimization that happens to cause problems under certain
conditions. FWIW, another approach could be to limit the scope of the
optimization (e.g., do the outward rounding depending on the size of the
hint with respect to maxextlen), but at that point we're getting into
territory where it makes things even harder to test for questionable
value in return...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN
  2015-04-19 13:33           ` Brian Foster
@ 2015-04-19 23:59             ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2015-04-19 23:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Sun, Apr 19, 2015 at 09:33:01AM -0400, Brian Foster wrote:
> On Sun, Apr 19, 2015 at 09:17:45AM +1000, Dave Chinner wrote:
> > [compund reply]
> > 
> > On Fri, Apr 17, 2015 at 08:58:44AM -0400, Brian Foster wrote:
> > > On Fri, Apr 17, 2015 at 08:28:29AM +1000, Dave Chinner wrote:
...
> > > Even if that is the case, it seems at some level this alters the
> > > semantics of the extent size hint. Maybe that's fine and we just
> > > document it such that rather than extent size hints potentially
> > > resulting in 2x allocations, unaligned I/Os simply result in multiple
> > > aligned allocations. IIUC, that shouldn't really have much user visible
> > > impact, if at all..?
> > 
> > I don't think it has any visible user impact at all, just a slight
> > difference in CPU usage. We'll still end up with the same allocation
> > being done because we hold the AGF locked over both the allocations,
> > and the "contiguous allocation block target" should result in the
> > same free space extents being chosen as if it was a single
> > allocation...
> 
> Indeed, a bit of a longer path for the overall allocation... and given
> that the purpose of the hint is to cause larger allocations, we'll
> naturally end up doing less of them overall for a given workload.
> 
> The idea that the same allocation is guaranteed doesn't seem always the
> case, however. If we bubble up out of xfs_bmapi_write(), the calling
> code commits and starts a new transaction. It does look like the
> "first_block" mechanism will certainly try to pick up where it left off,
> if I'm following that correctly, so perhaps that is the real world
> result most of the time (just as an optimization as opposed to a hard
> guarantee).

Right, that's effectively what will happen - other mechanisms in the
allocator will give contiguous allocation, so we won't notice
anything unusual in most cases.

> That still seems reasonable to me regardless since we're still doing
> extent size allocations. I don't see any major reason why the hint
> mechanism needs to guarantee everything is single allocation as opposed
> to just ensuring allocations are of the requested size, provided the
> file mapping allows it. The way I understand it, we're just disabling a
> small optimization that happens to cause problems under certain
> conditions. FWIW, another approach could be to limit the scope of the
> optimization (e.g., do the outward rounding depending on the size of the
> hint with respect to maxextlen), but at that point we're getting into
> territory where it makes things even harder to test for questionable
> value in return...

Well, that's effectively what the change does - it only rounds
inward if the alignment result is larger than MAXEXTLEN. So in the
majority of cases we aren't ever going to trigger this inward
rounding case and that's likely why it's not been noticed as being
broken. And why we need a fstest to cover this case ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-04-19 23:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-16  5:00 [PATCH v2] xfs: extent size hints can round up extents past MAXEXTLEN Dave Chinner
2015-04-16 17:32 ` Brian Foster
2015-04-16 22:28   ` Dave Chinner
2015-04-17  0:03     ` Dave Chinner
2015-04-17 13:01       ` Brian Foster
2015-04-18 23:17         ` Dave Chinner
2015-04-19 13:33           ` Brian Foster
2015-04-19 23:59             ` Dave Chinner
2015-04-17 12:58     ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox