* [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner
@ 2010-10-04 10:13 ` Dave Chinner
2010-10-14 17:22 ` Alex Elder
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-10-04 10:13 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.
Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by exponentially increasing it on each subsequent
preallocation. This will result in the preallocation size increasing
as the file increases, so for streaming writes we are much more
likely to get large preallocations exactly when we need it to reduce
fragementation. It should also prevent the need for using the
allocsize mount option for most workloads involving concurrent
streaming writes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_iomap.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 39f8c78..1594190 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -248,6 +248,7 @@ typedef struct xfs_inode {
mrlock_t i_iolock; /* inode IO lock */
struct completion i_flush; /* inode flush completion q */
atomic_t i_pincount; /* inode pin count */
+ unsigned int i_last_prealloc; /* last EOF prealloc size */
wait_queue_head_t i_ipin_wait; /* inode pinning wait queue */
spinlock_t i_flags_lock; /* inode i_flags lock */
/* Miscellaneous state. */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2057614..b2e4782 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -389,6 +389,9 @@ error_out:
* If the caller is doing a write at the end of the file, then extend the
* allocation out to the file system's write iosize. We clean up any extra
* space left over when the file is closed in xfs_inactive().
+ *
+ * If we find we already have delalloc preallocation out to alloc_blocks
+ * beyond EOF, don't do more preallocation as it it not needed.
*/
STATIC int
xfs_iomap_eof_want_preallocate(
@@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate(
xfs_filblks_t count_fsb;
xfs_fsblock_t firstblock;
int n, error, imaps;
+ int found_delalloc = 0;
*prealloc = 0;
if ((offset + count) <= ip->i_size)
@@ -427,11 +431,25 @@ xfs_iomap_eof_want_preallocate(
if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
(imap[n].br_startblock != DELAYSTARTBLOCK))
return 0;
+
start_fsb += imap[n].br_blockcount;
count_fsb -= imap[n].br_blockcount;
+
+ /* count delalloc blocks beyond EOF */
+ if (imap[n].br_startblock == DELAYSTARTBLOCK)
+ found_delalloc += imap[n].br_blockcount;
}
}
- *prealloc = 1;
+ if (!found_delalloc) {
+ /* haven't got any prealloc, so need some */
+ *prealloc = 1;
+ } else if (found_delalloc <= count_fsb) {
+ /* almost run out of prealloc */
+ *prealloc = 1;
+ } else {
+ /* still lots of prealloc left */
+ *prealloc = 0;
+ }
return 0;
}
@@ -469,6 +487,7 @@ xfs_iomap_write_delay(
extsz = xfs_get_extsz_hint(ip);
offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
if (error)
@@ -476,9 +495,25 @@ xfs_iomap_write_delay(
retry:
if (prealloc) {
+ xfs_fileoff_t alloc_blocks = 0;
+ /*
+ * If we don't have a user specified preallocation size, dynamically
+ * increase the preallocation size as we do more preallocation.
+ * Cap the maximum size at a single extent.
+ */
+ if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
+ alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
+ (ip->i_last_prealloc * 4));
+ }
+ if (alloc_blocks == 0)
+ alloc_blocks = mp->m_writeio_blocks;
+ ip->i_last_prealloc = alloc_blocks;
+
aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1));
ioalign = XFS_B_TO_FSBT(mp, aligned_offset);
- last_fsb = ioalign + mp->m_writeio_blocks;
+ last_fsb = ioalign + alloc_blocks;
+ printk("ino %lld, ioalign 0x%llx, alloc_blocks 0x%llx\n",
+ ip->i_ino, ioalign, alloc_blocks);
} else {
last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
}
--
1.7.1
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-10-14 17:22 ` Alex Elder
2010-10-14 21:33 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Alex Elder @ 2010-10-14 17:22 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently the size of the speculative preallocation during delayed
> allocation is fixed by either the allocsize mount option of a
> default size. We are seeing a lot of cases where we need to
> recommend using the allocsize mount option to prevent fragmentation
> when buffered writes land in the same AG.
>
> Rather than using a fixed preallocation size by default (up to 64k),
> make it dynamic by exponentially increasing it on each subsequent
> preallocation. This will result in the preallocation size increasing
> as the file increases, so for streaming writes we are much more
> likely to get large preallocations exactly when we need it to reduce
> fragementation. It should also prevent the need for using the
> allocsize mount option for most workloads involving concurrent
> streaming writes.
I have some comments, below.
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
. . .
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 2057614..b2e4782 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
. . .
> @@ -427,11 +431,25 @@ xfs_iomap_eof_want_preallocate(
> if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
> (imap[n].br_startblock != DELAYSTARTBLOCK))
> return 0;
> +
> start_fsb += imap[n].br_blockcount;
> count_fsb -= imap[n].br_blockcount;
> +
> + /* count delalloc blocks beyond EOF */
> + if (imap[n].br_startblock == DELAYSTARTBLOCK)
> + found_delalloc += imap[n].br_blockcount;
> }
> }
> - *prealloc = 1;
At this point, count_fsb will be 0 (a necessary condition
for loop termination, since count_fsb is unsigned). Since
found_delalloc is initially 0 (and is also unsigned), we
can safely assume that found_delalloc >= count_fsb. The
only case in which found_delalloc <= count_fsb is if
found_delalloc is also 0 (a case you cover separately,
first, below).
Furthermore, *prealloc was assigned the value 0 at the top.
So I think this bottom section can be simplified to:
if (!found_delalloc)
*prealloc = 1;
But if that's the case, then maybe the loop can simply
return as soon as it finds a delayed allocation entry.
In other words, the net effect of this is that you
only tell the caller we want preallocation if *no*
preallocated blocks beyond EOF exist. That may be
fine, but it doesn't seem to match your understanding
based on your code, so I just wanted to call your
attention to it.
> + if (!found_delalloc) {
> + /* haven't got any prealloc, so need some */
> + *prealloc = 1;
> + } else if (found_delalloc <= count_fsb) {
> + /* almost run out of prealloc */
> + *prealloc = 1;
> + } else {
> + /* still lots of prealloc left */
> + *prealloc = 0;
> + }
> return 0;
> }
>
> @@ -469,6 +487,7 @@ xfs_iomap_write_delay(
> extsz = xfs_get_extsz_hint(ip);
> offset_fsb = XFS_B_TO_FSBT(mp, offset);
>
> +
This is hunk should be killed. It just adds an unwanted
blank line.
> error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
> ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
> if (error)
> @@ -476,9 +495,25 @@ xfs_iomap_write_delay(
>
> retry:
> if (prealloc) {
> + xfs_fileoff_t alloc_blocks = 0;
> + /*
> + * If we don't have a user specified preallocation size, dynamically
> + * increase the preallocation size as we do more preallocation.
> + * Cap the maximum size at a single extent.
> + */
> + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
I note that this is circumventing the special code in
xfs_set_rw_sizes() that tries to set up a different (smaller)
size in the event the "sync" (generic) mount option was supplied
(indicated by XFS_MOUNT_SYNC). If that is a good thing, then I
suggest the code in xfs_set_rw_sizes() go away. But it would be
good to have the case for making that change stated.
> + alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
> + (ip->i_last_prealloc * 4));
> + }
So this is the spot that begs the question of whether the
the default I/O size mount option is needed any more. The
net effect of your change (assuming no "allocsize" mount
option is in effect) is:
- Initially, ip->i_last_prealloc will be 0. Therefore the
first time through, the preallocated blocks beyond the
end will be based on m_writeio_blocks (either 16KB or
64KB, dependent on whether XFS_MOUNT_WSYNC was specified).
- Thereafter, whenever more preallocated-at-EOF blocks
are needed, the number allocated will be 4 times more
than the last time (growing exponentially), limited by
the maximum extent size.
I guess the reason one might want the "allocsize" mount
option now becomes the opposite of why one might have
wanted it before. I.e., it would be used to *reduce*
the size of the preallocated range beyond EOF, which I
could envision might be reasonable in some circumstances.
> + if (alloc_blocks == 0)
> + alloc_blocks = mp->m_writeio_blocks;
> + ip->i_last_prealloc = alloc_blocks;
> +
> aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1));
> ioalign = XFS_B_TO_FSBT(mp, aligned_offset);
> - last_fsb = ioalign + mp->m_writeio_blocks;
> + last_fsb = ioalign + alloc_blocks;
> + printk("ino %lld, ioalign 0x%llx, alloc_blocks 0x%llx\n",
> + ip->i_ino, ioalign, alloc_blocks);
Kill the debug printk() call.
> } else {
> last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
> }
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-10-14 17:22 ` Alex Elder
@ 2010-10-14 21:33 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2010-10-14 21:33 UTC (permalink / raw)
To: Alex Elder; +Cc: xfs
On Thu, Oct 14, 2010 at 12:22:45PM -0500, Alex Elder wrote:
> On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote:
> > @@ -427,11 +431,25 @@ xfs_iomap_eof_want_preallocate(
> > if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
> > (imap[n].br_startblock != DELAYSTARTBLOCK))
> > return 0;
> > +
> > start_fsb += imap[n].br_blockcount;
> > count_fsb -= imap[n].br_blockcount;
> > +
> > + /* count delalloc blocks beyond EOF */
> > + if (imap[n].br_startblock == DELAYSTARTBLOCK)
> > + found_delalloc += imap[n].br_blockcount;
> > }
> > }
> > - *prealloc = 1;
>
> At this point, count_fsb will be 0 (a necessary condition
> for loop termination, since count_fsb is unsigned). Since
> found_delalloc is initially 0 (and is also unsigned), we
> can safely assume that found_delalloc >= count_fsb. The
> only case in which found_delalloc <= count_fsb is if
> found_delalloc is also 0 (a case you cover separately,
> first, below).
>
> Furthermore, *prealloc was assigned the value 0 at the top.
> So I think this bottom section can be simplified to:
> if (!found_delalloc)
> *prealloc = 1;
I'll have a look at this - there was some reason for the second
case, but the code has changed since I needed it and, as you
suggest, it might not be needed anymore.
> > error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
> > ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
> > if (error)
> > @@ -476,9 +495,25 @@ xfs_iomap_write_delay(
> >
> > retry:
> > if (prealloc) {
> > + xfs_fileoff_t alloc_blocks = 0;
> > + /*
> > + * If we don't have a user specified preallocation size, dynamically
> > + * increase the preallocation size as we do more preallocation.
> > + * Cap the maximum size at a single extent.
> > + */
> > + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
>
> I note that this is circumventing the special code in
> xfs_set_rw_sizes() that tries to set up a different (smaller)
> size in the event the "sync" (generic) mount option was supplied
> (indicated by XFS_MOUNT_SYNC). If that is a good thing, then I
> suggest the code in xfs_set_rw_sizes() go away. But it would be
> good to have the case for making that change stated.
The new code based on file size is a bit different. It still
triggers off the absence of his flag, but it now uses the default
sizes as the minimum speculative allocation size.
> I guess the reason one might want the "allocsize" mount
> option now becomes the opposite of why one might have
> wanted it before. I.e., it would be used to *reduce*
> the size of the preallocated range beyond EOF, which I
> could envision might be reasonable in some circumstances.
It now becomes the minimum preallocation size, rather than both the
minimum and the maximum....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3
@ 2010-11-29 0:43 Dave Chinner
2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-11-29 0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2010-11-29 0:43 UTC (permalink / raw)
To: xfs
This is the latest version of the dynamic speculative allocation
beyond EOF patch set. The description of the patchset can be found
here:
http://oss.sgi.com/archives/xfs/2010-10/msg00040.html
Version 3:
- allocsize mount option returned to fixed preallocation size only.
- reduces maximum dynamic prealloc size as the filesytem gets near
full.
- split i_delayed_blks bug fixes into new patch (posted in 2.6.37-rc
bug fix series)
Version 2:
- base speculative execution size on current inode size, not the
number of previous speculative allocations.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-11-29 0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner
@ 2010-11-29 0:43 ` Dave Chinner
2010-12-07 10:17 ` Christoph Hellwig
2010-11-29 0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-11-29 0:43 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.
Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.
For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088
and for 16 concurrent 16GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208
Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):
freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
<1% 128MB (8GB >> 6)
This should reduce the amount of space held in speculative
preallocation for such cases.
The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_fsops.c | 1 +
fs/xfs/xfs_iomap.c | 74 +++++++++++++++++++++++++++++++++++++++++++++-------
fs/xfs/xfs_mount.c | 26 ++++++++++++-----
fs/xfs/xfs_mount.h | 26 +++++++++++++++++-
4 files changed, 107 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index be34ff2..6d17206 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -374,6 +374,7 @@ xfs_growfs_data_private(
mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
} else
mp->m_maxicount = 0;
+ xfs_set_low_space_thresholds(mp);
/* update secondary superblocks. */
for (agno = 1; agno < nagcount; agno++) {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2057614..d7c2f05 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -389,6 +389,9 @@ error_out:
* If the caller is doing a write at the end of the file, then extend the
* allocation out to the file system's write iosize. We clean up any extra
* space left over when the file is closed in xfs_inactive().
+ *
+ * If we find we already have delalloc preallocation beyond EOF, don't do more
+ * preallocation as it it not needed.
*/
STATIC int
xfs_iomap_eof_want_preallocate(
@@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate(
xfs_filblks_t count_fsb;
xfs_fsblock_t firstblock;
int n, error, imaps;
+ int found_delalloc = 0;
*prealloc = 0;
if ((offset + count) <= ip->i_size)
@@ -427,11 +431,16 @@ xfs_iomap_eof_want_preallocate(
if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
(imap[n].br_startblock != DELAYSTARTBLOCK))
return 0;
+
start_fsb += imap[n].br_blockcount;
count_fsb -= imap[n].br_blockcount;
+
+ if (imap[n].br_startblock == DELAYSTARTBLOCK)
+ found_delalloc = 1;
}
}
- *prealloc = 1;
+ if (!found_delalloc)
+ *prealloc = 1;
return 0;
}
@@ -469,6 +478,7 @@ xfs_iomap_write_delay(
extsz = xfs_get_extsz_hint(ip);
offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
if (error)
@@ -476,9 +486,44 @@ xfs_iomap_write_delay(
retry:
if (prealloc) {
+ xfs_fileoff_t alloc_blocks = 0;
+ /*
+ * If we don't have a user specified preallocation size, dynamically
+ * increase the preallocation size as the size of the file
+ * grows. Cap the maximum size at a single extent or less if
+ * the filesystem is near full. The closer the filesystem is to
+ * full, the smaller the maximum prealocation.
+ */
+ if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
+ int shift = 0;
+ int64_t freesp;
+
+ alloc_blocks = XFS_B_TO_FSB(mp, ip->i_size);
+ alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
+ rounddown_pow_of_two(alloc_blocks));
+
+ freesp = xfs_icsb_read(mp, XFS_ICSB_FDBLOCKS);
+ if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) {
+ shift = 2;
+ if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT])
+ shift++;
+ if (freesp < mp->m_low_space[XFS_LOWSP_3_PCNT])
+ shift++;
+ if (freesp < mp->m_low_space[XFS_LOWSP_2_PCNT])
+ shift++;
+ if (freesp < mp->m_low_space[XFS_LOWSP_1_PCNT])
+ shift++;
+ }
+ if (shift)
+ alloc_blocks >>= shift;
+ }
+
+ if (alloc_blocks < mp->m_writeio_blocks)
+ alloc_blocks = mp->m_writeio_blocks;
+
aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1));
ioalign = XFS_B_TO_FSBT(mp, aligned_offset);
- last_fsb = ioalign + mp->m_writeio_blocks;
+ last_fsb = ioalign + alloc_blocks;
} else {
last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
}
@@ -496,22 +541,31 @@ retry:
XFS_BMAPI_DELAY | XFS_BMAPI_WRITE |
XFS_BMAPI_ENTIRE, &firstblock, 1, imap,
&nimaps, NULL);
- if (error && (error != ENOSPC))
+ switch (error) {
+ case 0:
+ case ENOSPC:
+ case EDQUOT:
+ break;
+ default:
return XFS_ERROR(error);
+ }
/*
- * If bmapi returned us nothing, and if we didn't get back EDQUOT,
- * then we must have run out of space - flush all other inodes with
- * delalloc blocks and retry without EOF preallocation.
+ * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. For
+ * ENOSPC, * flush all other inodes with delalloc blocks to free up
+ * some of the excess reserved metadata space. For both cases, retry
+ * without EOF preallocation.
*/
if (nimaps == 0) {
trace_xfs_delalloc_enospc(ip, offset, count);
if (flushed)
- return XFS_ERROR(ENOSPC);
+ return XFS_ERROR(error ? error : ENOSPC);
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_flush_inodes(ip);
- xfs_ilock(ip, XFS_ILOCK_EXCL);
+ if (error == ENOSPC) {
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ xfs_flush_inodes(ip);
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ }
flushed = 1;
error = 0;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3905fc3..98cb266 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -315,14 +315,6 @@ xfs_icsb_sum(
return percpu_counter_sum_positive(&mp->m_icsb[counter]);
}
-static inline int64_t
-xfs_icsb_read(
- struct xfs_mount *mp,
- int counter)
-{
- return percpu_counter_read_positive(&mp->m_icsb[counter]);
-}
-
void
xfs_icsb_reinit_counters(
struct xfs_mount *mp)
@@ -1145,6 +1137,21 @@ xfs_set_rw_sizes(xfs_mount_t *mp)
}
/*
+ * precalculate the low space thresholds for dynamic speculative preallocation.
+ */
+void
+xfs_set_low_space_thresholds(
+ struct xfs_mount *mp)
+{
+ int i;
+
+ for (i = 0; i < XFS_LOWSP_MAX; i++) {
+ mp->m_low_space[i] = mp->m_sb.sb_dblocks / 100 * (i + 1);
+ }
+}
+
+
+/*
* Set whether we're using inode alignment.
*/
STATIC void
@@ -1366,6 +1373,9 @@ xfs_mountfs(
*/
xfs_set_rw_sizes(mp);
+ /* set the low space thresholds for dynamic preallocation */
+ xfs_set_low_space_thresholds(mp);
+
/*
* Set the inode cluster size.
* This may still be overridden by the file system
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c8ff435..9f99a62 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -75,8 +75,15 @@ enum {
XFS_ICSB_MAX,
};
-extern int xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t, int);
-extern int xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int);
+/* dynamic preallocation free space thresholds, 5% down to 1% */
+enum {
+ XFS_LOWSP_1_PCNT = 0,
+ XFS_LOWSP_2_PCNT,
+ XFS_LOWSP_3_PCNT,
+ XFS_LOWSP_4_PCNT,
+ XFS_LOWSP_5_PCNT,
+ XFS_LOWSP_MAX,
+};
typedef struct xfs_mount {
struct super_block *m_super;
@@ -172,6 +179,8 @@ typedef struct xfs_mount {
on the next remount,rw */
struct shrinker m_inode_shrink; /* inode reclaim shrinker */
struct percpu_counter m_icsb[XFS_ICSB_MAX];
+ int64_t m_low_space[XFS_LOWSP_MAX];
+ /* low free space thresholds */
} xfs_mount_t;
/*
@@ -333,6 +342,19 @@ extern int xfs_icsb_init_counters(struct xfs_mount *);
extern void xfs_icsb_reinit_counters(struct xfs_mount *);
extern void xfs_icsb_destroy_counters(struct xfs_mount *);
extern void xfs_icsb_sync_counters(struct xfs_mount *);
+extern int xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t, int);
+extern int xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int);
+
+static inline int64_t
+xfs_icsb_read(
+ struct xfs_mount *mp,
+ int counter)
+{
+ return percpu_counter_read_positive(&mp->m_icsb[counter]);
+}
+
+extern void xfs_set_low_space_thresholds(struct xfs_mount *);
+
#endif /* __KERNEL__ */
--
1.7.2.3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
2010-11-29 0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner
2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-11-29 0:43 ` Dave Chinner
2010-11-29 9:42 ` Andi Kleen
2010-11-30 17:03 ` Christoph Hellwig
1 sibling, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2010-11-29 0:43 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
A long standing problem for streaming writeѕ through the NFS server
has been that the NFS server opens and closes file descriptors on an
inode for every write. The result of this behaviour is that the
->release() function is called on every close and that results in
XFS truncating speculative preallocation beyond the EOF. This has
an adverse effect on file layout when multiple files are being
written at the same time - they interleave their extents and can
result in severe fragmentation.
To avoid this problem, keep a count of the number of ->release calls
made on an inode. For most cases, an inode is only going to be opened
once for writing and then closed again during it's lifetime in
cache. Hence if there are multiple ->release calls, there is a good
chance that the inode is being accessed by the NFS server. Hence
count up every time ->release is called while there are delalloc
blocks still outstanding on the inode.
If this count is non-zero when ->release is next called, then do no
truncate away the speculative preallocation - leave it there so that
subsequent writes do not need to reallocate the delalloc space. This
will prevent interleaving of extents of different inodes written
concurrently to the same AG.
If we get this wrong, it is not a big deal as we truncate
speculative allocation beyond EOF anyway in xfs_inactive() when the
inode is thrown out of the cache.
The new counter in the struct xfs_inode fits into a hole in the
structure on 64 bit machines, so does not grow the size of the inode
at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_iget.c | 1 +
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_vnodeops.c | 61 ++++++++++++++++++++++++++++++++-----------------
3 files changed, 42 insertions(+), 21 deletions(-)
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 0cdd269..18991a9 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -84,6 +84,7 @@ xfs_inode_alloc(
memset(&ip->i_d, 0, sizeof(xfs_icdinode_t));
ip->i_size = 0;
ip->i_new_size = 0;
+ ip->i_dirty_releases = 0;
/* prevent anyone from using this yet */
VFS_I(ip)->i_state = I_NEW;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index fb2ca2e..ea2f34e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -260,6 +260,7 @@ typedef struct xfs_inode {
xfs_fsize_t i_size; /* in-memory size */
xfs_fsize_t i_new_size; /* size when write completes */
atomic_t i_iocount; /* outstanding I/O count */
+ int i_dirty_releases; /* dirty ->release calls */
/* VFS inode */
struct inode i_vnode; /* embedded VFS inode */
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 8e4a63c..49f3a5a 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -964,29 +964,48 @@ xfs_release(
xfs_flush_pages(ip, 0, -1, XBF_ASYNC, FI_NONE);
}
- if (ip->i_d.di_nlink != 0) {
- if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
- ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
- ip->i_delayed_blks > 0)) &&
- (ip->i_df.if_flags & XFS_IFEXTENTS)) &&
- (!(ip->i_d.di_flags &
- (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+ if (ip->i_d.di_nlink == 0)
+ return 0;
- /*
- * If we can't get the iolock just skip truncating
- * the blocks past EOF because we could deadlock
- * with the mmap_sem otherwise. We'll get another
- * chance to drop them once the last reference to
- * the inode is dropped, so we'll never leak blocks
- * permanently.
- */
- error = xfs_free_eofblocks(mp, ip,
- XFS_FREE_EOF_TRYLOCK);
- if (error)
- return error;
- }
- }
+ if ((((ip->i_d.di_mode & S_IFMT) == S_IFREG) &&
+ ((ip->i_size > 0) || (VN_CACHED(VFS_I(ip)) > 0 ||
+ ip->i_delayed_blks > 0)) &&
+ (ip->i_df.if_flags & XFS_IFEXTENTS)) &&
+ (!(ip->i_d.di_flags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))) {
+ /*
+ * If we can't get the iolock just skip truncating the blocks
+ * past EOF because we could deadlock with the mmap_sem
+ * otherwise. We'll get another chance to drop them once the
+ * last reference to the inode is dropped, so we'll never leak
+ * blocks permanently.
+ *
+ * Further, count the number of times we get here in the life
+ * of this inode. If the inode is being opened, written and
+ * closed frequently and we have delayed allocation blocks
+ * oustanding (e.g. streaming writes from the NFS server),
+ * truncating the blocks past EOF will cause fragmentation to
+ * occur.
+ *
+ * In this case don't do the truncation, either, but we have to
+ * be careful how we detect this case. Blocks beyond EOF show
+ * up as i_delayed_blks even when the inode is clean, so we
+ * need to truncate them away first before checking for a dirty
+ * release. Hence on the first couple of dirty closes,we will
+ * still remove the speculative allocation, but then we will
+ * leave it in place.
+ */
+ if (ip->i_dirty_releases > 1)
+ return 0;
+ error = xfs_free_eofblocks(mp, ip,
+ XFS_FREE_EOF_TRYLOCK);
+ if (error)
+ return error;
+
+ /* delalloc blocks after truncation means it really is dirty */
+ if (ip->i_delayed_blks)
+ ip->i_dirty_releases++;
+ }
return 0;
}
--
1.7.2.3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
2010-11-29 0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
@ 2010-11-29 9:42 ` Andi Kleen
2010-11-30 1:00 ` Dave Chinner
2010-11-30 17:03 ` Christoph Hellwig
1 sibling, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2010-11-29 9:42 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-fsdevel, xfs
Dave Chinner <david@fromorbit.com> writes:
>
> To avoid this problem, keep a count of the number of ->release calls
> made on an inode. For most cases, an inode is only going to be opened
> once for writing and then closed again during it's lifetime in
> cache. Hence if there are multiple ->release calls, there is a good
> chance that the inode is being accessed by the NFS server. Hence
> count up every time ->release is called while there are delalloc
> blocks still outstanding on the inode.
Seems like a hack. It would be cleaner and less fragile to add a
explicit VFS hint that is passed down from the nfs server, similar
to the existing open intents.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
2010-11-29 9:42 ` Andi Kleen
@ 2010-11-30 1:00 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2010-11-30 1:00 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-fsdevel, xfs
On Mon, Nov 29, 2010 at 10:42:29AM +0100, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> >
> > To avoid this problem, keep a count of the number of ->release calls
> > made on an inode. For most cases, an inode is only going to be opened
> > once for writing and then closed again during it's lifetime in
> > cache. Hence if there are multiple ->release calls, there is a good
> > chance that the inode is being accessed by the NFS server. Hence
> > count up every time ->release is called while there are delalloc
> > blocks still outstanding on the inode.
>
> Seems like a hack. It would be cleaner and less fragile to add a
> explicit VFS hint that is passed down from the nfs server, similar
> to the existing open intents.
Agreed.
However, we've been asking for the nfsd to change it's behaviour for
various operations for quite some time (i.e. years) to help
filesystems behave better, but and we're no closer to having it
fixed now than we were 3 or 4 years ago. What the nfsd really needs
is an an open file cache so that IO looks like normal file IO rather
than every write being an "open-write-close" operation....
While we wait for nfsd to be fixed, we've still got people reporting
excessive fragmentation during concurrent sequential writes to nfs
servers running XFS, so we really need some kind of fix for the
problem...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
2010-11-29 0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-11-29 9:42 ` Andi Kleen
@ 2010-11-30 17:03 ` Christoph Hellwig
2010-11-30 22:00 ` Dave Chinner
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2010-11-30 17:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Did any problems show up with just trying to use an inode flag instead
of the counter? I'd really hate to bloat the inode without reason.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes
2010-11-30 17:03 ` Christoph Hellwig
@ 2010-11-30 22:00 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2010-11-30 22:00 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Tue, Nov 30, 2010 at 12:03:01PM -0500, Christoph Hellwig wrote:
> Did any problems show up with just trying to use an inode flag instead
> of the counter? I'd really hate to bloat the inode without reason.
None that I've noticed in local testing, but I haven't been
focussing on this aspect so I hadn't changed it. I'll change it to a
flag, and we can go back to a counter if necessary.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-12-07 10:17 ` Christoph Hellwig
2010-12-07 10:49 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2010-12-07 10:17 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
I need the patch below to make the new code link on 32-bit systems:
Index: xfs/fs/xfs/xfs_mount.c
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.c 2010-12-07 11:01:50.394021585 +0100
+++ xfs/fs/xfs/xfs_mount.c 2010-12-07 11:02:46.231256257 +0100
@@ -1111,7 +1111,10 @@ xfs_set_low_space_thresholds(
int i;
for (i = 0; i < XFS_LOWSP_MAX; i++) {
- mp->m_low_space[i] = mp->m_sb.sb_dblocks / 100 * (i + 1);
+ __uint64_t space = mp->m_sb.sb_dblocks;
+ do_div(space, 100);
+
+ mp->m_low_space[i] = space * (i + 1);
}
}
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-12-07 10:17 ` Christoph Hellwig
@ 2010-12-07 10:49 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2010-12-07 10:49 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Tue, Dec 07, 2010 at 05:17:46AM -0500, Christoph Hellwig wrote:
> I need the patch below to make the new code link on 32-bit systems:
>
> Index: xfs/fs/xfs/xfs_mount.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_mount.c 2010-12-07 11:01:50.394021585 +0100
> +++ xfs/fs/xfs/xfs_mount.c 2010-12-07 11:02:46.231256257 +0100
> @@ -1111,7 +1111,10 @@ xfs_set_low_space_thresholds(
> int i;
>
> for (i = 0; i < XFS_LOWSP_MAX; i++) {
> - mp->m_low_space[i] = mp->m_sb.sb_dblocks / 100 * (i + 1);
> + __uint64_t space = mp->m_sb.sb_dblocks;
> + do_div(space, 100);
> +
> + mp->m_low_space[i] = space * (i + 1);
> }
> }
Yes, makes sense. I'll fold that into the latest version of the
patch. Thanks.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-12-13 1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner
@ 2010-12-13 1:25 ` Dave Chinner
2010-12-15 18:57 ` Christoph Hellwig
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-12-13 1:25 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.
Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.
For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088
and for 16 concurrent 16GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208
Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):
freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
<1% 128MB (8GB >> 6)
This should reduce the amount of space held in speculative
preallocation for such cases.
The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_fsops.c | 1 +
fs/xfs/xfs_iomap.c | 85 +++++++++++++++++++++++++++++++++++++++++++++------
fs/xfs/xfs_mount.c | 21 +++++++++++++
fs/xfs/xfs_mount.h | 14 ++++++++
4 files changed, 111 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index be34ff2..6d17206 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -374,6 +374,7 @@ xfs_growfs_data_private(
mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
} else
mp->m_maxicount = 0;
+ xfs_set_low_space_thresholds(mp);
/* update secondary superblocks. */
for (agno = 1; agno < nagcount; agno++) {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2057614..40f9612 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -389,6 +389,9 @@ error_out:
* If the caller is doing a write at the end of the file, then extend the
* allocation out to the file system's write iosize. We clean up any extra
* space left over when the file is closed in xfs_inactive().
+ *
+ * If we find we already have delalloc preallocation beyond EOF, don't do more
+ * preallocation as it it not needed.
*/
STATIC int
xfs_iomap_eof_want_preallocate(
@@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate(
xfs_filblks_t count_fsb;
xfs_fsblock_t firstblock;
int n, error, imaps;
+ int found_delalloc = 0;
*prealloc = 0;
if ((offset + count) <= ip->i_size)
@@ -427,14 +431,63 @@ xfs_iomap_eof_want_preallocate(
if ((imap[n].br_startblock != HOLESTARTBLOCK) &&
(imap[n].br_startblock != DELAYSTARTBLOCK))
return 0;
+
start_fsb += imap[n].br_blockcount;
count_fsb -= imap[n].br_blockcount;
+
+ if (imap[n].br_startblock == DELAYSTARTBLOCK)
+ found_delalloc = 1;
}
}
- *prealloc = 1;
+ if (!found_delalloc)
+ *prealloc = 1;
return 0;
}
+/*
+ * If we don't have a user specified preallocation size, dynamically increase
+ * the preallocation size as the size of the file grows. Cap the maximum size
+ * at a single extent or less if the filesystem is near full. The closer the
+ * filesystem is to full, the smaller the maximum prealocation.
+ */
+STATIC xfs_fsblock_t
+xfs_iomap_prealloc_size(
+ struct xfs_mount *mp,
+ struct xfs_inode *ip)
+{
+ xfs_fsblock_t alloc_blocks = 0;
+
+ if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) {
+ int shift = 0;
+ int64_t freesp;
+
+ alloc_blocks = XFS_B_TO_FSB(mp, ip->i_size);
+ alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN,
+ rounddown_pow_of_two(alloc_blocks));
+
+ freesp = percpu_counter_read_positive(
+ &mp->m_icsb[XFS_ICSB_FDBLOCKS]);
+ if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) {
+ shift = 2;
+ if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT])
+ shift++;
+ if (freesp < mp->m_low_space[XFS_LOWSP_3_PCNT])
+ shift++;
+ if (freesp < mp->m_low_space[XFS_LOWSP_2_PCNT])
+ shift++;
+ if (freesp < mp->m_low_space[XFS_LOWSP_1_PCNT])
+ shift++;
+ }
+ if (shift)
+ alloc_blocks >>= shift;
+ }
+
+ if (alloc_blocks < mp->m_writeio_blocks)
+ alloc_blocks = mp->m_writeio_blocks;
+
+ return alloc_blocks;
+}
+
STATIC int
xfs_iomap_write_delay(
xfs_inode_t *ip,
@@ -469,6 +522,7 @@ xfs_iomap_write_delay(
extsz = xfs_get_extsz_hint(ip);
offset_fsb = XFS_B_TO_FSBT(mp, offset);
+
error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
ioflag, imap, XFS_WRITE_IMAPS, &prealloc);
if (error)
@@ -476,9 +530,11 @@ xfs_iomap_write_delay(
retry:
if (prealloc) {
+ xfs_fsblock_t alloc_blocks = xfs_iomap_prealloc_size(mp, ip);
+
aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1));
ioalign = XFS_B_TO_FSBT(mp, aligned_offset);
- last_fsb = ioalign + mp->m_writeio_blocks;
+ last_fsb = ioalign + alloc_blocks;
} else {
last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
}
@@ -496,22 +552,31 @@ retry:
XFS_BMAPI_DELAY | XFS_BMAPI_WRITE |
XFS_BMAPI_ENTIRE, &firstblock, 1, imap,
&nimaps, NULL);
- if (error && (error != ENOSPC))
+ switch (error) {
+ case 0:
+ case ENOSPC:
+ case EDQUOT:
+ break;
+ default:
return XFS_ERROR(error);
+ }
/*
- * If bmapi returned us nothing, and if we didn't get back EDQUOT,
- * then we must have run out of space - flush all other inodes with
- * delalloc blocks and retry without EOF preallocation.
+ * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. For
+ * ENOSPC, * flush all other inodes with delalloc blocks to free up
+ * some of the excess reserved metadata space. For both cases, retry
+ * without EOF preallocation.
*/
if (nimaps == 0) {
trace_xfs_delalloc_enospc(ip, offset, count);
if (flushed)
- return XFS_ERROR(ENOSPC);
+ return XFS_ERROR(error ? error : ENOSPC);
- xfs_iunlock(ip, XFS_ILOCK_EXCL);
- xfs_flush_inodes(ip);
- xfs_ilock(ip, XFS_ILOCK_EXCL);
+ if (error == ENOSPC) {
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ xfs_flush_inodes(ip);
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ }
flushed = 1;
error = 0;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5e41ef3..fe27338 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1101,6 +1101,24 @@ xfs_set_rw_sizes(xfs_mount_t *mp)
}
/*
+ * precalculate the low space thresholds for dynamic speculative preallocation.
+ */
+void
+xfs_set_low_space_thresholds(
+ struct xfs_mount *mp)
+{
+ int i;
+
+ for (i = 0; i < XFS_LOWSP_MAX; i++) {
+ __uint64_t space = mp->m_sb.sb_dblocks;
+
+ do_div(space, 100);
+ mp->m_low_space[i] = space * (i + 1);
+ }
+}
+
+
+/*
* Set whether we're using inode alignment.
*/
STATIC void
@@ -1322,6 +1340,9 @@ xfs_mountfs(
*/
xfs_set_rw_sizes(mp);
+ /* set the low space thresholds for dynamic preallocation */
+ xfs_set_low_space_thresholds(mp);
+
/*
* Set the inode cluster size.
* This may still be overridden by the file system
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 03ad25c6..7b42e04 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -75,6 +75,16 @@ enum {
XFS_ICSB_MAX,
};
+/* dynamic preallocation free space thresholds, 5% down to 1% */
+enum {
+ XFS_LOWSP_1_PCNT = 0,
+ XFS_LOWSP_2_PCNT,
+ XFS_LOWSP_3_PCNT,
+ XFS_LOWSP_4_PCNT,
+ XFS_LOWSP_5_PCNT,
+ XFS_LOWSP_MAX,
+};
+
typedef struct xfs_mount {
struct super_block *m_super;
xfs_tid_t m_tid; /* next unused tid for fs */
@@ -169,6 +179,8 @@ typedef struct xfs_mount {
on the next remount,rw */
struct shrinker m_inode_shrink; /* inode reclaim shrinker */
struct percpu_counter m_icsb[XFS_ICSB_MAX];
+ int64_t m_low_space[XFS_LOWSP_MAX];
+ /* low free space thresholds */
} xfs_mount_t;
/*
@@ -333,6 +345,8 @@ extern void xfs_icsb_sync_counters(struct xfs_mount *);
extern int xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t);
extern int xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int);
+extern void xfs_set_low_space_thresholds(struct xfs_mount *);
+
#endif /* __KERNEL__ */
extern void xfs_mod_sb(struct xfs_trans *, __int64_t);
--
1.7.2.3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation
2010-12-13 1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
@ 2010-12-15 18:57 ` Christoph Hellwig
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2010-12-15 18:57 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
With this patch my 32-bit x86 VM hangs when running test 014, although
I can interrupt it. With the full patchset applied I get softlockup
warnings in this test for xfsaild, which seems related. Also test 012
fails, but that might be intentional (?).
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-12-15 18:55 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-29 0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner
2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-12-07 10:17 ` Christoph Hellwig
2010-12-07 10:49 ` Dave Chinner
2010-11-29 0:43 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner
2010-11-29 9:42 ` Andi Kleen
2010-11-30 1:00 ` Dave Chinner
2010-11-30 17:03 ` Christoph Hellwig
2010-11-30 22:00 ` Dave Chinner
-- strict thread matches above, loose matches on Subject: below --
2010-12-13 1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner
2010-12-13 1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-12-15 18:57 ` Christoph Hellwig
2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner
2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
2010-10-14 17:22 ` Alex Elder
2010-10-14 21:33 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox