* [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc
@ 2010-10-04 10:13 Dave Chinner
2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Dave Chinner @ 2010-10-04 10:13 UTC (permalink / raw)
To: xfs
When multiple concurrent streaming writes land in the same AG,
allocation of extents interleaves between inodes and causes
excessive fragmentation of the files being written. That instead of
getting maximally sized extents, we'll get writeback range sized
extents interleaved on disk. that is for four files A, B, C and D,
we'll end up with extents like:
+---+---+---+---+---+---+---+---+---+---+---+---+
A1 B1 C1 D1 A2 B2 C2 A3 D2 C3 B3 D3 .....
instead of:
+-----------+-----------+-----------+-----------+
A B C D
It is well known that using the allocsize mount option makes the
allocator behaviour much better and more likely to result in
the second layout above than the first, but that doesn't work in all
situations (e.g. writes from the NFS server). I think that we should
not be relying on manual configuration to solve this problem.
To demonstrate, writing 4 x 64GB files in parallel (16TB volume,
inode64 so all files land in same AG, 700MB/s write speed)
$ for i in `seq 0 1 3`; do
> dd if=/dev/zero of=/mnt/scratch/test.$i bs=64k count=1048576 &
> done
....
results in:
$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
777
196
804
784
$
This shows an average extent size on three of files of 80MB, and
320MB for the other file. The level of fragmentation varies
throughout the files, and varies greatly from run to run. To
demonstrate allocsize=1g:
$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
64
64
64
64
$
Which is 64x1GB extents per file, as we would expect. However, we
can do better than that - with this dynamic speculative
preallocation patch:
$ for i in `seq 0 1 3`; do
> sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l
> done
9
9
9
9
$
Which gives extent sizes of a maximal 8GB (i.e. perfect):
$ sudo xfs_bmap -vv /mnt/scratch/test.0
/mnt/scratch/test.0:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..16777207]: 96..16777303 0 (96..16777303) 16777208
1: [16777208..33554295]: 91344616..108121703 0 (91344616..108121703) 16777088
2: [33554296..50331383]: 158452968..175230055 0 (158452968..175230055) 16777088
3: [50331384..67108471]: 225561320..242338407 0 (225561320..242338407) 16777088
4: [67108472..83885559]: 292669672..309446759 0 (292669672..309446759) 16777088
5: [83885560..100662647]: 359778024..376555111 0 (359778024..376555111) 16777088
6: [100662648..117439735]: 426886376..443663463 0 (426886376..443663463) 16777088
7: [117439736..134216823]: 510771816..527548903 0 (510771816..527548903) 16777088
8: [134216824..134217727]: 594657256..594658159 0 (594657256..594658159) 904
$
The same results occur for tests running 16 and 64 sequential
writers into the same AG - extents of 8GB in all files, so
this is a major improvement in default behaviour and effectively
means we do not need the allocsize mount option anymore.
Worth noting is that the extents still interleave between files -
that problem still exists - but the size of the extents now means
that sequential read and write rates are not going to be affected
by excessive seeks between extents within each file.
Given this demonstratably improves allocation patterns, the only
question that remains in my mind is exactly what algorithm to use to
scale the preallocation. The current patch records the last
prealloc size and increases the next one from that. While that
preovides good results, it will cause problems when interacting with
truncation. It also means that a file may have a substantial amount
of preallocatin beyond EOF - maybe several times the size of the
file.
However, the current algorithm does work well when writing lots of
relatively small files (e.g. up to a few tens of megabytes), as
increasing the preallocation size fast reduces the chances of
interleaving small allocations.
I've been thinking that basing the preallocation size on the current
file size - say preallocate half the size of the file, is a better
option once file sizes start to grow large (more than a few tens of
of megabytes), so maybe a combination of the two is a better idea
(increase exponentially up to default^2 (4MB prealloc), then take
min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size
so that we don't do excessive amounts of preallocation?
--
We need to make the same write patterns result in equivalent
allocation patterns even when they come through the NFS server.
Right now the NFS server uses a file descriptor for each write that
comes across the wire. This means that the ->release function is
called after every write, and that means XFS will be truncating away
the speculative preallocation it did during the write. Hence we get
interleaving files and fragmentation.
To avoid this problem, detect when the ->release function is being
called repeatedly on an inode that has delayed allocation
outstanding. If this happenѕ twice in a row, then don't truncate the
speculative allocation away. This ensures that the speculative
preallocation is preserved until the delalloc blocks are converted
to real extents during writeback.
The result of this is that concurrent files written by NFS will tend
to have a small first extent (due to specultive prealloc being
truncated once), followed by 4-8GB extents that interleave
identically to the above local dd exmaples. I have tested this for
4, 16 and 64 concurrent writers from multiple NFS clients. The
result for 2 clients each writing 16x16GB files (32 all up):
$ for i in `seq 0 1 31`; do
> sudo xfs_bmap -vv /mnt/scratch/test.$i |grep ": \[" | wc -l
> done | uniq -c
1 2
31 3
Mostly a combination of 4GB and 8GB extents, instead of severe
fragmentation. The typical layout was:
/mnt/scratch/test.1:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..8388607]: 225562280..233950887 0 (225562280..233950887) 8388608
1: [8388608..25165815]: 410111608..426888815 0 (410111608..426888815) 16777208
2: [25165816..33554431]: 896648152..905036767 0 (896648152..905036767) 8388616
These results are using NFSv3, and the per-file write rate is only
~3MB/s. Hence it can be seen that the dynamic preallocation works
for both high and low per-file write throughput.
Comments welcome.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner @ 2010-10-04 10:13 ` Dave Chinner 2010-10-14 17:22 ` Alex Elder 2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner 2010-10-14 17:22 ` [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Alex Elder 2 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-10-04 10:13 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Currently the size of the speculative preallocation during delayed allocation is fixed by either the allocsize mount option of a default size. We are seeing a lot of cases where we need to recommend using the allocsize mount option to prevent fragmentation when buffered writes land in the same AG. Rather than using a fixed preallocation size by default (up to 64k), make it dynamic by exponentially increasing it on each subsequent preallocation. This will result in the preallocation size increasing as the file increases, so for streaming writes we are much more likely to get large preallocations exactly when we need it to reduce fragementation. It should also prevent the need for using the allocsize mount option for most workloads involving concurrent streaming writes. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_inode.h | 1 + fs/xfs/xfs_iomap.c | 39 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 39f8c78..1594190 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -248,6 +248,7 @@ typedef struct xfs_inode { mrlock_t i_iolock; /* inode IO lock */ struct completion i_flush; /* inode flush completion q */ atomic_t i_pincount; /* inode pin count */ + unsigned int i_last_prealloc; /* last EOF prealloc size */ wait_queue_head_t i_ipin_wait; /* inode pinning wait queue */ spinlock_t i_flags_lock; /* inode i_flags lock */ /* Miscellaneous state. */ diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 2057614..b2e4782 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -389,6 +389,9 @@ error_out: * If the caller is doing a write at the end of the file, then extend the * allocation out to the file system's write iosize. We clean up any extra * space left over when the file is closed in xfs_inactive(). + * + * If we find we already have delalloc preallocation out to alloc_blocks + * beyond EOF, don't do more preallocation as it it not needed. */ STATIC int xfs_iomap_eof_want_preallocate( @@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate( xfs_filblks_t count_fsb; xfs_fsblock_t firstblock; int n, error, imaps; + int found_delalloc = 0; *prealloc = 0; if ((offset + count) <= ip->i_size) @@ -427,11 +431,25 @@ xfs_iomap_eof_want_preallocate( if ((imap[n].br_startblock != HOLESTARTBLOCK) && (imap[n].br_startblock != DELAYSTARTBLOCK)) return 0; + start_fsb += imap[n].br_blockcount; count_fsb -= imap[n].br_blockcount; + + /* count delalloc blocks beyond EOF */ + if (imap[n].br_startblock == DELAYSTARTBLOCK) + found_delalloc += imap[n].br_blockcount; } } - *prealloc = 1; + if (!found_delalloc) { + /* haven't got any prealloc, so need some */ + *prealloc = 1; + } else if (found_delalloc <= count_fsb) { + /* almost run out of prealloc */ + *prealloc = 1; + } else { + /* still lots of prealloc left */ + *prealloc = 0; + } return 0; } @@ -469,6 +487,7 @@ xfs_iomap_write_delay( extsz = xfs_get_extsz_hint(ip); offset_fsb = XFS_B_TO_FSBT(mp, offset); + error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count, ioflag, imap, XFS_WRITE_IMAPS, &prealloc); if (error) @@ -476,9 +495,25 @@ xfs_iomap_write_delay( retry: if (prealloc) { + xfs_fileoff_t alloc_blocks = 0; + /* + * If we don't have a user specified preallocation size, dynamically + * increase the preallocation size as we do more preallocation. + * Cap the maximum size at a single extent. + */ + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) { + alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN, + (ip->i_last_prealloc * 4)); + } + if (alloc_blocks == 0) + alloc_blocks = mp->m_writeio_blocks; + ip->i_last_prealloc = alloc_blocks; + aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1)); ioalign = XFS_B_TO_FSBT(mp, aligned_offset); - last_fsb = ioalign + mp->m_writeio_blocks; + last_fsb = ioalign + alloc_blocks; + printk("ino %lld, ioalign 0x%llx, alloc_blocks 0x%llx\n", + ip->i_ino, ioalign, alloc_blocks); } else { last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count))); } -- 1.7.1 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner @ 2010-10-14 17:22 ` Alex Elder 2010-10-14 21:33 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Alex Elder @ 2010-10-14 17:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Currently the size of the speculative preallocation during delayed > allocation is fixed by either the allocsize mount option of a > default size. We are seeing a lot of cases where we need to > recommend using the allocsize mount option to prevent fragmentation > when buffered writes land in the same AG. > > Rather than using a fixed preallocation size by default (up to 64k), > make it dynamic by exponentially increasing it on each subsequent > preallocation. This will result in the preallocation size increasing > as the file increases, so for streaming writes we are much more > likely to get large preallocations exactly when we need it to reduce > fragementation. It should also prevent the need for using the > allocsize mount option for most workloads involving concurrent > streaming writes. I have some comments, below. > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- . . . > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 2057614..b2e4782 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c . . . > @@ -427,11 +431,25 @@ xfs_iomap_eof_want_preallocate( > if ((imap[n].br_startblock != HOLESTARTBLOCK) && > (imap[n].br_startblock != DELAYSTARTBLOCK)) > return 0; > + > start_fsb += imap[n].br_blockcount; > count_fsb -= imap[n].br_blockcount; > + > + /* count delalloc blocks beyond EOF */ > + if (imap[n].br_startblock == DELAYSTARTBLOCK) > + found_delalloc += imap[n].br_blockcount; > } > } > - *prealloc = 1; At this point, count_fsb will be 0 (a necessary condition for loop termination, since count_fsb is unsigned). Since found_delalloc is initially 0 (and is also unsigned), we can safely assume that found_delalloc >= count_fsb. The only case in which found_delalloc <= count_fsb is if found_delalloc is also 0 (a case you cover separately, first, below). Furthermore, *prealloc was assigned the value 0 at the top. So I think this bottom section can be simplified to: if (!found_delalloc) *prealloc = 1; But if that's the case, then maybe the loop can simply return as soon as it finds a delayed allocation entry. In other words, the net effect of this is that you only tell the caller we want preallocation if *no* preallocated blocks beyond EOF exist. That may be fine, but it doesn't seem to match your understanding based on your code, so I just wanted to call your attention to it. > + if (!found_delalloc) { > + /* haven't got any prealloc, so need some */ > + *prealloc = 1; > + } else if (found_delalloc <= count_fsb) { > + /* almost run out of prealloc */ > + *prealloc = 1; > + } else { > + /* still lots of prealloc left */ > + *prealloc = 0; > + } > return 0; > } > > @@ -469,6 +487,7 @@ xfs_iomap_write_delay( > extsz = xfs_get_extsz_hint(ip); > offset_fsb = XFS_B_TO_FSBT(mp, offset); > > + This is hunk should be killed. It just adds an unwanted blank line. > error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count, > ioflag, imap, XFS_WRITE_IMAPS, &prealloc); > if (error) > @@ -476,9 +495,25 @@ xfs_iomap_write_delay( > > retry: > if (prealloc) { > + xfs_fileoff_t alloc_blocks = 0; > + /* > + * If we don't have a user specified preallocation size, dynamically > + * increase the preallocation size as we do more preallocation. > + * Cap the maximum size at a single extent. > + */ > + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) { I note that this is circumventing the special code in xfs_set_rw_sizes() that tries to set up a different (smaller) size in the event the "sync" (generic) mount option was supplied (indicated by XFS_MOUNT_SYNC). If that is a good thing, then I suggest the code in xfs_set_rw_sizes() go away. But it would be good to have the case for making that change stated. > + alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN, > + (ip->i_last_prealloc * 4)); > + } So this is the spot that begs the question of whether the the default I/O size mount option is needed any more. The net effect of your change (assuming no "allocsize" mount option is in effect) is: - Initially, ip->i_last_prealloc will be 0. Therefore the first time through, the preallocated blocks beyond the end will be based on m_writeio_blocks (either 16KB or 64KB, dependent on whether XFS_MOUNT_WSYNC was specified). - Thereafter, whenever more preallocated-at-EOF blocks are needed, the number allocated will be 4 times more than the last time (growing exponentially), limited by the maximum extent size. I guess the reason one might want the "allocsize" mount option now becomes the opposite of why one might have wanted it before. I.e., it would be used to *reduce* the size of the preallocated range beyond EOF, which I could envision might be reasonable in some circumstances. > + if (alloc_blocks == 0) > + alloc_blocks = mp->m_writeio_blocks; > + ip->i_last_prealloc = alloc_blocks; > + > aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1)); > ioalign = XFS_B_TO_FSBT(mp, aligned_offset); > - last_fsb = ioalign + mp->m_writeio_blocks; > + last_fsb = ioalign + alloc_blocks; > + printk("ino %lld, ioalign 0x%llx, alloc_blocks 0x%llx\n", > + ip->i_ino, ioalign, alloc_blocks); Kill the debug printk() call. > } else { > last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count))); > } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-10-14 17:22 ` Alex Elder @ 2010-10-14 21:33 ` Dave Chinner 2010-10-15 6:51 ` allocsize mount option, was: " Michael Monnerie 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-10-14 21:33 UTC (permalink / raw) To: Alex Elder; +Cc: xfs On Thu, Oct 14, 2010 at 12:22:45PM -0500, Alex Elder wrote: > On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > > @@ -427,11 +431,25 @@ xfs_iomap_eof_want_preallocate( > > if ((imap[n].br_startblock != HOLESTARTBLOCK) && > > (imap[n].br_startblock != DELAYSTARTBLOCK)) > > return 0; > > + > > start_fsb += imap[n].br_blockcount; > > count_fsb -= imap[n].br_blockcount; > > + > > + /* count delalloc blocks beyond EOF */ > > + if (imap[n].br_startblock == DELAYSTARTBLOCK) > > + found_delalloc += imap[n].br_blockcount; > > } > > } > > - *prealloc = 1; > > At this point, count_fsb will be 0 (a necessary condition > for loop termination, since count_fsb is unsigned). Since > found_delalloc is initially 0 (and is also unsigned), we > can safely assume that found_delalloc >= count_fsb. The > only case in which found_delalloc <= count_fsb is if > found_delalloc is also 0 (a case you cover separately, > first, below). > > Furthermore, *prealloc was assigned the value 0 at the top. > So I think this bottom section can be simplified to: > if (!found_delalloc) > *prealloc = 1; I'll have a look at this - there was some reason for the second case, but the code has changed since I needed it and, as you suggest, it might not be needed anymore. > > error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count, > > ioflag, imap, XFS_WRITE_IMAPS, &prealloc); > > if (error) > > @@ -476,9 +495,25 @@ xfs_iomap_write_delay( > > > > retry: > > if (prealloc) { > > + xfs_fileoff_t alloc_blocks = 0; > > + /* > > + * If we don't have a user specified preallocation size, dynamically > > + * increase the preallocation size as we do more preallocation. > > + * Cap the maximum size at a single extent. > > + */ > > + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) { > > I note that this is circumventing the special code in > xfs_set_rw_sizes() that tries to set up a different (smaller) > size in the event the "sync" (generic) mount option was supplied > (indicated by XFS_MOUNT_SYNC). If that is a good thing, then I > suggest the code in xfs_set_rw_sizes() go away. But it would be > good to have the case for making that change stated. The new code based on file size is a bit different. It still triggers off the absence of his flag, but it now uses the default sizes as the minimum speculative allocation size. > I guess the reason one might want the "allocsize" mount > option now becomes the opposite of why one might have > wanted it before. I.e., it would be used to *reduce* > the size of the preallocated range beyond EOF, which I > could envision might be reasonable in some circumstances. It now becomes the minimum preallocation size, rather than both the minimum and the maximum.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: allocsize mount option, was: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-10-14 21:33 ` Dave Chinner @ 2010-10-15 6:51 ` Michael Monnerie 2010-10-15 11:59 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Michael Monnerie @ 2010-10-15 6:51 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: Text/Plain, Size: 1183 bytes --] On Donnerstag, 14. Oktober 2010 Dave Chinner wrote: > > I guess the reason one might want the "allocsize" mount > > option now becomes the opposite of why one might have > > wanted it before. I.e., it would be used to reduce > > the size of the preallocated range beyond EOF, which I > > could envision might be reasonable in some circumstances. > > It now becomes the minimum preallocation size, rather than both the > minimum and the maximum.... Until now, I often set allocsize to be <nr of data disks>*<stripe size>, i.e. in a 8 disk RAID-6 with 64KB stripe size = 6*64 = 384KB I guess this should provide the best performance. Is my assumption true? Will it change with the new code? Does XFS automatically use allocsize=<1 full stripe> so I can skip my manual allocsize options? -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 ****** Radiointerview zum Thema Spam ****** http://www.it-podcast.at/archiv.html#podcast-100716 // Wir haben im Moment zwei Häuser zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: allocsize mount option, was: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-10-15 6:51 ` allocsize mount option, was: " Michael Monnerie @ 2010-10-15 11:59 ` Dave Chinner 0 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2010-10-15 11:59 UTC (permalink / raw) To: Michael Monnerie; +Cc: xfs On Fri, Oct 15, 2010 at 08:51:24AM +0200, Michael Monnerie wrote: > On Donnerstag, 14. Oktober 2010 Dave Chinner wrote: > > > I guess the reason one might want the "allocsize" mount > > > option now becomes the opposite of why one might have > > > wanted it before. I.e., it would be used to reduce > > > the size of the preallocated range beyond EOF, which I > > > could envision might be reasonable in some circumstances. > > > > It now becomes the minimum preallocation size, rather than both the > > minimum and the maximum.... > > Until now, I often set allocsize to be <nr of data disks>*<stripe size>, > i.e. in a 8 disk RAID-6 with 64KB stripe size = 6*64 = 384KB > I guess this should provide the best performance. It's not doing what you think it is: > Is my assumption true? No. allocsize=size Sets the buffered I/O end-of-file preallocation size when doing delayed allocation writeout (default size is 64KiB). Valid values for this option are page size (typically 4KiB) through to 1GiB, inclusive, in power-of-2 increments. ^^^^^^^^^^^^^^^^^^^^^^^^^ The code will round the value down to the nearest power of 2, which means you're actually telling it to preallocate 256k at a time. > Will it change with the new code? Entirely possible. We can and do change behaviour of mount options when it results in an improvement. > Does XFS automatically use allocsize=<1 full stripe> so I can skip my > manual allocsize options? No. I will refer you to the swalloc mount option, though: swalloc Data allocations will be rounded up to stripe width boundaries when the current end of file is being extended and the file size is larger than the stripe width size. Which affects both delayed allocation (after speculative prealloc has been calculated) and physical allocation for direct IO. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes 2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner 2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner @ 2010-10-04 10:13 ` Dave Chinner 2010-10-14 17:22 ` Alex Elder 2010-10-14 17:22 ` [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Alex Elder 2 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-10-04 10:13 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> A long standing problem for streaming writeѕ through the NFS server has been that the NFS server opens and closes file descriptors on an inode for every write. The result of this behaviour is that the ->release() function is called on every close and that results in XFS truncating speculative preallocation beyond the EOF. This has an adverse effect on file layout when multiple files are being written at the same time - they interleave their extents and can result in severe fragmentation. To avoid this problem, keep a count of the number of ->release calls made on an inode. For most cases, an inode is only going to be opened once for writing and then closed again during it's lifetime in cache. Hence if there are multiple ->release calls, there is a good chance that the inode is being accessed by the NFS server. Hence count up every time ->release is called while there are delalloc blocks still outstanding on the inode. If this count is non-zero when ->release is next called, then do no truncate away the speculative preallocation - leave it there so that subsequent writes do not need to reallocate the delalloc space. This will prevent interleaving of extents of different inodes written concurrently to the same AG. If we get this wrong, it is not a big deal as we truncate speculative allocation beyond EOF anyway in xfs_inactive() when the inode is thrown out of the cache. The new counter in the struct xfs_inode fits into a hole in the structure on 64 bit machines, so does not grow the size of the inode at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_inode.h | 1 + fs/xfs/xfs_vnodeops.c | 15 ++++++++++++++- 2 files changed, 15 insertions(+), 1 deletions(-) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 1594190..82aad5e 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -261,6 +261,7 @@ typedef struct xfs_inode { xfs_fsize_t i_size; /* in-memory size */ xfs_fsize_t i_new_size; /* size when write completes */ atomic_t i_iocount; /* outstanding I/O count */ + int i_dirty_releases; /* dirty ->release calls */ /* VFS inode */ struct inode i_vnode; /* embedded VFS inode */ diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c index b7bdc43..0c8eeba 100644 --- a/fs/xfs/xfs_vnodeops.c +++ b/fs/xfs/xfs_vnodeops.c @@ -979,14 +979,27 @@ xfs_release( * chance to drop them once the last reference to * the inode is dropped, so we'll never leak blocks * permanently. + * + * Further, count the number of times we get here in + * the life of this inode. If the inode is being + * opened, written and closed frequently and we have + * delayed allocation blocks oustanding (e.g. streaming + * writes from the NFS server), truncating the + * blocks past EOF will cause fragmentation to occur. + * In this case don't do the truncation, either. */ + if (ip->i_delayed_blks) + ip->i_dirty_releases++; + if (ip->i_dirty_releases > 1) + goto out; + error = xfs_free_eofblocks(mp, ip, XFS_FREE_EOF_TRYLOCK); if (error) return error; } } - +out: return 0; } -- 1.7.1 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes 2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner @ 2010-10-14 17:22 ` Alex Elder 2010-10-14 21:28 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Alex Elder @ 2010-10-14 17:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > A long standing problem for streaming writeѕ through the NFS server > has been that the NFS server opens and closes file descriptors on an > inode for every write. The result of this behaviour is that the > ->release() function is called on every close and that results in > XFS truncating speculative preallocation beyond the EOF. This has > an adverse effect on file layout when multiple files are being > written at the same time - they interleave their extents and can > result in severe fragmentation. > > To avoid this problem, keep a count of the number of ->release calls > made on an inode. For most cases, an inode is only going to be opened > once for writing and then closed again during it's lifetime in > cache. Hence if there are multiple ->release calls, there is a good > chance that the inode is being accessed by the NFS server. Hence > count up every time ->release is called while there are delalloc > blocks still outstanding on the inode. > > If this count is non-zero when ->release is next called, then do no > truncate away the speculative preallocation - leave it there so that > subsequent writes do not need to reallocate the delalloc space. This > will prevent interleaving of extents of different inodes written > concurrently to the same AG. > > If we get this wrong, it is not a big deal as we truncate > speculative allocation beyond EOF anyway in xfs_inactive() when the > inode is thrown out of the cache. > > The new counter in the struct xfs_inode fits into a hole in the > structure on 64 bit machines, so does not grow the size of the inode > at all. This seems reasonable, and I have no real objection to it. However, I have a question and a comment related to the affected code (and not your specific change). > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_inode.h | 1 + > fs/xfs/xfs_vnodeops.c | 15 ++++++++++++++- > 2 files changed, 15 insertions(+), 1 deletions(-) > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index 1594190..82aad5e 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -261,6 +261,7 @@ typedef struct xfs_inode { > xfs_fsize_t i_size; /* in-memory size */ > xfs_fsize_t i_new_size; /* size when write completes */ > atomic_t i_iocount; /* outstanding I/O count */ > + int i_dirty_releases; /* dirty ->release calls */ > > /* VFS inode */ > struct inode i_vnode; /* embedded VFS inode */ > diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c > index b7bdc43..0c8eeba 100644 > --- a/fs/xfs/xfs_vnodeops.c > +++ b/fs/xfs/xfs_vnodeops.c OK, this comment is unrelated to your exact change. But just above the next hunk there's a big nasty condition, which appears to be *almost* duplicated in xfs_inactive() (twice!). It would be very nice if, while you're at modifying this nearby code, you could encapsulate that condition in a macro that has a meaningful name. > @@ -979,14 +979,27 @@ xfs_release( > * chance to drop them once the last reference to > * the inode is dropped, so we'll never leak blocks > * permanently. I'm curious what the effect is if we simply don't do the truncate *except* when the inode becomes inactive. It means we hang onto the stuff for a while longer, and maybe it makes things messier in the event of a crash. Can you tell me why we do the truncate here as well as in xfs_inactive() (or what the problem is of *not* doing it here)? > + * > + * Further, count the number of times we get here in > + * the life of this inode. If the inode is being > + * opened, written and closed frequently and we have > + * delayed allocation blocks oustanding (e.g. streaming > + * writes from the NFS server), truncating the > + * blocks past EOF will cause fragmentation to occur. > + * In this case don't do the truncation, either. > */ > + if (ip->i_delayed_blks) > + ip->i_dirty_releases++; > + if (ip->i_dirty_releases > 1) > + goto out; > + > error = xfs_free_eofblocks(mp, ip, > XFS_FREE_EOF_TRYLOCK); > if (error) > return error; > } > } > - > +out: > return 0; > } > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes 2010-10-14 17:22 ` Alex Elder @ 2010-10-14 21:28 ` Dave Chinner 0 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2010-10-14 21:28 UTC (permalink / raw) To: Alex Elder; +Cc: xfs On Thu, Oct 14, 2010 at 12:22:50PM -0500, Alex Elder wrote: > On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > > diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c > > index b7bdc43..0c8eeba 100644 > > --- a/fs/xfs/xfs_vnodeops.c > > +++ b/fs/xfs/xfs_vnodeops.c > > OK, this comment is unrelated to your exact change. But just above > the next hunk there's a big nasty condition, which appears to > be *almost* duplicated in xfs_inactive() (twice!). It would be > very nice if, while you're at modifying this nearby code, you > could encapsulate that condition in a macro that has a meaningful > name. I've looked at doing this factoring before, but the conditions are subtly different and hence cannot be factored into a single macro. The xfs_inactive case can be cleaned up (i've got old patches around that do half the job, IIRC), but the tests in the two functions are not the same. > > @@ -979,14 +979,27 @@ xfs_release( > > * chance to drop them once the last reference to > > * the inode is dropped, so we'll never leak blocks > > * permanently. > > I'm curious what the effect is if we simply don't do the truncate > *except* when the inode becomes inactive. It means we hang onto > the stuff for a while longer, and maybe it makes things messier > in the event of a crash. It doesn't change a thing in the event of a crash - speculative preallocation is entirely in-memory state. > Can you tell me why we do the truncate > here as well as in xfs_inactive() (or what the problem is of > *not* doing it here)? the truncation in ->release is not guaranteed to succeed - it uses trylock semantics to avoid blocking unnecessarily. It can do this because we know that when xfs_inactive() is called, the truncation will happen for real. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner 2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner 2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner @ 2010-10-14 17:22 ` Alex Elder 2010-10-14 21:16 ` Dave Chinner 2 siblings, 1 reply; 22+ messages in thread From: Alex Elder @ 2010-10-14 17:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > When multiple concurrent streaming writes land in the same AG, > allocation of extents interleaves between inodes and causes > excessive fragmentation of the files being written. That instead of > getting maximally sized extents, we'll get writeback range sized > extents interleaved on disk. that is for four files A, B, C and D, > we'll end up with extents like: > > +---+---+---+---+---+---+---+---+---+---+---+---+ > A1 B1 C1 D1 A2 B2 C2 A3 D2 C3 B3 D3 ..... > > instead of: > > +-----------+-----------+-----------+-----------+ > A B C D > > It is well known that using the allocsize mount option makes the > allocator behaviour much better and more likely to result in > the second layout above than the first, but that doesn't work in all > situations (e.g. writes from the NFS server). I think that we should > not be relying on manual configuration to solve this problem. > . . . (deleting some of your demonstration detail) > The same results occur for tests running 16 and 64 sequential > writers into the same AG - extents of 8GB in all files, so > this is a major improvement in default behaviour and effectively > means we do not need the allocsize mount option anymore. > > Worth noting is that the extents still interleave between files - > that problem still exists - but the size of the extents now means > that sequential read and write rates are not going to be affected > by excessive seeks between extents within each file. Just curious--do we have any current and meaningful information about the trade-off between the size of an extent and seek time? Obviously maximizing the extent size will maximize the bang (data read) for the buck (seek cost) but can we quantify that with current storage device specs? (This is really a theoretical aside...) > Given this demonstratably improves allocation patterns, the only > question that remains in my mind is exactly what algorithm to use to > scale the preallocation. The current patch records the last > prealloc size and increases the next one from that. While that > preovides good results, it will cause problems when interacting with > truncation. It also means that a file may have a substantial amount > of preallocatin beyond EOF - maybe several times the size of the > file. I honestly haven't looked into this yet, but can you expand on the truncation problems you mention? Is it that the preallocated blocks should be dropped and the scaling algorithm should be reset when a truncation occurs or something? > However, the current algorithm does work well when writing lots of > relatively small files (e.g. up to a few tens of megabytes), as > increasing the preallocation size fast reduces the chances of > interleaving small allocations. One thing that I keep wondering about as I think about this is what the effect is as the file system (or AG) gets full, and what level of "full" is enough to make any adverse effects of a change like this start to show up. The other thing is, what sort of workloads are reasonable things to use to gauge the effect? NFS is perhaps common, but it's unique in how it closes files all the time. What happens when there's a more "normal" (non-NFS) workload? For AGs with enough free space I suppose it's a win overall. > I've been thinking that basing the preallocation size on the current > file size - say preallocate half the size of the file, is a better > option once file sizes start to grow large (more than a few tens of > of megabytes), so maybe a combination of the two is a better idea > (increase exponentially up to default^2 (4MB prealloc), then take > min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size > so that we don't do excessive amounts of preallocation? I think basing it on the file size is a good idea, it scales the (initial) preallocation size to the specific file. This would assume that files tend to grow by amounts comparable to their size rather than suddenly and dramatically changing. That seems reasonable but I have nothing empirical to back up that assumption. Similarly, the assumption that once a file starts to grow you should rapidly increase the EOF preallocation goal seems sensible--certainly for the hindsight case of a stream of appends--but I have no proof that a normal use case wouldn't trigger this algorithm when it might be better not to. > -- > > We need to make the same write patterns result in equivalent > allocation patterns even when they come through the NFS server. > Right now the NFS server uses a file descriptor for each write that > comes across the wire. This means that the ->release function is > called after every write, and that means XFS will be truncating away > the speculative preallocation it did during the write. Hence we get > interleaving files and fragmentation. It could be useful to base the behavior on actual knowledge that a file system is being exported by NFS. But it may well be that other applications (like shell scripts that loop and append to the same file repeatedly) might benefit. > To avoid this problem, detect when the ->release function is being > called repeatedly on an inode that has delayed allocation > outstanding. If this happenѕ twice in a row, then don't truncate the > speculative allocation away. This ensures that the speculative > preallocation is preserved until the delalloc blocks are converted > to real extents during writeback. . . . I have a few other comments in my reviews of your two patches. . . . > Comments welcome. You got some... -Alex _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-14 17:22 ` [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Alex Elder @ 2010-10-14 21:16 ` Dave Chinner 2010-10-14 21:50 ` Ivan.Novick 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-10-14 21:16 UTC (permalink / raw) To: Alex Elder; +Cc: xfs On Thu, Oct 14, 2010 at 12:22:32PM -0500, Alex Elder wrote: > On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > > When multiple concurrent streaming writes land in the same AG, > > allocation of extents interleaves between inodes and causes > > excessive fragmentation of the files being written. That instead of > > getting maximally sized extents, we'll get writeback range sized > > extents interleaved on disk. that is for four files A, B, C and D, > > we'll end up with extents like: > > > > +---+---+---+---+---+---+---+---+---+---+---+---+ > > A1 B1 C1 D1 A2 B2 C2 A3 D2 C3 B3 D3 ..... > > > > instead of: > > > > +-----------+-----------+-----------+-----------+ > > A B C D > > > > It is well known that using the allocsize mount option makes the > > allocator behaviour much better and more likely to result in > > the second layout above than the first, but that doesn't work in all > > situations (e.g. writes from the NFS server). I think that we should > > not be relying on manual configuration to solve this problem. > > > > . . . (deleting some of your demonstration detail) > > > The same results occur for tests running 16 and 64 sequential > > writers into the same AG - extents of 8GB in all files, so > > this is a major improvement in default behaviour and effectively > > means we do not need the allocsize mount option anymore. > > > > Worth noting is that the extents still interleave between files - > > that problem still exists - but the size of the extents now means > > that sequential read and write rates are not going to be affected > > by excessive seeks between extents within each file. > > Just curious--do we have any current and meaningful > information about the trade-off between the size of an > extent and seek time? Obviously maximizing the extent > size will maximize the bang (data read) for the buck (seek > cost) but can we quantify that with current storage device > specs? (This is really a theoretical aside...) The reported numbers were a drop in read throughput of ~15% on a GB/s class filesystem when the files interleaved. Fundamentally, it's a pretty simple equation. If the average seek time is 5ms, and your disk runs at 100MB/s, then you lose 500kB/s for every seek. while you are doing 20 seeks/s, then typically you'll see 90MB/s, which is still pretty close to disk speed and most people won't notice. However, if your disk subsystem does 1GB/s, that 20 seeks/s is now 100MB/s and is very noticable. The latency of the seek does not change as the bandwidth of the volume goes up, so the per-seek bandwidth penalty is significantly higher. Even readahead cannot hide the latency penalty if the number of seeks increases too much... > > Given this demonstratably improves allocation patterns, the only > > question that remains in my mind is exactly what algorithm to use to > > scale the preallocation. The current patch records the last > > prealloc size and increases the next one from that. While that > > preovides good results, it will cause problems when interacting with > > truncation. It also means that a file may have a substantial amount > > of preallocatin beyond EOF - maybe several times the size of the > > file. > > I honestly haven't looked into this yet, but can you expand on > the truncation problems you mention? Is it that the preallocated > blocks should be dropped and the scaling algorithm should be > reset when a truncation occurs or something? Create a large file. preallocation size goes: 64k 256k 1024k 4096k 16M 64M 256M now truncate the file. The write one block. Preallocation size is now 1GB. I've actually changed this now after doing more testing to be based on the current file size. From my current commit message for the patch: For default settings, the size and the initial extents is determined by the number of parallel writers and the amount of memory in the machine. For 4GB RAM and 4 concurrent 32GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576 1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576 2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152 3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304 4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608 5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208 6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088 7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088 and for 16 concurrent 16GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144 1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144 2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288 3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576 4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152 5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304 6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608 7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208 > > However, the current algorithm does work well when writing lots of > > relatively small files (e.g. up to a few tens of megabytes), as > > increasing the preallocation size fast reduces the chances of > > interleaving small allocations. > > One thing that I keep wondering about as I think about this > is what the effect is as the file system (or AG) gets full, New speculative preallocation does not occur when ENOSPC is reported. > and what level of "full" is enough to make any adverse > effects of a change like this start to show up. The other > thing is, what sort of workloads are reasonable things to > use to gauge the effect? Anything that does concurrent buffered writes to the same directory. Databases, DVRs, MPI applications, etc. Anything you'd use the allocsize mount option for. ;) > > We need to make the same write patterns result in equivalent > > allocation patterns even when they come through the NFS server. > > Right now the NFS server uses a file descriptor for each write that > > comes across the wire. This means that the ->release function is > > called after every write, and that means XFS will be truncating away > > the speculative preallocation it did during the write. Hence we get > > interleaving files and fragmentation. > > It could be useful to base the behavior on actual knowledge > that a file system is being exported by NFS. But it may well > be that other applications (like shell scripts that loop and > append to the same file repeatedly) might benefit. We have no knowledge of whether the filesystem is exported or not. I could a "current_task_is_nfsd" hack in there, but I'd rather not go to that extreme. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-14 21:16 ` Dave Chinner @ 2010-10-14 21:50 ` Ivan.Novick 2010-10-15 7:14 ` Michael Monnerie 0 siblings, 1 reply; 22+ messages in thread From: Ivan.Novick @ 2010-10-14 21:50 UTC (permalink / raw) To: david, aelder; +Cc: xfs >> The reported numbers were a drop in read throughput of ~15% on a >> GB/s class filesystem when the files interleaved. I have seen ~50% performance improvement in read rate when changing from small extents to large extents with XFS. Essentially going from not using allocsize to setting 1gb allocsize. Also GB/s class filesystem. Cheers, Ivan Novick _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-14 21:50 ` Ivan.Novick @ 2010-10-15 7:14 ` Michael Monnerie 2010-10-15 11:45 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Michael Monnerie @ 2010-10-15 7:14 UTC (permalink / raw) To: xfs; +Cc: Ivan.Novick, aelder [-- Attachment #1.1: Type: Text/Plain, Size: 1792 bytes --] On Donnerstag, 14. Oktober 2010 Ivan.Novick@emc.com wrote: > I have seen ~50% performance improvement in read rate when changing > from small extents to large extents with XFS. Essentially going from > not using allocsize to setting 1gb allocsize. Also GB/s class > filesystem. In case of a database, I'd say 1GB allocsize seems fine. But when you're at /var/log/, it could be a penalty, as when you have 50 open log files which get permanently appended, you have 50GB preallocated which almost surely won't be needed before the log is rotated anyway. The question is: what is a good value for preallocation? I'd guess for a database 1GB seems reasonable, for /var/log one full stripe (nr-data-disks * stripe size), for webserver/ftp data too, as well as a fileserver (storing office documents) and mailserver (storing each mail as a separate file or into one flat file). Or, when you have an app that writes files of specific size, use the "normal maximum" file size. We have a special server storing files of normally 50M-1G size, there I set allocsize=1024k, so that files will be in one junk on disk. When sometimes a file is bigger, then it will be fragmented but at reasonable sizes so it won't matter too much. Is that reasonable? Are there guides to help with that? I'd like to write a FAQ entry if there is (can ever be?) a consensus of what would be a good value. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 ****** Radiointerview zum Thema Spam ****** http://www.it-podcast.at/archiv.html#podcast-100716 // Wir haben im Moment zwei Häuser zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-15 7:14 ` Michael Monnerie @ 2010-10-15 11:45 ` Dave Chinner 2010-10-17 14:31 ` Michael Monnerie 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-10-15 11:45 UTC (permalink / raw) To: Michael Monnerie; +Cc: Ivan.Novick, aelder, xfs On Fri, Oct 15, 2010 at 09:14:54AM +0200, Michael Monnerie wrote: > On Donnerstag, 14. Oktober 2010 Ivan.Novick@emc.com wrote: > > I have seen ~50% performance improvement in read rate when changing > > from small extents to large extents with XFS. Essentially going from > > not using allocsize to setting 1gb allocsize. Also GB/s class > > filesystem. > > In case of a database, I'd say 1GB allocsize seems fine. But when you're > at /var/log/, it could be a penalty, as when you have 50 open log files > which get permanently appended, you have 50GB preallocated which almost > surely won't be needed before the log is rotated anyway. Don't put your high performance database on the same filesystem as your system log files. Or don't use tunings for databases on your root filesystem. > The question is: what is a good value for preallocation? If you can't answer that, use the defaults. > I'd guess for a database 1GB seems reasonable, depends on your database. If it does direct IO, then allocsize is completely meaningless.... > for /var/log one full > stripe (nr-data-disks * stripe size), IIRC, that is the default behaviour for _physical_ extent allocation on files larger than one stripe unit (i.e stripe aligned and sized allocation). That is very different from speculative allocation done at delayed allocation time, which is purely in-memory until physical allocation occurs. > for webserver/ftp data too, as > well as a fileserver (storing office documents) and mailserver (storing > each mail as a separate file or into one flat file). > Or, when you have an app that writes files of specific size, use the > "normal maximum" file size. We have a special server storing files of > normally 50M-1G size, there I set allocsize=1024k, so that files will be > in one junk on disk. When sometimes a file is bigger, then it will be > fragmented but at reasonable sizes so it won't matter too much. > > Is that reasonable? What is reasonable is that the default behaviour does the right thing. :) Nobody should need to use the allocsize mount option except in extreme corner cases, and that is what my patchset attempts to achieve. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-15 11:45 ` Dave Chinner @ 2010-10-17 14:31 ` Michael Monnerie 2010-10-17 23:49 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Michael Monnerie @ 2010-10-17 14:31 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: Text/Plain, Size: 2615 bytes --] On Freitag, 15. Oktober 2010 Dave Chinner wrote: > On Fri, Oct 15, 2010 at 09:14:54AM +0200, Michael Monnerie wrote: > > The question is: what is a good value for preallocation? > > If you can't answer that, use the defaults. Who can really without testing? For a webserver, hosting hundreds of different websites, with many ftp uploads and where the rest of the writes are webservers temp and log files, do you know the best value without investigation? That's what I meant. > > I'd guess for a database 1GB seems reasonable, > > depends on your database. If it does direct IO, then allocsize is > completely meaningless.... True, I was thinking of postgresql and mysql (innodb, 1 file per table). Both increment file sizes when needed. > IIRC, that is the default behaviour for _physical_ extent allocation > on files larger than one stripe unit (i.e stripe aligned and sized > allocation). That is very different from speculative allocation done > at delayed allocation time, which is purely in-memory until physical > allocation occurs. Together with your other answer (swalloc...) I get the picture now. So delalloc, speculative prealloc, and physical allocation are done, where the first two are in-memory. And the mount option allocsize sets the delalloc size. Now something comes to my mind: When I make an ftp upload, it's very slow in terms of disk speed. Say 2MB/s. How long would delalloc wait before flushing buffers on disk? Isn't that controlled by the values of vm.dirty_background_bytes and vm.dirty_expire_centisecs and vm.dirty_writeback_centisecs? Then dirty_background_bytes would be the maximum possible delalloc size? But it's value is the top for the whole machine, so it would need to be much higher than I want it to be for a specific filesystem. Or are those values controllable per mount? Sorry for my maybe boring questions, I'm trying to understand how to get the most out of a server with lot's of VMs, and tuning seems to help (and be needed) a lot in this situation. > Nobody should need to use the allocsize mount option except in > extreme corner cases, and that is what my patchset attempts to > achieve. Sounds good :-) -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 ****** Radiointerview zum Thema Spam ****** http://www.it-podcast.at/archiv.html#podcast-100716 // Wir haben im Moment zwei Häuser zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-17 14:31 ` Michael Monnerie @ 2010-10-17 23:49 ` Dave Chinner 2010-10-18 6:39 ` Michael Monnerie 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-10-17 23:49 UTC (permalink / raw) To: Michael Monnerie; +Cc: xfs On Sun, Oct 17, 2010 at 04:31:48PM +0200, Michael Monnerie wrote: > On Freitag, 15. Oktober 2010 Dave Chinner wrote: > > On Fri, Oct 15, 2010 at 09:14:54AM +0200, Michael Monnerie wrote: > > > The question is: what is a good value for preallocation? > > > > If you can't answer that, use the defaults. > > Who can really without testing? For a webserver, hosting hundreds of > different websites, with many ftp uploads and where the rest of the > writes are webservers temp and log files, do you know the best value > without investigation? That's what I meant. Yes, I do know that - the best value is the default value. > > > I'd guess for a database 1GB seems reasonable, > > > > depends on your database. If it does direct IO, then allocsize is > > completely meaningless.... > > True, I was thinking of postgresql and mysql (innodb, 1 file per table). > Both increment file sizes when needed. MySQL can do direct IO, so you can't just assume that the database app will tell you what type of IO it is doing. > > IIRC, that is the default behaviour for _physical_ extent allocation > > on files larger than one stripe unit (i.e stripe aligned and sized > > allocation). That is very different from speculative allocation done > > at delayed allocation time, which is purely in-memory until physical > > allocation occurs. > > Together with your other answer (swalloc...) I get the picture now. So > delalloc, speculative prealloc, and physical allocation are done, where > the first two are in-memory. And the mount option allocsize sets the > delalloc size. > > Now something comes to my mind: When I make an ftp upload, it's very > slow in terms of disk speed. Say 2MB/s. How long would delalloc wait > before flushing buffers on disk? delalloc does not do any waiting. The system decides when to flush data to disk. We even do delalloc for sync writes - the system flushes the data to disk immediately instead of via memory pressure or background writeback. > Isn't that controlled by the values of > vm.dirty_background_bytes and vm.dirty_expire_centisecs and > vm.dirty_writeback_centisecs? Exactly. > Then dirty_background_bytes would be the > maximum possible delalloc size? No. Those numbers control the amount of dirty pages in memory, not the amount of space we've allocated speculatively. Indeed, by definition speculative allocation past EOF means it is space that currently has no dirty pages.... ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc 2010-10-17 23:49 ` Dave Chinner @ 2010-10-18 6:39 ` Michael Monnerie 0 siblings, 0 replies; 22+ messages in thread From: Michael Monnerie @ 2010-10-18 6:39 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: Text/Plain, Size: 2325 bytes --] On Montag, 18. Oktober 2010 Dave Chinner wrote: > delalloc does not do any waiting. The system decides when to flush > data to disk. We even do delalloc for sync writes - the system > flushes the data to disk immediately instead of via memory pressure > or background writeback. > > > Isn't that controlled by the values of > > vm.dirty_background_bytes and vm.dirty_expire_centisecs and > > vm.dirty_writeback_centisecs? > > Exactly. > > > Then dirty_background_bytes would be the > > maximum possible delalloc size? > > No. Those numbers control the amount of dirty pages in memory, not > the amount of space we've allocated speculatively. Indeed, by > definition speculative allocation past EOF means it is space that > currently has no dirty pages.... ;) The big picture is becoming clearer, but I still don't get it. So delalloc is not really about delayed allocation, but more a speculation about needed filesizes, combined with in-memory preallocation of physical blocks on the storage? So it's "future allocation with delayed data filling". Or what is "delayed allocation" refering to, when data can be on disk already? You wrote in another mail: > that is the default behaviour for _physical_ extent allocation > on files larger than one stripe unit (i.e stripe aligned and sized > allocation). That is very different from speculative allocation done > at delayed allocation time, which is purely in-memory until physical > allocation occurs So when a file is slowly uploaded, delalloc can imagine it will be 1MB on disk, and reserve that space in-memory. And when a write occurs, because dirty_background_bytes is hit, the fragments will be written, and the full 1MB is physically reserved on disk. Then more data arrives, delalloc now imagines the file will be 100MB on disk, it will reserve that space, first in-memory and after the next buffer flush on disk. Is it like this? -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 ****** Radiointerview zum Thema Spam ****** http://www.it-podcast.at/archiv.html#podcast-100716 // Wir haben im Moment zwei Häuser zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 @ 2010-11-29 0:43 Dave Chinner 2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-11-29 0:43 UTC (permalink / raw) To: xfs This is the latest version of the dynamic speculative allocation beyond EOF patch set. The description of the patchset can be found here: http://oss.sgi.com/archives/xfs/2010-10/msg00040.html Version 3: - allocsize mount option returned to fixed preallocation size only. - reduces maximum dynamic prealloc size as the filesytem gets near full. - split i_delayed_blks bug fixes into new patch (posted in 2.6.37-rc bug fix series) Version 2: - base speculative execution size on current inode size, not the number of previous speculative allocations. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-11-29 0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner @ 2010-11-29 0:43 ` Dave Chinner 2010-12-07 10:17 ` Christoph Hellwig 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-11-29 0:43 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Currently the size of the speculative preallocation during delayed allocation is fixed by either the allocsize mount option of a default size. We are seeing a lot of cases where we need to recommend using the allocsize mount option to prevent fragmentation when buffered writes land in the same AG. Rather than using a fixed preallocation size by default (up to 64k), make it dynamic by basing it on the current inode size. That way the EOF preallocation will increase as the file size increases. Hence for streaming writes we are much more likely to get large preallocations exactly when we need it to reduce fragementation. For default settings, the size of the initial extents is determined by the number of parallel writers and the amount of memory in the machine. For 4GB RAM and 4 concurrent 32GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576 1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576 2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152 3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304 4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608 5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208 6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088 7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088 and for 16 concurrent 16GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144 1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144 2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288 3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576 4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152 5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304 6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608 7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208 Because it is hard to take back specualtive preallocation, cases where there are large slow growing log files on a nearly full filesystem may cause premature ENOSPC. Hence as the filesystem nears full, the maximum dynamic prealloc size іs reduced according to this table (based on 4k block size): freespace max prealloc size >5% full extent (8GB) 4-5% 2GB (8GB >> 2) 3-4% 1GB (8GB >> 3) 2-3% 512MB (8GB >> 4) 1-2% 256MB (8GB >> 5) <1% 128MB (8GB >> 6) This should reduce the amount of space held in speculative preallocation for such cases. The allocsize mount option turns off the dynamic behaviour and fixes the prealloc size to whatever the mount option specifies. i.e. the behaviour is unchanged. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_fsops.c | 1 + fs/xfs/xfs_iomap.c | 74 +++++++++++++++++++++++++++++++++++++++++++++------- fs/xfs/xfs_mount.c | 26 ++++++++++++----- fs/xfs/xfs_mount.h | 26 +++++++++++++++++- 4 files changed, 107 insertions(+), 20 deletions(-) diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index be34ff2..6d17206 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -374,6 +374,7 @@ xfs_growfs_data_private( mp->m_maxicount = icount << mp->m_sb.sb_inopblog; } else mp->m_maxicount = 0; + xfs_set_low_space_thresholds(mp); /* update secondary superblocks. */ for (agno = 1; agno < nagcount; agno++) { diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 2057614..d7c2f05 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -389,6 +389,9 @@ error_out: * If the caller is doing a write at the end of the file, then extend the * allocation out to the file system's write iosize. We clean up any extra * space left over when the file is closed in xfs_inactive(). + * + * If we find we already have delalloc preallocation beyond EOF, don't do more + * preallocation as it it not needed. */ STATIC int xfs_iomap_eof_want_preallocate( @@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate( xfs_filblks_t count_fsb; xfs_fsblock_t firstblock; int n, error, imaps; + int found_delalloc = 0; *prealloc = 0; if ((offset + count) <= ip->i_size) @@ -427,11 +431,16 @@ xfs_iomap_eof_want_preallocate( if ((imap[n].br_startblock != HOLESTARTBLOCK) && (imap[n].br_startblock != DELAYSTARTBLOCK)) return 0; + start_fsb += imap[n].br_blockcount; count_fsb -= imap[n].br_blockcount; + + if (imap[n].br_startblock == DELAYSTARTBLOCK) + found_delalloc = 1; } } - *prealloc = 1; + if (!found_delalloc) + *prealloc = 1; return 0; } @@ -469,6 +478,7 @@ xfs_iomap_write_delay( extsz = xfs_get_extsz_hint(ip); offset_fsb = XFS_B_TO_FSBT(mp, offset); + error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count, ioflag, imap, XFS_WRITE_IMAPS, &prealloc); if (error) @@ -476,9 +486,44 @@ xfs_iomap_write_delay( retry: if (prealloc) { + xfs_fileoff_t alloc_blocks = 0; + /* + * If we don't have a user specified preallocation size, dynamically + * increase the preallocation size as the size of the file + * grows. Cap the maximum size at a single extent or less if + * the filesystem is near full. The closer the filesystem is to + * full, the smaller the maximum prealocation. + */ + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) { + int shift = 0; + int64_t freesp; + + alloc_blocks = XFS_B_TO_FSB(mp, ip->i_size); + alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN, + rounddown_pow_of_two(alloc_blocks)); + + freesp = xfs_icsb_read(mp, XFS_ICSB_FDBLOCKS); + if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) { + shift = 2; + if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT]) + shift++; + if (freesp < mp->m_low_space[XFS_LOWSP_3_PCNT]) + shift++; + if (freesp < mp->m_low_space[XFS_LOWSP_2_PCNT]) + shift++; + if (freesp < mp->m_low_space[XFS_LOWSP_1_PCNT]) + shift++; + } + if (shift) + alloc_blocks >>= shift; + } + + if (alloc_blocks < mp->m_writeio_blocks) + alloc_blocks = mp->m_writeio_blocks; + aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1)); ioalign = XFS_B_TO_FSBT(mp, aligned_offset); - last_fsb = ioalign + mp->m_writeio_blocks; + last_fsb = ioalign + alloc_blocks; } else { last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count))); } @@ -496,22 +541,31 @@ retry: XFS_BMAPI_DELAY | XFS_BMAPI_WRITE | XFS_BMAPI_ENTIRE, &firstblock, 1, imap, &nimaps, NULL); - if (error && (error != ENOSPC)) + switch (error) { + case 0: + case ENOSPC: + case EDQUOT: + break; + default: return XFS_ERROR(error); + } /* - * If bmapi returned us nothing, and if we didn't get back EDQUOT, - * then we must have run out of space - flush all other inodes with - * delalloc blocks and retry without EOF preallocation. + * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. For + * ENOSPC, * flush all other inodes with delalloc blocks to free up + * some of the excess reserved metadata space. For both cases, retry + * without EOF preallocation. */ if (nimaps == 0) { trace_xfs_delalloc_enospc(ip, offset, count); if (flushed) - return XFS_ERROR(ENOSPC); + return XFS_ERROR(error ? error : ENOSPC); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - xfs_flush_inodes(ip); - xfs_ilock(ip, XFS_ILOCK_EXCL); + if (error == ENOSPC) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_flush_inodes(ip); + xfs_ilock(ip, XFS_ILOCK_EXCL); + } flushed = 1; error = 0; diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 3905fc3..98cb266 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -315,14 +315,6 @@ xfs_icsb_sum( return percpu_counter_sum_positive(&mp->m_icsb[counter]); } -static inline int64_t -xfs_icsb_read( - struct xfs_mount *mp, - int counter) -{ - return percpu_counter_read_positive(&mp->m_icsb[counter]); -} - void xfs_icsb_reinit_counters( struct xfs_mount *mp) @@ -1145,6 +1137,21 @@ xfs_set_rw_sizes(xfs_mount_t *mp) } /* + * precalculate the low space thresholds for dynamic speculative preallocation. + */ +void +xfs_set_low_space_thresholds( + struct xfs_mount *mp) +{ + int i; + + for (i = 0; i < XFS_LOWSP_MAX; i++) { + mp->m_low_space[i] = mp->m_sb.sb_dblocks / 100 * (i + 1); + } +} + + +/* * Set whether we're using inode alignment. */ STATIC void @@ -1366,6 +1373,9 @@ xfs_mountfs( */ xfs_set_rw_sizes(mp); + /* set the low space thresholds for dynamic preallocation */ + xfs_set_low_space_thresholds(mp); + /* * Set the inode cluster size. * This may still be overridden by the file system diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index c8ff435..9f99a62 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -75,8 +75,15 @@ enum { XFS_ICSB_MAX, }; -extern int xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t, int); -extern int xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int); +/* dynamic preallocation free space thresholds, 5% down to 1% */ +enum { + XFS_LOWSP_1_PCNT = 0, + XFS_LOWSP_2_PCNT, + XFS_LOWSP_3_PCNT, + XFS_LOWSP_4_PCNT, + XFS_LOWSP_5_PCNT, + XFS_LOWSP_MAX, +}; typedef struct xfs_mount { struct super_block *m_super; @@ -172,6 +179,8 @@ typedef struct xfs_mount { on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ struct percpu_counter m_icsb[XFS_ICSB_MAX]; + int64_t m_low_space[XFS_LOWSP_MAX]; + /* low free space thresholds */ } xfs_mount_t; /* @@ -333,6 +342,19 @@ extern int xfs_icsb_init_counters(struct xfs_mount *); extern void xfs_icsb_reinit_counters(struct xfs_mount *); extern void xfs_icsb_destroy_counters(struct xfs_mount *); extern void xfs_icsb_sync_counters(struct xfs_mount *); +extern int xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t, int); +extern int xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int); + +static inline int64_t +xfs_icsb_read( + struct xfs_mount *mp, + int counter) +{ + return percpu_counter_read_positive(&mp->m_icsb[counter]); +} + +extern void xfs_set_low_space_thresholds(struct xfs_mount *); + #endif /* __KERNEL__ */ -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner @ 2010-12-07 10:17 ` Christoph Hellwig 2010-12-07 10:49 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2010-12-07 10:17 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs I need the patch below to make the new code link on 32-bit systems: Index: xfs/fs/xfs/xfs_mount.c =================================================================== --- xfs.orig/fs/xfs/xfs_mount.c 2010-12-07 11:01:50.394021585 +0100 +++ xfs/fs/xfs/xfs_mount.c 2010-12-07 11:02:46.231256257 +0100 @@ -1111,7 +1111,10 @@ xfs_set_low_space_thresholds( int i; for (i = 0; i < XFS_LOWSP_MAX; i++) { - mp->m_low_space[i] = mp->m_sb.sb_dblocks / 100 * (i + 1); + __uint64_t space = mp->m_sb.sb_dblocks; + do_div(space, 100); + + mp->m_low_space[i] = space * (i + 1); } } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-12-07 10:17 ` Christoph Hellwig @ 2010-12-07 10:49 ` Dave Chinner 0 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2010-12-07 10:49 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Tue, Dec 07, 2010 at 05:17:46AM -0500, Christoph Hellwig wrote: > I need the patch below to make the new code link on 32-bit systems: > > Index: xfs/fs/xfs/xfs_mount.c > =================================================================== > --- xfs.orig/fs/xfs/xfs_mount.c 2010-12-07 11:01:50.394021585 +0100 > +++ xfs/fs/xfs/xfs_mount.c 2010-12-07 11:02:46.231256257 +0100 > @@ -1111,7 +1111,10 @@ xfs_set_low_space_thresholds( > int i; > > for (i = 0; i < XFS_LOWSP_MAX; i++) { > - mp->m_low_space[i] = mp->m_sb.sb_dblocks / 100 * (i + 1); > + __uint64_t space = mp->m_sb.sb_dblocks; > + do_div(space, 100); > + > + mp->m_low_space[i] = space * (i + 1); > } > } Yes, makes sense. I'll fold that into the latest version of the patch. Thanks. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 @ 2010-12-13 1:25 Dave Chinner 2010-12-13 1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-12-13 1:25 UTC (permalink / raw) To: xfs This is the latest version of the dynamic speculative allocation beyond EOF patch set. The description of the patchset can be found here: http://oss.sgi.com/archives/xfs/2010-10/msg00040.html Version 4: - factored prealloc size into separate function to keep xfs_iomap_write_delay() easy to read. - convert i_dirty_releases counter to a flag. Version 3: - allocsize mount option returned to fixed preallocation size only. - reduces maximum dynamic prealloc size as the filesytem gets near full. - split i_delayed_blks bug fixes into new patch (posted in 2.6.37-rc bug fix series) Version 2: - base speculative execution size on current inode size, not the number of previous speculative allocations. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-12-13 1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner @ 2010-12-13 1:25 ` Dave Chinner 2010-12-15 18:57 ` Christoph Hellwig 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2010-12-13 1:25 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Currently the size of the speculative preallocation during delayed allocation is fixed by either the allocsize mount option of a default size. We are seeing a lot of cases where we need to recommend using the allocsize mount option to prevent fragmentation when buffered writes land in the same AG. Rather than using a fixed preallocation size by default (up to 64k), make it dynamic by basing it on the current inode size. That way the EOF preallocation will increase as the file size increases. Hence for streaming writes we are much more likely to get large preallocations exactly when we need it to reduce fragementation. For default settings, the size of the initial extents is determined by the number of parallel writers and the amount of memory in the machine. For 4GB RAM and 4 concurrent 32GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576 1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576 2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152 3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304 4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608 5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208 6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088 7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088 and for 16 concurrent 16GB file writes: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144 1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144 2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288 3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576 4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152 5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304 6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608 7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208 Because it is hard to take back specualtive preallocation, cases where there are large slow growing log files on a nearly full filesystem may cause premature ENOSPC. Hence as the filesystem nears full, the maximum dynamic prealloc size іs reduced according to this table (based on 4k block size): freespace max prealloc size >5% full extent (8GB) 4-5% 2GB (8GB >> 2) 3-4% 1GB (8GB >> 3) 2-3% 512MB (8GB >> 4) 1-2% 256MB (8GB >> 5) <1% 128MB (8GB >> 6) This should reduce the amount of space held in speculative preallocation for such cases. The allocsize mount option turns off the dynamic behaviour and fixes the prealloc size to whatever the mount option specifies. i.e. the behaviour is unchanged. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_fsops.c | 1 + fs/xfs/xfs_iomap.c | 85 +++++++++++++++++++++++++++++++++++++++++++++------ fs/xfs/xfs_mount.c | 21 +++++++++++++ fs/xfs/xfs_mount.h | 14 ++++++++ 4 files changed, 111 insertions(+), 10 deletions(-) diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index be34ff2..6d17206 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -374,6 +374,7 @@ xfs_growfs_data_private( mp->m_maxicount = icount << mp->m_sb.sb_inopblog; } else mp->m_maxicount = 0; + xfs_set_low_space_thresholds(mp); /* update secondary superblocks. */ for (agno = 1; agno < nagcount; agno++) { diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 2057614..40f9612 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -389,6 +389,9 @@ error_out: * If the caller is doing a write at the end of the file, then extend the * allocation out to the file system's write iosize. We clean up any extra * space left over when the file is closed in xfs_inactive(). + * + * If we find we already have delalloc preallocation beyond EOF, don't do more + * preallocation as it it not needed. */ STATIC int xfs_iomap_eof_want_preallocate( @@ -405,6 +408,7 @@ xfs_iomap_eof_want_preallocate( xfs_filblks_t count_fsb; xfs_fsblock_t firstblock; int n, error, imaps; + int found_delalloc = 0; *prealloc = 0; if ((offset + count) <= ip->i_size) @@ -427,14 +431,63 @@ xfs_iomap_eof_want_preallocate( if ((imap[n].br_startblock != HOLESTARTBLOCK) && (imap[n].br_startblock != DELAYSTARTBLOCK)) return 0; + start_fsb += imap[n].br_blockcount; count_fsb -= imap[n].br_blockcount; + + if (imap[n].br_startblock == DELAYSTARTBLOCK) + found_delalloc = 1; } } - *prealloc = 1; + if (!found_delalloc) + *prealloc = 1; return 0; } +/* + * If we don't have a user specified preallocation size, dynamically increase + * the preallocation size as the size of the file grows. Cap the maximum size + * at a single extent or less if the filesystem is near full. The closer the + * filesystem is to full, the smaller the maximum prealocation. + */ +STATIC xfs_fsblock_t +xfs_iomap_prealloc_size( + struct xfs_mount *mp, + struct xfs_inode *ip) +{ + xfs_fsblock_t alloc_blocks = 0; + + if (!(mp->m_flags & XFS_MOUNT_DFLT_IOSIZE)) { + int shift = 0; + int64_t freesp; + + alloc_blocks = XFS_B_TO_FSB(mp, ip->i_size); + alloc_blocks = XFS_FILEOFF_MIN(MAXEXTLEN, + rounddown_pow_of_two(alloc_blocks)); + + freesp = percpu_counter_read_positive( + &mp->m_icsb[XFS_ICSB_FDBLOCKS]); + if (freesp < mp->m_low_space[XFS_LOWSP_5_PCNT]) { + shift = 2; + if (freesp < mp->m_low_space[XFS_LOWSP_4_PCNT]) + shift++; + if (freesp < mp->m_low_space[XFS_LOWSP_3_PCNT]) + shift++; + if (freesp < mp->m_low_space[XFS_LOWSP_2_PCNT]) + shift++; + if (freesp < mp->m_low_space[XFS_LOWSP_1_PCNT]) + shift++; + } + if (shift) + alloc_blocks >>= shift; + } + + if (alloc_blocks < mp->m_writeio_blocks) + alloc_blocks = mp->m_writeio_blocks; + + return alloc_blocks; +} + STATIC int xfs_iomap_write_delay( xfs_inode_t *ip, @@ -469,6 +522,7 @@ xfs_iomap_write_delay( extsz = xfs_get_extsz_hint(ip); offset_fsb = XFS_B_TO_FSBT(mp, offset); + error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count, ioflag, imap, XFS_WRITE_IMAPS, &prealloc); if (error) @@ -476,9 +530,11 @@ xfs_iomap_write_delay( retry: if (prealloc) { + xfs_fsblock_t alloc_blocks = xfs_iomap_prealloc_size(mp, ip); + aligned_offset = XFS_WRITEIO_ALIGN(mp, (offset + count - 1)); ioalign = XFS_B_TO_FSBT(mp, aligned_offset); - last_fsb = ioalign + mp->m_writeio_blocks; + last_fsb = ioalign + alloc_blocks; } else { last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count))); } @@ -496,22 +552,31 @@ retry: XFS_BMAPI_DELAY | XFS_BMAPI_WRITE | XFS_BMAPI_ENTIRE, &firstblock, 1, imap, &nimaps, NULL); - if (error && (error != ENOSPC)) + switch (error) { + case 0: + case ENOSPC: + case EDQUOT: + break; + default: return XFS_ERROR(error); + } /* - * If bmapi returned us nothing, and if we didn't get back EDQUOT, - * then we must have run out of space - flush all other inodes with - * delalloc blocks and retry without EOF preallocation. + * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. For + * ENOSPC, * flush all other inodes with delalloc blocks to free up + * some of the excess reserved metadata space. For both cases, retry + * without EOF preallocation. */ if (nimaps == 0) { trace_xfs_delalloc_enospc(ip, offset, count); if (flushed) - return XFS_ERROR(ENOSPC); + return XFS_ERROR(error ? error : ENOSPC); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - xfs_flush_inodes(ip); - xfs_ilock(ip, XFS_ILOCK_EXCL); + if (error == ENOSPC) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_flush_inodes(ip); + xfs_ilock(ip, XFS_ILOCK_EXCL); + } flushed = 1; error = 0; diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 5e41ef3..fe27338 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1101,6 +1101,24 @@ xfs_set_rw_sizes(xfs_mount_t *mp) } /* + * precalculate the low space thresholds for dynamic speculative preallocation. + */ +void +xfs_set_low_space_thresholds( + struct xfs_mount *mp) +{ + int i; + + for (i = 0; i < XFS_LOWSP_MAX; i++) { + __uint64_t space = mp->m_sb.sb_dblocks; + + do_div(space, 100); + mp->m_low_space[i] = space * (i + 1); + } +} + + +/* * Set whether we're using inode alignment. */ STATIC void @@ -1322,6 +1340,9 @@ xfs_mountfs( */ xfs_set_rw_sizes(mp); + /* set the low space thresholds for dynamic preallocation */ + xfs_set_low_space_thresholds(mp); + /* * Set the inode cluster size. * This may still be overridden by the file system diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 03ad25c6..7b42e04 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -75,6 +75,16 @@ enum { XFS_ICSB_MAX, }; +/* dynamic preallocation free space thresholds, 5% down to 1% */ +enum { + XFS_LOWSP_1_PCNT = 0, + XFS_LOWSP_2_PCNT, + XFS_LOWSP_3_PCNT, + XFS_LOWSP_4_PCNT, + XFS_LOWSP_5_PCNT, + XFS_LOWSP_MAX, +}; + typedef struct xfs_mount { struct super_block *m_super; xfs_tid_t m_tid; /* next unused tid for fs */ @@ -169,6 +179,8 @@ typedef struct xfs_mount { on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ struct percpu_counter m_icsb[XFS_ICSB_MAX]; + int64_t m_low_space[XFS_LOWSP_MAX]; + /* low free space thresholds */ } xfs_mount_t; /* @@ -333,6 +345,8 @@ extern void xfs_icsb_sync_counters(struct xfs_mount *); extern int xfs_icsb_modify_inodes(struct xfs_mount *, int, int64_t); extern int xfs_icsb_modify_free_blocks(struct xfs_mount *, int64_t, int); +extern void xfs_set_low_space_thresholds(struct xfs_mount *); + #endif /* __KERNEL__ */ extern void xfs_mod_sb(struct xfs_trans *, __int64_t); -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] xfs: dynamic speculative EOF preallocation 2010-12-13 1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner @ 2010-12-15 18:57 ` Christoph Hellwig 0 siblings, 0 replies; 22+ messages in thread From: Christoph Hellwig @ 2010-12-15 18:57 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs With this patch my 32-bit x86 VM hangs when running test 014, although I can interrupt it. With the full patchset applied I get softlockup warnings in this test for xfsaild, which seems related. Also test 012 fails, but that might be intentional (?). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2010-12-15 18:55 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-04 10:13 [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Dave Chinner 2010-10-04 10:13 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner 2010-10-14 17:22 ` Alex Elder 2010-10-14 21:33 ` Dave Chinner 2010-10-15 6:51 ` allocsize mount option, was: " Michael Monnerie 2010-10-15 11:59 ` Dave Chinner 2010-10-04 10:13 ` [PATCH 2/2] xfs: don't truncate prealloc from frequently accessed inodes Dave Chinner 2010-10-14 17:22 ` Alex Elder 2010-10-14 21:28 ` Dave Chinner 2010-10-14 17:22 ` [RFC, PATCH 0/2] xfs: dynamic speculative preallocation for delalloc Alex Elder 2010-10-14 21:16 ` Dave Chinner 2010-10-14 21:50 ` Ivan.Novick 2010-10-15 7:14 ` Michael Monnerie 2010-10-15 11:45 ` Dave Chinner 2010-10-17 14:31 ` Michael Monnerie 2010-10-17 23:49 ` Dave Chinner 2010-10-18 6:39 ` Michael Monnerie -- strict thread matches above, loose matches on Subject: below -- 2010-11-29 0:43 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V3 Dave Chinner 2010-11-29 0:43 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner 2010-12-07 10:17 ` Christoph Hellwig 2010-12-07 10:49 ` Dave Chinner 2010-12-13 1:25 [PATCH 0/2] xfs: dynamic speculative allocation beyond EOF V4 Dave Chinner 2010-12-13 1:25 ` [PATCH 1/2] xfs: dynamic speculative EOF preallocation Dave Chinner 2010-12-15 18:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox