* Re: xfs_fsr question for improvement
@ 2010-04-26 20:58 Richard Scobie
0 siblings, 0 replies; 14+ messages in thread
From: Richard Scobie @ 2010-04-26 20:58 UTC (permalink / raw)
To: xfs
Linda Walsh wrote:
> That's a rather naive view. It may not be one application but several
> writing to the disk at once. Or it could be one, but recording
multiple streams
> to disk at the same time -- of course it would have to write them to
disk as they
> come in, as memory is limited -- how else would you prevent
interleaving in such
> a case? There are too many situations where fragmenting can occur to
toss them
> all off and say they are the result of not paying an application
programmer to do it
> "correctly".
Agreed. XFS does offer some assistence for improving some of these
situations - the "filestreams" mount option. This is not very well
documented in man pages, but the following is extracted from tutorial notes:
"Filestreams Allocator
A certain class of applications such as those doing film scanner
video ingest will write many large files to a directory in sequence.
It's important for playback performance that these files end up
allocated next to each other on disk, since consecutive data is
retrieved optimally by hardware RAID read-ahead.
XFS's standard allocator starts out doing the right thing as far
as file allocation is concerned. Even if multiple streams are being
written simultaneously, their files will be placed separately and
contiguously on disk. The problem is that once an allocation group
fills up, a new one must be chosen and there's no longer a parent
directory in a unique AG to use as an AG "owner". Without a way
to reserve the new AG for the original directory's use, all the
files being allocated by all the streams will start getting placed
in the same AGs as each other. The result is that consecutive
frames in one directory are placed on disk with frames from other
directories interleaved between them, which is a worst-case layout
for playback performance. When reading back the frames in directory
A, hardware RAID read-ahead will cache data from frames in directory
B which is counterproductive.
Create a file system with a small AG size to demonstrate:
sles10:~ sjv: sudo mkfs.xfs -d agsize=64m /dev/sdb7 > /dev/null
sles10:~ sjv: sudo mount /dev/sdb7 /test
sles10:~ sjv: sudo chmod 777 /test
sles10:~ sjv: cd /test
sles10:/test sjv:
Create ten 10MB files concurrently in two directories:
sles10:/test sjv: mkdir a b
sles10:/test sjv: for dir in a b; do
> > for file in `seq 0 9`; do
> > xfs_mkfile 10m $dir/$file
> > done &
> > done; wait 2>/dev/null
[1] 30904
[2] 30905
sles10:/test sjv: ls -lid * */*
131 drwxr-xr-x 2 sjv users 86 2006-10-20 13:48 a
132 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/0
133 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/1
134 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/2
135 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/3
136 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/4
137 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/5
138 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/6
139 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/7
140 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/8
141 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/9
262272 drwxr-xr-x 2 sjv users 86 2006-10-20 13:48 b
262273 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/0
262274 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/1
262275 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/2
262276 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/3
262277 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/4
262278 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/5
262279 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/6
262280 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/7
262281 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/8
262282 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/9
sles10:/test sjv:
Note that all the inodes are in the same AGs as each other. What
about the file data? Use xfs_bmap -v to examine the extents:
sles10:/test sjv: for file in `seq 0 9`; do
> > bmap_a=`xfs_bmap -v a/$file | tail -1`
> > bmap_b=`xfs_bmap -v b/$file | tail -1`
> > ag_a=`echo $bmap_a | awk '{print $4}'`
> > ag_b=`echo $bmap_b | awk '{print $4}'`
> > br_a=`echo $bmap_a | awk '{printf "%-18s", $3}'`
> > br_b=`echo $bmap_b | awk '{printf "%-18s", $3}'`
> > echo a/$file: $ag_a "$br_a" b/$file: $ag_b "$br_b"
> > done
a/0: 0 96..20575 b/0: 1 131168..151647
a/1: 0 20576..41055 b/1: 1 151648..172127
a/2: 0 41056..61535 b/2: 1 172128..192607
a/3: 0 61536..82015 b/3: 1 192608..213087
a/4: 0 82016..102495 b/4: 1 213088..233567
a/5: 0 102496..122975 b/5: 1 233568..254047
a/6: 2 299600..300111 b/6: 2 262208..275007
a/7: 2 338016..338527 b/7: 2 312400..312911
a/8: 2 344672..361567 b/8: 3 393280..401983
a/9: 2 361568..382047 b/9: 3 401984..421951
sles10:/test sjv:
The middle column is the AG number and the right column is the block
range. Note how the extents for files in both directories get
placed on top of each other in AG 2.
Something to note in the results is that even though the file extents
have worked their way up into AGs 2 and 3, the inode numbers show
that the file inodes are all in the same AGs as their parent
directory, i.e. AGs 0 and 1. Why is this? To understand, it's
important to consider the order in which events are occurring. The
two bash processes writing files are calling xfs_mkfile, which
starts by opening a file with the O_CREAT flag. At this point, XFS
has no idea how large the file's data is going to be, so it dutifully
creates a new inode for the file in the same AG as the parent
directory. The call returns successfully and the system continues
with its tasks. When XFS is asked write the file data a short time
later, a new AG must be found for it because the AG the inode is
in is full. The result is a violation of the original goal to keep
file data close to its inode on disk. In practice, because inodes
are allocated in clusters on disk, a process that's reading back a
stream is likely to cache all the inodes it needs with just one or
two reads, so the disk seeking involved won't be as bad as it first
seems.
On the other hand, the extent data placement seen in the xfs_bmap
-v output is a problem. Once the data extents spilled into AG 2,
both processes were given allocations there on a first-come-first-served
basis. This destroyed the neatly contiguous allocation pattern for
the files and will certainly degrade read performance later on.
To address this issue, a new allocation algorithm was added to XFS
that associates a parent directory with an AG until a preset
inactivity timeout elapses. The new algorithm is called the
Filestreams allocator and it is enabled in one of two ways. Either
the filesystem is mounted with the -o filestreams option, or the
filestreams chattr flag is applied to a directory to indicate that
all allocations beneath that point in the directory hierarchy should
use the filestreams allocator.
With the filestreams allocator enabled, the above test produces
results that look like this:
a/0: 0 96..20575 b/0: 1 131168..151647
a/1: 0 20576..41055 b/1: 1 151648..172127
a/2: 0 41056..61535 b/2: 1 172128..192607
a/3: 0 61536..82015 b/3: 1 192608..213087
a/4: 0 82016..102495 b/4: 1 213088..233567
a/5: 0 102496..122975 b/5: 1 233568..254047
a/6: 2 272456..273479 b/6: 3 393280..410271
a/7: 2 290904..300119 b/7: 3 410272..426655
a/8: 2 300632..321111 b/8: 3 426656..441503
a/9: 2 329304..343639 b/9: 3 441504..459935
Once the process writing files to the first directory starts using
AG 2, that AG is no longer considered available so the other process
skips it and moves to AG 3."
> I don't see why you posted -- it wasn't to help anyone nor to offer
constructive
> criticism. It was a bit harsh on the criticism side, as though
something about it
> was 'personal' for you.... Also sensed a tinge of bitterness in
that last bit of
Not unusual for this poster.
Regards,
Richard
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* xfs_fsr question for improvement
@ 2010-04-16 8:43 Michael Monnerie
2010-04-16 10:43 ` Stan Hoeppner
2010-04-17 1:24 ` Dave Chinner
0 siblings, 2 replies; 14+ messages in thread
From: Michael Monnerie @ 2010-04-16 8:43 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: text/plain, Size: 2816 bytes --]
From the man page I read that a file is defragmented by copying it to a
free space big enough to place it in one extent.
Now I have a 4TB filesystem, where all files written are at least 1GB,
average 5GB, up to 30GB each. I just xfs_growfs'd that filesystem to
6TB, as it was 97% full (150GB free). Every night a xfs_fsr runs and
finished to defragment everything, except during the last days where it
didn't find enough free space in a row to defragment.
Could it be that the defragmentation did it's job but in the end the
file layout was like this:
file 1GB
freespace 900M
file 1GB
freespace 900M
file 1GB
freespace 900M
That, while being an "almost worst case" scenario, would mean that once
the filesystem is about 50% full, new 1GB files will be fragmented all
the time.
To prevent this, xfs_fsr should do a "compress" phase after
defragmentation finished, in order to move all the files behind each
other:
file 1GB
file 1GB
file 1GB
file 1GB
freespace 3600M
That would also help fill the filesystem from front to end, reducing
disk head moves.
Another thing, but related to xfs_fsr, is that I did an xfs_repair on
that filesystem once, and I could see there were a lot of small I/Os
done, with almost no throughput. The disks are 7.200rpm 2TB disks, so
random disk access is horribly slow, and it looked like the disks were
doing nothing else but seeking.
Would it be possible xfs_fsr defrags the meta data in a way that they
are all together so seeks are faster?
Currently, when I do "find /this_big_fs -inum 1234", it takes *ages* for
a run, while there are not so many files on it:
# iostat -kx 5 555
Device: r/s rkB/s avgrq-sz avgqu-sz await svctm %util
xvdb 23,20 92,80 8,00 0,42 15,28 18,17 42,16
xvdc 20,20 84,00 8,32 0,57 28,40 28,36 57,28
(I edited the output to remove "writes" columns, as they are 0)
This is a RAID-5 over 7 disks, and 2 TB volumes are used with LVM
concatenated. As I only added the 3rd 2TB volume today, no seeks on that
new place.
So I get 43 reads/second at 100% utilization. Well I can see up to
150r/s, but still that's no "wow". A single run to find an inode takes a
very long time.
# df -i
Filesystem Inodes IUsed IFree IUse%
mybigstore 1258291200 765684 1257525516 1%
So only 765.684 files, and it takes about 8 minutes for a "find" pass.
Maybe an xfs_fsr over metadata could help here?
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31
// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-04-16 8:43 Michael Monnerie
@ 2010-04-16 10:43 ` Stan Hoeppner
2010-04-17 1:24 ` Dave Chinner
1 sibling, 0 replies; 14+ messages in thread
From: Stan Hoeppner @ 2010-04-16 10:43 UTC (permalink / raw)
To: xfs
Michael Monnerie put forth on 4/16/2010 3:43 AM:
> To prevent this, xfs_fsr should do a "compress" phase after
> defragmentation finished, in order to move all the files behind each
> other:
> file 1GB
> file 1GB
> file 1GB
> file 1GB
> freespace 3600M
> That would also help fill the filesystem from front to end, reducing
> disk head moves.
What happens if those are frequently written/appended files, such as logs or
mbox mail files, database files, etc? If you pack them nose to tail with
this "compress" phase they will instantly be fragmented upon the next append
operation. Leaving some free sectors at the tail end of a file is what
helps prevent fragmentation. I don't think this compression would be a good
default behavior. I think "packing" is probably a better term as
"compression" has a long standing connotation.
Sounds like you have a corner case. If this "packing" was implemented,
maybe it would be best to make it a command line option only.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-04-16 8:43 Michael Monnerie
2010-04-16 10:43 ` Stan Hoeppner
@ 2010-04-17 1:24 ` Dave Chinner
2010-04-17 7:13 ` Emmanuel Florac
2010-05-03 6:49 ` Michael Monnerie
1 sibling, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2010-04-17 1:24 UTC (permalink / raw)
To: Michael Monnerie; +Cc: xfs
On Fri, Apr 16, 2010 at 10:43:10AM +0200, Michael Monnerie wrote:
> From the man page I read that a file is defragmented by copying it to a
> free space big enough to place it in one extent.
>
> Now I have a 4TB filesystem, where all files written are at least 1GB,
> average 5GB, up to 30GB each. I just xfs_growfs'd that filesystem to
> 6TB, as it was 97% full (150GB free). Every night a xfs_fsr runs and
> finished to defragment everything, except during the last days where it
> didn't find enough free space in a row to defragment.
>
> Could it be that the defragmentation did it's job but in the end the
> file layout was like this:
> file 1GB
> freespace 900M
> file 1GB
> freespace 900M
> file 1GB
> freespace 900M
> That, while being an "almost worst case" scenario, would mean that once
> the filesystem is about 50% full, new 1GB files will be fragmented all
> the time.
Yup, xfs_fsr does not care about free space fragmentation - it just
cares about reducing the number of extents in the target file. fsr
is not very smart, because being smart is hard. Also fsr is generally
not needed because the allocator usuallly does a pretty good job up
front of laying out files contiguously.
However, The mistake that _everyone_ is making is that "not quite
perfect" does not equal "fragmented and needs fixing."
2 extents in a 1GB file is not a fragmented file - if the number was
in the hundreds then I'd be saying that it was fragmented, but not
single digits. XFS resists fragmentation better than most other
filesystems, so defragmentation, while possible, is generally not
needed.
You've got to think about what the numbers you are seeing really
mean before you can determine if you have a fragmentation problem or
not. If you don't understand what they mean in terms of your
applications or you aren't seeing any adverse performance problems,
then you don't have a fragmentation problem, not matter what the
numbers say....
e.g. I only consider a file fragmented enough to run fsr on it when
the number of extents or location of them is such that I can't get
large IOs from it (i.e. extents of less than a couple of megabytes
for most users) and it therefore affects performance. An example of
this is my VM block device images:
$ for f in `ls *.img`; do sudo xfs_bmap -v $f |tail -1 | awk '// {print $1}' ; done
856:
2676:
103:
823:
5452:
4734:
9222:
4101:
4258:
They have thousands of extents in them and they are all between
8-10GB in size, and IO from my VMs are stiall capable of saturating
the disks backing these files. While I'd normally consider these
files fragmented and candidates for running fsr on tme, the number
of extents is not actually a performance limiting factor and so
there's no point in defragmenting them. Especially as that requires
shutting down the VMs...
> To prevent this, xfs_fsr should do a "compress" phase after
> defragmentation finished, in order to move all the files behind each
> other:
> file 1GB
> file 1GB
> file 1GB
> file 1GB
> freespace 3600M
> That would also help fill the filesystem from front to end, reducing
> disk head moves.
Packing requires a whole lot more knowledge of the filesytem layout
in fsr, like where the free space is. We don't export that
information to userspace. It also requires the ability to allocate
at specific locations, instead of letting the allocator choose as it
does now. This is also a capability we don't have from userspace.
If you want to extend fsr to do this, you need to discover all the
files that have data in the same AG as the one you want to pack
(requires a full filesystem scan to build a block-to-owner inode
mapping), then move the data out of the identified areas of
freespace fragmentation into other AGs, then move them back in using
preallocation. This will pack the data as best as possible. I don't
have time to do this myself, but I'll happily review the patches ;)
Alternatively, if you want to pack your filesystem right now, copy
everything off it and then copy it back on. i.e. dump and restore.
> Another thing, but related to xfs_fsr, is that I did an xfs_repair on
> that filesystem once, and I could see there were a lot of small I/Os
> done, with almost no throughput. The disks are 7.200rpm 2TB disks, so
> random disk access is horribly slow, and it looked like the disks were
> doing nothing else but seeking.
This is not at all related to xfs_fsr. Newer versions of repair are
much smarter about reading metadata off disk - they can do readahead
and reorder IOs into ascending block offset....
> Would it be possible xfs_fsr defrags the meta data in a way that they
> are all together so seeks are faster?
It's not related to fsr because fsr does not defragment metadata.
Some metadata cannot be defragmented (e.g. inodes cannot be moved),
some metadata cannot be manipulated directly (e.g. free space
btrees), and some is just difficult to do (e.g. directory
defragmentation) so hasn't ever been done.
> Currently, when I do "find /this_big_fs -inum 1234", it takes *ages* for
> a run, while there are not so many files on it:
> # iostat -kx 5 555
> Device: r/s rkB/s avgrq-sz avgqu-sz await svctm %util
> xvdb 23,20 92,80 8,00 0,42 15,28 18,17 42,16
> xvdc 20,20 84,00 8,32 0,57 28,40 28,36 57,28
Well, it's not XFS's fault that each read IO is taking 20-30ms. You
can only do 30-50 IOs a second per drive at that rate, so:
[...]
> So I get 43 reads/second at 100% utilization. Well I can see up to
This is right on the money - it's going as fast a your (slow) RAID-5
volume will allow it to....
> 150r/s, but still that's no "wow". A single run to find an inode takes a
> very long time.
Raid 5/6 generally provides the same IOPS performance as a single
spindle, regardless of the width of the RAID stripe. A 2TB sata
drive might be able to do 150-200 IOPS, so a RAID5 array made up of
these drives will tend to max out at roughly the same....
> # df -i
> Filesystem Inodes IUsed IFree IUse%
> mybigstore 1258291200 765684 1257525516 1%
>
> So only 765.684 files, and it takes about 8 minutes for a "find" pass.
> Maybe an xfs_fsr over metadata could help here?
Eric increased the directory read buffer size fed to XFS recently,
which should allow more readahead to occur internally to large
directories. This will help reading large directories, but nothing can
be done in XFS if the directories are small because inodes can't be
moved and find does not do readahead of directory inodes...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: xfs_fsr question for improvement
2010-04-17 1:24 ` Dave Chinner
@ 2010-04-17 7:13 ` Emmanuel Florac
2010-04-25 11:17 ` Peter Grandi
2010-05-03 6:49 ` Michael Monnerie
1 sibling, 1 reply; 14+ messages in thread
From: Emmanuel Florac @ 2010-04-17 7:13 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michael Monnerie, xfs
Le Sat, 17 Apr 2010 11:24:15 +1000 vous écriviez:
> XFS resists fragmentation better than most other
> filesystems, so defragmentation, while possible, is generally not
> needed.
There two systems nevertheless where I need to use xfs_fsr regularly to
keep decent performance : my test VMware server (performance dropped
down to abysmal level until I set up a daily xfs_fsr cron job), and a
write-intensive video server.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: xfs_fsr question for improvement
2010-04-17 7:13 ` Emmanuel Florac
@ 2010-04-25 11:17 ` Peter Grandi
2010-04-25 13:02 ` Emmanuel Florac
2010-04-26 0:02 ` Linda Walsh
0 siblings, 2 replies; 14+ messages in thread
From: Peter Grandi @ 2010-04-25 11:17 UTC (permalink / raw)
To: Linux XFS
[ ... ]
>> XFS resists fragmentation better than most other filesystems,
>> so defragmentation, while possible, is generally not needed.
That's a common myth. For most file systems and filesystems.
Also most applications write files really badly.
http://www.sabi.co.uk/blog/anno06-3rd.html#060914b
> There two systems nevertheless where I need to use xfs_fsr
> regularly to keep decent performance :
Fortunately 'xfs_fsr' is mostly reliable, but in-place
defragmentation is a risky propostion for several reasons even
if 'xfs_fsr' is fairly reliable.
Note also that 'xfs_fsr' uses a terrible "defragmentation"
strategy (from 'man xfs_fsr'):
"The reorganization algorithm operates on one file at a time,"
"xfs_fsr improves the layout of extents for each file by
copying the entire file to a temporary location and then
interchanging the data extents of the target and temporary
files in an atomic manner."
> my test VMware server (performance dropped down to abysmal
> level until I set up a daily xfs_fsr cron job),
That should not be the case unless you are using very sparse
VM image files, in which case you get what you pay for.
> and a write-intensive video server.
That also should not be the case unless your applications write
strategy is wrong and you get extremely interleaved streams, in
which case you get what you paid for the application programmer.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-04-25 11:17 ` Peter Grandi
@ 2010-04-25 13:02 ` Emmanuel Florac
2010-04-25 21:04 ` Eric Sandeen
2010-04-26 0:02 ` Linda Walsh
1 sibling, 1 reply; 14+ messages in thread
From: Emmanuel Florac @ 2010-04-25 13:02 UTC (permalink / raw)
To: Peter Grandi; +Cc: Linux XFS
Le Sun, 25 Apr 2010 12:17:24 +0100 vous écriviez:
> > my test VMware server (performance dropped down to abysmal
> > level until I set up a daily xfs_fsr cron job),
>
> That should not be the case unless you are using very sparse
> VM image files, in which case you get what you pay for.
>
This is a development and test VM Ware server, so it hosts lots ( 100
or so) of test VM with sparse image files (when you start a VM to host
a quick test, you don't want to spend 15 minutes initializing the
drives).
> > and a write-intensive video server.
>
> That also should not be the case unless your applications write
> strategy is wrong and you get extremely interleaved streams, in
> which case you get what you paid for the application programmer.
The application write strategy is as simple as possible; several
different machines unaware of every other write huge media files to a
samba share. I don't know how it could be worse, however it could
hardly be enhanced in any way.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: xfs_fsr question for improvement
2010-04-25 13:02 ` Emmanuel Florac
@ 2010-04-25 21:04 ` Eric Sandeen
2010-04-25 21:44 ` Emmanuel Florac
0 siblings, 1 reply; 14+ messages in thread
From: Eric Sandeen @ 2010-04-25 21:04 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Peter Grandi, Linux XFS
Emmanuel Florac wrote:
> Le Sun, 25 Apr 2010 12:17:24 +0100 vous écriviez:
>
>>> my test VMware server (performance dropped down to abysmal
>>> level until I set up a daily xfs_fsr cron job),
>> That should not be the case unless you are using very sparse
>> VM image files, in which case you get what you pay for.
>>
>
> This is a development and test VM Ware server, so it hosts lots ( 100
> or so) of test VM with sparse image files (when you start a VM to host
> a quick test, you don't want to spend 15 minutes initializing the
> drives).
If you have the -space- then you can use space preallocation to
do this very quickly, FWIW. xfs_io's resvsp command, or fallocate
in recent util-linux-ng, if vmware doesn't do it on its own already.
You pay some penalty for unwritten extent conversion but it'd be
better than massive fragmentation of the images.
>>> and a write-intensive video server.
>> That also should not be the case unless your applications write
>> strategy is wrong and you get extremely interleaved streams, in
>> which case you get what you paid for the application programmer.
>
> The application write strategy is as simple as possible; several
> different machines unaware of every other write huge media files to a
> samba share. I don't know how it could be worse, however it could
> hardly be enhanced in any way.
If it's all large writes, you could mount -o allocsize=512m or so:
allocsize=size
Sets the buffered I/O end-of-file preallocation size when
doing delayed allocation writeout (default size is 64KiB).
Valid values for this option are page size (typically 4KiB)
through to 1GiB, inclusive, in power-of-2 increments.
and that might help.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: xfs_fsr question for improvement
2010-04-25 21:04 ` Eric Sandeen
@ 2010-04-25 21:44 ` Emmanuel Florac
0 siblings, 0 replies; 14+ messages in thread
From: Emmanuel Florac @ 2010-04-25 21:44 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Peter Grandi, Linux XFS
Le Sun, 25 Apr 2010 16:04:34 -0500 vous écriviez:
> If it's all large writes, you could mount -o allocsize=512m or so:
>
> allocsize=size
> Sets the buffered I/O end-of-file preallocation size when
> doing delayed allocation writeout (default size is 64KiB).
> Valid values for this option are page size (typically 4KiB)
> through to 1GiB, inclusive, in power-of-2 increments.
>
> and that might help.
Oh yes, nice idea indeed! I'll try it.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-04-25 11:17 ` Peter Grandi
2010-04-25 13:02 ` Emmanuel Florac
@ 2010-04-26 0:02 ` Linda Walsh
1 sibling, 0 replies; 14+ messages in thread
From: Linda Walsh @ 2010-04-26 0:02 UTC (permalink / raw)
To: Peter Grandi; +Cc: Linux XFS
Peter Grandi wrote:
>>> XFS resists fragmentation better than most other filesystems,
>>> so defragmentation, while possible, is generally not needed.
>
> That's a common myth. For most file systems and filesystems.
> Also most applications write files really badly.
> http://www.sabi.co.uk/blog/anno06-3rd.html#060914b
----
Do you have any evidence to back this up? The article you quote says nothing
about xfs allocation -- it's talking about Windows systems using FAT or NTFS -- unless,
you've ported XFS to Windows? If not, I'm I don't see how your comments are
relevant.
> Fortunately 'xfs_fsr' is mostly reliable, but in-place
> defragmentation is a risky propostion for several reasons even
> if 'xfs_fsr' is fairly reliable.
---
How is it risky? Do you have any evidence to back up this claim?
It copies from where it is at, to a pre-reserved, vacant space (which it finds just
before it does the copy). When the copy is done successfully, its points the inode
at the defragmented data, and frees the old copy -- or at least that's my
'not having looked at the code' understanding of it.
This is basically a file copy. So you are saying that file defragmentation
in place using file copy is risky? Doesn't this imply you are saying
that copying files is risky? How is this meaninful?
> Note also that 'xfs_fsr' uses a terrible "defragmentation"
> strategy (from 'man xfs_fsr'):
> "The reorganization algorithm operates on one file at a time,"
----
xfs_fsr does a superb job of file defragmenting. It doesn't
do disk defragmenting. But it does defragment single files well, which was
all it was designed to do. We can lament that it hasn't been improved on,
but no one with money or with 'free time' (ha) and the knowledge, has seen it as
a problem, so it hasn't been fixed.
> That also should not be the case unless your applications write
> strategy is wrong and you get extremely interleaved streams, in
> which case you get what you paid for the application programmer.
---
That's a rather naive view. It may not be one application but several
writing to the disk at once. Or it could be one, but recording multiple streams
to disk at the same time -- of course it would have to write them to disk as they
come in, as memory is limited -- how else would you prevent interleaving in such
a case? There are too many situations where fragmenting can occur to toss them
all off and say they are the result of not paying an application programmer to do it
"correctly".
I don't see why you posted -- it wasn't to help anyone nor to offer constructive
criticism. It was a bit harsh on the criticism side, as though something about it
was 'personal' for you.... Also sensed a tinge of bitterness in that last bit of
criticism about the video stream fragmentation. I'm sorry for your loss, but please
try to understand, that this is a forum/list for developers/users to help with xfs problems.
Please rethink your approach here, and I'm apologize in advance if I'm out of
line, but something seemed off-key in this post, but maybe I'm misreading things
completetely... its happened before. :-)
Linda Walsh
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-04-17 1:24 ` Dave Chinner
2010-04-17 7:13 ` Emmanuel Florac
@ 2010-05-03 6:49 ` Michael Monnerie
2010-05-03 7:41 ` Michael Monnerie
2010-05-03 12:17 ` Dave Chinner
1 sibling, 2 replies; 14+ messages in thread
From: Michael Monnerie @ 2010-05-03 6:49 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: Text/Plain, Size: 5208 bytes --]
On Samstag, 17. April 2010 Dave Chinner wrote:
> They have thousands of extents in them and they are all between
> 8-10GB in size, and IO from my VMs are stiall capable of saturating
> the disks backing these files. While I'd normally consider these
> files fragmented and candidates for running fsr on tme, the number
> of extents is not actually a performance limiting factor and so
> there's no point in defragmenting them. Especially as that requires
> shutting down the VMs...
I personally care less about file fragmentation than about
metadata/inode/directory fragmentation. This server gets accesses from
numerous people,
# time find /mountpoint/ -inum 107901420
/mountpoint/some/dir/ectory/path/x.iso
real 7m50.732s
user 0m0.152s
sys 0m2.376s
It took nearly 8 minutes to search through that mount point, which is
6TB big on a RAID-5 striped over 7 2TB disks, so search speed should be
high. Especially as there are only 765.000 files on that disk:
Filesystem Inodes IUsed IFree IUse%
/mountpoint 1258291200 765659 1257525541 1%
Wouldn't you say an 8 minutes search over just 765.000 files is slow,
even when only using 7x 2TB 7200rpm disks in RAID-5?
> > Would it be possible xfs_fsr defrags the meta data in a way that
> > they are all together so seeks are faster?
>
> It's not related to fsr because fsr does not defragment metadata.
> Some metadata cannot be defragmented (e.g. inodes cannot be moved),
> some metadata cannot be manipulated directly (e.g. free space
> btrees), and some is just difficult to do (e.g. directory
> defragmentation) so hasn't ever been done.
I see. On this particular server I know it would be good for performance
to have the metadata defrag'ed, but that's not the aim of xfs_fsr.
But maybe some developer is bored once and finds a way to optimize the
search&find of files on an aged filesystem, i.e. metadata defrag :-)
I tried this two times:
# time find /mountpoint/ -inum 107901420
real 8m17.316s
user 0m0.148s
sys 0m1.964s
# time find /mountpoint/ -inum 107901420
real 0m30.113s
user 0m0.540s
sys 0m9.813s
Caching helps the 2nd time :-)
> > Currently, when I do "find /this_big_fs -inum 1234", it takes
> > *ages* for a run, while there are not so many files on it:
> > # iostat -kx 5 555
> > Device: r/s rkB/s avgrq-sz avgqu-sz await svctm
> > %util xvdb 23,20 92,80 8,00 0,42 15,28
> > 18,17 42,16 xvdc 20,20 84,00 8,32 0,57
> > 28,40 28,36 57,28
>
> Well, it's not XFS's fault that each read IO is taking 20-30ms. You
> can only do 30-50 IOs a second per drive at that rate, so:
>
> [...]
>
> > So I get 43 reads/second at 100% utilization. Well I can see up to
>
> This is right on the money - it's going as fast a your (slow) RAID-5
> volume will allow it to....
>
> > 150r/s, but still that's no "wow". A single run to find an inode
> > takes a very long time.
>
> Raid 5/6 generally provides the same IOPS performance as a single
> spindle, regardless of the width of the RAID stripe. A 2TB sata
> drive might be able to do 150-200 IOPS, so a RAID5 array made up of
> these drives will tend to max out at roughly the same....
Running xfs_fsr, I can see up to 1200r+1200w=2400I/Os per second:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-
sz avgqu-sz await svctm %util
xvdc 0,00 0,00 0,00 1191,42 0,00 52320,16
87,83 121,23 96,77 0,71 84,63
xvde 0,00 0,00 1226,35 0,00 52324,15 0,00
85,33 0,77 0,62 0,13 15,33
But on average it's about 600-700 read plus writes per second, so
1200-1400 IOPS.
Both "disks" are 2TB LVM volumes on the same raidset, I just had to
split it as XEN doesn't allow to create >2TB volumes.
So, the badly slow I/O I see during "find" are not happening during fsr.
How can that be?
I'm just running another "find" on a fresh remounted xfs, and I can see
the reads are happening on 2 of the 3 2TB volumes parallel:
Device: r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
svctm %util
xvdb 103,20 0,00 476,80 0,00 9,24 0,46
4,52 4,50 46,40
xvdc 97,80 0,00 455,20 0,00 9,31 0,52
5,29 5,30 51,84
When I created that XFS, I took two 2TB partitions, did pvcreate,
vgcreate and lvcreate. Could it be that lvcreate automatically thought
it should do a RAID-0? Because all reads are equally split between the
two volumes. After a while, I added the 3rd 2TB volume, and I can't see
that behaviour there. So maybe this is the source of all evil.
BTW: I changed mount options "atime,diratime" to "relatime,reldiratime"
now and "find" runtime went from 8 minutes down to 7m14s.
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31
// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-05-03 6:49 ` Michael Monnerie
@ 2010-05-03 7:41 ` Michael Monnerie
2010-05-03 12:17 ` Dave Chinner
1 sibling, 0 replies; 14+ messages in thread
From: Michael Monnerie @ 2010-05-03 7:41 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: Text/Plain, Size: 1061 bytes --]
On Montag, 3. Mai 2010 Michael Monnerie wrote:
> When I created that XFS, I took two 2TB partitions, did pvcreate,
> vgcreate and lvcreate. Could it be that lvcreate automatically
> thought it should do a RAID-0? Because all reads are equally split
> between the two volumes. After a while, I added the 3rd 2TB volume,
> and I can't see that behaviour there. So maybe this is the source of
> all evil.
I found that lvcreate really is too smart:
-i, --stripes Stripes
Gives the number of stripes. This is equal to the number
of physical volumes to scatter the logical volume.
So it seems lvcreate did know that the VG was split among 2 "disks", and
therefore used -i2 while I wanted -i1.
> reldiratime
Should be nodiratime, of course.
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31
// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: xfs_fsr question for improvement
2010-05-03 6:49 ` Michael Monnerie
2010-05-03 7:41 ` Michael Monnerie
@ 2010-05-03 12:17 ` Dave Chinner
2010-05-10 22:39 ` Michael Monnerie
1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2010-05-03 12:17 UTC (permalink / raw)
To: Michael Monnerie; +Cc: xfs
On Mon, May 03, 2010 at 08:49:43AM +0200, Michael Monnerie wrote:
> On Samstag, 17. April 2010 Dave Chinner wrote:
> > They have thousands of extents in them and they are all between
> > 8-10GB in size, and IO from my VMs are stiall capable of saturating
> > the disks backing these files. While I'd normally consider these
> > files fragmented and candidates for running fsr on tme, the number
> > of extents is not actually a performance limiting factor and so
> > there's no point in defragmenting them. Especially as that requires
> > shutting down the VMs...
>
> I personally care less about file fragmentation than about
> metadata/inode/directory fragmentation. This server gets accesses from
> numerous people,
>
> # time find /mountpoint/ -inum 107901420
> /mountpoint/some/dir/ectory/path/x.iso
>
> real 7m50.732s
> user 0m0.152s
> sys 0m2.376s
>
> It took nearly 8 minutes to search through that mount point, which is
> 6TB big on a RAID-5 striped over 7 2TB disks, so search speed should be
> high.
Not necessarily, as your raid array has shown.
>
> Especially as there are only 765.000 files on that disk:
> Filesystem Inodes IUsed IFree IUse%
> /mountpoint 1258291200 765659 1257525541 1%
>
> Wouldn't you say an 8 minutes search over just 765.000 files is slow,
> even when only using 7x 2TB 7200rpm disks in RAID-5?
Depends on the directory structure and the number of IOs needed to
traverse it. If it's only a handful of files per directory, then you
get no internal directory readahead to hide read latency. That
results in a small random synchronous read workload that might
require a couple of hundred thousand IOs to complete.
>From your early stats showing a read rate of 50 IO/s from the raid
array, then the directory read traverse has requires about 25kiops
to complete. That takes about 10s on my laptop's cheap SSD, which
does random reads about 50x faster than your RAID array....
> > > Would it be possible xfs_fsr defrags the meta data in a way that
> > > they are all together so seeks are faster?
> >
> > It's not related to fsr because fsr does not defragment metadata.
> > Some metadata cannot be defragmented (e.g. inodes cannot be moved),
> > some metadata cannot be manipulated directly (e.g. free space
> > btrees), and some is just difficult to do (e.g. directory
> > defragmentation) so hasn't ever been done.
>
> I see. On this particular server I know it would be good for performance
> to have the metadata defrag'ed, but that's not the aim of xfs_fsr.
> But maybe some developer is bored once and finds a way to optimize the
> search&find of files on an aged filesystem, i.e. metadata defrag :-)
Many have. Find and tar have resisted attempts to optimise them over
the years, so stuff like this:
http://oss.oracle.com/~mason/acp/
grows on the interwebs all over the place... ;)
> I tried this two times:
> # time find /mountpoint/ -inum 107901420
> real 8m17.316s
> user 0m0.148s
> sys 0m1.964s
>
> # time find /mountpoint/ -inum 107901420
> real 0m30.113s
> user 0m0.540s
> sys 0m9.813s
>
> Caching helps the 2nd time :-)
That still seems rather slow traversing 750,000 cached directory
entries. My laptop (1.3GHz CULV core2 CPU) does 465,000 directory
entries in:
$ time sudo find / -mount -inum 123809285
real 0m2.196s
user 0m0.384s
sys 0m1.464s
> > Raid 5/6 generally provides the same IOPS performance as a single
> > spindle, regardless of the width of the RAID stripe. A 2TB sata
> > drive might be able to do 150-200 IOPS, so a RAID5 array made up of
> > these drives will tend to max out at roughly the same....
>
> Running xfs_fsr, I can see up to 1200r+1200w=2400I/Os per second:
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-
> sz avgqu-sz await svctm %util
> xvdc 0,00 0,00 0,00 1191,42 0,00 52320,16
> 87,83 121,23 96,77 0,71 84,63
> xvde 0,00 0,00 1226,35 0,00 52324,15 0,00
> 85,33 0,77 0,62 0,13 15,33
>
> But on average it's about 600-700 read plus writes per second, so
> 1200-1400 IOPS.
> Both "disks" are 2TB LVM volumes on the same raidset, I just had to
> split it as XEN doesn't allow to create >2TB volumes.
>
> So, the badly slow I/O I see during "find" are not happening during fsr.
> How can that be?
Because most of the IO xfs_fsr does is large sequential IO which the
RAID caches are optimised for. Directory traversals, OTOH, are small,
semi-random IO which are latency sensitive....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs_fsr question for improvement
2010-05-03 12:17 ` Dave Chinner
@ 2010-05-10 22:39 ` Michael Monnerie
0 siblings, 0 replies; 14+ messages in thread
From: Michael Monnerie @ 2010-05-10 22:39 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: Text/Plain, Size: 1604 bytes --]
On Montag, 3. Mai 2010 Dave Chinner wrote:
> Many have. Find and tar have resisted attempts to optimise them over
> the years, so stuff like this:
>
> http://oss.oracle.com/~mason/acp/
> grows on the interwebs all over the place... ;)
Uh, that makes a nice 3818 IOPS with 161MB/s:
xvdb 3818,16 0,80 161449,90 35,13 84,57 10,75
2,30 0,26 99,88
And even saw >4kIOPS an 180MB/s. Nice.
The tool gave me an idea:
lvchange -r 1024 /dev/all_my_lvm_stores
And this boots copy performance a lot: With the default "-r 128" I had
around 10-30MB/s, now 30-100MB/s. Of course this depends on the type of
access and so on, but at least during moving back all the data from the
backup lvm to the re-created original lvm it's a drastic speedup.
> > # time find /mountpoint/ -inum 107901420
> > real 0m30.113s
> > user 0m0.540s
> > sys 0m9.813s
> >
> > Caching helps the 2nd time :-)
>
> That still seems rather slow traversing 750,000 cached directory
> entries. My laptop (1.3GHz CULV core2 CPU) does 465,000 directory
> entries in:
>
> $ time sudo find / -mount -inum 123809285
>
> real 0m2.196s
> user 0m0.384s
> sys 0m1.464s
So why was it so slow here?
As soon as moving back all data is finished, I can retry if search speed
increased.
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31
// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-05-10 22:37 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-26 20:58 xfs_fsr question for improvement Richard Scobie
-- strict thread matches above, loose matches on Subject: below --
2010-04-16 8:43 Michael Monnerie
2010-04-16 10:43 ` Stan Hoeppner
2010-04-17 1:24 ` Dave Chinner
2010-04-17 7:13 ` Emmanuel Florac
2010-04-25 11:17 ` Peter Grandi
2010-04-25 13:02 ` Emmanuel Florac
2010-04-25 21:04 ` Eric Sandeen
2010-04-25 21:44 ` Emmanuel Florac
2010-04-26 0:02 ` Linda Walsh
2010-05-03 6:49 ` Michael Monnerie
2010-05-03 7:41 ` Michael Monnerie
2010-05-03 12:17 ` Dave Chinner
2010-05-10 22:39 ` Michael Monnerie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox