public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Re: xfs_fsr question for improvement
@ 2010-04-26 20:58 Richard Scobie
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Scobie @ 2010-04-26 20:58 UTC (permalink / raw)
  To: xfs

Linda Walsh wrote:

 > 	That's a rather naive view.  It may not be one application but several
 > writing to the disk at once.  Or it could be one, but recording 
multiple streams
 > to disk at the same time -- of course it would have to write them to 
disk as they
 > come in, as memory is limited -- how else would you prevent 
interleaving in such
 > a case?  There are too many situations where fragmenting can occur to 
toss them
 > all off and say they are the result of not paying an application 
programmer to do it
 > "correctly".

Agreed. XFS does offer some assistence for improving some of these 
situations - the "filestreams" mount option. This is not very well 
documented in man pages, but the following is extracted from tutorial notes:

"Filestreams Allocator

A certain class of applications such as those doing film scanner
video ingest will write many large files to a directory in sequence.
It's important for playback performance that these files end up
allocated next to each other on disk, since consecutive data is
retrieved optimally by hardware RAID read-ahead.

XFS's standard allocator starts out doing the right thing as far
as file allocation is concerned.  Even if multiple streams are being
written simultaneously, their files will be placed separately and
contiguously on disk.  The problem is that once an allocation group
fills up, a new one must be chosen and there's no longer a parent
directory in a unique AG to use as an AG "owner".  Without a way
to reserve the new AG for the original directory's use, all the
files being allocated by all the streams will start getting placed
in the same AGs as each other.  The result is that consecutive
frames in one directory are placed on disk with frames from other
directories interleaved between them, which is a worst-case layout
for playback performance.  When reading back the frames in directory
A, hardware RAID read-ahead will cache data from frames in directory
B which is counterproductive.

Create a file system with a small AG size to demonstrate:

sles10:~ sjv: sudo mkfs.xfs -d agsize=64m /dev/sdb7 > /dev/null
sles10:~ sjv: sudo mount /dev/sdb7 /test
sles10:~ sjv: sudo chmod 777 /test
sles10:~ sjv: cd /test
sles10:/test sjv:

Create ten 10MB files concurrently in two directories:

sles10:/test sjv: mkdir a b
sles10:/test sjv: for dir in a b; do
 > > for file in `seq 0 9`; do
 > > xfs_mkfile 10m $dir/$file
 > > done &
 > > done; wait 2>/dev/null
[1] 30904
[2] 30905
sles10:/test sjv: ls -lid * */*
    131 drwxr-xr-x 2 sjv users       86 2006-10-20 13:48 a
    132 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/0
    133 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/1
    134 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/2
    135 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/3
    136 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/4
    137 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/5
    138 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/6
    139 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/7
    140 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/8
    141 -rw------- 1 sjv users 10485760 2006-10-20 13:48 a/9
262272 drwxr-xr-x 2 sjv users       86 2006-10-20 13:48 b
262273 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/0
262274 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/1
262275 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/2
262276 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/3
262277 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/4
262278 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/5
262279 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/6
262280 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/7
262281 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/8
262282 -rw------- 1 sjv users 10485760 2006-10-20 13:48 b/9
sles10:/test sjv:

Note that all the inodes are in the same AGs as each other.  What
about the file data?  Use xfs_bmap -v to examine the extents:

sles10:/test sjv: for file in `seq 0 9`; do
 > > bmap_a=`xfs_bmap -v a/$file | tail -1`
 > > bmap_b=`xfs_bmap -v b/$file | tail -1`
 > > ag_a=`echo $bmap_a | awk '{print $4}'`
 > > ag_b=`echo $bmap_b | awk '{print $4}'`
 > > br_a=`echo $bmap_a | awk '{printf "%-18s", $3}'`
 > > br_b=`echo $bmap_b | awk '{printf "%-18s", $3}'`
 > > echo a/$file: $ag_a "$br_a" b/$file: $ag_b "$br_b"
 > > done
a/0: 0 96..20575          b/0: 1 131168..151647
a/1: 0 20576..41055       b/1: 1 151648..172127
a/2: 0 41056..61535       b/2: 1 172128..192607
a/3: 0 61536..82015       b/3: 1 192608..213087
a/4: 0 82016..102495      b/4: 1 213088..233567
a/5: 0 102496..122975     b/5: 1 233568..254047
a/6: 2 299600..300111     b/6: 2 262208..275007
a/7: 2 338016..338527     b/7: 2 312400..312911
a/8: 2 344672..361567     b/8: 3 393280..401983
a/9: 2 361568..382047     b/9: 3 401984..421951
sles10:/test sjv:

The middle column is the AG number and the right column is the block
range.  Note how the extents for files in both directories get
placed on top of each other in AG 2.

Something to note in the results is that even though the file extents
have worked their way up into AGs 2 and 3, the inode numbers show
that the file inodes are all in the same AGs as their parent
directory, i.e. AGs 0 and 1.  Why is this?  To understand, it's
important to consider the order in which events are occurring.  The
two bash processes writing files are calling xfs_mkfile, which
starts by opening a file with the O_CREAT flag.  At this point, XFS
has no idea how large the file's data is going to be, so it dutifully
creates a new inode for the file in the same AG as the parent
directory.  The call returns successfully and the system continues
with its tasks.  When XFS is asked write the file data a short time
later, a new AG must be found for it because the AG the inode is
in is full.  The result is a violation of the original goal to keep
file data close to its inode on disk.  In practice, because inodes
are allocated in clusters on disk, a process that's reading back a
stream is likely to cache all the inodes it needs with just one or
two reads, so the disk seeking involved won't be as bad as it first
seems.

On the other hand, the extent data placement seen in the xfs_bmap
-v output is a problem.  Once the data extents spilled into AG 2,
both processes were given allocations there on a first-come-first-served
basis.  This destroyed the neatly contiguous allocation pattern for
the files and will certainly degrade read performance later on.

To address this issue, a new allocation algorithm was added to XFS
that associates a parent directory with an AG until a preset
inactivity timeout elapses.  The new algorithm is called the
Filestreams allocator and it is enabled in one of two ways.  Either
the filesystem is mounted with the -o filestreams option, or the
filestreams chattr flag is applied to a directory to indicate that
all allocations beneath that point in the directory hierarchy should
use the filestreams allocator.

With the filestreams allocator enabled, the above test produces
results that look like this:

a/0: 0 96..20575          b/0: 1 131168..151647
a/1: 0 20576..41055       b/1: 1 151648..172127
a/2: 0 41056..61535       b/2: 1 172128..192607
a/3: 0 61536..82015       b/3: 1 192608..213087
a/4: 0 82016..102495      b/4: 1 213088..233567
a/5: 0 102496..122975     b/5: 1 233568..254047
a/6: 2 272456..273479     b/6: 3 393280..410271
a/7: 2 290904..300119     b/7: 3 410272..426655
a/8: 2 300632..321111     b/8: 3 426656..441503
a/9: 2 329304..343639     b/9: 3 441504..459935

Once the process writing files to the first directory starts using
AG 2, that AG is no longer considered available so the other process
skips it and moves to AG 3."

 > I don't see why you posted -- it wasn't to help anyone nor to offer 
constructive
 > criticism.  It was a bit harsh on the criticism side, as though 
something about it
 > was 'personal' for you....   Also sensed a tinge of bitterness in 
that last bit of

Not unusual for this poster.

Regards,

Richard

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread
* xfs_fsr question for improvement
@ 2010-04-16  8:43 Michael Monnerie
  2010-04-16 10:43 ` Stan Hoeppner
  2010-04-17  1:24 ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Michael Monnerie @ 2010-04-16  8:43 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2816 bytes --]

From the man page I read that a file is defragmented by copying it to a 
free space big enough to place it in one extent.

Now I have a 4TB filesystem, where all files written are at least 1GB, 
average 5GB, up to 30GB each. I just xfs_growfs'd that filesystem to 
6TB, as it was 97% full (150GB free). Every night a xfs_fsr runs and 
finished to defragment everything, except during the last days where it 
didn't find enough free space in a row to defragment.

Could it be that the defragmentation did it's job but in the end the 
file layout was like this:
file 1GB
freespace 900M
file 1GB
freespace 900M
file 1GB
freespace 900M
That, while being an "almost worst case" scenario, would mean that once 
the filesystem is about 50% full, new 1GB files will be fragmented all 
the time.
To prevent this, xfs_fsr should do a "compress" phase after 
defragmentation finished, in order to move all the files behind each 
other:
file 1GB
file 1GB
file 1GB
file 1GB
freespace 3600M
That would also help fill the filesystem from front to end, reducing 
disk head moves.


Another thing, but related to xfs_fsr, is that I did an xfs_repair on 
that filesystem once, and I could see there were a lot of small I/Os 
done, with almost no throughput. The disks are 7.200rpm 2TB disks, so 
random disk access is horribly slow, and it looked like the disks were 
doing nothing else but seeking.
Would it be possible xfs_fsr defrags the meta data in a way that they 
are all together so seeks are faster?
Currently, when I do "find /this_big_fs -inum 1234", it takes *ages* for 
a run, while there are not so many files on it:
# iostat -kx 5 555
Device:         r/s     rkB/s    avgrq-sz avgqu-sz   await  svctm  %util
xvdb              23,20    92,80     8,00     0,42   15,28  18,17  42,16
xvdc              20,20    84,00     8,32     0,57   28,40  28,36  57,28
(I edited the output to remove "writes" columns, as they are 0)
This is a RAID-5 over 7 disks, and 2 TB volumes are used with LVM 
concatenated. As I only added the 3rd 2TB volume today, no seeks on that 
new place.
So I get 43 reads/second at 100% utilization. Well I can see up to 
150r/s, but still that's no "wow". A single run to find an inode takes a 
very long time.
# df -i
Filesystem            Inodes         IUsed          IFree      IUse%
mybigstore            1258291200  765684 1257525516    1%

So only 765.684 files, and it takes about 8 minutes for a "find" pass.
Maybe an xfs_fsr over metadata could help here?


-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-05-10 22:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-26 20:58 xfs_fsr question for improvement Richard Scobie
  -- strict thread matches above, loose matches on Subject: below --
2010-04-16  8:43 Michael Monnerie
2010-04-16 10:43 ` Stan Hoeppner
2010-04-17  1:24 ` Dave Chinner
2010-04-17  7:13   ` Emmanuel Florac
2010-04-25 11:17     ` Peter Grandi
2010-04-25 13:02       ` Emmanuel Florac
2010-04-25 21:04         ` Eric Sandeen
2010-04-25 21:44           ` Emmanuel Florac
2010-04-26  0:02       ` Linda Walsh
2010-05-03  6:49   ` Michael Monnerie
2010-05-03  7:41     ` Michael Monnerie
2010-05-03 12:17     ` Dave Chinner
2010-05-10 22:39       ` Michael Monnerie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox