* Issues and new to the group
@ 2013-09-26 11:47 Ronnie Tartar
2013-09-26 12:06 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Ronnie Tartar @ 2013-09-26 11:47 UTC (permalink / raw)
To: xfs
Hi,
I have a 600GB xfs file system mounted that suddenly started running slow on
writes. It takes about 2.5 to 3.5 seconds to write a single file. Some
folders (with less number of files) work well. But it will copy fast, then
slow for long periods of time. This is a virtualized CentOS 5.9 64 bit box
on Citrix Xenserver 5.6SP2. Doesn't seem to be a load i/o issue as most of
the load is system%. My fragmentation is less than 1 %. Any help would
be greatly appreciated. I was looking to see if there was a better way to
mount this partition or allocate more memory, whatever it takes. The
folders are image folders that have anywhere between 5 to 10 million images
in each folder.
Thanks
Fstab mount is:
/dev/xvdb1 /images xfs
defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
Slabtop is:
Active / Total Objects (% used) : 2705947 / 2872142 (94.2%)
Active / Total Slabs (% used) : 290008 / 290008 (100.0%)
Active / Total Caches (% used) : 111 / 165 (67.3%)
Active / Total Size (% used) : 1048796.52K / 1083850.55K (96.8%)
Minimum / Average / Maximum Object : 0.02K / 0.38K / 128.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
843080 836242 99% 0.44K 105385 8 421540K xfs_inode
839608 836242 99% 0.56K 119944 7 479776K xfs_vnode
461610 432085 93% 0.21K 25645 18 102580K dentry_cache
306200 306200 100% 0.09K 7655 40 30620K buffer_head
222570 121732 54% 0.12K 7419 30 29676K size-128
117439 97108 82% 0.52K 16777 7 67108K radix_tree_node
30820 30814 99% 0.19K 1541 20 6164K xfs_ili
13270 13266 99% 0.74K 2654 5 10616K ext3_inode_cache
9390 9390 100% 0.25K 626 15 2504K size-256
7682 7562 98% 0.16K 334 23 1336K vm_area_struct
3068 1711 55% 0.06K 52 59 208K size-64
2816 2786 98% 0.09K 64 44 256K sysfs_dir_cache
2055 1275 62% 0.25K 137 15 548K filp
1440 1289 89% 0.02K 10 144 40K anon_vma
1120 999 89% 0.03K 10 112 40K size-32
768 534 69% 0.08K 16 48 64K
selinux_inode_security
756 698 92% 0.55K 108 7 432K inode_cache
576 554 96% 0.58K 96 6 384K proc_inode_cache
476 455 95% 1.00K 119 4 476K size-1024
404 403 99% 2.00K 202 2 808K size-2048
404 404 100% 4.00K 404 1 1616K size-4096
360 350 97% 0.12K 12 30 48K bio
320 284 88% 0.50K 40 8 160K size-512
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 11:47 Issues and new to the group Ronnie Tartar
@ 2013-09-26 12:06 ` Stan Hoeppner
2013-09-26 13:12 ` Ronnie Tartar
0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2013-09-26 12:06 UTC (permalink / raw)
To: Ronnie Tartar; +Cc: xfs
On 9/26/2013 6:47 AM, Ronnie Tartar wrote:
> I have a 600GB xfs file system mounted that suddenly started running slow on
> writes. It takes about 2.5 to 3.5 seconds to write a single file. Some
This typically occurs when the filesystem gets near full and free space
is heavily fragmented. Writing to these free space fragments requires
lots of seeking. Seeking causes latency. I assume your storage device
is spinning rust, yes?
> folders (with less number of files) work well. But it will copy fast, then
> slow for long periods of time.
Some allocation groups may have less fragmented free space than others.
Put another way, they may have more contiguous free space. Thus less
seeking.
> This is a virtualized CentOS 5.9 64 bit box
> on Citrix Xenserver 5.6SP2. Doesn't seem to be a load i/o issue as most of
> the load is system%. My fragmentation is less than 1 %. Any help would
> be greatly appreciated. I was looking to see if there was a better way to
> mount this partition or allocate more memory, whatever it takes. The
> folders are image folders that have anywhere between 5 to 10 million images
> in each folder.
> Fstab mount is:
> /dev/xvdb1 /images xfs
> defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
^^^^^^^^^^^^^
This tells XFS to allocate 64MB of free space at the end of each file
being allocated. If free space is heavily fragmented and the fragments
are all small, this will exacerbate the seek problem. Given the 64MB
allocsize, I assume these image files are quite large. If this is
correct, writing them over scattered small free space fragments also
requires seeking. Thus, I'd guess you're seeking your disk, or array,
to death.
How full is the XFS volume, and what does your free space fragmentation
map look like?
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Issues and new to the group
2013-09-26 12:06 ` Stan Hoeppner
@ 2013-09-26 13:12 ` Ronnie Tartar
2013-09-26 13:30 ` Ronnie Tartar
2013-09-26 14:59 ` Joe Landman
0 siblings, 2 replies; 12+ messages in thread
From: Ronnie Tartar @ 2013-09-26 13:12 UTC (permalink / raw)
To: stan; +Cc: xfs
Stan,
Thanks for the reply.
My fragmentation is:
[root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
actual 10470159, ideal 10409782, fragmentation factor 0.58%
xfs_db> freesp
from to extents blocks pct
1 1 52343 52343 0.08
2 3 34774 86290 0.13
4 7 122028 732886 1.08
8 15 182345 1898531 2.80
16 31 147747 3300501 4.87
32 63 111134 4981898 7.35
64 127 93359 8475962 12.50
128 255 51914 9069884 13.38
256 511 25548 9200077 13.57
512 1023 23027 17482586 25.79
1024 2047 8662 10600931 15.64
2048 4095 808 1915158 2.82
The volume is 57% full.
I have removed allocsize=64m from the fstab and rebooted. These are not
large files and this could definitely cause issues.
Would copying them to new folder and renaming the folder back help?
This is running virtualized, definitely not a rust bucket. It's x5570 cpus
with MD3200 Array with light I/O.
Seems like i/o wait is not problem, system% is problem. Is this the OS
trying to find spot for these files?
Thanks
-----Original Message-----
From: Stan Hoeppner [mailto:stan@hardwarefreak.com]
Sent: Thursday, September 26, 2013 8:07 AM
To: Ronnie Tartar
Cc: xfs@oss.sgi.com
Subject: Re: Issues and new to the group
On 9/26/2013 6:47 AM, Ronnie Tartar wrote:
> I have a 600GB xfs file system mounted that suddenly started running
> slow on writes. It takes about 2.5 to 3.5 seconds to write a single
> file. Some
This typically occurs when the filesystem gets near full and free space is
heavily fragmented. Writing to these free space fragments requires lots of
seeking. Seeking causes latency. I assume your storage device is spinning
rust, yes?
> folders (with less number of files) work well. But it will copy fast,
> then slow for long periods of time.
Some allocation groups may have less fragmented free space than others.
Put another way, they may have more contiguous free space. Thus less
seeking.
> This is a virtualized CentOS 5.9 64 bit box on Citrix Xenserver
> 5.6SP2. Doesn't seem to be a load i/o issue as most of
> the load is system%. My fragmentation is less than 1 %. Any help would
> be greatly appreciated. I was looking to see if there was a better
> way to mount this partition or allocate more memory, whatever it
> takes. The folders are image folders that have anywhere between 5 to
> 10 million images in each folder.
> Fstab mount is:
> /dev/xvdb1 /images xfs
> defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
^^^^^^^^^^^^^ This tells XFS to allocate
64MB of free space at the end of each file being allocated. If free space
is heavily fragmented and the fragments are all small, this will exacerbate
the seek problem. Given the 64MB allocsize, I assume these image files are
quite large. If this is correct, writing them over scattered small free
space fragments also requires seeking. Thus, I'd guess you're seeking your
disk, or array, to death.
How full is the XFS volume, and what does your free space fragmentation map
look like?
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Issues and new to the group
2013-09-26 13:12 ` Ronnie Tartar
@ 2013-09-26 13:30 ` Ronnie Tartar
2013-09-26 14:23 ` Eric Sandeen
2013-09-26 14:59 ` Joe Landman
1 sibling, 1 reply; 12+ messages in thread
From: Ronnie Tartar @ 2013-09-26 13:30 UTC (permalink / raw)
To: stan; +Cc: xfs
Stan, looks like I have directory fragmentation problem.
xfs_db> frag -d
actual 65057, ideal 4680, fragmentation factor 92.81%
What is the best way to fix this?
Thanks
-----Original Message-----
From: xfs-bounces@oss.sgi.com [mailto:xfs-bounces@oss.sgi.com] On Behalf Of
Ronnie Tartar
Sent: Thursday, September 26, 2013 9:12 AM
To: stan@hardwarefreak.com
Cc: xfs@oss.sgi.com
Subject: RE: Issues and new to the group
Stan,
Thanks for the reply.
My fragmentation is:
[root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1 actual 10470159, ideal
10409782, fragmentation factor 0.58%
xfs_db> freesp
from to extents blocks pct
1 1 52343 52343 0.08
2 3 34774 86290 0.13
4 7 122028 732886 1.08
8 15 182345 1898531 2.80
16 31 147747 3300501 4.87
32 63 111134 4981898 7.35
64 127 93359 8475962 12.50
128 255 51914 9069884 13.38
256 511 25548 9200077 13.57
512 1023 23027 17482586 25.79
1024 2047 8662 10600931 15.64
2048 4095 808 1915158 2.82
The volume is 57% full.
I have removed allocsize=64m from the fstab and rebooted. These are not
large files and this could definitely cause issues.
Would copying them to new folder and renaming the folder back help?
This is running virtualized, definitely not a rust bucket. It's x5570 cpus
with MD3200 Array with light I/O.
Seems like i/o wait is not problem, system% is problem. Is this the OS
trying to find spot for these files?
Thanks
-----Original Message-----
From: Stan Hoeppner [mailto:stan@hardwarefreak.com]
Sent: Thursday, September 26, 2013 8:07 AM
To: Ronnie Tartar
Cc: xfs@oss.sgi.com
Subject: Re: Issues and new to the group
On 9/26/2013 6:47 AM, Ronnie Tartar wrote:
> I have a 600GB xfs file system mounted that suddenly started running
> slow on writes. It takes about 2.5 to 3.5 seconds to write a single
> file. Some
This typically occurs when the filesystem gets near full and free space is
heavily fragmented. Writing to these free space fragments requires lots of
seeking. Seeking causes latency. I assume your storage device is spinning
rust, yes?
> folders (with less number of files) work well. But it will copy fast,
> then slow for long periods of time.
Some allocation groups may have less fragmented free space than others.
Put another way, they may have more contiguous free space. Thus less
seeking.
> This is a virtualized CentOS 5.9 64 bit box on Citrix Xenserver
> 5.6SP2. Doesn't seem to be a load i/o issue as most of
> the load is system%. My fragmentation is less than 1 %. Any help would
> be greatly appreciated. I was looking to see if there was a better
> way to mount this partition or allocate more memory, whatever it
> takes. The folders are image folders that have anywhere between 5 to
> 10 million images in each folder.
> Fstab mount is:
> /dev/xvdb1 /images xfs
> defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
^^^^^^^^^^^^^ This tells XFS to allocate
64MB of free space at the end of each file being allocated. If free space
is heavily fragmented and the fragments are all small, this will exacerbate
the seek problem. Given the 64MB allocsize, I assume these image files are
quite large. If this is correct, writing them over scattered small free
space fragments also requires seeking. Thus, I'd guess you're seeking your
disk, or array, to death.
How full is the XFS volume, and what does your free space fragmentation map
look like?
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 13:30 ` Ronnie Tartar
@ 2013-09-26 14:23 ` Eric Sandeen
2013-09-26 23:46 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Eric Sandeen @ 2013-09-26 14:23 UTC (permalink / raw)
To: Ronnie Tartar; +Cc: stan, xfs
On 9/26/13 8:30 AM, Ronnie Tartar wrote:
> Stan, looks like I have directory fragmentation problem.
>
> xfs_db> frag -d
> actual 65057, ideal 4680, fragmentation factor 92.81%
>
> What is the best way to fix this?
http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25._Is_that_bad.3F
We should just get rid of that command, TBH.
So your dirs are in an average of 65057/4680 or about 14 fragments each.
Really not that bad, in the scope of things.
I'd imagine that this could be more of your problem:
> The
> folders are image folders that have anywhere between 5 to 10 million images
> in each folder.
at 10 million entries in a dir, you're going to start slowing down on inserts
due to btree management. But that probably doesn't account for multiple seconds for
a single file.
So really,it's not clear *what* is slow.
> It takes about 2.5 to 3.5 seconds to write a single file.
strace with timing would be a very basic way to get a sense of what is slow;
is it the file open/create? How big is the file, are you doing buffered or
direct IO?
On a more modern OS you could do some of the tracing suggested in
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
but some sort of profiling (oprofile, perhaps) might tell you where time is being spent in the kernel.
When you say suddenly started, was it after a kernel upgrade or other change?
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 13:12 ` Ronnie Tartar
2013-09-26 13:30 ` Ronnie Tartar
@ 2013-09-26 14:59 ` Joe Landman
2013-09-26 15:26 ` Jay Ashworth
` (2 more replies)
1 sibling, 3 replies; 12+ messages in thread
From: Joe Landman @ 2013-09-26 14:59 UTC (permalink / raw)
To: xfs
On 09/26/2013 09:12 AM, Ronnie Tartar wrote:
> Stan,
>
> Thanks for the reply.
>
> My fragmentation is:
>
> [root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
> actual 10470159, ideal 10409782, fragmentation factor 0.58%
This was never likely the cause ...
[...]
> This is running virtualized, definitely not a rust bucket. It's x5570 cpus
... well, this is likely the cause (virtualized)
> with MD3200 Array with light I/O.
>
> Seems like i/o wait is not problem, system% is problem. Is this the OS
> trying to find spot for these files?
From your previous description
> takes. The folders are image folders that have anywhere between 5 to
> 10 million images in each folder.
The combination of very large folders, and virtualization is working
against you. Couple that with an old (ancient by Linux standards) xfs
in the virtual CentOS 5.9 system, and you aren't going to have much joy
with this without changing a few things.
First and foremost:
Can you change from one single large folder to a heirarchical set of
folders? The single large folder means any metadata operation (ls,
stat, open, close) has a huge set of lists to traverse. It will work,
albiet slowly. As a rule of thumb, we try to make sure our users don't
go much beyond 10k files/folder. If they need to, building a heirarchy
of folders slightly increases management complexity, but keeps the lists
that are needed to be traversed much smaller.
A strategy for doing this: If your files are named "aaaa0001"
"aaaa0002" ... "zzzz9999" or similar, then you can chop off the first
letter, and make a directory of it, and then put all files starting with
that letter in that directory. Then within each of those directories,
do the same thing with the second letter. This gets you 676 directories
and about 15k files per directory. Much faster directory operations.
Much smaller lists to traverse.
If you can't change the layout, this is a much harder problem to solve,
though you could do it by using one very large file, maintaining your
own start/end and file metadata in other files, and writing your files
at specific offsets (start and end above). This isn't a good solution
unless you know how to write file systems.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 14:59 ` Joe Landman
@ 2013-09-26 15:26 ` Jay Ashworth
2013-09-26 22:47 ` Dave Chinner
2013-09-26 22:16 ` Dave Chinner
2013-09-27 2:39 ` Stan Hoeppner
2 siblings, 1 reply; 12+ messages in thread
From: Jay Ashworth @ 2013-09-26 15:26 UTC (permalink / raw)
To: xfs
----- Original Message -----
> From: "Joe Landman" <joe.landman@gmail.com>
> > takes. The folders are image folders that have anywhere between 5 to
> > 10 million images in each folder.
>
> The combination of very large folders, and virtualization is working
> against you. Couple that with an old (ancient by Linux standards) xfs
> in the virtual CentOS 5.9 system, and you aren't going to have much
> joy with this without changing a few things.
> Can you change from one single large folder to a heirarchical set of
> folders? The single large folder means any metadata operation (ls,
> stat, open, close) has a huge set of lists to traverse. It will work,
> albiet slowly. As a rule of thumb, we try to make sure our users don't
> go much beyond 10k files/folder. If they need to, building a heirarchy
> of folders slightly increases management complexity, but keeps the
> lists that are needed to be traversed much smaller.
>
> A strategy for doing this: If your files are named "aaaa0001"
> "aaaa0002" ... "zzzz9999" or similar, then you can chop off the first
> letter, and make a directory of it, and then put all files starting
> with that letter in that directory. Then within each of those directories,
> do the same thing with the second letter. This gets you 676
> directories and about 15k files per directory. Much faster directory operations.
> Much smaller lists to traverse.
While this problem isn't *near* as bad on XFS as it was on older filesystems,
where over maybe 500-1000 files would result in 'ls' commands taking
over a minute...
It's still a good idea to filename hash large collections of files of
similar types into a directory tree, as Joe recommends. The best approach
I myself have seen to this is to has a filename of
835bfak3f89yu12.jpg
into
8/3/5/b/835bfak3f89yu12.jpg
8/3/5/b/f/835bfak3f89yu12.jpg
8/3/5/b/f/a/835bfak3f89yu12.jpg
Going as deep as necessary to reduce the size of the directories. What
you lose in needing to cache the extra directory levels outweighs (probably
far outweighs) having to handle Directories Of Unusual Size.
Note that I didn't actually trim the filename proper; the final file still has
its full name. This hash is easy to build, as long as you fix the number of layers
in advance... and if you need to make it deeper, later, it's easy to build a
shell script that crawls the current tree and adds the next layer.
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra@baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA #natog +1 727 647 1274
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 14:59 ` Joe Landman
2013-09-26 15:26 ` Jay Ashworth
@ 2013-09-26 22:16 ` Dave Chinner
2013-09-27 2:17 ` Joe Landman
2013-09-27 2:39 ` Stan Hoeppner
2 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2013-09-26 22:16 UTC (permalink / raw)
To: Joe Landman; +Cc: xfs
On Thu, Sep 26, 2013 at 10:59:41AM -0400, Joe Landman wrote:
> On 09/26/2013 09:12 AM, Ronnie Tartar wrote:
> >Stan,
> >
> >Thanks for the reply.
> >
> >My fragmentation is:
> >
> >[root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
> >actual 10470159, ideal 10409782, fragmentation factor 0.58%
>
> This was never likely the cause ...
>
> [...]
>
> >This is running virtualized, definitely not a rust bucket. It's x5570 cpus
>
> ... well, this is likely the cause (virtualized)
>
> >with MD3200 Array with light I/O.
> >
> >Seems like i/o wait is not problem, system% is problem. Is this the OS
> >trying to find spot for these files?
>
> From your previous description
>
> >takes. The folders are image folders that have anywhere between 5 to
> >10 million images in each folder.
>
> The combination of very large folders, and virtualization is working
> against you. Couple that with an old (ancient by Linux standards)
> xfs in the virtual CentOS 5.9 system, and you aren't going to have
> much joy with this without changing a few things.
Virtualisation will have nothing to do with the problem. *All* my
testing of XFS in a virtualised environment - including all the
performance testing I do. And I do it this way because even with SSD
based storage, the virtualisation overhead is less than 2% for IO
rates exceding 100,000 IOPS....
And, well, I can boot a virtualised machine in under 7s, while a
physical machine reboot takes about 5 minutes, so there's a massive
win in terms of compile/boot/test cycle times doing things this way.
> First and foremost:
>
> Can you change from one single large folder to a heirarchical set of
> folders? The single large folder means any metadata operation (ls,
> stat, open, close) has a huge set of lists to traverse. It will
> work, albiet slowly. As a rule of thumb, we try to make sure our
> users don't go much beyond 10k files/folder. If they need to,
> building a heirarchy of folders slightly increases management
> complexity, but keeps the lists that are needed to be traversed much
> smaller.
I'll just quote what I told someone yesterday on IRC:
[26/09/13 08:00] <dchinner_> spligak: the only way to scale linux directory operations is to spread them out over multiple directories.
[26/09/13 08:00] <dchinner_> operations on a directory are always single threaded
[26/09/13 08:01] <dchinner_> and the typical limitiation is somewhere between 10-20k modification operations per second per directory
[26/09/13 08:02] <dchinner_> an empty directory will average about 15-20k creates/s out to 100k entries, expect about 7-10k creates/s at 1 million entries, and down to around 2k creates/s at 10M entries
[26/09/13 08:03] <dchinner_> these numbers are variable depending on name lengths, filesystem fragmentation, etc
[26/09/13 08:04] <dchinner_> but, in reality, for a short term file store that has lots of create and removal, you really want to hash your files over multiple directories
[26/09/13 08:05] <dchinner_> A hash that is some multiple of the AG count (e.g. 2-4x, assuming an AG count of 16+ is usually sufficient on XFS...
[26/09/13 08:05] <dchinner_> spligak: the degradation is logarithmic due to the btree-based structure of the directory indexes
.....
[26/09/13 08:11] <dchinner_> spligak: hashing obliviates the need for a dir-per-thread - it spreads the load out by only having threads that end up with hash collisions working on the same dir..
>
> A strategy for doing this: If your files are named "aaaa0001"
> "aaaa0002" ... "zzzz9999" or similar, then you can chop off the
> first letter, and make a directory of it, and then put all files
> starting with that letter in that directory. Then within each of
> those directories, do the same thing with the second letter. This
> gets you 676 directories and about 15k files per directory. Much
> faster directory operations. Much smaller lists to traverse.
But that's still not optimal, as directory operations will then
serialise on per AG locks and so modifications will still be a
bottleneck if you only have 4 AGs in your filesystem. i.e. if you
are going to do this, you need to tailor the directory hash to the
concurrency the filesystem structure provide because more, smaller
directories are not necessarily better than fewer larger ones.
Indeed, if you're workload is dominated by random lookups, the
hashing technique is less efficient than just having one large
directory as the internal btree indexes in the XFS directory
structure are far, far more IO efficient than a multi-level
directory hash of smaller directories. The trade-off in this case is
lookup concurrency - enough directories to provide good llokup
concurrency, yet few enough that you still get the IO benefit from
the scalability of the internal directory structure.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 15:26 ` Jay Ashworth
@ 2013-09-26 22:47 ` Dave Chinner
0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-09-26 22:47 UTC (permalink / raw)
To: Jay Ashworth; +Cc: xfs
On Thu, Sep 26, 2013 at 11:26:47AM -0400, Jay Ashworth wrote:
> ----- Original Message -----
> > From: "Joe Landman" <joe.landman@gmail.com>
>
> > > takes. The folders are image folders that have anywhere between 5 to
> > > 10 million images in each folder.
> >
> > The combination of very large folders, and virtualization is working
> > against you. Couple that with an old (ancient by Linux standards) xfs
> > in the virtual CentOS 5.9 system, and you aren't going to have much
> > joy with this without changing a few things.
>
> > Can you change from one single large folder to a heirarchical set of
> > folders? The single large folder means any metadata operation (ls,
> > stat, open, close) has a huge set of lists to traverse. It will work,
> > albiet slowly. As a rule of thumb, we try to make sure our users don't
> > go much beyond 10k files/folder. If they need to, building a heirarchy
> > of folders slightly increases management complexity, but keeps the
> > lists that are needed to be traversed much smaller.
> >
> > A strategy for doing this: If your files are named "aaaa0001"
> > "aaaa0002" ... "zzzz9999" or similar, then you can chop off the first
> > letter, and make a directory of it, and then put all files starting
> > with that letter in that directory. Then within each of those directories,
> > do the same thing with the second letter. This gets you 676
> > directories and about 15k files per directory. Much faster directory operations.
> > Much smaller lists to traverse.
>
> While this problem isn't *near* as bad on XFS as it was on older filesystems,
> where over maybe 500-1000 files would result in 'ls' commands taking
> over a minute...
Assuming a worst case, 500-1000 files requires 700-1200 IOs for LS
to complete. If that's taking over a minute, then you're getting
less than 10-20 IOPS for the workload which is about 10% of the
capability of a typical SATA drive. This sounds to me like there was
lots of other stuff competing for IO bandwidth at the same time or
something else wrong to result in such poor performance for ls.
> It's still a good idea to filename hash large collections of files of
> similar types into a directory tree, as Joe recommends. The best approach
> I myself have seen to this is to has a filename of
>
> 835bfak3f89yu12.jpg
>
> into
>
> 8/3/5/b/835bfak3f89yu12.jpg
> 8/3/5/b/f/835bfak3f89yu12.jpg
> 8/3/5/b/f/a/835bfak3f89yu12.jpg
No, not on XFS. here you have a fanout per level of 16. i.e.
consider a tree with a fanout of 16. To move from level to level, it
takes 2 IOs.
Lets consider the internal hash btree in XFS. For a 4k directory
block, it fits 500 entries - call it 512 to make the math easy. i.e.
it is a tree with a fanout per level of 512 To move from level to
level, it takes 1 IO.
> 8/3/5/b/f/a/835bfak3f89yu12.jpg
Here we have 6 levels of hash, that's 16^6 = 16.7M fanout.
With a fanout of 512, the internal XFS hash btree needs only 3
levels (64 * 512 * 512) to index the same number directory entries.
So, do a lookup on the hash, it takes 12 IOs to get to the leaf
directory, then as many IOs are required to look up the entry in the
leaf directory. For a single large XFS directory, it takes 3 IOs to
find the dirent, and another 1 to read the dirent and return it to
userspace i.e. 4 IOs total vs 12 + N IOs for the equivalent 16-way
hash of the same depth...
What I am trying to point out is that on XFS deep hashing will not
improve performance like it might on ext4 - on XFS you should look
to use wide, shallow directory hashing with relatively large numbers
of entries in each leaf directory because the internal directory
structure is much more efficient that from an IO perspective than
hashing is...
And then, of course, if directory IO is still the limiting factor
with large numbers of leaf entries (e.g. you're indexing billions of
files), you have the option of using larger directory blocks and
making the internal directory fanout up to 16x wider than in this
example...
> Going as deep as necessary to reduce the size of the directories. What
> you lose in needing to cache the extra directory levels outweighs (probably
> far outweighs) having to handle Directories Of Unusual Size.
On XFS, a directory with a million entries is not an unusual size -
with a 4k directory block size the algorithms are still pretty CPU
efficient at this point, though it's going to be at roughly half
that of an empty directory. It's once you get above several million
entries that the modification cost starts to dominate performance
considerations and at that point a wider hash, not a deeper hash
should be considered..
> Note that I didn't actually trim the filename proper; the final file still has
> its full name. This hash is easy to build, as long as you fix the number of layers
> in advance... and if you need to make it deeper, later, it's easy to build a
> shell script that crawls the current tree and adds the next layer.
Avoiding the need for rebalancing a directory hash is one of the
reasons for designing it around a scalable directory structure in
the first place. It pretty much means the only consideration for
the width of the hash and the underlying filesystem layout is the
concurrency your application requires.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 14:23 ` Eric Sandeen
@ 2013-09-26 23:46 ` Stan Hoeppner
0 siblings, 0 replies; 12+ messages in thread
From: Stan Hoeppner @ 2013-09-26 23:46 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs, Ronnie Tartar
On 9/26/2013 9:23 AM, Eric Sandeen wrote:
> On 9/26/13 8:30 AM, Ronnie Tartar wrote:
>> Stan, looks like I have directory fragmentation problem.
>>
>> xfs_db> frag -d
>> actual 65057, ideal 4680, fragmentation factor 92.81%
>>
>> What is the best way to fix this?
>
> http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25._Is_that_bad.3F
>
> We should just get rid of that command, TBH.
>
> So your dirs are in an average of 65057/4680 or about 14 fragments each.
> Really not that bad, in the scope of things.
>
> I'd imagine that this could be more of your problem:
>
>> The
>> folders are image folders that have anywhere between 5 to 10 million images
>> in each folder.
>
> at 10 million entries in a dir, you're going to start slowing down on inserts
> due to btree management. But that probably doesn't account for multiple seconds for
> a single file.
>
> So really,it's not clear *what* is slow.
>
>> It takes about 2.5 to 3.5 seconds to write a single file.
>
> strace with timing would be a very basic way to get a sense of what is slow;
> is it the file open/create? How big is the file, are you doing buffered or
> direct IO?
>
> On a more modern OS you could do some of the tracing suggested in
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> but some sort of profiling (oprofile, perhaps) might tell you where time is being spent in the kernel.
>
> When you say suddenly started, was it after a kernel upgrade or other change?
Eric is an expert on this, much more knowledgeable than me. And somehow
I missed the 5-10 million files per dir. Maybe you have multiple issues
here adding up to large delays. In addition to the steps Eric
recommends, it can't hurt to go ahead and take a look at the free space
map. Depending on how the filesystem has aged this could be a factor,
such as being 90%+ full at one time, and then lots of files being deleted.
# xfs_db -r -c freesp /dev/[device]
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 22:16 ` Dave Chinner
@ 2013-09-27 2:17 ` Joe Landman
0 siblings, 0 replies; 12+ messages in thread
From: Joe Landman @ 2013-09-27 2:17 UTC (permalink / raw)
To: xfs
On 09/26/2013 06:16 PM, Dave Chinner wrote:
> Virtualisation will have nothing to do with the problem. *All* my
YMMV. Very heavy IO in KVM/Xen often results in some very interesting
performance anomolies from the testing we've done on customer use cases.
[...]
> And, well, I can boot a virtualised machine in under 7s, while a
> physical machine reboot takes about 5 minutes, so there's a massive
> win in terms of compile/boot/test cycle times doing things this way.
Certainly I agree with that aspect. Our KVM instances reboot and reload
very quickly. This is one of their nicest features. One we use for
similar reasons.
>
>> First and foremost:
>>
>> Can you change from one single large folder to a heirarchical set of
>> folders? The single large folder means any metadata operation (ls,
>> stat, open, close) has a huge set of lists to traverse. It will
>> work, albiet slowly. As a rule of thumb, we try to make sure our
>> users don't go much beyond 10k files/folder. If they need to,
>> building a heirarchy of folders slightly increases management
>> complexity, but keeps the lists that are needed to be traversed much
>> smaller.
>
> I'll just quote what I told someone yesterday on IRC:
>
[...]
>> A strategy for doing this: If your files are named "aaaa0001"
>> "aaaa0002" ... "zzzz9999" or similar, then you can chop off the
>> first letter, and make a directory of it, and then put all files
>> starting with that letter in that directory. Then within each of
>> those directories, do the same thing with the second letter. This
>> gets you 676 directories and about 15k files per directory. Much
>> faster directory operations. Much smaller lists to traverse.
>
> But that's still not optimal, as directory operations will then
> serialise on per AG locks and so modifications will still be a
> bottleneck if you only have 4 AGs in your filesystem. i.e. if you
> are going to do this, you need to tailor the directory hash to the
> concurrency the filesystem structure provide because more, smaller
> directories are not necessarily better than fewer larger ones.
>
> Indeed, if you're workload is dominated by random lookups, the
> hashing technique is less efficient than just having one large
> directory as the internal btree indexes in the XFS directory
> structure are far, far more IO efficient than a multi-level
> directory hash of smaller directories. The trade-off in this case is
> lookup concurrency - enough directories to provide good llokup
> concurrency, yet few enough that you still get the IO benefit from
> the scalability of the internal directory structure.
This said, its pretty clear the OP is hitting performance bottlenecks.
While the schema I proposed was non-optimal for the use case, I'd be
hard pressed to imagine it being worse for his use case based upon what
he's reported.
Obviously, more detail on the issue is needed.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issues and new to the group
2013-09-26 14:59 ` Joe Landman
2013-09-26 15:26 ` Jay Ashworth
2013-09-26 22:16 ` Dave Chinner
@ 2013-09-27 2:39 ` Stan Hoeppner
2 siblings, 0 replies; 12+ messages in thread
From: Stan Hoeppner @ 2013-09-27 2:39 UTC (permalink / raw)
To: Joe Landman; +Cc: Ronnie Tartar, xfs
On 9/26/2013 9:59 AM, Joe Landman wrote:
> On 09/26/2013 09:12 AM, Ronnie Tartar wrote:
>> Stan,
>>
>> Thanks for the reply.
>>
>> My fragmentation is:
>>
>> [root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
>> actual 10470159, ideal 10409782, fragmentation factor 0.58%
> This was never likely the cause ...
Nobody suggested it was.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-09-27 2:39 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-26 11:47 Issues and new to the group Ronnie Tartar
2013-09-26 12:06 ` Stan Hoeppner
2013-09-26 13:12 ` Ronnie Tartar
2013-09-26 13:30 ` Ronnie Tartar
2013-09-26 14:23 ` Eric Sandeen
2013-09-26 23:46 ` Stan Hoeppner
2013-09-26 14:59 ` Joe Landman
2013-09-26 15:26 ` Jay Ashworth
2013-09-26 22:47 ` Dave Chinner
2013-09-26 22:16 ` Dave Chinner
2013-09-27 2:17 ` Joe Landman
2013-09-27 2:39 ` Stan Hoeppner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox