public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Issues and new to the group
@ 2013-09-26 11:47 Ronnie Tartar
  2013-09-26 12:06 ` Stan Hoeppner
  0 siblings, 1 reply; 12+ messages in thread
From: Ronnie Tartar @ 2013-09-26 11:47 UTC (permalink / raw)
  To: xfs

Hi, 

I have a 600GB xfs file system mounted that suddenly started running slow on
writes.  It takes about 2.5 to 3.5 seconds to write a single file.  Some
folders (with less number of files) work well.  But it will copy fast, then
slow for long periods of time.  This is a virtualized CentOS 5.9 64 bit box
on Citrix Xenserver 5.6SP2.  Doesn't seem to be a load i/o issue as most of
the load is system%.  My fragmentation is less than 1 %.    Any help would
be greatly appreciated.  I was looking to see if there was a better way to
mount this partition or allocate more memory, whatever it takes.  The
folders are image folders that have anywhere between 5 to 10 million images
in each folder.

Thanks

Fstab mount is: 
/dev/xvdb1              /images xfs
defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1

Slabtop is:
Active / Total Objects (% used)    : 2705947 / 2872142 (94.2%)
 Active / Total Slabs (% used)      : 290008 / 290008 (100.0%)
 Active / Total Caches (% used)     : 111 / 165 (67.3%)
 Active / Total Size (% used)       : 1048796.52K / 1083850.55K (96.8%)
 Minimum / Average / Maximum Object : 0.02K / 0.38K / 128.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME

843080 836242  99%    0.44K 105385        8    421540K xfs_inode
839608 836242  99%    0.56K 119944        7    479776K xfs_vnode
461610 432085  93%    0.21K  25645       18    102580K dentry_cache
306200 306200 100%    0.09K   7655       40     30620K buffer_head
222570 121732  54%    0.12K   7419       30     29676K size-128
117439  97108  82%    0.52K  16777        7     67108K radix_tree_node
 30820  30814  99%    0.19K   1541       20      6164K xfs_ili
 13270  13266  99%    0.74K   2654        5     10616K ext3_inode_cache
  9390   9390 100%    0.25K    626       15      2504K size-256
  7682   7562  98%    0.16K    334       23      1336K vm_area_struct
  3068   1711  55%    0.06K     52       59       208K size-64
  2816   2786  98%    0.09K     64       44       256K sysfs_dir_cache
  2055   1275  62%    0.25K    137       15       548K filp
  1440   1289  89%    0.02K     10      144        40K anon_vma
  1120    999  89%    0.03K     10      112        40K size-32
   768    534  69%    0.08K     16       48        64K
selinux_inode_security
   756    698  92%    0.55K    108        7       432K inode_cache
   576    554  96%    0.58K     96        6       384K proc_inode_cache
   476    455  95%    1.00K    119        4       476K size-1024
   404    403  99%    2.00K    202        2       808K size-2048
   404    404 100%    4.00K    404        1      1616K size-4096
   360    350  97%    0.12K     12       30        48K bio     
   320    284  88%    0.50K     40        8       160K size-512

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 11:47 Issues and new to the group Ronnie Tartar
@ 2013-09-26 12:06 ` Stan Hoeppner
  2013-09-26 13:12   ` Ronnie Tartar
  0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2013-09-26 12:06 UTC (permalink / raw)
  To: Ronnie Tartar; +Cc: xfs

On 9/26/2013 6:47 AM, Ronnie Tartar wrote:

> I have a 600GB xfs file system mounted that suddenly started running slow on
> writes.  It takes about 2.5 to 3.5 seconds to write a single file.  Some

This typically occurs when the filesystem gets near full and free space
is heavily fragmented.  Writing to these free space fragments requires
lots of seeking.  Seeking causes latency.  I assume your storage device
is spinning rust, yes?

> folders (with less number of files) work well.  But it will copy fast, then
> slow for long periods of time.  

Some allocation groups may have less fragmented free space than others.
 Put another way, they may have more contiguous free space.  Thus less
seeking.

> This is a virtualized CentOS 5.9 64 bit box
> on Citrix Xenserver 5.6SP2.  Doesn't seem to be a load i/o issue as most of
> the load is system%.  My fragmentation is less than 1 %.    Any help would
> be greatly appreciated.  I was looking to see if there was a better way to
> mount this partition or allocate more memory, whatever it takes.  The
> folders are image folders that have anywhere between 5 to 10 million images
> in each folder.

> Fstab mount is: 
> /dev/xvdb1              /images xfs
> defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
                                   ^^^^^^^^^^^^^
This tells XFS to allocate 64MB of free space at the end of each file
being allocated.  If free space is heavily fragmented and the fragments
are all small, this will exacerbate the seek problem.  Given the 64MB
allocsize, I assume these image files are quite large.  If this is
correct, writing them over scattered small free space fragments also
requires seeking.  Thus, I'd guess you're seeking your disk, or array,
to death.

How full is the XFS volume, and what does your free space fragmentation
map look like?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Issues and new to the group
  2013-09-26 12:06 ` Stan Hoeppner
@ 2013-09-26 13:12   ` Ronnie Tartar
  2013-09-26 13:30     ` Ronnie Tartar
  2013-09-26 14:59     ` Joe Landman
  0 siblings, 2 replies; 12+ messages in thread
From: Ronnie Tartar @ 2013-09-26 13:12 UTC (permalink / raw)
  To: stan; +Cc: xfs

Stan,

Thanks for the reply.

My fragmentation is:

[root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
actual 10470159, ideal 10409782, fragmentation factor 0.58%

xfs_db> freesp
   from      to extents  blocks    pct
      1       1   52343   52343   0.08
      2       3   34774   86290   0.13
      4       7  122028  732886   1.08
      8      15  182345 1898531   2.80
     16      31  147747 3300501   4.87
     32      63  111134 4981898   7.35
     64     127   93359 8475962  12.50
    128     255   51914 9069884  13.38
    256     511   25548 9200077  13.57
    512    1023   23027 17482586  25.79
   1024    2047    8662 10600931  15.64
   2048    4095     808 1915158   2.82

The volume is 57% full.

I have removed allocsize=64m from the fstab and rebooted.  These are not
large files and this could definitely cause issues.  
Would copying them to new folder and renaming the folder back help?

This is running virtualized, definitely not a rust bucket.  It's x5570 cpus
with MD3200 Array with light I/O. 

Seems like i/o wait is not problem, system% is problem.  Is this the OS
trying to find spot for these files?

Thanks





-----Original Message-----
From: Stan Hoeppner [mailto:stan@hardwarefreak.com] 
Sent: Thursday, September 26, 2013 8:07 AM
To: Ronnie Tartar
Cc: xfs@oss.sgi.com
Subject: Re: Issues and new to the group

On 9/26/2013 6:47 AM, Ronnie Tartar wrote:

> I have a 600GB xfs file system mounted that suddenly started running 
> slow on writes.  It takes about 2.5 to 3.5 seconds to write a single 
> file.  Some

This typically occurs when the filesystem gets near full and free space is
heavily fragmented.  Writing to these free space fragments requires lots of
seeking.  Seeking causes latency.  I assume your storage device is spinning
rust, yes?

> folders (with less number of files) work well.  But it will copy fast, 
> then slow for long periods of time.

Some allocation groups may have less fragmented free space than others.
 Put another way, they may have more contiguous free space.  Thus less
seeking.

> This is a virtualized CentOS 5.9 64 bit box on Citrix Xenserver 
> 5.6SP2.  Doesn't seem to be a load i/o issue as most of
> the load is system%.  My fragmentation is less than 1 %.    Any help would
> be greatly appreciated.  I was looking to see if there was a better 
> way to mount this partition or allocate more memory, whatever it 
> takes.  The folders are image folders that have anywhere between 5 to 
> 10 million images in each folder.

> Fstab mount is: 
> /dev/xvdb1              /images xfs
> defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
                                   ^^^^^^^^^^^^^ This tells XFS to allocate
64MB of free space at the end of each file being allocated.  If free space
is heavily fragmented and the fragments are all small, this will exacerbate
the seek problem.  Given the 64MB allocsize, I assume these image files are
quite large.  If this is correct, writing them over scattered small free
space fragments also requires seeking.  Thus, I'd guess you're seeking your
disk, or array, to death.

How full is the XFS volume, and what does your free space fragmentation map
look like?

--
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Issues and new to the group
  2013-09-26 13:12   ` Ronnie Tartar
@ 2013-09-26 13:30     ` Ronnie Tartar
  2013-09-26 14:23       ` Eric Sandeen
  2013-09-26 14:59     ` Joe Landman
  1 sibling, 1 reply; 12+ messages in thread
From: Ronnie Tartar @ 2013-09-26 13:30 UTC (permalink / raw)
  To: stan; +Cc: xfs

Stan, looks like I have directory fragmentation problem.

xfs_db> frag -d
actual 65057, ideal 4680, fragmentation factor 92.81%

What is the best way to fix this?

Thanks


-----Original Message-----
From: xfs-bounces@oss.sgi.com [mailto:xfs-bounces@oss.sgi.com] On Behalf Of
Ronnie Tartar
Sent: Thursday, September 26, 2013 9:12 AM
To: stan@hardwarefreak.com
Cc: xfs@oss.sgi.com
Subject: RE: Issues and new to the group

Stan,

Thanks for the reply.

My fragmentation is:

[root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1 actual 10470159, ideal
10409782, fragmentation factor 0.58%

xfs_db> freesp
   from      to extents  blocks    pct
      1       1   52343   52343   0.08
      2       3   34774   86290   0.13
      4       7  122028  732886   1.08
      8      15  182345 1898531   2.80
     16      31  147747 3300501   4.87
     32      63  111134 4981898   7.35
     64     127   93359 8475962  12.50
    128     255   51914 9069884  13.38
    256     511   25548 9200077  13.57
    512    1023   23027 17482586  25.79
   1024    2047    8662 10600931  15.64
   2048    4095     808 1915158   2.82

The volume is 57% full.

I have removed allocsize=64m from the fstab and rebooted.  These are not
large files and this could definitely cause issues.  
Would copying them to new folder and renaming the folder back help?

This is running virtualized, definitely not a rust bucket.  It's x5570 cpus
with MD3200 Array with light I/O. 

Seems like i/o wait is not problem, system% is problem.  Is this the OS
trying to find spot for these files?

Thanks





-----Original Message-----
From: Stan Hoeppner [mailto:stan@hardwarefreak.com]
Sent: Thursday, September 26, 2013 8:07 AM
To: Ronnie Tartar
Cc: xfs@oss.sgi.com
Subject: Re: Issues and new to the group

On 9/26/2013 6:47 AM, Ronnie Tartar wrote:

> I have a 600GB xfs file system mounted that suddenly started running 
> slow on writes.  It takes about 2.5 to 3.5 seconds to write a single 
> file.  Some

This typically occurs when the filesystem gets near full and free space is
heavily fragmented.  Writing to these free space fragments requires lots of
seeking.  Seeking causes latency.  I assume your storage device is spinning
rust, yes?

> folders (with less number of files) work well.  But it will copy fast, 
> then slow for long periods of time.

Some allocation groups may have less fragmented free space than others.
 Put another way, they may have more contiguous free space.  Thus less
seeking.

> This is a virtualized CentOS 5.9 64 bit box on Citrix Xenserver 
> 5.6SP2.  Doesn't seem to be a load i/o issue as most of
> the load is system%.  My fragmentation is less than 1 %.    Any help would
> be greatly appreciated.  I was looking to see if there was a better 
> way to mount this partition or allocate more memory, whatever it 
> takes.  The folders are image folders that have anywhere between 5 to
> 10 million images in each folder.

> Fstab mount is: 
> /dev/xvdb1              /images xfs
> defaults,nodiratime,nosuid,nodev,allocsize=64m 1 1
                                   ^^^^^^^^^^^^^ This tells XFS to allocate
64MB of free space at the end of each file being allocated.  If free space
is heavily fragmented and the fragments are all small, this will exacerbate
the seek problem.  Given the 64MB allocsize, I assume these image files are
quite large.  If this is correct, writing them over scattered small free
space fragments also requires seeking.  Thus, I'd guess you're seeking your
disk, or array, to death.

How full is the XFS volume, and what does your free space fragmentation map
look like?

--
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 13:30     ` Ronnie Tartar
@ 2013-09-26 14:23       ` Eric Sandeen
  2013-09-26 23:46         ` Stan Hoeppner
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Sandeen @ 2013-09-26 14:23 UTC (permalink / raw)
  To: Ronnie Tartar; +Cc: stan, xfs

On 9/26/13 8:30 AM, Ronnie Tartar wrote:
> Stan, looks like I have directory fragmentation problem.
> 
> xfs_db> frag -d
> actual 65057, ideal 4680, fragmentation factor 92.81%
> 
> What is the best way to fix this?

http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25._Is_that_bad.3F

We should just get rid of that command, TBH.

So your dirs are in an average of 65057/4680 or about 14 fragments each.
Really not that bad, in the scope of things.

I'd imagine that this could be more of your problem:

> The
> folders are image folders that have anywhere between 5 to 10 million images
> in each folder.

at 10 million entries in a dir, you're going to start slowing down on inserts
due to btree management.  But that probably doesn't account for multiple seconds for
a single file.

So really,it's not clear *what* is slow.

> It takes about 2.5 to 3.5 seconds to write a single file.

strace with timing would be a very basic way to get a sense of what is slow;
is it the file open/create?  How big is the file, are you doing buffered or
direct IO?

On a more modern OS you could do some of the tracing suggested in
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

but some sort of profiling (oprofile, perhaps) might tell you where time is being spent in the kernel.

When you say suddenly started, was it after a kernel upgrade or other change?

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 13:12   ` Ronnie Tartar
  2013-09-26 13:30     ` Ronnie Tartar
@ 2013-09-26 14:59     ` Joe Landman
  2013-09-26 15:26       ` Jay Ashworth
                         ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Joe Landman @ 2013-09-26 14:59 UTC (permalink / raw)
  To: xfs

On 09/26/2013 09:12 AM, Ronnie Tartar wrote:
> Stan,
>
> Thanks for the reply.
>
> My fragmentation is:
>
> [root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
> actual 10470159, ideal 10409782, fragmentation factor 0.58%

This was never likely the cause ...

[...]

> This is running virtualized, definitely not a rust bucket.  It's x5570 cpus

... well, this is likely the cause (virtualized)

> with MD3200 Array with light I/O.
>
> Seems like i/o wait is not problem, system% is problem.  Is this the OS
> trying to find spot for these files?

 From your previous description

> takes.  The folders are image folders that have anywhere between 5 to
> 10 million images in each folder.

The combination of very large folders, and virtualization is working 
against you.  Couple that with an old (ancient by Linux standards) xfs 
in the virtual CentOS 5.9 system, and you aren't going to have much joy 
with this without changing a few things.

First and foremost:

Can you change from one single large folder to a heirarchical set of 
folders?  The single large folder means any metadata operation (ls, 
stat, open, close) has a huge set of lists to traverse.  It will work, 
albiet slowly.  As a rule of thumb, we try to make sure our users don't 
go much beyond 10k files/folder.  If they need to, building a heirarchy 
of folders slightly increases management complexity, but keeps the lists 
that are needed to be traversed much smaller.

A strategy for doing this:  If your files are named "aaaa0001" 
"aaaa0002" ... "zzzz9999" or similar, then you can chop off the first 
letter, and make a directory of it, and then put all files starting with 
that letter in that directory.  Then within each of those directories, 
do the same thing with the second letter.  This gets you 676 directories 
and about 15k files per directory.  Much faster directory operations. 
Much smaller lists to traverse.

If you can't change the layout, this is a much harder problem to solve, 
though you could do it by using one very large file, maintaining your 
own start/end and file metadata in other files, and writing your files 
at specific offsets (start and end above).  This isn't a good solution 
unless you know how to write file systems.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 14:59     ` Joe Landman
@ 2013-09-26 15:26       ` Jay Ashworth
  2013-09-26 22:47         ` Dave Chinner
  2013-09-26 22:16       ` Dave Chinner
  2013-09-27  2:39       ` Stan Hoeppner
  2 siblings, 1 reply; 12+ messages in thread
From: Jay Ashworth @ 2013-09-26 15:26 UTC (permalink / raw)
  To: xfs

----- Original Message -----
> From: "Joe Landman" <joe.landman@gmail.com>

> > takes. The folders are image folders that have anywhere between 5 to
> > 10 million images in each folder.
> 
> The combination of very large folders, and virtualization is working
> against you. Couple that with an old (ancient by Linux standards) xfs
> in the virtual CentOS 5.9 system, and you aren't going to have much
> joy with this without changing a few things.

> Can you change from one single large folder to a heirarchical set of
> folders? The single large folder means any metadata operation (ls,
> stat, open, close) has a huge set of lists to traverse. It will work,
> albiet slowly. As a rule of thumb, we try to make sure our users don't
> go much beyond 10k files/folder. If they need to, building a heirarchy
> of folders slightly increases management complexity, but keeps the
> lists that are needed to be traversed much smaller.
> 
> A strategy for doing this: If your files are named "aaaa0001"
> "aaaa0002" ... "zzzz9999" or similar, then you can chop off the first
> letter, and make a directory of it, and then put all files starting
> with that letter in that directory. Then within each of those directories,
> do the same thing with the second letter. This gets you 676
> directories and about 15k files per directory. Much faster directory operations.
> Much smaller lists to traverse.

While this problem isn't *near* as bad on XFS as it was on older filesystems,
where over maybe 500-1000 files would result in 'ls' commands taking
over a minute...

It's still a good idea to filename hash large collections of files of 
similar types into a directory tree, as Joe recommends.  The best approach
I myself have seen to this is to has a filename of

835bfak3f89yu12.jpg

into

8/3/5/b/835bfak3f89yu12.jpg
8/3/5/b/f/835bfak3f89yu12.jpg
8/3/5/b/f/a/835bfak3f89yu12.jpg

Going as deep as necessary to reduce the size of the directories. What
you lose in needing to cache the extra directory levels outweighs (probably
far outweighs) having to handle Directories Of Unusual Size.

Note that I didn't actually trim the filename proper; the final file still has
its full name.  This hash is easy to build, as long as you fix the number of layers
in advance... and if you need to make it deeper, later, it's easy to build a 
shell script that crawls the current tree and adds the next layer.

Cheers,
-- jra
-- 
Jay R. Ashworth                  Baylink                       jra@baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com         2000 Land Rover DII
St Petersburg FL USA               #natog                      +1 727 647 1274

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 14:59     ` Joe Landman
  2013-09-26 15:26       ` Jay Ashworth
@ 2013-09-26 22:16       ` Dave Chinner
  2013-09-27  2:17         ` Joe Landman
  2013-09-27  2:39       ` Stan Hoeppner
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2013-09-26 22:16 UTC (permalink / raw)
  To: Joe Landman; +Cc: xfs

On Thu, Sep 26, 2013 at 10:59:41AM -0400, Joe Landman wrote:
> On 09/26/2013 09:12 AM, Ronnie Tartar wrote:
> >Stan,
> >
> >Thanks for the reply.
> >
> >My fragmentation is:
> >
> >[root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
> >actual 10470159, ideal 10409782, fragmentation factor 0.58%
> 
> This was never likely the cause ...
> 
> [...]
> 
> >This is running virtualized, definitely not a rust bucket.  It's x5570 cpus
> 
> ... well, this is likely the cause (virtualized)
> 
> >with MD3200 Array with light I/O.
> >
> >Seems like i/o wait is not problem, system% is problem.  Is this the OS
> >trying to find spot for these files?
> 
> From your previous description
> 
> >takes.  The folders are image folders that have anywhere between 5 to
> >10 million images in each folder.
> 
> The combination of very large folders, and virtualization is working
> against you.  Couple that with an old (ancient by Linux standards)
> xfs in the virtual CentOS 5.9 system, and you aren't going to have
> much joy with this without changing a few things.

Virtualisation will have nothing to do with the problem. *All* my
testing of XFS in a virtualised environment - including all the
performance testing I do. And I do it this way because even with SSD
based storage, the virtualisation overhead is less than 2% for IO
rates exceding 100,000 IOPS....

And, well, I can boot a virtualised machine in under 7s, while a
physical machine reboot takes about 5 minutes, so there's a massive
win in terms of compile/boot/test cycle times doing things this way.

> First and foremost:
> 
> Can you change from one single large folder to a heirarchical set of
> folders?  The single large folder means any metadata operation (ls,
> stat, open, close) has a huge set of lists to traverse.  It will
> work, albiet slowly.  As a rule of thumb, we try to make sure our
> users don't go much beyond 10k files/folder.  If they need to,
> building a heirarchy of folders slightly increases management
> complexity, but keeps the lists that are needed to be traversed much
> smaller.

I'll just quote what I told someone yesterday on IRC:

[26/09/13 08:00] <dchinner_> spligak: the only way to scale linux directory operations is to spread them out over multiple directories.
[26/09/13 08:00] <dchinner_> operations on a directory are always single threaded
[26/09/13 08:01] <dchinner_> and the typical limitiation is somewhere between 10-20k modification operations per second per directory
[26/09/13 08:02] <dchinner_> an empty directory will average about 15-20k creates/s out to 100k entries, expect about 7-10k creates/s at 1 million entries, and down to around 2k creates/s at 10M entries
[26/09/13 08:03] <dchinner_> these numbers are variable depending on name lengths, filesystem fragmentation, etc
[26/09/13 08:04] <dchinner_> but, in reality, for a short term file store that has lots of create and removal, you really want to hash your files over multiple directories
[26/09/13 08:05] <dchinner_> A hash that is some multiple of the AG count (e.g. 2-4x, assuming an AG count of 16+ is usually sufficient on XFS...
[26/09/13 08:05] <dchinner_> spligak: the degradation is logarithmic due to the btree-based structure of the directory indexes
.....
[26/09/13 08:11] <dchinner_> spligak: hashing obliviates the need for a dir-per-thread - it spreads the load out by only having threads that end up with hash collisions working on the same dir..

> 
> A strategy for doing this:  If your files are named "aaaa0001"
> "aaaa0002" ... "zzzz9999" or similar, then you can chop off the
> first letter, and make a directory of it, and then put all files
> starting with that letter in that directory.  Then within each of
> those directories, do the same thing with the second letter.  This
> gets you 676 directories and about 15k files per directory.  Much
> faster directory operations. Much smaller lists to traverse.

But that's still not optimal, as directory operations will then
serialise on per AG locks and so modifications will still be a
bottleneck if you only have 4 AGs in your filesystem. i.e. if you
are going to do this, you need to tailor the directory hash to the
concurrency the filesystem structure provide because more, smaller
directories are not necessarily better than fewer larger ones.

Indeed, if you're workload is dominated by random lookups, the
hashing technique is less efficient than just having one large
directory as the internal btree indexes in the XFS directory
structure are far, far more IO efficient than a multi-level
directory hash of smaller directories. The trade-off in this case is
lookup concurrency - enough directories to provide good llokup
concurrency, yet few enough that you still get the IO benefit from
the scalability of the internal directory structure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 15:26       ` Jay Ashworth
@ 2013-09-26 22:47         ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2013-09-26 22:47 UTC (permalink / raw)
  To: Jay Ashworth; +Cc: xfs

On Thu, Sep 26, 2013 at 11:26:47AM -0400, Jay Ashworth wrote:
> ----- Original Message -----
> > From: "Joe Landman" <joe.landman@gmail.com>
> 
> > > takes. The folders are image folders that have anywhere between 5 to
> > > 10 million images in each folder.
> > 
> > The combination of very large folders, and virtualization is working
> > against you. Couple that with an old (ancient by Linux standards) xfs
> > in the virtual CentOS 5.9 system, and you aren't going to have much
> > joy with this without changing a few things.
> 
> > Can you change from one single large folder to a heirarchical set of
> > folders? The single large folder means any metadata operation (ls,
> > stat, open, close) has a huge set of lists to traverse. It will work,
> > albiet slowly. As a rule of thumb, we try to make sure our users don't
> > go much beyond 10k files/folder. If they need to, building a heirarchy
> > of folders slightly increases management complexity, but keeps the
> > lists that are needed to be traversed much smaller.
> > 
> > A strategy for doing this: If your files are named "aaaa0001"
> > "aaaa0002" ... "zzzz9999" or similar, then you can chop off the first
> > letter, and make a directory of it, and then put all files starting
> > with that letter in that directory. Then within each of those directories,
> > do the same thing with the second letter. This gets you 676
> > directories and about 15k files per directory. Much faster directory operations.
> > Much smaller lists to traverse.
> 
> While this problem isn't *near* as bad on XFS as it was on older filesystems,
> where over maybe 500-1000 files would result in 'ls' commands taking
> over a minute...

Assuming a worst case, 500-1000 files requires 700-1200 IOs for LS
to complete. If that's taking over a minute, then you're getting
less than 10-20 IOPS for the workload which is about 10% of the
capability of a typical SATA drive. This sounds to me like there was
lots of other stuff competing for IO bandwidth at the same time or
something else wrong to result in such poor performance for ls.

> It's still a good idea to filename hash large collections of files of 
> similar types into a directory tree, as Joe recommends.  The best approach
> I myself have seen to this is to has a filename of
> 
> 835bfak3f89yu12.jpg
> 
> into
> 
> 8/3/5/b/835bfak3f89yu12.jpg
> 8/3/5/b/f/835bfak3f89yu12.jpg
> 8/3/5/b/f/a/835bfak3f89yu12.jpg

No, not on XFS. here you have a fanout per level of 16. i.e.
consider a tree with a fanout of 16. To move from level to level, it
takes 2 IOs.

Lets consider the internal hash btree in XFS. For a 4k directory
block, it fits 500 entries - call it 512 to make the math easy. i.e.
it is a tree with a fanout per level of 512 To move from level to
level, it takes 1 IO.

> 8/3/5/b/f/a/835bfak3f89yu12.jpg

Here we have 6 levels of hash, that's 16^6 = 16.7M fanout.

With a fanout of 512, the internal XFS hash btree needs only 3
levels (64 * 512 * 512) to index the same number directory entries.

So, do a lookup on the hash, it takes 12 IOs to get to the leaf
directory, then as many IOs are required to look up the entry in the
leaf directory. For a single large XFS directory, it takes 3 IOs to
find the dirent, and another 1 to read the dirent and return it to
userspace i.e. 4 IOs total vs 12 + N IOs for the equivalent 16-way
hash of the same depth...

What I am trying to point out is that on XFS deep hashing will not
improve performance like it might on ext4 - on XFS you should look
to use wide, shallow directory hashing with relatively large numbers
of entries in each leaf directory because the internal directory
structure is much more efficient that from an IO perspective than
hashing is...

And then, of course, if directory IO is still the limiting factor
with large numbers of leaf entries (e.g. you're indexing billions of
files), you have the option of using larger directory blocks and
making the internal directory fanout up to 16x wider than in this
example...

> Going as deep as necessary to reduce the size of the directories. What
> you lose in needing to cache the extra directory levels outweighs (probably
> far outweighs) having to handle Directories Of Unusual Size.

On XFS, a directory with a million entries is not an unusual size -
with a 4k directory block size the algorithms are still pretty CPU
efficient at this point, though it's going to be at roughly half
that of an empty directory. It's once you get above several million
entries that the modification cost starts to dominate performance
considerations and at that point a wider hash, not a deeper hash
should be considered..

> Note that I didn't actually trim the filename proper; the final file still has
> its full name.  This hash is easy to build, as long as you fix the number of layers
> in advance... and if you need to make it deeper, later, it's easy to build a 
> shell script that crawls the current tree and adds the next layer.

Avoiding the need for rebalancing a directory hash is one of the
reasons for designing it around a scalable directory structure in
the first place.  It pretty much means the only consideration for
the width of the hash and the underlying filesystem layout is the
concurrency your application requires.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 14:23       ` Eric Sandeen
@ 2013-09-26 23:46         ` Stan Hoeppner
  0 siblings, 0 replies; 12+ messages in thread
From: Stan Hoeppner @ 2013-09-26 23:46 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs, Ronnie Tartar

On 9/26/2013 9:23 AM, Eric Sandeen wrote:
> On 9/26/13 8:30 AM, Ronnie Tartar wrote:
>> Stan, looks like I have directory fragmentation problem.
>>
>> xfs_db> frag -d
>> actual 65057, ideal 4680, fragmentation factor 92.81%
>>
>> What is the best way to fix this?
> 
> http://xfs.org/index.php/XFS_FAQ#Q:_The_xfs_db_.22frag.22_command_says_I.27m_over_50.25._Is_that_bad.3F
> 
> We should just get rid of that command, TBH.
> 
> So your dirs are in an average of 65057/4680 or about 14 fragments each.
> Really not that bad, in the scope of things.
> 
> I'd imagine that this could be more of your problem:
> 
>> The
>> folders are image folders that have anywhere between 5 to 10 million images
>> in each folder.
> 
> at 10 million entries in a dir, you're going to start slowing down on inserts
> due to btree management.  But that probably doesn't account for multiple seconds for
> a single file.
> 
> So really,it's not clear *what* is slow.
> 
>> It takes about 2.5 to 3.5 seconds to write a single file.
> 
> strace with timing would be a very basic way to get a sense of what is slow;
> is it the file open/create?  How big is the file, are you doing buffered or
> direct IO?
> 
> On a more modern OS you could do some of the tracing suggested in
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> but some sort of profiling (oprofile, perhaps) might tell you where time is being spent in the kernel.
> 
> When you say suddenly started, was it after a kernel upgrade or other change?

Eric is an expert on this, much more knowledgeable than me.  And somehow
I missed the 5-10 million files per dir.  Maybe you have multiple issues
here adding up to large delays.  In addition to the steps Eric
recommends, it can't hurt to go ahead and take a look at the free space
map.  Depending on how the filesystem has aged this could be a factor,
such as being 90%+ full at one time, and then lots of files being deleted.

# xfs_db -r -c freesp /dev/[device]

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 22:16       ` Dave Chinner
@ 2013-09-27  2:17         ` Joe Landman
  0 siblings, 0 replies; 12+ messages in thread
From: Joe Landman @ 2013-09-27  2:17 UTC (permalink / raw)
  To: xfs

On 09/26/2013 06:16 PM, Dave Chinner wrote:

> Virtualisation will have nothing to do with the problem. *All* my

YMMV.  Very heavy IO in KVM/Xen often results in some very interesting 
performance anomolies from the testing we've done on customer use cases.

[...]

> And, well, I can boot a virtualised machine in under 7s, while a
> physical machine reboot takes about 5 minutes, so there's a massive
> win in terms of compile/boot/test cycle times doing things this way.

Certainly I agree with that aspect.  Our KVM instances reboot and reload 
very quickly.  This is one of their nicest features.  One we use for 
similar reasons.

>
>> First and foremost:
>>
>> Can you change from one single large folder to a heirarchical set of
>> folders?  The single large folder means any metadata operation (ls,
>> stat, open, close) has a huge set of lists to traverse.  It will
>> work, albiet slowly.  As a rule of thumb, we try to make sure our
>> users don't go much beyond 10k files/folder.  If they need to,
>> building a heirarchy of folders slightly increases management
>> complexity, but keeps the lists that are needed to be traversed much
>> smaller.
>
> I'll just quote what I told someone yesterday on IRC:
>

[...]


>> A strategy for doing this:  If your files are named "aaaa0001"
>> "aaaa0002" ... "zzzz9999" or similar, then you can chop off the
>> first letter, and make a directory of it, and then put all files
>> starting with that letter in that directory.  Then within each of
>> those directories, do the same thing with the second letter.  This
>> gets you 676 directories and about 15k files per directory.  Much
>> faster directory operations. Much smaller lists to traverse.
>
> But that's still not optimal, as directory operations will then
> serialise on per AG locks and so modifications will still be a
> bottleneck if you only have 4 AGs in your filesystem. i.e. if you
> are going to do this, you need to tailor the directory hash to the
> concurrency the filesystem structure provide because more, smaller
> directories are not necessarily better than fewer larger ones.
>
> Indeed, if you're workload is dominated by random lookups, the
> hashing technique is less efficient than just having one large
> directory as the internal btree indexes in the XFS directory
> structure are far, far more IO efficient than a multi-level
> directory hash of smaller directories. The trade-off in this case is
> lookup concurrency - enough directories to provide good llokup
> concurrency, yet few enough that you still get the IO benefit from
> the scalability of the internal directory structure.

This said, its pretty clear the OP is hitting performance bottlenecks. 
While the schema I proposed was non-optimal for the use case, I'd be 
hard pressed to imagine it being worse for his use case based upon what 
he's reported.

Obviously, more detail on the issue is needed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issues and new to the group
  2013-09-26 14:59     ` Joe Landman
  2013-09-26 15:26       ` Jay Ashworth
  2013-09-26 22:16       ` Dave Chinner
@ 2013-09-27  2:39       ` Stan Hoeppner
  2 siblings, 0 replies; 12+ messages in thread
From: Stan Hoeppner @ 2013-09-27  2:39 UTC (permalink / raw)
  To: Joe Landman; +Cc: Ronnie Tartar, xfs

On 9/26/2013 9:59 AM, Joe Landman wrote:
> On 09/26/2013 09:12 AM, Ronnie Tartar wrote:
>> Stan,
>>
>> Thanks for the reply.
>>
>> My fragmentation is:
>>
>> [root@AP-FS1 ~]# xfs_db -c frag -r /dev/xvdb1
>> actual 10470159, ideal 10409782, fragmentation factor 0.58%

> This was never likely the cause ...

Nobody suggested it was.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-09-27  2:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-26 11:47 Issues and new to the group Ronnie Tartar
2013-09-26 12:06 ` Stan Hoeppner
2013-09-26 13:12   ` Ronnie Tartar
2013-09-26 13:30     ` Ronnie Tartar
2013-09-26 14:23       ` Eric Sandeen
2013-09-26 23:46         ` Stan Hoeppner
2013-09-26 14:59     ` Joe Landman
2013-09-26 15:26       ` Jay Ashworth
2013-09-26 22:47         ` Dave Chinner
2013-09-26 22:16       ` Dave Chinner
2013-09-27  2:17         ` Joe Landman
2013-09-27  2:39       ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox