public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* 1B files, slow file creation, only AG0 used
@ 2012-03-10  2:13 Michael Spiegle
  2012-03-10  4:59 ` Eric Sandeen
  2012-03-12  0:56 ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Michael Spiegle @ 2012-03-10  2:13 UTC (permalink / raw)
  To: xfs

We're seeing some very strange behavior with XFS on the default kernel
for CentOS 5.6 (note, I have also 3.2.9 and witnessed the same issue).
 The dataset on this server is about 1B small files (anywhere from 1KB
to 50KB).  We first noticed it when creating files in a directory.  A
simple 'touch' would take over 300ms on a completely idle system.  If
I simply create a different directory, touching files is 1ms or
faster.  Example:

# time touch 0
real    0m0.323s
user    0m0.000s
sys     0m0.323s

# mkdir tmp2
# time touch tmp2/0
real    0m0.001s
user    0m0.000s
sys     0m0.000s

We've done quite a bit of testing and debugging, and while we don't
have an answer yet, we've noticed that our filesystem was created with
the default of 32 AGs.  When using xfs_db, we notice that all
allocations appear to be in AG0 only.  We've also noticed during
testing that if we create 512 AGs, the distribution appears to be
better.  It seems that the AG is actually encoded into the inode, and
the XFS_INO_TO_AGNO(mp,i) macro is used to determine the AG by
performing a bitshift.  In our case, the bitshift appears to be
32bits, and since the inode is 32bits, we always end up with AG0.
Does anyone know if our slow file creation issue is related to our use
of AG0, and if so, what's the best way to utilize additional AGs?

Per-AG counts:
# for x in {0..31}; do echo -n "${x}: "; xfs_db -c "agi ${x}" -c
"print" -r /dev/sda1 | grep "^count"; done
0: count = 1098927744
1: count = 0
2: count = 0
3: count = 0
4: count = 0
5: count = 0
6: count = 0
7: count = 0
8: count = 0
9: count = 0
10: count = 0
11: count = 0
12: count = 0
13: count = 0
14: count = 0
15: count = 0
16: count = 0
17: count = 0
18: count = 0
19: count = 0
20: count = 0
21: count = 0
22: count = 0
23: count = 0
24: count = 0
25: count = 0
26: count = 0
27: count = 0
28: count = 0
29: count = 0
30: count = 0
31: count = 0

Some general stats on the server:
24x Xeon
24GB RAM
CentOS 5.6
20TB of storage
1B files
RAID6, 14 drives, SATA

Output of "xfs_info /dev/sda1":
meta-data=/dev/sda1              isize=256    agcount=32, agsize=152575999 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=4882431968, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-10  2:13 1B files, slow file creation, only AG0 used Michael Spiegle
@ 2012-03-10  4:59 ` Eric Sandeen
  2012-03-10  5:25   ` Michael Spiegle
  2012-03-12  0:56 ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2012-03-10  4:59 UTC (permalink / raw)
  To: mike; +Cc: xfs

On 3/9/12 8:13 PM, Michael Spiegle wrote:
> We're seeing some very strange behavior with XFS on the default kernel
> for CentOS 5.6 (note, I have also 3.2.9 and witnessed the same issue).

on centos please be sure you're not using the old xfs kmod package, just FYI.
The module shipped with the kernel is what you should use.

>  The dataset on this server is about 1B small files (anywhere from 1KB
> to 50KB).  We first noticed it when creating files in a directory.  A
> simple 'touch' would take over 300ms on a completely idle system.  If
> I simply create a different directory, touching files is 1ms or
> faster.  Example:
> 
> # time touch 0
> real    0m0.323s
> user    0m0.000s
> sys     0m0.323s
> 
> # mkdir tmp2
> # time touch tmp2/0
> real    0m0.001s
> user    0m0.000s
> sys     0m0.000s

If anything this is testament to xfs scalability, if it can make the billion-and-first inode in a single dir in "only" 300ms ;)

You might want to read up on the available docs, i.e.

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/xfs-allocators.html

probably covers a lot of what you are wondering about.

When you make a new dir, in general xfs will put inodes & files in that dir into a new
AG.

But another thing you are likely running into is the inode32 allocation behavior, also explained in the doc above.
In that case inodes are kept in the lower ags, and data is sprinkled around the higher AGs.

> We've done quite a bit of testing and debugging, and while we don't
> have an answer yet, we've noticed that our filesystem was created with
> the default of 32 AGs.  When using xfs_db, we notice that all
> allocations appear to be in AG0 only.  We've also noticed during
> testing that if we create 512 AGs, the distribution appears to be
> better.  It seems that the AG is actually encoded into the inode, and
> the XFS_INO_TO_AGNO(mp,i) macro is used to determine the AG by
> performing a bitshift.  

Right, the physical location of the inode can be determined from the inode number itself + fs geometry. This is why the default behavior of restricting inodes to 32 bits keeps them all in lower disk blocks; in your case, the lowest AG.

> In our case, the bitshift appears to be
> 32bits, and since the inode is 32bits, we always end up with AG0.
> Does anyone know if our slow file creation issue is related to our use
> of AG0, and if so, what's the best way to utilize additional AGs?

If you mount with -o inode64, inodes may be allocated anywhere on the fs, in any AG. New subdirs go to new AGs, activity will be distributed across the filesystem. As long as your applications can properly handle 64-bit inode numbers, this is probably the way to go.

You would be better off not creating all billion in a single dir, as well.

> Per-AG counts:
> # for x in {0..31}; do echo -n "${x}: "; xfs_db -c "agi ${x}" -c
> "print" -r /dev/sda1 | grep "^count"; done
> 0: count = 1098927744
> 1: count = 0
> 2: count = 0
... <snip> ...

> 29: count = 0
> 30: count = 0
> 31: count = 0
> 
> Some general stats on the server:
> 24x Xeon
> 24GB RAM
> CentOS 5.6
> 20TB of storage
> 1B files
> RAID6, 14 drives, SATA
> 
> Output of "xfs_info /dev/sda1":
> meta-data=/dev/sda1              isize=256    agcount=32, agsize=152575999 blks
>          =                       sectsz=512   attr=0

I wonder why you have attr=0 and 32 ags; pretty old xfsprogs maybe.

> data     =                       bsize=4096   blocks=4882431968, imaxpct=25
>          =                       sunit=0      swidth=0 blks, unwritten=1

You probably would be better off telling mkfs.xfs what your stripe geometry is, as well.

-Eric

> naming   =version 2              bsize=4096
> log      =internal               bsize=4096   blocks=32768, version=1
>          =                       sectsz=512   sunit=0 blks, lazy-count=0
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-10  4:59 ` Eric Sandeen
@ 2012-03-10  5:25   ` Michael Spiegle
  2012-03-12  2:59     ` Stan Hoeppner
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Spiegle @ 2012-03-10  5:25 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Fri, Mar 9, 2012 at 8:59 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> on centos please be sure you're not using the old xfs kmod package, just FYI.
> The module shipped with the kernel is what you should use.

We're definitely using the XFS module that came with the kernel.  In
any case, we're seeing problems with the 3.2.9 kernel as well, but
maybe this is due to running mkfs.xfs from CentOS 5.6

>
> If anything this is testament to xfs scalability, if it can make the billion-and-first inode in a single dir in "only" 300ms ;)
>
> You might want to read up on the available docs, i.e.
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/xfs-allocators.html
>
> probably covers a lot of what you are wondering about.
>
> When you make a new dir, in general xfs will put inodes & files in that dir into a new
> AG.
>
> But another thing you are likely running into is the inode32 allocation behavior, also explained in the doc above.
> In that case inodes are kept in the lower ags, and data is sprinkled around the higher AGs.
>

Our "1B" files are spread out evenly into a tree of 65536 directories.
 I've read the docs, and they seem explicit about new directories
being created in new AGs, however we are not seeing that on our
system.  All 1B files (despite being spread out across more than 64K
dirs) are in the first AG.  I have tried remounting the filesystem
with inode64 (and on 3.2.9), but this behavior does not seem to change
even if I add more files afterwards.

>
> If you mount with -o inode64, inodes may be allocated anywhere on the fs, in any AG. New subdirs go to new AGs, activity will be distributed across the filesystem. As long as your applications can properly handle 64-bit inode numbers, this is probably the way to go.
>
> You would be better off not creating all billion in a single dir, as well.
>

As mentioned above, the inode64 mount option doesn't seem to affect
anything.  Can you think of anything else I should check that would
prevent this from working?

>
> I wonder why you have attr=0 and 32 ags; pretty old xfsprogs maybe.
>

Yep, the filesystem was created with the xfsprogs that came with
CentOS 5.6.  Unfortunately, it took us a long time to copy all of the
files to this filesystem, so we're looking for a way to fix this
without having to reformat.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-10  2:13 1B files, slow file creation, only AG0 used Michael Spiegle
  2012-03-10  4:59 ` Eric Sandeen
@ 2012-03-12  0:56 ` Dave Chinner
  2012-03-12 21:54   ` Michael Spiegle
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2012-03-12  0:56 UTC (permalink / raw)
  To: Michael Spiegle; +Cc: xfs

On Fri, Mar 09, 2012 at 06:13:27PM -0800, Michael Spiegle wrote:
> We're seeing some very strange behavior with XFS on the default kernel
> for CentOS 5.6 (note, I have also 3.2.9 and witnessed the same issue).
>  The dataset on this server is about 1B small files (anywhere from 1KB
> to 50KB).  We first noticed it when creating files in a directory.  A
> simple 'touch' would take over 300ms on a completely idle system.  If
> I simply create a different directory, touching files is 1ms or
> faster.  Example:
> 
> # time touch 0
> real    0m0.323s
> user    0m0.000s
> sys     0m0.323s
> 
> # mkdir tmp2
> # time touch tmp2/0
> real    0m0.001s
> user    0m0.000s
> sys     0m0.000s

Entirely normal. some operations require Io to complete (e.g.
reading directory blocks to find where to insert the new entry),
while adding the first file to a directory generally requires zero
IO. You're seeing the difference between cold cache and hot cache
performance.

> We've done quite a bit of testing and debugging, and while we don't
> have an answer yet, we've noticed that our filesystem was created with
> the default of 32 AGs.  When using xfs_db, we notice that all
> allocations appear to be in AG0 only.

Go look up what the inode32 and inode64 mount options do. The
default is inode32....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-10  5:25   ` Michael Spiegle
@ 2012-03-12  2:59     ` Stan Hoeppner
  2012-03-12 22:11       ` Michael Spiegle
  0 siblings, 1 reply; 8+ messages in thread
From: Stan Hoeppner @ 2012-03-12  2:59 UTC (permalink / raw)
  To: mike; +Cc: Eric Sandeen, xfs

On 3/9/2012 11:25 PM, Michael Spiegle wrote:

> Our "1B" files are spread out evenly into a tree of 65536 directories.
>  I've read the docs, and they seem explicit about new directories
> being created in new AGs, however we are not seeing that on our
> system.  All 1B files (despite being spread out across more than 64K
> dirs) are in the first AG.  I have tried remounting the filesystem
> with inode64 (and on 3.2.9), but this behavior does not seem to change
> even if I add more files afterwards.

When using the inode32 allocator and having 64k dirs, and seeing no
files in AGs other than AG0, might this tend to suggest that these are
zero length files or similar, being stored entirely within the directory
inodes, thus occupying no extents in other AGs?  Would this tend to
explain why 'everything' is in AG0?

> As mentioned above, the inode64 mount option doesn't seem to affect
> anything.  Can you think of anything else I should check that would
> prevent this from working?

If my guess about these files is correct, mounting with inode64 and
writing additional files should create new directory inodes in other
AGs, but you still won't see file extents in those other AGs, just as
you don't in the first AG.

Are you using this XFS filesystem as a poor man's database or something
similar?  This would tend to explain a billion files, with no extents,
wholly stored in directory inodes, only in AG0, while using the inode32
allocator.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-12  0:56 ` Dave Chinner
@ 2012-03-12 21:54   ` Michael Spiegle
  2012-03-13  0:08     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Spiegle @ 2012-03-12 21:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

I believe we figured out what was going wrong:
1) You definitely need inode64 as a mount option
2) It seems that the AG metadata was being cached.  We had to unmount
the system and remount it to get updated counts on per-AG usage.

For the moment, I've written a script to copy/rename/delete our files
so that they are gradually migrated to new AGs.  FWIW, I noticed that
this operation is significantly faster on an EL6.2-based kernel
(2.6.32) compared to EL5 (2.6.18).  I'm also using the 'delaylog'
mount option which probably helps a bit.  I still have a few other
curiosities about this particular issue though:

On Sun, Mar 11, 2012 at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Entirely normal. some operations require Io to complete (e.g.
> reading directory blocks to find where to insert the new entry),
> while adding the first file to a directory generally requires zero
> IO. You're seeing the difference between cold cache and hot cache
> performance.
>

In this situation, any files written to the same directory exhibited
this issue regardless of cache state.  For example:

Takes 300ms to complete:
touch tmp/0

Takes 600ms to complete:
touch tmp/0 tmp/1

Takes 1200ms to complete:
touch tmp/0 tmp/1 tmp/2 tmp/3

I would expect the directory to be cached after the first file is
created.  I don't understand why all subsequent writes were affected
as well.

>
> Go look up what the inode32 and inode64 mount options do. The
> default is inode32....
>

So now that we're mounting inode64, I wonder if we'll see degraded
performance in the future due to a sub-optimal on-disk layout of our
data.  Even though I am migrating data to other AGs, will there be any
permanent "damage" to AG0 since it had to allocate 1B inodes?  What
happens to all of that metadata when the files are removed?

Regards,
Mike

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-12  2:59     ` Stan Hoeppner
@ 2012-03-12 22:11       ` Michael Spiegle
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Spiegle @ 2012-03-12 22:11 UTC (permalink / raw)
  To: stan; +Cc: xfs

On Sun, Mar 11, 2012 at 7:59 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>
> When using the inode32 allocator and having 64k dirs, and seeing no
> files in AGs other than AG0, might this tend to suggest that these are
> zero length files or similar, being stored entirely within the directory
> inodes, thus occupying no extents in other AGs?  Would this tend to
> explain why 'everything' is in AG0?
>

If any of our files are 0-bytes, then that would probably be an error.
 Our files are anywhere from 1K-50KB.  There could be files outside of
that range, but they're at the edges of the bell curve.

>
> If my guess about these files is correct, mounting with inode64 and
> writing additional files should create new directory inodes in other
> AGs, but you still won't see file extents in those other AGs, just as
> you don't in the first AG.
>
> Are you using this XFS filesystem as a poor man's database or something
> similar?  This would tend to explain a billion files, with no extents,
> wholly stored in directory inodes, only in AG0, while using the inode32
> allocator.
>
> --
> Stan

I mentioned it in another reply, but we just found out that the AG
information appeared to be cached.  We had to unmount/remount the
filesystem in order to get updated counts.  We do see other AGs being
used when mounting with inode64.

No, we have a database for the database stuff =)  These are legitimate
pieces of content that we're storing for our users (many many billions
of files).  We were using EXT3, but recently moved to XFS.  It's nice
to not be constrained by statically-allocated inodes!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 1B files, slow file creation, only AG0 used
  2012-03-12 21:54   ` Michael Spiegle
@ 2012-03-13  0:08     ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2012-03-13  0:08 UTC (permalink / raw)
  To: Michael Spiegle; +Cc: xfs

On Mon, Mar 12, 2012 at 02:54:20PM -0700, Michael Spiegle wrote:
> I believe we figured out what was going wrong:
> 1) You definitely need inode64 as a mount option
> 2) It seems that the AG metadata was being cached.  We had to unmount
> the system and remount it to get updated counts on per-AG usage.

If you were looking at it with xfs_db, then yes, that is what will
happen. Use "echo 1 > /proc/sys/vm/drop_caches" to get the cached
metadata dropped.

> For the moment, I've written a script to copy/rename/delete our files
> so that they are gradually migrated to new AGs.  FWIW, I noticed that
> this operation is significantly faster on an EL6.2-based kernel
> (2.6.32) compared to EL5 (2.6.18).  I'm also using the 'delaylog'
> mount option which probably helps a bit.  I still have a few other
> curiosities about this particular issue though:
> 
> On Sun, Mar 11, 2012 at 5:56 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Entirely normal. some operations require Io to complete (e.g.
> > reading directory blocks to find where to insert the new entry),
> > while adding the first file to a directory generally requires zero
> > IO. You're seeing the difference between cold cache and hot cache
> > performance.
> >
> 
> In this situation, any files written to the same directory exhibited
> this issue regardless of cache state.  For example:
> 
> Takes 300ms to complete:
> touch tmp/0
> 
> Takes 600ms to complete:
> touch tmp/0 tmp/1
> 
> Takes 1200ms to complete:
> touch tmp/0 tmp/1 tmp/2 tmp/3
> 
> I would expect the directory to be cached after the first file is
> created.  I don't understand why all subsequent writes were affected
> as well.

I don't have enough information to help you. I don't know what
hardware you are running on, how big the directory is, what they
layout of the directory is, etc. The "needs to do IO" was simply a
SWAG....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-03-13  0:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-10  2:13 1B files, slow file creation, only AG0 used Michael Spiegle
2012-03-10  4:59 ` Eric Sandeen
2012-03-10  5:25   ` Michael Spiegle
2012-03-12  2:59     ` Stan Hoeppner
2012-03-12 22:11       ` Michael Spiegle
2012-03-12  0:56 ` Dave Chinner
2012-03-12 21:54   ` Michael Spiegle
2012-03-13  0:08     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox