* bad performance on touch/cp file on XFS system
@ 2014-08-25 3:34 Zhang Qiang
2014-08-25 5:18 ` Dave Chinner
0 siblings, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-25 3:34 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: text/plain, Size: 1816 bytes --]
Dear XFS community & developers,
I am using CentOS 6.3 and xfs as base file system and use RAID5 as hardware
storage.
Detail environment as follow:
OS: CentOS 6.3
Kernel: kernel-2.6.32-279.el6.x86_64
XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)
Detail phenomenon:
# df
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 29G 17G 11G 61% /
/dev/sdb1 893G 803G 91G 90% /data
/dev/sda4 2.2T 1.6T 564G 75% /data1
# time touch /data1/1111
real 0m23.043s
user 0m0.001s
sys 0m0.349s
# perf top
Events: 6K cycles
16.96% [xfs] [k] xfs_inobt_get_rec
11.95% [xfs] [k] xfs_btree_increment
11.16% [xfs] [k] xfs_btree_get_rec
7.39% [xfs] [k] xfs_btree_get_block
5.02% [xfs] [k] xfs_dialloc
4.87% [xfs] [k] xfs_btree_rec_offset
4.33% [xfs] [k] xfs_btree_readahead
4.13% [xfs] [k] _xfs_buf_find
4.05% [kernel] [k] intel_idle
2.89% [xfs] [k] xfs_btree_rec_addr
1.04% [kernel] [k] kmem_cache_free
It seems that some xfs kernel function spend much time (xfs_inobt_get_rec,
xfs_btree_increment, etc.)
I found a bug in bugzilla [1], is that is the same issue like this?
It's very greatly appreciated if you can give constructive suggestion about
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.
[1] https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=813137
Thanks in advance
Qiang
[-- Attachment #1.2: Type: text/html, Size: 2726 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: bad performance on touch/cp file on XFS system 2014-08-25 3:34 bad performance on touch/cp file on XFS system Zhang Qiang @ 2014-08-25 5:18 ` Dave Chinner 2014-08-25 8:09 ` Zhang Qiang 2014-08-25 8:47 ` Zhang Qiang 0 siblings, 2 replies; 15+ messages in thread From: Dave Chinner @ 2014-08-25 5:18 UTC (permalink / raw) To: Zhang Qiang; +Cc: xfs On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote: > Dear XFS community & developers, > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as hardware > storage. > > Detail environment as follow: > OS: CentOS 6.3 > Kernel: kernel-2.6.32-279.el6.x86_64 > XFS option info(df output): /dev/sdb1 on /data type xfs > (rw,noatime,nodiratime,nobarrier) > > Detail phenomenon: > > # df > Filesystem Size Used Avail Use% Mounted on > /dev/sda1 29G 17G 11G 61% / > /dev/sdb1 893G 803G 91G 90% /data > /dev/sda4 2.2T 1.6T 564G 75% /data1 > > # time touch /data1/1111 > real 0m23.043s > user 0m0.001s > sys 0m0.349s > > # perf top > Events: 6K cycles > 16.96% [xfs] [k] xfs_inobt_get_rec > 11.95% [xfs] [k] xfs_btree_increment > 11.16% [xfs] [k] xfs_btree_get_rec > 7.39% [xfs] [k] xfs_btree_get_block > 5.02% [xfs] [k] xfs_dialloc > 4.87% [xfs] [k] xfs_btree_rec_offset > 4.33% [xfs] [k] xfs_btree_readahead > 4.13% [xfs] [k] _xfs_buf_find > 4.05% [kernel] [k] intel_idle > 2.89% [xfs] [k] xfs_btree_rec_addr > 1.04% [kernel] [k] kmem_cache_free > > > It seems that some xfs kernel function spend much time (xfs_inobt_get_rec, > xfs_btree_increment, etc.) > > I found a bug in bugzilla [1], is that is the same issue like this? No. > It's very greatly appreciated if you can give constructive suggestion about > this issue, as It's really hard to reproduce from another system and it's > not possible to do upgrade on that online machine. You've got very few free inodes, widely distributed in the allocated inode btree. The CPU time above is the btree search for the next free inode. This is the issue solved by this series of recent commits to add a new on-disk free inode btree index: 53801fd xfs: enable the finobt feature on v5 superblocks 0c153c1 xfs: report finobt status in fs geometry a3fa516 xfs: add finobt support to growfs 3efa4ff xfs: update the finobt on inode free 2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper 6dd8638 xfs: use and update the finobt on inode allocation 0aa0a75 xfs: insert newly allocated inode chunks into the finobt 9d43b18 xfs: update inode allocation/free transaction reservations for finobt aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type 8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt 57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Which is of no help to you, however, because it's not available in any CentOS kernel. There's really not much you can do to avoid the problem once you've punched random freespace holes in the allocated inode btree. IT generally doesn't affect many people; those that it does affect are normally using XFS as an object store indexed by a hard link farm (e.g. various backup programs do this). If you dump the superblock via xfs_db, the difference between icount and ifree will give you idea of how much "needle in a haystack" searching is going on. You can probably narrow it down to a specific AG by dumping the AGI headers and checking the same thing. filling in all the holes (by creating a bunch of zero length files in the appropriate AGs) might take some time, but it should make the problem go away until you remove more filesystem and create random free inode holes again... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 5:18 ` Dave Chinner @ 2014-08-25 8:09 ` Zhang Qiang 2014-08-25 8:56 ` Dave Chinner 2014-08-25 8:47 ` Zhang Qiang 1 sibling, 1 reply; 15+ messages in thread From: Zhang Qiang @ 2014-08-25 8:09 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 4901 bytes --] Thanks for your quick and clear response. Some comments bellow: 2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>: > On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote: > > Dear XFS community & developers, > > > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as > hardware > > storage. > > > > Detail environment as follow: > > OS: CentOS 6.3 > > Kernel: kernel-2.6.32-279.el6.x86_64 > > XFS option info(df output): /dev/sdb1 on /data type xfs > > (rw,noatime,nodiratime,nobarrier) > > > > Detail phenomenon: > > > > # df > > Filesystem Size Used Avail Use% Mounted on > > /dev/sda1 29G 17G 11G 61% / > > /dev/sdb1 893G 803G 91G 90% /data > > /dev/sda4 2.2T 1.6T 564G 75% /data1 > > > > # time touch /data1/1111 > > real 0m23.043s > > user 0m0.001s > > sys 0m0.349s > > > > # perf top > > Events: 6K cycles > > 16.96% [xfs] [k] xfs_inobt_get_rec > > 11.95% [xfs] [k] xfs_btree_increment > > 11.16% [xfs] [k] xfs_btree_get_rec > > 7.39% [xfs] [k] xfs_btree_get_block > > 5.02% [xfs] [k] xfs_dialloc > > 4.87% [xfs] [k] xfs_btree_rec_offset > > 4.33% [xfs] [k] xfs_btree_readahead > > 4.13% [xfs] [k] _xfs_buf_find > > 4.05% [kernel] [k] intel_idle > > 2.89% [xfs] [k] xfs_btree_rec_addr > > 1.04% [kernel] [k] kmem_cache_free > > > > > > It seems that some xfs kernel function spend much time > (xfs_inobt_get_rec, > > xfs_btree_increment, etc.) > > > > I found a bug in bugzilla [1], is that is the same issue like this? > > No. > > > > It's very greatly appreciated if you can give constructive suggestion > about > > this issue, as It's really hard to reproduce from another system and it's > > not possible to do upgrade on that online machine. > > You've got very few free inodes, widely distributed in the allocated > inode btree. The CPU time above is the btree search for the next > free inode. > > This is the issue solved by this series of recent commits to add a > new on-disk free inode btree index: > [Qiang] This meas that if I want to fix this issue, I have to apply the following patches and build my own kernel. As the on-disk structure has been changed, so should I also re-create xfs filesystem again? is there any user space tools to convert old disk filesystem to new one, and don't need to backup and restore currently data? > > 53801fd xfs: enable the finobt feature on v5 superblocks > 0c153c1 xfs: report finobt status in fs geometry > a3fa516 xfs: add finobt support to growfs > 3efa4ff xfs: update the finobt on inode free > 2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() > helper > 6dd8638 xfs: use and update the finobt on inode allocation > 0aa0a75 xfs: insert newly allocated inode chunks into the finobt > 9d43b18 xfs: update inode allocation/free transaction reservations for > finobt > aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type > 8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt > 57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers > > Which is of no help to you, however, because it's not available in > any CentOS kernel. > [Qiang] Do you think if it's possible to just backport these patches to kernel 6.2.32 (CentOS 6.3) to fix this issue? Or it's better to backport to 3.10 kernel, used in CentOS 7.0? > There's really not much you can do to avoid the problem once you've > punched random freespace holes in the allocated inode btree. IT > generally doesn't affect many people; those that it does affect are > normally using XFS as an object store indexed by a hard link farm > (e.g. various backup programs do this). > OK, I see. Could you please guide me to reproduce this issue easily? as I have tried to use a 500G xfs partition, and use about 98 % spaces, but still can't reproduce this issue. Is there any easy way from your mind? > If you dump the superblock via xfs_db, the difference between icount > and ifree will give you idea of how much "needle in a haystack" > searching is going on. You can probably narrow it down to a specific > AG by dumping the AGI headers and checking the same thing. filling > in all the holes (by creating a bunch of zero length files in the > appropriate AGs) might take some time, but it should make the > problem go away until you remove more filesystem and create random > free inode holes again... > I will try to investigate the detail issue. Thanks for your kindly response. Qiang > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 6932 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 8:09 ` Zhang Qiang @ 2014-08-25 8:56 ` Dave Chinner 2014-08-25 9:05 ` Zhang Qiang 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-08-25 8:56 UTC (permalink / raw) To: Zhang Qiang; +Cc: xfs On Mon, Aug 25, 2014 at 04:09:05PM +0800, Zhang Qiang wrote: > Thanks for your quick and clear response. Some comments bellow: > > > 2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>: > > > On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote: > > > Dear XFS community & developers, > > > > > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as > > hardware > > > storage. > > > > > > Detail environment as follow: > > > OS: CentOS 6.3 > > > Kernel: kernel-2.6.32-279.el6.x86_64 > > > XFS option info(df output): /dev/sdb1 on /data type xfs > > > (rw,noatime,nodiratime,nobarrier) .... > > > It's very greatly appreciated if you can give constructive suggestion > > about > > > this issue, as It's really hard to reproduce from another system and it's > > > not possible to do upgrade on that online machine. > > > > You've got very few free inodes, widely distributed in the allocated > > inode btree. The CPU time above is the btree search for the next > > free inode. > > > > This is the issue solved by this series of recent commits to add a > > new on-disk free inode btree index: > > > [Qiang] This meas that if I want to fix this issue, I have to apply the > following patches and build my own kernel. Yes. Good luck, even I wouldn't attempt to do that. And then use xfsprogs 3.2.1, and make a new filesystem that enables metadata CRCs and the free inode btree feature. > As the on-disk structure has been changed, so should I also re-create xfs > filesystem again? Yes, you need to download the latest xfsprogs (3.2.1) to be able to make it with the necessary feature bits set. > is there any user space tools to convert old disk > filesystem to new one, and don't need to backup and restore currently data? No, we don't write utilities to mangle on disk formats. dump, mkfs and restore is far more reliable than any "in-place conversion" code we could write. It will probably be faster, too. > > Which is of no help to you, however, because it's not available in > > any CentOS kernel. > > > [Qiang] Do you think if it's possible to just backport these patches to > kernel 6.2.32 (CentOS 6.3) to fix this issue? > > Or it's better to backport to 3.10 kernel, used in CentOS 7.0? You can try, but if you break it you get to keep all the pieces yourself. Eventually someone who maintains the RHEL code will do a backport that will trickle down to CentOS. If you need it any sooner, then you'll need to do it yourself, or upgrade to RHEL and ask your support contact for it to be included in RHEL 7.1.... > > There's really not much you can do to avoid the problem once you've > > punched random freespace holes in the allocated inode btree. IT > > generally doesn't affect many people; those that it does affect are > > normally using XFS as an object store indexed by a hard link farm > > (e.g. various backup programs do this). > > > OK, I see. > > Could you please guide me to reproduce this issue easily? as I have tried > to use a 500G xfs partition, and use about 98 % spaces, but still can't > reproduce this issue. Is there any easy way from your mind? Search the archives for the test cases that were used for the patch set. There's a performance test case documented in the review discussions. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 8:56 ` Dave Chinner @ 2014-08-25 9:05 ` Zhang Qiang 0 siblings, 0 replies; 15+ messages in thread From: Zhang Qiang @ 2014-08-25 9:05 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 7713 bytes --] Great, thank you. >From my xfs_db debug, I found I have icount and ifree as follow: icount = 220619904 ifree = 26202919 So the number of free inode take about 10%, so that's not so few. So, are you still sure the patches can fix this issue? Here's the detail xfs_db info: # mount /dev/sda4 /data1/ # xfs_info /data1/ meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=569089536, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=277875, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # umount /dev/sda4 # xfs_db /dev/sda4 xfs_db> sb 0 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = 128 rbmino = 129 rsumino = 130 rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 0 imax_pct = 5 icount = 220619904 ifree = 26202919 fdblocks = 147805479 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa xfs_db> sb 1 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = 128 rbmino = null rsumino = null rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 1 imax_pct = 5 icount = 0 ifree = 0 fdblocks = 568811645 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa xfs_db> sb 2 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = null rbmino = null rsumino = null rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 1 imax_pct = 5 icount = 0 ifree = 0 fdblocks = 568811645 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa xfs_db> sb 3 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = 128 rbmino = null rsumino = null rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 1 imax_pct = 5 icount = 0 ifree = 0 fdblocks = 568811645 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa Thanks Qiang 2014-08-25 16:56 GMT+08:00 Dave Chinner <david@fromorbit.com>: > On Mon, Aug 25, 2014 at 04:09:05PM +0800, Zhang Qiang wrote: > > Thanks for your quick and clear response. Some comments bellow: > > > > > > 2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>: > > > > > On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote: > > > > Dear XFS community & developers, > > > > > > > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as > > > hardware > > > > storage. > > > > > > > > Detail environment as follow: > > > > OS: CentOS 6.3 > > > > Kernel: kernel-2.6.32-279.el6.x86_64 > > > > XFS option info(df output): /dev/sdb1 on /data type xfs > > > > (rw,noatime,nodiratime,nobarrier) > .... > > > > > It's very greatly appreciated if you can give constructive suggestion > > > about > > > > this issue, as It's really hard to reproduce from another system and > it's > > > > not possible to do upgrade on that online machine. > > > > > > You've got very few free inodes, widely distributed in the allocated > > > inode btree. The CPU time above is the btree search for the next > > > free inode. > > > > > > This is the issue solved by this series of recent commits to add a > > > new on-disk free inode btree index: > > > > > [Qiang] This meas that if I want to fix this issue, I have to apply the > > following patches and build my own kernel. > > Yes. Good luck, even I wouldn't attempt to do that. > > And then use xfsprogs 3.2.1, and make a new filesystem that enables > metadata CRCs and the free inode btree feature. > > > As the on-disk structure has been changed, so should I also re-create xfs > > filesystem again? > > Yes, you need to download the latest xfsprogs (3.2.1) to be able to > make it with the necessary feature bits set. > > > is there any user space tools to convert old disk > > filesystem to new one, and don't need to backup and restore currently > data? > > No, we don't write utilities to mangle on disk formats. dump, mkfs > and restore is far more reliable than any "in-place conversion" code > we could write. It will probably be faster, too. > > > > Which is of no help to you, however, because it's not available in > > > any CentOS kernel. > > > > > [Qiang] Do you think if it's possible to just backport these patches to > > kernel 6.2.32 (CentOS 6.3) to fix this issue? > > > > Or it's better to backport to 3.10 kernel, used in CentOS 7.0? > > You can try, but if you break it you get to keep all the pieces > yourself. Eventually someone who maintains the RHEL code will do a > backport that will trickle down to CentOS. If you need it any > sooner, then you'll need to do it yourself, or upgrade to RHEL > and ask your support contact for it to be included in RHEL 7.1.... > > > > There's really not much you can do to avoid the problem once you've > > > punched random freespace holes in the allocated inode btree. IT > > > generally doesn't affect many people; those that it does affect are > > > normally using XFS as an object store indexed by a hard link farm > > > (e.g. various backup programs do this). > > > > > OK, I see. > > > > Could you please guide me to reproduce this issue easily? as I have tried > > to use a 500G xfs partition, and use about 98 % spaces, but still can't > > reproduce this issue. Is there any easy way from your mind? > > Search the archives for the test cases that were used for the patch > set. There's a performance test case documented in the review > discussions. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > [-- Attachment #1.2: Type: text/html, Size: 11579 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 5:18 ` Dave Chinner 2014-08-25 8:09 ` Zhang Qiang @ 2014-08-25 8:47 ` Zhang Qiang 2014-08-25 9:08 ` Dave Chinner 1 sibling, 1 reply; 15+ messages in thread From: Zhang Qiang @ 2014-08-25 8:47 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 7891 bytes --] I have checked icount and ifree, but I found there are about 11.8 percent free, so the free inode should not be too few. Here's the detail log, any new clue? # mount /dev/sda4 /data1/ # xfs_info /data1/ meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=569089536, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=277875, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # umount /dev/sda4 # xfs_db /dev/sda4 xfs_db> sb 0 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = 128 rbmino = 129 rsumino = 130 rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 0 imax_pct = 5 icount = 220619904 ifree = 26202919 fdblocks = 147805479 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa xfs_db> sb 1 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = 128 rbmino = null rsumino = null rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 1 imax_pct = 5 icount = 0 ifree = 0 fdblocks = 568811645 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa xfs_db> sb 2 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = null rbmino = null rsumino = null rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 1 imax_pct = 5 icount = 0 ifree = 0 fdblocks = 568811645 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa xfs_db> sb 3 xfs_db> p magicnum = 0x58465342 blocksize = 4096 dblocks = 569089536 rblocks = 0 rextents = 0 uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008 logstart = 536870916 rootino = 128 rbmino = null rsumino = null rextsize = 1 agblocks = 142272384 agcount = 4 rbmblocks = 0 logblocks = 277875 versionnum = 0xb4a4 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 28 rextslog = 0 inprogress = 1 imax_pct = 5 icount = 0 ifree = 0 fdblocks = 568811645 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 0 width = 0 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 1 features2 = 0xa bad_features2 = 0xa 2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>: > On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote: > > Dear XFS community & developers, > > > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as > hardware > > storage. > > > > Detail environment as follow: > > OS: CentOS 6.3 > > Kernel: kernel-2.6.32-279.el6.x86_64 > > XFS option info(df output): /dev/sdb1 on /data type xfs > > (rw,noatime,nodiratime,nobarrier) > > > > Detail phenomenon: > > > > # df > > Filesystem Size Used Avail Use% Mounted on > > /dev/sda1 29G 17G 11G 61% / > > /dev/sdb1 893G 803G 91G 90% /data > > /dev/sda4 2.2T 1.6T 564G 75% /data1 > > > > # time touch /data1/1111 > > real 0m23.043s > > user 0m0.001s > > sys 0m0.349s > > > > # perf top > > Events: 6K cycles > > 16.96% [xfs] [k] xfs_inobt_get_rec > > 11.95% [xfs] [k] xfs_btree_increment > > 11.16% [xfs] [k] xfs_btree_get_rec > > 7.39% [xfs] [k] xfs_btree_get_block > > 5.02% [xfs] [k] xfs_dialloc > > 4.87% [xfs] [k] xfs_btree_rec_offset > > 4.33% [xfs] [k] xfs_btree_readahead > > 4.13% [xfs] [k] _xfs_buf_find > > 4.05% [kernel] [k] intel_idle > > 2.89% [xfs] [k] xfs_btree_rec_addr > > 1.04% [kernel] [k] kmem_cache_free > > > > > > It seems that some xfs kernel function spend much time > (xfs_inobt_get_rec, > > xfs_btree_increment, etc.) > > > > I found a bug in bugzilla [1], is that is the same issue like this? > > No. > > > It's very greatly appreciated if you can give constructive suggestion > about > > this issue, as It's really hard to reproduce from another system and it's > > not possible to do upgrade on that online machine. > > You've got very few free inodes, widely distributed in the allocated > inode btree. The CPU time above is the btree search for the next > free inode. > > This is the issue solved by this series of recent commits to add a > new on-disk free inode btree index: > > 53801fd xfs: enable the finobt feature on v5 superblocks > 0c153c1 xfs: report finobt status in fs geometry > a3fa516 xfs: add finobt support to growfs > 3efa4ff xfs: update the finobt on inode free > 2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() > helper > 6dd8638 xfs: use and update the finobt on inode allocation > 0aa0a75 xfs: insert newly allocated inode chunks into the finobt > 9d43b18 xfs: update inode allocation/free transaction reservations for > finobt > aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type > 8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt > 57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers > > Which is of no help to you, however, because it's not available in > any CentOS kernel. > > There's really not much you can do to avoid the problem once you've > punched random freespace holes in the allocated inode btree. IT > generally doesn't affect many people; those that it does affect are > normally using XFS as an object store indexed by a hard link farm > (e.g. various backup programs do this). > > If you dump the superblock via xfs_db, the difference between icount > and ifree will give you idea of how much "needle in a haystack" > searching is going on. You can probably narrow it down to a specific > AG by dumping the AGI headers and checking the same thing. filling > in all the holes (by creating a bunch of zero length files in the > appropriate AGs) might take some time, but it should make the > problem go away until you remove more filesystem and create random > free inode holes again... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 11406 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 8:47 ` Zhang Qiang @ 2014-08-25 9:08 ` Dave Chinner 2014-08-25 10:31 ` Zhang Qiang 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-08-25 9:08 UTC (permalink / raw) To: Zhang Qiang; +Cc: xfs On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote: > I have checked icount and ifree, but I found there are about 11.8 percent > free, so the free inode should not be too few. > > Here's the detail log, any new clue? > > # mount /dev/sda4 /data1/ > # xfs_info /data1/ > meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 4 AGs > icount = 220619904 > ifree = 26202919 And 220 million inodes. There's your problem - that's an average of 55 million inodes per AGI btree assuming you are using inode64. If you are using inode32, then the inodes will be in 2 btrees, or maybe even only one. Anyway you look at it, searching btrees with tens of millions of entries is going to consume a *lot* of CPU time. So, really, the state your fs is in is probably unfixable without mkfs. And really, that's probably pushing the boundaries of what xfsdump and xfs-restore can support - it's going to take a long tiem to dump and restore that data.... With that many inodes, I'd be considering moving to 32 or 64 AGs to keep the btree size down to a more manageable size. The free inode btree would also help, but, really, 220M inodes in a 2TB filesystem is really pushing the boundaries of sanity..... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 9:08 ` Dave Chinner @ 2014-08-25 10:31 ` Zhang Qiang 2014-08-25 22:26 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Zhang Qiang @ 2014-08-25 10:31 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 1977 bytes --] 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>: > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote: > > I have checked icount and ifree, but I found there are about 11.8 percent > > free, so the free inode should not be too few. > > > > Here's the detail log, any new clue? > > > > # mount /dev/sda4 /data1/ > > # xfs_info /data1/ > > meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 > > 4 AGs > Yes. > > > icount = 220619904 > > ifree = 26202919 > > And 220 million inodes. There's your problem - that's an average > of 55 million inodes per AGI btree assuming you are using inode64. > If you are using inode32, then the inodes will be in 2 btrees, or > maybe even only one. > You are right, all inodes stay on one AG. BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?, sorry as I am not familiar with xfs currently. > > Anyway you look at it, searching btrees with tens of millions of > entries is going to consume a *lot* of CPU time. So, really, the > state your fs is in is probably unfixable without mkfs. And really, > that's probably pushing the boundaries of what xfsdump and > xfs-restore can support - it's going to take a long tiem to dump and > restore that data.... > Thanks reasonable. > With that many inodes, I'd be considering moving to 32 or 64 AGs to > keep the btree size down to a more manageable size. The free inode btree would also help, but, really, 220M inodes in a 2TB filesystem > is really pushing the boundaries of sanity..... > So the better inodes size in one AG is about 5M, is there any documents about these options I can learn more? I will spend more time to learn how to use xfs, and the internal of xfs, and try to contribute code. Thanks for your help. > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > [-- Attachment #1.2: Type: text/html, Size: 3535 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 10:31 ` Zhang Qiang @ 2014-08-25 22:26 ` Dave Chinner 2014-08-25 22:46 ` Greg Freemyer 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-08-25 22:26 UTC (permalink / raw) To: Zhang Qiang; +Cc: xfs On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote: > 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>: > > > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote: > > > I have checked icount and ifree, but I found there are about 11.8 percent > > > free, so the free inode should not be too few. > > > > > > Here's the detail log, any new clue? > > > > > > # mount /dev/sda4 /data1/ > > > # xfs_info /data1/ > > > meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 > > > > 4 AGs > > > Yes. > > > > > > icount = 220619904 > > > ifree = 26202919 > > > > And 220 million inodes. There's your problem - that's an average > > of 55 million inodes per AGI btree assuming you are using inode64. > > If you are using inode32, then the inodes will be in 2 btrees, or > > maybe even only one. > > > > You are right, all inodes stay on one AG. > > BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?, Because the top addresses in the 2nd AG go over 32 bits, hence only AG 0 can be used for inodes. Changing to inode64 will give you some relief, but any time allocation occurs in AG0 is will be slow. i.e. you'll be trading always slow for "unpredictably slow". > > With that many inodes, I'd be considering moving to 32 or 64 AGs to > > keep the btree size down to a more manageable size. The free inode > > btree would also help, but, really, 220M inodes in a 2TB filesystem > > is really pushing the boundaries of sanity..... > > > > So the better inodes size in one AG is about 5M, Not necessarily. But for your storage it's almost certainly going to minimise the problem you are seeing. > is there any documents > about these options I can learn more? http://xfs.org/index.php/XFS_Papers_and_Documentation Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 22:26 ` Dave Chinner @ 2014-08-25 22:46 ` Greg Freemyer 2014-08-26 2:37 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Greg Freemyer @ 2014-08-25 22:46 UTC (permalink / raw) To: Dave Chinner; +Cc: Zhang Qiang, xfs-oss On Mon, Aug 25, 2014 at 6:26 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote: >> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>: >> >> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote: >> > > I have checked icount and ifree, but I found there are about 11.8 percent >> > > free, so the free inode should not be too few. >> > > >> > > Here's the detail log, any new clue? >> > > >> > > # mount /dev/sda4 /data1/ >> > > # xfs_info /data1/ >> > > meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 >> > >> > 4 AGs >> > >> Yes. >> >> > >> > > icount = 220619904 >> > > ifree = 26202919 >> > >> > And 220 million inodes. There's your problem - that's an average >> > of 55 million inodes per AGI btree assuming you are using inode64. >> > If you are using inode32, then the inodes will be in 2 btrees, or >> > maybe even only one. >> > >> >> You are right, all inodes stay on one AG. >> >> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?, > > Because the top addresses in the 2nd AG go over 32 bits, hence only > AG 0 can be used for inodes. Changing to inode64 will give you some > relief, but any time allocation occurs in AG0 is will be slow. i.e. > you'll be trading always slow for "unpredictably slow". > >> > With that many inodes, I'd be considering moving to 32 or 64 AGs to >> > keep the btree size down to a more manageable size. The free inode >> >> btree would also help, but, really, 220M inodes in a 2TB filesystem >> > is really pushing the boundaries of sanity..... >> > >> >> So the better inodes size in one AG is about 5M, > > Not necessarily. But for your storage it's almost certainly going to > minimise the problem you are seeing. > >> is there any documents >> about these options I can learn more? > > http://xfs.org/index.php/XFS_Papers_and_Documentation Given the apparently huge number of small files would he likely see a big performance increase if he replaced that 2TB or rust with SSD. Greg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-25 22:46 ` Greg Freemyer @ 2014-08-26 2:37 ` Dave Chinner 2014-08-26 10:04 ` Zhang Qiang 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-08-26 2:37 UTC (permalink / raw) To: Greg Freemyer; +Cc: Zhang Qiang, xfs-oss On Mon, Aug 25, 2014 at 06:46:31PM -0400, Greg Freemyer wrote: > On Mon, Aug 25, 2014 at 6:26 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote: > >> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>: > >> > >> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote: > >> > > I have checked icount and ifree, but I found there are about 11.8 percent > >> > > free, so the free inode should not be too few. > >> > > > >> > > Here's the detail log, any new clue? > >> > > > >> > > # mount /dev/sda4 /data1/ > >> > > # xfs_info /data1/ > >> > > meta-data=/dev/sda4 isize=256 agcount=4, agsize=142272384 > >> > > >> > 4 AGs > >> > > >> Yes. > >> > >> > > >> > > icount = 220619904 > >> > > ifree = 26202919 > >> > > >> > And 220 million inodes. There's your problem - that's an average > >> > of 55 million inodes per AGI btree assuming you are using inode64. > >> > If you are using inode32, then the inodes will be in 2 btrees, or > >> > maybe even only one. > >> > > >> > >> You are right, all inodes stay on one AG. > >> > >> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?, > > > > Because the top addresses in the 2nd AG go over 32 bits, hence only > > AG 0 can be used for inodes. Changing to inode64 will give you some > > relief, but any time allocation occurs in AG0 is will be slow. i.e. > > you'll be trading always slow for "unpredictably slow". > > > >> > With that many inodes, I'd be considering moving to 32 or 64 AGs to > >> > keep the btree size down to a more manageable size. The free inode > >> > >> btree would also help, but, really, 220M inodes in a 2TB filesystem > >> > is really pushing the boundaries of sanity..... > >> > > >> > >> So the better inodes size in one AG is about 5M, > > > > Not necessarily. But for your storage it's almost certainly going to > > minimise the problem you are seeing. > > > >> is there any documents > >> about these options I can learn more? > > > > http://xfs.org/index.php/XFS_Papers_and_Documentation > > Given the apparently huge number of small files would he likely see a > big performance increase if he replaced that 2TB or rust with SSD. Doubt it - the profiles showed the allocation being CPU bound searching the metadata that indexes all those inodes. Those same profiles showed all the signs that it was hitting the buffer cache most of the time, too, which is why it was CPU bound.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-26 2:37 ` Dave Chinner @ 2014-08-26 10:04 ` Zhang Qiang 2014-08-26 13:13 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Zhang Qiang @ 2014-08-26 10:04 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Freemyer, xfs-oss [-- Attachment #1.1: Type: text/plain, Size: 7713 bytes --] Thanks Dave/Greg for your analysis and suggestions. I can summarize what I should do next: - backup my data using xfsdump - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk - mount filesystem with option inode64,nobarrier - applied patches about adding free list inode on disk structure As we have about ~100 servers need back up, so that will take much effort, do you have any other suggestion? What I am testing (ongoing): - created a new 2T partition filesystem - try to create small files and fill whole spaces then remove some of them randomly - check the performance of touch/cp files - apply patches and verify it. I have got more data from server: 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem 2) mount filesystem and testing with touch command * The first touch new file command take about ~23s * second touch command take about ~0.1s. Here's the perf data: First touch command: Events: 435 cycles + 7.51% touch [xfs] [k] xfs_inobt_get_rec + 5.61% touch [xfs] [k] xfs_btree_get_block + 5.38% touch [xfs] [k] xfs_btree_increment + 4.26% touch [xfs] [k] xfs_btree_get_rec + 3.73% touch [kernel.kallsyms] [k] find_busiest_group + 3.43% touch [xfs] [k] _xfs_buf_find + 2.72% touch [xfs] [k] xfs_btree_readahead + 2.38% touch [xfs] [k] xfs_trans_buf_item_match + 2.34% touch [xfs] [k] xfs_dialloc + 2.32% touch [kernel.kallsyms] [k] generic_make_request + 2.09% touch [xfs] [k] xfs_btree_rec_offset + 1.75% touch [kernel.kallsyms] [k] kmem_cache_alloc + 1.63% touch [kernel.kallsyms] [k] cpumask_next_and + 1.41% touch [sd_mod] [k] sd_prep_fn + 1.41% touch [kernel.kallsyms] [k] get_page_from_freelist + 1.38% touch [kernel.kallsyms] [k] __alloc_pages_nodemask + 1.27% touch [kernel.kallsyms] [k] scsi_request_fn + 1.22% touch [kernel.kallsyms] [k] blk_queue_bounce + 1.20% touch [kernel.kallsyms] [k] cfq_should_idle + 1.10% touch [xfs] [k] xfs_btree_rec_addr + 1.03% touch [kernel.kallsyms] [k] cfq_dispatch_requests+ 1.00% touch [kernel.kallsyms] [k] _spin_lock_irqsave+ 0.94% touch [kernel.kallsyms] [k] memcpy+ 0.86% touch [kernel.kallsyms] [k] swiotlb_map_sg_attrs+ 0.84% touch [kernel.kallsyms] [k] alloc_pages_current + 0.82% touch [kernel.kallsyms] [k] submit_bio + 0.81% touch [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion + 0.77% touch [kernel.kallsyms] [k] blk_peek_request + 0.73% touch [xfs] [k] xfs_btree_setbuf + 0.73% touch [megaraid_sas] [k] MR_GetPhyParams + 0.73% touch [kernel.kallsyms] [k] run_timer_softirq + 0.71% touch [kernel.kallsyms] [k] pick_next_task_rt + 0.71% touch [kernel.kallsyms] [k] init_request_from_bio + 0.70% touch [kernel.kallsyms] [k] thread_return + 0.69% touch [kernel.kallsyms] [k] cfq_set_request + 0.67% touch [kernel.kallsyms] [k] mempool_alloc + 0.66% touch [xfs] [k] xfs_buf_hold + 0.66% touch [kernel.kallsyms] [k] find_next_bit + 0.62% touch [kernel.kallsyms] [k] cfq_insert_request + 0.61% touch [kernel.kallsyms] [k] scsi_init_io + 0.60% touch [megaraid_sas] [k] MR_BuildRaidContext + 0.59% touch [kernel.kallsyms] [k] policy_zonelist + 0.59% touch [kernel.kallsyms] [k] elv_insert + 0.58% touch [xfs] [k] xfs_buf_allocate_memory Second perf command: Events: 105 cycles + 20.92% touch [xfs] [k] xfs_inobt_get_rec + 14.27% touch [xfs] [k] xfs_btree_get_rec + 12.21% touch [xfs] [k] xfs_btree_get_block + 12.12% touch [xfs] [k] xfs_btree_increment + 9.86% touch [xfs] [k] xfs_btree_readahead + 7.87% touch [xfs] [k] _xfs_buf_find + 4.93% touch [xfs] [k] xfs_btree_rec_addr + 4.12% touch [xfs] [k] xfs_dialloc + 3.03% touch [kernel.kallsyms] [k] clear_page_c + 2.96% touch [xfs] [k] xfs_btree_rec_offset + 1.31% touch [kernel.kallsyms] [k] kmem_cache_free + 1.03% touch [xfs] [k] xfs_trans_buf_item_match + 0.99% touch [kernel.kallsyms] [k] _atomic_dec_and_lock + 0.99% touch [xfs] [k] xfs_inobt_get_maxrecs + 0.99% touch [xfs] [k] xfs_buf_unlock + 0.99% touch [xfs] [k] kmem_zone_alloc + 0.98% touch [kernel.kallsyms] [k] kmem_cache_alloc + 0.28% touch [kernel.kallsyms] [k] pgd_alloc + 0.17% touch [kernel.kallsyms] [k] page_fault + 0.01% touch [kernel.kallsyms] [k] native_write_msr_safe I have compared the memory used, it seems that xfs try to load inode bmap block for the first time, which take much time, is that the reason to take so much time for the first touch operation? Thanks Qiang 2014-08-26 10:37 GMT+08:00 Dave Chinner <david@fromorbit.com>: > On Mon, Aug 25, 2014 at 06:46:31PM -0400, Greg Freemyer wrote: > > On Mon, Aug 25, 2014 at 6:26 PM, Dave Chinner <david@fromorbit.com> > wrote: > > > On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote: > > >> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>: > > >> > > >> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote: > > >> > > I have checked icount and ifree, but I found there are about 11.8 > percent > > >> > > free, so the free inode should not be too few. > > >> > > > > >> > > Here's the detail log, any new clue? > > >> > > > > >> > > # mount /dev/sda4 /data1/ > > >> > > # xfs_info /data1/ > > >> > > meta-data=/dev/sda4 isize=256 agcount=4, > agsize=142272384 > > >> > > > >> > 4 AGs > > >> > > > >> Yes. > > >> > > >> > > > >> > > icount = 220619904 > > >> > > ifree = 26202919 > > >> > > > >> > And 220 million inodes. There's your problem - that's an average > > >> > of 55 million inodes per AGI btree assuming you are using inode64. > > >> > If you are using inode32, then the inodes will be in 2 btrees, or > > >> > maybe even only one. > > >> > > > >> > > >> You are right, all inodes stay on one AG. > > >> > > >> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?, > > > > > > Because the top addresses in the 2nd AG go over 32 bits, hence only > > > AG 0 can be used for inodes. Changing to inode64 will give you some > > > relief, but any time allocation occurs in AG0 is will be slow. i.e. > > > you'll be trading always slow for "unpredictably slow". > > > > > >> > With that many inodes, I'd be considering moving to 32 or 64 AGs to > > >> > keep the btree size down to a more manageable size. The free inode > > >> > > >> btree would also help, but, really, 220M inodes in a 2TB filesystem > > >> > is really pushing the boundaries of sanity..... > > >> > > > >> > > >> So the better inodes size in one AG is about 5M, > > > > > > Not necessarily. But for your storage it's almost certainly going to > > > minimise the problem you are seeing. > > > > > >> is there any documents > > >> about these options I can learn more? > > > > > > http://xfs.org/index.php/XFS_Papers_and_Documentation > > > > Given the apparently huge number of small files would he likely see a > > big performance increase if he replaced that 2TB or rust with SSD. > > Doubt it - the profiles showed the allocation being CPU bound > searching the metadata that indexes all those inodes. Those same > profiles showed all the signs that it was hitting the buffer > cache most of the time, too, which is why it was CPU bound.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 10793 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-26 10:04 ` Zhang Qiang @ 2014-08-26 13:13 ` Dave Chinner 2014-08-27 8:53 ` Zhang Qiang 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2014-08-26 13:13 UTC (permalink / raw) To: Zhang Qiang; +Cc: Greg Freemyer, xfs-oss On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote: > Thanks Dave/Greg for your analysis and suggestions. > > I can summarize what I should do next: > > - backup my data using xfsdump > - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk > - mount filesystem with option inode64,nobarrier Ok up to here. > - applied patches about adding free list inode on disk structure No, don't do that. You're almost certain to get it wrong and corrupt your filesysetms and lose data. > As we have about ~100 servers need back up, so that will take much effort, > do you have any other suggestion? Just remount them with inode64. Nothing else. Over time as you add and remove files the inodes will redistribute across all 4 AGs. > What I am testing (ongoing): > - created a new 2T partition filesystem > - try to create small files and fill whole spaces then remove some of them > randomly > - check the performance of touch/cp files > - apply patches and verify it. > > I have got more data from server: > > 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem > 2) mount filesystem and testing with touch command > * The first touch new file command take about ~23s > * second touch command take about ~0.1s. So it's cache population that is your issue. You didn't say that first time around, which means the diagnosis was wrong. Again, it's having to search a btree with 220 million inodes in it to find the first free inode, and that btree has to be pulled in from disk and searched. Once it's cached, then each subsequent allocation will be much faster becaue the majority of the tree being searched will already be in cache... > I have compared the memory used, it seems that xfs try to load inode bmap > block for the first time, which take much time, is that the reason to take > so much time for the first touch operation? No. reading the AGI btree to find the first free inode to allocate is what is taking the time. If you spread the inodes out over 4 AGs (using inode64) then the overhead of the first read will go down proportionally. Indeed, that is one of the reasons for using more AGs than 4 for filesystems lik ethis. Still, I can't help but wonder why you are using a filesystem to store hundreds of millions of tiny files, when a database is far better suited to storing and indexing this type and quantity of data.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-26 13:13 ` Dave Chinner @ 2014-08-27 8:53 ` Zhang Qiang 2014-08-28 2:08 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Zhang Qiang @ 2014-08-27 8:53 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Freemyer, xfs-oss [-- Attachment #1.1: Type: text/plain, Size: 4534 bytes --] 2014-08-26 21:13 GMT+08:00 Dave Chinner <david@fromorbit.com>: > On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote: > > Thanks Dave/Greg for your analysis and suggestions. > > > > I can summarize what I should do next: > > > > - backup my data using xfsdump > > - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk > > - mount filesystem with option inode64,nobarrier > > Ok up to here. > > > - applied patches about adding free list inode on disk structure > > No, don't do that. You're almost certain to get it wrong and corrupt > your filesysetms and lose data. > > > As we have about ~100 servers need back up, so that will take much > effort, > > do you have any other suggestion? > > Just remount them with inode64. Nothing else. Over time as you add > and remove files the inodes will redistribute across all 4 AGs. > OK. How I can see the layout number of inodes on each AGs? Here's my checking steps: 1) Check unmounted file system first: [root@fstest data1]# xfs_db -c "sb 0" -c "p" /dev/sdb1 |egrep 'icount|ifree' icount = 421793920 ifree = 41 [root@fstest data1]# xfs_db -c "sb 1" -c "p" /dev/sdb1 |egrep 'icount|ifree' icount = 0 ifree = 0 [root@fstest data1]# xfs_db -c "sb 2" -c "p" /dev/sdb1 |egrep 'icount|ifree' icount = 0 ifree = 0 [root@fstest data1]# xfs_db -c "sb 3" -c "p" /dev/sdb1 |egrep 'icount|ifree' icount = 0 ifree = 0 2) mount it with inode64 and create many files: [root@fstest /]# mount -o inode64,nobarrier /dev/sdb1 /data [root@fstest /]# cd /data/tmp/ [root@fstest tmp]# fdtree.bash -d 16 -l 2 -f 100 -s 1 [root@fstest /]# umount /data 3) Check with xfs_db again: [root@fstest data1]# xfs_db -f -c "sb 0" -c "p" /dev/sdb1 |egrep 'icount|ifree' icount = 421821504 ifree = 52 [root@fstest data1]# xfs_db -f -c "sb 1" -c "p" /dev/sdb1 |egrep 'icount|ifree' icount = 0 ifree = 0 So, it seems that inodes only on first AG. Or icount/ifree is not the correct value to check, and how should I check how many inodes on each AGs? I am finding a way to improve the performance based on current filesystem and kernel just remounting with inode64, I am trying how to make all inodes redistribute on all AGs averagely. Is there any good way?, for example backup half of data to another device and remove it, then copy back it. > > What I am testing (ongoing): > > - created a new 2T partition filesystem > > - try to create small files and fill whole spaces then remove some of > them > > randomly > > - check the performance of touch/cp files > > - apply patches and verify it. > > > > I have got more data from server: > > > > 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount > filesystem > > 2) mount filesystem and testing with touch command > > * The first touch new file command take about ~23s > > * second touch command take about ~0.1s. > > So it's cache population that is your issue. You didn't say that > first time around, which means the diagnosis was wrong. Again, it's having > to > search a btree with 220 million inodes in it to find the first free > inode, and that btree has to be pulled in from disk and searched. > Once it's cached, then each subsequent allocation will be much > faster becaue the majority of the tree being searched will already > be in cache... > > > I have compared the memory used, it seems that xfs try to load inode bmap > > block for the first time, which take much time, is that the reason to > take > > so much time for the first touch operation? > > No. reading the AGI btree to find the first free inode to allocate > is what is taking the time. If you spread the inodes out over 4 AGs > (using inode64) then the overhead of the first read will go down > proportionally. Indeed, that is one of the reasons for using more > AGs than 4 for filesystems lik ethis. > OK, I see. > Still, I can't help but wonder why you are using a filesystem to > store hundreds of millions of tiny files, when a database is far > better suited to storing and indexing this type and quantity of > data.... > OK, this is a social networking website back end servers, actually the CDN infrastructure, and different server located different cities. We have a global sync script to make all these 100 servers have the same data. For each server we use RAID10 and XFS (CentOS6.3). There are about 3M files (50K in size) generated every day, and we track the path of each files in database. Do you have any suggestions to improve our solution? > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 6742 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: bad performance on touch/cp file on XFS system 2014-08-27 8:53 ` Zhang Qiang @ 2014-08-28 2:08 ` Dave Chinner 0 siblings, 0 replies; 15+ messages in thread From: Dave Chinner @ 2014-08-28 2:08 UTC (permalink / raw) To: Zhang Qiang; +Cc: Greg Freemyer, xfs-oss On Wed, Aug 27, 2014 at 04:53:17PM +0800, Zhang Qiang wrote: > 2014-08-26 21:13 GMT+08:00 Dave Chinner <david@fromorbit.com>: > > > On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote: > > > Thanks Dave/Greg for your analysis and suggestions. > > > > > > I can summarize what I should do next: > > > > > > - backup my data using xfsdump > > > - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk > > > - mount filesystem with option inode64,nobarrier > > > > Ok up to here. > > > > > - applied patches about adding free list inode on disk structure > > > > No, don't do that. You're almost certain to get it wrong and corrupt > > your filesysetms and lose data. > > > > > As we have about ~100 servers need back up, so that will take much > > effort, > > > do you have any other suggestion? > > > > Just remount them with inode64. Nothing else. Over time as you add > > and remove files the inodes will redistribute across all 4 AGs. > > > OK. > > How I can see the layout number of inodes on each AGs? Here's my checking > steps: > > 1) Check unmounted file system first: > [root@fstest data1]# xfs_db -c "sb 0" -c "p" /dev/sdb1 |egrep > 'icount|ifree' > icount = 421793920 > ifree = 41 > [root@fstest data1]# xfs_db -c "sb 1" -c "p" /dev/sdb1 |egrep > 'icount|ifree' > icount = 0 > ifree = 0 That's wrong. You need to check the AGI headers, not the superblock. Only the primary superblock gets updated, and it's the aggregated of all the AGI values, not the per AG values. And, BTW, that's *421 million* inodes in that filesystem. Almost twice as many as the filesystem you started showing problems on... > OK, this is a social networking website back end servers, actually the CDN > infrastructure, and different server located different cities. > We have a global sync script to make all these 100 servers have the same > data. > > For each server we use RAID10 and XFS (CentOS6.3). > > There are about 3M files (50K in size) generated every day, and we track > the path of each files in database. I'd suggest you are overestimating the size of the files being storedi by an order of magnitude: 200M files at 50k in size is 10TB, not 1.5TB. But you've confirmed exactly what I thought - you're using the filesystem as an anonymous object store for hundreds of millions of small objects and that's exactly the situation I'd expect to see these problems.... > Do you have any suggestions to improve our solution? TANSTAAFL. I've given you some stuff to try, worst case is reformating and recopying all the data around. I don't really have much time to do much more than that - talk to Red Hat (because you are using CentOS) if you want help with a more targeted solution to your problem... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-08-28 2:09 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-08-25 3:34 bad performance on touch/cp file on XFS system Zhang Qiang 2014-08-25 5:18 ` Dave Chinner 2014-08-25 8:09 ` Zhang Qiang 2014-08-25 8:56 ` Dave Chinner 2014-08-25 9:05 ` Zhang Qiang 2014-08-25 8:47 ` Zhang Qiang 2014-08-25 9:08 ` Dave Chinner 2014-08-25 10:31 ` Zhang Qiang 2014-08-25 22:26 ` Dave Chinner 2014-08-25 22:46 ` Greg Freemyer 2014-08-26 2:37 ` Dave Chinner 2014-08-26 10:04 ` Zhang Qiang 2014-08-26 13:13 ` Dave Chinner 2014-08-27 8:53 ` Zhang Qiang 2014-08-28 2:08 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox