* xfs_fsr, sunit, and swidth @ 2013-03-13 18:11 Dave Hall 2013-03-13 23:57 ` Dave Chinner 2013-03-14 0:03 ` Stan Hoeppner 0 siblings, 2 replies; 32+ messages in thread From: Dave Hall @ 2013-03-13 18:11 UTC (permalink / raw) To: xfs Does xfs_fsr react in any way to the sunit and swidth attributes of the file system? In other words, with an XFS filesytem set up directly on a hardware RAID, it is recommended that the mount command be changed to specify sunit and swidth values that reflect the new geometry of the RAID. In my case, these values were not specified on the mkfs.xfs of a rather large file system running on a RAID 6 array. I am wondering adding sunit and swidth parameters to the fstab will cause xfs_fsr to do anything different than it is already doing. Most importantly, will it improve performace in any way? Thanks. -Dave -- Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-13 18:11 xfs_fsr, sunit, and swidth Dave Hall @ 2013-03-13 23:57 ` Dave Chinner 2013-03-14 0:03 ` Stan Hoeppner 1 sibling, 0 replies; 32+ messages in thread From: Dave Chinner @ 2013-03-13 23:57 UTC (permalink / raw) To: Dave Hall; +Cc: xfs On Wed, Mar 13, 2013 at 02:11:19PM -0400, Dave Hall wrote: > Does xfs_fsr react in any way to the sunit and swidth attributes of > the file system? Not directly. > In other words, with an XFS filesytem set up > directly on a hardware RAID, it is recommended that the mount > command be changed to specify sunit and swidth values that reflect > the new geometry of the RAID. The mount option does nothing if sunit/swidth weren't specified at mkfs time. sunit/swidth affect the initial layout of the filesystem, and that cannot be altered after the fact. Hence you can't arbitrarily change sunit/swidth after mkfs - you are limited to changes that are compatible with the existing alignment. If you have no alignment specified, then there isn't a new alignment that can be verified as compatible with the existing layout..... > In my case, these values were not > specified on the mkfs.xfs of a rather large file system running on a > RAID 6 array. Which means the mount option won't work. > I am wondering adding sunit and swidth parameters to > the fstab will cause xfs_fsr to do anything different than it is > already doing. Most importantly, will it improve performace in any > way? It will make no difference at all. A more important question: why do you even need to run xfs_fsr? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-13 18:11 xfs_fsr, sunit, and swidth Dave Hall 2013-03-13 23:57 ` Dave Chinner @ 2013-03-14 0:03 ` Stan Hoeppner [not found] ` <514153ED.3000405@binghamton.edu> 1 sibling, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-03-14 0:03 UTC (permalink / raw) To: Dave Hall; +Cc: xfs On 3/13/2013 1:11 PM, Dave Hall wrote: > Does xfs_fsr react in any way to the sunit and swidth attributes of the > file system? No, manually remounting with new stripe alignment and then running xfs_fsr is not going to magically reorganize your filesystem. > In other words, with an XFS filesytem set up directly on a > hardware RAID, it is recommended that the mount command be changed to > specify sunit and swidth values that reflect the new geometry of the > RAID. This recommendation (as well as most things storage related) is workload dependent. A common misconception many people have is that XFS simply needs to be aligned to the RAID stripe. In reality, it's more critical that XFS write out be aligned to the application's write pattern, and thus, the hardware RAID stripe needs to be as well. Another common misconception is that simply aligning XFS to the RAID stripe will automagically yield fully filled hardware stripes. This is entirely dependent on matching the hardware RAID stripe to the applications write pattern. > In my case, these values were not specified on the mkfs.xfs of a > rather large file system running on a RAID 6 array. I am wondering > adding sunit and swidth parameters to the fstab will cause xfs_fsr to do > anything different than it is already doing. No, see above. And read this carefully: Aligning XFS affects write out only during allocation. It does not affect xfs_fsr. Nor does it affect non allocation workloads, i.e. database inserts, writing new mail to mbox files, etc. > Most importantly, will it > improve performace in any way? You provided insufficient information for us to help you optimize performance. For us to even take a stab at answering this we need to know at least: 1. application/workload write pattern(s) Is it allocation heavy? a. small random IO b. large streaming c. If mixed, what is the ratio 2. current hardware RAID parameters a. strip/chunk size b. # of effective spindles (RAID6 minus 2) 3. Current percentage of filesystem bytes and inodes used a. ~$ df /dev/[mount_point] b. ~$ df -i /dev/[mount_point] FWIW, parity RAID is abysmal with random writes, and especially so if the hardware stripe width is larger than the workload's write IOs. Thus, optimizing performance with hardware RAID and filesystems must be done during the design phase of the storage. For instance if you have a RAID6 chunk/strip size of 512K and 8 spindles that's a 4MB stripe width. If your application is doing random allocation write out in 256K chunks, you simply can't optimize performance without blowing away the array and recreating. For this example you'd need a chunk/strip of 32K with 8 effective spindles which equals 256K. Now, there is a possible silver lining here. If your workload is doing mostly large streaming writes, allocation or not, that are many multiples of your current hardware RAID stripe, it doesn't matter if your XFS is doing default 4K writes or if it has been aligned to the RAID stripe. In this case the controller's BBWC is typically going to take the successive XFS 4K IOs and fill hardware stripes automatically. So again, as always, the answer depends on your workload. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <514153ED.3000405@binghamton.edu>]
* Re: xfs_fsr, sunit, and swidth [not found] ` <514153ED.3000405@binghamton.edu> @ 2013-03-14 12:26 ` Stan Hoeppner 2013-03-14 12:55 ` Stan Hoeppner 0 siblings, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-03-14 12:26 UTC (permalink / raw) To: Dave Hall, xfs@oss.sgi.com On 3/13/2013 11:37 PM, Dave Hall wrote: > Stan, > > If you'd rather I can re-post this to xfs@oss.sgi.com, but I'm not clear > on exactly where this address leads. I am grateful for your response. No need, I'm CC'ing the list address. Read this entirely before hitting reply. > So the details are that this is a 16 x 2GB 7200 rpm SATA drive array in > a RAID enclosure. The array is configured RAID6 (so 14 data spindles) > with a chunk size of 128k. The XFS formatted size is 26TB with 19TB > currently used. So your RAID6 stripe width is 14 * 128KB = 1,792KB. > The workload is a backup program called rsnapshot. If you're not > familiar, this program uses cp -al top create a linked copy of the > previous backup, and then rsync -av --del to copy in any changes. The > current snapshots contain about 14.8 million files. The total number of > snapshots is about 600. So you've got a metadata heavy workload with lots of links being created. > The performance problem that lead me to investigate XFS is that some > time around mid-November the cp -al step started running very long - > sometimes over 48 hours. Sometimes it runs in just a few hours. Prior > to then the entire backup consistenly finished in less than 12 hours. > When the cp -al is running long the output of dstat indicates that the > I/O to the fs is fairly light. The 'cp -al' command is a pure metadata workload, which means lots of writes to the filesystem directory trees, but not into files. And if your kernel is lower than 2.6.39 your log throughput would be pretty high as well. But given this is RAID6 you'll have significant RMW for these directory writes, maybe overwhelming RMW, driving latency up and thus actual bandwidth down. So dstat bytes throughput may be low, but %wa may be through the roof, making the dstat data you're watching completely misleading as to what's really going on, what's causing the problem. > Please let me know if you need any further information. Yes, please provide the output of the following commands: ~$ grep xfs /etc/fstab ~$ xfs_info /dev/[mount-point] ~$ df /dev/[mount_point] ~$ df -i /dev/[mount_point] ~$ xfs_db -r -c freesp /dev/[mount-point] Also please provide the make/model of the RAID controller, the write cache size and if it is indeed enabled and working, as well as any errors, if any, logged by the controller in dmesg or elsewhere in Linux, or in the controller firmware. > Also, again, I > can post this to xfs@oss.sgi.com but I'd really like to know more about > the address. Makes me where you obtained the list address. Apparently not from the official websites or you'd not have to ask. Maybe this will assuage your fears. ;) xfs@oss.sgi.com is the official XFS mailing list submission address for the XFS developers and users. oss.sgi.com is the server provided and managed by SGI (www.sgi.com) that houses the XFS open source project. SGI created the XFS filesystem first released on their proprietary IRIX/MIPS computers in 1994. SGI open sourced XFS and ported it to Linux in the early 2000s. XFS is actively developed by a fairly large group of people, and AFAIK most of them are currently employed by Red Hat, including Dave Chinner, who also replied to your post. Dave wrote the delaylog code which will probably go a long way toward fixing your problem, if you're currently using 2.6.38 or lower and not mounting with this option enabled. It didn't become the default until 2.6.39. More info here http://www.xfs.org and here http://oss.sgi.com/projects/xfs/ > Thanks. You bet. -- Stan > -Dave > > > On 3/13/2013 8:03 PM, Stan Hoeppner wrote: >> On 3/13/2013 1:11 PM, Dave Hall wrote: >> >>> Does xfs_fsr react in any way to the sunit and swidth attributes of the >>> file system? >> No, manually remounting with new stripe alignment and then running >> xfs_fsr is not going to magically reorganize your filesystem. >> >>> In other words, with an XFS filesytem set up directly on a >>> hardware RAID, it is recommended that the mount command be changed to >>> specify sunit and swidth values that reflect the new geometry of the >>> RAID. >> This recommendation (as well as most things storage related) is workload >> dependent. A common misconception many people have is that XFS simply >> needs to be aligned to the RAID stripe. In reality, it's more critical >> that XFS write out be aligned to the application's write pattern, and >> thus, the hardware RAID stripe needs to be as well. Another common >> misconception is that simply aligning XFS to the RAID stripe will >> automagically yield fully filled hardware stripes. This is entirely >> dependent on matching the hardware RAID stripe to the applications write >> pattern. >> >>> In my case, these values were not specified on the mkfs.xfs of a >>> rather large file system running on a RAID 6 array. I am wondering >>> adding sunit and swidth parameters to the fstab will cause xfs_fsr to do >>> anything different than it is already doing. >> No, see above. And read this carefully: Aligning XFS affects write out >> only during allocation. It does not affect xfs_fsr. Nor does it affect >> non allocation workloads, i.e. database inserts, writing new mail to >> mbox files, etc. >> >>> Most importantly, will it >>> improve performace in any way? >> You provided insufficient information for us to help you optimize >> performance. For us to even take a stab at answering this we need to >> know at least: >> >> 1. application/workload write pattern(s) Is it allocation heavy? >> a. small random IO >> b. large streaming >> c. If mixed, what is the ratio >> >> 2. current hardware RAID parameters >> a. strip/chunk size >> b. # of effective spindles (RAID6 minus 2) >> >> 3. Current percentage of filesystem bytes and inodes used >> a. ~$ df /dev/[mount_point] >> b. ~$ df -i /dev/[mount_point] >> >> FWIW, parity RAID is abysmal with random writes, and especially so if >> the hardware stripe width is larger than the workload's write IOs. >> Thus, optimizing performance with hardware RAID and filesystems must be >> done during the design phase of the storage. For instance if you have a >> RAID6 chunk/strip size of 512K and 8 spindles that's a 4MB stripe width. >> If your application is doing random allocation write out in 256K >> chunks, you simply can't optimize performance without blowing away the >> array and recreating. For this example you'd need a chunk/strip of 32K >> with 8 effective spindles which equals 256K. >> >> Now, there is a possible silver lining here. If your workload is doing >> mostly large streaming writes, allocation or not, that are many >> multiples of your current hardware RAID stripe, it doesn't matter if >> your XFS is doing default 4K writes or if it has been aligned to the >> RAID stripe. In this case the controller's BBWC is typically going to >> take the successive XFS 4K IOs and fill hardware stripes automatically. >> >> So again, as always, the answer depends on your workload. >> > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-14 12:26 ` Stan Hoeppner @ 2013-03-14 12:55 ` Stan Hoeppner 2013-03-14 14:59 ` Dave Hall 0 siblings, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-03-14 12:55 UTC (permalink / raw) To: stan; +Cc: Dave Hall, xfs@oss.sgi.com Quick note below, need one more bit of info. On 3/14/2013 7:26 AM, Stan Hoeppner wrote: > On 3/13/2013 11:37 PM, Dave Hall wrote: >> Stan, >> >> If you'd rather I can re-post this to xfs@oss.sgi.com, but I'm not clear >> on exactly where this address leads. I am grateful for your response. > > No need, I'm CC'ing the list address. Read this entirely before hitting > reply. > >> So the details are that this is a 16 x 2GB 7200 rpm SATA drive array in >> a RAID enclosure. The array is configured RAID6 (so 14 data spindles) >> with a chunk size of 128k. The XFS formatted size is 26TB with 19TB >> currently used. > > So your RAID6 stripe width is 14 * 128KB = 1,792KB. > >> The workload is a backup program called rsnapshot. If you're not >> familiar, this program uses cp -al top create a linked copy of the >> previous backup, and then rsync -av --del to copy in any changes. The >> current snapshots contain about 14.8 million files. The total number of >> snapshots is about 600. > > So you've got a metadata heavy workload with lots of links being created. > >> The performance problem that lead me to investigate XFS is that some >> time around mid-November the cp -al step started running very long - >> sometimes over 48 hours. Sometimes it runs in just a few hours. Prior >> to then the entire backup consistenly finished in less than 12 hours. >> When the cp -al is running long the output of dstat indicates that the >> I/O to the fs is fairly light. > > The 'cp -al' command is a pure metadata workload, which means lots of > writes to the filesystem directory trees, but not into files. And if > your kernel is lower than 2.6.39 your log throughput would be pretty > high as well. But given this is RAID6 you'll have significant RMW for > these directory writes, maybe overwhelming RMW, driving latency up and > thus actual bandwidth down. So dstat bytes throughput may be low, but > %wa may be through the roof, making the dstat data you're watching > completely misleading as to what's really going on, what's causing the > problem. > >> Please let me know if you need any further information. > > Yes, please provide the output of the following commands: ~$ uname -a > ~$ grep xfs /etc/fstab > ~$ xfs_info /dev/[mount-point] > ~$ df /dev/[mount_point] > ~$ df -i /dev/[mount_point] > ~$ xfs_db -r -c freesp /dev/[mount-point] > > Also please provide the make/model of the RAID controller, the write > cache size and if it is indeed enabled and working, as well as any > errors, if any, logged by the controller in dmesg or elsewhere in Linux, > or in the controller firmware. > >> Also, again, I >> can post this to xfs@oss.sgi.com but I'd really like to know more about >> the address. > > Makes me where you obtained the list address. Apparently not from the > official websites or you'd not have to ask. Maybe this will assuage > your fears. ;) > > xfs@oss.sgi.com is the official XFS mailing list submission address for > the XFS developers and users. oss.sgi.com is the server provided and > managed by SGI (www.sgi.com) that houses the XFS open source project. > SGI created the XFS filesystem first released on their proprietary > IRIX/MIPS computers in 1994. SGI open sourced XFS and ported it to > Linux in the early 2000s. > > XFS is actively developed by a fairly large group of people, and AFAIK > most of them are currently employed by Red Hat, including Dave Chinner, > who also replied to your post. Dave wrote the delaylog code which will > probably go a long way toward fixing your problem, if you're currently > using 2.6.38 or lower and not mounting with this option enabled. It > didn't become the default until 2.6.39. > > More info here http://www.xfs.org and here http://oss.sgi.com/projects/xfs/ > >> Thanks. > > You bet. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-14 12:55 ` Stan Hoeppner @ 2013-03-14 14:59 ` Dave Hall 2013-03-14 18:07 ` Stefan Ring 2013-03-15 5:14 ` Stan Hoeppner 0 siblings, 2 replies; 32+ messages in thread From: Dave Hall @ 2013-03-14 14:59 UTC (permalink / raw) To: stan; +Cc: xfs@oss.sgi.com [-- Attachment #1.1: Type: text/plain, Size: 4497 bytes --] Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 03/14/2013 08:55 AM, Stan Hoeppner wrote: >> Yes, please provide the output of the following commands: >> > ~$ uname -a > Linux decoy 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64 GNU/Linux >> > ~$ grep xfs /etc/fstab >> LABEL=backup /infortrend xfs inode64,noatime,nodiratime,nobarrier 0 0 (cat /proc/mounts: /dev/sdb1 /infortrend xfs rw,noatime,nodiratime,attr2,delaylog,nobarrier,inode64,noquota 0 0) Note that there is also a second XFS on a separate 3ware raid card, but the I/O traffic on that one is fairly low. It is used as a staging area for a Debian mirror that is hosted on another server. >> > ~$ xfs_info/dev/[mount-point] >> # xfs_info /dev/sdb1 meta-data=/dev/sdb1 isize=256 agcount=26, agsize=268435455 blks = sectsz=512 attr=2 data = bsize=4096 blocks=6836364800, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 >> > ~$ df/dev/[mount_point] >> # df /dev/sdb1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 27343372288 20432618356 6910753932 75% /infortrend >> > ~$ df -i/dev/[mount_point] >> # df -i /dev/sdb1 Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend >> > ~$ xfs_db -r -c freesp/dev/[mount-point] >> # xfs_db -r -c freesp /dev/sdb1 from to extents blocks pct 1 1 832735 832735 0.05 2 3 432183 1037663 0.06 4 7 365573 1903965 0.11 8 15 352402 3891608 0.23 16 31 332762 7460486 0.43 32 63 300571 13597941 0.79 64 127 233778 20900655 1.21 128 255 152003 27448751 1.59 256 511 112673 40941665 2.37 512 1023 82262 59331126 3.43 1024 2047 53238 76543454 4.43 2048 4095 34092 97842752 5.66 4096 8191 22743 129915842 7.52 8192 16383 14453 162422155 9.40 16384 32767 8501 190601554 11.03 32768 65535 4695 210822119 12.20 65536 131071 2615 234787546 13.59 131072 262143 1354 237684818 13.76 262144 524287 470 160228724 9.27 524288 1048575 74 47384798 2.74 1048576 2097151 1 2097122 0.12 >> > >> > Also please provide the make/model of the RAID controller, the write >> > cache size and if it is indeed enabled and working, as well as any >> > errors, if any, logged by the controller in dmesg or elsewhere in Linux, >> > or in the controller firmware. >> > >> The RAID box is an Infortrend S16S-G1030 with 512MB cache and a fully functional battery. I couldn't find any details about the internal RAID implementation used by Infortrend. The array is SAS attached to an LSI HBA (SAS2008 PCI-Express Fusion-MPT SAS-2). The system hardware is a SuperMicro quad 8-core XEON E7-4820 2.0GHz with 128 GB of ram, hyper-theading enabled. (This is something that I inherited. There is no doubt that it is overkill.) >>> >> Another bit of information that you didn't ask about is the I/O scheduler algorithm. I just checked and found it set to 'cfq', although I though I had set it to 'noop' via a kernel parameter in GRUB. Also, some observations about the cp -al: In parallel to investigating hardware/OS/filesystem issue I have done some experiments with cp -al. It hurts to have 64 cores available and see cp -al running the wheels off just one, with a couple others slightly active with system level duties. So I tried some experiments where I copied smaller segments of the file tree in parallel (using make -j). I haven't had the chance to fully play this out, but these parallel cp invocations completed very quickly. So it would appear that the cp command itself may bog down with such a large file tree. I haven't had a chance to tear apart the source code or do any profiling to see if there are any obvious problems there. Lastly, I will mention that I see almost 0% wa when watching top. [-- Attachment #1.2: Type: text/html, Size: 9926 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-14 14:59 ` Dave Hall @ 2013-03-14 18:07 ` Stefan Ring 2013-03-15 5:14 ` Stan Hoeppner 1 sibling, 0 replies; 32+ messages in thread From: Stefan Ring @ 2013-03-14 18:07 UTC (permalink / raw) To: Dave Hall; +Cc: xfs@oss.sgi.com > Lastly, I will mention that I see almost 0% wa when watching top. I notice that XFS in general will report less % wa than ext4, although it exercises the disks a bit more when traversing a large directory tree, for example. But with 64 cores, you will see at most 1.5% in top anyway, if one process is doing nothing but waiting on the disk. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-14 14:59 ` Dave Hall 2013-03-14 18:07 ` Stefan Ring @ 2013-03-15 5:14 ` Stan Hoeppner 2013-03-15 11:45 ` Dave Chinner 1 sibling, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-03-15 5:14 UTC (permalink / raw) To: Dave Hall; +Cc: xfs@oss.sgi.com On 3/14/2013 9:59 AM, Dave Hall wrote: > Linux decoy 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.35-2~bpo60+1 x86_64 > GNU/Linux Ok, so you're already on a recent kernel with delaylog. >>> > ~$ grep xfs /etc/fstab >>> > LABEL=backup /infortrend xfs > inode64,noatime,nodiratime,nobarrier 0 0 XFS uses relatime by default, so noatime/nodiratime are useless, though not part of the problem. inode64 is good as your files and metadata have locality. Nobarrier is good with functioning BBWC. > meta-data=/dev/sdb1 isize=256 agcount=26, > agsize=268435455 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=6836364800, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 Standard internal log, no alignment. With delaylog, 512MB BBWC, and a nearly pure metadata workload, this should be fine. > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sdb1 27343372288 20432618356 6910753932 75% /infortrend Looks good. 75% is close to tickling the free space fragmentation dragon but you're not there yet. > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend Plenty of free inodes. > # xfs_db -r -c freesp /dev/sdb1 > from to extents blocks pct > 1 1 832735 832735 0.05 > 2 3 432183 1037663 0.06 > 4 7 365573 1903965 0.11 > 8 15 352402 3891608 0.23 > 16 31 332762 7460486 0.43 > 32 63 300571 13597941 0.79 > 64 127 233778 20900655 1.21 > 128 255 152003 27448751 1.59 > 256 511 112673 40941665 2.37 > 512 1023 82262 59331126 3.43 > 1024 2047 53238 76543454 4.43 > 2048 4095 34092 97842752 5.66 > 4096 8191 22743 129915842 7.52 > 8192 16383 14453 162422155 9.40 > 16384 32767 8501 190601554 11.03 > 32768 65535 4695 210822119 12.20 > 65536 131071 2615 234787546 13.59 > 131072 262143 1354 237684818 13.76 > 262144 524287 470 160228724 9.27 > 524288 1048575 74 47384798 2.74 > 1048576 2097151 1 2097122 0.12 Your free space map isn't completely horrible given you're at 75% capacity. Looks like most of it is in chunks 32MB and larger. Those 14.8m files have a mean size of ~1.22MB which suggests most of the files are small, so you shouldn't be having high seek load (thus latency) during allocation. > The RAID box is an Infortrend S16S-G1030 with 512MB cache and a fully > functional battery. I couldn't find any details about the internal > RAID implementation used by Infortrend. The array is SAS attached to > an LSI HBA (SAS2008 PCI-Express Fusion-MPT SAS-2). It's an older unit, definitely not the fastest in its class, but unless the firmware is horrible the 512MB BBWC should handle this metadata workload with aplomb. With 128GB RAM and Linux read-ahead caching you don't need the RAID controller to be doing read caching. Go into the SANWatch interface and make sure you're dedicating all the cache to writes not reads. This may or may not be configurable. Some firmware will simply drop read cache lines dynamically when writes come in. Some let you manually tweak the ratio. I'm not that familiar with the Infortrend units. But again, this is a minor optimization, and I don't think this is part of the problem. > The system hardware is a SuperMicro quad 8-core XEON E7-4820 2.0GHz with > 128 GB of ram, hyper-theading enabled. (This is something that I > inherited. There is no doubt that it is overkill.) Just a bit. 64 hardware threads, 72MB of L3 cache, and 128GB RAM for a storage server with two storage HBAs and low throughput disk arrays. Apparently running a Debian mirror is more compute intensive than I previously thought... > Another bit of information that you didn't ask about is the I/O > scheduler algorithm. Didn't get that far yet. ;) > I just checked and found it set to 'cfq', although > I though I had set it to 'noop' via a kernel parameter in GRUB. As you're using a distro kernel, I recommend simply doing it in root's crontab. That way it can't get 'lost' during kernel upgrades due to grub update problems, etc. The scheduler can be changed on the fly so it doesn't matter where you set it in the boot sequence. @reboot /bin/echo noop > /sys/block/sdb/queue/scheduler > Also, some observations about the cp -al: In parallel to investigating > hardware/OS/filesystem issue I have done some experiments with cp -al. > It hurts to have 64 cores available and see cp -al running the wheels > off just one, with a couple others slightly active with system level > duties. This tends to happen when one runs single threaded user space code on a large multiprocessor. > So I tried some experiments where I copied smaller segments of > the file tree in parallel (using make -j). I haven't had the chance to > fully play this out, but these parallel cp invocations completed very > quickly. So it would appear that the cp command itself may bog down > with such a large file tree. I haven't had a chance to tear apart the > source code or do any profiling to see if there are any obvious problems > there. > > Lastly, I will mention that I see almost 0% wa when watching top. So it's probably safe to say at this point that XFS and IO in general are not the problem. One thing you did not mention is how you are using rsnapshot. If you are using it as most folks do to backup remote filesystems of other machines over ethernet, what happens when you simply schedule multiple rsnapshot processes concurrently, targeting each at a different remote machine? If you're using rsnapshot strictly locally, you should take a hard look at xfsdump. It exists specifically for backing up XFS filesystems/files and has been around a very long time, is very mature. It's not quite as flexible as rsnapshot and may require more disk space, but it is lighting fast, even though limited to a single thread on Linux. Why is it lightning fast? Because the bulk of the work is performed in kernel space by the XFS driver, directly manipulating the filesystem--no user space execution or system calls. See 'man xfsdump'. Familiarize yourself with it and perform a test dump, to a file, of a large (~1TB) directory/tree. You'll see what we mean by lightning fast, compared to rsnapshot and other user space methods. And you'll actually see some IO throughput with this. ;) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-15 5:14 ` Stan Hoeppner @ 2013-03-15 11:45 ` Dave Chinner 2013-03-16 4:47 ` Stan Hoeppner 0 siblings, 1 reply; 32+ messages in thread From: Dave Chinner @ 2013-03-15 11:45 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Dave Hall, xfs@oss.sgi.com On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote: > On 3/14/2013 9:59 AM, Dave Hall wrote: > Looks good. 75% is close to tickling the free space fragmentation > dragon but you're not there yet. Don't be so sure ;) > > > Filesystem Inodes IUsed IFree IUse% Mounted on > > /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend > > Plenty of free inodes. > > > # xfs_db -r -c freesp /dev/sdb1 > > from to extents blocks pct > > 1 1 832735 832735 0.05 > > 2 3 432183 1037663 0.06 > > 4 7 365573 1903965 0.11 > > 8 15 352402 3891608 0.23 > > 16 31 332762 7460486 0.43 > > 32 63 300571 13597941 0.79 > > 64 127 233778 20900655 1.21 > > 128 255 152003 27448751 1.59 > > 256 511 112673 40941665 2.37 > > 512 1023 82262 59331126 3.43 > > 1024 2047 53238 76543454 4.43 > > 2048 4095 34092 97842752 5.66 > > 4096 8191 22743 129915842 7.52 > > 8192 16383 14453 162422155 9.40 > > 16384 32767 8501 190601554 11.03 > > 32768 65535 4695 210822119 12.20 > > 65536 131071 2615 234787546 13.59 > > 131072 262143 1354 237684818 13.76 > > 262144 524287 470 160228724 9.27 > > 524288 1048575 74 47384798 2.74 > > 1048576 2097151 1 2097122 0.12 > > Your free space map isn't completely horrible given you're at 75% > capacity. Looks like most of it is in chunks 32MB and larger. Those > 14.8m files have a mean size of ~1.22MB which suggests most of the files > are small, so you shouldn't be having high seek load (thus latency) > during allocation. FWIW, you can't really tell how bad the freespace fragmentation is from the global output like this. All of the large contiguous free space might be in one or two AGs, and the others might be badly fragmented. Hence you need to at least sample a few AGs to determine if this is representative of the freespace in each AG.... As it is, the above output raises alarms for me. What I see is that the number of small extents massively outnumbers the large extents. The fact that there are roughly 2.5 million extents smaller than 63 blocks and that there is only one freespace extent larger than 4GB indicates to me that free space is substantially fragmented. At 25% free space, that's 250GB per AG, and if the largest freespace in most AGs is less than 4GB in length, then free space is not contiguous. i.e. Free space appears to be heavily weighted towards small extents...` So, the above output would lead me to investigate the freespace layout more deeply to determine if this is going to affect the workload that is being run... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-15 11:45 ` Dave Chinner @ 2013-03-16 4:47 ` Stan Hoeppner 2013-03-16 7:21 ` Dave Chinner 0 siblings, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-03-16 4:47 UTC (permalink / raw) To: Dave Chinner; +Cc: Dave Hall, xfs@oss.sgi.com On 3/15/2013 6:45 AM, Dave Chinner wrote: > On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote: >> On 3/14/2013 9:59 AM, Dave Hall wrote: >> Looks good. 75% is close to tickling the free space fragmentation >> dragon but you're not there yet. > > Don't be so sure ;) The only thing I'm sure of is that I'll always be learning something new about XFS and how to troubleshoot it. ;) >> >>> Filesystem Inodes IUsed IFree IUse% Mounted on >>> /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend >> >> Plenty of free inodes. >> >>> # xfs_db -r -c freesp /dev/sdb1 >>> from to extents blocks pct >>> 1 1 832735 832735 0.05 >>> 2 3 432183 1037663 0.06 >>> 4 7 365573 1903965 0.11 >>> 8 15 352402 3891608 0.23 >>> 16 31 332762 7460486 0.43 >>> 32 63 300571 13597941 0.79 >>> 64 127 233778 20900655 1.21 >>> 128 255 152003 27448751 1.59 >>> 256 511 112673 40941665 2.37 >>> 512 1023 82262 59331126 3.43 >>> 1024 2047 53238 76543454 4.43 >>> 2048 4095 34092 97842752 5.66 >>> 4096 8191 22743 129915842 7.52 >>> 8192 16383 14453 162422155 9.40 >>> 16384 32767 8501 190601554 11.03 >>> 32768 65535 4695 210822119 12.20 >>> 65536 131071 2615 234787546 13.59 >>> 131072 262143 1354 237684818 13.76 >>> 262144 524287 470 160228724 9.27 >>> 524288 1048575 74 47384798 2.74 >>> 1048576 2097151 1 2097122 0.12 >> >> Your free space map isn't completely horrible given you're at 75% >> capacity. Looks like most of it is in chunks 32MB and larger. Those >> 14.8m files have a mean size of ~1.22MB which suggests most of the files >> are small, so you shouldn't be having high seek load (thus latency) >> during allocation. > > FWIW, you can't really tell how bad the freespace fragmentation is > from the global output like this. True. > All of the large contiguous free > space might be in one or two AGs, and the others might be badly > fragmented. Hence you need to at least sample a few AGs to determine > if this is representative of the freespace in each AG.... What would be representative of 26AGs? First, middle, last? So Mr. Hall would execute: ~$ xfs_db -r /dev/sdb1 xfs_db> freesp -a0 ... xfs_db> freesp -a13 ... xfs_db> freesp -a26 ... xfs_db> quit > As it is, the above output raises alarms for me. What I see is that > the number of small extents massively outnumbers the large extents. > The fact that there are roughly 2.5 million extents smaller than 63 > blocks and that there is only one freespace extent larger than 4GB > indicates to me that free space is substantially fragmented. At 25% > free space, that's 250GB per AG, and if the largest freespace in > most AGs is less than 4GB in length, then free space is not > contiguous. i.e. Free space appears to be heavily weighted towards > small extents...` It didn't raise alarms for me. This is an rsnapshot workload with millions of small files. For me it was a foregone conclusion he'd have serious fragmentation. What I was looking at is whether it's severe enough to be a factor in his stated problem. I don't think it is. In fact I think it's completely unrelated, which is why I didn't go into deeper analysis of this. Though I could be incorrect. ;) > So, the above output would lead me to investigate the freespace > layout more deeply to determine if this is going to affect the > workload that is being run... May be time to hold class again Dave as I'm probably missing something. His slowdown is serial hardlink creation with "cp -al" of many millions of files. Hardlinks are metadata structures, which means this workload modifies btrees and inodes, not extents, right? XFS directory metadata is stored closely together in each AG, correct? 'cp -al' is going to walk directories in order, which means we're going have good read caching of the directory information thus little to no random read IO. The cp is then going to create a hardlink per file. Now, even with the default 4KB write alignment, we should be getting a large bundle of hardlinks per write. And I would think the 512MB BBWC on the array controller, if firmware is decent, should do a good job of merging these to mitigate RMW cycles. The OP is seeing 100% CPU for the cp operation, almost no IO, and no iowait. If XFS or RMW were introducing any latency I'd think we'd see some iowait. Thus I believe at this point, the problem is those millions of serial user space calls in a single Perl thread causing the high CPU burn, little IO, and long run time, not XFS nor the storage. And I think the OP came to this conclusion as well, without waiting on our analysis of his filesystem. Regardless of the OP's course of action, I of course welcome critique of my analysis, so I learn new things and improve for future cases. Specifically WRT high metadata modification workloads on parity SRD storage. Which is what this OP could actually have if he runs many rsnaphots in parallel. With 32 cores/64 threads and 128GB RAM he can certainly generate much higher rsnapshot load on his filesystem and storage, if he chooses to. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-16 4:47 ` Stan Hoeppner @ 2013-03-16 7:21 ` Dave Chinner 2013-03-16 11:45 ` Stan Hoeppner 2013-03-25 17:00 ` Dave Hall 0 siblings, 2 replies; 32+ messages in thread From: Dave Chinner @ 2013-03-16 7:21 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Dave Hall, xfs@oss.sgi.com On Fri, Mar 15, 2013 at 11:47:08PM -0500, Stan Hoeppner wrote: > On 3/15/2013 6:45 AM, Dave Chinner wrote: > > On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote: > >> On 3/14/2013 9:59 AM, Dave Hall wrote: > >> Looks good. 75% is close to tickling the free space fragmentation > >> dragon but you're not there yet. > > > > Don't be so sure ;) > > The only thing I'm sure of is that I'll always be learning something new > about XFS and how to troubleshoot it. ;) > > >> > >>> Filesystem Inodes IUsed IFree IUse% Mounted on > >>> /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend > >> > >> Plenty of free inodes. > >> > >>> # xfs_db -r -c freesp /dev/sdb1 > >>> from to extents blocks pct > >>> 1 1 832735 832735 0.05 > >>> 2 3 432183 1037663 0.06 > >>> 4 7 365573 1903965 0.11 > >>> 8 15 352402 3891608 0.23 > >>> 16 31 332762 7460486 0.43 > >>> 32 63 300571 13597941 0.79 > >>> 64 127 233778 20900655 1.21 > >>> 128 255 152003 27448751 1.59 > >>> 256 511 112673 40941665 2.37 > >>> 512 1023 82262 59331126 3.43 > >>> 1024 2047 53238 76543454 4.43 > >>> 2048 4095 34092 97842752 5.66 > >>> 4096 8191 22743 129915842 7.52 > >>> 8192 16383 14453 162422155 9.40 > >>> 16384 32767 8501 190601554 11.03 > >>> 32768 65535 4695 210822119 12.20 > >>> 65536 131071 2615 234787546 13.59 > >>> 131072 262143 1354 237684818 13.76 > >>> 262144 524287 470 160228724 9.27 > >>> 524288 1048575 74 47384798 2.74 > >>> 1048576 2097151 1 2097122 0.12 > >> > >> Your free space map isn't completely horrible given you're at 75% > >> capacity. Looks like most of it is in chunks 32MB and larger. Those > >> 14.8m files have a mean size of ~1.22MB which suggests most of the files > >> are small, so you shouldn't be having high seek load (thus latency) > >> during allocation. > > > > FWIW, you can't really tell how bad the freespace fragmentation is > > from the global output like this. > > True. > > > All of the large contiguous free > > space might be in one or two AGs, and the others might be badly > > fragmented. Hence you need to at least sample a few AGs to determine > > if this is representative of the freespace in each AG.... > > What would be representative of 26AGs? First, middle, last? So Mr. > Hall would execute: > > ~$ xfs_db -r /dev/sdb1 > xfs_db> freesp -a0 > ... > xfs_db> freesp -a13 > ... > xfs_db> freesp -a26 > ... > xfs_db> quit Yup, though I normally just run something like: # for i in `seq 0 1 <agcount - 1>`; do > xfs_db -c "freesp -a $i" <dev> > done To look at the them all quickly... > > As it is, the above output raises alarms for me. What I see is that > > the number of small extents massively outnumbers the large extents. > > The fact that there are roughly 2.5 million extents smaller than 63 > > blocks and that there is only one freespace extent larger than 4GB > > indicates to me that free space is substantially fragmented. At 25% > > free space, that's 250GB per AG, and if the largest freespace in > > most AGs is less than 4GB in length, then free space is not > > contiguous. i.e. Free space appears to be heavily weighted towards > > small extents...` > > It didn't raise alarms for me. This is an rsnapshot workload with > millions of small files. For me it was a foregone conclusion he'd have > serious fragmentation. What I was looking at is whether it's severe > enough to be a factor in his stated problem. I don't think it is. In > fact I think it's completely unrelated, which is why I didn't go into > deeper analysis of this. Though I could be incorrect. ;) Ok, so what size blocks are the metadata held in? 1-4 filesystem block extents. So, when we do a by-size freespace btree lookup, we don't find a large freespace to allocate from. So we fall back to a by-blkno search down the freespace btree to find a neraby block of sufficient size. That search runs until we run off one end of the freespace btree. And when this might have to walk along several tens of thousand of btree records, each allocation will consume a *lot* of CPU time. How much? well, compared to finding a large freespace extent, think orders of magnitude more CPU overhead per allocation... > > So, the above output would lead me to investigate the freespace > > layout more deeply to determine if this is going to affect the > > workload that is being run... > > May be time to hold class again Dave as I'm probably missing something. > His slowdown is serial hardlink creation with "cp -al" of many millions > of files. Hardlinks are metadata structures, which means this workload > modifies btrees and inodes, not extents, right? It modifies directories and inodes, and adding directory entries requires allocation of new directory blocks, and that requires scanning of the freespace trees.... > XFS directory metadata is stored closely together in each AG, correct? > 'cp -al' is going to walk directories in order, which means we're going > have good read caching of the directory information thus little to no > random read IO. not f the directory is fragmented. If freespace is fragmented, then there's a good chance that directory blocks are not going to have good locality, though the effect of that will be minimised by the directory block readahead that is done. > The cp is then going to create a hardlink per file. > Now, even with the default 4KB write alignment, we should be getting a > large bundle of hardlinks per write. And I would think the 512MB BBWC > on the array controller, if firmware is decent, should do a good job of > merging these to mitigate RMW cycles. it's possible, but I would expect the lack of IO to be caused by the fact modification is CPU bound. i.e. it's taking so long for every hard link to be created (on average) that the IO subsystem can handle the read/write IO demands with ease because there is realtively little IO being issued. > The OP is seeing 100% CPU for the cp operation, almost no IO, and no > iowait. If XFS or RMW were introducing any latency I'd think we'd see > some iowait. Right, so that leads to the conclusion that the freespace fragmentation is definitely a potential cause of the excessive CPU usage.... > Thus I believe at this point, the problem is those millions of serial > user space calls in a single Perl thread causing the high CPU burn, > little IO, and long run time, not XFS nor the storage. And I think the > OP came to this conclusion as well, without waiting on our analysis of > his filesystem. Using perf to profile the kernel while the cp -al workload is running will tell use exactly where the CPU is being burnt. That will confirm the analysis, or point us at some other issue that is causing excessive CPU burn... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-16 7:21 ` Dave Chinner @ 2013-03-16 11:45 ` Stan Hoeppner 2013-03-25 17:00 ` Dave Hall 1 sibling, 0 replies; 32+ messages in thread From: Stan Hoeppner @ 2013-03-16 11:45 UTC (permalink / raw) To: Dave Chinner; +Cc: Dave Hall, xfs@oss.sgi.com On 3/16/2013 2:21 AM, Dave Chinner wrote: > On Fri, Mar 15, 2013 at 11:47:08PM -0500, Stan Hoeppner wrote: >> On 3/15/2013 6:45 AM, Dave Chinner wrote: >>> On Fri, Mar 15, 2013 at 12:14:40AM -0500, Stan Hoeppner wrote: >>>> On 3/14/2013 9:59 AM, Dave Hall wrote: >>>> Looks good. 75% is close to tickling the free space fragmentation >>>> dragon but you're not there yet. >>> >>> Don't be so sure ;) >> >> The only thing I'm sure of is that I'll always be learning something new >> about XFS and how to troubleshoot it. ;) >> >>>> >>>>> Filesystem Inodes IUsed IFree IUse% Mounted on >>>>> /dev/sdb1 5469091840 1367746380 4101345460 26% /infortrend >>>> >>>> Plenty of free inodes. >>>> >>>>> # xfs_db -r -c freesp /dev/sdb1 >>>>> from to extents blocks pct >>>>> 1 1 832735 832735 0.05 >>>>> 2 3 432183 1037663 0.06 >>>>> 4 7 365573 1903965 0.11 >>>>> 8 15 352402 3891608 0.23 >>>>> 16 31 332762 7460486 0.43 >>>>> 32 63 300571 13597941 0.79 >>>>> 64 127 233778 20900655 1.21 >>>>> 128 255 152003 27448751 1.59 >>>>> 256 511 112673 40941665 2.37 >>>>> 512 1023 82262 59331126 3.43 >>>>> 1024 2047 53238 76543454 4.43 >>>>> 2048 4095 34092 97842752 5.66 >>>>> 4096 8191 22743 129915842 7.52 >>>>> 8192 16383 14453 162422155 9.40 >>>>> 16384 32767 8501 190601554 11.03 >>>>> 32768 65535 4695 210822119 12.20 >>>>> 65536 131071 2615 234787546 13.59 >>>>> 131072 262143 1354 237684818 13.76 >>>>> 262144 524287 470 160228724 9.27 >>>>> 524288 1048575 74 47384798 2.74 >>>>> 1048576 2097151 1 2097122 0.12 >>>> >>>> Your free space map isn't completely horrible given you're at 75% >>>> capacity. Looks like most of it is in chunks 32MB and larger. Those >>>> 14.8m files have a mean size of ~1.22MB which suggests most of the files >>>> are small, so you shouldn't be having high seek load (thus latency) >>>> during allocation. >>> >>> FWIW, you can't really tell how bad the freespace fragmentation is >>> from the global output like this. >> >> True. >> >>> All of the large contiguous free >>> space might be in one or two AGs, and the others might be badly >>> fragmented. Hence you need to at least sample a few AGs to determine >>> if this is representative of the freespace in each AG.... >> >> What would be representative of 26AGs? First, middle, last? So Mr. >> Hall would execute: >> >> ~$ xfs_db -r /dev/sdb1 >> xfs_db> freesp -a0 >> ... >> xfs_db> freesp -a13 >> ... >> xfs_db> freesp -a26 >> ... >> xfs_db> quit > > Yup, though I normally just run something like: > > # for i in `seq 0 1 <agcount - 1>`; do >> xfs_db -c "freesp -a $i" <dev> >> done > > To look at the them all quickly... Ahh, you have to put the xfs_db command in quotes if it has args. I kept getting an error when using -a in my command line. Thanks. Your command line will give histograms for all 26 AGs. This isn't sampling just a few as you suggested. Do we generally want to have users dump histograms of all their AGs to the mailing list? Or will sampling do? In this case something like this? ~$ for i in [0 8 17 26]; do xfs_db -r -c "freesp -a $i" /dev/sdb1; done >>> As it is, the above output raises alarms for me. What I see is that >>> the number of small extents massively outnumbers the large extents. >>> The fact that there are roughly 2.5 million extents smaller than 63 >>> blocks and that there is only one freespace extent larger than 4GB >>> indicates to me that free space is substantially fragmented. At 25% >>> free space, that's 250GB per AG, and if the largest freespace in >>> most AGs is less than 4GB in length, then free space is not >>> contiguous. i.e. Free space appears to be heavily weighted towards >>> small extents...` >> >> It didn't raise alarms for me. This is an rsnapshot workload with >> millions of small files. For me it was a foregone conclusion he'd have >> serious fragmentation. What I was looking at is whether it's severe >> enough to be a factor in his stated problem. I don't think it is. In >> fact I think it's completely unrelated, which is why I didn't go into >> deeper analysis of this. Though I could be incorrect. ;) > > Ok, so what size blocks are the metadata held in? 1-4 filesystem > block extents. So, 4KB to 16KB. How many of the hard links being created can we store in each? > So, when we do a by-size freespace btree lookup, we > don't find a large freespace to allocate from. So we fall back to a > by-blkno search down the freespace btree to find a neraby block of > sufficient size. If we only need a free block of 4-16KB for our hardlinks, nearly any of his free space would be usable wouldn't it? > That search runs until we run off one end of the > freespace btree. And when this might have to walk along several tens > of thousand of btree records, each allocation will consume a *lot* > of CPU time. How much? well, compared to finding a large freespace > extent, think orders of magnitude more CPU overhead per > allocation... I follow you, up to a point. I'm disconnected between the free block size requirements for metadata, and having to potentially walk two entire btrees looking for a free chunk of sufficient size. Seems to me every free extent in his histogram is usable for hardlink metadata if our minimum is one filesystem block, or 4KB. WRT CPU burn, I'll address my thoughts on that much further below. >>> So, the above output would lead me to investigate the freespace >>> layout more deeply to determine if this is going to affect the >>> workload that is being run... >> >> May be time to hold class again Dave as I'm probably missing something. >> His slowdown is serial hardlink creation with "cp -al" of many millions >> of files. Hardlinks are metadata structures, which means this workload >> modifies btrees and inodes, not extents, right? > > It modifies directories and inodes, and adding directory entries > requires allocation of new directory blocks, and that requires > scanning of the freespace trees.... Got it. >> XFS directory metadata is stored closely together in each AG, correct? >> 'cp -al' is going to walk directories in order, which means we're going >> have good read caching of the directory information thus little to no >> random read IO. > > not f the directory is fragmented. If freespace is fragmented, then > there's a good chance that directory blocks are not going to have > good locality, though the effect of that will be minimised by the > directory block readahead that is done. Got it. And given this box has 128GB of RAM there's probably a lot of directory metadata alreay in cache. >> The cp is then going to create a hardlink per file. >> Now, even with the default 4KB write alignment, we should be getting a >> large bundle of hardlinks per write. And I would think the 512MB BBWC >> on the array controller, if firmware is decent, should do a good job of >> merging these to mitigate RMW cycles. > > it's possible, but I would expect the lack of IO to be caused by the > fact modification is CPU bound. i.e. it's taking so long for every > hard link to be created (on average) that the IO subsystem can > handle the read/write IO demands with ease because there is > realtively little IO being issued. The OP stated once CPU is throttled, two have very light load, the other 29 are idle. The throttled core must be the one on which the cp code is executing. The kernel isn't going to schedule the XFS btree walking thread(s) on the same core, is it? So if no other cores are anywhere near peak, isn't it safe to assume the workload isn't CPU bound due to free space btree walking? I should have thought of this earlier when he described the load on his cores... >> The OP is seeing 100% CPU for the cp operation, almost no IO, and no >> iowait. If XFS or RMW were introducing any latency I'd think we'd see >> some iowait. > > Right, so that leads to the conclusion that the freespace > fragmentation is definitely a potential cause of the excessive CPU > usage.... Is is still a candidate, given what I describe above WRT XFS thread scheduling, and that only one core is hammered? >> Thus I believe at this point, the problem is those millions of serial >> user space calls in a single Perl thread causing the high CPU burn, >> little IO, and long run time, not XFS nor the storage. And I think the >> OP came to this conclusion as well, without waiting on our analysis of >> his filesystem. > > Using perf to profile the kernel while the cp -al workload is > running will tell use exactly where the CPU is being burnt. That > will confirm the analysis, or point us at some other issue that is > causing excessive CPU burn... I'd like to see this as well. Because if the bottleneck isn't XFS, I'd like to understand how a 2GHz core with 18MB of L3 cache is being completely consumed by a cp command which is doing nothing but creating hardlinks--while the IO rate is almost nothing. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-16 7:21 ` Dave Chinner 2013-03-16 11:45 ` Stan Hoeppner @ 2013-03-25 17:00 ` Dave Hall 2013-03-27 21:16 ` Stan Hoeppner 2013-03-28 1:38 ` xfs_fsr, sunit, and swidth Dave Chinner 1 sibling, 2 replies; 32+ messages in thread From: Dave Hall @ 2013-03-25 17:00 UTC (permalink / raw) To: xfs@oss.sgi.com Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 03/16/2013 03:21 AM, Dave Chinner wrote: > Using perf to profile the kernel while the cp -al workload is > running will tell use exactly where the CPU is being burnt. That > will confirm the analysis, or point us at some other issue that is > causing excessive CPU burn... > Dave, which perf command(s) would you like me to run. (I'm familiar with the concept behind this kind of tool, but I haven't worked with this one before). Also, what would you like me to do with the xfs_db freesp output for 26 agroups? -Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-25 17:00 ` Dave Hall @ 2013-03-27 21:16 ` Stan Hoeppner 2013-03-29 19:59 ` Dave Hall 2013-03-28 1:38 ` xfs_fsr, sunit, and swidth Dave Chinner 1 sibling, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-03-27 21:16 UTC (permalink / raw) To: Dave Hall; +Cc: xfs@oss.sgi.com On 3/25/2013 12:00 PM, Dave Hall wrote: > On 03/16/2013 03:21 AM, Dave Chinner wrote: >> Using perf to profile the kernel while the cp -al workload is >> running will tell use exactly where the CPU is being burnt. That >> will confirm the analysis, or point us at some other issue that is >> causing excessive CPU burn... >> > Dave, which perf command(s) would you like me to run. (I'm familiar > with the concept behind this kind of tool, but I haven't worked with > this one before). I'll let Dave answer this one. > Also, what would you like me to do with the xfs_db freesp output for 26 > agroups? A pastebin link should be fine. Only a couple of people will be looking at it. I don't see value in free space maps of 26 AGs being archived. FWIW, it's probably best to reply-all instead of just to the list. Sometimes posts get lost in the noise. Not sure if that's the case here, but it's been a couple of days with no response from Dave C, and the answers to these questions are very short. Thus I'm guessing he missed your post, so I'm CC'ing him here. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-27 21:16 ` Stan Hoeppner @ 2013-03-29 19:59 ` Dave Hall 2013-03-31 1:22 ` Dave Chinner 0 siblings, 1 reply; 32+ messages in thread From: Dave Hall @ 2013-03-29 19:59 UTC (permalink / raw) To: stan; +Cc: xfs@oss.sgi.com Dave, Stan, Here is the link for perf top -U: http://pastebin.com/JYLXYWki. The ag report is at http://pastebin.com/VzziSa4L. Interestingly, the backups ran fast a couple times this week. Once under 9 hours. Today it looks like it's running long again. -Dave Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 03/27/2013 05:16 PM, Stan Hoeppner wrote: > On 3/25/2013 12:00 PM, Dave Hall wrote: > >> On 03/16/2013 03:21 AM, Dave Chinner wrote: >> >>> Using perf to profile the kernel while the cp -al workload is >>> running will tell use exactly where the CPU is being burnt. That >>> will confirm the analysis, or point us at some other issue that is >>> causing excessive CPU burn... >>> >>> >> Dave, which perf command(s) would you like me to run. (I'm familiar >> with the concept behind this kind of tool, but I haven't worked with >> this one before). >> > I'll let Dave answer this one. > > >> Also, what would you like me to do with the xfs_db freesp output for 26 >> agroups? >> > A pastebin link should be fine. Only a couple of people will be looking > at it. I don't see value in free space maps of 26 AGs being archived. > > FWIW, it's probably best to reply-all instead of just to the list. > Sometimes posts get lost in the noise. Not sure if that's the case > here, but it's been a couple of days with no response from Dave C, and > the answers to these questions are very short. Thus I'm guessing he > missed your post, so I'm CC'ing him here. > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-29 19:59 ` Dave Hall @ 2013-03-31 1:22 ` Dave Chinner 2013-04-02 10:34 ` Hans-Peter Jansen 2013-04-03 14:25 ` Dave Hall 0 siblings, 2 replies; 32+ messages in thread From: Dave Chinner @ 2013-03-31 1:22 UTC (permalink / raw) To: Dave Hall; +Cc: stan, xfs@oss.sgi.com On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote: > Dave, Stan, > > Here is the link for perf top -U: http://pastebin.com/JYLXYWki. > The ag report is at http://pastebin.com/VzziSa4L. Interestingly, > the backups ran fast a couple times this week. Once under 9 hours. > Today it looks like it's running long again. 12.38% [xfs] [k] xfs_btree_get_rec 11.65% [xfs] [k] _xfs_buf_find 11.29% [xfs] [k] xfs_btree_increment 7.88% [xfs] [k] xfs_inobt_get_rec 5.40% [kernel] [k] intel_idle 4.13% [xfs] [k] xfs_btree_get_block 4.09% [xfs] [k] xfs_dialloc 3.21% [xfs] [k] xfs_btree_readahead 2.00% [xfs] [k] xfs_btree_rec_offset 1.50% [xfs] [k] xfs_btree_rec_addr Inode allocation searches, looking for an inode near to the parent directory. Whatthis indicates is that you have lots of sparsely allocated inode chunks on disk. i.e. each 64 indoe chunk has some free inodes in it, and some used inodes. This is Likely due to random removal of inodes as you delete old backups and link counts drop to zero. Because we only index inodes on "allocated chunks", finding a chunk that has a free inode can be like finding a needle in a haystack. There are heuristics used to stop searches from consuming too much CPU, but it still can be quite slow when you repeatedly hit those paths.... I don't have an answer that will magically speed things up for you right now... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-31 1:22 ` Dave Chinner @ 2013-04-02 10:34 ` Hans-Peter Jansen 2013-04-03 14:25 ` Dave Hall 1 sibling, 0 replies; 32+ messages in thread From: Hans-Peter Jansen @ 2013-04-02 10:34 UTC (permalink / raw) To: xfs; +Cc: Dave Hall, stan On Sonntag, 31. März 2013 12:22:31 Dave Chinner wrote: > On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote: > > Dave, Stan, > > > > Here is the link for perf top -U: http://pastebin.com/JYLXYWki. > > The ag report is at http://pastebin.com/VzziSa4L. Interestingly, > > the backups ran fast a couple times this week. Once under 9 hours. > > Today it looks like it's running long again. > > 12.38% [xfs] [k] xfs_btree_get_rec > 11.65% [xfs] [k] _xfs_buf_find > 11.29% [xfs] [k] xfs_btree_increment > 7.88% [xfs] [k] xfs_inobt_get_rec > 5.40% [kernel] [k] intel_idle > 4.13% [xfs] [k] xfs_btree_get_block > 4.09% [xfs] [k] xfs_dialloc > 3.21% [xfs] [k] xfs_btree_readahead > 2.00% [xfs] [k] xfs_btree_rec_offset > 1.50% [xfs] [k] xfs_btree_rec_addr > > Inode allocation searches, looking for an inode near to the parent > directory. > > Whatthis indicates is that you have lots of sparsely allocated inode > chunks on disk. i.e. each 64 indoe chunk has some free inodes in it, > and some used inodes. This is Likely due to random removal of inodes > as you delete old backups and link counts drop to zero. Because we > only index inodes on "allocated chunks", finding a chunk that has a > free inode can be like finding a needle in a haystack. There are > heuristics used to stop searches from consuming too much CPU, but it > still can be quite slow when you repeatedly hit those paths.... > > I don't have an answer that will magically speed things up for > you right now... Hmm, unfortunately, this access pattern is pretty common, at least all "cp -al & rsync" based backup solutions will suffer from it after a while. I noticed, that the "removing old backups" part is also taking *ages* in this scenario. I had to manually remove parts of a backup (subtrees with a few million ordinary files, massively hardlinked as usual), that took 4-5 hours for each run on a Hitachi Ultrastar 7K4000 drive. For the 8 subtrees, that finally took one and a half day, freeing about 500 GB space. Oh well. The question is: is it (logically) possible to reorganize the fragmented inode allocation space with a specialized tool (to be implemented), that lays out the allocation space in such a way, that matches XFS earliest "expectations", or does that violate some deeper FS logic, I'm not aware of? I have to mention, that I haven't made any tests with other file systems, as playing games with backups ranges very low on my scale of sensible tests, but experience has shown, that XFS usually sucks less than its alternatives, even if the access pattern don't match its primary optimization domain. Hence, implementing such a tool makes sense, where "least sucking" should be aimed for. Cheers, Pete _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-31 1:22 ` Dave Chinner 2013-04-02 10:34 ` Hans-Peter Jansen @ 2013-04-03 14:25 ` Dave Hall 2013-04-12 17:25 ` Dave Hall 1 sibling, 1 reply; 32+ messages in thread From: Dave Hall @ 2013-04-03 14:25 UTC (permalink / raw) To: Dave Chinner; +Cc: stan, xfs@oss.sgi.com So, assuming entropy has reached critical mass and that there is no easy fix for this physical file system, what would happen if I replicated this data to a new disk array? When I say 'replicate', I'm not talking about xfs_dump. I'm talking about running a series of cp -al/rsync operations (or maybe rsync with --link-dest) that will precisely reproduce the linked data on my current array. All of the inodes would be re-allocated. There wouldn't be any (or at least not many) deletes. I am hoping that if I do this the inode fragmentation will be significantly reduced on the target as compared to the source. Of course over time it may re-fragment, but with two arrays I can always wipe one and reload it. -Dave Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 03/30/2013 09:22 PM, Dave Chinner wrote: > On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote: > >> Dave, Stan, >> >> Here is the link for perf top -U: http://pastebin.com/JYLXYWki. >> The ag report is at http://pastebin.com/VzziSa4L. Interestingly, >> the backups ran fast a couple times this week. Once under 9 hours. >> Today it looks like it's running long again. >> > 12.38% [xfs] [k] xfs_btree_get_rec > 11.65% [xfs] [k] _xfs_buf_find > 11.29% [xfs] [k] xfs_btree_increment > 7.88% [xfs] [k] xfs_inobt_get_rec > 5.40% [kernel] [k] intel_idle > 4.13% [xfs] [k] xfs_btree_get_block > 4.09% [xfs] [k] xfs_dialloc > 3.21% [xfs] [k] xfs_btree_readahead > 2.00% [xfs] [k] xfs_btree_rec_offset > 1.50% [xfs] [k] xfs_btree_rec_addr > > Inode allocation searches, looking for an inode near to the parent > directory. > > Whatthis indicates is that you have lots of sparsely allocated inode > chunks on disk. i.e. each 64 indoe chunk has some free inodes in it, > and some used inodes. This is Likely due to random removal of inodes > as you delete old backups and link counts drop to zero. Because we > only index inodes on "allocated chunks", finding a chunk that has a > free inode can be like finding a needle in a haystack. There are > heuristics used to stop searches from consuming too much CPU, but it > still can be quite slow when you repeatedly hit those paths.... > > I don't have an answer that will magically speed things up for > you right now... > > Cheers, > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-04-03 14:25 ` Dave Hall @ 2013-04-12 17:25 ` Dave Hall 2013-04-13 0:45 ` Dave Chinner 2013-04-13 0:51 ` Stan Hoeppner 0 siblings, 2 replies; 32+ messages in thread From: Dave Hall @ 2013-04-12 17:25 UTC (permalink / raw) To: stan; +Cc: xfs@oss.sgi.com Stan, IDid this post get lost in the shuffle? Looking at it I think it could have been a bit unclear. What I need to do anyways is have a second, off-site copy of my backup data. So I'm going to be building a second array. In copying, in order to preserve the hard link structure of the source array I'd have to run a sequence of cp -al / rsync calls that would mimic what rsnapshot did to get me to where I am right now. (Note that I could also potentially use rsync --link-dest.) So the question is how would the target xfs file system fare as far as my inode fragmentation situation is concerned? I'm hoping that since the target would be a fresh file system, and since during the 'copy' phase I'd only be adding inodes, that the inode allocation would be more compact and orderly than what I have on the source array since. What do you think? Thanks. -Dave Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 04/03/2013 10:25 AM, Dave Hall wrote: > So, assuming entropy has reached critical mass and that there is no > easy fix for this physical file system, what would happen if I > replicated this data to a new disk array? When I say 'replicate', I'm > not talking about xfs_dump. I'm talking about running a series of cp > -al/rsync operations (or maybe rsync with --link-dest) that will > precisely reproduce the linked data on my current array. All of the > inodes would be re-allocated. There wouldn't be any (or at least not > many) deletes. > > I am hoping that if I do this the inode fragmentation will be > significantly reduced on the target as compared to the source. Of > course over time it may re-fragment, but with two arrays I can always > wipe one and reload it. > > -Dave > > Dave Hall > Binghamton University > kdhall@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > On 03/30/2013 09:22 PM, Dave Chinner wrote: >> On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote: >>> Dave, Stan, >>> >>> Here is the link for perf top -U: http://pastebin.com/JYLXYWki. >>> The ag report is at http://pastebin.com/VzziSa4L. Interestingly, >>> the backups ran fast a couple times this week. Once under 9 hours. >>> Today it looks like it's running long again. >> 12.38% [xfs] [k] xfs_btree_get_rec >> 11.65% [xfs] [k] _xfs_buf_find >> 11.29% [xfs] [k] xfs_btree_increment >> 7.88% [xfs] [k] xfs_inobt_get_rec >> 5.40% [kernel] [k] intel_idle >> 4.13% [xfs] [k] xfs_btree_get_block >> 4.09% [xfs] [k] xfs_dialloc >> 3.21% [xfs] [k] xfs_btree_readahead >> 2.00% [xfs] [k] xfs_btree_rec_offset >> 1.50% [xfs] [k] xfs_btree_rec_addr >> >> Inode allocation searches, looking for an inode near to the parent >> directory. >> >> Whatthis indicates is that you have lots of sparsely allocated inode >> chunks on disk. i.e. each 64 indoe chunk has some free inodes in it, >> and some used inodes. This is Likely due to random removal of inodes >> as you delete old backups and link counts drop to zero. Because we >> only index inodes on "allocated chunks", finding a chunk that has a >> free inode can be like finding a needle in a haystack. There are >> heuristics used to stop searches from consuming too much CPU, but it >> still can be quite slow when you repeatedly hit those paths.... >> >> I don't have an answer that will magically speed things up for >> you right now... >> >> Cheers, >> >> Dave. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-04-12 17:25 ` Dave Hall @ 2013-04-13 0:45 ` Dave Chinner 2013-04-13 0:51 ` Stan Hoeppner 1 sibling, 0 replies; 32+ messages in thread From: Dave Chinner @ 2013-04-13 0:45 UTC (permalink / raw) To: Dave Hall; +Cc: stan, xfs@oss.sgi.com On Fri, Apr 12, 2013 at 01:25:22PM -0400, Dave Hall wrote: > Stan, > > IDid this post get lost in the shuffle? Looking at it I think it > could have been a bit unclear. What I need to do anyways is have a > second, off-site copy of my backup data. So I'm going to be > building a second array. In copying, in order to preserve the hard > link structure of the source array I'd have to run a sequence of cp > -al / rsync calls that would mimic what rsnapshot did to get me to > where I am right now. (Note that I could also potentially use rsync > --link-dest.) > So the question is how would the target xfs file system fare as far > as my inode fragmentation situation is concerned? I'm hoping that > since the target would be a fresh file system, and since during the > 'copy' phase I'd only be adding inodes, that the inode allocation > would be more compact and orderly than what I have on the source > array since. What do you think? Sure, it would be to start with, but you'll eventually end up in the same place. Removing links from the forest is what leads to the sparse free inode space, so even starting with a dense inode allocation pattern, it'll turn sparse the moment you remove backups from the forest.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-04-12 17:25 ` Dave Hall 2013-04-13 0:45 ` Dave Chinner @ 2013-04-13 0:51 ` Stan Hoeppner 2013-04-15 20:35 ` Dave Hall 1 sibling, 1 reply; 32+ messages in thread From: Stan Hoeppner @ 2013-04-13 0:51 UTC (permalink / raw) To: Dave Hall; +Cc: xfs@oss.sgi.com On 4/12/2013 12:25 PM, Dave Hall wrote: > Stan, > > IDid this post get lost in the shuffle? Looking at it I think it could > have been a bit unclear. What I need to do anyways is have a second, > off-site copy of my backup data. So I'm going to be building a second > array. In copying, in order to preserve the hard link structure of the > source array I'd have to run a sequence of cp -al / rsync calls that > would mimic what rsnapshot did to get me to where I am right now. (Note > that I could also potentially use rsync --link-dest.) > > So the question is how would the target xfs file system fare as far as > my inode fragmentation situation is concerned? I'm hoping that since > the target would be a fresh file system, and since during the 'copy' > phase I'd only be adding inodes, that the inode allocation would be more > compact and orderly than what I have on the source array since. What do > you think? The question isn't what it will look like initially, as your inodes shouldn't be sparsely allocated as with your current aged filesystem. The question is how quickly the problem will arise on the new filesystem as you free inodes. I don't have the answer to that question. There's no way to predict this that I know of. -- Stan > Thanks. > > -Dave > > Dave Hall > Binghamton University > kdhall@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > On 04/03/2013 10:25 AM, Dave Hall wrote: >> So, assuming entropy has reached critical mass and that there is no >> easy fix for this physical file system, what would happen if I >> replicated this data to a new disk array? When I say 'replicate', I'm >> not talking about xfs_dump. I'm talking about running a series of cp >> -al/rsync operations (or maybe rsync with --link-dest) that will >> precisely reproduce the linked data on my current array. All of the >> inodes would be re-allocated. There wouldn't be any (or at least not >> many) deletes. >> >> I am hoping that if I do this the inode fragmentation will be >> significantly reduced on the target as compared to the source. Of >> course over time it may re-fragment, but with two arrays I can always >> wipe one and reload it. >> >> -Dave >> >> Dave Hall >> Binghamton University >> kdhall@binghamton.edu >> 607-760-2328 (Cell) >> 607-777-4641 (Office) >> >> >> On 03/30/2013 09:22 PM, Dave Chinner wrote: >>> On Fri, Mar 29, 2013 at 03:59:46PM -0400, Dave Hall wrote: >>>> Dave, Stan, >>>> >>>> Here is the link for perf top -U: http://pastebin.com/JYLXYWki. >>>> The ag report is at http://pastebin.com/VzziSa4L. Interestingly, >>>> the backups ran fast a couple times this week. Once under 9 hours. >>>> Today it looks like it's running long again. >>> 12.38% [xfs] [k] xfs_btree_get_rec >>> 11.65% [xfs] [k] _xfs_buf_find >>> 11.29% [xfs] [k] xfs_btree_increment >>> 7.88% [xfs] [k] xfs_inobt_get_rec >>> 5.40% [kernel] [k] intel_idle >>> 4.13% [xfs] [k] xfs_btree_get_block >>> 4.09% [xfs] [k] xfs_dialloc >>> 3.21% [xfs] [k] xfs_btree_readahead >>> 2.00% [xfs] [k] xfs_btree_rec_offset >>> 1.50% [xfs] [k] xfs_btree_rec_addr >>> >>> Inode allocation searches, looking for an inode near to the parent >>> directory. >>> >>> Whatthis indicates is that you have lots of sparsely allocated inode >>> chunks on disk. i.e. each 64 indoe chunk has some free inodes in it, >>> and some used inodes. This is Likely due to random removal of inodes >>> as you delete old backups and link counts drop to zero. Because we >>> only index inodes on "allocated chunks", finding a chunk that has a >>> free inode can be like finding a needle in a haystack. There are >>> heuristics used to stop searches from consuming too much CPU, but it >>> still can be quite slow when you repeatedly hit those paths.... >>> >>> I don't have an answer that will magically speed things up for >>> you right now... >>> >>> Cheers, >>> >>> Dave. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-04-13 0:51 ` Stan Hoeppner @ 2013-04-15 20:35 ` Dave Hall 2013-04-16 1:45 ` Stan Hoeppner 2013-04-16 16:18 ` Dave Chinner 0 siblings, 2 replies; 32+ messages in thread From: Dave Hall @ 2013-04-15 20:35 UTC (permalink / raw) To: stan; +Cc: xfs@oss.sgi.com Stan, I understand that this will be an ongoing problem. It seems like all I could do at this point would be to ' manually defrag' my inodes the hard way by doing this 'copy' operation whenever things slow down. (Either that or go get my PHD in file systems and try to come up with a better inode management algorithm.) I will be keeping two copies of this data going forward anyways. Are there any other suggestions you might have at this time - xfs or otherwise? -Dave Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 04/12/2013 08:51 PM, Stan Hoeppner wrote: > On 4/12/2013 12:25 PM, Dave Hall wrote: > >> Stan, >> >> IDid this post get lost in the shuffle? Looking at it I think it could >> have been a bit unclear. What I need to do anyways is have a second, >> off-site copy of my backup data. So I'm going to be building a second >> array. In copying, in order to preserve the hard link structure of the >> source array I'd have to run a sequence of cp -al / rsync calls that >> would mimic what rsnapshot did to get me to where I am right now. (Note >> that I could also potentially use rsync --link-dest.) >> >> So the question is how would the target xfs file system fare as far as >> my inode fragmentation situation is concerned? I'm hoping that since >> the target would be a fresh file system, and since during the 'copy' >> phase I'd only be adding inodes, that the inode allocation would be more >> compact and orderly than what I have on the source array since. What do >> you think? >> > The question isn't what it will look like initially, as your inodes > shouldn't be sparsely allocated as with your current aged filesystem. > > The question is how quickly the problem will arise on the new filesystem > as you free inodes. I don't have the answer to that question. There's > no way to predict this that I know of. > > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-04-15 20:35 ` Dave Hall @ 2013-04-16 1:45 ` Stan Hoeppner 2013-04-16 16:18 ` Dave Chinner 1 sibling, 0 replies; 32+ messages in thread From: Stan Hoeppner @ 2013-04-16 1:45 UTC (permalink / raw) To: Dave Hall; +Cc: xfs@oss.sgi.com On 4/15/2013 3:35 PM, Dave Hall wrote: > Stan, > > I understand that this will be an ongoing problem. It seems like all I could do at this point would be to ' manually defrag' my inodes the hard way by doing this 'copy' operation whenever things slow down. (Either that or go get my PHD in file systems and try to come up with a better inode management algorithm.) I will be keeping two copies of this data going forward anyways. > > Are there any other suggestions you might have at this time - xfs or otherwise? I'm no expert in this particular area, so I'll simply give the sysadmin 101 perspective: Always pick the right tool for the job. If XFS isn't working satisfactorily for this job and no fix is forthcoming, I'd test EXT4 and JFS to see if either of them is more suitable for this job. The other option is to switch to a backup job that doesn't create/delete millions of hard links. There are likely other possibilities. -- Stan > -Dave > > Dave Hall > Binghamton University > kdhall@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > On 04/12/2013 08:51 PM, Stan Hoeppner wrote: >> On 4/12/2013 12:25 PM, Dave Hall wrote: >> >>> Stan, >>> >>> IDid this post get lost in the shuffle? Looking at it I think it could >>> have been a bit unclear. What I need to do anyways is have a second, >>> off-site copy of my backup data. So I'm going to be building a second >>> array. In copying, in order to preserve the hard link structure of the >>> source array I'd have to run a sequence of cp -al / rsync calls that >>> would mimic what rsnapshot did to get me to where I am right now. (Note >>> that I could also potentially use rsync --link-dest.) >>> >>> So the question is how would the target xfs file system fare as far as >>> my inode fragmentation situation is concerned? I'm hoping that since >>> the target would be a fresh file system, and since during the 'copy' >>> phase I'd only be adding inodes, that the inode allocation would be more >>> compact and orderly than what I have on the source array since. What do >>> you think? >>> >> The question isn't what it will look like initially, as your inodes >> shouldn't be sparsely allocated as with your current aged filesystem. >> >> The question is how quickly the problem will arise on the new filesystem >> as you free inodes. I don't have the answer to that question. There's >> no way to predict this that I know of. >> >> > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-04-15 20:35 ` Dave Hall 2013-04-16 1:45 ` Stan Hoeppner @ 2013-04-16 16:18 ` Dave Chinner 2015-02-22 23:35 ` XFS/LVM/Multipath on a single RAID volume Dave Hall 1 sibling, 1 reply; 32+ messages in thread From: Dave Chinner @ 2013-04-16 16:18 UTC (permalink / raw) To: Dave Hall; +Cc: stan, xfs@oss.sgi.com On Mon, Apr 15, 2013 at 04:35:38PM -0400, Dave Hall wrote: > Stan, > > I understand that this will be an ongoing problem. It seems like > all I could do at this point would be to ' manually defrag' my > inodes the hard way by doing this 'copy' operation whenever things > slow down. (Either that or go get my PHD in file systems and try to > come up with a better inode management algorithm.) No need, I know how to fix it for good. Just add a new btree that tracks free inodes, rather than having to scan the allocated inode tree to find free inodes. Shouldn't actually be too difficult to do, as it's a generic btree and the code to keep both btrees in sync is a copy of the way the two freespace btrees are kept in sync.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* XFS/LVM/Multipath on a single RAID volume 2013-04-16 16:18 ` Dave Chinner @ 2015-02-22 23:35 ` Dave Hall 2015-02-23 11:18 ` Emmanuel Florac 0 siblings, 1 reply; 32+ messages in thread From: Dave Hall @ 2015-02-22 23:35 UTC (permalink / raw) To: Dave Chinner; +Cc: stan, xfs@oss.sgi.com Dave, Stan, Not sure if you remember, but we corresponded for a while a couple years ago about some performance problems I was having with XFS on a 26TB SAS-attached RAID box. If either of you is still working on XFS, I've got some new questions. Actually, what I've got is a new array to set up. Same size, but faster disks and a faster controller. It will replace the existing array as the primary backup volume. So since I have a fresh array that's not in production yet I was hoping to get some pointers on how to configure it to maximize XFS performance. In particular, I've seen a suggestion that a multipathed array should be sliced up into logical drives and pasted back together with LVM. Wondering also about putting the journal in a separate logical drive on the same array. I am able to set up a 2-way multipath right now, and I might be able to justify adding a second controller to the array to get a 4-way multipath going. Even if the LVM approach is the wrong one, I clearly have a rare chance to set this array up the right way. Please let me know if you have any suggestions. Thanks. -Dave Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: XFS/LVM/Multipath on a single RAID volume 2015-02-22 23:35 ` XFS/LVM/Multipath on a single RAID volume Dave Hall @ 2015-02-23 11:18 ` Emmanuel Florac 2015-02-24 22:04 ` Dave Hall 0 siblings, 1 reply; 32+ messages in thread From: Emmanuel Florac @ 2015-02-23 11:18 UTC (permalink / raw) To: Dave Hall; +Cc: stan, xfs@oss.sgi.com Le Sun, 22 Feb 2015 18:35:19 -0500 Dave Hall <kdhall@binghamton.edu> écrivait: > So since I have a fresh array that's not in production yet I was > hoping to get some pointers on how to configure it to maximize XFS > performance. In particular, I've seen a suggestion that a > multipathed array should be sliced up into logical drives and pasted > back together with LVM. Wondering also about putting the journal in > a separate logical drive on the same array. What's the hardware configuration like? before multipathing, you need to know if your RAID controller and disks can actually saturate your link. Generally SAS-attached enclosures are driven through a 4 way SFF-8088 cable, with a bandwidth of 4x 6Gbps (maximum throughput per link: 3 GB/s) or 4 x 12 Gbps (max thruput: 6 GB/s). > I am able to set up a 2-way multipath right now, and I might be able > to justify adding a second controller to the array to get a 4-way > multipath going. A multipath can double the throughput, provided that you have enough drives: you'll need about 24 7k RPM drives to saturate _one_ 4x6Gbps SAS link. If you have only 12 drives, dual attachment probably won't yield much. > Even if the LVM approach is the wrong one, I clearly have a rare > chance to set this array up the right way. Please let me know if you > have any suggestions. In my experience, software RAID-0 with md gives slightly better performance than LVM, though not much. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: XFS/LVM/Multipath on a single RAID volume 2015-02-23 11:18 ` Emmanuel Florac @ 2015-02-24 22:04 ` Dave Hall 2015-02-24 22:33 ` Dave Chinner 2015-02-25 11:21 ` Emmanuel Florac 0 siblings, 2 replies; 32+ messages in thread From: Dave Hall @ 2015-02-24 22:04 UTC (permalink / raw) To: Emmanuel Florac; +Cc: stan, xfs@oss.sgi.com Dave Hall Binghamton University kdhall@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 02/23/2015 06:18 AM, Emmanuel Florac wrote: > Le Sun, 22 Feb 2015 18:35:19 -0500 > Dave Hall<kdhall@binghamton.edu> écrivait: > > >> So since I have a fresh array that's not in production yet I was >> hoping to get some pointers on how to configure it to maximize XFS >> performance. In particular, I've seen a suggestion that a >> multipathed array should be sliced up into logical drives and pasted >> back together with LVM. Wondering also about putting the journal in >> a separate logical drive on the same array. >> > What's the hardware configuration like? before multipathing, you need > to know if your RAID controller and disks can actually saturate your > link. Generally SAS-attached enclosures are driven through a 4 way > SFF-8088 cable, with a bandwidth of 4x 6Gbps (maximum throughput per > link: 3 GB/s) or 4 x 12 Gbps (max thruput: 6 GB/s). > > The new hardware is an Infortrend with 16 x 2TB 6Gbps SAS drives. It has one controller with dual 6Gbps SAS ports. The server currently has two 3Gbps SAS HBAs. On an existing array based on similar but slightly slower hardware, I'm getting miserable performance. The bottleneck seems to be on the server side. For specifics, the array is laid out as a single 26TB volume and attached by a single 3Gbps SAS. The server is quad 8-core Xeon with 128GB RAM. The networking is all 10GB. The application is rsnapshot which is essentially a series of rsync copies where the unchanged files are hard-linked from one snapshot to the next. CPU utilization is very low and only a few cores seem to be active. Yet the operation is taking hours to complete. The premise that was presented to me by someone in the storage business is that with 'many' proccessor cores one should slice a large array up into segments, multipath the whole deal, and then mash the segments back together with LVM (or MD). Since the kernel would ultimately see a bunch of smaller storage segments that were all getting activity, it should dispatch a set of cores for each storage segment and get the job done faster. I think in theory this would even work to some extent on a single-path SAS connection. >> I am able to set up a 2-way multipath right now, and I might be able >> to justify adding a second controller to the array to get a 4-way >> multipath going. >> > A multipath can double the throughput, provided that you have enough > drives: you'll need about 24 7k RPM drives to saturate _one_ 4x6Gbps > SAS link. If you have only 12 drives, dual attachment probably won't > yield much. > > >> Even if the LVM approach is the wrong one, I clearly have a rare >> chance to set this array up the right way. Please let me know if you >> have any suggestions. >> > In my experience, software RAID-0 with md gives slightly better > performance than LVM, though not much. > > > MD RAID-0 seems as likely as LVM, so I'd probably try that first. The big question is how to size the slices of the array to make XFS happy and then how to make sure XFS knows about it. Secondly, there is the question of the log volume. Seems that with multipath there might be some possible advantage to putting this in it's on slice on the array so that log writes could be in an I/O stream that is managed separately from the rest. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: XFS/LVM/Multipath on a single RAID volume 2015-02-24 22:04 ` Dave Hall @ 2015-02-24 22:33 ` Dave Chinner [not found] ` <54ED01BC.6080302@binghamton.edu> 2015-02-25 11:49 ` Emmanuel Florac 2015-02-25 11:21 ` Emmanuel Florac 1 sibling, 2 replies; 32+ messages in thread From: Dave Chinner @ 2015-02-24 22:33 UTC (permalink / raw) To: Dave Hall; +Cc: stan, xfs@oss.sgi.com On Tue, Feb 24, 2015 at 05:04:35PM -0500, Dave Hall wrote: > > Dave Hall > Binghamton University > kdhall@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > On 02/23/2015 06:18 AM, Emmanuel Florac wrote: > >Le Sun, 22 Feb 2015 18:35:19 -0500 > >Dave Hall<kdhall@binghamton.edu> écrivait: > > > >>So since I have a fresh array that's not in production yet I was > >>hoping to get some pointers on how to configure it to maximize XFS > >>performance. In particular, I've seen a suggestion that a > >>multipathed array should be sliced up into logical drives and pasted > >>back together with LVM. Wondering also about putting the journal in > >>a separate logical drive on the same array. > >What's the hardware configuration like? before multipathing, you need > >to know if your RAID controller and disks can actually saturate your > >link. Generally SAS-attached enclosures are driven through a 4 way > >SFF-8088 cable, with a bandwidth of 4x 6Gbps (maximum throughput per > >link: 3 GB/s) or 4 x 12 Gbps (max thruput: 6 GB/s). > > > The new hardware is an Infortrend with 16 x 2TB 6Gbps SAS drives. > It has one controller with dual 6Gbps SAS ports. The server > currently has two 3Gbps SAS HBAs. > > On an existing array based on similar but slightly slower hardware, > I'm getting miserable performance. The bottleneck seems to be on > the server side. For specifics, the array is laid out as a single > 26TB volume and attached by a single 3Gbps SAS. So, 300MB/s max throughput. > The server is quad > 8-core Xeon with 128GB RAM. The networking is all 10GB. The > application is rsnapshot which is essentially a series of rsync > copies where the unchanged files are hard-linked from one snapshot > to the next. CPU utilization is very low and only a few cores seem > to be active. Yet the operation is taking hours to complete. rsync is likely limited by network throughput and round trip latency. Test your storage performance locally first, see it if performs as expected. > The premise that was presented to me by someone in the storage > business is that with 'many' proccessor cores one should slice a > large array up into segments, multipath the whole deal, and then > mash the segments back together with LVM (or MD). No, that's just a bad idea. CPU and memory locality is the least of your worries, and wont have any influence on performance at such low speeds. When you start getting up into the multiple-GB/s of throughput (note, GB/s not Gbps) locality matters more, but not for what you are doing. And multipathing should be ignored until you've characterised and understood single port lun performance. > Since the kernel > would ultimately see a bunch of smaller storage segments that were > all getting activity, it should dispatch a set of cores for each > storage segment and get the job done faster. I think in theory this > would even work to some extent on a single-path SAS connection. The kernel already does most of the necessary locality stuff for optimal performance for you. > >>I am able to set up a 2-way multipath right now, and I might be able > >>to justify adding a second controller to the array to get a 4-way > >>multipath going. > >A multipath can double the throughput, provided that you have enough > >drives: you'll need about 24 7k RPM drives to saturate _one_ 4x6Gbps > >SAS link. If you have only 12 drives, dual attachment probably won't > >yield much. > > > >>Even if the LVM approach is the wrong one, I clearly have a rare > >>chance to set this array up the right way. Please let me know if you > >>have any suggestions. > >In my experience, software RAID-0 with md gives slightly better > >performance than LVM, though not much. > > > > > MD RAID-0 seems as likely as LVM, so I'd probably try that first. > The big question is how to size the slices of the array Doesn't really matter for RAID 0. > to make XFS > happy and then how to make sure XFS knows about it. IF you are using MD, then mkfs.xfs will pick up the config automatically from the MD device. > Secondly, there > is the question of the log volume. Seems that with multipath there > might be some possible advantage to putting this in it's on slice on > the array so that log writes could be in an I/O stream that is > managed separately from the rest. There are very few workloads where an external log makes any sense these days. Log bandwidth is generally a minor part of any workload, and non-volatile write caches aggregrate the sequential writes to the point where they impose very little physical IO overhead on teh array... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <54ED01BC.6080302@binghamton.edu>]
* Re: XFS/LVM/Multipath on a single RAID volume [not found] ` <54ED01BC.6080302@binghamton.edu> @ 2015-02-24 23:33 ` Dave Chinner 0 siblings, 0 replies; 32+ messages in thread From: Dave Chinner @ 2015-02-24 23:33 UTC (permalink / raw) To: Dave Hall; +Cc: xfs [cc the XFS list again] On Tue, Feb 24, 2015 at 05:57:00PM -0500, Dave Hall wrote: > Dave, > > I'm not going to post any more of my noob questions. Which defeats the purpose of having a public, archived list - other people can find your questions and the answers through search engines like Google. > Sounds like > about the best I could do would be to get a faster HBA (planned) and > just go for it. Also sounds like I might want to look at breaking > up some the large rsyncs that are running inside rsnapshot. Perhaps > it's just the directory tree traversal that's killing my > performance. Most likely - that's small, random IO and will almost always be seek bound on spinning disks. > One last question - format options: I seem to recall that there are > some parameters on the mkfs - su, sw, etc. Do I need to specify > those when I set up this new volume or can mkfs.xfs calculate them > correctly, now? XFS has calculated them correctly for years when you are using MD or LVM for software striping. Nowdays it even works with some hardware RAID, but support is still vendor and hardware specific. That's when you may have to specify it manually, as per the FAQ: http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E > Also, I saw something about formatting differently > for a workload like email with many small files, vs. a media > workload that's focused on large files. Since rsnapshot has to > create a new directory tree for every snapshot I'm going to say it's > closer to the email workload. Any guidance on that? Set up your storage config to be optimal for your workload, and XFS should set it's defaults appropriately. If you have a random seek bound workload, though, there's very little you can tweak at the filesystem level that will make any significant different to performance. In these cases, It's better to buy big, cheap SSDs than expensive spinning disks if you need better performance for this sort of workload. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: XFS/LVM/Multipath on a single RAID volume 2015-02-24 22:33 ` Dave Chinner [not found] ` <54ED01BC.6080302@binghamton.edu> @ 2015-02-25 11:49 ` Emmanuel Florac 1 sibling, 0 replies; 32+ messages in thread From: Emmanuel Florac @ 2015-02-25 11:49 UTC (permalink / raw) To: Dave Chinner; +Cc: Dave Hall, stan, xfs@oss.sgi.com Le Wed, 25 Feb 2015 09:33:44 +1100 Dave Chinner <david@fromorbit.com> écrivait: > > On an existing array based on similar but slightly slower hardware, > > I'm getting miserable performance. The bottleneck seems to be on > > the server side. For specifics, the array is laid out as a single > > 26TB volume and attached by a single 3Gbps SAS. > > So, 300MB/s max throughput. > Ah yes, maybe external RAID controllers can only use one SAS channel out of the 4 available, that would definitely limit performance badly. This limitation don't apply to internal RAID controllers (Adaptec, LSI, Areca) driving a JBOD though. I'll do a short digression on external storage enclosures: they're mostly useful to provide redundant controllers. If you're using only one controller, cheap ones (such as infortrend, Promise and the like) will always perform poorly compared to a modern PCIe RAID controller. High-end storage enclosures (DotHill, NetApp, etc) with high-bandwidth attachments (FC or IB) provide better performance AND redundancy, but at a hefty price. So if you want fast, cheap arrays, definitely use Adaptec/LSI/Areca and simple JBOD chassis like supermicro's. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: XFS/LVM/Multipath on a single RAID volume 2015-02-24 22:04 ` Dave Hall 2015-02-24 22:33 ` Dave Chinner @ 2015-02-25 11:21 ` Emmanuel Florac 1 sibling, 0 replies; 32+ messages in thread From: Emmanuel Florac @ 2015-02-25 11:21 UTC (permalink / raw) To: Dave Hall; +Cc: stan, xfs@oss.sgi.com Le Tue, 24 Feb 2015 17:04:35 -0500 Dave Hall <kdhall@binghamton.edu> écrivait: > The new hardware is an Infortrend with 16 x 2TB 6Gbps SAS drives. It > has one controller with dual 6Gbps SAS ports. The server currently > has two 3Gbps SAS HBAs. On my experience with these kinds of controllers, they perform quite poorly with more than 1 RAID-6 array. I'd go for a single RAID-6 array. Then as you said you'll have to do multipath LVM to create two LVs to stripe together to use both your HBAs and get some more performance. However with only 16 7k RPM drives you can't hope much more than 1.5 GByte/s, which is achievable with only one 3Gb SAS HBA... -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: xfs_fsr, sunit, and swidth 2013-03-25 17:00 ` Dave Hall 2013-03-27 21:16 ` Stan Hoeppner @ 2013-03-28 1:38 ` Dave Chinner 1 sibling, 0 replies; 32+ messages in thread From: Dave Chinner @ 2013-03-28 1:38 UTC (permalink / raw) To: Dave Hall; +Cc: xfs@oss.sgi.com On Mon, Mar 25, 2013 at 01:00:51PM -0400, Dave Hall wrote: > > Dave Hall > Binghamton University > kdhall@binghamton.edu > 607-760-2328 (Cell) > 607-777-4641 (Office) > > > On 03/16/2013 03:21 AM, Dave Chinner wrote: > >Using perf to profile the kernel while the cp -al workload is > >running will tell use exactly where the CPU is being burnt. That > >will confirm the analysis, or point us at some other issue that is > >causing excessive CPU burn... > > > Dave, which perf command(s) would you like me to run. (I'm > familiar with the concept behind this kind of tool, but I haven't > worked with this one before). Just run 'perf top -U' for 10s while the problem is occurring and pastebin the output.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2015-02-25 11:49 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-13 18:11 xfs_fsr, sunit, and swidth Dave Hall
2013-03-13 23:57 ` Dave Chinner
2013-03-14 0:03 ` Stan Hoeppner
[not found] ` <514153ED.3000405@binghamton.edu>
2013-03-14 12:26 ` Stan Hoeppner
2013-03-14 12:55 ` Stan Hoeppner
2013-03-14 14:59 ` Dave Hall
2013-03-14 18:07 ` Stefan Ring
2013-03-15 5:14 ` Stan Hoeppner
2013-03-15 11:45 ` Dave Chinner
2013-03-16 4:47 ` Stan Hoeppner
2013-03-16 7:21 ` Dave Chinner
2013-03-16 11:45 ` Stan Hoeppner
2013-03-25 17:00 ` Dave Hall
2013-03-27 21:16 ` Stan Hoeppner
2013-03-29 19:59 ` Dave Hall
2013-03-31 1:22 ` Dave Chinner
2013-04-02 10:34 ` Hans-Peter Jansen
2013-04-03 14:25 ` Dave Hall
2013-04-12 17:25 ` Dave Hall
2013-04-13 0:45 ` Dave Chinner
2013-04-13 0:51 ` Stan Hoeppner
2013-04-15 20:35 ` Dave Hall
2013-04-16 1:45 ` Stan Hoeppner
2013-04-16 16:18 ` Dave Chinner
2015-02-22 23:35 ` XFS/LVM/Multipath on a single RAID volume Dave Hall
2015-02-23 11:18 ` Emmanuel Florac
2015-02-24 22:04 ` Dave Hall
2015-02-24 22:33 ` Dave Chinner
[not found] ` <54ED01BC.6080302@binghamton.edu>
2015-02-24 23:33 ` Dave Chinner
2015-02-25 11:49 ` Emmanuel Florac
2015-02-25 11:21 ` Emmanuel Florac
2013-03-28 1:38 ` xfs_fsr, sunit, and swidth Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox