* Re: ARC-1120 and MD very sloooow [not found] <1385118796.8091.31.camel@bews002.euractiv.com> @ 2013-11-22 20:17 ` Stan Hoeppner 2013-11-25 8:56 ` Jimmy Thrasibule 0 siblings, 1 reply; 9+ messages in thread From: Stan Hoeppner @ 2013-11-22 20:17 UTC (permalink / raw) To: Jimmy Thrasibule; +Cc: Linux RAID, xfs@oss.sgi.com [CC'ing XFS] On 11/22/2013 5:13 AM, Jimmy Thrasibule wrote: Hi Jimmy, This may not be an md problem. It appears you've mangled your XFS filesystem alignment. This may be a contributing factor to the low write throughput. > md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1] > 7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] ... > /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota) Beyond having a ridiculously unnecessary quantity of mount options, it appears you've got your filesystem alignment messed up, still. Your RAID geometry is 512KB chunk, 1MB stripe width. Your override above is telling the filesystem that the RAID geometry is chunk size 1MB and stripe width 2MB, so XFS is pumping double the IO size that md is expecting. > # xfs_info /dev/md3 > meta-data=/dev/md3 isize=256 agcount=32, agsize=30523648 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=976755712, imaxpct=5 > = sunit=256 swidth=512 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=476936, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 You created your filesystem with stripe unit of 128KB and stripe width of 256KB which don't match the RAID geometry. I assume this is the reason for the fstab overrides. I suggest you try overriding with values that match the RAID geometry, which should be sunit=1024 and swidth=2048. This may or may not cure the low write throughput but it's a good starting point, and should be done anyway. You could also try specifying zeros to force all filesystem write IOs to be 4KB, i.e. no alignment. Also, your log was created with a stripe unit alignment of 4KB, which is 128 times smaller than your chunk. The default value is zero, which means use 4KB IOs. This shouldn't be a problem, but I do wonder why you manually specified a value equal to the default. mkfs.xfs automatically reads the stripe geometry from md and sets sunit/swidth correctly (assuming non-nested arrays). Why did you specify these manually? > The issue is that disk access is very slow and I cannot spot why. Here > is some data when I try to access the file system. > > > # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000 > 6000+0 records in > 6000+0 records out > 3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s > > # dd if=/srv/store/video/test.zero of=/dev/null > 6144000+0 records in > 6144000+0 records out > 3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s What percent of the filesystem space is currently used? > First run: > $ time ls /srv/files > [...] > real 9m59.609s > user 0m0.408s > sys 0m0.176s This is a separate problem and has nothing to do with the hardware, md, or XFS. I assisted with a similar, probably identical, ls completion time issue last week on the XFS list. I'd guess you're storing user and group data on a remote LDAP server and it is responding somewhat slowly. Use 'strace -T' with ls and you'll see lots of poll calls and the time taken by each. 17,189 files at 35ms avg latency per LDAP query yields 10m02s, if my math is correct, so 35ms is your current avg latency per query. Be aware that even if you get the average LDAP latency per file down to 2ms, you're still looking at 34s for ls to complete on this directory. Much better than 10 minutes, but nothing close to the local speed you're used to. > Second run: > $ time ls /srv/files > [...] > real 0m0.257s > user 0m0.108s > sys 0m0.088s Here the LDAP data has been cached. Wait an hour, run ls again, and it'll be slow again. > $ ls -l /srv/files | wc -l > 17189 > I guess the controller is what's is blocking here as I encounter the > issue only on servers where it is installed. I tried many settings like > enabling or disabling cache but nothing changed. The controller is not the cause of the 10 minute ls delay. If you see the ls delay only on servers with this controller it is coincidence. The cause lay elsewhere. Areca are pretty crappy controllers generally, but I doubt they're at fault WRT your low write throughput, though it is possible. > Any advise would be appreciated. I hope I've steered you in the right direction. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-22 20:17 ` ARC-1120 and MD very sloooow Stan Hoeppner @ 2013-11-25 8:56 ` Jimmy Thrasibule 2013-11-26 0:45 ` Stan Hoeppner 0 siblings, 1 reply; 9+ messages in thread From: Jimmy Thrasibule @ 2013-11-25 8:56 UTC (permalink / raw) To: stan; +Cc: Linux RAID, xfs@oss.sgi.com Hello Stan, > This may not be an md problem. It appears you've mangled your XFS > filesystem alignment. This may be a contributing factor to the low > write throughput. > > > md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1] > > 7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] > ... > > /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota) > > Beyond having a ridiculously unnecessary quantity of mount options, it > appears you've got your filesystem alignment messed up, still. Your > RAID geometry is 512KB chunk, 1MB stripe width. Your override above is > telling the filesystem that the RAID geometry is chunk size 1MB and > stripe width 2MB, so XFS is pumping double the IO size that md is > expecting. The nosuid, nodev, noexec, noatime and inode64 options are mine, the others are added by the system. > > # xfs_info /dev/md3 > > meta-data=/dev/md3 isize=256 agcount=32, agsize=30523648 blks > > = sectsz=512 attr=2 > > data = bsize=4096 blocks=976755712, imaxpct=5 > > = sunit=256 swidth=512 blks > > naming =version 2 bsize=4096 ascii-ci=0 > > log =internal bsize=4096 blocks=476936, version=2 > > = sectsz=512 sunit=8 blks, lazy-count=1 > > You created your filesystem with stripe unit of 128KB and stripe width > of 256KB which don't match the RAID geometry. I assume this is the > reason for the fstab overrides. I suggest you try overriding with > values that match the RAID geometry, which should be sunit=1024 and > swidth=2048. This may or may not cure the low write throughput but it's > a good starting point, and should be done anyway. You could also try > specifying zeros to force all filesystem write IOs to be 4KB, i.e. no > alignment. > > Also, your log was created with a stripe unit alignment of 4KB, which is > 128 times smaller than your chunk. The default value is zero, which > means use 4KB IOs. This shouldn't be a problem, but I do wonder why you > manually specified a value equal to the default. > > mkfs.xfs automatically reads the stripe geometry from md and sets > sunit/swidth correctly (assuming non-nested arrays). Why did you > specify these manually? It is said to trust mkfs.xfs, that's what I did. No options have been specified by me and mkfs.xfs guessed everything by itself. > > The issue is that disk access is very slow and I cannot spot why. Here > > is some data when I try to access the file system. > > > > > > # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000 > > 6000+0 records in > > 6000+0 records out > > 3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s > > > > # dd if=/srv/store/video/test.zero of=/dev/null > > 6144000+0 records in > > 6144000+0 records out > > 3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s > > What percent of the filesystem space is currently used? Very small, 3GB / 6TB, something like 0.05%. > > First run: > > $ time ls /srv/files > > [...] > > real 9m59.609s > > user 0m0.408s > > sys 0m0.176s > > This is a separate problem and has nothing to do with the hardware, md, > or XFS. I assisted with a similar, probably identical, ls completion > time issue last week on the XFS list. I'd guess you're storing user and > group data on a remote LDAP server and it is responding somewhat slowly. > Use 'strace -T' with ls and you'll see lots of poll calls and the time > taken by each. 17,189 files at 35ms avg latency per LDAP query yields > 10m02s, if my math is correct, so 35ms is your current avg latency per > query. Be aware that even if you get the average LDAP latency per file > down to 2ms, you're still looking at 34s for ls to complete on this > directory. Much better than 10 minutes, but nothing close to the local > speed you're used to. > > > Second run: > > $ time ls /srv/files > > [...] > > real 0m0.257s > > user 0m0.108s > > sys 0m0.088s > > Here the LDAP data has been cached. Wait an hour, run ls again, and > it'll be slow again. > > > $ ls -l /srv/files | wc -l > > 17189 > > > I guess the controller is what's is blocking here as I encounter the > > issue only on servers where it is installed. I tried many settings like > > enabling or disabling cache but nothing changed. Just using the old good `/etc/passwd` and `/etc/group` files here. There is no special permissions configuration. > The controller is not the cause of the 10 minute ls delay. If you see > the ls delay only on servers with this controller it is coincidence. > The cause lay elsewhere. > > Areca are pretty crappy controllers generally, but I doubt they're at > fault WRT your low write throughput, though it is possible. Well I have issues only on those servers. Strange enough. I see however that I messed the outputs concerning the filesystem details. Let me put everything in order. Server 1 -------- # xfs_info /dev/md3 meta-data=/dev/mapper/data-video isize=256 agcount=33, agsize=50331520 blks = sectsz=512 attr=2 data = bsize=4096 blocks=1610612736, imaxpct=5 = sunit=128 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mdadm -D /dev/md3 /dev/md3: Version : 1.2 Creation Time : Thu Oct 24 14:33:59 2013 Raid Level : raid10 Array Size : 7813770240 (7451.79 GiB 8001.30 GB) Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Fri Nov 22 12:30:20 2013 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : srv1:data (local to host srv1) UUID : ea612767:5870a6f5:38e8537a:8fd03631 Events : 22 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 3 8 81 3 active sync /dev/sdf1 # grep md3 /etc/fstab /dev/md3 /srv xfs defaults,inode64 0 0 Server 2 -------- # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=32, agsize=30523648 blks = sectsz=512 attr=2 data = bsize=4096 blocks=976755712, imaxpct=5 = sunit=256 swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=476936, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mdadm -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Thu Nov 8 11:20:57 2012 Raid Level : raid10 Array Size : 3907022848 (3726.03 GiB 4000.79 GB) Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB) Raid Devices : 4 Total Devices : 5 Persistence : Superblock is persistent Update Time : Mon Nov 25 08:37:33 2013 State : active Active Devices : 4 Working Devices : 5 Failed Devices : 0 Spare Devices : 1 Layout : near=2 Chunk Size : 1024K Name : srv2:0 UUID : 0bb3f599:e414f7ae:0ba93fa2:7a2b4e67 Events : 280490 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1 2 8 49 2 active sync /dev/sdd1 5 8 65 3 active sync /dev/sde1 4 8 81 - spare /dev/sdf1 # grep md0 /etc/fstab /dev/md0 /srv noatime,nodev,nosuid,noexec,inode64 0 0 -- Jimmy _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-25 8:56 ` Jimmy Thrasibule @ 2013-11-26 0:45 ` Stan Hoeppner 2013-11-26 2:52 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Stan Hoeppner @ 2013-11-26 0:45 UTC (permalink / raw) To: Jimmy Thrasibule; +Cc: Linux RAID, xfs@oss.sgi.com On 11/25/2013 2:56 AM, Jimmy Thrasibule wrote: > Hello Stan, > >> This may not be an md problem. It appears you've mangled your XFS >> filesystem alignment. This may be a contributing factor to the low >> write throughput. >> >>> md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1] >>> 7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] >> ... >>> /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota) >> >> Beyond having a ridiculously unnecessary quantity of mount options, it >> appears you've got your filesystem alignment messed up, still. Your >> RAID geometry is 512KB chunk, 1MB stripe width. Your override above is >> telling the filesystem that the RAID geometry is chunk size 1MB and >> stripe width 2MB, so XFS is pumping double the IO size that md is >> expecting. > > The nosuid, nodev, noexec, noatime and inode64 options are mine, the > others are added by the system. Right. It's unusual to see this many mount options. FYI, the XFS default is relatime, which is nearly identical to noatime. Specifying noatime won't gain you anything. Do you really need nosuid, nodev, noexec? >>> # xfs_info /dev/md3 >>> meta-data=/dev/md3 isize=256 agcount=32, agsize=30523648 blks >>> = sectsz=512 attr=2 >>> data = bsize=4096 blocks=976755712, imaxpct=5 >>> = sunit=256 swidth=512 blks >>> naming =version 2 bsize=4096 ascii-ci=0 >>> log =internal bsize=4096 blocks=476936, version=2 >>> = sectsz=512 sunit=8 blks, lazy-count=1 >> >> You created your filesystem with stripe unit of 128KB and stripe width >> of 256KB which don't match the RAID geometry. I assume this is the >> reason for the fstab overrides. I suggest you try overriding with >> values that match the RAID geometry, which should be sunit=1024 and >> swidth=2048. This may or may not cure the low write throughput but it's >> a good starting point, and should be done anyway. You could also try >> specifying zeros to force all filesystem write IOs to be 4KB, i.e. no >> alignment. >> >> Also, your log was created with a stripe unit alignment of 4KB, which is >> 128 times smaller than your chunk. The default value is zero, which >> means use 4KB IOs. This shouldn't be a problem, but I do wonder why you >> manually specified a value equal to the default. >> >> mkfs.xfs automatically reads the stripe geometry from md and sets >> sunit/swidth correctly (assuming non-nested arrays). Why did you >> specify these manually? > > It is said to trust mkfs.xfs, that's what I did. No options have been > specified by me and mkfs.xfs guessed everything by itself. So the mkfs.xfs defaults in Wheezy did this. Maybe I'm missing something WRT the md/RAID10 near2 layout. I know the alternate layouts can play tricks with the resulting stripe width but I'm not sure if that's the case here. The log sunit of 8 blocks may be due to your chunk being 512KB, which IIRC is greater than the XFS allowed maximum for the log. Hence it may have been dropped to 4KB for this reason. >>> The issue is that disk access is very slow and I cannot spot why. Here >>> is some data when I try to access the file system. >>> >>> >>> # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000 >>> 6000+0 records in >>> 6000+0 records out >>> 3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s >>> >>> # dd if=/srv/store/video/test.zero of=/dev/null >>> 6144000+0 records in >>> 6144000+0 records out >>> 3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s >> >> What percent of the filesystem space is currently used? > > Very small, 3GB / 6TB, something like 0.05%. So the low write speed shouldn't be related to free space fragmentation. >>> First run: >>> $ time ls /srv/files >>> [...] >>> real 9m59.609s >>> user 0m0.408s >>> sys 0m0.176s >> >> This is a separate problem and has nothing to do with the hardware, md, >> or XFS. I assisted with a similar, probably identical, ls completion >> time issue last week on the XFS list. I'd guess you're storing user and >> group data on a remote LDAP server and it is responding somewhat slowly. >> Use 'strace -T' with ls and you'll see lots of poll calls and the time >> taken by each. 17,189 files at 35ms avg latency per LDAP query yields >> 10m02s, if my math is correct, so 35ms is your current avg latency per >> query. Be aware that even if you get the average LDAP latency per file >> down to 2ms, you're still looking at 34s for ls to complete on this >> directory. Much better than 10 minutes, but nothing close to the local >> speed you're used to. >> >>> Second run: >>> $ time ls /srv/files >>> [...] >>> real 0m0.257s >>> user 0m0.108s >>> sys 0m0.088s >> >> Here the LDAP data has been cached. Wait an hour, run ls again, and >> it'll be slow again. >> >>> $ ls -l /srv/files | wc -l >>> 17189 >> >>> I guess the controller is what's is blocking here as I encounter the >>> issue only on servers where it is installed. I tried many settings like >>> enabling or disabling cache but nothing changed. > > Just using the old good `/etc/passwd` and `/etc/group` files here. There > is no special permissions configuration. You'll need to run "strace -T ls -l" to determine what's eating all the time. The user and kernel code is taking less than 0.5s combined. The other 9m58s is spent waiting on something. You need to identify that. This is interesting. You have low linear write speed to a file with dd, yet also horrible latency with a read operation. Do you see any errors in dmesg relating to the Areca, or anything else? >> The controller is not the cause of the 10 minute ls delay. If you see >> the ls delay only on servers with this controller it is coincidence. >> The cause lay elsewhere. >> >> Areca are pretty crappy controllers generally, but I doubt they're at >> fault WRT your low write throughput, though it is possible. > > Well I have issues only on those servers. Strange enough. Yes, this is a strange case thus far. Do you also see the low write speed and slow ls on md0, any/all of your md/RAID10 arrays? > I see however that I messed the outputs concerning the filesystem > details. Let me put everything in order. > > > Server 1 > -------- > > # xfs_info /dev/md3 > meta-data=/dev/mapper/data-video isize=256 agcount=33, agsize=50331520 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=1610612736, imaxpct=5 > = sunit=128 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # mdadm -D /dev/md3 > /dev/md3: > Version : 1.2 > Creation Time : Thu Oct 24 14:33:59 2013 > Raid Level : raid10 > Array Size : 7813770240 (7451.79 GiB 8001.30 GB) > Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB) > Raid Devices : 4 > Total Devices : 4 > Persistence : Superblock is persistent > > Update Time : Fri Nov 22 12:30:20 2013 > State : clean > Active Devices : 4 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 0 > > Layout : near=2 > Chunk Size : 512K > > Name : srv1:data (local to host srv1) > UUID : ea612767:5870a6f5:38e8537a:8fd03631 > Events : 22 > > Number Major Minor RaidDevice State > 0 8 33 0 active sync /dev/sdc1 > 1 8 49 1 active sync /dev/sdd1 > 2 8 65 2 active sync /dev/sde1 > 3 8 81 3 active sync /dev/sdf1 > > # grep md3 /etc/fstab > /dev/md3 /srv xfs defaults,inode64 0 0 > > > Server 2 > -------- > > # xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=32, agsize=30523648 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=976755712, imaxpct=5 > = sunit=256 swidth=512 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=476936, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # mdadm -D /dev/md0 > /dev/md0: > Version : 1.2 > Creation Time : Thu Nov 8 11:20:57 2012 > Raid Level : raid10 > Array Size : 3907022848 (3726.03 GiB 4000.79 GB) > Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB) > Raid Devices : 4 > Total Devices : 5 > Persistence : Superblock is persistent > > Update Time : Mon Nov 25 08:37:33 2013 > State : active > Active Devices : 4 > Working Devices : 5 > Failed Devices : 0 > Spare Devices : 1 > > Layout : near=2 > Chunk Size : 1024K > > Name : srv2:0 > UUID : 0bb3f599:e414f7ae:0ba93fa2:7a2b4e67 > Events : 280490 > > Number Major Minor RaidDevice State > 0 8 17 0 active sync /dev/sdb1 > 1 8 33 1 active sync /dev/sdc1 > 2 8 49 2 active sync /dev/sdd1 > 5 8 65 3 active sync /dev/sde1 > > 4 8 81 - spare /dev/sdf1 > > # grep md0 /etc/fstab > /dev/md0 /srv noatime,nodev,nosuid,noexec,inode64 0 0 -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-26 0:45 ` Stan Hoeppner @ 2013-11-26 2:52 ` Dave Chinner 2013-11-26 3:58 ` Stan Hoeppner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-11-26 2:52 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com On Mon, Nov 25, 2013 at 06:45:38PM -0600, Stan Hoeppner wrote: > On 11/25/2013 2:56 AM, Jimmy Thrasibule wrote: > > Hello Stan, > > > >> This may not be an md problem. It appears you've mangled your XFS > >> filesystem alignment. This may be a contributing factor to the low > >> write throughput. > >> > >>> md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1] > >>> 7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] > >> ... > >>> /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota) > >> > >> Beyond having a ridiculously unnecessary quantity of mount options, it > >> appears you've got your filesystem alignment messed up, still. Your > >> RAID geometry is 512KB chunk, 1MB stripe width. Your override above is > >> telling the filesystem that the RAID geometry is chunk size 1MB and > >> stripe width 2MB, so XFS is pumping double the IO size that md is > >> expecting. > > > > The nosuid, nodev, noexec, noatime and inode64 options are mine, the > > others are added by the system. > > Right. It's unusual to see this many mount options. FYI, the XFS > default is relatime, which is nearly identical to noatime. Specifying > noatime won't gain you anything. Do you really need nosuid, nodev, noexec? > > >>> # xfs_info /dev/md3 > >>> meta-data=/dev/md3 isize=256 agcount=32, agsize=30523648 blks > >>> = sectsz=512 attr=2 > >>> data = bsize=4096 blocks=976755712, imaxpct=5 > >>> = sunit=256 swidth=512 blks > >>> naming =version 2 bsize=4096 ascii-ci=0 > >>> log =internal bsize=4096 blocks=476936, version=2 > >>> = sectsz=512 sunit=8 blks, lazy-count=1 > >> > >> You created your filesystem with stripe unit of 128KB and stripe width > >> of 256KB which don't match the RAID geometry. I assume this is the sunit/swidth is in filesystem blocks, not sectors. Hence sunit is 1MB, swidth = 2MB. While it's not quite correct (su=512k,sw=1m), it's not actually a problem... > >> reason for the fstab overrides. I suggest you try overriding with > >> values that match the RAID geometry, which should be sunit=1024 and > >> swidth=2048. This may or may not cure the low write throughput but it's > >> a good starting point, and should be done anyway. You could also try > >> specifying zeros to force all filesystem write IOs to be 4KB, i.e. no > >> alignment. > >> > >> Also, your log was created with a stripe unit alignment of 4KB, which is > >> 128 times smaller than your chunk. The default value is zero, which > >> means use 4KB IOs. This shouldn't be a problem, but I do wonder why you > >> manually specified a value equal to the default. > >> > >> mkfs.xfs automatically reads the stripe geometry from md and sets > >> sunit/swidth correctly (assuming non-nested arrays). Why did you > >> specify these manually? > > > > It is said to trust mkfs.xfs, that's what I did. No options have been > > specified by me and mkfs.xfs guessed everything by itself. Well, mkfs.xfs just uses what it gets from the kernel, so it might have been told the wrong thing by MD itself. However, you can modify sunit/swidth by mount options, so you can't directly trust what is reported from xfs_info to be what mkfs actually set originally. > So the mkfs.xfs defaults in Wheezy did this. Maybe I'm missing > something WRT the md/RAID10 near2 layout. I know the alternate layouts > can play tricks with the resulting stripe width but I'm not sure if > that's the case here. The log sunit of 8 blocks may be due to your > chunk being 512KB, which IIRC is greater than the XFS allowed maximum > for the log. Hence it may have been dropped to 4KB for this reason. Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And yes, the default lsunit when the sunit > 256k is 32k. So, nothing wrong there, either. > >>> The issue is that disk access is very slow and I cannot spot why. Here > >>> is some data when I try to access the file system. > >>> > >>> > >>> # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000 > >>> 6000+0 records in > >>> 6000+0 records out > >>> 3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s > >>> > >>> # dd if=/srv/store/video/test.zero of=/dev/null > >>> 6144000+0 records in > >>> 6144000+0 records out > >>> 3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s > >> > >> What percent of the filesystem space is currently used? > > > > Very small, 3GB / 6TB, something like 0.05%. The usual: "iostat -x -d -m 5" output while the test is running. Also, you are using buffered IO, so changing it to use direct IO will tell us exactly what the disks are doing when Io is issued. blktrace is your friend here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-26 2:52 ` Dave Chinner @ 2013-11-26 3:58 ` Stan Hoeppner 2013-11-26 6:14 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Stan Hoeppner @ 2013-11-26 3:58 UTC (permalink / raw) To: Dave Chinner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com On 11/25/2013 8:52 PM, Dave Chinner wrote: ... > sunit/swidth is in filesystem blocks, not sectors. Hence > sunit is 1MB, swidth = 2MB. While it's not quite correct > (su=512k,sw=1m), it's not actually a problem... Well that's what I thought as well, and I was puzzled by the 8 blocks value for the log sunit. So I double checked before posting, and 'man mkfs.xfs' told me sunit=value This is used to specify the stripe unit for a RAID device or a logical volume. The value has to be specified in 512-byte block units. So apparently the units of 'sunit' are different depending on which XFS tool one is using. That's a bit confusing. And 'man xfs_info' (xfs_growfs) doesn't tell us that sunit is given in filesystem blocks. I'm using xfsprogs 3.1.4 so maybe these have been corrected since. > Well, mkfs.xfs just uses what it gets from the kernel, so it > might have been told the wrong thing by MD itself. However, you can > modify sunit/swidth by mount options, so you can't directly trust > what is reported from xfs_info to be what mkfs actually set > originally. Got it. > Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And > yes, the default lsunit when the sunit > 256k is 32k. So, nothing > wrong there, either. So where should I have looked to confirm sunit reported by xfs_info is in fs block (4KB) multiples, not the in the 512B multiples of mkfs.xfs? > The usual: "iostat -x -d -m 5" output while the test is running. > Also, you are using buffered IO, so changing it to use direct IO > will tell us exactly what the disks are doing when Io is issued. > blktrace is your friend here.... It'll be interesting to see where this troubleshooting leads. Buffered single stream write speed is ~6x slower than read w/RAID10. That makes me wonder if the controller and drive write caches have been disabled. That could explain this. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-26 3:58 ` Stan Hoeppner @ 2013-11-26 6:14 ` Dave Chinner 2013-11-26 8:03 ` Stan Hoeppner 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-11-26 6:14 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com On Mon, Nov 25, 2013 at 09:58:21PM -0600, Stan Hoeppner wrote: > On 11/25/2013 8:52 PM, Dave Chinner wrote: > ... > > sunit/swidth is in filesystem blocks, not sectors. Hence > > sunit is 1MB, swidth = 2MB. While it's not quite correct > > (su=512k,sw=1m), it's not actually a problem... > > Well that's what I thought as well, and I was puzzled by the 8 blocks > value for the log sunit. So I double checked before posting, and 'man > mkfs.xfs' told me > > sunit=value > This is used to specify the stripe unit for a RAID device > or a logical volume. The value has to be specified in > 512-byte block units. > > So apparently the units of 'sunit' are different depending on which XFS > tool one is using. No they don't. sunit as a mkfs input value is determined by 512 byte units. The output is given in units of "blks" i.e. the log block size: $ mkfs.xfs -N -l sunit=64 /dev/vdb .... log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 Which is given by the "bsize=4096" variable and so are, in this case, 4k in size. input = 64 * 512 bytes = 8 * 4096 bytes = output Remember, you can specify su rather than sunit, and they are specified in sectors, filesystem blocks or bytes, and the output is still in units of log block size: # mkfs.xfs -N -b size=4096 -l su=8b /dev/vdb .... log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 # mkfs.xfs -N -l su=32k /dev/vdb .... log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 IOws, the input units can vary, but the output units are always the same. > That's a bit confusing. And 'man xfs_info' > (xfs_growfs) doesn't tell us that sunit is given in filesystem blocks. > I'm using xfsprogs 3.1.4 so maybe these have been corrected since. It might seem confusing at first, but it's actually quite consistent... > > Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And > > yes, the default lsunit when the sunit > 256k is 32k. So, nothing > > wrong there, either. > > So where should I have looked to confirm sunit reported by xfs_info is > in fs block (4KB) multiples, not the in the 512B multiples of mkfs.xfs? Explained above. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-26 6:14 ` Dave Chinner @ 2013-11-26 8:03 ` Stan Hoeppner 2013-11-28 15:59 ` Jimmy Thrasibule 0 siblings, 1 reply; 9+ messages in thread From: Stan Hoeppner @ 2013-11-26 8:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com On 11/26/2013 12:14 AM, Dave Chinner wrote: > On Mon, Nov 25, 2013 at 09:58:21PM -0600, Stan Hoeppner wrote: >> On 11/25/2013 8:52 PM, Dave Chinner wrote: >> ... >>> sunit/swidth is in filesystem blocks, not sectors. Hence >>> sunit is 1MB, swidth = 2MB. While it's not quite correct >>> (su=512k,sw=1m), it's not actually a problem... >> >> Well that's what I thought as well, and I was puzzled by the 8 blocks >> value for the log sunit. So I double checked before posting, and 'man >> mkfs.xfs' told me >> >> sunit=value >> This is used to specify the stripe unit for a RAID device >> or a logical volume. The value has to be specified in >> 512-byte block units. >> >> So apparently the units of 'sunit' are different depending on which XFS >> tool one is using. > > No they don't. sunit as a mkfs input value is determined by 512 byte > units. The output is given in units of "blks" i.e. the log block > size: Yes. That's pretty clear now. And I've figured out why this is... > $ mkfs.xfs -N -l sunit=64 /dev/vdb > .... > log =internal log bsize=4096 blocks=12800, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > > Which is given by the "bsize=4096" variable and so are, in this > case, 4k in size. input = 64 * 512 bytes = 8 * 4096 bytes = output > > Remember, you can specify su rather than sunit, and they are > specified in sectors, filesystem blocks or bytes, and the output is > still in units of log block size: I never used IRIX. But I've deduced that this made sense then due to variable filesystem block size selection during mkfs. But in Linux the filesystem block size is static, at 4KB, equal to page size, and from everything I've read the page size isn't going to change any time soon. Thus for Linux only users, this exercise of using creation values in 512 byte blocks, or bytes, or multiples of the fs block size, can be very confusing, when the output is always a multiple of filesystem blocks, always a multiple of 4KB. > # mkfs.xfs -N -b size=4096 -l su=8b /dev/vdb ^^^^^ I never noticed this until now because I've never used an external log, nor needed an internal log with different geometry than the data section. But why do we have different input values for su in the data (bytes) and log (blocks) sections? I hope to learn something from your answer, as I usually do. :) > .... > log =internal log bsize=4096 blocks=12800, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > > # mkfs.xfs -N -l su=32k /dev/vdb > .... > log =internal log bsize=4096 blocks=12800, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > > IOws, the input units can vary, but the output units are always the > same. > >> That's a bit confusing. And 'man xfs_info' >> (xfs_growfs) doesn't tell us that sunit is given in filesystem blocks. >> I'm using xfsprogs 3.1.4 so maybe these have been corrected since. > > It might seem confusing at first, but it's actually quite > consistent... At first? Dang Dave, you've been mentoring me for something like 3+ years now. :) I don't deal with alignment issues very often, but this isn't my first rodeo. I had my answer based on 4KB blocks, and went to the docs to verify it before posting. That's the logical thing to do. In this case, the docs led me astray. That shouldn't happen. It won't happen to me again, but if it did once, after using the software and documentation for over 4 years, it may likely happen to someone else. So I'm thinking a short caveat/note might be in order in mkfs.xfs(8). Something like "Note: During filesystem creation, data section stripe alignment values (sunit/swidth/su/sw) are specified in units other than filesystem blocks. After creation, sunit/swidth values are referenced in multiples of filesystem blocks by the xfsprogs tools." >>> Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And >>> yes, the default lsunit when the sunit > 256k is 32k. So, nothing >>> wrong there, either. >> >> So where should I have looked to confirm sunit reported by xfs_info is >> in fs block (4KB) multiples, not the in the 512B multiples of mkfs.xfs? > > Explained above. Thanks Dave. Hopefully others learn from this as well. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-26 8:03 ` Stan Hoeppner @ 2013-11-28 15:59 ` Jimmy Thrasibule 2013-11-28 19:59 ` Stan Hoeppner 0 siblings, 1 reply; 9+ messages in thread From: Jimmy Thrasibule @ 2013-11-28 15:59 UTC (permalink / raw) To: stan; +Cc: Linux RAID, xfs@oss.sgi.com > Right. It's unusual to see this many mount options. FYI, the XFS > default is relatime, which is nearly identical to noatime. Specifying > noatime won't gain you anything. Do you really need nosuid, nodev, noexec? Well better say what I don't want on the filesystem no? >Do you also see the low write speed and slow ls on md0, any/all of your > md/RAID10 arrays? Yes, all drive operations are slow, unfortunately, I have no drives in the machine that are not managed by the controller to push tests further. > The usual: "iostat -x -d -m 5" output while the test is running. > Also, you are using buffered IO, so changing it to use direct IO > will tell us exactly what the disks are doing when Io is issued. > blktrace is your friend here.... I've ran the following: # dd if=/dev/zero of=/srv/store/video/test.zero bs=512K count=6000 oflag=direct 6000+0 records in 6000+0 records out 3145728000 bytes (3.1 GB) copied, 179.945 s, 17.5 MB/s # dd if=/srv/store/video/test.zero of=/dev/null iflag=direct 6144000+0 records in 6144000+0 records out 3145728000 bytes (3.1 GB) copied, 984.317 s, 3.2 MB/s Traces are huge for the read test so I put them on Google Drive + SHA1 sums: https://drive.google.com/folderview?id=0BxJZG8aWsaMaVWkyQk1ELU5yX2c Drives `sdc` to `sdf` are part of the RAID10 array. Only drives `sdc` and `sde` are used when reading. > That makes me wonder if the controller and drive write caches have been disabled. > That could explain this. Caching is enabled for the controller but not much information. > sys info The System Information =========================================== Main Processor : 500MHz CPU ICache Size : 32KB CPU DCache Size : 32KB CPU SCache Size : 0KB System Memory : 128MB/333MHz/ECC Firmware Version : V1.49 2010-12-02 BOOT ROM Version : V1.49 2010-12-02 Serial Number : Y611CAABAR200126 Controller Name : ARC-1120 =========================================== By the way is enabling the controller cache a good idea? I would disable it and let the kernel manage. -- Jimmy _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ARC-1120 and MD very sloooow 2013-11-28 15:59 ` Jimmy Thrasibule @ 2013-11-28 19:59 ` Stan Hoeppner 0 siblings, 0 replies; 9+ messages in thread From: Stan Hoeppner @ 2013-11-28 19:59 UTC (permalink / raw) To: Jimmy Thrasibule; +Cc: Linux RAID, xfs@oss.sgi.com On 11/28/2013 9:59 AM, Jimmy Thrasibule wrote: >> Right. It's unusual to see this many mount options. FYI, the XFS >> default is relatime, which is nearly identical to noatime. Specifying >> noatime won't gain you anything. Do you really need nosuid, nodev, noexec? > > Well better say what I don't want on the filesystem no? > > >Do you also see the low write speed and slow ls on md0, any/all of your >> md/RAID10 arrays? > > Yes, all drive operations are slow, unfortunately, I have no drives in > the machine > that are not managed by the controller to push tests further. Testing a single drive might provide a useful comparison. >> The usual: "iostat -x -d -m 5" output while the test is running. >> Also, you are using buffered IO, so changing it to use direct IO >> will tell us exactly what the disks are doing when Io is issued. >> blktrace is your friend here.... > > I've ran the following: > > # dd if=/dev/zero of=/srv/store/video/test.zero bs=512K count=6000 > oflag=direct > 6000+0 records in > 6000+0 records out > 3145728000 bytes (3.1 GB) copied, 179.945 s, 17.5 MB/s While O_DIRECT writing will give a more accurate picture of the throughput at the disks, single threaded O_DIRECT is usually not a good test due to serialization. That said, 17.5MB/s is very slow even for a single thread. > # dd if=/srv/store/video/test.zero of=/dev/null iflag=direct > 6144000+0 records in > 6144000+0 records out > 3145728000 bytes (3.1 GB) copied, 984.317 s, 3.2 MB/s This is useless. Never use O_DIRECT on input with dd. The result will always be ~20x lower than actual drive throughput. > Traces are huge for the read test so I put them on Google Drive + SHA1 sums: > https://drive.google.com/folderview?id=0BxJZG8aWsaMaVWkyQk1ELU5yX2c > > Drives `sdc` to `sdf` are part of the RAID10 array. Only drives `sdc` and `sde` > are used when reading. > >> That makes me wonder if the controller and drive write caches have been disabled. >> That could explain this. > > Caching is enabled for the controller but not much information. > > > sys info > The System Information > =========================================== > Main Processor : 500MHz > CPU ICache Size : 32KB > CPU DCache Size : 32KB > CPU SCache Size : 0KB > System Memory : 128MB/333MHz/ECC > Firmware Version : V1.49 2010-12-02 > BOOT ROM Version : V1.49 2010-12-02 > Serial Number : Y611CAABAR200126 > Controller Name : ARC-1120 > =========================================== This doesn't tell you if the read/write cache is enabled or disabled. This is simply the controller information summary. > By the way is enabling the controller cache a good idea? I would disable > it and let the kernel manage. With any decent RAID card the cache is enabled automatically for reads. The write cache will only be enabled automatically if a battery module is present and the firmware test shows it is in good condition. Some controllers allow manually enabling the write cache without battery. This is usually not advised. Since barriers are enabled in XFS by default, you may try enabling write cache on the controller to see if this helps performance. It may not depending on how the controller handles barriers. And of course, using md you'll want drive caches enabled or performance will be horrible. Which is why I recommending checking to make sure they're enabled. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-11-28 19:59 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1385118796.8091.31.camel@bews002.euractiv.com>
2013-11-22 20:17 ` ARC-1120 and MD very sloooow Stan Hoeppner
2013-11-25 8:56 ` Jimmy Thrasibule
2013-11-26 0:45 ` Stan Hoeppner
2013-11-26 2:52 ` Dave Chinner
2013-11-26 3:58 ` Stan Hoeppner
2013-11-26 6:14 ` Dave Chinner
2013-11-26 8:03 ` Stan Hoeppner
2013-11-28 15:59 ` Jimmy Thrasibule
2013-11-28 19:59 ` Stan Hoeppner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox