* How to deal with XFS stripe geometry mismatch with hardware RAID5
@ 2012-03-13 23:21 troby
2012-03-14 7:37 ` Brian Candler
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: troby @ 2012-03-13 23:21 UTC (permalink / raw)
To: xfs
I have a 30TB XFS filesystem created on CentOS 5.4 X86_64, kernel 2.6.39,
using xfsprogs 2.9.4. The underlying hardware is 12 3TB SATA drives on a
Dell PERC 700 controller with 1GB cache. There is an external journal on a
separate set of 15k SAS drives (I suspect now this was unnecessary, because
there is very little metadata activity). When I created the filesystem I
(mistakenly) believed the stripe width of the filesystem should count all 12
drives rather than 11. I've seen some opinions that this is correct, but a
larger number which have convinced me that it is not. I also set up the RAID
BIOS to use a small stripe element of 8KB per drive, based on the I/O
request size I was seeing at the time in previous installations of the same
application, which was generally doing writes around 100KB. l'm trying to
determine how to proceed to optimize write performance. Recreating the
filesystem and its existing data is not out of the question, but would be a
last resort.
The filesystem contains a MongoDB installation consisting of roughly 13000
2GB files which are already allocated. The application is almost exclusively
inserting data, there are no updates, and files are written pretty much
sequentially. When I set up the fstab entry I believed that it would inherit
the stripe geometry automatically, however now I understand that is not the
case with XFS version 2. What I'm seeing now is average request sizes which
are about 100KB, half the stripe size. With a typical write volume around
5MB per second I am getting wait times around 50ms, which appears to be
degrading performance. The filesystem was created on a partition aligned to
a 1MB boundary.
Short of recreating the filesystem with the correct stripe width, would it
make sense to change the mount options to define a stripe width that
actually matches either the filesystem (11 stripe elements wide) or the
hardware (12 stripe elements wide)? Is there a danger of filesystem
corruption if I give fstab a mount geometry that doesn't match the values
used at filesystem creation time?
I'm unclear on the role of the RAID hardware cache in this. Since the writes
are sequential, and since the volume of data written is such that it would
take about 3 minutes to actually fill the RAID cache, I would think the data
would be resident in the cache long enough to assemble a full-width stripe
at the hardware level and avoid the 4 I/O RAID5 penalty.
--
View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33498437.html
Sent from the Xfs - General mailing list archive at Nabble.com.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby @ 2012-03-14 7:37 ` Brian Candler 2012-03-14 7:52 ` Brian Candler ` (2 more replies) 2012-03-14 8:36 ` Stan Hoeppner 2012-03-14 23:22 ` Peter Grandi 2 siblings, 3 replies; 12+ messages in thread From: Brian Candler @ 2012-03-14 7:37 UTC (permalink / raw) To: troby; +Cc: xfs On Tue, Mar 13, 2012 at 04:21:07PM -0700, troby wrote: > there is very little metadata activity). When I created the filesystem I > (mistakenly) believed the stripe width of the filesystem should count all 12 > drives rather than 11. I've seen some opinions that this is correct, but a > larger number which have convinced me that it is not. With a 12-disk RAID5 you have 11 data disks, so the optimal filesystem alignment is 11 x stripe size. This is auto-detected for software (md) raid, but may or may not be for hardware RAID controllers. For example, here is a 12-disk RAID6 md array (10 data, 2 parity): $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active raid6 sdf[4] sdb[0] sdh[6] sdj[8] sdi[7] sdd[2] sdm[10] sdk[9] sdc[1] sdl[11] sdg[5] sde[3] 29302654080 blocks super 1.2 level 6, 64k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU] And here is the XFS filesystem which was created on it: $ xfs_info /dev/md127 meta-data=/dev/md127 isize=256 agcount=32, agsize=228926992 blks = sectsz=512 attr=2 data = bsize=4096 blocks=7325663520, imaxpct=5 = sunit=16 swidth=160 blks naming =version 2 bsize=16384 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 The parameters were detected automatically. sunit=16 x 4K = 64K, swidth= 160 x 4K = 640K. > I also set up the RAID > BIOS to use a small stripe element of 8KB per drive, based on the I/O > request size I was seeing at the time in previous installations of the same > application, which was generally doing writes around 100KB. I'd say this is almost guaranteed to give poor performance, because there will always be partial stripe write if you are doing random writes. e.g. consider the best case, which is when the 100KB is aligned with the start of the stripe. You will have: - a 88KB write across the whole stripe - 12 disks seek and write; this will take a whole revolution before it completes on every drive, i.e. 8.3ms rotational latency, in addition to seek time. The transfer time will be insignificant - one tiny write - 12KB write across a partial stripe. This will involve an 8K write to block A, a 4K read of block B and block P (parity), and a 4K write of block B and block P. Now consider what it would have been with a 256KB stripe size. If you're lucky and the whole 100K fits within a chunk, you'll have: - read 100K from block A and block P - write 100K to block A and block P There is less rotational latency, only slightly higher transfer time (for a slow drive which does 100MB/sec, 100KB will take 1ms), and will allow concurrent writers in the same area of disk, and much faster access if there are concurrent readers of those 100K chunks. The performance will still suck however, compared to RAID10. > I'm unclear on the role of the RAID hardware cache in this. Since the writes > are sequential, and since the volume of data written is such that it would > take about 3 minutes to actually fill the RAID cache, I would think the data > would be resident in the cache long enough to assemble a full-width stripe > at the hardware level and avoid the 4 I/O RAID5 penalty. Only if you're writing sequentially. For example, if you were untarring a huge tar file containing 100KB files, all in the same directory, XFS can allocate the extents one after the other, and so you will be doing pure stripe writes. But for *random* I/O, which I'm pretty sure is what mongodb will be doing, you won't have a chance. The controller will be forced to read the existing data and parity blocks so it can write back the updated parity. So the conclusion is: do you actually care about performance for this application? If you do, I'd say don't use RAID5. If you absolutely must use parity RAID then go buy a Netapp ($$$) or experiment with btrfs (risky). The cost of another 10 disks for a RAID10 array is going to be small in comparison. Regards, Brian. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 7:37 ` Brian Candler @ 2012-03-14 7:52 ` Brian Candler 2012-03-14 15:41 ` Peter Grandi 2012-03-14 17:53 ` troby 2 siblings, 0 replies; 12+ messages in thread From: Brian Candler @ 2012-03-14 7:52 UTC (permalink / raw) To: troby; +Cc: xfs > So the conclusion is: do you actually care about performance for this > application? If you do, I'd say don't use RAID5. If you absolutely must > use parity RAID then go buy a Netapp ($$$) or experiment with btrfs (risky). > The cost of another 10 disks for a RAID10 array is going to be small in > comparison. Or you could switch to another database like couchdb which only appends to its database and index files - it never goes back and overwrites existing blocks. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 7:37 ` Brian Candler 2012-03-14 7:52 ` Brian Candler @ 2012-03-14 15:41 ` Peter Grandi 2012-03-14 17:53 ` troby 2 siblings, 0 replies; 12+ messages in thread From: Peter Grandi @ 2012-03-14 15:41 UTC (permalink / raw) To: Linux fs XFS [ ... chunk sizes and relatively small random IO ... ] > So the conclusion is: do you actually care about performance > for this application? If you do, I'd say don't use RAID5. That's a general argument :-). http://WWW.BAARF.com/ The argument you make about RMW for relatively small random transactions becomes even more relevant when considering parity rebuilding in case of a drive failure. > If you absolutely must use parity RAID then go buy a Netapp > ($$$) or experiment with btrfs (risky). The Netapp WAFL or BTRFS don't "solve" the RMW problem, they just do parity with COW (object based in the case of BTRFS). The COW does not do in-place RMW, but something that has the same cost overall (depending on balance of read/writes and duty cycle and temporal vs spatial locality). The presence of parity chunks that must be kept in sync with the other blocks in the same stripe turns the stripe into a a block "cluster" for write purposes, and that's inescapable. If *multithreaded* performance were not important there would instead be a case for RAID2/3 with synchronous disks (to nullify the disk alignment times), but suitable components are probably not easy to source. > The cost of another 10 disks for a RAID10 array is going to be > small in comparison. More wise words, but this is a discussion about a choice to use an 11+1 RAID5, which is something that looks good to "management" by saving money upfront by delaying trouble (decreasing speed and higher risk) to later :-). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 7:37 ` Brian Candler 2012-03-14 7:52 ` Brian Candler 2012-03-14 15:41 ` Peter Grandi @ 2012-03-14 17:53 ` troby 2 siblings, 0 replies; 12+ messages in thread From: troby @ 2012-03-14 17:53 UTC (permalink / raw) To: xfs The choice of RAID5 was a compromise due to the need to store 30TB of data on each of 2 systems (a master and a replicated slave) - we couldn't afford that much space on our SAN for this application, but we could afford a 12-bay system with 3TB SATA drives. My hope was that since the write pattern was expected to be large sequential writes with no updates that the RAID5 penalty would not be significant. And it's quite possible that would be the case if I had got the stripe width right. The 8K element size was chosen because the actual average request size I was seeing on previous installations of the database was around 60K, which is still smaller than the stripe width over 12 drives even using 8K. I did try btrfs early on to take advantage of compression, but it failed. This was about six months ago, though. Brian Candler wrote: > > On Tue, Mar 13, 2012 at 04:21:07PM -0700, troby wrote: >> there is very little metadata activity). When I created the filesystem I >> (mistakenly) believed the stripe width of the filesystem should count all >> 12 >> drives rather than 11. I've seen some opinions that this is correct, but >> a >> larger number which have convinced me that it is not. > > With a 12-disk RAID5 you have 11 data disks, so the optimal filesystem > alignment is 11 x stripe size. This is auto-detected for software (md) > raid, but may or may not be for hardware RAID controllers. > > For example, here is a 12-disk RAID6 md array (10 data, 2 parity): > > $ cat /proc/mdstat > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] > [raid4] > [raid10] > md127 : active raid6 sdf[4] sdb[0] sdh[6] sdj[8] sdi[7] sdd[2] sdm[10] > sdk[9] sdc[1] sdl[11] sdg[5] sde[3] > 29302654080 blocks super 1.2 level 6, 64k chunk, algorithm 2 [12/12] > [UUUUUUUUUUUU] > > And here is the XFS filesystem which was created on it: > > $ xfs_info /dev/md127 > meta-data=/dev/md127 isize=256 agcount=32, agsize=228926992 > blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=7325663520, imaxpct=5 > = sunit=16 swidth=160 blks > naming =version 2 bsize=16384 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=16 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > The parameters were detected automatically. sunit=16 x 4K = 64K, swidth= > 160 x 4K = 640K. > >> I also set up the RAID >> BIOS to use a small stripe element of 8KB per drive, based on the I/O >> request size I was seeing at the time in previous installations of the >> same >> application, which was generally doing writes around 100KB. > > I'd say this is almost guaranteed to give poor performance, because there > will always be partial stripe write if you are doing random writes. e.g. > consider the best case, which is when the 100KB is aligned with the start > of > the stripe. You will have: > > - a 88KB write across the whole stripe > - 12 disks seek and write; this will take a whole revolution before > it completes on every drive, i.e. 8.3ms rotational latency, in > addition > to seek time. The transfer time will be insignificant > - one tiny write > - 12KB write across a partial stripe. This will involve an 8K write to > block > A, a 4K read of block B and block P (parity), and a 4K write of block B > and block P. > > Now consider what it would have been with a 256KB stripe size. If you're > lucky and the whole 100K fits within a chunk, you'll have: > > - read 100K from block A and block P > - write 100K to block A and block P > > There is less rotational latency, only slightly higher transfer time > (for a slow drive which does 100MB/sec, 100KB will take 1ms), and will > allow > concurrent writers in the same area of disk, and much faster access if > there > are concurrent readers of those 100K chunks. > > The performance will still suck however, compared to RAID10. > >> I'm unclear on the role of the RAID hardware cache in this. Since the >> writes >> are sequential, and since the volume of data written is such that it >> would >> take about 3 minutes to actually fill the RAID cache, I would think the >> data >> would be resident in the cache long enough to assemble a full-width >> stripe >> at the hardware level and avoid the 4 I/O RAID5 penalty. > > Only if you're writing sequentially. For example, if you were untarring a > huge tar file containing 100KB files, all in the same directory, XFS can > allocate the extents one after the other, and so you will be doing pure > stripe writes. > > But for *random* I/O, which I'm pretty sure is what mongodb will be doing, > you won't have a chance. The controller will be forced to read the > existing > data and parity blocks so it can write back the updated parity. > > So the conclusion is: do you actually care about performance for this > application? If you do, I'd say don't use RAID5. If you absolutely must > use parity RAID then go buy a Netapp ($$$) or experiment with btrfs > (risky). > The cost of another 10 disks for a RAID10 array is going to be small in > comparison. > > Regards, > > Brian. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > > -- View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33504119.html Sent from the Xfs - General mailing list archive at Nabble.com. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby 2012-03-14 7:37 ` Brian Candler @ 2012-03-14 8:36 ` Stan Hoeppner 2012-03-14 17:43 ` troby 2012-03-14 23:22 ` Peter Grandi 2 siblings, 1 reply; 12+ messages in thread From: Stan Hoeppner @ 2012-03-14 8:36 UTC (permalink / raw) To: troby; +Cc: xfs On 3/13/2012 6:21 PM, troby wrote: > Short of recreating the filesystem with the correct stripe width, would it > make sense to change the mount options to define a stripe width that > actually matches either the filesystem (11 stripe elements wide) or the > hardware (12 stripe elements wide)? Is there a danger of filesystem > corruption if I give fstab a mount geometry that doesn't match the values > used at filesystem creation time? What would make sense is for you to first show $ cat /etc/fstab $ xfs_info /dev/raid_device_name before we recommend any changes. > I'm unclear on the role of the RAID hardware cache in this. Since the writes > are sequential, This seems to be an assumption at odds with other information you've provided. > and since the volume of data written is such that it would > take about 3 minutes to actually fill the RAID cache, The PERC 700 operates in write-through cache mode if no BBU is present or the battery is degraded or has failed. You did not state whether your PERC 700 has the BBU installed. If not, you can increase write performance and decrease latency pretty substantially by adding the BBU which enables the write-back cache mode. You may want to check whether MongoDB uses fsync writes by default. If it does, and you don't have the BBU and write-back cache, this is affecting your write latency and throughput as well. > I would think the data > would be resident in the cache long enough to assemble a full-width stripe > at the hardware level and avoid the 4 I/O RAID5 penalty. Again, write-back-cache is only enabled with BBU on the PERC 700. Do note that achieving full stripe width writes is as much a function of your application workload and filesystem tuning as it is the RAID firmware, especially if the cache is in write-through mode, in which case the firmware can't do much, if anything, to maximize full width stripes. And keep in mind you won't hit the parity read-modify-write penalty on new stripe writes. This only happens when rewriting existing stripes. Your reported 50ms of latency for 100KB write IOs seems to suggest you don't have the BBU installed and you're actually doing RMW on existing stripes, not strictly new stripe writes. This is likely because... As an XFS filesystem gets full (you're at ~87%), file blocks may begin to be written into free space within existing partially occupied RAID stripes. This is where the RAID5/6 RMW penalty really kicks you in the a$$, especially if you have misaligned the filesystem geometry to the underlying RAID geometry. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 8:36 ` Stan Hoeppner @ 2012-03-14 17:43 ` troby 2012-03-14 21:05 ` Brian Candler 2012-03-14 22:48 ` Peter Grandi 0 siblings, 2 replies; 12+ messages in thread From: troby @ 2012-03-14 17:43 UTC (permalink / raw) To: xfs /dev/sdb1 /data xfs defaults,logdev=/dev/sda3,logbsize=256k,logbufs=8,largeio,nobarrier meta-data=/dev/sdb1 isize=256 agcount=32, agsize=251772920 blks = sectsz=4096 attr=0 data = bsize=4096 blocks=8056733408, imaxpct=2 = sunit=2 swidth=24 blks, unwritten=1 naming =version 2 bsize=4096 log =external bsize=4096 blocks=16000, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Mongo pre-allocates its datafiles and zero-fills them (there is a short header at the start of each, not rewritten as far as I know) and then writes to them sequentially, wrapping around when it hits the end. In this case the entire load is inserts, no updates, hence the sequential writes. The data will not wrap around for about 6 months, at which time old files will be overwritten starting from the beginning. The BBU is functioning and the cache is set to write-back. The files are memory-mapped, I'll check whether fsync is used. Flushing is done about every 30 seconds and takes about 8 seconds. One thing I'm wondering is whether the incorrect stripe structure I specified with mkfs is actually written into the file system structure or effectively just a hint to the kernel for what to use for a write size. If not, could I specify the correct stripe width in the mount options and override the incorrect width used by mkfs? Since the current average write size is only about half the specified stripe size, and since I'm not using md or xfs v.3 it seems the kernel is ignoring it for now. Stan Hoeppner wrote: > > On 3/13/2012 6:21 PM, troby wrote: > >> Short of recreating the filesystem with the correct stripe width, would >> it >> make sense to change the mount options to define a stripe width that >> actually matches either the filesystem (11 stripe elements wide) or the >> hardware (12 stripe elements wide)? Is there a danger of filesystem >> corruption if I give fstab a mount geometry that doesn't match the values >> used at filesystem creation time? > > What would make sense is for you to first show > > $ cat /etc/fstab > $ xfs_info /dev/raid_device_name > > before we recommend any changes. > >> I'm unclear on the role of the RAID hardware cache in this. Since the >> writes >> are sequential, > > This seems to be an assumption at odds with other information you've > provided. > >> and since the volume of data written is such that it would >> take about 3 minutes to actually fill the RAID cache, > > The PERC 700 operates in write-through cache mode if no BBU is present > or the battery is degraded or has failed. You did not state whether > your PERC 700 has the BBU installed. If not, you can increase write > performance and decrease latency pretty substantially by adding the BBU > which enables the write-back cache mode. > > You may want to check whether MongoDB uses fsync writes by default. If > it does, and you don't have the BBU and write-back cache, this is > affecting your write latency and throughput as well. > >> I would think the data >> would be resident in the cache long enough to assemble a full-width >> stripe >> at the hardware level and avoid the 4 I/O RAID5 penalty. > > Again, write-back-cache is only enabled with BBU on the PERC 700. Do > note that achieving full stripe width writes is as much a function of > your application workload and filesystem tuning as it is the RAID > firmware, especially if the cache is in write-through mode, in which > case the firmware can't do much, if anything, to maximize full width > stripes. > > And keep in mind you won't hit the parity read-modify-write penalty on > new stripe writes. This only happens when rewriting existing stripes. > Your reported 50ms of latency for 100KB write IOs seems to suggest you > don't have the BBU installed and you're actually doing RMW on existing > stripes, not strictly new stripe writes. This is likely because... > > As an XFS filesystem gets full (you're at ~87%), file blocks may begin > to be written into free space within existing partially occupied RAID > stripes. This is where the RAID5/6 RMW penalty really kicks you in the > a$$, especially if you have misaligned the filesystem geometry to the > underlying RAID geometry. > > -- > Stan > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > > -- View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33504048.html Sent from the Xfs - General mailing list archive at Nabble.com. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 17:43 ` troby @ 2012-03-14 21:05 ` Brian Candler 2012-03-14 23:21 ` troby 2012-03-14 22:48 ` Peter Grandi 1 sibling, 1 reply; 12+ messages in thread From: Brian Candler @ 2012-03-14 21:05 UTC (permalink / raw) To: troby; +Cc: xfs On Wed, Mar 14, 2012 at 10:43:44AM -0700, troby wrote: > Mongo pre-allocates its datafiles and zero-fills them (there is a short > header at the start of each, not rewritten as far as I know) and then > writes to them sequentially, wrapping around when it hits the end. In this > case the entire load is inserts, no updates, hence the sequential writes. > The data will not wrap around for about 6 months, at which time old files > will be overwritten starting from the beginning. The BBU is functioning and > the cache is set to write-back. The files are memory-mapped, I'll check > whether fsync is used. Flushing is done about every 30 seconds and takes > about 8 seconds. How much data has been added to mongodb in those 30 seconds? If everything really was being written sequentially then I reckon you could write about 6.6GB in that time (11 disks x 75MB/sec x 8 seconds). From your posting I suspect you are not achieving that level of performance :-) If it really is being written sequentially to a continguous file then the stripe alignment won't make any difference, because this is just a big pre-allocated file, and XFS will do its best to give one big contiguous chunk of space for it. Anwyay, you don't need to guess these things, you can easily find out. (1) Is the file preallocated and contiguous, or fragmented? # xfs_bmap /path/to/file This will show you if you get one huge extent. If you get a number of large extents (say 100MB+) that would be fine for performance too. If you get lots of shrapnel then there's a problem. (2) Are you really writing sequentially? # btrace /dev/whatever | grep ' [DC] ' This will show you block requests dispatched [D] and completed [C] to the controller. And at a higher level: # strace -p <pid-of-mongodb-process> will show you the seek/write/read operations that the application is performing. Once you have the answers to those, you can make a better judgement as to what's happening. (3) One other thing to check: cat /sys/block/xxx/bdi/read_ahead_kb cat /sys/block/xxx/queue/max_sectors_kb Increasing those to 1024 (echo 1024 > ....) may make some improvement. > One thing I'm wondering is whether the incorrect stripe structure I > specified with mkfs is actually written into the file system structure I am guessing that probably things like chunks of inodes are stripe-aligned. But if you're really writing sequentially to a huge contiguous file then it won't matter anyway. Regards, Brian. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 21:05 ` Brian Candler @ 2012-03-14 23:21 ` troby 2012-03-15 0:31 ` Peter Grandi 0 siblings, 1 reply; 12+ messages in thread From: troby @ 2012-03-14 23:21 UTC (permalink / raw) To: xfs Brian Candler wrote: > > On Wed, Mar 14, 2012 at 10:43:44AM -0700, troby wrote: >> Mongo pre-allocates its datafiles and zero-fills them (there is a short >> header at the start of each, not rewritten as far as I know) and then >> writes to them sequentially, wrapping around when it hits the end. In >> this >> case the entire load is inserts, no updates, hence the sequential writes. >> The data will not wrap around for about 6 months, at which time old files >> will be overwritten starting from the beginning. The BBU is functioning >> and >> the cache is set to write-back. The files are memory-mapped, I'll check >> whether fsync is used. Flushing is done about every 30 seconds and takes >> about 8 seconds. > > How much data has been added to mongodb in those 30 seconds? > > typically 2.5 MB > > If everything really was being written sequentially then I reckon you > could > write about 6.6GB in that time (11 disks x 75MB/sec x 8 seconds). From > your > posting I suspect you are not achieving that level of performance :-) > > If it really is being written sequentially to a continguous file then the > stripe alignment won't make any difference, because this is just a big > pre-allocated file, and XFS will do its best to give one big contiguous > chunk of space for it. > > Anwyay, you don't need to guess these things, you can easily find out. > > (1) Is the file preallocated and contiguous, or fragmented? > > # xfs_bmap /path/to/file > > All seem to have a single extent: > this is a currently active file: > lfs.303: > 0: [0..4192255]: 36322376672..36326568927 > > this is an old file: > lfs.3: > 0: [0..1048575]: 2039336992..2040385567 > > > > This will show you if you get one huge extent. If you get a number of > large > extents (say 100MB+) that would be fine for performance too. If you get > lots of shrapnel then there's a problem. > > (2) Are you really writing sequentially? > > # btrace /dev/whatever | grep ' [DC] ' > > This will show you block requests dispatched [D] and completed [C] to the > controller. > > I'm not familiar with the btrace output, but here's the summary of roughly > 5 minutes: > > Total (8,16): > Reads Queued: 16,914, 1,888MiB Writes Queued: 47,147, > 1,438MiB > Read Dispatches: 16,914, 1,888MiB Write Dispatches: 47,050, > 1,438MiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed: 16,914, 1,888MiB Writes Completed: 47,050, > 1,438MiB > Read Merges: 0, 0KiB Write Merges: 97, > 592KiB > IO unplugs: 17,060 Timer unplugs: 6 > > Throughput (R/W): 5,528KiB/s / 4,209KiB/s > Events (8,16): 418,873 entries > Skips: 0 forward (0 - 0.0%) > > > And here is some of the detail: > > 8,16 0 2251 7.674877079 5364 C R 42376096952 + 256 [0] > 8,16 0 2252 7.675031410 5364 C R 4046119976 + 256 [0] > 8,16 0 2259 7.689553858 5364 D R 4046120232 + 256 [mongod] > 8,16 0 2260 7.689812456 5364 C R 4046120232 + 256 [0] > 8,16 0 2267 7.690973707 5364 D R 42376097208 + 256 > [mongod] > 8,16 0 2268 7.691225467 5364 C R 42376097208 + 256 [0] > 8,16 0 2275 7.699438100 5364 D R 21964732520 + 256 > [mongod] > 8,16 0 2276 7.699688313 0 C R 21964732520 + 256 [0] > 8,16 0 2283 7.700493875 5364 D R 4046120488 + 256 [mongod] > 8,16 0 2284 7.700749134 5364 C R 4046120488 + 256 [0] > 8,16 0 2291 7.703460687 5364 D R 42376097464 + 256 > [mongod] > 8,16 0 2292 7.703707154 5364 C R 42376097464 + 256 [0] > 8,16 2 928 7.730573720 5364 D R 21964760296 + 256 > [mongod] > 8,16 0 2293 7.747651477 0 C R 21964760296 + 256 [0] > 8,16 0 2300 7.754517529 5364 D R 4046120744 + 256 [mongod] > 8,16 0 2301 7.754781549 5364 C R 4046120744 + 256 [0] > 8,16 0 2308 7.760712917 5364 D R 42376097720 + 256 > [mongod] > 8,16 0 2309 7.761392841 5364 C R 42376097720 + 256 [0] > 8,16 2 935 7.769193162 5597 D R 4046121000 + 256 [mongod] > 8,16 0 2310 7.769458041 0 C R 4046121000 + 256 [0] > 8,16 2 942 7.773021214 5597 D R 42376097976 + 256 > [mongod] > 8,16 0 2311 7.773290126 0 C R 42376097976 + 256 [0] > 8,16 2 949 7.780080336 5597 D R 4046121256 + 256 [mongod] > 8,16 0 2312 7.780346410 0 C R 4046121256 + 256 [0] > 8,16 2 956 7.808903046 5597 D R 42376098232 + 256 > [mongod] > 8,16 0 2313 7.809197289 0 C R 42376098232 + 256 [0] > 8,16 2 963 7.816907787 5597 D R 4046121512 + 256 [mongod] > 8,16 0 2314 7.817182676 0 C R 4046121512 + 256 [0] > 8,16 2 970 7.827457411 5597 D R 42376098488 + 256 > [mongod] > 8,16 0 2315 7.827730410 0 C R 42376098488 + 256 [0] > 8,16 0 2316 7.833225453 0 C R 4046121768 + 256 [0] > 8,16 1 2410 7.844128616 37922 D W 60216121432 + 80 > [flush-8:16] > 8,16 1 2411 7.844140476 37922 D W 60216121528 + 256 > [flush-8:16] > 8,16 1 2412 7.844145438 37922 D W 60216121784 + 256 > [flush-8:16] > 8,16 1 2413 7.844149939 37922 D W 60216122040 + 256 > [flush-8:16] > 8,16 1 2414 7.844154486 37922 D W 60216122296 + 256 > [flush-8:16] > 8,16 1 2415 7.844159104 37922 D W 60216122552 + 256 > [flush-8:16] > 8,16 1 2416 7.844163489 37922 D W 60216122808 + 256 > [flush-8:16] > 8,16 1 2417 7.844169195 37922 D W 60216123064 + 256 > [flush-8:16] > 8,16 1 2418 7.844173666 37922 D W 60216123320 + 256 > [flush-8:16] > 8,16 1 2419 7.844178182 37922 D W 60216123576 + 208 > [flush-8:16] > 8,16 1 2420 7.844182518 37922 D W 60216123800 + 256 > [flush-8:16] > 8,16 1 2421 7.844186886 37922 D W 60216124056 + 256 > [flush-8:16] > 8,16 1 2422 7.844191572 37922 D W 60216124312 + 256 > [flush-8:16] > 8,16 1 2423 7.844195825 37922 D W 60216124568 + 256 > [flush-8:16] > 8,16 1 2424 7.844200405 37922 D W 60216124824 + 256 > [flush-8:16] > 8,16 1 2425 7.844205039 37922 D W 60216125080 + 256 > [flush-8:16] > 8,16 1 2426 7.844209304 37922 D W 60216125336 + 256 > [flush-8:16] > 8,16 1 2427 7.844213483 37922 D W 60216125592 + 256 > [flush-8:16] > 8,16 1 2428 7.844217895 37922 D W 60216125848 + 256 > [flush-8:16] > 8,16 1 2429 7.844222295 37922 D W 60216126104 + 256 > [flush-8:16] > 8,16 1 2430 7.844226651 37922 D W 60216126360 + 256 > [flush-8:16] > 8,16 1 2431 7.844230959 37922 D W 60216126616 + 256 > [flush-8:16] > 8,16 1 2432 7.844235575 37922 D W 60216126872 + 256 > [flush-8:16] > 8,16 1 2433 7.844239866 37922 D W 60216127128 + 256 > [flush-8:16] > 8,16 1 2434 7.844244274 37922 D W 60216127384 + 256 > [flush-8:16] > 8,16 1 2435 7.844249817 37922 D W 60216127640 + 256 > [flush-8:16] > 8,16 1 2436 7.844254266 37922 D W 60216127896 + 256 > [flush-8:16] > 8,16 1 2437 7.844258706 37922 D W 60216128152 + 256 > [flush-8:16] > 8,16 1 2438 7.844263213 37922 D W 60216128408 + 256 > [flush-8:16] > 8,16 1 2439 7.844267570 37922 D W 60216128664 + 256 > [flush-8:16] > > > And at a higher level: > > # strace -p <pid-of-mongodb-process> > > will show you the seek/write/read operations that the application is > performing. > > Once you have the answers to those, you can make a better judgement as to > what's happening. > > (3) One other thing to check: > > cat /sys/block/xxx/bdi/read_ahead_kb > cat /sys/block/xxx/queue/max_sectors_kb > > Increasing those to 1024 (echo 1024 > ....) may make some improvement. > > They were 128 - I increased the first, but trying to write the second > gave me a write error. > >> One thing I'm wondering is whether the incorrect stripe structure I >> specified with mkfs is actually written into the file system structure > > I am guessing that probably things like chunks of inodes are > stripe-aligned. > But if you're really writing sequentially to a huge contiguous file then > it > won't matter anyway. > > Regards, > > Brian. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > > -- View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33506375.html Sent from the Xfs - General mailing list archive at Nabble.com. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 23:21 ` troby @ 2012-03-15 0:31 ` Peter Grandi 0 siblings, 0 replies; 12+ messages in thread From: Peter Grandi @ 2012-03-15 0:31 UTC (permalink / raw) To: Linux fs XFS [ ... ] >> lfs.303: >> 0: [0..4192255]: 36322376672..36326568927 [ ... ] >> lfs.3: >> 0: [0..1048575]: 2039336992..2040385567 $ factor 36322376672 2039336992 36322376672: 2 2 2 2 2 37 3257 9419 2039336992: 2 2 2 2 2 7 11 37 22369 $ factor 4192256 1048576 4192256: 2 2 2 2 2 2 2 2 2 2 2 23 89 1048576: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 The starting addresses look like to be 16KiB aligned (starting sector a multiple of 2^5 512B sectors). I would have expected different. The sizes are a multiple of 1MB, 2047MiB and 512MiB, which are plausible. [ ... ] > I'm not familiar with the btrace output, but here's the summary of roughly > 5 minutes: >> Total (8,16): >> Reads Queued: 16,914, 1,888MiB Writes Queued: 47,147, 1,438MiB >> Read Dispatches: 16,914, 1,888MiB Write Dispatches: 47,050, 1,438MiB >> Reads Requeued: 0 Writes Requeued: 0 >> Reads Completed: 16,914, 1,888MiB Writes Completed: 47,050, 1,438MiB >> Read Merges: 0, 0KiB Write Merges: 97, 592KiB >> IO unplugs: 17,060 Timer unplugs: 6 >> Throughput (R/W): 5,528KiB/s / 4,209KiB/s >> Events (8,16): 418,873 entries >> Skips: 0 forward (0 - 0.0%) That's around 17k reads, or 60/s, each of 100K, and 47k writes, or 160/s, average 31K. Both read and writes happen at around 4-5MB/s. Since the RAID5 is managed by the PERC, the reads cannot be those in RMW, and it is unlikely that these be sequential with the writes. There may be quite a bit of random access going on. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-14 17:43 ` troby 2012-03-14 21:05 ` Brian Candler @ 2012-03-14 22:48 ` Peter Grandi 1 sibling, 0 replies; 12+ messages in thread From: Peter Grandi @ 2012-03-14 22:48 UTC (permalink / raw) To: Linux fs XFS >>>> I have a 30TB XFS filesystem created on CentOS 5.4 X86_64, >>>> kernel 2.6.39, using xfsprogs 2.9.4. The underlying hardware >>>> is 12 3TB SATA drives on a Dell PERC 700 controller with 1GB >>>> cache. [ ... ] >>>> [ ... ] set up the RAID BIOS to use a small stripe element >>>> of 8KB per drive, [ ... ] The filesystem contains a MongoDB >>>> installation consisting of roughly 13000 2GB files which are >>>> already allocated. The application is almost exclusively >>>> inserting data, there are no updates, and files are written >>>> pretty much sequentially. [ ... ] How many of the 13,000 are being written at roughly at the same time? Because if you are logging 100K to each of them all the time, that is a heavily random access workload. Each file may be written sequentially, but the *disk* would be subject to a storm of seeks. >>>> When I set up the fstab entry I believed that it would >>>> inherit the stripe geometry automatically, however now I >>>> understand that is not the case with XFS version 2. 'mkfs.xfs' asks the kernel about drive geometry. If the kernel could read it odd the PERC 700 it would have been fine. The kernel can easily read geometry off MD etc. RAID sets because the relevant info is already in the system state. >>>> What I'm seeing now is average request sizes which are about >>>> 100KB, half the stripe size. But writes from what to what? From Linux to the PERC 700 cache or from the PERC 700 cache to the RAID set drives? >>>> With a typical write volume around 5MB per second I am >>>> getting wait times around 50ms, which appears to be >>>> degrading performance. [ ... ] 5MB per second in aggregate is hardly worth worrying about. What do the 50ms mean as wait times? Again, it matters a great deal whether it is Linux->PERC or PERC->drives. If you have barriers enabled, and the MongoDB is 'fsync'ing every 100K, then 100K will be the transaction size. Also, with a 100K append size, and 5MB/s aggregate, you are doing 50 transactions/s and it matters a great deal whether all are to the same file, sequentially, or each is to a different file, etc. >>>> [ ... ] Is there a danger of filesystem corruption if I give >>>> fstab a mount geometry that doesn't match the values used at >>>> filesystem creation time? No, those values are purely advisory. >>>> I'm unclear on the role of the RAID hardware cache in >>>> this. Since the writes are sequential, and since the volume >>>> of data written is such that it would take about 3 minutes >>>> to actually fill the RAID cache, I would think the data >>>> would be resident in the cache long enough to assemble a >>>> full-width stripe at the hardware level and avoid the 4 I/O >>>> RAID5 penalty. Sure, if the cache is configured right and barriers are not invoked every 100KiB. [ ... ] > Mongo pre-allocates its datafiles and zero-fills them (there is > a short header at the start of each, not rewritten as far as I > know) and then writes to them sequentially, wrapping around > when it hits the end. Preallocating is good. > In this case the entire load is inserts, no updates, hence the > sequential writes. So it is not random access, if it is a log-like operation. If it is a lot of 100K appends, things look a lot better. > [ ... ] The BBU is functioning and the cache is set to > write-back. That's good, check whether XFS has barriers enabled, and it might help to make sure that the host adapter really knows the geometry of the RAID set and if there is a parameter as to how much unwritten data to buffer, to set is high, to maximize the chances that it will do like it should and issue whole-stripe writes. > [ ... ] Flushing is done about every 30 seconds and takes > about 8 seconds. I usually prefer nearly continuous flushing (and the Linux level too and in particular), in part to avoid the 8s pauses. Even if that defeats in part the XFS delayed allocation logic. However there is a contradiction here between seeing 100K transactions and flushing taking 8s times a write rate of 5MB/s, every 30s. The latter would imply 40MB of writes every 30s. > One thing I'm wondering is whether the incorrect stripe > structure I specified with mkfs Probably the incorrect stripe structure here is mostly not that important, there are bigger factors at play. > is actually written into the file system structure or > effectively just a hint to the kernel for what to use for a > write size. The stripe parameters have static and dynamic effects: static - The metadata allocator attempts to interleave metadata at chunk ('sunit') boundaries to parallelize access. - The data allocator attempts to allocate extents on stripe ('swidth') aligned boundaries to maximize the chances of doing stripe aligned IO. These allocations are aligned according to the stripe parameters current when the metadata and data extents were allocated. dynamic - The block IO bottom end attempts to generate bulk IO requests aligned on stripe boundaries. These requests are aligned according to the stripe parameters current at the moment the IO occurs. The metadata and data extents may well have been allocated with alignment different from that on which IO requests are aligned. > If not, could I specify the correct stripe width in the mount > options and override the incorrect width used by mkfs? Sure, but the space already allocated is already on the "wrong" boundaries, even if XFS supposedly will try to issue IOs on the as-mounted stripe alignment. > Since the current average write size is only about half the > specified stripe size, and since I'm not using md or xfs v.3 > it seems the kernel is ignoring it for now. All the kernel does is to upload a bunch of blocks to the PERC, and all the RAID optimization is done by the PERC. > The choice of RAID5 was a compromise due to the need to store > 30TB of data on each of 2 systems (a master and a replicated > slave) - we couldn't afford that much space on our SAN for this > application, but we could afford a 12-bay system with 3TB SATA > drives. Still an 11+1 RAID5 is a bravce option to take. > My hope was that since the write pattern was expected to be > large sequential writes with no updates that the RAID5 penalty > would not be significant. That was a reasonable hope, but 11+1 RAID5 has other downsides. > And it's quite possible that would be the case if I had got the > stripe width right. Uh, I suspect that stripe alignment here is not that important. That 50ms after 100k sounds much much worse than RMW. On 15k drives 50ms are about 4-6 seek times, which is way more than RMW would take. > The 8K element size was chosen because the actual average > request size I was seeing on previous installations of the > database was around 60K, which is still smaller than the stripe > width over 12 drives even using 8K. That is not necessarily the right logic, but for bulk sequential transfers a small chunk size is a good idea, in general other things equal the smaller the chunk and the stripe size the better. > I did try btrfs early on to take advantage of compression, but > it failed. This was about six months ago, though. "failed" sounds a bit strange, and note that BTRFS has much larger overheads than other filesystems. But your applications seems ideal for XFS. Instead of using some weird kernel like 2.6.39 with EL5, you might want to try an "official" EL5 kernel like the Oracle 2.6.32 one, or switch to EL6/CentOS6. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5 2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby 2012-03-14 7:37 ` Brian Candler 2012-03-14 8:36 ` Stan Hoeppner @ 2012-03-14 23:22 ` Peter Grandi 2 siblings, 0 replies; 12+ messages in thread From: Peter Grandi @ 2012-03-14 23:22 UTC (permalink / raw) To: Linux fs XFS > I have a 30TB XFS filesystem created on CentOS 5.4 X86_64, > kernel 2.6.39, using xfsprogs 2.9.4. [ ... ] The filesystem > contains a MongoDB installation consisting of roughly 13000 > 2GB files which are already allocated. [ ... ] BTW, while 30TB is probably still in the real of the plausible if excessively large, and this is a 64b system, this is a good example of a gratuitously large filetree and an excessively wide RAID set under it. http://www.sabi.co.uk/blog/0805may.html#080516 A large single filetree only makes sense if one need a large and unified free space pool... But in this application all files are essentially independent and preallocated. A more manageable setup might have been a set of 4-8TB filetrees, for example on a group of 3TBx 2+1 or even 4+1 (split in two partitions) RAID5 sets. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2012-03-15 0:31 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby 2012-03-14 7:37 ` Brian Candler 2012-03-14 7:52 ` Brian Candler 2012-03-14 15:41 ` Peter Grandi 2012-03-14 17:53 ` troby 2012-03-14 8:36 ` Stan Hoeppner 2012-03-14 17:43 ` troby 2012-03-14 21:05 ` Brian Candler 2012-03-14 23:21 ` troby 2012-03-15 0:31 ` Peter Grandi 2012-03-14 22:48 ` Peter Grandi 2012-03-14 23:22 ` Peter Grandi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox