public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* How to deal with XFS stripe geometry mismatch with hardware RAID5
@ 2012-03-13 23:21 troby
  2012-03-14  7:37 ` Brian Candler
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: troby @ 2012-03-13 23:21 UTC (permalink / raw)
  To: xfs


I have a 30TB XFS filesystem created on CentOS 5.4 X86_64, kernel 2.6.39,
using xfsprogs 2.9.4. The underlying hardware is 12 3TB SATA drives on a
Dell PERC 700 controller with 1GB cache. There is an external journal on a
separate set of 15k SAS drives (I suspect now this was unnecessary, because
there is very little metadata activity). When I created the filesystem I
(mistakenly) believed the stripe width of the filesystem should count all 12
drives rather than 11. I've seen some opinions that this is correct, but a
larger number which have convinced me that it is not. I also set up the RAID
BIOS to use a small stripe element of 8KB per drive, based on the I/O
request size I was seeing at the time in previous installations of the same
application, which was generally doing writes around 100KB. l'm trying to
determine how to proceed to optimize write performance. Recreating the
filesystem and its existing data is not out of the question, but would be a
last resort.

The filesystem contains a MongoDB installation consisting of roughly 13000
2GB files which are already allocated. The application is almost exclusively
inserting data, there are no updates, and files are written pretty much
sequentially. When I set up the fstab entry I believed that it would inherit
the stripe geometry automatically, however now I understand that is not the
case with XFS version 2. What I'm seeing now is average request sizes which
are about 100KB, half the stripe size. With a typical write volume around
5MB per second I am getting wait times around 50ms, which appears to be
degrading performance. The filesystem was created on a partition aligned to
a 1MB boundary.

Short of recreating the filesystem with the correct stripe width, would it
make sense to change the mount options to define a stripe width that
actually matches either the filesystem (11 stripe elements wide) or the
hardware (12 stripe elements wide)? Is there a danger of filesystem
corruption if I give fstab a mount geometry that doesn't match the values
used at filesystem creation time?

I'm unclear on the role of the RAID hardware cache in this. Since the writes
are sequential, and since the volume of data written is such that it would
take about 3 minutes to actually fill the RAID cache, I would think the data
would be resident in the cache long enough to assemble a full-width stripe
at the hardware level and avoid the 4 I/O RAID5 penalty. 
-- 
View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33498437.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby
@ 2012-03-14  7:37 ` Brian Candler
  2012-03-14  7:52   ` Brian Candler
                     ` (2 more replies)
  2012-03-14  8:36 ` Stan Hoeppner
  2012-03-14 23:22 ` Peter Grandi
  2 siblings, 3 replies; 12+ messages in thread
From: Brian Candler @ 2012-03-14  7:37 UTC (permalink / raw)
  To: troby; +Cc: xfs

On Tue, Mar 13, 2012 at 04:21:07PM -0700, troby wrote:
> there is very little metadata activity). When I created the filesystem I
> (mistakenly) believed the stripe width of the filesystem should count all 12
> drives rather than 11. I've seen some opinions that this is correct, but a
> larger number which have convinced me that it is not.

With a 12-disk RAID5 you have 11 data disks, so the optimal filesystem
alignment is 11 x stripe size.  This is auto-detected for software (md)
raid, but may or may not be for hardware RAID controllers.

For example, here is a 12-disk RAID6 md array (10 data, 2 parity):

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
[raid10] 
md127 : active raid6 sdf[4] sdb[0] sdh[6] sdj[8] sdi[7] sdd[2] sdm[10]
sdk[9] sdc[1] sdl[11] sdg[5] sde[3]
      29302654080 blocks super 1.2 level 6, 64k chunk, algorithm 2 [12/12]
[UUUUUUUUUUUU]
      
And here is the XFS filesystem which was created on it:

$ xfs_info /dev/md127
meta-data=/dev/md127             isize=256    agcount=32, agsize=228926992 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=7325663520, imaxpct=5
         =                       sunit=16     swidth=160 blks
naming   =version 2              bsize=16384  ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The parameters were detected automatically. sunit=16 x 4K = 64K, swidth=
160 x 4K = 640K.

> I also set up the RAID
> BIOS to use a small stripe element of 8KB per drive, based on the I/O
> request size I was seeing at the time in previous installations of the same
> application, which was generally doing writes around 100KB.

I'd say this is almost guaranteed to give poor performance, because there
will always be partial stripe write if you are doing random writes.  e.g. 
consider the best case, which is when the 100KB is aligned with the start of
the stripe.  You will have:

- a 88KB write across the whole stripe
  - 12 disks seek and write; this will take a whole revolution before
    it completes on every drive, i.e. 8.3ms rotational latency, in addition
    to seek time. The transfer time will be insignificant
  - one tiny write
- 12KB write across a partial stripe. This will involve an 8K write to block
  A, a 4K read of block B and block P (parity), and a 4K write of block B
  and block P.

Now consider what it would have been with a 256KB stripe size. If you're
lucky and the whole 100K fits within a chunk, you'll have:

- read 100K from block A and block P
- write 100K to block A and block P

There is less rotational latency, only slightly higher transfer time
(for a slow drive which does 100MB/sec, 100KB will take 1ms), and will allow
concurrent writers in the same area of disk, and much faster access if there
are concurrent readers of those 100K chunks.

The performance will still suck however, compared to RAID10.

> I'm unclear on the role of the RAID hardware cache in this. Since the writes
> are sequential, and since the volume of data written is such that it would
> take about 3 minutes to actually fill the RAID cache, I would think the data
> would be resident in the cache long enough to assemble a full-width stripe
> at the hardware level and avoid the 4 I/O RAID5 penalty. 

Only if you're writing sequentially. For example, if you were untarring a
huge tar file containing 100KB files, all in the same directory, XFS can
allocate the extents one after the other, and so you will be doing pure
stripe writes.

But for *random* I/O, which I'm pretty sure is what mongodb will be doing,
you won't have a chance. The controller will be forced to read the existing
data and parity blocks so it can write back the updated parity.

So the conclusion is: do you actually care about performance for this
application?  If you do, I'd say don't use RAID5.  If you absolutely must
use parity RAID then go buy a Netapp ($$$) or experiment with btrfs (risky). 
The cost of another 10 disks for a RAID10 array is going to be small in
comparison.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14  7:37 ` Brian Candler
@ 2012-03-14  7:52   ` Brian Candler
  2012-03-14 15:41   ` Peter Grandi
  2012-03-14 17:53   ` troby
  2 siblings, 0 replies; 12+ messages in thread
From: Brian Candler @ 2012-03-14  7:52 UTC (permalink / raw)
  To: troby; +Cc: xfs

> So the conclusion is: do you actually care about performance for this
> application?  If you do, I'd say don't use RAID5.  If you absolutely must
> use parity RAID then go buy a Netapp ($$$) or experiment with btrfs (risky). 
> The cost of another 10 disks for a RAID10 array is going to be small in
> comparison.

Or you could switch to another database like couchdb which only appends to
its database and index files - it never goes back and overwrites existing
blocks.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby
  2012-03-14  7:37 ` Brian Candler
@ 2012-03-14  8:36 ` Stan Hoeppner
  2012-03-14 17:43   ` troby
  2012-03-14 23:22 ` Peter Grandi
  2 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2012-03-14  8:36 UTC (permalink / raw)
  To: troby; +Cc: xfs

On 3/13/2012 6:21 PM, troby wrote:

> Short of recreating the filesystem with the correct stripe width, would it
> make sense to change the mount options to define a stripe width that
> actually matches either the filesystem (11 stripe elements wide) or the
> hardware (12 stripe elements wide)? Is there a danger of filesystem
> corruption if I give fstab a mount geometry that doesn't match the values
> used at filesystem creation time?

What would make sense is for you to first show

$ cat /etc/fstab
$ xfs_info /dev/raid_device_name

before we recommend any changes.

> I'm unclear on the role of the RAID hardware cache in this. Since the writes
> are sequential, 

This seems to be an assumption at odds with other information you've
provided.

> and since the volume of data written is such that it would
> take about 3 minutes to actually fill the RAID cache, 

The PERC 700 operates in write-through cache mode if no BBU is present
or the battery is degraded or has failed.  You did not state whether
your PERC 700 has the BBU installed.  If not, you can increase write
performance and decrease latency pretty substantially by adding the BBU
which enables the write-back cache mode.

You may want to check whether MongoDB uses fsync writes by default.  If
it does, and you don't have the BBU and write-back cache, this is
affecting your write latency and throughput as well.

> I would think the data
> would be resident in the cache long enough to assemble a full-width stripe
> at the hardware level and avoid the 4 I/O RAID5 penalty. 

Again, write-back-cache is only enabled with BBU on the PERC 700.  Do
note that achieving full stripe width writes is as much a function of
your application workload and filesystem tuning as it is the RAID
firmware, especially if the cache is in write-through mode, in which
case the firmware can't do much, if anything, to maximize full width
stripes.

And keep in mind you won't hit the parity read-modify-write penalty on
new stripe writes.  This only happens when rewriting existing stripes.
Your reported 50ms of latency for 100KB write IOs seems to suggest you
don't have the BBU installed and you're actually doing RMW on existing
stripes, not strictly new stripe writes.  This is likely because...

As an XFS filesystem gets full (you're at ~87%), file blocks may begin
to be written into free space within existing partially occupied RAID
stripes.  This is where the RAID5/6 RMW penalty really kicks you in the
a$$, especially if you have misaligned the filesystem geometry to the
underlying RAID geometry.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14  7:37 ` Brian Candler
  2012-03-14  7:52   ` Brian Candler
@ 2012-03-14 15:41   ` Peter Grandi
  2012-03-14 17:53   ` troby
  2 siblings, 0 replies; 12+ messages in thread
From: Peter Grandi @ 2012-03-14 15:41 UTC (permalink / raw)
  To: Linux fs XFS

[ ... chunk sizes and relatively small random IO ... ]

> So the conclusion is: do you actually care about performance
> for this application?  If you do, I'd say don't use RAID5.

That's a general argument :-). http://WWW.BAARF.com/

The argument you make about RMW for relatively small random
transactions becomes even more relevant when considering parity
rebuilding in case of a drive failure.

> If you absolutely must use parity RAID then go buy a Netapp
> ($$$) or experiment with btrfs (risky).

The Netapp WAFL or BTRFS don't "solve" the RMW problem, they
just do parity with COW (object based in the case of BTRFS).

The COW does not do in-place RMW, but something that has the
same cost overall (depending on balance of read/writes and duty
cycle and temporal vs spatial locality).

The presence of parity chunks that must be kept in sync with the
other blocks in the same stripe turns the stripe into a a block
"cluster" for write purposes, and that's inescapable.

If *multithreaded* performance were not important there would
instead be a case for RAID2/3 with synchronous disks (to nullify
the disk alignment times), but suitable components are probably
not easy to source.

> The cost of another 10 disks for a RAID10 array is going to be
> small in comparison.

More wise words, but this is a discussion about a choice to use
an 11+1 RAID5, which is something that looks good to "management"
by saving money upfront by delaying trouble (decreasing speed and
higher risk) to later :-).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14  8:36 ` Stan Hoeppner
@ 2012-03-14 17:43   ` troby
  2012-03-14 21:05     ` Brian Candler
  2012-03-14 22:48     ` Peter Grandi
  0 siblings, 2 replies; 12+ messages in thread
From: troby @ 2012-03-14 17:43 UTC (permalink / raw)
  To: xfs


/dev/sdb1       /data   xfs    
defaults,logdev=/dev/sda3,logbsize=256k,logbufs=8,largeio,nobarrier

meta-data=/dev/sdb1              isize=256    agcount=32, agsize=251772920
blks
         =                       sectsz=4096  attr=0
data     =                       bsize=4096   blocks=8056733408, imaxpct=2
         =                       sunit=2      swidth=24 blks, unwritten=1
naming   =version 2              bsize=4096
log      =external               bsize=4096   blocks=16000, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Mongo pre-allocates its datafiles and zero-fills them (there is a short
header at the start of each, not rewritten as far as I know)  and then
writes to them sequentially, wrapping around when it hits the end. In this
case the entire load is inserts, no updates, hence the sequential writes.
The data will not wrap around for about 6 months, at which time old files
will be overwritten starting from the beginning. The BBU is functioning and
the cache is set to write-back. The files are memory-mapped, I'll check
whether fsync is used. Flushing is done about every 30 seconds and takes
about 8 seconds.

One thing I'm wondering is whether the incorrect stripe structure I
specified with mkfs is actually written into the file system structure or
effectively just a hint to the kernel for what to use for a write size. If
not, could I specify the correct stripe width in the mount options and
override the incorrect width used by mkfs? Since the current average write
size is only about half the specified stripe size, and since I'm not using
md or xfs v.3 it seems the kernel is ignoring it for now. 


Stan Hoeppner wrote:
> 
> On 3/13/2012 6:21 PM, troby wrote:
> 
>> Short of recreating the filesystem with the correct stripe width, would
>> it
>> make sense to change the mount options to define a stripe width that
>> actually matches either the filesystem (11 stripe elements wide) or the
>> hardware (12 stripe elements wide)? Is there a danger of filesystem
>> corruption if I give fstab a mount geometry that doesn't match the values
>> used at filesystem creation time?
> 
> What would make sense is for you to first show
> 
> $ cat /etc/fstab
> $ xfs_info /dev/raid_device_name
> 
> before we recommend any changes.
> 
>> I'm unclear on the role of the RAID hardware cache in this. Since the
>> writes
>> are sequential, 
> 
> This seems to be an assumption at odds with other information you've
> provided.
> 
>> and since the volume of data written is such that it would
>> take about 3 minutes to actually fill the RAID cache, 
> 
> The PERC 700 operates in write-through cache mode if no BBU is present
> or the battery is degraded or has failed.  You did not state whether
> your PERC 700 has the BBU installed.  If not, you can increase write
> performance and decrease latency pretty substantially by adding the BBU
> which enables the write-back cache mode.
> 
> You may want to check whether MongoDB uses fsync writes by default.  If
> it does, and you don't have the BBU and write-back cache, this is
> affecting your write latency and throughput as well.
> 
>> I would think the data
>> would be resident in the cache long enough to assemble a full-width
>> stripe
>> at the hardware level and avoid the 4 I/O RAID5 penalty. 
> 
> Again, write-back-cache is only enabled with BBU on the PERC 700.  Do
> note that achieving full stripe width writes is as much a function of
> your application workload and filesystem tuning as it is the RAID
> firmware, especially if the cache is in write-through mode, in which
> case the firmware can't do much, if anything, to maximize full width
> stripes.
> 
> And keep in mind you won't hit the parity read-modify-write penalty on
> new stripe writes.  This only happens when rewriting existing stripes.
> Your reported 50ms of latency for 100KB write IOs seems to suggest you
> don't have the BBU installed and you're actually doing RMW on existing
> stripes, not strictly new stripe writes.  This is likely because...
> 
> As an XFS filesystem gets full (you're at ~87%), file blocks may begin
> to be written into free space within existing partially occupied RAID
> stripes.  This is where the RAID5/6 RMW penalty really kicks you in the
> a$$, especially if you have misaligned the filesystem geometry to the
> underlying RAID geometry.
> 
> -- 
> Stan
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 
> 

-- 
View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33504048.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14  7:37 ` Brian Candler
  2012-03-14  7:52   ` Brian Candler
  2012-03-14 15:41   ` Peter Grandi
@ 2012-03-14 17:53   ` troby
  2 siblings, 0 replies; 12+ messages in thread
From: troby @ 2012-03-14 17:53 UTC (permalink / raw)
  To: xfs


The choice of RAID5 was a compromise due to the need to store 30TB of data on
each of 2 systems (a master and a replicated slave) - we couldn't afford
that much space on our SAN for this application, but we could afford a
12-bay system with 3TB SATA drives. My hope was that since the write pattern
was expected to be large sequential writes with no updates that the RAID5
penalty would not be significant. And it's quite possible that would be the
case if I had got the stripe width right. The 8K element size was chosen
because the actual average request size I was seeing on previous
installations of the database was around 60K, which is still smaller than
the stripe width over 12 drives even using 8K. I did try btrfs early on to
take advantage of compression, but it failed. This was about six months ago,
though.


Brian Candler wrote:
> 
> On Tue, Mar 13, 2012 at 04:21:07PM -0700, troby wrote:
>> there is very little metadata activity). When I created the filesystem I
>> (mistakenly) believed the stripe width of the filesystem should count all
>> 12
>> drives rather than 11. I've seen some opinions that this is correct, but
>> a
>> larger number which have convinced me that it is not.
> 
> With a 12-disk RAID5 you have 11 data disks, so the optimal filesystem
> alignment is 11 x stripe size.  This is auto-detected for software (md)
> raid, but may or may not be for hardware RAID controllers.
> 
> For example, here is a 12-disk RAID6 md array (10 data, 2 parity):
> 
> $ cat /proc/mdstat
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4]
> [raid10] 
> md127 : active raid6 sdf[4] sdb[0] sdh[6] sdj[8] sdi[7] sdd[2] sdm[10]
> sdk[9] sdc[1] sdl[11] sdg[5] sde[3]
>       29302654080 blocks super 1.2 level 6, 64k chunk, algorithm 2 [12/12]
> [UUUUUUUUUUUU]
>       
> And here is the XFS filesystem which was created on it:
> 
> $ xfs_info /dev/md127
> meta-data=/dev/md127             isize=256    agcount=32, agsize=228926992
> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=7325663520, imaxpct=5
>          =                       sunit=16     swidth=160 blks
> naming   =version 2              bsize=16384  ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=16 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> The parameters were detected automatically. sunit=16 x 4K = 64K, swidth=
> 160 x 4K = 640K.
> 
>> I also set up the RAID
>> BIOS to use a small stripe element of 8KB per drive, based on the I/O
>> request size I was seeing at the time in previous installations of the
>> same
>> application, which was generally doing writes around 100KB.
> 
> I'd say this is almost guaranteed to give poor performance, because there
> will always be partial stripe write if you are doing random writes.  e.g. 
> consider the best case, which is when the 100KB is aligned with the start
> of
> the stripe.  You will have:
> 
> - a 88KB write across the whole stripe
>   - 12 disks seek and write; this will take a whole revolution before
>     it completes on every drive, i.e. 8.3ms rotational latency, in
> addition
>     to seek time. The transfer time will be insignificant
>   - one tiny write
> - 12KB write across a partial stripe. This will involve an 8K write to
> block
>   A, a 4K read of block B and block P (parity), and a 4K write of block B
>   and block P.
> 
> Now consider what it would have been with a 256KB stripe size. If you're
> lucky and the whole 100K fits within a chunk, you'll have:
> 
> - read 100K from block A and block P
> - write 100K to block A and block P
> 
> There is less rotational latency, only slightly higher transfer time
> (for a slow drive which does 100MB/sec, 100KB will take 1ms), and will
> allow
> concurrent writers in the same area of disk, and much faster access if
> there
> are concurrent readers of those 100K chunks.
> 
> The performance will still suck however, compared to RAID10.
> 
>> I'm unclear on the role of the RAID hardware cache in this. Since the
>> writes
>> are sequential, and since the volume of data written is such that it
>> would
>> take about 3 minutes to actually fill the RAID cache, I would think the
>> data
>> would be resident in the cache long enough to assemble a full-width
>> stripe
>> at the hardware level and avoid the 4 I/O RAID5 penalty. 
> 
> Only if you're writing sequentially. For example, if you were untarring a
> huge tar file containing 100KB files, all in the same directory, XFS can
> allocate the extents one after the other, and so you will be doing pure
> stripe writes.
> 
> But for *random* I/O, which I'm pretty sure is what mongodb will be doing,
> you won't have a chance. The controller will be forced to read the
> existing
> data and parity blocks so it can write back the updated parity.
> 
> So the conclusion is: do you actually care about performance for this
> application?  If you do, I'd say don't use RAID5.  If you absolutely must
> use parity RAID then go buy a Netapp ($$$) or experiment with btrfs
> (risky). 
> The cost of another 10 disks for a RAID10 array is going to be small in
> comparison.
> 
> Regards,
> 
> Brian.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 
> 

-- 
View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33504119.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14 17:43   ` troby
@ 2012-03-14 21:05     ` Brian Candler
  2012-03-14 23:21       ` troby
  2012-03-14 22:48     ` Peter Grandi
  1 sibling, 1 reply; 12+ messages in thread
From: Brian Candler @ 2012-03-14 21:05 UTC (permalink / raw)
  To: troby; +Cc: xfs

On Wed, Mar 14, 2012 at 10:43:44AM -0700, troby wrote:
> Mongo pre-allocates its datafiles and zero-fills them (there is a short
> header at the start of each, not rewritten as far as I know)  and then
> writes to them sequentially, wrapping around when it hits the end. In this
> case the entire load is inserts, no updates, hence the sequential writes.
> The data will not wrap around for about 6 months, at which time old files
> will be overwritten starting from the beginning. The BBU is functioning and
> the cache is set to write-back. The files are memory-mapped, I'll check
> whether fsync is used. Flushing is done about every 30 seconds and takes
> about 8 seconds.

How much data has been added to mongodb in those 30 seconds?

If everything really was being written sequentially then I reckon you could
write about 6.6GB in that time (11 disks x 75MB/sec x 8 seconds). From your
posting I suspect you are not achieving that level of performance :-)

If it really is being written sequentially to a continguous file then the
stripe alignment won't make any difference, because this is just a big
pre-allocated file, and XFS will do its best to give one big contiguous
chunk of space for it.

Anwyay, you don't need to guess these things, you can easily find out.

(1) Is the file preallocated and contiguous, or fragmented?

    # xfs_bmap /path/to/file

This will show you if you get one huge extent. If you get a number of large
extents (say 100MB+) that would be fine for performance too.  If you get
lots of shrapnel then there's a problem.

(2) Are you really writing sequentially?

    # btrace /dev/whatever | grep ' [DC] '

This will show you block requests dispatched [D] and completed [C] to the
controller.

And at a higher level:

    # strace -p <pid-of-mongodb-process>

will show you the seek/write/read operations that the application is
performing.

Once you have the answers to those, you can make a better judgement as to
what's happening.

(3) One other thing to check:

cat /sys/block/xxx/bdi/read_ahead_kb
cat /sys/block/xxx/queue/max_sectors_kb

Increasing those to 1024 (echo 1024 > ....) may make some improvement.

> One thing I'm wondering is whether the incorrect stripe structure I
> specified with mkfs is actually written into the file system structure

I am guessing that probably things like chunks of inodes are stripe-aligned. 
But if you're really writing sequentially to a huge contiguous file then it
won't matter anyway.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14 17:43   ` troby
  2012-03-14 21:05     ` Brian Candler
@ 2012-03-14 22:48     ` Peter Grandi
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Grandi @ 2012-03-14 22:48 UTC (permalink / raw)
  To: Linux fs XFS

>>>> I have a 30TB XFS filesystem created on CentOS 5.4 X86_64,
>>>> kernel 2.6.39, using xfsprogs 2.9.4. The underlying hardware
>>>> is 12 3TB SATA drives on a Dell PERC 700 controller with 1GB
>>>> cache. [ ... ]

>>>> [ ... ] set up the RAID BIOS to use a small stripe element
>>>> of 8KB per drive, [ ... ] The filesystem contains a MongoDB
>>>> installation consisting of roughly 13000 2GB files which are
>>>> already allocated. The application is almost exclusively
>>>> inserting data, there are no updates, and files are written
>>>> pretty much sequentially. [ ... ]

How many of the 13,000 are being written at roughly at the same
time? Because if you are logging 100K to each of them all the
time, that is a heavily random access workload. Each file may be
written sequentially, but the *disk* would be subject to a storm
of seeks.

>>>> When I set up the fstab entry I believed that it would
>>>> inherit the stripe geometry automatically, however now I
>>>> understand that is not the case with XFS version 2.

'mkfs.xfs' asks the kernel about drive geometry. If the kernel
could read it odd the PERC 700 it would have been fine. The
kernel can easily read geometry off MD etc. RAID sets because the
relevant info is already in the system state.

>>>> What I'm seeing now is average request sizes which are about
>>>> 100KB, half the stripe size.

But writes from what to what? From Linux to the PERC 700 cache or
from the PERC 700 cache to the RAID set drives?

>>>> With a typical write volume around 5MB per second I am
>>>> getting wait times around 50ms, which appears to be
>>>> degrading performance. [ ... ]

5MB per second in aggregate is hardly worth worrying about.
What do the 50ms mean as wait times? Again, it matters a great
deal whether it is Linux->PERC or PERC->drives.

If you have barriers enabled, and the MongoDB is 'fsync'ing every
100K, then 100K will be the transaction size.

Also, with a 100K append size, and 5MB/s aggregate, you are doing
50 transactions/s and it matters a great deal whether all are to
the same file, sequentially, or each is to a different file, etc.

>>>> [ ... ] Is there a danger of filesystem corruption if I give
>>>> fstab a mount geometry that doesn't match the values used at
>>>> filesystem creation time?

No, those values are purely advisory.

>>>> I'm unclear on the role of the RAID hardware cache in
>>>> this. Since the writes are sequential, and since the volume
>>>> of data written is such that it would take about 3 minutes
>>>> to actually fill the RAID cache, I would think the data
>>>> would be resident in the cache long enough to assemble a
>>>> full-width stripe at the hardware level and avoid the 4 I/O
>>>> RAID5 penalty.

Sure, if the cache is configured right and barriers are not
invoked every 100KiB.

[ ... ]

> Mongo pre-allocates its datafiles and zero-fills them (there is
> a short header at the start of each, not rewritten as far as I
> know) and then writes to them sequentially, wrapping around
> when it hits the end.

Preallocating is good.

> In this case the entire load is inserts, no updates, hence the
> sequential writes.

So it is not random access, if it is a log-like operation. If it
is a lot of 100K appends, things look a lot better.

> [ ... ] The BBU is functioning and the cache is set to
> write-back.

That's good, check whether XFS has barriers enabled, and it might
help to make sure that the host adapter really knows the geometry
of the RAID set and if there is a parameter as to how much
unwritten data to buffer, to set is high, to maximize the chances
that it will do like it should and issue whole-stripe writes.

> [ ... ] Flushing is done about every 30 seconds and takes
> about 8 seconds.

I usually prefer nearly continuous flushing (and the Linux level
too and in particular), in part to avoid the 8s pauses. Even if
that defeats in part the XFS delayed allocation logic.

However there is a contradiction here between seeing 100K
transactions and flushing taking 8s times a write rate of 5MB/s,
every 30s. The latter would imply 40MB of writes every 30s.

> One thing I'm wondering is whether the incorrect stripe
> structure I specified with mkfs

Probably the incorrect stripe structure here is mostly not
that important, there are bigger factors at play.

> is actually written into the file system structure or
> effectively just a hint to the kernel for what to use for a
> write size.

The stripe parameters have static and dynamic effects:

  static
    - The metadata allocator attempts to interleave metadata at
      chunk ('sunit') boundaries to parallelize access.
    - The data allocator attempts to allocate extents on stripe
      ('swidth') aligned boundaries to maximize the chances of
      doing stripe aligned IO.
    These allocations are aligned according to the stripe
    parameters current when the metadata and data extents were
    allocated.
 
  dynamic
    - The block IO bottom end attempts to generate bulk IO
      requests aligned on stripe boundaries.
    These requests are aligned according to the stripe
    parameters current at the moment the IO occurs. The metadata
    and data extents may well have been allocated with alignment
    different from that on which IO requests are aligned.

> If not, could I specify the correct stripe width in the mount
> options and override the incorrect width used by mkfs?

Sure, but the space already allocated is already on the "wrong"
boundaries, even if XFS supposedly will try to issue IOs on the
as-mounted stripe alignment.

> Since the current average write size is only about half the
> specified stripe size, and since I'm not using md or xfs v.3
> it seems the kernel is ignoring it for now.

All the kernel does is to upload a bunch of blocks to the PERC,
and all the RAID optimization is done by the PERC.

> The choice of RAID5 was a compromise due to the need to store
> 30TB of data on each of 2 systems (a master and a replicated
> slave) - we couldn't afford that much space on our SAN for this
> application, but we could afford a 12-bay system with 3TB SATA
> drives.

Still an 11+1 RAID5 is a bravce option to take.

> My hope was that since the write pattern was expected to be
> large sequential writes with no updates that the RAID5 penalty
> would not be significant.

That was a reasonable hope, but 11+1 RAID5 has other downsides.

> And it's quite possible that would be the case if I had got the
> stripe width right.

Uh, I suspect that stripe alignment here is not that important.
That 50ms after 100k sounds much much worse than RMW. On 15k
drives 50ms are about 4-6 seek times, which is way more than RMW
would take.

> The 8K element size was chosen because the actual average
> request size I was seeing on previous installations of the
> database was around 60K, which is still smaller than the stripe
> width over 12 drives even using 8K.

That is not necessarily the right logic, but for bulk sequential
transfers a small chunk size is a good idea, in general other
things equal the smaller the chunk and the stripe size the better.

> I did try btrfs early on to take advantage of compression, but
> it failed. This was about six months ago, though.

"failed" sounds a bit strange, and note that BTRFS has much
larger overheads than other filesystems. But your applications
seems ideal for XFS. Instead of using some weird kernel like
2.6.39 with EL5, you might want to try an "official" EL5 kernel
like the Oracle 2.6.32 one, or switch to EL6/CentOS6.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14 21:05     ` Brian Candler
@ 2012-03-14 23:21       ` troby
  2012-03-15  0:31         ` Peter Grandi
  0 siblings, 1 reply; 12+ messages in thread
From: troby @ 2012-03-14 23:21 UTC (permalink / raw)
  To: xfs




Brian Candler wrote:
> 
> On Wed, Mar 14, 2012 at 10:43:44AM -0700, troby wrote:
>> Mongo pre-allocates its datafiles and zero-fills them (there is a short
>> header at the start of each, not rewritten as far as I know)  and then
>> writes to them sequentially, wrapping around when it hits the end. In
>> this
>> case the entire load is inserts, no updates, hence the sequential writes.
>> The data will not wrap around for about 6 months, at which time old files
>> will be overwritten starting from the beginning. The BBU is functioning
>> and
>> the cache is set to write-back. The files are memory-mapped, I'll check
>> whether fsync is used. Flushing is done about every 30 seconds and takes
>> about 8 seconds.
> 
> How much data has been added to mongodb in those 30 seconds?
> 
>    typically 2.5 MB
> 
> If everything really was being written sequentially then I reckon you
> could
> write about 6.6GB in that time (11 disks x 75MB/sec x 8 seconds). From
> your
> posting I suspect you are not achieving that level of performance :-)
> 
> If it really is being written sequentially to a continguous file then the
> stripe alignment won't make any difference, because this is just a big
> pre-allocated file, and XFS will do its best to give one big contiguous
> chunk of space for it.
> 
> Anwyay, you don't need to guess these things, you can easily find out.
> 
> (1) Is the file preallocated and contiguous, or fragmented?
> 
>     # xfs_bmap /path/to/file
> 
> All seem to have a single extent:
> this is a currently active file:
> lfs.303:
>         0: [0..4192255]: 36322376672..36326568927
> 
> this is an old file:
> lfs.3:
>         0: [0..1048575]: 2039336992..2040385567
> 
> 
> 
> This will show you if you get one huge extent. If you get a number of
> large
> extents (say 100MB+) that would be fine for performance too.  If you get
> lots of shrapnel then there's a problem.
> 
> (2) Are you really writing sequentially?
> 
>     # btrace /dev/whatever | grep ' [DC] '
> 
> This will show you block requests dispatched [D] and completed [C] to the
> controller.
> 
> I'm not familiar with the btrace output, but here's the summary of roughly
> 5 minutes:
> 
> Total (8,16):
>  Reads Queued:      16,914,    1,888MiB  Writes Queued:      47,147,   
> 1,438MiB
>  Read Dispatches:   16,914,    1,888MiB  Write Dispatches:   47,050,   
> 1,438MiB
>  Reads Requeued:         0               Writes Requeued:         0
>  Reads Completed:   16,914,    1,888MiB  Writes Completed:   47,050,   
> 1,438MiB
>  Read Merges:            0,        0KiB  Write Merges:           97,     
> 592KiB
>  IO unplugs:        17,060               Timer unplugs:           6
> 
> Throughput (R/W): 5,528KiB/s / 4,209KiB/s
> Events (8,16): 418,873 entries
> Skips: 0 forward (0 -   0.0%)
> 
> 
> And here is some of the detail:
> 
> 8,16   0     2251     7.674877079  5364  C   R 42376096952 + 256 [0]
>   8,16   0     2252     7.675031410  5364  C   R 4046119976 + 256 [0]
>   8,16   0     2259     7.689553858  5364  D   R 4046120232 + 256 [mongod]
>   8,16   0     2260     7.689812456  5364  C   R 4046120232 + 256 [0]
>   8,16   0     2267     7.690973707  5364  D   R 42376097208 + 256
> [mongod]
>   8,16   0     2268     7.691225467  5364  C   R 42376097208 + 256 [0]
>   8,16   0     2275     7.699438100  5364  D   R 21964732520 + 256
> [mongod]
>   8,16   0     2276     7.699688313     0  C   R 21964732520 + 256 [0]
>   8,16   0     2283     7.700493875  5364  D   R 4046120488 + 256 [mongod]
>   8,16   0     2284     7.700749134  5364  C   R 4046120488 + 256 [0]
>   8,16   0     2291     7.703460687  5364  D   R 42376097464 + 256
> [mongod]
>   8,16   0     2292     7.703707154  5364  C   R 42376097464 + 256 [0]
>   8,16   2      928     7.730573720  5364  D   R 21964760296 + 256
> [mongod]
>   8,16   0     2293     7.747651477     0  C   R 21964760296 + 256 [0]
>   8,16   0     2300     7.754517529  5364  D   R 4046120744 + 256 [mongod]
>   8,16   0     2301     7.754781549  5364  C   R 4046120744 + 256 [0]
>   8,16   0     2308     7.760712917  5364  D   R 42376097720 + 256
> [mongod]
>   8,16   0     2309     7.761392841  5364  C   R 42376097720 + 256 [0]
>   8,16   2      935     7.769193162  5597  D   R 4046121000 + 256 [mongod]
>   8,16   0     2310     7.769458041     0  C   R 4046121000 + 256 [0]
>   8,16   2      942     7.773021214  5597  D   R 42376097976 + 256
> [mongod]
>   8,16   0     2311     7.773290126     0  C   R 42376097976 + 256 [0]
>   8,16   2      949     7.780080336  5597  D   R 4046121256 + 256 [mongod]
>   8,16   0     2312     7.780346410     0  C   R 4046121256 + 256 [0]
>   8,16   2      956     7.808903046  5597  D   R 42376098232 + 256
> [mongod]
>   8,16   0     2313     7.809197289     0  C   R 42376098232 + 256 [0]
>   8,16   2      963     7.816907787  5597  D   R 4046121512 + 256 [mongod]
>   8,16   0     2314     7.817182676     0  C   R 4046121512 + 256 [0]
>   8,16   2      970     7.827457411  5597  D   R 42376098488 + 256
> [mongod]
>   8,16   0     2315     7.827730410     0  C   R 42376098488 + 256 [0]
>   8,16   0     2316     7.833225453     0  C   R 4046121768 + 256 [0]
>   8,16   1     2410     7.844128616 37922  D   W 60216121432 + 80
> [flush-8:16]
>   8,16   1     2411     7.844140476 37922  D   W 60216121528 + 256
> [flush-8:16]
>   8,16   1     2412     7.844145438 37922  D   W 60216121784 + 256
> [flush-8:16]
>   8,16   1     2413     7.844149939 37922  D   W 60216122040 + 256
> [flush-8:16]
>   8,16   1     2414     7.844154486 37922  D   W 60216122296 + 256
> [flush-8:16]
>   8,16   1     2415     7.844159104 37922  D   W 60216122552 + 256
> [flush-8:16]
>   8,16   1     2416     7.844163489 37922  D   W 60216122808 + 256
> [flush-8:16]
>   8,16   1     2417     7.844169195 37922  D   W 60216123064 + 256
> [flush-8:16]
>   8,16   1     2418     7.844173666 37922  D   W 60216123320 + 256
> [flush-8:16]
>   8,16   1     2419     7.844178182 37922  D   W 60216123576 + 208
> [flush-8:16]
>   8,16   1     2420     7.844182518 37922  D   W 60216123800 + 256
> [flush-8:16]
>   8,16   1     2421     7.844186886 37922  D   W 60216124056 + 256
> [flush-8:16]
>   8,16   1     2422     7.844191572 37922  D   W 60216124312 + 256
> [flush-8:16]
>   8,16   1     2423     7.844195825 37922  D   W 60216124568 + 256
> [flush-8:16]
>   8,16   1     2424     7.844200405 37922  D   W 60216124824 + 256
> [flush-8:16]
>   8,16   1     2425     7.844205039 37922  D   W 60216125080 + 256
> [flush-8:16]
>   8,16   1     2426     7.844209304 37922  D   W 60216125336 + 256
> [flush-8:16]
>   8,16   1     2427     7.844213483 37922  D   W 60216125592 + 256
> [flush-8:16]
>   8,16   1     2428     7.844217895 37922  D   W 60216125848 + 256
> [flush-8:16]
>   8,16   1     2429     7.844222295 37922  D   W 60216126104 + 256
> [flush-8:16]
>   8,16   1     2430     7.844226651 37922  D   W 60216126360 + 256
> [flush-8:16]
>   8,16   1     2431     7.844230959 37922  D   W 60216126616 + 256
> [flush-8:16]
>   8,16   1     2432     7.844235575 37922  D   W 60216126872 + 256
> [flush-8:16]
>   8,16   1     2433     7.844239866 37922  D   W 60216127128 + 256
> [flush-8:16]
>   8,16   1     2434     7.844244274 37922  D   W 60216127384 + 256
> [flush-8:16]
>   8,16   1     2435     7.844249817 37922  D   W 60216127640 + 256
> [flush-8:16]
>   8,16   1     2436     7.844254266 37922  D   W 60216127896 + 256
> [flush-8:16]
>   8,16   1     2437     7.844258706 37922  D   W 60216128152 + 256
> [flush-8:16]
>   8,16   1     2438     7.844263213 37922  D   W 60216128408 + 256
> [flush-8:16]
>   8,16   1     2439     7.844267570 37922  D   W 60216128664 + 256
> [flush-8:16]
> 
> 
> And at a higher level:
> 
>     # strace -p <pid-of-mongodb-process>
> 
> will show you the seek/write/read operations that the application is
> performing.
> 
> Once you have the answers to those, you can make a better judgement as to
> what's happening.
> 
> (3) One other thing to check:
> 
> cat /sys/block/xxx/bdi/read_ahead_kb
> cat /sys/block/xxx/queue/max_sectors_kb
> 
> Increasing those to 1024 (echo 1024 > ....) may make some improvement.
> 
>     They were 128 - I increased the first, but trying to write the second
> gave me a write error.
> 
>> One thing I'm wondering is whether the incorrect stripe structure I
>> specified with mkfs is actually written into the file system structure
> 
> I am guessing that probably things like chunks of inodes are
> stripe-aligned. 
> But if you're really writing sequentially to a huge contiguous file then
> it
> won't matter anyway.
> 
> Regards,
> 
> Brian.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 
> 

-- 
View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33506375.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby
  2012-03-14  7:37 ` Brian Candler
  2012-03-14  8:36 ` Stan Hoeppner
@ 2012-03-14 23:22 ` Peter Grandi
  2 siblings, 0 replies; 12+ messages in thread
From: Peter Grandi @ 2012-03-14 23:22 UTC (permalink / raw)
  To: Linux fs XFS

> I have a 30TB XFS filesystem created on CentOS 5.4 X86_64,
> kernel 2.6.39, using xfsprogs 2.9.4. [ ... ] The filesystem
> contains a MongoDB installation consisting of roughly 13000
> 2GB files which are already allocated. [ ... ]

BTW, while 30TB is probably still in the real of the plausible
if excessively large, and this is a 64b system, this is a good
example of a gratuitously large filetree and an excessively wide
RAID set under it.

  http://www.sabi.co.uk/blog/0805may.html#080516

A large single filetree only makes sense if one need a large and
unified free space pool... But in this application all files are
essentially independent and preallocated.

A more manageable setup might have been a set of 4-8TB
filetrees, for example on a group of 3TBx 2+1 or even 4+1 (split
in two partitions) RAID5 sets.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
  2012-03-14 23:21       ` troby
@ 2012-03-15  0:31         ` Peter Grandi
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Grandi @ 2012-03-15  0:31 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

>> lfs.303:
>> 0: [0..4192255]: 36322376672..36326568927
[ ... ]
>> lfs.3:
>> 0: [0..1048575]: 2039336992..2040385567

  $ factor 36322376672 2039336992
  36322376672: 2 2 2 2 2 37 3257 9419
  2039336992: 2 2 2 2 2 7 11 37 22369
  $ factor 4192256 1048576
  4192256: 2 2 2 2 2 2 2 2 2 2 2 23 89
  1048576: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

The starting addresses look like to be 16KiB aligned (starting
sector a multiple of 2^5 512B sectors). I would have expected
different. The sizes are a multiple of 1MB, 2047MiB and 512MiB,
which are plausible.

[ ... ]

> I'm not familiar with the btrace output, but here's the summary of roughly
> 5 minutes:

>> Total (8,16):
>> Reads Queued:      16,914,    1,888MiB  Writes Queued:      47,147,   1,438MiB
>> Read Dispatches:   16,914,    1,888MiB  Write Dispatches:   47,050,   1,438MiB
>> Reads Requeued:         0               Writes Requeued:         0
>> Reads Completed:   16,914,    1,888MiB  Writes Completed:   47,050,   1,438MiB
>> Read Merges:            0,        0KiB  Write Merges:           97,   592KiB
>> IO unplugs:        17,060               Timer unplugs:           6
>> Throughput (R/W): 5,528KiB/s / 4,209KiB/s
>> Events (8,16): 418,873 entries
>> Skips: 0 forward (0 -   0.0%)

That's around 17k reads, or 60/s, each of 100K, and 47k writes,
or 160/s, average 31K. Both read and writes happen at around
4-5MB/s.

Since the RAID5 is managed by the PERC, the reads cannot be
those in RMW, and it is unlikely that these be sequential with
the writes. There may be quite a bit of random access going on.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-03-15  0:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-13 23:21 How to deal with XFS stripe geometry mismatch with hardware RAID5 troby
2012-03-14  7:37 ` Brian Candler
2012-03-14  7:52   ` Brian Candler
2012-03-14 15:41   ` Peter Grandi
2012-03-14 17:53   ` troby
2012-03-14  8:36 ` Stan Hoeppner
2012-03-14 17:43   ` troby
2012-03-14 21:05     ` Brian Candler
2012-03-14 23:21       ` troby
2012-03-15  0:31         ` Peter Grandi
2012-03-14 22:48     ` Peter Grandi
2012-03-14 23:22 ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox