linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Useful benchmarking tools for RAID
@ 2008-03-12 22:27 Bryan Mark Mesich
  2008-03-13 15:26 ` michael
  2008-03-13 20:41 ` Peter Grandi
  0 siblings, 2 replies; 7+ messages in thread
From: Bryan Mark Mesich @ 2008-03-12 22:27 UTC (permalink / raw)
  To: Linux RAID

Good afternoon,

I've been sitting back quietly reading posts over the past few months
regarding RAID performance.  My ultimate goal is to increase the
performance of our IMAP mail servers that have storage on-top RAID 5.
During peek times of the day, a single IMAP box might have 500+ imapd
processes running simultaneously.  As a result, the load increases as
does the users blood pressure.

I'm currently testing with the following:

Intel SE7520BD2 motherboard
(2) 3Ware PCI-E 9550SX 8 port SATA card
1 GB of memory
(2) Core2Duo 3.0GHz
(16) Segate 750GB Barracuda ES drives
RHEL 5.1 server (stock 2.6.18)

I've setup 3 RAID5 arrays arranged in a 3+1 layout.  I created them with
different chunk sizes (64k, 128k, and 256k) for testing purposes.
Write-caching has been disabled (no battery) on the 3Ware cards and I'm
using ext3 as my filesystem.  When creating the filesystems, I used
sensible stride sizes and disabled directory indexing.

I ran bonnie 1.4 on 2 of the filesystems with the following results:

### Chunk size = 64k
./Bonnie -d /mnt/64/ -s 1024 -y -u -o_direct

MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
1*1024 59185 50.9 21849  7.5 14490  5.0 16377 24.1212812 25.3 267.8  1.5

### Chunk size = 256k
./Bonnie -d /mnt/256/ -s 1024 -y -u -o_direct

MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
1*1024 47650 40.6 22561  6.8 19019  6.9 16872 22.2209770 23.7 267.2  1.5


OK...so now I have some benchmarks, but I'm not sure if it remotely
relates to normal IO on a busy IMAP server.  I would expect an IMAP
server to have many relatively small random reads and writes.  Looking
at the output of 'iostat' on one of the mail servers, I can see that the
average IO request size (avgrq-sz) is 77 and 87 (sectors?) on disks that
are members of a RAID1 array.  If I understand this correctly, the
average I/O request to the block device is around 40k.  Can a larger
chunk size help out when random I/O is prevalent?

With this said, has anyone ever tried tuning a RAID5 array to a busy
mail server (or similar application)?  An ever better question would be
how a person can go about benchmarking different storage configurations
that can be applied to a specific application.  At this point, I'm not
sure which benchmarking tool(s) would serve useful in this situation and
how that testing should be conducted.  Should I measure throughput or
smiling email users :)

Thanks in advance,

~Bryan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Useful benchmarking tools for RAID
  2008-03-12 22:27 Useful benchmarking tools for RAID Bryan Mark Mesich
@ 2008-03-13 15:26 ` michael
  2008-03-13 20:41 ` Peter Grandi
  1 sibling, 0 replies; 7+ messages in thread
From: michael @ 2008-03-13 15:26 UTC (permalink / raw)
  To: Linux RAID

Quoting Bryan Mark Mesich <bmesich@atlantis.cc.ndsu.nodak.edu>:

> Good afternoon,
>
> I've been sitting back quietly reading posts over the past few months
> regarding RAID performance.  My ultimate goal is to increase the
> performance of our IMAP mail servers that have storage on-top RAID 5.
> During peek times of the day, a single IMAP box might have 500+ imapd
> processes running simultaneously.  As a result, the load increases as
> does the users blood pressure.
>
> I'm currently testing with the following:
>
> Intel SE7520BD2 motherboard
> (2) 3Ware PCI-E 9550SX 8 port SATA card
> 1 GB of memory
> (2) Core2Duo 3.0GHz
> (16) Segate 750GB Barracuda ES drives
> RHEL 5.1 server (stock 2.6.18)
>
> I've setup 3 RAID5 arrays arranged in a 3+1 layout.  I created them with
> different chunk sizes (64k, 128k, and 256k) for testing purposes.
> Write-caching has been disabled (no battery) on the 3Ware cards and I'm
> using ext3 as my filesystem.  When creating the filesystems, I used
> sensible stride sizes and disabled directory indexing.
>
> I ran bonnie 1.4 on 2 of the filesystems with the following results:
>
> ### Chunk size = 64k
> ./Bonnie -d /mnt/64/ -s 1024 -y -u -o_direct
>
> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
> 1*1024 59185 50.9 21849  7.5 14490  5.0 16377 24.1212812 25.3 267.8  1.5
>
> ### Chunk size = 256k
> ./Bonnie -d /mnt/256/ -s 1024 -y -u -o_direct
>
> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
> 1*1024 47650 40.6 22561  6.8 19019  6.9 16872 22.2209770 23.7 267.2  1.5

[snip]

> With this said, has anyone ever tried tuning a RAID5 array to a busy
> mail server (or similar application)?  An ever better question would be
> how a person can go about benchmarking different storage configurations
> that can be applied to a specific application.  At this point, I'm not
> sure which benchmarking tool(s) would serve useful in this situation and
> how that testing should be conducted.  Should I measure throughput or
> smiling email users :)

Hello,

I'm unfamiliar with the term 3+1 raid layout with 3 Raid 5 arrays. Can  
you be more specific, just curious.  :)

I've always found that raid 5 arrays show better performance on arrays  
with 6 or more disks. If this can't be achieved than a raid 10 might  
be more suited. Of course so many factors come into play.
If raid 5 is your only option, than perhaps try with an array with  
more devices.

You can try to get bonnie to benchmark many small files. Not sure how  
real world this is or not.

# bonnie++ -u root -f -s 0 -n 100:64000:0:128

I think this means 100x1024 files : ranging from 64K to : 0K : in 128  
directories. (or something like that)
Don't forget about some of the mount options for ext3 to speed things up like
  -o noatime,data=writeback
of course, these remove features that might be important to you.

Cheers,
Mike




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Useful benchmarking tools for RAID
  2008-03-12 22:27 Useful benchmarking tools for RAID Bryan Mark Mesich
  2008-03-13 15:26 ` michael
@ 2008-03-13 20:41 ` Peter Grandi
  2008-03-13 21:17   ` Richard Scobie
  2008-03-13 22:58   ` Bryan Mark Mesich
  1 sibling, 2 replies; 7+ messages in thread
From: Peter Grandi @ 2008-03-13 20:41 UTC (permalink / raw)
  To: Linux RAID

>>> On Wed, 12 Mar 2008 17:27:58 -0500, Bryan Mark Mesich
>>> <bmesich@atlantis.cc.ndsu.nodak.edu> said:

bmesich> [ ... ] performance of our IMAP mail servers that have
bmesich> storage on-top RAID 5. [ ... ]

That may be not a good combination. I generally dislike RAID5,
but even without being prejudiced :-), RAID5 is suited to a
mostly-read load, and a mail store is usually not mostly-read,
because it does lots of appends. In particular it does lots of
widely scattered appends. As usual, I'd rather use RAID10 here.

Most importantly, the structure of the mail store mailboxes
matters a great deal e.g. whether it is mbox-style, or else
maildir-style, or something else entirely like DBMS-style.

bmesich> During peek times of the day, a single IMAP box might
bmesich> have 500+ imapd processes running simultaneously.

The 'imapd's are not such a big deal, the delivery daemons may be
causing more trouble, and the interference between the two, and
the type of elevator. As to elevator in your case who knows which
would be best, a case could be made for 'anticipatory', another
one for 'deadline', and perhaps 'noop' is the safest. As usual,
flusher parameters are also probably quite important. Setting the
RHEL 'vm/max_queue_size' to a low value, something like 50-100 in
your case, might be useful.

Now that it occurs to me, another factor is whether your users
access the mail store mostly as a download area (that is mostly
as they would if using POP3) or they actually keep their mail
permanently on it, and edit the mailboxes via IMAP4.

In the latter case the reliability of the mail store is even
more important, and the write rates even higher, so I would
recommend RAID10 even more strongly.

If you think that RAID10 costs too much in WASTED capacity,
good luck! :-)

Or you could investigate whether your IMAP server can do
compressed mailboxes. You got plenty of CPU power, more so
probably relative to your network speed.

bmesich> I'm currently testing with the following:
bmesich> Intel SE7520BD2 motherboard
bmesich> (2) 3Ware PCI-E 9550SX 8 port SATA card

Pretty good.

bmesich> 1 GB of memory

Probably ridiculously small. Sad to say...

bmesich> (2) Core2Duo 3.0GHz
bmesich> (16) Segate 750GB Barracuda ES drives
bmesich> RHEL 5.1 server (stock 2.6.18)

Pretty good. Those 16x750GB look like *perfect* for a nice sw
RAID10, with 8 pairs, with each member of a pair on a different
9550SX.

bmesich> I've setup 3 RAID5 arrays arranged in a 3+1 layout.  I
bmesich> created them with different chunk sizes (64k, 128k, and
bmesich> 256k) for testing purposes.

Chunk size in your situation is the least of your worries. Anyhow
it depends on the structure of your mail store.

bmesich> Write-caching has been disabled (no battery) on the
bmesich> 3Ware cards

That can be a very bad idea, if that also disables the builtin
cache of the disks. If the ondisk cache is enabled it probably
matters relatively little. Anyhow for a system like yours doing
what it does I would consider battery backup *for the whole
server* pretty important.

bmesich> and I'm using ext3 as my filesystem.

That's likely to be a very bad idea. Consider just this: your
3+1 arrays have one 3x750GB filesystem each (I guess). How long
could 'fsck' of one of those take? You really don't want to know.

Depending on mail store structure I'd be using ReiserFS, or JFS
or even XFS. My usual suggestion is to use JFS by default unless
one has special reasons.

There may well be special reasons! In your case ReiserFS would be
rather better if the mail store was organized as a lot of small
files, and XFS if it was organized as large mail archives files,
for example. XFS also has the advantages that it supports write
barriers (but not sure if the one in 2.6.18 already does), so you
could probably enable the host adapter cache, and that it handles
well very parallel access patterns. It has the disadvantage that
it can require several GB of memory to 'fsck' (like 1GB per 1TB
of filesystem, or more), and does not work as well with lots of
small files (while ReiserFS is very good, and JFS not too bad).

bmesich> When creating the filesystems, I used sensible stride
bmesich> sizes and disabled directory indexing.

That's very wise, both of that.

bmesich> I ran bonnie 1.4 on 2 of the filesystems with the following results:

               ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
               -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-

bmesich> Chunk size = 64k
bmesich> ./Bonnie -d /mnt/64/ -s 1024 -y -u -o_direct
bmesich> MB     K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
bmesich> 1*1024 59185 50.9 21849  7.5 14490  5.0 16377 24.1212812 25.3 267.8  1.5

bmesich> Chunk size = 256k
bmesich> ./Bonnie -d /mnt/256/ -s 1024 -y -u -o_direct
bmesich> MB     K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
bmesich> 1*1024 47650 40.6 22561  6.8 19019  6.9 16872 22.2209770 23.7 267.2  1.5

So you are getting 45-60MB/s writing and 210-220MB/s reading.
The reading rate is roughly reasonable (each of those disks can
do 70-80MB/s on the outer tracks), but the write speed is pretty
disastrous. Probably like many others you are using RAID5 without
realizing the pitfalls of parity RAID (amply explained in some
recent threads). Such pitfalls are particularly bad also if the
access patterns involve lots of small writes.

This is what I get on a 4x(1+1) RAID10 (with 'f2' for better read
performance, I would suggest the default 'n2' in your case) with
mixed 400GB and 1TB disks (and 'blockdev --setra 1024', regrettably
as detailed in a recent message of mine):

  # Bonnie -y -u -o_direct -s 2000 -v 2 -d /tmp/a
  Bonnie 1.4: File '/tmp/a/Bonnie.27318', size: 2097152000, volumes: 2
  Using O_DIRECT for block based I/O
  Writing with putc_unlocked()...done: 176107 kB/s  79.0 %CPU
  Rewriting...                   done:  31797 kB/s   3.1 %CPU
  Writing intelligently...       done: 243844 kB/s   9.6 %CPU
  Reading with getc_unlocked()...done:  22424 kB/s  28.3 %CPU
  Reading intelligently...       done: 475166 kB/s  14.9 %CPU
  Seek numbers calculated on first volume only
  Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
		---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek-
		-CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (03)-
  Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
  serv02 2*2000 176107 79.0243844  9.6 31797  3.1 22424 28.3475166 14.9  382.1  1.0

Note however that the seek rates are not much higher than yours,
more or less of course.

bmesich> OK...so now I have some benchmarks, but I'm not sure if
bmesich> it remotely relates to normal IO on a busy IMAP server.

I think that's unlikely -- Bonnie is a good test of the limits of
a _storage system_, not particularly of any given application
usage pattern (unless that application looks a lot like Bonnie).

bmesich> I would expect an IMAP server to have many relatively
bmesich> small random reads and writes.

Perhaps -- but it all depends on the structure of the mail store
and whether the users download mail or keep their mailboxes on
the server, and how big those mailboxes tend to be.

bmesich> With this said, has anyone ever tried tuning a RAID5
bmesich> array to a busy mail server (or similar application)?

Note a little but important point of terminology: a mail server
and a mail store server are two very different things. They may
be running on the same hardware, but that's all.

bmesich> An ever better question would be how a person can go
bmesich> about benchmarking different storage configurations
bmesich> that can be applied to a specific application. [ ... ] 
bmesich> Should I measure throughput or smiling email users :)

Your application is narrow enough. There are mail-specific
benchmarks, e.g. Postmark, but they tend to be for mail servers,
not mail store servers.

A mail store server is in effect though a file server, even if
the protocol is IMAP4 rather than SMB or NFS or WEBDAV. But file
size, number, and access patterns matter.

Thinking of file server, the hundreds of IMAP daemons and the
size of the mail store point to a large concurrent user base.

I would dearly hope that you have several good (with a fair bit
of offloading) 1gb/s interfaces with load balancing across them
(either bonding ro ECMP), or at least one 10gb/s interface, and a
pretty good switch/router/network, and your have set the obvious
TCP parameters for high speed network transfer over high bandwidth
links.

If your users are typical contemporary ones and send each other
attachements dozens of megabytes long, a single 1gb/s interface
that can do 110MB/s with the best parameter is not going to be
enough.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Useful benchmarking tools for RAID
  2008-03-13 20:41 ` Peter Grandi
@ 2008-03-13 21:17   ` Richard Scobie
  2008-03-13 22:58   ` Bryan Mark Mesich
  1 sibling, 0 replies; 7+ messages in thread
From: Richard Scobie @ 2008-03-13 21:17 UTC (permalink / raw)
  To: Linux RAID Mailing List

Peter Grandi wrote:

> for example. XFS also has the advantages that it supports write
> barriers (but not sure if the one in 2.6.18 already does), so you
> could probably enable the host adapter cache, and that it handles
> well very parallel access patterns. It has the disadvantage that
> it can require several GB of memory to 'fsck' (like 1GB per 1TB
> of filesystem, or more), and does not work as well with lots of
> small files (while ReiserFS is very good, and JFS not too bad).

This no longer need be the case, (xfs_repair memory usage).

Improvements were made in xfs_repair 2.9.2 and later - quoting one of
the develpoers:

"Right now, I can repair a 9TB filesystem with ~150 million inodes
in 2GB of RAM without going to swap using xfs_repair 2.9.4 and
with no custom/tuning/config options."

Regards,

Richard


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Useful benchmarking tools for RAID
  2008-03-13 20:41 ` Peter Grandi
  2008-03-13 21:17   ` Richard Scobie
@ 2008-03-13 22:58   ` Bryan Mark Mesich
  2008-03-16 23:39     ` Peter Grandi
  1 sibling, 1 reply; 7+ messages in thread
From: Bryan Mark Mesich @ 2008-03-13 22:58 UTC (permalink / raw)
  To: Linux RAID


On Thu, 2008-03-13 at 20:41 +0000, Peter Grandi wrote:
> bmesich> [ ... ] performance of our IMAP mail servers that have
> bmesich> storage on-top RAID 5. [ ... ]
> 
> That may be not a good combination. I generally dislike RAID5,
> but even without being prejudiced :-), RAID5 is suited to a
> mostly-read load, and a mail store is usually not mostly-read,
> because it does lots of appends. In particular it does lots of
> widely scattered appends. As usual, I'd rather use RAID10 here.
> 
> Most importantly, the structure of the mail store mailboxes
> matters a great deal e.g. whether it is mbox-style, or else
> maildir-style, or something else entirely like DBMS-style.

We are currently using mbx mail format, but are looking into switching
to mixed (not sure if 'mixed' is the correct terminology).  We were
hoping that the smaller file sizes would in turn cause more efficient
I/O. Any thoughts on this change?
> 
> bmesich> During peek times of the day, a single IMAP box might
> bmesich> have 500+ imapd processes running simultaneously.
> 
> The 'imapd's are not such a big deal, the delivery daemons may be
> causing more trouble, and the interference between the two, and
> the type of elevator. As to elevator in your case who knows which
> would be best, a case could be made for 'anticipatory', another
> one for 'deadline', and perhaps 'noop' is the safest. As usual,
> flusher parameters are also probably quite important. Setting the
> RHEL 'vm/max_queue_size' to a low value, something like 50-100 in
> your case, might be useful.
> 
Good point on both.  The imap boxes are currently using cfq (Red Hat
default).  I've been setting up SAR to collect data points so when we
decide to change the scheduler, we have have something to measure
against.

> Now that it occurs to me, another factor is whether your users
> access the mail store mostly as a download area (that is mostly
> as they would if using POP3) or they actually keep their mail
> permanently on it, and edit the mailboxes via IMAP4.

In our setup, the mail servers store the mail permanently (unless users
delete).  Users have a 512MB quota on their mailboxes. 

[Cut]
> bmesich> 1 GB of memory
> 
> Probably ridiculously small. Sad to say...

Your right, 1GB on a mail server is small in this case.  In my attempt
to simplify my problems I left out some of the complexities of our
storage layout.  I reality, the imap servers store their mail on
mirrored SAN volumes via Dual 4GB fibre channel HBA's.  Typical volume
size for the mail to sit on is around 250GB.  The fibre targets are
running RAID5 in a 3+1 layout in separate geographic areas (my test box
is a fibre target replacement not yet in service, thus the small amount
of memory).  I should also mention that we are using bitmaps on the
RAID1 array.  Possibly moving these to local disk would increase
performance some?

We're using 3rd party software developed by Pavitrasoft to export the
volumes to the initiators.  We been looking a SC/ST as a replacement for
Pavitrasoft's software, but are unsure about moving to it.  I've done
little reading on RAID10, but what I have read looks promising in regard
to write performance improvements.  I'll setup a RAID 10 array with 8
drives and run some benchmarks.  

[Cut]
> bmesich> I've setup 3 RAID5 arrays arranged in a 3+1 layout.  I
> bmesich> created them with different chunk sizes (64k, 128k, and
> bmesich> 256k) for testing purposes.
> 
> Chunk size in your situation is the least of your worries. Anyhow
> it depends on the structure of your mail store.

Some of my readings indicated that larger chunk sizes can increase I/O
performance where random writes/reads occur often.  Any thoughts on
this? 
> 
> bmesich> Write-caching has been disabled (no battery) on the
> bmesich> 3Ware cards
> 
> That can be a very bad idea, if that also disables the builtin
> cache of the disks. If the ondisk cache is enabled it probably
> matters relatively little. Anyhow for a system like yours doing
> what it does I would consider battery backup *for the whole
> server* pretty important.

Good point.  I was unaware that disabling write-chaching on the
controller might effect the cache on the drives themselves.  As for
battery backup, the whole data center is protected by a UPS.  I was
referring to controller batteries on the 3ware cards.  I was under the
assumption that batteries on the controllers are a must when using
write-caching sensibly.  Any ideas on how much write-caching is needed
to be useful?  I calculated the average I/O request size to be around
440k/sec.  So, with a 128MB of cache ([128*1024]/440)/60 = 4.9 minutes
of cache time before it is over-writen?  
> 
> bmesich> and I'm using ext3 as my filesystem.
> 
> That's likely to be a very bad idea. Consider just this: your
> 3+1 arrays have one 3x750GB filesystem each (I guess). How long
> could 'fsck' of one of those take? You really don't want to know.

We have a 850GB volume running ext3 on an ftp server.  It takes a very
long time :(
> 
> Depending on mail store structure I'd be using ReiserFS, or JFS
> or even XFS. My usual suggestion is to use JFS by default unless
> one has special reasons.

Is JFS being supported my IBM anymore?  Other options I'm looking at
would be to move the (SAN) filesystem journal to local disk. 

[Cut]
> Note however that the seek rates are not much higher than yours,
> more or less of course.

Looks good.  I'll have to try it out.

[Cut]
> 
> bmesich> With this said, has anyone ever tried tuning a RAID5
> bmesich> array to a busy mail server (or similar application)?
> 
> Note a little but important point of terminology: a mail server
> and a mail store server are two very different things. They may
> be running on the same hardware, but that's all.

Thanks for the correction :)

[Cut]
> I would dearly hope that you have several good (with a fair bit
> of offloading) 1gb/s interfaces with load balancing across them
> (either bonding ro ECMP), or at least one 10gb/s interface, and a
> pretty good switch/router/network, and your have set the obvious
> TCP parameters for high speed network transfer over high bandwidth
> links.

We are currently running 7 imap servers servicing around 15,000+ users.
You're absolutely right, I think we would benefit from have more
hardware to spread the users across.  Users are relatively balanced
between the imap servers, but there are just too many users. I'm hoping
we get an additional 2 imap servers to help out the load. 
> 
> If your users are typical contemporary ones and send each other
> attachements dozens of megabytes long, a single 1gb/s interface
> that can do 110MB/s with the best parameter is not going to be
> enough.

The most damaging user actions seem to be internal listserv messages
marked for thousands of users.  Holding these messages until night time
(when the load is down), or educating our user base may help some.
--
Thanks for the reply,

~Bryan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Useful benchmarking tools for RAID
  2008-03-13 22:58   ` Bryan Mark Mesich
@ 2008-03-16 23:39     ` Peter Grandi
  2008-03-17 20:48       ` Richard Scobie
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Grandi @ 2008-03-16 23:39 UTC (permalink / raw)
  To: Linux RAID

>>> On Thu, 13 Mar 2008 17:58:35 -0500, Bryan Mark Mesich
>>> <bmesich@atlantis.cc.ndsu.nodak.edu> said:

[ ... performance boost for an IMAP mail server ... ]

bmesich> We are currently using mbx mail format, but are looking
bmesich> into switching to mixed (not sure if 'mixed' is the
bmesich> correct terminology). We were hoping that the smaller
bmesich> file sizes would in turn cause more efficient I/O. Any
bmesich> thoughts on this change?

Smaller file sizes usually don't cause more efficient IO, but
may cause more effective IO. But one negative aspect of small
files is more metadata access, and many file systems don't
handle metadata well (as to that, investigate using "nodiratime"
and either "noatime" or "relatime" already).

It depends on how your users interact with the mail store and
the current distribution of mail store file sizes. For example
if most of your users keep their mail (as you indicate below)
and keep it as many do in a single Inbox of up to 500MB, just
about any operation (except delivery) will rewrite it, and in
your current setup rewrite performance is terrible at around
20MB/s.

However, if you move to smaller files ReiserFS seems better, if
you keep mbox JFS is nicer, and if the mboxes are largish
perhaps XFS is better.

bmesich> In our setup, the mail servers store the mail
bmesich> permanently (unless users delete). Users have a 512MB
bmesich> quota on their mailboxes.

It would be interesting to have look whether they then split
their mailboxes into folders or keep it all in the Inbox. In
other words to have a look at the number and size of files.

bmesich> mirrored SAN volumes via Dual 4GB fibre channel HBA's.
bmesich> Typical volume size for the mail to sit on is around
bmesich> 250GB. The fibre targets are running RAID5 in a 3+1
bmesich> layout in separate geographic areas (my test box is a
bmesich> fibre target replacement not yet in service, thus the
bmesich> small amount of memory). I should also mention that we
bmesich> are using bitmaps on the RAID1 array.  Possibly moving
bmesich> these to local disk would increase performance some?

bmesich> Some of my readings indicated that larger chunk sizes
bmesich> can increase I/O performance where random writes/reads
bmesich> occur often. [ ... ]

Yes, but that also increases RAID5 stripe size, making the
chances of avoiding RMW lower.

bmesich> [ ... ] disabling write-chaching on the controller
bmesich> might effect the cache on the drives themselves.

Well, that depends on the firmware of the host adapter. Somewhat
reasonably if you tell it that its own cache can't be used, some
will assume that enabling the disk cache isn't safe either.

bmesich> As for battery backup, the whole data center is
bmesich> protected by a UPS.  I was referring to controller
bmesich> batteries on the 3ware cards.

But if the whole data center is on UPS, the battery on the
individual host adapter is almost redundant (I can imagine some
cases where power is lost to a single machine of course).

bmesich> I was under the assumption that batteries on the
bmesich> controllers are a must when using write-caching
bmesich> sensibly.

Well, yes and no. In general the Linux cache is enough for
caching and the disk cache is enough for buffering.

The host adapter cache is most useful for RAID5, as a stripe
buffer: to keep in memory writes that do not cover a full stripe
hoping that sooner or later the rest of the stripe will be
written and thus a RMW cycle will be avoided. In your case
that may be a vain hope.

bmesich> [ ... ] average I/O request size to be around 440k/sec.
bmesich> So, with a 128MB of cache ([128*1024]/440)/60 = 4.9
bmesich> minutes of cache time before it is over-writen?

Here the calculation seems motivated by thinking of the host
adapter cache as a proper cache for popular blocks. But in your
case I suspect that is not that relevant.

[ ... ]

bmesich> Is JFS being supported my IBM anymore?

It was never supported by IBM... The only filesystem for which
you can get support (with a modest fee) is ReiserFS, and 'ext3'
for RedHat customers only.

However IBM have stopped actively developing JFS, much as SGI
have stopped actively developing XFS, and RedHat have stopped
actively developing 'ext3'.

The main difference is in reactiveness to bug fixing: for JFS
it is up to the general kernel development community, while for
ReiserFS, XFS and 'ext3' there is a sponsor who cares (somewhat)
about that.

>> Note a little but important point of terminology: a mail server
>> and a mail store server are two very different things. They may
>> be running on the same hardware, but that's all.

bmesich> Thanks for the correction :)

Well, it was not a correction, but a prompt to consider the
impact of mail delivery. You have been trying to simplify the
description of your situation, but an IMAP mail store is fed
from a mail spool, and the mail spool from some network link.

A large impact on the performance of your mail store may be how
mail is delivered into it, and whether the mail transport server
and the mail delivery system are running on the same servers as
the mail store.

For example, if the mail store and the mail spool are on the
same server or disks then the one network interface is busy with
3 types of traffic:

* incoming e-mail
* outgoing e-mail
* outgoing mail store data

and mail delivery is likely to be local.

There is also incoming mail store requests, but they are likely
to be trivial (if numerous).

bmesich> We are currently running 7 imap servers servicing
bmesich> around 15,000+ users. [ ... ]
bmesich> The most damaging user actions seem to be internal
bmesich> listserv messages marked for thousands of users. [
bmesich> ... ]

In that case mail spooling and delivery are likely to be a very
big part of the equation.

You may want to investigate IMAP servers that store mailboxes
using DBMSes, they often store each message and attachment once
no matter how many local recipients it has.

Overall I suspect that your RAID issues are small compared to
the rest, even if the rather low RAID5 write rates reported
surely contribute robustly, suggesting that taking care about
alignment (at least) would help. But RAID10 does not have
special writing issues.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Useful benchmarking tools for RAID
  2008-03-16 23:39     ` Peter Grandi
@ 2008-03-17 20:48       ` Richard Scobie
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Scobie @ 2008-03-17 20:48 UTC (permalink / raw)
  To: Linux RAID Mailing List

Peter Grandi wrote:

> However, if you move to smaller files ReiserFS seems better, if
> you keep mbox JFS is nicer, and if the mboxes are largish
> perhaps XFS is better.

Testing might be the best option. This Intel authored pdf (slightly 
dated) suggests XFS may be the best choice for maildir based storage and 
  ext3 for mbox.

http://www.valhenson.org/review/choosing.pdf


> bmesich> I was under the assumption that batteries on the
> bmesich> controllers are a must when using write-caching
> bmesich> sensibly.
> 
> Well, yes and no. In general the Linux cache is enough for
> caching and the disk cache is enough for buffering.
> 
> The host adapter cache is most useful for RAID5, as a stripe
> buffer: to keep in memory writes that do not cover a full stripe
> hoping that sooner or later the rest of the stripe will be
> written and thus a RMW cycle will be avoided. In your case
> that may be a vain hope.

If using XFS, keeping the battery backed controller would be sensible - 
see the "Write Back Cache" section of the FAQ at SGI:

http://oss.sgi.com/projects/xfs/faq.html#wcache


> However IBM have stopped actively developing JFS, much as SGI
> have stopped actively developing XFS, and RedHat have stopped
> actively developing 'ext3'.
> 
> The main difference is in reactiveness to bug fixing: for JFS
> it is up to the general kernel development community, while for
> ReiserFS, XFS and 'ext3' there is a sponsor who cares (somewhat)
> about that.

Although SGI may have "stopped actively developing XFS", in the sense 
that SGI has EOL'ed IRIX, SGI staff are actively adding new features and 
fixing bugs to the Linux implementation. See

http://oss.sgi.com/projects/xfs/

and an active mailing list:

http://marc.info/?l=linux-xfs

Regards,

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-03-17 20:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-12 22:27 Useful benchmarking tools for RAID Bryan Mark Mesich
2008-03-13 15:26 ` michael
2008-03-13 20:41 ` Peter Grandi
2008-03-13 21:17   ` Richard Scobie
2008-03-13 22:58   ` Bryan Mark Mesich
2008-03-16 23:39     ` Peter Grandi
2008-03-17 20:48       ` Richard Scobie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).