All of lore.kernel.org
 help / color / mirror / Atom feed
* non volatile ram devices
@ 2002-12-04 19:59 Russell Coker
  2002-12-04 20:24 ` Ragnar Kjørstad
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Russell Coker @ 2002-12-04 19:59 UTC (permalink / raw)
  To: linux-ide-arrays; +Cc: ReiserFS

I have some servers that are giving inadequate disk performance for Maildir 
mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is planned) and 
using ReiserFS for everything that's important.

At this stage it is impossible for me to replace disks, RAID controllers, or 
anything else really significant.

What I am thinking of doing is using a kernel that supports data journalling 
which should increase performance, but still probably won't give me enough.  
So I am thinking of using an "external journal" (or using software RAID to 
put the part of the partition containing the journal on a different device).

The device containing the journal would be something much faster than physical 
media.  I have been doing some research on non-volatile memory devices.  I 
only found one company producing disks that are RAM based with battery 
backup, and they seem to start at $10K (too expensive - probably because they 
are much larger than I need, I need 128M at most, they provide 2G).  I found 
many companies selling flash memory, but that only takes a million writes 
(that'll last about an hour for the use I plan).  I found one company selling 
PC-Card devices that have two batterys for backup, but that requires getting 
a PCI controller for PC-Card's (something I haven't tried before).

Does anyone know of an affordable ($1000 or less) device that can survive 
unexpected power outages of at least 24 hours duration, can commit a write in 
less than 1ms, supports unlimited writes, and connects to a IDE or SCSI bus 
(or PCI if there's a suitable Linux driver).

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-04 19:59 non volatile ram devices Russell Coker
@ 2002-12-04 20:24 ` Ragnar Kjørstad
  2002-12-05  9:00   ` Russell Coker
  2002-12-04 22:05 ` Hans Reiser
  2002-12-05  6:32 ` Oleg Drokin
  2 siblings, 1 reply; 20+ messages in thread
From: Ragnar Kjørstad @ 2002-12-04 20:24 UTC (permalink / raw)
  To: Russell Coker; +Cc: linux-ide-arrays, ReiserFS

On Wed, Dec 04, 2002 at 08:59:35PM +0100, Russell Coker wrote:
> I have some servers that are giving inadequate disk performance for Maildir 
> mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is planned) and 
> using ReiserFS for everything that's important.

One thing you might considder is replacing the reiserfs hash with a
maildir-specific hash. In my rather limited testing I found that it was
significantly faster; I think some tests gave 200-300% speed
improvement.

But, as I said, there was only limited testing. Don't go this route
unless you have the time to test it properly both for stability and
performance.


> What I am thinking of doing is using a kernel that supports data journalling 
> which should increase performance, but still probably won't give me enough.  
> So I am thinking of using an "external journal" (or using software RAID to 
> put the part of the partition containing the journal on a different device).
> 
> The device containing the journal would be something much faster than physical 
> media.

Even if the device is just a regular disk it should give you a real
performance boost. Depending on your RAID-setup, it may not be the
throughput, but the seeking back and forth between the journal and the
rest of the disk that kills performance. Having the journal on a
seperate disk solves that problem.


> Does anyone know of an affordable ($1000 or less) device that can survive 
> unexpected power outages of at least 24 hours duration, can commit a write in 
> less than 1ms, supports unlimited writes, and connects to a IDE or SCSI bus 
> (or PCI if there's a suitable Linux driver).

Did you check out Micro Memory Inc? (http://www.umem.com/) 
I think they have some PCI-cards (with linux-drivers) which may be
suitable for this. 

However, the main strength of flash/RAM devices is that you can do
random writes very fast. For a journal deice all access will be
sequential, so there may not be much advantage compared to using a
seperate disk for the journal? I've never tried, so I'm not sure exactly
how well it would work.

Is your server read- or write- bound? I've found that some mailservers
are IO-bound because of reads (I guess pop- and imap-servers that are
polling), and then the external journal is not likely to help.



-- 
Ragnar Kjørstad

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-04 22:05 ` Hans Reiser
@ 2002-12-04 21:17   ` Mike Jadon
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Jadon @ 2002-12-04 21:17 UTC (permalink / raw)
  To: reiser, Russell Coker
  Cc: linux-ide-arrays, ReiserFS, Edward Shishkin, lrm, rmathews

Hans,

Many thanks for the referral.

Russell,

Our qty. 1-9 pricing is $730/unit for the 128MB card and $960/unit for the 
1GB card.  A driver is included the 2.4.19 kernel.

Thanks,

Mike


At 02:05 PM 12/4/2002, Hans Reiser wrote:
>Russell Coker wrote:
>
>>I have some servers that are giving inadequate disk performance for 
>>Maildir mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is 
>>planned) and using ReiserFS for everything that's important.
>>
>>At this stage it is impossible for me to replace disks, RAID controllers, 
>>or anything else really significant.
>>
>>What I am thinking of doing is using a kernel that supports data 
>>journalling which should increase performance, but still probably won't 
>>give me enough.
>>So I am thinking of using an "external journal" (or using software RAID 
>>to put the part of the partition containing the journal on a different device).
>>
>>The device containing the journal would be something much faster than 
>>physical media.  I have been doing some research on non-volatile memory 
>>devices.  I only found one company producing disks that are RAM based 
>>with battery backup, and they seem to start at $10K (too expensive - 
>>probably because they are much larger than I need, I need 128M at most, 
>>they provide 2G).  I found many companies selling flash memory, but that 
>>only takes a million writes (that'll last about an hour for the use I 
>>plan).  I found one company selling PC-Card devices that have two 
>>batterys for backup, but that requires getting a PCI controller for 
>>PC-Card's (something I haven't tried before).
>>
>>Does anyone know of an affordable ($1000 or less) device that can survive 
>>unexpected power outages of at least 24 hours duration, can commit a 
>>write in less than 1ms, supports unlimited writes, and connects to a IDE 
>>or SCSI bus (or PCI if there's a suitable Linux driver).
>>
>>
>The umem.com folks sell a device that we have tested and benchmarked 
>reiserfs on. If I could get Edward to format benchmarks in a way that 
>conveys that information that is relevant to persons reading them, I would 
>post them on our mailing list....
>
>Hans
>

Mike Jadon
Micro Memory, LLC
(US) Tel 818 998 0070 x 318
(US) Fax 818 998 4459
mikej@umem.com
www.umem.com
9540 Vassar
Chatsworth, Ca.
USA 91311 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-04 19:59 non volatile ram devices Russell Coker
  2002-12-04 20:24 ` Ragnar Kjørstad
@ 2002-12-04 22:05 ` Hans Reiser
  2002-12-04 21:17   ` Mike Jadon
  2002-12-05  6:32 ` Oleg Drokin
  2 siblings, 1 reply; 20+ messages in thread
From: Hans Reiser @ 2002-12-04 22:05 UTC (permalink / raw)
  To: Russell Coker; +Cc: linux-ide-arrays, ReiserFS, mikej, Edward Shishkin

Russell Coker wrote:

>I have some servers that are giving inadequate disk performance for Maildir 
>mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is planned) and 
>using ReiserFS for everything that's important.
>
>At this stage it is impossible for me to replace disks, RAID controllers, or 
>anything else really significant.
>
>What I am thinking of doing is using a kernel that supports data journalling 
>which should increase performance, but still probably won't give me enough.  
>So I am thinking of using an "external journal" (or using software RAID to 
>put the part of the partition containing the journal on a different device).
>
>The device containing the journal would be something much faster than physical 
>media.  I have been doing some research on non-volatile memory devices.  I 
>only found one company producing disks that are RAM based with battery 
>backup, and they seem to start at $10K (too expensive - probably because they 
>are much larger than I need, I need 128M at most, they provide 2G).  I found 
>many companies selling flash memory, but that only takes a million writes 
>(that'll last about an hour for the use I plan).  I found one company selling 
>PC-Card devices that have two batterys for backup, but that requires getting 
>a PCI controller for PC-Card's (something I haven't tried before).
>
>Does anyone know of an affordable ($1000 or less) device that can survive 
>unexpected power outages of at least 24 hours duration, can commit a write in 
>less than 1ms, supports unlimited writes, and connects to a IDE or SCSI bus 
>(or PCI if there's a suitable Linux driver).
>
>  
>
The umem.com folks sell a device that we have tested and benchmarked 
reiserfs on. If I could get Edward to format benchmarks in a way that 
conveys that information that is relevant to persons reading them, I 
would post them on our mailing list....

Hans


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-04 19:59 non volatile ram devices Russell Coker
  2002-12-04 20:24 ` Ragnar Kjørstad
  2002-12-04 22:05 ` Hans Reiser
@ 2002-12-05  6:32 ` Oleg Drokin
  2002-12-05  8:36   ` Russell Coker
  2 siblings, 1 reply; 20+ messages in thread
From: Oleg Drokin @ 2002-12-05  6:32 UTC (permalink / raw)
  To: Russell Coker; +Cc: ReiserFS

Hello!

On Wed, Dec 04, 2002 at 08:59:35PM +0100, Russell Coker wrote:

> I have some servers that are giving inadequate disk performance for Maildir 
> mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is planned) and 
> using ReiserFS for everything that's important.

May I ask what kind of inadequacy on what kinds of operations do you observe?

Thank you.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05  6:32 ` Oleg Drokin
@ 2002-12-05  8:36   ` Russell Coker
  2002-12-05 16:21     ` Todd Lyons
  0 siblings, 1 reply; 20+ messages in thread
From: Russell Coker @ 2002-12-05  8:36 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: ReiserFS

On Thu, 5 Dec 2002 07:32, Oleg Drokin wrote:
> > I have some servers that are giving inadequate disk performance for
> > Maildir mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is
> > planned) and using ReiserFS for everything that's important.
>
> May I ask what kind of inadequacy on what kinds of operations do you
> observe?

It just generally isn't fast enough.  The servers in question have 4 * 72G 
U160 SCSI disks in RAID-5 arrays on MegaRAID controllers.  They are designed 
to handle 300,000 accounts for POP and IMAP.

At times of high load there's 20 reads per second and 160 writes per second.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-04 20:24 ` Ragnar Kjørstad
@ 2002-12-05  9:00   ` Russell Coker
  2002-12-05 10:38     ` Ragnar Kjørstad
  2002-12-05 13:23     ` Chris Mason
  0 siblings, 2 replies; 20+ messages in thread
From: Russell Coker @ 2002-12-05  9:00 UTC (permalink / raw)
  To: Ragnar Kjørstad; +Cc: linux-ide-arrays, ReiserFS, Mike Jadon

On Wed, 4 Dec 2002 21:24, Ragnar Kjørstad wrote:
> On Wed, Dec 04, 2002 at 08:59:35PM +0100, Russell Coker wrote:
> > I have some servers that are giving inadequate disk performance for
> > Maildir mail spools.  They are running kernel 2.4.19 (2.4.20 upgrade is
> > planned) and using ReiserFS for everything that's important.
>
> One thing you might considder is replacing the reiserfs hash with a
> maildir-specific hash. In my rather limited testing I found that it was
> significantly faster; I think some tests gave 200-300% speed
> improvement.
>
> But, as I said, there was only limited testing. Don't go this route
> unless you have the time to test it properly both for stability and
> performance.

Thanks for the suggestion.  However I don't think that I have the resources to 
develop and adequately test such a change.

Also I doubt that this will help much for my use, I am seeing 160 writes per 
second but only 20 reads per second at peak load times.  So I think that the 
caching is doing well (and directory sizes aren't too big because of quotas).

> > The device containing the journal would be something much faster than
> > physical media.
>
> Even if the device is just a regular disk it should give you a real
> performance boost. Depending on your RAID-setup, it may not be the
> throughput, but the seeking back and forth between the journal and the
> rest of the disk that kills performance. Having the journal on a
> seperate disk solves that problem.

True.  However I could only put in a single extra disk, and I don't want to 
use non-RAID...

> > Does anyone know of an affordable ($1000 or less) device that can survive
> > unexpected power outages of at least 24 hours duration, can commit a
> > write in less than 1ms, supports unlimited writes, and connects to a IDE
> > or SCSI bus (or PCI if there's a suitable Linux driver).
>
> Did you check out Micro Memory Inc? (http://www.umem.com/)
> I think they have some PCI-cards (with linux-drivers) which may be
> suitable for this.

Thanks for everyone who recommended that, I'll check it out.  Based on the 
prices that Mike offered it seems crazy to go for a mere 128M, I think that a 
1G card would do best.  I could use it for the journal of the mail store file 
system and for the entire mail spool.  This should multiply mail delivery 
performance by a factor of at least 4 I think!  Given the price difference 
between 128M and 1G, maybe I should be looking at 2G...

> However, the main strength of flash/RAM devices is that you can do
> random writes very fast. For a journal deice all access will be
> sequential, so there may not be much advantage compared to using a
> seperate disk for the journal? I've never tried, so I'm not sure exactly
> how well it would work.

One significant issue of RAM is that there's almost zero latency.  Doing a 
synchronous write to disk (IE any journal write) takes a significant amount 
of time.  Moving that to RAM should improve things a lot.

> Is your server read- or write- bound? I've found that some mailservers
> are IO-bound because of reads (I guess pop- and imap-servers that are
> polling), and then the external journal is not likely to help.

In this case the machines each have 4G of RAM.  The total RAM for the mail 
cluster is four times what was used for the Solaris cluster, and Intel X86 
architecture uses less RAM than SPARC (32bit CISC vs 64bit RISC) and I 
suspect that the software we're now using (Qmail and Courier) is more memory 
efficient than Netscape too.  Overall we have heaps more cache memory than 
before, I'm seeing 20 reads per second and 160 writes per second at times of 
peak load.

I don't think that the ratio of reads and writes will change, the people who 
have their machines constantly polling for mail are the ones who receive the 
most mail, so therefore for reads it all stays in cache.

When we scale the machines up to more users if the RAM proves inadequate then 
we can always upgrade the servers to 8G of RAM each if necessary...

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05  9:00   ` Russell Coker
@ 2002-12-05 10:38     ` Ragnar Kjørstad
  2002-12-05 10:45       ` Russell Coker
  2002-12-05 13:23     ` Chris Mason
  1 sibling, 1 reply; 20+ messages in thread
From: Ragnar Kjørstad @ 2002-12-05 10:38 UTC (permalink / raw)
  To: Russell Coker; +Cc: linux-ide-arrays, ReiserFS, Mike Jadon

On Thu, Dec 05, 2002 at 10:00:32AM +0100, Russell Coker wrote:
> > Even if the device is just a regular disk it should give you a real
> > performance boost. Depending on your RAID-setup, it may not be the
> > throughput, but the seeking back and forth between the journal and the
> > rest of the disk that kills performance. Having the journal on a
> > seperate disk solves that problem.
> 
> True.  However I could only put in a single extra disk, and I don't want to 
> use non-RAID...

Unless you use two ramdisks you still have a single point of failure.
Not sure exactly how the reability of the ramdrive is compared to a
disk?



-- 
Ragnar Kjørstad

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05 10:38     ` Ragnar Kjørstad
@ 2002-12-05 10:45       ` Russell Coker
  0 siblings, 0 replies; 20+ messages in thread
From: Russell Coker @ 2002-12-05 10:45 UTC (permalink / raw)
  To: Ragnar Kjørstad; +Cc: linux-ide-arrays, ReiserFS, Mike Jadon

On Thu, 5 Dec 2002 11:38, Ragnar Kjørstad wrote:
> On Thu, Dec 05, 2002 at 10:00:32AM +0100, Russell Coker wrote:
> > > Even if the device is just a regular disk it should give you a real
> > > performance boost. Depending on your RAID-setup, it may not be the
> > > throughput, but the seeking back and forth between the journal and the
> > > rest of the disk that kills performance. Having the journal on a
> > > seperate disk solves that problem.
> >
> > True.  However I could only put in a single extra disk, and I don't want
> > to use non-RAID...
>
> Unless you use two ramdisks you still have a single point of failure.

True, but then both the RAID controller and the motherboard are single points 
of failure already.

> Not sure exactly how the reability of the ramdrive is compared to a
> disk?

The RAM drive has no moving parts and should be inherantly more reliable.  I 
don't recall ever having RAM die on a machine that had been functioning 
properly except when mechanical issues apply (IE clumsy people taking 
machines apart).  Hard drives die regularly.  Get a busy machine like a news 
server or a mail server and you expect to keep replacing dead hard drives as 
they wear out.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05  9:00   ` Russell Coker
  2002-12-05 10:38     ` Ragnar Kjørstad
@ 2002-12-05 13:23     ` Chris Mason
  2002-12-06  9:52       ` Russell Coker
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Mason @ 2002-12-05 13:23 UTC (permalink / raw)
  To: Russell Coker; +Cc: Ragnar Kjørstad, ReiserFS, Mike Jadon

On Thu, 2002-12-05 at 04:00, Russell Coker wrote:

> In this case the machines each have 4G of RAM.  The total RAM for the mail 
> cluster is four times what was used for the Solaris cluster, and Intel X86 
> architecture uses less RAM than SPARC (32bit CISC vs 64bit RISC) and I 
> suspect that the software we're now using (Qmail and Courier) is more memory 
> efficient than Netscape too.  Overall we have heaps more cache memory than 
> before, I'm seeing 20 reads per second and 160 writes per second at times of 
> peak load.

Have you benchmarked these machines to determine the max write load
capacity on reiserfs?  Are you using a vanilla kernel or one with
patches applied?

I've done a few of my own benchmarks of the data logging patches, but it
would be great to see some independent verification of the speedups in a
real mail server workload.

-chris



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05  8:36   ` Russell Coker
@ 2002-12-05 16:21     ` Todd Lyons
  2002-12-05 22:51       ` Russell Coker
  0 siblings, 1 reply; 20+ messages in thread
From: Todd Lyons @ 2002-12-05 16:21 UTC (permalink / raw)
  To: ReiserFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Russell Coker wanted us to know:

>It just generally isn't fast enough.  The servers in question have 4 * 72G 
>U160 SCSI disks in RAID-5 arrays on MegaRAID controllers.  They are designed 
>to handle 300,000 accounts for POP and IMAP.
>At times of high load there's 20 reads per second and 160 writes per second.

Let me ask some really stupid questions.
What kind of logging are your pop, imap, and mail services doing?  If
logging to syslog, redirect the mail logging facility to tty12 instead
of a file on the harddrive.  If syslog is logging to a network log
server, then there's not much you can do.  If logging to /dev/null, this
is a non-issue.  Is the Maildir spool on its own partition? (I can't see
how it's not since it's you, Russell).  Is /var/log on its own
partition.

What I'm getting at with all of this is that syslog can create
significant load on a machine if the machine is really busy.

Russell, if you've already tried all this, happily ignore this message
and let us know what you find.
- -- 
Blue skies...		Todd
| Get a bigger hammer!   |  Sometimes you get what you want.      |
| http://www.mrball.net  |  Sometimes you get experience.         |
| http://faq.mrball.net  |                     --unknown origin   |
   Linux kernel 2.4.19-16mdk   1 user,  load average: 0.00, 0.00, 0.00
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9730IIBT1264ScBURAr7lAJ4+19Qrj/aeWSgrOGLHKRvw7jRVqgCg5gZb
+za9M955ADOxSXxlVcOOV6Y=
=e0ai
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05 16:21     ` Todd Lyons
@ 2002-12-05 22:51       ` Russell Coker
  0 siblings, 0 replies; 20+ messages in thread
From: Russell Coker @ 2002-12-05 22:51 UTC (permalink / raw)
  To: Todd Lyons, ReiserFS

On Thu, 5 Dec 2002 17:21, Todd Lyons wrote:
> Russell Coker wanted us to know:
> >It just generally isn't fast enough.  The servers in question have 4 * 72G
> >U160 SCSI disks in RAID-5 arrays on MegaRAID controllers.  They are
> > designed to handle 300,000 accounts for POP and IMAP.
> >At times of high load there's 20 reads per second and 160 writes per
> > second.
>
> Let me ask some really stupid questions.
> What kind of logging are your pop, imap, and mail services doing?  If
> logging to syslog, redirect the mail logging facility to tty12 instead
> of a file on the harddrive.  If syslog is logging to a network log
> server, then there's not much you can do.  If logging to /dev/null, this
> is a non-issue.

Logging is to files, but it's got "-" at the start of the log entries to stop 
them being sync'd so I doubt that they have a great impact.

Turning off logs on live production servers is something that I am hesitant to 
do, and I don't expect it to improve performance much as the qmail spool dir 
(with synchronous writes) is also on the /var file system.

> Is the Maildir spool on its own partition? (I can't see
> how it's not since it's you, Russell).  Is /var/log on its own
> partition.

Partitions are /, /var, and /mail .  Not my choice but it's not too bad 
either.

> What I'm getting at with all of this is that syslog can create
> significant load on a machine if the machine is really busy.

Non synchronous writes for logging when a machine has 4G of RAM to allow big 
caches should not be a performance issue.  If it is then there's something 
wrong with the logging.  I may give it a try, but I'll try data logging 
first.

Thanks for the suggestion.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-05 13:23     ` Chris Mason
@ 2002-12-06  9:52       ` Russell Coker
  2002-12-06 13:03         ` Chris Mason
  2002-12-06 23:50         ` Matthias Andree
  0 siblings, 2 replies; 20+ messages in thread
From: Russell Coker @ 2002-12-06  9:52 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ragnar Kjørstad, ReiserFS, Mike Jadon

[-- Attachment #1: Type: text/plain, Size: 2617 bytes --]

On Thu, 5 Dec 2002 14:23, Chris Mason wrote:
> On Thu, 2002-12-05 at 04:00, Russell Coker wrote:
> > In this case the machines each have 4G of RAM.  The total RAM for the
> > mail cluster is four times what was used for the Solaris cluster, and
> > Intel X86 architecture uses less RAM than SPARC (32bit CISC vs 64bit
> > RISC) and I suspect that the software we're now using (Qmail and Courier)
> > is more memory efficient than Netscape too.  Overall we have heaps more
> > cache memory than before, I'm seeing 20 reads per second and 160 writes
> > per second at times of peak load.
>
> Have you benchmarked these machines to determine the max write load
> capacity on reiserfs?  Are you using a vanilla kernel or one with
> patches applied?

I'm using a fairly vanilla kernel.  It's performance is 2 messages per second 
taken from qmail spool and delivered while there is a background load of pop 
access and new incoming mail.  IE if there is a backlog of mail to deliver 
the backlog gets smaller by 120 messages per minute.

> I've done a few of my own benchmarks of the data logging patches, but it
> would be great to see some independent verification of the speedups in a
> real mail server workload.

I've attached the results from a quick bonnie++ run of a vanilla system, a 
system with the patches you referred me to, and finally with the file system 
mounted with data journalling.

The test was pretty quick (only a single pass) because I've spent so much time 
fiddling with the crappy test hardware to be inclined to spend too much 
effort on it (how the hell is a P3-600 with a 6G IDE drive and 128M of RAM 
supposed to be used for evaluating software to deploy on a server with 
2*1.8GHz CPUs, 196G of hardware RAID, and 4G of RAM).

The results seem to show that the patches do some good on their own, nothing 
really exciting but worth having.  The data journalling improves performance 
of synchronously creating files in the 512b to 16K size range (the issue I am 
interested in) by a factor of 7!  This is very promising, I only hope that 
the performance gains when 200 processes are hitting a hardware RAID array of 
4 U160 disks are as good as when a single process is hitting a cheap old IDE 
disk.

This may even remove the immediate need for umem devices.  But I think I'll 
try and get them anyway.  Extra speed is always useful.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page

[-- Attachment #2: res.html --]
[-- Type: text/html, Size: 4706 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-06  9:52       ` Russell Coker
@ 2002-12-06 13:03         ` Chris Mason
  2002-12-06 23:53           ` Matthias Andree
  2002-12-06 23:50         ` Matthias Andree
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Mason @ 2002-12-06 13:03 UTC (permalink / raw)
  To: Russell Coker; +Cc: Ragnar Kjørstad, ReiserFS, Mike Jadon

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]

On Fri, 2002-12-06 at 04:52, Russell Coker wrote:
 
> The results seem to show that the patches do some good on their own, nothing 
> really exciting but worth having.  The data journalling improves performance 
> of synchronously creating files in the 512b to 16K size range (the issue I am 
> interested in) by a factor of 7!  This is very promising, I only hope that 
> the performance gains when 200 processes are hitting a hardware RAID array of 
> 4 U160 disks are as good as when a single process is hitting a cheap old IDE 
> disk.
>

You should see a significant improvement over the old code as the number
of procs involved goes up.  The data logging patches have an
optimization andrew morton suggested, which is to schedule for a bit
during an fsync to allow other procs to get some work done and increase
the size of the transaction.  I've attached his synctest.c, which tries
to approximate a postfix mail load.  Check the difference between
data=journal and a pure kernel for

time synctest -F -f -n 1 -t 100 dir_name

(it does no timing on it's own, you'll have to run it under time)

This does fsyncs on both the file and the directory in a simulated
delivery.  It isn't a perfect benchmark, but it does hammer on fsyncs
nicely.

Another interesting metric is to use the reiserfs proc interface to
count the number of transactions required to finish each run.  (check
the transid in proc/fs/reiserfs/<disk>/journal)

-chris


[-- Attachment #2: synctest.c --]
[-- Type: text/plain, Size: 7642 bytes --]

/*
 * Test and benchmark synchronous operations.
 */

#undef _XOPEN_SOURCE	/* MAP_ANONYMOUS */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <stdarg.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/mman.h>

/*
 * Lots of yummy globals!
 */
char *progname, *dirname;
int verbose, use_fsync, use_osync;
int fsync_dir;
int n_threads = 1, n_iters = 100;
int *child_status;
int this_child_index;
int dir_fd;
int show_tids;
int threads_per_dir = 1;
int thread_group;
int do_unlink;
int rename_pass;

#define N_FILES		100
#define UNLINK_LAG	30
#define RENAME_PASSES	3

void show(char *fmt, ...)
{
	if (verbose) {
		va_list ap;

		va_start(ap, fmt);
		vfprintf(stdout, fmt, ap);
		fflush( stdout );
		va_end(ap);
	}
}

/*
 * - Create a file.
 * - Write some data to it
 * - Maybe fsync() it.
 * - Close it
 * - Maybe fsync() its parent dir
 * - rename() it.
 * - maybe fsync() its parent dir
 * - rename() it.
 * - maybe fsync() its parent dir
 * - rename() it.
 * - maybe fsync() its parent dir
 * - UNLINK_LAG files later, maybe unlink it.
 * - maybe fsync() its parent dir
 *
 * Repeat the above N_FILES times
 */

char *mk_dirname(void)
{
	char *ret = malloc(strlen(dirname) + 64);

	sprintf(ret, "%s/%05d", dirname, thread_group);
	return ret;
}

char *mk_filename(int fileno)
{
	char *ret = malloc(strlen(dirname) + 64);

	sprintf(ret, "%s/%05d/%05d-%05d",
			dirname, thread_group, getpid(), fileno);
	return ret;
}

char *mk_new_filename(int fileno, int pass)
{
	char *ret = malloc(strlen(dirname) + 64);

	sprintf(ret, "%s/%05d/%02d-%05d-%05d",
			dirname, thread_group, pass, getpid(), fileno);
	return ret;
}

void sync_dir(void)
{
	if (fsync_dir) {
		show("fsync(%s)\n", dirname);
		if (fsync(dir_fd) < 0) {
			fprintf(stderr, "%s: failed to fsync dir `%s': %s\n",
				progname, dirname, strerror(errno));
			exit(1);
		}
	}
}

void make_dir(void)
{
	char *n = mk_dirname();

	show("mkdir(%s)\n", n);
	if (mkdir(n, 0777) < 0) {
		fprintf(stderr, "%s: Cannot make directory `%s': %s\n",
			progname, n, strerror(errno));
		exit(1);
	}
	free(n);
}

void remove_dir(void)
{
	char *n = mk_dirname();
	show("rmdir(%s)\n", n);
	rmdir(n);
	free(n);
}

void write_stuff_to(int fd, char *name)
{
	static char buf[500000];
	static int to_write = 5000;

	show("write %d bytes to `%s'\n", sizeof(buf), name);
	if (write(fd, buf, to_write) != to_write) {
		fprintf(stderr, "%s: failed to write %d bytes to `%s': %s\n",
			progname, to_write, name, strerror(errno));
		exit(1);
	}

	to_write *= 1.1;
	if (to_write > 250000)
		to_write = 5000;
}

void unlink_one_file(int fileno, int pass)
{
	if (do_unlink) {
		char *name = mk_new_filename(fileno, pass);

		show("unlink(%s)\n", name);
		if (unlink(name) < 0) {
			fprintf(stderr, "%s: failed to unlink `%s': %s\n",
				progname, name, strerror(errno));
			exit(1);
		}
		sync_dir();
		free(name);
	}
}

void do_one_file(int fileno)
{
	char *name = mk_filename(fileno);
	int fd, flags;

	flags = O_RDWR|O_CREAT|O_TRUNC;
	if (use_osync)
		flags |= O_SYNC;

	show("open(%s)\n", name);
	fd = open(name, flags, 0666);
	if (fd < 0) {
		fprintf(stderr, "%s: failed to create file `%s': %s\n",
			progname, name, strerror(errno));
		exit(1);
	}

	write_stuff_to(fd, name);

	if (use_fsync) {
		show("fsync(%s)\n", name);
		if (fsync(fd) < 0) {
			fprintf(stderr, "%s: failed to fsync `%s': %s\n",
				progname, name, strerror(errno));
			exit(1);
		}
	}

	show("close(%s)\n", name);
	if (close(fd) < 0) {
		fprintf(stderr, "%s: failed to close `%s': %s\n",
			progname, name, strerror(errno));
		exit(1);
	}

	sync_dir();

	for (rename_pass = 0; rename_pass < RENAME_PASSES; rename_pass++) {
		char *newname = mk_new_filename(fileno, rename_pass);

		show("rename(%s, %s)\n", name, newname);
		if (rename(name, newname) < 0) {
			fprintf(stderr,
				"%s: failed to rename `%s' to `%s': %s\n",
				progname, name, newname, strerror(errno));
			exit(1);
		}
		sync_dir();
		free(name);
		name = newname;
	}
	rename_pass--;
	free(name);
}

void do_child(void)
{
	int fileno;
	char *dn = mk_dirname();
	int dotcount;

	dir_fd = open(dn, O_RDONLY);
	if (dir_fd < 0) {
		fprintf(stderr, "%s: failed to open dir `%s': %s\n",
			progname, dn, strerror(errno));
		exit(1);
	}
	free(dn);

	dotcount = N_FILES / 10;
	if (dotcount == 0)
		dotcount = 1;

	for (fileno = 0; fileno < N_FILES; fileno++) {
		if (fileno % dotcount == 0) {
			printf(".");
			fflush(stdout);
		}
		do_one_file(fileno);
		if (fileno >= UNLINK_LAG)
			unlink_one_file(fileno - UNLINK_LAG, RENAME_PASSES - 1);
	}
	for (fileno = N_FILES - UNLINK_LAG; fileno < N_FILES; fileno++)
		unlink_one_file(fileno, RENAME_PASSES - 1);
}

void doit(void)
{
	int child;
	int children_left;

	child_status = (int *)mmap(	0,
				n_threads * sizeof(*child_status),
				PROT_READ|PROT_WRITE,
				MAP_SHARED|MAP_ANONYMOUS,
				-1,
				0);
	if (child_status == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	memset(child_status, 0, n_threads * sizeof(*child_status));

	thread_group = -1;
	for (this_child_index = 0;
			this_child_index < n_threads; this_child_index++)
	{
		if (this_child_index % threads_per_dir == 0) {
			thread_group++;
			make_dir();
		}

		if (fork() == 0) {
			int iter;

			for (iter = 0; iter < n_iters; iter++)
				do_child();
			child_status[this_child_index] = 1;
			exit(0);
		}
	}

	/* Parent */
	children_left = n_threads;
	while (children_left) {
		int status;

		if( wait3(&status, 0, 0) < 0 ) {
			if( errno != EINTR ) {
				perror("wait3");
				exit(1);
			}
			continue;
		}
		for (child = 0; child < n_threads; child++) {
			if (child_status[child] == 1) {
				child_status[child] = 2;
				printf("*");
				fflush(stdout);
				children_left--;
			}
		}
	}
	for (thread_group = 0; 
			thread_group < ( n_threads / threads_per_dir ); 
			thread_group++ )
		remove_dir();

	printf("\n");
}

void usage(void)
{
	fprintf(stderr,
		"Usage: %s [-fFosuv] [-p threads-pre-dir ][-n iters] [-t threads] dirname\n",
			progname);
	fprintf(stderr, "        -f:    Use fsync() on close\n"); 
	fprintf(stderr, "        -F:    Use fsync() on parent dir\n"); 
	fprintf(stderr, "        -n:    Number of iterations\n");
	fprintf(stderr, "        -o:    Open files O_SYNC\n");
	fprintf(stderr, "        -p:    Number of threads per directory\n");
	fprintf(stderr, "        -t:    Number of threads\n");
	fprintf(stderr, "        -u:    Unlink files during test\n");
	fprintf(stderr, "        -v:    Verbose\n"); 
	fprintf(stderr, "   dirname:    Directory to run tests in\n");
	exit(1);
}


int main(int argc, char *argv[])
{
	int c;

	progname = argv[0];
	while ((c = getopt(argc, argv, "vFfout:n:p:")) != -1) {
		switch (c) {
		case 'f':
			use_fsync++;
			break;
		case 'F':
			fsync_dir++;
			break;
		case 'n':
			n_iters = strtol(optarg, NULL, 10);
			break;
		case 'o':
			use_osync++;
			break;
		case 'p':
			threads_per_dir = strtol(optarg, NULL, 10);
			break;
		case 't':
			n_threads = strtol(optarg, NULL, 10);
			break;
		case 'u':
			do_unlink++;
			break;
		case 'v':
			verbose++;
			break;
		}
	}

	if (optind == argc)
		usage();
	dirname = argv[optind++];
	if (optind != argc)
		usage();

	doit();
	exit(0);
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-06  9:52       ` Russell Coker
  2002-12-06 13:03         ` Chris Mason
@ 2002-12-06 23:50         ` Matthias Andree
  2002-12-07  4:09           ` Todd Lyons
  2002-12-07 10:03           ` Russell Coker
  1 sibling, 2 replies; 20+ messages in thread
From: Matthias Andree @ 2002-12-06 23:50 UTC (permalink / raw)
  To: reiserfs-list

Russell Coker <russell@coker.com.au> writes:

> I'm using a fairly vanilla kernel.  It's performance is 2 messages per second 
> taken from qmail spool and delivered while there is a background load of pop 
> access and new incoming mail.  IE if there is a backlog of mail to deliver 
> the backlog gets smaller by 120 messages per minute.

In my benchmarks on a plain FreeBSD ffs and a Micropolis 4345WS UWSCSI
disk drive (7200/min) that was otherwise idle, qmail maxes out for
remote 1-to-1 deliveries at a good 3 deliveries/s. It might improve a
little with André Oppenheimer's patches, I didn't bother to check,
Postfix does 15/s on softupdates FreeBSD ffs, qmail does not support
softupdates. I didn't check Linux file systems on a current disk drive
such as Fujitsu MAH (7200/min U160 SCSI).

So I believe on ATA or loaded SCSI 2 messages per second is as good as
qmail gets with its 13+ synchronous writes per delivery. It's a
pig. Retrying with -o dirsync instead of -o sync might be worthwhile
though. Kernel patches needed.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-06 13:03         ` Chris Mason
@ 2002-12-06 23:53           ` Matthias Andree
  0 siblings, 0 replies; 20+ messages in thread
From: Matthias Andree @ 2002-12-06 23:53 UTC (permalink / raw)
  To: reiserfs-list

Chris Mason <mason@suse.com> writes:

> This does fsyncs on both the file and the directory in a simulated
> delivery.  It isn't a perfect benchmark, but it does hammer on fsyncs
> nicely.

No need to sync the directory in Postfix if the fsync() makes sure the
filename of a newly created file cannot be lost. I heard this was true
for ReiserFS v3.6 ;-)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-06 23:50         ` Matthias Andree
@ 2002-12-07  4:09           ` Todd Lyons
  2002-12-07 17:13             ` Matthias Andree
  2002-12-07 10:03           ` Russell Coker
  1 sibling, 1 reply; 20+ messages in thread
From: Todd Lyons @ 2002-12-07  4:09 UTC (permalink / raw)
  To: reiserfs-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matthias Andree wanted us to know:

>> I'm using a fairly vanilla kernel.  It's performance is 2 messages per second 
>> taken from qmail spool and delivered while there is a background load of pop 
>> access and new incoming mail.  IE if there is a backlog of mail to deliver 
>> the backlog gets smaller by 120 messages per minute.
>In my benchmarks on a plain FreeBSD ffs and a Micropolis 4345WS UWSCSI
>disk drive (7200/min) that was otherwise idle, qmail maxes out for
>remote 1-to-1 deliveries at a good 3 deliveries/s. It might improve a

You need to increase your remoteconcurrency limit.  Unless your emails
are 10's of Megabytes each, 3/s is way low.
- -- 
Blue skies...		Todd
| Get a bigger hammer!   |  All vendors suck, but different ones  |
| http://www.mrball.net  |  suck less in different applications.  |
| http://faq.mrball.net  |                --Andy Walden on NANOG  |
   Linux kernel 2.4.19-16mdk   1 user,  load average: 0.00, 0.00, 0.00
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE98XRlIBT1264ScBURAltmAJsED+JgbSx0CKWb1PIf5iopOXXLBQCeKebg
3LeZt00jkHlER19Mqt/bBCU=
=RldY
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-06 23:50         ` Matthias Andree
  2002-12-07  4:09           ` Todd Lyons
@ 2002-12-07 10:03           ` Russell Coker
  2002-12-07 10:44             ` Valdis.Kletnieks
  1 sibling, 1 reply; 20+ messages in thread
From: Russell Coker @ 2002-12-07 10:03 UTC (permalink / raw)
  To: Matthias Andree, reiserfs-list

On Sat, 7 Dec 2002 00:50, Matthias Andree wrote:
> Russell Coker <russell@coker.com.au> writes:
> > I'm using a fairly vanilla kernel.  It's performance is 2 messages per
> > second taken from qmail spool and delivered while there is a background
> > load of pop access and new incoming mail.  IE if there is a backlog of
> > mail to deliver the backlog gets smaller by 120 messages per minute.
>
> In my benchmarks on a plain FreeBSD ffs and a Micropolis 4345WS UWSCSI
> disk drive (7200/min) that was otherwise idle, qmail maxes out for
> remote 1-to-1 deliveries at a good 3 deliveries/s. It might improve a
> little with André Oppenheimer's patches, I didn't bother to check,
> Postfix does 15/s on softupdates FreeBSD ffs, qmail does not support
> softupdates. I didn't check Linux file systems on a current disk drive
> such as Fujitsu MAH (7200/min U160 SCSI).

How does qmail not support softupdates?

> So I believe on ATA or loaded SCSI 2 messages per second is as good as
> qmail gets with its 13+ synchronous writes per delivery. It's a
> pig. Retrying with -o dirsync instead of -o sync might be worthwhile
> though. Kernel patches needed.

Well this 13 synchronous writes is what I am trying to solve.  With 
data=journal and the journal on a RAM device I expect that performance will 
improve massively.

This is not a Qmail bottleneck AFAIK, Qmail is using all the disk capacity.  
If I add any extra disk IO load (such as starting a process to deliver 
bulletins by hard-linking directly into user Maildir's) then the system load 
average dramatically increases (load average goes from ~2 to ~10 if I add an 
extra process doing heavy disk writes).

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/    Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-07 10:03           ` Russell Coker
@ 2002-12-07 10:44             ` Valdis.Kletnieks
  0 siblings, 0 replies; 20+ messages in thread
From: Valdis.Kletnieks @ 2002-12-07 10:44 UTC (permalink / raw)
  To: Russell Coker; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 432 bytes --]

On Sat, 07 Dec 2002 11:03:00 +0100, Russell Coker <russell@coker.com.au>  said:
> How does qmail not support softupdates?

I can't speak for qmail directly, but I've heard of other software that gets
indigestion because softupdates doesn't present the exact same API and view
of the world.  I think it had to do with exactly how you fsync() a directory,
and when the syscall returned, and when the data was *really* on disk.

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: non volatile ram devices
  2002-12-07  4:09           ` Todd Lyons
@ 2002-12-07 17:13             ` Matthias Andree
  0 siblings, 0 replies; 20+ messages in thread
From: Matthias Andree @ 2002-12-07 17:13 UTC (permalink / raw)
  To: reiserfs-list

Todd Lyons <todd@mrball.net> writes:

>>> I'm using a fairly vanilla kernel.  It's performance is 2 messages per second 
>>> taken from qmail spool and delivered while there is a background load of pop 
>>> access and new incoming mail.  IE if there is a backlog of mail to deliver 
>>> the backlog gets smaller by 120 messages per minute.
>>In my benchmarks on a plain FreeBSD ffs and a Micropolis 4345WS UWSCSI
>>disk drive (7200/min) that was otherwise idle, qmail maxes out for
>>remote 1-to-1 deliveries at a good 3 deliveries/s. It might improve a
>
> You need to increase your remoteconcurrency limit.  Unless your emails
> are 10's of Megabytes each, 3/s is way low.

No, I don't -- most time of the test, qmail was running from 0 to 2
qmail-remote processes. This only changes when the todo queue has been
drained completely and no more mail needs to be preprocessed. Only after
the todo is empty, qmail ramps up into the remoteconcurrency limit. And
I'd certainly not raise remoteconcurrency above 20 because qmail would
easily trample the destination host if I had many recipients in one
domain, giving me "false" deferrals (because it's ran into the tcpserver
limit of the MX it's talking to).

See André Oppenheim's "silly qmail syndrome patch" and the corresponding
graphs at http://www.nrg4u.com/

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2002-12-07 17:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-04 19:59 non volatile ram devices Russell Coker
2002-12-04 20:24 ` Ragnar Kjørstad
2002-12-05  9:00   ` Russell Coker
2002-12-05 10:38     ` Ragnar Kjørstad
2002-12-05 10:45       ` Russell Coker
2002-12-05 13:23     ` Chris Mason
2002-12-06  9:52       ` Russell Coker
2002-12-06 13:03         ` Chris Mason
2002-12-06 23:53           ` Matthias Andree
2002-12-06 23:50         ` Matthias Andree
2002-12-07  4:09           ` Todd Lyons
2002-12-07 17:13             ` Matthias Andree
2002-12-07 10:03           ` Russell Coker
2002-12-07 10:44             ` Valdis.Kletnieks
2002-12-04 22:05 ` Hans Reiser
2002-12-04 21:17   ` Mike Jadon
2002-12-05  6:32 ` Oleg Drokin
2002-12-05  8:36   ` Russell Coker
2002-12-05 16:21     ` Todd Lyons
2002-12-05 22:51       ` Russell Coker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.