All of lore.kernel.org
 help / color / mirror / Atom feed
* filesystem benchmarking fun
@ 2007-05-16 14:42 Chris Mason
  2007-05-16 16:01 ` Chuck Ebbert
  2007-05-16 18:12 ` Jan Engelhardt
  0 siblings, 2 replies; 27+ messages in thread
From: Chris Mason @ 2007-05-16 14:42 UTC (permalink / raw)
  To: linux-kernel

Hello everyone,

I've been spending some time lately on filesystem benchmarking, in part
because my pet FS project is getting more stable and closer to release.
Now seems like a good time to step back and try to find out what
workloads we think are most important and see how well Linux is doing on
them.  So, I'll start with my favorite three benchmarks and why I think
they matter.  Over time I hope to collect a bunch of results for all of
us to argue about.

* fio: http://brick.kernel.dk/snaps/
Fio can abuse a file via just about every api in the kernel.  aio, dio,
syslets, splice etc.  It can thread, fork, record and playback traces
and provides good numbers for throughput and latencies on various
sequential and random io loads.

* fs_mark: http://developer.osdl.org/dev/doubt/fs_mark/index.html
This one covers most of the 'use the FS as a database' type workloads,
and can vary the number of files, directory depth etc.  It has detailed
timings for reads, writes, unlinks and fsyncs that make it good for
simulating mail servers and other setups.

* compilebench: http://oss.oracle.com/~mason/compilebench/
Tries to benchmark the filesystem allocator by aging the FS through
simulated kernel compiles, patch runs and other operations.

It's easy to get caught up in one benchmark or another and try to use
them for bragging rights.  But, what I want to do is talk about the
workloads we're trying to optimize for and our current methods for
measuring success.  If we don't have good benchmarks for a given
workload, I'd like to try and collect ideas on how to make one.

For example, I'll pick on xfs for a minute.  compilebench shows the
default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
kernel trees.  Dave Chinner gave me some mount options that make it
dramatically better, but it still writes at 10MB/s on a sata drive that
can do 80MB/s.  Ext3 is better, but still only 20MB/s. 

Both are presumably picking a reasonable file and directory layout.
Still, our writeback algorithms are clearly not optimized for this kind
of workload.  Should we fix it?

-chris


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: filesystem benchmarking fun
@ 2007-05-16 21:01 Al Boldi
  0 siblings, 0 replies; 27+ messages in thread
From: Al Boldi @ 2007-05-16 21:01 UTC (permalink / raw)
  To: linux-kernel

Andrew Morton wrote:
> Chris Mason <chris.mason@oracle.com> wrote:
> > > Should be: it uses first-fit.
> > >
> > > >  Looks like ext3 is just walking a list of
> > > > bh/jh, maybe we can just sort the silly thing?
> > >
> > > The IO scheduler is supposed to do that.
> > >
> > > But I don't know what's causing this.
> >
> > I had high hopes of blaming cfq, but deadline gives the same results:
> >
> > create dir kernel-0 222MB in 5.38 seconds (41.33 MB/s)
> > ... [ ~30MB/s here ] ...
> > create dir kernel-7 222MB in 8.11 seconds (27.42 MB/s)
> > create dir kernel-8 222MB in 18.39 seconds (12.09 MB/s)
> > create dir kernel-9 222MB in 6.91 seconds (32.18 MB/s)
> > create dir kernel-10 222MB in 24.32 seconds (9.14 MB/s)
> > create dir kernel-11 222MB in 12.06 seconds (18.44 MB/s)
> > create dir kernel-12 222MB in 10.95 seconds (20.31 MB/s)
> >
> > The good news is that if you let it run long enough, the times
> > stabilize.  The bad news is:
> >
> > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>
> well hang on.  Doesn't this just mean that the first few runs were writing
> into pagecache and the later ones were blocking due to dirty-memory
> limits?
>
> Or do you have a sync in there?
>
> > echo 2048 > /sys/block/..../nr_requests didn't do it either.
> >
> > I guess I'll have systemtap tell me more about the log flushing.

Try these:
# echo anticipatory > /sys/block/.../scheduler
# echo 0 > /sys/block/.../iosched/antic_expire
# echo 192 > /sys/block/.../max_sectors_kb
# echo 192 > /sys/block/.../read_ahead_kb

These give me best performance, but most noticeably antic_expire > 0 leaves 
the IOScheduler in a apparent limbo.

see http://bugzilla.kernel.org/show_bug.cgi?id=5900


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: filesystem benchmarking fun
@ 2007-05-17 11:52 Xu CanHao
  0 siblings, 0 replies; 27+ messages in thread
From: Xu CanHao @ 2007-05-17 11:52 UTC (permalink / raw)
  To: chris.mason, linux-kernel

On May 17, 5:10 am, Chris Mason <chris.ma...@oracle.com> wrote:
> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > On Wed, 16 May 2007 16:14:14 -0400
> > Chris Mason <chris.ma...@oracle.com> wrote:
>
> > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > The good news is that if you let it run long enough, the times
> > > > > stabilize.  The bad news is:
>
> > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>
> > > > well hang on.  Doesn't this just mean that the first few runs were writing
> > > > into pagecache and the later ones were blocking due to dirty-memory limits?
>
> > > > Or do you have a sync in there?
>
> > > There's no sync,  but if you watch vmstat you can clearly see the log
> > > flushes, even when the overall create times are 11MB/s.  vmstat goes
> > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
>
> > How do you know that it is a log flush rather than, say, pdflush
> > hitting the blockdev inode and doing a big seeky write?
>
> I don't...it gets especially tricky because ext3_writepage starts
> a transaction, and so pdflush does hit the log flushing code too.
>
> So, in comes systemtap.  I instrumented submit_bh to look for seeks
> (defined as writes more than 16 blocks apart) when the process was
> inside __log_wait_for_space.  The probe is attached, it is _really_
> quick and dirty because I'm about to run out the door.
>
> Watching vmstat, every time the __log_wait_for_space hits lots of seeks,
> vmstat goes into the 2-4MB/s range.  Not a scientific match up, but
> here's some sample output:
>
> 7824 ext3 done waiting for space total wrote 3155 blocks seeks 2241
> 7827 ext3 done waiting for space total wrote 855 blocks seeks 598
> 7827 ext3 done waiting for space total wrote 2547 blocks seeks 1759
> 7653 ext3 done waiting for space total wrote 2273 blocks seeks 1609
>
> I also recorded the total size of each seek, 66% of them where 6000
> blocks or more.
>
> -chris
>
> [jbd.tap]
>
> global in_process
> global writers
> global last
> global seeks
>
> probe kernel.function("__log_wait_for_space@fs/jbd/checkpoint.c") {
>     printf("%d ext3 waiting for space\n", pid())
>     p = pid()
>     writers[p] = 0
>     in_process[p] = 1
>     last[p] = 0
>     seeks[p] = 0
>
> }
>
> probe kernel.function("__log_wait_for_space@fs/jbd/checkpoint.c").return {
>     p = pid()
>     in_process[p] = 0
>     printf("%d ext3 done waiting for space total wrote %d blocks seeks %d\n", p,
>           writers[p], seeks[p])
>
> }
>
> probe kernel.function("submit_bh") {
>     p = pid()
>     in_proc = in_process[p]
>     if (in_proc != 0) {
>         writers[p] += 1
>         block = $bh->b_blocknr
>         last_block = last[p]
>         diff = 0
>         if (last_block != 0) {
>             if (last_block < block && block - last_block > 16) {
>                 diff = block - last_block
>             }
>             if (last_block > block && last_block - block > 16) {
>                 diff = last_block - block
>             }
>         }
>
>         last[p] = block
>         if (diff != 0) {
>             printf("seek log write pid %d last %d this %d diff %d\n",
>                        p, last_block, block, diff);
>             seeks[p] += 1
>         }
>     }
>
> }

To Chris Mason:

I see that your file-system aging methodology is much the same as
here: http://defragfs.sourceforge.net/theory.html

Would it be useful to you?

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-05-25  7:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-16 14:42 filesystem benchmarking fun Chris Mason
2007-05-16 16:01 ` Chuck Ebbert
2007-05-16 17:11   ` Chris Mason
2007-05-16 18:25     ` Andrew Morton
2007-05-16 19:13       ` Chris Mason
2007-05-16 19:33         ` Andrew Morton
2007-05-16 19:53           ` Chris Mason
2007-05-16 20:04             ` Andrew Morton
2007-05-16 20:14               ` Chris Mason
2007-05-16 20:37                 ` Andrew Morton
2007-05-16 21:02                   ` Chris Mason
2007-05-24 17:29                     ` Vara Prasad
2007-05-22 16:35                   ` Chris Mason
2007-05-22 17:50                     ` John Stoffel
2007-05-22 18:12                       ` Chris Mason
2007-05-22 18:21                     ` Andrew Morton
2007-05-22 18:39                       ` Chris Mason
2007-05-22 21:25                       ` Matt Mackall
2007-05-25  7:14                         ` Jens Axboe
2007-05-16 18:12 ` Jan Engelhardt
2007-05-16 19:12   ` Jeff Garzik
2007-05-16 19:16     ` Jeffrey Hundstad
2007-05-16 19:21       ` Jan Engelhardt
2007-05-18  3:32     ` Eric Sandeen
2007-05-16 19:25   ` Chris Mason
  -- strict thread matches above, loose matches on Subject: below --
2007-05-16 21:01 Al Boldi
2007-05-17 11:52 Xu CanHao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.