Re: Poor read performance on high-end server

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Mason <chris.mason@oracle.com>
To: Freek Dijkstra <Freek.Dijkstra@sara.nl>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>
Subject: Re: Poor read performance on high-end server
Date: Fri, 6 Aug 2010 07:41:44 -0400	[thread overview]
Message-ID: <20100806114144.GB29846@think> (raw)
In-Reply-To: <4C5B2B42.3030407@sara.nl>

On Thu, Aug 05, 2010 at 11:21:06PM +0200, Freek Dijkstra wrote:
> Chris Mason wrote:
> 
> > Basically we have two different things to tune.  First the block layer
> > and then btrfs.
> 
> 
> > And then we need to setup a fio job file that hammers on all the ssds at
> > once.  I'd have it use adio/dio and talk directly to the drives.
> 
> Thanks. First one disk:
> 
> > f1: (groupid=0, jobs=1): err= 0: pid=6273
> >   read : io=32780MB, bw=260964KB/s, iops=12, runt=128626msec
> >     clat (usec): min=74940, max=80721, avg=78449.61, stdev=923.24
> >     bw (KB/s) : min=240469, max=269981, per=100.10%, avg=261214.77, stdev=2765.91
> >   cpu          : usr=0.01%, sys=2.69%, ctx=1747, majf=0, minf=5153
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      issued r/w: total=1639/0, short=0/0
> > 
> >      lat (msec): 100=100.00%
> > 
> > Run status group 0 (all jobs):
> >    READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, mint=128626msec, maxt=128626msec
> > 
> > Disk stats (read/write):
> >   sdd: ios=261901/0, merge=0/0, ticks=10135270/0, in_queue=10136460, util=99.30%
> 
> So 255 MiByte/s.
> Out of curiousity, what is the distinction between the reported figures
> of 260964 kiB/s, 261214.77 kiB/s, 267226 kiB/s and 260963 kiB/s?

When there is only one job, they should all be the same.  aggr is the
total seen across all the jobs, min is the lowest, max is the highest.

> 
> 
> Now 16 disks (abbreviated):
> 
> > ~/fio# ./fio ssd.fio
> > Starting 16 processes
> > f1: (groupid=0, jobs=1): err= 0: pid=4756
> >   read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec
> >     clat (msec): min=75, max=138, avg=96.15, stdev= 4.47
> >      lat (msec): min=75, max=138, avg=96.15, stdev= 4.47
> >     bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, stdev=9052.26
> >   cpu          : usr=0.00%, sys=1.71%, ctx=2737, majf=0, minf=5153
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >      issued r/w: total=1639/0, short=0/0
> > 
> >      lat (msec): 100=97.99%, 250=2.01%
> > Run status group 0 (all jobs):
> >    READ: io=524480MB, aggrb=3301MB/s, minb=216323KB/s, maxb=219763KB/s, mint=156406msec, maxt=158893msec

> So, the maximum for these 16 disks is 3301 MiByte/s.
> 
> I also tried hardware RAID (2 sets of 8 disks), and got a similar result:
> 
> > Run status group 0 (all jobs):
> >    READ: io=65560MB, aggrb=3024MB/s, minb=1548MB/s, maxb=1550MB/s, mint=21650msec, maxt=21681msec

Great, so we know the drives are fast.

> 
> 
> 
> > fio should be able to push these devices up to the line speed.  If it
> > doesn't I would suggest changing elevators (deadline, cfq, noop) and
> > bumping the max request size to the max supported by the device.
> 
> 3301 MiByte/s seems like a reasonable number, given the theoretic
> maximum of 16 times the single disk performance of 16*256 MiByte/s =
> 4096 MiByte/s.
> 
> Based on this, I have not looked at tuning. Would you recommend that I do?
> 
> Our minimal goal is 2500 MiByte/s; that seems achievable as ZFS was able
> to reach 2750 MiByte/s without tuning.
> 
> > When we have a config that does so, we can tune the btrfs side of things
> > as well.
> 
> Some files are created in the root folder of the mount point, but I get
> errors instead of results:
> 

Someone else mentioned that btrfs only gained DIO reads in 2.6.35.  I
think you'll get the best results with that kernel if you can find an
update.

If not, you can change the fio job file to remove direct=1 and increase the
bs flag up to 20M.

I'd also suggest changing /sys/class/bdi/btrfs-1/read_ahead_kb to a
bigger number.  Try 20480

-chris

next prev parent reply	other threads:[~2010-08-06 11:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-05 14:05 Poor read performance on high-end server Freek Dijkstra
2010-08-05 14:51 ` Chris Mason
2010-08-05 21:21   ` Freek Dijkstra
2010-08-05 22:13     ` Daniel J Blueman
2010-08-06 11:41     ` Chris Mason [this message]
2010-08-06 11:55   ` Jens Axboe
2010-08-06 11:59     ` Chris Mason
2010-08-20  4:53       ` Sander
2010-08-20 14:37         ` Chris Mason
2010-08-08  7:18     ` Andi Kleen
2010-08-08 11:04       ` Jens Axboe
2010-08-09 14:45         ` Freek Dijkstra
2010-08-10  0:55           ` Chris Mason
2010-08-05 14:54 ` Daniel J Blueman
2010-08-05 16:21 ` Mathieu Chouquet-Stringer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100806114144.GB29846@think \
    --to=chris.mason@oracle.com \
    --cc=Freek.Dijkstra@sara.nl \
    --cc=axboe@kernel.dk \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).