Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Frantisek Rysanek <Frantisek.Rysanek@post.cz>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
Date: Fri, 10 Apr 2009 10:35:47 +0800	[thread overview]
Message-ID: <20090410023547.GB6831@localhost> (raw)
In-Reply-To: <87prfl8zzk.fsf@basil.nowhere.org>

On Fri, Apr 10, 2009 at 02:05:03AM +0800, Andi Kleen wrote:
> "Frantisek Rysanek" <Frantisek.Rysanek@post.cz> writes:
> 
> > I don't understand all the tweakable knobs of mkfs.xfs - not well
> > enough to match the 4MB RAID chunk size somewhere in the internal
> > structure of XFS.
> 
> If it's software RAID recent mkfs.xfs should be able to figure
> it out the stripe sizes on its own.

A side note on Frantisek's "perfectly aligned 4MB readahead on 4MB
file allocation on 4MB RAID chunk size" proposal:
- 4MB IO size may be good for _disk_ bandwidth but not necessarily for
  the actual throughput of your applications because of latency issues.
- a (dirty) quick solution for your big-file servers is to use 16MB
  chunk size for software RAID and use 2MB readahead size. It won't
  suffer a lot from RAID5's partial write insufficiency, because
  - the write ratio is small
  - the writes are mostly sequential and can be write-back in busty
  The benefit for reads are, as long as XFS keeps the file blocks
  continuous, only 1 out of 8 readahead IO will involve two disks :-)

> > Another problem is, that there seems to be a single tweakable knob to 
> > read-ahead in Linux 2.6, accessible in several ways:
> >   /sys/block/<dev>/queue/max_sectors_kb
> >   /sbin/blockdev --setra
> >   /sbin/blockdev --setfra
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
>                 + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> }
> 
> So you can affect it indirectly by keeping a lot of memory free
> with vm.min_free_kbytes. Probably not an optimal solution.

Of course, not even viable ;)

Here is the memory demand of concurrent readahead.  For 1MB readahead
size, each stream will require about 2MB memory to keep it safe from
readahead thrashing. So for a server with 1000 streams, 2GB is enough
for readahead.

My old adaptive readahead patches can significantly reduce this
requirement - e.g. cut that 2GB down to 500MB. However, who cares
(please speak out!)? Servers seem to be memory bounty nowadays..

> >
> > Based on some manpages on the madvise() and fadvise() functions, I'd 
> > say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
> > FADV_SEQUENTIAL is still decimal orders less than the desired figure.
> 
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-kernel
> that you might try. It still uses strict limits, but it's better
> at figuring out specific patterns.
> 
> But then if you really know very well what kind of readahead
> is needed it might be best to just implement it directly in the
> applications than to rely on kernel heuristics.

File downloading servers typically run sequential reads/writes.
Which can be well served by the kernel readahead logic.

Apache/lighttpd have the option to do mmap reads. For these sequential
mmap read workloads, these new patches are expected to serve them well:
http://lwn.net/Articles/327647/

> For example for faster booting sys_readahead() is widely used
> now.

And the more portable/versatile posix_fadvise() advices :-)

Thanks,
Fengguang

next prev parent reply	other threads:[~2009-04-10  2:36 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-08 14:22 Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Frantisek Rysanek
2009-04-09 18:05 ` Andi Kleen
2009-04-10  2:35   ` Wu Fengguang [this message]
2009-04-10  6:19   ` Frantisek Rysanek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090410023547.GB6831@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Frantisek.Rysanek@post.cz \
    --cc=andi@firstfloor.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).