Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Frantisek Rysanek <Frantisek.Rysanek@post.cz>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
Date: Fri, 10 Apr 2009 10:35:47 +0800	[thread overview]
Message-ID: <20090410023547.GB6831@localhost> (raw)
In-Reply-To: <87prfl8zzk.fsf@basil.nowhere.org>

On Fri, Apr 10, 2009 at 02:05:03AM +0800, Andi Kleen wrote:
> "Frantisek Rysanek" <Frantisek.Rysanek@post.cz> writes:
> 
> > I don't understand all the tweakable knobs of mkfs.xfs - not well
> > enough to match the 4MB RAID chunk size somewhere in the internal
> > structure of XFS.
> 
> If it's software RAID recent mkfs.xfs should be able to figure
> it out the stripe sizes on its own.

A side note on Frantisek's "perfectly aligned 4MB readahead on 4MB
file allocation on 4MB RAID chunk size" proposal:
- 4MB IO size may be good for _disk_ bandwidth but not necessarily for
  the actual throughput of your applications because of latency issues.
- a (dirty) quick solution for your big-file servers is to use 16MB
  chunk size for software RAID and use 2MB readahead size. It won't
  suffer a lot from RAID5's partial write insufficiency, because
  - the write ratio is small
  - the writes are mostly sequential and can be write-back in busty
  The benefit for reads are, as long as XFS keeps the file blocks
  continuous, only 1 out of 8 readahead IO will involve two disks :-)

> > Another problem is, that there seems to be a single tweakable knob to 
> > read-ahead in Linux 2.6, accessible in several ways:
> >   /sys/block/<dev>/queue/max_sectors_kb
> >   /sbin/blockdev --setra
> >   /sbin/blockdev --setfra
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
>                 + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> }
> 
> So you can affect it indirectly by keeping a lot of memory free
> with vm.min_free_kbytes. Probably not an optimal solution.

Of course, not even viable ;)

Here is the memory demand of concurrent readahead.  For 1MB readahead
size, each stream will require about 2MB memory to keep it safe from
readahead thrashing. So for a server with 1000 streams, 2GB is enough
for readahead.

My old adaptive readahead patches can significantly reduce this
requirement - e.g. cut that 2GB down to 500MB. However, who cares
(please speak out!)? Servers seem to be memory bounty nowadays..

> >
> > Based on some manpages on the madvise() and fadvise() functions, I'd 
> > say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
> > FADV_SEQUENTIAL is still decimal orders less than the desired figure.
> 
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-kernel
> that you might try. It still uses strict limits, but it's better
> at figuring out specific patterns.
> 
> But then if you really know very well what kind of readahead
> is needed it might be best to just implement it directly in the
> applications than to rely on kernel heuristics.

File downloading servers typically run sequential reads/writes.
Which can be well served by the kernel readahead logic.

Apache/lighttpd have the option to do mmap reads. For these sequential
mmap read workloads, these new patches are expected to serve them well:
http://lwn.net/Articles/327647/

> For example for faster booting sys_readahead() is widely used
> now.

And the more portable/versatile posix_fadvise() advices :-)

Thanks,
Fengguang

next prev parent reply	other threads:[~2009-04-10  2:36 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-08 14:22 Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Frantisek Rysanek
2009-04-09 18:05 ` Andi Kleen
2009-04-10  2:35   ` Wu Fengguang [this message]
2009-04-10  6:19   ` Frantisek Rysanek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090410023547.GB6831@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Frantisek.Rysanek@post.cz \
    --cc=andi@firstfloor.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.