* Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
@ 2009-04-08 14:22 Frantisek Rysanek
2009-04-09 18:05 ` Andi Kleen
0 siblings, 1 reply; 4+ messages in thread
From: Frantisek Rysanek @ 2009-04-08 14:22 UTC (permalink / raw)
To: linux-fsdevel
Dear everyone,
first of all apologies for asking such a general question in this
fairly focused and productive mailing list... any references to other
places or previous posts are welcome :-)
Recently I've been getting requests for help with optimising the
"storage back end" on Linux-based servers run by various people that
I come in contact with.
And, I'm starting to see a "pattern" in those requests for help:
typically the box in question runs some web-based application, but
essentially the traffic pattern consists in transferring big files
back'n'forth. Someone uploads a file, and a number of other people
later download it. So it must be pretty similar to a busy master
distributions site of some Linux distro or even www.kernel.org :-)
A possible difference may be that the capacities served in my case
are accessed more or less evenly and are in units or tens of TB per
machine.
The proportion of writes in the total traffic can be 20% or less,
sometimes much less (less than 1%).
The key point is that many (up to many hundred) server threads
are reading (and writing) the files in parallel in relatively tiny
snippets. I've read before that the load presented by such multiple
parallel sessions is "bad" and difficult to handle.
Yet I believe that the sequential nature of the individual file
does suggest some basic ideas for optimization:
1) try to massage the big individual files to be allocated in large
contiguous chunks on the disk. Prevent fragmentation. That way, an
individual file can be read "sequentially" = with maximum transfer
rate in MBps. Clearly this boils down to the choice of FS type and FS
tweaking, and possibly some application-level optimizations should
help, if such can be implemented (grow the files in big chunks).
2) try to optimize the granularity of reads for maximum MBps
throughput. Given the behavior of common disk drives, an optimum
sequential transfer size is about 4 MB. That should correspond to a
FS allocation unit size (whatever the precise name - cluster, block
group etc.) and RAID stripe size, if the block device is really a
striped RAID device. Next, the runtime OS behavior (read-ahead size)
should be set to this value, at least very theoretically. And, for
optimum performance, chunk boundaries at the various layers/levels
should be aligned. This approach based on heavy read-ahead will
require lots of free RAM in the VM, but given the typical composition
of per-thread data flow vs. number of threads, I guess this is
essentially viable.
3) try to optimize writes for bigger transaction size. In Linux, it
takes some tweaking to the VM dirty ratio and (deadline) IO scheduler
timeouts, but ultimately it's perhaps the only bit that works
somewhat well. Unfortunately, given the relatively small proportion
of writes, this optimization has only a relatively small effect in
the whole volume of traffic. It may come useful if you use RAID
levels with calculated parity (typically RAID 5 or 6) which reduce
the IOps available from the spindles when writing, by the number of
spindles in a stripe set...
Problems:
The FS on-disk allocation granularity and RAID device stripe sizes
available are typically much smaller than 4 MB. Especially in HW RAID
controllers, the maximum stripe size is typically limited to maybe
128 kB, which means a waste of valuable IOps, if you use the RAID to
store large sequential files. A simple solution, at least for testing
purposes, is to use the Linux native software MD raid (level 0),
as this RAID implementation accepts big chunk sizes without a problem
(I've just tested 4 MB). And, stripe together several stand-alone
mirror volumes, presented by a hardware RAID. It can be seen as a
waste of the HW RAID's acceleration unit, but the resulting hybrid
RAID 10 works very well for a basic demonstration of the other
issues. There are no outright configuration bottlenecks in such a
setup, the bottom-level mirrors don't have a fixed stripe size and
RAID 10 doesn't suffer from the RAID5/6 "parity poison".
Especially the intended read-ahead optimization seems troublesome /
ineffective.
I've tried with XFS, which is really the only FS eligible for volume
sizes over 16 TB, and also said to be well optimized for sequential
data, aiming at contiguous allocation and all that. Everybody's using
XFS in such applications.
I don't understand all the tweakable knobs of mkfs.xfs - not well
enough to match the 4MB RAID chunk size somewhere in the internal
structure of XFS.
Another problem is, that there seems to be a single tweakable knob to
read-ahead in Linux 2.6, accessible in several ways:
/sys/block/<dev>/queue/max_sectors_kb
/sbin/blockdev --setra
/sbin/blockdev --setfra
When speaking about read-ahead optimization, about reading big
contiguous chunks of data, intrinsically I mean per-file read-ahead
at the "filesystem payload level". And the key trouble seems to be,
that the Linux read-ahead takes place at block device level. You ask
for some piece of data, and your request gets rounded up to 128 kB at
the block level (and aligned to integer blocks of that size, it would
seem). 128 kB is the default size.
As a result, interestingly, if you finish off all the aforementioned
optimizations by setting the max_sectors_kb to 4096, you don't get
higher throughput. You do get increased MBps at the block device
level (as seen in iostat), but your throughput at the level of open
files actually drops, and the number of threads "blocked in iowait"
grows.
My explanation for the "read-ahead misbehavior" is, that the data is
not entirely contiguous on the disk. That the filesystem metadata
introduces some "out of sequence" reads, which result in reading
ahead 4Meg chunks of metadata and irrelevant disk space (= useless
junk). Or, possibly, that the file allocation on disk is not all that
contiguous. Essentially that somehow the huge read-ahead overlaps
much too often into irrelevant space. I've even tried to counter this
effect "in vitro" by preparing a filesystem "tiled" with 1GB files,
that I created in sequence (one by one) by calling
posix_fallocate64() on a freshly open file descriptor...
But the reading threads then made the read-ahead misbehave precisely
that way.
It would be excellent if the read-ahead could happen at the
"filesystem payload level" and map somehow optimally to block-level
traffic paterns. Yet, the --setfra (and --getfra) parameters to the
"blockdev" util in Linux 2.6 seem to map to the block-level read-
ahead size. Is there any other tweakable knob that I'm missing?
Based on some manpages on the madvise() and fadvise() functions, I'd
say that the level of read-ahead corresponding to MADV_SEQUENTIAL and
FADV_SEQUENTIAL is still decimal orders less than the desired figure.
Besides, those are syscalls that need to be called by the
application on an open file handle - it may or may not be
easy to implement those selectively enough in your app.
Consider a monster like Apache with PHP and APR bucket brigades at
the back end... Who's got the wits to implement the "advise"
syscalls in those tarballs?
It would help if you could set this per mountpoint
using some sysfs variable or using an ioctl() call from
some small stand-alone util.
I've dumped all the stuff I could come up with on a web page:
http://www.fccps.cz/download/adv/frr/hdd/hdd.html
It contains some primitive load generators that I'm using.
Do you have any further tips on that?
I'd love to hope that I'm just missing something simple...
Any ideas are welcome :-)
Frank Rysanek
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
2009-04-08 14:22 Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Frantisek Rysanek
@ 2009-04-09 18:05 ` Andi Kleen
2009-04-10 2:35 ` Wu Fengguang
2009-04-10 6:19 ` Frantisek Rysanek
0 siblings, 2 replies; 4+ messages in thread
From: Andi Kleen @ 2009-04-09 18:05 UTC (permalink / raw)
To: Frantisek Rysanek; +Cc: linux-fsdevel, Wu Fengguang
"Frantisek Rysanek" <Frantisek.Rysanek@post.cz> writes:
> I don't understand all the tweakable knobs of mkfs.xfs - not well
> enough to match the 4MB RAID chunk size somewhere in the internal
> structure of XFS.
If it's software RAID recent mkfs.xfs should be able to figure
it out the stripe sizes on its own.
> Another problem is, that there seems to be a single tweakable knob to
> read-ahead in Linux 2.6, accessible in several ways:
> /sys/block/<dev>/queue/max_sectors_kb
> /sbin/blockdev --setra
> /sbin/blockdev --setfra
unsigned long max_sane_readahead(unsigned long nr)
{
return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
}
So you can affect it indirectly by keeping a lot of memory free
with vm.min_free_kbytes. Probably not an optimal solution.
>
> Based on some manpages on the madvise() and fadvise() functions, I'd
> say that the level of read-ahead corresponding to MADV_SEQUENTIAL and
> FADV_SEQUENTIAL is still decimal orders less than the desired figure.
Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
algorithms. There was a recent new patchkit from him on linux-kernel
that you might try. It still uses strict limits, but it's better
at figuring out specific patterns.
But then if you really know very well what kind of readahead
is needed it might be best to just implement it directly in the
applications than to rely on kernel heuristics.
For example for faster booting sys_readahead() is widely used
now.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
2009-04-09 18:05 ` Andi Kleen
@ 2009-04-10 2:35 ` Wu Fengguang
2009-04-10 6:19 ` Frantisek Rysanek
1 sibling, 0 replies; 4+ messages in thread
From: Wu Fengguang @ 2009-04-10 2:35 UTC (permalink / raw)
To: Andi Kleen; +Cc: Frantisek Rysanek, linux-fsdevel@vger.kernel.org
On Fri, Apr 10, 2009 at 02:05:03AM +0800, Andi Kleen wrote:
> "Frantisek Rysanek" <Frantisek.Rysanek@post.cz> writes:
>
> > I don't understand all the tweakable knobs of mkfs.xfs - not well
> > enough to match the 4MB RAID chunk size somewhere in the internal
> > structure of XFS.
>
> If it's software RAID recent mkfs.xfs should be able to figure
> it out the stripe sizes on its own.
A side note on Frantisek's "perfectly aligned 4MB readahead on 4MB
file allocation on 4MB RAID chunk size" proposal:
- 4MB IO size may be good for _disk_ bandwidth but not necessarily for
the actual throughput of your applications because of latency issues.
- a (dirty) quick solution for your big-file servers is to use 16MB
chunk size for software RAID and use 2MB readahead size. It won't
suffer a lot from RAID5's partial write insufficiency, because
- the write ratio is small
- the writes are mostly sequential and can be write-back in busty
The benefit for reads are, as long as XFS keeps the file blocks
continuous, only 1 out of 8 readahead IO will involve two disks :-)
> > Another problem is, that there seems to be a single tweakable knob to
> > read-ahead in Linux 2.6, accessible in several ways:
> > /sys/block/<dev>/queue/max_sectors_kb
> > /sbin/blockdev --setra
> > /sbin/blockdev --setfra
>
> unsigned long max_sane_readahead(unsigned long nr)
> {
> return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
> + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> }
>
> So you can affect it indirectly by keeping a lot of memory free
> with vm.min_free_kbytes. Probably not an optimal solution.
Of course, not even viable ;)
Here is the memory demand of concurrent readahead. For 1MB readahead
size, each stream will require about 2MB memory to keep it safe from
readahead thrashing. So for a server with 1000 streams, 2GB is enough
for readahead.
My old adaptive readahead patches can significantly reduce this
requirement - e.g. cut that 2GB down to 500MB. However, who cares
(please speak out!)? Servers seem to be memory bounty nowadays..
> >
> > Based on some manpages on the madvise() and fadvise() functions, I'd
> > say that the level of read-ahead corresponding to MADV_SEQUENTIAL and
> > FADV_SEQUENTIAL is still decimal orders less than the desired figure.
>
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-kernel
> that you might try. It still uses strict limits, but it's better
> at figuring out specific patterns.
>
> But then if you really know very well what kind of readahead
> is needed it might be best to just implement it directly in the
> applications than to rely on kernel heuristics.
File downloading servers typically run sequential reads/writes.
Which can be well served by the kernel readahead logic.
Apache/lighttpd have the option to do mmap reads. For these sequential
mmap read workloads, these new patches are expected to serve them well:
http://lwn.net/Articles/327647/
> For example for faster booting sys_readahead() is widely used
> now.
And the more portable/versatile posix_fadvise() advices :-)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
2009-04-09 18:05 ` Andi Kleen
2009-04-10 2:35 ` Wu Fengguang
@ 2009-04-10 6:19 ` Frantisek Rysanek
1 sibling, 0 replies; 4+ messages in thread
From: Frantisek Rysanek @ 2009-04-10 6:19 UTC (permalink / raw)
To: linux-fsdevel
On 9 Apr 2009 at 20:05, Andi Kleen wrote:
[...]
> unsigned long max_sane_readahead(unsigned long nr)
[...]
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-
kernel
[...]
> For example for faster booting sys_readahead() is widely used now.
>
Okay, thanks for the pointers :-)
Now I know where to get follow-up reading / try some hacking...
Frank Rysanek
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2009-04-10 6:20 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-08 14:22 Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Frantisek Rysanek
2009-04-09 18:05 ` Andi Kleen
2009-04-10 2:35 ` Wu Fengguang
2009-04-10 6:19 ` Frantisek Rysanek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).