Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
@ 2009-04-08 14:22 Frantisek Rysanek
  2009-04-09 18:05 ` Andi Kleen
  0 siblings, 1 reply; 4+ messages in thread
From: Frantisek Rysanek @ 2009-04-08 14:22 UTC (permalink / raw)
  To: linux-fsdevel

Dear everyone,

first of all apologies for asking such a general question in this 
fairly focused and productive mailing list... any references to other 
places or previous posts are welcome :-)

Recently I've been getting requests for help with optimising the 
"storage back end" on Linux-based servers run by various people that 
I come in contact with.
And, I'm starting to see a "pattern" in those requests for help: 
typically the box in question runs some web-based application, but 
essentially the traffic pattern consists in transferring big files 
back'n'forth. Someone uploads a file, and a number of other people 
later download it. So it must be pretty similar to a busy master 
distributions site of some Linux distro or even www.kernel.org :-) 
A possible difference may be that the capacities served in my case 
are accessed more or less evenly and are in units or tens of TB per 
machine.
The proportion of writes in the total traffic can be 20% or less, 
sometimes much less (less than 1%).

The key point is that many (up to many hundred) server threads
are reading (and writing) the files in parallel in relatively tiny 
snippets. I've read before that the load presented by such multiple 
parallel sessions is "bad" and difficult to handle.
Yet I believe that the sequential nature of the individual file
does suggest some basic ideas for optimization:

1) try to massage the big individual files to be allocated in large 
contiguous chunks on the disk. Prevent fragmentation. That way, an 
individual file can be read "sequentially" = with maximum transfer 
rate in MBps. Clearly this boils down to the choice of FS type and FS 
tweaking, and possibly some application-level optimizations should 
help, if such can be implemented (grow the files in big chunks).

2) try to optimize the granularity of reads for maximum MBps 
throughput. Given the behavior of common disk drives, an optimum 
sequential transfer size is about 4 MB. That should correspond to a 
FS allocation unit size (whatever the precise name - cluster, block 
group etc.) and RAID stripe size, if the block device is really a 
striped RAID device. Next, the runtime OS behavior (read-ahead size) 
should be set to this value, at least very theoretically. And, for  
optimum performance, chunk boundaries at the various layers/levels 
should be aligned. This approach based on heavy read-ahead will 
require lots of free RAM in the VM, but given the typical composition 
of per-thread data flow vs. number of threads, I guess this is 
essentially viable. 

3) try to optimize writes for bigger transaction size. In Linux, it 
takes some tweaking to the VM dirty ratio and (deadline) IO scheduler 
timeouts, but ultimately it's perhaps the only bit that works 
somewhat well. Unfortunately, given the relatively small proportion 
of writes, this optimization has only a relatively small effect in 
the whole volume of traffic. It may come useful if you use RAID 
levels with calculated parity (typically RAID 5 or 6) which reduce 
the IOps available from the spindles when writing, by the number of 
spindles in a stripe set...

Problems:

The FS on-disk allocation granularity and RAID device stripe sizes 
available are typically much smaller than 4 MB. Especially in HW RAID 
controllers, the maximum stripe size is typically limited to maybe 
128 kB, which means a waste of valuable IOps, if you use the RAID to 
store large sequential files. A simple solution, at least for testing 
purposes, is to use the Linux native software MD raid (level 0),
as this RAID implementation accepts big chunk sizes without a problem
(I've just tested 4 MB). And, stripe together several stand-alone
mirror volumes, presented by a hardware RAID. It can be seen as a 
waste of the HW RAID's acceleration unit, but the resulting hybrid 
RAID 10 works very well for a basic demonstration of the other 
issues. There are no outright configuration bottlenecks in such a 
setup, the bottom-level mirrors don't have a fixed stripe size and 
RAID 10 doesn't suffer from the RAID5/6 "parity poison".

Especially the intended read-ahead optimization seems troublesome / 
ineffective. 
I've tried with XFS, which is really the only FS eligible for volume 
sizes over 16 TB, and also said to be well optimized for sequential 
data, aiming at contiguous allocation and all that. Everybody's using 
XFS in such applications.
I don't understand all the tweakable knobs of mkfs.xfs - not well
enough to match the 4MB RAID chunk size somewhere in the internal
structure of XFS.
Another problem is, that there seems to be a single tweakable knob to 
read-ahead in Linux 2.6, accessible in several ways:
  /sys/block/<dev>/queue/max_sectors_kb
  /sbin/blockdev --setra
  /sbin/blockdev --setfra
When speaking about read-ahead optimization, about reading big 
contiguous chunks of data, intrinsically I mean per-file read-ahead 
at the "filesystem payload level". And the key trouble seems to be, 
that the Linux read-ahead takes place at block device level. You ask 
for some piece of data, and your request gets rounded up to 128 kB at 
the block level (and aligned to integer blocks of that size, it would 
seem). 128 kB is the default size. 
As a result, interestingly, if you finish off all the aforementioned 
optimizations by setting the max_sectors_kb to 4096, you don't get 
higher throughput. You do get increased MBps at the block device 
level (as seen in iostat), but your throughput at the level of open 
files actually drops, and the number of threads "blocked in iowait" 
grows.

My explanation for the "read-ahead misbehavior" is, that the data is 
not entirely contiguous on the disk. That the filesystem metadata 
introduces some "out of sequence" reads, which result in reading 
ahead 4Meg chunks of metadata and irrelevant disk space (= useless 
junk). Or, possibly, that the file allocation on disk is not all that 
contiguous. Essentially that somehow the huge read-ahead overlaps 
much too often into irrelevant space. I've even tried to counter this 
effect "in vitro" by preparing a filesystem "tiled" with 1GB files, 
that I created in sequence (one by one) by calling 
posix_fallocate64() on a freshly open file descriptor...
But the reading threads then made the read-ahead misbehave precisely 
that way.

It would be excellent if the read-ahead could happen at the 
"filesystem payload level" and map somehow optimally to block-level 
traffic paterns. Yet, the --setfra (and --getfra) parameters to the 
"blockdev" util in Linux 2.6 seem to map to the block-level read-
ahead size. Is there any other tweakable knob that I'm missing?

Based on some manpages on the madvise() and fadvise() functions, I'd 
say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
FADV_SEQUENTIAL is still decimal orders less than the desired figure.
Besides, those are syscalls that need to be called by the
application on an open file handle - it may or may not be
easy to implement those selectively enough in your app.
Consider a monster like Apache with PHP and APR bucket brigades at 
the back end... Who's got the wits to implement the "advise"
syscalls in those tarballs?
It would help if you could set this per mountpoint
using some sysfs variable or using an ioctl() call from
some small stand-alone util.

I've dumped all the stuff I could come up with on a web page:
http://www.fccps.cz/download/adv/frr/hdd/hdd.html
It contains some primitive load generators that I'm using.

Do you have any further tips on that?
I'd love to hope that I'm just missing something simple...

Any ideas are welcome :-)

Frank Rysanek

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
  2009-04-08 14:22 Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Frantisek Rysanek
@ 2009-04-09 18:05 ` Andi Kleen
  2009-04-10  2:35   ` Wu Fengguang
  2009-04-10  6:19   ` Frantisek Rysanek
  0 siblings, 2 replies; 4+ messages in thread
From: Andi Kleen @ 2009-04-09 18:05 UTC (permalink / raw)
  To: Frantisek Rysanek; +Cc: linux-fsdevel, Wu Fengguang

"Frantisek Rysanek" <Frantisek.Rysanek@post.cz> writes:

> I don't understand all the tweakable knobs of mkfs.xfs - not well
> enough to match the 4MB RAID chunk size somewhere in the internal
> structure of XFS.

If it's software RAID recent mkfs.xfs should be able to figure
it out the stripe sizes on its own.

> Another problem is, that there seems to be a single tweakable knob to 
> read-ahead in Linux 2.6, accessible in several ways:
>   /sys/block/<dev>/queue/max_sectors_kb
>   /sbin/blockdev --setra
>   /sbin/blockdev --setfra

unsigned long max_sane_readahead(unsigned long nr)
{
        return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
                + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
}

So you can affect it indirectly by keeping a lot of memory free
with vm.min_free_kbytes. Probably not an optimal solution.

>
> Based on some manpages on the madvise() and fadvise() functions, I'd 
> say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
> FADV_SEQUENTIAL is still decimal orders less than the desired figure.

Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
algorithms. There was a recent new patchkit from him on linux-kernel
that you might try. It still uses strict limits, but it's better
at figuring out specific patterns.

But then if you really know very well what kind of readahead
is needed it might be best to just implement it directly in the
applications than to rely on kernel heuristics.

For example for faster booting sys_readahead() is widely used
now.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
  2009-04-09 18:05 ` Andi Kleen
@ 2009-04-10  2:35   ` Wu Fengguang
  2009-04-10  6:19   ` Frantisek Rysanek
  1 sibling, 0 replies; 4+ messages in thread
From: Wu Fengguang @ 2009-04-10  2:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Frantisek Rysanek, linux-fsdevel@vger.kernel.org

On Fri, Apr 10, 2009 at 02:05:03AM +0800, Andi Kleen wrote:
> "Frantisek Rysanek" <Frantisek.Rysanek@post.cz> writes:
> 
> > I don't understand all the tweakable knobs of mkfs.xfs - not well
> > enough to match the 4MB RAID chunk size somewhere in the internal
> > structure of XFS.
> 
> If it's software RAID recent mkfs.xfs should be able to figure
> it out the stripe sizes on its own.

A side note on Frantisek's "perfectly aligned 4MB readahead on 4MB
file allocation on 4MB RAID chunk size" proposal:
- 4MB IO size may be good for _disk_ bandwidth but not necessarily for
  the actual throughput of your applications because of latency issues.
- a (dirty) quick solution for your big-file servers is to use 16MB
  chunk size for software RAID and use 2MB readahead size. It won't
  suffer a lot from RAID5's partial write insufficiency, because
  - the write ratio is small
  - the writes are mostly sequential and can be write-back in busty
  The benefit for reads are, as long as XFS keeps the file blocks
  continuous, only 1 out of 8 readahead IO will involve two disks :-)

> > Another problem is, that there seems to be a single tweakable knob to 
> > read-ahead in Linux 2.6, accessible in several ways:
> >   /sys/block/<dev>/queue/max_sectors_kb
> >   /sbin/blockdev --setra
> >   /sbin/blockdev --setfra
> 
> unsigned long max_sane_readahead(unsigned long nr)
> {
>         return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
>                 + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
> }
> 
> So you can affect it indirectly by keeping a lot of memory free
> with vm.min_free_kbytes. Probably not an optimal solution.

Of course, not even viable ;)

Here is the memory demand of concurrent readahead.  For 1MB readahead
size, each stream will require about 2MB memory to keep it safe from
readahead thrashing. So for a server with 1000 streams, 2GB is enough
for readahead.

My old adaptive readahead patches can significantly reduce this
requirement - e.g. cut that 2GB down to 500MB. However, who cares
(please speak out!)? Servers seem to be memory bounty nowadays..

> >
> > Based on some manpages on the madvise() and fadvise() functions, I'd 
> > say that the level of read-ahead corresponding to MADV_SEQUENTIAL and 
> > FADV_SEQUENTIAL is still decimal orders less than the desired figure.
> 
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-kernel
> that you might try. It still uses strict limits, but it's better
> at figuring out specific patterns.
> 
> But then if you really know very well what kind of readahead
> is needed it might be best to just implement it directly in the
> applications than to rely on kernel heuristics.

File downloading servers typically run sequential reads/writes.
Which can be well served by the kernel readahead logic.

Apache/lighttpd have the option to do mmap reads. For these sequential
mmap read workloads, these new patches are expected to serve them well:
http://lwn.net/Articles/327647/

> For example for faster booting sys_readahead() is widely used
> now.

And the more portable/versatile posix_fadvise() advices :-)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
  2009-04-09 18:05 ` Andi Kleen
  2009-04-10  2:35   ` Wu Fengguang
@ 2009-04-10  6:19   ` Frantisek Rysanek
  1 sibling, 0 replies; 4+ messages in thread
From: Frantisek Rysanek @ 2009-04-10  6:19 UTC (permalink / raw)
  To: linux-fsdevel

On 9 Apr 2009 at 20:05, Andi Kleen wrote:
[...]
> unsigned long max_sane_readahead(unsigned long nr)
[...]
> Wu Fengguang (cc'ed) is doing a lot of work on the MADV_* readahead
> algorithms. There was a recent new patchkit from him on linux-
kernel
[...] 
> For example for faster booting sys_readahead() is widely used now.
>

Okay, thanks for the pointers :-)
Now I know where to get follow-up reading / try some hacking...

Frank Rysanek


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-04-10  6:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-08 14:22 Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Frantisek Rysanek
2009-04-09 18:05 ` Andi Kleen
2009-04-10  2:35   ` Wu Fengguang
2009-04-10  6:19   ` Frantisek Rysanek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).