From: "Frantisek Rysanek" <Frantisek.Rysanek@post.cz>
To: linux-fsdevel@vger.kernel.org
Subject: Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning?
Date: Wed, 08 Apr 2009 16:22:45 +0200 [thread overview]
Message-ID: <49DCCF55.32475.8FFB6157@localhost> (raw)
Dear everyone,
first of all apologies for asking such a general question in this
fairly focused and productive mailing list... any references to other
places or previous posts are welcome :-)
Recently I've been getting requests for help with optimising the
"storage back end" on Linux-based servers run by various people that
I come in contact with.
And, I'm starting to see a "pattern" in those requests for help:
typically the box in question runs some web-based application, but
essentially the traffic pattern consists in transferring big files
back'n'forth. Someone uploads a file, and a number of other people
later download it. So it must be pretty similar to a busy master
distributions site of some Linux distro or even www.kernel.org :-)
A possible difference may be that the capacities served in my case
are accessed more or less evenly and are in units or tens of TB per
machine.
The proportion of writes in the total traffic can be 20% or less,
sometimes much less (less than 1%).
The key point is that many (up to many hundred) server threads
are reading (and writing) the files in parallel in relatively tiny
snippets. I've read before that the load presented by such multiple
parallel sessions is "bad" and difficult to handle.
Yet I believe that the sequential nature of the individual file
does suggest some basic ideas for optimization:
1) try to massage the big individual files to be allocated in large
contiguous chunks on the disk. Prevent fragmentation. That way, an
individual file can be read "sequentially" = with maximum transfer
rate in MBps. Clearly this boils down to the choice of FS type and FS
tweaking, and possibly some application-level optimizations should
help, if such can be implemented (grow the files in big chunks).
2) try to optimize the granularity of reads for maximum MBps
throughput. Given the behavior of common disk drives, an optimum
sequential transfer size is about 4 MB. That should correspond to a
FS allocation unit size (whatever the precise name - cluster, block
group etc.) and RAID stripe size, if the block device is really a
striped RAID device. Next, the runtime OS behavior (read-ahead size)
should be set to this value, at least very theoretically. And, for
optimum performance, chunk boundaries at the various layers/levels
should be aligned. This approach based on heavy read-ahead will
require lots of free RAM in the VM, but given the typical composition
of per-thread data flow vs. number of threads, I guess this is
essentially viable.
3) try to optimize writes for bigger transaction size. In Linux, it
takes some tweaking to the VM dirty ratio and (deadline) IO scheduler
timeouts, but ultimately it's perhaps the only bit that works
somewhat well. Unfortunately, given the relatively small proportion
of writes, this optimization has only a relatively small effect in
the whole volume of traffic. It may come useful if you use RAID
levels with calculated parity (typically RAID 5 or 6) which reduce
the IOps available from the spindles when writing, by the number of
spindles in a stripe set...
Problems:
The FS on-disk allocation granularity and RAID device stripe sizes
available are typically much smaller than 4 MB. Especially in HW RAID
controllers, the maximum stripe size is typically limited to maybe
128 kB, which means a waste of valuable IOps, if you use the RAID to
store large sequential files. A simple solution, at least for testing
purposes, is to use the Linux native software MD raid (level 0),
as this RAID implementation accepts big chunk sizes without a problem
(I've just tested 4 MB). And, stripe together several stand-alone
mirror volumes, presented by a hardware RAID. It can be seen as a
waste of the HW RAID's acceleration unit, but the resulting hybrid
RAID 10 works very well for a basic demonstration of the other
issues. There are no outright configuration bottlenecks in such a
setup, the bottom-level mirrors don't have a fixed stripe size and
RAID 10 doesn't suffer from the RAID5/6 "parity poison".
Especially the intended read-ahead optimization seems troublesome /
ineffective.
I've tried with XFS, which is really the only FS eligible for volume
sizes over 16 TB, and also said to be well optimized for sequential
data, aiming at contiguous allocation and all that. Everybody's using
XFS in such applications.
I don't understand all the tweakable knobs of mkfs.xfs - not well
enough to match the 4MB RAID chunk size somewhere in the internal
structure of XFS.
Another problem is, that there seems to be a single tweakable knob to
read-ahead in Linux 2.6, accessible in several ways:
/sys/block/<dev>/queue/max_sectors_kb
/sbin/blockdev --setra
/sbin/blockdev --setfra
When speaking about read-ahead optimization, about reading big
contiguous chunks of data, intrinsically I mean per-file read-ahead
at the "filesystem payload level". And the key trouble seems to be,
that the Linux read-ahead takes place at block device level. You ask
for some piece of data, and your request gets rounded up to 128 kB at
the block level (and aligned to integer blocks of that size, it would
seem). 128 kB is the default size.
As a result, interestingly, if you finish off all the aforementioned
optimizations by setting the max_sectors_kb to 4096, you don't get
higher throughput. You do get increased MBps at the block device
level (as seen in iostat), but your throughput at the level of open
files actually drops, and the number of threads "blocked in iowait"
grows.
My explanation for the "read-ahead misbehavior" is, that the data is
not entirely contiguous on the disk. That the filesystem metadata
introduces some "out of sequence" reads, which result in reading
ahead 4Meg chunks of metadata and irrelevant disk space (= useless
junk). Or, possibly, that the file allocation on disk is not all that
contiguous. Essentially that somehow the huge read-ahead overlaps
much too often into irrelevant space. I've even tried to counter this
effect "in vitro" by preparing a filesystem "tiled" with 1GB files,
that I created in sequence (one by one) by calling
posix_fallocate64() on a freshly open file descriptor...
But the reading threads then made the read-ahead misbehave precisely
that way.
It would be excellent if the read-ahead could happen at the
"filesystem payload level" and map somehow optimally to block-level
traffic paterns. Yet, the --setfra (and --getfra) parameters to the
"blockdev" util in Linux 2.6 seem to map to the block-level read-
ahead size. Is there any other tweakable knob that I'm missing?
Based on some manpages on the madvise() and fadvise() functions, I'd
say that the level of read-ahead corresponding to MADV_SEQUENTIAL and
FADV_SEQUENTIAL is still decimal orders less than the desired figure.
Besides, those are syscalls that need to be called by the
application on an open file handle - it may or may not be
easy to implement those selectively enough in your app.
Consider a monster like Apache with PHP and APR bucket brigades at
the back end... Who's got the wits to implement the "advise"
syscalls in those tarballs?
It would help if you could set this per mountpoint
using some sysfs variable or using an ioctl() call from
some small stand-alone util.
I've dumped all the stuff I could come up with on a web page:
http://www.fccps.cz/download/adv/frr/hdd/hdd.html
It contains some primitive load generators that I'm using.
Do you have any further tips on that?
I'd love to hope that I'm just missing something simple...
Any ideas are welcome :-)
Frank Rysanek
next reply other threads:[~2009-04-08 15:50 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-08 14:22 Frantisek Rysanek [this message]
2009-04-09 18:05 ` Disk IO, "Paralllel sequential" load: read-ahead inefficient? FS tuning? Andi Kleen
2009-04-10 2:35 ` Wu Fengguang
2009-04-10 6:19 ` Frantisek Rysanek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49DCCF55.32475.8FFB6157@localhost \
--to=frantisek.rysanek@post.cz \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).