From: Nikolai Joukov <kolya@cs.sunysb.edu>
To: Al Boldi <a1426z@gawab.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-raid@vger.kernel.org
Subject: Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems
Date: Sat, 16 Dec 2006 12:39:45 -0500 (EST) [thread overview]
Message-ID: <Pine.GSO.4.53.0612161118010.24531@compserv1> (raw)
In-Reply-To: <200612161635.49502.a1426z@gawab.com>
> I am looking at filling the net-pipe, and it only reaches 40-75% max, with
> some short 100% bursts, and a slow 10% start. It seems that caching
> somewhat delays the writes, which then batch up and sync at various speeds.
> So you have the cache really hiding slow sync speeds. To tune this, it may
> be helpful to turn off caching, which in turn could surface the actual
> bottlenecks.
Well, RAIF is an external kernel module and as such cannot change the
caching behavior much. The notorious problem of all Linux stackable file
systems is double-caching of data. Every stackable file system caches the
data at its own level and copies it from/to the lower file system's cached
pages when necessary. This has some advantages for file systems that
change the data (e.g., encrypt or compress). However, this effectively
reduces the system's cache memory size by two or more times.
Among all the existing stackable file systems, Tracefs and RAIF have the
highest requirements for low overheads. Here are the solutions to the
double-caching problem that we tried:
1. Redirect read and write requests to the lower file systems directly.
This allows to avoid caching of data at the RAIF level. However, this
optimization must be turned off as soon as a file is mmap'ed to avoid
cache inconsistencies. Also, even if the file is not mmap'ed, RAIF1
still keeps the data copies for all the lower branches. This
optimization is implemented in RAIF but is not turned on by default.
(We strip this and many other #ifdef'ed code fragments from the
code releases automatically.)
2. We cache the data at the RAIF level. When we write to the lower file
systems we do allocate the lower pages and do copy the data but we
also mark the lower pages with PG_reclaim flag before calling the
lower writepage operation. This releases all the lower pages right
after the write completes. This works fine for mmap'ed files and this
is the default RAIF behavior now. This solves the problem for most
workloads that mix reads and writes. For example, it improved
Postmark's performance several times. Unfortunately, this optimization
does not improve performance for big sequential writes - the workload
that you tried. So essentially, you had a quarter of your original page
cache while running your workload.
3. A known ideal solution for this problem is sharing of the cached pages
between file systems. We attempted to do it for Tracefs but the
resulting code is not beautiful and is potentially racy:
<http://marc.theaimsgroup.com/?l=linux-fsdevel&m=113193082115222&w=2>
Unfortunately, for fan-out file systems this solution requires even
more support from the OS. However, this is what most OSs do
(including BSD and Windows) but unfortunately not Linux :-(
> little overhead. So for RAIF to be viable, it needs to have low overhead,
> which doesn't seem impossible to implement, given RAIF's simple but
> beautiful approach.
Thanks a lot!
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
next prev parent reply other threads:[~2006-12-16 17:39 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-12-13 17:47 [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems Nikolai Joukov
2006-12-13 19:02 ` Phillip Susi
2006-12-13 19:17 ` Nikolai Joukov
2006-12-13 19:32 ` Nikolai Joukov
2006-12-13 19:33 ` Jan Engelhardt
2006-12-13 19:57 ` Nikolai Joukov
2006-12-13 19:57 ` Al Boldi
2006-12-14 21:01 ` Nikolai Joukov
2006-12-14 21:30 ` Charles Manning
2006-12-15 16:48 ` Nikolai Joukov
2006-12-14 22:48 ` berk walker
2006-12-15 5:02 ` Al Boldi
2006-12-15 17:41 ` Nikolai Joukov
[not found] ` <200612161635.49502.a1426z@gawab.com>
2006-12-16 17:39 ` Nikolai Joukov [this message]
2006-12-19 17:50 ` stacked filesystem cache waste Bryan Henderson
[not found] ` <200612172059.07941.a1426z@gawab.com>
2006-12-23 3:21 ` [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems Nikolai Joukov
2006-12-14 11:12 ` Al Boldi
2006-12-14 23:44 ` Nikolai Joukov
2006-12-15 5:03 ` Al Boldi
2006-12-15 18:47 ` Nikolai Joukov
2006-12-15 12:47 ` Ed Tomlinson
2006-12-15 20:11 ` Nikolai Joukov
2006-12-15 23:58 ` Ed Tomlinson
2006-12-16 0:20 ` Bryan Henderson
2006-12-16 1:20 ` Nikolai Joukov
2006-12-16 14:46 ` Ed Tomlinson
2006-12-16 17:57 ` Nikolai Joukov
2006-12-16 0:02 ` David Lang
2006-12-16 0:58 ` Nikolai Joukov
-- strict thread matches above, loose matches on Subject: below --
2006-12-15 1:13 Nikolai Joukov
[not found] <OF582D7197.D6F604B1-ON88257248.0069CC60-88257248.006AE165@us.ibm.com>
2006-12-25 15:13 ` Nikolai Joukov
2007-01-06 5:17 Chaitanya Patti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.GSO.4.53.0612161118010.24531@compserv1 \
--to=kolya@cs.sunysb.edu \
--cc=a1426z@gawab.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).