From: David Howells <dhowells@redhat.com>
To: "Lever, Charles" <Charles.Lever@netapp.com>
Cc: linux-fsdevel@vger.kernel.org,
Linux filesystem caching discussion list
<linux-cachefs@redhat.com>,
SteveD@redhat.com
Subject: Re: Re: NFS Patch for FSCache
Date: Fri, 13 May 2005 12:17:27 +0100 [thread overview]
Message-ID: <26591.1115983047@redhat.com> (raw)
In-Reply-To: <482A3FA0050D21419C269D13989C611307CF4C9E@lavender-fe.eng.netapp.com>
Charles Lever <Charles.Lever@netapp.com> wrote:
> to benchmark this i think you need to explore the architectural
> weaknesses of your approach. how bad will it get using cachefs with
> badly designed applications or client/server setups?
There are a number of critical points:
(1) Inodes
I need to represent in the cache the files I have cached so that I can
find the data attached to them and keep track of the last access times
for the purposes of culling inodes to make space. There are several ways
of doing this:
(a) Set aside a portion of the disk as an inode table and search it
end to end for each open attempt. Now assume I store one inode per
sector (512 bytes) and have a table of 128*1024 entries; this
means the worst case is that I have to read through 64MB (16384
pages) - and the worst case is going to happen every time there's
a cache miss. Now this can be alleviated in a couple of ways:
- By holding a bitmap, say, indicating which page-sized blocks
actually have inodes in them, but that then precludes the use of
do_generic_mapping_read() and readahead.
- By simply keeping track of where the last used block is to cut
short the scan and by moving inodes down or by always allocating
as low a slot as possible.
The really unpleasant side of this is that it will cycle on
average at least half of this space through the page cache every
time we scan the table. The average is likely to be worse than
half if cache misses are taken into account.
This also cuts out a chunk of the cache and makes it permanently
unavailable.
(b) Store metadata in an extensible file. This is what cachefs
actually does. It's slower than (a), but does allow (almost)
unlimited growth of the inode table and only uses as much of the
space in the cache as is necessary. Since I keep a list of free
inode entries, allocating a new one is very quick.
However, the worst case performance for (b) is actually worse than
for (a) because I not only have to read through the blocks of
inode definitions, but I also have to read the pointer blocks, and
there's on average one of those per 1024 inodes. Even worse is
that reading a particular inode has to be synchronous with respect
to walking the indirection chain.
This also has the same unpleasant side effects on scanning the
table as (a), only more so.
(c) Store the metadata records in a tree. This would permit
predictable and potentially constant time lookup for a particular
inode, though we would have to walk the tree to find the
particular inode we're looking for, which has to be synchronous.
Scanning a tree is potentially even worse than for flat files like
in (a) and (b) since you potentially have a lot more intermediate
nodes to walk. However, the really big advantage of a tree is that
it's a lot easier to remove dead space, thus compressing the tree,
plus it only needs to eat up as much cache space as is necessary.
Trees are a lot harder to juggle with the type of journalling that
cachefs does now, however, since the worst case for fanning out
any particular node is that you have to allocate as many new nodes
as there are slots in the page you already have plus one. So if
you make slots sector size, that's nine pages on a 4K page system,
but 129 on a 64K page system, and you have to journal all this
data...
Ideally, for optimal lookup speed, you want your tree to be
balanced, and that means dealing with rotation through nodes with
more than two pointers. This is something else the current method
of journalling is really poor at coping with.
However, the use of a wandering tree rather than exquisite
journalling would make matters easier here; the only journalling
required would be the change of tree root and the change in state
of the free block lists. The downside of a wandering tree is that
you have to maintain sufficient free space to be able to wander in
the performance of a deletion to make more space.
(2) Indexing
When a file is opened, CacheFS has to look in the cache to see if that
file is represented there. This means searching the set of cached files
for the one in which we're particularly interested.
Now I could store the lookup keys in the inodes themselves, but this has
two important consequences:
(a) It restricts the amount of auxilliary data an inode can store;
this includes such things as direct data pointers.
(b) It restricts the size of a netfs key and auxilliary data that can
be stored in an inode without increasing the on-disk inode
representation.
Furthermore, the keys of a particular class of object from a particular
netfs are generally of the same sort of size. For instance AFS vnodes
require a vnode ID (4 bytes) and a vnode uniquifier (4 bytes) per file.
Nothing else. These can be packed much more closely than can inodes
making them that much faster to search.
Therefore, I chose to arrange cachefs as a set of homogenous indexes,
where each index is defined by the netfs to hold elements of a particular
size. This makes searching an index that much faster.
However, since it's an index tree that's defined on disk, open() is still
synchronous in that it has to walk down the index tree until it finds (or
not) the particular data file it's looking for.
So the rapid opening and closing of a lot of small files is going to be
penalised; though this is reduced by the fact that we cache information
about indexes we know about. For instance, in the NFS client caching, we
have two layers of indexes: a server index (pinned by NFS server
records), each entry of which points to another index that keeps track of
the NFS files we know about by the NFS file handle.
Unfortunately, NFS file handles are potentially quite large (128 bytes
maximum for NFS4). This means that we can't fit many per page (about 30
on a 4K page). Now imagine that we're looking at a 2.6.11 kernel source
tree on an NFS server; this has about 21000 files. This means that we
require about 700 pages at least, more if the NFS filesystem wants to
store auxilliary data as well. So the worst case lookup is reading
through 700 pages (cache miss is worst case). If the key was stored in
with the inodes, this would mean at least 2625 pages to be read (assuming
8 per page). Now using do_generic_mapping_read() alleviates some of the
latency associated with this by doing readahead, but it's still quite a
turnover of the page cache.
If NFS had just the one index for all files on all servers, then the keys
would be bigger, though maybe only by 4 bytes (IP address), permitting
say 29 per page. Now assume you've got two servers, each with a kernel
source tree; that brings the number of files to 42000, and a worst case
lookup of say 1448 pages in the index or 5250 in the inode table
directly.
As you can see, keeping your indexes small greatly helps reduce the
lookup times, provided you can keep information about the indexes pinned
in memory to held you get to the bottom faster.
However, having more levels of index and subdividing the key space
between them brings its own pitfall: there are more levels, and walking
them has to be synchronous. Not only that, but creating all these indexes
on disk also has to be synchronous; and worse, has to be journalled.
Now all this is for flat-file indexes, as cachefs currently uses. If,
instead, cachefs used a tree the worst case lookup time would be (for a
balanced tree):
round_up(log1024(#objects)) * step_time
For 4K pages. Not only that, but the maximum page cache thrash would be
round_up(log1024(#objects)) too. So if you've got a million objects, then
this would be 2.
(3) Data storage
Finally, once you've found your inode, you have to be able to retrieve
data from it. cachefs does this by having a small number of direct
pointers in the inode itself, plus a single-indirect pointer, plus a
double indirect pointer, and plus potentionally higher-order indirection
pointers.
So for cachefs as it stands, up to the first 112 pages (448KB if 4K
pages) of a file are pointed to directly. These pages are really fast to
access because the inodes are held in memory as long as the fscache
cookies persist. For further into a file peformance degrades as more and
more levels of indirection blocks have to be traversed; but I think
that's acceptable. The single indirection block covers the next 4MB of
the file and then the double indirection block covers the next 4GB. We'd
then have to use triple indirection for the following 4TB and so on.
One disadvantage of doing this is that walking the indirection chain is,
of course, synchronous; though the page cache will alleviate the problem
somewhat. However, worse is that we have to journal the allocations of
pages and pointer blocks.
There are some ways of improving things:
(a) Extents
When the netfs presents some pages for caching, if these are all
contiguous in the inode's page space then they could be
represented on disk as a start page index, a size and either a
list of blocks or the first block of a contiguous chunk of disk.
Extents are quite tricky to deal with, depending on the degree of
data integrity you want. How do you deal with overlapping extents?
Do you just overwrite the first extent with the data from the
second and then write an abbreviated second extent (and perhaps
append as much of the second onto the first if possible)? What if
you're guaranteeing not to overwrite any data?
Also you really need a flexible tree structure to manage extents,
I think.
On-disk contiguous extents are also impractical as it's unlikely
that you'd be able to scrape together large runs of adjacent
blocks easily.
(b) Larger grained allocations
Performance could be improved at the cost of lowering the
utilisation of the disk by allocating blocks in groups. Pointers
in pointer blocks would then become a tuple containing a pointer
to the block group and a bitmap of the usage of the blocks in that
group - we mustn't return uninitialised data to the netfs. This
would allow one tuple to address up to 32 blocks (assuming a
32-bit bitmap); a coverage of 128KB with 4KB pages and 2MB with
64KB pages. However, the chunking factor could be anything from 2
to 32; a factor of 4 (16KB) might be a good compromise.
It would also be possible to create power-of-2 gradated allocation
bands in the cache, and attempt to allocate an appropriately sized
chunk, depending on the relationship between the target page group
position in the inode and the size of the inode's data.
This, however, would complicate the free block list handling as
each band would require its own free list list maintenance
This sort of thing would also easy to provide mount-time tuning
for.
(4) Other objects
One thing cachefs doesn't supply facilities for is caching other types of
objects, such as extended attributes. This is tricky with the flat file
indexing used as there's nowhere really to store a list of the objects
required. A move to a tree based filesystem would make this possible.
> for instance, what happens when the client's cache disk is much slower
> than the server (high performance RAID with high speed networking)?
Then using a local cache won't help you, no matter how hard it tries, except
in the following circumstances:
(1) The server is not available.
(2) The network is heavily used by more than just one machine.
(3) The server is very busy.
> what happens when the client's cache disk fills up so the disk cache is
> constantly turning over (which files are kicked out of your backing
> cachefs to make room for new data)?
I want that to be based on an LRU approach, using last access times. Inodes
pinned by being open can't be disposed of and neither can inodes pinned by
being marked so; but anything else is fair game for culling.
The cache has to be scanned occasionally to build up a list of inodes that are
candidates for being culled, and I think a certain amount of space must be
kept available to satisfy allocation requests; therefore the culling needs
thresholds.
Unfortunately, culling is going to be slower than allocation in general
because we always know where we're going to allocate, but we have to search
for something to get the chop.
> what happens with multi-threaded I/O-bound applications when the cachefs is
> on a single spindle?
I don't think that multi-threaded applications are actually distinguishable on
Linux from several single-threaded applications.
Allocation has to be a serialised if proper filesystem integrity is to be
maintained and if there is to be proper error handling. I do, however, try and
keep the critical section as small and as fast as possible.
> is there any performance dependency on the size of the backing cachefs?
The blocks holding virtually contiguous data may be scattered further apart on
the disk. This is unfortunate, but unless I can reclaim specific blocks or
use larger grained allocations there's not a lot I can do about that.
> do you also cache directory contents on disk?
That's entirely up to the netfs. AFS, yes; NFS, no. If cachefs were extended
to support the caching of other types of object, this would become easier for
NFS.
> remember that the application you designed this for (preserving cache
> contents across client reboots) is only one way this will be used. some
> of us would like to use this facility to provide a high-performance
> local cache larger than the client's RAM. :^)
Agreed. I'd like to be able to disable the journal. I'd also like to be able
to use it for low priority swap space. Obviously swap space can't be evicted
from the cache without approval from the swapper.
> synchronous file system metadata management is the bane of every cachefs
> implementation i know about.
Yeah. But imagine you work for a company with /usr mounted over the network by
every machine. Then the power fails. When the power comes back, all these
machines wake up, see their cache is in a bad state, reinitialise it and
immediately splat the server trying to regain /usr.
> have you measured what performance impact there is when cache files go from
> no indirection to single indirect blocks, or from single to double
> indirection? have you measured how expensive it is to reuse a single cache
> file because the cachefs file system is already full? how expensive is it
> to invalidate the data in the cache (say, if some other client changes a
> file you already have cached in your cachefs)?
Not yet. All these things require a journal update, but the changes don't
actually happen immediately, and we don't actually spend that much time
waiting around, except when we need to read something from the disk first.
The journal manager makes sure that the altered blocks hit the disk after the
journal does, and that happens in a background thread. Changes are batched up,
and we can write up to about 4000 in a batch (we must not overwrite the
previous BATCH and ACK marks in the journal).
> what about using an extent-based file system for the backing cachefs?
> that would probably not be too difficult because you have a good
> prediction already of how large the file will be (just look at the file
> size on the server).
Extents make deletion a lot slower, I suspect, because the structure is a lot
more flexible. Extents also do not eliminate indirection; they merely move it
elsewhere - an extent tree is indirect.
I'm working on making cachefs wandering tree based at the moment, at least for
the indexing. I've considered having data pointer blocks (extents) as objects
in the tree, but it's actually more complicated that way:-/
> how about using smallish chunks, like the AFS cache manager, to avoid
> indirection entirely?
In what way does this help? You have to, for example, be able to cope with a
64-bit machine requesting pages from anywhere within a file, so you might get
a request for page 0xFFFFFFFFFFFFFFFF from a sparse file and have to be able
to deal with it. How are you going to handle that without indirection? CacheFS
at the moment can't deal with that, but it wouldn't be too hard to make it
possible, it merely requires 7th-level indirection (I think). And if I move to
64-bit block pointers at some point, you'd be able to store almost that many
pages, assuming you could find a block device big enough.
> would there be any performance advantage to caching small files in memory
> and large files on disk, or vice versa?
Well, you have the page cache; but how do you decide what should be kept in
memory and what should be committed to disk? If you keep small files in
memory, then you lose them if the power goes off or you reboot, and if you're
trying to operate disconnectedly, then you've got a problem.
David
next prev parent reply other threads:[~2005-05-13 11:17 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-05-12 22:43 [Linux-cachefs] Re: NFS Patch for FSCache Lever, Charles
2005-05-13 11:17 ` David Howells [this message]
2005-05-14 2:08 ` Troy Benjegerdes
2005-05-16 12:47 ` [Linux-cachefs] " David Howells
2005-05-17 21:42 ` David Masover
2005-05-18 10:28 ` [Linux-cachefs] " David Howells
2005-05-19 2:18 ` Troy Benjegerdes
2005-05-19 6:48 ` David Masover
-- strict thread matches above, loose matches on Subject: below --
2005-05-18 16:32 Lever, Charles
2005-05-18 17:49 ` David Howells
2005-05-10 18:43 Steve Dickson
2005-05-09 10:31 ` Steve Dickson
2005-05-09 21:19 ` Andrew Morton
2005-05-10 19:12 ` [Linux-cachefs] " David Howells
2005-05-14 2:18 ` Troy Benjegerdes
2005-05-16 13:30 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=26591.1115983047@redhat.com \
--to=dhowells@redhat.com \
--cc=Charles.Lever@netapp.com \
--cc=SteveD@redhat.com \
--cc=linux-cachefs@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).