From: Theodore Ts'o <tytso@mit.edu>
To: Eric Sandeen <sandeen@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH 0/5 v2] add extent status tree caching
Date: Tue, 13 Aug 2013 09:04:59 -0400 [thread overview]
Message-ID: <20130813130459.GD8902@thunk.org> (raw)
In-Reply-To: <5209A649.90406@redhat.com>
On Mon, Aug 12, 2013 at 10:21:45PM -0500, Eric Sandeen wrote:
>
> Reading extents via fiemap almost certainly moves that metadata into
> kernel cache, simply by the act of reading the block device to get them.
Well, if the file system has an extent cache. It certainly will end
up reading the pages involved with the extents into the buffer and/or
page cache (depending on how the file system does things).
> I see Dave's point that we _do_ have an interface today to read
> all file extents into cache. We don't mark them as particularly sticky,
> however.
>
> This seems pretty clearly driven by a Google workload need; something you
> can probably test. Does FIEMAP do the job for you or not? If not, why not?
If you are using memory containers the way we do, in practice every
single process is going to be under memory pressure. See previous
comments I've made about why in a cloud environment, memory is your
most precious resource, since motherboards have limited numbers of
DIMM slots, and high-density DIMMS are expensive --- this is why
services like Amazon EC2 and Linode charge $$$ if you need much more
than 512mb of memory. This is because in order to make cloud systems
cost effective from a financial ROI point of view (especially once you
include power and cooling costs), you need to pack a large number of
workloads on each machine, and this is true regardless of whether you
are using containers or VM's as your method of isolation.
So basically, if you are trying to use your memory efficiently, _and_
you are trying to meet 99.9 percentile latency SLA numbers for your
performance-critical workloads, you need to have a way of telling the
system that certain pieces of memory (in this case, certain parts of
the extent cache) are more important than others (for example, a
no-longer-used inode/dentry in the inode/dentry cache or other slab
objects).
- Ted
P.S. In previous versions of this patch (which never went upstream,
using a different implementation which also never went upstream), this
ioctl nailed the relevant portions of the extent cache into memory
permanently, and they wouldn't be evicted no matter how much memory
pressure you would be under. In the Google environment, this wasn't a
major issue, since all jobs run under a restrictive memory container
and so a buggy or malicious program which attempted to precache too
many files would end up OOM-kiling itself (after which point the
situation would correct itself).
In this version of the patch, I've made the cache entries sticky, but
they aren't permanently nailed in place. This is because not all
systems will be running with containers, and I wanted to make sure we
had a safety valve against abuse. Could someone still degrade the
system performance if they tried to abuse this ioctl? Sure, but
someone can do the same thing with a "while (1) fork();" bomb.
next prev parent reply other threads:[~2013-08-13 13:05 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-16 15:17 [PATCH 0/5 v2] add extent status tree caching Theodore Ts'o
2013-07-16 15:17 ` [PATCH 1/5] ext4: refactor code to read the extent tree block Theodore Ts'o
2013-07-16 15:18 ` [PATCH 2/5] ext4: print the block number of invalid extent tree blocks Theodore Ts'o
2013-07-18 0:56 ` Zheng Liu
2013-07-16 15:18 ` [PATCH 3/5] ext4: use unsigned int for es_status values Theodore Ts'o
2013-07-16 15:18 ` [PATCH 4/5] ext4: cache all of an extent tree's leaf block upon reading Theodore Ts'o
2013-07-16 15:18 ` [PATCH 5/5] ext4: add new ioctl EXT4_IOC_PRECACHE_EXTENTS Theodore Ts'o
2013-07-18 1:19 ` Zheng Liu
2013-07-18 2:50 ` Theodore Ts'o
2013-07-18 13:06 ` Zheng Liu
2013-07-18 15:21 ` Theodore Ts'o
2013-07-18 18:35 ` [PATCH 0/5 v2] add extent status tree caching Eric Sandeen
2013-07-18 18:53 ` Theodore Ts'o
2013-07-19 0:56 ` Eric Sandeen
2013-07-19 2:59 ` Theodore Ts'o
2013-07-19 3:33 ` Dave Chinner
2013-07-19 14:22 ` Jeff Moyer
2013-07-19 16:19 ` Theodore Ts'o
2013-07-22 1:38 ` Dave Chinner
2013-07-22 2:17 ` Zheng Liu
2013-07-22 10:02 ` Dave Chinner
2013-07-22 12:57 ` Zheng Liu
2013-07-30 3:08 ` Dave Chinner
2013-08-04 1:27 ` Theodore Ts'o
2013-08-13 3:10 ` Dave Chinner
2013-08-13 3:21 ` Eric Sandeen
2013-08-13 13:04 ` Theodore Ts'o [this message]
2013-08-16 3:21 ` Dave Chinner
2013-08-16 14:39 ` Theodore Ts'o
2013-07-18 23:54 ` Zheng Liu
2013-07-19 0:07 ` Theodore Ts'o
2013-07-19 1:03 ` Zheng Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130813130459.GD8902@thunk.org \
--to=tytso@mit.edu \
--cc=david@fromorbit.com \
--cc=linux-ext4@vger.kernel.org \
--cc=sandeen@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).