From: Daniel J Blueman <daniel@numascale.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
Steffen Persvold <sp@numascale.com>,
Andreas Dilger <adilger.kernel@dilger.ca>
Subject: Re: ext4 performance falloff
Date: Sat, 05 Apr 2014 11:28:17 +0800 [thread overview]
Message-ID: <533F7851.30803@numascale.com> (raw)
In-Reply-To: <20140404205604.GC10275@thunk.org>
On 04/05/2014 04:56 AM, Theodore Ts'o wrote:
> On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote:
>> On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low
>> 600KB/s cached write performance to a local ext4 filesystem:
> Thanks for the heads up. Most (all?) of the ext4 don't have systems
> with thousands of cores, so these issues generally don't come up for
> us, and so we're not likely (hell, very unlikely!) to notice potential
> problems cause by these sorts of uber-large systems.
Hehe. It's not every day we get access to these systems also.
>> Analysis shows that ext4 is reading from all cores' cpu-local data (thus
>> expensive off-NUMA-node access) for each block written:
>>
>> if (free_clusters - (nclusters + rsv + dirty_clusters) <
>> EXT4_FREECLUSTERS_WATERMARK) {
>> free_clusters = percpu_counter_sum_positive(fcc);
>> dirty_clusters = percpu_counter_sum_positive(dcc);
>> }
>>
>> This threshold is defined as:
>>
>> #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
>> nr_cpu_ids))
...
> The problem we are trying to solve here is that when we do delayed
> allocation, we're making an implicit promise that there will be space
> available
>
> I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks,
> or 864 megabytes. That would mean that the file system is over 98%
> full, so that's actually pretty reasonable; most of the time there's
> more free space than that.
The filesystem is empty after the mkfs; the approach here may make sense
if we want to allow all cores to write to this FS, but here we have one.
Instrumenting shows that free_clusters=16464621 nclusters=1 rsv=842790
dirty_clusters=0 percpu_counter_batch=3456 nr_cpu_ids=1728; below 91GB
space, we'd hit this issue. It feels more sensible to start this
behaviour when the FS is say 98% full, irrespective of the number of
cores, but that's not why the behaviour is there.
Since these block devices are attached to a single NUMA node's IO link,
there is a scaling limitation there anyway, so there may be rationale in
limiting this to use min(256,nr_cpu_ids) maybe?
> It looks like the real problem is that we're using nr_cpu_ids, which
> is the maximum possible number of cpu's that the system can support,
> which is different from the number of cpu's that you currently have.
> For normal kernels nr_cpu_ids is small, so that has never been a
> problem, but I bet you have nr_cpu_ids set to something really large,
> right?
>
> If you change nr_cpu_ids to total_cpus in the definition of
> EXT4_FREECLUSTERS_WATERMARK, does that make things better for your
> system?
I have reproduced this with CPU hotplug disabled, so nr_cpu_ids is
nicely at 1728.
Thanks,
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale
next prev parent reply other threads:[~2014-04-05 3:28 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-04 17:00 ext4 performance falloff Daniel J Blueman
2014-04-04 20:56 ` Theodore Ts'o
2014-04-05 3:28 ` Daniel J Blueman [this message]
2014-04-07 14:19 ` Jan Kara
2014-04-07 16:40 ` Andi Kleen
2014-04-07 20:08 ` Jan Kara
2014-04-08 10:30 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=533F7851.30803@numascale.com \
--to=daniel@numascale.com \
--cc=adilger.kernel@dilger.ca \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sp@numascale.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).