Re: ext4 performance falloff - Daniel J Blueman

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel J Blueman <daniel@numascale.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Steffen Persvold <sp@numascale.com>,
	Andreas Dilger <adilger.kernel@dilger.ca>
Subject: Re: ext4 performance falloff
Date: Sat, 05 Apr 2014 11:28:17 +0800	[thread overview]
Message-ID: <533F7851.30803@numascale.com> (raw)
In-Reply-To: <20140404205604.GC10275@thunk.org>

On 04/05/2014 04:56 AM, Theodore Ts'o wrote:
> On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote:
>> On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low
>> 600KB/s cached write performance to a local ext4 filesystem:

 > Thanks for the heads up.  Most (all?) of the ext4 don't have systems
 > with thousands of cores, so these issues generally don't come up for
 > us, and so we're not likely (hell, very unlikely!) to notice potential
 > problems cause by these sorts of uber-large systems.

Hehe. It's not every day we get access to these systems also.

>> Analysis shows that ext4 is reading from all cores' cpu-local data (thus
>> expensive off-NUMA-node access) for each block written:
>>
>> if (free_clusters - (nclusters + rsv + dirty_clusters) <
>> 				EXT4_FREECLUSTERS_WATERMARK) {
>> 	free_clusters  = percpu_counter_sum_positive(fcc);
>> 	dirty_clusters = percpu_counter_sum_positive(dcc);
>> }
>>
>> This threshold is defined as:
>>
>> #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
>> nr_cpu_ids))
...
> The problem we are trying to solve here is that when we do delayed
> allocation, we're making an implicit promise that there will be space
> available
>
> I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks,
> or 864 megabytes.  That would mean that the file system is over 98%
> full, so that's actually pretty reasonable; most of the time there's
> more free space than that.

The filesystem is empty after the mkfs; the approach here may make sense 
if we want to allow all cores to write to this FS, but here we have one.

Instrumenting shows that free_clusters=16464621 nclusters=1 rsv=842790 
dirty_clusters=0 percpu_counter_batch=3456 nr_cpu_ids=1728; below 91GB 
space, we'd hit this issue. It feels more sensible to start this 
behaviour when the FS is say 98% full, irrespective of the number of 
cores, but that's not why the behaviour is there.

Since these block devices are attached to a single NUMA node's IO link, 
there is a scaling limitation there anyway, so there may be rationale in 
limiting this to use min(256,nr_cpu_ids) maybe?

> It looks like the real problem is that we're using nr_cpu_ids, which
> is the maximum possible number of cpu's that the system can support,
> which is different from the number of cpu's that you currently have.
> For normal kernels nr_cpu_ids is small, so that has never been a
> problem, but I bet you have nr_cpu_ids set to something really large,
> right?
>
> If you change nr_cpu_ids to total_cpus in the definition of
> EXT4_FREECLUSTERS_WATERMARK, does that make things better for your
> system?

I have reproduced this with CPU hotplug disabled, so nr_cpu_ids is 
nicely at 1728.

Thanks,
   Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

next prev parent reply	other threads:[~2014-04-05  3:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-04 17:00 ext4 performance falloff Daniel J Blueman
2014-04-04 20:56 ` Theodore Ts'o
2014-04-05  3:28   ` Daniel J Blueman [this message]
2014-04-07 14:19     ` Jan Kara
2014-04-07 16:40       ` Andi Kleen
2014-04-07 20:08         ` Jan Kara
2014-04-08 10:30         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=533F7851.30803@numascale.com \
    --to=daniel@numascale.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sp@numascale.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.