From: Benjamin LaHaise <bcrl@kvack.org>
To: Andreas Dilger <adilger@dilger.ca>
Cc: Theodore Ts'o <tytso@mit.edu>,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: ext4: first write to large ext3 filesystem takes 96 seconds
Date: Wed, 30 Jul 2014 10:49:28 -0400 [thread overview]
Message-ID: <20140730144928.GA10295@kvack.org> (raw)
In-Reply-To: <E0178FE2-1C0C-4AF3-BA8C-3F32B4A4ACF7@dilger.ca>
Hi Andreas, Ted,
I've finally had some more time to dig into this problem, and it's worse
than I initially thought in that it occurs on normal ext4 filesystems.
On Mon, Jul 07, 2014 at 11:11:58PM -0600, Andreas Dilger wrote:
...
> The main problem here is that reading all of the block bitmaps takes
> a huge amount of time for a large filesystem.
Very true.
...
>
> 7.8TB / 128MB/group ~= 8000 groups
> 8000 bitmaps / 100 seeks/sec = 80s
>
> So that is what is making things slow. Once the allocator has all the
> blocks in memory there are no problems. There are some heuristics
> to skip bitmaps that are totally full, but they don't work in your case.
>
> This is why the flex_bg feature was created - to allow the bitmaps
> to be read from disk without seeks. This also speeds up e2fsck by
> the same 96s that would otherwise be wasted waiting for the disk.
Unfortunately, that isn't the case.
> Backporting flex_bg to ext3 would be fairly trivial - just disable the checks
> for the location of the bitmaps at mount time. However, using it
> requires that you reformat your filesystem with "-O flex_bg" to
> get the improved layout.
flex_bg is not sufficient to resolve this issue. Using a native ext4
formatted filesystem initialized with mke4fs 1.41.12, this problem still
occurs. I created a 7.1TB filesystem, filled it to about 92% full with
8MB files. The time to create a new 8MB file after a fresh mount ranges
from 0.017 seconds 13.2 seconds. The outlier correlates with bitmaps
being read from disk. A copy of /proc/fs/ext4/dm-2/mb_groups from this
92% full fs is available at http://www.kvack.org/~bcrl/mb_groups.ext4-92
Note that is isn't the first allocating write to the filesystem that is
the worst in terms of timing, it can end up being the 10th or even the
100th attempt.
> The other option (if your runtime environment allows it) is to prefetch
> the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the
> filesystem is in use. This still takes 90s, but can be started early in
> the boot process on each disk in parallel.
That isn't a solution. Prefetching is impossible in my particular use-case,
as the filesystem is being mounted after a failover from another node --
any data prefetched prior to switching active nodes is not guaranteed to be
valid.
This seems like a pretty serious regression relative to ext3. Why can't
ext4's mballoc pick better block groups to attempt allocating from based
on the free block counts in the block group summaries?
-ben
--
"Thought is the essence of where you are now."
next prev parent reply other threads:[~2014-07-30 14:49 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-07 21:13 ext4: first write to large ext3 filesystem takes 96 seconds Benjamin LaHaise
2014-07-08 0:16 ` Theodore Ts'o
2014-07-08 1:35 ` Benjamin LaHaise
2014-07-08 3:54 ` Theodore Ts'o
2014-07-08 14:53 ` Benjamin LaHaise
2014-07-08 5:11 ` Andreas Dilger
2014-07-30 14:49 ` Benjamin LaHaise [this message]
2014-07-31 13:03 ` Theodore Ts'o
2014-07-31 14:04 ` Benjamin LaHaise
2014-07-31 15:27 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140730144928.GA10295@kvack.org \
--to=bcrl@kvack.org \
--cc=adilger@dilger.ca \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).