From: Andreas Dilger <adilger@sun.com>
To: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Subject: Re: ext4 unlink performance
Date: Wed, 19 Nov 2008 12:10:01 -0600 [thread overview]
Message-ID: <20081119181000.GD3186@webber.adilger.int> (raw)
In-Reply-To: <20081119024021.GA10185@mit.edu>
On Nov 18, 2008 21:40 -0500, Theodore Ts'o wrote:
> Looking at the blkparse profiles, doing an rm -rf given the ext4
> produced layout required 5130 megabytes. The exact same directory
> hierarchy, as laied out by ext3, required only 1294 megabytes.
> Looking at a few selected inode allocation bitmaps, we see that ext4
> will often need to write (and thus journal) the same block allocation
> bitmap block 4 or 5 times:
>
> 254,7 0 352 0.166492349 9376 C R 8216 + 8 [0]
> 254,7 0 348788 212.885545554 0 C W 8216 + 8 [0]
> 254,7 0 461448 309.533613765 0 C W 8216 + 8 [0]
> 254,7 0 827687 558.781690434 0 C W 8216 + 8 [0]
> 254,7 0 1210492 760.738217014 0 C W 8216 + 8 [0]
>
> However, the same block allocation block bitmap is only written once
> or twice.
>
> 254,8 0 3119 9.535331283 0 C R 524288 + 8 [0]
> 254,8 0 24504 45.253431031 0 C W 524288 + 8 [0]
> 254,8 0 85476 144.455205555 23903 C W 524288 + 8 [0]
Looking at the seekwatcher graphs, it is clear that the ext4 layout
is doing fewer seeks, and packing the data into a smaller part of
the filesystem, which is counter-intuitive to the performance result.
Even though the IO bandwidth is ostensibly higher (usually a good thing
on metadata benchmarks) that isn't any good if we are doing more writes.
It isn't immediately clear that _just_ the case of rewriting the same
block multiple times is a culprit in itself, because in the ext3 case
there would be more block bitmaps affeted that would _each_ be written
out 1 or 2 times, while the closer packing of ext4 allocations results
in fewer total bimaps being used.
One would think in the case of more sharing of a block bitmap would
result in a performance _increase_ because there is more chance that
it will be re-used within the same transaction.
> ext4:
> Reads Completed: 59947, 239788KiB
> Writes Completed: 1282K, 5130MiB
>
> ext3:
> Reads Completed: 64856, 259424KiB
> Writes Completed: 323582, 1294MiB
The reads look the about same, writes are 4x higher. What would be
useful to examine is the inode number grouping of files in the same
subdirectory, along with the blocks they are allocating. It seems
like the inodes are being packed more closely together, but the
blocks (and hence block bitmap writes) are spread further apart.
That may be a side-effect of the mballoc per-CPU cache again, where
files being written in the same subdirectory are spread apart because
of the write thread being rescheduled to different cores.
I discussed this in the past with Eric, in the case of a file doing
small writes+fsync and the blocks being fragmented needlessly between
different parts of the filesystem. The proposed solution in that case
(that Aneesh could probably fix quickly) is to attach an inode to the
per-CPU preallocation group on the first write (for small files). If it
doesn't get any more writes that is fine, but if it does then the same
PA would be used for further allocations regardless of what CPU is doing
the IO.
Another solution for that case, and (as I speculate) this case, is to
attach the PA to the parent directory and have all small files in the
same directory use that PA. This would ensure that blocks allocated to
small inodes in the same directory are kept together. The drawback is
that this could hurt performance for multiple threads writing to the
same directory.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
next prev parent reply other threads:[~2008-11-19 18:10 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-13 18:57 ext4 unlink performance Bruce Guenter
2008-11-13 19:10 ` Bruce Guenter
2008-11-13 20:42 ` Theodore Tso
2008-11-14 4:11 ` Bruce Guenter
2008-11-14 14:59 ` Theodore Tso
2008-11-14 15:48 ` Bruce Guenter
2008-11-14 15:54 ` Theodore Tso
2008-11-15 20:44 ` Bruce Guenter
2008-11-15 23:08 ` Eric Sandeen
2008-11-16 0:56 ` Theodore Tso
2008-11-16 3:38 ` Bruce Guenter
2008-11-17 0:43 ` Andreas Dilger
[not found] ` <20081119024021.GA10185@mit.edu>
2008-11-19 18:10 ` Andreas Dilger [this message]
2008-11-19 21:18 ` Theodore Tso
2008-11-20 22:49 ` Bruce Guenter
2008-11-13 19:46 ` Theodore Tso
2008-11-13 20:27 ` Bruce Guenter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081119181000.GD3186@webber.adilger.int \
--to=adilger@sun.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.