From: Andreas Dilger <adilger@sun.com>
To: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org,
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
Eric Sandeen <sandeen@redhat.com>
Subject: Re: [PATCH, RFC] ext4: New inode/block allocation algorithms for flex_bg filesystems
Date: Fri, 27 Feb 2009 02:17:04 -0700 [thread overview]
Message-ID: <20090227091704.GR3199@webber.adilger.int> (raw)
In-Reply-To: <20090226182156.GL7227@mit.edu>
On Feb 26, 2009 13:21 -0500, Theodore Ts'o wrote:
> I tried adding some of Andreas' suggestions which would tend to pack
> the inodes less agressively, in the hopes that it might speed up the
> mkdir operation, but at least for seq_mkdir/mkdirs_mark benchmark, it
> really didn't help, because we are disk bound, not cpu bound.
Could you explain in a bit more detail what you tried? In particular,
was this the "mapping hash range onto itable range" I have proposed in
the past?
As a rough outline of what I'm thinking, this kind of mapping might only
start once we exceed a single directory leaf block, as this coincides
with the start of htree hashing and hash-order vs. itable-order randomness.
Basically we would map the N leaf blocks of a directory into a range
of M itable blocks that had some number of free inodes. If we start
with 2 directory leaf blocks (right after split) that are 1/2 full:
4096 bytes/itable block / 512 bytes/inode = 8 inode/itable block
4096 bytes/leaf block / 40 bytes/dirent = 102 dirent/leaf block
102 dirent/leaf * 1/2 / 1 dirent/inode / 8 inode/itable = 6 itable/leaf
so that would mean filling the remaining 1/2 space in the 2 leaf blocks
would consume about 12 itable blocks. When there are 4 leaf blocks in the
directory we map to 24 itable blocks.
When we are scanning this directory (say at 4 leaf block size) for values
in the first leaf block (which is in hash order) the entries will likely
be in either:
+ the first 12 itable blocks (there was no itable ordering at that time)
+ the first 3 blocks of the first 12-block range (1/4 of hash values)
+ the first 6 blocks of the second 24-block range (1/4 of hash values)
= 21 blocks
Contrast this with regular htree inode allocation, the first 1/4 of the
directory entries will likely (randomly) have entries in all 12+12+24=48
of the blocks, so we are loading/modifying about 1/2 of the itable blocks
when doing stat/unlink in the directory.
If we make a table for stat/unlink of all entries in the first leaf block:
directory size total 1 leaf blk leaf blocks access
blocks:files itable blocks file ratio accessed ratio
1 102 12 1/1 12 1
2 204 12+12=24 1/2 12+6=18 3/4
4 408 12+12+24=48 1/4 12+3+6=21 1/3
8 816 12+12+24+48=96 1/8 12+2+3+6=23 1/4
16 1632 12+12+24+48+96=192 1/16 12+1+2+3+6=24 1/8
32 3264 384 1/32 24+1=25 1/15
64 6528 768 1/64 25+1=26 1/30
128 13056 1536 1/128 27 1/57
While initially it seems that past a directory of size 8 blocks we would
only modify at most 102 itable blocks per dirent block (== number of
entries in the dirent block) and the "access ratio" would stick around 1/4,
in practise we should continue to get proportionately fewer itable blocks
loaded/modified per dirent block because the itable blocks allocated
at the beginning (12+...) are used/modified repeatedly for the first
N dirent blocks and do not further negatively impact performance (no
re-loads due to cache pressure, or are redirtied in the journal).
In comparison, with the current "random" dirent->itable mapping we would
get another 102 new dirent blocks touched for each leaf block, and for
larger directories the leaf blocks cannot even all fit into a single
journal transaction and the performance tanks because each unlink will
cause a separate 4kB block to be written into the journal.
> + int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
>
> + /* Avoid using the first bg of a flexgroup for data files */
> + (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
Since these are both constants, wouldn't it make more sense to just
check the sbi->s_log_groups_per_flex against the lg of the threshold:
if (sbi->s_log_groups_per_flex > (2)) (as a #defined constant)
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
next prev parent reply other threads:[~2009-02-27 9:17 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-18 15:43 [PATCH, RFC] ext4: New inode/block allocation algorithms for flex_bg filesystems Theodore Tso
2009-02-24 8:59 ` Aneesh Kumar K.V
2009-02-24 15:27 ` Theodore Tso
2009-02-24 19:04 ` Theodore Tso
2009-02-24 22:41 ` Andreas Dilger
2009-02-25 0:57 ` Eric Sandeen
2009-02-25 0:58 ` Eric Sandeen
2009-02-25 2:50 ` Theodore Tso
2009-02-26 18:21 ` Theodore Tso
2009-02-26 18:38 ` Aneesh Kumar K.V
2009-03-30 8:48 ` Aneesh Kumar K.V
2009-02-27 0:15 ` Andreas Dilger
2009-02-27 9:17 ` Andreas Dilger [this message]
2009-02-27 15:06 ` Theodore Tso
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090227091704.GR3199@webber.adilger.int \
--to=adilger@sun.com \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=linux-ext4@vger.kernel.org \
--cc=sandeen@redhat.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.