bug in inode allocator? - Darrick J. Wong

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@us.ibm.com>
To: linux-ext4 <linux-ext4@vger.kernel.org>
Cc: Keith Mannthey <kmannth@us.ibm.com>, Mingming Cao <mcao@us.ibm.com>
Subject: bug in inode allocator?
Date: Mon, 22 Mar 2010 17:21:23 -0700	[thread overview]
Message-ID: <20100323002123.GQ29604@tux1.beaverton.ibm.com> (raw)

Hi,

I'm trying to understand how ext4_allocate_inode selects a blockgroup when
creating a top level directory, and I've noticed a couple of odd behaviors with
the algorithm (2.6.34-rc2 if anyone cares):

First, the allocator will pick a random blockgroup from which to begin a linear
scan of all the blockgroups to find the least heavily loaded one.  However, if
there are ties for the least heavily loaded bg, the allocator picks the first
one it scanned, not necessarily the one with the lowest bg number.  This seems
to be a strategy to scatter top level directories all over the disk in an
attempt to try to keep top level directories from ending up in the same bg and
fragmenting each other.  However, if the tie is between empty blockgroups and
the media is a rotating disk, this can result in top level directories being
created far away from the high-bandwidth beginning of the disk.  If one creates
only a handful of directories which all end up hashing to higher number
blockgroups, then the filesystem won't use the high performance areas of the
disk until there's enough data to wrap around to the blockgroups at the
beginning.

An "easy" fix seems to be: If there is a tie in comparing blockgroups, then the
one with the lowest bg number wins, though that heavily biases blockgroup
creation towards the beginning of the disk, so further study on my part is
needed.  In performing _that_ analysis, I came across a second problem:

The get_orlov_stat() function returns three metrics for a given block group;
these metrics (used_dirs, free_inodes, and free_blocks) are used to figure out
if one blockgroup is less heavily loaded than another.  If I create a bunch of
1-byte files, the free_inodes and free_blocks counts decrease by 1 every time,
as you'd expect.  However, when I create directories, only the free_blocks
count decreases--used_dirs and free_inodes remain the same!  This seemed very
suspicious to me, so I umounted and mounted the filesystem and reran my test.
free_blocks and used_dirs suddenly decreased by the number of directories that
I had created before the umount, but after the first mkdir, the free_inodes and
used_dirs counts did not change, just like before.

I then ran a loop wherein I create a directory and then a small file.  For
each dir/file creation, the free_inodes count decreased by 1, the used_dirs
count remained unchanged, and the free_blocks count decreased by 2.  Weird,
since I was pretty sure that even directories require an inode and a block.

I interpret this behavior to mean that free_inodes/used_dirs only get updated
at mount time and at file creation time, and furthermore are not being updated
when directories get created.  The fact that the counts _do_ suddenly decrease
across a umount/mount cycle confirms that directories do use inodes, which is
what I expect.  I wondered if this was a behavior of delalloc or something, but
-o nodelalloc did not change the behavior.  Nor did adding copious calls to
sync.

Unfortunately, this second behavior means that the "find the least full
blockgroup" code can use stale data in its comparisons.  Am I correct that
something is wrong here, or have I misinterpreted the code?  Is it /supposed/
to be the case that used_dirs reflects the number of directories in the
blockgroup at *mount time* and not at the current time?

--D

next             reply	other threads:[~2010-03-23  0:21 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-23  0:21 Darrick J. Wong [this message]
2010-03-23  3:34 ` [PATCH] fix up flex groups used_dirs manipulation Eric Sandeen
2010-03-23  5:13   ` Darrick J. Wong
2010-03-24  0:37   ` tytso
2010-03-24 13:57     ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100323002123.GQ29604@tux1.beaverton.ibm.com \
    --to=djwong@us.ibm.com \
    --cc=kmannth@us.ibm.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=mcao@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).