From: Chandan Rajendra <chandan@linux.vnet.ibm.com>
To: Dave Chinner <david@fromorbit.com>
Cc: dchinner@redhat.com, linux-xfs <linux-xfs@vger.kernel.org>,
"sandeen@sandeen.net" <sandeen@sandeen.net>,
"Darrick J . Wong" <darrick.wong@oracle.com>
Subject: Re: Clarification on inode alignment
Date: Fri, 24 Feb 2017 15:54:26 +0530 [thread overview]
Message-ID: <2430065.bKG4XOb8hV@localhost.localdomain> (raw)
In-Reply-To: <20170224000852.GF17542@dastard>
On Friday, February 24, 2017 11:08:52 AM Dave Chinner wrote:
> > However I still don't understand how non-aligned inode clusters (64 inodes in
> > a single cluster) would break the inode number to disk location
> > arithmetic calculations. Could you please explain it?
>
> First, you need to understand the correct terminology: "inode
> cluster" vs "inode chunk"
>
> inode chunk: unit of inode allocation, always 64 contiguous inodes
> referenced by a single inobt record
>
> inode cluster: a buffer used to do inode IO, whose size is dependent
> on superblock flags and field values
>
> The inobt record records the AGBNO of the inode chunk, and
> internally then indexes the inodes in the chunk from 0 to 63.
>
> Inode clusters are independent of the inobt record. The number of
> inodes in a cluster buffer is dependent on the inode size and the
> inode cluster buffer size. The size of the inode cluster buffer is
> dependent on filesystem block size and inode alignment.
>
> Take, for example, v4 filesystem, 4k fsb, 256 byte inodes, with
> everything indexed to zero to show that the maths to derive
> everything from chunk_agbno + ino # is simple:
>
> chunk @ agbno
> +-------------------------------------------+
> ino# 0 16 32 48
>
> block 0 1 2 3
> +----------+----------+----------+----------+
> agbno +0 +1 +2 +3
>
> cluster 0 1
> +---------------------+---------------------+
> agbno +0 +2
>
> All nice an simple, yes? So we can clearly see that an inode number
> of AGBNO | INO# can be mapped to the physical block
>
> chunk_agbno + (INO# / inodes per block)
>
> And the cluster buffer physical location is:
>
> chunk_agbno + (INO# / inodes per cluster)
>
> Ok, so the math is simple (as you've noticed), but it doesn't
> explain the alignment constraints. The question is this:
> what assumption does this math make about the relationship
> between the inode number and the physical location of the inode?
>
> ....
>
> ....
>
> That's right, it assumes that chunk_agbno + INO# can only map to a
> single physical location and so never overlaps with another inode
> chunk. i.e. this cannot happen as a result of a inode free/alloc
> operation pair:
>
> Free:
> chunk @ agbno
> +-------------------------------------------+
> ino# 0 16 32 48
>
> Alloc:
> chunk @ agbno+3
> +-------------------------------------------+
> ino# 0 16 32 48
>
> If we have the overlapping chunk allocation ranges like this, then
> we can have multiple inode numbers that map to the same physical
> location. in the above case, both of the inode numbers (agbno | 48)
> and (agbno + 3 | 0) map to the same physical location but they have
> different cluster buffer address (i.e. agbno+2 vs agbno+3)
>
> So, when you get an inode number, how do you know it is valid and
> you haven't raced with an unlink that just removed the underlying
> inode chunk? You can do a buffer lookup to see if it's stale, but
> that has all sorts of problems in that a key constraint is that we
> must not have overlapping buffers in the cache. How do we know what
> buffers we need to look up (and how do we do it in a race free
> manner) to ensure that all the original inode cluster buffers have
> been invalidated and their transactions committed during an
> allocation?
>
> IOWs, without jumping through all sorts of cluster buffer coherence
> validation hoops we end up with free vs allocation and free vs
> lookup races on the cluster buffers if we just use inode number
> conversions for physical buffer mapping. That's complex, costly and
> extremely error prone, so we essentially have to treat all inode
> numbers as untrusted because of these races.
>
> There are two ways to solve this problem.
> 1) always look up the inobt record for an inode number to
> get the chunk_agbno from the inobt as locking the AGI for
> lookup guarantees no alloc/lookup/free races can occur; or
>
> 2) Ensure that inode chunks never overlap by physically
> aligning them at allocation time, hence ensuring that every
> physical address maps to exactly one inode number and
> cluster buffer address.
>
> XFS implemented 1) back in 1994 when inode cluster buffers were
> introduced. The issue with this is that inobt lookups every time we
> want to map an inode number is that it is excitingly expensive. If
> we know the inode number is correct (i.e. cames from other internal
> metadata that we've already validated), then this is overhead we can
> avoid if we constraint the disk format via method 2).
>
> That was done more than 20 years ago:
>
> commit 07d3e5d3764a8cf02d2e40397da0018c5c60f70a
> Author: Doug Doucette <doucette@engr.sgi.com>
> Date: Tue Jun 4 19:08:18 1996 +0000
>
> Support for aligned inode allocation (bug 385316). Support for
> superblock versioning (bug 385292). Some cleanup.
>
> And we've used aligned inodes ever since....
>
Dave, Thanks a lot for describing the decisions behind the requirement of
inode alignment.
--
chandan
prev parent reply other threads:[~2017-02-24 10:25 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-23 6:21 Clarification on inode alignment Chandan Rajendra
2017-02-24 0:08 ` Dave Chinner
2017-02-24 10:24 ` Chandan Rajendra [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2430065.bKG4XOb8hV@localhost.localdomain \
--to=chandan@linux.vnet.ibm.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=linux-xfs@vger.kernel.org \
--cc=sandeen@sandeen.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).