From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH RFC 00/18] xfs: sparse inode chunks
Date: Mon, 28 Jul 2014 08:14:01 -0400 [thread overview]
Message-ID: <20140728121400.GB53501@bfoster.bfoster> (raw)
In-Reply-To: <20140726000335.GE20518@dastard>
On Sat, Jul 26, 2014 at 10:03:35AM +1000, Dave Chinner wrote:
> On Fri, Jul 25, 2014 at 12:30:57PM -0400, Brian Foster wrote:
> > On Fri, Jul 25, 2014 at 08:32:11AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > >
> > > > This is a first pass at sparse inode chunk support for XFS. Some
> > > > background on this work is available here:
> > > >
> > > > http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> > > >
> > > > The basic idea is to allow the partial allocation of inode chunks into
> > > > fragmented regions of free space. This is accomplished through addition
> > > > of a holemask field into the inobt record that defines what portion(s)
> > > > of an inode chunk are invalid (i.e., holes in the chunk). This work is
> > > > not quite complete, but is at a point where I'd like to start getting
> > > > feedback on the design and what direction to take for some of the known
> > > > gaps.
> > > >
> > > > The basic breakdown of functionality in this set is as follows:
> > > >
> > > > - Patches 1-2 - A couple generic cleanups that are dependencies for later
> > > > patches in the series.
> > > > - Patches 3-5 - Basic data structure update, feature bit and minor
> > > > helper introduction.
> > > > - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
> > > > inode records.
> > > > - Patches 8-13 - Allocation support for sparse inode records. Physical
> > > > chunk allocation and individual inode allocation.
> > > > - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
> > > > chunk deallocation, individual inode free and cluster free.
> > > > - Patch 17 - Fixes for bulkstat/inumbers.
> > > > - Patch 18 - Activate support for sparse chunk allocation and
> > > > processing.
> > > >
> > > > This work is lightly tested for regression (some xfstests failures due
> > > > to repair) and basic functionality. I have a new xfstests test I'll
> > > > forward along for demonstration purposes.
> > > >
> > > > Some notes on gaps in the design:
> > > >
> > > > - Sparse inode chunk allocation granularity:
> > > >
> > > > The current minimum sparse chunk allocation granularity is the cluster
> > > > size.
> > >
> > > Looking at the patchset (I got to patch 5 that first uses this),
> > > this is problematic. the cluster size is currently a kernel
> > > implementation detail, and not something defined by the on-disk
> > > format. We can change the cluster size in the kernel and not affect
> > > the format on disk. Making the cluster size a part of the disk
> > > format by defining it to be the resolution of sparse inode chunks
> > > changes that - it's now a part of the on-disk inode format, and that
> > > greatly limits what we can do with it.
> > >
> >
> > I was going off the inoalignmt bit that's set at mkfs time, but looking
> > at the code I think I see what you mean...
> >
> > > > My initial attempts at this work tried to redefine to the minimum
> > > > chunk length based on the holemask granularity (a la the stale macro I
> > > > seemingly left in this series ;), but this involves tweaking the
> > > > codepaths that use the cluster size (i.e., imap) which proved rather
> > > > hairy.
> > >
> > > This is where we need to head towards, though. The cluster size is
> > > currently the unit of inode IO, so that needs to be influenced by the
> > > sparse inode chunk granularity. Yes, we can define the inode chunk
> > > granularity to be the same as the cluster size, but that simply
> > > means we need to configure the cluster size appropriately at mount.
> > > It doesn't mean we need to change what cluster size means or it's
> > > implementation....
> > >
> >
> > I don't think I've necessarily encoded the cluster size into the disk
> > format explicitly, but rather used it as a heuristic for how to size the
> > sparse inode chunk allocations. I think what you're saying is that the
> > cluster size can change across a mount or code update, while the on-disk
> > allocation state will not. Thus we can't always go on the assumption
> > that the on-disk allocations will be sane with regard to the current
> > cluster size. For example, I suppose if the cluster size increased after
> > we have some existing sparse allocations we'd end up with some broken
> > buffer mappings for inode chunks.
>
> Yup, that's the problem in a nutshell: cluster size is not fixed.
>
> > > Again, that's kernel inode buffer cache implementaiton details, not
> > > something that matters for the on-disk format. So really these need
> > > to be separated. Probably means we need a "sparse inode allocation
> > > alignment" field in the superblock to define this. Having
> > > the kernel reject sparse alignments it can't support from the
> > > initial implementation means we can improve the kernel
> > > implementation over time and (eventually) support sub-cluster sized
> > > sparse inode allocation.
> > >
> > > i.e. initial implementation only supports sparse alignment ==
> > > cluster size, and rejects everything else....
> > >
> >
> > Yeah, that's pretty much what I was trying to accomplish, just not
> > encoded properly with regard to the superblock options. E.g., I
> > overloaded the cluster size in a fragile way. Instead, explicitly set
> > the sparse inode allocation granularity and let that influence the
> > cluster size. I suspect that means we could even still use the cluster
> > size heuristic to define the default alloc. granularity, but of course
> > we aren't married to it so can evaluate that independently. That sounds
> > much nicer, thanks.
>
> Right. We may not do anything else in the short or medium term, but
> having a separate superblock field gives us the flexibility for
> change the inode cluster implementation/behaviour in the future...
>
> > > dynamically, because it means the same upgrade/downdgrade cycle can
> > > have different results simply due to filesystem freespace
> > > fragmentation patterns...
> > >
> > > Perhaps an xfs-admin command to turn the feature on dynamicallyi for
> > > existing filesystems, kind of like what we did with lazy superblock
> > > counters when they were introduced?
> > >
> >
> > Hmm, I suppose that does create a new and interesting dynamic with
> > regard to the feature bit (non-deterministic backwards compatibility).
> > One could certainly value backwards compatibility over this particular
> > feature, and there is currently no way to control it. I'll look into
> > doing something with xfs_admin. In fact, I was thinking of adding
> > something to tune the cluster size bit to get around the v5 scaling
> > issue anyways.
>
> What v5 scalability issue is that? I don't recall any outstanding
> issues with inode cluster IO....
>
There's no scalability issue... I'm just referring to the fact that we
scale the cluster size by the inode size increase factor on v5
superblocks.
E.g., my free space fragmentation xfstests test started out with a fixed
file size based on something close to the worst case with an
implementation that used the allocation granularity of max(<holemask bit
granularity>, <inodes per block>). Once I tied the implementation to the
cluster size due to the aforementioned complexities, it became apparent
the test was less effective with my chosen file size on v5 supers,
particularly as the inode size increased. So from there I was
considering a similar xfs_admin command a user could run to reduce the
cluster size as a backstop should this limitation arise in the real
world. We can start with doing something just to enable the feature as
outlined above and revisit this then...
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-07-28 12:14 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-24 14:22 [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
2014-07-24 14:22 ` [PATCH 01/18] xfs: refactor xfs_inobt_insert() to eliminate loop and support variable count Brian Foster
2014-07-24 22:10 ` Dave Chinner
2014-07-28 16:03 ` Brian Foster
2014-07-28 23:32 ` Dave Chinner
2014-07-29 14:43 ` Brian Foster
2014-07-24 14:22 ` [PATCH 02/18] xfs: pass xfs_mount directly to xfs_ialloc_cluster_alignment() Brian Foster
2014-07-24 14:22 ` [PATCH 03/18] xfs: define sparse inode chunks v5 sb feature bit and helper function Brian Foster
2014-07-24 17:08 ` Mark Tinguely
2014-07-24 17:37 ` Brian Foster
2014-07-24 18:38 ` Mark Tinguely
2014-07-24 19:38 ` Brian Foster
2014-07-24 23:35 ` Dave Chinner
2014-07-24 14:22 ` [PATCH 04/18] xfs: introduce inode record hole mask for sparse inode chunks Brian Foster
2014-07-24 22:14 ` Dave Chinner
2014-07-28 16:16 ` Brian Foster
2014-08-07 15:18 ` Brian Foster
2014-07-24 14:22 ` [PATCH 05/18] xfs: create macros/helpers for dealing with " Brian Foster
2014-07-24 22:13 ` Dave Chinner
2014-07-24 14:22 ` [PATCH 06/18] xfs: pass inode count through ordered icreate log item Brian Foster
2014-07-24 14:22 ` [PATCH 07/18] xfs: handle sparse inode chunks in icreate log recovery Brian Foster
2014-07-24 14:22 ` [PATCH 08/18] xfs: create helper to manage record overlap for sparse inode chunks Brian Foster
2014-07-24 22:41 ` Dave Chinner
2014-07-28 16:19 ` Brian Foster
2014-07-29 0:07 ` Dave Chinner
2014-07-29 15:10 ` Brian Foster
2014-07-24 14:22 ` [PATCH 09/18] xfs: allocate sparse inode chunks on full chunk allocation failure Brian Foster
2014-07-24 14:23 ` [PATCH 10/18] xfs: set sparse inodes feature bit when a sparse chunk is allocated Brian Foster
2014-07-24 22:46 ` Dave Chinner
2014-07-28 16:23 ` Brian Foster
2014-07-24 14:23 ` [PATCH 11/18] xfs: reduce min. inode allocation space requirement for sparse inode chunks Brian Foster
2014-07-24 22:50 ` Dave Chinner
2014-07-24 14:23 ` [PATCH 12/18] xfs: helper to convert inobt record holemask to inode alloc. bitmap Brian Foster
2014-07-24 23:21 ` Dave Chinner
2014-07-24 14:23 ` [PATCH 13/18] xfs: filter out sparse regions from individual inode allocation Brian Foster
2014-07-24 14:23 ` [PATCH 14/18] xfs: update free inode record logic to support sparse inode records Brian Foster
2014-07-24 14:23 ` [PATCH 15/18] xfs: only free allocated regions of inode chunks Brian Foster
2014-07-24 23:24 ` Dave Chinner
2014-07-24 14:23 ` [PATCH 16/18] xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() Brian Foster
2014-07-24 14:23 ` [PATCH 17/18] xfs: use actual inode count for sparse records in bulkstat/inumbers Brian Foster
2014-07-24 23:29 ` Dave Chinner
2014-07-24 14:23 ` [PATCH 18/18] xfs: enable sparse inode chunks for v5 superblocks Brian Foster
2014-07-24 23:34 ` Dave Chinner
2014-07-24 16:28 ` [PATCH RFC 00/18] xfs: sparse inode chunks Brian Foster
2014-07-24 22:32 ` Dave Chinner
2014-07-25 16:30 ` Brian Foster
2014-07-26 0:03 ` Dave Chinner
2014-07-28 12:14 ` Brian Foster [this message]
2014-07-29 0:26 ` Dave Chinner
2014-07-29 15:25 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140728121400.GB53501@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox