Re: topics for the file system mini-summit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Russell Cattelan <cattelan@thebarn.com>
To: Matthew Wilcox <matthew@wil.cx>
Cc: Valerie Henson <val_henson@linux.intel.com>,
	Ric Wheeler <ric@emc.com>,
	linux-fsdevel@vger.kernel.org,
	Arjan van de Ven <arjan@linux.intel.com>
Subject: Re: topics for the file system mini-summit
Date: Thu, 01 Jun 2006 15:06:19 -0500	[thread overview]
Message-ID: <447F48BB.7080105@thebarn.com> (raw)
In-Reply-To: <20060601124517.GC32143@parisc-linux.org>

Matthew Wilcox wrote:

>On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:
>  
>
>>Actually, the continuation inode is in B.  When we create a link in
>>directory A to file C, a continuation inode for directory A is created
>>in domain B, and a block containing the link to file C is allocated
>>inside domain B as well.  So there is no continuation inode in domain
>>A.
>>
>>That being said, this idea is at the hand-waving stage and probably
>>has many other (hopefully non-fatal) flaws.  Thanks for taking a look!
>>    
>>
>
>OK, so we really have two kinds of continuation inodes, and it might be
>sensible to name them differently.  We have "here's some extra data for
>that inode over there" and "here's a hardlink from another domain".  I
>dub the first one a 'continuation inode' and the second a 'shadow inode'.
>
>Continuation inodes and shadow inodes both suffer from the problem
>that they might be unwittingly orphaned, unless they have some kind of
>back-link to their referrer.  That seems more soluble though.  The domain
>B minifsck can check to see if the backlinked inode or directory is
>still there.  If the domain A minifsck prunes something which has a link
>to domain B, it should be able to just remove the continuation/shadow
>inode there, without fscking domain B.
>
>Another advantage to this is that inodes never refer to blocks outside
>their zone, so we can forget about all this '64-bit block number' crap.
>We don't even need 64-bit inode numbers -- we can use special direntries
>for shadow inodes, and inodes which refer to continuation inodes need
>a new encoding scheme anyway.  Normal inodes would remain 32-bit and
>refer to the local domain, and shadow/continuation inode numbers would
>be 32-bits of domain, plus 32-bits of inode within that domain.
>
>So I like this ;-)
>
>  
>
>>>Surely XFS must have a more elegant solution than this?
>>>      
>>>
XFS may be a bit better suited to do this "encapsulated" form of 
inode/directory management
since it's AG's already tried to keep meta data close to the file data.
So it would be quite feasible to offline particular AG's and do a 
consistency check on it.

But yes hard links pose the same problem as being discussed here.
File data also can span AG's and thus create interdependency of AG's in 
terms both file
data and the meta data blocks that manage the extents.
But the idea of idea of creating continuation inodes seem like a good one.
For XFS is might be better to do this at the AG level so as soon as a 
hard link
in one AG refers to a inode it another AG the AG's are linked flagged as
being linked.
This would allow for any form of interdependent data to be grouped
(quota's extended attributes etc)

>>val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f`
>>[snip]
>> 109083 total
>>    
>>
>
>Well, yes.  I think that inside the Linux XFS implementation there's a
>small and neat filesystem struggling to get out.  Once SGI finally dies,
>perhaps we can rip out all the CXFS stubs and IRIX combatability.  Then
>we might be able to see it.
>
>For fun, if you're a masochist, try to follow the code flow for
>something easy like fsync().
>
>const struct file_operations xfs_file_operations = {
>        .fsync          = xfs_file_fsync,
>}
>
>xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
>{
>        struct inode    *inode = dentry->d_inode;
>        vnode_t         *vp = vn_from_inode(inode);
>        int             error;
>        int             flags = FSYNC_WAIT;
>
>        if (datasync)
>                flags |= FSYNC_DATA;
>        VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error);
>        return -error;
>}
>
>#define _VOP_(op, vp)   (*((vnodeops_t *)(vp)->v_fops)->op)
>  
>
Don't forget the extremely hard to untangle behaviors.
#define VNHEAD(vp)    ((vp)->v_bh.bh_first)
#define VOP(op, vp)    (*((bhv_vnodeops_t *)VNHEAD(vp)->bd_ops)->op)

Which I won't even try to explain cuz they confuse the crap out me.
But that is what CXFS uses to create different call chains.

Oh and note to make thing even more evil the call chains are dynamically 
changed
based on whether an inode has a client or not.
So in the case of no cxfs client the call chain is about the same as 
local xfs,
but when a client come in and cxfs will insert more behaviors / vop's 
that hooks
up all the cluster management stuff for that inode.

>#define VOP_FSYNC(vp,f,cr,b,e,rv)                                       \
>        rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e)
>
>vnodeops_t xfs_vnodeops = {
>        .vop_fsync              = xfs_fsync,
>}
>
>Finally, xfs_fsync actually does the work.  The best bit about all this
>abstraction is that there's only one xfs_vnodeops defined!  So this could
>all be done with an xfs_file_fsync() that munged its parameters and called
>xfs_fsync() directly.  That wouldn't even affect IRIX combatability,
>but it would make life difficult for CXFS, apparently.
>
>  
>
So some of my ex co-workers at SGI will disagree with the following but...

The VOP's that are left in XFS are completely pointless at this point, 
since xfs never has
anything other than one call chain it shouldn't have to deal with all 
that stuff in local mode.

All the behavior call chaining should be handled by CXFS and thus all 
the VOP code should
be pushed to that code base.  I think 4 VOP calls that are used 
internally by
XFS and such the callers of those vop's may need something else that 
provides away
or re-entering the call chain at the top.

I have done some of the work in terms of just replacing the VOP calls 
with straight calls
to the final functions in the hopes of tossing the vnodeops out of XFS.
And I spec'd out a way of fixing CXFS to deal with the vops internally 
but unfortunately
that kind of work will always fall under the ENORESOURCES category.

I know SGI will never take it as long CXFS lives, but maybe someday when
the SGI finally fizzles... :-)

Ohh and the whole IRIX compat is crap at this point since many of the 
vop call
params have been changed to match linux params.

>http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html
>-
>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>

next prev parent reply	other threads:[~2006-06-01 20:07 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
2006-05-26 16:48 ` Andreas Dilger
2006-05-27  0:49   ` Ric Wheeler
2006-05-27 14:18     ` Andreas Dilger
2006-05-28  1:44       ` Ric Wheeler
2006-05-29  0:11 ` Matthew Wilcox
2006-05-29  2:07   ` Ric Wheeler
2006-05-29 16:09     ` Andreas Dilger
2006-05-29 19:29       ` Ric Wheeler
2006-05-30  6:14         ` Andreas Dilger
2006-06-07 10:10       ` Stephen C. Tweedie
2006-06-07 14:03         ` Andi Kleen
2006-06-07 18:55         ` Andreas Dilger
2006-06-01  2:19 ` Valerie Henson
2006-06-01  2:42   ` Matthew Wilcox
2006-06-01  3:24     ` Valerie Henson
2006-06-01 12:45       ` Matthew Wilcox
2006-06-01 12:53         ` Arjan van de Ven
2006-06-01 20:06         ` Russell Cattelan [this message]
2006-06-02 11:27         ` Nathan Scott
2006-06-01  5:36   ` Andreas Dilger
2006-06-03 13:50   ` Ric Wheeler
2006-06-03 14:13     ` Arjan van de Ven
2006-06-03 15:07       ` Ric Wheeler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=447F48BB.7080105@thebarn.com \
    --to=cattelan@thebarn.com \
    --cc=arjan@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=matthew@wil.cx \
    --cc=ric@emc.com \
    --cc=val_henson@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).