Re: [PATCH 2/2] Large EAs

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kalpak Shah <Kalpak.Shah@Sun.COM>
To: Theodore Tso <tytso@mit.edu>
Cc: Andreas Dilger <adilger@Sun.COM>,
	Kalpak Shah <kalpak.shah@gmail.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	Mingming Cao <cmm@us.ibm.com>
Subject: Re: [PATCH 2/2] Large EAs
Date: Wed, 03 Dec 2008 16:08:27 +0530	[thread overview]
Message-ID: <1228300707.3121.71.camel@localhost> (raw)
In-Reply-To: <20081127003555.GD14101@mit.edu>

Since we need to make sure that inodes are not used very frequently for
storing EAs, the following design was discussed on the ext4 concall:

xattrs of size blocksize/2 < ea_size <= blocksize are stored by
referencing the block number directly from the ext4_xattr_entry (using
some unique combination of bits to encode that this is referencing a
block instead of an inode, and also finding space to store 48-bit block
numbers) and then ea_size > blocksize is referenced directly by an
inode.

During discussion Andreas suggested another idea using which we can
avoid the need to point at blocks from the ext4_xattr_entry:

Use mballoc to try and find up to 64kB of contiguous blocks to store
smaller xattrs. Looking at the ext4_xattr_header it has an h_blocks
field which we can use to indicate the number of blocks in a row that
are allocated for this inode's xattrs. 

The ext4_xattr_entry has a 16-bit block offset that can be used to
point anywhere within a 64kB block.  This not only allows many more
small xattrs to be stored efficiently, but also mid-sized xattrs (<=
blocksize) can be handled efficiently because the data will be packed
into the single group of blocks.  It also avoids the need to reference
block numbers from the ext4_xattr_entry directly, which is ugly.

Comments?

Thanks,
Kalpak

On Wed, 2008-11-26 at 19:35 -0500, Theodore Tso wrote:
> On Wed, Nov 26, 2008 at 02:49:29PM -0700, Andreas Dilger wrote:
> > 
> > One benefit I think is that at least the orphaned EA inode can be
> > cleaned up instead of lingering in the middle of the shared EA tree.
> > 
> > Another benefit of having separate EAs is that it makes it tractable to
> > modify very large EAs.  Otherwise, if there are a number of large
> > EAs shared in a single tree they would all have to be modified in order
> > to store a larger value for an EA in the middle of the tree.
> 
> I guess I didn't make myself clear.  I was *not* suggesting that we
> share EA's in one inode, or in one extent tree.  Instead, what I
> suggested was that instead of having a pointer to an inode, if the
> value of the EA is less than half the blocksize, it is stored in the
> EA block.  If it is between 50% and 100% of the blocksize, instead of
> pointing at inode, we point to a block.  If it is greater than a
> blocksize, we point at a block containing an EA tree.  (Which means
> for a large EA the average space overhead is 6k --- 4k for the extent
> block, plus 2k for the fragmentation cost).
> 
> So this scheme very much uses separate EA's, and does not pack all of
> the EA's into a single tree.  It is deliberately kept simple precisely
> because like you I don't think it's worth it to optimize EA's.  On the
> other hand, running out of inodes is a big problem, and dynamic inodes
> is far more complicated an issue, especially if we don't have 64-bit
> inode support in the kernel and in userspace, and we need to worry
> about locality issues and how dynamic inodes work with online
> resizing. 
> 
> The tradeoff is that my scheme doesn't burn an inode for each large
> EA, but for EA's greater than a blocksize, we chew an extra block's
> worth of overhead.  Personally, I think it's a worthwhile tradeoff ---
> 
>    	       	  	  	     	     - Ted

next prev parent reply	other threads:[~2008-12-03 10:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-17 20:36 [PATCH 2/2] Large EAs Kalpak Shah
2008-11-26  4:41 ` Theodore Tso
2008-11-26  6:00   ` Kalpak Shah
2008-11-26  6:54     ` Theodore Tso
2008-11-26 21:49       ` Andreas Dilger
2008-11-27  0:35         ` Theodore Tso
2008-11-27  9:27           ` Andreas Dilger
2008-12-03 10:38           ` Kalpak Shah [this message]
2008-12-17  6:10             ` Kalpak Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228300707.3121.71.camel@localhost \
    --to=kalpak.shah@sun.com \
    --cc=adilger@Sun.COM \
    --cc=cmm@us.ibm.com \
    --cc=kalpak.shah@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).