Re: [PATCH 2/2] Large EAs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kalpak Shah <Kalpak.Shah@Sun.COM>
To: Theodore Tso <tytso@mit.edu>
Cc: Andreas Dilger <adilger@Sun.COM>,
	Kalpak Shah <kalpak.shah@gmail.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	Mingming Cao <cmm@us.ibm.com>
Subject: Re: [PATCH 2/2] Large EAs
Date: Wed, 03 Dec 2008 16:08:27 +0530	[thread overview]
Message-ID: <1228300707.3121.71.camel@localhost> (raw)
In-Reply-To: <20081127003555.GD14101@mit.edu>

Since we need to make sure that inodes are not used very frequently for
storing EAs, the following design was discussed on the ext4 concall:

xattrs of size blocksize/2 < ea_size <= blocksize are stored by
referencing the block number directly from the ext4_xattr_entry (using
some unique combination of bits to encode that this is referencing a
block instead of an inode, and also finding space to store 48-bit block
numbers) and then ea_size > blocksize is referenced directly by an
inode.

During discussion Andreas suggested another idea using which we can
avoid the need to point at blocks from the ext4_xattr_entry:

Use mballoc to try and find up to 64kB of contiguous blocks to store
smaller xattrs. Looking at the ext4_xattr_header it has an h_blocks
field which we can use to indicate the number of blocks in a row that
are allocated for this inode's xattrs. 

The ext4_xattr_entry has a 16-bit block offset that can be used to
point anywhere within a 64kB block.  This not only allows many more
small xattrs to be stored efficiently, but also mid-sized xattrs (<=
blocksize) can be handled efficiently because the data will be packed
into the single group of blocks.  It also avoids the need to reference
block numbers from the ext4_xattr_entry directly, which is ugly.

Comments?

Thanks,
Kalpak

On Wed, 2008-11-26 at 19:35 -0500, Theodore Tso wrote:
> On Wed, Nov 26, 2008 at 02:49:29PM -0700, Andreas Dilger wrote:
> > 
> > One benefit I think is that at least the orphaned EA inode can be
> > cleaned up instead of lingering in the middle of the shared EA tree.
> > 
> > Another benefit of having separate EAs is that it makes it tractable to
> > modify very large EAs.  Otherwise, if there are a number of large
> > EAs shared in a single tree they would all have to be modified in order
> > to store a larger value for an EA in the middle of the tree.
> 
> I guess I didn't make myself clear.  I was *not* suggesting that we
> share EA's in one inode, or in one extent tree.  Instead, what I
> suggested was that instead of having a pointer to an inode, if the
> value of the EA is less than half the blocksize, it is stored in the
> EA block.  If it is between 50% and 100% of the blocksize, instead of
> pointing at inode, we point to a block.  If it is greater than a
> blocksize, we point at a block containing an EA tree.  (Which means
> for a large EA the average space overhead is 6k --- 4k for the extent
> block, plus 2k for the fragmentation cost).
> 
> So this scheme very much uses separate EA's, and does not pack all of
> the EA's into a single tree.  It is deliberately kept simple precisely
> because like you I don't think it's worth it to optimize EA's.  On the
> other hand, running out of inodes is a big problem, and dynamic inodes
> is far more complicated an issue, especially if we don't have 64-bit
> inode support in the kernel and in userspace, and we need to worry
> about locality issues and how dynamic inodes work with online
> resizing. 
> 
> The tradeoff is that my scheme doesn't burn an inode for each large
> EA, but for EA's greater than a blocksize, we chew an extra block's
> worth of overhead.  Personally, I think it's a worthwhile tradeoff ---
> 
>    	       	  	  	     	     - Ted

next prev parent reply	other threads:[~2008-12-03 10:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-17 20:36 [PATCH 2/2] Large EAs Kalpak Shah
2008-11-26  4:41 ` Theodore Tso
2008-11-26  6:00   ` Kalpak Shah
2008-11-26  6:54     ` Theodore Tso
2008-11-26 21:49       ` Andreas Dilger
2008-11-27  0:35         ` Theodore Tso
2008-11-27  9:27           ` Andreas Dilger
2008-12-03 10:38           ` Kalpak Shah [this message]
2008-12-17  6:10             ` Kalpak Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228300707.3121.71.camel@localhost \
    --to=kalpak.shah@sun.com \
    --cc=adilger@Sun.COM \
    --cc=cmm@us.ibm.com \
    --cc=kalpak.shah@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.