linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ted Ts'o <tytso@mit.edu>
To: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset
Date: Mon, 21 Mar 2011 09:24:15 -0400	[thread overview]
Message-ID: <20110321132415.GI4135@thunk.org> (raw)
In-Reply-To: <5427513F-76B9-4315-AC17-4BF35B290B18@dilger.ca>

On Mon, Mar 21, 2011 at 09:55:07AM +0100, Andreas Dilger wrote:
> > 
> > The cost is increased disk space efficiency.  Directories will consume
> > 1T, as will extent tree blocks.
>
> Presumably you mean "1M" here and not "1T"?

Yes; or more accurately, one allocation cluster (no matter what size
it might be).


> It would be a shame to waste another MB of space just to allocate
> 4kB for the next indirect block...  I guess it isn't clear to me why
> the index blocks need to be treated differently from file data
> blocks or directory blocks in this regard, since they both can use
> multiple blocks from the same cluster.  Being able to use the full
> cluster would allow 256 * 344 = 88064 extents, or 11TB to be
> addressed by the cluster of index blocks, which should be plenty.

There's a reason why I'm explicitly not supporting indirect blocks
with bigalloc, at least initially.  :-)

The reason why this gets difficult with metadata blocks (directory
blocks excepted) is the problem of determining whether or not a block
in a cluster is in use or not at allocation time, and whether all of
the blocks in a cluster are no longer in use when deciding whether or
not to free a cluster.  For data blocks we rely on the extent tree to
determine this, since clusters are aligned with respect to logical
block numbers --- that is, a physical cluster which is 1M starts on a
1M logical block boundary, and covers the logical blocks in that 1M
region.  So if you have a file which has a 4k sparse block at offset
4, and another 4k sparse block located at offset 1M+42, that file will
consume _two_ clusters, not one.

But for file system metadata blocks, such as extent tree blocks, if we
want to allocate multiple blocks from the same cluster, we would need
some way of determining which blocks from that cluster have been
allocated so far.  I could add a bitmap to the first block in the
cluster, but that adds a lot of complexity.

One thing which I've thought about doing is to initialize a bitmap in
the first block of a cluster (and then use the second block), but to
only use one block per cluster for extent tree blocks --- at least for
now.  That would allow a future read-only extension to use multiple
blocks/cluster, and if I also implement checking the bitmap at free
time, it could be a fully backwards compatible extension.

> Unfortunately, the overhead of allocating a whole cluster for every
> index block and every directory is fairly high.  For Lustre it
> matters very little, since there are only a handful of directories
> (under 40) on the data filesystems where this would be used and the
> real directory tree is located on a different metadata filesystem
> which probably wouldn't use this feature, but for most "normal"
> users this overhead may become prohibitive.  That is why I've been
> trying to think of a way to allow sub-cluster allocations for these
> uses.

I don't think it's that bad, if the cluster size is well chosen.  If
you know that most of your files are 4-8M, and you are using a 1M
cluster allocation size, most of the time you will be able to fit all
of the extents you need into the inode.  It's only for highly
fragmented file systems that you'll need more than 3 extents to store
8 clusters, no?  And for very large files, say 256M, an extra 1M
extent would be unfortunate, if it is needed, but as a percentage of
the file space used, it's not a complete deal breaker.

> > Please comment!  I do not intend for these patches to be merged during
> > the 2.6.39 merge window.  I am targetting 2.6.40, 3 months from now,
> > since these patches are quite extensive.
> 
> Is that before or after e2fsck support for this will be done?  I'm
> rather reluctant to commit anything to the kernel that doesn't have
> e2fsck support in a released e2fsprogs.

I think getting the e2fsck changes done in 3 months really ought not
to be a problem...

						- Ted

  parent reply	other threads:[~2011-03-21 13:24 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-19 21:28 [PATCH, RFC 00/12] bigalloc patchset Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 01/12] ext4: read-only support for bigalloc file systems Theodore Ts'o
2011-03-21 19:35   ` Lukas Czerner
2011-03-22 17:02     ` Ted Ts'o
2011-03-23 10:28       ` Lukas Czerner
2011-03-19 21:28 ` [PATCH, RFC 02/12] ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.) Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 03/12] ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP Theodore Ts'o
2011-03-20 10:26   ` Amir Goldstein
2011-03-21 13:12     ` Ted Ts'o
2011-03-19 21:28 ` [PATCH, RFC 04/12] ext4: Remove block bitmap initialization in ext4_new_inode() Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 05/12] ext4: factor out block group accounting into functions Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 06/12] ext4: split out ext4_free_blocks_after_init() Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 07/12] ext4: bigalloc changes to block bitmap initialization functions Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 08/12] ext4: Convert block group-relative offsets to use clusters Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 09/12] ext4: teach ext4_ext_map_blocks() about the bigalloc feature Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 10/12] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled Theodore Ts'o
2011-03-21 20:17   ` Lukas Czerner
2011-03-22 22:09     ` Ted Ts'o
2011-03-19 21:28 ` [PATCH, RFC 11/12] ext4: tune mballoc's default group prealloc size for bigalloc file systems Theodore Ts'o
2011-03-19 21:28 ` [PATCH, RFC 12/12] ext4: enable mounting bigalloc as read/write Theodore Ts'o
2011-03-20 10:33 ` [PATCH, RFC 00/12] bigalloc patchset Amir Goldstein
2011-03-21  8:55 ` Andreas Dilger
2011-03-21 11:31   ` Rogier Wolff
2011-03-21 13:24   ` Ted Ts'o [this message]
2011-03-21 23:42     ` Andreas Dilger
2011-04-05 17:23 ` Coly Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110321132415.GI4135@thunk.org \
    --to=tytso@mit.edu \
    --cc=adilger.kernel@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).