Re: block groups with no inode tables

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: block groups with no inode tables
  2007-07-10 17:40   ` Dave Kleikamp
@ 2007-07-10 15:59     ` Mingming Cao
  2007-07-10 19:09       ` Dave Kleikamp
  0 siblings, 1 reply; 8+ messages in thread
From: Mingming Cao @ 2007-07-10 15:59 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: coly li, Jose R. Santos, linux-ext4@vger.kernel.org

On Tue, 2007-07-10 at 12:40 -0500, Dave Kleikamp wrote:
> On Wed, 2007-07-11 at 01:30 +0800, coly li wrote:
> > Hi, once we decide to do this, how about storing inode inside the
> > directory ?
> 
> Which directory?
I think Coly is refering to the idea of
store-inode-inside-in-directory-file.

It's one way to implement the dynamic inode table allocation. With it
you don't have system-wide inode tables anymore, but all inode
structures are directly stored in the directory file.

> 
> > IMHO, the latter one is more attractive :-)
> 
> Sounds like a mess to me.  Consider ln and mv.
> 
> > Coly
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* block groups with no inode tables
@ 2007-07-10 17:12 Jose R. Santos
  2007-07-10 17:30 ` coly li
  2007-07-10 20:30 ` Theodore Tso
  0 siblings, 2 replies; 8+ messages in thread
From: Jose R. Santos @ 2007-07-10 17:12 UTC (permalink / raw)
  To: linux-ext4@vger.kernel.org

Hi folks,

As I play with the allocation of the metadata for the FLEX_BG feature,
it seems that we could benefit from having block groups with no inode
tables.  Right now we allocate one inode table per bg base on the
inode_blocks_per_group.  For FLEX_BG though, it would make more sense
to have a larger inode tables that fully use the inode bitmap allocated
on the first few block groups.  Once we reach the number of inode per
FLEX_BG, then the remaining block groups could then have no inode
tables defined.

The idea here is that we better utilize the inode bitmaps and reduce the
number of inode tables to improve mkfs/fsck times. We could also
support expansion of inode since we have block groups that have empty
entries in the block group descriptors and as long as we can find
enough empty blocks for the inode table expanding the number of inodes
should be relatively easy.

Don't know if ext4 currently supports this.  Any thoughts?

-JRS

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: block groups with no inode tables
  2007-07-10 17:12 block groups with no inode tables Jose R. Santos
@ 2007-07-10 17:30 ` coly li
  2007-07-10 17:40   ` Dave Kleikamp
  2007-07-10 20:30 ` Theodore Tso
  1 sibling, 1 reply; 8+ messages in thread
From: coly li @ 2007-07-10 17:30 UTC (permalink / raw)
  To: Jose R. Santos; +Cc: linux-ext4@vger.kernel.org

Hi, once we decide to do this, how about storing inode inside the
directory ?

IMHO, the latter one is more attractive :-)

Coly

在 2007-07-10二的 12:12 -0500，Jose R. Santos写道：
> Hi folks,
> 
> As I play with the allocation of the metadata for the FLEX_BG feature,
> it seems that we could benefit from having block groups with no inode
> tables.  Right now we allocate one inode table per bg base on the
> inode_blocks_per_group.  For FLEX_BG though, it would make more sense
> to have a larger inode tables that fully use the inode bitmap allocated
> on the first few block groups.  Once we reach the number of inode per
> FLEX_BG, then the remaining block groups could then have no inode
> tables defined.
> 
> The idea here is that we better utilize the inode bitmaps and reduce the
> number of inode tables to improve mkfs/fsck times. We could also
> support expansion of inode since we have block groups that have empty
> entries in the block group descriptors and as long as we can find
> enough empty blocks for the inode table expanding the number of inodes
> should be relatively easy.
> 
> Don't know if ext4 currently supports this.  Any thoughts?
> 
> -JRS
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: block groups with no inode tables
  2007-07-10 17:30 ` coly li
@ 2007-07-10 17:40   ` Dave Kleikamp
  2007-07-10 15:59     ` Mingming Cao
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Kleikamp @ 2007-07-10 17:40 UTC (permalink / raw)
  To: coly li; +Cc: Jose R. Santos, linux-ext4@vger.kernel.org

On Wed, 2007-07-11 at 01:30 +0800, coly li wrote:
> Hi, once we decide to do this, how about storing inode inside the
> directory ?

Which directory?

> IMHO, the latter one is more attractive :-)

Sounds like a mess to me.  Consider ln and mv.

> Coly

-- 
David Kleikamp
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: block groups with no inode tables
  2007-07-10 15:59     ` Mingming Cao
@ 2007-07-10 19:09       ` Dave Kleikamp
  2007-07-11  4:50         ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Kleikamp @ 2007-07-10 19:09 UTC (permalink / raw)
  To: cmm; +Cc: coly li, Jose R. Santos, linux-ext4@vger.kernel.org

On Tue, 2007-07-10 at 11:59 -0400, Mingming Cao wrote:
> On Tue, 2007-07-10 at 12:40 -0500, Dave Kleikamp wrote:
> > On Wed, 2007-07-11 at 01:30 +0800, coly li wrote:
> > > Hi, once we decide to do this, how about storing inode inside the
> > > directory ?
> > 
> > Which directory?
> I think Coly is refering to the idea of
> store-inode-inside-in-directory-file.
> 
> It's one way to implement the dynamic inode table allocation. With it
> you don't have system-wide inode tables anymore, but all inode
> structures are directly stored in the directory file.

Assuming you mean the parent directory?  An inode isn't tied to a
specific parent.

	ln dir1/file1 dir2/
	mv dir1/file1 dir3/
	rmdir dir1

What is happens to the inode?  I really don't think that the directory
is the right place to store an inode.
-- 
David Kleikamp
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: block groups with no inode tables
  2007-07-10 17:12 block groups with no inode tables Jose R. Santos
  2007-07-10 17:30 ` coly li
@ 2007-07-10 20:30 ` Theodore Tso
  2007-07-11  4:31   ` Andreas Dilger
  1 sibling, 1 reply; 8+ messages in thread
From: Theodore Tso @ 2007-07-10 20:30 UTC (permalink / raw)
  To: Jose R. Santos; +Cc: linux-ext4@vger.kernel.org

On Tue, Jul 10, 2007 at 12:12:21PM -0500, Jose R. Santos wrote:
> Hi folks,
> 
> As I play with the allocation of the metadata for the FLEX_BG feature,
> it seems that we could benefit from having block groups with no inode
> tables.  Right now we allocate one inode table per bg base on the
> inode_blocks_per_group.  For FLEX_BG though, it would make more sense
> to have a larger inode tables that fully use the inode bitmap allocated
> on the first few block groups.  Once we reach the number of inode per
> FLEX_BG, then the remaining block groups could then have no inode
> tables defined.
> 
> The idea here is that we better utilize the inode bitmaps and reduce the
> number of inode tables to improve mkfs/fsck times. We could also
> support expansion of inode since we have block groups that have empty
> entries in the block group descriptors and as long as we can find
> enough empty blocks for the inode table expanding the number of inodes
> should be relatively easy.
> 
> Don't know if ext4 currently supports this.  Any thoughts?

Plans to support are there; Andreas sent a patch back in April to
implement this, using bg_itable_unused, which is already reserved in
the block group data structure.  The idea here is to speed up fsck by
specifying how many inodes are actually in use in the block group, so
we don't have to initialize them until they are to be used.  This is
tied with the checksum patches, since doing this means we need to
really worry about the accuracy of the block group descriptors or we
could lose a lot of data if the block group descriptors are corrupted.

We also have something already implemented which does this on a
per-blockgroup basis.  That's the LAZY_BG feature, which was intended
for testing really big filesystems without needing to initialize all
of the inode tables.  In fact mke2fs -O lazy_bg it only initializes
the first and last blockgroups, in order to make sure we can force the
use of blocks at the very end of the filesystem, so we can find any
2**32 bit cleanliness problems, or other problems with really big
block numbers.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: block groups with no inode tables
  2007-07-10 20:30 ` Theodore Tso
@ 2007-07-11  4:31   ` Andreas Dilger
  0 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2007-07-11  4:31 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jose R. Santos, linux-ext4@vger.kernel.org

On Jul 10, 2007  16:30 -0400, Theodore Tso wrote:
> On Tue, Jul 10, 2007 at 12:12:21PM -0500, Jose R. Santos wrote:
> > As I play with the allocation of the metadata for the FLEX_BG feature,
> > it seems that we could benefit from having block groups with no inode
> > tables.  Right now we allocate one inode table per bg base on the
> > inode_blocks_per_group.  For FLEX_BG though, it would make more sense
> > to have a larger inode tables that fully use the inode bitmap allocated
> > on the first few block groups.  Once we reach the number of inode per
> > FLEX_BG, then the remaining block groups could then have no inode
> > tables defined.
> > 
> > The idea here is that we better utilize the inode bitmaps and reduce the
> > number of inode tables to improve mkfs/fsck times. We could also
> > support expansion of inode since we have block groups that have empty
> > entries in the block group descriptors and as long as we can find
> > enough empty blocks for the inode table expanding the number of inodes
> > should be relatively easy.
> > 
> > Don't know if ext4 currently supports this.  Any thoughts?
> 
> Plans to support are there; Andreas sent a patch back in April to
> implement this, using bg_itable_unused, which is already reserved in
> the block group data structure.  The idea here is to speed up fsck by
> specifying how many inodes are actually in use in the block group, so
> we don't have to initialize them until they are to be used.  This is
> tied with the checksum patches, since doing this means we need to
> really worry about the accuracy of the block group descriptors or we
> could lose a lot of data if the block group descriptors are corrupted.

I think Jose means something slightly different, but in the end the
uninit_groups feature (patches in the patch queue, but disabled for
some reason) essentially implements this.  We don't need to read inode
bitmaps from disk if the INODE_UNINIT flag is in the group.

I think all that is needed to get the semantics Jose wants is to tune
the inode allocation in ext4_new_inode() to avoid inode bitmaps that
are not yet initialized.  I suppose the other incremental feature would
be to allow the blocks in the inode table become used for file allocation,
but this exposes us to potential malicious corruption in some cases if
users create "inode looking" data files (e.g. suid root inodes) on a full
filesystem and e2fsck is convinced to treat them as inodes.

We might instead limit this space to directories and indirect/index
blocks, which wouldn't be a bad idea but when we get to changing the
inode structures too much I'd like to combine several of the other
changes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: block groups with no inode tables
  2007-07-10 19:09       ` Dave Kleikamp
@ 2007-07-11  4:50         ` Andreas Dilger
  0 siblings, 0 replies; 8+ messages in thread
From: Andreas Dilger @ 2007-07-11  4:50 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: cmm, coly li, Jose R. Santos, linux-ext4@vger.kernel.org

On Jul 10, 2007  14:09 -0500, Dave Kleikamp wrote:
> Assuming you mean the parent directory?  An inode isn't tied to a
> specific parent.
> 
> 	ln dir1/file1 dir2/
> 	mv dir1/file1 dir3/
> 	rmdir dir1
> 
> What is happens to the inode?

The inode stays in the same place, and the block map of the directories
are changed to enclose the inode.  In ideal (== normal) circumstances,
inodes are allocated within a directory in a sequential manner, and this
would also result in linear inode block allocation, great for extent-mapped
files.  In cases like the above, you will have fragmented IO patterns,
but those are already true for existing directories.

> I really don't think that the directory is the right place to store an inode.

There are actually some performance benefits from this, see
http://citeseer.ist.psu.edu/ganger97embedded.html

Each inode would be a disk block, or possibly a few (slightly larger than
now) inodes per block, on the order of 1kB or more.  This allows for
packing small files into the inode also (as an EA) or alternately having
many extents in the inode for huge files or lots of inline EAs.

I've also got a plan to overcome the hard-link limitations in that paper,
by storing the filename of an inode as an EA in the inode, prefixed by
the inode number & generation of the parent.  When doing a readdir or
lookup, we know the parent directory in which we are looking, so we can
only consider names in that directory.  When doing a readdir, we can
immediately list all of the names for this inode together.  The caveat
is that we need a flexible EA scheme to handle this, maybe a directory
with more EAs in it?

The one thing that I'm not sure about is how to handle the case where
inode blocks are allocated in relatively random order.  I'd _like_ to
be able to avoid the POSIX telldir/seekdir problem by doing readdir()
in block order, but that also means that if we allocate an inode block
between two other existing inode blocks in a directory that we should
"insert" the block into the directory instead of e.g. appending it.
That means the file offset in a directory is not constant, but maybe it
is OK to return the physical block number for telldir?

We would still have a hash for the files, but instead of per block
as it is now, it would need to have leaf entries for each name, since
an inode can have many names and would appear in multiple hash buckets.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-07-11  4:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-10 17:12 block groups with no inode tables Jose R. Santos
2007-07-10 17:30 ` coly li
2007-07-10 17:40   ` Dave Kleikamp
2007-07-10 15:59     ` Mingming Cao
2007-07-10 19:09       ` Dave Kleikamp
2007-07-11  4:50         ` Andreas Dilger
2007-07-10 20:30 ` Theodore Tso
2007-07-11  4:31   ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).