ext2 inode size (on-disk)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ext2 inode size (on-disk)
       [not found] <20001202014045.F2272@parcelfarce.linux.theplanet.co.uk>
@ 2001-04-19 11:55 ` Alexander Viro
  2001-04-19 17:41   ` Andreas Dilger
  2001-04-19 20:10   ` [Ext2-devel] " tytso
  0 siblings, 2 replies; 12+ messages in thread
From: Alexander Viro @ 2001-04-19 11:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andreas Dilger, Theodore Y. Ts'o,
	Ext2 development mailing list

	Erm... Folks, can ->s_inode_size be not a power of 2? Both
libext2fs and kernel break in that case. Example:

dd if=/dev/zero of=foo bs=1024 count=20480
mkfs -I 192 foo

corrupts memory and segfaults. Reason: ext2_read_inode() (same problem
is present in the kernel version of said beast) finds inode offset within
cylinder group piece of inode table, splits it into block*block_size+offset,
reads the block and works with the structure at given offset.

I.e. it does
	group = (ino-1) / inodes_per_group;
	number_in_group = (ino-1) % inodes_per_group;
	offset_in_group = number_in_group * inode_size;
	block_number = inode_table_base[group] + offset_in_group/block_size;
	offset_in_block = offset_in_group % block_size

Guess what happens if inode crosses block boundary? Exactly.

AFAICS we have two sane solutions:

	a) require inode size to be a power of 2

	b) switch to

	group = (ino-1) / inodes_per_group;
	number_in_group = (ino-1) % inodes_per_group;
	block_in_group = number_in_group / inodes_per_block;
	number_in_block = number_in_group % inodes_per_block;
	block = inode_table_fragments[group] + block_in_group;
	offset_in_block = number_in_block * inode_size;

	i.e. instead of current "pack inodes into piece of inode table and
pad it in the end" do "pack inodes into blocks padding the end of every block".

Something has to be done - right now mke2fs effectively mandates "inode size
is a power of 2" and as far as I'm concerned it's OK, but segfaulting is
a bit too drastic way of telling user "don't do it"...
								Al

PS: can we assume that inodes_per_group is a multiple of inodes_per_block
or it isn't guaranteed?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext2 inode size (on-disk)
  2001-04-19 11:55 ` ext2 inode size (on-disk) Alexander Viro
@ 2001-04-19 17:41   ` Andreas Dilger
  2001-04-19 18:02     ` Alexander Viro
  2001-04-19 20:10   ` [Ext2-devel] " tytso
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2001-04-19 17:41 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list

Al, you write:
> 	Erm... Folks, can ->s_inode_size be not a power of 2? Both
> libext2fs and kernel break in that case. Example:
> 
> dd if=/dev/zero of=foo bs=1024 count=20480
> mkfs -I 192 foo

I had always assumed that it would be a power-of-two size, but since it
is an undocumented option to mke2fs, I suppose it was never really
intended to be used.  It appears, however, that the mke2fs code
doesn't do ANY checking on the parameter, so you could concievably make
the inode size SMALLER than the current size, and this would DEFINITELY
be bad as well.

> corrupts memory and segfaults. Reason: ext2_read_inode() (same problem
> is present in the kernel version of said beast) finds inode offset within
> cylinder group piece of inode table, splits it into block*block_size+offset,
> reads the block and works with the structure at given offset.

However, we _should_ be safe against this, because ext2_read_super() does:

        if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV) {
                sb->u.ext2_sb.s_inode_size = EXT2_GOOD_OLD_INODE_SIZE;
                sb->u.ext2_sb.s_first_ino = EXT2_GOOD_OLD_FIRST_INO;
        } else {
                sb->u.ext2_sb.s_inode_size = le16_to_cpu(es->s_inode_size);
                sb->u.ext2_sb.s_first_ino = le32_to_cpu(es->s_first_ino);
                if (sb->u.ext2_sb.s_inode_size != EXT2_GOOD_OLD_INODE_SIZE) {
                        printk ("EXT2-fs: unsupported inode size: %d\n",
                                sb->u.ext2_sb.s_inode_size);
                        goto failed_mount;
                }
        }

Are you talking about user-space code or kernel code when you say it is
segfaulting and corrupting memory?

> 	a) require inode size to be a power of 2

This would be my preferred solution, makes the math in the kernel a lot
faster.  Should probably be put into mke2fs for safety, and also checked
when/if we ever allow the kernel to mount a filesystem with non-power-of-2
inode sizes.

> PS: can we assume that inodes_per_group is a multiple of inodes_per_block
> or it isn't guaranteed?

mke2fs will always set up the filesystem this way, and there is no real
reason NOT to do that.  If you are using a filesystem block for the inode
table, it is pointless to leave part of it empty, because you can't use
it for anything else anyways.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext2 inode size (on-disk)
  2001-04-19 17:41   ` Andreas Dilger
@ 2001-04-19 18:02     ` Alexander Viro
  2001-04-19 19:24       ` Andreas Dilger
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Viro @ 2001-04-19 18:02 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list

On Thu, 19 Apr 2001, Andreas Dilger wrote:

> Al, you write:
> > 	Erm... Folks, can ->s_inode_size be not a power of 2? Both
> > libext2fs and kernel break in that case. Example:
> > 
> > dd if=/dev/zero of=foo bs=1024 count=20480
> > mkfs -I 192 foo
> 
> I had always assumed that it would be a power-of-two size, but since it
> is an undocumented option to mke2fs, I suppose it was never really
> intended to be used.  It appears, however, that the mke2fs code
> doesn't do ANY checking on the parameter, so you could concievably make
> the inode size SMALLER than the current size, and this would DEFINITELY
> be bad as well.

In some sense it does - it dies if you've passed it not a power of two ;-)
I don't think that segfault is a good way to report the problem, though...

Problem with mkfs is obvious, but kernel side is also shady - we could
have cleaner code if we assumed that inode size is power of 2. As it
is, we have a code in read_super() that checks for size == 128 _and_
code that was apparently writen in assumption that it can be not a
power of 2. However, if that was the really the goal, we fail - code
in ext2_read_inode() actually would break with such sizes.

In other words, the real question is what the hell are we trying to
do there. If we want code that deals with sizes that are not powers of 2
we need to change ext2_read_inode() and friends. It wouldn't be
hard. OTOH, if we guarantee that inode size will always remain a power of
2 we can simplify the thing. In any case current situation doesn't
make much sense. The only question is direction of fix.

Could those who introduced ->s_inode_size tell what use had been intended?

> mke2fs will always set up the filesystem this way, and there is no real
> reason NOT to do that.  If you are using a filesystem block for the inode
> table, it is pointless to leave part of it empty, because you can't use
> it for anything else anyways.

It's not that simple - if you need 160 bytes per inode rounding it up
to the next power of two will lose a lot. On 4Kb fs it will be
16 inodes per block instead of 25 - 36% loss...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext2 inode size (on-disk)
  2001-04-19 18:02     ` Alexander Viro
@ 2001-04-19 19:24       ` Andreas Dilger
  2001-04-20  2:12         ` Alexander Viro
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2001-04-19 19:24 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andreas Dilger, linux-kernel, Theodore Y. Ts'o,
	Ext2 development mailing list

Al writes:
> > I had always assumed that it would be a power-of-two size, but since it
> > is an undocumented option to mke2fs, I suppose it was never really
> > intended to be used.  It appears, however, that the mke2fs code
> > doesn't do ANY checking on the parameter, so you could concievably make
> > the inode size SMALLER than the current size, and this would DEFINITELY
> > be bad as well.
> 
> In some sense it does - it dies if you've passed it not a power of two ;-)
> I don't think that segfault is a good way to report the problem, though...

Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any
problems with e2fsck or debugfs on the resulting filesystem.  I'm running
1.20-WIP, but I don't think anything was changed in this area for some time.

However, looking at the output from dumpe2fs shows it is packing inodes
across block boundaries (inode_size = 192, inodes_per_group = 16144,
inode_blocks_per_group = 757).  It is also not filling the last inode
table block (which would give us 16149 inodes).

Also, looking at the disk, it shows garbage data in the space after the
end of the normal inode, and ext2 should always zero-fill unused fields.

Basically, packing inodes across block boundaries is TOTALLY broken.
This can lead to all sorts of data corruption issues if one block is
written to disk, and the other is not.  For that matter, the same would
hold true with any not-power-of-two inode size.  In this case, the
inode will still be crossing a hard disk sector boundary, and have the
(small) possibility that part of the inode is written and part is not.

In this light, the safe inode sizes are 128 (current size), 256, and 512.

> > mke2fs will always set up the filesystem this way, and there is no real
> > reason NOT to do that.  If you are using a filesystem block for the inode
> > table, it is pointless to leave part of it empty, because you can't use
> > it for anything else anyways.
> 
> It's not that simple - if you need 160 bytes per inode rounding it up
> to the next power of two will lose a lot. On 4Kb fs it will be
> 16 inodes per block instead of 25 - 36% loss...

What I was getting at, is that no matter what size of inode we are using,
we should ALWAYS fill the last inode table block as full as possible.  Once
you have allocated a block to the inode table, it should be filled with
as many inodes as fit in a single block.  To do anything else is a waste.

For example, with 160 byte inodes, and a 4k inode table block, we can fit

4096 / 160 = 25 inodes into this block (with 96 bytes remaining)

There is no reason to only have 1 or 2 or 17 inodes in this block.  If we
assume we are not crossing a block boundary with the inode, then
inodes_per_group = n * inodes_per_block, where n = number of inode blocks.

Cheers, Andreas

PS - is this a code cleanup issue, or do you have some reason that you want
     to increase the inode size?
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext2 inode size (on-disk)
  2001-04-19 19:24       ` Andreas Dilger
@ 2001-04-20  2:12         ` Alexander Viro
  0 siblings, 0 replies; 12+ messages in thread
From: Alexander Viro @ 2001-04-20  2:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list



On Thu, 19 Apr 2001, Andreas Dilger wrote:

> Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any
> problems with e2fsck or debugfs on the resulting filesystem.  I'm running
> 1.20-WIP, but I don't think anything was changed in this area for some time.
 
May depend on the libc version/size of device/phase of the moon. I've
got segfaults with 1.18, 1.19 and 1.20-WIP on a Debian box with glibc
2.1.3-18 and 20Mb image. What really happens is memory corruption in
libe2fs (ext2_write_inode()), segfault comes later (usually in free()).

> Basically, packing inodes across block boundaries is TOTALLY broken.
> This can lead to all sorts of data corruption issues if one block is
> written to disk, and the other is not.  For that matter, the same would

Yup.

> PS - is this a code cleanup issue, or do you have some reason that you want
>      to increase the inode size?

Code cleanup
								Al


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-19 11:55 ` ext2 inode size (on-disk) Alexander Viro
  2001-04-19 17:41   ` Andreas Dilger
@ 2001-04-19 20:10   ` tytso
  2001-04-19 20:29     ` Jeff Garzik
  2001-04-20  2:23     ` Alexander Viro
  1 sibling, 2 replies; 12+ messages in thread
From: tytso @ 2001-04-19 20:10 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-kernel, Andreas Dilger, Theodore Y. Ts'o,
	Ext2 development mailing list

On Thu, Apr 19, 2001 at 07:55:20AM -0400, Alexander Viro wrote:
> 	Erm... Folks, can ->s_inode_size be not a power of 2? Both
> libext2fs and kernel break in that case. 

This was a project that was never completed.  I thought at one point
of allowing the inode size to be not a power of 2, but if you do that,
you really want to avoid letting an inode cross a block boundary ---
for reliability and performance reasons if nothing else.   

It may simply be easiest at this point to require that the inode size
be a power of two, at least as far as going from 128 to 256 bytes,
just for compatibility reasons.  (Although if we do that, the folks
who want to use extra space in the inode will come pooring out of the
woodwork, and we're going to have to careful to control who uses what
parts of the extended inode.)

In the long run, it probably makes sense to adjust the algorithms to
allow for non-power-of-two inode sizes, but require an incompatible
filesystem feature flag (so that older kernels and filesystem
utilities won't choke when mounting filesystems with non-standard
sized inodes.

							- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-19 20:10   ` [Ext2-devel] " tytso
@ 2001-04-19 20:29     ` Jeff Garzik
  2001-04-20  6:30       ` Theodore Tso
  2001-04-20  2:23     ` Alexander Viro
  1 sibling, 1 reply; 12+ messages in thread
From: Jeff Garzik @ 2001-04-19 20:29 UTC (permalink / raw)
  To: tytso
  Cc: Alexander Viro, linux-kernel, Andreas Dilger,
	Ext2 development mailing list

tytso@valinux.com wrote:
> In the long run, it probably makes sense to adjust the algorithms to
> allow for non-power-of-two inode sizes,

If you don't mind, does that imply packing inodes across block
boundaries?

Regards,

	Jeff


-- 
Jeff Garzik       | "The universe is like a safe to which there is a
Building 1024     |  combination -- but the combination is locked up
MandrakeSoft      |  in the safe."    -- Peter DeVries

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-19 20:29     ` Jeff Garzik
@ 2001-04-20  6:30       ` Theodore Tso
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Tso @ 2001-04-20  6:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: tytso, Alexander Viro, linux-kernel, Andreas Dilger,
	Ext2 development mailing list

On Thu, Apr 19, 2001 at 04:29:52PM -0400, Jeff Garzik wrote:
> tytso@valinux.com wrote:
> > In the long run, it probably makes sense to adjust the algorithms to
> > allow for non-power-of-two inode sizes,
> 
> If you don't mind, does that imply packing inodes across block
> boundaries?

No, it means that padding the end of each block in the inode table so
that inodes don't cross block boundries.  (i.e., if the inode size is
150 bytes, then there's room for 6 inodes in a 1k block, with 124
bytes left over for padding.)

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-19 20:10   ` [Ext2-devel] " tytso
  2001-04-19 20:29     ` Jeff Garzik
@ 2001-04-20  2:23     ` Alexander Viro
  2001-04-20  5:35       ` Andreas Dilger
  2001-04-20  6:35       ` Theodore Tso
  1 sibling, 2 replies; 12+ messages in thread
From: Alexander Viro @ 2001-04-20  2:23 UTC (permalink / raw)
  To: tytso
  Cc: linux-kernel, Andreas Dilger, Theodore Y. Ts'o,
	Ext2 development mailing list



On Thu, 19 Apr 2001 tytso@valinux.com wrote:

> This was a project that was never completed.  I thought at one point
> of allowing the inode size to be not a power of 2, but if you do that,
> you really want to avoid letting an inode cross a block boundary ---
> for reliability and performance reasons if nothing else.   

Agreed.

> In the long run, it probably makes sense to adjust the algorithms to
> allow for non-power-of-two inode sizes, but require an incompatible
> filesystem feature flag (so that older kernels and filesystem
> utilities won't choke when mounting filesystems with non-standard
> sized inodes.

I don't think that it's needed - old kernels (up to -CURRENT ;-) will
simply refuse to mount if ->s_inode_size != 128. Old utilites may be
trickier, though...

I'm somewhat concerned about the following: last block of inode table
fragment may have less inodes than the rest. Reason: number of inodes
per group should be a multiple of 8 and with inodes bigger than 128
bytes it may give such effect. Comments?

I would really, really like to end up with accurate description of
inode table layout somewhere in Documentation/filesystems. Heck, I
volunteer to write it down and submit into the tree ;-)
							Al


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-20  2:23     ` Alexander Viro
@ 2001-04-20  5:35       ` Andreas Dilger
  2001-04-20  6:37         ` Theodore Tso
  2001-04-20  6:35       ` Theodore Tso
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2001-04-20  5:35 UTC (permalink / raw)
  To: Alexander Viro
  Cc: tytso, linux-kernel, Andreas Dilger, Theodore Y. Ts'o,
	Ext2 development mailing list

Al writes:
> I don't think that it's needed - old kernels (up to -CURRENT ;-) will
> simply refuse to mount if ->s_inode_size != 128. Old utilites may be
> trickier, though...

Probably would need an incompat flag for changing the inode size anyways,
so old utilities wouldn't set that anyways.

> I'm somewhat concerned about the following: last block of inode table
> fragment may have less inodes than the rest. Reason: number of inodes
> per group should be a multiple of 8 and with inodes bigger than 128
> bytes it may give such effect. Comments?

I don't _think_ that there is a requirement for a multiple-of-8 inodes
per group.  OK, looking into mke2fs (actually lib/ext2fs/initialize.c)
it _does_ show that it needs to be a multiple of 8, but I'm not sure
exactly what the "bitmap splicing code" mentioned in the comment is.

In the end, it doesn't really matter much - if we go with multiple-of-2
inode sizes, all it means is that we may need to have multiple-of-2 (or
possibly 4 for 512-byte inodes in a 1k block filesystem) inode table
blocks in each group.  Not a big deal.  The code already handles this.

> I would really, really like to end up with accurate description of
> inode table layout somewhere in Documentation/filesystems. Heck, I
> volunteer to write it down and submit into the tree ;-)

I can write a few words as well.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-20  5:35       ` Andreas Dilger
@ 2001-04-20  6:37         ` Theodore Tso
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Tso @ 2001-04-20  6:37 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Alexander Viro, tytso, linux-kernel, Theodore Y. Ts'o,
	Ext2 development mailing list

On Thu, Apr 19, 2001 at 11:35:40PM -0600, Andreas Dilger wrote:
> I don't _think_ that there is a requirement for a multiple-of-8 inodes
> per group.  OK, looking into mke2fs (actually lib/ext2fs/initialize.c)
> it _does_ show that it needs to be a multiple of 8, but I'm not sure
> exactly what the "bitmap splicing code" mentioned in the comment is.

It's has to be a multiple of 8 because of how e2fsprogs handles
bitmaps --- that is, it takes the various pieces of all of the
bitmaps, and butts them up together in memory.  It would be possible
to remove this restriction by reworking the e2fsprogs library code,
but quite frankly, I don't think the restriction is all that
unreasonable.

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ext2-devel] ext2 inode size (on-disk)
  2001-04-20  2:23     ` Alexander Viro
  2001-04-20  5:35       ` Andreas Dilger
@ 2001-04-20  6:35       ` Theodore Tso
  1 sibling, 0 replies; 12+ messages in thread
From: Theodore Tso @ 2001-04-20  6:35 UTC (permalink / raw)
  To: Alexander Viro
  Cc: tytso, linux-kernel, Andreas Dilger, Theodore Y. Ts'o,
	Ext2 development mailing list

On Thu, Apr 19, 2001 at 10:23:39PM -0400, Alexander Viro wrote:
> 
> I'm somewhat concerned about the following: last block of inode table
> fragment may have less inodes than the rest. Reason: number of inodes
> per group should be a multiple of 8 and with inodes bigger than 128
> bytes it may give such effect. Comments?

Yup, that's right.  That shouldn't be too bad, though, since we
already calculate things by dividing by INODES_PER_BLOCK_GROUP.  So
the fact that the last block of the inode table may have some unused
space shouldn't be a problem.

> I would really, really like to end up with accurate description of
> inode table layout somewhere in Documentation/filesystems. Heck, I
> volunteer to write it down and submit into the tree ;-)

The "design and implementation of ext2" paper has a pretty good
explanation of the inode table, but of course it assumed a convenient
inode size of 128, and didn't really go into the issues of what might
happen if the inode size were larger, or not a power of two.

So yeah, getting something which explains how things work now that
things have gotten a bit more complicated would be a good thing.

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-04-20  6:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20001202014045.F2272@parcelfarce.linux.theplanet.co.uk>
2001-04-19 11:55 ` ext2 inode size (on-disk) Alexander Viro
2001-04-19 17:41   ` Andreas Dilger
2001-04-19 18:02     ` Alexander Viro
2001-04-19 19:24       ` Andreas Dilger
2001-04-20  2:12         ` Alexander Viro
2001-04-19 20:10   ` [Ext2-devel] " tytso
2001-04-19 20:29     ` Jeff Garzik
2001-04-20  6:30       ` Theodore Tso
2001-04-20  2:23     ` Alexander Viro
2001-04-20  5:35       ` Andreas Dilger
2001-04-20  6:37         ` Theodore Tso
2001-04-20  6:35       ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox