* ext2 inode size (on-disk) [not found] <20001202014045.F2272@parcelfarce.linux.theplanet.co.uk> @ 2001-04-19 11:55 ` Alexander Viro 2001-04-19 17:41 ` Andreas Dilger 2001-04-19 20:10 ` [Ext2-devel] " tytso 0 siblings, 2 replies; 12+ messages in thread From: Alexander Viro @ 2001-04-19 11:55 UTC (permalink / raw) To: linux-kernel Cc: Andreas Dilger, Theodore Y. Ts'o, Ext2 development mailing list Erm... Folks, can ->s_inode_size be not a power of 2? Both libext2fs and kernel break in that case. Example: dd if=/dev/zero of=foo bs=1024 count=20480 mkfs -I 192 foo corrupts memory and segfaults. Reason: ext2_read_inode() (same problem is present in the kernel version of said beast) finds inode offset within cylinder group piece of inode table, splits it into block*block_size+offset, reads the block and works with the structure at given offset. I.e. it does group = (ino-1) / inodes_per_group; number_in_group = (ino-1) % inodes_per_group; offset_in_group = number_in_group * inode_size; block_number = inode_table_base[group] + offset_in_group/block_size; offset_in_block = offset_in_group % block_size Guess what happens if inode crosses block boundary? Exactly. AFAICS we have two sane solutions: a) require inode size to be a power of 2 b) switch to group = (ino-1) / inodes_per_group; number_in_group = (ino-1) % inodes_per_group; block_in_group = number_in_group / inodes_per_block; number_in_block = number_in_group % inodes_per_block; block = inode_table_fragments[group] + block_in_group; offset_in_block = number_in_block * inode_size; i.e. instead of current "pack inodes into piece of inode table and pad it in the end" do "pack inodes into blocks padding the end of every block". Something has to be done - right now mke2fs effectively mandates "inode size is a power of 2" and as far as I'm concerned it's OK, but segfaulting is a bit too drastic way of telling user "don't do it"... Al PS: can we assume that inodes_per_group is a multiple of inodes_per_block or it isn't guaranteed? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ext2 inode size (on-disk) 2001-04-19 11:55 ` ext2 inode size (on-disk) Alexander Viro @ 2001-04-19 17:41 ` Andreas Dilger 2001-04-19 18:02 ` Alexander Viro 2001-04-19 20:10 ` [Ext2-devel] " tytso 1 sibling, 1 reply; 12+ messages in thread From: Andreas Dilger @ 2001-04-19 17:41 UTC (permalink / raw) To: Alexander Viro Cc: linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list Al, you write: > Erm... Folks, can ->s_inode_size be not a power of 2? Both > libext2fs and kernel break in that case. Example: > > dd if=/dev/zero of=foo bs=1024 count=20480 > mkfs -I 192 foo I had always assumed that it would be a power-of-two size, but since it is an undocumented option to mke2fs, I suppose it was never really intended to be used. It appears, however, that the mke2fs code doesn't do ANY checking on the parameter, so you could concievably make the inode size SMALLER than the current size, and this would DEFINITELY be bad as well. > corrupts memory and segfaults. Reason: ext2_read_inode() (same problem > is present in the kernel version of said beast) finds inode offset within > cylinder group piece of inode table, splits it into block*block_size+offset, > reads the block and works with the structure at given offset. However, we _should_ be safe against this, because ext2_read_super() does: if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV) { sb->u.ext2_sb.s_inode_size = EXT2_GOOD_OLD_INODE_SIZE; sb->u.ext2_sb.s_first_ino = EXT2_GOOD_OLD_FIRST_INO; } else { sb->u.ext2_sb.s_inode_size = le16_to_cpu(es->s_inode_size); sb->u.ext2_sb.s_first_ino = le32_to_cpu(es->s_first_ino); if (sb->u.ext2_sb.s_inode_size != EXT2_GOOD_OLD_INODE_SIZE) { printk ("EXT2-fs: unsupported inode size: %d\n", sb->u.ext2_sb.s_inode_size); goto failed_mount; } } Are you talking about user-space code or kernel code when you say it is segfaulting and corrupting memory? > a) require inode size to be a power of 2 This would be my preferred solution, makes the math in the kernel a lot faster. Should probably be put into mke2fs for safety, and also checked when/if we ever allow the kernel to mount a filesystem with non-power-of-2 inode sizes. > PS: can we assume that inodes_per_group is a multiple of inodes_per_block > or it isn't guaranteed? mke2fs will always set up the filesystem this way, and there is no real reason NOT to do that. If you are using a filesystem block for the inode table, it is pointless to leave part of it empty, because you can't use it for anything else anyways. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ext2 inode size (on-disk) 2001-04-19 17:41 ` Andreas Dilger @ 2001-04-19 18:02 ` Alexander Viro 2001-04-19 19:24 ` Andreas Dilger 0 siblings, 1 reply; 12+ messages in thread From: Alexander Viro @ 2001-04-19 18:02 UTC (permalink / raw) To: Andreas Dilger Cc: linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list On Thu, 19 Apr 2001, Andreas Dilger wrote: > Al, you write: > > Erm... Folks, can ->s_inode_size be not a power of 2? Both > > libext2fs and kernel break in that case. Example: > > > > dd if=/dev/zero of=foo bs=1024 count=20480 > > mkfs -I 192 foo > > I had always assumed that it would be a power-of-two size, but since it > is an undocumented option to mke2fs, I suppose it was never really > intended to be used. It appears, however, that the mke2fs code > doesn't do ANY checking on the parameter, so you could concievably make > the inode size SMALLER than the current size, and this would DEFINITELY > be bad as well. In some sense it does - it dies if you've passed it not a power of two ;-) I don't think that segfault is a good way to report the problem, though... Problem with mkfs is obvious, but kernel side is also shady - we could have cleaner code if we assumed that inode size is power of 2. As it is, we have a code in read_super() that checks for size == 128 _and_ code that was apparently writen in assumption that it can be not a power of 2. However, if that was the really the goal, we fail - code in ext2_read_inode() actually would break with such sizes. In other words, the real question is what the hell are we trying to do there. If we want code that deals with sizes that are not powers of 2 we need to change ext2_read_inode() and friends. It wouldn't be hard. OTOH, if we guarantee that inode size will always remain a power of 2 we can simplify the thing. In any case current situation doesn't make much sense. The only question is direction of fix. Could those who introduced ->s_inode_size tell what use had been intended? > mke2fs will always set up the filesystem this way, and there is no real > reason NOT to do that. If you are using a filesystem block for the inode > table, it is pointless to leave part of it empty, because you can't use > it for anything else anyways. It's not that simple - if you need 160 bytes per inode rounding it up to the next power of two will lose a lot. On 4Kb fs it will be 16 inodes per block instead of 25 - 36% loss... ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ext2 inode size (on-disk) 2001-04-19 18:02 ` Alexander Viro @ 2001-04-19 19:24 ` Andreas Dilger 2001-04-20 2:12 ` Alexander Viro 0 siblings, 1 reply; 12+ messages in thread From: Andreas Dilger @ 2001-04-19 19:24 UTC (permalink / raw) To: Alexander Viro Cc: Andreas Dilger, linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list Al writes: > > I had always assumed that it would be a power-of-two size, but since it > > is an undocumented option to mke2fs, I suppose it was never really > > intended to be used. It appears, however, that the mke2fs code > > doesn't do ANY checking on the parameter, so you could concievably make > > the inode size SMALLER than the current size, and this would DEFINITELY > > be bad as well. > > In some sense it does - it dies if you've passed it not a power of two ;-) > I don't think that segfault is a good way to report the problem, though... Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any problems with e2fsck or debugfs on the resulting filesystem. I'm running 1.20-WIP, but I don't think anything was changed in this area for some time. However, looking at the output from dumpe2fs shows it is packing inodes across block boundaries (inode_size = 192, inodes_per_group = 16144, inode_blocks_per_group = 757). It is also not filling the last inode table block (which would give us 16149 inodes). Also, looking at the disk, it shows garbage data in the space after the end of the normal inode, and ext2 should always zero-fill unused fields. Basically, packing inodes across block boundaries is TOTALLY broken. This can lead to all sorts of data corruption issues if one block is written to disk, and the other is not. For that matter, the same would hold true with any not-power-of-two inode size. In this case, the inode will still be crossing a hard disk sector boundary, and have the (small) possibility that part of the inode is written and part is not. In this light, the safe inode sizes are 128 (current size), 256, and 512. > > mke2fs will always set up the filesystem this way, and there is no real > > reason NOT to do that. If you are using a filesystem block for the inode > > table, it is pointless to leave part of it empty, because you can't use > > it for anything else anyways. > > It's not that simple - if you need 160 bytes per inode rounding it up > to the next power of two will lose a lot. On 4Kb fs it will be > 16 inodes per block instead of 25 - 36% loss... What I was getting at, is that no matter what size of inode we are using, we should ALWAYS fill the last inode table block as full as possible. Once you have allocated a block to the inode table, it should be filled with as many inodes as fit in a single block. To do anything else is a waste. For example, with 160 byte inodes, and a 4k inode table block, we can fit 4096 / 160 = 25 inodes into this block (with 96 bytes remaining) There is no reason to only have 1 or 2 or 17 inodes in this block. If we assume we are not crossing a block boundary with the inode, then inodes_per_group = n * inodes_per_block, where n = number of inode blocks. Cheers, Andreas PS - is this a code cleanup issue, or do you have some reason that you want to increase the inode size? -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ext2 inode size (on-disk) 2001-04-19 19:24 ` Andreas Dilger @ 2001-04-20 2:12 ` Alexander Viro 0 siblings, 0 replies; 12+ messages in thread From: Alexander Viro @ 2001-04-20 2:12 UTC (permalink / raw) To: Andreas Dilger Cc: linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list On Thu, 19 Apr 2001, Andreas Dilger wrote: > Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any > problems with e2fsck or debugfs on the resulting filesystem. I'm running > 1.20-WIP, but I don't think anything was changed in this area for some time. May depend on the libc version/size of device/phase of the moon. I've got segfaults with 1.18, 1.19 and 1.20-WIP on a Debian box with glibc 2.1.3-18 and 20Mb image. What really happens is memory corruption in libe2fs (ext2_write_inode()), segfault comes later (usually in free()). > Basically, packing inodes across block boundaries is TOTALLY broken. > This can lead to all sorts of data corruption issues if one block is > written to disk, and the other is not. For that matter, the same would Yup. > PS - is this a code cleanup issue, or do you have some reason that you want > to increase the inode size? Code cleanup Al ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-19 11:55 ` ext2 inode size (on-disk) Alexander Viro 2001-04-19 17:41 ` Andreas Dilger @ 2001-04-19 20:10 ` tytso 2001-04-19 20:29 ` Jeff Garzik 2001-04-20 2:23 ` Alexander Viro 1 sibling, 2 replies; 12+ messages in thread From: tytso @ 2001-04-19 20:10 UTC (permalink / raw) To: Alexander Viro Cc: linux-kernel, Andreas Dilger, Theodore Y. Ts'o, Ext2 development mailing list On Thu, Apr 19, 2001 at 07:55:20AM -0400, Alexander Viro wrote: > Erm... Folks, can ->s_inode_size be not a power of 2? Both > libext2fs and kernel break in that case. This was a project that was never completed. I thought at one point of allowing the inode size to be not a power of 2, but if you do that, you really want to avoid letting an inode cross a block boundary --- for reliability and performance reasons if nothing else. It may simply be easiest at this point to require that the inode size be a power of two, at least as far as going from 128 to 256 bytes, just for compatibility reasons. (Although if we do that, the folks who want to use extra space in the inode will come pooring out of the woodwork, and we're going to have to careful to control who uses what parts of the extended inode.) In the long run, it probably makes sense to adjust the algorithms to allow for non-power-of-two inode sizes, but require an incompatible filesystem feature flag (so that older kernels and filesystem utilities won't choke when mounting filesystems with non-standard sized inodes. - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-19 20:10 ` [Ext2-devel] " tytso @ 2001-04-19 20:29 ` Jeff Garzik 2001-04-20 6:30 ` Theodore Tso 2001-04-20 2:23 ` Alexander Viro 1 sibling, 1 reply; 12+ messages in thread From: Jeff Garzik @ 2001-04-19 20:29 UTC (permalink / raw) To: tytso Cc: Alexander Viro, linux-kernel, Andreas Dilger, Ext2 development mailing list tytso@valinux.com wrote: > In the long run, it probably makes sense to adjust the algorithms to > allow for non-power-of-two inode sizes, If you don't mind, does that imply packing inodes across block boundaries? Regards, Jeff -- Jeff Garzik | "The universe is like a safe to which there is a Building 1024 | combination -- but the combination is locked up MandrakeSoft | in the safe." -- Peter DeVries ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-19 20:29 ` Jeff Garzik @ 2001-04-20 6:30 ` Theodore Tso 0 siblings, 0 replies; 12+ messages in thread From: Theodore Tso @ 2001-04-20 6:30 UTC (permalink / raw) To: Jeff Garzik Cc: tytso, Alexander Viro, linux-kernel, Andreas Dilger, Ext2 development mailing list On Thu, Apr 19, 2001 at 04:29:52PM -0400, Jeff Garzik wrote: > tytso@valinux.com wrote: > > In the long run, it probably makes sense to adjust the algorithms to > > allow for non-power-of-two inode sizes, > > If you don't mind, does that imply packing inodes across block > boundaries? No, it means that padding the end of each block in the inode table so that inodes don't cross block boundries. (i.e., if the inode size is 150 bytes, then there's room for 6 inodes in a 1k block, with 124 bytes left over for padding.) - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-19 20:10 ` [Ext2-devel] " tytso 2001-04-19 20:29 ` Jeff Garzik @ 2001-04-20 2:23 ` Alexander Viro 2001-04-20 5:35 ` Andreas Dilger 2001-04-20 6:35 ` Theodore Tso 1 sibling, 2 replies; 12+ messages in thread From: Alexander Viro @ 2001-04-20 2:23 UTC (permalink / raw) To: tytso Cc: linux-kernel, Andreas Dilger, Theodore Y. Ts'o, Ext2 development mailing list On Thu, 19 Apr 2001 tytso@valinux.com wrote: > This was a project that was never completed. I thought at one point > of allowing the inode size to be not a power of 2, but if you do that, > you really want to avoid letting an inode cross a block boundary --- > for reliability and performance reasons if nothing else. Agreed. > In the long run, it probably makes sense to adjust the algorithms to > allow for non-power-of-two inode sizes, but require an incompatible > filesystem feature flag (so that older kernels and filesystem > utilities won't choke when mounting filesystems with non-standard > sized inodes. I don't think that it's needed - old kernels (up to -CURRENT ;-) will simply refuse to mount if ->s_inode_size != 128. Old utilites may be trickier, though... I'm somewhat concerned about the following: last block of inode table fragment may have less inodes than the rest. Reason: number of inodes per group should be a multiple of 8 and with inodes bigger than 128 bytes it may give such effect. Comments? I would really, really like to end up with accurate description of inode table layout somewhere in Documentation/filesystems. Heck, I volunteer to write it down and submit into the tree ;-) Al ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-20 2:23 ` Alexander Viro @ 2001-04-20 5:35 ` Andreas Dilger 2001-04-20 6:37 ` Theodore Tso 2001-04-20 6:35 ` Theodore Tso 1 sibling, 1 reply; 12+ messages in thread From: Andreas Dilger @ 2001-04-20 5:35 UTC (permalink / raw) To: Alexander Viro Cc: tytso, linux-kernel, Andreas Dilger, Theodore Y. Ts'o, Ext2 development mailing list Al writes: > I don't think that it's needed - old kernels (up to -CURRENT ;-) will > simply refuse to mount if ->s_inode_size != 128. Old utilites may be > trickier, though... Probably would need an incompat flag for changing the inode size anyways, so old utilities wouldn't set that anyways. > I'm somewhat concerned about the following: last block of inode table > fragment may have less inodes than the rest. Reason: number of inodes > per group should be a multiple of 8 and with inodes bigger than 128 > bytes it may give such effect. Comments? I don't _think_ that there is a requirement for a multiple-of-8 inodes per group. OK, looking into mke2fs (actually lib/ext2fs/initialize.c) it _does_ show that it needs to be a multiple of 8, but I'm not sure exactly what the "bitmap splicing code" mentioned in the comment is. In the end, it doesn't really matter much - if we go with multiple-of-2 inode sizes, all it means is that we may need to have multiple-of-2 (or possibly 4 for 512-byte inodes in a 1k block filesystem) inode table blocks in each group. Not a big deal. The code already handles this. > I would really, really like to end up with accurate description of > inode table layout somewhere in Documentation/filesystems. Heck, I > volunteer to write it down and submit into the tree ;-) I can write a few words as well. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-20 5:35 ` Andreas Dilger @ 2001-04-20 6:37 ` Theodore Tso 0 siblings, 0 replies; 12+ messages in thread From: Theodore Tso @ 2001-04-20 6:37 UTC (permalink / raw) To: Andreas Dilger Cc: Alexander Viro, tytso, linux-kernel, Theodore Y. Ts'o, Ext2 development mailing list On Thu, Apr 19, 2001 at 11:35:40PM -0600, Andreas Dilger wrote: > I don't _think_ that there is a requirement for a multiple-of-8 inodes > per group. OK, looking into mke2fs (actually lib/ext2fs/initialize.c) > it _does_ show that it needs to be a multiple of 8, but I'm not sure > exactly what the "bitmap splicing code" mentioned in the comment is. It's has to be a multiple of 8 because of how e2fsprogs handles bitmaps --- that is, it takes the various pieces of all of the bitmaps, and butts them up together in memory. It would be possible to remove this restriction by reworking the e2fsprogs library code, but quite frankly, I don't think the restriction is all that unreasonable. - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Ext2-devel] ext2 inode size (on-disk) 2001-04-20 2:23 ` Alexander Viro 2001-04-20 5:35 ` Andreas Dilger @ 2001-04-20 6:35 ` Theodore Tso 1 sibling, 0 replies; 12+ messages in thread From: Theodore Tso @ 2001-04-20 6:35 UTC (permalink / raw) To: Alexander Viro Cc: tytso, linux-kernel, Andreas Dilger, Theodore Y. Ts'o, Ext2 development mailing list On Thu, Apr 19, 2001 at 10:23:39PM -0400, Alexander Viro wrote: > > I'm somewhat concerned about the following: last block of inode table > fragment may have less inodes than the rest. Reason: number of inodes > per group should be a multiple of 8 and with inodes bigger than 128 > bytes it may give such effect. Comments? Yup, that's right. That shouldn't be too bad, though, since we already calculate things by dividing by INODES_PER_BLOCK_GROUP. So the fact that the last block of the inode table may have some unused space shouldn't be a problem. > I would really, really like to end up with accurate description of > inode table layout somewhere in Documentation/filesystems. Heck, I > volunteer to write it down and submit into the tree ;-) The "design and implementation of ext2" paper has a pretty good explanation of the inode table, but of course it assumed a convenient inode size of 128, and didn't really go into the issues of what might happen if the inode size were larger, or not a power of two. So yeah, getting something which explains how things work now that things have gotten a bit more complicated would be a good thing. - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2001-04-20 6:38 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20001202014045.F2272@parcelfarce.linux.theplanet.co.uk>
2001-04-19 11:55 ` ext2 inode size (on-disk) Alexander Viro
2001-04-19 17:41 ` Andreas Dilger
2001-04-19 18:02 ` Alexander Viro
2001-04-19 19:24 ` Andreas Dilger
2001-04-20 2:12 ` Alexander Viro
2001-04-19 20:10 ` [Ext2-devel] " tytso
2001-04-19 20:29 ` Jeff Garzik
2001-04-20 6:30 ` Theodore Tso
2001-04-20 2:23 ` Alexander Viro
2001-04-20 5:35 ` Andreas Dilger
2001-04-20 6:37 ` Theodore Tso
2001-04-20 6:35 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox