* Status of META_BG? @ 2012-03-15 15:46 Phillip Susi 2012-03-15 16:25 ` Andreas Dilger 0 siblings, 1 reply; 7+ messages in thread From: Phillip Susi @ 2012-03-15 15:46 UTC (permalink / raw) To: ext4 development What is the status of the META_BG feature? I thought that it never got off the ground, and FLEX_BG was used instead. Is there still a possibility of META_BG being implemented in the future, or is the code for it currently in e2fsprogs cruft? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Status of META_BG? 2012-03-15 15:46 Status of META_BG? Phillip Susi @ 2012-03-15 16:25 ` Andreas Dilger 2012-03-15 17:55 ` Phillip Susi 0 siblings, 1 reply; 7+ messages in thread From: Andreas Dilger @ 2012-03-15 16:25 UTC (permalink / raw) To: Phillip Susi; +Cc: ext4 development On 2012-03-15, at 9:46 AM, Phillip Susi wrote: > What is the status of the META_BG feature? I thought that it never got off the ground, and FLEX_BG was used instead. Is there still a possibility of META_BG being implemented in the future, or is the code for it currently in e2fsprogs cruft? META_BG serves a different purpose than FLEX_BG, but you are right that it never really got off the ground. META_BG is needed to overcome the problem of the backup group descriptor blocks growing to become too large, and to simplify the task of online (or offline) filesystem resizing. In the case of very large filesystems (256TB or more, assuming 4kB block size) the group descriptor blocks will grow to fill an entire block group, and in the case of group 0 and group 1 they would start overlapping, which would not work. For filesystem resizing, if the number of group descriptor blocks grows, and there is not a reserved block for it, then the online resize would fail, and the offline resize needs to move data and/or metadata in order to make the group descriptor blocks sequential. META_BG addresses both of these issues by distributing the group descriptor blocks into the filesystem for each "meta group" (= the number of groups whose descriptors fit into a single block). The number of backups is reduced (0 - 3 backups), and the blocks do not need to be contiguous anymore. So, there is still value in making this feature work, but it hasn't been high on anyone's list yet. Assistance welcome. Cheers, Andreas ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Status of META_BG? 2012-03-15 16:25 ` Andreas Dilger @ 2012-03-15 17:55 ` Phillip Susi 2012-03-15 21:06 ` Andreas Dilger 2012-03-18 20:41 ` Ted Ts'o 0 siblings, 2 replies; 7+ messages in thread From: Phillip Susi @ 2012-03-15 17:55 UTC (permalink / raw) To: Andreas Dilger; +Cc: ext4 development On 3/15/2012 12:25 PM, Andreas Dilger wrote: > In the case of very large filesystems (256TB or more, assuming 4kB > block size) the group descriptor blocks will grow to fill an entire > block group, and in the case of group 0 and group 1 they would start > overlapping, which would not work. To get an fs that large, you have to enable 64bit support, which also means you can pass the limit of 32k blocks per group. Doing that should allow for a much more reasonable number of groups ( which is a good thing several reasons ), and would also solve this problem wouldn't it? > META_BG addresses both of these issues by distributing the group > descriptor blocks into the filesystem for each "meta group" (= the > number of groups whose descriptors fit into a single block). So it puts one GD block at the start of every several block groups? Wouldn't that drastically slow down opening/mounting the fs since the disk has to seek to every block group? Perhaps if it were coupled with flex_bg so that flex_factor GD blocks would be clustered that would mitigate that somewhat, but iirc the default flex factor is only 16 so that might need bumped up for such large disks. > The number of backups is reduced (0 - 3 backups), and the blocks do > not need to be contiguous anymore. You know, I've been wondering why the group descriptors are backed up in the first place. If the backups are only ever written at mkfs time, and can be reconstructed with mke2fs -S, then what purpose do they serve? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Status of META_BG? 2012-03-15 17:55 ` Phillip Susi @ 2012-03-15 21:06 ` Andreas Dilger 2012-03-16 13:42 ` Phillip Susi 2012-03-18 20:41 ` Ted Ts'o 1 sibling, 1 reply; 7+ messages in thread From: Andreas Dilger @ 2012-03-15 21:06 UTC (permalink / raw) To: Phillip Susi; +Cc: ext4 development On 2012-03-15, at 11:55 AM, Phillip Susi wrote: > On 3/15/2012 12:25 PM, Andreas Dilger wrote: >> In the case of very large filesystems (256TB or more, assuming 4kB >> block size) the group descriptor blocks will grow to fill an entire >> block group, and in the case of group 0 and group 1 they would start >> overlapping, which would not work. > > To get an fs that large, you have to enable 64bit support, which also means you can pass the limit of 32k blocks per group. I'm not sure what you mean here. Sure, there can be more than 32k blocks per group, but there is still only a single block bitmap per group so having more blocks is dependent on a larger blocksize. > Doing that should allow for a much more reasonable number of groups ( which is a good thing several reasons ), and would also solve this problem wouldn't it? Possibly in conjunction with BIGALLOC. >> META_BG addresses both of these issues by distributing the group >> descriptor blocks into the filesystem for each "meta group" (= the >> number of groups whose descriptors fit into a single block). > > So it puts one GD block at the start of every several block groups? One at the start of the first group, the second group, and the last group. > Wouldn't that drastically slow down opening/mounting the fs since the disk has to seek to every block group? Yes, definitely. That wasn't a concern before flex_bg arrived, since that seek was needed for every group's block/inode bitmap as well. > Perhaps if it were coupled with flex_bg so that flex_factor GD blocks would be clustered that would mitigate that somewhat, but iirc the default flex factor is only 16 so that might need bumped up for such large disks. Something like this to work with flex_bg, but it would likely mean an incompat change, since GD blocks are one of the few "absolute position" blocks in the filesystem that allow finding all of the other metadata. Maybe with bigalloc the number of groups is reduced, and the size of the groups is increased, which helps two ways. First, fewer groups means fewer GD blocks, and larger groups mean more GD blocks can fit into the 0th and 1st groups. >> The number of backups is reduced (0 - 3 backups), and the blocks do >> not need to be contiguous anymore. > > You know, I've been wondering why the group descriptors are backed up in the first place. If the backups are only ever written at mkfs time, and can be reconstructed with mke2fs -S, then what purpose do they serve? Well, the "mke2fs -S" is only applying a best guess estimate of the metadata location using default parameters. If the default parameters are not identical (e.g. flex_bg on/off, bigalloc on/off, etc) then "mke2fs -S" will only corrupt an already-fatally-corrupted filesystem, and you need to start from scratch. Having the group descriptor backups and superblocks in well-known locations avoids this in most cases. This makes me wonder whether the backup superblock/GD for a bigalloc filesystem can even be located without knowing the bigalloc cluster size, and whether e2fsck will try a range of power-of-two cluster sizes to try and find it? Cheers, Andreas ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Status of META_BG? 2012-03-15 21:06 ` Andreas Dilger @ 2012-03-16 13:42 ` Phillip Susi 0 siblings, 0 replies; 7+ messages in thread From: Phillip Susi @ 2012-03-16 13:42 UTC (permalink / raw) To: Andreas Dilger; +Cc: ext4 development On 3/15/2012 5:06 PM, Andreas Dilger wrote: >> To get an fs that large, you have to enable 64bit support, which also means you can pass the limit of 32k blocks per group. > > I'm not sure what you mean here. Sure, there can be more than 32k > blocks per group, but there is still only a single block bitmap per > group so having more blocks is dependent on a larger blocksize. Heh, I'm not sure what you mean here. What does the block bitmap have to do with anything? I thought the issue was that the size of the block group descriptor table exceeded the size of a block group, as a result of there being a huge number of block groups, limited to a size of 128 MB. >> Doing that should allow for a much more reasonable number of groups ( which is a good thing several reasons ), and would also solve this problem wouldn't it? > > Possibly in conjunction with BIGALLOC. BIGALLOC? >> So it puts one GD block at the start of every several block groups? > > One at the start of the first group, the second group, and the last > group. You mean one copy of the whole table? That's not what the current code in e2fsprogs looks like it does to me. openfs.c has: > blk64_t ext2fs_descriptor_block_loc2(ext2_filsys fs, blk64_t group_block, > dgrp_t i) > { > int bg; > int has_super = 0; > blk64_t ret_blk; > > if (!(fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG) || > (i < fs->super->s_first_meta_bg)) > return (group_block + i + 1); > > bg = EXT2_DESC_PER_BLOCK(fs->super) * i; > if (ext2fs_bg_has_super(fs, bg)) > has_super = 1; > ret_blk = ext2fs_group_first_block2(fs, bg) + has_super; That appears to map the GDT block number to a block group based on how many group descriptors fit in a block, so there's one GDT block every several block groups. The subsequent code then checks if it is being asked for a backup and shifts the result over by one whole block group, so it looks like there is exactly one backup, whose blocks are each stored in the block group following the one that holds the corresponding primary GDT block. >> Wouldn't that drastically slow down opening/mounting the fs since the disk has to seek to every block group? > > Yes, definitely. That wasn't a concern before flex_bg arrived, since > that seek was needed for every group's block/inode bitmap as well. But you don't need to scan every bitmap at mount time do you? Aren't they loaded on demand when the group is first accessed? But you do need to scan all of the group descriptors at mount time. > Maybe with bigalloc the number of groups is reduced, and the size > of the groups is increased, which helps two ways. First, fewer > groups means fewer GD blocks, and larger groups mean more GD blocks > can fit into the 0th and 1st groups. That's what I was talking about. I'm not sure what bigalloc is, but once you enable 64bit, that gets you the ability to have more than 32768 blocks per group, so you have less groups and more room in them. > Well, the "mke2fs -S" is only applying a best guess estimate of the > metadata location using default parameters. If the default parameters > are not identical (e.g. flex_bg on/off, bigalloc on/off, etc) then > "mke2fs -S" will only corrupt an already-fatally-corrupted filesystem, > and you need to start from scratch. That's true of mke2fs -S, but you could do the same thing, but consult the existing superblock to determine the parameters. I believe that all parameters that affect the contents of the GDT can be found in the superblock. Specifically, block size, blocks per group, flex factor. Given that information, e2fsck should be able to rebuild the GDT. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Status of META_BG? 2012-03-15 17:55 ` Phillip Susi 2012-03-15 21:06 ` Andreas Dilger @ 2012-03-18 20:41 ` Ted Ts'o 2012-03-18 23:20 ` Andreas Dilger 1 sibling, 1 reply; 7+ messages in thread From: Ted Ts'o @ 2012-03-18 20:41 UTC (permalink / raw) To: Phillip Susi; +Cc: Andreas Dilger, ext4 development On Thu, Mar 15, 2012 at 01:55:36PM -0400, Phillip Susi wrote: > >META_BG addresses both of these issues by distributing the group > >descriptor blocks into the filesystem for each "meta group" (= the > >number of groups whose descriptors fit into a single block). > > So it puts one GD block at the start of every several block groups? > Wouldn't that drastically slow down opening/mounting the fs since > the disk has to seek to every block group? Not necessarily; right now we pull in every single block group descriptor at mount time because we need to update s_free_inodes_count and s_free_blocks_count. If we change things so that we only pull in the block group descriptors at mount time after a journal replay (but not after a clean umount, when the last inodes count and free blocks count should be correctly updated), that would avoid seeking to every 16th block group at mount time. - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Status of META_BG? 2012-03-18 20:41 ` Ted Ts'o @ 2012-03-18 23:20 ` Andreas Dilger 0 siblings, 0 replies; 7+ messages in thread From: Andreas Dilger @ 2012-03-18 23:20 UTC (permalink / raw) To: Ted Ts'o; +Cc: Phillip Susi, ext4 development On 2012-03-18, at 2:41 PM, Ted Ts'o wrote: > On Thu, Mar 15, 2012 at 01:55:36PM -0400, Phillip Susi wrote: >>> META_BG addresses both of these issues by distributing the group >>> descriptor blocks into the filesystem for each "meta group" (= the >>> number of groups whose descriptors fit into a single block). >> >> So it puts one GD block at the start of every several block groups? >> Wouldn't that drastically slow down opening/mounting the fs since >> the disk has to seek to every block group? > > Not necessarily; right now we pull in every single block group > descriptor at mount time because we need to update s_free_inodes_count > and s_free_blocks_count. If we change things so that we only pull in > the block group descriptors at mount time after a journal replay (but > not after a clean umount, when the last inodes count and free blocks > count should be correctly updated), that would avoid seeking to every > 16th block group at mount time. The lazy init thread also walks all of the group descriptors in the background after mount, so this could be handled asynchronously even without any changes. That is OK if there are free blocks and no user processes trying to write files, but we've had slowdowns in the past due to block bitmap lookups of every group looking for free space. Loading the group descriptors will be 32x or 16x faster than loading the bitmaps, but we still saw delays of up to 10 minutes for filesystems under 16TB due to seeking (before flex_bg) so I imagine this will also be an issue with meta_bg. It would be nice to retroactively define the semantics of flex_bg + meta_bg to mean that 2^s_log_groups_per_flex group descriptors are co-located. Cheers, Andreas ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-03-18 23:19 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-15 15:46 Status of META_BG? Phillip Susi 2012-03-15 16:25 ` Andreas Dilger 2012-03-15 17:55 ` Phillip Susi 2012-03-15 21:06 ` Andreas Dilger 2012-03-16 13:42 ` Phillip Susi 2012-03-18 20:41 ` Ted Ts'o 2012-03-18 23:20 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox