* [RFC] dynamic inodes
@ 2008-09-24 11:46 Alex Tomas
2008-09-25 22:09 ` Andreas Dilger
2008-09-25 22:37 ` Andreas Dilger
0 siblings, 2 replies; 15+ messages in thread
From: Alex Tomas @ 2008-09-24 11:46 UTC (permalink / raw)
To: ext4 development
Hi,
another idea how to achieve more (dynamic) inodes:
* new dir_entry format with 64bit inum
* ino space is 64bit:
* 2^48 phys. 4K blocks
* 2^5 inodes in 4K block
* highest bit is used to choose addressing schema: static or dynamic
* each block is covered by two bits: in inode (I) and block (B) bitmaps:
I: 0, B: 0 - block is just free
I: 0, B: 1 - block is used, but not contains inodes
I: 1, B: 0 - block is full of inodes
I: 1, B: 1 - block contains few inodes, has free space
implementation issues:
* how to allocate new blocks for inodes
* try to find block with empty inode using bitmaps
* if we can't find - allocate new block and make it inode block
* if no free block in this group - repeat 1-2 in another groups
* how to release block from inodes
* if block has no used inodes anymore, we mark it 0 in I and 1 in B
* or if group has very few free inodes, leave it inode block
* how to find free inode in existing inode block
* just scan all slots
* how to migrate/co-exist with static inodes
* group is marked with DYNAMIC_INODES flag
* we can turn group to dynamic when it has 0 inodes (simple)
* we can turn group to dynamic and update bitmaps (hard)
* how to implement varlen inodes
* should be simple at allocation time
possibilities:
* use large inodes to store directory entries
(4096-128)/(4+8+8) = 198 entries
* use large inodes to store EAs
* if we introduce small header for new inode block we could use it to store tails
problems:
* can't relocate inode w/o changing directory entry
* can't change inode size after creation
* e2fsck?
comments/suggestions are very welcome.
thanks, Alex
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] dynamic inodes 2008-09-24 11:46 [RFC] dynamic inodes Alex Tomas @ 2008-09-25 22:09 ` Andreas Dilger 2008-09-25 23:00 ` Alex Tomas 2008-09-25 22:37 ` Andreas Dilger 1 sibling, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2008-09-25 22:09 UTC (permalink / raw) To: Alex Tomas; +Cc: ext4 development Sadly this was sitting in my outbox overnight, and might be obsolete already (explanation in a follow-up email), but I'm sending it as food for thought... On Sep 24, 2008 15:46 +0400, Alex Tomas wrote: > another idea how to achieve more (dynamic) inodes: > * new dir_entry format with 64bit inum Yes, that is a requirement in all cases. I've always thought that we should also implement inode-in-dirent when we need to change the dirent format and make dynamic inodes, but that may be too much to chew on at one time. > * ino space is 64bit: > * 2^48 phys. 4K blocks > * 2^5 inodes in 4K block The 2^5 inodes/4kB block would actually depend on the blocksize/inodesize, lets just call this inodes-per-block-bits (IPBB). It will be a power-of-2 between 0 and 8 (i.e. between 1 and 256 inodes per block), which is fine. For common ext4 filesystems this would be 2^4 = 16 inodes/block, because the default is 256-byte inodes today. > * highest bit is used to choose addressing schema: static or dynamic Alternately, any inode >= 2^32 would be dynamic? One clear benefit of putting the dynamic inodes at the end of the number space is that they will only be used if the static inodes are full, which reduces risk due to corruption and overhead due to dynamic allocations. > * each block is covered by two bits: in inode (I) and block (B) bitmaps: > I: 0, B: 0 - block is just free > I: 0, B: 1 - block is used, but not contains inodes > I: 1, B: 0 - block is full of inodes > I: 1, B: 1 - block contains few inodes, has free space Storing B:0 for an in-use block seems very dangerous to me. This also doesn't really address the need to be able to quickly locate free inodes, because it means "I:1" _might_ mean the inode is free or it might not, so EVERY "in-use" inode would need to be checked to see if it is free. We need to start with a "dynamic inode bitmap" (DIB) that is mapped from an "inode table file" (possibly only for the dynamic inode table blocks). Free inodes can be scanned using the normal ext4_find_next_zero_bit() in each of the bitmaps. Each such DIB block holds an array of bits indicating dynamic inode use, as well as an array of block numbers which map IPBB inode bits to dynamic inode table blocks. The DIBB should also have a header which contains space for a magic, a checksum, and the count of free and total inodes, like a GDT has, as well as a count of in-use itable blocks. The dynamic inode table blocks (DITB) should also hold a header with magic, checksum, back-pointer to DIBB. The back-pointer to the DIBB allows efficient clearing of in-use bit and location of the DIBB if the dynamic inode itself is corrupted, and possibly freeing the DITB if the last in-use inode is freed. For common 256-byte inodes and 4kB blocks we need 8 bytes/block for the block addresses, and 1 bit/inode, so 4096 bytes/block / 256 bytes/inode = 16 inodes(bits)/block = 2 byte bitmap (4096 bytes - 64-byte header) / (8 byte address + 2 byte bitmap) = 400 itable blocks per DIBB = 400 * 16 = 6400 inodes/DIBB 65536 bytes/block / 256 bytes/inode = 256 inodes(bits)/block = 8 byte bitmap (65536 bytes - 64-byte header) / (8 byte address + 8 byte bitmap) = 4092 itable blocks per DIBB = 4092 * 16 = 1048576 inodes/DIBB Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-25 22:09 ` Andreas Dilger @ 2008-09-25 23:00 ` Alex Tomas 2008-09-25 23:29 ` Andreas Dilger 0 siblings, 1 reply; 15+ messages in thread From: Alex Tomas @ 2008-09-25 23:00 UTC (permalink / raw) To: Andreas Dilger; +Cc: ext4 development Andreas Dilger wrote: > Alternately, any inode >= 2^32 would be dynamic? One clear benefit of > putting the dynamic inodes at the end of the number space is that they > will only be used if the static inodes are full, which reduces risk due > to corruption and overhead due to dynamic allocations. the highest (63th) bit put dynamic inodes at the end, no? the idea is that using 48 bits we can address block directly, w/o any additional lookup via some metainode. essentially this is just a way to introduce 64bit addressable fragments of 2^(64-48-1) size. >> * each block is covered by two bits: in inode (I) and block (B) bitmaps: >> I: 0, B: 0 - block is just free >> I: 0, B: 1 - block is used, but not contains inodes >> I: 1, B: 0 - block is full of inodes >> I: 1, B: 1 - block contains few inodes, has free space > > Storing B:0 for an in-use block seems very dangerous to me. This also > doesn't really address the need to be able to quickly locate free inodes, > because it means "I:1" _might_ mean the inode is free or it might not, > so EVERY "in-use" inode would need to be checked to see if it is free. just combine I and B into single bitmap: 1) when you look for free block it's any 0 bit in bitmap made by (I & B) 2) when you look for free inode (in current inode blocks) it's any 1 bit in bitmap made again by (I & B), then you read corresponded block and find free slot there (for example, it can be null i_mode) looks very simple and doable? > We need to start with a "dynamic inode bitmap" (DIB) that is mapped from > an "inode table file" (possibly only for the dynamic inode table blocks). > Free inodes can be scanned using the normal ext4_find_next_zero_bit() > in each of the bitmaps. the idea is that we can implement truly dynamic and varlen inodes w/o introducing special files, using existing structures. thanks, Alex ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-25 23:00 ` Alex Tomas @ 2008-09-25 23:29 ` Andreas Dilger 2008-09-30 14:02 ` Alex Tomas 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2008-09-25 23:29 UTC (permalink / raw) To: Alex Tomas; +Cc: ext4 development On Sep 26, 2008 03:00 +0400, Alex Tomas wrote: > Andreas Dilger wrote: >> Storing B:0 for an in-use block seems very dangerous to me. This also >> doesn't really address the need to be able to quickly locate free inodes, >> because it means "I:1" _might_ mean the inode is free or it might not, >> so EVERY "in-use" inode would need to be checked to see if it is free. > > just combine I and B into single bitmap: > 1) when you look for free block it's any 0 bit in bitmap made by (I & B) > 2) when you look for free inode (in current inode blocks) it's any 1 bit > in bitmap made again by (I & B), then you read corresponded block and > find free slot there (for example, it can be null i_mode) > > looks very simple and doable? It _sounds_ simple, but I think the implementation will not be what is expected. Either you need to keep a 3rd bitmap for each group which is (I&B) used for finding either inodes or blocks first (with respectively find_first_bit() or find_first_zero_bit()), then check the "normal" inode and block bitmaps, keeping this in sync with mballoc, and confusion/danger on disk/e2fsck because in-use itable blocks are marked "0" in the block bitmap. There will be races between updating these bitmaps, unless the group is locked for both block or inode allocations on any update because setting any bit completely changes the meaning. Alternately, if there are only I and B bitmaps, then find_first_bit() and find_first_zero_bit() are not useful. Searching for free blocks means looking for "B:0" and finding potentially many "B:0 I:1" blocks that are full of inodes. Searching for free inodes means looking for "I:1" (strangely) but finding potentially many "I:1 B:0" blocks. I much prefer the dynamic itable idea from José (which I embellished in my other email), which is very simple for both the kernel and e2fsck, robust, and avoids the 64-bit inode problem for userspace to the maximum amount (i.e. a full 4B inodes must be in use before we ever need to use 64-bit inodes). The lack of complexity in itable allocation also translates directly into increased robustness in the face of corruption. It doesn't provide dynamic-sized inodes (which hasn't traditionally been a problem), nor is it perfect in terms of being able to fully populate a filesystem with inodes in all use cases but it could work in all but completely pathalogical fragmentation cases (at which point one wonders if it isn't better to just return -ENOSPC than to flog a nearly dead filesystem). It can definitely do a good job in most likely uses, and also provides a big win over what is done today. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-25 23:29 ` Andreas Dilger @ 2008-09-30 14:02 ` Alex Tomas 0 siblings, 0 replies; 15+ messages in thread From: Alex Tomas @ 2008-09-30 14:02 UTC (permalink / raw) To: Andreas Dilger; +Cc: ext4 development Andreas Dilger wrote: > It _sounds_ simple, but I think the implementation will not be what > is expected. Either you need to keep a 3rd bitmap for each group > which is (I&B) used for finding either inodes or blocks first (with > respectively find_first_bit() or find_first_zero_bit()), then check the > "normal" inode and block bitmaps, keeping this in sync with mballoc, and > confusion/danger on disk/e2fsck because in-use itable blocks are marked > "0" in the block bitmap. There will be races between updating these > bitmaps, unless the group is locked for both block or inode allocations > on any update because setting any bit completely changes the meaning. > > Alternately, if there are only I and B bitmaps, then find_first_bit() > and find_first_zero_bit() are not useful. Searching for free blocks > means looking for "B:0" and finding potentially many "B:0 I:1" blocks > that are full of inodes. Searching for free inodes means looking for > "I:1" (strangely) but finding potentially many "I:1 B:0" blocks. mballoc already maintains own in-core copy, so we'd have to apply another bitmap to it. as for races - I think this can be done by proper ordering, probably w/o locks even: free block turns used first, then becomes part of "fragmentary" space. anyway, the complexity would be away simpler than mballoc itself, for example. > I much prefer the dynamic itable idea from José (which I embellished in > my other email), which is very simple for both the kernel and e2fsck, > robust, and avoids the 64-bit inode problem for userspace to the maximum > amount (i.e. a full 4B inodes must be in use before we ever need to > use 64-bit inodes). The lack of complexity in itable allocation also > translates directly into increased robustness in the face of corruption. > > It doesn't provide dynamic-sized inodes (which hasn't traditionally > been a problem), nor is it perfect in terms of being able to fully > populate a filesystem with inodes in all use cases but it could work > in all but completely pathalogical fragmentation cases (at which point > one wonders if it isn't better to just return -ENOSPC than to flog a > nearly dead filesystem). It can definitely do a good job in most likely > uses, and also provides a big win over what is done today. I do understand simplicity and robustness as driving reasons much. and I agree dynamic inodes added via empty group descriptors is an excellent idea. but I still think that with original idea we could get much more than just dynamic inodes (though it was original intention). for example, storing small directories (upto ~200 dir entries) within inode could be very nice to avoid bunch of seeks. and tail packing could be as well. notice we don't really need fine structures to find free slots in that fragmentary space - usually small files are generated at once (e.g. tar -xf), so to pack them we just need to remember few last "partial filled" blocks. if some file is deleted and we get free slot in corresponded block - who cares - with current ext3/4 we'd waste whole block anyway. another reason for that design was to support >2^32 files per filesystem. thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-24 11:46 [RFC] dynamic inodes Alex Tomas 2008-09-25 22:09 ` Andreas Dilger @ 2008-09-25 22:37 ` Andreas Dilger 2008-09-26 1:10 ` Jose R. Santos 2008-09-26 2:11 ` Theodore Tso 1 sibling, 2 replies; 15+ messages in thread From: Andreas Dilger @ 2008-09-25 22:37 UTC (permalink / raw) To: Alex Tomas; +Cc: ext4 development On Sep 24, 2008 15:46 +0400, Alex Tomas wrote: > another idea how to achieve more (dynamic) inodes: Actually, José propsed a _very_ simple idea that would allow dynamic inodes with relatively low code complexity or risk due to dynamic placement of inode tables. The basic idea is to extend the FLEX_BG feature so that (essentially) "blockless groups" can be added to the filesystem when the inodes are all gone. The core idea of FLEX_BG is that the "group metadata" (inode and block bitmaps, inode table) can be placed anywhere in the filesystem. This implies that a "block group" is strictly just a contiguous range of blocks, and somewhere in the filesystem is the metadata that describes its usage. If one adds a new group (ostensibly "at the end of the filesystem") that has a flag which indicates there are no blocks available in the group, then what we get is the inode bitmap and inode table, with a 1-block "excess baggage" of the block bitmap and a new group descriptor. The "baggage" is small considering any overhead needed to locate and describe fully dynamic inode tables. A big plus is that there are very few changes needed to the kernel or e2fsck (the "dynamic inode table" is just a group which has no space for data). Some quick checks on 10 filesystems (some local, some server) shows that there is enough contiguous space in the filesystems to allocate a full inode table (between 1-4MB for most filesystems), and mballoc can help with this. This makes sense because the cases where there is a shortage of inodes also means there is an excess of space, and if the inodes were under-provisioned it also (usually) means the itable is on the smaller side. Another important benefit is that the 32-bit inode space is used fully before there is any need to grow to 64-bit inodes. This avoids the compatibility issues with userspace to the maximum possible extent, without any complex remapping of inode numbers. We could hedge our bets for finding large enough contiguous itable space and allow the itable to be smaller than normal, and mark the end inodes as in-use. e2fsck will in fact consider any blocks under the rest of the inode table as "shared blocks" and do duplicate block processing to remap the data blocks. We could also leverage online defrag to remap the blocks before allocating the itable if there isn't enough space. Another major benefit of this approach is that the "dynamic" inode table is actually relatively static in location, and we don't need a tree to find it. We would continue to use the "normal" group inodes first, and only add dynamic groups if there are no free inodes. It would also be possible to remove the last dynamic group if all its inodes are freed. The itable location would be replicated to all of the group descriptor backups for safety, though we would need to find a way for "META_BG" to store a backup of the GDT in blocks that don't exist, in the case where increasing the GDT size in-place isn't possible. The drawbacks of the approach is relatively coarse-grained itable allocation, which would fail if the filesystem is highly fragmented, but we don't _have_ to succeed either. The coarse-grained approach is also a benefit because we don't need complex data structures to find the itable, it reduces seeking during e2fsck, and we can keep some hysteresis in adding/removing dynamic groups to reduce overhead (updates of many GDT backups). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-25 22:37 ` Andreas Dilger @ 2008-09-26 1:10 ` Jose R. Santos 2008-09-26 10:36 ` Andreas Dilger 2008-09-26 2:11 ` Theodore Tso 1 sibling, 1 reply; 15+ messages in thread From: Jose R. Santos @ 2008-09-26 1:10 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alex Tomas, ext4 development On Thu, 25 Sep 2008 16:37:31 -0600 Andreas Dilger <adilger@sun.com> wrote: > On Sep 24, 2008 15:46 +0400, Alex Tomas wrote: > > another idea how to achieve more (dynamic) inodes: > > Actually, José propsed a _very_ simple idea that would allow dynamic > inodes with relatively low code complexity or risk due to dynamic > placement of inode tables. > > The basic idea is to extend the FLEX_BG feature so that (essentially) > "blockless groups" can be added to the filesystem when the inodes are > all gone. The core idea of FLEX_BG is that the "group metadata" (inode > and block bitmaps, inode table) can be placed anywhere in the filesystem. > This implies that a "block group" is strictly just a contiguous range of > blocks, and somewhere in the filesystem is the metadata that describes its > usage. > > If one adds a new group (ostensibly "at the end of the filesystem") that > has a flag which indicates there are no blocks available in the group, > then what we get is the inode bitmap and inode table, with a 1-block > "excess baggage" of the block bitmap and a new group descriptor. The > "baggage" is small considering any overhead needed to locate and describe > fully dynamic inode tables. > > A big plus is that there are very few changes needed to the kernel or > e2fsck (the "dynamic inode table" is just a group which has no space > for data). Some quick checks on 10 filesystems (some local, some > server) shows that there is enough contiguous space in the filesystems > to allocate a full inode table (between 1-4MB for most filesystems), and > mballoc can help with this. This makes sense because the cases where > there is a shortage of inodes also means there is an excess of space, > and if the inodes were under-provisioned it also (usually) means the > itable is on the smaller side. > > Another important benefit is that the 32-bit inode space is used fully > before there is any need to grow to 64-bit inodes. This avoids the > compatibility issues with userspace to the maximum possible extent, > without any complex remapping of inode numbers. > > We could hedge our bets for finding large enough contiguous itable space > and allow the itable to be smaller than normal, and mark the end inodes > as in-use. e2fsck will in fact consider any blocks under the rest of > the inode table as "shared blocks" and do duplicate block processing to > remap the data blocks. We could also leverage online defrag to remap > the blocks before allocating the itable if there isn't enough space. > > Another major benefit of this approach is that the "dynamic" inode table > is actually relatively static in location, and we don't need a tree to > find it. We would continue to use the "normal" group inodes first, and > only add dynamic groups if there are no free inodes. It would also be > possible to remove the last dynamic group if all its inodes are freed. > > The itable location would be replicated to all of the group descriptor > backups for safety, though we would need to find a way for "META_BG" > to store a backup of the GDT in blocks that don't exist, in the case > where increasing the GDT size in-place isn't possible. One way to get around this is to implement the exact opposite of what I proposed earlier and have a block group with no inode tables. If we do a 1:1 distribution of inode per block and don't allocate inodes tables for a series of block groups within a flexbg we could later on attempt to allocate new inode tables when we run out of inodes. If we leave holes in the inode numbers for the missing inode tables, adding new inode tables in these block groups would not require any inode renumbering. This also does not break the current inode allocator which would be a good thing. This should be even simpler to implement than the previous proposal. The drawbacks are that when allocating a new inode table, the 1:1 distribution of inode per block would mean that we need to find a bigger chunk on contiguous blocks to since we have bigger inode tables per block group. Since the current inode allocator tries to keep a 10% of blocks in a flexbg free, finding contiguous blocks may not be a really big issue. Another issue is 64bit filesystem if we use a 1:1 scheme. This would be like uninitialized inode tables with the added steps of finding free blocks, allocating a new inode and zeroing the newly created inode table. Since we could chose to allocate a new inode table on a flexbg with the most free blocks, this could keep filesystem meta-data/data layout consistently close together to maintain predictable performance. This option also has no overhead compared to the previous proposal. > > The drawbacks of the approach is relatively coarse-grained itable > allocation, which would fail if the filesystem is highly fragmented, > but we don't _have_ to succeed either. The coarse-grained approach is > also a benefit because we don't need complex data structures to find the > itable, it reduces seeking during e2fsck, and we can keep some hysteresis > in adding/removing dynamic groups to reduce overhead (updates of many > GDT backups). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -- -JRS -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 1:10 ` Jose R. Santos @ 2008-09-26 10:36 ` Andreas Dilger 2008-09-26 14:49 ` Jose R. Santos 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2008-09-26 10:36 UTC (permalink / raw) To: Jose R. Santos; +Cc: Alex Tomas, ext4 development On Sep 25, 2008 20:10 -0500, Jose R. Santos wrote: > One way to get around this is to implement the exact opposite of what I > proposed earlier and have a block group with no inode tables. If we do > a 1:1 distribution of inode per block and don't allocate inodes tables > for a series of block groups within a flexbg we could later on attempt > to allocate new inode tables when we run out of inodes. If we leave > holes in the inode numbers for the missing inode tables, adding new > inode tables in these block groups would not require any inode > renumbering. This also does not break the current inode allocator > which would be a good thing. This should be even simpler to implement > than the previous proposal. The drawbacks are that when allocating a > new inode table, the 1:1 distribution of inode per block would mean > that we need to find a bigger chunk on contiguous blocks to since we > have bigger inode tables per block group. Since the current inode > allocator tries to keep a 10% of blocks in a flexbg free, finding > contiguous blocks may not be a really big issue. Another issue is 64bit > filesystem if we use a 1:1 scheme. > > This would be like uninitialized inode tables with the added steps of > finding free blocks, allocating a new inode and zeroing the newly > created inode table. Since we could chose to allocate a new inode > table on a flexbg with the most free blocks, this could keep filesystem > meta-data/data layout consistently close together to maintain > predictable performance. This option also has no overhead compared to > the previous proposal. The problem with leaving gaps in the itable is that this needs the filesystem to be created in this manner in the first place, while adding them at the end can be done to any filesystem. If we are preparing the filesystem in advance for this we could just reserve enough GDT space too (as online resize already does to some extent).. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 10:36 ` Andreas Dilger @ 2008-09-26 14:49 ` Jose R. Santos 2008-09-26 20:01 ` Andreas Dilger 0 siblings, 1 reply; 15+ messages in thread From: Jose R. Santos @ 2008-09-26 14:49 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alex Tomas, ext4 development On Fri, 26 Sep 2008 04:36:07 -0600 Andreas Dilger <adilger@sun.com> wrote: > On Sep 25, 2008 20:10 -0500, Jose R. Santos wrote: > > One way to get around this is to implement the exact opposite of what I > > proposed earlier and have a block group with no inode tables. If we do > > a 1:1 distribution of inode per block and don't allocate inodes tables > > for a series of block groups within a flexbg we could later on attempt > > to allocate new inode tables when we run out of inodes. If we leave > > holes in the inode numbers for the missing inode tables, adding new > > inode tables in these block groups would not require any inode > > renumbering. This also does not break the current inode allocator > > which would be a good thing. This should be even simpler to implement > > than the previous proposal. The drawbacks are that when allocating a > > new inode table, the 1:1 distribution of inode per block would mean > > that we need to find a bigger chunk on contiguous blocks to since we > > have bigger inode tables per block group. Since the current inode > > allocator tries to keep a 10% of blocks in a flexbg free, finding > > contiguous blocks may not be a really big issue. Another issue is 64bit > > filesystem if we use a 1:1 scheme. > > > > This would be like uninitialized inode tables with the added steps of > > finding free blocks, allocating a new inode and zeroing the newly > > created inode table. Since we could chose to allocate a new inode > > table on a flexbg with the most free blocks, this could keep filesystem > > meta-data/data layout consistently close together to maintain > > predictable performance. This option also has no overhead compared to > > the previous proposal. > > The problem with leaving gaps in the itable is that this needs the > filesystem to be created in this manner in the first place, while adding > them at the end can be done to any filesystem. If we are preparing the > filesystem in advance for this we could just reserve enough GDT space > too (as online resize already does to some extent).. Agreed, but performance wise this way is more consistent with the current block and inode allocators. The block allocator will start its free block search on the block group that contains the inode. Since these block groups do not contain any blocks, the block allocator will have to be modify to make sure data is not being placed randomly in the disk. The flex_bg inode allocator would also need to be modify since it currently depends on a algoright that assumes that block groups contain actual blocks. One of the things that got flex_bg added to ext4 in the first place was performance the performance improvements it provided. I would like to keep that advantage if possible. This could also be use to speed mkfs since we would not need to zero out as many inode tables. We could initialize just a couple of inode tables per flex_bg group and allocate the rest dynamically. You do pay a small penalty when allocating a new inode table since we first need to find the blocks for that inode table as well as zeroing it afterward. The penalty is less than if we do the one time background zeroing of inode tables where your disk will be trashing for a while the first time it is mounted. If supporting already existing filesystems is really important we could always implement both techniques since they technically should not conflict with each other, though you couldn't use both of them at the same time if you have a 1:1 block/inode ratio. > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > -JRS ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 14:49 ` Jose R. Santos @ 2008-09-26 20:01 ` Andreas Dilger 0 siblings, 0 replies; 15+ messages in thread From: Andreas Dilger @ 2008-09-26 20:01 UTC (permalink / raw) To: Jose R. Santos; +Cc: Alex Tomas, ext4 development On Sep 26, 2008 09:49 -0500, Jose R. Santos wrote: > Agreed, but performance wise this way is more consistent with the > current block and inode allocators. The block allocator will start its > free block search on the block group that contains the inode. Since > these block groups do not contain any blocks, the block allocator will > have to be modify to make sure data is not being placed randomly in the > disk. This is already the case today when a block group is full. The block allocator needs to handle this gracefully. > The flex_bg inode allocator would also need to be modify since > it currently depends on a algoright that assumes that block groups > contain actual blocks. One of the things that got flex_bg added to > ext4 in the first place was performance the performance improvements it > provided. I would like to keep that advantage if possible. I don't think the performance advantage was at all related to inode->block locality (since this is actually worse with FLEX_BG) but rather better metadata locality (e.g. contiguous bitmaps, itables avoiding seeking during metadata operations). > This could also be use to speed mkfs since we would not need to zero > out as many inode tables. We could initialize just a couple of inode > tables per flex_bg group and allocate the rest dynamically. There is already the ability to avoid zeroing ANY inode tables with uninit_bg, but it is unsafe to do this in production because the old itable data is there and e2fsck might become confused if the group bg_itable_unused is lost (due to gdt corruption or other inconsistency). > You do pay > a small penalty when allocating a new inode table since we first need > to find the blocks for that inode table as well as zeroing it afterward. > The penalty is less than if we do the one time background zeroing of > inode tables where your disk will be trashing for a while the first > time it is mounted. I don't think it is any different. The itable zeroing is _still_ needed, because the flag that indicates if an itable is used or not is unreliable in some corruption cases, and we don't want to read garbage from disk. IMHO when a filesystem is first formatted and mounted it is probably mostly idle, and if not the zeroing (and other stuff) thread can be delayed (e.g. in a new distro install maybe the itables aren't zeroed until the second or third mount, no great loss/risk). > If supporting already existing filesystems is really important we could > always implement both techniques since they technically should not > conflict with each other, though you couldn't use both of them at the > same time if you have a 1:1 block/inode ratio. IMHO dynamic inode tables for existing filesystems is the MAIN goal. Once you know you have run out of inodes it is already too late to plan for it, and if you need a reformat to implement this scheme you could just as easily reformat with enough inodes in the first place :-). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-25 22:37 ` Andreas Dilger 2008-09-26 1:10 ` Jose R. Santos @ 2008-09-26 2:11 ` Theodore Tso 2008-09-26 10:33 ` Andreas Dilger 1 sibling, 1 reply; 15+ messages in thread From: Theodore Tso @ 2008-09-26 2:11 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alex Tomas, ext4 development On Thu, Sep 25, 2008 at 04:37:31PM -0600, Andreas Dilger wrote: > If one adds a new group (ostensibly "at the end of the filesystem") that > has a flag which indicates there are no blocks available in the group, > then what we get is the inode bitmap and inode table, with a 1-block > "excess baggage" of the block bitmap and a new group descriptor. The > "baggage" is small considering any overhead needed to locate and describe > fully dynamic inode tables. It's a good idea; and technically you don't have to allocate a block bitmap, given that the flag is present which says "no blocks available". The reason for allocating it is if you're trying to maintain full backwards compatibility, it will work --- except that you need some way of making sure that the on-line resizing code won't screw with the filesystem --- so the feature would have to be a read/only compat feature anyway. To do on-line resizing, you'd have to clear the flag and then know to that the first "inode-only" block group should be given the new blocks. > The itable location would be replicated to all of the group descriptor > backups for safety, though we would need to find a way for "META_BG" > to store a backup of the GDT in blocks that don't exist, in the case > where increasing the GDT size in-place isn't possible. This is actually the big problem; with META_BG, in order to find the group descriptor blocks, it assumes that the first group descriptor can be found at the beginning of the group descriptor block, which means it has to be found at a certain offset from the beginning of the filesystem. And this would not be true for inode-only block groups. The simplest solution actually would be to to allocate inodes from the *end* of the 32-bit inode space, growing downwards, and having those inodes be stored in a reserved inode. You would lose block locality, although that could be solved by adding a block group affinity field in the inode structure which is used by "extended inodes". - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 2:11 ` Theodore Tso @ 2008-09-26 10:33 ` Andreas Dilger 2008-09-26 14:33 ` Theodore Tso 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2008-09-26 10:33 UTC (permalink / raw) To: Theodore Tso; +Cc: Alex Tomas, ext4 development On Sep 25, 2008 22:11 -0400, Theodore Ts'o wrote: > On Thu, Sep 25, 2008 at 04:37:31PM -0600, Andreas Dilger wrote: > > If one adds a new group (ostensibly "at the end of the filesystem") that > > has a flag which indicates there are no blocks available in the group, > > then what we get is the inode bitmap and inode table, with a 1-block > > "excess baggage" of the block bitmap and a new group descriptor. The > > "baggage" is small considering any overhead needed to locate and describe > > fully dynamic inode tables. > > It's a good idea; and technically you don't have to allocate a block > bitmap, given that the flag is present which says "no blocks > available". The reason for allocating it is if you're trying to > maintain full backwards compatibility, it will work --- except that > you need some way of making sure that the on-line resizing code won't > screw with the filesystem --- so the feature would have to be a > read/only compat feature anyway. Sure, I agree it is possible to go either way. I was just trying to go for the element of least surprise. Having a group with "bg_block_bitmap = 0" would be strange, but no more strange than having a group for blocks beyond the end of the filesystem... > To do on-line resizing, you'd have to clear the flag and then know to > that the first "inode-only" block group should be given the new > blocks. Right. > > The itable location would be replicated to all of the group descriptor > > backups for safety, though we would need to find a way for "META_BG" > > to store a backup of the GDT in blocks that don't exist, in the case > > where increasing the GDT size in-place isn't possible. > > This is actually the big problem; with META_BG, in order to find the > group descriptor blocks, it assumes that the first group descriptor > can be found at the beginning of the group descriptor block, which > means it has to be found at a certain offset from the beginning of the > filesystem. And this would not be true for inode-only block groups. We could special-case the placement of the GDT blocks in this case, and then put them into the proper META_BG location when/if the blocks are actually added to the filesystem. > The simplest solution actually would be to to allocate inodes from the > *end* of the 32-bit inode space, growing downwards, and having those > inodes be stored in a reserved inode. You would lose block locality, > although that could be solved by adding a block group affinity field > in the inode structure which is used by "extended inodes". I don't see how growing the inode numbers downward really helps anything. With FLEX_BG there already is no "affinity" between the inodes and the blocks. The drawback of putting the inode table into an inode is that this is relatively fragile if the inode is corrupted. We'd want to have replication of the inode itself (we couldn't replicate the whole inode table very efficiently). Alternately, we could put the GDT into the inode and replicate the whole inode several times (the data would already be present in the filesystem). We just need to select inodes from disparate parts of the filesystem to avoid corruption (I'd suggest one inode from each backup superblock group), point them at the existing GDT blocks, then allow the new GDT blocks to be added to each one. The backup GDT-inode copies only need to be changed when new groups are added/removed. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 10:33 ` Andreas Dilger @ 2008-09-26 14:33 ` Theodore Tso 2008-09-26 20:18 ` Andreas Dilger 0 siblings, 1 reply; 15+ messages in thread From: Theodore Tso @ 2008-09-26 14:33 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alex Tomas, ext4 development On Fri, Sep 26, 2008 at 04:33:22AM -0600, Andreas Dilger wrote: > > This is actually the big problem; with META_BG, in order to find the > > group descriptor blocks, it assumes that the first group descriptor > > can be found at the beginning of the group descriptor block, which > > means it has to be found at a certain offset from the beginning of the > > filesystem. And this would not be true for inode-only block groups. > > We could special-case the placement of the GDT blocks in this case, and > then put them into the proper META_BG location when/if the blocks are > actually added to the filesystem. Yes, but where do you put the GDT blocks in the case of where there is no more space in the reserved gdt blocks? Using some inode is probably the best bet, since we would then know where to find the GDT blocks. My suggestion of using inode numbers growing downward from the top of the 2**32 number space was to avoid needing to move the GDT blocks into their proper place if and when the filesystem is grown; it simplifies the code needed for the on-line resizing, and it also means that when you do the on-line resizing the filesystem gets more inodes along with more blocks. If we move the GDT blocks into their proper place, then the filesystem gets more blocks, but not more inodes; but if the inodes are dynamically grown automatically by the filesystem, maybe that's not a problem. > Alternately, we could put the GDT into the inode and replicate the whole > inode several times (the data would already be present in the filesystem). > We just need to select inodes from disparate parts of the filesystem to > avoid corruption (I'd suggest one inode from each backup superblock > group), point them at the existing GDT blocks, then allow the new GDT > blocks to be added to each one. The backup GDT-inode copies only need > to be changed when new groups are added/removed. Yes, that's probably the best solution, IMHO. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 14:33 ` Theodore Tso @ 2008-09-26 20:18 ` Andreas Dilger 2008-09-26 22:26 ` Theodore Tso 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2008-09-26 20:18 UTC (permalink / raw) To: Theodore Tso; +Cc: Alex Tomas, ext4 development On Sep 26, 2008 10:33 -0400, Theodore Ts'o wrote: > > We could special-case the placement of the GDT blocks in this case, and > > then put them into the proper META_BG location when/if the blocks are > > actually added to the filesystem. > > Yes, but where do you put the GDT blocks in the case of where there is > no more space in the reserved gdt blocks? Using some inode is > probably the best bet, since we would then know where to find the GDT > blocks. I agree that replicating a GDT inode is probably the easiest answer. IIRC this was proposed also in the past, before META_BG was implemented. To be honest, we should just deprecate META_BG at that time, I don't think it was every used by anything, and still isn't properly handled by the cross-product of filesystem features (online resize, others). > My suggestion of using inode numbers growing downward from the top of > the 2**32 number space was to avoid needing to move the GDT blocks > into their proper place if and when the filesystem is grown; How do inode numbers affect the GDT blocks? Is it because high inode numbers would be in correspondingly high "groups" and resizing could be done "normally" without affecting the new GDT placement? Once we move over to a scheme of GDT inodes, there isn't necessarily a "proper place" for GDT blocks, so I don't know if that makes a difference. I was going to object on the grounds that the GDT inodes will become too large and sparse, but for a "normal" ratio (8192 inodes/group) this only works out to be 32MB for the whole gdt to hit 2^32 inodes. The other thing we should consider is the case where the inode ratio is too high, and it is limiting the growth of the filesystem due to 2^32 inode limit. With a default inode ratio of 1 inode/8192 bytes, this hits 2^32 inodes at 262144 groups, or only 32TB... We may need to also be able to add "inodeless groups" in such systems unless we also implement 2^64-bit inode numbers at the same time. This isn't impossible, though the directory format would need to change to handle 64-bit inode numbers, and some way to convert between the leaf formats. > it simplifies the code needed for the on-line resizing, and it also means > that when you do the on-line resizing the filesystem gets more inodes > if the inodes are dynamically grown automatically by the filesystem, > maybe that's not a problem. It probably makes sense to increase the "static" inode count proportionally with the new blocks, since we already know the inode ratio is too small, so I can see a benefit from this direction. > > Alternately, we could put the GDT into the inode and replicate the whole > > inode several times (the data would already be present in the filesystem). > > We just need to select inodes from disparate parts of the filesystem to > > avoid corruption (I'd suggest one inode from each backup superblock > > group), point them at the existing GDT blocks, then allow the new GDT > > blocks to be added to each one. The backup GDT-inode copies only need > > to be changed when new groups are added/removed. > > Yes, that's probably the best solution, IMHO. > > - Ted Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] dynamic inodes 2008-09-26 20:18 ` Andreas Dilger @ 2008-09-26 22:26 ` Theodore Tso 0 siblings, 0 replies; 15+ messages in thread From: Theodore Tso @ 2008-09-26 22:26 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alex Tomas, ext4 development On Fri, Sep 26, 2008 at 02:18:32PM -0600, Andreas Dilger wrote: > I agree that replicating a GDT inode is probably the easiest answer. > IIRC this was proposed also in the past, before META_BG was implemented. > To be honest, we should just deprecate META_BG at that time, I don't > think it was every used by anything, and still isn't properly handled > by the cross-product of filesystem features (online resize, others). Meta_BG is I think relatively well supported now, actually. More to the point, the resize_inode feature doesn't work for filesystems with more than 2**32 blocks, since indirect blocks don't work for such filesystems. The assumption had always been that we would use meta_bg to support online-resize for > 2*32 block filesystem, once we had implemented on-line resize support for it. > How do inode numbers affect the GDT blocks? Is it because high inode > numbers would be in correspondingly high "groups" and resizing could > be done "normally" without affecting the new GDT placement? Yep. So inode numbers between 1 and (num_bg*inodes_per_bg)+1 are "natural" inodes, and inodes above that would have to be "dynamic" inodes where the GDT would be found in an inode. > I was going to object on the grounds that the GDT inodes will become too > large and sparse, but for a "normal" ratio (8192 inodes/group) this > only works out to be 32MB for the whole gdt to hit 2^32 inodes. I'm not sure what you mean by "sparse".... the would be just as tighly packed, but just starting at 2*32-1 and growing down. > The other thing we should consider is the case where the inode ratio > is too high, and it is limiting the growth of the filesystem due to > 2^32 inode limit. With a default inode ratio of 1 inode/8192 bytes, > this hits 2^32 inodes at 262144 groups, or only 32TB... We may need > to also be able to add "inodeless groups" in such systems unless we > also implement 2^64-bit inode numbers at the same time. Yeah, good point. The real fundamental question is whether we want to try to support 2**64 inodes as a long-term goal. Past a certain point, we would have to have inodeless groups if we support 2**48 physical blocks, but only 2**32 inodes, with or without dynamic inodes. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2008-09-30 14:03 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-24 11:46 [RFC] dynamic inodes Alex Tomas 2008-09-25 22:09 ` Andreas Dilger 2008-09-25 23:00 ` Alex Tomas 2008-09-25 23:29 ` Andreas Dilger 2008-09-30 14:02 ` Alex Tomas 2008-09-25 22:37 ` Andreas Dilger 2008-09-26 1:10 ` Jose R. Santos 2008-09-26 10:36 ` Andreas Dilger 2008-09-26 14:49 ` Jose R. Santos 2008-09-26 20:01 ` Andreas Dilger 2008-09-26 2:11 ` Theodore Tso 2008-09-26 10:33 ` Andreas Dilger 2008-09-26 14:33 ` Theodore Tso 2008-09-26 20:18 ` Andreas Dilger 2008-09-26 22:26 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).