64bit filesystem questions

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 64bit filesystem questions
@ 2011-06-09 14:36 Phillip Susi
  2011-06-10  0:08 ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Phillip Susi @ 2011-06-09 14:36 UTC (permalink / raw)
  To: linux-ext4

I was checking up on the support for 64bit ( > 16 TB ) fs support in the 
next branch last night and have a few questions:

1)  Why is blocks_per_group limited to 32k ( more specifically, 8 x 
blocksize )

2)  Why can't 64bit be enabled explicitly ( with -O or -E? ) instead of 
automatically when needed and the enable automatic 64bit setting is in 
mke2fs.conf?

3)  Why does 64bit disable the resize inode?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-09 14:36 64bit filesystem questions Phillip Susi
@ 2011-06-10  0:08 ` Andreas Dilger
  2011-06-10 15:19   ` Phillip Susi
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2011-06-10  0:08 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-ext4

On 2011-06-09, at 8:36 AM, Phillip Susi wrote:
> I was checking up on the support for 64bit ( > 16 TB ) fs support in the next branch last night and have a few questions:
> 
> 1)  Why is blocks_per_group limited to 32k ( more specifically, 8 x blocksize )

There is only a single block pointer for each bitmap per group.  That said,
with flex_bg this is mostly meaningless, since the bitmaps do not have to
be located in the group, and a flex group is the same as a virtual group
that is {flex_bg_factor} times as large.

> 2)  Why can't 64bit be enabled explicitly ( with -O or -E? ) instead of automatically when needed and the enable automatic 64bit setting is in mke2fs.conf?

I thought it could be enabled with "-O 64bit", but I admit I've never tried.

> 3)  Why does 64bit disable the resize inode?

Because the on-disk format of the resize inode is only suitable for 32-bit
filesystems (it is an indirect-block mapped file and cannot reserve blocks
beyond 2^32).  The "future" way to resize filesystems is using the META_BG
feature, but the ability to use it has not been integrated into the kernel
or e2fsprogs yet.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10  0:08 ` Andreas Dilger
@ 2011-06-10 15:19   ` Phillip Susi
  2011-06-10 16:19     ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Phillip Susi @ 2011-06-10 15:19 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

On 6/9/2011 8:08 PM, Andreas Dilger wrote:
> There is only a single block pointer for each bitmap per group.  That said,
> with flex_bg this is mostly meaningless, since the bitmaps do not have to
> be located in the group, and a flex group is the same as a virtual group
> that is {flex_bg_factor} times as large.

Of course there is only a single pointer because there is only a single 
bitmap.  What does this have to do with limiting the block count to 8 * 
blocksize?

>> 3)  Why does 64bit disable the resize inode?
>
> Because the on-disk format of the resize inode is only suitable for 32-bit
> filesystems (it is an indirect-block mapped file and cannot reserve blocks
> beyond 2^32).  The "future" way to resize filesystems is using the META_BG
> feature, but the ability to use it has not been integrated into the kernel
> or e2fsprogs yet.

Ahh, right... no indirect blocks.  Couldn't and shouldn't the resize 
inode just use extents instead?  Also I thought that META_BG was an idea 
that eventually become FLEX_BG and has been dropped?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10 15:19   ` Phillip Susi
@ 2011-06-10 16:19     ` Andreas Dilger
  2011-06-10 17:14       ` Phillip Susi
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2011-06-10 16:19 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-ext4

On 2011-06-10, at 9:19 AM, Phillip Susi wrote:
> On 6/9/2011 8:08 PM, Andreas Dilger wrote:
>> There is only a single block pointer for each bitmap per group.  That said,
>> with flex_bg this is mostly meaningless, since the bitmaps do not have to
>> be located in the group, and a flex group is the same as a virtual group
>> that is {flex_bg_factor} times as large.
> 
> Of course there is only a single pointer because there is only a single bitmap.  What does this have to do with limiting the block count to 8 * blocksize?

I think in the presence of flex_bg this issue is moot.

>>> 3)  Why does 64bit disable the resize inode?
>> 
>> Because the on-disk format of the resize inode is only suitable for 32-bit
>> filesystems (it is an indirect-block mapped file and cannot reserve blocks
>> beyond 2^32).  The "future" way to resize filesystems is using the META_BG
>> feature, but the ability to use it has not been integrated into the kernel
>> or e2fsprogs yet.
> 
> Ahh, right... no indirect blocks.  Couldn't and shouldn't the resize inode just use extents instead?  Also I thought that META_BG was an idea that eventually become FLEX_BG and has been dropped?

META_BG also reduces the number of group descriptor blocks needed for the table.

Normally (without META_BG) each group descriptor table has a full copy of all
group descriptor blocks, and it has to be allocated contiguously on disk.
With META_BG, there are only 2 backups of each GDT block, and it is spread
around the filesystem, so there is not a need to allocate huge chunks of space.

Once we get a filesystem up to 256TB in size the size of the GDT will be larger
than a whole group (more than 128MB per GDT) and it will not be possible to
create a larger filesystem without META_BG.


Cheers, Andreas






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10 16:19     ` Andreas Dilger
@ 2011-06-10 17:14       ` Phillip Susi
  2011-06-10 17:29         ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Phillip Susi @ 2011-06-10 17:14 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

On 6/10/2011 12:19 PM, Andreas Dilger wrote:
> I think in the presence of flex_bg this issue is moot.

What is the issue without flex_bg?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10 17:14       ` Phillip Susi
@ 2011-06-10 17:29         ` Andreas Dilger
  2011-06-10 17:45           ` Phillip Susi
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2011-06-10 17:29 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-ext4

On 2011-06-10, at 11:14 AM, Phillip Susi wrote:
> On 6/10/2011 12:19 PM, Andreas Dilger wrote:
>> I think in the presence of flex_bg this issue is moot.
> 
> What is the issue without flex_bg?

No "issue" really, just that the block/inode bitmaps are spread all over
the filesystem.  The original discussion was about whether there could be
"larger bitmaps that addressed more than 32768 blocks", which is essentially
what the flex_bg feature provides.  With flex_bg the bitmaps for different
groups will be allocated adjacent to each other on disk, and allow addressing
more than 32768 blocks without any seeking.

On large filesystems without flex_bg, the distribution of the bitmaps without
flex_bg means that a seek is needed to read each one, and given that spinning
disks have stayed at about 100 seeks/sec for decades it means 10+ minutes just
to read all of the bitmaps.

On my 2TB 5400 RPM SATA drive, e2fsck time went from ~20 minutes to ~3 minutes
by copying the data to a new ext4 filesystem with flex_bg + extents.  For a
fair comparison, I then reformatted the original (identical) disk without
flex_bg or extents and copied the data back, so that there wasn't any unfair
comparison between the newly-formatted filesystem and the old fragmented one.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10 17:29         ` Andreas Dilger
@ 2011-06-10 17:45           ` Phillip Susi
  2011-06-10 20:37             ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Phillip Susi @ 2011-06-10 17:45 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

On 6/10/2011 1:29 PM, Andreas Dilger wrote:
> On 2011-06-10, at 11:14 AM, Phillip Susi wrote:
>> On 6/10/2011 12:19 PM, Andreas Dilger wrote:
>>> I think in the presence of flex_bg this issue is moot.
>>
>> What is the issue without flex_bg?
>
> No "issue" really, just that the block/inode bitmaps are spread all over
> the filesystem.  The original discussion was about whether there could be
> "larger bitmaps that addressed more than 32768 blocks", which is essentially
> what the flex_bg feature provides.  With flex_bg the bitmaps for different
> groups will be allocated adjacent to each other on disk, and allow addressing
> more than 32768 blocks without any seeking.
>
> On large filesystems without flex_bg, the distribution of the bitmaps without
> flex_bg means that a seek is needed to read each one, and given that spinning
> disks have stayed at about 100 seeks/sec for decades it means 10+ minutes just
> to read all of the bitmaps.
>
> On my 2TB 5400 RPM SATA drive, e2fsck time went from ~20 minutes to ~3 minutes
> by copying the data to a new ext4 filesystem with flex_bg + extents.  For a
> fair comparison, I then reformatted the original (identical) disk without
> flex_bg or extents and copied the data back, so that there wasn't any unfair
> comparison between the newly-formatted filesystem and the old fragmented one.

I know what flex_bg is; what I don't understand is what it has to do 
with the limit on the size of a block group.  Whether the block bitmaps 
are stored in their native block group, or clustered up with flex_bg 
does not seem to have anything to do with whether or not the size of the 
bitmap can exceed 32k blocks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10 17:45           ` Phillip Susi
@ 2011-06-10 20:37             ` Andreas Dilger
  2011-06-10 21:21               ` Phillip Susi
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2011-06-10 20:37 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-ext4

On 2011-06-10, at 11:45 AM, Phillip Susi wrote:
> On 6/10/2011 1:29 PM, Andreas Dilger wrote:
>> On 2011-06-10, at 11:14 AM, Phillip Susi wrote:
>>> On 6/10/2011 12:19 PM, Andreas Dilger wrote:
>>>> I think in the presence of flex_bg this issue is moot.
>>> 
>>> What is the issue without flex_bg?
>> 
>> No "issue" really, just that the block/inode bitmaps are spread all over
>> the filesystem.  The original discussion was about whether there could be
>> "larger bitmaps that addressed more than 32768 blocks", which is essentially
>> what the flex_bg feature provides.  With flex_bg the bitmaps for different
>> groups will be allocated adjacent to each other on disk, and allow addressing
>> more than 32768 blocks without any seeking.
>> 
>> On large filesystems without flex_bg, the distribution of the bitmaps without
>> flex_bg means that a seek is needed to read each one, and given that spinning
>> disks have stayed at about 100 seeks/sec for decades it means 10+ minutes just
>> to read all of the bitmaps.
>> 
>> On my 2TB 5400 RPM SATA drive, e2fsck time went from ~20 minutes to ~3 minutes
>> by copying the data to a new ext4 filesystem with flex_bg + extents.  For a
>> fair comparison, I then reformatted the original (identical) disk without
>> flex_bg or extents and copied the data back, so that there wasn't any unfair
>> comparison between the newly-formatted filesystem and the old fragmented one.
> 
> I know what flex_bg is; what I don't understand is what it has to do with the limit on the size of a block group.  Whether the block bitmaps are stored in their native block group, or clustered up with flex_bg does not seem to have anything to do with whether or not the size of the bitmap can exceed 32k blocks.

I hope it is obvious that a single bitmap block can only address the number
of bits (==blocks) that fit within that block.  To address more blocks the
block bitmap needs to be larger than a single block in size.  One possible
way to do this (discussed early on for ext4) would be to have N block
bitmap blocks per group.  That raises issues of how to address those blocks
for each "block group", and what the meaning of a "block group" really is.

The other (very similar, but not identical) approach is to essentially merge
N adjacent "block groups" into a single "large block group" that has N block
bitmaps, and addresses N * blocksize * 8 blocks per "large block group".
In this case "N" is the flex_bg factor (constrained to 2^n), and the "large
block group" is called a "flex group".  It achieves exactly the same thing
as having N block bitmaps per group, with the only difference that there are
N group descriptors that point to the bitmaps, and they no longer have to be
located within the groups themselves

There is virtually no difference between "larger bitmap" and "flex_bg":

"b"=block bitmap, "i"=inode bitmap, "."=data block

Non-flex_bg configuration for 4 groups * 32768 blocks:

bi...{32760}...bi...{32760}...bi...{32760}...bi...{32760}...

Each block bitmap addresses 32768 blocks in total (including itself).

flex_bg configuration for the same 4 groups * 32768 blocks:

bbbbiiii.....................{131020}.......................

If you treat the four "bbbb" blocks as a single block bitmap, and "iiii"
as a single inode bitmap, and the contiguous range of free blocks as a
single group, it is exactly what you are asking for - a larger bitmap.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 64bit filesystem questions
  2011-06-10 20:37             ` Andreas Dilger
@ 2011-06-10 21:21               ` Phillip Susi
  0 siblings, 0 replies; 9+ messages in thread
From: Phillip Susi @ 2011-06-10 21:21 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

On 6/10/2011 4:37 PM, Andreas Dilger wrote:
> I hope it is obvious that a single bitmap block can only address the number
> of bits (==blocks) that fit within that block.  To address more blocks the
> block bitmap needs to be larger than a single block in size.  One possible
> way to do this (discussed early on for ext4) would be to have N block
> bitmap blocks per group.  That raises issues of how to address those blocks
> for each "block group", and what the meaning of a "block group" really is.

I thought it was obvious that if there were more blocks, then you would 
have more than one bitmap block and it would just follow the first.

> The other (very similar, but not identical) approach is to essentially merge
> N adjacent "block groups" into a single "large block group" that has N block
> bitmaps, and addresses N * blocksize * 8 blocks per "large block group".
> In this case "N" is the flex_bg factor (constrained to 2^n), and the "large
> block group" is called a "flex group".  It achieves exactly the same thing
> as having N block bitmaps per group, with the only difference that there are
> N group descriptors that point to the bitmaps, and they no longer have to be
> located within the groups themselves

The other side effect is that you have N inode tables and N inode 
bitmaps.  A typical fs these days seems to have 8192 inodes in each bg, 
which gives far more inodes than needed, and only uses 1/4 of the inode 
bitmap block.

Now that I've looked a bit more at the code, it seems the 32k block 
limit comes from the old ext2 block group descriptor only having a 16 
bit field for the free blocks count.  This was fixed in the ext4 bg 
descriptor, but it seems that is not actually used except on a 64bit fs. 
  It looks like a few more bits of code need cleaned up to allow for 
more blocks per group when using 64bit.

> If you treat the four "bbbb" blocks as a single block bitmap, and "iiii"
> as a single inode bitmap, and the contiguous range of free blocks as a
> single group, it is exactly what you are asking for - a larger bitmap.

While each of those inode bitmaps may follow the previous, each one is 
typically only 1/4 used and the rest ignored.  It would be better to 
have only the single inode bitmap for a single, larger bg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-06-10 21:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-09 14:36 64bit filesystem questions Phillip Susi
2011-06-10  0:08 ` Andreas Dilger
2011-06-10 15:19   ` Phillip Susi
2011-06-10 16:19     ` Andreas Dilger
2011-06-10 17:14       ` Phillip Susi
2011-06-10 17:29         ` Andreas Dilger
2011-06-10 17:45           ` Phillip Susi
2011-06-10 20:37             ` Andreas Dilger
2011-06-10 21:21               ` Phillip Susi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).