ext4 64bit (disk >16TB) question

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* ext4 64bit (disk >16TB) question
@ 2008-07-14 19:50 Goswin von Brederlow
  2008-07-14 23:46 ` Theodore Tso
  2008-07-15 18:27 ` Jose R. Santos
  0 siblings, 2 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-14 19:50 UTC (permalink / raw)
  To: linux-ext4

Hi,

we are using lustre on a cluster of servers and raid boxes. Currently
lustre is based on the ext3 code and has a limit of 8TiB for each
filesystem. For us that results on having to split a servers storage
into up to 4 chunks and run one fs on each which I would rather avoid.
The solution for this would be to rebase the lustre patches to use
ext4 instead, which should also reduce the patch set considerably.
Lustre already patches a lot of ext4 features into the ext3 base.

But before I start rebasing lustre I though I would first test out
plain ext4 so I know any bugs I find will be from my rebasing and not
already existing in ext4 itself. And there I run into a big problem:
Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
stops saying the disk exceeds the 32bit block count. And looking at
the code I see a lot of blk_t (instead of blk64_t) and unsigned long
(instead of unsigned long long [or even better blk64_t]) usage.

I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
mkfs. Does anyone know if there is an updated patch set for 1.41
anywhere? And when will that be added to e2fsprogs upstream?

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-14 19:50 ext4 64bit (disk >16TB) question Goswin von Brederlow
@ 2008-07-14 23:46 ` Theodore Tso
  2008-07-15  5:42   ` Goswin von Brederlow
  2008-07-15 18:27 ` Jose R. Santos
  1 sibling, 1 reply; 16+ messages in thread
From: Theodore Tso @ 2008-07-14 23:46 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-ext4

On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote:
> 
> we are using lustre on a cluster of servers and raid boxes. Currently
> lustre is based on the ext3 code and has a limit of 8TiB for each
> filesystem. For us that results on having to split a servers storage
> into up to 4 chunks and run one fs on each which I would rather avoid.
> The solution for this would be to rebase the lustre patches to use
> ext4 instead, which should also reduce the patch set considerably.
> Lustre already patches a lot of ext4 features into the ext3 base.
> 
> 
> But before I start rebasing lustre I though I would first test out
> plain ext4 so I know any bugs I find will be from my rebasing and not
> already existing in ext4 itself. And there I run into a big problem:
> Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
> feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
> stops saying the disk exceeds the 32bit block count. And looking at
> the code I see a lot of blk_t (instead of blk64_t) and unsigned long
> (instead of unsigned long long [or even better blk64_t]) usage.
> 
> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
> mkfs. Does anyone know if there is an updated patch set for 1.41
> anywhere? And when will that be added to e2fsprogs upstream?

Yes, this is correct.  The 1.39 64-bit patches break the shared
library ABI, and also there were some long-term problems with having
super-large bitmaps taking huge amounts of memory without some kind of
run-length encoding or other compression technique.  I decided to
reject the 1.39 approach because it would have caused short- and
long-term maintenance issues.

At the moment 1.41 does not support > 32 bit block numbers.  The
priority was to get something which supported all of the other ext4
features out the door, since that would allow much better testing of
the ext4 code base.  We are now working on 64-bit support in
e2fsprogs, with mke2fs coming first, and the other tools coming later.
But yeah, good quality 64-bit e2fsprogs support is going to lag for a
bit.  Sorry, we're working as fast as we can, given the resources we
have.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-14 23:46 ` Theodore Tso
@ 2008-07-15  5:42   ` Goswin von Brederlow
  2008-07-15 12:36     ` Theodore Tso
  2008-07-15 13:16     ` Ric Wheeler
  0 siblings, 2 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-15  5:42 UTC (permalink / raw)
  To: linux-ext4

Theodore Tso <tytso@mit.edu> writes:

> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote:
>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
>> mkfs. Does anyone know if there is an updated patch set for 1.41
>> anywhere? And when will that be added to e2fsprogs upstream?
>
> Yes, this is correct.  The 1.39 64-bit patches break the shared
> library ABI, and also there were some long-term problems with having
> super-large bitmaps taking huge amounts of memory without some kind of
> run-length encoding or other compression technique.  I decided to
> reject the 1.39 approach because it would have caused short- and
> long-term maintenance issues.

Is that a problem for the kernel or for the user space? I notices that
mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While
being a lot that is not really a problem here.

> At the moment 1.41 does not support > 32 bit block numbers.  The
> priority was to get something which supported all of the other ext4
> features out the door, since that would allow much better testing of
> the ext4 code base.  We are now working on 64-bit support in
> e2fsprogs, with mke2fs coming first, and the other tools coming later.
> But yeah, good quality 64-bit e2fsprogs support is going to lag for a
> bit.  Sorry, we're working as fast as we can, given the resources we
> have.

Will there be filesystem changes as well? The above mentioned
run-length encoding sounds a bit like a new bitmap format or is that
only supposed to be the in memory format in userspace?

What is the plan of how to add 64-bit support to the shared lib now?
Will you introduce a do_foo64() function in parallel to do_foo() to
maintain abi compatibility? Will you add versioned symbols? Or will
there be an abi break at some point?

The reason I ask all this is because I'm willing to spend some time
patching and testing. A single >16TiB filesystem instead of multiple
smaller ones would be a great benefit for us.

> Regards,
>
> 						- Ted

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15  5:42   ` Goswin von Brederlow
@ 2008-07-15 12:36     ` Theodore Tso
  2008-07-15 17:00       ` Goswin von Brederlow
  2008-07-15 13:16     ` Ric Wheeler
  1 sibling, 1 reply; 16+ messages in thread
From: Theodore Tso @ 2008-07-15 12:36 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-ext4

On Tue, Jul 15, 2008 at 07:42:01AM +0200, Goswin von Brederlow wrote:
> Is that a problem for the kernel or for the user space? I notices that
> mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While
> being a lot that is not really a problem here.

Userspace.  The kernel demand-loads bitmap blocks as needed, but
e2fsprogs keeps bitarrays in user memory.  The problem is e2fsck; it
needs in the worst case something like 5 different blocks bitmaps and
3 or 4 inode bitmaps.  (I don't remember the exact numbers, but it's
that order of magnitude.)  So if it's something like a gigabyte of
memory for mke2fs, it might be 6-7 gigs of memory for e2fsck.  If this
is before swap has been enabled, it might not work at all, and even
with swap, we're talking serious slowdown if e2fsck is constantly
paging to disk.

> Will there be filesystem changes as well? The above mentioned
> run-length encoding sounds a bit like a new bitmap format or is that
> only supposed to be the in memory format in userspace?

No, it will only be a memory format in userspace.  And I anticipate
multiple backend storage formats for the bitmaps, depending on what
they will be used for.  For example, e2fsck uses one inode bitmap to
detect directory loops when following the parent '..' entry; this is a
super-sparse array, with at most N bits set in the entire array, where
N is the deepest directory in the filesystem.  Simply storing a sorted
list of bits that are "on" is the most efficient representation for
that particular bitmap.  Other bitmaps will be much better off stored
in memory using perhaps an extent of "on" bits in a red-black tree,
etc.  At least initially I will implement the "dumb and stupid" fixed
bitarray, but I need to make sure the we have the right dispatching to
support the rest.

> what is the plan of how to add 64-bit support to the shared lib now?
> Will you introduce a do_foo64() function in parallel to do_foo() to
> maintain abi compatibility? Will you add versioned symbols? Or will
> there be an abi break at some point?

There's a pretty good description of my plans here:

	http://thread.gmane.org/gmane.comp.file-systems.ext4/2845

So no versioned symbols, new functions where we go from
ext2fs_block_iterator2() to ext2fs_block_iterate3(), etc.  All new
interfaces that I have been adding have all been 64-bit clean to begin
with.  So for example all of the extents code use blk64_t.  The
io_manager has been switched over to support 64-bit block numbers,
etc.

> The reason I ask all this is because I'm willing to spend some time
> patching and testing. A single >16TiB filesystem instead of multiple
> smaller ones would be a great benefit for us.

Jose Santos has been working on some patches, and I've been working on
the 64-bit bitmap support (when I have time, which means it's been
sporadic).  My primary priority for ext4 has been on getting last
major bits of the patches into mainline and getting e2fsprogs 1.41 out
the door so that basic testing, bug fixing, and stablization could
begin.  We still have some bugs that need to squash, such as the
summary statistics and/or checksums in the block group descriptors
getting corrupted.  Nothing so far that can't be fixed with e2fsck,
but getting ext4 stable is just *much* higher priority for me right
now.

That being said, if you want to join the ext4 development efforts,
please subscribe to the linux-ext4@vger.kernel.org mailing list
(standard majordomo subscription interface, like all of the kernel.org
lists).  The wiki at http://ext4.wiki.kernel.org has some good stuff,
but there's also stuff which is out of date there.  But stuff like the
ext4 irc channel is there, and the "getting started page" is
reasonably up to date.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 12:36     ` Theodore Tso
@ 2008-07-15 17:00       ` Goswin von Brederlow
  2008-07-15 17:19         ` Theodore Tso
  0 siblings, 1 reply; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-15 17:00 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Goswin von Brederlow, linux-ext4

Theodore Tso <tytso@mit.edu> writes:

> On Tue, Jul 15, 2008 at 07:42:01AM +0200, Goswin von Brederlow wrote:
>> Is that a problem for the kernel or for the user space? I notices that
>> mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While
>> being a lot that is not really a problem here.
>
> Userspace.  The kernel demand-loads bitmap blocks as needed, but
> e2fsprogs keeps bitarrays in user memory.  The problem is e2fsck; it
> needs in the worst case something like 5 different blocks bitmaps and
> 3 or 4 inode bitmaps.  (I don't remember the exact numbers, but it's
> that order of magnitude.)  So if it's something like a gigabyte of
> memory for mke2fs, it might be 6-7 gigs of memory for e2fsck.  If this
> is before swap has been enabled, it might not work at all, and even
> with swap, we're talking serious slowdown if e2fsck is constantly
> paging to disk.

That problem I know. That is why I always make / small and then swap
can be enabled.

Normaly I would suggest just mmaping the blocks from the disk. But
with a 32bit cpu and 6-7 gigs that won't work. But that is not a use
case for me anyway. Nobody buys 32bit systems here and especially not
with that much storage. 4-8 cores and 8-32Gig ram are quite normal and
they won't have a problem. So fixing the in memory maps to demand
loading or compressed wouldn't be a priority for me.

>> Will there be filesystem changes as well? The above mentioned
>> run-length encoding sounds a bit like a new bitmap format or is that
>> only supposed to be the in memory format in userspace?
>
> No, it will only be a memory format in userspace.  And I anticipate
> multiple backend storage formats for the bitmaps, depending on what
> they will be used for.  For example, e2fsck uses one inode bitmap to
> detect directory loops when following the parent '..' entry; this is a
> super-sparse array, with at most N bits set in the entire array, where
> N is the deepest directory in the filesystem.  Simply storing a sorted
> list of bits that are "on" is the most efficient representation for
> that particular bitmap.  Other bitmaps will be much better off stored
> in memory using perhaps an extent of "on" bits in a red-black tree,
> etc.  At least initially I will implement the "dumb and stupid" fixed
> bitarray, but I need to make sure the we have the right dispatching to
> support the rest.

Makes sense.

>> what is the plan of how to add 64-bit support to the shared lib now?
>> Will you introduce a do_foo64() function in parallel to do_foo() to
>> maintain abi compatibility? Will you add versioned symbols? Or will
>> there be an abi break at some point?
>
> There's a pretty good description of my plans here:
>
> 	http://thread.gmane.org/gmane.comp.file-systems.ext4/2845
>
> So no versioned symbols, new functions where we go from
> ext2fs_block_iterator2() to ext2fs_block_iterate3(), etc.  All new
> interfaces that I have been adding have all been 64-bit clean to begin
> with.  So for example all of the extents code use blk64_t.  The
> io_manager has been switched over to support 64-bit block numbers,
> etc.

The get_size() function (actual name is a bit longer) does use a blk_t
* to store the disks size and returns EFBIG if the disk exceeds 2^32
blocks. So now you have three choices:

1) break abi:  get_size(blk64_t *size)
2) extend abi: get_size64(blk64_t *size);
3) versioned symbols: get_size_old(blk_t *size) + get_size_new(blk64_t
*size) and versioned to use the right one.

That function is pretty much the only thing I looked at so far because
that is where mkfs.ext4 stops with >16TiB.

>> The reason I ask all this is because I'm willing to spend some time
>> patching and testing. A single >16TiB filesystem instead of multiple
>> smaller ones would be a great benefit for us.
>
> Jose Santos has been working on some patches, and I've been working on
> the 64-bit bitmap support (when I have time, which means it's been
> sporadic).  My primary priority for ext4 has been on getting last
> major bits of the patches into mainline and getting e2fsprogs 1.41 out
> the door so that basic testing, bug fixing, and stablization could
> begin.  We still have some bugs that need to squash, such as the
> summary statistics and/or checksums in the block group descriptors
> getting corrupted.  Nothing so far that can't be fixed with e2fsck,
> but getting ext4 stable is just *much* higher priority for me right
> now.
>
> That being said, if you want to join the ext4 development efforts,
> please subscribe to the linux-ext4@vger.kernel.org mailing list
> (standard majordomo subscription interface, like all of the kernel.org
> lists).  The wiki at http://ext4.wiki.kernel.org has some good stuff,
> but there's also stuff which is out of date there.  But stuff like the
> ext4 irc channel is there, and the "getting started page" is
> reasonably up to date.

Already done.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 17:00       ` Goswin von Brederlow
@ 2008-07-15 17:19         ` Theodore Tso
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Tso @ 2008-07-15 17:19 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-ext4

On Tue, Jul 15, 2008 at 07:00:10PM +0200, Goswin von Brederlow wrote:
> The get_size() function (actual name is a bit longer) does use a blk_t
> * to store the disks size and returns EFBIG if the disk exceeds 2^32
> blocks. So now you have three choices:
> 
> 1) break abi:  get_size(blk64_t *size)
> 2) extend abi: get_size64(blk64_t *size);
> 3) versioned symbols: get_size_old(blk_t *size) + get_size_new(blk64_t
> *size) and versioned to use the right one.

... and I'm chosing choice (2a):

    get_size2(blk64_t *size);

							- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15  5:42   ` Goswin von Brederlow
  2008-07-15 12:36     ` Theodore Tso
@ 2008-07-15 13:16     ` Ric Wheeler
  2008-07-15 14:01       ` Bernd Schubert
  1 sibling, 1 reply; 16+ messages in thread
From: Ric Wheeler @ 2008-07-15 13:16 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-ext4

Goswin von Brederlow wrote:
> Theodore Tso <tytso@mit.edu> writes:
>
>   
>> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote:
>>     
>>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
>>> mkfs. Does anyone know if there is an updated patch set for 1.41
>>> anywhere? And when will that be added to e2fsprogs upstream?
>>>       
>> Yes, this is correct.  The 1.39 64-bit patches break the shared
>> library ABI, and also there were some long-term problems with having
>> super-large bitmaps taking huge amounts of memory without some kind of
>> run-length encoding or other compression technique.  I decided to
>> reject the 1.39 approach because it would have caused short- and
>> long-term maintenance issues.
>>     
>
> Is that a problem for the kernel or for the user space? I notices that
> mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While
> being a lot that is not really a problem here.
>
>   
>> At the moment 1.41 does not support > 32 bit block numbers.  The
>> priority was to get something which supported all of the other ext4
>> features out the door, since that would allow much better testing of
>> the ext4 code base.  We are now working on 64-bit support in
>> e2fsprogs, with mke2fs coming first, and the other tools coming later.
>> But yeah, good quality 64-bit e2fsprogs support is going to lag for a
>> bit.  Sorry, we're working as fast as we can, given the resources we
>> have.
>>     
>
> Will there be filesystem changes as well? The above mentioned
> run-length encoding sounds a bit like a new bitmap format or is that
> only supposed to be the in memory format in userspace?
>
> What is the plan of how to add 64-bit support to the shared lib now?
> Will you introduce a do_foo64() function in parallel to do_foo() to
> maintain abi compatibility? Will you add versioned symbols? Or will
> there be an abi break at some point?
>
> The reason I ask all this is because I'm willing to spend some time
> patching and testing. A single >16TiB filesystem instead of multiple
> smaller ones would be a great benefit for us.
>
>   
Can you give us any details about your use case? Is it hundreds of very 
large files, or 100 million little ones?

Any interesting hardware in the mix on the storage or server side?

Thanks!

Ric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 13:16     ` Ric Wheeler
@ 2008-07-15 14:01       ` Bernd Schubert
  2008-07-15 14:08         ` Ric Wheeler
  0 siblings, 1 reply; 16+ messages in thread
From: Bernd Schubert @ 2008-07-15 14:01 UTC (permalink / raw)
  To: rwheeler; +Cc: Goswin von Brederlow, linux-ext4

On Tuesday 15 July 2008 15:16:33 Ric Wheeler wrote:
> Goswin von Brederlow wrote:
> > Theodore Tso <tytso@mit.edu> writes:
> >> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote:
> >>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
> >>> mkfs. Does anyone know if there is an updated patch set for 1.41
> >>> anywhere? And when will that be added to e2fsprogs upstream?
> >>
> >> Yes, this is correct.  The 1.39 64-bit patches break the shared
> >> library ABI, and also there were some long-term problems with having
> >> super-large bitmaps taking huge amounts of memory without some kind of
> >> run-length encoding or other compression technique.  I decided to
> >> reject the 1.39 approach because it would have caused short- and
> >> long-term maintenance issues.
> >
> > Is that a problem for the kernel or for the user space? I notices that
> > mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While
> > being a lot that is not really a problem here.
> >
> >> At the moment 1.41 does not support > 32 bit block numbers.  The
> >> priority was to get something which supported all of the other ext4
> >> features out the door, since that would allow much better testing of
> >> the ext4 code base.  We are now working on 64-bit support in
> >> e2fsprogs, with mke2fs coming first, and the other tools coming later.
> >> But yeah, good quality 64-bit e2fsprogs support is going to lag for a
> >> bit.  Sorry, we're working as fast as we can, given the resources we
> >> have.
> >
> > Will there be filesystem changes as well? The above mentioned
> > run-length encoding sounds a bit like a new bitmap format or is that
> > only supposed to be the in memory format in userspace?
> >
> > What is the plan of how to add 64-bit support to the shared lib now?
> > Will you introduce a do_foo64() function in parallel to do_foo() to
> > maintain abi compatibility? Will you add versioned symbols? Or will
> > there be an abi break at some point?
> >
> > The reason I ask all this is because I'm willing to spend some time
> > patching and testing. A single >16TiB filesystem instead of multiple
> > smaller ones would be a great benefit for us.
>
> Can you give us any details about your use case? Is it hundreds of very
> large files, or 100 million little ones?

Depends on our customers. Though lustre is rather slow for small files and we 
try to inform our customers about that. On the other hand there also also no 
choices of cluster filesystem for small files.

>
> Any interesting hardware in the mix on the storage or server side?

What exactly do you want to know? Usually we have a server-pair and Infortrend 
Raid-units. Since lustre doesn't do any redundancy on its own, we usually 
also have a raid1, raid5 or raid6 of several raid units.

For ease of management and optimal performance, we need single partitions 
larger than 8TiB (raid1) or 16TiB (raid5 or raid6). And the present 8TiB 
limit strongly bites us.


Cheers,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 14:01       ` Bernd Schubert
@ 2008-07-15 14:08         ` Ric Wheeler
  2008-07-15 16:13           ` Goswin von Brederlow
  0 siblings, 1 reply; 16+ messages in thread
From: Ric Wheeler @ 2008-07-15 14:08 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Goswin von Brederlow, linux-ext4

Bernd Schubert wrote:
> On Tuesday 15 July 2008 15:16:33 Ric Wheeler wrote:
>   
>> Goswin von Brederlow wrote:
>>     
>>> Theodore Tso <tytso@mit.edu> writes:
>>>       
>>>> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote:
>>>>         
>>>>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
>>>>> mkfs. Does anyone know if there is an updated patch set for 1.41
>>>>> anywhere? And when will that be added to e2fsprogs upstream?
>>>>>           
>>>> Yes, this is correct.  The 1.39 64-bit patches break the shared
>>>> library ABI, and also there were some long-term problems with having
>>>> super-large bitmaps taking huge amounts of memory without some kind of
>>>> run-length encoding or other compression technique.  I decided to
>>>> reject the 1.39 approach because it would have caused short- and
>>>> long-term maintenance issues.
>>>>         
>>> Is that a problem for the kernel or for the user space? I notices that
>>> mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While
>>> being a lot that is not really a problem here.
>>>
>>>       
>>>> At the moment 1.41 does not support > 32 bit block numbers.  The
>>>> priority was to get something which supported all of the other ext4
>>>> features out the door, since that would allow much better testing of
>>>> the ext4 code base.  We are now working on 64-bit support in
>>>> e2fsprogs, with mke2fs coming first, and the other tools coming later.
>>>> But yeah, good quality 64-bit e2fsprogs support is going to lag for a
>>>> bit.  Sorry, we're working as fast as we can, given the resources we
>>>> have.
>>>>         
>>> Will there be filesystem changes as well? The above mentioned
>>> run-length encoding sounds a bit like a new bitmap format or is that
>>> only supposed to be the in memory format in userspace?
>>>
>>> What is the plan of how to add 64-bit support to the shared lib now?
>>> Will you introduce a do_foo64() function in parallel to do_foo() to
>>> maintain abi compatibility? Will you add versioned symbols? Or will
>>> there be an abi break at some point?
>>>
>>> The reason I ask all this is because I'm willing to spend some time
>>> patching and testing. A single >16TiB filesystem instead of multiple
>>> smaller ones would be a great benefit for us.
>>>       
>> Can you give us any details about your use case? Is it hundreds of very
>> large files, or 100 million little ones?
>>     
>
> Depends on our customers. Though lustre is rather slow for small files and we 
> try to inform our customers about that. On the other hand there also also no 
> choices of cluster filesystem for small files.
>   

Thanks - so this is not an internal application, but hosting for various 
workloads? We have different scalability issues depending on the nature 
and mix of file sizes, etc.

>   
>> Any interesting hardware in the mix on the storage or server side?
>>     
>
> What exactly do you want to know? Usually we have a server-pair and Infortrend 
> Raid-units. Since lustre doesn't do any redundancy on its own, we usually 
> also have a raid1, raid5 or raid6 of several raid units.
>   

One thing that we have been working on/thinking about is how best to 
automatically self tune a file system to the storage. Today, XFS is 
probably the best normal linux file system at figuring out raid stripe 
size, etc. Getting this enhanced in ext4 could lead to a significant 
performance win for users who are not masters of performance tuning, etc.

How long would you wait for something like fsck to run to completion 
before you would need to go to back up tapes? 6 hours? 1 day? 1 week ;-) ?

> For ease of management and optimal performance, we need single partitions 
> larger than 8TiB (raid1) or 16TiB (raid5 or raid6). And the present 8TiB 
> limit strongly bites us.
>
>
> Cheers,
> Bernd
>   

Makes sense, thanks for the information!

Regards,

Ric



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 14:08         ` Ric Wheeler
@ 2008-07-15 16:13           ` Goswin von Brederlow
  0 siblings, 0 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-15 16:13 UTC (permalink / raw)
  To: rwheeler; +Cc: linux-ext4

Ric Wheeler <rwheeler@redhat.com> writes:

> How long would you wait for something like fsck to run to completion
> before you would need to go to back up tapes? 6 hours? 1 day? 1 week
> ;-) ?

Backup? What are backups? :))

A hardware raid6 resync takes about 16h. A Software raid6 (over 6
hardware raid6) resync takes 1-2 days.

With lustre the fsck has to be done on the MDT (meta data target) to
build a database file and on all the OST (object storage target). So
there is some parallelization in the system.

But each OST, if we get ext4 64bit working, would be 28TB. I would
assume days for an fsck run. Weeks would not be good and less than a
day is totaly unrealistic.

But a few days for an fsck of 400-800TB filesystem isn't so
bad. Reading that amount from backup will take ages too, if you even
have one.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-14 19:50 ext4 64bit (disk >16TB) question Goswin von Brederlow
  2008-07-14 23:46 ` Theodore Tso
@ 2008-07-15 18:27 ` Jose R. Santos
  2008-07-15 20:12   ` Andreas Dilger
  1 sibling, 1 reply; 16+ messages in thread
From: Jose R. Santos @ 2008-07-15 18:27 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-ext4

On Mon, 14 Jul 2008 21:50:56 +0200
Goswin von Brederlow <goswin-v-b@web.de> wrote:

> Hi,
> 
> we are using lustre on a cluster of servers and raid boxes. Currently
> lustre is based on the ext3 code and has a limit of 8TiB for each
> filesystem. For us that results on having to split a servers storage
> into up to 4 chunks and run one fs on each which I would rather avoid.
> The solution for this would be to rebase the lustre patches to use
> ext4 instead, which should also reduce the patch set considerably.
> Lustre already patches a lot of ext4 features into the ext3 base.
> 
> 
> But before I start rebasing lustre I though I would first test out
> plain ext4 so I know any bugs I find will be from my rebasing and not
> already existing in ext4 itself. And there I run into a big problem:
> Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
> feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
> stops saying the disk exceeds the 32bit block count. And looking at
> the code I see a lot of blk_t (instead of blk64_t) and unsigned long
> (instead of unsigned long long [or even better blk64_t]) usage.
> 
> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
> mkfs. Does anyone know if there is an updated patch set for 1.41
> anywhere? And when will that be added to e2fsprogs upstream?

Hi Goswin,

I've recently submitted a set of patches that covers most of the API
changes needed to support >16TB file systems (missing Ted bitmap
support of course).  Once the bitmap support is included, it _SHOULD_
be relatively painless to add mke2fs support with this series of patches.

Stay tune.
 
> MfG
>         Goswin

-JRS

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 18:27 ` Jose R. Santos
@ 2008-07-15 20:12   ` Andreas Dilger
  2008-07-15 20:15     ` Ric Wheeler
  2008-07-15 21:20     ` Jose R. Santos
  0 siblings, 2 replies; 16+ messages in thread
From: Andreas Dilger @ 2008-07-15 20:12 UTC (permalink / raw)
  To: Jose R. Santos; +Cc: Goswin von Brederlow, linux-ext4

On Jul 15, 2008  13:27 -0500, Jose R. Santos wrote:
> On Mon, 14 Jul 2008 21:50:56 +0200
> Goswin von Brederlow <goswin-v-b@web.de> wrote:
> > we are using lustre on a cluster of servers and raid boxes. Currently
> > lustre is based on the ext3 code and has a limit of 8TiB for each
> > filesystem. For us that results on having to split a servers storage
> > into up to 4 chunks and run one fs on each which I would rather avoid.
> > The solution for this would be to rebase the lustre patches to use
> > ext4 instead, which should also reduce the patch set considerably.
> > Lustre already patches a lot of ext4 features into the ext3 base.
> > 
> > 
> > But before I start rebasing lustre I though I would first test out
> > plain ext4 so I know any bugs I find will be from my rebasing and not
> > already existing in ext4 itself. And there I run into a big problem:
> > Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
> > feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
> > stops saying the disk exceeds the 32bit block count. And looking at
> > the code I see a lot of blk_t (instead of blk64_t) and unsigned long
> > (instead of unsigned long long [or even better blk64_t]) usage.
> > 
> > I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
> > mkfs. Does anyone know if there is an updated patch set for 1.41
> > anywhere? And when will that be added to e2fsprogs upstream?
> 
> I've recently submitted a set of patches that covers most of the API
> changes needed to support >16TB file systems (missing Ted bitmap
> support of course).  Once the bitmap support is included, it _SHOULD_
> be relatively painless to add mke2fs support with this series of patches.

Jose,
while waiting for the "efficient bitmap" support, how hard would it be
to implement "inefficient bitmaps" that just malloc some GB of memory
if needed?  This would at least allow people with huge devices to test
mke2fs/ext4/e2fsck in the meantime.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 20:12   ` Andreas Dilger
@ 2008-07-15 20:15     ` Ric Wheeler
  2008-07-15 21:03       ` Goswin von Brederlow
  2008-07-15 21:20     ` Jose R. Santos
  1 sibling, 1 reply; 16+ messages in thread
From: Ric Wheeler @ 2008-07-15 20:15 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jose R. Santos, Goswin von Brederlow, linux-ext4

Andreas Dilger wrote:
> On Jul 15, 2008  13:27 -0500, Jose R. Santos wrote:
>   
>> On Mon, 14 Jul 2008 21:50:56 +0200
>> Goswin von Brederlow <goswin-v-b@web.de> wrote:
>>     
>>> we are using lustre on a cluster of servers and raid boxes. Currently
>>> lustre is based on the ext3 code and has a limit of 8TiB for each
>>> filesystem. For us that results on having to split a servers storage
>>> into up to 4 chunks and run one fs on each which I would rather avoid.
>>> The solution for this would be to rebase the lustre patches to use
>>> ext4 instead, which should also reduce the patch set considerably.
>>> Lustre already patches a lot of ext4 features into the ext3 base.
>>>
>>>
>>> But before I start rebasing lustre I though I would first test out
>>> plain ext4 so I know any bugs I find will be from my rebasing and not
>>> already existing in ext4 itself. And there I run into a big problem:
>>> Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
>>> feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
>>> stops saying the disk exceeds the 32bit block count. And looking at
>>> the code I see a lot of blk_t (instead of blk64_t) and unsigned long
>>> (instead of unsigned long long [or even better blk64_t]) usage.
>>>
>>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
>>> mkfs. Does anyone know if there is an updated patch set for 1.41
>>> anywhere? And when will that be added to e2fsprogs upstream?
>>>       
>> I've recently submitted a set of patches that covers most of the API
>> changes needed to support >16TB file systems (missing Ted bitmap
>> support of course).  Once the bitmap support is included, it _SHOULD_
>> be relatively painless to add mke2fs support with this series of patches.
>>     
>
> Jose,
> while waiting for the "efficient bitmap" support, how hard would it be
> to implement "inefficient bitmaps" that just malloc some GB of memory
> if needed?  This would at least allow people with huge devices to test
> mke2fs/ext4/e2fsck in the meantime.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>   

I think that would be very useful - how much DRAM would we need for a 
16TB file system ;-) ?

ric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 20:15     ` Ric Wheeler
@ 2008-07-15 21:03       ` Goswin von Brederlow
  0 siblings, 0 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-15 21:03 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Andreas Dilger, Jose R. Santos, Goswin von Brederlow, linux-ext4

Ric Wheeler <ricwheeler@gmail.com> writes:

> Andreas Dilger wrote:
>> Jose,
>> while waiting for the "efficient bitmap" support, how hard would it be
>> to implement "inefficient bitmaps" that just malloc some GB of memory
>> if needed?  This would at least allow people with huge devices to test
>> mke2fs/ext4/e2fsck in the meantime.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>
> I think that would be very useful - how much DRAM would we need for a
> 16TB file system ;-) ?
>
> ric

The patched 1.39 e2fsprogs managed to format a 16TIB under kvm with
1GiB ram and 128k swap. A 32TiB disk format uses nearly 1GiB ram for mkfs
alone and eventualy managed to deadlock the I/O layer in kvm with
1.5GB ram and 128k swap. (Something I'm sure is kvms fault. :)

But fsck is suposed to eat more by a factor (see other mails in
thread). So having 4-16GiB ram is probably recommended for anyone
thinking about testing.

I used the sparse_create script linked on one of the ext4 wiki pages
with a sparse loopback file and used mke2fs -i $((64*1024*1024)) to
speed up things. With that a 16TiB ext4 uses somewhat over 4GiB
freshly formated.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 20:12   ` Andreas Dilger
  2008-07-15 20:15     ` Ric Wheeler
@ 2008-07-15 21:20     ` Jose R. Santos
  2008-07-16 10:10       ` Goswin von Brederlow
  1 sibling, 1 reply; 16+ messages in thread
From: Jose R. Santos @ 2008-07-15 21:20 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Goswin von Brederlow, linux-ext4

On Tue, 15 Jul 2008 14:12:19 -0600
Andreas Dilger <adilger@sun.com> wrote:

> On Jul 15, 2008  13:27 -0500, Jose R. Santos wrote:
> > On Mon, 14 Jul 2008 21:50:56 +0200
> > Goswin von Brederlow <goswin-v-b@web.de> wrote:
> > > we are using lustre on a cluster of servers and raid boxes. Currently
> > > lustre is based on the ext3 code and has a limit of 8TiB for each
> > > filesystem. For us that results on having to split a servers storage
> > > into up to 4 chunks and run one fs on each which I would rather avoid.
> > > The solution for this would be to rebase the lustre patches to use
> > > ext4 instead, which should also reduce the patch set considerably.
> > > Lustre already patches a lot of ext4 features into the ext3 base.
> > > 
> > > 
> > > But before I start rebasing lustre I though I would first test out
> > > plain ext4 so I know any bugs I find will be from my rebasing and not
> > > already existing in ext4 itself. And there I run into a big problem:
> > > Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
> > > feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
> > > stops saying the disk exceeds the 32bit block count. And looking at
> > > the code I see a lot of blk_t (instead of blk64_t) and unsigned long
> > > (instead of unsigned long long [or even better blk64_t]) usage.
> > > 
> > > I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
> > > mkfs. Does anyone know if there is an updated patch set for 1.41
> > > anywhere? And when will that be added to e2fsprogs upstream?
> > 
> > I've recently submitted a set of patches that covers most of the API
> > changes needed to support >16TB file systems (missing Ted bitmap
> > support of course).  Once the bitmap support is included, it _SHOULD_
> > be relatively painless to add mke2fs support with this series of patches.
> 
> Jose,
> while waiting for the "efficient bitmap" support, how hard would it be
> to implement "inefficient bitmaps" that just malloc some GB of memory
> if needed?  This would at least allow people with huge devices to test
> mke2fs/ext4/e2fsck in the meantime.

As Ted mentioned already, the "efficient bitmap" support can come
latter but the 64bit API call need to well design to able to support
different models.  I will see how difficult it would be to create a ABI
BREAKING patch for testing purposes but coming up with a ABI
compatible one seems like to much work if its going to be replace
sometime in the near future.

It should be possible to test it with flexbg as well (I think) since
all I need to make sure is that all bitmaps reside within the 32bit
block boundary.  Dont have large disk to test on so Im playing with
device mapper to see how I can fake one.  Our lab network is making
thing difficult though.

Im sure that I will uncover a couple of bug this way.  Like the fact
that I forgot to set the 64bit compatibility flag or large group
descriptors. :)

> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 

-JRS

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: ext4 64bit (disk >16TB) question
  2008-07-15 21:20     ` Jose R. Santos
@ 2008-07-16 10:10       ` Goswin von Brederlow
  0 siblings, 0 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-16 10:10 UTC (permalink / raw)
  To: Jose R. Santos; +Cc: Andreas Dilger, Goswin von Brederlow, linux-ext4

"Jose R. Santos" <jrs@us.ibm.com> writes:

> block boundary.  Dont have large disk to test on so Im playing with
> device mapper to see how I can fake one.  Our lab network is making
> thing difficult though.

Download

http://www.bullopensource.org/ext4/files/sparse_create

and for 32TiB use

dd if=/dev/zero of=/somewhere/SPACE bs=1 count=1 seek=100000000000
losetup /dev/loop0 /somewhere/SPACE
./sparse_create ext4dev /dev/loop0 68719476736

If you have a real block device to spare use that directly.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-07-16 10:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 19:50 ext4 64bit (disk >16TB) question Goswin von Brederlow
2008-07-14 23:46 ` Theodore Tso
2008-07-15  5:42   ` Goswin von Brederlow
2008-07-15 12:36     ` Theodore Tso
2008-07-15 17:00       ` Goswin von Brederlow
2008-07-15 17:19         ` Theodore Tso
2008-07-15 13:16     ` Ric Wheeler
2008-07-15 14:01       ` Bernd Schubert
2008-07-15 14:08         ` Ric Wheeler
2008-07-15 16:13           ` Goswin von Brederlow
2008-07-15 18:27 ` Jose R. Santos
2008-07-15 20:12   ` Andreas Dilger
2008-07-15 20:15     ` Ric Wheeler
2008-07-15 21:03       ` Goswin von Brederlow
2008-07-15 21:20     ` Jose R. Santos
2008-07-16 10:10       ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox