* ext4 64bit (disk >16TB) question
@ 2008-07-14 19:50 Goswin von Brederlow
2008-07-14 23:46 ` Theodore Tso
2008-07-15 18:27 ` Jose R. Santos
0 siblings, 2 replies; 16+ messages in thread
From: Goswin von Brederlow @ 2008-07-14 19:50 UTC (permalink / raw)
To: linux-ext4
Hi,
we are using lustre on a cluster of servers and raid boxes. Currently
lustre is based on the ext3 code and has a limit of 8TiB for each
filesystem. For us that results on having to split a servers storage
into up to 4 chunks and run one fs on each which I would rather avoid.
The solution for this would be to rebase the lustre patches to use
ext4 instead, which should also reduce the patch set considerably.
Lustre already patches a lot of ext4 features into the ext3 base.
But before I start rebasing lustre I though I would first test out
plain ext4 so I know any bugs I find will be from my rebasing and not
already existing in ext4 itself. And there I run into a big problem:
Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT
feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always
stops saying the disk exceeds the 32bit block count. And looking at
the code I see a lot of blk_t (instead of blk64_t) and unsigned long
(instead of unsigned long long [or even better blk64_t]) usage.
I found ext4 64bit patches for e2fsprogs 1.39 that fix at least
mkfs. Does anyone know if there is an updated patch set for 1.41
anywhere? And when will that be added to e2fsprogs upstream?
MfG
Goswin
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: ext4 64bit (disk >16TB) question 2008-07-14 19:50 ext4 64bit (disk >16TB) question Goswin von Brederlow @ 2008-07-14 23:46 ` Theodore Tso 2008-07-15 5:42 ` Goswin von Brederlow 2008-07-15 18:27 ` Jose R. Santos 1 sibling, 1 reply; 16+ messages in thread From: Theodore Tso @ 2008-07-14 23:46 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: linux-ext4 On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote: > > we are using lustre on a cluster of servers and raid boxes. Currently > lustre is based on the ext3 code and has a limit of 8TiB for each > filesystem. For us that results on having to split a servers storage > into up to 4 chunks and run one fs on each which I would rather avoid. > The solution for this would be to rebase the lustre patches to use > ext4 instead, which should also reduce the patch set considerably. > Lustre already patches a lot of ext4 features into the ext3 base. > > > But before I start rebasing lustre I though I would first test out > plain ext4 so I know any bugs I find will be from my rebasing and not > already existing in ext4 itself. And there I run into a big problem: > Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT > feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always > stops saying the disk exceeds the 32bit block count. And looking at > the code I see a lot of blk_t (instead of blk64_t) and unsigned long > (instead of unsigned long long [or even better blk64_t]) usage. > > I found ext4 64bit patches for e2fsprogs 1.39 that fix at least > mkfs. Does anyone know if there is an updated patch set for 1.41 > anywhere? And when will that be added to e2fsprogs upstream? Yes, this is correct. The 1.39 64-bit patches break the shared library ABI, and also there were some long-term problems with having super-large bitmaps taking huge amounts of memory without some kind of run-length encoding or other compression technique. I decided to reject the 1.39 approach because it would have caused short- and long-term maintenance issues. At the moment 1.41 does not support > 32 bit block numbers. The priority was to get something which supported all of the other ext4 features out the door, since that would allow much better testing of the ext4 code base. We are now working on 64-bit support in e2fsprogs, with mke2fs coming first, and the other tools coming later. But yeah, good quality 64-bit e2fsprogs support is going to lag for a bit. Sorry, we're working as fast as we can, given the resources we have. Regards, - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-14 23:46 ` Theodore Tso @ 2008-07-15 5:42 ` Goswin von Brederlow 2008-07-15 12:36 ` Theodore Tso 2008-07-15 13:16 ` Ric Wheeler 0 siblings, 2 replies; 16+ messages in thread From: Goswin von Brederlow @ 2008-07-15 5:42 UTC (permalink / raw) To: linux-ext4 Theodore Tso <tytso@mit.edu> writes: > On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote: >> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least >> mkfs. Does anyone know if there is an updated patch set for 1.41 >> anywhere? And when will that be added to e2fsprogs upstream? > > Yes, this is correct. The 1.39 64-bit patches break the shared > library ABI, and also there were some long-term problems with having > super-large bitmaps taking huge amounts of memory without some kind of > run-length encoding or other compression technique. I decided to > reject the 1.39 approach because it would have caused short- and > long-term maintenance issues. Is that a problem for the kernel or for the user space? I notices that mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While being a lot that is not really a problem here. > At the moment 1.41 does not support > 32 bit block numbers. The > priority was to get something which supported all of the other ext4 > features out the door, since that would allow much better testing of > the ext4 code base. We are now working on 64-bit support in > e2fsprogs, with mke2fs coming first, and the other tools coming later. > But yeah, good quality 64-bit e2fsprogs support is going to lag for a > bit. Sorry, we're working as fast as we can, given the resources we > have. Will there be filesystem changes as well? The above mentioned run-length encoding sounds a bit like a new bitmap format or is that only supposed to be the in memory format in userspace? What is the plan of how to add 64-bit support to the shared lib now? Will you introduce a do_foo64() function in parallel to do_foo() to maintain abi compatibility? Will you add versioned symbols? Or will there be an abi break at some point? The reason I ask all this is because I'm willing to spend some time patching and testing. A single >16TiB filesystem instead of multiple smaller ones would be a great benefit for us. > Regards, > > - Ted MfG Goswin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 5:42 ` Goswin von Brederlow @ 2008-07-15 12:36 ` Theodore Tso 2008-07-15 17:00 ` Goswin von Brederlow 2008-07-15 13:16 ` Ric Wheeler 1 sibling, 1 reply; 16+ messages in thread From: Theodore Tso @ 2008-07-15 12:36 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: linux-ext4 On Tue, Jul 15, 2008 at 07:42:01AM +0200, Goswin von Brederlow wrote: > Is that a problem for the kernel or for the user space? I notices that > mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While > being a lot that is not really a problem here. Userspace. The kernel demand-loads bitmap blocks as needed, but e2fsprogs keeps bitarrays in user memory. The problem is e2fsck; it needs in the worst case something like 5 different blocks bitmaps and 3 or 4 inode bitmaps. (I don't remember the exact numbers, but it's that order of magnitude.) So if it's something like a gigabyte of memory for mke2fs, it might be 6-7 gigs of memory for e2fsck. If this is before swap has been enabled, it might not work at all, and even with swap, we're talking serious slowdown if e2fsck is constantly paging to disk. > Will there be filesystem changes as well? The above mentioned > run-length encoding sounds a bit like a new bitmap format or is that > only supposed to be the in memory format in userspace? No, it will only be a memory format in userspace. And I anticipate multiple backend storage formats for the bitmaps, depending on what they will be used for. For example, e2fsck uses one inode bitmap to detect directory loops when following the parent '..' entry; this is a super-sparse array, with at most N bits set in the entire array, where N is the deepest directory in the filesystem. Simply storing a sorted list of bits that are "on" is the most efficient representation for that particular bitmap. Other bitmaps will be much better off stored in memory using perhaps an extent of "on" bits in a red-black tree, etc. At least initially I will implement the "dumb and stupid" fixed bitarray, but I need to make sure the we have the right dispatching to support the rest. > what is the plan of how to add 64-bit support to the shared lib now? > Will you introduce a do_foo64() function in parallel to do_foo() to > maintain abi compatibility? Will you add versioned symbols? Or will > there be an abi break at some point? There's a pretty good description of my plans here: http://thread.gmane.org/gmane.comp.file-systems.ext4/2845 So no versioned symbols, new functions where we go from ext2fs_block_iterator2() to ext2fs_block_iterate3(), etc. All new interfaces that I have been adding have all been 64-bit clean to begin with. So for example all of the extents code use blk64_t. The io_manager has been switched over to support 64-bit block numbers, etc. > The reason I ask all this is because I'm willing to spend some time > patching and testing. A single >16TiB filesystem instead of multiple > smaller ones would be a great benefit for us. Jose Santos has been working on some patches, and I've been working on the 64-bit bitmap support (when I have time, which means it's been sporadic). My primary priority for ext4 has been on getting last major bits of the patches into mainline and getting e2fsprogs 1.41 out the door so that basic testing, bug fixing, and stablization could begin. We still have some bugs that need to squash, such as the summary statistics and/or checksums in the block group descriptors getting corrupted. Nothing so far that can't be fixed with e2fsck, but getting ext4 stable is just *much* higher priority for me right now. That being said, if you want to join the ext4 development efforts, please subscribe to the linux-ext4@vger.kernel.org mailing list (standard majordomo subscription interface, like all of the kernel.org lists). The wiki at http://ext4.wiki.kernel.org has some good stuff, but there's also stuff which is out of date there. But stuff like the ext4 irc channel is there, and the "getting started page" is reasonably up to date. Regards, - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 12:36 ` Theodore Tso @ 2008-07-15 17:00 ` Goswin von Brederlow 2008-07-15 17:19 ` Theodore Tso 0 siblings, 1 reply; 16+ messages in thread From: Goswin von Brederlow @ 2008-07-15 17:00 UTC (permalink / raw) To: Theodore Tso; +Cc: Goswin von Brederlow, linux-ext4 Theodore Tso <tytso@mit.edu> writes: > On Tue, Jul 15, 2008 at 07:42:01AM +0200, Goswin von Brederlow wrote: >> Is that a problem for the kernel or for the user space? I notices that >> mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While >> being a lot that is not really a problem here. > > Userspace. The kernel demand-loads bitmap blocks as needed, but > e2fsprogs keeps bitarrays in user memory. The problem is e2fsck; it > needs in the worst case something like 5 different blocks bitmaps and > 3 or 4 inode bitmaps. (I don't remember the exact numbers, but it's > that order of magnitude.) So if it's something like a gigabyte of > memory for mke2fs, it might be 6-7 gigs of memory for e2fsck. If this > is before swap has been enabled, it might not work at all, and even > with swap, we're talking serious slowdown if e2fsck is constantly > paging to disk. That problem I know. That is why I always make / small and then swap can be enabled. Normaly I would suggest just mmaping the blocks from the disk. But with a 32bit cpu and 6-7 gigs that won't work. But that is not a use case for me anyway. Nobody buys 32bit systems here and especially not with that much storage. 4-8 cores and 8-32Gig ram are quite normal and they won't have a problem. So fixing the in memory maps to demand loading or compressed wouldn't be a priority for me. >> Will there be filesystem changes as well? The above mentioned >> run-length encoding sounds a bit like a new bitmap format or is that >> only supposed to be the in memory format in userspace? > > No, it will only be a memory format in userspace. And I anticipate > multiple backend storage formats for the bitmaps, depending on what > they will be used for. For example, e2fsck uses one inode bitmap to > detect directory loops when following the parent '..' entry; this is a > super-sparse array, with at most N bits set in the entire array, where > N is the deepest directory in the filesystem. Simply storing a sorted > list of bits that are "on" is the most efficient representation for > that particular bitmap. Other bitmaps will be much better off stored > in memory using perhaps an extent of "on" bits in a red-black tree, > etc. At least initially I will implement the "dumb and stupid" fixed > bitarray, but I need to make sure the we have the right dispatching to > support the rest. Makes sense. >> what is the plan of how to add 64-bit support to the shared lib now? >> Will you introduce a do_foo64() function in parallel to do_foo() to >> maintain abi compatibility? Will you add versioned symbols? Or will >> there be an abi break at some point? > > There's a pretty good description of my plans here: > > http://thread.gmane.org/gmane.comp.file-systems.ext4/2845 > > So no versioned symbols, new functions where we go from > ext2fs_block_iterator2() to ext2fs_block_iterate3(), etc. All new > interfaces that I have been adding have all been 64-bit clean to begin > with. So for example all of the extents code use blk64_t. The > io_manager has been switched over to support 64-bit block numbers, > etc. The get_size() function (actual name is a bit longer) does use a blk_t * to store the disks size and returns EFBIG if the disk exceeds 2^32 blocks. So now you have three choices: 1) break abi: get_size(blk64_t *size) 2) extend abi: get_size64(blk64_t *size); 3) versioned symbols: get_size_old(blk_t *size) + get_size_new(blk64_t *size) and versioned to use the right one. That function is pretty much the only thing I looked at so far because that is where mkfs.ext4 stops with >16TiB. >> The reason I ask all this is because I'm willing to spend some time >> patching and testing. A single >16TiB filesystem instead of multiple >> smaller ones would be a great benefit for us. > > Jose Santos has been working on some patches, and I've been working on > the 64-bit bitmap support (when I have time, which means it's been > sporadic). My primary priority for ext4 has been on getting last > major bits of the patches into mainline and getting e2fsprogs 1.41 out > the door so that basic testing, bug fixing, and stablization could > begin. We still have some bugs that need to squash, such as the > summary statistics and/or checksums in the block group descriptors > getting corrupted. Nothing so far that can't be fixed with e2fsck, > but getting ext4 stable is just *much* higher priority for me right > now. > > That being said, if you want to join the ext4 development efforts, > please subscribe to the linux-ext4@vger.kernel.org mailing list > (standard majordomo subscription interface, like all of the kernel.org > lists). The wiki at http://ext4.wiki.kernel.org has some good stuff, > but there's also stuff which is out of date there. But stuff like the > ext4 irc channel is there, and the "getting started page" is > reasonably up to date. Already done. MfG Goswin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 17:00 ` Goswin von Brederlow @ 2008-07-15 17:19 ` Theodore Tso 0 siblings, 0 replies; 16+ messages in thread From: Theodore Tso @ 2008-07-15 17:19 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: linux-ext4 On Tue, Jul 15, 2008 at 07:00:10PM +0200, Goswin von Brederlow wrote: > The get_size() function (actual name is a bit longer) does use a blk_t > * to store the disks size and returns EFBIG if the disk exceeds 2^32 > blocks. So now you have three choices: > > 1) break abi: get_size(blk64_t *size) > 2) extend abi: get_size64(blk64_t *size); > 3) versioned symbols: get_size_old(blk_t *size) + get_size_new(blk64_t > *size) and versioned to use the right one. ... and I'm chosing choice (2a): get_size2(blk64_t *size); - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 5:42 ` Goswin von Brederlow 2008-07-15 12:36 ` Theodore Tso @ 2008-07-15 13:16 ` Ric Wheeler 2008-07-15 14:01 ` Bernd Schubert 1 sibling, 1 reply; 16+ messages in thread From: Ric Wheeler @ 2008-07-15 13:16 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: linux-ext4 Goswin von Brederlow wrote: > Theodore Tso <tytso@mit.edu> writes: > > >> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote: >> >>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least >>> mkfs. Does anyone know if there is an updated patch set for 1.41 >>> anywhere? And when will that be added to e2fsprogs upstream? >>> >> Yes, this is correct. The 1.39 64-bit patches break the shared >> library ABI, and also there were some long-term problems with having >> super-large bitmaps taking huge amounts of memory without some kind of >> run-length encoding or other compression technique. I decided to >> reject the 1.39 approach because it would have caused short- and >> long-term maintenance issues. >> > > Is that a problem for the kernel or for the user space? I notices that > mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While > being a lot that is not really a problem here. > > >> At the moment 1.41 does not support > 32 bit block numbers. The >> priority was to get something which supported all of the other ext4 >> features out the door, since that would allow much better testing of >> the ext4 code base. We are now working on 64-bit support in >> e2fsprogs, with mke2fs coming first, and the other tools coming later. >> But yeah, good quality 64-bit e2fsprogs support is going to lag for a >> bit. Sorry, we're working as fast as we can, given the resources we >> have. >> > > Will there be filesystem changes as well? The above mentioned > run-length encoding sounds a bit like a new bitmap format or is that > only supposed to be the in memory format in userspace? > > What is the plan of how to add 64-bit support to the shared lib now? > Will you introduce a do_foo64() function in parallel to do_foo() to > maintain abi compatibility? Will you add versioned symbols? Or will > there be an abi break at some point? > > The reason I ask all this is because I'm willing to spend some time > patching and testing. A single >16TiB filesystem instead of multiple > smaller ones would be a great benefit for us. > > Can you give us any details about your use case? Is it hundreds of very large files, or 100 million little ones? Any interesting hardware in the mix on the storage or server side? Thanks! Ric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 13:16 ` Ric Wheeler @ 2008-07-15 14:01 ` Bernd Schubert 2008-07-15 14:08 ` Ric Wheeler 0 siblings, 1 reply; 16+ messages in thread From: Bernd Schubert @ 2008-07-15 14:01 UTC (permalink / raw) To: rwheeler; +Cc: Goswin von Brederlow, linux-ext4 On Tuesday 15 July 2008 15:16:33 Ric Wheeler wrote: > Goswin von Brederlow wrote: > > Theodore Tso <tytso@mit.edu> writes: > >> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote: > >>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least > >>> mkfs. Does anyone know if there is an updated patch set for 1.41 > >>> anywhere? And when will that be added to e2fsprogs upstream? > >> > >> Yes, this is correct. The 1.39 64-bit patches break the shared > >> library ABI, and also there were some long-term problems with having > >> super-large bitmaps taking huge amounts of memory without some kind of > >> run-length encoding or other compression technique. I decided to > >> reject the 1.39 approach because it would have caused short- and > >> long-term maintenance issues. > > > > Is that a problem for the kernel or for the user space? I notices that > > mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While > > being a lot that is not really a problem here. > > > >> At the moment 1.41 does not support > 32 bit block numbers. The > >> priority was to get something which supported all of the other ext4 > >> features out the door, since that would allow much better testing of > >> the ext4 code base. We are now working on 64-bit support in > >> e2fsprogs, with mke2fs coming first, and the other tools coming later. > >> But yeah, good quality 64-bit e2fsprogs support is going to lag for a > >> bit. Sorry, we're working as fast as we can, given the resources we > >> have. > > > > Will there be filesystem changes as well? The above mentioned > > run-length encoding sounds a bit like a new bitmap format or is that > > only supposed to be the in memory format in userspace? > > > > What is the plan of how to add 64-bit support to the shared lib now? > > Will you introduce a do_foo64() function in parallel to do_foo() to > > maintain abi compatibility? Will you add versioned symbols? Or will > > there be an abi break at some point? > > > > The reason I ask all this is because I'm willing to spend some time > > patching and testing. A single >16TiB filesystem instead of multiple > > smaller ones would be a great benefit for us. > > Can you give us any details about your use case? Is it hundreds of very > large files, or 100 million little ones? Depends on our customers. Though lustre is rather slow for small files and we try to inform our customers about that. On the other hand there also also no choices of cluster filesystem for small files. > > Any interesting hardware in the mix on the storage or server side? What exactly do you want to know? Usually we have a server-pair and Infortrend Raid-units. Since lustre doesn't do any redundancy on its own, we usually also have a raid1, raid5 or raid6 of several raid units. For ease of management and optimal performance, we need single partitions larger than 8TiB (raid1) or 16TiB (raid5 or raid6). And the present 8TiB limit strongly bites us. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 14:01 ` Bernd Schubert @ 2008-07-15 14:08 ` Ric Wheeler 2008-07-15 16:13 ` Goswin von Brederlow 0 siblings, 1 reply; 16+ messages in thread From: Ric Wheeler @ 2008-07-15 14:08 UTC (permalink / raw) To: Bernd Schubert; +Cc: Goswin von Brederlow, linux-ext4 Bernd Schubert wrote: > On Tuesday 15 July 2008 15:16:33 Ric Wheeler wrote: > >> Goswin von Brederlow wrote: >> >>> Theodore Tso <tytso@mit.edu> writes: >>> >>>> On Mon, Jul 14, 2008 at 09:50:56PM +0200, Goswin von Brederlow wrote: >>>> >>>>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least >>>>> mkfs. Does anyone know if there is an updated patch set for 1.41 >>>>> anywhere? And when will that be added to e2fsprogs upstream? >>>>> >>>> Yes, this is correct. The 1.39 64-bit patches break the shared >>>> library ABI, and also there were some long-term problems with having >>>> super-large bitmaps taking huge amounts of memory without some kind of >>>> run-length encoding or other compression technique. I decided to >>>> reject the 1.39 approach because it would have caused short- and >>>> long-term maintenance issues. >>>> >>> Is that a problem for the kernel or for the user space? I notices that >>> mke2fs 1.39 used over a gigabyte memory to format a >16TiB disk. While >>> being a lot that is not really a problem here. >>> >>> >>>> At the moment 1.41 does not support > 32 bit block numbers. The >>>> priority was to get something which supported all of the other ext4 >>>> features out the door, since that would allow much better testing of >>>> the ext4 code base. We are now working on 64-bit support in >>>> e2fsprogs, with mke2fs coming first, and the other tools coming later. >>>> But yeah, good quality 64-bit e2fsprogs support is going to lag for a >>>> bit. Sorry, we're working as fast as we can, given the resources we >>>> have. >>>> >>> Will there be filesystem changes as well? The above mentioned >>> run-length encoding sounds a bit like a new bitmap format or is that >>> only supposed to be the in memory format in userspace? >>> >>> What is the plan of how to add 64-bit support to the shared lib now? >>> Will you introduce a do_foo64() function in parallel to do_foo() to >>> maintain abi compatibility? Will you add versioned symbols? Or will >>> there be an abi break at some point? >>> >>> The reason I ask all this is because I'm willing to spend some time >>> patching and testing. A single >16TiB filesystem instead of multiple >>> smaller ones would be a great benefit for us. >>> >> Can you give us any details about your use case? Is it hundreds of very >> large files, or 100 million little ones? >> > > Depends on our customers. Though lustre is rather slow for small files and we > try to inform our customers about that. On the other hand there also also no > choices of cluster filesystem for small files. > Thanks - so this is not an internal application, but hosting for various workloads? We have different scalability issues depending on the nature and mix of file sizes, etc. > >> Any interesting hardware in the mix on the storage or server side? >> > > What exactly do you want to know? Usually we have a server-pair and Infortrend > Raid-units. Since lustre doesn't do any redundancy on its own, we usually > also have a raid1, raid5 or raid6 of several raid units. > One thing that we have been working on/thinking about is how best to automatically self tune a file system to the storage. Today, XFS is probably the best normal linux file system at figuring out raid stripe size, etc. Getting this enhanced in ext4 could lead to a significant performance win for users who are not masters of performance tuning, etc. How long would you wait for something like fsck to run to completion before you would need to go to back up tapes? 6 hours? 1 day? 1 week ;-) ? > For ease of management and optimal performance, we need single partitions > larger than 8TiB (raid1) or 16TiB (raid5 or raid6). And the present 8TiB > limit strongly bites us. > > > Cheers, > Bernd > Makes sense, thanks for the information! Regards, Ric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 14:08 ` Ric Wheeler @ 2008-07-15 16:13 ` Goswin von Brederlow 0 siblings, 0 replies; 16+ messages in thread From: Goswin von Brederlow @ 2008-07-15 16:13 UTC (permalink / raw) To: rwheeler; +Cc: linux-ext4 Ric Wheeler <rwheeler@redhat.com> writes: > How long would you wait for something like fsck to run to completion > before you would need to go to back up tapes? 6 hours? 1 day? 1 week > ;-) ? Backup? What are backups? :)) A hardware raid6 resync takes about 16h. A Software raid6 (over 6 hardware raid6) resync takes 1-2 days. With lustre the fsck has to be done on the MDT (meta data target) to build a database file and on all the OST (object storage target). So there is some parallelization in the system. But each OST, if we get ext4 64bit working, would be 28TB. I would assume days for an fsck run. Weeks would not be good and less than a day is totaly unrealistic. But a few days for an fsck of 400-800TB filesystem isn't so bad. Reading that amount from backup will take ages too, if you even have one. MfG Goswin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-14 19:50 ext4 64bit (disk >16TB) question Goswin von Brederlow 2008-07-14 23:46 ` Theodore Tso @ 2008-07-15 18:27 ` Jose R. Santos 2008-07-15 20:12 ` Andreas Dilger 1 sibling, 1 reply; 16+ messages in thread From: Jose R. Santos @ 2008-07-15 18:27 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: linux-ext4 On Mon, 14 Jul 2008 21:50:56 +0200 Goswin von Brederlow <goswin-v-b@web.de> wrote: > Hi, > > we are using lustre on a cluster of servers and raid boxes. Currently > lustre is based on the ext3 code and has a limit of 8TiB for each > filesystem. For us that results on having to split a servers storage > into up to 4 chunks and run one fs on each which I would rather avoid. > The solution for this would be to rebase the lustre patches to use > ext4 instead, which should also reduce the patch set considerably. > Lustre already patches a lot of ext4 features into the ext3 base. > > > But before I start rebasing lustre I though I would first test out > plain ext4 so I know any bugs I find will be from my rebasing and not > already existing in ext4 itself. And there I run into a big problem: > Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT > feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always > stops saying the disk exceeds the 32bit block count. And looking at > the code I see a lot of blk_t (instead of blk64_t) and unsigned long > (instead of unsigned long long [or even better blk64_t]) usage. > > I found ext4 64bit patches for e2fsprogs 1.39 that fix at least > mkfs. Does anyone know if there is an updated patch set for 1.41 > anywhere? And when will that be added to e2fsprogs upstream? Hi Goswin, I've recently submitted a set of patches that covers most of the API changes needed to support >16TB file systems (missing Ted bitmap support of course). Once the bitmap support is included, it _SHOULD_ be relatively painless to add mke2fs support with this series of patches. Stay tune. > MfG > Goswin -JRS ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 18:27 ` Jose R. Santos @ 2008-07-15 20:12 ` Andreas Dilger 2008-07-15 20:15 ` Ric Wheeler 2008-07-15 21:20 ` Jose R. Santos 0 siblings, 2 replies; 16+ messages in thread From: Andreas Dilger @ 2008-07-15 20:12 UTC (permalink / raw) To: Jose R. Santos; +Cc: Goswin von Brederlow, linux-ext4 On Jul 15, 2008 13:27 -0500, Jose R. Santos wrote: > On Mon, 14 Jul 2008 21:50:56 +0200 > Goswin von Brederlow <goswin-v-b@web.de> wrote: > > we are using lustre on a cluster of servers and raid boxes. Currently > > lustre is based on the ext3 code and has a limit of 8TiB for each > > filesystem. For us that results on having to split a servers storage > > into up to 4 chunks and run one fs on each which I would rather avoid. > > The solution for this would be to rebase the lustre patches to use > > ext4 instead, which should also reduce the patch set considerably. > > Lustre already patches a lot of ext4 features into the ext3 base. > > > > > > But before I start rebasing lustre I though I would first test out > > plain ext4 so I know any bugs I find will be from my rebasing and not > > already existing in ext4 itself. And there I run into a big problem: > > Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT > > feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always > > stops saying the disk exceeds the 32bit block count. And looking at > > the code I see a lot of blk_t (instead of blk64_t) and unsigned long > > (instead of unsigned long long [or even better blk64_t]) usage. > > > > I found ext4 64bit patches for e2fsprogs 1.39 that fix at least > > mkfs. Does anyone know if there is an updated patch set for 1.41 > > anywhere? And when will that be added to e2fsprogs upstream? > > I've recently submitted a set of patches that covers most of the API > changes needed to support >16TB file systems (missing Ted bitmap > support of course). Once the bitmap support is included, it _SHOULD_ > be relatively painless to add mke2fs support with this series of patches. Jose, while waiting for the "efficient bitmap" support, how hard would it be to implement "inefficient bitmaps" that just malloc some GB of memory if needed? This would at least allow people with huge devices to test mke2fs/ext4/e2fsck in the meantime. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 20:12 ` Andreas Dilger @ 2008-07-15 20:15 ` Ric Wheeler 2008-07-15 21:03 ` Goswin von Brederlow 2008-07-15 21:20 ` Jose R. Santos 1 sibling, 1 reply; 16+ messages in thread From: Ric Wheeler @ 2008-07-15 20:15 UTC (permalink / raw) To: Andreas Dilger; +Cc: Jose R. Santos, Goswin von Brederlow, linux-ext4 Andreas Dilger wrote: > On Jul 15, 2008 13:27 -0500, Jose R. Santos wrote: > >> On Mon, 14 Jul 2008 21:50:56 +0200 >> Goswin von Brederlow <goswin-v-b@web.de> wrote: >> >>> we are using lustre on a cluster of servers and raid boxes. Currently >>> lustre is based on the ext3 code and has a limit of 8TiB for each >>> filesystem. For us that results on having to split a servers storage >>> into up to 4 chunks and run one fs on each which I would rather avoid. >>> The solution for this would be to rebase the lustre patches to use >>> ext4 instead, which should also reduce the patch set considerably. >>> Lustre already patches a lot of ext4 features into the ext3 base. >>> >>> >>> But before I start rebasing lustre I though I would first test out >>> plain ext4 so I know any bugs I find will be from my rebasing and not >>> already existing in ext4 itself. And there I run into a big problem: >>> Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT >>> feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always >>> stops saying the disk exceeds the 32bit block count. And looking at >>> the code I see a lot of blk_t (instead of blk64_t) and unsigned long >>> (instead of unsigned long long [or even better blk64_t]) usage. >>> >>> I found ext4 64bit patches for e2fsprogs 1.39 that fix at least >>> mkfs. Does anyone know if there is an updated patch set for 1.41 >>> anywhere? And when will that be added to e2fsprogs upstream? >>> >> I've recently submitted a set of patches that covers most of the API >> changes needed to support >16TB file systems (missing Ted bitmap >> support of course). Once the bitmap support is included, it _SHOULD_ >> be relatively painless to add mke2fs support with this series of patches. >> > > Jose, > while waiting for the "efficient bitmap" support, how hard would it be > to implement "inefficient bitmaps" that just malloc some GB of memory > if needed? This would at least allow people with huge devices to test > mke2fs/ext4/e2fsck in the meantime. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > I think that would be very useful - how much DRAM would we need for a 16TB file system ;-) ? ric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 20:15 ` Ric Wheeler @ 2008-07-15 21:03 ` Goswin von Brederlow 0 siblings, 0 replies; 16+ messages in thread From: Goswin von Brederlow @ 2008-07-15 21:03 UTC (permalink / raw) To: Ric Wheeler Cc: Andreas Dilger, Jose R. Santos, Goswin von Brederlow, linux-ext4 Ric Wheeler <ricwheeler@gmail.com> writes: > Andreas Dilger wrote: >> Jose, >> while waiting for the "efficient bitmap" support, how hard would it be >> to implement "inefficient bitmaps" that just malloc some GB of memory >> if needed? This would at least allow people with huge devices to test >> mke2fs/ext4/e2fsck in the meantime. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> > > I think that would be very useful - how much DRAM would we need for a > 16TB file system ;-) ? > > ric The patched 1.39 e2fsprogs managed to format a 16TIB under kvm with 1GiB ram and 128k swap. A 32TiB disk format uses nearly 1GiB ram for mkfs alone and eventualy managed to deadlock the I/O layer in kvm with 1.5GB ram and 128k swap. (Something I'm sure is kvms fault. :) But fsck is suposed to eat more by a factor (see other mails in thread). So having 4-16GiB ram is probably recommended for anyone thinking about testing. I used the sparse_create script linked on one of the ext4 wiki pages with a sparse loopback file and used mke2fs -i $((64*1024*1024)) to speed up things. With that a 16TiB ext4 uses somewhat over 4GiB freshly formated. MfG Goswin ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 20:12 ` Andreas Dilger 2008-07-15 20:15 ` Ric Wheeler @ 2008-07-15 21:20 ` Jose R. Santos 2008-07-16 10:10 ` Goswin von Brederlow 1 sibling, 1 reply; 16+ messages in thread From: Jose R. Santos @ 2008-07-15 21:20 UTC (permalink / raw) To: Andreas Dilger; +Cc: Goswin von Brederlow, linux-ext4 On Tue, 15 Jul 2008 14:12:19 -0600 Andreas Dilger <adilger@sun.com> wrote: > On Jul 15, 2008 13:27 -0500, Jose R. Santos wrote: > > On Mon, 14 Jul 2008 21:50:56 +0200 > > Goswin von Brederlow <goswin-v-b@web.de> wrote: > > > we are using lustre on a cluster of servers and raid boxes. Currently > > > lustre is based on the ext3 code and has a limit of 8TiB for each > > > filesystem. For us that results on having to split a servers storage > > > into up to 4 chunks and run one fs on each which I would rather avoid. > > > The solution for this would be to rebase the lustre patches to use > > > ext4 instead, which should also reduce the patch set considerably. > > > Lustre already patches a lot of ext4 features into the ext3 base. > > > > > > > > > But before I start rebasing lustre I though I would first test out > > > plain ext4 so I know any bugs I find will be from my rebasing and not > > > already existing in ext4 itself. And there I run into a big problem: > > > Current e2fsprogs (1.41) seem to be totaly unable to handle the ext4 64BIT > > > feature, i.e. filesystems larger than 16TiB. The mkfs.ext4 always > > > stops saying the disk exceeds the 32bit block count. And looking at > > > the code I see a lot of blk_t (instead of blk64_t) and unsigned long > > > (instead of unsigned long long [or even better blk64_t]) usage. > > > > > > I found ext4 64bit patches for e2fsprogs 1.39 that fix at least > > > mkfs. Does anyone know if there is an updated patch set for 1.41 > > > anywhere? And when will that be added to e2fsprogs upstream? > > > > I've recently submitted a set of patches that covers most of the API > > changes needed to support >16TB file systems (missing Ted bitmap > > support of course). Once the bitmap support is included, it _SHOULD_ > > be relatively painless to add mke2fs support with this series of patches. > > Jose, > while waiting for the "efficient bitmap" support, how hard would it be > to implement "inefficient bitmaps" that just malloc some GB of memory > if needed? This would at least allow people with huge devices to test > mke2fs/ext4/e2fsck in the meantime. As Ted mentioned already, the "efficient bitmap" support can come latter but the 64bit API call need to well design to able to support different models. I will see how difficult it would be to create a ABI BREAKING patch for testing purposes but coming up with a ABI compatible one seems like to much work if its going to be replace sometime in the near future. It should be possible to test it with flexbg as well (I think) since all I need to make sure is that all bitmaps reside within the 32bit block boundary. Dont have large disk to test on so Im playing with device mapper to see how I can fake one. Our lab network is making thing difficult though. Im sure that I will uncover a couple of bug this way. Like the fact that I forgot to set the 64bit compatibility flag or large group descriptors. :) > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > -JRS ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext4 64bit (disk >16TB) question 2008-07-15 21:20 ` Jose R. Santos @ 2008-07-16 10:10 ` Goswin von Brederlow 0 siblings, 0 replies; 16+ messages in thread From: Goswin von Brederlow @ 2008-07-16 10:10 UTC (permalink / raw) To: Jose R. Santos; +Cc: Andreas Dilger, Goswin von Brederlow, linux-ext4 "Jose R. Santos" <jrs@us.ibm.com> writes: > block boundary. Dont have large disk to test on so Im playing with > device mapper to see how I can fake one. Our lab network is making > thing difficult though. Download http://www.bullopensource.org/ext4/files/sparse_create and for 32TiB use dd if=/dev/zero of=/somewhere/SPACE bs=1 count=1 seek=100000000000 losetup /dev/loop0 /somewhere/SPACE ./sparse_create ext4dev /dev/loop0 68719476736 If you have a real block device to spare use that directly. MfG Goswin ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-07-16 10:10 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-14 19:50 ext4 64bit (disk >16TB) question Goswin von Brederlow 2008-07-14 23:46 ` Theodore Tso 2008-07-15 5:42 ` Goswin von Brederlow 2008-07-15 12:36 ` Theodore Tso 2008-07-15 17:00 ` Goswin von Brederlow 2008-07-15 17:19 ` Theodore Tso 2008-07-15 13:16 ` Ric Wheeler 2008-07-15 14:01 ` Bernd Schubert 2008-07-15 14:08 ` Ric Wheeler 2008-07-15 16:13 ` Goswin von Brederlow 2008-07-15 18:27 ` Jose R. Santos 2008-07-15 20:12 ` Andreas Dilger 2008-07-15 20:15 ` Ric Wheeler 2008-07-15 21:03 ` Goswin von Brederlow 2008-07-15 21:20 ` Jose R. Santos 2008-07-16 10:10 ` Goswin von Brederlow
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox