Re: 20TB ext4

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Re: 20TB ext4
       [not found] <s6nzks9y74l.fsf@falbala.ieap.uni-kiel.de>
@ 2010-12-13 18:12 ` Lukas Czerner
  2010-12-13 21:57 ` Andreas Dilger
  1 sibling, 0 replies; 6+ messages in thread
From: Lukas Czerner @ 2010-12-13 18:12 UTC (permalink / raw)
  To: Stephan Boettcher; +Cc: linux-fsdevel, linux-ext4

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4978 bytes --]

On Mon, 13 Dec 2010, Stephan Boettcher wrote:

> 
> Moin,
> 
> I spent the weekend trying to setup a 20TB ext4 filesystem on a 32-bit
> i386 system.  The filesystem is now up and running, but on a 64-bit
> machine.  I intend to test this setup for a while.  I understand that
> this is highly experimental.  If there is anything special I should do
> to help shaking out bugs, please tell me.
> 
> Thanks for all the code
> Stephan

This is indeed interesting, I'll add linux-ext4 into cc so more ext4
people can see this.

Thanks!

-Lukas

> 
> 
> 
> The setup:
> 
> Two old servers, dual Xeon 3GHz, hyperthreaded, in sturdy server
> housings, redundant power supplies, noisy but solid.  A third
> identical server will become available to me next week.
> 
> Each server has six 2TB SATA drives.  The drives are partitioned into a
> 20GB partition and a second partition with the remaining almost 2TB.
> 
> Kernel 2.6.36.1.
> 
> A raid1 (/dev/md1) over three 20GB partitions is the root filesystem,
> three 20GB partitions for swap, and a RAID5 (/dev/md0) from the six big
> partitions.
> 
> The 10TB /dev/md0 is exported via nbd.  I had to patch nbd-client to
> import this on a 32-bit machine, so that part works.
> 
> The intention was to export two (later three) via nbd to one of the
> servers, which combines them to a RAID5² with net capacity 20TB.  With
> e2fsprogs master branch I could make a filesystem, but dumpe2fs and
> fsck failed.  Mounting the filesystem said: EFBIG.
> 
> Obviously, with 32-bit pgoff_t this will not work, and it was said
> elsewhere that making pgoff_t 64-bit on i386 will require a lot of faith
> and luck, since there are more than 3000 unsigned longs in the fs tree.
> 
> So I exported both 10TB raid5 as nbd to my 64-bit desktop (Core 2 Quad,
> 2.6.36.2), did mke2fs, mount, some rsyncing, umount, dumpe2fs, fsck, mount,
> more rsyning -- no problems yet.
> 
> I'd prefer to run the setup selfcontained without an extra 64-bit head.
> Maybe I will partition it down to a 16TB and a 4TB partition.  Maybe I
> just dare to compile a kernel with typedef unsigned long long pgoff_t
> and see what happens, maybe I can help fixing that kind of configuration.
> 
> 
> 
> (stephan)idefix:~$ cat /proc/mdstat 
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
> md0 : active raid5 sda2[0] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
>       9662653440 blocks level 5, 512k chunk, algorithm 2 [6/6] [UUUUUU]
>       
> md1 : active raid1 sda1[0] sde1[2] sdc1[1]
>       20980736 blocks [3/3] [UUU]
>       
> unused devices: <none>
> 
> (stephan)falbala:~$ cat /proc/mdstat 
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
> md9 : active raid5 nbd0[0] nbd1[1]
>       19325303808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
> ...
>       
> unused devices: <none>
> 
> 
> (root)falbala:~# /home/asterix/stephan/src/e2fsprogs/build/misc/dumpe2fs -h /dev/md9p1 
> dumpe2fs 1.41.13 (22-Nov-2010)
> Filesystem volume name:   <none>
> Last mounted on:          /data/hinkelstein
> Filesystem UUID:          7c96821d-3371-465b-9c69-f67ec1a953fa
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash 
> Default mount options:    (none)
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              2415673344
> Block count:              4831325943
> Reserved block count:     241566297
> Free blocks:              4686685845
> Free inodes:              2415191498
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         16384
> Inode blocks per group:   512
> Flex block group size:    16
> Filesystem created:       Sun Dec 12 23:02:05 2010
> Last mount time:          Mon Dec 13 09:24:10 2010
> Last write time:          Mon Dec 13 09:24:10 2010
> Mount count:              2
> Maximum mount count:      26
> Last checked:             Sun Dec 12 23:02:05 2010
> Check interval:           15552000 (6 months)
> Next check after:         Sat Jun 11 00:02:05 2011
> Lifetime writes:          288 GB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               128
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed:      3c0d80ff-6611-43ad-93e8-b083d637e549
> Journal backup:           inode blocks
> Journal features:         journal_incompat_revoke FEATURE_I1
> Journal size:             128M
> Journal length:           32768
> Journal sequence:         0x00002bea
> Journal start:            4481
> 
> 
> 

-- 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 20TB ext4
       [not found] <s6nzks9y74l.fsf@falbala.ieap.uni-kiel.de>
  2010-12-13 18:12 ` 20TB ext4 Lukas Czerner
@ 2010-12-13 21:57 ` Andreas Dilger
  2010-12-14  3:27   ` Ric Wheeler
  2010-12-14  8:59   ` Stephan Boettcher
  1 sibling, 2 replies; 6+ messages in thread
From: Andreas Dilger @ 2010-12-13 21:57 UTC (permalink / raw)
  To: Stephan Boettcher; +Cc: ext4 development

On 2010-12-13, at 09:23, Stephan Boettcher wrote:
> A raid1 (/dev/md1) over three 20GB partitions is the root filesystem,
> three 20GB partitions for swap, and a RAID5 (/dev/md0) from the six big
> partitions.
> 
> The 10TB /dev/md0 is exported via nbd.  I had to patch nbd-client to
> import this on a 32-bit machine, so that part works.
> 
> The intention was to export two (later three) via nbd to one of the
> servers, which combines them to a RAID5² with net capacity 20TB.  With
> e2fsprogs master branch I could make a filesystem, but dumpe2fs and
> fsck failed.  Mounting the filesystem said: EFBIG.

RAID-5 on top of RAID-5 is going to be VERY SLOW...  Also note that only a single "nbd client" system will be able to use this storage at one time.  If you have dedicated server nodes, and you want to be able to use these 20TB from multiple clients, you might consider using Lustre, which uses ext4 as the back-end storage, and can scale to many PB filesystems (largest known filesystem is 20PB, from 1344 * 8TB separate ext4 filesystems).

> Obviously, with 32-bit pgoff_t this will not work, and it was said
> elsewhere that making pgoff_t 64-bit on i386 will require a lot of faith
> and luck, since there are more than 3000 unsigned longs in the fs tree.

I don't think that is going to happen any time soon.  Lustre _can_ export from a 32-bit server, though it definitely isn't very common anymore.  For the cost of a single 2TB drive you can likely get a new motherboard + 64-bit CPU + RAM...

> I'd prefer to run the setup selfcontained without an extra 64-bit head.
> Maybe I will partition it down to a 16TB and a 4TB partition.  Maybe I
> just dare to compile a kernel with typedef unsigned long long pgoff_t
> and see what happens, maybe I can help fixing that kind of configuration.

I would suggest you examine what it is you are really trying to get out of this system?  Is it just for fun, to test ext4 with > 16TB filesystems?  Great, you can probably do that with the 64-bit nbd client.  Do you actually want to use this for some data you care about?  Then trying to get 32-bit kernels to handle > 16TB block devices is a risky strategy to take for a few hundred USD.  Given that you are willing to spend a few thousand USD for the 2TB drives, you should consider just getting a 64-bit CPU + RAM to handle it.

Also note that running e2fsck on such a large filesystem will need 6-8GB of RAM at a minimum, and can be a lot more if there are serious problems (e.g. duplicate blocks).  Recently I saw a report of 22GB of RAM needed for e2fsck to complete, which is just impossible on a 32-bit machine.

Cheers, Andreas

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 20TB ext4
  2010-12-13 21:57 ` Andreas Dilger
@ 2010-12-14  3:27   ` Ric Wheeler
  2010-12-14  8:59   ` Stephan Boettcher
  1 sibling, 0 replies; 6+ messages in thread
From: Ric Wheeler @ 2010-12-14  3:27 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Stephan Boettcher, ext4 development

On 12/13/2010 04:57 PM, Andreas Dilger wrote:
> On 2010-12-13, at 09:23, Stephan Boettcher wrote:
>> A raid1 (/dev/md1) over three 20GB partitions is the root filesystem,
>> three 20GB partitions for swap, and a RAID5 (/dev/md0) from the six big
>> partitions.
>>
>> The 10TB /dev/md0 is exported via nbd.  I had to patch nbd-client to
>> import this on a 32-bit machine, so that part works.
>>
>> The intention was to export two (later three) via nbd to one of the
>> servers, which combines them to a RAID5² with net capacity 20TB.  With
>> e2fsprogs master branch I could make a filesystem, but dumpe2fs and
>> fsck failed.  Mounting the filesystem said: EFBIG.
> RAID-5 on top of RAID-5 is going to be VERY SLOW...  Also note that only a single "nbd client" system will be able to use this storage at one time.  If you have dedicated server nodes, and you want to be able to use these 20TB from multiple clients, you might consider using Lustre, which uses ext4 as the back-end storage, and can scale to many PB filesystems (largest known filesystem is 20PB, from 1344 * 8TB separate ext4 filesystems).
>
>> Obviously, with 32-bit pgoff_t this will not work, and it was said
>> elsewhere that making pgoff_t 64-bit on i386 will require a lot of faith
>> and luck, since there are more than 3000 unsigned longs in the fs tree.
> I don't think that is going to happen any time soon.  Lustre _can_ export from a 32-bit server, though it definitely isn't very common anymore.  For the cost of a single 2TB drive you can likely get a new motherboard + 64-bit CPU + RAM...
>
>> I'd prefer to run the setup selfcontained without an extra 64-bit head.
>> Maybe I will partition it down to a 16TB and a 4TB partition.  Maybe I
>> just dare to compile a kernel with typedef unsigned long long pgoff_t
>> and see what happens, maybe I can help fixing that kind of configuration.
> I would suggest you examine what it is you are really trying to get out of this system?  Is it just for fun, to test ext4 with>  16TB filesystems?  Great, you can probably do that with the 64-bit nbd client.  Do you actually want to use this for some data you care about?  Then trying to get 32-bit kernels to handle>  16TB block devices is a risky strategy to take for a few hundred USD.  Given that you are willing to spend a few thousand USD for the 2TB drives, you should consider just getting a 64-bit CPU + RAM to handle it.
>
> Also note that running e2fsck on such a large filesystem will need 6-8GB of RAM at a minimum, and can be a lot more if there are serious problems (e.g. duplicate blocks).  Recently I saw a report of 22GB of RAM needed for e2fsck to complete, which is just impossible on a 32-bit machine.
>
>
> Cheers, Andreas
>

I have to agree here - I do not see this as being a great investment of time. 
Even low powered CPU's can often run in 64 bit mode these days and as Andreas 
says, you will need a lot of DRAM to fsck this box :)

Ric


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 20TB ext4
  2010-12-13 21:57 ` Andreas Dilger
  2010-12-14  3:27   ` Ric Wheeler
@ 2010-12-14  8:59   ` Stephan Boettcher
  2010-12-14 20:51     ` Stephan Boettcher
  1 sibling, 1 reply; 6+ messages in thread
From: Stephan Boettcher @ 2010-12-14  8:59 UTC (permalink / raw)
  To: ext4 development

Andreas Dilger <adilger@dilger.ca> writes:

> On 2010-12-13, at 09:23, Stephan Boettcher wrote:
>> A raid1 (/dev/md1) over three 20GB partitions is the root filesystem,
>> three 20GB partitions for swap, and a RAID5 (/dev/md0) from the six big
>> partitions.
>> 
>> The 10TB /dev/md0 is exported via nbd.  I had to patch nbd-client to
>> import this on a 32-bit machine, so that part works.
>> 
>> The intention was to export two (later three) via nbd to one of the
>> servers, which combines them to a RAID5² with net capacity 20TB.  With
>> e2fsprogs master branch I could make a filesystem, but dumpe2fs and
>> fsck failed.  Mounting the filesystem said: EFBIG.
>
> RAID-5 on top of RAID-5 is going to be VERY SLOW...  

Speed is not a priority.  But I thought, since it's distributed across
multiple servers, it cannot be that bad.

> Also note that only a single "nbd client" system will be able to use
> this storage at one time.

Yes, this is obvious.  I have several safeguards (e.g., only a single
ip-address in /etc/nbd-server/allow) to make sure I do not accidentally
run a partition concurrently from two servers.

> If you have dedicated server nodes, and you want to be able to use
> these 20TB from multiple clients, you might consider using Lustre,
> which uses ext4 as the back-end storage, and can scale to many PB
> filesystems (largest known filesystem is 20PB, from 1344 * 8TB
> separate ext4 filesystems).

I like thinks to be as simple and transparent as possible :-) The plan
is to export the fs via NFS.  I will hit the 16 TB limit again, will I?
I did not test that part yet.  The NFS clients will then probably be
required to run 64-bit kernels as well.

>> Obviously, with 32-bit pgoff_t this will not work, and it was said
>> elsewhere that making pgoff_t 64-bit on i386 will require a lot of faith
>> and luck, since there are more than 3000 unsigned longs in the fs tree.
>
> I don't think that is going to happen any time soon.  Lustre _can_
> export from a 32-bit server, though it definitely isn't very common
> anymore.  For the cost of a single 2TB drive you can likely get a new
> motherboard + 64-bit CPU + RAM...

This is an exercise to keep a set of old truty servers usefully
employed, that were supposed to be discarded otherwise.  One aspect of
Linux is it's ability to keep old hardware running.

>> I'd prefer to run the setup selfcontained without an extra 64-bit head.
>> Maybe I will partition it down to a 16TB and a 4TB partition.  Maybe I
>> just dare to compile a kernel with typedef unsigned long long pgoff_t
>> and see what happens, maybe I can help fixing that kind of configuration.
>
> I would suggest you examine what it is you are really trying to get
> out of this system?  

I see it as a challenge to learn stuff (linux fs, ext4, git) and kind of
like a sport to find out where the limits are. And in the end we may
have a server for backup of some of those new virtual production. And I
hope I can contribute some testing to Linux fs code.

Our computer center throws out all the old servers and replaces them
with virtual machines on that big new system, with virtual disk from
fibre channel connected raids.  Seem to run well, but I also like some
real non-virtual backup. at least for a while.

> Is it just for fun, to test ext4 with > 16TB filesystems? 

Mostly.

> Great, you can probably do that with the 64-bit nbd client. Do you
> actually want to use this for some data you care about? 

Maybe, eventually. 

If I then really need to care about the data, I will probably partition
it to <16TB filesystems.

> Then trying to get 32-bit kernels to handle > 16TB block devices is a
> risky strategy to take for a few hundred USD. 

No risk, no fun, no progess.

I do see the mismatch, though: the hardware is massively redundant, and
the software highly experimental.

> Given that you are willing to spend a few thousand USD for the 2TB
> drives, you should consider just getting a 64-bit CPU + RAM to handle
> it.

Those disk are incredibly cheep, we spent about $1500 for 20 disks.
On thing I want to test is how often I need to swap out on of those
during the next year.

> Also note that running e2fsck on such a large filesystem will need
> 6-8GB of RAM at a minimum, and can be a lot more if there are serious
> problems (e.g. duplicate blocks).  Recently I saw a report of 22GB of
> RAM needed for e2fsck to complete, which is just impossible on a
> 32-bit machine.

Thank you for these comments, they will certainly influence how I will
proceed, but I don't know yet.  

For a few month I will experiment with the setup.  I am open for
suggestions, patches to test, etc.

-- 
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 20TB ext4
  2010-12-14  8:59   ` Stephan Boettcher
@ 2010-12-14 20:51     ` Stephan Boettcher
  2010-12-15  9:21       ` Andreas Dilger
  0 siblings, 1 reply; 6+ messages in thread
From: Stephan Boettcher @ 2010-12-14 20:51 UTC (permalink / raw)
  To: linux-ext4

Stephan Boettcher <boettcher@physik.uni-kiel.de> writes:

> Andreas Dilger <adilger@dilger.ca> writes:
>
>> On 2010-12-13, at 09:23, Stephan Boettcher wrote:
>>> A raid1 (/dev/md1) over three 20GB partitions is the root filesystem,
>>> three 20GB partitions for swap, and a RAID5 (/dev/md0) from the six big
>>> partitions.
>>> 
>>> The 10TB /dev/md0 is exported via nbd.  I had to patch nbd-client to
>>> import this on a 32-bit machine, so that part works.
>>> 
>>> The intention was to export two (later three) via nbd to one of the
>>> servers, which combines them to a RAID5² with net capacity 20TB.  With
>>> e2fsprogs master branch I could make a filesystem, but dumpe2fs and
>>> fsck failed.  Mounting the filesystem said: EFBIG.

>> If you have dedicated server nodes, and you want to be able to use
>> these 20TB from multiple clients, you might consider using Lustre,
>> which uses ext4 as the back-end storage, and can scale to many PB
>> filesystems (largest known filesystem is 20PB, from 1344 * 8TB
>> separate ext4 filesystems).
>
> I like thinks to be as simple and transparent as possible :-) The plan
> is to export the fs via NFS.  I will hit the 16 TB limit again, will I?
> I did not test that part yet.  The NFS clients will then probably be
> required to run 64-bit kernels as well.

Excuse me for not knowing all that much about how linux filesystems
work. I was surprised that I could export the 20TB filesystem via NFS
and mount it on a 32-bit (2.6.31) system.  Do I need to expect failures
when I try to actually use it that way, or does the nfs filesystem not
use the page cache or something, so that the 16TB limit does not apply?

Thanks, Stephan


I guess I should upgrade that kernel ...

(root)informatix:/data/hinkelstein# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.00GHz
stepping        : 4
cpu MHz         : 2020.126
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pebs bts
bogomips        : 4040.25
clflush size    : 64
power management:

(root)informatix:/data/hinkelstein# cat /proc/version 
Linux version 2.6.31 (stephan@informatix) (gcc version 4.3.3 (Debian 4.3.3-8) ) #2 Fri Oct 2 08:25:51 CEST 2009

(root)informatix:/data/hinkelstein# df .
Filesystem           1K-blocks      Used Available Use% Mounted on
falbala:/data/hinkelstein/
                     19021934592 651503616 17404166144   4% /data/hinkelstein


-- 
Stephan

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 20TB ext4
  2010-12-14 20:51     ` Stephan Boettcher
@ 2010-12-15  9:21       ` Andreas Dilger
  0 siblings, 0 replies; 6+ messages in thread
From: Andreas Dilger @ 2010-12-15  9:21 UTC (permalink / raw)
  To: Stephan Boettcher; +Cc: linux-ext4

On 2010-12-14, at 13:51, Stephan Boettcher wrote:
> Stephan Boettcher <boettcher@physik.uni-kiel.de> writes:
>> Andreas Dilger <adilger@dilger.ca> writes:
>>> If you have dedicated server nodes, and you want to be able to use
>>> these 20TB from multiple clients, you might consider using Lustre,
>>> which uses ext4 as the back-end storage, and can scale to many PB
>>> filesystems (largest known filesystem is 20PB, from 1344 * 8TB
>>> separate ext4 filesystems).
>> 
>> I like thinks to be as simple and transparent as possible :-) The plan
>> is to export the fs via NFS.  I will hit the 16 TB limit again, will I?
>> I did not test that part yet.  The NFS clients will then probably be
>> required to run 64-bit kernels as well.
> 
> Excuse me for not knowing all that much about how linux filesystems
> work. I was surprised that I could export the 20TB filesystem via NFS
> and mount it on a 32-bit (2.6.31) system.  Do I need to expect failures
> when I try to actually use it that way, or does the nfs filesystem not
> use the page cache or something, so that the 16TB limit does not apply?

The 16TB limit is related to the 32-bit page index * PAGE_SIZE (4kB), so 2^32 * 2^12 = 2^44 = 16 * 2^40 = 16TB.

Because NFS is not exporting the whole block device, just file access, the filesystem size is not limited by the 32-bit page index.  However, individual file access would be limited to 16TB, and it may be there are lower limits on the file size.

With Lustre, we had filesystems of hundreds of TB in size with 32-bit clients and servers.  The individual Lustre backing filesystems are still under 16TB, so there is not a problem to serve them from 32-bit nodes.  In any case, nobody buys 32-bit systems today so we don't test this much anymore.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-12-15  9:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <s6nzks9y74l.fsf@falbala.ieap.uni-kiel.de>
2010-12-13 18:12 ` 20TB ext4 Lukas Czerner
2010-12-13 21:57 ` Andreas Dilger
2010-12-14  3:27   ` Ric Wheeler
2010-12-14  8:59   ` Stephan Boettcher
2010-12-14 20:51     ` Stephan Boettcher
2010-12-15  9:21       ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox