* BIG files & file systems
@ 2002-07-31 19:16 Peter J. Braam
2002-07-31 19:26 ` Christoph Hellwig
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Peter J. Braam @ 2002-07-31 19:16 UTC (permalink / raw)
To: linux-kernel
Hi,
I've just been told that some "limitations" of the following kind will
remain:
page index = unsigned long
ino_t = unsigned long
Lustre has definitely been asked to support much larger files than
16TB. Also file systems with a trillion files have been requested by
one of our supporters (you don't want to know who, besides I've no
idea how many bits go in a trillion, but it's more than 32).
I understand why people don't want to sprinkle the kernel with u64's,
and arguably we can wait a year or two and use 64 bit architectures,
so I'm probably not going to kick up a fuss about it.
However, I thought I'd let you know that there are organizations that
_really_ want to have such big files and file systems and get quite
dismayed about "small integers". And we will fail to deliver on a
requirement to write a 50TB file because of this.
My first Linux machine was a 25MHz i386 with a 40MB disk....
- Peter -
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: BIG files & file systems 2002-07-31 19:16 BIG files & file systems Peter J. Braam @ 2002-07-31 19:26 ` Christoph Hellwig 2002-07-31 20:04 ` Matti Aarnio 2002-07-31 21:07 ` Jan Harkes ` (2 subsequent siblings) 3 siblings, 1 reply; 38+ messages in thread From: Christoph Hellwig @ 2002-07-31 19:26 UTC (permalink / raw) To: Peter J. Braam; +Cc: linux-kernel On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote: > Hi, > > I've just been told that some "limitations" of the following kind will > remain: > page index = unsigned long > ino_t = unsigned long > > Lustre has definitely been asked to support much larger files than > 16TB. Also file systems with a trillion files have been requested by > one of our supporters (you don't want to know who, besides I've no > idea how many bits go in a trillion, but it's more than 32). What about using 64bit machines? .. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 19:26 ` Christoph Hellwig @ 2002-07-31 20:04 ` Matti Aarnio 2002-07-31 20:12 ` Christoph Hellwig 2002-08-02 17:26 ` Albert D. Cahalan 0 siblings, 2 replies; 38+ messages in thread From: Matti Aarnio @ 2002-07-31 20:04 UTC (permalink / raw) To: Christoph Hellwig, Peter J. Braam, linux-kernel On Wed, Jul 31, 2002 at 08:26:38PM +0100, Christoph Hellwig wrote: > On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote: > > Hi, > > > > I've just been told that some "limitations" of the following kind will > > remain: > > > > page index = unsigned long > > ino_t = unsigned long > > > > Lustre has definitely been asked to support much larger files than > > 16TB. Also file systems with a trillion files have been requested by > > one of our supporters (you don't want to know who, besides I've no > > idea how many bits go in a trillion, but it's more than 32). > > What about using 64bit machines? .. It depends on many things: - Block layer (unsigned long) - Page indexes (unsigned long) - Filesystem format dependent limits - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3 is supported only up to 4 kB block sizes, that gives you a very hard limit.. of 16 terabytes (16 * "10^12") - ReiserFS: u32_t block indexes presently, u64_t in future; block size ranges ? Max size is limited by the maximum supported file size, likely 2^63, which is roughly 8 * "10^18", or circa 500 000 times larger than EXT2/EXT3 format maximum. - ClusterFS: (Braam et.al.): 64 bit block indexes ? System file size limitation, same as with ReiserFS. (Just to illustriate a few..) /Matti Aarnio ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 20:04 ` Matti Aarnio @ 2002-07-31 20:12 ` Christoph Hellwig 2002-08-02 17:26 ` Albert D. Cahalan 1 sibling, 0 replies; 38+ messages in thread From: Christoph Hellwig @ 2002-07-31 20:12 UTC (permalink / raw) To: Matti Aarnio; +Cc: Peter J. Braam, linux-kernel On Wed, Jul 31, 2002 at 11:04:12PM +0300, Matti Aarnio wrote: > It depends on many things: > - Block layer (unsigned long) > - Page indexes (unsigned long) That grows with sizeof(unsigned long) on 64bit machines. And for the filesystem internals just use one that is designed to be used with that big storage devices (e.g. jfs or xfs ceratainly not ext2/3 or reiserfs3). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 20:04 ` Matti Aarnio 2002-07-31 20:12 ` Christoph Hellwig @ 2002-08-02 17:26 ` Albert D. Cahalan 2002-08-02 22:14 ` Randy.Dunlap 1 sibling, 1 reply; 38+ messages in thread From: Albert D. Cahalan @ 2002-08-02 17:26 UTC (permalink / raw) To: Matti Aarnio; +Cc: Christoph Hellwig, Peter J. Braam, linux-kernel Matti Aarnio writes: > It depends on many things: > - Block layer (unsigned long) > - Page indexes (unsigned long) > - Filesystem format dependent limits > - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3 > is supported only up to 4 kB block sizes, that gives > you a very hard limit.. of 16 terabytes (16 * "10^12") You first hit the triple-indirection limit at 4 TB. http://www.cs.uml.edu/~acahalan/linux/ext2.gif > - ReiserFS: u32_t block indexes presently, u64_t in future; > block size ranges ? Max size is limited by the > maximum supported file size, likely 2^63, which is > roughly 8 * "10^18", or circa 500 000 times larger > than EXT2/EXT3 format maximum. The top 4 st_size bits get stolen, so it's 60-bit sizes. You also get the 32-bit block limit at 16 TB. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 17:26 ` Albert D. Cahalan @ 2002-08-02 22:14 ` Randy.Dunlap 2002-08-03 3:26 ` Albert D. Cahalan 2002-08-05 13:04 ` Stephen Lord 0 siblings, 2 replies; 38+ messages in thread From: Randy.Dunlap @ 2002-08-02 22:14 UTC (permalink / raw) To: Albert D. Cahalan Cc: Matti Aarnio, Christoph Hellwig, Peter J. Braam, linux-kernel On Fri, 2 Aug 2002, Albert D. Cahalan wrote: | Matti Aarnio writes: | | > It depends on many things: | > - Block layer (unsigned long) | > - Page indexes (unsigned long) | > - Filesystem format dependent limits | > - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3 | > is supported only up to 4 kB block sizes, that gives | > you a very hard limit.. of 16 terabytes (16 * "10^12") | | You first hit the triple-indirection limit at 4 TB. | http://www.cs.uml.edu/~acahalan/linux/ext2.gif | | > - ReiserFS: u32_t block indexes presently, u64_t in future; | > block size ranges ? Max size is limited by the | > maximum supported file size, likely 2^63, which is | > roughly 8 * "10^18", or circa 500 000 times larger | > than EXT2/EXT3 format maximum. | | The top 4 st_size bits get stolen, so it's 60-bit sizes. | You also get the 32-bit block limit at 16 TB. | - For a LinuxWorld presentation in August, I have asked each of the 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their filesystem/filesize limits are. Here's what they have told me. ext3fs reiserfs JFS XFS max filesize: 16 TB# 1 EB 4 PB$ 8 TB% max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB! Notes: #: think sparse files *: 4 KB blocks $: 16 TB on 32-bit architectures %: 4 KB pages !: block device limit -- ~Randy ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 22:14 ` Randy.Dunlap @ 2002-08-03 3:26 ` Albert D. Cahalan 2002-08-06 5:19 ` Andreas Dilger 2002-08-05 13:04 ` Stephen Lord 1 sibling, 1 reply; 38+ messages in thread From: Albert D. Cahalan @ 2002-08-03 3:26 UTC (permalink / raw) To: Randy.Dunlap Cc: Albert D. Cahalan, Matti Aarnio, Christoph Hellwig, Peter J. Braam, linux-kernel Randy.Dunlap writes: > On Fri, 2 Aug 2002, Albert D. Cahalan wrote: >> Matti Aarnio writes: >>> - Filesystem format dependent limits >>> - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3 >>> is supported only up to 4 kB block sizes, that gives >>> you a very hard limit.. of 16 terabytes (16 * "10^12") >> >> You first hit the triple-indirection limit at 4 TB. >> http://www.cs.uml.edu/~acahalan/linux/ext2.gif >> >>> - ReiserFS: u32_t block indexes presently, u64_t in future; >>> block size ranges ? Max size is limited by the >>> maximum supported file size, likely 2^63, which is >>> roughly 8 * "10^18", or circa 500 000 times larger >>> than EXT2/EXT3 format maximum. >> >> The top 4 st_size bits get stolen, so it's 60-bit sizes. >> You also get the 32-bit block limit at 16 TB. > > For a LinuxWorld presentation in August, I have asked each of the > 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their > filesystem/filesize limits are. Here's what they have told me. > > ext3fs reiserfs JFS XFS > max filesize: 16 TB# 1 EB 4 PB$ 8 TB% > max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB! > > Notes: > #: think sparse files > *: 4 KB blocks > $: 16 TB on 32-bit architectures > %: 4 KB pages > !: block device limit Please fix that before you give your presentation. Sparse files won't save you from the triple-indirection limit. This has me suspicious of the other numbers as well. Ext2 gives you 0xc blocks addressed right off the inode. Then with one 4 kB block of block pointers, you can get to another 0x400 (1024) blocks. With a block of pointers to blocks of pointers, you may address another 0x100000 blocks. Finally, triple indirection gives you a block of pointers to blocks of pointers to blocks of pointers, for another 0x40000000 data blocks. That's a total of: 0x4010040c blocks 0x4010040c000 bytes 4.4e12 bytes and change 4402 GB (decimal gigabytes) 4.4 TB (decimal terabytes) Of course you can't really use 4.4 TB on 32-bit Linux, so there is a sort of dishonesty in making this claim. I can get to 2.2 TB, which disturbingly would wrap any code using signed 32-bit math on units of 512 bytes. The exact limits are: 0x000001ffffffefff max offset 0x000001fffffff000 max size ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-03 3:26 ` Albert D. Cahalan @ 2002-08-06 5:19 ` Andreas Dilger 2002-08-06 7:24 ` Albert D. Cahalan 2002-08-06 9:28 ` Matti Aarnio 0 siblings, 2 replies; 38+ messages in thread From: Andreas Dilger @ 2002-08-06 5:19 UTC (permalink / raw) To: Albert D. Cahalan Cc: Randy.Dunlap, Matti Aarnio, Christoph Hellwig, Peter J. Braam, linux-kernel On Aug 02, 2002 23:26 -0400, Albert D. Cahalan wrote: > Randy.Dunlap writes: > > On Fri, 2 Aug 2002, Albert D. Cahalan wrote: > >> Matti Aarnio writes: > > >>> - Filesystem format dependent limits > >>> - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3 > >>> is supported only up to 4 kB block sizes, that gives > >>> you a very hard limit.. of 16 terabytes (16 * "10^12") > >> > >> You first hit the triple-indirection limit at 4 TB. > >> http://www.cs.uml.edu/~acahalan/linux/ext2.gif > >> > >>> - ReiserFS: u32_t block indexes presently, u64_t in future; > >>> block size ranges ? Max size is limited by the > >>> maximum supported file size, likely 2^63, which is > >>> roughly 8 * "10^18", or circa 500 000 times larger > >>> than EXT2/EXT3 format maximum. > >> > >> The top 4 st_size bits get stolen, so it's 60-bit sizes. > >> You also get the 32-bit block limit at 16 TB. > > > > For a LinuxWorld presentation in August, I have asked each of the > > 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their > > filesystem/filesize limits are. Here's what they have told me. > > > > ext3fs reiserfs JFS XFS > > max filesize: 16 TB# 1 EB 4 PB$ 8 TB% > > max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB! I think you need a "!" behind the 2TB limit for ext3 max filesystem size. The actual filesystem limit for 4kB block size is 16TB* (2^32 blocks). More on this below. > > Notes: > > #: think sparse files > > *: 4 KB blocks > > $: 16 TB on 32-bit architectures > > %: 4 KB pages > > !: block device limit > > Please fix that before you give your presentation. > Sparse files won't save you from the triple-indirection limit. > This has me suspicious of the other numbers as well. > > Ext2 gives you 0xc blocks addressed right off the inode. > Then with one 4 kB block of block pointers, you can get > to another 0x400 (1024) blocks. With a block of pointers to > blocks of pointers, you may address another 0x100000 blocks. > Finally, triple indirection gives you a block of pointers > to blocks of pointers to blocks of pointers, for another > 0x40000000 data blocks. That's a total of: > > 0x4010040c blocks > 0x4010040c000 bytes > 4.4e12 bytes and change > 4402 GB (decimal gigabytes) > 4.4 TB (decimal terabytes) > > Of course you can't really use 4.4 TB on 32-bit Linux, > so there is a sort of dishonesty in making this claim. > I can get to 2.2 TB, which disturbingly would wrap any > code using signed 32-bit math on units of 512 bytes. > The exact limits are: > > 0x000001ffffffefff max offset > 0x000001fffffff000 max size I would also have to add another footnote to this, if people start talking about limits on 64-bit and >4kB page size systems. ext2/3 can support multiple block sizes (limited by the hardware page size), and actually supporting larger block sizes has only been restricted for cross-platform compatibility reasons. Now that larger page sizes are becoming more common, the support for up to 16kB block sizes has already been added into e2fsprogs, and will only need a 1-line change in the kernel to be supported. The choice of 16kB pages as the limit is somewhat arbitrary also, and could be increased again in the future, as needed. Having 16kB block size would allow a maximum of 64TB for a single filesystem. The per-file limit would be over 256TB. In reality, we will probably implement extent-based allocation for ext3 when we start getting into filesystems that large, which has been discussed among the ext2/ext3 developers already. We could also go to a clustered filesystem like Lustre, which can span a large number of separate filesystems (and hosts). Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-06 5:19 ` Andreas Dilger @ 2002-08-06 7:24 ` Albert D. Cahalan 2002-08-06 7:52 ` Andreas Dilger 2002-08-06 9:28 ` Matti Aarnio 1 sibling, 1 reply; 38+ messages in thread From: Albert D. Cahalan @ 2002-08-06 7:24 UTC (permalink / raw) To: Andreas Dilger Cc: Albert D. Cahalan, Randy.Dunlap, Matti Aarnio, Christoph Hellwig, Peter J. Braam, linux-kernel Andreas Dilger writes: > I would also have to add another footnote to this, if people start > talking about limits on 64-bit and >4kB page size systems. ext2/3 can > support multiple block sizes (limited by the hardware page size), and > actually supporting larger block sizes has only been restricted for > cross-platform compatibility reasons. This looks pretty silly if you think about it. We support both 8 kB UFS and 64 kB FAT16 already. > Having 16kB block size would allow a maximum of 64TB for a single > filesystem. The per-file limit would be over 256TB. Um, yeah, 64 TB of data with 192 TB of holes! I really don't think you should count a file that won't fit on your filesystem. It's one thing to say ext2 is ready for when the block devices grow. It's another thing to talk about files that can't possibly fit without changing the filesystem layout. > In reality, we will probably implement extent-based allocation for > ext3 when we start getting into filesystems that large, which has been > discussed among the ext2/ext3 developers already. It's nice to have a simple filesystem. If you turn ext2/ext3 into an XFS/JFS competitor, then what is left? Just minix fs? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-06 7:24 ` Albert D. Cahalan @ 2002-08-06 7:52 ` Andreas Dilger 0 siblings, 0 replies; 38+ messages in thread From: Andreas Dilger @ 2002-08-06 7:52 UTC (permalink / raw) To: Albert D. Cahalan Cc: Randy.Dunlap, Matti Aarnio, Christoph Hellwig, Peter J. Braam, linux-kernel On Aug 06, 2002 03:24 -0400, Albert D. Cahalan wrote: > Andreas Dilger writes: > > Having 16kB block size would allow a maximum of 64TB for a single > > filesystem. The per-file limit would be over 256TB. > > Um, yeah, 64 TB of data with 192 TB of holes! > I really don't think you should count a file > that won't fit on your filesystem. Well, no worse than the original posting which had reiserfs supporting something-EB files and 16TB filesystems. Don't think I didn't consider this at the time of posting. > > In reality, we will probably implement extent-based allocation for > > ext3 when we start getting into filesystems that large, which has been > > discussed among the ext2/ext3 developers already. > > It's nice to have a simple filesystem. If you turn ext2/ext3 > into an XFS/JFS competitor, then what is left? Just minix fs? Note that I said ext3 in the above sentence, and not ext2. I'm not in favour of adding all of the high-end features (htree, extents, etc) into ext2 at all. It makes absolutely no sense to have a multi-TB filesystem running ext2, and then the fsck time takes a day. It is desirable to put some minimum support into ext2 for newer features when it makes sense and does not complicate the code, but not for everything. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-06 5:19 ` Andreas Dilger 2002-08-06 7:24 ` Albert D. Cahalan @ 2002-08-06 9:28 ` Matti Aarnio 1 sibling, 0 replies; 38+ messages in thread From: Matti Aarnio @ 2002-08-06 9:28 UTC (permalink / raw) To: Albert D. Cahalan, Randy.Dunlap, Matti Aarnio, Christoph Hellwig, Peter J. Braam, linux-kernel The original question was: "on 64 bit machine, what are the limits", this "willdo, wontdo" with 32-bit systems is thus out of the scope. And prolonged will/won't is out of the scope in every case... Maybe somebody wants to make concise encyclopedic article about the issue into LKML-FAQ ? http://www.tux.org/lkml/ /Matti Aarnio ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 22:14 ` Randy.Dunlap 2002-08-03 3:26 ` Albert D. Cahalan @ 2002-08-05 13:04 ` Stephen Lord 2002-08-05 13:42 ` Hans Reiser 1 sibling, 1 reply; 38+ messages in thread From: Stephen Lord @ 2002-08-05 13:04 UTC (permalink / raw) To: Randy.Dunlap Cc: Albert D. Cahalan, Matti Aarnio, Christoph Hellwig, Peter J. Braam, Linux Kernel Mailing List On Fri, 2002-08-02 at 17:14, Randy.Dunlap wrote: > On Fri, 2 Aug 2002, Albert D. Cahalan wrote: > > | Matti Aarnio writes: > | > | > It depends on many things: > | > - Block layer (unsigned long) > | > - Page indexes (unsigned long) > | > - Filesystem format dependent limits > | > - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3 > | > is supported only up to 4 kB block sizes, that gives > | > you a very hard limit.. of 16 terabytes (16 * "10^12") > | > | You first hit the triple-indirection limit at 4 TB. > | http://www.cs.uml.edu/~acahalan/linux/ext2.gif > | > | > - ReiserFS: u32_t block indexes presently, u64_t in future; > | > block size ranges ? Max size is limited by the > | > maximum supported file size, likely 2^63, which is > | > roughly 8 * "10^18", or circa 500 000 times larger > | > than EXT2/EXT3 format maximum. > | > | The top 4 st_size bits get stolen, so it's 60-bit sizes. > | You also get the 32-bit block limit at 16 TB. > | - > > For a LinuxWorld presentation in August, I have asked each of the > 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their > filesystem/filesize limits are. Here's what they have told me. > > ext3fs reiserfs JFS XFS > max filesize: 16 TB# 1 EB 4 PB$ 8 TB% > max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB! > > Notes: > #: think sparse files > *: 4 KB blocks > $: 16 TB on 32-bit architectures > %: 4 KB pages > !: block device limit Randy, If those are the numbers you are presenting then make it clear that for XFS those are the limits imposed by the the Linux kernel. The core of XFS itself can support files and filesystems of 9 Exabytes. I do not think all the filesystems are reporting their numbers in the same way. Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-05 13:04 ` Stephen Lord @ 2002-08-05 13:42 ` Hans Reiser 2002-08-05 13:56 ` Randy.Dunlap 2002-08-06 0:16 ` jw schultz 0 siblings, 2 replies; 38+ messages in thread From: Hans Reiser @ 2002-08-05 13:42 UTC (permalink / raw) To: Stephen Lord Cc: Randy.Dunlap, Albert D. Cahalan, Matti Aarnio, Christoph Hellwig, Peter J. Braam, Linux Kernel Mailing List Stephen Lord wrote: > > >>For a LinuxWorld presentation in August, I have asked each of the >>4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their >>filesystem/filesize limits are. Here's what they have told me. >> >> ext3fs reiserfs JFS XFS >>max filesize: 16 TB# 1 EB 4 PB$ 8 TB% >>max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB! >> >>Notes: >>#: think sparse files >>*: 4 KB blocks >>$: 16 TB on 32-bit architectures >>%: 4 KB pages >>!: block device limit >> >> > >Randy, > >If those are the numbers you are presenting then make it clear that >for XFS those are the limits imposed by the the Linux kernel. The >core of XFS itself can support files and filesystems of 9 Exabytes. >I do not think all the filesystems are reporting their numbers in >the same way. > >Steve > > > > You might also mention that I think the limits imposed by Linux are the only meaningful ones, as we would change our limits as soon as Linux did, and it was Linux that selected our limits for us. We would have changed already if Linux didn't make it pointless to change it on Intel. Reiser4 will have 64 bit blocknumbers that will be semi-pointless until 64 bit CPUs are widely deployed, and I am simply guessing this will be not very far into reiser4's lifecycle. Really, the couple of #defines that constitute these size limits, plus some surrounding code, are not such a big thing to change (except that it constitutes a disk format change). -- Hans ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-05 13:42 ` Hans Reiser @ 2002-08-05 13:56 ` Randy.Dunlap 2002-08-05 14:21 ` Randy.Dunlap 2002-08-06 0:16 ` jw schultz 1 sibling, 1 reply; 38+ messages in thread From: Randy.Dunlap @ 2002-08-05 13:56 UTC (permalink / raw) To: Hans Reiser Cc: Stephen Lord, Albert D. Cahalan, Christoph Hellwig, Linux Kernel Mailing List On Mon, 5 Aug 2002, Hans Reiser wrote: | Stephen Lord wrote: | > | >>For a LinuxWorld presentation in August, I have asked each of the | >>4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their | >>filesystem/filesize limits are. Here's what they have told me. | >> | >> ext3fs reiserfs JFS XFS= | >>max filesize: 16 TB# 1 EB 4 PB$ 8 TB% | >>max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB! | >> | >>Notes: | >>#: think sparse files | >>*: 4 KB blocks | >>$: 16 TB on 32-bit architectures | >>%: 4 KB pages | >>!: block device limit =: all limits are kernel limits (probably true for JFS and reiser also) Albert, your graph shows that the triple-indirect limit is at 8 EB, right? | >Randy, | > | >If those are the numbers you are presenting then make it clear that | >for XFS those are the limits imposed by the the Linux kernel. The | >core of XFS itself can support files and filesystems of 9 Exabytes. | >I do not think all the filesystems are reporting their numbers in | >the same way. | > | >Steve Yes, that info was missing from this text-mode info, but it's already on the slide. I will be sure to make it More obvious, and to make the numbers more consistent. | You might also mention that I think the limits imposed by Linux are the | only meaningful ones, as we would change our limits as soon as Linux | did, and it was Linux that selected our limits for us. We would have | changed already if Linux didn't make it pointless to change it on Intel. | Reiser4 will have 64 bit blocknumbers that will be semi-pointless until | 64 bit CPUs are widely deployed, and I am simply guessing this will be | not very far into reiser4's lifecycle. Really, the couple of #defines | that constitute these size limits, plus some surrounding code, are not | such a big thing to change (except that it constitutes a disk format | change). Right. I'll make the point in general that Linux internals are the reasons for many of these limits. Thanks, -- ~Randy ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-05 13:56 ` Randy.Dunlap @ 2002-08-05 14:21 ` Randy.Dunlap 2002-08-05 17:31 ` Albert D. Cahalan 0 siblings, 1 reply; 38+ messages in thread From: Randy.Dunlap @ 2002-08-05 14:21 UTC (permalink / raw) To: linux-kernel; +Cc: Albert D. Cahalan On Mon, 5 Aug 2002, Randy.Dunlap wrote: | Albert, your graph shows that the triple-indirect limit is | at 8 EB, right? Yes, but your text (email) explanation puts it at around 4.4 TB. Got it. Thanks. -- ~Randy ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-05 14:21 ` Randy.Dunlap @ 2002-08-05 17:31 ` Albert D. Cahalan 0 siblings, 0 replies; 38+ messages in thread From: Albert D. Cahalan @ 2002-08-05 17:31 UTC (permalink / raw) To: Randy.Dunlap; +Cc: linux-kernel, Albert D. Cahalan Randy.Dunlap writes: > On Mon, 5 Aug 2002, Randy.Dunlap wrote: >> Albert, your graph shows that the triple-indirect limit is >> at 8 EB, right? No, that's the API limit. We use signed 64-bit byte offsets in our API. (it's just under 8 EiB, which is about 9.2 EB) I do see one flaw on my graph. That horizontal line at 1 TiB ought to be at 2 TiB apparently. It's for the kernel limit, perhaps only on 32-bit hardware. This changes the limit with 4096-byte blocks from 1 TiB to 2 TiB, so the filesystem's 4.4 TB is still out of reach. > Yes, but your text (email) explanation puts it at around > 4.4 TB. Got it. If we had quadruple indirection, then we'd hit a 17.6 TB limit (16 TiB) due to the 32-bit block numbers. With an 8192-byte block size, we'd hit the block number limit at 35 TB (32 TiB) before hitting the triple-indirection limit. Of course none of this gets you past the kernel limit at around 2.2 TB. I believe we allow 8192-byte blocks on the Alpha. You might want to look into that. IA-64 maybe too. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-05 13:42 ` Hans Reiser 2002-08-05 13:56 ` Randy.Dunlap @ 2002-08-06 0:16 ` jw schultz 2002-08-06 9:48 ` Hans Reiser 1 sibling, 1 reply; 38+ messages in thread From: jw schultz @ 2002-08-06 0:16 UTC (permalink / raw) To: Linux Kernel Mailing List On Mon, Aug 05, 2002 at 05:42:18PM +0400, Hans Reiser wrote: > You might also mention that I think the limits imposed by Linux are the > only meaningful ones, as we would change our limits as soon as Linux > did, and it was Linux that selected our limits for us. We would have > changed already if Linux didn't make it pointless to change it on Intel. > Reiser4 will have 64 bit blocknumbers that will be semi-pointless until > 64 bit CPUs are widely deployed, and I am simply guessing this will be > not very far into reiser4's lifecycle. Really, the couple of #defines > that constitute these size limits, plus some surrounding code, are not > such a big thing to change (except that it constitutes a disk format > change). Hans, My recollection is that reiser4 isn't released yet. Why not set the reiser4 disk format with 64 bit blocknumbers from dot? 32 bit archs could write zeros and otherwise ignore the upper 32 bits and refuse to mount if filesystem size would cause overflow. That way you avoid on-disk format change mid cycle. That seems a lot less overhead than coping with different datatypes. Of course if you'd rather support another on-disk format to squeeze a bit more data onto small drives i can understand. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-06 0:16 ` jw schultz @ 2002-08-06 9:48 ` Hans Reiser 0 siblings, 0 replies; 38+ messages in thread From: Hans Reiser @ 2002-08-06 9:48 UTC (permalink / raw) To: jw schultz; +Cc: Linux Kernel Mailing List jw schultz wrote: >On Mon, Aug 05, 2002 at 05:42:18PM +0400, Hans Reiser wrote: > > >>You might also mention that I think the limits imposed by Linux are the >>only meaningful ones, as we would change our limits as soon as Linux >>did, and it was Linux that selected our limits for us. We would have >>changed already if Linux didn't make it pointless to change it on Intel. >>Reiser4 will have 64 bit blocknumbers that will be semi-pointless until >>64 bit CPUs are widely deployed, and I am simply guessing this will be >>not very far into reiser4's lifecycle. Really, the couple of #defines >>that constitute these size limits, plus some surrounding code, are not >>such a big thing to change (except that it constitutes a disk format >>change). >> >> > >Hans, > >My recollection is that reiser4 isn't released yet. Why not >set the reiser4 disk format with 64 bit blocknumbers from >dot? 32 bit archs could write zeros and otherwise ignore >the upper 32 bits and refuse to mount if filesystem size >would cause overflow. That way you avoid on-disk format >change mid cycle. That seems a lot less overhead than >coping with different datatypes. > >Of course if you'd rather support another on-disk format >to squeeze a bit more data onto small drives i can understand. > > > We are using 64 bit blocknumbers in reiser4, and letting linux limit them. Perhaps my writing style was rather lacking in clarity..... Linux is going to use some hacks in 2.5 that will let it go moderately above the 2.4 limits. 64 bit blocknumbers seem the most flexible thing in the face of what will be ever evolving hacks followed by the introduction of 64 bit CPUs into the mainstream. -- Hans ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 19:16 BIG files & file systems Peter J. Braam 2002-07-31 19:26 ` Christoph Hellwig @ 2002-07-31 21:07 ` Jan Harkes 2002-07-31 21:13 ` Alexander Viro 2002-08-01 12:01 ` David Woodhouse 2002-08-01 20:33 ` Andrew Morton 3 siblings, 1 reply; 38+ messages in thread From: Jan Harkes @ 2002-07-31 21:07 UTC (permalink / raw) To: Peter J. Braam; +Cc: linux-kernel On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote: > Hi, > > I've just been told that some "limitations" of the following kind will > remain: > page index = unsigned long > ino_t = unsigned long The number of files is not limited by ino_t, just look at the iget5_locked operation in fs/inode.c. It is possible to have your own n-bit file identifier, and simply provide your own comparison function. The ino_t then becomes the 'hash-bucket' in which the actual inode is looked up. For the page_index, maybe at some point someone manages to cleanly mix large pages (2MB?) with the current 4KB pages. Very large files could then use the page_index as an index into these large pages which should allow for 9PB files (or something close to that). Jan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 21:07 ` Jan Harkes @ 2002-07-31 21:13 ` Alexander Viro 2002-08-01 3:51 ` Jan Harkes 0 siblings, 1 reply; 38+ messages in thread From: Alexander Viro @ 2002-07-31 21:13 UTC (permalink / raw) To: Jan Harkes; +Cc: Peter J. Braam, linux-kernel On Wed, 31 Jul 2002, Jan Harkes wrote: > On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote: > > Hi, > > > > I've just been told that some "limitations" of the following kind will > > remain: > > page index = unsigned long > > ino_t = unsigned long > > The number of files is not limited by ino_t, just look at the > iget5_locked operation in fs/inode.c. It is possible to have your own > n-bit file identifier, and simply provide your own comparison function. > The ino_t then becomes the 'hash-bucket' in which the actual inode is > looked up. You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) and friends will break in all sorts of amusing ways. And there's nothing kernel can do about that - applications expect 32bit st_ino (compare them as 32bit values, etc.) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 21:13 ` Alexander Viro @ 2002-08-01 3:51 ` Jan Harkes 2002-08-01 12:01 ` Mark Mielke 2002-08-02 0:09 ` Stephen Lord 0 siblings, 2 replies; 38+ messages in thread From: Jan Harkes @ 2002-08-01 3:51 UTC (permalink / raw) To: Alexander Viro; +Cc: Peter J. Braam, linux-kernel On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote: > On Wed, 31 Jul 2002, Jan Harkes wrote: > > On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote: > > > I've just been told that some "limitations" of the following kind will > > > remain: > > > page index = unsigned long > > > ino_t = unsigned long > > > > The number of files is not limited by ino_t, just look at the > > iget5_locked operation in fs/inode.c. It is possible to have your own > > n-bit file identifier, and simply provide your own comparison function. > > The ino_t then becomes the 'hash-bucket' in which the actual inode is > > looked up. > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) > and friends will break in all sorts of amusing ways. And there's > nothing kernel can do about that - applications expect 32bit st_ino > (compare them as 32bit values, etc.) Which is why "tar and friends" are to different extents already broken on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS. (i.e. anything that currently uses iget5_locked instead of iget to grab the inode). Jan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-01 3:51 ` Jan Harkes @ 2002-08-01 12:01 ` Mark Mielke 2002-08-02 0:09 ` Stephen Lord 1 sibling, 0 replies; 38+ messages in thread From: Mark Mielke @ 2002-08-01 12:01 UTC (permalink / raw) To: Alexander Viro, Peter J. Braam, linux-kernel On Wed, Jul 31, 2002 at 11:51:19PM -0400, Jan Harkes wrote: > On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote: > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) > > and friends will break in all sorts of amusing ways. And there's > > nothing kernel can do about that - applications expect 32bit st_ino > > (compare them as 32bit values, etc.) > Which is why "tar and friends" are to different extents already broken > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS. > (i.e. anything that currently uses iget5_locked instead of iget to grab > the inode). In theory? Maybe. In practice, a lot more than just "tar and friends" assume that inodes are unique... mark (who recently, *continues* to write code that makes this assumption, although, granted, most of the checks are 'file caching'-type checks, and it isn't likely that a file will be the same size, the same inode, the same device, and the same path...) -- mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-01 3:51 ` Jan Harkes 2002-08-01 12:01 ` Mark Mielke @ 2002-08-02 0:09 ` Stephen Lord 2002-08-02 12:17 ` Chris Mason 2002-08-02 13:56 ` Jan Harkes 1 sibling, 2 replies; 38+ messages in thread From: Stephen Lord @ 2002-08-02 0:09 UTC (permalink / raw) To: Jan Harkes; +Cc: Alexander Viro, Peter J. Braam, Linux Kernel On Wed, 2002-07-31 at 22:51, Jan Harkes wrote: > On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote: > > On Wed, 31 Jul 2002, Jan Harkes wrote: > > > On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote: > > > > I've just been told that some "limitations" of the following kind will > > > > remain: > > > > page index = unsigned long > > > > ino_t = unsigned long > > > > > > The number of files is not limited by ino_t, just look at the > > > iget5_locked operation in fs/inode.c. It is possible to have your own > > > n-bit file identifier, and simply provide your own comparison function. > > > The ino_t then becomes the 'hash-bucket' in which the actual inode is > > > looked up. > > > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) > > and friends will break in all sorts of amusing ways. And there's > > nothing kernel can do about that - applications expect 32bit st_ino > > (compare them as 32bit values, etc.) > > Which is why "tar and friends" are to different extents already broken > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS. > (i.e. anything that currently uses iget5_locked instead of iget to grab > the inode). Why are they broken? In the case of XFS at least you still get a unique and stable inode number back - and it fits in 32 bits too. Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 0:09 ` Stephen Lord @ 2002-08-02 12:17 ` Chris Mason 2002-08-02 12:33 ` Anton Altaparmakov 2002-08-02 13:56 ` Jan Harkes 1 sibling, 1 reply; 38+ messages in thread From: Chris Mason @ 2002-08-02 12:17 UTC (permalink / raw) To: Stephen Lord; +Cc: Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel On Thu, 2002-08-01 at 20:09, Stephen Lord wrote: > > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) > > > and friends will break in all sorts of amusing ways. And there's > > > nothing kernel can do about that - applications expect 32bit st_ino > > > (compare them as 32bit values, etc.) > > > > Which is why "tar and friends" are to different extents already broken > > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS. > > (i.e. anything that currently uses iget5_locked instead of iget to grab > > the inode). > > Why are they broken? In the case of XFS at least you still get a unique > and stable inode number back - and it fits in 32 bits too. reiserfs is not broken here. It has unique stable 32 bit inode numbers, but looking up the file on disk requires 64 bits of information. -chris ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 12:17 ` Chris Mason @ 2002-08-02 12:33 ` Anton Altaparmakov 0 siblings, 0 replies; 38+ messages in thread From: Anton Altaparmakov @ 2002-08-02 12:33 UTC (permalink / raw) To: Chris Mason Cc: Stephen Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel At 13:17 02/08/02, Chris Mason wrote: >On Thu, 2002-08-01 at 20:09, Stephen Lord wrote: > > > > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) > > > > and friends will break in all sorts of amusing ways. And there's > > > > nothing kernel can do about that - applications expect 32bit st_ino > > > > (compare them as 32bit values, etc.) > > > > > > Which is why "tar and friends" are to different extents already broken > > > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS. > > > (i.e. anything that currently uses iget5_locked instead of iget to grab > > > the inode). > > > > Why are they broken? In the case of XFS at least you still get a unique > > and stable inode number back - and it fits in 32 bits too. > >reiserfs is not broken here. It has unique stable 32 bit inode numbers, >but looking up the file on disk requires 64 bits of information. ntfs is not broken here, either. It also uses unique stable 32 bit inode numbers, but inside the driver (not visible to user space at all at present), we use additional, fake inodes. But tar and friends will never see those so there is no problem... Anton -- "I've not lost my mind. It's backed up on tape somewhere." - Unknown -- Anton Altaparmakov <aia21 at cantab.net> (replace at with @) Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 0:09 ` Stephen Lord 2002-08-02 12:17 ` Chris Mason @ 2002-08-02 13:56 ` Jan Harkes 2002-08-02 14:06 ` Steve Lord 1 sibling, 1 reply; 38+ messages in thread From: Jan Harkes @ 2002-08-02 13:56 UTC (permalink / raw) To: Stephen Lord; +Cc: Alexander Viro, Peter J. Braam, Linux Kernel On Thu, Aug 01, 2002 at 07:09:37PM -0500, Stephen Lord wrote: > On Wed, 2002-07-31 at 22:51, Jan Harkes wrote: > > On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote: > > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1) > > > and friends will break in all sorts of amusing ways. And there's > > > nothing kernel can do about that - applications expect 32bit st_ino > > > (compare them as 32bit values, etc.) > > > > Which is why "tar and friends" are to different extents already broken > > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS. > > (i.e. anything that currently uses iget5_locked instead of iget to grab > > the inode). > > Why are they broken? In the case of XFS at least you still get a unique > and stable inode number back - and it fits in 32 bits too. I was simply assuming that any filesystem that is using iget5 and doesn't use the simpler iget helper has some reason why it cannot find an inode given just the 32-bit ino_t. This is definitely true for Coda, we have 96-bit file identifiers. Actually my development tree currently uses 128-bit, it is aware of multiple administrative realms and distinguishes between objects with FID 0x7f000001.0x1.0x1 in different administrative domains. There is a hash-function that tries to map these large FIDs into the 32-bit ino_t space with as few collisions as possible. NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but seems to need access to the directory to find them. So I don't quickly see how it would guarantee uniqueness. NTFS actually doesn't seem to use iget5 yet, but it has multiple streams per object which would probably end up using the same ino_t. I haven't looked at XFS, but as all in-tree filesystems that use iget5_locked have potential ino_t collisions, So I was assuming XFS would fit in the same category. Userspace applications should either have an option to ignore hardlinks. Very large filesystems either don't care because there is plenty of space, don't support them across boundaries that are not visible to the application, or could be dealing with them them automatically (COW links). Besides, if I really have a trillion files, I don't want 'tar and friends' to try to keep track of all those inode numbers (and device numbers) in memory. The other solution is that applications can actually use more of the information from the inode to avoid confusion, like st_nlink and st_mtime, which are useful when the filesystem is still mounted rw as well. And to make it even better, st_uid, st_gid, st_size, st_blocks and st_ctime, and a MD5/SHA checksum. Although this obviously would become even worse for the trillion file backup case. Jan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 13:56 ` Jan Harkes @ 2002-08-02 14:06 ` Steve Lord 2002-08-02 15:10 ` Hans Reiser 0 siblings, 1 reply; 38+ messages in thread From: Steve Lord @ 2002-08-02 14:06 UTC (permalink / raw) To: Jan Harkes; +Cc: Alexander Viro, Peter J. Braam, Linux Kernel On Fri, 2002-08-02 at 08:56, Jan Harkes wrote: > > I was simply assuming that any filesystem that is using iget5 and > doesn't use the simpler iget helper has some reason why it cannot find > an inode given just the 32-bit ino_t. In XFS's case (remember, the iget5 code is based on XFS changes) it is more a matter of the code to read the inode sometimes needing to pass other info down to the read_inode part of the filesystem, so we want to do that internally. XFS can have 64 bit inode numbers, but you need more than 1 Tbyte in an fs to get that big (inode numbers are a disk address). We also have code which keeps them in the bottom 1 Tbyte which is turned on by default on Linux. > > This is definitely true for Coda, we have 96-bit file identifiers. > Actually my development tree currently uses 128-bit, it is aware of > multiple administrative realms and distinguishes between objects with > FID 0x7f000001.0x1.0x1 in different administrative domains. There is a > hash-function that tries to map these large FIDs into the 32-bit ino_t > space with as few collisions as possible. > > NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but > seems to need access to the directory to find them. So I don't quickly > see how it would guarantee uniqueness. NTFS actually doesn't seem to use > iget5 yet, but it has multiple streams per object which would probably > end up using the same ino_t. > > Userspace applications should either have an option to ignore hardlinks. > Very large filesystems either don't care because there is plenty of > space, don't support them across boundaries that are not visible to the > application, or could be dealing with them them automatically (COW > links). Besides, if I really have a trillion files, I don't want 'tar > and friends' to try to keep track of all those inode numbers (and device > numbers) in memory. > > The other solution is that applications can actually use more of the > information from the inode to avoid confusion, like st_nlink and > st_mtime, which are useful when the filesystem is still mounted rw as > well. And to make it even better, st_uid, st_gid, st_size, st_blocks and > st_ctime, and a MD5/SHA checksum. Although this obviously would become > even worse for the trillion file backup case. If apps would have to change then I would vote for allowing larger inodes out of the kernel in an extended version of stat and getdents. I was going to say 64 bit versions, but if even 64 is not enough for you, it is getting a little hard to handle. Steve > Jan -- Steve Lord voice: +1-651-683-3511 Principal Engineer, Filesystem Software email: lord@sgi.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 14:06 ` Steve Lord @ 2002-08-02 15:10 ` Hans Reiser 2002-08-02 15:39 ` Trond Myklebust 0 siblings, 1 reply; 38+ messages in thread From: Hans Reiser @ 2002-08-02 15:10 UTC (permalink / raw) To: Steve Lord; +Cc: Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel There are a number of interfaces that need expansion in 2.5. Telldir and seekdir would be much better if they took as argument some filesystem specific opaque cookie (e.g. filename). Using a byte offset to reference a directory entry that was found with a filename is an implementation specific artifact that obviously only works for a ufs/s5fs/ext2 type of filesystem, and is just wrong. 4 billion files is not enough to store the government's XML databases in. Hans Steve Lord wrote: >On Fri, 2002-08-02 at 08:56, Jan Harkes wrote: > > >>I was simply assuming that any filesystem that is using iget5 and >>doesn't use the simpler iget helper has some reason why it cannot find >>an inode given just the 32-bit ino_t. >> >> > >In XFS's case (remember, the iget5 code is based on XFS changes) it is >more a matter of the code to read the inode sometimes needing to pass >other info down to the read_inode part of the filesystem, so we want to >do that internally. XFS can have 64 bit inode numbers, but you need more >than 1 Tbyte in an fs to get that big (inode numbers are a disk >address). We also have code which keeps them in the bottom 1 Tbyte >which is turned on by default on Linux. > > > >>This is definitely true for Coda, we have 96-bit file identifiers. >>Actually my development tree currently uses 128-bit, it is aware of >>multiple administrative realms and distinguishes between objects with >>FID 0x7f000001.0x1.0x1 in different administrative domains. There is a >>hash-function that tries to map these large FIDs into the 32-bit ino_t >>space with as few collisions as possible. >> >>NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but >>seems to need access to the directory to find them. So I don't quickly >>see how it would guarantee uniqueness. NTFS actually doesn't seem to use >>iget5 yet, but it has multiple streams per object which would probably >>end up using the same ino_t. >> >>Userspace applications should either have an option to ignore hardlinks. >>Very large filesystems either don't care because there is plenty of >>space, don't support them across boundaries that are not visible to the >>application, or could be dealing with them them automatically (COW >>links). Besides, if I really have a trillion files, I don't want 'tar >>and friends' to try to keep track of all those inode numbers (and device >>numbers) in memory. >> >>The other solution is that applications can actually use more of the >>information from the inode to avoid confusion, like st_nlink and >>st_mtime, which are useful when the filesystem is still mounted rw as >>well. And to make it even better, st_uid, st_gid, st_size, st_blocks and >>st_ctime, and a MD5/SHA checksum. Although this obviously would become >>even worse for the trillion file backup case. >> >> > >If apps would have to change then I would vote for allowing larger >inodes out of the kernel in an extended version of stat and getdents. >I was going to say 64 bit versions, but if even 64 is not enough for >you, it is getting a little hard to handle. > >Steve > > > >>Jan >> >> -- Hans ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 15:10 ` Hans Reiser @ 2002-08-02 15:39 ` Trond Myklebust 2002-08-02 17:01 ` Hans Reiser 0 siblings, 1 reply; 38+ messages in thread From: Trond Myklebust @ 2002-08-02 15:39 UTC (permalink / raw) To: Hans Reiser Cc: Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel >>>>> " " == Hans Reiser <reiser@namesys.com> writes: > There are a number of interfaces that need expansion in 2.5. > Telldir and seekdir would be much better if they took as > argument some filesystem specific opaque cookie > (e.g. filename). Using a byte offset to reference a directory > entry that was found with a filename is an implementation > specific artifact that obviously only works for a ufs/s5fs/ext2 > type of filesystem, and is just wrong. > 4 billion files is not enough to store the government's XML > databases in. That's more of a glibc-specific bug. Most other libc implementations appear to be quite capable of providing a userspace 'readdir()' which doesn't ever use the lseek() syscall. Note however that NFS compatibility *does* provide a limitation here: the cookies that are passed between client and server are limited to 32 bits (NFSv2) or 64 bits (NFSv3/v4), so you'll be wanting to provide some hack to get around this... Cheers, Trond ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 15:39 ` Trond Myklebust @ 2002-08-02 17:01 ` Hans Reiser 2002-08-02 17:25 ` Nikita Danilov 0 siblings, 1 reply; 38+ messages in thread From: Hans Reiser @ 2002-08-02 17:01 UTC (permalink / raw) To: Trond Myklebust Cc: Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel Trond Myklebust wrote: > > 4 billion files is not enough to store the government's XML > > databases in. > >That's more of a glibc-specific bug. Most other libc implementations >appear to be quite capable of providing a userspace 'readdir()' which >doesn't ever use the lseek() syscall. > Interesting. Thanks for the info. -- Hans ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 17:01 ` Hans Reiser @ 2002-08-02 17:25 ` Nikita Danilov 2002-08-02 17:47 ` Trond Myklebust 0 siblings, 1 reply; 38+ messages in thread From: Nikita Danilov @ 2002-08-02 17:25 UTC (permalink / raw) To: Hans Reiser Cc: Trond Myklebust, Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel Hans Reiser writes: > Trond Myklebust wrote: > > > > 4 billion files is not enough to store the government's XML > > > databases in. > > > >That's more of a glibc-specific bug. Most other libc implementations > >appear to be quite capable of providing a userspace 'readdir()' which > >doesn't ever use the lseek() syscall. > > > Interesting. Thanks for the info. But there still is a problem with applications (if any) calling seekdir/telldir directly... > > -- > Hans > Nikita. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 17:25 ` Nikita Danilov @ 2002-08-02 17:47 ` Trond Myklebust 2002-08-02 18:10 ` Nikita Danilov 0 siblings, 1 reply; 38+ messages in thread From: Trond Myklebust @ 2002-08-02 17:47 UTC (permalink / raw) To: Nikita Danilov Cc: Hans Reiser, Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel >>>>> " " == Nikita Danilov <Nikita@Namesys.COM> writes: > But there still is a problem with applications (if any) calling > seekdir/telldir directly... Agreed. Note however that the semantics for seekdir/telldir as specified by SUSv2 are much weaker than those in our current getdents()+lseek(). >From the Opengroup documentation for seekdir, it states that: On systems that conform to the Single UNIX Specification, Version 2, a subsequent call to readdir() may not be at the desired position if the value of loc was not obtained from an earlier call to telldir(), or if a call to rewinddir() occurred between the call to telldir() and the call to seekdir(). IOW assigning a unique offset to each and every entry in the directory is overkill (unless the user is calling telldir() for all those entries). Cheers, Trond ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 17:47 ` Trond Myklebust @ 2002-08-02 18:10 ` Nikita Danilov 2002-08-02 18:31 ` Hans Reiser 0 siblings, 1 reply; 38+ messages in thread From: Nikita Danilov @ 2002-08-02 18:10 UTC (permalink / raw) To: trond.myklebust Cc: Hans Reiser, Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel Trond Myklebust writes: > >>>>> " " == Nikita Danilov <Nikita@Namesys.COM> writes: > > > But there still is a problem with applications (if any) calling > > seekdir/telldir directly... > > Agreed. Note however that the semantics for seekdir/telldir as > specified by SUSv2 are much weaker than those in our current > getdents()+lseek(). > > >From the Opengroup documentation for seekdir, it states that: > > On systems that conform to the Single UNIX Specification, Version 2, > a subsequent call to readdir() may not be at the desired position if > the value of loc was not obtained from an earlier call to telldir(), > or if a call to rewinddir() occurred between the call to telldir() > and the call to seekdir(). > > IOW assigning a unique offset to each and every entry in the directory > is overkill (unless the user is calling telldir() for all those > entries). Are you implying some kind of ->telldir() file operation that notifies file-system that user has intention to later restart readdir from the "current" position and changing glibc to call sys_telldir/sys_seekdir in stead of lseek? This will allow file-systems like reiser4 that cannot restart readdir from 32bitsful of data to, at least, allocate something in kernel on call to ->telldir() and free in ->release(). > > Cheers, > Trond Nikita. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 18:10 ` Nikita Danilov @ 2002-08-02 18:31 ` Hans Reiser 2002-08-02 18:48 ` Nikita Danilov 0 siblings, 1 reply; 38+ messages in thread From: Hans Reiser @ 2002-08-02 18:31 UTC (permalink / raw) To: Nikita Danilov Cc: trond.myklebust, Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel Nikita Danilov wrote: >Trond Myklebust writes: > > >>>>> " " == Nikita Danilov <Nikita@Namesys.COM> writes: > > > > > But there still is a problem with applications (if any) calling > > > seekdir/telldir directly... > > > > Agreed. Note however that the semantics for seekdir/telldir as > > specified by SUSv2 are much weaker than those in our current > > getdents()+lseek(). > > > > >From the Opengroup documentation for seekdir, it states that: > > > > On systems that conform to the Single UNIX Specification, Version 2, > > a subsequent call to readdir() may not be at the desired position if > > the value of loc was not obtained from an earlier call to telldir(), > > or if a call to rewinddir() occurred between the call to telldir() > > and the call to seekdir(). > > > > IOW assigning a unique offset to each and every entry in the directory > > is overkill (unless the user is calling telldir() for all those > > entries). > Forgive the really dumb question, but does this mean we can just store the last entry returned to readdir in the directory metadata, and completely ignore the value of loc? > >Are you implying some kind of ->telldir() file operation that notifies >file-system that user has intention to later restart readdir from the >"current" position and changing glibc to call sys_telldir/sys_seekdir in >stead of lseek? This will allow file-systems like reiser4 that cannot >restart readdir from 32bitsful of data to, at least, allocate something >in kernel on call to ->telldir() and free in ->release(). > > > > > Cheers, > > Trond > >Nikita. > > > > -- Hans ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 18:31 ` Hans Reiser @ 2002-08-02 18:48 ` Nikita Danilov 2002-08-02 18:59 ` Hans Reiser 0 siblings, 1 reply; 38+ messages in thread From: Nikita Danilov @ 2002-08-02 18:48 UTC (permalink / raw) To: Hans Reiser Cc: trond.myklebust, Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel Hans Reiser writes: > Nikita Danilov wrote: > > >Trond Myklebust writes: > > > >>>>> " " == Nikita Danilov <Nikita@Namesys.COM> writes: > > > > > > > But there still is a problem with applications (if any) calling > > > > seekdir/telldir directly... > > > > > > Agreed. Note however that the semantics for seekdir/telldir as > > > specified by SUSv2 are much weaker than those in our current > > > getdents()+lseek(). > > > > > > >From the Opengroup documentation for seekdir, it states that: > > > > > > On systems that conform to the Single UNIX Specification, Version 2, > > > a subsequent call to readdir() may not be at the desired position if > > > the value of loc was not obtained from an earlier call to telldir(), > > > or if a call to rewinddir() occurred between the call to telldir() > > > and the call to seekdir(). > > > > > > IOW assigning a unique offset to each and every entry in the directory > > > is overkill (unless the user is calling telldir() for all those > > > entries). > > > Forgive the really dumb question, but does this mean we can just store > the last entry returned to readdir in the directory metadata, and > completely ignore the value of loc? If application is using readdir, then yes: glibc internally maps readdir into getdents plus at most one lseek on directory for "adjustment" purposes (if I remember correctly, problem is that kernel struct dirent has extra field and glibc cannot tell in advance how many of them will fit into supplied user buffer). But if application uses seekdir(3)/telldir(3) directly---then no. > > > > >Are you implying some kind of ->telldir() file operation that notifies > >file-system that user has intention to later restart readdir from the > >"current" position and changing glibc to call sys_telldir/sys_seekdir in > >stead of lseek? This will allow file-systems like reiser4 that cannot > >restart readdir from 32bitsful of data to, at least, allocate something > >in kernel on call to ->telldir() and free in ->release(). > > > > > > > > Cheers, > > > Trond > > Nikita. > > > > > > > > > > > -- > Hans > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-08-02 18:48 ` Nikita Danilov @ 2002-08-02 18:59 ` Hans Reiser 0 siblings, 0 replies; 38+ messages in thread From: Hans Reiser @ 2002-08-02 18:59 UTC (permalink / raw) To: Nikita Danilov Cc: trond.myklebust, Steve Lord, Jan Harkes, Alexander Viro, Peter J. Braam, Linux Kernel Nikita Danilov wrote: >Hans Reiser writes: > > Nikita Danilov wrote: > > > > >Trond Myklebust writes: > > > > >>>>> " " == Nikita Danilov <Nikita@Namesys.COM> writes: > > > > > > > > > But there still is a problem with applications (if any) calling > > > > > seekdir/telldir directly... > > > > > > > > Agreed. Note however that the semantics for seekdir/telldir as > > > > specified by SUSv2 are much weaker than those in our current > > > > getdents()+lseek(). > > > > > > > > >From the Opengroup documentation for seekdir, it states that: > > > > > > > > On systems that conform to the Single UNIX Specification, Version 2, > > > > a subsequent call to readdir() may not be at the desired position if > > > > the value of loc was not obtained from an earlier call to telldir(), > > > > or if a call to rewinddir() occurred between the call to telldir() > > > > and the call to seekdir(). > > > > > > > > IOW assigning a unique offset to each and every entry in the directory > > > > is overkill (unless the user is calling telldir() for all those > > > > entries). > > > > > Forgive the really dumb question, but does this mean we can just store > > the last entry returned to readdir in the directory metadata, and > > completely ignore the value of loc? > >If application is using readdir, then yes: glibc internally maps readdir >into getdents plus at most one lseek on directory for "adjustment" >purposes (if I remember correctly, problem is that kernel struct dirent >has extra field and glibc cannot tell in advance how many of them will >fit into supplied user buffer). > >But if application uses seekdir(3)/telldir(3) directly---then no. > It sounds like we could store the loc to key mapping in the file handle (a (partial) key is what reiser4 needs to find a directory entry). I am trying to understand if we need to store more than one loc to key mapping in the file handle, or if one is enough. What do people use telldir()/seekdir() for in practice? > > > > > > > > >Are you implying some kind of ->telldir() file operation that notifies > > >file-system that user has intention to later restart readdir from the > > >"current" position and changing glibc to call sys_telldir/sys_seekdir in > > >stead of lseek? This will allow file-systems like reiser4 that cannot > > >restart readdir from 32bitsful of data to, at least, allocate something > > >in kernel on call to ->telldir() and free in ->release(). > > > > > > > > > > > Cheers, > > > > Trond > > > > >Nikita. > > > > > > > > > > > > > > > > > > > -- > > Hans > > > > > > > > > > -- Hans ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 19:16 BIG files & file systems Peter J. Braam 2002-07-31 19:26 ` Christoph Hellwig 2002-07-31 21:07 ` Jan Harkes @ 2002-08-01 12:01 ` David Woodhouse 2002-08-01 20:33 ` Andrew Morton 3 siblings, 0 replies; 38+ messages in thread From: David Woodhouse @ 2002-08-01 12:01 UTC (permalink / raw) To: Peter J. Braam; +Cc: linux-kernel braam@clusterfs.com said: > (you don't want to know who, besides I've no idea how many bits go in > a trillion, but it's more than 32). It all gets a little confusing after 'million'. Either you mean a US 'trillion', which is a European 'billion'; 10^12. Or you mean a European 'trillion', which is a US 'quintillion'; 10^18. In general, it's best to stick to the numeric form if it's greater than 10^6. With the possible exception of using 'milliard' for 10^9, which may cause the recipient to have to look up the word, but won't cause it to be misinterpreted as 10^12 by non-usians. http://216.239.33.100/search?q=cache:rwJFJLB7ZnoC:www.reportercentral.com/reference/vocabulary/numbernames.html -- dwmw2 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: BIG files & file systems 2002-07-31 19:16 BIG files & file systems Peter J. Braam ` (2 preceding siblings ...) 2002-08-01 12:01 ` David Woodhouse @ 2002-08-01 20:33 ` Andrew Morton 3 siblings, 0 replies; 38+ messages in thread From: Andrew Morton @ 2002-08-01 20:33 UTC (permalink / raw) To: Peter J. Braam; +Cc: linux-kernel "Peter J. Braam" wrote: > > Hi, > > I've just been told that some "limitations" of the following kind will > remain: > page index = unsigned long > ino_t = unsigned long > > Lustre has definitely been asked to support much larger files than > 16TB. Also file systems with a trillion files have been requested by > one of our supporters (you don't want to know who, besides I've no > idea how many bits go in a trillion, but it's more than 32). > > I understand why people don't want to sprinkle the kernel with u64's, > and arguably we can wait a year or two and use 64 bit architectures, > so I'm probably not going to kick up a fuss about it. > > However, I thought I'd let you know that there are organizations that > _really_ want to have such big files and file systems and get quite > dismayed about "small integers". And we will fail to deliver on a > requirement to write a 50TB file because of this. I don't know about the ino_t thing, but as far as the pagecache indices goes it's simply a matter of - s/unsigned long/pgoff_t/ in a zillion places - modify the radix tree code a bit - implement CONFIG_LL_PAGECACHE_INDEX - make it all work - convince Linus Linus's objections are threefold: it expands struct page, 64 bit arith is slow and gcc tends to get it wrong. And I would add "most developers won't test 64-bit pgoff_t, and it'll get broken regularly". The expansion of struct page and the performance impact is just a cost which you'll have to balance against the benefits. For a few people, 32-bit pagecache index is a showstopper and they'll accept that tradeoff. Sprinkling `pgoff_t' everywhere is, IMO, not a bad thing - it aids code readability because it tells you what the variable is used for. As for broken gcc, well, the proponents of 64-bit pgoff_t would have to work to identify the correct gcc version and generally get gcc doing the right thing. ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2002-08-06 9:45 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-07-31 19:16 BIG files & file systems Peter J. Braam 2002-07-31 19:26 ` Christoph Hellwig 2002-07-31 20:04 ` Matti Aarnio 2002-07-31 20:12 ` Christoph Hellwig 2002-08-02 17:26 ` Albert D. Cahalan 2002-08-02 22:14 ` Randy.Dunlap 2002-08-03 3:26 ` Albert D. Cahalan 2002-08-06 5:19 ` Andreas Dilger 2002-08-06 7:24 ` Albert D. Cahalan 2002-08-06 7:52 ` Andreas Dilger 2002-08-06 9:28 ` Matti Aarnio 2002-08-05 13:04 ` Stephen Lord 2002-08-05 13:42 ` Hans Reiser 2002-08-05 13:56 ` Randy.Dunlap 2002-08-05 14:21 ` Randy.Dunlap 2002-08-05 17:31 ` Albert D. Cahalan 2002-08-06 0:16 ` jw schultz 2002-08-06 9:48 ` Hans Reiser 2002-07-31 21:07 ` Jan Harkes 2002-07-31 21:13 ` Alexander Viro 2002-08-01 3:51 ` Jan Harkes 2002-08-01 12:01 ` Mark Mielke 2002-08-02 0:09 ` Stephen Lord 2002-08-02 12:17 ` Chris Mason 2002-08-02 12:33 ` Anton Altaparmakov 2002-08-02 13:56 ` Jan Harkes 2002-08-02 14:06 ` Steve Lord 2002-08-02 15:10 ` Hans Reiser 2002-08-02 15:39 ` Trond Myklebust 2002-08-02 17:01 ` Hans Reiser 2002-08-02 17:25 ` Nikita Danilov 2002-08-02 17:47 ` Trond Myklebust 2002-08-02 18:10 ` Nikita Danilov 2002-08-02 18:31 ` Hans Reiser 2002-08-02 18:48 ` Nikita Danilov 2002-08-02 18:59 ` Hans Reiser 2002-08-01 12:01 ` David Woodhouse 2002-08-01 20:33 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox