* [RFC] Add new extent structure in ext4 @ 2012-01-23 12:51 Robin Dong 2012-01-23 18:59 ` Ted Ts'o ` (3 more replies) 0 siblings, 4 replies; 15+ messages in thread From: Robin Dong @ 2012-01-23 12:51 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Ext4 Developers List Hi Ted, Andreas and the list, After the bigalloc-feature is completed in ext4, we could have much more big size of block-group (also bigger continuous space), but the extent structure of files now limit the extent size below 128MB, which is not optimal. We could solve the problem by creating a new extent format to support larger extent size, which looks like this: struct ext4_extent2 { __le64 ee_block; /* first logical block extent covers */ __le64 ee_start; /* starting physical block */ __le32 ee_len; /* number of blocks covered by extent */ __le32 ee_flags; /* flags and future extension */ }; struct ext4_extent2_idx { __le64 ei_block; /* index covers logical blocks from 'block' */ __le64 ei_leaf; /* pointer to the physical block of the next level */ __le32 ei_flags; /* flags and future extension */ __le32 ei_unused; /* padding */ }; I think we could keep the structure of ext4_extent_header and add new imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. The new extent format could support 16TB continuous space and larger volumes. What's your opinion? -- -- Best Regard Robin Dong ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-23 12:51 [RFC] Add new extent structure in ext4 Robin Dong @ 2012-01-23 18:59 ` Ted Ts'o 2012-01-23 23:17 ` Andreas Dilger 2012-01-24 13:34 ` Jan Kara ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Ted Ts'o @ 2012-01-23 18:59 UTC (permalink / raw) To: Robin Dong; +Cc: Andreas Dilger, Ext4 Developers List On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: > > We could solve the problem by creating a new extent format to support > larger extent size, which looks like this: > > struct ext4_extent2 { > __le64 ee_block; /* first logical block extent covers */ > __le64 ee_start; /* starting physical block */ > __le32 ee_len; /* number of blocks covered by extent */ > __le32 ee_flags; /* flags and future extension */ > }; > > I think we could keep the structure of ext4_extent_header and add new > imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. The really unfortunate thing about using a 24 byte on-disk extent structure is that you can only fit 2 extents in the inode before needing to spill out to an external header. So being able to support multiple exent formats in the inode (by using a different eh_magic number) would probably be a good thing. In fact, it might be useful to also have a version which looks like this: struct ext4_extent_packed { __le32 ee_start_lo; __le16 ee_start_hi; __le16 ee_len; }; i.e., something which only takes 8 bytes, but which is only used for non-sparse files in the inode structure, so that you can fit 6 extents in the inode. The hard part will be cleaning up and refactoring the extent code to support multiple on-disk extent formats. (That's going to be very messy, though! So if we're going to go through all of that work, it would benice if it had advantages not for huge file systems, but also for desktop workloads.) Once this investment gets done, supporting a third extent format should be relatively straight forward. This would also allow us to make the new extent format be an RO_COMPAT feature, so that an existing ext4 file system could be converted to take advantage of the new extent encodings without needing to do a backup / reformat / restore pass. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-23 18:59 ` Ted Ts'o @ 2012-01-23 23:17 ` Andreas Dilger 0 siblings, 0 replies; 15+ messages in thread From: Andreas Dilger @ 2012-01-23 23:17 UTC (permalink / raw) To: Ted Ts'o; +Cc: Robin Dong, Ext4 Developers List On 2012-01-23, at 11:59 AM, Ted Ts'o wrote: > On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: >> >> We could solve the problem by creating a new extent format to support >> larger extent size, which looks like this: >> >> struct ext4_extent2 { >> __le64 ee_block; /* first logical block extent covers */ >> __le64 ee_start; /* starting physical block */ >> __le32 ee_len; /* number of blocks covered by extent */ >> __le32 ee_flags; /* flags and future extension */ >> }; >> >> I think we could keep the structure of ext4_extent_header and add new >> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. > > The really unfortunate thing about using a 24 byte on-disk extent > structure is that you can only fit 2 extents in the inode before > needing to spill out to an external header. > > So being able to support multiple exent formats in the inode (by using > a different eh_magic number) would probably be a good thing. In fact, > it might be useful to also have a version which looks like this: > > struct ext4_extent_packed { > __le32 ee_start_lo; > __le16 ee_start_hi; > __le16 ee_len; > }; > > i.e., something which only takes 8 bytes, but which is only used for > non-sparse files in the inode structure, so that you can fit 6 extents > in the inode. How does the code determine in advance whether a file is going to be sparse or not? Does this mean that the extents would have to be changed as soon as a hole is added to a file? That probably isn't bad if this format is only used inside the inode, but would be very complex if it is used for an indirect block. Actually, my thought has been that it would be useful to have a new "extent" format for block-mapped files that have fragmented on-disk layout, like directories. > The hard part will be cleaning up and refactoring the extent code to > support multiple on-disk extent formats. (That's going to be very > messy, though! So if we're going to go through all of that work, it > would benice if it had advantages not for huge file systems, but also > for desktop workloads.) Once this investment gets done, supporting a > third extent format should be relatively straight forward. ... and fourth... > This would also allow us to make the new extent format be an RO_COMPAT > feature, so that an existing ext4 file system could be converted to > take advantage of the new extent encodings without needing to do a > backup / reformat / restore pass. How could a new extent format be RO_COMPAT? Old kernels couldn't possibly be able to read files with the new extent format. I guess you are thinking that they are RO_COMPAT in the sense of "they don't crash old kernels, but new files cannot be read"? Cheers, Andreas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-23 12:51 [RFC] Add new extent structure in ext4 Robin Dong 2012-01-23 18:59 ` Ted Ts'o @ 2012-01-24 13:34 ` Jan Kara 2012-01-24 17:32 ` Andreas Dilger 2012-01-25 22:48 ` Dave Chinner 2012-01-30 20:41 ` Eric Sandeen 3 siblings, 1 reply; 15+ messages in thread From: Jan Kara @ 2012-01-24 13:34 UTC (permalink / raw) To: Robin Dong; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List Hello, On Mon 23-01-12 20:51:53, Robin Dong wrote: > After the bigalloc-feature is completed in ext4, we could have much more > big size of block-group (also bigger continuous space), but the extent > structure of files now limit the extent size below 128MB, which is not > optimal. It is not optimal but does it really make difference? I.e. what improvement do you expect from enlarging extents from 128MB to say 4GB (or do you expect to be consistently able to allocate continguous chunks larger than 4GB?)? All you save is a single read of an indirect block... Is that really worth the complications with another extent format? But maybe I miss some benefit. Honza > We could solve the problem by creating a new extent format to support > larger extent size, which looks like this: > > struct ext4_extent2 { > __le64 ee_block; /* first logical block extent covers */ > __le64 ee_start; /* starting physical block */ > __le32 ee_len; /* number of blocks covered by extent */ > __le32 ee_flags; /* flags and future extension */ > }; > > struct ext4_extent2_idx { > __le64 ei_block; /* index covers logical blocks from 'block' */ > __le64 ei_leaf; /* pointer to the physical block of the next level */ > __le32 ei_flags; /* flags and future extension */ > __le32 ei_unused; /* padding */ > }; > > I think we could keep the structure of ext4_extent_header and add new > imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. > > The new extent format could support 16TB continuous space and larger volumes. > > What's your opinion? -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-24 13:34 ` Jan Kara @ 2012-01-24 17:32 ` Andreas Dilger 0 siblings, 0 replies; 15+ messages in thread From: Andreas Dilger @ 2012-01-24 17:32 UTC (permalink / raw) To: Jan Kara; +Cc: Robin Dong, Ted Ts'o, Andreas Dilger, Ext4Developers List On 2012-01-24, at 6:34, Jan Kara <jack@suse.cz> wrote: > On Mon 23-01-12 20:51:53, Robin Dong wrote: >> After the bigalloc-feature is completed in ext4, we could have much more >> big size of block-group (also bigger continuous space), but the extent >> structure of files now limit the extent size below 128MB, which is not >> optimal. > > It is not optimal but does it really make difference? I.e. what > improvement do you expect from enlarging extents from 128MB to say 4GB (or > do you expect to be consistently able to allocate continguous chunks larger > than 4GB?)? All you save is a single read of an indirect block... Is that > really worth the complications with another extent format? But maybe I miss > some benefit. What I'm (somewhat) interested in is increasing the maximum file size. IMHO, I think it would be better to do this with a larger block size (similar to bigalloc, but actually handling large blocks as a side benefit) since this will reduce the allocation overhead as well. Even if the blocksize is only 64kB, that would allow files up to 256TB, and filesystems up to 2^64 bytes without the complexity of changing the extent format (which Ted looked at once and thought was difficult). Since Robin and Ted already did most of that work for bigalloc, I think the remaining effort would be manageable, especially if mmap is disabled on such a filesystem. Increasing the maximum extent size may have some small benefit, but I don't think it would be noticeable, and would rarely be used due to fragmentation and such. A single index block with 128MB extents can already address over 16GB, and with large blocks this increases with the square of the blocksize (larger extents * more extents per index block). Cheers, Andreas >> We could solve the problem by creating a new extent format to support >> larger extent size, which looks like this: >> >> struct ext4_extent2 { >> __le64 ee_block; /* first logical block extent covers */ >> __le64 ee_start; /* starting physical block */ >> __le32 ee_len; /* number of blocks covered by extent */ >> __le32 ee_flags; /* flags and future extension */ >> }; >> >> struct ext4_extent2_idx { >> __le64 ei_block; /* index covers logical blocks from 'block' */ >> __le64 ei_leaf; /* pointer to the physical block of the next level */ >> __le32 ei_flags; /* flags and future extension */ >> __le32 ei_unused; /* padding */ >> }; >> >> I think we could keep the structure of ext4_extent_header and add new >> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. >> >> The new extent format could support 16TB continuous space and larger volumes. >> >> What's your opinion? > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-23 12:51 [RFC] Add new extent structure in ext4 Robin Dong 2012-01-23 18:59 ` Ted Ts'o 2012-01-24 13:34 ` Jan Kara @ 2012-01-25 22:48 ` Dave Chinner 2012-01-25 23:03 ` Andreas Dilger 2012-01-30 20:41 ` Eric Sandeen 3 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2012-01-25 22:48 UTC (permalink / raw) To: Robin Dong; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: > Hi Ted, Andreas and the list, > > After the bigalloc-feature is completed in ext4, we could have much more > big size of block-group (also bigger continuous space), but the extent > structure of files now limit the extent size below 128MB, which is not > optimal. > > We could solve the problem by creating a new extent format to support > larger extent size, which looks like this: > > struct ext4_extent2 { > __le64 ee_block; /* first logical block extent covers */ > __le64 ee_start; /* starting physical block */ > __le32 ee_len; /* number of blocks covered by extent */ > __le32 ee_flags; /* flags and future extension */ > }; > > struct ext4_extent2_idx { > __le64 ei_block; /* index covers logical blocks from 'block' */ > __le64 ei_leaf; /* pointer to the physical block of the next level */ > __le32 ei_flags; /* flags and future extension */ > __le32 ei_unused; /* padding */ > }; > > I think we could keep the structure of ext4_extent_header and add new > imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. > > The new extent format could support 16TB continuous space and larger volumes. > > What's your opinion? Just use XFS. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-25 22:48 ` Dave Chinner @ 2012-01-25 23:03 ` Andreas Dilger 2012-01-27 0:19 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2012-01-25 23:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Robin Dong, Ted Ts'o, Ext4 Developers List On 2012-01-25, at 3:48 PM, Dave Chinner wrote: > On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: >> Hi Ted, Andreas and the list, >> >> After the bigalloc-feature is completed in ext4, we could have much more >> big size of block-group (also bigger continuous space), but the extent >> structure of files now limit the extent size below 128MB, which is not >> optimal. >> >> We could solve the problem by creating a new extent format to support >> larger extent size, which looks like this: >> >> struct ext4_extent2 { >> __le64 ee_block; /* first logical block extent covers */ >> __le64 ee_start; /* starting physical block */ >> __le32 ee_len; /* number of blocks covered by extent */ >> __le32 ee_flags; /* flags and future extension */ >> }; >> >> struct ext4_extent2_idx { >> __le64 ei_block; /* index covers logical blocks from 'block' */ >> __le64 ei_leaf; /* pointer to the physical block of the next level */ >> __le32 ei_flags; /* flags and future extension */ >> __le32 ei_unused; /* padding */ >> }; >> >> I think we could keep the structure of ext4_extent_header and add new >> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. >> >> The new extent format could support 16TB continuous space and larger volumes. >> >> What's your opinion? > > Just use XFS. Thanks for your troll. If you have something actually useful to contribute, please feel free to post. Otherwise, this is a list for ext4 development. I don't encourage XFS users to switch to ext4 (or ZFS, for that matter, since ZFS can do a lot of things that just aren't possible for XFS, and is now available for Linux) on your mailing lists, and I'd appreciate the same courtesy here... Cheers, Andreas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-25 23:03 ` Andreas Dilger @ 2012-01-27 0:19 ` Dave Chinner 2012-01-27 14:27 ` Tao Ma 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2012-01-27 0:19 UTC (permalink / raw) To: Andreas Dilger; +Cc: Robin Dong, Ted Ts'o, Ext4 Developers List On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote: > On 2012-01-25, at 3:48 PM, Dave Chinner wrote: > > On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: > >> Hi Ted, Andreas and the list, > >> > >> After the bigalloc-feature is completed in ext4, we could have much more > >> big size of block-group (also bigger continuous space), but the extent > >> structure of files now limit the extent size below 128MB, which is not > >> optimal. > >> > >> We could solve the problem by creating a new extent format to support > >> larger extent size, which looks like this: > >> > >> struct ext4_extent2 { > >> __le64 ee_block; /* first logical block extent covers */ > >> __le64 ee_start; /* starting physical block */ > >> __le32 ee_len; /* number of blocks covered by extent */ > >> __le32 ee_flags; /* flags and future extension */ > >> }; > >> > >> struct ext4_extent2_idx { > >> __le64 ei_block; /* index covers logical blocks from 'block' */ > >> __le64 ei_leaf; /* pointer to the physical block of the next level */ > >> __le32 ei_flags; /* flags and future extension */ > >> __le32 ei_unused; /* padding */ > >> }; > >> > >> I think we could keep the structure of ext4_extent_header and add new > >> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. > >> > >> The new extent format could support 16TB continuous space and larger volumes. > >> > >> What's your opinion? > > > > Just use XFS. > > Thanks for your troll. > > If you have something actually useful to contribute, please feel free to post. > Otherwise, this is a list for ext4 development. You can chose to see my comment as a troll, but it has a serious message. If that is your use case is for large multi-TB files, then why wouldn't you just use a filesystem that was designed for files that large from the ground up rather than try to extend a filesystem that is already struggling with file sizes that it already supports? Not to mention that very few people even need this functionality, and those that do right now are using XFS. Indeed, on current measures, a 15.95TB file on ext4 takes 330s to allocate on my test rig, while XFS will do it under *35 milliseconds*. What's the point of increasing the maximum file size when it when it takes so long to allocate or free the space? If you can't make the allocation and freeing scale first to the existing file size limits, there's little point in introducing support for larger files. And as an ext4 user, all I want is from ext4 to be stable like ext3 is stable, not have it continually destabilised by the addition of incompatible feature after incompatible feature. Indeed, I can't use ext4 in the places I'm using ext3 right now because ext4 is not very resilient in the face of 20 system crashes a day. I generally find that ext4 filesystems are irretrievable corrupted within a week. In comparison, I have ext3 filesystems have lasted more than 3 years under such workloads without any corruptions occurring. So the long form of my 3-word comment is effectively: "If you need multi-TB files, then use the filesystem most appropriate for that workload instead of trying to make ext4 more complex and unstable than it already is". > I don't encourage XFS users to switch to ext4 (or ZFS, for that matter, since > ZFS can do a lot of things that just aren't possible for XFS, and is now > available for Linux) on your mailing lists, and I'd appreciate the same > courtesy here... Sorry, I didn't realise that I'm not aren't allowed to tell ext4 people to use the filesystem most appropriate to their requirements. Extending ext4 is not the right solution to every problem. I say stuff like this w.r.t. "don't use XFS for that" or "XFS will never support that" all the time on the XFS lists and IRC channels, and nobody thinks that it is out of place. If you want to pop up and say that "you should use ext4 for that" on the XFS lists then you are welcome to do so. Such comments generally results in an informative technical discussion of the pros and cons of why something is or is not suited to the given requirement without anyone being called a troll. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-27 0:19 ` Dave Chinner @ 2012-01-27 14:27 ` Tao Ma 2012-01-29 22:07 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: Tao Ma @ 2012-01-27 14:27 UTC (permalink / raw) To: Dave Chinner Cc: Andreas Dilger, Robin Dong, Ted Ts'o, Ext4 Developers List Hi Dave, On 01/27/2012 08:19 AM, Dave Chinner wrote: > On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote: >> On 2012-01-25, at 3:48 PM, Dave Chinner wrote: >>> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: >>>> Hi Ted, Andreas and the list, >>>> >>>> After the bigalloc-feature is completed in ext4, we could have much more >>>> big size of block-group (also bigger continuous space), but the extent >>>> structure of files now limit the extent size below 128MB, which is not >>>> optimal. >>>> >>>> We could solve the problem by creating a new extent format to support >>>> larger extent size, which looks like this: >>>> >>>> struct ext4_extent2 { >>>> __le64 ee_block; /* first logical block extent covers */ >>>> __le64 ee_start; /* starting physical block */ >>>> __le32 ee_len; /* number of blocks covered by extent */ >>>> __le32 ee_flags; /* flags and future extension */ >>>> }; >>>> >>>> struct ext4_extent2_idx { >>>> __le64 ei_block; /* index covers logical blocks from 'block' */ >>>> __le64 ei_leaf; /* pointer to the physical block of the next level */ >>>> __le32 ei_flags; /* flags and future extension */ >>>> __le32 ei_unused; /* padding */ >>>> }; >>>> >>>> I think we could keep the structure of ext4_extent_header and add new >>>> imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. >>>> >>>> The new extent format could support 16TB continuous space and larger volumes. >>>> >>>> What's your opinion? >>> >>> Just use XFS. >> >> Thanks for your troll. >> >> If you have something actually useful to contribute, please feel free to post. >> Otherwise, this is a list for ext4 development. > > You can chose to see my comment as a troll, but it has a serious > message. If that is your use case is for large multi-TB files, then > why wouldn't you just use a filesystem that was designed for files > that large from the ground up rather than try to extend a filesystem > that is already struggling with file sizes that it already supports? > Not to mention that very few people even need this functionality, > and those that do right now are using XFS. Robin is one of my colleague. And to be frank, ext4 works well currently in our product system. And we'd like to see it grows to fit our future need also. I think it helps both the community and our employer. Having said that, another reason why we don't consider of XFS as our choice is that we don't think we have the ability to maintain 2 file systems in our product system. > > Indeed, on current measures, a 15.95TB file on ext4 takes 330s to > allocate on my test rig, while XFS will do it under *35 > milliseconds*. What's the point of increasing the maximum file size > when it when it takes so long to allocate or free the space? If you > can't make the allocation and freeing scale first to the existing > file size limits, there's little point in introducing support for > larger files. I think your test case here is biased since you used the most successful story from XFS. Yes, bitmap-based file system is a little bit hard to allocate a very large file if the bitmap is scattered all over the disk, but I don't think ext4 can't fill the gap of this test case in the future. Let us wait and see. :) > > And as an ext4 user, all I want is from ext4 to be stable like ext3 > is stable, not have it continually destabilised by the addition of > incompatible feature after incompatible feature. Indeed, I can't > use ext4 in the places I'm using ext3 right now because ext4 is not > very resilient in the face of 20 system crashes a day. I generally > find that ext4 filesystems are irretrievable corrupted within a > week. In comparison, I have ext3 filesystems have lasted more than > 3 years under such workloads without any corruptions occurring. OK, so next time when you see the corruption, please at least send it to the mail list so that ext4 developers can have the chance of seeing it. Complaint doesn't improve it. I have read your original letter about the review process in xfs development, it is good and I guess ext4 should take it as a standard process. > > So the long form of my 3-word comment is effectively: "If you need > multi-TB files, then use the filesystem most appropriate for that > workload instead of trying to make ext4 more complex and unstable > than it already is". I have read and watched the talk you gave in this year's LCA, your assumption about ext4 may be a little frightening, but it is good for the ext4 community. In your talk "xfs is much slower than ext4 in 2009-2010 for meta-intensive workload", and now it works much faster. So why do you think ext4 can't be improved also like xfs? Thanks Tao > >> I don't encourage XFS users to switch to ext4 (or ZFS, for that matter, since >> ZFS can do a lot of things that just aren't possible for XFS, and is now >> available for Linux) on your mailing lists, and I'd appreciate the same >> courtesy here... > > Sorry, I didn't realise that I'm not aren't allowed to tell ext4 > people to use the filesystem most appropriate to their requirements. > Extending ext4 is not the right solution to every problem. > > I say stuff like this w.r.t. "don't use XFS for that" or "XFS will > never support that" all the time on the XFS lists and IRC channels, > and nobody thinks that it is out of place. If you want to pop up and > say that "you should use ext4 for that" on the XFS lists then you > are welcome to do so. Such comments generally results in an > informative technical discussion of the pros and cons of why > something is or is not suited to the given requirement without > anyone being called a troll. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-27 14:27 ` Tao Ma @ 2012-01-29 22:07 ` Dave Chinner 2012-01-30 22:50 ` Andreas Dilger 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2012-01-29 22:07 UTC (permalink / raw) To: Tao Ma; +Cc: Andreas Dilger, Robin Dong, Ted Ts'o, Ext4 Developers List On Fri, Jan 27, 2012 at 10:27:02PM +0800, Tao Ma wrote: > Hi Dave, > On 01/27/2012 08:19 AM, Dave Chinner wrote: > > On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote: > >> On 2012-01-25, at 3:48 PM, Dave Chinner wrote: > >>> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: > >>>> Hi Ted, Andreas and the list, > >>>> > >>>> After the bigalloc-feature is completed in ext4, we could have much more > >>>> big size of block-group (also bigger continuous space), but the extent > >>>> structure of files now limit the extent size below 128MB, which is not > >>>> optimal. ..... > >>>> The new extent format could support 16TB continuous space and larger volumes. > >>>> > >>>> What's your opinion? > >>> > >>> Just use XFS. > >> > >> Thanks for your troll. > >> > >> If you have something actually useful to contribute, please feel free to post. > >> Otherwise, this is a list for ext4 development. > > > > You can chose to see my comment as a troll, but it has a serious > > message. If that is your use case is for large multi-TB files, then > > why wouldn't you just use a filesystem that was designed for files > > that large from the ground up rather than try to extend a filesystem > > that is already struggling with file sizes that it already supports? > > Not to mention that very few people even need this functionality, > > and those that do right now are using XFS. > Robin is one of my colleague. And to be frank, ext4 works well currently > in our product system. And we'd like to see it grows to fit our future > need also. Sure. But at the expense of the average user? ext4 is supposed to be primarily the Linux desktop filesystem, yet all I see is people trying to make it something for big, bigger and biggest. Bigalloc, new extent formats, no-journal mode, dioread_nolock, COW snapshots, secure delete, etc. It's a list of features that are somewhat incompatible with each other that are useful to only a handful of vendors or companies. Most have no relevance at all to the uses of the majority of ext4 users. This is what I'm getting at - I don't object to adding functionality that is generically useful and applies to all filesystem configs, but that's not what is happening. ext4 appears to have a development mindset of "if we don't support X, then we can do Y" and I don't think that serves the ext4 users very well at all. BTW, if you think that is a harsh criticism, just reflect on the insanity of the recent "we can support 64k block sizes if we just disable mmap" discussion. Yes, that's great for Lustre, but it is useless for everyone else... > I think it helps both the community and our employer. Having > said that, another reason why we don't consider of XFS as our choice is > that we don't think we have the ability to maintain 2 file systems in > our product system. That's your choice as a product vendor, not mine as an ext4 user.... > > Indeed, on current measures, a 15.95TB file on ext4 takes 330s to > > allocate on my test rig, while XFS will do it under *35 > > milliseconds*. What's the point of increasing the maximum file size > > when it when it takes so long to allocate or free the space? If you > > can't make the allocation and freeing scale first to the existing > > file size limits, there's little point in introducing support for > > larger files. > I think your test case here is biased since you used the most successful > story from XFS. Yes, bitmap-based file system is a little bit hard to > allocate a very large file if the bitmap is scattered all over the disk, Which is the case whenever the filesytem has been used for a while. I did those tests on a pristine, empty filesystem, so the speed of allocation only goes down from there. bitmap based allocation degrades much, much faster than extent-tree based allocation, especially when you have to search for the free space to allocation from.... Indeed, how do you plan to test such large files robustly when it takes so long to allocate the space to them? I mean, I can easily test large files on XFS because of how quickly allocation occurs. I can easily fragment free space and test large fragmented files bcause of how quickly allocation occurs. But if the same test that take a minute to run on XFS take 4 orders of magnitude longer on ext4, just how good is your test coverage going to be? What about when you have different filesystem block sizes, or different mount options, or doing it concurrently with an online resize? IOWs, the slowness of the allocation greatly limits the ability to test such a feature at the scale it is designed to support. That's my big, overriding concern - with ext4 allocation being so slow, we can't really test large files with enough thoroughness *right now*. Increasing the file size is only going to make that problem worse and that, to me, is a show stopper. If you can't test it properly, then the change should not be made. > but I don't think ext4 can't fill the gap of this test case in the > future. Let us wait and see. :) How do you plan to fix it? If there isn't a plan, or it involves a major on-disk format change, then aren't we back to square one about adding intrusive, complex and destablising features to a filesystem that people are relying to be stable? > > And as an ext4 user, all I want is from ext4 to be stable like ext3 > > is stable, not have it continually destabilised by the addition of > > incompatible feature after incompatible feature. Indeed, I can't > > use ext4 in the places I'm using ext3 right now because ext4 is not > > very resilient in the face of 20 system crashes a day. I generally > > find that ext4 filesystems are irretrievable corrupted within a > > week. In comparison, I have ext3 filesystems have lasted more than > > 3 years under such workloads without any corruptions occurring. > OK, so next time when you see the corruption, please at least send it to > the mail list so that ext4 developers can have the chance of seeing it. > Complaint doesn't improve it. I won't be reporting corruptions because I stopped using ext4 more than 6 months ago on these machines after the last batch of unreproducable, unrepairable corruptions that occurred. I couldn't get anything from the corpses (I do know how to analyse a corrupt ext4 filesystem), so there really wasn't anything to report.... Generally speaking, the first sign of problems was a corrupted binary or missing or empty file. The filesystem never complained or detected corruption at runtime. By that stage, the original cause of the corruption was unfindable because the problems may have happened many crashes ago and been propagated further. running e2fsck at that point generally resulted in a mess with lots of stuff ending in lost+found and multiply linked blocks being duplicated all over the place. IOWs, an unrecoverable mess. > > So the long form of my 3-word comment is effectively: "If you need > > multi-TB files, then use the filesystem most appropriate for that > > workload instead of trying to make ext4 more complex and unstable > > than it already is". > I have read and watched the talk you gave in this year's LCA, your > assumption about ext4 may be a little frightening, but it is good for > the ext4 community. In your talk "xfs is much slower than ext4 in > 2009-2010 for meta-intensive workload", and now it works much faster. So > why do you think ext4 can't be improved also like xfs? Because all of the XFS changes talked about in that talk did not change the on-disk format at all. They are *software-only* changes and are completely transparent to users. They are even the default behaviours now, so users with 10 year old XFS filesystems will also benefit from them. And they can go back to their old kernels if they don't like the new kernels, too... We know that the problems ext4 has are much, much deeper and as this thread shows require significant on-disk format changes to solve. And they will only benefit those that have new filesystems or make their old filesystems incompatible with old kernels. IOWs, the changes being proposed don't help solve problems on all the existing filesystems transparently. That's a *major* difference between where XFS was 2 years ago and where ext4 is now. Sure, given enough time and resources, any problem is solvable. But really, do ext4 users really need a new, incompatible, difficult to test on-disk formats to solve problems that most people will never hit on their desktop and server systems before they migrate them to BTRFS? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-29 22:07 ` Dave Chinner @ 2012-01-30 22:50 ` Andreas Dilger 2012-01-30 23:52 ` Ted Ts'o 2012-02-01 3:57 ` Dave Chinner 0 siblings, 2 replies; 15+ messages in thread From: Andreas Dilger @ 2012-01-30 22:50 UTC (permalink / raw) To: Dave Chinner; +Cc: Tao Ma, Robin Dong, Ted Ts'o, Ext4 Developers List On 2012-01-29, at 3:07 PM, Dave Chinner wrote: > On Fri, Jan 27, 2012 at 10:27:02PM +0800, Tao Ma wrote: >> Hi Dave, >> On 01/27/2012 08:19 AM, Dave Chinner wrote: >>> On Wed, Jan 25, 2012 at 04:03:09PM -0700, Andreas Dilger wrote: >>>> On 2012-01-25, at 3:48 PM, Dave Chinner wrote: >>>>> On Mon, Jan 23, 2012 at 08:51:53PM +0800, Robin Dong wrote: >>>>>> Hi Ted, Andreas and the list, >>>>>> >>>>>> After the bigalloc-feature is completed in ext4, we could have much more >>>>>> big size of block-group (also bigger continuous space), but the extent >>>>>> structure of files now limit the extent size below 128MB, which is not >>>>>> optimal. > > ..... > >>>>>> The new extent format could support 16TB continuous space and larger volumes. >>>>>> >>>>>> What's your opinion? >>>>> >>>>> Just use XFS. >>>> >>>> Thanks for your troll. >>>> >>>> If you have something actually useful to contribute, please feel free to post. >>>> Otherwise, this is a list for ext4 development. >>> >>> You can chose to see my comment as a troll, but it has a serious >>> message. If that is your use case is for large multi-TB files, then >>> why wouldn't you just use a filesystem that was designed for files >>> that large from the ground up rather than try to extend a filesystem >>> that is already struggling with file sizes that it already supports? >>> Not to mention that very few people even need this functionality, >>> and those that do right now are using XFS. >> >> Robin is one of my colleague. And to be frank, ext4 works well currently >> in our product system. And we'd like to see it grows to fit our future >> need also. > > Sure. But at the expense of the average user? ext4 is supposed to be > primarily the Linux desktop filesystem, That is your opinion, as an XFS developer that is trying to keep XFS relevant for some part of the market. Yet ext4 does extremely well at both the desktop and server workloads. > yet all I see is people trying to make it something for big, bigger > and biggest. Bigalloc, new extent formats, no-journal mode, > dioread_nolock, COW snapshots, secure delete, etc. It's a list of > features that are somewhat incompatible with each other that are > useful to only a handful of vendors or companies. Most have no > relevance at all to the uses of the majority of ext4 users. ??? This is quickly degrading into a mud slinging match. You claim that "because ext4 is only relevant for desktops, it shouldn't try to scale or improve performance". Should I similarly claim that "because XFS is only relevant to gigantic SMP systems with huge RAID arrays it shouldn't try to improve small file performance or be CPU efficient"? Not at all. The ext4 users and developers choose it because it meets their needs better than XFS for one reason or another, and we will continue to improve it for everyone while we are interested to do so. The ext4 multi-block allocator was originally done for high-throughput file servers, but it is totally relevant for desktop workloads today. The same is true for delayed allocation, and other improvements in the past. I imagine that bigalloc would be very welcome for media servers and other large file IO environments. > This is what I'm getting at - I don't object to adding functionality > that is generically useful and applies to all filesystem configs, > but that's not what is happening. ext4 appears to have a development > mindset of "if we don't support X, then we can do Y" and I don't > think that serves the ext4 users very well at all. > > BTW, if you think that is a harsh criticism, just reflect on the > insanity of the recent "we can support 64k block sizes if we just > disable mmap" discussion. Yes, that's great for Lustre, but it is > useless for everyone else... I don't see that at all. The complexity of blocksize > PAGE_SIZE is greatly reduced if we don't have to support mmap IO. Of course I'd be much happier if the VM supported this properly, but it's been 10 years and it hasn't happened, so waiting longer isn't reasonable. To be honest, I totally agree that large blocks may not be relevant for every desktop user. It may not even be relevant for Lustre, but that isn't a valid reason not even to _discuss_ feature development and see where that leads us to an implementation that meets a number of different needs. Disabling mmap IO for some configurations doesn't prevent someone from having a 4kB block LV for the root filesystem, and a separate data LV for large file IO. It isn't that mmap for blocksize > PAGE_SIZE is impossible to implement, but I'd rather see the code handling the real-world use cases (efficient large file IO, filesystem portability between IA64, PPC, ARM) than growing extra complexity to handle an obscure use case (e.g. mmap file IO and binaries executed from a data storage filesystem). Once we get the mechanics of large block allocation, we can still look into the complexity of mmap thereon, since a large block ext4 filesystem does not actually involve a disk format change since it has been handled for ages by ext2/3/4 for CPUs that have larger PAGE_SIZE. Handling mmap was in Robin's original submission, and I suggested that we exclude it initially to reduce complexity for the initial implementation. >> I think it helps both the community and our employer. Having >> said that, another reason why we don't consider of XFS as our choice is >> that we don't think we have the ability to maintain 2 file systems in >> our product system. > > That's your choice as a product vendor, not mine as an ext4 user.... You're suggesting that if I started using XFS on my home filesystems then I get veto power over your development plans? Hmm, I don't think that is going to happen. Later on, you claim that you aren't even an ext4 user, so what is the point of your complaint? The way it works is that anyone is free to develop any features they want for ext4, they are free to post them to this list (or not) and the ext4 maintainers can evaluate them on functionality and performance in the manner that they see fit, without any requirement that they be accepted, keeping in mind that we _do_ take regular user needs into account. The mere existence of a feature, nay even the discussion of a feature for ext4, should not be stifled by the suggestion that XFS is the last word in filesystems (especially since ZFS has already claimed that label :-). >>> Indeed, on current measures, a 15.95TB file on ext4 takes 330s to >>> allocate on my test rig, while XFS will do it under *35 >>> milliseconds*. What's the point of increasing the maximum file size >>> when it when it takes so long to allocate or free the space? If you >>> can't make the allocation and freeing scale first to the existing >>> file size limits, there's little point in introducing support for >>> larger files. >> >> I think your test case here is biased since you used the most successful >> story from XFS. Yes, bitmap-based file system is a little bit hard to >> allocate a very large file if the bitmap is scattered all over the disk, > > Which is the case whenever the filesytem has been used for a while. > I did those tests on a pristine, empty filesystem, so the speed of > allocation only goes down from there. bitmap based allocation > degrades much, much faster than extent-tree based allocation, > especially when you have to search for the free space to allocation > from.... > > Indeed, how do you plan to test such large files robustly when it > takes so long to allocate the space to them? I mean, I can easily > test large files on XFS because of how quickly allocation occurs. I > can easily fragment free space and test large fragmented files > bcause of how quickly allocation occurs. But if the same test that > take a minute to run on XFS take 4 orders of magnitude longer on > ext4, just how good is your test coverage going to be? What about > when you have different filesystem block sizes, or different mount > options, or doing it concurrently with an online resize? > > IOWs, the slowness of the allocation greatly limits the ability to > test such a feature at the scale it is designed to support. That's > my big, overriding concern - with ext4 allocation being so slow, we > can't really test large files with enough thoroughness *right now*. > Increasing the file size is only going to make that problem worse > and that, to me, is a show stopper. If you can't test it properly, > then the change should not be made. Hmm, excellent suggestion. Maybe if we implement faster allocation for ext4 your objections could be quieted? Wait, that is what you are objecting to in the first place (bigalloc, large blocks, etc) or any changes to ext4 that don't meet your approval. >> but I don't think ext4 can't fill the gap of this test case in the >> future. Let us wait and see. :) > > How do you plan to fix it? If there isn't a plan, or it involves a > major on-disk format change, then aren't we back to square one about > adding intrusive, complex and destablising features to a filesystem > that people are relying to be stable? > >>> And as an ext4 user, all I want is from ext4 to be stable like ext3 >>> is stable, not have it continually destabilised by the addition of >>> incompatible feature after incompatible feature. Indeed, I can't >>> use ext4 in the places I'm using ext3 right now because ext4 is not >>> very resilient in the face of 20 system crashes a day. I generally >>> find that ext4 filesystems are irretrievable corrupted within a >>> week. In comparison, I have ext3 filesystems have lasted more than >>> 3 years under such workloads without any corruptions occurring. >> >> OK, so next time when you see the corruption, please at least send it to >> the mail list so that ext4 developers can have the chance of seeing it. >> Complaint doesn't improve it. > > I won't be reporting corruptions because I stopped using ext4 more > than 6 months ago on these machines after the last batch of > unreproducable, unrepairable corruptions that occurred. I couldn't > get anything from the corpses (I do know how to analyse a corrupt > ext4 filesystem), so there really wasn't anything to report.... > > Generally speaking, the first sign of problems was a corrupted > binary or missing or empty file. The filesystem never complained or > detected corruption at runtime. By that stage, the original cause of > the corruption was unfindable because the problems may have happened > many crashes ago and been propagated further. running e2fsck at that > point generally resulted in a mess with lots of stuff ending in > lost+found and multiply linked blocks being duplicated all over the > place. IOWs, an unrecoverable mess. I haven't heard of similar problems reported here, but even the existence of such bug reports can be useful alert developers about the existence of such a problem, and to help narrow down corruption issues to a specific kernel version. >>> So the long form of my 3-word comment is effectively: "If you need >>> multi-TB files, then use the filesystem most appropriate for that >>> workload instead of trying to make ext4 more complex and unstable >>> than it already is". >> >> I have read and watched the talk you gave in this year's LCA, your >> assumption about ext4 may be a little frightening, but it is good for >> the ext4 community. In your talk "xfs is much slower than ext4 in >> 2009-2010 for meta-intensive workload", and now it works much faster. So >> why do you think ext4 can't be improved also like xfs? > > Because all of the XFS changes talked about in that talk did not > change the on-disk format at all. They are *software-only* changes > and are completely transparent to users. They are even the default > behaviours now, so users with 10 year old XFS filesystems will also > benefit from them. And they can go back to their old kernels if they > don't like the new kernels, too... That is only partly true. XFS had to change the 32-bit vs. 64-bit inode numbers to get better performance, and that is not backward compatible on 32-bit systems. XFS had changed the logging format to be more efficient in order to not suck at metadata benchmarks. > We know that the problems ext4 has are much, much deeper and as this > thread shows require significant on-disk format changes to solve. That is a very broad statement, and I think it is your extrapolation from reading a snippet of one thread on this list. > And they will only benefit those that have new filesystems or make > their old filesystems incompatible with old kernels. IOWs, the > changes being proposed don't help solve problems on all the existing > filesystems transparently. That's a *major* difference between > where XFS was 2 years ago and where ext4 is now. Not true. The ext4 code can mount and run ancient ext2 filesystems and shows a significant performance improvement without any on-disk format changes. Ask google about their million(?) ext4 filesystems and how they have improved with only a software update. Maybe the converse could also be said, that the fact that XFS can show so much performance improvement without changing the on-disk format is a testament to how complex and badly written the old code was? I think that argument holds as little value as yours, but I don't jump up and down in xfs@oss.sgi.com touting the fact that ext4 is as fast as (or faster than) XFS for most real-world workloads with only 1/2 of the code. > Sure, given enough time and resources, any problem is solvable. But > really, do ext4 users really need a new, incompatible, difficult to > test on-disk formats to solve problems that most people will never > hit on their desktop and server systems before they migrate them to > BTRFS? Again, you are entitled to your opinion, and are free to spend your time and efforts where you like. I wish Chris all the best for Btrfs, but having looked at that code I'm not in a hurry to move over to using it for our production workloads, nor even for my home file server. The joy of open source software is that everyone is free to make their own choices. I've made mine, and along with many other developers and users the choice has been ext4. Thanks for your input, we'll continue to discuss and develop whatever we want, regardless of how much you want everyone to use XFS. Cheers, Andreas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-30 22:50 ` Andreas Dilger @ 2012-01-30 23:52 ` Ted Ts'o 2012-02-01 3:57 ` Dave Chinner 1 sibling, 0 replies; 15+ messages in thread From: Ted Ts'o @ 2012-01-30 23:52 UTC (permalink / raw) To: Andreas Dilger; +Cc: Dave Chinner, Tao Ma, Robin Dong, Ext4 Developers List As a large meta-comment, let me say that I find that most conversations about which file systems users "should" are very often not very useful. Even less useful is what developers "should" be working on. In that way, my philosophy of ext4 is that it should be like the Linux kernel; it's an evolutionary process and central planning is often overrated. People contribute to ext4 for many different reasons, and that means they optimize ext4 for their particular workloads. Like Linus for Linux, we're not trying to architect for "world domination" by saying, "hmm, in order to 'take out' reiserfs4, we'd better implement features foo and bar". Instead, it's things like "gee, this company over here is interested in using ext4 as a back-end store for a cluster file system where the journal is unnecessary overhead and performance under severe memory pressure is important" --- and so we got no journal mode and some improvements to the block allocator so it works better under those conditions. People contribute to ext4 for different goals, just as people contribute to Linux for different goals. And just as there are times when improvements for big servers have improved Linux's capabilities for embedded machines, and vice versa, there are similar things that can and have happened for ext4 (such as extents and the multi-block allocator originally being developed for Lustre, but which have been very useful for many other use cases). Personally, I find that I get a lot more joy out of programming to make a codebase better --- as opposed programming with the goal to kill off some other codebase, or discouraging other users to use some other codebase. Now that's an open source approach to things. Things are no doubt very different if you are trying to allocate engineering resources at a distribution. So there may be some tensions between a desire from an open source perspective to be as flexible as possible, and a company's position that they only want to support a limited set of configuration options. I think those decisions are ones which are best made by the distribution, and not as part of the open source process. After all, what might make sense for one distribution's customer base and business model, might not make sense for another's. There are some dangers to that model; for example, RAID support was only implemented for the Lustre's private in-kernel (and out-of-tree) API. Some smarts in ext4's writepages codepath so that we can properly handle RAID support is currently lacking. I'd work on it, except that I don't personally (nor does my employer) has a strong need to worry about RAID systems. I'll certainly integrate code that fixes that problem, and I'm confident that eventually someone will decide that's the one bit of improvement they need so that ext4 is a good match for their use case. I'm definitely not going to stress that this is something we have to do right away just so we can kill off XFS; most of us are hopefully working on ext4 because it's fun, and secondarily because amazingly enough our employers are willing to pay for us to work on something cool. (Just as I'm glad most Linux kernel developers weren't waking up trying to think up ways to kill off FreeBSD or try to put the Mark Williams Company out of business. :-) Let me also add that competition is a good thing. It keeps all of us on our toes. Legacy unix systems accepted that system calls and context switches were naturally slow, until Linux proved that it could be done very quickly and efficiently. SGI didn't bother dealing with XFS's slow metadata performance even tough they were selling desktops during its original development. It was only when Ric Wheeler (as he tells the story) told the XFS developers how much XFS lagged on fs_mark that there was a strong effort to address those issues, over a decade and a half after XFS's original deployment. That's why I don't believe it's productive to say that a particular file system has no place in an ecosystem. If developers are continuing to work on an OS, or a file system, and if users continue to use it, then of course it has a place. You might not understand why that might be true initially, but in general it's not because everyone is being foolish/stupid. One last observation. It's dangerous to focus on just one benchmark; especially if it is a micro-benchmark. As a tool to improve one aspect of a file system's performance, it's certainly useful. But how many workloads will really hammer a file system with 16 cores, by creating lots of small files and nothing else? I have no doubt that we could improve ext4's scalability for that particular workload. But is that a deadly shortcoming that should cause ext4 developers to drop everything else they are doing and work on this problem, lest users immediately reformat their disks and switch to another file system because ext4's block allocator isn't as scalable as it could be for lots of small block allocations done in parallel? I'd suggest that might be an over-reaction. Best regards, - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-30 22:50 ` Andreas Dilger 2012-01-30 23:52 ` Ted Ts'o @ 2012-02-01 3:57 ` Dave Chinner 1 sibling, 0 replies; 15+ messages in thread From: Dave Chinner @ 2012-02-01 3:57 UTC (permalink / raw) To: Andreas Dilger; +Cc: Tao Ma, Robin Dong, Ted Ts'o, Ext4 Developers List On Mon, Jan 30, 2012 at 03:50:24PM -0700, Andreas Dilger wrote: > On 2012-01-29, at 3:07 PM, Dave Chinner wrote: > > yet all I see is people trying to make it something for big, bigger > > and biggest. Bigalloc, new extent formats, no-journal mode, > > dioread_nolock, COW snapshots, secure delete, etc. It's a list of > > features that are somewhat incompatible with each other that are > > useful to only a handful of vendors or companies. Most have no > > relevance at all to the uses of the majority of ext4 users. > > ??? This is quickly degrading into a mud slinging match. You claim > that "because ext4 is only relevant for desktops, it shouldn't try to > scale or improve performance". Should I similarly claim that "because > XFS is only relevant to gigantic SMP systems with huge RAID arrays it > shouldn't try to improve small file performance or be CPU efficient"? You can if you want..... But then I'll just point to Eric Whitney's latest results showing XFS is generally slightly more CPU efficient that ext4, and performs as well as ext4 on the small file workload he ran. :) > Not at all. The ext4 users and developers choose it because it meets > their needs better than XFS for one reason or another, and we will More likely is that most desktop users choose ext4 because it is the default filesystem their distribution installs, not because they know anything about it or any other linux filesystem.... > continue to improve it for everyone while we are interested to do so. > The ext4 multi-block allocator was originally done for high-throughput > file servers, but it is totally relevant for desktop workloads today. > The same is true for delayed allocation, and other improvements in the > past. I imagine that bigalloc would be very welcome for media servers > and other large file IO environments. Yes, it will help certain workloads, but it isn't a general solution to the allocation scalability problems. It also requires informed and knowledgable users to about such features, when it is best to use them and when not to use them. One of the things that I'm concerned about is that the changes being made add a new upfront decisions that users have to be informed about and understand sufficiently to be able to make the correct decision. You're making the assumption that users are informed and knowledgable, and all filesystem developers should know this is simply not true. Users repeatedly demonstrate that they don't know how filesystems work, don't understand the knobs that are provided, don't understand what their applications do in terms of filesystem operations and don't really understand their data sets. Education takes time and effort, but still users make the same mistakes over and over again. That's the reason why we have the mantra "use the defaults" when it comes to users asking questions about how to optimise an XFS filesystem. XFS is almost at the point where the defaults work for most people, from $300 ARM-based NAS boxes all the way up to multi-million dollar supercomputers. That's what we should be delivering to users - something that just works. Special case solutions should be few and far between, and only in those cases should education about the various options be necessary. That ext4 now has a much more complex configuration matrix than XFS, and that developers are expecting users to understand that matrix and how it relates to their systems and workloads without prior experience seems like a pretty valid concern to me. > > IOWs, the slowness of the allocation greatly limits the ability to > > test such a feature at the scale it is designed to support. That's > > my big, overriding concern - with ext4 allocation being so slow, we > > can't really test large files with enough thoroughness *right now*. > > Increasing the file size is only going to make that problem worse > > and that, to me, is a show stopper. If you can't test it properly, > > then the change should not be made. > > Hmm, excellent suggestion. Maybe if we implement faster allocation > for ext4 your objections could be quieted? Wait, that is what you > are objecting to in the first place (bigalloc, large blocks, etc) or > any changes to ext4 that don't meet your approval. bigalloc is not a solution to the use case that I initially found this problem on - filling large filesystems quickly before starting testing. Regardless of the existence of bigalloc, we still need to test large 4k block size, 4k alloc size filesystems because that is what users will mostly use. Further, bigalloc makes the large filesystem test matrix more complex and time consuming - we now have to test default configs as well as bigalloc filesystems. And if this new extent format change goes in, suddenly it is "defaults X bigalloc (various sizes) X extent format". This gets impossible to test very quickly, and so we end up with a mess of options that nobody really knows how well they work together because they simply aren't adequately tested. I've been trying to help address this large scale testing problem - to make >16TB filesystem testing for ext4 and btrfs as well as XFS easy to do through xfstests. Allocation speed is just one of the initial problems I'm coming across for both ext4 and BTRFS. Having easily repeatable tests for large filesystems is fundamental to being able to support such filesystems. However, requiring magic pixie dust to enable such testing raises a serious question about the suitability of the filesystem for such usage. And then further expanding support in an area that is known to be deficient seems very misguided to me - it doesn't make testing any easier, and it makes testing large files and filesystems even more time consuming. Ths is a serious problem, and that's why I'm asking whether this change is even something that should be done in the first place. Yes, I could have said it better than a throw-away, one-line comment. But I'm trying to explain the many reasons I had for the glib comment because that comment based on problems that I've seen over the past year of so trying to use and test ext4.... > >> I have read and watched the talk you gave in this year's LCA, > >> your assumption about ext4 may be a little frightening, but it > >> is good for the ext4 community. In your talk "xfs is much > >> slower than ext4 in 2009-2010 for meta-intensive workload", and > >> now it works much faster. So why do you think ext4 can't be > >> improved also like xfs? > > > > Because all of the XFS changes talked about in that talk did not > > change the on-disk format at all. They are *software-only* > > changes and are completely transparent to users. They are even > > the default behaviours now, so users with 10 year old XFS > > filesystems will also benefit from them. And they can go back to > > their old kernels if they don't like the new kernels, too... > > That is only partly true. XFS had to change the 32-bit vs. 64-bit > inode numbers to get better performance, and that is not backward > compatible on 32-bit systems. XFS had changed the logging format > to be more efficient in order to not suck at metadata benchmarks. Not true, but it's irrelevant to the above discussion, anyway, so I won't waste time going done this path any further.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-23 12:51 [RFC] Add new extent structure in ext4 Robin Dong ` (2 preceding siblings ...) 2012-01-25 22:48 ` Dave Chinner @ 2012-01-30 20:41 ` Eric Sandeen 2012-01-30 22:52 ` Andreas Dilger 3 siblings, 1 reply; 15+ messages in thread From: Eric Sandeen @ 2012-01-30 20:41 UTC (permalink / raw) To: Robin Dong; +Cc: Ted Ts'o, Andreas Dilger, Ext4 Developers List On 1/23/12 6:51 AM, Robin Dong wrote: > Hi Ted, Andreas and the list, > > After the bigalloc-feature is completed in ext4, we could have much more > big size of block-group (also bigger continuous space), but the extent > structure of files now limit the extent size below 128MB, which is not > optimal. > > We could solve the problem by creating a new extent format to support > larger extent size, which looks like this: > > struct ext4_extent2 { > __le64 ee_block; /* first logical block extent covers */ > __le64 ee_start; /* starting physical block */ > __le32 ee_len; /* number of blocks covered by extent */ > __le32 ee_flags; /* flags and future extension */ > }; > > struct ext4_extent2_idx { > __le64 ei_block; /* index covers logical blocks from 'block' */ > __le64 ei_leaf; /* pointer to the physical block of the next level */ > __le32 ei_flags; /* flags and future extension */ > __le32 ei_unused; /* padding */ > }; > > I think we could keep the structure of ext4_extent_header and add new > imcompat flag EXT4_FEATURE_INCOMPAT_EXTENTS2. > > The new extent format could support 16TB continuous space and larger volumes. (larger volumes?) > What's your opinion? > I think that mailing list drama aside ;) Dave has a decent point that we shouldn't allow structures to scale out further than the code *using* them can scale. In other words, if we already have some trouble being efficient with 2^32 blocks in a file, it is risky and perhaps unwise to allow even larger files, until those problems are resolved. At a minimum, I'd suggest that such a change should not go in until it is demonstrated that ext4 can, in general, handle such large file sizes efficiently. It'd be nice to be able to self-host large sparse images for large fs testing, though. I suppose bigalloc solves that a little, though with some backing store space usage penalty. I suppose if a bigalloc fs is hosted on a bigalloc fs, things should (?) line up and be reasonable. -Eric ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] Add new extent structure in ext4 2012-01-30 20:41 ` Eric Sandeen @ 2012-01-30 22:52 ` Andreas Dilger 0 siblings, 0 replies; 15+ messages in thread From: Andreas Dilger @ 2012-01-30 22:52 UTC (permalink / raw) To: Eric Sandeen; +Cc: Robin Dong, Ted Ts'o, Ext4 Developers List On 2012-01-30, at 1:41 PM, Eric Sandeen wrote: > On 1/23/12 6:51 AM, Robin Dong wrote: >> After the bigalloc-feature is completed in ext4, we could have much more >> big size of block-group (also bigger continuous space), but the extent >> structure of files now limit the extent size below 128MB, which is not >> optimal. >> >> The new extent format could support 16TB continuous space and larger volumes. > > (larger volumes?) Strictly speaking, the current extent format "only" allows filesystems up to 2^48 * blocksize bytes, typically 2^60 bytes. That in itself is not a significant limitation IMHO, since there are a number of other format-based limitations in this area (number of group descriptor blocks, etc), and the overall "do we realistically expect a single filesystem to be so big" that cannot be fixed by simply increasing the addressable blocks per file. Those format-based limits would not be present if we could handle a larger blocksize for the filesystem, since the number of groups is reduced by the square of the blocksize increase, as are a number of other limits. >> What's your opinion? > > I think that mailing list drama aside ;) Dave has a decent point that we > shouldn't allow structures to scale out further than the code *using* them > can scale. > > In other words, if we already have some trouble being efficient with 2^32 > blocks in a file, it is risky and perhaps unwise to allow even larger files, until those problems are resolved. At a minimum, I'd suggest that such a > change should not go in until it is demonstrated that ext4 can, in general, > handle such large file sizes efficiently. I think the issue that Dave pointed out (efficiency of allocating large files) is one that has partially been addressed by bigalloc. Using bigalloc allows larger clusters to be allocated much more efficiently, but it only gets us part of the way there. > It'd be nice to be able to self-host large sparse images for large fs > testing, though. I suppose bigalloc solves that a little, though with > some backing store space usage penalty. I suppose if a bigalloc fs is > hosted on a bigalloc fs, things should (?) line up and be reasonable. This is the one limitation of bigalloc - it doesn't change the underlying filesystem blocksize. That means the current extent format still cannot address more than 2^32 blocks in a single file, so self-hosting filesystem images over 16TB with 4kB blocksize is not possible with bigalloc. It _would_ be possible with a larger filesystem blocksize, and the bigalloc code already paved the way for most of that to happen. The joy of allowing large blocks for 4kB PAGE_SIZE is that it _doesn't_ involve an on-disk format change, and would have the added benefit that it would allow mounting IA64, PPC, ARM, SPARC, etc. filesystems directly, and facilitate migration or disaster recovery from those aging platforms. Cheers, Andreas ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-02-01 3:57 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-23 12:51 [RFC] Add new extent structure in ext4 Robin Dong 2012-01-23 18:59 ` Ted Ts'o 2012-01-23 23:17 ` Andreas Dilger 2012-01-24 13:34 ` Jan Kara 2012-01-24 17:32 ` Andreas Dilger 2012-01-25 22:48 ` Dave Chinner 2012-01-25 23:03 ` Andreas Dilger 2012-01-27 0:19 ` Dave Chinner 2012-01-27 14:27 ` Tao Ma 2012-01-29 22:07 ` Dave Chinner 2012-01-30 22:50 ` Andreas Dilger 2012-01-30 23:52 ` Ted Ts'o 2012-02-01 3:57 ` Dave Chinner 2012-01-30 20:41 ` Eric Sandeen 2012-01-30 22:52 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).