* Questions for article @ 2008-06-02 21:50 Thomas King 2008-06-02 22:30 ` Eric Sandeen 2008-06-02 22:59 ` Andreas Dilger 0 siblings, 2 replies; 8+ messages in thread From: Thomas King @ 2008-06-02 21:50 UTC (permalink / raw) To: linux-ext4 Folks, I am writing an article for Linux.com to answer Henry Newman's at http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is there anyone that can field a few questions on ext4? Thanks! Tom King ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-02 21:50 Questions for article Thomas King @ 2008-06-02 22:30 ` Eric Sandeen 2008-06-02 22:59 ` Andreas Dilger 1 sibling, 0 replies; 8+ messages in thread From: Eric Sandeen @ 2008-06-02 22:30 UTC (permalink / raw) To: Thomas King; +Cc: linux-ext4 Thomas King wrote: > Folks, > > I am writing an article for Linux.com to answer Henry Newman's at > http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is > there anyone that can field a few questions on ext4? > > Thanks! > Tom King Honestly I'm not sure it's worth feeding the trolls... that guy has some points but is sufficiently off-base to make me wonder if he actually has any broad Linux filesystem experience. ...But anyway, I'd just ask the questions on-list if you don't mind a collaborative answer. :) -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-02 21:50 Questions for article Thomas King 2008-06-02 22:30 ` Eric Sandeen @ 2008-06-02 22:59 ` Andreas Dilger 2008-06-03 0:40 ` Eric Sandeen 2008-06-03 15:10 ` Thomas King 1 sibling, 2 replies; 8+ messages in thread From: Andreas Dilger @ 2008-06-02 22:59 UTC (permalink / raw) To: Thomas King; +Cc: linux-ext4 On Jun 02, 2008 16:50 -0500, Thomas King wrote: > I am writing an article for Linux.com to answer Henry Newman's at > http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is > there anyone that can field a few questions on ext4? It depends on what you are proposing to write... Henry's comments are mostly accurate. There isn't even support for > 16TB filesystems in e2fsprogs today, so I wouldn't go rushing into an email saying "ext4 can support a single 100TB filesystem today". It wouldn't be too hard to take a 100TB Lustre filesystem and run it on a single node, but I doubt anyone would actually want to do that and it still doesn't meet the requirements of "a single instance filesystem". What is noteworthy is that the comments about IO not being aligned to RAID boundaries is only partly correct. This is actually done in ext4 with mballoc (assuming you set these boundaries in the superblock manually), and is also done by XFS automatically. The RAID geometry detection code should be added to mke2fs also, if someone would be interested. The ext4/mballoc code does NOT align the metadata to RAID boundaries, though this is being worked on also. The mballoc code also does efficient block allocations (multi-MB at a time), BUT there is no userspace interface for this yet, except O_DIRECT. The delayed allocation (delalloc) patches for ext4 are still in the unstable part of the patch series... What Henry is misunderstanding here is that the filesystem blocksize isn't necessarily the maximum unit for space allocation. I agree we could do this more efficiently (e.g. allocate an entire 128MB block group at a time for large files), but we haven't gotten there yet. There are a large number of IO performance improvements in ext4 due to work to improve IO server performance for Lustre (which Henry is of course familiar with), and for Lustre at least we are able to get IO performance in the 2GB/s range on 42 50MB/s disks with software RAID 0 (Sun x4500), but these are with O_DIRECT. For the fsck front, there have been performance improvements recently (uninit_bg), and more arriving soon (flex_bg and block metadata clustering), but that is still a far way from removing the need for e2fsck in case of corruption. Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably (though not superbly) for a certain kind of workload. On the other hand, this can be really nasty with a "readdir+stat" kind of workload. Lustre also runs with filesystems > 250M files total, but I haven't heard of e2fsck performance for such filesystems. I'd personally tend to keep quiet until we CAN show that ext4 runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-02 22:59 ` Andreas Dilger @ 2008-06-03 0:40 ` Eric Sandeen 2008-06-03 15:17 ` Thomas King 2008-06-03 15:10 ` Thomas King 1 sibling, 1 reply; 8+ messages in thread From: Eric Sandeen @ 2008-06-03 0:40 UTC (permalink / raw) To: Andreas Dilger; +Cc: Thomas King, linux-ext4 Andreas Dilger wrote: > On Jun 02, 2008 16:50 -0500, Thomas King wrote: >> I am writing an article for Linux.com to answer Henry Newman's at >> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is >> there anyone that can field a few questions on ext4? > > It depends on what you are proposing to write... Henry's comments are > mostly accurate. But others are way off base IMHO, to the point where I don't put a lot of stock in the article. fsck only checks the log? Hardly. No linux filesystem does proper geometry alignment? XFS has for years. He seems to take ext3 weaknesses and extrapolate to all linux filesystems. The fact that he suggests testing a 500T ext3 filesystem indicates a ... lack of research. Never mind that had he done that research he'd have found that you, well... you can't do it. :) On the one hand it proves his point about scalibility (of ext3) but on the other hand indicates that he's not completely investigated the problem of linux filesystem scalability, himself. Of the tests he proposes, he's clearly not bothered to do them himself. A 100 million inode filesystem is not that uncommon on xfs, and some of the tests he proposes are probably in daily use at SGI customers. So writing an article about ext4 to refute all his arguments might be premature, but dismissing all linux filesystems based on ext3 shortcomings is also shortsighted. He has some valid points but saying "fscking a multi-terabyte fs is too slow on linux" without showing that it actually *is* slow on linux, or that it *is* fast on $whatever_else, is just hand-waving. On the other hand it's a very hard test for mere mortals to run. :) -Eric > There isn't even support for > 16TB filesystems in > e2fsprogs today, so I wouldn't go rushing into an email saying "ext4 > can support a single 100TB filesystem today". It wouldn't be too hard > to take a 100TB Lustre filesystem and run it on a single node, but I > doubt anyone would actually want to do that and it still doesn't meet > the requirements of "a single instance filesystem". > > What is noteworthy is that the comments about IO not being aligned > to RAID boundaries is only partly correct. This is actually done in > ext4 with mballoc (assuming you set these boundaries in the superblock > manually), and is also done by XFS automatically. The RAID geometry > detection code should be added to mke2fs also, if someone would be > interested. The ext4/mballoc code does NOT align the metadata to RAID > boundaries, though this is being worked on also. > > The mballoc code also does efficient block allocations (multi-MB at a > time), BUT there is no userspace interface for this yet, except O_DIRECT. > The delayed allocation (delalloc) patches for ext4 are still in the unstable > part of the patch series... What Henry is misunderstanding here is that > the filesystem blocksize isn't necessarily the maximum unit for space > allocation. I agree we could do this more efficiently (e.g. allocate an > entire 128MB block group at a time for large files), but we haven't gotten > there yet. > > There are a large number of IO performance improvements in ext4 due to > work to improve IO server performance for Lustre (which Henry is of > course familiar with), and for Lustre at least we are able to get IO > performance in the 2GB/s range on 42 50MB/s disks with software RAID 0 > (Sun x4500), but these are with O_DIRECT. > > For the fsck front, there have been performance improvements recently > (uninit_bg), and more arriving soon (flex_bg and block metadata > clustering), but that is still a far way from removing the need for > e2fsck in case of corruption. > > Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably > (though not superbly) for a certain kind of workload. On the other hand, > this can be really nasty with a "readdir+stat" kind of workload. Lustre > also runs with filesystems > 250M files total, but I haven't heard of > e2fsck performance for such filesystems. > > > I'd personally tend to keep quiet until we CAN show that ext4 > runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-03 0:40 ` Eric Sandeen @ 2008-06-03 15:17 ` Thomas King 0 siblings, 0 replies; 8+ messages in thread From: Thomas King @ 2008-06-03 15:17 UTC (permalink / raw) To: linux-ext4 > Andreas Dilger wrote: >> On Jun 02, 2008 16:50 -0500, Thomas King wrote: >>> I am writing an article for Linux.com to answer Henry Newman's at http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is there anyone that can field a few questions on ext4? >> It depends on what you are proposing to write... Henry's comments are mostly accurate. > > But others are way off base IMHO, to the point where I don't put a lot of stock in the article. fsck only checks the log? Hardly. No linux filesystem does proper geometry alignment? XFS has for years. > > He seems to take ext3 weaknesses and extrapolate to all linux > filesystems. The fact that he suggests testing a 500T ext3 filesystem indicates a ... lack of research. Never mind that had he done that research he'd have found that you, well... you can't do it. :) On the one hand it proves his point about scalibility (of ext3) but on the other hand indicates that he's not completely investigated the problem of linux filesystem scalability, himself. > > Of the tests he proposes, he's clearly not bothered to do them himself. > A 100 million inode filesystem is not that uncommon on xfs, and some of > the tests he proposes are probably in daily use at SGI customers. > > So writing an article about ext4 to refute all his arguments might be premature, but dismissing all linux filesystems based on ext3 > shortcomings is also shortsighted. He has some valid points but saying "fscking a multi-terabyte fs is too slow on linux" without showing that it actually *is* slow on linux, or that it *is* fast on $whatever_else, is just hand-waving. On the other hand it's a very hard test for mere mortals to run. :) > > -Eric > >> There isn't even support for > 16TB filesystems in >> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4 can support a single 100TB filesystem today". It wouldn't be too hard to take a 100TB Lustre filesystem and run it on a single node, but I doubt anyone would actually want to do that and it still doesn't meet the requirements of "a single instance filesystem". >> What is noteworthy is that the comments about IO not being aligned to RAID boundaries is only partly correct. This is actually done in ext4 with mballoc (assuming you set these boundaries in the superblock manually), and is also done by XFS automatically. The RAID geometry detection code should be added to mke2fs also, if someone would be interested. The ext4/mballoc code does NOT align the metadata to RAID boundaries, though this is being worked on also. >> The mballoc code also does efficient block allocations (multi-MB at a time), BUT there is no userspace interface for this yet, except O_DIRECT. The delayed allocation (delalloc) patches for ext4 are still in the unstable part of the patch series... What Henry is misunderstanding here is that the filesystem blocksize isn't necessarily the maximum unit for space allocation. I agree we could do this more efficiently (e.g. allocate an entire 128MB block group at a time for large files), but we haven't gotten there yet. >> There are a large number of IO performance improvements in ext4 due to work to improve IO server performance for Lustre (which Henry is of course familiar with), and for Lustre at least we are able to get IO performance in the 2GB/s range on 42 50MB/s disks with software RAID 0 (Sun x4500), but these are with O_DIRECT. >> For the fsck front, there have been performance improvements recently (uninit_bg), and more arriving soon (flex_bg and block metadata clustering), but that is still a far way from removing the need for e2fsck in case of corruption. >> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably (though not superbly) for a certain kind of workload. On the other hand, this can be really nasty with a "readdir+stat" kind of workload. Lustre also runs with filesystems > 250M files total, but I haven't heard of e2fsck performance for such filesystems. >> I'd personally tend to keep quiet until we CAN show that ext4 >> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. Cheers, Andreas He is fairly keen on XFS except for a couple of items. "The metadata areas are not aligned with RAID strips and allocation units are FAR too small but better than ext." However, some of his comments do hint that any current filesystem technology wouldn't make him happy. ;) Folks, thank you for suffering my questions and probing. I may post a few more later. Tom King ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-02 22:59 ` Andreas Dilger 2008-06-03 0:40 ` Eric Sandeen @ 2008-06-03 15:10 ` Thomas King 2008-06-03 15:49 ` Martin K. Petersen 2008-06-03 22:07 ` Andreas Dilger 1 sibling, 2 replies; 8+ messages in thread From: Thomas King @ 2008-06-03 15:10 UTC (permalink / raw) To: linux-ext4 > On Jun 02, 2008 16:50 -0500, Thomas King wrote: >> I am writing an article for Linux.com to answer Henry Newman's at >> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is >> there anyone that can field a few questions on ext4? > > It depends on what you are proposing to write... Henry's comments are > mostly accurate. There isn't even support for > 16TB filesystems in > e2fsprogs today, so I wouldn't go rushing into an email saying "ext4 > can support a single 100TB filesystem today". It wouldn't be too hard > to take a 100TB Lustre filesystem and run it on a single node, but I > doubt anyone would actually want to do that and it still doesn't meet > the requirements of "a single instance filesystem". > Aye, as you probably saw in his article, he's skirting cluster filesystems since most of the implementations he's referencing use a single physical filesystem. > What is noteworthy is that the comments about IO not being aligned > to RAID boundaries is only partly correct. This is actually done in > ext4 with mballoc (assuming you set these boundaries in the superblock > manually), and is also done by XFS automatically. The RAID geometry > detection code should be added to mke2fs also, if someone would be > interested. The ext4/mballoc code does NOT align the metadata to RAID > boundaries, though this is being worked on also. > Good to know! > The mballoc code also does efficient block allocations (multi-MB at a > time), BUT there is no userspace interface for this yet, except O_DIRECT. > The delayed allocation (delalloc) patches for ext4 are still in the unstable > part of the patch series... What Henry is misunderstanding here is that > the filesystem blocksize isn't necessarily the maximum unit for space > allocation. I agree we could do this more efficiently (e.g. allocate an > entire 128MB block group at a time for large files), but we haven't gotten > there yet. > Can I assume this (large block size) is a possibility later? > There are a large number of IO performance improvements in ext4 due to > work to improve IO server performance for Lustre (which Henry is of > course familiar with), and for Lustre at least we are able to get IO > performance in the 2GB/s range on 42 50MB/s disks with software RAID 0 > (Sun x4500), but these are with O_DIRECT. > > For the fsck front, there have been performance improvements recently > (uninit_bg), and more arriving soon (flex_bg and block metadata > clustering), but that is still a far way from removing the need for > e2fsck in case of corruption. > > Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably > (though not superbly) for a certain kind of workload. On the other hand, > this can be really nasty with a "readdir+stat" kind of workload. Lustre > also runs with filesystems > 250M files total, but I haven't heard of > e2fsck performance for such filesystems. > > > I'd personally tend to keep quiet until we CAN show that ext4 > runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. > What will be the largest theoretical filesystem for ext4? Here are three other features he thought necessary for massive filesystems in Linux: -T10 DIF (block protect?) aware file system -NFSv4.1 support -Support for proposed POSIX relaxation extensions for HPC Are these already in ext4 or on the radar? Is there anything else y'all would like folks to know about ext4 and its future? > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. Thanks! Tom King ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-03 15:10 ` Thomas King @ 2008-06-03 15:49 ` Martin K. Petersen 2008-06-03 22:07 ` Andreas Dilger 1 sibling, 0 replies; 8+ messages in thread From: Martin K. Petersen @ 2008-06-03 15:49 UTC (permalink / raw) To: Thomas King; +Cc: linux-ext4 >>>>> "Thomas" == Thomas King <kingttx@tomslinux.homelinux.org> writes: Thomas> - T10 DIF (block protect?) aware file system I'm not really sure what the ext4 people are officially planning but I know from conversations with Ted and a few others that there's interest. Wiring up ext4 to the block integrity infrastructure is pretty easy. It's defining the tagging and making fsck use it that's the hard part. Some of that hinges on a userland interface that I haven't quite finished baking yet. However, a filesystem doesn't have to be explicitly DIF-aware to take advantage of it. Sector tagging is just icing on the cake. The current DIF infrastructure automagically protects all I/O that doesn't already have integrity metadata attached. Unfortunately, ext[23] aren't working well with protection turned on right now. The way DIF works is that you add a checksum to the I/O when it is submitted. If there's a mismatch, the HBA or the drive will reject the I/O. And unfortunately both ext2 and ext3 frequently modify pages that are in flight, causing a checksum mismatch. I have yet to try ext4. XFS and btrfs work fine with DIF except for the generic writable mmap hole that I think I'm about to fix. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Questions for article 2008-06-03 15:10 ` Thomas King 2008-06-03 15:49 ` Martin K. Petersen @ 2008-06-03 22:07 ` Andreas Dilger 1 sibling, 0 replies; 8+ messages in thread From: Andreas Dilger @ 2008-06-03 22:07 UTC (permalink / raw) To: Thomas King; +Cc: linux-ext4 On Jun 03, 2008 10:10 -0500, Thomas King wrote: > > The mballoc code also does efficient block allocations (multi-MB at a > > time), BUT there is no userspace interface for this yet, except O_DIRECT. > > The delayed allocation (delalloc) patches for ext4 are still in the unstable > > part of the patch series... What Henry is misunderstanding here is that > > the filesystem blocksize isn't necessarily the maximum unit for space > > allocation. I agree we could do this more efficiently (e.g. allocate an > > entire 128MB block group at a time for large files), but we haven't gotten > > there yet. > > Can I assume this (large block size) is a possibility later? Well, anything is a possibility later. There are no plans to implement it. > > I'd personally tend to keep quiet until we CAN show that ext4 > > runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. > > What will be the largest theoretical filesystem for ext4? In theory, it could be 2^64 bytes in size, though common architectures would currently be limited to 2^60 bytes due to 4kB PAGE_SIZE == blocksize. I'm not at all interested in "theoretical filesystem size", however, since theory != practise and a 2^64-byte filesystem that takes 10 weeks to format or fsck wouldn't be very useful... Not that I think ext4 is that bad, but I don't like to make claims based on complete guesswork. > Here are three other features he thought necessary for massive filesystems in > Linux: > -T10 DIF (block protect?) aware file system - DIF support is underway, though I'm not aware of filesystem support for it > -NFSv4.1 support - in progress > -Support for proposed POSIX relaxation extensions for HPC - nothing more than a proposal, it wouldn't even begin to see Linux implementation until there is something more than a few emails on the list. These are mostly meaningless outside of the context of a cluster. Don't get me wrong, these ARE things that Linux will want to implement as filesystems and clusters get huge, and it is also my job to work on such large file system deployments. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-06-03 22:07 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-02 21:50 Questions for article Thomas King 2008-06-02 22:30 ` Eric Sandeen 2008-06-02 22:59 ` Andreas Dilger 2008-06-03 0:40 ` Eric Sandeen 2008-06-03 15:17 ` Thomas King 2008-06-03 15:10 ` Thomas King 2008-06-03 15:49 ` Martin K. Petersen 2008-06-03 22:07 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox