Questions for article

All of lore.kernel.org
 help / color / mirror / Atom feed

* Questions for article
@ 2008-06-02 21:50 Thomas King
  2008-06-02 22:30 ` Eric Sandeen
  2008-06-02 22:59 ` Andreas Dilger
  0 siblings, 2 replies; 19+ messages in thread
From: Thomas King @ 2008-06-02 21:50 UTC (permalink / raw)
  To: linux-ext4

Folks,

I am writing an article for Linux.com to answer Henry Newman's at
http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
there anyone that can field a few questions on ext4?

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-02 21:50 Thomas King
@ 2008-06-02 22:30 ` Eric Sandeen
  2008-06-02 22:59 ` Andreas Dilger
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Sandeen @ 2008-06-02 22:30 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-ext4

Thomas King wrote:
> Folks,
> 
> I am writing an article for Linux.com to answer Henry Newman's at
> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
> there anyone that can field a few questions on ext4?
> 
> Thanks!
> Tom King

Honestly I'm not sure it's worth feeding the trolls... that guy has some
points but is sufficiently off-base to make me wonder if he actually has
any broad Linux filesystem experience.  ...But anyway, I'd just ask the
questions on-list if you don't mind a collaborative answer.  :)

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-02 21:50 Thomas King
  2008-06-02 22:30 ` Eric Sandeen
@ 2008-06-02 22:59 ` Andreas Dilger
  2008-06-03  0:40   ` Eric Sandeen
  2008-06-03 15:10   ` Thomas King
  1 sibling, 2 replies; 19+ messages in thread
From: Andreas Dilger @ 2008-06-02 22:59 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-ext4

On Jun 02, 2008  16:50 -0500, Thomas King wrote:
> I am writing an article for Linux.com to answer Henry Newman's at
> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
> there anyone that can field a few questions on ext4?

It depends on what you are proposing to write...  Henry's comments are
mostly accurate.  There isn't even support for > 16TB filesystems in
e2fsprogs today, so I wouldn't go rushing into an email saying "ext4
can support a single 100TB filesystem today".  It wouldn't be too hard
to take a 100TB Lustre filesystem and run it on a single node, but I
doubt anyone would actually want to do that and it still doesn't meet
the requirements of "a single instance filesystem".

What is noteworthy is that the comments about IO not being aligned
to RAID boundaries is only partly correct.  This is actually done in
ext4 with mballoc (assuming you set these boundaries in the superblock
manually), and is also done by XFS automatically.  The RAID geometry
detection code should be added to mke2fs also, if someone would be
interested.  The ext4/mballoc code does NOT align the metadata to RAID
boundaries, though this is being worked on also.

The mballoc code also does efficient block allocations (multi-MB at a
time), BUT there is no userspace interface for this yet, except O_DIRECT.
The delayed allocation (delalloc) patches for ext4 are still in the unstable
part of the patch series...  What Henry is misunderstanding here is that
the filesystem blocksize isn't necessarily the maximum unit for space
allocation.  I agree we could do this more efficiently (e.g. allocate an
entire 128MB block group at a time for large files), but we haven't gotten
there yet.

There are a large number of IO performance improvements in ext4 due to
work to improve IO server performance for Lustre (which Henry is of
course familiar with), and for Lustre at least we are able to get IO
performance in the 2GB/s range on 42 50MB/s disks with software RAID 0
(Sun x4500), but these are with O_DIRECT.

For the fsck front, there have been performance improvements recently
(uninit_bg), and more arriving soon (flex_bg and block metadata
clustering), but that is still a far way from removing the need for
e2fsck in case of corruption.

Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
(though not superbly) for a certain kind of workload.  On the other hand,
this can be really nasty with a "readdir+stat" kind of workload.  Lustre
also runs with filesystems > 250M files total, but I haven't heard of
e2fsck performance for such filesystems.

I'd personally tend to keep quiet until we CAN show that ext4
runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-02 22:59 ` Andreas Dilger
@ 2008-06-03  0:40   ` Eric Sandeen
  2008-06-03 15:17     ` Thomas King
  2008-06-03 15:10   ` Thomas King
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2008-06-03  0:40 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Thomas King, linux-ext4

Andreas Dilger wrote:
> On Jun 02, 2008  16:50 -0500, Thomas King wrote:
>> I am writing an article for Linux.com to answer Henry Newman's at
>> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
>> there anyone that can field a few questions on ext4?
> 
> It depends on what you are proposing to write...  Henry's comments are
> mostly accurate.  

But others are way off base IMHO, to the point where I don't put a lot
of stock in the article.  fsck only checks the log?  Hardly.  No linux
filesystem does proper geometry alignment?  XFS has for years.

He seems to take ext3 weaknesses and extrapolate to all linux
filesystems.   The fact that he suggests testing a 500T ext3 filesystem
indicates a ... lack of research.  Never mind that had he done that
research he'd have found that you, well... you can't do it.  :)  On the
one hand it proves his point about scalibility (of ext3) but on the
other hand indicates that he's not completely investigated the problem
of linux filesystem scalability, himself.

Of the tests he proposes, he's clearly not bothered to do them himself.
 A 100 million inode filesystem is not that uncommon on xfs, and some of
the tests he proposes are probably in daily use at SGI customers.

So writing an article about ext4 to refute all his arguments might be
premature, but dismissing all linux filesystems based on ext3
shortcomings is also shortsighted.  He has some valid points but saying
"fscking a multi-terabyte fs is too slow on linux" without showing that
it actually *is* slow on linux, or that it *is* fast on $whatever_else,
is just hand-waving.  On the other hand  it's a very hard test for mere
mortals to run.  :)

-Eric

> There isn't even support for > 16TB filesystems in
> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4
> can support a single 100TB filesystem today".  It wouldn't be too hard
> to take a 100TB Lustre filesystem and run it on a single node, but I
> doubt anyone would actually want to do that and it still doesn't meet
> the requirements of "a single instance filesystem".
> 
> What is noteworthy is that the comments about IO not being aligned
> to RAID boundaries is only partly correct.  This is actually done in
> ext4 with mballoc (assuming you set these boundaries in the superblock
> manually), and is also done by XFS automatically.  The RAID geometry
> detection code should be added to mke2fs also, if someone would be
> interested.  The ext4/mballoc code does NOT align the metadata to RAID
> boundaries, though this is being worked on also.
> 
> The mballoc code also does efficient block allocations (multi-MB at a
> time), BUT there is no userspace interface for this yet, except O_DIRECT.
> The delayed allocation (delalloc) patches for ext4 are still in the unstable
> part of the patch series...  What Henry is misunderstanding here is that
> the filesystem blocksize isn't necessarily the maximum unit for space
> allocation.  I agree we could do this more efficiently (e.g. allocate an
> entire 128MB block group at a time for large files), but we haven't gotten
> there yet.
> 
> There are a large number of IO performance improvements in ext4 due to
> work to improve IO server performance for Lustre (which Henry is of
> course familiar with), and for Lustre at least we are able to get IO
> performance in the 2GB/s range on 42 50MB/s disks with software RAID 0
> (Sun x4500), but these are with O_DIRECT.
> 
> For the fsck front, there have been performance improvements recently
> (uninit_bg), and more arriving soon (flex_bg and block metadata
> clustering), but that is still a far way from removing the need for
> e2fsck in case of corruption.
> 
> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
> (though not superbly) for a certain kind of workload.  On the other hand,
> this can be really nasty with a "readdir+stat" kind of workload.  Lustre
> also runs with filesystems > 250M files total, but I haven't heard of
> e2fsck performance for such filesystems.
> 
> 
> I'd personally tend to keep quiet until we CAN show that ext4
> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-02 22:59 ` Andreas Dilger
  2008-06-03  0:40   ` Eric Sandeen
@ 2008-06-03 15:10   ` Thomas King
  2008-06-03 15:49     ` Martin K. Petersen
  2008-06-03 22:07     ` Andreas Dilger
  1 sibling, 2 replies; 19+ messages in thread
From: Thomas King @ 2008-06-03 15:10 UTC (permalink / raw)
  To: linux-ext4

> On Jun 02, 2008  16:50 -0500, Thomas King wrote:
>> I am writing an article for Linux.com to answer Henry Newman's at
>> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
>> there anyone that can field a few questions on ext4?
>
> It depends on what you are proposing to write...  Henry's comments are
> mostly accurate.  There isn't even support for > 16TB filesystems in
> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4
> can support a single 100TB filesystem today".  It wouldn't be too hard
> to take a 100TB Lustre filesystem and run it on a single node, but I
> doubt anyone would actually want to do that and it still doesn't meet
> the requirements of "a single instance filesystem".
>
Aye, as you probably saw in his article, he's skirting cluster filesystems since
most of the implementations he's referencing use a single physical filesystem.

> What is noteworthy is that the comments about IO not being aligned
> to RAID boundaries is only partly correct.  This is actually done in
> ext4 with mballoc (assuming you set these boundaries in the superblock
> manually), and is also done by XFS automatically.  The RAID geometry
> detection code should be added to mke2fs also, if someone would be
> interested.  The ext4/mballoc code does NOT align the metadata to RAID
> boundaries, though this is being worked on also.
>
Good to know!

> The mballoc code also does efficient block allocations (multi-MB at a
> time), BUT there is no userspace interface for this yet, except O_DIRECT.
> The delayed allocation (delalloc) patches for ext4 are still in the unstable
> part of the patch series...  What Henry is misunderstanding here is that
> the filesystem blocksize isn't necessarily the maximum unit for space
> allocation.  I agree we could do this more efficiently (e.g. allocate an
> entire 128MB block group at a time for large files), but we haven't gotten
> there yet.
>
Can I assume this (large block size) is a possibility later?

> There are a large number of IO performance improvements in ext4 due to
> work to improve IO server performance for Lustre (which Henry is of
> course familiar with), and for Lustre at least we are able to get IO
> performance in the 2GB/s range on 42 50MB/s disks with software RAID 0
> (Sun x4500), but these are with O_DIRECT.
>
> For the fsck front, there have been performance improvements recently
> (uninit_bg), and more arriving soon (flex_bg and block metadata
> clustering), but that is still a far way from removing the need for
> e2fsck in case of corruption.
>
> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
> (though not superbly) for a certain kind of workload.  On the other hand,
> this can be really nasty with a "readdir+stat" kind of workload.  Lustre
> also runs with filesystems > 250M files total, but I haven't heard of
> e2fsck performance for such filesystems.
>
>
> I'd personally tend to keep quiet until we CAN show that ext4
> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.
>
What will be the largest theoretical filesystem for ext4?
Here are three other features he thought necessary for massive filesystems in
Linux:
-T10 DIF (block protect?) aware file system
-NFSv4.1 support
-Support for proposed POSIX relaxation extensions for HPC
Are these already in ext4 or on the radar?
Is there anything else y'all would like folks to know about ext4 and its future?

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03  0:40   ` Eric Sandeen
@ 2008-06-03 15:17     ` Thomas King
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas King @ 2008-06-03 15:17 UTC (permalink / raw)
  To: linux-ext4

> Andreas Dilger wrote:
>> On Jun 02, 2008  16:50 -0500, Thomas King wrote:
>>> I am writing an article for Linux.com to answer Henry Newman's at
http://www.enterprisestorageforum.com/sans/features/article.php/3749926. Is
there anyone that can field a few questions on ext4?
>> It depends on what you are proposing to write...  Henry's comments are mostly
accurate.
>
> But others are way off base IMHO, to the point where I don't put a lot of
stock in the article.  fsck only checks the log?  Hardly.  No linux filesystem
does proper geometry alignment?  XFS has for years.
>
> He seems to take ext3 weaknesses and extrapolate to all linux
> filesystems.   The fact that he suggests testing a 500T ext3 filesystem
indicates a ... lack of research.  Never mind that had he done that research
he'd have found that you, well... you can't do it.  :)  On the one hand it
proves his point about scalibility (of ext3) but on the other hand indicates
that he's not completely investigated the problem of linux filesystem
scalability, himself.
>
> Of the tests he proposes, he's clearly not bothered to do them himself.
>  A 100 million inode filesystem is not that uncommon on xfs, and some of
> the tests he proposes are probably in daily use at SGI customers.
>
> So writing an article about ext4 to refute all his arguments might be
premature, but dismissing all linux filesystems based on ext3
> shortcomings is also shortsighted.  He has some valid points but saying
"fscking a multi-terabyte fs is too slow on linux" without showing that it
actually *is* slow on linux, or that it *is* fast on $whatever_else, is just
hand-waving.  On the other hand  it's a very hard test for mere mortals to
run.  :)
>
> -Eric
>
>> There isn't even support for > 16TB filesystems in
>> e2fsprogs today, so I wouldn't go rushing into an email saying "ext4 can
support a single 100TB filesystem today".  It wouldn't be too hard to take a
100TB Lustre filesystem and run it on a single node, but I doubt anyone would
actually want to do that and it still doesn't meet the requirements of "a
single instance filesystem".
>> What is noteworthy is that the comments about IO not being aligned to RAID
boundaries is only partly correct.  This is actually done in ext4 with
mballoc (assuming you set these boundaries in the superblock manually), and
is also done by XFS automatically.  The RAID geometry detection code should
be added to mke2fs also, if someone would be interested.  The ext4/mballoc
code does NOT align the metadata to RAID boundaries, though this is being
worked on also.
>> The mballoc code also does efficient block allocations (multi-MB at a time),
BUT there is no userspace interface for this yet, except O_DIRECT. The
delayed allocation (delalloc) patches for ext4 are still in the unstable part
of the patch series...  What Henry is misunderstanding here is that the
filesystem blocksize isn't necessarily the maximum unit for space allocation.
 I agree we could do this more efficiently (e.g. allocate an entire 128MB
block group at a time for large files), but we haven't gotten there yet.
>> There are a large number of IO performance improvements in ext4 due to work
to improve IO server performance for Lustre (which Henry is of course
familiar with), and for Lustre at least we are able to get IO performance in
the 2GB/s range on 42 50MB/s disks with software RAID 0 (Sun x4500), but
these are with O_DIRECT.
>> For the fsck front, there have been performance improvements recently
(uninit_bg), and more arriving soon (flex_bg and block metadata clustering),
but that is still a far way from removing the need for e2fsck in case of
corruption.
>> Similarly, Lustre (with ext3) can scale to a 10M file directory reasonably
(though not superbly) for a certain kind of workload.  On the other hand,
this can be really nasty with a "readdir+stat" kind of workload.  Lustre also
runs with filesystems > 250M files total, but I haven't heard of e2fsck
performance for such filesystems.
>> I'd personally tend to keep quiet until we CAN show that ext4
>> runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc. Cheers,
Andreas

He is fairly keen on XFS except for a couple of items. "The metadata areas are
not aligned with RAID strips and allocation units are FAR too small but better
than ext." However, some of his comments do hint that any current filesystem
technology wouldn't make him happy. ;)

Folks, thank you for suffering my questions and probing. I may post a few more
later.
Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Questions for article
@ 2008-06-03 15:34 Thomas King
  2008-06-03 19:42 ` Justin Piszcz
  2008-06-04 14:52 ` Emmanuel Florac
  0 siblings, 2 replies; 19+ messages in thread
From: Thomas King @ 2008-06-03 15:34 UTC (permalink / raw)
  To: xfs

I am writing an article to answer Henry Newman's at
http://www.enterprisestorageforum.com/sans/features/article.php/3749926. I've
already been bugging folks on the ext4 mailing list and one of them mentioned I
should also send some of the same questions to this list. Please let me know if
I may do so.

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 15:10   ` Thomas King
@ 2008-06-03 15:49     ` Martin K. Petersen
  2008-06-03 22:07     ` Andreas Dilger
  1 sibling, 0 replies; 19+ messages in thread
From: Martin K. Petersen @ 2008-06-03 15:49 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-ext4

>>>>> "Thomas" == Thomas King <kingttx@tomslinux.homelinux.org> writes:

Thomas> - T10 DIF (block protect?) aware file system

I'm not really sure what the ext4 people are officially planning but I
know from conversations with Ted and a few others that there's
interest.  Wiring up ext4 to the block integrity infrastructure is
pretty easy.  It's defining the tagging and making fsck use it that's
the hard part.  Some of that hinges on a userland interface that I
haven't quite finished baking yet.

However, a filesystem doesn't have to be explicitly DIF-aware to take
advantage of it.  Sector tagging is just icing on the cake.  The
current DIF infrastructure automagically protects all I/O that doesn't
already have integrity metadata attached.

Unfortunately, ext[23] aren't working well with protection turned on
right now.  The way DIF works is that you add a checksum to the I/O
when it is submitted.  If there's a mismatch, the HBA or the drive
will reject the I/O.  And unfortunately both ext2 and ext3 frequently
modify pages that are in flight, causing a checksum mismatch.  I have
yet to try ext4.

XFS and btrfs work fine with DIF except for the generic writable mmap
hole that I think I'm about to fix.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 15:34 Thomas King
@ 2008-06-03 19:42 ` Justin Piszcz
  2008-06-04 14:52 ` Emmanuel Florac
  1 sibling, 0 replies; 19+ messages in thread
From: Justin Piszcz @ 2008-06-03 19:42 UTC (permalink / raw)
  To: Thomas King; +Cc: xfs



On Tue, 3 Jun 2008, Thomas King wrote:

> I am writing an article to answer Henry Newman's at
> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. I've
> already been bugging folks on the ext4 mailing list and one of them mentioned I
> should also send some of the same questions to this list. Please let me know if
> I may do so.
>
> Thanks!
> Tom King
>
>

What are the questions?

Justin.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
@ 2008-06-03 20:48 Thomas King
  2008-06-03 22:00 ` Martin K. Petersen
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Thomas King @ 2008-06-03 20:48 UTC (permalink / raw)
  To: xfs

>
>
> On Tue, 3 Jun 2008, Thomas King wrote:
>
>> I am writing an article to answer Henry Newman's at
>> http://www.enterprisestorageforum.com/sans/features/article.php/3749926. I've
>> already been bugging folks on the ext4 mailing list and one of them mentioned
>> I
>> should also send some of the same questions to this list. Please let me know
>> if
>> I may do so.
>>
>> Thanks!
>> Tom King
>>
>>
>
> What are the questions?
>
> Justin.

For the most part, XFS is used for massive filesystems (hundreds of petabytes)
successfully in Linux (among other OS's). However, Mr. Newman still believes
there are details that he believes XFS doesn't include or Linux limits (such as
page sizes in x86 limiting block sizes).

With that preface, here are some questions:
-Is XFS fully RAID aware inthat it aligns metadata with RAID stripes? Some of
the information I see states XFS can get geometry information from LVM and MD,
but what about hardware RAID?
-Does XFS take advantage of T10 DIF (block protection?)?
-Does/Will XFS support NFS v4.1?
-Concerning the block-size limit, will this eventually be a thing of the past?
Mr. Newman's contention is massive filesystems should have much larger block
sizes, but he also contends that OSD is the eventual answer instead of using
block allocation.
-Is there anything else y'all would like folks to understand about XFS and
massive implementations?

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 20:48 Questions for article Thomas King
@ 2008-06-03 22:00 ` Martin K. Petersen
  2008-06-03 22:14 ` Eric Sandeen
  2008-06-04  5:31 ` Christoph Hellwig
  2 siblings, 0 replies; 19+ messages in thread
From: Martin K. Petersen @ 2008-06-03 22:00 UTC (permalink / raw)
  To: Thomas King; +Cc: xfs

>>>>> "Thomas" == Thomas King <kingttx@tomslinux.homelinux.org> writes:

Thomas> -Is XFS fully RAID aware inthat it aligns metadata with RAID
Thomas> stripes? Some of the information I see states XFS can get
Thomas> geometry information from LVM and MD, but what about hardware
Thomas> RAID? 

The stuff that queries MD/LVM for stripe unit/stripe size has been in
XFS for a while[1].

For hardware RAID there is no non-proprietary way to obtain the
information from the device.  So whoever runs mkfs on a hardware RAID
device must manually specify the geometry using the sunit and swidth
parameters.  That capability has been there since the dawn of time.

Note that in the upcoming version of SBC-3 (SCSI Block Commands)
finally features a VPD page that the array firmware can fill out to
let the operating system know about stripe size, etc.  I have been
working on a patch that extracts this information and presents it to
the block layer in a generic fashion.  But so far I have not seen a
single array that implements said VPD page.  IOW, there hasn't been
much motivation to finish that work.

Also, SBC-3 is work in progress.  The standard has not been ratified
yet so things could change before it is released.  I doubt they are
going to change the block limits VPD, but who knows?

Thomas> -Does XFS take advantage of T10 DIF (block protection?)?

As I mentioned earlier today, filesystems do not need to be explicitly
DIF-aware.  I/Os submitted by XFS will be protected if the kernel does
DIF.

The DIF support has not been accepted upstream yet.  Working on that.
But in any case DIF-capable hardware is not generally available.

[1] http://www.linux.sgi.com/archives/xfs/2001-03/msg00435.html

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 15:10   ` Thomas King
  2008-06-03 15:49     ` Martin K. Petersen
@ 2008-06-03 22:07     ` Andreas Dilger
  1 sibling, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2008-06-03 22:07 UTC (permalink / raw)
  To: Thomas King; +Cc: linux-ext4

On Jun 03, 2008  10:10 -0500, Thomas King wrote:
> > The mballoc code also does efficient block allocations (multi-MB at a
> > time), BUT there is no userspace interface for this yet, except O_DIRECT.
> > The delayed allocation (delalloc) patches for ext4 are still in the unstable
> > part of the patch series...  What Henry is misunderstanding here is that
> > the filesystem blocksize isn't necessarily the maximum unit for space
> > allocation.  I agree we could do this more efficiently (e.g. allocate an
> > entire 128MB block group at a time for large files), but we haven't gotten
> > there yet.
>
> Can I assume this (large block size) is a possibility later?

Well, anything is a possibility later.  There are no plans to implement it.

> > I'd personally tend to keep quiet until we CAN show that ext4
> > runs well on a 100TB filesystem, that e2fsck time isn't fatal, etc.
>
> What will be the largest theoretical filesystem for ext4?

In theory, it could be 2^64 bytes in size, though common architectures
would currently be limited to 2^60 bytes due to 4kB PAGE_SIZE == blocksize.
I'm not at all interested in "theoretical filesystem size", however, since
theory != practise and a 2^64-byte filesystem that takes 10 weeks to format
or fsck wouldn't be very useful...  Not that I think ext4 is that bad, but
I don't like to make claims based on complete guesswork.

> Here are three other features he thought necessary for massive filesystems in
> Linux:
> -T10 DIF (block protect?) aware file system

- DIF support is underway, though I'm not aware of filesystem support for it

> -NFSv4.1 support

- in progress

> -Support for proposed POSIX relaxation extensions for HPC

- nothing more than a proposal, it wouldn't even begin to see Linux
  implementation until there is something more than a few emails on
  the list.  These are mostly meaningless outside of the context of
  a cluster.

Don't get me wrong, these ARE things that Linux will want to implement
as filesystems and clusters get huge, and it is also my job to work on
such large file system deployments.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 20:48 Questions for article Thomas King
  2008-06-03 22:00 ` Martin K. Petersen
@ 2008-06-03 22:14 ` Eric Sandeen
  2008-06-03 22:19   ` Thomas King
  2008-06-04  5:28   ` Christoph Hellwig
  2008-06-04  5:31 ` Christoph Hellwig
  2 siblings, 2 replies; 19+ messages in thread
From: Eric Sandeen @ 2008-06-03 22:14 UTC (permalink / raw)
  To: Thomas King; +Cc: xfs

Thomas King wrote:

> -Concerning the block-size limit, will this eventually be a thing of the past?
> Mr. Newman's contention is massive filesystems should have much larger block
> sizes, but he also contends that OSD is the eventual answer instead of using
> block allocation.

Just to reiterate what I already put on the ext4 list... :)

ftp://ftp.kernel.org/pub/linux/kernel/people/christoph/largeblocksize/4/patches/
http://kerneltrap.org/Linux/Large_Blocksize_Performance

Not sure where those patches are headed.

It's also not clear to me that this is really a critical feature for
large filesystems; space allocation is not done block by block per se in
xfs, as Mr. Newman seems (?) to imply (?)  The block granularity is
there throughout the fs but I'm not sure how much it matters in
practice.  Dave...?

OSDs may have their place, we'll see.  It's pretty new stuff (unless you
count Lustre, I guess, but I thought he didn't want to talk lustre...)
I don't think this relates to a linux shortcoming in any way (or to
xfs...), it's  awfully new stuff that just about nobody really has in
production.

> -Is there anything else y'all would like folks to understand about XFS and
> massive implementations?

I already pointed him at the xfs_repair paper, since he seems concerned
about fsck (and pointed out that yes, xfs_repair really *DOES* check all
filesystem data and does not simply replay the log...)

http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf

Maybe some of the folks on the list with said massive implementations
can speak up too.  :)

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 22:14 ` Eric Sandeen
@ 2008-06-03 22:19   ` Thomas King
  2008-06-04  5:28   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Thomas King @ 2008-06-03 22:19 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

> Thomas King wrote:
>
>> -Concerning the block-size limit, will this eventually be a thing of the past?
>> Mr. Newman's contention is massive filesystems should have much larger block
>> sizes, but he also contends that OSD is the eventual answer instead of using
>> block allocation.
>
> Just to reiterate what I already put on the ext4 list... :)
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/christoph/largeblocksize/4/patches/
> http://kerneltrap.org/Linux/Large_Blocksize_Performance
>
> Not sure where those patches are headed.
>
> It's also not clear to me that this is really a critical feature for
> large filesystems; space allocation is not done block by block per se in
> xfs, as Mr. Newman seems (?) to imply (?)  The block granularity is
> there throughout the fs but I'm not sure how much it matters in
> practice.  Dave...?
>
> OSDs may have their place, we'll see.  It's pretty new stuff (unless you
> count Lustre, I guess, but I thought he didn't want to talk lustre...)
> I don't think this relates to a linux shortcoming in any way (or to
> xfs...), it's  awfully new stuff that just about nobody really has in
> production.
>
>> -Is there anything else y'all would like folks to understand about XFS and
>> massive implementations?
>
> I already pointed him at the xfs_repair paper, since he seems concerned
> about fsck (and pointed out that yes, xfs_repair really *DOES* check all
> filesystem data and does not simply replay the log...)
>
> http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf
>
> Maybe some of the folks on the list with said massive implementations
> can speak up too.  :)
>
> -Eric
>
Both you and Andreas gave me some excellent information on both lists, and thank
you all for your patience. I appreciate everyone piping in. Like you say, if
there is anyone with massive implementations that wishes to add, please do so.

Thanks!
Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 22:14 ` Eric Sandeen
  2008-06-03 22:19   ` Thomas King
@ 2008-06-04  5:28   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2008-06-04  5:28 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Thomas King, xfs

On Tue, Jun 03, 2008 at 05:14:44PM -0500, Eric Sandeen wrote:
> It's also not clear to me that this is really a critical feature for
> large filesystems; space allocation is not done block by block per se in
> xfs, as Mr. Newman seems (?) to imply (?)  The block granularity is
> there throughout the fs but I'm not sure how much it matters in
> practice.  Dave...?

For streaming I/O workloads it doesn't matter anymore, see Dave's 2006
OLS talk.  The direct to bio I/O path mitigates any blocksize impact.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 20:48 Questions for article Thomas King
  2008-06-03 22:00 ` Martin K. Petersen
  2008-06-03 22:14 ` Eric Sandeen
@ 2008-06-04  5:31 ` Christoph Hellwig
  2008-06-04 14:16   ` Thomas King
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2008-06-04  5:31 UTC (permalink / raw)
  To: Thomas King; +Cc: xfs

On Tue, Jun 03, 2008 at 03:48:49PM -0500, Thomas King wrote:
> For the most part, XFS is used for massive filesystems (hundreds of petabytes)

I think undreds of petabytes is not something we commonly see today :)
hundreds of TB is more reasonable.

> -Does/Will XFS support NFS v4.1?

I suspect he means support for PNFS.  PNFS is just like CXFS over
sunrpc, so for anyone whoe cares adding an XFS layout driver shouldn't
be a problem, and not actually require changes to the disk format or
low-level XFS code.  Note that I think pnfs a really good idea.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-04  5:31 ` Christoph Hellwig
@ 2008-06-04 14:16   ` Thomas King
  2008-06-04 15:06     ` Eric Sandeen
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas King @ 2008-06-04 14:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

> On Tue, Jun 03, 2008 at 03:48:49PM -0500, Thomas King wrote:
>> For the most part, XFS is used for massive filesystems (hundreds of petabytes)
>
> I think undreds of petabytes is not something we commonly see today :)
> hundreds of TB is more reasonable.

If I'm going to answer his two articles, he's speaking in the context of massive
filesystems. True, hundreds of petabytes are not common but that's the
environment he's talking about.

>From what I'm seeing from XFS, BTRFS, ext4, and HAMMER, Linux filesystems are
going to easily keep up with the current trend. For the massive filesystems
Henry speaks of, XFS has some new features I don't think he's aware of and needs
to come out in this answer.

Tom King

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-03 15:34 Thomas King
  2008-06-03 19:42 ` Justin Piszcz
@ 2008-06-04 14:52 ` Emmanuel Florac
  1 sibling, 0 replies; 19+ messages in thread
From: Emmanuel Florac @ 2008-06-04 14:52 UTC (permalink / raw)
  To: Thomas King; +Cc: xfs

Le Tue, 3 Jun 2008 10:34:48 -0500 (CDT)
Thomas King <kingttx@tomslinux.homelinux.org> écrivait:

> I am writing an article to answer Henry Newman's at
> http://www.enterprisestorageforum.com/sans/features/article.php/3749926.
> I've already been bugging folks on the ext4 mailing list and one of
> them mentioned I should also send some of the same questions to this
> list. Please let me know if I may do so.

Seems like a good idea. This guy doesn't even mention XFS, while it's
more or less the only viable option for big filesystems (more than 8TB).
I currently use 30, 40TB XFS filesystems that work just fine.

I've already compared all filesystems : XFS works great for big
filesystems. JFS works well too, however it lacks a defragmenting
utility which is quite a problem for big filesystems with lots of write
activity. reiserfs 3.6 simply breaks over 4TB; mkfs.ext3 is so slow
than it's a problem from the start, then the performance is abysmal.

-- 
----------------------------------------
Emmanuel Florac     |   Intellique
----------------------------------------

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Questions for article
  2008-06-04 14:16   ` Thomas King
@ 2008-06-04 15:06     ` Eric Sandeen
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Sandeen @ 2008-06-04 15:06 UTC (permalink / raw)
  To: Thomas King; +Cc: Christoph Hellwig, xfs

Thomas King wrote:
>> On Tue, Jun 03, 2008 at 03:48:49PM -0500, Thomas King wrote:
>>> For the most part, XFS is used for massive filesystems (hundreds of petabytes)
>> I think undreds of petabytes is not something we commonly see today :)
>> hundreds of TB is more reasonable.
> 
> If I'm going to answer his two articles, he's speaking in the context of massive
> filesystems. True, hundreds of petabytes are not common but that's the
> environment he's talking about.
> 
> From what I'm seeing from XFS, BTRFS, ext4, and HAMMER, Linux filesystems are
> going to easily keep up with the current trend. For the massive filesystems
> Henry speaks of, XFS has some new features I don't think he's aware of and needs
> to come out in this answer.
> 
> Tom King

One thing I would be careful of is not to fall into the trap of letting
Linux filesystems get bashed over things that *nobody* really has today.
 Stuff like PNFS, OSD, DIF etc are bleeding-edge for almost *everybody*

Petabyte filesystems are hard.  For *everybody*

And hundred-petabyte filesystems aren't just uncommon, they don't exist
AFAIK.

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-06-04 15:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-03 20:48 Questions for article Thomas King
2008-06-03 22:00 ` Martin K. Petersen
2008-06-03 22:14 ` Eric Sandeen
2008-06-03 22:19   ` Thomas King
2008-06-04  5:28   ` Christoph Hellwig
2008-06-04  5:31 ` Christoph Hellwig
2008-06-04 14:16   ` Thomas King
2008-06-04 15:06     ` Eric Sandeen
  -- strict thread matches above, loose matches on Subject: below --
2008-06-03 15:34 Thomas King
2008-06-03 19:42 ` Justin Piszcz
2008-06-04 14:52 ` Emmanuel Florac
2008-06-02 21:50 Thomas King
2008-06-02 22:30 ` Eric Sandeen
2008-06-02 22:59 ` Andreas Dilger
2008-06-03  0:40   ` Eric Sandeen
2008-06-03 15:17     ` Thomas King
2008-06-03 15:10   ` Thomas King
2008-06-03 15:49     ` Martin K. Petersen
2008-06-03 22:07     ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.