public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* avoid mbox file fragmentation
@ 2010-10-19 23:04 Stan Hoeppner
  2010-10-19 23:42 ` Dave Chinner
  2010-10-20 11:21 ` avoid mbox file fragmentation Peter Grandi
  0 siblings, 2 replies; 20+ messages in thread
From: Stan Hoeppner @ 2010-10-19 23:04 UTC (permalink / raw)
  To: xfs

In an effort to maximize mbox performance and minimize fragmentation I'm
using these mount options in fstab.  This is on a Debian Lenny box but
with vanilla 2.6.34.1 rolled from kernel.org source (Lenny ships with
2.6.26).  xfsprogs is 2.9.8.

/dev/sda6  /home   xfs   defaults,logbufs=8,logbsize=256k,allocsize=1m

Since the actual XFS mount defaults aren't consistently published
anywhere that I can find I'm manually specifying logbufs and logbsize.

I added allocsize=1m as my read of the man page suggests this will
preallocate an additional 1MB of extent space at the end of each mbox
file each time it is written, which I would think should eliminate
fragmentation of these files.  However, this doesn't seem to be
eliminating the fragmentation.  I added allocsize=1m at a date after all
of the mbox files in question already existed.  Does allocsize=1m only
affect new files or does it preallocate at the end of existing files?

I've probably totally misread what allocsize= actually does.  Please
educate me.  If allocsize= doesn't help prevent fragmentation of mbox
files, what can I do to mitigate this, other than regularly running xfs_fsr?

Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sda6      xfs     94G  1.3G   92G   2% /home

Filesystem    Type    Inodes   IUsed   IFree IUse% Mounted on
/dev/sda6      xfs       94M    1.1K     94M    1% /home

actual 1096, ideal 1011, fragmentation factor 7.76%

I've recently run xfs_fsr thus the low 7% figure.  Every couple of weeks
the fragmentation reaches ~30%.  I save a lot of list mail, dozens to
hundreds per day for each of about 7 foss mailing lists.  As I say, in
just a couple of weeks, these mbox files become fragmented, on the order
of a dozen to a few dozen extents per mbox file.  With so much free
space available on this filesystem, why aren't/weren't these files being
spread out with sufficient space between them to prevent fragmentation?

P.S.

(Dave or someone has suggested on list that with newer kernels the
defaults for these two (and other) mount options do not match those
suggested in the man pages.  I requested a feature some time ago that
would actually put these default values in /proc files to eliminate any
doubt as to what the actual defaults being used are.  I don't recall if
anything came of this.  I've seen many an OP get "scolded" on list for
manually specifying values that were apparently equal to the "default"
values as stated by the responding dev.  This problem would never exist
if the documentation was complete and consistent.  If it already is,
then something is wrong, as myself and hoards of other OPs aren't able
to locate this definitive information regarding mount defaults.)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-19 23:04 avoid mbox file fragmentation Stan Hoeppner
@ 2010-10-19 23:42 ` Dave Chinner
  2010-10-20  2:36   ` Stan Hoeppner
                     ` (2 more replies)
  2010-10-20 11:21 ` avoid mbox file fragmentation Peter Grandi
  1 sibling, 3 replies; 20+ messages in thread
From: Dave Chinner @ 2010-10-19 23:42 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Tue, Oct 19, 2010 at 06:04:35PM -0500, Stan Hoeppner wrote:
> In an effort to maximize mbox performance and minimize fragmentation I'm
> using these mount options in fstab.  This is on a Debian Lenny box but
> with vanilla 2.6.34.1 rolled from kernel.org source (Lenny ships with
> 2.6.26).  xfsprogs is 2.9.8.
....
> I've recently run xfs_fsr thus the low 7% figure.  Every couple of weeks
> the fragmentation reaches ~30%.  I save a lot of list mail, dozens to
> hundreds per day for each of about 7 foss mailing lists.  As I say, in
> just a couple of weeks, these mbox files become fragmented, on the order
> of a dozen to a few dozen extents per mbox file.  With so much free
> space available on this filesystem, why aren't/weren't these files being
> spread out with sufficient space between them to prevent fragmentation?

I've explained how allocsize works, and that speculative allocation
gets truncated away whenteh file is closed. Hence is the application
is doing:

	open()
	seek(EOF)
	write()
	close()

allocsize does not help you because the specualtive prealloc done in
write() is truncated away in close().

What you want is _physical_ preallocation, not speculative
preallocation. i.e. look up XFS_IOC_RESVSP or FIEMAP so your
application does _permanent_ preallocate past EOF. Alteratively, the
filesystem will avoid the truncation on close() is the file has the
APPEND attribute set and the application is writing via O_APPEND...

The filesystem cannot do everything for you. Sometimes the
application has to help....

Cheers,

Dave.
> (Dave or someone has suggested on list that with newer kernels the
> defaults for these two (and other) mount options do not match those
> suggested in the man pages.  I requested a feature some time ago that
> would actually put these default values in /proc files to eliminate any
> doubt as to what the actual defaults being used are.  I don't recall if
> anything came of this.

/proc/mounts displays some default options.

> I've seen many an OP get "scolded" on list for
> manually specifying values that were apparently equal to the "default"
> values as stated by the responding dev.  This problem would never exist
> if the documentation was complete and consistent.

As I say to _everyone_ who complains about this: Patches
to correct documentation issues are greatly appreciated. You don't
need to be a coding export to send a patch correcting a man page....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-19 23:42 ` Dave Chinner
@ 2010-10-20  2:36   ` Stan Hoeppner
  2010-10-20 11:31     ` Peter Grandi
  2010-10-20  3:03   ` Stan Hoeppner
  2010-10-20 11:50   ` Peter Grandi
  2 siblings, 1 reply; 20+ messages in thread
From: Stan Hoeppner @ 2010-10-20  2:36 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 10/19/2010 6:42 PM:

> I've explained how allocsize works, and that speculative allocation
> gets truncated away whenteh file is closed. Hence is the application
> is doing:

My apologies for missing this requiring another post.  It is appreciated.

> 	open()
> 	seek(EOF)
> 	write()
> 	close()

The application is Dovecot IMAP server.  I'm not positive, but I believe
this is the append method it uses, as do most mbox writers.

> allocsize does not help you because the specualtive prealloc done in
> write() is truncated away in close().
> 
> What you want is _physical_ preallocation, not speculative
> preallocation. i.e. look up XFS_IOC_RESVSP or FIEMAP so your
> application does _permanent_ preallocate past EOF. Alteratively, the
> filesystem will avoid the truncation on close() is the file has the
> APPEND attribute set and the application is writing via O_APPEND...

Thank you very much Dave.  I'll see if I can get Timo to implement this.
 Odds are long as I believe mbox has a very small user base today
compared to maildir and thus has lower development priority.  If it's
relatively easy to implement, maybe he can/will do it.

> The filesystem cannot do everything for you. Sometimes the
> application has to help....

Agreed.  I'm far from an expert on either, but I've been around the
block enough, and been on this list long enough, to know this. :)
Please don't think I was "blaming" XFS for the fragmentation.  Any file
that constantly gets appended, forever, is going to fragment.  Fact of
life.  I just thought there may be a tweak in XFS to lessen the
fragmentation effects.  Maybe it's simply time for me to move to maildir
format which pretty much eliminates fragmentation altogether.  The main
reason I like mbox is the fast full body searching of mail folders.
Doing so with maildir crawls, relatively speaking, especially for
folders with 15k+ emails.  The second is portability of mbox files.
Just about any MUA can read them.  Backup/restore is fast and
straightforward, although restoring individual messages is a bear.

> /proc/mounts displays some default options.

Yep.  We went through them together previously.  The main ones missing
there are the performance enhancing (log) options.

> As I say to _everyone_ who complains about this: Patches
> to correct documentation issues are greatly appreciated. You don't
> need to be a coding export to send a patch correcting a man page....

I would love to do so Dave.  However, I don't know at what rev the
defaults have changed.  What has been stated on list recently is that
recent kernels default both logbufs=8 and logbsize=256k which are the
maximum values, according to my man page for mount.  My man page for
mount is rather old as Debian Stable lags well behind upstream, and it
states these default values are based on the fs block size and the
machine's memory size.  Maybe the current man page has the correct info
already?  Anyone have a link to the most current man page for mount?

I guess this is one of the downsides to using a newer rolled kernel in
place of a distro's default kernel?  The man page doesn't get updated
and doesn't reflect features/defaults of the new kernel?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-19 23:42 ` Dave Chinner
  2010-10-20  2:36   ` Stan Hoeppner
@ 2010-10-20  3:03   ` Stan Hoeppner
  2010-10-21  1:55     ` Dave Chinner
  2010-10-20 11:50   ` Peter Grandi
  2 siblings, 1 reply; 20+ messages in thread
From: Stan Hoeppner @ 2010-10-20  3:03 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 10/19/2010 6:42 PM:

> I've explained how allocsize works, and that speculative allocation
> gets truncated away whenteh file is closed. Hence is the application
> is doing:
> 
> 	open()
> 	seek(EOF)
> 	write()
> 	close()

I don't know if it changes anything in the sequence above, but Dovecot
uses mmap i/o.  As I've said, I'm not a dev.  Just thought this
could/might be relevant.  Would using mmap be compatible with physical
preallocation?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-19 23:04 avoid mbox file fragmentation Stan Hoeppner
  2010-10-19 23:42 ` Dave Chinner
@ 2010-10-20 11:21 ` Peter Grandi
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Grandi @ 2010-10-20 11:21 UTC (permalink / raw)
  To: Linux XFS

[ ... ]

> I added allocsize=1m as my read of the man page suggests this
> will preallocate an additional 1MB of extent space at the end
> of each mbox file each time it is written, which I would think
> should eliminate fragmentation of these files. [ ... ] I've
> probably totally misread what allocsize= actually does. [
> ... ]

That is not really the point. The point is that it is extremely
difficult if not pointless to ensure contiguous allocation for N
slowly growing streams (see below) on a 1-dimensional storage
device in the general case. One does not need to read any file
system introductory tutorial to figure that out, just a bit of
thinking it through will do.

> If allocsize= doesn't help prevent fragmentation of mbox
> files, what can I do to mitigate this, other than regularly
> running xfs_fsr?

'xfs_fsr' is not really a defragmenter, it just fixes some of
the worst cases, if there is a lot of free space.

> I've recently run xfs_fsr thus the low 7% figure.  Every
> couple of weeks the fragmentation reaches ~30%.  I save a lot
> of list mail, dozens to hundreds per day for each of about 7
> foss mailing lists.

That is the case where one has a number of slowly growing
"stream" files, such as multiple parallel downloads, or log
files.

> As I say, in just a couple of weeks, these mbox files become
> fragmented, on the order of a dozen to a few dozen extents per
> mbox file.

That's not a lot and not necessarily something to worry about.
What matters is fragment size and their distance in terms of arm
positioning times.

Anyhow in general file system allocators are designed to give
decent average performance, not to give perfect allocations in
edge cases (especially not those requiring predictive instead of
adaptive algorithms), unlike so many newbies posting to this
list seem to assume.

If you want perfect allocation strategies for edge cases, please
write your own abstract storage system. DBMS authors do that for
example.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-20  2:36   ` Stan Hoeppner
@ 2010-10-20 11:31     ` Peter Grandi
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Grandi @ 2010-10-20 11:31 UTC (permalink / raw)
  To: Linux XFS

[ ... ]

> Odds are long as I believe mbox has a very small user base
> today compared to maildir [ ... ]

Why? It is a good format and lots and lots of programs default
to it. Maildir is the result of the usual "file systems are the
optimal DBMS for databases with many small records" attitude.

Actually it was designed to avoid locking issues on filesystem
with poor locking implementations, for things like spool files
with high update rates, not for archiving emails, for which 'mbox'
is usually a lot better.

> I just thought there may be a tweak in XFS to lessen the
> fragmentation effects.

The simplest tweak would be write a predictive custom allocator
with advising (which is sort of what DaveC said), as nothing else
would do.

> Maybe it's simply time for me to move to maildir format which
> pretty much eliminates fragmentation altogether.

HAHAHAHAHAHAHA! Good one. Genius level insight. Just par for the
course for many posts to this (and others) mailing list.

As a curiosity, how do you square this amazing insight with
writing later that:

> The main reason I like mbox is the fast full body searching of
> mail folders. Doing so with maildir crawls, relatively speaking,
> especially for folders with 15k+ emails.

This reminds me of the other recent post where some other genius
thought that they were sorting the contents of files by changing
the order of directory entries and wondered why the result was so
slow.

[ ... ]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-19 23:42 ` Dave Chinner
  2010-10-20  2:36   ` Stan Hoeppner
  2010-10-20  3:03   ` Stan Hoeppner
@ 2010-10-20 11:50   ` Peter Grandi
  2010-10-21  2:00     ` Dave Chinner
  2 siblings, 1 reply; 20+ messages in thread
From: Peter Grandi @ 2010-10-20 11:50 UTC (permalink / raw)
  To: Linux XFS

[ ...  multiple slowly growing streams case ... ]

> What you want is _physical_ preallocation, not speculative
> preallocation.

Not even that. What he wants really is applications based on a
mail database layer that handles these issues, like real DBMSes do
(or *should* do), with tablespaces and the like, including where
necessary reservations and the like as you say:

> i.e. look up XFS_IOC_RESVSP or FIEMAP so your application does
> _permanent_ preallocate past EOF. Alteratively, the filesystem
> will avoid the truncation on close() is the file has the APPEND
> attribute set and the application is writing via O_APPEND...

Because these issues are common to essentially all multiple slow
growing files. I have seen slowly downloading ISO images with
hundreds of thousands of extents, never mind log files.

> The filesystem cannot do everything for you. Sometimes the
> application has to help....

That's only because file system authors are lazy and haven't
implemented 'O_PONIES' yet. :-)


However, I am a fan of having *default* physical preallocation,
because as a rule one can trade off space for speed nowadays, and
padding files with "future growth" tails is fairly cheap, and one
could modify the filesystem code or 'fsck' to reclaim unused space
in "future grown" tails.

Ideally applications would advise the filesystem with the expected
allocation patterns (e.g. "immutable", or "loglike", or "rewritable")
but the 'O_PONIES' discussions shows that it is unrealistic.
Userspace sucks.

Even if I think that a lot of good could be done by autoadvising in
the 'stdio' library, as 'stdio' flags often give a pretty huge hint.
Uhm I just remembered that I wriote in my blog something fairly
related, with a nice if sadly still true quote from a crucial paper
on all this userspace (and sometimes kernel) suckage:

  http://www.sabi.co.uk/blog/anno05-4th.html#051011
  http://www.sabi.co.uk/blog/anno05-4th.html#051011b
  http://www.sabi.co.uk/blog/anno05-4th.html#051012
  http://www.sabi.co.uk/blog/anno05-4th.html#051012d

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-20  3:03   ` Stan Hoeppner
@ 2010-10-21  1:55     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2010-10-21  1:55 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Tue, Oct 19, 2010 at 10:03:19PM -0500, Stan Hoeppner wrote:
> Dave Chinner put forth on 10/19/2010 6:42 PM:
> 
> > I've explained how allocsize works, and that speculative allocation
> > gets truncated away whenteh file is closed. Hence is the application
> > is doing:
> > 
> > 	open()
> > 	seek(EOF)
> > 	write()
> > 	close()
> 
> I don't know if it changes anything in the sequence above, but Dovecot
> uses mmap i/o.  As I've said, I'm not a dev.  Just thought this
> could/might be relevant.  Would using mmap be compatible with physical
> preallocation?

mmap() can't write beyond EOF or extend the file. hence it would
have to be:

	open()
	mmap()
	ftrucate(new_size)
	<write via mmap>

In this method, there is no speculative preallocation because the
there is never a delayed allocation that extends the file size.  it
simply doesn't matter where the close() occurs. Hence if you use
mmap() writes like this, the only way you can avoid fragmentation is
to use physical preallocation beyond EOF before you start any
writes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-20 11:50   ` Peter Grandi
@ 2010-10-21  2:00     ` Dave Chinner
  2010-10-21 16:39       ` Peter Grandi
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2010-10-21  2:00 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

On Wed, Oct 20, 2010 at 12:50:45PM +0100, Peter Grandi wrote:
> However, I am a fan of having *default* physical preallocation,
> because as a rule one can trade off space for speed nowadays, and
> padding files with "future growth" tails is fairly cheap, and one
> could modify the filesystem code or 'fsck' to reclaim unused space
> in "future grown" tails.

When writing lots of small files (e.g. unpacking a kernel tarball),
leaving space at the file tail due to preallocation (that will never
get used) means that the file data is now sparse on disk instead of
packed into adjacent blocks.

The result? Instead of the elevate merging adjacent file data IOs
into a large IO, they all get issued individually, and the seek
count for IO goes way up. Not truncating away the specualtive
prealocation beyond EOF will cause this, too, and that slows down
such workloads by an order of magnitude....

So, Peter, you don't get the pony because we gave it to someone
else. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: avoid mbox file fragmentation
  2010-10-21  2:00     ` Dave Chinner
@ 2010-10-21 16:39       ` Peter Grandi
  2010-10-21 20:06         ` Best filesystems ? Andrew Daviel
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Grandi @ 2010-10-21 16:39 UTC (permalink / raw)
  To: Linux XFS


>> However, I am a fan of having *default* physical
>> preallocation, because as a rule one can trade off space for
>> speed nowadays, and padding files with "future growth" tails
>> is fairly cheap, and one could modify the filesystem code or
>> 'fsck' to reclaim unused space in "future grown" tails.

> When writing lots of small files (e.g. unpacking a kernel
> tarball),

But the ig deal here is that's not something that a filesystem
targeted at high bandwidth multistreaming loads should be
optimized for, at least by default.

  While a number of people writing to this mailing list try to
  use XFS as a small records DBMS, we read their silly posts
  here precisely because it's a stupid idea and thus it works
  badly for them, and so they complain. All the guys who use XFS
  mostly for the workloads it is good for don't complain, so we
  don't read about them here that much.

There is simply no way to optimize for both workloads, because
they require completely different strategies.

  BTW, '.a' files exist precisely because "many small files" was
  considered a bad idea for '.o' files; I wish that the original
  UNIX guys had used '.a' files much more widely, at the very
  least for '.h' files and 'man' pages and configuration files.
  One of the several details that the BTL guys missed (apart
  from 'creat' I mean :->).

> leaving space at the file tail due to preallocation (that will
> never get used)

It will get used, e.g. by 'fsck' or the filesystem itself (let's
call it 'grey' space, and say that it will be reclaimed either
periodically or after all free space has been exahusted).

The overall issue here when you use the future tense "will
never" is that you can only have by default _adaptive_
strategies, and you need _predictive_ ones to handle well
completely different workloads.

That is predictive at the kernel level (and then good luck
making good guesses, even if in some useful simple cases it is
possible) or at the app level (and then userspace sucks, but at
least it's their fault).

> means that the file data is now sparse on disk instead of
> packed into adjacent blocks.

But even in your case of lots of small files without "tails",
that doesn't work either, because you get lots of little files
that are not quite guaranteed to be contiguous ot each other,
and anyhow if the tails are not large, then even with tails the
files can be "nearly" (most on the same track) contiguous.

> The result? Instead of the elevate merging adjacent file data
> IOs into a large IO, they all get issued individually,

Modern IO subsystems do scatter/gather and mailboxing quite
well, even SATA/SAS nowadays almost does the matter.

> and the seek count for IO goes way up.

Not necessarily -- because many/most files end up together on
the same track/cylinder they would have ended up without tails.

Anyhow for the "many small records DBMS" case, 

> Not truncating away the specualtive prealocation beyond EOF
> will cause this,

It will cause some space diffusion, but again you cannot do both
multistreaming high bandwidth and singlestreaming lots of small
files well, and a filesystem should not do the latter anyhow.

And even so, if for example the userspace hints to the file
system that sa file is very unlikely to be modified or appended
to (which is the default for most apps, and could be inferred
for example by no 'w' permissions or other details), the
filesystem need not put tails on it. Even heuristics based on
how many writes (and their interval) and/or seeks have occurred
can be used to asses the need for tails.

But again for the types of workloads XFS targets, one can do the
brute force approach defaulting to tails, and it does not cost
anywhere like this:

> too, and that slows down such workloads by an order of
> magnitude....

That sounds way too much. First consider that if a file is less
than 2KiB in size it will have an "internal" tail (in a 4KiB)
block much longer than itself already, and that's already a very
bad idea for small files, as it makes for many more IOPs and IO
bus bandwidth than needed, never mind the cache and RAM space.

  And if you are targeting things like that workload, the
  greatest improvement is just reducing the block size.
  If you do please scale accordingly the following numbers.

Then suppose that each nearly 4KiB file will have a 4KiB tail;
this simply reduce the density of space by a factor of around 2x
(makes a bad situation only 2x worse :->).

In other words leaving (for a while) "tails" on small files
behave somewhat similarly (but much better than) doubling block
size to 8KiB, which is not a catastrophe (the catastrophe has
already happened when default block sizes became 4KiB).

Unless you really want to argue that switching from 4KiB to 8KiB
blocks is really going to cost 10x in performance.

What I mean tails don't *have* to be done stupidly; sure there
are a number of really dumb things in Linux IO (from swap space
prefetching to "plugging"), but a few things seem to have been
designed with some forethought, so one does not need to assume
that they would be necessarily implemented that badly (and no, I
don't have time, as usual, so I stay on my "armchair").

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Best filesystems ?
  2010-10-21 16:39       ` Peter Grandi
@ 2010-10-21 20:06         ` Andrew Daviel
  2010-10-22  2:47           ` Stan Hoeppner
  2010-10-23 18:13           ` Peter Grandi
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Daviel @ 2010-10-21 20:06 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

On Thu, 21 Oct 2010, Peter Grandi wrote:

> But the ig deal here is that's not something that a filesystem
> targeted at high bandwidth multistreaming loads should be
> optimized for, at least by default.

Has someone documented (recently, and correctly) what filesystems (XFS, 
ext3, ext4, reiser, JFS, VFAT etc.) are best for what tasks (mailboxes, 
database, general computing, video streaming etc.) ?

I messed around with Bonnie etc. a couple of years ago, deciding that XFS 
was better than ext3 for 4Mb mailbox fragments. Later I was quite 
impressed by an early ext4 on FC9, until it bombed. But I don't really 
have the skills to assess properly.


-- 
Andrew Daviel, TRIUMF, Canada

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-21 20:06         ` Best filesystems ? Andrew Daviel
@ 2010-10-22  2:47           ` Stan Hoeppner
  2010-10-23 18:13           ` Peter Grandi
  1 sibling, 0 replies; 20+ messages in thread
From: Stan Hoeppner @ 2010-10-22  2:47 UTC (permalink / raw)
  To: xfs

Andrew Daviel put forth on 10/21/2010 3:06 PM:

> Has someone documented (recently, and correctly) what filesystems (XFS,
> ext3, ext4, reiser, JFS, VFAT etc.) are best for what tasks (mailboxes,
> database, general computing, video streaming etc.) ?

http://btrfs.boxacle.net/repository/raid/2.6.35-rc5/2.6.35-rc5/

That's the most recent set of tests I'm aware of.  The testing was
performed against an IBM DS4500 with 136 drives in a multilevel combo
hardware/software stripe setup.  Complete hardware description and test
details are here:

http://btrfs.boxacle.net/


-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-21 20:06         ` Best filesystems ? Andrew Daviel
  2010-10-22  2:47           ` Stan Hoeppner
@ 2010-10-23 18:13           ` Peter Grandi
  2010-10-23 20:16             ` Emmanuel Florac
                               ` (3 more replies)
  1 sibling, 4 replies; 20+ messages in thread
From: Peter Grandi @ 2010-10-23 18:13 UTC (permalink / raw)
  To: Linux XFS

[ ... ]

>> But the ig deal here is that's not something that a filesystem
>> targeted at high bandwidth multistreaming loads should be
>> optimized for, at least by default.

> Has someone documented (recently, and correctly) what
> filesystems (XFS, ext3, ext4, reiser, JFS, VFAT etc.) are best
> for what tasks (mailboxes, database, general computing, video
> streaming etc.) ?

That's impossible. In part because it would take a lot of work,
but mostly because file system performance is so anisotropic with
storage and application profiles, and there are too many variables.
File systems also have somewhat different features. I have written
some impressions on what matters here:

  http://www.sabi.co.uk/blog/0804apr.html#080415

The basic performance metric for a file systems used to be which
percent of the native performance of a hard disk it could deliver
to a single process, but currently most file systems are at 90%
or higher for most of that.

My current impressions are:

* Reiser3 is nice with the right parameters for the few cases
  where there is a significant number of small files on a small
  storage system accessed by a single process.

* 'ext2' is nice for small freshly loaded filesystems with not
  too big files, and for MS-Windows compatibility.

* 'ext3' is good for widespread compatibility and for somewhat
  "average", small or middling filesystems up to 2TB.

* 'ext4' is good for inplace upgradeability from 'ext3', and some
  more scalability and features, but I can't see any real reason
  why it was developed but offering an in-place upgrade path to RH
  customers, given that JFS and XFS were there are fairly mature
  already. I think it has a bit more scalability and performance
  than 'ext3' (especially better internal parallelism).

* JFS is good for almost everything, including largish filesystems
  on somewhat largish systems with lots of processes accessing
  lots of files, and works equally well on 32b and 64b, is very
  stable, and has a couple of nice features. Its major downside is
  less care than XFS for barriers. I think that it can support
  well filesystems up to 10-15TB, and perhaps beyond. It should
  have been made the default for Linux for at least a decade
  instead of 'ext3'.

* XFS is like JFS, and with somewhat higher scalability both as to
  sizes and as to higher internal parallelism in the of multiple
  processes accessing the same file, and has a couple of nice
  features (mostly barrier support, but also small blocks and large
  inodes). Its major limitation are internal complexity and should
  only be used on 64b systems. It can support single filesystems
  larger than 10-15TB, but that's stretching things.

* Lustre works well as a network parallel large file streaming
  filesystem. It is however somewhat unstable and great care has to
  be taken in integration testing because of that.

I currently think that JFS, XFS and Lustre cover more or less all
common workloads. I occasionally use 'ext2' or NTFS for data
exchange between Linux and MS-Windows.

There are some other interesting ones:

* UDF could be a very decent small-average sized file systems,
  especially for interchange, but also general purpose use.
  Implementations are a bit lacking.

* OCFS2 used in non cluster mode works well, and has pretty decent
  performance, and can be used in shared-storage mode too, but it
  seems still a bit too unstable.

* NILFS2 seems just the right thing for SSD based file systems,
  and with a collector could be a general purpose file systems.

* GlusterFS seems quite interesting for the distributed case.

Reiser4 also looked fairly good, but it has been sort of succeeded
by BTRFS (whose author used to work on Reiser4), but BTRFS seems
like a non entirely justified successor to 'ext4', but it has also
a few nice extra features builtin that I am not sure really need to
be built in ( have the same feeling for ZFS).

Then there are traditional file systems which also have some
special interest, like OpenAFS or GFS2.

The major current issue with all of these is 'fsck' times and lack
of scalability to single-pool filesystems larger than several TB,
but doing that well is still a research issue (which hasn't stopped
happy-go-lucky people from creating much larger filesystems,
usually with XFS, "because you can", but good luck to them).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-23 18:13           ` Peter Grandi
@ 2010-10-23 20:16             ` Emmanuel Florac
  2010-10-26  0:55               ` hank peng
  2010-10-23 21:28             ` Stan Hoeppner
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Emmanuel Florac @ 2010-10-23 20:16 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux XFS

Le Sat, 23 Oct 2010 19:13:08 +0100 vous écriviez:

> Its major limitation are internal complexity and should
>   only be used on 64b systems. It can support single filesystems
>   larger than 10-15TB, but that's stretching things.

I currently take care of 100 servers with 20 to 80 TB XFS filesystems.
Most are under heavy use, 95% filled for more than 3 years in a row,
etc. I can safely affirm that XFS is definitely safe for anything up to
80 TB.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-23 18:13           ` Peter Grandi
  2010-10-23 20:16             ` Emmanuel Florac
@ 2010-10-23 21:28             ` Stan Hoeppner
  2010-10-24  0:17             ` Steve Costaras
  2010-10-24 18:27             ` Michael Monnerie
  3 siblings, 0 replies; 20+ messages in thread
From: Stan Hoeppner @ 2010-10-23 21:28 UTC (permalink / raw)
  To: xfs

Peter Grandi put forth on 10/23/2010 1:13 PM:

> * XFS is like JFS, and with somewhat higher scalability both as to
>   sizes and as to higher internal parallelism in the of multiple
>   processes accessing the same file, and has a couple of nice
>   features (mostly barrier support, but also small blocks and large
>   inodes). Its major limitation are internal complexity and should
>   only be used on 64b systems. It can support single filesystems
>   larger than 10-15TB, but that's stretching things.

I'm surprised you mentioned Lustre and not CXFS, especially given the
mailing list you posted this diatribe to.  Beasts of a different breed,
but both cluster oriented.  And, apparently you're not familiar with
NASA Advanced Supercomputing (NAS) Division at Ames Research Center.
They've got some multi hundred terabyte filesystems to show you:

http://www.nas.nasa.gov/Resources/Systems/archive_storage.html

> * Lustre works well as a network parallel large file streaming
>   filesystem. It is however somewhat unstable and great care has to
>   be taken in integration testing because of that.

Integration testing?  I guess you mean the linkers and compilers on the
build machines?  The Lustre client library gets linked directly into
every application that runs on the compute nodes--it's entirely user
space code.  Or at least this is the way it used to be IIRC.  To do any
general work with the files (move/copy/etc) you had to log onto an
object storage server node.  Maybe this has changed with newer revs.

> I currently think that JFS, XFS and Lustre cover more or less all
> common workloads. 

I'll agree with the first two, but Lustre isn't used for "common"
workloads.  See above.

> I occasionally use 'ext2' or NTFS for data
> exchange between Linux and MS-Windows.

Most people use Samba, C/DVD-Rs, or thumb drives for this purpose.  If
you have _that_ much data to move between systems, Samba over GigE
yields similar throughput to mounting a USB or eSATA disk, without the
hassle.  If Samba throws up a wall and you can't tune settings around
it, use rsync or something similar.  Dragging HDs between systems is
so... 80s. ;)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-23 18:13           ` Peter Grandi
  2010-10-23 20:16             ` Emmanuel Florac
  2010-10-23 21:28             ` Stan Hoeppner
@ 2010-10-24  0:17             ` Steve Costaras
  2010-10-24 18:27             ` Michael Monnerie
  3 siblings, 0 replies; 20+ messages in thread
From: Steve Costaras @ 2010-10-24  0:17 UTC (permalink / raw)
  To: xfs



On 2010-10-23 13:13, Peter Grandi wrote:
>
> * JFS is good for almost everything, including largish filesystems
>    on somewhat largish systems with lots of processes accessing
>    lots of files, and works equally well on 32b and 64b, is very
>    stable, and has a couple of nice features. Its major downside is
>    less care than XFS for barriers. I think that it can support
>    well filesystems up to 10-15TB, and perhaps beyond. It should
>    have been made the default for Linux for at least a decade
>    instead of 'ext3'.

Would comment here that JFS is indeed very good, but does have a problem 
when reaching/hitting the 32TB boundary.   This appears to be a user 
space tool issue.   It is the main reason why I switched over to XFS as 
was running into this problem too often.

> * XFS is like JFS, and with somewhat higher scalability both as to
>    sizes and as to higher internal parallelism in the of multiple
>    processes accessing the same file, and has a couple of nice
>    features (mostly barrier support, but also small blocks and large
>    inodes). Its major limitation are internal complexity and should
>    only be used on 64b systems. It can support single filesystems
>    larger than 10-15TB, but that's stretching things.

Have used XFS up to 120TB myself on real media (i.e. not sparse files) 
under linux; will be building >128 shortly.    Have used more w/ XFS 
Irix in the past.


Generally I find with most file  systems/tools there are many bugs when 
you cross bit boundaries where they were not tested.    Whenever 
using/planning large systems /always/ test first and  have good backups.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-23 18:13           ` Peter Grandi
                               ` (2 preceding siblings ...)
  2010-10-24  0:17             ` Steve Costaras
@ 2010-10-24 18:27             ` Michael Monnerie
  2010-10-24 20:52               ` Emmanuel Florac
  3 siblings, 1 reply; 20+ messages in thread
From: Michael Monnerie @ 2010-10-24 18:27 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: Text/Plain, Size: 940 bytes --]

On Samstag, 23. Oktober 2010 Peter Grandi wrote:
> * Reiser3 is nice with the right parameters for the few cases
>   where there is a significant number of small files on a small
>   storage system accessed by a single process.

I still use reiser3 for all servers root filesystems, it's very good for 
that. Never had a problem. 

For anything else I use XFS for performance reasons. Except for my KDE 
/home partition, that stays on reiserfs because KDE seems to rely on 
O_PONIES and crashes lead to zero size config files which is not a lot 
of fun.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Radiointerview zum Thema Spam ******
http://www.it-podcast.at/archiv.html#podcast-100716

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-24 18:27             ` Michael Monnerie
@ 2010-10-24 20:52               ` Emmanuel Florac
  0 siblings, 0 replies; 20+ messages in thread
From: Emmanuel Florac @ 2010-10-24 20:52 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

Le Sun, 24 Oct 2010 20:27:59 +0200 vous écriviez:

> I still use reiser3 for all servers root filesystems, it's very good
> for that. Never had a problem. 
> 
> For anything else I use XFS for performance reasons. 

I did exactly the same. Actually I've set up about 2500 servers this
way in the past 7 years and serious problems can be counted on one
hand.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-23 20:16             ` Emmanuel Florac
@ 2010-10-26  0:55               ` hank peng
  2010-10-26  7:19                 ` Emmanuel Florac
  0 siblings, 1 reply; 20+ messages in thread
From: hank peng @ 2010-10-26  0:55 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Linux XFS

2010/10/24 Emmanuel Florac <eflorac@intellique.com>:
> Le Sat, 23 Oct 2010 19:13:08 +0100 vous écriviez:
>
>> Its major limitation are internal complexity and should
>>   only be used on 64b systems. It can support single filesystems
>>   larger than 10-15TB, but that's stretching things.
>
> I currently take care of 100 servers with 20 to 80 TB XFS filesystems.
> Most are under heavy use, 95% filled for more than 3 years in a row,
> etc. I can safely affirm that XFS is definitely safe for anything up to
> 80 TB.
>
Which kernel version do you use? I occasionally found some errors like
"Input/output error" after sudden power failure under linux kernel
2.6.23.

> --
> ------------------------------------------------------------------------
> Emmanuel Florac     |   Direction technique
>                    |   Intellique
>                    |   <eflorac@intellique.com>
>                    |   +33 1 78 94 84 02
> ------------------------------------------------------------------------
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>



-- 
The simplest is not all best but the best is surely the simplest!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Best filesystems ?
  2010-10-26  0:55               ` hank peng
@ 2010-10-26  7:19                 ` Emmanuel Florac
  0 siblings, 0 replies; 20+ messages in thread
From: Emmanuel Florac @ 2010-10-26  7:19 UTC (permalink / raw)
  To: hank peng; +Cc: Linux XFS

Le Tue, 26 Oct 2010 08:55:49 +0800 vous écriviez:

> Which kernel version do you use? I occasionally found some errors like
> "Input/output error" after sudden power failure under linux kernel
> 2.6.23.

Various, still have a couple of machines running 2.6.17.13 (but in the
smaller range, 10 to 20 TB) and 2.6.22.19; many running 2.6.24.7, some
running 2.6.27.25, and the latest bunch is running 2.6.32.11. Never
used 2.6.23 in production so I couldn't tell much about this.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-10-26  7:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-19 23:04 avoid mbox file fragmentation Stan Hoeppner
2010-10-19 23:42 ` Dave Chinner
2010-10-20  2:36   ` Stan Hoeppner
2010-10-20 11:31     ` Peter Grandi
2010-10-20  3:03   ` Stan Hoeppner
2010-10-21  1:55     ` Dave Chinner
2010-10-20 11:50   ` Peter Grandi
2010-10-21  2:00     ` Dave Chinner
2010-10-21 16:39       ` Peter Grandi
2010-10-21 20:06         ` Best filesystems ? Andrew Daviel
2010-10-22  2:47           ` Stan Hoeppner
2010-10-23 18:13           ` Peter Grandi
2010-10-23 20:16             ` Emmanuel Florac
2010-10-26  0:55               ` hank peng
2010-10-26  7:19                 ` Emmanuel Florac
2010-10-23 21:28             ` Stan Hoeppner
2010-10-24  0:17             ` Steve Costaras
2010-10-24 18:27             ` Michael Monnerie
2010-10-24 20:52               ` Emmanuel Florac
2010-10-20 11:21 ` avoid mbox file fragmentation Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox