linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* topics for the file system mini-summit
@ 2006-05-25 21:44 Ric Wheeler
  2006-05-26 16:48 ` Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Ric Wheeler @ 2006-05-25 21:44 UTC (permalink / raw)
  To: linux-fsdevel


Now that the IO mini-summit has solved all known issues under file
systems, I thought that I should throw out a list of challenges/problems
that I see with linux file systems (specifically with ext3 & reiserfs) ;-)

With the background of very large (and increasingly larger) commodity
drives, we are seeing single consumer drives of 500GB today with even
larger drives coming soon.  Of course,  array based storage capacities
are a large multiple of these (many terabytes per lun).

With both ext3 and with reiserfs, running a single large file system
translates into several practical limitations before we even hit the
existing size limitations:

    (1) repair/fsck time can take hours or even days depending on the
health of the file system and its underlying disk as well as the number
of files.  This does not work well for large servers and is a disaster
for "appliances" that need to run these commands buried deep in some
data center without a person watching...
    (2) most file system performance testing is done on "pristine" file
systems with very few files.  Performance over time, especially with
very high file counts, suffers very noticeable performance degradation
with very large file systems.
     (3) very poor fault containment for these very large devices - it
would be great to be able to ride through a failure of a segment of the
underlying storage without taking down the whole file system.

The obvious alternative to this is to break up these big disks into
multiple small file systems, but there again we hit several issues.

As an example, in one of the boxes that I work with we have 4 drives,
each 500GBs, with limited memory and CPU resources. To address the
issues above, we break each drive into 100GB chunks which gives us 20
(reiserfs) file systems per box.  The set of new problems that arise
from this include:

    (1) no forced unmount - one file system goes down, you have to
reboot the box to recover.
    (2) worst case memory consumption for the journal scales linearly
with the number of file systems (32MB/per file system).
    (3) we take away the ability of the file system to do intelligent
head movement on the drives (i.e., I end up begging the application team
to please only use one file system per drive at a time for ingest ;-)).
The same goes for allocation - we basically have to push this up to the
application to use the capacity in an even way.
    (4) pain of administration of multiple file systems.

I know that other file systems deal with scale better, but the question
is really how to move the mass of linux users onto these large and
increasingly common storage devices in a way that handles these challenges.

ric





^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-06-07 18:55 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
2006-05-26 16:48 ` Andreas Dilger
2006-05-27  0:49   ` Ric Wheeler
2006-05-27 14:18     ` Andreas Dilger
2006-05-28  1:44       ` Ric Wheeler
2006-05-29  0:11 ` Matthew Wilcox
2006-05-29  2:07   ` Ric Wheeler
2006-05-29 16:09     ` Andreas Dilger
2006-05-29 19:29       ` Ric Wheeler
2006-05-30  6:14         ` Andreas Dilger
2006-06-07 10:10       ` Stephen C. Tweedie
2006-06-07 14:03         ` Andi Kleen
2006-06-07 18:55         ` Andreas Dilger
2006-06-01  2:19 ` Valerie Henson
2006-06-01  2:42   ` Matthew Wilcox
2006-06-01  3:24     ` Valerie Henson
2006-06-01 12:45       ` Matthew Wilcox
2006-06-01 12:53         ` Arjan van de Ven
2006-06-01 20:06         ` Russell Cattelan
2006-06-02 11:27         ` Nathan Scott
2006-06-01  5:36   ` Andreas Dilger
2006-06-03 13:50   ` Ric Wheeler
2006-06-03 14:13     ` Arjan van de Ven
2006-06-03 15:07       ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).