topics for the file system mini-summit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* topics for the file system mini-summit
@ 2006-05-25 21:44 Ric Wheeler
  2006-05-26 16:48 ` Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Ric Wheeler @ 2006-05-25 21:44 UTC (permalink / raw)
  To: linux-fsdevel

Now that the IO mini-summit has solved all known issues under file
systems, I thought that I should throw out a list of challenges/problems
that I see with linux file systems (specifically with ext3 & reiserfs) ;-)

With the background of very large (and increasingly larger) commodity
drives, we are seeing single consumer drives of 500GB today with even
larger drives coming soon.  Of course,  array based storage capacities
are a large multiple of these (many terabytes per lun).

With both ext3 and with reiserfs, running a single large file system
translates into several practical limitations before we even hit the
existing size limitations:

    (1) repair/fsck time can take hours or even days depending on the
health of the file system and its underlying disk as well as the number
of files.  This does not work well for large servers and is a disaster
for "appliances" that need to run these commands buried deep in some
data center without a person watching...
    (2) most file system performance testing is done on "pristine" file
systems with very few files.  Performance over time, especially with
very high file counts, suffers very noticeable performance degradation
with very large file systems.
     (3) very poor fault containment for these very large devices - it
would be great to be able to ride through a failure of a segment of the
underlying storage without taking down the whole file system.

The obvious alternative to this is to break up these big disks into
multiple small file systems, but there again we hit several issues.

As an example, in one of the boxes that I work with we have 4 drives,
each 500GBs, with limited memory and CPU resources. To address the
issues above, we break each drive into 100GB chunks which gives us 20
(reiserfs) file systems per box.  The set of new problems that arise
from this include:

    (1) no forced unmount - one file system goes down, you have to
reboot the box to recover.
    (2) worst case memory consumption for the journal scales linearly
with the number of file systems (32MB/per file system).
    (3) we take away the ability of the file system to do intelligent
head movement on the drives (i.e., I end up begging the application team
to please only use one file system per drive at a time for ingest ;-)).
The same goes for allocation - we basically have to push this up to the
application to use the capacity in an even way.
    (4) pain of administration of multiple file systems.

I know that other file systems deal with scale better, but the question
is really how to move the mass of linux users onto these large and
increasingly common storage devices in a way that handles these challenges.

ric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
@ 2006-05-26 16:48 ` Andreas Dilger
  2006-05-27  0:49   ` Ric Wheeler
  2006-05-29  0:11 ` Matthew Wilcox
  2006-06-01  2:19 ` Valerie Henson
  2 siblings, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2006-05-26 16:48 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-fsdevel

On May 25, 2006  14:44 -0700, Ric Wheeler wrote:
> With both ext3 and with reiserfs, running a single large file system
> translates into several practical limitations before we even hit the
> existing size limitations:
> 
>    (1) repair/fsck time can take hours or even days depending on the
> health of the file system and its underlying disk as well as the number
> of files.  This does not work well for large servers and is a disaster
> for "appliances" that need to run these commands buried deep in some
> data center without a person watching...
>    (2) most file system performance testing is done on "pristine" file
> systems with very few files.  Performance over time, especially with
> very high file counts, suffers very noticeable performance degradation
> with very large file systems.
>     (3) very poor fault containment for these very large devices - it
> would be great to be able to ride through a failure of a segment of the
> underlying storage without taking down the whole file system.
> 
> The obvious alternative to this is to break up these big disks into
> multiple small file systems, but there again we hit several issues.
> 
> As an example, in one of the boxes that I work with we have 4 drives,
> each 500GBs, with limited memory and CPU resources. To address the
> issues above, we break each drive into 100GB chunks which gives us 20
> (reiserfs) file systems per box.  The set of new problems that arise
> from this include:
> 
>    (1) no forced unmount - one file system goes down, you have to
> reboot the box to recover.
>    (2) worst case memory consumption for the journal scales linearly
> with the number of file systems (32MB/per file system).
>    (3) we take away the ability of the file system to do intelligent
> head movement on the drives (i.e., I end up begging the application team
> to please only use one file system per drive at a time for ingest ;-)).
> The same goes for allocation - we basically have to push this up to the
> application to use the capacity in an even way.
>    (4) pain of administration of multiple file systems.
> 
> I know that other file systems deal with scale better, but the question
> is really how to move the mass of linux users onto these large and
> increasingly common storage devices in a way that handles these challenges.

In a way what you describe is Lustre - it aggregates multiple "smaller"
filesystems into a single large filesystem from the application POV
(though in many cases "smaller" filesystems are 2TB).  It runs e2fsck
in parallel if needed, has smart object allocation (clients do delayed
allocation, can load balance across storage targets, etc), can run with
down storage targets.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-26 16:48 ` Andreas Dilger
@ 2006-05-27  0:49   ` Ric Wheeler
  2006-05-27 14:18     ` Andreas Dilger
  0 siblings, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2006-05-27  0:49 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel



Andreas Dilger wrote:

>On May 25, 2006  14:44 -0700, Ric Wheeler wrote:
>  
>
>>With both ext3 and with reiserfs, running a single large file system
>>translates into several practical limitations before we even hit the
>>existing size limitations:
>>
>>   ....
>>
>>I know that other file systems deal with scale better, but the question
>>is really how to move the mass of linux users onto these large and
>>increasingly common storage devices in a way that handles these challenges.
>>    
>>
>
>In a way what you describe is Lustre - it aggregates multiple "smaller"
>filesystems into a single large filesystem from the application POV
>(though in many cases "smaller" filesystems are 2TB).  It runs e2fsck
>in parallel if needed, has smart object allocation (clients do delayed
>allocation, can load balance across storage targets, etc), can run with
>down storage targets.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Principal Software Engineer
>Cluster File Systems, Inc.
>
>
>  
>
The approach that lustre takes here is great -  distributed systems 
typically  take into account subcomponent failures as a fact of life & 
do this better than many single system designs...

The challenge is still there on the "smaller" file systems that make up 
Lustre - you can spend a lot of time waiting for just one fsck to finish ;-)

ric


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-27  0:49   ` Ric Wheeler
@ 2006-05-27 14:18     ` Andreas Dilger
  2006-05-28  1:44       ` Ric Wheeler
  0 siblings, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2006-05-27 14:18 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-fsdevel

On May 26, 2006  20:49 -0400, Ric Wheeler wrote:
> Andreas Dilger wrote:
> >In a way what you describe is Lustre - it aggregates multiple "smaller"
> >filesystems into a single large filesystem from the application POV
> >(though in many cases "smaller" filesystems are 2TB).  It runs e2fsck
> >in parallel if needed, has smart object allocation (clients do delayed
> >allocation, can load balance across storage targets, etc), can run with
> >down storage targets.
>
> The approach that lustre takes here is great -  distributed systems 
> typically  take into account subcomponent failures as a fact of life & 
> do this better than many single system designs...
> 
> The challenge is still there on the "smaller" file systems that make up 
> Lustre - you can spend a lot of time waiting for just one fsck to finish ;-)

CFS is actually quite interested in improving the health and reliability
of the component filesystems also.  That is the reason for our interest
in the U. Wisconsin IRON filesystem work, which we are (slowly) working
to include into ext3.

This will also be our focus for upcoming filesystem work.  It is
relatively easy to make filesystems with 64-bit structures, but the
ability to run such large filesystems in the face of corruption
environments is the real challenge.  It isn't practical to need a
17-year e2fsck time, extrapolating 2TB e2fsck times to 2^48 block
filesystems.  A lot of the features in ZFS make sense in this regard.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-27 14:18     ` Andreas Dilger
@ 2006-05-28  1:44       ` Ric Wheeler
  0 siblings, 0 replies; 24+ messages in thread
From: Ric Wheeler @ 2006-05-28  1:44 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel



Andreas Dilger wrote:

>On May 26, 2006  20:49 -0400, Ric Wheeler wrote:
>  
>
>>Andreas Dilger wrote:
>>    
>>
>>>In a way what you describe is Lustre - it aggregates multiple "smaller"
>>>filesystems into a single large filesystem from the application POV
>>>(though in many cases "smaller" filesystems are 2TB).  It runs e2fsck
>>>in parallel if needed, has smart object allocation (clients do delayed
>>>allocation, can load balance across storage targets, etc), can run with
>>>down storage targets.
>>>      
>>>
>>The approach that lustre takes here is great -  distributed systems 
>>typically  take into account subcomponent failures as a fact of life & 
>>do this better than many single system designs...
>>
>>The challenge is still there on the "smaller" file systems that make up 
>>Lustre - you can spend a lot of time waiting for just one fsck to finish ;-)
>>    
>>
>
>CFS is actually quite interested in improving the health and reliability
>of the component filesystems also.  That is the reason for our interest
>in the U. Wisconsin IRON filesystem work, which we are (slowly) working
>to include into ext3.
>  
>
We actually were the sponsors of the Wisconsin work, so I am glad to 
hear that it has a real impact.  I think that the Iron FS ideas will 
help, but they still don't eliminate the issues of scalability with fsck 
(and some of the scalability issues I see where performance dips with 
high object count file systems).

>This will also be our focus for upcoming filesystem work.  It is
>relatively easy to make filesystems with 64-bit structures, but the
>ability to run such large filesystems in the face of corruption
>environments is the real challenge.  It isn't practical to need a
>17-year e2fsck time, extrapolating 2TB e2fsck times to 2^48 block
>filesystems.  A lot of the features in ZFS make sense in this regard.
>
>Cheers, Andreas
>--
>
>  
>
Absolutely agree - I wonder if there is some value in trying to go back 
to profiling fsck if someone has not already done that.  It won't get 
rid of the design limitations, but we might be able to make some 
significant improvements...

ric


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
  2006-05-26 16:48 ` Andreas Dilger
@ 2006-05-29  0:11 ` Matthew Wilcox
  2006-05-29  2:07   ` Ric Wheeler
  2006-06-01  2:19 ` Valerie Henson
  2 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2006-05-29  0:11 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-fsdevel

On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote:
> The obvious alternative to this is to break up these big disks into
> multiple small file systems, but there again we hit several issues.
> 
> As an example, in one of the boxes that I work with we have 4 drives,
> each 500GBs, with limited memory and CPU resources. To address the
> issues above, we break each drive into 100GB chunks which gives us 20
> (reiserfs) file systems per box.  The set of new problems that arise
> from this include:
> 
>    (1) no forced unmount - one file system goes down, you have to
> reboot the box to recover.
>    (2) worst case memory consumption for the journal scales linearly
> with the number of file systems (32MB/per file system).
>    (3) we take away the ability of the file system to do intelligent
> head movement on the drives (i.e., I end up begging the application team
> to please only use one file system per drive at a time for ingest ;-)).
> The same goes for allocation - we basically have to push this up to the
> application to use the capacity in an even way.
>    (4) pain of administration of multiple file systems.
> 
> I know that other file systems deal with scale better, but the question
> is really how to move the mass of linux users onto these large and
> increasingly common storage devices in a way that handles these challenges.

How do you handle the inode number space?  Do you partition it across
the sub-filesystems, or do you prohibit hardlinks between the sub-fses?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-29  0:11 ` Matthew Wilcox
@ 2006-05-29  2:07   ` Ric Wheeler
  2006-05-29 16:09     ` Andreas Dilger
  0 siblings, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2006-05-29  2:07 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel



Matthew Wilcox wrote:

>On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote:
>  
>
>>The obvious alternative to this is to break up these big disks into
>>multiple small file systems, but there again we hit several issues.
>>
>>As an example, in one of the boxes that I work with we have 4 drives,
>>each 500GBs, with limited memory and CPU resources. To address the
>>issues above, we break each drive into 100GB chunks which gives us 20
>>(reiserfs) file systems per box.  The set of new problems that arise
>>from this include:
>>
>>   (1) no forced unmount - one file system goes down, you have to
>>reboot the box to recover.
>>   (2) worst case memory consumption for the journal scales linearly
>>with the number of file systems (32MB/per file system).
>>   (3) we take away the ability of the file system to do intelligent
>>head movement on the drives (i.e., I end up begging the application team
>>to please only use one file system per drive at a time for ingest ;-)).
>>The same goes for allocation - we basically have to push this up to the
>>application to use the capacity in an even way.
>>   (4) pain of administration of multiple file systems.
>>
>>I know that other file systems deal with scale better, but the question
>>is really how to move the mass of linux users onto these large and
>>increasingly common storage devices in a way that handles these challenges.
>>    
>>
>
>How do you handle the inode number space?  Do you partition it across
>the sub-filesystems, or do you prohibit hardlinks between the sub-fses?
>  
>
I think that the namespace needs to present a normal file system set of 
operations - support for hardlinks, no magic directories, etc. so that 
applications don't need to load balance (or even be aware) of the 
sub-units that provide storage.  If we removed that requirement, we 
would be back to today's collection of various file systems mounted on a 
single host.

I know that lustre aggregates full file systems, but you could build a 
file system on top of a collection of disk partitions/LUN's and then 
your inode would could be extended to encode the partition number and 
the internal mapping. You could even harden the block groups to the 
point that fsck could heal one group while the file system was (mostly?) 
online backed up by the rest of the block groups...

ric




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-29  2:07   ` Ric Wheeler
@ 2006-05-29 16:09     ` Andreas Dilger
  2006-05-29 19:29       ` Ric Wheeler
  2006-06-07 10:10       ` Stephen C. Tweedie
  0 siblings, 2 replies; 24+ messages in thread
From: Andreas Dilger @ 2006-05-29 16:09 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Matthew Wilcox, linux-fsdevel

On May 28, 2006  22:07 -0400, Ric Wheeler wrote:
> I think that the namespace needs to present a normal file system set of 
> operations - support for hardlinks, no magic directories, etc. so that 
> applications don't need to load balance (or even be aware) of the 
> sub-units that provide storage.  If we removed that requirement, we 
> would be back to today's collection of various file systems mounted on a 
> single host.
> 
> I know that lustre aggregates full file systems

Yes - we have a metadata-only filesystem which exports the inode numbers
and namespace, and then separate (essentially private) filesystems that
store all of the data.  The object store filesystems do not export any
namespace that is visible to userspace.

> you could build a 
> file system on top of a collection of disk partitions/LUN's and then 
> your inode would could be extended to encode the partition number and 
> the internal mapping. You could even harden the block groups to the 
> point that fsck could heal one group while the file system was (mostly?) 
> online backed up by the rest of the block groups...

This is one thing that we have been thinking of for ext3.  Instead of a
filesystem-wide "error" bit we could move this per-group to only mark
the block or inode bitmaps in error if they have a checksum failure.
This would prevent allocations from that group to avoid further potential
corruption of the filesystem metadata.

Once an error is detected then a filesystem service thread or a userspace
helper would walk the inode table (starting in the current group, which
is most likely to hold the relevant data) recreating the respective bitmap
table and keeping a "valid bit" bitmap as well.  Once all of the bits
in the bitmap are marked valid then we can start using this group again.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-29 16:09     ` Andreas Dilger
@ 2006-05-29 19:29       ` Ric Wheeler
  2006-05-30  6:14         ` Andreas Dilger
  2006-06-07 10:10       ` Stephen C. Tweedie
  1 sibling, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2006-05-29 19:29 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Matthew Wilcox, linux-fsdevel

Andreas Dilger wrote:

>On May 28, 2006  22:07 -0400, Ric Wheeler wrote:
>  
>
>>you could build a 
>>file system on top of a collection of disk partitions/LUN's and then 
>>your inode would could be extended to encode the partition number and 
>>the internal mapping. You could even harden the block groups to the 
>>point that fsck could heal one group while the file system was (mostly?) 
>>online backed up by the rest of the block groups...
>>    
>>
>
>This is one thing that we have been thinking of for ext3.  Instead of a
>filesystem-wide "error" bit we could move this per-group to only mark
>the block or inode bitmaps in error if they have a checksum failure.
>This would prevent allocations from that group to avoid further potential
>corruption of the filesystem metadata.
>
>Once an error is detected then a filesystem service thread or a userspace
>helper would walk the inode table (starting in the current group, which
>is most likely to hold the relevant data) recreating the respective bitmap
>table and keeping a "valid bit" bitmap as well.  Once all of the bits
>in the bitmap are marked valid then we can start using this group again.
>
>
>  
>
That is a neat idea - would you lose complete access to the impacted 
group, or have you thought about "best effort" read-only while under repair?

One thing that has worked very well for us is that we keep a digital 
signature of each user object (MD5, SHAX hash, etc) so we can validate 
that what we wrote is what got read back.  This also provides a very 
powerful sanity check after getting hit by failing media or severe file 
system corruption since what ever we do manage to salvage (which might 
not be all files) can be validated.

As an archival (write once, read infrequently) storage device, this 
works pretty well for us since the signature does not need to constantly 
recomputed on each write/append.

For general purpose read/write work loads, I wonder if it would make 
sense to compute and store such a checksum or signature on close (say in 
an extended attribute)?  It might be useful to use another of those 
special attributes (like immutable attribute) to indicate that this file 
is important enough to digitally sign on close.

Regards,

Ric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-29 19:29       ` Ric Wheeler
@ 2006-05-30  6:14         ` Andreas Dilger
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2006-05-30  6:14 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Matthew Wilcox, linux-fsdevel

On May 29, 2006  15:29 -0400, Ric Wheeler wrote:
> Andreas Dilger wrote:
> >Instead of a filesystem-wide "error" bit we could move this per-group to
> >only mark the block or inode bitmaps in error if they have a checksum
> >failure.  This would prevent allocations from that group to avoid further
> >potential corruption of the filesystem metadata.
> >
> >Once an error is detected then a filesystem service thread or a userspace
> >helper would walk the inode table (starting in the current group, which
> >is most likely to hold the relevant data) recreating the respective bitmap
> >table and keeping a "valid bit" bitmap as well.  Once all of the bits
> >in the bitmap are marked valid then we can start using this group again.
>
> That is a neat idea - would you lose complete access to the impacted 
> group, or have you thought about "best effort" read-only while under repair?

I think we would only need to prevent new allocation from the group if the
bitmap is corrupted.  The extent format already has a magic number to give
a very quick sanity check (unlike indirect blocks which can be filled with
random garbage on large filesystems and still appear valid).  We are looking
at adding checksums in the extent metadata and could also do extra internal
consistency checks to validate this metadata (e.g. sequential ordering of
logical offsets, non-overlapping logical offsets, proper parent->child
logical offset heirarchy, etc).

So, we are mostly safe from the "incorrect block free" side, and just need
to worry about the "block is free in bitmap, don't reallocate" problem.
Allowing unlinks in a group also allows the "valid" bitmap to be updated
when the bits are cleared, so this is beneficial to the end goal of getting
an all-valid block bitmap.  We could even get more fancy and allow blocks
marked valid to be used for allocations, but that is more complex than I like.

> One thing that has worked very well for us is that we keep a digital 
> signature of each user object (MD5, SHAX hash, etc) so we can validate 
> that what we wrote is what got read back.  This also provides a very 
> powerful sanity check after getting hit by failing media or severe file 
> system corruption since what ever we do manage to salvage (which might 
> not be all files) can be validated.

Yes, we've looked at this also for Lustre (we can already do checksums
from the client memory down to the server disk), but the problem of
consistency in the face of write/truncate/append and a crash is complex.
There's also the issue of whether to do partial-file checksums (in order
to allow more efficient updates) or full-file checksums.

I believe at one point there was work on a checksum loop device, but this
also has potential consistency problems in the face of a crash.

> For general purpose read/write work loads, I wonder if it would make 
> sense to compute and store such a checksum or signature on close (say in 
> an extended attribute)?  It might be useful to use another of those 
> special attributes (like immutable attribute) to indicate that this file 
> is important enough to digitally sign on close.

Hmm, good idea.  If a file is immutable that makes it fairly certain it
won't be modified any time soon so a good candidate for checksumming.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
  2006-05-26 16:48 ` Andreas Dilger
  2006-05-29  0:11 ` Matthew Wilcox
@ 2006-06-01  2:19 ` Valerie Henson
  2006-06-01  2:42   ` Matthew Wilcox
                     ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Valerie Henson @ 2006-06-01  2:19 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-fsdevel, Arjan van de Ven

On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote:
> 
>    (1) repair/fsck time can take hours or even days depending on the
> health of the file system and its underlying disk as well as the number
> of files.  This does not work well for large servers and is a disaster
> for "appliances" that need to run these commands buried deep in some
> data center without a person watching...
>    (2) most file system performance testing is done on "pristine" file
> systems with very few files.  Performance over time, especially with
> very high file counts, suffers very noticeable performance degradation
> with very large file systems.
>     (3) very poor fault containment for these very large devices - it
> would be great to be able to ride through a failure of a segment of the
> underlying storage without taking down the whole file system.
> 
> The obvious alternative to this is to break up these big disks into
> multiple small file systems, but there again we hit several issues.

1 and 3 are some of my main concerns, and what I want to focus a lot
of the workshop discussion on.  I view the question as: How do we keep
file system management simple while splitting the underlying storage
into isolated failure domains that can be repaired individually
online? (Say that three times fast.) Just splitting up into multiple
file systems only solves the second problem, and only if you have
forced umount, as you noted.

The approach we took in ZFS was to separate namespace management and
allocation management.  File systems aren't a fixed size, they take up
as much space as they need from a shared underlying pool.  You can
think of a file system in ZFS as a movable directory with management
bits attached.  I don't think this is the direction we should go, but
it's an example of separating your namespace management from a lot of
other stuff it doesn't really need to be attached to.

I don't think a block group is a good enough fault isolation domain -
think hard links.  What I think we need is normal file system
structures when you are referencing stuff inside your fault isolation
domain, and something more complicated if you have to reference stuff
outside.  One of Arjan's ideas involves something we're calling
continuation inodes - if the file's data is stored in multiple
domains, it has a separate continuation inode in each domain, and each
continuation inode has all the information necessary to run a full
fsck on the data inside that domain.  Similarly, if a directory has a
hard link to a file outside its domain, we'll have to allocate a
continuation inode and dir entry block in the domain containing the
file.  The idea is that you can run fsck on a domain without having to
go look outside that domain.  You may have to clean up a few things in
other domains, but they are easy to find and don't require an fsck in
other domains.

-VAL

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01  2:19 ` Valerie Henson
@ 2006-06-01  2:42   ` Matthew Wilcox
  2006-06-01  3:24     ` Valerie Henson
  2006-06-01  5:36   ` Andreas Dilger
  2006-06-03 13:50   ` Ric Wheeler
  2 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2006-06-01  2:42 UTC (permalink / raw)
  To: Valerie Henson; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven

On Wed, May 31, 2006 at 07:19:09PM -0700, Valerie Henson wrote:
> I don't think a block group is a good enough fault isolation domain -
> think hard links.  What I think we need is normal file system
> structures when you are referencing stuff inside your fault isolation
> domain, and something more complicated if you have to reference stuff
> outside.  One of Arjan's ideas involves something we're calling
> continuation inodes - if the file's data is stored in multiple
> domains, it has a separate continuation inode in each domain, and each
> continuation inode has all the information necessary to run a full
> fsck on the data inside that domain.  Similarly, if a directory has a
> hard link to a file outside its domain, we'll have to allocate a
> continuation inode and dir entry block in the domain containing the
> file.  The idea is that you can run fsck on a domain without having to
> go look outside that domain.  You may have to clean up a few things in
> other domains, but they are easy to find and don't require an fsck in
> other domains.

I don't quite get it.  Let's say we have directories A and B in domain A
and domain B.  A file C is created in directory B and is thus allocated
in domain B.  Now we create a link to file C in directory A, and remove
the link from directory B.  Presumably we have a continuation inode in
domain A and an inode with no references to it in domain B.  How does
fsck tell that there's a continuation inode in a different domain?

The obvious answer is to put a special flag on the continued inode ...
but then the question becomes more about the care and feeding of
said flag.  Maybe it's a counter.  But then fsck needs to check the
counter's not corrupt.  Or is it backlinks from the inode to all its
continuation inodes?  That quickly gets messy with locking.

Surely XFS must have a more elegant solution than this?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01  2:42   ` Matthew Wilcox
@ 2006-06-01  3:24     ` Valerie Henson
  2006-06-01 12:45       ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Valerie Henson @ 2006-06-01  3:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven

On Wed, May 31, 2006 at 08:42:47PM -0600, Matthew Wilcox wrote:
> On Wed, May 31, 2006 at 07:19:09PM -0700, Valerie Henson wrote:
> > I don't think a block group is a good enough fault isolation domain -
> > think hard links.  What I think we need is normal file system
> > structures when you are referencing stuff inside your fault isolation
> > domain, and something more complicated if you have to reference stuff
> > outside.  One of Arjan's ideas involves something we're calling
> > continuation inodes - if the file's data is stored in multiple
> > domains, it has a separate continuation inode in each domain, and each
> > continuation inode has all the information necessary to run a full
> > fsck on the data inside that domain.  Similarly, if a directory has a
> > hard link to a file outside its domain, we'll have to allocate a
> > continuation inode and dir entry block in the domain containing the
> > file.  The idea is that you can run fsck on a domain without having to
> > go look outside that domain.  You may have to clean up a few things in
> > other domains, but they are easy to find and don't require an fsck in
> > other domains.
> 
> I don't quite get it.  Let's say we have directories A and B in domain A
> and domain B.  A file C is created in directory B and is thus allocated
> in domain B.  Now we create a link to file C in directory A, and remove
> the link from directory B.

All correct up to here...

> Presumably we have a continuation inode in
> domain A and an inode with no references to it in domain B.  How does
> fsck tell that there's a continuation inode in a different domain?

Actually, the continuation inode is in B.  When we create a link in
directory A to file C, a continuation inode for directory A is created
in domain B, and a block containing the link to file C is allocated
inside domain B as well.  So there is no continuation inode in domain
A.

That being said, this idea is at the hand-waving stage and probably
has many other (hopefully non-fatal) flaws.  Thanks for taking a look!

> Surely XFS must have a more elegant solution than this?

val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f`
[snip]
 109083 total

:)

-VAL

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01  2:19 ` Valerie Henson
  2006-06-01  2:42   ` Matthew Wilcox
@ 2006-06-01  5:36   ` Andreas Dilger
  2006-06-03 13:50   ` Ric Wheeler
  2 siblings, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2006-06-01  5:36 UTC (permalink / raw)
  To: Valerie Henson; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven

On May 31, 2006  19:19 -0700, Valerie Henson wrote:
> I don't think a block group is a good enough fault isolation domain -
> think hard links.  What I think we need is normal file system
> structures when you are referencing stuff inside your fault isolation
> domain, and something more complicated if you have to reference stuff
> outside.  One of Arjan's ideas involves something we're calling
> continuation inodes - if the file's data is stored in multiple
> domains, it has a separate continuation inode in each domain, and each
> continuation inode has all the information necessary to run a full
> fsck on the data inside that domain.  Similarly, if a directory has a
> hard link to a file outside its domain, we'll have to allocate a
> continuation inode and dir entry block in the domain containing the
> file.  The idea is that you can run fsck on a domain without having to
> go look outside that domain.  You may have to clean up a few things in
> other domains, but they are easy to find and don't require an fsck in
> other domains.

This sounds very much like the approach Lustre has taken for clustered
metadata servers (CMD), which was developed as an advanced prototype
last year, and is being reimplemented for production now.

In "regular" (non-CMD) Lustre there is a single metadata target (MDT)
which holds all of the namespace (directories, filenames, inodes), and
the inodes have EA metadata that tells users of those files which other
storage targets (OSTs) hold the file data (RAID 0 stripe currently).
OSTs are completely self-contained ext3 filesystems, as is the MDT.

In the prototype CMD Lustre there are multiple metadata targets that
make up a single namespace.  Generally, each directory and the inodes
therein are kept on a single MDT but in the case of large directories (>
64k entries, which are split across MDTs by the hash of the filename),
hard links, or renames it is possible to have a cross-MDT inode reference
in a directory.

The cross-MDT reference is implemented by storing a special dirent
in the directory which tells the caller which other MDT actually has
the inode.  The remote inode itself is held in a private "MDT object"
directory so that it has a local filesystem reference and can be looked
up by a special filename that is derived from the inode number, and I
believe source MDT (either in the filename or the private directory)
to keep the link count correct.

This allows each MDT filesystem to be internally consIstent, and the
cross-MDT dirents are treated by e2fsck much the same as symlinks in
the sense that a dangling reference is non-fatal.  There is (or at
least was a design for the CMD prototype) a second-stage tool which
would get a list of cross-MDT references that it could correlate with
the MDT object directory inodes on the other MDTs and fix up refcounts
or orphaned inodes.

In the case of "split directories", which are implemented in order to
load-balance metadata operations across multiple MDTs there was also a
need to migrate directory entries to other MDTs when the directory
splits.  That was only done once when the dir grows beyond 64k, in order
to limit the number of cross-MDT entries in the directory and to get the
added parallelism involved as soon as possible.  After the initial split
new direntries and their inodes are created together within a single MDT,
though there are several directory "stripes" on multiple MDTs running in
parallel.

The same methods used to do dirent migration were also used for handling
renames across directories on multiple MDTs.  The basics are that there
needs to be separate target filesystem primitives exported for creating
and deleting a new inode, and adding or removing an entry from a directory
(which are bundled together in the Linux VFS).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01  3:24     ` Valerie Henson
@ 2006-06-01 12:45       ` Matthew Wilcox
  2006-06-01 12:53         ` Arjan van de Ven
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Matthew Wilcox @ 2006-06-01 12:45 UTC (permalink / raw)
  To: Valerie Henson; +Cc: Ric Wheeler, linux-fsdevel, Arjan van de Ven

On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:
> Actually, the continuation inode is in B.  When we create a link in
> directory A to file C, a continuation inode for directory A is created
> in domain B, and a block containing the link to file C is allocated
> inside domain B as well.  So there is no continuation inode in domain
> A.
> 
> That being said, this idea is at the hand-waving stage and probably
> has many other (hopefully non-fatal) flaws.  Thanks for taking a look!

OK, so we really have two kinds of continuation inodes, and it might be
sensible to name them differently.  We have "here's some extra data for
that inode over there" and "here's a hardlink from another domain".  I
dub the first one a 'continuation inode' and the second a 'shadow inode'.

Continuation inodes and shadow inodes both suffer from the problem
that they might be unwittingly orphaned, unless they have some kind of
back-link to their referrer.  That seems more soluble though.  The domain
B minifsck can check to see if the backlinked inode or directory is
still there.  If the domain A minifsck prunes something which has a link
to domain B, it should be able to just remove the continuation/shadow
inode there, without fscking domain B.

Another advantage to this is that inodes never refer to blocks outside
their zone, so we can forget about all this '64-bit block number' crap.
We don't even need 64-bit inode numbers -- we can use special direntries
for shadow inodes, and inodes which refer to continuation inodes need
a new encoding scheme anyway.  Normal inodes would remain 32-bit and
refer to the local domain, and shadow/continuation inode numbers would
be 32-bits of domain, plus 32-bits of inode within that domain.

So I like this ;-)

> > Surely XFS must have a more elegant solution than this?
> 
> val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f`
> [snip]
>  109083 total

Well, yes.  I think that inside the Linux XFS implementation there's a
small and neat filesystem struggling to get out.  Once SGI finally dies,
perhaps we can rip out all the CXFS stubs and IRIX combatability.  Then
we might be able to see it.

For fun, if you're a masochist, try to follow the code flow for
something easy like fsync().

const struct file_operations xfs_file_operations = {
        .fsync          = xfs_file_fsync,
}

xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
        struct inode    *inode = dentry->d_inode;
        vnode_t         *vp = vn_from_inode(inode);
        int             error;
        int             flags = FSYNC_WAIT;

        if (datasync)
                flags |= FSYNC_DATA;
        VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error);
        return -error;
}

#define _VOP_(op, vp)   (*((vnodeops_t *)(vp)->v_fops)->op)

#define VOP_FSYNC(vp,f,cr,b,e,rv)                                       \
        rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e)

vnodeops_t xfs_vnodeops = {
        .vop_fsync              = xfs_fsync,
}

Finally, xfs_fsync actually does the work.  The best bit about all this
abstraction is that there's only one xfs_vnodeops defined!  So this could
all be done with an xfs_file_fsync() that munged its parameters and called
xfs_fsync() directly.  That wouldn't even affect IRIX combatability,
but it would make life difficult for CXFS, apparently.

http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01 12:45       ` Matthew Wilcox
@ 2006-06-01 12:53         ` Arjan van de Ven
  2006-06-01 20:06         ` Russell Cattelan
  2006-06-02 11:27         ` Nathan Scott
  2 siblings, 0 replies; 24+ messages in thread
From: Arjan van de Ven @ 2006-06-01 12:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Valerie Henson, Ric Wheeler, linux-fsdevel

Matthew Wilcox wrote:
> On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:
>> Actually, the continuation inode is in B.  When we create a link in
>> directory A to file C, a continuation inode for directory A is created
>> in domain B, and a block containing the link to file C is allocated
>> inside domain B as well.  So there is no continuation inode in domain
>> A.
>>
>> That being said, this idea is at the hand-waving stage and probably
>> has many other (hopefully non-fatal) flaws.  Thanks for taking a look!
> 
> OK, so we really have two kinds of continuation inodes, and it might be
> sensible to name them differently.  We have "here's some extra data for
> that inode over there" and "here's a hardlink from another domain".  I
> dub the first one a 'continuation inode' and the second a 'shadow inode'.

nonono

the "hardlink" is in a directory inode, and that *directory* has a continuation
for the dentry that constitutes the hardlink. But that dentry is "local" to the data.
so the directory ends up being split over the domains

> Another advantage to this is that inodes never refer to blocks outside
> their zone, so we can forget about all this '64-bit block number' crap.

exactly the point!



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01 12:45       ` Matthew Wilcox
  2006-06-01 12:53         ` Arjan van de Ven
@ 2006-06-01 20:06         ` Russell Cattelan
  2006-06-02 11:27         ` Nathan Scott
  2 siblings, 0 replies; 24+ messages in thread
From: Russell Cattelan @ 2006-06-01 20:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Valerie Henson, Ric Wheeler, linux-fsdevel, Arjan van de Ven

Matthew Wilcox wrote:

>On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:
>  
>
>>Actually, the continuation inode is in B.  When we create a link in
>>directory A to file C, a continuation inode for directory A is created
>>in domain B, and a block containing the link to file C is allocated
>>inside domain B as well.  So there is no continuation inode in domain
>>A.
>>
>>That being said, this idea is at the hand-waving stage and probably
>>has many other (hopefully non-fatal) flaws.  Thanks for taking a look!
>>    
>>
>
>OK, so we really have two kinds of continuation inodes, and it might be
>sensible to name them differently.  We have "here's some extra data for
>that inode over there" and "here's a hardlink from another domain".  I
>dub the first one a 'continuation inode' and the second a 'shadow inode'.
>
>Continuation inodes and shadow inodes both suffer from the problem
>that they might be unwittingly orphaned, unless they have some kind of
>back-link to their referrer.  That seems more soluble though.  The domain
>B minifsck can check to see if the backlinked inode or directory is
>still there.  If the domain A minifsck prunes something which has a link
>to domain B, it should be able to just remove the continuation/shadow
>inode there, without fscking domain B.
>
>Another advantage to this is that inodes never refer to blocks outside
>their zone, so we can forget about all this '64-bit block number' crap.
>We don't even need 64-bit inode numbers -- we can use special direntries
>for shadow inodes, and inodes which refer to continuation inodes need
>a new encoding scheme anyway.  Normal inodes would remain 32-bit and
>refer to the local domain, and shadow/continuation inode numbers would
>be 32-bits of domain, plus 32-bits of inode within that domain.
>
>So I like this ;-)
>
>  
>
>>>Surely XFS must have a more elegant solution than this?
>>>      
>>>
XFS may be a bit better suited to do this "encapsulated" form of 
inode/directory management
since it's AG's already tried to keep meta data close to the file data.
So it would be quite feasible to offline particular AG's and do a 
consistency check on it.

But yes hard links pose the same problem as being discussed here.
File data also can span AG's and thus create interdependency of AG's in 
terms both file
data and the meta data blocks that manage the extents.
But the idea of idea of creating continuation inodes seem like a good one.
For XFS is might be better to do this at the AG level so as soon as a 
hard link
in one AG refers to a inode it another AG the AG's are linked flagged as
being linked.
This would allow for any form of interdependent data to be grouped
(quota's extended attributes etc)

>>val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f`
>>[snip]
>> 109083 total
>>    
>>
>
>Well, yes.  I think that inside the Linux XFS implementation there's a
>small and neat filesystem struggling to get out.  Once SGI finally dies,
>perhaps we can rip out all the CXFS stubs and IRIX combatability.  Then
>we might be able to see it.
>
>For fun, if you're a masochist, try to follow the code flow for
>something easy like fsync().
>
>const struct file_operations xfs_file_operations = {
>        .fsync          = xfs_file_fsync,
>}
>
>xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
>{
>        struct inode    *inode = dentry->d_inode;
>        vnode_t         *vp = vn_from_inode(inode);
>        int             error;
>        int             flags = FSYNC_WAIT;
>
>        if (datasync)
>                flags |= FSYNC_DATA;
>        VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error);
>        return -error;
>}
>
>#define _VOP_(op, vp)   (*((vnodeops_t *)(vp)->v_fops)->op)
>  
>
Don't forget the extremely hard to untangle behaviors.
#define VNHEAD(vp)    ((vp)->v_bh.bh_first)
#define VOP(op, vp)    (*((bhv_vnodeops_t *)VNHEAD(vp)->bd_ops)->op)

Which I won't even try to explain cuz they confuse the crap out me.
But that is what CXFS uses to create different call chains.

Oh and note to make thing even more evil the call chains are dynamically 
changed
based on whether an inode has a client or not.
So in the case of no cxfs client the call chain is about the same as 
local xfs,
but when a client come in and cxfs will insert more behaviors / vop's 
that hooks
up all the cluster management stuff for that inode.

>#define VOP_FSYNC(vp,f,cr,b,e,rv)                                       \
>        rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e)
>
>vnodeops_t xfs_vnodeops = {
>        .vop_fsync              = xfs_fsync,
>}
>
>Finally, xfs_fsync actually does the work.  The best bit about all this
>abstraction is that there's only one xfs_vnodeops defined!  So this could
>all be done with an xfs_file_fsync() that munged its parameters and called
>xfs_fsync() directly.  That wouldn't even affect IRIX combatability,
>but it would make life difficult for CXFS, apparently.
>
>  
>
So some of my ex co-workers at SGI will disagree with the following but...

The VOP's that are left in XFS are completely pointless at this point, 
since xfs never has
anything other than one call chain it shouldn't have to deal with all 
that stuff in local mode.

All the behavior call chaining should be handled by CXFS and thus all 
the VOP code should
be pushed to that code base.  I think 4 VOP calls that are used 
internally by
XFS and such the callers of those vop's may need something else that 
provides away
or re-entering the call chain at the top.

I have done some of the work in terms of just replacing the VOP calls 
with straight calls
to the final functions in the hopes of tossing the vnodeops out of XFS.
And I spec'd out a way of fixing CXFS to deal with the vops internally 
but unfortunately
that kind of work will always fall under the ENORESOURCES category.

I know SGI will never take it as long CXFS lives, but maybe someday when
the SGI finally fizzles... :-)

Ohh and the whole IRIX compat is crap at this point since many of the 
vop call
params have been changed to match linux params.

>http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html
>-
>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01 12:45       ` Matthew Wilcox
  2006-06-01 12:53         ` Arjan van de Ven
  2006-06-01 20:06         ` Russell Cattelan
@ 2006-06-02 11:27         ` Nathan Scott
  2 siblings, 0 replies; 24+ messages in thread
From: Nathan Scott @ 2006-06-02 11:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Valerie Henson, Ric Wheeler, linux-fsdevel, Arjan van de Ven

Hi there,

On Thu, Jun 01, 2006 at 06:45:17AM -0600, Matthew Wilcox wrote:
> On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:
> We don't even need 64-bit inode numbers -- we can use special direntries
> for shadow inodes, and inodes which refer to continuation inodes need
> a new encoding scheme anyway.  Normal inodes would remain 32-bit and
> refer to the local domain, and shadow/continuation inode numbers would
> be 32-bits of domain, plus 32-bits of inode within that domain.

Be careful of 64 bit inode numbers, we have had a world of pain
with those due to broken userspace applications that cannot work
with larger inode numbers (backup programs, with on-disk formats
that only have room for 32 bits).  To the point that we're actually
having to find new ways to keep our inode number space within the
32 bits-of-significance range.

> > > Surely XFS must have a more elegant solution than this?

Do you mean to the problem of inodes spanning multiple devices?
We don't really have this concept in XFS today, but it remains
an area of active interest (and current development).

Or are you refering to the problem of scaling in general?  Yes,
XFS uses many approaches there, it was designed with this in mind.
The first step toward an "elegent solution" is to use 64 bit disk
addressing, or to at least be able to.

> Well, yes.  I think that inside the Linux XFS implementation
> there's a small and neat filesystem struggling to get out.

Hmm, interesting theory.  In reality, no - all the complexity in
XFS is in core places, that no matter how the code is sliced and
diced still need to be done... the allocator, the log code, and
the multiple Btrees could all possibly be simpler, but they are
all critical pieces of XFS's ability to scale upward, and all have
seen many years of tuning under a large variety of workloads.

> For fun, if you're a masochist, try to follow the code flow for
> something easy like fsync().
> ...
> Finally, xfs_fsync actually does the work.  The best bit about all this
> abstraction is that there's only one xfs_vnodeops defined!

No, the best bit is that an abstraction like this is a neat fit to
the problems presented by this multiple-devices-within-one-filesystem
approach.  As I said, this is an area of active interest to us also.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-01  2:19 ` Valerie Henson
  2006-06-01  2:42   ` Matthew Wilcox
  2006-06-01  5:36   ` Andreas Dilger
@ 2006-06-03 13:50   ` Ric Wheeler
  2006-06-03 14:13     ` Arjan van de Ven
  2 siblings, 1 reply; 24+ messages in thread
From: Ric Wheeler @ 2006-06-03 13:50 UTC (permalink / raw)
  To: Valerie Henson; +Cc: linux-fsdevel, Arjan van de Ven


Valerie Henson wrote:

>On Thu, May 25, 2006 at 02:44:50PM -0700, Ric Wheeler wrote:
>  
>
>>   (1) repair/fsck time can take hours or even days depending on the
>>health of the file system and its underlying disk as well as the number
>>of files.  This does not work well for large servers and is a disaster
>>for "appliances" that need to run these commands buried deep in some
>>data center without a person watching...
>>   (2) most file system performance testing is done on "pristine" file
>>systems with very few files.  Performance over time, especially with
>>very high file counts, suffers very noticeable performance degradation
>>with very large file systems.
>>    (3) very poor fault containment for these very large devices - it
>>would be great to be able to ride through a failure of a segment of the
>>underlying storage without taking down the whole file system.
>>
>>The obvious alternative to this is to break up these big disks into
>>multiple small file systems, but there again we hit several issues.
>>    
>>
>
>1 and 3 are some of my main concerns, and what I want to focus a lot
>of the workshop discussion on.  I view the question as: How do we keep
>file system management simple while splitting the underlying storage
>into isolated failure domains that can be repaired individually
>online? (Say that three times fast.) Just splitting up into multiple
>file systems only solves the second problem, and only if you have
>forced umount, as you noted.
>
>
>  
>
Any thoughts about what the right semantics are for properly doing a 
forced unmount and how whether it is doable near term (as opposed to the 
more strategic/long term issues laid out in this thread) ?

ric


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-03 13:50   ` Ric Wheeler
@ 2006-06-03 14:13     ` Arjan van de Ven
  2006-06-03 15:07       ` Ric Wheeler
  0 siblings, 1 reply; 24+ messages in thread
From: Arjan van de Ven @ 2006-06-03 14:13 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Valerie Henson, linux-fsdevel

Ric Wheeler wrote:
>>
> Any thoughts about what the right semantics are for properly doing a 
> forced unmount and how whether it is doable near term (as opposed to the 
> more strategic/long term issues laid out in this thread) ?

I would like to ask you take one step back; in the past when I have seen
people want "forced unmount" they wanted instead somethings else that they
thought (at that point incorrectly) forced unmount would solve.

there's a few things an unmount does
1) detach from the namespace (tree)
2) shut down the filesystem to
2a) allow someone else to mount/fsck/etc it
2b) finish stuff up and put it in a known state (clean)
3) shut down IO to a fs for another node to take over
    (which is what the "incorrectly" is about technically)



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-03 14:13     ` Arjan van de Ven
@ 2006-06-03 15:07       ` Ric Wheeler
  0 siblings, 0 replies; 24+ messages in thread
From: Ric Wheeler @ 2006-06-03 15:07 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Valerie Henson, linux-fsdevel, trond.myklebust


Arjan van de Ven wrote:

> Ric Wheeler wrote:
>
>>>
>> Any thoughts about what the right semantics are for properly doing a 
>> forced unmount and how whether it is doable near term (as opposed to 
>> the more strategic/long term issues laid out in this thread) ?
>
>
> I would like to ask you take one step back; in the past when I have seen
> people want "forced unmount" they wanted instead somethings else that 
> they
> thought (at that point incorrectly) forced unmount would solve.
>
> there's a few things an unmount does
> 1) detach from the namespace (tree)
> 2) shut down the filesystem to
> 2a) allow someone else to mount/fsck/etc it
> 2b) finish stuff up and put it in a known state (clean)
> 3) shut down IO to a fs for another node to take over
>    (which is what the "incorrectly" is about technically)
>
>
We have 20-30 100GB file systems on a single box (to avoid the long fsck 
time).  When you hit an issue with one file system, say a panic, or a 
set of file systems (dead drive) that might take out 5 file systems, we 
want to be able to keep the box up since it is still serving up 
something like 2.5TB of storage to the user ;-)

So what I want is all of the following:

    (1) do your 2a - be able to fsck and repair corruptions and then 
hopefully remount that file system without a reboot of the box.
    (2) leave all other file systems (including those of the same type) 
running without error (good fault isolation).
    (3) I don't want to try and clean up that file system state - error 
out any existing IO's, perfectly fine to have processes using get blown 
away.  In effect, treat it (from the file system point of view) just 
like you would a power outage & reboot.  You should replay the journal & 
recover only after you get the disk back.
    (4) make sure that a hung disk or panic'ed file system does not 
prevent an intentional reboot. 

In a conversation about this with Trond, I think that he has some other 
specific motivations from an NFS point of view as well.

ric




ric


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-05-29 16:09     ` Andreas Dilger
  2006-05-29 19:29       ` Ric Wheeler
@ 2006-06-07 10:10       ` Stephen C. Tweedie
  2006-06-07 14:03         ` Andi Kleen
  2006-06-07 18:55         ` Andreas Dilger
  1 sibling, 2 replies; 24+ messages in thread
From: Stephen C. Tweedie @ 2006-06-07 10:10 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ric Wheeler, Matthew Wilcox, linux-fsdevel, Stephen Tweedie

Hi,

On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote:

> This is one thing that we have been thinking of for ext3.  Instead of a
> filesystem-wide "error" bit we could move this per-group to only mark
> the block or inode bitmaps in error if they have a checksum failure.
> This would prevent allocations from that group to avoid further potential
> corruption of the filesystem metadata.

Trouble is, individual files can span multiple groups easily.  And one
of the common failure modes is failure in the indirect tree.  What
action do you take if you detect that?

There is fundamentally a large difference between the class of errors
that can arise due to EIO --- simple loss of a block of data --- and
those which can arise from actual corrupt data/metadata.  If we detect
the latter and attempt to soldier on regardless, then we have no idea
what inconsistencies we are allowing to be propagated through the
filesystem.  

That can easily end up corrupting files far from the actual error.  Say
an indirect block is corrupted; we delete that file, and end up freeing
a block belonging to some other file on a distant block group.  Ooops.
Once that other block gets reallocated and overwritten, we have
corrupted that other file.

*That* is why taking the fs down/readonly on failure is the safe option.

The inclusion of checksums would certainly allow us to harden things.
In the above scenario, failure of the checksum test would allow us to
discard corrupt indirect blocks before we could allow any harm to come
to other disk blocks.  But that only works for cases where the checksum
notices the problem; if we're talking about possible OS bugs, memory
corruption etc. then it is quite possible to get corruption in the in-
memory copy, which gets properly checksummed and written to disk, so you
can't rely on that catching all cases.

--Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-07 10:10       ` Stephen C. Tweedie
@ 2006-06-07 14:03         ` Andi Kleen
  2006-06-07 18:55         ` Andreas Dilger
  1 sibling, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2006-06-07 14:03 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Ric Wheeler, Matthew Wilcox, linux-fsdevel

"Stephen C. Tweedie" <sct@redhat.com> writes:

> The inclusion of checksums would certainly allow us to harden things.
> In the above scenario, failure of the checksum test would allow us to
> discard corrupt indirect blocks before we could allow any harm to come
> to other disk blocks.  But that only works for cases where the checksum
> notices the problem; if we're talking about possible OS bugs, memory
> corruption etc. then it is quite possible to get corruption in the in-
> memory copy, which gets properly checksummed and written to disk, so you
> can't rely on that catching all cases.

I don't think you'll ever get a good solution for random kernel memory
corruption - if that happens you are dead no matter what you do. Even
if your file system still works then your application will eventually
produce garbage when its own data gets corrupted.

Limiting detection to on storage corruption is entirely reasonable.

And also handling 100% of all cases is not feasible anyways. Just handling
more than currently would be already a big step forward.

"The perfect is the enemy of the good"

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: topics for the file system mini-summit
  2006-06-07 10:10       ` Stephen C. Tweedie
  2006-06-07 14:03         ` Andi Kleen
@ 2006-06-07 18:55         ` Andreas Dilger
  1 sibling, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2006-06-07 18:55 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Ric Wheeler, Matthew Wilcox, linux-fsdevel

On Jun 07, 2006  11:10 +0100, Stephen C. Tweedie wrote:
> On Mon, 2006-05-29 at 10:09 -0600, Andreas Dilger wrote:
> > This is one thing that we have been thinking of for ext3.  Instead of a
> > filesystem-wide "error" bit we could move this per-group to only mark
> > the block or inode bitmaps in error if they have a checksum failure.
> > This would prevent allocations from that group to avoid further potential
> > corruption of the filesystem metadata.
> 
> Trouble is, individual files can span multiple groups easily.  And one
> of the common failure modes is failure in the indirect tree.  What
> action do you take if you detect that?

Return an IO error for that part of the file?  We already refuse to
free file blocks that overlap with filesystem metadata, but have no
way to know whether the rest of the blocks are valid or not.

> There is fundamentally a large difference between the class of errors
> that can arise due to EIO --- simple loss of a block of data --- and
> those which can arise from actual corrupt data/metadata.  If we detect
> the latter and attempt to soldier on regardless, then we have no idea
> what inconsistencies we are allowing to be propagated through the
> filesystem.  

Recall that one of the other goals is to add checksumming to the extent
tree metadata (if it isn't already covered by the inode checksum).  Even
today, the fact that the extent format has a magic allows some types of
corruption to be detected.  The structure is also somewhat verifiable 
(e.g. logical extent offsets are increasing, logical_offset + length is
non-overlapping with next logical offset, etc) even without checksums.

The proposed ext3_extent_tail would also contain an inode+generation
back-reference and the checksum would depend on the physical block
location so if one extent index block were incorrectly written in the
place of another, or the higher-level reference were corrupted this
would also be detectable.

        struct ext3_extent_tail {
		__u64   et_inum;
		__u32   et_igeneration;
		__u32   et_checksum;
	}

> That can easily end up corrupting files far from the actual error.  Say
> an indirect block is corrupted; we delete that file, and end up freeing
> a block belonging to some other file on a distant block group.  Ooops.
> Once that other block gets reallocated and overwritten, we have
> corrupted that other file.

Oh, I totally agree with that, which is another reason why I've proposed
the "block mapped extent" several times.  It would be referenced from
an extent index block or inode, would start with an extent header to
verify that this is at least semi-plausible block pointers, and can
optionally have an ext3_extent_tail to validate the block data itself.

The block-mapped extent is useful for fragmented files or files with
lots of small holes in them.  Concievably it would also be possible
to quickly remap old block-mapped (indirect tree) files to bm-extent
files if this was desirable.

> *That* is why taking the fs down/readonly on failure is the safe option.

And wait 17 years for e2fsck to complete?  While I agree it is the
safest option, sometimes it is necessary to just block off parts of the
filesystem from writes and soldier on until the system can be taken down
safely.

> The inclusion of checksums would certainly allow us to harden things.
> In the above scenario, failure of the checksum test would allow us to
> discard corrupt indirect blocks before we could allow any harm to come
> to other disk blocks.  But that only works for cases where the checksum
> notices the problem; if we're talking about possible OS bugs, memory
> corruption etc. then it is quite possible to get corruption in the in-
> memory copy, which gets properly checksummed and written to disk, so you
> can't rely on that catching all cases.

I agree, we can't ever handle everything unless we get checksums from the
top of linux to the bottom (maybe stored in the page table?), but we can
at least do the best we can.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-06-07 18:55 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-25 21:44 topics for the file system mini-summit Ric Wheeler
2006-05-26 16:48 ` Andreas Dilger
2006-05-27  0:49   ` Ric Wheeler
2006-05-27 14:18     ` Andreas Dilger
2006-05-28  1:44       ` Ric Wheeler
2006-05-29  0:11 ` Matthew Wilcox
2006-05-29  2:07   ` Ric Wheeler
2006-05-29 16:09     ` Andreas Dilger
2006-05-29 19:29       ` Ric Wheeler
2006-05-30  6:14         ` Andreas Dilger
2006-06-07 10:10       ` Stephen C. Tweedie
2006-06-07 14:03         ` Andi Kleen
2006-06-07 18:55         ` Andreas Dilger
2006-06-01  2:19 ` Valerie Henson
2006-06-01  2:42   ` Matthew Wilcox
2006-06-01  3:24     ` Valerie Henson
2006-06-01 12:45       ` Matthew Wilcox
2006-06-01 12:53         ` Arjan van de Ven
2006-06-01 20:06         ` Russell Cattelan
2006-06-02 11:27         ` Nathan Scott
2006-06-01  5:36   ` Andreas Dilger
2006-06-03 13:50   ` Ric Wheeler
2006-06-03 14:13     ` Arjan van de Ven
2006-06-03 15:07       ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).